Cluster Monitoring
This section of the OICM+ platform provides tenant admins with tools to monitor the health and performance of the GPU infrastructure dedicated to their tenant.
Overview
The Cluster Monitoring dashboard offers comprehensive insights into GPU nodes and utilization metrics, ensuring tenant admins can efficiently manage and troubleshoot their dedicated resources.
Node & GPU Counts
- Total GPU Nodes: Displays the total number of nodes equipped with GPUs within the tenant's allocated resources.
- Total GPUs: Shows the total number of GPUs available in the infrastructure.
- Allocated GPUs: Indicates how many GPUs are currently allocated to running workloads.
- Available GPUs: Lists the number of GPUs available for allocation to new workloads.
Utilization Metrics
Monitor the real-time utilization of GPU resources to optimize performance and detect potential bottlenecks:
- GPU Compute (%): Percentage of GPU compute capacity currently in use.
- GPU Memory (%): Percentage of GPU memory utilization.
- Compute Utilization (%): Overall percentage of compute resources being utilized across all workloads.
- Memory Utilization (%): Overall memory utilization across the tenant's infrastructure.
Best Practices
- Regularly check the cluster monitoring dashboard to stay updated on the health and performance of your GPU infrastructure.
- Use utilization metrics to make informed decisions about resource allocation and workload management.