Cluster Monitoring

Cluster admins can use Cluster Monitoring to oversee GPU infrastructure dedicated to their tenants. By reviewing node availability and utilization metrics, admins can efficiently manage and troubleshoot resource usage.

1. Overview

The Cluster Monitoring Dashboard displays key GPU node stats and real-time utilization data. These insights help you balance workloads, prevent bottlenecks, and ensure effective GPU usage.

2. Node & GPU Counts

Total GPU Nodes – The number of GPU-equipped nodes assigned to your tenant.
Total GPUs – The total count of GPUs available in your infrastructure.
Allocated GPUs – How many GPUs are currently in use.
Available GPUs – How many GPUs remain free for additional workloads.

3. Utilization Metrics

Track real-time GPU resource usage:

GPU Compute (%) – Current GPU compute load.
GPU Memory (%) – Percentage of GPU memory in use.
Compute Utilization (%) – Overall compute usage across all workloads.
Memory Utilization (%) – Aggregated memory usage for your tenant’s environment.

4. Best Practices

Routine Checks
Regularly monitor your GPU infrastructure to catch potential performance issues early.
Optimal Allocation
Use metrics to determine if you need to scale resources or rebalance workloads for maximum efficiency.

Next Steps

Resource Management – Assign and monitor nodes across tenants.
Tenant Management – Create or delete tenants and manage tenant admins.
Tenant Admin Tools Overview – Explore tenant-level administrative capabilities.