Usage Monitoring
Cluster admins can use Usage Monitoring to oversee usage of GPU infrastructure dedicated to their tenants. By reviewing utilization metrics, admins can efficiently manage and troubleshoot resource usage.
1. Overview
The Usage Monitoring Dashboard displays key GPU node stats and real-time utilization data. These insights help you balance workloads, prevent bottlenecks, and ensure effective GPU usage.
2. Utilization Metrics
Track real-time GPU resource usage:
- Tenant GPU Usage Distribution – GPU usage distrubution accross all seleceted tenants.
- Tenant GPU hours Interval – Accumulated GPU hours accross selected tenants.
- Tenants – List and usage details of all tenants selected. Clicking on a tenant open more usage details on selected tenant.
3. Best Practices
- Routine Checks
Regularly monitor your GPU infrastructure to catch potential performance issues early. - Optimal Allocation
Use metrics to determine if you need to scale resources or rebalance workloads for maximum efficiency.
Next Steps
- Resource Management – Assign and monitor nodes across tenants.
- Tenant Management – Create or delete tenants and manage tenant admins.
- Tenant Admin Tools Overview – Explore tenant-level administrative capabilities.