Jobs Overview
The Jobs Management module centralizes AI job execution and monitoring. It supports both Kubernetes and SLURM environments, offering efficient resource usage, flexible configurations, and seamless scheduling.
Key Features
-
Diverse Job Initiation Options
- Script-Based Submission – Upload scripts via the UI for custom AI tasks.
- Docker Image Submission – Deploy prebuilt containers with all required code and dependencies.
-
Comprehensive Job Configuration
- Define dependencies, runtime parameters, and execution constraints.
- Meet performance targets with parameterized settings.
-
Advanced Resource Specification
- Request CPU, GPU, and memory allocations precisely.
- Ensure each job’s performance is optimized for available hardware.
-
Robust Scheduling & Submission
- Gang Scheduling – Acquire all resources before job launch, avoiding partial allocations.
- Priority-Based Scheduling – Adjust policies and priorities to fit operational needs.
-
Real-Time Monitoring & Management
- Track CPU, GPU, and memory usage at runtime.
- Explore detailed logs for execution progress and performance analysis.
-
Proactive Alerting & Notifications
- Receive immediate alerts for critical conditions or anomalies.
- Intervene rapidly to maintain job stability.
-
Streamlined Completion & Post-Processing
- Automate cleanup and data aggregation upon job completion.
- Minimize manual tasks with an integrated workflow.
-
Flexible Result Storage & Access
- Configure storage backends for secure, easily accessible job outputs.
- Maintain organization and retrieval consistency across jobs.
-
Cross-Platform Compatibility
- Unified job management for Kubernetes and SLURM.
- Maintain a single workflow, regardless of the underlying infrastructure.
Note: This module also supports distributed training with Ray. For more details, see the dedicated page on distributed training.