Jobs Overview

The Jobs Management module centralizes AI job execution and monitoring. It supports both Kubernetes and SLURM environments, offering efficient resource usage, flexible configurations, and seamless scheduling.

Key Features

Diverse Job Initiation Options
- Script-Based Submission – Upload scripts via the UI for custom AI tasks.
- Docker Image Submission – Deploy prebuilt containers with all required code and dependencies.
Comprehensive Job Configuration
- Define dependencies, runtime parameters, and execution constraints.
- Meet performance targets with parameterized settings.
Advanced Resource Specification
- Request CPU, GPU, and memory allocations precisely.
- Ensure each job’s performance is optimized for available hardware.
Robust Scheduling & Submission
- Gang Scheduling – Acquire all resources before job launch, avoiding partial allocations.
- Priority-Based Scheduling – Adjust policies and priorities to fit operational needs.
Real-Time Monitoring & Management
- Track CPU, GPU, and memory usage at runtime.
- Explore detailed logs for execution progress and performance analysis.
Proactive Alerting & Notifications
- Receive immediate alerts for critical conditions or anomalies.
- Intervene rapidly to maintain job stability.
Streamlined Completion & Post-Processing
- Automate cleanup and data aggregation upon job completion.
- Minimize manual tasks with an integrated workflow.
Flexible Result Storage & Access
- Configure storage backends for secure, easily accessible job outputs.
- Maintain organization and retrieval consistency across jobs.
Cross-Platform Compatibility
- Unified job management for Kubernetes and SLURM.
- Maintain a single workflow, regardless of the underlying infrastructure.

Note: This module also supports distributed training with Ray. For more details, see the dedicated page on distributed training.