Job Management
This section of the OICM+ platform tools enables users to manage and monitor jobs that leverage computing resources to perform tasks such as training machine learning models.
Initiating a New Job
- Navigate to the Jobs section
- Click the "+ New Job" button
- Fill in the required information:
- Title: A descriptive name for your job
- Job type: Select the framework for your job (PyTorch, Ray, TensorFlow)
- Tags: Add optional tags to organize and filter your jobs
A confirmation message will display once the job is successfully initiated. Once the job is created, it will show up in the list along with existing jobs in the table.
Job Specific Page
By clicking on a specific job from the list, users are directed to a detailed page for that job. Each job has its own dedicated interface with multiple tabs for different aspects of job management:
Scripts
- Upload options:
- Single file upload
- Directory upload for multiple files
- Edit files directly in the interface
- Delete and Refresh options
Important File Naming:
When uploading multiple files, name your main script as main.py
and configuration as config.yaml
. This naming convention helps the system identify the primary execution files among multiple uploaded files.
Workers
- View all compute units assigned to your job
- Monitor status of each worker
- Track resource allocation and utilization
- View detailed worker specifications and configurations
Logs
- Real-time execution logs for each worker
- Worker selection via dropdown menu
- Live console output display
Events
- Chronological timeline of job-related events
- Error and warning notifications
- Important state transitions
System Metrics
- GPU performance indicators:
- Core utilization and temperature
- Memory usage and availability
- Hardware encoder/decoder usage
- Real-time monitoring with interactive graphs
- Historical data tracking
Settings
- Update job configuration:
- Edit job title
- Modify job type
- Manage tags
- Delete the job
Monitoring Job Performance
While individual tabs provide specific monitoring capabilities, effective job management often requires using them in combination:
- Correlate Events timeline with System Metrics to understand performance changes
- Use Workers status alongside System Metrics to identify resource bottlenecks
- Monitor Logs in context of Events to debug issues
This integrated approach helps maintain optimal job performance and quickly identify potential issues.
Best Practices
- Use the Workers tab to ensure resources are allocated and running properly
- Monitor System Metrics to optimize resource utilization
- Track Events for understanding job lifecycle and troubleshooting
- Review Logs regularly by selecting specific workers for detailed execution information
- Tag jobs appropriately for better organization and tracking