Skip to content

Job Management

This section of the OICM+ platform tools enables users to manage and monitor jobs that leverage computing resources to perform tasks such as training machine learning models.

Initiating a New Job

  1. Navigate to the Jobs section
  2. Click the "+ New Job" button
  3. Fill in the required information:
  4. Title: A descriptive name for your job
  5. Job type: Select the framework for your job (PyTorch, Ray, TensorFlow)
  6. Tags: Add optional tags to organize and filter your jobs

Creating a new job through the job creation form

A confirmation message will display once the job is successfully initiated. Once the job is created, it will show up in the list along with existing jobs in the table.

Jobs list showing all jobs in a tabular view

Job Specific Page

By clicking on a specific job from the list, users are directed to a detailed page for that job. Each job has its own dedicated interface with multiple tabs for different aspects of job management:

Job Specific Page

Scripts

  • Upload options:
  • Single file upload
  • Directory upload for multiple files
  • Edit files directly in the interface
  • Delete and Refresh options

Important File Naming: When uploading multiple files, name your main script as main.py and configuration as config.yaml. This naming convention helps the system identify the primary execution files among multiple uploaded files.

Workers

  • View all compute units assigned to your job
  • Monitor status of each worker
  • Track resource allocation and utilization
  • View detailed worker specifications and configurations

Logs

  • Real-time execution logs for each worker
  • Worker selection via dropdown menu
  • Live console output display

Events

  • Chronological timeline of job-related events
  • Error and warning notifications
  • Important state transitions

System Metrics

  • GPU performance indicators:
  • Core utilization and temperature
  • Memory usage and availability
  • Hardware encoder/decoder usage
  • Real-time monitoring with interactive graphs
  • Historical data tracking

Settings

  • Update job configuration:
  • Edit job title
  • Modify job type
  • Manage tags
  • Delete the job

Monitoring Job Performance

While individual tabs provide specific monitoring capabilities, effective job management often requires using them in combination:

  • Correlate Events timeline with System Metrics to understand performance changes
  • Use Workers status alongside System Metrics to identify resource bottlenecks
  • Monitor Logs in context of Events to debug issues

This integrated approach helps maintain optimal job performance and quickly identify potential issues.

Best Practices

  • Use the Workers tab to ensure resources are allocated and running properly
  • Monitor System Metrics to optimize resource utilization
  • Track Events for understanding job lifecycle and troubleshooting
  • Review Logs regularly by selecting specific workers for detailed execution information
  • Tag jobs appropriately for better organization and tracking