Skip to content

Job Configuration with config.yaml

This document explains how to configure jobs using the config.yaml file. The configuration file defines the resources, scaling, and environment settings for your job execution.

Configuration Structure

The config.yaml file supports the following fields:

Required Fields

Field Type Description
resources object Computational resources configuration for each instance
replicas integer Number of instances (Kubernetes pods) to run

resources (Required)

The resources field contains the following sub-fields:

Field Type Required Description
memory integer Yes RAM allocation per instance (in GB)
cpu integer Yes Number of CPU cores per instance
gpu object No GPU configuration for GPU-accelerated jobs
storage integer Yes The ephemeral storage per each instance (in GiB)

gpu (Optional, under resources)

When specified, the gpu field contains:

Field Type Description
accelerator_count integer Number of GPUs per instance
accelerator string GPU type (H100, A100, V100, etc.)

Optional Fields

Field Type Description
min_available integer Minimum number of replicas needed before job starts
config_map object Environment variables for the job
dependencies array Additional Python packages to install

Field Definitions and Usage

resources

The resources section defines the computational resources for each instance.

gpu (Optional)

Configure GPU resources for machine learning workloads.

resources:
  gpu:
    accelerator_count: 8    # Number of GPUs per instance
    accelerator: H100       # GPU type (H100, A100, V100, etc.)

When to use: For deep learning training or any GPU-accelerated computations.

memory (Required)

RAM allocation in GB per instance.

resources:
  memory: 128  # 128 GB RAM per instance

When to use: Always required. Set based on your job's memory requirements.

cpu (Required)

Number of CPU cores per instance.

resources:
  cpu: 4  # 4 CPU cores per instance

When to use: Always required. Consider your job's computational intensity.

replicas (Required)

Number of instances to run concurrently.

replicas: 10  # Run 10 instances of the job

When to use: Always required. Use multiple replicas for:
- Distributed training
- Parallel data processing

min_available (Optional)

Minimum number of replicas that must be available before the job starts.

min_available: 2  # Start job when at least 2 replicas are ready

When to use:
- When you need partial scaling (don't wait for all replicas)
- For jobs that can start with fewer resources - To reduce startup time in resource-constrained environments

Default behavior: If not specified, the job waits for all replicas to be ready.

config_map (Optional)

Environment variables passed to all job instances.

config_map:
  FOLDER_NAME: "test"
  EPOCHS: "5"
  BATCH_SIZE: "256"
  LEARNING_RATE: "0.01"

When to use:
- Pass configuration parameters to your application
- Set training hyperparameters
- Configure data paths and model settings

Data volumes

Please refer to Datavolumes Docs for more information on attaching a volume to your Job workload.

dependencies (Optional)

Additional Python packages to install before running the job.

dependencies:
  - torch
  - torchvision
  - boto3

Alternative: Alternatively add a requirements.txt file in the job files. Job Requirements

When to use:
- Install packages not included in the base image
- Add specific versions of libraries
- Quick package installation without maintaining separate requirements files

Configuration Examples

Example 1:

resources:
  memory: 16
  cpu: 4
  storage: 1
replicas: 1
config_map:
  DATA_PATH: "/data/input"
  OUTPUT_PATH: "/data/output"

Expected behavior:
- Single instance with 16GB RAM and 4 CPU cores
- Job starts immediately (only 1 replica needed)
- Environment variables set for data paths

Example 2:

resources:
  gpu:
    accelerator_count: 2
    accelerator: A100
  memory: 64
  cpu: 8
  storage: 1
output_checkpoints_data_volume:
  name: data-vol-your-datavolume-id
replicas: 4
min_available: 2
config_map:
  EPOCHS: "100"
  BATCH_SIZE: "64"
  LEARNING_RATE: "0.001"
dependencies:
  - transformers
  - datasets

Expected behavior:
- 4 instances, each with 2 A100 GPUs, 64GB RAM, 8 CPU cores
- Job starts when at least 2 instances are ready
- transformers and datasets libraries installed before the job starts
- Training hyperparameters set via environment variables - data volume mounted and accessible for model/dataset sharing

Best Practices

  1. Resource Planning: Start with smaller resource allocations and scale up based on actual usage
  2. Replica Strategy: Use min_available for jobs that can benefit from partial scaling
  3. Data Volumes: Use datavolumes when replicas need to share data or synchronize state.
  4. Environment Variables: Use config_map to make your jobs configurable without code changes
  5. Dependencies: Keep dependency lists minimal and use specific versions for reproducibility