Job Configuration with config.yaml

This document explains how to configure jobs using the config.yaml file. The configuration file defines the resources, scaling, and environment settings for your job execution.

Configuration Structure

The config.yaml file supports the following fields:

Required Fields

Field	Type	Description
`resources`	object	Computational resources configuration for each instance
`replicas`	integer	Number of instances (Kubernetes pods) to run

resources (Required)

The resources field contains the following sub-fields:

Field	Type	Required	Description
`memory`	integer	Yes	RAM allocation per instance (in GB)
`cpu`	integer	Yes	Number of CPU cores per instance
`gpu`	object	No	GPU configuration for GPU-accelerated jobs

gpu (Optional, under resources)

When specified, the gpu field contains:

Field	Type	Description
`accelerator_count`	integer	Number of GPUs per instance
`accelerator`	string	GPU type (H100, A100, V100, etc.)

Optional Fields

Field	Type	Description
`min_available`	integer	Minimum number of replicas needed before job starts
`config_map`	object	Environment variables for the job
`shared_volume`	object	Persistent volume shared across all replicas
`dependencies`	array	Additional Python packages to install

Field Definitions and Usage

resources

The resources section defines the computational resources for each instance.

gpu (Optional)

Configure GPU resources for machine learning workloads.

resources:
  gpu:
    accelerator_count: 8    # Number of GPUs per instance
    accelerator: H100       # GPU type (H100, A100, V100, etc.)

When to use: For deep learning training or any GPU-accelerated computations.

memory (Required)

RAM allocation in GB per instance.

resources:
  memory: 128  # 128 GB RAM per instance

When to use: Always required. Set based on your job's memory requirements.

cpu (Required)

Number of CPU cores per instance.

resources:
  cpu: 4  # 4 CPU cores per instance

When to use: Always required. Consider your job's computational intensity.

replicas (Required)

Number of instances to run concurrently.

replicas: 10  # Run 10 instances of the job

When to use: Always required. Use multiple replicas for:
- Distributed training
- Parallel data processing

min_available (Optional)

Minimum number of replicas that must be available before the job starts.

min_available: 2  # Start job when at least 2 replicas are ready

When to use:
- When you need partial scaling (don't wait for all replicas)
- For jobs that can start with fewer resources - To reduce startup time in resource-constrained environments

Default behavior: If not specified, the job waits for all replicas to be ready.

config_map (Optional)

Environment variables passed to all job instances.

config_map:
  FOLDER_NAME: "test"
  EPOCHS: "5"
  BATCH_SIZE: "256"
  LEARNING_RATE: "0.01"

When to use:
- Pass configuration parameters to your application
- Set training hyperparameters
- Configure data paths and model settings

shared_volume (Optional)

Creates a persistent volume shared across all replicas, mounted at /home.

shared_volume:
  size: 20  # 20 GB shared storage

When to use:
- Share datasets across multiple training instances
- Collect outputs from distributed jobs
- Synchronize checkpoints between replicas

dependencies (Optional)

Additional Python packages to install before running the job.

dependencies:
  - torch
  - torchvision
  - boto3

Alternative: Alternatively add a requirements.txt file in the job files. Job Requirements

When to use:
- Install packages not included in the base image
- Add specific versions of libraries
- Quick package installation without maintaining separate requirements files

Configuration Examples

Example 1:

resources:
  memory: 16
  cpu: 4
replicas: 1
config_map:
  DATA_PATH: "/data/input"
  OUTPUT_PATH: "/data/output"

Expected behavior:
- Single instance with 16GB RAM and 4 CPU cores
- Job starts immediately (only 1 replica needed)
- Environment variables set for data paths

Example 2:

resources:
  gpu:
    accelerator_count: 2
    accelerator: A100
  memory: 64
  cpu: 8
replicas: 4
min_available: 2
config_map:
  EPOCHS: "100"
  BATCH_SIZE: "64"
  LEARNING_RATE: "0.001"
shared_volume:
  size: 100
dependencies:
  - transformers
  - datasets

Expected behavior:
- 4 instances, each with 2 A100 GPUs, 64GB RAM, 8 CPU cores
- Job starts when at least 2 instances are ready
- 100GB shared volume mounted at /home for dataset sharing
- transformers and datasets libraries installed before the job starts
- Training hyperparameters set via environment variables

Example 3:

resources:
  gpu:
    accelerator_count: 1
    accelerator: V100
  memory: 32
  cpu: 4
replicas: 8
config_map:
  MODEL_PATH: "/models/bert-large"
  BATCH_SIZE: "32"
  MAX_WORKERS: "4"
shared_volume:
  size: 50

Expected behavior:
- 8 instances for parallel inference processing
- Each instance has 1 V100 GPU for model acceleration
- 50GB shared volume for model weights and input/output data
- Waits for all 8 replicas before starting (no min_available specified)

Example 4:

resources:
  memory: 8
  cpu: 2
replicas: 5
min_available: 1
config_map:
  CHUNK_SIZE: "1000"
  PARALLEL_WORKERS: "2"
dependencies:
  - pandas
  - numpy
  - boto3

Expected behavior:
- 5 lightweight instances (8GB RAM, 2 CPU cores each)
- Job starts as soon as 1 instance is ready
- No GPU allocation (CPU-only processing)
- No shared volume (each instance processes independently)
- Common data processing libraries installed

Example 5:

resources:
  gpu:
    accelerator_count: 8
    accelerator: H100
  memory: 256
  cpu: 32
replicas: 1
config_map:
  OMP_NUM_THREADS: "32"

Expected behavior:
- Single powerful instance with 8 H100 GPUs and 256GB RAM
- GPU and CPU optimization environment variables
- Job starts immediately when the single replica is ready

Best Practices

Resource Planning: Start with smaller resource allocations and scale up based on actual usage
Replica Strategy: Use min_available for jobs that can benefit from partial scaling
Shared Storage: Use shared_volume when replicas need to share data or synchronize state
Environment Variables: Use config_map to make your jobs configurable without code changes
Dependencies: Keep dependency lists minimal and use specific versions for reproducibility