Job Configuration with config.yaml
This document explains how to configure jobs using the config.yaml
file. The configuration file defines the resources, scaling, and environment settings for your job execution.
Configuration Structure
The config.yaml
file supports the following fields:
Required Fields
Field | Type | Description |
---|---|---|
resources |
object | Computational resources configuration for each instance |
replicas |
integer | Number of instances (Kubernetes pods) to run |
resources (Required)
The resources
field contains the following sub-fields:
Field | Type | Required | Description |
---|---|---|---|
memory |
integer | Yes | RAM allocation per instance (in GB) |
cpu |
integer | Yes | Number of CPU cores per instance |
gpu |
object | No | GPU configuration for GPU-accelerated jobs |
gpu (Optional, under resources)
When specified, the gpu
field contains:
Field | Type | Description |
---|---|---|
accelerator_count |
integer | Number of GPUs per instance |
accelerator |
string | GPU type (H100, A100, V100, etc.) |
Optional Fields
Field | Type | Description |
---|---|---|
min_available |
integer | Minimum number of replicas needed before job starts |
config_map |
object | Environment variables for the job |
shared_volume |
object | Persistent volume shared across all replicas |
dependencies |
array | Additional Python packages to install |
Field Definitions and Usage
resources
The resources
section defines the computational resources for each instance.
gpu (Optional)
Configure GPU resources for machine learning workloads.
resources:
gpu:
accelerator_count: 8 # Number of GPUs per instance
accelerator: H100 # GPU type (H100, A100, V100, etc.)
When to use: For deep learning training or any GPU-accelerated computations.
memory (Required)
RAM allocation in GB per instance.
When to use: Always required. Set based on your job's memory requirements.
cpu (Required)
Number of CPU cores per instance.
When to use: Always required. Consider your job's computational intensity.
replicas (Required)
Number of instances to run concurrently.
When to use: Always required. Use multiple replicas for:
- Distributed training
- Parallel data processing
min_available (Optional)
Minimum number of replicas that must be available before the job starts.
When to use:
- When you need partial scaling (don't wait for all replicas)
- For jobs that can start with fewer resources
- To reduce startup time in resource-constrained environments
Default behavior: If not specified, the job waits for all replicas to be ready.
config_map (Optional)
Environment variables passed to all job instances.
When to use:
- Pass configuration parameters to your application
- Set training hyperparameters
- Configure data paths and model settings
shared_volume (Optional)
Creates a persistent volume shared across all replicas, mounted at /home
.
When to use:
- Share datasets across multiple training instances
- Collect outputs from distributed jobs
- Synchronize checkpoints between replicas
dependencies (Optional)
Additional Python packages to install before running the job.
Alternative: Alternatively add a requirements.txt
file in the job files.
When to use:
- Install packages not included in the base image
- Add specific versions of libraries
- Quick package installation without maintaining separate requirements files
Configuration Examples
Example 1:
resources:
memory: 16
cpu: 4
replicas: 1
config_map:
DATA_PATH: "/data/input"
OUTPUT_PATH: "/data/output"
Expected behavior:
- Single instance with 16GB RAM and 4 CPU cores
- Job starts immediately (only 1 replica needed)
- Environment variables set for data paths
Example 2:
resources:
gpu:
accelerator_count: 2
accelerator: A100
memory: 64
cpu: 8
replicas: 4
min_available: 2
config_map:
EPOCHS: "100"
BATCH_SIZE: "64"
LEARNING_RATE: "0.001"
shared_volume:
size: 100
dependencies:
- transformers
- datasets
Expected behavior:
- 4 instances, each with 2 A100 GPUs, 64GB RAM, 8 CPU cores
- Job starts when at least 2 instances are ready
- 100GB shared volume mounted at /home
for dataset sharing
- transformers and datasets libraries installed before the job starts
- Training hyperparameters set via environment variables
Example 3:
resources:
gpu:
accelerator_count: 1
accelerator: V100
memory: 32
cpu: 4
replicas: 8
config_map:
MODEL_PATH: "/models/bert-large"
BATCH_SIZE: "32"
MAX_WORKERS: "4"
shared_volume:
size: 50
Expected behavior:
- 8 instances for parallel inference processing
- Each instance has 1 V100 GPU for model acceleration
- 50GB shared volume for model weights and input/output data
- Waits for all 8 replicas before starting (no min_available specified)
Example 4:
resources:
memory: 8
cpu: 2
replicas: 5
min_available: 1
config_map:
CHUNK_SIZE: "1000"
PARALLEL_WORKERS: "2"
dependencies:
- pandas
- numpy
- boto3
Expected behavior:
- 5 lightweight instances (8GB RAM, 2 CPU cores each)
- Job starts as soon as 1 instance is ready
- No GPU allocation (CPU-only processing)
- No shared volume (each instance processes independently)
- Common data processing libraries installed
Example 5:
resources:
gpu:
accelerator_count: 8
accelerator: H100
memory: 256
cpu: 32
replicas: 1
config_map:
OMP_NUM_THREADS: "32"
Expected behavior:
- Single powerful instance with 8 H100 GPUs and 256GB RAM
- GPU and CPU optimization environment variables
- Job starts immediately when the single replica is ready
Best Practices
- Resource Planning: Start with smaller resource allocations and scale up based on actual usage
- Replica Strategy: Use
min_available
for jobs that can benefit from partial scaling - Shared Storage: Use
shared_volume
when replicas need to share data or synchronize state - Environment Variables: Use
config_map
to make your jobs configurable without code changes - Dependencies: Keep dependency lists minimal and use specific versions for reproducibility