Job Configuration with config.yaml
This document explains how to configure jobs using the config.yaml file. The configuration file defines the resources, scaling, and environment settings for your job execution.
Configuration Structure
The config.yaml file supports the following fields:
Required Fields
| Field | Type | Description |
|---|---|---|
resources |
object | Computational resources configuration for each instance |
replicas |
integer | Number of instances (Kubernetes pods) to run |
resources (Required)
The resources field contains the following sub-fields:
| Field | Type | Required | Description |
|---|---|---|---|
memory |
integer | Yes | RAM allocation per instance (in GB) |
cpu |
integer | Yes | Number of CPU cores per instance |
gpu |
object | No | GPU configuration for GPU-accelerated jobs |
storage |
integer | Yes | The ephemeral storage per each instance (in GiB) |
gpu (Optional, under resources)
When specified, the gpu field contains:
| Field | Type | Description |
|---|---|---|
accelerator_count |
integer | Number of GPUs per instance |
accelerator |
string | GPU type (H100, A100, V100, etc.) |
Optional Fields
| Field | Type | Description |
|---|---|---|
min_available |
integer | Minimum number of replicas needed before job starts |
config_map |
object | Environment variables for the job |
dependencies |
array | Additional Python packages to install |
Field Definitions and Usage
resources
The resources section defines the computational resources for each instance.
gpu (Optional)
Configure GPU resources for machine learning workloads.
resources:
gpu:
accelerator_count: 8 # Number of GPUs per instance
accelerator: H100 # GPU type (H100, A100, V100, etc.)
When to use: For deep learning training or any GPU-accelerated computations.
memory (Required)
RAM allocation in GB per instance.
When to use: Always required. Set based on your job's memory requirements.
cpu (Required)
Number of CPU cores per instance.
When to use: Always required. Consider your job's computational intensity.
replicas (Required)
Number of instances to run concurrently.
When to use: Always required. Use multiple replicas for:
- Distributed training
- Parallel data processing
min_available (Optional)
Minimum number of replicas that must be available before the job starts.
When to use:
- When you need partial scaling (don't wait for all replicas)
- For jobs that can start with fewer resources
- To reduce startup time in resource-constrained environments
Default behavior: If not specified, the job waits for all replicas to be ready.
config_map (Optional)
Environment variables passed to all job instances.
When to use:
- Pass configuration parameters to your application
- Set training hyperparameters
- Configure data paths and model settings
Data volumes
Please refer to Datavolumes Docs for more information on attaching a volume to your Job workload.
dependencies (Optional)
Additional Python packages to install before running the job.
Alternative: Alternatively add a requirements.txt file in the job files.

When to use:
- Install packages not included in the base image
- Add specific versions of libraries
- Quick package installation without maintaining separate requirements files
Configuration Examples
Example 1:
resources:
memory: 16
cpu: 4
storage: 1
replicas: 1
config_map:
DATA_PATH: "/data/input"
OUTPUT_PATH: "/data/output"
Expected behavior:
- Single instance with 16GB RAM and 4 CPU cores
- Job starts immediately (only 1 replica needed)
- Environment variables set for data paths
Example 2:
resources:
gpu:
accelerator_count: 2
accelerator: A100
memory: 64
cpu: 8
storage: 1
output_checkpoints_data_volume:
name: data-vol-your-datavolume-id
replicas: 4
min_available: 2
config_map:
EPOCHS: "100"
BATCH_SIZE: "64"
LEARNING_RATE: "0.001"
dependencies:
- transformers
- datasets
Expected behavior:
- 4 instances, each with 2 A100 GPUs, 64GB RAM, 8 CPU cores
- Job starts when at least 2 instances are ready
- transformers and datasets libraries installed before the job starts
- Training hyperparameters set via environment variables
- data volume mounted and accessible for model/dataset sharing
Best Practices
- Resource Planning: Start with smaller resource allocations and scale up based on actual usage
- Replica Strategy: Use
min_availablefor jobs that can benefit from partial scaling - Data Volumes: Use
datavolumeswhen replicas need to share data or synchronize state. - Environment Variables: Use
config_mapto make your jobs configurable without code changes - Dependencies: Keep dependency lists minimal and use specific versions for reproducibility