Job Examples

The Jobs module in OICM+ supports various AI workloads, including distributed training with frameworks like PyTorch and Ray. Below, you’ll find code samples and best practices to adapt for your own AI pipelines.

1. PyTorch Examples

1.1 FSDP (Fully Sharded Data Parallel)

Scale massive PyTorch models by sharding parameters across GPUs.

The Fully Sharded Data Parallel (FSDP) is a PyTorch wrapper that enables the training of models that are too large to fit on a single GPU by sharding the model's parameters across GPUs. The following example demonstrates how to set up and run a training job using FSDP within the OICM+ Platform.

Core Idea – Enable training for models too large for a single GPU by distributing parameters and states.
Example – ResNet18 on CIFAR-10 with AWS S3 logging. The provided example trains a ResNet18 model on the CIFAR-10 dataset using FSDP. This example includes setting up the environment, initializing distributed training, and leveraging AWS S3 for logging and saving model checkpoints.

Key Steps

Setup & Init
- Logger Setup: A global logger is initialized to handle output across different nodes during training.
- AWS S3 Upload: Functions are provided to upload logs and model checkpoints to an S3 bucket.
- Distributed Training Setup: The code initializes the process group using torchrun with the environment variables RANK and WORLD_SIZE, which are automatically set by the OICM+ Platform.
Training & Eval
- Forward/backward passes, loss computation, and checkpointing.
- Training Function: The train function handles the forward and backward passes, loss computation, and optimization step for each batch.
- Evaluation Function: The test function evaluates the model on the test dataset and logs the performance metrics.
Monitoring
- Track progress in the Jobs tab (resource usage, logs, etc.). You can also delete the job if needed.
Code
- GitHub link

1.2 DDP (Distributed Data Parallel)

Scale training across multiple GPUs with PyTorch’s DDP.

The Distributed Data Parallel (DDP) framework is a widely used method in PyTorch for scaling up training across multiple GPUs. This example demonstrates how to fine-tune a ResNet18 model on the CIFAR-10 dataset using DDP within the OICM+ Platform. To see the complete code, visit our GitHub page.

Core Idea – Synchronize gradients effectively for multi-GPU setups.
Example – Fine-tune ResNet18 on CIFAR-10 with AWS S3 for logs and checkpoints. This example script covers the entire process from initializing distributed training to evaluating the model. It also includes functionality to upload logs and model checkpoints to AWS S3 for easy access and backup.

Key Steps

Setup & Init
- Logger Setup: A global logger is set up to capture output across all nodes during training.
- AWS S3 Upload: The script includes functions to upload training logs and model checkpoints to an S3 bucket.
- Distributed Training Setup: The script initializes the process group using the RANK environment variable for distributed training, which should be automatically set when using torchrun.
Data Prep
- Use CIFAR-10, apply normalization & augmentation.
- Data Loading: The CIFAR-10 dataset is used, and the script splits the training data into training and validation sets. The data is normalized and augmented to improve model performance.
- Data Loaders: Distributed samplers are used to ensure that each GPU processes a unique subset of the data during training.
Training & Eval
- Run forward/backward pass, check validation metrics per epoch.
- Training Function: This function handles the forward pass, loss computation, backpropagation, and optimization steps for each batch. It also applies L2 regularization to prevent overfitting.
- Evaluation Function: After each epoch, the model is evaluated on the validation set to track its performance.
Code
- GitHub link.

Sample YAML for Resource Allocation for both examples above:

resources:
  gpu:
    accelerator_count: 2
    accelerator: <your accelerator name> # examples: A10G or L4
  memory: 8
  cpu: 4
dependencies:
  - torch
  - torchvision
  - boto3
secrets:
  AWS_ACCESS_KEY_ID: <access key>
  AWS_SECRET_ACCESS_KEY: <secret key>
  REGION_NAME: <region name>
  BUCKET_NAME: <bucket name>
config_map:
  FOLDER_NAME: <folder name>
  EPOCHS: "5"
  BATCH_SIZE: "256"
  LEARNING_RATE: "0.01"

2. Ray-Based Examples

Ray is an open-source framework for scaling AI tasks across multiple GPUs and nodes, providing a flexible and powerful compute layer for parallel processing. With Ray, you can efficiently manage distributed training jobs, making it easier to train models that span multiple devices, such as GPUs. Ray integrates seamlessly with the OICM+ Platform, allowing you to leverage its capabilities within your existing workflows. To learn more about training with Ray, visit the official documentation of the Ray framework.

2.1 Sequence Classification with DeepSpeed + Ray

Train BERT on GLUE MRPC with Ray’s distributed data loading and DeepSpeed optimization.

Core Idea – Combine Ray’s data parallelism with DeepSpeed’s large-scale optimization.
Example – BERT model for classification, computing Binary F1 and Accuracy. This example demonstrates how to train a BERT model for sequence classification using DeepSpeed in conjunction with Ray. The training function is distributed across multiple GPUs, utilizing Ray's data loading and DeepSpeed's optimization techniques. The model is trained on the GLUE MRPC dataset, and performance metrics such as Binary F1 Score and Accuracy are computed during evaluation.
Code – GitHub link.

Key Steps

Init
- Split data with Ray. The GLUE MRPC dataset is loaded and split into training and validation sets using Ray's data loading capabilities. The BERT model is initialized with DeepSpeed for optimization.
DeepSpeed
- Automatic checkpointing, memory optimizations. The training is set up with DeepSpeed to handle large-scale distributed training across GPUs, with automatic checkpointing for fault tolerance.
Training & Eval
- Forward pass, backprop, measure metrics each epoch.
- Training Function: The train function handles the forward and backward passes, loss computation, and optimization step for each batch. The model is trained across multiple epochs, with metrics such as Binary F1 Score and Accuracy computed during evaluation.
- Evaluation Function: The evaluation function computes metrics on the validation set after each epoch and updates the model's performance metrics.

2.2 FSDP + Ray

Combine Fully Sharded Data Parallel with Ray for large-scale training.

Core Idea – Manage memory and compute across multiple GPUs with FSDP.
Example – Custom CNN, synthetic data. This example shows how to set up and run a fully sharded data parallel (FSDP) training job using Ray. The training function is distributed across multiple GPUs, leveraging Ray's capabilities for data loading and model distribution. The model is a custom convolutional neural network (CNN) designed to classify a synthetic dataset.
Code – GitHub link.

Key Steps

Synthetic Data
- Ray’s distributed data loading.
- Define a synthetic dataset: The dataset is synthetically generated to simulate real-world data, and Ray's distributed data loading capabilities are used to manage it across multiple GPUs.
- Initialize the model with FSDP: The model is prepared for training with Fully Sharded Data Parallel (FSDP) to efficiently manage memory and computational resources across multiple GPUs.
Model Init
- FSDP for parameter sharding.
- Training Function: The train function handles the forward pass, loss computation, backpropagation, and optimization steps for each batch. The model is trained across multiple epochs using FSDP, ensuring efficient distribution of the model's parameters.
- Evaluation Function: The model's performance is evaluated periodically, and the training process is monitored to ensure that the model is converging as expected.
Train & Monitor
- Evaluate performance, watch logs in OICM+.

2.3 FashionMNIST with Ray

Train a feedforward neural network on FashionMNIST, distributed with Ray.

This example demonstrates how to perform distributed training on the FashionMNIST dataset using Ray and PyTorch. The training function handles distributed data loading and model training across multiple GPUs.

Core Idea – Simple CPU/GPU distribution with Ray.
Example – Download FashionMNIST, train across GPUs. This example trains a simple feedforward neural network on the FashionMNIST dataset using Ray. The training is distributed across multiple GPUs, utilizing Ray's capabilities for data loading and model distribution.
Code – GitHub link.

Key Steps

Data Prep
- PyTorch utilities, Ray for distribution.
- Load the FashionMNIST dataset: The dataset is downloaded and prepared for training using PyTorch's data utilities. Ray is used to distribute the dataset across multiple GPUs for parallel processing.
- Prepare the model for distributed training: A simple feedforward neural network is initialized and prepared for distributed training across multiple GPUs using Ray.
Train
- Forward, backward passes on each GPU.
- Training Function: The train function handles the forward pass, loss computation, backpropagation, and optimization steps for each batch. The model is trained across multiple epochs, with the dataset evenly distributed across all available GPUs.
Eval
- Monitor accuracy and logs with OICM+.
- Evaluation Function: The model's performance is evaluated periodically on a validation set, and the training metrics are monitored to ensure optimal performance.

3. Configuring Ray Jobs

A YAML file typically defines environment, resources, environment variables, etc.:

runtimeEnvYAML: Specifies the Python packages and dependencies required for your job.
replicas: Defines the number of replicas (or workers) that will run your job.
resources: Specifies the resources allocated to each worker, including CPU, memory, and GPU specifications.
env: Environment variables needed for your job, such as AWS credentials for accessing S3 storage.

runtimeEnvYAML: |
  pip:
    - torch
    - torchvision
    - torchaudio

replicas: 2
resources:
  cpu: "5"
  memory: "12G"
  gpu:
    accelerator: t4
    accelerator_count: 1

env:
  KEY_1: VALUE_1
  KEY_2: VALUE_2

Further Resources

OICM+ Jobs GitHub – Complete code examples and configurations.
PyTorch Docs – Official documentation for advanced PyTorch features.
Ray Docs – Learn about scaling Python and ML tasks with Ray.