Jobs Examples

The Jobs module in the OICM+ Platform supports various AI job types, including distributed training using frameworks like Ray and PyTorch. Below are detailed examples demonstrating how to configure and run jobs using the module, along with code samples that can be directly used or adapted for your own AI workflows.

Example 1: Fully Sharded Data Parallel (FSDP) with PyTorch

The Fully Sharded Data Parallel (FSDP) is a PyTorch wrapper that enables the training of models that are too large to fit on a single GPU by sharding the model's parameters across GPUs. The following example demonstrates how to set up and run a training job using FSDP within the OICM+ Platform.

Example Code

The provided example trains a ResNet18 model on the CIFAR-10 dataset using FSDP. This example includes setting up the environment, initializing distributed training, and leveraging AWS S3 for logging and saving model checkpoints. To see the complete code, visit our GitHub page.

Setup and Initialization

Logger Setup: A global logger is initialized to handle output across different nodes during training.
AWS S3 Upload: Functions are provided to upload logs and model checkpoints to an S3 bucket.
Distributed Training Setup: The code initializes the process group using torchrun with the environment variables RANK and WORLD_SIZE, which are automatically set by the OICM+ Platform.

Training and Evaluation Functions

Training Function: The train function handles the forward and backward passes, loss computation, and optimization step for each batch.
Evaluation Function: The test function evaluates the model on the test dataset and logs the performance metrics.

Monitoring and Managing the Job

Once the job is started, you can monitor its progress in the Jobs tab. The platform provides real-time updates on job status, resource utilization, and logs. You can also delete the job if needed.

Example 2: Distributed Data Parallel (DDP) with Pytorch

The Distributed Data Parallel (DDP) framework is a widely used method in PyTorch for scaling up training across multiple GPUs. This example demonstrates how to fine-tune a ResNet18 model on the CIFAR-10 dataset using DDP within the OICM+ Platform. To see the complete code, visit our GitHub page.

Example Code

This example script covers the entire process from initializing distributed training to evaluating the model. It also includes functionality to upload logs and model checkpoints to AWS S3 for easy access and backup.

Setup and Initialization

Logger Setup: A global logger is set up to capture output across all nodes during training.
AWS S3 Upload: The script includes functions to upload training logs and model checkpoints to an S3 bucket.
Distributed Training Setup: The script initializes the process group using the RANK environment variable for distributed training, which should be automatically set when using torchrun.

Data Preparation

Data Loading: The CIFAR-10 dataset is used, and the script splits the training data into training and validation sets. The data is normalized and augmented to improve model performance.
Data Loaders: Distributed samplers are used to ensure that each GPU processes a unique subset of the data during training.

Training and Evaluation Functions

Training Function: This function handles the forward pass, loss computation, backpropagation, and optimization steps for each batch. It also applies L2 regularization to prevent overfitting.
Evaluation Function: After each epoch, the model is evaluated on the validation set to track its performance.

For both of the examples above, the following configuration is applicable:

resources:
  gpu:
    accelerator_count: 2
    accelerator: <your accelerator name> # examples: A10G or L4
  memory: 8
  cpu: 4
dependencies:
  - torch
  - torchvision
  - boto3
secrets:
  AWS_ACCESS_KEY_ID: <access key>
  AWS_SECRET_ACCESS_KEY: <secret key>
  REGION_NAME: <region name>
  BUCKET_NAME: <bucket name>
config_map:
  FOLDER_NAME: <folder name>
  EPOCHS: "5"
  BATCH_SIZE: "256"
  LEARNING_RATE: "0.01"

Training with Ray

Ray is an open-source framework designed to scale AI and Python applications across multiple GPUs and nodes, providing a flexible and powerful compute layer for parallel processing. With Ray, you can efficiently manage distributed training jobs, making it easier to train models that span multiple devices, such as GPUs. Ray integrates seamlessly with the OICM+ Platform, allowing you to leverage its capabilities within your existing workflows. To learn more about training with Ray, visit the official documentation of the Ray framework.

Example 1: Sequence Classification with DeepSpeed and Ray

This example demonstrates how to train a BERT model for sequence classification using DeepSpeed in conjunction with Ray. The training function is distributed across multiple GPUs, utilizing Ray's data loading and DeepSpeed's optimization techniques. The model is trained on the GLUE MRPC dataset, and performance metrics such as Binary F1 Score and Accuracy are computed during evaluation.

Example Code

This example demonstrates how to train a BERT model for sequence classification using DeepSpeed in conjunction with Ray. The training function is distributed across multiple GPUs, utilizing Ray's data loading and DeepSpeed's optimization techniques. The model is trained on the GLUE MRPC dataset, and performance metrics such as Binary F1 Score and Accuracy are computed during evaluation.

Setup and Initialization

Initialize Ray datasets and model: The GLUE MRPC dataset is loaded and split into training and validation sets using Ray's data loading capabilities. The BERT model is initialized with DeepSpeed for optimization.
Use DeepSpeed for optimized training and checkpointing: The training is set up with DeepSpeed to handle large-scale distributed training across GPUs, with automatic checkpointing for fault tolerance.

Training and Evaluation Functions

Training Function: The train function handles the forward and backward passes, loss computation, and optimization step for each batch. The model is trained across multiple epochs, with metrics such as Binary F1 Score and Accuracy computed during evaluation.
Evaluation Function: The evaluation function computes metrics on the validation set after each epoch and updates the model's performance metrics.

For the full code and configuration, refer to our GitHub repository.

Example 2: Distributed Training with FSDP and Ray

This example shows how to set up and run a fully sharded data parallel (FSDP) training job using Ray. The training function is distributed across multiple GPUs, leveraging Ray's capabilities for data loading and model distribution. The model is a custom convolutional neural network (CNN) designed to classify a synthetic dataset.

Example Code

The training function is distributed across multiple GPUs, leveraging Ray's capabilities for data loading and model distribution. The model is a simple feedforward neural network designed to classify images. For the full code and configuration, refer to our GitHub repository.

Setup and Initialization

Define a synthetic dataset: The dataset is synthetically generated to simulate real-world data, and Ray's distributed data loading capabilities are used to manage it across multiple GPUs.
Initialize the model with FSDP: The model is prepared for training with Fully Sharded Data Parallel (FSDP) to efficiently manage memory and computational resources across multiple GPUs.

Training and Evaluation Functions

Training Function: The train function handles the forward pass, loss computation, backpropagation, and optimization steps for each batch. The model is trained across multiple epochs using FSDP, ensuring efficient distribution of the model's parameters.
Evaluation Function: The model's performance is evaluated periodically, and the training process is monitored to ensure that the model is converging as expected.

For the full code and configuration, refer to our GitHub repository.

Example 3: Distributed FashionMNIST Training with Ray

This example demonstrates how to perform distributed training on the FashionMNIST dataset using Ray and PyTorch. The training function handles distributed data loading and model training across multiple GPUs.

Example Code

This example trains a simple feedforward neural network on the FashionMNIST dataset using Ray. The training is distributed across multiple GPUs, utilizing Ray's capabilities for data loading and model distribution. For the full code and configuration, refer to our GitHub repository..

Setup and Initialization

Load the FashionMNIST dataset: The dataset is downloaded and prepared for training using PyTorch's data utilities. Ray is used to distribute the dataset across multiple GPUs for parallel processing.
Prepare the model for distributed training: A simple feedforward neural network is initialized and prepared for distributed training across multiple GPUs using Ray.

Training and Evaluation Functions

Training Function: The train function handles the forward pass, loss computation, backpropagation, and optimization steps for each batch. The model is trained across multiple epochs, with the dataset evenly distributed across all available GPUs.
Evaluation Function: The model's performance is evaluated periodically on a validation set, and the training metrics are monitored to ensure optimal performance.

For the full code and configuration, refer to our GitHub repository.

Configuring the Ray Job

To run distributed training jobs using Ray on the OICM+ Platform, you need to configure a YAML file that specifies the environment, resources, and any necessary environment variables. Below is a guide on how to structure the YAML file for different Ray jobs, along with examples of key configurations.

General Structure of the YAML Configuration

A typical YAML configuration file for Ray jobs consists of several key sections:

runtimeEnvYAML: Specifies the Python packages and dependencies required for your job.
replicas: Defines the number of replicas (or workers) that will run your job.
resources: Specifies the resources allocated to each worker, including CPU, memory, and GPU specifications.
env: Environment variables needed for your job, such as AWS credentials for accessing S3 storage.

Here’s a general template for structuring your YAML file:

runtimeEnvYAML: |
  pip:
    - torch
    - torchvision
    - torchaudio

replicas: 2
resources:
  cpu: "5"
  memory: "12G"
  gpu:
    accelerator: t4
    accelerator_count: 1

env:
  KEY_1: VALUE_1
  KEY_2: VALUE_2

Jobs Examples

Example 1: Fully Sharded Data Parallel (FSDP) with PyTorch

Example Code

Setup and Initialization

Training and Evaluation Functions

Monitoring and Managing the Job

Example 2: Distributed Data Parallel (DDP) with Pytorch

Example Code

Setup and Initialization

Data Preparation

Training and Evaluation Functions

Training with Ray

Example 1: Sequence Classification with DeepSpeed and Ray

Example Code

Setup and Initialization

Training and Evaluation Functions

Example 2: Distributed Training with FSDP and Ray

Example Code

Setup and Initialization

Training and Evaluation Functions

Example 3: Distributed FashionMNIST Training with Ray

Example Code

Setup and Initialization

Training and Evaluation Functions

Configuring the Ray Job

General Structure of the YAML Configuration

Further Reading and Resources