Skip to content

Data Volumes

Data Volumes in OICM provide a convenient way to manage data such as models, datasets, and other files required for AI workloads.

OICM can be connected to external sources, for example Hugging Face or an S3-compatible Object Storage. Data from these sources is fetched and stored inside a Persistent Volume. In addition, Data Volumes can be used to store data generated within the platform such as trained or fine-tuned models. All stored data in Data Volumes contributes to the File Storage Quota.

Once created, Data Volumes make the stored data available to be attached to workloads within the same workspace.

Datavolumes Landing

How it works

When you create a Data Volume, you define:

  • Name: A unique identifier for the volume
  • Data Type: Choose Model, Dataset, or Other
  • File Storage Quota: The maximum size the volume can reach (in GiB)

    note: Make sure your workspace has enough FS quota for your allocations

  • Tags (optional): Add searchable tags to organize and filter volumes later

Datavolumes Creation

After creating the volume, you can choose to import data from an external source. Supported sources are:

  • Hugging Face (HF): Provide a saved secret blueprint containing your HF username and access token, then enter the repo name.

Datavolume Import HF

  • OBS (S3): Provide the secret OBS blueprint, endpoint URL, bucket name, and directory.

Once configured, OICM automatically syncs the data into the Data Volume when you click Import.

Datavolume Import OBS

Workloads and Data Volumes

Data Volumes can attach to Workloads for reading or writing data. Supported workload types are:

  • Jobs
  • Fine-tuning Coming Soon
  • Model deployment

Jobs

Job workloads can read from or write to Data Volumes. To connect, configure your config.yaml with the following options based on your needs:

Read Model

# Read-only access mounted at "/data-volumes/input_model"
input_model_data_volume:
  name: <identifier of the data volume>

Read Dataset

# Read-only access mounted at "/data-volumes/input_dataset"
input_dataset_data_volume:
  name: <identifier of the data volume>

Write Data

# Read-Write Access mounted at "/data-volumes/output_checkpoints"
output_checkpoints_data_volume:
  name: <identifier of the data volume>

Write Model

# Read-Write Access mounted at "/data-volumes/output_model"
output_model_data_volume:
  name: <identifier of the data volume>

Testing Read-Only Access

You can verify that your job can read from Data Volumes with this script:

import os

def print_tree(folder_path, indent=""):
    try:
        items = os.listdir(folder_path)
    except FileNotFoundError:
        print(f"{folder_path} does not exist.")
        return
    except PermissionError:
        print(f"{folder_path}: Permission Denied")
        return

    print(f"Folder {folder_path} is mounted")
    for index, item in enumerate(sorted(items)):
        full_path = os.path.join(folder_path, item)
        connector = "├── " if index < len(items) - 1 else "└── "
        print(indent + connector + item)
        if os.path.isdir(full_path):
            new_indent = indent + ("│   " if index < len(items) - 1 else "    ")
            print_tree(full_path, new_indent)

if __name__ == "__main__":
    paths = ["/data-volumes/input_model", "/data-volumes/input_dataset"]

    for path in paths:
        if os.path.exists(path):
            print(f"\nDirectory structure for {path}:")
            print(path)
            print_tree(path)
        else:
            print(f"\n{path} does not exist.")

Testing Read-Write Access

Use the following script to check write, read, and delete permissions:

import os

def check_dir_permissions(folder_path):
    test_file = os.path.join(folder_path, "test_permission_file.txt")

    if not os.path.exists(folder_path):
        print(f"{folder_path} does not exist.")
        return

    try:
        # Test write
        with open(test_file, "w") as f:
            f.write("permission test")
        print(f"Write OK in {folder_path}")

        # Test read
        with open(test_file, "r") as f:
            content = f.read()
        if content == "permission test":
            print(f"Read OK in {folder_path}")
        else:
            print(f"Read FAILED in {folder_path}")

        # Test delete
        os.remove(test_file)
        print(f"Delete OK in {folder_path}")

    except PermissionError:
        print(f"Permission Denied in {folder_path}")
    except Exception as e:
        print(f"Error in {folder_path}: {e}")

if __name__ == "__main__":
    paths = ["/data-volumes/output_model", "/data-volumes/output_checkpoints"]

    for path in paths:
        print(f"\nChecking permissions in: {path}")
        check_dir_permissions(path)

Fine-Tuning Coming Soon

You can import a model from Hugging Face or from an external OBS into a Data Volume for later fine-tuning. Once you import your data into a Data Volume of data type model, it will appear under Model Source and can be selected when creating a Fine-Tuning Task.

Datavolumes Finetunning

Model Deployment

You can deploy a model directly from a Data Volume. This is the most efficient and fastest way to deploy a model since the data is already in the platform and does not need to be fetched from an external source every time you need to deploy the model. A single Data Volume can be used for multiple model deployments.

note: When deploying using Data Volume, the name of the deployment should be used for the model in Open AI compatible endpoint for inference.

Datavolumes Model Deployment

Verifying Data Volume Checksum

This guide explains how to verify the checksum calculated by our platform for your data volume. The checksum ensures data integrity by creating a unique fingerprint of your entire directory structure and file contents.

How the Checksum Works

Our platform uses the BLAKE3 hashing algorithm to create a hierarchical checksum that includes:

  • File contents (hashed using BLAKE3)
  • Directory structure
  • File and directory names
  • Symlink targets

The algorithm processes entries in alphabetical order to ensure consistent results across different systems.

Data Structure

For each directory, we create a manifest with entries in this format:

  • Files: f:<filename>:<blake3_hash_of_content>
  • Directories: d:<dirname>:<blake3_hash_of_subdirectory>
  • Symlinks: l:<linkname>:<target_path>

These entries are sorted alphabetically, joined with null bytes (\0), and then hashed with BLAKE3 to produce the final checksum.

Verification Method

Installation

pip install oip-checksum-validator

Usage

Generate a checksum:

oic /path/to/directory

# OR invoke oip-checksum-validator via uvx, no permanent install needed
uvx --from oip-checksum-validator oic /path/to/directory

Verify against a reference checksum:

oic /path/to/directory -c <expected_checksum>

# OR invoke oip-checksum-validator via uvx, no permanent install needed
uvx --from oip-checksum-validator oic /path/to/directory -c <expected_checksum>

Example

For a directory structure:

my-data/
├── file1.txt
├── subdir/
│   └── file2.txt
└── link -> file1.txt

The algorithm creates:

  • f:file1.txt:<blake3_hash_of_file1_content>
  • d:subdir:<blake3_hash_of_subdir_manifest>
  • l:link:file1.txt

These are sorted, joined with null bytes, and hashed to produce the final checksum.

Verification

Compare your calculated checksum with the one provided by our platform. If they match, your data integrity is confirmed.