Skip to content

Data Volumes

Data Volumes in OICM provide a convenient way to manage data such as models, datasets, and other files required for AI workloads.

Data Volumes support two storage types:

  • File Storage (FS): High-performance storage for fast and low latency access. Supports data imports from Hugging Face and external object storage.
  • Object Storage (OBS): Cost-efficient storage for large datasets and training. Supports UI uploads and imports from Hugging Face or external storage.

OICM can be connected to external sources, for example Hugging Face or an S3-compatible Object Storage. Data from these sources is fetched and stored in a Data Volume. In addition, Data Volumes can be used to store data generated within the platform such as trained or fine-tuned models. All stored data in Data Volumes contributes to the respective storage quota (FS or OBS).

Once created, Data Volumes make the stored data available to be attached to workloads within the same workspace.

Datavolumes Landing

How it works

When you create a Data Volume, you define:

  • Storage Type: Choose between File Storage (FS) or Object Storage (OBS)
  • Name: A unique identifier for the volume
  • Data Type: Choose Model, Dataset, or Other
  • Storage Quota: The maximum size the volume can reach (in GiB)

    note: Make sure your workspace has enough quota for your allocations (FS or OBS, depending on the selected storage type)

  • Tags (optional): Add searchable tags to organize and filter volumes later

Datavolumes Creation

After creating the volume, you can choose to import data from an external source. Supported sources are:

  • Hugging Face (HF): Provide a saved secret blueprint containing your HF username and access token, then enter the repo name.
  • OBS (S3): Provide the secret OBS blueprint, endpoint URL, bucket name, and directory.

For both import sources, you can specify a Source Directory to import only a specific folder instead of the entire repository or bucket. You can also set a Destination Directory to control where the imported data is placed within the Data Volume. OICM will replace all contents inside the destination directory with the contents from the source directory. If the destination is set to root (/), the entire Data Volume will be updated.

Datavolume Import HF

Once configured, OICM automatically syncs the data into the Data Volume when you click Import.

Deployments with OBS

For Deployments with Data Volume, model weights in OBS needs to be in the root of the bucket.

Exporting Data

Data Volumes support exporting data to an external OBS (S3-compatible storage). Provide the secret blueprint, endpoint, region, and bucket name. You can also specify a Source Directory to export only a specific folder from the Data Volume, and a Destination Directory to control where the data is placed in the target bucket.

Datavolume Export OBS

File Upload (OBS only)

For Data Volumes with Object Storage (OBS) type, you can also upload files directly through the UI. Click or drag files into the upload area, and optionally set a relative path for each file. A maximum of 1000 files can be uploaded at a time.

Datavolume Upload OBS

Workloads and Data Volumes

Data Volumes can attach to Workloads for reading or writing data. Supported workload types are:

  • Jobs
  • Fine-tuning Coming Soon
  • Model deployment

Jobs

Job workloads can read from or write to Data Volumes. To connect, configure your config.yaml with the following options based on your needs:

Read Model

# Read-only access mounted at "/data-volumes/input_model"
input_model_data_volume:
  name: <identifier of the data volume>

Read Dataset

# Read-only access mounted at "/data-volumes/input_dataset"
input_dataset_data_volume:
  name: <identifier of the data volume>

Write Data

# Read-Write Access mounted at "/data-volumes/output_checkpoints"
output_checkpoints_data_volume:
  name: <identifier of the data volume>

Write Model

# Read-Write Access mounted at "/data-volumes/output_model"
output_model_data_volume:
  name: <identifier of the data volume>

Testing Read-Only Access

You can verify that your job can read from Data Volumes with this script:

import os

def print_tree(folder_path, indent=""):
    try:
        items = os.listdir(folder_path)
    except FileNotFoundError:
        print(f"{folder_path} does not exist.")
        return
    except PermissionError:
        print(f"{folder_path}: Permission Denied")
        return

    print(f"Folder {folder_path} is mounted")
    for index, item in enumerate(sorted(items)):
        full_path = os.path.join(folder_path, item)
        connector = "├── " if index < len(items) - 1 else "└── "
        print(indent + connector + item)
        if os.path.isdir(full_path):
            new_indent = indent + ("│   " if index < len(items) - 1 else "    ")
            print_tree(full_path, new_indent)

if __name__ == "__main__":
    paths = ["/data-volumes/input_model", "/data-volumes/input_dataset"]

    for path in paths:
        if os.path.exists(path):
            print(f"\nDirectory structure for {path}:")
            print(path)
            print_tree(path)
        else:
            print(f"\n{path} does not exist.")

Testing Read-Write Access

Use the following script to check write, read, and delete permissions:

import os

def check_dir_permissions(folder_path):
    test_file = os.path.join(folder_path, "test_permission_file.txt")

    if not os.path.exists(folder_path):
        print(f"{folder_path} does not exist.")
        return

    try:
        # Test write
        with open(test_file, "w") as f:
            f.write("permission test")
        print(f"Write OK in {folder_path}")

        # Test read
        with open(test_file, "r") as f:
            content = f.read()
        if content == "permission test":
            print(f"Read OK in {folder_path}")
        else:
            print(f"Read FAILED in {folder_path}")

        # Test delete
        os.remove(test_file)
        print(f"Delete OK in {folder_path}")

    except PermissionError:
        print(f"Permission Denied in {folder_path}")
    except Exception as e:
        print(f"Error in {folder_path}: {e}")

if __name__ == "__main__":
    paths = ["/data-volumes/output_model", "/data-volumes/output_checkpoints"]

    for path in paths:
        print(f"\nChecking permissions in: {path}")
        check_dir_permissions(path)

Fine-Tuning Coming Soon

You can import a model from Hugging Face or from an external OBS into a Data Volume for later fine-tuning. Once you import your data into a Data Volume of data type model, it will appear under Model Source and can be selected when creating a Fine-Tuning Task.

Datavolumes Finetunning

Model Deployment

You can deploy a model directly from a Data Volume. This is the most efficient and fastest way to deploy a model since the data is already in the platform and does not need to be fetched from an external source every time you need to deploy the model. A single Data Volume can be used for multiple model deployments.

note: When deploying using Data Volume, the name of the deployment should be used for the model in Open AI compatible endpoint for inference.

Datavolumes Model Deployment

Verifying Data Volume Checksum

This guide explains how to verify the checksum calculated by our platform for your data volume. The checksum ensures data integrity by creating a unique fingerprint of your entire directory structure and file contents.

How the Checksum Works

Our platform uses the BLAKE3 hashing algorithm to create a hierarchical checksum that includes:

  • File contents (hashed using BLAKE3)
  • Directory structure
  • File and directory names
  • Symlink targets

The algorithm processes entries in alphabetical order to ensure consistent results across different systems.

Data Structure

For each directory, we create a manifest with entries in this format:

  • Files: f:<filename>:<blake3_hash_of_content>
  • Directories: d:<dirname>:<blake3_hash_of_subdirectory>
  • Symlinks: l:<linkname>:<target_path>

These entries are sorted alphabetically, joined with null bytes (\0), and then hashed with BLAKE3 to produce the final checksum.

Verification Method

Installation

pip install oip-checksum-validator

Usage

Generate a checksum:

oic /path/to/directory

# OR invoke oip-checksum-validator via uvx, no permanent install needed
uvx --from oip-checksum-validator oic /path/to/directory

Verify against a reference checksum:

oic /path/to/directory -c <expected_checksum>

# OR invoke oip-checksum-validator via uvx, no permanent install needed
uvx --from oip-checksum-validator oic /path/to/directory -c <expected_checksum>

Example

For a directory structure:

my-data/
├── file1.txt
├── subdir/
│   └── file2.txt
└── link -> file1.txt

The algorithm creates:

  • f:file1.txt:<blake3_hash_of_file1_content>
  • d:subdir:<blake3_hash_of_subdir_manifest>
  • l:link:file1.txt

These are sorted, joined with null bytes, and hashed to produce the final checksum.

Verification

Compare your calculated checksum with the one provided by our platform. If they match, your data integrity is confirmed.