Dataset API Client

This section details the OIP Dataset Client library, covering installation, usage, and primary methods for interacting with datasets on the OICM+ platform.

The Dataset Client provides a convenient setup process to help you configure and initialize the client for interaction with the OICM+ Platform. This setup involves setting up an environment variables that includes OICM+ credentials and, optionally, storage provider credentials for services like AWS, Azure, GCP, or Minio. Once the setup is complete, the Dataset Client is ready to use.

1. Getting Started

1.1 Prerequisites

Python installed on your machine.
OICM+ account for obtaining an API Key.

1.2 Installation

Install via pip:

pip install oip-dataset-client

1.3 Initialization

You can initialize the DatasetClient using either the Python SDK or the Command-Line Interface (CLI).

Python SDK

from oip_dataset_client.dataset import DatasetClient

api_host = "http://192.168.1.35:8000"
api_key = "72e3f81c-8c75-4f88-9358-d36a3a50ef36"
workspace_name = "default_workspace"

DatasetClient.connect(api_host=api_host, api_key=api_key, workspace_name=workspace_name)

Parameters
- api_host (str, required) – Hostname of the Dataset API server.
- api_key (str, required) – User’s API authentication token.
- workspace_name (str, required) – Workspace to store datasets.
- verbose (bool, optional): If True, enable console output for progress information by default it relies on DatasetClient.verbose(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False).

You can access both the Dataset API Server and your access token by requesting them through our OICM+ Web UI.

CLI (Command-Line Interface)

If you prefer a more interactive setup, our CLI makes it easy to initialize the dataset client. After installing the client, simply run the initialization command, and follow the prompts to configure your OIP credentials.

Run the following command to start the setup process:

oip-dataset-client-init

Follow interactive prompts to configure OIP credentials.

2. DatasetClient Methods

The DatasetClient class serves as the central component of the oip_dataset_client. It acts as the core interface through which the client interacts with and manages datasets throughout their lifecycle, including creating, adding files, uploading, downloading, and finalizing. The upcoming section will thoroughly cover the essential methods in this class.

2.1 create

Initialize a new dataset with optional parent datasets and versioning.

my_dataset = DatasetClient.create(
    name="my_dataset",
    parent_datasets=["datasetA_id", "datasetB_id"],
    version="2.0.1",
    is_main=True,
    tags=["testing", "CSV", "NASA"],
    description="A CSV testing dataset from NASA"
)

Parameters
- name (str) – Required dataset name.
- parent_datasets (list[str], optional) – IDs of parent datasets.
- version (str, optional) – Semantic version (defaults to 1.0.0 for first version).
- is_main (bool, optional) – Mark this dataset as the main version.
- tags (list[str], optional) – Keywords for classification (e.g., domain, topic).
- description (str, optional) – Short description.
- verbose (bool, optional): If True, enable console output for progress information by default it relies on DatasetClient.verbose(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False).
Returns
- Dataset: Newly created Dataset object.
Raises
- ValueError: If the name is empty or None.
- ValueError: If any of the parent_datasets is not completed.

2.2 add_files

Include one or more files in a dataset. The system checks for duplicates, ignoring already-present or matching files.

my_dataset.add_files(path="/absolute/path/to/files")

Parameters
- path (str) – Local path to files.
- wildcard (str or list[str], optional) – Filter files with glob patterns.
- recursive (bool, optional) – If True, match recursively.
- max_workers (int, optional) – Number of threads for parallel adds.
- verbose (bool, optional): If True, enable console output for progress information by default it relies on DatasetClient.verbose(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False).
Returns
- int: The number of files that have been added.
- int: The number of files that have been modified.
Raises
- Exception: If the dataset is in a final state, which includes completed, aborted, or failed.
- ValueError: If the specified path to the files does not exist.

2.3 remove_files

Remove one or more files from a dataset (useful when removing files inherited from parent datasets).

my_dataset.remove_files(path="relative/path/*.csv")

Parameters
- wildcard_path (str) – The pattern matching files to remove (e.g., folder/file*).
- recursive (bool, optional) – If True, match subfolders recursively.
- verbose (bool, optional): If True, enable console output for progress information by default it relies on DatasetClient.verbose(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False).
Returns
- int: The number of files that have been removed.
Raises
- Exception: If the dataset is in a final state, which includes completed, aborted, or failed.

2.4 upload

Uploads newly added files (ignoring duplicates or inherited files).

Returns no value.
Ensures only missing or changed files are uploaded.

my_dataset.upload()

Parameters
- verbose (bool, optional): If True, enable console output for progress information by default it relies on DatasetClient.verbose(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False).
Returns
- none: Does not return any results.
Raises
- Exception: If the dataset is in a final state, which includes completed, aborted, or failed.
- Exception: If the upload failed.

2.5 finalize

Marks a dataset as finalized once all files are uploaded, preventing further changes.

my_dataset.finalize(auto_upload=True)

Parameters
- auto_upload (bool, optional) – If True, will attempt an upload first.
- verbose (bool, optional): If True, enable console output for progress information by default it relies on DatasetClient.verbose(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False).
Returns
- none: Does not return any results.
Raises
- Exception: If there is a pending upload.
- Exception: If the dataset's status is not valid for finalization.

2.6 get

Retrieve an existing dataset by name, ID, or version. By default, returns the dataset with the highest version if multiple are found.

my_dataset = DatasetClient.get(dataset_name="my_dataset")

Parameters
- dataset_id (str, optional)
- dataset_name (str, optional)
- dataset_version (str, optional)
- only_completed (bool, optional)
- auto_create (bool, optional)
Returns
- Dataset: Returns a Dataset object.
Raises
- ValueError: If the selection criteria are not met. Didn't provide id/name correctly.
- ValueError: If the query result is empty, it means that no dataset matching the provided selection criteria could be found.

2.7 get_local_copy (Internal Dataset)

Download a local copy of a finalized dataset, including parent files.

Returns no value.
Stores files in a local directory for further inspection.

dataset = DatasetClient.get(dataset_name="my_dataset")
dataset.get_local_copy()

Parameters
- verbose (bool, optional): If True, enable console output for progress information by default it relies on DatasetClient.verbose(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False).
Returns
- none: Does not return any results.
Raises
- Exception: If the dataset is in a final state, which includes completed, aborted, or failed.
- Exception: If we are unable to unzip a compressed file.
- Exception: If we encounter a failure while attempting to copy a file from a source folder to a target folder.

2.8 add_aws_storage_conf

Configure AWS S3 access for storing or retrieving dataset files.

from oip_dateset_client.Storage.StorageManager import StorageManager

StorageManager.add_aws_storage_conf(access_key="...", secret_key="...", region="...")

Parameters
- access_key (str, required): The access key is a unique identifier that allows access to AWS services and resources.
- secret_key (str, required): The secret key is a confidential piece of information used in conjunction with the access key for secure access to AWS services.
- region (str, required): The region specifies the geographical location where AWS resources will be provisioned or operated.
Returns
- none: Does not return any results.

2.9 migrate

Transfer a dataset from an external source (e.g., AWS S3) to OIP storage, creating a new dataset in the process.

Creates a new dataset with all files from the external source.

DatasetClient.migrate(
  storage_conf_name="aws_config",
  download_uri="s3://bucket/path/",
  name="migrated_dataset"
)

Parameters
- storage_conf_name (str, required): storage configuration name serve as an identifier.
- download_uri (str, required): Refers to the specific location or path within the storage provider where the dataset files are hosted.
- name (str, required): Name of the new dataset.
- parent_datsets (list[str], optional): A list of parent datasets to extend the new dataset by adding all the files from their respective parent datasets.
- version (str, optional): Version of the new dataset, if no version is specified during creation, the default version will be set to 1.0.0 for the dataset's First version. For the next versions, we will automatically increment the highest semantic version available.
- is_main (bool, optional): True if the new dataset is the main version.
- tags (list[str], optional): Descriptive tags categorize datasets by keywords, detailing their subject matter, domain, or specific topics for better identification and organization.
- description (str, optional): A brief description of the new dataset.
- verbose (bool, optional): If True, enable console output for progress information by default it relies on DatasetClient.verbose(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False).
Returns
- none: Does not return any results.
Raises
- ValueError: If the name is empty or None.
- ValueError: If any of the parent_datasets is not completed.

2.10 get_local_copy (External Files)

Download external files (e.g., from a public URL) locally, then add them to your dataset.

from oip_dateset_client.Storage.StorageManager import StorageManager

cifar_path = "https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
local_path = StorageManager.get_local_copy(remote_url=cifar_path)

my_dataset = DatasetClient.create(name="my_dataset")
my_dataset.add_files(path=local_path)
my_dataset.upload()
my_dataset.finalize()

Parameters
- remote_path (str, required): Dataset url.
- target_folder (str, optional): The local directory where the dataset will be downloaded.
- extract_archive (bool, optional): If true, and the file is compressed, proceed to extract it. Defaults to True.
Returns
- str: path to the downloaded file.
Raises
- Exception: If we encounter a failure while attempting to download the requested file.
- Exception: If we are unable to unzip a compressed file.

Next Steps

Datasets & Dataframes UI – Manage datasets visually.
Dataset Overview – Learn about dataset inheritance and versioning.
Dataframes – Explore single CSV file handling in the “Dataframes” module.