Dataset API Client
This section details the OIP Dataset Client library, covering installation, usage, and primary methods for interacting with datasets on the OICM+ platform.
The Dataset Client provides a convenient setup process to help you configure and initialize the client for interaction with the OICM+ Platform. This setup involves setting up an environment variables that includes OICM+ credentials and, optionally, storage provider credentials for services like AWS, Azure, GCP, or Minio. Once the setup is complete, the Dataset Client is ready to use.
1. Getting Started
1.1 Prerequisites
- Python installed on your machine.
- OICM+ account for obtaining an API Key.
1.2 Installation
Install via pip:
1.3 Initialization
You can initialize the DatasetClient using either the Python SDK or the Command-Line Interface (CLI).
Python SDK
from oip_dataset_client.dataset import DatasetClient
api_host = "http://192.168.1.35:8000"
api_key = "72e3f81c-8c75-4f88-9358-d36a3a50ef36"
workspace_name = "default_workspace"
DatasetClient.connect(api_host=api_host, api_key=api_key, workspace_name=workspace_name)
- Parameters
api_host
(str, required) – Hostname of the Dataset API server.api_key
(str, required) – User’s API authentication token.workspace_name
(str, required) – Workspace to store datasets.verbose
(bool, optional): If True, enable console output for progress information by default it relies onDatasetClient.verbose
(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False
).
You can access both the Dataset API Server and your access token by requesting them through our OICM+ Web UI.
CLI (Command-Line Interface)
If you prefer a more interactive setup, our CLI makes it easy to initialize the dataset client. After installing the client, simply run the initialization command, and follow the prompts to configure your OIP credentials.
- Run the following command to start the setup process:
- Follow interactive prompts to configure OIP credentials.
2. DatasetClient Methods
The DatasetClient class serves as the central component of the oip_dataset_client. It acts as the core interface through which the client interacts with and manages datasets throughout their lifecycle, including creating, adding files, uploading, downloading, and finalizing. The upcoming section will thoroughly cover the essential methods in this class.
2.1 create
Initialize a new dataset with optional parent datasets and versioning.
my_dataset = DatasetClient.create(
name="my_dataset",
parent_datasets=["datasetA_id", "datasetB_id"],
version="2.0.1",
is_main=True,
tags=["testing", "CSV", "NASA"],
description="A CSV testing dataset from NASA"
)
-
Parameters
name
(str) – Required dataset name.parent_datasets
(list[str], optional) – IDs of parent datasets.version
(str, optional) – Semantic version (defaults to 1.0.0 for first version).is_main
(bool, optional) – Mark this dataset as the main version.tags
(list[str], optional) – Keywords for classification (e.g., domain, topic).description
(str, optional) – Short description.verbose
(bool, optional): If True, enable console output for progress information by default it relies onDatasetClient.verbose
(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False
).
-
Returns
Dataset
: Newly created Dataset object.
-
Raises
ValueError
: If the name is empty orNone
.ValueError
: If any of theparent_datasets
is not completed.
2.2 add_files
Include one or more files in a dataset. The system checks for duplicates, ignoring already-present or matching files.
-
Parameters
path
(str) – Local path to files.wildcard
(str or list[str], optional) – Filter files with glob patterns.recursive
(bool, optional) – If True, match recursively.max_workers
(int, optional) – Number of threads for parallel adds.verbose
(bool, optional): If True, enable console output for progress information by default it relies onDatasetClient.verbose
(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False
).
-
Returns
int
: The number of files that have been added.int
: The number of files that have been modified.
-
Raises
Exception
: If the dataset is in a final state, which includescompleted
,aborted
, orfailed
.ValueError
: If the specified path to the files does not exist.
2.3 remove_files
Remove one or more files from a dataset (useful when removing files inherited from parent datasets).
-
Parameters
wildcard_path
(str) – The pattern matching files to remove (e.g.,folder/file*
).recursive
(bool, optional) – If True, match subfolders recursively.verbose
(bool, optional): If True, enable console output for progress information by default it relies onDatasetClient.verbose
(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False
).
-
Returns
int
: The number of files that have been removed.
-
Raises
Exception
: If the dataset is in a final state, which includescompleted
,aborted
, orfailed
.
2.4 upload
Uploads newly added files (ignoring duplicates or inherited files).
- Returns no value.
- Ensures only missing or changed files are uploaded.
-
Parameters
verbose
(bool, optional): If True, enable console output for progress information by default it relies onDatasetClient.verbose
(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False
).
-
Returns
none
: Does not return any results.
-
Raises
Exception
: If the dataset is in a final state, which includes completed, aborted, or failed.Exception
: If the upload failed.
2.5 finalize
Marks a dataset as finalized once all files are uploaded, preventing further changes.
-
Parameters
auto_upload
(bool, optional) – If True, will attempt an upload first.verbose
(bool, optional): If True, enable console output for progress information by default it relies onDatasetClient.verbose
(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False
).
-
Returns
none
: Does not return any results.
-
Raises
Exception
: If there is a pending upload.Exception
: If the dataset's status is not valid for finalization.
2.6 get
Retrieve an existing dataset by name, ID, or version. By default, returns the dataset with the highest version if multiple are found.
-
Parameters
dataset_id
(str, optional)dataset_name
(str, optional)dataset_version
(str, optional)only_completed
(bool, optional)auto_create
(bool, optional)
-
Returns
Dataset
: Returns a Dataset object.
-
Raises
ValueError
: If the selection criteria are not met. Didn't provide id/name correctly.ValueError
: If the query result is empty, it means that no dataset matching the provided selection criteria could be found.
2.7 get_local_copy (Internal Dataset)
Download a local copy of a finalized dataset, including parent files.
- Returns no value.
- Stores files in a local directory for further inspection.
-
Parameters
verbose
(bool, optional): If True, enable console output for progress information by default it relies onDatasetClient.verbose
(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False
).
-
Returns
none
: Does not return any results.
-
Raises
Exception
: If the dataset is in a final state, which includes completed, aborted, or failed.Exception
: If we are unable to unzip a compressed file.Exception
: If we encounter a failure while attempting to copy a file from a source folder to a target folder.
2.8 add_aws_storage_conf
Configure AWS S3 access for storing or retrieving dataset files.
from oip_dateset_client.Storage.StorageManager import StorageManager
StorageManager.add_aws_storage_conf(access_key="...", secret_key="...", region="...")
-
Parameters
access_key
(str, required): The access key is a unique identifier that allows access to AWS services and resources.secret_key
(str, required): The secret key is a confidential piece of information used in conjunction with the access key for secure access to AWS services.region
(str, required): The region specifies the geographical location where AWS resources will be provisioned or operated.
-
Returns
none
: Does not return any results.
2.9 migrate
Transfer a dataset from an external source (e.g., AWS S3) to OIP storage, creating a new dataset in the process.
- Creates a new dataset with all files from the external source.
DatasetClient.migrate(
storage_conf_name="aws_config",
download_uri="s3://bucket/path/",
name="migrated_dataset"
)
-
Parameters
storage_conf_name
(str, required): storage configuration name serve as an identifier.download_uri
(str, required): Refers to the specific location or path within the storage provider where the dataset files are hosted.name
(str, required): Name of the new dataset.parent_datsets
(list[str], optional): A list of parent datasets to extend the new dataset by adding all the files from their respective parent datasets.version
(str, optional): Version of the new dataset, if no version is specified during creation, the default version will be set to 1.0.0 for the dataset's First version. For the next versions, we will automatically increment the highest semantic version available.is_main
(bool, optional): True if the new dataset is the main version.tags
(list[str], optional): Descriptive tags categorize datasets by keywords, detailing their subject matter, domain, or specific topics for better identification and organization.description
(str, optional): A brief description of the new dataset.verbose
(bool, optional): If True, enable console output for progress information by default it relies onDatasetClient.verbose
(enabled by defaults) you can disable it by changing the value of DatasetClient.verbose (DatasetClient.verbose = False
).
-
Returns
none
: Does not return any results.
-
Raises
ValueError
: If the name isempty
orNone
.ValueError
: If any of theparent_datasets
is not completed.
2.10 get_local_copy (External Files)
Download external files (e.g., from a public URL) locally, then add them to your dataset.
from oip_dateset_client.Storage.StorageManager import StorageManager
cifar_path = "https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
local_path = StorageManager.get_local_copy(remote_url=cifar_path)
my_dataset = DatasetClient.create(name="my_dataset")
my_dataset.add_files(path=local_path)
my_dataset.upload()
my_dataset.finalize()
-
Parameters
remote_path
(str, required): Dataset url.target_folder
(str, optional): The local directory where the dataset will be downloaded.extract_archive
(bool, optional): If true, and the file is compressed, proceed to extract it. Defaults to True.
-
Returns
str
: path to the downloaded file.
-
Raises
Exception
: If we encounter a failure while attempting to download the requested file.Exception
: If we are unable to unzip a compressed file.
Next Steps
- Datasets & Dataframes UI – Manage datasets visually.
- Dataset Overview – Learn about dataset inheritance and versioning.
- Dataframes – Explore single CSV file handling in the “Dataframes” module.