Tracking API Client

The tracking API enables data scientists and engineers to track experiments, package code, and share models, making it easier to reproduce and collaborate on ML projects.

Installation

Install it via pip using

pip install oip-tracking-client

Library Import

Add the following import statement to your Python code to gain access to the TrackingClient class from the client library:

from oip_tracking_client.tracking import TrackingClient

Initialization

To interact with the tracking server, you need to initialize the client as follow:

api_host = "<API_HOST>"
api_key = "<API_KEY>"
workspace_name = "<TARGET_WORKSPACE_NAME>"
TrackingClient.connect(api_host, api_key, workspace_name)

api_host (str, required): Replace <API_HOST> with the hostname of the API server.
api_key (str, required): Replace <API_KEY> with your API Key.
workspace_name (str, required): Replace <TARGET_WORKSPACE_NAME> with the name of the workspace you aim to connect to. This workspace should already exist on the platform's user interface (UI).

Experiment Setting

After initializing the client, specify the experiment you wish to track:

experiment_name = "<TARGET_EXPERIMENT_NAME>"
TrackingClient.set_experiment(experiment_name)

experiment_name (str, required): Replace <TARGET_EXPERIMENT_NAME> with the name of the experiment you want to track. If the experiment does not exist, it will be created. See also: Experiment UI.

Compatibility with mlflow for the tracking

The Open Innovation TrackingClient is built around the MLflow client and provides extra features. All the functions of MLflow are accessible through the TrackingClient. For a comprehensive understanding of MLflow, its various features, and its usage, refer to the official MLflow documentation . For instance:

TrackingClient.autolog() is equivalent to mlflow.autolog()

Tracking Experiments

Once you've set up your experiment, you can take advantage of the Open Innovation TrackingClient's Auto-tracking feature. This feature allows automatic logging of parameters, metrics, and models during training without the need for explicit log statements. The code provided in the next section demonstrates the initial setup with Auto-tracking enabled.

In the code provided below, you'll see how to enable Auto-tracking and set up your experiment for tracking.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor

api_host = "<API_HOST>"
api_key = "<API_KEY>"
workspace_name = "<TARGET_WORKSPACE_NAME>"
experiment_name = "<EXPERIMENT_NAME>"

TrackingClient.connect(api_host, api_key, workspace_name)
TrackingClient.set_experiment(experiment_name)

with TrackingClient.start_run():

    TrackingClient.set_run_name("YOUR_RUN_NAME")

    TrackingClient.autolog()

    db = load_diabetes()
    X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)

    # Create and train models.
    rf = RandomForestRegressor(n_estimators=10, max_depth=6, max_features=3)
    rf.fit(X_train, y_train)

# Use the model to make predictions on the test dataset.
predictions = rf.predict(X_test)

Manual Model Logging

While the TrackingClient.autolog() function provides automatic logging for many libraries, there may be occasions where manual logging is either necessary or preferred.

The manual logging a the model can be performed as follow: 1. Utilize the TrackingClient.<name_of_the_library>.log_model method. 2. For method details, consult the MLflow documentation. Our method is based on mlflow.<name_of_the_library>.log_model, ensuring that TrackingClient.<name_of_the_library>.log_model behaves in the same manner.

The following are the arguments of log_model:

1st Argument: The model.
2nd Argument: The artifact path, which should be set to "model".
Signature: A named argument that represents the model's input and output schemas.

To infer the model's signature using samples of its input and output data, use:

signature = TrackingClient.infer_signature(x_train, y_train)

Supported Libraries

Below is a list of the log_model methods available for various libraries:

TrackingClient.catboost.log_model
TrackingClient.diviner.log_model
TrackingClient.fastai.log_model
TrackingClient.gluon.log_model
TrackingClient.h2o.log_model
TrackingClient.johnsnowlabs.log_model
TrackingClient.langchain.log_model
TrackingClient.lightgbm.log_model
TrackingClient.mleap.log_model
TrackingClient.onnx.log_model
TrackingClient.openai.log_model
TrackingClient.paddle.log_model
TrackingClient.pmdarima.log_model
TrackingClient.prophet.log_model
TrackingClient.pyfunc.log_model
TrackingClient.pytorch.log_model
TrackingClient.sentence_transformers.log_model
TrackingClient.sklearn.log_model
TrackingClient.spacy.log_model
TrackingClient.spark.log_model
TrackingClient.statsmodels.log_model
TrackingClient.tensorflow.log_model
TrackingClient.transformers.log_model
TrackingClient.xgboost.log_model

For manually logging a scikit-learn model:

signature = TrackingClient.infer_signature(x_train, y_train)
TrackingClient.sklearn.log_model(model, "model", signature=signature)

Note: Automated logging offers convenience, but manual logging provides a level of control that might be necessary in specific scenarios.

Model Artifacts

After training the model, valuable information related to the experiment is generated, including model artifacts, metrics, parameters, and tags.

The model artifacts include the serialized model file (model.pkl for sklearn , model.h5 for keas ect ..) and its associated metadata (MLmodel). These artifacts are crucial for reproducing and deploying the trained model.

The metrics and parameters represent the recorded information during the training process, helping you evaluate the model's performance and understand the chosen hyperparameters.

By understanding the folder structure and accessing these artifacts, you can effectively manage and analyze your machine learning experiments.

--run id
    --artifacts
        --model
            --MLmodel
            --model.pkl
    --metrics
    --params
    --tags

Run ID: A unique identifier for the run, facilitating experiment tracking and comparison.
Artifacts: This folder holds all the artifacts generated during the run, such as the model itself.
Artifacts/model: Subfolder containing the model artifacts.
Artifacts/model/MLmodel: Metadata file containing model details, hyperparameters, and other information.
Artifacts/model/model.pkl: The serialized model file.
Metrics: Logs and metrics recorded during the run, allowing you to assess model performance.
Params: Records the hyperparameters and parameters used during the run.
Tags: Additional metadata tags associated with the run.

Below an example of Artifacts/model/MLmodel file:

artifact_path: model
flavors:
python_function:
    env:
    conda: conda.yaml
    virtualenv: python_env.yaml
    loader_module: mlflow.sklearn
    model_path: model.pkl
    predict_fn: predict
    python_version: 3.9.7
sklearn:
    code: null
    pickled_model: model.pkl
    serialization_format: cloudpickle
    sklearn_version: 1.2.2
mlflow_version: 2.2.1
model_uuid: cd3b0852791d48a2a67ce739b0b07070
run_id: fb1aeba6036345cca027d44d747d1aeb
signature:
inputs: '[{"type": "tensor", "tensor-spec": {"dtype": "float64", "shape": [-1, 10]}}]'
outputs: '[{"type": "tensor", "tensor-spec": {"dtype": "float64", "shape": [-1]}}]'
utc_time_created: '2023-07-20 09:16:16.898078'

By using the TrackingClient API, you can easily manage and access model artifacts, metrics, and parameters for future analysis and reproduction of your machine learning experiments.

Logging Artifacts

In addition to all the features offered by MLflow, TrackingClient API also enables tracking of specific types of artifacts like images, audio, video, figures, text, JSON, etc., for specific machine learning training tasks analysis. This allows for sophisticated and advanced analytics and visualizations through our platform UI.

Image Tracking

The log_image_at_step method accepts an image as a numpy.ndarray or rom PIL.Image.Image:

from PIL import Image

# Load or create your image as numpy.ndarray or PIL.Image
image_data = Image.open("test_image.jpg")

# Log image at specific step
extra = {"label": "car"}
TrackingClient.log_image_at_step(image_data, 'image_file.jpg', 1, extra)

In this example, we are logging an image image_data and assigning to it a label, at 3rd /step.

Please note that for images, you can directly pass the path to the image file.

Audio Tracking

The log_audio_at_step method accepts audio data as a numpy.ndarray:

import numpy as np

# Create or load your audio data as a numpy array
audio_data = np.random.random(1000)

# Log audio at specific step
TrackingClient.log_audio_at_step(audio_data, 'audio_file.wav', 1, rate=44100)

The audio_data should be a numpy array. If the audio is stereo, the array should be 2-dimensional.

Text Tracking

The log_text_at_step method accepts text as a str:

# Log text at specific step
text_data = "This is a sample text."
TrackingClient.log_text_at_step(text_data, 'text_file.txt', 1)

Figure Tracking

The log_figure_at_step method accepts a figure as a matplotlib.figure.Figure or plotly.graph_objects.Figure:

import matplotlib.pyplot as plt

# Create a figure
fig, ax = plt.subplots()
ax.plot([1, 2, 3, 4], [1, 4, 2, 3])

# Log figure at specific step
TrackingClient.log_figure_at_step(fig, 'figure_file.jpg', 1)

The fig should be a matplotlib.figure.Figure object.

JSON Tracking

The log_dict_at_step method accepts a dictionary or list to be logged as JSON:

# Log dictionary at specific step
dict_data = {"key1": "value1", "key2": "value2"}
TrackingClient.log_dict_at_step(dict_data, 'dict_file.json', 1)

The dictionary or list will be saved as a JSON file.

Extra Parameters

Note: All log_*_at_step methods accept an optional extra parameter (of type dict) which can be used to log additional metadata about the artifact, and a file_name (of type str) that specifies the name of the artifact file. For log_audio_at_step, there is also a rate parameter (of type int) to specify the sample rate of the audio data.

The extra parameter should be a dictionary with string keys. The values can be of types int, float, str, bool, list, or None.

extra = {"description": "This is a description of the artifact."}

In the case of log_audio_at_step, there's also a rate parameter to specify the sample rate of the audio data.

TrackingClient.log_audio_at_step(audio_data, 'audio_file', 1, rate=44100, extra=extra)