Tracking API Client
The tracking API enables data scientists and engineers to track experiments, package code, and share models, making it easier to reproduce and collaborate on ML projects.
Installation
Install it via pip using
Library Import
Add the following import statement to your Python code to gain access to the TrackingClient
class from the client library:
Initialization
To interact with the tracking server, you need to initialize the client as follow:
api_host = "<API_HOST>"
api_key = "<API_KEY>"
workspace_name = "<TARGET_WORKSPACE_NAME>"
TrackingClient.connect(api_host, api_key, workspace_name)
api_host
(str, required): Replace<API_HOST>
with the hostname of the API server.api_key
(str, required): Replace<API_KEY>
with your API Key.workspace_name
(str, required): Replace<TARGET_WORKSPACE_NAME>
with the name of the workspace you aim to connect to. This workspace should already exist on the platform's user interface (UI).
Experiment Setting
After initializing the client, specify the experiment you wish to track:
experiment_name
(str, required): Replace<TARGET_EXPERIMENT_NAME>
with the name of the experiment you want to track. If the experiment does not exist, it will be created. See also: Experiment UI.
Compatibility with mlflow for the tracking
The Open Innovation TrackingClient is built around the MLflow client and provides extra features. All the functions of MLflow are accessible through the TrackingClient. For a comprehensive understanding of MLflow, its various features, and its usage, refer to the official MLflow documentation . For instance:
TrackingClient.autolog()
is equivalent to mlflow.autolog()
Tracking Experiments
Once you've set up your experiment, you can take advantage of the Open Innovation TrackingClient's Auto-tracking feature. This feature allows automatic logging of parameters, metrics, and models during training without the need for explicit log statements. The code provided in the next section demonstrates the initial setup with Auto-tracking enabled.
In the code provided below, you'll see how to enable Auto-tracking and set up your experiment for tracking.
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
api_host = "<API_HOST>"
api_key = "<API_KEY>"
workspace_name = "<TARGET_WORKSPACE_NAME>"
experiment_name = "<EXPERIMENT_NAME>"
TrackingClient.connect(api_host, api_key, workspace_name)
TrackingClient.set_experiment(experiment_name)
with TrackingClient.start_run():
TrackingClient.set_run_name("YOUR_RUN_NAME")
TrackingClient.autolog()
db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)
# Create and train models.
rf = RandomForestRegressor(n_estimators=10, max_depth=6, max_features=3)
rf.fit(X_train, y_train)
# Use the model to make predictions on the test dataset.
predictions = rf.predict(X_test)
Manual Model Logging
While the TrackingClient.autolog()
function provides automatic logging for many libraries, there may be occasions where manual logging is either necessary or preferred.
The manual logging a the model can be performed as follow:
1. Utilize the TrackingClient.<name_of_the_library>.log_model
method.
2. For method details, consult the MLflow documentation. Our method is based on mlflow.<name_of_the_library>.log_model
, ensuring that TrackingClient.<name_of_the_library>.log_model
behaves in the same manner.
The following are the arguments of log_model
:
-
1st Argument: The model.
-
2nd Argument: The artifact path, which should be set to "model".
-
Signature: A named argument that represents the model's input and output schemas.
To infer the model's signature using samples of its input and output data, use:
Supported Libraries
Below is a list of the log_model
methods available for various libraries:
TrackingClient.catboost.log_model
TrackingClient.diviner.log_model
TrackingClient.fastai.log_model
TrackingClient.gluon.log_model
TrackingClient.h2o.log_model
TrackingClient.johnsnowlabs.log_model
TrackingClient.langchain.log_model
TrackingClient.lightgbm.log_model
TrackingClient.mleap.log_model
TrackingClient.onnx.log_model
TrackingClient.openai.log_model
TrackingClient.paddle.log_model
TrackingClient.pmdarima.log_model
TrackingClient.prophet.log_model
TrackingClient.pyfunc.log_model
TrackingClient.pytorch.log_model
TrackingClient.sentence_transformers.log_model
TrackingClient.sklearn.log_model
TrackingClient.spacy.log_model
TrackingClient.spark.log_model
TrackingClient.statsmodels.log_model
TrackingClient.tensorflow.log_model
TrackingClient.transformers.log_model
TrackingClient.xgboost.log_model
For manually logging a scikit-learn model:
signature = TrackingClient.infer_signature(x_train, y_train)
TrackingClient.sklearn.log_model(model, "model", signature=signature)
Note: Automated logging offers convenience, but manual logging provides a level of control that might be necessary in specific scenarios.
Model Artifacts
After training the model, valuable information related to the experiment is generated, including model artifacts, metrics, parameters, and tags.
The model artifacts include the serialized model file (model.pkl for sklearn , model.h5 for keas ect ..) and its associated metadata (MLmodel). These artifacts are crucial for reproducing and deploying the trained model.
The metrics and parameters represent the recorded information during the training process, helping you evaluate the model's performance and understand the chosen hyperparameters.
By understanding the folder structure and accessing these artifacts, you can effectively manage and analyze your machine learning experiments.
- Run ID: A unique identifier for the run, facilitating experiment tracking and comparison.
- Artifacts: This folder holds all the artifacts generated during the run, such as the model itself.
- Artifacts/model: Subfolder containing the model artifacts.
- Artifacts/model/MLmodel: Metadata file containing model details, hyperparameters, and other information.
- Artifacts/model/model.pkl: The serialized model file.
- Metrics: Logs and metrics recorded during the run, allowing you to assess model performance.
- Params: Records the hyperparameters and parameters used during the run.
- Tags: Additional metadata tags associated with the run.
Below an example of Artifacts/model/MLmodel
file:
artifact_path: model
flavors:
python_function:
env:
conda: conda.yaml
virtualenv: python_env.yaml
loader_module: mlflow.sklearn
model_path: model.pkl
predict_fn: predict
python_version: 3.9.7
sklearn:
code: null
pickled_model: model.pkl
serialization_format: cloudpickle
sklearn_version: 1.2.2
mlflow_version: 2.2.1
model_uuid: cd3b0852791d48a2a67ce739b0b07070
run_id: fb1aeba6036345cca027d44d747d1aeb
signature:
inputs: '[{"type": "tensor", "tensor-spec": {"dtype": "float64", "shape": [-1, 10]}}]'
outputs: '[{"type": "tensor", "tensor-spec": {"dtype": "float64", "shape": [-1]}}]'
utc_time_created: '2023-07-20 09:16:16.898078'
By using the TrackingClient API, you can easily manage and access model artifacts, metrics, and parameters for future analysis and reproduction of your machine learning experiments.
Logging Artifacts
In addition to all the features offered by MLflow, TrackingClient API also enables tracking of specific types of artifacts like images, audio, video, figures, text, JSON, etc., for specific machine learning training tasks analysis. This allows for sophisticated and advanced analytics and visualizations through our platform UI.
Image Tracking
The log_image_at_step
method accepts an image as a numpy.ndarray
or rom PIL.Image.Image
:
from PIL import Image
# Load or create your image as numpy.ndarray or PIL.Image
image_data = Image.open("test_image.jpg")
# Log image at specific step
extra = {"label": "car"}
TrackingClient.log_image_at_step(image_data, 'image_file.jpg', 1, extra)
In this example, we are logging an image image_data and assigning to it a label, at 3rd /step.
Please note that for images, you can directly pass the path to the image file.
Audio Tracking
The log_audio_at_step
method accepts audio data as a numpy.ndarray
:
import numpy as np
# Create or load your audio data as a numpy array
audio_data = np.random.random(1000)
# Log audio at specific step
TrackingClient.log_audio_at_step(audio_data, 'audio_file.wav', 1, rate=44100)
The audio_data should be a numpy array. If the audio is stereo, the array should be 2-dimensional.
Text Tracking
The log_text_at_step
method accepts text as a str
:
# Log text at specific step
text_data = "This is a sample text."
TrackingClient.log_text_at_step(text_data, 'text_file.txt', 1)
Figure Tracking
The log_figure_at_step
method accepts a figure as a matplotlib.figure.Figure
or plotly.graph_objects.Figure
:
import matplotlib.pyplot as plt
# Create a figure
fig, ax = plt.subplots()
ax.plot([1, 2, 3, 4], [1, 4, 2, 3])
# Log figure at specific step
TrackingClient.log_figure_at_step(fig, 'figure_file.jpg', 1)
The fig should be a matplotlib.figure.Figure object.
JSON Tracking
The log_dict_at_step
method accepts a dictionary or list to be logged as JSON:
# Log dictionary at specific step
dict_data = {"key1": "value1", "key2": "value2"}
TrackingClient.log_dict_at_step(dict_data, 'dict_file.json', 1)
The dictionary or list will be saved as a JSON file.
Extra Parameters
Note: All log_*_at_step methods accept an optional extra parameter (of type dict) which can be used to log additional metadata about the artifact, and a file_name (of type str) that specifies the name of the artifact file. For log_audio_at_step, there is also a rate parameter (of type int) to specify the sample rate of the audio data.
The extra parameter should be a dictionary with string keys. The values can be of types int, float, str, bool, list, or None.
In the case of log_audio_at_step, there's also a rate parameter to specify the sample rate of the audio data.