Skip to content

Job Tracking

Use oip-tracking-client in job scripts when you want job executions to create Tracking experiments and runs in OICM.

Use TrackingClient to log parameters, metrics, and tags from the job.

Setup

Import TrackingClient:

from oip_tracking_client.v2.tracking import TrackingClient

Before the job starts tracking, it needs:

  • api_host, ending with /api/tracking
  • api_key, generated by OICM or provided by the job runtime
  • workspace_id, identifying the workspace where the experiment should appear

Example:

import os

from oip_tracking_client.v2.tracking import TrackingClient

api_host = os.environ["OIP_TRACKING_SERVER"]
api_key = os.environ["OIP_WORKLOAD_ACCESS_KEY"]
workspace_id = os.environ["OIP_WORKSPACE_ID"]

tc = TrackingClient(
    api_host=api_host,
    api_key=api_key,
)

tc.set_experiment(
    experiment_name="job-training-runs",
    workspace_id=workspace_id,
)

with tc.start_run(run_name="job-run-1", tags=["job"]):
    tc.log_param("entrypoint", "train.py")
    tc.log_metric("accuracy", 0.91)

If the job runtime provides different environment variable names for the workspace ID, read that value and pass it to set_experiment(...).

Multiple Runs from a Job

Create multiple runs under the same experiment when a job trains or evaluates multiple candidates.

for model_name, score in {
    "ridge": 0.82,
    "random_forest": 0.89,
}.items():
    with tc.start_run(run_name=model_name, tags=["job", model_name]):
        tc.log_param("model", model_name)
        tc.log_metric("score", score)

The experiment will show one run per model, and the runs can be compared in the Tracking UI.

Multi-GPU Jobs

When using torchrun or another multi-process launcher, each process can attempt to create a run. Restrict tracking to the primary process to avoid duplicate runs.

if os.getenv("GLOBAL_RANK", "0") == "0":
    with tc.start_run(run_name="primary-process-run", tags=["multi-gpu"]):
        tc.log_metric("loss", 0.42)

Next Steps