Job Tracking
Use oip-tracking-client in job scripts when you want job executions to create Tracking experiments and runs in OICM.
Use TrackingClient to log parameters, metrics, and tags from the job.
Setup
Import TrackingClient:
Before the job starts tracking, it needs:
api_host, ending with/api/trackingapi_key, generated by OICM or provided by the job runtimeworkspace_id, identifying the workspace where the experiment should appear
Example:
import os
from oip_tracking_client.v2.tracking import TrackingClient
api_host = os.environ["OIP_TRACKING_SERVER"]
api_key = os.environ["OIP_WORKLOAD_ACCESS_KEY"]
workspace_id = os.environ["OIP_WORKSPACE_ID"]
tc = TrackingClient(
api_host=api_host,
api_key=api_key,
)
tc.set_experiment(
experiment_name="job-training-runs",
workspace_id=workspace_id,
)
with tc.start_run(run_name="job-run-1", tags=["job"]):
tc.log_param("entrypoint", "train.py")
tc.log_metric("accuracy", 0.91)
If the job runtime provides different environment variable names for the workspace ID, read that value and pass it to set_experiment(...).
Multiple Runs from a Job
Create multiple runs under the same experiment when a job trains or evaluates multiple candidates.
for model_name, score in {
"ridge": 0.82,
"random_forest": 0.89,
}.items():
with tc.start_run(run_name=model_name, tags=["job", model_name]):
tc.log_param("model", model_name)
tc.log_metric("score", score)
The experiment will show one run per model, and the runs can be compared in the Tracking UI.
Multi-GPU Jobs
When using torchrun or another multi-process launcher, each process can attempt to create a run. Restrict tracking to the primary process to avoid duplicate runs.
if os.getenv("GLOBAL_RANK", "0") == "0":
with tc.start_run(run_name="primary-process-run", tags=["multi-gpu"]):
tc.log_metric("loss", 0.42)
Next Steps
- Jobs Overview - Understand core job concepts.
- Jobs UI - Learn how to visualize and manage jobs.
- Tracking API Client - Review the TrackingClient API.
- Tracking Examples - Try simple and multi-run examples.