OICM Troubleshooting Guide

This guide covers common troubleshooting scenarios for end users across OICM modules. Each issue is outlined with:

Problem
Symptoms
Possible Causes
Resolution Steps

1. Installation & Access

Symptoms:
Cannot access the web interface, or login redirects back to the login page.

Possible Causes:

Network or DNS misconfiguration
Misconfigured SSO or un-provisioned user

Resolution Steps:

Verify URL and network access.
Review SSO settings and ensure your user has access.
Confirm tenant user is active and has an appropriate role in OICM.

2. Authentication & Access

2.1 Cannot Access Workspace

Symptoms:
Workspace not visible for a particular user or API error “workspace not found.”

Possible Causes:

User not added to the workspace
Typo in workspace name
Workspace no longer exists

Resolution Steps:

Tenant Admin should add user to the workspace role
Double-check workspace name and ID.

3. Model Management

3.1 Model Registration Fails or Freezes

Symptoms:
Model deployment fails.

Possible Causes:

Wrong external credentials or model ID

Resolution Steps:

Verify bucket path or HuggingFace model ID
Verify if access tokens in Blueprints are correct and retry
Verify that your HF or S3 token has the correct access to the model.

4. Jobs

4.1 Job Fails or Stuck in Pending

Symptoms:
Job status stays “Pending”.

Possible Causes:

No GPU/CPU available
Misconfigured resource requests

Resolution Steps:

Check Events and Logs tab for errors
Check resource usage in “Resource Monitoring”
Adjust job config to fit available resources.

5. Deployment

5.1 Deployment Fails or Doesn’t Start

Symptoms:
Deployment stuck or crashes.

Possible Causes:

Resource misconfiguration or health-check failures
Invalid HF or Model access token

Resolution Steps:

Inside the deployment page, go to the events tab and check for error events.
If no error events, or the event is not clear, verify the logs inside the logs tab.
Verify Access Tokens.

5.2 Model Not Responding to Inference Calls

Symptoms:
Timeout or bad results on predictions.

Possible Causes:

Wrong input format or API path

Resolution Steps:

Check the model’s expected call format
Use Inference UI playground to verify behavior.

5.3 Model Deployment is Not Scheduled

Symptoms:
Deployment is always in pending status.

Possible Causes:

No GPU/CPU available
Resources are not allocated to the target workspace
Misconfigured resource requests, particularly RAM

Resolution Steps:

Check Events and Logs tab for errors
Check resource usage in “Resource Monitoring”
Adjust the workload resources to fit available compute in the workspace.

6. API Usage

6.1 Inference Requests Slow or Timeout

Symptoms:
Long latency or timeout on API call.

Possible Causes:

Operation takes long, platform under heavier load, or default timeout too low

Resolution Steps:

Increase timeout using OICM-Request-Timeout header.
Enable autoscaling or optimize model, in case of inference call.

7. Tracking

7.1 Tracking Library Misconfiguration

Symptoms:
oip-tracking-client does not respond or infinitely loads.

Possible Causes:

Library version is not compatible with the OICM+ product version

Resolution Steps:

Find the product version from User panel, and ensure that corresponding version of the oip-tracking-library is installed in the IDE.

7.2 Tracking Client Cannot Access the Experiment

Symptoms:
oip-tracking-client throws permission error in the terminal.

Possible Causes:

User API key is invalid, or user has no access to the workspace that tracking is calling

Resolution Steps:

Generate the user API key from the user panel, and cross-check the user access to the target workspace where the experiment is stored.

8. Benchmarks

8.1 Benchmark Run is Out of Memory

Symptoms:
Knowledge benchmark run logs show OOM error.

Possible Causes:

Large or no batch size is configured

Resolution Steps:

Rerun the benchmark with batch size between 16 and 64 depending on the GPU memory that tenant has.

OICM Troubleshooting Guide

1. Installation & Access

1.1 OICM Is Not Accessible or Login Issues

2. Authentication & Access

2.1 Cannot Access Workspace

3. Model Management

3.1 Model Registration Fails or Freezes

4. Jobs

4.1 Job Fails or Stuck in Pending

5. Deployment

5.1 Deployment Fails or Doesn’t Start

5.2 Model Not Responding to Inference Calls

5.3 Model Deployment is Not Scheduled

6. API Usage

6.1 Inference Requests Slow or Timeout

7. Tracking

7.1 Tracking Library Misconfiguration

7.2 Tracking Client Cannot Access the Experiment

8. Benchmarks

8.1 Benchmark Run is Out of Memory