Skip to content

OICM Troubleshooting Guide

This guide covers common troubleshooting scenarios for end users across OICM modules. Each issue is outlined with:

  • Problem
  • Symptoms
  • Possible Causes
  • Resolution Steps

1. Installation & Access

1.1 OICM Is Not Accessible or Login Issues

Symptoms:
Cannot access the web interface, or login redirects back to the login page.

Possible Causes:

  • Network or DNS misconfiguration
  • Misconfigured SSO or un-provisioned user

Resolution Steps:

  • Verify URL and network access.
  • Review SSO settings and ensure your user has access.
  • Confirm tenant user is active and has an appropriate role in OICM.

2. Authentication & Access

2.1 Cannot Access Workspace

Symptoms:
Workspace not visible for a particular user or API error “workspace not found.”

Possible Causes:

  • User not added to the workspace
  • Typo in workspace name
  • Workspace no longer exists

Resolution Steps:

  • Tenant Admin should add user to the workspace role
  • Double-check workspace name and ID.

3. Model Management

3.1 Model Registration Fails or Freezes

Symptoms:
Model deployment fails.

Possible Causes:

  • Wrong external credentials or model ID

Resolution Steps:

  • Verify bucket path or HuggingFace model ID
  • Verify if access tokens in Blueprints are correct and retry
  • Verify that your HF or S3 token has the correct access to the model.

4. Jobs

4.1 Job Fails or Stuck in Pending

Symptoms:
Job status stays “Pending”.

Possible Causes:

  • No GPU/CPU available
  • Misconfigured resource requests

Resolution Steps:

  • Check Events and Logs tab for errors
  • Check resource usage in “Resource Monitoring”
  • Adjust job config to fit available resources.

5. Deployment

5.1 Deployment Fails or Doesn’t Start

Symptoms:
Deployment stuck or crashes.

Possible Causes:

  • Resource misconfiguration or health-check failures
  • Invalid HF or Model access token

Resolution Steps:

  • Inside the deployment page, go to the events tab and check for error events.
  • If no error events, or the event is not clear, verify the logs inside the logs tab.
  • Verify Access Tokens.

5.2 Model Not Responding to Inference Calls

Symptoms:
Timeout or bad results on predictions.

Possible Causes:

  • Wrong input format or API path

Resolution Steps:

  • Check the model’s expected call format
  • Use Inference UI playground to verify behavior.

5.3 Model Deployment is Not Scheduled

Symptoms:
Deployment is always in pending status.

Possible Causes:

  • No GPU/CPU available
  • Resources are not allocated to the target workspace
  • Misconfigured resource requests, particularly RAM

Resolution Steps:

  • Check Events and Logs tab for errors
  • Check resource usage in “Resource Monitoring”
  • Adjust the workload resources to fit available compute in the workspace.

6. API Usage

6.1 Inference Requests Slow or Timeout

Symptoms:
Long latency or timeout on API call.

Possible Causes:

  • Operation takes long, platform under heavier load, or default timeout too low

Resolution Steps:

  • Increase timeout using OICM-Request-Timeout header.
  • Enable autoscaling or optimize model, in case of inference call.

7. Tracking

7.1 Tracking Library Misconfiguration

Symptoms:
oip-tracking-client does not respond or infinitely loads.

Possible Causes:

  • Library version is not compatible with the OICM+ product version

Resolution Steps:

  • Find the product version from User panel, and ensure that corresponding version of the oip-tracking-library is installed in the IDE.

7.2 Tracking Client Cannot Access the Experiment

Symptoms:
oip-tracking-client throws permission error in the terminal.

Possible Causes:

  • User API key is invalid, or user has no access to the workspace that tracking is calling

Resolution Steps:

  • Generate the user API key from the user panel, and cross-check the user access to the target workspace where the experiment is stored.

8. Benchmarks

8.1 Benchmark Run is Out of Memory

Symptoms:
Knowledge benchmark run logs show OOM error.

Possible Causes:

  • Large or no batch size is configured

Resolution Steps:

  • Rerun the benchmark with batch size between 16 and 64 depending on the GPU memory that tenant has.