OICM Troubleshooting Guide
This guide covers common troubleshooting scenarios for end users across OICM modules. Each issue is outlined with:
- Problem
- Symptoms
- Possible Causes
- Resolution Steps
1. Installation & Access
1.1 OICM Is Not Accessible or Login Issues
Symptoms:
Cannot access the web interface, or login redirects back to the login page.
Possible Causes:
- Network or DNS misconfiguration
- Misconfigured SSO or un-provisioned user
Resolution Steps:
- Verify URL and network access.
- Review SSO settings and ensure your user has access.
- Confirm tenant user is active and has an appropriate role in OICM.
2. Authentication & Access
2.1 Cannot Access Workspace
Symptoms:
Workspace not visible for a particular user or API error “workspace not found.”
Possible Causes:
- User not added to the workspace
- Typo in workspace name
- Workspace no longer exists
Resolution Steps:
- Tenant Admin should add user to the workspace role
- Double-check workspace name and ID.
3. Model Management
3.1 Model Registration Fails or Freezes
Symptoms:
Model deployment fails.
Possible Causes:
- Wrong external credentials or model ID
Resolution Steps:
- Verify bucket path or HuggingFace model ID
- Verify if access tokens in Blueprints are correct and retry
- Verify that your HF or S3 token has the correct access to the model.
4. Jobs
4.1 Job Fails or Stuck in Pending
Symptoms:
Job status stays “Pending”.
Possible Causes:
- No GPU/CPU available
- Misconfigured resource requests
Resolution Steps:
- Check Events and Logs tab for errors
- Check resource usage in “Resource Monitoring”
- Adjust job config to fit available resources.
5. Deployment
5.1 Deployment Fails or Doesn’t Start
Symptoms:
Deployment stuck or crashes.
Possible Causes:
- Resource misconfiguration or health-check failures
- Invalid HF or Model access token
Resolution Steps:
- Inside the deployment page, go to the events tab and check for error events.
- If no error events, or the event is not clear, verify the logs inside the logs tab.
- Verify Access Tokens.
5.2 Model Not Responding to Inference Calls
Symptoms:
Timeout or bad results on predictions.
Possible Causes:
- Wrong input format or API path
Resolution Steps:
- Check the model’s expected call format
- Use Inference UI playground to verify behavior.
5.3 Model Deployment is Not Scheduled
Symptoms:
Deployment is always in pending status.
Possible Causes:
- No GPU/CPU available
- Resources are not allocated to the target workspace
- Misconfigured resource requests, particularly RAM
Resolution Steps:
- Check Events and Logs tab for errors
- Check resource usage in “Resource Monitoring”
- Adjust the workload resources to fit available compute in the workspace.
6. API Usage
6.1 Inference Requests Slow or Timeout
Symptoms:
Long latency or timeout on API call.
Possible Causes:
- Operation takes long, platform under heavier load, or default timeout too low
Resolution Steps:
- Increase timeout using
OICM-Request-Timeout
header. - Enable autoscaling or optimize model, in case of inference call.
7. Tracking
7.1 Tracking Library Misconfiguration
Symptoms:
oip-tracking-client
does not respond or infinitely loads.
Possible Causes:
- Library version is not compatible with the OICM+ product version
Resolution Steps:
- Find the product version from User panel, and ensure that corresponding version of the
oip-tracking-library
is installed in the IDE.
7.2 Tracking Client Cannot Access the Experiment
Symptoms:
oip-tracking-client
throws permission error in the terminal.
Possible Causes:
- User API key is invalid, or user has no access to the workspace that tracking is calling
Resolution Steps:
- Generate the user API key from the user panel, and cross-check the user access to the target workspace where the experiment is stored.
8. Benchmarks
8.1 Benchmark Run is Out of Memory
Symptoms:
Knowledge benchmark run logs show OOM error.
Possible Causes:
- Large or no batch size is configured
Resolution Steps:
- Rerun the benchmark with batch size between 16 and 64 depending on the GPU memory that tenant has.