Inference REST API
This section explains how to set up and call the Inference Gateway to serve AI/ML models via secure, token-based REST APIs. Supported model types include LLMs, computer vision, speech processing, and classical ML models.
1. Overview
- Token-Based Auth – Generate unique tokens to secure your model endpoints.
- Multi-Task Support – Text generation/chat, sequence classification, speech, TTS, image generation, etc.
- Simplified Deployment – Focus on your application logic; let the gateway handle serving complexities.
2. Setting Up an Inference Gateway
2.1 Access Model Settings
- Navigate to your Registered Model in the workspace.
- Open the Settings tab in the model version page.
- Find the Access tokens section.
2.2 Generate an Access Token
- Click + Add new token.
- Provide a name (e.g., "new-token").
- Save the token immediately (it won’t be visible again).
2.3 Configure Inference Settings
When you choose a deployment type (text generation, ASR, etc.), the system auto-generates inference code and optimized payload structures. Supported tasks include text gen/chat, sequence classification, speech recognition, TTS, image generation, translation, reranking, and embeddings.
Below example shows how to use the API access key and generate the code snippet for API calls in Python applications for a Large Language Model deployment.
2.4 Making API Calls
To perform inference calls, you need:
- Your API token
- The model version ID
- The correct endpoint for your use case
3. Advanced Configuration
3.1 Temperature & Top-p (Text Generation)
- Temperature (default 0.7) – Higher => more randomness; lower => more deterministic.
- Top-p (default 0.9) – Nucleus sampling threshold.
3.2 Request Timeout
- Default: 120 seconds
- Custom: Use header
OICM-Request-Timeout
for per-request overrides (in seconds).
4. Token Management
- Expiration – Configure tokens with expiry periods.
- Multiple Tokens – Create separate tokens for different apps.
- Revocation – Revoke any token anytime in the Settings page.
- Visibility – Tokens are displayed once; store them securely.
5. Inference Payload Examples
Below are common endpoints and usage patterns for different AI tasks.
5.1 Text Generation
5.1.1 Endpoint
5.1.2 Usage Example
import requests
import json
api_key = "<api_key>"
model_version_id = "<model_version_id>"
base_url = "<base_url>" # e.g. https://inference.develop.openinnovation.ai
inference_url = f"{base_url}/models/{model_version_id}/proxy/v1/chat/completions"
headers = {
"Authorization": f"Bearer {api_key}",
"accept": "text/event-stream"
}
payload = {
"inputs": "What is Deep Learning?",
"max_new_tokens": 200,
"do_sample": False,
"stream": True
}
chat_response = requests.post(inference_url, headers=headers, json=payload, stream=True)
for token in chat_response.iter_lines():
try:
data_str = token.decode("utf-8")[6:]
data_json = json.loads(data_str)
content = data_json["choices"][0]["delta"]["content"]
print(content, end="", flush=True)
except:
pass
5.1.3 Chat Template
You can add a chat_template
in the payload for custom formatting via Jinja. Example:
{
"messages": [
{"role": "system", "content": "Be friendly"},
{"role": "user", "content": "What's the capital of UAE?"},
{"role": "assistant", "content": ""}
],
"chat_template": '''
{% if messages[0]['role'] == 'system' %}
{% set loop_messages = messages[1:] %}
{% set system_message = messages[0]['content'] %}
{% else %}
{% set loop_messages = messages %}
{% set system_message = '' %}
{% endif %}
{% for message in loop_messages %}
{% if loop.index0 == 0 %}
{{ system_message.strip() }}
{% endif %}
{{ '\n\n' + message['role'] + ': ' + message['content'].strip().replace('\r\n', '\n').replace('\n\n', '\n') }}
{% if loop.last and message['role'] == 'user' %}
{{ '\n\nAssistant: ' }}
{% endif %}
{% endfor %}
'''
}
5.2 Text Completion (VLLM / TGI)
Endpoint
Usage Example (VLLM)
import requests
import json
api_key = "<api_key>"
model_version_id = "<model_version_id>"
base_url = "<base_url>"
inference_url = f"{base_url}/models/{model_version_id}/proxy/v1"
headers = {
"Authorization": f"Bearer {api_key}",
"accept": "text/event-stream"
}
endpoints = {
"models": "models",
"chat_completion": "chat/completions"
}
# e.g. retrieving model info
model_info = requests.get(f"{inference_url}/models", headers=headers).json()
model_name = model_info["data"][0]["id"]
payload = {
"model": model_name,
"messages": [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Describe gravity to a 6-year-old."}
],
"temperature": 0.9,
"top_p": 0.7,
"max_tokens": 1000,
"stream": True
}
chat_response = requests.post(
f"{inference_url}/chat/completions",
headers=headers,
json=payload,
stream=True
)
for token in chat_response.iter_lines():
try:
data_str = token.decode("utf-8")[6:]
data_json = json.loads(data_str)
content = data_json["choices"][0]["delta"]["content"]
print(content, end="", flush=True)
except:
pass
Usage with TGI is similar, just adapt endpoints and payload fields accordingly.
Endpoint
Usage Example (TGI)
import requests
import json
api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>
inference_url = f"${base_url}/models/{model_version_id}/proxy/v1"
headers = {
"Authorization": f"Bearer {api_key}",
"accept": "text/event-stream"
}
endpoints = {
"models": "models",
"chat_completion": "chat/completions"
}
model_info = requests.get(f"{inference_url}/{endpoints['models']}", headers=headers).json()
model_name = model_info["data"][0]["id"]
payload = {
"model": model_name,
"messages": [
{
"role": "system",
"content": "You are a helpful assisant"
},
{
"role": "user",
"content": "describe gravity to 6-year old child in 50 words"
# "content": "tell me a long story"
}
],
"temperature": 0.9,
"top_p": 0.7,
"max_tokens": 1000,
"stream": True
}
chat_response = requests.post(f"{url}/{endpoints['chat_completion']}", headers=headers, json=payload, stream=True)
for token in chat_response.iter_lines():
try:
string_data = token.decode('utf-8')
string_data = string_data[6:]
json_data = json.loads(string_data)
content = json_data['choices'][0]['delta']['content']
print(content, end="", flush=True)
except:
pass
5.3 Sequence Classification
Endpoint
Usage Example
import requests
api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>
inference_url = (
f"${base_url}/models/{model_version_id}/proxy/v1/classify"
)
headers = {"Authorization": f"Bearer {api_key}"}
payload = {
"inputs": "this is good!",
}
response = requests.post(f"{inference_url}", headers=headers, json=payload)
response.json(),
curl: `curl -X POST "${inference_url}" \
-H "Authorization: Bearer ${api_key}" \
-H "Content-Type: application/json" \
-d '{
"inputs": "this is good!"
}'
Response
{
"classification": [
{
"label": "admiration",
"score": 0.7764764428138733
},
{
"label": "excitement",
"score": 0.11938948929309845
},
{
"label": "joy",
"score": 0.04363647475838661
},
{
"label": "approval",
"score": 0.012329215183854103
},
{
"label": "gratitude",
"score": 0.010198703035712242
},
...
]
}
5.4 Automatic Speech Recognition
Endpoint
Usage Example
import requests
api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>
inference_url = (
f"${base_url}/models/{model_version_id}/proxy/v1/transcript"
)
headers = {"Authorization": f"Bearer {api_key}"}
files = {
"file": (
"file_name.mp3",
open("/path/to/audio_file", "rb"),
"audio/wav"
)
}
response = requests.post(inference_url, headers=headers, files=files)
Response
5.5 Text To Speech
Endpoint
Request
import requests
api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>
inference_url = (
f"${base_url}/models/{model_version_id}/proxy/v1/generate-speech"
)
headers = {"Authorization": f"Bearer {api_key}"}
# get the schema of the model, each tts model has its own schema
schema = requests.get(f"{inference_url}/schema", headers=headers)
# based on the schema, you create the post request with the supported params
Response
[
{
"desc": "Hey, how are you doing today?",
"label": "Prompt",
"name": "prompt",
"required": true,
"type": "string"
},
{
"desc": "A female speaker with a slightly low-pitched voice",
"label": "Speaker Description",
"name": "description",
"required": true,
"type": "string"
}
]
Request Body Format
The request body should be in the following format. Use the fields names as received in the schema:
Response Format
The received audio is base64 string
5.6 Text To Image
Endpoint
Usage example
import requests
api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>
inference_url = (
f"${base_url}/models/{model_version_id}/proxy/v1/generate-image"
)
headers = {"Authorization": f"Bearer {api_key}"}
payload = {
"prompt": "A man walking on the moon",
"num_inference_steps": 20,
"high_noise_frac": 8,
}
response = requests.post(f"{inference_url}", headers=headers, json=payload)
Response
5.7 Translation
Endpoint
Usage example
import requests
api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>
inference_url = (
f"${base_url}/models/{model_version_id}/proxy/v1/translate"
)
headers = {"Authorization": f"Bearer {api_key}"}
payload = {"text": "A man walking on the moon"}
response = requests.post(f"{inference_url}", headers=headers, json=payload)
Response
5.8 Reranking / Embedding / Classification
import requests
api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>
url = f"{base_url}/models/{model_version_id}/proxy/v1"
headers = {"Authorization": f"Bearer {api_key}"}
# Embedding
embed_payload = {"inputs": "What is Deep Learning?"}
embed_resp = requests.post(f"{url}/embed", headers=headers, json=embed_payload)
# Reranking
rerank_payload = {
"query": "What is Deep Learning?",
"texts": ["Deep Learning is not...", "Deep learning is..."]
}
rerank_resp = requests.post(f"{url}/rerank", headers=headers, json=rerank_payload)
# Classification
classify_payload = {"inputs": "Abu Dhabi is great!"}
classify_resp = requests.post(
f"{url}/predict", headers=headers, json=classify_payload
)
6. Classical ML Models
6.1 API Endpoint
6.2 Usage Example
import requests
api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>
inference_url = (
f"${base_url}/models/{model_version_id}/proxy/v1/predict"
)
headers = {"Authorization": f"Bearer {api_key}"}
payload = [1, 2, 3]
response = requests.post(f"{inference_url}", headers=headers, json=payload)
Response Format
{
"data": [83.4155584413916, 209.9168121704531],
"meta": {
"input_schema": [
{
"tensor-spec": {
"dtype": "float64",
"shape": [-1, 10]
},
"type": "tensor"
}
],
"output_schema": [
{
"tensor-spec": {
"dtype": "float64",
"shape": [-1]
},
"type": "tensor"
}
]
}
}
7. Advanced Options
7.1 Timeout
- Default: 120 seconds
- Override via OICM-Request-Timeout header.
import requests
requests.post(endpoint, json=data, headers={
"Bearer": api_key,
"OICM-Request-Timeout": "300"
})
7.2 Best Practices
- Store tokens securely.
- Set appropriate temperature/top-p for text gen.
- Use system messages effectively for chat.
- Rotate tokens and watch for expiry.
- Validate request formats and endpoints if issues arise.
8. Troubleshooting
- Invalid Token – Verify token is active and not expired.
- Incorrect Model Version ID – Double-check the ID in your workspace.
- Request Format – Confirm JSON structure matches endpoint specs.
- Permission Errors – Ensure the token grants appropriate permissions.
Next Steps
- Model Inference Overview – Learn about input formats and model families.
- Deployments UI – Manage and monitor your model deployments.
- Performance Benchmark – Load-test your model for scalability insights