Skip to content

Inference REST API

This section explains how to set up and call the Inference Gateway to serve AI/ML models via secure, token-based REST APIs. Supported model types include LLMs, computer vision, speech processing, and classical ML models.


1. Overview

  • Token-Based Auth – Generate unique tokens to secure your model endpoints.
  • Multi-Task Support – Text generation/chat, sequence classification, speech, TTS, image generation, etc.
  • Simplified Deployment – Focus on your application logic; let the gateway handle serving complexities.

2. Setting Up an Inference Gateway

2.1 Access Model Settings

  1. Navigate to your Registered Model in the workspace.
  2. Open the Settings tab in the model version page.
  3. Find the Access tokens section.

Model Registry Settings

2.2 Generate an Access Token

  1. Click + Add new token.
  2. Provide a name (e.g., "new-token").
  3. Save the token immediately (it won’t be visible again).

Token Generated

2.3 Configure Inference Settings

When you choose a deployment type (text generation, ASR, etc.), the system auto-generates inference code and optimized payload structures. Supported tasks include text gen/chat, sequence classification, speech recognition, TTS, image generation, translation, reranking, and embeddings.

Below example shows how to use the API access key and generate the code snippet for API calls in Python applications for a Large Language Model deployment. Inference Code Snippet

2.4 Making API Calls

To perform inference calls, you need:

  1. Your API token
  2. The model version ID
  3. The correct endpoint for your use case

3. Advanced Configuration

3.1 Temperature & Top-p (Text Generation)

  • Temperature (default 0.7) – Higher => more randomness; lower => more deterministic.
  • Top-p (default 0.9) – Nucleus sampling threshold.

3.2 Request Timeout

  • Default: 120 seconds
  • Custom: Use header OICM-Request-Timeout for per-request overrides (in seconds).
headers = {
    "Authorization": "Bearer <api_key>",
    "OICM-Request-Timeout": "300"  # 5 minutes
}

4. Token Management

  • Expiration – Configure tokens with expiry periods.
  • Multiple Tokens – Create separate tokens for different apps.
  • Revocation – Revoke any token anytime in the Settings page.
  • Visibility – Tokens are displayed once; store them securely.

5. Inference Payload Examples

Below are common endpoints and usage patterns for different AI tasks.


5.1 Text Generation

5.1.1 Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/v1/chat/completions

5.1.2 Usage Example

import requests
import json

api_key = "<api_key>"
model_version_id = "<model_version_id>"
base_url = "<base_url>"  # e.g. https://inference.develop.openinnovation.ai

inference_url = f"{base_url}/models/{model_version_id}/proxy/v1/chat/completions"
headers = {
    "Authorization": f"Bearer {api_key}",
    "accept": "text/event-stream"
}

payload = {
    "inputs": "What is Deep Learning?",
    "max_new_tokens": 200,
    "do_sample": False,
    "stream": True
}

chat_response = requests.post(inference_url, headers=headers, json=payload, stream=True)
for token in chat_response.iter_lines():
    try:
        data_str = token.decode("utf-8")[6:]
        data_json = json.loads(data_str)
        content = data_json["choices"][0]["delta"]["content"]
        print(content, end="", flush=True)
    except:
        pass

5.1.3 Chat Template

You can add a chat_template in the payload for custom formatting via Jinja. Example:

{
    "messages": [
        {"role": "system", "content": "Be friendly"},
        {"role": "user", "content": "What's the capital of UAE?"},
        {"role": "assistant", "content": ""}
    ],
    "chat_template": '''
        {% if messages[0]['role'] == 'system' %}
            {% set loop_messages = messages[1:] %}
            {% set system_message = messages[0]['content'] %}
        {% else %}
            {% set loop_messages = messages %}
            {% set system_message = '' %}
        {% endif %}

        {% for message in loop_messages %}
            {% if loop.index0 == 0 %}
                {{ system_message.strip() }}
            {% endif %}
            {{ '\n\n' + message['role'] + ': ' + message['content'].strip().replace('\r\n', '\n').replace('\n\n', '\n') }}

            {% if loop.last and message['role'] == 'user' %}
                {{ '\n\nAssistant: ' }}
            {% endif %}
        {% endfor %}
    '''
}

5.2 Text Completion (VLLM / TGI)

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/v1

Usage Example (VLLM)

import requests
import json

api_key = "<api_key>"
model_version_id = "<model_version_id>"
base_url = "<base_url>"

inference_url = f"{base_url}/models/{model_version_id}/proxy/v1"
headers = {
    "Authorization": f"Bearer {api_key}",
    "accept": "text/event-stream"
}

endpoints = {
    "models": "models",
    "chat_completion": "chat/completions"
}

# e.g. retrieving model info
model_info = requests.get(f"{inference_url}/models", headers=headers).json()
model_name = model_info["data"][0]["id"]

payload = {
    "model": model_name,
    "messages": [
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "Describe gravity to a 6-year-old."}
    ],
    "temperature": 0.9,
    "top_p": 0.7,
    "max_tokens": 1000,
    "stream": True
}

chat_response = requests.post(
    f"{inference_url}/chat/completions", 
    headers=headers, 
    json=payload, 
    stream=True
)

for token in chat_response.iter_lines():
    try:
        data_str = token.decode("utf-8")[6:]
        data_json = json.loads(data_str)
        content = data_json["choices"][0]["delta"]["content"]
        print(content, end="", flush=True)
    except:
        pass

Usage with TGI is similar, just adapt endpoints and payload fields accordingly.

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/v1

Usage Example (TGI)

import requests
import json

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

inference_url = f"${base_url}/models/{model_version_id}/proxy/v1"


headers = {
    "Authorization": f"Bearer {api_key}",
    "accept": "text/event-stream"
}

endpoints = {
    "models": "models",
    "chat_completion": "chat/completions"
}

model_info = requests.get(f"{inference_url}/{endpoints['models']}", headers=headers).json()
model_name = model_info["data"][0]["id"]


payload = {
    "model": model_name,
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assisant"
        },
        {
            "role": "user",
            "content": "describe gravity to 6-year old child in 50 words"
            # "content": "tell me a long story"

        }
    ],
    "temperature": 0.9,
    "top_p": 0.7,
    "max_tokens": 1000,
    "stream": True
}

chat_response = requests.post(f"{url}/{endpoints['chat_completion']}", headers=headers, json=payload, stream=True)

for token in chat_response.iter_lines():
    try:
        string_data = token.decode('utf-8')
        string_data = string_data[6:]
        json_data = json.loads(string_data)
        content = json_data['choices'][0]['delta']['content']
        print(content, end="", flush=True)
    except:
        pass

5.3 Sequence Classification

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/v1/classify

Usage Example

import requests

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

inference_url = (
    f"${base_url}/models/{model_version_id}/proxy/v1/classify"
)

headers = {"Authorization": f"Bearer {api_key}"}

payload = {
    "inputs": "this is good!",
}
response = requests.post(f"{inference_url}", headers=headers, json=payload)
response.json(),

  curl: `curl -X POST "${inference_url}" \
-H "Authorization: Bearer ${api_key}" \
-H "Content-Type: application/json" \
-d '{
  "inputs": "this is good!"
}'

Response

{
    "classification": [
        {
            "label": "admiration",
            "score": 0.7764764428138733
        },
        {
            "label": "excitement",
            "score": 0.11938948929309845
        },
        {
            "label": "joy",
            "score": 0.04363647475838661
        },
        {
            "label": "approval",
            "score": 0.012329215183854103
        },
        {
            "label": "gratitude",
            "score": 0.010198703035712242
        },
        ...
    ]
}

5.4 Automatic Speech Recognition

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/v1/transcript

Usage Example

import requests

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

inference_url = (
    f"${base_url}/models/{model_version_id}/proxy/v1/transcript"
)

headers = {"Authorization": f"Bearer {api_key}"}

files = {
    "file": (
        "file_name.mp3",
        open("/path/to/audio_file", "rb"),
        "audio/wav"
    )
}
response = requests.post(inference_url, headers=headers, files=files)

Response

{"text": "Hi, can you help me with the driving license?"}

5.5 Text To Speech

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/v1/generate-speech

Request

import requests

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

inference_url = (
    f"${base_url}/models/{model_version_id}/proxy/v1/generate-speech"
)

headers = {"Authorization": f"Bearer {api_key}"}

# get the schema of the model, each tts model has its own schema
schema = requests.get(f"{inference_url}/schema", headers=headers)
# based on the schema, you create the post request with the supported params

Response

[
    {
        "desc": "Hey, how are you doing today?",
        "label": "Prompt",
        "name": "prompt",
        "required": true,
        "type": "string"
    },
    {
        "desc": "A female speaker with a slightly low-pitched voice",
        "label": "Speaker Description",
        "name": "description",
        "required": true,
        "type": "string"
    }
]

Request Body Format

The request body should be in the following format. Use the fields names as received in the schema:

{
    "prompt": "Hi, can you help me?",
    "description": "A man with clear voice"
}

Response Format

The received audio is base64 string

{
    "audio": "UklGRiRcAgBXQVZFZm10IBAAAA...",
    "sampling_rate": 44100
}

5.6 Text To Image

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/v1/generate-image

Usage example

import requests

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

inference_url = (
    f"${base_url}/models/{model_version_id}/proxy/v1/generate-image"
)

headers = {"Authorization": f"Bearer {api_key}"}

payload = {
    "prompt": "A man walking on the moon",
    "num_inference_steps": 20,
    "high_noise_frac": 8,
}
response = requests.post(f"{inference_url}", headers=headers, json=payload)

Response

{"image": "iVBORw0KGgoAAAANS..."}

5.7 Translation

Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/v1/translate

Usage example

import requests

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

inference_url = (
    f"${base_url}/models/{model_version_id}/proxy/v1/translate"
)

headers = {"Authorization": f"Bearer {api_key}"}

payload = {"text": "A man walking on the moon"}
response = requests.post(f"{inference_url}", headers=headers, json=payload)

Response

{"translation": "Un hombre caminando en la luna"}

5.8 Reranking / Embedding / Classification

import requests

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

url = f"{base_url}/models/{model_version_id}/proxy/v1"
headers = {"Authorization": f"Bearer {api_key}"}

# Embedding
embed_payload = {"inputs": "What is Deep Learning?"}
embed_resp = requests.post(f"{url}/embed", headers=headers, json=embed_payload)

# Reranking
rerank_payload = {
  "query": "What is Deep Learning?",
  "texts": ["Deep Learning is not...", "Deep learning is..."]
}
rerank_resp = requests.post(f"{url}/rerank", headers=headers, json=rerank_payload)

# Classification
classify_payload = {"inputs": "Abu Dhabi is great!"}
classify_resp = requests.post(
  f"{url}/predict", headers=headers, json=classify_payload
)

6. Classical ML Models

6.1 API Endpoint

POST <API_HOST>/models/<model_version_id>/proxy/v1/predict

6.2 Usage Example

import requests

api_key = <api_key>
model_version_id = <model_version_id>
base_url = <base_url>

inference_url = (
    f"${base_url}/models/{model_version_id}/proxy/v1/predict"
)


headers = {"Authorization": f"Bearer {api_key}"}


payload = [1, 2, 3]
response = requests.post(f"{inference_url}", headers=headers, json=payload)

Response Format

{
    "data": [83.4155584413916, 209.9168121704531],
    "meta": {
        "input_schema": [
            {
                "tensor-spec": {
                    "dtype": "float64",
                    "shape": [-1, 10]
                },
                "type": "tensor"
            }
        ],
        "output_schema": [
            {
                "tensor-spec": {
                    "dtype": "float64",
                    "shape": [-1]
                },
                "type": "tensor"
            }
        ]
    }
}

7. Advanced Options

7.1 Timeout

  • Default: 120 seconds
  • Override via OICM-Request-Timeout header.
import requests

requests.post(endpoint, json=data, headers={
    "Bearer": api_key,
    "OICM-Request-Timeout": "300"
})

7.2 Best Practices

  • Store tokens securely.
  • Set appropriate temperature/top-p for text gen.
  • Use system messages effectively for chat.
  • Rotate tokens and watch for expiry.
  • Validate request formats and endpoints if issues arise.

8. Troubleshooting

  • Invalid Token – Verify token is active and not expired.
  • Incorrect Model Version ID – Double-check the ID in your workspace.
  • Request Format – Confirm JSON structure matches endpoint specs.
  • Permission Errors – Ensure the token grants appropriate permissions.

Next Steps