Inference

This section explains how to perform inference on deployed machine learning models in the Open Innovation Platform. Depending on your model type, either LLM or classical ML, there are various methods to obtain predictions.

1. LLM Inference

Currently, the platform supports two primary LLM inference types:

Text Generation
Sequence Classification

1.1 Text Generation

Use TGI or VLLM deployments to access text generation endpoints. Two inference modes are supported:

1.1.1 Chat Inference

When using Chat mode, you provide a list of dictionaries representing conversation history:

{
  "messages": [
    {"role": "system", "content": "Be friendly"},
    {"role": "user", "content": "What's the capital of UAE?"},
    {"role": "assistant", "content": ""}
  ]
}

1.1.2 Completion Inference

When using Completion mode, you provide a single string:

{
  "messages": "once upon a time, "
}

Note: When you pass a list of dictionaries (Chat mode), the platform automatically formats the conversation according to a default or custom chat template. If you want full control over the prompt formatting, use Completion mode to pass a single string directly.

1.1.3 Inference Parameters

You can control additional parameters (e.g., temperature, top-k) alongside your input message. The exact parameters depend on whether you use VLLM or TGI as your inference backend.

1.1.4 Chat Templates

Chat mode relies on model-specific templates. The following families are supported by default:

LLAMA 2
LLAMA 3
Falcon
Yi
Mistral
Aya-23

If your model family isn’t in this list, add a custom chat template to the model version configuration or use the Completion mode for direct prompt strings.

1.2 Sequence Classification Inference

For sequence_classification models deployed with OI_SERVE, provide a single text input:

Request:

{
  "text": "The product is good!"
}

Response:

{
  "positive": 0.99,
  "neutral": 0.01,
  "negative": 0
}

2. Classical ML Inference

2.1 Input Format

When deploying a tracked experiment or custom model, you can provide inputs in different formats:

Tip: Log your model signature to ensure inputs are parsed correctly:

TrackingClient.mlflow.log_model(my_model, "model", signature=my_model_signature)

2.1.1 Tensor Input (NumPy Arrays)

If the model expects a NumPy array, submit data as a JSON list. For a shape (-1, 3, 2):

[
    [[1, 2], [3, 4], [5, 6]],
    [[7, 8], [9, 0], [1, 2]]
]

2.1.2 Named Parameters (Pandas DataFrame)

If the model expects multiple columns (e.g., a DataFrame), use a list of objects:

[
  {"age": 18, "weight": 65},
  {"age": 47, "weight": 73}
]

2.2 Output Format

The endpoint returns a JSON dictionary with a predictions field:

{
  "predictions": [...]
}

List Output: If the model returns a Python list or NumPy array, you’ll get a list of arrays. Dictionary Output: If the model returns a Pandas DataFrame or dict, you’ll see key-value pairs. Example (List Output):

{
    "predictions": [
        [
            -3.644273519515991,
            -4.824134826660156,
            -3.8084142208099365,
            -5.363550662994385
        ],
        [    -4.997870922088623,
            -4.3103718757629395,
            -0.13021154701709747,
            -3.2400429248809814
        ]
    ]
}

Example (Dictionary Output):

{
  "predictions": [
    {"sentiment": "POSITIVE", "score": 0.976},
    {"sentiment": "NEUTRAL",  "score": 0.7345}
  ]
}

Next Steps

Deployments UI – Manage and monitor your deployed models, including inference tests.
Registered Models & Versions – Organize and version your ML models.
Performance Benchmark – Evaluate how your model scales under load.