User Interface

Deployments UI

Accessing Model Deployments

Navigate to the Model Deployments section from the main platform menu
The dashboard displays your existing deployments with their status
Use the search functionality to find specific deployments

Model Deployments Dashboard

Creating a New Deployment

You can deploy either registered models from the Model Registry or custom Docker images.

Step 1: Model Selection

Click the "Create New Deployment" button
Choose your deployment source:
Registered Model: Deploy a model from the Model Registry
Custom Docker Image: Deploy your own containerized model

Deployment Source Selection

Deploying a Registered Model

When deploying from the Model Registry, follow these steps:

Step 1: Model Selection

Enter a unique deployment name
Select a registered model from the dropdown
Select a specific model version from the dropdown
Choose the appropriate task type for your model:
Text Generation (LLMs)
Text-to-Image
Text-to-Speech
Automatic Speech Recognition (ASR)
Classical ML
And others

Note: For guidance on task types, refer to the Hugging Face model categories which provides a comprehensive list of tasks and their purposes.

Model Selection

Step 2: Deployment Configuration

Select a model serving framework based on your requirements:

TGI 2: Text Generation Inference v2
TGI 3: Text Generation Inference v3
vLLM: High-throughput and memory-efficient inference
SGLang: Structured Generation Language
Ray Serve: Scalable model serving framework
OI_Serve: Our proprietary serving framework
Text_embedding_inference: Specialized for embedding models

Note: Each framework has its own set of configuration options. Default configurations are provided for all frameworks, but you can customize them as needed.

Deployment Configuration

Step 3: Resource Allocation when using the option Fractioned GPU

Configure the computing resources for each deployment instance:

Compute Type:
- Full GPUs
- Fractional GPUs
- CPUs
Memory (RAM): Amount of memory allocated to each instance
Storage: Disk space for model artifacts and runtime data
Accelerator Type: GPU model (if applicable)
Accelerator Count: Number of GPUs per instance
CPU Count: Number of CPU cores per instance

Important: Resources specified here are for a single deployment instance. The total resources consumed will be multiplied by the number of replicas configured in the next step.

Resource Allocation

GPU Fractioning option

When GPU fractioning option is chosen, accelerator slice can be chosen. Available slices vary depending on the type of GPU.

What is GPU Fractioning?
Modern NVIDIA GPUs (A100, H100, H200, etc) support MIG (Multi-Instance GPU), allowing a single card to be split into independent „slices“. Each slice has its own compute cores, memory, and bandwidth isolation, enabling multiple lightweight models to share one physical GPU economically.

Choosing a slice size

Here is the example of available slices on H100 GPU.

Fractioned GPU UI

Slice	VRAM	Typical use-case
`1g.5gb`	5 GB	Small classical/embedding models, testing
`1g.10gb`	10 GB	Medium ASR / CV models
`2g.10gb`	10 GB (w/ double SMs)	Throughput-sensitive inference
`3g.20gb`	20 GB	7-13 B parameter LLMs, TTS
`4g.20gb`	20 GB (more SMs)	High-token-rate LLMs
`7g.40gb`*	40 GB	Single-tenant large LLMs; uses the whole card

Step 4: Scaling Configuration

Toggle "Enable Autoscaling" on or off
For fixed scaling (autoscaling disabled):
- Set the number of replicas to maintain at all times
For autoscaling (enabled):
- Target Metric: The metric used to trigger scaling (default: ml_model_concurrent_requests)
- Scale Threshold: The value of the target metric that triggers scaling
- Min Replicas: Minimum number of instances to maintain regardless of load
- Max Replicas: Maximum number of instances allowed during peak load
- Activation Threshold: The threshold that must be exceeded to trigger a scaling event

Scaling Configuration

Autoscaling Example

Consider an LLM deployment with the following configuration:

- Target Metric: ml_model_concurrent_requests
- Scale Threshold: 5
- Min Replicas: 1
- Max Replicas: 10
- Activation Threshold: 6

In this scenario:

The deployment starts with one replica
When the number of concurrent requests exceeds 6, the platform triggers the scale-up
New replicas are added until each instance handles approximately 5 concurrent requests
During periods of low activity, replicas are gradually removed until reaching the minimum (1)
The system maintains between 1 and 10 replicas depending on the load

This approach ensures efficient resource utilization while maintaining responsive service.

Step 5: Review and Deploy

The final step shows a summary of your deployment configuration:

Review all settings
Choose one of the deployment options:
Save: Store the configuration without starting the deployment
Save & Deploy: Create and immediately start the deployment

Deployment Review

Managing Deployments

Once created, you can manage your deployments through the Deployments dashboard:

Monitor the status of active deployments
Start, stop, or delete deployments
View performance metrics and logs
Update deployment configurations

Deployment Management

Best Practices

Resource Optimization: Start with modest resources and scale based on actual performance
Autoscaling: Configure appropriate thresholds to balance performance and cost
Monitoring: Regularly review deployment metrics to identify optimization opportunities