User Interface
Deployments UI
Accessing Model Deployments
- Navigate to the Deployments section from the main platform menu
- The dashboard displays your existing deployments with their status
- Use the search functionality to find specific deployments

Creating a New Deployment
Models can be deployed using one of several deployment types:
- pre-configured models from the Model Hub,
- registered model versions from the Model Registry,
- custom Docker images,
- or models stored in a data volume.
Selecting Deployment Type
In Select Deployment Type, choose how you want to deploy your model:
- Click the "Create New Deployment" button
-
In Select Deployment Type, choose how you want to deploy your model:
- Deploy from Model Hub: Deploy a pre-configured model from the Model Hub, or customize its settings as needed.
- Deploy from Model Version: Deploy a registered model version from the Model Registry.
- Deploy from Docker Image: Deploy your own containerized model from your container registry.
- Deploy from Data Volume: Deploy a model stored on a data volume (for example, custom frameworks or artifacts not yet in the Model Registry).

Deploying a Registered Model
When deploying from the Model Registry, follow these steps:
Step 1: Model Selection
- Enter a unique deployment name
- Select a registered model from the dropdown
- Select a specific model version from the dropdown
-
Choose the appropriate task type for your model:
- Text Generation (LLMs)
- Text-to-Image
- Text-to-Speech
- Automatic Speech Recognition (ASR)
- Classical ML
- And others
-
Choose to add LoRA adapters to your base model. LoRA adapters can only be used when deploying from model version. In order to select LoRA adapters, LoRA model needs be marked as LoRA adapters when creating a model version. Once LoRA adapters are deployed, they can be selected and switched between when using the inference functionality (after deployment is ready).
Note: For guidance on task types, refer to the Hugging Face model categories which provides a comprehensive list of tasks and their purposes.

Step 2: Resource Allocation
Configure the computing resources for each deployment instance:
- Compute Type:
- Full GPUs
- Fractional GPUs
- CPUs
- Memory (RAM): Amount of memory allocated to each instance
- Storage: Disk space for model artifacts and runtime data
- Accelerator Type: GPU model (if applicable)
- Accelerator Count: Number of GPUs per instance
- CPU Count: Number of CPU cores per instance
Important: Resources specified here are for a single deployment instance. The total resources consumed will be multiplied by the number of replicas configured in the next step.

GPU Fractioning option
When GPU fractioning option is chosen, accelerator slice can be chosen. Available slices vary depending on the type of GPU.
What is GPU Fractioning?
Modern NVIDIA GPUs (A100, H100, H200, etc) support MIG (Multi-Instance GPU), allowing a single card to be split into independent „slices“. Each slice has its own compute cores, memory, and bandwidth isolation, enabling multiple lightweight models to share one physical GPU economically.
Choosing a slice size
Here is the example of available slices on H100 GPU.

| Slice | VRAM | Typical use-case |
|---|---|---|
1g.5gb |
5 GB | Small classical/embedding models, testing |
1g.10gb |
10 GB | Medium ASR / CV models |
2g.10gb |
10 GB (w/ double SMs) | Throughput-sensitive inference |
3g.20gb |
20 GB | 7-13 B parameter LLMs, TTS |
4g.20gb |
20 GB (more SMs) | High-token-rate LLMs |
7g.40gb |
40 GB | Single-tenant large LLMs; uses the whole card |
Scaling Configuration
- Toggle "Enable Autoscaling" on or off
- For manual scaling (autoscaling disabled):
- Set the number of replicas to maintain at all times
- For autoscaling (enabled):
- Target Metric: The metric used to trigger scaling (default: ml_model_concurrent_requests)
- Scale Threshold: The value of the target metric that triggers scaling
- Min Replicas: Minimum number of instances to maintain regardless of load
- Max Replicas: Maximum number of instances allowed during peak load
- Activation Threshold: The threshold that must be exceeded to trigger a scaling event

Autoscaling Example
Consider an LLM deployment with the following configuration:
- Target Metric: ml_model_concurrent_requests
- Scale Threshold: 5
- Min Replicas: 1
- Max Replicas: 10
- Activation Threshold: 6
In this scenario:
- The deployment starts with one replica
- When the number of concurrent requests exceeds 6, the platform triggers the scale-up
- New replicas are added until each instance handles approximately 5 concurrent requests
- During periods of low activity, replicas are gradually removed until reaching the minimum (1)
- The system maintains between 1 and 10 replicas depending on the load
This approach ensures efficient resource utilization while maintaining responsive service.
Step 3: Deployment Configuration
Select a model serving framework based on your requirements:
- TGI 2: Text Generation Inference v2
- TGI 3: Text Generation Inference v3
- vLLM: High-throughput and memory-efficient inference
- SGLang: Structured Generation Language
- Ray Serve: Scalable model serving framework
- OI_Serve: Our proprietary serving framework
- Text_embedding_inference: Specialized for embedding models
Note: Each framework has its own set of configuration options. Default configurations are provided for all frameworks, but you can customize them as needed.
Advanced Server Settings
To set advanced server settings, Quick Settings can be used fro most common options. Otherwise, Developer Mode can be used to set other Model Server Arguments not offered in Quick Settings.

Step 4: Review and Deploy
The final step shows a summary of your deployment configuration:
- Review all settings
- Choose one of the deployment options:
- Save: Store the configuration without starting the deployment
- Save & Deploy: Create and immediately start the deployment

Managing Deployments
Once created, you can manage your deployments through the Deployments dashboard:
- Monitor the status of active deployments
- Start, stop, or delete deployments
- View performance metrics and logs
- Update deployment configurations

Best Practices
- Resource Optimization: Start with modest resources and scale based on actual performance
- Autoscaling: Configure appropriate thresholds to balance performance and cost
- Monitoring: Regularly review deployment metrics to identify optimization opportunities