User Interface
Deployments UI
Accessing Model Deployments
- Navigate to the Model Deployments section from the main platform menu
- The dashboard displays your existing deployments with their status
- Use the search functionality to find specific deployments
Creating a New Deployment
You can deploy either registered models from the Model Registry or custom Docker images.
Step 1: Model Selection
- Click the "Create New Deployment" button
- Choose your deployment source:
- Registered Model: Deploy a model from the Model Registry
- Custom Docker Image: Deploy your own containerized model
Deploying a Registered Model
When deploying from the Model Registry, follow these steps:
Step 1: Model Selection
- Enter a unique deployment name
- Select a registered model from the dropdown
- Select a specific model version from the dropdown
- Choose the appropriate task type for your model:
- Text Generation (LLMs)
- Text-to-Image
- Text-to-Speech
- Automatic Speech Recognition (ASR)
- Classical ML
- And others
Note: For guidance on task types, refer to the Hugging Face model categories which provides a comprehensive list of tasks and their purposes.
Step 2: Deployment Configuration
Select a model serving framework based on your requirements:
- TGI 2: Text Generation Inference v2
- TGI 3: Text Generation Inference v3
- vLLM: High-throughput and memory-efficient inference
- SGLang: Structured Generation Language
- Ray Serve: Scalable model serving framework
- OI_Serve: Our proprietary serving framework
- Text_embedding_inference: Specialized for embedding models
Note: Each framework has its own set of configuration options. Default configurations are provided for all frameworks, but you can customize them as needed.
Step 3: Resource Allocation when using the option Fractioned GPU
Configure the computing resources for each deployment instance:
- Compute Type:
- Full GPUs
- Fractional GPUs
- CPUs
- Memory (RAM): Amount of memory allocated to each instance
- Storage: Disk space for model artifacts and runtime data
- Accelerator Type: GPU model (if applicable)
- Accelerator Count: Number of GPUs per instance
- CPU Count: Number of CPU cores per instance
Important: Resources specified here are for a single deployment instance. The total resources consumed will be multiplied by the number of replicas configured in the next step.
GPU Fractioning option
When GPU fractioning option is chosen, accelerator slice can be chosen. Available slices vary depending on the type of GPU.
What is GPU Fractioning?
Modern NVIDIA GPUs (A100, H100, H200, etc) support MIG (Multi-Instance GPU), allowing a single card to be split into independent „slices“. Each slice has its own compute cores, memory, and bandwidth isolation, enabling multiple lightweight models to share one physical GPU economically.
Choosing a slice size
Here is the example of available slices on H100 GPU.
Slice | VRAM | Typical use-case |
---|---|---|
1g.5gb |
5 GB | Small classical/embedding models, testing |
1g.10gb |
10 GB | Medium ASR / CV models |
2g.10gb |
10 GB (w/ double SMs) | Throughput-sensitive inference |
3g.20gb |
20 GB | 7-13 B parameter LLMs, TTS |
4g.20gb |
20 GB (more SMs) | High-token-rate LLMs |
7g.40gb * |
40 GB | Single-tenant large LLMs; uses the whole card |
Step 4: Scaling Configuration
- Toggle "Enable Autoscaling" on or off
- For fixed scaling (autoscaling disabled):
- Set the number of replicas to maintain at all times
- For autoscaling (enabled):
- Target Metric: The metric used to trigger scaling (default: ml_model_concurrent_requests)
- Scale Threshold: The value of the target metric that triggers scaling
- Min Replicas: Minimum number of instances to maintain regardless of load
- Max Replicas: Maximum number of instances allowed during peak load
- Activation Threshold: The threshold that must be exceeded to trigger a scaling event
Autoscaling Example
Consider an LLM deployment with the following configuration:
- Target Metric: ml_model_concurrent_requests
- Scale Threshold: 5
- Min Replicas: 1
- Max Replicas: 10
- Activation Threshold: 6
In this scenario:
- The deployment starts with one replica
- When the number of concurrent requests exceeds 6, the platform triggers the scale-up
- New replicas are added until each instance handles approximately 5 concurrent requests
- During periods of low activity, replicas are gradually removed until reaching the minimum (1)
- The system maintains between 1 and 10 replicas depending on the load
This approach ensures efficient resource utilization while maintaining responsive service.
Step 5: Review and Deploy
The final step shows a summary of your deployment configuration:
- Review all settings
- Choose one of the deployment options:
- Save: Store the configuration without starting the deployment
- Save & Deploy: Create and immediately start the deployment
Managing Deployments
Once created, you can manage your deployments through the Deployments dashboard:
- Monitor the status of active deployments
- Start, stop, or delete deployments
- View performance metrics and logs
- Update deployment configurations
Best Practices
- Resource Optimization: Start with modest resources and scale based on actual performance
- Autoscaling: Configure appropriate thresholds to balance performance and cost
- Monitoring: Regularly review deployment metrics to identify optimization opportunities