User Interface
Deployments UI
Accessing Model Deployments
- Navigate to the Model Deployments section from the main platform menu
- The dashboard displays your existing deployments with their status
- Use the search functionality to find specific deployments
Creating a New Deployment
You can deploy either registered models from the Model Registry or custom Docker images.
Step 1: Model Selection
- Click the "Create New Deployment" button
- Choose your deployment source:
- Registered Model: Deploy a model from the Model Registry
- Custom Docker Image: Deploy your own containerized model
Deploying a Registered Model
When deploying from the Model Registry, follow these steps:
Step 1: Model Selection
- Enter a unique deployment name
- Select a registered model from the dropdown
- Select a specific model version from the dropdown
- Choose the appropriate task type for your model:
- Text Generation (LLMs)
- Text-to-Image
- Text-to-Speech
- Automatic Speech Recognition (ASR)
- Classical ML
- And others
Note: For guidance on task types, refer to the Hugging Face model categories which provides a comprehensive list of tasks and their purposes.
Step 2: Deployment Configuration
Select a model serving framework based on your requirements:
- TGI 2: Text Generation Inference v2
- TGI 3: Text Generation Inference v3
- vLLM: High-throughput and memory-efficient inference
- SGLang: Structured Generation Language
- Ray Serve: Scalable model serving framework
- OI_Serve: Our proprietary serving framework
- Text_embedding_inference: Specialized for embedding models
Note: Each framework has its own set of configuration options. Default configurations are provided for all frameworks, but you can customize them as needed.
Step 3: Resource Allocation
Configure the computing resources for each deployment instance:
- Compute Type:
- Full GPUs
- Fractional GPUs
- CPUs
- Memory (RAM): Amount of memory allocated to each instance
- Storage: Disk space for model artifacts and runtime data
- Accelerator Type: GPU model (if applicable)
- Accelerator Count: Number of GPUs per instance
- CPU Count: Number of CPU cores per instance
Important: Resources specified here are for a single deployment instance. The total resources consumed will be multiplied by the number of replicas configured in the next step.
Step 4: Scaling Configuration
- Toggle "Enable Autoscaling" on or off
- For fixed scaling (autoscaling disabled):
- Set the number of replicas to maintain at all times
- For autoscaling (enabled):
- Target Metric: The metric used to trigger scaling (default: ml_model_concurrent_requests)
- Scale Threshold: The value of the target metric that triggers scaling
- Min Replicas: Minimum number of instances to maintain regardless of load
- Max Replicas: Maximum number of instances allowed during peak load
- Activation Threshold: The threshold that must be exceeded to trigger a scaling event
Autoscaling Example
Consider an LLM deployment with the following configuration:
- Target Metric: ml_model_concurrent_requests
- Scale Threshold: 5
- Min Replicas: 1
- Max Replicas: 10
- Activation Threshold: 6
In this scenario:
- The deployment starts with one replica
- When the number of concurrent requests exceeds 6, the platform triggers the scale-up
- New replicas are added until each instance handles approximately 5 concurrent requests
- During periods of low activity, replicas are gradually removed until reaching the minimum (1)
- The system maintains between 1 and 10 replicas depending on the load
This approach ensures efficient resource utilization while maintaining responsive service.
Step 5: Review and Deploy
The final step shows a summary of your deployment configuration:
- Review all settings
- Choose one of the deployment options:
- Save: Store the configuration without starting the deployment
- Save & Deploy: Create and immediately start the deployment
Managing Deployments
Once created, you can manage your deployments through the Deployments dashboard:
- Monitor the status of active deployments
- Start, stop, or delete deployments
- View performance metrics and logs
- Update deployment configurations
Best Practices
- Resource Optimization: Start with modest resources and scale based on actual performance
- Autoscaling: Configure appropriate thresholds to balance performance and cost
- Monitoring: Regularly review deployment metrics to identify optimization opportunities