Skip to content

User Interface

Deployments UI

Accessing Model Deployments

  1. Navigate to the Deployments section from the main platform menu
  2. The dashboard displays your existing deployments with their status
  3. Use the search functionality to find specific deployments

Model Deployments Dashboard

Creating a New Deployment

Models can be deployed using one of several deployment types:

  • pre-configured models from the Model Hub,
  • registered model versions from the Model Registry,
  • custom Docker images,
  • or models stored in a data volume.

Selecting Deployment Type

In Select Deployment Type, choose how you want to deploy your model:

  1. Click the "Create New Deployment" button
  2. In Select Deployment Type, choose how you want to deploy your model:

    • Deploy from Model Hub: Deploy a pre-configured model from the Model Hub, or customize its settings as needed.
    • Deploy from Model Version: Deploy a registered model version from the Model Registry.
    • Deploy from Docker Image: Deploy your own containerized model from your container registry.
    • Deploy from Data Volume: Deploy a model stored on a data volume (for example, custom frameworks or artifacts not yet in the Model Registry).

Deployment Source Selection

Deploying a Registered Model

When deploying from the Model Registry, follow these steps:

Step 1: Model Selection

  1. Enter a unique deployment name
  2. Select a registered model from the dropdown
  3. Select a specific model version from the dropdown
  4. Choose the appropriate task type for your model:

    • Text Generation (LLMs)
    • Text-to-Image
    • Text-to-Speech
    • Automatic Speech Recognition (ASR)
    • Classical ML
    • And others
  5. Choose to add LoRA adapters to your base model. LoRA adapters can only be used when deploying from model version. In order to select LoRA adapters, LoRA model needs be marked as LoRA adapters when creating a model version. Once LoRA adapters are deployed, they can be selected and switched between when using the inference functionality (after deployment is ready).

Note: For guidance on task types, refer to the Hugging Face model categories which provides a comprehensive list of tasks and their purposes.

Model Selection

Step 2: Resource Allocation

Configure the computing resources for each deployment instance:

  • Compute Type:
    • Full GPUs
    • Fractional GPUs
    • CPUs
  • Memory (RAM): Amount of memory allocated to each instance
  • Storage: Disk space for model artifacts and runtime data
  • Accelerator Type: GPU model (if applicable)
  • Accelerator Count: Number of GPUs per instance
  • CPU Count: Number of CPU cores per instance

Important: Resources specified here are for a single deployment instance. The total resources consumed will be multiplied by the number of replicas configured in the next step.

Resource Allocation

GPU Fractioning option

When GPU fractioning option is chosen, accelerator slice can be chosen. Available slices vary depending on the type of GPU.

What is GPU Fractioning?
Modern NVIDIA GPUs (A100, H100, H200, etc) support MIG (Multi-Instance GPU), allowing a single card to be split into independent „slices“. Each slice has its own compute cores, memory, and bandwidth isolation, enabling multiple lightweight models to share one physical GPU economically.

Choosing a slice size

Here is the example of available slices on H100 GPU.

Fractioned GPU UI

Slice VRAM Typical use-case
1g.5gb 5 GB Small classical/embedding models, testing
1g.10gb 10 GB Medium ASR / CV models
2g.10gb 10 GB (w/ double SMs) Throughput-sensitive inference
3g.20gb 20 GB 7-13 B parameter LLMs, TTS
4g.20gb 20 GB (more SMs) High-token-rate LLMs
7g.40gb 40 GB Single-tenant large LLMs; uses the whole card
Scaling Configuration
  1. Toggle "Enable Autoscaling" on or off
  2. For manual scaling (autoscaling disabled):
    • Set the number of replicas to maintain at all times
  3. For autoscaling (enabled):
    • Target Metric: The metric used to trigger scaling (default: ml_model_concurrent_requests)
    • Scale Threshold: The value of the target metric that triggers scaling
    • Min Replicas: Minimum number of instances to maintain regardless of load
    • Max Replicas: Maximum number of instances allowed during peak load
    • Activation Threshold: The threshold that must be exceeded to trigger a scaling event

Scaling Configuration

Autoscaling Example

Consider an LLM deployment with the following configuration:

- Target Metric: ml_model_concurrent_requests
- Scale Threshold: 5
- Min Replicas: 1
- Max Replicas: 10
- Activation Threshold: 6

In this scenario:

  1. The deployment starts with one replica
  2. When the number of concurrent requests exceeds 6, the platform triggers the scale-up
  3. New replicas are added until each instance handles approximately 5 concurrent requests
  4. During periods of low activity, replicas are gradually removed until reaching the minimum (1)
  5. The system maintains between 1 and 10 replicas depending on the load

This approach ensures efficient resource utilization while maintaining responsive service.

Step 3: Deployment Configuration

Select a model serving framework based on your requirements:

Note: Each framework has its own set of configuration options. Default configurations are provided for all frameworks, but you can customize them as needed.

Advanced Server Settings

To set advanced server settings, Quick Settings can be used fro most common options. Otherwise, Developer Mode can be used to set other Model Server Arguments not offered in Quick Settings.

Deployment Configuration

Step 4: Review and Deploy

The final step shows a summary of your deployment configuration:

  1. Review all settings
  2. Choose one of the deployment options:
  3. Save: Store the configuration without starting the deployment
  4. Save & Deploy: Create and immediately start the deployment

Deployment Review

Managing Deployments

Once created, you can manage your deployments through the Deployments dashboard:

  • Monitor the status of active deployments
  • Start, stop, or delete deployments
  • View performance metrics and logs
  • Update deployment configurations

Deployment Management

Best Practices

  • Resource Optimization: Start with modest resources and scale based on actual performance
  • Autoscaling: Configure appropriate thresholds to balance performance and cost
  • Monitoring: Regularly review deployment metrics to identify optimization opportunities