Skip to content

Benchmarks UI

When you select Benchmarks from the sidebar, you’ll see an overview of benchmark packages, which are simply collections of individual benchmarks.

Benchmarks Home Page


1. Creating a New Benchmark Package

  1. Click Create new package.
  2. Name your package.
  3. Specify the tasks you want to evaluate.

Create New Benchmark Package


2. Benchmark Runs

After choosing a package, you’ll see a list of benchmark runs for that package.

Benchmark Runs

2.1 Create a New Run

  1. Click Create new run.
  2. Select the model and resource allocations needed.
  3. Confirm to start the benchmark.

Create Benchmark Run
Create Benchmark Run


3. Compare Runs

In the Compare runs tab, pick multiple runs to analyze side by side.

Compare Runs


4. Benchmark Run Overview

Selecting a specific run opens a detailed view with:

  • Execution details
  • Logs
  • Metrics
  • Task configurations

Run Overview
Run Overview


5. Benchmark Run Settings

Navigate to the Settings tab of a run to delete it or modify related configurations.

Run Settings


6. Prompt Files

Under Benchmarks > Prompt files, you’ll find a list of existing prompt files used in benchmarking.

Prompt files

6.1 Upload Prompt Files

  • Click Upload files
  • Select the file(s) needed for your benchmark tasks

Upload Prompt Files


7. Custom Tasks

The Custom tasks tab lets you define user-created benchmark tasks. They can be combined with existing platform tasks in a single package.

Main Screen

7.1 Creating a User-Defined Task

  1. Click Create new task
  2. Complete the form fields:
    • General – Task name, dataset, metrics, etc.
      • Task name: the name of the task. Must be unique.
      • Description: the task description.
      • Dataset: the dataset file to use for the task. Supports CSV and json.
      • Metrics: list of metrics to use to evaluate the task.
      • Task Output Type: the type of the task.
        • generate_until: generate text until the EOS token is reached.
        • loglikelihood: return the loglikelihood probability of generating a piece of text given a certain input.
        • loglikelihood_rolling: return the loglikelihood probability of generating a piece of text.
        • multiple_choice: choose one of the provided options.
    • Prompting – Define how prompts and answers are formatted.
      • Prompt Column: Prompt to feed to the LLM. Can be either a column in the dataset or a template.
      • Answer Column: Expected answer. Can be either a column in the dataset or a template.
      • Possible Choices: Possible choices when using the multiple_choice task output type.
      • Fixed Choices: Specifies whether the same set of choices (Possible Choices) is used for every prompt.
    • Few-Shots – Include example-based configurations.
      • Number of few-shots: number of few-shots examples to add to the prompt.
      • Few-shots description: a string prepended to the few-shots. Can be either a fixed string or a template.
      • Few-shots delimiter: String to insert between the few-shots. Default is a blank line "\n\n".
    • Advanced – Repeat runs, delimiters, etc.
      • Repeat Runs: number of times each sample is fed to the LLM.
      • Target delimiter: string added between question and answer prompt. Default is a single whitespace " ".

Prompt Examples

  • Basic Prompt
    If your dataset has columns question and answer, set Prompt Column to question and Answer Column to answer.

    question,answer
    What is the result of three times two?,Six
    

  • Template Prompt
    Use placeholders like {{passage}} or {{question}} in your prompt field.

    passage,question,answer
    In this document we describe a recipe to make bread,What is the text about?,Bread
    

  • Multiple Choice
    Provide a list of columns for possible choices, and mark the correct answer column.

    question,distractor1,distractor2,correct
    Compounds that are capable of accepting electrons are called what?,redidues,oxygen,oxidants
    

We can then configure our task with:

  • Prompt Column: question
  • Possible Choices:
    - distractor1
    - distractor2
    - correct
  • Answer Column: correct

Custom Task Creation


Next Steps