Benchmarks UI

When you select Benchmarks from the sidebar, you’ll see an overview of benchmark packages, which are simply collections of individual benchmarks.

Benchmarks Home Page

1. Creating a New Benchmark Package

Click Create new package.
Name your package.
Specify the tasks you want to evaluate.

Create New Benchmark Package

2. Benchmark Runs

After choosing a package, you’ll see a list of benchmark runs for that package.

Benchmark Runs

2.1 Create a New Run

Click Create new run.
Select the model and resource allocations needed.
Confirm to start the benchmark.

Create Benchmark Run

3. Compare Runs

In the Compare runs tab, pick multiple runs to analyze side by side.

Compare Runs

4. Benchmark Run Overview

Selecting a specific run opens a detailed view with:

Execution details
Logs
Metrics
Task configurations

Run Overview

5. Benchmark Run Settings

Navigate to the Settings tab of a run to delete it or modify related configurations.

Run Settings

6. Prompt Files

Under Benchmarks > Prompt files, you’ll find a list of existing prompt files used in benchmarking.

Prompt files

6.1 Upload Prompt Files

Click Upload files
Select the file(s) needed for your benchmark tasks

Upload Prompt Files

7. Custom Tasks

The Custom tasks tab lets you define user-created benchmark tasks. They can be combined with existing platform tasks in a single package.

Main Screen

7.1 Creating a User-Defined Task

Click Create new task
Complete the form fields:
- General – Task name, dataset, metrics, etc.
  - Task name: the name of the task. Must be unique.
  - Description: the task description.
  - Dataset: the dataset file to use for the task. Supports CSV and json.
  - Metrics: list of metrics to use to evaluate the task.
  - Task Output Type: the type of the task.
    - generate_until: generate text until the EOS token is reached.
    - loglikelihood: return the loglikelihood probability of generating a piece of text given a certain input.
    - loglikelihood_rolling: return the loglikelihood probability of generating a piece of text.
    - multiple_choice: choose one of the provided options.
- Prompting – Define how prompts and answers are formatted.
  - Prompt Column: Prompt to feed to the LLM. Can be either a column in the dataset or a template.
  - Answer Column: Expected answer. Can be either a column in the dataset or a template.
  - Possible Choices: Possible choices when using the multiple_choice task output type.
  - Fixed Choices: Specifies whether the same set of choices (Possible Choices) is used for every prompt.
- Few-Shots – Include example-based configurations.
  - Number of few-shots: number of few-shots examples to add to the prompt.
  - Few-shots description: a string prepended to the few-shots. Can be either a fixed string or a template.
  - Few-shots delimiter: String to insert between the few-shots. Default is a blank line "\n\n".
- Advanced – Repeat runs, delimiters, etc.
  - Repeat Runs: number of times each sample is fed to the LLM.
  - Target delimiter: string added between question and answer prompt. Default is a single whitespace " ".

Prompt Examples

Basic Prompt
If your dataset has columns question and answer, set Prompt Column to question and Answer Column to answer.
```
question,answer
What is the result of three times two?,Six
```

Template Prompt
Use placeholders like {{passage}} or {{question}} in your prompt field.

passage,question,answer
In this document we describe a recipe to make bread,What is the text about?,Bread

Multiple Choice
Provide a list of columns for possible choices, and mark the correct answer column.

question,distractor1,distractor2,correct
Compounds that are capable of accepting electrons are called what?,redidues,oxygen,oxidants

We can then configure our task with:

Prompt Column: question
Possible Choices:
- distractor1
- distractor2
- correct
Answer Column: correct

Custom Task Creation

Next Steps

Knowledge Benchmarking Overview – Learn about the broader theory and features.
LLM Fine-Tuning UI – Adapt your models before benchmarking them.
Inference UI – See how models perform in real-world testing scenarios.