Skip to content

Documentation

Benchmarks

Documentation

Home
Workspace
Workspace
- Fundamentals
- User Interface
Tracking
Tracking
- Fundamentals
- User Interface
  User Interface
  - Experiments
  - Runs
- API Client
- Examples
Models
Models
Deployment
Deployment
- Fundamentals
- User Interface
Inference
Inference
Jobs
Jobs
LLM
LLM
- Fine Tuning
  Fine Tuning
- Knowledge Benchmark
  Knowledge Benchmark
  - Fundamentals
  - User Interface
- ASR Benchmark
  ASR Benchmark
  - Fundamentals
  - User Interface
- Examples
  Examples
  - Fine Tuning
  - Benchmarks Benchmarks
    Table of contents
    
    Example 1. Simple Knowledge benchmark
    
    Example 2. HF Leaderboard benchmark
    
    Example 3. Custom task
    
    Example 4. AR/EN ASR benchmark
Annotations
Annotations
- Human Feedback
  Human Feedback
Blueprints
Blueprints
- Fundamentals
- User Interface
Datasets & Dataframes
Datasets & Dataframes
Workflows
Workflows
- Fundamentals
- User Interface
Notebooks
Notebooks
- Fundamentals
- User Interface
Identity And Access Management
Identity And Access Management
Admin Tools
Admin Tools
- Tenant Admin Tools
  Tenant Admin Tools
- Cluster Admin Tools
  Cluster Admin Tools

Benchmarks

This page contains examples of knowledge/ASR benchmarks and custom tasks.

Example 1. Simple Knowledge benchmark

A knowledge benchmark about mathematics questions in Arabic.

Name: Benchmark Math
Tasks:
- ammlu_high_school_mathematics
- ammlu_college_mathematics

Model

Source: HF
Model: tiiuae/falcon-7b
Accelerator: L4
Storage: 64
Memory: 32

Example 2. HF Leaderboard benchmark

The knowledge benchmark used on the Huggingface leaderboard.

Name: HF Leaderboard
Tasks:
- truthfulqa
- hellaswag, num fewshot: 10
- arc_challenge, num fewshot: 25
- winogrande, num fewshot: 5
- gsm8k, num fewshot: 5
- mmlu, num fewshot: 5

Model

Source: HF
Model: mistralai/Mistral-7B-Instruct-v0.2
Secrets Blueprint: HF Model Read
- Token with access to mistralai/Mistral-7B-Instruct-v0.2
Accelerator: A10G
Storage: 120
Memory: 64

Example 3. Custom task

A custom task using the BoolQ dataset.

Name: BoolQ
Task Output Type: multiple_choice
Evaluation Dataframe: BoolQ
Prompt column: {passage}\n{question}
Answer column: {answer}
Use fixed choices: true
Possible Choices:
- true
- false
Metrics:
- exact_match

Example 4. AR/EN ASR benchmark

Automatic speech recognition benchmark on Arabic and English languages.

Name: AR/EN benchmark
Dataset 1: fleurs - en
- Source: HF
- Model: google/fleurs
- Subset: en_us
- Split: test
- Column containing the audio: audio
- Column containing the transcription: transcription
- Use normalizer: true
- Language: en
Dataset 2: fleurs - ar
- Source: HF
- Model: google/fleurs
- Subset: ar_eg
- Split: test
- Column containing the audio: audio
- Column containing the transcription: transcription
- Use normalizer: true
- Language: ar

Model

Source: HF
Model: openai/whisper-small
Accelerator: L4
Storage: 32
Memory: 16