Skip to content

Benchmark Examples

This page showcases examples of knowledge/ASR benchmarks and custom tasks, illustrating how to set up various configurations.


1. Simple Knowledge Benchmark

1.1 Setup

  • Name: Benchmark Math
  • Tasks:
    • ammlu_high_school_mathematics
    • ammlu_college_mathematics

1.2 Model

  • Source: HF
  • Model: tiiuae/falcon-7b
  • Accelerator: L4
  • Storage: 64
  • Memory: 32

2. Hugging Face Leaderboard Benchmark

2.1 Setup

  • Name: HF Leaderboard
  • Tasks:
    • truthfulqa
    • hellaswag, 10-shot
    • arc_challenge, 25-shot
    • winogrande, 5-shot
    • gsm8k, 5-shot
    • mmlu, 5-shot

2.2 Model

  • Source: HF
  • Model: mistralai/Mistral-7B-Instruct-v0.2
  • Secrets Blueprint: HF Model Read (access token required)
  • Accelerator: A10G
  • Storage: 120
  • Memory: 64

3. Custom Task: BoolQ

3.1 Setup

  • Name: BoolQ
  • Task Output Type: multiple_choice
  • Evaluation Dataframe: BoolQ
  • Prompt Column: {passage}\n{question}
  • Answer Column: {answer}
  • Fixed Choices: true
  • Possible Choices:
    • true
    • false
  • Metrics:
    • exact_match

4. AR/EN ASR Benchmark

4.1 Setup

  • Name: AR/EN benchmark

4.1.1 Dataset 1: English (Fleurs)

  • Source: HF
  • Model: google/fleurs
  • Subset: en_us
  • Split: test
  • Audio Column: audio
  • Transcription Column: transcription
  • Normalizer: true
  • Language: en

4.1.2 Dataset 2: Arabic (Fleurs)

  • Source: HF
  • Model: google/fleurs
  • Subset: ar_eg
  • Split: test
  • Audio Column: audio
  • Transcription Column: transcription
  • Normalizer: true
  • Language: ar

4.2 Model

  • Source: HF
  • Model: openai/whisper-small
  • Accelerator: L4
  • Storage: 32
  • Memory: 16