Skip to content

Benchmarks

This page contains examples of knowledge/ASR benchmarks and custom tasks.

Example 1. Simple Knowledge benchmark

A knowledge benchmark about mathematics questions in Arabic.

  • Name: Benchmark Math
  • Tasks:
    • ammlu_high_school_mathematics
    • ammlu_college_mathematics

Model

  • Source: HF
  • Model: tiiuae/falcon-7b
  • Accelerator: L4
  • Storage: 64
  • Memory: 32

Example 2. HF Leaderboard benchmark

The knowledge benchmark used on the Huggingface leaderboard.

  • Name: HF Leaderboard
  • Tasks:
    • truthfulqa
    • hellaswag, num fewshot: 10
    • arc_challenge, num fewshot: 25
    • winogrande, num fewshot: 5
    • gsm8k, num fewshot: 5
    • mmlu, num fewshot: 5

Model

  • Source: HF
  • Model: mistralai/Mistral-7B-Instruct-v0.2
  • Secrets Blueprint: HF Model Read
  • Accelerator: A10G
  • Storage: 120
  • Memory: 64

Example 3. Custom task

A custom task using the BoolQ dataset.

  • Name: BoolQ
  • Task Output Type: multiple_choice
  • Evaluation Dataframe: BoolQ
  • Prompt column: {passage}\n{question}
  • Answer column: {answer}
  • Use fixed choices: true
  • Possible Choices:
    • true
    • false
  • Metrics:
    • exact_match

Example 4. AR/EN ASR benchmark

Automatic speech recognition benchmark on Arabic and English languages.

  • Name: AR/EN benchmark
  • Dataset 1: fleurs - en
    • Source: HF
    • Model: google/fleurs
    • Subset: en_us
    • Split: test
    • Column containing the audio: audio
    • Column containing the transcription: transcription
    • Use normalizer: true
    • Language: en
  • Dataset 2: fleurs - ar
    • Source: HF
    • Model: google/fleurs
    • Subset: ar_eg
    • Split: test
    • Column containing the audio: audio
    • Column containing the transcription: transcription
    • Use normalizer: true
    • Language: ar

Model

  • Source: HF
  • Model: openai/whisper-small
  • Accelerator: L4
  • Storage: 32
  • Memory: 16