Benchmarks
This page contains examples of knowledge/ASR benchmarks and custom tasks.
Example 1. Simple Knowledge benchmark
A knowledge benchmark about mathematics questions in Arabic.
- Name:
Benchmark Math
- Tasks:
ammlu_high_school_mathematics
ammlu_college_mathematics
Model
- Source:
HF
- Model:
tiiuae/falcon-7b
- Accelerator:
L4
- Storage:
64
- Memory:
32
Example 2. HF Leaderboard benchmark
The knowledge benchmark used on the Huggingface leaderboard.
- Name:
HF Leaderboard
- Tasks:
truthfulqa
hellaswag
, num fewshot:10
arc_challenge
, num fewshot:25
winogrande
, num fewshot:5
gsm8k
, num fewshot:5
mmlu
, num fewshot:5
Model
- Source:
HF
- Model:
mistralai/Mistral-7B-Instruct-v0.2
- Secrets Blueprint:
HF Model Read
- Token with access to mistralai/Mistral-7B-Instruct-v0.2
- Accelerator:
A10G
- Storage:
120
- Memory:
64
Example 3. Custom task
A custom task using the BoolQ dataset.
- Name:
BoolQ
- Task Output Type:
multiple_choice
- Evaluation Dataframe: BoolQ
- Prompt column:
{passage}\n{question}
- Answer column:
{answer}
- Use fixed choices:
true
- Possible Choices:
true
false
- Metrics:
exact_match
Example 4. AR/EN ASR benchmark
Automatic speech recognition benchmark on Arabic and English languages.
- Name:
AR/EN benchmark
- Dataset 1:
fleurs - en
- Source:
HF
- Model:
google/fleurs
- Subset:
en_us
- Split:
test
- Column containing the audio:
audio
- Column containing the transcription:
transcription
- Use normalizer:
true
- Language:
en
- Source:
- Dataset 2:
fleurs - ar
- Source:
HF
- Model:
google/fleurs
- Subset:
ar_eg
- Split:
test
- Column containing the audio:
audio
- Column containing the transcription:
transcription
- Use normalizer:
true
- Language:
ar
- Source:
Model
- Source:
HF
- Model:
openai/whisper-small
- Accelerator:
L4
- Storage:
32
- Memory:
16