Documentation Index
Fetch the complete documentation index at: https://wb-21fd5541-weave-caching.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
This page lists the evaluation benchmarks LLM Evaluation Jobs provides by category.
To run certain benchmarks, a team admin must add the required API keys as team-scoped secrets. Any team member can specify the secret when configuring an evaluation job.
- If a benchmark has
true in the OpenAI Model Scorer column, the benchmark uses OpenAI models for scoring. An organization or team admin must add an OpenAI API key as a team secret. When you configure an evaluation job with a benchmark with this requirement, set the Scorer API key field to the secret.
- If a benchmark has a link in the Gated Hugging Face Dataset column, the benchmark requires access to a gated Hugging Face dataset. An organization or team admin must request access to the dataset in Hugging Face, create a Hugging Face user access token, and configure a team secret with the access key. When you configure a benchmark with this requirement, set the Hugging Face Token field to the secret.
Knowledge
Evaluate factual knowledge across various domains like science, language, and general reasoning.
| Evaluation | Task ID | OpenAI Scorer | Gated Hugging Face Dataset | Description |
|---|
| BoolQ | boolq | | | Boolean yes/no questions from natural language queries |
| GPQA Diamond | gpqa_diamond | | | Graduate-level science questions (highest quality subset) |
| HLE | hle | | Yes | Human-level evaluation benchmark |
| Lingoly | lingoly | | Yes | Linguistics olympiad problems |
| Lingoly Too | lingoly_too | | Yes | Extended linguistics challenge problems |
| MMIU | mmiu | | | Massive Multitask Language Understanding benchmark |
| MMLU (0-shot) | mmlu_0_shot | | | Massive Multitask Language Understanding without examples |
| MMLU (5-shot) | mmlu_5_shot | | | Massive Multitask Language Understanding with 5 examples |
| MMLU-Pro | mmlu_pro | | | More challenging version of MMLU |
| ONET M6 | onet_m6 | | | Occupational knowledge benchmark |
| PAWS | paws | | | Paraphrase adversarial word substitution |
| SevenLLM MCQ (English) | sevenllm_mcq_en | | | Multiple choice questions in English |
| SevenLLM MCQ (Chinese) | sevenllm_mcq_zh | | | Multiple choice questions in Chinese |
| SevenLLM QA (English) | sevenllm_qa_en | | | Question answering in English |
| SevenLLM QA (Chinese) | sevenllm_qa_zh | | | Question answering in Chinese |
| SimpleQA | simpleqa | Yes | | Straightforward factual question answering |
| SimpleQA Verified | simpleqa_verified | | | Verified subset of SimpleQA with validated answers |
| WorldSense | worldsense | | | Evaluates understanding of world knowledge and common sense |
Reasoning
Evaluate logical thinking, problem-solving, and common-sense reasoning capabilities.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| AGIE AQUA-RAT | agie_aqua_rat | | | Algebraic question answering with rationales |
| AGIE LogiQA (English) | agie_logiqa_en | | | Logical reasoning questions in English |
| AGIE LSAT Analytical Reasoning | agie_lsat_ar | | | LSAT analytical reasoning (logic games) problems |
| AGIE LSAT Logical Reasoning | agie_lsat_lr | | | LSAT logical reasoning questions |
| ARC Challenge | arc_challenge | | | Challenging science questions requiring reasoning (AI2 Reasoning Challenge) |
| ARC Easy | arc_easy | | | Easier set of science questions from the ARC dataset |
| BBH | bbh | | | BIG-Bench Hard: challenging tasks from BIG-Bench |
| CoCoNot | coconot | | | Counterfactual commonsense reasoning benchmark |
| CommonsenseQA | commonsense_qa | | | Commonsense reasoning questions |
| HellaSwag | hellaswag | | | Commonsense natural language inference |
| MUSR | musr | | | Multi-step reasoning benchmark |
| PIQA | piqa | | | Physical commonsense reasoning |
| WinoGrande | winogrande | | | Commonsense reasoning via pronoun resolution |
Math
Evaluate mathematical problem-solving at various difficulty levels, from grade school to competition-level problems.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| AGIE Math | agie_math | | | Advanced mathematical reasoning from AGIE benchmark suite |
| AGIE SAT Math | agie_sat_math | | | SAT mathematics questions |
| AIME 2024 | aime2024 | | | American Invitational Mathematics Examination problems from 2024 |
| AIME 2025 | aime2025 | | | American Invitational Mathematics Examination problems from 2025 |
| GSM8K | gsm8k | | | Grade School Math 8K: multi-step math word problems |
| InfiniteBench Math Calc | infinite_bench_math_calc | | | Mathematical calculations in long contexts |
| InfiniteBench Math Find | infinite_bench_math_find | | | Finding mathematical patterns in long contexts |
| MATH | math | | | Competition-level mathematics problems |
| MGSM | mgsm | | | Multilingual Grade School Math |
Code
Evaluate programming and software development capabilities like debugging, code execution prediction, and function calling.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| BFCL | bfcl | | | Berkeley Function Calling Leaderboard: tests function calling and tool use capabilities |
| InfiniteBench Code Debug | infinite_bench_code_debug | | | Long-context code debugging tasks |
| InfiniteBench Code Run | infinite_bench_code_run | | | Long-context code execution prediction |
Reading
Evaluate reading comprehension and information extraction from complex texts.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| AGIE LSAT Reading Comprehension | agie_lsat_rc | | | LSAT reading comprehension passages and questions |
| AGIE SAT English | agie_sat_en | | | SAT reading and writing questions with passages |
| AGIE SAT English (No Passage) | agie_sat_en_without_passage | | | SAT English questions without accompanying passages |
| DROP | drop | | | Discrete Reasoning Over Paragraphs: reading comprehension requiring numerical reasoning |
| RACE-H | race_h | | | Reading comprehension from English exams (high difficulty) |
| SQuAD | squad | | | Stanford Question Answering Dataset: extractive question answering on Wikipedia articles |
Long context
Evaluate the ability to process and reason over extended contexts, including retrieval and pattern recognition.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| InfiniteBench KV Retrieval | infinite_bench_kv_retrieval | | | Key-value retrieval in long contexts |
| InfiniteBench LongBook (English) | infinite_bench_longbook_choice_eng | | | Multiple choice questions on long books |
| InfiniteBench LongDialogue QA (English) | infinite_bench_longdialogue_qa_eng | | | Question answering over long dialogues |
| InfiniteBench Number String | infinite_bench_number_string | | | Number pattern recognition in long sequences |
| InfiniteBench Passkey | infinite_bench_passkey | | | Retrieval of information from long context |
| NIAH | niah | | | Needle in a Haystack: long-context retrieval test |
Safety
Evaluate alignment, bias detection, harmful content resistance, and truthfulness.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| AgentHarm | agentharm | Yes | | Tests model resistance to harmful agent behavior and misuse scenarios |
| AgentHarm Benign | agentharm_benign | Yes | | Benign baseline for AgentHarm to measure false positive rates |
| Agentic Misalignment | agentic_misalignment | | | Evaluates potential misalignment in agentic behavior |
| AHB | ahb | | | Agent Harmful Behavior: tests resistance to harmful agentic actions |
| AIRBench | air_bench | | | Tests adversarial instruction resistance |
| BBEH | bbeh | | | Bias Benchmark for Evaluating Harmful behavior |
| BBEH Mini | bbeh_mini | | | Smaller version of BBEH benchmark |
| BBQ | bbq | | | Bias Benchmark for Question Answering |
| BOLD | bold | | | Bias in Open-Ended Language Generation Dataset |
| CYSE3 Visual Prompt Injection | cyse3_visual_prompt_injection | | | Tests resistance to visual prompt injection attacks |
| Make Me Pay | make_me_pay | | | Tests resistance to financial scam and fraud scenarios |
| MASK | mask | Yes | Yes | Tests model’s handling of sensitive information |
| Personality BFI | personality_BFI | | | Big Five personality trait assessment |
| Personality TRAIT | personality_TRAIT | | Yes | Comprehensive personality trait evaluation |
| SOSBench | sosbench | Yes | | Safety and oversight stress test |
| StereoSet | stereoset | | | Measures stereotypical biases in language models |
| StrongREJECT | strong_reject | | | Tests model’s ability to reject harmful requests |
| Sycophancy | sycophancy | | | Evaluates tendency toward sycophantic behavior |
| TruthfulQA | truthfulqa | | | Tests model truthfulness and resistance to falsehoods |
| UCCB | uccb | | | Unsafe Content Classification Benchmark |
| WMDP Bio | wmdp_bio | | | Tests hazardous knowledge in biology |
| WMDP Chem | wmdp_chem | | | Tests hazardous knowledge in chemistry |
| WMDP Cyber | wmdp_cyber | | | Tests hazardous knowledge in cybersecurity |
| XSTest | xstest | Yes | | Exaggerated safety test for over-refusal detection |
Domain-Specific
Evaluate specialized knowledge in medicine, chemistry, law, biology, and other professional fields.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| ChemBench | chembench | | | Chemistry knowledge and problem-solving benchmark |
| HealthBench | healthbench | Yes | | Healthcare and medical knowledge evaluation |
| HealthBench Consensus | healthbench_consensus | Yes | | Healthcare questions with expert consensus |
| HealthBench Hard | healthbench_hard | Yes | | Challenging healthcare scenarios |
| LabBench Cloning Scenarios | lab_bench_cloning_scenarios | | | Laboratory experiment planning and cloning |
| LabBench DBQA | lab_bench_dbqa | | | Database question answering for lab scenarios |
| LabBench FigQA | lab_bench_figqa | | | Figure interpretation in scientific contexts |
| LabBench LitQA | lab_bench_litqa | | | Literature-based question answering for research |
| LabBench ProtocolQA | lab_bench_protocolqa | | | Laboratory protocol understanding |
| LabBench SeqQA | lab_bench_seqqa | | | Biological sequence analysis questions |
| LabBench SuppQA | lab_bench_suppqa | | | Supplementary material interpretation |
| LabBench TableQA | lab_bench_tableqa | | | Table interpretation in scientific papers |
| MedQA | medqa | | | Medical licensing exam questions |
| PubMedQA | pubmedqa | | | Biomedical question answering from research abstracts |
| SEC-QA v1 | sec_qa_v1 | | | SEC filing question answering |
| SEC-QA v1 (5-shot) | sec_qa_v1_5_shot | | | SEC-QA with 5 examples |
| SEC-QA v2 | sec_qa_v2 | | | Updated SEC filing benchmark |
| SEC-QA v2 (5-shot) | sec_qa_v2_5_shot | | | SEC-QA v2 with 5 examples |
Multimodal
Evaluate vision and language understanding combining visual and textual inputs.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| DocVQA | docvqa | | | Document Visual Question Answering: questions about document images |
| MathVista | mathvista | | | Mathematical reasoning with visual contexts combining vision and math |
| MMMU Multiple Choice | mmmu_multiple_choice | | | Multimodal understanding with multiple choice format |
| MMMU Open | mmmu_open | | | Multimodal understanding with open-ended responses |
| V*Star Bench Attribute Recognition | vstar_bench_attribute_recognition | | | Visual attribute recognition tasks |
| V*Star Bench Spatial Relationship | vstar_bench_spatial_relationship_reasoning | | | Spatial reasoning with visual inputs |
Instruction Following
Evaluate adherence to specific instructions and formatting requirements.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| IFEval | ifeval | | | Tests precise instruction-following capabilities |
System
Basic system validation and pre-flight checks.
| Evaluation | Task ID | OpenAI Scorer | Gated HF Dataset | Description |
|---|
| Pre-Flight | pre_flight | | | Basic system check and validation test |
Next steps