Add LEXam legal exam benchmark to Swiss legal evals by JoelNiklaus · Pull Request #1243 · huggingface/lighteval

JoelNiklaus · 2026-05-21T14:36:19Z

Add LEXam legal exam benchmark to Swiss legal evals

Problem

The Swiss legal evals module only covers translation and headnote-summarization tasks. There is no support for LEXam (ICLR 2026), a benchmark of 340 Swiss / EU / international law exams that ships both open-ended questions graded by legal experts and multiple-choice questions with 4, 8, 16, or 32 answer choices.

Solution

Adds LEXam to lighteval.tasks.multilingual.tasks.swiss_legal:

Open questions (lexam_oq:{en,de}): LLM-as-judge scorer (JudgeLEXamOQ) that compares free-form legal answers against expert reference answers and extracts a [[0.0, 1.0]] correctness score. Defaults to openai/gpt-4o-2024-11-20 via LiteLLM and follows the 20250324 judging rubric from the upstream LEXam repo.
Multiple-choice questions (lexam_mcq_{4,8,16,32}{,_idk}:{en,de}): letter-based extractive scorer (LEXamMCQExtractive) using IndicesExtractionConfig("NativeLetters") with a small fallback regex for the ###X### / \boxed{X} / Final answer: X conventions used in the prompt template. Choices are shuffled deterministically (seed 42).
Optional "I don't know" calibration: when with_idk=True, an extra letter is appended as I don't know and the prompt rewards calibration via +1/0/-1 scoring. The metric then reports trad_score, idk_score, idk_freq, and extract_fail; with with_idk=False it reports acc and extract_fail.

This produces 18 new tasks in TASKS_TABLE: 2 OQ × 2 langs and 4 choice-counts × 2 langs × {plain, IDK}.

Prompts are adapted from:

Testing

All 18 LEXam tasks register via TASKS_TABLE (total grows from 93 to 111).
End-to-end load of LEXam-Benchmark/LEXam confirmed: language filter yields 619 / 1655 EN MCQs and 2105 / 2541 DE open questions; the stringified choices field is parsed via ast.literal_eval.
LEXamMCQExtractive returns the expected values on ###X###, \boxed{X}, wrong answers, IDK selections, and unparseable outputs (both modes).
ruff check and ruff format --check pass on all changed files.

Adds LEXam (https://huggingface.co/datasets/LEXam-Benchmark/LEXam) tasks: - lexam_oq:{en,de}: open-ended legal exam questions scored with an LLM-as-judge (JudgeLEXamOQ, default openai/gpt-4o-2024-11-20 via LiteLLM). - lexam_mcq_{4,8,16,32}{,_idk}:{en,de}: letter-based MCQ scorer using IndicesExtractionConfig with a ###X### / \boxed{X} / "Final answer: X" fallback regex. Choices are shuffled deterministically (seed 42). - Optional "I don't know" calibration: when with_idk=True, an extra letter slot is reserved for IDK and the metric reports trad_score, idk_score (+1/0/-1), idk_freq, and extract_fail; otherwise it reports acc and extract_fail. Adds 18 tasks to TASKS_TABLE (2 OQ x 2 langs and 4 choice-counts x 2 langs x {plain, IDK}). Co-authored-by: Cursor <cursoragent@cursor.com>

bot-ci-comment · 2026-05-21T14:39:02Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

JoelNiklaus requested a review from NathanHB May 21, 2026 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LEXam legal exam benchmark to Swiss legal evals#1243

Add LEXam legal exam benchmark to Swiss legal evals#1243
JoelNiklaus wants to merge 1 commit into
huggingface:mainfrom
JoelNiklaus:feat/lexam-swiss-legal

JoelNiklaus commented May 21, 2026 •

edited

Loading

Uh oh!

bot-ci-comment Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JoelNiklaus commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add LEXam legal exam benchmark to Swiss legal evals

Problem

Solution

Testing

Uh oh!

bot-ci-comment Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JoelNiklaus commented May 21, 2026 •

edited

Loading