Skip to content

Add LEXam legal exam benchmark to Swiss legal evals#1243

Open
JoelNiklaus wants to merge 1 commit into
huggingface:mainfrom
JoelNiklaus:feat/lexam-swiss-legal
Open

Add LEXam legal exam benchmark to Swiss legal evals#1243
JoelNiklaus wants to merge 1 commit into
huggingface:mainfrom
JoelNiklaus:feat/lexam-swiss-legal

Conversation

@JoelNiklaus
Copy link
Copy Markdown
Collaborator

@JoelNiklaus JoelNiklaus commented May 21, 2026

Add LEXam legal exam benchmark to Swiss legal evals

Problem

The Swiss legal evals module only covers translation and headnote-summarization tasks. There is no support for LEXam (ICLR 2026), a benchmark of 340 Swiss / EU / international law exams that ships both open-ended questions graded by legal experts and multiple-choice questions with 4, 8, 16, or 32 answer choices.

Solution

Adds LEXam to lighteval.tasks.multilingual.tasks.swiss_legal:

  • Open questions (lexam_oq:{en,de}): LLM-as-judge scorer (JudgeLEXamOQ) that compares free-form legal answers against expert reference answers and extracts a [[0.0, 1.0]] correctness score. Defaults to openai/gpt-4o-2024-11-20 via LiteLLM and follows the 20250324 judging rubric from the upstream LEXam repo.
  • Multiple-choice questions (lexam_mcq_{4,8,16,32}{,_idk}:{en,de}): letter-based extractive scorer (LEXamMCQExtractive) using IndicesExtractionConfig("NativeLetters") with a small fallback regex for the ###X### / \boxed{X} / Final answer: X conventions used in the prompt template. Choices are shuffled deterministically (seed 42).
  • Optional "I don't know" calibration: when with_idk=True, an extra letter is appended as I don't know and the prompt rewards calibration via +1/0/-1 scoring. The metric then reports trad_score, idk_score, idk_freq, and extract_fail; with with_idk=False it reports acc and extract_fail.

This produces 18 new tasks in TASKS_TABLE: 2 OQ × 2 langs and 4 choice-counts × 2 langs × {plain, IDK}.

Prompts are adapted from:

Testing

  • All 18 LEXam tasks register via TASKS_TABLE (total grows from 93 to 111).
  • End-to-end load of LEXam-Benchmark/LEXam confirmed: language filter yields 619 / 1655 EN MCQs and 2105 / 2541 DE open questions; the stringified choices field is parsed via ast.literal_eval.
  • LEXamMCQExtractive returns the expected values on ###X###, \boxed{X}, wrong answers, IDK selections, and unparseable outputs (both modes).
  • ruff check and ruff format --check pass on all changed files.

Adds LEXam (https://huggingface.co/datasets/LEXam-Benchmark/LEXam) tasks:
- lexam_oq:{en,de}: open-ended legal exam questions scored with an
  LLM-as-judge (JudgeLEXamOQ, default openai/gpt-4o-2024-11-20 via LiteLLM).
- lexam_mcq_{4,8,16,32}{,_idk}:{en,de}: letter-based MCQ scorer using
  IndicesExtractionConfig with a ###X### / \boxed{X} / "Final answer: X"
  fallback regex. Choices are shuffled deterministically (seed 42).
- Optional "I don't know" calibration: when with_idk=True, an extra letter
  slot is reserved for IDK and the metric reports trad_score, idk_score
  (+1/0/-1), idk_freq, and extract_fail; otherwise it reports acc and
  extract_fail.

Adds 18 tasks to TASKS_TABLE (2 OQ x 2 langs and 4 choice-counts x 2 langs
x {plain, IDK}).

Co-authored-by: Cursor <cursoragent@cursor.com>
@bot-ci-comment
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@JoelNiklaus JoelNiklaus requested a review from NathanHB May 21, 2026 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant