nguha/legalbench
Viewer β’ Updated β’ 91.8k β’ 17.2k β’ 178
Multi-domain GRPO hardening across three professional verticals using real benchmarks with verifiable rewards.
| Domain | Dataset | Size | Task Type |
|---|---|---|---|
| βοΈ Legal | nguha/legalbench | 162 tasks (NeurIPS 2023) | Classification, QA |
| π Medical | GBaker/MedQA-USMLE-4-options | 10K+ USMLE questions | 4-choice MCQ |
| π° Finance | PatronusAI/financebench | 150 SEC 10-K QA | Numeric extraction |
| π° Finance | TheFinAI/flare-headlines | Market classification | Binary classification |
| π Structured | Generated | JSON work products | RL-Struct 5-component |
| Weight | Reward | What it measures |
|---|---|---|
| 0.45 | Domain Correctness | MCQ letter match, numeric Β±2%, classification exact |
| 0.20 | Structured Output | JSON validity + schema + types + content (RL-Struct) |
| 0.20 | Professional Quality | Domain terminology + evidence citation + structure |
| 0.10 | Reasoning Depth | Think tags + logical connectors |
| 0.05 | Length Penalty | DAPO soft overlong |
python professional_hardening.py