policyfx-nli
Fine-tuned version of tasksource/ModernBERT-base-nli for natural language inference on Congressional Research Service (CRS) policy reports.
Intended Use
Classifies whether a policy passage entails, contradicts, or is neutral with respect to a stakeholder impact statement.
Input Format
premise = "The program provides monthly cash payments of $500 to eligible low-income families."
hypothesis = "low-income families: Low-income families receive monthly cash payments of $500 through the program."
The hypothesis follows the format {stakeholder}: {effect statement}. Both premise and hypothesis are truncated to 2048 tokens.
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from scipy.special import softmax
model_id = "JoshuaAshkinaze/policyfx-nli"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
LABEL_NAMES = ["contradiction", "entailment", "neutral"]
premise = "The bill will raise taxes."
hypothesis = "Taxpayers: Individual taxpayers will have less disposable income."
inputs = tokenizer(premise, hypothesis, return_tensors="pt",
truncation=True, max_length=2048)
with torch.no_grad():
logits = model(**inputs).logits.numpy()
probs = softmax(logits, axis=-1)[0]
pred = {label: round(float(p), 4) for label, p in zip(LABEL_NAMES, probs)}
print(pred)
Note on label order: config.id2label is set to {0: "contradiction", 1: "entailment", 2: "neutral"}. Do not remap from the base model's label config.
Training
Data: ~61,600 (premise, stakeholder-effect hypothesis, label) triples generated from ~3,000 CRS reports, covering topics from housing and healthcare to defense and immigration. Train/val/test split is report-level to prevent leakage.
Hard example filtering: The base NLI model scored all examples; only those with P(true_label) < 0.80 were retained (~30,000 examples). This removes trivially easy examples and focuses training on genuine domain difficulty.
Spurious correlation check: A TF-IDF classifier trained on hypotheses alone achieves macro-F1 = 0.34 (โ chance), confirming that labels are not recoverable from surface form.
Hyperparameters (best config, retrained on train+val):
| Parameter | Value |
|---|---|
| Base model | tasksource/ModernBERT-base-nli |
| Learning rate | 2e-5 |
| Epochs | 1 |
| Batch size | 16 |
| Max length | 2048 |
| Optimizer | AdamW (fused) |
| Precision | bfloat16 + tf32 |
| Parameters | 149,607,171 |
Evaluation
Evaluated on a held-out test set of 2,874 balanced examples (958 per class).
| Metric | Score |
|---|---|
| Macro-F1 | 0.9652 |
| ROC-AUC (OvR) | 0.9964 |
| Brier Score | 0.0549 |
| ECE | 0.0163 |
Per-class F1:
| Label | Precision | Recall | F1 |
|---|---|---|---|
| Contradiction | 0.946 | 0.953 | 0.950 |
| Entailment | 0.954 | 0.944 | 0.949 |
| Neutral | 0.996 | 0.999 | 0.997 |
Limitations
- Domain specificity: Fine-tuned on CRS policy text with a specific hypothesis format (
{stakeholder}: {effect}). Performance on general NLI benchmarks (SNLI, MNLI, ANLI) is lower than the zero-shot base model โ this is an expected cost of domain adaptation. - Hypothesis format: The stakeholder-prefixed hypothesis format. Take note. Inputs that deviate significantly from this format may produce unreliable probabilities.
- English only.
Citation
@misc{ashkinaze2026crs,
author = {Ashkinaze, Joshua},
title = {ModernBERT-NLI fine-tuned on Congressional Research Service policy reports},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/jashkina/modernbert-nli-crs}
}
- Downloads last month
- 180
Model tree for JoshuaAshkinaze/policyfx-nli
Base model
answerdotai/ModernBERT-base