policyfx-nli

Fine-tuned version of tasksource/ModernBERT-base-nli for natural language inference on Congressional Research Service (CRS) policy reports.

Intended Use

Classifies whether a policy passage entails, contradicts, or is neutral with respect to a stakeholder impact statement.

Input Format

premise    = "The program provides monthly cash payments of $500 to eligible low-income families."
hypothesis = "low-income families: Low-income families receive monthly cash payments of $500 through the program."

The hypothesis follows the format {stakeholder}: {effect statement}. Both premise and hypothesis are truncated to 2048 tokens.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from scipy.special import softmax

model_id = "JoshuaAshkinaze/policyfx-nli"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

LABEL_NAMES = ["contradiction", "entailment", "neutral"]

premise    = "The bill will raise taxes."
hypothesis = "Taxpayers: Individual taxpayers will have less disposable income."

inputs = tokenizer(premise, hypothesis, return_tensors="pt",
                   truncation=True, max_length=2048)
with torch.no_grad():
    logits = model(**inputs).logits.numpy()

probs = softmax(logits, axis=-1)[0]
pred  = {label: round(float(p), 4) for label, p in zip(LABEL_NAMES, probs)}
print(pred)

Note on label order: config.id2label is set to {0: "contradiction", 1: "entailment", 2: "neutral"}. Do not remap from the base model's label config.

Training

Data: ~61,600 (premise, stakeholder-effect hypothesis, label) triples generated from ~3,000 CRS reports, covering topics from housing and healthcare to defense and immigration. Train/val/test split is report-level to prevent leakage.

Hard example filtering: The base NLI model scored all examples; only those with P(true_label) < 0.80 were retained (~30,000 examples). This removes trivially easy examples and focuses training on genuine domain difficulty.

Spurious correlation check: A TF-IDF classifier trained on hypotheses alone achieves macro-F1 = 0.34 (≈ chance), confirming that labels are not recoverable from surface form.

Hyperparameters (best config, retrained on train+val):

Parameter	Value
Base model	tasksource/ModernBERT-base-nli
Learning rate	2e-5
Epochs	1
Batch size	16
Max length	2048
Optimizer	AdamW (fused)
Precision	bfloat16 + tf32
Parameters	149,607,171

Evaluation

Evaluated on a held-out test set of 2,874 balanced examples (958 per class).

Metric	Score
Macro-F1	0.9652
ROC-AUC (OvR)	0.9964
Brier Score	0.0549
ECE	0.0163

Per-class F1:

Label	Precision	Recall	F1
Contradiction	0.946	0.953	0.950
Entailment	0.954	0.944	0.949
Neutral	0.996	0.999	0.997

Limitations

Domain specificity: Fine-tuned on CRS policy text with a specific hypothesis format ({stakeholder}: {effect}). Performance on general NLI benchmarks (SNLI, MNLI, ANLI) is lower than the zero-shot base model — this is an expected cost of domain adaptation.
Hypothesis format: The stakeholder-prefixed hypothesis format. Take note. Inputs that deviate significantly from this format may produce unreliable probabilities.
English only.

Citation

@misc{ashkinaze2026crs,
  author    = {Ashkinaze, Joshua},
  title     = {ModernBERT-NLI fine-tuned on Congressional Research Service policy reports},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/jashkina/modernbert-nli-crs}
}

Downloads last month: 180

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for JoshuaAshkinaze/policyfx-nli

Base model

answerdotai/ModernBERT-base

Finetuned

tasksource/ModernBERT-base-nli

Finetuned

(8)

this model