policyfx-nli

Fine-tuned version of tasksource/ModernBERT-base-nli for natural language inference on Congressional Research Service (CRS) policy reports.

Intended Use

Classifies whether a policy passage entails, contradicts, or is neutral with respect to a stakeholder impact statement.

Input Format

premise    = "The program provides monthly cash payments of $500 to eligible low-income families."
hypothesis = "low-income families: Low-income families receive monthly cash payments of $500 through the program."

The hypothesis follows the format {stakeholder}: {effect statement}. Both premise and hypothesis are truncated to 2048 tokens.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from scipy.special import softmax

model_id = "JoshuaAshkinaze/policyfx-nli"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

LABEL_NAMES = ["contradiction", "entailment", "neutral"]

premise    = "The bill will raise taxes."
hypothesis = "Taxpayers: Individual taxpayers will have less disposable income."

inputs = tokenizer(premise, hypothesis, return_tensors="pt",
                   truncation=True, max_length=2048)
with torch.no_grad():
    logits = model(**inputs).logits.numpy()

probs = softmax(logits, axis=-1)[0]
pred  = {label: round(float(p), 4) for label, p in zip(LABEL_NAMES, probs)}
print(pred)

Note on label order: config.id2label is set to {0: "contradiction", 1: "entailment", 2: "neutral"}. Do not remap from the base model's label config.

Training

Data: ~61,600 (premise, stakeholder-effect hypothesis, label) triples generated from ~3,000 CRS reports, covering topics from housing and healthcare to defense and immigration. Train/val/test split is report-level to prevent leakage.

Hard example filtering: The base NLI model scored all examples; only those with P(true_label) < 0.80 were retained (~30,000 examples). This removes trivially easy examples and focuses training on genuine domain difficulty.

Spurious correlation check: A TF-IDF classifier trained on hypotheses alone achieves macro-F1 = 0.34 (โ‰ˆ chance), confirming that labels are not recoverable from surface form.

Hyperparameters (best config, retrained on train+val):

Parameter Value
Base model tasksource/ModernBERT-base-nli
Learning rate 2e-5
Epochs 1
Batch size 16
Max length 2048
Optimizer AdamW (fused)
Precision bfloat16 + tf32
Parameters 149,607,171

Evaluation

Evaluated on a held-out test set of 2,874 balanced examples (958 per class).

Metric Score
Macro-F1 0.9652
ROC-AUC (OvR) 0.9964
Brier Score 0.0549
ECE 0.0163

Per-class F1:

Label Precision Recall F1
Contradiction 0.946 0.953 0.950
Entailment 0.954 0.944 0.949
Neutral 0.996 0.999 0.997

Limitations

  • Domain specificity: Fine-tuned on CRS policy text with a specific hypothesis format ({stakeholder}: {effect}). Performance on general NLI benchmarks (SNLI, MNLI, ANLI) is lower than the zero-shot base model โ€” this is an expected cost of domain adaptation.
  • Hypothesis format: The stakeholder-prefixed hypothesis format. Take note. Inputs that deviate significantly from this format may produce unreliable probabilities.
  • English only.

Citation

@misc{ashkinaze2026crs,
  author    = {Ashkinaze, Joshua},
  title     = {ModernBERT-NLI fine-tuned on Congressional Research Service policy reports},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/jashkina/modernbert-nli-crs}
}
Downloads last month
180
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for JoshuaAshkinaze/policyfx-nli

Finetuned
(8)
this model