NPC Reason 1.5B

A math-reasoning model whose every load-bearing arithmetic step emits a mechanically-checkable assertion in the form <<EXPR = RESULT>>. Specialized from DeepSeek-R1-Distill-Qwen-1.5B (MIT). The point is not just a final answer. It is that a pure-code checker can re-execute every step and confirm the chain, so "verifiable-rate" is not the model's opinion. Anyone can run the checker.

Results (frozen held-out eval, n=500, GSM8K + MATH-500, greedy, format prompt)

metric base R1-Distill SFT (V4 distill) NPC Reason (RL)
verifiable-rate 0.0% 76.8% 76.2%
accuracy 61.6% 65.8% 66.6%
verified-and-correct 0.0% 58.0% 59.6%

verified-and-correct (both axes) is the headline. The full arc is shown on purpose, not just the best column.

What actually happened (plain language, no overclaiming)

  • The base model produces zero mechanically-verifiable chains, even when asked for the format. Only 1 of 500 base outputs even contained a << marker.
  • The SFT distillation did the heavy lifting: 0 to 76.8% verifiable. Training on a corpus of DeepSeek-V4 chains that the frozen verifier confirmed (verifiable AND correct, 7,546 kept of 13,245 generated) transferred the grounding. Accuracy rose (61.6 to 65.8), it was not bought by sacrificing correctness.
  • RLVR/GRPO against the frozen verifier was a stable refinement, not a decisive gain. It moved verified-and-correct +1.6pp (58.0 to 59.6) and accuracy +0.8pp, with verifiable flat (-0.6pp). On n=500 that is roughly 8 problems. The RL model and the SFT model are statistically about even.
  • The shipped model is the RL checkpoint (marginally best on the headline). The SFT model is statistically equivalent and available as a fallback. Either is a defensible "NPC Reason".
  • The pre-registered 90% verifiable bar was NOT met (stuck near 77%). It was deliberately not chased into instability. This is the open limitation and the next frontier.
  • Accuracy includes the greedy no-answer floor. The same greedy decoding is used for base and tuned models, so the comparisons are apples-to-apples.

The verifier (the methodological core)

A pure-code Python/SymPy checker, frozen at sha256 d5d146cf..., used as BOTH the evaluation metric AND the RL reward (byte-identical both times). A chain is VERIFIABLE iff every load-bearing <<EXPR=RESULT>> assertion re-executes correctly AND the final answer composes from the last step. Correctness (final == gold) is a separate, independent axis. The verifier is shipped with the model (verifier/step_verifier.py); users run it on the model's own outputs.

Methods finding worth keeping

GRPO with this hard, frozen, pure-code verifier reward trained STABLY: KL stayed flat (~0.0002), no length runaway, no mode collapse, no early-stop trip. This is notable because prior RL attempts in related work were unstable. RLVR with a clean verifier reward is a regime where a small model trains without collapsing. The lift was small here, but the stability is the keepable result.

Intended use and limits

  • Use: math problems where checkable, grounded reasoning steps matter (arithmetic and arithmetic-reducible word problems). Prompt for the <<EXPR = RESULT>> format (see USAGE.md).
  • Math-first. Logic, proofs, and general chain-of-thought are NOT claimed and are future work.
  • Not a general chat model. The 1.5B reasoning ceiling applies. ~23% of format-prompt chains are still not fully verifiable, the unverified tail.
  • Simulation/research artifact. Verify outputs with the included checker before relying on them.

Lineage and license

  • Base: DeepSeek-R1-Distill-Qwen-1.5B (MIT).
  • Training chains distilled from DeepSeek V4 (distillation permitted; output rights assigned to the user). Released under MIT to match the base.
  • Attribution: Rama Krishna Bachu, Bottensor (Independent Research). ORCID 0009-0000-1298-0681.

Reproducibility

The pre-registration was frozen BEFORE any training and an honest-null clause was in force. Frozen references:

  • Verifier: VERIFIER.lock sha256 d5d146cf...
  • Eval set: EVAL.lock sha256 e1573cab...
  • Pre-registration: PREREG.lock sha256 b5a49437...

GGUF quantizations are provided with a per-quant decision-fidelity check vs the bf16 model (see gguf_fidelity.md); pick the recommended quant, do not choose on file size alone.

Downloads last month
173
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ramankrishna10/npc-reason

Quantized
(242)
this model