Quotebound 27B

A 27B LoRA adapter for evidence-faithful reasoning over closed packets of source text.

Quotebound 27B is the standalone model release from the Evidence-Faithful Reasoning project. It is trained to read a bounded evidence packet, identify the supporting units, copy exact quotes, and abstain with Insufficient evidence. when the packet does not justify an answer.

The project asks a stricter question than "did the model get the answer right?" It asks whether the answer is recoverably grounded in the supplied text.

On a fresh 36-task public holdout, Quotebound 27B improves task accuracy, evidence F1, and quote F1 over the prior bridge model. The largest raw gain is quote faithfulness: 0.3343 -> 0.6815.

Result snapshot

Question	Answer
What ships here?	A PEFT/LoRA adapter for `Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2`.
What changed inside the model?	Raw quote F1 roughly doubled on the fresh public holdout: `0.3343` -> `0.6815`.
Best standalone-system row on that holdout	Quotebound + `deterministic_v3`: task `0.8889`, strict `0.5833`, evidence F1 `0.9093`, quote F1 `0.9093`.
Output reliability	Zero invalid outputs across every reported evaluation surface.
Important boundary	Perfect `probe_v0` belongs to the benchmark-winning hybrid stack, not to this adapter alone.

Why this model exists

Reasoning-tuned models can sound structured while grounding badly: they may answer correctly but cite the wrong evidence, corrupt a quote, or keep going when the packet is actually insufficient.

Quotebound 27B is trained for a narrower, auditable behavior:

choose the smallest sufficient evidence units,
quote those units verbatim,
answer only from those units,
refuse cleanly when the packet runs out.

Correctness alone is not credited. The model is meant for settings where a user needs the answer and the support to survive inspection together.

Quick start

Install the usual Transformers + PEFT stack, then load the base model and attach the adapter:

pip install -U transformers peft accelerate bitsandbytes

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_id = "Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2"
adapter_id = "darcar0/quotebound-27b"

tokenizer = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForCausalLM.from_pretrained(
    base_id,
    device_map="auto",
    torch_dtype="auto",
)
model = PeftModel.from_pretrained(base, adapter_id)
model.eval()

The base is a 27B-parameter model. Use the quantization and serving setup your hardware requires; 4-bit loading with bitsandbytes is a practical inference path on constrained GPUs.

Model details

Field	Value
Adapter	`darcar0/quotebound-27b`
Base model	`Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2`
Artifact type	LoRA / PEFT adapter
Primary behavior	Closed-packet grounded QA, claim verification, exact quote attribution, and abstention
Output style	JSON with answer, evidence IDs, verbatim quotes, and short justification
Training sources	Public FEVER-style verify-claim data, public HotpotQA-style grounded-QA data, and project-local packet scaffolding derived from those sources
License	Apache 2.0

Prompt contract

The model is trained for an evidence-first prompt that makes the answer subordinate to the cited text. A minimal version:

You are answering from a bounded evidence packet only.

Work in this order:
1. Identify the smallest set of packet units that matters.
2. Copy exact quote(s) from those units.
3. Only then give the final answer.

Rules:
- No outside facts.
- Return valid JSON only.
- Every quote must be a verbatim substring of the cited unit.
- Do not paraphrase, ellipsize, or stitch quotes.
- If the packet is insufficient, the answer field must be exactly
  "Insufficient evidence."

Expected output shape:

{
  "task_id": "<task id>",
  "label": "support|contradict|insufficient|null",
  "answer": "<one-sentence answer>",
  "evidence_ids": ["unit_id_1", "unit_id_2"],
  "quotes": [
    {"unit_id": "unit_id_1", "quote": "<exact quote>"}
  ],
  "justification": "<one short sentence tied to the cited evidence>"
}

Evaluation

Fresh 36-task mixed public holdout

The main standalone comparison uses a fresh 36-task public holdout: 18 FEVER verify-claim tasks and 18 HotpotQA grounded-QA tasks. Source rows were de-duplicated against training, dev, and probe_v0 rows.

Stack	Task	Strict	Evidence F1	Quote F1
Bridge raw	0.8611	0.2222	0.8815	0.3343
Quotebound raw	0.8889	0.4444	0.9093	0.6815
Bridge + `deterministic_v3`	0.8611	0.5833	0.8815	0.8815
Quotebound + `deterministic_v3`	0.8889	0.5833	0.9093	0.9093

How to read this table:

Raw rows measure the model outputs before quote repair.
deterministic_v3 rows add the packet-local quote normalizer from the project repository.
Quotebound improves task accuracy, evidence F1, and quote F1 in both raw and normalized form; it also ties normalized strict success.
The largest model-side gain is raw quote faithfulness, from 0.3343 to 0.6815.

Fixed dev triage slice

Stack	Task	Strict	Evidence F1	Quote F1
Quotebound + `deterministic_v3`	1.0000	0.6190	0.8320	0.7095

Untouched 104-task HotpotQA shadow slice

On a 104-task HotpotQA shadow slice that was never touched during selection, Quotebound raw improved quote-faithful behavior over the prior bridge model. Quotebound plus deterministic_v3 matched bridge plus deterministic_v3 at the system level. This surface is reported as a narrative parity result because the freeze memo does not publish per-metric cells for it.

Release architecture

The project ends in two finished results that are intentionally reported separately:

Result	What it is	What it proves
Quotebound 27B	The downloadable LoRA adapter on this page.	More of the evidence-faithful behavior moved into the model itself, with gains across non-`probe_v0` surfaces.
Benchmark-winning hybrid stack	A trained bridge checkpoint plus the `deterministic_v3` packet-local quote normalizer.	The full system clears every gate of the strict contract on frozen held-out `probe_v0`.

These are connected, but they are not the same claim. Quotebound 27B is the standalone model release. The hybrid stack is the benchmark-facing winner. Perfect probe_v0 belongs to the hybrid stack, not to this adapter alone.

Intended use

Use this release when answers must stay inside a fixed body of supplied text:

bounded document QA with explicit evidence requirements,
claim verification over closed packets of source text,
policy, compliance, contract, and internal-document review where answers need source-text support,
research on evidence-faithful reasoning, quote fidelity, and abstention.

Limitations

This is not a general chatbot. Open-domain QA, open chat, and free-form generation outside the closed-packet setup are not characterized.
The downloadable artifact is the LoRA adapter only; the 27B base model is required.
deterministic_v3 is not shipped as part of this model repo. It is a separate packet-local post-processing step in the project repository.
Perfect probe_v0 belongs to the benchmark-winning hybrid stack, not to this adapter alone.
Raw item-level contents of the frozen held-out probe are intentionally not published; the held-out gate has to stay closed to remain meaningful.
For high-stakes use, treat the model as an evidence-grounding component that still requires human review and application-specific validation.

Citation

@misc{quotebound_27b_2026,
  title        = {Quotebound 27B: Evidence-Faithful Reasoning Standalone Release},
  author       = {{darcar0}},
  year         = {2026},
  howpublished = {Hugging Face model release},
  url          = {https://huggingface.co/darcar0/quotebound-27b}
}

References

Base model: Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2
Datasets: fever/fever, hotpotqa/hotpot_qa
Technical note: technical_note_evidence_faithful_reasoning.md

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for darcar0/quotebound-27b

Base model

Qwen/Qwen3.5-27B

Adapter

Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2