Quotebound 27B

A 27B LoRA adapter for evidence-faithful reasoning over closed packets of source text.

Quotebound 27B is the standalone model release from the Evidence-Faithful Reasoning project. It is trained to read a bounded evidence packet, identify the supporting units, copy exact quotes, and abstain with Insufficient evidence. when the packet does not justify an answer.

The project asks a stricter question than "did the model get the answer right?" It asks whether the answer is recoverably grounded in the supplied text.

Fresh public holdout: Quotebound 27B versus the prior bridge model

On a fresh 36-task public holdout, Quotebound 27B improves task accuracy, evidence F1, and quote F1 over the prior bridge model. The largest raw gain is quote faithfulness: 0.3343 -> 0.6815.

Result snapshot

Question Answer
What ships here? A PEFT/LoRA adapter for Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2.
What changed inside the model? Raw quote F1 roughly doubled on the fresh public holdout: 0.3343 -> 0.6815.
Best standalone-system row on that holdout Quotebound + deterministic_v3: task 0.8889, strict 0.5833, evidence F1 0.9093, quote F1 0.9093.
Output reliability Zero invalid outputs across every reported evaluation surface.
Important boundary Perfect probe_v0 belongs to the benchmark-winning hybrid stack, not to this adapter alone.

Why this model exists

Reasoning-tuned models can sound structured while grounding badly: they may answer correctly but cite the wrong evidence, corrupt a quote, or keep going when the packet is actually insufficient.

Quotebound 27B is trained for a narrower, auditable behavior:

  1. choose the smallest sufficient evidence units,
  2. quote those units verbatim,
  3. answer only from those units,
  4. refuse cleanly when the packet runs out.

Correctness alone is not credited. The model is meant for settings where a user needs the answer and the support to survive inspection together.

Quick start

Install the usual Transformers + PEFT stack, then load the base model and attach the adapter:

pip install -U transformers peft accelerate bitsandbytes
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_id = "Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2"
adapter_id = "darcar0/quotebound-27b"

tokenizer = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForCausalLM.from_pretrained(
    base_id,
    device_map="auto",
    torch_dtype="auto",
)
model = PeftModel.from_pretrained(base, adapter_id)
model.eval()

The base is a 27B-parameter model. Use the quantization and serving setup your hardware requires; 4-bit loading with bitsandbytes is a practical inference path on constrained GPUs.

Model details

Field Value
Adapter darcar0/quotebound-27b
Base model Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2
Artifact type LoRA / PEFT adapter
Primary behavior Closed-packet grounded QA, claim verification, exact quote attribution, and abstention
Output style JSON with answer, evidence IDs, verbatim quotes, and short justification
Training sources Public FEVER-style verify-claim data, public HotpotQA-style grounded-QA data, and project-local packet scaffolding derived from those sources
License Apache 2.0

Prompt contract

The model is trained for an evidence-first prompt that makes the answer subordinate to the cited text. A minimal version:

You are answering from a bounded evidence packet only.

Work in this order:
1. Identify the smallest set of packet units that matters.
2. Copy exact quote(s) from those units.
3. Only then give the final answer.

Rules:
- No outside facts.
- Return valid JSON only.
- Every quote must be a verbatim substring of the cited unit.
- Do not paraphrase, ellipsize, or stitch quotes.
- If the packet is insufficient, the answer field must be exactly
  "Insufficient evidence."

Expected output shape:

{
  "task_id": "<task id>",
  "label": "support|contradict|insufficient|null",
  "answer": "<one-sentence answer>",
  "evidence_ids": ["unit_id_1", "unit_id_2"],
  "quotes": [
    {"unit_id": "unit_id_1", "quote": "<exact quote>"}
  ],
  "justification": "<one short sentence tied to the cited evidence>"
}

Evaluation

Fresh 36-task mixed public holdout

The main standalone comparison uses a fresh 36-task public holdout: 18 FEVER verify-claim tasks and 18 HotpotQA grounded-QA tasks. Source rows were de-duplicated against training, dev, and probe_v0 rows.

Stack Task Strict Evidence F1 Quote F1
Bridge raw 0.8611 0.2222 0.8815 0.3343
Quotebound raw 0.8889 0.4444 0.9093 0.6815
Bridge + deterministic_v3 0.8611 0.5833 0.8815 0.8815
Quotebound + deterministic_v3 0.8889 0.5833 0.9093 0.9093

How to read this table:

  • Raw rows measure the model outputs before quote repair.
  • deterministic_v3 rows add the packet-local quote normalizer from the project repository.
  • Quotebound improves task accuracy, evidence F1, and quote F1 in both raw and normalized form; it also ties normalized strict success.
  • The largest model-side gain is raw quote faithfulness, from 0.3343 to 0.6815.

Fixed dev triage slice

Stack Task Strict Evidence F1 Quote F1
Quotebound + deterministic_v3 1.0000 0.6190 0.8320 0.7095

Untouched 104-task HotpotQA shadow slice

On a 104-task HotpotQA shadow slice that was never touched during selection, Quotebound raw improved quote-faithful behavior over the prior bridge model. Quotebound plus deterministic_v3 matched bridge plus deterministic_v3 at the system level. This surface is reported as a narrative parity result because the freeze memo does not publish per-metric cells for it.

Release architecture

The project ends in two finished results that are intentionally reported separately:

Result What it is What it proves
Quotebound 27B The downloadable LoRA adapter on this page. More of the evidence-faithful behavior moved into the model itself, with gains across non-probe_v0 surfaces.
Benchmark-winning hybrid stack A trained bridge checkpoint plus the deterministic_v3 packet-local quote normalizer. The full system clears every gate of the strict contract on frozen held-out probe_v0.

These are connected, but they are not the same claim. Quotebound 27B is the standalone model release. The hybrid stack is the benchmark-facing winner. Perfect probe_v0 belongs to the hybrid stack, not to this adapter alone.

Intended use

Use this release when answers must stay inside a fixed body of supplied text:

  • bounded document QA with explicit evidence requirements,
  • claim verification over closed packets of source text,
  • policy, compliance, contract, and internal-document review where answers need source-text support,
  • research on evidence-faithful reasoning, quote fidelity, and abstention.

Limitations

  • This is not a general chatbot. Open-domain QA, open chat, and free-form generation outside the closed-packet setup are not characterized.
  • The downloadable artifact is the LoRA adapter only; the 27B base model is required.
  • deterministic_v3 is not shipped as part of this model repo. It is a separate packet-local post-processing step in the project repository.
  • Perfect probe_v0 belongs to the benchmark-winning hybrid stack, not to this adapter alone.
  • Raw item-level contents of the frozen held-out probe are intentionally not published; the held-out gate has to stay closed to remain meaningful.
  • For high-stakes use, treat the model as an evidence-grounding component that still requires human review and application-specific validation.

Read next

Citation

@misc{quotebound_27b_2026,
  title        = {Quotebound 27B: Evidence-Faithful Reasoning Standalone Release},
  author       = {{darcar0}},
  year         = {2026},
  howpublished = {Hugging Face model release},
  url          = {https://huggingface.co/darcar0/quotebound-27b}
}

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for darcar0/quotebound-27b

Datasets used to train darcar0/quotebound-27b