ParaWriter-SFT

ParaWriter-SFT is a private ParadoxGPT writer model fine-tuned from a 4B Qwen-family base model for research-paper writing analysis and revision planning. It is trained to read rich paper context and produce Chinese, reviewer-aware writing diagnostics across six writer tasks:

S1 realization diagnosis
S3 problem insight
S4 introduction structure
S5 commitment alignment
S6 method necessity
S7 experiment closure

The model is intended for paper-level or section-level inputs that include enough evidence for the requested judgment: title, abstract, introduction, relevant body sections, method details, experiment summaries, and available tables/figures when the task requires them. It should not be used to infer unsupported numbers, experiments, or claims.

中文说明

ParaWriter-SFT 是 ParadoxGPT 的私有 writer 模型，基于 4B Qwen 系列底座继续监督微调，用于论文写作诊断、论证结构分析和修改前的写作规划。它面向较完整的论文上下文输入，输出中文、面向审稿视角的分析。

覆盖的 writer 能力包括：

S1: 分析论文的核心 realization，区分“原来是这样”和“原来还能这样”
S3: 抽取具体 problem setting 与驱动方法的 core insight
S4: 拆解 Introduction 的叙事结构与读者认知变化
S5: 检查 Introduction 承诺是否被正文证据兑现
S6: 分析主要方法组件的设计必然性
S7: 分析实验如何闭合 Introduction 中的关键承诺

推荐输入包含足够的论文上下文，例如标题、摘要、引言、方法、实验摘要、相关表格/图示说明和正文片段。模型不应该被要求凭空补实验、编数字或替论文制造不存在的证据。

Training Data

The SFT set contains 47,564 writer examples:

Split	Examples	Unique papers
train	42,349	7,089
dev	2,469	413
test	2,746	460

Task distribution:

Task	Examples
S1_realization_diagnosis	7,838
S3_problem_insight	7,945
S4_intro_structure	7,953
S5_commitment_alignment	7,940
S6_method_necessity	7,945
S7_experiment_closure	7,943

Quality filter: kept only records with exactly one balanced <think>...</think> block.

Final training metrics:

Metric	Value
train_loss	0.7323
eval_loss	0.7947
epoch	3.0

Example: Problem Insight

Input

请阅读这篇论文，分析它的问题设定和核心洞察。

要求：
1. 问题设定不是背景介绍，是论文具体要解决的问题
2. core insight 不是方法描述，是驱动方法设计的观察/发现
3. 说明什么不是 insight（避免把方法当 insight）
4. 用中文自然语言回答

论文标题: Adaptive Evidence Routing for Retrieval-Augmented Scientific QA

摘要:
Scientific QA systems often retrieve many passages but still answer incorrectly when evidence types are mismatched. We propose an adaptive routing framework that selects between definition, comparison, numerical, and causal evidence channels before answer generation.

引言:
Retrieval-augmented generation has improved scientific QA, but failures remain common when a query requires a specific type of evidence. A stronger retriever may return more relevant passages, yet the generator can still mix definitions with causal explanations or use numerical tables for conceptual questions. This suggests that the bottleneck is not only retrieval relevance, but alignment between question intent, evidence type, and generation strategy.

方法:
The system first predicts the evidence need of the question, routes retrieval results through specialized evidence channels, and then constrains generation to cite the selected channel. Ablations compare the full router with a single retriever, random routing, and oracle evidence routing.

实验:
The full system improves answer faithfulness and reduces evidence-type errors. Gains are largest on causal and numerical questions; random routing removes most of the improvement.

Expected style of ParaWriter-SFT output

这篇论文的问题设定不是“RAG 在科学问答中还不够好”这么宽泛，而是更具体的一层：当问题需要特定证据类型时，系统即使检索到了看似相关的材料，也可能因为 evidence type 和 question intent 没对齐而回答错误。

它的核心 insight 是：错误来源不只在 retrieval relevance，而在“问题意图 - 证据类型 - 生成策略”之间的错配。这个 insight 直接解释了为什么单纯增强 retriever 不一定解决问题，也自然推出了 adaptive evidence routing 的必要性。

不要把“设计一个 router”本身当成 insight。router 是方法组件；真正驱动它的是前面的观察：不同问题需要不同证据形态，如果不先区分证据需求，生成阶段就会把定义、因果解释和数值证据混用。

Example: Commitment Alignment

Input

请阅读这篇论文，找出 Introduction 中的主要承诺，分析它们是否被证据支撑。

论文标题: Calibrated Planning for Tool-Using Language Agents

Introduction claim:
We argue that tool-use failures are often caused by miscalibrated planning rather than tool API limitations. Our method explicitly separates plan confidence from tool execution and should reduce unnecessary tool calls while preserving task success.

Experiments:
Table 2 reports success rate and tool-call count on three agent benchmarks. Figure 3 compares calibrated planning against direct prompting and always-call-tool baselines. Section 4.3 ablates confidence estimation.

Expected style of ParaWriter-SFT output

这里至少有两个承诺需要分开看。

第一，作者承诺“失败主要来自 planning calibration，而不是 API 本身”。要支撑这个 claim，不能只给最终成功率，还需要证明在 API 可用的情况下，错误集中发生在过度调用、漏调用或低置信计划继续执行。最直接的证据应来自错误分解或 confidence ablation。

第二，作者承诺方法会减少不必要 tool calls，同时保持 task success。这个 claim 可以由 Table 2 支撑，但要看两个指标是否同时成立：tool-call count 下降，success rate 没有明显牺牲。Figure 3 只能说明相对 baseline 的趋势，Section 4.3 的 confidence ablation 才更接近验证“calibration 是关键组件”。

如果论文只展示总成功率提升，而没有错误类型或 confidence ablation，那么 Introduction 的第一个承诺会是弱兑现；第二个承诺则取决于效率和成功率是否同时报告。

Observed Behavior vs Base 4B

On held-out writer tasks, ParaWriter-SFT is much more task-shaped than the base 4B model. In qualitative checks, the base model often produced long English planning traces or overly broad analysis, while ParaWriter-SFT more consistently produced compact Chinese answers aligned to the requested writer skill.

The strongest observed gains are on:

problem insight extraction
introduction structure analysis
commitment/evidence alignment
method necessity analysis

Known limitation: for broad realization-diagnosis and experiment-closure prompts, the model can over-generate or repeat when max generation length is set very high. Use a practical generation cap and ask for a bounded structure when concise output is required.

Recommended Use

Use the model as a paper-writing assistant that diagnoses and restructures arguments before drafting. Good prompts specify the writer task, include the relevant paper context, and ask the model to ground claims in provided sections, tables, figures, or experimental summaries.

Recommended decoding:

temperature = 0.0
max_new_tokens = 2048  # increase only for full-paper diagnostic reports

Citation

If this private model is useful in your internal research workflow, cite:

@misc{zhang2026parawriter,
  title        = {ParaWriter-SFT: A ParadoxGPT Writer Model for Scientific Writing Diagnostics},
  author       = {Heng Zhang},
  year         = {2026},
  howpublished = {Hugging Face model repository},
  note         = {Private ParadoxGPT supervised fine-tuned model}
}