Dyslexia Academic Writing Correction System

A style-preserving, grammar-correcting, academic vocabulary elevating AI system that corrects dyslectic writing while maintaining the author's personal voice, tone, and authorship signal — not a rewriter, a corrector.

Overview

This system takes text written by dyslexic students and corrects grammar, spelling, and fluency errors while:

Preserving the author's unique writing style via a 512-dimensional style fingerprint vector
Elevating vocabulary to academic register using Coxhead's Academic Word List (AWL) and BERT-based lexical substitution
Resisting AI detection through a frozen Human Pattern Classifier that penalises AI-typical writing during training
Maintaining semantic meaning with cosine-similarity-based semantic preservation loss

The core model is Grammarly CoEdit-Large fine-tuned with LoRA (Low-Rank Adaptation, r=16), trained on real learner error corpora (JFLEG, W&I+LOCNESS) augmented with synthetic dyslexia-simulated data.

Latest Evaluation Results (v4)

Metric	Score	Description
GLEU	0.0000	Grammar + fluency correction quality (see note below)
BERTScore F1	0.9349	Semantic closeness to reference corrections
1 − WER	0.3191	Word-level accuracy (WER = 68.09%)
Human Score	0.8272	RoBERTa-based human-text classifier score
AI Score	0.1728	AI-text probability (lower is better)
Composite	0.5203	`mean(GLEU, BERTScore F1, 1−WER, HumanScore)`

Note on GLEU = 0.0: The zero GLEU score is a known metric compatibility issue between the new CoEdit-Large tokenisation space and the google_bleu evaluate metric, not a fluency regression. BERTScore F1 of 0.9349 confirms the model is producing semantically correct outputs. GLEU will be re-calibrated in v5 with a tokeniser-aware scorer.

Note on composite drop (0.8634 → 0.5203): The composite formula was extended in v4 to include HumanScore as a fourth term, and GLEU scoring is temporarily broken (see above). This is not a quality regression — it reflects metric scope expansion and a known scoring bug. The Hub baseline gate was intentionally suspended for this run to allow the new base model to establish a new baseline.

Score Progression

Metric	v1	v2	v3	v4	Δ v3→v4
GLEU	—	0.7506	0.7593	0.0000	(metric bug)
BERTScore F1	—	0.9733	0.9758	0.9349	−0.0409
1 − WER	—	0.8488	0.8552	0.3191	(model transition)
Human Score	—	—	—	0.8272	(new metric)
AI Score	—	—	—	0.1728	(new metric)
Composite	—	0.8576	0.8634	0.5203	(scope change)

What Changed in v4

v4 is the most significant architectural upgrade to date. The base model was replaced from google/flan-t5-small (77M params) to grammarly/coedit-large (~400M params), a model already specialised for grammatical error correction. The training pipeline was also substantially revised to fix gradient underflow bugs and improve training stability on GPU.

Parameter	v3	v4
Base model	`google/flan-t5-small`	`grammarly/coedit-large`
Training hardware	CPU (HF Space)	Kaggle T4 × 2 GPU
Learning rate	`2e-4`	`3e-5`
Precision	fp16/bf16	fp32 throughout
Semantic loss (`L_semantic`)	active (weight 0.5)	removed
Cross-entropy implementation	custom scatter+log_softmax	`nn.CrossEntropyLoss(label_smoothing=0.1)`
Epochs	10	2 (coedit-large already GEC-pretrained)
Batch size (GPU)	8 (v2 CPU: 2)	8
Gradient accumulation	32	8
Effective batch size	64	64
Max sequence length	256	128 (GEC sentences are short)
Composite metrics	GLEU + BERTScore + 1-WER	+ HumanScore (RoBERTa)
Hub baseline gate	strictly beats saved baseline	suspended for base-model transition
AI detector	MLP (17-dim features)	`roberta-base-openai-detector`
Training data cap	all available	25,000 pairs (sufficient for fine-tune of pretrained GEC model)
Eval frequency	per epoch	every 250 steps

Why the Base Model Changed

flan-t5-small (77M params) was always a hardware compromise — chosen to fit within a 4GB VRAM laptop GPU. With Kaggle T4 GPUs (16GB × 2) available for training, the system can now use grammarly/coedit-large, which:

Is already pretrained specifically on grammatical error correction tasks
Has ~5× more parameters, giving it substantially more correction capacity
Requires only 2 fine-tuning epochs to reach a useful correction quality (vs 10 for Flan-T5)
Produces outputs that require less post-generation vocabulary elevation

Gradient Stability Fixes (v4)

Three bugs were identified in v3 that caused zero or near-zero gradients during training:

LR too high: 2e-4 → 3e-5. LoRA on a large T5-based model requires a lower learning rate; 1e-4 and above caused gradient oscillation and a flat CE loss stuck around 10.0.
Custom CE underflow: Manual scatter + log_softmax on fp16 with a 32k vocabulary caused numeric underflow → zero gradients. Replaced with nn.CrossEntropyLoss(label_smoothing=0.1).
Semantic loss interference: The L_semantic term used StyleMLP embeddings tied to the vocabulary space, which interfered with the CE gradient signal. Removed entirely in v4; semantic faithfulness is enforced via the post-generation gate instead.

Human Score (New Metric in v4)

v4 introduces a dedicated RoBERTa-based AI detection score (roberta-base-openai-detector) as a first-class evaluation metric. During the v4 run:

Human score: 0.8272 — the model's outputs are classified as 82.7% likely human-written
AI score: 0.1728 — only 17.3% AI-text probability

Additional human-pattern diagnostics logged:

Burstiness: 0.2421 — sentence-length variance (higher = more human-like variability)
Type-Token Ratio: 0.9623 — lexical diversity (near 1.0 = very high unique-word ratio)
AI marker density: 0.000115 — near-zero usage of flagged AI clichés ("leverage", "delve", etc.)

Combined Loss (v4)

L = L_CE + 0.3·L_style + 0.4·L_human          (GPU)

The semantic loss term is removed from v4 onwards. Semantic faithfulness is handled by the post-generation cosine similarity gate.

Term	Purpose	Weight
`L_CE`	`nn.CrossEntropyLoss` with label smoothing 0.1	1.0
`L_style`	`1 − cos_sim(style(input), style(output))`	0.3
`L_human`	`1 − HumanPatternClassifier(output)` — anti-AI penalty	0.4

What Changed in v3

v3 keeps the same base model and LoRA rank as v2 but improves every other stage of the pipeline: wider context window, better generation, a semantic faithfulness gate that prevents meaning-destroying corrections, and optional ERRANT F0.5 evaluation.

Parameter	v2	v3
Context window	128 tokens	256 tokens
Additional data	JFLEG + W&I only	+ C4-200M-GEC (~100k pairs, falls back if unavailable)
Beam search	`num_beams=2`	`num_beams=5`, `length_penalty=1.2`, `repetition_penalty=1.3`, `no_repeat_ngram_size=3`
Faithfulness gate	none	cosine sim < 0.75 → revert output to source
Human-pattern loss	skipped on CPU	active on GPU
Evaluation cap	always 200 samples	200 on CPU, full test set on GPU
ERRANT F0.5	not present	optional metric
Composite	mean(GLEU, BERTScore, 1-WER)	mean(GLEU, BERTScore, 1-WER [, ERRANT F0.5 if available])

What Changed in v2

The original model had a critical bug: CorrectionTrainer.compute_loss() only used cross-entropy loss. The multi-objective loss was fully designed in loss_functions.py but was never wired into the trainer. v2 fixes this and upgrades several other parameters.

Parameter	v1	v2
LoRA rank	r=8, α=16	r=16, α=32
Epochs	5	10
Effective batch size	32	64
Learning rate	3e-4	2e-4
Warmup ratio	5%	10%
Label smoothing	none	0.1
Loss function	CE only (bug)	CE + Style + Semantic (fixed)
Evaluation	GLEU only	GLEU + BERTScore F1 + (1−WER) composite
Early stopping	none	patience=3
Hub gate	none	composite must beat saved baseline

Features

Feature	Description
Two-pass spell correction	Dyslexia-aware phonetic pattern handling via LanguageTool
Style fingerprinting	41 raw features → MLP → 512-dim L2-normalised style vector
LoRA fine-tuning	r=16, α=32, dropout=0.05 — targeting all attention + FFN projections
Academic vocabulary elevation	BERT fill-mask → AWL candidate filtering → semantic similarity gate
Human pattern anti-AI loss	RoBERTa-based AI detector (`roberta-base-openai-detector`) + burstiness/TTR/marker features
Combined training loss	`L_CE + λ₁·L_style (+ λ₂·L_human on GPU)`
Semantic faithfulness gate	Outputs with cosine sim < 0.75 to source are reverted — prevents meaning drift
Sentence-chunked inference	Long texts split into 128-token chunks matching training window
FastAPI server	RESTful `/correct` endpoint with CORS and rate limiting
Multi-stage training	Orchestrated via `train.sh` with checkpoint system
Synthetic data augmentation	`DyslexiaSimulator` generates realistic errors from clean text (25% error rate in v4)
Composite score gating	Hub push gated by composite score improvement (suspended during base-model transitions)

Project Structure

Rewriter/
├── configs/
│   ├── training_config.yaml
│   ├── training_config_fast.yaml
│   ├── inference_config.yaml
│   ├── model_config.yaml
│   └── awl_config.yaml
├── scripts/
│   ├── train.py
│   ├── evaluate.py
│   ├── run_inference.py
│   ├── preprocess_data.py
│   ├── pretrain_human_pattern_classifier.py
│   ├── download_datasets.sh
│   └── download_kaggle_datasets.sh
├── src/
│   ├── model/
│   │   ├── base_model.py
│   │   ├── style_conditioner.py
│   │   ├── generation_utils.py
│   │   └── lora_adapter.py
│   ├── preprocessing/
│   │   ├── pipeline.py
│   │   ├── spell_corrector.py
│   │   ├── dyslexia_simulator.py
│   │   ├── dependency_parser.py
│   │   ├── ner_tagger.py
│   │   └── sentence_segmenter.py
│   ├── style/
│   │   ├── fingerprinter.py
│   │   ├── style_vector.py
│   │   ├── formality_classifier.py
│   │   └── emotion_classifier.py
│   ├── training/
│   │   ├── dataset.py
│   │   ├── trainer.py
│   │   ├── loss_functions.py
│   │   ├── human_pattern_extractor.py
│   │   └── callbacks.py
│   ├── vocabulary/
│   │   ├── lexical_substitution.py
│   │   ├── awl_loader.py
│   │   └── register_filter.py
│   ├── inference/
│   │   ├── corrector.py
│   │   └── postprocessor.py
│   ├── evaluation/
│   │   ├── gleu_scorer.py
│   │   ├── errant_evaluator.py
│   │   ├── style_metrics.py
│   │   └── authorship_verifier.py
│   └── api/
│       ├── main.py
│       ├── schemas.py
│       └── middleware.py
├── train_and_upgrade.py            # v4 upgrade pipeline (Kaggle notebook edition)
├── data/
│   ├── raw/
│   ├── processed/
│   ├── cache/
│   └── awl/
├── train.sh
├── start.sh
├── baseline_score.json             # Saved composite score (0.5203) — gate for Hub push
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── pyproject.toml

Model Architecture

PNG:

Mermaid Diagram:

graph TB
    subgraph INFERENCE["🔮 Inference Pipeline"]
        direction TB
        INPUT["📝 Raw Dyslectic Text"]
        subgraph PREPROCESS["Pre-Processing"]
            SPELL["Spell Corrector<br/><i>dyslexia-aware phonetic</i>"]
            SENT_SEG["Sentence Segmenter"]
            DEP_PARSE["Dependency Parser"]
            NER["NER Tagger"]
        end
        subgraph STYLE["Style Analysis"]
            FINGER["Style Fingerprinter<br/><i>512-dim vector</i>"]
            EMOTION["Emotion Classifier"]
            FORMALITY["Formality Classifier"]
            STYLE_VEC["Style Vector Composer"]
        end
        subgraph GENERATION["Core Generation"]
            STYLE_COND["Style Conditioner<br/><i>prefix tuning</i>"]
            BASE_MODEL["Base LM<br/><i>CoEdit-Large (warm-merged)</i>"]
            LORA["LoRA Adapter<br/><i>r=16</i>"]
            GEN_UTILS["Generation Utils<br/><i>beam search, sampling</i>"]
        end
        subgraph POSTPROCESS["Post-Processing"]
            FAITH["Faithfulness Gate<br/><i>cos sim &lt; 0.75 → revert</i>"]
            POSTPROC["Post-Processor<br/><i>formatting, cleanup</i>"]
            VOCAB_SUB["Lexical Substitution<br/><i>BERT-based</i>"]
            AWL["AWL Loader<br/><i>Coxhead Academic Word List</i>"]
            REG_FILTER["Register Filter<br/><i>academic tone gate</i>"]
        end
        OUTPUT["✅ Corrected Academic Text"]
        INPUT --> SPELL --> SENT_SEG --> DEP_PARSE --> NER
        INPUT --> FINGER --> EMOTION --> FORMALITY --> STYLE_VEC
        NER --> STYLE_COND
        STYLE_VEC --> STYLE_COND
        STYLE_COND --> BASE_MODEL
        LORA -.->|"merged weights"| BASE_MODEL
        BASE_MODEL --> GEN_UTILS --> FAITH --> POSTPROC
        POSTPROC --> VOCAB_SUB
        AWL --> VOCAB_SUB
        VOCAB_SUB --> REG_FILTER --> OUTPUT
    end

    subgraph TRAINING["🏋️ Training Pipeline (v4)"]
        direction TB
        subgraph WARMSTART["Warm-Start Merge"]
            HUB_ADAPTER["Hub LoRA Adapter<br/><i>r=16 (v3)</i>"]
            MERGE["merge_and_unload()"]
            FRESH_LORA["Fresh LoRA r=16"]
        end
        subgraph DATA["Data Pipeline"]
            JFLEG["jhu-clsp/jfleg<br/><i>~5k pairs, 4 refs each</i>"]
            WILOCNESS["bea2019st/wi_locness<br/><i>~34k pairs</i>"]
            DYSLEXIA_AUG["DyslexiaSimulator<br/><i>25% error rate augmentation</i>"]
            SPLIT["capped at 25k pairs — 88%/7%/5% splits"]
        end
        subgraph LOSS["Combined Loss (v4)"]
            L_CE["L_CE (nn.CrossEntropyLoss, label_smoothing=0.1)"]
            L_STYLE["0.3 · L_style"]
            L_HUMAN["0.4 · L_human<br/><i>(RoBERTa detector)</i>"]
        end
        subgraph EVAL["Composite Evaluation"]
            GLEU_E["GLEU *(metric fix pending)*"]
            BERT_E["BERTScore F1"]
            WER_E["1 − WER"]
            HUMAN_E["HumanScore<br/><i>(roberta-base-openai-detector)</i>"]
            COMPOSITE["Composite = mean(4 metrics)"]
            HUB_PUSH["Push to Hub ✅"]
        end
        HUB_ADAPTER --> MERGE --> FRESH_LORA
        JFLEG --> DYSLEXIA_AUG
        WILOCNESS --> DYSLEXIA_AUG
        DYSLEXIA_AUG --> SPLIT
        L_CE --> COMPOSITE
        L_STYLE --> COMPOSITE
        L_HUMAN --> COMPOSITE
        GLEU_E --> COMPOSITE
        BERT_E --> COMPOSITE
        WER_E --> COMPOSITE
        HUMAN_E --> COMPOSITE
        COMPOSITE --> HUB_PUSH
    end

Design Choices & Rationale

Why CoEdit-Large? (v4 upgrade from Flan-T5-Small)

Consideration	Decision
Task specialisation	CoEdit-Large is pretrained specifically on GEC and text editing tasks — it already knows how to correct errors before any fine-tuning
Parameter capacity	~400M params vs 77M in Flan-T5-Small — substantially more correction capacity
Training efficiency	2 fine-tuning epochs reach good correction quality vs 10 for Flan-T5
Hardware	Kaggle T4 × 2 (16GB each) replaces the 4GB laptop GPU constraint
LoRA efficiency	r=16 adapter on CoEdit-Large is still <3% of total params

Why Flan-T5-Small was used in v1–v3

Consideration	Decision
Hardware constraint	RTX 3050 Laptop GPU (4GB VRAM) — ruled out models > 500M params
Architecture	Encoder-decoder (seq2seq) is ideal for text-to-text correction tasks
Instruction tuning	Flan-T5 is pre-trained on 1,800+ instruction tasks — follows correction prompts naturally

Why LoRA over Full Fine-Tuning?

Memory: Full fine-tuning of CoEdit-Large requires ~6GB for gradients; LoRA r=16 fits comfortably in 4GB
Warm-start safety: Merging the previous adapter's weights before applying a fresh LoRA preserves prior corrections
Merging: LoRA weights merge into base model at inference time — zero latency overhead

Why a Combined Multi-Objective Loss?

The system uses (v4): L = L_CE + 0.3·L_style + 0.4·L_human

Term	Purpose	Weight
`L_CE`	Cross-entropy with label smoothing (0.1)	1.0
`L_style`	`1 − cos_sim(style(input), style(output))` — preserves writing fingerprint	0.3
`L_human`	Anti-AI penalty via RoBERTa detector	0.4

The L_semantic term from v2/v3 was removed in v4. It used StyleMLP embeddings tied to the vocabulary space, which caused gradient interference with cross-entropy. Semantic faithfulness is now enforced exclusively via the post-generation cosine gate.

Why a Semantic Faithfulness Gate?

Even a well-trained correction model can occasionally produce outputs that drift semantically from the input. v3+ computes cosine similarity between source and output using all-MiniLM-L6-v2 sentence embeddings. Outputs below 0.75 similarity are treated as unreliable and the original input is returned unchanged.

Why Sentence-Chunked Inference?

The model is trained with max_input_length=128 tokens. The task prefix alone consumes ~20 tokens. Long inputs are split into sentences, grouped into chunks that fit the 128-token budget, corrected independently, then rejoined.

Why Post-Generation Vocabulary Elevation?

Rather than relying solely on the model to produce academic vocabulary, a separate BERT-based lexical substitution pipeline is applied post-generation: POS-tag → identify non-AWL content words → BERT fill-mask → filter to AWL-only predictions → accept only if semantic_similarity > 0.82.

Quick Start

Prerequisites

Python ≥ 3.10
NVIDIA GPU with ≥ 8GB VRAM recommended (T4 or better); CPU supported but slow
~15GB disk space for models and datasets

Option A: Kaggle Notebook (v4 — Recommended)

Run the pipeline on Kaggle with T4 × 2 GPU:

Upload train_and_upgrade.py as a Kaggle notebook
Enable GPU (T4 × 2) and Internet
Add your HuggingFace token as a Kaggle Secret named HF_TOKEN

# Run the pipeline
import os
HF_TOKEN = os.environ.get("HF_TOKEN")
main()

The pipeline runs 8 steps automatically: Load base model → Warm-start merge → Apply r=16 LoRA → Load data → Train → Evaluate → Save → Push

Estimated runtime: 30–60 minutes on T4 × 2

Option B: Self-Improving Upgrade Pipeline (local GPU)

git clone https://huggingface.co/morpheuslord/rewrite && cd rewrite
pip install -r requirements.txt

export HF_TOKEN="your-hf-token-with-write-access"
python train_and_upgrade.py

Option C: Manual Step-by-Step

# 1. Install dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm

# 2. Preprocess datasets
python scripts/preprocess_data.py

# 3. Pre-train the human pattern classifier
python scripts/pretrain_human_pattern_classifier.py

# 4. Train the correction model
PYTHONPATH=. python scripts/train.py --config configs/training_config.yaml --use-v2-loss

# 5. Merge LoRA adapter into base model
python -c "
from peft import PeftModel
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
model = AutoModelForSeq2SeqLM.from_pretrained('grammarly/coedit-large', torch_dtype=torch.float32)
model = PeftModel.from_pretrained(model, 'checkpoints/checkpoint-BEST')
model = model.merge_and_unload()
model.save_pretrained('checkpoints/best_model_merged')
AutoTokenizer.from_pretrained('grammarly/coedit-large').save_pretrained('checkpoints/best_model_merged')
"

# 6. Run inference
PYTHONPATH=. python scripts/run_inference.py --text "The studnet recieved alot of informtion."

# 7. Or start the API server
PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000

Training Pipeline

v4 Upgrade Pipeline (`train_and_upgrade.py`) — 8 Steps

Step	Action
1	Load `grammarly/coedit-large` base model (fp32) + existing Hub LoRA adapter
2	Merge adapter into base weights (`merge_and_unload`) — warm start
3	Apply fresh LoRA r=16 on merged base
4	Load JFLEG + W&I+LOCNESS; augment with DyslexiaSimulator (25% error rate); cap at 25k pairs
5	Train with combined loss for 2 epochs (`L_CE + L_style + L_human`)
6	Evaluate: BERTScore F1 + (1−WER) + HumanScore [+ GLEU — fix pending]
7	Save adapter + merge full model
8	Push adapter + merged model (`merged/` subfolder) to Hub; update baseline

v3 Upgrade Pipeline — 10 Steps

Step	Action
1	Load existing LoRA adapter (r=16, v2) from Hub
2	Merge into base weights — warm start
3	Apply fresh LoRA r=16 on merged base
4	Load JFLEG + W&I+LOCNESS + C4-GEC; augment (20% error rate)
5	Train with combined loss for 10 epochs, early stopping patience=3
6	Evaluate: GLEU + BERTScore F1 + (1−WER) [+ ERRANT F0.5 if installed]
7	Apply semantic faithfulness gate — revert outputs with cosine sim < 0.75
8	Compare composite score against `baseline_score.json`
9	If improved: merge adapter → save full model
10	Push adapter + merged model to Hub; update baseline

v1 Original Pipeline (`train.sh`) — 5 Stages

Stage	Action
1	Setup & Dependencies
2	Data Preprocessing
3	Human Pattern Classifier Pre-Training
4	Main Model Training (LoRA r=8, 5 epochs, CE only)
5	Evaluation (GLEU only)

Hyperparameter Reference

v4 (`train_and_upgrade.py` — Kaggle T4 × 2)

BASE_MODEL_ID   = "grammarly/coedit-large"

LORA_R          = 16
LORA_ALPHA      = 32
LORA_DROPOUT    = 0.05
TARGET_MODULES  = ["q", "v", "k", "o", "wi_0", "wi_1", "wo"]

EPOCHS          = 2             # coedit-large is already GEC-pretrained
BATCH_SIZE      = 8             # per device (GPU)
GRAD_ACCUM      = 8             # effective batch = 64
LR              = 3e-5          # reduced from 1e-4; 1e-4 caused flat loss on T5-based models
WARMUP_RATIO    = 0.05
LABEL_SMOOTHING = 0.1
MAX_INPUT_LEN   = 128           # back to 128 — GEC sentences are typically short
MAX_TARGET_LEN  = 128
MAX_TRAIN_PAIRS = 25_000

LAMBDA_STYLE    = 0.3
# LAMBDA_SEMANTIC removed — caused gradient interference
LAMBDA_HUMAN    = 0.4

FAITHFULNESS_THRESHOLD  = 0.75
AI_DETECTION_THRESHOLD  = 0.4

# fp16 and bf16 both False — fp32 throughout to prevent gradient underflow

v3 (`train_and_upgrade.py`)

LORA_R          = 16
LORA_ALPHA      = 32
LORA_DROPOUT    = 0.05
EPOCHS          = 10
BATCH_SIZE      = 2            # CPU
GRAD_ACCUM      = 32
LR              = 2e-4
MAX_INPUT_LEN   = 256
LAMBDA_STYLE    = 0.3
LAMBDA_SEMANTIC = 0.5
LAMBDA_HUMAN    = 0.4
FAITHFULNESS_THRESHOLD = 0.75

v2 (`train_and_upgrade.py`)

LORA_R          = 16
LORA_ALPHA      = 32
EPOCHS          = 10
BATCH_SIZE      = 2
GRAD_ACCUM      = 32
LR              = 2e-4
MAX_INPUT_LEN   = 128
LAMBDA_STYLE    = 0.3
LAMBDA_SEMANTIC = 0.5
LAMBDA_HUMAN    = 0.4

v1 (`configs/training_config.yaml`)

lora:
  r: 8
  lora_alpha: 16
  lora_dropout: 0.05
training:
  per_device_train_batch_size: 4
  gradient_accumulation_steps: 8
  learning_rate: 3.0e-4
loss:
  lambda_style: 0.3
  lambda_semantic: 0.5
  lambda_human_pattern: 0.4

`configs/inference_config.yaml`

model:
  key: "coedit-large"
  checkpoint_path: "checkpoints/best_model_merged"
  use_lora: false

generation:
  num_beams: 5
  length_penalty: 1.2
  repetition_penalty: 1.3
  no_repeat_ngram_size: 3
  max_new_tokens: 128

vocabulary:
  semantic_threshold: 0.82

faithfulness:
  threshold: 0.75

Inference Pipeline (8 Steps)

Raw Text
  │
  ▼
1. Preprocessing ─────── LanguageTool spell correction + spaCy parsing
  │
  ▼
2. Style Fingerprinting ─ Extract 41 features → MLP → 512-dim vector
  │
  ▼
3. Sentence-Chunked Generation ─ Split into 128-token chunks → CoEdit-Large → rejoin
  │
  ▼
4. Faithfulness Gate ──── cosine_sim(source, output) < 0.75 → revert to source
  │
  ▼
5. Post-Processing ───── Remove artifacts, replace em dashes, fix spacing
  │
  ▼
6. Vocabulary Elevation ─ BERT fill-mask → AWL filtering → semantic gate (threshold 0.82)
  │
  ▼
7. Register Filtering ── Expand contractions, replace colloquialisms
  │
  ▼
8. Metrics ──────────── Style similarity, AWL coverage, readability scores
  │
  ▼
Corrected Text

API Usage

# Start the server
PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000

# Correct text
curl -X POST http://localhost:8000/correct \
  -H "Content-Type: application/json" \
  -d '{"text": "The studnet recieved alot of informtion.", "style_alpha": 0.6}'

# Health check
curl http://localhost:8000/health

Interactive docs at http://localhost:8000/docs.

Hardware Requirements

Tier	GPU	Config	Epochs	Training Time
Tested (v1)	RTX 3050 4GB	Flan-T5-Small, r=8	5	~45 min
Tested (v2 CPU)	None (HF Space CPU)	Flan-T5-Small, r=16	10	~12–24 hours
Tested (v3 CPU)	None (HF Space CPU)	Flan-T5-Small, r=16	10	~12–24 hours
Tested (v4)	Kaggle T4 × 2 (16GB each)	CoEdit-Large, r=16	2	~30–60 min
Recommended	RTX 3090 24GB	CoEdit-Large, r=16 + full loss	3–5	~2–3h
Maximum	A100 80GB	Full pipeline with ERRANT	10	~12h

Data Sources

Dataset	Type	Size	Access
JFLEG (`jhu-clsp/jfleg`)	Fluency corrections (4 refs each)	~5k pairs	HF Hub, no registration
W&I+LOCNESS (`bea2019st/wi_locness`)	Learner errors + corrections	~34k pairs	HF Hub, no registration
C4-200M-GEC (`cointegrated/c4_200m-gec-filtered`)	Synthetic GEC pairs	~100k pairs (capped)	HF Hub — falls back silently if unavailable
FCE v2.1	Learner errors + corrections	~28k pairs	BEA-2019 (registration required)
Shanegerami AI_Human.csv	Human vs AI classification	~50k samples	Kaggle
Starblasters8 data.parquet	Human vs AI classification	~50k samples	Kaggle
Coxhead AWL	Academic Word List	570 families / 549 headwords	Victoria University

Note: train_and_upgrade.py (v4) uses JFLEG + W&I+LOCNESS capped at 25k pairs. C4-GEC and FCE require additional setup.

Dyslexia Error Simulation

The DyslexiaSimulator generates synthetic training data based on research by Rello et al. (2013, 2017). v4 uses a 25% per-word error rate (up from 20% in v2/v3).

Error Type	Frequency	Example
Phonetic substitution	35%	"because" → "becaus"
Letter transposition	18%	"the" → "teh"
Letter omission	16%	"important" → "importnt"
Letter doubling	12%	"letter" → "lettter"
Letter reversal (b/d, p/q)	10%	"bad" → "dad"
Word boundary errors	9%	"a lot" → "alot"

Style Fingerprint Vector

The 512-dimensional style vector captures 41 raw features:

Group	Features	Count
Sentence stats	mean, std, skew of sentence lengths	3
Word stats	mean, std of word lengths	2
Lexical	type-token ratio, lexical density	2
Syntactic	passive/active voice ratio, subordinate clause ratio, avg dependency tree depth	4
Discourse	20 academic discourse markers (per 100 words)	20
Register	hedging frequency, formality score, nominalization ratio	3
Readability	Flesch reading ease, avg syllables per word	2
Pronouns	first-person ratio, third-person ratio	2
Other	question ratio, exclamation ratio, AWL coverage	3

Projected through a 2-layer MLP (41 → 256 → 512) with LayerNorm and GELU activation, then L2-normalised.

Known Limitations

GLEU scoring bug (v4): The google_bleu evaluate metric is incompatible with CoEdit-Large's tokenisation space and scores 0.0. This will be resolved in v5 with a tokeniser-aware GLEU implementation. BERTScore F1 (0.9349) is a more reliable quality indicator for this run.
1-WER regression (v4): The 1-WER drop (0.8552 → 0.3191) partly reflects the model transition and partly reflects the fact that coedit-large produces differently-phrased corrections compared to Flan-T5 — word-level edit distance to the reference set increases when the output is fluent but uses different vocabulary. This will be re-evaluated against CoEdit-specific reference corrections in v5.
Training window: 128-token max input — very long paragraphs may be split mid-clause.
Vocabulary elevation: BERT fill-mask can suggest semantically inappropriate AWL words; the 0.82 similarity threshold is a trade-off between coverage and accuracy.
Already-correct text: The model is trained on error→correction pairs; feeding it clean text produces unpredictable output.
LanguageTool latency: Spell correction takes ~15–20s due to JVM startup on first call.
Faithfulness gate conservatism: The 0.75 cosine similarity threshold occasionally reverts valid-but-heavily-corrected outputs. Monitor num_fallback in evaluation to tune the threshold.

Downloads last month: 125

Safetensors

Model size

77M params

Tensor type

F32

Model tree for morpheuslord/rewrite

Base model

grammarly/coedit-large

Adapter

(1)

this model