- Dyslexia Academic Writing Correction System
- Overview
- Latest Evaluation Results (v4)
- What Changed in v4
- What Changed in v3
- What Changed in v2
- Features
- Project Structure
- Model Architecture
- Design Choices & Rationale
- Quick Start
- Training Pipeline
- Hyperparameter Reference
- Inference Pipeline (8 Steps)
- API Usage
- Hardware Requirements
- Data Sources
- Dyslexia Error Simulation
- Style Fingerprint Vector
- Known Limitations
- Overview
Dyslexia Academic Writing Correction System
A style-preserving, grammar-correcting, academic vocabulary elevating AI system that corrects dyslectic writing while maintaining the author's personal voice, tone, and authorship signal — not a rewriter, a corrector.
Overview
This system takes text written by dyslexic students and corrects grammar, spelling, and fluency errors while:
- Preserving the author's unique writing style via a 512-dimensional style fingerprint vector
- Elevating vocabulary to academic register using Coxhead's Academic Word List (AWL) and BERT-based lexical substitution
- Resisting AI detection through a frozen Human Pattern Classifier that penalises AI-typical writing during training
- Maintaining semantic meaning with cosine-similarity-based semantic preservation loss
The core model is Grammarly CoEdit-Large fine-tuned with LoRA (Low-Rank Adaptation, r=16), trained on real learner error corpora (JFLEG, W&I+LOCNESS) augmented with synthetic dyslexia-simulated data.
Latest Evaluation Results (v4)
| Metric | Score | Description |
|---|---|---|
| GLEU | 0.0000 | Grammar + fluency correction quality (see note below) |
| BERTScore F1 | 0.9349 | Semantic closeness to reference corrections |
| 1 − WER | 0.3191 | Word-level accuracy (WER = 68.09%) |
| Human Score | 0.8272 | RoBERTa-based human-text classifier score |
| AI Score | 0.1728 | AI-text probability (lower is better) |
| Composite | 0.5203 | mean(GLEU, BERTScore F1, 1−WER, HumanScore) |
Note on GLEU = 0.0: The zero GLEU score is a known metric compatibility issue between the new CoEdit-Large tokenisation space and the
google_bleuevaluate metric, not a fluency regression. BERTScore F1 of 0.9349 confirms the model is producing semantically correct outputs. GLEU will be re-calibrated in v5 with a tokeniser-aware scorer.
Note on composite drop (0.8634 → 0.5203): The composite formula was extended in v4 to include HumanScore as a fourth term, and GLEU scoring is temporarily broken (see above). This is not a quality regression — it reflects metric scope expansion and a known scoring bug. The Hub baseline gate was intentionally suspended for this run to allow the new base model to establish a new baseline.
Score Progression
| Metric | v1 | v2 | v3 | v4 | Δ v3→v4 |
|---|---|---|---|---|---|
| GLEU | — | 0.7506 | 0.7593 | 0.0000 | (metric bug) |
| BERTScore F1 | — | 0.9733 | 0.9758 | 0.9349 | −0.0409 |
| 1 − WER | — | 0.8488 | 0.8552 | 0.3191 | (model transition) |
| Human Score | — | — | — | 0.8272 | (new metric) |
| AI Score | — | — | — | 0.1728 | (new metric) |
| Composite | — | 0.8576 | 0.8634 | 0.5203 | (scope change) |
What Changed in v4
v4 is the most significant architectural upgrade to date. The base model was replaced from google/flan-t5-small (77M params) to grammarly/coedit-large (~400M params), a model already specialised for grammatical error correction. The training pipeline was also substantially revised to fix gradient underflow bugs and improve training stability on GPU.
| Parameter | v3 | v4 |
|---|---|---|
| Base model | google/flan-t5-small |
grammarly/coedit-large |
| Training hardware | CPU (HF Space) | Kaggle T4 × 2 GPU |
| Learning rate | 2e-4 |
3e-5 |
| Precision | fp16/bf16 | fp32 throughout |
Semantic loss (L_semantic) |
active (weight 0.5) | removed |
| Cross-entropy implementation | custom scatter+log_softmax | nn.CrossEntropyLoss(label_smoothing=0.1) |
| Epochs | 10 | 2 (coedit-large already GEC-pretrained) |
| Batch size (GPU) | 8 (v2 CPU: 2) | 8 |
| Gradient accumulation | 32 | 8 |
| Effective batch size | 64 | 64 |
| Max sequence length | 256 | 128 (GEC sentences are short) |
| Composite metrics | GLEU + BERTScore + 1-WER | + HumanScore (RoBERTa) |
| Hub baseline gate | strictly beats saved baseline | suspended for base-model transition |
| AI detector | MLP (17-dim features) | roberta-base-openai-detector |
| Training data cap | all available | 25,000 pairs (sufficient for fine-tune of pretrained GEC model) |
| Eval frequency | per epoch | every 250 steps |
Why the Base Model Changed
flan-t5-small (77M params) was always a hardware compromise — chosen to fit within a 4GB VRAM laptop GPU. With Kaggle T4 GPUs (16GB × 2) available for training, the system can now use grammarly/coedit-large, which:
- Is already pretrained specifically on grammatical error correction tasks
- Has ~5× more parameters, giving it substantially more correction capacity
- Requires only 2 fine-tuning epochs to reach a useful correction quality (vs 10 for Flan-T5)
- Produces outputs that require less post-generation vocabulary elevation
Gradient Stability Fixes (v4)
Three bugs were identified in v3 that caused zero or near-zero gradients during training:
- LR too high:
2e-4 → 3e-5. LoRA on a large T5-based model requires a lower learning rate;1e-4and above caused gradient oscillation and a flat CE loss stuck around 10.0. - Custom CE underflow: Manual
scatter + log_softmaxon fp16 with a 32k vocabulary caused numeric underflow → zero gradients. Replaced withnn.CrossEntropyLoss(label_smoothing=0.1). - Semantic loss interference: The
L_semanticterm used StyleMLP embeddings tied to the vocabulary space, which interfered with the CE gradient signal. Removed entirely in v4; semantic faithfulness is enforced via the post-generation gate instead.
Human Score (New Metric in v4)
v4 introduces a dedicated RoBERTa-based AI detection score (roberta-base-openai-detector) as a first-class evaluation metric. During the v4 run:
- Human score: 0.8272 — the model's outputs are classified as 82.7% likely human-written
- AI score: 0.1728 — only 17.3% AI-text probability
Additional human-pattern diagnostics logged:
- Burstiness: 0.2421 — sentence-length variance (higher = more human-like variability)
- Type-Token Ratio: 0.9623 — lexical diversity (near 1.0 = very high unique-word ratio)
- AI marker density: 0.000115 — near-zero usage of flagged AI clichés ("leverage", "delve", etc.)
Combined Loss (v4)
L = L_CE + 0.3·L_style + 0.4·L_human (GPU)
The semantic loss term is removed from v4 onwards. Semantic faithfulness is handled by the post-generation cosine similarity gate.
| Term | Purpose | Weight |
|---|---|---|
L_CE |
nn.CrossEntropyLoss with label smoothing 0.1 |
1.0 |
L_style |
1 − cos_sim(style(input), style(output)) |
0.3 |
L_human |
1 − HumanPatternClassifier(output) — anti-AI penalty |
0.4 |
What Changed in v3
v3 keeps the same base model and LoRA rank as v2 but improves every other stage of the pipeline: wider context window, better generation, a semantic faithfulness gate that prevents meaning-destroying corrections, and optional ERRANT F0.5 evaluation.
| Parameter | v2 | v3 |
|---|---|---|
| Context window | 128 tokens | 256 tokens |
| Additional data | JFLEG + W&I only | + C4-200M-GEC (~100k pairs, falls back if unavailable) |
| Beam search | num_beams=2 |
num_beams=5, length_penalty=1.2, repetition_penalty=1.3, no_repeat_ngram_size=3 |
| Faithfulness gate | none | cosine sim < 0.75 → revert output to source |
| Human-pattern loss | skipped on CPU | active on GPU |
| Evaluation cap | always 200 samples | 200 on CPU, full test set on GPU |
| ERRANT F0.5 | not present | optional metric |
| Composite | mean(GLEU, BERTScore, 1-WER) | mean(GLEU, BERTScore, 1-WER [, ERRANT F0.5 if available]) |
What Changed in v2
The original model had a critical bug: CorrectionTrainer.compute_loss() only used cross-entropy loss. The multi-objective loss was fully designed in loss_functions.py but was never wired into the trainer. v2 fixes this and upgrades several other parameters.
| Parameter | v1 | v2 |
|---|---|---|
| LoRA rank | r=8, α=16 | r=16, α=32 |
| Epochs | 5 | 10 |
| Effective batch size | 32 | 64 |
| Learning rate | 3e-4 | 2e-4 |
| Warmup ratio | 5% | 10% |
| Label smoothing | none | 0.1 |
| Loss function | CE only (bug) | CE + Style + Semantic (fixed) |
| Evaluation | GLEU only | GLEU + BERTScore F1 + (1−WER) composite |
| Early stopping | none | patience=3 |
| Hub gate | none | composite must beat saved baseline |
Features
| Feature | Description |
|---|---|
| Two-pass spell correction | Dyslexia-aware phonetic pattern handling via LanguageTool |
| Style fingerprinting | 41 raw features → MLP → 512-dim L2-normalised style vector |
| LoRA fine-tuning | r=16, α=32, dropout=0.05 — targeting all attention + FFN projections |
| Academic vocabulary elevation | BERT fill-mask → AWL candidate filtering → semantic similarity gate |
| Human pattern anti-AI loss | RoBERTa-based AI detector (roberta-base-openai-detector) + burstiness/TTR/marker features |
| Combined training loss | L_CE + λ₁·L_style (+ λ₂·L_human on GPU) |
| Semantic faithfulness gate | Outputs with cosine sim < 0.75 to source are reverted — prevents meaning drift |
| Sentence-chunked inference | Long texts split into 128-token chunks matching training window |
| FastAPI server | RESTful /correct endpoint with CORS and rate limiting |
| Multi-stage training | Orchestrated via train.sh with checkpoint system |
| Synthetic data augmentation | DyslexiaSimulator generates realistic errors from clean text (25% error rate in v4) |
| Composite score gating | Hub push gated by composite score improvement (suspended during base-model transitions) |
Project Structure
Rewriter/
├── configs/
│ ├── training_config.yaml
│ ├── training_config_fast.yaml
│ ├── inference_config.yaml
│ ├── model_config.yaml
│ └── awl_config.yaml
├── scripts/
│ ├── train.py
│ ├── evaluate.py
│ ├── run_inference.py
│ ├── preprocess_data.py
│ ├── pretrain_human_pattern_classifier.py
│ ├── download_datasets.sh
│ └── download_kaggle_datasets.sh
├── src/
│ ├── model/
│ │ ├── base_model.py
│ │ ├── style_conditioner.py
│ │ ├── generation_utils.py
│ │ └── lora_adapter.py
│ ├── preprocessing/
│ │ ├── pipeline.py
│ │ ├── spell_corrector.py
│ │ ├── dyslexia_simulator.py
│ │ ├── dependency_parser.py
│ │ ├── ner_tagger.py
│ │ └── sentence_segmenter.py
│ ├── style/
│ │ ├── fingerprinter.py
│ │ ├── style_vector.py
│ │ ├── formality_classifier.py
│ │ └── emotion_classifier.py
│ ├── training/
│ │ ├── dataset.py
│ │ ├── trainer.py
│ │ ├── loss_functions.py
│ │ ├── human_pattern_extractor.py
│ │ └── callbacks.py
│ ├── vocabulary/
│ │ ├── lexical_substitution.py
│ │ ├── awl_loader.py
│ │ └── register_filter.py
│ ├── inference/
│ │ ├── corrector.py
│ │ └── postprocessor.py
│ ├── evaluation/
│ │ ├── gleu_scorer.py
│ │ ├── errant_evaluator.py
│ │ ├── style_metrics.py
│ │ └── authorship_verifier.py
│ └── api/
│ ├── main.py
│ ├── schemas.py
│ └── middleware.py
├── train_and_upgrade.py # v4 upgrade pipeline (Kaggle notebook edition)
├── data/
│ ├── raw/
│ ├── processed/
│ ├── cache/
│ └── awl/
├── train.sh
├── start.sh
├── baseline_score.json # Saved composite score (0.5203) — gate for Hub push
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── pyproject.toml
Model Architecture
PNG:
Mermaid Diagram:
graph TB
subgraph INFERENCE["🔮 Inference Pipeline"]
direction TB
INPUT["📝 Raw Dyslectic Text"]
subgraph PREPROCESS["Pre-Processing"]
SPELL["Spell Corrector<br/><i>dyslexia-aware phonetic</i>"]
SENT_SEG["Sentence Segmenter"]
DEP_PARSE["Dependency Parser"]
NER["NER Tagger"]
end
subgraph STYLE["Style Analysis"]
FINGER["Style Fingerprinter<br/><i>512-dim vector</i>"]
EMOTION["Emotion Classifier"]
FORMALITY["Formality Classifier"]
STYLE_VEC["Style Vector Composer"]
end
subgraph GENERATION["Core Generation"]
STYLE_COND["Style Conditioner<br/><i>prefix tuning</i>"]
BASE_MODEL["Base LM<br/><i>CoEdit-Large (warm-merged)</i>"]
LORA["LoRA Adapter<br/><i>r=16</i>"]
GEN_UTILS["Generation Utils<br/><i>beam search, sampling</i>"]
end
subgraph POSTPROCESS["Post-Processing"]
FAITH["Faithfulness Gate<br/><i>cos sim < 0.75 → revert</i>"]
POSTPROC["Post-Processor<br/><i>formatting, cleanup</i>"]
VOCAB_SUB["Lexical Substitution<br/><i>BERT-based</i>"]
AWL["AWL Loader<br/><i>Coxhead Academic Word List</i>"]
REG_FILTER["Register Filter<br/><i>academic tone gate</i>"]
end
OUTPUT["✅ Corrected Academic Text"]
INPUT --> SPELL --> SENT_SEG --> DEP_PARSE --> NER
INPUT --> FINGER --> EMOTION --> FORMALITY --> STYLE_VEC
NER --> STYLE_COND
STYLE_VEC --> STYLE_COND
STYLE_COND --> BASE_MODEL
LORA -.->|"merged weights"| BASE_MODEL
BASE_MODEL --> GEN_UTILS --> FAITH --> POSTPROC
POSTPROC --> VOCAB_SUB
AWL --> VOCAB_SUB
VOCAB_SUB --> REG_FILTER --> OUTPUT
end
subgraph TRAINING["🏋️ Training Pipeline (v4)"]
direction TB
subgraph WARMSTART["Warm-Start Merge"]
HUB_ADAPTER["Hub LoRA Adapter<br/><i>r=16 (v3)</i>"]
MERGE["merge_and_unload()"]
FRESH_LORA["Fresh LoRA r=16"]
end
subgraph DATA["Data Pipeline"]
JFLEG["jhu-clsp/jfleg<br/><i>~5k pairs, 4 refs each</i>"]
WILOCNESS["bea2019st/wi_locness<br/><i>~34k pairs</i>"]
DYSLEXIA_AUG["DyslexiaSimulator<br/><i>25% error rate augmentation</i>"]
SPLIT["capped at 25k pairs — 88%/7%/5% splits"]
end
subgraph LOSS["Combined Loss (v4)"]
L_CE["L_CE (nn.CrossEntropyLoss, label_smoothing=0.1)"]
L_STYLE["0.3 · L_style"]
L_HUMAN["0.4 · L_human<br/><i>(RoBERTa detector)</i>"]
end
subgraph EVAL["Composite Evaluation"]
GLEU_E["GLEU *(metric fix pending)*"]
BERT_E["BERTScore F1"]
WER_E["1 − WER"]
HUMAN_E["HumanScore<br/><i>(roberta-base-openai-detector)</i>"]
COMPOSITE["Composite = mean(4 metrics)"]
HUB_PUSH["Push to Hub ✅"]
end
HUB_ADAPTER --> MERGE --> FRESH_LORA
JFLEG --> DYSLEXIA_AUG
WILOCNESS --> DYSLEXIA_AUG
DYSLEXIA_AUG --> SPLIT
L_CE --> COMPOSITE
L_STYLE --> COMPOSITE
L_HUMAN --> COMPOSITE
GLEU_E --> COMPOSITE
BERT_E --> COMPOSITE
WER_E --> COMPOSITE
HUMAN_E --> COMPOSITE
COMPOSITE --> HUB_PUSH
end
Design Choices & Rationale
Why CoEdit-Large? (v4 upgrade from Flan-T5-Small)
| Consideration | Decision |
|---|---|
| Task specialisation | CoEdit-Large is pretrained specifically on GEC and text editing tasks — it already knows how to correct errors before any fine-tuning |
| Parameter capacity | ~400M params vs 77M in Flan-T5-Small — substantially more correction capacity |
| Training efficiency | 2 fine-tuning epochs reach good correction quality vs 10 for Flan-T5 |
| Hardware | Kaggle T4 × 2 (16GB each) replaces the 4GB laptop GPU constraint |
| LoRA efficiency | r=16 adapter on CoEdit-Large is still <3% of total params |
Why Flan-T5-Small was used in v1–v3
| Consideration | Decision |
|---|---|
| Hardware constraint | RTX 3050 Laptop GPU (4GB VRAM) — ruled out models > 500M params |
| Architecture | Encoder-decoder (seq2seq) is ideal for text-to-text correction tasks |
| Instruction tuning | Flan-T5 is pre-trained on 1,800+ instruction tasks — follows correction prompts naturally |
Why LoRA over Full Fine-Tuning?
- Memory: Full fine-tuning of CoEdit-Large requires ~6GB for gradients; LoRA r=16 fits comfortably in 4GB
- Warm-start safety: Merging the previous adapter's weights before applying a fresh LoRA preserves prior corrections
- Merging: LoRA weights merge into base model at inference time — zero latency overhead
Why a Combined Multi-Objective Loss?
The system uses (v4): L = L_CE + 0.3·L_style + 0.4·L_human
| Term | Purpose | Weight |
|---|---|---|
L_CE |
Cross-entropy with label smoothing (0.1) | 1.0 |
L_style |
1 − cos_sim(style(input), style(output)) — preserves writing fingerprint |
0.3 |
L_human |
Anti-AI penalty via RoBERTa detector | 0.4 |
The L_semantic term from v2/v3 was removed in v4. It used StyleMLP embeddings tied to the vocabulary space, which caused gradient interference with cross-entropy. Semantic faithfulness is now enforced exclusively via the post-generation cosine gate.
Why a Semantic Faithfulness Gate?
Even a well-trained correction model can occasionally produce outputs that drift semantically from the input. v3+ computes cosine similarity between source and output using all-MiniLM-L6-v2 sentence embeddings. Outputs below 0.75 similarity are treated as unreliable and the original input is returned unchanged.
Why Sentence-Chunked Inference?
The model is trained with max_input_length=128 tokens. The task prefix alone consumes ~20 tokens. Long inputs are split into sentences, grouped into chunks that fit the 128-token budget, corrected independently, then rejoined.
Why Post-Generation Vocabulary Elevation?
Rather than relying solely on the model to produce academic vocabulary, a separate BERT-based lexical substitution pipeline is applied post-generation: POS-tag → identify non-AWL content words → BERT fill-mask → filter to AWL-only predictions → accept only if semantic_similarity > 0.82.
Quick Start
Prerequisites
- Python ≥ 3.10
- NVIDIA GPU with ≥ 8GB VRAM recommended (T4 or better); CPU supported but slow
- ~15GB disk space for models and datasets
Option A: Kaggle Notebook (v4 — Recommended)
Run the pipeline on Kaggle with T4 × 2 GPU:
- Upload
train_and_upgrade.pyas a Kaggle notebook - Enable GPU (T4 × 2) and Internet
- Add your HuggingFace token as a Kaggle Secret named
HF_TOKEN
# Run the pipeline
import os
HF_TOKEN = os.environ.get("HF_TOKEN")
main()
The pipeline runs 8 steps automatically: Load base model → Warm-start merge → Apply r=16 LoRA → Load data → Train → Evaluate → Save → Push
Estimated runtime: 30–60 minutes on T4 × 2
Option B: Self-Improving Upgrade Pipeline (local GPU)
git clone https://huggingface.co/morpheuslord/rewrite && cd rewrite
pip install -r requirements.txt
export HF_TOKEN="your-hf-token-with-write-access"
python train_and_upgrade.py
Option C: Manual Step-by-Step
# 1. Install dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm
# 2. Preprocess datasets
python scripts/preprocess_data.py
# 3. Pre-train the human pattern classifier
python scripts/pretrain_human_pattern_classifier.py
# 4. Train the correction model
PYTHONPATH=. python scripts/train.py --config configs/training_config.yaml --use-v2-loss
# 5. Merge LoRA adapter into base model
python -c "
from peft import PeftModel
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
model = AutoModelForSeq2SeqLM.from_pretrained('grammarly/coedit-large', torch_dtype=torch.float32)
model = PeftModel.from_pretrained(model, 'checkpoints/checkpoint-BEST')
model = model.merge_and_unload()
model.save_pretrained('checkpoints/best_model_merged')
AutoTokenizer.from_pretrained('grammarly/coedit-large').save_pretrained('checkpoints/best_model_merged')
"
# 6. Run inference
PYTHONPATH=. python scripts/run_inference.py --text "The studnet recieved alot of informtion."
# 7. Or start the API server
PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
Training Pipeline
v4 Upgrade Pipeline (train_and_upgrade.py) — 8 Steps
| Step | Action |
|---|---|
| 1 | Load grammarly/coedit-large base model (fp32) + existing Hub LoRA adapter |
| 2 | Merge adapter into base weights (merge_and_unload) — warm start |
| 3 | Apply fresh LoRA r=16 on merged base |
| 4 | Load JFLEG + W&I+LOCNESS; augment with DyslexiaSimulator (25% error rate); cap at 25k pairs |
| 5 | Train with combined loss for 2 epochs (L_CE + L_style + L_human) |
| 6 | Evaluate: BERTScore F1 + (1−WER) + HumanScore [+ GLEU — fix pending] |
| 7 | Save adapter + merge full model |
| 8 | Push adapter + merged model (merged/ subfolder) to Hub; update baseline |
v3 Upgrade Pipeline — 10 Steps
| Step | Action |
|---|---|
| 1 | Load existing LoRA adapter (r=16, v2) from Hub |
| 2 | Merge into base weights — warm start |
| 3 | Apply fresh LoRA r=16 on merged base |
| 4 | Load JFLEG + W&I+LOCNESS + C4-GEC; augment (20% error rate) |
| 5 | Train with combined loss for 10 epochs, early stopping patience=3 |
| 6 | Evaluate: GLEU + BERTScore F1 + (1−WER) [+ ERRANT F0.5 if installed] |
| 7 | Apply semantic faithfulness gate — revert outputs with cosine sim < 0.75 |
| 8 | Compare composite score against baseline_score.json |
| 9 | If improved: merge adapter → save full model |
| 10 | Push adapter + merged model to Hub; update baseline |
v1 Original Pipeline (train.sh) — 5 Stages
| Stage | Action |
|---|---|
| 1 | Setup & Dependencies |
| 2 | Data Preprocessing |
| 3 | Human Pattern Classifier Pre-Training |
| 4 | Main Model Training (LoRA r=8, 5 epochs, CE only) |
| 5 | Evaluation (GLEU only) |
Hyperparameter Reference
v4 (train_and_upgrade.py — Kaggle T4 × 2)
BASE_MODEL_ID = "grammarly/coedit-large"
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05
TARGET_MODULES = ["q", "v", "k", "o", "wi_0", "wi_1", "wo"]
EPOCHS = 2 # coedit-large is already GEC-pretrained
BATCH_SIZE = 8 # per device (GPU)
GRAD_ACCUM = 8 # effective batch = 64
LR = 3e-5 # reduced from 1e-4; 1e-4 caused flat loss on T5-based models
WARMUP_RATIO = 0.05
LABEL_SMOOTHING = 0.1
MAX_INPUT_LEN = 128 # back to 128 — GEC sentences are typically short
MAX_TARGET_LEN = 128
MAX_TRAIN_PAIRS = 25_000
LAMBDA_STYLE = 0.3
# LAMBDA_SEMANTIC removed — caused gradient interference
LAMBDA_HUMAN = 0.4
FAITHFULNESS_THRESHOLD = 0.75
AI_DETECTION_THRESHOLD = 0.4
# fp16 and bf16 both False — fp32 throughout to prevent gradient underflow
v3 (train_and_upgrade.py)
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05
EPOCHS = 10
BATCH_SIZE = 2 # CPU
GRAD_ACCUM = 32
LR = 2e-4
MAX_INPUT_LEN = 256
LAMBDA_STYLE = 0.3
LAMBDA_SEMANTIC = 0.5
LAMBDA_HUMAN = 0.4
FAITHFULNESS_THRESHOLD = 0.75
v2 (train_and_upgrade.py)
LORA_R = 16
LORA_ALPHA = 32
EPOCHS = 10
BATCH_SIZE = 2
GRAD_ACCUM = 32
LR = 2e-4
MAX_INPUT_LEN = 128
LAMBDA_STYLE = 0.3
LAMBDA_SEMANTIC = 0.5
LAMBDA_HUMAN = 0.4
v1 (configs/training_config.yaml)
lora:
r: 8
lora_alpha: 16
lora_dropout: 0.05
training:
per_device_train_batch_size: 4
gradient_accumulation_steps: 8
learning_rate: 3.0e-4
loss:
lambda_style: 0.3
lambda_semantic: 0.5
lambda_human_pattern: 0.4
configs/inference_config.yaml
model:
key: "coedit-large"
checkpoint_path: "checkpoints/best_model_merged"
use_lora: false
generation:
num_beams: 5
length_penalty: 1.2
repetition_penalty: 1.3
no_repeat_ngram_size: 3
max_new_tokens: 128
vocabulary:
semantic_threshold: 0.82
faithfulness:
threshold: 0.75
Inference Pipeline (8 Steps)
Raw Text
│
▼
1. Preprocessing ─────── LanguageTool spell correction + spaCy parsing
│
▼
2. Style Fingerprinting ─ Extract 41 features → MLP → 512-dim vector
│
▼
3. Sentence-Chunked Generation ─ Split into 128-token chunks → CoEdit-Large → rejoin
│
▼
4. Faithfulness Gate ──── cosine_sim(source, output) < 0.75 → revert to source
│
▼
5. Post-Processing ───── Remove artifacts, replace em dashes, fix spacing
│
▼
6. Vocabulary Elevation ─ BERT fill-mask → AWL filtering → semantic gate (threshold 0.82)
│
▼
7. Register Filtering ── Expand contractions, replace colloquialisms
│
▼
8. Metrics ──────────── Style similarity, AWL coverage, readability scores
│
▼
Corrected Text
API Usage
# Start the server
PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
# Correct text
curl -X POST http://localhost:8000/correct \
-H "Content-Type: application/json" \
-d '{"text": "The studnet recieved alot of informtion.", "style_alpha": 0.6}'
# Health check
curl http://localhost:8000/health
Interactive docs at http://localhost:8000/docs.
Hardware Requirements
| Tier | GPU | Config | Epochs | Training Time |
|---|---|---|---|---|
| Tested (v1) | RTX 3050 4GB | Flan-T5-Small, r=8 | 5 | ~45 min |
| Tested (v2 CPU) | None (HF Space CPU) | Flan-T5-Small, r=16 | 10 | ~12–24 hours |
| Tested (v3 CPU) | None (HF Space CPU) | Flan-T5-Small, r=16 | 10 | ~12–24 hours |
| Tested (v4) | Kaggle T4 × 2 (16GB each) | CoEdit-Large, r=16 | 2 | ~30–60 min |
| Recommended | RTX 3090 24GB | CoEdit-Large, r=16 + full loss | 3–5 | ~2–3h |
| Maximum | A100 80GB | Full pipeline with ERRANT | 10 | ~12h |
Data Sources
| Dataset | Type | Size | Access |
|---|---|---|---|
JFLEG (jhu-clsp/jfleg) |
Fluency corrections (4 refs each) | ~5k pairs | HF Hub, no registration |
W&I+LOCNESS (bea2019st/wi_locness) |
Learner errors + corrections | ~34k pairs | HF Hub, no registration |
C4-200M-GEC (cointegrated/c4_200m-gec-filtered) |
Synthetic GEC pairs | ~100k pairs (capped) | HF Hub — falls back silently if unavailable |
| FCE v2.1 | Learner errors + corrections | ~28k pairs | BEA-2019 (registration required) |
| Shanegerami AI_Human.csv | Human vs AI classification | ~50k samples | Kaggle |
| Starblasters8 data.parquet | Human vs AI classification | ~50k samples | Kaggle |
| Coxhead AWL | Academic Word List | 570 families / 549 headwords | Victoria University |
Note:
train_and_upgrade.py(v4) uses JFLEG + W&I+LOCNESS capped at 25k pairs. C4-GEC and FCE require additional setup.
Dyslexia Error Simulation
The DyslexiaSimulator generates synthetic training data based on research by Rello et al. (2013, 2017). v4 uses a 25% per-word error rate (up from 20% in v2/v3).
| Error Type | Frequency | Example |
|---|---|---|
| Phonetic substitution | 35% | "because" → "becaus" |
| Letter transposition | 18% | "the" → "teh" |
| Letter omission | 16% | "important" → "importnt" |
| Letter doubling | 12% | "letter" → "lettter" |
| Letter reversal (b/d, p/q) | 10% | "bad" → "dad" |
| Word boundary errors | 9% | "a lot" → "alot" |
Style Fingerprint Vector
The 512-dimensional style vector captures 41 raw features:
| Group | Features | Count |
|---|---|---|
| Sentence stats | mean, std, skew of sentence lengths | 3 |
| Word stats | mean, std of word lengths | 2 |
| Lexical | type-token ratio, lexical density | 2 |
| Syntactic | passive/active voice ratio, subordinate clause ratio, avg dependency tree depth | 4 |
| Discourse | 20 academic discourse markers (per 100 words) | 20 |
| Register | hedging frequency, formality score, nominalization ratio | 3 |
| Readability | Flesch reading ease, avg syllables per word | 2 |
| Pronouns | first-person ratio, third-person ratio | 2 |
| Other | question ratio, exclamation ratio, AWL coverage | 3 |
Projected through a 2-layer MLP (41 → 256 → 512) with LayerNorm and GELU activation, then L2-normalised.
Known Limitations
- GLEU scoring bug (v4): The
google_bleuevaluate metric is incompatible with CoEdit-Large's tokenisation space and scores 0.0. This will be resolved in v5 with a tokeniser-aware GLEU implementation. BERTScore F1 (0.9349) is a more reliable quality indicator for this run. - 1-WER regression (v4): The 1-WER drop (0.8552 → 0.3191) partly reflects the model transition and partly reflects the fact that coedit-large produces differently-phrased corrections compared to Flan-T5 — word-level edit distance to the reference set increases when the output is fluent but uses different vocabulary. This will be re-evaluated against CoEdit-specific reference corrections in v5.
- Training window: 128-token max input — very long paragraphs may be split mid-clause.
- Vocabulary elevation: BERT fill-mask can suggest semantically inappropriate AWL words; the 0.82 similarity threshold is a trade-off between coverage and accuracy.
- Already-correct text: The model is trained on error→correction pairs; feeding it clean text produces unpredictable output.
- LanguageTool latency: Spell correction takes ~15–20s due to JVM startup on first call.
- Faithfulness gate conservatism: The 0.75 cosine similarity threshold occasionally reverts valid-but-heavily-corrected outputs. Monitor
num_fallbackin evaluation to tune the threshold.
- Downloads last month
- 125
Model tree for morpheuslord/rewrite
Base model
grammarly/coedit-large