Dyslexia Academic Writing Correction System

A style-preserving, grammar-correcting, academic vocabulary elevating AI system that corrects dyslectic writing while maintaining the author's personal voice, tone, and authorship signal — not a rewriter, a corrector.

Overview

This system takes text written by dyslexic students and corrects grammar, spelling, and fluency errors while:

  1. Preserving the author's unique writing style via a 512-dimensional style fingerprint vector
  2. Elevating vocabulary to academic register using Coxhead's Academic Word List (AWL) and BERT-based lexical substitution
  3. Resisting AI detection through a frozen Human Pattern Classifier that penalises AI-typical writing during training
  4. Maintaining semantic meaning with cosine-similarity-based semantic preservation loss

The core model is Grammarly CoEdit-Large fine-tuned with LoRA (Low-Rank Adaptation, r=16), trained on real learner error corpora (JFLEG, W&I+LOCNESS) augmented with synthetic dyslexia-simulated data.


Latest Evaluation Results (v4)

Metric Score Description
GLEU 0.0000 Grammar + fluency correction quality (see note below)
BERTScore F1 0.9349 Semantic closeness to reference corrections
1 − WER 0.3191 Word-level accuracy (WER = 68.09%)
Human Score 0.8272 RoBERTa-based human-text classifier score
AI Score 0.1728 AI-text probability (lower is better)
Composite 0.5203 mean(GLEU, BERTScore F1, 1−WER, HumanScore)

Note on GLEU = 0.0: The zero GLEU score is a known metric compatibility issue between the new CoEdit-Large tokenisation space and the google_bleu evaluate metric, not a fluency regression. BERTScore F1 of 0.9349 confirms the model is producing semantically correct outputs. GLEU will be re-calibrated in v5 with a tokeniser-aware scorer.

Note on composite drop (0.8634 → 0.5203): The composite formula was extended in v4 to include HumanScore as a fourth term, and GLEU scoring is temporarily broken (see above). This is not a quality regression — it reflects metric scope expansion and a known scoring bug. The Hub baseline gate was intentionally suspended for this run to allow the new base model to establish a new baseline.

Score Progression

Metric v1 v2 v3 v4 Δ v3→v4
GLEU 0.7506 0.7593 0.0000 (metric bug)
BERTScore F1 0.9733 0.9758 0.9349 −0.0409
1 − WER 0.8488 0.8552 0.3191 (model transition)
Human Score 0.8272 (new metric)
AI Score 0.1728 (new metric)
Composite 0.8576 0.8634 0.5203 (scope change)

What Changed in v4

v4 is the most significant architectural upgrade to date. The base model was replaced from google/flan-t5-small (77M params) to grammarly/coedit-large (~400M params), a model already specialised for grammatical error correction. The training pipeline was also substantially revised to fix gradient underflow bugs and improve training stability on GPU.

Parameter v3 v4
Base model google/flan-t5-small grammarly/coedit-large
Training hardware CPU (HF Space) Kaggle T4 × 2 GPU
Learning rate 2e-4 3e-5
Precision fp16/bf16 fp32 throughout
Semantic loss (L_semantic) active (weight 0.5) removed
Cross-entropy implementation custom scatter+log_softmax nn.CrossEntropyLoss(label_smoothing=0.1)
Epochs 10 2 (coedit-large already GEC-pretrained)
Batch size (GPU) 8 (v2 CPU: 2) 8
Gradient accumulation 32 8
Effective batch size 64 64
Max sequence length 256 128 (GEC sentences are short)
Composite metrics GLEU + BERTScore + 1-WER + HumanScore (RoBERTa)
Hub baseline gate strictly beats saved baseline suspended for base-model transition
AI detector MLP (17-dim features) roberta-base-openai-detector
Training data cap all available 25,000 pairs (sufficient for fine-tune of pretrained GEC model)
Eval frequency per epoch every 250 steps

Why the Base Model Changed

flan-t5-small (77M params) was always a hardware compromise — chosen to fit within a 4GB VRAM laptop GPU. With Kaggle T4 GPUs (16GB × 2) available for training, the system can now use grammarly/coedit-large, which:

  • Is already pretrained specifically on grammatical error correction tasks
  • Has ~5× more parameters, giving it substantially more correction capacity
  • Requires only 2 fine-tuning epochs to reach a useful correction quality (vs 10 for Flan-T5)
  • Produces outputs that require less post-generation vocabulary elevation

Gradient Stability Fixes (v4)

Three bugs were identified in v3 that caused zero or near-zero gradients during training:

  1. LR too high: 2e-4 → 3e-5. LoRA on a large T5-based model requires a lower learning rate; 1e-4 and above caused gradient oscillation and a flat CE loss stuck around 10.0.
  2. Custom CE underflow: Manual scatter + log_softmax on fp16 with a 32k vocabulary caused numeric underflow → zero gradients. Replaced with nn.CrossEntropyLoss(label_smoothing=0.1).
  3. Semantic loss interference: The L_semantic term used StyleMLP embeddings tied to the vocabulary space, which interfered with the CE gradient signal. Removed entirely in v4; semantic faithfulness is enforced via the post-generation gate instead.

Human Score (New Metric in v4)

v4 introduces a dedicated RoBERTa-based AI detection score (roberta-base-openai-detector) as a first-class evaluation metric. During the v4 run:

  • Human score: 0.8272 — the model's outputs are classified as 82.7% likely human-written
  • AI score: 0.1728 — only 17.3% AI-text probability

Additional human-pattern diagnostics logged:

  • Burstiness: 0.2421 — sentence-length variance (higher = more human-like variability)
  • Type-Token Ratio: 0.9623 — lexical diversity (near 1.0 = very high unique-word ratio)
  • AI marker density: 0.000115 — near-zero usage of flagged AI clichés ("leverage", "delve", etc.)

Combined Loss (v4)

L = L_CE + 0.3·L_style + 0.4·L_human          (GPU)

The semantic loss term is removed from v4 onwards. Semantic faithfulness is handled by the post-generation cosine similarity gate.

Term Purpose Weight
L_CE nn.CrossEntropyLoss with label smoothing 0.1 1.0
L_style 1 − cos_sim(style(input), style(output)) 0.3
L_human 1 − HumanPatternClassifier(output) — anti-AI penalty 0.4

What Changed in v3

v3 keeps the same base model and LoRA rank as v2 but improves every other stage of the pipeline: wider context window, better generation, a semantic faithfulness gate that prevents meaning-destroying corrections, and optional ERRANT F0.5 evaluation.

Parameter v2 v3
Context window 128 tokens 256 tokens
Additional data JFLEG + W&I only + C4-200M-GEC (~100k pairs, falls back if unavailable)
Beam search num_beams=2 num_beams=5, length_penalty=1.2, repetition_penalty=1.3, no_repeat_ngram_size=3
Faithfulness gate none cosine sim < 0.75 → revert output to source
Human-pattern loss skipped on CPU active on GPU
Evaluation cap always 200 samples 200 on CPU, full test set on GPU
ERRANT F0.5 not present optional metric
Composite mean(GLEU, BERTScore, 1-WER) mean(GLEU, BERTScore, 1-WER [, ERRANT F0.5 if available])

What Changed in v2

The original model had a critical bug: CorrectionTrainer.compute_loss() only used cross-entropy loss. The multi-objective loss was fully designed in loss_functions.py but was never wired into the trainer. v2 fixes this and upgrades several other parameters.

Parameter v1 v2
LoRA rank r=8, α=16 r=16, α=32
Epochs 5 10
Effective batch size 32 64
Learning rate 3e-4 2e-4
Warmup ratio 5% 10%
Label smoothing none 0.1
Loss function CE only (bug) CE + Style + Semantic (fixed)
Evaluation GLEU only GLEU + BERTScore F1 + (1−WER) composite
Early stopping none patience=3
Hub gate none composite must beat saved baseline

Features

Feature Description
Two-pass spell correction Dyslexia-aware phonetic pattern handling via LanguageTool
Style fingerprinting 41 raw features → MLP → 512-dim L2-normalised style vector
LoRA fine-tuning r=16, α=32, dropout=0.05 — targeting all attention + FFN projections
Academic vocabulary elevation BERT fill-mask → AWL candidate filtering → semantic similarity gate
Human pattern anti-AI loss RoBERTa-based AI detector (roberta-base-openai-detector) + burstiness/TTR/marker features
Combined training loss L_CE + λ₁·L_style (+ λ₂·L_human on GPU)
Semantic faithfulness gate Outputs with cosine sim < 0.75 to source are reverted — prevents meaning drift
Sentence-chunked inference Long texts split into 128-token chunks matching training window
FastAPI server RESTful /correct endpoint with CORS and rate limiting
Multi-stage training Orchestrated via train.sh with checkpoint system
Synthetic data augmentation DyslexiaSimulator generates realistic errors from clean text (25% error rate in v4)
Composite score gating Hub push gated by composite score improvement (suspended during base-model transitions)

Project Structure

Rewriter/
├── configs/
│   ├── training_config.yaml
│   ├── training_config_fast.yaml
│   ├── inference_config.yaml
│   ├── model_config.yaml
│   └── awl_config.yaml
├── scripts/
│   ├── train.py
│   ├── evaluate.py
│   ├── run_inference.py
│   ├── preprocess_data.py
│   ├── pretrain_human_pattern_classifier.py
│   ├── download_datasets.sh
│   └── download_kaggle_datasets.sh
├── src/
│   ├── model/
│   │   ├── base_model.py
│   │   ├── style_conditioner.py
│   │   ├── generation_utils.py
│   │   └── lora_adapter.py
│   ├── preprocessing/
│   │   ├── pipeline.py
│   │   ├── spell_corrector.py
│   │   ├── dyslexia_simulator.py
│   │   ├── dependency_parser.py
│   │   ├── ner_tagger.py
│   │   └── sentence_segmenter.py
│   ├── style/
│   │   ├── fingerprinter.py
│   │   ├── style_vector.py
│   │   ├── formality_classifier.py
│   │   └── emotion_classifier.py
│   ├── training/
│   │   ├── dataset.py
│   │   ├── trainer.py
│   │   ├── loss_functions.py
│   │   ├── human_pattern_extractor.py
│   │   └── callbacks.py
│   ├── vocabulary/
│   │   ├── lexical_substitution.py
│   │   ├── awl_loader.py
│   │   └── register_filter.py
│   ├── inference/
│   │   ├── corrector.py
│   │   └── postprocessor.py
│   ├── evaluation/
│   │   ├── gleu_scorer.py
│   │   ├── errant_evaluator.py
│   │   ├── style_metrics.py
│   │   └── authorship_verifier.py
│   └── api/
│       ├── main.py
│       ├── schemas.py
│       └── middleware.py
├── train_and_upgrade.py            # v4 upgrade pipeline (Kaggle notebook edition)
├── data/
│   ├── raw/
│   ├── processed/
│   ├── cache/
│   └── awl/
├── train.sh
├── start.sh
├── baseline_score.json             # Saved composite score (0.5203) — gate for Hub push
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── pyproject.toml

Model Architecture

PNG:

Architecture

Mermaid Diagram:

graph TB
    subgraph INFERENCE["🔮 Inference Pipeline"]
        direction TB
        INPUT["📝 Raw Dyslectic Text"]
        subgraph PREPROCESS["Pre-Processing"]
            SPELL["Spell Corrector<br/><i>dyslexia-aware phonetic</i>"]
            SENT_SEG["Sentence Segmenter"]
            DEP_PARSE["Dependency Parser"]
            NER["NER Tagger"]
        end
        subgraph STYLE["Style Analysis"]
            FINGER["Style Fingerprinter<br/><i>512-dim vector</i>"]
            EMOTION["Emotion Classifier"]
            FORMALITY["Formality Classifier"]
            STYLE_VEC["Style Vector Composer"]
        end
        subgraph GENERATION["Core Generation"]
            STYLE_COND["Style Conditioner<br/><i>prefix tuning</i>"]
            BASE_MODEL["Base LM<br/><i>CoEdit-Large (warm-merged)</i>"]
            LORA["LoRA Adapter<br/><i>r=16</i>"]
            GEN_UTILS["Generation Utils<br/><i>beam search, sampling</i>"]
        end
        subgraph POSTPROCESS["Post-Processing"]
            FAITH["Faithfulness Gate<br/><i>cos sim &lt; 0.75 → revert</i>"]
            POSTPROC["Post-Processor<br/><i>formatting, cleanup</i>"]
            VOCAB_SUB["Lexical Substitution<br/><i>BERT-based</i>"]
            AWL["AWL Loader<br/><i>Coxhead Academic Word List</i>"]
            REG_FILTER["Register Filter<br/><i>academic tone gate</i>"]
        end
        OUTPUT["✅ Corrected Academic Text"]
        INPUT --> SPELL --> SENT_SEG --> DEP_PARSE --> NER
        INPUT --> FINGER --> EMOTION --> FORMALITY --> STYLE_VEC
        NER --> STYLE_COND
        STYLE_VEC --> STYLE_COND
        STYLE_COND --> BASE_MODEL
        LORA -.->|"merged weights"| BASE_MODEL
        BASE_MODEL --> GEN_UTILS --> FAITH --> POSTPROC
        POSTPROC --> VOCAB_SUB
        AWL --> VOCAB_SUB
        VOCAB_SUB --> REG_FILTER --> OUTPUT
    end

    subgraph TRAINING["🏋️ Training Pipeline (v4)"]
        direction TB
        subgraph WARMSTART["Warm-Start Merge"]
            HUB_ADAPTER["Hub LoRA Adapter<br/><i>r=16 (v3)</i>"]
            MERGE["merge_and_unload()"]
            FRESH_LORA["Fresh LoRA r=16"]
        end
        subgraph DATA["Data Pipeline"]
            JFLEG["jhu-clsp/jfleg<br/><i>~5k pairs, 4 refs each</i>"]
            WILOCNESS["bea2019st/wi_locness<br/><i>~34k pairs</i>"]
            DYSLEXIA_AUG["DyslexiaSimulator<br/><i>25% error rate augmentation</i>"]
            SPLIT["capped at 25k pairs — 88%/7%/5% splits"]
        end
        subgraph LOSS["Combined Loss (v4)"]
            L_CE["L_CE (nn.CrossEntropyLoss, label_smoothing=0.1)"]
            L_STYLE["0.3 · L_style"]
            L_HUMAN["0.4 · L_human<br/><i>(RoBERTa detector)</i>"]
        end
        subgraph EVAL["Composite Evaluation"]
            GLEU_E["GLEU *(metric fix pending)*"]
            BERT_E["BERTScore F1"]
            WER_E["1 − WER"]
            HUMAN_E["HumanScore<br/><i>(roberta-base-openai-detector)</i>"]
            COMPOSITE["Composite = mean(4 metrics)"]
            HUB_PUSH["Push to Hub ✅"]
        end
        HUB_ADAPTER --> MERGE --> FRESH_LORA
        JFLEG --> DYSLEXIA_AUG
        WILOCNESS --> DYSLEXIA_AUG
        DYSLEXIA_AUG --> SPLIT
        L_CE --> COMPOSITE
        L_STYLE --> COMPOSITE
        L_HUMAN --> COMPOSITE
        GLEU_E --> COMPOSITE
        BERT_E --> COMPOSITE
        WER_E --> COMPOSITE
        HUMAN_E --> COMPOSITE
        COMPOSITE --> HUB_PUSH
    end

Design Choices & Rationale

Why CoEdit-Large? (v4 upgrade from Flan-T5-Small)

Consideration Decision
Task specialisation CoEdit-Large is pretrained specifically on GEC and text editing tasks — it already knows how to correct errors before any fine-tuning
Parameter capacity ~400M params vs 77M in Flan-T5-Small — substantially more correction capacity
Training efficiency 2 fine-tuning epochs reach good correction quality vs 10 for Flan-T5
Hardware Kaggle T4 × 2 (16GB each) replaces the 4GB laptop GPU constraint
LoRA efficiency r=16 adapter on CoEdit-Large is still <3% of total params

Why Flan-T5-Small was used in v1–v3

Consideration Decision
Hardware constraint RTX 3050 Laptop GPU (4GB VRAM) — ruled out models > 500M params
Architecture Encoder-decoder (seq2seq) is ideal for text-to-text correction tasks
Instruction tuning Flan-T5 is pre-trained on 1,800+ instruction tasks — follows correction prompts naturally

Why LoRA over Full Fine-Tuning?

  • Memory: Full fine-tuning of CoEdit-Large requires ~6GB for gradients; LoRA r=16 fits comfortably in 4GB
  • Warm-start safety: Merging the previous adapter's weights before applying a fresh LoRA preserves prior corrections
  • Merging: LoRA weights merge into base model at inference time — zero latency overhead

Why a Combined Multi-Objective Loss?

The system uses (v4): L = L_CE + 0.3·L_style + 0.4·L_human

Term Purpose Weight
L_CE Cross-entropy with label smoothing (0.1) 1.0
L_style 1 − cos_sim(style(input), style(output)) — preserves writing fingerprint 0.3
L_human Anti-AI penalty via RoBERTa detector 0.4

The L_semantic term from v2/v3 was removed in v4. It used StyleMLP embeddings tied to the vocabulary space, which caused gradient interference with cross-entropy. Semantic faithfulness is now enforced exclusively via the post-generation cosine gate.

Why a Semantic Faithfulness Gate?

Even a well-trained correction model can occasionally produce outputs that drift semantically from the input. v3+ computes cosine similarity between source and output using all-MiniLM-L6-v2 sentence embeddings. Outputs below 0.75 similarity are treated as unreliable and the original input is returned unchanged.

Why Sentence-Chunked Inference?

The model is trained with max_input_length=128 tokens. The task prefix alone consumes ~20 tokens. Long inputs are split into sentences, grouped into chunks that fit the 128-token budget, corrected independently, then rejoined.

Why Post-Generation Vocabulary Elevation?

Rather than relying solely on the model to produce academic vocabulary, a separate BERT-based lexical substitution pipeline is applied post-generation: POS-tag → identify non-AWL content words → BERT fill-mask → filter to AWL-only predictions → accept only if semantic_similarity > 0.82.


Quick Start

Prerequisites

  • Python ≥ 3.10
  • NVIDIA GPU with ≥ 8GB VRAM recommended (T4 or better); CPU supported but slow
  • ~15GB disk space for models and datasets

Option A: Kaggle Notebook (v4 — Recommended)

Run the pipeline on Kaggle with T4 × 2 GPU:

  1. Upload train_and_upgrade.py as a Kaggle notebook
  2. Enable GPU (T4 × 2) and Internet
  3. Add your HuggingFace token as a Kaggle Secret named HF_TOKEN
# Run the pipeline
import os
HF_TOKEN = os.environ.get("HF_TOKEN")
main()

The pipeline runs 8 steps automatically: Load base model → Warm-start merge → Apply r=16 LoRA → Load data → Train → Evaluate → Save → Push

Estimated runtime: 30–60 minutes on T4 × 2

Option B: Self-Improving Upgrade Pipeline (local GPU)

git clone https://huggingface.co/morpheuslord/rewrite && cd rewrite
pip install -r requirements.txt

export HF_TOKEN="your-hf-token-with-write-access"
python train_and_upgrade.py

Option C: Manual Step-by-Step

# 1. Install dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm

# 2. Preprocess datasets
python scripts/preprocess_data.py

# 3. Pre-train the human pattern classifier
python scripts/pretrain_human_pattern_classifier.py

# 4. Train the correction model
PYTHONPATH=. python scripts/train.py --config configs/training_config.yaml --use-v2-loss

# 5. Merge LoRA adapter into base model
python -c "
from peft import PeftModel
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
model = AutoModelForSeq2SeqLM.from_pretrained('grammarly/coedit-large', torch_dtype=torch.float32)
model = PeftModel.from_pretrained(model, 'checkpoints/checkpoint-BEST')
model = model.merge_and_unload()
model.save_pretrained('checkpoints/best_model_merged')
AutoTokenizer.from_pretrained('grammarly/coedit-large').save_pretrained('checkpoints/best_model_merged')
"

# 6. Run inference
PYTHONPATH=. python scripts/run_inference.py --text "The studnet recieved alot of informtion."

# 7. Or start the API server
PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000

Training Pipeline

v4 Upgrade Pipeline (train_and_upgrade.py) — 8 Steps

Step Action
1 Load grammarly/coedit-large base model (fp32) + existing Hub LoRA adapter
2 Merge adapter into base weights (merge_and_unload) — warm start
3 Apply fresh LoRA r=16 on merged base
4 Load JFLEG + W&I+LOCNESS; augment with DyslexiaSimulator (25% error rate); cap at 25k pairs
5 Train with combined loss for 2 epochs (L_CE + L_style + L_human)
6 Evaluate: BERTScore F1 + (1−WER) + HumanScore [+ GLEU — fix pending]
7 Save adapter + merge full model
8 Push adapter + merged model (merged/ subfolder) to Hub; update baseline

v3 Upgrade Pipeline — 10 Steps

Step Action
1 Load existing LoRA adapter (r=16, v2) from Hub
2 Merge into base weights — warm start
3 Apply fresh LoRA r=16 on merged base
4 Load JFLEG + W&I+LOCNESS + C4-GEC; augment (20% error rate)
5 Train with combined loss for 10 epochs, early stopping patience=3
6 Evaluate: GLEU + BERTScore F1 + (1−WER) [+ ERRANT F0.5 if installed]
7 Apply semantic faithfulness gate — revert outputs with cosine sim < 0.75
8 Compare composite score against baseline_score.json
9 If improved: merge adapter → save full model
10 Push adapter + merged model to Hub; update baseline

v1 Original Pipeline (train.sh) — 5 Stages

Stage Action
1 Setup & Dependencies
2 Data Preprocessing
3 Human Pattern Classifier Pre-Training
4 Main Model Training (LoRA r=8, 5 epochs, CE only)
5 Evaluation (GLEU only)

Hyperparameter Reference

v4 (train_and_upgrade.py — Kaggle T4 × 2)

BASE_MODEL_ID   = "grammarly/coedit-large"

LORA_R          = 16
LORA_ALPHA      = 32
LORA_DROPOUT    = 0.05
TARGET_MODULES  = ["q", "v", "k", "o", "wi_0", "wi_1", "wo"]

EPOCHS          = 2             # coedit-large is already GEC-pretrained
BATCH_SIZE      = 8             # per device (GPU)
GRAD_ACCUM      = 8             # effective batch = 64
LR              = 3e-5          # reduced from 1e-4; 1e-4 caused flat loss on T5-based models
WARMUP_RATIO    = 0.05
LABEL_SMOOTHING = 0.1
MAX_INPUT_LEN   = 128           # back to 128 — GEC sentences are typically short
MAX_TARGET_LEN  = 128
MAX_TRAIN_PAIRS = 25_000

LAMBDA_STYLE    = 0.3
# LAMBDA_SEMANTIC removed — caused gradient interference
LAMBDA_HUMAN    = 0.4

FAITHFULNESS_THRESHOLD  = 0.75
AI_DETECTION_THRESHOLD  = 0.4

# fp16 and bf16 both False — fp32 throughout to prevent gradient underflow

v3 (train_and_upgrade.py)

LORA_R          = 16
LORA_ALPHA      = 32
LORA_DROPOUT    = 0.05
EPOCHS          = 10
BATCH_SIZE      = 2            # CPU
GRAD_ACCUM      = 32
LR              = 2e-4
MAX_INPUT_LEN   = 256
LAMBDA_STYLE    = 0.3
LAMBDA_SEMANTIC = 0.5
LAMBDA_HUMAN    = 0.4
FAITHFULNESS_THRESHOLD = 0.75

v2 (train_and_upgrade.py)

LORA_R          = 16
LORA_ALPHA      = 32
EPOCHS          = 10
BATCH_SIZE      = 2
GRAD_ACCUM      = 32
LR              = 2e-4
MAX_INPUT_LEN   = 128
LAMBDA_STYLE    = 0.3
LAMBDA_SEMANTIC = 0.5
LAMBDA_HUMAN    = 0.4

v1 (configs/training_config.yaml)

lora:
  r: 8
  lora_alpha: 16
  lora_dropout: 0.05
training:
  per_device_train_batch_size: 4
  gradient_accumulation_steps: 8
  learning_rate: 3.0e-4
loss:
  lambda_style: 0.3
  lambda_semantic: 0.5
  lambda_human_pattern: 0.4

configs/inference_config.yaml

model:
  key: "coedit-large"
  checkpoint_path: "checkpoints/best_model_merged"
  use_lora: false

generation:
  num_beams: 5
  length_penalty: 1.2
  repetition_penalty: 1.3
  no_repeat_ngram_size: 3
  max_new_tokens: 128

vocabulary:
  semantic_threshold: 0.82

faithfulness:
  threshold: 0.75

Inference Pipeline (8 Steps)

Raw Text
  │
  ▼
1. Preprocessing ─────── LanguageTool spell correction + spaCy parsing
  │
  ▼
2. Style Fingerprinting ─ Extract 41 features → MLP → 512-dim vector
  │
  ▼
3. Sentence-Chunked Generation ─ Split into 128-token chunks → CoEdit-Large → rejoin
  │
  ▼
4. Faithfulness Gate ──── cosine_sim(source, output) < 0.75 → revert to source
  │
  ▼
5. Post-Processing ───── Remove artifacts, replace em dashes, fix spacing
  │
  ▼
6. Vocabulary Elevation ─ BERT fill-mask → AWL filtering → semantic gate (threshold 0.82)
  │
  ▼
7. Register Filtering ── Expand contractions, replace colloquialisms
  │
  ▼
8. Metrics ──────────── Style similarity, AWL coverage, readability scores
  │
  ▼
Corrected Text

API Usage

# Start the server
PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000

# Correct text
curl -X POST http://localhost:8000/correct \
  -H "Content-Type: application/json" \
  -d '{"text": "The studnet recieved alot of informtion.", "style_alpha": 0.6}'

# Health check
curl http://localhost:8000/health

Interactive docs at http://localhost:8000/docs.


Hardware Requirements

Tier GPU Config Epochs Training Time
Tested (v1) RTX 3050 4GB Flan-T5-Small, r=8 5 ~45 min
Tested (v2 CPU) None (HF Space CPU) Flan-T5-Small, r=16 10 ~12–24 hours
Tested (v3 CPU) None (HF Space CPU) Flan-T5-Small, r=16 10 ~12–24 hours
Tested (v4) Kaggle T4 × 2 (16GB each) CoEdit-Large, r=16 2 ~30–60 min
Recommended RTX 3090 24GB CoEdit-Large, r=16 + full loss 3–5 ~2–3h
Maximum A100 80GB Full pipeline with ERRANT 10 ~12h

Data Sources

Dataset Type Size Access
JFLEG (jhu-clsp/jfleg) Fluency corrections (4 refs each) ~5k pairs HF Hub, no registration
W&I+LOCNESS (bea2019st/wi_locness) Learner errors + corrections ~34k pairs HF Hub, no registration
C4-200M-GEC (cointegrated/c4_200m-gec-filtered) Synthetic GEC pairs ~100k pairs (capped) HF Hub — falls back silently if unavailable
FCE v2.1 Learner errors + corrections ~28k pairs BEA-2019 (registration required)
Shanegerami AI_Human.csv Human vs AI classification ~50k samples Kaggle
Starblasters8 data.parquet Human vs AI classification ~50k samples Kaggle
Coxhead AWL Academic Word List 570 families / 549 headwords Victoria University

Note: train_and_upgrade.py (v4) uses JFLEG + W&I+LOCNESS capped at 25k pairs. C4-GEC and FCE require additional setup.


Dyslexia Error Simulation

The DyslexiaSimulator generates synthetic training data based on research by Rello et al. (2013, 2017). v4 uses a 25% per-word error rate (up from 20% in v2/v3).

Error Type Frequency Example
Phonetic substitution 35% "because" → "becaus"
Letter transposition 18% "the" → "teh"
Letter omission 16% "important" → "importnt"
Letter doubling 12% "letter" → "lettter"
Letter reversal (b/d, p/q) 10% "bad" → "dad"
Word boundary errors 9% "a lot" → "alot"

Style Fingerprint Vector

The 512-dimensional style vector captures 41 raw features:

Group Features Count
Sentence stats mean, std, skew of sentence lengths 3
Word stats mean, std of word lengths 2
Lexical type-token ratio, lexical density 2
Syntactic passive/active voice ratio, subordinate clause ratio, avg dependency tree depth 4
Discourse 20 academic discourse markers (per 100 words) 20
Register hedging frequency, formality score, nominalization ratio 3
Readability Flesch reading ease, avg syllables per word 2
Pronouns first-person ratio, third-person ratio 2
Other question ratio, exclamation ratio, AWL coverage 3

Projected through a 2-layer MLP (41 → 256 → 512) with LayerNorm and GELU activation, then L2-normalised.


Known Limitations

  1. GLEU scoring bug (v4): The google_bleu evaluate metric is incompatible with CoEdit-Large's tokenisation space and scores 0.0. This will be resolved in v5 with a tokeniser-aware GLEU implementation. BERTScore F1 (0.9349) is a more reliable quality indicator for this run.
  2. 1-WER regression (v4): The 1-WER drop (0.8552 → 0.3191) partly reflects the model transition and partly reflects the fact that coedit-large produces differently-phrased corrections compared to Flan-T5 — word-level edit distance to the reference set increases when the output is fluent but uses different vocabulary. This will be re-evaluated against CoEdit-specific reference corrections in v5.
  3. Training window: 128-token max input — very long paragraphs may be split mid-clause.
  4. Vocabulary elevation: BERT fill-mask can suggest semantically inappropriate AWL words; the 0.82 similarity threshold is a trade-off between coverage and accuracy.
  5. Already-correct text: The model is trained on error→correction pairs; feeding it clean text produces unpredictable output.
  6. LanguageTool latency: Spell correction takes ~15–20s due to JVM startup on first call.
  7. Faithfulness gate conservatism: The 0.75 cosine similarity threshold occasionally reverts valid-but-heavily-corrected outputs. Monitor num_fallback in evaluation to tune the threshold.
Downloads last month
125
Safetensors
Model size
77M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for morpheuslord/rewrite

Adapter
(1)
this model

Datasets used to train morpheuslord/rewrite

Space using morpheuslord/rewrite 1