FormalASR: End-to-End Spoken Chinese to Formal Text
Abstract
FormalASR enables direct transcription of spoken Chinese into formal written text through compact end-to-end models trained on newly created datasets with LLM-based rewriting and quality filtering.
Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.
Get this paper in your agent:
hf papers read 2605.19266 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
TaurenMountain/FormalASR-0.6B
Datasets citing this paper 2
TaurenMountain/WenetSpeech-Formal
TaurenMountain/Speechio-Formal
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper