YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
WtpsplitKit
On-device text segmentation for iOS / macOS using Core ML SaT models
("Segment any Text"). It splits
text into TTS-friendly chunks that respect length bounds and break at natural
pauses — a Swift port of scripts/run_segmentation.py from this repo.
The whole pipeline runs locally: XLM-RoBERTa tokenization, Core ML inference, and the length-constrained Viterbi segmentation are all in pure Swift (plus Core ML).
What it does
tokenize (XLM-R Unigram)
→ per-token boundary logits (windowed Core ML inference)
→ scatter onto characters → sigmoid
→ pause-aware mask (bias breaks to commas / clauses / connectors, never mid-word)
→ length-constrained DP (Viterbi or greedy) with a length prior
The output chunks always rejoin to the exact input (chunks.joined() == text).
Fidelity
This port is verified against the Python/Hugging Face references:
- Tokenizer — ids and character offsets match Hugging Face
tokenizers(xlm-roberta-base) exactly across Latin, CJK, full-width, emoji, Devanagari and mixed/whitespace edge cases. - Segmentation — mask + prior + DP produce byte-identical chunks to
wtpsplit/utils/{constraints,priors}.pyandpause_aware_maskwhen fed the same probabilities, across every prior / algorithm / overflow / allow-midword combination.
The only intentional difference from the ONNX path is logit precision: the iOS models are quantized (fp16 / int8 / palettized), so boundary logits differ slightly from the fp32 ONNX models. Tokenization and segmentation logic are exact.
Installation
Swift Package Manager:
.package(url: "https://github.com/krmanik/wtpsplit-kit", from: "0.1.0")
.target(name: "App", dependencies: [.product(name: "WtpsplitKit", package: "wtpsplit-kit")])
Requires iOS 16+ / macOS 13+.
Models
The Core ML models are not bundled (they are 40–430 MB). Build them with
scripts/build_ios_coreml.py (produces ios_models/<variant>/<variant>-<quant>.mlpackage)
and ship the one you want with your app, or download it on first launch.
| Vocabulary | Variants | Notes |
|---|---|---|
*-full-* |
full XLM-R (250 k) | every language SaT supports |
*-en_zh-* |
pruned (≈101 k) | English + Chinese only, ~2× smaller |
en_zh models need token-id remapping; the kit handles it automatically (the
remap table is bundled). Pass .auto and it infers the vocabulary from the path
(en_zh ⇒ pruned), or set it explicitly.
The 4 MB tokenizer vocabulary and 1 MB remap table are bundled as package resources — no network or extra files needed for tokenization.
Usage
import WtpsplitKit
let modelURL = Bundle.main.url(forResource: "sat-3l-sm-full-fp16",
withExtension: "mlpackage")!
let segmenter = try SaTSegmenter(modelURL: modelURL) // load/compile once, reuse
var options = SegmentationOptions()
options.maxLength = 80 // hard cap (unless overflow > 0)
options.minLength = 40
options.prior = .gaussian // .uniform | .gaussian | .clippedPolynomial
options.targetLength = 70
options.spread = 12
let chunks = try segmenter.segment(
"Breaking News: Scientists announced a discovery. 这是一个测试。It works well!",
options: options)
// → ["Breaking News: Scientists announced a discovery. 这是一个测试。",
// "It works well!"]
SaTSegmenter is reusable; create it once (loading/compiling the model is the
expensive step) and call segment as needed. Run it off the main thread for long
inputs.
Options (mirror run_segmentation.py)
| Option | Default | Meaning |
|---|---|---|
maxLength |
80 | target max characters per chunk |
minLength |
40 | min characters per chunk (best effort) |
overflow |
0 | chars a chunk may exceed maxLength to reach a pause (0 = hard cap) |
prior |
.gaussian |
length-prior shape |
targetLength |
70 | gaussian / polynomial target |
spread |
12 | gaussian / polynomial spread |
algorithm |
.viterbi |
.viterbi (optimal) or .greedy |
allowMidword |
false |
permit breaks inside words (skips the pause-aware mask) |
windowOverlap |
16 | token overlap between inference windows for long text |
Compute units
let segmenter = try SaTSegmenter(modelURL: modelURL, computeUnits: .cpuAndNeuralEngine)
Tokenizer-only
let tok = try XLMRobertaTokenizer()
let (ids, charEnds) = tok.encode("Hello world! ä½ å¥½ã€‚")
CLI (for validation / experimentation)
swift run wtpseg --model path/to/sat-3l-sm-full-fp16.mlpackage \
--text "Your text. ä½ çš„æ–‡å—。" --max-length 80 --min-length 40
# flags: --max-length --min-length --overflow --prior {uniform,gaussian,clipped_polynomial}
# --target --spread --algorithm {viterbi,greedy} --allow-midword
# --vocab {auto,full,en_zh} --file <path>
# --tokens / --dump-probs / --dump-chunks (debug dumps)
Notes
- Long inputs are processed in
seqLen-sized windows (the models have a fixed sequence length, 256 by default) with overlapping logits averaged. .mlpackageis compiled on first load. To skip recompilation, pass a precompiled.mlmodelcURL instead.- All text indexing is by Unicode scalar (code point) to match Python
strsemantics, so split positions and lengths line up with the reference.
- Downloads last month
- 37