WtpsplitKit

On-device text segmentation for iOS / macOS using Core ML SaT models ("Segment any Text"). It splits text into TTS-friendly chunks that respect length bounds and break at natural pauses — a Swift port of scripts/run_segmentation.py from this repo.

The whole pipeline runs locally: XLM-RoBERTa tokenization, Core ML inference, and the length-constrained Viterbi segmentation are all in pure Swift (plus Core ML).

What it does

tokenize (XLM-R Unigram)
  → per-token boundary logits (windowed Core ML inference)
  → scatter onto characters → sigmoid
  → pause-aware mask (bias breaks to commas / clauses / connectors, never mid-word)
  → length-constrained DP (Viterbi or greedy) with a length prior

The output chunks always rejoin to the exact input (chunks.joined() == text).

Fidelity

This port is verified against the Python/Hugging Face references:

Tokenizer — ids and character offsets match Hugging Face tokenizers (xlm-roberta-base) exactly across Latin, CJK, full-width, emoji, Devanagari and mixed/whitespace edge cases.
Segmentation — mask + prior + DP produce byte-identical chunks to wtpsplit/utils/{constraints,priors}.py and pause_aware_mask when fed the same probabilities, across every prior / algorithm / overflow / allow-midword combination.

The only intentional difference from the ONNX path is logit precision: the iOS models are quantized (fp16 / int8 / palettized), so boundary logits differ slightly from the fp32 ONNX models. Tokenization and segmentation logic are exact.

Installation

Swift Package Manager:

.package(url: "https://github.com/krmanik/wtpsplit-kit", from: "0.1.0")

.target(name: "App", dependencies: [.product(name: "WtpsplitKit", package: "wtpsplit-kit")])

Requires iOS 16+ / macOS 13+.

Models

The Core ML models are not bundled (they are 40–430 MB). Build them with scripts/build_ios_coreml.py (produces ios_models/<variant>/<variant>-<quant>.mlpackage) and ship the one you want with your app, or download it on first launch.

Vocabulary	Variants	Notes
`-full-`	full XLM-R (250 k)	every language SaT supports
`-en_zh-`	pruned (≈101 k)	English + Chinese only, ~2× smaller

en_zh models need token-id remapping; the kit handles it automatically (the remap table is bundled). Pass .auto and it infers the vocabulary from the path (en_zh ⇒ pruned), or set it explicitly.

The 4 MB tokenizer vocabulary and 1 MB remap table are bundled as package resources — no network or extra files needed for tokenization.

Usage

import WtpsplitKit

let modelURL = Bundle.main.url(forResource: "sat-3l-sm-full-fp16",
                               withExtension: "mlpackage")!
let segmenter = try SaTSegmenter(modelURL: modelURL)   // load/compile once, reuse

var options = SegmentationOptions()
options.maxLength = 80      // hard cap (unless overflow > 0)
options.minLength = 40
options.prior = .gaussian   // .uniform | .gaussian | .clippedPolynomial
options.targetLength = 70
options.spread = 12

let chunks = try segmenter.segment(
    "Breaking News: Scientists announced a discovery. 这是一个测试。It works well!",
    options: options)
// → ["Breaking News: Scientists announced a discovery. 这是一个测试。",
//    "It works well!"]

SaTSegmenter is reusable; create it once (loading/compiling the model is the expensive step) and call segment as needed. Run it off the main thread for long inputs.

Options (mirror `run_segmentation.py`)

Option	Default	Meaning
`maxLength`	80	target max characters per chunk
`minLength`	40	min characters per chunk (best effort)
`overflow`	0	chars a chunk may exceed `maxLength` to reach a pause (0 = hard cap)
`prior`	`.gaussian`	length-prior shape
`targetLength`	70	gaussian / polynomial target
`spread`	12	gaussian / polynomial spread
`algorithm`	`.viterbi`	`.viterbi` (optimal) or `.greedy`
`allowMidword`	`false`	permit breaks inside words (skips the pause-aware mask)
`windowOverlap`	16	token overlap between inference windows for long text

Compute units

let segmenter = try SaTSegmenter(modelURL: modelURL, computeUnits: .cpuAndNeuralEngine)

Tokenizer-only

let tok = try XLMRobertaTokenizer()
let (ids, charEnds) = tok.encode("Hello world! 你好。")

CLI (for validation / experimentation)

swift run wtpseg --model path/to/sat-3l-sm-full-fp16.mlpackage \
    --text "Your text. 你的文字。" --max-length 80 --min-length 40

# flags: --max-length --min-length --overflow --prior {uniform,gaussian,clipped_polynomial}
#        --target --spread --algorithm {viterbi,greedy} --allow-midword
#        --vocab {auto,full,en_zh} --file <path>
#        --tokens / --dump-probs / --dump-chunks  (debug dumps)

Notes

Long inputs are processed in seqLen-sized windows (the models have a fixed sequence length, 256 by default) with overlapping logits averaged.
.mlpackage is compiled on first load. To skip recompilation, pass a precompiled .mlmodelc URL instead.
All text indexing is by Unicode scalar (code point) to match Python str semantics, so split positions and lengths line up with the reference.

Downloads last month: 37

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support