Apertus Tokenizer Extension
Current canonical artifact: C3 + 17,408 curated/backfilled added units.
The active ship tokenizer is:
tokenizers/c3_added_17408_curated_padded/
It starts from swiss-ai/Apertus-8B-2509, preserves the Apertus base ids
0..131071 verbatim, adds 17,408 C3 Greek-extension BPE units, and keeps
the final vocab aligned at 148,480 = 128 * 1160 = 256 * 580.
Canonical Tokenizer
| Path | Vocab | Added units | Curation | SHA-256 tokenizer.json |
|---|---|---|---|---|
tokenizers/c3_added_17408_curated_padded/ |
148,480 | 17,408 | 69 noisy raw-cutoff tokens structurally skipped and backfilled | 358ae3f29ac17c99769d6d437339e28657d5fcaed3486f8550feed3d6adfc394 |
The cutoff can be revisited after downstream embedding adaptation and CPT, but this is the current handoff artifact for the Apertus Greek tokenizer extension line.
Cutoff Evidence
The compact evidence bundle is under:
experiments/02_1_7_intrinsic_eval_sweep_20260518/
Important files:
CHOSEN_CUTOFF.mdโ pinned decision and downstream contract.REPORT.mdโ intrinsic/fertility/MorphScore evidence base.artifacts/plots/โ report plots for the cutoff decision.manifests/curated_padded_at_17408_manifest.jsonโ exact backfilled construction.manifests/removal_mask_at_17408.jsonlโ the 69 in-cutoff tokens filtered out.firing_counts_c3_added_17408_curated_padded/โ compact token firing-count attribution artifacts for the exact C3 BPE training corpus, split by GlossAPI-nanochat, HPLT, combined C3, and exactsource_dataset.
The full sweep zoo is intentionally not mirrored here. This repo keeps the canonical tokenizer and just enough evidence to explain why it is current. Raw local variants, TokEval vendor caches, parquet outputs, and large geometry arrays remain outside the Hub artifact set.
Historical Artifacts
Older C1/C2/C3 full-tokenizer candidates and language-attribution artifacts are still present in their historical paths, including:
continuous/fresh/analysis/artifacts/language_attribution_20260515/
Those are retained for provenance, but the current tokenizer handoff is the 17,408 curated/backfilled artifact above.
Source Code
Process docs and scripts live in GitHub:
https://github.com/fffoivos/glossapi-tokenizer-extension
Relevant commit for this upload:
9a6b039 Add firing count attribution workflow
Model tree for fffoivos/apertus-tokenizer-extension
Base model
swiss-ai/Apertus-8B-2509