Apertus Tokenizer Extension

Current canonical artifact: C3 + 17,408 curated/backfilled added units.

The active ship tokenizer is:

tokenizers/c3_added_17408_curated_padded/

It starts from swiss-ai/Apertus-8B-2509, preserves the Apertus base ids 0..131071 verbatim, adds 17,408 C3 Greek-extension BPE units, and keeps the final vocab aligned at 148,480 = 128 * 1160 = 256 * 580.

Canonical Tokenizer

Path Vocab Added units Curation SHA-256 tokenizer.json
tokenizers/c3_added_17408_curated_padded/ 148,480 17,408 69 noisy raw-cutoff tokens structurally skipped and backfilled 358ae3f29ac17c99769d6d437339e28657d5fcaed3486f8550feed3d6adfc394

The cutoff can be revisited after downstream embedding adaptation and CPT, but this is the current handoff artifact for the Apertus Greek tokenizer extension line.

Cutoff Evidence

The compact evidence bundle is under:

experiments/02_1_7_intrinsic_eval_sweep_20260518/

Important files:

  • CHOSEN_CUTOFF.md โ€” pinned decision and downstream contract.
  • REPORT.md โ€” intrinsic/fertility/MorphScore evidence base.
  • artifacts/plots/ โ€” report plots for the cutoff decision.
  • manifests/curated_padded_at_17408_manifest.json โ€” exact backfilled construction.
  • manifests/removal_mask_at_17408.jsonl โ€” the 69 in-cutoff tokens filtered out.
  • firing_counts_c3_added_17408_curated_padded/ โ€” compact token firing-count attribution artifacts for the exact C3 BPE training corpus, split by GlossAPI-nanochat, HPLT, combined C3, and exact source_dataset.

The full sweep zoo is intentionally not mirrored here. This repo keeps the canonical tokenizer and just enough evidence to explain why it is current. Raw local variants, TokEval vendor caches, parquet outputs, and large geometry arrays remain outside the Hub artifact set.

Historical Artifacts

Older C1/C2/C3 full-tokenizer candidates and language-attribution artifacts are still present in their historical paths, including:

  • continuous/
  • fresh/
  • analysis/
  • artifacts/language_attribution_20260515/

Those are retained for provenance, but the current tokenizer handoff is the 17,408 curated/backfilled artifact above.

Source Code

Process docs and scripts live in GitHub:

https://github.com/fffoivos/glossapi-tokenizer-extension

Relevant commit for this upload:

9a6b039 Add firing count attribution workflow
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for fffoivos/apertus-tokenizer-extension

Finetuned
(16)
this model

Dataset used to train fffoivos/apertus-tokenizer-extension