Apertus Tokenizer Extension

Current canonical artifact: C3 + 17,408 curated/backfilled added units.

The active ship tokenizer is:

tokenizers/c3_added_17408_curated_padded/

It starts from swiss-ai/Apertus-8B-2509, preserves the Apertus base ids 0..131071 verbatim, adds 17,408 C3 Greek-extension BPE units, and keeps the final vocab aligned at 148,480 = 128 * 1160 = 256 * 580.

Canonical Tokenizer

Path	Vocab	Added units	Curation	SHA-256 tokenizer.json
`tokenizers/c3_added_17408_curated_padded/`	148,480	17,408	69 noisy raw-cutoff tokens structurally skipped and backfilled	`358ae3f29ac17c99769d6d437339e28657d5fcaed3486f8550feed3d6adfc394`

The cutoff can be revisited after downstream embedding adaptation and CPT, but this is the current handoff artifact for the Apertus Greek tokenizer extension line.

Cutoff Evidence

The compact evidence bundle is under:

experiments/02_1_7_intrinsic_eval_sweep_20260518/

Important files:

CHOSEN_CUTOFF.md — pinned decision and downstream contract.
REPORT.md — intrinsic/fertility/MorphScore evidence base.
artifacts/plots/ — report plots for the cutoff decision.
manifests/curated_padded_at_17408_manifest.json — exact backfilled construction.
manifests/removal_mask_at_17408.jsonl — the 69 in-cutoff tokens filtered out.
firing_counts_c3_added_17408_curated_padded/ — compact token firing-count attribution artifacts for the exact C3 BPE training corpus, split by GlossAPI-nanochat, HPLT, combined C3, and exact source_dataset.

The full sweep zoo is intentionally not mirrored here. This repo keeps the canonical tokenizer and just enough evidence to explain why it is current. Raw local variants, TokEval vendor caches, parquet outputs, and large geometry arrays remain outside the Hub artifact set.

Historical Artifacts

Older C1/C2/C3 full-tokenizer candidates and language-attribution artifacts are still present in their historical paths, including:

continuous/
fresh/
analysis/
artifacts/language_attribution_20260515/

Those are retained for provenance, but the current tokenizer handoff is the 17,408 curated/backfilled artifact above.

Source Code

Process docs and scripts live in GitHub:

https://github.com/fffoivos/glossapi-tokenizer-extension

Relevant commit for this upload:

9a6b039 Add firing count attribution workflow

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fffoivos/apertus-tokenizer-extension

Base model

swiss-ai/Apertus-8B-2509

Finetuned

(16)

this model

fffoivos
/

apertus-tokenizer-extension