Instructions to use KikoCis/moe-coactivation-placement with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use KikoCis/moe-coactivation-placement with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("KikoCis/moe-coactivation-placement", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- The gaussian proxy lies: MoE expert co-activation has real structure
- TL;DR
- The idea, and the first (wrong) answer
- Test 1 β real traces + a shuffle null
- Test 2 β next-token prefetch hit-rate (no clustering, no chance model)
- Test 3 β a real capacity-C cache, real eviction, real miss-rate
- Test 4 β static expert placement: the exploit that survives
- What this means
- Caveats (what would make this airtight)
- Repository contents (code)
- TL;DR
The gaussian proxy lies: MoE expert co-activation has real structure
A negative result on MoE expert clustering β overturned by feeding the router real tokens instead of noise. Three orthogonal tests, one verdict, and the same methodological lesson that keeps recurring: the proxy you measure on is usually the thing that's wrong.
π Companion compression study (VQ for LLM weights) and the same "measure honestly" philosophy: huggingface.co/KikoCis/vq-llm-compression Β· agentic leaderboard: rab.utopiaia.com
TL;DR
- In a Mixture-of-Experts model, the natural unit isn't the weight β it's the expert. Permuting/grouping experts is ~free (the router indexes them), so if some experts reliably fire together, you could co-locate them in fast memory and prefetch as a group. Cache locality, not size compression.
- A first experiment said no structure: clustering the router's co-activation graph beat chance by β1.000Γ. Dead end.
- That experiment fed the router gaussian noise instead of real hidden states. Re-run on a real MoE (OLMoE-1B-7B) with real text, the co-activation clustering-gain is 1.4β2.1Γ, and a marginal-preserving shuffle null collapses it back to β1.0 β so the structure is real, not a clustering artifact.
- The structure is real, but whether it's worth exploiting depends entirely on the baseline. Against frequency caching (LFU) a co-activation policy wins; against recency (LRU) β the baseline that actually matters for a dynamic cache β an explicit prefetcher just wastes bandwidth, because LRU already retains co-firing experts implicitly.
- The one real, deployable win is static placement. For sharded / expert-parallel inference (fixed expertβdevice, no dynamic cache, where recency is unavailable), a load-capped co-activation placement cuts cross-device communication 12β17% vs standard frequency-balanced placement, at matched load balance β and the gain grows with device count and with depth. The unconstrained version cuts more (β21%) but imbalances the devices; the load-capped version is the honest, free win.
The idea, and the first (wrong) answer
MoE layers route each token to a small top-k subset of N experts (OLMoE: top-8 of 64; MiniMax-M3: 4 of 128). Only the selected experts compute. If you can't keep all experts in fast memory, you want the ones that tend to be needed at the same time to live together β then one fetch warms a whole "team."
The first pass (on MiniMax-M3's router) built the co-activation matrix by simulating top-k selection over gaussian random input directions, clustered it, and measured intra-cluster co-activation vs chance. Result: locality-gain β 1.000 at every layer, with strong marginal structure (some experts hot, some dead) but no pairwise teams. Conclusion at the time: co-activation is fully explained by independent selection with non-uniform popularity β clustering buys nothing over ordinary LFU caching.
The flaw is in the input. Gaussian h has no semantic correlation. Real hidden states arriving at an MoE layer are highly structured β a token in a code span, a math span, a Spanish span each produce correlated activations that drive correlated routing. Feeding the gate noise destroys exactly the correlation the experiment was trying to detect.
Test 1 β real traces + a shuffle null
We can't run M3 locally (128 GB). So we ran the largest MoE we can run end-to-end β OLMoE-1B-7B-0924 (64 experts, top-8, 16 layers) β on a 6-domain real-text corpus with output_router_logits=True, captured the genuine per-token expert selection, and applied the identical clustering-gain metric. Critically, we added a marginal-preserving shuffle null: independently permute which tokens select each expert, so per-expert frequencies are preserved exactly but all pairwise co-firing is destroyed. If the gain survives the shuffle, it was a finite-sample/clustering artifact; if it collapses to β1.0, the pairwise structure is real.
| layer | coact-Gini | clustering-gain | shuffle-null | excess |
|---|---|---|---|---|
| 0 | 0.63 | 1.44 | 0.86 | +0.57 |
| 4 | 0.73 | 1.69 | 0.92 | +0.77 |
| 8 | 0.79 | 2.07 | 0.99 | +1.08 |
| 12 | 0.78 | 1.70 | 0.96 | +0.74 |
| 15 | 0.81 | 2.00 | 0.99 | +1.00 |
Real gain is 1.4β2.1Γ; the shuffle null sits at 0.86β0.99 β chance. The excess is real pairwise structure, strongest in deep layers. The gaussian-proxy negative was an artifact of the input, not a property of MoE routers.
Test 2 β next-token prefetch hit-rate (no clustering, no chance model)
Clustering-gain still leans on a clustering algorithm and a chance model β the same shape of metric that has misled this line of work before. So we re-asked the question with neither. Honest train/test split: learn expert frequency and co-activation on the first half of each document, evaluate on the second. At a prefetch budget of B experts for the next token, compare two predictors:
- LFU β prefetch the B globally-most-frequent experts (static). The "trivial caching" baseline.
- CO-ACT β prefetch the B experts most co-activated with the set that just fired.
hit-rate = (experts that fire next β© prefetched) / top-k. Mean over layers {0,1,4,8,12,15}:
| budget B | LFU | CO-ACT | Ξ |
|---|---|---|---|
| 12 | 0.46 | 0.49 | +0.03 |
| 16 | 0.54 | 0.57 | +0.03 |
| 24 | 0.66 | 0.70 | +0.04 |
| 32 | 0.77 | 0.80 | +0.03 |
CO-ACT wins all 24 cells (6 layers Γ 4 budgets) β chance would lose about half β and the margin grows with depth (layer 15, B=16: 0.58 β 0.67, +9 points). Same depth pattern as Test 1.
Test 3 β a real capacity-C cache, real eviction, real miss-rate
Tests 1 and 2 both scored co-activation against an expert-frequency baseline (clustering-vs-chance; LFU prefetch). That is the wrong opponent. The real opponent for any caching scheme is recency β an LRU cache. So we built an actual capacity-C expert cache, replayed the real test trace, and counted real demand misses (cold loads). For the co-activation policy we also counted prefetch loads, so its bandwidth cost is not hidden.
Demand miss-rate (cold loads / expert-activation; lower is better), mean over layers {0,1,4,8,12,15}:
| cache cap | LRU | LFU-static | CO-ACT (demand) | CO-ACT (+ prefetch bandwidth) |
|---|---|---|---|---|
| 16 | 0.423 | 0.459 | 0.419 | 0.530 |
| 24 | 0.280 | 0.337 | 0.289 | 0.331 |
| 32 | 0.179 | 0.232 | 0.189 | 0.208 |
This is the sobering, honest result, and it reframes Tests 1β2:
- LFU-static is the worst policy, especially deep (layer 15, cap 16: LFU 0.418 vs LRU 0.321). Pinning experts by global popularity is a bad cache. Tests 1β2 flattered co-activation by comparing it to this weak baseline.
- CO-ACT prefetch ties LRU on demand misses but spends far more bandwidth (cap 16: 0.530 total vs LRU 0.423). The explicit prefetcher does not pay for itself.
- Why: experts that co-fire are, by definition, recently used β so LRU already retains them implicitly. The co-activation signal is real, but a competent dynamic cache exploits it for free, without a prefetcher.
Test 4 β static expert placement: the exploit that survives
Test 3 killed the prefetch framing but pointed at the one setting LRU can't help: static placement. In expert-parallel / sharded inference, each expert lives on a fixed device with no dynamic cache. Per token, the hidden state must be dispatched to every distinct device holding one of its top-k experts, so the communication cost β distinct devices touched per token. If co-firing experts share a device, you touch fewer devices. Recency is unavailable here, so the question is purely placement.
We compare three balanced placements (each device gets exactly N/D experts, learned on train): random, frequency-balanced (the standard load-balancing assignment), and co-activation-clustered (greedy agglomeration putting high-co-activation experts together). We report distinct devices/token (lower = better locality) and load-CV (lower = better balanced β a guard against winning locality by overloading one device). Mean over layers {0,1,4,8,12,15}:
| D devices | FREQ-bal dev/tok | CO-ACT (unconstrained) | CO-ACT load-capped | Ξ vs FREQ | load-CV (freq β capped) |
|---|---|---|---|---|---|
| 2 | 2.00 | 1.92 (CV 0.42) | 1.97 | β1.5% | 0.02 β 0.02 |
| 4 | 3.73 | 3.08 (CV 0.57) | 3.27 | β12.4% | 0.06 β 0.09 |
| 8 | 5.71 | 4.52 (CV 0.68) | 4.72 | β17.3% | 0.09 β 0.19 |
The unconstrained co-activation placement cuts dev/token harder (up to β21%, β33% in deep layers) but at a real load-balance cost β CV blows up from ~0.07 to 0.42β0.68, which means stragglers in expert-parallel inference. That is not a free win. A load-capped co-activation placement (the agglomeration penalized by per-device load) recovers nearly all the balance while keeping most of the locality: β12% to β17% cross-device traffic at essentially the same load-CV as standard load-balancing. The benefit grows with device count (more devices β more to gain from co-locating teams), and is largest in deep layers β the same depth signature as Tests 1 and 2.
What this means
- There is exactly one real, free exploit, and it's static placement. For any system with a dynamic expert cache, plain LRU already captures co-activation β an explicit prefetcher just wastes bandwidth (Test 3). But for static expert sharding (no cache, fixed expertβdevice), a load-capped co-activation placement cuts cross-device communication 12β17% over standard frequency-balanced placement, at matched load balance, with the gain rising in device count and depth. That is a genuine, deployable win for distributed MoE inference β and it falls directly out of the structure the gaussian proxy said didn't exist.
- The methodological lesson is the durable result. A negative was produced by measuring the right object (the router gate) with the wrong input (gaussian noise). Then the recovered structure looked exploitable against a frequency baseline β until the right baseline (LRU) showed a dynamic cache eats it for free. Then the unconstrained clustering looked like a free win β until the load-CV guard showed it just imbalanced the devices. Only the load-capped placement, measured on real traces against the strongest honest baseline, with a balance guard, is a real result. Three proxy traps peeled off one finding (wrong input β wrong baseline β unguarded objective). Same recurring lesson as the VQ companion study (weight-MSE and per-tensor output-cosine both lied): the thing you measure against is usually where the error hides.
Caveats (what would make this airtight)
- OLMoE β M3. The structure and the placement win are shown on a 64-expert top-8 model; confirming on M3's 128-expert top-4 router needs real M3 traces, which needs 128 GB to run β a server job, not local. With more experts and a smaller top-k, co-location has more room to help, so 64/8 is if anything a conservative case.
- The placement model counts distinct devices/token (all-to-all dispatch fan-out), not real interconnect latency or per-device kernel time on specific hardware. The relative ordering of placements and the load-CV trade-off are the robust parts; absolute dev/token is setup-dependent.
- Placement is learned on one corpus's co-activation statistics; a production placement should be fit on a representative traffic sample and is only as good as that sample's coverage.
Code is public: moe_router_locality.py (gaussian proxy, the original negative), moe_real_traces.py (real traces + shuffle null), moe_cache_sim.py (prefetch hit-rate vs LFU), moe_cache_miss.py (real cache miss-rate vs LRU), moe_static_placement.py (static placement vs frequency-balanced, with a load-capped variant). All run on a single public MoE checkpoint; no model training details are involved.
Repository contents (code)
Five scripts, each one stage of the investigation. All run on a single public MoE checkpoint (default allenai/OLMoE-1B-7B-0924); no model training is involved.
| file | what it does |
|---|---|
moe_router_locality.py |
the original negative: clusters the router co-activation graph from gaussian input directions β locality-gain β 1.0 (the trap) |
moe_real_traces.py |
real per-token router traces on multi-domain text + a marginal-preserving shuffle null β gain 1.4β2.1Γ, null β 1.0 |
moe_cache_sim.py |
next-token prefetch hit-rate, co-activation vs frequency (LFU), honest train/test split |
moe_cache_miss.py |
a real capacity-C expert cache miss-rate vs LRU β shows a dynamic cache already captures the structure |
moe_static_placement.py |
static expert placement (sharded inference): load-capped co-activation placement cuts cross-device traffic 12β17% vs frequency-balanced β the one surviving win |
Reproduce
pip install -r requirements.txt
python moe_real_traces.py # real traces + shuffle null (the reversal)
python moe_cache_miss.py # dynamic cache: LRU already wins
python moe_static_placement.py # static placement: the real, deployable win
Study on a public MoE checkpoint. Model training details are out of scope.