The gaussian proxy lies: MoE expert co-activation has real structure

A negative result on MoE expert clustering — overturned by feeding the router real tokens instead of noise. Three orthogonal tests, one verdict, and the same methodological lesson that keeps recurring: the proxy you measure on is usually the thing that's wrong.

📊 Companion compression study (VQ for LLM weights) and the same "measure honestly" philosophy: huggingface.co/KikoCis/vq-llm-compression · agentic leaderboard: rab.utopiaia.com

TL;DR

In a Mixture-of-Experts model, the natural unit isn't the weight — it's the expert. Permuting/grouping experts is ~free (the router indexes them), so if some experts reliably fire together, you could co-locate them in fast memory and prefetch as a group. Cache locality, not size compression.
A first experiment said no structure: clustering the router's co-activation graph beat chance by ≈1.000×. Dead end.
That experiment fed the router gaussian noise instead of real hidden states. Re-run on a real MoE (OLMoE-1B-7B) with real text, the co-activation clustering-gain is 1.4–2.1×, and a marginal-preserving shuffle null collapses it back to ≈1.0 — so the structure is real, not a clustering artifact.
The structure is real, but whether it's worth exploiting depends entirely on the baseline. Against frequency caching (LFU) a co-activation policy wins; against recency (LRU) — the baseline that actually matters for a dynamic cache — an explicit prefetcher just wastes bandwidth, because LRU already retains co-firing experts implicitly.
The one real, deployable win is static placement. For sharded / expert-parallel inference (fixed expert→device, no dynamic cache, where recency is unavailable), a load-capped co-activation placement cuts cross-device communication 12–17% vs standard frequency-balanced placement, at matched load balance — and the gain grows with device count and with depth. The unconstrained version cuts more (−21%) but imbalances the devices; the load-capped version is the honest, free win.

The idea, and the first (wrong) answer

MoE layers route each token to a small top-k subset of N experts (OLMoE: top-8 of 64; MiniMax-M3: 4 of 128). Only the selected experts compute. If you can't keep all experts in fast memory, you want the ones that tend to be needed at the same time to live together — then one fetch warms a whole "team."

The first pass (on MiniMax-M3's router) built the co-activation matrix by simulating top-k selection over gaussian random input directions, clustered it, and measured intra-cluster co-activation vs chance. Result: locality-gain ≈ 1.000 at every layer, with strong marginal structure (some experts hot, some dead) but no pairwise teams. Conclusion at the time: co-activation is fully explained by independent selection with non-uniform popularity — clustering buys nothing over ordinary LFU caching.

The flaw is in the input. Gaussian h has no semantic correlation. Real hidden states arriving at an MoE layer are highly structured — a token in a code span, a math span, a Spanish span each produce correlated activations that drive correlated routing. Feeding the gate noise destroys exactly the correlation the experiment was trying to detect.

Test 1 — real traces + a shuffle null

We can't run M3 locally (128 GB). So we ran the largest MoE we can run end-to-end — OLMoE-1B-7B-0924 (64 experts, top-8, 16 layers) — on a 6-domain real-text corpus with output_router_logits=True, captured the genuine per-token expert selection, and applied the identical clustering-gain metric. Critically, we added a marginal-preserving shuffle null: independently permute which tokens select each expert, so per-expert frequencies are preserved exactly but all pairwise co-firing is destroyed. If the gain survives the shuffle, it was a finite-sample/clustering artifact; if it collapses to ≈1.0, the pairwise structure is real.

layer	coact-Gini	clustering-gain	shuffle-null	excess
0	0.63	1.44	0.86	+0.57
4	0.73	1.69	0.92	+0.77
8	0.79	2.07	0.99	+1.08
12	0.78	1.70	0.96	+0.74
15	0.81	2.00	0.99	+1.00

Real gain is 1.4–2.1×; the shuffle null sits at 0.86–0.99 ≈ chance. The excess is real pairwise structure, strongest in deep layers. The gaussian-proxy negative was an artifact of the input, not a property of MoE routers.

Test 2 — next-token prefetch hit-rate (no clustering, no chance model)

Clustering-gain still leans on a clustering algorithm and a chance model — the same shape of metric that has misled this line of work before. So we re-asked the question with neither. Honest train/test split: learn expert frequency and co-activation on the first half of each document, evaluate on the second. At a prefetch budget of B experts for the next token, compare two predictors:

LFU — prefetch the B globally-most-frequent experts (static). The "trivial caching" baseline.
CO-ACT — prefetch the B experts most co-activated with the set that just fired.

hit-rate = (experts that fire next ∩ prefetched) / top-k. Mean over layers {0,1,4,8,12,15}:

budget B	LFU	CO-ACT	Δ
12	0.46	0.49	+0.03
16	0.54	0.57	+0.03
24	0.66	0.70	+0.04
32	0.77	0.80	+0.03

CO-ACT wins all 24 cells (6 layers × 4 budgets) — chance would lose about half — and the margin grows with depth (layer 15, B=16: 0.58 → 0.67, +9 points). Same depth pattern as Test 1.

Test 3 — a real capacity-C cache, real eviction, real miss-rate

Tests 1 and 2 both scored co-activation against an expert-frequency baseline (clustering-vs-chance; LFU prefetch). That is the wrong opponent. The real opponent for any caching scheme is recency — an LRU cache. So we built an actual capacity-C expert cache, replayed the real test trace, and counted real demand misses (cold loads). For the co-activation policy we also counted prefetch loads, so its bandwidth cost is not hidden.

Demand miss-rate (cold loads / expert-activation; lower is better), mean over layers {0,1,4,8,12,15}:

cache cap	LRU	LFU-static	CO-ACT (demand)	CO-ACT (+ prefetch bandwidth)
16	0.423	0.459	0.419	0.530
24	0.280	0.337	0.289	0.331
32	0.179	0.232	0.189	0.208

This is the sobering, honest result, and it reframes Tests 1–2:

LFU-static is the worst policy, especially deep (layer 15, cap 16: LFU 0.418 vs LRU 0.321). Pinning experts by global popularity is a bad cache. Tests 1–2 flattered co-activation by comparing it to this weak baseline.
CO-ACT prefetch ties LRU on demand misses but spends far more bandwidth (cap 16: 0.530 total vs LRU 0.423). The explicit prefetcher does not pay for itself.
Why: experts that co-fire are, by definition, recently used — so LRU already retains them implicitly. The co-activation signal is real, but a competent dynamic cache exploits it for free, without a prefetcher.

Test 4 — static expert placement: the exploit that survives

Test 3 killed the prefetch framing but pointed at the one setting LRU can't help: static placement. In expert-parallel / sharded inference, each expert lives on a fixed device with no dynamic cache. Per token, the hidden state must be dispatched to every distinct device holding one of its top-k experts, so the communication cost ≈ distinct devices touched per token. If co-firing experts share a device, you touch fewer devices. Recency is unavailable here, so the question is purely placement.

We compare three balanced placements (each device gets exactly N/D experts, learned on train): random, frequency-balanced (the standard load-balancing assignment), and co-activation-clustered (greedy agglomeration putting high-co-activation experts together). We report distinct devices/token (lower = better locality) and load-CV (lower = better balanced — a guard against winning locality by overloading one device). Mean over layers {0,1,4,8,12,15}:

D devices	FREQ-bal dev/tok	CO-ACT (unconstrained)	CO-ACT load-capped	Δ vs FREQ	load-CV (freq → capped)
2	2.00	1.92 (CV 0.42)	1.97	−1.5%	0.02 → 0.02
4	3.73	3.08 (CV 0.57)	3.27	−12.4%	0.06 → 0.09
8	5.71	4.52 (CV 0.68)	4.72	−17.3%	0.09 → 0.19

The unconstrained co-activation placement cuts dev/token harder (up to −21%, −33% in deep layers) but at a real load-balance cost — CV blows up from ~0.07 to 0.42–0.68, which means stragglers in expert-parallel inference. That is not a free win. A load-capped co-activation placement (the agglomeration penalized by per-device load) recovers nearly all the balance while keeping most of the locality: −12% to −17% cross-device traffic at essentially the same load-CV as standard load-balancing. The benefit grows with device count (more devices → more to gain from co-locating teams), and is largest in deep layers — the same depth signature as Tests 1 and 2.

What this means

There is exactly one real, free exploit, and it's static placement. For any system with a dynamic expert cache, plain LRU already captures co-activation — an explicit prefetcher just wastes bandwidth (Test 3). But for static expert sharding (no cache, fixed expert→device), a load-capped co-activation placement cuts cross-device communication 12–17% over standard frequency-balanced placement, at matched load balance, with the gain rising in device count and depth. That is a genuine, deployable win for distributed MoE inference — and it falls directly out of the structure the gaussian proxy said didn't exist.
The methodological lesson is the durable result. A negative was produced by measuring the right object (the router gate) with the wrong input (gaussian noise). Then the recovered structure looked exploitable against a frequency baseline — until the right baseline (LRU) showed a dynamic cache eats it for free. Then the unconstrained clustering looked like a free win — until the load-CV guard showed it just imbalanced the devices. Only the load-capped placement, measured on real traces against the strongest honest baseline, with a balance guard, is a real result. Three proxy traps peeled off one finding (wrong input → wrong baseline → unguarded objective). Same recurring lesson as the VQ companion study (weight-MSE and per-tensor output-cosine both lied): the thing you measure against is usually where the error hides.

Caveats (what would make this airtight)

OLMoE ≠ M3. The structure and the placement win are shown on a 64-expert top-8 model; confirming on M3's 128-expert top-4 router needs real M3 traces, which needs 128 GB to run — a server job, not local. With more experts and a smaller top-k, co-location has more room to help, so 64/8 is if anything a conservative case.
The placement model counts distinct devices/token (all-to-all dispatch fan-out), not real interconnect latency or per-device kernel time on specific hardware. The relative ordering of placements and the load-CV trade-off are the robust parts; absolute dev/token is setup-dependent.
Placement is learned on one corpus's co-activation statistics; a production placement should be fit on a representative traffic sample and is only as good as that sample's coverage.

Code is public: moe_router_locality.py (gaussian proxy, the original negative), moe_real_traces.py (real traces + shuffle null), moe_cache_sim.py (prefetch hit-rate vs LFU), moe_cache_miss.py (real cache miss-rate vs LRU), moe_static_placement.py (static placement vs frequency-balanced, with a load-capped variant). All run on a single public MoE checkpoint; no model training details are involved.

Repository contents (code)

Five scripts, each one stage of the investigation. All run on a single public MoE checkpoint (default allenai/OLMoE-1B-7B-0924); no model training is involved.

file	what it does
`moe_router_locality.py`	the original negative: clusters the router co-activation graph from gaussian input directions → locality-gain ≈ 1.0 (the trap)
`moe_real_traces.py`	real per-token router traces on multi-domain text + a marginal-preserving shuffle null → gain 1.4–2.1×, null ≈ 1.0
`moe_cache_sim.py`	next-token prefetch hit-rate, co-activation vs frequency (LFU), honest train/test split
`moe_cache_miss.py`	a real capacity-C expert cache miss-rate vs LRU — shows a dynamic cache already captures the structure
`moe_static_placement.py`	static expert placement (sharded inference): load-capped co-activation placement cuts cross-device traffic 12–17% vs frequency-balanced — the one surviving win

Reproduce

pip install -r requirements.txt
python moe_real_traces.py        # real traces + shuffle null (the reversal)
python moe_cache_miss.py         # dynamic cache: LRU already wins
python moe_static_placement.py   # static placement: the real, deployable win

Study on a public MoE checkpoint. Model training details are out of scope.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support