The gaussian proxy lies: MoE expert co-activation has real structure

A negative result on MoE expert clustering β€” overturned by feeding the router real tokens instead of noise. Three orthogonal tests, one verdict, and the same methodological lesson that keeps recurring: the proxy you measure on is usually the thing that's wrong.

πŸ“Š Companion compression study (VQ for LLM weights) and the same "measure honestly" philosophy: huggingface.co/KikoCis/vq-llm-compression Β· agentic leaderboard: rab.utopiaia.com

TL;DR

  • In a Mixture-of-Experts model, the natural unit isn't the weight β€” it's the expert. Permuting/grouping experts is ~free (the router indexes them), so if some experts reliably fire together, you could co-locate them in fast memory and prefetch as a group. Cache locality, not size compression.
  • A first experiment said no structure: clustering the router's co-activation graph beat chance by β‰ˆ1.000Γ—. Dead end.
  • That experiment fed the router gaussian noise instead of real hidden states. Re-run on a real MoE (OLMoE-1B-7B) with real text, the co-activation clustering-gain is 1.4–2.1Γ—, and a marginal-preserving shuffle null collapses it back to β‰ˆ1.0 β€” so the structure is real, not a clustering artifact.
  • The structure is real, but whether it's worth exploiting depends entirely on the baseline. Against frequency caching (LFU) a co-activation policy wins; against recency (LRU) β€” the baseline that actually matters for a dynamic cache β€” an explicit prefetcher just wastes bandwidth, because LRU already retains co-firing experts implicitly.
  • The one real, deployable win is static placement. For sharded / expert-parallel inference (fixed expertβ†’device, no dynamic cache, where recency is unavailable), a load-capped co-activation placement cuts cross-device communication 12–17% vs standard frequency-balanced placement, at matched load balance β€” and the gain grows with device count and with depth. The unconstrained version cuts more (βˆ’21%) but imbalances the devices; the load-capped version is the honest, free win.

The idea, and the first (wrong) answer

MoE layers route each token to a small top-k subset of N experts (OLMoE: top-8 of 64; MiniMax-M3: 4 of 128). Only the selected experts compute. If you can't keep all experts in fast memory, you want the ones that tend to be needed at the same time to live together β€” then one fetch warms a whole "team."

The first pass (on MiniMax-M3's router) built the co-activation matrix by simulating top-k selection over gaussian random input directions, clustered it, and measured intra-cluster co-activation vs chance. Result: locality-gain β‰ˆ 1.000 at every layer, with strong marginal structure (some experts hot, some dead) but no pairwise teams. Conclusion at the time: co-activation is fully explained by independent selection with non-uniform popularity β€” clustering buys nothing over ordinary LFU caching.

The flaw is in the input. Gaussian h has no semantic correlation. Real hidden states arriving at an MoE layer are highly structured β€” a token in a code span, a math span, a Spanish span each produce correlated activations that drive correlated routing. Feeding the gate noise destroys exactly the correlation the experiment was trying to detect.

Test 1 β€” real traces + a shuffle null

We can't run M3 locally (128 GB). So we ran the largest MoE we can run end-to-end β€” OLMoE-1B-7B-0924 (64 experts, top-8, 16 layers) β€” on a 6-domain real-text corpus with output_router_logits=True, captured the genuine per-token expert selection, and applied the identical clustering-gain metric. Critically, we added a marginal-preserving shuffle null: independently permute which tokens select each expert, so per-expert frequencies are preserved exactly but all pairwise co-firing is destroyed. If the gain survives the shuffle, it was a finite-sample/clustering artifact; if it collapses to β‰ˆ1.0, the pairwise structure is real.

layer coact-Gini clustering-gain shuffle-null excess
0 0.63 1.44 0.86 +0.57
4 0.73 1.69 0.92 +0.77
8 0.79 2.07 0.99 +1.08
12 0.78 1.70 0.96 +0.74
15 0.81 2.00 0.99 +1.00

Real gain is 1.4–2.1Γ—; the shuffle null sits at 0.86–0.99 β‰ˆ chance. The excess is real pairwise structure, strongest in deep layers. The gaussian-proxy negative was an artifact of the input, not a property of MoE routers.

Test 2 β€” next-token prefetch hit-rate (no clustering, no chance model)

Clustering-gain still leans on a clustering algorithm and a chance model β€” the same shape of metric that has misled this line of work before. So we re-asked the question with neither. Honest train/test split: learn expert frequency and co-activation on the first half of each document, evaluate on the second. At a prefetch budget of B experts for the next token, compare two predictors:

  • LFU β€” prefetch the B globally-most-frequent experts (static). The "trivial caching" baseline.
  • CO-ACT β€” prefetch the B experts most co-activated with the set that just fired.

hit-rate = (experts that fire next ∩ prefetched) / top-k. Mean over layers {0,1,4,8,12,15}:

budget B LFU CO-ACT Ξ”
12 0.46 0.49 +0.03
16 0.54 0.57 +0.03
24 0.66 0.70 +0.04
32 0.77 0.80 +0.03

CO-ACT wins all 24 cells (6 layers Γ— 4 budgets) β€” chance would lose about half β€” and the margin grows with depth (layer 15, B=16: 0.58 β†’ 0.67, +9 points). Same depth pattern as Test 1.

Test 3 β€” a real capacity-C cache, real eviction, real miss-rate

Tests 1 and 2 both scored co-activation against an expert-frequency baseline (clustering-vs-chance; LFU prefetch). That is the wrong opponent. The real opponent for any caching scheme is recency β€” an LRU cache. So we built an actual capacity-C expert cache, replayed the real test trace, and counted real demand misses (cold loads). For the co-activation policy we also counted prefetch loads, so its bandwidth cost is not hidden.

Demand miss-rate (cold loads / expert-activation; lower is better), mean over layers {0,1,4,8,12,15}:

cache cap LRU LFU-static CO-ACT (demand) CO-ACT (+ prefetch bandwidth)
16 0.423 0.459 0.419 0.530
24 0.280 0.337 0.289 0.331
32 0.179 0.232 0.189 0.208

This is the sobering, honest result, and it reframes Tests 1–2:

  • LFU-static is the worst policy, especially deep (layer 15, cap 16: LFU 0.418 vs LRU 0.321). Pinning experts by global popularity is a bad cache. Tests 1–2 flattered co-activation by comparing it to this weak baseline.
  • CO-ACT prefetch ties LRU on demand misses but spends far more bandwidth (cap 16: 0.530 total vs LRU 0.423). The explicit prefetcher does not pay for itself.
  • Why: experts that co-fire are, by definition, recently used β€” so LRU already retains them implicitly. The co-activation signal is real, but a competent dynamic cache exploits it for free, without a prefetcher.

Test 4 β€” static expert placement: the exploit that survives

Test 3 killed the prefetch framing but pointed at the one setting LRU can't help: static placement. In expert-parallel / sharded inference, each expert lives on a fixed device with no dynamic cache. Per token, the hidden state must be dispatched to every distinct device holding one of its top-k experts, so the communication cost β‰ˆ distinct devices touched per token. If co-firing experts share a device, you touch fewer devices. Recency is unavailable here, so the question is purely placement.

We compare three balanced placements (each device gets exactly N/D experts, learned on train): random, frequency-balanced (the standard load-balancing assignment), and co-activation-clustered (greedy agglomeration putting high-co-activation experts together). We report distinct devices/token (lower = better locality) and load-CV (lower = better balanced β€” a guard against winning locality by overloading one device). Mean over layers {0,1,4,8,12,15}:

D devices FREQ-bal dev/tok CO-ACT (unconstrained) CO-ACT load-capped Ξ” vs FREQ load-CV (freq β†’ capped)
2 2.00 1.92 (CV 0.42) 1.97 βˆ’1.5% 0.02 β†’ 0.02
4 3.73 3.08 (CV 0.57) 3.27 βˆ’12.4% 0.06 β†’ 0.09
8 5.71 4.52 (CV 0.68) 4.72 βˆ’17.3% 0.09 β†’ 0.19

The unconstrained co-activation placement cuts dev/token harder (up to βˆ’21%, βˆ’33% in deep layers) but at a real load-balance cost β€” CV blows up from ~0.07 to 0.42–0.68, which means stragglers in expert-parallel inference. That is not a free win. A load-capped co-activation placement (the agglomeration penalized by per-device load) recovers nearly all the balance while keeping most of the locality: βˆ’12% to βˆ’17% cross-device traffic at essentially the same load-CV as standard load-balancing. The benefit grows with device count (more devices β†’ more to gain from co-locating teams), and is largest in deep layers β€” the same depth signature as Tests 1 and 2.

What this means

  1. There is exactly one real, free exploit, and it's static placement. For any system with a dynamic expert cache, plain LRU already captures co-activation β€” an explicit prefetcher just wastes bandwidth (Test 3). But for static expert sharding (no cache, fixed expertβ†’device), a load-capped co-activation placement cuts cross-device communication 12–17% over standard frequency-balanced placement, at matched load balance, with the gain rising in device count and depth. That is a genuine, deployable win for distributed MoE inference β€” and it falls directly out of the structure the gaussian proxy said didn't exist.
  2. The methodological lesson is the durable result. A negative was produced by measuring the right object (the router gate) with the wrong input (gaussian noise). Then the recovered structure looked exploitable against a frequency baseline β€” until the right baseline (LRU) showed a dynamic cache eats it for free. Then the unconstrained clustering looked like a free win β€” until the load-CV guard showed it just imbalanced the devices. Only the load-capped placement, measured on real traces against the strongest honest baseline, with a balance guard, is a real result. Three proxy traps peeled off one finding (wrong input β†’ wrong baseline β†’ unguarded objective). Same recurring lesson as the VQ companion study (weight-MSE and per-tensor output-cosine both lied): the thing you measure against is usually where the error hides.

Caveats (what would make this airtight)

  • OLMoE β‰  M3. The structure and the placement win are shown on a 64-expert top-8 model; confirming on M3's 128-expert top-4 router needs real M3 traces, which needs 128 GB to run β€” a server job, not local. With more experts and a smaller top-k, co-location has more room to help, so 64/8 is if anything a conservative case.
  • The placement model counts distinct devices/token (all-to-all dispatch fan-out), not real interconnect latency or per-device kernel time on specific hardware. The relative ordering of placements and the load-CV trade-off are the robust parts; absolute dev/token is setup-dependent.
  • Placement is learned on one corpus's co-activation statistics; a production placement should be fit on a representative traffic sample and is only as good as that sample's coverage.

Code is public: moe_router_locality.py (gaussian proxy, the original negative), moe_real_traces.py (real traces + shuffle null), moe_cache_sim.py (prefetch hit-rate vs LFU), moe_cache_miss.py (real cache miss-rate vs LRU), moe_static_placement.py (static placement vs frequency-balanced, with a load-capped variant). All run on a single public MoE checkpoint; no model training details are involved.


Repository contents (code)

Five scripts, each one stage of the investigation. All run on a single public MoE checkpoint (default allenai/OLMoE-1B-7B-0924); no model training is involved.

file what it does
moe_router_locality.py the original negative: clusters the router co-activation graph from gaussian input directions β†’ locality-gain β‰ˆ 1.0 (the trap)
moe_real_traces.py real per-token router traces on multi-domain text + a marginal-preserving shuffle null β†’ gain 1.4–2.1Γ—, null β‰ˆ 1.0
moe_cache_sim.py next-token prefetch hit-rate, co-activation vs frequency (LFU), honest train/test split
moe_cache_miss.py a real capacity-C expert cache miss-rate vs LRU β€” shows a dynamic cache already captures the structure
moe_static_placement.py static expert placement (sharded inference): load-capped co-activation placement cuts cross-device traffic 12–17% vs frequency-balanced β€” the one surviving win

Reproduce

pip install -r requirements.txt
python moe_real_traces.py        # real traces + shuffle null (the reversal)
python moe_cache_miss.py         # dynamic cache: LRU already wins
python moe_static_placement.py   # static placement: the real, deployable win

Study on a public MoE checkpoint. Model training details are out of scope.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support