Laguna Martini - structured MoE pruning for Laguna XS.2
10% leaner, served straight up.
This is a provisional research submission for the Poolside Research Hackathon. Laguna Martini identifies 10% of Laguna XS.2's routed MoE parameters for removal while preserving a regular structure designed for efficient deployment kernels. That is 3.14B removable parameters, or 9.39% of the full 33.44B-parameter model.
Results at a glance
| Native grouped pruning | Full-cache loss | Full MMLU 0-shot accuracy | MMLU-STEM 5-shot accuracy | GSM8K-CoT 8-shot strict EM |
|---|---|---|---|---|
| BF16 baseline | 2.347142 | 0.733514 | 0.690771 | 0.931818 |
| 10% pruning | 2.363612 | 0.735437 | 0.693942 | 0.909091 |
| 20% pruning | 2.416442 | 0.734012 | 0.679036 | 0.780303 |
Loss uses the full held-out 4k cache. GSM8K-CoT uses the paired first-10% subset (132 examples).
Full MMLU 0-shot remains effectively flat at 10% and 20%. Native-grouped MMLU-STEM 5-shot changes
from 0.690771 to 0.693942 (+0.003172) at 10%, then to 0.679036 (-0.011735) at 20%.
We treat the small positive deltas as retained quality, not evidence that pruning improves the
model. The separate atomic-pruning MMLU sweep is not substituted here.
The current artifact measures pruning quality, not deployment speed. It applies the structured pruning mask by zeroing blocks inside the original tensors, so the released Laguna kernel still computes those blocks. A modified grouped MoE kernel and physically compact deployment checkpoint are future work.
One-line claim
Using HEAPr-style importance scores, we identify 10% of Laguna XS.2's routed MoE parameters for
structured removal: 3,140,616,192 parameters, or 9.39% of the full model. Full-cache
perplexity moves from 10.455647 to 10.629271 (+1.66%), while strict GSM8K-CoT exact match on
a paired first-10% subset moves from 0.931818 to 0.909091.
What structured pruning means here
Laguna routes tokens to parent experts. Inside each parent expert, the computation can be decomposed
into smaller atomic contributions. We sort those contributions by importance and bundle them into
regular 64-wide blocks. The recommended 10% pruning mask removes 7,987 / 79,872 of those blocks across 39
sparse layers while retaining at least one block in every parent expert.
This grouping keeps the pruned layout structured enough for a future deployment kernel to skip removed blocks. It is different from deleting routed parent experts outright.
Parameter calculation
Each atomic contribution owns three width-2,048 vectors: one gate-projection row, one
up-projection row, and one down-projection column. A removable 64-wide block therefore contains:
64 atoms x 3 projections x 2,048 parameters = 393,216 parameters
The recommended 10% mask removes 7,987 blocks:
7,987 blocks x 393,216 parameters = 3,140,616,192 removable parameters
| Quantity | Parameters |
|---|---|
| Full Laguna XS.2 model | 33,442,617,088 |
| Routed MoE parameters | 31,406,948,352 |
| Routed MoE parameters identified for removal | 3,140,616,192 |
| Projected compact model | 30,302,000,896 |
This is 9.9997% of routed MoE parameters and 9.39% of the full model. In BF16 weight
storage, the removable parameters correspond to 5.850 GiB. The currently published artifact
still uses zeroed original-shape tensors; the projected compact size requires a materialized
deployment checkpoint.
Method
Laguna XS.2 is a mixture-of-experts model with 39 sparse layers, 256 routed parent experts per sparse layer, and 512 atomic contributions per parent expert. An atomic expert is one independently scorable contribution inside a routed parent expert. We adapt HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space:
- Decompose each routed parent expert into atomic expert contributions.
- Score each atom with an output-space Hessian approximation derived from expert-output gradient covariance.
- Sort atoms by importance within each parent expert and form eight groups of 64 atoms.
- Globally prune the lowest-scoring groups while retaining at least one group in every parent expert.
- Evaluate the structurally pruned model through Laguna's original parent routing path by zeroing each pruned group's rows and columns in place.
The recommended 10% keep mask removes 7,987 of 79,872 groups. The published 25% stress-point
mask removes 19,968 groups. Its per-layer statistics are in
reports/native_group_pruning_stats.md.
Results
Recommended 10% result
| Metric | BF16 baseline | Native grouped 10% pruning | Delta |
|---|---|---|---|
| Groups pruned | 0 / 79,872 | 7,987 / 79,872 | 10.00% |
| Full-model parameters | 33.44B | projected 30.30B | -9.39% |
| BF16 weight storage | 62.292 GiB | projected 56.442 GiB | -5.850 GiB |
| Full-cache perplexity | 10.455647 | 10.629271 | +1.66% |
| Full-cache mean loss | 2.347142 | 2.363612 | +0.016469 |
| GSM8K-CoT strict EM, paired first 10% subset | 0.931818 | 0.909091 | -0.022727 |
Loss and perplexity use a held-out cache of 1,024 sequences x 4,096 tokens. MMLU uses the full 14,042-example zero-shot task suite.
Original-routing structured pruning sweep
The primary evaluation path, called native in the code and reports, keeps Laguna's original top-8 parent routing semantics and zeroes pruned blocks in place.
| Pruned groups | Mean loss | Perplexity | Perplexity delta |
|---|---|---|---|
| 0% | 2.347142 | 10.455647 | - |
| 10% | 2.363612 | 10.629271 | +1.66% |
| Random 10% control | 2.480595 | 11.948376 | +14.28% |
| 20% | 2.416442 | 11.205921 | +7.18% |
| 25% | 2.466706 | 11.783569 | +12.70% |
| Random 25% control | 2.561328 | 12.953011 | +23.89% |
| 40% | 2.683520 | 14.636528 | +39.99% |
The native sweep uses the directly comparable explicit-static-cache BF16 baseline. The detailed
report also records an earlier BF16 baseline whose perplexity differs by less than 0.04%.
What we tried: repacked routing
We also tested repacked routing: treating each retained 64-wide block as an independently routed mini-expert and selecting a fixed top-64 blocks per token. This is useful as an exploratory evaluator, but it changes the post-pruning behavior. Selected blocks can come from parent experts that the original top-8 router would not have selected. It therefore answers a different question and performs worse than the original-routing evaluation path.
| Repacked child groups pruned | Mean loss | Perplexity | Perplexity delta |
|---|---|---|---|
| 10% | 2.404768 | 11.075862 | +5.93% |
| 20% | 2.517318 | 12.395307 | +18.55% |
| 40% | 2.928546 | 18.700425 | +78.85% |
The comparison motivates a deployment kernel that preserves original parent routing while skipping pruned child-group work.
Insights
Importance ranking improves the pruning tradeoff
Importance-ranked pruning consistently beats the measured random controls. At the recommended 10%
point, perplexity rises by only 1.66%, compared with 14.28% for random pruning. The ranked curve
is also nonlinear: moving from 20% to 40% raises its perplexity delta from 7.18% to 39.99%.
Together with the GSM8K curve, this supports 10% as the initial operating point.
Later layers have lower-importance atomic experts
Later sparse layers contain lower-importance atomic experts under this calibration set and HEAPr-style
score. The median atomic importance in the final third of sparse layers is about 1.35 orders of
magnitude below the median in the first third. This helps explain why global importance pruning
selects more groups from later layers. It is an empirical score pattern, not a claim that later layers
are universally less important for every task.
The y-axis is log10(importance score): negative tick labels denote small positive scores below 1,
not negative importance.
Preserving routing structure matters
The repacked experiment is informative even though it performs worse. Independently selecting retained blocks changes which parent experts contribute after pruning. Preserving Laguna's original router decisions is therefore part of the deployment target, which motivates the sentinel-group kernel rather than a simple repack.
Importance ranking beats random pruning
At the recommended 10% point, the same-size random control reaches 11.948376 perplexity
(+14.28%), compared with 10.629271 (+1.66%) for importance-ranked pruning. The 25% control
shows the same pattern: random pruning reaches 12.953011 (+23.89%), compared with 11.783569
(+12.70%) for the importance-ranked mask. Both random masks use the same minimum-retention rule,
so the gaps are evidence that the HEAPr-style score selects meaningfully less important blocks.
Benchmark sensitivity still matters
The 25% mask moves full zero-shot MMLU only modestly, but the paired first-10% GSM8K-CoT subset is more sensitive:
| Benchmark | BF16 baseline | Structured-pruning variant | Delta |
|---|---|---|---|
| Full MMLU 0-shot accuracy, 10% pruning | 0.733514 | 0.735437 | +0.001923 |
| Full MMLU 0-shot accuracy, 20% pruning | 0.733514 | 0.734012 | +0.000499 |
| Full MMLU 0-shot accuracy, 25% pruning | 0.733514 | 0.725965 | -0.007549 |
| MMLU-STEM 5-shot accuracy, 10% pruning | 0.690771 | 0.693942 | +0.003172 |
| MMLU-STEM 5-shot accuracy, 20% pruning | 0.690771 | 0.679036 | -0.011735 |
| MMLU-STEM 5-shot accuracy, 25% pruning | 0.690771 | 0.666350 | -0.024421 |
| GSM8K-CoT 8-shot strict EM, first 10% subset, 10% pruning | 0.931818 | 0.909091 | -0.022727 |
| GSM8K-CoT 8-shot flexible EM, first 10% subset, 10% pruning | 0.886364 | 0.863636 | -0.022727 |
| GSM8K-CoT 8-shot strict EM, first 10% subset, 20% pruning | 0.931818 | 0.780303 | -0.151515 |
| GSM8K-CoT 8-shot flexible EM, first 10% subset, 20% pruning | 0.886364 | 0.742424 | -0.143939 |
| GSM8K-CoT 8-shot strict EM, first 10% subset, 25% pruning | 0.931818 | 0.681818 | -0.250000 |
| GSM8K-CoT 8-shot flexible EM, first 10% subset, 25% pruning | 0.886364 | 0.689394 | -0.196970 |
The GSM8K result is a paired 132-example subset, not a full benchmark run. It is still a useful warning: perplexity and one downstream suite are not sufficient to characterize a pruning point. The completed 10%, 20%, and 25% curve shows increasingly steep reasoning-quality degradation as pruning rises: a modest drop at 10%, followed by much larger losses at 20% and 25%.
CRUXEval harness validation
The CRUXEval-O CoT executable-grading path was smoke-tested successfully, but the paired coding
subset was stopped after 84 / 400 baseline generations due to the remaining time box. The
one-function smoke recorded pass@1 0.0; it validates the local generation and executable-grader
path, not coding quality.
Future work: sentinel-group kernel
The next implementation step is a modified grouped MoE kernel. For each selected parent expert, its pruned block indices should point to sentinel groups instead of materialized expert blocks. The kernel can then avoid loading and computing removed blocks while preserving the original parent router decisions.
The structured mask identifies the removable MoE parameters. Realized checkpoint-size, memory, and runtime improvements remain to be measured after the deployment checkpoint and kernel exist.
Reproducibility artifacts
| File | Shape | Dtype | Description |
|---|---|---|---|
artifacts/atomic_scores.npy |
[39, 256, 512] |
float32 |
HEAPr-style atomic importance scores |
artifacts/group_scores_importance_sorted.npy |
[39, 256, 8] |
float32 |
Scores for importance-sorted 64-wide groups |
artifacts/group_indices_importance_sorted.npy |
[39, 256, 8, 64] |
int64 |
Atomic indices assigned to each group |
artifacts/group_keep_mask_10pct.npy |
[39, 256, 8] |
bool |
Recommended 10% pruning mask |
artifacts/group_keep_mask_25pct.npy |
[39, 256, 8] |
bool |
Exploratory 25% stress-point mask |
results/summary.json |
- | JSON | Compact machine-readable results |
The source snapshot includes the scoring, pruning, evaluation, grouped reference runtime, and tests.
This Hub repository is an experimental artifact and code package, not a directly loadable pruned
Transformers checkpoint. It intentionally does not include full model weights. Load the base model from
poolside/Laguna-XS.2.
Reproduce
Create the environment:
uv sync --extra dev --extra eval
Evaluate the recommended 10% pruning point in memory against a prepared 4k calibration cache:
uv run python scripts/eval_pruned_loss.py \
--model-id poolside/Laguna-XS.2 \
--cache-path artifacts/data/calibration/<cache_id>/chunks.npy \
--scores-path artifacts/group_scores_importance_sorted.npy \
--group-indices artifacts/group_indices_importance_sorted.npy \
--output-dir artifacts/runs/native_group_loss \
--mode group \
--ratios 0.10 \
--batch-size 4 \
--gpu-memory-per-device 46GiB
The published keep mask can also be loaded directly and applied with
heapr.prune.apply_group_mask_to_model.
Run the local unit suite:
uv run pytest -q
Regenerate the insight plots from the published summary and keep mask:
uv run --extra dev python scripts/plot_pruning_insights.py
Status
- Completed: native grouped loss sweep at 10%, 20%, 25%, and 40%.
- Completed: full baseline and native-grouped 10%, 20%, and 25% MMLU.
- Completed: exploratory repacked child-routing loss sweep.
- Completed: paired first-10% GSM8K-CoT baseline and native-grouped 10%, 20%, and 25% curve.
- Stopped after smoke validation: paired CRUXEval-O CoT comparison, due to the remaining time box.
- Completed: local MMLU-STEM 5-shot baseline and native-grouped 10%, 20%, and 25% comparisons.
- Completed: random native-grouped 10% and 25% full-cache loss controls.
- Pending: additional downstream evaluation.
- Future work: sentinel-group grouped MoE kernel and measured inference benchmarks.
License
Apache 2.0, inheriting poolside/Laguna-XS.2.
Model tree for poolside-laguna-hackathon/laguna-martini
Base model
poolside/Laguna-XS.2