Laguna Martini structured MoE pruning banner

Laguna Martini - structured MoE pruning for Laguna XS.2

10% leaner, served straight up.

This is a provisional research submission for the Poolside Research Hackathon. Laguna Martini identifies 10% of Laguna XS.2's routed MoE parameters for removal while preserving a regular structure designed for efficient deployment kernels. That is 3.14B removable parameters, or 9.39% of the full 33.44B-parameter model.

Results at a glance

Native grouped pruning Full-cache loss Full MMLU 0-shot accuracy MMLU-STEM 5-shot accuracy GSM8K-CoT 8-shot strict EM
BF16 baseline 2.347142 0.733514 0.690771 0.931818
10% pruning 2.363612 0.735437 0.693942 0.909091
20% pruning 2.416442 0.734012 0.679036 0.780303

Loss uses the full held-out 4k cache. GSM8K-CoT uses the paired first-10% subset (132 examples). Full MMLU 0-shot remains effectively flat at 10% and 20%. Native-grouped MMLU-STEM 5-shot changes from 0.690771 to 0.693942 (+0.003172) at 10%, then to 0.679036 (-0.011735) at 20%. We treat the small positive deltas as retained quality, not evidence that pruning improves the model. The separate atomic-pruning MMLU sweep is not substituted here.

The current artifact measures pruning quality, not deployment speed. It applies the structured pruning mask by zeroing blocks inside the original tensors, so the released Laguna kernel still computes those blocks. A modified grouped MoE kernel and physically compact deployment checkpoint are future work.

One-line claim

Using HEAPr-style importance scores, we identify 10% of Laguna XS.2's routed MoE parameters for structured removal: 3,140,616,192 parameters, or 9.39% of the full model. Full-cache perplexity moves from 10.455647 to 10.629271 (+1.66%), while strict GSM8K-CoT exact match on a paired first-10% subset moves from 0.931818 to 0.909091.

What structured pruning means here

Laguna routes tokens to parent experts. Inside each parent expert, the computation can be decomposed into smaller atomic contributions. We sort those contributions by importance and bundle them into regular 64-wide blocks. The recommended 10% pruning mask removes 7,987 / 79,872 of those blocks across 39 sparse layers while retaining at least one block in every parent expert.

This grouping keeps the pruned layout structured enough for a future deployment kernel to skip removed blocks. It is different from deleting routed parent experts outright.

Parameter calculation

Each atomic contribution owns three width-2,048 vectors: one gate-projection row, one up-projection row, and one down-projection column. A removable 64-wide block therefore contains:

64 atoms x 3 projections x 2,048 parameters = 393,216 parameters

The recommended 10% mask removes 7,987 blocks:

7,987 blocks x 393,216 parameters = 3,140,616,192 removable parameters
Quantity Parameters
Full Laguna XS.2 model 33,442,617,088
Routed MoE parameters 31,406,948,352
Routed MoE parameters identified for removal 3,140,616,192
Projected compact model 30,302,000,896

This is 9.9997% of routed MoE parameters and 9.39% of the full model. In BF16 weight storage, the removable parameters correspond to 5.850 GiB. The currently published artifact still uses zeroed original-shape tensors; the projected compact size requires a materialized deployment checkpoint.

Method

Laguna XS.2 is a mixture-of-experts model with 39 sparse layers, 256 routed parent experts per sparse layer, and 512 atomic contributions per parent expert. An atomic expert is one independently scorable contribution inside a routed parent expert. We adapt HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space:

  1. Decompose each routed parent expert into atomic expert contributions.
  2. Score each atom with an output-space Hessian approximation derived from expert-output gradient covariance.
  3. Sort atoms by importance within each parent expert and form eight groups of 64 atoms.
  4. Globally prune the lowest-scoring groups while retaining at least one group in every parent expert.
  5. Evaluate the structurally pruned model through Laguna's original parent routing path by zeroing each pruned group's rows and columns in place.

The recommended 10% keep mask removes 7,987 of 79,872 groups. The published 25% stress-point mask removes 19,968 groups. Its per-layer statistics are in reports/native_group_pruning_stats.md.

Results

Recommended 10% result

Metric BF16 baseline Native grouped 10% pruning Delta
Groups pruned 0 / 79,872 7,987 / 79,872 10.00%
Full-model parameters 33.44B projected 30.30B -9.39%
BF16 weight storage 62.292 GiB projected 56.442 GiB -5.850 GiB
Full-cache perplexity 10.455647 10.629271 +1.66%
Full-cache mean loss 2.347142 2.363612 +0.016469
GSM8K-CoT strict EM, paired first 10% subset 0.931818 0.909091 -0.022727

Loss and perplexity use a held-out cache of 1,024 sequences x 4,096 tokens. MMLU uses the full 14,042-example zero-shot task suite.

Original-routing structured pruning sweep

The primary evaluation path, called native in the code and reports, keeps Laguna's original top-8 parent routing semantics and zeroes pruned blocks in place.

Pruned groups Mean loss Perplexity Perplexity delta
0% 2.347142 10.455647 -
10% 2.363612 10.629271 +1.66%
Random 10% control 2.480595 11.948376 +14.28%
20% 2.416442 11.205921 +7.18%
25% 2.466706 11.783569 +12.70%
Random 25% control 2.561328 12.953011 +23.89%
40% 2.683520 14.636528 +39.99%

The native sweep uses the directly comparable explicit-static-cache BF16 baseline. The detailed report also records an earlier BF16 baseline whose perplexity differs by less than 0.04%.

What we tried: repacked routing

We also tested repacked routing: treating each retained 64-wide block as an independently routed mini-expert and selecting a fixed top-64 blocks per token. This is useful as an exploratory evaluator, but it changes the post-pruning behavior. Selected blocks can come from parent experts that the original top-8 router would not have selected. It therefore answers a different question and performs worse than the original-routing evaluation path.

Repacked child groups pruned Mean loss Perplexity Perplexity delta
10% 2.404768 11.075862 +5.93%
20% 2.517318 12.395307 +18.55%
40% 2.928546 18.700425 +78.85%

The comparison motivates a deployment kernel that preserves original parent routing while skipping pruned child-group work.

Insights

Importance ranking improves the pruning tradeoff

Line chart showing lower perplexity degradation for importance-ranked structured MoE pruning than random pruning

Importance-ranked pruning consistently beats the measured random controls. At the recommended 10% point, perplexity rises by only 1.66%, compared with 14.28% for random pruning. The ranked curve is also nonlinear: moving from 20% to 40% raises its perplexity delta from 7.18% to 39.99%. Together with the GSM8K curve, this supports 10% as the initial operating point.

Later layers have lower-importance atomic experts

Box plots showing atomic importance score distributions across sparse layers

Later sparse layers contain lower-importance atomic experts under this calibration set and HEAPr-style score. The median atomic importance in the final third of sparse layers is about 1.35 orders of magnitude below the median in the first third. This helps explain why global importance pruning selects more groups from later layers. It is an empirical score pattern, not a claim that later layers are universally less important for every task.

The y-axis is log10(importance score): negative tick labels denote small positive scores below 1, not negative importance.

Preserving routing structure matters

The repacked experiment is informative even though it performs worse. Independently selecting retained blocks changes which parent experts contribute after pruning. Preserving Laguna's original router decisions is therefore part of the deployment target, which motivates the sentinel-group kernel rather than a simple repack.

Importance ranking beats random pruning

At the recommended 10% point, the same-size random control reaches 11.948376 perplexity (+14.28%), compared with 10.629271 (+1.66%) for importance-ranked pruning. The 25% control shows the same pattern: random pruning reaches 12.953011 (+23.89%), compared with 11.783569 (+12.70%) for the importance-ranked mask. Both random masks use the same minimum-retention rule, so the gaps are evidence that the HEAPr-style score selects meaningfully less important blocks.

Benchmark sensitivity still matters

The 25% mask moves full zero-shot MMLU only modestly, but the paired first-10% GSM8K-CoT subset is more sensitive:

Line chart showing increasingly lower GSM8K-CoT exact match as structured MoE pruning increases

Benchmark BF16 baseline Structured-pruning variant Delta
Full MMLU 0-shot accuracy, 10% pruning 0.733514 0.735437 +0.001923
Full MMLU 0-shot accuracy, 20% pruning 0.733514 0.734012 +0.000499
Full MMLU 0-shot accuracy, 25% pruning 0.733514 0.725965 -0.007549
MMLU-STEM 5-shot accuracy, 10% pruning 0.690771 0.693942 +0.003172
MMLU-STEM 5-shot accuracy, 20% pruning 0.690771 0.679036 -0.011735
MMLU-STEM 5-shot accuracy, 25% pruning 0.690771 0.666350 -0.024421
GSM8K-CoT 8-shot strict EM, first 10% subset, 10% pruning 0.931818 0.909091 -0.022727
GSM8K-CoT 8-shot flexible EM, first 10% subset, 10% pruning 0.886364 0.863636 -0.022727
GSM8K-CoT 8-shot strict EM, first 10% subset, 20% pruning 0.931818 0.780303 -0.151515
GSM8K-CoT 8-shot flexible EM, first 10% subset, 20% pruning 0.886364 0.742424 -0.143939
GSM8K-CoT 8-shot strict EM, first 10% subset, 25% pruning 0.931818 0.681818 -0.250000
GSM8K-CoT 8-shot flexible EM, first 10% subset, 25% pruning 0.886364 0.689394 -0.196970

The GSM8K result is a paired 132-example subset, not a full benchmark run. It is still a useful warning: perplexity and one downstream suite are not sufficient to characterize a pruning point. The completed 10%, 20%, and 25% curve shows increasingly steep reasoning-quality degradation as pruning rises: a modest drop at 10%, followed by much larger losses at 20% and 25%.

CRUXEval harness validation

The CRUXEval-O CoT executable-grading path was smoke-tested successfully, but the paired coding subset was stopped after 84 / 400 baseline generations due to the remaining time box. The one-function smoke recorded pass@1 0.0; it validates the local generation and executable-grader path, not coding quality.

Future work: sentinel-group kernel

The next implementation step is a modified grouped MoE kernel. For each selected parent expert, its pruned block indices should point to sentinel groups instead of materialized expert blocks. The kernel can then avoid loading and computing removed blocks while preserving the original parent router decisions.

The structured mask identifies the removable MoE parameters. Realized checkpoint-size, memory, and runtime improvements remain to be measured after the deployment checkpoint and kernel exist.

Reproducibility artifacts

File Shape Dtype Description
artifacts/atomic_scores.npy [39, 256, 512] float32 HEAPr-style atomic importance scores
artifacts/group_scores_importance_sorted.npy [39, 256, 8] float32 Scores for importance-sorted 64-wide groups
artifacts/group_indices_importance_sorted.npy [39, 256, 8, 64] int64 Atomic indices assigned to each group
artifacts/group_keep_mask_10pct.npy [39, 256, 8] bool Recommended 10% pruning mask
artifacts/group_keep_mask_25pct.npy [39, 256, 8] bool Exploratory 25% stress-point mask
results/summary.json - JSON Compact machine-readable results

The source snapshot includes the scoring, pruning, evaluation, grouped reference runtime, and tests. This Hub repository is an experimental artifact and code package, not a directly loadable pruned Transformers checkpoint. It intentionally does not include full model weights. Load the base model from poolside/Laguna-XS.2.

Reproduce

Create the environment:

uv sync --extra dev --extra eval

Evaluate the recommended 10% pruning point in memory against a prepared 4k calibration cache:

uv run python scripts/eval_pruned_loss.py \
  --model-id poolside/Laguna-XS.2 \
  --cache-path artifacts/data/calibration/<cache_id>/chunks.npy \
  --scores-path artifacts/group_scores_importance_sorted.npy \
  --group-indices artifacts/group_indices_importance_sorted.npy \
  --output-dir artifacts/runs/native_group_loss \
  --mode group \
  --ratios 0.10 \
  --batch-size 4 \
  --gpu-memory-per-device 46GiB

The published keep mask can also be loaded directly and applied with heapr.prune.apply_group_mask_to_model.

Run the local unit suite:

uv run pytest -q

Regenerate the insight plots from the published summary and keep mask:

uv run --extra dev python scripts/plot_pruning_insights.py

Status

  • Completed: native grouped loss sweep at 10%, 20%, 25%, and 40%.
  • Completed: full baseline and native-grouped 10%, 20%, and 25% MMLU.
  • Completed: exploratory repacked child-routing loss sweep.
  • Completed: paired first-10% GSM8K-CoT baseline and native-grouped 10%, 20%, and 25% curve.
  • Stopped after smoke validation: paired CRUXEval-O CoT comparison, due to the remaining time box.
  • Completed: local MMLU-STEM 5-shot baseline and native-grouped 10%, 20%, and 25% comparisons.
  • Completed: random native-grouped 10% and 25% full-cache loss controls.
  • Pending: additional downstream evaluation.
  • Future work: sentinel-group grouped MoE kernel and measured inference benchmarks.

License

Apache 2.0, inheriting poolside/Laguna-XS.2.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for poolside-laguna-hackathon/laguna-martini

Finetuned
(23)
this model

Paper for poolside-laguna-hackathon/laguna-martini