Post
93
Two custom releases β both unusual takes on common problems, on a single RTX 3090 + a vast.ai pod.
πΉ ManniX-ITA/Qwen3.5-27B-Omnimerge-v2
3-source weight-space merge over Qwen3.5-27B combining OBIM-lite magnitude masking + DAREx rescaling + EMR election (sign from consensus, amplitude from max-abs across sources). GPU-accelerated, ~35Γ over CPU.
Sources: Claude-4.6-Opus-distill (0.40), Esper3.1 code (0.35), Gemini-3.1-Pro-distill (0.25). density 0.53, DAREx q 0.75.
Q6_K vs best source:
β’ GPQA Diamond: 53.03 β 69.19 (+16.16 pp)
β’ MBPP pass@1: 71.20 β 74.60 (+3.40)
β’ HumanEval pass@1: 76.22 β 79.27 (+3.05)
vs Omnimerge v1 (vanilla DARE-TIES): +8.08 pp GPQA, +2.80 MBPP. Amplitude-from-max + sign-from-consensus is what unlocked the GPQA jump.
πΉ ManniX-ITA/gemma-4-A4B-98e-v3-it
Gemma 4 26B-A4B pruned 128 β 98 experts/layer (-23.4% MoE capacity, -5.2B params), zero GPQA degradation.
GPQA Diamond:
β’ 128e reference: 75.25%
β’ 98e v3 (this): 75.25% β +0.00 pp despite -23.4% capacity, -5.2B params
β’ 109e v3 (older): 71.72% β -3.53 pp
The win over 109e v3 came from changing the importance map: aggregate per-expert contribution across math/logic/code/science/creative via 128-token teacher-force, instead of GPQA-specific per-question top-16 (which overfitted). Result: more experts dropped, quality preserved.
Findings worth flagging:
β’ Experts NOT topic-specialized β 28/32 overlap math/creative top-32.
β’ Expert weight cosine β 0.05 max β merging destroys the model. Dropping is the only viable structural compression here.
β’ Contribution Gini β 0.38 β ~75 experts/layer carry 80% of signal.
Eval: lm-eval gpqa_diamond_cot_zeroshot, llama-server --reasoning-format deepseek --reasoning-budget 8192, Gemma 4 official sampling. Feedback welcome.
πΉ ManniX-ITA/Qwen3.5-27B-Omnimerge-v2
3-source weight-space merge over Qwen3.5-27B combining OBIM-lite magnitude masking + DAREx rescaling + EMR election (sign from consensus, amplitude from max-abs across sources). GPU-accelerated, ~35Γ over CPU.
Sources: Claude-4.6-Opus-distill (0.40), Esper3.1 code (0.35), Gemini-3.1-Pro-distill (0.25). density 0.53, DAREx q 0.75.
Q6_K vs best source:
β’ GPQA Diamond: 53.03 β 69.19 (+16.16 pp)
β’ MBPP pass@1: 71.20 β 74.60 (+3.40)
β’ HumanEval pass@1: 76.22 β 79.27 (+3.05)
vs Omnimerge v1 (vanilla DARE-TIES): +8.08 pp GPQA, +2.80 MBPP. Amplitude-from-max + sign-from-consensus is what unlocked the GPQA jump.
πΉ ManniX-ITA/gemma-4-A4B-98e-v3-it
Gemma 4 26B-A4B pruned 128 β 98 experts/layer (-23.4% MoE capacity, -5.2B params), zero GPQA degradation.
GPQA Diamond:
β’ 128e reference: 75.25%
β’ 98e v3 (this): 75.25% β +0.00 pp despite -23.4% capacity, -5.2B params
β’ 109e v3 (older): 71.72% β -3.53 pp
The win over 109e v3 came from changing the importance map: aggregate per-expert contribution across math/logic/code/science/creative via 128-token teacher-force, instead of GPQA-specific per-question top-16 (which overfitted). Result: more experts dropped, quality preserved.
Findings worth flagging:
β’ Experts NOT topic-specialized β 28/32 overlap math/creative top-32.
β’ Expert weight cosine β 0.05 max β merging destroys the model. Dropping is the only viable structural compression here.
β’ Contribution Gini β 0.38 β ~75 experts/layer carry 80% of signal.
Eval: lm-eval gpqa_diamond_cot_zeroshot, llama-server --reasoning-format deepseek --reasoning-budget 8192, Gemma 4 official sampling. Feedback welcome.