modernbert-embed-base-yat

nomic-ai/modernbert-embed-base with every GeGLU feed-forward block replaced by a sigmoid-gated Yat-kernel MLP (an alignment / inverse-distance kernel primitive, no GELU/GeGLU). Only the 22 feed-forward blocks are changed; attention, embeddings and norms are the base model's.

The Yat FFNs are fit by end-to-end last-layer distillation: freeze everything but the FFNs and train them so the model's final hidden state matches the frozen GeGLU teacher (normalized MSE on the last-layer hidden states + a cosine term on the mean-pooled embedding), one epoch over all-nli sentences. Matching only the final representation — rather than imitating each GeGLU block pointwise, which hits a function-class ceiling — lets the kernel layers reallocate computation and recover full teacher parity.

Evaluation (MTEB STS, cosine Spearman)

Task base modernbert-embed-base (GeGLU) this model (Yat)
STSBenchmark 0.835 0.815
STS12 0.676 0.718
STS16 0.835 0.814
average 0.782 0.782

The Yat-kernel swap reaches the same average STS as the GeGLU base (and is ahead of it on STS12). Scores are reproduced after a Hub round-trip (trust_remote_code=True). The bundled custom architecture (modeling_yatmodernbert.py) is loaded automatically.

Usage

from sentence_transformers import SentenceTransformer
m = SentenceTransformer("mlnomad/modernbert-embed-base-yat", trust_remote_code=True)
emb = m.encode(["A man is eating food.", "A man is eating a meal."])

The Yat FFN

g(x) = ( softplus(a) * (x·W + b)^2 / (||x - W||^2 + exp(le)) * sigmoid(gate(x)) ) @ A + c

a non-negative rational (alignment-over-distance) kernel feature with a sigmoid gate, replacing the GeGLU map d -> 4d -> d at the same hidden width.

Notes

  • Reaches teacher parity by distillation; this is the ceiling of distillation (matches, does not beat the base). A light contrastive fine-tune of the FFNs alone does not exceed the base.
  • Quantizes losslessly to int8 (weight-only PTQ leaves STS unchanged); below 8 bits the rational kernel is more sensitive than GeGLU under naive uniform quantization.

Part of the ⵟ-kernel research project (kernel-native replacements for transformer FFNs).

Downloads last month
271
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlnomad/modernbert-embed-base-yat

Finetuned
(111)
this model

Evaluation results