modernbert-embed-base-yat

nomic-ai/modernbert-embed-base with every GeGLU feed-forward block replaced by a sigmoid-gated Yat-kernel MLP (an alignment / inverse-distance kernel primitive, no GELU/GeGLU). Only the 22 feed-forward blocks are changed; attention, embeddings and norms are the base model's.

The Yat FFNs are fit by end-to-end last-layer distillation: freeze everything but the FFNs and train them so the model's final hidden state matches the frozen GeGLU teacher (normalized MSE on the last-layer hidden states + a cosine term on the mean-pooled embedding), one epoch over all-nli sentences. Matching only the final representation — rather than imitating each GeGLU block pointwise, which hits a function-class ceiling — lets the kernel layers reallocate computation and recover full teacher parity.

Evaluation (MTEB STS, cosine Spearman)

Task	base `modernbert-embed-base` (GeGLU)	this model (Yat)
STSBenchmark	0.835	0.815
STS12	0.676	0.718
STS16	0.835	0.814
average	0.782	0.782

The Yat-kernel swap reaches the same average STS as the GeGLU base (and is ahead of it on STS12). Scores are reproduced after a Hub round-trip (trust_remote_code=True). The bundled custom architecture (modeling_yatmodernbert.py) is loaded automatically.

Usage

from sentence_transformers import SentenceTransformer
m = SentenceTransformer("mlnomad/modernbert-embed-base-yat", trust_remote_code=True)
emb = m.encode(["A man is eating food.", "A man is eating a meal."])

The Yat FFN

g(x) = ( softplus(a) * (x·W + b)^2 / (||x - W||^2 + exp(le)) * sigmoid(gate(x)) ) @ A + c

a non-negative rational (alignment-over-distance) kernel feature with a sigmoid gate, replacing the GeGLU map d -> 4d -> d at the same hidden width.

Notes

Reaches teacher parity by distillation; this is the ceiling of distillation (matches, does not beat the base). A light contrastive fine-tune of the FFNs alone does not exceed the base.
Quantizes losslessly to int8 (weight-only PTQ leaves STS unchanged); below 8 bits the rational kernel is more sensitive than GeGLU under naive uniform quantization.

Part of the ⵟ-kernel research project (kernel-native replacements for transformer FFNs).

Downloads last month: 271

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for mlnomad/modernbert-embed-base-yat

Base model

answerdotai/ModernBERT-base

Finetuned

nomic-ai/modernbert-embed-base

Finetuned

(111)

this model

Evaluation results

cosine_spearman on MTEB STSBenchmark
self-reported

0.815
cosine_spearman on MTEB STS12
self-reported

0.718
cosine_spearman on MTEB STS16
self-reported

0.814