lettuce-emb-768d-v4

A roleplay-first embedding model for LettuceAI's on-device memory layer.

lettuce-emb-768d-v4 is built for one job: retrieve the right memory from a long, messy roleplay conversation history. v3 failed at this (recall@1 = 0.020). v4 hits recall@1 = 0.924 while keeping general semantic quality intact (STSBenchmark = 0.819).

It also works fine as a general retrieval embedder. It just was not optimized for that as the first priority.

Backbone: nomic-ai/nomic-embed-text-v1.5
Output: 768d native (no Dense projection)
Matryoshka dims: 64 / 128 / 256 / 512 / 768
Context length: 4096 tokens
Pooling: mean over tokens, L2 normalized
License: Apache 2.0

Headline numbers

Metric	v3	v4	Change
RP recall@1	0.020	0.924	46.2x
RP recall@5	0.109	0.982	9.0x
STSBenchmark Spearman	0.809	0.819	+0.010
Output dim	512d	768d native	no bottleneck
Matryoshka	no	5 tiers from one file	yes
ONNX	not released	FP32 + INT8	shipped

The full release write-up is on the LettuceAI blog.

Files

.
├── config.json
├── configuration_hf_nomic_bert.py
├── model.safetensors            # 547 MB, FP32 weights
├── tokenizer.json
├── tokenizer_config.json
├── metrics.json                 # release checkpoint metrics, full
├── best_release_metrics.json    # release checkpoint metrics, summary
└── onnx/
    ├── model.fp32.onnx          # 547.7 MB, server / GPU
    └── model.int8.onnx          # 138.0 MB, on-device CPU

Both ONNX files return L2-normalized 768d vectors directly. The caller picks the Matryoshka dim by slicing.

Usage

transformers + torch

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

MODEL_ID = "Zeolit/lettuce-emb-768d-v4"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True)
model.eval()

def embed(texts, dim=768):
    enc = tokenizer(texts, padding=True, truncation=True, max_length=4096, return_tensors="pt")
    with torch.no_grad():
        out = model(**enc).last_hidden_state
    mask = enc.attention_mask.unsqueeze(-1).float()
    pooled = (out * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
    pooled = F.normalize(pooled, p=2, dim=1)
    sliced = pooled[:, :dim]
    return F.normalize(sliced, p=2, dim=1)  # re-normalize after slice

vecs = embed(["hello world", "I remember that day"], dim=256)

sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Zeolit/lettuce-emb-768d-v4", trust_remote_code=True)
vecs = model.encode(["hello world", "I remember that day"], normalize_embeddings=True)

ONNX (recommended for production)

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("Zeolit/lettuce-emb-768d-v4")
sess = ort.InferenceSession("onnx/model.fp32.onnx", providers=["CPUExecutionProvider"])

enc = tok(["hello world"], padding=True, truncation=True, max_length=4096, return_tensors="np")
inputs = {
    "input_ids": enc["input_ids"].astype(np.int64),
    "attention_mask": enc["attention_mask"].astype(np.int64),
}
vec = sess.run(None, inputs)[0]   # already L2-normalized, shape (1, 768)

For 8x smaller deployment, swap to model.int8.onnx. The output API is identical.

Matryoshka slicing

def slice_dim(vec, dim):
    sliced = vec[..., :dim]
    return sliced / np.linalg.norm(sliced, axis=-1, keepdims=True)

v_64  = slice_dim(vec, 64)    # 256 bytes per vector (FP32)
v_128 = slice_dim(vec, 128)
v_256 = slice_dim(vec, 256)
v_512 = slice_dim(vec, 512)
v_768 = vec                   # already normalized

Matryoshka tradeoff

Dim	Bytes (FP32)	recall@1	recall@5	recall@10	MRR@10
64d	256	0.424	0.648	0.698	0.523
128d	512	0.488	0.723	0.768	0.591
256d	1,024	0.504	0.752	0.796	0.614
512d	2,048	0.509	0.767	0.808	0.622
768d	3,072	0.512	0.769	0.815	0.628

(Numbers from the 144k-passage extreme retrieval benchmark. Full benchmark in the release post.)

Going from 768d to 64d costs ~17% of recall@1 in exchange for 12x smaller vectors. Even at 64d, v4 is well above v3's 768d performance.

Training

Three-stage curriculum, ~285k pairs/triplets across roleplay/persona, long-form narrative, and general retrieval data, with BGE-M3 hard negatives refreshed per epoch.

Stage	Seq len	Batch	Negatives	Losses
1 warmup	512	128 pairs	in-batch	MNR
2 main	2048	16 triplets	hard negatives	MNR + Cosine distillation
3 refinement	4096	8 triplets	refreshed hard negatives	MNR + Cosine + MarginMSE + STS replay

Released checkpoint is best_release (step 34400): the highest-recall checkpoint that still passes the STSBenchmark release floor. It is not the final training step. See the engineering postmortem on the LettuceAI blog for why.

Intended use

Memory retrieval over multi-turn roleplay / persona conversations (primary).
General sentence similarity and retrieval over short and long documents.
On-device embedding via INT8 ONNX for resource-constrained hardware.

Out of scope

Cross-lingual retrieval. Trained on English data.
Code retrieval. Not in the training mix.
Reranking. Use a dedicated cross-encoder for that.

Limitations

Benchmarks reported here are in-distribution for v4 (sources it saw during training). v3 was tested on the same set so the relative comparison is fair, but absolute generalization on completely held-out corpora may differ.
Tuned for roleplay-style memory retrieval. On clean-QA benchmarks like MS MARCO, dedicated retrieval models will likely outperform it.
4096-token context is real but works best when the embedded passage is genuinely long. Short passages do not need it.

Citation

@misc{lettuceemb_v4_2026,
  author       = {Zeolit and LettuceAI},
  title        = {lettuce-emb-768d-v4: a roleplay-first embedding model},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Zeolit/lettuce-emb-768d-v4}}
}

Acknowledgments

nomic-ai/nomic-embed-text-v1.5 for the backbone.
BAAI/bge-m3 for hard-negative mining and teacher cosine scores.
cross-encoder/ms-marco-MiniLM-L-6-v2 for false-negative filtering during data prep.
google/gemma-4-26b-a4b-it for synthetic query generation.
Training data sources: google/Synthetic-Persona-Chat, nazlicanto/persona-based-chat, kmfoda/booksum, deepmind/narrativeqa, and the sentence-transformers mirrors of HotpotQA, GooAQ, NQ, AllNLI.

Downloads last month: 16

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Zeolit/lettuce-emb-768d-v4

Base model

nomic-ai/nomic-embed-text-v1.5

Quantized

(34)

this model