lettuce-emb-768d-v4

A roleplay-first embedding model for LettuceAI's on-device memory layer.

lettuce-emb-768d-v4 is built for one job: retrieve the right memory from a long, messy roleplay conversation history. v3 failed at this (recall@1 = 0.020). v4 hits recall@1 = 0.924 while keeping general semantic quality intact (STSBenchmark = 0.819).

It also works fine as a general retrieval embedder. It just was not optimized for that as the first priority.

  • Backbone: nomic-ai/nomic-embed-text-v1.5
  • Output: 768d native (no Dense projection)
  • Matryoshka dims: 64 / 128 / 256 / 512 / 768
  • Context length: 4096 tokens
  • Pooling: mean over tokens, L2 normalized
  • License: Apache 2.0

Headline numbers

Metric v3 v4 Change
RP recall@1 0.020 0.924 46.2x
RP recall@5 0.109 0.982 9.0x
STSBenchmark Spearman 0.809 0.819 +0.010
Output dim 512d 768d native no bottleneck
Matryoshka no 5 tiers from one file yes
ONNX not released FP32 + INT8 shipped

The full release write-up is on the LettuceAI blog.

Files

.
├── config.json
├── configuration_hf_nomic_bert.py
├── model.safetensors            # 547 MB, FP32 weights
├── tokenizer.json
├── tokenizer_config.json
├── metrics.json                 # release checkpoint metrics, full
├── best_release_metrics.json    # release checkpoint metrics, summary
└── onnx/
    ├── model.fp32.onnx          # 547.7 MB, server / GPU
    └── model.int8.onnx          # 138.0 MB, on-device CPU

Both ONNX files return L2-normalized 768d vectors directly. The caller picks the Matryoshka dim by slicing.

Usage

transformers + torch

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

MODEL_ID = "Zeolit/lettuce-emb-768d-v4"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True)
model.eval()

def embed(texts, dim=768):
    enc = tokenizer(texts, padding=True, truncation=True, max_length=4096, return_tensors="pt")
    with torch.no_grad():
        out = model(**enc).last_hidden_state
    mask = enc.attention_mask.unsqueeze(-1).float()
    pooled = (out * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
    pooled = F.normalize(pooled, p=2, dim=1)
    sliced = pooled[:, :dim]
    return F.normalize(sliced, p=2, dim=1)  # re-normalize after slice

vecs = embed(["hello world", "I remember that day"], dim=256)

sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Zeolit/lettuce-emb-768d-v4", trust_remote_code=True)
vecs = model.encode(["hello world", "I remember that day"], normalize_embeddings=True)

ONNX (recommended for production)

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("Zeolit/lettuce-emb-768d-v4")
sess = ort.InferenceSession("onnx/model.fp32.onnx", providers=["CPUExecutionProvider"])

enc = tok(["hello world"], padding=True, truncation=True, max_length=4096, return_tensors="np")
inputs = {
    "input_ids": enc["input_ids"].astype(np.int64),
    "attention_mask": enc["attention_mask"].astype(np.int64),
}
vec = sess.run(None, inputs)[0]   # already L2-normalized, shape (1, 768)

For 8x smaller deployment, swap to model.int8.onnx. The output API is identical.

Matryoshka slicing

def slice_dim(vec, dim):
    sliced = vec[..., :dim]
    return sliced / np.linalg.norm(sliced, axis=-1, keepdims=True)

v_64  = slice_dim(vec, 64)    # 256 bytes per vector (FP32)
v_128 = slice_dim(vec, 128)
v_256 = slice_dim(vec, 256)
v_512 = slice_dim(vec, 512)
v_768 = vec                   # already normalized

Matryoshka tradeoff

Dim Bytes (FP32) recall@1 recall@5 recall@10 MRR@10
64d 256 0.424 0.648 0.698 0.523
128d 512 0.488 0.723 0.768 0.591
256d 1,024 0.504 0.752 0.796 0.614
512d 2,048 0.509 0.767 0.808 0.622
768d 3,072 0.512 0.769 0.815 0.628

(Numbers from the 144k-passage extreme retrieval benchmark. Full benchmark in the release post.)

Going from 768d to 64d costs ~17% of recall@1 in exchange for 12x smaller vectors. Even at 64d, v4 is well above v3's 768d performance.

Training

Three-stage curriculum, ~285k pairs/triplets across roleplay/persona, long-form narrative, and general retrieval data, with BGE-M3 hard negatives refreshed per epoch.

Stage Seq len Batch Negatives Losses
1 warmup 512 128 pairs in-batch MNR
2 main 2048 16 triplets hard negatives MNR + Cosine distillation
3 refinement 4096 8 triplets refreshed hard negatives MNR + Cosine + MarginMSE + STS replay

Released checkpoint is best_release (step 34400): the highest-recall checkpoint that still passes the STSBenchmark release floor. It is not the final training step. See the engineering postmortem on the LettuceAI blog for why.

Intended use

  • Memory retrieval over multi-turn roleplay / persona conversations (primary).
  • General sentence similarity and retrieval over short and long documents.
  • On-device embedding via INT8 ONNX for resource-constrained hardware.

Out of scope

  • Cross-lingual retrieval. Trained on English data.
  • Code retrieval. Not in the training mix.
  • Reranking. Use a dedicated cross-encoder for that.

Limitations

  • Benchmarks reported here are in-distribution for v4 (sources it saw during training). v3 was tested on the same set so the relative comparison is fair, but absolute generalization on completely held-out corpora may differ.
  • Tuned for roleplay-style memory retrieval. On clean-QA benchmarks like MS MARCO, dedicated retrieval models will likely outperform it.
  • 4096-token context is real but works best when the embedded passage is genuinely long. Short passages do not need it.

Citation

@misc{lettuceemb_v4_2026,
  author       = {Zeolit and LettuceAI},
  title        = {lettuce-emb-768d-v4: a roleplay-first embedding model},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Zeolit/lettuce-emb-768d-v4}}
}

Acknowledgments

  • nomic-ai/nomic-embed-text-v1.5 for the backbone.
  • BAAI/bge-m3 for hard-negative mining and teacher cosine scores.
  • cross-encoder/ms-marco-MiniLM-L-6-v2 for false-negative filtering during data prep.
  • google/gemma-4-26b-a4b-it for synthetic query generation.
  • Training data sources: google/Synthetic-Persona-Chat, nazlicanto/persona-based-chat, kmfoda/booksum, deepmind/narrativeqa, and the sentence-transformers mirrors of HotpotQA, GooAQ, NQ, AllNLI.
Downloads last month
16
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Zeolit/lettuce-emb-768d-v4

Quantized
(34)
this model