Party is over: regularizing ColBERT models to fix efficient ANN methods

Community Article
Published June 16, 2026

image

Website LinkedIn X

Efficient ANN methods like MUVERA and SMVE promise to simplify ColBERT infrastructure, but they fell short with modern ColBERT model. We trace the root cause to embedding geometry, show that mean-centering is a strong but insufficient fix, and introduce STE-based regularization that directly optimizes the model for the target projection space. The surprise: the fix doesn't make embeddings more isotropic but concentrates them into fewer dimensions, the opposite of what intuition predicts, but what random projections need. The regularization transfers across methods, seeds, and hyperparameters without degrading full MaxSim retrieval.

🤗 Models: LateOn-regularized | LateOn

Table of Contents

Late interaction: powerful, but expensive

Late interaction models (also known as ColBERT or multi-vector) have exhibited very strong performances on various retrieval tasks such as out-of-domain, long-context, code and agentic retrieval. Despite these strong results, they come with infrastructure challenges because they require storing and searching all the embeddings corresponding to each tokens in the documents. Besides the storage cost, it also relies on index structure to generate a candidates set before gathering and decompressing the document token embeddings and applying the full MaxSim operation. This is typically handled with PLAID indexes, which we made effortlessly and efficiently available to the broad audience through FastPlaid and NextPlaid.
Although those libraries enabled the use of ColBERT models at larger scale, PLAID indexes can still be complicated to maintain at scale. Notably, updating the centroids remains a challenge due to the limitations of IVF-PQ that would require to rebuild the whole index once the distribution shifts to much, even after adding some mechanisms to follow that shift in FastPlaid ; which makes CRUD operations difficult. SPFresh offers some trade-off to make updates easier, but might face other scaling issues.
For these reasons, new methods have been proposed to generate candidates without relying on centroids.

The solutions

MUVERA uses the multi-vector representations to create Fixed Dimensional Encodings (FDEs), akin to large, dense, single vectors to leverage existing single vector infrastructure such as HNSW.
SMVE use random rotation projections to create sparse encoding of the multi-vector representations to leverage existing sparse retrieval infrastructure.

Those methods are very attractive because they would simplify the multi-vector retrieval infrastructure extensively by offering easier to maintain and CRUD-friendly indexes to generate candidate sets to be reranked with full MaxSim. However, the early experiments exhibiting promising results were run with the old ColBERT-v2 model, and surprisingly did not seem to generalize to newer models…

Are you sure about that?

On the 27th August 2025, an issue was opened on PyLate, describing that ModernColBERT was not performing correctly when used with MUVERA. Quickly, we realized that this issue extended to other models such as ColBERT-small, which excluded our training loop from the suspects because it was not trained with PyLate. We then further hypothesized that it could be due to the scores normalizations in distillation setup, until we ruled it out by performing KD without normalization and then further confirmed that some contrastive-only models such as ColBERT-Zero-supervised also suffered from the issue. Since most of the new models did as well, it became clear that it was not really tied to a specific training setup but was a more global issue. A simple explanation would have been that those methods simply were not working properly; but why would it work for some models (including new ones) and not for others?

Anisotropy, the ideal suspect: if everything is similar, nothing is dissimilar

Then, someone came up, asking:
> Question/curiosity - why are GTE-ModernColBERT-v1's token vectors so much more anisotropic than colbertv2.0's?
> Mean pairwise cosine similarity for GTE-ModernColBERT-v1 on FreshStack datasets is ~0.9. For Colbertv2.0 it's ~0.2.
At that time, the answer seemed to be that it was probably related to the training setup, that also caused MUVERA to fail. However, it turns out it could be a key clue about why MUVERA is failing.

When we think about embeddings for retrieval, we love to imagine them scattered evenly across the hypersphere (i.e, scattered around the space), so that the proximity in the space is meaningful, we can divide the space into cells to perform search only a subset of those, to be able to quantize the embeddings without introducing errors by collapsing two nearby embeddings, …

image

Illustration of the decomposition of an embedding space into Voronoï cells, credits

However, in the case of most recent ColBERT models, they collapse into a narrow cone, all leaning in roughly the same directions (as indicated by the 0.9 mean cosine similarity). What it means is that most of the token embedding energy is just a single shared axis (essentially the corpus mean) and only a thin residual actually encodes the useful, discriminative information that is actually useful for retrieval.

Model raw avgcos centered avgcos
colbertv2 0.2407 0.0005
ColBERT-Zero 0.6197 0.0040
answerai-colbert-small 0.8383 0.0046
GTE-v1 0.8965 0.0107
LateOn 0.9458 0.0012

Mean average cosine similarity between document tokens for various model on the NFCorpus dataset.

image

Illustration of isotropic and anisotropic embedding distribution

This could explain why MUVERA and SMVE come short with those models. Instead of relying on centroids (that would “eat up” the common direction), MUVERA partitions the space with random hyperplanes (à la SimHash) and pools the tokens that fall into each bucket into a Fixed Dimensional Encoding, while SMVE applies random rotations, keeps the largest projections per direction, and sums them into a sparse sketch. But random partitions only separate points that are actually separable in the space. When all of the embeddings are within the same thin cone, a random hyperplane almost always split the space with the whole cone on one side and a random projection returns the same value for every embedding. This means applying these approaches to anisotropic ColBERT just yields degenerate outcomes: every token maps to the same code and thus every document maps to nearly the same FDE/sparse vector, making the candidate generation step essentially random.

image

image

Illustration of the failure cases of MUVERA/SMVE on an anisotropic space

The exact reasons why the new ColBERT models are so anisotropic remains unknown and are surely an interesting avenue of research. Among the running hypothesis, we know that ModernBERT base embeddings are somewhat already anisotropic which does not help (as most of the modern ColBERT models are based on this backbone), but this cannot be the only explanation as ColBERT-small is not ModernBERT-based. Another possible factor is that, in ColBERT training, we only directly backpropagate into the tokens that have been selected by the MaxSim operator. While this helps stabilize the training, this also means that a lot of the tokens' training signal is very indirect and tokens are just “pulled” by the other tokens' updates. Finally, it seems that in all generality, embedding models tend to become anisotropic.

One simple post-hoc solution is to center all of the embeddings by subtracting the mean direction (either mean without the document or the collection). This obviously reduces anisotropy by a very large factor and, by centering the cloud, the random projections are much more likely to split it across the informative axis. And this is not only theoretical: simply centering the embeddings before computing MUVERA/SMVE representations allows LateOn (our latest state-of-the-art ColBERT model) results to go from plain unusable to somewhat working but still far from the target. This is because this is only partially solving the problem: we remove the majority of the noise coming from the shared common direction, but the cloud is still not correctly shaped to efficiently be projected into those representations. Although the latent anisotropism is wiped from the embeddings, they are still not efficiently separable through the random operations.

image

Illustration of why a centered space can still be difficult to project

Model Center PLAID Method rk=0 rk=50 rk=100 rk=200 rk=400 rk=1000
LateOn no 55.28 MUVERA 2.89 (-52.4) 9.46 (-45.8) 12.53 (-42.8) 15.87 (-39.4) 20.29 (-35.0) 27.52 (-27.8)
SMVE 1.51 (-53.8) 7.59 (-47.7) 10.84 (-44.4) 14.24 (-41.0) 18.29 (-37.0) 25.10 (-30.2)
LateOn yes 55.28 MUVERA 32.66 (-22.6) 46.22 (-9.1) 48.68 (-6.6) 50.01 (-5.3) 50.85 (-4.4) 52.09 (-3.2)
SMVE 21.18 (-34.1) 37.39 (-17.9) 40.90 (-14.4) 43.52 (-11.8) 45.83 (-9.5) 48.61 (-6.7)

Results of LateOn models with non-centered and centered embeddings using MUVERA and SMVE. rk=k means we rerank the top-k candidates with full MaxSim.

This does not mean that this is a fatality and post-hoc methods only go so far, we are limited by the space learned by the model.
This is why regularization objective that attempt to spread the vectors across the hypersphere have been proposed for a long time and are still very much used, notably through the GOR objective introduced in 2017 has been used in very recently in GemmaEmbeddings and Jina Embeddings to enhance quantization robustness.

The GOR objective is as simple as it gets: we don’t want two embeddings to be similar, so we will take the pairwise cosine similarity matrix between all the embeddings (Gram matrix) and try to push all the non diagonal terms to 0. This effectively tries to minimize any similarity and create an isotropic space.

Thus, it seemed very obvious that the solution was just to regularize the space when training ColBERT models to prevent them from using the space suboptimally, which is why we started adding GOR in the training. However, contrary to dense models where it just makes sense to push all the (negative) in-batch vectors towards different parts of the space, the multi-vector setup introduced some questions. A negative document can contain tokens that should actually be similar to the token of the positive document. It makes sense that the word cat is represented similarly, although the two documents are not covering the same exact topic. Take the example of “the cat drinks milk” and “the cat drinks water”. We don’t want the embeddings of “the” “cat” and “drinks” to be radically different in the two examples, we want the distinction to be made by the “water”/”milk” token. This also makes sense after the projections: we want most of the codes of the both representations to be similar because those are similar sequences, and so they should be close in these projected spaces as well. This is achievable if most of the tokens are similar (and thus receive the same codes), and only differ by the codes corresponding to the difference in the last embeddings.
Also, considering that our problem is document embeddings anisotropy, should we also regularize the queries or let them be trained by the document regularization?
While trying various different setup using all the pairwise similarity matrices possible and trying to balance the different weights to fix the issue without degrading the main performance of the model (without that much of a success), a question arose… should we even bother?

The easiest way to optimize something is… to optimize it

Everything we’ve discussed so far is theories and hypotheses based on some observations. Until now, we have no certitudes besides that the embedding space is not suited and that we should regularize. What to regularize and how are still very much open questions.
So, instead of guessing, could we simply optimize the model to work great in our target representation space?
The main issue is that we can’t simply apply the contrastive loss onto the projected codes and hope the gradients sort it out… because the creation of the representations blocks them. More precisely, the discretization step of both pipelines breaks them.

In MUVERA, this step is the SimHash partition: each token is routed using the sign of its projection onto a few random hyperplanes, and those signs are stitched into a one-hot assignment to one of the buckets. But sign is flat on each side (−1 / +1) and only moves at the boundary, so moving within its cell changes nothing; the derivative is zero, and it's that which-bucket assignment that is opaque (the pooling that follows is a harmless sum).
For SMVE, after computing the random rotations and the associated projected values, we keep only the largest projections per direction and zero the rest to make it sparse. Which values survive depends on the ordering and thus, small changes are not reflected until the order changes, making the gradient zero again. In both cases, the gradient-killers are hard, piecewise constant selections.

If you have trained quantization-aware models, it might ring some bells as this is the exact same issue. Quantization transforms a continuous value onto a discrete grid of levels, rounding it to the nearest level and creating a staircase: flat between levels with zero gradient almost everywhere except at the boundaries. One very popular approach to perform Quantization Aware Training (QAT) is by using the Straight Through Estimator (STE). The idea is quite simple: we perform the gradient-breaking operation in the forward pass and compute the loss value as usual. The trick is during the backward pass: we pretend the hard selection operation was actually an identity operation gradient-wise and let the gradient flow as if nothing happened.

image

Illsutration of the approximation used by the STE trick

In our case, we build the true MUVERA/SMVE code and score it using contrastive loss, so we are actually directly optimizing the representation we’ll then leverage at test-time. On the backward pass we treat the bucketing/top-k operation as identity on whatever it selected: each token receives the gradient of the bucket it landed in, and SMVE passes gradient through the surviving coordinates and zero through the dropped ones. The linear machinery around the selection (projections, rotation, pooling) keeps its exact gradient.
The loss gradient therefore flows straight back into the embeddings, and the model is trained to make its codes discriminative after the projection, optimizing the target metric directly instead of a hand-crafted proxy. Centering and GOR were us guessing at what a well-shaped space should look like and hoping the projections would cooperate; STE lets the projection itself tell the model what to fix.

It's worth noting this is obviously not free: the straight-through gradient is biased as the gradient we pretend exists is not the real (zero) derivative, and the term we drop is exactly the one that would push a token across a hyperplane or flip the ranking. This approximation can be noisy, but since we use it only as an additional regularization signal alongside the standard MaxSim contrastive objective, we still learn a genuinely good late-interaction space while pushing it toward something that projects cleanly.

image Illustration of the STE trick

Talk is cheap, show me the results

Model Center PLAID Method rk=0 rk=50 rk=100 rk=200 rk=400 rk=1000
LateOn no 55.28 MUVERA 2.89 (-52.4) 9.46 (-45.8) 12.53 (-42.8) 15.87 (-39.4) 20.29 (-35.0) 27.52 (-27.8)
SMVE 1.51 (-53.8) 7.59 (-47.7) 10.84 (-44.4) 14.24 (-41.0) 18.29 (-37.0) 25.10 (-30.2)
LateOn yes 55.28 MUVERA 32.66 (-22.6) 46.22 (-9.1) 48.68 (-6.6) 50.01 (-5.3) 50.85 (-4.4) 52.09 (-3.2)
SMVE 21.18 (-34.1) 37.39 (-17.9) 40.90 (-14.4) 43.52 (-11.8) 45.83 (-9.5) 48.61 (-6.7)
MUVERA-reg yes 55.72 MUVERA 40.80 (-14.9) 51.60 (-4.1) 52.37 (-3.4) 53.02 (-2.7) 53.60 (-2.1) 53.87 (-1.9)
SMVE 37.18 (-18.5) 48.97 (-6.8) 50.45 (-5.3) 51.51 (-4.2) 52.40 (-3.3) 53.50 (-2.2)
SMVE-reg yes 54.92 MUVERA 40.22 (-14.7) 50.94 (-4.0) 51.90 (-3.0) 52.67 (-2.3) 53.02 (-1.9) 53.35 (-1.6)
SMVE 40.39 (-14.5) 50.70 (-4.2) 52.09 (-2.8) 52.92 (-2.0) 53.19 (-1.7) 53.47 (-1.5)

Results of the regularized models with both SMVE and MUVERA. rk=k means we rerank the top-k candidates with full MaxSim

As we found out centering to be a very strong and cheap operation, all of our trainings and evaluations are run using mean-centered representation. The training setup is pretty simple: we take the unregularized unsupervised model and perform the same supervised training step, but adding the additional gradient coming from STE as a regularization. The loss is (1-alpha) * MaxSim contrastive loss + alpha * projected representations contrastive loss. Note that this supervised stage is very cheap to run (we did not even run the full 3 epoch) and thus show the results that are obtainable with a very lightweight regularization.

The first very clear result is that these very simple regularizations achieve their goal and significantly boost the performance of SMVE/MUVERA over the unregularized (LateOn) model without degrading the PLAID results (meaning that having the space correctly shaped does not conflict with the MaxSim requirements). The base results (without any reranking) are much stronger by themselves, but it also shows that the gap with the full PLAID retrieval can be smaller when coupled to top-k reranking. Please note that we did not optimize the hyper-parameters of MUVERA/SMVE to boost performance. It is possible that using enough dimensions, we could totally close the gap (even with fewer reranked elements), but the goal of this study was simply to show that the models could be regularized, not study the trade-off between cost/performance of all of these methods.

The second (and probably most important) thing to note is that both regularizations help both methods: using MUVERA representations during training helps SMVE during evaluation and vice-versa. This is a critical piece of information that shows that the training did not optimize only one specific subspace, but improved compression globally. Indeed, this was a very important verification because it is worth mentioning that, during training, the random elements are kept fixed (sampled once at the start of the training). We tried randomizing it at every step, but it seems that it made the training unstable because the signal from ever-changing spaces was confusing the model. Training for various bases by alternating every few steps could be a direction, but we pleasantly found out that the model was not overfitting to these fixed elements. Although the two methods are vastly different and we used different seeds for both anyways, we validated the results using various test seed for MUVERA, and the results show that there is very little variance and, most importantly, the training seed is not the best of it all, totally ruling out space-specific learning.

Method rk seed=1221* seed=999 seed=5555 seed=31415 mean std
MUVERA 0 40.80 41.61 40.42 40.97 40.95 0.43
MUVERA 200 53.02 53.11 53.36 53.20 53.17 0.13
SMVE 0 37.18 37.47 37.06 36.73 37.11 0.27
SMVE 200 51.51 51.98 51.71 51.65 51.71 0.17

Results of the regularized models with MUVERA/SMVE projections across different random seeds.

We can see that the variance is not too big and is even smaller when we consider reranking. Interestingly, we can note that the seed used for the training (1221) is not even the strongest one of the 4 seeds evaluated, emphasizing that the variance comes more from the methods themselves rather than a specific overfitting.

Besides specific projection directions, the model could also overfit to the specific MUVERA hyperparameters used during training (R=20, k=5, d=16, FDE dim=10,240), making the regularization ineffective at other operating points. To test this, we swept each MUVERA parameter independently at evaluation time while keeping the others fixed. MUVERA builds a fixed-dimensional encoding (FDE) by first partitioning token embeddings with SimHash, then pooling within each partition, projecting each pooled block to a lower-dimensional space, and finally concatenating the results across several independent repetitions. In this construction, (R) is the number of independent repetitions, (k) is the number of SimHash bits used to define the partitioning, and (d) is the per-bucket projection dimension after the inner random projection. Together, these parameters determine the FDE size: FDE dim = 2^k × d × R

Across all configurations, from aggressive compression (FDE dim=1,024) to high-capacity (FDE dim=20,480), the regularized model consistently outperforms the baseline. Notably, the improvement is largest away from the training configuration: at FDE dim ~1K the rk=0 gap reaches +7–13 NDCG@10 points, compared to +8 at the training dim of 10K. This confirms that the regularization teaches a general geometric property rather than optimizing for a specific set of projection parameters. Exploring whether training at a different capacity, e.g., smaller FDE to force stronger compression, or larger to relax the constraint, could further improve results is an interesting direction, but orthogonal to our main finding and left for future work.

Config FDE dim Model rk=0 rk=200
R2_k5_d16 1024 LateOn 6.12 24.15
LateOn-regularized 13.42 36.28
R5_k5_d16 2560 LateOn 15.38 38.34
LateOn-regularized 27.03 48.03
R10_k5_d16 5120 LateOn 26.08 46.11
LateOn-regularized 35.63 51.92
R20_k5_d16 10240 LateOn 32.66 50.01
LateOn-regularized 40.80 53.02
R40_k5_d16 20480 LateOn 39.31 52.02
LateOn-regularized 44.13 53.69

Sweep over (R), the number of independent FDE repetitions. Increasing (R) gives more randomized views of the token set and increases the final FDE size linearly.

MUVERA k_sim sweep (R=20, d=16)

Config FDE dim Model rk=0 rk=200
R20_k3_d16 2560 LateOn 24.54 47.36
LateOn-regularized 34.93 52.23
R20_k4_d16 5120 LateOn 29.83 49.04
LateOn-regularized 38.67 52.53
R20_k5_d16 10240 LateOn 32.66 50.01
LateOn-regularized 40.80 53.02
R20_k6_d16 20480 LateOn 35.32 50.75
LateOn-regularized 42.99 53.24

Sweep over (k), the number of SimHash bits used for partitioning. Increasing (k) creates more buckets, since each repetition uses (2^k) partitions, so the FDE size grows exponentially with (k).

MUVERA d_proj sweep (R=20, k=5)

Config FDE dim Model rk=0 rk=200
R20_k5_d2 1280 LateOn 9.64 32.81
LateOn-regularized 21.52 45.54
R20_k5_d4 2560 LateOn 18.59 41.90
LateOn-regularized 31.69 50.40
R20_k5_d8 5120 LateOn 26.89 47.18
LateOn-regularized 37.48 52.39
R20_k5_d16 10240 LateOn 32.66 50.01
LateOn-regularized 40.80 53.02
R20_k5_d32 20480 LateOn 35.81 51.44
LateOn-regularized 41.69 53.20

Sweep over (d), the inner projection dimension used for each bucket representation before concatenation. Increasing (d) preserves more information per bucket and increases the FDE size linearly.

What did the regularization actually do?

Beyond giving us a model that works with MUVERA and SMVE, the STE training left behind a second gift: two checkpoints (before and after) that we can probe to understand what changed in the geometry of the space. We ran a battery of measurements on all documents and queries of SciFact, computing every metric on mean-centered representations. The findings are informative, and run directly counter to the narrative we built in the first half of this post.

I. Anisotropy was not the (only) problem (or at least, not the whole story)

Everything above pointed at anisotropy as the culprit. So we expected the regularized model to produce a more isotropic embedding space. It doesn't.

Model Uncentered Centered
LateOn 0.949 0.002
LateOn-regularized 0.960 0.004

Mean average cosine similarity between document tokens comparison with and without regularization on the NFCorpus dataset.

The regularized model is actually slightly more anisotropic than the original, in both the raw and centered spaces. The STE regularization, which dramatically improved MUVERA/SMVE quality, did not make the space more isotropic. If anything, it made it slightly more anisotropic.

This makes sense in retrospect as we trained on centered representations, so the model never received any signal to reduce the dominant mean direction. But it means that anisotropy per se is not the damaging property, or at least not the only one (centering does a lot of the heavy lifting here). It was a useful clue that led us here, but the real mechanism is elsewhere.

(Note: The Iso-ModernColBERT model effectively reduced the anisotropism of ModernColBERT, but in our experiments, it seems to brings improvements when indexing non-centered embeddings but is not much better when using centered embeddings, confirming that anisotropism is an issue but that is not the only issue and that centering is a fair post-hoc fix for this part of the problem. However, it's possible that centering actually hurts the results (masked by the large gains brought for MUVERA/SMVE) and that letting the regularization "learn" the centering might end up in a more optimal solution.)

II. Fewer effective dimensions, not more

While searching for what actually moved during the training, we measured the Stable Rank (SR), a smooth estimate of how many directions a set of vectors effectively spans. We computed it at two levels: per-document and corpus-wide.

Metric LateOn LateOn-regularized Change
Per-doc stable rank 4.88 3.85 −21%
Corpus stable rank 15.4 11.4 −26%
Top-1 eigenvalue fraction (per-doc) 21.3% 27.0% +27%
Top-3 eigenvalue fraction (per-doc) 43.0% 50.7% +18%

Stable rank comparison of the model with regularization at the document and corpus level.

Both stable ranks decreased. Per-document, tokens that spanned ~5 effective directions now span ~4. At the corpus level, the cloud contracts from ~15 to ~11 effective dimensions. The top eigenvalue captures a larger share of the variance, meaning each document's token information concentrates into fewer principal axes.

This is the exact opposite of what a GOR-type regularization would do. GOR pushes tokens apart to fill the space uniformly and increase stable rank. STE did the reverse: it collapsed each document's tokens into fewer dimensions. And yet this is what made SMVE and MUVERA work dramatically better.

(Note: the SR results are also interesting by themselves. It means we are encoding and storing various vectors while the actual dimensions are much lower.)

Why fewer dimensions help random projections. The SMVE score between a query and a document is a sum over a fixed number of random projection directions. Each term captures one random "glimpse" of the query–document interaction. When the token cloud spans many effective dimensions, those glimpses are spread thin: each captures a small, noisy slice, and the sum has high variance as an estimator of the true MaxSim score. When the cloud spans fewer dimensions, the same number of glimpses oversample the space more densely. Each captures a cleaner slice, and the sum converges to a tighter estimate.

The width sweep confirms this directly. We ran SMVE at different numbers of random projections (from 128 to 1024):

Width LateOn LateOn-regularized Ratio
128 0.013 0.027 2.2×
256 0.057 0.100 1.8×
512 0.096 0.177 1.9×
1024 0.209 0.269 1.3×

Performance of the models under different width for SMVE projections.

The regularized model wins at every width, but the advantage narrows as width increases. With enough random projections, even a higher-dimensional space gets adequately sampled, so the regularization matters less. This is exactly the pattern you'd expect if the mechanism better covers a lower-dimensional subspace.

III. Learned concentration vs. naive compression

If fewer dimensions help, a natural question arises: why not just project the embeddings onto their top PCA directions post-hoc to reduce dimensionality? We tested this by taking LateOn's tokens, projecting them onto the top-k corpus principal components, and measuring SMVE quality:

Corpus PCA-k LateOn SMVE % of original
4 0.007 7%
8 0.010 10%
16 0.014 15%
32 0.036 38%
64 0.067 70%

Performance of LateOn (unregularized) when using the embeddings projected using PCA.

The results are clear: naive low-rank projection is catastrophic. Keeping only 4 PCA directions destroys 93% of SMVE quality; even 16 directions lose 85%. Simply squashing the space into fewer dimensions wipes out the retrieval signal along with the noise.

The regularized model achieves a comparable reduction in effective dimensionality, yet its SMVE quality improves by 85%. The difference: the model learned which dimensions to concentrate into. The STE gradient tells each token which projection bucket it landed in, so the model can rearrange its embedding space to place discriminative information along the axes that survive the projection rather than along arbitrary high-variance directions that happen to capture the most energy in an unregularized space.

Query–document alignment measurements confirm this. We measured how much query token energy falls within the top principal components of the document token cloud:

Metric LateOn LateOn-regularized
Query energy in top-4 doc PCs 6.1% 9.1%
Query energy in top-16 doc PCs 24.2% 29.0%

Percentage of the energy of queries when projected into the top-k components of the documents

Queries concentrate more of their energy into the same principal components that documents use. When both live in a more shared subspace, random projections through that subspace capture retrieval-relevant signals on both sides simultaneously, making each SMVE/MUVERA term more informative.

This is why directly optimizing the target with STE is so much more effective than geometric proxies: the projection loss captures the full chain from geometry to score. Centering and GOR are educated guesses about what a well-shaped space should look like. STE lets the projection itself tell the model what to fix.

Takeaways and what's next

Four things to remember from this work:

  1. New ColBERT models are highly anisotropic, which breaks ANN candidate-generation methods like MUVERA and SMVE that rely on random projections to separate embeddings.
  2. Mean-centering is a strong baseline that removes the dominant shared direction, but it's not enough on its own as the residual geometry still doesn't project cleanly.
  3. STE-based regularization fixes the problem, not by spreading embeddings uniformly across the sphere as expected, but by teaching the model to concentrate discriminative information into a low-dimensional subspace that random projections can efficiently capture, the opposite of what isotropy-based intuition predicts.
  4. The regularization is generic: it transfers across methods (MUVERA-trained helps SMVE and vice-versa), across random seeds, and across hyperparameter configurations far from the training setup.

These findings also connect to a broader question about the structure of ColBERT embedding spaces. As we showed in our work on hierarchical pooling, much of the per-token information in a document is redundant. The stable rank results here reinforce that picture and suggest that the effective dimensionality of these spaces is much lower than the nominal embedding dimension. We believe this work is directly related to the recent line of work that learns SAE over ColBERT models to create sparse representations. Because MUVERA/SMVE are not learned, we have to make sure the representations are directly usable. We believe understanding the effect of this kind of regularization and uncovering the underlying structures in the embedding spaces that make them compressible (SMVE/MUVERA, pooling, quantization) and allow us to expose them through other types of indexes than just MaxSim could be very useful in building the next generation of models and indexes.

Get started now:

Citation

LateOn Regularization

@misc{chaffin2026lateonregularization,
  title={Party is over: regularizing ColBERT models to fix efficient ANN methods},
  author={Chaffin, Antoine},
  year={2026},
  howpublished={\url{https://huggingface.co/blog/lightonai/lateon-regularization}},
}

DenseOn and LateOn

@misc{sourty2026denseonlateon,
  title={DenseOn with the LateOn: Open State-of-the-Art Single and Multi-Vector Models},
  author={Sourty, Raphael and Chaffin, Antoine and Weller, Orion and Demoura, Paulo and Chatelain, Amelie},
  year={2026},
  howpublished={\url{https://huggingface.co/blog/lightonai/denseon-lateon}},
}

Community

Sign up or log in to comment