arxiv:2606.07502

Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

Published on Jun 5

· Submitted by

Songhao Wu on Jun 8

#1 Paper of the day

Upvote

Authors:

Songhao Wu ,

Abstract

Text embeddings from large language models are enhanced by EmbedFilter, a linear transformation that reduces the influence of high-frequency tokens and improves semantic representations while enabling dimensionality reduction.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.

View arXiv page View PDF GitHub 10 Add to collection

Community

shwu

Paper author Paper submitter about 20 hours ago

•

edited about 20 hours ago

We show that the unemebdding matrix within LLMs serve as an overlooked feature extractor for free. It encodes a latent semantic space; filtering out its effects from the primary text embeddings markedly improves zero-shot representation performance. We also empirically confirm that this can be achieved through a simple linear transformation, which results in a reduction in vector dimensionality as an bonus.

noahml

about 11 hours ago

Cool paper - I liked the way "Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings" frames the problem without making it feel too abstract.

Curious if you think this would still work once the setup gets messier in the wild?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/e08d64a5-f5d5-4555-9d1f-db797b88cc1b

shwu

Paper author about 10 hours ago

Thanks for checking it out! We found this phenomenon holds up across a wide range of LLMs and even embedding models trained via contrastive learning (where the unembed matrix wasn't part of the training).

To your point about the "wild" — we've actually used these insights to develop a new training framework aimed at improving text embedding models. We hope this will help boost embedding capabilities across much "messier," real-world scenarios.

Stay tuned, it's coming soon!