AstroBERT Small Embeddings

This is an AstroBERT Small model fined-tuned using sentence-transformers. It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.

The training dataset was generated using a random sample of ArXiv abstracts labeled as astro-ph.

The model was trained by distilling embeddings from the larger Qwen3-Embedding-8B model using EmbedDistillLoss over the generated training dataset.

As noted in the paper Well-Read Students Learn Better: On the Importance of Pre-training Compact Models, it's important that the base model is pretrained on a large corpus of relevant documents prior to distillation.

Usage (txtai)

This model can be used to build embeddings databases with txtai for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).

import txtai

embeddings = txtai.Embeddings(path="neuml/astrobert-small-embeddings", content=True)
embeddings.index(documents())

# Run a query
embeddings.search("query to run")

Usage (Sentence-Transformers)

Alternatively, the model can be loaded with sentence-transformers.

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer("neuml/astrobert-small-embeddings")
embeddings = model.encode(sentences)
print(embeddings)

Usage (Hugging Face Transformers)

The model can also be used directly with Transformers.

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def meanpooling(output, mask):
    embeddings = output[0] # First element of model_output contains all token embeddings
    mask = mask.unsqueeze(-1).expand(embeddings.size()).float()
    return torch.sum(embeddings * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("neuml/astrobert-small-embeddings")
model = AutoModel.from_pretrained("neuml/astrobert-small-embeddings")

# Tokenize sentences
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    output = model(**inputs)

# Perform pooling. In this case, mean pooling.
embeddings = meanpooling(output, inputs['attention_mask'])

print("Sentence embeddings:")
print(embeddings)

Evaluation Results

A BEIR-compatible dataset was generated to facilitate the evaluation process. This is a separate random sample of Wikipedia articles alongside generated user queries.

Evaluation results are shown below. NDCG is used as the evaluation metric.

Model	Parameters	NDCG	Index Time	Search Time	Disk
AstroBERT Small Embeddings	22.7M	69.09	9.9s	0.42s	16 MB
all-MiniLM-L6-v2	22.7M	40.45	12.50s	0.38s	16 MB
DenseOn	149M	61.46	67.35s	0.77s	31 MB
EmbeddingGemma	300M	57.44	86.17s	1.43s	31 MB
Qwen3-Embedding-0.6B	600M	65.73	114.17s	2.20s	41 MB
Qwen3-Embedding-4B	4000M	71.14	545.28s	9.89s	103 MB
Qwen3-Embedding-8B	8000M	73.84	941.82s	17.24s	164 MB

This model is a solid performer at a small size. It beats the same sized all-MiniLM-L6-v2 model by a significant margin. It beats the 600M parameter Qwen3 Embeddings model which is over 25x larger. It scores slightly lower than the model it's distilled from (Qwen3-Embedding-8B).

This is a great model that can be used in CPU-only setups without trading off much on the accuracy front. It shows how small models can excel at specialized domains, requiring less compute and disk space.

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

More Information

Read more about the model in this article.

Downloads last month: -

Model tree for NeuML/astrobert-small-embeddings

Base model

NeuML/astrobert-small

Finetuned

(1)

this model

Paper for NeuML/astrobert-small-embeddings

Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

Paper • 1908.08962 • Published Aug 23, 2019 • 1