AstroBERT Small Embeddings

This is an AstroBERT Small model fined-tuned using sentence-transformers. It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.

The training dataset was generated using a random sample of ArXiv abstracts labeled as astro-ph.

The model was trained by distilling embeddings from the larger Qwen3-Embedding-8B model using EmbedDistillLoss over the generated training dataset.

As noted in the paper Well-Read Students Learn Better: On the Importance of Pre-training Compact Models, it's important that the base model is pretrained on a large corpus of relevant documents prior to distillation.

Usage (txtai)

This model can be used to build embeddings databases with txtai for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).

import txtai

embeddings = txtai.Embeddings(path="neuml/astrobert-small-embeddings", content=True)
embeddings.index(documents())

# Run a query
embeddings.search("query to run")

Usage (Sentence-Transformers)

Alternatively, the model can be loaded with sentence-transformers.

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer("neuml/astrobert-small-embeddings")
embeddings = model.encode(sentences)
print(embeddings)

Usage (Hugging Face Transformers)

The model can also be used directly with Transformers.

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def meanpooling(output, mask):
    embeddings = output[0] # First element of model_output contains all token embeddings
    mask = mask.unsqueeze(-1).expand(embeddings.size()).float()
    return torch.sum(embeddings * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("neuml/astrobert-small-embeddings")
model = AutoModel.from_pretrained("neuml/astrobert-small-embeddings")

# Tokenize sentences
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    output = model(**inputs)

# Perform pooling. In this case, mean pooling.
embeddings = meanpooling(output, inputs['attention_mask'])

print("Sentence embeddings:")
print(embeddings)

Evaluation Results

A BEIR-compatible dataset was generated to facilitate the evaluation process. This is a separate random sample of Wikipedia articles alongside generated user queries.

Evaluation results are shown below. NDCG is used as the evaluation metric.

Model Parameters NDCG Index Time Search Time Disk
AstroBERT Small Embeddings 22.7M 69.09 9.9s 0.42s 16 MB
all-MiniLM-L6-v2 22.7M 40.45 12.50s 0.38s 16 MB
DenseOn 149M 61.46 67.35s 0.77s 31 MB
EmbeddingGemma 300M 57.44 86.17s 1.43s 31 MB
Qwen3-Embedding-0.6B 600M 65.73 114.17s 2.20s 41 MB
Qwen3-Embedding-4B 4000M 71.14 545.28s 9.89s 103 MB
Qwen3-Embedding-8B 8000M 73.84 941.82s 17.24s 164 MB

This model is a solid performer at a small size. It beats the same sized all-MiniLM-L6-v2 model by a significant margin. It beats the 600M parameter Qwen3 Embeddings model which is over 25x larger. It scores slightly lower than the model it's distilled from (Qwen3-Embedding-8B).

This is a great model that can be used in CPU-only setups without trading off much on the accuracy front. It shows how small models can excel at specialized domains, requiring less compute and disk space.

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

More Information

Read more about the model in this article.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NeuML/astrobert-small-embeddings

Finetuned
(1)
this model

Paper for NeuML/astrobert-small-embeddings