BiomedBERT Small: Medical models at 22.7M parameters
These models are solid performers both in speed and accuracy. The dense embeddings model even beats the original PubMedBERT Embeddings model across the board at only 20% of the parameters. It also does much better than all-MiniLM-L6-v2, a commonly used small model which is roughly the same size.
These models are comparable in size to the popular all-MiniLM-L6-v2, meaning that it can run in CPU-only environments.
The following new models are released as part of this effort. All models have an Apache 2.0 license.
| Model | Description |
|---|---|
| biomedbert-small | Base 22.7M parameter language model |
| biomedbert-small-embeddings | Small Sentence Transformers model for embeddings |
| biomedbert-small-colbert | Late interaction (ColBERT) small model |
| biomedbert-base-embeddings | Improved Base Sentence Transformers model for embeddings |
Building a Strong Baseline
In order to create task-specific models, a strong baseline is necessary. A 22.7M parameter BERT encoder-only model was trained on data from PubMed. The raw data was transformed using PaperETL with the results stored as a local dataset via the Hugging Face Datasets library. Masked language modeling was the training objective.
After training, the model was evaluated using this Medical Abstracts Text Classification Dataset. A handful of biomedical models and general models were selected for comparison.
| Model | Parameters | Accuracy | Loss |
|---|---|---|---|
| biomedbert-hash-nano | 0.969M | 0.6195 | 0.9464 |
| biomedbert-small | 22.7M | 0.6274 | 0.8647 |
| bert-base-uncased | 110M | 0.6118 | 0.9712 |
| biomedbert-base | 110M | 0.6195 | 0.9037 |
| ModernBERT-base | 149M | 0.5672 | 1.1079 |
| BioClinical-ModernBERT-base | 149M | 0.5679 | 1.0915 |
As we can see, this model performs very well against models much larger in size and this serves as a strong baseline.
Training a Small Embeddings model
With this strong baseline and teacher model, we can now train a small embeddings model.
biomedbert-small-embeddings was trained using Sentence Transformers. The training dataset was generated using a random sample of PubMed title-abstract pairs along with similar title pairs.
The training workflow was a two-step distillation process as follows.
- Distill embeddings from the larger pubmedbert-base-embeddings model using this model distillation script from Sentence Transformers.
- Build a distilled dataset of teacher scores using the
biomedbert-base-rerankercross-encoder for a separate random sample of title-abstract pairs. - Further fine-tune the model on the distilled dataset using KLDivLoss.
Training ColBERT models
A similar methodology as above was employed to train biomedbert-small-colbert as follows.
- Train a model with MSELoss using
biomedbert-small-embeddingsas the base model. - Build a distilled dataset of teacher scores using the
biomedbert-base-rerankercross-encoder for a separate random sample of title-abstract pairs. - Fine-tune the model on the distilled dataset using KLDivLoss.
Fine-Tuning the base PubMedBERT Embeddings model
The original PubMedBERT Embeddings was released almost 3 years ago. It gets over 500K downloads a month and has been cited many times in literature.
A simple idea was explored as part of this effort. What if we fine-tuned this model on the same distilled dataset used for the nano and small series of models? Turns out this adds a sizable performance boost to the base model as shown below.
Evaluation Results
Performance of these models are compared to previously released models trained on medical literature. The most commonly used small embeddings model is also included for comparison.
The following datasets were used to evaluate model performance.
- PubMed QA
- Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
- PubMed Subset
- Split: test, Pair: (title, text)
- PubMed Summary
- Subset: pubmed, Split: validation, Pair: (article, abstract)
Evaluation results are shown below. The Pearson correlation coefficient is used as the evaluation metric.
| Model | PubMed QA | PubMed Subset | PubMed Summary | Average |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 90.40 | 95.92 | 94.07 | 93.46 |
| biomedbert-base-colbert | 94.59 | 97.18 | 96.21 | 95.99 |
| biomedbert-base-embeddings | 94.60 | 98.39 | 97.61 | 96.87 |
| biomedbert-base-reranker | 97.66 | 99.76 | 98.81 | 98.74 |
| biomedbert-small-colbert | 93.51 | 97.20 | 95.85 | 95.52 |
| biomedbert-small-embeddings | 93.25 | 97.93 | 96.65 | 95.94 |
| biomedbert-hash-nano-embeddings | 90.39 | 96.29 | 95.32 | 94.00 |
| pubmedbert-base-embeddings | 93.27 | 97.00 | 96.58 | 95.62 |
The 22.7M parameter small models pack quite a punch. biomedbert-small-embeddings beats the original PubMedBERT Embeddings model across the board at only 20% of the parameters.
As with other ColBERT models on this dataset, it tends to score lower with longer form queries. But note how it outperforms it's equivalent small model on the PubMed QA dataset. For traditional user queries, this model will likely get better results in production.
Lastly, the new biomedbert-base-embeddings model is a sizable jump over the original PubMedBERT Embeddings model.
Wrapping up
This article introduced the new BiomedBERT Small series of models. It also adds new strong-performing, standard-sized dense embeddings model.
If you're interested in building custom models like this for your data or domain area, feel free to reach out!
NeuML is the company behind txtai and we provide AI consulting services around our stack. Schedule a meeting or send a message to learn more.
We're also building an easy and secure way to run hosted txtai applications with txtai.cloud.

