ChemLM 2.30M
ChemLM 2.30M is a small BERT-style chemical language model pre-trained with masked language modeling on molecular SMILES strings.
This checkpoint is intended to be used as an encoder initialization for downstream molecular property prediction tasks through the fine-tuning and linear probe code in the original repository.
- Repository: https://github.com/sagawatatsuya/chemlm_pretraining
- Paper: How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?
- Tokenizer:
ibm-research/MoLFormer-XL-both-10pct - Model size: 2.30M parameters
Model Details
| Hyperparameter | Value |
|---|---|
| Hidden size | 192 |
| Number of hidden layers | 5 |
| Number of attention heads | 3 |
| Intermediate size | 768 |
| Vocabulary size | 2362 |
| Maximum sequence length during pre-training | 512 |
The model was pre-trained using the Academic Budget BERT-based implementation in the repository and converted from a DeepSpeed checkpoint to Hugging Face format.
Intended Use
This checkpoint is intended for downstream molecular property prediction by loading it with the custom model class defined in the repository:
from pretraining.configs import PretrainedBertConfig
from pretraining.modeling import BertForSequenceClassificationMolecule
The downstream model adds a task-specific classification or regression head on top of the pre-trained BERT encoder.
The supported training modes are:
- full fine-tuning
- linear probe, where the BERT encoder is frozen and only the prediction head is trained
Model Architecture for Downstream Tasks
For downstream molecular property prediction, the repository uses:
BertForSequenceClassificationMolecule
This model consists of:
- a pre-trained
BertModelencoder, BertPoolerC, which mean-pools non-special tokens using the attention mask,- dropout,
- a linear prediction head:
self.classifier = nn.Linear(config.hidden_size, self.num_labels)
The output is a Hugging Face SequenceClassifierOutput containing loss and logits.
The loss function depends on the task type:
| Task type | Number of labels | Loss |
|---|---|---|
| Regression | 1 | MSELoss |
| Binary classification | 2 | CrossEntropyLoss |
| Multitask classification | Number of target columns | Masked BCEWithLogitsLoss |
Usage
The intended usage is to load the pre-trained checkpoint inside the repository's downstream training script.
First, clone the original repository and install the fine-tuning environment:
git clone https://github.com/sagawatatsuya/chemlm_pretraining.git
cd chemlm_pretraining
conda create -n chemlm_finetuning python=3.11
conda activate chemlm_finetuning
pip install -r requirements.txt
pip install torch transformers==4.57.3
pip install -U accelerate deepspeed
Then run fine-tuning, for example on BBBP:
python chemlm_pretraining/run_ft_molecule.py \
--model_name_or_path "sagawatatsuya/chemlm-2.30m" \
--tokenizer_name "ibm-research/MoLFormer-XL-both-10pct" \
--train_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/train.csv" \
--validation_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/valid.csv" \
--test_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/test.csv" \
--task_name "bbbp" \
--task_config "chemlm_pretraining/task_config.json" \
--do_train \
--do_eval \
--do_predict \
--per_device_train_batch_size 256 \
--per_device_eval_batch_size 512 \
--learning_rate 3e-5 \
--num_train_epochs 500 \
--save_strategy epoch \
--eval_strategy epoch
For linear probe evaluation, set:
--training_type "linear_probe"
In this mode, the script freezes the parameters under model.bert and trains only the task-specific head.
Loading the Model in Python
The downstream script loads local converted checkpoints with the custom config and model class:
import json
from argparse import Namespace
from transformers import AutoTokenizer
from pretraining.configs import PretrainedBertConfig
from pretraining.modeling import BertForSequenceClassificationMolecule
model_name_or_path = "sagawatatsuya/chemlm-2.30m"
task_name = "bbbp"
task_config_path = "chemlm_pretraining/task_config.json"
task_to_keys = json.load(open(task_config_path))
task_info = task_to_keys[task_name]
if task_info["task_category"] == "regression":
num_labels = 1
elif task_info["task_category"] == "classification":
num_labels = 2
else:
num_labels = len(task_info["target_columns"])
pretrain_run_args = json.load(open(f"{model_name_or_path}/args.json"))
ds_args = Namespace(**pretrain_run_args)
config = PretrainedBertConfig.from_pretrained(
model_name_or_path,
num_labels=num_labels,
finetuning_task=task_name,
layer_norm_type="pytorch",
task_category=task_info["task_category"],
fused_linear_layer=True,
max_seq_length=task_info["max_seq_length"],
)
tokenizer = AutoTokenizer.from_pretrained(
"ibm-research/MoLFormer-XL-both-10pct",
trust_remote_code=True,
)
model = BertForSequenceClassificationMolecule.from_pretrained(
model_name_or_path,
config=config,
args=ds_args,
)
For normal use, run_ft_molecule.py is recommended instead of manually writing this loading code.
Data
The pre-training data preparation follows the repository pipeline:
- download ZINC-15 and PubChem SMILES used in MoLFormer-style pre-training,
- preprocess, shard, and split the dataset,
- create masked language modeling samples using
ibm-research/MoLFormer-XL-both-10pct.
Pre-training Objective
The model was pre-trained with masked language modeling.
The sample generation configuration uses:
| Setting | Value |
|---|---|
| Masked LM probability | 0.15 |
| Maximum sequence length | 512 |
| Maximum predictions per sequence | 77 |
| Tokenizer | ibm-research/MoLFormer-XL-both-10pct |
Downstream Tasks
The repository defines task metadata in chemlm_pretraining/task_config.json.
Supported task categories include:
- binary classification, e.g. BBBP, BACE, HIV
- multitask classification, e.g. Tox21, ClinTox, SIDER
- regression, e.g. QM9 targets, ESOL, FreeSolv, Lipophilicity
Metrics are selected from the task config:
| Task category | Metric |
|---|---|
| Classification | ROC-AUC |
| Multitask classification | Mean ROC-AUC over tasks |
| Regression | MAE or RMSE, depending on the task config |
Citation
If you use this model, please cite:
@misc{sagawa2026largescalechemicallanguagemodels,
title={How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?},
author={Tatsuya Sagawa and Ryosuke Kojima},
year={2026},
eprint={2602.11618},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.11618},
}
- Downloads last month
- 437