ChemLM 2.30M

ChemLM 2.30M is a small BERT-style chemical language model pre-trained with masked language modeling on molecular SMILES strings.

This checkpoint is intended to be used as an encoder initialization for downstream molecular property prediction tasks through the fine-tuning and linear probe code in the original repository.

Repository: https://github.com/sagawatatsuya/chemlm_pretraining
Paper: How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?
Tokenizer: ibm-research/MoLFormer-XL-both-10pct
Model size: 2.30M parameters

Model Details

Hyperparameter	Value
Hidden size	192
Number of hidden layers	5
Number of attention heads	3
Intermediate size	768
Vocabulary size	2362
Maximum sequence length during pre-training	512

The model was pre-trained using the Academic Budget BERT-based implementation in the repository and converted from a DeepSpeed checkpoint to Hugging Face format.

Intended Use

This checkpoint is intended for downstream molecular property prediction by loading it with the custom model class defined in the repository:

from pretraining.configs import PretrainedBertConfig
from pretraining.modeling import BertForSequenceClassificationMolecule

The downstream model adds a task-specific classification or regression head on top of the pre-trained BERT encoder.

The supported training modes are:

full fine-tuning
linear probe, where the BERT encoder is frozen and only the prediction head is trained

Model Architecture for Downstream Tasks

For downstream molecular property prediction, the repository uses:

BertForSequenceClassificationMolecule

This model consists of:

a pre-trained BertModel encoder,
BertPoolerC, which mean-pools non-special tokens using the attention mask,
dropout,
a linear prediction head:

self.classifier = nn.Linear(config.hidden_size, self.num_labels)

The output is a Hugging Face SequenceClassifierOutput containing loss and logits.

The loss function depends on the task type:

Task type	Number of labels	Loss
Regression	1	MSELoss
Binary classification	2	CrossEntropyLoss
Multitask classification	Number of target columns	Masked BCEWithLogitsLoss

Usage

The intended usage is to load the pre-trained checkpoint inside the repository's downstream training script.

First, clone the original repository and install the fine-tuning environment:

git clone https://github.com/sagawatatsuya/chemlm_pretraining.git
cd chemlm_pretraining

conda create -n chemlm_finetuning python=3.11
conda activate chemlm_finetuning

pip install -r requirements.txt
pip install torch transformers==4.57.3
pip install -U accelerate deepspeed

Then run fine-tuning, for example on BBBP:

python chemlm_pretraining/run_ft_molecule.py \
  --model_name_or_path "sagawatatsuya/chemlm-2.30m" \
  --tokenizer_name "ibm-research/MoLFormer-XL-both-10pct" \
  --train_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/train.csv" \
  --validation_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/valid.csv" \
  --test_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/test.csv" \
  --task_name "bbbp" \
  --task_config "chemlm_pretraining/task_config.json" \
  --do_train \
  --do_eval \
  --do_predict \
  --per_device_train_batch_size 256 \
  --per_device_eval_batch_size 512 \
  --learning_rate 3e-5 \
  --num_train_epochs 500 \
  --save_strategy epoch \
  --eval_strategy epoch

For linear probe evaluation, set:

--training_type "linear_probe"

In this mode, the script freezes the parameters under model.bert and trains only the task-specific head.

Loading the Model in Python

The downstream script loads local converted checkpoints with the custom config and model class:

import json
from argparse import Namespace

from transformers import AutoTokenizer
from pretraining.configs import PretrainedBertConfig
from pretraining.modeling import BertForSequenceClassificationMolecule

model_name_or_path = "sagawatatsuya/chemlm-2.30m"
task_name = "bbbp"
task_config_path = "chemlm_pretraining/task_config.json"

task_to_keys = json.load(open(task_config_path))
task_info = task_to_keys[task_name]

if task_info["task_category"] == "regression":
    num_labels = 1
elif task_info["task_category"] == "classification":
    num_labels = 2
else:
    num_labels = len(task_info["target_columns"])

pretrain_run_args = json.load(open(f"{model_name_or_path}/args.json"))
ds_args = Namespace(**pretrain_run_args)

config = PretrainedBertConfig.from_pretrained(
    model_name_or_path,
    num_labels=num_labels,
    finetuning_task=task_name,
    layer_norm_type="pytorch",
    task_category=task_info["task_category"],
    fused_linear_layer=True,
    max_seq_length=task_info["max_seq_length"],
)

tokenizer = AutoTokenizer.from_pretrained(
    "ibm-research/MoLFormer-XL-both-10pct",
    trust_remote_code=True,
)

model = BertForSequenceClassificationMolecule.from_pretrained(
    model_name_or_path,
    config=config,
    args=ds_args,
)

For normal use, run_ft_molecule.py is recommended instead of manually writing this loading code.

Data

The pre-training data preparation follows the repository pipeline:

download ZINC-15 and PubChem SMILES used in MoLFormer-style pre-training,
preprocess, shard, and split the dataset,
create masked language modeling samples using ibm-research/MoLFormer-XL-both-10pct.

Pre-training Objective

The model was pre-trained with masked language modeling.

The sample generation configuration uses:

Setting	Value
Masked LM probability	0.15
Maximum sequence length	512
Maximum predictions per sequence	77
Tokenizer	`ibm-research/MoLFormer-XL-both-10pct`

Downstream Tasks

The repository defines task metadata in chemlm_pretraining/task_config.json.

Supported task categories include:

binary classification, e.g. BBBP, BACE, HIV
multitask classification, e.g. Tox21, ClinTox, SIDER
regression, e.g. QM9 targets, ESOL, FreeSolv, Lipophilicity

Metrics are selected from the task config:

Task category	Metric
Classification	ROC-AUC
Multitask classification	Mean ROC-AUC over tasks
Regression	MAE or RMSE, depending on the task config

Citation

If you use this model, please cite:

@misc{sagawa2026largescalechemicallanguagemodels,
      title={How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?},
      author={Tatsuya Sagawa and Ryosuke Kojima},
      year={2026},
      eprint={2602.11618},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.11618},
}

Downloads last month: 437

Collection including sagawa/chemlm-2.30m

ChemLM

Collection

6 items • Updated 7 days ago

Paper for sagawa/chemlm-2.30m

How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?

Paper • 2602.11618 • Published Apr 1