license: cc-by-nc-4.0
library_name: pytorch
tags:
- chemistry
- cheminformatics
- optical-chemical-structure-recognition
- ocsr
- molecule-recognition
- smiles
- transformer
- swin-transformer
- minimum-risk-training
- molecular-graph
datasets:
- Keylab/COMO
metrics:
- exact_match
- tanimoto_similarity
COMO: Closed-Loop Optical Molecule Recognition
COMO (Closed-loop Optical Molecule recOgnition) is a deep learning framework for Optical Chemical Structure Recognition (OCSR). It recognizes chemical structure diagrams from images and predicts SMILES strings with atom-level 2D coordinates and bond matrices. COMO uses Minimum Risk Training (MRT) to directly optimize molecular-level, non-differentiable objectives, closing the gap between token-level training and molecular-level evaluation.
Model Summary
- Architecture: Swin-B encoder β 6-layer Transformer decoder β bond MLP
- Input: 384Γ384 RGB image of a chemical structure diagram
- Output: SMILES string + atom coordinates + bond matrix
- Vocabulary: chartok_coords format (200 tokens: SMILES chars + 64 X/Y bins)
- Parameters: ~94M
- Training data: 1M PubChem + 652K USPTO (MLE) + 83K MolParser-SFT (MRT)
Available Checkpoints
All checkpoints are from the joint MLE+MRT training pipeline (30 epochs, interleaved MLE/MRT from scratch). Three reward variants are provided:
| Checkpoint | Reward Mode | Description |
|---|---|---|
models/tanimoto/final.pth |
Tanimoto | Morgan fingerprint Tanimoto similarity reward |
models/tanimoto/best.pth |
Tanimoto | Best validation epoch |
models/edit_distance/final.pth |
Edit Distance | Levenshtein string-similarity reward |
models/edit_distance/best.pth |
Edit Distance | Best validation epoch |
models/visual/final.pth |
Visual | Siamese visual-encoder cosine-similarity reward |
models/visual/best.pth |
Visual | Best validation epoch |
Architecture
Image (384Γ384)
β Swin-B backbone (ImageNet pretrained)
β 2D sinusoidal positional encoding
β 6-layer Transformer decoder (d=256, 8 heads)
β chartok_coords tokens β SMILES + coordinates
β Bond MLP (2-layer, GELU) β 7-class bond matrix
β Graph reconstruction β canonical SMILES
The model outputs a molecular graph $G = (A, B)$ where:
- $A = {(l_i, x_i, y_i)}$ β atom SMILES labels with 2D image coordinates
- $B$ β pairwise bond types (none, single, double, triple, aromatic, wedge, dash)
Training
MLE Phase
- Data: 1M PubChem SMILES (synthetic) + 652K USPTO patent molecules
- Augmentation: Indigo-rendered images with random styles, functional group substitution, R-group insertion, wavy bonds, scan shadows, multilingual comments
- Optimizer: AdamW, lr=4Γ10β»β΄ (encoder & decoder), weight decay=10β»βΆ
- Schedule: 2% linear warmup β cosine decay, batch size 64/GPU
- Loss: Label-smoothed cross-entropy (Ξ΅=0.1) + bond classification CE
MRT Phase
- Data: 83K real-world molecular images (MolParser-SFT)
- Candidates: N=32 per image, multinomial sampling at Ο=0.5
- Reward weights: validity=0.1, similarity=0.5, exact match=0.4
- Sharpening: Ξ±=1.0, loss weight Ξ»=0.1
- Schedule: First 5 epochs MLE-only warmup, then interleaved MLE+MRT
Evaluation Results
Exact match accuracy (%) on 10 benchmarks (COMO-Tanimoto variant):
| Benchmark | Images | Synthetic/Real | COMO-Tanimoto |
|---|---|---|---|
| Indigo | 5,719 | Synthetic | 98.6 |
| ChemDraw | 5,719 | Synthetic | 96.5 |
| CLEF | 992 | Real (patents) | 94.8 |
| JPO | 450 | Real (patents) | 88.4 |
| UOB | 5,740 | Real (academic) | 98.0* |
| USPTO | 5,719 | Real (patents) | 93.4 |
| USPTO-10K | 10,000 | Real (patents) | 96.1 |
| Staker | 50,000 | Real | 87.4 |
| ACS | 331 | Real (publications) | 84.6 |
| WildMol-10K | 10,000 | Real (wild) | 77.1 |
*UOB results after tautomer standardization.
See the paper for full comparison with MolScribe, MolParser, SwinOCSR, and other baselines.
Installation
pip install como-ocsr
Usage
import como
# Download checkpoint from HuggingFace:
# huggingface-cli download Keylab/COMO models/tanimoto/final.pth
model = como.load_model("models/tanimoto/final.pth", device="cuda")
# Single image prediction
smiles = como.predict(model, "molecule.png")
print(smiles) # "CC(=O)O"
# Batch prediction
smiles_list = como.predict_batch(model, ["mol1.png", "mol2.png"])
# Benchmark evaluation
metrics = como.evaluate(model, "benchmark/USPTO/", "benchmark/USPTO.csv")
print(f"Exact Match: {metrics['postprocess/exact_match_acc']:.2%}")
Full documentation: Github PyPI
Benchmarks
Benchmark datasets are available in the benchmarks/ directory of this
repository. Each dataset contains .png images and a CSV file with columns
image_id and SMILES.
Note: These benchmarks are collected from existing public OCSR datasets. Please refer to the original sources for attribution:
| Dataset | Source |
|---|---|
| USPTO, CLEF, JPO, UOB, Staker | Rajan et al., 2020, Xiong et al., 2023 |
| Indigo, ChemDraw, ACS, Staker | Qian et al., 2023 |
| USPTO-10K | Morin et al., 2023 |
| WildMol-10K | Fang et al., 2025 |
License
- Model Weights: CC BY-NC 4.0 (non-commercial use only)
- Code: MIT License
- Benchmarks: See original sources for applicable terms
Citation
@article{lyu2026closed,
title={COMO: Closed-Loop Optical Molecule Recognition with Minimum Risk Training},
author={Lyu, Zhuoqi and Ke, Qing},
journal={arXiv preprint arXiv:2604.23546},
year={2026}
}