COMO / README.md
lyuzhuoqi's picture
Update README.md
39020e7 verified
metadata
license: cc-by-nc-4.0
library_name: pytorch
tags:
  - chemistry
  - cheminformatics
  - optical-chemical-structure-recognition
  - ocsr
  - molecule-recognition
  - smiles
  - transformer
  - swin-transformer
  - minimum-risk-training
  - molecular-graph
datasets:
  - Keylab/COMO
metrics:
  - exact_match
  - tanimoto_similarity

COMO: Closed-Loop Optical Molecule Recognition

COMO (Closed-loop Optical Molecule recOgnition) is a deep learning framework for Optical Chemical Structure Recognition (OCSR). It recognizes chemical structure diagrams from images and predicts SMILES strings with atom-level 2D coordinates and bond matrices. COMO uses Minimum Risk Training (MRT) to directly optimize molecular-level, non-differentiable objectives, closing the gap between token-level training and molecular-level evaluation.

Model Summary

  • Architecture: Swin-B encoder β†’ 6-layer Transformer decoder β†’ bond MLP
  • Input: 384Γ—384 RGB image of a chemical structure diagram
  • Output: SMILES string + atom coordinates + bond matrix
  • Vocabulary: chartok_coords format (200 tokens: SMILES chars + 64 X/Y bins)
  • Parameters: ~94M
  • Training data: 1M PubChem + 652K USPTO (MLE) + 83K MolParser-SFT (MRT)

Available Checkpoints

All checkpoints are from the joint MLE+MRT training pipeline (30 epochs, interleaved MLE/MRT from scratch). Three reward variants are provided:

Checkpoint Reward Mode Description
models/tanimoto/final.pth Tanimoto Morgan fingerprint Tanimoto similarity reward
models/tanimoto/best.pth Tanimoto Best validation epoch
models/edit_distance/final.pth Edit Distance Levenshtein string-similarity reward
models/edit_distance/best.pth Edit Distance Best validation epoch
models/visual/final.pth Visual Siamese visual-encoder cosine-similarity reward
models/visual/best.pth Visual Best validation epoch

Architecture

Image (384Γ—384)
  β†’ Swin-B backbone (ImageNet pretrained)
    β†’ 2D sinusoidal positional encoding
      β†’ 6-layer Transformer decoder (d=256, 8 heads)
        β†’ chartok_coords tokens β†’ SMILES + coordinates
        β†’ Bond MLP (2-layer, GELU) β†’ 7-class bond matrix
          β†’ Graph reconstruction β†’ canonical SMILES

The model outputs a molecular graph $G = (A, B)$ where:

  • $A = {(l_i, x_i, y_i)}$ β€” atom SMILES labels with 2D image coordinates
  • $B$ β€” pairwise bond types (none, single, double, triple, aromatic, wedge, dash)

Training

MLE Phase

  • Data: 1M PubChem SMILES (synthetic) + 652K USPTO patent molecules
  • Augmentation: Indigo-rendered images with random styles, functional group substitution, R-group insertion, wavy bonds, scan shadows, multilingual comments
  • Optimizer: AdamW, lr=4Γ—10⁻⁴ (encoder & decoder), weight decay=10⁻⁢
  • Schedule: 2% linear warmup β†’ cosine decay, batch size 64/GPU
  • Loss: Label-smoothed cross-entropy (Ξ΅=0.1) + bond classification CE

MRT Phase

  • Data: 83K real-world molecular images (MolParser-SFT)
  • Candidates: N=32 per image, multinomial sampling at Ο„=0.5
  • Reward weights: validity=0.1, similarity=0.5, exact match=0.4
  • Sharpening: Ξ±=1.0, loss weight Ξ»=0.1
  • Schedule: First 5 epochs MLE-only warmup, then interleaved MLE+MRT

Evaluation Results

Exact match accuracy (%) on 10 benchmarks (COMO-Tanimoto variant):

Benchmark Images Synthetic/Real COMO-Tanimoto
Indigo 5,719 Synthetic 98.6
ChemDraw 5,719 Synthetic 96.5
CLEF 992 Real (patents) 94.8
JPO 450 Real (patents) 88.4
UOB 5,740 Real (academic) 98.0*
USPTO 5,719 Real (patents) 93.4
USPTO-10K 10,000 Real (patents) 96.1
Staker 50,000 Real 87.4
ACS 331 Real (publications) 84.6
WildMol-10K 10,000 Real (wild) 77.1

*UOB results after tautomer standardization.

See the paper for full comparison with MolScribe, MolParser, SwinOCSR, and other baselines.

Installation

pip install como-ocsr

Usage

import como

# Download checkpoint from HuggingFace:
# huggingface-cli download Keylab/COMO models/tanimoto/final.pth

model = como.load_model("models/tanimoto/final.pth", device="cuda")

# Single image prediction
smiles = como.predict(model, "molecule.png")
print(smiles)  # "CC(=O)O"

# Batch prediction
smiles_list = como.predict_batch(model, ["mol1.png", "mol2.png"])

# Benchmark evaluation
metrics = como.evaluate(model, "benchmark/USPTO/", "benchmark/USPTO.csv")
print(f"Exact Match: {metrics['postprocess/exact_match_acc']:.2%}")

Full documentation: Github PyPI

Benchmarks

Benchmark datasets are available in the benchmarks/ directory of this repository. Each dataset contains .png images and a CSV file with columns image_id and SMILES.

Note: These benchmarks are collected from existing public OCSR datasets. Please refer to the original sources for attribution:

Dataset Source
USPTO, CLEF, JPO, UOB, Staker Rajan et al., 2020, Xiong et al., 2023
Indigo, ChemDraw, ACS, Staker Qian et al., 2023
USPTO-10K Morin et al., 2023
WildMol-10K Fang et al., 2025

License

  • Model Weights: CC BY-NC 4.0 (non-commercial use only)
  • Code: MIT License
  • Benchmarks: See original sources for applicable terms

Citation

@article{lyu2026closed,
  title={COMO: Closed-Loop Optical Molecule Recognition with Minimum Risk Training},
  author={Lyu, Zhuoqi and Ke, Qing},
  journal={arXiv preprint arXiv:2604.23546},
  year={2026}
}