COMO / README.md

Update README.md

39020e7 verified 2 days ago

6.05 kB

license: cc-by-nc-4.0
library_name: pytorch
tags:
  - chemistry
  - cheminformatics
  - optical-chemical-structure-recognition
  - ocsr
  - molecule-recognition
  - smiles
  - transformer
  - swin-transformer
  - minimum-risk-training
  - molecular-graph
datasets:
  - Keylab/COMO
metrics:
  - exact_match
  - tanimoto_similarity

COMO: Closed-Loop Optical Molecule Recognition

COMO (Closed-loop Optical Molecule recOgnition) is a deep learning framework for Optical Chemical Structure Recognition (OCSR). It recognizes chemical structure diagrams from images and predicts SMILES strings with atom-level 2D coordinates and bond matrices. COMO uses Minimum Risk Training (MRT) to directly optimize molecular-level, non-differentiable objectives, closing the gap between token-level training and molecular-level evaluation.

Model Summary

Architecture: Swin-B encoder → 6-layer Transformer decoder → bond MLP
Input: 384×384 RGB image of a chemical structure diagram
Output: SMILES string + atom coordinates + bond matrix
Vocabulary: chartok_coords format (200 tokens: SMILES chars + 64 X/Y bins)
Parameters: ~94M
Training data: 1M PubChem + 652K USPTO (MLE) + 83K MolParser-SFT (MRT)

Available Checkpoints

All checkpoints are from the joint MLE+MRT training pipeline (30 epochs, interleaved MLE/MRT from scratch). Three reward variants are provided:

Checkpoint	Reward Mode	Description
`models/tanimoto/final.pth`	Tanimoto	Morgan fingerprint Tanimoto similarity reward
`models/tanimoto/best.pth`	Tanimoto	Best validation epoch
`models/edit_distance/final.pth`	Edit Distance	Levenshtein string-similarity reward
`models/edit_distance/best.pth`	Edit Distance	Best validation epoch
`models/visual/final.pth`	Visual	Siamese visual-encoder cosine-similarity reward
`models/visual/best.pth`	Visual	Best validation epoch

Architecture

Image (384×384)
  → Swin-B backbone (ImageNet pretrained)
    → 2D sinusoidal positional encoding
      → 6-layer Transformer decoder (d=256, 8 heads)
        → chartok_coords tokens → SMILES + coordinates
        → Bond MLP (2-layer, GELU) → 7-class bond matrix
          → Graph reconstruction → canonical SMILES

The model outputs a molecular graph $G = (A, B)$ where:

$A = {(l_i, x_i, y_i)}$ — atom SMILES labels with 2D image coordinates
$B$ — pairwise bond types (none, single, double, triple, aromatic, wedge, dash)

Training

MLE Phase

Data: 1M PubChem SMILES (synthetic) + 652K USPTO patent molecules
Augmentation: Indigo-rendered images with random styles, functional group substitution, R-group insertion, wavy bonds, scan shadows, multilingual comments
Optimizer: AdamW, lr=4×10⁻⁴ (encoder & decoder), weight decay=10⁻⁶
Schedule: 2% linear warmup → cosine decay, batch size 64/GPU
Loss: Label-smoothed cross-entropy (ε=0.1) + bond classification CE

MRT Phase

Data: 83K real-world molecular images (MolParser-SFT)
Candidates: N=32 per image, multinomial sampling at τ=0.5
Reward weights: validity=0.1, similarity=0.5, exact match=0.4
Sharpening: α=1.0, loss weight λ=0.1
Schedule: First 5 epochs MLE-only warmup, then interleaved MLE+MRT

Evaluation Results

Exact match accuracy (%) on 10 benchmarks (COMO-Tanimoto variant):

Benchmark	Images	Synthetic/Real	COMO-Tanimoto
Indigo	5,719	Synthetic	98.6
ChemDraw	5,719	Synthetic	96.5
CLEF	992	Real (patents)	94.8
JPO	450	Real (patents)	88.4
UOB	5,740	Real (academic)	98.0*
USPTO	5,719	Real (patents)	93.4
USPTO-10K	10,000	Real (patents)	96.1
Staker	50,000	Real	87.4
ACS	331	Real (publications)	84.6
WildMol-10K	10,000	Real (wild)	77.1

*UOB results after tautomer standardization.

See the paper for full comparison with MolScribe, MolParser, SwinOCSR, and other baselines.

Installation

pip install como-ocsr

Usage

import como

# Download checkpoint from HuggingFace:
# huggingface-cli download Keylab/COMO models/tanimoto/final.pth

model = como.load_model("models/tanimoto/final.pth", device="cuda")

# Single image prediction
smiles = como.predict(model, "molecule.png")
print(smiles)  # "CC(=O)O"

# Batch prediction
smiles_list = como.predict_batch(model, ["mol1.png", "mol2.png"])

# Benchmark evaluation
metrics = como.evaluate(model, "benchmark/USPTO/", "benchmark/USPTO.csv")
print(f"Exact Match: {metrics['postprocess/exact_match_acc']:.2%}")

Full documentation: Github PyPI

Benchmarks

Benchmark datasets are available in the benchmarks/ directory of this repository. Each dataset contains .png images and a CSV file with columns image_id and SMILES.

Note: These benchmarks are collected from existing public OCSR datasets. Please refer to the original sources for attribution:

Dataset	Source
USPTO, CLEF, JPO, UOB, Staker	Rajan et al., 2020, Xiong et al., 2023
Indigo, ChemDraw, ACS, Staker	Qian et al., 2023
USPTO-10K	Morin et al., 2023
WildMol-10K	Fang et al., 2025

License

Model Weights: CC BY-NC 4.0 (non-commercial use only)
Code: MIT License
Benchmarks: See original sources for applicable terms

Citation

@article{lyu2026closed,
  title={COMO: Closed-Loop Optical Molecule Recognition with Minimum Risk Training},
  author={Lyu, Zhuoqi and Ke, Qing},
  journal={arXiv preprint arXiv:2604.23546},
  year={2026}
}