Safetensors

1. Model Overview

  • Model Name: MMPT-FM (Matched Molecular Pair Transformation Foundation Model)
  • Summary: MMPT-FM is a transformation-centric generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs), i.e., context-independent variable-to-variable chemical modifications derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
  • Model Specification: Encoder–decoder Transformer. 220M parameters.
  • Developed by: Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
  • License: MIT license.
  • Base Model: ChemT5 (chemistry-domain pretrained T5).
  • Model Type: Transformer
  • Languages: SMARTS (chemical substructure representation)
  • Pipeline Tag: text2text-generation for MMP transformation
  • Library: Transformers, PyTorch

2. Intended Use

  • Direct Use:
    • Generation of chemically valid matched molecular pair transformations (MMPTs).
    • Analog design at a user-specified edit site (R-group substitution or core hopping)
  • Downstream Use:
    • Integration into analog enumeration pipelines
    • Retrieval-augmented generation (MMPT-RAG) to bias suggestions toward project- or series-specific chemistry

3. Bias, Risks, and Limitations

  • Known Limitations: The model relies on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
  • Biases: Inherits biases from ChEMBL-derived medicinal chemistry literature.
  • Risk Areas: Our framework is intended for research use, and does not introduce specific ethical concerns.

4. Training Details

5. Evaluation

  • Metrics:
    • Validity
    • Novelty (Novel/valid, Novel/all)
    • Recall (overall, in-training, out-of-training)
  • Benchmarks:
    • Held-out ChEMBL MMPT test set (in-distribution)
    • Within-patent analog generation (PMV17)
    • Cross-patent analog generation (PMV17 → PMV21)
  • Testing Data: Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)

6. Usage

7. Citation

@misc{pan2026retrievalaugmentedfoundationmodelsmatched,
      title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition}, 
      author={Bo Pan and Peter Zhiping Zhang and Hao-Wei Pang and Alex Zhu and Xiang Yu and Liying Zhang and Liang Zhao},
      year={2026},
      eprint={2602.16684},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.16684}, 
}
@article{
doi:10.26434/chemrxiv.15001722/v1,
author = {Hao-Wei Pang  and Peter Zhiping Zhang  and Bo Pan  and Liang Zhao  and Xiang Yu  and Liying Zhang },
title = {Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations},
journal = {ChemRxiv},
volume = {2026},
number = {0407},
pages = {},
year = {2026},
doi = {10.26434/chemrxiv.15001722/v1},
URL = {https://chemrxiv.org/doi/abs/10.26434/chemrxiv.15001722/v1},
eprint = {https://chemrxiv.org/doi/pdf/10.26434/chemrxiv.15001722/v1},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Merck/MMPT-FM