Instructions to use multimolecule/pangolin with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MultiMolecule
How to use multimolecule/pangolin with MultiMolecule:
pip install multimolecule
from multimolecule import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("multimolecule/pangolin") model = AutoModel.from_pretrained("multimolecule/pangolin") inputs = tokenizer("UAGCUUAUCAGACUGAUGUUGA", return_tensors="pt") outputs = model(**inputs) embeddings = outputs.last_hidden_state - Notebooks
- Google Colab
- Kaggle
datasets:
- multimolecule/gencode
library_name: multimolecule
license: agpl-3.0
pipeline: splice-site
pipeline_tag: other
tags:
- Biology
- RNA
- Splicing
- rna
widget:
- example_title: microRNA 21
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: UAGCUUAUCAGACUGAUGUUGA
- example_title: microRNA 146a
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: UGAGAACUGAAUUCCAUGGGUU
- example_title: microRNA 155
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: UUAAUGCUAAUCGUGAUAGGGGUU
- example_title: RNA component of mitochondrial RNA processing endoribonuclease
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: >-
GGUUCGUGCUGAAGGCCUGUAUCCUAGGCUACACACUGAGGACUCUGUUCCUCCCCUUUCCGCCUAGGGGAAAGUCCCCGGACCUCGGGCAGAGAGUGCCACGUGCAUACGCACGUAGACAUUCCCCGCUUCCCACUCCAAAGUCCGCCAAGAAGCGUAUCCCGCUGAGCGGCGUGGCGCGGGGGCGUCAUCCGUCAGCUCCCUCUAGUUACGCAGGCAGUGCGUGUCCGCGCACCAACCACACGGGGCUCAUUCUCAGCGCGGCUGUAAAAAAAAA
- example_title: 7SK small nuclear RNA
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: >-
GGAUGUGAGGGCGAUCUGGCUGCGACAUCUGUCACCCCAUUGAUCGCCAGGGUUGAUUCGGCUGAUCUGGCUGGCUAGGCGGGUGUCCCCUUCCUCCCUCACCGCUCCAUGUGCGUCCCUCCCGAAGCUGCGCGCUCGGUCGAAGAGGACGACCAUCCCCGAUAGAGGAGGACCGGUCUUCGGUCAAGGGUAUACGAGUAGCUGCGCUCCCCUGCUAGAACCUCCAAACAAGCUCUCAAGGUCCAUUUGUAGGAGAACGUAGGGUAGUCAAGCUUCCAAGACUCCAGACACAUCCAAAUGAGGCGCUGCAUGUGGCAGUCUGCCUUUCUUUU
- example_title: telomerase RNA component
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: >-
GGGUUGCGGAGGGUGGGCCUGGGAGGGGUGGUGGCCAUUUUUUGUCUAACCCUAACUGAGAAGGGCGUAGGCGCCGUGCUUUUGCUCCCCGCGCGCUGUUUUUCUCGCUGACUUUCAGCGGGCGGAAAAGCCUCGGCCUGCCGCCUUCCACCGUUCAUUCUAGAGCAAACAAAAAAUGUCAGCUGCUGGCCCGUUCGCCCCUCCCGGGGACCUGCGGCGGGUCGCCUGCCCAGCCCCCGAACCCCGCCUGGAGGCCGCGGUCGGCCCGGGGCUUCUCCGGAGGCACCCACUGCCACCGCGAAGAGUUGGGCUCUGUCAGCCGCGGGUCUCUCGGGGGCGAGGGCGAGGUUCAGGCCUUUCAGGCCGCAGGAAGAGGAACGGAGCGAGUCCCCGCGCGCGGCGCGAUUCCCUGAGCUGUGGGACGUGCACCCAGGACUCGGCUCACACAUGC
- example_title: vault RNA 2-1
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: >-
CGGGUCGGAGUUAGCUCAAGCGGUUACCUCCUCAUGCCGGACUUUCUAUCUGUCCAUCUCUGUGCUGGGGUUCGAGACCCGCGGGUGCUUACUGACCCUUUUAUGCAA
- example_title: brain cytoplasmic RNA 1
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: >-
GGCCGGGCGCGGUGGCUCACGCCUGUAAUCCCAGCUCUCAGGGAGGCUAAGAGGCGGGAGGAUAGCUUGAGCCCAGGAGUUCGAGACCUGCCUGGGCAAUAUAGCGAGACCCCGUUCUCCAGAAAAAGGAAAAAAAAAAACAAAAGACAAAAAAAAAAUAAGCGUAACUUCCCUCAAAGCAACAACCCCCCCCCCCCUUU
- example_title: HIV-1 TAR-WT
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: GGUCUCUCUGGUUAGACCAGAUCUGAGCCUGGGAGCUCUCUGGCUAACUAGGGAACC
- example_title: prion protein (Kanno blood group)
pipeline_tag: splice-site
sequence_type: mRNA
task: splice-site
text: AUGGCGAACCUUGGCUGCUGGAUGCUGGUUCUCUUUGUGGCCACAUGGAGUGACCUGGGCCUCUGC
- example_title: interleukin 10
pipeline_tag: splice-site
sequence_type: mRNA
task: splice-site
text: AUGCACAGCUCAGCACUGCUCUGUUGCCUGGUCCUCCUGACUGGGGUGAGGGCC
- example_title: Zaire ebolavirus
pipeline_tag: splice-site
sequence_type: mRNA
task: splice-site
text: >-
AAUGUUCAAACACUUUGUGAAGCUCUGUUAGCUGAUGGUCUUGCUAAAGCAUUUCCUAGCAAUAUGAUGGUAGUCACAGAGCGUGAGCAAAAAGAAAGCUUAUUGCAUCAAGCAUCAUGGCACCACACAAGUGAUGAUUUUGGUGAGCAUGCCACAGUUAGAGGGAGUAGCUUUGUAACUGAUUUAGAGAAAUACAAUCUUGCAUUUAGAUAUGAGUUUACAGCACCUUUUAUAGAAUAUUGUAACCGUUGCUAUGGUGUUAAGAAUGUUUUUAAUUGGAUGCAUUAUACAAUCCCACAGUGUUAU
- example_title: SARS coronavirus
pipeline_tag: splice-site
sequence_type: mRNA
task: splice-site
text: >-
AUGUUUAUUUUCUUAUUAUUUCUUACUCUCACUAGUGGUAGUGACCUUGACCGGUGCACCACUUUUGAUGAUGUUCAAGCUCCUAAUUACACUCAACAUACUUCAUCUAUGAGGGGGGUUUACUAUCCUGAUGAAAUUUUUAGAUCAGACACUCUUUAUUUAACUCAGGAUUUAUUUCUUCCAUUUUAUUCUAAUGUUACAGGGUUUCAUACUAUUAAUCAUACGUUUGACAACCCUGUCAUACCUUUUAAGGAUGGUAUUUAUUUUGCUGCCACAGAGAAAUCAAAUGUUGUCCGUGGUUGGGUUUUUGGUUCUACCAUGAACAACAAGUCACAGUCGGUGAUUAUUAUUAACAAUUCUACUAAUGUUGUUAUACGAGCAUGUAACUUUGAAUUGUGUGACAACCCUUUCUUUGCUGUUUCUAAACCCAUGGGUACACAGACACAUACUAUGAUAUUCGAUAAUGCAUUUAAAUGCACUUUCGAGUACAUAUCU
- example_title: insulin
pipeline_tag: splice-site
sequence_type: mRNA
task: splice-site
text: >-
AUGGCCCUGUGGAUGCGCCUCCUGCCCCUGCUGGCGCUGCUGGCCCUCUGGGGACCUGACCCAGCCGCAGCCUUUGUGAACCAACACCUGUGCGGCUCACACCUGGUGGAAGCUCUCUACCUAGUGUGCGGGGAACGAGGCUUCUUCUACACACCCAAGACCCGCCGGGAGGCAGAGGACCUGCAGGUGGGGCAGGUGGAGCUGGGCGGGGGCCCUGGUGCAGGCAGCCUGCAGCCCUUGGCCCUGGAGGGGUCCCUGCAGAAGCGUGGCAUUGUGGAACAAUGCUGUACCAGCAUCUGCUCCCUCUACCAGCUGGAGAACUACUGCAACUAG
- example_title: cyclin dependent kinase inhibitor 2A
pipeline_tag: splice-site
sequence_type: mRNA
task: splice-site
text: >-
AUGGAGCCGGCGGCGGGGAGCAGCAUGGAGCCUUCGGCUGACUGGCUGGCCACGGCCGCGGCCCGGGGUCGGGUAGAGGAGGUGCGGGCGCUGCUGGAGGCGGGGGCGCUGCCCAACGCACCGAAUAGUUACGGUCGGAGGCCGAUCCAGGUCAUGAUGAUGGGCAGCGCCCGAGUGGCGGAGCUGCUGCUGCUCCACGGCGCGGAGCCCAACUGCGCCGACCCCGCCACUCUCACCCGACCCGUGCACGACGCUGCCCGGGAGGGCUUCCUGGACACGCUGGUGGUGCUGCACCGGGCCGGGGCGCGGCUGGACGUGCGCGAUGCCUGGGGCCGUCUGCCCGUGGACCUGGCUGAGGAGCUGGGCCAUCGCGAUGUCGCACGGUACCUGCGCGCGGCUGCGGGGGGCACCAGAGGCAGUAACCAUGCCCGCAUAGAUGCCGCGGAAGGUCCCUCAGACAUCCCCGAUUGA
- example_title: human papillomavirus type 16 E6
pipeline_tag: splice-site
sequence_type: mRNA
task: splice-site
text: >-
AUGCACCAAAAGAGAACUGCAAUGUUUCAGGACCCACAGGAGCGACCCAGAAAGUUACCACAGUUAUGCACAGAGCUGCAAACAACUAUACAUGAUAUAAUAUUAGAAUGUGUGUACUGCAAGCAACAGUUACUGCGACGUGAGGUAUAUGACUUUGCUUUUCGGGAUUUAUGCAUAGUAUAUAGAGAUGGGAAUCCAUAUGCUGUAUGUGAUAAAUGUUUAAAGUUUUAUUCUAAAAUUAGUGAGUAUAGACAUUAUUGUUAUAGUUUGUAUGGAACAACAUUAGAACAGCAAUACAACAAACCGUUGUGUGAUUUGUUAAUUAGGUGUAUUAACUGUCAAAAGCCACUGUGUCCUGAAGAAAAGCAAAGACAUCUGGACAAAAAGCAAAGAUUCCAUAAUAUAAGGGGUCGGUGGACCGGUCGAUGUAUGUCUUGUUGCAGAUCAUCAAGAACACGUAGAGAAACCCAGCUGUAA
- example_title: NRAS proto-oncogene
pipeline_tag: splice-site
sequence_type: 5' UTR
task: splice-site
text: >-
GGGGCCGGAAGUGCCGCUCCUUGGUGGGGGCUGUUCAUGGCGGUUCCGGGGUCUCCAACAUUUUUCCCGGCUGUGGUCCUAAAUCUGUCCAAAGCAGAGGCAGUGGAGCUUGAGGUUCUUGCUGGUGUGAA
- example_title: amyloid beta precursor protein
pipeline_tag: splice-site
sequence_type: 5' UTR
task: splice-site
text: >-
GUCAGUUUCCUCGGCAGCGGUAGGCGAGAGCACGCGGAGGAGCGUGCGCGGGGGCCCCGGGAGACGGCGGCGGUGGCGGCGCGGGCAGAGCAAGGACGCGGCGGAUCCCACUCGCACAGCAGCGCACUCGGUGCCCCGCGCAGGGUCGCG
- example_title: RUNX family transcription factor 1
pipeline_tag: splice-site
sequence_type: 5' UTR
task: splice-site
text: >-
ACUUCUUUGGGCCUCAUAAACAACCACAGAACCACAAGUUGGGUAGCCUGGCAGUGUCAGAAGUCUGAACCCAGCAUAGUGGUCAGCAGGCAGGACGAAUCACACUGAAUGCAAACCACAGGGUUUCGCAGCGUGGUAAAAGAAAUCAUUGAGUCCCCCGCCUUCAGAAGAGGGUGCAUUUUCAGGAGGAAGCG
- example_title: fragile X messenger ribonucleoprotein 1
pipeline_tag: splice-site
sequence_type: 5' UTR
task: splice-site
text: >-
CUCAGUCAGGCGCUCAGCUCCGUUUCGGUUUCACUUCCGGUGGAGGGCCGCCUCUGAGCGGGCGGCGGGCCGACGGCGAGCGCGGGCGGCGGCGGUGACGGAGGCGCCGCUGCCAGGGGGCGUGCGGCAGCGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGAGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCUGGGCCUCGAGCGCCCGCAGCCCACCUCUCGGGGGCGGGCUCCCGGCGCUAGCAGGGCUGAAGAGAAG
- example_title: MYC proto-oncogene
pipeline_tag: splice-site
sequence_type: 5' UTR
task: splice-site
text: >-
AACUCGCUGUAGUAAUUCCAGCGAGAGGCAGAGGGAGCGAGCGGGCGGCCGGCUAGGGUGGAAGAGCCGGGCGAGCAGAGCUGCGCUGCGGGCGUCCUGGGAAGGGAGAUCCGGAGCGAAUAGGGGGCUUCGCCUCUGGCCCAGCCCUCCCGCUGAUCCCCCAGCCAGCGGUCCGCAACCCUUGCCGCAUCCACGAAACUUUGCCCAUAGCAGCGGGCGGGCACUUUGCACUGGAACUUACAACACCCGAGCAAGGACGCGACUCUCCCGACGCGGGGAGGCUAUUCUGCCCAUUUGGGGACACUUCCCCGCCGCUGCCAGGACCCGCUUCUCUGAAAGGCUCUCCUUGCAGCUGCUUAGACG
- example_title: activating transcription factor 4
pipeline_tag: splice-site
sequence_type: 5' UTR
task: splice-site
text: >-
CAUUUCUACUUUGCCCGCCCACAGAUGUAGUUUUCUCUGCGCGUGUGCGUUUUCCCUCCUCCCCGCCCUCAGGGUCCACGGCCACCAUGGCGUAUUAGGGGCAGCAGUGCCUGCGGCAGCAUUGGCCUUUGCAGCGGCGGCAGCAGCACCAGGCUCUGCAGCGGCAACCCCCAGCGGCUUAAGCCAUGGCGCUUCUCACGGCAUUCAGCAGCAGCGUUGCUGUAACCGACAAAGACACCUUCGAAUUAAGCACAUUCCUCGAUUCCAGCAAAGCACCGCAAC
- example_title: Human GPI protein p137
pipeline_tag: splice-site
sequence_type: 3' UTR
task: splice-site
text: >-
UUUUUAAAAGGAAAAGAUACCAAAUGCCUGCUGCUACCACCCUUUUCAAUUGCUAUGUUUUGAAAGGCACCAGUAUGUGUUUUAGAUUGAUUUAAAUGUUUCAUUUAAAUCACGGACAGUAGUUUCAGUUCUGAUGGUAUAAGCAAAACAAAUAAAACGUUUAUAAAAGUUGUAUCUUGAAACACUGGUGUUCAACAGCUAGCAGCUUAUGUGAUUCACCCCAUGCCACGUUAGUGUCACAAAUUUUAUGGUUUAUCUCCAGCAACAUUUCUCUAGUACUUGCACUUAUUAUCUGAAUUC
- example_title: nucleophosmin 1
pipeline_tag: splice-site
sequence_type: 3' UTR
task: splice-site
text: >-
GAAAAUAGUUUAAACAAUUUGUUAAAAAAUUUUCCGUCUUAUUUCAUUUCUGUAACAGUUGAUAUCUGGCUGUCCUUUUUAUAAUGCAGAGUGAGAACUUUCCCUACCGUGUUUGAUAAAUGUUGUCCAGGUUCUAUUGCCAAGAAUGUGUUGUCCAAAAUGCCUGUUUAGUUUUUAAAGAUGGAACUCCACCCUUUGCUUGGUUUUAAGUAUGUAUGGAAUGUUAUGAUAGGACAUAGUAGUAGCGGUGGUCAGACAUGGAAAUGGUGGGGAGACAAAAAUAUACAUGUGAAAUAAAACUCAGUAUUUUAAUAAAGUAGCACGGUUUCUAUUGA
- example_title: superoxide dismutase 1
pipeline_tag: splice-site
sequence_type: 3' UTR
task: splice-site
text: >-
ACAUUCCCUUGGAUGUAGUCUGAGGCCCCUUAACUCAUCUGUUAUCCUGCUAGCUGUAGAAAUGUAUCCUGAUAAACAUUAAACACUGUAAUCUUAAAAGUGUAAUUGUGUGACUUUUUCAGAGUUGCUUUAAAGUACCUGUAGUGAGAAACUGAUUUAUGAUCACUUGGAAGAUUUGUAUAGUUUUAUAAAACUCAGUUAAAAUGUCUGUUUCAAUGACCUGUAUUUUGCCAGACUUAAAUCACAGAUGGGUAUUAAACUUGUCAGAAUUUCUUUGUCAUUCAAGCCUGUGAAUAAAAACCCUGUAUGGCACUUAUUAUGAGGCUAUUAAAAGAAUCCAAAUUCAAACUAAA
- example_title: hemoglobin subunit alpha 2
pipeline_tag: splice-site
sequence_type: 3' UTR
task: splice-site
text: >-
CUGGAGCCUCGGUAGCCGUUCCUCCUGCCCGCUGGGCCUCCCAACGGGCCCUCCUCCCCUCCUUGCACCGGCCCUUCCUGGUCUUUGAAUAAAGUCUGAGUGGGCAGCA
- example_title: BRAF proto-oncogene
pipeline_tag: splice-site
sequence_type: 3' UTR
task: splice-site
text: >-
AACAAAUGAGUGAGAGAGUUCAGGAGAGUAGCAACAAAAGGAAAAUAAAUGAACAUAUGUUUGCUUAUAUGUUAAAUUGAAUAAAAUACUCUCUUUUUUUUUAAGGUGAACCAAAGAACACUUGUGUGGUUAAAGACUAGAUAUAAUUUUUCCCCAAACUAAAAUUUAUACUUAACAUUGGAUUUUUAACAUCCAAGGGUUAAAAUACAUAGACAUUGCUAAAAAUUGGCAGAGCCUCUUCUAGAGGCUUUACUUUCUGUUCCGGGUUUGUAUCAUUCACUUGGUUAUUUUAAGUAGUAAACUUCAGUUUCUCAUGCAACUUUUGUUGCCAGCUAUCACAUGUCCACUAGGGACUCCAGAAGAAGACCCUACCUAUGCCUGUGUUUGCAGGUGAGAAGUUGGCAGUCGGUUAGCCUGGG
- example_title: H3 clustered histone 1
pipeline_tag: splice-site
sequence_type: 3' UTR
task: splice-site
text: UUACUGUGGUCUCUCUGACGGUCCAAGCAAAGGCUCUUUUCAGAGCCACCACCUUUUC
Pangolin
Convolutional neural network for predicting tissue-specific splice site strength from pre-mRNA sequences.
Disclaimer
This is an UNOFFICIAL implementation of Predicting RNA splicing from DNA sequence using Pangolin by Tony Zeng, et al.
The OFFICIAL repository of Pangolin is at tkzeng/Pangolin.
The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
The team releasing Pangolin did not write this model card for this model so this model card has been written by the MultiMolecule team.
Model Details
Pangolin is a deep convolutional neural network (CNN) that predicts splice site strength from primary pre-mRNA sequence. It extends the dilated-residual SpliceAI architecture to predict tissue-specific splice site usage, and is trained on splicing measurements derived from RNA-seq data across multiple tissues. The network processes a one-hot encoded nucleotide sequence and, for each position, predicts a splice-site score and a splice-site usage score per tissue. Pangolin is typically used to estimate the effect of genetic variants on splicing by scoring reference and alternate sequences and taking the difference. Please refer to the Training Details section for more information on the training process.
Model Specification
| Num Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) |
|---|---|---|---|---|
| 16 | 32 | 8.36 | 168.85 | 84.04 |
Links
- Code: multimolecule.pangolin
- Data: Cross-species RNA-seq splice-site usage from human, rhesus, rat, and mouse tissues
- Paper: Predicting RNA splicing from DNA sequence using Pangolin
- Developed by: Tony Zeng, Yang I. Li
- Model type: Dilated residual 1D CNN ensemble for per-nucleotide multi-tissue splice-site usage prediction
- Original Repository: tkzeng/Pangolin
Usage
The model file depends on the multimolecule library. You can install it using pip:
pip install multimolecule
Direct Use
RNA Splicing Site Prediction
You can use this model directly to predict per-nucleotide tissue-specific splice-site score and usage channels for a pre-mRNA sequence:
>>> from multimolecule import RnaTokenizer, PangolinModel
>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/pangolin")
>>> model = PangolinModel.from_pretrained("multimolecule/pangolin")
>>> output = model(tokenizer("AGCAGUCAUUAUGGCGAA", return_tensors="pt")["input_ids"])
>>> output.keys()
odict_keys(['last_hidden_state', 'probabilities'])
The probabilities tensor reproduces the original Pangolin output: for each of the four tissues, two splice-site score channels (softmax) and one splice-site usage channel (sigmoid).
Downstream Use
Token Prediction
You can fine-tune Pangolin for per-nucleotide splice site strength regression with [PangolinForTokenPrediction][multimolecule.models.PangolinForTokenPrediction], which adds a shared token prediction head on top of the backbone.
Interface
- Input length: variable pre-mRNA sequence
- Padding: flanking context padded with
Nnear transcript ends - Output: per-position tissue-specific channels — for each of 4 tissues, 2 splice-site score channels + 1 splice-site usage channel
Training Details
Pangolin was trained to predict tissue-specific splice site usage from primary pre-mRNA sequence.
Training Data
Pangolin was trained on splice site usage derived from RNA-seq data in heart, liver, brain, and testis tissues from human and three other species, using gene annotations from GENCODE.
For each nucleotide whose splicing status was predicted, a sequence window centered on that nucleotide was used, with the flanking context padded with N (unknown nucleotide) when near transcript ends.
Training Procedure
Pre-training
The model was trained to minimize a combination of cross-entropy loss over splice-site classification and a regression loss over splice-site usage, comparing predictions against measurements derived from RNA-seq.
- Optimizer: AdamW
- Learning rate scheduler: Step decay
Citation
@article{zeng2022predicting,
author = {Zeng, Tony and Li, Yang I.},
title = {Predicting RNA splicing from DNA sequence using Pangolin},
journal = {Genome Biology},
volume = {23},
number = {1},
pages = {103},
year = {2022},
doi = {10.1186/s13059-022-02664-4},
publisher = {BioMed Central}
}
The artifacts distributed in this repository are part of the MultiMolecule project. If MultiMolecule supports your research, please cite the MultiMolecule project as follows:
@software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
Contact
Please use GitHub issues of MultiMolecule for any questions or comments on the model card.
Please contact the authors of the Pangolin paper for questions or comments on the paper/model.
License
This model implementation is licensed under the GNU Affero General Public License.
For additional terms and clarifications, please refer to our License FAQ.
SPDX-License-Identifier: AGPL-3.0-or-later