Instructions to use multimolecule/procapnet with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MultiMolecule
How to use multimolecule/procapnet with MultiMolecule:
pip install multimolecule
from multimolecule import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("multimolecule/procapnet") model = AutoModel.from_pretrained("multimolecule/procapnet") inputs = tokenizer("ACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGG", return_tensors="pt") outputs = model(**inputs) embeddings = outputs.last_hidden_state - Notebooks
- Google Colab
- Kaggle
File size: 11,412 Bytes
8703b8a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 | ---
datasets:
- multimolecule/encode
library_name: multimolecule
license: agpl-3.0
pipeline: regulatory-profile
pipeline_tag: other
tags:
- Biology
- DNA
- dna
widget:
- example_title: tumor protein p53
pipeline_tag: regulatory-profile
sequence_type: DNA
task: regulatory-profile
text: ACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGG
- example_title: BRCA1 DNA repair associated
pipeline_tag: regulatory-profile
sequence_type: DNA
task: regulatory-profile
text: TCATTGGAACAGAAAGAAATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAAATCTTAGAGTGTCCCATCTGG
- example_title: hemoglobin subunit beta
pipeline_tag: regulatory-profile
sequence_type: DNA
task: regulatory-profile
text: CATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG
- example_title: CF transmembrane conductance regulator
pipeline_tag: regulatory-profile
sequence_type: DNA
task: regulatory-profile
text: ACTTCACTTCTAATGGTGATTATGGGAGAACTGGAGCCTTCAGAGGGTAAAATTAAGCACAGTGGAAGAATTTCATTCTGTTCTCAGTTTTCCTGGATTATGCCTGGCACCATTAAAGAAAATATCATCTTTGGTGTTTCCTATGATGAATATAGATACAGAAGCGTCATCAAAGCATGCCAACTAGAAGAG
- example_title: telomerase reverse transcriptase
pipeline_tag: regulatory-profile
sequence_type: DNA
task: regulatory-profile
text: CGCGGGGGTGGCCGGGGCCAGGGCTTCCCACGTGCGCAGCAGGACGCAGCGCTGCCTGAAACTCGCGCCGCGAGGAGAGGGCGGGGCCGCGGAAAGGAAGGGGAGGGGCTGGGAGGGCCCGGAGGGGGCTGGGCCGGGGACCCGGGAGGGGTCGGGACGGGGCGGGGTCCGCGCGGAGGAGGCGGAGCTGGAAGGTGAAGGGGCAGGACGGGTGCCCGGGTCCCCAGTCCCTCCGCCACGTGGGAAGCGCGGTCCTGGGCGTCTGTGCCCGCGAATCCACTGGGAGCCCGGCCTGGCCCCGACAGCGCAGCTGCTCCGGGCGGACCCGGGG
- example_title: KRAS proto-oncogene
pipeline_tag: regulatory-profile
sequence_type: DNA
task: regulatory-profile
text: GCCTGCTGAAAATGACTGAATATAAACTTGTGGTAGTTGGAGCTGGTGGCGTAGGCAAGAGTGCCTTGACGATACAGCTAATTCAGAATCATTTTGTGGACGAATATGATCCAACAATAGAG
- example_title: prion protein (Kanno blood group)
pipeline_tag: regulatory-profile
sequence_type: cDNA
task: regulatory-profile
text: ATGGCGAACCTTGGCTGCTGGATGCTGGTTCTCTTTGTGGCCACATGGAGTGACCTGGGCCTCTGC
- example_title: interleukin 10
pipeline_tag: regulatory-profile
sequence_type: cDNA
task: regulatory-profile
text: ATGCACAGCTCAGCACTGCTCTGTTGCCTGGTCCTCCTGACTGGGGTGAGGGCC
- example_title: Zaire ebolavirus
pipeline_tag: regulatory-profile
sequence_type: cDNA
task: regulatory-profile
text: AATGTTCAAACACTTTGTGAAGCTCTGTTAGCTGATGGTCTTGCTAAAGCATTTCCTAGCAATATGATGGTAGTCACAGAGCGTGAGCAAAAAGAAAGCTTATTGCATCAAGCATCATGGCACCACACAAGTGATGATTTTGGTGAGCATGCCACAGTTAGAGGGAGTAGCTTTGTAACTGATTTAGAGAAATACAATCTTGCATTTAGATATGAGTTTACAGCACCTTTTATAGAATATTGTAACCGTTGCTATGGTGTTAAGAATGTTTTTAATTGGATGCATTATACAATCCCACAGTGTTAT
- example_title: SARS coronavirus
pipeline_tag: regulatory-profile
sequence_type: cDNA
task: regulatory-profile
text: ATGTTTATTTTCTTATTATTTCTTACTCTCACTAGTGGTAGTGACCTTGACCGGTGCACCACTTTTGATGATGTTCAAGCTCCTAATTACACTCAACATACTTCATCTATGAGGGGGGTTTACTATCCTGATGAAATTTTTAGATCAGACACTCTTTATTTAACTCAGGATTTATTTCTTCCATTTTATTCTAATGTTACAGGGTTTCATACTATTAATCATACGTTTGACAACCCTGTCATACCTTTTAAGGATGGTATTTATTTTGCTGCCACAGAGAAATCAAATGTTGTCCGTGGTTGGGTTTTTGGTTCTACCATGAACAACAAGTCACAGTCGGTGATTATTATTAACAATTCTACTAATGTTGTTATACGAGCATGTAACTTTGAATTGTGTGACAACCCTTTCTTTGCTGTTTCTAAACCCATGGGTACACAGACACATACTATGATATTCGATAATGCATTTAAATGCACTTTCGAGTACATATCT
- example_title: insulin
pipeline_tag: regulatory-profile
sequence_type: cDNA
task: regulatory-profile
text: ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAG
- example_title: cyclin dependent kinase inhibitor 2A
pipeline_tag: regulatory-profile
sequence_type: cDNA
task: regulatory-profile
text: ATGGAGCCGGCGGCGGGGAGCAGCATGGAGCCTTCGGCTGACTGGCTGGCCACGGCCGCGGCCCGGGGTCGGGTAGAGGAGGTGCGGGCGCTGCTGGAGGCGGGGGCGCTGCCCAACGCACCGAATAGTTACGGTCGGAGGCCGATCCAGGTCATGATGATGGGCAGCGCCCGAGTGGCGGAGCTGCTGCTGCTCCACGGCGCGGAGCCCAACTGCGCCGACCCCGCCACTCTCACCCGACCCGTGCACGACGCTGCCCGGGAGGGCTTCCTGGACACGCTGGTGGTGCTGCACCGGGCCGGGGCGCGGCTGGACGTGCGCGATGCCTGGGGCCGTCTGCCCGTGGACCTGGCTGAGGAGCTGGGCCATCGCGATGTCGCACGGTACCTGCGCGCGGCTGCGGGGGGCACCAGAGGCAGTAACCATGCCCGCATAGATGCCGCGGAAGGTCCCTCAGACATCCCCGATTGA
- example_title: human papillomavirus type 16 E6
pipeline_tag: regulatory-profile
sequence_type: cDNA
task: regulatory-profile
text: ATGCACCAAAAGAGAACTGCAATGTTTCAGGACCCACAGGAGCGACCCAGAAAGTTACCACAGTTATGCACAGAGCTGCAAACAACTATACATGATATAATATTAGAATGTGTGTACTGCAAGCAACAGTTACTGCGACGTGAGGTATATGACTTTGCTTTTCGGGATTTATGCATAGTATATAGAGATGGGAATCCATATGCTGTATGTGATAAATGTTTAAAGTTTTATTCTAAAATTAGTGAGTATAGACATTATTGTTATAGTTTGTATGGAACAACATTAGAACAGCAATACAACAAACCGTTGTGTGATTTGTTAATTAGGTGTATTAACTGTCAAAAGCCACTGTGTCCTGAAGAAAAGCAAAGACATCTGGACAAAAAGCAAAGATTCCATAATATAAGGGGTCGGTGGACCGGTCGATGTATGTCTTGTTGCAGATCATCAAGAACACGTAGAGAAACCCAGCTGTAA
---
# ProCapNet
Base-resolution convolutional neural network for predicting PRO-cap transcription-initiation signal from DNA sequence.
## Disclaimer
This is an UNOFFICIAL implementation of [Dissecting the cis-regulatory syntax of transcription initiation with deep learning](https://doi.org/10.1101/2024.05.28.596138) by Kelly Cochran, et al.
The OFFICIAL repository of ProCapNet is at [kundajelab/ProCapNet](https://github.com/kundajelab/ProCapNet).
> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
**The team releasing ProCapNet did not write this model card for this model so this model card has been written by the MultiMolecule team.**
## Model Details
ProCapNet is a convolutional neural network (CNN) trained to predict base-resolution PRO-cap transcription-initiation signal from primary DNA sequence. Its architecture is largely adapted from Jacob Schreiber's `bpnet-lite` and shares BPNet's dilated-convolution backbone and profile/count factorization. The output is two-stranded (plus / minus strand), mappability-aware, and reconstructed by `ProCapNetForProfilePrediction.postprocess`. Please refer to the [Training Details](#training-details) section for more information on the training process.
### Model Specification
| Input Length | Profile Length | Num Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) |
| ------------ | -------------- | ---------- | ----------- | ------------------ | --------- | -------- |
| 2114 | 1000 | 9 | 512 | 6.43 | 27.17 | 13.58 |
FLOPs and MACs are measured on the canonical 2114 bp ProCapNet input window.
### Links
- **Code**: [multimolecule.procapnet](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/procapnet)
- **Data**: [K562 PRO-cap (ENCODE ENCSR261KBX)](https://www.encodeproject.org/experiments/ENCSR261KBX/)
- **Paper**: [Dissecting the cis-regulatory syntax of transcription initiation with deep learning](https://doi.org/10.1101/2024.05.28.596138)
- **Developed by**: Kelly Cochran, Melody Yin, Anika Mantripragada, Jacob Schreiber, Georgi K. Marinov, Sagar R. Shah, Haiyuan Yu, John T. Lis, Anshul Kundaje
- **Model type**: BPNet-derived 1D dilated CNN with two-stranded factorized profile-and-count heads for PRO-cap transcription-initiation prediction
- **Original Repository**: [kundajelab/ProCapNet](https://github.com/kundajelab/ProCapNet)
## Usage
The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:
```bash
pip install multimolecule
```
### Direct Use
#### Transcription-Initiation Profile Prediction
You can use this model directly to predict PRO-cap transcription-initiation profiles of a DNA sequence:
```python
>>> from multimolecule import DnaTokenizer, ProCapNetForProfilePrediction
>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/procapnet")
>>> model = ProCapNetForProfilePrediction.from_pretrained("multimolecule/procapnet")
>>> output = model(**tokenizer(("ACGT" * 529)[:2114], return_tensors="pt"))
>>> output.keys()
odict_keys(['profile_logits', 'count_logits'])
>>> output["profile_logits"].shape
torch.Size([1, 1000, 2])
>>> output["count_logits"].shape
torch.Size([1, 1])
>>> track = model.postprocess(output)
>>> track.shape
torch.Size([1, 1000, 2])
```
The recombined `track` is the usable base-resolution prediction. The last dimension stacks the `num_strands` (plus, minus) PRO-cap signal predictions.
### Interface
- **Input length**: 2114 bp DNA window
- **Profile length**: 1000 bp, two-stranded (plus / minus)
- **Output**: factorized `(profile_logits, count_logits)`; recombine the base-resolution PRO-cap track via `ProCapNetForProfilePrediction.postprocess`
## Training Details
ProCapNet was trained to predict the base-resolution, two-stranded PRO-cap transcription-initiation signal in human cell lines.
### Training Data
The published ProCapNet models were trained on PRO-cap signal using ~2 kb genomic windows. The K562 model was trained on K562 PRO-cap experiment [ENCSR261KBX](https://www.encodeproject.org/experiments/ENCSR261KBX/). Training and test regions, observed signal tracks, and contribution scores are distributed through the same ENCODE release.
### Training Procedure
#### Pre-training
The model was trained with a composite loss: a (strand-merged) multinomial negative log-likelihood on the per-position, two-stranded profile shape plus a mean-squared-error regression on `log(count + 1)` total counts.
- Optimizer: Adam
- Training is mappability-aware
## Citation
```bibtex
@article{cochran2024procapnet,
author = {Cochran, Kelly and Yin, Melody and Mantripragada, Anika and Schreiber, Jacob and Marinov, Georgi K. and Shah, Sagar R. and Yu, Haiyuan and Lis, John T. and Kundaje, Anshul},
title = {Dissecting the cis-regulatory syntax of transcription initiation with deep learning},
journal = {bioRxiv},
year = 2024,
doi = {10.1101/2024.05.28.596138},
note = {Preprint}
}
```
> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If MultiMolecule supports your research, please cite the MultiMolecule project as follows:
```bibtex
@software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
```
## Contact
Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.
Please contact the authors of the [ProCapNet paper](https://doi.org/10.1101/2024.05.28.596138) for questions or comments on the paper/model.
## License
This model implementation is licensed under the [GNU Affero General Public License](license.md).
For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).
```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
``` |