aparent / README.md
ZhiyuanChen's picture
Upload folder using huggingface_hub
2d2848b verified
metadata
library_name: multimolecule
license: agpl-3.0
pipeline: polyadenylation
pipeline_tag: other
tags:
  - Biology
  - RNA
  - 3' UTR
  - rna
widget:
  - example_title: microRNA 21
    pipeline_tag: polyadenylation
    sequence_type: ncRNA
    task: polyadenylation
    text: UAGCUUAUCAGACUGAUGUUGA
  - example_title: microRNA 146a
    pipeline_tag: polyadenylation
    sequence_type: ncRNA
    task: polyadenylation
    text: UGAGAACUGAAUUCCAUGGGUU
  - example_title: microRNA 155
    pipeline_tag: polyadenylation
    sequence_type: ncRNA
    task: polyadenylation
    text: UUAAUGCUAAUCGUGAUAGGGGUU
  - example_title: RNA component of mitochondrial RNA processing endoribonuclease
    pipeline_tag: polyadenylation
    sequence_type: ncRNA
    task: polyadenylation
    text: >-
      GGUUCGUGCUGAAGGCCUGUAUCCUAGGCUACACACUGAGGACUCUGUUCCUCCCCUUUCCGCCUAGGGGAAAGUCCCCGGACCUCGGGCAGAGAGUGCCACGUGCAUACGCACGUAGACAUUCCCCGCUUCCCACUCCAAAGUCCGCCAAGAAGCGUAUCCCGCUGAGCGGCGUGGCGCGGGGGCGUCAUCCGUCAGCUCCCUCUAGUUACGCAGGCAGUGCGUGUCCGCGCACCAACCACACGGGGCUCAUUCUCAGCGCGGCUGUAAAAAAAAA
  - example_title: 7SK small nuclear RNA
    pipeline_tag: polyadenylation
    sequence_type: ncRNA
    task: polyadenylation
    text: >-
      GGAUGUGAGGGCGAUCUGGCUGCGACAUCUGUCACCCCAUUGAUCGCCAGGGUUGAUUCGGCUGAUCUGGCUGGCUAGGCGGGUGUCCCCUUCCUCCCUCACCGCUCCAUGUGCGUCCCUCCCGAAGCUGCGCGCUCGGUCGAAGAGGACGACCAUCCCCGAUAGAGGAGGACCGGUCUUCGGUCAAGGGUAUACGAGUAGCUGCGCUCCCCUGCUAGAACCUCCAAACAAGCUCUCAAGGUCCAUUUGUAGGAGAACGUAGGGUAGUCAAGCUUCCAAGACUCCAGACACAUCCAAAUGAGGCGCUGCAUGUGGCAGUCUGCCUUUCUUUU
  - example_title: telomerase RNA component
    pipeline_tag: polyadenylation
    sequence_type: ncRNA
    task: polyadenylation
    text: >-
      GGGUUGCGGAGGGUGGGCCUGGGAGGGGUGGUGGCCAUUUUUUGUCUAACCCUAACUGAGAAGGGCGUAGGCGCCGUGCUUUUGCUCCCCGCGCGCUGUUUUUCUCGCUGACUUUCAGCGGGCGGAAAAGCCUCGGCCUGCCGCCUUCCACCGUUCAUUCUAGAGCAAACAAAAAAUGUCAGCUGCUGGCCCGUUCGCCCCUCCCGGGGACCUGCGGCGGGUCGCCUGCCCAGCCCCCGAACCCCGCCUGGAGGCCGCGGUCGGCCCGGGGCUUCUCCGGAGGCACCCACUGCCACCGCGAAGAGUUGGGCUCUGUCAGCCGCGGGUCUCUCGGGGGCGAGGGCGAGGUUCAGGCCUUUCAGGCCGCAGGAAGAGGAACGGAGCGAGUCCCCGCGCGCGGCGCGAUUCCCUGAGCUGUGGGACGUGCACCCAGGACUCGGCUCACACAUGC
  - example_title: vault RNA 2-1
    pipeline_tag: polyadenylation
    sequence_type: ncRNA
    task: polyadenylation
    text: >-
      CGGGUCGGAGUUAGCUCAAGCGGUUACCUCCUCAUGCCGGACUUUCUAUCUGUCCAUCUCUGUGCUGGGGUUCGAGACCCGCGGGUGCUUACUGACCCUUUUAUGCAA
  - example_title: brain cytoplasmic RNA 1
    pipeline_tag: polyadenylation
    sequence_type: ncRNA
    task: polyadenylation
    text: >-
      GGCCGGGCGCGGUGGCUCACGCCUGUAAUCCCAGCUCUCAGGGAGGCUAAGAGGCGGGAGGAUAGCUUGAGCCCAGGAGUUCGAGACCUGCCUGGGCAAUAUAGCGAGACCCCGUUCUCCAGAAAAAGGAAAAAAAAAAACAAAAGACAAAAAAAAAAUAAGCGUAACUUCCCUCAAAGCAACAACCCCCCCCCCCCUUU
  - example_title: HIV-1 TAR-WT
    pipeline_tag: polyadenylation
    sequence_type: ncRNA
    task: polyadenylation
    text: GGUCUCUCUGGUUAGACCAGAUCUGAGCCUGGGAGCUCUCUGGCUAACUAGGGAACC
  - example_title: prion protein (Kanno blood group)
    pipeline_tag: polyadenylation
    sequence_type: mRNA
    task: polyadenylation
    text: AUGGCGAACCUUGGCUGCUGGAUGCUGGUUCUCUUUGUGGCCACAUGGAGUGACCUGGGCCUCUGC
  - example_title: interleukin 10
    pipeline_tag: polyadenylation
    sequence_type: mRNA
    task: polyadenylation
    text: AUGCACAGCUCAGCACUGCUCUGUUGCCUGGUCCUCCUGACUGGGGUGAGGGCC
  - example_title: Zaire ebolavirus
    pipeline_tag: polyadenylation
    sequence_type: mRNA
    task: polyadenylation
    text: >-
      AAUGUUCAAACACUUUGUGAAGCUCUGUUAGCUGAUGGUCUUGCUAAAGCAUUUCCUAGCAAUAUGAUGGUAGUCACAGAGCGUGAGCAAAAAGAAAGCUUAUUGCAUCAAGCAUCAUGGCACCACACAAGUGAUGAUUUUGGUGAGCAUGCCACAGUUAGAGGGAGUAGCUUUGUAACUGAUUUAGAGAAAUACAAUCUUGCAUUUAGAUAUGAGUUUACAGCACCUUUUAUAGAAUAUUGUAACCGUUGCUAUGGUGUUAAGAAUGUUUUUAAUUGGAUGCAUUAUACAAUCCCACAGUGUUAU
  - example_title: SARS coronavirus
    pipeline_tag: polyadenylation
    sequence_type: mRNA
    task: polyadenylation
    text: >-
      AUGUUUAUUUUCUUAUUAUUUCUUACUCUCACUAGUGGUAGUGACCUUGACCGGUGCACCACUUUUGAUGAUGUUCAAGCUCCUAAUUACACUCAACAUACUUCAUCUAUGAGGGGGGUUUACUAUCCUGAUGAAAUUUUUAGAUCAGACACUCUUUAUUUAACUCAGGAUUUAUUUCUUCCAUUUUAUUCUAAUGUUACAGGGUUUCAUACUAUUAAUCAUACGUUUGACAACCCUGUCAUACCUUUUAAGGAUGGUAUUUAUUUUGCUGCCACAGAGAAAUCAAAUGUUGUCCGUGGUUGGGUUUUUGGUUCUACCAUGAACAACAAGUCACAGUCGGUGAUUAUUAUUAACAAUUCUACUAAUGUUGUUAUACGAGCAUGUAACUUUGAAUUGUGUGACAACCCUUUCUUUGCUGUUUCUAAACCCAUGGGUACACAGACACAUACUAUGAUAUUCGAUAAUGCAUUUAAAUGCACUUUCGAGUACAUAUCU
  - example_title: insulin
    pipeline_tag: polyadenylation
    sequence_type: mRNA
    task: polyadenylation
    text: >-
      AUGGCCCUGUGGAUGCGCCUCCUGCCCCUGCUGGCGCUGCUGGCCCUCUGGGGACCUGACCCAGCCGCAGCCUUUGUGAACCAACACCUGUGCGGCUCACACCUGGUGGAAGCUCUCUACCUAGUGUGCGGGGAACGAGGCUUCUUCUACACACCCAAGACCCGCCGGGAGGCAGAGGACCUGCAGGUGGGGCAGGUGGAGCUGGGCGGGGGCCCUGGUGCAGGCAGCCUGCAGCCCUUGGCCCUGGAGGGGUCCCUGCAGAAGCGUGGCAUUGUGGAACAAUGCUGUACCAGCAUCUGCUCCCUCUACCAGCUGGAGAACUACUGCAACUAG
  - example_title: cyclin dependent kinase inhibitor 2A
    pipeline_tag: polyadenylation
    sequence_type: mRNA
    task: polyadenylation
    text: >-
      AUGGAGCCGGCGGCGGGGAGCAGCAUGGAGCCUUCGGCUGACUGGCUGGCCACGGCCGCGGCCCGGGGUCGGGUAGAGGAGGUGCGGGCGCUGCUGGAGGCGGGGGCGCUGCCCAACGCACCGAAUAGUUACGGUCGGAGGCCGAUCCAGGUCAUGAUGAUGGGCAGCGCCCGAGUGGCGGAGCUGCUGCUGCUCCACGGCGCGGAGCCCAACUGCGCCGACCCCGCCACUCUCACCCGACCCGUGCACGACGCUGCCCGGGAGGGCUUCCUGGACACGCUGGUGGUGCUGCACCGGGCCGGGGCGCGGCUGGACGUGCGCGAUGCCUGGGGCCGUCUGCCCGUGGACCUGGCUGAGGAGCUGGGCCAUCGCGAUGUCGCACGGUACCUGCGCGCGGCUGCGGGGGGCACCAGAGGCAGUAACCAUGCCCGCAUAGAUGCCGCGGAAGGUCCCUCAGACAUCCCCGAUUGA
  - example_title: human papillomavirus type 16 E6
    pipeline_tag: polyadenylation
    sequence_type: mRNA
    task: polyadenylation
    text: >-
      AUGCACCAAAAGAGAACUGCAAUGUUUCAGGACCCACAGGAGCGACCCAGAAAGUUACCACAGUUAUGCACAGAGCUGCAAACAACUAUACAUGAUAUAAUAUUAGAAUGUGUGUACUGCAAGCAACAGUUACUGCGACGUGAGGUAUAUGACUUUGCUUUUCGGGAUUUAUGCAUAGUAUAUAGAGAUGGGAAUCCAUAUGCUGUAUGUGAUAAAUGUUUAAAGUUUUAUUCUAAAAUUAGUGAGUAUAGACAUUAUUGUUAUAGUUUGUAUGGAACAACAUUAGAACAGCAAUACAACAAACCGUUGUGUGAUUUGUUAAUUAGGUGUAUUAACUGUCAAAAGCCACUGUGUCCUGAAGAAAAGCAAAGACAUCUGGACAAAAAGCAAAGAUUCCAUAAUAUAAGGGGUCGGUGGACCGGUCGAUGUAUGUCUUGUUGCAGAUCAUCAAGAACACGUAGAGAAACCCAGCUGUAA
  - example_title: NRAS proto-oncogene
    pipeline_tag: polyadenylation
    sequence_type: 5' UTR
    task: polyadenylation
    text: >-
      GGGGCCGGAAGUGCCGCUCCUUGGUGGGGGCUGUUCAUGGCGGUUCCGGGGUCUCCAACAUUUUUCCCGGCUGUGGUCCUAAAUCUGUCCAAAGCAGAGGCAGUGGAGCUUGAGGUUCUUGCUGGUGUGAA
  - example_title: amyloid beta precursor protein
    pipeline_tag: polyadenylation
    sequence_type: 5' UTR
    task: polyadenylation
    text: >-
      GUCAGUUUCCUCGGCAGCGGUAGGCGAGAGCACGCGGAGGAGCGUGCGCGGGGGCCCCGGGAGACGGCGGCGGUGGCGGCGCGGGCAGAGCAAGGACGCGGCGGAUCCCACUCGCACAGCAGCGCACUCGGUGCCCCGCGCAGGGUCGCG
  - example_title: RUNX family transcription factor 1
    pipeline_tag: polyadenylation
    sequence_type: 5' UTR
    task: polyadenylation
    text: >-
      ACUUCUUUGGGCCUCAUAAACAACCACAGAACCACAAGUUGGGUAGCCUGGCAGUGUCAGAAGUCUGAACCCAGCAUAGUGGUCAGCAGGCAGGACGAAUCACACUGAAUGCAAACCACAGGGUUUCGCAGCGUGGUAAAAGAAAUCAUUGAGUCCCCCGCCUUCAGAAGAGGGUGCAUUUUCAGGAGGAAGCG
  - example_title: fragile X messenger ribonucleoprotein 1
    pipeline_tag: polyadenylation
    sequence_type: 5' UTR
    task: polyadenylation
    text: >-
      CUCAGUCAGGCGCUCAGCUCCGUUUCGGUUUCACUUCCGGUGGAGGGCCGCCUCUGAGCGGGCGGCGGGCCGACGGCGAGCGCGGGCGGCGGCGGUGACGGAGGCGCCGCUGCCAGGGGGCGUGCGGCAGCGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGAGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCUGGGCCUCGAGCGCCCGCAGCCCACCUCUCGGGGGCGGGCUCCCGGCGCUAGCAGGGCUGAAGAGAAG
  - example_title: MYC proto-oncogene
    pipeline_tag: polyadenylation
    sequence_type: 5' UTR
    task: polyadenylation
    text: >-
      AACUCGCUGUAGUAAUUCCAGCGAGAGGCAGAGGGAGCGAGCGGGCGGCCGGCUAGGGUGGAAGAGCCGGGCGAGCAGAGCUGCGCUGCGGGCGUCCUGGGAAGGGAGAUCCGGAGCGAAUAGGGGGCUUCGCCUCUGGCCCAGCCCUCCCGCUGAUCCCCCAGCCAGCGGUCCGCAACCCUUGCCGCAUCCACGAAACUUUGCCCAUAGCAGCGGGCGGGCACUUUGCACUGGAACUUACAACACCCGAGCAAGGACGCGACUCUCCCGACGCGGGGAGGCUAUUCUGCCCAUUUGGGGACACUUCCCCGCCGCUGCCAGGACCCGCUUCUCUGAAAGGCUCUCCUUGCAGCUGCUUAGACG
  - example_title: activating transcription factor 4
    pipeline_tag: polyadenylation
    sequence_type: 5' UTR
    task: polyadenylation
    text: >-
      CAUUUCUACUUUGCCCGCCCACAGAUGUAGUUUUCUCUGCGCGUGUGCGUUUUCCCUCCUCCCCGCCCUCAGGGUCCACGGCCACCAUGGCGUAUUAGGGGCAGCAGUGCCUGCGGCAGCAUUGGCCUUUGCAGCGGCGGCAGCAGCACCAGGCUCUGCAGCGGCAACCCCCAGCGGCUUAAGCCAUGGCGCUUCUCACGGCAUUCAGCAGCAGCGUUGCUGUAACCGACAAAGACACCUUCGAAUUAAGCACAUUCCUCGAUUCCAGCAAAGCACCGCAAC
  - example_title: Human GPI protein p137
    pipeline_tag: polyadenylation
    sequence_type: 3' UTR
    task: polyadenylation
    text: >-
      UUUUUAAAAGGAAAAGAUACCAAAUGCCUGCUGCUACCACCCUUUUCAAUUGCUAUGUUUUGAAAGGCACCAGUAUGUGUUUUAGAUUGAUUUAAAUGUUUCAUUUAAAUCACGGACAGUAGUUUCAGUUCUGAUGGUAUAAGCAAAACAAAUAAAACGUUUAUAAAAGUUGUAUCUUGAAACACUGGUGUUCAACAGCUAGCAGCUUAUGUGAUUCACCCCAUGCCACGUUAGUGUCACAAAUUUUAUGGUUUAUCUCCAGCAACAUUUCUCUAGUACUUGCACUUAUUAUCUGAAUUC
  - example_title: nucleophosmin 1
    pipeline_tag: polyadenylation
    sequence_type: 3' UTR
    task: polyadenylation
    text: >-
      GAAAAUAGUUUAAACAAUUUGUUAAAAAAUUUUCCGUCUUAUUUCAUUUCUGUAACAGUUGAUAUCUGGCUGUCCUUUUUAUAAUGCAGAGUGAGAACUUUCCCUACCGUGUUUGAUAAAUGUUGUCCAGGUUCUAUUGCCAAGAAUGUGUUGUCCAAAAUGCCUGUUUAGUUUUUAAAGAUGGAACUCCACCCUUUGCUUGGUUUUAAGUAUGUAUGGAAUGUUAUGAUAGGACAUAGUAGUAGCGGUGGUCAGACAUGGAAAUGGUGGGGAGACAAAAAUAUACAUGUGAAAUAAAACUCAGUAUUUUAAUAAAGUAGCACGGUUUCUAUUGA
  - example_title: superoxide dismutase 1
    pipeline_tag: polyadenylation
    sequence_type: 3' UTR
    task: polyadenylation
    text: >-
      ACAUUCCCUUGGAUGUAGUCUGAGGCCCCUUAACUCAUCUGUUAUCCUGCUAGCUGUAGAAAUGUAUCCUGAUAAACAUUAAACACUGUAAUCUUAAAAGUGUAAUUGUGUGACUUUUUCAGAGUUGCUUUAAAGUACCUGUAGUGAGAAACUGAUUUAUGAUCACUUGGAAGAUUUGUAUAGUUUUAUAAAACUCAGUUAAAAUGUCUGUUUCAAUGACCUGUAUUUUGCCAGACUUAAAUCACAGAUGGGUAUUAAACUUGUCAGAAUUUCUUUGUCAUUCAAGCCUGUGAAUAAAAACCCUGUAUGGCACUUAUUAUGAGGCUAUUAAAAGAAUCCAAAUUCAAACUAAA
  - example_title: hemoglobin subunit alpha 2
    pipeline_tag: polyadenylation
    sequence_type: 3' UTR
    task: polyadenylation
    text: >-
      CUGGAGCCUCGGUAGCCGUUCCUCCUGCCCGCUGGGCCUCCCAACGGGCCCUCCUCCCCUCCUUGCACCGGCCCUUCCUGGUCUUUGAAUAAAGUCUGAGUGGGCAGCA
  - example_title: BRAF proto-oncogene
    pipeline_tag: polyadenylation
    sequence_type: 3' UTR
    task: polyadenylation
    text: >-
      AACAAAUGAGUGAGAGAGUUCAGGAGAGUAGCAACAAAAGGAAAAUAAAUGAACAUAUGUUUGCUUAUAUGUUAAAUUGAAUAAAAUACUCUCUUUUUUUUUAAGGUGAACCAAAGAACACUUGUGUGGUUAAAGACUAGAUAUAAUUUUUCCCCAAACUAAAAUUUAUACUUAACAUUGGAUUUUUAACAUCCAAGGGUUAAAAUACAUAGACAUUGCUAAAAAUUGGCAGAGCCUCUUCUAGAGGCUUUACUUUCUGUUCCGGGUUUGUAUCAUUCACUUGGUUAUUUUAAGUAGUAAACUUCAGUUUCUCAUGCAACUUUUGUUGCCAGCUAUCACAUGUCCACUAGGGACUCCAGAAGAAGACCCUACCUAUGCCUGUGUUUGCAGGUGAGAAGUUGGCAGUCGGUUAGCCUGGG
  - example_title: H3 clustered histone 1
    pipeline_tag: polyadenylation
    sequence_type: 3' UTR
    task: polyadenylation
    text: UUACUGUGGUCUCUCUGACGGUCCAAGCAAAGGCUCUUUUCAGAGCCACCACCUUUUC

APARENT

Convolutional neural network for predicting human 3'UTR Alternative Polyadenylation (APA) from sequence.

Disclaimer

This is an UNOFFICIAL implementation of A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation by Nicholas Bogard, Johannes Linder, et al.

The OFFICIAL repository of APARENT is at johli/aparent.

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing APARENT did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details

APARENT (APA REgression NeT) is a convolutional neural network trained on more than 3.5 million randomized 3'UTR poly-A signals expressed on mini-gene reporters in HEK293. Given a fixed-length 205 nt 3'UTR/polyA sequence, APARENT predicts the alternative-polyadenylation isoform proportion (a scalar) and a positional cleavage distribution. The model is primarily used to score the impact of genetic variants on APA regulation and to engineer new polyadenylation signals. Please refer to the Training Details section for more information on the training process.

The base, non-normalised APARENT model is recommended by the original authors for isoform and variant-effect prediction.

Model Specification

Num Layers Hidden Size Num Parameters (M) FLOPs (G) MACs (G) Max Num Tokens
4 256 6.43 0.03 0.01 205

Links

Usage

The model file depends on the multimolecule library. You can install it using pip:

pip install multimolecule

Direct Use

APA Isoform Prediction

You can use this model directly to predict the APA isoform proportion of a 3'UTR/polyA sequence:

>>> from multimolecule import RnaTokenizer, AparentForSequencePrediction

>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/aparent")
>>> model = AparentForSequencePrediction.from_pretrained("multimolecule/aparent")
>>> output = model(**tokenizer("ACGUACGUACGU", return_tensors="pt"))

>>> output.keys()
odict_keys(['logits'])

The full APARENT isoform and cleavage outputs are available on the backbone:

>>> from multimolecule import RnaTokenizer, AparentModel

>>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/aparent")
>>> model = AparentModel.from_pretrained("multimolecule/aparent")
>>> output = model(**tokenizer("ACGUACGUACGU", return_tensors="pt"))

>>> output.keys()
odict_keys(['pooler_output', 'isoform_logits', 'cleavage_logits'])

Interface

  • Input length: fixed 205 nt 3'UTR / polyA sequence
  • Output (AparentModel): isoform_logits (scalar APA proportion) + cleavage_logits (206-dim positional cleavage distribution)
  • Output (AparentForSequencePrediction): APA isoform scalar only (logits)

Training Details

APARENT was trained to jointly predict the APA isoform proportion and the positional cleavage distribution of randomized 3'UTR poly-A signals.

Training Data

APARENT was trained on more than 3.5 million randomized 3'UTR poly-A signal sequences expressed on mini-gene reporters in HEK293 cells (a massively parallel reporter assay, MPRA). The raw sequencing data for the 3'UTR MPRA libraries are available at GEO accession GSE113849.

This APARENT model was trained on all MPRA libraries (no libraries held out) to produce the best general-purpose APA predictor; it differs from the per-library held-out model evaluated in the paper.

Training Procedure

Pre-training

The model was trained to minimize a combined objective: a sigmoid KL-divergence on the isoform proportion and a KL-divergence on the positional cleavage distribution, weighted equally.

Citation

@article{bogard2019adeep,
  author    = {Bogard, Nicholas and Linder, Johannes and Rosenberg, Alexander B. and Seelig, Georg},
  title     = {A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation},
  journal   = {Cell},
  volume    = {178},
  number    = {1},
  pages     = {91--106.e23},
  year      = {2019},
  publisher = {Elsevier BV},
  doi       = {10.1016/j.cell.2019.04.046}
}

The artifacts distributed in this repository are part of the MultiMolecule project. If MultiMolecule supports your research, please cite the MultiMolecule project as follows:

@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}

Contact

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the APARENT paper for questions or comments on the paper/model.

License

This model implementation is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

SPDX-License-Identifier: AGPL-3.0-or-later