Instructions to use nvidia/AMPLIFY_120M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/AMPLIFY_120M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="nvidia/AMPLIFY_120M", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("nvidia/AMPLIFY_120M", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| license: mit | |
| datasets: | |
| - chandar-lab/UR100P | |
| language: | |
| - en | |
| tags: | |
| - biology | |
| > [!NOTE] | |
| > This model has been optimized using NVIDIA's [TransformerEngine](https://github.com/NVIDIA/TransformerEngine) | |
| > library. Slight numerical differences may be observed between the original model and the optimized | |
| > model. For instructions on how to install TransformerEngine, please refer to the | |
| > [official documentation](https://github.com/NVIDIA/TransformerEngine?tab=readme-ov-file#installation). | |
| # AMPLIFY (TransformerEngine-Optimized) Overview | |
| ## Description: | |
| AMPLIFY is an efficient, state-of-the-art protein language model (pLM). AMPLIFY can generate residue and protein | |
| embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences. AMPLIFY is available in two | |
| sizes, 120M and 350M parameters. | |
| This version of the AMPLIFY model is optimized with NVIDIA's | |
| [TransformerEngine](https://github.com/NVIDIA/TransformerEngine) library. It is based on the original AMPLIFY model from | |
| Chandar Research Lab (CRL), and (within numerical precision) has identical weights and outputs. | |
| This model is ready for commercial/non-commercial use. | |
| ## Third-Party Community Consideration | |
| This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements | |
| for this application and use case; see link to Non-NVIDIA [AMPLIFY Model | |
| Card](https://huggingface.co/chandar-lab/AMPLIFY_120M). | |
| ### License/Terms of Use: | |
| AMPLIFY is provided under the [MIT license](https://github.com/chandar-lab/AMPLIFY/blob/main/LICENSE). | |
| ### Deployment Geography: | |
| Global | |
| ### Use Case: | |
| Protein design, mutation prediction, and function analysis. | |
| ### Release Date: | |
| Hugging Face 06/12/2025 via [https://huggingface.co/nvidia/AMPLIFY_120M](https://huggingface.co/nvidia/AMPLIFY_120M) | |
| ## References: | |
| - [Protein Language Models: Is Scaling | |
| Necessary?](https://www.biorxiv.org/content/biorxiv/early/2024/09/23/2024.09.23.614603.full.pdf) - detailed | |
| information on the model architecture and training data. | |
| ## Model Architecture: | |
| **Architecture Type:** Transformer | |
| **Network Architecture:** ESM-2 | |
| **This model was developed based on:** [AMPLIFY](https://huggingface.co/chandar-lab/AMPLIFY_120M) <br> | |
| **Number of model parameters:** 1.2 x 10^8 | |
| ## Input: | |
| **Input Type:** Text (Protein Sequences) <br> | |
| **Input Format:** String <br> | |
| **Input Parameters:** One-Dimensional (1D) <br> | |
| **Other Properties Related to Input:** Protein sequence represented as a string of canonical amino acids. The maximum | |
| context length is 2048 residues. | |
| ## Output: | |
| **Output Type:** Embeddings (Amino acid and sequence-level) <br> | |
| **Output Format:** Numeric vector <br> | |
| **Output Parameters:** One-Dimensional (1D) <br> | |
| **Other Properties Related to Output:** Numeric vector with floating-point values corresponding to an embedding for each | |
| amino acid in the input protein sequence. | |
| Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware | |
| (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times | |
| compared to CPU-only solutions. | |
| ## Software Integration: | |
| **Runtime Engines:** | |
| - Hugging Face Transformers | |
| **Supported Hardware Microarchitecture Compatibility:** | |
| - NVIDIA Ampere | |
| - NVIDIA Blackwell | |
| - NVIDIA Hopper | |
| **Preferred Operating System(s):** | |
| - Linux | |
| The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific | |
| data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at | |
| both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure | |
| compliance with safety and ethical standards before deployment. | |
| ## Model and checkpoint versions are noted below: | |
| - [AMPLIFY_350M](https://huggingface.co/nvidia/AMPLIFY_350M) <br> | |
| - [AMPLIFY_120M](https://huggingface.co/nvidia/AMPLIFY_120M) <br> | |
| **Get Started** | |
| ```python | |
| from transformers import AutoModel | |
| from transformers import AutoTokenizer | |
| from datasets import load_dataset | |
| # Load AMPLIFY and tokenizer | |
| model = AutoModel.from_pretrained("nvidia/AMPLIFY_120M", trust_remote_code=True) | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| "nvidia/AMPLIFY_120M", trust_remote_code=True | |
| ) | |
| # Move the model to GPU (required due to Flash Attention) | |
| model = model.to("cuda") | |
| # Load the UniProt validation set | |
| dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test") | |
| for sample in dataset: | |
| # Protein | |
| print("Sample: ", sample["name"], sample["sequence"]) | |
| # Tokenize the protein | |
| input = tokenizer.encode(sample["sequence"], return_tensors="pt") | |
| print("Input: ", input) | |
| # Move to the GPU and make a prediction | |
| input = input.to("cuda") | |
| output = model(input) | |
| print("Output: ", output) | |
| break | |
| ``` | |
| ## Training and Evaluation Datasets: | |
| ## Training Datasets: | |
| **Link:** [UniRef100](https://www.uniprot.org/uniref?query=identity%3A1.0) | |
| **Data Modality:** | |
| - Text (Protein Sequences) | |
| **Text Training Data Size:** | |
| - 1 Billion to 10 Trillion Tokens | |
| **Data Collection Method:** | |
| - Human | |
| **Labeling Method:** | |
| - N/A | |
| **Properties (Quantity, Dataset Descriptions, Sensor(s)):** UniRef100 contains all records in the UniProt Knowledgebase | |
| and selected UniParc records. In UniRef100, identical sequences and subfragments are placed into a single cluster using | |
| the CD-HIT algorithm. The longest members of the cluster (seed sequences) are used to generate UniRef90. However, the | |
| longest sequence is not always the most informative. There is often more biologically relevant information and | |
| annotation (name, function, cross-references) available on other cluster members. All the proteins in each cluster are | |
| ranked to facilitate the selection of a biologically relevant representative for the cluster. | |
| **Link:** [Observed Antibody Space (OAS)](https://opig.stats.ox.ac.uk/webapps/oas/downloads_paired/) | |
| **Data Modality:** | |
| - Text (Protein Sequences) | |
| **Text Training Data Size:** | |
| - 1 Billion to 10 Trillion Tokens | |
| **Data Collection Method:** | |
| - Human | |
| **Labeling Method:** | |
| - Human | |
| **Properties:** The Observed Antibody Space (OAS) database is a project to collect and annotate immune repertoires for | |
| use in large-scale analysis. It currently contains over one billion sequences, from over 80 different studies. These | |
| repertoires cover diverse immune states, organisms (primarily human and mouse), and individuals. | |
| **Link:** [Structural Classification of Proteins (SCOP)](https://www.ebi.ac.uk/pdbe/scop/download) | |
| **Data Modality:** | |
| - Text (Protein Sequences) | |
| **Text Training Data Size:** | |
| - 1 Billion to 10 Trillion Tokens | |
| **Data Collection Method:** | |
| - Hybrid: Human, Automated | |
| **Labeling Method:** | |
| - Hybrid: Human, Automated | |
| **Properties:** The main levels of classification in SCOP are: | |
| - Class: Groups proteins based on their secondary structure content, such as all-alpha, all-beta, alpha/beta, and | |
| alpha+beta. | |
| - Fold: Proteins within the same fold have the same major secondary structures arranged in the same way with the same | |
| topological connections. | |
| - Superfamily: Groups protein domains with a probable common evolutionary ancestry based on shared structural and | |
| functional features, even if sequence similarity is low. | |
| - Family: Groups closely related proteins with clear evidence of a common evolutionary origin, often detectable through | |
| sequence comparison methods. | |
| - Species: Represents a distinct protein sequence. | |
| - Protein: Groups similar sequences with the same function. | |
| ## Evaluation Datasets: | |
| **Link:** [Continuous Automated Model EvaluatiOn (CAMEO)](https://pmc.ncbi.nlm.nih.gov/articles/PMC8673552/) | |
| **Benchmark Score:** LR P@L of 17.8±14.1 | |
| **Data Collection Method:** | |
| - Human | |
| **Labeling Method:** | |
| - N/A | |
| **Properties:** The data is collected by taking sequences of protein structures that are about to be released weekly by | |
| the Protein Data Bank (PDB). These sequences are sent as "blind targets" to participating protein structure prediction | |
| servers, which then return their predictions. | |
| **Link:** [CASP14 (Critical Assessment of Methods of Protein Structure | |
| Prediction)](https://pubmed.ncbi.nlm.nih.gov/34533838/) | |
| **Benchmark Score:** LR P@L of 12.4±11.3 | |
| **Data Collection Method:** | |
| - Human | |
| **Labeling Method:** | |
| - N/A | |
| **Properties:** The data for CASP14 targets is collected from protein structures that are newly solved by experimental | |
| structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full, | |
| three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to | |
| participating research groups and servers, who must submit their predicted structures within a specific time frame. | |
| **Link:** [CASP15 (Critical Assessment of Methods of Protein Structure | |
| Prediction)](https://pubmed.ncbi.nlm.nih.gov/37920879/) | |
| **Benchmark Score:** LR P@L of 16.9±13.2 | |
| **Data Collection Method:** | |
| - Human | |
| **Labeling Method:** | |
| - N/A | |
| **Properties:** The data for CASP15 targets is collected from protein structures that are newly solved by experimental | |
| structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full, | |
| three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to | |
| participating research groups and servers, who must submit their predicted structures within a specific time frame. | |
| ## Inference: | |
| **Acceleration Engine:** | |
| - Hugging Face Transformers | |
| **Test Hardware:** | |
| - A100 | |
| - H100 | |
| - H200 | |
| - GB200 | |
| ## Ethical Considerations: | |
| NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable | |
| development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, | |
| developers should work with their internal model team to ensure this model meets requirements for the relevant industry | |
| and use case and addresses unforeseen product misuse. | |
| Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and | |
| comply with applicable safety regulations and ethical standards. | |
| Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns | |
| [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). | |