Upload folder using huggingface_hub

531b049 unverified 8 months ago

10.5 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- chandar-lab/UR100P
	language:
	- en
	tags:
	- biology
	---

	> [!NOTE]
	> This model has been optimized using NVIDIA's [TransformerEngine](https://github.com/NVIDIA/TransformerEngine)
	> library. Slight numerical differences may be observed between the original model and the optimized
	> model. For instructions on how to install TransformerEngine, please refer to the
	> [official documentation](https://github.com/NVIDIA/TransformerEngine?tab=readme-ov-file#installation).

	# AMPLIFY (TransformerEngine-Optimized) Overview

	## Description:

	AMPLIFY is an efficient, state-of-the-art protein language model (pLM). AMPLIFY can generate residue and protein
	embeddings, suggest mutations, differentiate disordered proteins from non-protein sequences. AMPLIFY is available in two
	sizes, 120M and 350M parameters.

	This version of the AMPLIFY model is optimized with NVIDIA's
	[TransformerEngine](https://github.com/NVIDIA/TransformerEngine) library. It is based on the original AMPLIFY model from
	Chandar Research Lab (CRL), and (within numerical precision) has identical weights and outputs.

	This model is ready for commercial/non-commercial use.

	## Third-Party Community Consideration

	This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements
	for this application and use case; see link to Non-NVIDIA [AMPLIFY Model
	Card](https://huggingface.co/chandar-lab/AMPLIFY_120M).

	### License/Terms of Use:

	AMPLIFY is provided under the [MIT license](https://github.com/chandar-lab/AMPLIFY/blob/main/LICENSE).

	### Deployment Geography:

	Global

	### Use Case:

	Protein design, mutation prediction, and function analysis.

	### Release Date:

	Hugging Face 06/12/2025 via [https://huggingface.co/nvidia/AMPLIFY_120M](https://huggingface.co/nvidia/AMPLIFY_120M)

	## References:

	- [Protein Language Models: Is Scaling
	Necessary?](https://www.biorxiv.org/content/biorxiv/early/2024/09/23/2024.09.23.614603.full.pdf) - detailed
	information on the model architecture and training data.

	## Model Architecture:

	Architecture Type: Transformer
	Network Architecture: ESM-2

	This model was developed based on: [AMPLIFY](https://huggingface.co/chandar-lab/AMPLIFY_120M) <br>
	Number of model parameters: 1.2 x 10^8

	## Input:

	Input Type: Text (Protein Sequences) <br>
	Input Format: String <br>
	Input Parameters: One-Dimensional (1D) <br>
	Other Properties Related to Input: Protein sequence represented as a string of canonical amino acids. The maximum
	context length is 2048 residues.

	## Output:

	Output Type: Embeddings (Amino acid and sequence-level) <br>
	Output Format: Numeric vector <br>
	Output Parameters: One-Dimensional (1D) <br>
	Other Properties Related to Output: Numeric vector with floating-point values corresponding to an embedding for each
	amino acid in the input protein sequence.

	Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware
	(e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times
	compared to CPU-only solutions.

	## Software Integration:

	Runtime Engines:

	- Hugging Face Transformers

	Supported Hardware Microarchitecture Compatibility:

	- NVIDIA Ampere
	- NVIDIA Blackwell
	- NVIDIA Hopper

	Preferred Operating System(s):

	- Linux

	The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific
	data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at
	both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure
	compliance with safety and ethical standards before deployment.

	## Model and checkpoint versions are noted below:

	- [AMPLIFY_350M](https://huggingface.co/nvidia/AMPLIFY_350M) <br>
	- [AMPLIFY_120M](https://huggingface.co/nvidia/AMPLIFY_120M) <br>

	Get Started

	```python
	from transformers import AutoModel
	from transformers import AutoTokenizer
	from datasets import load_dataset

	# Load AMPLIFY and tokenizer
	model = AutoModel.from_pretrained("nvidia/AMPLIFY_120M", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(
	"nvidia/AMPLIFY_120M", trust_remote_code=True
	)

	# Move the model to GPU (required due to Flash Attention)
	model = model.to("cuda")

	# Load the UniProt validation set
	dataset = load_dataset("chandar-lab/UR100P", data_dir="UniProt", split="test")

	for sample in dataset:
	# Protein
	print("Sample: ", sample["name"], sample["sequence"])

	# Tokenize the protein
	input = tokenizer.encode(sample["sequence"], return_tensors="pt")
	print("Input: ", input)

	# Move to the GPU and make a prediction
	input = input.to("cuda")
	output = model(input)
	print("Output: ", output)

	break
	```

	## Training and Evaluation Datasets:

	## Training Datasets:

	Link: [UniRef100](https://www.uniprot.org/uniref?query=identity%3A1.0)

	Data Modality:

	- Text (Protein Sequences)

	Text Training Data Size:

	- 1 Billion to 10 Trillion Tokens

	Data Collection Method:

	- Human

	Labeling Method:

	- N/A

	Properties (Quantity, Dataset Descriptions, Sensor(s)): UniRef100 contains all records in the UniProt Knowledgebase
	and selected UniParc records. In UniRef100, identical sequences and subfragments are placed into a single cluster using
	the CD-HIT algorithm. The longest members of the cluster (seed sequences) are used to generate UniRef90. However, the
	longest sequence is not always the most informative. There is often more biologically relevant information and
	annotation (name, function, cross-references) available on other cluster members. All the proteins in each cluster are
	ranked to facilitate the selection of a biologically relevant representative for the cluster.

	Link: [Observed Antibody Space (OAS)](https://opig.stats.ox.ac.uk/webapps/oas/downloads_paired/)

	Data Modality:

	- Text (Protein Sequences)

	Text Training Data Size:

	- 1 Billion to 10 Trillion Tokens

	Data Collection Method:

	- Human

	Labeling Method:

	- Human

	Properties: The Observed Antibody Space (OAS) database is a project to collect and annotate immune repertoires for
	use in large-scale analysis. It currently contains over one billion sequences, from over 80 different studies. These
	repertoires cover diverse immune states, organisms (primarily human and mouse), and individuals.

	Link: [Structural Classification of Proteins (SCOP)](https://www.ebi.ac.uk/pdbe/scop/download)

	Data Modality:

	- Text (Protein Sequences)

	Text Training Data Size:

	- 1 Billion to 10 Trillion Tokens

	Data Collection Method:

	- Hybrid: Human, Automated

	Labeling Method:

	- Hybrid: Human, Automated

	Properties: The main levels of classification in SCOP are:

	- Class: Groups proteins based on their secondary structure content, such as all-alpha, all-beta, alpha/beta, and
	alpha+beta.
	- Fold: Proteins within the same fold have the same major secondary structures arranged in the same way with the same
	topological connections.
	- Superfamily: Groups protein domains with a probable common evolutionary ancestry based on shared structural and
	functional features, even if sequence similarity is low.
	- Family: Groups closely related proteins with clear evidence of a common evolutionary origin, often detectable through
	sequence comparison methods.
	- Species: Represents a distinct protein sequence.
	- Protein: Groups similar sequences with the same function.

	## Evaluation Datasets:

	Link: [Continuous Automated Model EvaluatiOn (CAMEO)](https://pmc.ncbi.nlm.nih.gov/articles/PMC8673552/)

	Benchmark Score: LR P@L of 17.8±14.1

	Data Collection Method:

	- Human

	Labeling Method:

	- N/A

	Properties: The data is collected by taking sequences of protein structures that are about to be released weekly by
	the Protein Data Bank (PDB). These sequences are sent as "blind targets" to participating protein structure prediction
	servers, which then return their predictions.

	Link: [CASP14 (Critical Assessment of Methods of Protein Structure
	Prediction)](https://pubmed.ncbi.nlm.nih.gov/34533838/)

	Benchmark Score: LR P@L of 12.4±11.3

	Data Collection Method:

	- Human

	Labeling Method:

	- N/A

	Properties: The data for CASP14 targets is collected from protein structures that are newly solved by experimental
	structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full,
	three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to
	participating research groups and servers, who must submit their predicted structures within a specific time frame.

	Link: [CASP15 (Critical Assessment of Methods of Protein Structure
	Prediction)](https://pubmed.ncbi.nlm.nih.gov/37920879/)

	Benchmark Score: LR P@L of 16.9±13.2

	Data Collection Method:

	- Human

	Labeling Method:

	- N/A

	Properties: The data for CASP15 targets is collected from protein structures that are newly solved by experimental
	structural biologists. The CASP organizers receive the amino acid sequences of these proteins before their full,
	three-dimensional structures are publicly released in the Protein Data Bank (PDB). They then provide these sequences to
	participating research groups and servers, who must submit their predicted structures within a specific time frame.

	## Inference:

	Acceleration Engine:

	- Hugging Face Transformers

	Test Hardware:

	- A100
	- H100
	- H200
	- GB200

	## Ethical Considerations:

	NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable
	development for a wide array of AI applications. When downloaded or used in accordance with our terms of service,
	developers should work with their internal model team to ensure this model meets requirements for the relevant industry
	and use case and addresses unforeseen product misuse.

	Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and
	comply with applicable safety regulations and ethical standards.

	Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns
	[here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).