Instructions to use GenerTeam/GENERanno-eukaryote-0.5b-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use GenerTeam/GENERanno-eukaryote-0.5b-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="GenerTeam/GENERanno-eukaryote-0.5b-base", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("GenerTeam/GENERanno-eukaryote-0.5b-base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
GENERanno-eukaryote-0.5b-base model
Abouts
In this repository, we present GENERanno, a genomic foundation model featuring a context length of 8k base pairs and 500M parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. Our evaluations demonstrate that the GENERanno achieves comparable performance with GENERator in benchmark evaluations, including Genomic Benchmarks, NT tasks, and our newly proposed Gener tasks, making them the top genomic foundation models in the field (2025-02).
Beyond benchmark performance, the GENERanno model is meticulously designed with its specialization in gene annotation. The model efficiently and accurately identifies gene locations, predicts gene function, and annotates gene structure, highlighting its potential to revolutionize genomic research by significantly enhancing the precision and efficiency of gene annotation processes.
The code and implementation details are available on Github: https://github.com/GenerTeam/GENERanno.
Please note that the GENERanno-eukaryote is currently in the developmental phase. We are actively refining the model and will release more technical details soon. Stay tuned for updates!
How to use
Example: Embedding Extraction
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained(
"GenerTeam/GENERanno-eukaryote-0.5b-base",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained(
"GenerTeam/GENERanno-eukaryote-0.5b-base",
trust_remote_code=True,
)
# Define input sequences.
sequences = [
"ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG",
"ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
]
processed_sequences = ["<s>" + seq for seq in sequences]
# Tokenize the sequences
inputs = tokenizer(
processed_sequences,
add_special_tokens=False,
return_tensors="pt",
padding=True,
).to("cuda")
with torch.inference_mode():
outputs = model(**inputs, output_hidden_states=True)
# Retrieve the hidden states from the last layer
hidden_states = outputs.hidden_states[-1]
attention_mask = inputs["attention_mask"]
# Option 1: Use the first token (BOS) as the sentence embedding
bos_embeddings = hidden_states[:, 0, :]
# Option 2: Use mean pooling over the token embeddings
expanded_mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).to(torch.float32)
sum_embeddings = torch.sum(hidden_states * expanded_mask, dim=1)
mean_embeddings = sum_embeddings / expanded_mask.sum(dim=1)
print("BOS Embeddings:", bos_embeddings)
print("Mean Embeddings:", mean_embeddings)
Citation
@article{li2025generanno,
author = {Li, Qiuyi and Wu, Wei and Zhu, Yiheng and Feng, Fuli and Ye, Jieping and Wang, Zheng},
title = {GENERanno: A Genomic Foundation Model for Metagenomic Annotation},
elocation-id = {2025.06.04.656517},
year = {2025},
doi = {10.1101/2025.06.04.656517},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/06/05/2025.06.04.656517},
journal = {bioRxiv}
}
- Downloads last month
- 457