GENERanno-eukaryote-0.5b-base model

Abouts

In this repository, we present GENERanno, a genomic foundation model featuring a context length of 8k base pairs and 500M parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. Our evaluations demonstrate that the GENERanno achieves comparable performance with GENERator in benchmark evaluations, including Genomic Benchmarks, NT tasks, and our newly proposed Gener tasks, making them the top genomic foundation models in the field (2025-02).

Beyond benchmark performance, the GENERanno model is meticulously designed with its specialization in gene annotation. The model efficiently and accurately identifies gene locations, predicts gene function, and annotates gene structure, highlighting its potential to revolutionize genomic research by significantly enhancing the precision and efficiency of gene annotation processes.

The code and implementation details are available on Github: https://github.com/GenerTeam/GENERanno.

Please note that the GENERanno-eukaryote is currently in the developmental phase. We are actively refining the model and will release more technical details soon. Stay tuned for updates!

How to use

Example: Embedding Extraction


import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained(
    "GenerTeam/GENERanno-eukaryote-0.5b-base",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda().eval()

tokenizer = AutoTokenizer.from_pretrained(
    "GenerTeam/GENERanno-eukaryote-0.5b-base",
    trust_remote_code=True,
)

# Define input sequences.
sequences = [
    "ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG",
    "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
]

processed_sequences = ["<s>" + seq for seq in sequences]

# Tokenize the sequences
inputs = tokenizer(
    processed_sequences,
    add_special_tokens=False,
    return_tensors="pt",
    padding=True,
).to("cuda")

with torch.inference_mode():
    outputs = model(**inputs, output_hidden_states=True)

# Retrieve the hidden states from the last layer
hidden_states = outputs.hidden_states[-1]
attention_mask = inputs["attention_mask"]

# Option 1: Use the first token (BOS) as the sentence embedding
bos_embeddings = hidden_states[:, 0, :]

# Option 2: Use mean pooling over the token embeddings
expanded_mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).to(torch.float32)
sum_embeddings = torch.sum(hidden_states * expanded_mask, dim=1)
mean_embeddings = sum_embeddings / expanded_mask.sum(dim=1)

print("BOS Embeddings:", bos_embeddings)
print("Mean Embeddings:", mean_embeddings)

Citation

@article{li2025generanno,
    author = {Li, Qiuyi and Wu, Wei and Zhu, Yiheng and Feng, Fuli and Ye, Jieping and Wang, Zheng},
    title = {GENERanno: A Genomic Foundation Model for Metagenomic Annotation},
    elocation-id = {2025.06.04.656517},
    year = {2025},
    doi = {10.1101/2025.06.04.656517},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2025/06/05/2025.06.04.656517},
    journal = {bioRxiv}
}

Downloads last month: 457

Safetensors

Model size

0.5B params

Tensor type

F32

Collection including GenerTeam/GENERanno-eukaryote-0.5b-base

GENERanno

Collection

4 items • Updated Feb 9 • 2