varLens β Pretrained Weights
Lightweight variant-effect predictor that consumes Track Delta features from NTv3-650M-post and returns a sigmoid score per SNV. ~627K parameters β roughly 1/1000 the size of NTv3-650M.
Pretrained on 9.7M autosomal SNVs from gnomAD with MAF-based proxy labels (MAF < 1% β likely-deleterious, MAF > 5% β likely-tolerated; intermediate 1β5% excluded). Final balanced set: 4,869,209 + 4,869,209 SNVs.
Code and full documentation: github.com/omicsEye/varLens Paper: varLens: Enhancing Single Nucleotide Variation (SNV) Effect Prediction Using Language Models. Fu, D. and Rahnavard, A. (in preparation).
Files in this repo
| File | Purpose |
|---|---|
varLens_PreTrained_9p7M.pt |
Model weights (torch state_dict) |
varLens_PreTrained_9p7M.safetensors |
Same weights in safetensors format |
scaler.npz |
Per-feature mean/std (length 8064 = 63Γ128) β required to normalize Track Delta inputs |
panel_core_v1.json |
63-track functional panel definition (same file as in the GitHub configs/) |
All runtime artifacts (.pt or .safetensors, scaler.npz, panel_core_v1.json)
are needed to reproduce inference.
Quick use
# 1. Get the code
pip install varlens
# 2. Load weights + scaler directly from this HF repo
python - <<'PY'
import varlens
model = varlens.load("domizzz2025/varLens") # auto-downloads ckpt + scaler + panel
scores = model.score(my_track_delta) # (N, 63, 128) -> (N,)
PY
Or via the CLI:
varlens score-features --features my_track_delta.npz --out scores.csv
Minimal Python loading example (no pip install)
import numpy as np
import torch
from varlens.model import VarLensV3 # from the GitHub package
model = VarLensV3(in_channels=63, n_positions=128, center_k=5)
model.load_state_dict(torch.load("varLens_PreTrained_9p7M.pt",
map_location="cpu", weights_only=True))
model.eval()
sc = np.load("scaler.npz")
mean, std = sc["mean"], sc["std"]
# x: (N, 63, 128) Track Delta features from NTv3-650M-post
x_norm = ((x.reshape(len(x), -1) - mean) / (std + 1e-8)).reshape(x.shape)
with torch.no_grad():
scores = torch.sigmoid(model(torch.from_numpy(x_norm).float())).numpy()
Model details
- Architecture: parallel CNN + Attention branches over a 63Γ128 Track Delta
tensor, concatenated into a 384-dim vector, then a 2-layer MLP head. Both
branches use center pooling (Β±5 positions around the SNV,
center_k=5). - Input:
NTv3(alt_sequence) β NTv3(ref_sequence)across 63 curated functional tracks Γ 128 positions centered on the SNV. Seepanel_core_v1.jsonfor the track list (DNase, ATAC, H3K4me3/27ac/4me1/36me3/27me3/9me3, CTCF, other TFs, CAGE, RNA-seq). - Parameters: ~627K
- Training FLOPs: 0.16 GFLOPs/sample (vs 333.7 GFLOPs for NTv3-650M fine-tuning)
- Training data: 9.7M autosomal gnomAD SNVs with MAF-proxy labels (4,869,209 likely-deleterious + 4,869,209 likely-tolerated)
Evaluation
Evaluated on 5 external variant-effect benchmarks spanning gene expression regulation, clinical pathogenicity, and trait association:
| Dataset | Task | Metric |
|---|---|---|
| Causal eQTL | Fine-mapped causal vs non-causal eQTL | AUROC |
| ClinVar Pathogenic | Pathogenic vs Benign | AUROC |
| ClinVar Pathogenic vs. gnomAD Common Missense | Pathogenic missense vs common missense | AUROC |
| Mendelian Traits | Causal variants for 113 monogenic traits | AUPRC |
| Complex Traits | Causal variants across 83 polygenic traits | AUPRC |
Full numbers and comparisons against 7 genome language models (DNABERT, DNABERT-2, HyenaDNA, NTv2-100/500M, NTv3-100/650M) are in the GitHub README.
Limitations
- The MAF-proxy pretraining objective is an indirect signal for pathogenicity. Low-frequency variants correlate with pathogenic ones but are not identical. Fine-tune on labeled task data when a direct pathogenic/benign decision is needed.
- Inputs must come from NTv3-650M-post, the specific 63-track panel, and a 128 bp window centered on the variant. Other NTv3 variants or track selections will not work with these weights.
- The model is trained on human autosomal variants only.
Citation
@software{varLens2026,
title = {varLens: Enhancing Single Nucleotide Variation (SNV) Effect Prediction Using Language Models},
author = {Fu, Dezhao and Rahnavard, Ali},
year = {2026},
url = {https://github.com/omicsEye/varLens}
}
Acknowledgments
This work was supported by the National Science Foundation under grant no. 2109688 to A.R.
Competing Interests
A.R. is the inventor of a pending patent related to the work presented.
License
MIT. See the GitHub repo for full terms.