varLens β€” Pretrained Weights

Lightweight variant-effect predictor that consumes Track Delta features from NTv3-650M-post and returns a sigmoid score per SNV. ~627K parameters β€” roughly 1/1000 the size of NTv3-650M.

Pretrained on 9.7M autosomal SNVs from gnomAD with MAF-based proxy labels (MAF < 1% β†’ likely-deleterious, MAF > 5% β†’ likely-tolerated; intermediate 1–5% excluded). Final balanced set: 4,869,209 + 4,869,209 SNVs.

Code and full documentation: github.com/omicsEye/varLens Paper: varLens: Enhancing Single Nucleotide Variation (SNV) Effect Prediction Using Language Models. Fu, D. and Rahnavard, A. (in preparation).

Files in this repo

File Purpose
varLens_PreTrained_9p7M.pt Model weights (torch state_dict)
varLens_PreTrained_9p7M.safetensors Same weights in safetensors format
scaler.npz Per-feature mean/std (length 8064 = 63Γ—128) β€” required to normalize Track Delta inputs
panel_core_v1.json 63-track functional panel definition (same file as in the GitHub configs/)

All runtime artifacts (.pt or .safetensors, scaler.npz, panel_core_v1.json) are needed to reproduce inference.

Quick use

# 1. Get the code
pip install varlens

# 2. Load weights + scaler directly from this HF repo
python - <<'PY'
import varlens
model = varlens.load("domizzz2025/varLens")  # auto-downloads ckpt + scaler + panel
scores = model.score(my_track_delta)       # (N, 63, 128) -> (N,)
PY

Or via the CLI:

varlens score-features --features my_track_delta.npz --out scores.csv

Minimal Python loading example (no pip install)

import numpy as np
import torch
from varlens.model import VarLensV3   # from the GitHub package

model = VarLensV3(in_channels=63, n_positions=128, center_k=5)
model.load_state_dict(torch.load("varLens_PreTrained_9p7M.pt",
                                 map_location="cpu", weights_only=True))
model.eval()

sc = np.load("scaler.npz")
mean, std = sc["mean"], sc["std"]

# x: (N, 63, 128) Track Delta features from NTv3-650M-post
x_norm = ((x.reshape(len(x), -1) - mean) / (std + 1e-8)).reshape(x.shape)
with torch.no_grad():
    scores = torch.sigmoid(model(torch.from_numpy(x_norm).float())).numpy()

Model details

  • Architecture: parallel CNN + Attention branches over a 63Γ—128 Track Delta tensor, concatenated into a 384-dim vector, then a 2-layer MLP head. Both branches use center pooling (Β±5 positions around the SNV, center_k=5).
  • Input: NTv3(alt_sequence) βˆ’ NTv3(ref_sequence) across 63 curated functional tracks Γ— 128 positions centered on the SNV. See panel_core_v1.json for the track list (DNase, ATAC, H3K4me3/27ac/4me1/36me3/27me3/9me3, CTCF, other TFs, CAGE, RNA-seq).
  • Parameters: ~627K
  • Training FLOPs: 0.16 GFLOPs/sample (vs 333.7 GFLOPs for NTv3-650M fine-tuning)
  • Training data: 9.7M autosomal gnomAD SNVs with MAF-proxy labels (4,869,209 likely-deleterious + 4,869,209 likely-tolerated)

Evaluation

Evaluated on 5 external variant-effect benchmarks spanning gene expression regulation, clinical pathogenicity, and trait association:

Dataset Task Metric
Causal eQTL Fine-mapped causal vs non-causal eQTL AUROC
ClinVar Pathogenic Pathogenic vs Benign AUROC
ClinVar Pathogenic vs. gnomAD Common Missense Pathogenic missense vs common missense AUROC
Mendelian Traits Causal variants for 113 monogenic traits AUPRC
Complex Traits Causal variants across 83 polygenic traits AUPRC

Full numbers and comparisons against 7 genome language models (DNABERT, DNABERT-2, HyenaDNA, NTv2-100/500M, NTv3-100/650M) are in the GitHub README.

Limitations

  • The MAF-proxy pretraining objective is an indirect signal for pathogenicity. Low-frequency variants correlate with pathogenic ones but are not identical. Fine-tune on labeled task data when a direct pathogenic/benign decision is needed.
  • Inputs must come from NTv3-650M-post, the specific 63-track panel, and a 128 bp window centered on the variant. Other NTv3 variants or track selections will not work with these weights.
  • The model is trained on human autosomal variants only.

Citation

@software{varLens2026,
  title  = {varLens: Enhancing Single Nucleotide Variation (SNV) Effect Prediction Using Language Models},
  author = {Fu, Dezhao and Rahnavard, Ali},
  year   = {2026},
  url    = {https://github.com/omicsEye/varLens}
}

Acknowledgments

This work was supported by the National Science Foundation under grant no. 2109688 to A.R.

Competing Interests

A.R. is the inventor of a pending patent related to the work presented.

License

MIT. See the GitHub repo for full terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including domizzz2025/varLens