Instructions to use khyeom/SVSTR-Score with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use khyeom/SVSTR-Score with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("khyeom/SVSTR-Score", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
SVSTR-Score: long-read-guided confidence scoring for short-read SV and STR genotypes
Per-class random-forest models that assign a calibrated confidence score (CS ∈ [0,1]) and a four-tier operating point to each short-read structural-variant (SV) or short-tandem-repeat (STR) call, learned from paired short-/long-read genomes. The primary SV model is the 23-feature caller-output random forest; a caller-independent sequence-only model is provided as a portable secondary option. The long-read genotype is used only to build the training label; inference needs short-read caller output only.
Files
| File | Description |
|---|---|
sv_model_v13_parents.joblib |
SV random forest (23 caller-output features; primary SV model / external-validation headline) |
str_model_v13_parents.joblib |
STR random forest (25 features) |
seqonly/ |
caller-independent sequence-only SV model (svspr, 11 features) — portable secondary option for callers whose fields differ from training. seqonly/model/svspr_v14_seq.pkl + the svspr inference package (VCF + reference → CS/tier; tiers HIGH≥0.9/MOD≥0.7/WARN≥0.5) + example. 11 features = SV length/log-length, SVTYPE one-hot, ±100 bp flank GC/AT/inner-GC, motif-2/3 counts; no caller fields |
str_locus_lookup.parquet |
per-locus historical concordance (locus_conc_rate) keyed by (chrom, pos); 163,726 loci |
sv_config.json / str_config.json |
feature order, tier thresholds, lookup metadata |
feature_manifest.json |
definition of every SV/STR feature |
tier_thresholds.json |
tier cut-offs + override rule |
score_svstr.py |
inference entry point |
requirements.txt |
pinned dependencies (scikit-learn==1.5.1) |
example/ |
real example calls from the study cohort (input features + scored output) |
Models were trained with scikit-learn 1.5.1; load with the same version to avoid
InconsistentVersionWarningand guarantee identical results.
Quick start
pip install -r requirements.txt
# SV
python score_svstr.py --variant sv --model-dir . --features example/sv_features.tsv --out sv_scored.tsv
# STR (locus_conc_rate / locus_in_lookup are filled here from str_locus_lookup.parquet)
python score_svstr.py --variant str --model-dir . --features example/str_features.tsv --out str_scored.tsv
Input is a table extracted from the caller VCF (e.g. with bcftools query) holding
the features in *_config.json. For STR, supply the 23 caller-output features plus
chrom,pos; the script joins the catalogue lookup to add locus_conc_rate and
locus_in_lookup.
example/ holds real calls drawn from the study cohort (*_features.tsv) and
their scored output (*_scored.tsv), spanning the tier range. They double as a
reproducibility check: the SV set includes precise PASS calls scored HIGH and
override cases (is_imprecise==1 or filter_pass==0) demoted to LOW; the STR
set includes high-concordance loci scored HIGH and a feature-clean but
historically discordant locus scored LOW, illustrating that locus_conc_rate
dominates the STR score.
Confidence tiers
| Tier | Cut |
|---|---|
| HIGH | CS ≥ 0.70 |
| MODERATE | 0.50 ≤ CS < 0.70 |
| WARNING | 0.30 ≤ CS < 0.50 |
| LOW | CS < 0.30 |
Override. Deterministic quality-demotion rules in each *_config.json (rules)
are applied on top of the CS tier by score_svstr.py; any rule triggered →
override_target regardless of CS. SV → LOW (QUAL<20, GQ<15, total_alt_support≤2,
vaf_estimate<0.15, is_imprecise==1); **STR → WARNING** (locus_coverage<20, is_low_depth==1,
support_type_a2≤1, total_support_a2<5, ci_width_a2>3, allele_balance<0.3). Only the
MODERATE→HIGH range is a monotone precision ladder; LOW and WARNING are not rank-ordered
by precision. Use the HIGH tier as a candidate-triage filter.
Intended use & performance
- Sample-LOSO cross-validation (143 unrelated parents): SV F1 = 0.9308, STR F1 = 0.9960.
- Within-cohort generalization — re-scored on 3,608 unseen short-read genomes without retraining; the score distribution and per-1-Mb genomic reliability map reproduce (training↔unseen Spearman ρ = 0.90).
- External multi-sample validation — 194 HPRC genomes (sequence-aware Truvari labels; long-read truth Sawfish/SV, TRGT/STR):
- SV (this 23-feature model, the primary external object): per-genome median AUROC 0.927, above raw QUAL (0.741), GQ (0.476) and read-depth (0.671) and the applicable dedicated tools (duphold, Paragraph), most clearly for deletions and insertions (AUPRC 0.948, expected calibration error 0.053). Robust across the tested callers (median 0.915 on an independent DRAGEN reanalysis). The companion caller-independent sequence-only model transfers at a lower ceiling (median AUROC 0.851).
- STR (this model): per-genome median AUROC 0.912; in-catalogue 0.909 vs out-of-catalogue 0.688 (catalogue-bound reliability atlas).
- GIAB HG002 (single-sample supplementary check): HIGH-tier SV precision 97.7 % (consensus v0.6).
Limitations (read before use)
- External validation is multi-sample (194 HPRC genomes; see above). Outputs remain candidate triage, not validated clinical calls. The 23-feature caller-output model is robust across the tested Manta-family callers (Manta v1.6, DRAGEN), but it consumes caller-specific fields and is untested on structurally different callers (e.g. DELLY, GATK-SV); for those use the caller-independent sequence-only model (lower ceiling).
- STR scoring is a per-locus reliability atlas. ~94 % of STR model importance is
locus_conc_rate; removing it drops cross-validated AUROC 0.998 → 0.62. The score does not extend to STR loci outside the 163,726-locus catalogue (out-of-catalogue loci get a conservative score by construction), and only the HIGH tier transfers across callers — lower-tier ranking does not. - Trained on a single ancestry and pipeline (Illumina DRAGEN + Manta/ExpansionHunter; PacBio HiFi + Sawfish/TRGT, GRCh38). Behaviour under other pipelines is untested.
.joblibis a pickle: load only from trusted sources and with the pinned versions.
Citation
Kim W*, Yeom K*, et al. SVSTR-Score: long-read-guided confidence scoring for short-read SV and STR genotypes. (manuscript in preparation.)
Licence
MIT.
- Downloads last month
- -