SVSTR-Score: long-read-guided confidence scoring for short-read SV and STR genotypes

Per-class random-forest models that assign a calibrated confidence score (CS ∈ [0,1]) and a four-tier operating point to each short-read structural-variant (SV) or short-tandem-repeat (STR) call, learned from paired short-/long-read genomes. The primary SV model is the 23-feature caller-output random forest; a caller-independent sequence-only model is provided as a portable secondary option. The long-read genotype is used only to build the training label; inference needs short-read caller output only.

Files

File Description
sv_model_v13_parents.joblib SV random forest (23 caller-output features; primary SV model / external-validation headline)
str_model_v13_parents.joblib STR random forest (25 features)
seqonly/ caller-independent sequence-only SV model (svspr, 11 features) — portable secondary option for callers whose fields differ from training. seqonly/model/svspr_v14_seq.pkl + the svspr inference package (VCF + reference → CS/tier; tiers HIGH≥0.9/MOD≥0.7/WARN≥0.5) + example. 11 features = SV length/log-length, SVTYPE one-hot, ±100 bp flank GC/AT/inner-GC, motif-2/3 counts; no caller fields
str_locus_lookup.parquet per-locus historical concordance (locus_conc_rate) keyed by (chrom, pos); 163,726 loci
sv_config.json / str_config.json feature order, tier thresholds, lookup metadata
feature_manifest.json definition of every SV/STR feature
tier_thresholds.json tier cut-offs + override rule
score_svstr.py inference entry point
requirements.txt pinned dependencies (scikit-learn==1.5.1)
example/ real example calls from the study cohort (input features + scored output)

Models were trained with scikit-learn 1.5.1; load with the same version to avoid InconsistentVersionWarning and guarantee identical results.

Quick start

pip install -r requirements.txt
# SV
python score_svstr.py --variant sv  --model-dir . --features example/sv_features.tsv  --out sv_scored.tsv
# STR (locus_conc_rate / locus_in_lookup are filled here from str_locus_lookup.parquet)
python score_svstr.py --variant str --model-dir . --features example/str_features.tsv --out str_scored.tsv

Input is a table extracted from the caller VCF (e.g. with bcftools query) holding the features in *_config.json. For STR, supply the 23 caller-output features plus chrom,pos; the script joins the catalogue lookup to add locus_conc_rate and locus_in_lookup.

example/ holds real calls drawn from the study cohort (*_features.tsv) and their scored output (*_scored.tsv), spanning the tier range. They double as a reproducibility check: the SV set includes precise PASS calls scored HIGH and override cases (is_imprecise==1 or filter_pass==0) demoted to LOW; the STR set includes high-concordance loci scored HIGH and a feature-clean but historically discordant locus scored LOW, illustrating that locus_conc_rate dominates the STR score.

Confidence tiers

Tier Cut
HIGH CS ≥ 0.70
MODERATE 0.50 ≤ CS < 0.70
WARNING 0.30 ≤ CS < 0.50
LOW CS < 0.30

Override. Deterministic quality-demotion rules in each *_config.json (rules) are applied on top of the CS tier by score_svstr.py; any rule triggered → override_target regardless of CS. SV → LOW (QUAL<20, GQ<15, total_alt_support≤2, vaf_estimate<0.15, is_imprecise==1); **STR → WARNING** (locus_coverage<20, is_low_depth==1, support_type_a2≤1, total_support_a2<5, ci_width_a2>3, allele_balance<0.3). Only the MODERATE→HIGH range is a monotone precision ladder; LOW and WARNING are not rank-ordered by precision. Use the HIGH tier as a candidate-triage filter.

Intended use & performance

  • Sample-LOSO cross-validation (143 unrelated parents): SV F1 = 0.9308, STR F1 = 0.9960.
  • Within-cohort generalization — re-scored on 3,608 unseen short-read genomes without retraining; the score distribution and per-1-Mb genomic reliability map reproduce (training↔unseen Spearman ρ = 0.90).
  • External multi-sample validation — 194 HPRC genomes (sequence-aware Truvari labels; long-read truth Sawfish/SV, TRGT/STR):
    • SV (this 23-feature model, the primary external object): per-genome median AUROC 0.927, above raw QUAL (0.741), GQ (0.476) and read-depth (0.671) and the applicable dedicated tools (duphold, Paragraph), most clearly for deletions and insertions (AUPRC 0.948, expected calibration error 0.053). Robust across the tested callers (median 0.915 on an independent DRAGEN reanalysis). The companion caller-independent sequence-only model transfers at a lower ceiling (median AUROC 0.851).
    • STR (this model): per-genome median AUROC 0.912; in-catalogue 0.909 vs out-of-catalogue 0.688 (catalogue-bound reliability atlas).
  • GIAB HG002 (single-sample supplementary check): HIGH-tier SV precision 97.7 % (consensus v0.6).

Limitations (read before use)

  • External validation is multi-sample (194 HPRC genomes; see above). Outputs remain candidate triage, not validated clinical calls. The 23-feature caller-output model is robust across the tested Manta-family callers (Manta v1.6, DRAGEN), but it consumes caller-specific fields and is untested on structurally different callers (e.g. DELLY, GATK-SV); for those use the caller-independent sequence-only model (lower ceiling).
  • STR scoring is a per-locus reliability atlas. ~94 % of STR model importance is locus_conc_rate; removing it drops cross-validated AUROC 0.998 → 0.62. The score does not extend to STR loci outside the 163,726-locus catalogue (out-of-catalogue loci get a conservative score by construction), and only the HIGH tier transfers across callers — lower-tier ranking does not.
  • Trained on a single ancestry and pipeline (Illumina DRAGEN + Manta/ExpansionHunter; PacBio HiFi + Sawfish/TRGT, GRCh38). Behaviour under other pipelines is untested.
  • .joblib is a pickle: load only from trusted sources and with the pinned versions.

Citation

Kim W*, Yeom K*, et al. SVSTR-Score: long-read-guided confidence scoring for short-read SV and STR genotypes. (manuscript in preparation.)

Licence

MIT.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support