You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

sctherapy-artifacts

Trained checkpoints, derived data, and evaluation results for the sctherapy_pytorch project β€” predicting drug response (% inhibition) from gene expression, benchmarked on a 12-patient AML cohort from the scTherapy paper.

Code: Tino3141/sctherapy_pytorch (FT-Transformer, GRANDE, LightGBM baseline) β€” the training/eval code, scripts, and configs that produced these artifacts. This repo is self-contained: checkpoints, the full training set, both eval cohorts, and results are all here. (Training data is also mirrored at Tino3141/lincs-pharmaco-training / Tino3141/pharmaco_flat.)

The Python snippets below (from eval.model_registry import ..., from src.model.lgbm import ...) assume you're running inside a clone of the GitHub repo; this repo holds the artifacts, not the code.

Checkpoints

Neural (checkpoints/neural/) β€” the exact checkpoints used in the AML eval

Pulled from Weights & Biases (cpinkl/pmlr-sctherapy). These are the precise epoch checkpoints behind the per-seed predictions in results/aml_evals/ β€” not the final best-val (epoch-10) snapshots. The eval used an earlier epoch (6–9) per run; the filename encodes the epoch, and each .pt's stored epoch field matches it. Complete {42, 43, 44} seed sweep Γ— 4 variants = 12 files.

File Arch Head Seed Epoch used W&B run val_rmse @ epoch
ft_transformer_hill__s42__epoch7__run-mwwp1nej.pt FT-Transformer Hill 42 7 mwwp1nej 10.87
ft_transformer_hill__s43__epoch7__run-k36em957.pt FT-Transformer Hill 43 7 k36em957 10.81
ft_transformer_hill__s44__epoch7__run-dnyzw2jk.pt FT-Transformer Hill 44 7 dnyzw2jk 10.72
ft_transformer_scalar__s42__epoch7__run-lv6eowft.pt FT-Transformer Scalar 42 7 lv6eowft 11.01
ft_transformer_scalar__s43__epoch9__run-1lx7r1vp.pt FT-Transformer Scalar 43 9 1lx7r1vp 10.91
ft_transformer_scalar__s44__epoch8__run-5mtffjfa.pt FT-Transformer Scalar 44 8 5mtffjfa 10.74
grande_hill__s42__epoch8__run-fhszu55u.pt GRANDE Hill 42 8 fhszu55u 11.47
grande_hill__s43__epoch8__run-0vsn6313.pt GRANDE Hill 43 8 0vsn6313 11.56
grande_hill__s44__epoch6__run-xutm14sz.pt GRANDE Hill 44 6 xutm14sz 11.72
grande_scalar__s42__epoch7__run-izgn7e8j.pt GRANDE Scalar 42 7 izgn7e8j 30.78
grande_scalar__s43__epoch6__run-iqngqwll.pt GRANDE Scalar 43 6 iqngqwll 31.00
grande_scalar__s44__epoch8__run-k2bepvf1.pt GRANDE Scalar 44 8 k2bepvf1 30.45

Each .pt is a dict with model_state_dict, epoch, val_rmse. Load with the project's eval/model_registry.py (state-dict introspection rebuilds the architecture β€” no separate config needed), or via build_model after reading the shapes.

Why these epochs (not best-val): the evaluated checkpoint was chosen on a combination of the mean and standard deviation of validation performance β€” favouring a stable epoch over the single lowest-mean snapshot. So the selected epoch (6–9) deliberately trades a marginally lower best-val for lower variance, rather than chasing the minimum mean val_rmse alone.

Recovering other snapshots: the absolute best-val epoch (usually epoch 10) scores marginally lower mean val_rmse (FT-hill ~10.5–10.6, GRANDE-hill ~11.3–11.6). To pull it instead, fetch cpinkl/pmlr-sctherapy/model-<runid>:best (alias best β†’ epoch 10). Any other epoch: model-<runid>:vN where v0=epoch1 … v9=epoch10.

LightGBM (checkpoints/lgbm/)

Path Notes
cell_seed42_ts0.8_nb1000_es50/model.txt Cell-split baseline (held-out cell lines), 1000 rounds, ES@50. Headline LGBM used in the registry. Val RMSE β‰ˆ 6.96, AUC β‰ˆ 0.70 on the 12-patient benchmark.
lgbm_baseline_model.txt Alternative local LGBM.

Usage

The neural .pt files are PyTorch state-dicts ({model_state_dict, epoch, val_rmse}) β€” weights only, no architecture. The repo's eval/model_registry.py re-infers the architecture from the tensor shapes, so loading needs nothing but the checkpoint and a registry key (ft_scalar, ft_hill, grande_scalar, grande_hill, or lgbm β€” the filename tells you which).

import torch, numpy as np
from huggingface_hub import hf_hub_download
from eval.model_registry import REGISTRY, load_predictor

# Download one of the archived checkpoints (best FT-hill, seed 43, epoch 7)
ckpt = hf_hub_download(
    "Tino3141/sctherapy-artifacts",
    "checkpoints/neural/ft_transformer_hill__s43__epoch7__run-k36em957.pt",
)

pred = load_predictor(
    REGISTRY["ft_hill"],                 # key must match the checkpoint's arch/head
    device=torch.device("cpu"),
    checkpoint_override=ckpt,
)

# Inputs: gene z-scores (N, 978), ECFP4 bits (N, 1024), dose in Β΅M (N,)
gene  = np.zeros((2, 978), dtype=np.float32)
ecfp4 = np.zeros((2, 1024), dtype=np.float32)
dose  = np.array([1.0, 10.0], dtype=np.float32)

y = pred.predict(gene, ecfp4, dose)      # β†’ predicted % inhibition, shape (N,)
print(y)

LightGBM (model.txt) loads directly:

from src.model.lgbm import LGBMDrugPredictor
lgbm = LGBMDrugPredictor.load("cell_seed42_ts0.8_nb1000_es50/model.txt")
# feature layout: np.hstack([gene(978), ecfp4(1024), dose(1)])

The AML patient inputs are in eval/aml_eval/ (model_inputs, with the ground-truth DSS to score against); the separate multi-cancer held-out set is in eval/zenodo_eval/; derived/gene_names.json gives the gene column order and derived/lincs_gene_stats.npz the (mean, std) used to z-score new expression data. To download everything at once: hf download Tino3141/sctherapy-artifacts --local-dir ./sctherapy-artifacts.

Derived data (derived/)

Small generated files needed to reproduce inference (not the raw sources):

File What
gene_names.json 978 LINCS landmark gene symbols, in feature order
feature_names.json Full flat-feature names `[genes
lincs_gene_stats.npz Per-gene LINCS (mean, std) for z-scoring patient inputs
matched_samples.parquet LINCSΓ—PharmacoDB matched (cell, drug) rows
training_data.parquet Full training set β€” 3.24M rows [sig_id, cell_iname, smiles, dose, inhibition, gene_expression(978), ecfp4(1024)] (1.5 GB)

The full training parquet is included here so the repo is self-contained; it is also mirrored at Tino3141/lincs-pharmaco-training / Tino3141/pharmaco_flat.

Eval datasets (eval/) β€” two separate cohorts

eval/aml_eval/ β€” the 12-patient AML benchmark

The primary patient benchmark (scTherapy paper; consolidated from the public Tino3141/aml12-drug-response-eval dataset). Patients patient1 … patient12.

Path What
ground_truth/dss.csv / .parquet Ground-truth Drug Sensitivity Scores per (patient, drug) β€” the eval target
ground_truth/supplementary_scTherapy.xlsx scTherapy supplementary Excel the DSS were extracted from
model_inputs/full_inputs.parquet Assembled (patient Γ— drug) model inputs
model_inputs/patient_deg_vectors.parquet 978-gene log2FC vector per patient
model_inputs/drug_fingerprints.parquet 1024-bit ECFP4 per drug
model_inputs/example_full_input.parquet Small worked example of the input layout
degs/ Differential-expression gene sets per patient (raw + filtered + meta)

eval/zenodo_eval/ β€” multi-cancer held-out set (NOT AML)

A separate single-cell drug-response set from public GEO/Zenodo studies β€” HNSCC (SCC47, JHU006, HN120/137), lung (PC9, H1975), CLL, HCC, breast (FCIBC02), etc., across 10 drugs. Each pseudo-bulk carries a binary sensitive/resistant label. Built from the ../tino/*.pt source (named after the author).

Path What
expression_zscore.parquet Dose-expanded expression, LINCS z-score (built by scripts/build_zenodo_zscore.py)
expression_logfc.parquet Dose-expanded expression, LINCS log2FC (built by scripts/build_zenodo_logfc.py)

eval/reference/ β€” shared lookups

Path What
lincs_landmark_genes.csv The 978 LINCS landmark genes
drug_name_to_smiles.json Drug-name β†’ canonical SMILES map

Each folder has its own README.md. Raw AML patient .RDS files (~5 GB) are not here β€” re-downloadable from Zenodo 13340927.

Results (results/)

Per-drug / per-patient prediction CSVs, metric tables, and plots from the headline runs (exploratory/superseded variants were pruned):

Dir Cohort What
aml_evals/ AML per-arch/head/seed AML predictions β€” the main neural-vs-LGBM comparison (filenames encode the epoch used)
aml_lgbm_seed42/ AML LGBM baseline AML eval (summary, per-drug, per-patient AUC)
zenodo_cluster/ Zenodo run with all 5 models in one aggregated.csv + ROC plots
zenodo_logfc_all_final_final/ Zenodo final full Zenodo-cohort run
zenodo_logfc_within_sweep/ Zenodo within-file hyperparameter sweep grid
split_comparison/ β€” random vs cell vs drug vs cell_drug split analysis (the data-leakage finding)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support