File size: 3,693 Bytes

ad9572d

# AlloGen Inference Guide

This guide covers how to score binder designs and apply guidance with the bundled Q_θ checkpoint. Training is not part of the public release — only inference and guidance.

> **Env var.** Throughout this doc, `${ALLOGEN_ROOT}` is the path to the cloned repo. Either `cd` into it and use relative paths, or `export ALLOGEN_ROOT=/path/to/AlloGen`.

> **Python.** Use the env from `environment.yml` / `requirements.txt`. All scripts insert `code/` into `sys.path` via a `_CODE_DIR` boot block, so they work from any CWD.

---

## 1. Checkpoint

The Phase 2 weights `checkpoints/Q_theta_phase2.pt` are the **v4-S2 target-swap split** model used in the paper. Phase 1 (`Q_theta_phase1.pt`) is the DockQ regression intermediate.

Pull via Git LFS:

```bash
git lfs install
git lfs pull
```

---

## 2. Score binders

### 2a. Python API

```python
import sys
sys.path.insert(0, 'code')

from models.differentiable_features import DifferentiableQTheta

scorer = DifferentiableQTheta(
    checkpoint='checkpoints/Q_theta_phase2.pt',
    device='cuda:0',
)
scorer.load_receptor(
    holo_path='holo.pdb', rec_chain='A',
    apo_path='apo.pdb',   apo_chain='A',
)
q_holo = scorer.score('design.pdb', binder_chain='B', state='holo')
q_apo  = scorer.score('design.pdb', binder_chain='B', state='apo')
print(f'S = {q_holo - q_apo:.3f}')
```

### 2b. CLI on the bundled sample

```bash
python code/scripts/evaluate.py \
    --target cam \
    --checkpoint checkpoints/Q_theta_phase2.pt \
    --data_dir data/sample/ \
    --outdir /tmp/cam_inference \
    --no_wandb
```

Scores every binder in `data/sample/cam/test.pkl` and writes `tables/eval_cam_test.json` with Spearman ρ, AUC, and selectivity gap.

---

## 3. Guidance methods (PXDesign)

The shipped guidance code wraps **PXDesign** as the prior and uses Q_θ as the gradient / classifier signal.

| Script | Method |
|---|---|
| `code/scripts/pxdesign_guidance/langevin_pxdesign.py` | Post-hoc Langevin refinement |
| `code/scripts/pxdesign_guidance/smc_pxdesign.py` | Sequential Monte Carlo |
| `code/scripts/pxdesign_guidance/tds_pxdesign.py` | Twisted Diffusion Sampler |
| `code/scripts/pxdesign_guidance/guided_pxdesign.py` | Classifier guidance |
| `code/scripts/pxdesign_guidance/iterative_refinement.py` | Iterative refinement loop |
| `code/scripts/pxdesign_guidance/qtheta_pxdesign.py` | Q_θ wrapper used by the above |

Common flags:

- `--checkpoint checkpoints/Q_theta_phase2.pt`
- `--holo_pdb your_holo.pdb` / `--apo_pdb your_apo.pdb`
- `--output_dir designs/`
- `--device cuda:0`
- `--seed 42`

Method-specific arguments (steps, batch sizes, guidance scales) are in each script's `argparse` block.

To plug Q_θ into RFdiffusion, Proteina-ComplexA, or any other backbone prior, see `code/scripts/README.md`.

---

## 4. Bundled sample data

`data/sample/cam/test.pkl` — held-out test split for Calmodulin (CaM), small enough to run on a laptop CPU in under a minute. **The only data shipped in the repo.** Score your own targets via the Python API in §2a (raw PDBs as input).

---

## 5. Training reproduction

Training data, training scripts, and per-target processed graphs are NOT shipped in this public release. The paper's main result (Phase 2 on the **v4-S2 target-swap** split) is provided as a frozen checkpoint at `checkpoints/Q_theta_phase2.pt`. Retraining requires the full pipeline (separate request).

---

## 6. Citation

```bibtex
@inproceedings{cao2026allogen,
  title     = {AlloGen: State-Selective Scoring for Allosteric Binder Design},
  author    = {Cao, Hanqun and others},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2026}
}
```