Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
INTEGRATION_RECIPES.md — Wiring the 3-channel composer loss into your RL stack
Status: Wave 14 release reference. Supersedes the historical
docs/INTEGRATION_ARCHITECTURE.md(Recipes A–D), which is retained as background reading for the original mechanism-level diagrams.Companion docs:
docs/USER_GUIDE.md— narrative walk-through, sections 1–8docs/API_REFERENCE.md— exact kwarg signaturesdocs/TROUBLESHOOTING.md— error → fix indexdocs/V3_SUBSTRATE_COVERAGE.md— what each substrate coversdocs/adrs/ADR-006-rl-frameworks.md— why these five recipes and not others
This document is the canonical answer to "how do I plug the 3-channel composer loss into framework X?" for the five frameworks the project supports as of Wave 14:
- TRL
GRPOTrainersubclass - VeRL custom
adv_estimator+ DataProto extension - PRIME-RL custom-loss config
- Serverless Decoupled DiLoCo (Modal / HF Jobs / SageMaker)
- Monarch actor mesh (TorchForge-style topology)
Each recipe follows the same seven-part template:
- When to use it — decision criteria.
- Install command — which optional extras of
composer-replication. - Minimum-viable Python script — copy-pasteable, ≤ 60 lines.
- Decoupled DiLoCo wiring — how
ServerlessExecutor+ObjectStoreAllReduce+MockManagerlayer on top. - Distillation-loss wiring — how to switch DPO → SimPO and add TAID
via
compose_loss(..., dpo_variant=..., sdpo_wrapper=...)or the recipe's own loss-config field. - Cost ballpark — GPU $/hr + API spend, sourced from
docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md. - Known limitations as of Wave 14.
A cross-recipe comparison matrix closes the doc.
TL;DR — the unified loss
For any of the five recipes, the v0.1 trainer step computes:
total_loss = grpo_loss
+ α * sdpo_kl_loss (channel 2 — Composer hint-distill;
optional TAID or Entropy-OPD wrapper)
+ β * trace_replay_loss (channel 3 — N-teacher DPO;
switchable to SimPO)
This is implemented once, in
composer_replication/loss.py::compose_loss,
and re-used by every recipe via the kwargs documented in
API_REFERENCE.md. The full signature — including
all ADR-007 channel-2/3 knobs (dpo_variant, sdpo_wrapper, taid_t,
simpo_beta/simpo_gamma, entropy_opd_h_max, …) — is the
single source of truth in
API_REFERENCE.md § compose_loss.
The conceptual call shape is just:
compose_loss(model, inputs, **kwargs) # see API_REFERENCE.md#compose_loss for full signature
All five recipes below either call compose_loss directly or call a
thin per-framework adapter that forwards these kwargs unchanged. Each
recipe's §5 Distillation-loss wiring documents the kwargs that
recipe uses by default and why; refer back to API_REFERENCE.md for
defaults, types, and which kwargs are mutually exclusive.
Recipe 1 — TRL GRPOTrainer subclass
1. When to use it
This is the default v0.0/v0.1 path and the one we recommend for ~99% of users today. Pick TRL when:
- Your model fits on ≤ 32 GPUs (typically ≤ 70B-param FSDP).
- You already have a HuggingFace
model+tokenizer+datasetsflow. - You want minimum integration cost —
ComposerReplicationTraineris a single subclass override of_compute_lossovertrl.GRPOTrainer, no Ray, no actor mesh. - You're doing single-host (one node, possibly multi-GPU FSDP) training.
Don't pick TRL when you need >100 B-param scale, when you must async-decouple tool calls from the GPU loop, or when a Ray cluster is already in your stack (in which case Recipe 2 is cheaper).
2. Install command
pip install -e ".[train,replaysim]"
The train extra pulls trl>=0.12, peft, accelerate, and datasets.
The replaysim extra pulls data-juicer for CPU-side DPO normalization
(channel 3 cleaning step). Add [serverless] if you also want Decoupled
DiLoCo (see step 4).
3. Minimum-viable Python script
# train_trl.py — minimum viable Recipe 1
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from composer_replication import ComposerReplicationTrainer
MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct" # swap for 7B once it works
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
dataset = load_dataset("trl-lib/tldr", split="train[:512]")
def reward_length(completions, **_):
return [-abs(len(c) - 64) for c in completions]
trainer = ComposerReplicationTrainer(
model = model,
processing_class = tokenizer,
reward_funcs = [reward_length],
train_dataset = dataset,
# Composer extras (defaults shown):
alpha_sdpo = 0.1,
beta_replay = 0.05,
sdpo_jsd_beta = 0.5,
sdpo_temperature = 1.0,
sdpo_token_clip = None,
replay_dpo_beta = 0.1,
)
trainer.train()
Channels 2 and 3 auto-disable per step when their inputs aren't
present in the batch (e.g. batches with no error sites get
sdpo_kl=0). Set alpha_sdpo=0 / beta_replay=0 to disable globally
for ablations.
4. Decoupled DiLoCo wiring
ComposerReplicationTrainer is a single-process trainer. To run N
replicas of it under Decoupled DiLoCo, layer the serverless stack on the
outside: each replica runs the script above; MockManager stands in for
torchft.Manager on the inner loop and ObjectStoreAllReduce runs the
outer-loop pseudo-gradient exchange:
# diloco_replica.py — what each of the N replicas runs
import os
from composer_replication.diloco import make_diloco_outer_loop
from composer_replication.diloco.serverless import (
LocalProcessExecutor, ObjectStoreAllReduce, MockManager,
)
rendezvous = ObjectStoreAllReduce(
uri = "s3://my-bucket/diloco-runs/run42/",
world_size = 4,
rank = int(os.environ["REPLICA_RANK"]),
)
manager = MockManager(allreduce=rendezvous)
# trainer.optimizer is the *inner* optimizer; the outer is built here:
outer = make_diloco_outer_loop(
inner_optimizer = trainer.optimizer,
manager = manager,
sync_every_h = 500,
)
trainer.add_callback(outer.callback()) # syncs every H inner steps
trainer.train()
The driver process spins these up with any ServerlessExecutor:
# Wave 14: ModalExecutor / HFJobsExecutor are skeletons (raise NotImplementedError);
# use LocalProcessExecutor for testing. Swap once the cloud backends land.
executor = LocalProcessExecutor()
handles = executor.launch_replicas(
n_replicas = 4,
entrypoint = "diloco_replica.py",
entrypoint_args = {"rendezvous": rendezvous.uri,
"rank_env": "REPLICA_RANK"},
)
result = executor.collect(handles, timeout=3600)
5. Distillation-loss wiring
ComposerReplicationTrainer exposes the new ADR-007 channels via the
shared compose_loss kwargs — pass them through **kwargs on the
trainer and they're forwarded to compose_loss:
trainer = ComposerReplicationTrainer(
model = model, processing_class = tokenizer,
reward_funcs = [reward_length], train_dataset = dataset,
# SimPO instead of DPO for channel 3:
dpo_variant = "simpo",
simpo_beta = 2.0,
simpo_gamma = 1.0,
# TAID for channel 2 (SakanaAI port; logit-space mix + forward-KL):
sdpo_wrapper = "taid",
taid_t = 0.4, # current TAID coeff in [0, 1];
# drive from TAIDScheduler if you want
# the paper's adaptive scheme
)
Or, equivalently, drop entropy_opd in for taid if you want
per-token entropy-gated forward/reverse KL instead of the
linear-blend interpolation. SimPO does not require reference
log-probs (channel 3 batches with dpo_chosen_ref_logprobs /
dpo_rejected_ref_logprobs set are silently ignored).
6. Cost ballpark
- GPU: single host,
g5.12xlarge($5.67/hr) or RunPod 4×A100-80GB ($5–9/hr) gets you Qwen2.5-7B at moderate throughput. For Qwen2.5-72B you'll want 2–4× H100 —$98/hr on AWS, ~$25–30/hr on Lambda Cloud / RunPod community).p5.48xlarge( - API: channel 3 teacher replay via OpenRouter — verified ~$0.98/trace at 50 steps × 3 teachers (spike 001). For a 100-trace curriculum that's ~$100 in teacher tokens.
- Storage: negligible until you turn on DiLoCo (then see Recipe 4).
7. Known limitations as of Wave 14
- Tool calls block the GPU. TRL's rollout is synchronous; long tool-call latency idles the trainer. Async-decouple via Recipe 2/3/5 if this matters.
- No native multi-node. TRL is single-process; multi-host scaling is via Decoupled DiLoCo (Recipe 4) on top, not via TRL itself.
- vLLM weight sync is co-located — no resharding between FSDP and TP. At 70B+ this becomes the bottleneck and you should move to Recipe 2.
reward_funcsmust be Python callables that returnlist[float]; shell-out reward graders need a wrapper.
Recipe 2 — VeRL custom adv_estimator + DataProto extension
1. When to use it
Pick VeRL when:
- You need >70B-param scale or >32-GPU multi-host, and a Ray cluster is acceptable in your stack.
- You're already using or willing to adopt 3D-HybridEngine for efficient FSDP↔TP weight resharding (verified ~5× weight-sync speed-up vs co-located vLLM at 70B+).
- You need async multi-turn rollouts where tool-call latency must not
block the GPU loop. VeRL's
AsyncServer+AgentLoopis the best-in-class option here. - You want extension points the framework's authors expect third
parties to use — the
@register_adv_est("...")decorator and theDataProtoextension contract are first-class APIs.
Don't pick VeRL if you're <7B-param or single-host (overkill — Recipe 1's Trainer subclass is one file, not a Ray cluster).
2. Install command
pip install -e ".[replaysim]"
pip install verl # not packaged as an extra; pinned at >=0.3
# Optional, for the Composer adapter:
pip install -e ".[serverless]" # for Decoupled DiLoCo on top
The framework's verl adapter lives at
composer_replication.recipes.verl (currently shape-only — see
Limitations below).
3. Minimum-viable Python script
VeRL's actual entry point is a Hydra/YAML config + verl.trainer.main_ppo
CLI; the pythonic surface looks like this:
# train_verl.py — minimum viable Recipe 2 sketch
from verl.trainer.ppo import core_algos
from verl.trainer.ppo.ray_trainer import RayPPOTrainer
from composer_replication.loss import compose_loss
@core_algos.register_adv_est("grpo_composer")
def composer_advantage(data, **kwargs):
"""Custom adv-estimator that adds SDPO + DPO channels to GRPO.
Reads three extra DataProto keys (populated by the data prep step):
- data.batch["sdpo_teacher_logits"] (channel 2)
- data.non_tensor_batch["teacher_actions"] (channel 3)
and returns the standard (advantages, returns) tuple plus a stashed
composer-loss term consumed by the critic worker.
"""
advantages, returns = core_algos.compute_grpo_outcome_advantage(data, **kwargs)
composer_term = compose_loss(
model = kwargs["actor_module"],
inputs = data.batch,
alpha_sdpo = 0.1,
beta_replay = 0.05,
dpo_variant = "dpo",
sdpo_wrapper = "none",
)
data.meta_info["composer_loss"] = composer_term
return advantages, returns
# Then in your YAML:
# algorithm:
# adv_estimator: grpo_composer
# and run: python -m verl.trainer.main_ppo --config-name composer_grpo
The full driver wires RayPPOTrainer against your config; consult VeRL's
own quickstart for the Ray-cluster boilerplate. The composer-specific
piece is just the registered estimator above.
4. Decoupled DiLoCo wiring
VeRL's actor workers run in Ray; DiLoCo replicates the whole VeRL job. Each "replica" is one Ray cluster running Recipe 2 end-to-end; the outer loop is independent of Ray and just exchanges pseudo-gradients via the object store between Ray-job invocations:
from composer_replication.diloco.serverless import (
LocalProcessExecutor, ObjectStoreAllReduce,
)
rendezvous = ObjectStoreAllReduce(
uri = "s3://verl-diloco/run/",
world_size = 4,
)
executor = LocalProcessExecutor() # Wave 14: ModalExecutor is a skeleton (raises NotImplementedError) — keep LocalProcessExecutor for now
handles = executor.launch_replicas(
n_replicas = 4,
entrypoint = "verl.trainer.main_ppo",
entrypoint_args = {
"+algorithm.adv_estimator": "grpo_composer",
"+algorithm.diloco.rendezvous": rendezvous.uri,
"+algorithm.diloco.sync_every_h": 500,
},
)
executor.collect(handles, timeout=24 * 3600)
The Ray cluster inside each replica handles intra-replica scaling (FSDP / TP / vLLM); the object-store exchange handles cross-replica sync. Bandwidth is identical to Recipe 1 (~2 GB / 30 min per replica for a 7B-param model in bf16) and well within S3 free-tier.
5. Distillation-loss wiring
The custom adv_estimator from step 3 already calls compose_loss;
flip the kwargs there to switch DPO → SimPO or add TAID:
composer_term = compose_loss(
model = kwargs["actor_module"],
inputs = data.batch,
alpha_sdpo = 0.1,
beta_replay = 0.05,
dpo_variant = "simpo", # ← SimPO swap
simpo_beta = 2.0,
simpo_gamma = 1.0,
sdpo_wrapper = "taid", # ← TAID wrap
taid_schedule_step = data.meta_info.get("global_step", 0),
taid_total_steps = 10_000,
)
VeRL's data.meta_info carries the global step automatically, which is
exactly what TAID's interpolation schedule needs. Channel 2 batches
without student_init_logits / student_init_input_ids are auto-skipped
(returns 0 for that step).
6. Cost ballpark
- GPU: 8× H100 (
p5.48xlarge~$98/hr on AWS, ~$25/hr on Lambda or RunPod community) is the entry point for 70B-class. Expect 32–256 H100 for full 671B (matches DeepSeek's reported VeRL config). - API: same ~$0.98/trace as Recipe 1 (channel 3 is a Python helper, not a VeRL primitive — costs are framework-independent).
- Ray cluster overhead: head node + redis + dashboard adds ~1 CPU-instance ($0.10–0.50/hr) per cluster, negligible at GPU scale.
7. Known limitations as of Wave 14
composer_replication.recipes.verlis shape-only. The decorator registration and DataProto extension are documented but not yet shipped as a runnable adapter — Wave 14 release exposes the contract, not the glue. Expect this to land in a v0.2 follow-up spike.- Ray dependency. Adds a heavyweight runtime; debugging
cross-actor crashes can be painful. Use VeRL's
--debugmode early. - Custom-
adv_estimatorLOC: writing your own takes ~50–150 LOC including DataProto plumbing. Not a one-liner. - No first-class TAID hook in VeRL itself — we route TAID through
the meta_info channel; this works but means you can't use VeRL's
built-in checkpoint-replay tooling without re-stamping
taid_schedule_stepon each replay.
Recipe 3 — PRIME-RL CustomLossConfig
1. When to use it
Pick PRIME-RL when:
- You're operating in the PRIME-Intellect / decentralized training universe and want INTELLECT-style scaling on a long-horizon training run.
- You need DPPO importance-ratio masking (the rationale most users
arrive with) — PRIME-RL's headline contribution is the
out-of-band-token mask (not clip) on
log_ratio = trainer_lp - inference_lp, with defaultslow=-4.0, high=4.0. - You want a first-class custom-loss surface: PRIME-RL ships
CustomLossConfigthat takes an importable Python function and aLossInputsstruct exposing exactly the tensors we need (trainer_logprobs,inference_logprobs,teacher_logprobs,advantages,loss_mask). No fork, no Trainer subclass, no monkey-patch. - You have access to multi-node infrastructure that PRIME-RL's trainer/inference/orchestrator split is designed for.
Don't pick PRIME-RL if you need full vocab logits (channel 2 SDPO requires logits not log-probs — see Limitations).
2. Install command
pip install -e ".[prime-rl,replaysim]"
# pulls prime-rl>=0.5
3. Minimum-viable Python script
PRIME-RL drives via YAML config; the only Python you write is the
custom-loss function (already shipped at
composer_replication/recipes/prime_rl/composer_loss.py). Wire it in:
# prime_rl_config.yaml — point at the framework's adapter
loss:
custom:
import_path: composer_replication.recipes.prime_rl.composer_loss:loss_fn
kwargs:
alpha_sdpo: 0.0 # channel 2 deferred in v0 (see below)
beta_dpo: 0.0 # channel 3 emits a warning if non-zero
dppo_mask_high: 4.0 # PRIME-RL DPPO mask bounds
dppo_mask_low: -4.0
epsilon: 1.0e-6
trainer:
model: Qwen/Qwen2.5-7B-Instruct
... # standard PRIME-RL fields
The shipped loss_fn signature is fixed by PRIME-RL's contract:
def loss_fn(
inputs: LossInputs,
*,
alpha_sdpo: float = 0.0,
beta_dpo: float = 0.0,
dppo_mask_high: float = 4.0,
dppo_mask_low: float = -4.0,
epsilon: float = 1e-6,
) -> torch.Tensor:
log_ratio = inputs.trainer_logprobs - inputs.inference_logprobs
dppo_invalid = (log_ratio > dppo_mask_high) | (log_ratio < dppo_mask_low)
keep_mask = inputs.loss_mask & ~dppo_invalid
grpo = -(inputs.advantages * inputs.trainer_logprobs * keep_mask).sum() \
/ keep_mask.sum().clamp_min(epsilon)
if alpha_sdpo != 0.0:
raise NotImplementedError(
"Channel 2 SDPO requires full-vocab logits; PRIME-RL v0.5 "
"exposes only log-probs. Deferred to v0.2."
)
if beta_dpo != 0.0:
import warnings; warnings.warn(
"Channel 3 trace-replay DPO is out-of-scope for PRIME-RL recipe v0",
stacklevel=2,
)
return grpo
Shape note (caught in the Wave 13 cross-model review): PRIME-RL
calls the loss function once per sample; tensors are 1-D (seq,),
not batched (B, T). The 10 unit tests in
composer_replication/recipes/prime_rl/tests/test_composer_loss.py
cover this plus DPPO mask edges.
4. Decoupled DiLoCo wiring
PRIME-RL was designed for decentralized training and ships its own
weight-sync primitives. Stack DiLoCo on top via the
ServerlessExecutor Protocol — each replica runs an independent
PRIME-RL job pointing at the same composer_loss:loss_fn:
from composer_replication.diloco.serverless import (
LocalProcessExecutor, ObjectStoreAllReduce,
)
rendezvous = ObjectStoreAllReduce(
uri = "s3://prime-rl-diloco/run/",
world_size = 4,
)
# Wave 14: ModalExecutor is a skeleton (raises NotImplementedError until v0.x).
# Use LocalProcessExecutor for the inner-replica wiring; swap to the cloud
# executor once it lands. The DiLoCo + rendezvous code below is identical.
executor = LocalProcessExecutor()
handles = executor.launch_replicas(
n_replicas = 4,
entrypoint = "prime_rl.cli:main",
entrypoint_args = {
"config": "prime_rl_config.yaml",
"+diloco.rendezvous": rendezvous.uri,
"+diloco.sync_every_h": 500,
},
)
executor.collect(handles, timeout=24 * 3600)
Note PRIME-RL's own multi-node story (the trainer / inference / orchestrator split) is orthogonal to Decoupled DiLoCo: PRIME-RL multi-node = single replica scaled across many GPUs; DiLoCo = N independent replicas synchronizing via object store. Combine both for "big PRIME-RL job × N replicas".
5. Distillation-loss wiring
Channel 2 (SDPO + TAID + Entropy-OPD) is deferred in v0 because
PRIME-RL's LossInputs exposes log-probs not full vocab logits. The
SimPO swap on channel 3 is also gated by the same shape constraint, but
DPPO-clip itself doesn't change. To get TAID/SimPO into a PRIME-RL job
today you must:
- Switch to Recipe 1 or 2 for the SFT/distill phase.
- Use PRIME-RL only for the on-policy GRPO+DPPO phase.
The v0.2 plan (per ADR-007) is to extend LossInputs with a
teacher_logits field; the loss adapter is already shape-ready.
6. Cost ballpark
- GPU: similar profile to Recipe 2 — 8–32 H100 typical, scales to hundreds for INTELLECT-class runs. Lambda Cloud or RunPod community H100 community pricing (~$2–4/hr per H100) is most cost-effective.
- API: channel 3 is gated, so the only OpenRouter spend is from the offline data-prep spike (using the verifier harness in Recipe 1 to pre-bake DPO pairs), not from the training loop itself. Order of magnitude: $50–500 for a curriculum-bake one-time, then $0/run.
- Network: PRIME-RL's own decentralized weight sync uses substantial bandwidth between training replicas (one of its design constraints); this is separate from the Decoupled DiLoCo bandwidth and shows up as a ceiling on cross-region replica placement.
7. Known limitations as of Wave 14
- Channel 2 deferred — see step 5.
alpha_sdpo > 0raisesNotImplementedError. - Channel 3 emits a warning if
beta_dpo != 0; trace-replay DPO pairs must be folded into the training data (offline) rather than the loss (online) until v0.2. - PRIME-RL ≥ 0.5 required. Earlier versions don't ship
CustomLossConfig. - Smoke test deferred. Per
prime_rl_recipe.md, the runtime smoke test requires a CUDA box +prime-rl >= 0.5install and is gated to a follow-up spike. The 10 unit tests run cleanly without GPU. - DPPO defaults are PRIME-RL's, not ours. We pin
low=-4.0, high=4.0to match. If you change them, you're now diverging from PRIME-RL's example configs.
Recipe 4 — Serverless Decoupled DiLoCo
1. When to use it
Pick Decoupled DiLoCo when:
- You have N independent training replicas that should sync occasionally but can't (or shouldn't) cross-talk on every step.
- The cost or operational burden of an always-on multi-node cluster is unacceptable, but you're happy paying for 4× independent serverless jobs.
- Your inner trainer is one of Recipes 1–3 — DiLoCo wraps any inner optimizer; it's purely outer-loop.
- You need failure isolation: if one replica crashes, the others keep training; on restart it picks up from the last outer round.
DiLoCo's design rests on two abstractions (per ADR-005):
ServerlessExecutorProtocol — uniform interface for spinning up N replicas across cloud backends (Modal / HF Jobs / SageMaker / k8s).ObjectStoreAllReduce— fsspec-backed pseudo-gradient exchange that replaces the in-processtorchft.Manager.allreducecall.
The communication pattern is S3 PutObject + N GetObjects once per
inner-H steps, matching DiLoCo paper §3.2 (arXiv:2311.08105). For
1B-param bf16 that's ~2 GB / 30 min per replica — well within S3
free-tier.
2. Install command
pip install -e ".[diloco,serverless]"
# also one of the inner-trainer extras:
pip install -e ".[train]" # if the inner trainer is Recipe 1
# OR pip install verl # if the inner trainer is Recipe 2
# OR pip install -e ".[prime-rl]" # if the inner trainer is Recipe 3
3. Minimum-viable Python script
This pattern is independent of the inner trainer — pick any of Recipes
1/2/3 and wrap it with a ServerlessExecutor. The replica entrypoint
runs the inner trainer; the driver launches N of them and waits.
# diloco_driver.py — driver that launches N replicas
from composer_replication.diloco.serverless import (
LocalProcessExecutor, # for dev — runs replicas as local subprocesses
ObjectStoreAllReduce,
)
rendezvous = ObjectStoreAllReduce(
uri = "s3://my-bucket/diloco-runs/run42/", # or file:// for local
world_size = 4,
)
executor = LocalProcessExecutor() # Wave 14: ModalExecutor skeleton raises NotImplementedError; swap once cloud backend lands
handles = executor.launch_replicas(
n_replicas = 4,
entrypoint = "diloco_replica.py", # (script below)
entrypoint_args = {
"rendezvous": rendezvous.uri,
"rank_env": "REPLICA_RANK",
},
)
result = executor.collect(handles, timeout=3600)
print({h.replica_id: h.exit_code for h in result})
# diloco_replica.py — runs inside each replica
import os
from composer_replication.diloco import make_diloco_outer_loop
from composer_replication.diloco.serverless import (
ObjectStoreAllReduce, MockManager,
)
# Build inner trainer (Recipe 1 example):
from train_trl import trainer
rendezvous = ObjectStoreAllReduce(
uri = os.environ["DILOCO_RENDEZVOUS"],
world_size = 4,
rank = int(os.environ["REPLICA_RANK"]),
)
manager = MockManager(allreduce=rendezvous)
outer = make_diloco_outer_loop(
inner_optimizer = trainer.optimizer,
manager = manager,
sync_every_h = 500,
)
trainer.add_callback(outer.callback())
trainer.train()
4. Decoupled DiLoCo wiring
This recipe is the DiLoCo wiring — see step 3. The available executor adapters are:
| Executor | Status | Use case |
|---|---|---|
LocalProcessExecutor |
Production-ready | Dev loop — N subprocesses on one box |
ModalExecutor |
Skeleton (modal-client gated) | Modal cloud, $/sec billing |
HFJobsExecutor |
Skeleton (hf-hub gated) | HuggingFace Jobs, transformer-shop |
SageMakerExecutor |
Roadmap (post-v0.2) | AWS, warm-pool ~10s cold start |
K8sExecutor |
Roadmap | KubeRay / Volcano gang scheduling |
Cross-cloud replica placement (e.g. 2× Modal + 2× HF Jobs) is supported in principle — they all read/write the same S3 / GCS / HF rendezvous — but treat as experimental.
5. Distillation-loss wiring
DiLoCo is loss-agnostic — it operates purely on inner-optimizer state.
Whichever inner trainer you're running (Recipe 1, 2, or 3) handles
distillation kwargs as documented in that recipe's step 5. The only
DiLoCo-specific knob worth knowing: TAID's taid_schedule_step is a
global counter, but each replica increments it independently. If you
care about replicas all reading the same α at outer-sync time, set
taid_schedule_step = trainer.state.global_step + replica_offset and
let the outer-loop sync average them out.
6. Cost ballpark
Pulled from
docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md:
| Backend | A100-80GB $/hr | H100 $/hr | Cold-start | Notes |
|---|---|---|---|---|
| Modal | $1.39/sec → 4× ≈ $20/hr per A100 | ~$8/hr per H100 | 1–60s warm, 60–120s first-run | $/sec billing; no minimum |
| AWS SageMaker | $4.10/A100·hr | $12.29/hr | 2–5 min cold, ~10s warm pool | Min 60min on warm pool |
| GCP Vertex | $3.67/A100·hr | $11/hr | 2–6 min cold | 30–50% premium over raw GPU |
| Azure ML | ~$3.67/A100·hr | ~$12.25/hr | 3–8 min cold | Use curated env to cut cold-start |
| RunPod | $1.19/hr (community), $2.17 (secure) | $1.99/hr (community), $4.18 (secure) | seconds | No federation; same-DC only |
| HF Jobs | comparable to Modal | ~$8–12/hr | 30–90s | Best DX for HF-shop |
Object-store cost. ~$0.02/GB-month for S3 standard, ~$0/free-tier. Pseudo-gradients are ~2 GB per replica per outer round; for a 24-hour 4-replica run at H=500 that's ~50 outer rounds × 2 GB × 4 replicas = ~400 GB written. Free-tier blows through fast — budget $10–20 in storage.
7. Known limitations as of Wave 14
ModalExecutorandHFJobsExecutorare skeletons. They checkimport modal/import huggingface_hubat adapter init time and raise; the actuallaunch_replicasis shape-only until the relevant spike lands. UseLocalProcessExecutorfor dev.ObjectStoreAllReduce(world_size=1)must passthrough cleanly — the unit testtest_object_store_allreduce_world_size_1_passthroughis the regression guard. Don't override unless you've read it.- Rank validation is mandatory. Tests assert
ObjectStoreAllReduce(rank=N, world_size=N)raises (rank must be< world_size); silent corruption otherwise. MockManageris not feature-complete. It implements theManager.allreducesurface that DiLoCo's outer-loop needs, but not the fulltorchft.ManagerAPI (no fault-tolerance, no membership protocol). Don't use it as a drop-in for live torchft.- No native heterogeneous compute — all replicas are assumed to have the same compute shape. Mixed A100+H100 placements work but the slow replica gates outer-loop progress.
Recipe 5 — Monarch actor mesh
1. When to use it
Pick Monarch when:
- You're at TorchForge-style topology scale: trainer / generator / rewarder / N-teachers all want to be independent, asynchronously scheduled, fault-tolerant actors on a typed mesh.
- You want heterogeneous executor support — different actors run
in different clouds (e.g.
TrainerActoron Modal A100s,GeneratorActoron dedicated H100s,TeacherPoolActoras 0-GPU CPU pods on k8s). - You need hot-swap of actor implementations — replace "OpenRouter teachers" with "local vLLM teachers" by changing one Monarch binding, no trainer code change.
- You're prepared to track upstream Monarch (v0.4.1 stable, v0.5 dev daily); the API is moving and v0 of this recipe is intentionally deferred per ADR-006.
Don't pick Monarch in Wave 14 unless you're explicitly scoping a v0.2+ pilot. The framework ships skeleton actors that fail-fast on instantiation; this is a reference-pattern reading exercise, not a production target.
2. Install command
pip install -e ".[prime-rl,monarch]"
# pulls monarch>=0.4.1 plus the PRIME-RL trainer used inside actors
3. Minimum-viable Python script
The framework ships skeleton actor definitions at
composer_replication/recipes/monarch/actors.py; they raise
NotImplementedError on instantiation in Wave 14. The shape of the
final answer:
# monarch_train.py — what v0.2+ usage will look like
from monarch import Actor, mesh, endpoint
from composer_replication.recipes.monarch.actors import (
TrainerActor, GeneratorActor, RewarderActor, TeacherPoolActor,
)
# Topology
trainers = mesh.spawn(TrainerActor, n=4, gpu="A100")
generator = mesh.spawn(GeneratorActor, n=1, gpu="A100")
rewarder = mesh.spawn(RewarderActor, n=1, gpu=None)
teachers = mesh.spawn(TeacherPoolActor, n=1, gpu=None)
# Wire endpoints
async def outer_step(batch_id: int):
prompts = await trainers[0].sample_prompts.call(batch_id)
rollouts = await generator.rollout.call(prompts)
rewards = await rewarder.score.call(rollouts)
teacher_acts = await teachers.replay.call([
{"state": r["state"]} for r in rollouts
])
await trainers.train_outer_step.call(
batch_id, rollouts=rollouts, rewards=rewards,
teacher_actions=teacher_acts,
)
# Run
import asyncio
for batch_id in range(1000):
asyncio.run(outer_step(batch_id))
The Composer 3-channel loss lives inside TrainerActor.train_outer_step,
which calls compose_loss(...) exactly as Recipe 1 does. The
orchestration changes; the loss math doesn't.
4. Decoupled DiLoCo wiring
Monarch + Decoupled DiLoCo compose naturally: each TrainerActor is a
DiLoCo replica, and Monarch's supervision tree handles the failure
recovery that ADR-005 lists as a DiLoCo design constraint. The wire-up
is identical to Recipe 4's LocalProcessExecutor pattern, just running
inside Monarch instead of subprocess:
from composer_replication.diloco.serverless import (
ObjectStoreAllReduce, MockManager,
)
class TrainerActor(Actor):
def __init__(self, rendezvous_uri: str, rank: int, world_size: int):
self.rendezvous = ObjectStoreAllReduce(
uri=rendezvous_uri, rank=rank, world_size=world_size,
)
self.manager = MockManager(allreduce=self.rendezvous)
# ... build inner ComposerReplicationTrainer ...
@endpoint
async def train_outer_step(self, batch_id: int, **kw):
# Inner H steps locally, then sync via self.rendezvous
...
The "object store" is the cross-actor synchronization point that doesn't go through Monarch's RDMA data plane — by design, slow syncs (S3) and fast syncs (RDMA for in-actor weight broadcast) live on different planes.
5. Distillation-loss wiring
Monarch sees the loss as opaque: it lives inside TrainerActor and
takes the same compose_loss kwargs as Recipe 1. The mesh-level
benefit is swap-by-binding: you can replace TeacherPoolActor
("OpenRouter") with a LocalVLLMTeacherActor to switch the
supplier of teacher log-probs without touching the loss config.
# Original binding — channel 3 via OpenRouter
teachers = mesh.spawn(TeacherPoolActor, n=1, gpu=None)
# Swap binding — channel 3 via local vLLM
teachers = mesh.spawn(LocalVLLMTeacherActor, n=1, gpu="A100",
model_id="Qwen/Qwen2.5-72B-Instruct")
# Trainer config unchanged:
trainer.compose_loss_kwargs = dict(
dpo_variant = "simpo", # same as before
sdpo_wrapper = "taid",
taid_schedule_step = batch_id,
taid_total_steps = 10_000,
)
6. Cost ballpark
In Wave 14: $0 (skeleton fails fast; no compute used). Projected for v0.2+:
- Mesh overhead: Monarch's coordination plane is light — typically <1% of total compute even at 4-actor scale. The dominant cost is whatever the actors run.
- Heterogeneous placement is the cost lever: e.g. a 4-trainer mesh
with
TeacherPoolActoron 0-GPU CPU pods can cut total $/hr by ~10–20% vs forcing all actors onto GPU nodes. - Cluster bring-up: Monarch v0.5's Slurm backend is stable; k8s backend is dev-track; bare-metal SSH backend is documented.
7. Known limitations as of Wave 14
- Skeleton only, fails fast. Importing
actors.pyis fine; instantiatingTrainerActor(...)raisesNotImplementedError("v0 skeleton; deferred to v0.2 per ADR-006"). By design. - Upstream Monarch API is moving. v0.4.1 stable + v0.5 dev daily means breaking changes are expected. Pin to a Monarch hash if you prototype.
- TorchForge is paused. Per its own repo banner — don't take TorchForge's recipes as production patterns. Monarch alone is active; Forge as a layered framework is reference reading.
- Open question (deferred): does Monarch v0.5's Slurm backend
hand-shake cleanly with HF Jobs lifecycle? See
monarch_actor_layout.mdfor the open-questions list. - Open question (deferred): can
TrainerActorhostComposerReplicationTrainerunmodified, or does it need astep_init/step_computesplit for Monarch's async actor model?
Comparison matrix
| Dimension | Recipe 1 — TRL | Recipe 2 — VeRL | Recipe 3 — PRIME-RL | Recipe 4 — Serverless DiLoCo | Recipe 5 — Monarch |
|---|---|---|---|---|---|
| Maturity (Wave 14) | Production-ready | Production-ready (adapter shape-only) | Recipe ready, runtime smoke deferred | LocalProcessExecutor ready; cloud adapters skeleton |
Skeleton only; v0.2+ scope |
| Supports DAPO / GRPO | GRPO ✅; DAPO via TRL master | GRPO ✅; DAPO ✅ (built-in) | GRPO+DPPO ✅ (DAPO mask is the headline) | Inherits from inner trainer | Inherits from inner trainer |
| Custom-loss extension cost (LOC) | ~30 LOC (subclass override) | ~50–150 LOC (registered estimator) | ~20 LOC (single Python fn) | 0 (transparent wrapper) | ~30 LOC (loss inside actor) |
| OpenEnv-compatible | ✅ (HF datasets layer) | ✅ (DataProto extension) | ✅ (rollout JSONL contract) | ✅ (orthogonal) | ✅ (RewarderActor binding) |
| Native multi-node | ❌ (single-host FSDP only) | ✅ (Ray cluster + 3D-HybridEngine) | ✅ (trainer/inference/orchestrator split) | ✅ (the whole point) | ✅ (mesh of actors) |
| Native Decoupled DiLoCo | ❌ — wrap with Recipe 4 | ❌ — wrap with Recipe 4 | ❌ — wrap with Recipe 4 | ✅ (this is it) | ✅ (compose with Recipe 4 inside actor) |
| License | Apache 2.0 (TRL) | Apache 2.0 (VeRL) | Apache 2.0 (PRIME-RL) | Apache 2.0 (this repo) | BSD-3 (Monarch) |
| Our recommendation (Wave 14) | Default for ≤ 70B / single-host | Pick at >70B if Ray is acceptable | Pick if PRIME-Intellect / DPPO mask is required | Stack on top of 1/2/3 for N replicas | Reference pattern only — revisit v0.2 |
Cross-recipe checklist
Regardless of which recipe you pick, these invariants are tested across the 115-test suite (post-Wave-15) and should be true of your wired-up system:
alpha_sdpo=0must reproduce the channel-1-only baseline bit-exact (test_compose_loss_integration.py).beta_replay=0must reproduce the no-channel-3 baseline bit-exact.sdpo_wrapper="taid"withouttaid_schedule_stepmustValueErrorat first step (test_compose_loss_integration.py).sdpo_wrapper="taid"attaid_schedule_step / taid_total_steps = 0must ignore the teacher signal (test_taid_loss_alpha_zero_ignores_teacher).sdpo_wrapper="taid"attaid_schedule_step / taid_total_steps = 1must equal plain SDPO (test_taid_blended_logits_endpoints).dpo_variant="simpo"must be differentiable through theloss-of-sigmoidpath (test_simpo_loss_differentiable).sdpo_wrapper="entropy_opd"must zero out when student ≡ teacher (test_entropy_aware_opd_zero_when_distributions_match).ObjectStoreAllReduce(world_size=1)must passthrough cleanly (test_object_store_allreduce_world_size_1_passthrough).
If any of these fail in your wired-up system, run the corresponding unit test to localize: most break because a kwarg got dropped at the adapter boundary, not because the loss math is wrong.
Picking a recipe — decision flow
- Piloting Monarch (v0.2+)? → Recipe 5.
- Else, need >70B / multi-host? → Recipe 2 (VeRL) if Ray is OK, Recipe 3 (PRIME-RL) if you're in the PRIME-Intellect / DPPO universe, otherwise wait for Recipe 5.
- Else → Recipe 1 (TRL) is the v0.0/v0.1 default.
- At any of 1–3, need N independent replicas / failure isolation? → Stack Recipe 4 (Decoupled DiLoCo) on top.
Pointers to source
- Loss core:
composer_replication/loss.py - TRL trainer:
composer_replication/trainer/composer_trainer.py - PRIME-RL adapter:
composer_replication/recipes/prime_rl/composer_loss.py, recipe doc:composer_replication/recipes/prime_rl/prime_rl_recipe.md - Monarch skeleton:
composer_replication/recipes/monarch/actors.py, layout doc:composer_replication/recipes/monarch/monarch_actor_layout.md - Serverless DiLoCo:
composer_replication/diloco/serverless/ - VeRL adapter (shape-only):
composer_replication/recipes/verl/ - ADRs:
docs/adrs/ADR-005-serverless-diloco.md,docs/adrs/ADR-006-rl-frameworks.md,docs/adrs/ADR-007-distillation-losses.md
File path: /mnt/e/CS/HF/composer-replication-framework/docs/INTEGRATION_RECIPES.md