Instructions to use MemGym/memgym-rm-1p7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use MemGym/memgym-rm-1p7b with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
MemRM β 1.7B QLoRA Reward Model
Description
MemRM is a 1.7B-parameter QLoRA adapter trained on top of Qwen/Qwen3-1.7B-Base to predict whether a memory compaction event in an LLM agent trajectory is HARMFUL (degrades task completion) or SAFE. The model is trained on the memgym-rm-train dataset (15,630 IID training pairs from SWE-Gym trajectories) and evaluated on three held-out distributions.
This model is the paper's primary published model. The 8B model (rm_v2_8b_yn_cw3) was used as a development baseline and is NOT the published model.
CRITICAL β Checkpoint and Eval File Mapping
The paper reports headline metrics from checkpoint-500 evaluated on the IID held-out set. There are two eval_results.json files in the repository β they describe different models:
| File | Model | AUROC | ECE | Safe-F1 | Use this? |
|---|---|---|---|---|---|
training_output/lightweight_comparison/eval_results_1p7b_ckpt500_iid.json |
1.7B ckpt-500 (THIS MODEL) | 0.985 | 0.009 | 0.849 | YES β paper source |
training_output/reward_model_v2_run1/eval_results.json |
8B baseline | 0.985 | 0.011 | 0.828 | NO β different model |
Both models happen to round to AUROC = 0.985, but secondary metrics diverge. All paper Table 3 values correspond to the 1.7B ckpt-500 file.
Model Configuration
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-1.7B-Base |
| PEFT type | LoRA (QLoRA with 4-bit quantization during training) |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 (paper text incorrectly states 0 β M3) |
| Target modules | q_proj, k_proj, v_proj, o_proj |
| Max sequence length | 32,768 tokens |
| Adapter size | 24.5 MB (paper text incorrectly states 25.7 MB β M4) |
| Training checkpoints | 6 (steps 100β600, every 100 steps) |
Performance
All metrics below from checkpoint-500, eval file eval_results_1p7b_ckpt500_iid.json:
IID (in-distribution, n=3,007 held-out rows, 7 held-out repos)
| Metric | Value |
|---|---|
| AUROC | 0.985 |
| ECE | 0.009 |
| Accuracy | 97.3% |
| Harmful-F1 | 0.985 |
| Safe-F1 | 0.849 |
Strategy-OOD (n=166 pre-rebuild bundle, coverage 26.5%)
| Metric | Value |
|---|---|
| Covered AUROC | 0.714 |
| Coverage rate | 26.5% (44/166) |
These numbers were computed on the pre-rebuild 166-row Strategy-OOD bundle, captured in evals/eval_results_1p7b_ckpt500_strategy_ood_v2.json shipped alongside this checkpoint. The numbers match what is reported in Paper Table 3.
Public dataset note: The Strategy-OOD dataset published as MemGym/memgym-rm-strategy-ood is a rebuilt 22-pair bundle (2026-05-06) that renders prompts in the long-context training-distribution shape (~22β38K tokens) rather than the short-template ~600-token form used for the historical eval above. The two are different artifacts:
- Use the 166-row numbers above to compare against Paper Table 3.
- Re-running this checkpoint on
MemGym/memgym-rm-strategy-oodwill yield different metrics β the dataset is intentionally smaller and prompt-distribution-matched, intended for downstream users to validate their own RM checkpoints rather than to reproduce paper headline numbers.
Note (Stage 1 M2): Paper Table 3 reports Strategy-OOD ECE = 0.850. Disk computation on the same 166-row file gives 0.578. Recompute on a covered subset (n=44) is pending paper-cross-check.
Scenario-OOD β WebArena V2 (n=426, coverage 20.4%)
| Metric | Value |
|---|---|
| Covered AUROC | 0.748 |
| Coverage rate | 20.4% (87/426) |
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load base model + adapter
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-1.7B-Base",
torch_dtype=torch.bfloat16,
device_map="auto"
)
model = PeftModel.from_pretrained(
base_model,
"MemGym/memgym-rm-1p7b",
subfolder="checkpoint-500" # paper checkpoint β stored as a subdir, not a git revision
)
tokenizer = AutoTokenizer.from_pretrained("MemGym/memgym-rm-1p7b")
# Format the prompt (see memgym.training.scripts.eval_three_distributions for full format)
prompt = "[System]\nYou are a helpful assistant...\n[User]\n<trajectory>\n[Assistant]"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
logits = model(**inputs).logits
# Token IDs for " Y" and " N"
y_id = tokenizer.encode(" Y", add_special_tokens=False)[0]
n_id = tokenizer.encode(" N", add_special_tokens=False)[0]
last_logits = logits[0, -1, [y_id, n_id]]
prob_harmful = torch.softmax(last_logits, dim=0)[0].item()
print(f"P(HARMFUL): {prob_harmful:.3f}")
Training Recipe
See docs/release/provenance.md artifact #13. Requires 8Γ A100 and ~6 hours:
# Override VENV to your environment path
VENV=/your/venv bash memgym/training/scripts/launch_train.sh \
--dataset data/world_model/training_output/reward_model_v2/reward_model_pairs_v2.jsonl \
--output-dir checkpoints/rm_v2_1p7b_qlora_32k \
--base-model Qwen/Qwen3-1.7B-Base
Note: launch_train.sh ships with a deploy-host-specific default for VENV β always override via the VENV= environment variable to point at your own virtualenv.
License
Apache 2.0 (code + model weights, inherited from Qwen3-1.7B-Base Apache 2.0 license).
Citation
@inproceedings{xu2026memgym,
title = {MemGym: a Long-Horizon Memory Environment for LLM Agents},
author = {Anonymous Authors},
booktitle = {Advances in Neural Information Processing Systems},
year = {2026},
note = {Under review}
}
Known Limitations and Paper-Text Discrepancies
- M3 β LoRA dropout:
adapter_config.jsonshowslora_dropout = 0.05. Paper Appendix Β§MemRM states dropout = 0. The config file is authoritative; the paper text needs a correction. - M4 β Adapter size: Disk file is 24.5 MB. Paper Appendix states
25.7 MB. The disk size is authoritative (4.7% discrepancy, likely from an earlier export). - Checkpoint choice: Paper reports checkpoint-500 metrics. Checkpoint-600 is also available and shows marginal improvement in some secondary metrics (see
threshold_sweep_1p7b_ckpt600_iid.json). Use checkpoint-500 for paper reproducibility. - 8B confusion: The repository also contains
reward_model_v2_run1/eval_results.jsonfrom an 8B development model. That file is NOT this model. Its AUROC happens to round to the same 0.985 but its other metrics differ.
- Downloads last month
- -
Model tree for MemGym/memgym-rm-1p7b
Base model
Qwen/Qwen3-1.7B-Base