Long-Term Memory Retrieval Benchmark

Code release for the experiments described in the accompanying paper:

Hierarchical memory organization (User Profile / Semantic / Episodic).
Plan-Act-Read agentic retrieval that interleaves keyword, time-filter, and embedding search.
Flat / dense / oracle baselines for comparison.

Repository layout

.
├── main.py                       # End-to-end QA pipeline (agent, embed, keyword modes)
├── evaluate_qa.py                # Atomic-rubric QA evaluator (strict + partial)
├── model_zoo.py                  # Model registry
├── prompts/                      # Prompt templates
│   ├── agentic_retrieval_prompt.txt
│   ├── agentic_retrieval_prompt_wo_profile.txt
│   ├── keyword_search_prompt.txt
│   └── read_and_extract_prompt.txt
├── memory/                       # Episodic + semantic memory stores
├── baselines/
│   ├── MemoChat/                 # MemoChat baseline (upstream code + our wrapper)
│   ├── raptor/                   # RAPTOR baseline (upstream code + our wrapper)
│   └── read-agent/               # ReadAgent baseline wrapper
├── scripts/
│   ├── build_retrieval_cache.py  # Pre-compute GTE-7B embeddings for the corpus
│   ├── make_v5_shards.py         # Deterministic shard split by question_id
│   ├── merge_jsonl_by_dataset_order.py
│   ├── run_oracle_qa.py          # Gold-session-only upper bound
│   ├── plot_main_results.py
│   ├── llm_judge_agreement.py
│   └── slurm/
│       ├── example_dense_retrieval.slurm
│       └── example_agentic_retrieval.slurm
└── requirements.txt

The benchmark dataset (evolv_mem_v5.json) is released separately; place it under dataset/ along with the supporting files referenced by main.py (all_sessions.json, all_session_summary.json, etc.).

Setup

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

API keys

The pipeline calls LLMs through three optional providers; set whichever you plan to use:

Provider	Env var	Flag
OpenAI-compatible inference API	`NV_API_KEY`	`--nvidia`
OpenAI-compatible LiteLLM proxy	`LITELLM_API_KEY`	`--tritonai`
Direct Anthropic API	`ANTHROPIC_API_KEY`	(default)
Azure OpenAI	`AZURE_OPENAI_KEY`	(default)

Each --<flag> selects which client the pipeline uses; entries in model_zoo.py are tagged accordingly.

Quick start

1. Build the per-question retrieval cache (one-time)

python scripts/build_retrieval_cache.py \
    --dataset dataset/evolv_mem_v5.json \
    --all_sessions dataset/all_sessions.json \
    --out_dir response_cache/retrieval/

2. Shard the dataset for parallel runs

python scripts/make_v5_shards.py \
    --dataset dataset/evolv_mem_v5.json \
    --ret_cache_jsonl response_cache/retrieval/flat-gte/v5_retrievallog_turn_flat-gte \
    --out_dir output/shards/v5_run_nchunks10/ \
    --num_shards 8

3. Run the QA pipeline

Flat dense retrieval @ top-k=20 (single shard, e.g. for smoke testing):

export ret_cache="output/shards/v5_run_nchunks10/ret_cache/shard_00.jsonl"
python main.py \
    --in_file  output/shards/v5_run_nchunks10/dataset/shard_00.json \
    --out_file output/shards/v5_run_nchunks10/dense_gte_topk20/part_00.jsonl \
    --model_name gpt-5.5 \
    --top_k 20 \
    --n_chunks 10 \
    --nvidia \
    --all_sessions_file dataset/all_sessions.json \
    --no_semantic \
    --mode embed

Agentic retrieval over hierarchical memory:

python main.py \
    --in_file  output/shards/v5_run_nchunks10/dataset/shard_00.json \
    --out_file output/shards/v5_run_nchunks10/agentic_hier/part_00.jsonl \
    --model_name gpt-5.5 \
    --top_k 20 \
    --n_chunks 10 \
    --nvidia \
    --all_sessions_file dataset/all_sessions.json \
    --hier_v2 --hier_union \
    --mode agent

To launch the full 8-shard parallel sweep on a SLURM cluster, edit and submit scripts/slurm/example_dense_retrieval.slurm or scripts/slurm/example_agentic_retrieval.slurm.

4. Merge shards and evaluate

python scripts/merge_jsonl_by_dataset_order.py \
    --dataset dataset/evolv_mem_v5.json \
    --parts_glob "output/shards/v5_run_nchunks10/dense_gte_topk20/part_*.jsonl" \
    --out_file output/v5_run_dense_gte_topk20.jsonl

python evaluate_qa.py \
    --hyp_file output/v5_run_dense_gte_topk20.jsonl \
    --ref_file dataset/evolv_mem_v5.json \
    --eval_model_name gpt-5.2 \
    --eval_mode both \
    --nvidia

The evaluator caches an atomic-rubric per question (<dataset>.atomic-v1.rubric.json) so subsequent runs reuse it.

Pipeline modes

main.py --mode selects how a question is answered:

embed: top-k flat dense retrieval (GTE 7B), then a single LLM call to answer.
keyword: LLM-generated keywords + lexical matching, then answer.
agent: Plan-Act-Read loop. Combines --hier_v2 (semantic-summary stage) and --hier_union (union with flat top-K) for the hierarchical-memory variant.

--no_semantic disables the semantic-summary memory layer (flat memory).

Baselines

The three external baselines (MemoChat, RAPTOR, ReadAgent) live under baselines/ together with our thin wrappers (run_<baseline>_baseline.py). Each baseline's upstream LICENSE is preserved.

License

This repository is released under the license stated in the corresponding LICENSE file (TBD prior to release). Upstream baselines retain their original licenses.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support