YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Long-Term Memory Retrieval Benchmark
Code release for the experiments described in the accompanying paper:
- Hierarchical memory organization (User Profile / Semantic / Episodic).
- Plan-Act-Read agentic retrieval that interleaves keyword, time-filter, and embedding search.
- Flat / dense / oracle baselines for comparison.
Repository layout
.
βββ main.py # End-to-end QA pipeline (agent, embed, keyword modes)
βββ evaluate_qa.py # Atomic-rubric QA evaluator (strict + partial)
βββ model_zoo.py # Model registry
βββ prompts/ # Prompt templates
β βββ agentic_retrieval_prompt.txt
β βββ agentic_retrieval_prompt_wo_profile.txt
β βββ keyword_search_prompt.txt
β βββ read_and_extract_prompt.txt
βββ memory/ # Episodic + semantic memory stores
βββ baselines/
β βββ MemoChat/ # MemoChat baseline (upstream code + our wrapper)
β βββ raptor/ # RAPTOR baseline (upstream code + our wrapper)
β βββ read-agent/ # ReadAgent baseline wrapper
βββ scripts/
β βββ build_retrieval_cache.py # Pre-compute GTE-7B embeddings for the corpus
β βββ make_v5_shards.py # Deterministic shard split by question_id
β βββ merge_jsonl_by_dataset_order.py
β βββ run_oracle_qa.py # Gold-session-only upper bound
β βββ plot_main_results.py
β βββ llm_judge_agreement.py
β βββ slurm/
β βββ example_dense_retrieval.slurm
β βββ example_agentic_retrieval.slurm
βββ requirements.txt
The benchmark dataset (evolv_mem_v5.json) is released separately; place it
under dataset/ along with the supporting files referenced by main.py
(all_sessions.json, all_session_summary.json, etc.).
Setup
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
API keys
The pipeline calls LLMs through three optional providers; set whichever you plan to use:
| Provider | Env var | Flag |
|---|---|---|
| OpenAI-compatible inference API | NV_API_KEY |
--nvidia |
| OpenAI-compatible LiteLLM proxy | LITELLM_API_KEY |
--tritonai |
| Direct Anthropic API | ANTHROPIC_API_KEY |
(default) |
| Azure OpenAI | AZURE_OPENAI_KEY |
(default) |
Each --<flag> selects which client the pipeline uses; entries in
model_zoo.py are tagged accordingly.
Quick start
1. Build the per-question retrieval cache (one-time)
python scripts/build_retrieval_cache.py \
--dataset dataset/evolv_mem_v5.json \
--all_sessions dataset/all_sessions.json \
--out_dir response_cache/retrieval/
2. Shard the dataset for parallel runs
python scripts/make_v5_shards.py \
--dataset dataset/evolv_mem_v5.json \
--ret_cache_jsonl response_cache/retrieval/flat-gte/v5_retrievallog_turn_flat-gte \
--out_dir output/shards/v5_run_nchunks10/ \
--num_shards 8
3. Run the QA pipeline
Flat dense retrieval @ top-k=20 (single shard, e.g. for smoke testing):
export ret_cache="output/shards/v5_run_nchunks10/ret_cache/shard_00.jsonl"
python main.py \
--in_file output/shards/v5_run_nchunks10/dataset/shard_00.json \
--out_file output/shards/v5_run_nchunks10/dense_gte_topk20/part_00.jsonl \
--model_name gpt-5.5 \
--top_k 20 \
--n_chunks 10 \
--nvidia \
--all_sessions_file dataset/all_sessions.json \
--no_semantic \
--mode embed
Agentic retrieval over hierarchical memory:
python main.py \
--in_file output/shards/v5_run_nchunks10/dataset/shard_00.json \
--out_file output/shards/v5_run_nchunks10/agentic_hier/part_00.jsonl \
--model_name gpt-5.5 \
--top_k 20 \
--n_chunks 10 \
--nvidia \
--all_sessions_file dataset/all_sessions.json \
--hier_v2 --hier_union \
--mode agent
To launch the full 8-shard parallel sweep on a SLURM cluster, edit and submit
scripts/slurm/example_dense_retrieval.slurm or
scripts/slurm/example_agentic_retrieval.slurm.
4. Merge shards and evaluate
python scripts/merge_jsonl_by_dataset_order.py \
--dataset dataset/evolv_mem_v5.json \
--parts_glob "output/shards/v5_run_nchunks10/dense_gte_topk20/part_*.jsonl" \
--out_file output/v5_run_dense_gte_topk20.jsonl
python evaluate_qa.py \
--hyp_file output/v5_run_dense_gte_topk20.jsonl \
--ref_file dataset/evolv_mem_v5.json \
--eval_model_name gpt-5.2 \
--eval_mode both \
--nvidia
The evaluator caches an atomic-rubric per question
(<dataset>.atomic-v1.rubric.json) so subsequent runs reuse it.
Pipeline modes
main.py --mode selects how a question is answered:
embed: top-k flat dense retrieval (GTE 7B), then a single LLM call to answer.keyword: LLM-generated keywords + lexical matching, then answer.agent: Plan-Act-Read loop. Combines--hier_v2(semantic-summary stage) and--hier_union(union with flat top-K) for the hierarchical-memory variant.
--no_semantic disables the semantic-summary memory layer (flat memory).
Baselines
The three external baselines (MemoChat, RAPTOR, ReadAgent) live under
baselines/ together with our thin wrappers
(run_<baseline>_baseline.py). Each baseline's upstream LICENSE is preserved.
License
This repository is released under the license stated in the corresponding LICENSE file (TBD prior to release). Upstream baselines retain their original licenses.