YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Long-Term Memory Retrieval Benchmark

Code release for the experiments described in the accompanying paper:

  • Hierarchical memory organization (User Profile / Semantic / Episodic).
  • Plan-Act-Read agentic retrieval that interleaves keyword, time-filter, and embedding search.
  • Flat / dense / oracle baselines for comparison.

Repository layout

.
β”œβ”€β”€ main.py                       # End-to-end QA pipeline (agent, embed, keyword modes)
β”œβ”€β”€ evaluate_qa.py                # Atomic-rubric QA evaluator (strict + partial)
β”œβ”€β”€ model_zoo.py                  # Model registry
β”œβ”€β”€ prompts/                      # Prompt templates
β”‚   β”œβ”€β”€ agentic_retrieval_prompt.txt
β”‚   β”œβ”€β”€ agentic_retrieval_prompt_wo_profile.txt
β”‚   β”œβ”€β”€ keyword_search_prompt.txt
β”‚   └── read_and_extract_prompt.txt
β”œβ”€β”€ memory/                       # Episodic + semantic memory stores
β”œβ”€β”€ baselines/
β”‚   β”œβ”€β”€ MemoChat/                 # MemoChat baseline (upstream code + our wrapper)
β”‚   β”œβ”€β”€ raptor/                   # RAPTOR baseline (upstream code + our wrapper)
β”‚   └── read-agent/               # ReadAgent baseline wrapper
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ build_retrieval_cache.py  # Pre-compute GTE-7B embeddings for the corpus
β”‚   β”œβ”€β”€ make_v5_shards.py         # Deterministic shard split by question_id
β”‚   β”œβ”€β”€ merge_jsonl_by_dataset_order.py
β”‚   β”œβ”€β”€ run_oracle_qa.py          # Gold-session-only upper bound
β”‚   β”œβ”€β”€ plot_main_results.py
β”‚   β”œβ”€β”€ llm_judge_agreement.py
β”‚   └── slurm/
β”‚       β”œβ”€β”€ example_dense_retrieval.slurm
β”‚       └── example_agentic_retrieval.slurm
└── requirements.txt

The benchmark dataset (evolv_mem_v5.json) is released separately; place it under dataset/ along with the supporting files referenced by main.py (all_sessions.json, all_session_summary.json, etc.).

Setup

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

API keys

The pipeline calls LLMs through three optional providers; set whichever you plan to use:

Provider Env var Flag
OpenAI-compatible inference API NV_API_KEY --nvidia
OpenAI-compatible LiteLLM proxy LITELLM_API_KEY --tritonai
Direct Anthropic API ANTHROPIC_API_KEY (default)
Azure OpenAI AZURE_OPENAI_KEY (default)

Each --<flag> selects which client the pipeline uses; entries in model_zoo.py are tagged accordingly.

Quick start

1. Build the per-question retrieval cache (one-time)

python scripts/build_retrieval_cache.py \
    --dataset dataset/evolv_mem_v5.json \
    --all_sessions dataset/all_sessions.json \
    --out_dir response_cache/retrieval/

2. Shard the dataset for parallel runs

python scripts/make_v5_shards.py \
    --dataset dataset/evolv_mem_v5.json \
    --ret_cache_jsonl response_cache/retrieval/flat-gte/v5_retrievallog_turn_flat-gte \
    --out_dir output/shards/v5_run_nchunks10/ \
    --num_shards 8

3. Run the QA pipeline

Flat dense retrieval @ top-k=20 (single shard, e.g. for smoke testing):

export ret_cache="output/shards/v5_run_nchunks10/ret_cache/shard_00.jsonl"
python main.py \
    --in_file  output/shards/v5_run_nchunks10/dataset/shard_00.json \
    --out_file output/shards/v5_run_nchunks10/dense_gte_topk20/part_00.jsonl \
    --model_name gpt-5.5 \
    --top_k 20 \
    --n_chunks 10 \
    --nvidia \
    --all_sessions_file dataset/all_sessions.json \
    --no_semantic \
    --mode embed

Agentic retrieval over hierarchical memory:

python main.py \
    --in_file  output/shards/v5_run_nchunks10/dataset/shard_00.json \
    --out_file output/shards/v5_run_nchunks10/agentic_hier/part_00.jsonl \
    --model_name gpt-5.5 \
    --top_k 20 \
    --n_chunks 10 \
    --nvidia \
    --all_sessions_file dataset/all_sessions.json \
    --hier_v2 --hier_union \
    --mode agent

To launch the full 8-shard parallel sweep on a SLURM cluster, edit and submit scripts/slurm/example_dense_retrieval.slurm or scripts/slurm/example_agentic_retrieval.slurm.

4. Merge shards and evaluate

python scripts/merge_jsonl_by_dataset_order.py \
    --dataset dataset/evolv_mem_v5.json \
    --parts_glob "output/shards/v5_run_nchunks10/dense_gte_topk20/part_*.jsonl" \
    --out_file output/v5_run_dense_gte_topk20.jsonl

python evaluate_qa.py \
    --hyp_file output/v5_run_dense_gte_topk20.jsonl \
    --ref_file dataset/evolv_mem_v5.json \
    --eval_model_name gpt-5.2 \
    --eval_mode both \
    --nvidia

The evaluator caches an atomic-rubric per question (<dataset>.atomic-v1.rubric.json) so subsequent runs reuse it.

Pipeline modes

main.py --mode selects how a question is answered:

  • embed: top-k flat dense retrieval (GTE 7B), then a single LLM call to answer.
  • keyword: LLM-generated keywords + lexical matching, then answer.
  • agent: Plan-Act-Read loop. Combines --hier_v2 (semantic-summary stage) and --hier_union (union with flat top-K) for the hierarchical-memory variant.

--no_semantic disables the semantic-summary memory layer (flat memory).

Baselines

The three external baselines (MemoChat, RAPTOR, ReadAgent) live under baselines/ together with our thin wrappers (run_<baseline>_baseline.py). Each baseline's upstream LICENSE is preserved.

License

This repository is released under the license stated in the corresponding LICENSE file (TBD prior to release). Upstream baselines retain their original licenses.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support