Skip to main content

Memory Retrieval Strategies

Storing memories is only half the problem. The other half—often the harder half—is getting the right memories back when you need them. Ask the wrong question of your memory system and you get irrelevant context. Ask the right question the wrong way and you miss crucial information.

This lesson covers four retrieval strategies: recency-based (what happened recently), relevance-based (what matches the query), entity-based (what involves specific entities), and hybrid (combining all three). You'll learn when to use each and how to manage the constraint that makes it all tricky: the token budget.

The Retrieval Challenge

When a user asks "What about the project?"—which memories should the agent retrieve?

Available Memories:
├── "Alex completed OAuth implementation" (yesterday)
├── "Phoenix project deadline is Jan 31" (2 weeks ago)
├── "Alex prefers morning meetings" (1 month ago)
├── "Sprint planning discussed Phoenix blockers" (3 days ago)
├── "User's favorite color is blue" (2 months ago)
└── "Phoenix project uses auth-service" (2 weeks ago)

Different strategies give different answers:

StrategySelected Memories
RecencyOAuth implementation, Sprint planning
RelevancePhoenix deadline, Phoenix uses auth-service
Entity ("Phoenix")Phoenix deadline, Sprint planning, Phoenix uses auth-service
HybridPhoenix deadline, Sprint planning, OAuth implementation

The right choice depends on context. Let's explore each strategy.

Recency-Based Retrieval

Principle: Most recent memories are most likely relevant to current context.

When to Use

  • Continuing an ongoing conversation
  • Following up on recent events
  • When the user references "what we just discussed"

Implementation

def get_recent_memories(user_id: str, limit: int = 10):
"""Retrieve most recent memories regardless of content."""
return memory.search(
query="", # Empty query = no semantic filtering
filters={"user_id": user_id},
sort_by="created_at",
sort_order="desc",
limit=limit
)

Output:

[
{"memory": "Alex completed OAuth implementation", "created_at": "2025-01-20"},
{"memory": "Sprint planning discussed Phoenix blockers", "created_at": "2025-01-18"},
{"memory": "Phoenix project deadline is Jan 31", "created_at": "2025-01-06"},
...
]

Limitations

  • Misses old but highly relevant memories
  • Recency doesn't equal importance
  • Fails for queries about past events

Principle: Memories semantically similar to the query are most relevant.

When to Use

  • Answering specific questions
  • Looking up facts
  • When the query clearly states what's needed

Implementation

def get_relevant_memories(user_id: str, query: str, limit: int = 10, threshold: float = 0.5):
"""Retrieve memories semantically similar to query."""
results = memory.search(
query=query,
filters={"user_id": user_id},
limit=limit
)

# Filter by similarity threshold
return [r for r in results['results'] if r['score'] >= threshold]

Example:

query = "What is the project deadline?"

results = get_relevant_memories("alex", query)

Output:

[
{"memory": "Phoenix project deadline is Jan 31", "score": 0.91},
{"memory": "Sprint planning discussed Phoenix blockers", "score": 0.67},
{"memory": "Phoenix project uses auth-service", "score": 0.54},
]

How It Works

  1. Query is embedded into a vector
  2. Memory vectors are compared using cosine similarity
  3. Highest-similarity memories are returned
Query: "What is the project deadline?"
↓ (embedding)
[0.1, 0.8, 0.3, 0.5, ...]

Compare to each memory vector:
Memory A: "Phoenix project deadline is Jan 31"
[0.15, 0.75, 0.28, 0.48, ...] → similarity: 0.91

Memory B: "Alex prefers morning meetings"
[0.7, 0.1, 0.9, 0.2, ...] → similarity: 0.23

Limitations

  • May miss memories with different wording
  • Semantic similarity isn't always relevance
  • Requires good embeddings

Entity-Based Retrieval

Principle: When a specific entity (person, project, concept) is mentioned, retrieve all memories involving that entity.

When to Use

  • User mentions a name or project
  • Building context around a specific topic
  • Answering "What do you know about X?"

Implementation

def get_entity_memories(user_id: str, entity: str, limit: int = 10):
"""Retrieve memories involving a specific entity."""
# Option 1: Semantic search for entity name
return memory.search(
query=entity,
filters={"user_id": user_id},
limit=limit
)

# Option 2: Metadata filter (if entities are tagged)
# return memory.search(
# query="",
# filters={
# "user_id": user_id,
# "entities": {"contains": entity}
# }
# )

Entity Extraction

First, extract entities from the user's message:

def extract_entities(message: str) -> list:
"""Extract named entities from message."""
# Using LLM for extraction
prompt = f"""
Extract named entities from this message.
Categories: people, projects, companies, products, dates

Message: {message}

Return as JSON: {{"entities": [{{"name": "...", "type": "..."}}]}}
"""
result = llm.generate(prompt)
return json.loads(result)['entities']

Example:

message = "What's happening with the Phoenix project that Alex is working on?"

entities = extract_entities(message)
# [{"name": "Phoenix", "type": "project"}, {"name": "Alex", "type": "person"}]

# Retrieve memories for each entity
for entity in entities:
memories = get_entity_memories(user_id, entity['name'])

Limitations

  • Entity extraction can miss or misidentify entities
  • Same entity may have multiple names ("Phoenix", "the project")
  • Requires entity disambiguation

Hybrid Retrieval

Principle: Combine multiple strategies to get the best of all worlds.

When to Use

  • Most production scenarios
  • When query type is unclear
  • When maximum recall is important

Implementation

def hybrid_retrieve(user_id: str, query: str, limit: int = 10):
"""Combine recency, relevance, and entity retrieval."""

# 1. Get relevant memories (semantic)
relevant = memory.search(
query=query,
filters={"user_id": user_id},
limit=limit
)['results']

# 2. Extract entities and get entity-based memories
entities = extract_entities(query)
entity_memories = []
for entity in entities:
entity_results = memory.search(
query=entity['name'],
filters={"user_id": user_id},
limit=5
)['results']
entity_memories.extend(entity_results)

# 3. Get recent memories
recent = memory.search(
query="",
filters={"user_id": user_id},
sort_by="created_at",
sort_order="desc",
limit=5
)['results']

# 4. Combine and deduplicate
all_memories = deduplicate(relevant + entity_memories + recent)

# 5. Re-score with weighted formula
scored = []
for mem in all_memories:
score = calculate_hybrid_score(
memory=mem,
query=query,
semantic_weight=0.5,
recency_weight=0.3,
entity_weight=0.2
)
scored.append({**mem, "hybrid_score": score})

# 6. Sort by hybrid score and return top
scored.sort(key=lambda x: x['hybrid_score'], reverse=True)
return scored[:limit]

Weighted Scoring

def calculate_hybrid_score(memory, query, semantic_weight, recency_weight, entity_weight):
"""Calculate weighted hybrid score."""

# Semantic similarity (from vector search)
semantic = memory.get('score', 0)

# Recency decay
days_old = (datetime.now() - memory['created_at']).days
recency = math.exp(-days_old / 30)

# Entity match boost
entities = extract_entities(query)
entity_match = 1.0 if any(e['name'].lower() in memory['memory'].lower() for e in entities) else 0.0

return (
semantic_weight * semantic +
recency_weight * recency +
entity_weight * entity_match
)

The Token Budget Constraint

You can retrieve 100 relevant memories, but you can only inject so many into the context window.

The Problem

Context Window: 128,000 tokens
System Prompt: 2,000 tokens
User Message: 500 tokens
Available for Memories: 4,000 tokens (budget decision)
Reserved for Response: 2,000 tokens
Remaining for Conversation: 119,500 tokens

Average Memory Size: 200 tokens
Maximum Memories: 4,000 / 200 = 20 memories

Token-Aware Selection

def select_within_budget(memories: list, token_budget: int) -> list:
"""Select highest-scored memories that fit within token budget."""
selected = []
current_tokens = 0

# Memories should already be sorted by score
for memory in memories:
mem_tokens = estimate_tokens(memory['memory'])

if current_tokens + mem_tokens <= token_budget:
selected.append(memory)
current_tokens += mem_tokens
else:
# Check if we can fit a smaller memory
continue

return selected

def estimate_tokens(text: str) -> int:
"""Estimate token count. Rule of thumb: ~4 chars per token."""
return len(text) // 4

Budget Allocation Strategies

Fixed Budget:

# Always use same budget
memory_budget = 4000 # tokens

Dynamic Budget:

# Adjust based on query complexity
if is_simple_query(query):
memory_budget = 2000
elif is_complex_query(query):
memory_budget = 6000
else:
memory_budget = 4000

Priority Tiers:

# Allocate budget across tiers
budgets = {
"critical": 1500, # Must-have context
"relevant": 2000, # Highly relevant
"background": 500 # Nice to have
}

for tier, budget in budgets.items():
tier_memories = get_memories_by_priority(tier)
selected.extend(select_within_budget(tier_memories, budget))

Strategy Selection Guide

ScenarioRecommended Strategy
"What did we just discuss?"Recency
"What's the project deadline?"Relevance
"Tell me about Alex"Entity
"Help me with the next step"Hybrid (recency + relevance)
"What do you know about me?"Entity (user) + Relevance
Complex, unclear queryHybrid

Complete Retrieval Pipeline

class MemoryRetriever:
def __init__(self, memory_client, token_budget: int = 4000):
self.memory = memory_client
self.token_budget = token_budget

def retrieve(self, user_id: str, query: str) -> list:
"""Full retrieval pipeline."""

# 1. Classify query type
query_type = self.classify_query(query)

# 2. Choose strategy
if query_type == "recent":
memories = self.recency_retrieve(user_id, query)
elif query_type == "specific":
memories = self.relevance_retrieve(user_id, query)
elif query_type == "entity":
memories = self.entity_retrieve(user_id, query)
else:
memories = self.hybrid_retrieve(user_id, query)

# 3. Apply token budget
selected = self.select_within_budget(memories, self.token_budget)

# 4. Format for injection
return self.format_for_context(selected)

def format_for_context(self, memories: list) -> str:
"""Format memories for prompt injection."""
if not memories:
return ""

lines = ["Relevant context from previous interactions:"]
for mem in memories:
lines.append(f"- {mem['memory']}")

return "\n".join(lines)

Try With AI

Use these prompts to practice retrieval strategy design with Claude or your preferred AI assistant.

Prompt 1: Hybrid Strategy Design

Design a hybrid retrieval strategy for a project management agent.

The agent needs to answer questions like:
- "What's the status of Project Alpha?" (entity + relevance)
- "What did we discuss yesterday?" (recency)
- "Who's responsible for the API?" (entity + relevance)
- "What are my priorities?" (entity + recency + relevance)

For each query type:
1. Which retrieval strategies should be combined?
2. What weights would you assign to each strategy?
3. Show example memories that would be retrieved

Then design a query classifier that determines which strategy to use.

What you're learning: Different queries need different strategy mixes. A hybrid approach with query classification adapts to user needs dynamically.

Prompt 2: Token Budget Optimization

Your agent has 20 relevant memories but only 2000 tokens of budget.

Memories (with token counts):
1. "User prefers Python over JavaScript" (50 tokens) - relevance: 0.92
2. "Project Phoenix deadline Jan 31" (60 tokens) - relevance: 0.88
3. "User completed 5 tasks yesterday" (80 tokens) - relevance: 0.75
4. "Phoenix uses microservices architecture" (150 tokens) - relevance: 0.85
5. "User had meeting with Sarah about auth" (100 tokens) - relevance: 0.70
... (15 more memories)

Design an algorithm that:
1. Maximizes total relevance within the token budget
2. Ensures diversity (not all memories about same topic)
3. Handles the case where a high-relevance memory is also very long

Show which memories would be selected and the reasoning.

What you're learning: Token budget management is an optimization problem. Sometimes a shorter, less relevant memory is better than a longer, more relevant one. Diversity constraints prevent over-concentration on one topic.

Prompt 3: Entity-Based Retrieval System

Build an entity-based retrieval system for a customer relationship management (CRM) agent.

Entity types:
- Companies (customers)
- Contacts (people at companies)
- Deals (sales opportunities)
- Products
- Issues (support tickets)

Design:
1. How to extract these entities from user queries
2. How to link related entities (e.g., Contact → works at → Company)
3. A retrieval strategy that follows entity relationships

Example query: "What's happening with the Acme deal?"

Show the retrieval steps:
- Entity extraction
- Relationship traversal
- Memory retrieval
- Final context assembly

What you're learning: Entity-based retrieval shines in domain-specific agents. A CRM agent needs to understand relationships—when you ask about a deal, you probably also need context about the company and contacts involved.