Give Your AI Searchable Context: RAG on Postgres with pgvector — A Crash Course
15 Concepts · 80% of Real Use · Built by your agent, not by hand
Imagine you could tell your computer: "Take this folder of documents, stand up a database that understands their meaning, and build me a search where asking for 'Empire State of Mind' returns quotes about New York — even the ones that never say the word 'New York'." And it actually does it. Not a slideshow about vector databases. A running system you can query.
That is what this course teaches — and the twist is that you will not write the SQL by hand. You will direct Claude Code or OpenCode to install the extensions, design the schema, generate the embeddings, build the search, and benchmark the indexes. Your job is to know enough about Postgres and pgvector to give clear instructions and to judge whether the agent got it right. That second part is the whole skill.
This crash course teaches 15 concepts that cover about 80% of what you will use day to day. By the end, you will have built a working retrieval system, and you will know which knobs matter and which to leave alone.
One idea that makes everything click: you already learned, in the Agentic Coding Crash Course, that the entire game with a coding agent is giving it the right information at the right time and keeping irrelevant information out — that's what the rules file, /compact, and on-demand file loading are for. A vector database is how you do exactly that for your application's AI, for your users. RAG is context management, one layer down. Every section here connects back to that.
This course assumes you have done the Agentic Coding Crash Course — you should be comfortable driving Claude Code or OpenCode, using plan mode, and managing context. It also builds on AI Prompting in 2026. If "embedding" and "vector" are brand new to you, don't worry — Concept 2 covers them from zero.
Two tools, one discipline (again)
Same pairing as the coding course: Claude Code and OpenCode. Here, both are the driver and your Postgres — a serverless Neon database — is the subject. The agent operates Neon through the Neon MCP server: it creates the project, opens a branch, runs the SQL, and previews each migration on that branch before committing it. You never click around the Neon console or open psql yourself. Everything in this course works in either tool — where a command genuinely differs, I call out the difference inline.
What this course covers
| Part | Topic | What you learn |
|---|---|---|
| 1 | Foundations | The AI-engineer mindset, vectors in one minute, what runs on Neon, connecting your agent to Neon |
| 2 | Your First RAG | Schema, the embedding worker (and choosing a model), chunking, semantic search, RAG |
| 3 | Making Search Fast | Why and when to index, choosing an index, and tuning recall with ef_search |
| 4 | Making Search Good | Eval-driven development, filtered search, hybrid search, multi-tenancy, text-to-SQL |
| 5 | Full Worked Example | One complete task — empty database to working RAG — driven in both tools |
| 6 | Ship It as a Tool | Wrap your RAG in an MCP server so any agent — Claude Code, OpenCode, a Digital FTE — can call it |
| 7 | Where It Runs | Neon branching, the embedding worker, and production notes |
| 8 | Hand It to an Agent | Plug your RAG tool into an agent (OpenAI Agents SDK) — the bridge to the Build AI Agents course |
Learn by doing? Skip to Part 5 for the full worked example first, then come back for the why.
New to all of this? Read Concepts 1–9 and the worked example in Part 5 — together they're a complete, working RAG system, end to end. Treat Parts 3–4 (Concepts 10–15) as reference you come back to when search needs to be faster or sharper, not something to absorb in one sitting.
How long is this? The 15 concepts are a ~90-minute read. Parts 5–8 — the worked example, shipping an MCP server, production notes, and the bridge to building agents — are build-and-reference: work them at the keyboard across a few sessions, not in one pass.
By the end you'll have a working RAG repo: a docs/ corpus loaded into a Neon (or TigerData) dev branch with pgvector enabled, an embedding worker that keeps the vectors in sync, a semantic-search query, an answer_question() function, a 5-question eval report, and — optionally — an MCP server that hands the whole thing to any agent. Budget the ~90-minute read plus a few hours at the keyboard for the full build.
Set up your environment (once)
This course is built by directing your agent. When you're ready to build — at Part 5, or now if you'd rather have it waiting — you start from a small base that already has the Neon and Context7 MCP servers wired and a short rules file your agent reads. You set this up once: postgres-ai/ is your folder for the whole course, the worked example and the MCP server (Part 6) alike. Each build provisions its own fresh Neon project (a database), but you never re-download or re-unzip.
Claude Code
cd postgres-ai
claude
OpenCode
cd postgres-ai
opencode
This base assumes a capable general agent (Claude Code, or OpenCode running Claude Sonnet or Opus, GPT-5, or similar). A smaller model will drift on the build prompts; if its first plan looks vague instead of specific, switch to a stronger one before you go further.
Prep the base (~3 min). The base ships the rules file and the MCP wiring; the skills and your key come next. Have your agent set itself up. Paste this:
Read AGENTS.md, then get this base ready: install the skills it lists for whichever agent you are, copy
.env.exampleto.envfor me, and tell me exactly what you need from me to bring the Neon and Context7 MCP servers online.
Watch for: the agent installing neon-postgres and mcp-builder (you see the install run), creating .env, then asking you for two things: your OPENAI_API_KEY to paste into .env (one key covers embeddings and generation), and one browser click to authorize Neon over OAuth. Neon is free; if you don't have an account yet, sign up at neon.com in about a minute, or create one right at the authorization screen. When the install and wiring are done, the agent asks you to restart it (exit and relaunch) so the new skills and MCP servers load; neither loads mid-session.
Part 1: Foundations
1. What you're actually building (and your job in it)
The most common misconception: building AI applications requires a machine-learning team. It does not. The models are off-the-shelf. The infrastructure is a database you may already run. What's left is the work of an AI engineer — someone who uses AI to build products, rather than a researcher who trains models. That's the role this whole book is preparing you for, and it's very much within reach.
The second misconception is the one this course exists to fix: that you must hand-write all the SQL. You don't. You direct an agent that already knows pgvector's operators and the embedding workflow cold. Your value moves up the stack — from typing CREATE INDEX to deciding which index, from writing the query to judging whether the results are any good.
That changes what "knowing this material" means. You're not memorizing syntax; you're building enough of a mental model to:
- give the agent a precise instruction ("store the embedding in the same table, use cosine distance, index it with HNSW"),
- read back what it produced and spot when it's wrong,
- and decide the architecture choices the agent shouldn't make for you.
This is the same plan-then-execute habit from the coding course — and here it matters even more, because the agent is about to make decisions (which extension, which index, which distance function) that are expensive to undo once you have data.
The mindset shift: stop asking "what's the SQL for semantic search?" Start saying "build me semantic search over this table; here are my constraints; show me the plan first."
There's a bigger name for this. In the book's terms, the database you're about to build is a system of record for the agent era — the authoritative ground truth your agents read from, write to, and verify against. Jensen Huang's argument is that agents don't remove the need for a system of record; they depend on one. Without authoritative ground truth an agent hallucinates; with it, it executes. RAG on Postgres is how you hand an agent that ground truth — which is exactly why the rest of this course treats retrieval quality as the thing that decides whether the agent can be trusted.
Take this literally for the rest of the course: you never open psql, write SQL, or run a migration yourself — Claude Code or OpenCode does all of it. Every SQL block below is shown so you can read and judge what the agent produced, not so you can type it. Knowing how to read it is what lets you catch a wrong distance function or a missing filter before it ships.
2. Vectors and embeddings, in one minute
If "vector" and "embedding" are new, here is everything you need to start.
A vector is just a list of numbers — [0.021, -0.88, 0.14, …]. An embedding model takes a piece of content (a sentence, a paragraph, an image) and turns it into one of these lists. The trick is that the list captures the meaning of the content: two pieces of text that mean similar things get lists of numbers that sit close together. "Empire State of Mind" and "the city that never sleeps" land near each other even though they share no words.
A vector database is just a system that stores these lists and finds the closest ones to a given list, fast. That's it. When a user asks a question, your app embeds their question into a vector, then asks the database: "which stored vectors are nearest to this one?" The nearest ones are the most semantically relevant — and that is how your app fetches the right context to hand an LLM. (See the opening diagram: this is context management for your users.)
You do not need to understand how the embedding model works internally any more than you need to understand how a JPEG compresses an image. You need to know it exists, that it converts content to meaning-vectors, and that closeness equals similarity.
Open a session and ask your agent: "In plain language, explain what an embedding is, then show me two short sentences that would have nearby vectors and two that would be far apart — and why." Reading its answer is a faster gut-check on your own understanding than re-reading this section.
3. The extensions — and what you get on Neon
Postgres becomes a vector database through extensions — add-ons that give it new powers without giving up everything Postgres already does well (transactions, joins, reliability, SQL). Two matter: pgvector, which you'll always use, and pgvectorscale, the scale tier worth knowing about. (The third piece of the puzzle — creating the embeddings — isn't an extension at all; it's a small worker your agent writes, covered in the note below and built in Concept 6.)
| Extension | What it adds | When you need it |
|---|---|---|
| pgvector | The vector data type, distance operators, and the HNSW + IVFFlat indexes | Always — and on Neon it's pre-installed |
| pgvectorscale | The StreamingDiskANN index, vector compression, high-accuracy filtered search at large scale | Native on TigerData; not on Neon, where it's the tier you graduate to |
On Neon, pgvector is your whole stack: it ships built in, and it is all you need for a complete RAG engine. The embeddings themselves come from a small worker your agent writes (Concept 6) that runs beside the database — not from a managed extension. pgvectorscale (the StreamingDiskANN layer) isn't part of Neon's vetted extension set; think of it as the host you'd graduate to only if you ever outgrow HNSW. That's not a gap — pgvector plus a worker covers most of what you'd otherwise reach for, and Neon's autoscaling and branching handle the rest. On TigerData Cloud, pgvector and pgvectorscale are both native extensions, so starting there gives you the StreamingDiskANN tier in the box from day one (Concept 4).
The one-database payoff still holds: your vectors live next to the rows they describe, so a similarity search and a WHERE price < 2000 AND in_stock filter happen in the same query, on the same source of truth. No second database, no sync pipeline, no data drift. (Hold onto that — it's the whole reason filtered search in Concept 13 is so easy.)
pgvector is a mature, widely used Postgres extension — the stable bedrock this whole course rests on. Everything after embedding (semantic search, indexing, evals, filters, hybrid search, RLS, the MCP server) is pure pgvector.
The one thing pgvector doesn't do for you is create the embeddings — calling an embedding model and writing the vectors back. The clean, portable way to do that is a small worker your agent writes: source table → chunk → embed via an API call → write vectors to an embeddings table → search with pgvector. Keeping that work outside the database is the point, not a workaround — a stateful system of record shouldn't depend on a volatile external API, so embedding (and the LLM calls in Concept 9) live in the worker and app layers, where they can fail, retry, and scale without touching your data, and the whole thing ports cleanly to any host or container. (A managed convenience layer once baked embedding into the database — pgai's Vectorizer — but that coupling proved brittle, and its repository was archived by the maintainer in Feb 2026; this course doesn't depend on it.) The worker is a few lines your agent produces and you review. Learn the pattern; it outlives any one package.
4. Connect your agent to Neon
We don't install or run our own database. We use Neon — serverless Postgres with pgvector already built in — and we let the agent operate it through the Neon MCP server. MCP is the same connector mechanism from the coding course; here it hands Claude Code and OpenCode a set of tools (create_project, create_branch, run_sql, get_database_tables, prepare_database_migration, complete_database_migration) so they can manage Neon entirely through natural language. You never open the Neon console or a psql shell yourself. (This course uses Neon as its default host; everything works identically on TigerData Cloud — see "Prefer TigerData?" at the end of this concept.)
One-time wiring. If you started from the course base, the Neon MCP server is already declared in .mcp.json (Claude Code) and opencode.json (OpenCode) — you authorize it once in the browser over OAuth, with no API key to manage. (Wiring it by hand instead is the same single step: add the Neon MCP server to your tool and authorize it.) That is the only manual setup; from here on, everything happens by instructing the agent. Drive it in plan mode:
Using the Neon MCP server, create a project called
agent-factory-ragand enable the pgvector extension on it. Then create a branch calleddevfor us to build on, and read me the connection string for that branch from the MCP (never print my API key). Show me the plan before you run anything.
Then read the plan. What you're checking for:
- It works on a branch, not directly on production. Branching is Neon's superpower: a branch is an instant, copy-on-write clone of your whole database. The agent makes schema changes on a branch, you preview them, and only then commit to the default branch — the same plan-then-execute discipline, enforced by the platform. (It's also how you'll benchmark indexes and run evals later: branch, test, throw the branch away.)
- It enables pgvector — the one extension this course needs — with a single statement:
CREATE EXTENSION IF NOT EXISTS vector; -- pgvector is pre-installed on Neon; this just switches it on
- For embeddings (Concept 6), the agent builds a small worker — a short Python script or service — that reads new or changed rows, calls the embedding model, and writes the vectors into an embeddings table. It runs off Neon, reaching your branch over its connection string. The embedding provider's API key lives in the worker's environment, never in the database.
You do not need to memorize the MCP tool names or Neon's API. The point of plan mode is that you read the plan, confirm it's working on a branch and enabling pgvector, and then approve. If the plan does something you don't recognize, ask the agent why before saying yes — that question is your real job.
Neon's own guidance is that the MCP server is meant for local development and IDE integrations — it can run powerful operations, so keep it to your dev workflow and review every action the agent proposes before approving. Production changes still go through your normal reviewed migration process.
The course base already includes a short AGENTS.md / CLAUDE.md with the rules this course needs: which Neon branch you're on, that keys live in the environment (never committed), your chosen distance function, and two hard rules — "always make schema changes on a Neon branch and let me preview before committing," and "never run destructive SQL (DROP, TRUNCATE, DELETE without WHERE) without showing me first." Building without the base? Run /init and trim to exactly that.
Everything in this course runs unchanged on TigerData Cloud — the team behind pgvector's companion extensions, who position it as "Agentic Postgres." The agent-driven loop is identical; only the names change:
- Tiger MCP in place of the Neon MCP server. It's built into the Tiger CLI — install it with
tiger mcp install, then drive Tiger Cloud in natural language exactly as above ("create a service, fork it, enable the extensions, show me the plan first"). It even ships Postgres Skills that teach the agent best practices. - Forks in place of branches:
tiger service fork …makes an instant, zero-copy clone — the same fork → test → throw away discipline you'll use for evals and index benchmarks. - pgvectorscale is native (alongside pgvector) — so the StreamingDiskANN "graduate tier" (Concept 11) is in the box from day one, and its index supports vectors up to 16,000 dimensions (big models like
text-embedding-3-largethen need nohalfvec). Your embedding worker runs the same way it does on Neon.
Your embedding worker and the generation step both live in app code — identical to Neon. Rule of thumb: Neon for the simplest serverless start; TigerData when you want pgvectorscale's scale and filtered-search performance without ever migrating.
Part 2: Your first RAG, built by your agent
We'll build a small, classic example: a table of quotes by historical figures about US cities, then search it by meaning, then have an LLM answer questions over it.
5. The schema: vectors live next to your data
Ask the agent for the table. You want it to understand that the embedding is not a separate store — it's a column (or a managed companion table) right alongside the human-readable data.
Create a
quotestable with columns:person,city, andquote. We'll add embeddings in a companion table next, populated by a small worker — so for now just the source data. Then insert a handful of real quotes about New York, San Francisco, and Chicago so we have something to search.
The table the agent writes will look about like this:
CREATE TABLE quotes (
id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
person text NOT NULL,
city text NOT NULL,
quote text NOT NULL
);
The mental model to hold: one source of truth. The quote, who said it, the city, and (soon) its meaning-vector all live in the same database. That's what makes the filtering in Part 4 trivial.
6. Create embeddings — the worker your agent builds
Here is the piece that does the most work for you, and the one part pgvector doesn't hand you for free. You won't write a throwaway script that loops over rows once, and you won't call an embedding function inside a SELECT. Instead, the agent builds a small, reusable embedding worker: a short program that finds rows without a current embedding, chunks their text, calls the embedding model, and writes the vectors into a companion table. Run it once to backfill, then on a schedule so embeddings stay in sync as your data changes.
One choice to make first: the embedding model. Before the worker can embed anything, you pick which model turns text into vectors — the one decision here that's annoying to reverse, since switching models means re-embedding everything. Three things to weigh:
- Quality vs cost. A small, cheap model (OpenAI's
text-embedding-3-small, 1536 dimensions) is the sensible default and is plenty for most RAG. A larger model (text-embedding-3-large, 3072 dimensions) retrieves a little better and costs more — worth it only if your evals show the lift. - Dimensions. More dimensions means more storage and memory and slightly slower search. Two practical notes: many modern models let you request fewer dimensions without re-training, and — a real gotcha — pgvector's HNSW/IVFFlat indexes cap the
vectortype at 2,000 dimensions (thehalfvectype extends that to 4,000), so a 3072-dim model liketext-embedding-3-largeneedshalfvecor reduced dimensions to be indexable. Your agent should flag this; you should recognize it when it does. (On TigerData, pgvectorscale's StreamingDiskANN index supports vectors up to 16,000 dimensions, so the limit rarely bites.) - Language and domain. If your content isn't English, or is highly specialized (legal, medical, code), a multilingual or domain-tuned model can beat a bigger general one.
In the worker the model is a single setting, so you test alternatives on your eval set (Concept 12) instead of guessing — shortlist a couple of candidates from the MTEB leaderboard (the standard public benchmark for embedding models), then let your evals pick the winner on your data.
We're embedding English support docs, cost matters, and we need to stay under pgvector's index limit. Recommend an embedding model and dimension count, explain the tradeoff in a paragraph, and set it as the worker's embedding model.
With the model chosen, have the agent build the worker:
Create a companion table
quotes_embeddingwith a foreign key toquotes, the chunk text, and anembedding vector(1536)column. Then write a small embedding worker that finds quotes with no current embedding, chunks thequotetext, embeds each chunk with the model we chose, and inserts the vectors intoquotes_embedding. Run it once to backfill, show me how to confirm the rows landed, and explain how I'd schedule it. Read the API key from the environment.
What the agent builds, in plain terms:
- A companion table,
quotes_embedding, holding each chunk and its vector next to a reference back to the source quote. Keeping vectors in their own table — rather than a single column onquotes— is what lets one long source row produce several chunks. - A worker: a short script (the background process from Concept 4) that selects the rows needing embeddings, calls the model, and writes the vectors back. The same code runs as the one-off backfill and as the scheduled job that keeps things current.
- A way to stay in sync, kept decoupled — the database never calls the embedding API itself. The simplest version polls: the worker re-runs on a schedule and re-embeds rows whose text changed. If you want change-driven updates, an
INSERT/UPDATEtrigger can mark a row dirty (set a flag, enqueue an id) for the worker to pick up — but the trigger only flags; the embedding call still happens out-of-band in the worker. Either way you solve the "stale embeddings" problem — the thing that quietly wrecks RAG quality — and your writes never block on a slow or failing embedding endpoint.
The same worker pattern embeds documents, not just short fields — point it at PDFs, DOCX, or files in an S3 bucket and have it parse, chunk, and embed them into the same companion table. And because the model is one setting in the worker, you can swap providers (OpenAI, Cohere, Voyage, a local model) to test which retrieves best on your data — the model experiment above.
The worker calls an external embedding provider, so it needs that provider's API key. The key belongs to the worker's environment (or your cloud provider's secret store), never hard-coded in SQL or committed to your repo. Tell your agent explicitly: "read the API key from an environment variable, never write it into a file we commit." Then verify the diff before you approve. Switching embedding providers later (Cohere, Voyage, a local model) is a one-line change in the worker — the rest of your app doesn't move.
7. Chunking — the lever that sets your ceiling
Chunking is one step inside your worker — and, quietly, the most important one. A long document embedded as a single vector turns into mush — one blurry average of everything it says — so you split it into smaller chunks, and each chunk gets its own vector. What counts as a chunk decides what your search can ever retrieve.
Two dials:
- Size. Too large and a chunk spans several topics, so its vector is unfocused and you retrieve near-misses. Too small and a chunk loses the context that made it meaningful. A few hundred tokens is a common starting point; the right answer depends on your content.
- Overlap. Letting chunks overlap a little (say 10–20%) keeps a sentence that straddles a boundary from being orphaned. Some overlap almost always helps; too much wastes storage and retrieves near-duplicates.
There's also strategy: split by character count (the simple default), by structure (markdown headings, paragraphs), or semantically (group sentences that belong together). For structured docs, splitting on headings usually beats blind character counts.
Why this earns its own concept: chunking sets your recall ceiling — recall meaning simply whether your search pulled back the right chunks in the first place. If the right answer never lands cleanly in a single chunk, no embedding model and no clever prompt can recover it — which is why bad chunking is one of the most common reasons a RAG system quietly underperforms. So you don't guess; it's the first thing to tune against your evals.
On a Neon branch, re-run our worker with three chunking setups — 256, 512, and 800 tokens, ~15% overlap each — plus a heading-aware split. Run our eval questions against each and tell me which gives the best retrieval recall. Then throw the branches away.
8. Semantic search: order by distance
Now the magic from the intro. To find quotes similar in meaning to a phrase, the phrase is turned into a query vector (in a few lines of app code the agent writes for you), then the database sorts rows by how close their stored vector is to it.
The one new piece of SQL is a distance operator. pgvector has several; these three are the ones that matter for text embeddings, and cosine distance (<=>) is the usual default:
| Operator | Distance | Use it for |
|---|---|---|
<=> | Cosine | Text-embedding similarity (the default) |
<-> | L2 (Euclidean) | When magnitude matters |
<#> | Inner product | Certain models that recommend it |
Write me a query that returns the top 5 quotes most similar in meaning to a search phrase. The phrase's embedding will be passed in as a parameter from application code — don't embed it inside the SQL.
The query it hands back (yours to read, not to type):
SELECT person, city, quote
FROM quotes_embedding -- the table your worker populates: chunks + their vectors
ORDER BY embedding <=> $1 -- $1 = the query phrase, embedded in app code
LIMIT 5;
Search "Empire State of Mind" and the top results are quotes about New York — including ones that never contain the words "New York" — because the meanings are close. That, finally made concrete, is semantic search.
$1Embedding the user's phrase happens in your application code (a few lines the agent writes), and the resulting vector is passed into the query as the $1 parameter. The part that calls an external model stays in your app, where you control the model, retries, and caching. The storage and the nearest-neighbor search stay in Postgres, which is exactly where they belong.
9. RAG: retrieve, then generate
Semantic search finds the relevant text. RAG (Retrieval-Augmented Generation) goes one step further: it takes those retrieved chunks, stuffs them into a prompt as context, and asks an LLM to compose an answer grounded in your data. This is the customer-support bot, the docs assistant, the "chat with my files" feature — all of them are this loop.
The loop has two stages, and the split is the whole point:
- Retrieve (in Postgres): run the Concept 8 query to pull the few most relevant chunks. This is the context-management step — fetch the signal, leave the million irrelevant rows out.
- Generate (in your app): build a prompt = system instructions + retrieved chunks + the user's question, send it to an LLM, return the answer.
Build a
answer_question(question)function in application code. It should: embed the question, run the top-k semantic search againstquotes_embedding, format the retrieved quotes into a prompt as context, call the LLM, and return the grounded answer. Keep retrieval in SQL and generation in app code.
The architecture splits cleanly: retrieval stays in Postgres (where it's fast and filterable) and generation lives in your application (where you control the prompt, the model, retries, and streaming). The agent writes that app-side glue for you; it's small.
Notice what stage 1 is doing: handing the LLM exactly the context it needs and nothing else. If retrieval is sloppy — wrong chunks, too many, irrelevant ones — the LLM gives a bad answer and people blame "the AI." It's almost always the retrieval. Which is why Part 4 exists.
answer_question() isn't only a chatbot backend — it's a tool an agent calls. In the agent chapters, retrieval becomes one of the tools a larger agent reaches for when it needs grounded facts, the same way it reaches for a calculator or a web search. RAG is the searchable context an agent draws on, not just a Q&A endpoint.
Part 3: Making search fast — indexes
Intermediate from here. Parts 3 and 4 are the tuning layer — you can ship a working system on Parts 1–2 alone, and return to these when search needs to be faster (Part 3) or sharper (Part 4).
10. Why you need a vector index (and when you don't)
Without an index, a similarity search compares the query vector to every row — an exact nearest-neighbor scan. That's perfectly accurate, and perfectly fine while you're small. As the table grows, it gets slow.
The fix is an approximate search: trade a sliver of accuracy for a large speed-up by not checking every vector. A vector index is the data structure that makes that approximation good.
The threshold that saves you from over-engineering: below roughly 100,000 vectors, exact search is often fast enough — and always correct. But the real threshold shifts with dimensions, compute size, latency target, filters, concurrency, and how often the vectors get rewritten, so have the agent benchmark before adding an index rather than reaching for one by default; add it when searches actually get slow, not before. (This mirrors the coding course's "add a rule when something goes wrong, not before.")
11. The indexes, which to use, and how to tune them
On Neon, the live choice is the middle card: HNSW is your workhorse (with IVFFlat as a legacy option). The StreamingDiskANN card is native on TigerData; on Neon it's what you'd graduate to a pgvectorscale host for.
The agent builds the index for you — but you must know enough to choose. On Neon you have two of the three options below (you can only have one index type per column anyway): HNSW, your default, and IVFFlat. StreamingDiskANN belongs to pgvectorscale — available now if you're on TigerData, or the option you'd graduate to a pgvectorscale host for from Neon. Here's what each looks like so you can confirm the index type and the distance op match your queries:
-- HNSW — your default on Neon
CREATE INDEX ON quotes_embedding USING hnsw (embedding vector_cosine_ops);
-- IVFFlat — also on Neon; legacy, needs a lists parameter and rebuilds on change
CREATE INDEX ON quotes_embedding USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
-- StreamingDiskANN — native on TigerData; on Neon only via a pgvectorscale host
CREATE INDEX ON quotes_embedding USING diskann (embedding vector_cosine_ops);
Note vector_cosine_ops — the index must match the distance function your queries use (<=> → cosine). Mismatch this and the index silently won't help.
How to actually decide, as an instruction:
We have about 2 million quote-vectors and we filter most searches by city. On a fresh Neon branch, build an HNSW index, tune
mandef_construction, run my 10 sample queries, and report p95 latency and whether the top results match. Recommend settings and explain the tradeoff. Then throw the branch away.
That's the AI-engineer move: you don't memorize which settings are faster, you make the agent measure it on a Neon branch and bring you a recommendation you can sanity-check against the card above — then discard the branch at no cost. (p95 latency, which the prompt asks for, just means the speed 95% of queries come in under — a sensible worst-case to hold the line on.)
Row count isn't the only axis — churn is the other. How often your vectors get rewritten (re-chunking, swapping embedding models, time-decay re-indexing) is a cost of its own. HNSW takes incremental inserts and updates without a full rebuild, but heavy rewrite volume bloats the graph over time — recall drifts and you eventually need a REINDEX — and the re-embedding itself is compute you pay for. So a corpus that's constantly rewritten can get expensive well before it's large. Tell the agent your real update pattern, not just your row count, and have it benchmark against that.
The one knob worth knowing: ef_search. The build-time settings (m, ef_construction) shape the HNSW graph when the index is built. The dial that matters day to day is at query time — ef_search controls how hard a search looks: higher means better recall and slower queries, lower means faster and less accurate. The agent sets it per query; this is the line you'll see it run:
SET LOCAL hnsw.ef_search = 100; -- raise for more recall, lower for more speed
So the move is: tell the agent the recall you actually need and have it tune ef_search to hit it, not blindly max it. And have it confirm the index is doing its job by running EXPLAIN ANALYZE and checking for an index scan, not a sequential scan — then report back. A sequential scan usually means the query's operator doesn't match the index (a cosine index needs <=>).
Here's the difference, in the part of the agent's EXPLAIN ANALYZE output you actually read (the same 2M-vector table, with and without a usable index):
✅ GOOD — the index is doing the work
-> Index Scan using quotes_embedding_hnsw_idx on quotes_embedding
Order By: (embedding <=> $1)
Execution Time: 0.8 ms
❌ BAD — no index used; every row got scanned and sorted
-> Seq Scan on quotes_embedding (rows=2000000)
Sort Key: (embedding <=> $1)
Execution Time: 52.4 ms
Two lines tell you which you got: the operator line (Index Scan using …hnsw… versus Seq Scan on …) and the execution time (sub-millisecond versus tens of milliseconds on the same data). Seq Scan means the index isn't being used — usually because the query's distance operator doesn't match the index, or because no index exists yet. That's the one thing to have the agent confirm before you call indexing "done."
These are the recurring ways vector search quietly breaks — and since your job is judging the agent's work, they're exactly what to watch for:
- Dimension mismatch — the column's dimensions don't match the embedding model's output.
- No index past the point where you needed one — a silent slowdown, not an error.
- Operator / index mismatch — the query uses
<->but the index is cosine, so the index is ignored. - Whole documents instead of chunks — the Concept 7 mistake; retrieval can never find anything precise.
- Skipping
EXPLAIN ANALYZE— so nobody catches any of the above until users do.
If most of your searches carry a WHERE clause (by tenant, by date, by category), pair HNSW with ordinary B-tree indexes (Postgres's standard index for plain columns like dates and categories) on the filter columns so Postgres can narrow the set efficiently. (StreamingDiskANN — native on TigerData, or via a pgvectorscale host from Neon — is built for heavy filtered search at very large scale.) On Neon, HNSW plus good filter indexes carries you a long way first.
Part 4: Making search good — the advanced layer
A working RAG is not a good RAG. This is where most projects stall, and where knowing the moves separates you from someone who can only produce a demo.
12. Eval-driven development
The habit that separates a shippable system from a lucky demo. Most people start by building, then eyeball whether the output "looks right." Instead, start with the questions. Before writing anything, write down 10 questions your users will actually ask and put them in a file. That file is your evaluation set — your yardstick for whether a change made things better or worse.
Then, whenever you change the system — a new embedding model, a different chunking strategy, an added filter — you re-run the eval set and see the effect, instead of guessing. As the app grows, the set grows with it (20, 50 questions), and you catch regressions before your users do.
The second half is decompose the problem. When an answer is bad, don't conclude "the AI is dumb." Trace the stages:
- Retrieval: did semantic search even return the right chunks? (Often the real problem is an inventory gap — users ask about something you never put in the database.)
- Context: were the right chunks passed to the LLM, or too many/too few?
- Generation: given good context, did the model still answer poorly?
Nine times out of ten the failure is retrieval, not the LLM. Fix the stage that's actually broken.
Here are 12 questions in
evals/questions.mdwith the answers I expect. Build a small harness that runs each through our retrieve-then-generate pipeline, and for each one show: the chunks retrieved, the final answer, and whether it matches what I expected. Summarize where it's failing — retrieval or generation.
Eval-driven development is the spine of reliable agents, not just RAG. The dedicated treatment is the Eval-Driven Development Crash Course — do it after this one.
13. Filtered search: the WHERE clause is your friend
Pure semantic search returns the globally most-similar rows. Often you want the most similar rows that also satisfy some condition. Because your vectors live next to your data (Concept 5), this is just a WHERE clause on the same query — no second system, no juggling. Five patterns cover almost everything:
| Pattern | Example use case | The added clause (sketch) |
|---|---|---|
| Metadata filter | Docs search across multiple products | WHERE product = 'CRM' AND doc_type = 'api-reference' |
| Composite filter | E-commerce recommendations | WHERE category = 'electronics' AND price BETWEEN 500 AND 2000 AND in_stock |
| Time filter | News recommender — only recent articles | WHERE published_at > now() - interval '7 days' |
| Permissions filter | Internal RAG where users see only what they're cleared for | WHERE clearance_level <= $user_level |
| Geospatial filter | "Recommend things within 5 km" (add PostGIS) | WHERE ST_DWithin(location, $point, 5000) |
Each is the same shape: ORDER BY embedding <=> $1 with a WHERE in front. The permissions one is worth dwelling on — it's how you keep tenant A from ever retrieving tenant B's documents, enforced in the database rather than hoped for in application code.
Add a city filter and a "last N days" filter to our semantic search, both optional. On a Neon branch, add B-tree indexes on the filter columns and tune the HNSW search so filtered queries stay fast, then show me the before/after latency on our eval set.
14. Hybrid search: meaning and keywords
By 2026 this stopped being an optional trick and became the default for serious retrieval: run keyword search and vector search, then merge. Each covers the other's blind spot — vector search understands paraphrase but underweights exact rare terms (a product code, a person's name); keyword search nails the exact term but misses meaning. Postgres does both natively — keyword search via full-text search (tsvector), vectors via pgvector — so it stays one database and, often, one query.
The shape that's become standard:
- Retrieve from both, over-fetching — say the top 20 from keyword and top 20 from vector — so the merge has signal to work with.
- Fuse with Reciprocal Rank Fusion (RRF). RRF merges the two ranked lists by position, not score — which sidesteps the real headache that keyword scores and cosine distances live on completely different scales and can't be averaged sanely. It's a few lines of SQL and needs no model.
- (Optional) rerank the top candidates with a cross-encoder — a small model that scores each query–chunk pair directly. This is the precision step: RRF picks a good pool of ~100, the cross-encoder orders the final handful you hand the LLM. Add it only if your evals say the lift is worth the extra latency.
The shape the agent produces (yours to read, not type) — two ranked lists fused by RRF, all in one query:
WITH kw AS ( -- keyword side: full-text search, ranked
SELECT id, row_number() OVER (ORDER BY ts_rank_cd(ts, plainto_tsquery($1)) DESC) AS rank
FROM quotes_embedding WHERE ts @@ plainto_tsquery($1) LIMIT 20
),
vec AS ( -- vector side: semantic search, ranked
SELECT id, row_number() OVER (ORDER BY embedding <=> $2) AS rank
FROM quotes_embedding ORDER BY embedding <=> $2 LIMIT 20
)
SELECT id, SUM(1.0 / (60 + rank)) AS score -- RRF: k = 60, summed across both lists
FROM (SELECT * FROM kw UNION ALL SELECT * FROM vec) r
GROUP BY id ORDER BY score DESC LIMIT 10; -- $1 = query text, $2 = query vector
Reading it: kw returns the top 20 by keyword match and vec the top 20 by meaning, each row tagged with its rank in that list (1 = best). The final query adds up 1 / (60 + rank) for every row across both lists — so a row near the top of either list scores well, and a row that lands in both wins. The 60 (the standard RRF constant) stops any single high rank from dominating, and because it works on positions, you never have to reconcile keyword scores and cosine distances on different scales.
Add full-text search over the
quotecolumn alongside our vector search, fuse the two with RRF, and run our eval set vector-only vs hybrid. Tell me which questions improved and by how much.
Hybrid search wins most on queries that mix a concept with a specific term — "what did Truman Capote say about the city" needs the name matched exactly and the meaning understood. Whether it actually helps your corpus, and by how much, is a question only your eval set can answer — so measure vector-only vs hybrid before taking on the extra moving parts.
15. Multi-tenancy and text-to-SQL
Two more you should recognize even if you don't build them today.
Multi-tenancy. If you're building SaaS, each customer's data must stay walled off from every other's. There's a ladder of isolation, from loosest to strictest:
| Approach | Isolation | Cost / complexity | Typical fit |
|---|---|---|---|
Shared table + tenant_id filter | Weakest | Cheapest | Internal tools, low-risk data |
| Schema per tenant | Good | Moderate | The sweet spot for most SaaS |
| Database per tenant | Strongest | Highest (backups, ops) | High-security / regulated clients |
Schema-per-tenant is the usual balance: real isolation, one database to operate. Either way, the rule from Concept 13 stands — enforce the boundary in the database, not just in your app code. The cleanest Postgres mechanism is Row-Level Security (RLS): you have the agent write a policy once (rows are visible only where, say, tenant_id = current_setting('app.tenant')) and Postgres then applies it to every query automatically — so a single forgotten WHERE clause can't leak one tenant's vectors to another.
Text-to-SQL. Postgres holds structured data too — numbers, dates, relations. Text-to-SQL lets a user ask in plain English ("what were Q3 sales by region?") and have an agent translate it into a correct SQL query against your real tables. What makes it accurate is a well-described schema: clear table and column names, COMMENTs that explain what each one means, and a handful of example question→query pairs the agent can learn from. Give the agent that context, keep a human in the loop to review the SQL before it runs, and combine it with semantic search (for your documents) — an agent that picks the right tool per question is the foundation of a real data assistant.
Add clear
COMMENTs to our tables and columns describing what each holds, and write a few example question→SQL pairs for our schema. Then let me ask a question in English, and show me the SQL you would run against our real tables for review before executing it.
Text-to-SQL is plan mode by another name: have the agent show the SQL first, especially anything that writes. A wrong SELECT wastes a second; a wrong UPDATE ruins your afternoon.
Part 5: A complete worked example
One task, start to finish: an empty Neon project to a working, grounded Q&A over a folder of documents. This is the Plan / Execute split applied to data work — plan with a strong model, let a cheaper one do the routine execution. The prompts below are the whole job — type them into either tool. They're identical for Claude Code and OpenCode; only two mechanics differ: how you enter plan mode (Claude Code: Shift+Tab · OpenCode: Tab) and that you switch to a cheaper model with /model before the build (in OpenCode, e.g. deepseek-v4-flash).
1. Plan first — enter plan mode with a strong model, then paste:
I have a folder
./docsof markdown files. Build a RAG system on Neon: using the Neon MCP server, create a project and adevbranch, enable pgvector, load the docs into a table, build a small embedding worker that chunks and embeds them into a companionchunkstable, and give me ananswer_question()function that retrieves and then generates. Show me the full plan and the schema before running anything.
The embedding worker is just code the agent writes — source → chunk → embed → store → search. Everything downstream (search, evals, the MCP server) is identical regardless of how the vectors get created.
2. Read the plan before you approve. Check: is it working on a Neon branch? Is pgvector enabled? Does the embedding worker read the API key from the environment (and is the key kept out of the repo)? Is generation in app code? If all yes, approve.
3. Execute — switch to a cheaper model for the routine build, then paste:
Looks right. Proceed. Run the migration on the branch, start the worker, confirm the embeddings backfilled, then run these 5 questions and show me the retrieved chunks plus the answer for each.
4. Evaluate, then iterate — paste:
Q3 and Q5 returned irrelevant chunks. Diagnose: is it retrieval or generation? If retrieval, try a different chunking strategy on a fresh branch and re-run the same 5 questions.
Notice the rhythm: plan → review → execute → evaluate → iterate. That's the same loop from the coding course; the only new thing is what you're reviewing — schema choices, the embedding worker, where generation runs. Master that loop and the specific SQL stops mattering, because you can always have the agent produce it and you can always tell whether it's right.
Part 6: Ship your RAG as an MCP tool
You've built answer_question(). Right now only your code can call it. Wrap it in a Model Context Protocol (MCP) server and the same retrieval becomes a tool that any agent can discover and call — Claude Code, OpenCode, Claude Desktop, Cursor, or a Digital FTE you build later. MCP is the open standard this book keeps returning to: write the capability once, and every agent speaks to it the same way. One server, every agent.
This is the Concept 9 promise — that retrieval is the searchable context an agent draws on — made real. Your vector search stops being a feature buried in one app and becomes a reusable capability: searchable context an agent reaches for the way it reaches for a calculator.
You've now met both, and they do opposite jobs:
- The Neon MCP server (Concept 4) is a dev-time admin tool. It lets your agent operate the database while you build — create branches, run SQL, preview migrations. It is not for production or end users.
- The RAG MCP server (this part) is your product surface at runtime. It exposes read-only retrieval — "search my knowledge," "answer from my data" — to whatever agent you point at it.
The first one builds the system. The second one is the system, offered to agents. Never hand the Neon admin server to end users.
What the server looks like
An MCP server is a small program that advertises a list of tools an agent can call. With FastMCP — the standard Python library, a thin decorator layer over the official MCP SDK — each tool is just a typed function with a docstring; you don't write any of the JSON-RPC plumbing. As ever, the agent writes this — it's shown so you can judge it:
# server.py — your RAG, exposed as MCP tools (review material, not to type)
import os
from fastmcp import FastMCP
# your existing retrieval + answer_question pipeline are imported here
mcp = FastMCP("agent-factory-rag")
@mcp.tool()
def search_knowledge(query: str, limit: int = 5) -> list[dict]:
"""Search the knowledge base by meaning and return the closest chunks.
Use this when you need grounded facts from the user's own data."""
# embeds `query` in app code, runs the Concept 8 search on Neon,
# returns [{text, source, score}, ...] — retrieval only, read-only role
return retrieve(query, limit)
@mcp.tool()
def answer_question(question: str) -> str:
"""Answer a question grounded in the knowledge base (retrieve, then generate)."""
return rag_answer(question) # the Concept 9 pipeline
if __name__ == "__main__":
mcp.run() # start the server; the calling agent connects to it
Two things matter more than the rest. The docstring is the interface — it's the text the calling agent reads to decide when to use the tool, so it must say plainly what the tool does and when to reach for it. And retrieval stays read-only: the tool runs the parameterized search from Concept 8 under a read-only database role, so a tool argument can never mutate or leak your data.
Build it with your agent
Same discipline as everywhere else — plan, review, execute. In plan mode:
Wrap our retrieval in a FastMCP server called
agent-factory-rag. Expose two tools:search_knowledge(query, limit)returning the top matching chunks with their source and similarity score, andanswer_question(question)returning a grounded answer. Reuse our semantic-search query and our existinganswer_question()pipeline. Read the Neon pooled connection string and the model API keys from the environment, and connect with a read-only database role. Write clear, action-oriented tool docstrings — that's what the calling agent reads to decide when to use each tool. Show me the plan and the tool list before writing any code.
Read the plan, confirm the tool list and the read-only role, then approve and let it build.
Register it so agents can use it — with ease
Claude Code — one command:
claude mcp add --scope project rag -- uv run server.py
--scope project writes a .mcp.json at your repo root that you can commit — so teammates and students who clone the repo get the tool automatically (Claude Code prompts them once to approve it). Check it with claude mcp list; reconnect mid-session with /mcp. (FastMCP can also wire this up for you: fastmcp install claude-code server.py handles the dependencies and runs the same claude mcp add underneath.)
OpenCode — add a block to opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"mcp": {
"rag": {
"type": "local",
"command": ["uv", "run", "server.py"],
"environment": {
"DATABASE_URL": "{env:DATABASE_URL}",
"OPENAI_API_KEY": "{env:OPENAI_API_KEY}"
},
"enabled": true
}
}
}
Then just say "use the rag tool" in a prompt. Both tools follow the same add → check → use flow you already know from the Neon MCP server in Concept 4.
Try it from either tool:
Use the rag tool to answer: what did people say about New York? Show me which chunks it retrieved first.
Watch the agent call search_knowledge, get your chunks back, and ground its answer in them. That round-trip is the whole point: your data is now something any agent can reason over.
To share one server with many people instead of running it per-machine, have the agent host it behind a URL and register that URL instead: claude mcp add --transport http rag https://your-host/mcp, or in OpenCode set "type": "remote" with a url. Add OAuth once more than one person connects. The Neon admin server stays dev-only either way.
A few rules to have the agent follow, and to confirm in review:
- Read-only role. The retrieval tools only ever
SELECT. Connect with a database role that can't write, so no tool argument can delete or alter data. - Parameterized queries. The query text arrives as a bound parameter, never string-concatenated into SQL — the same rule as Concept 8.
- Multi-tenancy: never trust a
tenant_idpassed as a plain tool argument. Derive it from the authenticated session and enforce it with RLS (Concept 15), so one tenant's agent can't read another's vectors. - Trust the config. A local MCP server is a command another machine runs. Only register servers — and only open
.mcp.json/opencode.jsonfiles — you trust; a project config can launch a process on your machine.
And the retrieval is only ever as good as the data behind it, so your eval set still rules the day: point it at the MCP tool and you're measuring the exact thing your agents will experience.
Part 7: Where it runs
| Where | Best for | Notes |
|---|---|---|
| Neon (this course) | From first build to production | Serverless Postgres, pgvector built in, instant branching, autoscaling. You operate nothing. |
| Neon branches | Dev, preview, evals, benchmarks | Each branch is an instant clone — build on dev, preview, commit to the default branch. |
| TigerData Cloud | A full-stack alternative to Neon | "Agentic Postgres": pgvector + pgvectorscale native, Tiger MCP, instant zero-copy forks. Pick it for StreamingDiskANN scale and filtered search without ever migrating. |
A few production realities worth flagging to your agent up front:
- The embedding worker is a real process — it has to be running for embeddings to stay in sync. In production it's a service you deploy and monitor, not something you start by hand.
- Migrations: do schema and worker changes on a Neon branch, preview them, then commit to the default branch — the agent drives this through the Neon MCP's migration tools. Versioned, reviewed, reversible, and never an ad-hoc edit against production.
- Cost lives in the embedding calls and the generation calls, both of which hit external model APIs. Batch where you can, cache where you can, and use the model-matching habit — a cheap model for routine generation, a strong one where quality matters.
- Scale-to-zero vs the worker. Neon scales the database to zero when idle to save money — but an embedding worker that polls constantly keeps it awake, quietly defeating that. For low-traffic apps, have the agent set the worker to run on a schedule (or batch embeddings) rather than tight-polling, so you actually get the savings.
- Use the pooled connection string for app traffic. Neon's pooled endpoint (the
-poolerhost, PgBouncer in transaction mode) is built for the many short-lived connections serverless and high-concurrency apps open. Have the agent point your RAG-serving app at it; migrations and the worker can use the direct string. - The worker runs off Neon. You can't run sidecar processes on Neon's managed compute, so the embedding worker lives on your app host, a small VM, or a scheduled job, reaching Neon over its connection string. Tell your agent where it should run.
This is the real question behind the hype, and the honest answer is usually not. If you already run Postgres, pgvector keeps your vectors next to the data they describe — one source of truth, filters and joins in the same query, no second system to sync, secure, and pay for. HNSW comfortably covers the ~100k–10M range, and pgvectorscale's StreamingDiskANN extends past a billion vectors, so raw scale is rarely the deciding factor it once was. For most applications, Postgres is the vector database.
A dedicated store (Pinecone, Weaviate, Qdrant, Milvus, and others) earns its place when you've measured a need it meets and Postgres can't — say, very high query throughput at billion-plus scale, a specific recall/latency target your benchmarks show pgvector missing, or a fully managed vector service you'd rather rent than operate. Those are real cases; they're just the minority. The trap is reaching for one by default because a tutorial or a headline did. Decide it the way you decide indexes (Concept 11): let your eval set and a benchmark on your real workload — including its churn — make the call, and weigh the standing cost of a second system against the gain it actually delivers.
Part 8: Hand it to an agent
You've built retrieval and wrapped it as a tool (Part 6). The natural next move is an agent that uses that tool to get real work done — which is exactly where the Build AI Agents crash course picks up. You don't rebuild any of this there; you hand the agent the tool you already have.
Here's the whole bridge in one idea. That course teaches the agent loop with the OpenAI Agents SDK: an Agent is a model equipped with instructions and tools, and a Runner drives the model → tool → model loop until the job is done. Your RAG is one of those tools. Where this course ends — a callable search_knowledge / answer_question — is where that one begins.
Two ways to connect what you built (the agents course covers the loop; you just supply the tool):
- As the MCP server from Part 6. The Agents SDK can consume MCP servers directly, so you point the agent at your
agent-factory-ragserver and its tools appear in the agent's toolbox automatically — the same server you already registered in Claude Code and OpenCode. - As a function tool. If you'd rather keep it in-process, wrap the same read-only query in the SDK's
@function_tool— that course's native tool style. Same retrieval, expressed as a Python function the agent can call.
The deeper bridge: memory vs. knowledge. The agents course frames every agent around two questions: what it can draw on (state) and what it's allowed to do (trust). One kind of state is memory — what it recalls from the running conversation, which that course handles with sessions. Your RAG provides the other kind: knowledge — durable, searchable context it can look up, over a corpus far too big to fit in any window. Sessions hold the chat; your Postgres + pgvector holds everything the agent might need to retrieve. An agent wired to both can remember what was just said and look up what it never knew — and the "fetch only the relevant chunks, leave the million irrelevant rows out" discipline from this course is exactly what keeps that retrieved context from flooding the window. Same throughline, now serving an agent instead of an app. In the book's language, that knowledge half is the agent's system of record — the ground truth it reads from (retrieval), writes to (the worker keeps it current), and verifies against (your eval set). It is what turns a fluent guesser into an agent that executes.
Wire it the way you built everything else — through the agent. In plan mode:
Scaffold the minimal agent from the Build AI Agents crash course (OpenAI Agents SDK — an
Agentplus aRunnerloop). Give it exactly one tool: our retrieval. Either connect it to theagent-factory-ragMCP server from Part 6 sosearch_knowledgeandanswer_questionshow up as tools, or wrap the same read-only query as a@function_tool— recommend which fits and say why. Leave the agent's own mechanics (sessions, guardrails, model routing) to that course; here, just prove the agent can call our retrieval and ground an answer in it. Show me the plan first.
Approve it, run a question that forces the tool call, and watch the agent retrieve from your Neon (or TigerData) database and answer from it. That's the handoff: Build AI Agents teaches the loop, the guardrails, the sessions, and the deployment; this course gave the agent something true to say. Your eval set crosses over intact — it still measures the retrieval the agent now depends on.
Where to go next
You now have the 80%: you can turn Neon into a vector database, build a grounded RAG system, choose an index on evidence, and improve retrieval with filters, hybrid search, and evals — all by directing an agent and judging its work.
- Make it reliable: Eval-Driven Development Crash Course
- Make it an agent: Part 8 bridges your RAG tool into an agent; the full loop, guardrails, sessions, and deployment are in Build AI Agents
- Make it a product: turn the assistant into a deployable Digital FTE
The throughline never changes: right information, right moment, irrelevant information out. You learned it for your agent. Now you can build it for everyone else's.