Skip to main content

Reading GROUP BY and Aggregate Traps

The boss sends a third message.

"What's the average expense amount per category? Quick read for a budget meeting."

The agent writes a query. The output looks reasonable. The Food category sits at $42, Transport at $35, Entertainment at $61. The boss thanks you. The next day the head of finance says the averages are wrong: too low across the board.

Why? Some expenses had no category assigned at the time they were logged. The category column was NULL on those rows. An AVG silently excluded them. The query was not wrong by SQL standards. It just answered a slightly different question than the one the boss asked.

This lesson teaches you to spot aggregate traps before the boss does.

Key Terms for This Lesson
  • GROUP BY: Buckets rows by one or more columns. After grouping, each row in the result represents one bucket, not one original row.
  • Aggregate function: A function that summarises a group of rows: COUNT, SUM, AVG, MIN, MAX.
  • HAVING: A filter that runs after grouping. Use it when the filter is about the aggregate, not the original rows.
  • WHERE: A filter that runs before grouping. Use it when the filter is about the original rows.
  • NULL in aggregates: AVG, SUM, and COUNT(column) silently skip NULL values. COUNT(*) counts every row, NULLs included.

A Grouped Query You Can Read

SELECT category, AVG(amount) AS avg_amount
FROM expenses
GROUP BY category;

Walk it.

  • Bucket: Each row in the result is one category.
  • Aggregate: The average expense amount per category.
  • Skipped silently: Any row where category is NULL is bucketed into a NULL group (or excluded, depending on the database).
  • Skipped silently: Any row where amount is NULL is excluded from the AVG.

For most reporting, that double-silence is what bites you. The query did not crash. It just told you the average of the rows it could count. Whether those rows are the rows the boss meant depends on the data.

HAVING vs WHERE

This trips up almost everyone, including the agent on bad days. Read the difference twice.

-- WRONG: WHERE cannot reference an aggregate. This is a syntax error.
SELECT category, AVG(amount) AS avg_amount
FROM expenses
WHERE AVG(amount) > 50
GROUP BY category;
-- RIGHT: HAVING filters groups after the aggregation
SELECT category, AVG(amount) AS avg_amount
FROM expenses
GROUP BY category
HAVING AVG(amount) > 50;

The mental model: WHERE is the colander that decides which raw rows enter the bucket. HAVING is the second colander that decides which buckets leave the kitchen.

When you read an agent's query and the WHERE clause references an aggregate function, that is a bug. The fix is to move the condition into a HAVING clause.

COUNT(*) vs COUNT(column)

Two queries that look almost identical can answer very different questions.

-- Counts every row in the table
SELECT COUNT(*) FROM expenses;
-- Counts only rows where category is not NULL
SELECT COUNT(category) FROM expenses;

If half your expenses have no category, the second query returns half the number. When you read a COUNT, look at what is inside the parentheses. If it is a column that can be NULL, the count silently excludes those rows.

For "how many expenses total," use COUNT(*). For "how many expenses have a category set," use COUNT(category). Either can be correct. They cannot both be the answer to the same question.

PRIMM-AI+ Practice: Average Per Category

Predict [AI-FREE]

The agent produced this query for the boss's budget meeting.

SELECT category, AVG(amount) AS avg_amount, COUNT(*) AS expense_count
FROM expenses
WHERE date >= '2026-01-01'
GROUP BY category
ORDER BY avg_amount DESC;

Before running, write down:

  • One bucket = one row of the result. Which column is the bucket?
  • What rows go into each bucket? (Hint: read the WHERE clause first.)
  • What happens to rows where category is NULL?
  • What happens to rows where amount is NULL?
  • Your confidence score from 1 to 5.

Run

Ask Claude Code to run the query against your Neon database.

You should see something like this:

  category   | avg_amount | expense_count
-------------+------------+---------------
Transport | 65.40 | 8
Food | 42.15 | 12
Entertainment | 38.00 | 4
NULL | 71.20 | 3
(4 rows)

The NULL row at the bottom is the giveaway. There are three expenses in your database that have no category. They have a high average. The boss did not ask about them. They are in the report anyway.

Investigate

Write your own explanation first:

  1. The WHERE clause limited rows to expenses on or after January 1, 2026. Everything before that date was excluded.
  2. The GROUP BY put every remaining row into a bucket named after its category. Rows with NULL category were bucketed into a NULL group.
  3. AVG computed the average of amount within each bucket. Any row with NULL amount inside a bucket was silently excluded from the average.

Then ask the agent these questions in order:

  1. "Is there a way to keep only the categories that have at least 5 expenses?" (Answer: HAVING COUNT(*) >= 5.)
  2. "What happens to the NULL category row if I add WHERE category IS NOT NULL? Why is that safer for this report?"
  3. "What is the difference between COUNT(*) and COUNT(amount) for this table?"

Modify

Change one requirement: the boss now wants to see only categories with at least 5 expenses, and she does not want the NULL row in the report.

Predict the changes: a HAVING clause to filter by count, and a WHERE clause that excludes NULL categories.

Ask the agent to make exactly those two changes. Read the result. Verify nothing else moved.

Make [Mastery Gate]

Write an English brief for a grouped report on your own table: per-user totals, per-month sums, per-category counts.

Ask the agent for the query. Read it. Check the three traps in order:

  1. Is the WHERE filtering raw rows or trying (illegally) to filter aggregates?
  2. Is the HAVING filtering aggregates correctly?
  3. Are NULLs being silently excluded in a way that matters for the answer?

Approve or reject based on the read. The gate passes when you can correctly call out a NULL-in-aggregate issue or a HAVING-vs-WHERE mistake without running the query.

Try With AI

Prompt 1: HAVING vs WHERE Drill

I'm going to give you two requests. For each, tell me whether the
condition belongs in WHERE or HAVING, and write the correct query
against the expenses table.

A) Find categories where the total spend is over $500.
B) Find expenses where the amount is over $500.

What you're learning: Practising the WHERE-vs-HAVING decision out loud. The first request is a HAVING (filtering buckets by their sum). The second is a WHERE (filtering raw rows). Wrong picks here are one of the top three aggregate bugs in agent output.

Prompt 2: NULL in AVG

Create a small temporary table with three rows: one with amount 10,
one with amount 20, and one with amount NULL. Then run
`SELECT AVG(amount) FROM that_table` and explain why the answer is
15 and not 10.

What you're learning: You are forcing the NULL-skip behaviour to appear in front of you. Once you have seen 15 come back from a three-row table with one NULL, you will remember it for years.

Prompt 3: Read a Real Report Query

Write a SQL query for: "Average spend per user, only for users with
more than 3 expenses in the last 90 days, sorted highest first."
Show me the query before running. I want to read it.

What you're learning: This combines WHERE (date range), GROUP BY (per user), HAVING (more than 3 expenses), and ORDER BY. Reading the query is your final integration test for the lesson.

Checkpoint

  • I can read a GROUP BY query and name what each row in the result represents.
  • I know when to use HAVING and when to use WHERE.
  • I know that AVG, SUM, and COUNT(column) silently skip NULLs.
  • I rejected at least one wrong aggregate query and explained the issue in business terms.

Flashcards Study Aid