Capstone - Data Processing Pipeline

The Challenge: From Data to Insight

You've now mastered three fundamental Python data structures: lists (ordered, mutable sequences), tuples (ordered, immutable sequences), and dictionaries (fast key-value lookups). But mastery means more than understanding each structure in isolation. The real test is this: Can you combine all three to solve a realistic, end-to-end data processing problem?

This is what data engineers, analysts, and backend developers do every day. They receive raw data—messy, unstructured, often in text format—and transform it into meaningful insights. A student dataset. A transaction log. A sensor reading stream. The pattern is always the same:

Ingest the data (get it into a usable format)
Parse it (structure it as objects you can work with)
Filter it (select only what matters)
Aggregate it (calculate summaries, patterns, counts)
Output it (present results to humans or systems)

In this lesson, you'll build a Data Processing Pipeline that demonstrates this entire workflow. You'll parse student records, filter by major and GPA, aggregate statistics by program, and output a professional summary report.

The Learning Goal: Prove you can think architecturally about data structures and build a complete application that combines all three collections intelligently. This is not a toy exercise—this is the foundation of real-world data work.

The Pipeline Architecture: Planning Before Code

💬 AI Colearning Prompt

"Design the data structure for this pipeline. We're processing student records (name, major, GPA). Should each record be a dict or a tuple? Why? What structure should we use for aggregating counts by major?"

Before you write a single line of code, understand the design. The best developers sketch their data structures first—on paper, in a notebook, or discussing with their AI partner.

Here's the structure we'll use:

Step 1: Raw Data (String Format)

Loading Python environment...

This simulates reading a CSV file (we'll learn actual file I/O in Chapter 27). For now, it's just a multi-line string.

Step 2: Parsed Data (List of Dicts)

Loading Python environment...

Each student is a dict (key-value mapping for field access by name), stored in a list (ordered collection of all records).

Step 3: Filtered Data (List Comprehension)

Loading Python environment...

We use a list comprehension with if conditions to select only records matching our criteria.

Step 4: Aggregated Results (Dict of Counts/Stats)

Loading Python environment...

We use a dict mapping major names to summary stats (another dict containing counts and averages).

🎓 Expert Insight

In AI-native development, you don't just code—you design. Notice how we chose list for "ordered records" (we care about having all students), dict for "key-value lookup by student field", and dict again for "meaningful keys in aggregations". Structure choice is communication. When future you (or a teammate) reads this code, the structures tell the story.

Phase 1: Parse Raw Data into List of Dicts

Let's start with the foundation. You have a CSV-like string, and your job is to convert it into a list of dicts, where each dict represents one student record.

Specification:

Input: Multi-line string with headers on first line, data on remaining lines
Output: list[dict[str, str | float]] where keys are column names
Each dict = one record
Handle missing values gracefully (skip malformed rows)

Code Example: Data Parsing

📘 Note: In Chapter 25, you'll learn how to organize this parsing logic into reusable functions. For now, we're writing the code inline to focus on the data structure transformations—how lists and dicts work together to structure raw text.

Loading Python environment...

✨ Teaching Tip

When debugging this parsing step, ask your AI: "Why is my list empty?" or "Show me what each dict contains after parsing". AI can help you visualize the structure and spot issues. Use print(students[0]) to inspect the first record.

Phase 2: Filter Data with Comprehensions

Now that you have structured data, filter it. Let's find all Computer Science students with a GPA of 3.5 or higher.

Specification:

Input: list[dict[str, str | float]] of all students
Criteria: major == "Computer Science" AND gpa >= 3.5
Output: list[dict] containing only matching records
Use comprehension (not a loop)

Code Example: Filtering with List Comprehension

Loading Python environment...

Notice the two conditions in the if clause:

student["major"] == "Computer Science" (exact match)
student["gpa"] >= 3.5 (numeric comparison)

Both must be true for the student to be included.

💬 AI Colearning Prompt

"Show me how to write a list comprehension that filters students from multiple majors (Computer Science OR Mathematics). How would the condition change?"

🚀 CoLearning Challenge

Ask your AI Co-Teacher:

"Given the student data, write a comprehension that finds all students with GPA between 3.5 and 3.9 (inclusive). Then explain what each part of the comprehension does."

Expected Outcome: You'll understand how to combine multiple conditions in comprehensions and apply range-based filtering to numerical data.

Phase 3: Aggregate Data with Dictionaries

Filtering is useful, but aggregation is powerful. Now calculate statistics by major:

How many students in each major?
What's the average GPA per major?

Specification:

Input: list[dict[str, str | float]] of all students
Output: dict[str, dict[str, float | int]] where:
- Outer key = major name
- Inner dict = {"count": N, "average_gpa": X.XX}
Use dict to accumulate counts and sums

Code Example: Aggregation with Dict

📘 Note: This aggregation pattern—grouping data and calculating statistics—is fundamental to data analysis. In Chapter 25, you'll learn to package this logic into reusable functions. For now, focus on understanding the dict-based accumulation pattern.

Loading Python environment...

Notice the pattern:

Check if key exists: if major not in stats
Initialize if needed: stats[major] = {...}
Accumulate: stats[major]["count"] += 1
Calculate final value: average_gpa = total_gpa / count

🎓 Expert Insight

This aggregation pattern appears everywhere: calculating totals, counting occurrences, tracking minimums/maximums. You're learning a skill that applies to data analysis, reporting, analytics dashboards, and more. The dict-based accumulator is fundamental. Syntax is cheap—understanding this pattern is gold.

🚀 CoLearning Challenge

Ask your AI:

"I need to find the student with the highest GPA in each major. How would I modify this aggregation to also track the top student's name in each major?"

Expected Outcome: You'll extend the aggregation pattern to track multiple values per group.

Phase 4: Output Formatted Results

Raw dicts are great for computation, but humans need readable output. Format your results as a professional summary report.

Specification:

Input: dict[str, dict[str, float | int]] of aggregated statistics
Output: Formatted string suitable for printing or saving
Include: Major name, student count, average GPA
Format clearly with spacing and alignment

Code Example: Formatted Output

Loading Python environment...

Notice the formatting techniques:

{major:25s} — left-align major name in 25 characters
{count:2d} — right-align integer in 2 characters
{avg_gpa:.2f} — float with 2 decimal places
'\n'.join(lines) — combine list of strings with newlines

✨ Teaching Tip

When your output doesn't look quite right, show it to your AI: "Here's my output. The columns don't line up. How can I fix the formatting?" AI can suggest better alignment and explain f-string formatting codes.

Putting It All Together: The Complete Pipeline

Now integrate all phases into one cohesive application. This is the complete, runnable code combining everything you've learned:

Loading Python environment...

Output:

✓ Parsed 7 student records

✓ Found 3 Computer Science students

✓ Calculated statistics by major

Student Statistics by Major
==================================================

Computer Science          | Count:  3 | Avg GPA: 3.73
Mathematics               | Count:  2 | Avg GPA: 3.40
Physics                   | Count:  2 | Avg GPA: 3.45

==================================================

Validation Checklist

Data parses correctly (right number of students, correct field values)
Filtering works (CS students match expected records)
Aggregation is accurate (counts and averages are correct)
Output is readable (aligned columns, no errors)
Code runs without exceptions

Common Pitfalls and How to Debug Them

Pitfall 1: KeyError When Accessing Dict Values

Error: KeyError: 'major'

Cause: The dict doesn't have the expected key (field name misspelled, or data parsing failed).

Debug approach:

Print the first dict to see what keys actually exist: print(students[0].keys())
Check if your header parsing is correct
Ask AI: "Why is my key 'major' not in the dict after parsing?"

Pitfall 2: Type Errors in Aggregation

Error: TypeError: '>' not supported between instances of 'str' and 'float'

Cause: You're trying to compare GPA but it's stored as a string instead of float.

Debug approach:

Print a student record: print(students[0]['gpa'], type(students[0]['gpa']))
Check your parsing function—is it converting to float?
Ask AI: "How do I convert string '3.8' to float in Python?"

Pitfall 3: Comprehension Syntax Error

Error: SyntaxError: invalid syntax

Cause: Missing colon, wrong if placement, or unbalanced brackets.

Debug approach:

Break the comprehension into a loop to verify logic:
Loading Python environment...
Once the loop works, convert back to comprehension
Ask AI: "Convert this loop to a list comprehension and explain each part"

✨ Teaching Tip

When debugging, never just ask AI to fix your code. Instead, ask AI to explain what you're seeing: "I got this error. What does it mean?" Then work with AI to diagnose. This builds your debugging skills—the most valuable skill in professional development.

Extensions: Making It Real

Your basic pipeline works. Now make it more sophisticated. Choose one or more extensions:

Extension 1: Multi-Criteria Filtering

Filter students who are Computer Science OR Mathematics majors with GPA above 3.4:

Loading Python environment...

Extension 2: Sort Results

Sort students by GPA (highest first) before output:

📘 Note: The lambda syntax below is a shorthand for defining small, inline operations. You'll learn lambda functions in Chapter 25. For now, just understand: key=lambda s: s["gpa"] means "sort by the 'gpa' field of each student dict."

Loading Python environment...

Extension 3: Find Outliers

Find students whose GPA is significantly different from their major's average:

Loading Python environment...

Capstone Validation: Am I Done?

Check yourself against these criteria:

Core Functionality (Required):

Code parses raw CSV string into list[dict] ✓
Filtering works with at least one comprehension ✓
Aggregation calculates correct counts and averages ✓
Output is formatted and readable ✓
No runtime errors when processing data ✓

Code Quality (Expected):

Type hints present on all function signatures ✓
Variable names are descriptive (not x, data1, etc.) ✓
Comments explain non-obvious logic ✓
Code follows consistent indentation ✓

Understanding (Critical):

I can explain why each data structure (list, dict) was chosen ✓
I can justify the comprehension logic ✓
I could modify this for a different data format (products, employees, transactions) ✓
I asked AI when I was stuck and learned from the explanation ✓

Stretch Goals (Optional):

Implemented at least one extension ✓
Data handles edge cases (empty records, missing fields) ✓
Code is organized and easy to read ✓

Try With AI

Build a complete sales data pipeline integrating all Chapter 23 concepts.

🔍 Explore Pipeline Architecture:

"Show me a sales CSV pipeline design: parse to list[dict], filter Electronics with price>$100, aggregate revenue by category. Explain structure choices and why dict for records, list for collection, dict for aggregation."

🎯 Practice Pipeline Construction:

"Help me build working code: CSV parsing (handle headers), type conversion (float/int), filtering with comprehension, revenue aggregation with dict accumulator, find top category. Show complete code with type hints."

🧪 Test Edge Case Handling:

"Debug production CSV issues: empty lines, missing fields, invalid prices ('N/A'), duplicate products. Show how code breaks for each, add validation (skip? defaults?), explain strict vs lenient tradeoffs."

🚀 Apply Multi-Dimensional Analysis:

"Build advanced pipeline: revenue by category AND price bracket (<$100, $100-$500, $500+), top 3 products by revenue (sorted), summary stats, formatted table. Use nested dicts, comprehensions, sorted(), f-strings. Integrate all Chapter 23 concepts."

Capstone Success

You've now completed the full journey from raw data to insights. You've:

Designed data structures strategically
Parsed text into structured Python objects
Filtered data with comprehensions
Aggregated results using dict-based accumulators
Output professional summaries

This is real work. Data engineers, backend developers, analytics engineers do this every day. You've demonstrated the core competency: architectural thinking combined with execution.

Congratulations on completing Chapter 23. You're ready for Chapter 24 (Sets and Frozen Sets) and beyond. The collection structures you've mastered form the foundation for everything that comes next—from functions that operate on collections to objects that contain collections as attributes.

What's next: In Chapter 25, you'll learn how to encapsulate this pipeline logic into reusable functions. In Chapter 26, you'll handle exceptions robustly when data is malformed. In Chapter 27, you'll read/write data from actual files. But the core pattern—ingest, transform, aggregate, output—remains your north star.

Keep building. Keep asking your AI partner. Keep validating. You're thinking like a developer now.

The Challenge: From Data to Insight​

The Pipeline Architecture: Planning Before Code​

💬 AI Colearning Prompt​

Step 1: Raw Data (String Format)​

Step 2: Parsed Data (List of Dicts)​

Step 3: Filtered Data (List Comprehension)​

Step 4: Aggregated Results (Dict of Counts/Stats)​

🎓 Expert Insight​

Phase 1: Parse Raw Data into List of Dicts​

Code Example: Data Parsing​

✨ Teaching Tip​

Phase 2: Filter Data with Comprehensions​

Code Example: Filtering with List Comprehension​

💬 AI Colearning Prompt​

🚀 CoLearning Challenge​

Phase 3: Aggregate Data with Dictionaries​

Code Example: Aggregation with Dict​

🎓 Expert Insight​

🚀 CoLearning Challenge​

Phase 4: Output Formatted Results​

Code Example: Formatted Output​

✨ Teaching Tip​

Putting It All Together: The Complete Pipeline​

Validation Checklist​

Common Pitfalls and How to Debug Them​

Pitfall 1: KeyError When Accessing Dict Values​

Pitfall 2: Type Errors in Aggregation​

Pitfall 3: Comprehension Syntax Error​

✨ Teaching Tip​

Extensions: Making It Real​

Extension 1: Multi-Criteria Filtering​

Extension 2: Sort Results​

Extension 3: Find Outliers​

Capstone Validation: Am I Done?​

Try With AI​

Capstone Success​

The Challenge: From Data to Insight

The Pipeline Architecture: Planning Before Code

💬 AI Colearning Prompt

Step 1: Raw Data (String Format)

Step 2: Parsed Data (List of Dicts)

Step 3: Filtered Data (List Comprehension)

Step 4: Aggregated Results (Dict of Counts/Stats)

🎓 Expert Insight

Phase 1: Parse Raw Data into List of Dicts

Code Example: Data Parsing

✨ Teaching Tip

Phase 2: Filter Data with Comprehensions

Code Example: Filtering with List Comprehension

💬 AI Colearning Prompt

🚀 CoLearning Challenge

Phase 3: Aggregate Data with Dictionaries

Code Example: Aggregation with Dict

🎓 Expert Insight

🚀 CoLearning Challenge

Phase 4: Output Formatted Results

Code Example: Formatted Output

✨ Teaching Tip

Putting It All Together: The Complete Pipeline

Validation Checklist

Common Pitfalls and How to Debug Them

Pitfall 1: KeyError When Accessing Dict Values

Pitfall 2: Type Errors in Aggregation

Pitfall 3: Comprehension Syntax Error

✨ Teaching Tip

Extensions: Making It Real

Extension 1: Multi-Criteria Filtering

Extension 2: Sort Results

Extension 3: Find Outliers

Capstone Validation: Am I Done?

Try With AI

Capstone Success