Skip to main content
Updated Mar 15, 2026

For Instructors: Calibrating and Maintaining AI Prompts

The AI check prompts in this part are designed for the current generation of AI tools (Claude and ChatGPT as of early 2026). AI capabilities will evolve. What counts as a rigorous evaluation today may need adjustment as models improve, change behavior, or develop new failure modes. This section provides a maintenance protocol to keep the assessment system accurate over time.


Semester Calibration Protocol

Step 1 -- Score Distribution Audit (every semester)

Collect Thinking Score Card data across all students. If more than 80% score 8+ on any dimension by Chapter 3, the prompts are too lenient. If more than 50% score below 4 by Chapter 8, prompts may be too harsh.

Healthy distribution: Chapter 1 averages of 4-6 rising to Chapter 10 averages of 6-8, with natural variance.

Step 2 -- Prompt Spot-Testing (every semester)

Take 5 student deliverables from the previous semester (one strong, one weak, three average). Submit each to current AI models using current prompts. Compare AI scores to instructor's independent assessment.

If scores diverge by more than 2 points consistently, revise the prompt.

Common drift patterns: score inflation (AI becomes more generous), compression (AI stops distinguishing mediocre from good), or new blind spots.

Step 3 -- Feedback Challenge Review (every semester)

Review all Feedback Challenge Protocol submissions. If students successfully challenge AI feedback more than 30% of the time, prompts need tightening. If challenge rate is 0%, students may be too deferential -- consider adding a mandatory challenge requirement (each student must dispute at least one AI score across the 10 chapters).

Step 4 -- Model Migration (when major AI models update)

When a major new model version is released, run the full spot-test before the semester begins. New models may score differently. Adjust prompt language to maintain consistent scoring behavior.

The five Thinking Score Card dimensions are permanent -- only the prompt wording that elicits accurate scores should be tuned.

Step 5 -- Scenario Refresh (annually)

Review exercise scenarios for continued relevance. Scenarios based on emerging technology may become dated as these technologies mature. Replace settled scenarios with new dilemmas requiring genuine thinking.

The exercise structure and AI prompts remain the same -- only the scenario content changes.


The goal of calibration is not perfect AI scoring -- that is impossible. The goal is consistent scoring that reliably distinguishes strong thinking from weak thinking, so that the Score Card trajectory is meaningful across 40 exercises. Small inaccuracies on individual scores wash out over 40 data points. Systematic bias does not -- and that is what the calibration protocol catches.

The Thinking Score Card dimensions (Independent Thinking, Critical Evaluation, Reasoning Depth, Originality, Self-Awareness) are permanent. The prompts that measure them are tunable. Calibrate the instrument; do not change what it measures.

Knowledge is the foundation. Thinking is the building. This part teaches you to build.