Skip to main content

Bronze Capstone: First Real Day

In Give Your Employee an Identity, Teach Your Employee a Skill, and Connect Your Employee to the World lessons, you gave your employee an identity, a skill, and a connection to the outside world. Now you will find out if any of it actually works under real conditions.

This is not a demo. You are going to send your AI employee the kinds of tasks that a human in your profession handles on a typical workday. Some will be routine. Some will require the domain skill you built in Teach Your Employee a Skill lesson. At least one will be deliberately ambiguous: the kind of request where a good employee asks for clarification instead of guessing.

The goal is not a perfect score. The goal is an honest evaluation that tells you exactly what works, what fails, and what to improve next.

The Challenge

Send your AI employee 3-5 real professional tasks and evaluate the results using a structured rubric. At least one task must use the workflow from Teach Your Employee a Skill lesson, at least one must use your Connect Your Employee to the World connection, and at least one must be ambiguous enough that the employee should ask a clarifying question rather than guess.

Acceptance Criteria

  1. Conversation log exported showing all tasks and responses
  2. Self-evaluation rubric completed with scores and evidence for each dimension
  3. At least one response improved through follow-up iteration (you gave feedback, the employee adapted)
  4. Written reflection identifying what worked, what failed, and one specific improvement to make

Deliverables

Add these files to your nanoclaw-employee repo:

  • conversation-log.md: full task/response transcript
  • evaluation.md: completed rubric with scores and reflection

These examples show how different professions structure their five tasks. Adapt the pattern to your own work.

Accountant:

  1. "Review this invoice for errors": routine task testing basic identity and tone
  2. "Categorize these expenses by tax deduction type": tests Teach Your Employee a Skill
  3. "Email the client about their payment status": tests Connect Your Employee to the World via Gmail
  4. "Handle this tax situation": ambiguous, should ask: which jurisdiction? personal or business?
  5. "Prepare a quarterly summary from these three invoices": complex, combines skill + reasoning

Teacher:

  1. "Write a welcome message for parents about the field trip": routine tone check
  2. "Plan next week's math unit on fractions for 4th graders": tests Teach Your Employee a Skill
  3. "Post an update in the parent channel about homework policy": tests Connect Your Employee to the World via Slack
  4. "Help with this student": ambiguous, should ask: academic help? behavioral? what subject?
  5. "Create a differentiated worksheet for my mixed-ability class": complex, combines skill + judgment

Consultant:

  1. "Draft a status update for the project team": routine task
  2. "Build a proposal outline for a new client engagement": tests Teach Your Employee a Skill
  3. "Check my calendar and prep notes for tomorrow's meetings": tests Connect Your Employee to the World
  4. "Follow up with the client": ambiguous, should ask: which client? about what? what tone?
  5. "Analyze why this project is behind schedule and suggest recovery options": complex reasoning

Evaluation Rubric

Use this rubric for your evaluation.md. Score each dimension 1-5 and include specific evidence.

Dimension1 (Poor)3 (Adequate)5 (Excellent)Your ScoreEvidence
Domain accuracyMajor factual errors about your professionMostly correct, minor gapsGets professional details right consistently
Appropriate toneWould embarrass you if a client saw itAcceptable but genericMatches the voice you defined in Give Your Employee an Identity
Skill usageDid not use Teach Your Employee a Skill when it should haveUsed the skill but missed nuancesApplied the skill effectively with domain insight
Connection usageFailed to use Connect Your Employee to the WorldUsed the connection but with errorsSmooth integration with the external channel/tool
Clarification behaviorGuessed on ambiguous taskAsked a question but not the right oneAsked targeted clarifying questions before acting
Iteration qualityNo improvement after feedbackSome improvement, missed key pointsMeaningfully improved response based on your feedback

Hints

PRIMM-AI+ Practice

This challenge follows the PRIMM-AI+ cycle. Before you build, predict. After you build, investigate.

Predict [AI-FREE]

Before you run any tasks, write down:

  • Which of your 3 to 5 tasks you expect the employee to handle best and which you expect it to struggle with most.
  • What the employee will do when it receives the ambiguous task: will it ask a clarifying question, guess, or refuse?
  • What your lowest-scoring rubric dimension will be and why.
  • Your confidence score from 1 to 5.

Do not ask the agent until those notes are written.

Run

Open your NanoClaw WhatsApp group alongside Claude Code ($ claude). Send each of your 3 to 5 tasks in sequence, logging the full exchange. Use Claude Code to help you draft the evaluation.md rubric scores and reflection after running the tasks.

Complete the challenge as described above: send all tasks, export the conversation log, complete the rubric, and iterate on at least one weak response.

Investigate

First, write your own explanation of whether your predicted strongest and weakest tasks matched the actual rubric scores. Then ask the agent: "Looking at the task where you scored lowest in my evaluation, what specific part of my identity configuration or skill definition caused the weak response?"

Modify

Change one requirement: pick the dimension where the employee scored lowest (for example, clarification behavior) and add or tighten one rule in groups/main/CLAUDE.md to address it. Predict how the score for that dimension will change if you run the same ambiguous task again, apply the fix, re-run just that task, and record the new response alongside the original.

Make [Mastery Gate]

Verify against the acceptance criteria above. Passing means: the conversation log is exported and committed, the evaluation rubric is completed with scores and evidence for all six dimensions, at least one response was improved through follow-up iteration, and the written reflection identifies one specific improvement to make next.

Level 1: Planning Your Tasks

Think about what you actually did at work this week. Pick tasks that range from routine to complex. The best test tasks are real ones, not hypothetical scenarios. If you can use actual documents, emails, or situations from your work (with sensitive details removed), the evaluation will be far more meaningful.

Level 2: Ask Your AI for Task Ideas

Before starting the evaluation, send this to Claude:

"What are 5 common daily tasks for a [your profession] that vary in complexity from routine to judgment-heavy? For each, note whether it primarily tests identity/tone, domain skill, tool usage, or ambiguity handling."

Use the response to design a balanced test set that covers all four dimensions.

Level 3: Structuring the Evaluation

Run your tasks in this specific order for the clearest signal:

  1. Task 1: Routine: Tests basic identity and tone. Should be something your employee handles easily.
  2. Task 2: Skill-heavy: Requires your Teach Your Employee a Skill SKILL.md. Does the domain expertise come through?
  3. Task 3: Connection-dependent: Must use your Connect Your Employee to the World channel or MCP server. Does the integration work end-to-end?
  4. Task 4: Ambiguous: Deliberately vague. A good employee asks questions before acting. A bad one guesses.
  5. Task 5: Complex: Combines everything. Tests whether identity + skill + connection work together.

For the iteration test: pick the weakest response from Tasks 1-5, give specific feedback, and ask for a revised version. Compare the two.