Updated Feb 16, 2026

Debugging & Troubleshooting

Your agent failed. Now what?

In Lesson 10, you deployed agent_main.py as a systemd service with restart policies, resource limits, and a health check script. That infrastructure keeps your agent running through routine crashes. But when something genuinely breaks -- a memory leak, a full disk, a network timeout at 3 AM -- automatic restarts won't help. You need to find the root cause.

Production debugging is not about memorizing commands. It is about systematic diagnosis: gathering evidence, isolating the problem, and fixing the root cause instead of blindly restarting the service.

By the end of this lesson, you will have a four-phase triage methodology and the specific tools to execute each phase. When your Digital FTE fails, you will know exactly where to look.

Structured Triage Methodology

Before diving into individual tools, learn the system. Every production issue falls into one of four categories, and diagnosing them in the right order saves time:

Phase 1: Logs -- What did the agent say before it died? Phase 2: Network -- Can the agent reach its dependencies? Phase 3: Disk -- Is there space for the agent to operate? Phase 4: Processes -- Is the agent consuming resources abnormally?

This order matters. Logs answer "what happened" immediately in 80% of cases. If logs are clean, check network connectivity. If the network is fine, check disk space. If disk is fine, inspect the process itself. Skipping phases or jumping to process debugging first wastes time on symptoms instead of causes.

Agent fails → Check logs (journalctl)
                  ↓ logs clean?
              Check network (curl, ss, ping)
                  ↓ network fine?
              Check disk (df, du)
                  ↓ disk fine?
              Check process (ps, strace, lsof)

The rest of this lesson teaches the tools for each phase, applied to the my-agent service you created in Lesson 10.

Phase 1: Log Analysis with journalctl

When a systemd service fails, the answer is almost always in the logs. journalctl reads the system journal where systemd captures all stdout and stderr from your service.

Read Service Logs

Check the current status first:

sudo systemctl status my-agent

Output (failed service):

● my-agent.service - Sample Digital FTE Agent
     Loaded: loaded (/etc/systemd/system/my-agent.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Tue 2026-02-11 14:32:15 UTC; 2min ago
    Process: 12345 ExecStart=/usr/local/bin/uvicorn agent_main:app --host 0.0.0.0 --port 8000 (code=exited, status=1/FAILURE)
   Main PID: 12345 (code=exited, status=1/FAILURE)

Now pull the full logs for this service:

journalctl -u my-agent

This shows every log entry for my-agent since the journal began. For a service that has been running for days, this is too much output. Filter it down.

Filter by Recency

Show only the last 50 lines:

journalctl -u my-agent -n 50

Output:

Feb 11 14:32:10 server uvicorn[12345]: INFO:     Processing request batch 847
Feb 11 14:32:12 server uvicorn[12345]: WARNING:  Memory usage at 490MB
Feb 11 14:32:14 server python[12345]: MemoryError: Unable to allocate 128MB array
Feb 11 14:32:15 server systemd[1]: my-agent.service: Main process exited, code=exited, status=1/FAILURE
Feb 11 14:32:15 server systemd[1]: my-agent.service: Failed with result 'exit-code'

Follow Logs in Real Time

Watch logs as they appear (like tail -f for the journal):

journalctl -u my-agent -f

Output (live stream):

Feb 11 14:35:00 server uvicorn[12400]: INFO:     192.168.1.5:42386 - "GET /health HTTP/1.1" 200
Feb 11 14:35:05 server uvicorn[12400]: INFO:     Processing request batch 1
Feb 11 14:35:10 server uvicorn[12400]: INFO:     192.168.1.5:42388 - "GET /tasks HTTP/1.1" 200

Press Ctrl+C to stop following.

Filter by Priority

Show only errors (skip informational messages):

journalctl -p err --since "1 hour ago"

Output:

Feb 11 14:32:14 server python[12345]: MemoryError: Unable to allocate 128MB array
Feb 11 14:32:15 server systemd[1]: my-agent.service: Failed with result 'exit-code'

Priority levels from most to least severe:

Level	Name	Meaning
0	`emerg`	System is unusable
1	`alert`	Immediate action required
2	`crit`	Critical condition
3	`err`	Error condition
4	`warning`	Warning condition
5	`notice`	Normal but significant
6	`info`	Informational
7	`debug`	Debug-level detail

Using -p err shows levels 0 through 3 (emerg, alert, crit, err). This cuts through noise fast.

Filter by Time Range

Show logs from a specific window:

journalctl -u my-agent --since "2026-02-11 14:00" --until "2026-02-11 15:00"

Output:

Feb 11 14:05:30 server systemd[1]: Started Sample Digital FTE Agent.
Feb 11 14:05:31 server uvicorn[12345]: INFO:     Application startup complete.
Feb 11 14:32:14 server python[12345]: MemoryError: Unable to allocate 128MB array
Feb 11 14:32:15 server systemd[1]: my-agent.service: Failed with result 'exit-code'

Combine unit, priority, and time filters for precision:

journalctl -u my-agent -p err --since "1 hour ago"

This gives you only errors from your service in the last hour -- the exact information you need for triage.

journalctl Quick Reference

Command	Purpose
`journalctl -u my-agent`	All logs for the service
`journalctl -u my-agent -n 50`	Last 50 lines
`journalctl -u my-agent -f`	Follow live output
`journalctl -p err --since "1 hour ago"`	Errors from last hour
`journalctl -u my-agent --since "2026-02-11 14:00" --until "2026-02-11 15:00"`	Specific time window
`journalctl -u my-agent -b`	Since last boot

Real-World Scenario: MemoryError Crash

Your agent has been running fine for three days. This morning, it shows failed in systemctl status. Here is how you diagnose it:

# Step 1: What happened?
journalctl -u my-agent -p err --since "6 hours ago"

Output:

Feb 11 03:14:22 server python[12345]: MemoryError: Unable to allocate 128MB array
Feb 11 03:14:22 server systemd[1]: my-agent.service: Main process exited, code=exited, status=1/FAILURE

The agent crashed at 3:14 AM with a MemoryError. Now check whether this is a server-wide RAM issue or a process-specific limit:

free -h

Output:

              total        used        free      shared  buff/cache   available
Mem:          1.0Gi       780Mi        50Mi        12Mi       170Mi       120Mi
Swap:            0B          0B          0B

Only 120 MB available system-wide. Check if the service has a memory limit:

systemctl show my-agent --property=MemoryMax

Output:

MemoryMax=536870912

The service is capped at 512 MB by the MemoryMax directive from Lesson 10. The agent tried to allocate beyond that limit. You now have two paths: increase the limit or optimize your agent's memory usage.

Phase 2: Network Diagnosis

If logs show connection errors, timeouts, or "Connection refused" messages, the problem is in the network layer. Diagnose from local to remote -- this builds on the networking foundations from Lesson 9.

Layer 1: Is the Local Service Responding?

Test whether your agent is listening on its port:

curl -v localhost:8000/health

Output (working):

*   Trying 127.0.0.1:8000...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET /health HTTP/1.1
< HTTP/1.1 200 OK
{"status":"healthy","agent":"running","timestamp":"2026-02-11T14:40:01.234567"}

Output (not working):

*   Trying 127.0.0.1:8000...
* connect to 127.0.0.1 port 8000 failed: Connection refused

If connection is refused, check whether anything is listening on that port:

ss -tlnp | grep 8000

Output (listening):

LISTEN   0   128   0.0.0.0:8000   0.0.0.0:*   users:(("uvicorn",pid=12400,fd=7))

Output (not listening):

(no output -- nothing is bound to port 8000)

No output means the agent process is not running or is listening on a different port. Go back to Phase 1 and check logs.

Layer 2: Can You Reach External Hosts?

Test DNS resolution and basic connectivity:

ping -c 3 api.example.com

Output (working):

PING api.example.com (93.184.216.34): 56 data bytes
bytes from 93.184.216.34: icmp_seq=0 ttl=56 time=11.4 ms
bytes from 93.184.216.34: icmp_seq=1 ttl=56 time=11.2 ms
bytes from 93.184.216.34: icmp_seq=2 ttl=56 time=11.3 ms

Output (DNS failure):

ping: api.example.com: Name or service not known

If DNS fails, the problem is name resolution, not the remote server. Check /etc/resolv.conf for DNS configuration.

Layer 3: Can You Reach the Remote Service?

Test HTTP connectivity to an external API your agent depends on:

curl -o /dev/null -s -w "%{http_code}" https://api.example.com

Output:

A 200 means the remote service is reachable and responding. Other common codes:

Code	Meaning	Action
`000`	Connection failed entirely	Check network, DNS, firewall
`403`	Forbidden	Check API key or IP allowlist
`429`	Rate limited	Back off, check request volume
`502/503`	Remote server error	Not your problem -- wait or contact provider

Layer 4: Is the Firewall Blocking Traffic?

sudo ufw status

Output:

Status: active

To                         Action      From
--                         ------      ----
22/tcp                     ALLOW       Anywhere
8000/tcp                   ALLOW       Anywhere

If port 8000 is not listed and you need external access, add it:

sudo ufw allow 8000/tcp

Network Diagnosis Summary

Follow this sequence every time:

# 1. Is the agent listening locally?
curl -v localhost:8000/health

# 2. Is the port bound?
ss -tlnp | grep 8000

# 3. Can you resolve and reach external hosts?
ping -c 3 api.example.com

# 4. Does HTTP to the remote service work?
curl -o /dev/null -s -w "%{http_code}" https://api.example.com

# 5. Is the firewall blocking?
sudo ufw status

If step 1 fails, the problem is local. If steps 1-2 pass but step 3 fails, the problem is DNS or routing. If steps 1-3 pass but step 4 fails, the problem is the remote service or a firewall.

Phase 3: Disk Monitoring

When disk space runs out, services crash with cryptic errors -- "No space left on device," write failures, or silent hangs. Agents that generate logs, cache responses, or store outputs can fill a disk faster than you expect.

Check Overall Disk Usage

df -h

Output:

Filesystem      Size  Used Avail Use%  Mounted on
/dev/sda1        20G   18G  1.2G  94%  /
tmpfs           512M     0  512M   0%  /dev/shm

If Use% is above 90%, you are in the danger zone. Above 95%, services will start failing.

Find What Is Consuming Space

du -sh /var/log/* | sort -rh | head -5

Output:

1.2G    /var/log/journal
340M    /var/log/syslog
128M    /var/log/auth.log
45M     /var/log/kern.log
12M     /var/log/dpkg.log

The journal is consuming 1.2 GB. For agent-specific directories:

du -sh /opt/agent/*

Output:

4.0K    /opt/agent/agent_main.py
256M    /opt/agent/cache
890M    /opt/agent/outputs

Agent outputs are consuming 890 MB. Clean old outputs or configure automatic rotation.

Log Rotation with logrotate

Prevent logs from growing indefinitely. Create a logrotate configuration:

sudo nano /etc/logrotate.d/agent-logs

Add this content:

/opt/agent/logs/*.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    create 0640 nobody nobody
}

This rotates logs daily, keeps 7 days of history, and compresses old files. Test without applying:

sudo logrotate -d /etc/logrotate.d/agent-logs

Output:

reading config file /etc/logrotate.d/agent-logs
considering log /opt/agent/logs/*.log
  log does not need rotating (log is empty)

Force an immediate rotation:

sudo logrotate -f /etc/logrotate.d/agent-logs

Vacuum the System Journal

If /var/log/journal is consuming excessive space, limit it:

sudo journalctl --vacuum-size=500M

Output:

Vacuuming done, freed 724.0M of archived journals from /var/log/journal.

Phase 4: Process Debugging

When logs are clean, the network is fine, and disk has space, the problem is inside the process itself. These tools let you inspect what a running agent is actually doing.

Find the Agent Process

ps aux | grep agent

Output:

nobody   12400  2.3  5.1 234567 52340 ?   Ss   14:05   0:15 /usr/local/bin/uvicorn agent_main:app --host 0.0.0.0 --port 8000
root     12890  0.0  0.0  12345   672 pts/0 S+  14:45   0:00 grep --color=auto agent

The columns that matter: PID (12400), CPU% (2.3), MEM% (5.1), and the command.

Monitor Resource Consumption

Watch a specific process in real time:

top -p 12400

Output:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
12400 nobody    20   0  234567  52340  12456 S   2.3   5.1   0:15.42 uvicorn

Press q to exit. If %MEM keeps climbing over time, you have a memory leak.

Trace System Calls

strace shows every system call a process makes. Attach to a running agent:

sudo strace -p 12400 -c

Let it run for 10-15 seconds, then press Ctrl+C:

Output:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 45.23    0.004523          15       301           read
 30.11    0.003011          12       251           write
 12.45    0.001245           8       156           recvfrom
  8.33    0.000833           6       139           sendto
  3.88    0.000388           4        97           epoll_wait
------ ----------- ----------- --------- --------- ----------------
100.00    0.010000                   944           total

This summary shows where the process spends its time. If read or write dominates with high error counts, the process may be struggling with file I/O or network connections.

For detailed output (every individual call), use:

sudo strace -p 12400 -e trace=network 2>&1 | head -20

This filters to only network-related system calls -- useful when debugging connection issues.

List Open Files

lsof shows every file and network connection a process has open:

sudo lsof -p 12400

Output (abbreviated):

COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF    NODE NAME
uvicorn 12400 nobody  cwd    DIR    8,1     4096    1234 /opt/agent
uvicorn 12400 nobody    3u  IPv4  56789      0t0     TCP *:8000 (LISTEN)
uvicorn 12400 nobody    7u  IPv4  56790      0t0     TCP 192.168.1.10:8000->192.168.1.5:42386 (ESTABLISHED)
uvicorn 12400 nobody    9r   REG    8,1   102400    5678 /opt/agent/cache/model.bin

This reveals: the process is listening on port 8000, has one active connection, and has a cache file open. If you see hundreds of ESTABLISHED connections or too many open files, that indicates a leak.

Count open files for the process:

sudo lsof -p 12400 | wc -l

Output:

If this number grows steadily over time, the agent is leaking file descriptors.

Diagnostic Tool Comparison

Tool	What It Checks	When to Use	Example Output
journalctl -u	Systemd service logs	Service crashes or fails to start	`MemoryError: Unable to allocate`
journalctl -f	Live log stream	Watching agent behavior in real time	Log lines appearing as events occur
curl localhost	Local HTTP connectivity	Agent not responding to requests	`HTTP/1.1 200 OK` or `Connection refused`
ss -tlnp	Port binding status	Checking if service is listening	`0.0.0.0:8000 LISTEN`
ping	DNS and ICMP connectivity	Testing if remote hosts are reachable	`64 bytes from... time=11.4 ms`
df -h	Disk space usage	Services crashing with write errors	`/dev/sda1 94% /`
du -sh	Directory sizes	Finding what consumes disk space	`890M /opt/agent/outputs`
ps aux	Process list and resource usage	Finding process PID and CPU/MEM	`nobody 12400 2.3 5.1 ... uvicorn`
strace -p	System calls of a running process	Process is stuck or behaving strangely	`45% read, 30% write`
lsof -p	Open files and connections	Suspected file descriptor leak	`42 open files`

Exercises

Exercise 1: Find Error-Level Log Messages

Find all ERROR-level messages from the last hour across all services:

journalctl -p err --since "1 hour ago" | head -5

Expected output (if errors exist):

Feb 11 14:32:14 server python[12345]: MemoryError: Unable to allocate 128MB array
Feb 11 14:32:15 server systemd[1]: my-agent.service: Failed with result 'exit-code'

Expected output (if no errors):

-- No entries --

If you see -- No entries --, that means no errors occurred in the last hour. Try extending the range: --since "24 hours ago".

Exercise 2: Find the Largest Log Files

Check disk usage in /var/log and identify the largest files:

du -sh /var/log/* | sort -rh | head -3

Expected output:

1.2G    /var/log/journal
340M    /var/log/syslog
128M    /var/log/auth.log

Your numbers will differ, but the format shows the largest consumers first. If /var/log/journal dominates, consider running sudo journalctl --vacuum-size=500M to reclaim space.

Exercise 3: Find All Listening Processes

List every process that is listening for network connections:

ss -tlnp | grep LISTEN

Expected output:

LISTEN   0   128   0.0.0.0:22      0.0.0.0:*   users:(("sshd",pid=1234,fd=3))
LISTEN   0   128   0.0.0.0:8000    0.0.0.0:*   users:(("uvicorn",pid=12400,fd=7))

Verify that your agent (uvicorn on port 8000) appears in the list. If it does not, the service is not running -- check systemctl status my-agent and review the logs.

For a comprehensive health check that combines service status, health endpoint, and resource usage, see Health Check Script.

Try With AI

Ask Claude: "My agent service shows 'failed' in systemctl status. Here is the output: [paste your actual systemctl status output]. What's wrong and how do I fix it?"

What you're learning: AI can parse complex error output and suggest targeted fixes faster than reading man pages. Pay attention to how it identifies the specific failure reason from the status output and maps it to a concrete fix.

Tell Claude: "The agent was working yesterday but fails today. Nothing changed in the code. What environmental factors should I check?" Then systematically verify each suggestion.

What you're learning: Debugging requires considering the full environment -- disk space, available memory, network changes, expired certificates, updated dependencies -- not just code. AI generates a checklist of non-obvious environmental factors that experienced sysadmins learn over years.

Reproduce a common failure (kill a dependency process or fill up /tmp), then work with Claude to diagnose it WITHOUT telling Claude what you did. See if AI's diagnostic process finds the real cause.

What you're learning: Testing diagnostic methodology by creating known failures and validating the troubleshooting process. This exercise builds diagnostic confidence -- you know the answer, so you can evaluate whether the triage approach actually works.

Safety Reminder

When debugging on production servers, prefer read-only diagnostic commands (journalctl, df, ps, ss) before running anything that modifies state. Never run strace on a production process during peak traffic -- it adds overhead to every system call. Use strace -c for a summary instead of full tracing. Always check systemctl status and logs before restarting a service, so you capture the evidence before it is lost.

Structured Triage Methodology​

Phase 1: Log Analysis with journalctl​

Read Service Logs​

Filter by Recency​

Follow Logs in Real Time​

Filter by Priority​

Filter by Time Range​

journalctl Quick Reference​

Real-World Scenario: MemoryError Crash​

Phase 2: Network Diagnosis​

Layer 1: Is the Local Service Responding?​

Layer 2: Can You Reach External Hosts?​

Layer 3: Can You Reach the Remote Service?​

Layer 4: Is the Firewall Blocking Traffic?​

Network Diagnosis Summary​

Phase 3: Disk Monitoring​

Check Overall Disk Usage​

Find What Is Consuming Space​

Log Rotation with logrotate​

Vacuum the System Journal​

Phase 4: Process Debugging​

Find the Agent Process​

Monitor Resource Consumption​

Trace System Calls​

List Open Files​

Diagnostic Tool Comparison​

Exercises​

Exercise 1: Find Error-Level Log Messages​

Exercise 2: Find the Largest Log Files​

Exercise 3: Find All Listening Processes​

Try With AI​

Structured Triage Methodology

Phase 1: Log Analysis with journalctl

Read Service Logs

Filter by Recency

Follow Logs in Real Time

Filter by Priority

Filter by Time Range

journalctl Quick Reference

Real-World Scenario: MemoryError Crash

Phase 2: Network Diagnosis

Layer 1: Is the Local Service Responding?

Layer 2: Can You Reach External Hosts?

Layer 3: Can You Reach the Remote Service?

Layer 4: Is the Firewall Blocking Traffic?

Network Diagnosis Summary

Phase 3: Disk Monitoring

Check Overall Disk Usage

Find What Is Consuming Space

Log Rotation with logrotate

Vacuum the System Journal

Phase 4: Process Debugging

Find the Agent Process

Monitor Resource Consumption

Trace System Calls

List Open Files

Diagnostic Tool Comparison

Exercises

Exercise 1: Find Error-Level Log Messages

Exercise 2: Find the Largest Log Files

Exercise 3: Find All Listening Processes

Try With AI