Skip to main content
Updated Feb 16, 2026

Debugging & Troubleshooting

Your agent failed. Now what?

In Lesson 10, you deployed agent_main.py as a systemd service with restart policies, resource limits, and a health check script. That infrastructure keeps your agent running through routine crashes. But when something genuinely breaks -- a memory leak, a full disk, a network timeout at 3 AM -- automatic restarts won't help. You need to find the root cause.

Production debugging is not about memorizing commands. It is about systematic diagnosis: gathering evidence, isolating the problem, and fixing the root cause instead of blindly restarting the service.

By the end of this lesson, you will have a four-phase triage methodology and the specific tools to execute each phase. When your Digital FTE fails, you will know exactly where to look.


Structured Triage Methodology

Before diving into individual tools, learn the system. Every production issue falls into one of four categories, and diagnosing them in the right order saves time:

Phase 1: Logs -- What did the agent say before it died? Phase 2: Network -- Can the agent reach its dependencies? Phase 3: Disk -- Is there space for the agent to operate? Phase 4: Processes -- Is the agent consuming resources abnormally?

This order matters. Logs answer "what happened" immediately in 80% of cases. If logs are clean, check network connectivity. If the network is fine, check disk space. If disk is fine, inspect the process itself. Skipping phases or jumping to process debugging first wastes time on symptoms instead of causes.

Agent fails → Check logs (journalctl)
↓ logs clean?
Check network (curl, ss, ping)
↓ network fine?
Check disk (df, du)
↓ disk fine?
Check process (ps, strace, lsof)

The rest of this lesson teaches the tools for each phase, applied to the my-agent service you created in Lesson 10.


Phase 1: Log Analysis with journalctl

When a systemd service fails, the answer is almost always in the logs. journalctl reads the system journal where systemd captures all stdout and stderr from your service.

Read Service Logs

Check the current status first:

sudo systemctl status my-agent

Output (failed service):

● my-agent.service - Sample Digital FTE Agent
Loaded: loaded (/etc/systemd/system/my-agent.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Tue 2026-02-11 14:32:15 UTC; 2min ago
Process: 12345 ExecStart=/usr/local/bin/uvicorn agent_main:app --host 0.0.0.0 --port 8000 (code=exited, status=1/FAILURE)
Main PID: 12345 (code=exited, status=1/FAILURE)

Now pull the full logs for this service:

journalctl -u my-agent

This shows every log entry for my-agent since the journal began. For a service that has been running for days, this is too much output. Filter it down.

Filter by Recency

Show only the last 50 lines:

journalctl -u my-agent -n 50

Output:

Feb 11 14:32:10 server uvicorn[12345]: INFO:     Processing request batch 847
Feb 11 14:32:12 server uvicorn[12345]: WARNING: Memory usage at 490MB
Feb 11 14:32:14 server python[12345]: MemoryError: Unable to allocate 128MB array
Feb 11 14:32:15 server systemd[1]: my-agent.service: Main process exited, code=exited, status=1/FAILURE
Feb 11 14:32:15 server systemd[1]: my-agent.service: Failed with result 'exit-code'

Follow Logs in Real Time

Watch logs as they appear (like tail -f for the journal):

journalctl -u my-agent -f

Output (live stream):

Feb 11 14:35:00 server uvicorn[12400]: INFO:     192.168.1.5:42386 - "GET /health HTTP/1.1" 200
Feb 11 14:35:05 server uvicorn[12400]: INFO: Processing request batch 1
Feb 11 14:35:10 server uvicorn[12400]: INFO: 192.168.1.5:42388 - "GET /tasks HTTP/1.1" 200

Press Ctrl+C to stop following.

Filter by Priority

Show only errors (skip informational messages):

journalctl -p err --since "1 hour ago"

Output:

Feb 11 14:32:14 server python[12345]: MemoryError: Unable to allocate 128MB array
Feb 11 14:32:15 server systemd[1]: my-agent.service: Failed with result 'exit-code'

Priority levels from most to least severe:

LevelNameMeaning
0emergSystem is unusable
1alertImmediate action required
2critCritical condition
3errError condition
4warningWarning condition
5noticeNormal but significant
6infoInformational
7debugDebug-level detail

Using -p err shows levels 0 through 3 (emerg, alert, crit, err). This cuts through noise fast.

Filter by Time Range

Show logs from a specific window:

journalctl -u my-agent --since "2026-02-11 14:00" --until "2026-02-11 15:00"

Output:

Feb 11 14:05:30 server systemd[1]: Started Sample Digital FTE Agent.
Feb 11 14:05:31 server uvicorn[12345]: INFO: Application startup complete.
Feb 11 14:32:14 server python[12345]: MemoryError: Unable to allocate 128MB array
Feb 11 14:32:15 server systemd[1]: my-agent.service: Failed with result 'exit-code'

Combine unit, priority, and time filters for precision:

journalctl -u my-agent -p err --since "1 hour ago"

This gives you only errors from your service in the last hour -- the exact information you need for triage.

journalctl Quick Reference

CommandPurpose
journalctl -u my-agentAll logs for the service
journalctl -u my-agent -n 50Last 50 lines
journalctl -u my-agent -fFollow live output
journalctl -p err --since "1 hour ago"Errors from last hour
journalctl -u my-agent --since "2026-02-11 14:00" --until "2026-02-11 15:00"Specific time window
journalctl -u my-agent -bSince last boot

Real-World Scenario: MemoryError Crash

Your agent has been running fine for three days. This morning, it shows failed in systemctl status. Here is how you diagnose it:

# Step 1: What happened?
journalctl -u my-agent -p err --since "6 hours ago"

Output:

Feb 11 03:14:22 server python[12345]: MemoryError: Unable to allocate 128MB array
Feb 11 03:14:22 server systemd[1]: my-agent.service: Main process exited, code=exited, status=1/FAILURE

The agent crashed at 3:14 AM with a MemoryError. Now check whether this is a server-wide RAM issue or a process-specific limit:

free -h

Output:

              total        used        free      shared  buff/cache   available
Mem: 1.0Gi 780Mi 50Mi 12Mi 170Mi 120Mi
Swap: 0B 0B 0B

Only 120 MB available system-wide. Check if the service has a memory limit:

systemctl show my-agent --property=MemoryMax

Output:

MemoryMax=536870912

The service is capped at 512 MB by the MemoryMax directive from Lesson 10. The agent tried to allocate beyond that limit. You now have two paths: increase the limit or optimize your agent's memory usage.


Phase 2: Network Diagnosis

If logs show connection errors, timeouts, or "Connection refused" messages, the problem is in the network layer. Diagnose from local to remote -- this builds on the networking foundations from Lesson 9.

Layer 1: Is the Local Service Responding?

Test whether your agent is listening on its port:

curl -v localhost:8000/health

Output (working):

*   Trying 127.0.0.1:8000...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET /health HTTP/1.1
< HTTP/1.1 200 OK
{"status":"healthy","agent":"running","timestamp":"2026-02-11T14:40:01.234567"}

Output (not working):

*   Trying 127.0.0.1:8000...
* connect to 127.0.0.1 port 8000 failed: Connection refused

If connection is refused, check whether anything is listening on that port:

ss -tlnp | grep 8000

Output (listening):

LISTEN   0   128   0.0.0.0:8000   0.0.0.0:*   users:(("uvicorn",pid=12400,fd=7))

Output (not listening):

(no output -- nothing is bound to port 8000)

No output means the agent process is not running or is listening on a different port. Go back to Phase 1 and check logs.

Layer 2: Can You Reach External Hosts?

Test DNS resolution and basic connectivity:

ping -c 3 api.example.com

Output (working):

PING api.example.com (93.184.216.34): 56 data bytes
64 bytes from 93.184.216.34: icmp_seq=0 ttl=56 time=11.4 ms
64 bytes from 93.184.216.34: icmp_seq=1 ttl=56 time=11.2 ms
64 bytes from 93.184.216.34: icmp_seq=2 ttl=56 time=11.3 ms

Output (DNS failure):

ping: api.example.com: Name or service not known

If DNS fails, the problem is name resolution, not the remote server. Check /etc/resolv.conf for DNS configuration.

Layer 3: Can You Reach the Remote Service?

Test HTTP connectivity to an external API your agent depends on:

curl -o /dev/null -s -w "%{http_code}" https://api.example.com

Output:

200

A 200 means the remote service is reachable and responding. Other common codes:

CodeMeaningAction
000Connection failed entirelyCheck network, DNS, firewall
403ForbiddenCheck API key or IP allowlist
429Rate limitedBack off, check request volume
502/503Remote server errorNot your problem -- wait or contact provider

Layer 4: Is the Firewall Blocking Traffic?

sudo ufw status

Output:

Status: active

To Action From
-- ------ ----
22/tcp ALLOW Anywhere
8000/tcp ALLOW Anywhere

If port 8000 is not listed and you need external access, add it:

sudo ufw allow 8000/tcp

Network Diagnosis Summary

Follow this sequence every time:

# 1. Is the agent listening locally?
curl -v localhost:8000/health

# 2. Is the port bound?
ss -tlnp | grep 8000

# 3. Can you resolve and reach external hosts?
ping -c 3 api.example.com

# 4. Does HTTP to the remote service work?
curl -o /dev/null -s -w "%{http_code}" https://api.example.com

# 5. Is the firewall blocking?
sudo ufw status

If step 1 fails, the problem is local. If steps 1-2 pass but step 3 fails, the problem is DNS or routing. If steps 1-3 pass but step 4 fails, the problem is the remote service or a firewall.


Phase 3: Disk Monitoring

When disk space runs out, services crash with cryptic errors -- "No space left on device," write failures, or silent hangs. Agents that generate logs, cache responses, or store outputs can fill a disk faster than you expect.

Check Overall Disk Usage

df -h

Output:

Filesystem      Size  Used Avail Use%  Mounted on
/dev/sda1 20G 18G 1.2G 94% /
tmpfs 512M 0 512M 0% /dev/shm

If Use% is above 90%, you are in the danger zone. Above 95%, services will start failing.

Find What Is Consuming Space

du -sh /var/log/* | sort -rh | head -5

Output:

1.2G    /var/log/journal
340M /var/log/syslog
128M /var/log/auth.log
45M /var/log/kern.log
12M /var/log/dpkg.log

The journal is consuming 1.2 GB. For agent-specific directories:

du -sh /opt/agent/*

Output:

4.0K    /opt/agent/agent_main.py
256M /opt/agent/cache
890M /opt/agent/outputs

Agent outputs are consuming 890 MB. Clean old outputs or configure automatic rotation.

Log Rotation with logrotate

Prevent logs from growing indefinitely. Create a logrotate configuration:

sudo nano /etc/logrotate.d/agent-logs

Add this content:

/opt/agent/logs/*.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
create 0640 nobody nobody
}

This rotates logs daily, keeps 7 days of history, and compresses old files. Test without applying:

sudo logrotate -d /etc/logrotate.d/agent-logs

Output:

reading config file /etc/logrotate.d/agent-logs
considering log /opt/agent/logs/*.log
log does not need rotating (log is empty)

Force an immediate rotation:

sudo logrotate -f /etc/logrotate.d/agent-logs

Vacuum the System Journal

If /var/log/journal is consuming excessive space, limit it:

sudo journalctl --vacuum-size=500M

Output:

Vacuuming done, freed 724.0M of archived journals from /var/log/journal.

Phase 4: Process Debugging

When logs are clean, the network is fine, and disk has space, the problem is inside the process itself. These tools let you inspect what a running agent is actually doing.

Find the Agent Process

ps aux | grep agent

Output:

nobody   12400  2.3  5.1 234567 52340 ?   Ss   14:05   0:15 /usr/local/bin/uvicorn agent_main:app --host 0.0.0.0 --port 8000
root 12890 0.0 0.0 12345 672 pts/0 S+ 14:45 0:00 grep --color=auto agent

The columns that matter: PID (12400), CPU% (2.3), MEM% (5.1), and the command.

Monitor Resource Consumption

Watch a specific process in real time:

top -p 12400

Output:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
12400 nobody 20 0 234567 52340 12456 S 2.3 5.1 0:15.42 uvicorn

Press q to exit. If %MEM keeps climbing over time, you have a memory leak.

Trace System Calls

strace shows every system call a process makes. Attach to a running agent:

sudo strace -p 12400 -c

Let it run for 10-15 seconds, then press Ctrl+C:

Output:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
45.23 0.004523 15 301 read
30.11 0.003011 12 251 write
12.45 0.001245 8 156 recvfrom
8.33 0.000833 6 139 sendto
3.88 0.000388 4 97 epoll_wait
------ ----------- ----------- --------- --------- ----------------
100.00 0.010000 944 total

This summary shows where the process spends its time. If read or write dominates with high error counts, the process may be struggling with file I/O or network connections.

For detailed output (every individual call), use:

sudo strace -p 12400 -e trace=network 2>&1 | head -20

This filters to only network-related system calls -- useful when debugging connection issues.

List Open Files

lsof shows every file and network connection a process has open:

sudo lsof -p 12400

Output (abbreviated):

COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF    NODE NAME
uvicorn 12400 nobody cwd DIR 8,1 4096 1234 /opt/agent
uvicorn 12400 nobody 3u IPv4 56789 0t0 TCP *:8000 (LISTEN)
uvicorn 12400 nobody 7u IPv4 56790 0t0 TCP 192.168.1.10:8000->192.168.1.5:42386 (ESTABLISHED)
uvicorn 12400 nobody 9r REG 8,1 102400 5678 /opt/agent/cache/model.bin

This reveals: the process is listening on port 8000, has one active connection, and has a cache file open. If you see hundreds of ESTABLISHED connections or too many open files, that indicates a leak.

Count open files for the process:

sudo lsof -p 12400 | wc -l

Output:

42

If this number grows steadily over time, the agent is leaking file descriptors.


Diagnostic Tool Comparison

ToolWhat It ChecksWhen to UseExample Output
journalctl -uSystemd service logsService crashes or fails to startMemoryError: Unable to allocate
journalctl -fLive log streamWatching agent behavior in real timeLog lines appearing as events occur
curl localhostLocal HTTP connectivityAgent not responding to requestsHTTP/1.1 200 OK or Connection refused
ss -tlnpPort binding statusChecking if service is listening0.0.0.0:8000 LISTEN
pingDNS and ICMP connectivityTesting if remote hosts are reachable64 bytes from... time=11.4 ms
df -hDisk space usageServices crashing with write errors/dev/sda1 94% /
du -shDirectory sizesFinding what consumes disk space890M /opt/agent/outputs
ps auxProcess list and resource usageFinding process PID and CPU/MEMnobody 12400 2.3 5.1 ... uvicorn
strace -pSystem calls of a running processProcess is stuck or behaving strangely45% read, 30% write
lsof -pOpen files and connectionsSuspected file descriptor leak42 open files

Exercises

Exercise 1: Find Error-Level Log Messages

Find all ERROR-level messages from the last hour across all services:

journalctl -p err --since "1 hour ago" | head -5

Expected output (if errors exist):

Feb 11 14:32:14 server python[12345]: MemoryError: Unable to allocate 128MB array
Feb 11 14:32:15 server systemd[1]: my-agent.service: Failed with result 'exit-code'

Expected output (if no errors):

-- No entries --

If you see -- No entries --, that means no errors occurred in the last hour. Try extending the range: --since "24 hours ago".

Exercise 2: Find the Largest Log Files

Check disk usage in /var/log and identify the largest files:

du -sh /var/log/* | sort -rh | head -3

Expected output:

1.2G    /var/log/journal
340M /var/log/syslog
128M /var/log/auth.log

Your numbers will differ, but the format shows the largest consumers first. If /var/log/journal dominates, consider running sudo journalctl --vacuum-size=500M to reclaim space.

Exercise 3: Find All Listening Processes

List every process that is listening for network connections:

ss -tlnp | grep LISTEN

Expected output:

LISTEN   0   128   0.0.0.0:22      0.0.0.0:*   users:(("sshd",pid=1234,fd=3))
LISTEN 0 128 0.0.0.0:8000 0.0.0.0:* users:(("uvicorn",pid=12400,fd=7))

Verify that your agent (uvicorn on port 8000) appears in the list. If it does not, the service is not running -- check systemctl status my-agent and review the logs.

For a comprehensive health check that combines service status, health endpoint, and resource usage, see Health Check Script.


Try With AI

Ask Claude: "My agent service shows 'failed' in systemctl status. Here is the output: [paste your actual systemctl status output]. What's wrong and how do I fix it?"

What you're learning: AI can parse complex error output and suggest targeted fixes faster than reading man pages. Pay attention to how it identifies the specific failure reason from the status output and maps it to a concrete fix.

Tell Claude: "The agent was working yesterday but fails today. Nothing changed in the code. What environmental factors should I check?" Then systematically verify each suggestion.

What you're learning: Debugging requires considering the full environment -- disk space, available memory, network changes, expired certificates, updated dependencies -- not just code. AI generates a checklist of non-obvious environmental factors that experienced sysadmins learn over years.

Reproduce a common failure (kill a dependency process or fill up /tmp), then work with Claude to diagnose it WITHOUT telling Claude what you did. See if AI's diagnostic process finds the real cause.

What you're learning: Testing diagnostic methodology by creating known failures and validating the troubleshooting process. This exercise builds diagnostic confidence -- you know the answer, so you can evaluate whether the triage approach actually works.

Safety Reminder

When debugging on production servers, prefer read-only diagnostic commands (journalctl, df, ps, ss) before running anything that modifies state. Never run strace on a production process during peak traffic -- it adds overhead to every system call. Use strace -c for a summary instead of full tracing. Always check systemctl status and logs before restarting a service, so you capture the evidence before it is lost.