In the previous post I covered git worktrees and automated PR review. This one’s about custom commands I built to debug production systems. They’ve become the fastest way I know to find bugs in distributed systems.
The core problem: when something breaks across multiple services - an order fails processing, data doesn’t sync, a queue backs up - you need to trace it through CloudWatch logs from Lambda, API Gateway, SQS, RDS, and compute instances. Manual searching is slow. Narrowing down the time window, finding relevant logs, connecting them into a causal chain, and identifying the root cause takes hours.
I built /check-services and /check-orders (and similar) to make this instant.
How it works
Both commands are just shell scripts that live in my project, invoked through Claude Code’s custom command system. They follow the same pattern:
- Fetch logs from multiple AWS services for the past N minutes
- Feed them to Claude with context about the system architecture
- Claude interprets the data and reports which services are healthy, which have errors, which might be escalating
- For specific issues, trace them through every stage they touch
Here’s the flow for /check-services:
# Fetch Lambda logs from all functions
aws logs filter-log-events --log-group-name /aws/lambda/...
# Get API Gateway execution logs
aws logs filter-log-events --log-group-name "API-Gateway-Execution-Logs_..."
# Get SQS queue depths and metrics
aws sqs get-queue-attributes --queue-url ... --attribute-names All
# Get EC2 instance logs (if running custom applications)
aws ssm start-session --document-name "AWS-RunShellScript"
# Pipe everything to Claude with system context
Claude gets the raw logs plus a model of your system: which Lambda functions do what, which APIs call which services, which queues feed which consumers, what’s considered normal vs abnormal.
The output isn’t a wall of logs. It’s structured:
✓ order-processor (green) - processing 50 orders/min, no errors
✓ inventory-sync (green) - last sync 2 minutes ago
⚠ payment-webhook (yellow) - 3 errors in last 10 minutes, all timeout related
✗ notification-queue (red) - backed up with 500+ messages, DLQ shows consumer crashed
Escalation analysis:
- payment-webhook errors are transient (retrying successfully)
- notification-queue is critical - consumer needs restart
This is all automated in a single command. No “let me check CloudWatch”. No switching between different AWS dashboards. Run /check-services, get a full picture.
/check-orders - auditing order workflows
/check-orders audits whether orders are moving through the system correctly. Pass it a number of days to check:
/check-orders 14
This checks orders from the past 14 days, auditing each stage they touch:
- Queries the e-commerce platform for orders with specific tags
- Cross-references with the database to ensure they exist
- For each order, checks if:
- It was created as a job in the processing service
- The job was sent to the next service (queued for fulfillment, payment processing, etc.)
- Each stage consumed and processed the message
Most order systems look like:
- Order placed (API call)
- Saved to database
- Published to processing queue
- Service picks it up and creates a job
- Calls external APIs (payment, fulfillment)
- Publishes to next queue
- Next service processes and marks complete
When orders get stuck, the issue is usually: order exists in the platform but not in the database, or exists in the database but never got queued, or got queued but the worker never picked it up.
/check-orders categorizes exactly what went wrong:
✓ Order #12345 - complete flow (Platform → DB → Queue → Processed)
✓ Order #12346 - complete flow
✗ Order #12347 - in platform, in DB, but job was never created
✗ Order #12348 - in DB with job, but never sent to fulfillment queue
⚠ Order #12349 - job exists but worker hasn't processed yet (might be slow queue)
Claude interprets the patterns: “Most orders are flowing fine, but these 2 got stuck at job creation - probably an error in that service. These 3 never made it to the next queue - likely a bug in the queue publisher.”
Tracing a specific order or event
When you need to drill into a single order or event that’s failing, there’s another command to trace it end-to-end through all services:
/trace-order 9F2D7K3N
This does deeper investigation on a single order:
- Finds all logs across services that mention this order ID
- Reconstructs the exact timeline:
- When it entered each service
- The exact API requests and responses
- Any errors or timeouts
- When it moved to the next stage
- Identifies where it got stuck and why
Output looks like:
Order 9F2D7K3N timeline:
14:23:01 - Order created
14:23:15 - Saved to database
14:23:22 - Published to order-processor queue
14:23:45 - Processor picked it up
14:24:12 - Called payment API
14:24:14 - Payment API returned: 500 Server Error (timeout)
14:24:45 - Retried payment API
14:24:47 - Payment API succeeded
14:25:01 - Published to fulfillment queue
14:25:02 - Fulfillment service started processing
14:25:03 - Error: Cannot deserialize message JSON
Stack: LineItem.price is undefined
14:25:04 - Message sent to DLQ (dead letter queue)
Root cause: LineItem schema changed but fulfillment service not updated
Problems that would take an hour of jumping between CloudWatch, the database, third-party dashboards, and trying to correlate timestamps across multiple systems get solved in seconds.
Why this works
Three things make this work:
Logging at every stage. Every code path logs enough context to trace through. Not just errors, also state transitions, external API calls, queue messages. A function that processes an order doesn’t just log success or failure, it logs: received message ID, parsed fields, called X API, got response status Y, saved to database, published to next queue. Specific enough that Claude can construct a timeline.
Structured IDs. Order IDs, request IDs, correlation IDs flow through the system. When the payment service logs “failed request 9F2D7K3N-001”, the same ID appears in the payment API logs, the order processor logs, and the database. This lets Claude connect dots across services.
System context. When I set up the command, I gave Claude a model of the system: service names, which APIs they call, expected latencies, which queues feed which services, where to expect certain error patterns. This lets Claude distinguish between “payment timeout, but the payment went through” versus “payment never reached the API”.
Building your own
These commands are straightforward to build:
- Write a bash script that fetches the relevant logs
- Create a
.claude/commands/directory in your project (if it doesn’t exist) - Drop the script there with a descriptive name
- Reference it in your CLAUDE.md with a
commands:section
Claude Code will make it available as a slash command. When invoked, Claude runs the script and has the output in context for the next message.
You can keep the scripts simple - they don’t need to do the interpretation, just the fetching. Claude does the heavy lifting of understanding what the logs mean.
The real work is in the logging layer. If your services don’t log state transitions and external calls, no amount of clever command-building will help. But if you have solid logging, Claude can turn hours of debugging into minutes.
When this saves time
These commands work well for:
- Debugging production issues where the flow crosses multiple services
- Auditing workflows to verify data is being transformed and routed correctly through each stage
- Checking system health - is anything degrading or about to fail?
- Diagnosing race conditions or timing-dependent bugs where you need the exact sequence of events
They’re less useful for:
- Issues that are obvious from error messages in a single service
- Problems that need local debugging (step through code, inspect variables)
- Performance optimization (though the logs can hint at where things are slow)
The advantage compounds as your system gets more complex. A three-service system is manageable to debug manually. A fifteen-service system isn’t. These commands keep the fifteen-service case tractable.