Claude Code Custom Commands: Debugging Distributed Systems at Scale - Writing

In the previous post I covered git worktrees and automated PR review. This one’s about custom commands I built to debug production systems. They’ve become the fastest way I know to find bugs in distributed systems.

The core problem: when something breaks across multiple services - an order fails processing, data doesn’t sync, a queue backs up - you need to trace it through CloudWatch logs from Lambda, API Gateway, SQS, RDS, and compute instances. Manual searching is slow. Narrowing down the time window, finding relevant logs, connecting them into a causal chain, and identifying the root cause takes hours.

I built /check-services and /check-orders (and similar) to make this instant.

How it works

Both commands are just shell scripts that live in my project, invoked through Claude Code’s custom command system. They follow the same pattern:

Fetch logs from multiple AWS services for the past N minutes
Feed them to Claude with context about the system architecture
Claude interprets the data and reports which services are healthy, which have errors, which might be escalating
For specific issues, trace them through every stage they touch

Here’s the flow for /check-services:

# Fetch Lambda logs from all functions
aws logs filter-log-events --log-group-name /aws/lambda/...

# Get API Gateway execution logs
aws logs filter-log-events --log-group-name "API-Gateway-Execution-Logs_..."

# Get SQS queue depths and metrics
aws sqs get-queue-attributes --queue-url ... --attribute-names All

# Get EC2 instance logs (if running custom applications)
aws ssm start-session --document-name "AWS-RunShellScript"

# Pipe everything to Claude with system context

Claude gets the raw logs plus a model of your system: which Lambda functions do what, which APIs call which services, which queues feed which consumers, what’s considered normal vs abnormal.

The output isn’t a wall of logs. It’s structured:

✓ order-processor (green) - processing 50 orders/min, no errors
✓ inventory-sync (green) - last sync 2 minutes ago
⚠ payment-webhook (yellow) - 3 errors in last 10 minutes, all timeout related
✗ notification-queue (red) - backed up with 500+ messages, DLQ shows consumer crashed

Escalation analysis:
- payment-webhook errors are transient (retrying successfully)
- notification-queue is critical - consumer needs restart

This is all automated in a single command. No “let me check CloudWatch”. No switching between different AWS dashboards. Run /check-services, get a full picture.

`/check-orders` - auditing order workflows

/check-orders audits whether orders are moving through the system correctly. Pass it a number of days to check:

/check-orders 14

This checks orders from the past 14 days, auditing each stage they touch:

Queries the e-commerce platform for orders with specific tags
Cross-references with the database to ensure they exist
For each order, checks if:
- It was created as a job in the processing service
- The job was sent to the next service (queued for fulfillment, payment processing, etc.)
- Each stage consumed and processed the message

Most order systems look like:

Order placed (API call)
Saved to database
Published to processing queue
Service picks it up and creates a job
Calls external APIs (payment, fulfillment)
Publishes to next queue
Next service processes and marks complete

When orders get stuck, the issue is usually: order exists in the platform but not in the database, or exists in the database but never got queued, or got queued but the worker never picked it up.

/check-orders categorizes exactly what went wrong:

✓ Order #12345 - complete flow (Platform → DB → Queue → Processed)
✓ Order #12346 - complete flow
✗ Order #12347 - in platform, in DB, but job was never created
✗ Order #12348 - in DB with job, but never sent to fulfillment queue
⚠ Order #12349 - job exists but worker hasn't processed yet (might be slow queue)

Claude interprets the patterns: “Most orders are flowing fine, but these 2 got stuck at job creation - probably an error in that service. These 3 never made it to the next queue - likely a bug in the queue publisher.”

Tracing a specific order or event

When you need to drill into a single order or event that’s failing, there’s another command to trace it end-to-end through all services:

/trace-order 9F2D7K3N

This does deeper investigation on a single order:

Finds all logs across services that mention this order ID
Reconstructs the exact timeline:
- When it entered each service
- The exact API requests and responses
- Any errors or timeouts
- When it moved to the next stage
Identifies where it got stuck and why

Output looks like:

Order 9F2D7K3N timeline:
14:23:01 - Order created
14:23:15 - Saved to database
14:23:22 - Published to order-processor queue
14:23:45 - Processor picked it up
14:24:12 - Called payment API
14:24:14 - Payment API returned: 500 Server Error (timeout)
14:24:45 - Retried payment API
14:24:47 - Payment API succeeded
14:25:01 - Published to fulfillment queue
14:25:02 - Fulfillment service started processing
14:25:03 - Error: Cannot deserialize message JSON
           Stack: LineItem.price is undefined
14:25:04 - Message sent to DLQ (dead letter queue)

Root cause: LineItem schema changed but fulfillment service not updated

Problems that would take an hour of jumping between CloudWatch, the database, third-party dashboards, and trying to correlate timestamps across multiple systems get solved in seconds.

Why this works

Three things make this work:

Logging at every stage. Every code path logs enough context to trace through. Not just errors, also state transitions, external API calls, queue messages. A function that processes an order doesn’t just log success or failure, it logs: received message ID, parsed fields, called X API, got response status Y, saved to database, published to next queue. Specific enough that Claude can construct a timeline.

Structured IDs. Order IDs, request IDs, correlation IDs flow through the system. When the payment service logs “failed request 9F2D7K3N-001”, the same ID appears in the payment API logs, the order processor logs, and the database. This lets Claude connect dots across services.

System context. When I set up the command, I gave Claude a model of the system: service names, which APIs they call, expected latencies, which queues feed which services, where to expect certain error patterns. This lets Claude distinguish between “payment timeout, but the payment went through” versus “payment never reached the API”.

Building your own

These commands are straightforward to build:

Write a bash script that fetches the relevant logs
Create a .claude/commands/ directory in your project (if it doesn’t exist)
Drop the script there with a descriptive name
Reference it in your CLAUDE.md with a commands: section

Claude Code will make it available as a slash command. When invoked, Claude runs the script and has the output in context for the next message.

You can keep the scripts simple - they don’t need to do the interpretation, just the fetching. Claude does the heavy lifting of understanding what the logs mean.

The real work is in the logging layer. If your services don’t log state transitions and external calls, no amount of clever command-building will help. But if you have solid logging, Claude can turn hours of debugging into minutes.

When this saves time

These commands work well for:

Debugging production issues where the flow crosses multiple services
Auditing workflows to verify data is being transformed and routed correctly through each stage
Checking system health - is anything degrading or about to fail?
Diagnosing race conditions or timing-dependent bugs where you need the exact sequence of events

They’re less useful for:

Issues that are obvious from error messages in a single service
Problems that need local debugging (step through code, inspect variables)
Performance optimization (though the logs can hint at where things are slow)

The advantage compounds as your system gets more complex. A three-service system is manageable to debug manually. A fifteen-service system isn’t. These commands keep the fifteen-service case tractable.