Making APIs and Queues Bulletproof - Writing

Picking between REST, gRPC, tRPC, and queues is the easy part. Making any of them survive crashes, retries, and partial failure is where most production systems live.

In fintech and other regulated spaces, “the call failed” is not an acceptable outcome. A payment must be applied exactly once even if the API server, the queue, or the consumer restarts mid-transaction. The patterns below come up over and over in companies like Stripe, Adyen, and Block, and apply equally to startups handling money or critical state.

Bulletproof writes: idempotency keys

For APIs, the centrepiece is the idempotency key. The client generates a unique ID per operation and sends it with the request; the server stores the result against that key, and if the client retries (network blip, client crash, ambiguous timeout) the server returns the original response instead of running the operation again. Stripe persists keys for 24 hours and caches failures as well as successes - a retry of a 500 returns the same 500, so client and server can’t disagree about what happened.

The dedup window has to be at least as long as the maximum retry window of every client that calls you. Pick 24 hours and you’re fine for most consumer SDKs; pick 5 minutes and a client with aggressive retries on a long network partition will produce duplicates that look like the idempotency key didn’t work.

Retries from the client side need exponential backoff with jitter. Without backoff, every client retries immediately when the server recovers and the resulting load can knock it back over; without jitter, retries from many clients line up on the same tick. Both Stripe and AWS document this in their SDKs and it’s worth copying.

Layered server-side defences

A bulletproof API also uses layered defences. API-level idempotency stops duplicate requests, database-level optimistic locking or row-level constraints stop concurrent writers from corrupting state, and immutable double-entry ledgers (rather than mutable balance fields) make money flow auditable. None of these alone is enough; together they get you to the reliability fintech needs. There’s a good write-up of this layering for a deeper read.

Bulletproof reads: layered caching

The patterns above are mostly about writes - making sure a payload that needs to land, lands. For reads, the bulletproof equivalent is a layered cache. A common setup is two layers: an in-process cache (LRU in the application’s own memory) for sub-millisecond hits on hot keys, backed by a shared cache like Redis or Memcached for cross-instance hits, with the database as the source of truth behind both. I once visited a company that ran almost entirely on this two-layer strategy. Their database saw a small fraction of the read traffic because the in-process layer absorbed the hottest keys and Redis handled the rest. The trade-off is staleness: every cache layer is a window where the data can be out of date. You design TTLs, invalidation on writes, or both, depending on how fresh the data has to be. Done well, the database survives traffic that would otherwise overload it.

Bulletproof queues: outbox, inbox, DLQ

For queues, the equivalent is the transactional outbox. The naive pattern (write to the database, then publish to the queue) has a window where the database commits and the publish fails. The transactional outbox closes that window: the service writes the domain change and the outbound message into the same database transaction (into an “outbox” table), and a separate process polls the outbox and publishes the messages. If publishing fails, the message stays in the outbox and gets retried. The database transaction is the only atomic step; the queue publish becomes eventually consistent.

On the consumer side, the matching pattern is the inbox: when a message arrives, write its event ID into an “inbox” table inside the same transaction as the state change it causes. If the same message is delivered twice (which it will be), the second write fails on the unique event ID and the consumer skips it. That’s how you turn at-least-once delivery into effectively-once processing without lying about exactly-once semantics.

The third leg is a dead letter queue with monitoring. Outbox handles the producer side, inbox handles duplicates on the consumer side, but neither helps when a message fails repeatedly because it’s malformed or hits a bug. Without a DLQ the broker either keeps redelivering forever (blocking the queue head) or silently drops the message after max-receives. With a DLQ wired up and an alarm on its depth, failures become visible immediately and you can fix the underlying issue, then redrive once the cause is resolved.

Redrive caveats apply (the first post covers them in the queue section): if the consumer transformed the message before failing, the DLQ holds the transformed version, and reconciliation against current state matters because by the time you redrive, the team may have already worked around the failure manually.

A few operational notes. Outbox tables grow without limit if you don’t archive processed rows, and high-throughput systems can rack up hundreds of millions of rows quickly enough to start hurting query performance - so plan for archival or partitioning from day one. The outbox poller needs its own retry policy with exponential backoff and a circuit breaker, otherwise an extended broker outage will hammer it. Poller lag under load is the most common production bug here: domain writes commit fast, the queue lags by minutes, downstream services drift. CDC-from-WAL (Debezium and friends) is a meaningful alternative to polling once volume justifies it. And when you redrive from a DLQ, the inbox pattern is what saves you from duplicates - but only if you’ve been consistent about using event IDs from the start.

When to use these patterns

These patterns aren’t free: extra tables, polling processes, and reasoning cost. For internal tooling they’re often overkill. For anything where a duplicate or a lost message has real-world consequences (money moved, dose administered, contract signed), they’re the baseline.