Yan Cui’s post on the Outbox and Inbox patterns explains how to make event-driven systems reliable: an outbox to make sure state changes produce events, an inbox to make sure incoming events are processed at most once.
I wanted to build a sandboxed exemplar of these patterns while also addressing issues with queues and redrives I’ve experienced in production systems. The rest of this post is the implementation that worked, then a list of queue smells based on my experience of production systems.
Code at danieljohnmorris/inbox-outbox-demo. It runs on DynamoDB Local and LocalStack Community in two docker compose containers, with no AWS account and no paid tier. EventBridge Pipes is the one piece that’s only available on LocalStack Pro, so I replaced it with a TypeScript worker that polls the DynamoDB stream and publishes to EventBridge directly.
The outbox write
The outbox commits the database write and the “I want to publish this” record together. In DynamoDB that’s TransactWriteItems:
await ddb.send(
new TransactWriteItemsCommand({
TransactItems: [
{ Put: { TableName: "Orders", Item: marshall(order) } },
{ Put: { TableName: "Outbox", Item: marshall(outboxItem) } },
],
}),
);
Either both rows land, or neither does. The outbox row sits in PENDING until a separate worker publishes it.
Publishing the outbox
The publisher has two paths into the same set of pending rows. The DynamoDB stream is the live path; a five-second scan over status = PENDING is the backstop. The stream catches the live flow, the scan catches anything the stream missed during a transient failure or an iterator expiry. A conditional update on publishedAt keeps either path from publishing the same row twice:
await ddb.send(
new UpdateItemCommand({
TableName: "Outbox",
Key: { outboxId: { S: id } },
UpdateExpression: "SET #s = :pub, publishedAt = :p",
ConditionExpression: "attribute_not_exists(publishedAt)",
// ...
}),
);
If both the stream record and the scan see the same row, only one UpdateItem succeeds and the other returns ConditionalCheckFailedException.
The order of operations inside publish matters. Mark the row PUBLISHED with the conditional, call PutEvents, roll back to PENDING if the call fails. The other order (publish then mark) drops events whenever PutEvents errors after the mark, because the next scan filters on PENDING and never retries:
try {
await markPublished(envelope.id);
} catch (err) {
if (err instanceof ConditionalCheckFailedException) return "skipped";
throw err;
}
try {
await putEvents(envelope);
} catch (err) {
await rollbackToPending(envelope.id);
throw err;
}
If the rollback itself fails, the row is left as PUBLISHED with no event sent. That’s a degenerate case worth alerting on; the recovery is one operator-driven UpdateItem to flip the row back, after which the next scan picks it up.
Inbox with a lease
The inbox is an idempotency-key implementation. An idempotency key is a unique ID per logical operation: the server records what it did the first time, so a retry or a DLQ redrive returns the stored result instead of running the work again. Two ways the key gets to the server:
- Stripe-style: the client puts a UUID in an HTTP header on every request.
- Inbox: the publisher puts an
eventIdin the message envelope.
eventId is per-message, not per-entity. The publisher mints it once at outbox-write time; the same ID travels with every redelivery and DLQ redrive of that message, while a distinct event gets a fresh ID.
The lease is what makes the inbox survive worker crashes, by adding a TTL on the in-progress state.
Each eventId gets one row in the inbox table, which moves through three states:
- No row yet - the message has never been seen.
PROCESSING- a worker is running the handler.PROCESSED- the handler finished; further claims for thiseventIdalways reject.
A row in flight, just after claim succeeds:
{
"eventId": "evt_01HXYZ",
"type": "OrderCreated",
"status": "PROCESSING",
"claimedAt": "2026-05-08T13:00:00.000Z",
"expiresAt": "2026-05-08T13:00:30.000Z"
}
The full claim function:
type ClaimResult =
| { kind: "claimed"; expiresAt: string } // lease is yours; run the handler
| { kind: "leased"; expiresAt: string } // another worker holds it; leave on queue
| { kind: "processed" }; // already done; skip
async function claim(
eventId: string,
type: string,
leaseSeconds = 30,
): Promise<ClaimResult> {
const expiresAt = new Date(Date.now() + leaseSeconds * 1000).toISOString();
try {
await ddb.send(new PutItemCommand({
TableName: NAMES.inboxTable,
Item: marshall({ eventId, type, status: "PROCESSING", expiresAt }),
ConditionExpression:
"attribute_not_exists(eventId) OR (#s = :processing AND expiresAt < :nowIso)",
// attribute names/values omitted for brevity
}));
return { kind: "claimed", expiresAt };
} catch (err) {
if (!(err instanceof ConditionalCheckFailedException)) throw err;
}
// Put rejected → row exists. Fetch to learn its current state.
const existing = await getInboxRow(eventId);
if (existing.status === "PROCESSED") return { kind: "processed" };
return { kind: "leased", expiresAt: existing.expiresAt };
}
The condition expression on the put is what produces the three outcomes. It accepts when there’s no row (first claim), or when the existing row is in PROCESSING past its expiresAt (recovery from a crashed worker). A naive condition - “succeed only if no row exists for this eventId” - would handle duplicates but leave a row stuck in PROCESSING forever after a crash. When the put rejects, the function fetches the existing row to distinguish processed from leased.
The four cases:
Row state at claim time | claim result | What happens next |
|---|---|---|
| No row | succeeds | Worker runs the handler |
PROCESSING, lease still valid | rejects | Redelivered SQS message goes back on the queue |
PROCESSING, lease expired | succeeds | A new worker takes over |
PROCESSED | rejects | Permanent dedup |
The repo’s chaos:stuck-claim script tests the recovery path: pre-claim an event with a 10-second lease, exit without calling markProcessed. With SQS visibility set to 5 seconds, the consumer logs leased by another worker, leaving on queue twice, then once the lease expires the next redelivery succeeds and the order charges exactly once.
Two timers, one lease
The lease and the SQS visibility timeout are independent timers. Two rules:
- Lease above the slowest handler. If the handler can run longer than its lease, another worker reclaims while the first is still working and both run the side effect. Either size the lease comfortably above the longest expected runtime, or extend it with a heartbeat.
- Visibility timeout against the lease, in either direction. Shorter than the lease: redeliveries arrive while the lease still holds and
claimreturnsleased, so the message goes back on the queue. Longer than the lease: the lease expires first and the next redelivery succeeds with a new worker.
The lease half of rule 1 is asserted directly in tests/inbox.test.ts - second claim during an active lease returns leased, and a fresh claim succeeds once the lease has expired. The SQS half is exercised live by the chaos:stuck-claim run linked above.
Queue smells
Failure modes I tripped over building this, or have watched come up in production systems. The implementation above handles the first two; the rest are constraints on how it gets used.
| Smell | Mitigation / defensive rules |
|---|---|
DynamoDB Local trims its stream too aggressively. TRIM_HORIZON iterators throw TrimmedDataAccessException on the first poll; LATEST silently drops any insert that lands during a recovery window. | Pair the stream with a 5s PENDING scan backstop and a conditional update on publishedAt. Worth keeping in production too, where the same gap can open during stream outages. |
Publish-then-mark ordering. Marking the row PUBLISHED before PutEvents drops events whenever the publish errors after the mark. | Mark, publish, then roll back to PENDING on failure. |
DLQ redrive after a manual fix. On-call deletes the inbox row to “fix” a stuck order and redrives the DLQ; claim succeeds and the side effect runs a second time. Reproduced live by chaos:manual-fix. | Any manual remediation touching the same business entity must also write the inbox row, or go through the same handler the consumer uses. |
Idempotency keyed on a regenerated value. Dedup on Date.now(), a fresh UUID inside the handler, or a timestamp-filename: each call has a new key, so retries insert new rows and rerun the side effect. | The key has to come from the inbound message envelope. |
| Side effects before the inbox write. A handler that calls a third-party charge API, sends an email, or triggers a physical action before writing the inbox row is at-least-once even with the inbox in place. | Order the handler so the inbox row is written first, then give each side effect its own idempotency key (Yan’s layered approach). |
What I’d take into a real system
The patterns work once these pieces are right:
- Outbox row and domain write commit in one transaction. If they commit separately, there’s a window where one lands without the other - you can lose events, or send events for changes that rolled back.
- The publisher has to be idempotent. The same row can be picked up twice (the stream and the scan, or after a retry). A conditional update on
attribute_not_exists(publishedAt)makes only one attempt mark the row as published; a rollback path handles the case where the mark succeeds butPutEventsfails, returning the row toPENDINGso the next pass retries. - The inbox needs a lease. Without one, a worker that crashes mid-handler leaves the row stuck in
PROCESSINGforever, and every redelivery skips it - the side effect never runs. A TTL onPROCESSINGlets another worker reclaim the row once the lease expires. - The inbox only protects the consumer’s normal path. Operators editing the database to fix stuck rows, side-effecting calls before the inbox write, and dedup keys regenerated per attempt all sit outside the guarantee. Each one needs its own discipline: a runbook for operators, a handler ordering rule, and a code-review check on dedup-key sources.
The full demo, including the chaos scripts and the vitest suite, is at danieljohnmorris/inbox-outbox-demo. Every chaos scenario reproduces on a laptop in under thirty seconds.