Most service-to-service communication ends up in one of two shapes: synchronous request/response (REST, gRPC, tRPC), or asynchronous events through a queue. The differences inside the synchronous group matter less than the split between synchronous and asynchronous. Below are the trade-offs I’ve run into when picking between them.
REST
REST is the default for most APIs because every language has a client library, every browser and CLI tool can talk to it, and the transport (HTTP and JSON) is human-readable. You can debug it with curl or Postman, and cache it at a CDN without any extra layer in front. Postman in particular has become the standard way teams share, test, and document REST endpoints, since requests live in a collection that anyone with the workspace can run.
A small Hono server in TypeScript:
import { Hono } from "hono";
const app = new Hono();
app.get("/items/:id", (c) => {
const id = Number(c.req.param("id"));
const item = items.get(id);
if (!item) return c.json({ error: "Not found" }, 404);
return c.json(item);
});
app.post("/items", async (c) => {
const body = await c.req.json();
const item = { id: nextId++, ...body };
items.set(item.id, item);
return c.json(item, 201);
});
The contract problem
One cost is that there’s no API contract by default - no enforced agreement between caller and callee about what fields exist, what types they are, and what each endpoint returns. With plain REST that contract lives in your head, a README, or server code the client has to mirror by hand.
The usual fix is API docs: Swagger UI on top of an OpenAPI spec is the most common. You describe the endpoints once and Swagger renders an interactive page that doubles as a spec for client generators. Done well, this gets you most of the way to a real contract - generated clients, response validation against the schema, breaking changes visible in a diff. Done badly, the OpenAPI file drifts out of sync with the code, the docs say one thing and the server returns another, and you find out when a client breaks in production. gRPC and tRPC sidestep this by generating both client and server from one schema, so the two cannot drift apart.
Convention, not contract
What REST does have, even without a machine-readable contract, is a commonly agreed shape - resources, HTTP verbs, status codes, predictable URL structure. That’s a convention, not a contract: it helps a human reader navigate the API, but nothing enforces it and nothing fails to compile when it’s wrong. The conventions even disagree at the edges - the classic one is whether a single item lives at /orders/123 (plural collection) or /order/123 (singular).
REST works well in a wide range of situations: public APIs, partner integrations, anywhere you want HTTP caching, and plenty of internal service-to-service traffic too. gRPC and tRPC give stronger contract guarantees, but those come with a coordination cost between client and server that isn’t always worth paying. Plenty of solid systems are REST end to end on purpose.
gRPC
gRPC pulls the contract into the toolchain. You write a .proto file once, generate clients in every language, and the compiler enforces the wire format on both sides. Not everything (field-number reuse, removed fields still in flight, version skew with stale clients can still bite) but a much stronger floor than OpenAPI maintained by hand.
The schema is written in protobuf, its own small IDL - not Go or TypeScript or any runtime language, but the contract that gets compiled into all of them.
service ItemService {
rpc GetItem (GetItemRequest) returns (Item);
rpc CreateItem (ItemCreate) returns (Item);
}
message Item {
uint32 id = 1;
string name = 2;
double price = 3;
}
message GetItemRequest { uint32 id = 1; }
message ItemCreate { string name = 1; double price = 2; }
The server side, in Go, reads almost like a regular handler:
func (s *server) GetItem(
ctx context.Context,
req *pb.GetItemRequest,
) (*pb.Item, error) {
item, ok := items[req.Id]
if !ok {
return nil, status.Error(codes.NotFound, "not found")
}
return item, nil
}
gRPC is fast: protobuf is binary and compact, HTTP/2 multiplexes, and you get bidirectional streams, server-push, deadlines, and cancellation that propagates through context. For service-to-service traffic inside your own infrastructure, it’s hard to beat.
The downsides hit you outside your network. Browser support means gRPC-Web plus a gateway like Envoy in front, with no client streaming - that’s the supported path, not a workaround, but the constraints are real. Debugging means reaching for grpcurl instead of curl, and you can’t read traffic in tcpdump or your service mesh’s request log without proto descriptors, which makes incident response materially worse. HTTP caching effectively goes away - everything is a POST with a binary body, so CDNs and reverse proxies have nothing to key on. Load balancers, WAFs, and rate limiters that work fine with REST often need extra configuration with HTTP/2 long-lived connections. And the proto toolchain adds a build step that REST doesn’t need. If the caller isn’t a service you control, the friction usually isn’t worth it.
tRPC
tRPC is the TypeScript-only answer to “I just want my types to flow from server to client.” There’s no schema file, no codegen step, no transport spec. The client imports the server’s router type and gets full autocomplete and type checking.
Server side, in TypeScript:
import { initTRPC } from "@trpc/server";
import { z } from "zod";
const t = initTRPC.create();
export const appRouter = t.router({
getItem: t.procedure
.input(z.object({ id: z.number() }))
.query(({ input }) => items.get(input.id)),
createItem: t.procedure
.input(z.object({ name: z.string(), price: z.number() }))
.mutation(({ input }) => {
const item = { id: nextId++, ...input };
items.set(item.id, item);
return item;
}),
});
export type AppRouter = typeof appRouter;
The last line is where the schema becomes a type. typeof appRouter asks TypeScript to infer the full shape of the router (every procedure name, every input type, every return type) and exports it as AppRouter. There’s no separate schema file; the router definition itself is what gets typed.
And the matching client:
import type { AppRouter } from "../server";
import { createTRPCClient, httpBatchLink } from "@trpc/client";
const trpc = createTRPCClient<AppRouter>({
links: [httpBatchLink({ url: "/trpc" })],
});
const item = await trpc.getItem.query({ id: 1 });
The client passes AppRouter as a generic to createTRPCClient<AppRouter>, and from that point everything is typed: trpc.getItem autocompletes, .query({ id: 1 }) checks the input shape, and the awaited result has the return type the server inferred. No code is generated and no JSON schema is exchanged at runtime; the contract lives entirely in the imported type.
Rename a field on the server and the client fails to compile. For a Next.js app where the same team owns frontend and backend, that catches a category of bugs that would otherwise show up at runtime. The same team also has the option of skipping the API layer entirely with React Server Components (covered separately), so tRPC is one of two reasonable defaults rather than the obvious one.
Both ends being TypeScript is what makes it worth using. The wire is HTTP/JSON, so a Python or Go client can technically call a tRPC endpoint, but without the inferred types there’s no reason to - you’ve taken on tRPC’s constraints to get nothing back. You also don’t get gRPC’s performance, and HTTP caching is awkward in practice - the default httpBatchLink sends everything as POST so CDNs can’t cache it (you can switch to httpLink and force GET for queries to get back to something cacheable). The type safety is also build-time, not runtime - if a deployed client talks to a newer server, you can still get runtime errors. Zod input validation catches the worst of it.
REST vs tRPC vs gRPC
These three all do the same thing: a caller asks, a callee answers. The differences are about who owns the contract and where the cost lands.
REST has the loosest contract. HTTP/JSON, schema in your head or in OpenAPI if you maintain it, client and server written by different people in different languages without coordinating. That looseness is useful when the consumer isn’t yours; when both sides are yours, the contract tends to drift and mismatches show up at runtime.
tRPC closes that drift, but only for TypeScript. The types are the schema - no codegen, no proto file, no OpenAPI. The catch is language lock-in: if anything other than a TS client needs to call this, you’ll add REST or another transport alongside.
gRPC gives you tRPC-style contract enforcement across languages. The proto file is the source of truth and generated clients land in Go, Python, Rust, Java, whatever you need. You also get binary encoding, HTTP/2 multiplexing, and proper streaming - none of which REST or tRPC give you out of the box. The price is build complexity (proto compilers, generated code in your repo) and the loss of HTTP-native tooling.
The shortest version: REST when you don’t control the caller, tRPC when you control a TypeScript caller, and gRPC when you control a polyglot caller and care about throughput or streaming.
Event queues
REST, gRPC, and tRPC are all synchronous: the caller sends a request and waits for the response. Queues work the other way around. The producer writes a message to the broker and continues with whatever it was doing, and the consumer reads the message at its own pace, possibly on a different machine, possibly after a crash and restart.
A producer in Python pushing to SQS with boto3:
import json
import os
import boto3
sqs = boto3.client("sqs")
sqs.send_message(
QueueUrl=os.environ["ORDER_QUEUE_URL"],
MessageBody=json.dumps({
"type": "order.created",
"orderId": order.id,
"customerId": order.customer_id,
"total": order.total,
}),
)
What queues buy you
The indirection buys you a few things. The producer doesn’t have to care whether the consumer is up - messages buffer until the consumer is back, and failed work gets retried by the broker rather than the producer. A slow consumer doesn’t pull a fast producer down with it. Multiple consumers can subscribe to the same event independently, without the producer needing to know about them. Of the four shapes in this article, queues are the only one that survives a downstream outage without losing work or pushing the failure back to the caller.
Cross-VPC and dead letter queues
Two operational wins worth calling out. First, networking. If your producer and consumer live in different VPCs (or different AWS accounts, or different clouds), getting them to talk over HTTP usually means VPC peering, IP whitelists, or a private link, all of which take ongoing maintenance. A managed queue sits outside those VPCs and is reachable from any of them with IAM credentials, so neither service has to know where the other one is running.
Second, dead letter queues. They aren’t automatic - you configure the main queue with a DLQ target. Once wired up, a message that fails repeatedly gets shunted to the DLQ instead of looping forever, and you can inspect the failures, fix the bug, and redrive them back onto the main queue. With a synchronous API, a failed call is gone unless you’ve separately logged the request body somewhere.
Redrives have caveats. The main one is what lands in the DLQ: if the consumer parses, unwraps, or transforms the message before it fails, the DLQ holds the partial or transformed version rather than the original payload. Redriving that pushes a message onto the main queue that the consumer wasn’t designed to receive - fields can be missing, envelopes wrong, and the retry fails for a different reason than the original. Preserving the original message all the way to the DLQ is something to design for up front. The other caveat is reconciliation: by the time you redrive, the team may have already worked around the failure manually (reissued the order, refunded the customer, fixed the row by hand), and replaying the DLQ in that state produces duplicates or conflicts. Reconcile DLQ contents against current state before replaying.
What queues cost
What you pay for it: there’s no return value, so the producer doesn’t know if the work succeeded. Debugging means correlating logs across services and reading queue depth metrics, and ordering guarantees vary by broker. Every queue you’ll deploy in practice is at-least-once, not exactly-once - the consumer can crash after processing a message but before it acks the broker, and the broker will redeliver. Some systems (Kafka with transactions) claim exactly-once, but only within Kafka itself; the moment the consumer writes to a database or calls an external API, you’re back to at-least-once and have to handle duplicates. Consumers have to be idempotent: processing the same message twice should produce the same result as once. The schema problem is also worse than REST - there’s no synchronous error to tell you the consumer can’t parse your message. Most teams past the early stage layer schemas on top (Avro or protobuf via a registry, JSON Schema in the envelope, CloudEvents) so producers and consumers can agree on shape without the broker enforcing it. Optional, but strongly advised once events cross team boundaries.
I once got chatting to some startup engineers in a New York bar who’d pushed this further than I’d seen elsewhere. Their pipeline accepted any incoming event with no validation and dumped the lot into a NoSQL store - schemaless on the way in, so anything was a valid write. Validation happened later on the consumer side, and anything that didn’t parse got dropped at processing time. For their use case it worked: high-volume telemetry where losing a small fraction of malformed events cost less than slowing ingestion down to validate every one. The trade-off was that producers weren’t told when events were bad and some percentage of stored data was unusable, which is reasonable as a deliberate choice. It becomes a problem when no one realises the choice has been made, because a queue feeding a schemaless store doesn’t surface the loss on its own.
Fan-out and microservice orchestration
Queues are the right call for fan-out, for slow work, for anything where the producer shouldn’t block on the consumer. Fan-out means one event triggering several independent reactions: order placed, send a confirmation email and update the warehouse and notify analytics. None of those should be in the order’s request path, and none of them care about each other - the producer publishes once and multiple consumers each do their own thing.
They’re also useful for orchestrating multi-step flows across microservices. Once a system is broken into services that each own their own data, a single business action (place an order, onboard a customer, publish a listing) usually spans several of them. Doing that synchronously means a chain of HTTP calls where every service has to be up at once and any failure leaves the system in a half-applied state. Queues let you express the same flow as a sequence of events: each service handles its part and emits the next event when done. This is the saga pattern - the “transaction” is a sequence of local steps held together by events, with compensating actions for unwinding on failure. There are two flavours: choreography (each service reacts to events from the others) and orchestration (a dedicated orchestrator sends commands and tracks progress). Choreography is lighter to start but gets harder to reason about as the flow grows, since no single place describes the whole sequence; orchestration adds a component but keeps the flow visible. Either way, observability matters more than for synchronous systems - you can’t see the flow in a stack trace, only in the events that crossed the queues.
Not all queues are the same
“Queue” gets used loosely. The actual products behave differently, and picking the wrong one for the workload causes problems that are hard to fix later.
AWS SQS is a point-to-point work queue. One producer publishes messages and one logical consumer (typically a worker pool competing for messages) reads them. Each message is delivered to one worker, processed, and then deleted from the queue. There’s no replay and no fan-out. SQS is the right choice when you need a job done once, somewhere, and don’t care which worker does it.
AWS SNS is pub/sub. The producer publishes to a topic, every subscriber gets a copy. SNS has its own delivery retry policies per subscription protocol and supports subscription-level DLQs, but no consumer-ack work-queue semantics. The common pattern is SNS to one or more SQS queues - SNS handles fan-out, each SQS queue handles its own consumer’s retries and dead letters.
Apache Kafka is a distributed append-only log. Messages aren’t deleted when they’re read; they sit on the partition for a configurable retention window, so multiple consumer groups can read the same topic at their own pace and a new consumer can replay from the beginning. The cost is operational weight (partitions, brokers, ZooKeeper or KRaft, schema registry) and a mental model shift, since you’re managing offsets rather than popping messages. Kafka makes sense for high-throughput event streams, audit logs, and anywhere replay matters.
RabbitMQ sits between the two. A real message broker with exchanges, routing keys, and per-message acks. More flexible routing than SQS or Kafka (topic, direct, fanout, headers exchanges) and lighter to operate than Kafka. Throughput tops out lower than Kafka but it’s plenty for most internal workloads.
Redis offers two options here. Redis pub/sub is fire-and-forget with no persistence, so any subscriber that’s offline when a message is published will miss it. Redis Streams adds a persistent log with consumer groups, which is closer to a smaller Kafka in behaviour. Both are cheap and fast if you’re already running Redis, but neither is the first thing to reach for when durability really matters.
The decision usually maps to the question “what does the consumer need to do with this?”
- One worker does the job once: SQS, RabbitMQ work queue.
- Many independent services react to the same event: SNS to SQS, Kafka, RabbitMQ topic exchange.
- Replay history or rebuild state: Kafka, Redis Streams.
- Already running Redis and the work is small: Redis Streams.
- All-AWS shop and you don’t want to operate brokers: SNS plus SQS, with EventBridge if you need routing rules on top.
A fifth shape skips the API question by having the server return rendered HTML (RSC, Hotwire, LiveView, Livewire, HTMX) for the client to swap in. Covered separately in HTML Over the Wire.
Picking between them
| REST | gRPC | tRPC | Queues | |
|---|---|---|---|---|
| Schema | optional (OpenAPI) | required (proto) | inferred (TS) | none by default |
| Wire format | JSON | protobuf | JSON | varies |
| Languages | any | any | TypeScript only | any |
| Caching | HTTP caches work | no (POST + binary) | limited (config required) | n/a |
| Streaming | server-sent events | bidirectional | subscriptions | native |
| Coupling | loose | tight (shared proto) | tight (shared types) | loose |
| Producer waits | yes | yes | yes | no |
| Survives consumer outage | no | no | no | yes |
Three questions usually decide it:
- Who’s calling? Third party or anyone outside your build pipeline: REST. TypeScript frontend you own: tRPC. Another service in your fleet: gRPC.
- Does the caller need an answer? Yes: REST, gRPC, or tRPC. No: a queue. “User clicked submit, send the email” is queue work; “user clicked submit, charge the card” is not.
- How tight is the coupling? Sharing a proto file or a TypeScript type means coordinated deploys. Fine inside one team, painful across orgs.
Most real systems use more than one transport, picked per workload.
The related trap is reaching for a queue to decouple services that need a synchronous answer, which produces a slow RPC with extra failure modes. If the producer can’t continue without the result, use a call.
For the patterns that make any of these reliable in production - idempotency keys, transactional outbox, dead letter queues - see Making APIs and Queues Bulletproof.