API logging in production - Writing

Here we will explore three tiers of logging for a serverless API on AWS.

CloudWatch alone. Structured JSON logs, correlation IDs threaded by hand, per-function retention, a handful of saved Logs Insights queries. CloudWatch is AWS’s built-in logs and metrics service; Logs Insights is its query language for structured log search.
CloudWatch plus Powertools and EMF. Powertools is a Lambda-aware logger: it auto-injects request_id, function_name, and cold-start info on every line, and extracts a correlation ID from API Gateway / SQS / EventBridge events without each handler doing it by hand. EMF (Embedded Metric Format) is one JSON log line that CloudWatch parses into both a log entry and a custom metric. W3C traceparent headers propagate trace context between services.
OpenTelemetry to a tracing backend. OpenTelemetry (OTel) is the vendor-neutral standard for instrumenting code with traces, metrics, and logs; an OTel SDK in each service emits spans that ship to a backend like Honeycomb, Datadog, or Grafana Tempo. Tail sampling and log shipping to S3 sit on top.

Not the same as CloudTrail

CloudTrail and CloudWatch sound similar and get confused often. CloudTrail is the AWS audit log: every API call against your AWS account (who launched which EC2 instance, who deleted which S3 bucket, what role assumed what) is recorded for compliance and forensics. CloudWatch is for application logs and metrics: what your code is doing inside a Lambda, an ECS task, or an EC2 instance. This post is about CloudWatch. CloudTrail sits alongside it as a separate service with its own retention and bill; the common query pattern is Athena over the S3 export.

CloudWatch does not disappear at Tier 3

At Tier 3 the OTel backend becomes the primary observability surface, but CloudWatch is still in the picture. Lambda’s stdout is forwarded to CloudWatch Logs by the runtime and you cannot turn that off. AWS-emitted service metrics (Invocations, Duration, Errors, plus API Gateway / SQS / EventBridge counters) keep landing in CloudWatch automatically.

Tiers at a glance

Tier	Setup	Right for
1	CloudWatch only. JSON logs, correlation IDs, retention, redaction, saved queries.	One AWS account, one small team, single service or a handful of Lambdas
2	Add Powertools logger, EMF for custom metrics, W3C `traceparent` propagation, CloudWatch Application Signals.	Many Lambdas across multiple teams, cross-service investigations are routinely slow
3	OTel SDK plus a tracing backend (Honeycomb, Datadog, Grafana Tempo). Tail sampling, log shipping.	High traffic, routine filtering by customer or tenant ID, dedicated observability budget

Tier 1: CloudWatch alone

CloudWatch only, without an OpenTelemetry collector or third-party tooling. One AWS account, one small team, a single service or a handful of Lambdas. CloudWatch itself scales much further; the investigation workflow across many log groups tends to strain first.

What “enough” means

Three things have to work when an incident hits:

Find one specific request across every function it touched.
Filter to errors only, across all functions, for a recent window.
Pull the full structured input of one failing call without exposing PII.

Anything that does not serve these three is overhead; anything that prevents them is a blocker.

JSON-format logs

Start by making CloudWatch parse your logs as structured data rather than free text. Lambda Node 20 and above supports a JSON logging format set on the function:

new NodejsFunction(this, 'CreateOrder', {
  // ...
  loggingFormat: LoggingFormat.JSON,
  applicationLogLevelV2: ApplicationLogLevel.INFO,
});

Once that flips, console.log({ event: 'order.created', orderId }) produces a structured log line in CloudWatch that Logs Insights can query as fields rather than parse with regex.

At this tier I do not reach for Pino, Winston, or Powertools logger. The runtime’s built-in JSON formatting plus a thin wrapper is enough until there is a concrete reason to add a library:

// lib/log.ts
type Level = 'INFO' | 'WARN' | 'ERROR';
type Context = { correlationId?: string; orderId?: string; customerId?: string };

let ctx: Context = {};
export const setContext = (next: Context) => { ctx = { ...ctx, ...next }; };

const emit = (level: Level, event: string, fields: object = {}) => {
  const line = { level, event, ...ctx, ...fields };
  if (level === 'ERROR') console.error(line);
  else console.log(line);
};

export const log = {
  info: (event: string, fields?: object) => emit('INFO', event, fields),
  warn: (event: string, fields?: object) => emit('WARN', event, fields),
  error: (event: string, fields?: object) => emit('ERROR', event, fields),
};

Every handler calls setContext at the top with the correlation ID and any business IDs from the event, and every later log line in that invocation inherits them.

Canonical fields

Every log line carries:

level (INFO / WARN / ERROR)
event (a snake-case action name like order.created or payment.declined)
correlationId (a UUID generated at the API Gateway entry, threaded forward)
orderId, customerId, sku (the domain IDs you will filter by during an incident)
latencyMs on completion lines
statusCode for HTTP responses

I never log raw request bodies. The handler validates the event with a zod schema first; the schema explicitly omits PII fields, so only safe parsed data is ever passed to the log helper.

Threading the correlation ID

API Gateway adds an x-amzn-RequestId header to every event. It works as a correlation ID for synchronous request-response flows. But SQS, EventBridge, and Step Functions all re-wrap the event, so the original request ID ends up buried somewhere the next Lambda cannot easily reach.

The first Lambda in a flow generates a UUID and stores it in correlationId.
Every event published downstream (an SQS message body, an EventBridge detail, a Step Functions input) includes correlationId as a top-level field.
Every Lambda processing a downstream event calls setContext({ correlationId: input.correlationId }) first thing.

Threading through every event shape is manual work, but it is the cheapest pattern that works without a tracing backend. Later, when the team moves to Tier 2 and adopts W3C Trace Context, the same threaded field becomes the traceparent carrier.

Log group retention

The default retention on a CloudWatch log group is “Never expire”. On a serverless app where every Lambda gets a log group automatically, that turns into a creeping storage bill. The fix is one line per function in CDK:

new NodejsFunction(this, 'CreateOrder', {
  // ...
  logRetention: RetentionDays.ONE_MONTH,
});

I run 30 days for hot production traffic, 7 days for DEBUG-heavy services, 1 year for billing or audit-relevant ones. Past that, ship to S3 if you ever expect to read it again.

What not to log

The most common GDPR leak in a Lambda codebase is a framework or library that logs raw request bodies under a debug flag, then never gets turned off in production. Once the log line lands in CloudWatch, retention keeps it for whatever you set.

Fields I never log:

authorization header, session cookies, JWTs
Email, phone, full name, address, date of birth
Card numbers, CVV, full bank account numbers
IPs in EU traffic (treated as PII under GDPR)
Raw request bodies before schema validation

The log helper logs whatever structured object you pass it. Validation schemas strip the PII fields before the handler logic ever sees them, so by the time you log the parsed input it is already safe.

Saved Logs Insights queries

Three queries cover most of what I open the console for. Save them with descriptive names against the relevant log group sets; the Logs Insights UI is bad at remembering frequent queries, so you will retype them every time otherwise.

Find one order across every function in the flow. Run across all the log groups in the saga:

fields @timestamp, @logStream, level, event, orderId, customerId, correlationId, @message
| filter orderId = "o101"
| sort @timestamp asc

Errors in the last hour, grouped by event:

fields @timestamp, level, event, correlationId, @message
| filter level = "ERROR"
| stats count(*) by event

Slow requests in the last 15 minutes:

fields @timestamp, event, latencyMs, correlationId
| filter latencyMs > 1000
| sort latencyMs desc
| limit 50

The order-by-correlationId query is the one I open most. Filtering by the ID and sorting by timestamp gives a top-to-bottom timeline of one request across every Lambda it touched.

Tier 2: when CloudWatch alone starts to strain

Signs the tier is straining:

“Where did this order get stuck” routinely takes more than five minutes to answer because the flow now spans many Lambdas and you have to jump between several log groups.
Filtering logs by customer ID, tenant ID, or experiment ID becomes a routine question, and CloudWatch’s per-query cost on those fields adds up.
The Logs Insights bill starts to rival what a managed tracing backend would cost.

Powertools logger. Replaces the hand-rolled context helper with a Lambda-aware logger that auto-injects request ID, cold-start info, and correlation context without each handler setting them by hand. Worth adopting when the hand-rolled helper starts to accumulate special cases.

EMF for custom metrics. A single JSON log line that CloudWatch parses into both a log entry and a metric. Cheaper than PutMetricData per call, and the metrics get CloudWatch alarms for free. Keep dimensions low-cardinality (service, env, route); high-cardinality fields go in the log body, not as metric dimensions, or you end up with one custom metric per unique combination and the bill grows fast.

W3C traceparent propagation. With OTel auto-instrumentation and OTEL_PROPAGATORS=tracecontext,baggage, Lambda-to-Lambda HTTP calls pick up the standard trace headers automatically. SQS, EventBridge, and Step Functions still need manual threading, the same way the Tier 1 correlation ID is threaded. The correlationId field from Tier 1 carries forward as the human-readable companion to the machine traceId.

CloudWatch Application Signals. AWS’s distributed-tracing UI sitting on top of OTel-emitted spans. Native to the AWS account, no third-party data egress, and the trace map covers most “where did this order get stuck” questions a small team has.

Tier 3: when Tier 2 stops being enough

Signs Tier 2 stops being enough:

Filtering by customer or tenant ID across logs or traces is routinely slow or expensive in CloudWatch.
Trace exploration through Application Signals does not match what Honeycomb, Datadog, or Tempo show in the same situation.
A dedicated observability budget exists.

What changes:

OTel SDK plus a tracing backend. Honeycomb, Datadog APM, and Grafana Tempo all accept OTel-native traces. The advantage over Application Signals is filtering traces by fields like customer ID or tenant ID, and a trace-exploration UI built for ad-hoc questions rather than pre-defined dashboards.
Tail sampling via the OTel Collector. Buffer the whole trace, decide at end based on errors, latency, or attributes. Cheaper than always-on tracing and never drops error traces. Below a few hundred requests per second, do not sample at all.
Log shipping to S3 plus Athena for long-retention queries on cold data. Cheaper than long CloudWatch retention if you need it at all.
Per-service redaction in the OTel Collector rather than relying on every handler to omit PII, so redaction lives at one choke point rather than spread across the codebase.