SST to CDK: Lessons from Running Multiple Services on AWS

I used SST across multiple production services: Lambda APIs, REST APIs, sync workers, Next.js apps. It’s a useful abstraction. One config file, sensible defaults, a lot of AWS complexity handled for you. The developer experience holds up day to day.

I’ve since migrated everything to CDK. CDK has been more reliable when stacks fail, which matters more than the smoother SST authoring experience when something breaks at 2am.

Why SST first

SST wraps CDK under the hood. The appeal is that you don’t have to write CloudFormation constructs by hand. A Lambda function with an API Gateway, a cron schedule, or a queue is a few lines of config. Getting the same result in raw CDK takes significantly more code. The problem surfaces when something goes wrong. SST’s abstractions become opaque. Debugging a failed deploy means digging into generated CloudFormation templates you didn’t write, looking at CDK constructs SST created internally, and working around SST’s own opinions about how things should be wired. The error messages are often SST errors, not AWS errors, which adds a layer of indirection when you’re trying to understand what failed.

CDK CodePipeline vs CodePipeline

They’re different things. Regular CodePipeline - defined directly in AWS or via a buildspec - I love it. CDK CodePipeline is the aws-cdk-lib/pipelines construct, where the pipeline itself is defined in CDK code.

The interesting feature of CDK CodePipeline is self-mutation: the pipeline can update its own definition as part of its run. Change the CDK pipeline construct, push, and the pipeline rewrites itself before deploying your app. In practice it adds complexity. The pipeline that deploys your CDK stack is itself a CDK stack. When it breaks, you’re debugging CDK with CDK. And when a rollback fails in CDK CodePipeline, it can leave the stack in a broken state that requires manual intervention to recover. I now use plain CodePipeline so the pipeline that deploys my app is not itself a CDK stack.

Databases and NAT gateways outside CDK

The split between long-lived infrastructure and application stacks is the lesson that has paid back the most.

When CDK deploys a stack and something fails mid-way, CloudFormation rolls back. Most resources roll back cleanly. Some do not.

NAT Gateways with Elastic IPs are one. If your stack owns the NAT Gateway and a rollback releases the Elastic IP, the IP is gone. Any downstream service that had that IP in an allowlist now can’t connect. In our case an IP-restricted API stopped working because the NAT Gateway’s IP changed after a failed rollback. Incident, manual fix, new IP added to every allowlist it needed to be in.

RDS databases are another. If your database is inside the CDK stack that’s failing, a bad enough rollback can delete it. CDK supports RemovalPolicy.RETAIN to protect resources, but relying on that is fragile. Better to keep the database completely separate from the application stack.

The pattern I use now:

  • Infra stack (managed separately, rarely touched): VPC, subnets, RDS (with deletion protection on), NAT Gateway, Elastic IPs, security groups
  • App stack (deployed via CDK/pipeline): Lambda, API Gateway, CloudFront, S3, application-layer resources

The app stack imports the infra stack’s outputs. It can be destroyed and recreated freely. The infra stack changes rarely and only deliberately.

This means a bad deploy can break the app stack but cannot touch the database or the network topology.

What I use now

CDK for application stacks. CodePipeline for CI/CD. Direct CDK deploy only when infrastructure config changes require it - not for routine deploys.

SST is still the right choice if you want to move fast on a greenfield project and aren’t managing shared infrastructure. The moment your services share a VPC, or you have stateful resources you can’t afford to lose, CDK gives you more control over what’s happening underneath.