Rewriting a Chat API in Go for Lambda: Cold Starts and Trade-offs - Writing

The chat API started as a Next.js application. It worked - streaming responses, RAG search, tool calling. But deploying it as a customer-facing service on Lambda exposed problems.

Cold starts were noticeable. The Node.js runtime needs to initialise, load dependencies, and establish connections before the first request can be served. For a chat interface, a multi-second delay on the first message is unacceptable.

Memory usage was high. The Node.js process with its dependencies consumed more memory than the Lambda needed, pushing it into a larger (and more expensive) memory tier.

Why Go

Go compiles to a single binary. No runtime to initialise, no node_modules to load, no dependency resolution at startup. The binary starts, allocates its connections, and serves requests.

On Lambda, this translates directly to faster cold starts and lower memory usage. The Go binary starts in under a second where the Node.js version took several seconds. Memory usage dropped enough to run in a smaller Lambda tier.

Go’s type system also caught bugs at compile time that the TypeScript version caught at runtime (or didn’t catch at all - any types in AI SDK response handling were a source of subtle issues).

Architecture

The API uses Go Fiber (a fast HTTP framework) with an adapter for Lambda. The same binary runs locally for development and on Lambda for deployment.

Routing: Fiber handles HTTP routing with middleware for CORS, logging, and API key validation. The Lambda adapter translates API Gateway events into Fiber requests.

Services: Each external integration is a separate service - embedding (OpenAI), vector search (pgvector), cache (DynamoDB), shop (Shopify GraphQL), email (Resend), verification (2FA codes).

Infrastructure: AWS CDK defines everything - Lambda function (ARM64, AL2023 runtime), API Gateway with API key authentication, DynamoDB tables, Route53 DNS, Secrets Manager, CloudWatch alarms. Staging and production are separate stacks with different configurations.

CI/CD: CodePipeline with CodeBuild handles build, test, and deploy. The Go binary is cross-compiled for ARM64 Linux in the build step.

Tool calling in Go

OpenAI’s function calling works the same in any language - you define tools as JSON schemas and handle the responses. Go’s strict typing caught malformed tool calls at deserialisation rather than at execution.

Each tool has a typed input struct, a typed output struct, and an executor function. The framework deserialises the LLM’s tool call arguments into the input struct (catching malformed calls), executes the tool, and serialises the result back.

The tool loop runs up to three iterations: send message → get tool calls → execute tools → send results → get more tool calls or final response. This handles multi-step tool chains (look up customer email, verify with 2FA, then look up order).

DynamoDB caching

Search results are cached in DynamoDB, keyed by SHA256(scope + query). This avoids re-embedding and re-searching for repeated queries. TTL is 12 hours in staging, 24 hours in production.

The cache tracks hit rates with atomic counters. This revealed that about 40% of queries were cache hits - common questions that multiple customers ask. For those queries, response time dropped from seconds (embedding + vector search) to milliseconds (DynamoDB lookup).

What I’d do differently

The VPC configuration was the most painful part. Lambda needs a NAT Gateway for outbound internet access (to reach OpenAI, Shopify, Resend), and NAT Gateways aren’t cheap. For a service that only needs VPC access for the database, a VPC endpoint for RDS and public internet for everything else would be simpler.

The CDK infrastructure code is longer than the application code. That’s fine for production, but for prototyping, it’s a lot of ceremony. If I were building this again, I’d start with a simpler deployment (Fly.io, Railway) and add AWS infrastructure only when needed.