Scaling the Inbox: How I Contributed the Step Functions Construct to SST

#aws #sst

Every developer knows the "happy path"—that glorious moment when your code works perfectly on a small sample size. But when the "happy path" meets the reality of 100,000+ emails and strict API rate limits, things tend to break.

This is the story of how a bottleneck in my AI startup, Glim, led me to build and contribute the Step Functions construct to the SST (Serverless Stack) ecosystem.

The Problem: The Gmail "Wall"

I was building Glim, an AI assistant designed to help users manage their chaos-ridden inboxes. We offered real-time filtering and a powerful bulk-cleanup tool that could archive or delete thousands of emails based on specific criteria like sender, date, or domain.

To make this work, I had to implement a full inbox sync. According to the Google API documentation, this is a two-step process:

Fetch a list of all email IDs.
Iterate through that list and perform a messages.get request for each individual email's full data.

The Math of the 429

Google’s rate limits are notoriously strict and follow a specific quota system. The limit consists of two primary factors:

Per-second Unit Limit: You are allowed 100 quota units per second.
Request Weight: Each type of request has a "weight." A messages.get request costs 20 units.

This means you can only fetch 5 emails per second (or roughly 300 per minute). If you violate this, Google hits you with a 429 Too Many Requests error and an exponential block that gets longer every time you retry too early.

Attempt One: A SQS Disaster

My first architecture was a classic serverless pattern: Lambda + SQS.

I bundled 20 email IDs into each SQS message. To respect the rate limits, I throttled the message publishing to one message per second. I set a 60-second visibility timeout, figuring that if a batch hit a 429, SQS would simply retry it a minute later based on standard AWS rules.

It worked... until it didn't.

For users with fewer than 40,000 emails, it was fine. But for power users with massive inboxes, the math fell apart. Multiple SQS batches would hit the rate limit simultaneously. The 60-second window wasn't long enough to clear the Google block, leading to more retries, more 429s, and a cascading failure that increased the block time exponentially. I realized I didn't just need a queue; I needed a State Machine.

The Solution: The "Scatter-Gather" Step Function

I needed a way to strictly control the flow of execution. If "Batch 1" hit a rate limit, the entire process needed to pause until that block cleared before "Batch 2" even started.

I decided to move to a Scatter-Gather pattern using AWS Step Functions. This allowed me to:

Fetch IDs: Fetch 500 IDs at a time (even fetching the IDs can hit limits once you pass 40k emails).
Strict Flow Control: Process the batch, respect the rate limit, then move to the next 500.
Software-Level Backoff: Implement an exponential wait step within the flow so that retries didn't stack on top of each other at the API gateway level.

The Hurdle: SST Didn't Speak Step Functions

At the time, Glim was built entirely on SST (Serverless Stack). I loved the developer experience of SST, but there was a major missing piece: it didn't have a built-in construct for Step Functions.

I had two choices: manually write raw Pulumi code (which would feel out of place in my clean SST codebase) or build a native SST construct myself.

I chose the latter. I designed the component using a linked-node architecture. I visualized the state machine as a circular tree where each node connects the steps of the SF. I wrote the logic to compile this tree into the exact JSON format required for Pulumi (the backend of SST) and pushed it to AWS.

Giving Back to the Community

Once I had a working version, I didn't want to keep it in a silo. I forked the SST repository, integrated my new StepFunctions component, and used it to successfully fix the Glim inbox sync. It handled 100,000+ emails without breaking a sweat.

Seeing it work in production, I opened a Pull Request.

After several rounds of refinement, the code was merged. Today, that construct—born out of a desperate need to sync 100,000 emails without hitting a Google rate limit—is available for everyone in the SST ecosystem to use.

Lessons Learned

Queues aren't Orchestrators: SQS is great for decoupling, but when you need to "stop the world" to wait for a rate limit, Step Functions are superior.
Understand the Quota: If I hadn't dug into the 20-unit weight of a messages.get request, I would have kept guessing why my Lambda was failing.
Scratch your own itch: The best open-source contributions come from solving real-world production problems.

Now, whether you're syncing a massive inbox or orchestrating a complex AI pipeline, you can do it natively within SST. Happy coding!