← architecture notes

Event-driven media processing pipeline

The original flow was synchronous: client uploads a file, the API calls MediaConvert, waits for a response, and returns a job ID. Works fine at low volume. At higher concurrency it falls apart: Lambda timeouts, blocked API threads, and MediaConvert throttling errors that the client has to retry manually.

The goal was to make the flow async without changing the existing API contract. Clients already expected a job ID back immediately. The challenge was wiring up completion events without polling.

The revised architecture

Client → API (POST /upload)
           ↓ upload to S3
           ↓ publish SNS message
           ↓ return job ID immediately
                    ↓
              SQS queue (fan-out)
                    ↓
           Lambda: validate + transcode
                    ↓
              MediaConvert job
                    ↓
           EventBridge: job complete
                    ↓
           Lambda: update job record
                    ↓
           Webhook to client (optional)

The API becomes a thin entry point: it validates the upload, puts the file in S3, publishes to SNS, and returns immediately. All the heavy work happens downstream.

What made this possible without a contract break

The job ID was already being persisted in a DynamoDB table. The API returned it before processing was complete. Clients polled a GET /jobs/:id endpoint to check status.

We kept that endpoint, kept the job IDs, and just changed what was writing to the status field. Before: the API wrote completed after MediaConvert returned. After: an EventBridge rule triggers a Lambda that writes completed when MediaConvert emits its completion event.

The client experience is identical. The API just became a lot faster.

Tradeoffs that were worth it

Observability got harder. A synchronous flow fails in one place. An async pipeline can fail in five. We added structured logging at each Lambda boundary and built a dead-letter queue dashboard to surface stuck jobs.

Local development got harder. You can't run this on your laptop without mocking SNS, SQS, and EventBridge. We used LocalStack for integration tests and kept unit tests focused on the Lambda handlers in isolation.

Retry logic moved into the infrastructure. SQS handles retries with configurable backoff. That's good — it's one less thing to code — but it also means you need to understand SQS visibility timeouts and what happens when a Lambda crashes mid-process.

What I'd do differently

The SNS-to-SQS fan-out was added early in anticipation of multiple consumers. We never got a second consumer. In hindsight, we could have gone straight from S3 events to SQS and avoided the SNS layer. The architecture should earn its complexity.