Scaling AI Products on AWS Bedrock: Getting Past the Quota Wall
If you've built anything non-trivial on AWS Bedrock, you've probably hit the quota wall. As a single developer working on Brain — a self-hosted AI assistant using Claude models via Bedrock — I can blow through the daily token limits in a few heavy sessions. That raises an obvious question: how could anyone build a real product on top of these limits?
The short answer is that the on-demand quotas are essentially dev/prototype tier. AWS expects production workloads to use different mechanisms entirely.
The Problem
Here's what you get out of the box with Claude Opus 4.6 on Bedrock (on-demand, cross-region):
| Quota | Value | Adjustable? |
|---|---|---|
| Requests per minute | 25 | Yes |
| Tokens per minute | 2,000,000 | Yes |
| Daily token cap | 2,592,000 | No |
You can request increases for the per-minute limits, but the daily token cap is fixed and non-negotiable. At ~2.6 million tokens per day, a single power user running complex coding sessions can exhaust this in an afternoon.
Sonnet fares better with a 3.6 billion daily token cap — roughly 1,400x more headroom — but if your product needs frontier-level reasoning, you're stuck.
Strategy 1: Provisioned Throughput
This is what AWS actually wants you to buy for production. You purchase "model units" that give you dedicated capacity with no RPM, TPM, or daily caps. You pay hourly whether you use it or not, but you get predictable, uncapped throughput.
The pricing is steep — think thousands per month per model unit — but it's the only way to remove the daily ceiling entirely. You can auto-scale model units based on demand.
If you have paying users justifying the cost, this is the straightforward path.
Strategy 2: Intelligent Model Routing
Not every request needs your most expensive model. A well-designed routing layer can send the majority of traffic to cheaper, higher-quota models:
- Haiku for classification, routing, and simple Q&A
- Sonnet for most conversational turns (with its vastly higher daily cap)
- Opus only for complex reasoning, code generation, and planning
In Brain, I route memory extraction and context compaction to Sonnet, and title generation to Haiku. The primary chat gets Opus. Most production systems send less than 10% of traffic to the top-tier model.
A lightweight classifier — even a fine-tuned Haiku — can decide which tier handles each request automatically.
Strategy 3: Prompt Caching
Bedrock supports prompt caching. If your system prompt and tool definitions are large (Brain's are), you're paying for and counting those tokens on every single turn. Cached prefixes dramatically reduce token consumption with no code changes required.
This is the lowest-effort, highest-impact optimization for most applications.
Strategy 4: Multi-Account Distribution
Each AWS account gets its own quotas. Within an AWS Organization, you can:
- Route users across multiple accounts
- Each account gets its own daily Opus cap
- Use a routing layer (API Gateway + Lambda) to distribute traffic
This is a common pattern for startups that haven't committed to provisioned throughput pricing yet. It's not elegant, but it works.
Strategy 5: Self-Hosted Models on SageMaker
For tasks that don't need frontier intelligence, deploy open-source models on SageMaker endpoints:
- Llama 3, Mistral, or similar models
- Zero quota limits — bounded only by the instances you provision
- Ideal for embeddings, summarization, memory extraction, and compaction
- Keep Bedrock for the primary chat where model quality matters most
Moving background tasks to self-hosted models frees up your entire Bedrock quota for actual user interaction.
Strategy 6: Request Queuing
Put an SQS queue in front of non-interactive Bedrock calls:
- Smooth out burst traffic to stay within RPM limits
- Background tasks (memory processing, indexing) run off-peak
- Users get streaming responses for interactive chat
- Everything else gets queued and processed when capacity is available
The Realistic Path
If I were taking Brain to market today, the progression would look like this:
- Prompt caching — immediate win, no code changes
- Move background tasks to SageMaker — frees Bedrock quota for chat
- Aggressive Sonnet routing — Opus only when genuinely needed
- Multi-account distribution — bridge solution while growing
- Provisioned throughput — once revenue justifies the cost
The Uncomfortable Truth
The daily cap being non-adjustable is the tell. It's not a technical limitation — it's a business one. The on-demand limits are a pricing funnel. AWS makes them tight enough that any real product has to either buy provisioned throughput or get creative with the strategies above.
That doesn't make Bedrock a bad choice. The managed infrastructure, the model variety, and the AWS ecosystem integration are genuinely valuable. You just need to go in with eyes open about what "on-demand" actually means at scale.
