AWS Bedrock's on-demand quotas are surprisingly restrictive for production use. Here's how to actually scale an AI product using AWS services — from provisioned throughput to multi-model routing and self-hosted alternatives.

Scaling AI Products on AWS Bedrock: Getting Past the Quota Wall

If you've built anything non-trivial on AWS Bedrock, you've probably hit the quota wall. As a single developer working on Brain — a self-hosted AI assistant using Claude models via Bedrock — I can blow through the daily token limits in a few heavy sessions. That raises an obvious question: how could anyone build a real product on top of these limits?

The short answer is that the on-demand quotas are essentially dev/prototype tier. AWS expects production workloads to use different mechanisms entirely.

The Problem

Here's what you get out of the box with Claude Opus 4.6 on Bedrock (on-demand, cross-region):

Quota	Value	Adjustable?
Requests per minute	25	Yes
Tokens per minute	2,000,000	Yes
Daily token cap	2,592,000	No

You can request increases for the per-minute limits, but the daily token cap is fixed and non-negotiable. At ~2.6 million tokens per day, a single power user running complex coding sessions can exhaust this in an afternoon.

Sonnet fares better with a 3.6 billion daily token cap — roughly 1,400x more headroom — but if your product needs frontier-level reasoning, you're stuck.

Strategy 1: Provisioned Throughput

This is what AWS actually wants you to buy for production. You purchase "model units" that give you dedicated capacity with no RPM, TPM, or daily caps. You pay hourly whether you use it or not, but you get predictable, uncapped throughput.

The pricing is steep — think thousands per month per model unit — but it's the only way to remove the daily ceiling entirely. You can auto-scale model units based on demand.

If you have paying users justifying the cost, this is the straightforward path.

Strategy 2: Intelligent Model Routing

Not every request needs your most expensive model. A well-designed routing layer can send the majority of traffic to cheaper, higher-quota models:

Haiku for classification, routing, and simple Q&A
Sonnet for most conversational turns (with its vastly higher daily cap)
Opus only for complex reasoning, code generation, and planning

In Brain, I route memory extraction and context compaction to Sonnet, and title generation to Haiku. The primary chat gets Opus. Most production systems send less than 10% of traffic to the top-tier model.

A lightweight classifier — even a fine-tuned Haiku — can decide which tier handles each request automatically.

Strategy 3: Prompt Caching

Bedrock supports prompt caching. If your system prompt and tool definitions are large (Brain's are), you're paying for and counting those tokens on every single turn. Cached prefixes dramatically reduce token consumption with no code changes required.

This is the lowest-effort, highest-impact optimization for most applications.

Strategy 4: Multi-Account Distribution

Each AWS account gets its own quotas. Within an AWS Organization, you can:

Route users across multiple accounts
Each account gets its own daily Opus cap
Use a routing layer (API Gateway + Lambda) to distribute traffic

This is a common pattern for startups that haven't committed to provisioned throughput pricing yet. It's not elegant, but it works.

Strategy 5: Self-Hosted Models on SageMaker

For tasks that don't need frontier intelligence, deploy open-source models on SageMaker endpoints:

Llama 3, Mistral, or similar models
Zero quota limits — bounded only by the instances you provision
Ideal for embeddings, summarization, memory extraction, and compaction
Keep Bedrock for the primary chat where model quality matters most

Moving background tasks to self-hosted models frees up your entire Bedrock quota for actual user interaction.

Strategy 6: Request Queuing

Put an SQS queue in front of non-interactive Bedrock calls:

Smooth out burst traffic to stay within RPM limits
Background tasks (memory processing, indexing) run off-peak
Users get streaming responses for interactive chat
Everything else gets queued and processed when capacity is available

The Realistic Path

If I were taking Brain to market today, the progression would look like this:

Prompt caching — immediate win, no code changes
Move background tasks to SageMaker — frees Bedrock quota for chat
Aggressive Sonnet routing — Opus only when genuinely needed
Multi-account distribution — bridge solution while growing
Provisioned throughput — once revenue justifies the cost

The Uncomfortable Truth

The daily cap being non-adjustable is the tell. It's not a technical limitation — it's a business one. The on-demand limits are a pricing funnel. AWS makes them tight enough that any real product has to either buy provisioned throughput or get creative with the strategies above.

That doesn't make Bedrock a bad choice. The managed infrastructure, the model variety, and the AWS ecosystem integration are genuinely valuable. You just need to go in with eyes open about what "on-demand" actually means at scale.