Inference — Model routing across Bedrock and Foundry Models

Tier → provider → model

The same logical tier names route to different concrete models on each cloud. This table is the per-tenant default; tenant policy can override on a per-call basis.

Tier	Bedrock model id	Foundry deployment	Failover order
`haiku`	`anthropic.claude-haiku-4-5-20251001-v1:0`	`claude-haiku-4-5`	Bedrock → Foundry → Anthropic direct
`sonnet`	`anthropic.claude-sonnet-4-6 (inf. profile)`	`claude-sonnet-4-6`	Bedrock → Foundry → Anthropic direct
`opus`	`anthropic.claude-opus-4-7 (inf. profile)`	`claude-opus-4-6`	Bedrock → Foundry → Anthropic direct
`gpt-fast`	`openai.gpt-oss-120b-1:0`	`gpt-4.1-mini · gpt-oss-120b`	Foundry → OpenAI direct
`gpt-reason`	`—`	`o4 · gpt-5 family`	Foundry → OpenAI direct

Two surfaces, peer capabilities

Dimension	AWS Bedrock	Azure AI Foundry Models
Surface	One control plane (`bedrock`) + one data plane (`bedrock-runtime`)	One Foundry resource (`Microsoft.CognitiveServices/accounts` kind `AIServices`)
Unified API	`Converse` / `ConverseStream` normalizes prompts + tool schemas across providers	OpenAI-compatible `/openai/v1/chat/completions` and `/embeddings`
Tool schemas	Normalized via `toolConfig.tools[].toolSpec` across Claude, Cohere, Llama, Nova	Provider-native inside Responses/Chat Completions; Maestro adapts at the boundary
Pricing tiers	On-demand · Provisioned Throughput · Batch (−50%) · Priority (+75%) · Flex (−50%) · Reserved	MaaS pay-per-token · PTU (Global / Data Zone / Regional Provisioned) · Managed Compute (dedicated GPU)
Cross-region	Cross-Region Inference profiles — Geographic (`us.`/`eu.`/`apac.`) and Global (≈ −10% vs Geo), no routing surcharge	Per-deployment regional config; Data Zone PTU spans Geo
Cost attribution	Application Inference Profiles for per-app tagging; IAM-principal allocation in CUR 2.0	Azure Cost Management; partner models via Marketplace, MACC-eligible
Guardrails	Bedrock Guardrails inline via `guardrailConfig`; PII + denied topics + contextual grounding + Automated Reasoning	Azure AI Content Safety at the deployment; six categories, togglable severity
RAG primitive	Knowledge Bases over OpenSearch Serverless / Neptune / Kendra GenAI	Azure AI Search hybrid; Foundry Knowledge tools
Auth	IAM SigV4 · IRSA	Entra ID (`DefaultAzureCredential`) · Managed Identity · API key
Private network	PrivateLink (`com.amazonaws..bedrock(-runtime)`); IAM gates with `aws:sourceVpce`	Azure Private Link, VNet, public-access flag
Latency levers	Latency-Optimized Inference (Haiku 3.5, Llama 70B/405B, Nova Pro); Intelligent Prompt Routing within a family; prompt caching via `cachePoint`	PTU deployment routing; Foundry Intelligent Prompt Routing (OpenAI family)

Cost levers the kernel can pull

Bedrock Cross-Region Inference (CRIS). Geographic and Global profiles. Global ≈ 10% cheaper than Geographic with no routing surcharge. Set per-call when latency budget allows.
Bedrock service tiers. default | priority (+75%) | flex (−50%) | reserved. Deferred ticks go to flex; user-waiting ticks go to priority.
Bedrock latency-optimized inference. Haiku 3.5, Llama 70B/405B, Nova Pro. Used for mode=sync ticks where p50 matters.
Bedrock prompt caching via cachePoint blocks. Field manifests that don't change tick-over-tick are obvious caching candidates.
Foundry PTU. Global / Data Zone / Regional Provisioned. For tenants with steady predictable load — Cargill at scale — PTU at the right zone amortizes well below per-token MaaS.
Foundry Intelligent Prompt Routing within the OpenAI family. Used when the tier is gpt-fast and the kernel hasn't committed to a specific deployment.

Tool-schema normalization

Bedrock's Converse API normalizes toolConfig.tools[].toolSpec across providers; Foundry's Responses API has a different shape. The dispatch layer (core/maestro/dispatch/) exposes a single internal tool-call ABI; to_provider_tool_schema() and from_provider_tool_call() adapt at the boundary. Every provider is conformance-tested against the same suite — same inputs, same expected toolUse envelope out.

Governance and guardrails

Bedrock Guardrails applied inline on the Converse call (guardrailConfig) for Bedrock-routed traffic. PII, denied topics, contextual grounding, Automated Reasoning checks.
Azure AI Content Safety filters at the Foundry deployment for Foundry-routed traffic. Six categories with togglable severity.
Provider verdicts are normalized into a single safety_outcome enum on the model_call event so the audit trail reads the same regardless of route.

Why this is the right shape