Type: Feature — Infrastructure + Substrate
Priority: P0 (blocks P2-006 of maestro-build-plan.yaml)
Branch:feature/foundry-tiered-dispatch
Depends on: P1-001 (Substrate interface), P1-002 (Continuation data model)
Blocks: P2-005 (Policy Kernel v0), P2-006 (Model dispatch), P2-007 (Budget enforcement)
Maestro’s Policy Kernel routes every continuation tick through a tiered model decision: opus for planning/reasoning, sonnet for synthesis/drafting, haiku for classification/extraction. The current BarnOS OpusService calls a single Claude model through Azure Foundry. This task extends the infrastructure to support all three tiers — plus the Anthropic Advisor Strategy — through the Foundry proxy endpoint, without using the Foundry Agent SDK’s thread-based pattern.
Foundry Agents use a thread lifecycle: createThread → addMessage → createRun → poll → getMessages → deleteThread. This is chat-shaped — it assumes request/response cycles that complete in seconds. Maestro continuations run for weeks. The continuation store owns session state; model calls must be stateless ticks. Each tick is a fresh Messages API call with context assembled from the continuation’s state and field manifest.
The Foundry proxy endpoint (https://{resource}.services.ai.azure.com/anthropic/v1/messages) exposes the raw Anthropic Messages API format. This is the correct integration point for Maestro — it gives us full API surface (tool_use, extended thinking, prompt caching, Advisor Strategy) while keeping billing inside Azure Marketplace.
┌───────────────── Traditional Agent Session ──────────────────┐
│ │
│ Model Provider owns the session: │
│ Thread created → messages accumulate → thread deleted │
│ Duration: minutes. State: inside the provider. │
│ Provider outage = session lost. │
│ │
└───────────────────────────────────────────────────────────────┘
┌───────────────── Maestro Continuation ───────────────────────┐
│ │
│ Continuation Store owns the session: │
│ Continuation persisted in Cosmos/DynamoDB with full state. │
│ Each tick: assemble context → stateless model call → persist │
│ Duration: weeks. State: in the continuation store. │
│ Provider outage = retry on next tick or failover to backup. │
│ │
│ Model call is a PURE FUNCTION: │
│ f(goal_frame, field_manifest, state, tools) → response │
│ │
│ The model has no memory of prior ticks. │
│ The continuation store has all the memory. │
│ The field manifest has all the context. │
│ The model just reasons over what it's given. │
│ │
└───────────────────────────────────────────────────────────────┘
This decoupling is why Maestro can:
These steps must be completed in the Azure portal before any code runs.
Foundry Project: Must be in a supported region (East US 2 or Sweden Central).
Deploy each model via Model Catalog → search “Claude” → Deploy → Global Standard:
| # | Deployment Name | Model | Tier | Purpose |
|---|---|---|---|---|
| 1 | claude-opus-4-6 | Claude Opus 4.6 | P0 — Critical | Planning, synthesis, complex reasoning, KB authoring |
| 2 | claude-sonnet-4-6 | Claude Sonnet 4.6 | P1 — Standard | Drafting, analysis, tool selection, evidence summaries |
| 3 | claude-haiku-4-5 | Claude Haiku 4.5 | P2 — Batch | Classification, tagging, field reranking, extraction |
Verification: After deployment, confirm each model responds:
# Test each deployment (replace {resource} and {key})
for MODEL in claude-opus-4-6 claude-sonnet-4-6 claude-haiku-4-5; do
curl -s -w "\n%{http_code}" \
"https://{resource}.services.ai.azure.com/anthropic/v1/messages" \
-H "x-api-key: {key}" \
-H "anthropic-version: 2023-06-01" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"$MODEL\",
\"max_tokens\": 50,
\"messages\": [{\"role\": \"user\", \"content\": \"ping\"}]
}"
echo "--- $MODEL ---"
done
All three should return HTTP 200 with a valid response.
Azure Marketplace: Ensure Marketplace purchasing is enabled at the subscription level. Claude models are billed through Marketplace, not Azure compute. Each model tier has independent pay-as-you-go metering.
File: core/maestro/substrate/interface.py
Extend the Substrate interface to support tiered, metered, advisor-aware model dispatch.
# Addition to existing Substrate Protocol in P1-001
from typing import Protocol, runtime_checkable
from maestro.models.budget import BudgetVector
@runtime_checkable
class ModelDispatch(Protocol):
"""
Stateless model invocation interface.
Every call is independent — no session, no thread, no memory
between calls. The continuation store provides all state;
the context field provides all context; the model just reasons.
This is the core architectural bet: model calls are pure functions.
Duration lives in the continuation store, not the model provider.
"""
async def call(
self,
tier: str, # "opus" | "sonnet" | "haiku"
messages: list[dict], # assembled from continuation state + field
tools: list[dict] | None, # MCP tools available for this tick
budget: BudgetVector, # remaining allowances
mode: str, # "sync" | "async" | "batch"
use_advisor: bool = False, # invoke Advisor Strategy (Opus as advisor)
system: str | None = None, # system prompt from goal frame + policy
thinking: str = "adaptive", # "enabled" | "disabled" | "adaptive"
) -> "ModelCallResult":
...
async def health_check(self) -> dict[str, bool]:
"""Check reachability of each deployed tier."""
...
def supported_tiers(self) -> list[str]:
"""Return list of available model tiers."""
...
Tasks:
ModelDispatch protocol to core/maestro/substrate/interface.pyModelCallResult in core/maestro/substrate/types.py:
content: list[ContentBlock]usage: ModelUsage (executor + advisor tokens separated)cost_dollars: floattier_used: strmodel_id: stradvisor_consulted: boollatency_ms: intstop_reason: strModelUsage with explicit executor/advisor splitFile: substrate-azure/maestro_azure/model_dispatch.py
"""
Azure Foundry ModelDispatch — calls Claude models through the Foundry
Anthropic proxy endpoint. Stateless. No threads. No sessions.
Architecture:
Continuation Store (Cosmos DB) → owns state, sleeps, wakes
Context Field (AI Search) → provides ranked context
THIS SERVICE → pure model invocation
Policy Kernel → decides tier + mode
The model has no memory. The continuation has all the memory.
"""
import anthropic
import time
from maestro.substrate.interface import ModelDispatch
from maestro.substrate.types import ModelCallResult, ModelUsage
from maestro.models.budget import BudgetVector
class AzureFoundryModelDispatch(ModelDispatch):
def __init__(self, config):
self.endpoint = config.anthropic_foundry_endpoint
# e.g., "https://cdsopenaigpt4.services.ai.azure.com"
self.client = anthropic.Anthropic(
api_key=config.anthropic_foundry_key,
base_url=f"{self.endpoint}/anthropic/v1",
)
# Deployment names must match Foundry portal deployments
self.tier_to_model: dict[str, str] = {
"opus": config.opus_model_id or "claude-opus-4-6",
"sonnet": config.sonnet_model_id or "claude-sonnet-4-6",
"haiku": config.haiku_model_id or "claude-haiku-4-5",
}
# Advisor configuration
self.advisor_model = config.advisor_model or "claude-opus-4-6"
self.advisor_enabled = config.advisor_enabled # default: True
# Pricing per 1M tokens (for cost computation)
self._pricing = {
"claude-opus-4-6": {"input": 15.0, "output": 75.0},
"claude-sonnet-4-6": {"input": 3.0, "output": 15.0},
"claude-haiku-4-5": {"input": 0.80, "output": 4.0},
}
async def call(
self,
tier: str,
messages: list[dict],
tools: list[dict] | None,
budget: BudgetVector,
mode: str,
use_advisor: bool = False,
system: str | None = None,
thinking: str = "adaptive",
) -> ModelCallResult:
model_id = self.tier_to_model[tier]
request_tools = list(tools or [])
betas = []
# Inject Advisor tool for Sonnet/Haiku ticks
if use_advisor and tier != "opus" and self.advisor_enabled:
request_tools.append({
"type": "advisor_20260301",
"name": "advisor",
"model": self.advisor_model,
"max_uses": min(3, budget.remaining_advisor_calls),
})
betas.append("advisor-tool-2026-03-01")
# Compute token limit from budget
max_tokens = min(
budget.remaining_tokens_for_tier(tier),
16384, # safety cap per tick
)
start_ms = time.monotonic_ns() // 1_000_000
try:
kwargs = {
"model": model_id,
"messages": messages,
"max_tokens": max_tokens,
}
if system:
kwargs["system"] = system
if request_tools:
kwargs["tools"] = request_tools
if betas:
kwargs["betas"] = betas
# Extended thinking for Opus 4.6 and Sonnet 4.6
if thinking != "disabled" and tier in ("opus", "sonnet"):
kwargs["thinking"] = {"type": thinking, "budget_tokens": 8192}
response = self.client.messages.create(**kwargs)
except anthropic.APIError as e:
# Emit error event — continuation store will decide retry vs failover
raise ModelDispatchError(
tier=tier,
model_id=model_id,
error=str(e),
recoverable=e.status_code in (429, 500, 502, 503),
)
elapsed_ms = (time.monotonic_ns() // 1_000_000) - start_ms
# Parse usage — separate executor vs advisor tokens
usage = ModelUsage(
executor_input=response.usage.input_tokens,
executor_output=response.usage.output_tokens,
advisor_input=getattr(response.usage, "advisor_input_tokens", 0),
advisor_output=getattr(response.usage, "advisor_output_tokens", 0),
cache_read=getattr(response.usage, "cache_read_input_tokens", 0),
cache_creation=getattr(response.usage, "cache_creation_input_tokens", 0),
model_id=model_id,
tier=tier,
)
return ModelCallResult(
content=response.content,
usage=usage,
cost_dollars=self._compute_cost(usage),
tier_used=tier,
model_id=model_id,
advisor_consulted=use_advisor and usage.advisor_output > 0,
latency_ms=elapsed_ms,
stop_reason=response.stop_reason,
)
def _compute_cost(self, usage: ModelUsage) -> float:
"""Compute dollar cost from token usage and pricing table."""
model_price = self._pricing.get(usage.model_id, self._pricing["claude-sonnet-4-6"])
executor_cost = (
(usage.executor_input * model_price["input"] / 1_000_000)
+ (usage.executor_output * model_price["output"] / 1_000_000)
)
advisor_cost = 0.0
if usage.advisor_input > 0 or usage.advisor_output > 0:
opus_price = self._pricing["claude-opus-4-6"]
advisor_cost = (
(usage.advisor_input * opus_price["input"] / 1_000_000)
+ (usage.advisor_output * opus_price["output"] / 1_000_000)
)
return executor_cost + advisor_cost
async def health_check(self) -> dict[str, bool]:
"""Ping each tier with a minimal request."""
results = {}
for tier, model_id in self.tier_to_model.items():
try:
resp = self.client.messages.create(
model=model_id,
max_tokens=10,
messages=[{"role": "user", "content": "health"}],
)
results[tier] = resp.stop_reason is not None
except Exception:
results[tier] = False
return results
def supported_tiers(self) -> list[str]:
return list(self.tier_to_model.keys())
Tasks:
substrate-azure/maestro_azure/model_dispatch.pysubstrate-azure/maestro_azure/config.py with config model:
anthropic_foundry_endpoint: stranthropic_foundry_key: str (from Key Vault)opus_model_id: str = "claude-opus-4-6"sonnet_model_id: str = "claude-sonnet-4-6"haiku_model_id: str = "claude-haiku-4-5"advisor_enabled: bool = Trueadvisor_model: str = "claude-opus-4-6"ModelDispatchError exception with recoverable flaganthropic_foundry_keyFile: substrate-aws/maestro_aws/model_dispatch.py
Same interface, AWS path. Uses boto3 Bedrock Runtime invoke_model.
class BedrockModelDispatch(ModelDispatch):
def __init__(self, config):
import boto3
self.bedrock = boto3.client("bedrock-runtime", region_name=config.region)
self.tier_to_model = {
"opus": "anthropic.claude-opus-4-6-20250514",
"sonnet": "anthropic.claude-sonnet-4-6-20250514",
"haiku": "anthropic.claude-haiku-4-5-20251001",
}
# Advisor Strategy: NOT natively supported on Bedrock
# Must use Anthropic direct API as fallback for advisor calls
self.advisor_fallback = AnthropicDirectDispatch(config.anthropic_direct_key)
Tasks:
substrate-aws/maestro_aws/model_dispatch.pyadvisor_20260301 tool type)Extend the Policy Kernel’s Layer B to support a five-tier gradient instead of three:
Quality/Cost gradient (highest to lowest):
opus → sonnet+advisor → sonnet → haiku+advisor → haiku
File: policy-kernel/kernel/layers/b_routing.py
def route_tier(decision_input: DecisionInput) -> RouteDecision:
"""
Layer B: pick the cheapest model tier likely to make progress.
Five-tier gradient:
opus — complex planning, multi-step reasoning
sonnet+advisor — synthesis with strategic guidance
sonnet — standard drafting, analysis
haiku+advisor — extraction with quality check
haiku — classification, tagging, field reranking
"""
capability = capability_estimator(
intent=decision_input.continuation.goal_frame.intent,
field_top_score=decision_input.context_field.top_score,
open_questions=decision_input.continuation.state.open_questions,
)
# Base routing by capability
if capability in ("plan", "reason", "evaluate_evidence"):
tier = "opus"
use_advisor = False
elif capability in ("synthesize", "draft", "compare"):
tier = "sonnet"
# Use advisor when entering novel territory
use_advisor = decision_input.context_field.novelty_vs_prior_tick > 0.5
elif capability in ("extract", "classify", "tag", "rerank"):
tier = "haiku"
use_advisor = False
else:
tier = "sonnet"
use_advisor = False
# Budget-aware downshift with advisor compensation
budget = decision_input.budget
if budget.remaining_dollars < budget.soft_cap * 0.3:
# Deep budget pressure — drop everything to haiku
tier = "haiku"
use_advisor = False
elif budget.remaining_dollars < budget.soft_cap * 0.6:
# Moderate pressure — downshift one level, compensate with advisor
if tier == "opus":
tier = "sonnet"
use_advisor = True # sonnet+advisor ≈ 85% opus quality at 30% cost
elif tier == "sonnet":
tier = "haiku"
use_advisor = True # haiku+advisor ≈ sonnet quality at 40% cost
return RouteDecision(
tier=tier,
use_advisor=use_advisor,
rationale=f"capability={capability}, tier={tier}, advisor={use_advisor}, "
f"budget_remaining=${budget.remaining_dollars:.2f}",
)
Tasks:
use_advisor: bool to Decision dataclasscapability_estimator stub (rules-based for v0, GBT in v1)If the primary Foundry endpoint fails, Maestro must failover transparently.
File: core/maestro/dispatch/router.py
class FailoverModelRouter(ModelDispatch):
"""
Wraps multiple ModelDispatch implementations in a failover chain.
The continuation doesn't know or care which provider actually served it.
"""
def __init__(self, providers: list[ModelDispatch]):
self.providers = providers # ordered by preference
async def call(self, **kwargs) -> ModelCallResult:
last_error = None
for provider in self.providers:
try:
result = await provider.call(**kwargs)
return result
except ModelDispatchError as e:
if not e.recoverable:
raise # non-recoverable errors skip failover
last_error = e
continue
raise ModelDispatchError(
tier=kwargs["tier"],
model_id="failover-exhausted",
error=f"All {len(self.providers)} providers failed. Last: {last_error}",
recoverable=False,
)
Azure failover chain:
services.ai.azure.com/anthropic) — primaryapi.anthropic.com) — backupAWS failover chain:
Tasks:
core/maestro/dispatch/router.pycore/maestro/dispatch/providers/anthropic_direct.pyEvery model call emits a budget_charge event to the continuation event log.
# In worker tick, after model call:
event = ContinuationEvent(
kind="budget_charge",
payload={
"tier": result.tier_used,
"model_id": result.model_id,
"executor_tokens": {
"input": result.usage.executor_input,
"output": result.usage.executor_output,
},
"advisor_tokens": {
"input": result.usage.advisor_input,
"output": result.usage.advisor_output,
},
"cache_tokens": {
"read": result.usage.cache_read,
"creation": result.usage.cache_creation,
},
"cost_dollars": result.cost_dollars,
"latency_ms": result.latency_ms,
"advisor_consulted": result.advisor_consulted,
},
cost={
"tokens": result.usage.total_tokens,
"dollars": result.cost_dollars,
"wall_ms": result.latency_ms,
},
)
await event_log.append(continuation_id, event)
# Deduct from budget
budget.charge(result.cost_dollars, result.usage)
Tasks:
BudgetVector.charge() with per-tier token trackingremaining_advisor_calls decrementfact_model_call materialized view for cost analytics# Foundry Anthropic proxy (existing, extend)
ANTHROPIC_FOUNDRY_ENDPOINT=https://cdsopenaigpt4.services.ai.azure.com
ANTHROPIC_FOUNDRY_KEY=<from-key-vault>
# Model deployment IDs (match Foundry portal deployment names)
MAESTRO_OPUS_MODEL_ID=claude-opus-4-6
MAESTRO_SONNET_MODEL_ID=claude-sonnet-4-6
MAESTRO_HAIKU_MODEL_ID=claude-haiku-4-5
# Advisor Strategy
MAESTRO_ADVISOR_ENABLED=true
MAESTRO_ADVISOR_MODEL=claude-opus-4-6
MAESTRO_ADVISOR_MAX_USES_PER_TICK=3
# Failover
ANTHROPIC_DIRECT_API_KEY=<from-key-vault> # backup: direct Anthropic API
MAESTRO_FAILOVER_ENABLED=true
# Bedrock (primary on AWS)
AWS_BEDROCK_REGION=us-east-1
MAESTRO_OPUS_MODEL_ID=anthropic.claude-opus-4-6-20250514
MAESTRO_SONNET_MODEL_ID=anthropic.claude-sonnet-4-6-20250514
MAESTRO_HAIKU_MODEL_ID=anthropic.claude-haiku-4-5-20251001
# Failover
ANTHROPIC_DIRECT_API_KEY=<from-secrets-manager>
MAESTRO_FAILOVER_ENABLED=true
Tasks:
substrate-azure/maestro_azure/config.pysubstrate-aws/maestro_aws/config.pydocs/runbooks/model_deployment.md| File | Language | Purpose |
|---|---|---|
substrate-azure/maestro_azure/model_dispatch.py | Python | Azure Foundry proxy dispatch |
substrate-azure/maestro_azure/config.py | Python | Azure substrate configuration |
substrate-aws/maestro_aws/model_dispatch.py | Python | AWS Bedrock dispatch |
core/maestro/dispatch/router.py | Python | Failover chain router |
core/maestro/dispatch/providers/anthropic_direct.py | Python | Direct API fallback |
core/maestro/dispatch/metering.py | Python | Cost computation + budget charging |
policy-kernel/kernel/layers/b_routing.py | Python | Five-tier routing with advisor |
eval/sql/fact_model_call.sql | SQL | Warm store materialized view |
docs/runbooks/model_deployment.md | Markdown | Foundry portal setup guide |
docs/adr/0002-stateless-model-dispatch.md | Markdown | ADR: why model calls are stateless |
| File | Changes |
|---|---|
core/maestro/substrate/interface.py | Add ModelDispatch protocol |
core/maestro/substrate/types.py | Add ModelCallResult, ModelUsage |
core/maestro/models/budget.py | Add per-tier tracking, advisor call counting |
core/maestro/worker/tick.py | Wire model dispatch + budget charging |
Phase 0: Foundry Portal deployments ← manual, no code, do first
Phase 1: Substrate interface extension ← protocol only, no cloud SDK
Phase 2: Azure Foundry implementation ← primary dispatch path
Phase 3: AWS Bedrock implementation ← parallel with Phase 2
Phase 4: Policy Kernel routing ← depends on Phase 1
Phase 5: Failover chain ← depends on Phase 2 + 3
Phase 6: Metering + budget events ← depends on Phase 2
Phase 7: Environment config + IaC ← depends on Phase 2 + 3
Estimated effort: 8–10 days
Risk: Phase 0 is the only blocking manual step — everything else is code.
Decision: Maestro model calls are pure functions with no provider-side state.
Context: Foundry Agents, OpenAI Assistants, and Bedrock Agents all offer thread-based stateful sessions. These sessions bind reasoning state to the model provider.
Rationale: Maestro journals run for weeks. Binding session state to a model provider means:
By making model calls stateless, the Continuation Store becomes the single source of truth for all session state. The model is a commodity compute resource. The intelligence is in the continuation + field + policy kernel.
Consequences: Every model call must carry its full context (system prompt + field manifest + recent state). This costs more tokens per call but eliminates all provider coupling. Prompt caching (supported through Foundry proxy) amortizes the context cost for frequently-recomputed fields.
Status: Accepted.