TASK — Azure Foundry Tiered Model Dispatch for Maestro Continuations

Type: Feature — Infrastructure + Substrate
Priority: P0 (blocks P2-006 of maestro-build-plan.yaml)
Branch: feature/foundry-tiered-dispatch
Depends on: P1-001 (Substrate interface), P1-002 (Continuation data model)
Blocks: P2-005 (Policy Kernel v0), P2-006 (Model dispatch), P2-007 (Budget enforcement)

Context

Maestro’s Policy Kernel routes every continuation tick through a tiered model decision: opus for planning/reasoning, sonnet for synthesis/drafting, haiku for classification/extraction. The current BarnOS OpusService calls a single Claude model through Azure Foundry. This task extends the infrastructure to support all three tiers — plus the Anthropic Advisor Strategy — through the Foundry proxy endpoint, without using the Foundry Agent SDK’s thread-based pattern.

Why Not Foundry Agent SDK Threads?

Foundry Agents use a thread lifecycle: createThread → addMessage → createRun → poll → getMessages → deleteThread. This is chat-shaped — it assumes request/response cycles that complete in seconds. Maestro continuations run for weeks. The continuation store owns session state; model calls must be stateless ticks. Each tick is a fresh Messages API call with context assembled from the continuation’s state and field manifest.

The Foundry proxy endpoint (https://{resource}.services.ai.azure.com/anthropic/v1/messages) exposes the raw Anthropic Messages API format. This is the correct integration point for Maestro — it gives us full API surface (tool_use, extended thinking, prompt caching, Advisor Strategy) while keeping billing inside Azure Marketplace.

The Core Principle: Decoupled Model Calls

┌───────────────── Traditional Agent Session ──────────────────┐
│                                                               │
│  Model Provider owns the session:                             │
│  Thread created → messages accumulate → thread deleted        │
│  Duration: minutes. State: inside the provider.               │
│  Provider outage = session lost.                              │
│                                                               │
└───────────────────────────────────────────────────────────────┘

┌───────────────── Maestro Continuation ───────────────────────┐
│                                                               │
│  Continuation Store owns the session:                         │
│  Continuation persisted in Cosmos/DynamoDB with full state.   │
│  Each tick: assemble context → stateless model call → persist │
│  Duration: weeks. State: in the continuation store.           │
│  Provider outage = retry on next tick or failover to backup.  │
│                                                               │
│  Model call is a PURE FUNCTION:                               │
│    f(goal_frame, field_manifest, state, tools) → response     │
│                                                               │
│  The model has no memory of prior ticks.                      │
│  The continuation store has all the memory.                   │
│  The field manifest has all the context.                      │
│  The model just reasons over what it's given.                 │
│                                                               │
└───────────────────────────────────────────────────────────────┘

This decoupling is why Maestro can:

Sleep for weeks between ticks without maintaining a model session
Migrate between workers without losing state
Failover between providers mid-journal (Foundry → direct Anthropic → Bedrock)
Budget-shift from Opus to Haiku mid-run without session reconstruction
Replay any tick deterministically from the event log

Phase 0: Azure Foundry Portal — Deploy Three Model Tiers

Prerequisites (Manual — Foundry Portal)

These steps must be completed in the Azure portal before any code runs.

Foundry Project: Must be in a supported region (East US 2 or Sweden Central).

Deploy each model via Model Catalog → search “Claude” → Deploy → Global Standard:

#	Deployment Name	Model	Tier	Purpose
1	`claude-opus-4-6`	Claude Opus 4.6	P0 — Critical	Planning, synthesis, complex reasoning, KB authoring
2	`claude-sonnet-4-6`	Claude Sonnet 4.6	P1 — Standard	Drafting, analysis, tool selection, evidence summaries
3	`claude-haiku-4-5`	Claude Haiku 4.5	P2 — Batch	Classification, tagging, field reranking, extraction

Verification: After deployment, confirm each model responds:

# Test each deployment (replace {resource} and {key})
for MODEL in claude-opus-4-6 claude-sonnet-4-6 claude-haiku-4-5; do
  curl -s -w "\n%{http_code}" \
    "https://{resource}.services.ai.azure.com/anthropic/v1/messages" \
    -H "x-api-key: {key}" \
    -H "anthropic-version: 2023-06-01" \
    -H "Content-Type: application/json" \
    -d "{
      \"model\": \"$MODEL\",
      \"max_tokens\": 50,
      \"messages\": [{\"role\": \"user\", \"content\": \"ping\"}]
    }"
  echo "--- $MODEL ---"
done

All three should return HTTP 200 with a valid response.

Azure Marketplace: Ensure Marketplace purchasing is enabled at the subscription level. Claude models are billed through Marketplace, not Azure compute. Each model tier has independent pay-as-you-go metering.

Phase 1: Substrate ModelDispatch Interface

File: core/maestro/substrate/interface.py

Extend the Substrate interface to support tiered, metered, advisor-aware model dispatch.

# Addition to existing Substrate Protocol in P1-001

from typing import Protocol, runtime_checkable
from maestro.models.budget import BudgetVector

@runtime_checkable
class ModelDispatch(Protocol):
    """
    Stateless model invocation interface.
    
    Every call is independent — no session, no thread, no memory
    between calls. The continuation store provides all state;
    the context field provides all context; the model just reasons.
    
    This is the core architectural bet: model calls are pure functions.
    Duration lives in the continuation store, not the model provider.
    """

    async def call(
        self,
        tier: str,                    # "opus" | "sonnet" | "haiku"
        messages: list[dict],         # assembled from continuation state + field
        tools: list[dict] | None,     # MCP tools available for this tick
        budget: BudgetVector,         # remaining allowances
        mode: str,                    # "sync" | "async" | "batch"
        use_advisor: bool = False,    # invoke Advisor Strategy (Opus as advisor)
        system: str | None = None,    # system prompt from goal frame + policy
        thinking: str = "adaptive",   # "enabled" | "disabled" | "adaptive"
    ) -> "ModelCallResult":
        ...

    async def health_check(self) -> dict[str, bool]:
        """Check reachability of each deployed tier."""
        ...

    def supported_tiers(self) -> list[str]:
        """Return list of available model tiers."""
        ...

Tasks:

Add ModelDispatch protocol to core/maestro/substrate/interface.py
Define ModelCallResult in core/maestro/substrate/types.py:
- content: list[ContentBlock]
- usage: ModelUsage (executor + advisor tokens separated)
- cost_dollars: float
- tier_used: str
- model_id: str
- advisor_consulted: bool
- latency_ms: int
- stop_reason: str
Define ModelUsage with explicit executor/advisor split
Unit tests: protocol conformance checks

Phase 2: Azure Foundry ModelDispatch Implementation

File: substrate-azure/maestro_azure/model_dispatch.py

"""
Azure Foundry ModelDispatch — calls Claude models through the Foundry
Anthropic proxy endpoint. Stateless. No threads. No sessions.

Architecture:
  Continuation Store (Cosmos DB) → owns state, sleeps, wakes
  Context Field (AI Search)      → provides ranked context
  THIS SERVICE                   → pure model invocation
  Policy Kernel                  → decides tier + mode

The model has no memory. The continuation has all the memory.
"""

import anthropic
import time
from maestro.substrate.interface import ModelDispatch
from maestro.substrate.types import ModelCallResult, ModelUsage
from maestro.models.budget import BudgetVector


class AzureFoundryModelDispatch(ModelDispatch):

    def __init__(self, config):
        self.endpoint = config.anthropic_foundry_endpoint
        # e.g., "https://cdsopenaigpt4.services.ai.azure.com"

        self.client = anthropic.Anthropic(
            api_key=config.anthropic_foundry_key,
            base_url=f"{self.endpoint}/anthropic/v1",
        )

        # Deployment names must match Foundry portal deployments
        self.tier_to_model: dict[str, str] = {
            "opus":   config.opus_model_id   or "claude-opus-4-6",
            "sonnet": config.sonnet_model_id or "claude-sonnet-4-6",
            "haiku":  config.haiku_model_id  or "claude-haiku-4-5",
        }

        # Advisor configuration
        self.advisor_model = config.advisor_model or "claude-opus-4-6"
        self.advisor_enabled = config.advisor_enabled  # default: True

        # Pricing per 1M tokens (for cost computation)
        self._pricing = {
            "claude-opus-4-6":   {"input": 15.0, "output": 75.0},
            "claude-sonnet-4-6": {"input": 3.0,  "output": 15.0},
            "claude-haiku-4-5":  {"input": 0.80, "output": 4.0},
        }

    async def call(
        self,
        tier: str,
        messages: list[dict],
        tools: list[dict] | None,
        budget: BudgetVector,
        mode: str,
        use_advisor: bool = False,
        system: str | None = None,
        thinking: str = "adaptive",
    ) -> ModelCallResult:

        model_id = self.tier_to_model[tier]
        request_tools = list(tools or [])
        betas = []

        # Inject Advisor tool for Sonnet/Haiku ticks
        if use_advisor and tier != "opus" and self.advisor_enabled:
            request_tools.append({
                "type": "advisor_20260301",
                "name": "advisor",
                "model": self.advisor_model,
                "max_uses": min(3, budget.remaining_advisor_calls),
            })
            betas.append("advisor-tool-2026-03-01")

        # Compute token limit from budget
        max_tokens = min(
            budget.remaining_tokens_for_tier(tier),
            16384,  # safety cap per tick
        )

        start_ms = time.monotonic_ns() // 1_000_000

        try:
            kwargs = {
                "model": model_id,
                "messages": messages,
                "max_tokens": max_tokens,
            }
            if system:
                kwargs["system"] = system
            if request_tools:
                kwargs["tools"] = request_tools
            if betas:
                kwargs["betas"] = betas

            # Extended thinking for Opus 4.6 and Sonnet 4.6
            if thinking != "disabled" and tier in ("opus", "sonnet"):
                kwargs["thinking"] = {"type": thinking, "budget_tokens": 8192}

            response = self.client.messages.create(**kwargs)

        except anthropic.APIError as e:
            # Emit error event — continuation store will decide retry vs failover
            raise ModelDispatchError(
                tier=tier,
                model_id=model_id,
                error=str(e),
                recoverable=e.status_code in (429, 500, 502, 503),
            )

        elapsed_ms = (time.monotonic_ns() // 1_000_000) - start_ms

        # Parse usage — separate executor vs advisor tokens
        usage = ModelUsage(
            executor_input=response.usage.input_tokens,
            executor_output=response.usage.output_tokens,
            advisor_input=getattr(response.usage, "advisor_input_tokens", 0),
            advisor_output=getattr(response.usage, "advisor_output_tokens", 0),
            cache_read=getattr(response.usage, "cache_read_input_tokens", 0),
            cache_creation=getattr(response.usage, "cache_creation_input_tokens", 0),
            model_id=model_id,
            tier=tier,
        )

        return ModelCallResult(
            content=response.content,
            usage=usage,
            cost_dollars=self._compute_cost(usage),
            tier_used=tier,
            model_id=model_id,
            advisor_consulted=use_advisor and usage.advisor_output > 0,
            latency_ms=elapsed_ms,
            stop_reason=response.stop_reason,
        )

    def _compute_cost(self, usage: ModelUsage) -> float:
        """Compute dollar cost from token usage and pricing table."""
        model_price = self._pricing.get(usage.model_id, self._pricing["claude-sonnet-4-6"])
        executor_cost = (
            (usage.executor_input * model_price["input"] / 1_000_000)
            + (usage.executor_output * model_price["output"] / 1_000_000)
        )
        advisor_cost = 0.0
        if usage.advisor_input > 0 or usage.advisor_output > 0:
            opus_price = self._pricing["claude-opus-4-6"]
            advisor_cost = (
                (usage.advisor_input * opus_price["input"] / 1_000_000)
                + (usage.advisor_output * opus_price["output"] / 1_000_000)
            )
        return executor_cost + advisor_cost

    async def health_check(self) -> dict[str, bool]:
        """Ping each tier with a minimal request."""
        results = {}
        for tier, model_id in self.tier_to_model.items():
            try:
                resp = self.client.messages.create(
                    model=model_id,
                    max_tokens=10,
                    messages=[{"role": "user", "content": "health"}],
                )
                results[tier] = resp.stop_reason is not None
            except Exception:
                results[tier] = False
        return results

    def supported_tiers(self) -> list[str]:
        return list(self.tier_to_model.keys())

Tasks:

Create substrate-azure/maestro_azure/model_dispatch.py
Create substrate-azure/maestro_azure/config.py with config model:
- anthropic_foundry_endpoint: str
- anthropic_foundry_key: str (from Key Vault)
- opus_model_id: str = "claude-opus-4-6"
- sonnet_model_id: str = "claude-sonnet-4-6"
- haiku_model_id: str = "claude-haiku-4-5"
- advisor_enabled: bool = True
- advisor_model: str = "claude-opus-4-6"
Create ModelDispatchError exception with recoverable flag
Wire Key Vault secret for anthropic_foundry_key
Integration test: call each tier, verify response + usage parsing
Test advisor injection: Sonnet call with advisor, verify advisor tokens reported

Phase 3: AWS Bedrock ModelDispatch Implementation

File: substrate-aws/maestro_aws/model_dispatch.py

Same interface, AWS path. Uses boto3 Bedrock Runtime invoke_model.

class BedrockModelDispatch(ModelDispatch):

    def __init__(self, config):
        import boto3
        self.bedrock = boto3.client("bedrock-runtime", region_name=config.region)
        self.tier_to_model = {
            "opus":   "anthropic.claude-opus-4-6-20250514",
            "sonnet": "anthropic.claude-sonnet-4-6-20250514",
            "haiku":  "anthropic.claude-haiku-4-5-20251001",
        }
        # Advisor Strategy: NOT natively supported on Bedrock
        # Must use Anthropic direct API as fallback for advisor calls
        self.advisor_fallback = AnthropicDirectDispatch(config.anthropic_direct_key)

Tasks:

Create substrate-aws/maestro_aws/model_dispatch.py
Handle Bedrock-specific auth (IRSA + STS)
Implement advisor fallback via direct Anthropic API (Bedrock doesn’t support advisor_20260301 tool type)
Same test suite as Phase 2 must pass

Phase 4: Policy Kernel Tier Routing with Advisor Strategy

Extend the Policy Kernel’s Layer B to support a five-tier gradient instead of three:

Quality/Cost gradient (highest to lowest):
  opus → sonnet+advisor → sonnet → haiku+advisor → haiku

File: policy-kernel/kernel/layers/b_routing.py

def route_tier(decision_input: DecisionInput) -> RouteDecision:
    """
    Layer B: pick the cheapest model tier likely to make progress.
    
    Five-tier gradient:
      opus            — complex planning, multi-step reasoning
      sonnet+advisor  — synthesis with strategic guidance
      sonnet          — standard drafting, analysis
      haiku+advisor   — extraction with quality check
      haiku           — classification, tagging, field reranking
    """
    capability = capability_estimator(
        intent=decision_input.continuation.goal_frame.intent,
        field_top_score=decision_input.context_field.top_score,
        open_questions=decision_input.continuation.state.open_questions,
    )

    # Base routing by capability
    if capability in ("plan", "reason", "evaluate_evidence"):
        tier = "opus"
        use_advisor = False
    elif capability in ("synthesize", "draft", "compare"):
        tier = "sonnet"
        # Use advisor when entering novel territory
        use_advisor = decision_input.context_field.novelty_vs_prior_tick > 0.5
    elif capability in ("extract", "classify", "tag", "rerank"):
        tier = "haiku"
        use_advisor = False
    else:
        tier = "sonnet"
        use_advisor = False

    # Budget-aware downshift with advisor compensation
    budget = decision_input.budget
    if budget.remaining_dollars < budget.soft_cap * 0.3:
        # Deep budget pressure — drop everything to haiku
        tier = "haiku"
        use_advisor = False
    elif budget.remaining_dollars < budget.soft_cap * 0.6:
        # Moderate pressure — downshift one level, compensate with advisor
        if tier == "opus":
            tier = "sonnet"
            use_advisor = True   # sonnet+advisor ≈ 85% opus quality at 30% cost
        elif tier == "sonnet":
            tier = "haiku"
            use_advisor = True   # haiku+advisor ≈ sonnet quality at 40% cost

    return RouteDecision(
        tier=tier,
        use_advisor=use_advisor,
        rationale=f"capability={capability}, tier={tier}, advisor={use_advisor}, "
                  f"budget_remaining=${budget.remaining_dollars:.2f}",
    )

Tasks:

Add use_advisor: bool to Decision dataclass
Implement five-tier routing logic in Layer B
Add capability_estimator stub (rules-based for v0, GBT in v1)
Log routing decisions to event log for warm-store training data
Unit tests: each capability → expected tier, budget downshift paths

Phase 5: Failover Chain

If the primary Foundry endpoint fails, Maestro must failover transparently.

File: core/maestro/dispatch/router.py

class FailoverModelRouter(ModelDispatch):
    """
    Wraps multiple ModelDispatch implementations in a failover chain.
    The continuation doesn't know or care which provider actually served it.
    """

    def __init__(self, providers: list[ModelDispatch]):
        self.providers = providers  # ordered by preference

    async def call(self, **kwargs) -> ModelCallResult:
        last_error = None
        for provider in self.providers:
            try:
                result = await provider.call(**kwargs)
                return result
            except ModelDispatchError as e:
                if not e.recoverable:
                    raise  # non-recoverable errors skip failover
                last_error = e
                continue
        raise ModelDispatchError(
            tier=kwargs["tier"],
            model_id="failover-exhausted",
            error=f"All {len(self.providers)} providers failed. Last: {last_error}",
            recoverable=False,
        )

Azure failover chain:

Foundry proxy (services.ai.azure.com/anthropic) — primary
Direct Anthropic API (api.anthropic.com) — backup
Bedrock Claude (cross-cloud) — last resort

AWS failover chain:

Bedrock — primary
Direct Anthropic API — backup
Foundry proxy (cross-cloud) — last resort

Tasks:

Create core/maestro/dispatch/router.py
Create core/maestro/dispatch/providers/anthropic_direct.py
Wire failover chain in substrate config per cloud
Integration test: simulate Foundry 503 → verify fallback to direct API
Emit failover events to continuation event log

Phase 6: Metering and Budget Charge Events

Every model call emits a budget_charge event to the continuation event log.

# In worker tick, after model call:
event = ContinuationEvent(
    kind="budget_charge",
    payload={
        "tier": result.tier_used,
        "model_id": result.model_id,
        "executor_tokens": {
            "input": result.usage.executor_input,
            "output": result.usage.executor_output,
        },
        "advisor_tokens": {
            "input": result.usage.advisor_input,
            "output": result.usage.advisor_output,
        },
        "cache_tokens": {
            "read": result.usage.cache_read,
            "creation": result.usage.cache_creation,
        },
        "cost_dollars": result.cost_dollars,
        "latency_ms": result.latency_ms,
        "advisor_consulted": result.advisor_consulted,
    },
    cost={
        "tokens": result.usage.total_tokens,
        "dollars": result.cost_dollars,
        "wall_ms": result.latency_ms,
    },
)
await event_log.append(continuation_id, event)

# Deduct from budget
budget.charge(result.cost_dollars, result.usage)

Tasks:

Extend BudgetVector.charge() with per-tier token tracking
Add advisor token tracking: remaining_advisor_calls decrement
Wire budget_charge events into CDC → warm store pipeline
Warm store SQL: fact_model_call materialized view for cost analytics
Alert when continuation hits 80% of soft cap

Phase 7: Environment Configuration

Azure Environment Variables

# Foundry Anthropic proxy (existing, extend)
ANTHROPIC_FOUNDRY_ENDPOINT=https://cdsopenaigpt4.services.ai.azure.com
ANTHROPIC_FOUNDRY_KEY=<from-key-vault>

# Model deployment IDs (match Foundry portal deployment names)
MAESTRO_OPUS_MODEL_ID=claude-opus-4-6
MAESTRO_SONNET_MODEL_ID=claude-sonnet-4-6
MAESTRO_HAIKU_MODEL_ID=claude-haiku-4-5

# Advisor Strategy
MAESTRO_ADVISOR_ENABLED=true
MAESTRO_ADVISOR_MODEL=claude-opus-4-6
MAESTRO_ADVISOR_MAX_USES_PER_TICK=3

# Failover
ANTHROPIC_DIRECT_API_KEY=<from-key-vault>     # backup: direct Anthropic API
MAESTRO_FAILOVER_ENABLED=true

AWS Environment Variables

# Bedrock (primary on AWS)
AWS_BEDROCK_REGION=us-east-1
MAESTRO_OPUS_MODEL_ID=anthropic.claude-opus-4-6-20250514
MAESTRO_SONNET_MODEL_ID=anthropic.claude-sonnet-4-6-20250514
MAESTRO_HAIKU_MODEL_ID=anthropic.claude-haiku-4-5-20251001

# Failover
ANTHROPIC_DIRECT_API_KEY=<from-secrets-manager>
MAESTRO_FAILOVER_ENABLED=true

Tasks:

Add all env vars to substrate-azure/maestro_azure/config.py
Add all env vars to substrate-aws/maestro_aws/config.py
Document in docs/runbooks/model_deployment.md
Add Key Vault / Secrets Manager provisioning to IaC

File Inventory

New Files

File	Language	Purpose
`substrate-azure/maestro_azure/model_dispatch.py`	Python	Azure Foundry proxy dispatch
`substrate-azure/maestro_azure/config.py`	Python	Azure substrate configuration
`substrate-aws/maestro_aws/model_dispatch.py`	Python	AWS Bedrock dispatch
`core/maestro/dispatch/router.py`	Python	Failover chain router
`core/maestro/dispatch/providers/anthropic_direct.py`	Python	Direct API fallback
`core/maestro/dispatch/metering.py`	Python	Cost computation + budget charging
`policy-kernel/kernel/layers/b_routing.py`	Python	Five-tier routing with advisor
`eval/sql/fact_model_call.sql`	SQL	Warm store materialized view
`docs/runbooks/model_deployment.md`	Markdown	Foundry portal setup guide
`docs/adr/0002-stateless-model-dispatch.md`	Markdown	ADR: why model calls are stateless

Modified Files

File	Changes
`core/maestro/substrate/interface.py`	Add `ModelDispatch` protocol
`core/maestro/substrate/types.py`	Add `ModelCallResult`, `ModelUsage`
`core/maestro/models/budget.py`	Add per-tier tracking, advisor call counting
`core/maestro/worker/tick.py`	Wire model dispatch + budget charging

Execution Order

Phase 0: Foundry Portal deployments       ← manual, no code, do first
Phase 1: Substrate interface extension     ← protocol only, no cloud SDK
Phase 2: Azure Foundry implementation      ← primary dispatch path
Phase 3: AWS Bedrock implementation        ← parallel with Phase 2
Phase 4: Policy Kernel routing             ← depends on Phase 1
Phase 5: Failover chain                    ← depends on Phase 2 + 3
Phase 6: Metering + budget events          ← depends on Phase 2
Phase 7: Environment config + IaC          ← depends on Phase 2 + 3

Estimated effort: 8–10 days
Risk: Phase 0 is the only blocking manual step — everything else is code.

ADR: Why Model Calls Are Stateless (Summary)

Decision: Maestro model calls are pure functions with no provider-side state.

Context: Foundry Agents, OpenAI Assistants, and Bedrock Agents all offer thread-based stateful sessions. These sessions bind reasoning state to the model provider.

Rationale: Maestro journals run for weeks. Binding session state to a model provider means:

Provider outage = session lost
No migration between workers
No failover between providers
No budget-driven tier switching mid-session
No replay from event log

By making model calls stateless, the Continuation Store becomes the single source of truth for all session state. The model is a commodity compute resource. The intelligence is in the continuation + field + policy kernel.

Consequences: Every model call must carry its full context (system prompt + field manifest + recent state). This costs more tokens per call but eliminates all provider coupling. Prompt caching (supported through Foundry proxy) amortizes the context cost for frequently-recomputed fields.

Status: Accepted.