TASK — Azure Foundry Tiered Model Dispatch for Maestro Continuations

Type: Feature — Infrastructure + Substrate
Priority: P0 (blocks P2-006 of maestro-build-plan.yaml)
Branch: feature/foundry-tiered-dispatch
Depends on: P1-001 (Substrate interface), P1-002 (Continuation data model)
Blocks: P2-005 (Policy Kernel v0), P2-006 (Model dispatch), P2-007 (Budget enforcement)


Context

Maestro’s Policy Kernel routes every continuation tick through a tiered model decision: opus for planning/reasoning, sonnet for synthesis/drafting, haiku for classification/extraction. The current BarnOS OpusService calls a single Claude model through Azure Foundry. This task extends the infrastructure to support all three tiers — plus the Anthropic Advisor Strategy — through the Foundry proxy endpoint, without using the Foundry Agent SDK’s thread-based pattern.

Why Not Foundry Agent SDK Threads?

Foundry Agents use a thread lifecycle: createThread → addMessage → createRun → poll → getMessages → deleteThread. This is chat-shaped — it assumes request/response cycles that complete in seconds. Maestro continuations run for weeks. The continuation store owns session state; model calls must be stateless ticks. Each tick is a fresh Messages API call with context assembled from the continuation’s state and field manifest.

The Foundry proxy endpoint (https://{resource}.services.ai.azure.com/anthropic/v1/messages) exposes the raw Anthropic Messages API format. This is the correct integration point for Maestro — it gives us full API surface (tool_use, extended thinking, prompt caching, Advisor Strategy) while keeping billing inside Azure Marketplace.

The Core Principle: Decoupled Model Calls

┌───────────────── Traditional Agent Session ──────────────────┐
│                                                               │
│  Model Provider owns the session:                             │
│  Thread created → messages accumulate → thread deleted        │
│  Duration: minutes. State: inside the provider.               │
│  Provider outage = session lost.                              │
│                                                               │
└───────────────────────────────────────────────────────────────┘

┌───────────────── Maestro Continuation ───────────────────────┐
│                                                               │
│  Continuation Store owns the session:                         │
│  Continuation persisted in Cosmos/DynamoDB with full state.   │
│  Each tick: assemble context → stateless model call → persist │
│  Duration: weeks. State: in the continuation store.           │
│  Provider outage = retry on next tick or failover to backup.  │
│                                                               │
│  Model call is a PURE FUNCTION:                               │
│    f(goal_frame, field_manifest, state, tools) → response     │
│                                                               │
│  The model has no memory of prior ticks.                      │
│  The continuation store has all the memory.                   │
│  The field manifest has all the context.                      │
│  The model just reasons over what it's given.                 │
│                                                               │
└───────────────────────────────────────────────────────────────┘

This decoupling is why Maestro can:


Phase 0: Azure Foundry Portal — Deploy Three Model Tiers

Prerequisites (Manual — Foundry Portal)

These steps must be completed in the Azure portal before any code runs.

Foundry Project: Must be in a supported region (East US 2 or Sweden Central).

Deploy each model via Model Catalog → search “Claude” → Deploy → Global Standard:

#Deployment NameModelTierPurpose
1claude-opus-4-6Claude Opus 4.6P0 — CriticalPlanning, synthesis, complex reasoning, KB authoring
2claude-sonnet-4-6Claude Sonnet 4.6P1 — StandardDrafting, analysis, tool selection, evidence summaries
3claude-haiku-4-5Claude Haiku 4.5P2 — BatchClassification, tagging, field reranking, extraction

Verification: After deployment, confirm each model responds:

# Test each deployment (replace {resource} and {key})
for MODEL in claude-opus-4-6 claude-sonnet-4-6 claude-haiku-4-5; do
  curl -s -w "\n%{http_code}" \
    "https://{resource}.services.ai.azure.com/anthropic/v1/messages" \
    -H "x-api-key: {key}" \
    -H "anthropic-version: 2023-06-01" \
    -H "Content-Type: application/json" \
    -d "{
      \"model\": \"$MODEL\",
      \"max_tokens\": 50,
      \"messages\": [{\"role\": \"user\", \"content\": \"ping\"}]
    }"
  echo "--- $MODEL ---"
done

All three should return HTTP 200 with a valid response.

Azure Marketplace: Ensure Marketplace purchasing is enabled at the subscription level. Claude models are billed through Marketplace, not Azure compute. Each model tier has independent pay-as-you-go metering.


Phase 1: Substrate ModelDispatch Interface

File: core/maestro/substrate/interface.py

Extend the Substrate interface to support tiered, metered, advisor-aware model dispatch.

# Addition to existing Substrate Protocol in P1-001

from typing import Protocol, runtime_checkable
from maestro.models.budget import BudgetVector

@runtime_checkable
class ModelDispatch(Protocol):
    """
    Stateless model invocation interface.
    
    Every call is independent — no session, no thread, no memory
    between calls. The continuation store provides all state;
    the context field provides all context; the model just reasons.
    
    This is the core architectural bet: model calls are pure functions.
    Duration lives in the continuation store, not the model provider.
    """

    async def call(
        self,
        tier: str,                    # "opus" | "sonnet" | "haiku"
        messages: list[dict],         # assembled from continuation state + field
        tools: list[dict] | None,     # MCP tools available for this tick
        budget: BudgetVector,         # remaining allowances
        mode: str,                    # "sync" | "async" | "batch"
        use_advisor: bool = False,    # invoke Advisor Strategy (Opus as advisor)
        system: str | None = None,    # system prompt from goal frame + policy
        thinking: str = "adaptive",   # "enabled" | "disabled" | "adaptive"
    ) -> "ModelCallResult":
        ...

    async def health_check(self) -> dict[str, bool]:
        """Check reachability of each deployed tier."""
        ...

    def supported_tiers(self) -> list[str]:
        """Return list of available model tiers."""
        ...

Tasks:


Phase 2: Azure Foundry ModelDispatch Implementation

File: substrate-azure/maestro_azure/model_dispatch.py

"""
Azure Foundry ModelDispatch — calls Claude models through the Foundry
Anthropic proxy endpoint. Stateless. No threads. No sessions.

Architecture:
  Continuation Store (Cosmos DB) → owns state, sleeps, wakes
  Context Field (AI Search)      → provides ranked context
  THIS SERVICE                   → pure model invocation
  Policy Kernel                  → decides tier + mode

The model has no memory. The continuation has all the memory.
"""

import anthropic
import time
from maestro.substrate.interface import ModelDispatch
from maestro.substrate.types import ModelCallResult, ModelUsage
from maestro.models.budget import BudgetVector


class AzureFoundryModelDispatch(ModelDispatch):

    def __init__(self, config):
        self.endpoint = config.anthropic_foundry_endpoint
        # e.g., "https://cdsopenaigpt4.services.ai.azure.com"

        self.client = anthropic.Anthropic(
            api_key=config.anthropic_foundry_key,
            base_url=f"{self.endpoint}/anthropic/v1",
        )

        # Deployment names must match Foundry portal deployments
        self.tier_to_model: dict[str, str] = {
            "opus":   config.opus_model_id   or "claude-opus-4-6",
            "sonnet": config.sonnet_model_id or "claude-sonnet-4-6",
            "haiku":  config.haiku_model_id  or "claude-haiku-4-5",
        }

        # Advisor configuration
        self.advisor_model = config.advisor_model or "claude-opus-4-6"
        self.advisor_enabled = config.advisor_enabled  # default: True

        # Pricing per 1M tokens (for cost computation)
        self._pricing = {
            "claude-opus-4-6":   {"input": 15.0, "output": 75.0},
            "claude-sonnet-4-6": {"input": 3.0,  "output": 15.0},
            "claude-haiku-4-5":  {"input": 0.80, "output": 4.0},
        }

    async def call(
        self,
        tier: str,
        messages: list[dict],
        tools: list[dict] | None,
        budget: BudgetVector,
        mode: str,
        use_advisor: bool = False,
        system: str | None = None,
        thinking: str = "adaptive",
    ) -> ModelCallResult:

        model_id = self.tier_to_model[tier]
        request_tools = list(tools or [])
        betas = []

        # Inject Advisor tool for Sonnet/Haiku ticks
        if use_advisor and tier != "opus" and self.advisor_enabled:
            request_tools.append({
                "type": "advisor_20260301",
                "name": "advisor",
                "model": self.advisor_model,
                "max_uses": min(3, budget.remaining_advisor_calls),
            })
            betas.append("advisor-tool-2026-03-01")

        # Compute token limit from budget
        max_tokens = min(
            budget.remaining_tokens_for_tier(tier),
            16384,  # safety cap per tick
        )

        start_ms = time.monotonic_ns() // 1_000_000

        try:
            kwargs = {
                "model": model_id,
                "messages": messages,
                "max_tokens": max_tokens,
            }
            if system:
                kwargs["system"] = system
            if request_tools:
                kwargs["tools"] = request_tools
            if betas:
                kwargs["betas"] = betas

            # Extended thinking for Opus 4.6 and Sonnet 4.6
            if thinking != "disabled" and tier in ("opus", "sonnet"):
                kwargs["thinking"] = {"type": thinking, "budget_tokens": 8192}

            response = self.client.messages.create(**kwargs)

        except anthropic.APIError as e:
            # Emit error event — continuation store will decide retry vs failover
            raise ModelDispatchError(
                tier=tier,
                model_id=model_id,
                error=str(e),
                recoverable=e.status_code in (429, 500, 502, 503),
            )

        elapsed_ms = (time.monotonic_ns() // 1_000_000) - start_ms

        # Parse usage — separate executor vs advisor tokens
        usage = ModelUsage(
            executor_input=response.usage.input_tokens,
            executor_output=response.usage.output_tokens,
            advisor_input=getattr(response.usage, "advisor_input_tokens", 0),
            advisor_output=getattr(response.usage, "advisor_output_tokens", 0),
            cache_read=getattr(response.usage, "cache_read_input_tokens", 0),
            cache_creation=getattr(response.usage, "cache_creation_input_tokens", 0),
            model_id=model_id,
            tier=tier,
        )

        return ModelCallResult(
            content=response.content,
            usage=usage,
            cost_dollars=self._compute_cost(usage),
            tier_used=tier,
            model_id=model_id,
            advisor_consulted=use_advisor and usage.advisor_output > 0,
            latency_ms=elapsed_ms,
            stop_reason=response.stop_reason,
        )

    def _compute_cost(self, usage: ModelUsage) -> float:
        """Compute dollar cost from token usage and pricing table."""
        model_price = self._pricing.get(usage.model_id, self._pricing["claude-sonnet-4-6"])
        executor_cost = (
            (usage.executor_input * model_price["input"] / 1_000_000)
            + (usage.executor_output * model_price["output"] / 1_000_000)
        )
        advisor_cost = 0.0
        if usage.advisor_input > 0 or usage.advisor_output > 0:
            opus_price = self._pricing["claude-opus-4-6"]
            advisor_cost = (
                (usage.advisor_input * opus_price["input"] / 1_000_000)
                + (usage.advisor_output * opus_price["output"] / 1_000_000)
            )
        return executor_cost + advisor_cost

    async def health_check(self) -> dict[str, bool]:
        """Ping each tier with a minimal request."""
        results = {}
        for tier, model_id in self.tier_to_model.items():
            try:
                resp = self.client.messages.create(
                    model=model_id,
                    max_tokens=10,
                    messages=[{"role": "user", "content": "health"}],
                )
                results[tier] = resp.stop_reason is not None
            except Exception:
                results[tier] = False
        return results

    def supported_tiers(self) -> list[str]:
        return list(self.tier_to_model.keys())

Tasks:


Phase 3: AWS Bedrock ModelDispatch Implementation

File: substrate-aws/maestro_aws/model_dispatch.py

Same interface, AWS path. Uses boto3 Bedrock Runtime invoke_model.

class BedrockModelDispatch(ModelDispatch):

    def __init__(self, config):
        import boto3
        self.bedrock = boto3.client("bedrock-runtime", region_name=config.region)
        self.tier_to_model = {
            "opus":   "anthropic.claude-opus-4-6-20250514",
            "sonnet": "anthropic.claude-sonnet-4-6-20250514",
            "haiku":  "anthropic.claude-haiku-4-5-20251001",
        }
        # Advisor Strategy: NOT natively supported on Bedrock
        # Must use Anthropic direct API as fallback for advisor calls
        self.advisor_fallback = AnthropicDirectDispatch(config.anthropic_direct_key)

Tasks:


Phase 4: Policy Kernel Tier Routing with Advisor Strategy

Extend the Policy Kernel’s Layer B to support a five-tier gradient instead of three:

Quality/Cost gradient (highest to lowest):
  opus → sonnet+advisor → sonnet → haiku+advisor → haiku

File: policy-kernel/kernel/layers/b_routing.py

def route_tier(decision_input: DecisionInput) -> RouteDecision:
    """
    Layer B: pick the cheapest model tier likely to make progress.
    
    Five-tier gradient:
      opus            — complex planning, multi-step reasoning
      sonnet+advisor  — synthesis with strategic guidance
      sonnet          — standard drafting, analysis
      haiku+advisor   — extraction with quality check
      haiku           — classification, tagging, field reranking
    """
    capability = capability_estimator(
        intent=decision_input.continuation.goal_frame.intent,
        field_top_score=decision_input.context_field.top_score,
        open_questions=decision_input.continuation.state.open_questions,
    )

    # Base routing by capability
    if capability in ("plan", "reason", "evaluate_evidence"):
        tier = "opus"
        use_advisor = False
    elif capability in ("synthesize", "draft", "compare"):
        tier = "sonnet"
        # Use advisor when entering novel territory
        use_advisor = decision_input.context_field.novelty_vs_prior_tick > 0.5
    elif capability in ("extract", "classify", "tag", "rerank"):
        tier = "haiku"
        use_advisor = False
    else:
        tier = "sonnet"
        use_advisor = False

    # Budget-aware downshift with advisor compensation
    budget = decision_input.budget
    if budget.remaining_dollars < budget.soft_cap * 0.3:
        # Deep budget pressure — drop everything to haiku
        tier = "haiku"
        use_advisor = False
    elif budget.remaining_dollars < budget.soft_cap * 0.6:
        # Moderate pressure — downshift one level, compensate with advisor
        if tier == "opus":
            tier = "sonnet"
            use_advisor = True   # sonnet+advisor ≈ 85% opus quality at 30% cost
        elif tier == "sonnet":
            tier = "haiku"
            use_advisor = True   # haiku+advisor ≈ sonnet quality at 40% cost

    return RouteDecision(
        tier=tier,
        use_advisor=use_advisor,
        rationale=f"capability={capability}, tier={tier}, advisor={use_advisor}, "
                  f"budget_remaining=${budget.remaining_dollars:.2f}",
    )

Tasks:


Phase 5: Failover Chain

If the primary Foundry endpoint fails, Maestro must failover transparently.

File: core/maestro/dispatch/router.py

class FailoverModelRouter(ModelDispatch):
    """
    Wraps multiple ModelDispatch implementations in a failover chain.
    The continuation doesn't know or care which provider actually served it.
    """

    def __init__(self, providers: list[ModelDispatch]):
        self.providers = providers  # ordered by preference

    async def call(self, **kwargs) -> ModelCallResult:
        last_error = None
        for provider in self.providers:
            try:
                result = await provider.call(**kwargs)
                return result
            except ModelDispatchError as e:
                if not e.recoverable:
                    raise  # non-recoverable errors skip failover
                last_error = e
                continue
        raise ModelDispatchError(
            tier=kwargs["tier"],
            model_id="failover-exhausted",
            error=f"All {len(self.providers)} providers failed. Last: {last_error}",
            recoverable=False,
        )

Azure failover chain:

  1. Foundry proxy (services.ai.azure.com/anthropic) — primary
  2. Direct Anthropic API (api.anthropic.com) — backup
  3. Bedrock Claude (cross-cloud) — last resort

AWS failover chain:

  1. Bedrock — primary
  2. Direct Anthropic API — backup
  3. Foundry proxy (cross-cloud) — last resort

Tasks:


Phase 6: Metering and Budget Charge Events

Every model call emits a budget_charge event to the continuation event log.

# In worker tick, after model call:
event = ContinuationEvent(
    kind="budget_charge",
    payload={
        "tier": result.tier_used,
        "model_id": result.model_id,
        "executor_tokens": {
            "input": result.usage.executor_input,
            "output": result.usage.executor_output,
        },
        "advisor_tokens": {
            "input": result.usage.advisor_input,
            "output": result.usage.advisor_output,
        },
        "cache_tokens": {
            "read": result.usage.cache_read,
            "creation": result.usage.cache_creation,
        },
        "cost_dollars": result.cost_dollars,
        "latency_ms": result.latency_ms,
        "advisor_consulted": result.advisor_consulted,
    },
    cost={
        "tokens": result.usage.total_tokens,
        "dollars": result.cost_dollars,
        "wall_ms": result.latency_ms,
    },
)
await event_log.append(continuation_id, event)

# Deduct from budget
budget.charge(result.cost_dollars, result.usage)

Tasks:


Phase 7: Environment Configuration

Azure Environment Variables

# Foundry Anthropic proxy (existing, extend)
ANTHROPIC_FOUNDRY_ENDPOINT=https://cdsopenaigpt4.services.ai.azure.com
ANTHROPIC_FOUNDRY_KEY=<from-key-vault>

# Model deployment IDs (match Foundry portal deployment names)
MAESTRO_OPUS_MODEL_ID=claude-opus-4-6
MAESTRO_SONNET_MODEL_ID=claude-sonnet-4-6
MAESTRO_HAIKU_MODEL_ID=claude-haiku-4-5

# Advisor Strategy
MAESTRO_ADVISOR_ENABLED=true
MAESTRO_ADVISOR_MODEL=claude-opus-4-6
MAESTRO_ADVISOR_MAX_USES_PER_TICK=3

# Failover
ANTHROPIC_DIRECT_API_KEY=<from-key-vault>     # backup: direct Anthropic API
MAESTRO_FAILOVER_ENABLED=true

AWS Environment Variables

# Bedrock (primary on AWS)
AWS_BEDROCK_REGION=us-east-1
MAESTRO_OPUS_MODEL_ID=anthropic.claude-opus-4-6-20250514
MAESTRO_SONNET_MODEL_ID=anthropic.claude-sonnet-4-6-20250514
MAESTRO_HAIKU_MODEL_ID=anthropic.claude-haiku-4-5-20251001

# Failover
ANTHROPIC_DIRECT_API_KEY=<from-secrets-manager>
MAESTRO_FAILOVER_ENABLED=true

Tasks:


File Inventory

New Files

FileLanguagePurpose
substrate-azure/maestro_azure/model_dispatch.pyPythonAzure Foundry proxy dispatch
substrate-azure/maestro_azure/config.pyPythonAzure substrate configuration
substrate-aws/maestro_aws/model_dispatch.pyPythonAWS Bedrock dispatch
core/maestro/dispatch/router.pyPythonFailover chain router
core/maestro/dispatch/providers/anthropic_direct.pyPythonDirect API fallback
core/maestro/dispatch/metering.pyPythonCost computation + budget charging
policy-kernel/kernel/layers/b_routing.pyPythonFive-tier routing with advisor
eval/sql/fact_model_call.sqlSQLWarm store materialized view
docs/runbooks/model_deployment.mdMarkdownFoundry portal setup guide
docs/adr/0002-stateless-model-dispatch.mdMarkdownADR: why model calls are stateless

Modified Files

FileChanges
core/maestro/substrate/interface.pyAdd ModelDispatch protocol
core/maestro/substrate/types.pyAdd ModelCallResult, ModelUsage
core/maestro/models/budget.pyAdd per-tier tracking, advisor call counting
core/maestro/worker/tick.pyWire model dispatch + budget charging

Execution Order

Phase 0: Foundry Portal deployments       ← manual, no code, do first
Phase 1: Substrate interface extension     ← protocol only, no cloud SDK
Phase 2: Azure Foundry implementation      ← primary dispatch path
Phase 3: AWS Bedrock implementation        ← parallel with Phase 2
Phase 4: Policy Kernel routing             ← depends on Phase 1
Phase 5: Failover chain                    ← depends on Phase 2 + 3
Phase 6: Metering + budget events          ← depends on Phase 2
Phase 7: Environment config + IaC          ← depends on Phase 2 + 3

Estimated effort: 8–10 days
Risk: Phase 0 is the only blocking manual step — everything else is code.


ADR: Why Model Calls Are Stateless (Summary)

Decision: Maestro model calls are pure functions with no provider-side state.

Context: Foundry Agents, OpenAI Assistants, and Bedrock Agents all offer thread-based stateful sessions. These sessions bind reasoning state to the model provider.

Rationale: Maestro journals run for weeks. Binding session state to a model provider means:

By making model calls stateless, the Continuation Store becomes the single source of truth for all session state. The model is a commodity compute resource. The intelligence is in the continuation + field + policy kernel.

Consequences: Every model call must carry its full context (system prompt + field manifest + recent state). This costs more tokens per call but eliminates all provider coupling. Prompt caching (supported through Foundry proxy) amortizes the context cost for frequently-recomputed fields.

Status: Accepted.