Optimization & Replay

HatiData V2 captures rich execution telemetry that feeds back into model routing, prompt optimization, and agent training. Every completed attempt generates structured reward signals that can be used for offline analysis, A/B testing, and model fine-tuning.

Reward Signals

A reward signal is generated after every attempt completion. It captures the outcome alongside the decisions that led to it:

Reward Record Schema

Field	Type	Description
`attempt_id`	UUID	The attempt that generated this signal
`task_class`	String	Task classification (e.g., `Codegen`, `AuthoritySpec`)
`model_id`	String	Model used (e.g., `deepseek-ai/deepseek-v3.2-maas`)
`capability_class`	String	Model capability class (`StrongGeneral`, `FastCheap`, etc.)
`provider`	String	Provider (`Deepseek`, `Gemini`, `Claude`)
`input_tokens`	Integer	Total input tokens
`output_tokens`	Integer	Total output tokens
`cost_usd`	Float	Total cost for this attempt
`latency_ms`	Integer	Wall-clock duration
`outcome`	String	`success`, `partial_success`, `failed`
`verification_status`	String	`passed`, `failed`, `not_verified`
`recovery_level`	String	Highest recovery level reached (`none`, `L1`, `L2`, `L3`, `L4`)
`human_override`	Boolean	Whether a human overrode the outcome
`created_at`	Timestamp	When the signal was generated

Querying Reward Signals

-- Model performance by task class
SELECT
    task_class,
    model_id,
    COUNT(*) AS attempts,
    AVG(cost_usd) AS avg_cost,
    AVG(latency_ms) AS avg_latency,
    SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END)::FLOAT / COUNT(*) AS success_rate
FROM v_reward_signals
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY task_class, model_id
ORDER BY task_class, success_rate DESC;

-- Cost efficiency: cost per successful artifact
SELECT
    model_id,
    SUM(cost_usd) AS total_cost,
    SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END) AS successes,
    SUM(cost_usd) / NULLIF(SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END), 0) AS cost_per_success
FROM v_reward_signals
GROUP BY model_id
ORDER BY cost_per_success;

Thompson Sampling Bandit Router

HatiData V2 includes an optional Thompson Sampling bandit that learns from reward signals to optimize model routing over time.

Feature Flag

The bandit router is gated behind BANDIT_ENABLED=true. When disabled (default), the static routing policy from llm_policy.rs is used.

How It Works

Arms: Each (task_class, model_id) combination is a bandit arm
Observations: Success/failure outcomes from reward signals update the arm's Beta distribution
Selection: Thompson Sampling draws from each arm's posterior to select the next model
Exploration: Natural exploration — uncertain arms (few observations) get sampled more often

Bandit State Schema

-- Bandit arm state (Thompson Sampling)
CREATE TABLE hd_runtime.bandit_arms (
    task_class    TEXT NOT NULL,
    model_id      TEXT NOT NULL,
    alpha         FLOAT NOT NULL DEFAULT 1.0,  -- Success count + prior
    beta          FLOAT NOT NULL DEFAULT 1.0,  -- Failure count + prior
    total_pulls   INTEGER NOT NULL DEFAULT 0,
    last_pulled   TIMESTAMP,
    PRIMARY KEY (task_class, model_id)
);

-- Observation log
CREATE TABLE hd_runtime.bandit_observations (
    id            UUID PRIMARY KEY,
    task_class    TEXT NOT NULL,
    model_id      TEXT NOT NULL,
    attempt_id    UUID NOT NULL,
    reward        FLOAT NOT NULL,  -- 1.0 = success, 0.0 = failure, 0.5 = partial
    mode          TEXT NOT NULL,    -- 'shadow' or 'live'
    created_at    TIMESTAMP NOT NULL DEFAULT NOW()
);

Shadow Mode

The bandit runs in shadow mode by default — it observes and records what it would have chosen, but doesn't affect actual model routing. This allows you to validate the bandit's decision quality before switching to live mode.

-- Compare bandit's shadow decisions vs actual outcomes
SELECT
    bo.model_id AS bandit_would_choose,
    md.primary_model AS actually_used,
    bo.reward AS bandit_reward
FROM hd_runtime.bandit_observations bo
JOIN hd_runtime.model_decisions md ON md.attempt_id = bo.attempt_id
WHERE bo.mode = 'shadow'
ORDER BY bo.created_at DESC
LIMIT 20;

Counterfactual Replay

Test a new prompt, model, or configuration against historical failures without affecting production.

Creating a Replay Session

POST /v2/replay
{
  "source_attempt_id": "att-failed-001",
  "overrides": {
    "model_id": "gemini-2.5-pro",
    "prompt_version": "v2.1",
    "temperature": 0.2
  },
  "dry_run": true
}

Response:

{
  "replay_id": "replay-abc",
  "source_attempt": {
    "model_id": "deepseek-ai/deepseek-v3.2-maas",
    "outcome": "failed",
    "failure_kind": "InvalidOutputSchema"
  },
  "replay_result": {
    "model_id": "gemini-2.5-pro",
    "outcome": "success",
    "cost_usd": 0.0045,
    "latency_ms": 3200
  },
  "comparison": {
    "cost_delta": "+$0.0014",
    "outcome_improved": true,
    "recommendation": "Switch to gemini-2.5-pro for AuthoritySpec tasks"
  }
}

Replay Schema

CREATE TABLE hd_runtime.counterfactual_runs (
    id                UUID PRIMARY KEY,
    source_attempt_id UUID NOT NULL,
    override_model    TEXT,
    override_prompt   TEXT,
    result_outcome    TEXT,
    result_cost_usd   FLOAT,
    result_latency_ms INTEGER,
    created_at        TIMESTAMP NOT NULL DEFAULT NOW()
);

Training Data Export

Export reward signals and execution traces for model fine-tuning or offline analysis.

Export to Parquet

-- Export last 30 days of reward signals
COPY (
    SELECT * FROM v_reward_signals
    WHERE created_at > NOW() - INTERVAL '30 days'
) TO '/tmp/rewards.parquet' (FORMAT PARQUET);

Export to S3

# Via HatiData CLI
hati export rewards \
  --since 30d \
  --format parquet \
  --destination s3://my-bucket/training-data/

Training Loop Integration

The exported data includes:

Column	Use in Training
`prompt_hash`	Input feature — group by prompt version
`model_id`	Stratification — per-model performance
`outcome`	Label — binary (success/fail) or continuous (reward)
`cost_usd`	Constraint — optimize within budget
`recovery_level`	Severity — weight failures by recovery cost

Adaptive Scheduling

V2 includes an optional adaptive scheduler that uses reward signals to prioritize task dispatch:

Tasks with higher historical success rates get dispatched first
Tasks that consistently fail get flagged for human review
Budget-aware: downgrade model class when spend approaches limits

Feature Flag

The adaptive scheduler is gated behind SCHEDULER_V2_ACTIVE=true. Disabled by default.

Next Steps

Tasks & Attempts — The lifecycle that generates reward data
Lineage & Explainability — Understanding what to optimize
Prometheus Metrics — Monitoring optimization health

Routing Signals API

The routing signals endpoint provides per-model performance data for cost-optimized dispatch. Orchestrators can read these signals at dispatch time to down-route when confidence is proven.

curl "https://api.hatidata.com/v2/routing/signals?window_days=7" \
  -H "Authorization: ApiKey hd_live_..."

Response:

{
  "window_days": 7,
  "models": [
    {
      "model_id": "claude-sonnet-4-6",
      "sample_size": 5231,
      "success_rate": 0.94,
      "avg_confidence": 0.91,
      "avg_cost_usd": 0.008,
      "avg_latency_ms": 340.0,
      "total_cost_usd": 41.85,
      "override_rate": 0.05,
      "p50_latency_ms": 310.0,
      "p99_latency_ms": 0.0
    }
  ],
  "freshness": {
    "computed_at": "2026-04-05T04:50:56Z",
    "staleness_ms": 15,
    "within_slo": true
  }
}

Usage pattern: When sample_size >= 100 and success_rate >= 0.97, the orchestrator can down-route from an expensive model (e.g., GPT-4) to a cheaper model (e.g., Claude Haiku) for that task type — saving 30-80% on LLM costs with proven confidence.

See V2 Runtime API Reference for full details.

Reward Signals​

Reward Record Schema​

Querying Reward Signals​

Thompson Sampling Bandit Router​

How It Works​

Bandit State Schema​

Shadow Mode​

Counterfactual Replay​

Creating a Replay Session​

Replay Schema​

Training Data Export​

Export to Parquet​

Export to S3​

Training Loop Integration​

Adaptive Scheduling​

Next Steps​

Routing Signals API​

Stay in the loop