Optimization & Replay
HatiData V2 captures rich execution telemetry that feeds back into model routing, prompt optimization, and agent training. Every completed attempt generates structured reward signals that can be used for offline analysis, A/B testing, and model fine-tuning.
Reward Signals
A reward signal is generated after every attempt completion. It captures the outcome alongside the decisions that led to it:
Reward Record Schema
| Field | Type | Description |
|---|---|---|
attempt_id | UUID | The attempt that generated this signal |
task_class | String | Task classification (e.g., Codegen, AuthoritySpec) |
model_id | String | Model used (e.g., deepseek-ai/deepseek-v3.2-maas) |
capability_class | String | Model capability class (StrongGeneral, FastCheap, etc.) |
provider | String | Provider (Deepseek, Gemini, Claude) |
input_tokens | Integer | Total input tokens |
output_tokens | Integer | Total output tokens |
cost_usd | Float | Total cost for this attempt |
latency_ms | Integer | Wall-clock duration |
outcome | String | success, partial_success, failed |
verification_status | String | passed, failed, not_verified |
recovery_level | String | Highest recovery level reached (none, L1, L2, L3, L4) |
human_override | Boolean | Whether a human overrode the outcome |
created_at | Timestamp | When the signal was generated |
Querying Reward Signals
-- Model performance by task class
SELECT
task_class,
model_id,
COUNT(*) AS attempts,
AVG(cost_usd) AS avg_cost,
AVG(latency_ms) AS avg_latency,
SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END)::FLOAT / COUNT(*) AS success_rate
FROM v_reward_signals
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY task_class, model_id
ORDER BY task_class, success_rate DESC;
-- Cost efficiency: cost per successful artifact
SELECT
model_id,
SUM(cost_usd) AS total_cost,
SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END) AS successes,
SUM(cost_usd) / NULLIF(SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END), 0) AS cost_per_success
FROM v_reward_signals
GROUP BY model_id
ORDER BY cost_per_success;
Thompson Sampling Bandit Router
HatiData V2 includes an optional Thompson Sampling bandit that learns from reward signals to optimize model routing over time.
The bandit router is gated behind BANDIT_ENABLED=true. When disabled (default), the static routing policy from llm_policy.rs is used.
How It Works
- Arms: Each (task_class, model_id) combination is a bandit arm
- Observations: Success/failure outcomes from reward signals update the arm's Beta distribution
- Selection: Thompson Sampling draws from each arm's posterior to select the next model
- Exploration: Natural exploration — uncertain arms (few observations) get sampled more often
Bandit State Schema
-- Bandit arm state (Thompson Sampling)
CREATE TABLE hd_runtime.bandit_arms (
task_class TEXT NOT NULL,
model_id TEXT NOT NULL,
alpha FLOAT NOT NULL DEFAULT 1.0, -- Success count + prior
beta FLOAT NOT NULL DEFAULT 1.0, -- Failure count + prior
total_pulls INTEGER NOT NULL DEFAULT 0,
last_pulled TIMESTAMP,
PRIMARY KEY (task_class, model_id)
);
-- Observation log
CREATE TABLE hd_runtime.bandit_observations (
id UUID PRIMARY KEY,
task_class TEXT NOT NULL,
model_id TEXT NOT NULL,
attempt_id UUID NOT NULL,
reward FLOAT NOT NULL, -- 1.0 = success, 0.0 = failure, 0.5 = partial
mode TEXT NOT NULL, -- 'shadow' or 'live'
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
Shadow Mode
The bandit runs in shadow mode by default — it observes and records what it would have chosen, but doesn't affect actual model routing. This allows you to validate the bandit's decision quality before switching to live mode.
-- Compare bandit's shadow decisions vs actual outcomes
SELECT
bo.model_id AS bandit_would_choose,
md.primary_model AS actually_used,
bo.reward AS bandit_reward
FROM hd_runtime.bandit_observations bo
JOIN hd_runtime.model_decisions md ON md.attempt_id = bo.attempt_id
WHERE bo.mode = 'shadow'
ORDER BY bo.created_at DESC
LIMIT 20;
Counterfactual Replay
Test a new prompt, model, or configuration against historical failures without affecting production.
Creating a Replay Session
POST /v2/replay
{
"source_attempt_id": "att-failed-001",
"overrides": {
"model_id": "gemini-2.5-pro",
"prompt_version": "v2.1",
"temperature": 0.2
},
"dry_run": true
}
Response:
{
"replay_id": "replay-abc",
"source_attempt": {
"model_id": "deepseek-ai/deepseek-v3.2-maas",
"outcome": "failed",
"failure_kind": "InvalidOutputSchema"
},
"replay_result": {
"model_id": "gemini-2.5-pro",
"outcome": "success",
"cost_usd": 0.0045,
"latency_ms": 3200
},
"comparison": {
"cost_delta": "+$0.0014",
"outcome_improved": true,
"recommendation": "Switch to gemini-2.5-pro for AuthoritySpec tasks"
}
}
Replay Schema
CREATE TABLE hd_runtime.counterfactual_runs (
id UUID PRIMARY KEY,
source_attempt_id UUID NOT NULL,
override_model TEXT,
override_prompt TEXT,
result_outcome TEXT,
result_cost_usd FLOAT,
result_latency_ms INTEGER,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
Training Data Export
Export reward signals and execution traces for model fine-tuning or offline analysis.
Export to Parquet
-- Export last 30 days of reward signals
COPY (
SELECT * FROM v_reward_signals
WHERE created_at > NOW() - INTERVAL '30 days'
) TO '/tmp/rewards.parquet' (FORMAT PARQUET);
Export to S3
# Via HatiData CLI
hati export rewards \
--since 30d \
--format parquet \
--destination s3://my-bucket/training-data/
Training Loop Integration
The exported data includes:
| Column | Use in Training |
|---|---|
prompt_hash | Input feature — group by prompt version |
model_id | Stratification — per-model performance |
outcome | Label — binary (success/fail) or continuous (reward) |
cost_usd | Constraint — optimize within budget |
recovery_level | Severity — weight failures by recovery cost |
Adaptive Scheduling
V2 includes an optional adaptive scheduler that uses reward signals to prioritize task dispatch:
- Tasks with higher historical success rates get dispatched first
- Tasks that consistently fail get flagged for human review
- Budget-aware: downgrade model class when spend approaches limits
The adaptive scheduler is gated behind SCHEDULER_V2_ACTIVE=true. Disabled by default.
Next Steps
- Tasks & Attempts — The lifecycle that generates reward data
- Lineage & Explainability — Understanding what to optimize
- Prometheus Metrics — Monitoring optimization health