Prometheus Metrics
HatiData exports Prometheus-compatible metrics on the :9090/metrics endpoint. V2 adds runtime lifecycle metrics alongside the existing V1 proxy metrics.
Scrape Configuration
# prometheus.yml
scrape_configs:
- job_name: 'hatidata'
static_configs:
- targets: ['hatidata-proxy:9090']
scrape_interval: 15s
metrics_path: /metrics
V2 Runtime Metrics
Task Lifecycle
| Metric | Type | Labels | Description |
|---|---|---|---|
v2_tasks_created_total | Counter | agent_type, task_class | Tasks created |
v2_tasks_completed_total | Counter | agent_type, task_class, status | Tasks completed (done/failed) |
v2_task_duration_seconds | Histogram | agent_type, task_class | Task wall-clock time (create → complete) |
v2_task_queue_depth | Gauge | task_class | Current queue depth by task class |
Attempt Lifecycle
| Metric | Type | Labels | Description |
|---|---|---|---|
v2_attempts_total | Counter | agent_type, status | Attempts created |
v2_attempt_duration_seconds | Histogram | agent_type, task_class | Per-attempt duration |
v2_attempts_per_task | Histogram | task_class | Retry count distribution |
v2_lease_claims_total | Counter | agent_type | Successful lease claims |
v2_lease_expirations_total | Counter | agent_type | Leases that expired (no heartbeat) |
v2_lease_renewals_total | Counter | agent_type | Heartbeat renewals |
Model Routing
| Metric | Type | Labels | Description |
|---|---|---|---|
v2_model_decisions_total | Counter | task_class, model_id, provider | Routing decisions made |
v2_model_fallbacks_total | Counter | task_class, from_model, to_model | Fallback triggered |
v2_llm_invocation_duration_seconds | Histogram | provider, model_id | Per-invocation latency |
v2_llm_invocation_cost_usd | Histogram | provider, model_id, task_class | Per-invocation cost |
v2_llm_invocation_tokens_total | Counter | provider, model_id, direction | Tokens (direction: input/output/cached) |
v2_llm_cache_hits_total | Counter | provider, model_id | Prompt cache hits |
v2_llm_errors_total | Counter | provider, model_id, error_type | LLM call failures |
Artifact Health
| Metric | Type | Labels | Description |
|---|---|---|---|
v2_artifacts_produced_total | Counter | artifact_kind, agent_type | Artifacts created |
v2_artifacts_validated_total | Counter | artifact_kind, status | Validation outcomes (passed/failed) |
v2_artifacts_verified_total | Counter | artifact_kind, verifier_type | Verifier outcomes |
v2_lineage_orphan_artifacts_total | Gauge | — | Artifacts with broken lineage chain |
v2_artifact_confidence_score | Histogram | artifact_kind | Confidence score distribution |
Recovery
| Metric | Type | Labels | Description |
|---|---|---|---|
v2_recovery_actions_total | Counter | level, action_type | Recovery actions (L1-L4 + blocked) |
v2_recovery_escalations_total | Counter | from_level, to_level | Escalation transitions |
v2_human_reviews_pending | Gauge | entity_type, urgency | Pending review queue depth |
v2_human_reviews_completed_total | Counter | entity_type, decision | Reviews completed (approved/rejected) |
Gate Evaluation
| Metric | Type | Labels | Description |
|---|---|---|---|
v2_gate_evaluations_total | Counter | gate_id, result | Gate evaluations (auto_approved/pending_manual/failed) |
v2_gate_duration_seconds | Histogram | gate_id | Time from gate creation to resolution |
v2_gate_predicate_failures_total | Counter | gate_id, predicate_key | Individual predicate failures |
Bandit Router (Optional)
| Metric | Type | Labels | Description |
|---|---|---|---|
v2_bandit_pulls_total | Counter | task_class, model_id, mode | Bandit arm pulls (shadow/live) |
v2_bandit_regret_estimate | Gauge | task_class | Estimated cumulative regret |
V1 Proxy Metrics (Existing)
These metrics continue to be exported alongside V2:
| Metric | Type | Description |
|---|---|---|
hatidata_queries_total | Counter | SQL queries processed |
hatidata_query_duration_seconds | Histogram | Query execution time |
hatidata_query_errors_total | Counter | Query failures |
hatidata_cache_hits_total | Counter | SQL hash cache hits |
hatidata_cache_misses_total | Counter | SQL hash cache misses |
hatidata_connections_active | Gauge | Active client connections |
hatidata_memory_operations_total | Counter | store/search/load operations |
hatidata_cot_steps_total | Counter | Reasoning steps logged |
hatidata_trigger_fires_total | Counter | Semantic trigger activations |
hatidata_branch_operations_total | Counter | Branch create/merge/delete |
hatidata_abac_decisions_total | Counter | ABAC allow/deny decisions |
Sample Grafana Dashboards
V2 Runtime Overview
Import from: HatiData/dev/config/openobserve/dashboards/v2-runtime.json
Panels:
- Task Throughput —
rate(v2_tasks_created_total[5m])by task_class - Success Rate —
v2_tasks_completed_total{status="done"} / v2_tasks_completed_total - Model Cost —
sum(rate(v2_llm_invocation_cost_usd_sum[1h]))by provider - Attempt Distribution —
histogram_quantile(0.95, v2_attempts_per_task_bucket) - Lease Health —
v2_lease_expirations_totalvsv2_lease_claims_total - Recovery Funnel —
v2_recovery_actions_totalby level (stacked bar) - Queue Depth —
v2_task_queue_depthby task_class (gauge) - Orphan Alert —
v2_lineage_orphan_artifacts_total(should always be 0)
Model Performance Comparison
Panels:
- Latency by Model —
histogram_quantile(0.95, v2_llm_invocation_duration_seconds_bucket)by model_id - Cost by Model —
sum(v2_llm_invocation_cost_usd_sum)by model_id - Success Rate by Model —
v2_artifacts_validated_total{status="passed"} / v2_artifacts_validated_totalby model_id - Cache Hit Rate —
v2_llm_cache_hits_total / (v2_llm_cache_hits_total + v2_llm_errors_total{error_type="cache_miss"})
PromQL Examples
# Cost per successful artifact (last 24h)
sum(v2_llm_invocation_cost_usd_sum) by (task_class)
/
sum(v2_artifacts_validated_total{status="passed"}) by (task_class)
# Model fallback rate
rate(v2_model_fallbacks_total[1h])
/
rate(v2_model_decisions_total[1h])
# P95 ExplainBundle latency
histogram_quantile(0.95,
rate(v2_attempt_duration_seconds_bucket{status="completed_verified"}[5m])
)
# Human review backlog (should trend toward 0)
v2_human_reviews_pending
Next Steps
- SLO Reference — Latency targets and alerting thresholds
- Deployment & Monitoring — Setting up the metrics pipeline
- Cost Model — How costs are calculated