Prometheus Metrics

HatiData exports Prometheus-compatible metrics on the :9090/metrics endpoint. V2 adds runtime lifecycle metrics alongside the existing V1 proxy metrics.

Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'hatidata'
    static_configs:
      - targets: ['hatidata-proxy:9090']
    scrape_interval: 15s
    metrics_path: /metrics

V2 Runtime Metrics

Task Lifecycle

Metric	Type	Labels	Description
`v2_tasks_created_total`	Counter	`agent_type`, `task_class`	Tasks created
`v2_tasks_completed_total`	Counter	`agent_type`, `task_class`, `status`	Tasks completed (done/failed)
`v2_task_duration_seconds`	Histogram	`agent_type`, `task_class`	Task wall-clock time (create → complete)
`v2_task_queue_depth`	Gauge	`task_class`	Current queue depth by task class

Attempt Lifecycle

Metric	Type	Labels	Description
`v2_attempts_total`	Counter	`agent_type`, `status`	Attempts created
`v2_attempt_duration_seconds`	Histogram	`agent_type`, `task_class`	Per-attempt duration
`v2_attempts_per_task`	Histogram	`task_class`	Retry count distribution
`v2_lease_claims_total`	Counter	`agent_type`	Successful lease claims
`v2_lease_expirations_total`	Counter	`agent_type`	Leases that expired (no heartbeat)
`v2_lease_renewals_total`	Counter	`agent_type`	Heartbeat renewals

Model Routing

Metric	Type	Labels	Description
`v2_model_decisions_total`	Counter	`task_class`, `model_id`, `provider`	Routing decisions made
`v2_model_fallbacks_total`	Counter	`task_class`, `from_model`, `to_model`	Fallback triggered
`v2_llm_invocation_duration_seconds`	Histogram	`provider`, `model_id`	Per-invocation latency
`v2_llm_invocation_cost_usd`	Histogram	`provider`, `model_id`, `task_class`	Per-invocation cost
`v2_llm_invocation_tokens_total`	Counter	`provider`, `model_id`, `direction`	Tokens (direction: input/output/cached)
`v2_llm_cache_hits_total`	Counter	`provider`, `model_id`	Prompt cache hits
`v2_llm_errors_total`	Counter	`provider`, `model_id`, `error_type`	LLM call failures

Artifact Health

Metric	Type	Labels	Description
`v2_artifacts_produced_total`	Counter	`artifact_kind`, `agent_type`	Artifacts created
`v2_artifacts_validated_total`	Counter	`artifact_kind`, `status`	Validation outcomes (passed/failed)
`v2_artifacts_verified_total`	Counter	`artifact_kind`, `verifier_type`	Verifier outcomes
`v2_lineage_orphan_artifacts_total`	Gauge	—	Artifacts with broken lineage chain
`v2_artifact_confidence_score`	Histogram	`artifact_kind`	Confidence score distribution

Recovery

Metric	Type	Labels	Description
`v2_recovery_actions_total`	Counter	`level`, `action_type`	Recovery actions (L1-L4 + blocked)
`v2_recovery_escalations_total`	Counter	`from_level`, `to_level`	Escalation transitions
`v2_human_reviews_pending`	Gauge	`entity_type`, `urgency`	Pending review queue depth
`v2_human_reviews_completed_total`	Counter	`entity_type`, `decision`	Reviews completed (approved/rejected)

Gate Evaluation

Metric	Type	Labels	Description
`v2_gate_evaluations_total`	Counter	`gate_id`, `result`	Gate evaluations (auto_approved/pending_manual/failed)
`v2_gate_duration_seconds`	Histogram	`gate_id`	Time from gate creation to resolution
`v2_gate_predicate_failures_total`	Counter	`gate_id`, `predicate_key`	Individual predicate failures

Bandit Router (Optional)

Metric	Type	Labels	Description
`v2_bandit_pulls_total`	Counter	`task_class`, `model_id`, `mode`	Bandit arm pulls (shadow/live)
`v2_bandit_regret_estimate`	Gauge	`task_class`	Estimated cumulative regret

V1 Proxy Metrics (Existing)

These metrics continue to be exported alongside V2:

Metric	Type	Description
`hatidata_queries_total`	Counter	SQL queries processed
`hatidata_query_duration_seconds`	Histogram	Query execution time
`hatidata_query_errors_total`	Counter	Query failures
`hatidata_cache_hits_total`	Counter	SQL hash cache hits
`hatidata_cache_misses_total`	Counter	SQL hash cache misses
`hatidata_connections_active`	Gauge	Active client connections
`hatidata_memory_operations_total`	Counter	store/search/load operations
`hatidata_cot_steps_total`	Counter	Reasoning steps logged
`hatidata_trigger_fires_total`	Counter	Semantic trigger activations
`hatidata_branch_operations_total`	Counter	Branch create/merge/delete
`hatidata_abac_decisions_total`	Counter	ABAC allow/deny decisions

Sample Grafana Dashboards

V2 Runtime Overview

Import from: HatiData/dev/config/openobserve/dashboards/v2-runtime.json

Panels:

Task Throughput — rate(v2_tasks_created_total[5m]) by task_class
Success Rate — v2_tasks_completed_total{status="done"} / v2_tasks_completed_total
Model Cost — sum(rate(v2_llm_invocation_cost_usd_sum[1h])) by provider
Attempt Distribution — histogram_quantile(0.95, v2_attempts_per_task_bucket)
Lease Health — v2_lease_expirations_total vs v2_lease_claims_total
Recovery Funnel — v2_recovery_actions_total by level (stacked bar)
Queue Depth — v2_task_queue_depth by task_class (gauge)
Orphan Alert — v2_lineage_orphan_artifacts_total (should always be 0)

Model Performance Comparison

Panels:

Latency by Model — histogram_quantile(0.95, v2_llm_invocation_duration_seconds_bucket) by model_id
Cost by Model — sum(v2_llm_invocation_cost_usd_sum) by model_id
Success Rate by Model — v2_artifacts_validated_total{status="passed"} / v2_artifacts_validated_total by model_id
Cache Hit Rate — v2_llm_cache_hits_total / (v2_llm_cache_hits_total + v2_llm_errors_total{error_type="cache_miss"})

PromQL Examples

# Cost per successful artifact (last 24h)
sum(v2_llm_invocation_cost_usd_sum) by (task_class)
/
sum(v2_artifacts_validated_total{status="passed"}) by (task_class)

# Model fallback rate
rate(v2_model_fallbacks_total[1h])
/
rate(v2_model_decisions_total[1h])

# P95 ExplainBundle latency
histogram_quantile(0.95,
  rate(v2_attempt_duration_seconds_bucket{status="completed_verified"}[5m])
)

# Human review backlog (should trend toward 0)
v2_human_reviews_pending

Next Steps

SLO Reference — Latency targets and alerting thresholds
Deployment & Monitoring — Setting up the metrics pipeline
Cost Model — How costs are calculated

Scrape Configuration​

V2 Runtime Metrics​

Task Lifecycle​

Attempt Lifecycle​

Model Routing​

Artifact Health​

Recovery​

Gate Evaluation​

Bandit Router (Optional)​

V1 Proxy Metrics (Existing)​

Sample Grafana Dashboards​

V2 Runtime Overview​

Model Performance Comparison​

PromQL Examples​

Next Steps​

Stay in the loop