Skip to main content

Prometheus Metrics

HatiData exports Prometheus-compatible metrics on the :9090/metrics endpoint. V2 adds runtime lifecycle metrics alongside the existing V1 proxy metrics.

Scrape Configuration

# prometheus.yml
scrape_configs:
- job_name: 'hatidata'
static_configs:
- targets: ['hatidata-proxy:9090']
scrape_interval: 15s
metrics_path: /metrics

V2 Runtime Metrics

Task Lifecycle

MetricTypeLabelsDescription
v2_tasks_created_totalCounteragent_type, task_classTasks created
v2_tasks_completed_totalCounteragent_type, task_class, statusTasks completed (done/failed)
v2_task_duration_secondsHistogramagent_type, task_classTask wall-clock time (create → complete)
v2_task_queue_depthGaugetask_classCurrent queue depth by task class

Attempt Lifecycle

MetricTypeLabelsDescription
v2_attempts_totalCounteragent_type, statusAttempts created
v2_attempt_duration_secondsHistogramagent_type, task_classPer-attempt duration
v2_attempts_per_taskHistogramtask_classRetry count distribution
v2_lease_claims_totalCounteragent_typeSuccessful lease claims
v2_lease_expirations_totalCounteragent_typeLeases that expired (no heartbeat)
v2_lease_renewals_totalCounteragent_typeHeartbeat renewals

Model Routing

MetricTypeLabelsDescription
v2_model_decisions_totalCountertask_class, model_id, providerRouting decisions made
v2_model_fallbacks_totalCountertask_class, from_model, to_modelFallback triggered
v2_llm_invocation_duration_secondsHistogramprovider, model_idPer-invocation latency
v2_llm_invocation_cost_usdHistogramprovider, model_id, task_classPer-invocation cost
v2_llm_invocation_tokens_totalCounterprovider, model_id, directionTokens (direction: input/output/cached)
v2_llm_cache_hits_totalCounterprovider, model_idPrompt cache hits
v2_llm_errors_totalCounterprovider, model_id, error_typeLLM call failures

Artifact Health

MetricTypeLabelsDescription
v2_artifacts_produced_totalCounterartifact_kind, agent_typeArtifacts created
v2_artifacts_validated_totalCounterartifact_kind, statusValidation outcomes (passed/failed)
v2_artifacts_verified_totalCounterartifact_kind, verifier_typeVerifier outcomes
v2_lineage_orphan_artifacts_totalGaugeArtifacts with broken lineage chain
v2_artifact_confidence_scoreHistogramartifact_kindConfidence score distribution

Recovery

MetricTypeLabelsDescription
v2_recovery_actions_totalCounterlevel, action_typeRecovery actions (L1-L4 + blocked)
v2_recovery_escalations_totalCounterfrom_level, to_levelEscalation transitions
v2_human_reviews_pendingGaugeentity_type, urgencyPending review queue depth
v2_human_reviews_completed_totalCounterentity_type, decisionReviews completed (approved/rejected)

Gate Evaluation

MetricTypeLabelsDescription
v2_gate_evaluations_totalCountergate_id, resultGate evaluations (auto_approved/pending_manual/failed)
v2_gate_duration_secondsHistogramgate_idTime from gate creation to resolution
v2_gate_predicate_failures_totalCountergate_id, predicate_keyIndividual predicate failures

Bandit Router (Optional)

MetricTypeLabelsDescription
v2_bandit_pulls_totalCountertask_class, model_id, modeBandit arm pulls (shadow/live)
v2_bandit_regret_estimateGaugetask_classEstimated cumulative regret

V1 Proxy Metrics (Existing)

These metrics continue to be exported alongside V2:

MetricTypeDescription
hatidata_queries_totalCounterSQL queries processed
hatidata_query_duration_secondsHistogramQuery execution time
hatidata_query_errors_totalCounterQuery failures
hatidata_cache_hits_totalCounterSQL hash cache hits
hatidata_cache_misses_totalCounterSQL hash cache misses
hatidata_connections_activeGaugeActive client connections
hatidata_memory_operations_totalCounterstore/search/load operations
hatidata_cot_steps_totalCounterReasoning steps logged
hatidata_trigger_fires_totalCounterSemantic trigger activations
hatidata_branch_operations_totalCounterBranch create/merge/delete
hatidata_abac_decisions_totalCounterABAC allow/deny decisions

Sample Grafana Dashboards

V2 Runtime Overview

Import from: HatiData/dev/config/openobserve/dashboards/v2-runtime.json

Panels:

  1. Task Throughputrate(v2_tasks_created_total[5m]) by task_class
  2. Success Ratev2_tasks_completed_total{status="done"} / v2_tasks_completed_total
  3. Model Costsum(rate(v2_llm_invocation_cost_usd_sum[1h])) by provider
  4. Attempt Distributionhistogram_quantile(0.95, v2_attempts_per_task_bucket)
  5. Lease Healthv2_lease_expirations_total vs v2_lease_claims_total
  6. Recovery Funnelv2_recovery_actions_total by level (stacked bar)
  7. Queue Depthv2_task_queue_depth by task_class (gauge)
  8. Orphan Alertv2_lineage_orphan_artifacts_total (should always be 0)

Model Performance Comparison

Panels:

  1. Latency by Modelhistogram_quantile(0.95, v2_llm_invocation_duration_seconds_bucket) by model_id
  2. Cost by Modelsum(v2_llm_invocation_cost_usd_sum) by model_id
  3. Success Rate by Modelv2_artifacts_validated_total{status="passed"} / v2_artifacts_validated_total by model_id
  4. Cache Hit Ratev2_llm_cache_hits_total / (v2_llm_cache_hits_total + v2_llm_errors_total{error_type="cache_miss"})

PromQL Examples

# Cost per successful artifact (last 24h)
sum(v2_llm_invocation_cost_usd_sum) by (task_class)
/
sum(v2_artifacts_validated_total{status="passed"}) by (task_class)

# Model fallback rate
rate(v2_model_fallbacks_total[1h])
/
rate(v2_model_decisions_total[1h])

# P95 ExplainBundle latency
histogram_quantile(0.95,
rate(v2_attempt_duration_seconds_bucket{status="completed_verified"}[5m])
)

# Human review backlog (should trend toward 0)
v2_human_reviews_pending

Next Steps

Stay in the loop

Product updates, engineering deep-dives, and agent-native insights. No spam.