SLO Reference
HatiData V2 defines explicit Service Level Objectives (SLOs) for every runtime operation. Use these targets for alerting configuration and capacity planning.
Operation Latency Targets
Write Operations
| Operation | P50 | P95 | P99 | Max | Notes |
|---|---|---|---|---|---|
| Task create | 3ms | 15ms | 50ms | 200ms | Single INSERT |
| Attempt claim | 5ms | 25ms | 80ms | 300ms | SELECT FOR UPDATE SKIP LOCKED |
| Heartbeat renew | 2ms | 10ms | 30ms | 100ms | Single UPDATE on lease |
| Attempt complete | 5ms | 20ms | 60ms | 200ms | UPDATE + event INSERT |
| Attempt fail | 5ms | 20ms | 60ms | 200ms | UPDATE + event + recovery INSERT |
| Memory store | 8ms | 30ms | 80ms | 300ms | DELETE + INSERT (upsert pattern) |
| Reasoning step | 5ms | 20ms | 50ms | 200ms | INSERT with hash chain validation |
| Branch create | 10ms | 40ms | 100ms | 500ms | Metadata only (copy-on-write) |
| Branch merge | 50ms | 200ms | 500ms | 2s | Depends on divergence size |
Read Operations
| Operation | P50 | P95 | P99 | Max | Notes |
|---|---|---|---|---|---|
| Memory search | 10ms | 50ms | 150ms | 500ms | ILIKE + optional vector similarity |
| Memory exact | 3ms | 10ms | 30ms | 100ms | Primary key lookup |
| ExplainBundle | 20ms | 100ms | 300ms | 1s | Joins across 5 tables |
| Task list | 5ms | 25ms | 80ms | 300ms | Indexed by project_id |
| Attempt chain | 10ms | 40ms | 100ms | 300ms | Indexed by task_id |
| Gate evaluation | 15ms | 60ms | 150ms | 500ms | Predicate evaluation + fact bag |
| Reward signals | 20ms | 80ms | 200ms | 500ms | View materialization |
SQL Proxy (V1 Path)
| Operation | P50 | P95 | P99 | Notes |
|---|---|---|---|---|
| Simple query | 5ms | 20ms | 50ms | Direct DuckDB execution |
| Transpiled query | 10ms | 40ms | 100ms | Snowflake → DuckDB transpilation |
| Vector search | 15ms | 60ms | 200ms | Embedding + cosine similarity |
| Cache hit | 1ms | 3ms | 10ms | SQL hash cache |
View Freshness Targets
V2 views are derived from the runtime tables. Freshness targets define how quickly changes propagate:
| View | Target Freshness | Refresh Trigger | Staleness Alert |
|---|---|---|---|
v_task_summary | < 5s | On task state change | > 30s |
v_attempt_chain | < 5s | On attempt completion | > 30s |
v_branch_divergence | < 30s | On memory write to branch | > 120s |
v_reward_signals | < 60s | On verification complete | > 300s |
v_lineage_graph | < 10s | On artifact state change | > 60s |
Availability Targets
| Component | Target | Measurement Window | Downtime Budget |
|---|---|---|---|
| SQL Proxy | 99.95% | Monthly | 21.9 min/month |
| Control Plane API | 99.9% | Monthly | 43.8 min/month |
| MCP Server | 99.9% | Monthly | 43.8 min/month |
| Runtime API (V2) | 99.9% | Monthly | 43.8 min/month |
| Lease Worker | 99.99% | Monthly | 4.4 min/month |
Lease Worker SLO
The lease worker has the highest availability target because lease expiry directly affects agent throughput. If the worker is down for > 3 minutes, all leased tasks expire and return to the queue — causing unnecessary retries and cost.
Error Budget Policy
| SLO Violation | Action |
|---|---|
| < 1% budget remaining | Feature freeze — no deploys until SLO recovers |
| Budget exhausted | Incident declared — rollback to last known good |
| 3 consecutive SLO misses | Post-incident review required |
Alerting Thresholds
Critical (Pages On-Call)
| Condition | Threshold | Window |
|---|---|---|
| SQL proxy down | 0 healthy instances | Instant |
| Lease worker down | No heartbeats for 3 min | 3 min |
| ExplainBundle P99 > 5s | Sustained for 5 min | 5 min |
| Orphan artifacts detected | v2_lineage_orphan_artifacts_total > 0 | Instant |
Warning (Slack Notification)
| Condition | Threshold | Window |
|---|---|---|
| Task claim P95 > 100ms | Sustained for 10 min | 10 min |
| Memory search P95 > 200ms | Sustained for 10 min | 10 min |
| View freshness > 2x target | Any view | 5 min |
| Error rate > 1% | Any V2 endpoint | 5 min |
Capacity Planning
| Metric | Current | Warning | Scale Trigger |
|---|---|---|---|
| Tasks created/min | 50 | 200 | 500 |
| Concurrent leases | 10 | 50 | 100 |
| Memory entries/project | 500 | 5,000 | 10,000 |
| ExplainBundle joins/sec | 20 | 100 | 200 |
Next Steps
- Prometheus Metrics — The metrics that feed these SLOs
- Deployment & Monitoring — Setting up observability
- Performance — Architecture-level optimization