SLO Reference

HatiData V2 defines explicit Service Level Objectives (SLOs) for every runtime operation. Use these targets for alerting configuration and capacity planning.

Operation Latency Targets

Write Operations

Operation	P50	P95	P99	Max	Notes
Task create	3ms	15ms	50ms	200ms	Single INSERT
Attempt claim	5ms	25ms	80ms	300ms	SELECT FOR UPDATE SKIP LOCKED
Heartbeat renew	2ms	10ms	30ms	100ms	Single UPDATE on lease
Attempt complete	5ms	20ms	60ms	200ms	UPDATE + event INSERT
Attempt fail	5ms	20ms	60ms	200ms	UPDATE + event + recovery INSERT
Memory store	8ms	30ms	80ms	300ms	DELETE + INSERT (upsert pattern)
Reasoning step	5ms	20ms	50ms	200ms	INSERT with hash chain validation
Branch create	10ms	40ms	100ms	500ms	Metadata only (copy-on-write)
Branch merge	50ms	200ms	500ms	2s	Depends on divergence size

Read Operations

Operation	P50	P95	P99	Max	Notes
Memory search	10ms	50ms	150ms	500ms	ILIKE + optional vector similarity
Memory exact	3ms	10ms	30ms	100ms	Primary key lookup
ExplainBundle	20ms	100ms	300ms	1s	Joins across 5 tables
Task list	5ms	25ms	80ms	300ms	Indexed by project_id
Attempt chain	10ms	40ms	100ms	300ms	Indexed by task_id
Gate evaluation	15ms	60ms	150ms	500ms	Predicate evaluation + fact bag
Reward signals	20ms	80ms	200ms	500ms	View materialization

SQL Proxy (V1 Path)

Operation	P50	P95	P99	Notes
Simple query	5ms	20ms	50ms	Direct DuckDB execution
Transpiled query	10ms	40ms	100ms	Snowflake → DuckDB transpilation
Vector search	15ms	60ms	200ms	Embedding + cosine similarity
Cache hit	1ms	3ms	10ms	SQL hash cache

View Freshness Targets

V2 views are derived from the runtime tables. Freshness targets define how quickly changes propagate:

View	Target Freshness	Refresh Trigger	Staleness Alert
`v_task_summary`	< 5s	On task state change	> 30s
`v_attempt_chain`	< 5s	On attempt completion	> 30s
`v_branch_divergence`	< 30s	On memory write to branch	> 120s
`v_reward_signals`	< 60s	On verification complete	> 300s
`v_lineage_graph`	< 10s	On artifact state change	> 60s

Availability Targets

Component	Target	Measurement Window	Downtime Budget
SQL Proxy	99.95%	Monthly	21.9 min/month
Control Plane API	99.9%	Monthly	43.8 min/month
MCP Server	99.9%	Monthly	43.8 min/month
Runtime API (V2)	99.9%	Monthly	43.8 min/month
Lease Worker	99.99%	Monthly	4.4 min/month

Lease Worker SLO

The lease worker has the highest availability target because lease expiry directly affects agent throughput. If the worker is down for > 3 minutes, all leased tasks expire and return to the queue — causing unnecessary retries and cost.

Error Budget Policy

SLO Violation	Action
< 1% budget remaining	Feature freeze — no deploys until SLO recovers
Budget exhausted	Incident declared — rollback to last known good
3 consecutive SLO misses	Post-incident review required

Alerting Thresholds

Critical (Pages On-Call)

Condition	Threshold	Window
SQL proxy down	0 healthy instances	Instant
Lease worker down	No heartbeats for 3 min	3 min
ExplainBundle P99 > 5s	Sustained for 5 min	5 min
Orphan artifacts detected	`v2_lineage_orphan_artifacts_total > 0`	Instant

Warning (Slack Notification)

Condition	Threshold	Window
Task claim P95 > 100ms	Sustained for 10 min	10 min
Memory search P95 > 200ms	Sustained for 10 min	10 min
View freshness > 2x target	Any view	5 min
Error rate > 1%	Any V2 endpoint	5 min

Capacity Planning

Metric	Current	Warning	Scale Trigger
Tasks created/min	50	200	500
Concurrent leases	10	50	100
Memory entries/project	500	5,000	10,000
ExplainBundle joins/sec	20	100	200

Next Steps

Prometheus Metrics — The metrics that feed these SLOs
Deployment & Monitoring — Setting up observability
Performance — Architecture-level optimization

Operation Latency Targets​

Write Operations​

Read Operations​

SQL Proxy (V1 Path)​

View Freshness Targets​

Availability Targets​

Error Budget Policy​

Alerting Thresholds​

Critical (Pages On-Call)​

Warning (Slack Notification)​

Capacity Planning​

Next Steps​

Stay in the loop