Core Concepts
This page covers the fundamental architecture of HatiData: the two-plane design, the 13-step query pipeline, caching tiers, agent identification, and SQL transpilation.
Two-Plane Architecture
HatiData separates compute and control into two independent planes connected via AWS PrivateLink (Enterprise) or direct HTTPS (Cloud).
Data Plane
The data plane runs where your data lives -- either on your machine (Local), in HatiData-managed infrastructure (Cloud), or inside your VPC (Enterprise). It consists of:
- Proxy (
hatidata-proxy, port 5439) -- Accepts connections via the Postgres wire protocol. Parses incoming SQL, runs it through the 13-step pipeline, and returns results. - DuckDB engine -- Embedded analytical database that executes transpiled queries. Optimized for columnar scans and aggregations.
- Cache layers -- Three-tier caching (L1 in-memory, L2 NVMe disk, L3 S3/Iceberg) to minimize redundant computation.
Control Plane
The control plane manages everything that is not query execution:
- API server (
hatidata-control-plane, port 8080) -- Axum-based REST API with 100+ endpoints covering authentication, policy management, billing, audit, federation, and multi-tenancy. - Authentication -- JWT tokens, API keys (with scopes and IP allowlists), and federated identity (AWS STS, Azure AD, Google Cloud Identity).
- Dashboard -- React/Vite web application for query audit, policy management, API key management, billing, and environment configuration.
How They Connect
AI Agents → HatiData (SQL + Memory + Triggers) → Results
In Local mode, both planes are embedded in the CLI process. In Cloud mode, the data plane runs on HatiData-managed infrastructure. In Enterprise mode, the data plane runs in your VPC and connects to the control plane over PrivateLink.
The 13-Step Query Pipeline
Every SQL query that enters HatiData flows through 13 steps before a result is returned. This is not optional -- security, compliance, and observability are built into every query path.
Step 1: Concurrency Control
A semaphore-based admission controller limits the number of concurrent queries. Queries that exceed the limit are queued, not rejected. This prevents resource exhaustion on the DuckDB engine.
Step 2: Table Extraction
An AST walker traverses the parsed SQL to identify all tables referenced in the query. This list feeds into policy evaluation, row-level security, and metering.
Step 3: Policy Evaluation
ABAC (Attribute-Based Access Control) policies are evaluated against a rich context that includes the user's role, agent ID, agent framework, query origin, time of day, environment, and the tables being accessed. Policies can block, warn, or log.
Step 4: Cost Estimation
A heuristic cost estimator assigns a credit cost to the query based on table sizes, join complexity, and aggregation depth. This cost is used for quota enforcement and billing.
Step 5: Quota Enforcement
Per-organization credit limits are checked. If the query would exceed the org's monthly quota, it is rejected with a QUOTA_EXCEEDED error.
Step 6: Row-Level Security
WHERE clauses are injected into the query based on the user's role and agent context. Filters support agent-aware placeholders like {agent_id}, {user_id}, and {org_id}, which are resolved at query time.
Step 7: SQL Transpilation
The Snowflake-compatible SQL is rewritten to DuckDB-native SQL. This includes:
- Function mapping -- NVL to COALESCE, IFF to IF, DATEDIFF to DATE_DIFF, LISTAGG to STRING_AGG, ARRAY_AGG to LIST, and more.
- Type mapping -- VARIANT to JSON, TIMESTAMP_NTZ to TIMESTAMP, NUMBER to NUMERIC, and more.
- Construct rewriting -- FLATTEN to UNNEST, OBJECT_CONSTRUCT to json_object, colon notation to json_extract_string, and more.
Transpilation results are cached by SQL hash in a DashMap, so repeated queries skip this step entirely.
Step 8: Snapshot Pinning
Queries are pinned to an Iceberg snapshot, providing repeatable reads within a session. This ensures that concurrent data loads do not affect in-flight queries.
Step 9: DuckDB Execution
The transpiled SQL executes on the embedded DuckDB engine. DuckDB is optimized for analytical workloads: columnar storage, vectorized execution, and automatic parallelism.
Step 10: AI Healing
If the query fails at execution (for example, due to an edge case the transpiler missed), HatiData can optionally call an external AI API with the error message and original SQL. The AI suggests a corrected query, which is retried automatically. This is disabled by default and requires configuring HATIDATA_AI_HEALER_ENDPOINT.
Step 11: Column Masking
Sensitive columns are masked post-execution based on the user's role. Masking levels range from full visibility (Admin/Owner) to partial masking (Analyst sees john***@example.com) to full redaction (Agent sees [REDACTED]).
Step 12: Metering
Credit usage is recorded for billing. Prometheus metrics are emitted for latency, throughput, error rates, and per-agent usage.
Step 13: Audit
A PII-redacted, tamper-evident audit log entry is written. Audit events are hash-chained (SHA-256) so that any modification to a past event breaks the chain, making tampering detectable.
Three-Tier Caching
HatiData uses a three-tier cache hierarchy to minimize query latency:
| Tier | Backing | Capacity | Latency | Eviction |
|---|---|---|---|---|
| L1 | In-memory (moka) | Configurable (default 10,000 entries) | Sub-microsecond | TTL + LRU |
| L2 | NVMe disk | Configurable (default 1 GB) | Sub-millisecond | TTL + size cap |
| L3 | Object storage (S3/GCS/Azure Blob) | Unlimited (Iceberg format) | 10-100ms | Snapshot-based |
Cache lookups proceed from L1 to L3. Cache invalidation is snapshot-aware: when a new Iceberg snapshot is created (due to data loading), cached results for affected tables are evicted.
The transpilation cache is separate from the data cache. It stores SQL rewrites keyed by input SQL hash, so the parse-and-rewrite pipeline is skipped for repeated queries.
Agent Identification
Agents identify themselves to HatiData through Postgres startup parameters. When an agent connects, it passes:
| Parameter | Purpose |
|---|---|
agent_id | Unique agent identifier (e.g., analytics-agent-1) |
framework | AI framework name (e.g., langchain, crewai, autogen, custom) |
request_id | Optional per-request trace ID for reasoning chain tracking |
priority | Query priority (low, normal, high, critical) |
These parameters flow through the entire 13-step pipeline:
- Policy evaluation uses
agent_idandframeworkfor ABAC rules (e.g., "LangChain agents cannot query PII tables after hours"). - Row-level security resolves
{agent_id}placeholders in WHERE filters. - Metering tracks per-agent credit usage for billing.
- Audit records agent identity in every log entry.
- Scheduling uses
priorityto order queued queries.
Snowflake SQL Transpilation
HatiData accepts Snowflake-compatible SQL and transpiles it to DuckDB in five stages:
- Pre-parse -- Snowflake colon notation (
col:field) is rewritten tojson_extract_string(col, 'field')before the SQL parser sees it. - Parse -- The SQL is parsed using the Snowflake dialect of sqlparser.
- Function rewrites -- Snowflake functions are mapped to DuckDB equivalents (NVL to COALESCE, IFF to IF, DATEDIFF to DATE_DIFF, etc.).
- Type rewrites -- Snowflake data types are mapped to DuckDB types (VARIANT to JSON, TIMESTAMP_NTZ to TIMESTAMP, etc.).
- Construct rewrites -- Snowflake-specific constructs are transformed (FLATTEN to UNNEST, OBJECT_CONSTRUCT to json_object, etc.).
Each query receives a confidence score indicating the reliability of the transpilation. Queries that cannot be transpiled cleanly return an error with details about the unsupported construct.
For the full list of supported functions, types, and constructs, see SQL Compatibility.
Deployment Tiers
HatiData offers three deployment tiers. The core conversion between tiers is hati push --target cloud|vpc -- your queries and schemas stay the same.
Local (Free)
Everything runs on your machine. The hati CLI embeds the DuckDB engine, the transpiler, and a local proxy. Data is stored in a .hati/ directory. No network calls, no telemetry, no cloud account.
Best for: development, prototyping, offline work, CI pipelines.
See Local Mode for details.
Cloud ($29/month)
HatiData hosts a managed proxy and control plane. You push your local data with hati push --target cloud and receive a connection string. Includes a dashboard for query audit, policy management, and billing.
Best for: small teams, startups, shared environments, production workloads that do not require VPC isolation.
See Cloud Mode for details.
Enterprise (Custom Pricing)
The data plane deploys inside your AWS/GCP/Azure VPC. The control plane communicates over PrivateLink. Data never leaves your network. Includes SLAs, dedicated support, Shadow Mode for migration validation, and HatiData Shield IP protection.
Best for: regulated industries, large-scale production, organizations with strict data residency requirements.
See Enterprise for details.
Next Steps
- Quickstart -- Install and run your first query in under 5 minutes
- Local Mode -- Deep dive into offline development
- Cloud Mode -- Dashboard, connection strings, and team access
- Enterprise -- In-VPC deployment with PrivateLink
- SQL Compatibility -- Full list of supported functions and types
- Security Overview -- RBAC, ABAC, encryption, and audit