Skip to main content

Core Concepts

This page covers the fundamental architecture of HatiData: the two-plane design, the 13-step query pipeline, caching tiers, agent identification, and SQL transpilation.

Two-Plane Architecture

HatiData separates compute and control into two independent planes connected via AWS PrivateLink (Enterprise) or direct HTTPS (Cloud).

Data Plane

The data plane runs where your data lives -- either on your machine (Local), in HatiData-managed infrastructure (Cloud), or inside your VPC (Enterprise). It consists of:

  • Proxy (hatidata-proxy, port 5439) -- Accepts connections via the Postgres wire protocol. Parses incoming SQL, runs it through the 13-step pipeline, and returns results.
  • DuckDB engine -- Embedded analytical database that executes transpiled queries. Optimized for columnar scans and aggregations.
  • Cache layers -- Three-tier caching (L1 in-memory, L2 NVMe disk, L3 S3/Iceberg) to minimize redundant computation.

Control Plane

The control plane manages everything that is not query execution:

  • API server (hatidata-control-plane, port 8080) -- Axum-based REST API with 100+ endpoints covering authentication, policy management, billing, audit, federation, and multi-tenancy.
  • Authentication -- JWT tokens, API keys (with scopes and IP allowlists), and federated identity (AWS STS, Azure AD, Google Cloud Identity).
  • Dashboard -- React/Vite web application for query audit, policy management, API key management, billing, and environment configuration.

How They Connect

AI Agents → HatiData (SQL + Memory + Triggers) → Results

In Local mode, both planes are embedded in the CLI process. In Cloud mode, the data plane runs on HatiData-managed infrastructure. In Enterprise mode, the data plane runs in your VPC and connects to the control plane over PrivateLink.

The 13-Step Query Pipeline

Every SQL query that enters HatiData flows through 13 steps before a result is returned. This is not optional -- security, compliance, and observability are built into every query path.

Step 1: Concurrency Control

A semaphore-based admission controller limits the number of concurrent queries. Queries that exceed the limit are queued, not rejected. This prevents resource exhaustion on the DuckDB engine.

Step 2: Table Extraction

An AST walker traverses the parsed SQL to identify all tables referenced in the query. This list feeds into policy evaluation, row-level security, and metering.

Step 3: Policy Evaluation

ABAC (Attribute-Based Access Control) policies are evaluated against a rich context that includes the user's role, agent ID, agent framework, query origin, time of day, environment, and the tables being accessed. Policies can block, warn, or log.

Step 4: Cost Estimation

A heuristic cost estimator assigns a credit cost to the query based on table sizes, join complexity, and aggregation depth. This cost is used for quota enforcement and billing.

Step 5: Quota Enforcement

Per-organization credit limits are checked. If the query would exceed the org's monthly quota, it is rejected with a QUOTA_EXCEEDED error.

Step 6: Row-Level Security

WHERE clauses are injected into the query based on the user's role and agent context. Filters support agent-aware placeholders like {agent_id}, {user_id}, and {org_id}, which are resolved at query time.

Step 7: SQL Transpilation

The Snowflake-compatible SQL is rewritten to DuckDB-native SQL. This includes:

  • Function mapping -- NVL to COALESCE, IFF to IF, DATEDIFF to DATE_DIFF, LISTAGG to STRING_AGG, ARRAY_AGG to LIST, and more.
  • Type mapping -- VARIANT to JSON, TIMESTAMP_NTZ to TIMESTAMP, NUMBER to NUMERIC, and more.
  • Construct rewriting -- FLATTEN to UNNEST, OBJECT_CONSTRUCT to json_object, colon notation to json_extract_string, and more.

Transpilation results are cached by SQL hash in a DashMap, so repeated queries skip this step entirely.

Step 8: Snapshot Pinning

Queries are pinned to an Iceberg snapshot, providing repeatable reads within a session. This ensures that concurrent data loads do not affect in-flight queries.

Step 9: DuckDB Execution

The transpiled SQL executes on the embedded DuckDB engine. DuckDB is optimized for analytical workloads: columnar storage, vectorized execution, and automatic parallelism.

Step 10: AI Healing

If the query fails at execution (for example, due to an edge case the transpiler missed), HatiData can optionally call an external AI API with the error message and original SQL. The AI suggests a corrected query, which is retried automatically. This is disabled by default and requires configuring HATIDATA_AI_HEALER_ENDPOINT.

Step 11: Column Masking

Sensitive columns are masked post-execution based on the user's role. Masking levels range from full visibility (Admin/Owner) to partial masking (Analyst sees john***@example.com) to full redaction (Agent sees [REDACTED]).

Step 12: Metering

Credit usage is recorded for billing. Prometheus metrics are emitted for latency, throughput, error rates, and per-agent usage.

Step 13: Audit

A PII-redacted, tamper-evident audit log entry is written. Audit events are hash-chained (SHA-256) so that any modification to a past event breaks the chain, making tampering detectable.

Three-Tier Caching

HatiData uses a three-tier cache hierarchy to minimize query latency:

TierBackingCapacityLatencyEviction
L1In-memory (moka)Configurable (default 10,000 entries)Sub-microsecondTTL + LRU
L2NVMe diskConfigurable (default 1 GB)Sub-millisecondTTL + size cap
L3Object storage (S3/GCS/Azure Blob)Unlimited (Iceberg format)10-100msSnapshot-based

Cache lookups proceed from L1 to L3. Cache invalidation is snapshot-aware: when a new Iceberg snapshot is created (due to data loading), cached results for affected tables are evicted.

The transpilation cache is separate from the data cache. It stores SQL rewrites keyed by input SQL hash, so the parse-and-rewrite pipeline is skipped for repeated queries.

Agent Identification

Agents identify themselves to HatiData through Postgres startup parameters. When an agent connects, it passes:

ParameterPurpose
agent_idUnique agent identifier (e.g., analytics-agent-1)
frameworkAI framework name (e.g., langchain, crewai, autogen, custom)
request_idOptional per-request trace ID for reasoning chain tracking
priorityQuery priority (low, normal, high, critical)

These parameters flow through the entire 13-step pipeline:

  • Policy evaluation uses agent_id and framework for ABAC rules (e.g., "LangChain agents cannot query PII tables after hours").
  • Row-level security resolves {agent_id} placeholders in WHERE filters.
  • Metering tracks per-agent credit usage for billing.
  • Audit records agent identity in every log entry.
  • Scheduling uses priority to order queued queries.

Snowflake SQL Transpilation

HatiData accepts Snowflake-compatible SQL and transpiles it to DuckDB in five stages:

  1. Pre-parse -- Snowflake colon notation (col:field) is rewritten to json_extract_string(col, 'field') before the SQL parser sees it.
  2. Parse -- The SQL is parsed using the Snowflake dialect of sqlparser.
  3. Function rewrites -- Snowflake functions are mapped to DuckDB equivalents (NVL to COALESCE, IFF to IF, DATEDIFF to DATE_DIFF, etc.).
  4. Type rewrites -- Snowflake data types are mapped to DuckDB types (VARIANT to JSON, TIMESTAMP_NTZ to TIMESTAMP, etc.).
  5. Construct rewrites -- Snowflake-specific constructs are transformed (FLATTEN to UNNEST, OBJECT_CONSTRUCT to json_object, etc.).

Each query receives a confidence score indicating the reliability of the transpilation. Queries that cannot be transpiled cleanly return an error with details about the unsupported construct.

For the full list of supported functions, types, and constructs, see SQL Compatibility.

Deployment Tiers

HatiData offers three deployment tiers. The core conversion between tiers is hati push --target cloud|vpc -- your queries and schemas stay the same.

Local (Free)

Everything runs on your machine. The hati CLI embeds the DuckDB engine, the transpiler, and a local proxy. Data is stored in a .hati/ directory. No network calls, no telemetry, no cloud account.

Best for: development, prototyping, offline work, CI pipelines.

See Local Mode for details.

Cloud ($29/month)

HatiData hosts a managed proxy and control plane. You push your local data with hati push --target cloud and receive a connection string. Includes a dashboard for query audit, policy management, and billing.

Best for: small teams, startups, shared environments, production workloads that do not require VPC isolation.

See Cloud Mode for details.

Enterprise (Custom Pricing)

The data plane deploys inside your AWS/GCP/Azure VPC. The control plane communicates over PrivateLink. Data never leaves your network. Includes SLAs, dedicated support, Shadow Mode for migration validation, and HatiData Shield IP protection.

Best for: regulated industries, large-scale production, organizations with strict data residency requirements.

See Enterprise for details.

Next Steps

Stay in the loop

Product updates, engineering deep-dives, and agent-native insights. No spam.