Skip to main content

Arrow-Native Queries

HatiData supports returning query results in Apache Arrow IPC stream format -- a zero-copy columnar binary format that eliminates the serialize-to-text-then-parse-back overhead that plagues traditional database integrations. AI agents using pandas, polars, or any Arrow-compatible library can load results directly into typed DataFrames without parsing a single string.

Why Arrow?

Traditional query protocols (Postgres wire, JSON REST) serialize all values as text. An integer 42 becomes the string "42", a timestamp becomes "2026-02-18T10:30:00Z", and a boolean becomes "true". The client then parses every value back into its native type. For an agent processing thousands of rows per query, this serialization round trip wastes CPU cycles on both sides.

Arrow IPC delivers results as typed, columnar binary data. Integers arrive as contiguous arrays of 32-bit or 64-bit values. Timestamps arrive as microsecond offsets. Booleans arrive as bit-packed arrays. The client loads the data into a DataFrame with zero parsing -- the bytes are the data.

MethodFormatType FidelityParse Overhead
Postgres wire protocolText rowsNone (all strings)High
REST API (JSON)JSON objectsPartial (numbers, strings)Medium
Arrow IPCColumnar binaryFull (all types preserved)None

HTTP Endpoint

POST /v1/query/arrow

Execute a SQL query and return results as an Arrow IPC stream. Available on the MCP server port (default: 5440).

Request

curl -X POST http://localhost:5440/v1/query/arrow \
-H "Authorization: Bearer hd_live_a1b2c3d4..." \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT customer_id, name, revenue, signup_date FROM customers ORDER BY revenue DESC LIMIT 10000"
}'

Request Body:

FieldTypeRequiredDescription
sqlstringYesSQL query to execute
timeout_msintegerNoQuery timeout in milliseconds

Response

The response body is raw Arrow IPC stream bytes (not JSON). Response headers provide metadata:

HeaderDescription
content-typeapplication/vnd.apache.arrow.stream
x-execution-time-msQuery execution time in milliseconds
x-rows-affectedNumber of rows in the result
x-cache-hitWhether the result was served from cache

Python Example (pyarrow)

import pyarrow.ipc as ipc
import requests

resp = requests.post(
"http://localhost:5440/v1/query/arrow",
json={"sql": "SELECT * FROM orders WHERE total > 100"},
headers={"Authorization": "Bearer hd_live_a1b2c3d4..."},
)
resp.raise_for_status()

# Load Arrow IPC directly -- zero parsing
reader = ipc.open_stream(resp.content)
table = reader.read_all()

# Convert to pandas DataFrame (zero-copy where possible)
df = table.to_pandas()
print(df.dtypes) # Types are preserved: int64, float64, datetime64, etc.
print(df.describe())

Python Example (polars)

import polars as pl
import pyarrow.ipc as ipc
import requests

resp = requests.post(
"http://localhost:5440/v1/query/arrow",
json={"sql": "SELECT region, SUM(revenue) as total FROM sales GROUP BY region"},
headers={"Authorization": "Bearer hd_live_a1b2c3d4..."},
)
resp.raise_for_status()

reader = ipc.open_stream(resp.content)
table = reader.read_all()

# Polars loads Arrow natively -- truly zero-copy
df = pl.from_arrow(table)
print(df)

TypeScript Example (Apache Arrow JS)

import { tableFromIPC } from "apache-arrow";

const resp = await fetch("http://localhost:5440/v1/query/arrow", {
method: "POST",
headers: {
"Authorization": "Bearer hd_live_a1b2c3d4...",
"Content-Type": "application/json",
},
body: JSON.stringify({
sql: "SELECT * FROM customers LIMIT 1000",
}),
});

const buffer = await resp.arrayBuffer();
const table = tableFromIPC(new Uint8Array(buffer));

console.log(`${table.numRows} rows, ${table.numCols} columns`);
for (const row of table) {
console.log(row.toJSON());
}

MCP Tool: run_sql_arrow

The same Arrow-native path is available as an MCP tool for agents. Since MCP tool results are JSON text, the Arrow IPC bytes are base64-encoded.

// Tool call
{
"name": "run_sql_arrow",
"arguments": {
"sql": "SELECT * FROM orders WHERE status = 'pending' ORDER BY created_at DESC"
}
}

// Tool result
{
"format": "arrow_ipc_base64",
"data": "QVJST1cxAAAAAAoAAAAAAAD...",
"rows": 847,
"execution_time_ms": 8
}

Decode in Python:

import base64
import pyarrow.ipc as ipc
import io

ipc_bytes = base64.b64decode(tool_result["data"])
reader = ipc.open_stream(io.BytesIO(ipc_bytes))
table = reader.read_all()
df = table.to_pandas()

See the full MCP tool reference at MCP Tools > run_sql_arrow.

Security Pipeline

Arrow queries pass through the same 13-step security pipeline as all other queries:

  1. Authentication (Bearer token or API key)
  2. ABAC policy evaluation
  3. Cost estimation and quota checks
  4. Row-level security filter injection
  5. SQL transpilation (Snowflake syntax supported)
  6. DuckDB execution (Arrow-native path)
  7. Column masking applied directly on Arrow arrays
  8. Metering and audit logging

Column masking operates directly on Arrow RecordBatch arrays. Masked columns are converted to UTF-8 string arrays with the masking function applied (full redaction, partial redaction, SHA-256 hash, or null). The masked column's data type changes to Utf8 in the Arrow schema.

When to Use Arrow vs. JSON

ScenarioRecommendedWhy
Agent analyzing data with pandas/polarsArrowZero parsing overhead, full type fidelity
Large result sets (1K+ rows)ArrowColumnar format is more compact and faster to process
Feeding results to ML pipelinesArrowArrow is the lingua franca of the Python data stack
Small lookups (< 100 rows)JSONSimpler to work with, overhead is negligible
Displaying results to humansJSONEasier to render in UIs
Debugging agent behaviorJSONHuman-readable

Error Handling

Arrow endpoint errors return JSON (not Arrow IPC) with standard error codes:

{
"error": "POLICY_VIOLATION",
"message": "Agent 'analyst' is not permitted to access table 'salary_data'"
}
StatusCodeDescription
400QUERY_PARSE_ERRORSQL syntax error
401UNAUTHORIZEDMissing or invalid authentication
403POLICY_DENIEDQuery blocked by ABAC policy
408QUERY_TIMEOUTQuery exceeded timeout
503SERVICE_UNAVAILABLEPipeline not initialized

Next Steps

Stay in the loop

Product updates, engineering deep-dives, and agent-native insights. No spam.