Arrow-Native Queries
HatiData supports returning query results in Apache Arrow IPC stream format -- a zero-copy columnar binary format that eliminates the serialize-to-text-then-parse-back overhead that plagues traditional database integrations. AI agents using pandas, polars, or any Arrow-compatible library can load results directly into typed DataFrames without parsing a single string.
Why Arrow?
Traditional query protocols (Postgres wire, JSON REST) serialize all values as text. An integer 42 becomes the string "42", a timestamp becomes "2026-02-18T10:30:00Z", and a boolean becomes "true". The client then parses every value back into its native type. For an agent processing thousands of rows per query, this serialization round trip wastes CPU cycles on both sides.
Arrow IPC delivers results as typed, columnar binary data. Integers arrive as contiguous arrays of 32-bit or 64-bit values. Timestamps arrive as microsecond offsets. Booleans arrive as bit-packed arrays. The client loads the data into a DataFrame with zero parsing -- the bytes are the data.
| Method | Format | Type Fidelity | Parse Overhead |
|---|---|---|---|
| Postgres wire protocol | Text rows | None (all strings) | High |
| REST API (JSON) | JSON objects | Partial (numbers, strings) | Medium |
| Arrow IPC | Columnar binary | Full (all types preserved) | None |
HTTP Endpoint
POST /v1/query/arrow
Execute a SQL query and return results as an Arrow IPC stream. Available on the MCP server port (default: 5440).
Request
curl -X POST http://localhost:5440/v1/query/arrow \
-H "Authorization: Bearer hd_live_a1b2c3d4..." \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT customer_id, name, revenue, signup_date FROM customers ORDER BY revenue DESC LIMIT 10000"
}'
Request Body:
| Field | Type | Required | Description |
|---|---|---|---|
sql | string | Yes | SQL query to execute |
timeout_ms | integer | No | Query timeout in milliseconds |
Response
The response body is raw Arrow IPC stream bytes (not JSON). Response headers provide metadata:
| Header | Description |
|---|---|
content-type | application/vnd.apache.arrow.stream |
x-execution-time-ms | Query execution time in milliseconds |
x-rows-affected | Number of rows in the result |
x-cache-hit | Whether the result was served from cache |
Python Example (pyarrow)
import pyarrow.ipc as ipc
import requests
resp = requests.post(
"http://localhost:5440/v1/query/arrow",
json={"sql": "SELECT * FROM orders WHERE total > 100"},
headers={"Authorization": "Bearer hd_live_a1b2c3d4..."},
)
resp.raise_for_status()
# Load Arrow IPC directly -- zero parsing
reader = ipc.open_stream(resp.content)
table = reader.read_all()
# Convert to pandas DataFrame (zero-copy where possible)
df = table.to_pandas()
print(df.dtypes) # Types are preserved: int64, float64, datetime64, etc.
print(df.describe())
Python Example (polars)
import polars as pl
import pyarrow.ipc as ipc
import requests
resp = requests.post(
"http://localhost:5440/v1/query/arrow",
json={"sql": "SELECT region, SUM(revenue) as total FROM sales GROUP BY region"},
headers={"Authorization": "Bearer hd_live_a1b2c3d4..."},
)
resp.raise_for_status()
reader = ipc.open_stream(resp.content)
table = reader.read_all()
# Polars loads Arrow natively -- truly zero-copy
df = pl.from_arrow(table)
print(df)
TypeScript Example (Apache Arrow JS)
import { tableFromIPC } from "apache-arrow";
const resp = await fetch("http://localhost:5440/v1/query/arrow", {
method: "POST",
headers: {
"Authorization": "Bearer hd_live_a1b2c3d4...",
"Content-Type": "application/json",
},
body: JSON.stringify({
sql: "SELECT * FROM customers LIMIT 1000",
}),
});
const buffer = await resp.arrayBuffer();
const table = tableFromIPC(new Uint8Array(buffer));
console.log(`${table.numRows} rows, ${table.numCols} columns`);
for (const row of table) {
console.log(row.toJSON());
}
MCP Tool: run_sql_arrow
The same Arrow-native path is available as an MCP tool for agents. Since MCP tool results are JSON text, the Arrow IPC bytes are base64-encoded.
// Tool call
{
"name": "run_sql_arrow",
"arguments": {
"sql": "SELECT * FROM orders WHERE status = 'pending' ORDER BY created_at DESC"
}
}
// Tool result
{
"format": "arrow_ipc_base64",
"data": "QVJST1cxAAAAAAoAAAAAAAD...",
"rows": 847,
"execution_time_ms": 8
}
Decode in Python:
import base64
import pyarrow.ipc as ipc
import io
ipc_bytes = base64.b64decode(tool_result["data"])
reader = ipc.open_stream(io.BytesIO(ipc_bytes))
table = reader.read_all()
df = table.to_pandas()
See the full MCP tool reference at MCP Tools > run_sql_arrow.
Security Pipeline
Arrow queries pass through the same 13-step security pipeline as all other queries:
- Authentication (Bearer token or API key)
- ABAC policy evaluation
- Cost estimation and quota checks
- Row-level security filter injection
- SQL transpilation (Snowflake syntax supported)
- DuckDB execution (Arrow-native path)
- Column masking applied directly on Arrow arrays
- Metering and audit logging
Column masking operates directly on Arrow RecordBatch arrays. Masked columns are converted to UTF-8 string arrays with the masking function applied (full redaction, partial redaction, SHA-256 hash, or null). The masked column's data type changes to Utf8 in the Arrow schema.
When to Use Arrow vs. JSON
| Scenario | Recommended | Why |
|---|---|---|
| Agent analyzing data with pandas/polars | Arrow | Zero parsing overhead, full type fidelity |
| Large result sets (1K+ rows) | Arrow | Columnar format is more compact and faster to process |
| Feeding results to ML pipelines | Arrow | Arrow is the lingua franca of the Python data stack |
| Small lookups (< 100 rows) | JSON | Simpler to work with, overhead is negligible |
| Displaying results to humans | JSON | Easier to render in UIs |
| Debugging agent behavior | JSON | Human-readable |
Error Handling
Arrow endpoint errors return JSON (not Arrow IPC) with standard error codes:
{
"error": "POLICY_VIOLATION",
"message": "Agent 'analyst' is not permitted to access table 'salary_data'"
}
| Status | Code | Description |
|---|---|---|
400 | QUERY_PARSE_ERROR | SQL syntax error |
401 | UNAUTHORIZED | Missing or invalid authentication |
403 | POLICY_DENIED | Query blocked by ABAC policy |
408 | QUERY_TIMEOUT | Query exceeded timeout |
503 | SERVICE_UNAVAILABLE | Pipeline not initialized |
Next Steps
- MCP Tools Reference -- All 24 MCP tools
- Query Proxy -- Postgres wire protocol and REST API
- Python SDK -- HatiData Python client