Arrow-Native Queries

HatiData supports returning query results in Apache Arrow IPC stream format -- a zero-copy columnar binary format that eliminates the serialize-to-text-then-parse-back overhead that plagues traditional database integrations. AI agents using pandas, polars, or any Arrow-compatible library can load results directly into typed DataFrames without parsing a single string.

Why Arrow?

Traditional query protocols (Postgres wire, JSON REST) serialize all values as text. An integer 42 becomes the string "42", a timestamp becomes "2026-02-18T10:30:00Z", and a boolean becomes "true". The client then parses every value back into its native type. For an agent processing thousands of rows per query, this serialization round trip wastes CPU cycles on both sides.

Arrow IPC delivers results as typed, columnar binary data. Integers arrive as contiguous arrays of 32-bit or 64-bit values. Timestamps arrive as microsecond offsets. Booleans arrive as bit-packed arrays. The client loads the data into a DataFrame with zero parsing -- the bytes are the data.

Method	Format	Type Fidelity	Parse Overhead
Postgres wire protocol	Text rows	None (all strings)	High
REST API (JSON)	JSON objects	Partial (numbers, strings)	Medium
Arrow IPC	Columnar binary	Full (all types preserved)	None

HTTP Endpoint

POST /v1/query/arrow

Execute a SQL query and return results as an Arrow IPC stream. Available on the MCP server port (default: 5440).

Request

curl -X POST http://localhost:5440/v1/query/arrow \
  -H "Authorization: Bearer hd_live_a1b2c3d4..." \
  -H "Content-Type: application/json" \
  -d '{
    "sql": "SELECT customer_id, name, revenue, signup_date FROM customers ORDER BY revenue DESC LIMIT 10000"
  }'

Request Body:

Field	Type	Required	Description
`sql`	string	Yes	SQL query to execute
`timeout_ms`	integer	No	Query timeout in milliseconds

Response

The response body is raw Arrow IPC stream bytes (not JSON). Response headers provide metadata:

Header	Description
`content-type`	`application/vnd.apache.arrow.stream`
`x-execution-time-ms`	Query execution time in milliseconds
`x-rows-affected`	Number of rows in the result
`x-cache-hit`	Whether the result was served from cache

Python Example (pyarrow)

import pyarrow.ipc as ipc
import requests

resp = requests.post(
    "http://localhost:5440/v1/query/arrow",
    json={"sql": "SELECT * FROM orders WHERE total > 100"},
    headers={"Authorization": "Bearer hd_live_a1b2c3d4..."},
)
resp.raise_for_status()

# Load Arrow IPC directly -- zero parsing
reader = ipc.open_stream(resp.content)
table = reader.read_all()

# Convert to pandas DataFrame (zero-copy where possible)
df = table.to_pandas()
print(df.dtypes)  # Types are preserved: int64, float64, datetime64, etc.
print(df.describe())

Python Example (polars)

import polars as pl
import pyarrow.ipc as ipc
import requests

resp = requests.post(
    "http://localhost:5440/v1/query/arrow",
    json={"sql": "SELECT region, SUM(revenue) as total FROM sales GROUP BY region"},
    headers={"Authorization": "Bearer hd_live_a1b2c3d4..."},
)
resp.raise_for_status()

reader = ipc.open_stream(resp.content)
table = reader.read_all()

# Polars loads Arrow natively -- truly zero-copy
df = pl.from_arrow(table)
print(df)

TypeScript Example (Apache Arrow JS)

import { tableFromIPC } from "apache-arrow";

const resp = await fetch("http://localhost:5440/v1/query/arrow", {
  method: "POST",
  headers: {
    "Authorization": "Bearer hd_live_a1b2c3d4...",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    sql: "SELECT * FROM customers LIMIT 1000",
  }),
});

const buffer = await resp.arrayBuffer();
const table = tableFromIPC(new Uint8Array(buffer));

console.log(`${table.numRows} rows, ${table.numCols} columns`);
for (const row of table) {
  console.log(row.toJSON());
}

MCP Tool: run_sql_arrow

The same Arrow-native path is available as an MCP tool for agents. Since MCP tool results are JSON text, the Arrow IPC bytes are base64-encoded.

// Tool call
{
  "name": "run_sql_arrow",
  "arguments": {
    "sql": "SELECT * FROM orders WHERE status = 'pending' ORDER BY created_at DESC"
  }
}

// Tool result
{
  "format": "arrow_ipc_base64",
  "data": "QVJST1cxAAAAAAoAAAAAAAD...",
  "rows": 847,
  "execution_time_ms": 8
}

Decode in Python:

import base64
import pyarrow.ipc as ipc
import io

ipc_bytes = base64.b64decode(tool_result["data"])
reader = ipc.open_stream(io.BytesIO(ipc_bytes))
table = reader.read_all()
df = table.to_pandas()

See the full MCP tool reference at MCP Tools > run_sql_arrow.

Security Pipeline

Arrow queries pass through the same 13-step security pipeline as all other queries:

Authentication (Bearer token or API key)
ABAC policy evaluation
Cost estimation and quota checks
Row-level security filter injection
SQL transpilation (Snowflake syntax supported)
DuckDB execution (Arrow-native path)
Column masking applied directly on Arrow arrays
Metering and audit logging

Column masking operates directly on Arrow RecordBatch arrays. Masked columns are converted to UTF-8 string arrays with the masking function applied (full redaction, partial redaction, SHA-256 hash, or null). The masked column's data type changes to Utf8 in the Arrow schema.

When to Use Arrow vs. JSON

Scenario	Recommended	Why
Agent analyzing data with pandas/polars	Arrow	Zero parsing overhead, full type fidelity
Large result sets (1K+ rows)	Arrow	Columnar format is more compact and faster to process
Feeding results to ML pipelines	Arrow	Arrow is the lingua franca of the Python data stack
Small lookups (< 100 rows)	JSON	Simpler to work with, overhead is negligible
Displaying results to humans	JSON	Easier to render in UIs
Debugging agent behavior	JSON	Human-readable

Error Handling

Arrow endpoint errors return JSON (not Arrow IPC) with standard error codes:

{
  "error": "POLICY_VIOLATION",
  "message": "Agent 'analyst' is not permitted to access table 'salary_data'"
}

Status	Code	Description
`400`	`QUERY_PARSE_ERROR`	SQL syntax error
`401`	`UNAUTHORIZED`	Missing or invalid authentication
`403`	`POLICY_DENIED`	Query blocked by ABAC policy
`408`	`QUERY_TIMEOUT`	Query exceeded timeout
`503`	`SERVICE_UNAVAILABLE`	Pipeline not initialized

Next Steps

MCP Tools Reference -- All 24 MCP tools
Query Proxy -- Postgres wire protocol and REST API
Python SDK -- HatiData Python client

Why Arrow?​

HTTP Endpoint​

Request​

Response​

Python Example (pyarrow)​

Python Example (polars)​

TypeScript Example (Apache Arrow JS)​

MCP Tool: run_sql_arrow​

Security Pipeline​

When to Use Arrow vs. JSON​

Error Handling​

Next Steps​

Stay in the loop