Arrow Query API
The Arrow Query API provides a high-performance HTTP endpoint that returns query results in Apache Arrow stream format. This is the recommended path for analytical workloads, ML feature pipelines, and any use case where columnar data is preferred over row-based Postgres wire protocol results.
Endpoint
POST /v1/query/arrow
Base URL: http://<proxy-host>:<proxy-port> (default: http://localhost:5439)
Request
Headers
| Header | Required | Description |
|---|---|---|
Authorization | Yes | Bearer <api_key> |
Content-Type | Yes | application/json |
X-Agent-Id | No | Override the agent ID for this query (default: derived from API key) |
Body
{
"sql": "SELECT * FROM orders WHERE total > 100 LIMIT 1000",
"parameters": [],
"timeout_ms": 30000
}
| Field | Type | Required | Description |
|---|---|---|---|
sql | string | Yes | The SQL query to execute |
parameters | array | No | Positional parameters for prepared statements |
timeout_ms | integer | No | Query timeout in milliseconds (default: 30000) |
Example Request
curl -X POST http://localhost:5439/v1/query/arrow \
-H "Authorization: Bearer hd_live_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"sql": "SELECT customer_id, SUM(total) AS revenue FROM orders GROUP BY customer_id ORDER BY revenue DESC LIMIT 100"
}' \
--output result.arrow
Response
Success (200 OK)
The response body is an Arrow stream (binary). Response headers include metadata:
| Header | Description |
|---|---|
Content-Type | application/vnd.apache.arrow.stream |
X-Row-Count | Total number of rows in the result |
X-Column-Count | Number of columns in the result |
X-Query-Id | Unique query identifier for audit |
X-Credits-Consumed | Query credits consumed |
X-Cache-Hit | true if the transpilation cache was hit |
X-Duration-Ms | Server-side query execution time in milliseconds |
Error Responses
| Status | Body | Description |
|---|---|---|
| 400 | {"error": "SQL parse error: ..."} | Invalid SQL |
| 401 | {"error": "Invalid API key"} | Authentication failed |
| 403 | {"error": "Permission denied"} | Agent lacks required permissions |
| 408 | {"error": "Query timeout"} | Query exceeded timeout_ms |
| 429 | {"error": "Rate limit exceeded"} | Too many requests |
| 500 | {"error": "Internal server error"} | Server error |
Python Client
PyArrow
import pyarrow as pa
import pyarrow.ipc as ipc
import requests
def arrow_query(sql: str, host: str = "localhost", port: int = 5439, api_key: str = "") -> pa.Table:
"""Execute a SQL query and return an Arrow Table."""
response = requests.post(
f"http://{host}:{port}/v1/query/arrow",
json={"sql": sql},
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
},
)
response.raise_for_status()
reader = ipc.open_stream(response.content)
table = reader.read_all()
# Metadata from response headers
row_count = response.headers.get("X-Row-Count")
duration_ms = response.headers.get("X-Duration-Ms")
print(f"Rows: {row_count}, Duration: {duration_ms}ms")
return table
Polars
import polars as pl
def polars_query(sql: str, **kwargs) -> pl.DataFrame:
"""Execute a SQL query and return a Polars DataFrame."""
return pl.from_arrow(arrow_query(sql, **kwargs))
df = polars_query(
"SELECT * FROM orders WHERE order_date >= '2025-01-01'",
api_key="hd_live_your_api_key",
)
print(df.describe())
Pandas
import pandas as pd
def pandas_query(sql: str, **kwargs) -> pd.DataFrame:
"""Execute a SQL query and return a Pandas DataFrame."""
table = arrow_query(sql, **kwargs)
return table.to_pandas(types_mapper=pd.ArrowDtype)
TypeScript Client
import { tableFromIPC } from 'apache-arrow';
async function arrowQuery(sql: string): Promise<any> {
const response = await fetch('http://localhost:5439/v1/query/arrow', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.HATIDATA_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ sql }),
});
if (!response.ok) {
throw new Error(`Query failed: ${response.status}`);
}
const buffer = await response.arrayBuffer();
const table = tableFromIPC(new Uint8Array(buffer));
return {
table,
rowCount: Number(response.headers.get('X-Row-Count')),
durationMs: Number(response.headers.get('X-Duration-Ms')),
};
}
const { table, rowCount, durationMs } = await arrowQuery(
'SELECT customer_id, SUM(total) AS revenue FROM orders GROUP BY customer_id'
);
console.log(`${rowCount} rows in ${durationMs}ms`);
Parameterized Queries
Use positional parameters to prevent SQL injection:
response = requests.post(
f"http://localhost:5439/v1/query/arrow",
json={
"sql": "SELECT * FROM orders WHERE customer_id = $1 AND total > $2",
"parameters": ["cust-123", 100.0],
},
headers={"Authorization": f"Bearer {api_key}"},
)
Parameters are bound server-side before query execution. All standard parameter types are accepted: strings, integers, floats, booleans, and timestamps.
Semantic Search via Arrow
The Arrow endpoint supports semantic_match() and semantic_rank() functions:
table = arrow_query("""
SELECT
memory_id,
content,
semantic_rank(content, 'revenue growth patterns') AS relevance
FROM _hatidata_agent_memory
WHERE agent_id = 'analyst-agent'
AND semantic_match(content, 'revenue growth patterns', 0.7)
ORDER BY relevance DESC
LIMIT 50
""")
df = pl.from_arrow(table)
print(df.head(10))
Performance Notes
| Aspect | Detail |
|---|---|
| Format overhead | Arrow is a zero-copy format -- no serialization/deserialization overhead |
| Compression | Not compressed by default; use gzip Accept-Encoding for network savings |
| Max result size | 100 MB default (configurable) |
| Concurrency | Same semaphore as Postgres wire protocol queries |
| Caching | Transpilation cache applies; query engine result cache applies |
For datasets larger than the max response size, use LIMIT/OFFSET pagination or export to Parquet.
Related Concepts
- Arrow & Polars Integration -- Setup and usage patterns
- Arrow Query Recipes -- Advanced recipes
- SQL Functions & Types -- Full SQL reference
- Query Pipeline -- How queries flow through the proxy
- Postgres Drivers -- Row-based alternative