Hybrid Search (Vector SQL)
HatiData extends standard SQL with vector-aware syntax that lets agents join structured data with vector memory, filter rows by semantic similarity, and rank results by meaning rather than keywords. These extensions compose naturally with regular SQL -- you can use WHERE, GROUP BY, ORDER BY, and aggregates alongside vector operations.
Quick Reference
| Function | Returns | Use In | Description |
|---|---|---|---|
semantic_match(col, 'text' [, threshold]) | BOOLEAN | WHERE, JOIN ON | True if cosine similarity exceeds threshold |
vector_match(col, 'text' [, threshold]) | BOOLEAN | WHERE, JOIN ON | Alias for semantic_match |
semantic_rank(col, 'text') | FLOAT | SELECT, ORDER BY | Cosine similarity score (0.0 to 1.0) |
JOIN_VECTOR table ON ... | -- | FROM | Joins structured data with vector memory |
semantic_match(column, text [, threshold])
Returns TRUE if the cosine similarity between the column's embedding and the text's embedding exceeds the threshold.
Signature:
semantic_match(column_name, 'search text' [, threshold FLOAT])
| Parameter | Type | Default | Description |
|---|---|---|---|
column_name | identifier | (required) | Column containing text or embeddings |
search text | VARCHAR | (required) | Natural language query to match against |
threshold | FLOAT | 0.7 | Minimum cosine similarity (0.0 to 1.0) |
Examples
Filter rows by semantic similarity:
SELECT ticket_id, subject, description
FROM support_tickets
WHERE semantic_match(description, 'billing dispute refund request')
AND status = 'open';
With an explicit threshold:
SELECT memory_id, content, importance
FROM _hatidata_agent_memory
WHERE semantic_match(content, 'customer pricing preferences', 0.85)
AND agent_id = 'support-agent';
In a JOIN ON clause:
SELECT o.order_id, o.total, m.content AS related_memory
FROM orders o
JOIN _hatidata_agent_memory m
ON semantic_match(m.content, CAST(o.description AS VARCHAR), 0.75);
vector_match(column, text [, threshold])
An alias for semantic_match. The two are interchangeable -- use whichever reads more naturally in your query.
-- These two queries are identical:
SELECT * FROM docs WHERE semantic_match(content, 'deployment guide');
SELECT * FROM docs WHERE vector_match(content, 'deployment guide');
semantic_rank(column, text)
Returns the cosine similarity score as a FLOAT between 0.0 and 1.0. Use this in SELECT projections and ORDER BY clauses to rank results by semantic relevance.
Signature:
semantic_rank(column_name, 'search text')
| Parameter | Type | Description |
|---|---|---|
column_name | identifier | Column containing text or embeddings |
search text | VARCHAR | Natural language query to rank against |
Examples
Rank and select the most relevant results:
SELECT
memory_id,
content,
semantic_rank(content, 'revenue growth Q4') AS relevance
FROM _hatidata_agent_memory
WHERE agent_id = 'analyst'
ORDER BY relevance DESC
LIMIT 10;
Combine with aggregation:
SELECT
customer_id,
MAX(semantic_rank(notes, 'churn risk')) AS max_churn_signal
FROM customer_notes
GROUP BY customer_id
HAVING MAX(semantic_rank(notes, 'churn risk')) > 0.8
ORDER BY max_churn_signal DESC;
JOIN_VECTOR
Joins a base table with a vector memory table using semantic matching. This is syntactic sugar that combines a JOIN with semantic_match in a single, readable clause.
Syntax:
SELECT ...
FROM base_table
JOIN_VECTOR memory_table ON semantic_match(column, 'text' [, threshold])
Example
SELECT
t.ticket_id,
t.subject,
m.content AS related_memory,
semantic_rank(m.content, t.subject) AS relevance
FROM support_tickets t
JOIN_VECTOR _hatidata_agent_memory m
ON semantic_match(m.content, t.subject, 0.7)
WHERE t.status = 'open'
ORDER BY relevance DESC;
This is equivalent to:
SELECT
t.ticket_id,
t.subject,
m.content AS related_memory,
semantic_rank(m.content, t.subject) AS relevance
FROM support_tickets t
JOIN _hatidata_agent_memory m
ON semantic_match(m.content, t.subject, 0.7)
WHERE t.status = 'open'
ORDER BY relevance DESC;
JOIN_VECTOR makes intent explicit: the reader immediately knows this is a vector-aware join rather than a standard equi-join.
How It Works
When HatiData encounters vector SQL functions, the transpiler and proxy collaborate to resolve them:
- Parse -- The transpiler identifies
semantic_match,vector_match,semantic_rank, andJOIN_VECTORnodes in the SQL AST. - Embed -- The proxy sends the search text to the configured embedding provider (OpenAI, Cohere, or a local model) and receives a float vector.
- Rewrite -- The transpiler rewrites vector functions to
list_cosine_similarity()calls with the resolved embedding vector. - Execute -- DuckDB evaluates
list_cosine_similarity(embedding_col, [0.12, -0.34, ...])against stored embeddings. - Filter/Rank -- For
semantic_match, results are filtered by the threshold comparison. Forsemantic_rank, the raw similarity score is returned.
Agent SQL Transpiled DuckDB SQL
───────────── ──────────────────────
semantic_match( list_cosine_similarity(
content, → content_embedding,
'billing issue', [0.12, -0.34, ...] -- resolved by proxy
0.8 ) > 0.8
)
semantic_rank( list_cosine_similarity(
content, → content_embedding,
'billing issue' [0.12, -0.34, ...]
) )
Threshold Tuning Guide
The default threshold of 0.7 works well for general-purpose retrieval. Adjust it based on your use case:
| Threshold | Precision | Recall | Best For |
|---|---|---|---|
| 0.9+ | Very high | Low | Exact semantic match, deduplication |
| 0.8 | High | Medium | Targeted retrieval, specific facts |
| 0.7 (default) | Medium | Medium-High | General search, RAG context |
| 0.6 | Low-Medium | High | Exploratory search, broad context |
| 0.5 | Low | Very high | Catch-all, brainstorming, related topics |
Rules of thumb:
- Start with 0.7 and adjust based on result quality.
- If you get too many irrelevant results, increase to 0.8.
- If you miss results you expected, decrease to 0.6.
- For safety-critical applications (fraud detection, compliance), use 0.85+ and review false negatives separately.
- Short search texts tend to need lower thresholds; longer, more specific texts work better with higher thresholds.
Real-World Examples
Fintech: Fraud Pattern Detection
An agent detects suspicious transactions by joining transaction records with known fraud pattern memories:
-- Find open transactions that resemble known fraud patterns
SELECT
tx.transaction_id,
tx.amount,
tx.merchant,
tx.timestamp,
m.content AS fraud_pattern,
semantic_rank(m.content, CONCAT(tx.merchant, ' ', tx.description)) AS similarity
FROM transactions tx
JOIN_VECTOR _hatidata_agent_memory m
ON semantic_match(m.content, CONCAT(tx.merchant, ' ', tx.description), 0.85)
WHERE m.memory_type = 'fact'
AND m.metadata->>'category' = 'fraud_pattern'
AND tx.status = 'pending'
AND tx.amount > 1000
ORDER BY similarity DESC
LIMIT 20;
Support: Contextual Memory Recall
A support agent retrieves relevant past interactions before responding to a customer:
-- Pull relevant memories for the current customer conversation
SELECT
content,
memory_type,
importance,
semantic_rank(content, 'customer wants to upgrade plan but has billing concerns') AS relevance,
created_at
FROM _hatidata_agent_memory
WHERE agent_id = 'support-bot'
AND semantic_match(content, 'customer wants to upgrade plan but has billing concerns', 0.7)
AND importance >= 0.5
ORDER BY relevance DESC, importance DESC
LIMIT 5;
Supply Chain: Supplier Risk Assessment
An operations agent correlates supplier performance data with risk intelligence memories:
-- Match supplier delivery records against risk intelligence
SELECT
s.supplier_name,
s.region,
s.avg_lead_time_days,
m.content AS risk_signal,
semantic_rank(m.content, CONCAT('supplier risk ', s.supplier_name, ' ', s.region)) AS risk_score
FROM supplier_metrics s
JOIN_VECTOR _hatidata_agent_memory m
ON semantic_match(m.content, CONCAT('supplier risk ', s.supplier_name, ' ', s.region), 0.75)
WHERE m.memory_type = 'fact'
AND s.avg_lead_time_days > 14
ORDER BY risk_score DESC;
Multi-Agent Knowledge Sharing
Multiple agents query a shared memory pool to find insights from other agents:
-- Agent B searches for research findings stored by Agent A
SELECT
agent_id AS source_agent,
content,
memory_type,
semantic_rank(content, 'competitive analysis enterprise pricing') AS relevance,
created_at
FROM _hatidata_agent_memory
WHERE semantic_match(content, 'competitive analysis enterprise pricing', 0.7)
AND agent_id != 'agent-b' -- exclude own memories
AND importance >= 0.6
ORDER BY relevance DESC
LIMIT 10;
Limitations
- Vector functions require an embedding provider to be configured on the proxy. Without one, queries using
semantic_matchorsemantic_rankreturn an error. JOIN_VECTORis syntactic sugar -- it does not create a special join algorithm. For very large tables, consider filtering with standardWHEREclauses first to reduce the join cardinality.- Embedding resolution adds latency to the first occurrence of each unique search text in a query. Repeated identical texts within the same query are cached.
- The embedding model's quality directly affects search relevance. Use a model appropriate for your domain.
Next Steps
- Function Reference -- Snowflake-to-DuckDB function mappings
- Agent Memory -- How the memory system works under the hood
- MCP Tools Reference -- All 24 MCP tools including memory operations
- SQL Compatibility -- Full transpilation pipeline overview