Skip to main content

Hybrid Search (Vector SQL)

HatiData extends standard SQL with vector-aware syntax that lets agents join structured data with vector memory, filter rows by semantic similarity, and rank results by meaning rather than keywords. These extensions compose naturally with regular SQL -- you can use WHERE, GROUP BY, ORDER BY, and aggregates alongside vector operations.

Quick Reference

FunctionReturnsUse InDescription
semantic_match(col, 'text' [, threshold])BOOLEANWHERE, JOIN ONTrue if cosine similarity exceeds threshold
vector_match(col, 'text' [, threshold])BOOLEANWHERE, JOIN ONAlias for semantic_match
semantic_rank(col, 'text')FLOATSELECT, ORDER BYCosine similarity score (0.0 to 1.0)
JOIN_VECTOR table ON ...--FROMJoins structured data with vector memory

semantic_match(column, text [, threshold])

Returns TRUE if the cosine similarity between the column's embedding and the text's embedding exceeds the threshold.

Signature:

semantic_match(column_name, 'search text' [, threshold FLOAT])
ParameterTypeDefaultDescription
column_nameidentifier(required)Column containing text or embeddings
search textVARCHAR(required)Natural language query to match against
thresholdFLOAT0.7Minimum cosine similarity (0.0 to 1.0)

Examples

Filter rows by semantic similarity:

SELECT ticket_id, subject, description
FROM support_tickets
WHERE semantic_match(description, 'billing dispute refund request')
AND status = 'open';

With an explicit threshold:

SELECT memory_id, content, importance
FROM _hatidata_agent_memory
WHERE semantic_match(content, 'customer pricing preferences', 0.85)
AND agent_id = 'support-agent';

In a JOIN ON clause:

SELECT o.order_id, o.total, m.content AS related_memory
FROM orders o
JOIN _hatidata_agent_memory m
ON semantic_match(m.content, CAST(o.description AS VARCHAR), 0.75);

vector_match(column, text [, threshold])

An alias for semantic_match. The two are interchangeable -- use whichever reads more naturally in your query.

-- These two queries are identical:
SELECT * FROM docs WHERE semantic_match(content, 'deployment guide');
SELECT * FROM docs WHERE vector_match(content, 'deployment guide');

semantic_rank(column, text)

Returns the cosine similarity score as a FLOAT between 0.0 and 1.0. Use this in SELECT projections and ORDER BY clauses to rank results by semantic relevance.

Signature:

semantic_rank(column_name, 'search text')
ParameterTypeDescription
column_nameidentifierColumn containing text or embeddings
search textVARCHARNatural language query to rank against

Examples

Rank and select the most relevant results:

SELECT
memory_id,
content,
semantic_rank(content, 'revenue growth Q4') AS relevance
FROM _hatidata_agent_memory
WHERE agent_id = 'analyst'
ORDER BY relevance DESC
LIMIT 10;

Combine with aggregation:

SELECT
customer_id,
MAX(semantic_rank(notes, 'churn risk')) AS max_churn_signal
FROM customer_notes
GROUP BY customer_id
HAVING MAX(semantic_rank(notes, 'churn risk')) > 0.8
ORDER BY max_churn_signal DESC;

JOIN_VECTOR

Joins a base table with a vector memory table using semantic matching. This is syntactic sugar that combines a JOIN with semantic_match in a single, readable clause.

Syntax:

SELECT ...
FROM base_table
JOIN_VECTOR memory_table ON semantic_match(column, 'text' [, threshold])

Example

SELECT
t.ticket_id,
t.subject,
m.content AS related_memory,
semantic_rank(m.content, t.subject) AS relevance
FROM support_tickets t
JOIN_VECTOR _hatidata_agent_memory m
ON semantic_match(m.content, t.subject, 0.7)
WHERE t.status = 'open'
ORDER BY relevance DESC;

This is equivalent to:

SELECT
t.ticket_id,
t.subject,
m.content AS related_memory,
semantic_rank(m.content, t.subject) AS relevance
FROM support_tickets t
JOIN _hatidata_agent_memory m
ON semantic_match(m.content, t.subject, 0.7)
WHERE t.status = 'open'
ORDER BY relevance DESC;

JOIN_VECTOR makes intent explicit: the reader immediately knows this is a vector-aware join rather than a standard equi-join.

How It Works

When HatiData encounters vector SQL functions, the transpiler and proxy collaborate to resolve them:

  1. Parse -- The transpiler identifies semantic_match, vector_match, semantic_rank, and JOIN_VECTOR nodes in the SQL AST.
  2. Embed -- The proxy sends the search text to the configured embedding provider (OpenAI, Cohere, or a local model) and receives a float vector.
  3. Rewrite -- The transpiler rewrites vector functions to list_cosine_similarity() calls with the resolved embedding vector.
  4. Execute -- DuckDB evaluates list_cosine_similarity(embedding_col, [0.12, -0.34, ...]) against stored embeddings.
  5. Filter/Rank -- For semantic_match, results are filtered by the threshold comparison. For semantic_rank, the raw similarity score is returned.
Agent SQL                    Transpiled DuckDB SQL
───────────── ──────────────────────
semantic_match( list_cosine_similarity(
content, → content_embedding,
'billing issue', [0.12, -0.34, ...] -- resolved by proxy
0.8 ) > 0.8
)

semantic_rank( list_cosine_similarity(
content, → content_embedding,
'billing issue' [0.12, -0.34, ...]
) )

Threshold Tuning Guide

The default threshold of 0.7 works well for general-purpose retrieval. Adjust it based on your use case:

ThresholdPrecisionRecallBest For
0.9+Very highLowExact semantic match, deduplication
0.8HighMediumTargeted retrieval, specific facts
0.7 (default)MediumMedium-HighGeneral search, RAG context
0.6Low-MediumHighExploratory search, broad context
0.5LowVery highCatch-all, brainstorming, related topics

Rules of thumb:

  • Start with 0.7 and adjust based on result quality.
  • If you get too many irrelevant results, increase to 0.8.
  • If you miss results you expected, decrease to 0.6.
  • For safety-critical applications (fraud detection, compliance), use 0.85+ and review false negatives separately.
  • Short search texts tend to need lower thresholds; longer, more specific texts work better with higher thresholds.

Real-World Examples

Fintech: Fraud Pattern Detection

An agent detects suspicious transactions by joining transaction records with known fraud pattern memories:

-- Find open transactions that resemble known fraud patterns
SELECT
tx.transaction_id,
tx.amount,
tx.merchant,
tx.timestamp,
m.content AS fraud_pattern,
semantic_rank(m.content, CONCAT(tx.merchant, ' ', tx.description)) AS similarity
FROM transactions tx
JOIN_VECTOR _hatidata_agent_memory m
ON semantic_match(m.content, CONCAT(tx.merchant, ' ', tx.description), 0.85)
WHERE m.memory_type = 'fact'
AND m.metadata->>'category' = 'fraud_pattern'
AND tx.status = 'pending'
AND tx.amount > 1000
ORDER BY similarity DESC
LIMIT 20;

Support: Contextual Memory Recall

A support agent retrieves relevant past interactions before responding to a customer:

-- Pull relevant memories for the current customer conversation
SELECT
content,
memory_type,
importance,
semantic_rank(content, 'customer wants to upgrade plan but has billing concerns') AS relevance,
created_at
FROM _hatidata_agent_memory
WHERE agent_id = 'support-bot'
AND semantic_match(content, 'customer wants to upgrade plan but has billing concerns', 0.7)
AND importance >= 0.5
ORDER BY relevance DESC, importance DESC
LIMIT 5;

Supply Chain: Supplier Risk Assessment

An operations agent correlates supplier performance data with risk intelligence memories:

-- Match supplier delivery records against risk intelligence
SELECT
s.supplier_name,
s.region,
s.avg_lead_time_days,
m.content AS risk_signal,
semantic_rank(m.content, CONCAT('supplier risk ', s.supplier_name, ' ', s.region)) AS risk_score
FROM supplier_metrics s
JOIN_VECTOR _hatidata_agent_memory m
ON semantic_match(m.content, CONCAT('supplier risk ', s.supplier_name, ' ', s.region), 0.75)
WHERE m.memory_type = 'fact'
AND s.avg_lead_time_days > 14
ORDER BY risk_score DESC;

Multi-Agent Knowledge Sharing

Multiple agents query a shared memory pool to find insights from other agents:

-- Agent B searches for research findings stored by Agent A
SELECT
agent_id AS source_agent,
content,
memory_type,
semantic_rank(content, 'competitive analysis enterprise pricing') AS relevance,
created_at
FROM _hatidata_agent_memory
WHERE semantic_match(content, 'competitive analysis enterprise pricing', 0.7)
AND agent_id != 'agent-b' -- exclude own memories
AND importance >= 0.6
ORDER BY relevance DESC
LIMIT 10;

Limitations

  • Vector functions require an embedding provider to be configured on the proxy. Without one, queries using semantic_match or semantic_rank return an error.
  • JOIN_VECTOR is syntactic sugar -- it does not create a special join algorithm. For very large tables, consider filtering with standard WHERE clauses first to reduce the join cardinality.
  • Embedding resolution adds latency to the first occurrence of each unique search text in a query. Repeated identical texts within the same query are cached.
  • The embedding model's quality directly affects search relevance. Use a model appropriate for your domain.

Next Steps

Stay in the loop

Product updates, engineering deep-dives, and agent-native insights. No spam.