Hybrid Search (Vector SQL)

HatiData extends standard SQL with vector-aware syntax that lets agents join structured data with vector memory, filter rows by semantic similarity, and rank results by meaning rather than keywords. These extensions compose naturally with regular SQL -- you can use WHERE, GROUP BY, ORDER BY, and aggregates alongside vector operations.

Quick Reference

Function	Returns	Use In	Description
`semantic_match(col, 'text' [, threshold])`	BOOLEAN	`WHERE`, `JOIN ON`	True if cosine similarity exceeds threshold
`vector_match(col, 'text' [, threshold])`	BOOLEAN	`WHERE`, `JOIN ON`	Alias for `semantic_match`
`semantic_rank(col, 'text')`	FLOAT	`SELECT`, `ORDER BY`	Cosine similarity score (0.0 to 1.0)
`JOIN_VECTOR table ON ...`	--	`FROM`	Joins structured data with vector memory

`semantic_match(column, text [, threshold])`

Returns TRUE if the cosine similarity between the column's embedding and the text's embedding exceeds the threshold.

Signature:

semantic_match(column_name, 'search text' [, threshold FLOAT])

Parameter	Type	Default	Description
`column_name`	identifier	(required)	Column containing text or embeddings
`search text`	VARCHAR	(required)	Natural language query to match against
`threshold`	FLOAT	`0.7`	Minimum cosine similarity (0.0 to 1.0)

Examples

Filter rows by semantic similarity:

SELECT ticket_id, subject, description
FROM support_tickets
WHERE semantic_match(description, 'billing dispute refund request')
  AND status = 'open';

With an explicit threshold:

SELECT memory_id, content, importance
FROM _hatidata_agent_memory
WHERE semantic_match(content, 'customer pricing preferences', 0.85)
  AND agent_id = 'support-agent';

In a JOIN ON clause:

SELECT o.order_id, o.total, m.content AS related_memory
FROM orders o
JOIN _hatidata_agent_memory m
  ON semantic_match(m.content, CAST(o.description AS VARCHAR), 0.75);

`vector_match(column, text [, threshold])`

An alias for semantic_match. The two are interchangeable -- use whichever reads more naturally in your query.

-- These two queries are identical:
SELECT * FROM docs WHERE semantic_match(content, 'deployment guide');
SELECT * FROM docs WHERE vector_match(content, 'deployment guide');

`semantic_rank(column, text)`

Returns the cosine similarity score as a FLOAT between 0.0 and 1.0. Use this in SELECT projections and ORDER BY clauses to rank results by semantic relevance.

Signature:

semantic_rank(column_name, 'search text')

Parameter	Type	Description
`column_name`	identifier	Column containing text or embeddings
`search text`	VARCHAR	Natural language query to rank against

Examples

Rank and select the most relevant results:

SELECT
  memory_id,
  content,
  semantic_rank(content, 'revenue growth Q4') AS relevance
FROM _hatidata_agent_memory
WHERE agent_id = 'analyst'
ORDER BY relevance DESC
LIMIT 10;

Combine with aggregation:

SELECT
  customer_id,
  MAX(semantic_rank(notes, 'churn risk')) AS max_churn_signal
FROM customer_notes
GROUP BY customer_id
HAVING MAX(semantic_rank(notes, 'churn risk')) > 0.8
ORDER BY max_churn_signal DESC;

`JOIN_VECTOR`

Joins a base table with a vector memory table using semantic matching. This is syntactic sugar that combines a JOIN with semantic_match in a single, readable clause.

Syntax:

SELECT ...
FROM base_table
JOIN_VECTOR memory_table ON semantic_match(column, 'text' [, threshold])

Example

SELECT
  t.ticket_id,
  t.subject,
  m.content AS related_memory,
  semantic_rank(m.content, t.subject) AS relevance
FROM support_tickets t
JOIN_VECTOR _hatidata_agent_memory m
  ON semantic_match(m.content, t.subject, 0.7)
WHERE t.status = 'open'
ORDER BY relevance DESC;

This is equivalent to:

SELECT
  t.ticket_id,
  t.subject,
  m.content AS related_memory,
  semantic_rank(m.content, t.subject) AS relevance
FROM support_tickets t
JOIN _hatidata_agent_memory m
  ON semantic_match(m.content, t.subject, 0.7)
WHERE t.status = 'open'
ORDER BY relevance DESC;

JOIN_VECTOR makes intent explicit: the reader immediately knows this is a vector-aware join rather than a standard equi-join.

How It Works

When HatiData encounters vector SQL functions, the transpiler and proxy collaborate to resolve them:

Parse -- The transpiler identifies semantic_match, vector_match, semantic_rank, and JOIN_VECTOR nodes in the SQL AST.
Embed -- The proxy sends the search text to the configured embedding provider (OpenAI, Cohere, or a local model) and receives a float vector.
Rewrite -- The transpiler rewrites vector functions to list_cosine_similarity() calls with the resolved embedding vector.
Execute -- DuckDB evaluates list_cosine_similarity(embedding_col, [0.12, -0.34, ...]) against stored embeddings.
Filter/Rank -- For semantic_match, results are filtered by the threshold comparison. For semantic_rank, the raw similarity score is returned.

Agent SQL                    Transpiled DuckDB SQL
─────────────               ──────────────────────
semantic_match(             list_cosine_similarity(
  content,          →         content_embedding,
  'billing issue',            [0.12, -0.34, ...]  -- resolved by proxy
  0.8                       ) > 0.8
)

semantic_rank(              list_cosine_similarity(
  content,          →         content_embedding,
  'billing issue'             [0.12, -0.34, ...]
)                           )

Threshold Tuning Guide

The default threshold of 0.7 works well for general-purpose retrieval. Adjust it based on your use case:

Threshold	Precision	Recall	Best For
0.9+	Very high	Low	Exact semantic match, deduplication
0.8	High	Medium	Targeted retrieval, specific facts
0.7 (default)	Medium	Medium-High	General search, RAG context
0.6	Low-Medium	High	Exploratory search, broad context
0.5	Low	Very high	Catch-all, brainstorming, related topics

Rules of thumb:

Start with 0.7 and adjust based on result quality.
If you get too many irrelevant results, increase to 0.8.
If you miss results you expected, decrease to 0.6.
For safety-critical applications (fraud detection, compliance), use 0.85+ and review false negatives separately.
Short search texts tend to need lower thresholds; longer, more specific texts work better with higher thresholds.

Real-World Examples

Fintech: Fraud Pattern Detection

An agent detects suspicious transactions by joining transaction records with known fraud pattern memories:

-- Find open transactions that resemble known fraud patterns
SELECT
  tx.transaction_id,
  tx.amount,
  tx.merchant,
  tx.timestamp,
  m.content AS fraud_pattern,
  semantic_rank(m.content, CONCAT(tx.merchant, ' ', tx.description)) AS similarity
FROM transactions tx
JOIN_VECTOR _hatidata_agent_memory m
  ON semantic_match(m.content, CONCAT(tx.merchant, ' ', tx.description), 0.85)
WHERE m.memory_type = 'fact'
  AND m.metadata->>'category' = 'fraud_pattern'
  AND tx.status = 'pending'
  AND tx.amount > 1000
ORDER BY similarity DESC
LIMIT 20;

Support: Contextual Memory Recall

A support agent retrieves relevant past interactions before responding to a customer:

-- Pull relevant memories for the current customer conversation
SELECT
  content,
  memory_type,
  importance,
  semantic_rank(content, 'customer wants to upgrade plan but has billing concerns') AS relevance,
  created_at
FROM _hatidata_agent_memory
WHERE agent_id = 'support-bot'
  AND semantic_match(content, 'customer wants to upgrade plan but has billing concerns', 0.7)
  AND importance >= 0.5
ORDER BY relevance DESC, importance DESC
LIMIT 5;

Supply Chain: Supplier Risk Assessment

An operations agent correlates supplier performance data with risk intelligence memories:

-- Match supplier delivery records against risk intelligence
SELECT
  s.supplier_name,
  s.region,
  s.avg_lead_time_days,
  m.content AS risk_signal,
  semantic_rank(m.content, CONCAT('supplier risk ', s.supplier_name, ' ', s.region)) AS risk_score
FROM supplier_metrics s
JOIN_VECTOR _hatidata_agent_memory m
  ON semantic_match(m.content, CONCAT('supplier risk ', s.supplier_name, ' ', s.region), 0.75)
WHERE m.memory_type = 'fact'
  AND s.avg_lead_time_days > 14
ORDER BY risk_score DESC;

Multiple agents query a shared memory pool to find insights from other agents:

-- Agent B searches for research findings stored by Agent A
SELECT
  agent_id AS source_agent,
  content,
  memory_type,
  semantic_rank(content, 'competitive analysis enterprise pricing') AS relevance,
  created_at
FROM _hatidata_agent_memory
WHERE semantic_match(content, 'competitive analysis enterprise pricing', 0.7)
  AND agent_id != 'agent-b'  -- exclude own memories
  AND importance >= 0.6
ORDER BY relevance DESC
LIMIT 10;

Limitations

Vector functions require an embedding provider to be configured on the proxy. Without one, queries using semantic_match or semantic_rank return an error.
JOIN_VECTOR is syntactic sugar -- it does not create a special join algorithm. For very large tables, consider filtering with standard WHERE clauses first to reduce the join cardinality.
Embedding resolution adds latency to the first occurrence of each unique search text in a query. Repeated identical texts within the same query are cached.
The embedding model's quality directly affects search relevance. Use a model appropriate for your domain.

Next Steps

Function Reference -- Snowflake-to-DuckDB function mappings
Agent Memory -- How the memory system works under the hood
MCP Tools Reference -- All 24 MCP tools including memory operations
SQL Compatibility -- Full transpilation pipeline overview

Quick Reference​

semantic_match(column, text [, threshold])​

Examples​

vector_match(column, text [, threshold])​

semantic_rank(column, text)​

Examples​

JOIN_VECTOR​

Example​

How It Works​

Threshold Tuning Guide​

Real-World Examples​

Fintech: Fraud Pattern Detection​

Support: Contextual Memory Recall​

Supply Chain: Supplier Risk Assessment​

Multi-Agent Knowledge Sharing​

Limitations​

Next Steps​

Stay in the loop

Quick Reference

`semantic_match(column, text [, threshold])`

Examples

`vector_match(column, text [, threshold])`

`semantic_rank(column, text)`

Examples

`JOIN_VECTOR`

Example

How It Works

Threshold Tuning Guide

Real-World Examples

Fintech: Fraud Pattern Detection

Support: Contextual Memory Recall

Supply Chain: Supplier Risk Assessment

Multi-Agent Knowledge Sharing

Limitations

Next Steps