Embedding Pipeline

HatiData's embedding pipeline converts text content into vector representations for semantic search. The pipeline is designed for high throughput and low latency: embeddings are generated asynchronously, stored in a vector index for approximate nearest neighbor (ANN) retrieval, and joined back to structured metadata by memory_id UUID.

Architecture Overview

Content (store_memory / log_step)
    |
    v
Embedding Service
    |
    +-- Cloud provider (e.g., OpenAI text-embedding-3-small)
    +-- Local provider (for air-gapped deployments)
    +-- Custom provider (bring your own)
    |
    v
Vector Index (ANN)
    |
    v
Structured Metadata Store
    |
    +-- Joined by memory_id at query time

Embedding Service

HatiData defines a pluggable interface for embedding backends. All providers implement the same contract: accept a batch of text strings, return one vector per input.

Available Providers

Provider	Model	Dimensions	Latency (p50)	Use Case
Cloud (OpenAI)	text-embedding-3-small	1536	~80ms	Cloud, production
Local	BAAI/bge-small-en-v1.5	384	~15ms	Local, air-gapped
Mock	Deterministic seeded RNG	Configurable	<1ms	Testing

The embedding provider is configurable at deployment time. Cloud deployments default to the OpenAI provider; local and air-gapped deployments use the bundled local provider.

Local Embedding Provider

For local and air-gapped deployments, HatiData includes a local embedding service. The service exposes an HTTP endpoint compatible with the OpenAI embeddings API format and is included in the dev Docker Compose stack.

# Test the local embedding service
curl -X POST http://localhost:8090/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": ["Hello world"], "model": "bge-small-en-v1.5"}'

Fallback Chain

Cloud provider --> Local provider --> Error

If the primary provider fails (network error, rate limit, API key invalid), the proxy attempts the local embedding service before returning an error. This ensures semantic search remains functional during cloud API outages.

Asynchronous Processing

Embedding generation is decoupled from content storage. When store_memory is called, the content is written to the metadata store immediately and the embedding is generated asynchronously in the background. This means store_memory calls return instantly while embeddings are computed and indexed behind the scenes.

The embedding pipeline batches multiple requests together for efficient processing, and the resulting vectors are indexed for search. The has_embedding flag on each memory entry indicates whether the embedding has been generated yet.

Embedding Sampling

Not all content needs to be embedded. The CoT ledger supports configurable sampling to reduce embedding costs:

Content Type	Default Sample Rate	Rationale
Agent memory	100%	All memories should be searchable
CoT steps (observation)	10%	High volume, low search need
CoT steps (reasoning)	10%	High volume, moderate search need
CoT steps (conclusion)	100%	Always embed for compliance search
CoT steps (escalation)	100%	Always embed for audit search

The sampling rate is configurable per step type. Content can also force embedding with metadata.force_embed = true.

Vector Storage

Vectors are stored in a vector index with per-organization isolation. Each organization gets its own index namespace, ensuring complete tenant separation at the vector storage level.

At query time, semantic_match() and semantic_rank() are resolved by:

Embedding the search text using the configured provider
Running an approximate nearest neighbor (ANN) search within the agent's organization namespace
Returning matching memory_id UUIDs and similarity scores
Joining these results back to the structured metadata store by memory_id

This hybrid approach — vector search for semantic similarity, then a structured join for full metadata — combines the strengths of both search paradigms.

Performance Characteristics

Metric	Value	Conditions
Embedding latency (cloud)	~80ms p50	Batch of 32 texts
Embedding latency (local)	~15ms p50	Batch of 32 texts
ANN search	~2ms p50	1M vectors, top-10
End-to-end `semantic_match()`	~5ms p50 (local), ~90ms p50 (cloud)	Including embed + search + join
Background embedding throughput	~3,200 texts/sec (local)	2 workers, batch size 32

Hybrid SQL -- semantic_match and semantic_rank functions
Persistent Memory -- Memory storage and indexing
Semantic Triggers -- Trigger evaluation pipeline
Query Pipeline -- Where embedding resolution fits
Multi-Cloud Deployment -- Cloud-specific embedding providers

Architecture Overview​

Embedding Service​

Available Providers​

Local Embedding Provider​

Fallback Chain​

Asynchronous Processing​

Embedding Sampling​

Vector Storage​

Performance Characteristics​

Related Concepts​

Stay in the loop