Embedding Pipeline
HatiData's embedding pipeline converts text content into vector representations for semantic search. The pipeline is designed for high throughput and low latency: embeddings are generated asynchronously, stored in a vector index for approximate nearest neighbor (ANN) retrieval, and joined back to structured metadata by memory_id UUID.
Architecture Overview
Content (store_memory / log_step)
|
v
Embedding Service
|
+-- Cloud provider (e.g., OpenAI text-embedding-3-small)
+-- Local provider (for air-gapped deployments)
+-- Custom provider (bring your own)
|
v
Vector Index (ANN)
|
v
Structured Metadata Store
|
+-- Joined by memory_id at query time
Embedding Service
HatiData defines a pluggable interface for embedding backends. All providers implement the same contract: accept a batch of text strings, return one vector per input.
Available Providers
| Provider | Model | Dimensions | Latency (p50) | Use Case |
|---|---|---|---|---|
| Cloud (OpenAI) | text-embedding-3-small | 1536 | ~80ms | Cloud, production |
| Local | BAAI/bge-small-en-v1.5 | 384 | ~15ms | Local, air-gapped |
| Mock | Deterministic seeded RNG | Configurable | <1ms | Testing |
The embedding provider is configurable at deployment time. Cloud deployments default to the OpenAI provider; local and air-gapped deployments use the bundled local provider.
Local Embedding Provider
For local and air-gapped deployments, HatiData includes a local embedding service. The service exposes an HTTP endpoint compatible with the OpenAI embeddings API format and is included in the dev Docker Compose stack.
# Test the local embedding service
curl -X POST http://localhost:8090/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": ["Hello world"], "model": "bge-small-en-v1.5"}'
Fallback Chain
Cloud provider --> Local provider --> Error
If the primary provider fails (network error, rate limit, API key invalid), the proxy attempts the local embedding service before returning an error. This ensures semantic search remains functional during cloud API outages.
Asynchronous Processing
Embedding generation is decoupled from content storage. When store_memory is called, the content is written to the metadata store immediately and the embedding is generated asynchronously in the background. This means store_memory calls return instantly while embeddings are computed and indexed behind the scenes.
The embedding pipeline batches multiple requests together for efficient processing, and the resulting vectors are indexed for search. The has_embedding flag on each memory entry indicates whether the embedding has been generated yet.
Embedding Sampling
Not all content needs to be embedded. The CoT ledger supports configurable sampling to reduce embedding costs:
| Content Type | Default Sample Rate | Rationale |
|---|---|---|
| Agent memory | 100% | All memories should be searchable |
| CoT steps (observation) | 10% | High volume, low search need |
| CoT steps (reasoning) | 10% | High volume, moderate search need |
| CoT steps (conclusion) | 100% | Always embed for compliance search |
| CoT steps (escalation) | 100% | Always embed for audit search |
The sampling rate is configurable per step type. Content can also force embedding with metadata.force_embed = true.
Vector Storage
Vectors are stored in a vector index with per-organization isolation. Each organization gets its own index namespace, ensuring complete tenant separation at the vector storage level.
At query time, semantic_match() and semantic_rank() are resolved by:
- Embedding the search text using the configured provider
- Running an approximate nearest neighbor (ANN) search within the agent's organization namespace
- Returning matching
memory_idUUIDs and similarity scores - Joining these results back to the structured metadata store by
memory_id
This hybrid approach — vector search for semantic similarity, then a structured join for full metadata — combines the strengths of both search paradigms.
Performance Characteristics
| Metric | Value | Conditions |
|---|---|---|
| Embedding latency (cloud) | ~80ms p50 | Batch of 32 texts |
| Embedding latency (local) | ~15ms p50 | Batch of 32 texts |
| ANN search | ~2ms p50 | 1M vectors, top-10 |
End-to-end semantic_match() | ~5ms p50 (local), ~90ms p50 (cloud) | Including embed + search + join |
| Background embedding throughput | ~3,200 texts/sec (local) | 2 workers, batch size 32 |
Related Concepts
- Hybrid SQL --
semantic_matchandsemantic_rankfunctions - Persistent Memory -- Memory storage and indexing
- Semantic Triggers -- Trigger evaluation pipeline
- Query Pipeline -- Where embedding resolution fits
- Multi-Cloud Deployment -- Cloud-specific embedding providers