Agentic DataOps
Build a data operations pipeline where AI agents create isolated branches for speculative queries, test schema changes safely, and merge results back -- all without affecting production data. HatiData's branch isolation uses DuckDB schema-based separation with copy-on-write semantics for instant branch creation.
Architecture
+---------------+ MCP / SQL +----------------+
| DataOps | ------------------> | HatiData |
| Agent | | Proxy |
+---------------+ +--------+-------+
|
+------+------+
| DuckDB |
+-------------+
| main branch |
| +- feature |
| +- staging |
| +- test |
+-------------+
Each branch is implemented as a separate DuckDB schema. On creation, branches use zero-copy views of the main tables. On first write, the affected table is materialized in the branch schema (copy-on-write). This means branch creation is instant and storage-efficient.
What's Included
- HatiData Proxy -- Query engine with instant branch isolation
- Demo App -- Python DataOps agent demonstrating branch-based workflows
Quick Start
# Clone and start
git clone https://github.com/marviy/hatidata.git
cd hatidata/playbooks/agentic-dataops
docker compose up -d
# Wait for services to be healthy
docker compose ps
# Run the demo
docker compose exec demo-app python app.py
What the Demo Does
- Sets up main data -- Creates tables with production-like data in the main branch
- Creates a branch -- Agent creates an isolated branch for testing schema changes
- Tests schema changes -- Runs
ALTER TABLEandADD COLUMNsafely on the branch - Runs speculative queries -- Agent tests new queries against the branched data
- Compares results -- Demonstrates that the main branch remains untouched
- Merges or discards -- Agent decides whether to keep or discard the changes
Key Concepts
Creating a Branch
Branches are created instantly using copy-on-write semantics. No data is copied until a write operation occurs on the branch.
# Create an instant isolated copy for testing
client.call_tool("branch_create", {
"name": "test-schema-migration",
"from": "main"
})
# Branches are copy-on-write -- instant creation, minimal storage
Working on a Branch
Once a branch is active, all queries in the session target the branch schema. Destructive operations are safe because they only affect the branch.
# All queries in this session target the branch
client.call_tool("switch_branch", {"name": "test-schema-migration"})
# Safe to run destructive operations
client.query("ALTER TABLE users ADD COLUMN last_login TIMESTAMP")
client.query("UPDATE users SET last_login = CURRENT_TIMESTAMP")
Comparing Branches
Check what changed between a branch and main:
# Check what changed
diff = client.call_tool("diff_branch", {
"branch": "test-schema-migration",
"base": "main"
})
# Shows: added columns, modified rows, new tables
Merging Branches
When changes look good, merge them back to main:
# Merge branch changes into main
result = client.call_tool("branch_merge", {
"branch": "test-schema-migration",
"target": "main",
"strategy": "branch_wins" # or "main_wins", "manual", "abort"
})
Merge Strategies
| Strategy | Behavior |
|---|---|
branch_wins | Branch data overwrites main on conflict |
main_wins | Main data preserved on conflict; non-conflicting branch changes applied |
manual | Returns conflict details for human resolution |
abort | Fails if any conflicts are detected |
Discarding Branches
If the changes are not needed, discard the branch to clean up resources:
# Discard branch and free resources
client.call_tool("branch_discard", {
"branch": "test-schema-migration"
})
Shadow Mode Testing
Run production queries against multiple branches simultaneously to compare results without affecting production:
# Run queries against both branches simultaneously
result = client.call_tool("shadow_query", {
"sql": "SELECT COUNT(*) FROM users WHERE active = true",
"branches": ["main", "test-schema-migration"]
})
# Compare results without affecting production
DataOps Workflows
Schema Migration Testing
Test schema changes on a branch before applying to production:
- Create branch from main
- Run
ALTER TABLEstatements on the branch - Run existing queries to verify they still work
- Compare results with main
- Merge to main or iterate
Data Backfill Preview
Preview the impact of a data backfill before running it:
- Create branch from main
- Run
INSERTorUPDATEstatements on the branch - Verify data integrity with validation queries
- Check row counts and data distributions
- Merge when satisfied
Query Optimization
Compare query plans and performance across different data layouts:
- Create branch from main
- Apply indexes, materialized views, or denormalization on the branch
- Run benchmark queries on both branches
- Compare execution times
- Merge the winning approach
A/B Data Testing
Create branches with different data transformations to compare outcomes:
- Create branches
transform-aandtransform-bfrom main - Apply different transformations to each
- Run the same analytical queries on both
- Compare results to determine the better approach
- Merge the winning branch
Branch Implementation Details
Storage Model
| Operation | Implementation |
|---|---|
| Branch create | New DuckDB schema with views pointing to main tables |
| First write to table | Table materialized in branch schema (full copy) |
| Subsequent writes | Writes go to the materialized branch table |
| Branch discard | Drop the branch schema and all its objects |
| Branch merge | Copy modified tables back to main schema |
Garbage Collection
Branches have automatic lifecycle management:
- Reference counting -- Tracks active sessions using each branch
- TTL expiry -- Branches are automatically discarded after a configurable timeout (default: 24 hours)
- Periodic cleanup -- Background task runs every 5 minutes to collect expired branches
Concurrency
- Up to 100 concurrent branches are supported
- Each branch operates independently with no cross-branch locking
- Merge operations acquire a brief lock on the target branch
Configuration
| Variable | Default | Description |
|---|---|---|
BRANCH_ISOLATION_ENABLED | true | Enable branch isolation |
HATIDATA_SHADOW_MODE_ENABLED | true | Enable shadow mode queries |
HATIDATA_DEV_MODE | true | Skip auth for local development |
Cleanup
docker compose down -v # Remove containers and volumes
Next Steps
- Agentic RAG -- Semantic memory for AI agents
- Agentic Compliance -- Tamper-evident audit trails