Agentic DataOps

Build a data operations pipeline where AI agents create isolated branches for speculative queries, test schema changes safely, and merge results back -- all without affecting production data. HatiData's branch isolation uses DuckDB schema-based separation with copy-on-write semantics for instant branch creation.

Architecture

+---------------+     MCP / SQL      +----------------+
|  DataOps      | ------------------> |  HatiData      |
|  Agent        |                     |  Proxy         |
+---------------+                     +--------+-------+
                                               |
                                        +------+------+
                                        |   DuckDB    |
                                        +-------------+
                                        | main branch |
                                        |  +- feature |
                                        |  +- staging |
                                        |  +- test    |
                                        +-------------+

Each branch is implemented as a separate DuckDB schema. On creation, branches use zero-copy views of the main tables. On first write, the affected table is materialized in the branch schema (copy-on-write). This means branch creation is instant and storage-efficient.

What's Included

HatiData Proxy -- Query engine with instant branch isolation
Demo App -- Python DataOps agent demonstrating branch-based workflows

Quick Start

# Clone and start
git clone https://github.com/marviy/hatidata.git
cd hatidata/playbooks/agentic-dataops
docker compose up -d

# Wait for services to be healthy
docker compose ps

# Run the demo
docker compose exec demo-app python app.py

What the Demo Does

Sets up main data -- Creates tables with production-like data in the main branch
Creates a branch -- Agent creates an isolated branch for testing schema changes
Tests schema changes -- Runs ALTER TABLE and ADD COLUMN safely on the branch
Runs speculative queries -- Agent tests new queries against the branched data
Compares results -- Demonstrates that the main branch remains untouched
Merges or discards -- Agent decides whether to keep or discard the changes

Key Concepts

Creating a Branch

Branches are created instantly using copy-on-write semantics. No data is copied until a write operation occurs on the branch.

# Create an instant isolated copy for testing
client.call_tool("branch_create", {
    "name": "test-schema-migration",
    "from": "main"
})
# Branches are copy-on-write -- instant creation, minimal storage

Working on a Branch

Once a branch is active, all queries in the session target the branch schema. Destructive operations are safe because they only affect the branch.

# All queries in this session target the branch
client.call_tool("switch_branch", {"name": "test-schema-migration"})

# Safe to run destructive operations
client.query("ALTER TABLE users ADD COLUMN last_login TIMESTAMP")
client.query("UPDATE users SET last_login = CURRENT_TIMESTAMP")

Comparing Branches

Check what changed between a branch and main:

# Check what changed
diff = client.call_tool("diff_branch", {
    "branch": "test-schema-migration",
    "base": "main"
})
# Shows: added columns, modified rows, new tables

Merging Branches

When changes look good, merge them back to main:

# Merge branch changes into main
result = client.call_tool("branch_merge", {
    "branch": "test-schema-migration",
    "target": "main",
    "strategy": "branch_wins"  # or "main_wins", "manual", "abort"
})

Merge Strategies

Strategy	Behavior
`branch_wins`	Branch data overwrites main on conflict
`main_wins`	Main data preserved on conflict; non-conflicting branch changes applied
`manual`	Returns conflict details for human resolution
`abort`	Fails if any conflicts are detected

Discarding Branches

If the changes are not needed, discard the branch to clean up resources:

# Discard branch and free resources
client.call_tool("branch_discard", {
    "branch": "test-schema-migration"
})

Shadow Mode Testing

Run production queries against multiple branches simultaneously to compare results without affecting production:

# Run queries against both branches simultaneously
result = client.call_tool("shadow_query", {
    "sql": "SELECT COUNT(*) FROM users WHERE active = true",
    "branches": ["main", "test-schema-migration"]
})
# Compare results without affecting production

DataOps Workflows

Schema Migration Testing

Test schema changes on a branch before applying to production:

Create branch from main
Run ALTER TABLE statements on the branch
Run existing queries to verify they still work
Compare results with main
Merge to main or iterate

Data Backfill Preview

Preview the impact of a data backfill before running it:

Create branch from main
Run INSERT or UPDATE statements on the branch
Verify data integrity with validation queries
Check row counts and data distributions
Merge when satisfied

Query Optimization

Compare query plans and performance across different data layouts:

Create branch from main
Apply indexes, materialized views, or denormalization on the branch
Run benchmark queries on both branches
Compare execution times
Merge the winning approach

A/B Data Testing

Create branches with different data transformations to compare outcomes:

Create branches transform-a and transform-b from main
Apply different transformations to each
Run the same analytical queries on both
Compare results to determine the better approach
Merge the winning branch

Branch Implementation Details

Storage Model

Operation	Implementation
Branch create	New DuckDB schema with views pointing to main tables
First write to table	Table materialized in branch schema (full copy)
Subsequent writes	Writes go to the materialized branch table
Branch discard	Drop the branch schema and all its objects
Branch merge	Copy modified tables back to main schema

Garbage Collection

Branches have automatic lifecycle management:

Reference counting -- Tracks active sessions using each branch
TTL expiry -- Branches are automatically discarded after a configurable timeout (default: 24 hours)
Periodic cleanup -- Background task runs every 5 minutes to collect expired branches

Concurrency

Up to 100 concurrent branches are supported
Each branch operates independently with no cross-branch locking
Merge operations acquire a brief lock on the target branch

Configuration

Variable	Default	Description
`BRANCH_ISOLATION_ENABLED`	`true`	Enable branch isolation
`HATIDATA_SHADOW_MODE_ENABLED`	`true`	Enable shadow mode queries
`HATIDATA_DEV_MODE`	`true`	Skip auth for local development

Cleanup

docker compose down -v  # Remove containers and volumes

Next Steps

Agentic RAG -- Semantic memory for AI agents
Agentic Compliance -- Tamper-evident audit trails

Architecture​

What's Included​

Quick Start​

What the Demo Does​

Key Concepts​

Creating a Branch​

Working on a Branch​

Comparing Branches​

Merging Branches​

Merge Strategies​

Discarding Branches​

Shadow Mode Testing​

DataOps Workflows​

Schema Migration Testing​

Data Backfill Preview​

Query Optimization​

A/B Data Testing​

Branch Implementation Details​

Storage Model​

Garbage Collection​

Concurrency​

Configuration​

Cleanup​

Next Steps​

Stay in the loop