Skip to main content

Agentic DataOps

Build a data operations pipeline where AI agents create isolated branches for speculative queries, test schema changes safely, and merge results back -- all without affecting production data. HatiData's branch isolation uses DuckDB schema-based separation with copy-on-write semantics for instant branch creation.

Architecture

+---------------+     MCP / SQL      +----------------+
| DataOps | ------------------> | HatiData |
| Agent | | Proxy |
+---------------+ +--------+-------+
|
+------+------+
| DuckDB |
+-------------+
| main branch |
| +- feature |
| +- staging |
| +- test |
+-------------+

Each branch is implemented as a separate DuckDB schema. On creation, branches use zero-copy views of the main tables. On first write, the affected table is materialized in the branch schema (copy-on-write). This means branch creation is instant and storage-efficient.

What's Included

  • HatiData Proxy -- Query engine with instant branch isolation
  • Demo App -- Python DataOps agent demonstrating branch-based workflows

Quick Start

# Clone and start
git clone https://github.com/marviy/hatidata.git
cd hatidata/playbooks/agentic-dataops
docker compose up -d

# Wait for services to be healthy
docker compose ps

# Run the demo
docker compose exec demo-app python app.py

What the Demo Does

  1. Sets up main data -- Creates tables with production-like data in the main branch
  2. Creates a branch -- Agent creates an isolated branch for testing schema changes
  3. Tests schema changes -- Runs ALTER TABLE and ADD COLUMN safely on the branch
  4. Runs speculative queries -- Agent tests new queries against the branched data
  5. Compares results -- Demonstrates that the main branch remains untouched
  6. Merges or discards -- Agent decides whether to keep or discard the changes

Key Concepts

Creating a Branch

Branches are created instantly using copy-on-write semantics. No data is copied until a write operation occurs on the branch.

# Create an instant isolated copy for testing
client.call_tool("branch_create", {
"name": "test-schema-migration",
"from": "main"
})
# Branches are copy-on-write -- instant creation, minimal storage

Working on a Branch

Once a branch is active, all queries in the session target the branch schema. Destructive operations are safe because they only affect the branch.

# All queries in this session target the branch
client.call_tool("switch_branch", {"name": "test-schema-migration"})

# Safe to run destructive operations
client.query("ALTER TABLE users ADD COLUMN last_login TIMESTAMP")
client.query("UPDATE users SET last_login = CURRENT_TIMESTAMP")

Comparing Branches

Check what changed between a branch and main:

# Check what changed
diff = client.call_tool("diff_branch", {
"branch": "test-schema-migration",
"base": "main"
})
# Shows: added columns, modified rows, new tables

Merging Branches

When changes look good, merge them back to main:

# Merge branch changes into main
result = client.call_tool("branch_merge", {
"branch": "test-schema-migration",
"target": "main",
"strategy": "branch_wins" # or "main_wins", "manual", "abort"
})

Merge Strategies

StrategyBehavior
branch_winsBranch data overwrites main on conflict
main_winsMain data preserved on conflict; non-conflicting branch changes applied
manualReturns conflict details for human resolution
abortFails if any conflicts are detected

Discarding Branches

If the changes are not needed, discard the branch to clean up resources:

# Discard branch and free resources
client.call_tool("branch_discard", {
"branch": "test-schema-migration"
})

Shadow Mode Testing

Run production queries against multiple branches simultaneously to compare results without affecting production:

# Run queries against both branches simultaneously
result = client.call_tool("shadow_query", {
"sql": "SELECT COUNT(*) FROM users WHERE active = true",
"branches": ["main", "test-schema-migration"]
})
# Compare results without affecting production

DataOps Workflows

Schema Migration Testing

Test schema changes on a branch before applying to production:

  1. Create branch from main
  2. Run ALTER TABLE statements on the branch
  3. Run existing queries to verify they still work
  4. Compare results with main
  5. Merge to main or iterate

Data Backfill Preview

Preview the impact of a data backfill before running it:

  1. Create branch from main
  2. Run INSERT or UPDATE statements on the branch
  3. Verify data integrity with validation queries
  4. Check row counts and data distributions
  5. Merge when satisfied

Query Optimization

Compare query plans and performance across different data layouts:

  1. Create branch from main
  2. Apply indexes, materialized views, or denormalization on the branch
  3. Run benchmark queries on both branches
  4. Compare execution times
  5. Merge the winning approach

A/B Data Testing

Create branches with different data transformations to compare outcomes:

  1. Create branches transform-a and transform-b from main
  2. Apply different transformations to each
  3. Run the same analytical queries on both
  4. Compare results to determine the better approach
  5. Merge the winning branch

Branch Implementation Details

Storage Model

OperationImplementation
Branch createNew DuckDB schema with views pointing to main tables
First write to tableTable materialized in branch schema (full copy)
Subsequent writesWrites go to the materialized branch table
Branch discardDrop the branch schema and all its objects
Branch mergeCopy modified tables back to main schema

Garbage Collection

Branches have automatic lifecycle management:

  • Reference counting -- Tracks active sessions using each branch
  • TTL expiry -- Branches are automatically discarded after a configurable timeout (default: 24 hours)
  • Periodic cleanup -- Background task runs every 5 minutes to collect expired branches

Concurrency

  • Up to 100 concurrent branches are supported
  • Each branch operates independently with no cross-branch locking
  • Merge operations acquire a brief lock on the target branch

Configuration

VariableDefaultDescription
BRANCH_ISOLATION_ENABLEDtrueEnable branch isolation
HATIDATA_SHADOW_MODE_ENABLEDtrueEnable shadow mode queries
HATIDATA_DEV_MODEtrueSkip auth for local development

Cleanup

docker compose down -v  # Remove containers and volumes

Next Steps

Stay in the loop

Product updates, engineering deep-dives, and agent-native insights. No spam.