Tracing Artifacts to Prompts

This tutorial walks through a real forensic debugging scenario: an architecture artifact cost $0.50 across 3 attempts. We trace it back to understand what went wrong with the first two attempts and why the third succeeded.

The Scenario

Your Architect agent produced an api_contract artifact for project proj-abc. The project's cost dashboard shows $0.50 for this single artifact — significantly more than the expected $0.05 for a single DeepSeek V3 call.

Goal: Understand why it took 3 attempts and $0.50 instead of 1 attempt and $0.05.

Step 1: Find the Artifact

Start from the artifact you're investigating:

SELECT
    id,
    memory_key,
    state,
    produced_by_agent,
    content_hash,
    created_at
FROM hd_runtime.artifact_instances
WHERE memory_key = 'proj:abc:api_contract'
ORDER BY created_at DESC
LIMIT 1;

Result:

id:         art-final-001
state:      verified
agent:      Architect
hash:       sha256:d4e5f6...
created_at: 2026-03-28T10:05:00Z

Step 2: Find All Attempts

The artifact links to a task — find all attempts for that task:

SELECT
    ta.id AS attempt_id,
    ta.status,
    ta.failure_kind,
    ta.created_at,
    ta.completed_at,
    md.primary_model,
    md.budget_mode,
    li.cost_usd,
    li.latency_ms,
    li.outcome
FROM hd_runtime.task_attempts ta
JOIN hd_runtime.model_decisions md ON md.attempt_id = ta.id
JOIN hd_runtime.llm_invocations li ON li.decision_id = md.id
WHERE ta.task_id = (
    SELECT task_id FROM hd_runtime.task_attempts
    WHERE id = 'att-final-003'  -- from the artifact's attempt
)
ORDER BY ta.created_at;

Result:

Attempt 1: att-001 | terminal_failed  | InvalidOutputSchema | deepseek-v3.2-maas  | $0.003 | 2340ms
Attempt 2: att-002 | terminal_failed  | InvalidOutputSchema | claude-4-sonnet     | $0.045 | 4100ms
Attempt 3: att-003 | completed_verified| (none)             | deepseek-v3.2-maas  | $0.003 | 2100ms

Discovery: Attempt 1 failed schema validation, L2 escalated to Claude Sonnet, which also failed, then L1 retried DeepSeek which succeeded. Total: $0.051 across 3 invocations.

Wait — the dashboard showed $0.50. Let's check for hidden invocations.

Step 3: Check All Invocations

SELECT
    li.id,
    li.provider,
    li.model_id,
    li.input_tokens,
    li.output_tokens,
    li.cost_usd,
    li.outcome,
    li.prompt_hash
FROM hd_runtime.llm_invocations li
WHERE li.decision_id IN (
    SELECT id FROM hd_runtime.model_decisions
    WHERE attempt_id IN ('att-001', 'att-002', 'att-003')
)
ORDER BY li.called_at;

Result:

inv-001: DeepSeek  | 4200 in, 1800 out | $0.003 | success  | sha256:aaa...
inv-002: DeepSeek  | 4200 in, 0 out    | $0.001 | error    | sha256:aaa... (timeout)
inv-003: Claude    | 4200 in, 3200 out  | $0.045 | success  | sha256:bbb...
inv-004: Claude    | 4200 in, 3400 out  | $0.048 | success  | sha256:bbb... (schema still wrong)
inv-005: DeepSeek  | 6800 in, 1800 out  | $0.003 | success  | sha256:ccc... (new prompt with error context)

Root cause found: There were 5 invocations, not 3. Attempt 1 had 2 invocations (first timed out, second succeeded but schema was wrong). Attempt 2 had 2 invocations (Claude was called twice — the first output also failed validation). Total: $0.003 + $0.001 + $0.045 + $0.048 + $0.003 = $0.10 (closer, but still not $0.50).

Step 4: Check the Prompt Hash

The prompt hash changed between attempts:

Attempt	Prompt Hash	Input Tokens
1	`sha256:aaa...`	4,200
2	`sha256:bbb...`	4,200
3	`sha256:ccc...`	6,800

Attempt 3 had 6,800 input tokens vs 4,200 — the RepairAgent injected failure context into the prompt, increasing the input size by 62%.

Step 5: Use the ExplainBundle

Instead of manual SQL, get the full picture in one call:

bundle = client.explain(attempt_id="att-003")

# Quick summary
print(f"Attempts: {bundle.totals.attempts}")
print(f"Total cost: ${bundle.totals.cost_usd}")
print(f"Model: {bundle.model_decision.primary_model}")

# Recovery chain
for r in bundle.recovery:
    print(f"  {r.level}: {r.action} — {r.failure_kind}")

Step 6: The Fix

Now you know:

DeepSeek V3 fails schema validation for this agent's output format ~50% of the time
Claude Sonnet also fails, suggesting the contract schema is too strict
The successful attempt used an enriched prompt with error context

Actions:

Relax the api_contract schema validation (currently requires exact OpenAPI 3.1 structure)
Or update the Architect prompt to be more explicit about output format
Monitor via: SELECT model_id, success_rate FROM v_reward_signals WHERE task_class = 'AuthoritySpec' AND agent_type = 'architect'

Forensic Query Cheat Sheet

Question	Query
"How much did this artifact cost?"	Sum `llm_invocations.cost_usd` for all attempts of the parent task
"Why was this model selected?"	Check `model_decisions.routing_reason` and `budget_mode`
"What prompt was used?"	Match `llm_invocations.prompt_hash` — same hash = same prompt
"What changed between attempts?"	Compare `prompt_hash` and `input_tokens` across attempts
"Who approved this?"	Check `artifact_validations.verifier_type` and `status`
"Was cache used?"	Check `llm_invocations.cache_status` — `hit` saves ~90% on input cost

Next Steps

Anatomy of an ExplainBundle — Full JSON structure reference
Tasks & Attempts — The lifecycle model
Optimization & Replay — Use reward signals to prevent these failures

The Scenario​

Step 1: Find the Artifact​

Step 2: Find All Attempts​

Step 3: Check All Invocations​

Step 4: Check the Prompt Hash​

Step 5: Use the ExplainBundle​

Step 6: The Fix​

Forensic Query Cheat Sheet​

Next Steps​

Stay in the loop

The Scenario

Step 1: Find the Artifact

Step 2: Find All Attempts

Step 3: Check All Invocations

Step 4: Check the Prompt Hash

Step 5: Use the ExplainBundle

Step 6: The Fix

Forensic Query Cheat Sheet

Next Steps