Tracing Artifacts to Prompts
This tutorial walks through a real forensic debugging scenario: an architecture artifact cost $0.50 across 3 attempts. We trace it back to understand what went wrong with the first two attempts and why the third succeeded.
The Scenario
Your Architect agent produced an api_contract artifact for project proj-abc. The project's cost dashboard shows $0.50 for this single artifact — significantly more than the expected $0.05 for a single DeepSeek V3 call.
Goal: Understand why it took 3 attempts and $0.50 instead of 1 attempt and $0.05.
Step 1: Find the Artifact
Start from the artifact you're investigating:
SELECT
id,
memory_key,
state,
produced_by_agent,
content_hash,
created_at
FROM hd_runtime.artifact_instances
WHERE memory_key = 'proj:abc:api_contract'
ORDER BY created_at DESC
LIMIT 1;
Result:
id: art-final-001
state: verified
agent: Architect
hash: sha256:d4e5f6...
created_at: 2026-03-28T10:05:00Z
Step 2: Find All Attempts
The artifact links to a task — find all attempts for that task:
SELECT
ta.id AS attempt_id,
ta.status,
ta.failure_kind,
ta.created_at,
ta.completed_at,
md.primary_model,
md.budget_mode,
li.cost_usd,
li.latency_ms,
li.outcome
FROM hd_runtime.task_attempts ta
JOIN hd_runtime.model_decisions md ON md.attempt_id = ta.id
JOIN hd_runtime.llm_invocations li ON li.decision_id = md.id
WHERE ta.task_id = (
SELECT task_id FROM hd_runtime.task_attempts
WHERE id = 'att-final-003' -- from the artifact's attempt
)
ORDER BY ta.created_at;
Result:
Attempt 1: att-001 | terminal_failed | InvalidOutputSchema | deepseek-v3.2-maas | $0.003 | 2340ms
Attempt 2: att-002 | terminal_failed | InvalidOutputSchema | claude-4-sonnet | $0.045 | 4100ms
Attempt 3: att-003 | completed_verified| (none) | deepseek-v3.2-maas | $0.003 | 2100ms
Discovery: Attempt 1 failed schema validation, L2 escalated to Claude Sonnet, which also failed, then L1 retried DeepSeek which succeeded. Total: $0.051 across 3 invocations.
Wait — the dashboard showed $0.50. Let's check for hidden invocations.
Step 3: Check All Invocations
SELECT
li.id,
li.provider,
li.model_id,
li.input_tokens,
li.output_tokens,
li.cost_usd,
li.outcome,
li.prompt_hash
FROM hd_runtime.llm_invocations li
WHERE li.decision_id IN (
SELECT id FROM hd_runtime.model_decisions
WHERE attempt_id IN ('att-001', 'att-002', 'att-003')
)
ORDER BY li.called_at;
Result:
inv-001: DeepSeek | 4200 in, 1800 out | $0.003 | success | sha256:aaa...
inv-002: DeepSeek | 4200 in, 0 out | $0.001 | error | sha256:aaa... (timeout)
inv-003: Claude | 4200 in, 3200 out | $0.045 | success | sha256:bbb...
inv-004: Claude | 4200 in, 3400 out | $0.048 | success | sha256:bbb... (schema still wrong)
inv-005: DeepSeek | 6800 in, 1800 out | $0.003 | success | sha256:ccc... (new prompt with error context)
Root cause found: There were 5 invocations, not 3. Attempt 1 had 2 invocations (first timed out, second succeeded but schema was wrong). Attempt 2 had 2 invocations (Claude was called twice — the first output also failed validation). Total: $0.003 + $0.001 + $0.045 + $0.048 + $0.003 = $0.10 (closer, but still not $0.50).
Step 4: Check the Prompt Hash
The prompt hash changed between attempts:
| Attempt | Prompt Hash | Input Tokens |
|---|---|---|
| 1 | sha256:aaa... | 4,200 |
| 2 | sha256:bbb... | 4,200 |
| 3 | sha256:ccc... | 6,800 |
Attempt 3 had 6,800 input tokens vs 4,200 — the RepairAgent injected failure context into the prompt, increasing the input size by 62%.
Step 5: Use the ExplainBundle
Instead of manual SQL, get the full picture in one call:
bundle = client.explain(attempt_id="att-003")
# Quick summary
print(f"Attempts: {bundle.totals.attempts}")
print(f"Total cost: ${bundle.totals.cost_usd}")
print(f"Model: {bundle.model_decision.primary_model}")
# Recovery chain
for r in bundle.recovery:
print(f" {r.level}: {r.action} — {r.failure_kind}")
Step 6: The Fix
Now you know:
- DeepSeek V3 fails schema validation for this agent's output format ~50% of the time
- Claude Sonnet also fails, suggesting the contract schema is too strict
- The successful attempt used an enriched prompt with error context
Actions:
- Relax the
api_contractschema validation (currently requires exact OpenAPI 3.1 structure) - Or update the Architect prompt to be more explicit about output format
- Monitor via:
SELECT model_id, success_rate FROM v_reward_signals WHERE task_class = 'AuthoritySpec' AND agent_type = 'architect'
Forensic Query Cheat Sheet
| Question | Query |
|---|---|
| "How much did this artifact cost?" | Sum llm_invocations.cost_usd for all attempts of the parent task |
| "Why was this model selected?" | Check model_decisions.routing_reason and budget_mode |
| "What prompt was used?" | Match llm_invocations.prompt_hash — same hash = same prompt |
| "What changed between attempts?" | Compare prompt_hash and input_tokens across attempts |
| "Who approved this?" | Check artifact_validations.verifier_type and status |
| "Was cache used?" | Check llm_invocations.cache_status — hit saves ~90% on input cost |
Next Steps
- Anatomy of an ExplainBundle — Full JSON structure reference
- Tasks & Attempts — The lifecycle model
- Optimization & Replay — Use reward signals to prevent these failures