ScenarioBench: Trace-Grounded Compliance Evaluation for Text-to-SQL and RAG

Abstract

ScenarioBench is a policy-grounded, trace-aware benchmark for evaluating Text-to-SQL and retrieval-augmented generation in compliance settings. Each YAML scenario ships with a no-peek gold-standard package-expected decision, minimal witness trace, governing clause set, and canonical SQL-enabling end-to-end scoring of both what a system decides and why it decides so. Systems must justify outputs with clause IDs retrieved from the same policy canon, making explanations falsifiable and audit-ready. The evaluator reports core signals-decision/trace quality, retrieval and SQL correctness (by result-set equivalence), latency, and a hallucination rate. A normalized Scenario Difficulty Index (SDI) and its budgeted variant (SDI-R) aggregate these while pricing retrieval difficulty and time. Unlike Spider/BIRD or KILT/RAG-style setups, ScenarioBench enforces clause-ID grounding under a strict nopeek rule. On a seed synthetic suite (N=16), accuracy saturates (Acc =M−F1=1.000), and a single budgeted reflection closes trace gaps (TrC & Coverage 0.541→1.000) with zero hallucination at ∼+1ms-showing that marginal gains come from why-quality under explicit time budgets. Artifact & Code (v0.1.0): https://github.com/ShabnamAtf/ScenarioBench/releases/tag/v0.1.0

Publication
In 2025 IEEE International Conference on Collaborative Advances in Software and COmputiNg (CASCON) (pp. 615-620). IEEE