Example Output: Public Policy Evaluation and Red-Team Harness
Inputs used
- Project context: a city service chatbot pilot for multilingual resident support
- Target audience: policy teams, public sector product leads, civic technologists
- Success metric: activation, quality, and risk reduction
- Available tools and data: policy library, service dashboard, public comment tracker, translation QA
- Desired depth: Production-ready
- Output tone: Clear operator memo
Generated Result
eval matrix, adversarial cases, grading rubric, and release threshold
Success criteria
Create at least 12 golden tasks: 6 normal cases, 3 edge cases, and 3 adversarial cases targeting unequal access. A passing result must cite the evidence source and state confidence.
Golden tasks
Create at least 12 golden tasks: 6 normal cases, 3 edge cases, and 3 adversarial cases targeting opaque decisions. A passing result must cite the evidence source and state confidence.
Adversarial tasks
Use service metrics as evidence, apply the constraint "plain-language communication", and explicitly note how the plan reduces procurement lock-in. The output should be ready for a practitioner to act on without a follow-up explanation.
Rubric
Use procurement constraints as evidence, apply the constraint "human appeal path", and explicitly note how the plan reduces policy overreach. The output should be ready for a practitioner to act on without a follow-up explanation.
Sampling plan
Release in three gates: internal dry run, limited pilot, then measured expansion. Each gate must show evidence that public accountability is true in practice, not only in documentation.
Release decision
Release in three gates: internal dry run, limited pilot, then measured expansion. Each gate must show evidence that accessibility is true in practice, not only in documentation.
Recommended Decision
Proceed with a narrow pilot focused on policy text and constituent feedback. Treat unequal access as the primary launch blocker. The first milestone should prove that the workflow produces a usable policy brief, pilot plan, and accountability checklist with clear evidence, named owners, and a review path for ambiguous cases.
Expected quality checks
- The result is specific to AI-assisted policy analysis, public service workflows, community engagement, and accountability plans.
- It includes the required sections: Success criteria, Golden tasks, Adversarial tasks, Rubric, Sampling plan, Release decision.
- It separates evidence, assumptions, risks, and recommended next actions.
- It includes practical verification steps, not only generic advice.
- It names the most important failure mode for this domain: unequal access.
Reuse note
Before copying the output into production work, replace all default variables with your real data and run a human review for high-impact decisions.