Why we built Audit Grill-Me, an open-source methodology for auditing AI-enabled workflows
Ask a team in a regulated company to prove how one of their AI-supported decisions was made three months ago. Watch what happens.
They can usually show you the output. Often they can show you the input. Almost never can they show you the prompt that was active at the time, the model version that was running, the retrieval source the system drew from, or the name of the person who reviewed the result before it became final. The decision happened. The evidence to reconstruct it did not survive.
This is the part of AI risk that gets the least attention, and in our experience it is where most of the real exposure sits.
The conversation about AI risk is aimed at the wrong target
When organizations talk about AI risk, they tend to talk about model risk. Is the model accurate. Is it biased. Does it hallucinate. Those questions matter and they belong in any serious assessment. But in regulated environments, the model is rarely where the damage originates.
The damage originates in the workflow around the model. We see the same failure patterns across ERP, CRM, internal knowledge, compliance, and operational processes once AI is introduced into them:
- No clear owner for the AI-enabled step, or ownership split between IT and the business with neither fully accountable
- Prompts that are edited directly in a console with no version history and no change control
- Retrieval sources that were never formally approved, including draft and deprecated documents that quietly influence live decisions
- No record of which model version produced a given output
- Human review that exists on paper but operates as a rubber stamp
- Exceptions that are noticed but never investigated to a root cause
- Logs that cannot reconstruct what the system actually did
- Vendor and third-party AI tools entering workflows without going through the controls that any other system change would face
None of these are model problems. Every one of them is a workflow problem, and every one of them sits on familiar audit territory: change management, segregation of duties, evidence retention, exception handling, access control. The audit profession already knows how to deal with these things. The difficulty is that the technical surface of an AI workflow tends to deflect the ordinary audit instinct, which is simply to say “walk me through it.”
We call the consequence of these failures the blast radius. When an AI workflow goes wrong, the blast radius is rarely the model output itself. It is the inability to explain the decision afterward, the inability to scope how many other decisions were affected, and the inability to produce evidence when a regulator or an audit committee asks for it.
What we built
To make that blast radius smaller, we built and open-sourced Audit Grill-Me, a structured walkthrough methodology and Claude Skills package for AI workflow assurance.
The method adapts a pattern that the software engineering community already knows. Matt Pocock’s “grill me” skills interview a developer about a design until every branch of the decision tree is resolved. Audit Grill-Me takes that same idea of relentless, structured questioning and turns it toward a different audience. Instead of interviewing a developer about a design, it interviews a process owner about an AI-enabled workflow, and it does not let a branch close until that branch has a defensible audit disposition.
The structure is deliberately ordinary. An AI workflow is walked across nine branches, in order:
- Use case and decision impact
- Governance and accountability
- Workflow walkthrough and data lineage
- Risk assessment and control objectives
- Data, prompt, and retrieval controls
- Model, vendor, and change management
- Access, segregation, and operational security
- Output review, human oversight, and exception handling
- Logging, monitoring, and reproducibility
Each branch runs through the same sequence. State the audit objective. Ask the anchor question. Capture what the process owner claims. Request the evidence. Assess whether that evidence is sufficient and appropriate. Assess design effectiveness, then operating effectiveness. Assign a disposition. Produce the workpaper entry.
Each branch ends in one of five dispositions: Pass, Design Gap, Operating Issue, Governance Finding, or Scope Limitation and Follow-up. The discipline that holds the whole method together is a single rule:
If management cannot explain it, show it, evidence it, reproduce it, or assign ownership for it, the branch does not close. It becomes a finding, a scope limitation, or a follow-up procedure.
That rule is the point. It is what prevents a confident verbal answer from passing as a control. It is also what makes the output of a walkthrough genuinely inspectable, because every disposition carries a reasoning trail from claim to evidence to conclusion.
How it surfaces what other reviews miss
A short illustration. In one of the worked examples in the repository, a customer-support chatbot retrieves answers from an internal knowledge base. The model is fine. The prompt is version-controlled. Most reviews would stop there and call it clean.
The Branch 5 walkthrough asks a different question: not “is retrieval enabled” but “show me what is actually in the retrieval corpus right now, and how you know.” Running ten real queries and inspecting the retrieved content surfaces a draft refund policy that a product manager uploaded for a pilot and never removed. The chatbot had been approving refunds on the draft policy’s terms. The failure was upstream of the model entirely, in a category of input that most AI risk frameworks describe in the abstract and never push to its operational conclusion.
That is the kind of thing the method is built to catch. Not by being clever, but by refusing to accept “retrieval is governed” without asking what is actually in the index.
The skills
Audit Grill-Me ships as eight Claude Skills that take a walkthrough through to real audit work product:
audit-grill-meruns the branch-by-branch interview and applies the state machineaudit-grill-with-docschallenges a process owner’s claims against AI policy, the risk register, the risk-control matrix, prior findings, and the system and vendor inventoriesto-audit-planconverts a completed walkthrough into scope, objectives, an RCM, a PBC evidence request list, testing procedures, sample logic, findings, and a management action planaudit-triageclassifies an unresolved branch so that ambiguity does not automatically become a findingevidence-request-builderproduces a prioritized, owner-assigned evidence request listrcm-builderbuilds the risk-control matrix from branch outputsfinding-writerwrites findings in the standard condition, criteria, cause, effect, recommendation, and management action plan structureroot-cause-diagnoseseparates the symptom of an exception from its underlying cause
The methodology is grounded in frameworks auditors already use, including NIST AI RMF and its Generative AI Profile, ISO/IEC 42001, COSO, COBIT, the IIA AI Auditing Framework, the GAO Green Book, and SOC 2. Regulatory mapping is applied after a branch is dispositioned rather than before, so the walkthrough surfaces what is actually wrong before any framework checklist gets a chance to narrow the field of view.
What it is, and what it is not
Audit Grill-Me does not replace professional audit judgment, and it is not a compliance checklist. It does not test controls. It will not produce a conclusion an experienced auditor could not have written by hand. What it does is enforce the discipline of not closing a branch until the disposition is defensible, and turn that discipline into a portable, repeatable process. It makes AI workflow risk inspectable. The judgment stays with the auditor.
Why we are sharing it
We work on software and systems delivery for regulated and enterprise environments, and a growing share of that work involves AI being introduced into processes that already carry real compliance weight. The recurring problem we see is not a shortage of AI ambition. It is the gap between a vague governance concern and a concrete set of controls, evidence, and implementation steps that an audit committee or a regulator would accept.
Audit Grill-Me is one open-source artifact that reflects how we think about that gap. We are publishing it because the methodology is more useful in the open, and because the audit, IT audit, compliance, and AI governance communities will make it sharper than we can alone.
If you work in internal audit, IT audit, compliance, risk, or AI governance, especially in a regulated industry, we would value your review. Use it on a real workflow, pressure-test the branch logic, tell us where a finding rule produces the wrong disposition, or contribute a worked example from your own experience. Credit underwriting, clinical documentation, and HR screening are scenarios we would particularly like to see exercised.
The repository is here: https://github.com/invariantengineering/audit-grill-me
The method is published under the MIT License. It is professional methodology, not regulated audit advice, and the professional judgment of your own engagement always governs.