Your AI setup needs tests, not just intuition

If your company’s AI workflow is validated by someone reading a few responses and saying “looks good”, it is probably not an enterprise workflow yet. It is still an experiment.

Right now, many divisions are trying to make the leap from “classic chat” interactions to tailored AI usage: building custom skills, specialized agents, and central automated workflows.

But as you move away from the chatbox, you run into a practical problem: Intuition doesn’t scale.

A small tweak to a central AI setup always looks harmless:

You update a skill definition to make one edge case work better.
You switch to a newer model because the benchmarks look promising.
You move to a smaller model to optimize API costs.

But fixing a bug in an agent’s prompt can easily break the use case you fixed last week. In the AI world, a small prompt change can create a surprisingly large regression somewhere else.

If we touch a shared library in traditional software, we run tests. Yet with AI, many teams still rely on manual review and gut feeling. They try a few examples, see what they want to see, and ship it.

Prompts, system instructions, and agent workflows are not just configurations. They define software behavior. And behavior can regress.

It is basically impossible to track these side effects with manual reviews alone. If we want to move from playing around with AI to using it professionally, reliability becomes one of the deciding factors. You cannot scale a business process on trial and error.

That is the shift: from prompting by intuition to engineering AI behavior.

Setting up an AI test suite

My division’s Claude plugin with multiple tailored skills needs to work reliably: People use it in their daily work and acceptance heavily depends on reliable output.

To apply changes more confidently, I’ve moved to automated testing with promptfoo.

The Assets: Define the specific prompts, skills, and models you want to evaluate.
The Datasets: Predefine realistic input files or scenarios.
The Assertions: Set strict, automated rules for the output (e.g., Does it follow the required JSON schema? Does it refuse when it should refuse? Did it call the correct tools?).

Where this starts to matter at organizational level:

Safer Iteration: When an AI skill misbehaves, you add the failing case to the test suite first, then adjust the prompt. You get instant proof that the fix works without silently breaking old behavior.
Confident Cost Optimization: Want to swap a premium model for a cheaper, faster one to save budget? Run the test suite. If the cheaper model passes your assertions, you switch with evidence instead of assumptions.

Moving past the “hype phase” of AI means treating your AI setups with the same discipline as your core code.

Start small: Gather a few realistic examples, define clear expectations for the outputs, and run them whenever your setup changes.

You will catch far more bugs than you ever could by relying on a gut feeling and you get the one thing required to scale AI across a division: the confidence to iterate without degrading quality.

Beyond testing: Ownership

In larger organizations this also becomes an operating model question. Who owns a shared AI skill? How are changes reviewed? Which examples represent acceptable behavior? And when is a model switch good enough to roll out? Testing does not answer all of that, but it gives teams a factual basis for the discussion.

How are you testing your AI workflows?

Example: Promptfoo in action

If you are interested in the tool, here is a simple example of how a promptfooconfig.yaml is structured.

Promptfoo can be adapted to many different evaluation setups. In the end the tool matters less than the habit: keep a stable test set and run it whenever the AI setup changes.

providers:
  - id: anthropic:messages:claude-sonnet-4-6
  - id: anthropic:messages:claude-opus-4-8

prompts:
  - |-
    Analyze this customer support ticket and return a JSON object with sentiment (positive,neutral,negative) and summary: {{ticket}}

tests:
  - vars:
      ticket: "I've been waiting for three days for my refund. This is unacceptable."
    assert:
      - type: contains-json
        value:
          required:
            - sentiment
            - summary
          type: object
          properties:
            sentiment:
              type: string
              enum: ["positive", "neutral", "negative"]
            summary:
              type: string
      - type: contains
        value: "negative"

  - vars:
      ticket: "Can you please tell me my account balance?"
    assert:
      - type: contains-json
        value: file://./path/to/schema.json
      - type: contains
        value: "neutral"