Start With Boring Workflows

We’ve all watched the demo. An agent takes a vague instruction, plans a project, browses the web, writes the code, updates three systems, and posts a summary to Slack. It’s genuinely impressive. It also sets an expectation that quietly poisons most enterprise AI roadmaps.

Because the demo skips every question that actually decides whether an agent ships: What is it allowed to touch? What can it change? Who’s accountable when it’s wrong? How do you even know if it’s wrong? In a keynote you can wave those away. In a regulated, audited, real company, they are the project.

For most organisations, the right first agent is not the autonomous one that “does my job.” It’s a narrow, slightly dull workflow that already happens every day, has clear inputs and outputs, can be checked in seconds, and fails in ways you can recover from.

Start boring. Boring is how you earn the right to do interesting later.

”Agent” is doing too much work

Part of the problem is the word itself. Sometimes “agent” means a chatbot with tool access. Sometimes it means an automation with a model in the middle. Sometimes it means a system that reasons, plans, calls APIs, and decides its own next move. When one word covers all three, stakeholders hear the most autonomous version and budget for a level of independence the organisation isn’t ready to operate.

Worth saying plainly: this is not me calling enterprise agents overhyped. I think they’ll become a real productivity layer across a lot of knowledge work. But a useful enterprise agent isn’t defined by how autonomous it looks. It’s defined by whether it does one valuable thing reliably, inside clear boundaries. Those are very different design goals, and the demo optimises for the wrong one.

Two words that decide everything: bounded and reviewable

A good first agent use case is bounded and reviewable.

Bounded means the scope is small and explicit. Not “handle customer support” – classify incoming tickets against the taxonomy we already use. Not “manage releases” – draft release notes from the merged pull requests. The narrower the better; you can always widen later, and widening from a working base is a far easier conversation than recovering from a broad agent that misfired.

Reviewable means a human can look at the output and decide in seconds whether it’s right, useful, or safe to ship – and crucially, the agent still saves time even with that human in the loop. If reviewing the agent’s work takes as long as doing the work, you’ve built a toy.

Almost everything else that makes a workflow a good first candidate follows from those two:

a clear input shape
a defined output shape
historical examples of what “good” looks like
recoverable mistakes
some way to measure quality
an escalation path for when the agent isn’t sure

Tick those and the use case will look modest – because it is. It’s taking one repetitive slice of cognitive work and making it faster and more consistent. That modesty is the point.

Two boring workflows worth more than they look

I’ll resist the urge to give you the full catalogue. Two examples make the case better than ten, and these are the two most of my audience will recognise from their own week.

Ticket triage. Plenty of teams burn real hours sorting support, IT, or security tickets – category, urgency, affected system, right team. An agent can do the first pass and flag what’s missing from the request. Nobody puts this on a conference slide. But better triage cuts handoff delays and makes workload reporting honest, and it’s about as low-risk a place to start as exists: a misrouted ticket is visible and trivially fixable.

Release notes. Close to a perfect first agent. The inputs are structured and already exist – merged PRs, issue entries, commit messages, labels. The output follows a standard shape – features, fixes, breaking changes, migration notes. A release manager still owns the final wording, but the blank page is gone and fewer changes slip through. Bounded inputs, reviewable output, recoverable mistakes. Textbook.

The pattern under both: the work is real, the friction is real, and almost none of it needs creative judgement every single time. That’s the sweet spot. None of it requires a digital employee. All of it requires careful integration into work people already do.

Boundaries first, autonomy later

Enterprise agents need boundaries before they need independence. A boundary just answers three questions cleanly: what may it access, what may it do, and when must it stop? Most first-generation agents should run in plain “human approves” mode – they prepare the work, show their reasoning or sources, and a person confirms, edits, or rejects. That’s not a limitation to apologise for. It’s how an organisation learns safely, and it’s how the security, legal, and ops teams build real intuition about access, logging, and data handling on a concrete case instead of in the abstract.

The part teams skip is evaluation, and it’s the part I’d argue hardest for. “The output looks good” is not a quality gate. You need something you can re-run. Classification agents can be measured against your own labelled history. Summary agents can be checked for completeness and for claims the source doesn’t support. Release-note agents can be diffed against the set of changes that actually shipped.

This is the same discipline I’ve written about for AI workflows generally: keep a stable set of realistic examples and run it whenever the prompt, the model, or the data changes. It matters more for agents, not less, because everything underneath them moves. Models update. Prompts get tweaked. Source systems shift. Without a regression check, yesterday’s reliable agent becomes tomorrow’s silent source of errors – and “silent” is the dangerous word. A bounded, well-evaluated agent fails loudly and recoverably.

Your AI setup needs tests, not just intuitionMoving past basic chat interfaces to custom enterprise AI workflows requires scaling beyond simple intuition. This article explores why manual reviews fail to catch critical regressions when updating prompts, models, or agent behaviors. Learn how to treat your AI setups like core software and build operational confidence through automated testing.sebastianstoehr.de

Boring is how trust compounds

Trust in enterprise AI isn’t created by one impressive demo. It’s created by the same narrow workflow working, visibly, the fiftieth time.

Boring agents are easy to observe, which is exactly why they build it. Their scope is small enough to understand, their output checkable, their failure modes documentable, their value measurable. And every one you ship teaches the organisation something durable: how to design these workflows, how to evaluate them, how access and approval actually behave in production. Employees stop experiencing AI as a vague threat to their role and start experiencing it as the thing that drafts their release notes. That’s a much easier adoption story than “here is the agent that replaces your function.”

Then, and only then, you widen the scope – more sources, more actions, more automation – because you’ve earned the evidence. Autonomy should be earned through evidence, not assumed because the technology is exciting.

Starting with boring workflows doesn't mean thinking small. It means choosing the right first step.

The question that gets you there isn’t “how autonomous can we make this?” It’s “where can we make one real workflow more reliable, faster, and easier to trust?”

Which boring workflow would your team automate first?

”Agent” is doing too much work

Two words that decide everything: bounded and reviewable

Two boring workflows worth more than they look

Boundaries first, autonomy later

Boring is how trust compounds

More from “AI Agents in the Enterprise”