← Back to Blog

Richard Batt |

AI Agents Just Hit the Trough of Disillusionment: Now What?

Tags: AI Agents, Industry Trends

AI Agents Just Hit the Trough of Disillusionment: Now What?

What the Trough Actually Means (Beyond the Hype Cycle Jargon)

Gartner's Hype Cycle isn't just academic chart-pushing. It's a map of technology maturity with real business implications. Every transformative technology follows the same pattern: hype, disillusionment, gradual success, plateau.

Key Takeaways

  • What the Trough Actually Means (Beyond the Hype Cycle Jargon).
  • The Failure Modes Nobody Talks About in Demo Videos, apply this before building anything.
  • Why the Demo-to-Production Gap Is Absurdly Wide for Agents and what to do about it.
  • What IS Actually Working Right Now (The Honest Assessment).
  • My Consulting Experience: Where Agent Deployments Actually Succeeded.

AI agents hit the Trough of Disillusionment in early 2026. This means: the demo worked beautifully in the pitch meeting, but the thing won't handle your actual workload.

I watched this happen in real time with three of my clients. One deployed an agent for customer support escalation in December 2025. By February 2026, they'd rolled it back to human-only. The agent made good decisions 78% of the time and customers noticed the 22% that were wrong.

This isn't failure. It's actually the beginning of realism replacing hype. And realism is where smart money builds real capabilities.

The Failure Modes Nobody Talks About in Demo Videos

Research from Anthropic and Carnegie Mellon published in late 2025 laid bare what researchers have been quietly discovering: AI agents make too many errors for high-stakes work. The studies measured error rates across multi-step agentic tasks and found something sobering: current agents fail in catastrophic ways on 15-25% of complex task sequences.

Let me be specific about what I'm seeing in the field. I recently deployed an agent to help a UK manufacturer process purchase orders, categorising them, flagging exceptions, suggesting approval workflows. In testing, it worked perfectly. 97% accuracy on the test set.

First week in production: 78% accuracy on live orders. Why? The real orders had ambiguities the test set never saw. Suppliers using slightly different SKU formats. Quantities that implied bulk discounts the agent didn't factor in. Currency mixing across geographies. Partial orders, cancellations, special notes in unstructured text.

This is the core problem with AI agents right now:

  • They hallucinate steps in multi-step processes (inventing "solutions" that don't exist, like mentioning supplier discount tiers they never actually looked up)
  • They can't recover gracefully from mistakes (one error cascades into worse errors, like processing a malformed order and then doubling down on the malformation)
  • They don't know when to ask for help (confidence divorced from competence, they confidently process edge cases they're actually terrible at)
  • They fail silently on edge cases, then confidently report success (the worst failure mode)
  • They struggle with context windows and lose track of earlier information in long workflows

The last one is the killer. An agent that crashes is easy to debug. An agent that processes something wrong and looks like it worked? That's a business problem.

Why the Demo-to-Production Gap Is Absurdly Wide for Agents

I built a framework to explain this to non-technical clients. Demos work because they're constrained:

You show an agent handling Scenario A perfectly. Scenario A is the thing you engineered the system to handle. The agent has worked through exactly this task, in exactly this environment, with exactly these edge cases eliminated. The demo usually runs on clean, well-formatted data. The demo has a person standing by to fix problems if they arise.

Production is everything else. Real data is messy. Real customers do things your training data didn't anticipate. Real workflows have exceptions layered on top of exceptions. Real systems run 24/7 with nobody standing by to fix things.

With traditional software, you solve this through exhaustive testing and specification. You document every rule. You test every path. You lock things down. You have guardrails everywhere.

With AI agents, you can't document every rule because the system is too flexible. You can't test every path because the path space is infinite. You can't lock things down because the whole point is flexibility. You can't have guardrails everywhere because that defeats the purpose of using an agent.

This is why the gap exists, and why it's not closing as fast as people expected. The fundamental architecture of agents, flexible, adaptive, reasoning-based, is inherently at odds with the bulletproofing you need for production systems.

What IS Actually Working Right Now (The Honest Assessment)

Don't misread me. I'm not saying AI agents are broken. I'm saying expectations need recalibration. There are legitimate use cases where agents add real value.

What's working:

Narrow-scope agents: I deployed an agent for one UK insurance company to help triage customer complaints. It had one job: read complaint, categorise it, assign a category tag, nothing else. No multistep workflows. No decision gates. Just categorisation. Accuracy held up at 91% in production. The agent was useful immediately.

Human-in-the-loop design: I worked with a publisher to build an agent that suggested content improvements. The agent never published anything directly. It suggested changes, humans reviewed them, humans approved. Over 6 months, accuracy on suggestions improved to 89%, and the human reviewers learned to trust it for certain change types. That's how you deploy AI agents successfully right now.

Specific use cases with low error costs: I helped a recruitment firm automate initial CV screening. Missing a few good CVs was acceptable. Incorrectly approving unqualified ones was not. We built the agent to be conservative, it approved confidently and always flagged borderline cases for human review. It cut human screening time in half without risks.

Information retrieval and assembly agents: I worked with a financial services firm where the agent's job was to gather information (pull relevant compliance documents, find regulatory precedents, assemble research). The agent didn't make decisions. It gathered and presented. Humans made decisions. This pattern works because the cost of imperfect information gathering is low.

Pattern: narrow scope, humans in the loop, low error costs, specific domain expertise baked into the system design.

My Consulting Experience: Where Agent Deployments Actually Succeeded

I've deployed roughly 15 AI agent systems in the last 18 months. Seven worked well enough that clients kept them running and expanded them. Eight either got rolled back or stayed narrowly scoped. That's less than 50% success rate, and the ones that succeeded all followed a pattern.

The seven that worked:

1. Internal knowledge bot for a financial advisory firm (answering employee questions about company policy), humans still answered some queries, but agent handled 40% without escalation.

2. Data quality flagging for a UK retailer (identifying obviously wrong product records), worked great because the error pattern was consistent and human review was built in.

3. Scheduling assistant for a consulting firm (suggesting meeting times based on calendar availability), super narrow scope, zero customer-facing impact if wrong, perfect use case.

4. Email sorting system for a customer success team (categorising incoming mail, routing to right department), again, humans could override, low cost to error.

5. Content tagging for a publishing client (labelling articles with topic tags), fast, accurate enough, humans could correct, easy to improve over time.

6. Initial triage agent for IT support (categorising tickets, suggesting category, nothing else), worked immediately.

7. Anomaly flagging for a logistics company (identifying suspicious shipments), flagged things for human investigation, never made autonomous decisions.

What they had in common: narrow scope (typically one decision or categorisation), human review in the loop, domains where the error pattern was learnable, zero catastrophic failure modes.

The eight that didn't work? They were trying to replicate complex human workflows autonomously. Multi-step decisions. Judgment calls. Scenarios the agent had never seen. No human safety net. Multi-stakeholder workflows where errors affected multiple parties.

Practical Advice for Companies Considering AI Agents in 2026

If you're thinking about deploying an AI agent, ask yourself honestly:

First: Is this narrow enough? Can I describe the agent's job in one sentence? If the sentence is more than 15 words, it's probably too broad.

Second: What happens if the agent is wrong? If the answer is "we lose money" or "we violate regulations" or "customers get hurt", you need a human in the loop. Full stop.

Third: Can I verify the result? After the agent acts, can I (or your customer, or your system) check whether it worked? If yes, you've got a chance. If no, you need a rethink.

Fourth: Is this better than the baseline? Not "better than the dream of AI agents solving everything." Better than current reality. If automation handles 60% of cases and humans need to touch the other 40%, is that still a win? Only you can answer that.

Fifth: Do I have domain expertise embedded? The best agent systems I've built involved subject-matter experts helping design the agent's decision logic. Generic agents fail. Domain-aware agents work.

I helped a client work through this framework for automating vendor approval. Scope: narrow (approve/reject based on existing vendor criteria). What happens if wrong: financial and compliance risk. Can we verify? Yes, every approved vendor has a 30-day review. Better than baseline? Currently humans handle 200 approvals/month, spending 20 hours. Agent could cut that to 5 hours with humans reviewing flagged cases. Clear win.

They deployed it. It works. Not because agents are magical. Because the scope and constraints were realistic.

Why the Trough Is Actually the Best Time to Invest

Here's the contrarian take that I genuinely believe: the trough is when you start building real advantages.

During the hype phase, everyone's exploring. Vendors are overselling. Expectations are ridiculous. Every company trying agents is trying the same things, simultaneously failing at the same problems. You're fighting noise and hype.

The trough is when three things happen:

First, expectations finally match reality. You stop waiting for agents that can handle anything and start building agents that handle one thing well. You stop asking "can we automate this entire process?" and start asking "what's the one bottleneck we can remove?"

Second, the technology actually improves. Anthropic, OpenAI, and others are studying failure modes and shipping better models. Error rates are dropping. Recovery capabilities are improving. Better tooling for building agents is emerging.

Third, competitive advantage emerges. Early movers who figure out "what agents are actually good for right now" will have built muscle memory, lessons learned, and working systems by the time the technology genuinely matures. You'll be two years ahead of companies still waiting for perfection.

I'm advising clients right now: start small. Pick one narrow use case. Get it working. Learn what AI agents are actually good for in your business. The companies that do this in 2026 will have a two-year head start over the ones waiting for the perfect agent technology.

The trough isn't the end of the story. It's the beginning of the real story.

How to Pick Your First Agent: The Right Use Case

If you're going to deploy an agent in 2026, picking the right use case is everything. Most teams pick wrong and burn credibility. Here's how to get it right.

The agent use-case selection matrix:

Scope: Can it be described in one sentence? Narrow beats broad.

Cost of error: What happens if the agent fails? Low cost (user wasted 10 minutes) beats high cost (company lost £100,000).

Verifiability: Can you check if the agent worked? Easy checking beats hard checking.

Baseline comparison: Is this better than what you're doing now? Faster, cheaper, more consistent? If not, why build the agent?

Domain expertise: Does anyone on your team understand this domain deeply? If yes, you can tune the agent. If no, it will fail.

I helped a logistics company pick their first agent. They wanted to automate their entire order-to-delivery workflow. Way too broad. I suggested they pick just the initial order classification step (route by destination, by package size, by urgency). Single sentence. Low error cost (humans review, agent just suggests). Easy to verify (does the suggestion match what the human chose?). Better than baseline (cuts human work 40%). Domain expertise high (they've been doing this for 20 years).

That agent works. They're now expanding to other narrow steps in the workflow. Building incrementally, learning constantly.

The Error Modes Framework: Know What Failures Actually Look Like

Before you deploy an agent, you need to understand specifically how it will fail so you can guard against those failures.

Hallucination failures: The agent invents information that doesn't exist. A scheduling agent that proposes a meeting in a room that doesn't exist. A procurement agent that orders from a supplier that the company doesn't work with. Build guardrails: agents should only choose from predefined lists, not invent options.

Cascade failures: One error leads to worse errors downstream. A categorisation error leads to routing to wrong team, which tries to process malformed data, which creates corrupted records. Build isolation: each agent decision should be independently verifiable, not dependency chains.

Silent failures: Agent processes something wrong, reports success, and you don't notice until it's too late. A content moderation agent approves something it should reject, report says "moderation complete". Build monitoring: log everything, check samples regularly, set alerts for unexpected patterns.

Confidence without competence: Agent is confident about things it doesn't understand. Recommends something it's completely wrong about and sounds authoritative. Build human fallback: for novel scenarios, escalate than decide.

Understanding these patterns lets you design around them. The best agent systems I've seen don't try to eliminate all errors. They make errors visible, containable, and recoverable.

The Metrics That Actually Matter

When you deploy an agent, what do you measure? Most teams measure accuracy. That's not enough.

Accuracy: Did the agent make the right decision? Important, but not sufficient.

Coverage: What percentage of cases can the agent handle autonomously? 60% coverage is useful. 100% coverage is suspicious (you're probably missing edge cases).

Precision vs recall trade-off: For screening tasks, precision (low false positives) and recall (low false negatives) matter differently. A resume screener missing good candidates (recall) is different from wrongly approving unqualified candidates (precision). Know which matters for your use case.

Human escalation rate: What percentage of cases require human review? If it's 0%, you're missing edge cases. If it's 80%, the agent isn't adding much value.

Human override rate: When humans do review, what percentage do they override the agent? High override rate suggests the agent isn't calibrated correctly.

Time-to-decision: Is the agent actually faster than the baseline? If it's slower because you're adding human review, the offer changes.

Cost per decision: Total cost (agent cost + human review cost) versus baseline. This is the number that matters to finance.

One client I worked with tracked all seven metrics. Their agent had 87% accuracy, but accuracy wasn't the binding constraint. What mattered: it handled 65% of cases autonomously, cost 40% less than human-only, and had 8% override rate (meaning humans trusted it). That's a successful agent.

The Honest Assessment: Most Companies Will Fail With Agents in 2026

This is the real talk. Most teams deploying agents in 2026 will have something that doesn't work and will get rolled back. The trough exists for a reason. Agents are hard.

But the companies that succeed will be the ones that approached it realistically: pick narrow scope, expect humans in the loop, measure relentlessly, improve incrementally, don't oversell internally.

Those teams will have a two-year head start over companies waiting for "better" agents. And that head start becomes competitive advantage.

Richard Batt has delivered 120+ AI and automation projects across 15+ industries. He helps businesses deploy AI that actually works, with battle-tested tools, templates, and implementation roadmaps. Featured in InfoWorld and WSJ.

Frequently Asked Questions

How long does it take to build AI automation in a small business?

Most single-process automations take 1-5 days to build and start delivering ROI within 30-90 days. Complex multi-system integrations take 2-8 weeks. The key is starting with one well-defined process, proving the value, then expanding.

Do I need technical skills to automate business processes?

Not for most automations. Tools like Zapier, Make.com, and N8N use visual builders that require no coding. About 80% of small business automation can be done without a developer. For the remaining 20%, you need someone comfortable with APIs and basic scripting.

Where should a business start with AI implementation?

Start with a process audit. Identify tasks that are high-volume, rule-based, and time-consuming. The best first automation is one that saves measurable time within 30 days. Across 120+ projects, the highest-ROI starting points are usually customer onboarding, invoice processing, and report generation.

How do I calculate ROI on an AI investment?

Measure the hours spent on the process before automation, multiply by fully loaded hourly cost, then subtract the tool cost. Most small business automations cost £50-500/month and save 5-20 hours per week. That typically means 300-1000% ROI in year one.

Which AI tools are best for business use in 2026?

It depends on the use case. For content and communication, Claude and ChatGPT lead. For data analysis, Gemini and GPT work well with spreadsheets. For automation, Zapier, Make.com, and N8N connect AI to your existing tools. The best tool is the one your team will actually use and maintain.

Put This Into Practice

I use versions of these approaches with my clients every week. The full templates, prompts, and implementation guides, covering the edge cases and variations you will hit in practice, are available inside the AI Ops Vault. It is your AI department for $97/month.

Want a personalised implementation plan first? Book your AI Roadmap session and I will map the fastest path from where you are now to working AI automation.

← Back to Blog