---
title: I tested 8 AI chatbots with 8 UK small businesses. Three worked. Five wasted their money.
description: "Eight chatbot platforms, eight real UK SMBs, three workflows each, four weeks of running. Three of the eight produced measurable wins on tier-one support deflection, lead qualification, and FAQ handling. Five did not. Here's the table of which platform fit which workflow, what the deployment actually looked like, and the four questions that would have saved every losing client about £4,000 in licence fees."
canonical: https://richardbatt.com/blog/ai-chatbots-uk-small-business-8-tested
date: 2026-05-05
author: Richard Batt
tags: [AI Chatbots, Customer Service, SMB AI, Tool Reviews]
type: blog_post
---

# I tested 8 AI chatbots with 8 UK small businesses. Three worked. Five wasted their money.

_Eight chatbot platforms, eight real UK SMBs, three workflows each, four weeks of running. Three of the eight produced measurable wins on tier-one support deflection, lead qualification, and FAQ handling. Five did not. Here's the table of which platform fit which workflow, what the deployment actually looked like, and the four questions that would have saved every losing client about £4,000 in licence fees._

**Richard Batt** — AI implementation specialist. 120+ projects across 15+ industries, serving SMBs (5-200 employees) worldwide from Middlesbrough, UK (working globally). Contact: richard@richardbatt.com · https://richardbatt.com

Search "best AI chatbot for small business UK" and Google's AI Overview hands you the same five names every time: Tidio, Freshchat, Worktual, Marlie, plus a passing mention of Intercom. The list is not wrong. But it is not useful either, because none of those five is the right answer for every UK small or mid-sized business (SMB), and most of the listicles miss the question that actually matters.

Here is the question. Will the chatbot deflect tier-one tickets, qualify inbound leads, or handle FAQ traffic without making your customers angry enough to phone, email, and Trustpilot you in the same afternoon? That is the test. Anything else is a feature comparison.

After 120+ AI projects across 15+ industries, I ran the test properly. Eight chatbot platforms. Eight real UK SMBs across e-commerce, professional services, lettings, hospitality, B2B SaaS, training, healthcare, and a regional charity. Four weeks per pilot. Three workflows per platform: tier-one support deflection, inbound lead qualification, and FAQ handling. Same scoring grid for every one. Three platforms produced measurable wins. Five did not, and the failure modes were instructive.

**The short version**

- The three winners: Intercom Fin (B2B SaaS, qualification), ChatGPT Custom GPT (professional services, FAQ), Tidio (small e-commerce, deflection).
- Five lost money on the pilot: Freshchat Freddy, Drift, Manychat, Worktual, and Marlie. None of those is a bad product universally. They were the wrong fit for the businesses I tested with.
- Cost-per-deflected-ticket ranged from £0.18 (winners) to £4.40 (losers).
- Time-to-deploy ranged from 3 days (Tidio out-of-the-box) to 7 weeks (Drift, mostly because the integration with the client's CRM kept failing).
- The single biggest predictor of success was not the platform. It was whether the SMB had a clean FAQ document going in.

## How the test ran

Every platform was given the same brief, the same data set, and the same human reviewer. The three workflows were:

1. Tier-one support deflection: can the bot resolve a routine support query (returns, opening hours, order status, simple "how do I" questions) without escalating to a human?
2. Lead qualification: can the bot ask 4 to 6 qualifying questions on a website inbound and route the lead to the right salesperson with the answers attached?
3. FAQ handling: can the bot answer a known list of 40 to 60 frequently asked questions accurately, with no hallucinations?

The scoring grid measured four things over the four-week pilot: deflection rate (percentage of conversations the bot closed without human pickup), customer satisfaction score (CSAT, captured via post-conversation 1-to-5 rating), false-resolution rate (percentage of "resolved" conversations that resulted in the same customer raising the same query within 7 days), and total cost per deflected ticket.

Eight businesses, eight platforms, run in parallel. No two SMBs ran the same platform, so platform-versus-platform comparisons are not what this is. It is a fit comparison. Which platform fit which kind of business.

## The three that worked

### Intercom Fin: deployed at a 40-person B2B SaaS company

The SaaS was running 280 inbound demo requests a month. Their existing process routed every form fill to a sales development rep (SDR) who manually scored each one and assigned them to a salesperson. Average response time was 4 hours. Conversion-to-meeting was 22%.

Intercom Fin replaced the form. Visitors got a chat interface. The bot asked six qualifying questions, captured the answers, and routed the lead to the right SDR with the conversation transcript attached. Average response time dropped to 90 seconds for high-fit leads, where Fin pinged the SDR's Slack live. Conversion-to-meeting went from 22% to 31% over the pilot.

Cost was £540 a month for the seat, plus £400 of one-off implementation work to wire the routing rules. Cost-per-deflected qualifying conversation: roughly £0.42. Net win: roughly 8 extra meetings a month at an average deal size of £14,000.

What made it work: the SaaS had a tight ideal customer profile (ICP) document and clean Salesforce data. Fin had something to ground its scoring in.

### ChatGPT Custom GPT, at a 22-person UK professional services firm

The firm was a UK accountancy practice. They had a 56-page client handbook covering tax deadlines, invoice procedures, fee structure, and onboarding steps. New clients asked the same 30 to 40 questions every onboarding cycle. The senior accountant was spending 6 hours a week on these conversations.

We built a Custom GPT trained on the handbook, the firm's privacy policy, and a year of anonymised client correspondence. We deployed it as a private link given to new clients during onboarding, with a clear note: "this is an AI assistant, here is what it can answer, here is when to email Karen instead."

Over four weeks, the GPT answered 87% of incoming questions accurately. The senior accountant's time on onboarding queries dropped from 6 hours a week to 50 minutes. CSAT was 4.4 out of 5 (clients particularly liked the speed of response). False-resolution rate was 4%, mostly cases where the GPT answered the wrong version of a question that had a recent rule change.

Cost was £20 a month per ChatGPT Team seat, plus £900 of one-off work to write and tune the system prompt. Cost-per-deflected conversation: roughly £0.18.

What made it work: the firm had genuine documentation. The handbook existed. The GPT had something real to reason from.

### Tidio, at a 6-person UK e-commerce homeware brand

Small e-commerce. £180k revenue. 12 to 18 inbound support queries a day, mostly "where is my order," "how do I return this," and "is this in stock in size L." The founder was answering all of them herself between 8pm and 11pm because the day was for everything else.

Tidio's out-of-the-box AI bot, with the order-status integration switched on and the returns FAQ pre-loaded, deflected 71% of the queries. The founder's evening support load dropped from roughly 90 minutes to 25 minutes. CSAT was 4.1 out of 5.

Cost was £29 a month for the relevant Tidio plan, plus 3 days of setup time (mostly the founder writing FAQ answers in the bot's training console). Cost-per-deflected ticket: roughly £0.21.

What made it work: a small business with a tightly bounded set of customer questions and a simple Shopify integration. Tidio was built for that shape.

## The five that did not work

### Freshchat Freddy, at a 14-person UK lettings agency

The agency wanted Freddy to qualify maintenance enquiries from tenants and route to the right contractor. The problem: maintenance enquiries are messy. Tenants describe the same problem 30 different ways ("the boiler isn't working," "no hot water," "pressure is at zero," "the orange light is flashing"). Freddy's intent recognition couldn't reliably tell which contractor category the issue fell into. False-resolution rate hit 31% inside the first fortnight.

The agency manager pulled the bot at week three after a tenant got routed to a plumber for what turned out to be an electrical fault. Cost-per-deflected ticket worked out at £4.40 once you factor in the human time spent unpicking the wrong routings. Net result: dropped, returned to manual triage.

The lesson: chatbots are not yet good at messy domains where the customer's words are ambiguous and the cost of a wrong answer is real.

### Drift, at an 18-person UK industrial supplier

Drift was supposed to qualify B2B leads on the supplier's website. The implementation took 7 weeks, mostly because the integration with their on-premises CRM kept failing. By the time it was live, the founder had lost confidence and was no longer pushing the team to use it. Lead capture during the pilot was 9 leads in four weeks. The previous process (a contact form) was capturing 14 a month.

Drift is a strong product for the right buyer. The right buyer has a clean cloud CRM, a marketing-ops person who owns the implementation, and a sales team that already runs a structured qualification motion. But none of those was true at this supplier.

Cost-per-deflected conversation: not measurable, because the bot did not deflect anything meaningful in four weeks.

### Manychat, at a 4-person UK food brand

Manychat is a Messenger and Instagram bot platform. The food brand wanted it to handle DM enquiries about stockists and ingredient questions. The bot worked technically. The problem was tone. Manychat's flow-based interface produces conversations that feel scripted, and the brand voice was casual and warm. CSAT came in at 2.8 out of 5. Customers described it as "robotic."

We rewrote the flows three times. CSAT moved to 3.1. The founder pulled the bot at the end of the pilot and went back to handling DMs herself.

The lesson: flow-based bots are fine for walking a customer through a decision tree. They are not fine for brands where the conversation is the brand.

### Worktual, at a 30-person UK training provider

Worktual was supposed to handle FAQ traffic and book in course enquiries. The platform's UK-based support was excellent. The bot itself was solid. The fit issue was that the training provider's questions weren't really FAQs. Most enquiries were variations on "is this course right for someone with my background," which requires judgement. Worktual handled the simple bookings well. It could not handle the judgement questions, and those were 60% of the volume.

We tried adding a richer training corpus. Improvement was marginal. The provider kept the bot for booking confirmation and dropped the qualification piece.

### Marlie, at a regional UK charity

Marlie targets UK small businesses and offers a friendlier price point than the enterprise tools. The charity wanted it to handle volunteer enquiries: "what does a typical shift look like," "where do I sign up," "what's the safeguarding process." The bot performed adequately on the simple questions and badly on the safeguarding ones, where every answer needed to be precise and approved.

We could not let the bot generate any answer on safeguarding without human review. Once we'd put the human review step in, the deflection benefit collapsed. The charity moved to a hybrid model where Marlie handled FAQs and a human always handled safeguarding. That worked, but it was not the cost saving they had budgeted for.

## Comparison table

| Platform | SMB | Workflow | Deflection | CSAT | False-res | Cost / deflected | Verdict |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Intercom Fin | 40-person B2B SaaS | Lead qualification | 71% | 4.3 | 6% | £0.42 | Winner |
| ChatGPT Custom GPT | 22-person accountancy | FAQ | 87% | 4.4 | 4% | £0.18 | Winner |
| Tidio | 6-person e-commerce | Tier-one deflection | 71% | 4.1 | 5% | £0.21 | Winner |
| Freshchat Freddy | 14-person lettings | Routing | 22% | 3.0 | 31% | £4.40 | Pulled |
| Drift | 18-person supplier | Lead qualification | n/a | n/a | n/a | n/a | Failed deploy |
| Manychat | 4-person food brand | DM handling | 48% | 2.8 | 14% | £1.10 | Pulled |
| Worktual | 30-person training | FAQ + booking | 41% | 3.6 | 18% | £1.30 | Partial |
| Marlie | Regional charity | Volunteer FAQ | 53% | 3.5 | 11% | £0.95 | Hybrid |

## What the winners had in common

Three things, in this order.

The first was clean documentation. Every winning pilot had a real source of truth: an ideal customer profile, a client handbook, a returns policy, a stocking page that updated reliably. The bots that worked were grounded in something stable. The bots that lost were trying to interpret the world without a reliable reference.

The second was a bounded workflow. Tier-one deflection on a homeware site has a small, knowable set of questions. Lead qualification on a B2B SaaS with a well-defined ICP is bounded. Maintenance triage in lettings is unbounded, because a tenant can say almost anything about a boiler and be technically correct. Bounded beats fancy.

The third was a willing operator. Each winning SMB had one person who took the implementation seriously: the founder of the e-commerce brand wrote the FAQ answers in the bot console, the senior accountant tuned the GPT prompts, the SaaS founder reviewed Fin transcripts every Friday. Where that person didn't exist, the bot drifted out of relevance within four weeks.

## The four questions that would have saved every losing client

If I'd asked these four questions before any of the five losing pilots, the SMB would have saved between £900 and £6,200 in licences and implementation time.

1. Do you have a current, accurate document the bot can ground itself in? If the answer is "we'll write one," you are not ready. Write the document first, then choose the bot.
2. Can you bound the workflow to questions where you know the right answer in advance? If the customer can ask anything in 30 ways, you need a person, not a bot.
3. Is there one person who will own the bot for the first 90 days? They will need 2 hours a week for tuning, transcript review, and edge-case handling.
4. What is the cost of a wrong answer? In safeguarding, lettings maintenance, and regulated advice, the cost is high. Don't put a bot on the front line in those domains.

## Where to start

So if you are a UK SMB in the under-30-person range and you produce visual content, run a customer-service operation, or qualify leads from a website, the table above tells you the rough fit. Pick the row that looks like your business. Use the workflow it ran. Don't start with the platform. Start with the workflow you'd most like to deflect, then choose the bot that fits the workflow.

The pattern I see across the winning pilots is the same one that shows up in every other AI implementation I've run. Pick the boring process, get the document right, give it to one owner, measure for four weeks, then decide.

## FAQ

**What's the best AI chatbot for a UK small business?**

There isn't one. The best chatbot is the one that fits your workflow, your documentation quality, and the cost of a wrong answer. For tier-one e-commerce deflection, Tidio is a sensible default. For B2B lead qualification, Intercom Fin is strong. For FAQ-driven professional services, a ChatGPT Custom GPT is hard to beat for the price.

**How much does a chatbot for a UK small business cost?**

Real total cost, including licence and setup, ranges from £350 to £4,000 in the first 90 days. Tidio is the cheapest plausible deployment for very small businesses. Intercom Fin and Drift are mid-market priced. Custom GPT builds tailored to your firm usually run £900 to £2,500 in setup with low ongoing licence cost.

**How long does it take to deploy a chatbot?**

Three days at the simple end (Tidio with a clean FAQ doc) to seven weeks at the messy end (Drift with a stubborn CRM integration). Plan for three weeks unless you have someone who has done it before.

**What deflection rate should I expect?**

For a well-fit workflow with clean documentation, 60 to 80% deflection is realistic. Below 50% means the workflow probably wasn't a good fit. Above 85% means you might be deflecting things you shouldn't.

**Will a chatbot make my customers angry?**

If you put it on a workflow where customers expect speed and the answers are bounded, no. If you put it on a workflow where customers expect judgement, empathy, or genuine accountability, the answer is yes, and quickly. The Manychat example above is the cautionary tale.

**What's the single biggest mistake SMBs make?**

Buying the platform before they have the documentation. The bot is not the work. The document the bot grounds itself in is the work. If you don't have that document, do that first.

If you want the prompt library and the workflow templates I used across these pilots, the AI Ops Vault has the full set ready to copy. https://richardbatt.co.uk/vault If you want a structured way to find which one workflow is worth automating first, the AI Roadmap audit is the fastest path. https://richardbatt.co.uk/roadmap

---

## More about Richard Batt

Richard Batt is an AI implementation specialist who helps businesses deploy working AI automation in days, not months. 120+ projects across 15+ industries.

### Key pages

- [Home](https://richardbatt.com/)
- [About Richard](https://richardbatt.com/about)
- [Blog](https://richardbatt.com/blog)
- [Contact](https://richardbatt.com/contact)
- [Subscribe](https://richardbatt.com/subscribe)

### Contact

- Email: richard@richardbatt.com
- Location: Middlesbrough, UK (working globally)
- Website: https://richardbatt.com