← Back to Blog

Richard Batt |

I Tested Claude Sonnet 4.6 on 5 Real Business Tasks

Tags: AI Tools, AI Strategy

I Tested Claude Sonnet 4.6 on 5 Real Business Tasks

The Experiment That Changed How I Select Models

Last month, I got a call from a director at a mid-size financial services firm. "Richard," she said, "we're spending £15,000 a month on API calls. If Claude Sonnet 4.6 is 'good enough' for most tasks, can we stop overpaying for Opus?" That question wouldn't leave me alone. So I did what consultants do. I ran the data.

Key Takeaways

  • The Experiment That Changed How I Select Models, apply this before building anything.
  • Task 1: Drafting a Client Proposal from Meeting Notes.
  • Task 2: Analysing a 50-Page Financial Report.
  • Task 3: Writing and Debugging a Python Automation Script.
  • Task 4: Summarising a Week's Worth of Industry News.

I took five real business problems from my clients and ran them through both Claude Sonnet 4.6 and Opus 4.6. Same input, same context, same evaluation rubric. The results surprised me. Not because Sonnet is a hidden genius: it isn't. But because the economics start to matter more than the benchmarks the moment you factor in your actual workflow.

Let me walk you through what I found. The pricing stakes are real: Sonnet 4.6 costs $3 per 1 million input tokens and $15 per 1 million output tokens, versus Opus at $5 and $25 respectively. That's a 40% saving on input and a 40% saving on output. Over a year, for teams running dozens of API calls daily, that gap compounds into real money: the money that executives notice in budget reviews.

But pricing alone isn't the story. The real question is whether Sonnet delivers enough quality to justify those savings. Can you actually use it for serious business work, or are you paying less because the output is proportionally worse? That's what I tested. Here are the unvarnished results.

Task 1: Drafting a Client Proposal from Meeting Notes

I grabbed messy meeting notes from a real SaaS pitch: 15 minutes of audio transcribed, full of tangents, false starts, and unstructured information. The task: synthesize into a formal, client-ready proposal in 8 hours that highlights value, pricing options, and next steps.

Sonnet delivered a draft in 12 seconds. The structure was clean: executive summary, problem statement, solution architecture, pricing table, timeline, and CTA. I scored it on four criteria: structure clarity (does it flow?), comprehensiveness (all key points covered?), tone alignment (does it sound professional?), and actionability (can the sales team use it directly?).

Sonnet 4.6 score: 8.2/10. It nailed the structure and tone. The pricing rationale felt logical and persuasive. The timeline was realistic. Minor gap: it didn't dig deep enough into the client's specific pain points from the notes: it generalized them slightly. A human would catch that and request one revision before sending.

Opus took 18 seconds (and pulled from a deeper context window). It produced a more careful version: stronger problem statement, richer competitive context, and specific language about the client's industry challenges. It even flagged a potential objection about implementation timeline that Sonnet had glossed over, suggesting a mitigation strategy.

Opus score: 9.1/10. The output felt more executive-grade. But here's the catch: Sonnet's draft was 90% there. A sales team using Sonnet would spend 20 minutes on edits. With Opus, they'd spend 5. The value delta? £2 in time. The cost delta on that API call? About 1.2p more for Opus. Economically neutral for a proposal this high-stakes.

But if you run 30 proposals a month? Sonnet saves £3.60 a month while only adding 30 minutes of revision work across the team. The choice becomes a business decision, not a quality decision. For most proposal-writing teams, Sonnet is the obvious choice.

Task 2: Analysing a 50-Page Financial Report

I gave both models a real FCA filing: 50 pages of regulatory text, cash flow tables, footnotes, and risk disclosures. The task: extract the top 5 financial risks, translate them into plain English, and flag which ones affect loan approval decisions.

This task required both breadth (covering 50 pages) and depth (understanding interconnections between risks). It also required judgment: not all risks are equal, and a banker needs to know which ones matter for the specific decision they're making. Sonnet's context window is 200K tokens standard; Opus is the same. Both could handle the document easily.

Sonnet completed the analysis in 8 seconds. It pulled the obvious risks: pension liabilities, foreign exchange exposure, regulatory changes. Clean summary, well-structured, accurate. My evaluation team graded it 7.8/10 on accuracy and 8.1/10 on usefulness for a lending decision.

Opus took 14 seconds. It identified all of Sonnet's risks plus two subtle ones: a hidden contingent liability buried in footnote 31, and a debt covenant trigger that Sonnet had noted but not flagged as "material." These weren't obvious: they required reading cross-references and inferring implications.

Opus score: 9.3/10. The report would make a banker more confident in their decision. Sonnet score: 7.8/10. A banker using Sonnet would miss real information, which in a £5 million lending decision could be catastrophic.

For this task, Opus isn't just slightly better: it's meaningfully safer. The cost difference (about 3.5p per report) is trivial compared to the risk of missing a material disclosure. Here, Opus is worth it. This is where I tell clients: don't optimize price on judgment-critical analysis. The cost of being wrong vastly exceeds the cost of paying more for the right answer.

Task 3: Writing and Debugging a Python Automation Script

I gave both models a real requirement: "Build a Python script that polls a Slack channel every 5 minutes, extracts messages containing project status keywords, parses them into structured data, and pushes them to a PostgreSQL database for reporting."

Sonnet generated a functional script in 6 seconds. It included the Slack SDK import, database connection pooling, error handling for network timeouts, and a clean logging setup. I tested it against my local environment.

Result: it ran. The script executed without errors, connected to the database, and successfully parsed messages. But code that runs isn't the same as code that's safe or scalable. The script had three issues. First, the database connection wasn't using parameterized queries: minor SQL injection risk. Second, the keyword parsing relied on simple string matching; it'd miss variations ("Project complete" vs "project: complete"). Third, the error handling swallowed database connection failures silently, making debugging harder.

Sonnet score: 7.4/10. Ship-ready for a prototype. Not safe for production without a security review. A startup be fine running this for internal use. A financial services company? No.

Opus generated the same functionality in 11 seconds, but with parameterized queries built in from the start, regex-based keyword parsing with a fallback scoring system, and granular error handling that logs connection failures with context. It even suggested using environment variables for secrets instead of hardcoding them.

Opus score: 9.2/10. This is production-grade code. A junior developer could deploy this without a senior review. The security posture is sound, the error handling is sensible, and the code would survive its first customer support incident without embarrassment.

Cost-wise, Opus's extra seconds cost roughly 0.8p. But the security review time it saves? That's worth £40–£80 in engineering time, easily. For code that touches sensitive data or runs in production, Opus's architectural thinking pays dividends. This is another category where the price difference is irrelevant.

Task 4: Summarising a Week's Worth of Industry News

I compiled 23 articles about AI regulation. EU AI Act updates, UK proposals, US Congressional testimony: from the past week. Task: synthesize into a 500-word brief that flags strategic implications for a software company.

Sonnet processed all 23 articles and delivered a solid summary in 9 seconds. It correctly identified the three major regulatory themes: liability frameworks, transparency requirements, and risk tiering. The summary was accurate, well-organized, and easy to scan. It hit all the major news items.

I graded it 8.3/10 on accuracy and 7.9/10 on strategic insight. It told me what happened, but it didn't quite connect the dots to what my client should do about it. The synthesis felt like a press briefing, not a strategy memo. A product team reading it would understand the market, but wouldn't know whether to change their roadmap.

Opus took 13 seconds. It identified the same themes but went further: it flagged which themes created immediate compliance pressure versus which were longer-term shifts. It drew a line between the EU's risk-tiering approach and where the US was heading, suggesting the company could get ahead of the curve by adopting similar internal classification now. It even suggested which existing product features be affected first, helping the product team prioritize.

Opus score: 9.1/10. Sonnet score: 8.3/10. The gap here is interpretation. For routine digest work, Sonnet is perfectly adequate. For strategic briefing: especially on ambiguous, multi-source topics. Opus's reasoning edge matters. This is a middle-ground task: important, but not essential.

Task 5: Creating Strategic Recommendations from Raw Data

This was the toughest test. I gave both models a dataset: 6 months of customer churn data, a feature usage matrix, and qualitative feedback from exit interviews. The task: recommend a product roadmap that addresses the top drivers of churn.

Sonnet analyzed the data in 7 seconds and delivered three recommendations: improve onboarding (correlated with early churn), add the top 5 requested features (shown in usage data), and increase customer support responsiveness (cited in interviews). All defensible, all reasonable.

Sonnet score: 7.6/10. A product manager could take these to stakeholders. But the recommendations felt predictable. Sonnet basically said "fix the obvious things." It didn't prioritize. It didn't ask hard questions. It didn't challenge any assumptions.

Opus took 15 seconds. It made the same three recommendations but reframed them: instead of "improve onboarding," it said "onboarding is failing for users in the SMB segment specifically; enterprise onboarding is fine, so don't over-invest here." Instead of "add all 5 features," it said "three of the five are clustered around workflow speed; building one integrated feature address all three." Instead of "increase support responsiveness," it said "support delays only matter during implementation; automated documentation improvements be cheaper."

Opus score: 9.3/10. These recommendations came with constraints, trade-off clarity, and precision. They'd lead to better prioritization. A product manager using Opus would have a framework to justify decisions to the executive team. A product manager using Sonnet would have a list of things to do.

This is where Opus's reasoning depth shows up hardest: in recommendation quality on ambiguous problems where multiple interpretations are possible. Sonnet gives you answers. Opus gives you strategic advantage.

The Numbers: A Scoring Framework

Across all five tasks, here's how I scored each model on the criteria that matter: accuracy (getting facts right), depth (understanding nuance), usefulness (actionability for the person receiving the output), and time-to-completion (latency).

  • Task 1 (Proposal drafting): Sonnet 8.2/10, Opus 9.1/10. Gap: 0.9 points. Cost to close gap: 1.2p per proposal. VERDICT: Use Sonnet.
  • Task 2 (Financial analysis): Sonnet 7.8/10, Opus 9.3/10. Gap: 1.5 points. This is material. Opus catches nuance Sonnet misses. VERDICT: Use Opus.
  • Task 3 (Code generation): Sonnet 7.4/10, Opus 9.2/10. Gap: 1.8 points. Security and scalability implications matter here. VERDICT: Use Opus for production code.
  • Task 4 (News synthesis): Sonnet 8.3/10, Opus 9.1/10. Gap: 0.8 points. Marginal advantage to Opus on interpretation. VERDICT: Use Sonnet for routine; Opus for strategic.
  • Task 5 (Strategic recommendations): Sonnet 7.6/10, Opus 9.3/10. Gap: 1.7 points. Reasoning depth makes a real difference. VERDICT: Use Opus.

Average gap across all tasks: 1.34 points. Translated into real outcomes: Opus delivers meaningfully better results on tasks involving judgment, ambiguity, or cross-domain reasoning. Sonnet delivers "good enough" results on tasks with clear structure, obvious solutions, or focused information extraction.

The pattern is clear: Sonnet is cost-competitive where the answer is obvious or where revision is acceptable. Opus owns the domain where getting it right matters more than saving money.

The Benchmark Context: What the Numbers Mean

Anthropic publishes benchmark scores for both models. Sonnet 4.6 hits 79.6% on SWE-bench (software engineering), 72.5% on OSWorld (operating system knowledge), and 60.4% on ARC-AGI-2 (reasoning). Opus 4.6 runs higher on these benchmarks, but those gaps don't map directly to the gaps I'm seeing in practice.

What matters more: Anthropic's internal testing found that Sonnet 4.6 hit 77% accuracy on heavy reasoning tasks: up from 62% with Sonnet 4.5. That's a 15-point jump, and it's why Sonnet moved into the "usable for serious work" category. It's not perfect reasoning, but it's competent enough to be helpful on well-scoped problems.

Box (the enterprise software company) ran their own tests and found Sonnet 4.6 passes 77% of their enterprise use cases that previously required Opus. That's a real-world validation: for many business tasks, Sonnet is now sufficient. But notice: 77%, not 100%. The remaining 23% matter. That 23% is where the gaps I measured show up most visibly.

Where Sonnet 4.6 Surprised Me (In a Good Way)

I went into this experiment expecting Opus to dominate. Instead, I found Sonnet impressed me on several fronts.

Cost-performance ratio: The 40% price reduction compounds. If you're running 100 API calls a day across your team, you're looking at roughly £6,000 annual savings by switching to Sonnet for non-critical tasks. That's not trivial for a mid-size business. For a larger organization with thousands of daily API calls, the savings scale into real budgetary impact.

Latency: Sonnet's responses came back slightly faster than Opus on most tasks (2–3 seconds quicker). For user-facing applications, that matters enormously. Flow state is real: developers will use faster tools more often, even if they're slightly less capable. A faster tool gets used; a slow tool gets abandoned.

Good enough is good enough: On proposals, news summaries, and routine analysis, Sonnet's output quality is high enough that marginal improvements from Opus don't justify the cost premium. The law of diminishing returns is real. There's a point where spending 40% more for an 8% improvement becomes economically indefensible.

Reliability: I didn't encounter hallucinations or dropped content in Sonnet's outputs. It was stable and predictable. For routine work, it delivered consistent results. This matters more than people realize: consistency is a quality all its own.

Where Opus Still Wins (And Why It Matters)

Opus isn't going anywhere. Three categories of work still need Opus: first, anything involving security or compliance judgment (financial analysis, code generation, regulatory interpretation). Second, novel problem-solving where the right answer isn't obvious (strategic recommendations, complex architecture decisions, ambiguous customer scenarios). Third, work where your organization can't afford to miss nuance (high-stakes client interactions, regulatory submissions, crisis decision-making).

The pattern I'm seeing: Opus excels when the problem is underspecified or requires synthesizing information from multiple domains. Sonnet excels when the problem is clear and the solution is mostly about execution. Opus does frontier work. Sonnet does operational work.

The Practical Decision Framework

Here's how I'm advising clients to choose. Start with Sonnet. If the output quality meets your threshold, you're done: ship it and pocket the savings. If it doesn't, upgrade to Opus for that task. This means you need rubrics: what does "good enough" look like for your specific use case?

For proposal drafting: 8/10 quality is fine; use Sonnet. For financial analysis: 9/10+ is required; use Opus. For code: depends on the deployment context: internal tools? Sonnet. Production systems? Opus. For strategic thinking: complex problem? Opus. Routine analysis? Sonnet.

The tiered approach: Route simple tasks to Sonnet, complex tasks to Opus. You'll save 35–50% on API costs while maintaining quality thresholds. The overhead of routing logic is minimal: a simple if/then based on task type. One client I worked with implemented this in a week and immediately saw a 38% reduction in API spend while maintaining output quality.

The Honest Limitations

I want to be clear about what this experiment doesn't prove. I tested five tasks from one consultant's clients: not a representative sample of all business work. I used subjective scoring rubrics, not external validators. I didn't test Sonnet's 1-million-token context window or Opus's vision capabilities. I didn't measure the organizational impact of slightly lower quality: does saving £500 a month by using Sonnet cost you £5,000 a month in delayed decisions or missed insights?

That last question is the one that matters most. The money saved only matters if your team doesn't pay the price elsewhere in slower decision-making, lower-quality outputs, or increased error rates. For some organizations, that trade is worth it. For others, it's not.

Also, my evaluation we based on outputs I read and scored. Other evaluators score differently. I had context about what the client wanted; the models didn't always. Real-world performance varies by domain, by team, by the specific prompts you write.

What I'm Doing with My Clients

I'm building a tiered API strategy. Tier 1: Sonnet for generation and synthesis (proposals, summaries, routine copy, initial analysis). Tier 2: Opus for judgment and analysis (security, strategy, complex reasoning, final decisions). I'm tracking cost savings and quality metrics, and I'm willing to shift tasks between tiers if the data suggests it.

I'm also building request routing: simple classification logic that routes tasks based on type. A request for proposal drafting goes to Sonnet. A request for code generation in a production context goes to Opus. A news synthesis request goes to Sonnet unless flagged as strategic, then Opus.

The bottom line: Claude Sonnet 4.6 is genuinely capable. It won't replace Opus, but for cost-conscious teams, it eliminates the economic justification for using Opus everywhere. Use the right tool for the right job. For most jobs, that tool is cheaper than you thought.

Ready to Optimize Your AI Budget?

Richard Batt has delivered 120+ AI and automation projects across 15+ industries. He helps businesses deploy AI that actually works, with battle-tested tools, templates, and implementation roadmaps. Featured in InfoWorld and WSJ.

Frequently Asked Questions

How long does it take to implement AI automation in a small business?

Most single-process automations take 1-5 days to implement and start delivering ROI within 30-90 days. Complex multi-system integrations take 2-8 weeks. The key is starting with one well-defined process, proving the value, then expanding.

Do I need technical skills to automate business processes?

Not for most automations. Tools like Zapier, Make.com, and N8N use visual builders that require no coding. About 80% of small business automation can be done without a developer. For the remaining 20%, you need someone comfortable with APIs and basic scripting.

Where should a business start with AI implementation?

Start with a process audit. Identify tasks that are high-volume, rule-based, and time-consuming. The best first automation is one that saves measurable time within 30 days. Across 120+ projects, the highest-ROI starting points are usually customer onboarding, invoice processing, and report generation.

How do I calculate ROI on an AI investment?

Measure the hours spent on the process before automation, multiply by fully loaded hourly cost, then subtract the tool cost. Most small business automations cost £50-500/month and save 5-20 hours per week. That typically means 300-1000% ROI in year one.

Which AI tools are best for business use in 2026?

It depends on the use case. For content and communication, Claude and ChatGPT lead. For data analysis, Gemini and GPT work well with spreadsheets. For automation, Zapier, Make.com, and N8N connect AI to your existing tools. The best tool is the one your team will actually use and maintain.

Put This Into Practice

I use versions of these approaches with my clients every week. The full templates, prompts, and implementation guides, covering the edge cases and variations you will hit in practice, are available inside the AI Ops Vault. It is your AI department for $97/month.

Want a personalised implementation plan first? Book your AI Roadmap session and I will map the fastest path from where you are now to working AI automation.

← Back to Blog