The map

What's the right deflection KPI?

The right deflection KPI for an ecommerce chatbot is per ticket category, not blended. Vendors quote 90%. The data quotes 8% on refunds and 60-70% on order status.


Justin Thompson6 min read

Most chatbot dashboards want to give you one clean deflection number.

That number is useful, but only up to a point. It blends simple order-status tickets with refund disputes, complaints, cancellations, and customers who are already annoyed. Those are not the same job.

So the right deflection KPI is not one blended rate across the inbox. It is deflection by ticket category. The gap between those categories is wider than any single number admits.

If the bot is in the loop for 86% of conversations but only finishes the job on a fraction of them, “deflection rate” stops describing what the AI actually solved. It starts describing what it touched.

Why is 90% deflection the wrong target?

The 90% target is inherited from a different category. Voice contact centres in the 2010s optimised for average handle time, and IVR deflection got measured against call volume diverted from agents. One queue, one cost line, one number.

That math travels badly into a modern DTC inbox where one ticket is “where is my order” and the next is “your subscription charged me twice after I cancelled.”

Vendors quote the headline because it sells. The actual mix sitting underneath it is rarely disclosed. Gorgias’s own data is the clearest cut available, and it splits the inbox cleanly.

A blended 90% across an inbox that contains 8% on refunds is either selective reporting or a brand that is trying to deflect the wrong work and has not yet seen the Trustpilot lag show up. Both are recoverable. Neither is a target to chase.

What does the data say about real deflection ceilings?

Category by category, the ceilings are knowable.

Ticket categoryReal deflection ceilingWhy the ceiling sits there
Order status, tracking, password resets60-70%Stable answers, structured data, low emotional load
FAQ and policy basics50-60%The bot has to refuse questions that look like policy but aren’t (Air Canada is the cautionary tale)
Refunds, returns, subscription disputesUnder 10%Gorgias: 8% AI-alone. Customer signals they want a human
Damage, complaints, emotional escalationsUnder 5%Empathy load, brand-reputational risk on the failure tail

The first two rows are the half that pays for the deflection deployment. The bottom two are where customers are usually signalling that they want a human, regardless of how good the bot’s reading comprehension is.

Blended across a typical DTC ticket mix, the real ceiling lands somewhere in the 30-45% range. That number is not prescriptive. The shape is what matters: simple tickets approach a real ceiling, complex tickets have a hard floor, and brands chasing one number across both are optimising for a metric the customer is not experiencing.

What did Klarna actually overshoot?

The Klarna walk-back is the canonical case, and the lesson in it is narrower than the headline suggests. Sebastian Siemiatkowski’s February 2024 announcement claimed two thirds of chats handled by the AI assistant in its first month, work equivalent to 700 FTEs. No category split was disclosed.

Fourteen months later he told Fortune that “cost was a too predominant evaluation factor” and Klarna was rehiring humans. Read that line carefully. The deflection mechanics worked. The KPI did not. Cost per ticket was the measured variable, and customer-experience cost was the missing one. Applied uniformly across an inbox that contained both buckets, the same target that printed savings on bucket one printed quiet damage on bucket two.

Klarna is the named case because Sebastian was unusually open about both sides. The pattern is industry wide. Any brand that lets a vendor quote one deflection number across all tickets is on the same arc.

What should you measure instead?

Two metrics, not one. Run them in parallel and report them separately.

Per-category deflection. Set the ceiling per ticket type, not across the inbox. Order status sits in one bucket with its own number. Refunds, cancellations, complaints sit in another with a much lower one. If your vendor can’t break the deflection rate down this way, that itself is the finding.

Customer-side lag indicators. Deflection rates report instantly. Customer behaviour lags 3-6 months. Track Trustpilot velocity, repeat-contact rate within 7 days, and the rate at which customers try to escape the bot on chat. The Register reported the trailing version of this metric for the industry.

The other half of the conversation is the work the AI does not finish. Once a brand accepts that a real ceiling exists, the question shifts from “how do we deflect more?” to “how do we make the tickets that still need a human better than they were before AI?”

That is where team-side AI sits, measured on enrichment quality and reply time rather than deflection.

Deflect what makes sense. Enrich the rest. One number hides the trade-off. Two numbers describe what the customer actually experiences.

Sources

Part of the AI in customer service: the map series

The AI-in-CX category is still being drawn. Deflection, assist, automation, copilot, agent. These words mean different things to different vendors, and the map of the category is contested. This pillar publishes our reading of the map, and where Handsom sits on it.

See the full series

What is Handsom?

Team-side AI that briefs your support team on every ticket before they open it. Lookup work happens once, by the AI; your reps reply with context.

See how it works

More in The map