If you’ve tried voice automation – via Siri, Alexa, or a conversational IVR – chances are, you’ve had a bad experience. Maybe you were misunderstood or not understood at all, maybe you had to keep repeating yourself or maybe you made a mistake and couldn’t get the damn machine to go back and amend it.
All of these issues are avoidable. A well-designed voice assistant will be a pleasure to converse with. It won’t make you work extra hard to communicate and will let you lead the conversation.
In this post, we’ll look at some of the most common reasons that conversations with voice assistants go badly, and explain how you can avoid these pitfalls when deploying voice technologies in your business.
Keyword reliance and the dreaded conversational IVR
For years, conversational IVRs have been the only example of voice technologies that companies trusted to deploy in their contact centers. Instead of the dreaded ‘press 1 for billing, press 2 for…etc’, customer support calls were answered with a recorded messaging asking callers to explain their reason for calling.
The problem with conversational IVRs is that they can only identify the right way to route a call if the caller uses the right keywords. But customers often speak in unexpected ways.
We found some great examples of this when we worked with BP on an automated solution to route calls from gas station workers with tech problems. Gas station workers would call the number when they had problems with the card machine, but more often than not, they didn’t use the words ‘card machine’. They said ‘card swiper’, ‘chip and pin’, ‘pin pad’, and could even go so convoluted as ‘the machine customers put their card in to pay’.
It’s important to remember that callers don’t know what you expect from them. By using a conversational IVR, you’re not letting them browse the range of options they can choose from.
A 2019 report found that 51% of consumers abandon businesses because of a poor IVR system, resulting in companies losing $262 per customer every year – that’s a significant chunk of profit per customer wasted.
Takeaway: Don’t rely on technology that requires callers to say specific keywords. Look for technologies that understand colloquial and slang language as default.
Speech-to-Text: ‘I’m sorry, I didn’t understand that’
Speech recognition failure is one of the biggest culprits of poor experiences with voice assistants. I’m sure I don’t need to tell you how infuriating it is when you say one thing and the machine hears another.
Voice assistants typically use a speech-to-text process to transform spoken utterances (speech) into written transcriptions (text) which can then be processed by a machine learning model. Many understanding issues arise due to a discrepancy between what the user says and what the machine transcribes.
There are a number of reasons why speech-to-text fails. Let’s look at a few of the most common…
People pronounce different words differently. Even for humans, it can be really difficult to understand people with strong or unfamiliar accents. Call center workers receive accent training to understand 35 different variations of English alone.
Different speech recognition providers perform better or worse with different accents, so it’s important to find a solution that works best for your customers. At PolyAI, we test different speech recognition solutions for each project and language to find the best option and achieve the highest level of understanding.
Just like human hearing, speech recognition is never going to be perfect. The best way to improve the accuracy of speech recognition is to layer it with machine learning technology that can apply contextual understanding. So let’s have a look at context…
In many cases, what we hear depends on the context. The words ‘ate’ and ‘eight’ sound identical, but if you hear that sound in the context of a phone number, you will understand it as ‘eight’.
It’s important that voice assistants are tuned to listen for certain words based on expected inputs. At PolyAI, our voice assistants record every possible transcription of words and sentences and then re-rank them based on context.
Imagine you’re ordering a sweater over the phone. The voice assistant asks you what color you’d like and you say ‘red’. PolyAI’s technology will record ‘red’ and ‘read’, and, depending on pronunciation, it may also hear ‘bread’ or ‘tred’.
Instead of picking one at random, PolyAI’s voice assistants will rank all responses in the context of which is most likely to be the actual input, given that the assistant is listening for a color, and will choose ‘red’
Understanding people is difficult, and the phone often makes it even more so. Connections can be patchy, a caller might move away from the mouthpiece and people who are already prone to mumbling become altogether incoherent.
Thanks to our exposure to vast amounts of conversations, the human brain is pretty good at processing language, even when the signals are weak.
You’ve probably seen this on Facebook at some point…
Aoccdrnig to rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.
Known as Typoglycemia, this phenomenon isn’t based on real scientific research, but it’s a good example of how our brains are able to understand jumbled words based on extensive exposure to written language.
There’s nothing speech recognition technology can do about faulty signals. If you say pig, but you meant to say dog, there’s no way a speech recognition system can apply logic to understand that.
That’s why speech recognition should be used in conjunction with machine learning models that have been trained on vast amounts of data. By exposing machines to billions of conversational examples, they’re able to look at a sentence as a whole and derive the intent, or meaning, based on both speech transcriptions and past ‘experience’.
Takeaway: Speech recognition is far from perfect, and is unlikely to get better. Blend speech recognition solutions with machine learning models to most accurately understand customers.
Multi-Turn vs Single-Turn
Have you ever tried following up a query with a voice assistant? For example, you might have asked Siri to text someone, but when you ask later ‘did you send that text?’, it doesn’t know what you mean.
A single conversational turn typically comprises one utterance followed by one response. A question and answer, for example. Household voice assistants like Siri and Alexa can typically handle 2 or 3 turns to get a task done.
The more turns a conversation has, the more complicated it is to programme a voice assistant to hold it.
We recently built an assistant to troubleshoot router issues for a major UK telco. The assistant asks the customer a number of questions to identify the issue, and calls can last over 10 minutes with 20+ conversational turns.
To handle multi-turn conversations well, a voice assistant needs to have impeccable short term memory. It must be able to recall information given earlier in the conversation and use this to make decisions about where the conversation should go next. Customers should be able to change their minds or interrupt a flow by asking questions.
All of this is next to impossible with a decision-tree style dialogue design, which is what systems like Google Dialogflow and Amazon Lex rely on.
Takeaway: Look for a solution that allows for however much back-and-forth it takes for a customer’s query to be resolved, even if that means handling 20 conversational turns.
There’s No Excuse For Building Bad Experiences
How well your voice assistant performs is down to two factors:
- The quality of the technology you use
- Skill and experience with dialogue design
If you’d like to learn more about how to build voice experiences that your customers will love, get in touch with PolyAI today.