Posts Tagged ‘speech recognition’

“I Wanna Go Tuh Cleveland…”

An interesting trend developed in telephony a few years back.

There was an active and vehement move away from the robotic, automaton sound of early voice recordings on telephones — recordings which seem to have been deliberately done that way, in order to eliminate any confusion as to whether or not the caller had reached an actual, live human being or a “machine” — and a move more towards a relaxed, natural, conversational cadence. A tone which says: “Yes, you’ve definitely reached a self-serve system — but think of me as just another fellow human being. I hate these things, too!” The thinking behind it is: if the caller feels the voice behind the system is welcoming without using up too much of their time; reassuring without being obsequious; and — all the better — if the voice can sound like the caller’s best friend or neighbor, the caller will “engage” the system, follow the instructions accurately, not hang up in frustration, and not have a whole new veneer of annoyance on top of the issue they’re calling in about, by the time they *do* make it to an actual rep. And even if they have managed to turnkey themselves into a solution (made a reservation, checked their Visa balance) and never had to actually speak to a live operator, their opinion of that company or the transaction can still be made or broken by that automated voice alone.

Solid thinking. And personally — having voiced telephone prompts for companies internationally, and for a wide variety of industries — I applauded, and still continue to applaud that trend. Rather than having to “put on” a voice which isn’t actually natural for me to speak in, I’m allowed — nay — encouraged to sound like an actual, real person (what luck! I happen to *be* one…) Real people hesitate slightly when they’re trying to think of just the right word; there’s a certain…pause…which seems normal in everyday conversational rhythms; and there’s almost a  “stumbling” effect which many clients want me to do when I’m voicing — so that I’ll sound like a real person. Actual speech is full of slurs, imperfections, and natural flaws which we all try to avoid in everyday conversation — it’s those natural “artifacts” which are big in IVR right now. (On the extreme end of that scale was a company who produced on-hold messages, who encouraged me — if at all possible — to come up with a yawn or sneeze in the middle of script, just to reinforce that a “real” person took the time to voice it — I talked them out of that. That’s just a little too “real”.)

However, consider this: humans — being quintessentially social — are infinitely comfortable in taking on the mannerisms, rhythms, and traits of those other human with which they’re interacting. Watch a pair of humans introducing themselves to one another, and the intricate ballet which ensues. They will — without even thinking about it — mirror the other’s mannerisms, the “rate” or speed at which they converse, and the innate need to “match” their conversational partner. It’s the reason why accents are irresistible to *not* absorb as you speak with a native of Scotland, for example. It’s why dating coaches actively encourage their clients to make a point of deliberately matching their prospective mate’s every mannerism move for move — that sympatico that mirroring creates is not only beneficial to our harmony with others — it’s automatic and almost impossible *not* to engage in.

How that relates to IVR is simple: whether or not you’re aware of it, you mirror the “tone” set by an IVR you call into. In many ways, that voice dictates the formality or informality of  the transaction. It tells you everything you need to know about the company and even gives you an idea of the level of service and attentiveness you can expect when your issue or problem is eventually dealt with. And the degree of “precision” apparent in the voice is likely how *you* will respond.

Think, for example of an IVR voice saying — in a no-frills, somewhat flat-toned delivery: “Please tell me — clearly and slowly — the city to where you’d like to travel. Please press pound when finished.” If you’re the caller, hoping to book a flight to Cleveland, you’ll probably take that instruction quite seriously and deliberately slow your roll as you enunciate — much more slowly and clearly than you’d normally be inclined to: “I’d like to travel to Cleveland, please.” (Even mirroring the two “pleases” which were in their command). Or — you might even just intone: “Cleveland.” In stark contrast would be the “modern” style of IVR: “Great. I can help you book your trip.” (Playfully) “Why don’t tell me where you wanna go..?”

Naturally, you’re going to reply (playfully) “I wanna go tuh Cleveland.”

Fun, yes? And while this style of recording is accessible, young, modern, and warm, it didn’t take long for data to surface which found fault in that casual, almost *too* relaxed call and reply: speech recognition software struggles to fit the biometrics of “informalspeak” and complains of a less than perfect hit-rate when callers match a “lazy” IVR’s cue. Also, where with “traditional” clipped, more severe IVR’s, the caller would be more likely to just say “Cleveland”, for example, than a chatty, off-the-cuff IVR might be inclined to make the callers respond in kind, or elaborate more than they would under the parameters of a “stiff” automated system. With less accuracy comes confusion, more time burned up, and a greater chance that the customer will either pull the plug on the call, or be so annoyed with the ongoing attempts to repeat their selection, they’ll be stoked with a refreshed supply of vitriol for the poor CSR to whom the call eventually gets transferred.

While I’m a fan of a more relaxed, conversational tone — both as a caller and as a voice of IVR systems — the dangers of “under-enunciating” are vast and very real. I like to strike a balance between the friendly and natural, and also being a clear enunciator (while I keep my diction as clear as possible when I’m working, anyone who has spoken to me over the phone after a long day in the booth can testify that I slur like Tom Brokaw). To be relaxed and conversational, and yet authoritative enough to make sure people “hit” the speech recognition utility is always my goal.

Perhaps — to maintain the integrity and accuracy of speech recognition utilities — a certain amount of formality is required in an IVR.  It could be argued that there’s no getting away from a steady, even-toned delivery, if it means a clean, well-running match-up of vocal input whose ultimate goal is getting callers to the right department.

I’m very excited about my next upcoming blog, where I interview the legendary Emily Yellin — arguably the world’s expert in customer relation metrics. We had a great chat about what companies desperately need to know about designing effective telephone systems, and I bring you that interview in about two week’s time.

As always, thanks for reading. If you have any comments or insights about what you’ve read, feel free to leave a comment!

Allison Smith is a professional telephone voice, who can be heard voicing systems for telephone systems and private companies throughout the world, including platforms for Verizon, Qwest, Cingular, Sprint, Bell Canada, Hawai’ian Telcom, and Asterisk.  Her website is

Do You Know What I Mean?

The stories are now legendary: the mis-firings of speech recognition utilities — when a phrase is uttered into a system, and something completely seemingly random is repeated back to the originating voice — are entertaining as well as fear-inducing. Imagine having your completely coherent and well-thought-out message to a colleague  turn out sounding like this:  “Hey don’t forget your Dad killed her by name. Be careful on the way. Read some pretty clear down here bomb within like 130 to be careful. Bye.” Or: “Hi again this is Michael. So calling from Ralph there. Volkswagen lasagna.”

When speech recognition programs (also known as Automatic Speech Recognition or Computer Speech Recognition) — designed to convert speech-to-text — goes wrong or misinterprets what is said, there seems to follow some sort of perverse satisfaction in machines being not quite as intuitive as we are. Much like when the IBM computer persona “Watson” competed on Jeopardy! this last week, we took just a little too much glee in his failures and just slightly too much angst when he actually trumped what we know to be a be a very capable human.

Having voiced many prompts to build text-to-speech applications (where typed words are converted to the spoken word), I have also been an actual human being on the other side of it, where I have attempted to order items via automated systems — following prompts which I, myself, have voiced — and have had the automated version of “me” say things like: “Great. I think you said: International Sales” when I clearly intoned; “Visa Payment”. Or, when I got my first voice-enabled dialing feature on a cell phone years ago and distinctly told it to dial “Kelsey” and it repeated back to me: “OK — I think you said…..JEROME…”

Gerd Graumann, Director of Business at Lumenvox ( — one of the leading providers of speech development products — filled me in on some background and history of Speech Recognition: “AT & T Bell Laboratories developed a primitive device that could recognize speech as far back as the 40’s — and even back then, researchers knew that the widespread use of speech recognition would depend on the ability to accurately and consistently perceive complex verbal input.” explains Graumann.

“In the 60’s, researchers turned their focus towards creating a device that would use discrete speech, verbal stimuli punctuated by small pauses,” further explains Graumann. “However, in the 1970’s, conrinuous speech recognition, which does not require the user to pause between words, began, The technology became functional in the 1980’s, and is still being developed and refined today.”

In 1982, Kurzweil Applied Intelligence released speech recognition products, and by 1985, their software had a vocabulary of 1,000 words — uttered one word at a time. In just two years, its lexicon reached 20,000 words — entering the realm of actual human vocabularies, which typically range from 10,000 to 150,000 words. Despite that healthy base, the recognition accuracy was still only 10% in 1993. Two years later, the error rate crossed below 50%. In 2001, the recognition accuracy reached a plateau of 80%, no longer growing with data or computer power. When, in 2006, Google published a trillion word corpus, Carnegie Mellon University researchers found no significant increase in recognition accuracy.

Ever-increasing processor speed, overall system performance and improved algorithms now enable speech recognition systems to run more effectively than ever and deliver the results of massive probability calculations within fractions of a second. Even the stumbling block which was at one time considered to be close to insurmountable — the challenge of speakers with accents — have been largely eradicated. Current generation speech recognitions systems learn over time to “understand” various speakers with accents and strong regionalities from the data they are being trained with. Gerd Graumann further clarifies this point: “The training data that goes into the acoustic model makes all the difference. With today’s models, the spectrum is fairly broad, and many non-native speakers are part of the training data to reflect how people from many different backgrounds speak. Of course,” warns Graumann, “there is always the end of the spectrum.”

 When it comes to the words people use to interact with automated systems, the latest technology already allows for the systems to interpret what the person is saying. This is achieved by the use of statistical linguistic models, a new technology that tries to understand the intent of what is being said, versus the exact words that were spoken. Not unlike texting with a SMS utility, which remembers likely words you might mean, when typing a text. And also, not unlike how the actual human brain works, as well.

The applications for speech recognition are vast. Medical and legal uses — not the least of which involve transcription and real-time dictation, which is made considerably more efficient with digital dictation systems being routed through speech recognitions utilities (known as Deferred SR). Speech recognition is aggressively being implemented into High-performance military fighter aircraft, with the capabilities to set radio frequencies, commanding the autopilot system, setting steer-point coordinates, weapon release parameters, and controlling flight displays. Enhancing the lives of people with disabilities; training Air Traffic Controllers — even improving the experience of video games — speech recognition’s uses and applications are immense and growing continuously. And hopefully — with the refinement of the technology — the likelihood is minimal of receiving the following cryptic voice mail transcription:  “I just wanted to let you know so that you weren’t surprised if you come back for shower tomorrow that cousin is girlfriend, maybe..” Or how about “Kelly” receiving a message from her Father: “Hi, Kelly, Death calling…”

Next week, I’m excited to blog about those fascinating — and largely subliminal — short “flurry” sound effects you sometimes hear when accessing a telephone system…they’re almost like a trademark musical “scale” which can become closely associated with a telephone company’s identity — and they’re *very* hard to find! I’ll discuss how they’ve become a big boon to my business, and why the sounds which I own are closely guarded.

Thanks for reading!

Allison Smith is a professional telephone voice, who can be heard voicing systems for telephone systems and private companies throughout the world, including platforms for Verizon, Qwest, Cingular, Sprint, Bell Canada, Hawai’ian Telcom, and Asterisk.  Her website is