Do You Know What I Mean?

The stories are now legendary: the mis-firings of speech recognition utilities — when a phrase is uttered into a system, and something completely seemingly random is repeated back to the originating voice — are entertaining as well as fear-inducing. Imagine having your completely coherent and well-thought-out message to a colleague  turn out sounding like this:  “Hey don’t forget your Dad killed her by name. Be careful on the way. Read some pretty clear down here bomb within like 130 to be careful. Bye.” Or: “Hi again this is Michael. So calling from Ralph there. Volkswagen lasagna.”

When speech recognition programs (also known as Automatic Speech Recognition or Computer Speech Recognition) — designed to convert speech-to-text — goes wrong or misinterprets what is said, there seems to follow some sort of perverse satisfaction in machines being not quite as intuitive as we are. Much like when the IBM computer persona “Watson” competed on Jeopardy! this last week, we took just a little too much glee in his failures and just slightly too much angst when he actually trumped what we know to be a be a very capable human.

Having voiced many prompts to build text-to-speech applications (where typed words are converted to the spoken word), I have also been an actual human being on the other side of it, where I have attempted to order items via automated systems — following prompts which I, myself, have voiced — and have had the automated version of “me” say things like: “Great. I think you said: International Sales” when I clearly intoned; “Visa Payment”. Or, when I got my first voice-enabled dialing feature on a cell phone years ago and distinctly told it to dial “Kelsey” and it repeated back to me: “OK — I think you said…..JEROME…”

Gerd Graumann, Director of Business at Lumenvox ( — one of the leading providers of speech development products — filled me in on some background and history of Speech Recognition: “AT & T Bell Laboratories developed a primitive device that could recognize speech as far back as the 40’s — and even back then, researchers knew that the widespread use of speech recognition would depend on the ability to accurately and consistently perceive complex verbal input.” explains Graumann.

“In the 60’s, researchers turned their focus towards creating a device that would use discrete speech, verbal stimuli punctuated by small pauses,” further explains Graumann. “However, in the 1970’s, conrinuous speech recognition, which does not require the user to pause between words, began, The technology became functional in the 1980’s, and is still being developed and refined today.”

In 1982, Kurzweil Applied Intelligence released speech recognition products, and by 1985, their software had a vocabulary of 1,000 words — uttered one word at a time. In just two years, its lexicon reached 20,000 words — entering the realm of actual human vocabularies, which typically range from 10,000 to 150,000 words. Despite that healthy base, the recognition accuracy was still only 10% in 1993. Two years later, the error rate crossed below 50%. In 2001, the recognition accuracy reached a plateau of 80%, no longer growing with data or computer power. When, in 2006, Google published a trillion word corpus, Carnegie Mellon University researchers found no significant increase in recognition accuracy.

Ever-increasing processor speed, overall system performance and improved algorithms now enable speech recognition systems to run more effectively than ever and deliver the results of massive probability calculations within fractions of a second. Even the stumbling block which was at one time considered to be close to insurmountable — the challenge of speakers with accents — have been largely eradicated. Current generation speech recognitions systems learn over time to “understand” various speakers with accents and strong regionalities from the data they are being trained with. Gerd Graumann further clarifies this point: “The training data that goes into the acoustic model makes all the difference. With today’s models, the spectrum is fairly broad, and many non-native speakers are part of the training data to reflect how people from many different backgrounds speak. Of course,” warns Graumann, “there is always the end of the spectrum.”

 When it comes to the words people use to interact with automated systems, the latest technology already allows for the systems to interpret what the person is saying. This is achieved by the use of statistical linguistic models, a new technology that tries to understand the intent of what is being said, versus the exact words that were spoken. Not unlike texting with a SMS utility, which remembers likely words you might mean, when typing a text. And also, not unlike how the actual human brain works, as well.

The applications for speech recognition are vast. Medical and legal uses — not the least of which involve transcription and real-time dictation, which is made considerably more efficient with digital dictation systems being routed through speech recognitions utilities (known as Deferred SR). Speech recognition is aggressively being implemented into High-performance military fighter aircraft, with the capabilities to set radio frequencies, commanding the autopilot system, setting steer-point coordinates, weapon release parameters, and controlling flight displays. Enhancing the lives of people with disabilities; training Air Traffic Controllers — even improving the experience of video games — speech recognition’s uses and applications are immense and growing continuously. And hopefully — with the refinement of the technology — the likelihood is minimal of receiving the following cryptic voice mail transcription:  “I just wanted to let you know so that you weren’t surprised if you come back for shower tomorrow that cousin is girlfriend, maybe..” Or how about “Kelly” receiving a message from her Father: “Hi, Kelly, Death calling…”

Next week, I’m excited to blog about those fascinating — and largely subliminal — short “flurry” sound effects you sometimes hear when accessing a telephone system…they’re almost like a trademark musical “scale” which can become closely associated with a telephone company’s identity — and they’re *very* hard to find! I’ll discuss how they’ve become a big boon to my business, and why the sounds which I own are closely guarded.

Thanks for reading!

Allison Smith is a professional telephone voice, who can be heard voicing systems for telephone systems and private companies throughout the world, including platforms for Verizon, Qwest, Cingular, Sprint, Bell Canada, Hawai’ian Telcom, and Asterisk.  Her website is


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: