I was thrilled a couple of years ago when I was approached by Cepstral — one of the premiere architects of high quality, natural sounding voice synthesis products — to be one of their text-to speech voices….and I was even thrilled by their very public “proposal”. They did a presentation at Astricon one year, and while discussing their range of voices available, a slide appeared on the screen which read: “Coming soon: The Allison Voice!”

Geez, give a girl some notice. At least we’re not capturing the event on a jumbotron.

A Text to Speech (TTS) synthesis is basically the artificial production of human speech — most people’s first thought will gravitate immediately to Stephen Hawking, whose Text to Speech voice has become a part of his persona; legend has it that Cepstral — who designed his initial TTS utility has offered him numerous “upgrades” and more current and evolved versions throughout the years for him to experiment with. He has turned them all down. His early, rudimentary “voice” works well; it is recognizable, and most signficantly, it has practically become a part of who he is. Text to Speech products immeasurably enhance the lives those unable to speak, and it’s imperative that the user and voice connect on a visceral level.

A Text to Speech system converts normal language text into speech, by concatenating pieces of recorded speech which are stored in a database. Phonemes and graphemes are simply broken-down sound “fragments” which the system recognizes, and assigns those sounds to what it recognizes the typed word to which it should  correspond. The storage of entire words and even sentences allows for high-quality output, but is laborious and time-intensive to record.

Tell me about it.

Cepstral’s goal, when they proposed the idea of working together, was to build a very robust TTS engine — possibly the most robust they’d ever designed. Due to the prevalence of my voice not only on the Asterisk Open Source PBX but with many other telephony platforms, they saw the advantages in recording volumes more “sounds” than usually required to build a typical TTS system, so as to create as seamless as possible an interface which would dovetail well with pre-installed stock prompts and custom-recorded prompts alike — all voiced by me. As a way of achieving that, a script arrived which had the breadth (and thickness) of a typical major-city white pages telephone book. No problem!

In this script, I found thousands upon thousands of single words — and just as many pages of random, and often non-sensical sentences (“During the period, the company continued to benefit from favorable tax effects”, or “But oh what a hit it could be”, as examples). From larger sentences, phonemes can be farmed (think of the single sounds and combinations of sounds which could be extracted from the sentence: “Julie put on her red coat and made it to the train station by nine”) and stored for retrieval when the system perceives that the “fragment” is needed (although it’s not flawless: at a subsequent Astricon after the Allison TTS Voice was launched,  one of the Digium staffers was very eager to unveil the Cepstral Allison Voice; he typed in “Hi! I’m Allison Smith!” and out of the computer I spoke: “Hi! I’m Allison Smeeeeth!”) I find it hard to believe we didn’t capture the “ih” sound that the “I” in “Smith” makes, but there you have it. (One of the most difficult sounds to capture in a TTS application is — oddly enough — the word: “of” — widely used in the English language; it’s one of the few words where “f” is pronounced “v”. naturally, this creates problems for TTS utilities.)

I devoted about three hours a day for several weeks to getting the project recorded, and managed to soldier through it — not only voicing all words and sentences, but editing them into individual sound files. Apparently it was worth it — the Cepstral Allison TTS voice is the number one selling voice for Cepstral, and is offered as a very useful add-on for purchasers of the Asterisk PBX.  The uses of TTS for the speaking-disabled allow for clear, real-time communication for those with challenges; other applications in the area of transcription  of the written word to audio format are immeasurably vast and key to its growth and evolution. While it will never “replace” me (I’ve had a few clients who have tried doing longer paragraphs and one client who even tried to “forge together” an entire on-hold system using strictly my TTS voice — unsuccessfully), the text-to-speech utility is ideal for filling in gaps, smithing together proper and place names, or simply bridging together prompts which need integration. While the Allison TTS voice — just by the volume of material which built it — is a formidable and extensive TTS utility, it will always be identifiable as “mechanized” and never apt to be mistaken from an organic recording.

…type anything in, and I’ll say it. Yes, anything. My husband if prone to typing in things like: “You are correct 100% of the time!” or “There are no chores for you today!”; hearing them in a slightly robotic, manufactured style is better than not hearing them at all….

Thanks for reading! Next blog, I’ll dig deeper into the voices which tell you when to turn — the occasionally vexing world of GPS voices!

