Text To Speech

I was thrilled a couple of years ago when I was approached by Cepstral – one of the premiere architects of high quality, natural sounding voice synthesis products — to be one of their text-to speech voices….and I was even thrilled by their very public “proposal”. They did a presentation at Astricon one year, and while discussing their range of voices available, a slide appeared on the screen which read: “Coming soon: The Allison Voice!”

Geez, give a girl some notice. At least we’re not capturing the event on a jumbotron.

A Text to Speech (TTS) synthesis is basically the artificial production of human speech — most people’s first thought will gravitate immediately to Stephen Hawking, whose Text to Speech voice has become a part of his persona; legend has it that Cepstral – who designed his initial TTS utility has offered him numerous “upgrades” and more current and evolved versions throughout the years for him to experiment with. He has turned them all down. His early, rudimentary “voice” works well; it is recognizable, and most signficantly, it has practically become a part of who he is. Text to Speech products immeasurably enhance the lives those unable to speak, and it’s imperative that the user and voice connect on a visceral level.

A Text to Speech system converts normal language text into speech, by concatenating pieces of recorded speech which are stored in a database. Phonemes and graphemes are simply broken-down sound “fragments” which the system recognizes, and assigns those sounds to what it recognizes the typed word to which it should  correspond. The storage of entire words and even sentences allows for high-quality output, but is laborious and time-intensive to record.

Tell me about it.

Cepstral’s goal, when they proposed the idea of working together, was to build a very robust TTS engine — possibly the most robust they’d ever designed. Due to the prevalence of my voice not only on the Asterisk Open Source PBX but with many other telephony platforms, they saw the advantages in recording volumes more “sounds” than usually required to build a typical TTS system, so as to create as seamless as possible an interface which would dovetail well with pre-installed stock prompts and custom-recorded prompts alike – all voiced by me. As a way of achieving that, a script arrived which had the breadth (and thickness) of a typical major-city white pages telephone book. No problem!

In this script, I found thousands upon thousands of single words — and just as many pages of random, and often non-sensical sentences (“During the period, the company continued to benefit from favorable tax effects”, or “But oh what a hit it could be”, as examples). From larger sentences, phonemes can be farmed (think of the single sounds and combinations of sounds which could be extracted from the sentence: “Julie put on her red coat and made it to the train station by nine”) and stored for retrieval when the system perceives that the “fragment” is needed (although it’s not flawless: at a subsequent Astricon after the Allison TTS Voice was launched,  one of the Digium staffers was very eager to unveil the Cepstral Allison Voice; he typed in “Hi! I’m Allison Smith!” and out of the computer I spoke: “Hi! I’m Allison Smeeeeth!”) I find it hard to believe we didn’t capture the “ih” sound that the “I” in “Smith” makes, but there you have it. (One of the most difficult sounds to capture in a TTS application is — oddly enough — the word: “of” — widely used in the English language; it’s one of the few words where “f” is pronounced “v”. naturally, this creates problems for TTS utilities.)

I devoted about three hours a day for several weeks to getting the project recorded, and managed to soldier through it — not only voicing all words and sentences, but editing them into individual sound files. Apparently it was worth it — the Cepstral Allison TTS voice is the number one selling voice for Cepstral, and is offered as a very useful add-on for purchasers of the Asterisk PBX.  The uses of TTS for the speaking-disabled allow for clear, real-time communication for those with challenges; other applications in the area of transcription  of the written word to audio format are immeasurably vast and key to its growth and evolution. While it will never “replace” me (I’ve had a few clients who have tried doing longer paragraphs and one client who even tried to “forge together” an entire on-hold system using strictly my TTS voice — unsuccessfully), the text-to-speech utility is ideal for filling in gaps, smithing together proper and place names, or simply bridging together prompts which need integration. While the Allison TTS voice — just by the volume of material which built it — is a formidable and extensive TTS utility, it will always be identifiable as “mechanized” and never apt to be mistaken from an organic recording.

Check out the Cepstral Allison Voice at: www.cepstral.com/demos

…type anything in, and I’ll say it. Yes, anything. My husband if prone to typing in things like: “You are correct 100% of the time!” or “There are no chores for you today!”; hearing them in a slightly robotic, manufactured style is better than not hearing them at all….

Thanks for reading! Next blog, I’ll dig deeper into the voices which tell you when to turn — the occasionally vexing world of GPS voices!

Allison Smith is a professional telephone voice, who can be heard voicing systems for telephone systems and private companies throughout the world, including platforms for Verizon, Qwest, Cingular, Sprint, Bell Canada, Hawai’ian Telcom, and Asterisk.  Her website is www.theivrvoice.com.

About these ads

4 Comments »

  1. John Todd Said:

    Speaking of (ha!) synthesized voices being a part of someone’s persona: here’s a really interesting speech by Robert Ebert on the loss of his voice, and some of the trials and tribulations of re-creating it from prior recordings: http://www.ted.com/talks/roger_ebert_remaking_my_voice.html

    JT

    • voicegal Said:

      JT:

      That was a *great* presentation — thank you for posting that. I think he expressed beautifully his feelings of isolation and detachment, even though he had the means to type out his feelings — it is time-intensive, and makes real-time and well-timed conversational flow all but impossible. What an extraordinary man!

  2. Tim Said:

    Thank you for the article! I know now how text to speech works.

    I use freeware Panopreter Basic, it reads aloud with cepstral voice too.

    • voicegal Said:

      Thanks for the feedback! I think there’s huge room to grow for TTS….


{ RSS feed for comments on this post} · { TrackBack URI }

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: