Text-to-Speech Innovation in the 21st Century

Dr. Hossein Eslambolchi
March 2012

Speech technologies are developing faster than the growth predicted by Moore’s law. By 2020 or 2030, the accuracy of these technologies will approach 100 percent, and will find application in many different arenas – airline reservation services and call centers, for example.

These applications will be deployed globally, in any language necessary.

KEY POINTS

• Text-to-speech (TTS) systems have offered intelligible but unnatural and unpleasant speech for many years. Since 2001, text-to-speech has become more natural in sound due to groundbreaking work at various research labs across the globe and increasing computer power and memory. The industry has largely adopted this technology in the meantime.

• The ultimate goal for TTS synthesis is to produce speech that is indistinguishable from human speech. This is possible at the present in limited applications.

• The highest-quality TTS depends on large databases of speech recordings. The latest TTS recordings reflect specific emotions, or carefully designed personae.

• TTS can be developed for limited vocabularies and specific applications with quality that approaches recorded human speech. Speech recognition technologies also achieve higher accuracy when tailored to specific applications.

• High-quality TTS can be used in network applications – for instance, reading names and addresses back to customers. Lower-quality TTS, requiring less memory, can be used for automotive systems, PDAs and mobile devices.

• TTS is achieving wide distribution. Applications in the network include synthetic voice reading of email messages, the textual content of websites and content from up-to-the minute information services.

• TTS presents tremendous opportunities, including uses in motor vehicles, PDAs, wireless devices and games and toys.

THE TECHNOLOGY

TTS synthesis enables computers to transform text into spoken language. They have a long history: the first speech synthesis system, called Voder, was introduced at the 1939 New York World’s Fair. Progress has continued ever since.

Until the late 1980s, the state-of-the-art TTS system was the Digital Equipment Corporation’s DECTalk, which was often characterized as sounding like a drunken Swede. However, in the last 10 to 15 years, new technology has developed to make speech much more natural. Advances since the late 1990s take advantage of increased computer power to create much more natural sounding synthetic voices.

Older TTS systems produce speech using software based on linguistic rules and models constructed by analyzing human speech. The older technology requires limited memory and only moderate computing resources. It produces voices with high intelligibility, but yields speech with a mechanical and often unpleasant sound. Newer TTS systems, developed since 1990, employ technology producing synthetic speech from recordings of human speech.

The leading-edge Natural Voices system uses “unit selection synthesis”, a newer, recording-based technique which creates databases that store segments of speech ranging from sub-syllable units to words and phrases – even entire sentences. Unit selection dynamically combines units of different sizes to create synthetic speech that has a natural sound. This approach requires large databases of recorded speech. Older technology uses smaller databases with fewer units, requiring extensive unit modification by signal processing, which is one of the causes of the unnatural quality of voice.

TTS synthesis can be employed in any application where information is delivered by voice or where a system queries a user with a voice prompt. There are many applications of TTS in telecommunications, including voice rendering of dynamic content like names and addresses, email messages and text information on web pages. TTS is one of the enabling technologies for virtual agents, and is an absolute necessity for next-generation CRM and IVR systems that aspire to solve customer problems instead of merely route calls.

TTS can be used to provide voice output for information stored in databases, including contact information, navigation and directions, restaurant locations and menus, movie guides, talking travel guidebooks and many other services.

There are a myriad of other uses for TTS, including access to large information stores such as talking books, online catalogues, encyclopedias, reference books, law volumes and even talking appliances.

Producing customized voices is a new capability of TTS technology. Synthetic voices with desirable characteristics will be owned by a particular company or used in a particular application. The voice of a well-known person can be copied, and a synthetic version can be created that possesses the characteristics of the original. With enough recorded speech as a resource, the voice of someone from the past might even be simulated. This has profound implications for “voice branding,”- virtual spokespersons that represent a particular product or service.

The latest developments in TTS aspire to create more expressive speech. TTS R&D groups around the globe are now trying to control expressiveness and emotions in the synthesized voice and to get away from the standard and somewhat monotonous “newsreader” style. Data-driven methods that rely on large databases of training material are taking the early lead here as well. Synthetic agents will soon express a full range of emotions and/or expressions.

TTS systems will soon be able to produce synthetic speech that is more natural than today’s best systems. The increasing availability of computer memory, even on small devices, and decreasing memory costs will make this improvement possible.

Natural sounding TTS will play an even more important role in the future as a new infrastructure, such as Voice XML and SALT1, is introduced. This will make it easier to integrate TTS into Internet and telephony applications.

THE PLAYERS

The number of companies active in the TTS marketplace continues to grow as the technology matures and the marketplace expands. Higher-quality voices, support for additional languages and accents and products designed for small devices and desktops are becoming available at a quickening pace.

• ScanSoft offers the Real Speak platform, originally developed by Lernout & Hauspie. It offers TTS in 19 languages in versions designed for automobiles and mobile devices, personal computers and multimedia uses, and telecommunications applications.

• Microsoft has concentrated its TTS R&D in Beijing, China, with a group of about 20 people. The company’s Chinese Mandarin TTS is good, but TTS in other languages is lagging behind in voice quality and intelligibility. Expect Microsoft to look outside TTS for other sources to fill the growing need for natural-sounding translation in multiple languages.

• IBM offers TTS synthesis as part of its WebSphere Voice Server using old “formant” technology and newer “concatenative” technology. It has licensed the ScanSoft RealSpeak platform. It supports high quality limited vocabulary TTS using what it calls phrase splitting. IBM offers its ViaVoice TTS SDK for Windows platforms through Wizard Software. Brazilian Portuguese, Canadian French, French, Finnish, German, Italian, Mexican Spanish, Spanish, Simplified Chinese, U.K. English and U.S. English are supported.

• Fonix offers TTS in 11 languages and embedded devices. It now owns rights to the DECTalk technology.

• Cepstral is a spin-off of Carnegie Mellon University that focuses on TTS. It has developed TTS capabilities for devices such as wireless phones and PDAs for games, education and other applications. Besides US English, Cepstral supports French Canadian TTS.

• Acapela was created in 2004 by a merger between Babel Technologies and Elan Speech as a spin-off of Mons Technical University in Belgium. Babel earlier had acquired a Swedish company, Infovox, a specialist in speech synthesis using older technology. The Acapela TTS is currently available in twenty-three languages: Arabic, Czech, Danish, Dutch, Dutch, English (UK), English (US), Faroe, French, Finnish, German, Greek, Icelandic, Italian, Norwegian, Polish, Portuguese, Portuguese (Brazil), Russian, Spanish, Spanish (South America), Swedish and Turkish. The offering targets telecommunications applications, PC multimedia and cellular phones.

• Mindmaker has a TTS system called FlexVoice, which uses a hybrid technology approach requiring limited memory resources. Different flavors of FlexVoice are aimed at desktop devices, small devices such as mobile phones, Internet-based applications and telephony applications. FlexVoice is currently available in English and Hungarian, and work is underway for Czech, Slovak, Croatian, Malay, and Southeast Asian-accented English.

• NeoSpeech offers a TTS system called VoiceText available in different configurations for use on small devices, personal computers, or on servers. NeoSpeech is available for US English, Mandarin Chinese, Japanese and Korean, with many other languages in development. NeoSpeech seems to be the fastest growing TTS vendor.

• Sakrament, a company in Belarus, offers TTS for Russian, Belarusian, Ukrainian and U.K. English, with voices in U.S. English and Polish in the works. Sakrament TTS supports a wide range of platforms.