Speech Translation in the 21st Century

By Dr. Hossein Eslambolchi
March 2012

I predicted a decade ago that speech translation will dominate technological innovation in this decade. With the advent of smart phones and more intelligent power and memory, my prediction seems to be coming true.

In the 1980s, Moore’s Law claimed that the capabilities of electronic devices double about every 18 months at a fixed price. Optical technologies are bucking Moore’s Law, developing somewhat faster than expected – 1.5 times faster than electronics. And speech technologies are leaving Moore’s Law in the dust, developing 10 times faster than the law predicts they should.

This rapid development has made a variety of speech technologies possible, for a variety of end points – including automobiles, medical and smart phone applications and, as the 21st century unfolds, robotics.

What should we look for in the development of speech technology in the coming decade?

• Speech-to-Speech Translation (SST) involves mapping spoken words from one language into another. It involves Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text-to-Speech Synthesis (TTS).

• Projections indicate that language or text translation will become a $50 billion business by the end of 2012, according to a report by Allied Business Intelligence. Only a small fraction – $250 million – is attributed to computerized language translation sales. The rest is essentially human translation.

• SST has been demonstrated with reasonable success in limited applications, such as travel and hotel reservations. Open domain SST for fluent speech remains a challenging research problem.

• There are significant applications for SST, especially in the medical, military, legal and tourism sectors. SST will help facilitate speech mining of large multilingual data sources and offer real-time translation for on-demand media captions and network-based two-way communication.

• Tier 1 researchers have been conducting research in SST for more than 15 years, and have secured novel patents that apply the concept of stochastic translation (i.e. learning from examples).

• There are currently no commercial systems for unrestricted SST, although there are numerous products for text translation. Given the global market leadership service providers enjoy in voice, data and IP services, it seems natural that service providers will lead the research and commercialization of SST services.


SST allows people who speak different languages to communicate more effectively, removing the language barrier. SST implies three basic challenges:

1. ASR: Ability to recognize unrestricted conversational speech by different speakers with unique speaking styles, under varying recording conditions;

2. SLT: Learning to accurately map spontaneous spoken phrases from a source language onto a target language in the presence of ASR errors;

3. TTS: Perform expressive speech synthesis that simulates the human voice an conveys the intent of the original speaker.

Language translation has been driven largely by the globalization of the world economy and the commercialization of the Internet. The language translation market, which includes both SST and Machine Translation (MT) from text, has been projected to reach $50 billion by the end of 2012. About $5 billion of this estimate is attributed to translation services in China alone.

There are numerous applications for SST in legal, commercial and technical translation; medical and web mining services; translation of government records; real time on-demand captioning of television and radio programs and “simple” two-way human-to-human communication.

The web provides a major business opportunity for MT given that non-English content will soon comprise 60 percent of the total information on the web, and is generally growing faster than English-language content. According to Global Promote2, over 57 percent of the more than one billion Internet users speak a language other than English.

Languages Spoken by Number of Internet Users:

  • Chinese: 120 million
  • Japanese: 100 million
  • Spanish: 65 million
  • German: 55 million
  • Korean: 39 million
  • Italian: 36 million
  • French: 36 million
  • English: 35 million
  • Portuguese: 30 million
  • Russian: 30 million
  • Dutch: 26 million

Although the idea of using computers to translate languages was first investigated in the early 1950s, the problem is far from solved. Research in SST only began about 15 years ago when it became possible to recognize and synthesize continuous speech with reasonable quality.

Despite the huge opportunities that exist today, there are currently no commercial products that exist in the marketplace for unrestricted SST. For more restricted applications, VoxTec, a division of Marine Acoustics, has a translation device called Phraselator that costs approximately $2,000 and is able to translate over 15,000 phrases in 53 languages.

There are numerous research programs exploring SST:

Apple is becoming one the largest providers of text translation software in the world through its SIRI application in Apple smart phones. It invests in the development of tools and technologies that enhance translation. Some of these product lines will be incorporated into robotics technology in the 21st century.

Google offers both text and document translation for about ten languages. Google relies on a large corpus of texts that are available in multiple languages. Its translation service is available at no charge on its website.

IBM offers text and document translation solutions for the enterprise with its WebSphere Translation Server. The server version, launched in 2001, can support text translation for about ten languages. Customers of this product include Deutsche Bank, Cable and Wireless, The United States Social Security Administration, GE Financial and Sutter Health. Nuance is their major speech engine. IBM’s advances are helping the technology to exceed Moore’s Law, and they will continue to be a formidable player in the industry.

Global service providers realized the importance of language translation services back in the 1980s. In 1989, global service providers purchased a service named Communications and Language Line (CALL), which later became Language Line3. Language Line3 provided over-the-phone human interpretation from English into more than 150 languages and was available around the clock. Language Line3 was sold in 1999 to Providence Equity Partners. SST research in France began in 1989 with the introduction of the VEST4 system – a two-way English/Spanish translation project with Telefonica in Spain. Since then, global service providers pioneered a number of key technical innovations including the use of head transducers and finite state transducers.

Recently, these innovations have been demonstrated on a number of prototype systems for instant messaging, spoken language dialog, call center automation, real time broadcast news and two-way translation, most notably at the Joint Warrior Interoperability Demonstration (JWID) forum in 2003.

In 2004, the finite state approach was demonstrated for the first time on a real-time television news broadcast using an English-to-Spanish speech translation system. These innovations were central to the global service providers $36 million proposal for the DARPA GALE BAA6 program in 2005.

In 2012, we are already seeing significant improvements in speech-based technologies, and can expect the rate of innovation to increase dramatically throughout the year. Several leading companies, including APPLE, IBM, Microsoft and Google, are investing significant amounts of R&D in this area.