IBM edges closer to human speech recognition

BI Intelligence

This story was delivered to BI Intelligence Apps and Platforms Briefing subscribers. To learn more and subscribe, please click here.

IBM has taken the lead in the race to create speech recognition that's in parity with the error rate of human speech recognition.

As of last week, IBM’s speech-recognition team achieved a 5.5% error rate, a vast improvement from its previous record of 6.9%.

Digital voice assistants like Apple's Siri and Microsoft's Cortana must meet or outdo human speech recognition — which is estimated to have an error rate of 5.1%, according to IBM — in order to see wider consumer adoption. Voice assistants are expected to be the next major computing interface for smartphones, wearables, connected cars, and home hubs.

While digital voice assistants are far from perfect, competition among tech companies is bolstering overall voice-recognition capabilities, as tech companies vie to outdo one another. IBM is locked in a race with Microsoft, which last year developed a voice-recognition system with an error rate of 5.9%, according to Microsoft's Chief Speech Scientist Xuedong Huang; this beat IBM by an entire percentage point. 

Despite progress, however, existing methods to study voice recognition lack an industry standard. This makes it difficult to truly gauge advances in the technology. IBM tested a combination of “Long Short-Term Memory” (LSTM), a type of artificial neural network, and Google-owned DeepMind’s WaveNet language models, against SWITCHBOARD, which is a series of recorded human discussions. And while SWITCHBOARD has been regarded as a benchmark for speech recognition for more than two decades, there are other measurements that can be used that are regarded as more difficult, like “CallHome," which are more difficult for machines to transcribe, IBM notes. Using CallHome, the company achieved a 10.3% error rate.

Moreover, voice assistants need to overcome several hurdles before mass adoption occurs:

  • They need to surpass “as close as humanly possible." Despite recent advancements, speech recognition needs to reach roughly 95% for voice to be considered the most efficient form of computing input, according to Kleiner Perkins analyst Mary Meeker. That’s because expectations for automated services are much less forgiving than they are for human error. In fact, when a panel of US smartphone owners were asked what they most wanted voice assistants to do better, "understand the words I am saying" received 44% of votes, according to MindMeld. 
  • Consumer behavior needs to change. For voice to truly replace text or touch as the primary interface, consumers need to be more willing to use the technology in all situations. Yet relatively few consumers regularly employ voice assistants; just 33% of consumers aged 14-17 regularly used voice assistants in 2016, according to an Accenture Report.
  • Voice assistants need to be more helpful. Opening third-party apps to voice assistants will be key in providing consumers with a use case more in line with future expectations of a truly helpful assistant. Voice assistants like Siri, Google Assistant, and Echo, are only just beginning to gain access to these apps, enabling users to carry out more actions like ordering a car.