YouTube’s AI can now describe sound effects

YouTube’s AI can now describe sound effects

by Micah Singleton . @MicahSingleton  Mar 24, 2017, 3:17pm EDT

YouTube has had automated captions for its videos since 2009, and now it’s expanding the feature to include captions for sound effects. The video service uses machine learning to detect sound effects in videos and add the captions [APPLAUSE], [MUSIC], and [LAUGHTER] to millions of videos.

While those three were some of the most frequent manually captioned sounds, YouTube says it’s only in the early stages of making improvements for its deaf and hard of hearing users. The company says captions like ringing, barking, and knocking are next in line, but those require more deciphering than simple laughter or music.

The improved captions are now available on YouTube.

IBM edges closer to human speech recognition

BI Intelligence

This story was delivered to BI Intelligence Apps and Platforms Briefing subscribers. To learn more and subscribe, please click here.

IBM has taken the lead in the race to create speech recognition that's in parity with the error rate of human speech recognition.

As of last week, IBM’s speech-recognition team achieved a 5.5% error rate, a vast improvement from its previous record of 6.9%.

Digital voice assistants like Apple's Siri and Microsoft's Cortana must meet or outdo human speech recognition — which is estimated to have an error rate of 5.1%, according to IBM — in order to see wider consumer adoption. Voice assistants are expected to be the next major computing interface for smartphones, wearables, connected cars, and home hubs.

While digital voice assistants are far from perfect, competition among tech companies is bolstering overall voice-recognition capabilities, as tech companies vie to outdo one another. IBM is locked in a race with Microsoft, which last year developed a voice-recognition system with an error rate of 5.9%, according to Microsoft's Chief Speech Scientist Xuedong Huang; this beat IBM by an entire percentage point. 

Despite progress, however, existing methods to study voice recognition lack an industry standard. This makes it difficult to truly gauge advances in the technology. IBM tested a combination of “Long Short-Term Memory” (LSTM), a type of artificial neural network, and Google-owned DeepMind’s WaveNet language models, against SWITCHBOARD, which is a series of recorded human discussions. And while SWITCHBOARD has been regarded as a benchmark for speech recognition for more than two decades, there are other measurements that can be used that are regarded as more difficult, like “CallHome," which are more difficult for machines to transcribe, IBM notes. Using CallHome, the company achieved a 10.3% error rate.

Moreover, voice assistants need to overcome several hurdles before mass adoption occurs:

  • They need to surpass “as close as humanly possible." Despite recent advancements, speech recognition needs to reach roughly 95% for voice to be considered the most efficient form of computing input, according to Kleiner Perkins analyst Mary Meeker. That’s because expectations for automated services are much less forgiving than they are for human error. In fact, when a panel of US smartphone owners were asked what they most wanted voice assistants to do better, "understand the words I am saying" received 44% of votes, according to MindMeld. 
  • Consumer behavior needs to change. For voice to truly replace text or touch as the primary interface, consumers need to be more willing to use the technology in all situations. Yet relatively few consumers regularly employ voice assistants; just 33% of consumers aged 14-17 regularly used voice assistants in 2016, according to an Accenture Report.
  • Voice assistants need to be more helpful. Opening third-party apps to voice assistants will be key in providing consumers with a use case more in line with future expectations of a truly helpful assistant. Voice assistants like Siri, Google Assistant, and Echo, are only just beginning to gain access to these apps, enabling users to carry out more actions like ordering a car.

The Google Hangout that will change the way I will view communication forever


Last week, I had the great honor to speak with three very awesome people in a Google+ Hangout: Christian Vogler, director of the Technology Access Program at Gallaudet, Andrew Phillips, policy counsel at the NAD and Willie King, the director of product management atZVRS.

These three folks have one thing in common – they are deaf.

I can’t understand sign language, and I speak way too fast for anyone to be able to read my lips. How did it go without a hitch? Thanks to a Google+ Hangout app announced by the Google Accessibility team last week, a sign language expert and a fantastic CART transcriptionist, Laura Brewer. All in real-time, all virtually.

Full story...

Launching a Business in 54 Hours

December 23, 2011 | 4:04 pm by Anne Fisher

“I was skeptical at first about whether you can really accomplish much in just one weekend,” said Ryan Flynn, a graphic designer who worked on voice-recognition technologies at Motorola before moving to New York from Chicago in 2010. Mr. Flynn is the founder of a fledgling enterprise called Closed Capp, whose first product is a mobile app for real-time closed captioning, aimed at giving the hearing-impaired access to cellphone conversations.

Mr. Flynn’s initial doubts soon dissolved: Startup Weekend introduced him to two fellow techies who “helped me figure out a workable business model and refine the technology,” he said. “It was great to have two solid days to focus on solutions to things I’d been stuck on. We had a working prototype by the end of the weekend.” He also had several promising conversations with interested investors, he added....

Read more:

VR on Android's Ice Cream Sandwich is Awesome



...It’s a pretty substantial upgrade, as significant as Apple Inc.’s recent rollout of its improved iPhone software. Up to now, Google has delivered separate versions of Android for phones and tablet computers. Ice Cream Sandwich combines valuable features from both versions and adds a lot of welcome improvements.

One of the biggest is in speech recognition. No, Android phones still haven’t caught up with Apple’s voice-controlled personal assistant Siri. But Ice Cream Sandwich makes it far easier to dictate e-mails and text messages. With earlier versions, you basically spoke one phrase or sentence, then waited for the software to do its stuff. Now, the process is continuous. Just keep talking, and remember to verbally add punctuation marks, like “comma’’ and “period.’’ The software transcribes sentence after sentence with surprising speed and impressive accuracy. It’s so good that you might start dictating all your text communications....

This upgrade works with Closed Capp and is awesome once you get used to speaking your punctuation. So exciting!

Closed Capp update - Now with Keyboard Entry


We updated the Android App yesterday to include a keyboard entry mode as well. A new button on the screen easily switches you back and forth between modes so you can continue the conversation easily. Text that you type appears large on screen for others to read.

If you have used Closed Capp in your daily life, we would love to hear about your experience. Please send your stories to or find us on Twitter (@ClosedCapp) and Facebook.

Closed Capp wins the Twilio Sponsors' Prize for Startup Weekend NYC!


Closed Capp was chosen to win the Twilio Sponsors' Prize for Startup Weekend NYC, November 20, 2011. Thanks to Mason Du and Seth Hosko for helping push this idea forward, and thank you to for the award! 

It was a great weekend and we were able to make this app much more functional, letting the speaker keep a more natural pace during a conversation. We hope you find this app useful & we are working to make it even better in the future! 

Can you hear me now? 1 in 5 in U.S. suffers hearing loss

By Linda Carroll for MSNBC, November 14th, 2011

Nearly one in five Americans has significant hearing loss, far more than previously estimated, a first-ever national analysis finds.

That means more than 48 million people across the United States have impairments so severe that it’s impossible for them to make out what a companion is saying over the din of a crowded restaurant,  said Dr. Frank Lin, author of a new study published in the latest issue of the Archives of Internal Medicine.

“It’s pretty jaw-dropping how big it is,” said Lin, an assistant professor of otolaryngology and epidemiology at the Johns Hopkins School of Medicine.

Previous estimates had pegged the number affected by hearing loss at between 21 million and 29 million.

Lin and other researchers were surprised at the magnitude of the problem, but the significance of the findings goes beyond the “wow” factor, he said.

That’s because other studies have shown that hearing decline is often accompanied by losses in cognition and memory. Further, Lin said, some studies have associated hearing loss with a greater risk of dementia.

Lin’s study is the first to look at the hearing loss in a national sample of Americans aged 12 and older who have actually had their hearing tested. Earlier studies were smaller or depended on people’s self-reports of hearing loss.

Full Story...

Microsoft to unveil a ‘breakthrough’ in speech recognition


29 August '11, 08:56pm


Earlier today, Microsoft Research released a blog post promising that at Interspeech 2011, an event that is underway, the company would unveil a ‘breatkthrough’ in speech recognition.

Importantly, the development does not deal with speech recognition that requires the user to ‘train’ the system, but instead involves “real-time, speaker-independent, automatic speech recognition.” In other words, true recognition of human speech.

Microsoft claims that it has managed to “dramatically improve the potential” of this sort of technology becoming commercially functional. Through the use of deep neural networks, the company has managed to improve the accuracy of ‘on the go’ speech recognition, something that is a near holy grail of technology. How the team managed to execute the breakthrough is exceptionally technical, but we will not summarize it here because it is a topic that requires extensive background knowledge to follow. Microsoft’s blog post has all the information, if you’re curious.

In regards to the results of what Microsoft Research has built, this is the crucial revelation: “The subsequent benchmarks achieved an astonishing word-error rate of 18.5 percent, a 33-percent relative improvement compared with results obtained by a state-of-the-art conventional system.” The company claims that this has “brought fluent speech-to-speech applications much closer to reality.”

That said, this remains very much a research project. The company made that abundantly clear in the discussion of its progress.

This project is not simply an interesting technical problem, but something that Microsoft desperately needs solved. The company is forging ahead with what it calls Natural User Interface integration (think the Kinect, voice to text, and so forth), and so it needs a better voice solution. The company must have its eyes on its Research division, pushing them towards a commercially viable product that can be integrated across the world of its products.

For now, this is one step, albeit an important one.

How Speech Technologies Will Transform Mobile Use

by Phil Hendrix

 Jul. 7, 2009

 Download the Full Report

Mobile is enjoying what may be called a virtuous spiral, driven by compelling new devices, innovative new applications and faster networks that are making mobile broadband a reality. At the center of this phenomenon are user-friendly interfaces, especially the touch screen, which have fueled adoption and use of mobile apps. While much improved, mobile use still demands considerable attention — for example, viewing displays, entering text, navigating through the UI, etc. Advances in speech technologies will fundamentally alter the way in which users experience mobile devices and apps: devices secured with voice authentication; individuals — including the sighted and visually impaired — enjoying mobile content and apps without ever having to touch or view a device; new applications, from search to language translation and others, enabled by speech recognition; and in many other ways. Despite these advances, no OEM or operator to date has delivered a functional, easy-to-use, well-integrated speech solution, or ensured that users are aware of, understand how to use, and benefit from speech functionality — current devices are merely “speech equipped.” This briefing describes these important developments and outlines opportunities for operators and developers to introduce and capitalize on innovative new, speech- optimized solutions.