[Updated 5/13/16, 9:56 a.m. See below.] Computer scientists have been trying to train software-powered machines to reliably understand and respond to human speech for years. They seem to have turned a corner recently.
Two of the latest examples were highlighted Thursday in Boston, as researchers from Amazon and Mitsubishi Electric Research Laboratories (MERL) briefly discussed what they’ve been working on at their companies’ local outposts in Cambridge, MA. The speeches were part of the Re-Work Deep Learning Summit, a two-day business conference paired with Re-Work’s Connected Home Summit.
The talks were notable because Amazon and Mitsubishi are usually quiet about what they’re working on at the local research offices. Xconomy last dug into MERL in 2007, when it went through a significant restructuring that involved a wave of layoffs and resignations. The corporate research group currently employs 83 people, plus some visiting scientists, CEO Richard Waters said in an e-mail to Xconomy. [This paragraph updated with latest staff total.]
Amazon, meanwhile, is starting to talk more openly about the activities at its growing research lab in Kendall Square, which includes developing the speech recognition tech behind the Echo and Amazon’s other voice-controlled products (pictured above). The Echo—an Internet-connected speaker powered by the company’s virtual voice assistant, Alexa—has become a hit with consumers and grabbed tons of press in recent months.
Amazon has done a lot of work in “far-field automatic speech recognition”—meaning the Echo is able to perceive voice commands from someone located across the room. There are understandably a lot of challenges that go into that, Amazon principal research scientist Spyros Matsoukas said at the summit Thursday.
The hurdles include tuning out background noise, overcoming degraded acoustics that result from the reverberation of sound waves, and learning to understand different accents and speaking styles. Then there are hurdles that might be less obvious, like how Alexa distinguishes between words and phrases that sound the same (“Sundays” versus “sun daze”) or predicts the spoken forms of unconventionally spelled names (think musical artist P!nk).
“The goal is to convert spoken audio into text” that can be understood by software programs “with high accuracy,” Matsoukas said. The Echo isn’t perfect, but the general consensus among reviewers is it seems to do a good job of understanding users.
Meanwhile, John Hershey—a MERL senior principal research scientist who leads its speech and audio team—talked about how he and his colleagues have been trying to “crack the cocktail party problem.” That means creating software that can parse speech when there are multiple voices speaking simultaneously. “This is a really old problem,” Hershey acknowledged.
One way researchers have approached it is by trying to better understand the way the human brain processes cluttered sounds, Hershey said. It’s still a matter of debate, but he agrees with scientists who say that the brain separates the various sounds into different signals in order to focus on one source. “We’re recognizing so much about the sounds that we must be separating” them, he said.
It’s difficult to develop software that can figure out “where one speaker’s signal stops and where another begins,” but Hershey’s team has been working on an algorithm that uses deep learning techniques to accomplish that. The software is able to take a single-channel audio recording with multiple unknown speakers and help separate them and render clean audio of each one. “Our performance is getting very good,” he said.
Such technology would have uses in robotics, assistive listening, data analysis, sound engineering, audio forensics, and more, he said.