The Journey to a Machine That Transcribes Speech as Well as Humans

As a student at the elite Tsinghua University in the early 1980s, Xuedong Huang confronted the same challenge as all other Chinese computer users.

“In China, typing was fairly difficult with a Western keyboard,” says Huang, now a Microsoft (NASDAQ: MSFT) distinguished engineer and its chief speech scientist.

To answer that challenge, he helped develop a Chinese dictation prototype, launching him on a three-decade quest that yielded, earlier this fall, a speech recognition system capable of matching human transcriptionists for accuracy.

“Having a natural user interface—that vision always did inspire many people to pursue advanced speech recognition, so I never stopped since 1982,” says Huang, who will speak at Xconomy’s upcoming Intersect event in Seattle on Dec. 8. (See the full agenda and registration details.)

The system he and his Microsoft Research colleagues developed achieved a word error rate of 5.9 percent on the benchmark “Switchboard” task—automatically transcribing hundreds of recorded human-to-human phone conversations. An IBM team previously held the record with a 6.9 percent word error rate, announced earlier this year.

Huang, pictured above, marvels at the technology advancements and collective effort of the speech research community that reached this “historic milestone.”

“That’s a really big moment,” Huang says. “It’s a celebration of the collective efforts over the last 25 years for everyone in the speech research community, everyone in the speech industry, working together, sharing the knowledge.”

As a journalist, I marvel at this achievement, too. Transcription is a necessary part of my job. After interviewing Huang last week in a quiet conference room at Microsoft Research headquarters in Redmond, WA, I paid careful attention to what I actually do when I listen back to a spontaneous conversation and convert it to text. I rewind certain passages repeatedly trying to decipher what was said through cross-talk or mumbles; pause to look up unfamiliar terms, acronyms, proper names; use my knowledge of context, my understanding of colloquialisms; and adjust to an individual’s accent and patterns of speech. (More on this at the bottom.)

That machines can now do this as well as flesh-and-blood professionals—at least in certain situations—shows just how far we’ve come in giving computers human-like capabilities.

While the Microsoft team achieved a new best for machine transcription, its claim of “human parity” is in part based on a better understanding of actual human performance on the same task. It had previously been thought that the human word error rate was around 4 percent, but the source of that claim was ill-defined. Microsoft had a third-party professional transcription service undertake the same Switchboard task in the course of its normal activities. The humans erred at the same rate as the system, making many of the same kinds of mistakes.

The system that hit the human parity mark for transcription is no ordinary machine, of course. It begins with a hardware layer, codenamed Project Philly, that consists of a massive, distributed computing cluster outfitted with Nvidia GPU—graphics processing unit—chips. (GPUs, originally designed for handling video and gaming, have become workhorses of the artificial intelligence world.)

On top of that is Microsoft’s Cognitive Toolkit, an open source software framework for deep learning that makes efficient use of all that computing power, and was updated last month.

The next layer is an ensemble of 10 complementary neural network models. Six perform the acoustic evaluation—the work of recognizing the speech—and four focus on word understanding, parsing things like context and punctuation, Huang says.

The various models were trained on the Switchboard conversations and several other commonly used conversational datasets, ranging in size from a few hundred thousand words to 191 million words (the University of Washington conversational Web corpus). Huang says the system relies mostly on machine learning to improve its accuracy, but it also includes what he calls “semi-supervised rules.” For example, it has access to a dictionary, which provides word pronunciations.

“That’s the knowledge people have accumulated, and we find that actually is useful to not relearn everything,” Huang says. “With human knowledge and machine learning combined, that will give us the best performance so far.”

(It should be noted that this is a simplified description of some amazingly complex technology. Here’s a PDF of the paper in which the Microsoft Research team explains their process and results in detail.)

While the Switchboard task is in English, Huang says the system that achieved the human parity milestone is language independent—provided it is trained with enough data. “So whether it’s German or Thai or Chinese, it’s really just amazing how powerful this,” he says.

He’s careful to note several caveats on this scientific achievement: The system is still very expensive, benefitting from essentially unlimited computing resources, and more than 20 years of Microsoft research and development. The team spent more than a year focused on the Switchboard task to reach human parity. And it doesn’t transcribe speech in real time.

There’s still lots of work to do to bring this capability from the realm of research to a production system that could improve Microsoft products like Xbox and Cortana.

Meanwhile, Huang is confident that hardware advances will continue. “Don’t worry,” he says, holding up his iPhone and describing the increasingly powerful computers available in the cloud. Beginning in December, Microsoft will offer GPU-powered machines in its Azure cloud computing service. “You have a cloud and client working together. So that trend will not stop,” he says.

The next big scientific challenge is tackling the “cocktail party” problem. Computers still struggle to capture speech in settings that have multiple speakers who may be far from the microphone; echoes and background noise such as a television or music; and other complications.

“Humans have no problem,” Huang says. “They can just adapt, zero in, and intelligently have a good understanding. Even the best Microsoft human parity system performs badly with that kind of open environment.”

He says improvements in microphone technology will help, noting devices such as the Amazon Echo, with its seven microphones that can pinpoint a distant speaker.

Over Thanksgiving, my family had a great time trading knock-knock jokes with Alexa, the intelligence underlying the Amazon device. It was competent, if inconsistent, even in a crowded kitchen.

Huang emphasizes the difference between the Switchboard task—the transcription of conversations between human strangers on an assigned topic—and other computer speech contexts. “When you talk to a computer, you know you are talking to a computer so the way you articulate is different,” Huang says.

Apart from hardware, Huang says improvements in natural language understanding will help solve the cocktail party problem. Humans can better understand the signal in the noise, thanks to common sense and contextual knowledge. But machine understanding “is far from being solved,” Huang says.

So when will that milestone be reached? Reflecting on the journey to the transcription milestone, Huang says he “totally underestimated” the advancements that would be made over the course of his career. He paraphrases a famous Bill Gates quote: “He thought most people… overestimate what they can achieve in a year and underestimate what the community can achieve in 10 years. So, that can be applied to my own prediction. I’m not going to predict what’s going to happen, but it’s just phenomenal.”

How One Human Transcribes

In my career as a journalist, I’ve spent uncounted hours—lost days, perhaps weeks of my life—transcribing. I record many of my interviews and then go about the time consuming process of turning the spoken words into text. The goal is an error rate of zero. We’re trying not to misquote people here, which is a big impetus for recording interviews in the first place, rather than typing or writing notes in real-time.

I’d never tracked exactly how time consuming transcription is. That’s in part because it’s not something I do as a defined activity, separate from writing the story. For me, it’s part of the writing process. I will often stop transcribing to fit a fact or a quote from an interview into my draft as I encounter it during transcription, rather than transcribing the whole thing and then going back to pick out the bits that will actually appear in the story.

For my interview with computer speech expert Xuedong Huang, which ran just shy of 32 minutes, I measured how long it took me to transcribe it with a running stopwatch: one hour, 23 minutes. Put another way, each minute of conversation took me about 2.6 minutes to transcribe. I consider myself a fast typist, but I regularly have to rewind the recording to make sure I heard a word or phrase correctly. This interview took place in a quiet meeting room in Building 99, headquarters of Microsoft Research in Redmond, WA. There were still patches that were difficult to hear on the recording—when both Huang and I were talking simultaneously, for example—or difficult for me to understand, such as when Huang used acronyms, proper names, and other linguistic and computer science terms that were new to me.

I broke the interview up into 12 segments to make it more manageable. Even for an interesting interview, transcription is tedious process; breaks are necessary, interruptions inevitable. This presented an opportunity to measure whether my transcription speed increased over the course of the interview. I assumed it would as I gained experience, training my own neural network with some high-quality data, accumulating valuable context from earlier in the interview, and adjusting to new vocabulary, Huang’s accent, and speaking rhythm. By the end of the interview, a minute of conversation took only 2.3 minutes to transcribe.

That said, a production system capable of accurately transcribing spontaneous human-to-human conversations can’t come soon enough.

Trending on Xconomy