TranscribeMe, Nuance Fine-Tuning Audio Transcription For Hot Market

In 1999, the futurist and speech recognition technology pioneer Ray Kurzweil predicted that by 2009, deafness would be a mere inconvenience rather than a disability.

That’s because deaf people would be carrying small machines that would listen to their companions and display real-time text transcripts of their conversations, Kurzweil imagined in his book The Age of Spiritual Machines.

It’s five years past 2009, and we’re not there yet.

No software company has yet offered a product that can deliver highly accurate, speech-to-text transcription of multiple voices. Such technology would be a boon not only to the deaf, but also to a host of business customers, such as companies that record their meetings, lawyers who record depositions for court cases, and journalists who publish interviews and quotes. But in the meantime, significant business opportunities remain for next-generation transcription companies such as Berkeley, CA-based TranscribeMe.

Human beings still do most of the professional transcribing work on multiple-voice audio files, but TranscribeMe was founded in 2011 to improve on services provided by the traditional lone worker with a transcription machine at home. The company uses a combination of its own speech recognition software and a network of about 30,000 freelance transcriptionists, with the aim of increasing efficiency while controlling costs, says CEO and co-founder Alexei Dunayev (pictured above.)

“That gets us the speed of the computer as well as the quality of people,” Dunayev says.

Although Kurzweil’s forecast was premature, speech recognition software has improved substantially since 1999, making the voice a key element of digital communication. Smartphone users can have short conversations with their digital assistants, and receive text versions of their friends’ voicemail messages—even though the transcriptions are sometimes hilariously off the mark. Plenty of people now speak to their computers rather than type when they’re creating longer text documents, because they use transcription software such as the Dragon products made by Burlington, MA-based Nuance Communications (NASDAQ: NUAN). But those programs must still be “trained” to recognize their owners’ speech patterns, so they can produce accurate text copies.

The greater challenge—and one not yet overcome by software companies—is the transcription of conversations involving two or more speakers. Transcriptions get muddled if a speech-to-text program is confronted with a mixture of different voices, rather than the familiar voice of the software owner alone. “The accuracy will plummet for the non-primary speaker,” says Peter Mahoney, Nuance’s chief marketing officer.

Nuance’s labs are working on the problem. “It certainly is an important area for us to do research on,” Mahoney says. (More on Nuance’s efforts later.) But in the meantime, companies such as TranscribeMe are trying to make the most of what technology can already do.

TranscribeMe can’t transcribe a cocktail party conversation in real time for a deaf guest, but Dunayev says the company can turn out a transcript of a business conference session in about three hours. “Typically it would take three times as long for a single worker,” he says.

Here’s how the TranscribeMe system works: Audio files without a lot of background noise are put through the company’s proprietary speech recognition program to get a first draft as a starting point, Dunayev says. Poor-quality recordings skip that step, because software can’t glean much from them. All audio files submitted by customers are sent to human transcribers who sign into TranscribeMe’s online workroom. But first, the files are split up into many slices only a few minutes long each—and sometimes less than a minute, Dunayev says.

Speed is the first reason for dividing the files up. If many transcribers work at the same time on audio snippets, they can produce a full document faster than a single person tackling the whole file from start to finish. TranscribeMe’s software later stitches the scattered text passages together in the right order. The transcribers, if they like, can choose to work for short stretches of time, rather than committing to complete a lengthy assignment. “We let people monetize their downtime,” Dunayev says.

Confidentiality is the second reason for fracturing the files into small segments, Dunayev says. No transcriber hears a full version of a client’s audio file, he says.

TranscribeMe aims for an accuracy rate of 98 percent or better, and offers options such as editing to correct speakers’ grammar mistakes or to remove stuttering. The company’s quality assurance staffers do a final review of each transcript. Its customers include lawyers, law enforcement agencies, insurance companies, medical centers, conference attendees, and researchers who do a lot of interviews, Dunayev says. “The real advantage of our model is quick delivery and almost any volume,” he says.

Although thousands of U.S. transcription companies compete in a market that has existed for decades, Dunayev says demand is growing as the amount of audio and video production rises and transcription prices drop. For example, media producers often publish full texts of their productions so that viewers and listeners can find them using search engines. “Google doesn’t index MP3 files,” he says.

Smartphones are also market catalysts, because they feature high-quality recording devices with enough memory to store audio from meetings two or three hours long, Dunayev says. TranscribeMe offers mobile apps for Windows, Android, and iPhones that steer the process of recording and uploading audio to its customer portal. There, customers can store their audio files and place transcription orders.

TranscribeMe is now preparing to allow clients to order transcripts of files they’ve stored in other content hosting platforms such as YouTube and Dropbox. TranscribeMe charges $2 per transcribed audio minute when the file involves two or more speakers. The price for a single speaker transcript is $1 per minute.

Dunayev says TranscribeMe has been growing rapidly since its service came online in 2012, but the private company doesn’t disclose its revenues. The company raised $900,000 in November 2012 from investors including Tech Coast Angels, Sierra Angels, TA Ventures, TEC Ventures, ICE Angels, and Maverick Angels, bringing its total funds raised to $1.5 million. TranscribeMe has 30 staffers in Berkeley and an operations branch in New Zealand.

While startups continue to develop methods to work around the limitations of multiple voice transcription software, tech giants are amassing big databases of digitized speech and matching text that may help researchers enhance the accuracy of voice recognition programs across a range of accents and vocal quirks. Microsoft, Google, and Apple are trying to improve their services based on speech recognition, such as voicemail-to-text translation and voice-activated computer commands.

Nuance, by supplying voicemail-to-text technology to phone companies, has already made some headway in transcribing the speech of people its programs haven’t been specifically trained to interpret. At Nuance, research divisions are now trying to develop high-quality transcribing capabilities for lengthy, multiple-voice audio files, Mahoney says.

The first hurdle is to build software that can simply recognize that a new person has begun to speak—a feat that can be challenging even for human beings when two speakers on the same audio file have similar voices. Mahoney says Nuance is working on voice identification software that weighs a number of speech characteristics to sort out different speakers.

The future Nuance software would then split up the audio file so that each individual speaker’s statements could be transcribed separately and later re-assembled, Mahoney says. The new process is being designed as a service for enterprise customers such as businesses that record their meetings, not as a consumer software product. The service, while it would rely on improved speech recognition software, would still make some use of human transcriptionists, the company says. No timeline has been released for its launch.

TranscribeMe’s Dunayev says he fully expects speech recognition software to gain added power, no matter which big company eventually meets the challenge of multiple-voice transcription. He says he doesn’t fear the competition.

“We actually count on it happening,” Dunayev says. Rather than undercutting TranscribeMe’s business, voice-to-text technology advances could allow the company to improve its service by reducing costs and lowering prices for customers, he says. Dunayev doubts that software will soon eliminate the need for human transcribers to perfect computer-generated transcripts.

“There’s still a need for that last person to do that quality validation,” Dunayev says.

Trending on Xconomy