Waiting for the Speakularity
In late 2010, the Nieman Journalism Lab surveyed reporters for their predictions about what 2011 would bring for the future of journalism. My favorite prediction came from Matt Thompson, an editorial product manager at National Public Radio and a widely respected evangelist for digital journalism. (I happened to meet Thompson in person around the same time at News Foo, a future-of-news conference in Phoenix sponsored by Google and O’Reilly Media. He’s an amazing guy, brimming with about a dozen great ideas per minute.)
Anyway, his prediction was this: Soon—perhaps not in 2011, but in the near future—automatic speech recognition and transcription services would become “fast, free, and decent.” In a jocular reference to the Singularity, Ray Kurzweil’s label for the moment when we’ll be able to upload our minds to computers and live forever, Thompson called the arrival of free and accurate speech transcription “the Speakularity.” He predicted it would be a watershed moment for journalism, since it would mean that all of verbal information reporters collect—everything said in interviews, press conferences, courtrooms, city council meetings, broadcasts, and the like—could easily be turned into text and made searchable.
“Obscure city meetings could be recorded and auto-transcribed,” Thompson wrote. “Interviews could be published nearly instantly as Q&As; journalists covering events could focus their attention on analyzing rather than capturing the proceedings. Because text is much more scannable than audio, recordings automatically indexed to a transcript would be much quicker to search through and edit.”
The implications are obviously immense. But what excited me personally about the Thompson’s concept was the prospect that, as a reporter, I might finally be able to start thinking more about the content of my interviews (analyzing) and less about taking notes (capturing).
Not that I have a problem taking notes. If I had to reveal my superpower, it’s this: I can type extremely fast. I’m talking Clark Kent fast. So fast that I walk away from most interviews with a verbatim transcript. There are always typos in the text, but nothing that can’t be easily deciphered.
It’s a great skill to have, because it means I don’t have to record interviews and waste time transcribing them later. But it comes at a cost. If I’m transcribing during an interview, my brain is divided into three separate operations. First, I’m typing whatever the speaker said a few seconds ago; to use a computational analogy, you might say my finger movements on the keyboard are drawing from the bottom of my short-term memory buffer. Second, my ears are listening to the speaker’s words in the current moment, and adding them to the top of the buffer. Third, I’m trying to comprehend the content and think ahead to the next question I want to ask.
This procedure usually works fine, but it’s exhausting. And if I stop concentrating for even a second, I suffer from buffer overflow, which is just as disastrous for me as it is for a computer program. With automatic and accurate speech transcription, I’d be able to dispense with all the typing and focus fully on the interviewee and their ideas, which would be heavenly.
So, how far off is the Speakularity? The idea itself is not nearly as outlandish as the Singularity (which still has plenty of skeptics, even within the irrationally optimistic population of startup entrepreneurs). Continuous dictation software has been available since the 1990s from companies like Dragon Systems, which is now part of Nuance Communications. The problem is that it has never quite met all three of Thompson’s criteria. If it was fast and decent, it wasn’t free, and if it was close to free, it wasn’t decent.
Lately, though, that’s been changing. Today’s mobile devices have both powerful internal processors and broadband connections to external, cloud-based speech transcription engines. Nuance introduced its “Dragon Dictation” app for Apple iOS devices in 2009, giving users the ability to dictate short stretches of text—about a paragraph. Smartphones with Google’s Android operating system have had a built-in Voice Actions feature since 2010. In 2011, Apple came out with the iPhone 4S, which had dictation capabilities, not to mention the speech-driven Siri virtual personal assistant, baked in. And this year, Apple put dictation into both the third-generation iPad and the Mountain Lion update of its Mac OS X operating system.
One of the big constraints in all of these systems, right now, is on the length of the passage that can be transcribed. The Google, Nuance, and Apple technology works great for dictating reminders, text messages, short e-mails, and the like, but it can’t handle continuous speech. I’m guessing that’s because all of the heavy lifting (identifying speech sounds and probabilistically assigning text to them) is happening in the cloud, and there’s a limit on the size of the sound files that can be uploaded and deciphered in one go.
Another, bigger hurdle is that today’s commercial speech recognition technology still has a very hard time dealing with multiple voices, especially if they’re talking over one another (as humans routinely do). The Holy Grail would be a service that provided continuous, speaker-independent transcription of conversations between two or more people. The finished transcripts would be fodder not just for search engines but for a new wealth of newspaper, magazine, and blog stories.
Thompson predicted that Google will be the first to bring together all the elements of the vision, and I think that’s a good bet, given the company’s enormous computational resources, its experience with services like Google 411 and automatic YouTube captioning, and the depth of its bench in areas like natural language processing and machine translation. But you can’t count out Nuance or Apple (which uses Nuance’s technology in Siri and the iOS dictation feature), and research institutions such as SRI International, which are also thinking hard about this stuff.
I’m ready for the Speakularity now—but realistically, I’ll probably have to keep taking manual notes for the next few years. Just cut me a break if I’m interviewing you, my buffer flows over, and I have to ask you to rewind.