Vice President for Corporate Engagement, Rose-Hulman Institute of Technology
Director, Life Sciences Institute, University of Michigan
Director, Silicon Flatirons Centers' Entrepreneurship Initiative
VP of Research & Policy, Ewing Marion Kauffman Foundation
For its next act, the Silicon Valley giant wants to put a supercomputer in your pocket, the better to sense, search, and interpret your personal surroundings. We talked to the scientists who are making it happen.
Wade Roush 2/28/2011
1: New Frontiers in Speech Recognition
Already, it’s hard for anyone with a computer to get through a day without encountering Google, whether that means doing a traditional Web search, visiting your Gmail inbox, calling up a Google map, or just noticing an ad served up by Google Adsense. And as time goes on, it’s going to get a lot harder.
That’s in part because the Mountain View, CA-based search and advertising giant has spent years building and acquiring technologies that extend its understanding beyond Web pages to other genres of information. I’m not just talking about the obvious, high-profile Google product areas such as browsers and operating systems (Chrome, Android), video (YouTube and the nascent Google TV), books (Google Book Search, Google eBooks), maps (Google Maps and Google Earth), images (Google Images, Picasa, Picnik), and cloud utilities (Google Docs). One layer below all of that, Google has also been pouring resources into fundamental technologies that make meaning more machine-tractable—including software that recognizes human speech, translates written text from one language to another, and identifies objects in images. Taken together, these new capabilities promise to make all of Google’s other products more powerful.
The other reason Google will become harder to avoid is that many of the company’s newest capabilities are now being introduced and perfected first on mobile devices rather than the desktop Web. Already, our mobile gadgets are usually closest at hand when we need to find something out. And their ubiquity will only increase: it’s believed that 2011 will be the year when sales of smartphones and tablet devices finally surpass sales of PCs, with many of those new devices running Android.
That means you’ll be able to tap Google’s services in many more situations, from the streets of a foreign city, where Google might keep you oriented and feed you a stream of factoids about the surrounding landmarks, to the restaurant you pick for lunch, where your phone might translate your menu (or even your waiter’s remarks) into English.
Google CEO Eric Schmidt says the company has adopted a “mobile first” strategy. And indeed, many Googlers seem to think of mobile devices and the cameras, microphones, touchscreens, and sensors they carry as extensions of our own awareness. “We like to say a phone has eyes, ears, skin, and a sense of location,” says Katie Watson, head of Google’s communications team for mobile technologies. “It’s always with you in your pocket or purse. It’s next to you when you’re sleeping. We really want to leverage that.”
This is no small vision, no tactical marketing ploy—it’s becoming a key part of Google’s picture of the future. In a speech last September at the IFA consumer electronics fair in Berlin, Schmidt talked about “the age of augmented humanity,” a time when computers remember things for us, when they save us from getting lost, lonely, or bored, and when “you really do have all the world’s information at your fingertips in any language”—finally fulfilling Bill Gates’ famous 1990 forecast. This future, Schmidt says, will soon be accessible to everyone who can afford a smartphone—one billion people now, and as many as four billion by 2020, in his view.
It’s not that phones themselves are all that powerful, at least compared to laptop or desktop machines. But more and more of them are backed up by broadband networks that, in turn, connect to massively distributed computing clouds (some of which, of course, are operated by Google). “It’s like having a supercomputer in your pocket,” Schmidt said in Berlin. “When we do voice translation, when we do picture identification, all [the smartphone] does is send a request to the supercomputers that then do all the work.”
And the key thing about those supercomputers—though Schmidt alluded to it only briefly—is that they’re stuffed with data, petabytes of data about what humans say and write and where they go and what they like. This data is drawn from the real world, generated by the same people who use all of Google’s services. And the company’s agility when it comes to collecting, storing, and analyzing it is perhaps its greatest but least appreciated capability.
The power of this data was the one consistent theme in a series of interviews I conducted in late 2010 with Google research directors in the fundamental areas of speech recognition, machine translation, and computer vision. It turns out that many of the problems that have stymied researchers in cognitive science and artificial intelligence for decades—understanding the rules behind grammar, for instance, or building models of perception in the visual cortex—give way before great volumes of data, which can simply be mined for statistical connections.
Unlike the large, structured language corpuses used by the speech-recognition or machine-translation experts of yesteryear, this data doesn’t have to be transcribed or annotated to yield insights. The structure and the patterns arise from the way the data was generated, and the contexts in which Google collects it. It turns out, for example, that meaningful relationships can be extracted from search logs—the more people who search for “IBM stock price” or “Apple Computer stock price,” the clearer it becomes that there is a class of things, i.e. companies, with an attribute called “stock price.” Google’s algorithms glean this from Google’s own users in a process computer scientists call “unsupervised learning.”
“This is a form of artificial intelligence,” Schmidt observed in Berlin. “It’s intelligence where the computer does what it does well and it helps us think better…The computer and the human, together, each does something better because the other is helping.”
In a series of three articles this week, I’ll look more closely at this human-computer symbiosis and how Google is exploiting it, starting with the area of speech recognition. (Subsequent articles will examine machine translation and computer vision.) Research in these areas is advancing so fast that the outlines of Schmidt’s vision of augmented humanity are already becoming clear, especially for owners of Android phones, where Google deploys its new mobile technologies first and most deeply.
Obviously, Google has competition in the market for mobile information services. Over time, its biggest competitor in this area is likely to be Apple, which controls one of the world’s most popular smartphone platforms and recently acquired, in the form of a startup called Siri, a search and personal-assistant technology built on many of the same machine-learning principles espoused by Google’s researchers.
But Google has substantial assets in its favor: a large and talented research staff, one of the world’s largest distributed computing infrastructures, and most importantly, a vast trove of data for unsupervised learning. It seems likely, therefore, that much of the innovation making our phones more powerful over the coming years will emerge from Mountain View.
The Linguists and the Engineers
Today Michael Cohen leads Google’s speech technology efforts. But he actually started out as a composer and guitarist, making a living for seven years writing music for piano, violin, orchestra, and jazz bands. As a musician, he says, he was always interested the mechanics of auditory perception—why certain kinds of sound make musical sense to the human brain, while others are just noise.
A side interest in computer music eventually led him into computer science proper. “That very naturally led me, first of all, to wanting to work on something relating to perception, and second, related to sounds,” Cohen says today. “And the natural thing was speech recognition.”
Cohen started studying speech at Menlo Park’s SRI International in 1984, as the principal investigator in a series of DARPA-funded studies in acoustic modeling. By that time, a fundamental change in the science of speech was already underway, he says. For decades, early speech researchers had hoped that it would be possible to teach computers to understand speech by giving them linguistic knowledge—general rules about word usage and pronunciation. But starting in the 1970s, an engineering-oriented camp had emerged that rejected this approach as impractical. “These engineers came along, saying, ‘We will never know everything about those details, so let’s just write algorithms that can learn from data,'” Cohen recounts. “There was friction between the linguists and the engineers, and the engineers were winning by quite a bit.”
But around the mid-1980s, Cohen says, “the linguists and the engineers started talking to each other.” The linguists realized that their rules-based approach was too complex and inflexible, while the engineers realized their statistical models needed more structure. One result was the creation of context-dependent statistical models of speech that, for the first time, could take “co-articulation” into account—the fact that the pronunciation of each phoneme, or sound unit, in a word is influenced by the preceding and following phonemes. There would no longer be just one statistical profile for the sound waves constituting a long “a” sound, for example; there would be different models for “a” for all of the contexts in which it occurs.
“The engineers, to this day, still follow the fundamental statistical, machine-learning, data-driven approaches,” Cohen says. “But by learning a bit about linguistic structure—that words are built in phonemes and that particular realizations of these phonemes are context-dependent—they were able to build richer models that could learn much more of the fine details about speech than they had before.”
Cohen took much of that learning with him when he co-founded Nuance, a Menlo Park, CA-based spinoff of SRI International, in 1994. (Much later, SRI would also spin off Siri, the personal assistant startup bought last year by Apple.) He spent a decade building up the company’s strength in telephone-based voice-response systems for corporate call centers—the kind of technology that lets customers get flight status updates from airlines by speaking the flight numbers, for example.
The Burlington, MA-based company now called Nuance Communications was formerly a Nuance competitor called ScanSoft, and it adopted the Nuance name after it acquired the Menlo Park startup in 2005. But by that time Cohen had left Nuance for Google. He says several factors lured him in. One was the fact that statistical speech-recognition models were inherently limited by computing speed and memory, and by the amount of training data available. “Google had way more compute power than anybody had, and over time, the ability to have way more data than anybody had,” Cohen says. “The biggest bottleneck in the research being, ‘How can we build a much bigger model?,’ it was definitely an opportunity.”
But there were other aspects to this opportunity. After 10 years working on speech recognition for landline telephone systems at Nuance, Cohen wanted to try something different, and “mobile was looking more and more important as a platform, as a place where speech technology would be very important,” he says. That’s mainly because of the user-interface problem: phones are small and it’s inconvenient to type on them.
“At the time, Google had barely any effort in mobile, maybe four people doing part-time stuff,” Cohen says. “In my interviews, I said, ‘I realize you can’t tell me what your next plans are, but if you are not going to be serious about mobile, don’t make me an offer, because I won’t be interested in staying.’ I felt at the time that mobile was going to be a really important area for Google.”
As it turned out, of course, Cohen wasn’t the only one who felt that way. Schmidt and Google co-founders Larry Page and Sergey Brin also believed mobile phones would become key platforms for browsing and other search-related activities, which helped lead to the company’s purchase of mobile operating system startup Android in 2005.
Cohen built a whole R&D group around speech technology. Its first product was goog-411, a voice-driven directory assistance service that debuted in 2007. Callers to 1-800-GOOG-411 could request business listings for all of the United States and Canada simply by speaking to Google’s computers. The main reason for building the service, Cohen says, was to make Google’s local search service available over the phone. But the company also logged all calls to goog-411, which made it “a source of valuable training data,” Cohen says: “Even though goog-411 was a subset of voice search, between the city names and the company names we covered a great deal of phonetic diversity.”
And there was a built-in validation mechanism: if Google’s algorithms correctly interpreted the caller’s prompt, the caller would go ahead and place an actual call. It’s in many such unobtrusive ways (as Schmidt pointed out in his Berlin speech) that Google recruits users themselves to help its algorithms learn.
Google shut down goog-411 in November 2010—but only because it had largely been supplanted by newer products from Cohen’s team such as Voice Search, Voice Input, and Voice Actions. Voice Search made its first appearance in November 2008 as part of the Google Mobile app for the Apple iPhone. (It’s now available on Android phones, BlackBerry devices, and Nokia S60 phones as well.) It allows mobile phone users to enter Google search queries by speaking them into the phone. It’s startlingly accurate, in part because it learns from users. “The initial models were based on goog-411 data and they performed very well,” Cohen says. “Over time, we’ve been able to train with more Voice Search data and get improvements.”
Google isn’t the only company building statistical speech-recognition models that learn from data; Cambridge, MA, startup Vlingo, for example, has built a data-driven virtual assistant for iPhone, Android, BlackBerry, Nokia, and Windows Phone platforms that uses voice recognition to help users with mobile search, text messaging, and other tasks.
But Google has a big advantage: it’s also a search company. Before Cohen joined Google, he says, “they hadn’t done voice search before—but they had done search before, in a big way.” That meant Cohen’s team could use the logs of traditional Web searches at Google.com to help fine-tune its own language models. “If the last two words I saw were ‘the dog’ and I have a little ambiguity about the next word, it’s more likely to be ‘ran’ than ‘pan,'” Cohen explains. “The language models tell you the probabilities of all possible next words. We have been able to train enormous language models for Voice Search because we have so much textual data from Google.com.”
Over time, speech recognition capabilities have popped up in more and more Google products. When Google Voice went public in the spring of 2009, it included a voicemail transcription feature courtesy of Cohen’s team. Early in 2010, YouTube began using Google’s transcription engine to publish written transcripts alongside every YouTube video, and YouTube viewers now have the option of seeing the transcribed text on screen, just like closed-captioning on television.
But mobile is still where most of the action is. Google’s Voice Actions service, introduced last August, lets Android users control their phones via voice—for instance, they can initiate calls, send e-mail and text messages, call up music, or search maps on the Web. (This feature is called Voice Commands on some phones.) And the Voice Input feature on certain Android phones adds a microphone button to the virtual keypad, allowing users to speak within any app where text entry is required.
“In general, our vision for [speech recognition on] mobile is complete ubiquity,” says Cohen. “That’s not where we are now, but it is where we are trying to get to. Anytime the user wants to interact by voice, they should be able to.” That even includes interacting with speakers of other languages: Cohen says Google’s speech recognition researchers work closely with their colleagues in machine translation—the subject of the next article in this series—and that the day isn’t far off when the two teams will be able to release a “speech in, speech out” application that combines speech recognition, machine translation, and speech synthesis for near-real-time translation between people speaking different languages.
“The speech effort could be viewed as something that enhances almost all of Google’s services,” says Cohen. “We can organize your voice mails, we can show you the information on the audio track of a YouTube video, you can do searches by voice. A large portion of the world’s information is spoken—that’s the bottom line. It was a big missing piece of the puzzle, and it needs to be included. It’s an enabler of a much wider array of usage scenarios, and I think that what we’ll see over time is all kinds of new applications that people would never have thought of before,” all of them powered by user-provided training data. Which is precisely what Schmidt had in mind in Berlin when he quoted sci-fi author William Gibson: “Google is made of us, a sort of coral reef of human minds and their products.”
2: Changing the Equation in Machine Translation
When science fiction fans think about language translation, they have two main reference points. One is the Universal Translator, software built into the communicators used by Star Trek crews for simultaneous, two-way translation of alien languages. The other is the Babel fish from Douglas Adams’ The Hitchhiker’s Guide to the Galaxy, which did the same thing from its home in the listener’s auditory canal.
When AltaVista named its Web-based text translation service after the Babel fish in 1997, it was a bit of a stretch: the tool’s translations were often hilariously bad. For a while, in fact, it seemed that the predictions of the Star Trek writers—that the Universal Translator would be invented sometime around the year 2150—might be accurate.
But the once-infant field of machine translation has grown up quite a bit in the last half-decade. It’s been nourished by the same three trends that I wrote about on Monday in the first part of this week’s series about Google’s vision of “augmented humanity.” One is the gradual displacement of rules-based approaches to processing speech and language by statistical, data-driven approaches, which have proved far more effective. Another is the creation of a distributed cloud-computing infrastructure capable of holding the statistical models in active memory and crunching the numbers on a massive scale. Third, and just as important, has been the profusion of real-world data for the models to learn from.
In machine translation, just as in speech recognition, Google has unique assets in all three of these areas—assets that are allowing it to build a product-development lead that may become more and more difficult for competitors to surmount. Already, the search giant offers a “Google Translate” app that lets an Android user speak to his phone in one language and hear speech-synthesized translations in a range of languages almost instantly. In on-stage previews, Google has been showing off “conversation-mode” version of the app that does the same thing for two people. (Check out Google employees Hugo Barra and Kay Oberbeck carrying out a conversation in English and German in this section of a Google presentation in Berlin last September.)
While still experimental, the conversation app is eerily reminiscent of the fictional Universal Translator. Suddenly, the day seems much closer when anyone with an Internet-connected smartphone will be able to make their way through a foreign city without knowing a word of the local language.
In October, I met with Franz Josef Och, the head of Google’s machine translation research effort behind the Translate app, and learned quite a bit about how Google approaches translation. Och’s long-term vision is similar to that of Michael Cohen, who leads Google’s efforts in speech recognition. Cohen wants to eliminate the speech-text dichotomy as an impediment, so that it’s easier to communicate with and through our mobile devices; Och wants to take away the problem of language incomprehension. “The goal right from the beginning was to say, what can we do to break down the language barrier wherever it appears,” Och says.
This barrier is obviously higher for many Americans than it is for others, present company included—I’m functionally monolingual despite years of Russian, French, and Spanish classes. (“It’s always a shock to Americans,” Google CEO Eric Schmidt quipped during the Berlin presentation, but “people actually don’t all speak English.”) So a Babel fish in my ear—or in my phone, at any rate—would definitely count as a step toward the augmented existence Schmidt describes.
But in the big picture, Google’s machine translation work is really just a subset of its larger effort to make the world’s information “universally accessible and useful.” After all, quite a bit of this information is in languages other than those you or I may understand.
The Magic Is in the Data
Given the importance of language understanding in military affairs, from intelligence-gathering to communicating with local citizens in conflict zones, it isn’t surprising that Och, like Cohen, found his way to Google by way of the U.S. Defense Advanced Research Projects Agency (DARPA). The German native, who had done masters work in statistical machine translation at the University of Nuremberg and PhD work at the University of Aachen, spent the early 2000s doing DARPA-funded research at USC’s Information Sciences Institute. His work there focused on automated systems for translating Arabic and Chinese records into English, and he entered the software in yearly machine translation “bake-offs” sponsored by DARPA. “I got very good results, and people at Google saw that and said ‘We should invite that guy,'” Och says.
Och was getting his results in part by setting aside the old notion that computers should translate expressions between languages based on rules. In rules-based translation, Och says, “What you write down is dictionaries. This word translates into that. Some words have multiple translations, and based on the context you might have to choose this one or that one. The overall structure might change: the morphology, the extensions, the cases. But you write down the rules for that too. The problem is that language is so enormously complex. It’s not like a computer language like C++ where you can always resolve the ambiguities.”
It was the heady success of British and American cryptographers and cryptanalysts at breaking Japanese and German codes during World War II, Och believes, that set the stage for the early optimism about rule-based translation. “If you look 60 years ago, people said ‘In five years, we’ll have solved that, like we solved cryptography.'” But coming up with rules to capture all the variations in the ways people express things turned out to be a far thornier problem than experts expected. “It didn’t take five years [to start to solve it], it took 60 years,” Och says. “And the way we are doing it is different. For us, it’s a computer science problem, not a linguistics problem.”
The pioneers in statistical machine translation in the 1990s, Och says, came from the field of speech recognition, where it was already clear that it would be easier to bootstrap machine-learning algorithms by feeding them lots of recordings of people actually speaking than to codify all the rules behind speech production.
The more such data researchers have, the faster their systems can learn. “Data changes the equation,” says Och. “The system figures out on its own what is correlated. Because we feed it billions of words, it learns billions of rules. The magic comes from these massive amounts of data.”
But back in 2004, when Google was taking a look at Och’s DARPA bake-off entry, the magic was still slow, limited by his team’s computation budget at USC. “We had a few machines, and the goal was to translate a given test sentence, and it would take a few days to translate just that sentence,” he says. Translating random text? Forget it. “Building a real system would have needed much bigger computational resources. We were CPU-constrained, RAM-constrained.”
But Google wasn’t. Access to the search company’s data centers, Och figured, would advance his project by a matter of several years overnight. Then there was Google’s ability to crawl the Web, collecting examples of already-translated texts—which are the key to bootstrapping any statistical machine translation system. But what clinched the deal when Google finally hired Och in early 2004, he says, was the opportunity to work on technology that would reach so many people: “The idea of being able to build a real system that millions of people might use to break down the language barrier, that was a very exciting thought.”
Och and the machine-translation team he started to build at Google began with his existing systems for translating Chinese, Arabic, and Russian into English, fanning out over the Web to find as many examples as possible of human-translated texts. The sheer scope of Google’s areas of interest was a help here—it turns out that many of the millions of books Google was busily scanning for its Book Search project have high-quality translations. But another Google invention called MapReduce was even more important. The more statistical associations that a translation system can remember, the faster and more accurately it can translate new text, Och says—so the best systems are those that can hold hundreds of gigabytes of data in memory all at once. MapReduce, the distributed-computing framework that Google engineers developed to parallelize the work of mapping, sorting, and retrieving Web links, turned out to be perfect for operating on translation data. “It was never built with machine translation in mind, but our training infrastructure uses MapReduce at many many places,” says Och. “It helps us, as part of building those gigantic models, to manage very large amounts of data.”
By October of 2007, Och’s team was able to replace the third-party, rules-based translation software Google had been licensing from Systran—the same technology behind AltaVista’s Babel Fish (which now lives at babelfish.yahoo.com)—with its own, wholly statistical models. Today the Web-based Google Translate system works for 58 languages, including at least one—Latin—that isn’t even spoken anymore. “There are not so many Romans out there any more, but there are a lot of Latin books out there,” Och explains. “If you want to organize all the world’s information and make it universally accessible and useful, then the Latin books are certainly part of that.”
And new languages are coming online every month. Human-translated documents are still the starting point—“For languages where those don’t exist, we cannot build systems,” Och says—but over time, the algorithms have gotten smarter, meaning they can produce serviceable translations using less training data. “We have Yiddish, Icelandic, Haitian Creole—a bunch of very small languages for which it’s hard to find those kinds of documents,” Och says.
Today Google Translate turns up in a surprising number of places across the Google universe, starting with plain old search box. You can translate a phrase at Google.com just by typing a phrase like “translate The Moon is made of cheese to French” (the translation offered sounds credible to me: La Lune est faite de fromage). For those times when you know the best search results are likely to be in some other language, there’s the “Translated Search” tool. And when results come up that aren’t in your native language, Google can translate entire Web pages on the fly. (If you’re using Google’s Chrome browser or the Google Toolbar for Internet Explorer or Firefox, page translation happens automatically.) Google can translate your Gmail messages, your Google Talk chats, and your Google Docs. It can even render the captions on foreign-language YouTube videos into your native tongue. And, of course, there are mobile-friendly versions of the Google translation tools, including the Android app.
In September, Hugo Barra, Google’s director of mobile products, hinted that the conversation-mode version of the Google Translate Android app would be available “in a few months,” meaning it could arrive any day now. But whenever it appears, Och and his team aren’t likely to stop there. In fact, they’re already thinking about how to get around some of the barriers that remain between speakers of different languages—even if they have an Android smartphone to mediate between them.
“If I were speaking German now, ideally we should just be able to communicate, but there are some user interface questions and issues,” Och points out. There’s the delay, for one thing—while the Google cloud sends back translations quickly, carrying on a conversation is still an awkward process of speaking, pushing a button, waiting while the other person listens and responds, and so on. Then there’s the tinny computer voice that issues the translation. “It would be my voice going in, but isn’t my voice coming out,” Och says. “So no one knows how [true simultaneous machine translation] would work, until we get to the Babel fish or the Universal Translator from Star Trek—where it just works, and language is never an issue when they go to a different planet.”
3: Computer Vision Puts a “Bird on Your Shoulder”
It’s a staple of every film depiction of killer androids since Terminator: the moment when the audience watches through the robot’s eyes as it scans a human face, compares the person to a photo stored in its memory, and targets its unlucky victim for elimination.
That’s computer vision in action—but it’s actually one of the easiest examples, from a computational point of view. It’s a simple case of testing whether an acquired image matches a stored one. What if the android doesn’t know whether its target is a human or an animal or a rock, and it has to compare everything it sees against the whole universe of digital images? That’s the more general problem in computer vision, and it’s very, very hard.
But just as we saw with the case of statistical machine translation in Part 2 of this series, real computer science is catching up with, and in some cases outpacing, science fiction. And here, again, Google’s software engineers are helping push to the boundaries of what’s possible. Google made its name helping people find textual data on the Web, and it makes nearly all of its money selling text-based ads. But the company also has a deep interest in programming machines to comprehend the visual world—not so that they can terminate people more easily (not until Skynet takes over, anyway) but so that they can supply us with more information about all the unidentified or under-described objects we come across in our daily lives.
I’ve already described how Google’s speech recognition tools help you initiate searches by speaking to your smartphone rather than pecking away at its tiny keyboard. With Google Goggles, a visual search tool that debuted on Android mobile phones in December 2009 and on the Apple iPhone in October 2010, your phone’s built-in camera becomes the input channel, and the images you capture become the search queries. For limited categories of things—bar codes, text on signs or restaurant menus, book covers, famous paintings, wine labels, company logos—Goggles already works extremely well. And Google’s computer vision team is training its software to recognize many more types of things. In the near future, according to Hartmut Neven, the company’s technical lead manager for image recognition, Goggles might be able to tell a maple leaf from an oak leaf, or look at a chess board and suggest your next move.
Goggles is the most experimental, and the most audacious, of the technologies that Google CEO Eric Schmidt described in a recent speech in Berlin as the harbingers of an age of “augmented humanity.” Even more than the company’s speech recognition or machine translation tools, the software that Neven’s team is building—which is naturally tailored for smartphones and other sensor-laden mobile platforms—points toward a future where Google may be at hand to mediate nearly every instance of human curiosity.
“It is indeed not many years out where you can have this little bird looking over your shoulder, interpreting the scenes that you are seeing and pretty much for every piece in the scene—art, buildings, the people around you,” Neven told me in an interview late last year. “You can see that we will soon approach the point where the artificial system knows much more about what you are looking at than you know yourself.”
Neven, like most of the polymaths at Google, started out studying subjects completely unrelated to search. In his case, it was classical physics, followed by a stint in theoretical neurobiology, where he applied methods from statistical physics to understanding how the brain makes sense of information from the nervous system.
“One of most fascinating objects of study in nature is the human brain, understanding how we learn, how we perceive,” Neven says. “Conscious experience is one of the big riddles in science. I am less and less optimistic that we will ever solve them—they’re probably not even amenable to the scientific method. But any step toward illuminating those questions, I find extremely fascinating.”
He sees computer vision as one of the steps. “If you have a theory about how the brain may recognize something, it’s surely nice if you can write a software program that does something similar,” he says. “That by no means proves that the brain does it the same way, but at least you have reached an understanding of how, in principal, it could be done.”
It’s pretty clear that the brain doesn’t interpret optical signals by starting from abstract definitions of what constitutes an edge, a curve, an angle, or a color. Nor does it have the benefit of captions or other metadata. The point—which I won’t belabor again here, since we’ve already seen it at work in the cases of Google’s efforts in speech recognition and machine translation—is that Neven’s approach to image recognition was data-driven from the start, relying on computers to sift through the huge piles of 1s and 0s that make up digital images and sniff out the statistical similarities between them. “We have, early on, and sooner than other groups, banked very heavily on machine learning as opposed to model-based vision,” he says.
Trained in Germany, Neven spent the late 1990s and early 2000s at the University of Southern California, in labs devoted to computational vision and human-machine interfaces. After tiring of the grant-writing treadmill, he struck out on his own, co-founding a company called Eyematic around a unique and very specific application of computer vision: using video from a standard camcorder to “drive” computer-generated characters in 3D. When that technology failed to pay off, Neven started Neven Vision, which began from the same foundation—facial feature tracking—but wound up exploring areas as diverse as biometric tools for law enforcement and visual searches for mobile commerce. “What Goggles is today, we started out working on at Neven Vision on a much smaller scale,” he says. “Take an image of a Coke can, and be entered in a sweepstakes. Simple, early applications that would generate revenue.”
How much revenue Neven Vision actually generated isn’t on record—but the company did have a reputation for building some of the most accurate face recognition software on the market, which was Google’s stated reason for acquiring the company in 2006. The team’s first assignment, Neven says, was to put face recognition into Picasa—the photo management system Google had purchased a couple of years before.
Given how far his team’s computer vision tools have evolved since then, Neven Vision probably should have held out for more money in the acquisition, Neven jokes today. “We said, ‘We can do more than face recognition—one of our main products is visual mobile search.’ They knew it, but they kept a poker face and said, ‘All we want is the face recognition, we are just going to pay for that.'”
Once the Picasa project was done, Neven’s team had to figure out what to do next. His initial pitch to his managers was to build visual search app for packaged consumer goods. That was when Google’s poker face came off. “We said, ‘Let’s do a verticalized app that supports users in finding information about products.’ And then one of our very senior engineers, Udi Manber, came to the meeting and said, ‘No, no, it shouldn’t be vertical. It’s in Google’s DNA to go universal. We understand if you can’t quite do it yet, but that should be the ambition.'” The team was being told, in other words, to build a visual search tool that could identify anything.
That was “a little bit of a scary prospect,” Neven says. But on the other hand, the team had already developed modules or “engines” that were pretty good at recognizing things within a few categories, such as famous structures (the Eiffel Tower, the Golden Gate Bridge). And it had seen the benefits of doing things at Google scale. Neven Vision’s original face recognition algorithm had achieved a “significant jump in performance” simply because the team was now able to train it using tens of millions of images, instead of tens of thousands, and to parallelize the work across thousands of computers.
“Data is the key for pretty much everything we do,” Neven says. “It’s often more critical than the innovation on the algorithmic side. A dumb algorithm with more data beats a smart algorithm with less data.”
In practice, Neven’s team has been throwing both algorithms and data at the general computer vision problem. Goggles isn’t built around a single statistical model, but a variety of them. “A modern computer vision algorithm is a complex building with many stories and little towers on the side,” Neven says. “Whenever I visit a university and I see a piece that I could add, we try to find an arrangement with the researchers to bring third-party recognition software into Goggles as we go. We have the opposite of ‘Not Invented Here’ syndrome. If we find something good, we will add it.”
Goggles is really good at reading text (and translating it, if asked); it can work wonders with a business card or a wine label. If it has a good, close-up image to work with, it’s not bad at identifying random objects—California license plates, for example. And if it can’t figure out what it’s looking at, it can, at the very least, direct you to a collection of images with similar colors and layouts. “We call that internally the Fail Page, but it gives the user something, and over time this will show up less and less,” Neven says.
As even Neven acknowledges, Goggles isn’t yet a universal visual search tool; that’s why it’s still labeled as a Google Labs project, not an officially supported Google product. Its ability to identify nearly 200,000 works by famous painters, for example, is a computational parlor trick that, in truth, doesn’t add much to its everyday utility. The really hard work—getting good at identifying random objects that don’t have their own Wikipedia entries—is still ahead. “What keeps me awake at night is, ‘What are the honest-to-God use cases that we can deliver,’ where it’s not just an ‘Oh, wow,'” Neven says. “We call it the bar of daily engagement. Can we make it useful enough that every day you will take out Goggles and do something with it?”
But given the huge amount of learning material Google collects from the Web every day, the company’s image recognition algorithms are likely to clear that bar more and more often. They have savant-like skill in some areas: they can tell amur leopards from clouded leopards, based on their spot patterns. They can round up images not just of tulips but of white tulips. The day isn’t all that far away, it seems clear, when Goggles will come close to fulfilling Neven’s image of the bird looking over your shoulder, always ready to tell you what you’re seeing.
The Next Great Stage of Search
What reaching this point might mean on a sociocultural level—in areas like travel and commerce, learning and education, surveillance and privacy—is a question that we’ll probably have to confront sooner than we expected. Why? Because it’s very clear that this is where Google wants to go.
Here’s how Schmidt put it in his speech: “When I walk down the streets of Berlin, I love history, [and] what I want is, I want the computer, my smartphone, to be doing searches constantly. ‘Did you know this occurred here, this occurred there?’ Because it knows who I am, it knows what I care about, and it knows roughly where I am.” And, as Schmidt might have added, the smartphone will know what he’s seeing. “So this notion of autonomous search, the ability to tell me things that I didn’t know but I probably am very interested in, is the next great stage, in my view, of search.”
This type of always-on, always-there search is, by definition, mobile. Indeed, Schmidt says Google search traffic from mobile devices grew by 50 percent in the first half of 2010, faster than every other kind of search. And by sometime between 2013 and 2015, analysts agree, the number of people accessing the Web from their phones and tablet devices will surpass the number using desktop and laptop PCs.
By pursuing a data-driven, cloud-based, “mobile first” strategy, therefore, Google is staking its claim in a near-future world where nearly every computing device will have its own eyes and ears, and where the boundaries of the searchable will be much broader. “Google works on the visual information in the world, the spoken and textual and document information in the world,” says Michael Cohen, Google’s speech technology leader. So in the long run, he says, technologies like speech recognition, machine translation, and computer vision “help flesh out the whole long-term vision of organizing literally all the world’s information and making it accessible. We never want you to be in a situation where you wish you could get at some of this information, but you can’t.”
Whatever you’re looking for, in other words, Google wants to help you find it—in any language, via text, sound, or pictures. (And if it can serve up a few ads in the process, so much the better.) That’s the real promise of having a “supercomputer in your pocket,” as Schmidt put it. But what we do with these new superpowers is up to us.