Diffbot Challenges Google Supremacy With Rival Knowledge Graph

When you do a Google search at your desktop for a common health condition, you’ll get links to tons of webpages you can sift through in hopes of finding the specific facts you want.

But if you’ve searched Google from a mobile device recently, you may have been rewarded with a summary of important facts about the disorder, culled for you from many websites. You’re tapping into what Google calls its Knowledge Graph.

Palo Alto, CA-based artificial intelligence startup Diffbot has based its whole business on that second kind of search—ferreting out data points scattered across many websites and pulling them together into Big Data resources that can be queried, combined, and rearranged. The upstart company—14 engineers in a backyard bungalow—now says its own data mega-map, called its Global Index, is a bigger database than Google’s Knowledge Graph of billions of facts.

This kind of structured data—Web facts organized into a searchable database—is the resource behind the most popular mobile apps, says Diffbot founder and CEO Mike Tung. Such apps can answer questions like, “What is the best Thai restaurant in this neighborhood?” Diffbot’s mission is to capture everything online—articles, images, videos, comments, reviews, the works—and keep it updated.

“We are working to create a structured version of the Web,” Tung says. “We’re quite serious about that.”

The company developed elements of its Web-crawling methods as it served customer needs over the past few years, but only started proactively spidering the Web for its own purposes in the past few months. Its Global Index now contains more than 600 million objects (this can be anything from a celebrity to an Ikea chair model) and 19 billion facts. Diffbot clocks Google’s Knowledge Graph at about 570 million objects and 18 billion facts.

Diffbot, founded in 2008, is already covering its operating expenses by enhancing other search engines including Microsoft’s Bing and DuckDuckGo, and by powering apps for companies such as Cisco and AOL, Tung says. Diffbot subscribers can build apps based on narrowly targeted searches that answer questions such as, “What’s the best price in my region for Nike cross trainers?”

But Diffbot has larger ambitions, and it’s raising money to support them. The company just banked $500,000 from Bloomberg Beta, bringing an angel round up to $3 million. It wouldn’t be surprising to see a Series A round raised this year, Tung says. Just a hint about Diffbot’s ultimate interests: According to its CEO, Diffbot may help answer the long-debated question: Can computers ever duplicate human intelligence?

Diffbot has been exploring the art of teaching machines to function like a human researcher—compiling facts from multiple online sources so they can be combined and compared for many purposes. The company began building its Global Index by storing results from URL searches requested by customers, but in recent months Diffbot has been analyzing websites to build its index at a rate of up to 15 million pages a day.

Its artificial intelligence bots are doing the work without human supervision. Tung says Google’s Knowledge Graph, by contrast, has relied significantly on human curation.

“Our approach is fairly radical in that there’s no human behind the curtain,” Tung says. “This is why we were able to catch up in such a short time.”

Diffbot assembles its own servers at its Palo Alto bungalow. They’re not the kind you can rent from a cloud storage outfit. The company’s standard crawling and indexing machines use 32 terabytes of solid-state storage, have 192GB of RAM, and 40 CPU cores. Diffbot now has 100 servers in a guarded co-location space in Fremont, CA, where fiber optic cables link them to all the Internet service providers in the world, Tung says.

The crawlbots are adding millions of new objects to Diffbot’s index every day. Tung is envisioning adding thousands more servers, or tens of thousands.

“If we just throw more resources at it, we can generate structured data at true scale,” Tung says.

Last year, Xconomy’s much-missed San Francisco editor Wade Roush asked the question, “Could a Little Startup Called Diffbot Be the Next Google?” in his article about Diffbot’s mission to cover the Web more fully by taking search further than conventional search engines. Diffbot had developed bots that can “read” a webpage the way humans do, distinguishing among different parts of the layout such as headlines, main text, side columns, and so on.

With this computer vision, the machines can tell the difference between types of Web pages—article pages, home pages, and product offerings where prices are displayed. They can reshuffle these layout elements to reformat a Web page for mobile device screens—a chore that companies pay Diffbot to take on, and one of its early sources of revenue.

The bots also “learn” where they’re likely to find certain information on a page, such as prices or author names on articles. They can extract information from images, videos, blogs, and the discussion threads that follow published articles. The company’s newest product, Discussion API, has become a tool for marketers who want to check brand reputations, Tung says.

Companies also come to Diffbot to assemble information about competitors and suppliers, or people they might want to hire.

“All these entities leave footprints on the Web,” Tung says. Diffbot is now developing frameworks for machine analysis of new kinds of pages, such as events, locations, and profile pages, to further expand its Global Index.

If you’re a high school freshman with an English term paper due tomorrow, you might be wondering how you can log in to this new type of search to get all the facts you need about Charles Dickens, by midnight tonight, painlessly assembled by a machine.

For the most part, though, consumers can’t yet directly tap into the structured Web data compiled by Diffbot and Google. Diffbot shares its data resources with consumers indirectly by selling its services to search engines and app developers. A Bing search about a product, for example, will show the traditional list of website links, but in the upper right-hand corner, it may display an image, price, and other facts assembled about the product—structured data.

Google is also making some of its Knowledge Graph findings available through mobile searches. But Tung speculates that Google may limit these kinds of search returns in favor of traditional Web page listings, because they expose consumers to more advertising, the tech titan’s source of revenue.

In his 2014 article, Roush pondered what would happen if Diffbot remained an independent company, grew to 10,000 employees, and vied with Google to control our online existence. But Roush, wistfully, found it more likely that Diffbot would be “acqui-hired” by Google at some point.

Tung brought up this prediction when I talked to him this week—mainly to refute it. He didn’t sound like a guy who was ready to let somebody else discover his company’s full potential.

“We have received a lot of acquisition offers from pretty much all of the large technology companies,” Tung says. Diffbot’s response to the offers, he says, is to convert its suitors into customers. He declines to say whether Google belongs to either category.

So what future is Tung aiming an independent Diffbot towards?

My sense is that Tung is an artificial intelligence researcher at heart, captivated by questions about what machines could do if humans knew how to equip their silicon brains and train them expertly.

Here’s part of his shorter-term vision: Technologists are not just organizing Web data to better inform humans so they can decide what to do or to think. They’re also structuring the Web to better inform machines, so they can take action themselves and work with other machines.

The first products along these lines may be as cozy as recipe apps. You might be able to point your mobile phone at an unfurnished corner of your new living room, so that in a few minutes it will tell you what chair on the market would fit in the space, and look good with the rest of your decor, Tung says.

He gave another example: The printer in your office runs out of ink, knows whether it needs a black or color cartridge, knows which manufacturers’ products are compatible, taps into the Web, compares prices, and executes the order.

“We think that’s the exciting future,” Tung says. “Everything’s intelligent, and they all need access to information.”

But beyond those limited chores, there’s still a bg question out there: Will machines ever duplicate human intelligence? For example, could they exercise judgment by balancing the benefits and consequences of two different courses of action, such as choices between medical treatments or business strategies?

“Progress toward human intelligence is still quite a rocky road ahead,” Tung says. The artificial intelligence community hasn’t yet hit on the missing link that would make that possible, he says. “It’s going to require a breakthrough that’s still unknown.”

But the Diffbot team is trying to make training computers more sophisticated by rendering the Web—the digital repository of a growing swath of human knowledge—readable by machines.

The big successes already gained in artificial intelligence have not necessarily come from new programming wizardry, but through the application of old algorithms to dense data sets, Tung and fellow Diffbot executive John Davi point out.

Achievements in computer image classification have built on the burgeoning population of digitized images that were not available in earlier decades, they say. The IBM computer Watson performed feats of medical diagnosis by drawing from a concentrated trove of human-curated knowledge in that relatively narrow realm of data, Tung and Davi add.

“Our working theory is, if we can assemble enough structured, labeled data, we can simulate all aspects of human intelligence,” Tung says. The inflection point in artificial intelligence may be assembling trillions of objects from the Web that machines can read, he says.

“We’re working on what could be the missing piece, which is the data,” Tung says.

Trending on Xconomy