Neo Aims to Reorder the Business World with Graph Databases
Today I’m going to tell you about an up-and-coming Silicon Valley startup, Neo Technology, that sells a radical new type of database.
I know, databases are about as exciting to the average technology user as carburetors and doorstops. But before your eyes glaze over and you click on to the next article, let me explain why you should care.
It’s pretty clear that the biggest winners in Silicon Valley in the past decade have been the companies that understood and exploited connections—between Web pages, in the case of Google, or between people, in the cases of Facebook, LinkedIn, and Twitter. To build their empires, all of these companies had to painstakingly develop several new types of databases capable of representing and sorting through such connections. One type is called a graph database. I wrote about an important example, Google’s Knowledge Graph, back in December.
Neo’s database, called Neo4j, is the first commercial, off-the-shelf graph database. Any company can use it; no longer do you have to build your own graph database to take advantage of connected data.
Do I have your attention yet?
“There are two types of data: atomic data about single individuals, and connective data about how various elements are connected,” argues Neo’s co-founder and CEO, Emil Eifrem. “There are a bunch of industries that have only exploited atomic data so far. And what we are seeing—what has played out in several industries—is that when a guy or girl comes along who starts exploiting the connections, it revolutionizes that industry.”
Of course, every startup CEO talks about how his company’s technology is revolutionizing the world. But Eifrem, an uncharacteristically brash Swede, goes even farther. He thinks the companies that fail to understand the connections in their data will inevitably be left behind. “Whoever you are, you are eventually going to have to exploit connected data in your industry, or you are going to go out of business, because somebody else will,” he says.
For many applications, analyzing connections at large scale means abandoning the relational database model that has dominated the computer industry for the last 40 years. It’s not that relational databases can’t hold connective data; they’re just not very good at it. Neo4j, by contrast, was designed from the ground up to represent relationships between entities, right down to the way the data is recorded on a disk.
To see the power of a graph database, consider this example. Eifrem says a big social network that he isn’t allowed to name approached his company to ask for a demonstration of Neo4j. It handed the startup a sample dataset representing 1,000 people connected in a network; each person had an average of 50 friends.
The assignment: select any two people at random and find out if they know each other directly, or are connected by a mutual friend, or a friend of a friend, or a friend of a friend of a friend. That’s the kind of thing that sites like LinkedIn or Facebook need to do all the time, by the way. And when they do, they usually need the answer in less than half a second, or users get impatient.
When Neo’s engineers loaded the network data into a standard MySQL relational database and ran the query, it took two whole seconds to get an answer. When they put the data into a Neo4j graph database, the query took 2 milliseconds—a thousandth as long as the old method.
And when they expanded the sample to a million people, the response time was still the same: 2 milliseconds. (The math is complicated, but the fundamental advantage of a graph database is that the data is stored in a way that makes traversing a web of connections lightning-fast, almost regardless of the size of the web.)
“Any time you enable things that are not just 10 or 20 percent better, but 1,000 times better, then your entire world changes and you can do completely new things,” says Eifrem. “We ended up winning that customer.”
Graph databases aren’t magical, and in some ways they’re harder to work with than relational databases or other types of so-called “NoSQL” databases. But companies in numerous industries are starting to put the technology to work.
Eifrem says his San Mateo-based startup, which is backed by $24 million in venture funding from Fidelity Growth Partners Europe, Sunstone Capital and Conor Venture Partners, has paying customers in areas like hardware (Cisco), professional networking (Glassdoor and Viadeo), publishing (Bloomberg), telecommunications (Deutsche Telekom, Telenor, and SFR), office equipment (Pitney Bowes), and content management (Adobe Systems).
“Ten years ago the big Web giants like Facebook, Google, and Twitter had to hire the best and brightest out of Stanford to build tools to process this data,” says Eifrem. “Now you can buy it from us the same way you can buy a relational database from Oracle.”
Eifrem himself is one of Sweden’s best and brightest. He says he taught himself to program as a teenager by building a text-based role playing game. (He confesses that he thought spending 18 hours a day programming would make him more attractive to “hot Swedish blond chicks.” He didn’t say how that theory worked out.)
Eifrem’s mandatory stint in the Swedish Armed Forces in the late 1990s ended in the middle of the academic year, and to kill time before starting at university he joined a startup called Windh Technologies that was building a Web-based enterprise content management system, or CMS.
Soon enough, Eifrem became Windh’s chief technology officer, in charge of improving the company’s core database. And that’s where his exposure to big databases began. Hundreds of companies were using the system, each with their own groups and subgroups. Each group owned an array of documents, which had to be accessible to people in some groups but not others. It was all built on a standard Oracle relational database, where everything is stored in tables of rows and columns.
By 2000, Eifrem says, trying to keep track of all the hierarchies and permissions had turned into “a big mess.” So he ended up building a layer on top of the Oracle database “to shield us from the database and abstract all of this into nodes and relationships,” also called edges—the key elements of a graph database.
The scheme worked, but translating queries from the graph layer to the relational layer caused enough new difficulties that the Windh team was soon pining for a “native” graph database where the documents and their properties could be stored directly as nodes and relationships. So they built one, and then rebuilt it. By 2003, Windh’s whole CMS was running on top of one of the world’s first native graph databases. They called it the NEO Node Space Engine.
In the rush of that accomplishment, the Windh team wanted to tell everyone about NEO. “We were young and arrogant and said shit like ‘the world deserves this,’” Eifrem recalls. Unfortunately, 2003 was a bad time to be promoting a new kind of database. Not only was the tech world hung over from the 2001 crash, but there was also widespread cynicism in the computing community after object-oriented databases, a previous alternative to the relational database, had failed to live up to dot-com-era hype.
“There was zero acceptance to bringing a new database into the market,” Eifrem says. So the Windh team went back to the CMS business and “honed our skills, polished the database, and learned more about how to build applications that use it.”
But by 2007 or 2008, some of the new database approaches being pioneered inside companies like Amazon, Google, Yahoo, and Facebook had begun to attract the attention of developers. The huge collections of information on user behavior that these businesses were generating began to be described as “Big Data,” and a lot of this data was going into a new generation of non-relational, NoSQL databases like Amazon’s Dynamo, Google’s BigTable, Facebook’s Cassandra, and LinkedIn’s Voldemort.
To understand where Windh’s technology fit in, a crash course on the various families of NoSQL databases is needed. First there’s the “key-value store,” where data is stored in tall, skinny tables consisting of just two columns—a key and a value. Dynamo and Voldemort belong to this family; key-value stores are especially good at holding simple data.
Then there’s the column family, inspired by Google’s BigTable. In a column database, data is recorded primarily in columns rather than rows. Each row can have a different number of columns, which means column databases are good for holding data with varying amounts of structure. Cassandra and the Hadoop Hbase database are column databases.
Document databases are the third type of NoSQL database. They have no tables, rows, or columns. They’re just collections of documents, each with an arbitrary number of fields. They’re great for storing and retrieving variegated data like, well, documents. I profiled 10gen, a Palo Alto company that promotes the MongoDB document database, in September 2011.
Finally there are graph databases, which are best for storing interconnected elements where the types of connections might change over time (making it impossible to define a fixed scheme of rows and columns). Google has a graph database called Pregel, as well as the Knowledge Graph, and Twitter’s FlockDB acts like a graph database, though it’s actually a MySQL database under the hood.
Eifrem felt that Windh’s database had advantages that other types of NoSQL databases didn’t, especially when it came to handling highly complex data with lots of embedded relationships. NoSQL databases are built to perform well at large scale, but the more complex the data, the less easily they scale up, Eifrem says. Only graph databases still perform well at scale when the data is complex, he argues.
“If you put data into a key-value store, it will be easier to get to scale [across many machines], but then you have chopped it up into these small pieces,” he says. “Whereas a graph database says, ‘Fuck it, the world is connected, let’s embrace that and allow you to express your domain in as rich a way as possible.’”
In 2009, Windh spun out its NEO technology as an open-source database standard called Neo4j, and Eifrem set up Neo Technologies to sell it. A Series A investment from Fidelity in 2011 gave the startup the resources it needed to move from Sweden to San Mateo.
In contrast to companies like Red Hat or 10gen, which make money on consulting and support around open-source software, Neo owns the intellectual property beyond Neo4j and is the exclusive contributor to the database’s core code. Outside programmers contribute “at the fringes” but customers must pay Neo to embed Neo4j in their applications, Eifrem says.
At first, Eifrem thought the Neo4j graph database would have just four main uses. Social networking, where companies must track connections between people, was a no-brainer. So was geographical data, where locations can be stored as nodes and the routes between them can be represented as edges.
Third was network management, where people in operations centers must track millions of switches and routers and data pipes and continuously recalculate the most efficient path for data through the network, depending on which parts of the network are most congested.
Finally, there was master data management, the process big companies use to make sure they’re working from single, consistent set of data—Cisco, for example, chose Neo4j to represent its entire sales hierarchy, including all sales reps, whom they report to, and whom they’re trying to sell to.
But beyond those initial cases, Eifrem says customers are coming up with all sorts of uses he never foresaw, including fraud detection (fraudsters tend to have fewer legitimate connections in a social graph) and drug discovery in the pharmaceutical industry (where interactions between proteins, genes, and compounds that modify their expression can readily be represented as graphs).
Eifrem says the query language that Neo designed to go with Neo4j makes it easy for non-programmers to work with graph databases. In this language, called Cypher, relationships are represented in a user-friendly way that Eifrem compares to ASCII art. If person A loves person B, for example, the relationship would be notated in Cypher this way: (A)-[:Loves]->(B). “You’d never expect a CEO to write an SQL query, but with Cypher, if you can draw it on a whiteboard, you can make a query out of it,” says Adam Frankl, Neo’s vice president of marketing.
As Neo tries to grow, though, one of its big challenges is raising awareness—that is, getting CEOs and CIOs to realize that much of the data their companies own is inherently connective rather than atomic, and that a graph database might help them ask different types of questions about it.
“No one has to ask what a relational database is, but with a graph database not everyone knows what problems it solves,” Eifrem says. “A lot of people think it’s just for social data. But what we see is that it’s much more broadly applicable. It’s also for life sciences and media and the general market. We have, objectively, the most elegant and expressive mode [for storing connective data], and I can prove it with equations and shit, but we are not yet good enough at exposing that.”
On top of that, Eifrem says the 41-employee company is still working on making its product as easy to set up, populate, and use as other NoSQL databases.
“If you are a developer it is not as easy to use Neo4j today as it is to use MongoDB,” he says. “We have been very focused on making the model super-awesome in performance and robustness, and we have not been as focused on ease of use.” But Eifrem also says the company will be “100 percent focused on usability” in 2013, especially for Ruby, C#, and PHP developers.
Once businesses realize that they can do “completely new things” with graph databases, Eifrem thinks, there’s nothing stopping Neo from growing to the scale of previous database giants like Oracle.
“There is the opportunity to make it really big,” he says, with just a lingering touch of the young and arrogant programmer of 2003. “I’m talking about being one of the biggest, most profound companies of this decade. Because if you just exploit your atomic data, you are going to go out of business.”