Facebook Doesn’t Have Big Data. It Has Ginormous Data.
One thing that makes Facebook different from most other consumer Internet services is the vast scale of the data it must manage. Start with a billion-plus members, each with 140 friends, on average. Add some 240 billion photos, with 350 million more being uploaded every day. Mix in messages, status updates, check-ins, and targeted ads, and make sure every page is personalized and continuously updated for every user, and you’ve definitely got a first-class big data challenge.
But the data itself is not what makes Facebook (NASDAQ: FB) unique or successful. After all, there are organizations that own larger databases—think Google, Amazon, the CIA, the NSA, and the major telecom companies. But none of them can claim to keep customers fixated on their sites for an astounding 7 hours per month. To do that, you have to understand your users and what they want. And the true source of innovation at Facebook, as I’ve been learning lately, is the data it has about all the data it has.
Every move you make on Facebook leaves a digital trail. When you log in, log out, “like” a friend’s photo, click on an ad, visit the fan page for a band or a TV show, or try a new feature, Facebook takes note. It adds these behavioral tidbits to its activity logs, which take up hundreds of petabytes and are stored in giant back-end databases for analytics purposes. (To be specific, the logs live in giant custom-built data centers, on clusters of servers running the open-source Hadoop distributed computing framework.)
This back end is separate from, but just as important as, the front-end systems that store your personal data and generate Facebook’s public user interface. At least 1,000 of Facebook’s 4,600 employees use the back end every day, mainly to monitor and understand the results of the tens of thousands of tests that are being run on the site at all times.
Which leads to a larger point: there is no single “Facebook.”
As a product, Facebook is about as Protean as a non-mythological entity can get. It takes a constantly shifting form depending on what new features or designs Facebook’s engineers are trying out at any given hour, in any given geography around the world. “You and your friends are seeing subtly different Facebook pages, you just don’t know it,” says Santosh Janardhan, Facebook’s manager of database administration and storage systems.
Analytics, and the infrastructure that supports it, are the key to Facebook’s constant self-optimization. In conversations with Janardhan and other top Facebook engineers, I’ve been getting an introduction to the company’s analytics back end, which is arguably the most complex and sophisticated on the consumer Web. Indeed, if you want a glimpse of how other consumer-facing tech companies may be managing and exploiting big data in the future, it would be smart to look first to Facebook, which logs so much information on user behavior that it’s had to build its own storage hardware and data management software to handle it all.
“Everything we do here is a big data problem,” says Jay Parikh, Facebook’s vice president of infrastructure engineering. “There is nothing here that is small. It’s either big or ginormous. Everything is at an order of magnitude where there is not a packaged solution that exists out there in the world, generally speaking.”
Unlike Google, which stays mostly mum about the details of its infrastructure, Facebook shares much of what it’s learning about managing its ginormous data stores. It has released many of its custom database and analytics tools to the open source community, and through the Open Compute Project, it’s even giving away its hardware tricks, sharing the specifications of its home-grown servers, storage devices, and data centers.
There’s probably an aspect of karma to this open source strategy; Facebook shares what it learns “in hopes of accelerating innovation across all these different levels of the stack,” Parikh says. Of course, the sharing also bolsters Facebook’s image as a cool place for engineers to work, even post-IPO. (Parikh’s team has been busy on the media circuit lately, appearing in a Wired magazine feature as well as this article.)
But whatever the company’s motivations for opening up, the world should be watching and learning. Facebook is the new definition of a data-driven company. How big data is actually used to shape a big business is a question still shrouded in mystery for most observers. At Facebook, an answer is emerging: it involves using detailed data on user behavior to guide product decisions, and—just as telling—building a lot of new software and hardware to store and handle that data. The engineers who oversee that process hold the keys to the company’s growth.
Making the Machine Do the Work
If Facebook were to show up on a cable-TV reality series, it would probably be Hoarders. The starting point of the company’s philosophy about analytics is “keep everything.”
That’s different from the historical norm in the analytics or business intelligence sectors. Because old or “offline” data is expensive to store and difficult to retrieve, IT departments generally either throw it out, archive it on tape, or filter and reduce it into data warehouses designed to … Next Page »