Greenplum Purchase Gets EMC into the Big Data Game

[Corrected July 8, 2010, 10:20 a.m.; see below] Boston is already a powerhouse in “big data.” It’s home to companies like Netezza, Dataupia, Vertica, and Lightwolf Technologies, which all help enterprises manage and mine the huge databases used in business intelligence applications. It was the site of the first “Boston Big Data Summit” last fall. And now, with the acquisition of San Mateo, CA-based Greenplum by Hopkinton, MA-based EMC, the region will be even bigger into big data.

Greenplum is probably best known as the provider of the multi-petabyte data warehouse that auction site eBay formerly used to analyze the behavior of site visitors. EBay users generate a reported 150 billion individual event records per day as they skim the site and place bids. That’s information eBay can use to optimize the site’s performance and serve customers better—but doing so requires sifting through trillions of records overall. This huge task requires a massively parallel processing approach, which is what Greenplum’s database software, built on top of the open-source Postgres object-relational database system, is optimized to do. [Update and correction: Oliver Ratzesberger, who is in charge of the analytics platform at eBay, wrote to say that the company now uses a different technology for analytics.]

The main difference between Greenplum’s technology and other database software schemes has to do with how data is accessed. In traditional database management systems built by companies like Oracle and Microsoft, different query processing jobs generally share access to the same hard-drive disks, which can slow down individual queries. But Greenplum’s so-called “shared-nothing” system divides data across multiple servers or segments, each of which has its own connection to a disk drive. That means a single database query can be run against many segments of data simultaneously—perfect for the analytics applications run by Greenplum customers like eBay, Fox Interactive Media, NASDAQ, the New York Stock Exchange, Skype, and T-Mobile.

Announced Tuesday, the all-cash acquisition of Greenplum (terms weren’t given) means that EMC will now have a data computing product division that allows it to compete directly with suppliers of large-scale data warehousing systems like Netezza and Vertica, not to mention database giants like Oracle and Teradata. It’s perhaps surprising that EMC would reach all the way to California for an acquisition in the big-data sector, given that there were several options within Route 128. But it’s easy to understand why EMC would want to be a player in this area, considering that it sells much of the storage and networking hardware that many large companies’ data warehouses live on.

“EMC and Greenplum are partners already,” says Dave Farmer, an EMC public relations spokesperson. “So when we were looking at this space and deciding that we wanted to make a more substantive offering in the space, we were already very close to Greenplum, and as we looked closer and deeper, they very quickly rose to the top of the potential candidates.”

Greenplum’s 140 employees will remain based in San Mateo, Farmer says. The seven-year-old startup will form the core of a new data computing products division headed by Greenplum CEO Bill Cook, who will report directly to Pat Gelsinger, the president and chief operating officer of EMC’s Information Infrastructure Products division. (Gelsinger basically oversees all EMC products except for the VMware virtualization software subsidiary).

Farmer says EMC will invest in expanding the sales and R&D staffs at Greenplum, and may even acquire more companies in the data warehousing area. EMC isn’t saying how much it spent to buy the startup, which had raised at least $61 million from a syndicate of venture firms and strategic partners including Dawntreader Ventures, EDF Ventures, Hudson Ventures, Meritech Capital Partners, Mission Ventures, SAP Ventures, Sierra Ventures, and Sun Microsystems (now part of Oracle). But the acquisition price isn’t enough to have a material impact on the company’s earnings-per-share numbers, Farmer says.

Gelsinger said in a statement yesterday that “The data warehousing world is about to change…Greenplum’s massively parallel, scale-out architecture, along with its self-service consumption model, has enabled it to separate itself from the incumbent players and emerge as the leader in this industry shift toward ‘big-data’ analytics.”

Wade Roush is a freelance science and technology journalist and the producer and host of the podcast Soonish. Follow @soonishpodcast

Trending on Xconomy

By posting a comment, you agree to our terms and conditions.

Comments are closed.