Scientists have been complaining for years about massive amounts of data piling up on their servers as gene sequencing instruments have become ever better, faster, and more efficient. Yet even as San Diego-based Illumina built a business worth $5 billion as the market leader in these instruments and the chemical reagents they need, nobody has come along with a dominant software program to help researchers analyze and visualize what these billions of genomic data points mean for human health.
Sundquist, 30, has seen the data logjam get worse through his years in graduate school. He was 23 and just embarking on his career in computational biology when the original Human Genome Project was completed in 2003. He studied computer science and electrical engineering at MIT, and then went on to get his doctorate in computational biology from Stanford five years later. Those years were marked by extraordinary progress from companies like Illumina and Life Technologies that have brought the cost of sequencing below $10,000. One ambitious startup, Menlo Park, CA-based Pacific Biosciences, is racing to make a machine that can perform the task in as little as 15 minutes.
While all that innovation was happening, Sundquist and his peers at Stanford saw the software to analyze all this data is still produced by a cottage industry. The bioinformatics industry is mainly composed of custom-made programs at individual labs, with some open source programs, and a few small private companies like Seattle-based Geospiza and Westborough, MA-based GenomeQuest. The market is still tiny, probably in the “tens of millions” of dollars, Sundquist says. But it is bound to become more lucrative over time as fast, cheap sequencing becomes the norm and researchers will have to better analyze their data to delve deeper into what genomes can tell us about human health and disease.
“We saw a huge trend,” Sundquist says. “We believe there’s going to be a large market in sequence analysis.”
DNAnexus got going last July with a $1.55 million seed financing, led by First Round Capital out of San Francisco. Sundquist was joined by a pair of co-founders from the Stanford faculty—Arend Sidow, an associate professor of pathology and genetics, and Serafim Batzoglou, an associate professor of computer science. The company debuted with its commercial product in April at the Bio-IT World conference in Boston, and has signed up “hundreds” of users since, Sundquist says.
The idea at DNAnexus is to offer a low-cost, user-friendly way for researchers to do some basic analysis of their sequencing runs. Typically, a researcher will run a sequencing instrument to get the data he or she wants, and then spend six to 12 months “playing around with the data,” Sundquist says. That means finding out what tools are available, downloading them, and figuring out how to make formats compatible with their instrument. Usually the person doing this is a biologist, who doesn’t have training in computer science or math. So the researcher fumbles around looking for someone, often a graduate student, who knows enough about computer science to help. “They are completely at a loss,” Sundquist says.
This whole model, Sundquist says, ought to be turned around. DNAnexus has built a system that takes all that genomic data running off the instruments, keeps it stored on a cloud computing platform run by Amazon Web Services, and displays the results in a Web 2.0-style web-based interface.
This system requires the researcher to have access to an expensive instrument and the chemicals to run it. But using the cloud computing service from Amazon means that researchers don’t need to spend money on servers or use resources of their own server cluster on campus, which may or may not have enough horsepower to store and process the genomic data over time. And DNAnexus is set up so that researchers pay for the Web-based program and storage on a pay-as-you-go basis, instead of locking them into monthly recurring fees or annual software licenses.
“We want people to start using the technology at such a low cost that it’s a no-brainer. We want next-generation sequence analysis accessible to everyone,” Sundquist says. “You shouldn’t have to invest a lot of money to get access to it. Nor should you need to have a bioinformatics PhD to start using these technologies.”
The DNAnexus interface was designed to look a little like Gmail, Sundquist says. A user logs in, and can see the sample data. If the user is running a centralized core sequencing facility shared by many researchers on a campus, then the user can get a quick look at quality stats that look at how well the sequencing runs were performed, or to what extent there may have been errors in sample preparation, Sundquist says.
Sharing is one of the key pieces of the puzzle. Researchers are able, Web 2.0-style, able to click and drag on what they want to, and zoom in or zoom out for the degree of resolution they want on their sequencing run. If you zoom all the way in, you can see individual letters of A, C, G, or T, the chemical units of DNA, Sundquist says.
I wondered if all this genomic data—given that each genome has 6 billion data points—might clog up the broadband at some campuses that aren’t really equipped to send this much data to a remote server run by Amazon. This isn’t an issue for the customers using DNAnexus now, Sundquist says, because it usually takes a week to do a sequencing run and an hour or less to transmit the data across the Internet. But as I noted in a feature story yesterday, Amazon has set up a conventional FedEx system for researchers who prefer to save their data on a disk and ship it back and forth to Amazon, rather than transmit over the web.
As more and more researchers start producing more and more gene sequencing runs, bandwidth could be an issue, Sundquist says. “It’s not a problem now,” he says. “Five or 10 years from now, it could be a real problem.”
Another real problem, which Microsoft has been grappling with lately, is how to create a standardized program that’s useful for researchers who might ask completely different questions. This is one of the reasons there are so many open-source and custom-made programs, and no single dominant for-profit vendor, Sundquist says.
Some of the more specialized questions will always be part of bioinformatics, and there will probably always be a place for the custom-made bioinformatics programs, Sundquist says. The DNAnexus program is designed to be good at some very common questions that researchers look for, like single nucleotide polymorphisms (SNPs) that occur in the genome and might be associated with a disease.
Researchers have traditionally leaned on home-brewed software in an era when a tiny number of complete human genomes are thought to have been sequenced worldwide. But over the next few years, that number is expected to skyrocket to 1 million genomes. If that happens, the data deluge will be hard to fathom. The entrepreneurs of a decade ago who said there was gold in bioinformatics may just have been a little too far ahead of their time, Sundquist says.
“This is so much larger a scale of anything from the past, it’s forcing a shift,” Sundquist says. “This is going to be the dominant problem.”