There is a major transformational step underway for managing the growing amount of human genomic data. To date, the focus has been on amassing databanks of genomes and then developing new tools to analyze this information. In essence, the emphasis has been on breaking bottlenecks for analyzing the data.
Now, there is an opportunity to take progress in a new direction, to move beyond amassing genomic data and enable researchers to share genomic knowledge worldwide, and ultimately, at the point of patient care. This next era will dramatically change how genomic data can be accessed, shared, and interpreted on a global scale.
The demand for distributed, global access to information is evident from our day-to-day interaction with data. With access to data in nearly everything we do—from using Google Maps for directions to getting the weather forecast—we know that the norm is to get information in real-time, from a browser interface. The ability to do so with large sets of genetic data represents the next wave of progress in the genomic era.
Many of us in this field are captivated by the opportunities to use genomic insights to advance medicine. So, let’s take a look at the progress made in managing these data for clinical use:
Early genome sequencing and analysis: The concept of genomics in medicine was catalyzed by the Human Genome Project in the early 2000s, after which the research community, governments, and industry spent nearly a decade understanding how to sequence genomes, find useful information, and build equipment to facilitate more efficient processes. The progress led to a host of important discoveries about the genome and its impact on disease risk and treatment response. I personally was involved in these early pioneering efforts, working with deCODE Genetics on a population-scale platform in Iceland that has compiled the largest collection of whole-genome variation data in the world. Even back then, that genomic engine identified scores of important genetic variations associated with common diseases, groundbreaking discoveries that laid the foundation for gauging the inherited risk of conditions like prostate cancer and heart attacks.
Mainstream sequencing genomic data: With more experience and success with sequencing, industry leaders set out to make genomics vastly more accessible with improved technologies that made sequencing cheaper, faster, and easier. We’ve now reached the threshold of the $1,000 genome—enabling major centers to integrate sequencing and use genome-guided information to advance their research and generate new ideas about genomic causes to a range of challenging diseases.
Analyzing genomic data: Harnessing this progress, we have witnessed in recent years a whole new generation of genomics players, with a range of companies introducing new analytics tools and software to aid with sequencing and analysis. This flurry of activity has generated new techniques with a range of purposes—from identifying single variants to broad patterns—that can impact everything from a single rare disease to an entire therapeutic category. While many of these tools will continue to be used for niche applications, some (particularly those that are successfully scaling up) will likely become industry standards that can manage large-scale datasets and support broad, cross-institution genomic research in the near future.
Amassing genomic datasets and liberating them: Today, as organizations amass vast amounts of genomic data, they are looking for ways to share this information and work with one another to consolidate the data and generate reliable, consistent conclusions about disease causes, risks, and responses. Based on these efforts, new databases are emerging that catalogue genetic variation, offering invaluable “Big Data” style resources for the world’s research community. Just two weeks ago at the American Society of Human Genetics’s annual meeting, several new databases were announced, including datasets from the Haplotype Reference Consortium and the Exome Aggregation Consortium. These, like others in the works, aim to accelerate studies to identify genetic variants and translate them into clinically useful markers.
Distributed global access: While a growing amount of genomic data is being amassed in various repositories across many locations, what matters is the ability to access and share knowledge derived from these insights to inform patient-driven and crowd-sourced solutions. In order to manage and distribute massive volumes of raw genomic information (a single whole genome sequence contains some 100 gigabytes of data), new technologies are being designed to annotate, characterize, and organize it so that it can be searchable and analyzed for relevant patterns, markers, and variants that might predict disease risk or response to a treatment.
New technologies are emerging that have two key capabilities. First, the ability to manipulate and mine the massive volumes of genome data. Second, the ability to share these large amounts of data instantly, via web browsers. Together, these capabilities will open a bunch of new opportunities:
• Researchers in rare diseases will be able to have the critical mass of genomes to power their studies and find, potentially, new clues and cures for rare diseases.
• Researchers in common diseases will have access to large enough reference databases to find meaningful genomic signals and augment their patient data.
• Access to broader genomic data can help companies design clinical trials with better patient criteria, leading to more efficient development of new treatments.
• Over time, instant access to such data will enable physicians everywhere to bring the best genomic insights to the point of patient care.
From availability to accessibility: We predict the start of an accelerated pace of innovation for new technologies that enable distributed global access to genomic data. Here are some examples of how large collections of genomes are becoming more easily accessible:
• The NextCODE Exchange: In collaboration with rare disease researchers from medical institutions in the U.S., Europe, Australia, and Japan, NextCODE Health has developed a browser-based Exchange that allows genomic data to be shared, instantly, across the globe. The purpose of this Exchange, first and foremost, is to help researchers crack more difficult diagnostic cases and rare diseases, and to help accelerate new discoveries for common conditions. Through the Exchange, for instance, the Simons Foundation Autism Research Institute is providing researchers with real-time access to 10,000 exomes and phenotypic data from 2,600 families with one child on the autism spectrum.
• The 1,000 Genomes Project: This project was launched in January 2008 as an international research effort to establish a detailed catalogue of common human genetic variations, and now includes more than 2,000 genomes. Data generated by the 1000 Genomes Project is widely used by the genetics community, and all the sequencing data (including variant calls) are freely available and can be downloaded via standard file transfer protocol (FTP) from the project’s website.
• The Exome Aggregation Consortium: This group, led by the Broad Institute of MIT and Harvard, has just released allele frequency data derived from 63,000 exomes. This large collection of exomes provides very useful reference data. The raw data is aggregated from 25 institutions, and while it is not currently possible to query that data, it may be possible to do so in the future.
As this kind of global knowledge continues to be made available in a distributed, accessible way – supported by advanced information technology – it will enable researchers and clinicians to effectively work with global partners and will lead to new insights and discoveries about diseases.
We are on the cusp of the next wave of data sharing in genomics. It promises to open up new frontiers for collaboration, and revolutionize how we use genomics in medicine.