First Comes The $1,000 genome, Then Comes The $10,000 Analysis


Xconomy Seattle — 

While everyone in the in the world of sequencing has heard this cynical joke, as an industry, the joke is about to be on us. We have focused our time, energy and investment on the ability to sequence the genome in a day, and have largely ignored the computational power required to extract and analyze the genomic data in a timeframe that keeps it relevant. This disparity creates significant risk of delay to the pace of research and creates a scenario where the joke becomes “the genome in a day but with the year-long analysis.”

The vast majority of academic and core laboratories are not prepared for the massive wave of data that will be coming off of multiple new “Genome-in-a-Day” technology offerings. Most labs are already computationally resource constrained, struggling to just store volumes of data. The average analysis pipeline of basic alignment, consensus calling, variant detection and filtering process takes 14 days, even with a large cluster for most sequencing centers. For most, the exorbitant cost of building more computing infrastructure is not a realistic short term solution and ultimately will not shorten computation time.

The answer that many labs have been dabbling with is going to “the cloud.” Unfortunately, as many researchers have discovered, the cloud is not as welcoming as its white fluffy exterior suggests. At the onset of a project, the learning curve just to start a computing instance is relatively steep. Additionally, there are many misconceptions about what the cloud can actually do to speed up a computation. Many researchers assume that if they put their existing analysis software in “the cloud” and purchase analysis time on 1,000 machines that the computation will automatically scale. Wrong.

As you might imagine, researchers are sorely disappointed with this outcome while at the same struggling in a deluge of data needing to be analyzed. Concurrently, the need for rapid, large scale genomic analysis continues to rise so the industry is at a technical impasse. Instrument producers while perfecting “the genome in a day,” have not offered solutions to solve this analytics crisis which begs the question, why would any lab purchase a genome in a day sequencing system if they can’t analyze the results in a day?

There is a new generation of cloud-based bioinformatics available. To date, the industry has taken a “wait and see” approach to the next generation of technology and now we have a pending bottleneck. It is critical for the industry to share its best practices, identify core investment dollars and support continued aggressive development of bioinformatics software that works in a cloud environment. Without industry support and bioinformatics technology allowing for analysis in a day, lab managers, researchers and sequencing instrument companies are running headlong into a data disaster.

By posting a comment, you agree to our terms and conditions.

4 responses to “First Comes The $1,000 genome, Then Comes The $10,000 Analysis”

  1. Adina Mangubat is correct that the interpretation of genomics information is partly limited by the availability of computational power for its analysis, but the development of improved bioinformatics software that operates better in a cloud environment is only part of the solution. Greater emphasis should be directed towards the collection and consolidation of large data sets from high throughput genomics, proteomics and metabolomics measurements of clinical specimens, and the development of predictive biochemistry algorithms that can efficiently interrogate this meta-data. This would be an ideal problem for synthetic intelligence research and development as humans are ill equipped to quickly recognize important correlations in such large and diverse data sets.

    Our knowledge of the activities and interactions of the metabolites and the macromolecules that are encoded by the genome remains extremely rudimentary and deficient even after a decade of completion of the sequencing of the human genome. For about 40% of the human proteins, we do not know what they do, never mind how they are regulated. While the challenge of mapping molecular interactions is already formidable, the added complexity of understanding the consequences of individual nucleotide and amino acid changes in genes and proteins is overwhelming. Yet, the point of sequencing the genomes of individuals is to uncover these genetic differences to gain insight into their contributions to the aetiology of diseases.

    It is possible that the successful recognition of patterns of mutations in individual genomes may be sufficient for disease prediction and diagnosis. However, with some 60 million single nucleotide polymorphisms apparently existing in the genomes of healthy people, and the more likely scenario of multiple, complementary mutations required for the vast majority of common diseases, simple pattern recognition will be insufficient in most cases. What is needed is a true understanding of the composition and architecture of the metabolic and signalling protein networks that support healthy cells and how pathogenic genetic mutations and environmental toxins (including those from viruses and bacteria) compromise their operations.

    While tremendous advancements have been achieved in genomics research with strong public and private support through governments and industry, the situation for proteomics and metabolomics research is quite different. The costs of genomics analyses have plummeted in the last 20 years, because this is where most of the basic research support has been placed. When societies finally recognize the importance of providing comparable support to studies of proteins and metabolites, it is likely that better technologies can be developed and applied for their analyses.

    It is intriguing that one of the biggest concerns of government and industry is that despite the doubling of biomedical research support over the last decade, the actual number of new disease diagnostic tests and therapeutic drugs approved annually has markedly declined. This is better known as the “translational research gap,” which has been widening. However, disease phenotype correlates better with the specific levels of active proteins than total protein, total mRNA or individually mutated genes. mRNA sequences are “translated” into protein sequences by ribosomes. Maybe the real “translational gap” is too much emphasis in research on genes rather than the proteins that they encode.

  2. “The Dreaded DNA Data Deluge”, see warning on YouTube 2008, became a “Data Disaster” of today. All major DNA sequencing companies lost a lion share of their valuation lately because of an oversupply of sequences – unmatched by adequate analytics. This predictable unsustainable supply-demand upsets Adina’s logic. Though the production price of DNA sequences will continue to drop, sequencing companies that sustain heavy losses will actually have to increase the price tag just to stay in business by charging higher margins. Single-minded focus on just the “sequencing” half of industrialization of genomics, fixated on a mythical $1,000 DNA sequencing, is actually devastating. The real worth of even full human DNA is exactly “zero dollar” if not matched by “one million dollar analysis” (George Church). Why should any entity put an order for more DNA, if the already received sequences can not be adequately analyzed? Yes, you can by “analytics” of even extremely partial genome interrogation (SNPs by microarrays), but “you get what you pay for” – and presently there is no industrial analytics available at any price (not even for a million dollars, let alone the $10k Adina would like to see), that would e.g. prioritize the thousand chemo-s to tell which is most suitable to fight a given cancerous genome. If the demand-part is shot, increasing supply at even lowered (dumped) prices not only won’t help, but will make the equation even more lopsided. Foreign cars can be sold in the USA since we built the Interstate system and ample network of gas stations. Likewise, the demand-side of the industrialization of genomics needs to be strengthened by massive investments, to produce affordable and actionable DNA ANALYTICS. Once that analytics will approximate from the present “one million dollar” expense towards a value to be bought by a few $10k price that Adina envisions, that will create the demand that will bring the supply-side down (and not in the imagined reverse order). Till we change strategy, we’ll get further and even worse data deluge, now coming down from the clouds. – @HolGenTech

  3. Nathan Meryash says:

    One option is to farm it out to the public via distributed computing.  If it’s a for profit sequencing run, than distributed computing participants could receive payment for leasing time on their machines.  Yes, sequencing algorithms would need to be written for parallel processing, but I don’t see why this would be such a problem.