Scientists have warned that the world of genomics is headed for a data bottleneck.
The team of maths and computer specialists discovered that the data created by genomic studies will soon overtake that of social media giants such as YouTube and Twitter. Even the high-tech and processor power-hungry field of astronomy does not currently generate as much data as genomics, they report in PLoS Biology.
YouTube, the current leader in the field of data generation, has 100 petabytes (100 million gigabytes) of video uploaded to its servers every year - over a thousand times what the average home computer could store. By comparison, genomics is currently generating 25 petabytes a year but the rate at which the data is produced is doubling every seven months, mostly due to the refinement and falling costs of sequencing techniques.
'As genome-sequencing technologies improve and costs drop, we are expecting an explosion of genome sequencing that will cause a huge flood of data,' said Professor Gene Robinson, director of the Carl R Woese Institute for Genomic Biology at the University of Illinois.
By 2025 it is estimated that up to two billion people will have had their genomes sequenced, meaning the level of genomic data could hit exabyte levels - or billions of gigabytes. This huge influx of data leads to the problem of, not just how to store it but, how to acquire, distribute and analyse it. And, the researchers say that all four of these challenges must be tackled if we are to solve the 'genomics data problem'.
Professor Robinson said, 'Genomics will soon pose some of the most severe computational challenges that we have ever experienced.
'If genomics is to realise the promise of having a transformative positive impact on medicine, agriculture, energy production and our understanding of life itself, there must be dramatic innovations in computing. Now is the time to start.'
According to an editorial appearing this week in Nature, perhaps one such innovation could be a more collaborative use of cloud storage.
An international group of prominent researchers, headed by Dr Lincoln Stein, put out a call to the community to collectively fund a cloud computing network that would take the strain from private networks of individual institutions. The group argues that the challenge of accessing large datasets is blocking scientists' progress, particularly when it comes to building on or replicating previous work.
They propose that funding bodies should pay for large genomic datasets to be stored and accessed in cloud format, meaning that researchers can save time and money by not having to download or process the data on local computers.
'We have now reached a stage where these data sets are too large to move around - cloud computing offers us the flexibility to hold the data in one virtual location and unleash the world's researchers on it all together,' said co-author Dr Peter Campbell, head of cancer genomics at the Wellcome Trust Sanger Institute.