Donate to Science & Enterprise

S&E on Mastodon

S&E on LinkedIn

S&E on Flipboard

Please share Science & Enterprise

Compression Software for Genomic Analysis Being Designed

Genetic testing illustration

(National Institute of General Medical Sciences, NIH)

6 July 2015. Engineers at University of Illinois in Urbana and Stanford University in California are tackling the problem of massive data files generated by genomic analyses, an emerging issue as precision medicine harnessing genomics takes hold. A team led by engineering faculty Olgica Milenkovic at Illinois and Tsachy Weissman at Stanford is funded by a 3-year grant of $1.3 million from National Institutes of Health.

Precision medicine is the term given to the use of detailed information about human genomic variations to guide clinical decisions, particularly in identifying the best drugs or biologics for treatments. Cancer is considered a prime near-term candidate for precision medicine, given the high incidence of cancer, particularly as populations age. Precision medicine also offers the hope of finding treatments for cancer and other diseases that cause fewer adverse side effects than current chemo and radiation therapies.

Relying on genomic data to provide the detailed diagnostics guiding precision medical decisions means making use of massive biological databases and high-powered analytical tools. Files containing the raw 3 billion base pairs — the combinations of nucleotide bases in a genome labeled A, G, C, and T — can run as large as 200 gigabites for a single human. Adding in related protein and metabolite analyses can balloon an individual’s data files even further.

Milenkovic, Weissman and colleagues seek to apply tools from computer science that reduce the size of these files, while maintaining access to their fine-grain details. The researchers plan to study the nature of genomic data to find techniques for applying lossless compression, where each bit of data can be directly restored, combined with restricted forms of lossy compression that allow for tradeoffs in quality and file size. The team then plans to review current data compression algorithms applied to biological computing for meeting these requirements, and develop new algorithms as needed.

The researchers anticipate the project developing, as Milenkovic describes in a university statement, “a suite of software solutions for the next generation of biological data repositories and labs, which are currently facing enormous challenges with data storage, transfer, visualization, and wrangling.” The software would compress data from DNA sequencing, quality scores for sequences, and functional genomic analysis that describes gene and protein interactions from the transcripts revealing the molecular composition of cells and tissue.

The project is funded under NIH’s Big Data to Knowledge (BD2K) initiative that began in 2012. One objective of BD2K is to conduct research and develop the methods, software, and tools needed to analyze biomedical big data.

Read more:

*     *     *

Comments are closed.