Donate to Science & Enterprise

S&E on Mastodon

S&E on LinkedIn

S&E on Flipboard

Please share Science & Enterprise

Software Speeds Database Sequence Searches

DNA fragment (Wikimedia Commons)

(Wikimedia Commons)

Computational biologists at Ludwig-Maximilians Universität (LMU) in Munich, Germany have developed software that makes possible a new search method to identify proteins in databases with similar genomic sequences. The software that the developers say is faster and can discover twice as many evolutionarily related proteins as previous methods, is described online in the journal Nature Methods (paid subscription required).

A basic process in genomic research is sequence searches, in which a protein’s sequence is compared with millions of sequences with annotated structures and functions in public databases, many of which are accessible to scientists. The relationship between a protein’s sequence and function makes it possible to predict the structure and function of a given protein by comparing its sequence with those of other proteins with known structures and functions.

The team led by Johannes Söding of LMU’s Genzentrum (Gene Center) developed the software, called HHblits — short for HMM-HMM–based lightning-fast iterative sequence search — that uses different statistical models than current bio-statistics search mechanisms. The models used in HHblits, called Hidden Markov Models (HMMs) include the probabilities of mutations from sequence alignments, which the developers say increases the sensitivity and precision of the search for sequence similarities.

The software also has a filtering process that identifies similar amino acid compositions of proteins, which reduces the amount of data to be searched, a reduction in processing according to Söding as much as 2,500 fold. Current search algorithms rely on pairwise comparisons of protein sequences. The paired comparisons give results showing the mostly identical or similar amino acids paired up in the same columns.

HHblits assembles similar sequences from the database into multiple sequence alignments, and assigns one of 219 identifiers to each alignment column, so that columns with similar amino acid compositions have the same identifier. “By translating the multiple sequence alignments into sequences composed of these 219 letters,” says Söding, “we can replace the time-consuming pairwise comparison of HMMs by the comparison of simple sequences.”

A Web-enabled and open-source version of HHblits is available on the Gene Center Web site.

Read More: Open-Source Genome Analysis Software Developed

*     *     *

2 comments to Software Speeds Database Sequence Searches