Recursive Dynamic Markov clustering A novel approach for classifying protein families


Meeting Abstract

52-4  Friday, Jan. 5 11:00 – 11:15  Recursive Dynamic Markov clustering: A novel approach for classifying protein families BOND, SR*; BAXEVANIS, AD; NHGRI, NIH, Bethesda, MD; NHGRI, NIH, Bethesda, MD steve.bond@nih.gov https://github.com/biologyguy/RD-MCL

Inferred orthology (i.e., homology via speciation) among genes is commonly used to predict gene product function. Orthology is also a key consideration when classifying genes coherently and consistently across taxa, but the granularity of current prediction tools is too coarse to resolve clusters of orthologs (i.e., orthogroups) within specific gene families. As a result, classification is generally at the discretion of individual curators manually inspecting gene trees. Here, we present a method that improves granularity and greatly assists with classification.

This new method is called Recursive Dynamic MCL (RD-MCL), and it extends the popular Markov clustering algorithm used for identifying clusters in all-by-all similarity graphs. RD-MCL features four key innovations: a shift away from BLASTP-based similarity metrics in favor of more information-rich multiple sequence alignments, applying a scoring system to assess the quality of orthogroups, dynamic selection of optimized MCL parameters, and recursive decomposition of orthogroups to account for heterogeneous rates of evolution.

Simulation studies reveal improved precision when RD-MCL is compared to popular orthogroup prediction software such as OrthoMCL, OrthoFinder, and ProteinOrtho. These improvements are possible because RD-MCL has been developed specifically for analyzing individual protein families rather than the genomic-scale datasets these other methods excel at. Furthermore, applying RD-MCL to real protein families (including the innexin/pannexin superfamily and caspases) clearly delineates the internal evolutionary structure of these families and illustrates current weaknesses in the naming of genes in public databases, particularly in taxa outside the Chordata.

RD-MCL is open source and available for reuse without restriction.

the Society for
Integrative &
Comparative
Biology