Abstract: Estimating the Tree of Life will likely involve a two-step procedure, where in the first step trees are estimated on many genes, and then the gene trees are combined into a tree on all the taxa. The computational problems involved here are substantial and interesting – even if the gene trees are correctly estimated, because the individual gene trees may not include all the species, finding the true tree that contains all the species is NP-hard. Furthermore, estimated gene trees may not be correct, making the estimation problem additionally challenging. Finally, the true gene trees may not agree with each other or with the species tree, due to biological processes such as deep coalescence, gene duplication and loss, and horizontal gene transfer.
In this talk, I will present new algorithms for these problems. The first algo- rithm is a method called SuperFine (Swenson et al., Syst Biol 2012 and Nguyen et al., J Alg Mol Biol 2012) that handles the case where there should not be con- flict between true gene trees. SuperFine is a “booster” for supertree methods, and produces highly accurate supertrees on large datasets. We then address the case where gene trees can differ form the species tree (and from each other) due to incomplete lineage sorting. For this problem, we present algorithms for the MDC (minimize deep coalescence) problem, taking gene tree estimation error into account (Yu et al., RECOMB and J Comp Biol 2011, Bayzid et al., J Comp Biol 2012). We also explore the use of “bin-and-conquer” to improve the ac- curacy and/or scalability of coalescent-based species tree methods (Bayzid and Warnow, Bioinformatics 2012).
(By invitation of the hiring committee)