ИСТИНА |
Войти в систему Регистрация |
|
ИСТИНА ПсковГУ |
||
There are a lot of algorithms and programs for phylogenetic inference based on a multiple sequence alignment. Many programs allow users to choose a number of parameters, for example, an evolutionary model for maximum likelihood programs. Different programs and different parameters often produce different results. Usually comparison of phylogenetic methods either is based on simulated alignments as input data, or uses some calculated features, as log likelihood, for the evaluation. Our aim was to create a benchmark that allows such comparison on large sets of natural orthologous protein sequences using species trees as reference trees. The benchmark consists of protein sequence alignments and of reference trees for these alignments. We used sequences of evolutionary protein domains extracted from the Pfam [1] database. To select the sequences, we first fixed a taxonomic group of species. Then using home-made scripts we formed orthologous groups of the domains from proteins of those species. The species tree was constructed starting with NCBI Taxonomy, unresolved nodes were resolved using trees inferred from all obtained orthologous groups. This procedure was repeated several times, for different taxonomic groups covering all major taxa of cellular organisms. For testing the benchmark we used comparison of inferences made with real sequence alignments of domains and those made with artificially damaged alignments. For comparison of inferred trees with reference trees we used a number of tree comparison measures and chose the measure that allows us to obtain the maximum statistical significance during the test. As a result, the Robinson–Foulds distance [2] is proved to be the best tree comparison measure. We demonstrate a statistically significant difference between results obtained from real and damaged alignments, which confirms applicability of our benchmark. The benchmark, i.e. alignments and reference trees, is available online http://mouse.belozersky.msu.ru/phylobench/pb.html together with the service that allows to compare user’s trees with trees inferred by three methods: maximum parsimony, maximum likelihood and BioNJ. Using the benchmark, we performed a number of comparisons of phylogenetic methods and their parameters. In particular, we confirmed recent results that alignment filtering does not improve the accuracy of phylogenetic inference [3] and that distance methods, such as minimum evolution, are superior compared with maximum likelihood and maximum parsimony [4]. The work was supported by the Russian Science Foundation grant no. 21-14-00135. 1. J. Mistry et al. (2021). Pfam: The protein families database in 2021 , Nucleic Acids Research, 49(D1):D412–D419. 2. D.R. Robinson, L.R. Foulds (1981). Comparison of phylogenetic trees. Mathematical Biosciences. 53(1–2):131–147. 3. G. Tan et al. (2015). Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference. Systematic Biology, 64(5):778–791. 4. G.H. Gonnet (2012). Surprising results on phylogenetic tree building methods based on molecular sequences. BMC Bioinformatics 13:148.