Core Concepts
Cognate data with synonyms can be effectively represented using probabilistic character matrices to maximize the phylogenetic signal during maximum likelihood tree inference.
Abstract
The authors investigate the impact of synonym selection on maximum likelihood (ML) phylogenetic tree inference using the RAxML-NG tool. They find that manually selecting synonyms can lead to substantially different tree topologies compared to using the full dataset with all synonyms. To address this issue, the authors introduce two types of probabilistic character matrices beyond the standard binary matrices: probabilistic binary and probabilistic multi-valued.
The key highlights and insights are:
Performing ML tree inference on the full dataset with all synonyms included is preferable to manual synonym selection, which can lead to up to 100% difference in tree topology.
The authors introduce probabilistic binary and probabilistic multi-valued character matrices as alternatives to the standard binary matrices for representing cognate data with synonyms.
It is dataset-dependent which character matrix type (deterministic binary, probabilistic binary, or probabilistic multi-valued) yields the ML tree closest to the gold standard reference tree.
The rate heterogeneity and the difficulty of the phylogenetic inference task can indicate which character matrix type is best suited for a given dataset.
The authors provide a Python interface for generating all the discussed character matrix types from cognate data in the Cross-Linguistic Data Format (CLDF).
Overall, the study demonstrates that probabilistic character matrices can effectively capture the phylogenetic signal in cognate data with synonyms, outperforming the standard binary matrices in many cases.
Stats
The authors report the following key statistics:
The average GQ distance between the best-scoring tree inferred on the deterministic binary character matrices and the gold standard is 0.22.
The average GQ distance between the best-scoring trees inferred on the probabilistic character matrices and the gold standard is 0.23.
The standard deviation of the GQ distances of the trees inferred on the sampled character matrices (with randomly selected synonyms) can exceed 0.2 for some datasets, indicating substantial variation in the inferred trees.
For 5 datasets, the maximum RF distance between the tree inferred on the full dataset and the trees inferred on the sampled character matrices is 1.0, indicating completely different topologies.
Quotes
"Given these large potential discrepancies, we advise against manual selection."
"We therefore strongly advise against manual synonym selection. Instead, we recommend to consider all known synonyms when inferring phylogenetic trees."