toplogo
Sign In

Limitations of Principal Component Analysis in Geometric Morphometrics: A Supervised Machine Learning Approach for Accurate Classification and Outlier Detection


Core Concepts
PCA outcomes in geometric morphometrics are artefacts of the input data and are neither reliable, robust, nor reproducible. Supervised machine learning classifiers and outlier detection methods outperform PCA in accurately classifying samples and detecting new taxa.
Abstract
The content discusses the limitations of using Principal Component Analysis (PCA) in geometric morphometrics analysis. It highlights the following key points: PCA is the standard approach in geometric morphometrics, comprising two steps: Generalized Procrustes Analysis (GPA) followed by PCA. PCA projects the superimposed landmark data onto uncorrelated principal components (PCs), which are then used to visually assess patterns of shape variation. Researchers often interpret the proximity and clustering of samples in PC scatterplots in terms of origins, relatedness, evolution, gene flow, speciation, and phenotypic/genotypic variation. However, these interpretations are subjective and can be inconsistent across different PC plots. The authors developed MORPHIX, a Python package that contains tools for processing superimposed landmark data using various supervised machine learning classifiers and outlier detection methods, which can provide more accurate and robust results compared to PCA. The authors evaluated the performance of PCA and alternative methods using a benchmark dataset of papionin crania. They found that PCA outcomes are heavily influenced by the input data and are neither reliable, robust, nor reproducible. In contrast, supervised machine learning classifiers like Nearest Neighbours, Logistic Regression, Gaussian Process, and Support Vector Classifier outperformed PCA in accurately classifying the samples. The authors also examined the effects of missing taxa, samples, and landmarks on the PCA and alternative methods. They found that PCA-based interpretations can be significantly biased by these types of data alterations, leading to misclassifications and inconsistent conclusions about evolutionary relationships and taxonomic affiliations. The authors emphasize the need to reevaluate a large corpus of the literature that has relied on PCA-based findings in geometric morphometrics, as these may be unreliable and biased.
Stats
PCA can explain up to 74% of the variance in the benchmark dataset, yet the clusters still exhibit significant overlap, leading to potential misclassifications. Removing one taxon from the dataset can result in dramatic changes in the PC scatterplots, with taxa that were previously separated now overlapping. Removing samples from the dataset can also lead to significant changes in the PC scatterplots, altering the perceived evolutionary relationships between taxa. Removing landmarks from the dataset can distort the shape of the convex hulls and the distances between taxa in the PC scatterplots, leading to different interpretations of evolutionary relationships.
Quotes
"PCA outcomes are artefacts of the input data and are neither reliable, robust, nor reproducible as field members may assume and that supervised machine learning classifiers are more accurate both for classification and detecting new taxa." "Our findings raise concerns about PCA-based findings in 18,000 to 32,900 studies."

Deeper Inquiries

How can the limitations of PCA in geometric morphometrics be addressed in the broader context of evolutionary biology and phylogenetic reconstruction?

In the context of evolutionary biology and phylogenetic reconstruction, addressing the limitations of PCA in geometric morphometrics requires a multi-faceted approach. One way to mitigate these limitations is to complement PCA with other statistical methods that are less prone to biases and artefacts. For example, supervised machine learning classifiers, such as the Nearest Neighbour classifier and outlier detection methods like the Local Outlier Factor, have shown promising results in accurately classifying samples and detecting new taxa. By incorporating these alternative methods into the analysis pipeline, researchers can reduce the reliance on PCA and improve the accuracy and reliability of their results. Furthermore, researchers should exercise caution when interpreting PCA outcomes and avoid making definitive conclusions based solely on PCA scatterplots. It is essential to consider the potential biases and limitations of PCA, such as the sensitivity to input data variations and the subjective nature of interpreting the results. By critically evaluating PCA outcomes in conjunction with other analytical tools and methods, researchers can gain a more comprehensive understanding of shape variations and evolutionary relationships among samples and taxa. Additionally, promoting transparency and reproducibility in morphometrics studies is crucial for addressing the limitations of PCA. Researchers should openly share their data, methodologies, and code to allow for independent verification and validation of results. By fostering a culture of openness and collaboration in the scientific community, researchers can collectively work towards overcoming the challenges posed by PCA in geometric morphometrics and advancing the field of evolutionary biology and phylogenetic reconstruction.

How can the potential biases and pitfalls in using phylogenetically-informed methods like Phy-PCA and PACA that still rely on PCA as a core component be addressed?

While phylogenetically-informed methods like Phy-PCA and PACA offer valuable insights into shape variations and evolutionary relationships by incorporating phylogenetic information, they still rely on PCA as a core component, which can introduce biases and pitfalls. To address these challenges, researchers can consider the following strategies: Validation and Sensitivity Analysis: Conduct thorough validation and sensitivity analyses to assess the robustness of the results obtained from phylogenetically-informed methods. By testing the methods on different datasets and varying parameters, researchers can evaluate the stability and reliability of the outcomes. Integration of Alternative Methods: Complement Phy-PCA and PACA with alternative statistical approaches that are less susceptible to the limitations of PCA. By diversifying the analytical toolkit, researchers can cross-validate results and mitigate the biases introduced by PCA. Interdisciplinary Collaboration: Foster collaboration between experts in evolutionary biology, phylogenetics, and statistical modelling to ensure a comprehensive and nuanced interpretation of the results. By leveraging diverse expertise, researchers can address the inherent biases in phylogenetically-informed methods and enhance the accuracy of phylogenetic reconstructions. Transparency and Reproducibility: Promote transparency in data collection, analysis, and reporting to facilitate reproducibility and peer review. By making data and methodologies openly accessible, researchers can enhance the credibility and trustworthiness of phylogenetically-informed analyses that rely on PCA. By implementing these strategies, researchers can navigate the biases and pitfalls associated with phylogenetically-informed methods that rely on PCA and enhance the reliability and validity of their evolutionary analyses.

How can the insights from this study be applied to improve the analysis and interpretation of fossil remains in palaeoanthropology and palaeontology?

The insights from this study can be instrumental in improving the analysis and interpretation of fossil remains in palaeoanthropology and palaeontology in the following ways: Enhanced Methodological Rigor: Researchers can adopt a more critical and cautious approach when using PCA in morphometric analyses of fossil remains. By acknowledging the limitations and biases of PCA, researchers can supplement their analyses with alternative methods, such as supervised machine learning classifiers and outlier detection techniques, to improve the accuracy and reliability of their interpretations. Improved Taxonomic Classification: The application of outlier detection methods, like the Local Outlier Factor, can aid in identifying novel taxa or outliers within fossil datasets. By incorporating these methods, researchers can enhance their ability to classify and differentiate fossil specimens accurately, especially when dealing with incomplete or fragmentary remains. Validation of Phylogenetic Relationships: Researchers can use the insights from this study to validate and refine phylogenetic reconstructions based on morphometric data. By critically evaluating the outcomes of phylogenetically-informed methods that rely on PCA, researchers can ensure that their interpretations of evolutionary relationships among fossil taxa are robust and well-supported. Promotion of Open Science Practices: Emphasizing transparency, reproducibility, and data sharing in palaeoanthropology and palaeontology studies can facilitate collaboration and peer review, leading to more reliable and trustworthy interpretations of fossil remains. By following open science practices, researchers can enhance the credibility and impact of their research in the field. By applying these insights and strategies, researchers in palaeoanthropology and palaeontology can advance their analyses of fossil remains, refine their interpretations of evolutionary relationships, and contribute to the broader understanding of human evolution and biodiversity.
0