Core Concepts
PCA outcomes in geometric morphometrics are artefacts of the input data and are neither reliable, robust, nor reproducible. Supervised machine learning classifiers and outlier detection methods outperform PCA in accurately classifying samples and detecting new taxa.
Abstract
The content discusses the limitations of using Principal Component Analysis (PCA) in geometric morphometrics analysis. It highlights the following key points:
PCA is the standard approach in geometric morphometrics, comprising two steps: Generalized Procrustes Analysis (GPA) followed by PCA. PCA projects the superimposed landmark data onto uncorrelated principal components (PCs), which are then used to visually assess patterns of shape variation.
Researchers often interpret the proximity and clustering of samples in PC scatterplots in terms of origins, relatedness, evolution, gene flow, speciation, and phenotypic/genotypic variation. However, these interpretations are subjective and can be inconsistent across different PC plots.
The authors developed MORPHIX, a Python package that contains tools for processing superimposed landmark data using various supervised machine learning classifiers and outlier detection methods, which can provide more accurate and robust results compared to PCA.
The authors evaluated the performance of PCA and alternative methods using a benchmark dataset of papionin crania. They found that PCA outcomes are heavily influenced by the input data and are neither reliable, robust, nor reproducible. In contrast, supervised machine learning classifiers like Nearest Neighbours, Logistic Regression, Gaussian Process, and Support Vector Classifier outperformed PCA in accurately classifying the samples.
The authors also examined the effects of missing taxa, samples, and landmarks on the PCA and alternative methods. They found that PCA-based interpretations can be significantly biased by these types of data alterations, leading to misclassifications and inconsistent conclusions about evolutionary relationships and taxonomic affiliations.
The authors emphasize the need to reevaluate a large corpus of the literature that has relied on PCA-based findings in geometric morphometrics, as these may be unreliable and biased.
Stats
PCA can explain up to 74% of the variance in the benchmark dataset, yet the clusters still exhibit significant overlap, leading to potential misclassifications.
Removing one taxon from the dataset can result in dramatic changes in the PC scatterplots, with taxa that were previously separated now overlapping.
Removing samples from the dataset can also lead to significant changes in the PC scatterplots, altering the perceived evolutionary relationships between taxa.
Removing landmarks from the dataset can distort the shape of the convex hulls and the distances between taxa in the PC scatterplots, leading to different interpretations of evolutionary relationships.
Quotes
"PCA outcomes are artefacts of the input data and are neither reliable, robust, nor reproducible as field members may assume and that supervised machine learning classifiers are more accurate both for classification and detecting new taxa."
"Our findings raise concerns about PCA-based findings in 18,000 to 32,900 studies."