Core Concepts
The author revisits datasets and evaluation criteria for Symbolic Regression, focusing on scientific discovery potential. They propose new datasets and metrics to address existing issues in the field.
Abstract
The content discusses the importance of symbolic regression in scientific discovery, highlighting challenges with existing datasets and evaluation metrics. The authors propose new SRSD datasets and introduce a novel evaluation metric based on normalized edit distances between predicted and true equations.
The paper emphasizes the need for more realistic datasets that capture the properties of formulas accurately. It introduces dummy variables to assess the robustness of SR methods against irrelevant features. The study evaluates various baseline methods using the proposed SRSD datasets, revealing insights into their performance and limitations.
Key findings include uDSR and PySR performing well on SRSD-Feynman datasets, while none of the baselines prove robust against dummy variables. The results showcase the importance of considering structural closeness to true equations through NED as a metric for solution quality.
Stats
For each problem, we use validation tabular dataset to choose best trained SR model.
Edit distance computes minimum cost to transform trees representing equations.
Proposed metric NED normalizes edit distance values between predicted and true equations.