Core Concepts
Large language models have memorized many popular tabular datasets verbatim, leading to inflated few-shot learning performance estimates on those datasets. However, they also exhibit non-trivial few-shot learning abilities on novel tabular datasets, which are largely driven by their world knowledge rather than in-context statistical learning.
Abstract
The paper investigates the memorization and few-shot learning capabilities of large language models (LLMs), specifically GPT-3.5 and GPT-4, on tabular data.
The key findings are:
Memorization Tests:
The authors develop various tests to detect whether an LLM has seen a tabular dataset during training.
The results show that GPT-3.5 and GPT-4 have memorized many popular tabular datasets, such as Iris, Wine, Adult, and Housing, verbatim.
There is no evidence of memorization for novel datasets released after 2022.
Few-Shot Learning Performance:
The authors compare the few-shot learning performance of GPT-3.5 and GPT-4 on datasets that were seen during training versus novel datasets.
On the memorized datasets, the LLMs exhibit impressive few-shot learning performance, often outperforming logistic regression. However, this performance drops significantly when the data is presented in a perturbed or transformed format.
On the novel datasets, the LLMs' few-shot learning performance is more consistent across different data formats, suggesting that their performance on the memorized datasets is due to overfitting.
In-Context Statistical Learning:
The authors investigate the LLMs' ability to act as in-context statistical predictors, without fine-tuning.
They find that the LLMs' in-context statistical learning capabilities are limited, especially as the dimensionality of the problem increases.
This suggests that the few-shot learning performance on novel datasets is largely driven by the LLMs' world knowledge rather than their in-context statistical learning abilities.
Random Sampling:
The authors demonstrate that GPT-3.5 can draw random samples from tabular datasets it has seen during training, without any fine-tuning.
Overall, the paper highlights the importance of testing for training data contamination when evaluating the few-shot learning performance of LLMs on tabular data, as memorization can lead to inflated and invalid performance estimates.
Stats
"GPT-4 can consistently generate the entire Iris and Wine datasets from the UCI machine learning repository."
"GPT-3.5 and GPT-4 exhibit remarkable performance differences when predicting time series data for time periods prior to and after the LLM's training data cutoff."
"Adding small amounts of noise and other re-formatting techniques leads to an average accuracy drop of 6 percentage points on the memorized datasets."
Quotes
"Our investigation reveals that GPT-3.5 and GPT-4 have memorized many popular tabular datasets verbatim."
"We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting."
"We find that GPT-3.5 and GPT-4 can still perform in-context statistical classification better than random but struggle as the dimension of the problem increases."