The paper explores the use of large language models (LLMs) for data preprocessing (DP), focusing on instruction-tuning local LLMs to serve as universal DP task solvers. The Jellyfish dataset is introduced, enabling manual crafting of instructions for DP tasks and enhancing model interpretability. Experiments show that Jellyfish models outperform state-of-the-art methods on seen and unseen tasks, showcasing their competitiveness and generalizability. The impact of tuning with different datasets on DP performance is analyzed, highlighting the importance of multi-task tuning in improving overall performance.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問