Core Concepts
LLM-based agents automate data standardization process efficiently.
Abstract
1. Introduction
Data standardization is crucial in data science.
Transformation of heterogeneous data formats into a unified format is essential.
Example provided for illustration.
2. Challenges with Traditional Methods
Pandas requires extensive coding for data standardization.
Different column types necessitate bespoke code for each type.
3. Role of Large Language Models (LLMs)
LLMs like ChatGPT can aid in automating standardization tasks.
Challenges remain in prompt crafting and multi-turn dialogues.
4. Proposed Solution
Introduce Python library with declarative APIs for standardizing column types.
Simplify LLM's task with concise API calls from natural language instructions.
5. Dataprep.Clean and CleanAgent Framework
Dataprep.Clean simplifies specific column type standardization with one line of code.
CleanAgent automates the process by integrating Dataprep.Clean and LLM-based agents.
6. Workflow of CleanAgent
Composed of four agents: Chat Manager, Column-type Annotator, Python Programmer, Code Executor.
Detailed workflow involves communication between agents to complete data standardization automatically.
7. Demonstration Scenarios
User interface allows uploading CSV files for cleaning.
Steps include annotation, code generation, execution, and result verification.
8. Conclusion and Future Work
CleanAgent automates data standardization using Dataprep.Clean and LLM-based Agents.
Potential for automating entire data science life cycle through LLM-based agents' cooperation.
Stats
Dataprep.Cleanは、特定の列タイプの標準化を1行のコードで可能にします。
CleanAgentは、Dataprep.CleanとLLMベースのエージェントを統合してデータ標準化プロセスを自動化します。