The paper presents a comprehensive study on the string processing capability of large language models (LLMs). The authors first propose StringLLM, a method to construct datasets for benchmarking the string processing capability of LLMs. Using StringLLM, they create a series of datasets called StringBench, covering a wide range of string processing tasks and different types of strings.
The authors then conduct extensive experiments to evaluate the performance of various LLMs on the StringBench datasets, using three prompt engineering techniques: raw instructions, Chain of Thought (CoT), and Program of Thought (PoT). The results show that LLMs struggle with string processing tasks compared to humans, achieving a maximum average accuracy of only 48.89% using raw instructions. LLMs' performance varies across datasets, with random strings being the most challenging.
To understand why LLMs struggle with string processing, the authors analyze the underlying mechanisms of LLMs, including tokenization and token embedding. They find that tokenization fails to split strings into individual characters, and token embedding lacks character-level information, leading to LLMs' limited understanding of strings.
To address this limitation, the authors propose an effective fine-tuning approach that significantly enhances LLMs' string processing capability, without substantially degrading their foundational capabilities on general-purpose benchmarks.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問