This study investigates the impact of document characteristics on the performance of Retrieval-Augmented Generation (RAG) systems. It compares two document splitting methods, the Recursive Character Splitter (RCS) and the Token-based Splitter (TTS), to assess their ability to preserve contextual integrity and retrieval accuracy.
The analysis reveals that the RCS consistently outperforms the TTS across various document types, including textbooks, articles, and novels. Textbooks and articles, with their structured and concise content, generally achieve higher retrieval scores compared to the more complex and narrative-driven novels.
The study also evaluates the performance of two retrieval methods, OpenAI and LM Studio, and finds that OpenAI's approach outperforms LM Studio in capturing semantic nuances, particularly for articles. However, LM Studio demonstrates stronger handling of technical terminology and complex structures, making it more suitable for textbooks.
Exploratory data analysis techniques, including descriptive statistics, ANOVA, and pairwise comparisons, are employed to provide insights into the performance variations across document types and splitting methods. The findings highlight the importance of adaptive retrieval strategies that consider the unique characteristics of each document type to optimize accuracy and efficiency.
The study introduces a novel evaluation methodology that utilizes an open-source model to generate a comprehensive dataset of question-and-answer pairs, simulating realistic retrieval scenarios. This approach, combined with a weighted scoring framework incorporating metrics such as SequenceMatcher, BLEU, METEOR, and BERT Score, offers a robust and reliable assessment of the RAG system's performance.
The research provides valuable insights for the development and optimization of RAG systems, emphasizing the need for tailored content strategies, including chunk size optimization, key term analysis, and text complexity adjustments, to enhance retrieval outcomes across diverse document types.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문