Core Concepts
Different table-to-text generation methods significantly impact the performance of LLM-based question answering systems when leveraging domain hybrid data, with the LLM-based and TPLM-based methods exhibiting superior performance in the DSFT paradigm, and the LLM-based and Markdown methods showing excellence in the RAG paradigm.
Abstract
This paper explores the impact of different table-to-text generation methods on enhancing LLM-based question answering (QA) systems with domain hybrid data. The authors first innovatively integrate table-to-text generation into the framework of improving LLM-based QA systems using domain-specific data. They then conduct extensive experiments on two types of QA systems (DSFT and RAG) with four representative table-to-text methods: Markdown format, Template serialization, TPLM-based, and LLM-based.
The key findings are:
Table-to-text methods significantly impact the performance of QA systems, with relative score differences ranging from 2.8% to 9.0% in human evaluation and 4.8% to 16% in GPT-4 evaluation.
In the DSFT paradigm, the LLM-based and TPLM-based methods consistently outperform the others across various model settings, demonstrating their superiority.
In the RAG paradigm, while the LLM-based method still performs excellently, the Markdown method has shown unexpected effectiveness.
The varying frequency of domain-specific terms and verbs produced by these methods, alongside the differing quality of semantic representations in the generated text chunks, appear to be pivotal factors influencing performance disparities across the two systems.
The authors provide practical suggestions for choosing table-to-text methods based on the trade-offs between performance, resource requirements, and text diversity.
Stats
The text data in tables accounts for approximately 18% of the total content in the ICT-DATA dataset.
Quotes
"Table-to-text methods significantly impact the performance of QA systems, with relative score differences ranging from 2.8% to 9.0% in human evaluation and 4.8% to 16% in GPT-4 evaluation."
"In the DSFT paradigm, LLM-based and TPLM-based consistently outperform others across various model settings, demonstrating their superiority."
"In the RAG paradigm, while the LLM-based method still performs excellently, the Markdown has shown unexpected effectiveness."