The paper proposes LayoutLLM, an LLM/MLLM based method for document understanding that incorporates a layout instruction tuning strategy to enhance the comprehension of document layouts.
The key highlights are:
Layout-aware Pre-training: Three groups of pre-training tasks are introduced - document-level, region-level, and segment-level - to enable LayoutLLM to learn comprehensive document understanding from global to local details.
Layout-aware Supervised Fine-tuning (SFT): A novel module called LayoutCoT is designed to enable LayoutLLM to focus on relevant document regions and leverage their layout characteristics to generate accurate answers. LayoutCoT provides a certain degree of interpretability.
Experiments on standard benchmarks show that LayoutLLM significantly outperforms existing methods that adopt open-source 7B LLMs/MLLMs for document understanding in the zero-shot setting.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Chuwei Luo,Y... a las arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.05225.pdfConsultas más profundas