Core Concepts
BuDDIE is a new dataset of 1,665 real-world business documents that supports three key tasks in visually-rich document understanding: document classification, key entity extraction, and visual question answering.
Abstract
The BuDDIE dataset consists of 1,665 publicly available structured business documents from US state government websites. It is unique in that it tackles multiple distinct visually-rich document understanding tasks: document classification, key entity extraction, and visual question answering.
For document classification, the dataset contains 5 distinct document classes such as amendment documents, application/articles, business entity details, certificates/statements, and periodic reports. Annotators achieved high agreement on the document class labels.
The key entity extraction task features a rich ontology of 69 fine-grained entity types across 7 super categories, including business entities, key personnel, file attributes, government officials, and more. The annotations were validated to ensure high quality.
For visual question answering, the dataset includes both span questions that require extracting a key entity, as well as boolean questions that ask if a certain property of an entity is true or false. The questions cover a diverse range of the annotated key entities.
Overall, BuDDIE provides a comprehensive multi-task benchmark for visually-rich document understanding, with the potential to support additional downstream tasks like multi-turn QA and instruction tuning in the future.
Stats
The dataset contains 1,665 business documents from US state government websites.
There are 38,906 annotated key entities across the dataset.