Conceitos essenciais
This paper proposes a set of reading order independent metrics tailored to evaluating information extraction performance in handwritten documents, addressing the limitations of existing metrics that are sensitive to reading order errors.
Resumo
The paper introduces a set of reading order independent metrics for evaluating information extraction (IE) in handwritten documents. Existing metrics used for this task often rely on word positions or text alignment, which can introduce biases and do not reflect the expected final application of IE systems.
The proposed metrics include:
- OIECER and OIEWER: Adaptations of the ECER and EWER metrics that use a non-sequential alignment between ground truth and predicted named entities.
- OINerval: An adaptation of the Nerval metric that computes precision, recall, and F1-score using a non-sequential alignment.
- Bag-of-tagged-words and bag-of-entities metrics: Alternative representations that compute error rates and precision/recall/F1 without relying on alignment.
The authors evaluate these metrics on four public datasets and a real-world use case, and provide an in-depth analysis of their behavior. They recommend the use of OIECER, OIEWER, and OINerval as the most effective reading order independent metrics for IE evaluation.
The key insights are:
- Existing metrics that rely on word positions or reading order are sensitive to segmentation errors and do not reflect the expected application of IE systems.
- The proposed reading order independent metrics provide a more reliable and unbiased evaluation of IE performance.
- The recommended OIECER, OIEWER, and OINerval metrics offer a comprehensive assessment of both text recognition and semantic labeling.
Estatísticas
The IAM dataset contains 747 training, 116 validation, and 336 test pages with 18 named entity types.
The Simara dataset contains 3778 training, 811 validation, and 804 test pages with 7 fields.
The Esposalles dataset contains 75 training, 25 validation, and 25 test pages with 35 combined named entity types.
The POPP dataset contains 128 training, 16 validation, and 16 test pages with 10 column names as named entities.
The real-world French Military Records dataset contains 118 training and 29 validation full-page images with 5 named entity types.
Citações
"Information Extraction (IE) refers to the identification of parts of digital text that contain specific knowledge."
"Coupled approaches that solve the task in one step have also seen successful application in recent years at solving the task, mainly when applied to historical documents."
"When performing IE with the intent of building a knowledge database, the goal is to identify which Named Entities appear in each document. Therefore, as long as the segmentation errors do not lead to splitting Named Entities, the order in which they are recognized should not matter."