toplogo
Sign In

Improving Automatic Extraction of Entities from Historical Death Certificates in Curaçao


Core Concepts
Developing a six-step pipeline to automatically extract key entities like names, dates, and professions from handwritten text in historical death certificates, and exploring ways to improve the accuracy of name recognition.
Abstract
The REE-HDSC project aims to improve the quality of named entities extracted automatically from texts generated by hand-written text recognition (HTR) software. The authors present a six-step processing pipeline and test it on 19th and 20th century death certificates from the civil registry of Curaçao. The key findings are: The pipeline extracts dates with high precision (90%) but the precision of person name extraction is low (33%). To improve name extraction, the authors retrain HTR models with more name examples, post-process the output, and identify/remove incorrect names. Experiments show that ChatGPT outperforms regular expressions for both name and date extraction. The authors discuss limitations of the current HTR technology and suggest future work to further improve the pipeline.
Stats
The data set contains 77,352 scans of death certificates from Curaçao for the years 1831-1950. After cleaning, the data set was reduced to 68,520 unique scans. The certificates are in three main formats: three-column (1831-1869), early two-column (1869-1934), and late two-column (1935-1950). An accompanying Excel file contains data for over 129,000 persons from 78,564 certificates, some of which are duplicates.
Quotes
"We find that the pipeline extracts dates with high precision but that the precision of person name extraction is low." "Next we show how name extraction precision can be improved by retraining HTR models with names, post-processing and by identifying and removing incorrect names."

Key Insights Distilled From

by Erik Tjong K... at arxiv.org 04-08-2024

https://arxiv.org/pdf/2401.02972.pdf
REE-HDSC

Deeper Inquiries

How could the authors leverage the existing manual annotations in the accompanying Excel file to further improve the automatic entity extraction

To leverage the existing manual annotations in the accompanying Excel file for further improving automatic entity extraction, the authors could implement a process of cross-validation. By comparing the entities extracted automatically from the handwritten text with the entities listed in the Excel file, they can identify patterns and discrepancies. This comparison can help in refining the regular expressions and machine learning models used for entity extraction. Additionally, the manual annotations can serve as a validation set to test the accuracy of the automated extraction process. Any discrepancies between the automated extraction and the manual annotations can be used to fine-tune the algorithms and improve the overall performance of the entity extraction pipeline.

What other types of historical records, beyond death certificates, could benefit from the authors' handwritten text recognition and entity extraction pipeline

Beyond death certificates, various types of historical records could benefit from the authors' handwritten text recognition and entity extraction pipeline. Some examples include: Census Records: Extracting information such as names, ages, occupations, and family relationships from census records can help in genealogical research and historical analysis. Land Deeds and Property Records: Identifying names of property owners, transaction dates, and property descriptions from land deeds can aid in studying land ownership patterns and historical land use. Court Documents: Extracting names of litigants, judges, case details, and verdicts from court documents can assist in legal research and understanding historical legal proceedings. Ship Manifests: Capturing names of passengers, departure and arrival dates, and destinations from ship manifests can be valuable for studying migration patterns and historical transportation. By applying the handwritten text recognition and entity extraction pipeline to a diverse range of historical records, researchers can uncover valuable insights and facilitate historical research across various domains.

How might the authors' approach be adapted to handle multilingual or multi-script historical documents, beyond the Dutch-language death certificates from Curaçao

Adapting the authors' approach to handle multilingual or multi-script historical documents involves several considerations: Language Identification: Implementing language identification algorithms to determine the language of the text before applying language-specific models for handwritten text recognition and entity extraction. Multi-Script Recognition: Developing models that can recognize and process multiple scripts within the same document, enabling the extraction of entities from documents containing different writing systems. Training Data Diversity: Ensuring that the training data for the models includes a diverse range of languages and scripts to improve the models' ability to handle multilingual and multi-script documents. Post-Processing Techniques: Implementing post-processing techniques that can handle variations in language and script structures to enhance the accuracy of entity extraction in multilingual documents. By incorporating these adaptations, the authors can extend their approach to effectively handle the complexities of multilingual and multi-script historical documents, opening up opportunities for broader historical research and analysis.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star