Silva, M. V. (2024). Attribute-Based Semantic Type Detection and Data Quality Assessment.
This research paper introduces a novel approach to data quality assessment by leveraging the semantic information embedded within attribute labels to detect potential data quality issues before the traditional data cleaning process. The study aims to answer whether attribute labels can be effectively used for semantic type detection and subsequent data quality assessment and how this approach identifies data quality issues across diverse datasets.
The researchers developed a two-step methodology: Attribute-Based Semantic Type Detection and Attribute-Based Data Quality Assessment. They first created a semantic type classification system and then used it to analyze 50 datasets from the UCI Machine Learning Repository. The analysis involved identifying potential data formats for each attribute by analyzing target words and abbreviations in attribute labels, cross-referencing them with curated Formats and Abbreviations Dictionaries, and validating the content against expected formats.
The study found that attribute labels can be effectively used for semantic type detection, with a 99.35% success rate in classifying 922 columns across the datasets. The approach proved particularly effective in identifying missing values, which constituted 76.4% of the 106 data quality issues detected. Compared to a traditional data profiling tool, YData Profiling, the proposed method demonstrated superior accuracy, detecting 81 missing values across the datasets, while YData Profiling identified only one.
The research concludes that leveraging semantic information from attribute labels significantly enhances data quality assessment and streamlines the data cleaning process. This approach offers a practical and effective solution for identifying a wide range of data quality issues across diverse datasets and domains.
This research significantly contributes to data quality management by introducing a novel and effective method for early detection of data quality issues. The approach has the potential to improve data-driven decision-making across various domains by ensuring higher data quality and reducing the time and resources required for data cleaning.
The study acknowledges the limitation of using datasets primarily from a single repository and suggests expanding the analysis to datasets from diverse sources. Future research directions include incorporating machine learning for automated semantic type detection, expanding the analysis to database tables, and aligning the methodology with international data quality standards.
Till ett annat språk
från källinnehåll
arxiv.org
Djupare frågor