The paper presents a data-centric approach to improve printed mathematical expression recognition (MER) models. The key contributions are:
LaTeX Normalization: The authors identified several problematic aspects in the ground truth (GT) of the widely used im2latex-100k dataset, such as variations in mathematical fonts, white spaces, curly brackets, sub- and superscript order, tokens, and array structures. They proposed a normalization process to address these issues and reduce undesired variations in the GT.
im2latexv2 Dataset: The authors created an enhanced version of the im2latex-100k dataset, called im2latexv2, which includes 30 different fonts in the training set and 59 fonts in the validation and test sets. This addresses the limitation of using a single font in the original dataset.
realFormula Dataset: The authors also introduced a new real-world test set, realFormula, containing 121 manually annotated mathematical expressions extracted from research papers.
MathNet Model: The authors developed a new printed MER model, MathNet, based on a convolutional vision transformer encoder and a transformer decoder. MathNet outperforms the previous state-of-the-art models on all four test sets (im2latex-100k, im2latexv2, realFormula, and InftyMDB-1) by up to 88.3% in terms of Edit score.
The authors conducted extensive experiments to analyze the impact of their data-centric approach and the model architecture. They found that the LaTeX normalization process contributed two-thirds of the performance improvement, while the use of multiple fonts accounted for the remaining one-third. The authors also identified challenges, such as the array structure and the absence of mathematical fonts in the training data, which negatively impact the model's performance on the realFormula test set.
翻譯成其他語言
從原文內容
arxiv.org
深入探究