Core Concepts
A BERT-based deep learning model can effectively extract multiple questions from academic images and text, outperforming rule-based and layout-based approaches in accuracy and efficiency.
Abstract
The paper presents a method for extracting multiple questions from academic images and text using a BERT-based deep learning model. The key highlights are:
Providing fast and accurate resolution to student queries is a critical goal for online education systems. Students often submit queries through a chatbot-like interface, which can include complex equations, tables, images, or other relevant information.
While allowing students to upload images of their queries eliminates the need for them to type out complex information, it also introduces challenges. Images may contain multiple questions or extra textual noise, which can lower the accuracy of existing single-query answering solutions.
The authors propose using a BERT-based deep learning model for extracting questions from text and images. They compare this approach to rule-based and layout-based (LayoutLM) methods.
The BERT-based model outperforms the other approaches in terms of accuracy, with a precision of 96% and recall of 83%. It is also significantly smaller and faster than the LayoutLM model.
The BERT-based model is easier to fine-tune, supports good data augmentation, and is more suitable for adoption in a large-scale question-answering pipeline.
The authors also discuss potential future extensions, such as applying the model to other languages and exploring OCR-free transformer models for further improvements.
Stats
Providing fast and accurate resolution to student queries is a critical goal of the online education system.
Around 30% of images submitted still contain textual noise or multiple questions.
The BERT-based model achieved a precision of 96% and recall of 83% on the validation dataset.
The BERT-based model is significantly smaller (107M parameters) and faster (205ms per query) than the LayoutLM model (133M parameters, 526ms per query).
Quotes
"Deep learning-based models, such as BERT and LayoutLMv3 have shown to be highly effective in capturing contextual information."
"BERT based model successfully extracts questions from raw text without image input while being significantly smaller and faster than layoutLM model."