Core Concepts
It is possible to develop recommendation models for highlighting important information in Stack Overflow answers with different formatting styles, such as Bold, Italic, Code, and Heading.
Abstract
The study investigates the prevalence and usage of information highlighting in Stack Overflow (SO) answers, and explores the feasibility of automatically recommending highlighted content using machine learning models.
Key findings:
- Information highlighting is prevalent on SO, with 47.6% of answers using formatting types like Bold, Italic, Code, Heading, and Delete to highlight content.
- Code formatting is the most commonly used (38.5% of answers), mainly to highlight source code elements like identifiers, keywords, and statements. Code is also used to highlight non-code content like software names, equations, and terminology.
- Besides Code, Bold and Italic are frequently used to highlight source code, as well as content related to caveats, references, and terminology.
- The authors developed CNN and BERT-based models to automatically recommend highlighted content. The CNN models achieve precision ranging from 0.71 to 0.82, with the Code model performing the best (F1 score of 0.71).
- Analysis of failure cases reveals that the majority are due to missing identification, as the models tend to learn frequently highlighted words while struggling with less frequent (long-tail) content.
The findings provide insights to improve future research on automatic information highlighting and leverage highlighted content for downstream tasks like answer summarization and API documentation enrichment.
Stats
"Information highlighting is prevalent on SO, i.e., 47.6% of the answers use the studied formatting to highlight information."
"38.5% of the answers use Code, which is the most frequently used format, followed by Bold (11.3%) and Italic (7.2%)."
"The median length of the content highlighted with Code, Bold, Italic, Deleting, and Heading are 1, 1, 1, 8, and 2 words, respectively."
Quotes
"Code is mainly used to highlight source code elements, such as identifiers (63.5%), programming language keywords (9.9%), and statements (7.0%)."
"Code is also used to highlight content other than source code, such as Software (4.9%), Terminology (1.8%), Equation (5.2%), and Version (0.5%)."
"Both Bold and Italic formatting are most frequently used to highlight content related to source code."