Core Concepts
The core message of this article is to propose a taxonomy of intentions for technical forum posts and develop an automated intention detection framework that leverages textual and structural features of posts to accurately classify their underlying purposes.
Abstract
The authors conducted a qualitative study to understand the composition and arrangement of content in technical forum posts. They found that posts often contain various supplementary materials beyond natural language descriptions, such as code snippets, error messages, configurations, and command lines. These content types are frequently organized within code blocks.
Based on the findings from the qualitative study and a review of prior work, the authors devised a taxonomy of seven intention categories for technical forum posts: Discrepancy, Explicit Error, Review, Conceptual, Learning, How-to, and Other. They manually annotated a dataset of 784 posts according to this taxonomy and measured high inter-rater agreement.
The authors then proposed an intention detection framework that leverages pre-trained transformer-based language models to generate embeddings for the title and description of posts. In addition to the textual features, the framework also utilizes the content categories of code blocks as a structural feature. The authors experimented with different pre-trained models and fine-tuning strategies, demonstrating that their framework outperforms state-of-the-art baselines in intention detection.
The key insights from this work include:
- The composition and arrangement of content in technical forum posts, with code blocks serving as a common container for various supplementary materials.
- The taxonomy of seven intention categories that capture the diverse purposes behind technical forum posts.
- The effectiveness of combining textual and structural features, particularly the content categories of code blocks, for accurately detecting the intentions of technical forum posts.
- Guidance on the selection and fine-tuning of pre-trained language models for processing technical forum data.
Stats
The average length of post descriptions is 112.1 tokens, with a median of 83 tokens.
26.8% of posts contain code snippets, 15.9% contain error messages, and 10.4% contain images.
90.6% of posts with code snippets use code blocks to present them, while 33.3% of posts with inline code do not mark them correctly.
55.0% of posts with stack traces arrange them in code blocks, while 65.7% of shorter error messages are mixed with natural language descriptions.
Quotes
"Most tags are only focused on the technical perspective (e.g., program language, platform, tool). In most cases, forum posts in online developer communities reveal the author's intentions to solve a problem, ask for advice, share information, etc. The modeling of the intentions of posts can provide an extra dimension to the current tag taxonomy."
"Efficient recommendations with tags have the potential to enhance the visibility of a question, increasing the likelihood of a swift response from domain experts."