Core Concepts
This work proposes a taxonomy of downstream tasks that capture how researchers and practitioners have been leveraging prompts to utilize the emergent capabilities of Large Language Models (LLMs) for software testing, verification, and related problems.
Abstract
The paper investigates how software testing and verification research communities have been using prompts to leverage the capabilities of Large Language Models (LLMs). The authors first validate whether the concept of "downstream tasks" is adequate to convey the blueprint of prompt-based solutions. They then develop a novel taxonomy of downstream tasks to identify patterns and commonalities in a varied spectrum of Software Engineering problems, including testing, fuzzing, debugging, vulnerability detection, static analysis, and program verification.
The taxonomy is organized hierarchically, with top-level categories capturing different high-level conceptual operations, such as Generative, Evaluative, Extractive, Abstractive, Executive, and Consultative tasks. These categories are further divided into more specific task families based on the type of software artifacts being processed and the nature of the operations performed.
The authors provide detailed tables that summarize the downstream tasks elicited by prompts in various LLM-enabled approaches for software testing, fuzzing, debugging, vulnerability detection, static analysis, and program verification. These tables describe the input-output relationships of the tasks, as well as the integration and orchestration of the LLM components within the overall approach.
The proposed taxonomy helps identify patterns, trends, and unexplored areas in the use of LLMs for software engineering problems. It also provides a framework for discussing design patterns of LLM-enabled approaches, characteristics of task families, and opportunities for future research and development.
Stats
"Prompting has become one of the main approaches to leverage emergent capabilities of Large Language Models [Brown et al. NeurIPS 2020, Wei et al. TMLR 2022, Wei et al. NeurIPS 2022]."
"We were able to recover from the 80 reported papers their downstream tasks and present them homogeneously no matter how sophisticated the underlying probabilistic program is."
"Identified downstream tasks end up being rich in terms nature and functional features and, to the best of our knowledge, some of them were not previously identified in existing taxonomies."
Quotes
"Taxonomies may result in rigid concepts that do not favour the use of versatility of concrete concepts and phenomena like, in this case, inference elicited by prompts. However, we believe abstract organization is worth its risks: one could see patterns, trends, unexplored spots, and a way to recognize when one is in front of a 'brand new specimen' or category of things."
"Even in the case were fine-tuned neural paradigms were already in place (e.g., vulnerability detection), there is an apparent gap between expected LLMs proficiency and the nature of problems and 'classical' solutions (and, thus, one would expect ingenious LLM-enabled solutions)."