מושגי ליבה
Large Language Models can be prompted to suggest diverse mutants that resemble real-world bugs, complementing traditional mutation testing approaches.
תקציר
The paper presents LLMorpheus, a mutation testing technique that uses Large Language Models (LLMs) to suggest mutants. Traditional mutation testing tools apply a fixed set of mutation operators, which limits their ability to generate mutants that resemble real-world bugs. LLMorpheus addresses this by prompting an LLM to suggest mutations by asking what placeholders inserted in source code could be replaced with.
The key highlights and insights are:
- LLMorpheus is capable of generating a diverse set of mutants, some of which resemble real-world bugs that cannot be created by traditional mutation operators.
- The majority (63.2%) of surviving mutants produced by LLMorpheus reflect behavioral differences, 8.5% are equivalent to the original code, and 9.7% are near-equivalent.
- Using higher temperature settings for the LLM results in more variable mutant generation, while lower temperatures produce more stable results.
- The default prompting strategy used by LLMorpheus generally produces the largest number of mutants and surviving mutants, and removing different parts of the prompt degrades the results to varying degrees.
- The codellama-34b-instruct LLM generally produces the most mutants and surviving mutants, but LLMorpheus remains effective when using the codellama-13b-instruct and mixtral-8x7b-instruct models.
- The cost of running LLMorpheus, in terms of time and tokens used, is practical for real-world use.
סטטיסטיקה
The paper reports the following key statistics:
LLMorpheus generated between 89 and 2035 mutants across the 13 subject applications.
Of the surviving mutants, 63.2% reflected behavioral differences, 8.5% were equivalent, and 9.7% were near-equivalent.
Running LLMorpheus took between 430.53 and 5,241.46 seconds across the subject applications.
The total number of tokens used by LLMorpheus was 6,563,096.