핵심 개념
This paper introduces TARGET, a novel approach using pre-trained code language models to automatically repair broken test cases, and demonstrates its effectiveness through extensive evaluation using a newly created benchmark, TARBENCH.
초록
Bibliographic Information:
Yaraghi, A. S., Holden, D., Kahani, N., & Briand, L. (2024). Automated Test Case Repair Using Language Models. arXiv preprint arXiv:2401.06765v2.
Research Objective:
This paper aims to address the challenge of automatically repairing broken test cases in software development by leveraging the power of pre-trained code language models (CLMs).
Methodology:
The researchers developed TARGET, a two-step approach that first identifies and prioritizes code changes in the System Under Test (SUT) relevant to the broken test case, forming a repair context. Then, it utilizes this context to fine-tune a pre-trained CLM for test repair, treating it as a language translation task. They evaluated TARGET's effectiveness using TARBENCH, a comprehensive benchmark they created, comprising 45,373 broken test repairs across 59 open-source projects. The study explored different input-output formats for the CLM, compared its performance against baselines, and investigated its generalizability and the reliability of its generated repairs.
Key Findings:
- TARGET, when fine-tuned with the appropriate input-output format, achieves a 66.1% exact match accuracy and an 80% plausible repair accuracy, significantly outperforming baselines.
- The study highlights the importance of carefully selecting and prioritizing repair context information for optimal CLM performance.
- A model trained on data from specific projects can be effectively applied to other projects, demonstrating the generalizability of the approach.
- The researchers introduce a method to predict the reliability of TARGET's generated repairs, aiding developers in deciding whether to trust them.
Main Conclusions:
This research demonstrates the potential of leveraging pre-trained CLMs for automated test case repair, offering a promising solution to a significant challenge in software development. The creation of TARBENCH provides a valuable resource for future research in this area.
Significance:
This work significantly contributes to the field of automated software engineering by presenting a novel and effective approach for automated test case repair using language models. The comprehensive benchmark and the insights gained from the study pave the way for further advancements in this domain.
Limitations and Future Research:
The study acknowledges that while SUT code changes are crucial for repair, additional context might be beneficial. Future research could explore incorporating more comprehensive context information and investigate alternative techniques for context prioritization.
통계
Broken test cases account for 14% to 22% of test case failures in open-source projects.
TARGET achieves a 66.1% exact match accuracy.
TARGET achieves a plausible repair accuracy of 80%.
TARGET outperforms the baseline by 37.4 EM percentage points.
TARBENCH comprises 45,373 broken test repairs across 59 distinct projects.
인용구
"Existing automated test repair studies show limitations from both methodological and evaluation perspectives, restricting their applicability across diverse software systems and repair scenarios."
"Motivated by these studies, we propose TARGET (TEST REPAIR GENERATOR), an approach that leverages language models to automatically repair broken test cases."
"We conduct an extensive experimental analysis and assess TARGET by creating TARBENCH, a benchmark comprising 45,373 broken test repairs across 59 distinct projects, making it by far the most comprehensive benchmark to date in this application context."