The paper revisits recent code similarity evaluation metrics, focusing on the application of Abstract Syntax Tree (AST) editing distance in diverse programming languages. The authors explore the usefulness of these metrics and compare them to traditional sequence similarity metrics like BLEU score and Jaccard similarity.
The key highlights and insights are:
The authors demonstrate the adaptability of the TSED (Tree Similarity of Edit Distance) metric beyond SQL, showing its effectiveness in languages like Java, Python, and Kotlin. TSED exhibits a strong correlation with BLEU score, Jaccard similarity, and GPT-based similarity, indicating its ability to capture intricate code structures.
The evaluation compares TSED, GPT-based similarity, BLEU score, and Jaccard similarity against execution match, a metric that assesses the consistency in execution outcomes between generated code and ground truth. The results show that both TSED and GPT-based similarity exhibit higher accuracy in predicting execution match compared to the semantic metrics.
The authors discuss the limitations of these metrics, including the unstable nature of GPT-based similarity scoring and the influence of parameter optimization on the TSED metric. They highlight the need to carefully balance performance and stability considerations in code similarity assessment across various programming languages.
The paper proposes and publishes an adaptable TSED-based metric that demonstrates effectiveness across all tested languages, representing an enhanced version of the original TSED approach.
Overall, the study provides valuable insights into the strengths and weaknesses of different code similarity evaluation techniques, and offers a comprehensive comparison of their performance across multiple programming languages.
To Another Language
from source content
arxiv.org
Deeper Inquiries