The paper investigates the syntactic robustness of LLM-based code generation, focusing on prompts that contain mathematical formulas. It defines syntactic robustness as the degree to which semantically equivalent prompts (with syntactically different formulas) elicit semantically equivalent code responses from the LLM.
The authors first demonstrate that GPT-3.5 and GPT-4 are not syntactically robust by showing examples where small syntactic changes to the formula in the prompt lead to different code being generated. They then propose a systematic approach to evaluate syntactic robustness:
They define a set of code generation prompts based on linear, quadratic, trigonometric, and logarithmic equations, and develop a set of mutation rules to generate syntactically different but semantically equivalent versions of these prompts.
They implement a reference code solution for each prompt and use differential fuzzing to check the equivalence of the code generated by the LLMs against the reference code.
They define the syntactic robustness degree as the ratio of semantically equivalent generated code to the total number of mutated prompts.
The experimental results show that the syntactic robustness degree decreases as the syntactic distance (number of mutations) increases, indicating that both GPT-3.5 and GPT-4 are not syntactically robust.
To improve syntactic robustness, the authors propose a prompt pre-processing step that uses a set of reduction rules to simplify the mathematical formulas in the prompts without changing their semantics. Their experiments show that this approach can achieve 100% syntactic robustness for both GPT-3.5 and GPT-4.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Laboni Sarke... at arxiv.org 04-03-2024
https://arxiv.org/pdf/2404.01535.pdfDeeper Inquiries