The paper investigates the multi-lingual bias that exists in current large code models (LCMs) for the task of code generation. The authors first construct a multi-lingual benchmark, X-HumanEval-X, to systematically evaluate the extent of multi-lingual bias in nine popular LCMs.
The experiments reveal two key findings regarding the multi-lingual bias in LCMs:
Multi-natural language (multi-NL) bias: When provided with instructions in Chinese, the average Pass@1 rate of LCMs decreases by at least 13% compared to English instructions.
Multi-programming language (multi-PL) bias: The performance of LCMs varies significantly across different programming languages, with the gap between Python and C++ reaching as high as 20.9%.
To mitigate the observed biases, the authors explore two approaches:
Prompting strategies: Translating Chinese instructions into English using one-step or multi-step translation can reduce the multi-NL bias from 17.2% to as low as 3.8%. However, self-translation by the LCMs themselves leads to a drastic 62.3% decrease in performance.
Instruction tuning: The authors construct a multi-lingual dataset, Multi-EvolInstruct-Code (MEIC), containing instructions and solutions in two natural languages (English and Chinese) and over 20 programming languages. Instruction tuning with MEIC substantially reduces the multi-NL bias by up to 84% and the multi-PL bias by up to 40%, while also enhancing the overall code generation performance by 31%-46%.
The findings provide valuable insights for researchers and developers aiming to mitigate the multi-lingual bias and improve the code generation capabilities of large code models.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Chaozheng Wa... في arxiv.org 05-01-2024
https://arxiv.org/pdf/2404.19368.pdfاستفسارات أعمق