The paper investigates the multi-lingual bias that exists in current large code models (LCMs) for the task of code generation. The authors first construct a multi-lingual benchmark, X-HumanEval-X, to systematically evaluate the extent of multi-lingual bias in nine popular LCMs.
The experiments reveal two key findings regarding the multi-lingual bias in LCMs:
Multi-natural language (multi-NL) bias: When provided with instructions in Chinese, the average Pass@1 rate of LCMs decreases by at least 13% compared to English instructions.
Multi-programming language (multi-PL) bias: The performance of LCMs varies significantly across different programming languages, with the gap between Python and C++ reaching as high as 20.9%.
To mitigate the observed biases, the authors explore two approaches:
Prompting strategies: Translating Chinese instructions into English using one-step or multi-step translation can reduce the multi-NL bias from 17.2% to as low as 3.8%. However, self-translation by the LCMs themselves leads to a drastic 62.3% decrease in performance.
Instruction tuning: The authors construct a multi-lingual dataset, Multi-EvolInstruct-Code (MEIC), containing instructions and solutions in two natural languages (English and Chinese) and over 20 programming languages. Instruction tuning with MEIC substantially reduces the multi-NL bias by up to 84% and the multi-PL bias by up to 40%, while also enhancing the overall code generation performance by 31%-46%.
The findings provide valuable insights for researchers and developers aiming to mitigate the multi-lingual bias and improve the code generation capabilities of large code models.
To Another Language
from source content
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Chaozheng Wa... lúc arxiv.org 05-01-2024
https://arxiv.org/pdf/2404.19368.pdfYêu cầu sâu hơn