핵심 개념
Compiler intermediate representations can enhance multilingual capabilities of Code-LMs for improved code generation.
초록
Code understanding and generation applications of language models are popular.
Research on multilingual aspects of Code-LMs is limited.
Leveraging compiler intermediate representations can improve multilingual capabilities.
SLTrans dataset created for training Code-LMs on source code and IR pairs.
Continued training on IR-grounded data shows significant gains in code generation tasks.
IR grounding improves prompt robustness, multilingual code completion, code understanding, and instruction following.
IRCoder models outperform base models in multilingual tasks.
Limitations include variations in IR dialects and constraints on model application.
Ethical risks include potential for generating malicious code.
통계
다양한 프로그래밍 언어에 대한 SLTrans 데이터 세트가 거의 4백만 개의 훈련 예제를 포함하고 있습니다.
IR에 대한 계속된 훈련은 코드 생성 작업에서 상당한 이득을 보여줍니다.
IR 기반 데이터에서의 계속된 훈련은 코드 생성 작업에서 상당한 이득을 보여줍니다.
인용구
"Most mainstream Code-LMs have been pre-trained on source code files alone."
"Compiler intermediate representations can improve the multilingual capabilities of Code-LMs."