IRCoder investigates the use of compiler intermediate representations to improve multilingual capabilities in Code-LMs. The study creates a parallel dataset, SLTrans, and demonstrates substantial gains in prompt robustness, multilingual code completion, code understanding, and instruction following. By grounding heterogeneous source code languages in a shared IR, IRCoder shows promising results across different programming languages.
The content highlights the challenges faced by mainstream Code-LMs due to skewed language distributions and rapid changes in programming language popularity. It emphasizes the importance of leveraging IR as an interlingua for grounding source code understanding across diverse languages. The study showcases how continued pre-training on paired source-IR data significantly enhances the performance of Code-LMs on various tasks.
Furthermore, IRCoder's experiments reveal that grounding in IR improves robustness to prompt perturbations and enhances multilingual code understanding. The results demonstrate consistent gains across a wide range of tasks and programming languages, showcasing the potential of using IR for cross-lingual transfer and alignment.
Overall, IRCoder's approach of leveraging intermediate representations for multilingual code generation proves to be effective in improving the performance and robustness of Code-LMs.
לשפה אחרת
מתוכן המקור
arxiv.org
שאלות מעמיקות