HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization
Grunnleggende konsepter
HumanEval-XL introduces a comprehensive benchmark for multilingual code generation, addressing the gap in evaluating cross-lingual NL generalization of LLMs.
Sammendrag
Abstract:
Large language models (LLMs) have shown progress in generating codes from textual prompts.
Existing benchmarks focus on English-centric code generation, leaving a gap in evaluating multilingual NL to code generation.
HumanEval-XL connects 23 NLs and 12 PLs with 22,080 prompts for multilingual LLM evaluation.
Introduction:
LLMs have advanced in code generation, but benchmarks lack multilingual evaluation.
HumanEval-XL pioneers a massively multilingual benchmark for comprehensive assessment.
Related Work:
Previous benchmarks concentrated on English-centric Python generation.
HumanEval-XL surpasses existing benchmarks by connecting multiple NLs and PLs.
HumanEval-XL:
Design Principles:
Task Complexity: Focus on challenging code generation tasks.
Language Diversity: Incorporate diverse NLs and PLs for unbiased comparisons.
Accessibility: Use data with permissive licenses for research purposes.
Dataset Construction:
Iterative process using GPT-4 to create a robust benchmark across 23 NLs and 12 PLs.
Experiments:
Experimental Setup:
Top-p sampling with consistent parameters across models.
Results:
GPT-4 outperforms other models consistently across different PLs and NLs.