The paper explores the capabilities of state-of-the-art large language models (LLMs) to generate parallel code. The authors create a benchmark called ParEval, consisting of 420 prompts covering 12 different computational problem types and 7 parallel programming models. They evaluate several open-source and closed-source LLMs on this benchmark and introduce new metrics to assess the correctness and performance of the generated parallel code.
The key findings are:
Closed-source models like GPT-3.5 and GPT-4 outperform open-source models like CodeLlama and StarCoderBase on parallel code generation, achieving pass@1 scores of 39.6 and 37.8 respectively, compared to 10.2-32.0 for the open-source models.
LLMs perform best on simple, structured parallel problems like transform and reduction, but struggle with more complex parallel algorithms and sparse, unstructured problems like sparse linear algebra, sorting, and FFT.
The parallel code generated by LLMs has poor parallel speedup and efficiency, with the LLMs that generate the most correct parallel code not necessarily generating the most performant code.
Providing LLMs with correct implementations in one execution model can improve their ability to generate correct code in another execution model, particularly for smaller open-source models.
Overall, the results suggest that while LLMs show promise for code generation, they still struggle significantly with generating correct and efficient parallel code, especially for more complex parallel algorithms and unstructured problems.
Vers une autre langue
à partir du contenu source
arxiv.org
Questions plus approfondies