Core Concepts
Large language models can be effectively adapted to automatically optimize program performance through techniques like retrieval-based few-shot prompting, performance-conditioned generation, and synthetic data augmentation.
Abstract
The paper introduces a novel benchmark called Performance Improving Edits (PIE) to enable the evaluation of large language models (LLMs) for program optimization. The PIE dataset consists of over 77,000 pairs of C++ programs where one program is a performance-improving edit of the other, along with extensive unit tests and execution time annotations obtained using the gem5 CPU simulator.
The authors evaluate a variety of prompting and fine-tuning strategies for adapting pre-trained LLMs like CODELLAMA and GPT-3.5 to optimize program performance:
Prompting approaches, including instruction-only, few-shot, and chain-of-thought prompting, show limited effectiveness without leveraging the PIE dataset.
Retrieval-based few-shot prompting, where relevant optimization examples are dynamically retrieved from the training set, significantly improves performance.
Fine-tuning strategies, such as using a smaller high-quality subset of the PIE dataset, performance-conditioned generation, and synthetic data augmentation via self-play, further boost optimization capabilities.
The best-performing model, a fine-tuned version of GPT-3.5 augmented with synthetic data, achieves an average speedup of 6.86x on the test set, outperforming the fastest human solutions (4.06x average speedup). The authors also provide a detailed analysis of the types of optimizations performed by the models, including algorithmic changes, input/output optimizations, and data structure modifications.
Stats
The fastest human solutions achieve an average speedup of 4.06x.
The fine-tuned GPT-3.5 model augmented with synthetic data achieves an average speedup of 6.86x.
Quotes
"With the waning of Moore's law, optimizing program performance has become a major focus of software research."
"To address this challenge, we measure program performance using the gem5 (Binkert et al., 2011) full system detailed microarchitectural simulator of state-of-the-art processors."
"Our best model, GPT-3.5 augmented with synthetic data obtained from self-play, achieves an average speedup of 6.86×, and optimizes 87.68% of the test set by at least 10%."