Core Concepts
Large language models (LLMs) are utilized to automatically generate tests for compiler validation, focusing on OpenACC implementations.
Abstract
1. Abstract:
LLMs like Codellama, Deepseek Coder, and GPT models are fine-tuned for generating tests for OpenACC compiler validation.
Various prompt engineering techniques are explored to enhance test generation capabilities.
2. Introduction:
LLMs are trained on large datasets and can be fine-tuned for specific tasks like code generation.
The study focuses on validating compiler implementations using LLM-generated tests.
3. Motivation:
Compiler implementations often have issues due to misinterpretations and ambiguities, necessitating the need for validation tests.
The complexity of compiler features requires a systematic approach to test generation.
4. Overview of LLMs:
LLMs are trained on large datasets and fine-tuned for specific tasks like code generation.
Prompt engineering techniques are used to enhance LLM performance in generating tests.
5. Prompt Engineering Techniques:
Various prompt engineering techniques like one-shot prompting and retrieval-augmented generation are employed to improve test generation quality.
6. Fine-tuning of LLMs:
Fine-tuning LLMs on domain-specific datasets improves their performance in generating accurate tests.
Fine-tuning involves updating all model parameters for task-specific learning.
7. LLM Benchmarks:
Performance of LLMs is evaluated using benchmarks like HumanEval and MBPP for code generation tasks.
LLMs are compared based on their performance in generating validation tests for OpenACC.
8. Types of errors:
Different types of errors like parsing errors, compile failures, and runtime errors are categorized in evaluating the generated tests.
The intended outcome is to have tests that return 0 for successful features and non-zero for errors.
9. Three Stages:
The development process is divided into three stages for test generation and evaluation.
Each stage focuses on refining the test generation process and improving the quality of generated tests.
Stats
LLMs like Codellama-34b-Instruct and Deepseek-Coder-33b-Instruct produced the most passing tests.
GPT-4-Turbo showed competitive performance in generating tests for OpenACC compiler validation.
Quotes
"LLMs are pre-trained on large, unlabeled datasets with self-supervised learning."
"Fine-tuning involves updating all of the model’s parameters for domain-specific learning."