Kernkonzepte
Large language models can effectively generate high-quality code documentation that outperforms human-written documentation across various parameters, with closed-source models exhibiting superior performance compared to open-source alternatives.
Zusammenfassung
The study presents a comprehensive comparative analysis of leading large language models (LLMs) in their ability to generate code documentation at different levels of granularity, including inline, function, and class-level. The evaluation employs a rigorous checklist-based system to assess the documentation on parameters such as accuracy, completeness, relevance, understandability, and readability.
The key findings are:
- Except for Starchat, all LLMs consistently outperform the original human-written documentation across various parameters.
- Closed-source models like GPT-3.5, GPT-4, and Bard exhibit superior performance compared to open-source/source-available LLMs like Llama2 and Starchat.
- File-level documentation has considerably worse performance across all parameters (except time taken) compared to inline and function-level documentation.
- Statistical analysis using ANOVA confirms the significant impact of the model choice on completeness, relevance, and time taken for documentation generation.
The study highlights the potential of LLMs in automating and enhancing code documentation, while also identifying areas for further improvement, particularly in file-level documentation and the performance gap between closed-source and open-source models.
Statistiken
GPT-4 took the longest time to generate documentation, followed by Llama2, Bard, with ChatGPT and Starchat having comparable generation times.
File-level documentation had a worse performance across all parameters (except time taken) compared to inline and function-level documentation.
Zitate
"Closed-source models, including GPT-3.5, GPT-4, and Bard, consistently outperform their open-source counterparts, Llama2 and Starchat, across a majority of parameters in our evaluation rubric."
"Additionally, file level documentation had a considerably worse performance across all parameters (except for time taken) as compared to inline and function level documentation."