Core Concepts
The study evaluates the performance of SVM, LSTM, and two ChatGPT models (gpt-3.5-turbo and gpt-4) in classifying software requirements into functional and non-functional categories, finding that the optimal technique depends on the specific requirements class.
Abstract
The study explores the use of AI techniques for automated requirements classification, focusing on the differentiation between Functional Requirements (FR) and Non-Functional Requirements (NFR). It compares the performance of three models: Support Vector Machine (SVM), Long Short-Term Memory (LSTM), and two versions of ChatGPT (gpt-3.5-turbo and gpt-4).
The key findings are:
There is no single best technique for all requirements classifications. The optimal model varies depending on the specific class of requirements (IsFunctional, IsQuality, OnlyFunctional, OnlyQuality).
GPT-3.5 generally outperforms GPT-4, except for the "OnlyFunctional" classification, where GPT-4 Zero-Shot setting shows superior performance.
The few-shot setting is particularly beneficial when the zero-shot performance is significantly low, as it can lead to marked improvements in the GPT-3.5 model's ability to classify "OnlyFunctional" and "OnlyQuality" requirements.
The LSTM model performs best in the "IsQuality" classification, but its effectiveness is limited in generalizing to other datasets, indicating a potential lack of robustness.
The study highlights the importance of evaluating multiple AI techniques and the need for high-quality benchmark datasets to comprehensively assess the capabilities of Large Language Models (LLMs) in Requirements Engineering tasks.
Stats
The datasets used in the study include:
PROMISE: 625 requirements
Dronology: 97 requirements
ReqView: 87 requirements
Leeds Library: 85 requirements
WASP: 62 requirements
Quotes
"There is no single best technique for all requirements classifications. The best technique varies depending on the specific requirement classification."
"GPT-3.5 is generally more effective than GPT-4, except when it comes to 'OnlyFunctional' requirements classification, where GPT-4's higher cost may be justified by its enhanced performance."
"The few-shot setting has been found to be beneficial primarily in scenarios where zero-shot performance is notably weak."