Sign In

Evaluating AI Techniques for Automated Requirements Classification: A Comparative Study of SVM, LSTM, and ChatGPT

Core Concepts
The study evaluates the performance of SVM, LSTM, and two ChatGPT models (gpt-3.5-turbo and gpt-4) in classifying software requirements into functional and non-functional categories, finding that the optimal technique depends on the specific requirements class.
The study explores the use of AI techniques for automated requirements classification, focusing on the differentiation between Functional Requirements (FR) and Non-Functional Requirements (NFR). It compares the performance of three models: Support Vector Machine (SVM), Long Short-Term Memory (LSTM), and two versions of ChatGPT (gpt-3.5-turbo and gpt-4). The key findings are: There is no single best technique for all requirements classifications. The optimal model varies depending on the specific class of requirements (IsFunctional, IsQuality, OnlyFunctional, OnlyQuality). GPT-3.5 generally outperforms GPT-4, except for the "OnlyFunctional" classification, where GPT-4 Zero-Shot setting shows superior performance. The few-shot setting is particularly beneficial when the zero-shot performance is significantly low, as it can lead to marked improvements in the GPT-3.5 model's ability to classify "OnlyFunctional" and "OnlyQuality" requirements. The LSTM model performs best in the "IsQuality" classification, but its effectiveness is limited in generalizing to other datasets, indicating a potential lack of robustness. The study highlights the importance of evaluating multiple AI techniques and the need for high-quality benchmark datasets to comprehensively assess the capabilities of Large Language Models (LLMs) in Requirements Engineering tasks.
The datasets used in the study include: PROMISE: 625 requirements Dronology: 97 requirements ReqView: 87 requirements Leeds Library: 85 requirements WASP: 62 requirements
"There is no single best technique for all requirements classifications. The best technique varies depending on the specific requirement classification." "GPT-3.5 is generally more effective than GPT-4, except when it comes to 'OnlyFunctional' requirements classification, where GPT-4's higher cost may be justified by its enhanced performance." "The few-shot setting has been found to be beneficial primarily in scenarios where zero-shot performance is notably weak."

Deeper Inquiries

How can the performance of these AI techniques be further improved for requirements classification tasks?

In order to enhance the performance of AI techniques for requirements classification tasks, several strategies can be implemented: Feature Engineering: Improving the quality and relevance of features used in the classification models can significantly impact performance. This can involve extracting more meaningful features from the requirements text, such as syntactic and semantic features, to provide richer information for the models to learn from. Ensemble Methods: Combining multiple AI models, such as SVM, LSTM, and ChatGPT, through ensemble methods can leverage the strengths of each model and mitigate their individual weaknesses. Ensemble learning techniques like stacking or boosting can lead to more robust and accurate classification results. Fine-tuning Models: Fine-tuning the pre-trained language models like ChatGPT on domain-specific requirements data can enhance their understanding of the nuances in requirements classification. This domain adaptation process can improve the models' performance on specific types of requirements. Data Augmentation: Increasing the diversity and size of the training data through data augmentation techniques can help AI models generalize better to unseen requirements. Techniques like back-translation, synonym replacement, or adding noise to the text can enrich the training data and improve model performance. Hyperparameter Tuning: Optimizing the hyperparameters of the AI models, such as learning rates, batch sizes, or dropout rates, can fine-tune the models for better performance. Grid search or random search methods can be employed to find the optimal hyperparameter configurations. Interpretable Models: Utilizing interpretable AI models, such as decision trees or rule-based systems, alongside complex models like LSTM and ChatGPT, can provide insights into the classification process. Interpretable models can help in understanding the reasoning behind the classification decisions and improve trust in the AI systems. By implementing these strategies, the performance of AI techniques for requirements classification tasks can be further improved, leading to more accurate and reliable results.

What are the potential biases and limitations of the datasets used in this study, and how might they impact the generalizability of the findings?

The datasets used in the study exhibit several potential biases and limitations that could impact the generalizability of the findings: Imbalanced Classes: The datasets show a significant class imbalance, with certain classes having more instances than others. This imbalance can lead to biased model performance, where the models may favor the majority class and struggle with minority classes, affecting the overall generalizability of the findings. Limited Diversity: The datasets may lack diversity in terms of the types of requirements included, potentially limiting the models' exposure to a wide range of requirements scenarios. This lack of diversity can hinder the models' ability to generalize to unseen data and real-world requirements. Dataset Size: The size of the datasets used in the study may not be sufficient to capture the full complexity of requirements classification tasks. Larger datasets with more varied requirements could provide a more comprehensive understanding of the classification problem and improve the models' generalizability. Annotation Quality: The quality of annotations in the datasets, such as inaccuracies or inconsistencies in labeling requirements, can introduce noise and bias into the training data. Biased annotations can mislead the AI models and impact their performance on unseen requirements. Domain Specificity: The datasets may be specific to certain domains or industries, limiting the applicability of the findings to other domains. Models trained on domain-specific data may struggle to generalize to requirements from different domains, affecting the overall generalizability of the study results. These biases and limitations in the datasets used in the study could potentially impact the generalizability of the findings and the applicability of the AI models to real-world requirements classification tasks. Addressing these issues through careful dataset curation, augmentation, and validation can help mitigate biases and improve the robustness of the models.

How can the interpretability and explainability of the AI models be enhanced to provide deeper insights into the requirements classification process?

Enhancing the interpretability and explainability of AI models for requirements classification is crucial for gaining deeper insights into the classification process. Here are some strategies to achieve this: Feature Importance: Conducting feature importance analysis to identify the most influential features in the classification process can provide insights into what aspects of the requirements text are driving the classification decisions. Techniques like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) can help in understanding the model's decision-making process. Attention Mechanisms: For models like LSTM and transformer-based models (e.g., ChatGPT), visualizing attention mechanisms can reveal which parts of the input text are being focused on during classification. Attention maps can highlight the words or phrases that are critical for the model's predictions, offering transparency into the model's inner workings. Rule Extraction: Extracting rules or decision paths from complex models like LSTM can create interpretable representations of the classification logic. Rule-based explanations can provide a human-understandable rationale for why a requirement is classified in a certain way, enhancing the model's explainability. Confidence Scores: Incorporating confidence scores or uncertainty estimates into the model predictions can help assess the model's reliability in its classifications. Calibrating the confidence scores to reflect the model's certainty can aid in understanding when the model may be making errors or misclassifications. Interactive Visualizations: Developing interactive visualizations or dashboards that allow users to explore and interact with the model's predictions can enhance the interpretability of the AI models. Visual representations of the classification results can make complex models more accessible and facilitate deeper insights into the requirements classification process. By implementing these strategies, the interpretability and explainability of AI models for requirements classification can be enhanced, providing stakeholders with deeper insights into how the models make decisions and improving trust and transparency in the classification process.