Existing large language models and deep learning-based automated program repair techniques can fix only a small number of real-world Java security vulnerabilities, with Codex exhibiting the best fixing capability among the models studied. Fine-tuning language models with general program repair data can improve their vulnerability fixing abilities, but they still struggle to fix many complex vulnerability types.
Fine-tuned Large Language Models, particularly Mistral, show promise in automating code vulnerability repair, outperforming existing methods even with stricter evaluation metrics and highlighting the importance of dataset integrity in accurately assessing model performance.