Core Concepts
Large Language Models can effectively assess the realism of driving scenarios, with GPT-3.5 showing the highest robustness.
Abstract
The study evaluates the ability of Large Language Models (LLMs) like GPT-3.5, Llama2-13B, and Mistral-7B to assess the realism of driving scenarios. By testing 576 scenarios, it was found that GPT-3.5 exhibited the highest robustness overall, followed by Llama2-13B and Mistral-7B. The research highlights the importance of using LLMs in autonomous driving testing techniques to generate realistic scenarios efficiently.
Stats
GPT-3.5 achieved a robustness score of 12.59 out of 20.
Mistral-7B had a robustness score of 5.60.
Llama2-13B scored 9.48 in terms of robustness.
Quotes
"Large Language Models have great potential in assessing the realism of driving scenarios."
"GPT-3.5 demonstrated the highest robustness compared to other models."
"Mistral-7B consistently performed the worst in assessing scenario realism."