Core Concepts
Large language models lack proper instruction-data separation, leading to potential vulnerabilities and malfunctions. The author proposes a formal measure to quantify this separation and introduces a dataset for evaluation.
Abstract
Large language models face challenges in separating instructions from data, impacting their functionality and security. The study introduces a formal measure and dataset to assess the level of separation in various state-of-the-art models.
Accepted for ICLR 2024 Workshop on Secure and Trustworthy Large Language Models. Egor Zverev, Sahar Abdelnabi, Mario Fritz, Christoph H. Lampert. LLMs lack safety features like instruction-data separation, leading to vulnerabilities. A new measure is proposed to quantify this gap with empirical evaluation on existing models.
Most previous safety work on LLMs focused on "jailbreaks," ignoring the fundamental issue of instruction-data separation. The study introduces a formal definition of this separation and evaluates existing models using a proposed measure.
The ability to separate instructions from data is crucial for the reliable functioning of large language models across various applications. Existing models show low levels of separation according to the proposed measure.
On an architectural level, current LLMs lack a formal separation between passive data and active instructions, posing security risks similar to historical issues like SQL injections in databases.
The study emphasizes the importance of defining desirable properties like instruction-data separation for building reliable systems based on experiences from other computer science domains.
The empirical evaluation reveals that all evaluated LLMs struggle to achieve high levels of instruction-data separation according to the proposed measure.
Stats
Model Separation Score: GPT-4 - 0.225 ± 0.005
Model Separation Score: GPT-3.5 - 0.653 ± 0.006
Quotes
"LLMs lack elementary safety features such as the separation between instructions and data." - Egor Zverev et al.
"We introduce a formal measure to quantify the phenomenon of instruction-data separation." - Egor Zverev et al.
"All evaluated LLMs fail to achieve a high amount of separation according to our measure." - Egor Zverev et al.