toplogo
Sign In

Exploiting Timing Side Channels in LLM Serving Systems to Infer Confidential Prompts and User Requests


Core Concepts
Timing side channels introduced by performance optimization techniques in LLM serving systems can be exploited to infer confidential system prompts and sensitive user requests.
Abstract

The paper presents the first security analysis of performance optimization techniques used by modern LLM systems that serve multiple users or applications simultaneously. It discovers significant information leaks from unique timing side channels introduced by these techniques.

The key findings are:

  1. Timing side channels arise from the sharing of the semantic cache and KV cache to reduce inference costs in LLM systems. These caches can be exploited to infer proprietary system prompts or sensitive prompts from peer users.

  2. The paper proposes novel attack strategies to exploit these side channels, enabling two attacks: prompt stealing attack and peeping neighbor attack.

  3. Experimental validations on open-source projects and popular online LLM services demonstrate the feasibility and effectiveness of the attacks.

  4. Preliminary solutions are proposed to mitigate these risks, such as sharing the KV cache in larger units and anonymizing privacy-related information in user inputs before semantic search.

The findings underscore the urgent need to address potential information leakage in LLM serving infrastructures as they become widely deployed.

edit_icon

Customize Summary

edit_icon

Rewrite with AI

edit_icon

Generate Citations

translate_icon

Translate Source

visual_icon

Generate MindMap

visit_icon

Visit Source

Stats
The prefilling phase of 1 token takes about 0.045 ms when the token misses the cache, compared to 0.35 us when the token hits the cache, on a Llama-7B model running on an A100 GPU. The timing difference is more pronounced on larger models, e.g., for Llama-2-70B-Chat-GPTQ, the prefilling phase of 1 token takes about 0.45 ms when missing the cache and 0.22 us when hitting the cache.
Quotes
"The wide deployment of Large Language Models (LLMs) has given rise to strong demands for optimizing their inference performance. Today's techniques serving this purpose primarily focus on reducing latency and improving throughput through algorithmic and hardware enhancements, while largely overlooking their privacy side effects, particularly in a multi-user environment." "Exploiting these side channels can reveal private prompts from other users or applications."

Deeper Inquiries

How can the proposed attacks be extended to target other types of caching mechanisms beyond the KV cache and semantic cache?

The proposed attacks, particularly the prompt stealing attack (PSA) and peeping neighbor attack (PNA), can be extended to target other caching mechanisms by leveraging the fundamental principles of timing side channels and shared resource contention. For instance, in systems utilizing disk caching, such as web servers or database management systems, attackers could exploit timing differences in response times when accessing cached versus non-cached data. By sending crafted requests that are likely to hit or miss the cache, an attacker could infer the presence of sensitive data based on the observed latencies. Additionally, in systems employing object caching (e.g., Redis or Memcached), similar timing attacks could be executed. Attackers could monitor the response times of requests that are designed to access specific keys in the cache. If the response time is significantly lower, it may indicate that the data was cached, allowing the attacker to infer the existence of certain data entries. Moreover, in the context of shared memory systems, such as those found in multi-threaded applications, timing attacks could be adapted to exploit shared data structures. By measuring the time taken to access shared variables or data structures, an attacker could infer the state of those variables, potentially leading to the exposure of sensitive information.

What are the potential countermeasures that LLM service providers can implement to mitigate the risks of timing side channels beyond the solutions discussed in the paper?

To mitigate the risks of timing side channels in LLM serving infrastructures, service providers can implement several additional countermeasures beyond those discussed in the paper. Randomized Response Times: Introducing artificial delays or randomization in response times can obscure the timing differences that attackers exploit. By adding a controlled amount of noise to the response times, it becomes more challenging for attackers to discern whether a cache hit or miss occurred. Cache Partitioning: Implementing strict partitioning of caches for different users or applications can prevent shared access to sensitive data. By ensuring that each user has a dedicated cache space, the potential for timing side channels due to shared cache access is significantly reduced. Access Control and Rate Limiting: Enforcing strict access controls and rate limiting can help mitigate the risk of timing attacks. By limiting the number of requests a user can make in a given timeframe, the opportunity for an attacker to probe the system for timing differences is minimized. Cache Eviction Policies: Adopting more aggressive cache eviction policies can help reduce the likelihood of sensitive data being retained in shared caches. For example, implementing a least-recently-used (LRU) policy with shorter retention times can help ensure that sensitive data is not cached for extended periods. Monitoring and Anomaly Detection: Implementing monitoring systems that can detect unusual patterns of access or latency can help identify potential timing attacks in real-time. By analyzing request patterns and response times, service providers can take proactive measures to mitigate ongoing attacks.

How can the insights from this work be applied to improve the security and privacy of other types of shared computing systems beyond LLM serving infrastructures?

The insights from this work can be broadly applied to enhance the security and privacy of various shared computing systems by emphasizing the importance of understanding and mitigating timing side channels. Shared Cloud Services: In cloud environments where multiple tenants share resources, the findings highlight the need for robust isolation mechanisms. Implementing strict resource allocation and monitoring can prevent one tenant from inferring sensitive information from another through timing attacks. Multi-Tenant Databases: The principles of cache sharing and timing side channels can be applied to database systems where multiple applications access shared data. By ensuring that queries are executed in a manner that obfuscates timing differences, database providers can protect sensitive information from being inferred by malicious users. Virtualized Environments: In virtualized systems, where multiple virtual machines (VMs) share physical resources, the insights can inform the design of hypervisors and resource management strategies. Ensuring that VMs do not share caches or memory can help mitigate the risk of timing attacks. Collaborative Computing Platforms: In collaborative environments, such as shared workspaces or development platforms, understanding the risks of timing side channels can lead to the implementation of better access controls and data anonymization techniques, ensuring that sensitive information remains protected. General Software Development: Developers can incorporate security practices that account for timing side channels during the design and implementation phases of software development. This includes conducting threat modeling and security assessments to identify potential vulnerabilities related to timing attacks. By applying these insights, organizations can create more secure and privacy-preserving shared computing environments, ultimately reducing the risk of information leakage through timing side channels.
0
star