Core Concepts
The author proposes SPA, a lightweight architecture for fast on-device inference and privacy retention, by integrating pretrained LLMs with additive parameters on devices. This approach aims to balance computational constraints while improving cost efficiency.
Abstract
The content discusses the challenges of deploying large language models (LLMs) on resource-constrained devices and introduces SPA as a solution. SPA separates adapters from pre-trained models, allowing for efficient deployment on devices while maintaining model performance. The paper highlights the benefits of using a classifier to choose between original and adapted models, reducing latency and enhancing user-specific features.
Large language models (LLMs) have shown exceptional capabilities in various tasks, but deploying them on edge devices poses challenges due to memory and computational constraints. The proposed Side Plugin Adaptation (SPA) addresses these issues by separating adapters from pre-trained models, improving inference speed and performance. By utilizing a classifier to select between different model outputs, SPA enhances efficiency while maintaining task performance.
The study evaluates SPA across various datasets and demonstrates its superiority over traditional approaches like LST. By comparing different settings of the side plugin adaptation, the paper shows that parallelization improves generative task performance significantly. Additionally, analyzing the impact of training data sizes reveals that larger datasets enhance model generalization abilities.
Overall, SPA offers an innovative approach to deploying large language models efficiently on edge devices through collaboration with cloud-based resources. The method shows promise in improving inference speed, enhancing user-specific features, and maintaining privacy in model deployment.
Stats
Model XSum CNN-DM CoQA SciQ LLaMA-7B + one-shot 17.32 22.36 15.32 17.21
Model LLaMA-7B + LST 28.18 32.15 31.24 23.24
Model LLaMA-7B + SPA 35.52 39.22 37.30 25.38
Quotes
"Our method establishes an interaction between pretrained LLMs on-cloud and additive parameters on-devices."
"SPA significantly reduces the difficulty of fully deploying large models on the edge."
"The classifier allows us to choose between leveraging the inherent capabilities of the original model or integrating feature information generated by adapters."