Evaluating the Potential and Limitations of Parallel Deep Learning Inference on Heterogeneous Mobile Processors
Concetti Chiave
Parallel execution of deep learning inference across heterogeneous mobile processors holds potential to accelerate on-device intelligence, but its practical effectiveness is limited by unsupported operators, process fallbacks, and the need to balance resource utilization with overall system performance.
Sintesi
This paper presents a comprehensive empirical study to assess the capabilities and challenges associated with parallel deep learning (DL) inference on heterogeneous mobile processors. The key findings are:
-
Existing strategies for parallel inference across mobile processors are not adequately effective due to limitations such as unsupported operators and processor fallbacks. There are opportunities for cross-level optimization by integrating frontend and backend compilation techniques.
-
Parallel inference across processors is not always beneficial. The granularity of parallel scheduling and the presence of competing processes can significantly impact the overall system performance. Maximizing resource utilization for DL inference alone may not lead to optimal results in real-world mobile contexts.
-
Offline latency profiling is insufficient, as the non-stationary dynamics in mobile environments widen the gap between actual and estimated latency. Runtime profiling is desired to integrate mobile resource dynamics.
-
Integrating the regularity of mobile DL inference, such as data reuse across frames, can further optimize backend compilation and improve the accuracy-efficiency tradeoff.
The insights from this study can facilitate the development of effective parallel inference strategies that adapt to the inherent diversity and dynamics of the mobile device ecosystem.
Traduci origine
In un'altra lingua
Genera mappa mentale
dal contenuto originale
Visita l'originale
arxiv.org
Deep Learning Inference on Heterogeneous Mobile Processors: Potentials and Pitfalls
Statistiche
In ResNet-50 on the Xiaomi 9 device, the ratio of unsupported operators during parallel inference on the CPU and GPU reaches approximately 48%.
Parallel inference of Fast Style Transfer on CPU+GPU using μLayer slows down by 1.9× to 3.1× compared to on GPU alone, due to scheduling overhead.
With 3 competing processes, the latency on CPU increases by 10.2×, while for GPU and DSP, only increases by 2.1× and 1.8×, respectively.
Utilizing Mace on GPU (with buffer type) for VGG-16 resulted in frame drops exceeding 90% in a concurrent video playback app.
Citazioni
"Existing strategies agnostic to DL models often exhibit overparameterization that can be optimized before DAG conversion."
"Dedicating excessive computational resources to DL inference can impair the functionality of other components, e.g., UI responsiveness in smartphones and AR apps."
"Runtime latency profiling is desired to integrate non-stationary mobile resource dynamics."
Domande più approfondite
How can the design space of parallel inference strategies be expanded beyond the current focus on backend compilation-level optimization to better adapt to the diversity and dynamics of mobile environments?
In order to expand the design space of parallel inference strategies beyond backend compilation-level optimization, it is essential to incorporate frontend compilation optimizations that focus on reducing redundancy in DL models. Techniques such as pruning, low-rank decomposition, and parameter/activation quantization can help minimize resource demand and improve resource utilization. By integrating these frontend optimizations with backend compilation strategies like operator fusion and parallelism, a more holistic approach can be achieved. This cross-level optimization allows for a broader exploration of the design space, enabling the prioritization of critical operations while efficiently managing less crucial ones. Additionally, considering the impact of unsupported operators and process fallbacks during runtime is crucial. Adaptive redistribution of computations based on operator support across heterogeneous processors can enhance resource utilization and mitigate underutilization of processor resources.
What are the potential tradeoffs and design considerations in balancing the resource utilization for deep learning inference and the overall system performance in multi-process mobile applications?
Balancing resource utilization for deep learning inference while maintaining overall system performance in multi-process mobile applications involves several tradeoffs and design considerations. One key consideration is the granularity of parallel scheduling, where the level of parallelism can impact the efficiency of utilizing multiple processors. Coarse-grained parallelism, such as sub-graph parallelism, may lead to idle processor cores, while fine-grained parallelism, like intra-operator parallelism, can result in process suspensions and increased data transmission delays. It is crucial to adapt the parallel scheduling granularity based on the workload and competing processes to optimize system performance.
Another tradeoff lies in the allocation of computations across processors. While maximizing resource utilization for deep learning inference is important, dedicating excessive computational resources to DL tasks can negatively impact the functionality of other components in the system, such as UI responsiveness. Identifying optimal resource utilization levels based on the dynamic nature of runtime resource availability and competing process demands is essential for enhancing overall system performance without compromising other processes.
How can the regularity of mobile deep learning inference, such as data reuse across frames, be further leveraged to optimize the accuracy-efficiency tradeoff in parallel execution across heterogeneous processors?
The regularity of mobile deep learning inference, particularly data reuse across frames, presents an opportunity to optimize the accuracy-efficiency tradeoff in parallel execution across heterogeneous processors. By leveraging data reuse at both the frame-level and layer-level, parallel inference can benefit from reduced latency and improved efficiency. Implementing mechanisms to identify when previous intermediate results will be accessed for the final time can help optimize resource allocation and release resources in a timely manner. This approach can lead to a reduction in latency while maintaining acceptable levels of visual quality. Additionally, integrating data reuse strategies into the parallel execution workflow can enhance the overall performance of deep learning inference on mobile devices by minimizing redundant computations and maximizing resource utilization.