Core Concepts
Deploying black-box models across mobile, edge, and cloud tiers using a combination of partitioning, quantization, and early exiting operators can optimize the trade-off between inference latency and model performance.
Abstract
The study aims to empirically assess the accuracy vs inference time trade-off of different black-box Edge AI deployment strategies, i.e., combinations of deployment operators (Partitioning, Quantization, Early Exit) and deployment tiers (Mobile, Edge, Cloud).
Directory:
Background
Deep Learning Architecture
Monolithic Edge AI Deployment
Multi-tier Edge AI Partitioning
Early Exiting
Quantization
ONNX Run-time for Inference
Related Work
Partitioning
Early Exiting
Quantization
Approach
Subjects
Study Design
Experimental Setup
Results
RQ1: Impact of single-tier deployment
RQ2: Impact of Quantization operator
RQ3: Impact of Early Exiting operator
RQ4: Impact of Model Partitioning operator
RQ5: Impact of Hybrid operators
The key findings suggest that:
Edge deployment using the hybrid Quantization + Early Exit operator could be preferred over non-hybrid operators when faster latency is a concern at medium accuracy loss.
When minimizing accuracy loss is a concern, MLOps engineers should prefer using only a Quantization operator on edge.
In mobile-constrained scenarios, a preference for Partitioning across mobile and edge tiers is observed over mobile deployment.
For models with smaller input data, a network-constrained cloud deployment can be a better alternative than Mobile/Edge deployment and Partitioning strategies.
For models with large input data, an edge tier with higher network/computational capabilities than Cloud/Mobile can be a more viable option than Partitioning and Mobile/Cloud deployment strategies.