The paper presents the design and evaluation of Piper, a hardware accelerator for efficient tabular data preprocessing in machine learning pipelines. Key highlights:
Tabular data preprocessing is a crucial but computationally intensive step in ML training, often becoming a major bottleneck due to the performance gap between CPUs and GPUs.
Piper adopts a column-wise pipelined execution mechanism with specialized hardware processing elements to avoid costly synchronization overheads in CPU-based solutions.
Piper integrates a high-performance parallel UTF-8 decoding unit and leverages high-bandwidth memory (HBM) to achieve high memory throughput.
Piper can be deployed as a network-attached accelerator, avoiding host-side overhead and enabling streaming processing of datasets larger than the FPGA's memory capacity.
Evaluation on production DLRM models shows that Piper outperforms optimized CPU baselines by 4.7x to 71.3x and GPUs by 4.8x to 20.3x, depending on the input data format and vocabulary size.
The network-attached design of Piper provides additional benefits in terms of flexibility, scalability, and seamless integration into future ML systems.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Yu Zhu, Wenq... alle arxiv.org 09-24-2024
https://arxiv.org/pdf/2409.14912.pdfDomande più approfondite