Optimized Data Placement for Accelerating GEMV Computations in Generative AI with Processing-In-Memory
Optimized data placement is critical to harnessing the full potential of PIM acceleration for GEMV computations that dominate Generative AI inference. The proposed PIMnast methodology balances multiple factors to identify data placements that deliver up to 6.86x speedup for GEMVs, leading to up to 5x end-to-end speedup for Generative AI per-token latencies.