RandMScan: accelerating parallel scan via matrix computation and random-jump strategy

RandMScan:通过矩阵计算和随机跳跃策略加速并行扫描

阅读:1

Abstract

Parallel scan is a fundamental primitive widely used in a broad range of workloads, including parallel sorting, graph algorithms, and sampling in large language model inference. Although GPU-optimized parallel scan algorithms have been extensively studied, their reliance on vector units makes them inefficient on modern AI accelerators, which typically lack such units but incorporate abundant processing element (PE) arrays tailored for matrix operations. Existing matrix-based approaches, however, suffer from excessive bandwidth consumption due to auxiliary matrix loading as well as costly inter-block communication, thereby limiting their scalability in practical deployments. In this paper, we propose RandMScan, a two-stage parallel scan framework specifically designed for AI accelerators equipped with PE arrays, to address the aforementioned challenges. The first stage employs an efficient matrix-based local chunk scan algorithm that fully exploits fine-grained parallelism within PE arrays, while the second stage introduces a lightweight Random-Jump strategy to coordinate global aggregation with reduced synchronization overhead. Together, these techniques enable scalable execution over long sequences while effectively mitigating the substantial communication overhead that typically arises in prior solutions. Extensive evaluations on state-of-the-art AI accelerators demonstrate that our method achieves more than 80% speedup compared to existing matrix-based implementations for long input sequences, and can reduce the end-to-end latency by 15%-26% in representative downstream applications of scan.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。