Abstract
Parallel scan is a fundamental primitive widely used in a broad range of workloads, including parallel sorting, graph algorithms, and sampling in large language model inference. Although GPU-optimized parallel scan algorithms have been extensively studied, their reliance on vector units makes them inefficient on modern AI accelerators, which typically lack such units but incorporate abundant processing element (PE) arrays tailored for matrix operations. Existing matrix-based approaches, however, suffer from excessive bandwidth consumption due to auxiliary matrix loading as well as costly inter-block communication, thereby limiting their scalability in practical deployments. In this paper, we propose RandMScan, a two-stage parallel scan framework specifically designed for AI accelerators equipped with PE arrays, to address the aforementioned challenges. The first stage employs an efficient matrix-based local chunk scan algorithm that fully exploits fine-grained parallelism within PE arrays, while the second stage introduces a lightweight Random-Jump strategy to coordinate global aggregation with reduced synchronization overhead. Together, these techniques enable scalable execution over long sequences while effectively mitigating the substantial communication overhead that typically arises in prior solutions. Extensive evaluations on state-of-the-art AI accelerators demonstrate that our method achieves more than 80% speedup compared to existing matrix-based implementations for long input sequences, and can reduce the end-to-end latency by 15%-26% in representative downstream applications of scan.