As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6Ã faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129Ã for the 3D unbiased nonlinear image registration technique and 93Ã for the non-local means surface denoising algorithm.
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms.
阅读:3
作者:Lee Daren, Dinov Ivo, Dong Bin, Gutman Boris, Yanovsky Igor, Toga Arthur W
| 期刊: | Computer Methods and Programs in Biomedicine | 影响因子: | 4.800 |
| 时间: | 2012 | 起止号: | 2012 Jun;106(3):175-87 |
| doi: | 10.1016/j.cmpb.2010.10.013 | ||
特别声明
1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。
2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。
3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。
4、投稿及合作请联系:info@biocloudy.com。
