Abstract
Deep neural networks (DNNs) are computationally intensive and optimized in different ways. Some compiler optimizations for DNNs could achieve performance almost the same as, or even better than, manual optimizations. However, the former mechanisms usually require an unbearably long optimization time in the tuning process. In this paper, we propose a new method that accelerates the tuning process significantly without performance penalties. In particular, we use a Roofline-like cost model, namely ROFT (Roofline for Fast AutoTune), to evaluate the performance of schedules. The ROFT model can be easily implemented on different microarchitectures, e.g., NVidia GPUs and Huawei Ascend NPUs. Based on the cost model, we implement a flexible two-stage search algorithm, which significantly improves the time of tuning process. Experiments show that the ROFT method speeds up the tuning process by about 4X and 10X compared with AutoTVM on NVidia GPUs and the AutoTune of Huawei's Tensor Boost Engine (TBE) on Huawei Ascend310 NPUs for some typical DNNs, respectively. It improves the inference time of some DNNs by up to 7% as well.