Title: 在NVIDIA圖形處理器上管理暫存器以增加線程級並行處理
Increasing Thread-Level Parallelism with Register Resource Management for NVIDIA GPUs
Authors: 游本永
Yu, Pen-Yung
游逸平
You, Yi-Ping
資訊科學與工程研究所
Keywords: 編譯器最佳化;暫存器配置;線程級並行處理;圖形處理單元;OpenCL;CUDA;Compiler optimization;register allocation;thread-level parallelism;GPU;OpenCL;CUDA
Issue Date: 2013
Abstract: 圖形處理單元具有大量的運算處理器,這些運算處理器是以單指令流多資料流的方式執行,因此圖形處理單元能處理每秒兆個浮點運算,運算量是中央處理器的數十甚至數百倍。通用圖形處理器依靠大量的執行緒來隱藏會花費400~800時序的off-chip記憶體延遲,然而能平行執行的執行緒數量會特別受到執行緒使用的暫存器數量影響,因此在這篇論文中,我們提出了降低暫存器壓力以最佳化線程級並行處理的架構,這個架構的目的就是要降低執行緒使用的暫存器數目,以增加線程級並行處理。在這個架構中包含了兩個降低暫存器使用量的方法,第一個是暫存器的重算,第二個是溢出暫存器至on-chip記憶體。實驗結果顯示這個架構是有效果的,平均減少了5.7%的執行時間,最多能減少27%。
Graphics processing units (GPUs) are equipped with enormous amounts of arithmetic processors running in a single-instruction, multiple-data fashion, producing a throughput of Tera floating-point operations per second, which is ten or even hundred times higher than the throughput of central processing units. GPUs reply on massive hardware multithreading to hide off-chip memory latencies, which are approximately 400–800 cycles. However, the number of parallel threads running on GPUs is highly restricted by the resource requirement of such a thread, especially the register requirement. In this thesis, we proposed a thread-level parallelism-aware register-pressure reduction framework to reduce the register usage of threads on GPGPUs, thereby increasing the thread-level parallelism. This framework includes two register-pressure reduction methods: (1) register rematerialization, (2) spilling registers to on-chip memory. The experimental results demonstrate that the proposed framework was effective in improving performance of OpenCL kernel programs by a maximum of 27% and an average of 5.7%.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070056013
http://hdl.handle.net/11536/73916
Appears in Collections:Thesis