標題: 在多圖形處理器系統上考量運算負載與資料傳遞之工作排程方法
Computation and Communication Aware Task Graph Scheduling on Multi-GPGPU Systems
作者: 王允廷
Wang, Yun-Ting
賴伯承
Lai, Bo-Cheng
電子工程學系 電子研究所
關鍵字: 圖形處理器;運算負載;資料傳遞;GPGPU;Computation;Communication
公開日期: 2013
摘要: 因為巨大的平行運算處理能力,圖形處理器成為了廣為使用的高通量運算平台。因為驚人的運算能力,使得對多圖形處理器利用的關注成長。然而在使用多圖形處理器有三項設計上的挑戰,第一,平均分配工作量給每個處理器,在系統上不平衡的工作分配會導致某些圖形處理器閒置,使得整體效能降低;第二,藉由在運算核心與平行運算執行序資料的重複利用,有效的利用圖形處理器記憶體,很低的資料重複利用會產生多餘的資料傳輸與存取;第三,程式的設計使得資料傳輸與資料運算有效的重疊。本論文針對以上的設計議題,提出一個有效啟發式的排程法”在多圖形處理器系統上考量運算負載與資料傳遞之工作排程方法”,在多處理器系上同時考量運算負載與資料傳輸的效能。在多任務圖的應用中,也提出了一個事先掃描的方式,利用我們所提的任務圖特性,將任務圖分群並分配到每個處理器上。本論文提出的排程法,與先前所提的演算法比較,加速了22.15%。在多任務圖的應用中,利用我們所提的是先掃描分群的方式,當系統從兩張運算處理器擴張到四張運算處理器時,可以達到很高的加速。
Due to the massive parallel computation capability, GPGPUs have emerged as popular throughput computing platforms. Due to the astonishing computation capability, there is a growing interest in exploiting systems with multiple GPGPUs. However, attaining superior performance in a multi-GPGPU system involves three main design challenges. The first challenge is to balance the loading of tasks assigned to each GPGPU. An imbalanced loading across the system could cause idling of some GPGPUs and degrade the overall performance. The second is to exploit the memory resource by fully leveraging the data reuse between threads as well as kernels. Poor data reuse would cause excessive data accesses and transfers. The third challenge stems from how efficient a program could hide the data transfer overhead by overlapping the computation and communication [1]. This thesis aims at addressing the above design issues by proposing a Computation and Communication Aware task graph Scheduling (CCAS) for multi-GPGPU systems. The proposed scheduling approach (CCAS) adopts an effective heuristic algorithm that considers both the data reuse, and load balance to the performance of multi-GPGPU systems. In multi-graph applications, a pre-scan method is applied to cluster disjoint task graphs to each GPGPU based on the characteristics of the graph. In summary, the proposed CCAS approach can achieve an average of 22.15% performance enhancement when compared with a previous work. In multi-graph applications, the proposed pre-scan clustering method has achieved good performance scaling when the system size is increased from 2 to 4 GPGPUs.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070050220
http://hdl.handle.net/11536/74585
Appears in Collections:Thesis