標題: 在多圖形處理器架構下考量裝置能力進行工作量分散運算
Capability-Aware Workload Partition on Multi-GPU Systems
作者: 趙硯廷
游逸平
Chao, Yen-Ting
You, Yi-Ping
資訊科學與工程研究所
關鍵字: 圖形處理器;OpenCL;GPU 抽象技術;workload distribution;load balance;GPGPU;OpenCL;device abstraction;workload distribution;load balance
公開日期: 2016
摘要: 在多GPU 資源管理的幫助下,使用多個圖形處理器(GPU)加速應用程序近年來越來越流行。然而,kernel 間存在著相依性關係的應用程序並沒有從多個GPU 的資源得到任何好處,因為kernel 無法運行同時在這些GPU 上而降低GPU 的利用率。應用程序中如有big-kernel 的行為,通常會啟動了大量的線程進行大量的數據運算,因為在OenCL 的計算模型下,一個kernel 僅能執行於一顆圖形處理器上,此類的應用程序也 降低多GPU 系統的總體效能。big-kernel 應用程序需要使用者手動將kernel 分成幾個小kernel 並在不同的GPU 上分派這些kernel 以便利用多個GPU 資源,但是這給使用者帶來了額外的負擔。在本文中,我們提出了XVirtCL,用於自動平衡一個kernel 在多個GPU 之間的工作負載,同時考慮GPU 的能力水平,並最小化在GPU 之間傳輸的數據。XVirtCL 涉及(1)kernel analyzer 用於確定kernel 的工作負載是否合適切割,(2)workload scheduling algorithm,用於平衡的將kernel 分散在多個GPU 之間,同時考慮各種GPU 計算能力水平(3)workload partitioner 用於將kernel 分割成多個子kernel 的工作。實驗結果說明我們提出的系統架構針對big-kernel 能最大化了多個GPU 的使用率 與加速效果。
Using multiple graphics processing units (GPUs) to accelerate applications has become more and more popular in recent years, with the assistance of multi-GPU abstraction techniques. However, an application that has only dependent kernels derives no benefit from the power of multiple GPUs since the kernels within the application cannot run simultaneously on those GPUs, thereby decreasing the utilization of GPUs. Applications that have a ‘big’ kernel, which launches a huge number of threads for processing massively parallel data, can also lower the overall throughput of a multi-GPU system. Such an application requires programmers to manually divide the kernel into several ‘small’ kernels and dispatch the kernels on different GPUs so as to utilize multiple GPU resources, but this imposes an extra burden on programmers. In this paper, we present XVirtCL, which is an extension of VirtCL (a GPU abstraction framework) for automatically balancing the workload of a kernel among multiple GPUs while considering the variety of compute capability levels of GPUs and minimizing the data transferred among GPUs. XVirtCL involves (1) a kernel analyzer for determining whether the workload of a kernel is suitable for being partitioned, (2) a workload scheduling algorithm for balancing workload of a kernel among multiple GPUs while considering the variety of compute capability levels of GPUs and (3) a workload partitioner for partitioning a kernel into multiple sub-kernels which have disjoint sub-NDrange spaces. The preliminary experimental results indicate that the proposed framework maximized the throughput of multiple GPUs for applications with big, regular kernels.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070356029
http://hdl.handle.net/11536/140124
顯示於類別:畢業論文