標題: 多重圖形處理系統上之OpenCL程式內核的自動平行化排程
Automatic Intra-Task Parallelization of OpenCL Programs on Multi-GPU Systems
作者: 宋禹
Sung, Vincent
游逸平
You, Yi-Ping
資訊科學與工程研究所
關鍵字: OpenCL;GPGPU抽象化;GPGPU 工作排程;記憶體管理;程式流程分析;異質多核心系統;OpenCL;GPGPU abstraction;GPGPU task scheduling;Memory management;Program flow analysis;Heterogeneous multi-core systems
公開日期: 2013
摘要: 隨著異質多核心架構的系統與 OpenCL 平行化的程式語言模型的日趨成熟,傳統處理器(CPU)與通用圖型處理器(GPGPU)的協同運算已廣泛運用在生物資訊、財務金融預測、大氣模擬等科學領域;而硬體製程工藝的進步與軟體設計的需要,大至雲端運算級的大型伺服器叢集系統,小至家用電腦甚至是手持裝置,具有多個GPGPU 或CPU 裝置的硬體系統已然普及。但現行的軟體系統並無法針對多個裝置的運算資源做最佳化的調控與配置;由於OpenCL 的程式語言模型設計,舊有的應用程式在多個裝置上也無法獲得更高的效能。 本研究主要目的為提供一個OpenCL 的抽象層,包含靜態分析與運行環境系統,前者可分析舊有應用程式中運算kernel 的相依性,後者根據靜態分析的資訊進行運算資源的分配,其中包括記憶體的同步與管理系統與運算kernel 的排程分配系統,使舊有的應用程式可以分配在多個裝置上計算,並以單一的裝置介面提供程式設計者進行API 的操作,簡化多個裝置間資源分配與同步的考量,進而減輕在撰寫時的負擔。我們將實作成果與原生具備單一裝置的 OpenCL runtime 做比較,在使用一個、兩個與四個裝置上執行的平均效率依序為 95%,92%與87%。總體的overhead 平均約為5%,主要來自於應用程式與runtime 系統的資料傳輸語同步。
Open Computing Language (OpenCL), which enables GPUs to be programmed for general-purpose computation, has promoted GPU systems to be widely applicable platforms for parallel computing. In recent years, from cloud computing environments to consumer/entry-level platforms, multiple GPU devices in one system are getting more common. However, the OpenCL programming model requires programmers to explicitly identify and schedule computations and communications among multiple GPU devices. A legacy OpenCL program, which is usually written for systems with only a single GPU device, does not directly benefit from multiple computing resources. Furthermore, the resource management, such as workload distribution and synchronization among discrete memory spaces, in such a system increases programmers’ burden. In this thesis, we propose a runtime abstraction layer ViCL, which acts as a high-level agent between programmers and the vendor’s OpenCL runtime system to untangle these problems and therefore provides backward compatibility for legacy programs and leads to more productive development of OpenCL programs. ViCL comprises a static analyzer and a runtime system. The static analyzer extracts dependency information between kernel functions of an OpenCL program. The runtime system provides a front-end library and a runtime manager: the front-end library encapsulates the concept of single virtual device to programmers and delivers OpenCL API calls as commands to the runtime manager, which schedules commands to appropriate GPU devices according to the statically analyzed dependency information so as to automatically maximize the parallelism of OpenCL applications. The experimental results show that the mean abstraction overhead of VirCL with single device is about 5%, while the main overhead was incurred from data communications between applications and the runtime system. ViCL also made legacy OpenCL applications scalable: the mean efficiency when running on a one-GPU, two-GPU, four-GPU platform was 95%, 92%, and 87%, respectively.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070056111
http://hdl.handle.net/11536/73740
Appears in Collections:Thesis