标题: | 在多图形处理器架构下考量装置能力进行工作量分散运算 Capability-Aware Workload Partition on Multi-GPU Systems |
作者: | 赵砚廷 游逸平 Chao, Yen-Ting You, Yi-Ping 资讯科学与工程研究所 |
关键字: | 图形处理器;OpenCL;GPU 抽象技术;workload distribution;load balance;GPGPU;OpenCL;device abstraction;workload distribution;load balance |
公开日期: | 2016 |
摘要: | 在多GPU 资源管理的帮助下,使用多个图形处理器(GPU)加速应用程序近年来越来越流行。然而,kernel 间存在着相依性关系的应用程序并没有从多个GPU 的资源得到任何好处,因为kernel 无法运行同时在这些GPU 上而降低GPU 的利用率。应用程序中如有big-kernel 的行为,通常会启动了大量的线程进行大量的数据运算,因为在OenCL 的计算模型下,一个kernel 仅能执行于一颗图形处理器上,此类的应用程序也 降低多GPU 系统的总体效能。big-kernel 应用程序需要使用者手动将kernel 分成几个小kernel 并在不同的GPU 上分派这些kernel 以便利用多个GPU 资源,但是这给使用者带来了额外的负担。在本文中,我们提出了XVirtCL,用于自动平衡一个kernel 在多个GPU 之间的工作负载,同时考虑GPU 的能力水平,并最小化在GPU 之间传输的数据。XVirtCL 涉及(1)kernel analyzer 用于确定kernel 的工作负载是否合适切割,(2)workload scheduling algorithm,用于平衡的将kernel 分散在多个GPU 之间,同时考虑各种GPU 计算能力水平(3)workload partitioner 用于将kernel 分割成多个子kernel 的工作。实验结果说明我们提出的系统架构针对big-kernel 能最大化了多个GPU 的使用率 与加速效果。 Using multiple graphics processing units (GPUs) to accelerate applications has become more and more popular in recent years, with the assistance of multi-GPU abstraction techniques. However, an application that has only dependent kernels derives no benefit from the power of multiple GPUs since the kernels within the application cannot run simultaneously on those GPUs, thereby decreasing the utilization of GPUs. Applications that have a ‘big’ kernel, which launches a huge number of threads for processing massively parallel data, can also lower the overall throughput of a multi-GPU system. Such an application requires programmers to manually divide the kernel into several ‘small’ kernels and dispatch the kernels on different GPUs so as to utilize multiple GPU resources, but this imposes an extra burden on programmers. In this paper, we present XVirtCL, which is an extension of VirtCL (a GPU abstraction framework) for automatically balancing the workload of a kernel among multiple GPUs while considering the variety of compute capability levels of GPUs and minimizing the data transferred among GPUs. XVirtCL involves (1) a kernel analyzer for determining whether the workload of a kernel is suitable for being partitioned, (2) a workload scheduling algorithm for balancing workload of a kernel among multiple GPUs while considering the variety of compute capability levels of GPUs and (3) a workload partitioner for partitioning a kernel into multiple sub-kernels which have disjoint sub-NDrange spaces. The preliminary experimental results indicate that the proposed framework maximized the throughput of multiple GPUs for applications with big, regular kernels. |
URI: | http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070356029 http://hdl.handle.net/11536/140124 |
显示于类别: | Thesis |