標題: 最大化不規則相依迴圈平行度的有效切割方式
Effective Partitioning Mechanisms to Maximize Parallelism of Nested Loops with Non-uniform Dependences
作者: 平德林
Der-Lin Pean
陳正
Cheng Chen
資訊科學與工程研究所
關鍵字: 編譯器;不規則相依性;迴圈平行化;平行處理;相依不規則區塊;compilers;non-uniform dependence;loop parallelization;parallel processing;dependence convex hull
公開日期: 1999
摘要: 使用平行處理系統來計算科學上的應用以便達到有效的運算是目前最普遍的一種方式。在一些應用程式中,例如流體力學(fluid mechanisms)、結構分析(structural analysis)及固態模擬(solid state simulations)等都存在所謂的跨迴圈相依關係(loop-carried dependence)。而這種相依關係有可能是規則性的陣列存取或是不規則性(non-uniform)的函數型矩陣座標。然而目前存在的迴圈切割方法中,對於不規則相依迴圈的切割往往效能相當差。因此,本論文提出了一個結合多種一般化且最佳化的不規則迴圈切割(partition)方式,希望針對各種不同形態的迴圈都能有效的開發其平行度(parallelism)。 首先,我們提出的技巧是根據不規則相依迴圈中不規則的特性,希望能針對此特性來開發其平行度。藉著這種方法,目前高度平行化的多處理機系統諸如多引線(multithreaded)及叢集式(clustered)的多處理機系統都能充分使用到。這些技術包括平行部份切割(Parallelization Part Splitting, PPS)、部份平行化分解(Partial Parallelization Decomposition, PPD)、不規則迴圈交換(Irregular Loop Interchange, ILI)及成長性格式偵測(Growing Pattern Detection, GPD)等方法。他們不但能偵測出特殊的迴圈型式而且能利用迴圈轉換(loop transformation)的方式來充分開發不規則相依迴圈的平行度。而這些方法不但能適用於特殊型式的迴圈相依距離也能應用於下列提出的其他方式中。然而,某些迴圈相依向量並不一定具有特殊而可有效切割的相依型式;此時,如何提供一個一般化的方式,來針對一般不規則相依迴圈型式加以有效切割,便成為一個相當重要的課題。 第二點,根據前面所提的需求,我們提出一個一般化的切割方式,此方法稱為最佳化的三態切割方式。此方法是以相依多邊區塊(dependence convex hull)理論為基礎而將迴圈切割成三個大小不一的區塊。再者、我們提出的演算法也利用了複製重新命名(copy-renaming)及最佳化的切割技術。此方法較適合用在原始序列區塊較小的迴圈中,因為切割後的三個區塊中的其中一個就是所謂的序列區塊,而其餘兩區塊則可被完全平行執行。然而,仍然有為數不少的情況是序列區塊佔了迴圈極大的部分。此時解決的途徑有二;一是可以提出和序列區塊大小獨立的方式,二是緊接本方法而提出一個有效切割此方法所無法切割的序列區塊。而此兩種解決方式,我們都能提出並且證明他們是有效的。 第三點、首先,我們承接上面方法的缺點而提出了一個和序列區塊大小相互獨立的方式,稱為最佳化的相依多邊形區塊切割方式,此方法利用了複製重新命名及最佳化整數程式化(integer programming)切割技術來將迴圈空間切割成最小數目的完全平行區塊。此方法較適用於序列區塊很大的情況。雖然此方式比我們前面提及的方法有較高的複雜度,但在此特定情況下,此方式是有效而且必要的。 最後、我們整合了三區塊切割技術及相依多邊形的切割方式即所謂的二階段的切割方式,而此方法既沒有最佳化的相依多邊形區塊切割方式的複雜度,也可以有效解決最佳化的三態切割方式所無法克服的完全序列區塊部份的切割。因此,此方式所適用的對象也較廣;舉凡所有完全序列區塊小於整個迴圈空間時都可以適用。而最佳化的相依多邊形區塊切割方式及部份平行化分解方式則可針對完全序列區塊佔滿整個迴圈空間的狀況來加以有效的切割。若再更一步整合可並用的方式如平行部份切割及不規則迴圈交換等,效果將更加顯著。和其他著名的切割技巧相比較,我們的方法在實際的機器上針對程式模型及實際的程式核心片段加以評估,都有顯著而更好的效果。
There are many methods for nested loop partitioning. However, most of them per-form poorly when partitioning loops with non-uniform dependencies. This dissertation proposes several generalized and optimized partitioning mechanisms to exploit paral-lelism from nested loops with non-uniform dependencies. By this way, current highly parallel multiprocessor systems can be fully utilized. First, we propose new techniques adaptable in extracting parallelism of loop nests with non-uniform dependencies, by which more parallelism is explored effec-tively using their irregularity. These mechanisms are parallelization part splitting (PPS), partial parallelization decomposition (PPD), irregular loop interchange(ILI) and grow-ing pattern detection(GPD). They not only use transformation techniques but also de-tect special parallelization patterns for non-uniform dependence nested loops. The abo-ve mechanisms can not only be applied to loops with special dependence patterns but also be combined with following new techniques. However, most of loops does not have such special dependence vector patters. Thus, a new loop partitioning mechanism that can handle loops with general dependence vector patterns is necessary. Second, we propose a loop partitioning mechanism called the Optimized Three Region Partitioning (OTRP) method based on dependence convex theory to divide the loop that has general dependence vector patterns into variable size partitions. It parti-tions the loop into two parallel regions and one serial region. However, there are still one serial region left in this mechanism. Thus, we will propose two approaches to resol-ve this situation. One way is to parallelize the above serial region effectively and the other is to develop a new method whose performance is independent of the size of the inherent serial region. Third, we propose a method whose efficiency is independent of the size of the in-herent serial region called the Optimized Dependence Convex Hull Partitioning (ODCHP) method. This mechanism is suitable for loops with large inherent serial re-gion. Although the algorithm of this mechanism has higher complexity, it is more effec-tive to resolve the above case directly. Finally, we develop a new mechanism called the two stage partitioning (TSP) mechanism based on dependence convex theory and three region partitioning technique can parallelize the serial region of the OTRP method effectively. Compared with other popular techniques, our schemes show a dramatic improvement in performance on pou-plar program models and real program code segments.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT880392085
http://hdl.handle.net/11536/65485
顯示於類別:畢業論文