Title: 在多處理機系統上的執行時期平行化方法
An Efficient Run-Time Parallelizing Method for Multiprocessor Systems
Authors: 謝明輝
Hsieh, Ming-Huei
曾憲雄
Shian-Shyong Tseng
資訊科學與工程研究所
Keywords: 執行時期;波前;平行度;run-time;wavefront;parallelism
Issue Date: 1995
Abstract: 迴圈在一程式中存在有大量的平行度,為了將此程式平行化,平行編譯器
利用靜態資料相依性分析來獲得迴圈的平行度。然而,有些迴圈則無法於
編譯時期取得資料相依性的資訊。例如,在稀疏矩陣計算上,陣列述語內
通常包含了間接陣列或函式,而無法利用靜態資料相依性分析。故便保守
的將程式循序的執行,而犧牲了潛在的平行度。因此,在此論文中則提出
了一個兩階段 (偵測階段及執行階段) 的執行時期平行化方法於執行時期
擷取出迴圈中潛在的平行度。偵測階段經由建立一DEF-USE表而決定出可
平行執行的迴圈輪替集合-波前,此外,此偵測階段本身可以被完全的平
行化以減少因決定波前所照成的額外負擔。而經改良的執行階段則根據波
前來執行迴圈並且使用auto-adapted函式來獲得合適的Thread數量而非傳
統固定的指定Thread數量。實驗的結果顯示,這個平行偵測演算法能處理
較複雜的資料相依性而且能明顯縮短本身執行時間。此外,在執行階段所
利用的新策略能提高整個執行時期平行化的效率並且增加多處理機系統的
利用度。
Loop-level parallelism is the most common resource to be
exploited by parallelizing compiler. To parallelize a sequential
loop, a parallelizing compiler must compute a parallel schedule
of the iterations based on a static data dependenceanalysis at
compile-time. Some loops, however, may contain parallelism not
detectable in this way. For example, insparse matrix
computations, array subscripts often involve indirection arrays
and thus defy static analysis. In conservatively, the loop
iterations in such examples will be performed sequentially.
Motivated by these concerns, a run-time technique based on
inspector-executor scheme is proposed for finding available
parallelism on loops in this thesis. Our inspector can determine
the wavefronts by building DEF-USE table. Additionally, the
inspector is fully parallel without any synchronization for
reducing overhead that indicates the wavefronts. Our improved
executor performs the loop iterations concurrently for each
wavefront in a loop by using auto-adapted function to get a
tailored thread number rather than using fixed thread number.
Experimental results show that our new parallel inspector
algorithm can handle complex data dependency patterns that
cannot be performed by the previousresearches and reduce itself
running time obviously. Besides, the new strategyfor executor
can also achieve high system utilization and improve the
performance of run-time parallelization.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT840394006
http://hdl.handle.net/11536/60445
Appears in Collections:Thesis