標題: 多處理機上迴圈排序之記憶體延遲紓緩技術
Memory latency reduction in loop scheduling on modern multiprocessors
作者: 王逸民
Wang ,Yi-Min
張瑞川
Ruei-Chuan Chang
資訊科學與工程研究所
關鍵字: 共享記憶體;迴圈排序;記憶體延遲紓緩;親近排序法;;Shared memory; Loop scheduling; Memory latency ty scheduling
公開日期: 1995
摘要: 平行迴圈演算法(loop scheduling algorithm)是平行編譯器上一項重要 的議題,本論文提出了一些在目前及未來適用的平行迴圈演算法。我們製 造一個能真實的模擬各種架構之多處理機模擬器。我們可以利用此多處理 機模擬器來研究多處理機上的各種平行迴圈排序演算法技術。傳統平行迴 圈演算法在新一代大規模共享記憶體多處理機上執行會因通訊代價的增加 而變得無效率,因此有一種新的行程排序方法叫親近式行程排序( affinity scheduling algorithm;AFS),它可以增加記憶體存取發生在快 速記憶體 (cache)的機會,以降低通訊代價。然而AFS有一些缺點:AFS浪 廢太多時間在同步(synchronization)上,而且親近性亦沒發揮到極致。 因此我們首先提出一改良式AFS (modified affinity scheduling algorithm; MAFS),以改進AFS上述二項缺點,並且在模擬器上執行許多 應用程式。實驗結果顯示在MAFS的執行時間比在AFS的短,而且在大部分 情況下,MAFS可減少一半的遷移(migration)運算指令。此外,我們提出 群聚親近式行程排序(clustered affinity scheduling algorithm; CAFS) ,以適用於大型之NUMA多處理機上。CAFS主要以群 (cluster)為單 位做處理機間工作的遷移,以降低同步負擔並減低資料不在快速記憶體之 比例。我們在模擬器上執行許多應用程式以比較各種平行迴圈演算法的效 能。實驗結果顯示在大部分情況下,CAFS可減少至少 1/3的同步運算指令 。而且CAFS可增加資料在快速記憶體之比例,並且平衡各處理機間的工作 負擔。最後我們提出階層式親近排序法(hierarchical affinity scheduling policy) 以適用於分群式(clustered)NUMA多處理機。此法改 良以前各種親近排序法,例如 AFS及MAFS,我們稱為HAFS及HMAFS。這種 方法建議處理機之間工作遷移採用階層(hierarchy)方式,實驗證實此策 略可減少不同群之間資料讀取以及紓緩對於工作列陣(work queue)之同步 負擔,而且顯示HMAFS是最佳的平行迴圈演算法。 Loop scheduling is one of the most important issues in the development of parallelizing complier on shared memory multiprocessors. In this thesis, we propose loop scheduling algorithms for shared memory multiprocessors. We build a realistic multiprocessor simulator to simulate many modern multiprocessor architectures and use the simulator to evaluate the performance of various loop scheduling algorithms. Firstly, we propose modified affinity scheduling algorithm (MAFS) to improve affinity scheduling algorithm (AFS). When imbalance occurs, MAFS migrates expected work load of each processor to the idle processor. MAFS minimizes the synchronization overhead and enhances the affinity effect. Compared with AFS, MAFS reduces at least 1/2 of the migrations in most cases. Secondly, we propose clustered affinity scheduling algorithm (CAFS) for large-scale NUMA machines. CAFS distributes the processors into several clusters, and each processor belongs to a dedicated cluster. When imbalance occurs, the idle processor just migrates a fraction of iterations from the most loaded processor in its own cluster. Our results show that CAFS reduces at least 1/3 of both remote reads and synchronous writes to the queues under most applications. CAFS also improves the cache hit ratios, and balances the workload. Finally, we propose hierarchical affinity scheduling policy to improve various affinity algorithms for clustered NUMA machines. The basic idea of our policy is that when imbalance occurs, the migration processes are carried on hierarchically. The results show that hierarchical algorithms reduce the inter- cluster remote memory accesses, decrease the locks to the queues, and effectively balance the workload.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT840394082
http://hdl.handle.net/11536/60530
顯示於類別:畢業論文