標題: 考慮快取記憶體層級的產率處理器執行續映射方法論
A Cache Hierarchy Aware Thread Mapping Methodology for Throughput Processors
作者: 郭玹凱
Kuo, Hsien-Kai
周景揚
賴伯承
Jou, Jing-Yang
Lai, Bo-Cheng
電子工程學系 電子研究所
關鍵字: 多執行續處理器;快取記憶體;共享式記憶體;效能分析及設計輔助;Multithreaded processors;Cache memories;Shared memory;Performance Analysis and Design Aids
公開日期: 2013
摘要: 快取記憶體系統已經成為緩解產率處理器中記憶體瓶頸的有效方法,如通用圖形處理器。為了更有效地利用應用程式的資料區域性,最先進的產率導向架構已經實作了一多層級共享快取記憶體系統。與傳統的多核心處理器相比,產率處理器的設計理念分配絕大部分的晶片面積給處理核心,從而導致相對小且被大量處理核心共用的快取記憶體。運用適當的執行續映射是獲得有建設性的快取記憶體共享和避免數以千計的執行續資源爭奪的關鍵。然而,由於在架構和程式撰寫模型的顯著差異,現有的多核心執行續映射方法無法有效的執行於產率處理器。在產率處理器的研究中,使用最大可能的執行續並行是一常用的執行續映射策略。然而,這種貪婪的策略通常會導致嚴重快取記憶體爭用,並顯著降低系統的效能。因此,產率處理器的執行續並行和快取記憶體爭用之間的權衡已經成為一個關鍵的效能因素。 本論文首先提出了一模型來捕捉執行續以及多層級共享快取記憶體的行為特點。透過適當的證明,該模型提供了堅實的理論基礎,基於該模型,本論文提出一考慮快取記憶體層級的執行續映射方法。在實驗結果中,本論文所提出的執行續映射方法可以成功地改善產率處理器中快取記憶體的資料重用,與現有方法相比,平均實現了2.3倍到4.3倍的加速。本論文更進一步分析產率處理器之快取記憶體爭用的性能影響。基於分析和研究結果,本論文提出一執行續排程的問題,同時也提出了一系列執行續排程演算法。該執行續排程演算法可以降低快取記憶體的爭用,並增強產率處理器的整體效能。與一廣泛使用的執行續群集方案相比,本論文所提出的執行續排程方案實現了61.6 %效能改善。相較於最先進的資料重用技術,執行續排程方案對執行時間的改善可以達到47.4%。
Deploying the cache system is an effective way to alleviate the memory bottleneck in modern throughput processors, such as GPGPUs. The recently proposed throughput oriented architecture has added a multi-level hierarchy of shared cache to better exploit the data locality of general purpose applications. The design philosophy of throughput processor allocates most of the chip area to processing cores, and thus results in a relatively small cache shared by a large number of cores when compared with conventional multi-core CPUs. Applying a proper thread mapping scheme is crucial for gaining from constructive cache sharing and avoiding resource contention among thousands of threads. However, due to the significant differences on architectures and programming models, the existing thread mapping approaches for multi-core CPUs do not perform as effective on throughput processors. In throughput processors, a commonly-used thread mapping policy is to render the maximum possible Thread-Level-Parallelism (TLP). However, this greedy policy could usually cause serious cache contention on the Shared Last-Level-Cache (SLLC) and significantly degrades the system performance. It is therefore a critical performance factor that the thread mapping of a throughput processor performs a careful trade-off between the thread-level-parallelism and cache contention. This dissertation first proposes a formal model to capture both the characteristics of threads as well as the cache sharing behavior of multi-level shared cache. With appropriate proofs, the model forms a solid theoretical foundation beneath the proposed cache hierarchy aware thread mapping methodology for multi-level shared cache GPGPUs. The experiments reveal that the three-staged thread mapping methodology can successfully improve the data reuse on each cache level of GPGPUs and achieve an average of 2.3x to 4.3x runtime enhancement when compared with existing approaches. This dissertation then extends the discussion to further characterize and analyze the performance impact of cache contention in the SLLC of throughput processors. Based on the analyses and findings of cache contention and its performance pitfalls, this dissertation formally formulates the Aggregate-Working-Set-Size Constrained Thread Scheduling Problem, which constrains the aggregate-working-set-size on concurrent threads. With a proof to be NP-hard, this paper has integrated a series of algorithms to minimize the cache contention and enhances the overall system performance on GPGPUs. The simulation results on NVIDIA’s Fermi architecture have shown that the proposed thread scheduling scheme achieves up to 61.6% execution time enhancement over a widely-used thread clustering scheme. When compared to the state-of-the-art technique that exploits the data reuse of applications, the improvement on execution time can reach 47.4%. Notably, the execution time improvement of the proposed thread scheduling scheme is only 2.6% from an exhaustive searching scheme.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079711670
http://hdl.handle.net/11536/74270
顯示於類別:畢業論文