Reducing Contention in Shared Last-Level Cache for Throughput Processors

doi:10.1145/2676550

Full metadata record

DC Field	Value	Language
dc.contributor.author	Kuo, Hsien-Kai	en_US
dc.contributor.author	Lai, Bo-Cheng Charles	en_US
dc.contributor.author	Jou, Jing-Yang	en_US
dc.date.accessioned	2015-07-21T11:21:09Z	-
dc.date.available	2015-07-21T11:21:09Z	-
dc.date.issued	2014-11-01	en_US
dc.identifier.issn	1084-4309	en_US
dc.identifier.uri	http://dx.doi.org/10.1145/2676550	en_US
dc.identifier.uri	http://hdl.handle.net/11536/123923	-
dc.description.abstract	Deploying the Shared Last-Level Cache (SLLC) is an effective way to alleviate the memory bottleneck in modern throughput processors, such as GPGPUs. A commonly used scheduling policy of throughput processors is to render the maximum possible thread-level parallelism. However, this greedy policy usually causes serious cache contention on the SLLC and significantly degrades the system performance. It is therefore a critical performance factor that the thread scheduling of a throughput processor performs a careful trade-off between the thread-level parallelism and cache contention. This article characterizes and analyzes the performance impact of cache contention in the SLLC of throughput processors. Based on the analyses and findings of cache contention and its performance pitfalls, this article formally formulates the aggregate working-set-size-constrained thread scheduling problem that constrains the aggregate working-set size on concurrent threads. With a proof to be NP-hard, this article has integrated a series of algorithms to minimize the cache contention and enhance the overall system performance on GPGPUs. The simulation results on NVIDIA\'s Fermi architecture have shown that the proposed thread scheduling scheme achieves up to 61.6% execution time enhancement over a widely used thread clustering scheme. When compared to the state-of-the-art technique that exploits the data reuse of applications, the improvement on execution time can reach 47.4%. Notably, the execution time improvement of the proposed thread scheduling scheme is only 2.6% from an exhaustive searching scheme.	en_US
dc.language.iso	en_US	en_US
dc.subject	Algorithms	en_US
dc.subject	Performance	en_US
dc.subject	Throughput processors	en_US
dc.subject	thread-level parallelism	en_US
dc.subject	cache contention	en_US
dc.subject	shared last-level cache	en_US
dc.subject	thread scheduling	en_US
dc.title	Reducing Contention in Shared Last-Level Cache for Throughput Processors	en_US
dc.type	Article	en_US
dc.identifier.doi	10.1145/2676550	en_US
dc.identifier.journal	ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS	en_US
dc.citation.volume	20	en_US
dc.contributor.department	交大名義發表	zh_TW
dc.contributor.department	National Chiao Tung University	en_US
dc.identifier.wosnumber	WOS:000345523400012	en_US
dc.citation.woscount	0	en_US
Appears in Collections:	Articles

Files in This Item:

000345523400012.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.