為減少載入指令之延遲所作之改善

標題:	為減少載入指令之延遲所作之改善 Improvement for Reducing Latency of Load Instructions
作者:	劉文俊 Liu, Wen-Jun 陳昌居 Chang-Jiu Chen 資訊科學與工程研究所
關鍵字:	載入指令;載入延遲;超純量;Load Instruction;Load Latency;Superscalar;Out of order;LTAPB;FAC-like
公開日期:	1997
摘要:	記憶體系統的設計在計算機架構中，是很具有挑戰性的一個題目。因為電腦工業在記憶體上的進展不如中央處理器快速，所以使得設計者必須在原有的計上（如implementation technology，workload，processor architecture）下功夫，以求得到微小的整體效能進展。在記憶體系統的設計裡，其中一個重要的挑戰，就是load latency持續地延長的問題。在執行程式時﹐如果loadlatency一直延長﹐會使得整體的效率持續地下降。所以如何減少load latency便成了一個影響系統效能的重要關鍵。快取記憶體（Cache）的發明，雖然緩和了處理器與記憶體之間的差距，但是近年來一些較新的資料存取方式（像multimedia，compression， encryption）因為缺乏區域性（locality）的特色，使得快取記憶體的效能，在傳統記憶體的架構下，大大打了折扣。另外，一些著重 instruction-level parallelism的處理器，允許在一個時脈下同時處理多個load/store指令，導致記憶體系統的頻寬需求量增加，也使得load latency對整體效能的影響越來越大。在這篇論文中，我們先研究 Todd Michael Austin的FAC（Fast Address Calculation）方法。這個方法以預測的方式，在實際的有效位址計算之前，先行算出load指令的有效位址（effective address），於是可以減少一個cycle的load latency。我們針對這個方法做改善，提出另一個方法FAC-like，使得FAC的預測錯誤率可以減低到最小。另外﹐我們也提出一個類似BTB架構的設計﹐名為 LTAPB﹐這個設計可以減少2-cycle的load latency。最後﹐我們再將這兩個方法整合起來（Hybrid Method）﹐以求得到最好的效能增益。我們用各個標竿程式的執行時脈數來評估整體系統效能的提升，並且觀察各種設計在各個標竿程式的預測失敗率。我們使用Spec95作為標竿程式，並且以一個execution-driven 的simulator（SimpleScalar Tool Set）來做模擬。實驗結果顯示，在所有的標竿程式中，FAC-like都比 FAC好﹐且 Hybrid Method 得到最好的整體效能提升。 Memory system design is one of the most challenging aspects of computer architecture. The computer industry does not improve the memory system as quickly as CPU architecture, forcing designers to re-evaluate the existeddesign in light of changes in implementation technology, workload, and processor architecture. One of the important challenges in memory systemdesign is the problem of continually lengthening load latency. The growingload latency degrades the entire system performance continuously, making theproblem of reducing load latency the important issue to good system performance. The invention of Cache memory mitigates the fallout between the CPU and memory system. However, many newer workloads (such as multimedia, compression, and encryption) lack the locality necessary to perform well on traditionalmemory hierarchies. In addition, some processors emphasized on instruction-level parallelism issue multiple loads and stores per cycle, resulting in increased bandwidth demands on the memory system and further aggravate loadlatency. In this thesis, we first study T. M. Austin's FAC (Fast Address Calculation) method. This method predicts the effective address of load instruction earlierthan the real effective address computation, so as to eliminates one cycle ofload latency. And then we propose FAC-like to improve FAC and eliminate theprediction miss rate. In addition, we propose our own approach loosely basedon BTB, named LTAPB, to reduce two-cycle load latency. Finally, we combinethese two methods to gain the best performance improvement. In our study, we use "execution cycle count" to estimate the performance improvement of entiresystem, and observe the prediction miss rate. We use the Spec95 benchmarks andthe SimpleScalar tool set (an execution-driven simulator) to simulate ourresult. The simulation results show that the FAC-like method is better than FAC andthe hybrid method achieves the best performance improvement in all benchmarks.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#NT860392059 http://hdl.handle.net/11536/62793
顯示於類別：	畢業論文