標題: 高派發率X86微處理機設計與模擬技術
High Issue Rate X86 Microprocessor Design and Simulation Techniques
作者: 徐日明
R-Ming Shiu
鍾崇斌
Chung-Ping Chung
資訊科學與工程研究所
關鍵字: X86 微處理機;超純量技術;複雜指令集電腦;指令解碼;記憶體存取預測;單回軌跡模擬;X86 microprocessor;superscalar;CISC;instruction decode;memory access prediction;single-pass trace simulation
公開日期: 1999
摘要: 超純量技術被廣為應用於近來的x86微處理機上,以平行執行數個指令來達到更高的效能。為了在x86微處理機上,發掘目前商用程式中更高的指令層次平行程度,我們研究了x86微處理機上一些重要的高派發率課題。這些課題包括 i) 高派發率的指令解碼,和 ii) 猜測性的資料存取排程, 都和在RISC處理機上很不同。此外,為了建立x86研究的高效率模擬環境,我們發展 iii) x86超純量微處理機的單次軌跡模擬技術。 在第一個課題 "高派發率指令解碼" 中,我們研究x86指令至基本運算 (POP) 的轉換策略和解碼規則,來達到更高的平行執行程度。X86指令的語意非常複雜,因此解碼器需要將其轉為POP。有兩種POP轉換策略:一是將位址計算併入資料存取POP,二是解出獨立的位址產生POP。模擬結果顯示,在高派發率解碼中,第二種策略會使效能增強20%至25%。此外,我們發現賦予資料存入緩衝器窺視結果匯流排的能力,可以發掘更高的平行執行程度。在考慮硬體成本及效能的取捨下,我們建議了一個成本效益很高的解碼規則。 在第二個課題 "預測性資料存取排程" 中,我們發展數種適合x86超純量處理機的預測性資料存取排程。X86微處理機的資料存取指令的比例很高,因此當超純量程度高時,發掘資料存取的平行執行程度變為關鍵。將在RISC上發展的傳統預測技術移至x86處理器時,加長的預測錯誤處罰會使它們不能有效率地運作。為了增加預測正確率,我們發展新的位址及相依性預測。我們以加上回送預測能力、2位元計數器細緻化預測、及以2位元計數器過濾預測等方法,加強相依性預測。為了減小預測錯誤處罰,我們考慮預測的時機與讀出資料的送回策略等問題。實驗結果顯示,透過增加預測正確率及減小預測錯誤處罰,我們提出的預測性排程可以大量增加效能。 在第三個課題 "單次軌跡模擬技術" 中,我們發展單次軌跡技術來建立一個高效率的軌跡模擬器,以模擬整個x86微處理機。單次軌跡模擬技術已被發展用來在一次的執行軌跡掃描中,即可評估多組設計參數。但是,這些技術只適合具有"包含特性"的儲存體。本課題的主要困難在於,具亂序執行機制的管線及分支預測緩衝器 (BTB),都不具包含特性。所以我們分別發展亂序執行機制及BTB的單次模擬技術。對於亂序機制,我們將讀入的指令放入一個統一的指令發展佇列中,將所有可能的管線狀態列舉在管線狀態向量中。對於BTB,困難在於BTB中的預測資訊不具包含特性。我們提出了狀態向量法及狀態鏈結法來解決這個困難。狀態向量法列舉所有可能的狀態,而狀態鏈結法只紀錄狀態向量中狀態改變的地方。整合我們發展的亂序機制及BTB的單次模擬技術,和傳統對快取記憶體的單次模擬技術,我們的模擬器就成為了對整個x86超純量微處理機的完整單次模擬平台。模擬結果顯示在測量10組參數時,單次模擬比傳統模擬快4.15倍。 我們更進一部將狀態向量法,用於多處理機快取記憶體一致性協定的單次模擬上。透過泡泡狀態的加入,來模仿包含特性,我們發展的對不同協定及不同大小的多處理機快取記憶體的單次模擬,不只能做傳統的效能測量,還能測匯流排交通量。 經過了本論文中這些重要課題的討論,一個高效率的x86模擬環境,和一個由此而生的高派發率x86微架構都可被建立。我們希望此研究中的努力,能對未來的高派發率x86微處理機設計有所貢獻。
In recent x86 microprocessors, superscalar techniques are widely used to achieve higher performance by executing multiple instructions in parallel. To exploit higher instruction level parallelism of current commercial programs on x86 superscalar microprocessors, we study the critical high issue rate topics in x86 microprocessors. Topics include i) the instruction decoding with high issue rate, and ii) the predictive data load/store scheduling, which are very different than in RISC processors. Furthermore, to build an efficient simulation environment for x86 research, we develop iii) the single-pass trace simulation techniques for x86 superscalar micro-architecture. In the first topic, the high issue rate decoding, we examine the x86 instruction to primitive operation (POP) translation strategies and the decoding rules to achieve a higher degree of parallel execution. The semantic of x86 instructions may be too complex and thus the decoders need to translate the instructions into POPs. There are two different POP translation strategies: one is to merge the address generation into load/store operations and the other is to use individual address generation operations. Simulation results show that, in high issue rate decoders, the latter strategy improves the performance by 20% to 25%. Besides, we find that equipping the UMAB with the ability of result buses snooping can further exploit higher parallel execution degree. Considering the tradeoffs between hardware cost and performance, a cost-effective decoding rule is recommended. In the second topic, the predictive load/store scheduling, we develop several predictive scheduling policies of loads/stores suited for x86 superscalar processors. The proportion of memory access instructions for x86 microprocessors is relatively high, and exploiting the parallel execution degree of memory accesses becomes crucial in high superscalar degree. Traditional prediction techniques developed on RISC suffer the lengthened penalty of prediction errors and thus cannot work effectively when applied to x86 processors. To increasing the prediction accuracy, we develop new address and dependency prediction policies. We improve the dependency prediction by adding forwarding prediction ability, refining the predictions with 2-bit counter, and filtering out the error-like predictions with another 2-bit counter. To reduce the miss-penalty, we consider the prediction stage and the strategies for handling loaded data. Experiment results show that, by reducing the miss-penalty and increasing the prediction accuracy, the predictive scheduling proposed in this work can significantly improve the performance. In the third topic, the single-pass trace simulation techniques, we develop the single-pass techniques to build an efficient trace-driven simulator for whole x86 superscalar processors. The single-pass trace simulation techniques have been developed to evaluate many sets of design configurations in one simulation run. However, these techniques are only suited to storages having the inclusion property. The major difficulty in this topic is that both the pipeline with out-of-order mechanism and the branch prediction buffer (BTB) do not show the inclusion property. Thus, we develop the single-pass simulation techniques for the BTB and the out-of-order mechanism separately. For the out-of-order mechanism, we put the incoming instructions in a unified instruction progression queue, and enumerate the possible pipeline states in a pipeline state vector. For the BTB, difficulty arises since the prediction information in the BTB has no inclusion property. We propose the state vector method and the state link method to overcome this difficulty. The state vector method enumerates the states of various possibilities, whereas the state link method book-keeps only the changing locations of the states in the state vector. By integrating the single-pass simulation for both out-of-order mechanism and BTB we developed and the traditional single-pass simulation for caches, our simulator becomes a platform of a complete single-pass simulation for whole x86 superscalar microprocessors. The speedup of this single-pass simulation over the conventional simulation is 4.15 in terms of simulation time when 10 sets of configurations are evaluated. We further apply the state vector method on the single-pass simulation for multi-processor (MP) cache coherence protocols. By inserting the bubble state to imitate the inclusion property, we develop a single-pass simulation to measure not only the performance as tradition but also bus traffic of MP caches with various coherence protocols and sizes. Having dealt with the critical topics discussed in this dissertation, an efficient simulation environment and hence a high-issue rate x86 micro-architecture can be built. We hope the efforts in this research can contribute to the design of future high issue rate x86 microprocessors.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT880392090
http://hdl.handle.net/11536/65491
顯示於類別:畢業論文