标题: 具复杂运算单元之低功率多执行绪资料路径的研究与设计
Study on Improving Utilization for Low-Power Multithreaded Datapath with Composite Functional Units
作者: 卓毅
Yi Cho
刘志尉
Chih-Wei Liu
电子研究所
关键字: 低功率;多执行绪;资料路径;运算单元;硬体使用率;low-power;multithreaded;datapath;functional unit;hardware utilization
公开日期: 2007
摘要: 在观察近年来处理器的发展演变中我们发现,简化指令集处理器(RISC)已成为一大设计主流。其简单和规律的指令集设计很容易进一步的将指令执行管线化(pipeline)提高处理器效能。然而,因为分派一个指令,只能执行一个动作导致其硬体使用率不高。多指令分发(multi-issue)处理器,即超长指令(VLIW)处理器,利用指令层级平行度(ILP)提高硬体使用率,但它的暂存器档案面积,随着运算单元增加而剧烈成长,因而付出沉重的硬体代价。在本论文中,我们提出一个具复杂运算单元(composite FU)的资料路径,以客制化顺序串接多个运算单元的方式,在同一指令中处理连续多个基本运算(primitive operations),达到硬体使用率的提升。此复杂运算单元不仅可以减轻如VLIW的暂存器面积会因功能单元(FU)增加而大幅成长的问题,还因为复杂运算单元可以在抓取运算子后,作多个运算才存回,总暂存器存取次数得到节省,进而得到低功率的好处。此外我们也利用整合管线化设计流程来提升整体效能(操作频率),以及搭配交错多执行绪(interleaved multithreaded)架构来完全地隐藏管线化后所衍生的指令延迟。我们同时提出一个自动化复杂运算单元产生器,藉由分析使用者所输入的应用程式资料流程图(data-flow graph),自动产生出一个最佳化的复杂运算单元。经由对多个典型DSP应用分析,复杂运算单元MSA(串接一个乘法器M以及一个移位器S和加法器A)的硬体使用率(operation per cycle)和简化指令集处理器的1.00比较提升为1.35。使用台积电0.13um制程作合成分析,在同样的运算效能下,复杂运算单元较简化指令集合的面积约多10%,但较超长指令减少约50%。复杂运算单元之功率消耗,较简化指令集合及超长指令节省16.6%到31.6%。
From the observation of evolution of processor development in recent years, we find that Reduced Instruction Set Computer (RISC) processors have already become main design fashion. The simplicity and regularity of RISC is suitable for pipeline design to boost performance. However, its hardware utilization is low because of it execute only one operation in single instruction issued. Multi-issue (VLIW) processors, takes advantage of the Instruction Level Parallelism (ILP) to promote hardware utilization. But the register file (RF) area of VLIW grows exaggeratedly with the increase of the functional unit number. It pays a great hardware overhead. In this thesis, we propose a datapath with composite functional units (FUs). It cascades several functional units in costumed order to perform continuous multiple primitive operations in single cycle for raising hardware utilization. The read and write port number of the register files of composite FUs only slightly increase by 1 or remain unchanged. It solves the problem of large RF area pressure. In addition, the composite FUs can perform several operations after fetching operands and then write back. The reduction of total register accesses leads to low-power benefit. Besides, the pipeline design is integrated to boost performance up and the Interleaved Multithreaded (IMT) architecture is coordinated to hide instruction latency derived from pipeline design totally. In the mean time, we propose a recursive composite FUs generator which automatically generator a best composite FU by analyzing Data Flow Graph (DFG) input by user. From the analysis of several classic DSP kernels, the hardware utilization of MSA-ordered (cascade a multiplier, an adder, then a shifter) composite FU is 1.35 times higher than 1.00 of RISC. Use the TSMC 0.13um process to do synthesis analysis. Under same performance, the register file area of composite FU is 10% more than RISC and 50% less than VLIW. The power reduction of composite FU is smaller compared with RISC and VLIW ranging from 16.6% to 31.6%.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009411619
http://hdl.handle.net/11536/80532
显示于类别:Thesis


文件中的档案:

  1. 161901.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.