# 國立交通大學 電機學院IC設計產業研發碩士班 #### 碩士論文 低功率正反器之設計應用於晶片網路以及 維特比解碼器 1896 Ultra Low Power Flip Flop Design for Network-on-chip and Viterbi Decoder Application 研究生:王尹伶 指導教授:黃威 教授 中華民國九十七年九月 #### 低功率正反器之設計應用於晶片網路以及維特比解碼器 # Ultra Low Power Flip Flop Design for Network-on-chip and Viterbi Decoder Application 研究生: 王尹伶 Student: Yin-Ling Wang 指導教授:黃威 Advisor:Wei Huang #### A Thesis Submitted to College of Electrical and Computer Engineering National Chiao Tung University in partial Fulfillment of the Requirements for the Degree of Master in Industrial Technology R & D Master Program on IC Design August 2008 Hsinchu, Taiwan, Republic of China 中華民國九十七年九月 # 低功率正反器之設計應用於晶片網路以及 維特比解碼器 學生: 王尹伶 指導教授: 黃威 #### 國立交通大學電機學院產業研發碩士班 本論文使用低功耗電路設計的技術,來實現時脈驅動儲存元件之設計。一個適合應用在低震盪電壓時脈的邊緣觸發正反器 (LCSFF) 被提出,且使用 UMC 90nm 標準元件的技術來設計以及佈局。此單緣觸發正反器使用了低電壓的時脈震盪延遲電路,來產生操作的觸發波形和電晶體疊加技術,來達到減低漏電流的低功耗設計。 此一低震盪電壓時脈正反器,非常適用應用於需要大量儲存單元的系統。本論文中將把此正反器應用在晶片網路裏的串列器以及解串列器,還有維特比解碼器裏的記憶殘存單元。根據模擬結果顯示,這樣的應用可以減少至少 27.5 %的功率消耗。 # Ultra Low Power Flip Flop Design for Network-on-chip and Viterbi Decoder Application Student: Yin-Ling Wang Advisors: Dr. Wei Hwang Industrial Technology R & D Master Program of Electrical and Computer Engineering College National Chiao Tung University The clocked storage elements using the low power technique are realized in this paper. The low clock swing edge-triggered flip-flop (LCSFF) suitable for the low switching activity applications is proposed and simulated in UMC 90nm technology, and layout in UMC 90nm standard cell. The single edge-triggered flip-flop uses low swing voltage delay chain generating the operation transparency window for reduces the power consumption. The flip-flop uses the power gating technique to reduce the leakage current. The low clock swing flip-flop (LCSFF) suitable for the system which used a great quantity of memory. In this thesis it applies to the serializer and deserailizer in network on chip and the survivor memory unit in viterbi decoder. The simulation result shows the applications could save more than 27.5% power. ### 致謝 首先我要感謝我的指導教授黃威老師,不論是在基礎課程或是研究方向 上面,老師都竭盡所能的教導我。不僅提供觀念上的協助,還有許多研究 資源的提供。使得我不僅在專業知識上獲益良多,也學到很多做人處事的 態度。 再來,我要感謝實驗室的學長以及同學們,不論是一起討論作業專題還是研究方向,同學們都不吝給予最大的協助以及鼓舞。特別要感謝Si2lab的李鎮宜老師以及同學,在研究資源方面提供的協助及幫忙。 最後,我要感謝我的父母以及家人好友,不僅在生活方面提供後盾, 讓我無後顧之憂的全力完成論文,也在我失落徬徨時,不吝給予最大的愛 與關懷。 最後,我要把這份論文的成果分享給每一個幫助過我的人。 ## **Contents** | Chapter 1 | | |------------------------------------------------------------|-----------| | Introduction | 1 | | | | | Chapter 2 | | | Low Power Digital Circuit Design Concepts and Ove | erview of | | Network on Chip and Channel Coding | 3 | | 2.1 Introduction | 3 | | 2.2 Device Characteristic | 4 | | 2.2.1 Size Issue | 4 | | 2.2.2 Power Issue | 5 | | 2.3 Introduction The Abstraction Levels of Network on Chip | 7 | | 2.4 Introduction of Channel Coding | 10 | | Chapter 3 | | | Low Power Pulse- based Flip-Flop Design | 13 | | 3.1 Introduction | 13 | | 3.2 Flip-Flop Characterization Events | 15 | | 3.2.1 Timing Factors | 15 | | 3.2.2 Energy Factors | 17 | | 3.3 Low Power Techniques of Flip-Flop | 18 | | 3.3.1 Clock Gating Technique | 19 | | 3.3.2 Data Gating Technique | 21 | | 3.3.3 Power Gating Technique | 23 | | 3.4 Conventional Edge-Triggered Flip-Flop | 26 | |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------| | 3.4.1 Master Slave Flip- Flop | .27 | | 3.4.2 Pulse Triggered Flip- Flop | .29 | | 3.4.2.1 Explicit-Pulsed Flip-Flop | .30 | | 3.4.2.2 Implicit Pulsed Flip-Flop | 34 | | 3.5 Proposed Edge-Triggered Flip-Flop Design and Simulation Result | 39 | | 3.5.1 The Motivation of Proposed The New Flip- Flop | 39 | | 3.5.2 Proposed Pulse Generator of Flip- Flop | 40 | | 3.5.3 Proposed Low Clock Swing Flip- Flop | 42 | | 3.5.4 Simulation Result and Comparisons | 45 | | 3.5.5 Layout and Post Simulation Result | 46 | | Chapter 4 Low Clock Swing Flip-Flop Design for Serializer/Deserializer in | 1 | | Low Clock Swing Flip-Flop Design for Serializer/Deserializer in | | | Low Clock Swing Flip-Flop Design for Serializer/Deserializer in | 49 | | Low Clock Swing Flip-Flop Design for Serializer/Deserializer in Network on Chip | <b>49</b><br>.49 | | Low Clock Swing Flip-Flop Design for Serializer/Deserializer in Network on Chip | <b>49</b><br>.49<br>.54 | | Low Clock Swing Flip-Flop Design for Serializer/Deserializer in Network on Chip | <b>49</b><br>.49<br>.54<br>.56 | | Low Clock Swing Flip-Flop Design for Serializer/Deserializer in Network on Chip | .49<br>.54<br>.56<br>.57 | | Low Clock Swing Flip-Flop Design for Serializer/Deserializer in Network on Chip | 49<br>.49<br>.54<br>.56<br>.57 | | Low Clock Swing Flip-Flop Design for Serializer/Deserializer in Network on Chip | .49<br>.54<br>.56<br>.57<br>60 | | Low Clock Swing Flip-Flop Design for Serializer/Deserializer in Network on Chip. 4.1 Introduction 4.2 Practical Application in NoC Framework 4.2.1 Triplication Error Correction Coding Stage 4.2.2 Green Bus Coding Stage for Crosstalk Avoidance 4.3 Serializer and Deserializer Design for NoC 4.3.1 Introduction of Serializer and Deserializer | .49<br>.54<br>.56<br>.57<br>60<br>60 | #### Chapter 5 | Low Clock Swing Flip-Flop Application of Viterbi Decod | ler70 | |--------------------------------------------------------|-------| | 5.1 Introduction | 70 | | 5.2 The Design of Proposed Viterbi Decoder | 73 | | 5.2.1 Implementation of SST | 75 | | 5.2.2 Radix-2x2 ACS Structure | 76 | | 5.2.3 Implementation of Path Merging Detection Unit | 78 | | 5.3 Simulation and Implementation Results | 81 | | | | | Chapter 6 | | | Conclusion and Future Work | 84 | | Bibliography | 87 | | 1896 | | ## **List of Figures** | Fig 2.1 The predict die size by S. Brokar4 | |---------------------------------------------------------------------------| | Fig 2.2 The gate length predict by ITRS5 | | Fig 2.3 The growth trend in power6 | | Fig 2.4 The leakage sources for the static CMOS transistor7 | | Fig 2.5 The abstraction levels of NoC8 | | Fig 2.6 Block diagram of a digital communication system11 | | Fig 3.1 Waveform diagram of setup time and hold time16 | | Fig 3.2 The diagram of setup time and hold time17 | | Fig 3.3 The diagram of clock gating technique19 | | Fig 3.4 The condition captured flip-flop (CCFF)20 | | Fig 3.5 The diagram of data gating technique21 | | Fig 3.6 (a) Delay cell (Dly) for producing CKDB from CK22 | | (b) AND gate for producing mapped input X from AND operation of D | | and QB1 | | (c) Conditional Data Mapping Flip-Flop (CDMFF) | | Fig 3.7 (a) Conditional precharge technique24 | | (b) Conditional discharge technique | | Fig 3.8 Conditional discharge double-edge triggered flip-flop (CDFF)25 | | Fig 3.9 The Transmission-Gate Flip-Flop (TGFF)2 | | Fig 3.10 The race condition in TGFF28 | | Fig 3.11 The C2MOS Flip-Flop29 | | Fig 3.12 The diagram of explicit-pulsed flip-flop and the prevalent pulse | | generator circuit31 | |-----------------------------------------------------------------------------| | Fig 3.13 Dual-edge triggered static pulsed flip-flop32 | | (a) DESPFF | | (b) Pulse generator and waveform | | Fig 3.14 The Clock Gated Static Pulsed Flip-Flop (CGSPFF)33 | | (a) The flip-flop part of CGSPFF | | (b) The pulse generator part of CGSPFF | | Fig 3.15 The diagram of implicit-pulsed flip-flop and the prevalent pulse | | generator circuit35 | | Fig 3.16 The Hybrid Latch Flip-Flop (HLFF)36 | | Fig 3.17 Clock Branch Sharing Implicit Pulse Flip-Flop and(CBS_IP) and | | waveform38 | | Fig 3.18 Proposed Low Swing Inverter Chain41 | | Fig 3.19 Proposed Low Clock Swing Flip-Flop (LCSFF)43 | | Fig 3.20 The Waveform of Transparent Window43 | | Fig 3.21 The Data Tolerance Simulation Result44 | | Fig 3.22 The layout view of LCSFF in UMC90nm standard cell47 | | Fig 3.23 The post simulation of setup time and hold time48 | | | | Fig 4.1 Traditional Synchronous Bus49 | | Fig 4.2 Network-on-Chip Architecture51 | | Fig 4.3 A simple architecture of Network on Chip52 | | Fig 4.4 Interconnect delay and gate delay under different53 | | Fig 4.5 A joint bus and error correction coding scheme with serializers and | | deserializer in network-on-chip55 | | Fig 4.6 Triplication error correction coding scheme57 | | Fig 4.7 Design flow of green bus coding57 | |--------------------------------------------------------------------------------| | Fig 4.8 (a) 4-to-5 Green bus coding scheme58 | | (b) Original set and converted set of Green bus code | | Fig 4.9 Circuit implementation of green bus coding | | (a) Encoder (b) Decoder59 | | Fig 4.10 K bit -to- (K/N) bit serialization with N:1 ratio61 | | Fig 4.11 (a) Energy and (b) area of an NoC according to Serialization ratio.62 | | Fig 4.12 Power Simulation Result of Different Numbers of Wire63 | | Fig 4.13 The tree-based serializer and waveform64 | | Fig 4.14 The shift-register serializer and operation waveform65 | | Fig 4.15 The mixed structure serializer and operation waveform66 | | Fig 4.16 The shift-register description and operation waveform67 | | Fig 4.17 SPICE simulation waveform of 4-1 serializer68 | | Fig 4.18 SPICE simulation waveform of 1- 4 deserializer69 | | Fig 5.1 The (2, 1, 2) convolutional encoder70 | | Fig 5.2 The trellis diagram of the convolutional encoder in Fig 5.171 | | Fig 5.3 Path merging phenomenon in Viterbi decoding over a noisy channel.72 | | Fig 5.4 The conventional block diagram of Viterbi decoder73 | | Fig 5.5 The block diagram of proposed Viterbi decoder74 | | Fig 5.6 The convolutional encoder of MB-OFDM UWB system75 | | Fig 5.7 The pre-decoder for the convolutional encoder75 | | Fig 5.8 The 4-state radix-4 and radix-2x2 trellis diagrams76 | | (a) 4-state radix-4 trellis diagram | | (b) 4-state radix-2x2 trellis diagram | | Fig 5.9 The radix-4 and radix-2x2 ACS units77 | - (a) Radix-4 ACS unit - (b) Radix-2×2 ACS unit | Fig 5.10 | O A RE-based survivor memory with variable truncation length | 78 | |----------|----------------------------------------------------------------|----| | Fig 5.11 | 1 The implementation of variable truncation length | 80 | | Fig 5.12 | 2 The power simulation results in different channel conditions | 83 | | | (a) The power of whole Viterbi decoder | | | | (b) The power of the survivor memory | | ## **List of Table** | Table 3.1 The simulation result compare with typical inverter chain40 | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Table 3.2 The comparisons of proposed LCSFF with other flip-flop45 | | Table 3.3 The presimulation of setup time and hold time of LCSFF46 | | Table 3.4 The simulation result of LCSFF47 | | Table 3.5 The post simulation of setup time and hold time of LCSFF48 | | | | Table 4.1 Design background of simulation of serializer68 | | Table 4.2 Design background of simulation of deserializer69 | | ALLE CONTRACTOR OF THE PARTY | | Table 5.1 Comparison of complexity between radix-4 and radix-2×2 ACS | | units77 | | Table 5.2 The gate counts of different comparators and multiplexers77 | | Table 5.3 The gate counts of different implementations81 | | Table 5.4 Design parameters of the proposed Viterbi decoder82 | | Table 5.5 Simulation result compares with other Viterbi decoder83 | #### **Chapter 1** #### Introduction In the deep-submicron or nanometer era, the power dissipation is becoming a major design requirement, not only in portable application, but also in high performance VLSI systems. In a highly synchronous system, the power consumption of the clock network contributes is up to 45% of overall system power dissipation. And the clocked storage elements consume 90% power dissipation of the clock network. If we can design a low power clocked storage element, then the total power of the system will reduced great quantity power consumption. A low power clocked storage element is the important concept in the digital circuit design. In this thesis, I will focus on the low power clocked storage elements design, by means of reduce clock loading and low swing technique. In chapter 2, the low power digital circuit design concepts are discussed. It is emphasized on the device characteristic and the circuit design concepts. The evolution of modern flip flops are discussed and designed in chapter 3. A low clock swing flip flop (LCSFF) is proposed. This flip-flop uses low swing voltage technique, conditional capture technique, and stacked technique to reduce the leakage current in the flip-flop. It could reduce both the dynamic power and static power efficiently. In chapter 4, propose the serializer and deserializer with self corrected green coding scheme for Network-on-Chip. In chapter 5, the application of low clock swing flip-flop in survivor memory unit, which proposed in a low power Viterbi decoder for wireless communication systems. In chapter 6 is the conclusions and future works. #### **Chapter 2** # Low Power Digital Circuit Design Concepts and Overview of Network on Chip and Channel Coding #### 2.1 Introduction The power consumption of circuits and systems is critically important in modern VLSI especially for low power applications and, hence, the power optimization techniques are applied at different levels of the digital design. The design of low power logic is one of the most important tasks to minimize the power consumption of digital circuits. Even the analog TC, RF IC, and wireless applications, have become the hot research topics in the recent year. But the digital circuit is still contributed the major part of a chip. In a conventional IC, 90% area of that is contributed by the digital circuit. To be a good IC designer, the basic digital circuit design concepts have to be studied in the beginning. In the section 2.2, some device characteristic will be presented. The basic concepts in size, power, and performance of the device will be discussed [1]. In rest section will introduce Network on Chip and Channel Coding. #### 2.2 Device Characteristic #### 2.2.1 Size Issue As the technology scaling in modern VLSI design, more and more transistors are integrated into a chip. It is possible to embed more and more function blocks in a system. The Moore's Law made a prediction that the semiconductor technologies will double its effectiveness every 18 months and it still comes into effect. The Fig 2.1 shows the transistor counts of Intel Pentium series. It gets a exponential growth in the die size and proves the Moore's law indirectly. Fig 2.1 the predict die size by S. Brokar. Fig 2.2 shows the technology trends of gate length by International Technology Roadmap for Semiconductors (ITRS) in 2003. Fig 2.2 The gate length predict by ITRS #### 2.2.2 Power Issue In the deep-submicron or nanometer era, it brings out the power problem when the more and more transistors are integrated into one chip. The Fig 2.3 shows the trends in power consumption of each processor. The power consumption is increasing exponentially and may be achieved to 18KW in 2008. The high power consumption will reduce the life of the battery in a portable application and cause the problem to cool down the system. In order to resolve the thermal problem, more and more low power techniques and technologies are proposed. Some of which will be introduced in the chapter 3. The power consumption includes both a dynamic and static parts. The dynamic part is mainly attributed to charging and discharging of the capacitances. The other part is associated with the leakage current of the transistors. In addition to the dynamic power consumption, the high leakage current in deep sub-micron regimes has become a significant contributor to the power dissipation of CMOS circuits as the CMOS technology scales down [1]. The subthreshold leakage power is expected to become a significant fraction of the total power in the sub-100 nm CMOS technology where reducing the subthreshold leakage power of the circuit is crucial. Fig 2.4 shows the cross sectional view of FETMOS, which illustrate leakage current. Fig 2.4 The leakage sources for the static CMOS transistor - I1: PN junction reverse bias current - I2: Weak inversion - I3: Drain-induced barrier lowering (DIBL) - I4: Gate-induced drain leakage (GIDL) - I5: Punchthrough - I6: Narrow width effect - I7: Gate oxide tunneling - I8: Hot carrier injection The more ditails in low power design will present in rest chapters. # 2.3 Introduction The Abstraction Levels of Network on Chip For a micro-network, the protocol stack will be reduced to physical layer, data-link layer, network and transport layer and software layer [35] which showed in Fig 2.5. The characteristics of each layer will be described in this section. The NoC protocols are described bottom-up, starting from the physical up to the software layer. Fig 2.5 The abstraction levels of NoC In the physical layer, global wires are the physical implementation of the communication channels. Traditional rail-to-rail voltage signaling with capacitive termination is definitely not well-suited for high-speed, low-energy communications for future global interconnect. Reduced swing can significantly reduce communication power dissipation which preserves the speed of data communication. Nevertheless, as the technology trends lead us to use smaller voltage swings and capacitances, the upset probabilities will rise. It is important to realize that a well-balanced design, because the overhead in performance, energy-efficiency and modularity may be too high. Physical layer design should find a compromise between competing quality metrics and provide a clean and complete abstraction of channel characteristics to micro-network layers above. In the data-link layer abstracts the physical layer as an unreliable digital link, where the probability of bit upsets is non null. Furthermore, reliability can be traded off for energy. The main purpose of data-link protocols is to increase the reliability of the link up to a minimum required level, under the assumption that the physical layer by itself is not sufficiently reliable. At the data link layer, error correction can be complemented by several packet-based error detection and correction protocols. Several parameters in the protocols can be adjusted depending on the goal to achieve maximum performance at a specified residual error probability within given energy consumption bounds. At the network layer, packet data transmission can be customized by the choice of switching and routing algorithms. The NoC designers establish path of connection to its destination. Switching and routing affect heavily performance and energy consumption. Robustness and fault tolerance will also be highly desirable. At the transport layer, algorithms deal with the decomposition of messages into packets at the source and their assembly at destination. Packetization granularity is a critical design decision because the behavior of most network control algorithm is very sensitive to packet size. Packet size can be application specific in SoCs, as opposed to general network. Software layers comprise system and application software which includes processing element and network operating systems. The system software provides us with an abstraction of the underlying hardware platform. Moreover, policies implemented at the system software layer request either specific protocols or parameters at the lower layers to achieve the appropriate information flow. The hardware abstraction is coupled to the design of wrappers for processor cores which perform as network interfaces between cores and NoC architecture. #### 2.4 Introduction of Channel Coding A communication system connects an information source to a destination through a channel. The physical channel may be wireline cables, microwave links, and even storage media. Fig 2.6 shows a typical digital communication system. The transmission end is composed of source encoder, channel encoder, and modulator. The receiving end is composed of demodulator, channel decoder, and source decoder. Fig 2.6 Block diagram of a digital communication system A signal will be distorted by some effects such as noise, interference, and fading as it passes through the channel. To overcome the channel effects, the channel encoder introduces some redundancy in the output of the source encoder, called the information sequence. Next, the modulator converts the new sequence with redundancy, called the codeword sequence, into analog signals transmitted through the channel. In the receiver, the demodulator estimates the transmitted signal and makes some error because of channel noise. The demodulated sequence is called received sequence, which may not match the codeword sequence due to the errors. The channel decoder uses the redundancy in the codeword to correct the errors in the received sequence and produces an estimate of the information sequence. A subject dealing with the design of channel encoder and channel decoder, referred to channel coding or error control coding, are developed to improve the performance of the overall system. There are two main types of channel coding, the block code and the convolutional code. For the block codes, the encoder transforms a block of k information symbols into a block of n symbols called a codeword. These codes are usually referred as (n, k) block codes. The (n-k) redundancy symbols, also termed as parity symbols, depend only on the corresponding k information symbols and not on other information symbols. This means the block code is memoryless. Some of the commonly used block codes are Hamming code, BCH code, Reed-Solomon (RS) code, and low-density parity-check (LDPC) code. For the Convolutional code, the encoder contains memory elements. The (n, k, m) Convolutional encoder has k inputs, n outputs, and m memory elements. Convolutional code converts the entire data stream into one single codeword by a linear shift-register circuit that performs a convolutional operation on the information sequence. The encoded bits depend not only on the current k input bits but also on the previous bits. The Viterbi algorithm [1] proposed by A.J. Viterbi in 1967 is used to decode convolutional code. Forney [2] later proves that the Viterbi algorithm provides a maximum likelihood (ML) decoding algorithm. Until now, Viterbi algorithm is still the optimal solution for convolutional code and has become an important algorithm in communication systems. #### **Chapter 3** #### Low Power Pulse- based Flip-Flop Design #### 3.1 Introduction As the technology scaling down in VLSI design, it is possible to build chips consisting of millions of transistors. It accompanies both higher speed and larger power consumption, especially in deep-submicron technologies. Recently, the portable applications such as notebook PC, personal digital assistant (PDA), mobile phones and portable media players (PMP), are widely used in modern society. The public are look forward to longer battery lifetime. Furthermore, modern high technology products configure multi -cores, it stands for that the most important issue in deep submicron will be "power". Power dissipation is becoming a limiting factor in both high performance and mobile applications [1][2]. The power consumption in CMOS circuits comes from static and dynamic parts: - Static dissipation due to - > Subthreshold condition through OFF transistors - > Tunneling current through gate oxide - > Leakage through reverse-biased diodes - Contention current in ratioed circuit - Dynamic dissipation due to - Charging and discharging of load capacitance - "short-circuit" current while both pMOS and nMOS networks are partially ON $$P_{total} = P_{static} + P_{dynamic} \tag{3.1}$$ $$P_{static} = I_{static} * V_{DD} = (I_{dc} + I_{leakage}) * V_{DD}$$ (3.2) $$P_{dynamic} = \alpha * C * V_{DD}^2 * f$$ (3.3) - $\alpha$ : Switching probability of one clock cycle - f : the clock frequency - C : the load capacitance We take notice of the power formula (3.3), it shows dynamic power dissipation is proportional to the square of the supply voltage. Therefore reducing supply voltage (VDD) can save the power consumption efficient. Furthermore, the dynamic part is mainly attributed to charging and discharging of the capacitances. By means of cut down the loading capacitance will be the most significant improvement of the subject. The principal source of the power consumption in digital systems is the clock tree which may consume up to 45% of the system power [3] [4]. The clock system consists of the clock distribution network and timing elements (flip-flops and latches), which is one of the most power consuming components in a VLSI system [5]. It accounts for 30% to 60% of the total power dissipation in a system [6]. Latches and flip-flops have all pervasive applications in sequential circuit design, especially in pipelined circuit, signal processing and communication system. To summarize the above-mentions, curtailing the power by reducing clock loading capacitance in flip-flops will have a deep impact on the total power consumed. The rest of this chapter is organized as follows. Section 3.2 describes Flip-Flop characterization events. Section 3.3 describes the low power design techniques of Flip-Flop. Section 3.4 describes conventional edge-triggered Flip-Flops. Section 3.5 proposed a Low power pulse-based Flip-Flop. Section 3.6 showed the compared simulation results. #### 3.2 Flip-Flop Characterization Events #### 3.2.1 Timing Factors Flip-Flops and latches are crucial elements of the design which attach great importance to delay and energy. The basic Flip-Flop timing parameters are setup time, hold time, and clock to Q delay. These are the indexes to judge the performance of the Flip-Flop. Setup time is the stable required time (before the clock edge) for data input correctly latched by the Flip-Flop. Hold time is the stable required time (after the clock edge) for data input correctly latched by the Flip-Flop. The clock to Q delay is the delay time from the active clock edge to the output. The simple diagram is showed in Fig 3.1. Fig 3.1 Waveform diagram of setup time and hold time In a digital system, it has to satisfy the equation (3.4) and (3.5) to avoid timing violation. The equation (3.4) shows the hold time margin, the sum of clock to Q delay and logic delay must be greater or equal to the sum of hold time and relative clock skew. The equation (3.5) shows the setup time margin, the sum of clock cycle time and relative clock skew must be greater or equal to the sum of setup time and clock to Q delay and logic delay. I use a simple example to explain the hold time and setup time margin in the Fig 3.2. $$(Tq1 + Td) \ge T_{hold} + (Tc2 - Tc1) \tag{3.4}$$ $$(T_{cycle} + Tc2 - Tc1) \ge T_{setup} + (Tq1 + Td)$$ (3.5) Fig 3.2 The diagram of setup time and hold time - Tq : clock to Q delay time in Flip-Flop - Td : combinational logic delay time - Tc : clock delay time - Thold: hold time - Tsetup : setup time - Tcycle : cycle period time #### 3.2.2 Energy Factors Another crucial point for Flip-Flop design is the power consumption, which include dynamic and static parts. When input data maintained, the power consuming called static power consumption. If the next data inverse of former data, it called dynamic power consumption. As the equation (3.3).So I define the Flip-Flop switching probability as "switching activity $\alpha$ ". I have applied some different input pattern to test the power efficiency of Flip-Flop. The pattern, "0101010101" ( $\alpha$ =1 ) , means the input data is always switching, reflects the maximum active power consumption. The pattern "0000000000" or "11111111111" ( $\alpha\!=\!0$ ), means the static power consumption, reflects the leakage power consumption. The dynamic power consumption is always larger than static power consumption, and switch activity of Flip-Flop is different in various applications. We employ low power techniques could reduce switch activity considerably. In addition use an applicable Flip-Flop to conform to the application is one way to meet low power target. #### 3.3 Low Power Techniques of Flip-Flop Most of the Flip-Flops presented here are dynamic in nature, and some internal nodes are precharged in each cycle without producing any useful activity at the output when the input is stable. Reducing this redundant switching activity has a profound effect in reducing the power dissipation, and in the literature many techniques were presented for this purpose [5]-[8]. A brief survey of such techniques is conducted in this work, and the main techniques were classified into: clock gating; data gating and power gating. #### 3.3.1 Clock Gating Technique It shows the general scheme for clock gating technique in Fig.3.3. This technique is mainly applied for implicit pulse-triggered Flip-Flops such as CCFF [7], which is shown in Fig.3.4. This Flip-Flop employs the internal clock-gating method. Briefly, in this technique, an output signal controlled gate (Q-controlled gate) is inserted on the path of the delayed clock to the first stage, Fig.3.3. Fig. 3.3 The diagram of clock gating technique Flip-Flops in this category feature a transparent window period that is used to sample the input. This transparent window, created by an implicit pulse generator, is determined by the time when both clocked transistors in the first stage are simultaneously on. In Fig 3.4, after sampling a HIGH state at the input, the output $\boldsymbol{\varrho}$ will be HIGH. To take notice of the NOR gate, this output state could be used to shut the transparent window as long as it is HIGH, preventing the redundant activities of the internal node $\boldsymbol{x}$ . Fig 3.4 The condition captured flip-flop (CCFF) In Fig 3.4, the condition captured Flip-Flop (CCFF) is introduced to reduce redundant power at the internal node X. This Flip-Flop employs a scheme much like the JK-type-Flip-Flop, but it adds one more gate that is switching with the clock compared to HLFF [8]. This addition leads to an increase in the power consumed by the clock system, and it may offset the savings gained from reducing the internal redundant switching power. #### 3.3.2 Data Gating Technique It shows the general scheme for data gating technique in Fig.3.5. This technique is mainly applied for implicit pulse-triggered Flip-Flop such as SDMFF [9], which is shown in Fig.3.6. This Flip-Flop employs the internal data-gating method. Briefly, in this technique, an output signal controlled gate (Q-controlled gate) is inserted on the path of the input data to the first stage, Fig.3.5. Fig 3.5 The diagram of data gating technique In Fig.3.6, I take Conditional Data Mapping Flip-Flop (CDMFF) to stand for data gating technique. Fig.3.6(a) illustrates a delay cell "Dly", which is used for producing **CKDB** from **CK**(clock signal). The cell consumes less power than the conventional one, comprised of three inverters connected in series. Fig. 3.6(b) illustrates a pass-gate version of AND gates used to construct the data mapping, for decreasing the latency of the data path. Note that the state-of-the-art cell libraries [10] [11] allow the use of pass-gate inputs for high performance and design flexibility. The inputs can be driven by wide choices of logic gates. Fig 3.6 (a) Delay cell (Dly) for producing CKDB from CK. (b) AND gate for producing mapped input X from AND operation of D and QB1. (c) Conditional Data Mapping Flip-Flop (CDMFF) In Fig.3.6(c), the circuit illustrates the schematic diagram of the CDMFF with a single-ended structure (s-CDMFF). The flip-flop consists of three stages, where an AND gate generates a mapped input $m{X}$ from $m{D}$ and $m{QB}$ in the first stage, and a single-ended pulse generator produces a reverse pulse $m{SB}$ from $m{X}$ in the second stage. In the third stage, a latch generates the output $m{Q}$ from $m{SB}$ and $m{DB}$ . #### 3.3.3 Power Gating Technique This power gating technique is utilized by mostly low power Flip-Flops. So I subdivide them into two parts, the conditional precharge technique and conditional discharge technique. There shows the general scheme for power gating techniques which include the conditional precharge technique and conditional discharge technique in Fig.3.7 (a) and (b), respectively. Fig. 3.7 (a) shows the general scheme of the conditional precharge technique. The general idea of this technique is that the precharging path is controlled to avoid precharging the internal node when D stays HIGH. In the absence of the pMOS precharge control and when D stays HIGH for a long time, the discharge path will be on during the evaluation periods, causing node X to discharge after each precharging phase. To eliminate these charging/discharging activities, a PMOS transistor is inserted in the precharging path, which will prevent the precharging of node X in case the data input is stable HIGH. Fig 3.7 (a) Conditional precharge technique (b) Conditional discharge technique The clock-gating technique results in redundant power consumed by the gate controlling the delivery of the delayed clock to the flip-flop. As a result, conditional precharge technique outperformed the conditional capture technique in reducing the flip-flop's energy-delay-product (EDP) [12]. The conditional precharge technique is difficult to use a double-edge triggering structure for these flip-flops, as it will require a lot of transistors. In conditional discharge technique, the extra switching activity is eliminated by controlling the discharge path when the input is stable HIGH. In this scheme, an NMOS transistor controlled by output *Qb* is inserted in the discharge path of the stage with the high-switching activity. When the input undergoes a LOW to HIGH transition; the output $\mathbf{Q}$ changes to HIGH and $\mathbf{Qb}$ to LOW. This transition at the output switches off the discharge path of the first stage to prevent it from discharging or doing evaluation in succeeding cycles as long as the input is stable HIGH. Fig 3.8 Conditional discharge double-edge triggered flip-flop (CDFF) Fig 3.8 shows the conditional discharge flip-flop (CDFF). It uses a pulse generator as in [13], which is suitable for double-edge sampling. The flip-flop is made up of two stages. The conditional discharging scheme is employed in the CDFF as follows: Stage one is responsible for capturing the LOW to HIGH transition. In order to reduce the redundant switch power, we employ a discharge control transistor $\it N5$ at the discharge path of the first stage. When $\mathbf{Qb}$ =HIGH, which means $\mathbf{Q}$ =LOW and $\mathbf{X}$ =HIGH, N5 turns on, and the discharge path is enabled. If the input $\mathbf{D}$ makes a LOW to HIGH transition, and $\mathbf{CLK}$ \_pulse is HIGH, $\mathbf{N1}$ , $\mathbf{N5}$ , and $\mathbf{N3}$ switch on, the internal node $\mathbf{X}$ is discharged to LOW, and $\mathbf{Q}$ is pulled up to HIGH with $\mathbf{Qb}$ pulled down to LOW, which shuts off the NMOS stack in first stage. For this $\mathbf{D}$ transition (LOW to HIGH), $\mathbf{X}$ is discharged only once; i.e., consecutive HIGH level at will not be sampled because the discharging path is inhibited by $\mathbf{N5}$ which controlled by $\mathbf{Qb}$ . D was LOW during the sampling period, then the first stage is disabled, and node X retains its precharge state. Whereas, node Y will be HIGH, and the discharge path in the second stage will be enabled in the sampling period, allowing the output node to discharge and to correctly capture the input data. ## 3.4 Conventional Edge-Triggered Flip-Flop The flip-flops are widely used not only in the synchronous and high performance system but also in system on chip (SOC) design. There are many kinds of flip-flop had proposed to satisfied the different system requirements. I will give an overview about each conventional flip-flop and illustrate the advantage and disadvantage of the flip-flop in this chapter. This section is organized with a sequence of develop. ## 3.4.1 Master Slave Flip- Flop Master-slave flip-flop is the typical edge triggered flip-flop that is designed to immune the race condition. The master-slave flip-flop is composed of two latches, the master one is sensitive to the logic level 0, and the slave one is sensitive to the logic level 1. When the clock is low, the master one is active and restores the input data, and the slave one keeps the previous data in the feedback loop circuit. When the clock is high, the master one is frozen the state by clock, and the slave one is active and sensitive the data from the master latch in a short time. Due to the cascaded structure, the flip-flop works as the edge-triggered flip-flop. Fig 3.9 The Transmission-Gate Flip-Flop (TGFF) The figure 3.9 shows the transmission-gate type master-slave flip-flop (TGFF), the flip-flop is used in PowerPC 603[14]. The TGFF uses the transmission gates to isolate the master part and slave part, and the transmission gates are driven by opposite clock level. And it also speeds up the data pass-through time and reduces the power consumption by using the transmission gates. The other feature is that the TGFF adds the input gate isolation, and it can get better noise immunity. The advantages of the TGFF are short direct path and a low power feedback, but it has a big clock load which effects the power consumption of the clock tree. Using the transmission gates as the isolation circuit has several merits, but it also causes some problem. The TGFF flip-flop may cause the race condition when the true clock and inverse clock have the overlapping range, and it is illustrated in the fig 3.10. (a) The clock with 1 - 1 overlap (b) The clock with 0 - 0 overlap Fig 3.10 The race condition in TGFF If these two control clocks have a 1-1 overlapping condition, the data may pass through the NMOS gates from input to output. Similarly the data may pass through the PMOS gates from input to output when the $0\,-\,0$ overlapping condition. In order to overcome the race problem, the TGFF is improved to the C2MOS flip-flop [15], which is showed in fig 3.11. It replaces the transmission gates and inverter circuit by the C2MOS circuit. Compared with TGFF, the C2MOS has slower operation speed, but a better robustness. There are some other improved circuits of TGFF [16] [17]. They improve the performance or reduce the power consumption, but they are usually less robust. Fig 3.11 The C<sup>2</sup>MOS Flip-Flop # 3.4.2 Pulse Triggered Flip- Flop Compared with latches, the master-slave flip-flop has the larger clock load and larger latency. The larger clock load increases the capacitance in the clock tree, and increases the power consumption in overall system. The larger latency reduces the performance. In order to resolve these problems in high performance system, the pulsed triggered latch is proposed. And the negative setup time of the pulsed latch allows the critical circuit to borrow time from the next cycle. In 130nm processes and beyond, the static power is becoming a primary design issue. To provide leakage current is the efficient way. Therefore many pulse based flip-flops had proposed purposely for low power. The master-slave FF has the hard edge property. Pulsed flip-flop allow cycle stealing and are skew tolerant. The pulsed flip-flop works as the edge triggered flip-flop based on pulse generator. There are two ways to generate a transparent window applied to pulsed flip-flop in general. One is explicit-pulsed flip-flops (ep-FF) and the other is implicit-pulsed flip-flops (ip-FF), which separated by pulse generating method. #### 3.4.2.1 Explicit-Pulsed Flip-Flop The diagram of explicit-pulsed flip-flop and the prevalent pulse generator circuit are showed in Fig 3.12. The ep-FF uses a pulse generator outside the latching part, which is generator the clock pulse by normal clock. It generated a real pulse by using a delay inverter chain and a NAND logic gate. It also illustrated in Fig.3.12. Fig 3.12 The diagram of explicit-pulsed flip-flop and the prevalent pulse generator circuit In this part, I particularize two classical explicit-pulsed flip-flops. One is Dual-edge triggered Static Pulsed Flip-Flop (DESPFF) [18] and the other is Clock Gated Static Pulsed Flip-Flop (CGSPFF) [19] [20]. They are illustrated in Fig 3.13 and Fig 3.14, respectively. Edge-triggered flip-flops create a narrow sampling window to overcome race problem in comparison to simple flip-flop structures. Double-edge triggered flip-flops can latch the data on both rising and falling edge of the clock. Thus, the clock frequency is reduced by half while the data throughput is preserved. In Fig 3.14(b), the pulse generator consists of four inverters which generate delayed and inverted clock signals, CLK2 and CLK3, along with two NMOS transistors for pulse generation. Delayed clock signal CLK2 which is the inverse of CLK is applied to the drain of MN8 while the clock (CLK) controls the gate of MN8. When rising edge of the clock signal begins, CLK2 is high and the pass transistor MN8 charge the PULS node therefore a narrow sample window is generated at the rising edge of the clock signal. Delayed clock signal CLK3, CLK1 and pass transistor MN9 create another pulse at the falling edge of the clock signal in the same manner. In Fig 3.13(a) the PULS signal applied to the NMOS transistor MN1 creates a narrow transparency window in which data inputs can affect the state of static nodes SB and S via NMOS transistors MN2 and MN3. The PMOS transistor MP5 (MP4) pulls S (SB) node up to Vdd. Fig 3.13 Dual-edge triggered static pulsed flip-flop (a) DESPFF (b) Pulse generator and waveform It is observed that when the input remains the same in two successive clocks, the flip-flop does not need to be triggered because the output should not be changed either. But even input remains unchanged in the consecutive clock cycles, pulse generator still activated. These transitions in the clock pulse generator are redundant and cause unnecessary power consumption. If the clock to the clock pulse generator can be deactivated when the input remains unchanged in the consecutive clock cycles, then the power can be saved. Furthermore, take of MN6 and MN7 can saved more power, because the leakage current and the capacitance of node SB and S will be reduced. Therefore the Clock Gated Static Pulsed Flip-Flop (CGSPFF) was proposed whose circuit is shown in Fig 3.14. (b) The pulse generator part of CGSPFF Fig 3.14 the Clock Gated Static Pulsed Flip-Flop (CGSPFF) Take notice of Fig 3.14(a), the main block of CGSPFF is similar to DSPFF except for the elimination of MN6 and MN7. Eliminating transistors leads to reduction in leakage power consumption. Based on the simulation results, using simpler structure for the pulse generator and the main block, improves the performance and reduces the leakage power by compare the previous data and the new data. Note MP4 and MP5, that is cross coupled structure which can improve Clock-Q delay. In Fig 3.14(b) the comparison between the input D and the output Q is performed by an XOR gate (transistors MN8 and MN9). When D = Q, XOR = 0 while XOR = 1 when D $\neq$ Q. The output of the XOR gate is used as an enable signal for the clock pulse generator of the static flip-flop. The gating logic is a simple NMOS transistor (MN10). When D is equal to Q, the clock signal is disabled via MN10 and by turning on MN11, CLK1 pulls down while CLK3 and CLK4 are low and high, respectively. In this state, MP12 and MN13 are OFF while MP14 and MN15 are ON making PULS low. On other hand, when D and Q are different, MN10 enables the input clock, CLK1 changes to high but CLK3 remains low for a period equal to the delay of two inverters while CLK4 in that time remains high. This enables MP14 and MN15 to pass CLK1 to PULS that indicate a low to high transition in PULS. After CLK3 changes its state and becomes high, CLK4 becomes low and, hence, MP14 and MN15 turn OFF while MN13 turn ON. Thus, PULS is pulled down by MN13 generating the narrow pulse required for the operation of flip-flop. #### 3.4.2.2 Implicit Pulsed Flip-Flop The diagram of implicit-pulsed flip-flop and the prevalent pulse generator circuit are showed in Fig 3.15. The implicit pulsed flip-flops use two series devices embedded in the logic branch receiving a clock and a delayed clock, respectively. It generated a virtual pulse that works as a real pulse by using a delay inverter chain and a stack NMOS gates. They are illustrated in figure 3.15. Fig 3.15 The diagram of implicit-pulsed flip-flop and the prevalent pulse generator circuit In this part, I particularize two classical implicit-pulsed flip-flops. One is Hybrid Latch flip-flop (HLFF) [21] and the other is Clock Branch Sharing Implicit Pulse Flip-Flop (CBS\_IP) [22]. They are illustrated in Fig 3.16 and Fig 3.17, respectively. The Hybrid latch flip-flop (HLFF) [21] is the one of the fastest flip-flop in the world, is shown in the Fig 3.16. It is the first one latch to construct the pulse latch mechanism to achieve high performance design. The HLFF is composed of two stages, the first stage is a 3-input NAND gate coupled to clock, input, and delay clock; and the second stage is a static latch works as a C2MOS latch. The node X of HLFF is always precharged except the time in the transparent window. If the input D is high in the evaluation phase, the node X is discharged through the pull-down path, and the PMOS of the second stage charges the output to level high. If the input D is low in the evaluation phase, the NAND gate keeps the node X in high level, and the output is pulled down to level 0 through the pull-down path in the second stage. The features of the HLFF are described above. Fig 3.16 The Hybrid Latch Flip-Flop (HLFF) Explicit pulsed FFs use external clock pulse generators, which increase the power, because more transistors and more cloak loading capacitance. In addition, ep-FF cannot work with dynamic logic. To overcome the problem with previous implicit pulsed flip-flops which is the large clock load, a novel clock branch sharing topology is proposed. In this new clock branch sharing scheme, Fig. 3.17, the two groups of clocked branches in the previous clock branch separating scheme (DECPFF,[22]) are merged; (N1, N3), (N2, N4) are shared by the first stage and second stage (in the doted circle). The advantage of this sharing concept is reflected in reducing the number of transistors required to implement the clocking branch of the double-edge triggered implicit-pulsed flip-flop. Without this sharing, the number of clocked transistors would be much larger than the number of transistors used with the sharing concept. Recall that clocked transistors have a 100% activity factor and consume a large amount of power. Reducing the number of clocked transistors is an efficient way to decrease the power. The discharging path only stays ON for a short while, yielding only a little short circuit current. An inverter is placed after Q, providing protection from direct noise coupling. The double edge triggering operation of the flip-flop, Fig 3.17, is as follows. Q\_fdbk is used to control N7. When CLK rises, CLKB will stay high for a short interval of time equal to one inverter delay. During this period, the clocked branch (N1 and N3) turns on and the flip-flop will be in the evaluation period. Note that the other clocked branch (N2 and N4) is disconnected. When CLK falls, CLKB will rise, and CLKB\_delay will stay HIGH for one inverter delay period during which the transistors N2 and N4 are both on, and the flip-flop is in the evaluation mode. The first stage in the design is responsible for capturing the input 0->1 transitions of D. The internal node X will discharge causing the outputs Q and Qb is HIGH and LOW, respectively; N7 turns off by Q\_fdbk=0; If the input D stays "1," the first stage is disconnected from ground in the later evaluations preventing node X from experiencing redundant switching activity. The second stage, on the other hand, is responsible for capturing the input 1->0 transitions. In this case, the falling transition of the input will cause the pull down network of the second stage to be ON and, thus, forcing the output nodes Q and Qb to be 0 and 1, respectively. Before the clock rising/falling edge, the output of I1/I2 turns on N1, N2, respectively, thus, the internal nodes A and B are discharged to ground before evaluation correspondingly, and this can reduce the discharge time. Fig 3.17 Clock Branch Sharing Implicit Pulse Flip-Flop and (CBS\_IP) and waveform # 3.5 Proposed Edge-Triggered Flip-Flop Design and Simulation Result In this section, a new low power flip-flop is proposed. The new flip-flop is feature of few low power techniques; include reducing clock swing and low leakage current. And I will compare the pre-simulation results with the other conventional flip-flops. This section ends with the post simulation results after layout by TSMC 130nm CMOS technology and UMC 90nm CMOS technology in standard cell. #### 3.5.1 The Motivation of Proposed The New Flip- Flop In general, the pulsed flip-flop has a lower clock load, and it can reduce the power consumption in the clock tree. But the pulsed flip-flop has a penalty of the pulse generator. The pulse generator usually uses a delay inverter chain to produce the transparent window. The delay chain is always switched as the clock switches, and it will consume large part of the total power even if the data has no transition. In the deep sub-micro IC design, the leakage has become the largest part of the static power consumption. The inverter delay chain of the pulsed flip-flop will provide the more leakage path from the supply voltage to the ground, and it causes the larger static power consumption. The explicit-pulsed flip-flops (ep-FF) and implicit-pulsed flip-flops (ip-FF) have different features. First, ep-FF can have the pulse generator being shared by neighboring flip-flops, a technique that is not to be utilized in ip-FF. This sharing can help in distributing the power overhead of the pulse generator across many explicit-pulsed flip-flops. Note that the transistor number of pulse generator in ep-FF is much more than ip-FF. Second, ep-FF could have the advantage of better performance since the height of the NMOS stack in ep-FF is less than that in the ip-FF. In other side, the stack structure can reduce the leakage current. However, ep-FF cannot be used with dynamic logic. In order to design a low power flip-flop, I will improve the delay chain of the implicit-pulsed generator, and combine the reduce swing technique and conditional capture technique to reduce both the dynamic power and static power of the flip-flop. #### 3.5.2 Proposed Pulse Generator of Flip- Flop In order to reduce the leakage current and static power consumption of the delay chain, the stack transistor technique is used. The delay chain of the pulsed latch is designed to generate a transparent window, so the slight loss in performance could be tolerated. The stack transistor technique reduces not only static power but also dynamic power by reduced clock swing of the flip-flop. The clock power is proportioned to the supply voltage of clock system. I will propose a reduced swing inverter chain by using the circuit technique, and it doesn't need an additional voltage supply. A new reduced swing inverter chain is proposed which is illustrated in the Fig 3.18. It inserts the NMOS transistor M72 between the NMOS transistor of the inverter and ground. Note that transistor M72 is targets at low swing clock by compound the source terminal of M2 and M4. When clock transmit though the forward two inverters it's not longer swing <Vdd - Gnd> but <Vdd - Vtn>. The source terminal of M6 maintains to ground. It is order to make sure the NMOS which driven by CLKB will turn off certainly. This structure has two advantages: the first one is to use the n-type stack technique to reduce the static power; the second one is to reduce the swing voltage of the inverter chain. The simulation result compare with typical inverter chain are showed in the Table 3.1. Fig 3.18 Proposed Low Swing Inverter Chain | | Avg. Static Power | Avg. Dynamic Power | |--------------------------|-------------------|--------------------| | Typical inverter chain | 7.64E-08 | 1.78E-06 | | Low Swing Inverter Chain | 6.34E-08 | 1.26E-06 | Table 3.1 The simulation result compare with typical inverter chain As the simulation result, the proposed low swing inverter chain can reduce 17% of static power and 30% of dynamic power in the delay chain. #### 3.5.3 Proposed Low Clock Swing Flip- Flop The low swing conditional capture flip-flop is proposed, and illustrated in the figure 3.19. It constitutes by low swing implicit-pulsed generator and cross coupled flip-flop. The low swing implicit-pulsed generator is based on clock delay chain which has described in the section 3.5.2. To utilize the cascade NMOS M8 and M9 which driven by CLKB and CLK. We could get a transparent window and activate the pulsed latch to sample the data. The waveform of transparent window is showed in Fig 3.20. The source terminal of M6 doesn't cascade M72, because if it does, the CLKB with weak logic level 0 maybe cause unnecessary leakage current. Even lost the data stored in QB. Fig 3.19 Proposed Low Clock Swing Flip-Flop (LCSFF) Fig 3.20 The Waveform of Transparent Window Note that clock delay signal CLKB is cascade upon the CLK, this kind of design is on purpose for performance. In case of input data 0->1, the node QB must discharge by M8 and M9, before the clock positive edge arrives to turn on M9, the CLKB had earlier turn on M8. That minds the charge in node QB could discharge faster. I had try to consolidate M8, M9 and M81, M91 for reduce transistor number. Although the post simulation result of flip-flop is work, but when it application to deserializer the flip-flop doesn't work correctly. So I give up the discharge path consolidate. The main body of the flip-flop is the cross couple structure edge-triggered flip-flop. The cross couple structure is designed for performance. When the input data is logic level 1 and a sampling window is generated, the node QB is discharged along the pull down path, M8, M9. The node QQ is charged by M15, and the output Q is changed after a delay time of inverter. When the input data is logic level 0 and a sampling window is generated, the node QQ is discharged along the pull down path, M13, M81, M91. The node QB is charged by M14, and the output Q is changed after a delay time of inverter. The typical operation waveform of the flip-flop is illustrated in the figure 3.21. It showed the data tolerance test simulation result. Fig 3.21 The Data Tolerance Simulation Result #### 3.5.4 Simulation Result and Comparisons The comparisons of proposed LCSFF with other flip-flop are shown in table 3.2. The result is simulated with UMC 90nm CMOS technology in 500MHz. Supply voltage = 1.0 V, loading capacitance = 0.1 ff, operation temperature = 25 degrees centigrade. The transistor sizing is optimized for both speed and power consumption. The presimulation of setup time and hold time are showed in table 3.3. | | TGFF | HLFF | LCSFF | |-------------------------------|--------|---------|---------| | Transistor of clock loading | 12 | 11 | 10 | | Number of transistor | 24 | 20 | 19 | | Clk to Q delay(ps) | 80.27 | 55.40 | 61.88 | | Static power(µW) | 5.0750 | 11.2605 | 5.4973 | | Dynamic power(µW) | 8.8399 | 18.4795 | 7.5908 | | Power and delay product (PDP) | 558.48 | 823.80 | 404.95 | | PDP percentage | 1 | +47.5% | -27.49% | | PDP percentage | -32.2% | 1 | -50.8% | Table 3.2 The comparisons of proposed LCSFF with other flip-flop | Data low to | Setup time | Hold time | Data high | Setup time | Hold time | |-------------|------------|-----------|-----------|------------|-----------| | high | (ps) | (ps) | to low | (ps) | (ps) | | FF | -3.10 | 27.88 | FF | -2.89 | 11.95 | | SS | -55.26 | 79.26 | SS | -44.44 | 51.39 | | TT | -20.13 | 42 | TT | -14.54 | 21.49 | | SNFP | -25.07 | 49.13 | SNFP | -21.94 | 27.01 | | FNSP | -16.13 | 36.81 | FNSP | -11.55 | 17.64 | Table 3.3 The presimulation of setup time and hold time of LCSFF #### 3.5.5 Layout and Post Simulation Result In this section I will present the layout diagram with UMC 90nm CMOS technology. It illustrate in figure 3.22. Note that I draw the layout by standard cell, so there has many restrictions like cell height and usable metal and position of input pin, output pin, even the Vdd and Gnd pins also exacting regulations. The standard cell restriction height in UMC 90nm CMOS technology is $2.5\,\mu\text{m}$ . The usable metal is only the metal one. Because of the restrictions of standard cell, it makes the layout more difficulties. The layout view illustrate in figure 3.22, which is based on UMC 90nm CMOS standard cell rules. The total area is $2.5*5.32\,\mu\text{m}^2$ . The table 3.4 showed post simulation results of power in LCSFF. The table 3.5 and figure 3.23 showed post simulation of setup time and hold time of LCSFF. Fig 3.22 The layout view of LCSFF in UMC90nm standard cell | | E F | | | |--------------------------------------------|-----------------------------|--|--| | Technology | MC 90nm CMOS Standard Cell | | | | Supply Voltage | 1.0 | | | | Clock Frequency | 1GHz | | | | Width | 5.32 $\mu$ m | | | | power00= 7.2492μW (previous d | data is 0 , new data is 0 ) | | | | power11= 6.9723μW (previous d | data is 1 , new data is 1 ) | | | | (compare with presimulation result +29.5%) | | | | | power01= 10.425μW (previous d | data is 0 , new data is 1 ) | | | | power10= 9.9499μW (previous d | data is 1 , new data is 0 ) | | | | (compare with presimulation result +34.2%) | | | | Table 3.4 the simulation result of LCSFF | Data low to | Setup time | Hold time | Data high | Setup time | Hold time | |-------------|------------|-----------|-----------|------------|-----------| | high | (ps) | (ps) | to low | (ps) | (ps) | | FF | 20.17 | 30.02 | FF | 3.18 | 8.87 | | SS | -70.30 | 107.34 | SS | -53.71 | 53.80 | | TT | -18.97 | 52.72 | TT | -16.32 | 18.74 | | SNFP | -23.17 | 61.12 | SNFP | -17.68 | 24.78 | | FNSP | -13.69 | 46.47 | FNSP | -6.65 | 12.73 | Table 3.5 The post simulation of setup time and hold time of LCSFF Fig 3.23 The post simulation of setup time and hold time Note that setup time result with FF (that means both NMOS and PMOS are fast type) in pre-simulation is negative numeral, but in the post simulation result, the numeral becomes positive after layout. # **Chapter 4** # Low Clock Swing Flip-Flop Design for Serializer/Deserializer in Network on Chip #### 4.1 Introduction System-on-chip (SoC) designs provide the integrated solution to the challenging design problems in the multi-IP. System-on-Chip designs become more complex with numbers of transistors grow exponentially.[1-3] Fig 4.1 Traditional Synchronous Bus Traditional on-chip bus platform showed in Fig 4.1. The shared bus architecture will limit the development factor for increasing Internet Protocol (IP) blocks. The required on-chip communication bandwidth is growing beyond that provided by standard on-chip buses [4]. Existing bus architectures and techniques are unable to meet leading edge complexity and performance requirements. In nanoscale technologies, increased coupling effect for interconnects not only aggravates the power-delay metrics but also deteriorates the signal integrity due to capacitive and inductive crosstalk noises. Several options were proposed to reduce the inter-wire capacitances: - 1. To wide the pitch between bus lines. - Using P&R (place & route) tools to avoid routing of the bus lines side by side. - 3. Changing the geometric shape of bus lines. - 4. Adding a shielding line (VDD/Ground) between two adjacent signal lines. - 5. Reducing power is through bus encoding schemes [5-7]. However, in SoC design, the interconnection and the routing is complex and is hard to do minimize the coupling capacitances. And the disadvantages of these methods are the increasing area since the cross-sectional area of a bus line is fixed. On-chip physical interconnections will present a limited factor for performance and energy consumption. The encoding schemes for low power and reliability issues are proposed in [8][9]. Both the system design and performance are limited by the complexity of the interconnection between the different modules and blocks into single clocked design. Different data transfer speeds are required, as well as parallel transmission. The traditional system buses may not be suitable for such a system. The solution to above problems is a segmented bus design combined with the concept of the globally asynchronous local synchronous (GALS) system architecture [10-12]. Asynchronous design can make the circuits resilient to delay variation. The Network-on-Chip architecture as shown in Fig 4.2 is based on a homogeneous and scalable switch fabric network. The motivation of establishing NoC platform is to achieve performance using a system perspective of communication. The core of NoC technology is the active switching fabric that manages multi-purpose data packets within complex, IP laden designs. The most important characteristics of NoC architecture can be summarized as packet switched approach, flexible and user-defined topology and global asynchronous locally synchronous (GALS) implementation. Fig 4.2 Network-on-Chip Architecture A simple architecture of Network on Chip showed in Fig 4.3. Focus on physical layer and data-link design of NoC protocols. The goal is to achieve a low latency, low power and reliable interconnect architecture. The architecture will apply to each transmission stages between two adjacent switches in network interface. The design of NoC protocols should consider each stage properties together to achieve better performance. Based on this concept, we adopt serialization technique to implement packet-based transmission, which is the most significant difference of Network-on-Chip architecture to other architectures. Fig 4.3 A simple architecture of Network on Chip The traditional rail-to-rail voltage signaling will no longer suitable for low power interconnect design. Reducing signal swing can significantly reduce power consumption on link wires. However, as the technology led us to smaller voltage swings and larger coupling capacitances effect, it means power will be the most important issue in future. It should carefully design and tradeoff each metrics. To guarantee the reliability is a necessary and important issue especially in future nanometer design. According to this concept, we want to save more energy on link wires. The whole architecture should tolerant of process-variation and make sure the circuit's functional work in different cases. According to the ITRS (International Technology Roadmap for Semiconductors) prediction illustrated in Fig 4.4, the gate between the interconnection delay and the gate delay will increase to 9:1 with the 65nm technology [13]. Increasing of power dissipation by charging and discharging the interconnect wires on a chip. Soon the interconnect will domain the performance such as power consumption, speed, and area. It means that interconnection will affect the system more in future SoC design rather than logic circuits on a chip. Fig 4.4 Interconnect delay and gate delay under different technology # 4.2 Practical Application in NoC Framework The signal integrity is an important issue for on-chip interconnection. The degradation comes from many different source, intrinsic RLC circuit nature and extrinsic noises. Intrinsic RLC circuit has such as RC delay, LC ringing, attenuation of wave amplitude .etc. Extrinsic noises such as crosstalk noise inter-symbol interference (ISI). The wave overlapping cause ISI and leads to higher error rate on interconnect. The most significant noise in DSM (data services manager) is crosstalk noise. Through the coupling capacitance, the delay and skew of victim line signals can be affected by adjacent lines (aggressor lines). Crosstalk noise makes difficult to estimate the exact delay deviation. Interconnect capacitance/resistance became comparable to gate in today's technology. Significant inductance effects with technology scaling in the future. In DSM, the inductance effect of wires becomes a noise source. When aggressor lines are switching simultaneously, the filed electromagnetic may induces noise on the victim lines. The effect of inductance can be neglected if $f << R/(2\pi L)$ in most case of on-chip interconnect. According to the simulation results of [14], the inductance effects will change the worst case (delay consideration) transition for nano-scale interconnect wires only when the operation frequency is over 1GHz. We will ignore the inductance effect of on-chip interconnect in this thesis. The previous Noc works have disadvantages of encoder/decoder hardware overhead, encoder/decoder need a significant propagation delay when numbers of un-coded bit increase. Besides, without the serialization and deserialization technique for link wires, large phit size will increase network area and energy consumption. To achieve a low-power and reliable interconnect, we propose a joint bus and error correction coding scheme with 4-to-1 serializers and deserializer showed in Fig 4.5. Fig 4.5 A joint bus and error correction coding scheme with serializers/deserializer in network-on-chip The Self-corrected green coding scheme is divided into two stages, triplication error correction coding (ECC) stage and green bus coding stage. The green bus coding is developed by the joint triplication bus power model to achieve more energy reduction for the triplication ECC. Joint bus and error correction coding has been an elegant and effective technique to solve the crosstalk effect and further provides a reliability bound for on-chip interconnect. In this scheme, I manage the 4-to-1 serializer and the 1-to4 deserializer parts. Hence I will concise explain the others parts in this section. And I will describe my part in rest section in chapter 4. If readers are interest in the bus and error correction coding parts, please reference [15-19]. #### 4.2.1 Triplication Error Correction Coding Stage The triplication error correction coding(ECC) scheme showed in Fig 4.6, it's a single error correcting code by triplicating each bit. From the information theory, it is well-known that a code set with hamming distance of h has h-l error-detect ability and [(h-1)/2] error-correct ability. For the triplication error correction coding, the hamming distance of each bit is equal to 3. Therefore, each bit can be corrected by itself if there are no more than two error bits in the three triplicate bits. The error bit can be corrected by a majority gate, and the function of the majority gate is shown in Fig 4.6. Compared to other error correction mechanisms, the critical delay of the decoder is a constant delay of a majority gate and much smaller than other ECCs. In other words, it has rapid correction ability by self-corrected in bit level. Therefore, triplication error correction coding is more suitable in network-on-chip for smaller encode/decode propagation delay. Fig 4.6 Triplication error correction coding scheme #### 4.2.2 Green Bus Coding Stage for Crosstalk Avoidance The purpose of green bus coding is to minimize the value of $\alpha$ (the coefficient about coupling effects and switching activities) by encoding the signals when $\lambda>2$ (The parameter $\lambda$ is defined as the ratio of coupling capacitance $C_L$ .), the design flow of green bus coding as shown in Fig 4.7. Fig 4.7 Design flow of green bus coding The correspondences between 4-bit data-word and 5-bit codeword are shown in Fig 4.8(a). According to the correspondences, the data-word can be grouped into two set, original set and converted set. When the transmitted data is in the converted set, the green bus coding will convert the data to the original set by one-to-one mapping as Fig 4.8(b). Meanwhile, the converted bit, c4, will be asserted, and c0 and c2 will be inverted and mapped to the original set. X1 and X2 will not be modified all the time. The circuit implementation of green bus coding is also shown in Figure 3.12, including encoder and decoder[24][25]. Fig 4.8 (a) 4-to-5 Green bus coding scheme (b) Original set and converted set of Green bus code Fig 4.9 Circuit implementation of green bus coding (a) Encoder (b) Decoder The circuitry of green bus coding is more simple and effective than other approaches by the joint triplication bus model. Between two adjacent 5-bit codeword, it's unnecessary to add an extra shielding line to reduce the coupling effect. This is because the boundary data of the 5-bit codeword are set as 0 almost. Certainly, it can achieve more energy saving by inserting a grounded shielding line. It's a trade-off between wiring area and energy consumption. The proposed green bus coding has following properties: - Using C4 as detection bit to decide Y0 and Y2. It can simplify the circuitries of encoder and decoder, especially for the decoder. - 2. The encode bit is always equal to the data bit at certain bit positions, which Y1 = Y1 and Y3 = Y3. - 3. Focus on the joint bus and error correction coding scheme, the self-corrected green coding scheme can avoid forbidden overlap condition (FOC) and forbidden pattern condition (FPC) and reduce forbidden transition condition (FTC) to achieve more power saving. 4. It's unnecessary to add extra shielding lines to reduce the coupling effect between two adjacent codeword with increasing coding bits. # 4.3 Serializer and Deserializer Design for NoC #### 4.3.1 Introduction of Serializer and Deserializer The main motivation for using on-chip networking is to achieve high performance using a system perspective of communication. Indeed, it is the trend to larger-scale on-chip multiprocessing that demands on-chip networking solutions. The physical transfer unit (phit) is a unit into which a packet is divided and transmitted through the NoC. Simply speaking, the phit size is the bit width of a link. Large phit size increases network area and energy consumption, especially for switching circuit and buffering units in switch fabrics. Using small size phit reduces the number of link wires, thus the space between link wires can be widen to decrease coupling capacitance. If the phit size is smaller than the interface bitwidth of processing units, serialization must be performed by the factor of equation (4.1): Serialization ratio (SERR) = (I/O bit width) / phit size (4.1) For high speed network transmission low power serializer is more important in bus architecture. Fig 4.10 showed a K bit to (K/N) bit serialization with N:1 ratio. Serialization ratio is defined as input bit width divided to phit size. In this diagram, the phit size = K/N. The serializer and deserializer reduce the phit size and further reduce the area and energy consumption by diminish coupling capacitances. On-chip serialization is a crucial technique for NoC implementation. It reduces overall network area and optimizes power consumption which is well explained in [22][23]. Fig 4.10 K bit -to- (K/N) bit serialization with N:1 ratio # 4.3.2 Design Serialization Ratio by Wire Simulation Result However, in order to achieve the same throughput, the serialization technique will increase the operation frequency of interconnection network. Therefore we should carefully choice the phit size. There has a journal paper which had simulation about the ratio of serializer[21]. Operation frequency without the serialization is set to be 200 MHz, and detail designs of the blocks are based on the implementation results. Fig 4.11, shows the analysis results of energy consumption per packet transmission and area of building blocks in star topology NoC. In Fig 4.11(a), the energy consumptions in a switch and links decrease as the SERR increases as mentioned above. In the most cases, the Serialization ratio of 4 minimizes the overall power consumption. As shown in Fig 4.11(b), the serialization also reduces the overall NoC area effectively, and the Serialization ratio of 4 is the optimal. Fig 4.11 (a) Energy and (b) area of an NoC according to Serialization ratio. Not only research the papers but also I simulated the power consumption in different wire numbers. The simulation is based on UMC 90nm CMOS technology. I assume that the transmit data throughput are the same, despite the numbers of wire. And the loading capacitance is 0.01pf in each wire. Despite of loading, the power consumption decreases with the increasing ratio of serializer under low operation frequency. Unfortunately, with the increasing ratio of serializer under higher operation frequency, the power consumption increases because of large driver to provide high driving ability. From [21] and the simulation results, 4:1 serializer is an optimized ratio to achieve energy saving. Fig 4.12 Power Simulation Result of Different Numbers of Wire #### 4.3.3 Conventional Serializer and Deserializer As the optical communication and network technology improve, the required transmission data rate also increase the gigabit-per-second range. High speed I/O interface design becomes an important issue. In the resent year, serial link transmission method is extensively used in high speed communication. I will show the principal structure of serializer. The previously serializers can separated into two category, tree based and shift register structure [15-17]. The fig 4.13 and fig 4.14 are structures of serializer respectively. Fig 4.13 The tree-based serializer and waveform We take notice of the tree-based structure illustrate in Fig.4.13(a), although the MUX can operate at 1/4 clock frequency, but there has some problem in jitter. The poor clock property or input data delay induces the jitter, because of the jitter, we must insert some D Flip-Flops between 2-1 MUX. Therefore the application circuit will like Fig 4.13(b), it is perceived that tree-based's advantage does no longer exist. Moreover, the disadvantage of tree-based is the number of input must be the power of 2. The Fig 3.13(c) illustrate the operation waveform. We take notice of the shift register structure illustrate in Fig.4.14(a)(b). The input data and select signal are all operation in frequency 1/4f. And the output data operation in f like Flip-Flops. The advantage of shift register serializer, it can do any number of input. The drawback of this architecture is the long D Flip-Flop chain and to synchronize clock and data. Fig 4.14 The shift-register serializer and operation waveform The latest dissertation that I have surveyed, someone proposed the mix structure [18-20]. Fig 3.15(a) shows the mix serializer structure, and Fig 3.15(b) shows the waveform of control signal and output signal. Fig 4.15 The mixed structure serializer and operation waveform For the above mentioned problems, and the simulation result in section 4.3.2, I choose serialization ratio= 4:1 and the shift register structure for my application. Although the tree-based structure perhaps occupy similar area of the shift-register serializer, but the control signals are all depend on system clock which has high switching activity. The high switching activity leads to high power consumption. The MUX structure not only use the system clock but also other frequency. So it must pay more circuit to generate different frequency of clock. And that means clock loading in system will widely increase. Note that mixed structure serializer, it has apply four Flip-Flops and three MUXs. That is cost more area than the shift-register serializer. For those reasons, the application serializer and deserializer are illustrate in Fig 4.14 is the shift-register structure serializer and waveform. Fig 4.16 showed the shift-register structure deserializer and waveform. Fig 4.16 The shift-register deserializer and operation waveform ## 4.4 Simulation Result of Serializer and Deserializer In this section I will present my simulation results in UMC 90nm CMOS technology. Table 4.1 shows the simulation background. The Fig 4.17 shows the SPICE simulation waveform of 4 to 1 serializer (Fig 4.14), when signal SELECT=1, input data A will transmit to output at the same time, the other data B,C,D fetched into flip flops, therefore while the next clock arrived, data B will transmit to output. Table 4.1 Design background of simulation of serializer | Technology | UMC 90nm | |----------------------------|----------| | Supply Voltage | 1.0 V | | Output and Clock Frequency | 1GHz | | Input Frequency 1896 | 250MHz | THE OWNER OF OWNER OF THE OWNER OWNE Fig 4.17 SPICE simulation waveform of 4-1 serializer Table 4.2 shows the simulation background. The Fig 4.18 shows the SPICE simulation waveform of 1 to 4 deserializer(Fig 4.16). The first four cyale of CLK\_In, the input data A,B,C,D are already storage in lower flip flops, when the CLK\_Out triggered A,B,C,D will transmit to output D1,D2,D3,D4. Table 4.2 Design background of simulation of deserializer | Technology | UMC 90nm | | |-----------------------|----------|--| | Supply Voltage | 1.0 V | | | Output data Frequency | 250MHz | | | Input data Frequency | 1GHz | | Fig 4.18 SPICE simulation waveform of 1- 4 deserializer # **Chapter 5** # Low Clock Swing Flip-Flop Application of Viterbi decoder ## 5.1 Introduction Convolutional code is a widely used error control code in modern communication systems such as DVB-T, IEEE 802.11, IEEE 802.16, and MB-OFDM UWB systems. A convolutional encoder is composed of several shift registers and modulo-2 adders (or the XOR operation). Fig 5.1 shows a (2, 1, 2) convolutional encoder with two shift registers and three modulo-2 adders. It produces 2-bit encoded codeword for 1-bit input information. Fig 5.1 The (2, 1, 2) convolutional encoder The Viterbi algorithm [1] proposed by A.J. Viterbi in 1967 is used to decode convolutional code. Forney [2] later proves that the Viterbi algorithm provides a maximum likelihood (ML) decoding algorithm. In fact, an optimum solution to decode a convolutional code is equivalent to find the maximum likelihood path in the trellis diagram. The trellis diagram of the convolutional encoder showed in Fig 5.2. Until now, Viterbi algorithm is still the optimal solution for convolutional code and has become an important algorithm in communication systems.[3] Fig 5.2 The trellis diagram of the convolutional encoder in Fig 5.1 In the early research of the Viterbi decoder, low complexity and high throughout are two important concerns in VLSI design. As modern communication systems are required to transmit information at high data rates, the power dissipation has also become an important issue. Nowadays, mobile and wireless system applications are more and more popular. Therefore, a low power design is the key point of the overall system. Convolutional code is a common error control code in practical communication system. The Viterbi decoder consumes much power in the receiver because of the computing complexity. Therefore, applying low-power techniques to the Viterbi decoder will effectively reduce the power consumption of the whole system. In this thesis, we propose a low-power Viterbi decoder for wireless communication systems. As the codeword sequence is transmitted through a noisy channel, the received sequence may not match the original codeword sequence due to the channel noise. We assume the received sequence including two-bit errors is (11, 11, 00, 01, 10, 00, 11). The errors are represented in bottom line. By the process Fig 5.3 showed the four survivor paths corresponding to each state. Fig 5.3 also shows the path merging property of the Viterbi algorithm. In this example, all survivor paths will merge to the survivor path with the minimum path metric after the merged point. In other words, the decoded data is determined after all survivor paths merge, whether the trace-back operation starts from the best state or not. [4][5] Fig 5.3 Path merging phenomenon in Viterbi decoding over a noisy channel # 5.2 The Design of Proposed Viterbi Decoder First, I introduce the hardware implementation of the Viterbe algorithm. Fig 5.4 shows the main blocks of Viterbi decoder. A Viterbi decoder is usually composed of four basic units. Fig 5.4 The conventional block diagram of Viterbi decoder - Branch Metric Unit (BM Unit): According to the received sequence, compute the branch metric for different ranches in trellis diagram. - Add-Compare-Select Unit (ACS Unit): Accumulate the branch metric recursively and perform comparison operation to generate the path metric for each state. Decide the survivor corresponding to each state according to the comparison result. - Path Metric Unit (PM Unit):Store the path metric at each time instant. - Survivor Memory: Store the survivors from ACS unit. Then use the register-exchange approach or trace-back approach to decode the maximum likelihood information sequence.[4-6] We will propose a low-power Viterbi decoder combining scarce-state-transition (SST) algorithm and variable truncation length. The ACS computation and the survivor memory are most power critical, consuming about 90% power in the Viterbi decoder. Therefore, most low power designs focus on these two blocks. The SST algorithm reduces the switching activity of the input sequence to reduce the dynamic power. In addition to apply SST, we propose a modified register-exchange approach that adjusts the truncation length dynamically. With variable truncation length, the access of the survivor memory will become more efficient.[7-12] The proposed Viterbi decoder targets for Multi-band OFDM UWB [13] system. This system exploits a 64-state convolutional code and has a high throughput requirement up to 480Mbps. Fig 5.5 shows the block diagram of MB-OFDM UWB system. Fig 5.5 The block diagram of proposed Viterbi decoder # 5.2.1 Implementation of SST To apply SST algorithm in the Viterbi decoder, it is necessary to implement the pre-decoder and re-encoder. Fig 5.6 shows the convolutional encoder of the MB-OFDM UWB system.[14] Fig 5.6 The convolutional encoder of MB-OFDM UWB system As showed in Fig 5.7, the pre-decoder and the re-encoder both are composed of some shifter registers and modulo-2 adders only. Therefore, the hardware overhead of these two additional blocks for SST algorithm is small. Fig 5.7 The pre-decoder for the convolutional encoder #### 5.2.2 Radix-2x2 ACS Structure The throughput requirement of MB-OFDM UWB system is up to 480Mbps. ACS unit is the speed bottleneck of Viterbi decoder due to a data dependent feedback loop. For high speed applications, one often applies high-radix or multi-dimension ACS to improve the throughput. Radix-4 ACS and radix-2x2 ACS both completes the operations of two trellis stages in one clock cycle. In 0.13µm CMOS technology, the radix-4 and radix-2x2 ACS structure can achieve the throughput requirement. Fig 5.8 shows a 4-state radix-4 trellis and a 4-state radix-2x2 trellis. The structures of radix-4 and radix-2x2 ACS unit for state S0 is shown in Figure 5.6. (a) 4-state radix-4 trellis diagram (b) 4-state radix-2x2 trellis diagram Fig 5.8 The 4-state radix-4 and radix-2x2 trellis diagrams Fig 5.9 The radix-4 and radix-2x2 ACS units The complexity analysis of radix-4 and radix-2×2 ACS units for a 64-state Viterbi decoder is summarized in Table 5.1. Simulate in UMC 0.13µm technology. The main differences of these two ACS structures are the comparator and multiplexer. Table 5.2 lists their gate counts to show the hardware costs. Although the critical path is longer, radix-2x2 ACS can achieve the throughput requirement with lower complexity. To design a low-power Viterbi decoder, we exploit radix-2x2 ACS structure in the proposed design. Table 5.1 Comparison of complexity between radix-4 and radix-2×2 ACS units | | registers a | adders | 2-way | 4-way | 2-to-1 | 4-to-1 | |---------|-------------|----------|------------|------------|-------------|-------------| | | | | comparator | comparator | multiplexer | multiplexer | | ACS-4 | 64 | 4.64 | - | 64 | - | 64 | | ACS-2×2 | 64 | (2+2)·64 | 2.64 | - | 2.64 | - | Table 5.2 The gate counts of different comparators and multiplexers | | 2-way | 2-way 4-way | | 4-to-1 | |------------|-------------------|-------------|-------------|-------------| | | comparator | comparator | multiplexer | multiplexer | | Gate count | Gate count 28 173 | | 17 | 33 | ## **5.2.3 Implementation of Path Merging Detection Unit** As all survivor paths merge, it is more efficient to store the merged path rather than all paths. Based on this principle, we propose a low-power scheme called variable truncation length for Viterbi decoder. Fig 5.10 illustrates a 64-state, radix-2x2, RE-based survivor memory with variable truncation length [16]. Fig 5.10 A RE-based survivor memory with variable truncation length D0 to D63 are the decisions provided by the ACS units for selecting survivor paths. In the decoding process, the contents of registers corresponding to 64 states tend to be equivalent from the left stages to the right stages. The registers of each stage are connected to the path merging detection unit. The path merging detection unit will find the merged stage in the memory and generates clock gating signals of each stage to eliminate unnecessary data movement. We proposed variable truncation length scheme based on path merging property of Viterbi algorithm. As all survivor paths merge, the survivor memory stores the merged path rather than all paths to eliminate unnecessary data movement. To implement variable truncation length, it is necessary to find the merged stage of the survivor memory. After detecting the merged point, we can shift out the data on merged path directly and apply clock gating to the registers corresponding to other paths. Obviously, all survivor paths merge as the contents of 64 states are equivalent at the same stage. However, it is too complex to check the equality of all 64 states concurrently. To reduce the hardware complexity, our proposal detects path merging by dividing 64 states into several groups that are verified separately. For radix-2x2 trellis, there are four source states corresponding to each state. Therefore, we divide 64 states into 16 groups and each group contains 4 states. Fig 5.11 illustrates the implementation of variable truncation length. Because we exploit SST algorithm in the proposed Viterbi decoder, the decoded data is obtained from state 0, which is most likely the best state. As the Figure shown, the equality of each group is checked separately. The verified results of each stage are connected to the path merging detection unit. The signals Gi and Si generated by the path merging detection unit mean the clock gating control of each stage and the selection signal of the state 0 respectively. With the clock gating control signal Gi, the register clocks in the shadow region of Fig 5.11 are gated to reduce the power consumption. The selection signal Si controls the content of state 0 to be updated by directly shift or register exchange. Fig 5.11 The implementation of variable truncation length # 5.3 Simulation and Implementation Results This section will show some simulation and implementation results. The performance simulations are performed in AWGN channel and BPSK modulation. We adopt the (3, 1, 6) convolutional code for MB-OFDM UWB system with 3-bit soft-decision and 1/3 code rate. As the variable truncation length scheme is based on the path merging property, it is necessary to choose a proper truncation length to ensure all survivor path will merge with high probability. As the simulation results [1][2], the performance improvement will reach a limit even the truncation length increases continuously. We select 64 as the maximum truncation length in the proposed design. Table 5.4 lists the design parameters of the proposed Viterbi decoder. In order to demonstrate the proposed schemes reduce the power consumption, we implement three versions of Viterbi decoder including conventional register-exchange structure, SST scheme only, and the proposed structure. Table 5.3 lists the gate counts of these three implementations. Table 5.3 The gate counts of different implementations | Implementation | Gate count | | | |-----------------|------------|--|--| | Conventional RE | 57.8k | | | | SST | 58.2k | | | | Proposed | 65.1k | | | Table 5.4 Design parameters of the proposed Viterbi decoder | Technology | UMC 0.13-μm process | | |-------------------|---------------------|--| | State number | 64 | | | Code rate | 1/3 | | | Soft-decision | 8-levels | | | BM width | 6 bits | | | PM width | 9 bits | | | Truncation length | 64 (max) | | | ACS structure | radix-2x2 | | Fig 5.12 shows the power simulation results in different channel conditions. The operation frequency is 250MHz and the corresponding data rate is 500Mbps. For the conventional structure, the channel conditions are ineffective in the power dissipation. In the SST only implementation, the decoder power dissipation is reduced in high SNR environments. In the proposed design combining the SST and the variable truncation length, the decoder power has an obvious reduction as shown in Fig 5.12(a). Fig 5.12(b) shows the survivor memory power only to highlight the effect of the dynamic truncation length. (a) The power of whole Viterbi decoder (b) The power of the survivor memory Fig 5.12 The power simulation results in different channel conditions The table 5.5 showed the comparison with other design, which includes details and power consumption. The viterbi application compared to automatic synthesize register, the LCSFF can achieve 56.5% energy saving Table 5.5 Simulation result compares with other Viterbi decoder | | Intel <sup>896</sup> | New(synthetize) | New (LCSFF) | | |----------------------|----------------------|-----------------------|-------------|--| | Technology | 90nm | 90nm | | | | State NO. | 64 | 64 | | | | Area (mm2) | ACS: 0.048 | 0.25 | 0.23 | | | | TB: 0.133 | | | | | Soft decision | | 3bit | | | | PM width | 10-bit | 9bit | | | | Truncation length | 96 | 64(max) | | | | Clock rate 100M (Hz) | 2G | 250M | | | | Data rate 200M (bps) | 500M | 500M | | | | Power | 40mW | 28.52mW <b>12.4mW</b> | | | # **Chapter 6** # **Conclusion and Future Work** This thesis has presented the LCSFF which is designed with the conditional capture technique. The flip-flop applies low swing voltage technique, conditional capture technique, and stacked technique to reduce both the dynamic power and static power efficiently. So it benefits the low switch activity applications. As the simulation result, the LCSFF reduces at least 27.5% of power\_delay\_product. And it could operate in 3GHz. Joint coding schemes have been consider the effective way to reduce power consumption and at the same time provide a reliable interconnect. However, both crosstalk avoidance codes and error correction codes enlarge the physical transfer unit (phit) in network-on-chip. Using large phit size, is inefficient in the respect of the network on chip cost, i.e., area and power consumption. On the contrary, use of a small number of link wires, thus the space between link wires can be widen to decrease coupling capacitance. Therefore, area and power consumption can be reduced. In chapter 4, a joint bus and error correction coding, self-corrected green coding scheme with serializer and deserializer, is presented to construct reliable and green interconnection for NoC. In addition, the circuitry of green bus coding is more simple and effective. The simulation results show self-corrected green coding can achieve 34.4% energy saving to un-code word in UMC 90um CMOS technology. In chapter 5, the application of low clock swing flip-flop in survivor memory unit, which proposed a low-power Viterbi decoder for MB-OFDM UWB system. The proposed design combines SST and variable truncation length schemes. SST is a low-power technique which reduces the state transition activity with low hardware cost. Based on path merging property of Viterbi algorithm, we propose a modified memory management to adjust the truncation length dynamically according to the channel conditions. Consequently, the redundant data movement can be eliminated. With variable truncation length scheme, the access of the survivor memory becomes more efficient. Experimental results indicate the power reduction of the whole decoder and the survivor memory unit can achieve more than 14% and 53% respectively as Eb/NO is large than 4dB, while the overhead of 13% gate count due to additional control logics is required. The proposed low-power schemes reduce the power dissipation of the survivor memory significantly. However, the ACS unit is still power critical. The viterbi application compared to automatic synthesize register, the LCSFF can achieve 56.5% energy saving. In the future, there are several related topics could be continued to research. In the SoC era, more and more function units are embedded into a chip. Each function unit may operate with different supply voltage. The voltage island concept has become a hot topic in recent year. In order to resolve the link problem between the different supply voltage function blocks, the level converter flip-flop is presented. The level converter flip-flop is embedded in the level conversion function. It is used to resolve this problem with small overhead. As the good data tolerance, this LCSFF can apply as a level converter. The second one is the energy crisis problem. Maybe it's possible for using solar energy for supply power. Certainly we have design for low power consumption issue. For example, use the leakage current in the keeper to maintain the output data. Or combine with the data gating technique in flip flop. # **Bibliography** - [3.1] Chi-Ken Tsai and Wei Huang, "Low Power Flip-Flop and Reconfigurable FIFO Design," Department of Electronics Engineering & Institute of Electronics College National Chiao Tung University Sept. 2004. - [3.2] Neil Weste and David Harris, "CMOS VLSI Design," third edition. Addison Wesley, 2005 - [3.3] G. Palumbo, F. Pappalardo and S. Sannella, "Evaluation on power reduction applying gated clock approaches," ISCAS, vol. 4, pp. IV-85 -IV-88, 2002. - [3.4] A. S. Seyedi, S. H. Rasouli, et al, "Low power low leakage clock gated static pulsed flip-flop," ISCAS, pp. 21 -24, May 2006. - [3.5] P. Zhao, T. Darwish and M. Bayoumi, "High-performance and lowpower conditional discharge flip-flop," IEEE Trans. Very Large Scale Integr (VLSI) Syst., vol. 12, no. 5, pp. 477-484, May 2004. - [3.6] H. Kawaguchi and T. Sakurai, "A reduced clock-swing flip-flop (RCSFF) for 63% power reduction," IEEE J. Solid-State Circuits, vol. 33, no. 5, pp. 807-811, May 1998. - [3.7] B. Kong, S. Kim, and Y. Jun, "Conditional-capture flip-flop for statistical power reduction," IEEE J. Solid-State Circuits, vol. 36, pp. 1263-1271, Aug. 2001. - [3.8] H. Partovi, R. Burd, et al, "Flow-through latch and edge-triggered flip-flop hybrid elements," in Proc. Dig. ISSCC, pp. 138-139, Feb. 1996. - [3.9] Chen Kong The, Mototsugu Hamada, et al, "Conditional Data Mapping Flip-Flops for Low-Power and High-Performance Systems," IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 12, Dec. 2006. - [3.10] S. D. Naffziger, "The implementation of the itanium 2 microprocessor," IEEE J. Solid-State Circuits, vol. 37, no. 11, pp. 1448-1460, Nov. 2002. - [3.11] J. D. Warnock et al., "The circuit and physical design of the POWER4 microprocessor," IBM J. Res. Dev., vol. 46, pp. 27-51, Jan. 2002. - [3.12] N. Nedovic, M. Aleksic, and V. G. Oklobdzija, "Conditional techniques for small power consumption flip-flops," in Proc. 8<sup>th</sup> IEEE Int. Conf. Electronics, Circuits Systems, Malta, Spain, pp. 803-806, Sept. 2-5, 2001. - [3.13] J. Tschanz, S. Narendra, et al, "Comparative delay and energy of single edge-triggered & dual edgetriggered pulsed flip-flops for high-performance microprocessors," in Proc. ISPLED'01, Huntington Beach, CA, pp. 207-212, Aug 2001. - [3.14] Gerosa G., Gary S., et al, "A 2.2 W, 80 MHz superscalar RISC microprocessor," Solid-State Circuits, IEEE Journal of, Volume 29, Issue: 12, pp.1440-1454, Dec 1994 - [3.15] Y. Suzuki, K. Odagawa, and T. Abe, "Clocked CMOS calculator circuitry," Solid-State Circuits, IEEE Journal of, Volume 8, Issue 6, pp.462-469, Dec 1973. - [3.16] Qiu Xiaohai, Chen Hongyi, "Discussion on the low-power CMOS latches and flip-flops," Solid-State and Integrated Circuit Technology, Proceedings. 1998 5<sup>th</sup> International Conference, pp.477-480, Oct 1998. - [3.17] D. Markovic, B. Nikolic and Brodersen, "Analysis and design of low-energy flip-flops" Low Power Electronics and Design, International Symposium, pp.52-55, Aug. 2001. - [3.18] Aliakbar Ghadiri and Hamid Mahmoodi, "Dual-Edge Triggered Static Pulsed Flip-Flops", Proceedings of the 18<sup>th</sup> International Conference on VLSI Design, pp.846-849, 2005. - [3.19] A. S. Seyedi, S. H. Rasouli, and A. Afzali-Kusha, "Clock Gated Static Pulsed Flip-Flop (CGSPFF) in sub 100nm Technology," Proceedings of the 2006 Emerging VLSI Technologies and Architectures, 2006 - [3.20] A. S. Seyedi, S. H. Rasouli, and A. Amirabadi, "Low Power Low Leakage Clock Gated Static Pulsed Flip-Flop," ISCAS, pp.3658-3661, 2006. - [3.21] H. Partovi, R. Burd, et al, "Flow-through latch and edge-triggered flip-flop hybrid elements," Solid-State Circuits Conference, Digest of Technical Papers. 43rd ISSCC, pp.138-139, 8-10 Feb. - [3.22] Peiyi Zhao, Jason McNeely, et al, "Low Power Clock Branch Sharing Double Edge Triggered Flip Flop," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on vol. 15, no. 3, pp. 338-345, March 2007. - [4.1] L. Zhang, J. Wilson, R. Bashirullah, et al, "A 32Gb/s On-chip Bus with Driver Pre-emphasis Signaling," IEEE Custom Integrated Circuits Conference, pp. 265-268,2006. - [4.2] R. Bashirullah, W.T. Liu, et al, "A 16 Gb/s adaptive bandwidth on-chip bus based on hybrid current/voltage mode signaling," IEEE Journal of Solid-State Circuits, Vol. 41, Issue 2, pp. 461-473, Feb. 2006. - [4.3] S.R. Sridhara and N.R. Shanbhag, "Coding for Reliable - On-Chip Buses: A Class of Fundamental Bounds and Practical Codes," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 26, Issue 5, pp. 977-982, May 2007. - [4.4] M. Copploa, "Trends and trade-offs in designing highly robust throughput on chip communication network," IEEE International On-Line Testing Synposium, July 2006 - [4.5] T. Lv, J. Henkel, H. Lekatsas and W. Wolf, "An adaptive dictionary encoding scheme for SOC data buses," in proceedings of Design, Automation and Test in Europe Conference and Exhibition DATE, pp. 1059-1064, 2002. - [4.6] Kang Min Lee, Se-Joong Le and Hoi-Jun Yoo, "Low energy transmission coding for on-chip serial communications," in Proceeding of System-on-Chip Conference, 2004, pp. 177-178. - [4.7] R. Srinivasa and R. Naresh, "Coding For System-On-Chip Network: A Unified Framework," Design and Automation Confernece, 2004, pp. 103-106. - [4.8] N.K. Patel and I. L. Markov, "Error-Correction and Crosstalk Avoidance in DSM Busses", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 12, NO.10, Oct. 2004, pp. 1076 -1080. - [4.9] Rung-Bin Lin, Chi-Ming Tsai, "Weight-based bus-invert coding for low-power applications", VLSID, 2002, pp. 121-125 - [4.10] Jens Muttersbach, Thomas Villiger, and Wolfgang Fichtner, "Practical design of globally asynchronous locally synchronous systems," In Proceeding of International Symposium on Advanced Research in Asynchronous Circuits and Systems, April 2000, pp. 52-59. - [4.11] E. Beigne and P. Vivet, "Design of On-chip and Off-chip Interfaces for a GALS NoC Architecture", IEEE International Symposium on Asynchronous Circuits and Systems, 2006 - [4.12] Z. Shengxian, W. Li, J. Carlsson, K. Palmkvist, and L. Wanhammar, "An asynchronous wrapper with novel handshake circuits for GALS systems", IEEE Proceeding of International conference on communications, circuits and systems, Vol. 2, 2002, pp. 1521-1525. - [4.13] Availble: http://public.itrs.net/files/2003ITRS/Home2003.htm - [4.14] Tzu-Wei Lin; Shang-Wei Tu; Jing-Yang Jou, "On-Chip Bus Encoding for Power Minimization Under Delay Constraint," VLSI Design, Automation and Test, 2007. VLSI-DAT 2007. International Symposium on 25-27, April 2007, pp.1 4 - [4.15] F. Tobajas, R. Esper-Chain, et al, "A Low Power 2.5 Gbps 1:32 Descrializer in SiGe BiCMOS Technology", IEEE Design and Diagnostics of Electronic Circuits and systems, pp. 19 24, April 18-21, 2006 - [4.16] Kouichi Kanda, Daisuke Yamazaki, et al, "40Gb/s 4:1 MUX/1:4 DEMUX in 90nm Standard CMOS", IEEE Solid-State Circuits Conference, Volume 1, pp. 152-590, Feb. 6-10, 2005 - [4.17] Zhigong Wang, Jingfeng Ding and Wencai Lu, "2.5-Gb/s 0.25-µm CMOS Lower Power 1:16 Demultiplexer", Asia-Pacific Microwave Conference Proceedings, Volume 1, Dec 4-7,2005 - [4.18] Ping-Lin Yang and Yarsun Hsu, "The High Speed Quarter-Rate 16/20:1 Serializer with the Area-Saving RF Devices," Department of Electrical Engineering, National Tsing Hua University, June 2007. - [4.19] Ming-Hao Lu and Yarsun Hsu, "The Quarter-Rate 1:16/20 - Deservation, "Department of Electrical Engineering, National Tsing Hua University, June 2007. - [4.20] Yu-Hao Hsu, Min-Sheng Kao, et al, "A 20 Gbps Scalable Load Balanced Birkhoff-von Neumann Symmetric TDM Switch IC with SERDES Interfaces", Design Automation Conference. ASP-DAC '07. Asia and South Pacific 23-26, pp.102-103, Jan. 2007 - [4.21] Se-Joong Lee; Kangmin Lee; Seong-Jun Song and Hoi-Jun Yoo, "Packet-switched on-chip interconnection network for system-on-chip applications," Circuits and Systems II: Express Briefs, IEEE Transactions on [see also Circuits and Systems II: Analog and Digital Signal Processing, IEEE Transactions on], Volume 52, Issue 6, pp.308-312, June 2005. - [4.22] K. Lee et al., "A 51mW 1.6GHz On-Chip Network for Low Power Heterogeneous SoC Platform," IEEE International Solid-State Circuits Conference, Feb. 2004, pp. 152-153. - [4.23] Se-Joong Lee, Kangmin. Lee and Hoi-Jun Yoo, "Analysis and Implemenation of Practical, Cost-Effective Networks on Chips," IEEE Design & Test Computers, Vol. 22, 2005, pp. 422-433. - [4.24] Po-Tsang Huang, Wei-Li Fang, Yin-Ling Wang, Wei Hwang, "Low Power and Reliable Interconnection with Self-Corrected Green Coding Scheme for Network-on-Chip," Networks-on-Chip, 2008. NoCS 2008. Second ACM/IEEE International Symposium, pp.77-83, April 2008 - [4.25] Wei-Li Fang and Wei Hwang, "Low Power and Reliable Interconnection with Self-Corrected Green Coding Scheme and Self-Calibrated Voltage Scaling Technique for Network-on-Chip," Department of Electronics Engineering & Institute of Electronics College National Chiao Tung - University August 2008. - [5.1] A. J. Viterbi, "Error bounds for convolutional codes and asymptotically optimal decoding algorithm," IEEE Trans. Inform. Theory, vol. IT-13, no. 2, pp. 206-269, Apr. 1967. - [5.2] G. D. Forney JR., "The Viterbi algorithm," Proc. IEEE, vol.61, no. 3, pp.268-278, Mar. 1973. - [5.3] C. B. Shung, P. H. Siegel, G. Ungerboeck and H. K. Thapar, "VLSI architectures for metric normalization in the Viterbi algorithm," SUPERCOMM/ICC '90. Conference Recoder., IEEE, vol. 4, pp. 1723-1728, Apr. 1990. - [5.4] A. P. Hekstra, "An Alternative to Metric Rescaling in Viterbi Decoders," IEEE Trans. Commun., vol. 37, no. 11, pp. 1220-1222, Nov. 1989. - [5.5] S. B. Wicker, Error Control Systems for Digital Communication and Storage, Prentice Hall, 1995. - [5.6] A. M. Obeid, A. Garcia, M. Petrov, and M. Glesner, "A Multi-path High Speed Viterbi Decoder," ICECS 2003. Proceeding of the 2003 10th IEEE International Conference on Electronics. Circuit and Systems, vol. 3, pp. 1160-1163, Dec. 2003. - [5.7] T. Ishitani, K. Tansho, N. Miyahara, S. Kubota and S. Kato, "A scarce-state-transition Viterbi decoder VLSI for bit error correction," IEEE Journal of Solid-State Circuits, Aug. 1987. - [5.8] S. Kubota and S. Kato, "Novel Viterbi Decoder VLSI Implementation and its Performance," IEEE Trans. Commun., vol. 41, no. 8, pp. 1170-1178, Aug. 1993. - [5.9] L. H. C. Lee, D. J. Tait, and P. G. Farrell, "Scarce-State-Transition Syndrome-Former Error-Trellis Decoding of (n, n-1) Convolutional Codes," IEEE Trans. Commun., vol. 44, no. 1, pp. 7-9, Jan. 1996. - [5.10] R. Henning and C. Chakrabarti, "An approach for adaptively approximating the Viterbi algorithm to reduce power consumption while decoding convolutional codes," Transactions on Signal Processing, vol. 52, pp. 1443-1451, May 2004. - [5.11] M. H. Chan, W. T. Lee, M. C. Lin, and L. G. Chen, "IC design of an adaptive Viterbi decoder," IEEE Transactions on Consumer Electronics, vol. 42, pp. 52-62, Feb. 1996. - [5.12] S. J. Simmons, "Breadth-first trellis decoding with adaptive effort," IEEE Transactions on Communications, vol. 38, pp. 3-12, Jan. 1990. - [5.13] A. Batra et al, "Multi-band OFDM physical layer proposal for IEEE 802.15 task group 3a," submitted to IEEE P802.15 working group for WPANs, Sept. 2004. - [5.14] F. Sun and T. Zhang, "Low-power State Parallel Relaxed Adaptive Viterbi Decoder," IEEE Trans. Circuits and Syst. I, vol. 54, no. 5, pp. 1060-1068, May 2007. - [5.15] M. Anders, S. Mathew, R. Krishnamurthy, and S. Borker, "A 64-state 2GHz 500Mbps 40mW Viterbi Accelerator in 90nm CMOS," in Sympo. VLSI Circuits Dig. Tech. Papers, 2004, pp.174-175. - [5.16] Dah-Jia Lin, Chen-Yi Lee, "A Low-power Viterbi Decoder Based on Scarce State Transition and Variable Truncation Length," Department of Electronics Engineering & Institute of Electronics College National Chiao Tung University August 2008. # Vita # PERSONAL INFORMATION Name: Yin-Ling Wang Birth Date: January 3, 1983 Birth Place: Taipei, Taiwan, R.O.C. Address: Department of Electronics Engineering National Chiao Tung University 1001 Ta-Hsueh Road Hsin-chu, Taiwan 30010, R.O.C. E-Mail Address: tost1121@yahoo.com.tw ## **EDUCATION** B.S. [2005] Department of Electronics Engineering, National Chi Nan University M.A.[2008] Institute of Electronics, National Chiao-Tung University.