# A Low Power Pulsed Edge-Triggered Latch for Survivor Memory Unit of Viterbi Decoder

Wei-Li Su and Herming Chiueh System-on-Chip Design Lab, Department of Communication Engineering, College of Electrical Engineering, National Chiao Tung University, Hsinchu 300, Taiwan. pippenbasket@yahoo.com.tw, chiueh@ieee.org

*Abstract***—A low power pulsed edge-triggered latch based on the static edge-triggered latch (ETL) is presented for survivor memory (SMU) unit of Viterbi decoder for low power high speed wireless local area network (WLAN) applications. By reducing clock loading and transistor number, the proposed low swing static ETL has less clock loading, smaller cell area and power-delay product compared to traditional master-slave register. Moreover, a stage-reduced SMU is introduced later for saving both area and power consumption. The proposed low swing static ETL and stage-reduced SMU are designed and simulated in TSMC 0.13um standard CMOS process, and the operating clock frequency is at 1GHz.**

#### I. INTRODUCTION

In modern digital communication systems, information is required to be transmitted at high data rates especially in wireless local area network (WLAN) [1]. It will result in increasing power dissipation and system complexity. Besides, for enhancing system performance, an efficient error-control code is often employed. Convolutional codes have been exploited widely in communication systems, which provide a superior error correction capacity while maintaining a reasonable coding complexity. Viterbi algorithm is one of the optimal solutions for decoding convolutional codes with the modest computing resource [2-3]. However, as the requirement of high transmission rate increasing power dissipation of the system, the error-control mechanism also becomes an additional part of power dissipation in system implementation. In modern digital communication very large-scale integration (VLSI) design, power dissipation issue has become more important than the past because of two reasons: one is the limited battery life of portable mobile systems and the other is the high cost of packaging and cooling requirement for reliability in deep submicron technology. These phenomena suggest us to build systems with low power feature [4].

The Viterbi decoder is constructed from three major units [5]: transition metric unit (TMU), add-compare-select unit (ACSU), and survivor memory unit (SMU) as illustrated in Fig. 1.

# Po-Tsang Huang and Wei Hwnag

Low Power System-on-Chip Lab, Department of Electrical Engineering & Institute of Electronic, Microelectronics and Information Systems Research Center, National Chiao Tung University, Hsinchu 300, Taiwan. bug.ee91g@nctu.edu.tw, Hwang@mail.nctu.edu.tw



Figure 1. Block diagram of Viterbi decoder.

TMU calculates the transition metrics (TM) from the input data. ACSU accumulates transition metrics recursively as path metrics (PM), and makes decisions to select the most likely state transition sequence. Finally, SMU traces the decisions to extract this sequence. There are two main different ways to build SMU: traceback and registerexchange [6]. The former is built of embedded memory element such as static random-access memory (SRAM), and the later is composed of many registers and multiplexers. The traceback approach is a power efficient solution, but not suitable for high speed application because of the limited bandwidth in embedded memory [7]. The register-exchange approach is more direct and intuitional to trace the most likely state transition sequence and easier to operate at higher speed. But its power consumption is proportional to its size and will increase as data throughput increasing. In [8], the SMU's area of Viterbi accelerator is about 73% of whole chip in physical dimension. Therefore, it is meaning to make improvement for SMU of Viterbi decoder.

In this paper, a low power pulsed edge-triggered latch is proposed for low power high speed SMU of Viterbi decoder. Section II introduces relative low power register designs. The architecture of the proposed stage-reduced SMU and proposed low power pulse edge-triggered latch will be discussed in Section III, and here the design concept of low power SMU will be pointed out. The simulation results of low power pulsed edge-triggered latch and the SMU are presented in Section IV. The conclusion and future work is given in Section V.

This research was supported by National Science Council, Taiwan (contract numbers: NSC 94-2220-E-009-016, NSC 95-2220-E-009-016) and Ministry of Education, Taiwan (MoE ATU program). The authors would also like to acknowledge the design parameters provided by NCTU-TSMC joined project.

## II. RELATIVE LOW POWER REGISTERS

Latches and registers are widely used in sequential circuit design due to their characteristic of data storage. The main difference between them is their timing properties. A latch is a level-sensitive device; a register is an edge-triggered storage element and an edge-triggered register is often referred to as a flip-flop as well [9]. There are three major types of registers: master-slave flip-flop, pulse register and sense-amplifier-based register. Master-slave flip-flop consists of cascading a negative latch (master) with a positive latch (slave) to trigger at clock edge. In a sense, the sense-amplifier-based registers are similar in operation to the pulse registers (i.e., the first stage generates the pulse, and the second latches it). However, sense-amplifier-based registers are used extensively in memory cores and in lowswing bus drivers to sense small input signals and amplify them to generate rail-to-rail swings [9]. In [10] and [11], their complex circuit schemes and more transistor number are in the opposite direction to our expectance, so we do not discuss sense-amplifier-based registers in this SMU design.

Besides the most common multiplexer-based flip-flop, there is a modified low power register named transmissiongate flip-flop (TGFF) [12] as illustrated in Fig. 2(b). TGFF splits "keeper" inverter to two different feedback paths to maintain data in remained half cycle. Because of the stacking effect [13] in these two feedback loops, a few amount of power consumption could be reduced. However, classical CMOS master-slave flip-flops employ two cascaded transparent latches controlled by true and inverting clocks. Because of this, they are compared unfavorably with transparent latches in terms of speed, power and area.

The advantage of pulse registers is the reduced clock loading and the smaller number of transistors required [9]. In our design experience, only combing a pulse generator with a latch to form a pulse register is not good enough to reduce total power consumption. Although the outer clock loading of pulse register is reduced, the internal node capacitance of latch's clock ports (i.e., the output of pulse generator) is not decreasing. It is the reason why the pulse generator will consume more power than we expecting, and the total power dissipation of pulse register will not be less than expectance.

 Recently, other types of pulse register called edgetriggered latch (ETL) [14] have been designed. Instead of



Figure 2. (a) Multiplexer-based flip-flop. (b) Transmission-gate flip-flop.

using pulse generator, pulsed clock signals are generated in registers locally, which are used in triggering the transparent latch. The latches are transparent only during a small pulse window and they effectively act as edge-triggered flip-flop. Hybrid latch flip-flop (HLFF) is a famous one of proposed ETLs, and HLFF family in [15] was discussed very well. However, in additional to their transistor numbers, sizing their transistor sizes for racing problem results in larger cell area in our experience.

#### III. SURVIVOR MEMORY UNIT AND PROPOSED DESIGN

#### *A. Survivor Memory Unit*

The radix-4 SMU of Viterbi decoder based on register exchange method is illustrated in Fig. 3(a). The primary inputs of this SMU are 64 2-bit constants and 64 2-bit select signals (i.e., S00 to S63) which are calculated from ACSU. These 64 2-bit constants are composed of sixteen 00s, 01s, 10s and 11s of each in sequence (i.e., {16[00], 16[01],  $16[10]$ ,  $16[11]$ . The truncation length is how many stages we use to trace the decisions to extract expected sequence, and is dependant on system specification and transmitted channel property.



Figure 3. (a) Block diagram of survivor memory unit. (b) Basic D element.



Figure 4. (a) Dual-rail static ETL. (b) Proposed low swing static ETL type 1. (c) Proposed low swing static ETL type 2.

In our system requirement, the truncation length is 16 or 32 for practical implementation. We use one column to represent one stage (i.e., 16 or 32 columns in our system) and each column in SMU contains 64 basic D elements. Each of D elements has one 2-bit 4-to-1 multiplexer and one 2-bit register as illustrated in Fig. 3(b). Finally, the primary output of this SMU is the output of the first D element in the last stage.

Because the register-exchange based SMU contains a lot of registers and all of them need outer clock signal to work up. If we can reduce the input capacitance of clock port in the register, the clock loading of whole SMU could be scaled down and the total power consumption of the SMU could be reduced too. In the next part of Section III, we will focus on which type of low power registers could have less clock loading and also less transistor number to restrict the area size of whole SMU. Because this SMU contains a lot of registers, if we can have registers with less transistor number, the total area of the SMU could decrease obviously.

#### *B. Proposed Low Swing Static Edge-Triggered Latch*

The low swing conditional capture we proposed in [17] realizes the low power edge-triggered latch by detecting the input switching activity and reducing the clock swing as Fig. 4(a). It uses NMOS transistors cascoded on the top of inverter chain to restrict voltage swing at  $(VDD - V_t)$ for saving power. However, the data switching activity in the Viterbi Decoder is channel-dependent and we decide not to add conditional capture component for saving peripheral XOR gate. For reducing cell area, we use only one diodeconnected NMOS M3 to limit voltage swing of inverter chain in Fig. 4(b) and also only one output inverter. Another difference between Fig. 4(a) and Fig. 4(b) is that the gate signals of M1 and M2 are exchanged to make M1 have better driving ability than M2 because of the limited voltage swing  $\{i.e., (V_{DD} - V_t) \text{ to ground}\}.$  However, the gate terminal of M2 is weak logic level one which results in weaker pull down network of ETL. Since we do not need conditional

capture component, we move the diode-connected NMOS M3 to the bottom of inverter chain just like footer power gating device in Fig. 4(c). Therefore, both M1 and M2 have equal driving ability and pull down network is stronger than Fig. 4 (b). On the other hand, it reduces the clk-to-Q delay by strong logic level of M2. Moreover, after sizing gate width of pull down network in Fig. 4 (c), we get smaller cell size of type 2 than type 1. Not only cell size, but also power consumption could be reduced due to smaller driving current in pull down network.

#### IV. SIMULATION RESULTS

In this section, we present our simulation results of proposed low swing static ETL and the SMU. In the last part of this section, we also mention stage-reduced SMU for saving area and power because of its transmission behavior.

## *A. Low Swing Static Edge-Triggered Latch*

The simulation environment of register includes input buffer, clock buffer and output capacitance 0.7fF for a single simplest inverter input capacitance. The clock frequency is at 1GHz, and the process is TSMC 0.13um CMOS technology.

As summarized in Table I, the total power includes the power consumption of register and buffer. Our design type 1 proves reducing clock loading is a suitable solution of low power application. Obviously, our proposed design type 2 has both low power and small area feature. Though the core power of type 2 is not less than TGFF at idle mode and 50% switching activity, the total power of type 2 is less than TGFF due to the benefit of less clock loading. Our type 2 design also has smallest power-delay product.

#### *B. Stage-Reducing and Simulation results of SMU*

Base on the property of primary inputs and registerexchange method, the output patterns of the first three stages are keeping in the same order despite the select signals from ACSU. Therefore we can cancel the first three stages.

| Register<br><b>Type</b> | <b>Total gate</b><br>Width (um) | <b>Clock load</b><br>(fF) | Switching<br>activity | <b>Total power</b><br>(uW) | <b>Core Power</b><br>(uW) | Peripheral<br>power (uW) | <b>Average C-Q</b><br>delay (ps) | Power-delay<br>product $(E^{-15})$ |
|-------------------------|---------------------------------|---------------------------|-----------------------|----------------------------|---------------------------|--------------------------|----------------------------------|------------------------------------|
| Mux-based<br>FF         | 4.05                            | 1.76                      | Idle                  | 9.35                       | 3.00                      | 6.35                     | 96                               | 0.898                              |
|                         |                                 |                           | 50 %                  | 14.75                      | 6.93                      | 7.82                     |                                  | 1.416                              |
|                         |                                 |                           | 100 %                 | 20.15                      | 10.82                     | 9.33                     |                                  | 1.934                              |
| <b>TGFF</b>             | 4.05                            | 1.76                      | Idle                  | 9.19                       | 2.98                      | 6.21                     | 102.5                            | 0.942                              |
|                         |                                 |                           | 50 %                  | 14.15                      | 6.36                      | 7.79                     |                                  | 1.450                              |
|                         |                                 |                           | $100\%$               | 19.06                      | 9.70                      | 9.36                     |                                  | 1.953                              |
| Proposed<br>type 1      | 3.3                             | 1.07                      | Idle                  | 9.86                       | 4.11                      | 5.75                     | 124                              | 1.223                              |
|                         |                                 |                           | 50 %                  | 14.07                      | 7.18                      | 6.89                     |                                  | 1.745                              |
|                         |                                 |                           | $100\%$               | 18.19                      | 10.10                     | 8.09                     |                                  | 2.256                              |
| Proposed<br>type 2      | 2.7                             | 0.75                      | Idle                  | 8.52                       | 3.68                      | 4.84                     | 101.5                            | 0.865                              |
|                         |                                 |                           | $50\%$                | 12.48                      | 6.42                      | 6.06                     |                                  | 1.267                              |
|                         |                                 |                           | $100\%$               | 16.44                      | 9.12                      | 7.28                     |                                  | 1.669                              |

TABLE I. SIMULATION RESULTS OF REGISTERS

The select signals from ACSU become the primary inputs of stage-reduced SMU in a specified order. For example, in our simulation on truncation length of 16 could be reduced to 13 stages. It results in saving area and power proportional to reduced stages. The simulation results of whole SMU after stage-reducing are 13.26 mW at operating voltage 1.2V and 9.05 mW at 1.0V for low voltage operation. The clock frequency is at 1 GHz and it is fast enough for our Viterbi decoder specification.

## V. CONCLUSION AND FUTURE WORK

In this paper, we have proposed the reduced-stage survivor memory unit with low swing static ETL for Viterbi decoder. The proposed register has less clock loading, smaller area and power-delay product compared to traditional Mux-based flip-flop. It can be very useful in the digital systems with high switching activities. In the future, we will integrate other Viterbi decoder's components synthesized from EDA tools with the fully costumed SMU hard macro in physical implementation for low power high speed WLAN applications.

#### **REFERENCES**

- [1] C.C. Lin, Y.H. Shih, H.C. Chang, and C.Y. Lee, "Design of a Power-Reduction Viterbi Decoder for WLAN Applications," *IEEE Transactions on Circuits and Systems*, vol. 52, no. 6, JUNE 2005.
- [2] A. J. Viterbi, "Error bounds for convolutional codes and asymptotically optimum decoding algorithm," *IEEE Trans. Inf. Theory*, vol. IT-13, no. 2, pp. 260–269, Apr. 1967.
- [3] J. G. D. Forney, "The Viterbi algorithm," *Proc. IEEE*, vol. 61, no. 3, pp.268–278, Mar. 1973.
- [4] W.M. Chan; Herming Chiueh, "A block-level optimization of comprehensive thermal-aware power management for SoC integration in nano-scale CMOS technology," *Circuits and Systems, 2005. 48th Midwest Symposium on,* August 7-10, 2005.
- [5] G. Fettweis and H. Meyr, "A 100 MBit/s Viterbi decoder chip: Novel architecture and its relization," in *Proc. IEEE Int. Conf. Communications (ICC)*, vol. 2, Aug. 1990, pp. 463–467.
- [6] Ranpara, S.; Dong Sam Ha, "A low-power Viterbi decoder design for wireless communications applications," *ASIC/SOC Conference, 1999. Proceedings. Twelfth Annual IEEE International*, 15-18 Sept. 1999
- [7] C. M. Rader, "Memory management in a Viterbi decoder," *IEEE Trans. Commun.*, vol. 29, no. 9, pp. 1399–1401, Sep. 1981.
- [8] M. Anders, S. Mathew, R. Krishnamurthy and S. Borkar, "A 64 state 2GHz SOOMbps 40mW Viterbi Accelerator in 90nm CMOS*,*" *2004 Symposium On VLSl Circuits Digest of Technical Papers*
- [9] J.M. Rabaey, "Digital Integrated Circuits: A Design Perspective," Prentice Hall, New Jersey, 1996.
- [10] S.-D. Shin and B.-S. Kong, "Variable sampling window flip-flop for low power high-speed VLSI," *IEE Proc.Circuits Devices Syst.* Vol. 152, no. 3, June 2005.
- [11] H. Zhang, P. Mazumber, "Design of a new sense amplifier flip-flop with improved power-delay product," *ISCAS2005(IEEE International Symposium on Circuits and Systems)* May 23-26, 2005, Page(s):1262 - 1265 Vol. 2.
- [12] G. Gerosa et al., "A 2.2 W, 80 MHz superscalar RISC microprocessor," *IEEE J.Solide-State Circuits*, vol. 29, DEC 1994.
- [13] V. De, Y. Ye, A. Keshavarzi, S. Narendra, J. Kao, D. Somasekhar, R. Nair, and S. Borkar, "Techniques for leakage power reduction," in *Design of High-Performance Microprocessor Circuits*, A. Chandrakasan, W. Bowhill, and F. Fox, Eds. Piscataway, NJ: IEEE, 2001, ch. 3, pp. 52–55.
- [14] Ding Li, P. Mazumder, N. Srinivas, "A dual-rail static edge-triggered latch,**"** *ISCAS'01*, pp.645-648 vol. 2.
- [15] S. H. Rasouli, A. Khademzadeh, A. Afzali-Kusha, and M. Nourani, "Low power single- and double-edge-triggered flip-flops for high speed applications," *IEE Proc.Circuits Devices Syst.* Vol. 152, no. 2, April 2005.
- [16] H. Partovi, et al., "Flow-Through Latch and Edge-Triggered Flip-Flop Hybrid Elements," *IEEE biteniatiorzal Solid-State Circuits Corlfereiice*, pp. 138-139, Feb. 1996.
- [17] Chi-Ken Tsai , Po-Tsang Huang and Wei Hwang, "Low Power Pulsed Edge-Triggered Latches Design," *16th VLSI Design/CAD Symposium*, 2005.