# 國立交通大學

# 電子工程學系 電子研究所

## 碩士 論 文

實現在 40 奈米製程下可應用於 IP 位址搜尋之 高能源效益三態內容可定址記憶體電路設計

Energy-Efficient TCAM Design for IP Lookup Tables in 40nm LP CMOS Process

> 研究生:賴淑琳 指導教授:黃 威 教授

## 中華民國一〇一年九月

# 實現在 40 奈米製程下可應用於 IP 位址搜尋之 高能源效益三態內容可定址記憶體電路設計 Energy-Efficient TCAM Design for IP Lookup Tables

## in 40nm LP CMOS Process

研究生:賴淑琳 Student: Shu-Lin Lai 指導教授:黃威教授 Advisor: Prof. Wei Hwang



Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical Engineering and Computer Engineering National Chiao Tung University

in partial Fulfillment of the Requirements

for the Degree of

Master

in

**Electronics Engineering** 

September 2012

Hsinchu, Taiwan



## 實現在 40 奈米製程下可應用於 IP 位址搜尋之

### 高能源效益三態內容可定址記憶體電路設計

學生:賴淑琳

#### 指導教授:黃 威 教授

### 國立交通大學電子工程學系電子研究所

## 摘 要

#### 

即便三元內容可定址記憶體是個大功耗的晶片系統設計,但仍被廣泛地應用 在 IP 位址搜尋之功能上。在本論文中,提出了實現在 40 奈米製程下高能源效益 的可定址記憶體電路設計。由於先進製程底下N極電晶體的導通電流小,在16 電晶體三元內容可定址記憶體中,我們採用 P 極的比較電路來增加動態電路的導 通電流。另外,使用 AND 閘連接的蝴蝶式比較線連結架構不僅能降低動態節點 的導線電容,也能確保每一子節的動態節點電容量為相同。為了進一步降低功率 的消耗,我們提出了漣波位元線讀取架構及漣波比較傳輸架構,並且結合可定址 記憶體內無關項的特性。比起傳統階層式架構,它同時降低了比較傳輸線的切換 功率也節省了位元線和比較傳輸線的長導線電容。此外,我們提出了垂直式資料 感測電路來加強寫入能力;另一方面,藉由無關項特性的動態電源電路設計來縮 减漏電流和提高靜態雜訊邊界。寫入時,為了避免所儲存的資料被破壞及提升資 料感測控制電路對環境變異的穩定性,在我們設計當中也提供了複製電路來控制 動態電源的開闢時間。建立在 40 奈米製程上,我們結合這多項低功耗電路架構 實現在 256x40 和 256x144 的三元內容可定址記憶體中。經由電路佈局後的模擬 顯示,操作在 400 百萬赫茲及1 伏特電壓底下,可省 28.9%的漏電功耗和 31.74% 的比較傳輸線功耗,並且平均每個可定址記憶體也只消耗 0.461 飛焦耳。

Ι

## Energy-Efficient TCAM Design for IP Lookup Tables in 40nm LP CMOS Process

Student: Shu-Lin Lai

Advisors: Prof. Wei Hwang

Department of Electronics Engineering & Institute of Electronics National Chiao-Tung University

## **ABSTRACT**

Ternary content addressable memory (TCAM) is extensively adopted in routing tables of network systems and occupied great amounts of energy consumption. In this thesis, energy-efficient TCAM macros have been designed and realized in 40nm LP CMOS process with the sizes of 256x40 and 256x144, respectively. Based on the small drain current in 40nm LP CMOS process, a 16T AND-type TCAM cell with p-type comparison circuits is utilized to increase the I<sub>on</sub>/I<sub>off</sub> ratio of the dynamic circuitry. Additionally, the butterfly match-line scheme with AND gates is designed to reduce the wire loading on the evaluation nodes and to ensure that the capacitance of the evaluation nodes are the same in all segments. For further reducing the energy consumption in nano-scale technologies, don't-care-based ripple search-line and ripple bit-lines are realized to decrease both the switching activities and wire capacitance of search-lines and bit-lines. Moreover, the column-based data-aware power control is also employed to realize the leakage power reduction, write-ability and static noise margin (SNM) improvements by the power gating devices. Consequently, the timing of the power switching is tolerant to PVT variation and Vt scatter by the replica circuitry. The energy-efficient 256x40 and 256x144 TCAM macros are implemented using UMC 40nm LP CMOS technology, and the experimental results demonstrate a leakage power reduction of 28.9%, a search-line power reduction of 31.74% and an energy metric of the TCAM macro of 0.461 f.J/bit/search.

## 誌 謝

可以完成這篇論文,要感謝的人很多很多。首先,我要感謝指導教授黃威教授,在老師的帶領下讓我學會研究時正確的態度跟方法,也時時指引著我正確的 研究方向。另外,老師更提供了良好的研究環境和充足的資源,讓我能充分發揮 自己的能力完成這篇論文。接著,我要感謝莊景德教授經常給予我許多研究內容 的指導,同時產學合作的計畫也讓我得到許多意外的收穫。

再者,要感謝指導我的學長黃柏蒼,在研究的這一路上不停地給予我新觀點 跟方向讓我學習,遇到困難時也會不厭其煩的指導我渡過難關。此外也感謝張銘 宏、楊皓義及謝維致這三位博班學長們的幫助及討論。當然還要感謝實驗室夥伴 林弘璋及陳美維,這一路上的相互扶持跟鼓勵,也是我在碩班研究生活上的一大 助力,在此一併感謝。

最後要感謝最親愛的父母親跟姐姐,總是在我面對挫折或感到疲憊時給予我 最大的鼓勵和支持,適時的關懷更使我有無比的動力繼續前進,才能夠順利完成 碩士的論文研究。另外,也感謝許多好朋友們,總是在背後支持著我,成為我心 靈最重要的支柱。在這邊無法用有限的文字表達無限的感謝,但淑琳真心誠意的 感謝大家。

III

## Contents

| Chapter 1 Introduction                             | 1 |
|----------------------------------------------------|---|
| 1.1 Background                                     | 1 |
| 1.2 Motivation                                     | 2 |
| 1.3 Thesis Organization                            | 3 |
| Chapter 2 Overvuew of Low Power CAM/TCAM Design    | 5 |
| 2.1 Applications & Architecture of CAM/TCAM        | 5 |
| 2.1.1 Conventional CAM Architecture                | 5 |
| 2.1.2 Applications of CAM/TCAM                     | 7 |
| 2.1.2.1 Cache Memory                               | 7 |
| 2.1.2.2 Translation Look-aside Buffer              | 8 |
| 2.1.2.3 ATM Switches                               | 0 |
| 2.1.2.4 Packet Forwarding Using CAM1               | 1 |
| 2.2 Design of CAM/TCAM Cells                       | 2 |
| 2.2.1 Binary CAM Cell                              | 2 |
| 2.2.1.1 NOR-type CAM Cell13                        | 3 |
| 2.2.1.2 AND-type CAM Cell14                        | 4 |
| 2.2.2Ternary CAM Cell                              | 5 |
| 2.2.2.1 NOR-type TCAM Cell1                        | 5 |
| 2.2.2.2 AND-type TCAM Cell17                       | 7 |
| 2.3 Low Power Match-Line Scheme 1.8.9.6.           | 8 |
| 2.3.1 Conventional Match-line Structure19          | 9 |
| 2.3.1.1 NOR-type Match-line19                      | 9 |
| 2.3.1.2 AND-type Match-line                        | 0 |
| 2.3.2 Selective Pre-charge Scheme                  | 1 |
| 2.3.3 Pipelined Hierarchical Search Scheme         | 2 |
| 2.3.4 Current-Saving Scheme                        | 3 |
| 2.3.5 Wide-AND Match-line Scheme                   | 4 |
| 2.3.6 Tree Style AND-type Match-line Scheme25      | 5 |
| 2.4 Low Power Search-line Schemes                  | 7 |
| 2.4.1 Hierarchical Search-line Scheme              | 7 |
| 2.4.2 Charge-Recycling Search-line Scheme          | 9 |
| 2.4.3 Two-Level Don't-Care Gating Scheme           | 0 |
| 2.4.4 Low Swing Search-line Scheme                 | 1 |
| 2.5 Low Power Design Techniques for CAM/TCAM Macro | 2 |
| 2.5.1 Power-Gated ML Sensing                       | 2 |
| 2.5.2 Self-Disable Sensing Technique               | 4 |

| 2.5.3 Dynamic Power Source (DPS) Technique                      | 36  |
|-----------------------------------------------------------------|-----|
| 2.5.4 Variability-Tolerance CAM Cells with NOR-type Match-lines | 37  |
| Chapter 3 Energy-Efficient Match-Line Schemes                   | 39  |
| 3.1 Conventional NAND-Type Match-Line Schemes                   | 39  |
| 3.2 And-Type TCAM Cell with P-Type Comparison Circuits          | 42  |
| 3.3 XOR-based Conditional Keeper                                | 45  |
| 3.3.1 Circuit Implementation                                    | 45  |
| 3.3.2 Design Analysis                                           | 47  |
| 3.4 Butterfly Match-Line Scheme                                 | 48  |
| 3.4.1 Organization                                              | 49  |
| 3.4.2 Design Consideration                                      | 51  |
| 3.5 Summary                                                     | 53  |
| Chapter 4 Column-Based Low Power Design Techniques              | 55  |
| 4.1 Ripple Bit-Line Scheme for Read/Write Operation             | 56  |
| 4.1.1 Circuit Implementation & Operation                        | 56  |
| 4.1.2 Design Consideration                                      | 58  |
| 4.2 Don't-Care-Based Ripple Search-Line Scheme                  | 60  |
| 4.2.1 Circuit Implementation                                    | 60  |
| 4.2.2 Design Consideration                                      | 62  |
| 4.3 Column-Based Data-Aware Power Control                       | 64  |
| 4.3.1 Basic Concept                                             | 65  |
| 4.3.2 Replica Timing Control Circuit                            | 69  |
| 4.4 Simulation Results and Analysis                             | 71  |
| 4.4.1 Performance Comparison of HSL and RSL                     | 71  |
| 4.4.2 Simulation Result for DAPC                                | 75  |
| 4.5 Summary                                                     | 77  |
| Chapter 5 Implementation of 256x40 and 256x144 Energy-Efficie   | ent |
| TCAM Macro in UMC 40nm LP CMOS process                          | 78  |
| 5.1 Specification of Energy-Efficient TCAM Macro                | 78  |
| 5.2 Architecture & Floor-planning of TCAM Macro                 | 82  |
| 5.3 Butterfly Match-Line Design for 256x40 and 256x144          | 85  |
| 5.4 Design Implementation in UMC 40nm LP CMOS Process           | 87  |
| 5.4.1 Shared BL/DL                                              | 87  |
| 5.4.2 Interleaving Vertical Lines                               | 88  |
| 5.4.3 Cell Layout                                               | 90  |
| 5.5 Simulation Results and Analysis                             | 91  |
| 5.5.1 Simulation Results of 256x144 TCAM Macro                  | 91  |
| 5.5.2 Simulation Results of 256x40 TCAM Macro                   | 93  |

| 5.6 Summary                            | 96  |
|----------------------------------------|-----|
| Chapter 6 Conclusions and Future Works | 98  |
| 6.1 Conclusions                        | 98  |
| 6.2 Future Work                        | 99  |
| Bibliography                           | 102 |
| Chapter 1                              |     |
| Chapter 2                              |     |
| Chapter 3                              | 109 |
| Chapter 4                              | 110 |
| Chapter 5                              |     |
| Vita                                   | 113 |



# **List of Figures**

| Fig. 1.1 Binary CAM (BCAM) cell and ternary CAM (TCAM) cell1                         |
|--------------------------------------------------------------------------------------|
| Fig. 2.1 Conventional CAM architecture                                               |
| Fig. 2.2 A simple cache memory                                                       |
| Fig. 2.3 A simple virtual memory system                                              |
| Fig. 2.4 ATM switch with CAM                                                         |
| Fig. 2.5 Packet forwarding by an address-lookup table in network routers 11          |
| Fig. 2.6 NOR-type binary CAM cell. (a) 9T BCAM cell and (b) 10T BCAM cell.           |
|                                                                                      |
| Fig. 2.7 AND-type 9-transistor binary CAM cell                                       |
| Fig. 2.8 Static NOR-type ternary CAM cell                                            |
| Fig. 2.9 Dynamic NOR-type ternary CAM cell17                                         |
| Fig. 2.10 AND-type ternary CAM cell                                                  |
| Fig. 2.11 Structure of conventional NOR-type match-line                              |
| Fig. 2.12 Structure of conventional AND-type match-line                              |
| Fig. 2.13 Word structure of the selective pre-charge scheme                          |
| Fig. 2.14 Pipelined MLs reduce power by shutting down after a miss in a stage 23     |
| Fig. 2.15 Current-saving match-line sensing scheme                                   |
| Fig. 2.16 64 bits sequential AND plan and HS-AND match circuit25                     |
| Fig. 2.17 (a) Parallel, (b) 3-level tree, and (c) 2-level tree AND-type match lines. |
|                                                                                      |
| Fig. 2.18 Schematic of the hierarchical search-line architecture                     |
| Fig. 2.19 Charge-recycling search-line driver                                        |
| Fig. 2.20 (a) L1 gating node (GNL1) implementation. (b) L2 DCG example30             |
| Fig. 2.21 Operation of a NOR-cell in the LSSL-CAM. (a) Mismatch. (b) Match.          |
| Fig. 2.22 Schematic of the NOR-cell block in the LSSL CAM 31                         |
| Fig. 2.23 Row-based ML sense amplifier and new CAM architecture 33                   |
| Fig. 2.24 The differential NAND CAM cell. The block M and block DMLSA                |
| denote a memory cell to store the data bit and differential ML sense amplifier . 34  |
| Fig. 2.25 DMLSA                                                                      |
| Fig. 2.26 (a) DPSVDD implementation. (b) DPSGND implementation                       |
| Fig. 2.27 (a) NVT-BCAM cell with NOR-type match-line. (b) Read/Write timing          |
| sequence of the NVT-BCAM cell                                                        |
|                                                                                      |
| Fig. 3.1 PF-CDPD And-type match-line scheme                                          |

| Fig. 3.2 Transfer dynamic logic into clock-and-data pre-charge dynamic (CDPD)  |
|--------------------------------------------------------------------------------|
| circuits41                                                                     |
| Fig. 3.3 Pseudo-footless clock-and-data pre-charge dynamic (PF-CDPD) circuits. |
|                                                                                |
| Fig. 3.4 16T AND-type TCAM cell with P-type comparison circuits43              |
| Fig. 3.5 Drain current versus gate voltage for different technology44          |
| Fig. 3.6 AND-type match-line with XOR-based conditional keeper45               |
| Fig. 3.7 The diagram of XOR-based conditional keeper                           |
| Fig. 3.8 (a) Search time (b) Power consumption versus UNG margin for different |
| keepers                                                                        |
| Fig. 3.9 Butterfly match-line scheme                                           |
| Fig. 3.10 Modification of NOR gate in TCAM segment52                           |
| Fig. 3.11 Butterfly connection style with P-type comparison circuit and        |
| XOR-based conditional keeper                                                   |
|                                                                                |

| Fig. 4.1 Packet routing based on longest prefix matching mechanism55                |
|-------------------------------------------------------------------------------------|
| Fig. 4.2 The ripple bit-line scheme and timing waveforms of read operation57        |
| Fig. 4.3 Power and delay comparisons of the local bit-line scheme59                 |
| Fig. 4.4 (a) A simplified architecture (b) Circuit implementation of don't-care     |
| based ripple search-line scheme                                                     |
| Fig. 4.5 Delay of ripple search-line scheme versus number of TCAM cells on          |
| each local search-line                                                              |
| Fig. 4.6 The architecture of column-based data-aware power control in TCAM          |
| macro                                                                               |
| Fig. 4.7 Cell connection of column-based data-aware power control scheme66          |
| Fig. 4.8 Control Circuits for (a) storage cells (b) don't-care cells67              |
| Fig. 4.9 An adaptive replica timing control circuit for data-aware scheme69         |
| Fig. 4.10 The diagram of adaptive write-time tracing replica70                      |
| Fig. 4.11 (a) Hierarchical search-line structure. (b) Ripple search-line structure. |
|                                                                                     |
| Fig. 4.12 Search-line power consumption under different don't-care patterns73       |
| Fig. 4.13 Analysis of the search-line delay under different search-line schemes.    |
|                                                                                     |
| Fig. 4.14Leakage power consumption under different don't-care pattern when          |
| Flag=175                                                                            |
| Fig. 4.15 Leakage power consumption under different don't-care pattern when         |
| Flag=076                                                                            |

| Fig. 5.1 Timing diagram of writing storage/don't-care cells.                   | .81 |
|--------------------------------------------------------------------------------|-----|
| Fig. 5.2 Timing diagram of reading storage/don't-care cells                    | .82 |
| Fig. 5.3 Timing diagram of search operation.                                   | .82 |
| Fig. 5.4 Block diagram of 256x40 TCAM macro                                    | .83 |
| Fig. 5.5 The floor-plan of energy-efficient 256x40 TCAM macro                  | .84 |
| Fig. 5.6 Butterfly match-line scheme for 144-bit TCAM cells                    | .85 |
| Fig. 5.7 (a) Two-stage (b) Three-stage butterfly match-line scheme for 40-bit  |     |
| TCAM cells                                                                     | .86 |
| Fig. 5.8 (a) Typical TCAM cell. (b) TCAM cell with shared BL/DL                | .87 |
| Fig. 5.9 Coupling capacitance                                                  | .88 |
| Fig. 5.10 (a) Coupling Effect of conventional vertical lines. (b) Interleaving |     |
| vertical lines                                                                 | .89 |
| Fig. 5.11 Layout view of 1-bit TCAM cell.                                      | .91 |
| Fig. 5.12 Timing analysis of search operation.                                 | .92 |
| Fig. 5.13 Layout view of a TCAM segment with 5-bit TCAM cells                  | .93 |
| Fig. 5.14 A 256x40-bit layout of the proposed energy-efficient TCAM            | .94 |

Fig. 6.1 The concept of the Internet of Thing (IoT)......100



## **List of Tables**

| Table 2.1 Truth table of NOR-type binary CAM cell.    13                        |
|---------------------------------------------------------------------------------|
| Table 2.2 Truth table of AND-type binary CAM cell                               |
| Table 2.3 State assignments and truth table for static TCAM cell16              |
| Table 2.4 State assignments for TCAM cell.    18                                |
| Table 3.1 Control organism of XOR-based conditional keeper45                    |
| Table 3.2 Comparison of search delay with NOR gate and AND gate in TCAM         |
| segment (Unit: ns)                                                              |
| Table 4.1 Key signals of replica-column scheme.                                 |
| Table 4.2 The corresponding virtual source voltage under different operations66 |
| Table 4.3 The truth table of the control signals for storage cells.    68       |
| Table 4.4 The truth table of the control signals for don't-care cells           |
|                                                                                 |
| Table 5.1 Descriptions of input pins.   79                                      |
| Table 5.2 Descriptions of output pins.   80                                     |
| Table 5.3 Truth table of three modes.   81                                      |
| Table 5.4 Pre-simulation result of 256x144 TCAM macro                           |
| Table 5.5 Summary of the 256x144 TCAM macro93                                   |
| Table 5.6 Post simulation result of 256x40 TCAM macro95                         |
| Table 5.7 Features summary and comparisons.    96                               |
|                                                                                 |

# Chapter 1 Introduction

## **1.1 Background**

Content-addressable memory (CAM), also called associative memory, executes the lookup-table function in a single clock cycle using dedicated comparison circuitry. CAM compares input search data against a table of stored information, and returns the matching data. Accordingly, CAM cells contain storage memories and comparison circuits. CAM cells are of two types – binary content addressable memory (BCAM) and ternary content addressable memory (TCAM) - depending on their comparison function as presented in Fig. 1.1.



Fig. 1.1 Binary CAM (BCAM) cell and ternary CAM (TCAM) cell.

A BCAM cell has two states – the "one" state and the "zero" state. A BCAM cell contains 1-bit storage memory and a 1-bit comparison circuit. TCAM has three states

- logic 0, logic 1, and don't-care X. The third state, don't-care X, which is used in masking, makes TCAM suitable for network router applications. Hence, the difference between BCAM and TCAM is that TCAM contains an extra SRAM to store the don't-care state. If the datum in don't-care cell is 1, then the match-line (ML) will bypass the don't-care cells and be discharged to ground. It will not perform any comparison operation. If the datum in don't-care cell is 0, then the function of TCAM is the same as that of BCAM.

Due to fast search capability, CAM has been employed in numerous applications requiring high search speed. In past decades, these applications are parametric curve extraction [1.1], Hough transformation [1.2], Lempel–Ziv compression [1.3], image coding [1.4], the human body communication controller [1.5], the periodic event generator [1.6], and the virus-detection processor [1.7]. At present, CAM is popular for use in network routers for packet forwarding, packet classification, asynchronous transfer mode (ATM) switching, and other functions.

## **1.2 Motivation**

As the range of CAM applications grows, power consumption is one of the critical challenges. The trade-off among power, speed, and area is the most important issue in recent researches on large-capacity CAMs. The primary commercial application of CAMs today is the classification and forwarding of Internet protocol (IP) packets in network routers. To overcome the dwindling unallocated address space. Internet Protocol Version 6 (IPv6) becomes mandatory to build new Internet networks and services. The IP address space is expanded from 32-bit to 144-bit identifiers for interfaces and set of interfaces [1.8]-[1.11]. With existing implementations ranging from 32 to 144 bits, the large CAM word results in a long search delay path in

network routers for packet forwarding applications. Accordingly, high speed and low power are the two major goals of TCAM design for IP-address forwarding applications, especially in nano-scale technologies.

With the silicon technology entering the sub-65nm regime, the leakage power increasingly dominates the overall power consumption. However, previous investigations of low-power TCAM have focused only on dynamic power consumption [1.12]-[1.15]. The data-aware power control scheme is proposed for reducing both the leakage current and dynamic power dissipation and further improving write-ability. The other serious issue, I<sub>on</sub>-I<sub>off</sub>-ratio, is expected to further worsen with technology scaling, resulting the degradation of search performance and TCAM cell functionality, particularly in low power process. We have developed the AND-type TCAM cell with P-type comparison circuit to conquer these problems.

For high density circuit design, the ripple bit-line scheme and ripple search-line scheme are proposed to not only enhance the area efficiency but also save additional process cost for hierarchical lines layer. Based on continuous don't-care pattern, the don't-care-based ripple search-line scheme also reduces dynamic power effectively. Moreover, the noise-tolerant XOR-based conditional keeper and butterfly match-line scheme are adopted to further reduce the power consumption and the length of critical paths [1.16]. In this work, a 256x40 and 256x144 TCAM macros are implemented using UMC 40nm low power technology. The details of energy-efficient techniques and analysis are also included.

## **1.3 Thesis Organization**

The organization of this thesis is as follows. An overview of CAM is introduced in Chapter 2. Here, a conventional CAM architecture including CAM cells and CAM word schemes would be presented. Besides, the application and prior low power methodologies of CAM would be described in this chapter as well. The noise-tolerant butterfly match-line scheme with XOR-based conditional keeper is realized in Chapter 3. Furthermore, AND-type TCAM cell with P-type comparison circuit also described. Chapter 4 presents the ripple bit-line scheme, don't-care-based ripple search-line scheme and data-aware power control scheme. By utilizing the regular table of don't-care pattern, both the dynamic and leakage power can be saved. The energy-efficient 256x40 and 256x144 ternary CAM array are implemented in Chapter 5. In this chapter, other layout considerations are presented to reduce area overhead and coupling effect, including shared BL/DL and interleaving vertical global lines techniques. Finally, the overall investigation results and conclusions are drawn in

Chapter 6.



# Chapter 2 Overview of Low Power CAM/TCAM Design

This chapter is a study of CAM-design technique at the circuit level and at the architectural level. Typically CAM/TCAM architecture and the applications will be described in section 2.1. The basic operation, cell circuits and word schemes of CAM are presented in section 2.2. The low power match-line schemes and low power search-line driving approaches are presented in section 2.3 and 2.4, respectively. At the architecture level, section 2.5 reviews several design techniques for reducing power consumption of CAM/TCAM macro.

## 2.1 Applications & Architecture of CAM/TCAM

## 2.1.1 Conventional CAM Architecture

A conventional CAM architecture is usually composed of the data memories, address decoders, bit-lines pre-charge circuits, word match schemes, read sense amplifiers, address priority encoders and so on [2.1]-[2.7]. Fig. 2.1 shows a simplified block diagram of a CAM. Generally, CAM has three operation modes: write, read, and search. In write and read operation, CAM plays just like an ordinary memory. That is to say, data is manipulated in the CAM array as the same way in SRAM array. Different from SRAM, CAM has a special mode: search mode. The input in Fig. 2.1 called search word that is broadcast onto the search-lines to the table of stored data. The number of bits in a CAM word is usually large, with existing implementations ranging from 36 to 144 bits. A typical CAM employs a table size ranging between a

few hundred entries to 32K entries, corresponding to an address space ranging from 7 bits to 15 bits. Each stored word has a match-line that indicates whether the search word and stored word are identical (the match case) or are different (a mismatch case, or miss). The match-lines are fed to an encoder that generates a binary match location corresponding to the match-line that is in the match state. An encoder is used in systems where only a single match is expected. In CAM applications where more than one word may match, a priority encoder is used instead of a simple encoder. A priority encoder selects the highest priority matching location to map to the match result, with words in lower address locations receiving higher priority. The overall function of a CAM is to take a search word and return the matching memory location. One can think of this operation as a fully programmable arbitrary mapping of the large space of the input search word to the smaller space of the output match location.



Fig. 2.1 Conventional CAM architecture.

### 2.1.2 Applications of CAM/TCAM

CAMs are widely used in cache memory system and translation look-aside buffer (TLB) in virtual memory system in past years. The primary commercial application of CAMs today is to classify and forward Internet protocol (IP) packets in network routers [2.8]-[2.12]. In networks like the Internet, a message such an as e-mail or a Web page is transferred by first breaking up the message into small data packets of a few hundred bytes, then sending each data packet individually through the network. These packets are routed from the source, through the intermediate nodes of the network (called routers), and reassembled at the destination to reproduce the original message. The function of a router is to compare the destination address of a packet to all possible routes, in order to choose the appropriate one. A CAM is a good choice for implementing this lookup operation due to its fast search capability.

# 2.1.2.1 Cache Memory 1896

In the memory hierarchy system, cache plays an important role [2.13], [2.14]. Cache is the name given to the first level of the memory hierarchy encountered once the address leaves the CPU. Its function is used to refer to any storage managed to take advantage of locality of access. Cache serves as a method for providing fast reference to recently used portion of instruction or data. When CPU finds a wanted data item in the cache, it is called cache hit. On the contrary, if CPU does not find a data item that is needed in the cache, it is called cache miss.

An example for direct data mapping cache is illustrated in Fig. 2.2. The address has 32 bits, and it is divided into three parts. First one part is byte offset which occupies two bits. Second part is Index, and third part is Tag. The numbers of Index

can tell us the capacity of cache. If there are N bits for Index, the cache has 2N entries which can be stored data items. The action is first to find the corresponding position of index. When the corresponding position is found out, the tag stored in the corresponding position would be taken out. This tag would be compared to the third part of tag. If they are the same, and valid bit is one, a hit signal and the corresponding data would be sent out. Of course, the tag entries are composed of CAM array. The valid bit is used to indicate whether an entry contains a valid address or not. If they are not the same, a miss occurs.



Fig. 2.2 A simple cache memory.

## 2.1.2.2 Translation Look-aside Buffer

Translation look-aside buffer (TLB) is widely used to virtual memory system. A TLB is like a cache that hold only page table mapping [2.13], [2.14]. Its function is to provide fast translation from the virtual address to the physical address. When we get

the physical address, we can use this physical address to access the data which are stored in the memory (such as cache or DRAM or DISK). Because TLB can speed up address translation in processor with virtual memory and it also can cut down access time and lowering the miss rates.



Fig. 2.3 shows a simple virtual memory system. The TLB contains a subset of the virtual-to-physical page mappings that are in the page table. Because the TLB is a cache, it must have a tag field which consists of CAMs. As a virtual page number (VPN) is sent to the TLB, this VPN would be compared with all valid tags in TLB. If the VPN can find a corresponding tag in the TLB, the corresponding physical page address would find in the corresponding tag. However, if there is no matching entry in the TLB for a page, the page table must be examined. The page table either supplies a physical page number for the page or indicates that the page resides on disk, in which case a page fault occurs. Since the page table has an entry for every virtual page, no

tag field is needed.

#### 2.1.2.3 ATM Switches

For ATM switching network application, CAM can be adopted as a translation table. Virtual circuits are important parts to ATM networks, and they need to be set up across ATM networks before any data transfer because ATM networks are connection-oriented. There are two types of ATM virtual circuits, Virtual Path (identified by a virtual path identifier [VPI]) and Channel Path (identified by a channel path identifier [VPI]). Each segment of the total connection has unique VPI/VCI combinations, and the VPI/VCI value of ATM cells would be changed into the value for the next segment of connection while ATM cell go through a switch [2.15], [2.16].

| CAM RAM<br>Current Connection Next Connection |  |         |         |         |        |
|-----------------------------------------------|--|---------|---------|---------|--------|
| Data                                          |  | Address | 189     | Address | Data   |
| VCI 24                                        |  | 0       |         | 0       | VCI 24 |
| VCI 05                                        |  | 1       |         | 1       | VCI 05 |
| VCI 17                                        |  | 2       | Address | 2       | VCI 17 |
| VCI 48                                        |  | 3       |         | 3       | VCI 48 |
| VCI 05                                        |  | 4       |         | 4       | 85     |
| •••                                           |  |         |         | •••     |        |
|                                               |  | VPI/VCI | -       |         |        |
| PC 46738987 ATM Switch                        |  |         |         |         |        |

Fig. 2.4 ATM switch with CAM.

CAM is applied to an ATM switch as an address translator and can quickly perform the VPI/VCI translation. In the translation process, the CAM causes address which access data in RAM and uses incoming VPI/VCI values in ATM cell headers. A CAM/RAM combination realizes the multi-megabit translation tables with full parallel search capability. Take VPI/VCI fields from the TM cell header and the list of current connections stored in the CAM array for comparison, as a result, CAM originates and address which is used to access an external RAM where VPI/VCI mapping data and other connection information is stored. The ATM controller uses the VPI/VCI data from the RAM for modifying the cell header, and the cell is sent to the switch, depicted in Fig. 2.4.



#### 2.1.2.4 Packet Forwarding Using CAM

Fig. 2.5 Packet forwarding by an address-lookup table in network routers.

In recently years, TCAMs have been popularly used in network routers for packet forwarding and packet classification. Network routers forward data packets from an incoming port to an outgoing port, using an address-lookup function [2.17]-[2.20]. Fig. 2.5 schematically depicts a simplified block diagram of a TCAM macro. The address-lookup function examines the destination address of the packet and selects the output port associated with that address. The router maintains a list, called the routing table, which contains destination addresses and their corresponding output ports. The search data are broadcast onto the search-lines to the table of stored data. The address-lookup function determines the destination address of the packet and selects the output port that is associated with that address. For example, the packet destination address 01101 is input to the TCAM. As indicated by the table, two entries are matched, and the priority encoder chooses the upper entry and generates the matching location 01. This matching location is the address that is input to a RAM that contains a list of output ports, as shown in Fig. 2.5. A read operation of RAM outputs the port designation, port B, to which the incoming packet is forwarded. We can view the match location output of the CAM as a pointer that retrieves the associated word from the RAM. In the particular case of packet forwarding the associated word is the designation of the output port. This TCAM/RAM system fully implements an address-lookup engine for packet forwarding.

## 2.2 Design of CAM/TCAM Cells

In this section, a conventional CAM/TCAM cell will be introduced. A CAM cell serves two basic functions: bit storage (as in RAM) and bit comparison (unique to CAM). There are two types of CAM cells will be introduced as following: one is binary CAM (BCAM) cell and the other is ternary CAM (TCAM) cell.

#### 2.2.1 Binary CAM Cell

Depending upon working different methods in search mode, CAM cells are classified into two kinds: NOR-type CAM cell and AND-type CAM cell [2.21], [2.22]. The differences of them would be described as follows.

## 2.2.1.1 NOR-type CAM Cell



Fig. 2.6 NOR-type binary CAM cell. (a) 9-transistor BCAM cell and (b) 10-transistor

| BCAM cell.<br>Table 2.1 Truth table of NOR-type binary CAM cell. |    |    |          |  |  |
|------------------------------------------------------------------|----|----|----------|--|--|
| State 📃                                                          | Qi | SL | ML       |  |  |
| Zama (0)                                                         | 0  | 0  | floating |  |  |
| Zero (0)                                                         | 0  | 1  | 0        |  |  |
| $Ome_{1}$                                                        |    | 0  | 0        |  |  |
| One (1)                                                          | 1  | I  | floating |  |  |

Fig. 2.6 depicts the NOR-type CAM cells which are widely used for CAM scheme design in past years. Fig. 2.6 (a) is constructed by 9-transistor structure and Fig. 2.6 (b) is composed of 10-transistor structure. Table 2.1 shows the truth table of a NOR-type CAM cell. The 9T CAM cell consists of a traditional 6T SRAM and a PTL-type compare circuit; the 10T CAM cell is composed of an ordinary 6T SRAM and the pull down XOR comparison circuits. As the CAM cell is to be written, not only 9T CAM cell but also 10T CAM cell work same as a SRAM cell. While word-line is active, the complementary data is forced onto the bit-lines to be stored in the D-latch which is composed of two inverters. In read operation, bit-lines will be pre-charged to high

first and whether the bit-lines discharge to ground or not depends on stored data. After passing the read sense amplifier, the correct data is sent to the output stage. About 9T CAM cell, the match-line will be charged to high first in the search operation. If search data is equal to the stored data, the node X becomes low. Furthermore, the NMOS, Mn, is turned off, and the match-line is still floating. On the other hand, if search data doesn't match with stored data, the node X would become high and result in the NMOS, Mn, being turned on. Therefore, the match-line would be discharged to ground. Regarding 10T CAM cells, the principle is same as 9T CAM cells. During searching operation, the match-line would be pre-charged to high first. If searching data is equal to the stored data, there is a path from match-line to ground and match-line would be discharged to ground through this path.





Fig. 2.7 AND-type 9-transistor binary CAM cell.

An AND-type CAM cell is similar to 9-transistor CAM cell whatever it works in write or read operation. The only one difference from 9T CAM cell is the match-line scheme. Fig. 2.7 depicts an AND-type CAM cell and Table 2.2 describes the truth

table of AND-type CAM cell. As an AND-type CAM cell works in search operation, the match-line would be pre-charged to high first. In contrary, the match-line hold floating when the search data doesn't match with stored data and the match-line is discharged to ground only while the search data and stored data are match.

| State    | Qi | SL | ML       |
|----------|----|----|----------|
| Zero (0) | 0  | 0  | 0        |
|          | 0  | 1  | floating |
| One (1)  | 1  | 0  | floating |
|          | 1  | 1  | 0        |

Table 2.2 Truth table of AND-type binary CAM cell.

## 2.2.2 Ternary CAM Cel

For the CAM circuit design, the ternary CAM (TCAM) performs a more powerful data search function [2.1]. Different from binary CAM which has two states: one (1) and zero (0) state, the ternary CAM (TCAM) cell has an additional state: don't care (X) state. Alike binary CAM cell, TCAM would be classified into two kinds: NOR-type TCAM cell and AND-type TCAM cell.

## 2.2.2.1 NOR-type TCAM Cell



Fig. 2.8 Static NOR-type ternary CAM cell.

| State       | Qi | Qj | SL | ML       |
|-------------|----|----|----|----------|
| Zaro (0)    | 0  | 1  | 0  | floating |
| Zel0 (0)    | 0  | 1  | 1  | 0        |
| One (1)     | 1  | 0  | 0  | 0        |
|             | 1  | 0  | 1  | floating |
| Don't care  | 0  | 0  | 0  | floating |
| (X)         | 0  | 0  | 1  | floating |
| Not allowed | 1  | 1  | 0  | _        |
|             | 1  | 1  | 1  |          |

Table 2.3 State assignments and truth table for static TCAM cell.

Fig. 2.8 shows a static NOR-type TCAM cell. It consists of 2-SRAM and comparison circuits. This TCAM cell is designed to store three states, namely zero (0), one (1) and don' care (X). These three states are set by Qi and Qj. Table 2.3 illustrates how the three states are stored in this TCAM cell and the truth table of the static NOR-type TCAM cell. When Qi is low and Qj is high, the TCAM cell is in the "zero" state. In the searching operation, the same as BCAM cell, match-line will be charged to high first. If search data is low, the NMOS M1 and M4 would not be turned on, such that the ML will still be floating. On the other hand, while search data is high, the NMOS M1 and M2 are turned on at the same time result in the match-line being discharged to ground. However, the TCAM cell is in the "one" state, while search data is high, the match-line would keep high. While search data is low, the MMOS M1 and Qj are both low, the TCAM cell is in "don't care" state. No matter search data is high or is low, the NMOS M1 and M3 are not turned on result in the match-line keeping floating. Note that Qi and Qj cannot be high simultaneously, this state are not be allowed.

There is an additional dynamic NOR-type TCAM cell is called dynamic TCAM cell [2.23]-[2.26], as shown in Fig. 2.9. The major difference between static TCAM

cell and dynamic TCAM cell is that the storage memories composed of 2 SRAM cells in static TCAM cell are replaced by 2 capacitances in dynamic TCAM cell. The dynamic TCAM cell works like static TCAM and Table 2.3 also shows how these three states are stored in this dynamic TCAM cell and the truth table of the dynamic TCAM cell.



Fig. 2.10 AND-type ternary CAM cell.

| State          | Qi | Qj | SL | ML       |
|----------------|----|----|----|----------|
| Zero (0)       | 0  | 0  | 0  | 0        |
|                | 0  | 0  | 1  | floating |
| <b>One</b> (1) | 1  | 0  | 0  | floating |
|                | 1  | 0  | 1  | 0        |
| Don't Care (X) | 0  | 1  | 0  | 0        |
|                | 0  | 1  | 1  | 0        |
|                | 1  | 1  | 0  | 0        |
|                | 1  | 1  | 1  | 0        |

Table 2.4 State assignments for TCAM cell.

Fig. 2.10 illustrates a 16-transistor AND-type TCAM cell which includes 2-SRAM and comparison circuits composed of three NMOS. The state assignments and truth table of this TCAM cell is described in Table 2.4. The AND-type TCAM cell is alike a 9-transistor AND-type BCAM cell when TCAM cell works in zero (0) and one (1) states. However, while this AND-type TCAM cell is in don't care (X) state (Qj is high), no matter the search data is high or low, the match-line would be discharged.

## 2.3 Low Power Match-line Schemes

The dynamic power consumed by a single match-line that misses is due to the rising edge during pre-charge and the falling edge during evaluation, and is given by the equation, Eq. (2.1), where f is the frequency of search operations. In the case of a match, the power consumption associated with a single match-line depends on the previous state of the match-line. Typically, there is only a small number of matching we can neglect this power consumption. Accordingly, the overall match-line power consumption of a CAM block with w match-lines is derived in Eq. (2.2).

$$P_{miss} = C_{ML} \cdot V_{DD}^2 \cdot f \tag{2.1}$$

$$P_{ML} = w \cdot P_{miss} = w \cdot C_{ML} \cdot V_{DD}^{2} \cdot f$$
(2.2)

With the advance of technology, noises are increasing the soft-error rate of dynamic circuitries. Therefore, a low power, high speed and noise-tolerant TCAM is expected. There has been large variety of techniques to reduce the power consumption of match lines which are categorized as follow.

#### **2.3.1 Conventional Match-line Structure**

In the conventional CAM architecture, the circuit design of CAM word circuits adopts dynamic CMOS circuits to improve data matching performance and hardware cost. Applying the dynamic CMOS circuits designs, the conventional NOR-type CAM word schemes and AND-type match-line schemes are shown in Fig. 2.11 and Fig. 2.12, respectively [2.27], [2.28].

#### 2.3.1.1 NOR-type Match-line



Fig. 2.11 Structure of conventional NOR-type match-line.

Fig. 2.11 depicts, in schematic form, how NOR-type cells are connected in parallel to form a NOR-type match-line. While we show CAM cells in the figure, the description of match-line operation applies to both CAM and TCAM. A typical NOR search cycle operates in three phases: search-line pre-charge, match-line pre-charge, and match-line evaluation. First, the search-lines are discharged to disconnect the match-lines from ground by disabling the pull down paths in each CAM cell. Second, with the pull down paths disconnected, match-lines are pre-charges by  $M_{pre}$ . Finally, the search-lines are driven to the search values, triggering the match-line evaluation phase. In the case of a match, the ML voltage stays high as there is no discharging path to ground. In the case of a miss, there is at least one path to ground that discharges the ML. The match-line sense amplifier senses the voltage on ML, and generates a corresponding full-rail search result. The main feature of the NOR-type match-line is its high speed of operation. In the slowest case of a one-bit miss in a word, the critical evaluation path is through the two series transistors in the cell that form the pull down path. Even in this worst case, NOR-type evaluation is faster than the NAND-type, where between 8 and 16 transistors form the evaluation path.



Fig. 2.12 Structure of conventional AND-type match-line.

Fig. 2.12 shows the structure of the AND-type match-line. A number of AND CAM cells are cascaded to form the ML (this is, in fact, a floating node, but for consistency we will refer to it as ML). On the right of the figure, the pre-charge PMOS transistor, Mpre sets the initial voltage of the ML to the supply voltage. Then, the evaluation NMOS transistor, Np, is turned on. In the case of a match, all NMOS

transistors are active, effectively creating a path to ground from the ML, hence discharging ML to ground. In the case of a mismatch, at least one of the series NMOS transistors is off, leaving the ML voltage high. The AND match-line has an explicit evaluation transistor, Np, unlike the NOR match-line, where the CAM cells themselves perform the evaluation.

There is a potential charge-sharing problem in the AND-type match-line. Charge sharing occurs between the ML and the intermediate nodes. Referring to Fig. 2.12, if all bits match except for the leftmost bit, there is charge sharing between the ML and nodes  $Ndn_{n-1}$  through  $Nd_{n1}$  during evaluation. This charge sharing may cause the ML voltage to drop sufficiently low such that the output inverter detects a false match. A technique that eliminates charge sharing is to pre-charge high, in addition to ML, the intermediate match nodes. This procedure eliminates charge sharing, since the intermediate match nodes and the ML node are initially shorted. However, there is an increase in the power consumption. Two drawbacks of the AND match-line are a quadratic delay dependence on the number of cells, and a low noise margin.

## 2.3.2 Selective Pre-charge Scheme

Selective pre-charge, performs a match operation on the first few bits of a word before activating the search of the remaining bits. As shown in Fig. 2.13, the CAM word structure separates the searching operation into two comparison processes. Partial bits among n bits data length are selected to perform the first comparison process. If these partial bits of the input data mismatch those of a stored data, then the input data mismatches the stored data [2.29]. Therefore, only very few word consumes the power consumption.



Fig. 2.13 Word structure of the selective pre-charge scheme.

#### 

In the Selective pre-charge scheme, two different kinds of CAM cells are utilized in the two segments respectively. In the SEG\_1, the CAM cell is implemented as XNOR-type and their pull-down transistors are arranged in the NAND type. The NAND-type block is connected to the ground only when all the CAM cells of SEG\_1 are matched. In contrast to SEG\_1, we use the XOR-type CAM cell to implement the SEG\_2, and their pull-down transistors are placed in the NOR type. The NOR-type block is disconnected from the ground only when all the CAM cells of SEG\_2 are matched [2.30]. Perhaps, selective pre-charge is the most common method used to save power on match-lines [2.31]-[2.33], since it is both simple to implement and can reduce power by a large amount in many CAM applications.

### 2.3.3 Pipelined Hierarchical Search Scheme

In selective pre-charge, the match-line is divided into two segments. More generally, an implementation may divide the match-line into any number of segments, where a match in a given segment results in a search operation in the next segment but a miss terminates the match operation for that word. A design that uses multiple match-line segments in a pipelined fashion is the pipelined match-lines scheme [2.34]-[2.37]. Fig. 2.14 shows the pipelined match-line, but with the match-line broken into four match-line segments that are serially evaluated. If any stage misses, the subsequent stages are shut off, resulting in power saving. The drawbacks of this scheme are the increased latency and the area overhead due to the pipeline stages. By itself, a pipelined match-line scheme is not as compelling as basic selective pre-charge; however, pipelining enables the use of hierarchical search-lines, thus saving power.



Fig. 2.14 Pipelined match-lines reduce power by shutting down after a miss in a stage

: 9

## 2.3.4 Current-Saving Scheme



Fig. 2.15 Current-saving match-line sensing scheme

The current-saving scheme [2.38]-[2.44], is another data-dependent match-line

sensing scheme which is a modified form of the current-race sensing scheme. The key improvement of the current-saving scheme is to allocate a different amount of current for a match than for a miss. In the current-saving scheme, matches are allocated a larger current and misses are allocated a lower current. Since almost every match-line has a miss, overall the scheme saves power. Fig. 2.15 shows a simplified schematic of the current-saving scheme. This block is the mechanism by which a different amount of current is allocated, based on a match or a miss. The input to this current-control block is the match-line voltage, V<sub>ML</sub>, and the output is a control voltage that determines the current, I<sub>ML</sub>, which charges the match-line. The current-control block provides positive feedback since higher V<sub>ML</sub> results in higher I<sub>ML</sub>, which, in turn, results in higher V<sub>ML</sub>. In this scheme, the match-line is pre-charged low. The amount of current is initially the same for all match-lines, but the current control reduces the current provided to match-lines that miss (have a resistance to ground), but maintains the current to match-lines that match (with no resistance to ground). The current-control block increases the current as the voltage on the ML rises and the voltage on the ML rises faster for large ML resistance. Since the amount of current decreases with the number of misses, it follows that the power dissipated on the match-line also depends on the number of bits that miss.

#### 2.3.5 Wide-AND Match-line Scheme

A pipelined hierarchical search scheme improves search throughput, but results in high area cost and large power consumed by flip-flops and clock drivers. Fig. 2.16 illustrates the 64-bit dynamic AND ML technique in each bank, which is designed with four 16-bit wide dynamic AND gates connected sequentially. Only the first 16 bits AND is controlled by the global ML clock (clk0), while the remaining gates are
triggered by the outputs of the preceding gates (clk1~clk3). The 16 bits wide AND ML circuit consists of two 8 bits wide footed domino NOR gate local ML (lml0, lml1). A footed domino is implemented to gate the evaluation of the next 16 bits wide AND ML. This also enables a static circuit implementation of SLs, saving significant SL switching power. The inherent logic function in each wide AND ML is a 16-bit NOR, which is complemented through a clock gated NAND followed by an inverter, deriving a fast domino compatible 16-bit wide AND function. This eliminates the wide AND paths and realizes the complete critical path with 8-way dynamic OR circuits. Therefore, the wide AND ML technique enables NOR-type CAM performance with AND-type CAM power [2.45].



Fig. 2.16 64 bits sequential AND plan with swapped XOR cell and HS-AND match

circuit.

#### 2.3.6 Tree Style AND-type Match-line Scheme

For original m-stage PF-CDPD AND-type match-line circuit, if the comparisons result of first stage is match at the evaluation phase, the first output will become low

to enable second stage comparison operation. Moreover, the match-lines divided into many segments causes the size of comparison transistors being unnecessary too large in the same search time criteria. Nevertheless, the speed enhancement comes at a cost. To further speed-up of the PF-CDPD scheme, three versions of tree-style match lines are shown in Fig. 2.17. Fig. 2.17 (a) uses two short parallel MLs in each half plane and merges the output from both planes into a 4-input AND gate to generate the final matching results. On the other hand, the design in Fig. 2.17 (b) and (c) use an 8-input and 4-input AND gate, respectively, to generate the final matching results. The designs of parallel, 3-level tree and 2-level tree have nearly 30% improvement on search speed compared to original cascaded. However, compared to 233.6 $\mu$ W of power consumption of the cascaded design, the parallel design and the 3-level tree design have about 20% more power consumption and the 2-level design has only 9% more power consumption due to a slight more complex inter-connection [2.46], [2.47].



Fig. 2.17 (a) Parallel, (b) 3-level tree, and (c) 2-level tree AND-type match lines.

## 2.4 Low Power Search-line Schemes

Eliminating the search-line pre-charge phase is the common method of saving search-line power. It reduces the togging of the search-lines, thus reducing power. There are still other cases which can reduce the search-line.

#### 2.4.1 Hierarchical Search-line Scheme

The basic idea of hierarchical search-lines is to exploit the fact that few match-lines survive the first segment of the pipelined match-lines. With the conventional search-line approach, even though only a small number of match-lines survive the first segment, all search-lines are still driven. Instead of this, the hierarchical search-line scheme divides the search-lines into a two-level hierarchy of global search-lines (GSLs) and local search-lines (LSLs) [2.24]-[2.25],[2.48]-[2.51]. Fig. 2.18 shows a simplified hierarchical search-line scheme. In the figure, each LSL feeds only a single match-line (for simplicity), but the number of match-lines per LSL can be 64 to 256. The GSLs are active every cycle, but the LSLs are active only when necessary. Activating LSLs is necessary when at least one of the match-lines fed by the LSL is active. In many cases, an LSL will have no active match-lines in a given cycle, hence there is no need to activate the LSL, saving power. The overall power consumption on the search-lines is derived in Eq. (2.6), where  $\alpha$  is the activity rate of the LSLs. C<sub>GSL</sub> primarily consists of wiring capacitance, whereas C<sub>LSL</sub> consists of wiring capacitance and the gate capacitance of the SL inputs of the CAM cells. The factor  $\alpha$ , which can be as low as 25% in some cases, is determined by the search data and the data stored in the CAM. We see from Eq. (2.6) that determines how much power is saved on the LSLs, but the cost of this savings is the power dissipated by the

GSLs. Thus, the power dissipated by the GSLs must be sufficiently small so that overall search-line power is lower than that using the conventional approach.

$$P_{SL} = \left(C_{GSL} \cdot V_{DD}^{2} + \alpha \cdot C_{LSL} \cdot V_{DD}^{2}\right) \cdot f \qquad (2.6)$$

$$P_{SL} = 2n \cdot \left(C_{GSL} \cdot V_{LOW}^{2} + \alpha \cdot C_{LSL} \cdot V_{DD}^{2}\right) \cdot f$$
(2.7)

If wiring capacitance is small compared to the parasitic transistor capacitance [2.49], then the scheme saves power. However, as transistor dimensions scale down, it is expected that wiring capacitance will increase relative to transistor parasitic capacitance. In the situation where wiring capacitance is comparable or larger than the parasitic transistor capacitance,  $C_{GSL}$  and  $C_{LSL}$  will be similar in size, resulting in no power savings. In this case, small-swing signaling on the GSLs can reduce the power of the GSLs compared to that of the full-swing LSLs. This power equation of the modified search-line scheme is derived in Eq. (2.7). This scheme requires an amplifier to convert the low-swing GSL signal to the full-swing signals on the LSLs. Fortunately, there is only a small number of these amplifiers per search-line, so that the area and power overhead of this extra circuitry is small.



Fig. 2.18 Schematic of the hierarchical search-line architecture.

### 2.4.2 Charge-Recycling Search-line Driver

The SLs consume power only when the search data change, In general, the transition probability of SL is smaller than one half. The non-pre-charged SLs consume less power than half of that of the pre-charged SLs [2.52]. The driver further saves the SL power be recycling the charge of SLs. Fig. 2.19 shows the Charge-Recycling Search-Line Driver (CRSLD) architecture. Initially, the SLCR is "0". The transistor P1 turns on. The CR is "0". The transmission gates T1 and T2 turn off. Two tri-state drivers D1 and D2 drive the SL pairs. The latch holds the data of SLs. The SLCR becomes "1". The transistor N1 turns on. When the search data change from "0" to "1", the transistors, N2 and N3, turn on. Then the search data change from "1" to "0", the transistors, N4 and N5, turn on. Therefore, the CR becomes "1". T1 turns on and two SLs share their charges. The SLs become  $V_{DD}/2$ . T2 turns on and the latch updates its data. If the search data don not change, the CR remains at "0". The SLCR returns to "0" and the CR becomes "0". T1 and T2 turn off. D1 and D2 drive the SL pairs from  $V_{DD}/2$  to  $V_{DD}$  or ground [2.53]. Without the loss of the memory utilization, the CRSLD reduces the SL power by recycling the charge of SLs without the SL pre-charge.



Fig. 2.19 Charge-recycling search-line driver

#### 2.4.3 Two-Level Don't-Care Gating Scheme

The two-level DCG scheme exploits the vertically continuous "don't-care" feature [2.54]. Therefore, to gate the search data from being broadcast over the entire SL, this design inserts the gating nodes to break the entire SL into several segments. As shown in Fig. 2.20 (a), the level-1 (L1) gating node ( $GN_{L1}$ ) is implemented as an inverter that is controlled by the corresponding mask bit. For example, a  $GN_{L1}$  is located in the ith cell. When the ith cell is "X",  $M_i=1$  will cut off both the power and ground sources to disable the inverter from transmitting the search data.



Fig. 2.20 (a) L1 gating node (GNL1) implementation. (b) L2 DCG example.

The L1 DCG is beneficial to reduce the SL power only when the first TCAM cell is "X" in a segment. If the first cell is not "X", this segment would be driven to perform the search operation even though the remainder cells are all "X". To improve the power efficiency of the L1 DCG, the level-2 (L2) gating scheme to further exploit the vertically continuous "X" property within an L1 segment. Fig. 2.20 (b) shows an L2 gating example, in which the L2 granularity ( $G_{L2}$ ) is 4. Similar to  $GN_{L1}$ , the function of the L2 gating node ( $GN_{L2}$ ) is to either gate or transmit the search data from the L1 segment. Instead of the transmission gate,  $GN_{L2}$  is implemented as in Fig. 2.20 (b), which is controlled by the mask value of the first cell (M0) in the corresponding L2 segment. Thus, all side effects incurred by the floating Q and Qb nodes can completely be eliminated.

#### 2.4.4 Low Swing Search-line Scheme



Fig. 2.22 Schematic of the NOR-cell block in the LSSL\_CAM.

By comparing stored data with the low swing search data on the search-lines, a low power CAM using low swing search-lines is presented. Fig. 2.21 shows the operation of a NOR-cell in the low swing search line (LSSL) CAM [2.55]. In order to compare the stored data with the low swing search data on a SL pair, M0 and a bias current  $I_{BIAS}$  are needed. M0, M1 and  $I_{BIAS}$  act like a differential amplifier. In a mismatch case, the voltages of SL and SLb are  $V_H$  (= $V_{REF}+\Delta V$ ) and  $V_L$  (= $V_{REF}-\Delta V$ ), respectively. The gate voltage of M1 is  $\Delta V$  higher than that of M0. Therefore,  $I_{BIAS}$ flows through M1. The ML is thereby discharged to ground. In a match case,  $I_{BIAS}$ flows through M0 and the ML remains at  $V_{DD}$ . Thus, the SL swing voltage ( $\Delta V_{SL}=2x\Delta V$ ) in the LSSL-CAM is much smaller than the full swing voltage in the conventional CAMs.

Fig. 2.22 shows the schematic of the NOR-cell block in the LSSL-CAM. In order to reduce the ML power consumption,  $I_{BIAS}$  is dynamically controlled by the ML\_EN and ML\_ON signals. Initially, the ML\_ON signal is set to the logic "1". After the SL voltages change, the ML\_EN signal is set to "1". This enables  $I_{BIAS}$  to flow. If all the NOR-cells are matched, the ML remains at V<sub>DD</sub>. If not, the  $I_{BIAS}$  flowing through the mismatched NOR-cells decreases the ML voltage. The  $I_{BIAS}$  control scheme reduces the power consumption in all of the mismatched MLs by limiting the  $\Delta V_{ML}$ . Only a matched ML consumes the  $I_{BIAS}$  until the ML\_EN signal returns to "0".

## 2.5 Low Power Design Techniques for CAM/TCAM Macro

#### 2.5.1 Power-Gated ML Sensing

Due to parallel match-line comparison, CAM is power-hungry. Thus, robust, high speed and low power ML sense amplifiers are highly sought-after in CAM designs. An effective gated-power technique of ML sensing reduces the peak and average power consumption and enhances the robustness of the design against process variations [2.32], [2.33]. The new CAM architecture and the row-based ML sense amplifier are depicted in Fig. 2.23. The CAM cell has the same number of transistors

as the conventional NOR CAM and use a similar ML structure. However, the comparison unit and the SRAM unit are powered by two separate metal rails, namely  $V_{DDML}$  and the  $V_{DDC}$ , respectively. The  $V_{DDML}$  is independently controlled by a power transistor (Px) and a feedback loop that can auto turn off the ML current to save power. The separated power rails of  $V_{DD}$  and  $V_{DDML}$  is to completely isolate the SRAM cell from any possibility of power disturbances during compare cycle.



Fig. 2.23 Row-based ML sense amplifier and new CAM architecture.

396

As shown in Fig. 2.23, the gated-power transistor Px, is controlled by a feedback loop, denoted as "Power Control" which will automatically turn off Px once the voltage on the ML reaches a certain threshold. At the beginning the ML is first initialized by a global control signal EN. At this time, signal EN is set to low and the power transistor Px is turned OFF. After that, signal EN turns HIGH and initiates the compare phase. If one or more mismatches happen in the CAM cells, the ML will be charged up. When the voltage of the ML reaches the threshold voltage of M8, voltage at node C1 will be toggled and thus the power transistor Px is turned off again. As the result, the ML is not fully charged to  $V_{DD}$ , but limited to some voltage slightly above the threshold voltage of M8,  $V_{th8}$ .

#### 2.5.2 Self-Disable Sensing Technique

In order to resolve the design dilemma of the prior NOR-type and NAND-type CAMs described in Section 2.2, a differential NAND-type CAM is presented in Fig. 2.24. Notably, MSi is turned on only if BL<i> and Q are logically opposite. That is, SML<i> will be charged by ML<i> when the search key is opposite to the corresponding bit of the word. The voltage drop between ML<i> and SML<i> will be sensed by the differential MLSA (DMLSA). The speed of the comparison will be fastened by parallel charging paths. Most important of all, the dc grounding path is removed to reduce the static power consumption [2.56].



Fig. 2.24 The differential NAND CAM cell. The block M and block DMLSA denote a memory cell to store the data bit and differential ML sense amplifier

The detailed schematic of the DMLSA for differential NAND-type CAM is shown in Fig. 2.25. The DMLSA senses the voltage on the ML $\langle i \rangle$  and SML $\langle i \rangle$  to tell if the word is "match" or "mismatch," and then automatically disables the charge path to save the power. Notably, a signal will set the DMLSA into an initial state, where ML $\langle i \rangle$  = SML $\langle i \rangle$  = 0 and SP = 0 before the searching process. The operation of DMLSA in the searching process is described as follow.



Fig. 2.25 DMLSA.

1) "Mismatch": SEARCH = SEARCH\_EN is pulled to high at the beginning of the searching process. Then, MN1 is turned on to charge the ML<i> such that KP will be discharged but not totally pulled down to 0. If there is any "mismatch" CAM cell, MSi is turned on to make a current path between ML<i> and SML<i>. When the voltage of SML<i> is high enough to turn off MP3, the voltage of KP will be pulled down such that MATCHB is equal to logic 1 (mismatch). By two feedback paths, MATCHB turns MN3 on and MP1 off, respectively, such that the current path of MP1 is shut off to choke the charge current of ML<i>. Therefore, the power consumption is reduced after the searching process.

2) "Match": If all of the CAM cells are "match," ML<i> and SML<i> are isolated without any current path. The voltage difference between ML<i> and SML<i> creates an output current of the differential pair (MP2 and MP3) to charge the KP and SP. As soon as KP is charged to high, MATCHB becomes logic 0 (match). After the SP is raised to high, SEARCH will equal to logic 0 and turn off MN1 to choke the charge current to ML<i>.

In short, the charge current to ML<i> will be choked after the result of

comparison has been decided, regardless what the result is. By using the choking current method to reduce the unnecessary dc currents, the power consumption is significantly reduced. Moreover, the comparison process has been accelerated by a positive loop to reduce the unwanted power dissipation.



2.5.3 Dynamic Power Source (DPS) Technique

Fig. 2.26 DPS technique. (a) DPSVDD implementation. (b) DPSGND implementation.

Leakage-suppressed designs can be grouped into state-preserved [2.57] and state-destructive strategies. The prefix data are unnecessary in determining the match result in case of "X", this technique uses the state-destructive strategy to reduce the leakage power dissipated in the "X" TCAM cells [2.58].

Similar to the traditional power-gated technique, there are two implementations for the DPS design. Fig. 2.26 (a) first shows the DPS<sub>VDD</sub> implementation, where the sources of P1 and P2 are connected to the Mb node of the mask SRAM. If the TCAM cell is "X", then Mb is 0, which will lower the D voltage to destroy the stored prefix data. In order to improve DPS<sub>VDD</sub>, Fig. 2.26 (b) shows the DPS<sub>GND</sub> implementation that connects the sources of both N1 and N2 to the M node of the mask. If the TCAM

cell is "X", then M=1 will raise the voltage of Db to destroy the stored prefix data. Otherwise, M=0 will retain the stored prefix data in the "care" state. For an "X" TCAM cell,  $DPS_{GND}$  has roughly 58% reduction that is the best among all possible implementations.

#### 2.5.4 Variability-Tolerance CAM Cells with NOR-type

#### **Match-lines**





Within-chip variability has become a serious problem in modern nano-scale technologies, which is particular true for semiconductor memory designs. The variability-tolerant BCAM cell is designed by separating the read port from the write port such that the sizing for read static noise margin and write trip voltage is decoupled [2.59]. Fig. 2.27 (a) shows the N-type variability-tolerant BCAM (NVT-BCAM) cell with an NOR-type match-line. An  $M_{N4}$  is added in the comparator for performing the read operation. Fig. 2.27 (b) shows the timing sequence of the read and write operation of the NVT-BCAM cell. Consider that the NVT-BCAM cell executes a write operation. The WWL is pulled to high and the Din is put on the

bit-lines. Then the Din can be stored in the SRAM storage. Consider that the NVT-BCAM cell performs a read operation. The bit-lines are pre-charged to VDD first and then the RWL is enabled.

If the state of SRAM storage is logic 1, then the  $M_{N2}$  and  $M_{N4}$  are turned on. So, the charge of the bit-line Bb is charged to logic 0 through the path  $M_{N2} \rightarrow M_{N4} \rightarrow VSS$ and the data 1 is read. On the contrary, if the state of SRAM storage is logic 0, then the charge of the bit-line B is charged to logic 0 through the path  $M_{N1} \rightarrow M_{N4} \rightarrow VSS$ and data 0 is read. As consequence, the sizing of a cell for read static noise margin and write trip voltage is decoupled by separating the read port from the write port of the cell.

By reusing the comparison logic of a BCAM cell as the read port, moreover, only an additional transistor and a read word-line are needed. Experimental results show that the NVT-BCAM cells can provide food read static noise margin and write trip voltage with lower area cost in comparison with the typical CAM cell.

## Chapter 3 Energy-Efficient Match-Line Schemes

The NOR-type match-line scheme provides high search performance, but its cost is a large amount of power dissipation. While, the AND-type match-line scheme trades the performance for reducing the switching feature. As the range of the CAM /TCAM application grows, energy efficiency becomes the critical issue. Thus, we propose the 16T AND-type TCAM cell with P-type comparison circuits to save match-line power while maintaining good search speed especially in low power process.

On the other hand, leakage currents, charge sharing, and coupling noise all increase the soft-error rate of dynamic circuits with the advance of technology. They worsen not only the performance but also the functionality of the TCAM macro. To conquer these problems, our TCAM design employs the XOR-based conditional keepers and butterfly match-line scheme to support noise-tolerant, high-speed and low power TCAM. In addition, we also modify the NOR gate of butterfly match-line scheme to improve stability of dynamic circuits. These designs are described in detail below.

## **3.1 Conventional NAND-Type Match-Line Schemes**

In the typical NAND-type (AND-type) match-line structure, a number of cells are cascaded to form the match-line. In the case of a match, all bits of stored data match all bits of search data, the match-line is discharged to ground. In the case of a mismatch, the match-line is remained at VDD when the stored data are not identical to the search data in every bit. Generally, the matching probability is less than mismatching. Therefore, the NAND-type has low power feature but a longer search time owing to the deep fan-in circuits. To solve this problem, there are many works presented to increase the search speed of the NAND-type match-line. They include the pseudo-footless clock-data pre-charge dynamic (PF-CDPD) match-line scheme [3.1] as shown in Fig. 3.1, the range matching scheme [3.2], and the tree-style NAND-type match-line scheme [3.3]. All of their concepts are to separate the match-line schemes into several segments. And we will review the details of PF-CDPD in this section.



Fig. 3.1 PF-CDPD And-type match-line scheme.

Fig. 3.2 illustrates the dynamic AND gate with four-input is transformed to clock-data pre-charge dynamic (CDPD) circuits [3.4], [3.5]. The n-stage pseudo-footless clock-and-data pre-charge dynamic (PF-CDPD), as shown in Fig. 3.3, combine the operation and the characteristics of CDPD and AND-type match-line scheme to decrease power and search delay. Typically, search operation is divided into two phases, pre-charge and evaluation. At the beginning of pre-charge phase, the clock triggers the floating node  $(C_1)$  of the first stage to high instead of global

match-line. And output of the first stage goes low to trigger the next floating node  $(C_2)$ . Therefore, all floating nodes  $(C_1 \sim C_n)$  are charged to high in pre-charge phase. During the evaluation phase, the output of each stage just depends on the result of the previous stage. For instance, if the comparisons result of the first stage is match, all NMOS of this stage will be turned on. Then the first output goes high to enable the second comparison stage. On the other hand, if the comparisons result of the first stage to disable the second comparison stage.



Fig. 3.2 Transfer dynamic logic into clock-and-data pre-charge dynamic (CDPD)

circuits.



Fig. 3.3 Pseudo-footless clock-and-data pre-charge dynamic (PF-CDPD) circuits.

Accordingly, there are several advantages of CAM/TCAM which adopts PF-CDPD circuits. First, the match-line divided into many segments results the size of serial comparison transistors being unnecessary too large in the same search time criteria. Second, the switching capacitances are reduced effectively because of the smaller comparison transistors. Moreover, the switching capacitances of match-line are also decreased due to separated match-lines rather than deep logic depth. Third, the evaluation operation of PF-CDPD match-line is enabled or disabled depending on the output of preceding stages. That is to say, if the stored data and search data are mismatch, the output will disable all comparison operations in after stages to avoid unnecessary switching. In consequence, PF-CDPD circuits contribute to enhance search time and save power consumption.

## 3.2 And-Type TCAM Cell with P-Type Comparison

## Circuits

With the progress of process technology scaling down, designing reliable circuits have to face many challenges, including charge sharing effect, increasing leakage current and decreasing  $I_{on}$ - $I_{off}$  ratio. These all limit circuit operations. Especially in low power process, conventional AND-type CAM/TCAM cell is not suitable anymore because decreasing  $I_{on}$ - $I_{off}$  ratio destroy the functionality of search operations.

Many works have been devoted to the design of CAM/TCAM cell to increase the swing voltage or reduce the search delay of comparison circuits. The work in [3.4] uses the 4-transistor CMOS XNOR function to restore full voltage swing, but increase the capacitance of the stored node and single bit-line. Also to drive the local match-line for full swing operation, the design in [3.6] utilizes the XOR CAM cell with transmission gates. And swapping the inputs of XOR gated enables improved

slopes on SLs with faster compare delay. However, the area overhead results from extra 2 PMOS transistors. At the same time, higher complexity of wire routing is required for swapped XOR cell. The novel 16T AND-type TCAM cell with P-type comparison circuits (as shown in Fig. 3.4) provide larger swing voltage of comparison circuits and without additional transistor count. Following analysis is based on UMC 40nm low power CMOS process.



Fig. 3.4 16T AND-type TCAM cell with P-type comparison circuits.

The 16T AND-type TCAM cell is composes of the two traditional 6T SRAM which are all typically minimum-size to maintain high cell density and four comparison transistors, M1 through M4, to implement the comparison between stored data and search data. In our design, we adopt the PMOS (M1 and M2) to trigger the pass transistors (M3) of local match-line (LML). This change considers the decreasing  $I_{on}$ - $I_{off}$  ratio. Fig. 3.5 depicts the drain current versus gate voltage for 65nm standard

process (65nm SP) and 40nm low power process (40nm LP), and the shadowed region also displays the upper bound current and lower bound current of keeper. If we want to design reliable TCAM cell, then the turned on current of LML should be larger than upper bound. Contrarily, the turned off current of LML should be below lower bound current of keeper. In standard process, the  $I_{on}$ - $I_{off}$  ratio is large enough to ensure robust search operation. But in low power process, high threshold voltage decreases the drain current and shrinking  $I_{on}$ - $I_{off}$  ratio leads to recognize match and mismatch state hardly. From current curve in Fig. 3.5, NMOS is realized to have smaller matching current than PMOS. Additionally, in 40nm LP, both matching and mismatching current of NMOS smaller than lower bound current of keeper thus resulting search error. Accordingly, two NMOS comparison transistors are replaced by two PMOS transistors.



Fig. 3.5 Drain current versus gate voltage for different technology.

Although this cell is similar to conventional TCAM cell, PMOS comparison circuits perform correct functionality and reduce delay of search output due to improving I<sub>on</sub>-I<sub>off</sub> ratio. During match searching operation, PMOS transistors provide full VDD for M3 to discharge LML with stronger drain current. For mismatch state, it also suppresses charge sharing effect for AND-type match-line due to no need of large pass transistors (M3 and M4). Consequently, the tolerance to noise and variation of match-line keeper are increased. Comparing to conventional TCAM cell, the AND-type TCAM cell with P-type comparison circuits does not increase the cell area. Meanwhile, it can be applied in either the binary CAM or the ternary CAM.

## **3.3 XOR-based Conditional Keeper**



Fig. 3.6 AND-type match-line with XOR-based conditional keeper.

| CLK  | Floating Node | Control Signal on gate of keeper                                             |  |  |  |
|------|---------------|------------------------------------------------------------------------------|--|--|--|
| Low  | Low           | Low, to speed up the process of pre-charge                                   |  |  |  |
| Low  | High          | High, to avoid the impact on performance at the very beginning of evaluation |  |  |  |
| High | Low           | High, keeper should be off                                                   |  |  |  |
| High | High          | Low, keeper should be activated to enhance the capability of noise immunity  |  |  |  |

Table 3.1 Control organism of XOR-based conditional keeper.

The AND-type match-line scheme has to face high fan-in circuits. Nevertheless,

conventional keepers perform more poorly in terms of propagation delay and power consumption. Accordingly, a XOR-based conditional keeper has been presented in [3.7]. The main idea of the proposed XOR-based conditional keeper is to ensure that the keeper does not be turned on in the dynamic circuit at the beginning of the evaluation phase. Fig. 3.6 and Table 3.1 present the control signals and their corresponding keeper states.



Fig. 3.7 The diagram of XOR-based conditional keeper.

896

The match-line starts the pre-charge cycle by setting both the pre-charge signal and floating node to low voltage. Concurrently, the conditional keeper should be turned on to accelerate the pre-charge procedure. When the match-line pre-charge signal is low and the floating node goes high, the pre-charge process completes and the circuit is ready to be evaluated. Since the match-line is pre-charged high, the conditional keeper should be turned off, preventing any impact on the delay and any unnecessary power consumption.

Evaluation of match-line starts from the pre-charge signal low to high. At the beginning of the evaluation process, floating node maintains the high voltage. But it will eventually be at the appropriate voltage as long as the delay of the XOR gate exceeds the propagation delay of the dynamic circuits. Note that the delay time of the

dynamic circuits is shorter than that of the XOR gate, the conditional keeper is slightly turned on at the beginning of the evaluation process. At the end of evaluation phase, the conditional keeper is fully turned on or off as determined by the final search output that is stored in the floating node. If the floating node is kept high, reflecting a mismatching state, the conditional keeper will be turned on to assist keeping voltage at high. While a match-line in the match state, the pre-charge signal is high and the floating node is pulled toward ground level, the evaluation mode has been completed and the final value stored in the floating node is low. Consequently, the conditional keeper should be fully turned off. An XOR gate is required to generate the desired control signals. The timing diagram for the XOR-based conditional keeper is shown in Fig. 3.7.



Fig. 3.8 (a) Search time (b) Power consumption versus UNG margin for different keepers.

We take the design of 8-bit AND-type match-line as an example. There are four different types of match-line scheme mainly adopted for the performance comparison. The first design is match-line scheme with conventional keeper, which configuration

is in Fig. 3.1. The second design employs weak keeper to match-line scheme. The third design reduces search delay by the twin transistors technique of match-line. The last one is the proposed AND-type match-lines scheme with XOR-based conditional keeper. During the noise tolerance comparison, what we concern about is not the actual size of the keeper device or actual size of twin transistors but the ability to resist noises.

This ability is verified by the widely used Unity Noise Gain (UNG) margin [3.8]. Fig. 3.8 (a) and Fig. 3.8 (b) summarize the simulation result, where the search time and power consumption versus unity noise gain margin for four types of AND-type match-line, respectively. When UNG is at 810mV, using XOR-based keeper achieves 19.2% improvement on search time and 3.5% reduction on power saving compared to conventional keeper up-sizing. Based on the same condition, compared to weak keeper, we obtain 27.1% improvement on search time and 8.9% reduction on power saving. Even though the twin transistors technique is suitable for deep fan-in dynamic circuits, the performance is worse than XOR-based keeper. According to the simulation results, it was observed that the delay of search increased by 16.3% and consumption of power also increased by 8.9%, compared to XOR-based keeper when UNG is at 810mV. It is a good tradeoff to use the design of XOR-based conditional keeper. Because it only sacrifices 1.8% and 1.0% area overhead compared to conventional keeper and weak keeper, respectively.

## **3.4 Butterfly Match-Line Scheme**

In this section, the butterfly match-line scheme is presented. By increasing the parallelism of the search operation, the butterfly match-line scheme improves search performance. Meanwhile, it reduces power consumption in a manner that depends on

the interlaced pipeline since the butterfly connection turns off more TCAM segments than PF-CDPD match-line does. In addition, we enhance noise tolerance in dynamic circuits by adjusting the NOR gate in butterfly match-line.

#### **3.4.1 Organization**



Fig. 3.9 Butterfly match-line scheme.

Fig. 3.9 demonstrates the simplified butterfly match-line circuits, which is based on the PF-CDPD match-line scheme [3.1]. Each circle represents a TCAM segment, which contains six TCAM cells and a dynamic circuit. The degree of parallelism is double that of the conventional **PF-CDPD** match-line scheme. In the figure, the match-line is folded into four sub match-lines in sex stages for 144-bit TCAM cells. Hence, the number of segments is reduced from twelve to six. Therefore, the key improvement is to reduce critical delay of a match-line from  $(12T_{seg}+T_{AND2})$  to  $(6T_{seg}+5T_{NOR2}+T_{NOR4})$  compared to the conventional PF-CDPD scheme. The  $T_{seg}$  is the discharging time of a TCAM segment. Note that  $T_{seg}$  is much larger than the delay of NOR gates. In order to reduce the power consumption, a butterfly connection is made among these four independent sub match-lines by intersecting to the interlaced connection, as shown in Fig. 3.9. The first stage of match-line (Seg-1 to Seg-4) is active every evaluation phase. When at least one of the TCAM segment is mismatched with search data, this mismatching signal will be propagated to after sub match-line due to butterfly connection and the search operations behind this mismatched segment are terminated. Thus, the butterfly match-line scheme turns off more TCAM segments than the conventional PF-CDPD match-line scheme does.

Butterfly match-line scheme not only achieves high performance with high degree of parallelism but also improves power reduction by exploiting interlaced connections. Such a match-line can be implemented using full connections between two stages and thereby feed the mismatching information into the subsequent stages. However, it requires a NOR gate with four fan-ins to collect the information with the previous stage regardless of the state of sub segment. Furthermore, this NOR-gate must provide large driving capacity to trigger the four segments in the subsequent stage. Although it can turn off two more segments than can butterfly match-line scheme, the power and performance overheads of the NOR gates with four fan-ins and four fan-outs will dominate the critical path of match-line. Accordingly, the butterfly connection can turn off the segments behind the mismatching segment most efficiently.

The power analysis of the butterfly match-line scheme is as follows. Before the power formulas of the butterfly match-line schemes can be derived, some assumptions are made for simplicity.

- The power consumption of the search operation is the same in all segments (P<sub>seg</sub>) when all of the TCAM cells are matched with the search data in one TCAM segment.
- The matching probability of the TCAM cell [i] is represented as pi (i=1 to 144). The probability pi is defined as one when i<1.

$$\begin{split} P_{144\text{-bit}} = & P_{seg} \begin{cases} \sum_{j=0}^{2} \left[ \prod_{i=0}^{4+8(j+1)} PS_{i} \times \begin{bmatrix} PS_{8j\cdot3}PS_{8j\cdot1} \left( PS_{8j+1} + PS_{8j+2} \right) + \\ PS_{8j\cdot2}PS_{8j} \left( PS_{8j+3} + PS_{8j+4} \right) \end{bmatrix} \right] + \\ & \sum_{k=0}^{2} \left[ \prod_{i=0}^{8k} PS_{i} \times \begin{bmatrix} PS_{8k+1}PS_{8k+2} \left( PS_{8k+5} + PS_{8k+6} \right) + \\ PS_{8k+3}PS_{8k+4} \left( PS_{8k+7} + PS_{8k+8} \right) \end{bmatrix} \right] \end{cases} \end{split}$$
(3.1)  
where  $PS_{n} = \begin{cases} \prod_{i=6n\cdot5}^{6n} p_{i}, n \ge 1 \\ 1, n \le 0 \end{cases}$  (probability product of segment n)

Eq. (3.1) is the power formula for the butterfly match-line scheme with 144-bit TCAM cells.  $P_{seg}$  denotes the power consumption when the TCAM segment is discharged in the evaluation cycle. The power consumption of the charge sharing in the dynamic circuit is neglected when the TCAM segment is mismatched. The probability of segment-n, PS<sub>n</sub>, represents the probability that the TCAM segment-n is matched to the search data. Each stage consists of four TCAM segments, and the segments in stage-1 are defined from Seg-1 to Seg-4. The terms, j and k, are referred to the odd and even stages of the butterfly match-line scheme, respectively. We see from Eq. (3.1) that the butterfly match-line scheme achieves higher power saving since more TCAM segments are not be turned on in evaluation period. For instance, if Seg-9 as shown in Fig. 3.9 is mismatched, the segments with the gray background in stage 4, 5 and 6 will not be activated.

#### 3.4.2 Design Consideration

Even though one match-line with butterfly connection is divided into four sub match-lines. From layout view, all of the sub match-lines are serial in the same line instead of parallel connection with four different lines. Hence, the NOR gate of each segment suffers from the large capacitance because of long interconnection and differ in length of input signal. In the mismatch state, when match-line pre-charge signal goes high, the evaluation phase starts and the floating node will be discharged slowly due to large capacitance. On the other hand, rather than full connections between two stages, 144-bit TCAM cells adopt butterfly connection since NOR2 requires less driving capacity than NOR4. However, we implement the 40-bit TCAM cells by utilizing NOR3 gate that is a compromise between power and performance. Based on NOR3 gate, the wider fan-in NMOS pull-down leaks the charge stored in the capacitance at the floating node due to the sub-threshold leakage [3.9]. Accordingly, the large capacitance and increasing sub-threshold leakage of floating node will limit the speed of the dynamic circuit no matter what state in the match-line [3.10]. To ensure the search performance, we adjust the segment circuit as below.



Fig. 3.10 Modification of NOR gate in TCAM segment.

Fig. 3.10 illustrates the modification of NOR gate in TCAM segment. The factor, C<sub>in</sub>, represents the capacitance of floating node. For improving worst-case delay and delay uncertainty, inverters are inserted between floating node and match-line output since static CMOS circuit offers stronger driving capability to trigger next stage than dynamic circuit does. In addition, the previous segments with NOR gate have large and different amount of capacitance of floating node, that will diminish the variation immunity of keeper sizing. Thus, these inverters not only can decrease interconnection length to enhance search speed of dynamic circuit but also equalize the capacitance of each floating node to provide higher variation tolerance of keeper. In order not to influence the functionality of butterfly match-line, the NOR gate is replaced by AND gate according to DeMorgan's Laws. Based on UMC 40nm low power process with 1.0V supply voltage, Table 3.2 shows the comparison of search delay with NOR gate and AND gate in TCAM segment at 400MHz. In the table, segment with AND gate is observed that a 41% reduction of search delay at 25°C, TT corner, and it still maintain speed under worst circumstance. Clearly, the overall improvement of search performance will more than compensate for the longer logic delay with AND gate.

Table 3.2 Comparison of search delay with NOR gate and AND gate in TCAM

| segment (Unit: ns). |        |       |         |       |        |       |  |  |  |  |
|---------------------|--------|-------|---------|-------|--------|-------|--|--|--|--|
| Corner              | -40 °C |       | E 25 °C |       | 125 °C |       |  |  |  |  |
| Case                | NOR    | AND   | NOR     | AND   | NOR    | AND   |  |  |  |  |
| SS                  | Fail   | 1.795 | Fail    | 1.625 | Fail   | 1.575 |  |  |  |  |
| SNFP                | Fail   | 1.278 | Fail    | 1.178 | 1.452  | 1.004 |  |  |  |  |
| TT                  | 1.571  | 1.084 | 1.465   | 1.039 | 1.276  | 0.908 |  |  |  |  |
| FNSP                | 1.311  | 0.938 | 1.299   | 0.895 | 1.144  | 0.790 |  |  |  |  |
| FF                  | 0.942  | 0.786 | 0.952   | 0.646 | 0.886  | 0.616 |  |  |  |  |

## **3.5 Summary**

In this chapter, the noise-tolerant energy-efficient match-line is presented, which employs the co-design of the architecture and the circuit, as shown in Fig. 3.11. To provide reliable design of TCAM cell, P-type comparison circuit is utilized. Furthermore, the butterfly match-line scheme with a XOR-based conditional keeper exhibits high performance because of increasing the dependence between the four parallel match-lines. It also saves power by using a XOR-based conditional keeper and turning off the mismatched segments. Consequently, these techniques of match-line simultaneously reduce the search time and power consumption. They are also resilient against noise effectively.



Fig. 3.11 Butterfly connection style with P-type comparison circuit and XOR-based

conditional keeper.

# Chapter 4 Column-Based Low Power Design Techniques

Due to high speed search function, TCAM is widespread use in internet protocol for forwarding packets toward their final destination. In a manner determined by IP address-lookup function, TCAM is required to implement the masking function through storing "X" as a mask. The "X" value is a don't-care state which represents both "0" and "1" and allows a wildcard operation. The wildcard operation is a feature of packet forwarding in Internet routers and involves the storing of an "X" value in a cell to yield a match, regardless of the search data. Additionally, the list of routing tables can be rearranged and maintained using rule table management with continuous don't-care X patterns, as presented in Fig. 4.1. Based on this feature, the column-based ripple search-line scheme and column-based data-aware power control are proposed. For energy-efficient design, the ripple bit-line scheme for read/write operation is also discussed in this chapter.



Fig. 4.1 Packet routing based on longest prefix matching mechanism

## 4.1 Ripple Bit-Line Scheme for Read/Write Operation

In conventional hierarchical bit-line scheme, the data are propagated between local bit-line (LBL) and global bit-line (GBL). Nevertheless, GBLs and LBLs cannot use the same metal layer since the area overhead is limited in ultra-high-density cells design, thus GBLs need additional metal layer [4.1]. Furthermore, the long GBLs go through the entire array that influence the performance and increase the power consumption of read/write operation. The proposed ripple bit-line (RBL) scheme transfers data to divided LBL step by step without requirement of GBL, and provides better sensing margin for read operation by isolated short LBL.

## 4.1.1 Circuit Implementation & Operation.

The figuration of ripple bit-line read scheme is in Fig. 4.2. Each local read scheme is composed of propagation inverter gated by bank select signal between two local bit-lines and one PMOS transistor (M5) as keeper controlled by column enable (*Col\_en*) signal. Both bank select and column enable are correlative signals of pre-decoder. In stand-by and write mode, all of related signals are set to high, and bank select signal (*Bank\_sel*) will cut off both the power and ground source to disable the propagation inverter. Otherwise, the data of LBL will disturb the stored value in TCAM cells. Thus, the separated capacitance and better noise immunity of LBL improve the noise margin of storage cell.

For read mode, read cycle is divided into pre-charge (CLK\_0) and evaluation (CLK\_1) phases. During pre-charge phase, each LBL is charged to high voltage by write scheme including pre-charge circuit, and M5 is turned on by Col\_en to facilitate pre-charge operation. Besides, voltage of each LBL is equalized by active propagation

inverter to ensure they have same voltage level. In contrast, pre-charge circuit is turned off by bit-line pre-charge (BL\_pre) signal during evaluation phase. At the same time, the selected bank is controlled by the corresponding signal to begin the read evaluation. For example, if the read data is located in Bank2, Col\_en goes high to terminate charging path of LBL. Also Bank\_sel2 is pulled up to disconnect from previous bank by disabling the propagation inverter between Bank3 and Bank2. Accordingly, the stored data will be transmitted from LBL2 to LBL0 sequentially, and the output data is converted by multiplexer finally. The timing diagram of read operation is also shown in Fig. 4.2 and related control signal of ripple bit-line scheme for different operation is demonstrated in Table 4.1.



Fig. 4.2 The ripple bit-line scheme and timing waveforms of read operation.

| Control<br>Signal | WRITE |       |       |                                   |      |
|-------------------|-------|-------|-------|-----------------------------------|------|
|                   | CLK_0 | CLK_1 | CLK_0 | CLK_1                             | HULD |
| BL_pre            | High  | High  | Low   | High                              | High |
| Col_en            | High  | High  | Low   | Low(unselected)<br>High(selected) | High |
| Bank_sel          | High  | High  | Low   | Low(unselected)<br>High(selected) | High |

Table 4.1 Key signals of replica-column scheme.

However, reading "0" is the worst case in read operation since LBLs have been charged to high voltage in pre-charge phase and it is much harder to transmit data "0" than data "1" to after LBL. Using smaller threshold voltage of pull-down transistor benefits reading "0" because of increasing drain current. Therefore, pull-down transistors, M3 and M4, are then utilized by low threshold voltage (low Vt) instead of regular threshold voltage. Moreover, longer channel length of pull-down transistors, M3 and M4, also assist LBL to be discharged for reading "0" operation.

## 4.1.2 Design Consideration

Depending on the different size of TCAM macro, user can adjust the length of LBL. In the ripple bit-line scheme, all the banks are arranged in cascade mode (as shown in Fig. 4.2), and the farthest bank is on the critical path to charge and discharge, so the bit allocation of LBL is chosen by the power and performance. Fig. 4.3 illustrates the power and delay comparisons of ripple bit-line scheme according to the number of TCAM cells in local bank (M). As M increases, both the area overhead and power of RBL decreases, because the number of ripple bit-line buffer decreases. Even though power dissipation reduces, delay of RBL is prolonged significantly because of

larger capacitance and increasing leakage current of LBL resulting in long slew rate. On the other hand, if the length of LBL is too short, the number of sub banks (256/M) increases though the LBL capacitance is partitioned into small amount. Hence, bit-line power and read delay rises rapidly since number of buffers dominates the power and number of banks delay leads the delay instead of capacitance. Therefore, the competing trend in power and delay result in an intermediate bit number where total energy is minimized.



Fig. 4.3 Power and delay comparisons of the local bit-line scheme.

From Fig. 4.3, 16-bit TCAM cells on each local bit-line has minimal delay, but the lowest power-delay product is located at 32-bit. Generally, the bit number of local bank should be decided at 32-bit. For consideration of ripple search-line scheme (which will be discussed in section 4.2), less bits of local bank can save more power due to continuous don't-care pattern. Also, the power consumption of 16-bit is close to 32-bit, such that the number of local bit-line is chosen to be 16-bit. By appropriate partition of bit-line length, this ripple bit-line scheme employs simple circuitry

without extra timing control to save power dissipation and additional metal layer of GBL, and further improve area efficiency to high density design.

## 4.2 Don't-Care-Based Ripple Search-Line Scheme

Owning to the characteristics of parallel processing for data comparison in each search operation, power dissipation of TCAM is the important issue. The three major power consumers in TCAM are match-lines, search-lines, and clock control. The butterfly match-line scheme has saved power effectively, but the search-lines still contribute 54%-82% to the total power consumption [4.2]. Recently, many works have been devoted to the design of the search-line scheme to reduce power dissipation, such as two-level don't-care gating scheme [4.2], low swing search-lines scheme [4.3] and don't-care based hierarchy search-line scheme [4.4]. However, the long and wide global search-lines (GSLs) of hierarchy architecture adopted by them induce large capacitance and un-negligible switching power. Furthermore, the delay of global search-lines increases significantly in advanced technologies. In order to decrease power without performance penalty and number of the vertical wires for high density design, the don't-care based ripple search-line scheme is proposed. And the details of ripple search-line concept are described as below.

#### **4.2.1 Circuit Implementation**

If the TCAM cell is don't-care, then the matching signal is independent of the search data and the search-lines can be disabled to save power consumption. Based on this feature, the ripple search-line scheme divides one search-line into several local search-lines (LSLs) which are active depending on don't-care cells instead of every search cycle. Fig. 4.4 (a) exhibits the simplified architecture of the don't-care based
ripple search-line scheme. The entire TCAM macro is partitioned into n sub-blocks, each of them is composed of numerous match-lines and one set ripple search-line buffers. In the IP addressing lookup tables associated with TCAM, the prefixes are grouped and arranged in order by prefix lengths. The longest prefix is located at the top of the table. Accordingly, if the lower TCAM cell stores don't-care datum, then the upper cells must be don't-care. By exploiting the basic regulation of don't-care pattern, the ripple search-line buffers are controlled by the data in the don't-care cells which is stored in the bottom word of each block.



Fig. 4.4 (a) A simplified architecture (b) Circuit implementation of don't-care based ripple search-line scheme.

The circuit implementation of the don't-care based ripple search-line scheme is presented in Fig. 4.4 (b). The implemented logic function in each ripple search-line buffer is a pair of NOR gate controlled by local search-lines and don't-care bit. During the search operation, the search data are sent into ripple search-line buffers in bottommost bank first. Then, the search data would be determined whether be propagated to local search-lines or not according to the don't-care data. All the following banks have the similar connection way, and the propagation of the entire search-line will be performed consecutively from bottom most bank to top most bank. For instance, if the don't-care state is true, the local search-line pair is always be discharged to ground and has no voltage swing. This scenario indicates that the comparisons between the search data and the prefix data stored in the storage cells are redundant. Moreover, only the capacitances before this bank are needed to be charge to required voltage level. In this case, due to the diminished effective capacitance, the search-line power consumption can be reduced. In contrast, if the don't-care state is false, then the local search-line will broadcast the search data to the entire bank and next ripple search-line buffers. In this way, the local search-line pair of farthest bank is critical to search-line delay, but consumes least switching power thanks to continuous don't-care pattern.

### **4.2.2 Design Consideration**

Unlike the hierarchical search-line scheme, the ripple search-line scheme gets rid of the global search-lines to save large wiring capacitances. However, all the local search-line pairs are in serial, thus the number of banks has a great influence on the speed of search-line propagation. In order to measure the worst propagation delay, we set the don't-care bits to be in low voltage that active all local search-lines. The ripple search-line delay with respect to the number of TCAM cells on each bank (M) is simulated and the results are depicted in Fig. 4.5.



Fig. 4.5 Delay of ripple search-line scheme versus number of TCAM cells on each

local search-line.

:10

The hierarchical search-line scheme has the same amount of delay regardless of the number of banks (256/M). However, the ripple search-line scheme is dependent upon that. The measurements confirm this trend. When M=16, the TCAM macro is partitioned into 16 sub banks. With M increasing, the area overhead declines since fewer number of ripple search-line buffers are needed. Even though area efficiency is improved, delay of RSL rises rapidly because high capacitive local search-line induces high leakage current. On the other side, if the length of local search-line is too short, the increasing number of banks leads long propagation delay, too. But, more switching power can be saved due to further exploiting the vertically continuous don't-care property. Consequently, the don't-care based ripple search-line scheme can reduce power consumption and provide high margin to meet timing yield by deciding

the number of TCAM cells on each local search-line to be 16.

# 4.3 Column-Based Data-Aware Power Control

With the progress of process technology scaling down, the contribution of leakage power to the total TCAM power is expected to increase substantially, particularly for a large table size. Several techniques for the leakage power reduction have been reported [4.5], [4.6]. Nevertheless, [4.5] trades both delay and area penalties for saving leakage consumption. Although [4.6] can reduce both the TCAM cell leakage and the search-line dynamic power without any search performance penalty, the write order is very critical to the dynamic power source (DPS) design. In this brief, the column-based data-aware power control (DAPC) scheme is investigated for TCAM macro based on the characteristic of continuous don't-care X patterns. It can reduce both the leakage and dynamic power consumption and further to enhance the write-ability, write margin, and time-to-write. Moreover, the order of write data is flexible.

For various applications of TCAM, our design has the additional control signal, Flag. In some applications without read operation, we set the Flag to low voltage (logic "0"). If the don't-care state on the lowest entry in the same bank (in the same column) is true, the datum in the storage cell will be destroyed. Thus, the readout datum will be unknown while the flag is low. In contrast, if Flag is high (logic "1"), the data stored in the storage cells will be prevented from disturbance. Then the read data can be propagated through ripple bit-line scheme (discussed in section 4.1) successfully. Therefore, the operations of the proposed data-aware power control scheme depend on the Flag signal. The power gating approaches are described and analyzed below.

#### 4.3.1 Basic Concept



Fig. 4.6 The architecture of column-based data-aware power control in TCAM macro.

Fig. 4.6 schematically depicts the dynamic power implementation of column-based data-aware scheme in TCAM macro. In each TCAM bank, the storage and don't-care cells are controlled by only two power switch pairs with little area overhead of control circuits, respectively. The individual power switch pair is utilized for left- and right-half-cells. Hence, dynamical power adjustment not only can gate power source for achieving power reduction but also enhance write margin according to the input datum. The circuitry associated with dynamic data-aware power control

of TCAM cell is demonstrated in Fig. 4.7. Table 4.2 also shows the corresponding voltage of virtual power sources ( $V_{VDD1}$ ,  $V_{VDD2}$ ,  $V_{VDD3}$  and  $V_{VDD4}$ ) under different operations.



Fig. 4.7 Cell connection of column-based data-aware power control scheme.

Table 4.2 The corresponding virtual source voltage under different operations.

| Mode             | V <sub>VDD1</sub>                                                     | V <sub>VDD2</sub>                     | V <sub>VDD3</sub>               | V <sub>VDD4</sub>   |  |
|------------------|-----------------------------------------------------------------------|---------------------------------------|---------------------------------|---------------------|--|
| Write 0          | • 0 <vdd th="" vdd<=""><th><vdd< th=""><th>VDD</th></vdd<></th></vdd> |                                       | <vdd< th=""><th>VDD</th></vdd<> | VDD                 |  |
| Write 1          | VDD <vdd< th=""><th>VDD</th><th><vdd< th=""></vdd<></th></vdd<>       |                                       | VDD                             | <vdd< th=""></vdd<> |  |
| Data retention & | VDD                                                                   | VDD                                   | VDD                             | VDD                 |  |
| Read & Search    | VDD                                                                   | VDD                                   | VDD                             | <vdd< th=""></vdd<> |  |
| Cut off          | <vdd< th=""><th><vdd< th=""><th></th><th></th></vdd<></th></vdd<>     | <vdd< th=""><th></th><th></th></vdd<> |                                 |                     |  |

From Fig. 4.7, the regular power gating PMOS transistors (M1 and M2) are manipulated by P1 and P2 for storage cell. The other power gating pair (M3 and M4) connects to don't-care cell controlled by P3 and P4. In write mode, the write-in datum from bit-line turns off one side half-cell to facilitate write-ability. When we want to

write logic "0" into cell node, WL\_c (WL\_d) is pulled-up to VDD. Then P1 (P3) goes high to turn off M1 (M3), which decreases the voltage of floating node  $V_{VDD1}$  ( $V_{VDD3}$ ). If we want to write logic "1" into cell node in write cycle, P2 (P4) is set to high to cut off the power source of right-half-cell (M2/M4 is turned off). Then,  $V_{VDD2}$  ( $V_{VDD4}$ ) is floating as shown in Table 4.2. In this way, the one side floating VDD provides large margin to support write-ability and improve the performance of the time-to-write. And this concept is employed in both storage and don't-care cells.

Unlike continuous don't-care pattern, the data for storage cells doesn't have regular table to rearrange the write pattern. Thus, don't-care cells and storage cells have different power gating operations, and can't share the same control circuits. The corresponding control circuits for storage and don't-care cells are showed in Fig. 4.8 (a) and Fig. 4.8 (b), respectively.



Fig. 4.8 Control Circuits for (a) storage cells (b) don't-care cells.

For storage cells, there are four modes of power control circuit, write\_0, write\_1, cut off and data retention. Since read and search operations do not impact the data in

cells, both of them are categorized in data retention mode. If the Flag signal is set to low and don't-care state in bottommost row is true (mask=1), the storage cells get into cut off mode. In cut off mode, P1and P2 will be high and the power gating transistors, M1 and M2, are both gated to reduce leakage current. Over a span of time, the data stored in cells will be destroyed due to cutting off the power source. However, we don't need to worry about it, because the search output is always matched whatever the data in storage cells are and the TCAM macro will not perform read operation when Flag is low. On the other hand, when storage cells are in the data retention mode, P1 and P2 remain at low voltage to turn on M1 and M2 for providing sufficient noise margin for stored data. The truth table of the control signal in four modes is clarified in Table 4.3.

| P1     | P2                               |
|--------|----------------------------------|
| High   | High                             |
| 96High | Low                              |
| Low    | High                             |
| Low    | Low                              |
|        | P1<br>High<br>High<br>Low<br>Low |

Table 4.3 The truth table of the control signals for storage cells.

For don't-care cells, the three modes of power gating control circuit are write\_0, write\_1 and data retention mode. Since the don't-care pattern is basis for the comparison operation regardless of the Flag signal, the control circuit of don't-care cells does not have cut off mode. When the don't-care cells enter the data retention mode, P3 will stay at low and P4 will be pulled up to turn off M4 because the bottom TCAM cell of this column is mask. Therefore, the static noise margin and power reduction are enhanced. Contrarily, if the mask bit is low, both P3 and P4 remain at low and the power gating transistors (M3 and M4) are active to supply complete power source for holding data, which act like retention mode of storage cell.

Accordingly, the data-aware power control can reduce the static power and improve the SNM for don't-care cells. Moreover, the increasing mask bits of each bank achieve increasing power saving. The truth table of the control signal in three modes is clarified in Table 4.4.

| Mode             |          | Mask bit | P3   | P4   |  |
|------------------|----------|----------|------|------|--|
| Write_0          |          | Х        | High | Low  |  |
| Write_1          |          | Х        | Low  | High |  |
| Data retention & | (mask=0) | Low      | Low  | Low  |  |
| Read & Search    | (mask=1) | High     | Low  | High |  |

Table 4.4 The truth table of the control signals for don't-care cells.





Fig. 4.9 An adaptive replica timing control circuit for data-aware scheme.

In order to trace real read/write access time and PVT variation of bit cells, replica

circuits are often used in memory timing control circuits [4.7], [4.8]. Since the data-aware scheme in our design is realized to assist write-ability, the dynamic power gates the power supply for left- or right-half-cell according to the input data during write operation. At the same time, if the pulse width of data-aware scheme (Dawa\_en) is too large, the voltage difference between individual power source is enlarged due to increasing leakage. Finally, the stored data in cells are at the risk of being destroyed. For determining the unlock time of data-aware scheme, we investigate the replica timing control circuit, as shown in Fig. 4.9, to mitigate the effects of leakage and process variation.



Fig. 4.10 The diagram of adaptive write-time tracing replica.

Referring to Fig. 4.9, when the TCAM is not performing write operation, pre-charge circuits in replica maintain voltage levels of dummy bit-line pair and power supply of dummy cells at VDD. The timing diagram of associated control signals in replica is shown in Fig. 4.10. During write cycles, the write enable signal (Wen\_c/Wen\_d) goes high to inform dummy word-line (DWL) to turn off pre-charge circuits in dummy bit-line pair and power source of left-half dummy cells.

Meanwhile, the write datum is transfer to dummy bit-line pair to start write "1" operation and Dawa\_en is pulled high. Note that, write "1" is the worst case of write operation when all of the dummy cells stores logic "0". If the stored datum in write replica cell (WRcell) rises to logic "1", the write operation completes and Dawa\_en is pulled down to recover the power supply of cells in bank. As the result, the replica timing control circuit can trace the different PVT condition to ensure that the Dawa\_en pulse width is wide enough for optimizing the write performance.

# **4.4 Simulation Results and Analysis**



# 4.4.1 Performance Comparison of HSL and RSL

Fig. 4.11 (a) Hierarchical search-line structure. (b) Ripple search-line structure.

Fig. 4.11 (a) and (b) illustrates the simplified hierarchical and ripple search-line structure, respectively. In hierarchical scheme, the search-lines are divided into two-level, global search-line and local search-lines. The basic idea of don't-care based

hierarchical search-line is to exploit the concept that the global search-lines are active every cycle, but the local search-lines are active only when necessary. In many cases, there is no need to activate the local search-line based on continuous don't-care property. Thus, the hierarchical search-line scheme reduces power consumption by decreasing switching activities of local search-line, compared to the conventional search-line driving approach. The overall power consumption on conventional and hierarchical search-lines are concluded in Eq. (4.1) and Eq. (4.2). M is the number of bit cells on each local search-line, N is the number of banks and  $\alpha$  denotes the activity rate of local search-line. We see from Eq. (4.2) that  $\alpha$  determines how much power is saved on the local search-lines, but the cost of this saving is the power dissipated by the global search-lines. Referring to Fig. 4.11 (b), the ripple search-line scheme gets rid of global search-line which adopted in our TCAM design, hence the global power of big driver and large global capacitance (CGSL) can be saved. The equation for the dynamic power in this case is concluded in Eq. (4.3). Since the factor,  $\alpha$ , is proportional to ripple search-line power. If  $\alpha$  is lower, then the power reduction is higher.

$$P_{Concentional SL} = f \cdot (C_{LSL} + (N \cdot M) \cdot C_{cell}) \cdot V_{dd}^{2}$$
(4.1)

$$P_{Hierrarchical\,SL} = f \cdot C_{GSL} \cdot V_{dd}^{2} + \alpha \cdot N \cdot f \cdot (C_{LSL} + M \cdot C_{cell}) \cdot V_{dd}^{2} \quad (4.2)$$

$$P_{Ripple SL} = \alpha \cdot N \cdot f \cdot (C_{LSL} + M \cdot C_{cell}) \cdot V_{dd}^{2}$$
(4.3)

To show the power improvement of ripple search-line, Fig. 4.12 compares search-line power under different don't-care patterns of hierarchical scheme and ripple scheme. The simulated results include the powers dissipated in search-lines, search-line driver circuits, and local search-line buffers. As the percentage of don't-care data increases, both the power dissipation of hierarchy and ripple schemes is reduced due to disabled local search-line. When the proportion of don't-care data is

50%, the ripple search-line scheme saves 31.74% of the power compared to the hierarchical scheme. Furthermore, when the proportion of don't-care data is 75%, the power reduction runs up to 52.13%.



Fig. 4.12 Search-line power consumption under different don't-care patterns.

Rather than broadcast search data to local search-line directly, the ripple search-line scheme propagates the search data bank by bank. Hence, the delay of ripple scheme is perhaps longer than hierarchical scheme. Based on Elmore Delay Model, the hierarchical search-line and ripple search-line delay is derived in Eq. (4.4) and Eq. (4.5). L is the length of search-line, and  $C_w$  is the total wiring capacitance of search-line. Referring to the equation, the propagation time of hierarchical scheme is composed of global and local search-line delay. In the ripple scheme, the propagation time depends on number of local search-lines (N), since the separated search-lines connect in series. However, the delay of the global search-lines increases rapidly in advanced technologies. Thus, if N is determined in the appropriate range, then ripple search-line scheme can provide better performance of search data propagation than

hierarchical search-line scheme.

$$\tau_{hierarchical} = R_{G\_driver} \cdot \left(C_{w} \cdot L + N \cdot C_{L\_driver}\right) + \frac{R_{w} \cdot C_{w} \cdot L^{2}}{2} + R_{L\_driver} \cdot \left(C_{w} \cdot \frac{L}{N} + M \cdot C_{cell}\right) + \frac{R_{w} \cdot C_{w} \cdot \left(\frac{L}{N}\right)^{2}}{2}$$

$$(4.4)$$

$$\tau_{ripple} = N \cdot R_{L\_driver} \cdot \left( C_w \cdot \frac{L}{N} + M \cdot C_{cell} + C_{L\_driver} \right) + N \cdot \frac{R_w \cdot C_w \cdot \left(\frac{L}{N}\right)^2}{2}$$
(4.5)

... 2

The overall search-line delay of ripple scheme compared to the hierarchical schemes is presented in Fig. 4.13. For a fair comparison, these schemes are simulated for identical pattern where the percentage of don't-care data is 0%, and number of local search-lines is 16. From simulated result, the hierarchical scheme spends 607.73 ps to transfer search data, but ripple scheme spends only 520.95ps to complete propagation procedure from bottommost bank to top most one. Consequently, the don't-care ripple search-line scheme achieves 14.28% improvement of search-line delay over the hierarchical search-line scheme.



Fig. 4.13 Analysis of the search-line delay under different search-line schemes.

## 4.4.2 Simulation Result for DAPC

The standby power without and with power gating is analyzed. Fig. 4.14 shows the standby power under different don't-care pattern when Flag=1. For preserving the stored data from disturbance, the gating transistors cannot be turned off except that the don't-care state is true. Hence, only one side power source of don't-care cells is floating during data retention mode. From the figuration, even though the leakage power increases slightly due to additional power switch circuits when TCAM cells do not have any don't-care data. The data-aware power control still have 7.8% leakage power reduction compared to conventional TCAM cells without power gating, when half of data are don't-care state.



Fig. 4.14Leakage power consumption under different don't-care pattern when Flag=1.

On the other hand, when Flag=0, the TCAM macro will not perform the read operation. Therefore, data-aware power control will cut off all the power sources of

storage cells to save standby power whether the stored data is destroyed or not. Fig. 4.15 shows the standby power under different don't-care pattern when Flag=0. Because both the dynamic power gating devices of storage and don't-care cells depends on the don't-care bits of bottom row in the bank. The leakage power decreases as the percentage of don't-care data increases. This trend can be proved in Fig. 4.15. When half of data are don't-care, the data-aware power control scheme results in 28.9% lower leakage power compared to conventional TCAM. Moreover, when proportion of don't-care data is 75%, the leakage power reduction is even higher than 40%. Besides, from Fig. 4.14 and Fig. 4.15, we can conclude that the column-based data-aware power control can save standby power effectively no matter the flag signal is high or low.



Fig. 4.15 Leakage power consumption under different don't-care pattern when

Flag=0.

# 4.5 Summary

The column-based low power design is described with the ripple bit-line scheme, don't-care-based ripple search-line scheme and column-based data-aware power control. For dynamic power, the ripple bit-line and ripple search-line are utilized to enhance power reduction without performance penalty and save additional process cost of global search-line by connecting the banks in serial. Furthermore, the don't-care-based ripple search-line scheme decreases the switching activity and switching capacitance of local search-line to save more power due to continuous don't-care pattern. As technologies advance, leakage currents increasingly dominate the overall power consumption of nano-scale technologies. Accordingly, the column-based data-aware power control is employed to realize the static power reduction by gating devices. Based on the don't-care bits and input data, the power control dynamically adjusts the voltages for the left and right half-cells of both don't-care cells and storage cells. Therefore, it also can improve write-ability and SNM for read/search/data-retention operation. And the timing of the power switching is tolerant to PVT variation and VT scatter by the replica circuitry.

# Chapter 5 Implementation of 256x40 and 256x144 Energy-Efficient TCAM Macro in UMC 40nm LP CMOS Process

With the manifest of the shortcomings of the existing IP, a new protocol, known as IPv6 (IP version 6), has been defined to ultimately replace IPv4 [5.1]. The addresses in the new Internet protocol are 128 or 144 bit long, whereas they are 32 bit long in the current IPv4-based Internet. Therefore, we have designed the 256x144 TCAM macro with power saving techniques for IPv6 application. These power saving techniques are also exploited in 256x40 TCAM macro. In this chapter, the 256x40 and 256x144 energy-efficient TCAM macros are implemented using UMC 40nm low power (LP) CMOS process. The specification and floor-planning of 256x40 TCAM macro are described in section 5.1 and 5.2. For different size of TCAM array, the match-line scheme requires some modifications of butterfly connection. In order to further reduce power consumption, we also utilize shared BL/DL and interleaving vertical global-lines techniques. Both of these design implementations are discussed in section 5.3 and 5.4. Section 5.5 shows the simulation results and analysis from 256x40 and 256x144 TCAM macros. Finally, this chapter draws the conclusions in section 5.6.

# 5.1 Specification of Energy-Efficient TCAM Macro

The size of the TCAM macro is 256-word x 40-bit indicating 256x40 TCAM cells are utilized, and the TCAM array is divided into 16 banks. Each TCAM cell is composed of two SRAM cells, the storage cell and the don't-care cell. In this TCAM macro, 8-bit address signals, *Addr* [7:0], are used to access one of the 256 entries in the read/write operation. Therefore, *Addr* [7:4] indicate the 16 banks in the TCAM array, and *Addr* [3:0] point toward 16 words in the bank. In addition, each TCAM entry contains 40-bit TCAM cells. Accordingly, the bit-width of the write-in data (*In* [39:0]), read-out data (*DOUT* [39:0]) and search data (*Sin* [39:0]) are all 40-bit. During the search operation, all 256 entries are compared to the search data within 1 cycle, and 256 comparison results are generated simultaneously as the search outputs (*SOUT* [255:0]). The input and output pins of this TCAM macro are listed in Table 5.1 and Table 5.2.

| Input Pin Name | Description                                                                                                                                                                                                                                        |  |  |  |
|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| Vdd, Gnd       | Power pins                                                                                                                                                                                                                                         |  |  |  |
| Addr [7:0]     | 8-bit address signals for accessing one of the 256 entries (words) during the read operation or the write operation                                                                                                                                |  |  |  |
| MODE           | $M_{S}/M_{D}$ selection (accessing the storage cells or the don't-care cells) in the selected entry during the read operation or the write operation                                                                                               |  |  |  |
| In [39:0]      | Data input for the write operation                                                                                                                                                                                                                 |  |  |  |
| Sin [39:0]     | Search input for the search operation                                                                                                                                                                                                              |  |  |  |
| CEN            | <i>Chip Enable</i> , the three operations, read/write/search, are activated when CEN is high                                                                                                                                                       |  |  |  |
| SEN            | <i>Search enable</i> , the search operation is activated when SEN is high. And the read/write operations are activated when SEN is low.                                                                                                            |  |  |  |
| WEN            | Read/Write selection, the write operation is activated when<br>WEN is high. And the read operation is activated when WEN<br>is low.                                                                                                                |  |  |  |
| FLAG           | <i>Readout Flag</i> , if the flag is low the data in the storage cell will<br>be disturbed if the don't-care cell on the lowest entry in the<br>same bank (in the same column) is true. The readout data will<br>be unknown while the flag is low. |  |  |  |

Table 5.1 Descriptions of input pins.

| Output Pin Name | Description                                                             |
|-----------------|-------------------------------------------------------------------------|
| DOUT [39:0]     | 40-bit read-out data                                                    |
| SOUT [255:0]    | 256-bit search output while comparing 256 entries in a search operation |

Table 5.2 Descriptions of output pins.

Due to shared BL/DL, the read or write operation cannot be completed within one cycle. An extra bit, *Mode*, is utilized to access the storage cells or the don't-care cells in a TCAM entry. If *Mode* is high, the don't-care cells are selected to perform read/write operation. On the other hand, if *Mode* is low, the storage cells are selected. For various applications of TCAM, our design has another extra control signal, *Flag*. Based on the continuous don't-care X pattern and pre-fix pattern, the *Flag* signal is designed to destroy the storage data while the don't-care data is 1 and the storage data will not be read. When *Flag* is low for some application without read operation, the datum in the storage cell will be destroyed if the don't-care cell on the lowest entry in the same bank (in the same column) is true. In a TCAM cell, the destroyed storage data will not affect the search functionality because the don't-care cell is true based on the continuous don't-care X pattern. In contrast, when Flag is high, the data stored in the storage cells will be robust enough to prevent from disturbance. Then the read data can be propagated through ripple bit-line scheme successfully.

Generally, the TCAM macro is operated in three modes: Write Mode, Read Mode, and Search Mode. In the write and read operations, the functionality of the TCAM macro is operated like an ordinary memory. That is to say, data is manipulated in the TCAM array as the same way in the SRAM array. Different from the SRAM, the TCAM array has the extra operation mode, Search Mode. In the search operation, the input data sent into TCAM array and are compared with all the stored data in the TCAM simultaneously. After that, all rows which match with input data are sent to the address priority encoder. When multiple matched rows pass through the address priority encoder, an appropriate address for the longest prefix is sent to the output. Thus, in the TCAM architecture, large amount of comparison operations are active to identify all data stored in the TCAM array during a search operation.

The three operations are controlled by the three signals (*CEN*, *SEN*, *WEN*). When *CEN* is low, then the TCAM macro is in standby mode. The priority of these three control signals is CEN > SEN > WEN. Table 5.3 lists the truth table of the three modes. Besides, the timing diagrams of corresponding signals for different operations are shown in Fig. 5.1, Fig. 5.2 and Fig. 5.3.



Table 5.3 Truth table of three modes.

Fig. 5.1 Timing diagram of writing storage/don't-care cells.



Fig. 5.2 Timing diagram of reading storage/don't-care cells.



Fig. 5.3 Timing diagram of search operation.

# 5.2 Architecture & Floor-planning of TCAM Macro

In recently years, TCAMs have been popularly used in network routers for packet forwarding and packet classification. Network routers forward data packets from an incoming port to an outgoing port, using an address-lookup function [5.2]. Fig. 5.4 schematically depicts a simplified block diagram of the proposed TCAM macro for IP lookup tables. The search data are broadcast onto the search-lines to the TCAM array. Each stored word has a match-line that indicates whether the search word is identical to the stored word (matching) or not (mismatching, or "a miss"). The match-lines are fed to the encoder that generates a binary matching location that corresponds to the most-direct routing. In TCAM applications, where more than one word may match, a priority encoder is employed instead of a simple encoder. A priority encoder identifies the location that is matched with the highest priority to map the result of matching, such that words in lower address locations have higher priority. The overall function of TCAM is to take a search word and return the matching memory location. But in our TCAM design, the address priority encoder is not included in the implementation of the



Fig. 5.4 Block diagram of 256x40 TCAM macro.

Fig. 5.5 demonstrates the floor-plan of proposed 256x40 TCAM macro. With continuous scaling of CMOS technology, the increasing contribution of parasitic capacitances and series resistances has become a challenge. Meanwhile, parasitic capacitances are charged/dis-charged during the device switching [5.3], [5.4]. Thus, the global control circuitry is placed in the center of macro to enable shorter interconnection of global routing. In the ripple bit-line and ripple search-line scheme, 256 entries are separated into 16 local banks. Each 16 bit-cells in the same column is arranged with local evaluation circuit and ripple buffers to cope with leakage current problem and don't-care based power control. For reducing the propagation time of ripple scheme, the search input data and write input data are sent from middle of array to decrease propagation distance.



Fig. 5.5 The floor-plan of energy-efficient 256x40 TCAM macro.

# 5.3 Butterfly Match-Line Design for 256x40 and 256x144

As mentioned in section 3.4, the basic concept behind the butterfly match-line scheme is that it tries to increase the parallelism of the search operation. The number of stage and number of cell in segment both impact the critical delay of comparison. Therefore, the different size of match-line requires different butterfly connection to optimize search performance. In 144-bit TCAM cells, the match-line is folded into four sub match-line in six stages, as shown in Fig. 5.6. Each circle denotes a TCAM segment, which contains six TCAM cells and a dynamic circuit. If one of the TCAM segments is mismatched with search data, the mismatching signal can be propagated to turn off more TCAM segments than conventional PF-CDPD match-line dose. All the search operations behind this mismatched segment are terminated. Accordingly, by intersecting the interlaced connection as Fig. 5.6, the 144-bit TCAM match-line increases the dependence between the four parallel sub match-liens to reduce power consumption.



Fig. 5.6 Butterfly match-line scheme for 144-bit TCAM cells.

In 40-bit TCAM cells, there are two ways to achieve the same goal of butterfly match-line scheme. Both of them are shown in Fig. 5.7 (a) and Fig. 5.7 (b). Each TCAM segment in 40-bit match-line contains five TCAM cells and a dynamic circuit.

The match-line in Fig. 5.7 (a) uses four parallel segments in each stage and merges the segment outputs into the four fan-ins NOR gate to generate the final matching result. Hence, the critical delay of two stages match-line is  $(2T_{seg} + 2T_{NOR4})$  On the other hand, the match-line in Fig. 5.7 (b) adopts the three-stage butterfly connection. Since the match-lines enter to search operation simultaneously in the same bank, the parallelism degree of first stage is two instead of three to decrease trigger loading of match-line pre-charge signal. Even though the three-stage butterfly match-line has one more  $T_{seg}$  delay than two-stage match-line, both delay of NOR2 and NOR3 are shorter than NOR4. Thus, the difference of the search delay of two types match-line connection is slight.



Fig. 5.7 (a) Two-stage (b) Three-stage butterfly match-line scheme for 40-bit TCAM cells.

However, the NOR gates of butterfly connection have to collect the information about the mismatching associated with the previous stage. The four fan-ins NOR gate exploit in two-stage match-line requires large driving capacity to trigger the four segments in the subsequent stage, especially in low power CMOS process. Otherwise, the slew rate of NOR4 is too small to degrade search performance. Hence, the power and area overheads of the NOR gates with four fan-ins and four fan-outs are larger than NOR2 and NOR3. Besides, if the TCAM segment in first stage is mismatched, the three-stage match-line can turn off two more segments than can two-stage match-line. Therefore, we adopt the three-stage butterfly match-line scheme in the 256x40 TCAM design to trade little search delay for more power saving.

# 5.4 Design Implementation in UMC 40nm LP CMOS

### **Process**



Fig. 5.8 (a) Typical TCAM cell. (b) TCAM cell with shared BL/DL.

A binary CAM cell stores either a logic "0" or a logic "1". Different from binary CAM, the ternary CAM has three possible state: logic "0", logic"1" and don't-care X. To store a ternary value, a TCAM cell contains two-bit storage memory and a 1-bit

comparison circuit. Fig. 5.8 (a) shows a typical AND-type TCAM cell. For writing stored datum and don't-care datum at the same time, except to bit-line and search-line pairs, the typical TCAM cell has additional complementary don't-care lines to transfer the don't-care data. Hence, there are three pairs vertical line of one cell.

In order to decrease cell area overhead and save additional metal layer for high density design, the bit-line and don't-care line are combined in our TCAM cell as shown in Fig. 5.8 (b). At the same time, the shared word-line between two memory cells is separated to WL\_c and WL\_d for storage cell and don't-care cell, respectively. In other TCAM cell designs, they often merge don't-care line and search-line that will worsen the propagation delay of search data due to increased capacitance. Although the capacitance of shared bit-lines is also increased, search operation is the major work of TCAM instead of read/write operations. Besides, the numbers of vertical control and input register are reduced. Hence, the shared BL/DL can save area overhead of not only TCAM cell but also global control circuits.





Fig. 5.9 Coupling capacitance.

Moore's law continues to drive technology scaling to deliver increased density

and integration in CMOS technology. The interconnection noise will become increasingly large due to the effects of coupling capacitance and other factors. A coupling capacitance, as shown in Fig. 5.9, between two conductors introduces noise that degrades the signal integrity. It leads to a rise on the spurious pulse on a neighboring wire, if it has a static value or causes delayed transition. Besides mutual capacitance, crosstalk is also determined by the ratio of the mutual to the sum of self and mutual capacitance (to ground). As technology scales, the spacing between conductors in circuits decreases, increasing crosstalk and other sources of interconnection noise as the wires become more compact and closer to one another [i5]. This high density TCAM design contributes to long interconnections and a great amount of vertical lines that can increase crosstalk. Crosstalk is a major source of timing uncertainty in circuits and it is more prevalent than process variation.



Fig. 5.10 (a) Coupling Effect of conventional vertical lines. (b) Interleaving vertical

lines.

Fig. 5.10 (a) reveals the coupling effect of conventional TCAM design. The cells in the same column have at least four vertical long wires, one pair bit-line and one pair search-line. The bit-line pair (BL and BLB) is implemented with one metal layer, and the search-line pair (SL and SLB) shares another one metal layer. Referring to Fig. 5.10 (a), each SL is close to SLB of neighboring column to induce coupling effect. Because of the presence of the coupling capacitance, the switching of search-line pair results in functional degradation and power consumption.

With limited process cost, the technique of interleaving vertical lines is presented to decrease coupling effect, as shown in Fig. 5.10 (b). Instead of using the same metal layer, the SL exchanges the metal layer with BL. Thus, SL shares the same metal layer with BLB. Doing this results in reducing capacitance of search-line pairs without area overhead and mitigating the interconnect noise due to increased distance of neighboring wire.

1896

# 5.4.3 Cell Layout

Fig. 5.11 exhibits the layout view of a 1-bit TCAM cell. A TCAM cell is composed two SRAM cells and a comparison circuit. Based on the power gating technique of data-aware power control, each TCAM cells needs additional two metal layers to route extra virtual power sources for storage cell and don't-care cell, respectively. To reduce the power dissipation, the ripple bit-line and ripple search-line schemes propagate the write data and search data by local line pairs bank by bank without global bit-lines and global search-lines. Moreover, the shared BL/DL technique does not require the don't-care line pair since write operation is completed in two cycles. Therefore, three metal layers have been saved in the proposed TCAM cell layout. From the figuration, the word-lines and match-line are along the horizontal axis using M1 and M2. Two vertical metal lines, M3 and M4, are preserved for interleaving bit-line pair and search-line pair. M4 is also utilized to realize the power lines of storage cell. And the other power lines are routed through M5. The overall size of the proposed TCAM cell is  $1.77 \times 1.365 \ \mu m^2$  which is one-third of 65nm TCAM cell presented in [5.6]. Consequently, adopting the energy-efficient schemes provide not only power reduction but also area efficiency.



Fig. 5.11 Layout view of 1-bit TCAM cell.

# **5.5 Simulation Results and Analysis**

# 5.5.1 Simulation Results of 256x144 TCAM Macro

Unlike the hierarchical search-line, the don't-care-based ripple search-line

improves the power reduction by saving global search-line. Even though the search-line is separated into 16 local search-lines, the search data only need to pass through four segments due to placement of search input latches, as shown in Fig. 5.5. Fig. 5.12 demonstrates the timing diagram of the search operation. The search-lines are activated when the clock is high, and the delay patch is through the D flip-flop and four ripple search-line buffers. For a 50% duty cycle clock, the delay of four ripple buffers is clearly shorter than half of a clock cycle, and the critical path depend on the match-lines. Based on butterfly connection, the comparison delay of 144-bit match-line is through the six stages and six NOR gates. Before the negative edge of the clock, the match-lines are pre-charged to high and the search data have been transferred to the local search-lines completely. Therefore, the search time, which depends on the buffer delay (ml\_pre\_buf delay) and the search delay of the match-lines, dominates the clock period.



Fig. 5.12 Timing analysis of search operation.

A 256x144 TCAM macro has been implemented with 40nm LP CMOS technology with the circuit techniques presented. To verify the proposed architecture, Table 5.4 shows the simulation result of search time in 1.0V under different process

corners and different temperature conditions. Its simulated maximum clock frequency achieves 400MHz, and energy/bit/search is 0.526 fJ. These features are summarized in Table 5.5.

| Search Delay (ps)    | ch Delay (ps) -40 °C |          | 125 °C   |  |
|----------------------|----------------------|----------|----------|--|
| SS                   | 1654.776             | 1524.562 | 1328.667 |  |
| SNFP                 | <b>SNFP</b> 1392.382 |          | 1107.660 |  |
| <b>TT</b> 1144.217   |                      | 1099.913 | 994.3913 |  |
| <b>FNSP</b> 1035.800 |                      | 1011.407 | 925.0000 |  |
| FF                   | 789.2581             | 787.5862 | 750.1724 |  |

Table 5.4 Pre-simulation result of 256x144 TCAM macro.

Table 5.5 Summary of the 256x144 TCAM macro



# 5.5.2 Simulation Results of 256x40 TCAM Macro



Fig. 5.13 Layout view of a TCAM segment with 5-bit TCAM cells



Fig. 5.14 A 256x40-bit layout of the proposed energy-efficient TCAM.

Fig. 5.13 shows the layout of a TCAM segment that consists of 5-bit TCAM cells, a XOR-based conditional keeper and a local match-line circuit. The keeper is placed at the right-hand side of a TCAM segment. The left-hand side is the AND type match-line circuit. The size of one TCAM segment is  $1.985x7.74 \mu m^2$ . Due to longer

length of weak keeper, height of segment is slightly larger than TCAM cell's. Furthermore, Fig. 5.14 displays the layout block diagram of the 256x40 TCAM macro. Note that, the TCAM array is divided into 16 banks for both the performance and power efficiency. The sub banks with 16-row x 40-column are serially connected by ripple SL and ripple BL buffers. The total area of the TCAM macro is  $460.475x159.81\mu m^2$ . And the data presented in the following discussion are obtained from the post-layout simulation.

The 256x40 TCAM array is implemented with 40nm LP CMOS technology and operated at 400MHz, which frequencies are limited by the match-line. Table 5.6 shows the post simulation result of search time in 1.0V under five corners and three different temperature conditions.

| Search Delay (ps)   | -40 °C   | 25 °C    | 125 °C   |  |
|---------------------|----------|----------|----------|--|
| SS 📃                | 1686.549 | 1511.961 | 1265.679 |  |
| SNFP                | 1175.913 | 1066.847 | 899.666  |  |
| ТТ                  | 997.18   | 948.77   | 844.189  |  |
| <b>FNSP</b> 872.914 |          | 857.724  | 769.0888 |  |
| <b>FF</b> 592.2322  |          | 600.1365 | 571.1151 |  |

Table 5.6 Post simulation result of 256x40 TCAM macro.

Performance summaries of the proposed energy-efficient TCAM macro and several other low-power designs published recently are shown in Table 5.7. Nevertheless, the energy metric of this work is three times as large as previous butterfly TCAM design. Because the TCAM design in [5.6] does not perform the read operation, and power of write buffers is excluded from simulation. Moreover, since the conventional comparison circuit (N-type) will cause function error in advanced technology, especially in low power process, the P-type comparison circuit is adopted. But it has larger leakage current resulting in performance degradation and additional

power consumption during evaluation phase. In particularly, the proposed TCAM design is the commercial IP macro. Therefore, compare to related low-power TCAM designs, it is reasonable that the proposed energy-efficient TCAM macro has the lowest energy per bit per search except to [5.6]. Additionally, the search speed of the proposed TCAM is also competitive.

|                                  | PF-CDPD     | Tree-style  | Range Match  | Low Swing    | Butterfly   |            |
|----------------------------------|-------------|-------------|--------------|--------------|-------------|------------|
|                                  | [5.7]       | [5.8]       | [5.9]        | [5.10]       | [5.6]       | This Work  |
|                                  | (JSSC 2006) | (JSSC 2008) | (TCASI 2009) | (TCASI 2011) | (JSSC 2011) |            |
| Configuration                    | 256x128     | 256x128     | 256x40       | 128x144      | 256x144     | 256x40     |
| Technology                       | 0.18 µm     | 0.18 µm     | 0.13 μm      | 0.18 μm      | 65 nm       | 40 nm      |
| <b>A</b> man (man <sup>2</sup> ) | 1.21x0.56   | 0.84x0.92   | 0.395x0.915  | 0.53x1.85    | 1.01x0.43   | 0.156x0.46 |
| Area (mm <sup>-</sup> )          | (core)      | (core)      | (core)       | (core)       | (core)      | (core)     |
| Supply voltage                   | 1.9.17      | 1 8 V       | ES           | 1.9.1        | 1.0V        | 1 O V      |
| (V)                              | 1.0 V E     | 1.0 V       | 1.2_V        | 1.8 V        | 1.0 v       | 1.0 V      |
| Search time                      | 2.10 m      | 156 pc      | 100 ng       | 217 MHz      | 400MHz      | 400MHz     |
| (ns)                             | 2.10 IIS    | 1.50 lls    | 1.99 lis     | 217101112    | 400MHZ      | 400MHZ     |
| Energy metric                    | 2 330       | 1 420       | 1896         | 2.82         | 0 165       | 0.461      |
| (fJ/bit/search)                  | 2.330       | 1.420       | 1.20         | 2.82         | 0.105       | 0.401      |
|                                  |             |             |              |              |             |            |
|                                  |             |             |              |              |             |            |

Table 5.7 Features summary and comparisons.

# **5.6 Summary**

The number of bits in a TCAM word is usually large, with existing implementations ranging from 36 to 144 bits. Thus, both 256x40 and 256x144 energy-efficient TCAM macros are presented in this chapter. In order to optimize the search performance, the different butterfly connections for 40-bit and 144-bit match-lines are discussed. Moreover, in ultra-high-density circuit design, the spacing between interconnections is decreasing. By utilizing shared BL/DL, ripple BL and ripple SL scheme, the additional metal layers are saved to enhance area efficiency. For layout consideration, the increasing coupling capacitance will impact the search speed
and power reduction. Hence, the metal layers of vertical lines are used in interleaving to decrease coupling effect without area penalty.

The layout of proposed TCAM macro is implemented with 40nm LP technology. The size of TCAM cell is  $1.77 \times 1.365 \ \mu\text{m}^2$ , and the overall area of TCAM macro is  $460.475 \times 159.81 \ \mu\text{m}^2$ . The simulation results show that the maximum clock frequency achieves 400MHz under 1.0V power supply and energy metrics of 256x40 and 256x144 macro are 0.461 fJ/search and 0.526 fJ/search, respectively.



# Chapter 6 Conclusions and Future Works

## **6.1 Conclusions**

Content addressable memory (CAM) is widely utilized to execute lookup-table functions for routing packets in interconnection networks. Furthermore, ternary content addressable memory (TCAM) is extensively adopted in network systems. Network routers forward data packets from an incoming port to an outgoing port, using an address-lookup function. The number of bits in a TCAM word is generally large, and existing implementations range from 36 to 144 bits. A typical TCAM utilizes a table size that has between a few hundred entries and 32K entries, corresponding to an address space that ranges from 7 bits to 15 bits. As routing tables become larger, energy consumption and leakage current become increasingly important issues in the design of TCAM in nano-scale technologies.

This thesis presents an energy-efficient TCAM design approach, which exploits the co-design of the architecture and circuit. To achieve low-power and high-performance TCAM architecture, the butterfly match-line with AND gates is designed to reduce the wire loading on the evaluation nodes and to ensure that the capacitance of the evaluation nodes are the same in all segments to diminish the variation immunity of the keeper sizing. Additionally, a 16T and-type TCAM cell with p-type comparison circuits is utilized to increase the  $I_{on}/I_{off}$  ratio of the dynamic circuitry according to the small drain current in 40nm LP CMOS process. The increase the  $I_{on}/I_{off}$  enhances the design margin of the size of the keeper increasing the tolerance to the fighting between the charge sharing and evaluation.

For further reducing the energy consumption in nano-scale technologies, the column-based low power design techniques are employed using the ripple bit-line scheme, don't-care-based ripple search-line scheme and column-based data-aware power control. The ripple bit-line and ripple search-line schemes are designed to realize both the power reduction and performance improvement by dividing the long search-lines and bit-lines. Thus, both the switching activities and wire capacitance of search-lines and bit-lines are decreased. Additionally, the switching activities of the ripple search-lines can be further reduced due to the continuous don't-care pattern. Moreover, the column-based data-aware power control is also employed to realize the leakage power reduction, write-ability and static noise margin (SNM) improvements for read/search/data-retention operations by the power gating devices. Based on the don't-care bits and input data, the power control dynamically adjusts the voltages for the left and right half-cells of both don't-care cells and storage cells. Consequently, the timing of the power switching is tolerant to PVT (process, voltage, temperature) variation and  $V_t$  scatter by the replica circuitry. The energy-efficient 256x40 and 256x144 TCAM macros are implemented using UMC 40nm LP CMOS technology, and the experimental results demonstrate a leakage power reduction of 28.9%, a search-line power reduction of 31.74% and an energy metric of the TCAM macro of 0.461 fJ/bit/search.

#### **6.2 Future Work**

As nano-scale technologies become more advanced, the circuit design will be influenced by high-k metal gates while shrinking below 40nm CMOS technologies. Moreover, PVT variations increases rapidly, and further degrades the functionality of our designs. For the dynamic circuitry, the PVT variation enhances the fighting between the charge sharing and evaluation and degrades the design margin of the keeper sizes. Therefore, a PVT-tolerant circuit design has been considered to provide a reliable TCAM macro. First of all, both the length and width of the keeper should be increased and a PVT monitor is essential to trace the current ratio of the keeper and comparison circuits during the run-time.



Fig. 6.1 The concept of the Internet of Thing (IoT)

The Internet of Things (IoT) refers to uniquely identifiable objects (things) and their virtual representations in an Internet-like structure as shown in Fig. 6.1. The concept of the IoT first became popular through the Auto-ID Center and related market analysts publications. The IoT attracts many attentions recently and paints a beautiful of future life for the people. The typical IoT architecture can be divided into three domain, sensing domain, network domain and application domain. In the network domain of the IoT architecture, the TCAM macro will be widely adopted in the IoT gateways for the data transmission between the sensing domain and application domain. One of the critical requirements of the IoT gateways is extremely low power consumption. Therefore, reducing the operated voltage of TCAM macros to sub-threshold region or near-threshold region is the most efficient design technique to achieve the power budget of the IoT gateways. However, reducing the voltage of TCAM macros would significantly increase the design challenge of the dynamic circuitry and decrease the tolerance to the PVT variation. Therefore, the search operation can be achieved by other logic families with great amounts of the area overhead.



## **Bibliography**

- [1.1] M. Meribout, T. Ogura, and M. Nakanishi, "On using the CAM concept for parametric curve extraction," *IEEE Transactions on Image Processing*, Vol. 9, No. 12, pp. 2126-2130, Dec. 2000.
- [1.2] M. Nakanishi and T. Ogura, "Real-time CAM-based Hough transform and its performance evaluation," *Machine Vision Application*, Vol. 12, No. 2, pp. 59-68, Aug. 2000.
- [1.3] D.J. Craft, "A fast hardware data compression algorithm and some algorithmic extensions," *IBM Journal of Research Development*, Vol. 42, No. 6, pp. 733-745, Nov. 1998.
- [1.4] L.-Y. Liu, J.-F. Wang, R.-J. Wang, and J.-Y. Lee, "CAM-based VLSI architectures for dynamic Huffman coding," *IEEE Transactions on Consumer Electronics*, Vol. 40, No. 3, pp. 282-289, Aug. 1994.
- [1.5] S. Choi, S.-J. Song, K. Sohn, H. Kim, J. Kim, N.Cho, J.-H. Woo, J. yoo and H.-J. Yoo, "A 24.2-µW Dual-Mode Human Body Communication Controller for Body Sensor Network," *Proceeding of IEEE European Solid-State Circuits Conference*, pp. 227-230, 2006.
- [1.6] S. Choi, K. Sohn, J. Kim, J. Yoo and H.-J. Yoo, "A TCAM-based Periodic Event Generator for Multi-Node Management in the Body Sensor Network," *Proceeding of IEEE Asian Solid-State Circuits Conference*, pp. 307-310, 2006.
- [1.7] C.-C. Wang, C.-J. Cheng, T.-F. Chen, and J.-S. Wang, "An Adaptively Dividable Dual-Port BiTCAM for Virus-Detection Processors in Mobile Devices," *IEEE Journal of Solid-State Circuits*, Vol. 44, No. 5, pp. 1571-1581, Jan. 2009.
- [1.8] Fong Pong, N.-F. Tzeng, "Concise Lookup Tables for IPv4 and IPv6 Longest Prefix Matching in Scalable Routers," *IEEE/ACM Transactions* on Networking, Vol.20, No.3, pp.729-741, June 2012.
- [1.9] G. Wood, "**IPv6: Making Room for the World on the Future Internet**," IEEE Internet Computing, Vol.15, No.4, pp.88-89, July-Aug. 2011.

- [1.10] J.-W. Hui, D.-E. Culler, "IPv6 in Low-Power Wireless Networks," *Proceedings of the IEEE*, Vol.98, No.11, pp.1865-1878, Nov. 2010.
- [1.11] S. Park, J. Jeong, C. Hong, "DNS Configuration in IPv6: Approaches, Analysis and Deployment Scenarios," *IEEE Internet Computing*, No.99.
- [1.12] L.T. Clark, V. Chaudhary, "Fast low power translation lookaside buffers using hierarchical NAND match lines," *Proceedings of IEEE International Symposium on Circuits and Systems*, pp.3493–3496, June 2 2010.
- [1.13] A. Agarwal, S. Hsu, S. Mathew, M. Anders, H. Kaul, F. Sheikh, R. Krishnamurthy, "A 128×128b high-speed wide-and match-line content addressable memory in 32nm CMOS," Proceedings of the ESSCIRC, pp.12–16, Sept. 2011.
- [1.14] C.-C. Wang, J.-S. Wang, C.-W. Yeh, , "High-Speed and Low-Power Design Techniques for TCAM Macros," *IEEE Journal of Solid-State Circuit*, Vol.43, No.2, pp.530–540, Feb. 2008.
- [1.15] O. Tyshchenko, A. Sheikholeslami, "Match Sensing Using Match-Line Stability in Content-Addressable Memories (CAM)," *IEEE Journal of Solid-State Circuits*, Vol.43, No.9, pp.1972–1981, Sep. 2008.
- [1.16] P.-T. Huang, W. Hwang, "A 65 nm 0.165 fJ/Bit/Search 256x144 TCAM Macro Design for IPv6 Lookup Tables," IEEE Journal of Solid-State Circuits, Vol.46, No.2, pp.507-519, Feb. 2011.

- [2.1] K. Pagiamtzis, A. Sheikholeslami, "Content-addressable memory (CAM) circuits and architectures: a tutorial and survey," *IEEE Journal of Solid-State Circuits*, Vol.41, No.3, pp. 712- 727, March 2006.
- [2.2] P.F. Lin and J.B. Kuo, "1-V 128-kb Four-Way Set-Associative CMOS Cache Memory Using Wordline-Oriented Tag-Compare (WLOTC) Structure with the Content Addressable-Memory (CAM) 10-transistor Tag Cell," *IEEE Journal of Solid-State Circuits*, Vol. 36, No. 4, pp.666-675, April 2001.
- [2.3] H. Miyatake and M. Tanaka and Y. Mori, "A Design for High-Speed Low-Power CMOS Fully Parallel Content-Addressable Memory Macros," *IEEE Journal of Solid-State Circuits*, Vol. 36, No. 6, pp.956-968, June 2001.

- [2.4] K. Eshraghian, K.-R. Cho, O. Kavehei, S.-K. Kang, D. Abbott, S.-M. Steve Kang, "Memristor MOS Content Addressable Memory (MCAM): Hybrid Architecture for Future High Performance Search Engines," *IEEE Transactions on Very Large Scale Integration Systems*, Vol.19, No.8, pp.1407-1417, Aug. 2011.
- [2.5] B. Agrawal, T. Sherwood, "Ternary CAM Power and Delay Model: Extensions and Uses," *IEEE Transactions on Very Large Scale Integration Systems*, Vol.16, No.5, pp.554-564, May 2008.
- [2.6] Y.-J. Hu, J.-F. Li, Y.-J. Huang, "3-D Content Addressable Memory Architectures," IEEE International Workshop on Memory Technology, Design, and Testing, pp.59-64, Sept. 2009.
- [2.7] S. Pontarelli, M. Ottavi, A. Salsano, "Error Detection and Correction in Content Addressable Memories," *IEEE 25th International Symposium* on Defect and Fault Tolerance in VLSI Systems, pp.420-428, Oct. 2010.
- [2.8] Fong Pong, N.-F. Tzeng, "Concise Lookup Tables for IPv4 and IPv6 Longest Prefix Matching in Scalable Routers," IEEE/ACM Transactions on Networking, Vol.20, No.3, pp.729-741, June 2012.
- [2.9] A.J. McAuley, P. Francis, "Fast routing table lookup using CAMs," Proceedings of the IEEE INFOCOM, Vol.3, pp.1382-1391, 1993.
- [2.10] S.K. Maurya, L.T. Clark, "A Dynamic Longest Prefix Matching Content Addressable Memory for IP Routing," *IEEE Transactions on Very Large Scale Integration Systems*, Vol.19, No.6, pp.963-972, June 2011.
- [2.11] S. Park, J. Jeong, C. Hong, "DNS Configuration in IPv6: Approaches, Analysis and Deployment Scenarios," *IEEE Internet Computing*, No.99.
- [2.12] P.-K. Chen, C.-W. Lu, Q. Wu, "IPv6 Rapid Deployment in Taiwan Academic Network (TANet)," International Conference on Advanced Communication Technology, pp.694-697, Feb. 2012.
- [2.13] D. A. Patterson, J. L. Hennessy, "Computer Organization and Design," Morgan Kaufmann, 2<sup>nd</sup> edition.
- [2.14] J. L. Hennessy and D. A. Patterson, "Computer Architecture," *Morgan Kaufmann*, 3<sup>rd</sup> edition.
- [2.15] <u>Http://www.commsdesign.com/main/1999/11/9911feat3.htm</u>: Using Content Addressable Memory for Networking Applications.

- [2.16] <u>Http://www.eecg.toronto.edu/~pagiamt/</u>: Content-Addressable Memory (CAM).
- [2.17] H. Che, Z. Wang, K. Zheng and B. Liu, "DRES: Dynamic Range Encoding Scheme for TCAM Coprocessors," *IEEE Transactions on Computer*, Vol. 57, No. 7, pp. 902-915, Jul. 2008.
- [2.18] J.-W. Hui, D.-E. Culler, "IPv6 in Low-Power Wireless Networks," *Proceedings of the IEEE*, Vol.98, No.11, pp.1865-1878, Nov. 2010.
- [2.19] Y. Sun, H. Liu, M.-S. Kim, "Using TCAM efficiently for IP route lookup," *IEEE Consumer Communications and Networking Conference*, pp.816-817, Jan. 2011.
- [2.20] G. Wood, "IPv6: Making Room for the World on the Future Internet," *IEEE Internet Computing*, Vol.15, No.4, pp.88-89, July-Aug. 2011.
- [2.21] I. Arsovski, A. Sheikholeslami, "A Mismatch-Dependent Power Allocation Technique for Match-Line Sensing in Content-Addressable Memories," IEEE Journal of Solid-State Circuits, Vol. 38, No. 11, Nov. 2003.
- [2.22] F. Alibart, T. Sherwood, D.-B. Strukov, "Hybrid CMOS/nanodevice circuits for high throughput pattern matching applications," NASA/ESA Conference on Adaptive Hardware and Systems, 2011, pp.279-286, June 2011.
- [2.23] J.G. Delgado-Frias, A. Yu, J. Nyathi, "A Dynamic Content Addressable Memory Using a 4-transistor Cell," International Workshop on Design of Mixed-Mode Integrated Circuits and Applications, pp.110-113, July 1999.
- [2.24] H. Noda, K. Inoue, M. Kuroiwa, F. Igaue, K. Yamamoto, H.-J. Mattausch, T. Koide, A. Amo, A. Hachisuka, S. Soeda, I. Hayashi, F. Morishita, K. Dosaka, K. Arimoto, K. Fujishima, K. Anami, T. Yoshihara, "A Cost-Efficient High-Performance Dynamic TCAM with Pipelined Hierarchical Searching and Shift Redundancy Architecture," *IEEE Journal of Solid-State Circuits*, Vol. 40, No. 1, pp. 245-253, Jan. 2005.
- [2.25] I. Arsovski, T. Chandler and A. Sheikholeslami, "A Ternary Content-Addressable Memory (TCAM) Based on 4T Static Storage and Including a Current-Race Sensing Scheme," *IEEE Journal of Solid-State Circuits*, Vol. 38, No. 1, pp.155-158, Jan. 2003.
- [2.26] M. Chae, J.-W. Lee, S.-H. Hong, "Decoupled 4T dynamic CAM suitable for

high density storage," *Electronics Letters*, Vol.47, No.7, pp.434-436, March 2011.

- [2.27] V. Nagarjuna, H.-M. Kittur, "Low power, low area and high Performance Hybrid Type DYNAMIC CAM design," International Conference on Signal Processing, Communication, Computing and Networking Technologies, pp.430-435, July 2011.
- [2.28] S. Choi, K. Sohn, H.-J. Yoo, "A 0.7-fJ/bit/search 2.2-ns search time hybrid-type TCAM architecture," *IEEE Journal of Solid-State Circuits*, Vol.40, No.1, pp. 254- 260, Jan. 2005.
- [2.29] K.-H. Cheng, C.-H. Wei, S.-Y. Jiang, "Static divided word matching line for low-power Content Addressable Memory design," *Proceedings of International Symposium on Circuits and System*, pp. 629-32, May 2004.
- [2.30] Y.-J. Chang, Y.-H. Liao, "Hybrid-Type CAM Design for Both Power and Performance Efficiency," IEEE Transactions on Very Large Scale Integration Systems, Vol.16, No.8, pp.965-974, Aug. 2008.
- [2.31] S.-H. Yang, Y.-J. Huang, J.-F. Li, "A Low-Power Ternary Content Addressable Memory With Pai-Sigma Matchlines," IEEE Transactions on Very Large Scale Integration Systems, No.99.
- [2.32] A.-T. Do, S.-S. Chen, Z.-H. Kong, K-S Yeo, "A low-power CAM with efficient power and delay trade-off," *IEEE International Symposium on Circuits and Systems*, pp.2573-2576, May 2011.
- [2.33] A.-T. Do, S.-S. Chen, Z.-H. Kong, K-S Yeo, "A High Speed Low Power CAM With a Parity Bit and Power-Gated ML Sensing," IEEE Transactions on Very Large Scale Integration Systems, No.99.
- [2.34] K. Pagiamtzis, A. Sheikholeslami, "A low-power content-addressable memory (CAM) using pipelined hierarchical search scheme," *IEEE Journal of Solid-State Circuits*, Vol.39, No.9, pp. 1512-1519, Sept. 2004.
- [2.35] V. Chaudhary, L.-T. Clark, "Low-power high-performance NAND match line content addressable memories," *IEEE Transactions on Very Large Scale Integration Systems*, pp.895-905, Aug. 2006.
- [2.36] L.-T. Clark, V. Chaudhary, "Fast low power translation lookaside buffers using hierarchical NAND match lines," Proceedings of 2010 IEEE International Symposium on Circuits and Systems, Vol.14, No.8,

pp.3493-3496, June 2010.

- [2.37] S. Baeg, "Low-Power Ternary Content-Addressable Memory Design Using a Segmented Match Line," IEEE Transactions on Circuits and Systems I: Regular Papers, Vol.55, No.6, pp.1485-1494, July 2008.
- [2.38] I. Arsovski, A. Sheikholeslami, "A current-saving match-line sensing scheme for content-addressable memories," *IEEE International Solid-State Circuits Conference (ISSCC)*, Vol.1, pp.304-494, 2003.
- [2.39] G. Kasai, Y. Takarabe, K. Furumi, M. Yoneda, "200 MHz/200 MSPS 3.2 W at 1.5 V Vdd, 9.4 Mbits ternary CAM with new charge injection match detect circuits and bank selection scheme," Proc. IEEE Custom Integrated Circuits Conf., pp. 387-390, 2003.
- [2.40] M. Sultan, M. Siddiqui, Sonika, G.-S. Visweswaran, "A low-power ternary content addressable memory (TCAM) with segmented and non-segmented matchlines," *Conference TENCON*, pp.1-5, Nov. 2008.
- [2.41] N. Mohan, W. Fung, D. Wright, M. Sachdev, "Match Line Sense Amplifiers with Positive Feedback for Low-Power Content Addressable Memories," *IEEE Conference on Custom Integrated Circuits*, pp.297-300, Sept. 2006.
- [2.42] N. Mohan, W. Fung, D. Wright, M. Sachdev, "A Low-Power Ternary CAM With Positive-Feedback Match-Line Sense Amplifiers," IEEE Transactions on Circuits and Systems I: Regular Papers, Vol.56, No.3 pp.566-573, March 2009.
- [2.43] S.-I. Ali, M.-S. Islam, "A high-speed and low-power ternary CAM design using match-line segmentation and feedback in sense amplifiers," *International Conference on Computer and Information Technology*, pp.221-226, Dec. 2010.
- [2.44] M.-M. Hasan, A.-B.-M.-.H. Rashid, M.-M. Hussain, "A novel match-line selective charging scheme for high-speed, low-power and noise-tolerant content-addressable memory," *International Conference on Intelligent and Advanced Systems*, pp.1-4, June 2010.
- [2.45] A. Agarwal, S. Hsu, S. Mathew, M. Anders, H. Kaul, F. Sheikh, R. Krishnamurthy, "A 128×128b high-speed wide-and match-line content addressable memory in 32nm CMOS," *Proceedings of the ESSCIRC*, pp.12-16 Sept. 2011.

- [2.46] C.-C. Wang, J.-S. Wang, C.-W. Yeh, "High-Speed and Low-Power Design Techniques for TCAM Macros," *IEEE Journal of Solid-State Circuit*, Vol.43, No.2, pp.530-540, Feb. 2008.
- [2.47] H.-Y. Li, C.-C. Chen, J.-S. Wang, and C. Yeh, "An AND-type match-line scheme for high-performance energy-efficient content addressable memories," *IEEE J. Solid-State Circuits*, Vol. 41, No. 5, pp.1108–1119, May 2006.
- [2.48] K. Pagiamtzis, A. Sheikholeslami, "Pipelined match-lines and hierarchical search-lines for low-power content-addressable memories," *Proceedings* of the IEEE Custom Integrated Circuits Conference, pp. 383-386, Sept. 2003.
- [2.49] H. Noda, K. Inoue, M. Kuroiwa, A. Amo, A. Hachisuka, H.-J. Mattausch, T. Koide, S. Soeda, K. Dosaka, K. Arinnoto, "A 143MHz 1.1W 4.5Mb dynamic TCAM with hierarchical searching and shift redundancy architecture," *IEEE International Solid-State Circuits Conference (ISSCC)*, Vol.1, pp. 208- 523, Feb. 2004.
- [2.50] C.-C Wang, C.-J. Cheng, T.-F. Chen, J.-S. Wang, "An Adaptively Dividable Dual-Port BiTCAM for Virus-Detection Processors in Mobile Devices," *IEEE Journal of Solid-State Circuits*, Vol.44, No.5, pp.1571-1581, May 2009.
- [2.51] Y.-D. Kim, H.-S. Ahn, S. Kim, D.-K. Jeong, "A High-Speed Range-Matching TCAM for Storage-Efficient Packet Classification," *IEEE Transactions on Circuits and Systems I: Regular Papers*, Vol.56, No.6, pp.1221-1230, June 2009.
- [2.52] B. Gamache, Z. Pfeffer, S.-P. Khatri, "A fast ternary CAM design for IP networking applications," *Proceedings of International Conference on Computer Communications and Networks*, pp. 434-439, Oct. 2003.
- [2.53] B.-D. Yang, L.-S. Kim, "A low-power CAM using pulsed NAND-NOR match-line and charge-recycling search-line driver," *IEEE Journal of Solid-State Circuits*, Vol.40, No.8, pp. 1736- 1744, Aug. 2005.
- [2.54] Y.-J. Chang, "A High-Performance and Energy-Efficient TCAM Design for IP-Address Lookup," IEEE Transactions on Circuits and Systems II: Express Briefs, Vol. 56, No. 6, pp.479-483, June 2009.
- [2.55] B.-D. Yang, Y.-K. Lee, S.-W. Sung, J.-J. Min, J.-M. Oh, H.-J. Kang, "A Low Power Content Addressable Memory Using Low Swing Search Lines," *IEEE Transactions on Circuits and Systems I: Regular Papers*, No.99.

- [2.56] C.-C. Huang, J.-H. Wu, C.-C. Wang, "A self-disable sense technique with differential NAND cell for content-addressable memories," *IEEE International Conference on Electronics, Circuits and Systems*, pp.590-593, Sept. 2008.
- [2.57] N. Mohan, M. Sachdev, "Low-Leakage Storage Cells for Ternary Content Addressable Memories," *IEEE Transactions on Very Large Scale Integration Systems*, Vol.17, No.5, pp.604-612, May 2009.
- [2.58] Y.-J. Chang, "Using the Dynamic Power Source Technique to Reduce TCAM Leakage Power," IEEE Transactions on Circuits and Systems II: Express Briefs, pp.888-892, Nov. 2010.
- [2.59] S.-P. Yong, J.-F. Li, Y.-J. Huang, "Variability-Tolerant Binary Content Addressable Memory Cells," International Workshop on Memory Technology, Design, and Testing, pp.44-49, Aug.-Sept. 2009.

- [3.1] H.-Y. Li, C.-C. Chen, J.-S. Wang, and C. Yeh, "An AND-type match-line scheme for high-performance energy-efficient content addressable memories," *IEEE J. Solid-State Circuits*, Vol. 41, No. 5, pp.1108–1119, May 2006.
- [3.2] Y.-D. Kim, H.-S. Ahn, S. Kim, D.-K. Jeong, "A High-Speed Range-Matching TCAM for Storage-Efficient Packet Classification," *IEEE Transactions on Circuits and Systems I: Regular Papers*, Vol.56, No.6, pp.1221-1230, June 2009
- [3.3] C.-C. Wang, J.-S. Wang, C.-W. Yeh, "High-Speed and Low-Power Design Techniques for TCAM Macros," *IEEE Journal of Solid-State Circuit*, Vol.43, No.2, pp.530-540, Feb. 2008.
- [3.4] K.-H. Cheng, C.-H. Wei, S.-Y. Jiang, "Static divided word matching line for low-power Content Addressable Memory design," *Proceedings of International Symposium on Circuits and System*, pp. 629-32, May 2004.
- [3.5] K. Pagiamtzis, A. Sheikholeslami, "A low-power content-addressable memory (CAM) using pipelined hierarchical search scheme," *IEEE Journal of Solid-State Circuits*, Vol.39, No.9, pp. 1512-1519, Sept. 2004.
- [3.6] A. Agarwal, S. Hsu, S. Mathew, M. Anders, H. Kaul, F. Sheikh, R.

Krishnamurthy, "A 128×128b high-speed wide-and match-line content addressable memory in 32nm CMOS," *Proceedings of the ESSCIRC*, pp.12–16, Sept. 2011.

- [3.7] C.-H. Hua, C.-K. Chen, W. Hwang, "Noise-tolerant XOR-based conditional keeper for high fan-in dynamic circuits," IEEE International Symposium on Circuits and Systems, pp. 444- 447 Vol. May 2005.
- [3.8] K. L. Shepard, V. Narayanan, and R. Rose, "Harmony: A methodology for noise analysis in deep submicron digital integrated circuits," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, Vol. 18, No. 6, pp. 1132-1150, Aug. 1999.
- [3.9] P. Meher, K.-K. Mahapatra, "Ultra low-power and noise tolerant CMOS dynamic circuit technique," *IEEE Conference on TENCON*, pp.1175-1179, Nov. 2011.
- [3.10] K. Yelamarthi, C.-I.H. Chen, "Timing Optimization and Noise Tolerance for Dynamic CMOS Susceptible to Process Variations," IEEE Transactions on Semiconductor Manufacturing, Vol.25, No.2, pp.255-265, May 2012.

### **Chapter 4**

[4.1] K. Kushida, A. Suzuki, G. Fukano, A. Kawasumi, O. Hirabayashi, Y. Takeyama, T. Sasaki, A. Katayama, Y. Fujimura, T. Yabe, "A 0.7V single-supply SRAM with 0.495um2 cell in 65nm technology utilizing self-write-back sense amplifier and cascaded bit line scheme," *IEEE Symposium on VLSI Circuits*, pp.46-47, June 2008.

1896

- [4.2] Y.-J. Chang, "A High-Performance and Energy-Efficient TCAM Design for IP-Address Lookup," IEEE Transactions on Circuits and Systems II: Express Briefs, Vol. 56, No. 6, pp.479-483, June 2009.
- [4.3] B.-D. Yang, Y.-K. Lee, S.-W. Sung, J.-J. Min, J.-M. Oh, H.-J. Kang, "A Low Power Content Addressable Memory Using Low Swing Search Lines," *IEEE Transactions on Circuits and Systems I: Regular Papers*, No.99.
- [4.4] P.-T. Huang, W. Hwang, "A 65 nm 0.165 fJ/Bit/Search 256x144 TCAM Macro Design for IPv6 Lookup Tables," *IEEE Journal of Solid-State Circuits*, Vol.46, No.2, pp.507-519, Feb. 2011.

- [4.5] N. Mohan, M. Sachdev, "Novel Ternary Storage Cells and Techniques for Leakage Reduction in Ternary CAM." In Proc. Int. SOC Conf., pp. 311-314, 2006.
- [4.6] Y.-J. Chang, "Using the Dynamic Power Source Technique to Reduce TCAM Leakage Power," IEEE Transactions on Circuits and Systems II: Express Briefs, pp.888-892, Nov. 2010.
- [4.7] H.-I. Yang, S.-C. Yang, W. Hwang, C.-T. Chuang, "Impacts of NBTI/PBTI on Timing Control Circuits and Degradation Tolerant Design in Nanoscale CMOS SRAM," IEEE Transactions on Circuits and Systems I: Regular Papers, Vol.58, No.6, pp.1239-1251, June 2011.
- [4.8] K. Nii, M. Yabuuchi, Y. Tsukamoto, S. Ohbayashi, S. Imaoka, H. Makino, Y. Yamagami, S. Ishikura, T. Terano, T. Oashi, K. Hashimoto, A. Sebe, S. Okazaki, K. Satomi, H. Akamatsu, H. Shinohara, "A 45-nm Bulk CMOS Embedded SRAM With Improved Immunity Against Process and Temperature Variations," *IEEE Journal of Solid-State Circuits*, Vol.43, No.1, pp.180-191, Jan. 2008

- [5.1] <u>http://ieeexplore.ieee.org/xpl/ebooks/bookPdfWithBanner.jsp?fileName=562</u> <u>8409.pdf&bkn=5628393</u> T. Rooney, 2011. Internet Protocol Version 6(IPv6) [Document on-line].
- [5.2] H. Che, Z. Wang, K. Zheng and B. Liu, "DRES: Dynamic Range Encoding Scheme for TCAM Coprocessors," *IEEE Transactions on Computer*, Vol. 57, No. 7, pp. 902–915, Jul. 2008.
- [5.3] W. Zhao, X. Li, S. Gu, S.-H. Kang, M.-M. Nowak, Y. Cao, "Field-Based Capacitance Modeling for Sub-65-nm On-Chip Interconnect," *IEEE Transactions on Electron Devices*, Vol.56, No.9, pp.1862-1872, Sept. 2009.
- [5.4] L. Wei, F. Boeuf, T. Skotnicki, H.-S.P. Wong, "Parasitic Capacitances: Analytical Models and Impact on Circuit-Level Performance," IEEE Transactions on Electron Devices, Vol.58, No.5, pp.1361-1370, May 2011.
- [5.5] N. Ekekwe, "Power dissipation and interconnect noise challenges in nanometer CMOS technologies," *IEEE Potentials*, Vol.29, No.3, pp.26-31, May-June 2010.

- [5.6] P.-T. Huang, W. Hwang, "A 65 nm 0.165 fJ/Bit/Search 256x144 TCAM Macro Design for IPv6 Lookup Tables," *IEEE Journal of Solid-State Circuits*, Vol.46, No.2, pp.507-519, Feb. 2011.
- [5.7] H.-Y. Li, C.-C. Chen, J.-S. Wang, and C. Yeh, "An AND-type match-line scheme for high-performance energy-efficient content addressable memories," *IEEE J. Solid-State Circuits*, Vol. 41, No. 5, pp.1108–1119, May 2006.
- [5.8] C.-C. Wang, J.-S. Wang, C.-W. Yeh, "High-Speed and Low-Power Design Techniques for TCAM Macros," *IEEE Journal of Solid-State Circuit*, Vol.43, No.2, pp.530-540, Feb. 2008.
- [5.9] Y.-D. Kim, H.-S. Ahn, S. Kim, D.-K. Jeong, "A High-Speed Range-Matching TCAM for Storage-Efficient Packet Classification," *IEEE Transactions on Circuits and Systems I: Regular Papers*, Vol.56, No.6, pp.1221-1230, June 2009.
- [5.10] B.-D. Yang, Y.-K. Lee, S.-W. Sung, J.-J. Min, J.-M. Oh, H.-J. Kang, "A Low Power Content Addressable Memory Using Low Swing Search Lines," *IEEE Transactions on Circuits and Systems I: Regular Papers*, No.99.



## Vita

### Shu-Lin Lai 賴淑琳

#### PERSONAL INFORMATION

- Birth Date: March. 8, 1988
- Birth Place: Taichung, **TAIWAN**
- Email: <u>sharon.ee99g@nctu.edu.tw</u>

#### **EDUCATION**

- 09/2010 08/2012 M.S. in Electronics Engineering, National Chiao Tung University Thesis: Energy-Efficient TCAM Design for IP Lookup Tables in 40nm LP CMOS Process
- 09/2006 06/2010 B.S. in Communications Engineering, National Chung Cheng University PUBLICATIONS

Ming-Hung Chang, Yi-Te Chiu, Shu-Lin Lai and Wei Hwang, "A 1kb 9T Subthreshold SRAM with Bit-interleaving Scheme in 65nm CMOS", *IEEE International Symposium on Low Power Electronics Design (ISLPED)*, Aug. 2011.

#### PATENTS

Shu-Lin Lai, Po-Tsang Huang, Ching-Te Chuang, and Wei Hwang, "AND-Type CAM/TCAM Cell with P-Type Comparison Circuits," US/TW Patent Pending

Po-Tsang Huang, Shu-Lin Lai, Ching-Te Chuang, and Wei Hwang, "Data-Aware Power Control Scheme for TCAM Cell," US/TW Patent Pending