# A 40 nm 512 kb Cross-Point 8 T Pipeline SRAM With Binary Word-Line Boosting Control, Ripple Bit-Line and Adaptive Data-Aware Write-Assist Nan-Chun Lien, Li-Wei Chu, Chien-Hen Chen, Hao-I. Yang, Ming-Hsien Tu, Paul-Sen Kan, Yong-Jyun Hu, Ching-Te Chuang, Fellow, IEEE, Shyh-Jye Jou, Senior Member, IEEE, and Wei Hwang, Life Fellow, IEEE Abstract—This paper presents a cross-point 512 kb 8 T pipeline static random-access memory (SRAM). The cross-point structure eliminates write half-select disturb to facilitate bit-interleaving architecture for enhanced soft error immunity. The design employs boosted word-line (WL) for improving both read performance and write-ability. A ripple bit-line (RiBL) structure provides 30%-44% read access performance improvement and $2\times-3.5\times$ variation immunity at 0.7 V compared with the conventional hierarchical bit-line (HiBL) schemes. An adaptive data-aware write-assist (ADAWA) with VCS tracking is employed to further enhance the write-ability while ensuring adequate stability for half-selected cells on the selected bit-lines. An adaptive voltage detector (AVD) with binary boosting control is used to mitigating gate electric over-stress. The design is implemented in UMC 40 nm low-power (40LP) CMOS technology. The 512 kb test chip operates from 1.5 V to 0.65 V, with maximum operation frequency of 800 MHz@1.1 V and 200 MHz@0.65 V. The measured power consumption is 0.5 mW/MHz (active) and 4.4 mW (standby) at 1.1 V, and 0.107 mW/MHz (active) and 0.367 mW (standby) at 0.65 V. Index Terms—Adaptive data-aware write-assist (ADAWA), adaptive voltage detector (AVD), ripple bit-line, Static random-access memory (SRAM), write-ability. ### I. INTRODUCTION POR PORTABLE and hand-held devices, low-power large-capacity memories are needed in processors and memory-rich SoC to contain power dissipation and extend battery life. SRAM, with its logic process compatibility, has been the prime candidate for on-chip cache and embedded memory. In deep sub-100 nm technologies with sub-1 V supply voltage, the shrinking design window due to increasing leakage and variation has become major challenge for SRAM design. For the conventional 6 T SRAM cell [Fig. 1(a)], the conflicting Manuscript received February 08, 2014; revised May 03, 2014 and June 16, 2014; accepted July 01, 2014. Date of publication September 04, 2014; date of current version November 21, 2014. This work was supported by the Ministry of Economic Affairs in Taiwan under Contract 100-EC-17-A-01-S1-124, the National Science Council of Taiwan under Contract NSC 102-2218-E-009-025-, and the Ministry of Education in Taiwan under ATU Program. This paper was recommended by Associate Editor M. Seok. N.-C. Lien, L.-W. Chu, C.-H. Chen, H.-I. Yang, C.-T. Chuang, S.-J. Jou, and W. Hwang are with the Department of Electronics Engineering and Institute of Electronics, National Chiao Tung University, Hsinchu 30050, Taiwan (e-mail: nclien@gmail.com; b9427106@gmail.com; kc-nevo4@gmail.com; haoiyang@gmail.com; b9427106@gmail.com; kc-nevo4@gmail.nctu.edu.tw; hwang@mail.nctu.edu.tw). M.-H. Tu, P.-S. Kan, and Y.-J. Hu are with the Faraday Technology Corporation, Hsinchu 300, Taiwan (e-mail: huberttu@faraday-tech.com; pskan@faraday-tech.com; angelohu@faraday-tech.com). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSI.2014.2336531 read and write requirements and the read-disturb and half-select disturb further limit the attainable minimum operation voltage. Many alternative cell structures have been proposed for low voltage operation, including the conventional single-ended 8 T cell [Fig. 1(b)] [1], the differential data-aware power supplied (D2AP) 8 T cell [Fig. 1(c)] [2], the large $\sigma V th/VDD$ tolerant zigzag 8 T (Z8T) cell [Fig. 1(d)] [3], the Schmitt-triggered (ST) based cell [Fig. 1(e)] [4], the column-decoupled (CDC) 8 T cell [Fig. 1(f)] [5], and the column-line assist (CLA) 10 T cell [Fig. 1(g)] [6]. The conventional 8 T cell [1] utilizes dedicated read port to decouple the read current from the cell storage node to eliminate read-disturb. However, due to its 6 T-like write operation, half-select cells on the selected word-line perform dummy read during write operation, thus experiencing storage node disturb similar to read-disturb in 6 T SRAM cell and not suitable for bit-interleaving architecture. The D2AP 8 T cell [3] improves the write-ability yet read-disturb persists. In ST cell [4], the cell is formed by using half Schmitt trigger in the pull-down path of the cell. The feedback mechanism of the half Schmitt trigger raises the trip voltage of the cross-coupled cell inverters uni-directionally, thereby reducing read-disturb. Nevertheless, its area-overhead is large. The CDC 8 T cell [5] utilizes an additional inverter powered by the column select signal to form cross-point structure with the word-line to mitigate half-select disturb. However, the read-disturb persists. The CLA 10 T cell [6] uses cross-point structure to eliminate the read-disturb and additional discharge path to improve write-ability, yet the area overhead is large. In this paper, we present a 40 nm 512 kb pipeline 8 T SRAM based on our proposed cross-point 8 T SRAM cell in [7]. The design utilizes boosted WL for improving both read performance and write-ability. To mitigate gate dielectric over-stress due to boosting, an adaptive voltage detector (AVD) with binary boosting control and inherent corner tracking capability is used. A ripple bit-line structure [8] improves both read performance and process variation immunity for low voltage operation. An adaptive data-aware write-assist (ADAWA) with variation-tolerant VCS (array cell power supply) tracking is employed to further enhance write-ability while ensuring adequate stability for half-selected cells on the selected bit-lines. The remainder of this paper is organized as follows. Section II describes the cross-point 8 T SRAM cell and stability evaluation. Section III and IV present the pipeline structure and the read-/write-assist schemes, respectively. Section V discusses the test chip implement and measurement results. The conclusion of the paper is given in Section VI. Fig. 1. (a) Conventional 6 T cell, (b) conventional 8 T cell [1], (c) D2AP 8 T cell [2], (d) Z8T cell [3], (e) ST cell [4], (f) CDC 8 T cell [5], and (g) CLA 10 T cell [6]. Fig. 2. (a) Schematic and (b) layout of the cross-point 8 T cell. # II. CROSS-POINT 8 T SRAM CELL # A. Cell Structure and Operation Fig. 2(a) and Fig. 2(b) show the schematic and layout of the cross-point 8 T SRAM cell, respectively. The cell size is $1.44~\mu m \times 0.59~\mu m$ in layout view based on logic rules in UMC 40 nm low-power (40 LP) CMOS technology. The actual size on silicon is shrunk by $0.9\times$ from the layout view. The cell features a cross-point write word-line structure with double-layer pass-gate. The word-line (WL) is row-based, while the write word-line (WWL), write word-line bar (WWLB), read bit-line TABLE I TRUTH TABLE OF CROSS-POINT 8 T SRAM | MODE | Standby | Write 0 (Q) | Write 1 (Q) | Read | |------|---------|-------------|-------------|----------| | WL | 0 | 1 | 1 | 1 | | WWL | 0 | 0 | 1 | 0 | | WWLB | 0 | 1 | 0 | 0 | | VVSS | 0 | 0 | 1 | 0 | | RBL | 1 | 0 | 0 | Floating | | VDD1 | 1 | Floating | 1 | 1 | | VDD2 | 1 | 1 | Floating | 1 | (RBL), and VVSS are all column-based. VVSS is "High" in standby mode to reduce the leakage of the reading stack (M7/ M8), and "Low" in read mode to provide a large read current to discharge RBL. The read operation resembles that of a conventional 8 T cell, hence free of read-disturb. During write operation, the RBL goes "Low," the row-based WL turns on, and either WWL or WWLB (column based) is turned on depending on the applied input data (Data-in) to be written into the selected cell. In write-mode, VVSS has exactly the same phase as WWL to eliminate the discharged path for QB = 1 storage nodes in half-selected cells on the selected column. So only the selected cell at the cross-point, where the selected row intersects the selected column, has a discharge path to ground for cell storage node. It is important to ensure that VVSS is ready before WWL/WWLB is activated. In our pipeline design, the VVSS is set/ready in the "negative half-cycle" preceding the turning on of WWL/WWLB. Table I summarizes the operation of the proposed SRAM cell. Table II compares the features of the proposed cross-point 8 T cell with that of the conventional 6 T cell and various 8 T and 10 T cells [1]–[6]. The proposed 8 T SRAM cell is immune to read disturb and write half-select disturb, and its area is smaller than other disturb-free cells. ### B. Stability Comparison For stability comparison, both the conventional 8 T cell and the proposed cross-point 8 T cell are constructed using the same logic devices with minimum size devices for M1–M6. The cell size for the proposed cross-point 8 T cell is 0.59 $\mu$ m $\times$ 1.44 $\mu$ m = 0.8496 $\mu$ m<sup>2</sup> as shown | | Conv. 6T | Conv. 8T [1] | D2AP 8T [2] | Z8T [3] | ST cell [4] | CDC 8T [5] | CLA 10T [6] | Proposed 8T | |----------------------------|----------|----------------|-------------|----------------|-------------|----------------|----------------|---------------------| | Read<br>Disturb Free | × | 0 | × | 0 | × | × | 0 | 0 | | Write HS<br>Disturb Free | × | × | × | × | × | $\circ$ | $\bigcirc$ | $\circ$ | | Word-Line Num. | 1(row) | 2(row) | 2(row) | 2(row) | 2(row) | 2(row) | 1(row) | 1(row)<br>2(column) | | Bit-Line Num. <sup>1</sup> | 2-BL | 2-WBL<br>1-RBL | 2-BL | 2-WBL<br>2-RBL | 2-BL | 2-WBL<br>1-RBL | 2-WBL<br>2-RBL | 1-RBL | | Area | 0.77× | 0.95× | 1.1× | 1× | 1.1× | 1.2× | 1.3× | 1× | TABLE II SRAM BIT CELL COMPARISON in Fig. 2(b), while that for the conventional 8 T cell is 0.46 $\mu m \times 1.75~\mu m=0.805~\mu m^2$ (0.95× of the proposed cross-point 8 T cell). Fig. 3(a) shows the (row) write half-select static noise margin (SNM) [9] of the conventional 8 T cell and the proposed crosspoint 8 T cell at PSNF, 125 °C (where half-select SNM is the worst). The conventional 8 T cell suffers the same write half-select disturb as the conventional 6 T cell. The write half-select SNM of the cross-point 8 T cell is essentially the hold SNM. Fig. 3(b) compares the write margin (WM) of the two cells at PFNS, $-40~^{\circ}\mathrm{C}$ (where write-ability is the worst). Here the WM of the cross-point 8 T cell is defined as the highest voltage level of the low-going RBL that causes the selected cell to flip during write operation, similar to the definition of WM for the conventional 6 T cell. Due to the series-connected write-access transistors, the WM of the proposed 8 T cell is worse than that of the conventional 8 T cell. To ensure adequate WM and dynamic write-ability [10] through the double-layer pass-gate structure of the cross-point 8 T cell, write-assist [11]–[14] is necessary and will be discussed in the next section. Fig. 3(c) compares the V<sub>disturb</sub>, the highest disturb voltage level at the cell "0" storage node of the write half-select cell, of the three cells from transient simulations at PSNF, 125 °C. For the conventional 8 T cell, the V<sub>disturb</sub> is close to that of the conventional 6 T cell. For the cross-point 8 T cell, minor write halfselect disturb occurs in the column direction. For column write half-select cell, M7 for the row-based WL is "Off," thus isolating RBL from the common drain node of M5/M6/M8 (source node of M7). If WWL = 1 (VVSS = 1) and QB = "0", M5 and M8 are "Off." Thus, the disturb at "0" cell storage node QB comes from the charge re-distribution from the common drain node of M5/M6/M8 through the "On" M6, which is significantly less than the row write half-select disturb in the conventional 8 T cell (coming from the "High" write BL). If WWL = 0(VVSS = 0) and Q = "0", M5 and M8 are "On," the common drain node of M5/M6/M8 will be at "0," the same voltage as the "0" cell storage node Q, and there is no disturb for node Q. As can be seen in Fig. 3(c), the column write half-select V<sub>disturb</sub> of the cross-point 8 T cell is 60% (75%) less than the row write half-select $V_{disturb}$ of the conventional 8 T cell at 1.1 V (0.7 V). Notice that if non-interleaving architecture is used, the SNM of the conventional 8 T cell would equal its hold SNM, same as the proposed cross-point 8 T cell. The "dynamic" write-ability (with finite WL pulse width) of the proposed 8 T cell would be slightly worse since the proposed 8 T cell performs write operation through double-layer pass-gate. Fig. 3. (a) Half-select static noise margin (SNM) at PSNF, 125 $^{\circ}\mathrm{C}$ , (b) write margin (WM) at PFNS, $-40\,^{\circ}\mathrm{C}$ , and (c) $V_{\mathrm{disturb}}$ at PSNF, 125 $^{\circ}\mathrm{C}$ versus VDD for the conventional 8 T cell and the cross-point 8 T cell. # III. PIPELINE STRUCTURE Fig. 4 shows the 2-stage pipeline structure [15], [16]. The conventional L1–L2 (Master-Slave) latches with non-over- <sup>&</sup>lt;sup>1</sup>BL (bit-line), WBL (Write bit-line), RBL (Read bit-line). Fig. 4. Two-stage pipeline structure. TABLE III PIPELINE STRUCTURE CHARACTERISTICS | Improvement/ | Frequency | Area | Active | Standby | |--------------|-------------|-------|--------|---------| | Overhead | Performance | | Power | Power | | Pipeline | +78% | +1.2% | +0.8% | +2.3% | lapping clock are used for the Input Latch and Output Latch. The Middle Latch consists of L1 only, and is followed by an AND gate which drives the WL [16]. Functions are performed during the "positive half-cycles" and the "negative half-cycles" are used for capturing data and precharge. The first "positive half-cycle" is used for Decode. The second "positive half-cycle" is used for WL activation, data-sensing through local bit-line (LBL) and RiBL to global evaluation, and data latching into global latch and data-out (DO) latch. The second "positive half-cycle" is the cycle-time gating period. To minimize the clock skew and jitter among the latches, H-tree clock distribution commonly used in processor designs is employed. Table III summarizes the performance improvement, and area and power overhead of the pipeline structure. The pipeline structure improves the frequency performance by 78% with area overhead of 1.2%. The overheads for active power and standby power are 0.8% and 2.3%, respectively. # IV. READ- AND WRITE-ASSIST In the 512 kb test chip design, the row-based WL is boosted to improve both read performance and write-ability. To mitigate gate dielectric over-stress due to boosting, an AVD with binary boosting control and inherent corner tracking capability is used. To further enhance read performance and process variation immunity for low voltage operation, RiBL structure [8] is adopted. An ADAWA with variation-tolerant VCS (array cell power supply) tracking is employed to further enhance write-ability while ensuring adequate stability for half-selected cells on the selected bit-lines. # A. Row-Based Word-Line Boosting Fig. 5(a) and 5(b) show the boosting circuit and pertinent waveforms for the row-based WL, respectively. The local WL width is 32 bits. One boost unit is shared among 32 WL drivers for area efficiency. Fig. 5. (a) Word-line boosting circuit and (b) pertinent waveforms. Fig. 6. (a) Read current and (b) Write Margin (WM) improvement due to WL boosting. Fig. 6(a) and 6(b) show the read current and WM improvement due to WL boosting, respectively. The read current improves by 30.6% and the WM improves by 40.9% at 0.6 V. # B. Ripple Bit-Line (RiBL) Scheme For low-voltage operation in deeply scaled technology, the delay is dominated by wire delay and process variation. To fur- Fig. 7. (a) Conventional HiBL structure with NMOS mux, (b) conventional HiBL structure with PMOS mux, (c) RiBL structure and (d) simulated read waveforms at 0.7 V, PSNS, 125 °C. ther improve read performance and process variation immunity, the RiBL structure [8] is adopted. In essence, the RiBL structure resembles the buffer insertion scheme in reducing the delay of long wires commonly used in logic circuits or processor design. Fig. 7(a) depicts the conventional hierarchical bit-line (HiBL) structure with an NMOS muxing the LBL signal into a precharged "High" GBL, whereas Fig. 7(b) shows the conventional HiBL structure with a PMOS muxing the LBL signal into a pre-discharged "Low" GBL. The RiBL structure is shown in Fig. 7(c), where short LBL segments are isolated by a simple ripple buffer consisting of an inverter and NMOS M2. Sensing signal propagates uni-directionally in domino-like Fig. 8. (a) Mean and (b) sigma $(\sigma)$ of data evaluation delay versus VDD with global and local variations from Monte Carlo simulations. fashion through ripple buffer from segment to segment. Non-active LBL segments are completely isolated at all times. Fig. 7(d) shows the simulation waveforms of RiBL and HiBL for read operation at 0.7 V, PSNS, 125 $^{\circ}$ C. With the same LBL length of 32 bits and GBL length of 128 bits (4 segments partition), the data evaluation delay of RiBL is faster than that of HiBL in Fig. 7(a) by 44% and that of HiBL in Fig. 7(b) by 30% at 0.7 V. Fig. 8 compares the mean and sigma $(\sigma)$ of the data evaluation delay versus $V_{\rm DD}$ from Monte Carlo simulations considering both global and local variations. The $\sigma$ of data evaluation delay for HiBL in Fig. 7(a) and Fig. 7(b) is $3.5\times$ and $2\times$ of that for RiBL at 0.7 V, respectively. The RiBL structure provides significantly better data evaluation delay and process variation immunity compared with the HiBL structures. The conventional HiBL structures in Fig. 7(a) and 7(b) utilize a single NMOS and PMOS to mux the LBL signal into GBL. The RiBL structure requires a sensing inverter and M2/M3 in each LBL segment for signal propagation. Nevertheless, the improvement in performance and variation tolerance of RiBL is very significant, especially for low voltage operation. # C. Adaptive Voltage Detector (AVD) for Boosting Control The row-based WL is boosted to a level higher than VDD to enhance read performance and write-ability. However, the boosting efficiency tends to be low at low operating voltage (where boosting is most needed), and high at normal or high operating voltage. As such, optimization of boosting circuit/efficiency for low-voltage operation may cause over-boosting at normal/high operating voltage, resulting in gate dielectric overstress and degrading the device reliability. Recently, a boosting attenuation circuit [14] with decreasing boosting efficiency as the operating voltage increases has been proposed. In our design, we employ an AVD [Fig. 9(a)] with binary boosting control. If VDD is higher (lower) than a pre-determined voltage, the boosting action will be Off ("On"). The control signal, ST, is a pulse generated from the leading edge of CSB (Chip Select Bar, not shown). When the chip is selected, ST goes "High" with a pre-determined duration (i.e. pulse width) to enable AVD. CLK and CLKB enable M5 and M2, respectively. The voltage at VD0, set by the diode voltage of PMOS M0, is then compared with the trip voltage $V_{\rm trip}$ of inverter M3/M4, and the result VD1 is latched. The latched result is buffered to generate BST\_EN which triggers the boosting circuit in Fig. 5(a). If VD0 is lower (higher) than V<sub>trip</sub>, BST\_EN will be "Low" ("High") and the boosting action will be "Off" ("On"). Fig. 9(b) shows the VD0 versus VDD characteristic of the AVD at typical corner (TC), slow corner (SC) and fast corner (FC). The effectiveness of this boosting control is based on the distinct VDD dependence of VD0 and $V_{trip}$ . The $V_{trip}$ is determined Fig. 9. (a) Adaptive voltage detector (AVD) circuit. (b) VD0 versus VDD characteristic of AVD. Fig. 10. Variation analysis of AVD from Monte Carlo simulations with $3\sigma$ local random variation of $V_{\rm trip}$ and VD0 at different corners. by the inverter N/P strength ratio. As VDD increases, $V_{\rm trip}$ follows and increases approximately linearly with VDD. VD0, on the other hand, has a highly non-linear dependence on VDD. At low VDD, most of voltage drops across diode-connected PMOS M0, and VD0 would be higher than $V_{\rm trip}$ . At high VDD, most of voltage drops across the load resistor, and VD0 would be lower than $V_{\rm trip}$ . The scheme is inherently variation-tolerant across process corners. The $V_{\rm trip}$ remains about the same for fast and slow corners, whereas the diode voltage (hence VD0) is lower (higher) at fast (slow) corner. Hence, the boosting action would be off at lower (higher) VDD for fast (slow) corner. The cross-over voltage of VDO and $V_{\rm trip}$ is programmable with ${\rm OSPD}\langle 0:2\rangle$ . Fig. 10 illustrates the variation analysis of AVD from Monte Carlo simulations with $3\sigma$ local random variation at different corners. VDD is swept and the variations of VD0 and $V_{trip}$ at each VDD are calculated. The lower bound (L. B.) and upper bound (U. B.) of the intersect of VD0 and $V_{trip}$ represent the range where boosting decision may occur. For example, for fast corner (F. C.), the boosting will be disabled for VDD between 1.02 V–1.15 V, while for slow corner (S.C.) the boosting will be disabled for VDD between 1.22 V–1.38 V. # D. Adaptive Data-Aware Write-Assist (ADAWA) With VCS Tracking To further enhance the write-ability through the double-layer pass-gate, the design employs an ADAWA [Fig. 11(a)]. The column-based cell supply is split into 2 virtual supply lines, one for the right-half cells (VDD1), and the other for the left-half cells (VDD2). Each virtual supply line is controlled by two power-switches. The inner power-switch (M1/M2) is controlled by ADAWA WE (Write Enable). Thus, M1 and M2 will be turned off by ADAWA WE during Write operation to weaken the virtual column supply. Depending on data-in, the high-going WWL or WWLB turns off either M3 or M4, causing the corresponding virtual supply node (VDD2 or VDD1) to drop, thus reducing the V<sub>GS</sub> and contention of the corresponding cell holding PMOS to enhance write-ability and WM. The opposite half-cell inverter is unaffected and maintains its strength and feedback action to facilitate the pull-up of the opposite cell storage node. The timing of half-cell supply switching is initiated directly from high-going WWL or WWLB, thus tolerant to PVT variations and $V_{\rm T}$ scatter. Data-aware switching of half-cell supplies reduces dynamic supply switching power and noise to half, and improves supply switching speed. The scheme adds minimum loading to WWL/WWLB (only 1 extra $C_{GATE}$ ), and is area efficient (only 4 PMOS per column). The ADAWA scheme requires much smaller area overhead compared with negative bit-line (NBL) write-assist [12], [14], which incurs large area overhead due to large boosting capacitor and complicated control on each column. Compared with the floating power-line write-assist scheme [13], the ADAWA scheme reduces dynamic supply switching power and noise to half, and offers faster dynamic supply switching speed, faster pull-up of the opposite cell storage node and faster time-to-write. Disparity between VDD1 and VDD2 results in asymmetrical Hold SNM for half-selected cells on the selected column. To ensure adequate half-select stability, a VCS (array cell power supply) tracking circuit [Fig. 11(b)] is used to control the pulse width of ADAWA WE. In this VCS tracking circuit, NMOS M6/M7 track the cell access NMOS [M5/M6/M7 in Fig. 2(a)], while PMOS M2/M3/M4/M5 track the cell holding PMOS. The contention between NMOS M6/M7 and PMOS M2/M3/M4/M5 mimics the writing action in a cell with 4-bit (OSD $\langle 0:3 \rangle$ ) programmability. A Replica Cell Load, which mimics the capacitive load of an array column, is added at the source node (virtual column supply node) of the PMOSs'. The source node of M7, RBLS, is controlled by the WEN (Write Enable) signal. WLE (WL Enable) is the logic combination of WWL-OR-WWLB (i.e. either WWL or WWLB activated) and is true/activated in write operation. During write cycle, WLE goes "High," and RBLS in the selected bank goes "Low" at the leading/rising edge of WEN, thus turning on M6/M7 to pull-down node RLS. Once the voltage at node RLS falls below the trip voltage of the succeeding inverter SA, ADAWA\_WE [which controls the inner power-switch M1/M2 in Fig. 11(a)] goes "Low" to end the write-assist. Fig. 11(c) shows the pertinent waveforms of VCS tracking. Fig. 12 shows the cumulative yield of write operation from Monte Carlo simulations at PFNS corner (where write-ability is the worst) with $3\sigma$ local random variation at 25 °C. The write VDD<sub>MIN</sub> is improved by 200 mV with ADAWA. To illustrate the effectiveness of VCS tracking in mitigating the process variation and ensuring adequate half-selected stability, Fig. 13 shows T\_CellW (cell write time), T\_VCST (time span during Fig. 11. (a) Adaptive data-aware write-assist (ADAWA) circuit, (b) VCS tracking circuit for ADAWA, and (c) pertinent waveforms of VCS-Tracking. Fig. 12. Cumulative yield of write operation from Monte Carlo simulations at PFNS corner with $3\sigma$ local random variation at 25 $^{\circ}$ C . which the inner power-switch is "Off" using VCS tracking) and T\_INVD (time span during which the inner power-switch is "Off" using the conventional inverter chain delay tracking) across different process corners. T\_VCST and T\_INVD are designed based on T\_CellW at the worst write-ability corner, i.e. Fig. 13. Tracking of T\_VCST (VCS tracking) and T\_INVD (Inverter chain delay tracking) with T\_CellW (cell write time) across different process corners. Fig. 14. Minimum cell data retention time for 90% of 1 Mb of the proposed 8 T cell and ADAWA power collapse duration (PCD) versus VDD from Monte Carlo simulations at PFNF, 125 $^{\circ}$ C (where leakage and data retention are the worst) with $3\sigma$ local random variation. Fig. 15. (a) Layout view and (b) die photo of the 512 kb test chip. NSPF (SF) corner. As can be seen, T\_VCST tracks T\_CellW well across all corners, whereas T\_INVD completely loses its tracking capability at other corners. To ensure adequate data-retention for half-selected cells on the selected column, we adopt the design criterion in [17] with a timing guard band between the minimum data retention time for 90% of 1 Mb of the cell and ADAWA power collapse duration across intended $V_{\rm DD}$ range at the worst corner with local random variation as shown in Fig. 14. As can be seen, from Monte Carlo simulations at PFNF, 125 $^{\circ}{\rm C}$ (where leakage and data retention are the worst) with $3\sigma$ local random variation, a guard band of 0.4 ns at 0.65 V is observed. ## V. TEST CHIP IMPLEMENTATION AND MEASUREMENT A 512 kb test chip is implemented in UMC 40 LP CMOS. The 8 T SRAM cell size is $1.44\times0.59~\mu\mathrm{m}^2$ . The 512 kb array is organized into 8192 word $\times$ 64 bits with inter-leaving 16 architecture. The data I/O width is 64 bit. The local word-line (LWL) width is 32 bit and LBL length is 32 bit. The layout view and die photo are shown in Fig. 15(a) and Fig. 15(b), respectively. The core chip area is 947 $\mu\mathrm{m} \times 2810~\mu\mathrm{m}$ in layout view. The actual size on silicon is shrunk by $0.9\times$ from the layout view. Fig. 16. (a) Measured error free full functionality die yield (without redundancy) versus VDD for FF (58 dies), TT (65 dies) and SS (53 dies) corners at room temperature. (b) Measured frequency-voltage Shmoo plot. (c) Measured write failure bit count (FBC) improvement with boosted WL versus VDD at 3 corners. (d) Measured operation power and Standby power versus VDD. The row-based WL boosting circuits are placed in row decoder with centralized control. The RiBL circuits and ADAWA power switches are distributed in each column. The VCS tracking circuit is placed adjacent to the bank decoder. The area overheads for row-based WL boosting, AVD, RiBL, and ADAWA with VCS tracking are 4.73%, 0.01%, 2.73%, and 2.94%, respectively. Fig. 16(a) shows the measured die yield (without redundancy) versus VDD at room temperature for FF (58 dies), TT (65 dies), and SS (53 dies) corners. Dies are tested with full suits of industry standard SRAM compiler product qualification patterns including CHECKBOARD, MARCH C- and MARCH C+ test patterns with all high/low-Read/Write combinations. At $V_{\rm DD} =$ 0.65 V, we still have "perfect" die yield of over 50% for FF dies. Fig. 16(b) shows the measured frequency Shmoo of the test chip. The 512 kb test chip operates from 1.5 V to 0.65 V, with maximum operation frequency of 800 MHz@1.1 V and 200 MHz@0.65 V at room temperature. As discussed in Section IV, boosting of the row-based WL significantly enhances the read current and write-ability. Fig. 16(c) shows the measured bit failure count improvement with boosted WL versus $V_{\mathrm{DD}}.$ Depending on process corners, bit failure count improvements of 1.5-order up to 4-order are observed from 0.7 V to 0.5 V. The improvement is particularly significant for SS corner, where the read current and write-ability are the worst. Fig. 16(d) shows the measured operation power and Standby power versus $V_{\rm DD}.$ The power consumption is 0.5 mW/MHz (Active) and 4.4 mW (Standby) at 1.1 V, TT, 25 $^{\circ}{\rm C}$ and 0.107 mW/MHz (Active) and 0.367 mW (Standby) at 0.65 V, TT, 25 $^{\circ}{\rm C}.$ The characteristics of the chip are summarized in Table IV. ### VI. CONCLUSION We presented a 512 kb cross-point 8 T SRAM in UMC 40 LP CMOS. Cross-point cell structure mitigated write half-select TABLE IV Characteristics of 512 kb Test Chip | Item | Feature | |--------------------------|------------------------------------| | Technology | 40nm Low Power (LP) CMOS | | Bit Cell | Cross-point 8T cell | | Cell Size (Layout View) | $1.44 \ \mu m \times 0.59 \ \mu m$ | | Organization | 512Kb (8192 word x 64 bits) | | Chip Area (Layout View) | 947 μm x 2810 μm | | DI/ DO | 64 bit | | Bit-interleaving | 16 bit | | Local WL Width | 32 bit | | Local BL Length | 32 bit | | Global BL Length | 128 bit | | Measured Max Freq (25°C) | 800MHz@1.1V | | | 200MHz@0.65V | | Measured Power (1.1V) | Active: 0.5 mW/MHz | | | Standby: 4.4 mW | | Measured Power (0.65V) | Active: 0.107mW/MHz | | | Standby: 0.367 mW | | $VDD_{MIN}$ | 0.65V (w/o Redundancy) | disturb to facilitate bit-interleaving architecture for enhanced soft error immunity with error correction code (ECC). Pipeline design enabled high-frequency operation with low-power low-leakage technology. Boosting of row-based WL improved read current by 30.6% and WM by 40.9% at 0.6 V. RiBL structure enhanced data evaluation delay by 30%–44% and process variation immunity by $2\times-3.5\times$ at 0.7 V compared with the conventional HiBL structures. ADAWA with VCS tracking provided 200 mV improvement of write VDD<sub>MIN</sub>. AVD with binary boosting control was used to mitigating gate electric over-stress. Error free full functionality operation was achieved from 1.5 V down to 0.65 V without redundancy for FF dies. The measured maximum operation frequency was 800 MHz@1.1 V and 200 MHz@0.65 V at room temperature. The measured power consumption was 0.5 mW/MHz (Active) and 4.4 mW (Standby) at 1.1 V, TT, 25 $^{\circ}\mathrm{C}$ and 0.107 mW/MHz (Active) and 0.367 mW (Standby) at 0.65 V, TT, 25 $^{\circ}\mathrm{C}.$ # REFERENCES - L. Chang et al., "An 8 T-SRAM for variability tolerance and low-voltage operation in high-performance caches," *IEEE J. Solid-State Circuit*, vol. 43, no. 4, pp. 956–963, Apr. 2008. - [2] M. F. Chang et al., "A differential data-aware power-supplied (D2AP) 8 T SRAM cell with expanded write/read stabilities for low VD-Dmin applications," *IEEE J. Solid-State Circuit*, vol. 45, no. 6, pp. 1234–1245, Jun. 2010. - [3] J. J. Wu *et al.*, "A large $\sigma V_{TH}/VDD$ tolerant zigzag 8 T SRAM with area-efficient decoupled differential sensing and fast write-back scheme," *IEEE J. Solid-State Circuit*, vol. 46, no. 4, pp. 815–827, Apr. 2010. - [4] J. P. Kulkani and K. Roy, "Ultralow-voltage process-variation-tolerant Schmitt-trigger-based SRAM design," *IEEE Trans., Very Large Scale Integr. (VLSI) Syst.*, vol. 20, no. 2, pp. 319–332, Feb. 2012. - [5] R. V. Joshi et al., "A novel column-decoupled 8 T cell for low-power differential and domino-based SRAM design," *IEEE Trans. Very Large Scale Integration (VLSI) Syst.*, vol. 19, no. 5, pp. 869–882, May 2011. - [6] S. Okumura et al., "A 0.56-V 128 Kb 10 T SRAM using column line assist (CLA) scheme," in Proc. IEEE Int. Symp. Quality Electron. Design (ISQED), Mar. 16–18, 2009, pp. 659–663. - [7] Y. W. Lin et al., "A 55 nm 0.5 V 128 Kb cross-point 8 T SRAM with data-aware dynamic supply write-assist," in Proc. IEEE Int. SOC Conf. (SOCC), Sep. 2012, pp. 218–223. - [8] C. Y. Lu et al., "A 0.33 V, 500 KHz, 3.94 μW 40 nm 72 Kb 9 T subthreshold SRAM with ripple bit-line structure and negative bit-line write-assist," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 59, no. 12, pp. 863–867, Dec. 2012. - [9] E. Seevinck et al., "Static noise margin analysis of CMOS SRAM cells," *IEEE J. Solid-State Circuits*, vol. 22, no. 5, pp. 748–754, Oct. 1987. - [10] S. Nalam et al., "Dynamic write limited minimum operating voltage for nanoscale SRAMs," in Proc. Des. Autom. Test Eur., 2011, pp. 1–6. - [11] R. W. Mann et al., "Impact of circuit assist methods on margin and performance in 6 T RSAM," Solid-State Electron., vol. 54, no. 11, pp. 1398–1407, Nov. 2010. - [12] S. Mukhopadyay et al., "SRAM write-ability improvement with transient negative bit-line voltage," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 1, pp. 24–32, Jan. 2011. - [13] M. Yamaoka et al., "Low-power embedded SRAM modules with expand margins for writing," in Dig. Tech. Papers Int. Solid-State Circuits Conf. (ISSCC), 2005, pp. 480–481. - [14] H. Pilo et al., "A 64 Mb SRAM in 32 nm high-k metal-gate SOI technology with 0.7 V operation enabled by stability, write-ability and readability enhancements," *IEEE J. Solid-State Circuits*, vol. 47, no. 1, pp. 97–106, Jan. 2012. - [15] D. W. Plass and Y. H. Chan, "IBM POWER6 SRAM arrays," *IBM J. Res. Develop.*, vol. 51, no. 6, pp. 747–756, Nov. 2007. [16] J. Pille *et al.*, "Implementation of the cell broadband engine in 65 - [16] J. Pille et al., "Implementation of the cell broadband engine in 65 nm SOI technology featuring dual power supply SRAM arrays supporting 6 GHz at 1.3 V," *IEEE J. Solid-State Circuits*, vol. 43, no. 1, pp. 163–171, Jan. 2008. - [17] E. Karl et al., "A 4.6 GHz 162 Mb SRAM design in 22 nm Tri\_Gate CMOS technology with integrated read and write," *IEEE J. Solid-State Circuits*, vol. 48, no. 1, pp. 150–158, Jan. 2013. Nan-Chun Lien received the B.S. degree in electro-physics and the M.S. degree in electronics engineering from National Chiao Tung University, Taiwan, in 1995 and 1997, respectively. He is working toward the Ph.D. degree in electronics engineering in National Chiao Tung University, Taiwan. From 1999 to 2001, he was with TSMC, Taiwan. From 2001 to 2011, he was with Faraday Technology Corporation, Taiwan, working on embedded memory design. In 2011, he joined M31 Technology Corpo- ration, Hsinchu, Taiwan. His research interests include advanced VLSI circuit design, high speed and low power memory design, and embedded memory compiler development. **Li-Wei Chu** received the B.S. degrees in electrical engineering from Chang Gung University, Taiwan, in 2009 and M.S. degree in electronics engineering from National Chiao Tung University, Taiwan, in 2012. In 2012, she joined M31 Technology Corporation, Hsinchu, Taiwan, working on next-generation embedded memory compiler development. **Chien-Hen Chen** received the B.S. degrees in electrical engineering from National Cheng Kung University, Taiwan, in 2009 and M.S. degree in electronics engineering from National Chiao Tung University, Taiwan, in 2011. He is currently working at UMC, Hsinchu, Taiwan. His research interest focuses on low-power VLSI circuit design. **Hao-I. Yang** (S'09) received the B.S. and M.S. degree in electrical engineering from National Cheng Kung University, Taiwan, in 2003 and 2005, respectively, and the Ph.D degree in electronics engineering from National Chiao Tung University, Taiwan, in 2011. In 2011, he joined TSMC, Hinchu, Taiwan. He currently works on nanoscale high-speed and low-power SRAM design. Ming-Hsien Tu received his B.S. and M.S. degrees in electrical engineering from National Central University, Taiwan, in 2004 and 2006, respectively. He received his Ph.D. degree in electronics engineering, National Chiao Tung University, Taiwan, in 2011. His research interests include noise suppression design technologies, embedded measurement circuit design, and ultra-low-power SRAM design. **Paul-Sen Kan** received the B.S. degree in electrical engineering from Minghsin University of Science and Technology, Taiwan, in 2007 and the M.S. degree in electronics from Chung Hua University, Taiwan, in 2009. In 2011, he joined Faraday Technology Corporation, Hsinchu, Taiwan. His research interests focus VLSI circuit and embedded memory testing. Yong-Jyun Hu received the B.S. degree in electrical engineering from National Chung Cheng University, Taiwan, in 2007 and M.S. degree in electrical engineering from National Central University, Taiwan, in 2010 In 2010, he joined Faraday Technology Corporation, Hsinchu, Taiwan. Ching-Te Chuang (S'78–M'82–SM'91–F'94) received the B.S.E.E. degree from National Taiwan University, Taipei, Taiwan, in 1975 and the Ph.D. degree in electrical engineering from University of California, Berkeley, CA in 1982. From 1982 to 2008, he worked at IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, holding various technical and management positions. He joined the Department of Electronics Engineering, National Chiao Tung University, Hsinchu, Taiwan in 2008, where he is currently a Life Chair Professor. Prof. Chuang has authored or coauthored over 380 papers. He holds 54 U.S. patents with another 20 pending. Shyh-Jye Jou received his B.S. degree in electrical engineering from National Chen Kung University, Taiwan, in 1982, and the M.S. and Ph.D. degrees in electronics engineering from National Chiao Tung University, Taiwan, in 1984 and 1988, respectively. He is currently a Professor in the Department of Electronics Engineering, National Chiao Tung University. Since August 2011 he has been the Dean of Office of International Affairs, National Chiao Tung University. He has authored or coauthored over 100 papers. Wei Hwang (F'01–LF'09) received the B.Sc. degree from National Cheng Kung University, Taiwan, the M.S. degree from National Chiao Tung University, Taiwan, and the M.S. and Ph.D. degrees in electrical engineering from University of Manitoba, Winnipeg, MB, Canada, in 1970 and 1974, respectively. From 1975 to 1978, he was Assistant Professor with the Department of Electrical Engineering, Concordia University, Montreal, QC, Canada. From 1979 to 1984, he was Associate Professor with the Department of Electrical Engineering, Columbia University, New York. From 1984 to 2002, he was a Research Staff Member with IBM T. J. Watson Research Center, Yorktown Heights, NY, USA. In 2002, he joined the Department of Electronics Engineering, National Chiao Tung University, Hsinchu, Taiwan, where he is currently a Life Chair Professor. Prof. Hwang has authored or coauthored over 200 papers and holds over 170 international patents (including 66 U.S. patents).