# **Chapter 1 Introduction**



# 1.1 Why Clock and Data Recovery

The communications in the modern world are becoming more and more important. Innumerable messages must be exchanged in our daily life. At the sender, a message is transformed into signals and transmitted to the receiver. After receiving, the receiver will recover the message from signals. This process is successful

primarily relies on the performance of the transmitter and the receiver. In fact, the transmitter and the receiver both need clocks to process data, so the receiver has to extract a clock signal from data to make circuits work synchronously.

Therefore, the clock and data recovery (CDR) circuit is one of the building blocks at the receiver end of wired communication systems, such as SONET, Gigabit Ethernet, Gigabit Passive Optical Network (GPON) and Ethernet Passive Optical Network (EPON), etc. Figure 1-1 shows the functionality of CDR, a clock recovery circuit in a receiver is used to reconstruct the clock. Because the data received in a receiver are asynchronous and noisy. Thus, it requires that a clock be extracted for synchronous operations. The data also need to be retimed so that jitter accumulated during data transmission can be removed. Since, the performance of the receiver depends on the CDR circuit, it must be designed carefully.

As mentioned above, a CDR circuit is an important building block in wired communication systems, especially in optical networks. In view of network topology, optical networks can be classified as point-to-point systems and point-to-multipoint systems. Of course, both of a point-to-point system and a point-to-multipoint system need CDR circuits to recover clock and data at the receiver end. However, they use different kinds of CDR circuits in different ways.



Fig. 1-1 The Functionality of Clock and Data Recovery (CDR) Circuit

## 1.2 Passive Optical Network Introduction

The rapid growth of IP traffic has spurred the development of low-cost and convenient broadband access services. A passive optical network (PON) is suitable for the construction of economical fiber-to-the-home (FTTH) systems for high-speed optical subscriber networks. ATM-based PON (APON), broadband PON (BPON), and Ethernet PON (EPON) with transmission speeds of 50–600 Mb/s were developed early on. More recently, gigabit Ethernet PON (GEPON) systems have attracted a great deal of attention as a way of exceeding 1 Gbps in a subscriber network.

Fig. 1-2 shows the general architecture of a GEPON system [1], which has been standardized under the auspices of the IEEE 802.3ah committee [2]. All transmissions in a PON are performed between an optical line terminal (OLT) and Multiple optical network units (ONUs). Multiple optical network units (ONUs) located at the subscribers' premises are connected with an optical line terminal (OLT) through a single optical fiber and a tree network based on a 1:N passive star coupler. The OLT broadcasts serial data with headers to identify the particular ONUs that should receive the packets, using the 1490-nm wavelength for downstream transmission. The ONUs extract the clock from the downstream traffi, frequency synchronizing the PON system. Each ONU can send burst packets on demand, using the 1310-nm wavelength in the upstream. The burst packets sent from the ONUs are managed using a time-division multiple-access (TDMA) scheme so that the packets do not interfere with each other once the data is coupled onto the common fiber. Therefore, in the downstream direction (from OLT to ONUs), a PON is point-to-multipoint (P2MP) network, and in the upstream direction it is a multipoint-to-point (MP2P) network. Obviously, a PON has the lowest cost. Nonetheless, the "Media Access" (MAC) is an important problem in such a topology. Unlike a point-to-point (P2P) network or a curb-switched network, the PON needs to serve many users with only one fiber and a passive splitter (star coupler). There are mainly three ways to accomplish "Media Access." One is "wavelength division multiplexing" (WDM). While a simple solution, it is a high cost access network because a tunable receiver or a receiver array would be required either in the OLT or in the ONUs and that's why WDM is not an attractive solution.

Contention-based media access (something similar to CSMA/CD) is difficult to implement because ONUs cannot detect a collision at the OLT ( because of directional properties of optical splitter/combiner ). An OLT could detect a collision and inform ONUs by sending a jam signal, however, propagation delays in PON, which can exceed 20 km in length, greatly reduce the efficiency of such a scheme. Most designers believe time-sharing is the preferred method of optical channel sharing in an access network because it allows for a single upstream wavelength (e.g. 1310 nm ) and a single transceiver in the OLT, resulting in a cost-effective solution. All ONUs are synchronized to a common time reference, and each ONU is allocated a time slot. Each time slot is capable of carrying several Ethernet frames. An ONU should buffer frames from a subscriber until its time slot arrives. When its time slot arrives, the ONU would "burst" all stored frames at full channel speed ( standard Ethernet rate). If there are no frames in the buffer to fill the entire time slot, idles are transmitted. The possible time slot allocation schemes could range from a static allocation (fixed time-division multiple access, TDMA) to a dynamically adapting scheme based on instantaneous queue size in every ONU (statistical multiplexing scheme). There more allocation schemes possible, including schemes utilizing traffic priority and QoS, service level agreements (SLAs), and oversubscription ratios.



Fig. 1-2 General Architecture of A Gigabit Ethernet PON System

Therefore, the general physical layer for a PON is similar to Fig. 1-3. The transmitter includes a laser driver and a laser diode while the receiver includes a photo diode, amplifiers and clock and data recovery (CDR). To separate upstream and downstream, a WDM multiplexer is employed. Because the data in the upstream direction is composed of bursts of packets in a PON, the receiver in OLT and the transmitter in ONU need to allow burst-mode operation. That is, all the blocks must settle within a fixed time at the arriving of packets such as burst-mode TIAs and burst-mode CDRs.



Fig. 1-3 Burst-Mode Transceiver Architecture

In conventional long-distance optical networks, a network is composed of several sections of optical fibers and operates connect these optical fibers. The data transmitted from point A to point B may be passed through tens of repeaters. Each repeater consists of amplifiers and a CDR circuit. Amplifiers amplify and limit the signal to eliminate noise in amplitude while the CDR circuit samples the data with a "clean" clock, which is extracted from data to eliminate noise in timing (jitter).

# 1.3 Specifications

To understand further about burst-mode CDR circuits, the specifications for PON cannot be lost. There are two popular PONs: Gigabit PON (GPON) and Ethernet PON (EPON). Although they both have many specifications to the physical layer, only those related to CDR will be included in this section. They are: Data rata, Lock time, Jitter performance, and Mask of eye diagram for upstream transmission.

#### 1.3.1 Data Rate

This table shows their data rates:

| PON              | Data Rate ( Mbps ) |
|------------------|--------------------|
| GPON             | 155.52             |
|                  | 622.08             |
|                  | 1244.16            |
|                  | 2488.32            |
| EPON 1000BASE-PX | 1250               |

Table 1-1 Specifications of Data Rate

#### 1.3.2 Lock Time

Differential data rate corresponds to differential lock time:

| PON              | Data Rate ( Mbps ) | Bits | Time ( ns ) |
|------------------|--------------------|------|-------------|
| GPON             | 155.52             | 10   | 64          |
|                  | 622.08             | 28   | 44.8        |
|                  | 1244.16            | 44   | 35.2        |
|                  | 2488.32            | 108  | 43.2        |
| EPON 1000BASE-PX | 1250               | 500  | 400         |

Table 1-2 Specifications of Lock Time

From table 1-1 and 1-2, we can observe that the conventional receiver for continuous-mode transmission with slow response and long settling time is not suitable for the OLT. Due to the fundamental difference between burst-mode and continuous-mode operation, burst-mode receiver has a strict specification on lock time while continuous-mode operation does not. In addition, the state-of-the-art data rate in burst-mode receiver is already over 20 Gbps [3] and 33 Gbps [4], however, they are manufactured in standard 90nm CMOS technology, while the state-of-the-art data rate in continuous-mode receiver is already over 40 Gbps [5] in standard 0.18µm CMOS technology. The OLT needs a burst-mode receiver to recover the burst signal.

## 1.3.3 Jitter Performance

Because most people are much more familiar with continuous-mode CDR circuits than burst-mode CDR circuits, jitter performance may be one of the most confusing topics when a burst-mode CDR circuits is mentioned.

Generally speaking, jitter performance includes three items: jitter transfer, jitter tolerance, jitter generation. For continuous-mode CDR circuits, as mentioned before, because they are applied as a series of data repeaters to suppress the effect of fiber nonlinearities periodically, their jitter characteristics will be accumulated along the whole data path. It is obvious that the specifications for jitter performance must be very stringent to minimize the accumulated jitter, such as OC-192 or IEEE 802.3ae [2], while these constraints cannot be seen in papers about burst-mode CDR circuits.

For burst-mode CDR circuits, since they are applied as the receiver in OLT, which is the termination of upstream path, their jitter characteristics will not be accumulated (or multiplied). As a result, PON specifications have only constraints to CDR circuits in ONUs (or downstream), which are still continuous-mode, and the jitter performance for a burst-mode CDR circuit will be labeled as "N/A" which means not available in documents.

## 1.3.4 Mask of Eye Diagram for Upstream Transmission

This mask mainly decides the limitation for static phase error of the CDR circuit as shown in Table 1-3. Fortunately, this mask is not too stringent to be met for most CDR architectures.



|       | 155.52Mbps | 622.08Mbps | 1244.16Mbps | 2488.32Mbps |
|-------|------------|------------|-------------|-------------|
| x1/x4 | 0.10/0.90  | 0.20/0.80  | 0.22/0.78   | N/A         |
| x2/x3 | 0.35/0.65  | 0.40/0.60  | 0.40/0.60   | N/A         |
| y1/y4 | 0.13/0.87  | 0.15/0.85  | 0.17/0.83   | N/A         |
| y2/y3 | 0.20/0.80  | 0.20/0.80  | 0.20/0.80   | N/A         |

Table 1-3 Mask of Eye Diagram for Upstream Transmission

# 1.4 Target Specification of Proposed CDR

According to the specification of Passive Optical Network (PON), we can budget the proper target specification of proposed burst-mode CDR in standard  $0.18\mu m$  CMOS technology and list in Table 1-4.

1896

| Input Data Rate                     | 1.25 ~ 6 Gbps                                 |  |  |
|-------------------------------------|-----------------------------------------------|--|--|
| Output Data Rate                    | 312.5 ~ 1500 Mbps                             |  |  |
| Locking Time                        | < 16 Bit Time                                 |  |  |
| Static Phase Offset ( Phase Error ) | < 1/32 UI                                     |  |  |
| PLL Frequency                       | 260 ~ 1750 Mbps                               |  |  |
| Bit Error Rate (BER)                | < 10 <sup>-12</sup> @ Input Sensitivity 30 mV |  |  |
| Maximum Zero Block Run              | > 15 Bits                                     |  |  |
| Frequency Offset Tolerance          | >2000 ppm                                     |  |  |
| Jitter Tolerance                    | Pass OC-48 Mask                               |  |  |
| Output Peak-to-peak Jitter          | < 50 psec                                     |  |  |

Table 1-4 Target Specification of Proposed Burst-Mode CDR

# 1.5 Thesis Organization

In Chapter 2, some popular main categories of burst-mode CDR circuits will be introduced and then we can know which kind of architectures could be suitable for high-speed operation and high jitter performance. Chapter 3 shows our proposed 1.25~6 Gbps burst-mode CDR and introduces the detail work. Chapter 4 describes a wide-range PLL with automatic band selection. Chapter 5 demonstrates the experimental results of our CDR. Finally, Chapter 6 gives the conclusion and future work.



# Chapter 2

# Categories of Burst-Mode Clock and Data Recovery



Actually, the concept of Clock and Data Recovery (CDR) circuits has been developed for decades and that means countless number of CDR circuits has been proposed. With the substantial growth in the need of bandwidth, the operation speed of CDR circuits must be improved.

Indeed, a few years ago, tremendous papers related to high-speed CDR circuits have been published [6]-[10], but few of them focused on burst-mode CDR circuits because the PON has not been practicable, and, therefore, most proposed high-speed

CDR circuits were dedicated to SONET [11][12], Gigabit Ethernet, or another optical communication [13]. Until the emergence of PON, high-speed burst-mode CDR schemes have overcome an attractive topic.

Although there have been plenty of papers about burst-mode CDR circuits until now, we categories them into three main types according to their architectures. They are "Phase-Locked-Loop-Based Burst-Mode Clock and Data Recovery" (PLL-based BMCDR), "Oversampling-Based Burst-Mode Clock and Data Recovery" (Oversampling-based BMCDR), and "Gated-Voltage-Controlled-Oscillator-Based Burst-Mode Clock and Data Recovery" (GVCO-based BMCDR), respectively. The first and the second type originate from continuous-mode CDR but they could also be adapted to burst-mode CDR while the third one is specialized for burst-mode CDR.

# 2.1 PLL-Based Burst-Mode CDR

A public phase-locked-loop-based (PLL-based) continuous-mode CDR topology is shown in Fig. 2-1 [5-7][14]. It is composed of phase detector (PD), low pass filter (LPF), voltage-controlled oscillator (VCO), and a de-multiplexer (Demux) with a decision circuit (just a D-Flip-Flop commonly). The phase-locked loop adjusts the phase of recovered clock to align with the incoming data, so that the 90-degree clock could sample at the middle of data transitions in the decision circuit to minimize bit-error rate (BER). By sampling, the noise and the jitter accumulated in the incoming data stream are eliminated, so the recovered data and clock is generated for the following circuits.



Fig. 2-1 Phase-Locked-Loop-Based CDR Topology

Unfortunately, this conventional phase-locked loop (PLL)[15] and a timing circuit with a resonator [16], which are commonly used for clock recovery, are not suitable for burst-mode transmission. It can be observed that the phase locked-loop can only work properly when the VCO frequency deviates from data rate within the loop bandwidth that may be in an order of mega hertz. Moreover, the long time constant of the loop filter for PLL stability and the large resonator Q factor for low-jitter oscillation prevent convergence within the short burst packet header. A long header suitable for synchronization would lead to degraded data transmission efficiency, along with increased service cost for the subscribers.

## 2.2 Oversampling-Based Burst-Mode CDR

A typical oversampling-based CDR topology is like the one in Fig. 2-2 [17][18]. This architecture can be decomposed of one CDR part and one PLL part. The upper loop takes charge of clock and data recovery while the lower loop creates serves as a clock generator.



Fig. 2-2 Oversampling-Based Burst-Mode CDR Topology

The PLL part is a conventional charge-pump phase-locked loop unless a multi-phase voltage-controlled oscillator (MP-VCO) is adopted. A phase-frequency detector (PFD) compares the phase of reference clock and that of the divider output, and then outputs up/down signals to a charge pump (CP). Through a low-pass filter to filter out the high-frequency component, the control voltage of multi-phase VCO is adjusted by PFD and CP. Finally, the oscillation frequency will settle at N times of reference clock. MP-VCO provides multi-phase clock signals to CDR part to operate those digital circuits.

The CDR part consists of "Samplers", "Edge Detector", and "Control Logic". At samplers, data is sampled by multi-phase clock, which is illustrated in Fig. 2-3. This process is generally called "oversampling". In this case, data is 3X over-sampled which means that the data is sampled by 3 phases of clock within one data period. Furthermore, to relax the performance requirements of each circuit, the MP-VCO frequency is one-fourth data rate, so we need 12 phase from MP-VCO.

Using the results of samplers, edge detector detects data edge. As shown in Fig. 2-3, the output of edge detector is high when there is a data transition while the output is low when there is no data transition. Intuitively, edge detector can be an array of exclusive OR (XOR).

With these transition markers, control logic extracts the phase of data and then determines one of the three phases from MP-VCO as the recovered clock. Four sets of clock are connected to four multiplexers individually and each set is composed of three phases. Control logic will choose one of the three phases as the recovered clock to get the lowest bit-error rate. At decision circuits, the data is retimed and de-multiplexed by the four recovered clocks. Because the optimum phase of MP-VCO is picked to make the proper recovered clock, this topology is also called "Phase-Picking-Based CDR" from literal meaning.



Fig. 2-3 Timing Diagram of Oversampling-Based CDR

To tell whether of one circuit could be applied to burst-mode systems, its locking time should be examined. In Fig. 2-2, the signal is passed by "Samplers", "Edge

Detector", "Control Logic", "Mux", and "Decision Circuit". Thus, the total locking time should be

$$\begin{split} T_{Samplers} + T_{Edge} + T_{Control} + T_{MUX} + T_{Decision} \\ = 1 \times T_{Clock} + 1 \times T_{Clock} + C \times T_{Clock} + 1 \times T_{Clock} + 1 \times T_{Clock} \\ = (4 + C) \times T_{Clock} = (4 + C) \times R \times T_{Data} \end{split} \tag{2.1}$$

where  $T_{Clcok}$  is the period of clock and  $T_{Data}$  is the period of data. "C" means total cycles in "Control Logic", and "R" means the data rate to the oscillation frequency ratio.

Obviously, different specifications of locking time and different frequency ratio "R" will directly affect the maximum of "C". Their relationship is illustrated in Fig. 2-4. According to Fig. 2-4, we can observe two trends:

- (1) The higher the data rata, the larger the maximum cycles of "Control Logic".
- (2) The higher the "R", the smaller the maximum cycles of "Control Logic".

Consequently, 155 Mbps is hard to be achieved no matter what the "R" is. Even for the other three specifications, the "R" should be 2 or 4. Considering the trade-off between lock time and the speed requirements of those digital circuits, the "R" is logically designed as 4. Therefore, the faster the lock time, the more complex digital logic circuits. It also implies that the trade-off between the lock time and the hardware complexity hides in this oversampling-based CDR topology. It implies that the CDR topology occupies a large chip area and dissipates significant power in the digital processor. Moreover, in a noisy environment, to reduce the phase offset between the sampling phase and input data, the hardware complexity would become much higher and difficult for high speed operation.



Fig. 2-4 Maximum Cycles of Control Logic "C" of Oversampling-Based CDR (within Different Specifications of Data Rate for GPON)

# 2.3 GVCO-Based Burst-Mode CDR

Unlike the PLL-based and the oversampling-based circuits, GVCO-based one was born to perform burst-mode acquisition and its typical architecture is demonstrated in Fig. 2-5. This topology is also more popular and suitable approach to achieve the instantaneous phase alignment for clock recovery in burst-mode transmission.

This architecture can also be decomposed into one PLL part and one CDR core part. Though the PLL part is still a conventional PLL, this time, the VCO is replaced by a GVCO. "GVCO" is the abbreviation of Gated Voltage-Controlled-Oscillator [19]–[25].



Fig. 2-5 Gated-VCO-Based Burst-Mode CDR Topology

Everyone knows VCO, but Gated VCO is not so common. "Gated" means the VCO is composed of logic gates. For example, one can connect NAND gates as an oscillator as illustrated in Fig. 2-6. It is clear that when "VCO\_EN" is high, the GVCO will oscillate as a five-stage VCO. On the other hand, when "VCO\_EN" is low, the GVCO will stop. Briefly speaking, a GVCO is just a stoppable VCO controlled by the signal "VCO\_EN". This stoppable VCO is the key point of fast locking. Differing from a clock generator, this PLL part in Fig. 2-5 serves as a "voltage generator". To make Gated-VCO1 oscillate at a desired frequency, they should be fed with an adequate control voltage such that they can work properly. Therefore, the Gated-VCO2 in PLL part should be a replica of Gated-VCO1.

To understand the operation, we can see Fig. 2-7 and start from "data burst in". When data stream comes, system sends a "Burst" signal to CDR circuit. After "Burst" is set to high, upon the data becomes low, "VCO\_EN" is also low and Gated-VCO1 will be stopped. Once the data becomes high, " VCO\_EN " goes up to high and Gated-VCO1 starts to oscillate and its phase will be the same as the data.



Fig. 2-6 Gated-VCO Composed of NAND Gates with "VCO\_EN" Control

Completing all of the above actions, the recovered clock and data are obtained. When burst data releases, "Burst" is set to low. Then "VCO\_EN" goes to low and Gated-VCO1 is free-running and the control voltage  $V_{ctrl}$  is shared with Gated-VCO2 in the PLL loop. Thus, after phase aligning, it works as a continuous-mode CDR.



Fig. 2-7 Control State Flow Diagram of GVCO-Based CDR

It is important to let the clock align the incoming burst data smoothly. If not, the control voltage of GVCO would be distributed and the lower loop has to spend time to settle again. Because the circuit must be ready to receive data anytime, the control logic must be responsible for the smooth transition, or this action will mess up reception of data from next burst. Moreover, dual VCO cores are needed in that technology. For a wide range application, the frequency offset and noise coupling between the two Gated-VCOs become critical issues for a reliable operation.

The largest problem about GVCO-based CDR topology is mentioned in [1][26]. A timing circuit with a GVCO is a compact circuit to generate the recovery clock, for example, combining two signals from GVCOs triggered by the data transitions. However, undesirable microscopic pulses on the recovered clock are reported when there is a small difference between the oscillation frequency of the GVCO and the bit rate of the incoming data [27]. It is also pointed out that pulse-width distortion in the incoming data causes a change in the duty ratio of the recovered clock, because the GVCO is consistently triggered at both the rising and falling edges of the data transitions [28]. Such a change in the duty ratio is an issue for a timing circuit regenerating the data at optimum phase in a GPON OLT receiver, as it has to tolerate a large variation of up to 0.44 unit intervals (UIs) in the pulse-width distortion of the optical data in compliance with the IEEE 802.3ah standard [2]. It is actually a disadvantage for GVCO-based CDR topology.

# **Chapter 3**

# A Burst-Mode Clock and Data Recovery Circuit



# 3.1 Proposed CDR Topology

Fig. 3-1 shows the whole architecture of the CDR [29][30]. To lower the speed requirements, a CDR architecture based on a "four-channel" and "quarter-rate" clock was used. Thereby, both the positive and the negative transition of the quarter-rate recovered clock (Clk45, Clk90, Clk135, Clk180, Clk225, Clk270, Clk315, Clk0) are

used for sampling. The eight phases, provided to the whole CDR system, are generated by a on-chip wide range phase-locked loop (PLL). Quarter-rate clock technique is exploited to facilitate the design of the 8-phase voltage-controlled oscillator (VCO). By means of four-channel topology, the received serial data stream can be recovered and transferred to four parallel data. This topology has the function as 1:4 de-multiplexer by eliminating the need of extra de-multiplexer, thereby achieving low power consumption.

A pre-amplifier was applied at the data input of the CDR. Its task is to provide the optimum common mode level for the bang-bang phase detector (BBPD)[31] and to amplify the incoming data to increase the timing margin of the samplers. The PD output data is further de-multiplexed and retimed then fed to the four-channel output at a maximum data rate of 1.5 Gbps.

The phase of the recovered clock is adjusted in the phase interpolator (PI) core. Since the binary PD samples the incoming data at its transition to derive the required phase information, two orthogonal clocks are provided by the PI core. The phase interpolator interpolates the clock phase in each phase quadrant by the factor of 32. This results in a total of 128 phase steps in the  $360^{\circ}(2\pi)$  phase circle of the quarter-rate clock and in 32 phase steps with respect to the full data rate. It is important that the delay times of the clock buffers are well matched, because any phase shift of the recovered clock in relation to the other clocks reduces the jitter tolerance.

For proper functionality the used PI needs input signals with a sufficiently low slew rate; this is achieved by placing a low-pass filter at its input. The phase of the recovered clock is controlled by the digital dynamic loop filter. The proposed digital dynamic loop filter uses digital counter as the filter for the polarity signals from the bang-bang phase detector. Furthermore, the binary search method is utilized to make

CDR lock faster to achieve the specification of the burst-mode lock time. The binary search operates four times to guarantee the static phase error is less than 1/32 UI in the lock condition.

The digital dynamic loop filter utilizes 31 thermometer codes to encodes the phase update signals in terms of four quadrants and 32 phase steps inside each quadrant. The DAC block converts the 2-bit quadrant and 5-bit step signal into tail currents to steer the phase interpolator. The equivalent resolution of the DAC is 7 bits. Deciding in which quadrant currently, two quadrant bits are controlled by the finite state machine (FSM). By detecting 31 thermometer codes and the polarity signal from the dynamic digital loop filter and the bang-bang phase detector, respectively, FSM guarantees the phase of the quarter-rate clock have the tracking ability with data in the whole  $360^{\circ}(2\pi)$  phase circle



Fig. 3-1 Whole CDR Topology

## 3.2 Bang-Bang Phase Detector

When CDR operates at high speed, almost all blockings operate at the quarter-rate of the data rate, except for phase detector. Phase detector must operate at the same high speed as full data rate. Therefore, phase detector plays an important role in the whole CDR system. Our target is to design a improved phase detector to operate at high speed.

## 3.2.1 Conventional Bang-Bang Phase Detector

Phase detectors can be classified into linear [32] and binary ones [31][33]. The linear type is more suitable for high speed operation by utilizing current mode logic (CML) logic. However, for SoC integration, the binary type has more advantages than the linear type. We choose the Alexander (Bang-Bang) type[31] phase detector finally.

The conventional Alexander PD is shown in Fig.3-2. The PD employs three Flip-Flops to strobe the data. Three Flip-Flops samples the incoming data with the three of the eight phase clock (Clk45, Clk90, Clk135). With quarter-rate sampling, hold time of all the logic can be four times that required in full-rate operation.

The PD also employs two XORs. It compares two consecutive samples by means of an XOR gate, generating the Up and Down output pulses if an edge has occurred, which contain the phase information. In case that no transition is sampled in the incoming data, neither an Up nor a Down pulse is generated and Enable is not on to drive the dynamic digital loop filter.



Fig. 3-2 Conventional Alexander Phase Detector Architecture

### **CDR Tracking Method**

We give a definition to the phase lead or lag as shown in Fig. 3-3. Now we want to sample the data "0" by Clk45, Clk90, and Clk135, and the results are A1, A2, and A3 in Fig. 3-2. If A1 $\oplus$ A2=1 and A2 $\oplus$ A3=0, we define "Clk Late" and the polarity=0 (Down=1), as shown in Fig. 3-3(a). Then the recovered clock interpolated by PI is partial to the in-phase clock. On the other hand, if A1 $\oplus$ A2=0 and A2 $\oplus$ A3=1, we define "Clk Early" and the polarity=1 (Up=1), as shown in Fig. 3-3(b). Then the recovered clock interpolated by PI is partial to the quadrature clock. When the CDR is in the lock condition, Clk45, Clk90, and Clk135 are aligned with data center, edge, and center, respectively, as shown in Fig. 3-3(c).





Fig. 3-3 (a) Clk Early Condition (b) Clk Late Condition

(c) Clk Alignment Condition

### **Timing Diagram of Conventional Alexander PD**

Fig. 3-4 illustrates the timing diagram of conventional Alexander phase detector.  $T_{Ck-Q}$  is the delay time from clock sampling to data out.  $T_{XOR}$  and  $T_{OR}$  is the gate delay time.  $T_{Bit}$  is a bit time or unit interval (UI). For the "Clk Early" case in Fig. 3-4(a), the delay time from clock sampling to the polarity signal (or Up/Down) out, or  $T_{D-Polarity}$ , denotes:

$$T_{D-Polarity}(Clk Early) = T_{Bit} + T_{Ck-Q} + T_{XOR}$$
 (3.1)

The delay time from clock sampling to the enable signal out, or  $T_{\text{D-Enable}}$ , denotes:

$$T_{D-Enable}(Clk Early) = T_{Bit} + T_{Ck-Q} + T_{XOR} + T_{OR}$$
 (3.2)

Moreover, for the "Clk Late" case in Fig. 3-4(b), T<sub>D-Polarity</sub> and T<sub>D-Enable</sub> denote:

$$T_{D-Polarity}(Clk Late) = T_{Bit}/2 + T_{Ck-Q} + T_{XOR}$$
(3.3)

$$T_{D-\text{Enable}} (\text{Clk Late}) = T_{\text{Bit}} + T_{\text{Ck-Q}} + T_{\text{XOR}} + T_{\text{OR}}$$
(3.4)



Fig. 3-4 Timing Diagram of conventional Alexander phase detector

(a) Clk early condition (b) Clk late condition

## 3.2.2 Proposed Bang-Bang Phase Detector

The bottleneck of Conventional bang-bang phase detector is operating frequency. Its speed limit is about 4.5 Gbps in  $0.18\,\mu$  m CMOS Technology due to longer gate delay. Therefore, we have to improve the speed by reducing the gate delay.

Observe Fig. 3-3 again. In order to detect the data transition, we just need to compare the first two samples (sampled by Clk45 and Clk90). If the two sample values are different, we define "Clk Late" and the polarity=0, as shown in Fig. 3-3(a). On the other hand, if the two sample values are the same, we define "Clk Early" and the polarity=1 (Up=1), as shown in Fig. 3-3(b). The Up and Down signal in the conventional Alexander PD is in place of the polarity signal and its inverse signal. The same as the Enable signal, we just need to compare the first and the third sample (sampled by Clk45 and Clk135). Then we can detect whether the data transition occurs or not. Consequently, if the second sample is ready, the polarity and enable signal will be ready soon. However, the polarity and enable signal are sure after both three samples are ready. This method can reduce some operating time. Our proposed Alexander phase detector (type-I) is illustrated in Fig. 3-5.

It exists the other important reason that why the Up and Down signal in the conventional Alexander PD is in place of the polarity signal and its inverse signal in the proposed Alexander PD. That is because of highly combination and integration with the dynamic digital loop filter, which is in the behind of the proposed bang-bang phase detector.



Fig. 3-5 Proposed Alexander Phase Detector (Type-I)

The second improvement is the proposed TSPC-type D-Flip-Flop with embedded XOR gate. It not only needs only one single phase but also can do XOR operation together without adding redundant delay. So we can embed the second and the third DFF with XOR gate. Proposed Alexander phase detector (type-II) is shown in Fig. 3-6, and the proposed TSPC-type DFF with embedded XOR gate is demonstrated in Fig. 3-7(b). The XOR operation is embedded in the first stage and the fourth stage(inverter) is eliminated. Comparing with conventional TSPC-type DFF shown in Fig. 3-7(a), the delay time from clock sampling to data out (T<sub>Ck-Q</sub>) of the DFF is almost the same. Nevertheless, a XOR gate delay is reduced to enhance the overall operating frequency of bang-bang phase detector.



Fig. 3-6 Proposed Alexander Phase Detector (Type-II)



(b) Proposed TSPC-Type DFF with Embedded XOR Gate

#### **Timing Diagram of Proposed Alexander PD**

Fig. 3-8 illustrates the timing diagram of proposed Alexander phase detector. For the "Clk Early" case in Fig. 3-8(a), the delay time from clock sampling to the polarity signal out, or  $T_{D\text{-Polarity}}$ , denotes:

$$T_{D-Polarity}(Clk Early) = T_{Bit}/2 + T_{Ck-Q}$$
 (3.5)

The delay time from clock sampling to the enable signal out, or  $T_{\text{D-Enable}}$ , denotes:

$$T_{D-\text{Enable}} \left( \text{Clk Early} \right) = T_{\text{Bit}} + T_{\text{Ck-Q}}$$
 (3.6)

Moreover, for the "Clk Late" case in Fig. 3-8(b), T<sub>D-Polarity</sub> and T<sub>D-Enable</sub> denote:

$$T_{D-Polarity}(Clk Late) = T_{Bit}/2 + T_{Ck-Q}$$
 (3.7)

$$T_{\text{D-Enable}}(\text{Clk Late}) = T_{\text{Bit}} + T_{\text{Ck-Q}}$$
 (3.8)



(a)



Fig. 3-8 Timing Diagram of Proposed Alexander Phase Detector

(a) Clk Early Condition (b) Clk Late Condition

## 3.2.3 Delay Time Comparison

Comparing (3.1), (3.2) with (3.5), (3.6),  $T_{D\text{-Polarity}}$  and  $T_{D\text{-Enable}}$  improvement for the "Clk Early" case, denotes:

$$T_{D-Polarity-Improve}$$
 (Clk Early) =  $T_{Bit}/2 + T_{XOR}$  (3.9)

$$T_{D-\text{Enable-Improve}}(\text{Clk Early}) = T_{XOR} + T_{OR}$$
 (3.10)

Moreover, comparing (3.3), (3.4) with (3.7), (3.8),  $T_{D\text{-Polarity}}$  and  $T_{D\text{-Enable}}$  improvement for the "Clk Late" case, denotes:

$$T_{D-Polarity-Improve}$$
 (Clk Late) =  $T_{XOR}$  (3.11)

$$T_{D-\text{Enable-Improve}}(\text{Clk Late}) = T_{XOR} + T_{OR}$$
 (3.12)

By comparing the timing diagram of proposed Alexander PD with conventional one, we can easily observe the apparent improvement for the overall PD's delay time. By simulation, the operating frequency of BBPD is boosted up to more than 6Gbps in  $0.18 \,\mu$  m CMOS Technology.

## 3.3 Dynamic Digital Loop Filter

## 3.3.1 Proposed Dynamic Digital Loop Filter

Determining the interpolated phase, which is partial to in-phase or quadrature clock, is controlled by the digital dynamic loop filter. The proposed digital dynamic loop filter, shown in Fig. 3-9(a), uses a conventional digital counter [34][36] as a filter for the polarity signals from the bang-bang phase detector. A digital counter acts as a first-order path to track the instantaneous phase error. The whole CDR loop bandwidth is primarily determined by this a first-order path.

For better control of the loop dynamics, it is suggested to add an integrative path in parallel to the existing proportional path in the CDR loop and consequently expand the first-order to a second-order Bang-Bang loop in Ref.[37]. For design simplicity, reduced risk for instability, and since the required performance is achieved with a simple first-order loop, a second-order loop was not considered.

The most important of all, our CDR must match the strict lock time specification for Passive Optical Network (PON) system. It is not enough to have only a digital counter to achieve this target. Therefore, we include binary search method in our dynamic loop filter .The binary search method operates four times to guarantee the static phase error is less than 1/32 UI in the lock condition. Binary/Linear search dual-mode operation not only reduces the lock time but also increase the data tracking ability, which is the trade-off in the conventional PLL-based CDR topology.

The binary search method is composed of two blocks, 5-bit Successive Approximation Register (SAR) Controller and 31-bit Digital-to-Analog Converter (DAC) Controller, as illustrated in Fig. 3-9(a). Our design target is to lock the data

below 16 bit time (UI). In other words, for four-channel design, we have to ensure the clock lock the data within only 4 bit time in one channel. That implies the times of binary search is four, and it operates at the first 16 bit time just after the burst-mode enable signal bursts in. This condition is called "Binary Search Mode", seen in Fig. 3-9(b). At binary search mode, 5-bit SAR controller counts the times of binary search and monitors 31-bit DAC controller to execute binary search. 31-bit DAC controller results in 31 thermometer codes D[30:0] to the next block, current-steering DAC, introduced in the next section. The circuit schematics will be mentioned later.

After four times is done, the fifth count of 5-bit SAR controller is the bit "Lock Detect". "Lock Detect" changes the loop filter path from upper path to under path, as demonstrated in Fig. 3-9(c). CDR is in the lock condition currently and start to track data step by step. This condition is called "Linear Search Mode". The digital counter monitors 31-bit DAC controller to execute linear search and is programmable by the bit "Loop\_Program" to adjust the whole CDR loop bandwidth.





Fig. 3-9 (a) The Architecture of Proposed Dynamic Digital Loop Filter

- (b) Proposed Dynamic Digital Loop Filter operates at Binary Search Mode
- (c) Proposed Dynamic Digital Loop Filter operates at Linear Search Mode

## 3.3.2 Binary Search Algorithm

Let give an example for a 4-times binary search. If the optimum interpolated point is at the seventh position in the quadrant I. The search sequence is expressed as:

$$16 \xrightarrow{-8} 8 \xrightarrow{-4} 4 \xrightarrow{+2} 6 \xrightarrow{+1} 7 \tag{3.13}$$

Fig. 3-10(a) shows this example in I-Q constellation circle. A1 to A5 is the time of binary search and is counted by 5-bit SAR controller. Our implementation expression is illustrated in Fig. 3-10(b). We exploit 31 registers to save the 31 thermometer codes, which represents 31 interpolated phase positions in a quadrant in Fig. 3-10(a). The initial position is at the middle point, the 16th position, expressed as 15 registers for "0" and 16 registers for "1". The number of registers, which is saved for "1", denote the phase positions. We execute binary search by shifting the number of "1" store in 31 registers. Thus, for this example, there are seven registers for "1" after 4-times binary search.





Fig. 3-10 (a) An Example of 4-times Binary Search Expression in I-Q Circle

(b) An Example of 4-times Binary Search Expression with 31 Registers

#### 3.3.3 5-bit SAR Controller

5-bit Successive Approximation Register (SAR) Controller is shown in Fig. 3-11. Fig. 3-11(a) and (b) demonstrates the switch-type and the mux-type architecture, respectively. The function of SAR controller is to count the times of binary search, so it can act as a shift register, like Fig. 3-11(c). The initial value for A1-A5 is "10000", and then shift "1" from A1 to A5.

The "Load" signal simulates the burst-mode enable signal. The "Program" bit is

programmable to control the pulse width of the counting bit, and further control the period of A1-A5. It avoids that phase interpolator could have slower settling time and make an error interpolation. Fig. 3-11(d) and (e) illustrates the 5-bit programmable SAR controller operates when Program = 0 and 1, respectively. Thus when Program = 1, each period of the bit time is twice.





Fig. 3-11 (a) The Switch-Type Architecture of 5-bit Programmable SAR Controller

(b) The Mux-Type Architecture of 5-bit Programmable SAR Controller

(c) The 5-bit Programmable SAR Controller operates as Shift Register

(d) The 5-bit Programmable SAR Controller operates when Program = 0

(e) The 5-bit Programmable SAR Controller operates when Program = 1

#### 3.3.4 Proposed 31-bit DAC Controller

Conventional binary search is implemented by 5-bit full adder. In addition, a extra Binary-to-thermometer-code Converter, or 5-bit decoder, is necessary in order to drive 5-bit current-steering DAC. However, it is not suitable for high speed operation. We propose the schematic of 31-bit DAC controller in Fig. 3-12(c). It is composed of 31 cells and the corresponding chart is Fig. 3-10(b). As mentioned before, the initial value is 15 registers for "0" and 16 registers for "1". We execute binary search by shifting the number of "1" store in 31 registers. The first time of binary search should shift 8 bits. The second, the third, and the fourth time should 4, 2, and 1 bit(s),

respectively. Switch Array is utilized to select different steps according to binary search times, as demonstrated in Fig. 3-12(a). Moreover, Fig. 3-12(b) shows Switch Array implementation for Fig. 3-12(a).





Fig. 3-12 (a) Switch Array to choose different steps according to Binary Search
Times

- (b) Switch Array Implementation for (a)
- (c) The schematic of 31-bit DAC Controller Implementation

#### 3.3.5 Programmable Counter

Programmable Counter plays an important role to monitor the whole CDR loop bandwidth after the data lock while CDR operates at the linear search mode. Counter size is from 4 to 8. Changing the counter size is also changing the loop bandwidth. This programmable counter acts as a low pass filter to filter out the jitter in our whole CDR system. It implies that the performance of jitter tolerance and jitter transfer is controlled by this programmable counter. In Appendix A, we will introduce and analyze jitter tolerance and jitter transfer of our CDR system in detail.

The bit "Loop-Program" can be used to control if the output updates every 4 counts or every 8 counts. The whole CDR loop bandwidth is changed with this bit variation. Programmable Counter Expression controlled by "Loop-Program" is illustrated in Fig. 3-13 (a), and its implementation schematic is shown in Fig. 3-13(b).





Fig. 3-13 (a) Programmable Counter Expression controlled by Loop-Program

(b) The schematic of Programmable Counter Implementation

## 3.4 Phase Interpolator with DAC

#### 3.4.1 Phase Interpolator Core

The circuit schematic of the phase interpolator (PI) core is shown in Fig. 3-14 [34] [35]. The phase interpolator performs phase mixing by the weighted summation of the quadrature clock pair (differential I-CK and Q-CK). The four differential pairs are switched on-or-off by the control signals from digital-to-analog converter (DAC)

and quadrant control. A. Inductive peaking is required to achieve more PI bandwidth, since 1/2 of the differential pairs act as capacitive load and do not amplify the clock signal. Only two differential pairs are turned on at any give time. Interpolator settling time is improved by never applying zero tail current to the interpolator branches. By selecting proper two tail currents ( I<sub>1</sub>-DAC or I<sub>2</sub>-DAC, and Q<sub>1</sub>-DAC or Q<sub>2</sub>-DAC ), 128 interpolated phase steps on the 360° phase circle at quarter rate can be achieved.



Fig. 3-14 Phase Interpolator Core

#### 3.4.2 DAC and Quadrant Selection

The phase interpolator must be precise not to degrade the timing position of the recovered clock. Therefore, the interpolator uses a current-steering digital-to-analog converter (DAC) which supplies four tail currents ( I<sub>1</sub>-DAC, I<sub>2</sub>-DAC, Q<sub>1</sub>-DAC, and Q<sub>2</sub>-DAC ) to the four differential pairs, as shown in Fig. 3-15. The circuit selects polarity of the phases (quadrant selection) and then interpolates between them to generate 32 phase positions within each quadrant for a total of 128 on the 360° phase circle at quarter rate. The quadrant selection is implemented by two switched differential pairs and controlled by finite state machine (FSM), introduced in the next

section.

The resolution of the phase interpolator is 5-bit with each quadrant because of generating 32 phase positions. Thus, digital-to-analog converter is also 5-bit (except 2-bit quadrant selection) and binary search operates 4 times in order to achieve the static phase error is less than 1/32 UI in the lock condition. In the application of clock and data recovery, the PI steps up and down the phase trajectory in an effort to align clock phase with data phase. There is a tradeoff to select the optimal PI resolution. A higher phase resolution results in smaller static phase errors. However, it also means a reduced loop bandwidth, and, therefore, a smaller tracking range.

To reduce the nonlinearity in phase steps, the size of the bias transistor in each DAC cell must be fine tuned during simulation. The 31 steering DAC cells are not uniform. Their relative sizing, with the largest cells being switched near the quadrant boundaries, is optimized for the most linear relationship between digital control code and rotator output phase. Thus, the integral nonlinearity (INL) error will be reduced.



Fig. 3-15 Digital-to-Analog Converter and Quadrant Selection

#### 3.4.3 Orthogonal Interpolated Phase Analysis

Since simple orthogonal I-Q phase interpolation in each phase quadrant is applied, the interpolated clock signal, mixed by the weighted summation of the PLL quadrature clock pair, can be expressed as Ref.[36]:

$$S_{INT} = r \times \sin(\omega t + \theta) = I_i \times \sin(\omega t) + Q_i \times \sin(\omega t + \pi/2)$$
(3.14)

$$I_i = r \cdot \cos \theta = r \cdot \cos(\arctan(\frac{k}{32 - k}))$$
 (3.15)

$$Q_i = r \cdot \sin \theta = r \cdot \sin(\arctan(\frac{k}{32 - k}))$$
(3.16)

$$\tan \theta = \frac{k}{32 - k} \tag{3.17}$$

where k can have values from 0 to 32, and is the number of interpolation steps on the 90° phase circle. The parameter r denotes the amplitude, which the interpolated signal is limited to. The orthogonal interpolated signal can be represented in Fig. 3-16. Observe the interpolated phase trace and uniform phase trace. By the relationship between the two traces, we can derive (3.15), (3.16), and the orthogonal angle from (3.17).

## — Interpolated Phase Trace — Uniform Phase Trace



Fig. 3-16 The Relationship between Orthogonal Interpolated Phase and Uniform Phase

The comparison of orthogonal interpolated phase steps with uniform phase steps is illustrated in Fig. 3-17(a). As the 31 steering DAC cells switch on one by one, both orthogonal interpolated angle and uniform phase angle increase monotonously. Nevertheless, they have a little different growing trend. The phase error between them is shown in Fig. 3-17(b).

Between these phases an additional non-uniform quantization phase error with respect to the uniform quantization phase error of up to 4.1 results. This additional quantization phase error is by a factor of three smaller than the uniform quantization phase error. Consequently, the non-uniform phase interpolation error can be ignored and, thus, the simple orthogonal interpolation method can be used, which simplifies the design. For more than 32 interpolation steps, the non-uniform interpolation phase error has to be corrected for by a more complicated design to maintain equal minimum phase steps. As a consequence, 32 interpolation phase steps and the chosen interpolation scheme is a good tradeoff between performance and implementation complexity.



(a)



Fig. 3-17 (a) The Comparison of Orthogonal Interpolated Phase Steps with Uniform

Phase Steps

(b) The Phase Error between Orthogonal Interpolated Angle and Uniform Phase Angle

## 3.4.4 Phase Interpolator Core and PLL Interface

Fig. 3-18 demonstrates the topology of phase interpolator core and PLL interface. A matched pair of CML differential buffers is inserted to compensate the loss before entering PI mixer core. It also reshapes these signals to be more sinusoidal to ensure adequate overlap of the clock edges being interpolated.

The duty cycle correction output buffer follows the PI mixer core. The amplitude variation of the interpolated clock is not crucial, because only the zero-crossing of the clock is of importance for sampling the data. Also, some of the amplitude variation is filtered by this duty cycle correction clock buffer between the PI and the PD, because

the buffer acts as a limiting amplifier. Moreover, this simple circuit works fairly well for sinusoidal clock signals. Since both edges of the in-phase and quadrature clocks are used to sample the received signal, the duty-cycle correction circuits play an important role in reducing systematic jitter on the recovered clock.





Fig. 3-18 (a) The Topology of Phase Interpolator Core and PLL Interface

- (b) The Schematic of Phase Interpolator Input CML Buffer
- (c) The Schematic of Duty Cycle Correction Output Buffer

## 3.5 Finite State Machine for Phase Rotation

#### 3.5.1 Phase Rotation Method

We assume interpolated clock is in the quadrant I initially. While interpolated clock tracks the data, it is unavoidable that the angle of interpolated clock is over  $90^{\circ}$  ( $\pi/2$ ) or beyond  $0^{\circ}$ . In other words, interpolated clock must be changed to the quadrant II or IV. Thus, the two orthogonal clock, I-CK and Q-CK, must be inversed, as illustrated in Fig. 3-19.



Fig. 3-19 Phase Rotation Condition

Fig. 3-20 demonstrates the whole architecture of the phase rotation. Deciding in which quadrant currently, two quadrant bits ( Quadrant Bit [1:0], or Q-Bit [1:0] ) in current-steering digital-to-analog converter (DAC), shown in Fig. 3-15, are controlled by the finite state machine (FSM). When the rotator steps across the quadrant boundary, the interpolation ratio stays constant, so only the polarity of one input phase needs to be switched.

Whether the rotator steps across the quadrant boundary or not is to check 31 thermometer codes D[30:0]. The "All one" signal, also called "Overflow", is high when all of these 31 thermometer codes are high. Similarly, the "All zero" signal, also called "Underflow", is high when all of these 31 thermometer codes are low.

If the "All one" signal is high and the polarity from PD is also high, interpolated phase must be changed from the quadrant I to the quadrant II. Then two quadrant bits (Q-Bit [1:0]), which is "00" initially, become "10" and change the switching pairs of current-steering DAC in Fig. 3-15. That means one of the two differential pairs in the phase interpolator core in Fig. 3-14, which are turned on only at any give time, is also changed.

Similarly, if the "All zero" signal is high and the polarity from PD is low, interpolated phase must be changed from the quadrant I to the quadrant IV. Then two quadrant bits (Q-Bit [1:0]), which is "00" initially, become "01" and change the switching pairs of current-steering DAC in Fig. 3-15. That means one of the two differential pairs, in the phase interpolator core in Fig. 3-14, which are turned on only at any give time, is also changed.

Finally, if the quadrant changes ( whether from the odd quadrant to the even quadrant or not ), the "Inversion" signal becomes high to inverse the polarity from BBPD by a simplified switch.

By detecting 31 thermometer codes and the polarity signal from the dynamic digital loop filter and the bang-bang phase detector, respectively, FSM guarantees the phase of the quarter-rate clock have the tracking ability with data in the whole  $360^{\circ}$  ( $2\pi$ ) phase circle.



Fig. 3-20 The Architecture of the Phase Rotation controlled by FSM

#### 3.5.2 Finite State Machine Implementation

This phase rotation method can be represented by the state diagram, as illustrated in Fig. 3-21(a). Four states express four quadrants and is denoted " $Q_1Q_2$ " ( Quadrant Bit [1:0], or Q-Bit [1:0] ). The three transition bits, denoted "POU", are "the polarity from BBPD", "Overflow" for all one condition, and "Underflow" for all zero condition, respectively. Therefore, this finite state machine (FSM) must be implemented by five variables, " $Q_1$ " " $Q_2$ " "P" "O" "U". Fig. 3-21 (b), (c), and (d) demonstrates the K-map of the state, Q-Bit[1] (" $Q_1$ "), and Q-Bit[0] (" $Q_2$ "), respectively. Finally, finite state machine logic implementation is shown in Fig. 3-21(e).





Q₁: Quadrant Bit [1] Q₂: Quadrant Bit [0] P : Polarity O : Overflow U : Underflow

(b)



(c)





Fig. 3-21 (a) The State Diagram of Phase Rotation Method (b) K-Map of the State

- (c) K-map of Q-Bit [1]  $(Q_1)$  (d) K-map of Q-Bi [0]  $(Q_2)$
- (e) Finite State Machine Implementation

## 3.6 Pre-Amplifier

A pre-amplifier is utilized at the data input of the CDR. Its task is to provide the optimum common-mode level for bang-bang phase detector (BBPD) and to amplify the incoming data to full swing. Therefore, the timing margin of the samplers increases.

The pre-amplifier is composed of three stages, as demonstrated in Fig. 3-22(a). The first stage is a level shifter to change the common-mode level for the input Rx data. Inductive peaking is employed to enhance bandwidth. Then two cascade broad-band limiting amplifier (LA) is behind level shifter. Two-stage limiting

amplifier makes input sensitivity level to about 30mVp-p by simulation. The noise on amplitude is eliminated due to the two-stage LA. Fig. 3-22(b) shows the circuit schematic of LA. It consists of two cascade differential pairs with active feedback for the sake of wide-band approach. The third stage is CML to CMOS buffer, and the circuit schematic is illustrated in Fig. 3-22(c). It reshapes the signal to digital type to fit four-channel bang-bang phase detector (BBPD).





Fig. 3-22 (a) The Architecture of the Pre-Amplifier

- (b) The Circuit Schematic of Limiting Amplifier (LA)
- (c) The Circuit Schematic of CML to CMOS Buffer

## 3.7 Data Retiming and Output Buffer

#### 3.7.1 Data Retiming and Synchronization

Data Retiming and Synchronization must be added to further synchronize the data and minimize bit errors due to D-Flip-Flop meta-stability. The configuration of the implemented data retiming circuit is shown in Fig. 3-23(a). It is composed of eight D-Flip-Flops and three phase of eight recovered clocks, Clk45, Clk135, and Clk225. Since the input signals of the retiming circuit, A, B, C and D, are already sampled in the PD, there is no systematic timing skew between phase detection and data retiming. For the detailed operation of the data retiming, the timing diagram is shown in Fig. 3-23(b). Each DFF cell operates with a wide timing margin which guarantees robust data retiming operation. By passing through all the same numbers of DFFs by two, the retiming data, DATA[3:0], are all aligned synchronously with a certain phase of the recovered clock, Clk135, as shown in Fig. 3-23(b).



Fig. 3-23 (a) The Configuration of Data Retiming and Synchronization (b) Timing Diagram of Data Retiming and Synchronization

#### 3.7.2 Data Output Buffer

At the quarter-rate data and clock outputs,  $50\Omega$ -CML buffers [38] are included to drive test and measurement equipment. The schematic is two differential pairs with common resistive load. The last stage is matched to  $50\Omega$ , as shown in Fig. 3-24.



Fig. 3-24 The Schematic of Data Output Buffer

# 3.8 Phase Arrangement for Whole Four-channel CDR System

Phase Arrangement must be considerable carefully because we utilize eight phases at quarter rate. Our proposed bang-bang phase detector (BBPD) is designed to use three phases to sample data. Excluding BBPD, the other blockings must be triggered by clock, too. Thus, there should be four phases in one channel. In other words, eight phases are all used twice and the capacitive load for each phase is balanced. Table I presents phase arrangement for whole four-channel CDR system. The detailed topology of whole four-channel CDR system with phase arrangement is shown in Fig. 3-25.

| Channel                   | A      | В      | С      | D      |
|---------------------------|--------|--------|--------|--------|
| Three phases to sample    | Clk45  | Clk135 | Clk225 | Clk315 |
| data in Bang-Bang         | Clk90  | Clk180 | Clk270 | Clk0   |
| Phase Detector            | Clk135 | Clk225 | Clk315 | Clk45  |
| One phase to provide      | Clk180 | Clk270 | Clk0   | Clk90  |
| clock for other blockings |        |        |        |        |

Table 3-1 Phase Arrangement for Whole Four-Channel CDR System



Fig. 3-25 The Topology of Whole Four-Channel CDR System with Phase Arrangement

## 3.9 CDR Behavior Model

Whole CDR system is complicated and hard to simulate. It is inadvisable to simulate whole CDR loop only in the transistor level. Respecting the reason that the majority blocks of whole CDR is digitally implemented, behavior verification is necessary before transistor-level simulation. It costs less simulation time than transistor-level simulation only. We verify whole CDR loop behavior by two tools, Matlab Simulink and Verilog.

#### 3.9.1 CDR Behavior Verification by Matlab Simulink

Fig. 3-26 (a) illustrates whole CDR behavior building model by Matlab Simulink. It consists of Phase Detector in (b), Programmable Counter in (c), SAR Controller in (d), DAC Controller in (e) and (f), DAC and Phase Interpolator in (g), and PRBS Generator in (h).





(b)





(d)





(f)





Fig. 3-26 (a) Whole CDR Behavior Building Model by Simulink

- (b) Bang-Bang Phase Detector Behavior Building Model
- (c) Programmable Counter Behavior Building Model
- (d) SAR Controller Behavior Building Model
- (e) One Cell of DAC Controller Behavior Building Model
- (f) DAC Controller Behavior Building Model
- (g) DAC and Phase Interpolator Behavior Building Model
- (h) PRBS Generator Behavior Building Model

Fig. 3-27 illustrates the lock time verification of whole CDR by Matlab Simulink. We can observe interpolated clocks sample data and adjust the position by each time of binary search ( the gray dash line). After binary search is done, interpolated clocks align to the data center ( the gray dash line) or data edge ( the light gray dash line). CDR is in the lock condition. The lock time is below 16 bit times according to behavior simulation.



Fig. 3-27 The Lock Time Verification of Whole CDR by Matlab Simulink

## 3.9.2 CDR Behavior Verification by Verilog

Fig. 3-28(a) demonstrates the lock time verification of Whole CDR by Verilog. The result corresponds with the Matlab Simulink simulation result in Fig. 3-27. The lock time is below 16 bit times. Fig. 3-28(b) shows frequency offset tolerance verification of whole CDR by Verilog. At the linear search mode, frequency offset tolerance responses the ability of data tracking. Maximum frequency offset tolerance is more than 2000 ppm according to Fig. 3-28(b). Finally, another important indicator

for CDR system is its maximum zero-block tolerance. CDR could make mistakes while phase detector detects no data transition too long. This phenomenon is occurred when a sequential bit stream of "1" or "0". Fig. 3-28(c) shows that our CDR can tolerate up to 17 "zero" (or "one") blocks by Verilog at best.





Fig. 3-28 (a) The Lock Time Verification of Whole CDR by Verilog

- (b) Frequency Offset Tolerance Verification of Whole CDR by Verilog
- (c) Maximum Zero-Block Tolerance Verification of Whole CDR by Verilog

## 3.10 CDR Simulation Result

After verifying CDR loop behavior, we finally simulate whole CDR circuit in the transistor-level by HSPICE. Fig. 3-29(a) shows the lock time verification of Whole CDR by HSPICE. The result corresponds with the simulation result of Matlab Simulink and Verilog in Fig. 3-27 and Fig. 3-28(a), respectively. The Correction of 4-parallel recovered data ( after Retiming ) is verified in Fig. 3-29(b), and the eye diagram for recovered data ( after De-multiplexer ) is illustrated in Fig. 3-29(c). At last, the simulated Peak-to-peak Jitter (Jpp) Performance in three different corner (TT, FF, SS) is given in Table II.







(a)



Fig. 3-29 (a) The Lock Time Verification of Whole CDR by HSPICE

- (b) The Correct Four-Parallel Recovered Data (after Retiming)
- (c) The Eye Diagram for Recovered Data ( after De-multiplexer )

|    | Peak-to-peak Jitter (Jp-p) |
|----|----------------------------|
| TT | 20 psec @ 1.5 Gbps         |
| FF | 15 psec @ 1.5 Gbps         |
| SS | 27 psec @ 1.5 Gbps         |

Table 3-2 Simulated Peak-to-peak Jitter (Jp-p) Performance in Three Corner

