This article was downloaded by: [National Chiao Tung University 國立交通大學] On: 26 April 2014, At: 01:39 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

## International Journal of Electronics

Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/tetn20

# Design techniques for high-speed multirate multistage FIR digital filters

M.-C. Lin<sup>a</sup>, H.-Y. Chen<sup>b</sup> & S.-J. Jou<sup>b</sup> <sup>a</sup> Dept. of Electrical Engineering, National Central University, Jhongli, 320, Taiwan, ROC <sup>b</sup> Dept. of Electronics Engineering, National Chiao Tung University, Hsinchu, 300, Taiwan, ROC Published online: 20 Feb 2007.

To cite this article: M.-C. Lin , H.-Y. Chen & S.-J. Jou (2006) Design techniques for high-speed multirate multistage FIR digital filters, International Journal of Electronics, 93:10, 699-721, DOI: 10.1080/00207210600810838

To link to this article: <u>http://dx.doi.org/10.1080/00207210600810838</u>

### PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the "Content") contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at <a href="http://www.tandfonline.com/page/terms-and-conditions">http://www.tandfonline.com/page/terms-and-conditions</a>



### Design techniques for high-speed multirate multistage FIR digital filters

M.-C. LIN\*<sup>†</sup>, H.-Y. CHEN<sup>‡</sup> and S.-J. JOU<sup>‡</sup>

 †Dept. of Electrical Engineering, National Central University, Jhongli, 320, Taiwan, ROC
 ‡Dept. of Electronics Engineering, National Chiao Tung University, Hsinchu, 300, Taiwan, ROC

(Received 17 October 2005; in final form 26 March 2006)

This paper presents architecture design techniques for implementing both single-rate and multirate high-speed finite impulse response (FIR) digital filters, with emphasis on the multirate multistage interpolated FIR (IFIR) digital filters. Well-known techniques to achieve high-speed and low-power applications for the single-rate digital FIR architecture are summarized, followed by the introduction of variable filter order selection, optimal filter decomposition, memory-saving and mirror symmetric filter pairs techniques which offer further gains in both performance and complexity reduction for the multirate multistage digital FIR architecture. A filter design example with TSMC 0.25 µm standard cell for 64-QAM baseband demodulator shows that the area is reduced by 39% for low-complexity application. Moreover, for high-speed application, the chip can operate at 714 MHz. Finally, a designed decimator which is used in the CDMA cellular shows that the area is reduced by 70% as compared with conventional approach.

Keywords: FIR digital filter; Multirate; Interpolated FIR filter

#### 1. Introduction

The applications of digital finite impulse response (FIR) filters and up/down sampling techniques are everywhere in modern electronic products such as multimedia, modems and mobile personal communications. For every electronic product, lower circuit complexity is always an important design target since it reduces the cost. Furthermore, for portable applications, a low-power low-complexity implementation is also the key factor. On the other hand, the trend of increasing data rates in digital signal processing (DSP) systems has pushed the development and implementation of high-speed digital FIR filters. High-speed and low-power applications require both increased parallelism and reduced complexity in order to meet both sampling rate and power dissipation goals.

Multirate signal processing (Crochiere and Rabiner 1983, Vaidyanathan 1993) consists of using different sample rates within a system to achieve computational

<sup>\*</sup>Corresponding author. Email: mclin@ee.ncu.edu.tw

efficiencies that are impossible to obtain with a system that operates on a single fixed sample rate. Two key components in multirate systems are decimator and interpolator. Multirate systems utilize high-speed decimators/interpolators to reduce the sampling rate so that complicated processing may be performed at a lower data rate. The polyphase structure provides an efficient architecture for the realization of multirate systems through a bank of filters operating in parallel (Crochiere and Rabiner 1983, Vaidyanathan 1993). For the narrow-band system, the interpolated FIR (IFIR) filter has been introduced to reduce hardware complexity (Neuvo *et al.* 1984, Saramäki *et al.* 1988).

In this paper, a general-purpose multirate multistage digital FIR filter which is based on canonic signed digit (CSD) code representation (Reitwiesner 1966) and IFIR filter design methodology will be proposed. Various techniques by which sufficient parallelism for high-speed operation may be achieved, while simultaneously constraining the solution to have a small hardware implementation for these structures will be described. These widely known techniques are briefly summarized as an introductory tutorial in §2. New techniques which further reduce complexity are presented in §3. Finally, some design examples which utilize these architectures and techniques are presented and compared in §4. A conclusion is made in §5.

#### 2. FIR filter structure and design methods

The choice of structure for the design of an FIR filter includes factors such as hardware complexity, desired data throughput and power consumptions. Many different structures exit, most of which provide some trade-off between complexity and throughout. For a dedicated application, the design choice then becomes the minimal hardware complexity which can achieve a given throughput rate. Typically it can be classified into two different architectures to achieve a given throughput rate. They are single-rate and multirate digital FIR architectures. The single-rate digital FIR architecture typically uses a pipelined transpose direct form structure with CSD multiplier and carry-save adders (CSA) to achieve a linear phase FIR filter (Lin et al. 2003, Jheng et al. 2004). In the multirate digital FIR architecture, a polyphase structure is utilized in order to achieve an efficient multirate filter implementation (Bellanger et al. 1976, Hawley et al. 1996, Jou et al. 1998). Basically, the polyphase multirate filter is a single stage digital FIR architecture although it usually cascades an interpolation or a decimation stage for up/down sampling (M). The polyphase structure results in an efficient use of the filter hardware since each subfilter operates at one-*M*th of the required throughput rate. For the narrow-band system, an IFIR filter has been used to further reduce hardware complexity. A sharp narrow passband filter can be decomposed into a periodic model filter and an image suppressor by using the IFIR filter design methodology (Neuvo et al. 1984, Saramäki et al. 1988, Jovanovic-Dolecek and Mitra 2005, Mehrnia and Daneshrad 2005). Therefore, a filter, a decimator and an interpolator can be constructed as a multirate multistage digital FIR architecture using the polyphase and IFIR filter structure. The multirate single-stage or multistage digital FIR architecture can be still realized using any of the many available single-rate digital FIR architectures for each subfilter, several of which are described below.

#### 2.1 Linear phase transpose direct form FIR filter structure

The transpose direct form structure of FIR filter repositions the delay elements of the direct form structure such that the input is fed to each tap and the results are accumulated over N sample periods, where N is the tap length of the filter. It retains the regularity of direct form structure but the critical path is only a multiplication and an addition. Therefore, the system throughput rate is independent of the filter length. One of the primary disadvantages of this structure is the large loading on the input data broadcast bus since all multipliers are fed in parallel. This effect can be reduced by using appropriate data buffers and appropriately distributing the input bus as tree-like structures. Another disadvantage is the delay elements are larger since they hold the accumulated sum instead of the input signal.

It is well known that an FIR filter can be guaranteed to have an exact linear phase response if the coefficients are either symmetric or anti-symmetric about the centre tap. This coefficient symmetry can then be exploited by sharing the multipliers between the (anti-)symmetric taps. An example of a structure that a linear phase FIR filter has an even taps transpose direct form is shown in figure 1.

#### 2.2 Carry-save addition

The delay time of a carry-propagation adder (CPA) is linearly dependent on the word length of the adder. It also generates many glitches before the real carry propagates from the least significant bit to the most significant bit. In order to avoid the long critical path delay of the adder, the adder in each tap of FIR filter is converted to the carry-save adder. In carry-save addition, both a sum and a carry bit are acquired for each bit position of the word and the carry propagation problem inside a carry-propagation adder is avoided. There is one drawback of the carry-save scheme for the requirement of doubling the number of registers within the filter core. This will increase the filter core area but the system can achieve a higher throughput rate. At the final stage of the filter, it requires a single high-speed CPA, a so-called vector merge adder (VMA), in order to sum the sum and carry data



Figure 1. Linear phase transpose direct form structure of a linear phase FIR filter.

outputs together to form the final output. The critical path delay  $(T_{FIR})$  of a transposed direct from FIR filter is

$$T_{FIR} = \max\{T_{mul} + T_{CSA}, T_{VMA}\}\tag{1}$$

where  $T_{mul}$ ,  $T_{CSA}$  and  $T_{VMA}$  mean the multiplier, CSA and VMA delay, respectively. Some high-complexity high-speed adder such as a carry-select adder or a carrylookahead adder (CLA) may be used to reduce  $T_{VMA}$ . The transpose direct form structure with carry-save addition of an FIR filter is shown in figure 2.

#### 2.3 CSD multiplier

Because an FIR filter has a set of fixed coefficients, the multiplication operation can be reduced to a series of hard wired shifts and additions while recoding these fixed coefficients with CSD code representation (Lim and Parker 1983, Samueli 1989, MacLeod and Dempster 2005). A CSD-encoded multiplier is simply implemented by combining bus shifts and two's complement adders. Without using a multiplier, this structure is often referred to as multiplierless filters. A transpose direct form structure with CSD coefficients and CSAs of an FIR filter is shown in figure 3 (Lin *et al.* 2003). One CSA is required for each non-zero term of the CSD coefficients. In the structure, one compensate vector (CV) that is generated by collecting constant ones by each shift in CSD multipliers is pre-summed. This is the so-called "MSB Fix" scheme (Wong and Samueli 1991). This CV can be added to the first tap of this filter. This structure provides excellent layout regularity and a short critical path as

$$T_{FIR} = \max\{D_{\max} \cdot T_{CSA}, T_{VMA}\}$$
(2)

where  $D_{\text{max}} = \max_n \{D_n\}$  and  $D_n$  is the number of signed power-of-two (SPT) term of a coefficient. A CSA delay time is only a one-bit full adder (FA) delay time. The use of carry-save arithmetic takes full advantage of the CSD coefficients and reduces the delay time of multiplication-accumulation in a tap to a few one-bit full adder delays without heavy pipelining. Finally, an architecture for a four-digit CSD linear phase tap using carry-save addition and pipelining to two-adder delay is shown in figure 4.



Figure 2. Transpose direct form structure with carry-save addition of an FIR filter.



Figure 3. Transpose direct form structure with CSD coefficients and CSAs of an FIR filter.



Figure 4. Symmetric transpose direct form structure using carry-save addition for four-digit CSD code.

#### 2.4 Pipelining

Architectures that adopt the CSD multipliers and carry-save addition greatly reduce the critical path of the filter. However, the critical path can be reduced further through pipelining of the structure. An example of pipelining to a single adder delay is shown in figure 5.

#### 2.5 Polyphase representation

Figure 6 shows an example of the reconstruction for a decimator with M = 2. The polyphase implementation shown in figure 6(c) (Bellanger *et al.* 1976) is much



Figure 5. Transpose direct form structure with one-adder delay pipelining.

more efficient than a direct implementation as shown in figure 6(a). Although there are some hardware overheads due to the decimator,  $H_0(z)$  and  $H_1(z)$ will operate at a lower rate. However, the decomposition of the linear phase filter, which has symmetric coefficients, into subfilters will usually destroy the symmetric property of subfilters. Thus, it will possibly increase the hardware complexity compared to the original symmetric filter without using polyphase representation. Since the decomposition into subfilters is accomplished by sampling every *M*th coefficient of the original impulse response, those subfilters resulting from sampling which is symmetric about the centre tap will be linear phase, while the other subfilters will not. At most, there will be two subfilters to be linear phase as is summarized in table 1 (Hawley *et al.* 1996). Therefore, when the sampling rate conversion ratio (SRCR) is even, the filter with even tap length *N* can be redesigned to be N+1 for two more linear phase subfilters to reduce the hardware complexity.



Figure 6. Reconstruction of a decimator with M = 2.

| Filter length | Sampling rate conversion ratio | Number of linear phase subfilters |
|---------------|--------------------------------|-----------------------------------|
| Even          | Even                           | 0                                 |
| Even          | Odd                            | 1                                 |
| Odd           | Odd                            | 1                                 |
| Odd           | Even                           | 2                                 |

Table 1.Number of linear phase subfilters if prototype filteris linear phase.

#### 2.6 Interpolated FIR filter

The basic idea of the IFIR filter is to implement the filter H(z) as a cascade of two FIR sections

$$H(z) = G(z^{L}) \cdot I(z) \tag{3}$$

where  $G(z^L)$  is a periodic model filter which generates a sparse set of impulse response values with every *L*th sample being non-zero, and I(z) is an image suppressor which can be implemented with only a few arithmetic operations. In frequency domain analysis,  $G(z^L)$  has a periodic frequency response with period  $2\pi/L$  and is designed to perform passband, transition band and stopband shaping in the vicinity of the passband, and I(z) is designed to attenuate the unwanted passband



Figure 7. Time and frequency domain behaviours of IFIR low-pass filter with L=3.

created by  $G(z^L)$ . If  $\delta_p$  denotes the passband deviation and  $\delta_s$  denotes the stopband deviation, the overall IFIR filter must meet the requirements of

$$1 - \delta_p \le \left| G(z^L) \cdot I(z) \right| \le 1 + \delta_p \tag{4}$$

and

$$\left|G(z^{L}) \cdot I(z)\right| \le \delta_{s} \tag{5}$$

Time and frequency domain behaviours of the IFIR approach used for a low-pass filter design with L=3 are illustrated in figure 7. With careful selection of the interpolation factor L, the number of stages and the best method to implement the subfilters, there is an IFIR filter design with less hardware complexity.

#### 3. Architectural techniques for hardware reduction

In many applications, it is usually necessary to design a decimator/interpolator with a large decimation/interpolation ratio. Although this can be done by designing a filter directly and using the polyphase structure to save the arithmetic operations, it is more efficient to design in multiple stages using the IFIR technique and implement with multirate techniques. This section will introduce some new techniques which can further reduce filter complexity and improve performance for the multirate multistage digital FIR architecture.

#### 3.1 Variable filter order selection

In the digital FIR filter design using the CSD multiplier for a given specification, the order of an FIR filter and its floating-point coefficients are calculated first.

For a fixed filter order, an FIR filter architecture design is to keep the filter's frequency response within the specification while optimizing the number of SPT terms employed to a minimum and the number of SPT terms per coefficient within a specified bound (Lim and Parker 1983). An observation (Bhattacharya and Saramaki 2003) shows that one can start with a filter, which exceeds the given criteria that may involve an acceptable level of increase in the filter order, but with much smaller number of non-zero bits than the initial design. Moreover, we find out that the order of a filter will be increased if we set stricter specification in the passband ripple or stopband ripple without changing the frequency specification. This set of coefficients will satisfy the requirement of a filter's specification, but give more possibility to minimizing the total non-zero digits in the CSD representation although using some more storage due to an increase of tap number. In some implementation, like FPGA, storage elements like registers are free. In addition, the variable filter order selection scheme can be applied to the polyphase representation to change the tap length of an FIR filter that has initially even taps for a given specification to odd taps. This will have two more linear phase subfilters and can reduce the hardware complexity when the sampling rate conversion ratio is even. For an example of a single-rate digital FIR filter that has  $0.3\pi$ passband,  $0.5\pi$  stopband edge frequencies and -50 dB normalized peak ripple (NPR), the comparison using the mixed integer linear programming algorithm (Lim and Parker 1983), local search algorithm (Samueli 1989) and variable filter order selection is shown in table 2. When the maximum allowed number of SPT terms per coefficient is limited to four, the filter designed by our method saves  $22\% (21\% \sim 24\%)$  SPT terms and costs 5%  $(4\% \sim 7\%)$  additional tap length. If an application requires limiting the maximum number of SPT terms per coefficient to three, for a higher throughout rate, the filter designed using local search algorithm fails to reach -50 dB NPR. However, using our proposed method can save 16% SPT terms and costs 4% additional tap length. Thus, the hardware complexity using the variable filter order selection can be reduced further than the other two methods.

#### 3.2 Multirate multistage filter design

Considering a decimator shown in figure 8(a), the lowpass filter H(z) will be a narrow band case as the decimation ratio M becomes large. The IFIR technique can be used

| Algorithm                      | Number of SPT terms | Number of taps |  |
|--------------------------------|---------------------|----------------|--|
| Max. SPT per coefficient $= 4$ |                     |                |  |
| MILP (Lim and Parker 1983)     | 68                  | 28             |  |
| Samueli (Samueli 1989)         | 66                  | 28             |  |
| Proposed                       | 64                  | 28             |  |
| Proposed                       | 54                  | 29             |  |
| Proposed                       | 52                  | 30             |  |
| Max. SPT per coefficient $= 3$ |                     |                |  |
| MILP (Lim and Parker 1983)     | 68                  | 28             |  |
| Samueli (Samueli 1989)         | Cannot reach        | -50 dB         |  |
| Proposed                       | 66                  | 28             |  |
| Proposed                       | 57                  | 29             |  |

Table 2. Minimum number of SPT terms required to attain -50 dB NPR.



Figure 8. Multistage IFIR decimator design.

to reduce the hardware complexity of H(z). If the interpolation factor L of the periodic model filter  $G(z^L)$  is carefully designed to be  $M_1$ , as shown in figure 8(b), the structure of the decimator can be reconstructed into figure 8(c) by noble identity. By this structure, the decimator is divided into two sections, both of which can be implemented by polyphase representation with less filter coefficients resulting from image suppressor I(z) and model filter G(z), as shown in figure 8(d). In addition, an interpolator can be designed in the same way, as shown in figure 9. Furthermore, the multistage IFIR decimator/interpolator structure can also be extended to three stages or more. Figure 10 shows the derivation of the structure with three-stage decomposition.

In this subsection, the optimal decomposition of IFIR filters is discussed for both single-stage and multistage of the image suppressor I(z). We will estimate the minimum subfilter orders first. With the estimated values, it is possible to find a nearly optimum decomposition. The optimal filter decomposition depends on the stopband edge as well as on the normalized transition width of the filter. In the following, we will consider three cases shown in table 3 as design examples (Saramäki *et al.* 1988). Figure 11 shows the total taps required, in these three cases, to implement I(z),  $G(z^L)$  and the overall filter as a function of the interpolated factor L for a single-stage implementation of I(z). The interpolated factor L=1corresponds to the conventional direct form FIR filter. As shown in these figures, the IFIR filters provide significant reductions in the number of the taps over conventional direct form designs. As L increases, the number of taps of  $G(z^L)$ decreases exponentially and the taps of I(z) increases exponentially. We can increase L until the decrease in the number of taps of  $G(z^L)$ 



Figure 9. Multistage IFIR interpolator design.



Figure 10. Multistage IFIR decimator with three-stage decomposition.

Table 3. Three cases for filter specifications.

| Case | $\omega_{ m P}$ | $\omega_{ m S}$                 | $\delta_{\mathrm{P}}$ | $\delta_{ m s}$ |
|------|-----------------|---------------------------------|-----------------------|-----------------|
| I    | $0.05\pi$       | $0.10\pi \\ 0.10\pi \\ 0.02\pi$ | 0.01                  | 0.001           |
| II   | $0.09\pi$       |                                 | 0.01                  | 0.001           |
| III  | $0.01\pi$       |                                 | 0.01                  | 0.001           |

increase in the number of taps of I(z) and the minimum total taps of the overall filter is obtained. The maximum interpolated factor is limited to  $L_{\text{max}} = \lfloor \pi/\omega_s \rfloor$  and  $L_{\rm max}$  for case I, II and III are 10, 10 and 50, respectively. When comparing the results for case I and case II, it is observed that as the normalized transition bandwidth is smaller while keeping the same stopband edge, the optimum value  $L_{\text{opt}}$  of L becomes larger. As for case II and case III, if the transition width is the same, the one with smaller stopband edge will have larger tap contribution of I(z)and  $L_{opt}/L_{max}$  will decrease. Figure 12 shows the total taps required, in case III, to implement I(z),  $G(z^{L})$  and the overall filter as a function of the interpolated factor  $L(=L_1 \cdot L_2)$  for the two-stage implementation of  $I(z) (=I_1(z) \cdot I_2(z^{L_1}))$ . Comparing single-stage and two-stage implementations of I(z) in case III, the two-stage implementation saves significantly more of the taps of the overall filter than the single-stage implementation. This is because of the two-stage implementation of I(z) requires considerably fewer taps and the optimum decomposition occurs at a relatively large value of  $L(=L_1 \cdot L_2)$ . Thus, it decreases the number of taps in  $G(z^{L})$ . When the single-stage implementation of I(z) has a very small taps, the multistage implementations of I(z) will provide only a slight saving over its single-stage implementation.

Table 4 summarizes the optimal IFIR filters with single-stage and two-stage implementations of I(z) in these three cases. We can observe that a filter with a narrow passband width and transition band using the IFIR implementation method will have significant reduction of the total taps for the overall filter. When using the two-stage implementation of I(z), it will further reduce the total taps of the overall filter compared with the one-stage implementation of I(z). For the two-stage implementation method, the tap number is reduced to 12% of the conventional single-stage method. The analysis of the optimal IFIR filters for case I ~ III costs 12% additional taps compared with Saramäki *et al.* (1988). Because the decomposition of IFIR filters with stringent specifications will guarantee that the final design satisfies the system specification with less overall SPT terms, it will take more taps of the overall design.

#### 3.3 Structures of the multirate multistage FIR filter

Both decimator and interpolator can have two structures in direct form or transposed direct form. When the implementation is to use the transposed direct form for decimators and the direct form for interpolators, there are registers to be shared between the subfilters as shown in figure 13(b) and (c) for the example of N=9 and M=3 or L=3. This is the so-called memory-saving technique. Another type of implementation is to use the direct form for decimator and transposed direct



Figure 11. The number of taps versus interpolated factor L for the periodic model filter  $G(z^{L})$ , the image suppressor I(z) and the overall filter H(z).



Figure 12. In case III, the decompositions of two-stage designs of I(z) for various values of  $L_1$  and  $L_2$ .

| Table 4. | Summaries of the optimal IFIR filters with single-stage and two-stage |
|----------|-----------------------------------------------------------------------|
|          | implementations of $I(z)$ .                                           |

| Case     | $N_{\rm CON}$ | $N_H$     | $N_I$                                | $N_G$ | L                    | R (%) |
|----------|---------------|-----------|--------------------------------------|-------|----------------------|-------|
| Single-s | stage implen  | nentation | s of $I(z)$                          |       |                      |       |
| I        | 103           | 43        | 15                                   | 28    | 4                    | 42%   |
| II       | 510           | 126       | 39                                   | 87    | 6                    | 25%   |
| III      | 510           | 88        | 41                                   | 47    | 11-13                | 17%   |
| Two-sta  | age impleme   | entations | of $I(z)$                            |       |                      |       |
| III      | 510           | 60        | 31                                   | 29    | 20                   | 12%   |
|          |               |           | $(N_{\rm I1} = 15; N_{\rm I2} = 16)$ |       | $(L_1 = 4; L_2 = 5)$ |       |

Note:  $N_{\text{CON}}$  is the number of taps of conventional one-stage filter;  $N_H$  is the minimum taps of overall multistage filter;  $N_I$  is the number of taps of I(z);  $N_G$  is the number of taps of  $G(z^L)$ ; L is the interpolated factor; R is the reduction ratio of multistage total taps over the conventional design taps.



Figure 13. (a) Direct form decimator with mirror symmetric filter pairs. (b) Transposed direct form decimator with memory-saving technique. (c) Direct form interpolator with memory-saving technique. (d) Transposed direct form interpolator with mirror symmetric filter pairs.

form for interpolator. They allow multipliers to be shared between the subfilters in each mirror symmetric pair as shown in figure 13(a) and (d) for the example of N=10 and M=3 or L=3. This is the so-called mirror symmetric filter pairs technique (Hawley *et al.* 1996). The word length of the registers in structures (b) and (d) need to store internal signal and is longer than the word length of the registers in structures (a) and (c) which store input signal. With mirror symmetric filter pairs, structures (a) and (d) have only about half of the multipliers in



Figure 13. Continued.

structures (b) and (c). However, structures (b) and (c), using the memory-saving technique, have approximate 1/M registers of those in structures (a) and (d). Although no structure is absolutely better than the other one, the critical path of the transposed direct form is shorter than that of direct form. For high-speed

| Structure | Addition requirements<br>per unit time (APU)                                                 | Storage requirements per unit time (SPU)                                                     |
|-----------|----------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|
| (a)       | $\sum_{i=1}^{K} \left[ \left( \frac{D_i}{2} + M_i - 1 \right) \prod_{j=i+1}^{K} M_j \right]$ | $\sum_{i=1}^{K} \left[ (N_i + M_i^2 - M_i) \prod_{j=i+1}^{K} M_j \right]$                    |
| (b)       | $\sum_{i=1}^{K} \left( D_i \prod_{j=i+1}^{K} M_j \right)$                                    | $\sum_{i=1}^{K} \left[ \left( M_i^2 + \frac{N_i}{M_i} \right) \prod_{j=i+1}^{K} M_j \right]$ |
| (c)       | $\sum_{i=1}^{K} \left( D_i \prod_{j=i+1}^{K} L_j \right)$                                    | $\sum_{i=1}^{K} \left( \frac{N_i}{L_i} \prod_{j=i+1}^{K} L_j \right)$                        |
| (d)       | $\sum_{i=1}^{K} \left( \frac{D_i + N_i}{2} \prod_{j=i+1}^{K} L_j \right)$                    | $\sum_{i=1}^{K} \left( N_i \prod_{j=i+1}^{K} L_j \right)$                                    |

Table 5. Hardware complexity of the four structures shown in<br/>figure 13 of decimator/interpolator.

*Notes.* K: Total number of stages;  $N_i$ : Tap length;  $D_i$ : Total non-zero digits;  $M_i$ : Down conversion ratio;  $L_i$ : Up conversion ratio.

application, therefore, structures (b) and (d) will be selected. Table 5 summarizes the hardware complexity of the four structures.

#### 4. Design and implementation examples

For the single-rate digital FIR architecture, a filter design example for 64-QAM baseband demodulator is given (Jou *et al.* 1999). In the designs of the following examples, Verilog HDL is used to design the architecture. Then commercial synthesis tool and standard cells of TSMC 0.25  $\mu$ m CMOS process are used to synthesize the design into gate-level design. Finally, gate-level simulation tools with timing and power analysis capability are used to analyse the performance. The design results of using the proposed techniques show that the area is reduced by 39% for low-complexity applications with doubled operating frequency and half power consumption. Moreover, for high-speed application, the design with pipelining scheme can operate at 714 MHz. The synthesis results are summarized in table 6. The area is measured in equivalents of 2-input NAND gates.

The second example is of IFIR filters with specifications which are the first version of the CDMA cellular proposed by Qualcomm (Jou *et al.* 1999). The specifications are shown in table 7. Then, the conventional filter design using the Parks McClellan algorithm would require an order N=69. Based on the algorithm shown in the multirate multistage filter design section, we use the optimal interpolated factor L=4 for IFIR design with single-stage implementation of I(z) and  $L_1 \cdot L_2 = 2 \cdot 2$  for IFIR design with two-stage implementation of I(z).

|                          | Work#1  | Work#2  | Wong and Samueli (1991) |
|--------------------------|---------|---------|-------------------------|
| Technology (µm)          | 0.25    | 0.25    | 0.25                    |
| Max. operating frequency | 714 MHz | 146 MHz | 72 MHz                  |
| Total gate count         | 11117   | 5155    | 8477                    |
| Combinational area       | 5011    | 2496    | 5938                    |
| Non-combinational area   | 6106    | 2659    | 2539                    |
| Power dissipation (mW)   | 520.05  | 6.83    | 13.31                   |

Table 6. Synthesis results of the example for 64-QAM baseband demodulator.

Work#1 Specifications under high-speed constraint. Work#2 Specifications under low-complexity constraint.

Table 7. Specifications of the CDMA cellular proposed by Qualcomm.

| Sampling frequency         | 19.6608 MHz   |
|----------------------------|---------------|
| Passband edge frequency    | $0.064087\pi$ |
| Stopband edge frequency    | $0.125\pi$    |
| Passband ripple in dB      | 0.1 dB        |
| Stopband attenuation in dB | 40 dB         |
|                            |               |

Figure 14 shows the conventional filter and the frequency responses for the IFIR filters with single-stage I(z) and L=4 as well as the frequency responses for the subfilters, I(z) and  $G(z^4)$ . Note that the system  $G(z^4)I(z)$  has linear phase property since G(z) and I(z) have this property. Figure 15 shows the conventional filter and the frequency responses for the IFIR filters with two-stage I(z) and  $L_1 \cdot L_2 = 2 \cdot 2$  as well as the frequency responses for the subfilters,  $I_1(z)$ ,  $I_2(z^2)$  and  $G(z^4)$ . The synthesis results of the IFIR filter designs and conventional filter design are summarized in table 8. The filter I(z) has very low gate count, whereas the cost of  $G(z^4)$  is little more than half the cost of the conventional design. When the timing constraints of the conventional filter and the IFIR filters with two-stage I(z) are reduced by 38% and 42% compared with the conventional filter, respectively. Moreover, the power dissipations are reduced by 20% and 24% respectively.

Finally, the designs of multirate multistage decimators with the same specifications in example 2 and with a decimation factor of eight are summarized in table 9. The frequency responses of the conventional single-stage decimator using the polyphase structure to save the arithmetic operations and the multistage multirate decimators designed by our proposed method as shown in figure 16. When the timing constraint of the conventional decimator and the multistage multirate decimators are equal, the area of the multistage multirate decimator is reduced by 70% compared with conventional decimator. Due to multirate design, the power is reduced by 6.6 times (conventional) compared to table 8. Moreover, as listed in table 9, the multirate multistage implementation can reduce almost 51% of the power consumption compared to conventional single-stage implementation with only polyphase structure.



Figure 14. Frequency responses of the IFIR filters with single-stage I(z) and L=4, (a) I(z) of order 15 and  $G(z^4)$  of order 19, (b) the overall IFIR filter and the conventional filter.

#### 5. Conclusions

This paper has presented high-speed and low-complexity design techniques at the architectural level for the single-rate digital FIR architecture. It uses a pipelined



Figure 15. Frequency responses of the IFIR filters with two-stage I(z) and L=4, (a)  $I_1(z)$  of order 7,  $I_2(z^2)$  of order 7 and  $G(z^4)$  of order 19, (b) the overall IFIR filter and the conventional filter.

transpose direct form structure with CSD multiplier and carry-save adders to design a linear phase FIR filter. A filter design example for 64-QAM baseband demodulator shows that the area is reduced by 39% for low-complexity applications. Moreover, for high-speed application, it can operate at 714 MHz.

|                                                      |                                   | IFIR filter with single-stage $I(z)$ |          |                          |  |
|------------------------------------------------------|-----------------------------------|--------------------------------------|----------|--------------------------|--|
| Technology: TSMC 0.25 μm<br>Input frequency: 200 MHz | Conventional<br>filter            | I(z)                                 | $G(z^4)$ | Reduction ratio, $R$ (%) |  |
| Tap number                                           | 69                                | 15                                   | 19       | 49%                      |  |
| Total gate count                                     | 14839                             | 2049                                 | 7080     | 62%                      |  |
| Combinational area                                   | 9190                              | 1118                                 | 1684     | 31%                      |  |
| Noncombinational area                                | 5649                              | 931                                  | 5396     | 112%                     |  |
| Power dissipation (mW)                               | 140.01                            | 25.78                                | 86.57    | 80%                      |  |
|                                                      | IFIR filter with two-stage $I(z)$ |                                      |          |                          |  |
| Technology: TSMC 0.25 μm<br>Input frequency: 200 MHz | $I_1(z)$                          | $I_2(z^2)$                           | $G(z^4)$ | Reduction ratio, $R$ (%) |  |
| Tap number                                           | 7                                 | 7                                    | 19       | 48%                      |  |
| Total gate count                                     | 721                               | 1369                                 | 6523     | 58%                      |  |
| Combinational area                                   | 275                               | 504                                  | 1540     | 25%                      |  |
| Noncombinational area                                | 446                               | 865                                  | 4983     | 111%                     |  |
| Power dissipation (mW)                               | 8.89                              | 17.31                                | 80.70    | 76%                      |  |

Table 8. The synthesis results of the conventional filter and the IFIR filters.

Table 9. The synthesis results of the conventional decimator and the multirate decimator (decimation ratio, M = 8).

|                                                                                            |                                      | Multirate decimator with two-stage |                                |                           |  |
|--------------------------------------------------------------------------------------------|--------------------------------------|------------------------------------|--------------------------------|---------------------------|--|
| Technology: TSMC 0.25 μm<br>Input frequency: 200 MHz                                       | Conventional decimator $(M=8)$       | Stage#1 $(M_1 = 4)$                | Stage#2 $(M_2 = 4)$            | Reduction ratio, $R(\%)$  |  |
| Total gate count<br>Combinational area<br>Non-combinational area<br>Power dissipation (mW) | 13742<br>12567<br>1174<br>21.30      | 1580<br>1162<br>418<br>6.59        | 2554<br>1792<br>761<br>4.29    | 30%<br>24%<br>100%<br>51% |  |
|                                                                                            | Multirate decimator with three-stage |                                    |                                |                           |  |
| Technology: TSMC 0.25 μm<br>Input frequency: 200 MHz                                       | Stage#1 $(M_1 = 2)$                  | Stage#2 $(M_2=2)$                  | Stage#3<br>(M <sub>3</sub> =2) | Reduction ratio, $R$ (%)  |  |
| Total gate count<br>Combinational area<br>Non-combinational area<br>Power dissipation (mW) | 638<br>336<br>301<br>4.48            | 808<br>496<br>312<br>3.15          | 2417<br>1706<br>710<br>4.42    | 28%<br>20%<br>113%<br>57% |  |

In the multirate digital FIR architecture, we construct a multirate multistage digital FIR architecture with the polyphase structure and IFIR filter design. To keep the symmetric property of subfilters, a variable filter order selection scheme is proposed. Besides, the hardware complexity can also be reduced by using a variable filter order selection. For an example of a single-rate digital FIR filter, the filter designed by the variable filter order selecting method saves about 22% SPT terms and costs 5% additional tap length. For IFIR filter implementation, it can reduce many hardware complexities with carefully decomposing the interpolation factor

![](_page_22_Figure_1.jpeg)

Figure 16. The frequency responses of the conventional FIR and multirate IFIR filters.

of the periodic model filter. The optimal filter decomposition which depends on the stopband edge as well as on the normalized transition width of the filter is proposed. We design IFIR filters with specifications which are the first version of the CDMA cellular, the area of the IFIR filter with single-stage and two-stage image suppressor are reduced by 38% and 42% compared with the conventional filter, respectively. Moreover, the power dissipations are reduced by 20% and 24% respectively. Finally, a multirate multistage multiplierless decimator with a decimation factor of eight is designed. The area of the multistage multirate decimator is reduced by 70% compared with a conventional decimator. Moreover, the multistage implementation can reduce almost 51% of the power consumption compared to conventional single-stage implementation with only polyphase structure.

#### References

- M. Bellanger, G. Bonnerot and M. Coudreuse, "Digital filtering by polyphase network: application to sample rate alteration and filter banks", *IEEE Trans. ASSAP*, ASSP-24, pp. 109–114, 1976.
- M. Bhattacharya and T. Saramaki, "Some observations on multiplierless implementation of linear phase FIR filters", *IEEE ISCAS*, 4, pp. 193–196, 2003.
- "The CDMA network engineering handbook, volume 1: concepts in CDMA", Qualcomm Inc., 1993.
- R.E. Crochiere and L.R. Rabiner, *Multirate Digital Signal Processing*, Englewood Cliffs, NJ: Prentice Hall, 1983.
- R.A. Hawley, B.C. Wong, T.-J. Lin, J. Laskowski and H. Samueli, "Design techniques for silicon compiler implementations of high-speed FIR digital filters", *IEEE JSSC*, 31, pp. 656–667, 1996.

- K.Y. Jheng, S.J. Jou and A.Y. Wu, "A design flow for multiplierless linear-phase FIR filters from system specification to verilog code", IEEE Int. Symp. Cir. Syst., 5, pp. 293-296, 2004.
- S.J. Jou, S.Y. Wu and C.K. Wang, "Low-power multirate architecture for IF digital frequency down converter", IEEE Trans. Circuits Syst. II, 45, pp. 1487-1494, 1998.
- S.J. Jou, C.H. Kuo, M.T. Shiau, J.Y. Heh and C.K. Wang, "VLSI implementation of timing recovery and carrier recovery for QAM/VSB dual mode", Int. Symp. VLSI Technology, Syst. Applications, pp. 159-162, 1999.
- G. Jovanovic-Dolecek and S.K. Mitra, "Multiplier-free FIR filter design based on IFIR structure and rounding", Symp. Circuits and Sys., 48th Midwest, 2005, pp. 559-562.
- Y.C. Lim and S.R. Parker, "FIR filter design over a discrete powers-of-two coefficient space", IEEE Trans. Acoust., Speech, Signal Processing, 31, pp. 583–591, 1983.
- M.C. Lin, C.L. Chen, D.Y. Hsin, C.H. Lin and S.J. Jou, "Multiplierless FIR filter architecture synthesizer based on CSD code", J. Chin. Institute of Electrical Eng., 10(2), pp. 155-163, 2003.
- M.D. MacLeod and A.G. Dempster, "Multiplierless FIR filter design algorithms", IEEE Signal Process. Lett., 12, pp. 186-189, 2005.
- A. Mehrnia and B. Daneshrad, "A low-complexity multi-rate channel selector transmit filter Bank with Reconfigurable bandwidth", *Aerospace, IEEE Conference*, pp. 1–11, 2005. Y. Neuvo, C.Y. Dong and S.K. Mitra, "Interpolated finite impulse response filters",
- IEEE Trans. Acoust., Speech, Signal Processing, ASSP-32, pp. 563–570, 1984.
- G.W. Reitwiesner, "Binary arithmetic", in Advances in Computers, Vol. 1, NY: Academic, 1966, pp. 231-308.
- T. Saramäki, Y. Neuvo and S.K. Mitra, "Design of computationally efficient interpolated FIR filters", IEEE Trans. Circuits Syst., 35, pp. 563-570, 1988.
- H. Samueli, "An improved search algorithm for the design of multiplierless FIR filters with powers-of-two coefficients", IEEE Trans. Circuits Syst., 36, pp. 1044-1047, 1989.
- P.P. Vaidyanathan, Multirate Systems and Filter Banks, Englewood Cliffs, NJ: Prentice Hall, 1993.
- B.C. Wong and H. Samueli, "A 200-MHz all-digital QAM modulator and demodulator in 1.2 µm CMOS for digital radio applications", IEEE JSSC, 26, pp. 1970–1979, 1991.

Downloaded by [National Chiao Tung University ] at 01:39 26 April 2014