# A Vario-Power ME Architecture Using Content-Based Subsample Algorithm

Hsien-Wen Cheng and Lan-Rong Dung, Member, IEEE

Abstract — The Motion estimator is a key element in many video compression systems and it tends to dominate the power consumption in them. With increasing demand of portable, power-aware multimedia devices, an architecture that can be flexible in both power consumption and compression quality is essential. To meet this requirement, this paper presents a novel power-aware architecture, called the Vario-Power Architecture, for the motion estimation. Based on a semisystolic array with the content-based subsample algorithm, the architecture real-time disables some processing elements to reduce power consumption. By performing the edge extraction first, a threshold is then set as the criterion of whether to enable or disable processing elements and thus the switch activities of the system can be reduced. As the simulation shows, the architecture may operate at different power consumption modes according to the remaining capacity of the battery pack giving little quality degradation and the power overhead under 0.36%.

Index Terms — motion estimation, VLSI architecture, video compression, power-aware architecture, subsample.

### I. INTRODUCTION

The technique of motion estimation has been widely used in the video compression system for years. As the demand of portable, power-aware multimedia devices increases, an architecture that can be flexible in both power consumption and compression quality is essential. This paper presents a poweraware architecture, called the Vario-Power Architecture, which provides real-time switching of power consumption mode for conserving battery life<sup>2</sup> while it maintains the compression quality in high level. The proposed architecture is driven by the content-based subsample algorithm which maintain acceptable quality degradation when the system works at different power consumption modes; that is, the proposed architecture can perform trade-offs between power consumption and compression quality as the battery status changes. Because the control mechanism and data sequences at different power consumption modes are the same, the vario-power architecture can perform the switching of power consumption mode very smoothly. The block diagram shown in Fig.1 illustrates a typical

The authors are with the Department of Electrical and Control Engineering, National Chiao Tung University, 1001 Ta Hsueh Road, Hsinchu 300, Taiwan, R.O.C. (email: jswcheng.ece89g@nctu.edu.tw; lennon@cn.nctu.edu.tw). <sup>2</sup> The idea is motivated from the SpeedStep<sup>TM</sup> technology of Intel [1].



Fig. 1. A typical application of the vario-power ME architecture in a video compression system.

application of the vario-power architecture for motion estimation in a video compression system. The host processor monitors the remaining capacity of the battery pack and switches the operation mode of the vario-power element to maintain the performance of the compression system better.

Lots of published papers presented efficient algorithms for the VLSI implementation of motion estimator either on high performance or low power design. Yet, most proposed algorithms cannot adapt their system to different power consumption modes. Among these proposed algorithms, the Full-Search Block-matching (FSBM) algorithm with Sum of Absolute Difference (SAD) criteria is the most popular approach for motion estimation because of its considerably good quality. There are many types of architecture which have been proposed for the implementation of FSBM algorithms [2] [3] [4] [5]. However, a huge number of comparison/difference operations result in high computational load and significant power consumption. To reduce the computational complexity of FSBM, researchers have proposed various fast algorithms by reducing the searching steps [6] [7] [8] [9] [10]. Unfortunately, these fast algorithms suffer from irregular block-matching scheme and much worse quality than FSBM, and they are not suitable for the implementation of vario-power architecture. Papers in [11] and [12] may reduce the computational load without degrading the compression quality; nevertheless, they both require additional operations for each search step and cannot adapt themselves to different power consumption modes. The Novel Early-Jump-Out (NEJO) technique in [13] addressed the low-power architecture with some quality degradation. However, the EJO also requires extra operations for each search step and is not feasible enough for vario-power architecture.

<sup>&</sup>lt;sup>1</sup> This work was supported in part by Taiwan MOE Program for Promoting Academic Excellent of Universities under the grant number 91-E-FA06-4-4 and the National Science Council, R.O.C., under the grant number NSC 92-2220-E-009-003.

Articles in [14] and [15] present subsample algorithms to significantly reduce the computation cost with low quality degradation. The reduction of computational cost implies saving of the power consumption, and the power consumption can be reduced by simply increasing the subsample rate; thus, the subsample algorithms are very suitable for implementing the vario-power architecture. However, applying the subsample algorithms for the vario-power architecture may suffer from aliasing problem for high frequency band. The aliasing problem makes compression quality degrading rapidly as the subsample rate increases. To alleviate the problem, we extend the traditional subsample algorithm to a novel algorithm, which is called the content-based subsample algorithm. In our approach, we first use edge extraction techniques to separate the high-frequency band from a macro-block and then perform subsampling within the low-frequency band only. Based on the architecture proposed by Hsieh and Lin [4], we present a semisystolic architecture which is driven by the content-based subsample algorithm. The proposed architecture can real-time alter the subsample rate as the power consumption mode changes. As the result shows, our methodology successfully switches the power mode while the quality degradation is a little.

## **II. GENERAL SUBSAMPLE ALGORITHM**

The FSBM algorithm with SAD criteria is the most popular approach for motion estimation because of its considerably good quality and many published papers presented efficient architectures for the VLSI implementation of it [2] [3] [4] [5]. To determine the motion vector, the algorithm uses (1) and (2) to compare each current macro-block (CMB) of the frame with all the reference macro-blocks (RMB) in searching area.

$$SAD(u,v) = \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} |S(i+u, j+v) - R(i, j)|,$$
  
where  $-p \le u, v \le p-1.$  (1)

$$SAD_{\overline{MV}} = \min_{-p \le u, v \le p-1} SAD(u, v)$$
  
and  $\overline{MV} = (u, v) \Big|_{\min_{-p \le u, v \le p-1} SAD(u, v)}$  (2)

where the macro-block size is *N*-by-*N*, R(i,j) is the luminance value at (i,j) in the current macro-block (CMB). The S(i+u,j+v) is the luminance value at (i,j) of the reference macro-block (RMB) which is offset (u,v) from the CMB in the searching area which size is 2p-by-2p.

Many researches addressed subsample techniques for motion estimation to reduce the computation load of FSBM [14] [15]. Liu and Zaccarin, as pioneers of the subsample algorithms, significantly reduced the computational load by applying the 4-to-1 alternative subsampling to FSBM. As the simulation results show, the subsample algorithm reduces the computational load by a factor of 8 or 16 while the quality is similar to that of exhaustive search [14]. Here, we present the general subsample algorithm in which the subsample rate ranges from 4-to-1 to 1-to-1 for the vario-power architecture.

The general subsample algorithm shown in (3) uses the matching criterion which is called subsample sum of absolute difference (SSAD). In (3), the  $SM_{8:m}$  which is generated from (4) is the subsample mask for the subsample rate 8-to-m.

$$SSAD_{8:m}(u,v) = \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} \left| SM_{8:m}(i,j) \cdot \left( S(i+u,j+v) - R(i,j) \right) \right|$$
  
for  $-p \le u, v \le p - 1$ , (3)  
where  $SM_{8:m}(i,j)$  is the mask for  $8 - to - m$  subsample and

 $SM_{8:m}(i, j) = BM_{8:m}(i \mod 4, j \mod 4)$ 

$$BM_{8m} = \begin{bmatrix} u(m-2) & u(m-5) & u(m-2) & u(m-6) \\ u(m-3) & u(m-7) & u(m-4) & u(m-8) \\ u(m-2) & u(m-5) & u(m-2) & u(m-6) \\ u(m-3) & u(m-7) & u(m-4) & u(m-8) \end{bmatrix}$$
(4)

where u(n) is a step function; that is,  $u(n) = \begin{cases} 1, \text{ for } n \ge 0\\ 0, \text{ for } n < 0 \end{cases}$ 

Using the general subsample algorithm, the power consumption can be reduced by simply increasing the subsample rate in which way the reduction of computation cost implies the saving of power consumption. Obviously, the general subsample algorithm is very suitable to implement the vario-power architecture. However, the algorithm is rather content independent and suffers from aliasing problem in high frequency band. To enhance the performance, we introduce a content-dependent technique to the general subsample algorithm as addressed in the following section.

# **III. CONTENT-BASED SUBSAMPLE ALGORITHM**

As mentioned above, the general subsample algorithm has aliasing problem when it is in high subsample rate. The aliasing problem leads to considerable quality degradation because the high frequency band is messed up. To alleviate the problem, we use edge extraction techniques to separate the edge pixels from a macro-block and then perform subsampling to the remaining pixels.

Fig. 2 describes the procedure of the content-based subsample algorithm. The algorithm first determines edge pixels of the current macro-block and then generates the content-based subsample mask (CSM). Upon the CSM generated, we are able to calculate SSAD values and find the best motion vector. The determination of edge pixels starts from applying gradient filter in the current macro-block. In this paper, we use three popular gradient filters which are the high-pass filter, the Sobel filter and the morphological gradient filter to exercise the performance of the content-based algorithm [16]. Equations (5)-(8) illustrate their gradient calculations.

## **Input current and reference** *W*×*H* **frames;**

for(y = 0; y < W/N; y + +){
 for(x = 0; x < H/N; x + +){</pre>

Perform gradient filtering;

Calculate the edge threshold accroding to power mode;

Determine edge pixels and edge mask;

Generate content - based subsample mask (CSM);

$$SSAD_{\min}(x, y) = \infty;$$
  

$$for(u = -p; u < p; u + +) \{$$
  

$$for(v = -p; v < p; v + +) \{$$
  

$$SSAD(u, v) = \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} |CSM(i, j) \cdot (S(i+u, j+v) - R(i, j))|;$$
  

$$if SSAD_{\min}(x, y) > SSAD(u, v)$$
  

$$\{SSAD_{\min}(x, y) = SSAD(u, v); MV(x, y) = (u, v); \}$$
  

$$\} // for loop index v$$
  

$$\} // for loop index x$$
  

$$\} // for loop index x$$

Fig. 2. The procedure of the content-based subsample algorithm.

# **High-Pass Filter:**

$$G_{hpf}(i, j) = |MF(HPF_{mask}, R(i, j))|,$$
  
where  $HPF_{mask} = \begin{bmatrix} -1 & -1 & -1 \\ -1 & 8 & -1 \\ -1 & -1 & -1 \end{bmatrix}.$  (5)

# **Sobel Filter:**

$$G_{sobel}(i,j) = |MF(SX_{mask}, R)(i,j)| + |MF(SY_{mask}, R)(i,j)|,$$
  
where  $SX_{mask} = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix}, SY_{mask} = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}$  (6)

In (5) and (6), the  $MF(\cdot)$  function is the mask filter operation which is shown in (7).

$$MF(M, R(i, j)) = \sum_{p=-1}^{1} \sum_{q=-1}^{1} M(p+1, q+1) \cdot R(i+p, j+q),$$
  
where M is a 3-by-3 mask (7)

and R(i, j) is the luminance value at (i, j)

# Morphological Gradient filter:

$$G_{morpho \log ical} = (R \oplus B) - (R \Theta B),$$
  
where  $B = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix}.$  (8)

The operation ' $\oplus$ ' and ' $\Theta$ ' denote the morphological operation of dilation and erosion respectively.

After obtaining the gradient value G, we first determine the edge threshold of this CMB as defined in (9).

threshold = 
$$m_1 \cdot \max\{G(i, j)\} + m_2 \cdot \min\{G(i, j)\},\$$
  
where  $m_1$  and  $m_2$  are set according to power mode. (9)

Then, the algorithm uses the threshold value as a condition to pick the edge pixels and produces the edge mask by (10).

$$EdgeMask(i, j) = \begin{cases} 1, & for G(i, j) \ge threshold \\ 0, & otherwise \end{cases}$$
(10)

Finally, the contend-based subsample mask (CSM) is generated from (11) and, therefore, the content-based subsample rate (CSR) is  $N^2$ -to-(the number of 1's in CSM).

$$CSM(i, j) = SM_{8m}(i, j) \text{ OR } EdgeMask(i, j)$$
  
for  $0 \le i, j \le N - 1$ . (11)

Since the higher the threshold value is, the less the edge pixels will be, the CSM is highly dependent on the threshold. Thus, the switching of power mode can be done by adjusting the threshold parameters m1 and m2.

### **IV. THE VARIO-POWER ARCHITECTURE**

According to the content-based subsample algorithm, we present a semi-systolic architecture which is based on the architecture proposed by Hsieh and Lin [4]. The vario-power architecture shown in Fig. 3 contains an edge-extraction unit (EXU), an array of processing elements (PEs), a parallel adder tree (PAT), a shift register array (SRA), and a motion-vector selector (MVS). Given the power consumption mode, the EXU extracts high frequency pixels (or edge pixels) from the current macro-block (CMB) and generates 0-1 content-based subsample masks (CSM) for the PE array to disable or enable processing elements (PEs). As shown in Fig. 4, the structure of PE array, which is used to accumulate the absolute differences of pixels column by column while the parallel adder tree sum up all the results to generate the subsample sum of absolute difference (SSAD). Finally, The MVS performs compare-andselect operation to select the best motion vector which has a minimum SSAD.



Fig. 3. The system block diagram of the content-based subsample algorithm.



Fig. 4. The architecture of PE array and shift register array.

Since the adder operations which PE achieves dominant significant power consumption in the system, the proposed architecture driven by the content-based subsample algorithm can disable some processing elements to save the switch activity of PE; that is, it can save the power consumption of the system. By performing the edge extraction first, a threshold was then set as the criterion of how many processing elements will be turn-off. Figure 5 shows the structure of the Processing Element and explains how the CSM disables or enables the PE. The absolute difference (AD) unit, which is denoted as |a-b|, calculates the absolute difference of CMB and RMB pixels. Then the adder unit accumulates the absolute difference value with the partial sum from the previous PE and conveys the results to the next PE in the same column. In order to reduce the switch activity, the PE receives the CSM signals from the EXU to disable or enable the blocking registers, which is abbreviated as 'breg' in Fig. 5, to decide the PE is active or inactive. When the blocking registers are enabled, the data paths of AD unit and the adder unit remains still, that is, the switch activity of this inactive PE is reduced. Thus, the power consumption can be saved.

The edge-extraction unit contains two main blocks which are gradient filter and CSM generator. The implementation of



Fig. 5. The PE structure for the vario-power architecture.



Fig. 6. The architecture to implement the high-pass gradient filter.



gradient filter is based on one of the (5), (6) and (8). Figure 6, for instance, is the structure of the implementation for the high-pass gradient filter. The multiplexers are used to prevent the boundary error for border pixels of the CMB. The black-dot in each multiplexer indicates the switching path when the filter is processing a border pixel. The CSM generator, whose structure is illustrated in Fig. 7, figures out the maximum and minimum of the gradient value in the macro-block and then determines the threshold value according to the power mode as shown in (9). Finally, it generates the CSM by OR-merging the regular subsample pattern and the edge pattern.

The execution of the vario-power architecture has five phases: initial CMB phase, filtering phase, edge-determination phase, initial RMB phase, and SSAD calculation phase. The initial CMB phase is for loading the CMB data into PE array and the initial RMB phase is for filling up PE array in full with RMB data to start the SSAD calculation. Figure 8 illustrates the timing of data flow. Since the execution of edge-extraction





Fig. 8. The data flow of the vario-power architecture.

unit, which includes filtering phase and edge-determination phase, is parallel to the initial RMB phase, the behavior of the proposed architecture is the same as the architecture without vario-power function proposed by Hsieh and Lin [4].

## V. ESTIMATION OF POWER CONSUMPTION

From (12), the power consumption of digital VLSI is in proportion to the switch activity,  $f \cdot \gamma(0 \leftrightarrow 1)$ .

$$P_{gate} = f \cdot C_{gate} \cdot V_{DD}^2 \cdot \gamma(0 \leftrightarrow 1),$$
  
where  $\gamma(0 \leftrightarrow 1)$  is the switching rate of the gate. (12)

Because the addition is the majority of the motion estimation architecture, by referring to [11], this paper uses the number of equivalent additions, denoted as  $\varepsilon_{adder}$ , as the power measure unit to estimate the power consumption. As per Fig. 5, the calculation of an absolute difference nearly requires  $2\varepsilon_{adder}$  and each PE consumes  $3\varepsilon_{adder}$  in each iteration. When the PE array operates at the content-based subsample rate of  $R_s$ , it requires  $R_s^{-1} \cdot N^2 \cdot (2p)^2 \cdot 3\varepsilon_{adder}$ . Since the calculation of PAT requires  $(N-1) \cdot (2p)^2 \cdot \varepsilon_{adder}$ , the power consumption of calculating SSADs for all RMBs in searching area is  $12R_s^{-1}N^2p^2\varepsilon_{adder} + 4(N-1)p^2\varepsilon_{adder}$ .

As regard to the EXU, three gradient filters which are mentioned in this paper consume  $6N^2 \varepsilon_{adder}$ ,  $9N^2 \varepsilon_{adder}$  and  $8N^2 \varepsilon_{adder}$ , respectively, and the edge-determination consumes  $3N^2 \varepsilon_{adder}$ . So the power consumption of the content-based subsample algorithm ( $P_{CS4}$ ) can be expressed as (13).

$$P_{CSA} \cong 12R_s^{-1}N^2p^2\varepsilon_{adder} + 4(N-1)p^2\varepsilon_{adder} + \alpha N^2\varepsilon_{adder},$$
(13)  
where  $\alpha \in \{9, 12, 11\}.$ 



Fig. 9. The PSNR vs. power consumption of the News clip.



Fig. 10. The PSNR vs. power consumption of the Weather clip.

From (13), we can learn that the power overhead of EXU is a little. For the worst case, when the subsample rate is 4-to-1 and the gradient filter is Sobel-type, the power overhead is only 0.36% for the motion estimation with N=16 and p=32.

#### **VI. SIMULATION RESULTS**

Figure 9 and 10 demonstrate the simulation results of two 352-by-288 MPEG clips for N=16 and p=32. The dashed lines are the results of the general subsample algorithm and the subsample rates at the bullets are (4:1), (8:3), (2:1), (8:5), (4:3), (8:7), and (1:1), respectively from left to right. The solid lines are the results of the content-based subsample algorithm with three gradient filters and the subsample rate is from (4:1) to (1:1). The threshold parameters,  $(m_1, m_2)$  pairs, at the bullets on solid lines are (1,0), (0.75,0.25), (0.5,0.5), (0.4,0.6), (0.3,0.7), (0.2,0.8), (0.1,0.9), and (0,1), from left to right respectively. As the results, the power consumption can be significantly reduced while the quality degradation is little and the power mode can be dynamically switched by simply changing the threshold parameters.

IEEE Transactions on Consumer Electronics, Vol. 50, No. 1, FEBRUARY 2004

# VII. CONCLUSION

We proposed the vario-power architecture based on a novel content-based subsample algorithm. The vario-power architecture provides the real-time capability for the switching of power consumption mode while the quality degrades a little. In the proposed architecture, the edge extraction unit plays a key role to dynamically adjust the power consumption mode and its power overhead can be neglected. As shown in the simulation results, the proposed algorithm successfully improves the compression quality of the general subsample algorithm and switches the power consumption mode by the content dependent technique.

## REFERENCES

- [1] "Mobile Pentium III Processor in BGA2 and Micro-PGA2 Packages Datasheet," Intel Corporation, p55.
- [2] K. M. Yang, M T. Sun, and L. Wu, "A family of VLSI designs for the motion compensation block-matching algorithm," *IEEE Trans. Circuits Syst. Video Technol.*, Vol. 36, no. 10, pp. 1317-1325, Oct. 1989.
- [3] Yeong-Kang Lai, and Liang-Gee Chen, "A data-interlacing architecture with two-dimensional data-reuse for full-search block-matching algorithm," *IEEE Trans. Circuits syst. Video Technol.*, Vol. 8, no. 2, pp. 124-127, Apr. 1998.
- [4] Chaur-Heh Hsieh, and Ting-Pang Lin, "VLSI Architecture for Block-Matching Motion Estimation Algorithm," *IEEE Trans. Circuits Syst. Video Technol.*, Vol. 2, no. 2, pp. 169-175, Jun 1992.
- [5] Jen-Chieh Tuan, Tian-Sheuan Chang, and Chein-Wei Jen, "On the Data Reuse and Memory Bandwidth Analysis for Full-Search Block-Matching VLSI Architecture," *IEEE Trans. Circuits Syst. Video Technol.*, Vol. 12, no. 1, pp. 61-72, Jan. 2002.
- [6] Mei-Juan Chen, Liang-Gee Chen, and Tzi-Dar Chiueh, "Onedimensional full search motion estimation algorithm for video coding," *IEEE Trans. Circuits Syst. Video Technol.*, Vol. 4, no. 5, pp. 504-509, Oct. 1994.
- [7] T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishigura, "Motion compensated interframe coding for video conferencing," in *Proc. NTC*'81, New Orleans, LA, pp. G5.3.1-G5.3.5, Nov. 1981.
- [8] J. R. Jain, and A. K. Jain, "Displacement measurement and its application in interframe image coding," IEEE Trans. Commun. Vol. COM-29, pp. 1799-1808, Dec. 1981.
- [9] Renxiang Li, Bing Zeng, and Ming L. Liou, "A new three-step search algorithm for block motion estimation," *IEEE Trans. Circuits Syst. Video Technol.*, Vol. 4, no. 4, pp. 438-442, Aug. 1994.
- [10] Ken Sauer, and Brian Schwartz, "Efficient block motion estimation using integral projections," *IEEE Trans. Circuits Syst. Video Technol.*, Vol. 6, no. 5, pp. 513-518, Oct. 1996.

- [11] Viet L. Do, and Kenneth Y. Yun, "A Low-Power VLSI Architecture for Full-Search Block-Matching Motion Estimation," *IEEE Trans. Circuits Syst. Video Technol.*, Vol. 8, no. 4, pp. 393-398, Aug. 1998.
- [12] W. Li, and E. Salari, "Succesive elimination algorithm for motion estimation," *IEEE Trans. Image Processing*, Vol. 4, no. 1, pp. 105-107, Jan. 1995.
- [13] Wujian Zhang, Runde Zhou, and Kondo, T., "Low-power motionestimation architecture based on a novel early-jump-out technique," *The IEEE International Symposium on Circuits and Systems*, Vol. 5, pp. 187-190, 2001.
- [14] Bede Liu, and A. Zaccarin, "New fast algorithms for the estimation of block motion vectors," *IEEE Trans. Circuits Syst. Video Technol.*, Vol. 3, no. 2, pp. 148-157, Arp. 1993.
- [15] Chok-Kwan Cheung, and Lai-Man Po, "Normalized partial distortion search algorithm for block motion estimation," *IEEE Trans. Circuits Syst. Video Technol.*, Vol. 10, no. 3, pp. 417-422, Arp. 2000.
- [16] Rafael C. Gonzalez, and Richard E. Woods, "Digital Image Processing," Addison Wesley, Sep. 1993.



**Hsien-Wen Cheng** received the B.S. degree in Control Engineering from National Chiao Tung University, Hsinchu, Taiwan, R.O.C. in 1992. He joined the AVerMedia Technologies Inc. from 1994 to 2001.

He is currently working toward the Ph.D degree in the Electrical and Control Engineer, National Chiao Tung University. His research interests are video/image

compression, motion estimation, VLSI architecture and digital signal processing.



**Lan-Rong Dung** received a BSEE and the Best Student Award from Feng Chia University, Taiwan, in 1988, an MS in electronics engineering from National Chiao Tung University, Taiwan, in 1990, and Ph.D. in electrical and computer engineering from Georgia Institute of Technology, in 1997.

From 1997 to 1999 he was with Rockwell Science Center, Thousand Oaks, CA, as a Member of the Technical Staff. He joined

the faculty of National Chiao Tung University, Taiwan in 1999 where he is currently an assistant professor in the Department of Electrical and Control Engineering. He received the VHDL International Outstanding Dissertation Award celebrating in Washington DC in October, 1997. His current research interests include VLSI design, digital signal processing, hardware-software codesign, and System-on-Chip architecture. He is a member of Computer and Signal Processing societies of the IEEE.