## Low Power and Area Efficient 2-D DWT Using 9/7 Filter based on NEDA Technique

Ambikesh Prasad Gupta M tech Scholar IES College of Tech. Bhopal Shweta Singh Associate Professor IES College of Tech. Bhopal Nitin Meena Assistant Professor IES College of Tech. Bhopal

#### ABSTRACT

In this paper, based on word-serial pipeline architecture, a new efficient distributed arithmetic (NEDA) technique is introduced. This architecture increases the speed and reduced the time of 2-D discrete wavelet transform (DWT). In this design, word-serial pipeline architecture able to compute a complete 2-D discrete wavelet transforms (DWT) binary tree in an on-line fashion, and easily configurable in order to compute any required 2-D DWT sub tree is proposed. In this architecture, free of ROM, multiplication and subtraction, 9 high-pass and 7 low-pass NEDA techniques are used concurrently. The proposed NEDA architecture is 30% faster than compare the exiting architecture has 100% hardware utilization efficiency.

#### **KEYWORDS**

2-DDiscrete Wavelet Transform (DWT), NEDA, Synopsis Simulation.

#### **1. INTRODUCTION**

Fourier Transform (FT) with its fast algorithms (FFT) is an important tool for analysis and processing of many natural signals. FT has certain limitations to characterize many natural signals, which are non-stationary (e.g. speech). Though a time varying, overlapping window based FT namely STFT (Short Time FT) is well known for speech processing applications, a new time-scale based Wavelet Transform (WT) is a powerful mathematical tool for non-stationary signals.

Due to remarkable advantage of discrete wavelet transform (DWT) over the unitary transforms like discrete wavelet transform (DWT), discrete cosine transform (DCT), discrete sine transform (DST) and discrete Fourier transform (DFT), DWT is widely used in multimedia applications for efficient utilization of bandwidth with enhanced performance [1, 2]. The amount of multimedia data processed is really huge and many of its applications required real-time processing for performance for better performance. To meet the timing requirement, DWT is implemented in a VLSI system [3, 4].

Numerous wavelet applications have given rise to algorithms and architectures for wavelet transforms. Since digital signal processors (DSPs) are designed for general-use architectures, and are therefore not optimal for a specific algorithm, such as a. WT, we designed and implemented a specialized, parallel architecture. With such a design algorithm, calculations can, to a great extent, run parallel operations in order to increase the global speed.

The 2-D discrete wavelet transform (DWT) has been widely used in many areas of science and engineering, e.g., signal and image processing, bio-informatics, geophysics, and meteorology etc. for the applications involving compression and analysis of various forms of data. The well-known image coding standards, namely, MPEG-4 and JPEG2000 have adopted DWT as the transform coder due to its remarkable advantages over the other transforms.

Several designs have been proposed for the multiplier-less implementation of 2-D DWT based on the principle of distributed arithmetic (DA) [5, 6]. The structure of distributes the bits of the fixed coefficients instead of the bits of input samples. Consequently, the adder-complexity of the structure of depends on the DA-matrix of the fixed coefficients [5].

Martina *et al* [3] have approximated the 9/7 filter coefficients and expressed the 9/7 filter outputs in terms of 5/3 filter outputs. By that approach, they have significantly reduced the adder-complexity of the 9/7 DWT. Gourav *et al* [9] have suggested an LUT-less DA-based design for the implementation of 1-D DWT. They have eliminated the ROM cells required by the DA-based structures at the cost of additional adders and multiplexors. The adder-complexity of this structure is significantly higher than the other multiplierless structures.

In this paper, we have proposed an efficient scheme to derive NEDA-based word-serial parallel structures, in this architecture, free of ROM, multiplication and subtraction.

The remainder of the paper is organized as follows: mathematical derivation of NEDA techniques is presented in Section II. The proposed a multiplier-less architecture for 9/7 wavelet Filter by using NEDA are presented in Section III. Result and simulation of the proposed structures are discussed and compared with the existing structures in Section IV. Conclusion and future scope is presented in Section V.

# 2. MATHEMATICAL DERIVATION OF NEDA

Consider the following sum of products [7]:

$$Z = \sum_{k=1}^{M} X_k \times Y_k \tag{1}$$

Where  $X_k$  are fixed coefficients and they  $Y_k$  are the input data words. Equation (1) can also be written in the form of a matrix product as:

$$Z = \begin{bmatrix} X_1 & X_2 & \dots & X_M \end{bmatrix} \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ \vdots \\ Y_M \end{bmatrix}$$
(2)

Both  $X_k$  and  $Y_k$  are in two's complement format. The two's complement representation of  $X_k$  may be expressed as

$$X_{k} = -X_{k}^{M} 2^{M} + \sum_{i=N}^{M-1} X_{k}^{i} 2^{i}$$
(3)

Where  $B_k^i = 0$  or 1, and i = N, N+1... M and  $B_k^M$  is the sign bit and  $B_k^N$  is the least significant bit (LSB).

Now on combining equations (1) and (3)

$$Z = -(Z^{M}.2^{M}) + \sum_{i=N}^{M-1} (Z^{i}.2^{i})$$
(4)

Where 
$$Z^i = \sum_{k=1}^{L} X^i_k Y_k$$
,  $i = N, N+1... M$ 

#### **3. PROPOSED ARCHITECTURE**

In this paper, we have proposed a multiplier-less 2-D architecture for 9/7 wavelet Filter by using NEDA. The filter coefficients of 9/7 wavelet filter are given in table1. We multiply the filter coefficients by 100 for simplification. The mathematical calculation for 1-D low pass output is explained by an example.

| Table 1: Filte | r Coefficients | of 9/7 | Wavelet | Filter. |
|----------------|----------------|--------|---------|---------|
|----------------|----------------|--------|---------|---------|

|                       | Coefficients | Multiplie<br>d by 100 | 6 bit binary<br>represent-<br>tation | 2's<br>complement<br>of negative<br>no. |
|-----------------------|--------------|-----------------------|--------------------------------------|-----------------------------------------|
| h <sub>0</sub>        | 0.6029       | 60                    | 111100                               | -                                       |
| h <sub>1</sub>        | 0.2668       | 26                    | 011010                               | -                                       |
| h <sub>2</sub>        | -0.0782      | -7                    | -                                    | 001001                                  |
| h <sub>3</sub>        | -0.0168      | -1                    | -                                    | 000011                                  |
| h <sub>4</sub>        | 0.0267       | 2                     | 000010                               | -                                       |
| $g_0$                 | 0.5575       | 55                    | 110111                               | -                                       |
| <b>g</b> <sub>1</sub> | -0.2956      | -29                   | -                                    | 100011                                  |
| <b>g</b> <sub>2</sub> | -0.0287      | -2                    | -                                    | 000110                                  |
| <b>g</b> <sub>3</sub> | 0.0456       | 4                     | 000100                               | -                                       |

Where  $h_0$ ,  $h_1$ ,  $h_2$ ,  $h_3$ ,  $h_4$  are the Low pass filter coefficients and  $g_0$ ,  $g_1$ ,  $g_2$ ,  $g_3$  are the High pass filter coefficients.

If we take the low pass coefficients  $h_0$ ,  $h_1$ ,  $h_2$ ,  $h_3$ ,  $h_4$  and multiply by r(1), r(2), r(3), r(4) and r(5) then we get the low pass output  $Y_L$  of the 9/7 filter as [6]:

$$Y_{L} = \begin{bmatrix} h_{0} & h_{1} & h_{2} & h_{3} & h_{4} \end{bmatrix} \begin{bmatrix} r(1) \\ r(2) \\ r(3) \\ r(4) \\ r(5) \end{bmatrix}$$
(5)

Where

$$r(1) = X(n) + X(n-8)$$
(6)

$$r(2) = X(n-1) + X(n-7)$$
(7)

$$r(3) = X(n-2) + X(n-6)$$
(8)

$$r(4) = X(n-3) + X(n-5)$$
(9)

$$r(5) = X(n-4)$$
 (10)

then

$$Y_{L} = \begin{bmatrix} 60 & 26 & -7 & -1 & 2 \end{bmatrix} \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \end{bmatrix} = 97$$
(11)

Now if we implement this with NEDA then

$$Y_{L} = \begin{bmatrix} 60 & 26 & -7 & -1 & 2 \end{bmatrix} \begin{bmatrix} r(1) \\ r(2) \\ r(3) \\ r(4) \\ r(5) \end{bmatrix}$$
(12)

 $Y_{L} = \begin{bmatrix} 111100 & 011010 & 001001 & 000011 & 000010 \end{bmatrix} \\ \times \begin{bmatrix} r(1) \\ r(2) \\ r(3) \\ r(4) \\ r(5) \end{bmatrix}$ (13)

Now we can make the DA matrix by the filter coefficients as

$$\begin{bmatrix} X_k \end{bmatrix} = \begin{bmatrix} 0 & 0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 1 & 1 \\ 1 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{bmatrix}$$
(14)

And thus

$$Z_{M} = \begin{bmatrix} 0 & 0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 1 & 1 \\ 1 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} r(1) \\ r(2) \\ r(3) \\ r(4) \\ r(5) \end{bmatrix}$$
(15)

$$= \begin{bmatrix} r(3) + r(4) \\ r(2) + r(4) + r(5) \\ r(1) \\ r(1) + r(2) + r(3) \\ r(1) + r(2) \\ r(1) \end{bmatrix}$$
(16)

Assume r(1) = 1, r(2) = 2, r(3) = 3, r(4) = 4, r(5) = 5

IN Figure 1, apply NEDA techniques step-1 all the input converts' binary number

$$r(1) = 001, r(2) = 010, r(3) = 011, r(4) = 100,$$
  
 $r(5) = 101$ 

Step-2 all the binary input applied to sign extension so,

$$s(1) = 0001$$
 ,  $s(2) = 0010$  ,  $s(3) = 0011$  ,  
 $r(4) = 0100, r(5) = 0101$ 

Step-3 all the sign extension input applied to adder array so,

$$m(1) = 0111$$
 ,  $m(2) = 1011$  ,  $m(3) = 0001$   
 $m(4) = 0110$  ,  $m(5) = 0011$   
 $m(6) = not(r(1)) + 1 = 1111$ 

Step-4 the entire adder array input applied to MUX so,

The entire adder array input m(1) right shift 1-bit so

MUX (1) = 0'0111 = 
$$Y_p(0)$$
  
MUX (1) add MUX (2) =  $Y_P(1)$   
= 0'0111  
= 1011+ 11101



Figure 1: Mathematical calculation of the NEDA Technique of the Low-pass Wavelet Filter Output

Output of the  $Y_{P}\left(1\right)$  again right shift 1-bit and adds MUX (3) so

= 0'11101  
= 0 001  
+ 1 00001  
$$Y_P(1) + MUX(3) = Y_P(2)$$

Output of the  $Y_{P}\left(2\right)$  again right shift 1-bit and adds MUX (4) so

$$= 0'100001$$
$$= 0 110$$
$$+ 1 010001$$

$$Y_{P}(2) + MUX(4) = Y_{P}(3)$$

Output of the  $Y_{P}\left(3\right)$  again right shift 1-bit and adds MUX (5) so

$$= 0'1010001$$
$$= 0 011$$
$$+ 1 0000001$$

$$Y_{P}(3) + MUX(5) = Y_{P}(4)$$

Output of the  $Y_{P}\left(4\right)$  again right shift 1-bit and adds MUX (6) so

= 0'10000001= 1 111 + 10 01100001 Total output Y<sub>P</sub>(5) = 001100001 = 97

Carry is rejected.

#### **4. RESULT AND SIMULATION**

We have implemented the multiplier-less 9/7 filter for two dimensional discrete wavelet transform (2-D DWT) by using NEDA. We have modified NEDA consist of MUX which requires less time compare to exiting architecture and to achieve good computation speed.

 
 Table 2: Theoretical comparison between existing and proposed architectures

| Author                                    | Multi<br>-plier | Adde<br>r /<br>Sub | Shift<br>register | MU<br>X<br>2×1 | ROM              |
|-------------------------------------------|-----------------|--------------------|-------------------|----------------|------------------|
| Gaurav<br>Tewari <i>et al</i><br>[9](2-D) | 0               | 80                 | 68                | 9              | 16<br>(4bit<br>) |
| Proposed<br>Architecture<br>(1-D)         | 0               | 36                 | 24                | 0              | 0                |
| Proposed<br>Architecture<br>(2-D)         | 0               | 60                 | 24                | 9              | 0                |

### Table 3: Synopsys result comparison between existing and proposed architectures

| Author                                    | Require<br>d time<br>(nsec) | Power<br>( <i>uW</i> ) | Area<br>(um <sup>2</sup> ) | Area delay<br>product<br>(um <sup>2</sup> -sec) |
|-------------------------------------------|-----------------------------|------------------------|----------------------------|-------------------------------------------------|
| Gaurav<br>Tewari <i>et al</i><br>[9](2-D) | 54.40                       | 297.533<br>8           | 39620.6<br>5283            | 2455363.51<br>4                                 |
| Proposed<br>Architecture<br>(1_D)         | 19.80                       | 43.314                 | 9553.80                    | 189165.24                                       |
| Proposed<br>Architecture<br>(2_D)         | 34.80                       | 270.744<br>9           | 34304.7<br>1688            | 1193804.14<br>7                                 |

The theoretical comparison between the Gaurav tiwari's architecture [1] and the proposed multiplier-less architectures using NEDA and using modified NEDA is shown in Table 3.

Implementing the architecture of Gaurav Tewari *et al* [1] has been simulated in VHDL and the functionality is verified by RTL and gate level simulation. To estimate the timing, area and power information for ASIC design, we have used Synopsys Design Compiler to synthesize the design into gate level.

The comparison between the Gaurav tiwari's architecture [1], and the proposed multiplier-less architectures using NEDA and using modified NEDA, on the basis of area, delay and power is shown in Table 3. Output waveform of full adder, shift register and 2-D 9/7 filter show in Figure 2-4.



Figure 2: Output wave form of full adder



Figure 3: Output wave form of shift register



Figure 4: Output wave form by using NEDA technique of 2-D 9/7 filter

#### **5. CONCLUSION**

In this paper a multiplier-less VLSI architecture is proposed using new distributed arithmetic algorithm named NEDA. Mathematical proof is given for the validity of the NEDA scheme.

We propose a novel distributed arithmetic paradigm named NEDA for VLSI implementation of digital signal processing (DSP) algorithms, image, and video involving inner product of vectors. The proposed modified NEDA architecture is 30% faster than the DA based architecture at the cost of 27% increase in area.

In the proposed architecture if we apply lifting based technique then area and power will be reduced and the performance of the architecture will also be efficient. This architecture will also be very efficient in various applications like an image compression and speech denoising.

#### **6. REFERENCES**

- [1] M. Alam, C. A. Rahman, and G. Jullian, "Efficient distributed arithmetic based DWT architectures for multimedia applications," in *Proc. IEEE Workshop on SoC for real-time applications*, pp. 333 336, 2003.
- [2] X. Cao, Q. Xie, C. Peng, Q. Wang and D. Yu, "An efficient VLSI implementation of distributed architecture for DWT," in *Proc. IEEE Workshop on Multimedia and Signal Process.*, pp. 364-367, 2006.
- [3] M. Martina, and G. Masera, "Low-complexity, efficient 9/7 wavelet filters VLSI implementation," *IEEE Trans. on Circuits and Syst. II, Express Brief* vol. 53, no. 11, pp. 1289-1293, Nov. 2006.
- [4] M. Martina, and G. Masera, "Multiplierless, folded 9/7-5/3 wavelet VLSI architecture," *IEEE Trans. on Circuits an syst. II, Express Brief* vol. 54, no. 9, pp. 770-774, Sep. 2007.
- [5] C.-C. Cheng, C.-T. Huang, C.-Y. Cheng, C.-Jr. Lian and L.-G. Chen, "On-chip memory optimization scheme for VLSI implementation of line-based two dimensional discrete wavelet transform," IEEE Trans. on circuit and System for Video Technology, vol. 17, no. 7, pp. 814-822, July 2007. July 2007
- [6] [14] P. K. Meher, B. K. Mohanty and J. C. Patra, "Hardware- Efficient Systolic-Like Modular Design for Two-Dimensional Discrete Wavelet Transform", IEEE transactions on circuits and systems—ii: express briefs, vol. 55, no. 2, february 2008.
- [7] A. M. Shams, A. Chidanandan, W.Pan and M. A. Bayoumi, "NEDA: A Low-Power High-Performance DCT Architecture," IEEE Transactions on signal processing, vol. 54, no. 3, march 2006.
- [8] Archana Chidanandan and Magdy Bayoumi, "AREA-EFFICIENT NEDA ARCHITECTURE FOR THE 1-D DCT/IDCT," ICASSP 2006.
- [9] Gaurav Tewari, Santu Sardar, K. A. Babu, "High-Speed & Memory Efficient 2-D DWT on Xilinx Spartan3A DSP using scalable Polyphase Structure with DA for JPEG2000 Standard," 978-1-4244-8679-3/11/\$26.00
   ©2011 IEEE.