# Simulation and Synthesis of 2048 Point FFT/IFFT for mobile Wimax 802.16e

Unnati C Mehta Electronics Department, NOIDA Institute of Engineering Technology Noida, India

### ABSTRACT

This paper represents 2048 point Fast Fourier Transform and its inverse (FFT/IFFT) for Mobile Wi-MAX. Modified architecture also provides concept of local ROM module and variable length support from 128~2048 point for FFT/IFFT. UMC 0.18µm is used to design the same. FFT/IFFT chip consumes 266.81mW at 40MHz, 130.74mW at 20MHz and 65mW at 10 MHz for length of 2048 point. Its core size is 2.6mm x 2.6mm. Its latency is 2050 clock cycle with maximum clock frequency 40MHz. Start up time for the chip is N/2 clock cycle where N is the length of FFT/IFFT. 16 bit word length with fixed point precision is used for entire implementation. As Wi-MAX is used for Metropolitan Area Network, it uses Orthogonal Frequency Division Multiple Access scheme.

Keywords FFT, IFFT, Pipelined Architecture, SDF, Wi-MAX

# 1. INTRODUCTION

Mobile Wi-MAX (IEEE 802.16e standard) is widely used for configuring wireless MAN (Metropolitan Area Network), which uses OFDMA (Orthogonal Frequency Division Multiple Access) [1]. FFT/IFFT is one of the important blocks used in the OFDMA systems. By adjusting length of FFT from 128 to 2048, it can be used for different applications. We have targeted our design for Mobile Wi-MAX, so preference is given 2048 point FFT/IFFT. FFT architectures are mainly classified in two ways: Memory based architecture and pipelined based architecture. Popular pipelined architectures are based on radix-2 or radix-4 algorithms. Radix-4 has less computational complexity compared to radix-2 architecture. Even higher radix algorithm can also be used for the implementation to decrease the computational complexity in the design. Higher the radix, lower the computational complexity. Tradeoff for the selection of architecture is the requirement of length of FFT for the particular application.

Pipelined architectures are further classified in two types: (1) SDF (Single Delay Path Feedback) (2) MDC (Multiple Delay Path Commutator). Both have their personal advantages. MDC are widely used for MIMO (Multiple Input Multiple Output) OFDM techniques as one can simultaneously operate on channels. Also MDC utilization of delay elements is low in MDC as compared to SDF [2].

MDC uses more number of storage elements as compared to SDF. Single stage block diagram of R2SDF (Radix-2 SDF) and R4MDC (Radix-4 MDC) is shown in fig. 1, where F is FIFO. BF2 and BF4 stands for two point and four point butterfly unit respectively. D is the delay element.

In this paper, we represent high speed and high throughput VLSI architecture using R2SDF architecture. Rest of the paper is organized as follows: Section 2 describes basic FFT algorithm and their dataflow graph. Section 3 emphasis on conventional SDF and proposed architecture with local ROM module and also details each block in proposed architecture.

Satyendra Sharma Electronics Department, NOIDA Institute of Engineering Technology Noida, India

Section 4 contains implementation using proposed scheme. Results have been discussed in and compared with earlier implementations in section 5



Fig 1(a): R2SDF basic module



Fig 1(b): R4MDC basic module

# 2. BASIC FFT ALGORITHM

The DFT (Discrete Cosine Transform) for N point sequence x(n) is defined as

$$X(k) = \sum_{n=0}^{N-1} x(n). W_N^{nk} \text{ where, } W_N = e^{-j2\pi/N} \text{ and } k = 0 \dots N - 1$$
(1)

Where x(n) and X(k) can be the complex number. Here *n* is the time index and *k* is the frequency index. For the calculation of this transform requires N2 complex multiplication and N(N-1) complex addition. In order to reduce the computational complexity, Cooley and Tukey proposed and efficient algorithm [3]. Afterwards radix-2, radix-4, radix-22[4], split radix [5] and extended split radix [6] have been developed to reduce the computational complexity of FFT. SFG (Signal Flow Graph) of 8-point FFT using radix-2 algorithm is shown in fig. 2. As twiddle factor is multiplied after each stage, the flow graph shown in fig. 2 can easily be implemented by R2SDF.



Fig2: Signal flow graph of 8point FFT using radix-2 algorithm

# **3. PROPOSED ARCHITECTURE 3.1A Single Delay Path Feedback (SDF)**

Basic module of SDF is shown in fig. 1(a). Where first N/2 input samples are stored in FIFO and operation starts when N/2+1st data is available at the input to the butterfly unit [7]. For designing N-stage pipelined FFT/IFFT architecture same SDF module is repeated for log2N time. It gives flexibility in design of any point FFT/IFFT. SFG shown in fig. 2 can be implemented using R2SDF. Block diagram for the same is shown in fig. 3



Fig 3: Pipelined architecture of 16 point FFT using R2SDF

Where 8, 4, 2 and 1 written in the FIFO field of basic module show the size of the FIFO required at different stages. Generalized pipelined stage based on SDF is given below. It includes butterfly structure with Address generator and ROM. Output is a complex multipled output.



Fig 4: General pipelined stage based on SDF [5]

#### **3.2 Proposed Architecture**

Modification in generalized pipeline stage is done by modifying FIFO design to enhance the throughput of the design. At different stages of the design, requirement of FIFO is different. Length of FIFO depends upon in which stage of pipeline it is used, where width of FIFO depends upon word length used. In proposed architecture we have chosen word length as 2 x 16 bits. So, width of FIFO is 32 bits. As, suggested in [7], we need to store first N/2 input points in FIFO. This brings the clarity to design different sized FIFO. For N-point FFT/IFFT size of FIFO for the first stage is N/2 x 32 bits, for second stage N/4 x 32 bits, so on and so forth till the end of pipeline architecture. Read and write clock of FIFO is taken on different events. This helps to improve the throughput for the pipelined architecture. Read and write operations are explained in more detail in section 5.



Butterfly module complex multiplier

# Fig 5: Proposed architecture for writing different output to FIFO

In generalized pipeline stage based on SDF, summation output of butterfly unit is stored in FIFO, and subtraction output is passed for multiplication with twiddle factor on each clock event as shown in fig. 4. In proposed architecture subtraction output is stored in FIFO and summation output is passed to the next stage without any multiplication. For the N-point FFT/IFFT first N/2 points are stored in the FIFO. N/2+1st input is directly given as the second input to butterfly module and first input comes from the FIFO as shown in fig. 5. Summation output of the butterfly is passed directly to the next stage, while subtraction output is stored back to the FIFO. When next set of input comes, subtraction outputs of butterfly unit which was stored in FIFO are passed as an input of complex multiplier to multiply necessary twiddle factor Outputs which are first fed to the input of the next stage are propagated first in the pipeline. Idea behind this change in proposed architecture is to arrange the output as per order as required in SFG for N point radix-2 FFT/IFFT as shown in fig. 2. RTL diagram for the proposed architecture is shown in fig. 6. AGU (Address Generation Unit) which generates read address for the ROM. Through which particular twiddle factor is passed for the multiplication



Fig 6: RTL diagram of proposed Architecture.

# 4. IMPLEMENTATION 4.1 AGU & ROM Design

AGU generates different sets of addresses for performing FFT and IFFT. Counter is implemented inside AGU. One control signal for the selection of FFT or IFFT is provided. Depending upon that selection bit AGU will generate read addresses for the ROM. ROM contains all twiddle factors for performing FFT/IFFT. Block diagram for the same is shown in fig. 7, where twiddle block is the storage of twiddle in an array.



Fig7: Block diagram of twiddle generation

# 4.2 Read and Write Operation of FIFO Architecture

Reading and writing of FIFO cannot be done at single clock event due to combinational path delay of two point butter fly unit. So, reading and writing of FIFO is done on two different clock events. N-point FFT/IFFT pipelined architecture using proposed scheme is shown in fig. 9. Each stage has dedicated unit of AGU and ROM, which help to increase the speed of operation with the cost of area. Disadvantage of this architecture is that, it takes N/2 clock cycle as its start up time because it requires to store the input data in FIFO to start first operation.



Fig 9: N Point FFT/IFFT Pipelined Architecture

# 5. RESULTS & COMPARISION

Design for 2048-point is implemented using 0.18  $\mu$ m Faraday standard cell library from UMC. Gate level netlist` generated from design vision is imported in Encounter tool from Cadence. Remaining flow is completed using encounter. Table 1 lists all implementation results of the chip. Table 2 shows the comparison results of proposed architecture to existing architectures.

As compared to memory based architecture suggested in [10], our architecture takes more hardware but execution time is drastically reduced in proposed architecture, because memory based architecture uses only one butterfly unit and one complex multiplier. So, it helps to reduce the area, but at the same time execution speed is affected by large amount.

| <b>TABLE 1. Implementation Results of Proposed</b> |  |  |  |  |
|----------------------------------------------------|--|--|--|--|
| Architecture                                       |  |  |  |  |

| Implementation Results of Proposed Architecture |                           |  |  |  |
|-------------------------------------------------|---------------------------|--|--|--|
| Technology                                      | UMC 0.18 µm Faraday       |  |  |  |
| Voltage                                         | 1.8 V                     |  |  |  |
| World Length                                    | 16 bits                   |  |  |  |
| Gate count <sup>*1</sup>                        | 526047                    |  |  |  |
| Core size <sup>*1</sup>                         | 2.6 x 2.6 mm <sup>2</sup> |  |  |  |
| Maximum frequency                               | 40 MHz                    |  |  |  |
| Power                                           | 266.81 mW @ 40MHz         |  |  |  |
|                                                 | 130.74 mW @ 20 MHz        |  |  |  |
|                                                 | 65.05 mW @ 10 MHz         |  |  |  |
| Supported length of FFT/IFFT                    | 128~2048                  |  |  |  |

#### TABLE II COMPARISON OF PROPOSED ARCHITECTURE TO EXISTING ARCHITECTURES

|                              | Proposed<br>Architect<br>ure | H.<br>harikrish<br>nan [14]<br>2011 | Simeng<br>Li,Huxio<br>ng Xu<br>[12] | Chin-<br>Long<br>Wey[10]<br>2007 |
|------------------------------|------------------------------|-------------------------------------|-------------------------------------|----------------------------------|
| Technolo<br>gy               | 0.18 µm                      | 0.6 µm                              | 0.13µm                              | 0.18 µm                          |
| Voltage                      | 1.8V                         |                                     | 1.3V                                | 1.8V                             |
| Length<br>Support            | 128~2048                     | 512~1024                            | 128~256                             | 2048~819<br>2                    |
| Word<br>length               | 32 bits                      | 16 bits                             | 10 bits                             | 24 bits                          |
| Gate<br>Count                | 526047*1                     |                                     |                                     |                                  |
| Maximu<br>m<br>Frequenc<br>y | 40 Mhz                       | 92.36Mhz                            | 125 Mhz                             | 173 Mhz                          |
| Executio<br>n time           | 51.25 μs                     | 65.89µs @                           |                                     | 882.22 us                        |
| Power                        | 266.81m<br>W<br>2K@40<br>MHz |                                     |                                     | 40.80 mW<br>2K@55<br>MHz         |

\*1 Gate counting including all local ROM Module

\*2 Area after Synthesis

\*3 Excluding RAM

Even word length of proposed architecture is taken higher than existing ones [12] and [14] to improve the precision in operation. Post synthesis area for 512 point FFT/IFFT suggested in [14] is way higher than post synthesis area of proposed architecture. Post synthesis area for the proposed architecture is 6.76 mm2. Architecture in [12] has variable length of 128 and 256 point only, while proposed architecture has all variable length from 128 to 2048 point. Although the gate count of proposed architecture is higher than others because of dedicated ROM module for each stage of pipeline, yet it certainly helps to improve the execution time. In our proposed architecture we have focused on the execution time which is less as compare to others [10], [12] and [14]. Hence the execution time has been improved for the given FFT/IFFT size. Here in my proposed architecture the input word size if 32 bits hence the area is little bit more than others [12], [10] yet area is also improved if we compare our result for the current paper [14].

# 6. ACKNOWLEDGEMENT

To Discover analyze and to present something new is to venture on an untrodden path towards an unexplored destination is an arduous adventure unless one gets a true torchbearer to show the way. This enlightening guidance, found in revered guide **Mr. Satyendra Sharma**, Associate Professor, Electronics & Communication Engineering Department, (NIET, Greater NOIDA). It gives immense pleasure to thank Dr. **M.K. Dewan** Head of the Department, Electronics & Communication Engineering Department, NIET Greater NOIDA. This project would have not been possible without support of high end EDA tools, which had used in **CADENCE Design System NOIDA**. Would like to thank Cadence Company which allowed to use its platform for Academic purpose

#### 7. CONCLUSION

Implementation Concept of 2048 point FFT/IFFT has been been done for mobile Wi-MAX. In near future would like to go for

Layout design of the same using Fab lab, and implement a prototype for the same. Aim of this project is to design ASIC for "2048-point FFT/IFFT processor for MobileWi-MAX". Following are objectives of next stage of project.

- Static Timing Analysis of Synthesized netlist.
- Static Timing Analysis of Laid out Design
- Apply DFT for testing of the design
- Testing of the Chip after Fabrication.

#### 8. REFERENCES

- [1] WiMAX Forum, Mobile WiMAX-Part I: A technical overview and performance evaluations, Feb. 21, 2006
- [2] Bo Fu, Paul Ampadu, "An Area Efficient FFT/IFFT Processor for MIMO-OFDM WLAN 802.11n" J Sign Process Syst (2009) 56:59–68
- [3] J. W. Cooley and J. W. Tukey, "An algorithm for the machine calculation of complex Fourier series," *Math. Comput.*, vol. 19, Apr. 1965, Page(s):297 301.
- [4] M. Vetterli and P. Duhamel, "Split-radix algorithms for length- pn DFT's," *IEEE Trans. Acoust., Speech, Signal Processing*, vol. 37,pp. 57–64, Jan. 1989
- [5] Soo-Chang Pei, Tzyy-Liang Luo, "Split-radix generalized fast Fourier transform," *Signal Processing*, 54(1996) 137-151.
- [6] Daisuke Takahashi, May 2001 "An Extended Split-Radix FFT Algorithm," *IEEE Signal Processing Letters*," vol. 8, no.5, pp. 145-147
- [7] Erling H. Wold "Pipelined and Parallel-pipelined FFT FFT processor for VLSI Implementation" 0018-9340/84/0500-0414\$01 .00 © 1984 IEEE
- [8] Hyun-Yong Lee, In-Cheol Park "Balanced Binary-Tree Decomposition for Area-Efficient Pipelined FFT Processing" 8328/\$25.00 © 2007 IEEE
- [9] Jen-Chih Kuo, Ching-Hua Wen, Chih-Hsiu Lin, An-Yeu Wu, "VLSI Design of Variable –Length FFT/IFFT Processor for OFDM-Based Communication Systems" *EURASIP Journal on Applied Signal Processing* 2003:13, 1306-1316
- [10] Chin-Long Wey, Wei-Chien ang and Shin-Yo Lin "Efficient VLSI Implementation of Memory-Based FFT processors for DVB-T Applications" *IEEE Computer* Society Annual Symposium on VLSI(ISVLSI'07)
- [11] B. Sklar, "Digital Communications, Fundamentals and Applications, Second Edition, New Delhi, Pearson Education, 2004.
- [12] Simeng li, Huxiong Xu, Wenhua Fan, Yun Chen, Xiaoyang Zeng 2010"A 128/256-Point Pipeline FFT/IFFT Processor for MIMO OFDM System IEEE 802.16e" 978-1-4244-5309-2/10/\$26.00 ©2010 IEEE
- [13] J.A. Heller and I. M. Jacobs, "Viterbi Decoding for Satellite and Space Communications", *IEEE Trans. Commun. Technol.*, vol. COM19, no. 5, 1971
- [14] K.Harikrishna 1, T. Rama Rao 2, Vladimir A. Labay 2011"FPGA Implementation of FFT Algorithm for IEEE 802.16e (Mobile WiMAX)" International Journal of Computer Theory and Engineering, Vol. 3, No. 2, April 2011