# Design Space Exploration of DSP Techniques for Hardware Software Co-Design: An OFDM Transmitter Case Study

Mahendra Vucha Asst. Professor, Dept. ECE Christ University, Bengaluru, India Lincy Sara Varghese PG Scholar, Dept. ECE Christ University, Bengaluru, India

#### ABSTRACT

This paper describes the exploration of design parameters for widely used Digital Signal Processing (DSP) algorithms and techniques. In this paper, some of the DSP algorithms and techniques are considered and executed them on soft core processors like General Purpose Processors, Digital Signal Processing processers and also on hard core processor Field Programmable Gate Array (FPGA). After execution, the design parameters like execution time, area (number of slices required on FPGA) of the DSP techniques are acquired for the computing architectures. The acquired parameters play crucial role in selection of resources for their optimum execution in real time. In this paper, the acquired design parameters are represented as DSP techniques resource utilization chart for hardware software co-design. The resource utilization chart could help in designing optimized computing architecture for DSP applications. Finally, the described methodology has been evaluated by considering OFDM transmitter, a real time DSP application, as a case study and proposed optimized computing platform for OFDM transmitter.

#### **Keywords**

Design Space Exploration, Hardware-Software co-design, TMS320C6713 DSK, Virtex-5 FPGA.

#### 1. INTRODUCTION

With the advancements in technology, various real time applications like image processing, speech synthesis, control and radar applications etc. uses complex and fast DSP algorithms. Also, with growing technology, the demands of the user is changing day by day which makes it a challenging task to satisfy their needs in short span of time and make the product available in the market. Software implementation of an application provides high flexibility whereas hardware implementation results in higher performance. However, both flexibility and performance is not possible if considering only one realization. So inorder to have a balance between the performance and flexibility, it is a good choice to opt for hardware-software co-design.Hardware-software co-design is a method of mixed implementation of hardware and software for an application. Such a co-design requires hardware-software partitioning which means to allot few functions of the application to be carried out to software and the other functions to hardware. This partitioning is based upon the real time specifications in terms of power, area, time and cost. Therefore, to arrive on partitioning decision, it is required to know the design space exploration of the algorithms when implemented them on soft core and hardcore architectures individually. In the proposed paper, the DSP algorithms taken into consideration are convolution-linear and circular, digital FIR filter, FFT and IFFT. These algorithms are implemented on

three different platforms which are general purpose processor, DSP processor-TMS320C6713 DSK (acts as soft core processors) and Xilinx FPGA-Virtex5 (acts as hard core processor) to explore the parameters required for hardware and software realization. Here the parameters taken into consideration are area (number of logic slices) and computationtime. There are similar researches like - in [6], proposed framework for optimization at the hardware architecture level for signal processing applications like FIR filter. Here different structures of FIR filter are considered for the design space exploration. A bottom-up modular design methodology is adopted where pre-synthesized arithmetic blocks are considered to reduce the synthesis time. In [3], a design space exploration algorithm is proposed that makes use of Simulink models to perform macro and micro architecture DSP. The Simulink model of the time delay of arrival algorithm (the case study) is partitioned into atomic subsystems which are mapped to various SystemC threads and then generates set of hardware implementations by doing the high level synthesis.In [8], they have co-designed a 16-bit FIR filter on Triscend E5 CSoC device. To allow the processor and hardware operate in full parallelism, double buffering technique is applied at both inputs and outputs of the hardware engine of FIR filter. Researchers in [7] discuss the fabric model to implement super data flow graph. Here the granularity of interconnections is closely examined and power consumption by varying bit-size ALU and multiplexer is taken care of. In [1], a heterogeneous architecture is proposed to support hardware/software codesign and which provides cost effective solution to reach task schedulable bound. There is a case study in [4], which describes various software and hardware profiling methodologies and also proposes a resource optimization methodology for heterogeneous computing systems. In this paper, a methodology for design space exploration of DSP techniques is proposed in order to support hardware software co-design.

This paper is organized as follows: Section 2 briefs the chosen DSP algorithms, section 3 discusses the targeted platforms to carry out this research, section 4 explains the OFDM transmitter and its realization specifications based on IEEE 802.11a, section 5 presents the experimental results and finally in section 6 conclusion and future scope of the paper is discussed.

## 2. DESCRIPTION OF DSP ALGORITHMS FOR DESIGN SPACE EXPLORATION

The DSP algorithms considered are explained as follows:

#### **2.1 Convolution**

In DSP, convolution is basically used to predict the output of a system by taking few samples of both input signal and the

impulse response of the system. Basically the impulse response of the system is known prior so that the behavior of the system to an input signal is predicted. A convolution process has folding, shifting, multiplication and summation operations. There are two types of Convolution techniques - Linear Convolution and Circular Convolution. Linear convolution is used for aperiodic and infinite signals. The symbolic and mathematical representation of linear convolution is given by equation 1 and 2.

$$y[n] = x[n]*h[n]$$
 (1)

$$y[n] = \sum_{k=-\infty}^{\infty} h[k]x[n-k](2)$$

Where h[k] is impulse response of the system and x[n-k] is the input sample delayed by k units.

Circular convolution is used for periodic and finite signals. It is sum over one period. Thus circular convolution of two periodic discrete time sequences x1[n] and x2[n] with periodicity of N samples is defined in expression 3.

$$y[n] = \sum_{m=0}^{N-1} x_1[n] x_2[(n-m)] \mod N)$$
(3)

The computation of circular convolution is easy compared to linear convolution and also the folding and shifting operations are performed in circular fashion.

#### 2.2 Finite Impulse Response (FIR) filter

One of the most common DSP operations is filtering. Filtering can be used for noise suppression, signal enhancement, removal or attenuation of a specific frequency etc. An FIR filter is one in which the impulse response is limited to a finite number of points and hence the name Finite Impulse Response filter. The FIR is a discrete linear time invariant system whose output is based on the weighted summation of past inputs and is expressed as:

$$y[n] = \sum_{k=0}^{N-1} b_k x[n-k]$$
 (4)

For an FIR system, the filter coefficients  $b_k$  describes the filters response h[k]. Hence linear convolution can be applied for FIR filtering.FIR filter is implemented using delay blocks that store the value for one sample period.

#### 2.3 Fast Fourier Transform (FFT)

FFT is an important tool in DSP. It can be used to analyze information encoded in the frequency, phase or amplitude of the sinusoids by calculating the frequency spectrum of a signal. It is defined as given by equation (5) which is same for Discrete Fourier Transform(DFT).

$$X[k] = \sum_{n=0}^{N-1} x[n] e^{-\frac{j2\pi kn}{N}}, \ fork = 0, 1, 2 \dots ... N - 1(5)$$

In comparison to DFT, FFT is very fast in computation which takes only  $(N/2)\log_2N$  complex multiplications and  $N\log_2N$  complex additions. The FFT is implemented here using radix-2 decimation in time where N point signal is decomposed into N time domain signals.

#### 2.4 Inverse Fast Fourier Transform (IFFT)

IFFT is performed to convert a frequency transformed signal back into its time domain. The IFFT of a sequence X[k] of length N is defined as

$$x[n] = \frac{1}{N} \sum_{n=0}^{N-1} X[k] e^{\frac{j 2 \pi k n}{N}} \text{ for } n = 0, 1, 2 \dots N - 1 \ (6)$$

## 3. RESOURCE ARCHITECTURES USED IN HARDWARE SOFTWARE CO-DESIGN

In this research, the General Purpose Processor (GPP) and Digital Signal Processing (DSP) processors acts as computing resources for software implementation and FPGA acts as computing resource for hardware implementation of an application. The architectural specifications of computing resources are described in this section.

#### 3.1 General purpose processor architecture

The general purpose processor considered is the Pentium dual core processor. As the name suggests, it has two cores with maximum speed CPU clock rate of 1.3GHz to 2.6GHz. The core architecture of the processor provides efficient decoding, execution units, caches and buses. It can run an operating system that supports high level language compiler to run the DSP applications.

## 3.2 TMS320C6713 DSP Processor

#### architecture

A digital signal processor has optimized architecture and instruction set for real time digital signal processing. Such optimizations include Harvard architecture, hardware circular and bit-reversed addressing capabilities. TMS320C6713 is a digital signal processor which is a member of the high performance Texas Instruments (TI) C6000 DSP family. It supports both floating-point and fixed point operation and operates at a frequency of 225 MHz with low power consumption. The C6713 DSK (DSP Starter Kit) is selected because of its widespread utilization and compatibility with the DSP designs. It uses CodeComposer Studio as the software development tool which provides an Integrated Development Environment(IDE). The DSK operates from an external power supply of +5V which is converted to +1.26V for DSP core and +3.3V for I/O buffers by means of the onboard switching voltage regulators. The C6713 DSK also features AIC23 stereo codec for input and output of audio signals, 16 Mbytes of synchronous DRAM, 512 Kbytes of non-volatile flash memory, embedded JTAG emulator, 4 LEDs and DIP switches.

#### **3.3 Xilinx FPGA- Virtex 5 architecture**

It consists of fixed function hardware for multipliers, memories, microprocessor cores, FIFO and ECC logic, DSP blocks, and high speed serial transceivers. The virtex-5 LX and LXT are developed for logic-intensive applications while virtex-5 SXT for DSP applications. Virtex-5 features 6-input LUTs, dedicated user controlled MUX for combinational logic and four 1-bit registers that are configured as flip-flops or latches. It also consist of dedicated arithmetic logic gates.

#### 4. IMPLEMENTATION SCHEME

This section describes the methodology that has been followed for executing the DSP techniques on various computing resources. The dataflow diagram of the methodology is shown in figure 1. The behavioral specifications of the DSP algorithm are described in C as well as in HDL. The C code of the algorithm can be cross compiled to soft-core processor architecture and it generates executable bit file. Similarly, the HDL code is synthesized that produces the gate level net list, which contains the area and timing specifications, in the form of executable bit stream file. The Bit file and bit stream files are then mapped to their respective computing resource in heterogeneous architecture environment and then executed.As soon as execution starts, the time analyzer becomes active in counting number of clock cycles consumed by the computing platform for execution. The number of clock cycles is then multiplied with time period of computing resource clock cycle that produces the computation time of an algorithm. The computation time and area required (number of slices) are extracted from time analyzer and synthesis report of the HDL code respectively and then resource utilization chart is prepared. The resource utilization chart helps designer to develop optimized computing platform for a DSP application. In order to estimate the effectiveness of design space exploration module, OFDM transmitter as a real time application is considered. The behavior and architecture of OFDM transmitter is as described in next sub section.



Figure 1: Data flow diagram of Design Space Exploration Module (DSEM)

## 4.1 An OFDM Transmitter for Hardware-Software Co-design

OFDM which stands for Orthogonal Frequency Division Multiplexing is a transmission technique which provides highspeed data, video and multimedia communications. OFDM supports multicarrier modulation scheme where high rate data stream is broken into lower bit-rate data stream. Each stream is modulated on separate sub-carriers. These sub-carriers are orthogonal in nature and therefore prevent inter-symbol interference (ISI) even under the overlapping condition. OFDM uses guard intervals between the symbols to completely remove ISI. The guard interval is larger than the multipath delay spread. Thus OFDM is spectrum efficient in comparison to normal FDM [2][5].

Typical OFDM transmitter block diagram is shown in figurer 2. The transmitter includes source generator, serial to parallel (S/P) converter, a constellation mapper, IFFT, parallel to serial (P/S) converter and cyclic prefix insertion blocks.



Figure 2: An OFDM Transmitter block diagram

The source generator will produce serial binary input sequences which are transmitted parallel through S/P converter depending upon the modulation scheme for constellation mapping. Here the modulation chosen is Quadrature Amplitude Modulation (QAM). Thus the input bits are mapped to QAM symbols and transmitted parallel to the N-point IFFT block. The output of IFFT which are complex N-points is fed to cyclic prefix block through P/S converter. The function of cyclic prefix block is to insert a cyclic prefix to the incoming data so that guard interval is introduced for every OFDM symbol. For hardware-software co-design of this OFDM transmitter, the specifications is taken from IEEE 802.11a WLAN where data bits per OFDM symbol duration is 192 using 16-QAM and 800 ns of guard interval.

## 5. RESULTS AND DISCUSSION

This section deals with the projection of the results obtained from the design space exploration module. The DSP algorithms and also the techniques used in OFDM transmitter is applied to the DSEM and the obtained parameters like area, computation time on soft-core and hard core processors are shown as resource utilization chart in table 1.

| DSP Technique        | Software Realization                     |                                  | Hardware Realization                         |                           |
|----------------------|------------------------------------------|----------------------------------|----------------------------------------------|---------------------------|
|                      | General purpose processor<br>(Dual core) | DSP Processor<br>TMS320C6713 DSK | Reconfigurable Architecture<br>Virtex-5 FPGA |                           |
|                      | Computation Time(in µs)                  | Computation Time (in µs)         | Computation<br>Time (in ns)                  | No. of Used<br>Bit slices |
| Linear Convolution   | 5                                        | 7.87                             | 6.55                                         | 381                       |
| Circular Convolution | 4                                        | 6.26                             | 8.7                                          | 145                       |
| 8-FFT/16-FFT         | 45/81.21                                 | 36.1/87.24                       | 4.994/4.41                                   | 734/1276                  |
| 8-IFFT/16-IFFT       | 72.24/143.5                              | 77 /142.20                       | 4.338/8.6                                    | 167/334                   |
| FIR Filter           | 4                                        | 203.08                           | 19.67                                        | 536                       |
| Binary generator     | 500                                      | 1270                             | 576                                          | 192                       |
| S/P conversion       | 300.5                                    | 8.012                            | 1.216                                        | 13                        |
| 16-QAM               | 4.5                                      | 0.864                            | 6                                            | 16                        |
| P/S conversion       | 300.5                                    | 8.012                            | 4.14                                         | 74                        |
| Cyclic Prefix        | 6                                        | 8                                | 4.73                                         | 1917                      |

#### Table 1: Resource utilization chart for DSP techniques

#### Table 2: Design parameters of OFDM Transmitter

| DSP Technique           | Software Realization                     |                                  | Hardware Realization                         |                           |
|-------------------------|------------------------------------------|----------------------------------|----------------------------------------------|---------------------------|
|                         | General purpose processor<br>(Dual Core) | DSP Processor<br>TMS320C6713 DSK | Reconfigurable Architecture<br>Virtex-5 FPGA |                           |
|                         | Computation Time (in µs)                 | Computation Time (in µs)         | Computation<br>Time (in ns)                  | No. of Used<br>Bit slices |
| Binary generator        | 500                                      | 1270                             | 576                                          | 192                       |
| S/P conversion          | 300.5                                    | 8.012                            | 1.216                                        | 13                        |
| 16-QAM                  | 4.5                                      | 0.864                            | 6                                            | 16                        |
| 16-IFFT                 | 143.5                                    | 142.20                           | 8.646                                        | 334                       |
| P/S conversion          | 300.5                                    | 8.012                            | 4.14                                         | 74                        |
| Cyclic Prefix Insertion | 6                                        | 8                                | 4.73                                         | 1917                      |
|                         |                                          | 1437.08                          | 600.73                                       |                           |

The table 1 depicts the execution time of various DSP techniques on soft core processors - dual core processor, TMS320C6713DSP processor and also on hardcore processor FPGA. The computation time of the DSP techniques are accelerated on hardcore processors compared to soft-core processors. So, the implementation of DSP techniques on hardcore processor i.e. FPGA would be cost effective and also accelerates execution speed but the area required for various DSP techniques may vary with respect to their complexity as depicted in table 1. Since DSP processor runs at 225 MHz, whereas GPP dual core processor runs at 1.3GHz/2.6GHz, the GPP processor dissipates more power as compared with DSP processor. So the former requires heat sink and thus occupies more area. Hence, the DSP processor would provide cost effective and high speed solution for DSP application. From the analysis, it is clear that DSP processor and FPGA together provides effective solution for hardware software co-design implementation of DSP processors. To analyze the effectiveness of the design space exploration methodology, a DSP application OFDM transmitter is analyzed for optimized

implementation. The block diagram of OFDM is shown in figure 2 and its design parameters are described in table 2. The table 2 shows the computation time of various functional blocks of OFDM on DSP processor and FPGA. The computation time of OFDM on TMS320C6713 DSK is 1437.08 µs and on FPGA is 600.73 ns. This means the computation of OFDM is 2395 times faster than DSP processor. But in real time the area required may not find on FPGA so there is need of soft core processors like DSP processors. So, the hardcore processor FPGA in combination with soft core DSP processor can acts as computing elements and it can be called as heterogeneous computing architecture. The hardware software co-design supports an application execution on this kind of heterogeneous computing architecture. In this research, the application OFDM (transmitter) would have optimized computation when binary input generator and cyclic prefix insertion are implemented on TMS32C6713 DSK because they require more area on FPGA and remaining functional blocks on FPGA. The table 3 (a) and (b) indicates the hardware software partitioning of the functional blocks and their computation time.

| Platform                      | Functions               | Computation time (in µs) |
|-------------------------------|-------------------------|--------------------------|
| TMS32C6713 DSK                | Binary input generation | 1270                     |
|                               | Cyclic Prefix Insertion | 8                        |
| Vitex-5 FPGA                  | S/P conversion          | 0.00414                  |
|                               | 16-QAM                  | 0.006                    |
|                               | IFFT                    | 0.0087                   |
|                               | P/S conversion          | 0.0042                   |
| TMS32C6713 DSK + Vitex-5 FPGA | OFDM                    | 1278                     |

| Table 3: (a) Hardware software partition | ning of OFDM transmitter functions |
|------------------------------------------|------------------------------------|
|------------------------------------------|------------------------------------|

#### Table 3: (b) Hardware software partitioning of OFDM transmitter functions

| Platform                      | Functions               | Computation time (in µs) |
|-------------------------------|-------------------------|--------------------------|
| TMS32C6713 DSK                | Cyclic Prefix Insertion | 8                        |
| Vitex-5 FPGA                  | S/P conversion          | 0.00414                  |
|                               | 16-QAM                  | 0.006                    |
|                               | IFFT                    | 0.0087                   |
|                               | P/S conversion          | 0.0042                   |
|                               | Binary input generation | 0.576                    |
| TMS32C6713 DSK + Vitex-5 FPGA | OFDM                    | 8.59                     |

From table 3, it is clear that the computation time of the application depends on the computation time of the tasks which are targeted to TMS32C6713 DSK.

## 6. CONCLUSION AND FUTURE SCOPE

In this paper, design space exploration for widely used DSP algorithms is done and based on it the design parameters chart is prepared. The design parameters chart is then utilized for hardware software co-design which helps to optimize the application execution. Finally this methodology is applied in OFDM transmitter and prepared optimized execution model. From the result analysis, it is seen that the implementation of an application purely on software or hardware do not meet the time and area constraints. An application meets the design constraints when it is realized on heterogeneous computing architecture (combination of hardcore FPGA and soft core DSP processor) and it could be supported by hardware-software codesign. As a future scope, this research can be extended to other real time applications and the research work can be concentrated on the other design parameters like communication overheads as well as memory utilization.

#### 7. REFERENCES

- [1] MahendraVucha and ArvindRajawat, An ettective dynamic scheduler tor reconfigurable high speed computing system, Advance Computing Conference (IACC), 2014 IEEE International , vol., no., pp.766,773, 21-22 Feb. 2014.
- [2] Naveen Kumar N, Rohith S and H Venkatesh Kumar FPGA Implementation of OFDM Transceiver using Verilog-Hardware Description Language. International Journal of Computer Applications (0975-8887), Volume 102-No. 6,September 2014.

- [3] Shahzad Ahmad Butt, Luciano Lavagno, Design Space Exploration and Synthesis for Digital Signal Processing Algorithms from Simulink Models, Design and Test Symposium,2013 8<sup>th</sup> International
- [4] MahendraVucha, Rajendra Patel &ArvindRajawat, Dynamic profiling Methodology for Resource Optimization in Heterogeneous Computing Systems, International Conference on Emerging Research in Computing, Information, Communication & Applications, August 2013, Bangalore
- [5] Abhijit D. Palekar and Dr. Prashant V. Ingole. Ofdm System Using FFT and IFFT. International Journal of Advanced Research in Computer Science and Software Engineering. Volume 3, Issue 12, December 2013
- [6] Ramsey Hourani, Ravi Jenkal, W.Rhett Davis, Winser Alexander. Automated Design Space Exploration for DSP Applications, Journal of Signal Processing System, P-199 -216, 2009.
- [7] Gayatri Mehta, Raymond R Hoare, Justin Stander, Alex K Jones, Design Space Exploration for Low-Power Reconfigurable Fabrics, IPDPS, 2006, Parallel and Distributed Processing Symposium, International, Parallel and Distributed Processing Symposium,2006, pp. 227, doi:10.1109/IPDPS.2006.1639484
- [8] MassoudHashempour, ShervinSharifi, MaziarGudarzi, ShaahinHessabi, Rapid Design Space Exploration of DSP Applications using Programmable SoC devices – a case study, ASIC/SOC Conference, 2002. 15<sup>th</sup> Annual IEEE International, vol no. pp. 273, 277