# A Novel Design of Low Power, High Speed SAMM and its FPGA Implementation

Anuja George Asst: Professor St. Joseph's College of Engg: &Technology, Palai

## ABSTRACT

The matrix multiplication is a computationally intensive problem and a prerequisite in various image processing applications like spatial and frequency filtering, edge detection and convolution. Being a core part of various applications in portable devices like mobile phones, demand for high speed and low power consumption is extremely high. This work demonstrates an effective design and efficient implementation of the Matrix Multiplication using Systolic Architecture and Ancient mathematics. For efficient implementation and maximum speed-up, integer arithmetic was used. Three main steps of the work, i.e. design, simulation and implementation, were accomplished. For design and simulation, Verilog HDL was used. The design was simulated using modelsim10.1d and synthesized using Xilinx Planahead 12.1. The work also includes the comparison between three design approaches of the matrix multiplication using systolic architecture. In the first design approach, array multipliers were used. In the second approach, Wallace tree multipliers were used and in the final approach, matrix multiplier design was based on Ancient multiplication technique.

## **General Terms**

Systolic Architecture, Vedic mathematics, Image processing

## Keywords

Systolic Architecture, VLSI, Vedic mathematics.

## **1. INTRODUCTION**

A wide gamut of applications including signal, image and video processing and numerical analysis involve matrix operation as the kernel operation. It is a prerequisite in various image processing applications like spatial and frequency filtering, edge detection and convolution. Being a core part of various applications in portable devices like mobile phones, demand for high speed and low power consumption is extremely high. Matrix multiplication is a computationally intensive problem. Hence, its design and efficient implementation on an FPGA where resources are very limited, is of great interest. FPGA based designs are usually evaluated using three performance metrics: speed, area, and power.

The paper is organized as follows. Chapter 2 is focused on the literature survey conducted on various multipliers in VLSI. This section presents the fundamental technical aspects behind the three design approaches of the systolic matrixmultiplication. Chapter 3 presents the proposed matrix multiplication using Systolic Architecture and Ancient mathematics and its FPGA implementation. Chapter 4 presents results and discussions based on synthesis of Systolic Matrix Multiplication using three design approaches. Results of the efficient implementation of the Matrix Multiplication using Systolic Architecture and Ancient mathematics are also presented. Chapter 5 concludes the work based on important results obtained.

# 2. PREVIOUS WORK

Previous work was focused on the literature survey conducted on various multipliers in VLSI. It included the study of the fundamental technical aspects behind the design approaches of proposed systolic matrix multiplication.

In an array multiplier multiplication of two binary numbers can be obtained with one micro-operation by using a combinational circuit that forms the product bits all at once thus making it a fast way of multiplying two numbers since the only delay is the time for the signals to propagate through the gates that form the multiplication array [1]. Array Multiplier needs optimum number of components, but delay for this multiplier is larger and gives more power consumption as well. The demand for large number of gates increases the area. Hence array multiplier is less economical.

A fast process for multiplication of two numbers, developed by Wallace is referred to as Wallace tree multiplication [1]. It is a three step process in which the bit products are formed, the bit product matrix is reduced to a two row matrix where sum of the row equals the sum of bit products, and the two resulting rows are summed with a fast adder to produce a final product. Three bit signals are passed to a one bit full adder which is called a three input Wallace tree circuit, and the output signal (sum signal) is supplied to the next stage full adder of the same bit, and the carry output signal thereof is passed to the next stage full adder of the same no of bit, and the carry output signal thereof is supplied to the next stage of the full adder located at a one bit higher position [1].

Another improvement in the multiplier is by reducing the number of partial products generated. The Booth recording multiplier is one such multiplier; it scans the two bits at a time to reduce the number of partial products [1]. Booth's multiplication algorithm multiplies two signed binary numbers in two's complement notation. The algorithm was invented by Andrew Donald Booth in 1951.

# 3. MATRIX MULTIPLICATION USING SYSTOLIC ARCHITECTURE AND ANCIENT MATHEMATICS

The data or pixel values displayed on the hyperterminal window of the PC are entered into the proposed systolic array matrix multiplication unit using UART interface. The 64 pixels are buffered before sending them to SAMM unit. The output pixels of the resultant matrix are collected in FIFO and displayed on the hyperterminal window of PC using UART interface.

The proposed Matrix Multiplication enhances computation speed in two ways. Firstly, systolic architecture enhances the speed of matrix multiplication by twice that of the conventional method. Secondly, the implementation of Vedic algorithm to Systolic Array Architecture for matrix multiplication further improves the computation speed.



Fig1: Block diagram of FPGA implementation of proposed work

## **3.1 Systolic Architecture**

A systolic architecture is an array of Processing Elements, each called as a cell. Each cell is connected to a small number of nearest neighbours in a mesh like topology. Each cell performs a sequence of operations on data that flows between them. PE at each step takes input data from one or more neighbors (e.g. Left and Top), processes it and, in the next step, outputs results in the opposite direction (Right and Bottom). The Proposed two dimensional systolic Architecture for 3 by 3 matrixes is given in Fig 2.







# Fig 3: Block diagram of processing element in systolic architecture

## 3.2 Vedic Multiplier

Vedic multiplier is an efficient multiplier based on ancient mathematics. Urdhva Tiryakbhyam Sutra, a concept in Ancient mathematics is the soul of the Vedic multiplier design. It means "Vertically and Crosswise" [4]. The digits on the two ends of the line are multiplied and the result is added with the previous carry. When there are more lines in one step, all the results are added to the previous carry [4]. The least significant digit of the numbers is concatenated to form the final product while the rest act as the carry for the next step. Initially the carry is taken to be as zero.

Vedic multiplier has the advantage that as the number of bits increases, gate delay and area increases very slowly as compared to other multipliers [3]. By adopting Vedic multiplier, time, space and power of the design is reduced. All the partial products are calculated in parallel and the delay associated is mainly the time taken by the carry to propagate through the adders which form the multiplication array [4]. The hardware realization of 8x8 bit multiplication unit is as shown in Fig 4.



RESULT = (s15- s8) & (s7- s4) & (s3-s0)

#### Fig 4: Hardware realization of 8x8 bit multiplication

## 4. RESULTS AND DISCUSSIONS

Matrix multiplication plays a vital role in image and signal processing. Systolic matrix multipliers using array, Wallace and Vedic multiplication units as its core was simulated using modelsim10.1d and synthesized using Xilinx Planahead 12.1.The results of the comparision between Systolic matrix multiplier using three approaches in terms of resource, power and speed are as follows :

Table 1. Comparision between different SAMM design approaches

| SAMM<br>designs  | Fmax<br>(MHz) | Resource<br>utilization(%) | Power<br>utilization(mw) |
|------------------|---------------|----------------------------|--------------------------|
| SAMM_A<br>RRAY   | 63.2          | 53                         | 186.3                    |
| SAMM_W<br>ALLACE | 86.9          | 42                         | 186.6                    |
| SAMM_VE<br>DIC   | 98.7          | 89                         | 180.2                    |

Fmax of SAMM Using Different Multipliers



Fig 5: Graph showing maximum frequency of systolic array matrix multiplier using different Multiplier units.

SAMM\_WALLACE shows 37.5% increase in speed compared to speed of the SAMM\_ARRAY.SAMM\_VEDIC has the highest frequency with an increase of 56.17% and 13.58% in speed when compared to SAMM\_ARRAY and SAMM\_WALLACE respectively.



#### Fig 6: Graph showing resource utilization of systolic array matrix multiplier using different Multiplier units

SAMM\_WALLACE shows 20.75% decrease in resource utilization compared to resource utilization of the SAMM\_ARRAY.SAMM\_VEDIC shows 7.55% decrease in resource utilization compared to resource utilization of the SAMM\_ARRAY.



### Fig 7: Graph showing power utilization of systolic array matrix multiplier using different Multiplier units

SAMM\_WALLACE shows 0.16% increase in power utilization compared to power utilization of the SAMM\_ARRAY.SAMM\_VEDIC shows 3.27% decrease in power utilization compared to power utilization of the SAMM\_ARRAY.

From the above graphs, it was concluded that the systolic matrix multiplier using the concept of ancient Vedic mathematics is the most apposite. A high speed and low power matrix multiplier, exploiting Vedic algorithm was implemented on the Spartan6 FPGA. The results of the implementation which estimates power and resource utilization and maximum frequency available are shown below:

| Table 2. | Results of F | PGA im | plementation | of | proposed |
|----------|--------------|--------|--------------|----|----------|
|          |              | wor    | ·k           |    |          |

| SAMM<br>designs  | Fmax<br>(MHz) | Resource<br>utilization<br>(%) | Power<br>utilization(mw) |  |
|------------------|---------------|--------------------------------|--------------------------|--|
| Proposed<br>SAMM | 116.519       | 35                             | 39.2                     |  |

# 5. CONCLUSIONS

Matrix multiplication plays a vital role in image and signal processing. Systolic matrix multipliers using array, Wallace and Vedic multiplication units as its core was simulated using modelsim10.1d and synthesized using Xilinx Planahead 12.1.The Systolic matrix multiplier using three approaches were compared in terms of resource, power and speed estimations. The systolic matrix multiplier exploiting ancient mathematics proved to be the best. A high speed and low power systolic matrix multiplier was designed ,simulated and implemented on spartan6 FPGA having xcslx45tfgg484-3 device and estimate of power, resource utilization and maximum frequency were obtained.

## 6. REFERENCES

- [1] Sumit Vaidya1 and Deepak Dandekar 'Delay power performance comparision of multipliers in VLSI circuit design', International Journal of Computer Networks & Communications (IJCNC), Vol.2, No.4, July 2010
- [2] R. C. Gonzalez, R. E. Woods, "Digital Image Processing", 2nd.Ed., Prentice Hall, 2002.
- [3] Himanshu Thapliyal and Hamid R. Arabnia 'A Time-Area- Power Efficient Multiplier and Square Architecture Based On Ancient Indian Vedic Mathematics', July 2000.
- [4] S. S. Kerur, Prakash Narchi, Jayashree C N, Harish M Kittur and Girish V A, 'Implementation of Vedic Multiplier for Digital Signal Processing', Proceeding published by International Journal of Computer Applications (IJCA),, International Conference on VLSI, Communication & Instrumentation (ICVCI) 2011.
- [5] Jagadguru Swami Sri Bharath, Krsna Tirathji, "Vedic Mathematics or Sixteen Simple Sutras From The Vedas", Motilal Banarsidas, Varanasi(India),1986.

- [6] Mahendra Vucha, Arvind Rajawat, 'Design and FPGA Implementation of Systolic Array Architecture for Matrix Multiplication', International Journal of Computer Applications, Volume 26– No.3, July 2011.
- [7] H. T. Kung, "Why systolic architectures?", IEEE Computer, vol. 15, pp. 37, Jan. 1982.
- [8] Ziad Al-Qadi and and Musbah Aqel, 'Performance Analysis of Parallel Matrix Multiplication Algorithms Used in Image Processing', World Applied Sciences Journal 6 (1): 45-52, 2009.
- [9] M. Ramalatha, K.D. Dayalan, P. Dharani, and S.D. Priya, "High Speed Energy Efficient ALU Design using Vedic Multiplication Technique," Lebanon, pp. 600-603, 2009.
- [10] Asmita Haveliya,' A Novel Design for High Speed Multiplier for Digital Signal Processing Applications, International Journal of Technology And Engineering System(IJTES): Vol2.No1., Jan – March 2011.
- [11] Himanshu Thapliyal and M.B. Srinivas, "Very Large Scale Integration (VLSI) Implementation of RSA Encryption System using Ancient Indian Vedic Mathematics ", Proceedings of International Conference on Security Management, June 2005.
- [12] Manoranjan Pradhan, Rutuparna Panda and Sushanta Kumar Sahu,'Speed Comparison of 16x16 Vedic Multipliers', International Journal of Computer Applications (0975 – 8887)Volume 21– No.6, May 2011.