Virtex 4 FPGA Implementation of Viterbi Decoded 64-bit RISC for High Speed Application using Xilinx

Ritam Dutta
Dept. of Electronics & Communication Engineering
SIEM, West Bengal University of Technology
Siliguri, India

Charles Van Der Mast
Dept. of Electrical Engg., Mathematics & Comp. Sc
Delft University of Technology
Delft, The Netherlands

ABSTRACT
In the modern era of electronics and communication decoding and encoding of any data(s) using VLSI technology requires low power, less area and high speed constrains. The viterbi decoder using survivor path with necessary parameters for wireless communication is an attempt to reduce the power and cost and at the same time increase the speed compared to normal decoder. This paper presents three objectives. Firstly, an orthodox viterbi decoder is designed and simulated. For faster process application, the Gate Diffused Input Logic (GDIL) based viterbi decoder is designed using Xilinx ISE. simulated and synthesized successfully. The new proposed GDIL viterbi provides very less path delay with low power simulation results. Secondly, the GDIL viterbi is again compared with our proposed technique, which comprises a Survivor Path Unit (SPU) implements a trace back method with DRAM. This proposed approach of incorporating DRAM stores the path information in a manner which allows fast read access without requiring physical partitioning of the DRAM. This leads to a comprehensive gain in speed with low power effects. Thirdly, all the viterbi decoders are compared, simulated, synthesized and the proposed approach shows the best simulation and synthesis results for low power and high speed application in VLSI design. The Add-Compare-Select (ACS) and Trace Back (TB) units and its sub circuits of the decoder(s) have been operated in deep pipelined manner to achieve high transmission rate. Although the register exchange based survivor unit has better throughput when compared to trace back unit, but in this paper by introducing the RAM cell between the ACS array and output register bank, a significant amount of reduction in path delay has been observed. All the designing of viterbi is done using Xilinx ISE 12.4 and synthesized successfully in the FPGA Virtex 4 target device operated at 64.516 MHz clock frequency, reduces almost 41% of total path delay.

Keywords
Viterbi decoder, GDIL technique, SPU, DRAM, Add Compare Select (ASC), Trace Back (TB), Xilinx, FPGA.

1. INTRODUCTION
The Viterbi decoding algorithm, proposed by Viterbi, is a decoding process for convolutional codes in memory-less noise. The algorithm can be applied to a host of problems encountered in the design of communication systems. The Viterbi Algorithm finds the most-likely state transition sequence in a state diagram, given a sequence of symbols. The Viterbi algorithm is used to find the most likely noiseless finite-state sequence, given a sequence of finite-state signals that are corrupted by noise.

Generally, a viterbi decoder consists of three basic computation units: Branch Metric Unit (BMU), Add-Compare-Select Unit (ACSU) and Trace Back Unit (TBU). The BMU calculates the branch metrics by the hamming distance or Euclidean distance and the ACSU calculates a summation of the branch metric from the BMU and previous state metrics, which are called the path metrics.

After this summation, the value of each state is updated and then the survivor path is chosen by comparing path metrics. The TBU processes the decisions made in the BMU and ACSU and outputs the decoded data. The feedback loop of the ACSU is a major critical path for the viterbi decoder. The operations of the convolutional encoder and decoder for constraint length of 3 are explained as follows.

The convolution encoder has a rate of \( \frac{k}{n} \) with a constraint length of 3. With an encoder, the 3 bit shift register provides the memory and two modulo-2 adders provide convolution operations. For each bit in the input sequence, two bits are output, one from each of the two modulo-2 adders. The decoding procedure compares the received sequence with all the possible sequences that may be obtained with the respective encoder and then selects the sequence that is closest to the received sequence. There are always two paths merging at each node and the path selected is the one with the minimum hamming distance, the other is simply terminated. The retained paths are known as survivor paths and the final path selected is the one with the continuous path through the trellis with a minimum aggregate hamming distance.

2. EARLIER RESEARCH WORK
Y. Zhu and M. Benaissa (2003) presented a novel ACS scheme that enables high speeds to be achieved in area efficient viterbi decoders without compromising for area and power efficiency. Multilevel pipelining has been introduced into the ACS feedback loop.

Arkadiy Morgenshtein et al (2004) used Gate Diffusion Input circuits for asynchronous design and compared the designs with CMOS asynchronous design [1]. Dalia A. El-Dib and Mohamed I. Elmasry (2004) discussed the implementation of a viterbi decoder based on modified register-exchange (RE) method [2]. Song Li and Qing-Ming Yi (2006) proposed a scheme based on Verilog language for the implementation of high-speed and low power consumption bi-directional viterbi decoder [3]. The decoding was done in both positive and negative direction and the delay was half of that of the unilaterallism decoder and the decoding speed was greatly improved. Yun-Nan Chang and Yu-Chung Ding (2006) presented a low power design for viterbi decoder based on a novel survivor path trace mechanism. Lupin Chen et al (2007)
presented a low-power trace-back (TB) scheme for high constraint length viterbi decoder. Xuan-zhong Li et al (2008) discussed a high speed viterbi decoder which was based on parallel radix-4 architecture and bit level carry-save algorithm. Seongjoo Lee (2009) presented an efficient implementation method for parallel processing viterbi decoders in UWB systems.

3. ORTHODOX VITERBI LIMITATION

From the literature survey, viterbi decoder is mainly used in all communication techniques. Logic styles like CMOS, Pseudo NMOS and Dynamic logic design of circuits at ACS level are done [4] but the switching activity in these logic styles are high and hence lead to high power dissipation.

4. GDIL BASED VITERBI

Gate Diffusion Input Logic (GDIL) is a technique of low power digital for circuit design which allows reducing power consumption, delay and area of the digital circuit. The basic GDIL cell is similar to the standard CMOS inverter, the differences are: (1) GDIL cell contains three inputs (2) Bulks of both NMOS and PMOS are connected to N or P, so it can be randomly biased at contrast with CMOS inverter. The GDIL contains four terminals – G (common gate input of the nMOS and pMOS transistors), P (outer diffusion node of the pMOS transistor), N (outer diffusion node of the nMOS transistor) and D node (common diffusion of both transistors). The GDIL approach [5] allows implementation of a wide range of complex logic functions using only two transistors.

This GDIL method is suitable for design of fast, low-power circuits using a reduced number of transistors (as compared to CMOS and existing Pass Transistor Logic techniques), while improving logic level swing and static power characteristics and allowing simple top-down design by using small cell library. A simple change of the input configuration of the simple GDIL cell corresponds to very different Boolean functions [6]. This GDIL undoubtedly reduces area as lesser number of LUTs and CLBs are used in FPGA prototyping [7].

GDIL viterbi decoder consists of three blocks. They are Branch Metric Unit (BMU), Add Compare Select Unit (ACSU) and Survivor Memory Unit (SMU). All these blocks are designed using GDIL technology, simulated and synthesized using Xilinx ISE.

4.1 GDIL based Branch Metric Unit (BMU)

The branch metric computation block compares the received code symbol with the expected code symbol and counts the number of differing bits. It consists of EXOR gate and counter. The Branch Metric Unit is designed using the EXOR gate and the 3-bit counter. The output of the EXOR gate is fed as the clock input to the 3-bit counter. 3-bit counter is designed by cascading the D FF and the output of the one flip flop is given as clock input for the next flip flop. Further the D input for all the flip flops are tied to HIGH input. The preset and clear input is used to make the counter working as asynchronous counter. The RTL schematic diagram of the Branch Metric Unit is shown in figure 2.

4.2 Add Compare Select Unit (ACSU)

The Add Compare Select Unit (ACSU) which adds the Branch Metrics (BM) to the corresponding Path Metrics (PM) compares the new PMs and then stores the selected PMs in the Path Metric Memory (PMM). At the same time, the ACSU stores the associated survivor path decisions in the Survivor Memory Unit (SMU). The PM of the survivor path of each state is updated and stored back into the PMM. The Block diagram of the Add Compare and Select unit is shown in the figure 3.

State metric (SMi,j) and Branch Metric (BMi,j) are the two inputs to the Adder unit. Each butterfly wing is usually implemented by a module called ACS module. The two adders compute the partial path metric of each branch. The
comparator compares the two partial metrics and the selector selects an appropriate branch. The new partial path metric updates the state metric of state p and the survivor path-tracking block records the survivor path.

The adder unit which is proposed in the design consists of two full adders and one half-adder. Output of the Branch metric unit (BMU) is added with the previous path metric and the obtained output is the new path metric for the next branch. The input sequence from a0 (1100), b0 (1001), a1 (1011), b1 (1101), a2 (0111) and b2 (0011) are added to the adder unit. The 1st bit of a0 and b0, i.e., 11, is the input of half-adder and produces the output of the half adder operation sum1 as ‘0’ and carry as ‘1’.

4.3 Selector Unit (SU) using GDIL

The output of the comparator is given as the select signal for the multiplexer which is used to select the minimum path metric of the decoded message bit in the Viterbi decoder. The selector unit consists of four 2x1 MUX and the select signal for all the multiplexers are from the A<B output of the comparator. Hence, the selector selects the minimum path metric value. The 4-bit input is a0=1101, b0=1111, a1=1101, b1=1000, a2=1110, b2=1110, a3=1001, b3=1110 and select line value is 1010. When the select line value is high i.e. ‘1’, the output value z0 will be b0, i.e., ‘1’. When the select line value is ‘0’, the output value z0 will be a0. Likewise all other input values z1, z2 and z3 are taken and the outputs are obtained.

4.4 Survivor Memory Unit (SMU) using GDIL

The Survivor Memory Unit is designed by using the serial-in-serial-out shift register and the length of the shift register depends on the length of the convolution encoder. The figure 4 shows the 4x4 memory unit to store the minimum surviving path and the clock signal of the SMU are the ACSU output for a constraint length of 3.

4.5 Complete Integration of GDIL based Viterbi decoder

The viterbi decoder using GDIL technique is designed by integrating all the units like BMU, ACSU and SMU. The proposed design using GDIL is shown in the figure 5.

Fig 5: The RTL design of GDIL Viterbi Decoder

Here two Branch Metric Units are used since two possible changes from one state to another. This BMU calculates the branch metric between the expected sequence (original random input which is ‘a’) and the received sequence (introduced errors which is ‘b’). Then adder unit adds branch metric with the previous path metric and comparator compares the two paths and select the least path using selector. The survivor memory unit stores the path metric value and its corresponding states using the 2x1 multiplexer and 2 bit shift register to get the decoded output.

5. PROPOSED VITERBI USING DRAM

The GDIL viterbi is also compared with our proposed technique, which comprises a Survivor Path Unit (SPU) implements a trace back method with DRAM. This proposed approach of incorporating DRAM stores the path information in a manner which allows fast read access without requiring physical partitioning of the DRAM. This leads to a comprehensive gain in speed with low power effects.

Fig 6: The RTL design of Proposed Viterbi using DRAM

As shown in figure 6 the proposed DRAM based GDIL viterbi is designed in Register Transfer Logic style. As in earlier inventions of viterbi, the main block is divided into two main sub-units, i.e; Add-Compare-Select (ACS) array and a Survivor Path Unit (SPU). Here in this proposed system the metric calculation, addition, weight comparison and survivor path selection everything take place in the ACS array (ACSU). Thus ACS array contains the weight values at each
state, as a progression along the survivor path. The weight values are very necessary when comparing with other weights of other paths, to determine the survivor path. The compilation and comparison functions, which takes place within the ACS array actually determines the path extensions. Signals indicating these extensions to the survivor paths are passed from the ACS array to the SPU, which then updates the survivor paths.

Till now two methods exist for implementing the SPU: the register exchange method and the trace back method. Though the register exchange based survivor unit has better throughput when compared to trace back unit, but in this project by introducing the RAM cell between the ACS array and output register bank, a significant amount of reduction in path delay (almost 41%) has been observed.

6. VITERBI IMPLEMENTATION ON RISC PROCESSOR

Finally the fastest GDIL based viterbi decoder is implemented on the 64-bit RISC (Reduced Instruction Set Computer) processor. RISC is basically a modern style of microprocessor with a few old tricks from the mainframe world. In todays market most microprocessors are based on either RISC or CISC (Complex Instruction Set Computer) architecture technologies. Research has shown that RISC architecture immensely boost up computer speed by using simplified machine instructions for frequently used functions. In this case the instruction set is etched into logic circuits using HDLs like Verilog HDL. This instruction set is reduced to basic, often used commands that can be executed in a single machine cycle. For general purpose computers the data collection shows that up to 85% is spent executing simple instructions like LOAD, STORE and BRANCH. For further complexity the compiler should generate several simple instructions. The more complex instructions consist of a very small amount of overall execution time of the CPU. Therefore for optimization of RISC processing power and to avoid complex instructions, no complex addressing is allowed and all instruction sets act on internal register set. These fruitful factors make RISC an irresistible choice and most modern processors are built on this form factor.

6.1 RISC Design Architecture

The ultimate goal of the project is to design and implement a RISC (Reduced Instruction Set Computer) using a FPGA (here Virtex 4 – Target device) for high speed application. The RISC is characterized by 64 bit architecture having 64 bit registers, ALU, RAM, Counters, Control Unit, Display Unit and most importantly Decoders. Here the proposed GDIL based Viterbi decoder is replaced by ordinary decoder, and then implemented and synthesized by Xilinx 12.4 ISE software.

6.2 Complete RTL Design of RISC with GDIL Implementation

Finally the target device of 64-bit RISC is designed in block level and synthesized. The control unit having orthodox decoders is replaced by proposed Viterbi using proposed GDIL technology.
The figure 8 shows the complete RTL design of the RISC processor with modifications of viterbi decoder for high speed applications. The backend coding is done using Verilog HDL and the layout is extracted by Xilinx ISE. Finally the proposed processor is successfully implemented in Virtex 4 FPGA.

7. SIMULATION RESULTS

The simulation results of orthodox viterbi, GDIL viterbi and proposed GDIL viterbi using DRAM are shown below. The simulation is mainly performed by Xilinx ISE 12.4 and synthesized in the FPGA [8, 9] target device: Virtex 4. The major constrains for VLSI design is speed, power and area. In this whole project all three constrains have been taken into account. The detail timing report and power report describes the delay analysis for high speed applications and low power for proposed Virtex [10] based viterbi decoder.

7.1 Timing Analysis

<table>
<thead>
<tr>
<th>Timing Report</th>
<th>Orthodox Viterbi</th>
<th>GDIL Viterbi</th>
<th>Proposed GDIL Viterbi with DRAM</th>
</tr>
</thead>
<tbody>
<tr>
<td>Target Device</td>
<td>XA9536XL (Virtex 4)</td>
<td>XA9536XL (Virtex 4)</td>
<td>XA9536XL (Virtex 4)</td>
</tr>
<tr>
<td>Max Clock Frequency</td>
<td>64.516 MHz</td>
<td>64.516 MHz</td>
<td>64.516 MHz</td>
</tr>
<tr>
<td>Min Clock Period</td>
<td>21.400 ns</td>
<td>15.500 ns</td>
<td>15.500 ns</td>
</tr>
<tr>
<td>Clock to Set up Cycle Time</td>
<td>21.400 ns</td>
<td>15.500 ns</td>
<td>15.500 ns</td>
</tr>
<tr>
<td>Set up to Clock Pad Delay</td>
<td>87.663 ns</td>
<td>52.600 ns</td>
<td>20.800 ns</td>
</tr>
<tr>
<td>Clock Pad to Output Pad Delay</td>
<td>16.635 ns</td>
<td>5.800 ns</td>
<td>5.800 ns</td>
</tr>
</tbody>
</table>

7.2 Power Analysis

<table>
<thead>
<tr>
<th>Power Report</th>
<th>Orthodox Viterbi</th>
<th>GDIL Viterbi</th>
<th>Proposed GDIL Viterbi with DRAM</th>
</tr>
</thead>
<tbody>
<tr>
<td>Target Device</td>
<td>XA9536XL (Virtex 4)</td>
<td>XA9536XL (Virtex 4)</td>
<td>XA9536XL (Virtex 4)</td>
</tr>
<tr>
<td>On-Chip IO Pads</td>
<td>0.982 µw</td>
<td>0.001 µw</td>
<td>0.001 µw</td>
</tr>
<tr>
<td>Leakage Power</td>
<td>3.791 µw</td>
<td>1.010 µw</td>
<td>1.007 µw</td>
</tr>
<tr>
<td>Quiescent Power</td>
<td>2.315 µw</td>
<td>1.110 µw</td>
<td>1.007 µw</td>
</tr>
<tr>
<td>Dynamic Power</td>
<td>0.739 µw</td>
<td>0.022 µw</td>
<td>0.003 µw</td>
</tr>
</tbody>
</table>

4. Here in this Virtex 4 family many different devices were available in the Xilinx ISE tool. In order to implement this modified viterbi with DRAM, the device named as “XA9536XL” has been chosen and the package as “FG320” with the device speed as “– 4 ”. The design of modified viterbi decoded RISC for low power [13] and high speed [14] is synthesized successfully and its results are analyzed as shown in the figure 9 below.

Fig 9: Synthesize Report of Proposed Viterbi Decoded RISC

After completion of synthesize the entire circuit model is processed through Translate, Map, Place and Route successfully. Finally a UCF file of the target object is created which is prototyped with hardware FPGA Virtex 4 hardware [15], through parallel connection using JTAG. Boundary Scan is performed followed by generation of PROM file. Finally the modified viterbi with RAM cell is successfully configured into FPGA Virtex 4 shown in figure 10.

Figure 10: FPGA prototyping of Proposed Viterbi decoded 64-Bit RISC

The layout of the proposed modified GDIL Viterbi decoded RISC is obtained shown in figure 11.

Figure 11: Layout of proposed GDIL Viterbi decoded RISC
The graphical analysis is also performed for all three viterbi decoders to have a complete performance summary. The timing - delay analysis and power analysis are shown in figure 12 (a) and 12 (b) respectively.

9. CONCLUSION & FUTURE SCOPE
The Add-Compare-Select (ACS) and Trace Back (TB) units and its sub circuits of all the three type decoder(s) have been operated in deep pipelined manner to achieve high transmission rate. Although the register exchange based survivor unit has better throughput when compared to trace back unit, but in this paper by introducing the RAM cell between the ACS array and output register bank, a significant amount of reduction in path delay has been observed. All the designing of viterbi is done using Xilinx ISE 12.4 and synthesized successfully in the FPGA Virtex 4 target device operated at 64.516 MHz clock frequency and reduces almost 41% of total path delay with considerable low power results.

The fast decoding can be achieved by our modified viterbi approach, which shows significant results in high speed applications. In future this research work can be extended by 32nm CMOS process technology for more precise fast decoding results.

10. REFERENCES
[12] www.digilentinc.com/Products/Detail.cfm?NavPath=ATLYS