Evaluating FPGA Virtex-II Board using Dynamic Partial Reconfiguration

Imran Hashmi
M.Sc Student
Department of Electrical Engineering, UET Taxila, Pakistan.

Habibullah Jamal
Professor
Department of Electrical Engineering, UET Taxila, Pakistan.

Tahir Muhammad
Lecturer
Department of Electrical Engineering, UET Taxila, Pakistan.

Abstract

The Field Programmable Gate Array (FPGA) offer effective suppleness and performance because of reconfigurable hardware but consume more power in contrast to the Application Specific Integrated Circuit (ASIC). At run time reconfiguration of hardware in FPGAs can not only be very economical but can be real alternative for ASICs. The designers are reluctant to use Dynamic Partial Reconfiguration (DPR) in FPGA due to lack of adequate tools provided by the vendors. DPR has been in academic use for more over a decade. DPR offers reduction in power consumption, area, cost as well as increase in flexibility, efficiency and fault tolerance but has an application dependent overhead. In this work prior performance of DPR is evaluated using Xilinx Virtex II Pro in order to realize whether it is suitable for an application rather than at later complex design stages of a system design having the DPR employed. The evaluation is based on the reconfiguration speed and the resource utilization. The DPR shows an improvement of resource utilization by 22.5 % (in terms of slices) as well as speedup in comparison to Non-DPR design.

Keyword

Dynamic Partial Reconfiguration, Field Programmable Gate Array, Reconfigurable Architecture.

1 INTRODUCTION

Due to the flexibility and enhanced capacity of logic gates, Field Programmable Gate Arrays are becoming sustainable and alluring option for digital system design. They are used in different application domains as in bioinformatics and motion detector [1]. The high-end FPGAs e.g. Virtex II, Virtex 4, and Virtex 5 have multi millions gates of logic available for very large system implementations. For the implementation of reliable systems increasing attention has been given for new methodologies and techniques in the last decade. Applications requiring distance communication, fault relief, quick recovery and low power design are common nowadays e.g. space technology and telecommunication. Advantages offered by FPGAs such as programmability, flexibility, debugging, prototyping and parallel computation have provided an alternate for ASICs. The SRAM based FPGAs are configured by programming bits of SRAM being connected to the configuration points of a chip[1]. Run-time reconfiguration is used to dynamically re-design applications offering performance improvements in multi traits such as speed, power, area, cost, and fault tolerance. High-end FPGAs have an appealing feature of DPR and an evolving interest is presented for DPR [2-4]. Due to lacking tools and support provided by the vendors, DPR is only in academia use for the past decade. Applications using DPR have shown power reduction, speed improvement, time-sharing of resources; economical cost and chip-area reduction reported in [5-10].

DPR can be used to reconfigure only a desire part of the reconfigurable portion of the device while the other part of the application continues to run. In reconfiguration process a bitstream is loaded to FPGA’s configuration memory which holds reconfigurable data. This process overwrites the previous design using the newer reconfiguration data. The speed of reconfiguration is a key factor for the performance of time critical applications. The overall process demeans execution as time is required to download bitstream to configuration memory before execution. Typically reconfiguration time published in previous work correspond to only chip’s configuration port [11]. The frameworks developed for performance evaluations is in order to leverage Partial Reconfiguration design flow at initial steps of the development [12]. For real system experimentation all the system components taking part in reconfiguration overhead have to be considered. When the assumptions are unrealistic the theoretical reconfiguration overhead can deviate in magnitudes [13-14]. A framework developed in [11] has shown the performance of DPR for various bitstream sizes using PLB-HWICAP. There are other possible reconfiguration design architectures of reconfiguration port ICAP such as DMA-HWICAP, BRAM-ICAP, PLB-HWICAP (Cache Enabled) which are evaluated in this work and have shown to improve the reconfiguration speed.

The contribution of this paper is as follow:

✓ Reconfiguration Time comparison for different design architectures of ICAP

✓ Maximum speed offered by ICAP design architectures

✓ Device Utilization for Power and Area concerns of chip

This paper is organized as follow. Section 2 provides a short view of related work on applications using DPR and several reconfiguration designs are discussed. Section 3 describes the experimental setup with sub-sections detailing on individual...
processes showing the complete work presented in this paper. Section 4 shows and discusses the results obtained in the experiments with the concluding remarks in Section 5.

2 RELATED WORK

The previously reported evaluations of DPR of FPGA for power reduction and chip size (area) are in [7-8, 15-16]. Two cores of ICAP are available for reconfiguration. A low performance OPBHWCAP and high performance XPSHWICAP in Xilinx design. A number of reconfiguration tactics are used for speed comparison such as ICAP, JTAG, and selectMAP[7]. A good speed for reconfiguration has been shown by optimizing the PLB ICAP design but resource consumption is not presented [17-18]. A complete evaluation is presented in this work for design architectures of ICAP such as DMA-HWICAP, BRAM-HWICAP, PLB-HWICAP, and OPB-HWICAP.

3 EXPERIMENTAL SETUP

The experiments were performed using XC2VP30 Xilinx Virtex II Pro with the aid of logic analyzer and PCs. The experimental setups are used to measure time components contributing to total reconfiguration time. The next section, Creating Partial Bitstream, discusses the flow of HDL Design and bitstream generation using Xilinx ISE 10.1. Later section, System Operational Flow emphasizes on execution of reconfiguration process. Last section ICAP design provides explanation of design architectures of reconfiguration port ICAP.

3.1 Creating Partial Bitstream

There are two approaches supported by Xilinx for generation of partial bitstream i) Difference based ii) Module based [19]. In our design we have used difference based. The difference based design is for smaller designs while the module based is for larger and complex designs. Partial bitstream is created using Xilinx ISE 10.1 design tools and PlanAhead. The HDL design flow used is shown in Figure 1. The design is described using Verilog language in hierarchical manner. The Top-Level module contains the system design, Static Modules, and Partial Reconfigurable Module (PRM). All I/O of design, Bus Macros (BM), GCLK, DCMs, Buffers, Static and Partial Modules of design are instantiated as black box in Top-Level module.

![HDL Design Flow Diagram](image)

<table>
<thead>
<tr>
<th>Top-Level Module (Synthesis Description)</th>
<th>Static Modules Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Defining Design Constraints</td>
<td>Implementation of PR Modules Separately</td>
</tr>
<tr>
<td>Implementation of Non-PR Design</td>
<td>Merging Static &amp; Partial Modules</td>
</tr>
<tr>
<td></td>
<td>Timing Analysis &amp;andRouting &amp; Placing of Non-PR Design</td>
</tr>
<tr>
<td></td>
<td>Bit stream Generation</td>
</tr>
</tbody>
</table>

After the synthesis of the design, description (.ngc) file is created which is used in place and routing for defining timing and area constraints. Non-DPR design is implemented for verification and resource utilization comparison. Timing and area constraints are verified for Non-DPR design before stepping in DPR complex design. The static_used file generated after implementing static module is later used when implementing PRM to avoid overlapping of areas. Then merging of static and partial modules is performed and bitstreams are generated for individual PRM.
3.2 System Operational Flow

The reconfiguration flow used is shown in Figure 2. The FPGA runs and executes the initial configured application and waits for the reconfiguration call. As soon as the reconfiguration event is detected the PowerPC reads the data from the memory, where initially the bitstream is placed. The data is transferred to PowerPC memory. In a single transaction the size of data transferred depends on the processor array. Then the data is transferred from processor to the configuration cache (BRAM) of the HWICAP. The data is written to FPGA’s configuration memory using ICAP as soon as BRAM is full. The end of configuration is indicated by the use of pad frame. As soon as the configuration is completed the pipeline terminates. FPGA executes the application using the newly downloaded PRM.

3.3 ICAP Design

Two cores OPB-HWICAP and XPS-HWICAP for Xilinx designs are used mainly for the DPR. A bridge is required to interface low performance OPB with high performance PLB bus as in Figure 3. The data is stored in configuration cache (BRAM) using OPB. BRAM is dual port memory. ICAP is used to load data to FPGA’s configuration memory. While using XPS-HWICAP the DP-BRAM is replaced by FIFO write/read and register’s group shown in Figure 4. To enhance the transfer rate of data from processor to ICAP port XPS-HWICAP is used along with DMA. The use of DMA function, shown in figure 5 on XPS-HWICAP increases the efficiency of data transfer from embedded processor to ICAP.

DMA has two interfaces one for the commands and other for the data transfer. Further reduction in communication overhead can be achieved using burst transmission support of PLB interface. Figure 6 shows the configuration efficiency of ICAP primitive evaluated by using dedicated BRAM for bit stream storage. BRAM must have size enough to hold the bitstream. The bitstreams are loaded in dedicated BRAM instead of memory. The data is available for reconfiguration which excludes required time for transfer of data of PRM from memory to HWICAP. PLB IP (Intellectual Property) is used to load data to BRAM. The highest speed for reconfiguration achieved is using BRAM-HWICAP design. This design architecture is well suited to small designs and those applications requiring fast switching of design modules. But there is a tradeoff for this achieved speed in term of extra utilization of resources. So power consumption and area of chip also increases with increase in resource utilization.
RESULTS AND DISCUSSION

4.1 Reconfiguration Time Measurements
The experiments were carried out using PowerPC and MicroBlaze running at 300 MHz and 100 MHz respectively. The speed has been increased as the ICache and DCache of PowerPC are enabled. The reconfiguration time (RT) is proportional to size of bitstream. Bitstreams size 548 KB, 125 KB, and 99 KB were used. The size of processor array and of configuration cache does also have an impact on the configuration speed. Here the size of processor array and configuration cache is kept constant. So the linear increase in RT is observed. The rows 3, 4, 5 and 6 of Table 1 show that the speed is higher when using the PowerPC compared to MicroBlaze. Hardcore processor is much faster than the softcore processor. The DMA function on XPSICAP provides much increase in speed compared to the trivial design architectures of ICAP. Moreover, BRAM-HWICAP design architecture provides highest speed but with a penalty of physical resource utilization. Graphical comparison of speed and reconfiguration time is shown in Figure 7 and Figure 8.

4.2 Device Utilization
The device utilization has made it clear that the area of chip and resources requirement using DPR is reduced. From Table 2 the slice utilization for non DPR design is 22.5 % greater than that of DPR design. The utilization under title DPR contains the utilization of both static and partial modules. The LUTs and FFs utilizations are also less. But the BRAM utilization is very near to that of Non-DPR design. The reason is that the use of BRAM as memory to hold bitstream increase the speed greatly but the cost is paid in term of more resource utilization and so the area of chip is also increased.
Table 1: Reconfiguration and Speed of Bitstreams

<table>
<thead>
<tr>
<th>Sr.No</th>
<th>ICAP Design</th>
<th>Bitstream 1</th>
<th>Bitstream 2</th>
<th>Bitstream 3</th>
<th>Avg. Speed</th>
<th>Max. Speed</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Reconf. Time (ms)</td>
<td>Reconf. Time (ms)</td>
<td>Reconf. Time (ms)</td>
<td>MB/Sec</td>
<td>MB/Sec</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>OPB-ICAP (Cache_disabled)</td>
<td>344</td>
<td>73</td>
<td>56.5</td>
<td>1.67</td>
<td>1.75</td>
</tr>
<tr>
<td>2</td>
<td>PLB-ICAP (Cache_disabled)</td>
<td>267</td>
<td>53</td>
<td>42.8</td>
<td>2.24</td>
<td>2.32</td>
</tr>
<tr>
<td>3</td>
<td>OPB-ICAP (Cache_enabled)</td>
<td>128</td>
<td>30</td>
<td>19.9</td>
<td>4.52</td>
<td>4.98</td>
</tr>
<tr>
<td>4</td>
<td>PLB-ICAP (Cache_enabled)</td>
<td>59.8</td>
<td>14.46</td>
<td>9.6</td>
<td>9.37</td>
<td>10.31</td>
</tr>
<tr>
<td>5</td>
<td>OPB-ICAP (MicroBlaze)</td>
<td>170.6</td>
<td>43.2</td>
<td>28.8</td>
<td>3.2</td>
<td>3.5</td>
</tr>
<tr>
<td>6</td>
<td>PLB-ICAP (MicroBlaze)</td>
<td>79.7</td>
<td>23.8</td>
<td>13.4</td>
<td>6.5</td>
<td>7.3</td>
</tr>
<tr>
<td>7</td>
<td>DMA-ICAP</td>
<td>4.8</td>
<td>0.8544</td>
<td>0.7654</td>
<td>129.78</td>
<td>143.96</td>
</tr>
<tr>
<td>8</td>
<td>BRAM-ICAP</td>
<td>2.3</td>
<td>0.4835</td>
<td>0.4283</td>
<td>242.6</td>
<td>254.4</td>
</tr>
</tbody>
</table>

Power consumption, cost and chip-area are proportional to the resource utilization. Graphical representation of Device Utilization is shown in Figure 9. The values are bit overlapped but they can be verified from the table.

Figure 8: Reconfiguration Time Vs Bitstream

Figure 9: Device Utilization

Table 2: Device Utilization

<table>
<thead>
<tr>
<th>Sr.No</th>
<th>Resource</th>
<th>DPR Design Utilization (Static + Partial)</th>
<th>Non PR Utilization</th>
<th>Total Available Resource</th>
<th>Non PR %Utilization</th>
<th>DPR %Utilization</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Slices</td>
<td>1410+2056=3466</td>
<td>6545</td>
<td>13696</td>
<td>47.8</td>
<td>25.3</td>
</tr>
<tr>
<td>2</td>
<td>LUTs</td>
<td>2045+3503=5548</td>
<td>10819</td>
<td>27392</td>
<td>39.5</td>
<td>20.3</td>
</tr>
<tr>
<td>3</td>
<td>FFs</td>
<td>1581+1003=2584</td>
<td>5015</td>
<td>27392</td>
<td>18.3</td>
<td>9.5</td>
</tr>
<tr>
<td>4</td>
<td>IOs</td>
<td>35+0=35</td>
<td>64</td>
<td>416</td>
<td>15.4</td>
<td>8.3</td>
</tr>
<tr>
<td>5</td>
<td>BRAMs</td>
<td>47+38=85</td>
<td>97</td>
<td>136</td>
<td>71.3</td>
<td>62.5</td>
</tr>
</tbody>
</table>
5 CONCLUSIONS

In this paper an evaluation of Virtex-II Pro using DPR is performed. This evaluation can help designers to decide whether the architecture is suitable for their desired application(s). It has been observed that speed of reconfiguration is proportional to size of the bitstream. BRAM-HWICAP offers highest speed but increases resource utilization. It is suitable for small and speedy application designs. A hardcore (PowerPC) processor also offers more speed than softcore processor (MicroBlaze). Power, area and cost reduction is possible with reduction in resource utilization. Overall, the resources utilized using DPR are less compared to that of Non-DPR.

6. REFERENCES


