# Design and Implementation of Efficient Permutation Clos Network Design for Mpnoc

Poornima Jain P.J. PG scholar, VLSI & Embedded systems Dept. of Electronics & Communication SJCE,Mysuru.

### ABSTRACT

The transmission of the data with traffic free, low latency and high throughput from source to destination are the challenges for on chip multi processing system on chip (MPNOC) design.

The conventional packet switching approach having large amount of power and area for the queuing buffer. Topologies such as mesh and torus[10], are intuitively feasible for physical layout in a 2-D chip. Having the high wiring irregularity and the large router radix of indirect topologies such as Benes or Butterfly[11], pose a challenge for physical implementation.

The present work, the silicon-proven design of a novel on-chip network to support guaranteed traffic permutation in multiprocessor system on chip applications. The proposed network employs a pipelined Circuit switching approach combined with a dynamic path setup scheme under a multistage network topology. The dynamic path-setup scheme enables runtime path arrangement for arbitrary traffic permutations. The circuit switching approach offers a guarantee of permuted data and its compact overhead enables the benefit of stacking multiple networks. Design and developed by using XILINX 12.4 and simulated on Modelsim 6.3f and implemented on Spartan 3 FPGA Device. This can achieve high throughput, low latency and low cost.

#### **General Terms**

Throughput, Latency, network on chip

### Keywords

permutation network, Circuit Switching, Dynamic path setup scheme, and system on chip and Network topology.

#### 1. INTROUDUCTION

In NoC design, all the classical networking issues has to be addressed. Addressing and routing schemes need to be organized in order to allow packets transmitting the same links to be routed to various destinations and the timely delivery of certain types of traffic on the chip is important for performance and the quality-of service (QoS) requirements is also essential. Similarly, a NoC should support congestion control in order to accommodate excessive traffic conditions.

The proposed design presents new silicon proven design of an on chip permutation network under arbitrary permutation .Instead of conventional packet switching approach, circuit switching mechanism having a dynamic path setup scheme under a multistage network topology is used. The dynamic path setup handles path arrangement for struggle free permuted data and the data paths enable a guaranteed Shreekanth T. Assistant professor Dept. of Electronics & Communication, SJCE,Mysuru.

throughput. By eliminating the excessive overhead of queuing buffers, a compress implementation is achieved.

### 2. OVERVIEW OF MPNOC

Due to the advancement that is taking place in IC technology, there is more and more number of transistors that are integrated onto a single IC chip [7]. Due to this, today's chip can contain 100M transistors on a single IC chip and the transistors gate length are in terms of nano meters. According to the definition of Moore's law the numbers of transistors approximately doubles every 18 months, according to this law

there will be more components on a printed circuit board, the components that are connected on a printed circuit board can be integrated into a single IC chip.

# 2.1 SOC

The VLSI producing technology advances has created potential to place many transistors on a single die. It allows designers to place systems-on-a-chip that move everything from the board onto the chip eventually. SoC can be a high performance microchip processor; it can program and provide all the instruction to the microprocessor to perform the desired task. SoC is that the efforts to integrate heterogeneous or differing types of silicon IPs on to a similar chip, like memory, ,random logics, and analog electronic equipment.

# 2.2 MPSOC

Multiprocessor systems-on-chips (MPSoCs) are the latest incarnation of very large scale integration (VLSI) technology. A single integrated circuit can contain over100 million transistors, and the International Technology Roadmap for Semiconductors predicts that chips with a billion transistors are within reach. The demands placed on these chips by applications require designers to face problems not confronted by traditional computer architecture: real-time deadlines, very low-power operation, and so on. These opportunities and challenges make MPSoC design an important field of research.

MPSOC is simply a system-on-chip that contains multiple instruction-set processors (CPUs). In practice, most SoCs are MPSoCs because it is too difficult to design a complex system-on-chip without making use of multiple CPUs.

#### 2.3 2.3. Network on Chip

Network on Chip is a new prototype for SoC, it is a method of communication in networking for on chip interconnections and it has roughly a threefold increase in performance over conventional bus systems[1].

#### 2.4 2.4. Need for Network-on-Chip

As the technology keeps on developing there are millions of transistors which are integrated on a single chip, so in Network on Chip we use Arbitration and Routing techniques. The performance of Network on Chip architecture doesn't degrade with the network scaling methods. In Network on Chip there is always an aggregated bandwidth scales with network size of the system under consideration. For a given NoC system, there can be multiple hops present which increases the performance of the Network on Chip architecture under consideration. Each module communicates with each other by sending packetized data over the network. Just as similar to a computer network, it has a device that uses the network and the routers are used to direct the data between the devices and wires are used for communication between routers and device and routers to routers.

#### 2.5 Circuit switching

Circuit switching is a technique that directly connects the sender and receiver in an unbroken path[4]. Telephone switching equipment for example, establishers a path that connects the caller's telephone to the receiver's telephone by making a physical connection. In this switching technique, once the connection is established a dedicated path exists between both ends of the sender and receiver until the connection is terminated. Routing decisions must be made when the circuit is established first but there no decisions made after that time.

#### 2.6 Clos Network

Clos network is a multistage network topology, which is used in switching technique for data transfer in three stages, and also it has twelve inputs and outputs is as shown in the fig1, each path is select dynamically according to the input given. The main advantage of network is that connection between a large number of input and output ports can be made by using only small sized switches.



#### **3. METHODOLOGY**

In my proposed design consist of 4X3 switches, it is a 3 stage clos network.

This design mainly on based on designing of the switch

#### 3.1 Switch design

The switch module is as shown fig 2 consists of four data inputs denotedas,din0,din1,din2,din3,Four data outputs denoted as,dout0,dout1,dout2,dout3.four request input signal denoted as req0,req1,req2,req3,four grant signal, cross bar,arbiter,4:1 decoder

#### *3.1.1 Cross bar*:

Crossbar is switch connecting multiple inputs to multiple outputs in a matrix manner. This cross bar switch designed by mux tree architecture .minimum four muxes are using to generated the outputs. Here we using 4:1 four mux due to 4 inputs.



Fig:2.Switch based architecture

#### 3.1.2 Arbiter design

Many input ports are requesting to access a common physical channel resource, in this case an arbiter is required to determine how the physical channel can be shared amongst many requestors.

The proposed arbiter as shown in fig 3. contains four request signal and four grant signal, according to the request signal grant signal will be selected. The general arbitration scheme is a fixed priority arbiter. Each input port has its own fixed priority level, and an arbiter grants an active request signal with the highest priority depending on this priority level.



Fig 3.Design of Arbiter module

For instance, if req0 has the highest priority among N requests, and req0 is active, it will be granted regardless other request signals. If req0 is not active, the request signal with the next highest priority will be granted. In other words, the current request only will be served if the previous request has not appeared or been served already.

#### 3.1.3 .Decoder Design

In decoder module output of the arbiter given input to the decoder, here 4:1 decoder is used. The output of the decoder gives the select output which gives to the crossbar which acts as the select line for the crossbar. According to the value of the select line in the cross bar the particular mux will get selected and gives the output.

#### 3.2 Proposed on chip network topology

Clos network, it is a family of multistage networks. A typical three stage clos network is defined as C(n,m,p)[8], where n represents the number of inputs in each first stage switches and m is number of second stage switches.



Fig 4. Proposed design for on chip network topology

For whole on chip network topology have only four request signals(I.e. req\_0, req\_1, req\_2, req\_3), these requests are given to the first sw1,the grants of the sw1 are requests for the sw2, the grants of the sw2 are requests for the sw3, the grants of the sw3 are requests for the sw4, the grants of sw4

In this network topology has 16-bit data path-setup scheme is the key point of the proposed design to support a runtime path arrangement when the permutation is changed. Each path system, which starts from an input to find a path leading to its

corresponding output. The fig 4. shows the proposed on chip network topology.

#### 3.3 switch based test wrapper

In the Fig 5 shows the proposed design for switch based test wrapper module. In this test wrapper I am testing all the data in the switch is moved properly or not. Here I am using

two switches in-between two switches 4 FIFO's are used for holding the data.

are requests for the sw5 like this it will fallows up to sw12. But number of grants signals are more (i.e.64).According the grant signals the data will be move for the whole proposed design.



# Fig5. Proposed design for switch based test wrapper module

In each FIFO the data is transmitted fist in first out basis ,which ever the fist data will comes that will be transmitted first. In each FIFO contain 8 memory allocation each memory contains 4-bit of data.

# 4. RESULTS AND DISCUSSIONS

This section gives the information regarding results and discussions of developed on chip network.

# 4.1 Simulation results of proposed switch module design

This architecture consists of four in signal ,four out signals, arbiter, crossbar and decoder is as shown in the fig 6. The in and out denotes the input and output pins of the switch. The arbiter is used to connect in and out signal with help of the grant signals. It is only through the arbiter the switch inputs request the output link. Arbiter also solves the contention problem by serving the input which requested earlier when several input pins request the same output pin. In this architecture the finite state machine is implemented only in the arbiter.

The fig 7 shows the wave form for single switch architecture for permutation network .They are 4 inputs to the arbiter they are req0 to req3 and output they are grant0 to grant3. If a request is arrived to any one of the input say for example req1 then arbiter will grant and it will move on to the next state. If none of the request available it will move on the idle state. If many request arrive at the same time it will perform comparison and according to the priority it will grant the output



Fig 6: RTL Schematic view of single switch architecture for a proposed network



Fig 7: shows the Simulation waveform for switch architecture

Here arbiter output given to the input of the decoder, Here 4:1 decoder is used. The output of the decoder gives the select

output which gives to the crossbar which acts as the select line for the crossbar. According to the value of the select line in the cross bar the particular mux will get selected and gives the output.

# 4.2 Simulation results of proposed network design

In this fig 8. shows the RTL view of network topology has 16-bit data with path-setup scheme is the key point of the proposed design to support a path arrangement when the permutation is changed. Each path system, which starts from an input to find a path leading to its corresponding output. The simulation waveform of the proposed design is as shown in the fig 9. The proposed design require minimum latency of 8clock cycle and maximum latency of 15 clock cycle it shown in fig 9.

Compared to all other devices our proposed design having low latency.

#### 4.2.1 Timing constraints result

The timing constrain result of the proposed design gives the total clock frequency the design. According to the timing contrarian result the proposed design clock period is 320.446MHz. This gives the Throughput frequency of the proposed design. The throughput frequency is calculated as

# Throughput=clock period x no of inputs x no of outputs in HZ

Then,

Throughput=320.446Mz x16x16

=82.034GHz

Compared to all other devices our proposed design having high throughput .



Fig8. RTL schematic view for proposed network



Fig9.simulation waveform of proposed network design

# 4.3 Simulation results of switch based test wrapper

In fig 5 shows the FIFO based test wrappers interfaced with the proposed on-chip network is used, but here we using only 2 switches are used to test end to end synchronous data

In each FIFO the data is transmitted fist in first out basis ,which ever the fist data will comes that will be transmitted first. In each FIFO contain 8 memory allocation each memory contains 4-bit of data.

In figure 3.9 shows the switch based wrapper circuit simulation waveform in this circuit the input din0 to din3, will give the clock and reset, data write and data read will perform the operation will shows the output based request the output will show in the figure.

RTL view of switch based test wrapper module as shown in fig 3.6. here we are using four FIFO for holding the data of the four input signal.



Fig 3.6.RTL view of switch based test wrapper module

The simulation result of the switch based test wrapped module as shown in fig 3.7.



Fig 3.7.Simmulation waveform of switch based test wrapper module

# 4.4 Comparisons With Other Existing Design

The table 1 gives the comparison with other technology. As shown in Table 1, due to the using of different switching technique, data width topology and particularly the evaluation level, it is difficult to make comparison with other networks. However, Table 1 indicates a compact implementation resulting from the proposed approach.

The delay bound of each path setup can be comparable to maximum packet latency of packet switching approaches for example 22 cycles as in the Benes 2N-N network [11]. It is noted that the data delivered in the proposed network is guaranteed due to the use of circuit switching , whereas this

feature is not clearly visible with the packet switching approaches as mentioned in works [10]-[11]. Another example,

Table 1: comparison of related on chip networks

| Design                                   | [9]                                 | [10]             | [11]          | [11]           | [8]                     | This<br>work    |
|------------------------------------------|-------------------------------------|------------------|---------------|----------------|-------------------------|-----------------|
| Topology                                 | (4.4)2<br>D<br>mesh<br>topol<br>ogy | De<br>Brui<br>jn | Butter<br>fly | Ben<br>es      | 3-<br>stag<br>e<br>clos | 3-stage<br>clos |
| Number of<br>input X<br>output           | 16x16                               | 16x<br>16        | 16x8          | 16x<br>8       | 16x<br>16               | 16x16           |
| Min/max<br>latency(cy<br>cles)           | NA                                  | NA               | 5/21          | 8/2<br>2       | 16/<br>28               | 8/15            |
| Measured<br>frequency                    | -                                   | -                | -             | 110<br>Mh<br>z | 140<br>Mh<br>z          | 82.034<br>Ghz   |
| Cost for<br>system<br>implement<br>ation | High                                | Hig<br>h         | high          | hig<br>h       | hig<br>h                | Low             |

assuming that a MPSoC is computing under a full permutation , is that it then needs to switch to another permutation. A fast or even zero switching time can be achieved with stacking if a standby network is being rearranged in parallel with the current network's operation and is ready for the runtime switching. Regarding system scalability, the Clos topology is scalable as used in macro commercial systems. The proposed path-setup scheme performs in distribution, thereby suggesting a scalability in terms of computing the guaranteed routes in runtime, compared to static (pre-computed) or centralized approaches. However, a runtime path-arrangement optimization and physical design issue for the scaled networks need more considerations in future researches. Proposed chip network having high throughput ,low latency and low cost. Low cost is achieved by using XILINX 12.4 and simulated on Modelsim 6.3f and implemented on Spartan 3 FPGA Device [2].

# 5. CONCLUSIONS

This work has presented an on-chip network design supporting traffic permutations in MPSoC applications. By utilizing circuit switching methodology consolidated with path-setup scheme under a clos network topology, the proposed design offers random traffic permutation with reduced implementation overhead.

Design is implemented using Xilinx ISE 12.4 on FPGA Board of Spartan 3 family, and obtained the synthesis result regarding delay and done power analysis regarding power by that proved that efficiency is improved when compared to existing systems.

Efficient Permutation clos network Design for MPNOC used to transfer the data from source to destination using a 4x3 clos network with reduced latency and increased throughput and the operating frequency is determined and it uses less resources

#### 6. FUTURE SCOPE

The ultimate goal of this project is to develop a 4x3 NoC. Future work includes the extension of the router architectures and increase the number of inputs and outputs of the clos network to construct an efficient NoC. The CMOS implementation of the NoC can also be done.

#### 7. REFERENCES

- K. Goossens, J. Dielissen, and A. Radulescu, "Aethereal network on chip: concepts, architectures, and Implementations," vol. 22, Design & Test of Computers, IEEE
- [2] S. Jovanovic, C. Tanougast and S. Weber, "CuNoC: A Scalable Dynamic NoC for Dynamically Reconfigurable FPGAs," in IEEE 2007.
- [3] Jingcao Hu, Student Member, IEEE, and Radu Marculescu, Member, "Energy- and Performance-Aware Mapping for Regular NoC Architectures" in IEEE 2005
- [4] C. Hilton and B. Nelson, "PNoC: a flexible circuit switched NoC for fpga-based systems," vol. 153, Computers and Digital Techniques, IEEE Proceedings-2006.
- [5] Mikkel B. Stensgaard and Jens Spars," ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology" in IEEE 2008

- [6] Phi-Hung Pham, Student Member, IEEE, Jongsun Park, Member, IEEE, Phuong "Design and Implementation of BacktrackingWave-Pipeline Switch to Support Guaranteed Throughput in Network-on-Chip"in IEEE 2010
- [7] Jingcao Hu, and Radu Marculescu "Energy- and Performance-Aware Mapping forRegular NoC Architectures"in IEEE 2005
- [8] Phi-Hung Pham, Junyoung Song, Jongsun Park, and Chulwoo Kim "Design and Implementation of an On-Chip Permutation Network for Multiprocessor System-On-Chip"in IEEE 2013
- [9] C. Neeb,M. J. Thul, and N.Wehn, "Network-on-chipcentric approachto interleaving in high throughput channel decoders," in Proc. IEEE Int.Symp. Circuits Syst. (ISCAS), 2005, pp. 1766–1769.
- [10] H. Moussa, A. Baghdadi, and M. Jezequel, "Binary de Bruijn on-chipnetwork for a flexible multiprocessor LDPC decoder," in Proc. ACM/IEEE Design Autom. Conf. (DAC), 2008, pp. 429–434.
- [11] H. Moussa, O. Muller, A. Baghdadi, and M. Jezequel, "Butterfly andBenes-based on-chip communication networks for multiprocessorturbo decoding," in Proc. Design, Autom. Test in Euro. (DATE), 2007,pp. 654– 659.