# High Throughput LFSR Design for BCH Encoder using Sample Period Reduction Technique for MLC NAND based Flash Memories

Manikandan.S.K Assistant Professor (Sr.Gr.), Department of EEE, Velalar College of Engineering and Technology Erode, Tamilnadu

Nisha Angeline. M Assistant Professor (Sr.Gr.), Department of ECE, Velalar College of Engineering and Technology Erode, Tamilnadu

# ABSTRACT

Error correction is one of the important technique for detecting and correcting errors in communication channels, memories etc., Errors are associated with all types of memories. But the NAND FLASH memories are competing in the market due to its low power, high density, cost effectiveness and design scalability. As far as the memory is concerned the testing should not consume more time. So, some DSP algorithms are used to overcome the delays by increasing the sampling rate. BCH codes are widely been used for error detection and correction. The generated check bits of the BCH encoder are appended with the message bits to form a codeword. This codeword is sent to the receiver to detect any error during the transmission. One of the main components of BCH encoder is LFSR (Linear Feedback Shift Register). LFSR find its wider application in Built-in-Self-Test, signature analyzer etc., whereas here it is used to form parity bits to concatenate with message bits for the formation of a codeword. The main advantage of LFSR is that it is simple to construct and it operates at very high clock speed, but its main drawback is that the inputs are given in bit serial. To overcome these drawbacks, DSP algorithms such as unfolding and parallel processing are used by selecting the unfolding factor based on some design criteria. Selecting a better unfolding value reduces the sample period, decreases the clock cycle, and increases the speed, power and the throughput.

# **KEYWORDS**

Bose Chaudhuri-Hocquengham (BCH), Cyclic Redundancy Check (CRC), Computational Time (CT), Galois Field (GF), LFSR, MLC (Multi Level Cell), unfolding, sample period reduction.

# 1. INTRODUCTION

NAND Flash's high-density, low power, cost effective, and scalable design make it an ideal choice to fuel the explosion of new multimedia products that are entering the market. Advances in system design techniques also enable the Sharmitha.E.K PG Student, Department of ECE, Velalar College of Engineering and Technology Erode, Tamilnadu

Palanisamy.C Professor and Head, Department of IT, Bannari Amman College of Technology, Sathyamangalam, Tamilnadu

more cost effective NAND Flash to replace NOR Flash in a significant percentage of traditionally NOR Flash applications. NAND flash performs read and write simultaneously. Due to the efficient architecture of the NAND Flash, its cell size is almost half the size of a NOR cell. This enable NAND Flash architecture to offer higher densities with larger capacity on a given die size, in combination with a simpler production process [1].

This paper focus on NAND Flash memories since they have lower erase times, less chip area per cell which allows greater storage density, and lower cost per bit than NOR Flash memories. NAND is used or best suited for sequential data application. Physically, the NAND architecture uses smaller transistors, because it doesn't have to "pull-down" a whole bitline. A NAND bit line is a series of transistors so each transistor only has to pass a small amount of current.

There are some inherent limitations of NAND Flash memories [2]. These include write/read disturbs, data retention errors, bad block accumulation, limited number of writes, program disturb errors, soft errors and stress-induced leakage current. Because of these errors that occur in MLC NAND based flash memories various types of coding techniques can be applied for error detection and correction.

There are various codes available for error detection and correction. The codes are broadly classified into two types; they are a).block codes and b).conventional codes. Cyclic code is one of the classifications of block codes. The subset of the block code is BCH code.BCH code initially forms a generator polynomial by the use of finite field (GF) concept [3] and generates a parity (check) bits to be appended to the message bits to form a codeword [4]. The main component of the encoder is simply a LFSR for generating the parity bit. The components used to form LFSR are simply registers and exor gates. Series combination of both registers and exor gates forms a LFSR. The main advantage of LFSR is it is simple to construct and it operates at very high clock speed, But the main drawback of the LFSR is that the bit stream applied to LFSR should be in serial. Hence, high-speed data transmission cannot be made possible. In order to increase the throughput and speed, parallel processing can be applied by

unfolding concept. Parallel processing increases the number of message bits to be processed in a clock cycle (sample rate), but increases the area also.

Unfolding is a transformation technique, which describes J consecutive iterations of the original DSP program. Unfolding increases the Iteration bound  $T_{\infty}$  to  $JT_{\infty}$ . In order to reduce the sampling period it is important to calculate the iteration bound before unfolding the system to select the unfolding factor. Many important cases such as  $CT > T_{\infty}$  and  $T_{\infty}$  is not an integer must be analyzed before selecting the unfolding factor. The selected unfolding value must make  $CT < T_{\infty}$  and  $T_{\infty}$  is an integer, which automatically reduces the sampling period. Unfolding is a transformation technique that can be applied to any DSP program to create a new program, which describes more than one iteration of the original program [5]. Large number of iterations of an original program can be made by unfolding it by an unfolding factor.

The rest of the paper is organized as follows. Section II gives the brief summary of the Existing System. Section III contains the design procedure of LFSR for BCH (31, 16) and criteria for selecting the unfolding factor to propose a new unfolded structure to reduce the sample period and gives the steps for unfolding the DFG of the LFSR. Section IV contains the algorithm for unfolding and proposed a new architecture for LFSR. Section V analyses the data flow, area and clock cycle of LFSR for different unfolding

# 2. EXISTING SYSTEM

There are various recursive formulas developed in past to achieve the parallel architecture for CRC hardware [6]. Highspeed architectures for parallel long BCH encoders are developed in [7] for a particular generator polynomial of lower order. However, the unfolding factor is not selected by analyzing various cases. Resource Sharing and power optimization techniques are applied in [6] to achieve low power high-throughput BCH error correction in VLSI for multi-level cell NAND flash memories. Novel look ahead techniques are used to improve the throughput for the generator polynomial of lower order [8] without considering the important criteria for selecting the unfolding factor. Retiming and unfolding of CRC architectures are introduced for lower order generator polynomial to increase the speed [9] without selecting the unfolding factor by considering some important criteria. The normal architecture and the unfolded architecture [5] are shown in Figure 1 and 2.



Figure 1: CRC architecture for  $g(x) = 1 + y + y^3 + y^5$ 



Figure 2: Two-Parallel CRC architecture for  $g(x) = 1 + y + y^3 + y^5$  for J = 2

# **3. PROPOSED SYSTEM**

The basic architecture of the proposed system is unfolded LFSR, for generating the parity bit. Compare to normal LFSR, the proposed unfolded LFSR architecture uses sample period reduction technique to achieve more speed. In order to achieve this objective, some important criteria has to be considered for selecting the unfolding factor to improve the design methodology. Data flow table for the proposed system and conventional system is tabulated and compared in Table 1,2 and 3. Then, area and power analysis of the proposed and conventional LFSR is graphically shown in Figure 7,8 and 9 to analyze the depth on the hardware and power overhead. Different levels of unfolding factors are introduced to check the hardware complexity and speed of the design.

# 3.1 Design of LFSR for BCH (31,16)

BCH codes are subset of the Block codes. BCH codes belong to a powerful class of multiple error correcting codes [10]. BCH codes are based on well-defined mathematical properties. These mathematical properties are based on the Galois Field or finite fields [3]. The Finite field has the property that any arithmetic operations on field elements always have results in the field only [3]. To provide an excellent error correcting capability, the roots of the generator polynomial of the BCH codes have to be specified carefully. With a generator polynomial of g(x), a t- error correcting cyclic codes is the binary BCH codes, with a condition that g(x) must be the least degree polynomial over Galois Field GF(2). Steps for designing the generator polynomial of BCH (31,16) is explained below,

- i) Choose an irreducible polynomial  $p(x) = x^5 + x^2 + 1$
- ii) Construct GF (2<sup>5</sup>)

iii) Construct the minimal polynomial using the relation 
$$\varphi_{\mathsf{R}}(\mathbf{X}) = \prod_{i=1}^{l-1} (x + \beta^{2^i})$$
 (1)

$$\alpha: x^{5} + x^{2} + 1 = m1(x)$$
  

$$\alpha^{3}: x^{5} + x^{4} + x^{3} + x^{2} + 1 = m2(x)$$
  

$$\alpha^{5}: x^{5} + x^{4} + x^{2} + x + 1 = m3(x)$$

where  $\beta$  be a non-zero element of GF(2<sup>*m*</sup>).

iv). Form the generator polynomial using the relation g(x)=LCM(m1(x),m2(x),m3(x)). (2)

In this proposed work, BCH (31, 16) is taken as an example and an encoder is designed for it using the generator polynomial,  $g(x)=x^{15}+x^{11}+x^{10}+x^9+x^8+x^7+x^5+x^2+x^2+x+1$ 

and LFSR is unfolded by an unfolding factor which is selected based on some design criteria discussed in theorem[5] to improve the design methodology. LFSR architecture for BCH encoding is shown in Figure 3. Initially the 16-bit information must be made equal to the degree of the generator polynomial. Hence, the resultant message bit is 22 bit long, which is divided by the generator polynomial to form parity bits.

The parity bits that are obtained from LFSR is 100101000100010. This is systematic encoding, because information and check bits are arranged together so that they can be recognized in the resulting codeword.

General equation for codeword is,  $i(x).x^{n-k}=q(x).g(x)+r(x)$  (3)

#### Where,

i (x):information bit polynomial.

- q (x):quotient bit polynomial.
- g (x):generator polynomial. r (x):remainder polynomial.

Encoder of BCH(31,16) comprises of parallel in serial out shift register followed by the LFSR.



Figure 3: LFSR Architecture for  $g(x) = x^{15} + x^{11} + x^{10} + x^9 + x^8 + x^7 + x^5 + x^3 + x^2 + x + 1$ 

#### 3.2 Unfolding LFSR

Unfolding algorithm is applied only to the LFSR architecture in this proposed design to increase the speed by reducing the sampling period.

The steps to be followed to unfold the encoder are,

- i) Convert normal LFSR into DFG
- ii) Calculate the iteration bound for the DFG
- iii) If  $T_{\infty} < CT$  of a node, apply unfolding to make the sampling period to be equal to  $T_{\infty}$ . This technique is called as sample period reduction.

#### 3.2.1 Formation of DFG for the LFSR

Often a DSP program is represented using the DFG. Here the nodes represent the computation and each of the node has its own computation time. The communication between the nodes is represented using edges. The DFG for the LFSR is shown in Figure 4.



Figure 4: DFG of LFSR for  $g(x) = x^{15} + x^{11} + x^{10} + x^9 + x^8 + x^7 + x^5 + x^3 + x^2 + x + 1$ 

Many of the DSP algorithms contain feedback loops, which impose an inherent fundamental lower bound on the achievable iteration or sample period. This bound is referred to as iteration bound. Iteration bound is the representation of the algorithm in the form of a DFG. Same algorithm but with different representations lead to different iteration bound. Iteration bound is defined as,

$$T_{\infty} = \max_{l \in L} \left\{ \frac{l}{w_l} \right\}$$
(4)

Iteration bound is the maximum of loop bound

$$T_{\infty} = max\{\frac{1}{4}, \frac{2}{5}, \frac{3}{6}, \frac{4}{7}, \frac{5}{8}, \frac{6}{10}, \frac{7}{12}, \frac{8}{13}, \frac{9}{14}, \frac{10}{15}\}$$
$$T_{\infty} = \frac{10}{15} = \frac{2}{2}.$$

#### 3.2.3 Sample Period Reduction

The loop  $10 \rightarrow 9 \rightarrow 8 \rightarrow 7 \rightarrow 6 \rightarrow 5 \rightarrow 4 \rightarrow 3$  $\rightarrow 2 \rightarrow 1$  has the maximum loop bound. Synthesis report reveals that all the nodes of the DFG has CT of 1u.t. Since  $T_{\infty} < CT$ , iteration period cannot be made equal to  $T_{\infty}$ . In such a case retiming can be applied but it cannot be used to reduce the CT of the critical path of the DFG to  $T_{\infty}$ . Selection of the unfolding factor is an important criterion in sample period reduction. Unfolding factor is chosen using the relation,  $J = \int \frac{t_u}{T_{\infty}} \gamma = 2$ . Iteration bound of the unfolded DFG changes from  $T_{\infty}$  to  $JT_{\infty}$ . Where J stands for the unfolding factor. Similarly the sample period of the unfolded DFG is  $\frac{T_{\infty}}{J}$ . One more case exist is, if  $T_{\infty}$  is not an integer. The LFSR of g(x)  $= x^{15} + x^{11} + x^{10} + x^9 + x^8 + x^7 + x^5 + x^3 + x^2 + x + 1$ 

satisfies both the cases. Because its CT is greater than the iteration bound and the iteration bound is not an integer. Hence J must be selected in such a way that  $|T_{\infty}|$  is an integer and  $|T_{\infty}\rangle$  node CT. The only value of J that satisfies both the condition is 3. This is clearly specified by a theorem [5].

# 4. UNFOLDING ALGORITHM

It is a transformation technique that can be applied to a DSP program in order to create a new program describing more than one iteration of the original program. Unfolding a DSP program is done by selecting an unfolding factor J, which describes J consecutive iterations of the original program. Loop unrolling is also called as unfolding [5].

#### 4.1 Algorithm Steps

- For each node U in the original DFG, draw J nodes U<sub>0</sub>, U<sub>1</sub>, U<sub>2</sub>,...,U<sub>j-1</sub>
- 2. For each edge  $U \rightarrow V$  with w delays in the original DFG, draw the J edges  $U_i \rightarrow U_{(i+w)\%J}$  with  $\lfloor \frac{i+w}{J} \rfloor$  delays for  $i = 0, 1, \dots, J-1$ . (5)

By this technique the speed of the LFSR is increased automatically by reducing the clock cycle. The main drawback of unfolding is that the area of the system increases and choosing a large value of unfolding factor leads to hardware complexity. After applying the unfolding technique with unfolding factor J=3 and 2 for Figure 3, three parallel and two parallel architectures are obtained and it is shown in Figure 5 and 6.



Figure 5: Three parallel LFSR for  $g(x) = x^{15} + x^{11} + x^{10} + x^9 + x^8 + x^7 + x^5 + x^3 + x^2 + x + 1$  after Unfolding by a factor of 3



Figure 6: Three parallel LFSR for  $g(x) = x^{15} + x^{11} + x^{10} + x^9 + x^8 + x^7 + x^5 + x^3 + x^2 + x + 1$  after Unfolding by a factor of 2

After the application of unfolding, the sample period is reduced and the iteration bound is increased from 0.66 to 1.32. So that  $J T_{cc} > CT$ . This sample period reduction is one of the application of the unfolding algorithm.

#### 5. RESULTS AND ANALYSIS

Initially the codeword is formed by the generation of parity bits 100100010100010. This parity bit formation is coded in VHDL. Each of the unfolded architecture is coded in VHDL, simulated and implemented using Xilinx92i to analyze the area and speed. For the message bits: 0000000001000001 and for the generator polynomial  $g(x) = x^{15} + x^{11} + x^{10} + x^9 + x^8 + x^7 + x^5 + x^3 + x^2 + x^{+1}$  the normal BCH encoder simulation result is shown in Figure 10. The same architecture but with unfolding factor of 2 and 3 is simulated and verified as shown in Figure 11 and 12. From the results shown in Table 1, 2, 3 and 4, it is clear that the clock cycle decreases from 22 to 8 because of sample period reduction technique. Hence unfolding speed up the LFSR operation by decreasing the clock cycle. As far the memory is concerned, the error detection and correction must not take much time because it decreases the throughput of the system.

# 5.1 Data Flow Table

Table 1: Data flow table for normal LFSR

| Clock | Message bit | y(14 to 0)                              |
|-------|-------------|-----------------------------------------|
| _     | _           |                                         |
| 1     | 1           | 000000000000000000000000000000000000000 |
| 2     | 0           | 000000000000010                         |
| 3     | 0           | 00000000000100                          |
| 4     | 0           | 00000000001000                          |
| 5     | 0           | 00000000010000                          |
| 6     | 0           | 00000000100000                          |
| 7     | 1           | 00000001000001                          |
| 8     | 0           | 00000010000010                          |

| 9  | 0 | 00000100000100  |
|----|---|-----------------|
| 10 | 0 | 000001000001000 |
| 11 | 0 | 000010000010000 |
| 12 | 0 | 000100000100000 |
| 13 | 0 | 001000001000000 |
| 14 | 0 | 010000010000000 |
| 15 | 0 | 100000100000000 |
| 16 | 0 | 000110110101111 |
| 17 | 0 | 001101101011110 |
| 18 | 0 | 011011010111100 |
| 19 | 0 | 110110101111000 |
| 20 | 0 | 101010101011111 |
| 21 | 0 | 010010100010001 |
| 22 | 0 | 100101000100010 |
|    |   |                 |
|    |   |                 |

Table 2. Data flow table for LFSR with unfolding factor of

3

| Clock                                | m(3k)<br>m(3k+1)<br>m(3k+2)                          | y(14 to 0)                                                                                                                                |
|--------------------------------------|------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|
| 1<br>2<br>3<br>4<br>5<br>6<br>7<br>8 | 100<br>000<br>100<br>000<br>000<br>000<br>000<br>000 | $\begin{array}{c} 000000000000100\\ 00000000100000\\ 00000100000100\\ 000100000100000\\ 10000010000000\\ 011011010111100\\ 0100101000100$ |

Table 3: Data flow table for LFSR with unfolding factor of2

| Clock | m(2k)<br>m(2k+1) | y(14 to 0)      |
|-------|------------------|-----------------|
|       |                  |                 |
| 1     | 10               | 000000000000010 |
| 2     | 00               | 00000000001000  |
| 3     | 00               | 00000000100000  |
| 4     | 10               | 000000010000010 |
| 5     | 00               | 000001000001000 |
| 6     | 00               | 000100000100000 |
| 7     | 00               | 001000001000000 |
| 8     | 00               | 000110110101111 |
| 9     | 00               | 011011010111100 |
| 10    | 00               | 110010101011111 |
| 11    | 00               | 100101000100010 |
|       |                  |                 |

# 5.2 Area, Clock Cycle and Power Analysis

The design is analyzed for different levels of unfolding factors in order to discuss the hardware and power overhead involving in different parallelism levels. The power analysis and device utilization analysis is done by implementing this LFSR in XilinxISE9.2i is graphically shown in Figure 7, 8 and 9.



Figure 7: Comparison of total power for different unfolding factors



Figure 8: Comparison of various powers for different unfolding factors



Figure 9: Comparison of Area and Speed parameters for different unfolding factors

# 5.3 Screen Shots



Figure 10: LFSR before Unfolding



Figure 11: LFSR after Unfolding with J=3



Figure 12: LFSR after Unfolding with J=3

# 6. CONCLUSION

Since the NAND FLASH memories require less delay encoders, a high throughput encoder is designed by unfolding the LFSR of the BCH encoder by checking design criteria for selecting the unfolding factor .Moreover area, clock cycle and power is analyzed by simulating the design in ModelSim tool by VHDL language and implemented the design in XilinxISE9.2i.The obtained results reveal that unfolding increases the throughput, this in turn decreases the clock cycle which automatically increases the speed but it increases the area and power.

# 7. FUTURE WORK

Different-pipelining techniques can be introduced to reduce the critical path of the encoder of BCH. Retiming also can be applied to further increase the speed and to reduce the power consumption and area.

# 8. ACKNOWLEDGMENT

The authors acknowledge the contributions of the students, faculty of Velalar College of Engineering and Technology for helping in the design of test circuitry, and for tool support. The authors also thank the anonymous reviewers for their thoughtful comments that helped to improve this paper. The authors would like to thank the anonymous reviewers for their constructive critique from which this paper greatly benefited.

# 9. REFERENCES

- [1] R. Bez, E. Camerlenghi, A. Modelli, and A. Visconti, "Introduction to flash memory," Proc. IEEE, vol. 91, no. 4, pp. 489–502, Apr. 2003.
- [2] J. Cooke, "The inconvenient truths about NAND flash memory," presented at the Micron MEMCON Presentation, Santa Clara, CA, 2007.
- [3] William Stallings, "Cryptography and Network Security-Principles and Practices, Introduction to Finite Fields", 3rd edition, 2004.
- [4] Ranjan Bose, "Information Theory, Coding and Cryptography".
- [5] K.K.Parhi "VLSI Digital Signal Processing Systems-Design And Implementation".
- [6] Wei Liu, Junrye Rho, and Wongong Sung, "Low- Power High throughput BCH error correction VLSI Design for Multi-Level cell NAND Flash Memories".
- [7] Keshab K. Parhi, "Eliminating the Fan out Bottleneck in Parallel Long Bch Encoders" in proc IEEE, vol.51.No.3, march 2004.
- [8] Chao Cheng and Keshab Parhi, "High-Speed Parallel CRC Implementation Based on Unfolding, Pipelining And Retiming", in proc, IEEE, vol.53, No.10, October 2006.
- [9] Naresh Reddy, B.Kiran Kumar and K.monisha Sirisha," On the Design of High Speed Parallel CRC Circuits Using DSP Algorithms" in IJCSIT, vol.3 (5), 2012.
- [10] John G.Proakis Masoud Salehi,"Digital-Communications-Linear block codes, cyclic codes, BCH codes, Reed-Solomon codes," 5<sup>th</sup> Edition, 2008.

# **AUTHOR'S PROFILE**

Manikandan.S.K has received B.E degree in Electronics and Communication Engineering from Annai Mathammal Sheela Engineering College, Namakkal District, 2003 and M.E – Embedded Systems from Sasthra University, Tanjore District in the year 2006. He is working as a Professor (Sr.Gr.) in the department of EEE, Velalar College of Engineering and Technology, Erode. Currently he is doing Ph.D under Anna University in the area of VLSI Design. He has published three papers in International Journals and presented five papers in National and International Conferences. His areas of interest are Microprocessor and Microcontroller, ARM Processor, VLSI architectures.

**Sharmitha.E.K** has received B.E degree in Electrical and Electronics Engineering from Anna University, Coimbatore, 2011. She is currently pursuing Master of Engineering in VLSI Design in Velalar College of Engineering and Technology under Anna University, Chennai. She has published one paper in international journal. Her areas of interest in research are VLSI Signal Processing, VLSI architectures.

**Nisha Angeline. M** has received B.E degree in Electronics and Instrumentation Engineering from the Indian Engineering College, Thirunelveli District in 2003 and M.E –VLSI Design in Kongu Engineering College, Perundurai in the year 2006. She received first rank in PG course for her academic performance. She is working as a Professor (Sr.Gr) in the department of ECE, Velalar College of Engineering and Technology, Erode. Currently she is doing Ph.D under Anna University in the area of VLSI Design. She has published three papers in International Journals and presented five papers in National and International Conferences. Her areas of interest are Digital Electronics, VLSI Architecture, and VLSI Signal Processing.

Dr.C.Palanisamy received B.E degree in Electronics and Communication Engineering in 1998 from the Madras University, Chennai and the M.E Communication Systems in 2000 from Madurai Kamaraj University, Madurai. He received his Ph.D degree in Information and Communication Engineering from Anna University, Chennai in 2009. He is a gold medalist in M.E., (Communication Systems). He has 12 years of teaching experience. Formerly he worked as a faculty member in the Department of Information Technology, PSG College of Technology, Coimbatore, India. Currently he is a Professor and Head of the Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam, India.He has published 25 papers in international, national journals and conference proceedings. His areas of research include data mining, digital image processing and computer networking. He has delivered several special lectures in workshops and seminars. He has also organized 4 national level conferences and 4 national level workshops with funding from government agencies including, MIT, DRDO, CSIR and AICTE.He has successfully completed one research project from DRDO, New Delhi and one project from AICTE, New Delhi is under progress.