1. Introduction
When a neuron is created, the training process, the inference process and the data source must have the same data type. This gives the neuron the ability to perform its task with sufficient precision. A system which runs on FPGA has been developed and in which an effort has been made to improve the precision of the operations. This effort involved the design and implementation of several formats of floating point in which the size of fraction has been increased. Several experiments have been developed to understand the consequences of increasing the size of fraction.
In recent years, several manufacturers have produced large-scale computing infrastructures to train and apply neural networks. Some of the products take advantage of the resiliency of neural network architectures and their learning algorithms, which let the programmer wander off the precision requirements. The resulting products involve the use of fixed and floating point values of less than 32 bits in length.
Fixed point data type is known for its low precision and low latency. The identified main problem for the authors in [
1] is the rounding process to the nearest value that causes failed training. A way to overcome this obstacle is a process called stochastic rounding which calculates the best fixed-point value to floating point value through a probability calculation. The authors presented compelling graphics of results generated with a Xilinx Kintex325T FPGA. The data type used was fixed point with 12-bit and 14-bit.
In the case of floating point, vendors tend to follow a package structure defined in IEEE-754 [
2]. The structure includes a sign bit, bits for the exponent, and bits for the fraction. The exponent is expressed in excess notation. The fraction, since there is always a bit "1" to the left of the dot, is reduced to contain bits referred to as fractional information. The meaning of this packet format is shown in the following formula.
Regarding the type of data used in the calculation accelerator market, these types are developed according to the needs of each manufacturer. Some of such types are IEEE 32-bit floating point (fp32) [
2], IEEE 16-bit floating point (fp16) [
2], Google 16-bit floating point (bfloat16) [
3], and NVIDIA 19-bit floating point (tf32) [
4].
Figure 1 illustrates the aforementioned data packing structures.
Authors in [
5] convey the increasing use of 16-bit floating-point formats in machine learning applications. The studied two formats are half-precision from IEEE-754 and bfloat16 format from Google. Since this data type entails fewer electronic components, the consequences are low latency, low energy usage and reduced communication costs. The problem that was attended is related to the need of scientists to know the properties of 16-bit data types as well as to understand the possible results that are generated in arithmetic operations. So the article was oriented to solving LU factorizing, calculating the harmonic series 1+1/2+1/3+..., solving an ordinary differential equation using Euler’s method and other operations. The restrictions of the test implied rounding after each product, and one degree of freedom was provided when indicating whether or not subnormal data were used in the tests. In conclusion, the use of rounding after each product and the use of a data type with more bits to contain subnormal numbers reduces the errors in the calculations.
In [
6] authors presented a new definition of floating point data. This new data requires one bit for sing, a 5-bit exponent and a 3-bit fraction. Since three bits cannot hold enough fractional information, another floating point definition was used for internal operations. This second format requires 14 bits to store trailing bits and thus preserve some accuracy in the result. This article is of special interest because of the formulations used to calculate the trailing bits needed to store the fraction for internal operations. The formulations are based on the number of multiplications that must be performed when evaluating a neural network. The advantage of this hybrid data format, according to the authors, is that no additional regularization or normalization of the input data is required. A comment in this article implies that increasing the number of data to be operated on will produce a greater deviation from the expected result.
In [
7], a study of three floating point adders has been carried out. Each one has a lower latency than the previous one but with a resource consumption that grows inversely with the latency. The product of this work also offers the possibility to define the number of bits of the units that make up the floating point data.
In [
8], the author realizes the implementation of a fusion of the multiplication and accumulation circuits following the linear form of the IEEE-754 standard to save resources in the FPGA.
To reduce calculation times is necessary to characterize multiplier units by latency, power and size. In [
9] three circuits were used to make up a multiplier circuit: a Booth model was used to generate partial products. A Wallace tree was also used to add these partial products and a KoggeStone circuit to operate the exponents. All within a four-stage pipeline synthesized for a Xilinx Spartan board. Tests were performed and results imply a delay of 29.72ns, but when this design is pipelined the latency is reduced to 7.92ns.
In [
10] was considered that current generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication. GeMM algorithms map these heavy matrices to GPUs (regular parallelism, high TFLOP/s). Therefore, GPUs are widely used to accelerate DNNs. The research focused on showing that new FPGAs (Arria 10 and Stratix 10) can overcome GPUs by providing thousands of arithmetic circuits (more than 5000 32-bit floating point units) and high-bandwidth memory banks (M20K). FPGA Stratix 10 is special because this device offers 5.4x better performance than the Titan X Pascal GPU. A disadvantage of GPUs is that the data type is always the same. Otherwise, FPGA can offer a custom data type with variable throughput. This means a better power efficiency and a superior computation rate compared to the GPU.
In [
11], a high-speed IEEE754-compliant 32-bit floating-point arithmetic unit designed using VHDL code was presented, and all operations of addition, subtraction, multiplication, and division were tested and successfully verified on Xilinx. SIMULINK-MATLAB was used to test the performance of the proposed implementations of the arithmetic operations. The article provides complete flowcharts of arithmetic operations and written descriptions of such flowcharts.
The 32-bit PowerPC440 (PPC440) was a processor that had a soft version in Xilinx FPGAs. Today, some scientists still use this processor for research purposes. The floating point unit required to implement the processor can be a custom implementation or can be used the one provided in the FPGA. In [
12] the dilemma of using floating point units was discussed. In the Virtex-5, the soft floating point has a lower latency than the hard floating point circuit. However, the soft floating point consumes more FPGA resources. To justify the use of soft data formats, a comparison was made between the implementation of the PowerPC processor with the soft floating point and with the floating point circuit provided in the FPGA.
What about the effects of input data being quantized to fixed or floating point? Chapter 3 in [
13] contains a probabilistic treatment of rounding effects when input data must be expressed in fixed or floating point data type. The studies were carried out during the quantization of real numbers and after the multiplication of two data, since this arithmetic operation produces a result with more fractional bits. The deviations between real numbers and quantized values were treated as a random variable that is stationary, ergodic, and with uniform distribution. The formulations in the text allow their application to data types with any number of fractional bits. This chapter is mainly oriented to FIR and IIR filters, but it is possible to extend the application of the formulations to neural networks.
In 2017, Gustafson et al. [
14] proposed a new general-purpose floating-point representation called POSITs. POSITs are claimed to have a wider dynamic range, be more accurate, and contain fewer unused (e.g., Not-A-Number) states than IEEE-754 formats. An OpenCL prototype linear algebra library based on POSITs was created. This prototype implements the three most common arithmetic operations on FPGAs. Tests involved quantifying the performance achieved with the implementation of POSITs for FPGA and quantifying the performance achieved by POSITS emulated by a library written in C++. A comparison between both performances was made, which resulted in favor of POSITs for FPGA. It is worth mentioning that the arithmetic operations have been separated into operational units working in pipeline mode. The pipeline consists of 30 stages with which the authors achieve operating frequencies of hundreds of MHz.
In [
15] the authors designed their own 16-bit floating point algorithm for addition and multiplication. They took advantage of the specialized fixed-point library to increase the operating frequency of the arithmetic circuits. It is worth mentioning that fixed-point libraries take advantage of already implemented circuits. The arithmetic circuits generated after compilation for FPGA work in pipeline mode with seven stages, reaching an operating frequency of 793MHz.