Shang Tang jointly proposed the FPGA-based Winograd algorithm: improving the complexity of CNN performance reduction algorithm on FPGA

The Shangtang Technology Algorithm Platform team and the Peking University Energy Efficiency Lab jointly proposed an FPGA-based fast Winograd algorithm that can significantly reduce the complexity of the algorithm and improve the CNN performance on the FPGA. As early as 2016, the experiments in the paper used the optimal CNN architecture at that time, and achieved the optimal performance and energy consumption under FPGA acceleration.

Summary

In recent years, convolutional neural networks (CNN) have become more and more widely used in computer vision tasks. FPGAs have received attention as their high-performance, low-power, and reconfigurable features become CNN's effective hardware accelerators. However, previous FPGA solutions based on traditional convolution algorithms were often limited by the computing power of the FPGA (such as the number of DSPs). This paper presents a fast Winograd algorithm that significantly reduces algorithm complexity and improves CNN performance on FPGAs. We first proposed a new architecture to implement the Winograd algorithm on an FPGA. Our design uses a line buffer structure to efficiently reuse feature map data from different tiles. We also efficiently architect the Winograd PE engine to launch multiple PEs through parallelization. At the same time, there is a complex design space to be explored. We propose an analytical model for predicting resource usage and inferring performance. We use this model to guide rapid design space exploration. The experiment uses the current optimal CNN, and the results show that it achieves the optimal performance and energy consumption on the FPGA. On the Xilinx ZCU102 platform, we achieved a convolutional average processing speed of 1006.4 GOP/s, an overall AlexNet processing speed of 854.6 GOP/s, a convolutional average processing speed of 3044.7 GOP/s, and an overall VGG16 processing speed of 2940.7 GOP/s.

introduction

Deep Convolutional Neural Networks (CNN) have achieved excellent performance on multiple computer vision tasks, including image classification, target detection, and semantic segmentation [1, 2]. The high accuracy of CNN comes at the cost of great computational complexity because it requires a comprehensive assessment of all regions in the feature map [3, 4]. To solve such huge computational pressures, researchers use hardware accelerators such as GPUs, FPGAs, and ASICs to accelerate CNN [5–17]. Among them, FPGAs are an effective solution due to their high performance, low power consumption and reconfigurability. More importantly, the use of C or C++'s High Level Synthesis (HLS) significantly reduces FPGA programming barriers and increases productivity [18–20].

The CNN usually contains multiple layers, and the output feature map of each layer is the input feature map of the next layer. Previous studies have found that current optimal CNN calculations are dominated by convolutional layers [6, 7]. Using a conventional convolution algorithm, each element in the output feature map is separately calculated by a multi-step multiply-accumulate operation. Although FPGA solutions that previously used traditional convolution algorithms have achieved initial success [5–9, 11], if the algorithm is more efficient, the solution may be more efficient. This paper shows how the convolution algorithm [21] using the Winograd algorithm can greatly reduce the complexity of the algorithm and improve the performance of CNN on the FPGA. Use the Winograd algorithm to generate a list of elements in the output feature map using structural similarities between elements. This reduces the number of multiplication operations, which reduces the complexity of the algorithm. Studies have shown that the fast Winograd algorithm is suitable for deriving efficient algorithms for CNN with small filters [16].

More importantly, the current trend of CNN is a deep topology with small filters. For example, all convolutional layers of Alexnet (except the first layer) use 3 × 3 and 5 × 5 filters [3]; VGG16 uses only 3 × 3 filters [22]. This creates an opportunity to efficiently implement CNN using the Winograd algorithm. However, while using the Winograd algorithm on an FPGA is attractive, there are still some issues. First, the design must not only minimize memory bandwidth requirements, but also match the computational engine and memory throughput. Second, there is a lot of design space when mapping the Winograd algorithm on an FPGA. It is difficult to infer which designs will improve performance or compromise performance.

This paper designs a row buffer structure for the Winograd algorithm cache feature map. This allows different tiles to reuse data as the convolution operation proceeds. The calculation of the Winograd algorithm involves a mixed matrix transformation of general matrix multiplication (GEMM) and elementary multiplication (EWMM). Then, we designed an efficient Winograd PE and started multiple PEs through parallelization. Finally, we develop analytical models to assess resource usage and predict performance. We use these models to explore the design space and determine the optimal design parameters.

The contributions of this article are as follows:

An architecture is proposed to efficiently implement CNN using the Winograd algorithm on the FPGA. The architecture uses row buffer structure, general and element level matrix multiplication for Winograd PE and PE parallelization.

Develop analytical resource and performance models and use the model to explore the design space and determine optimal parameters.

The technology is rigorously validated using current optimal CNNs such as AlexNet and VGG16.

Shang Tang jointly proposed the FPGA-based Winograd algorithm: improving CNN performance on FPGAs, reducing algorithm complexity

Figure 1: Comparison of traditional convolution algorithm and Winograd convolution algorithm. We assume that the stride S of the Winograd algorithm is 1.

Architecture design

Figure 2: Architecture diagram

Figure 2 shows the convolutional layer architecture based on the Winograd algorithm on the FPGA. The researchers identified data reuse opportunities in the feature maps of adjacent tiles. Finally, the line buffer is naturally implemented. The input feature map (M) has multiple channels, as shown in Figure 1. Each row of the row buffer stores the same row in all channels. Winograd PE gets data from the row buffer. Specifically, given an n×n input tile, Winograd PE will generate an m × m output tile. The researchers started the PE array by parallelizing the processing of multiple channels. Finally, use double buffers to overlap data migration and calculations. All input data (such as input signatures, filters) are initially stored in external memory. Input and output feature maps are migrated to the FPGA through the FIFO. However, the size of the filter increases significantly as the depth of the network increases. It is impractical to load all filters into on-chip memory. In the design of this paper, the researchers divided the input and output channels into multiple groups. Each group contains only a portion of the filter. The investigator loads the filter by group as needed. For the convenience of the statement, the following assumes that there is only one group.

Figure 3: Winograd PE design icon

Automatic tool flow

The researchers designed an automated tool flow to automatically map CNN to the FPGA, as shown in Figure 5. The process includes the Design Space Exploration Engine (DSEE). The researchers used Caffe prototxt to describe the structure of CNN [24]. FPGA configuration parameters include memory bandwidth, number of DSPs, logic units, and on-chip memory capacity. The output of DSEE is the optimal solution {n, Tm, Tn}. In step 2, based on the optimal solution, the researchers developed a code generation engine (CGE) that automatically generates a Winograd convolution function. This function describes the entire accelerator structure, including row buffering, buffer management, and Winograd PE. The resulting implementation is an HLS-compatible C code. Compiled instructions such as memory partitioning factors, loop unwinding factor Tn Tm, and FIFO interface are inserted into the function. In step 3, the researchers used the Xilinx HLS tool to synthesize the code into a register transfer level. Finally, researchers use the Xilinx SDSoC (software-defined system-on-a-chip) toolchain to generate bitstreams.

Figure 4: Automated tool flow

Experimental evaluation

Table 1: Design parameters

Table 2: Performance comparison of Alexnet

Table 3: Performance comparison of VGG

Table 4: GPU platform comparison

Power Charger

Power Charger,Portable Charger,Portable Power Bank,Portable Battery Charger

Shenzhen Jinziming Electronic Technology Co.,LTD , https://www.powerchargerusb.com

Posted on