|M.Sc Student||Liss Natan|
|Subject||Neural Network Quantization for Integer-Only|
Inferencing on an FPGA
|Department||Department of Electrical Engineering||Supervisor||Professor Avi Mendelson|
|Full Thesis text|
Convolutional Neural Networks (CNN) are very popular in many fields including computer vision, speech recognition, natural language processing, to name a few.
Though deep learning leads to groundbreaking performance in these domains, the networks used are very demanding computationally and are far from real-time even on a GPU, which is not power efficient and therefore does not suit low power systems such as mobile devices.
To overcome this challenge, some solutions have been proposed for quantizing the weights and activations of these networks, which accelerate the runtime significantly. This acceleration comes at the cost of a higher error. In this work we will survey several approaches to tackle these challenges. First, we introduce the idea of noise injection as a mean to mitigate quantization error. We investigate both uniform and non-uniform noise injection and quantization. Secondly, we perform clamping for both weights and activations, before quantization, as a mean of dynamic range reduction. These clamp values are calculated before model training, according to the model’s weights and activation statistics. Thirdly, we make use of gradual quantization training scheme to reach smoother fine-tuning from full precision pre-trained model, which eventually improves the model’s convergence. Using these techniques leads to state-of-the-art results on various regression and classification tasks, e.g.,
ImageNet classification and ISP regression with architectures such as ResNet-18/34/50 with low as 3-bit weights and activations. Since FPGAs are a natural device choice when dealing with custom data types and low power, we demonstrate an FPGA implantation for both regression and classification tasks, while maintaining integer-only arithmetic. We have implemented DeepISP regression network on a single FPGA, as a representative for the serial layer execution model. For the parallel layer execution model, we implemented binarized AlexNet and ResNet-18 architectures, for classification tasks, on Maxeler’s multi-FPGA data flow system.