1 Introduction
With the development of deep neural networks (DNNs), DNNs have achieved impressive performance in many applications. Generous computing resources and footprint memory are urgently required when deeper networks are used to solve various problems. Moreover, with the rapid development of chip technologies, especially GPU and TPU, the computational frequency and efficiency have been greatly improved. Most scholars use GPUs as the basic hardware platform for network training due to excellent acceleration capability. However, for low power consumption platforms (e.g., mobile phone, embedded devices and smart chips), whose resources are limited, it is hard to achieve satisfactory performance for industrial applications. As one of the typical methods for model compression and acceleration, model quantization methods usually quantize fullprecision (32bit) parameters to lowbits of precision (e.g., 8bit and 4bit). Even more, we can extremely constrain weights and activation values to binary {1, +1} [18, 40] or ternary {1, 0, +1} [61], which can be computed by bitwise operations. This logic calculation is more suitable for the implementation of FPGA and other serviceoriented computing platforms. As implemented in [27], they achieved about speed up compared with CPU and are faster than GPU in the peak condition.
In recent years, many methods [8, 22, 26, 30, 46, 58] have been proposed to improve the performance of lowprecision models. However, the bitprecisions of most quantization models are set manually based on experience, and all layers have the same quantization precision in general. Some studies [5, 7, 10, 11] show that different layers have different sensitivity to quantization. Therefore, the mixedprecision quantization [36, 48, 6] can achieve better performance according to the characteristics of network. Besides, some recent smart chips also support mixedprecision for the DNN inference, e.g., Apple A12 Bionic [12], Nvidia Turing GPU [37], BitFusion [44] and BISMO [49].
With the successful application of Neural Architecture Search (NAS), [6, 33, 50]
converted the mixedprecision quantization problem as a NAS task and used reinforcement learning or gradientbased methods to search an ideal solution. Differentiable neural architecture search methods
[51, 55] searched over a supernet that contains all candidate architectures and needs to reside in memory for searching with feature maps. Fig. 1 (a) shows the GPU memory usage of searching on multiple models based on DARTS [32] framework. Due to its exponential search space with the number of network layers, the searching process requires a lot of hardware resources and is very timeconsuming. In order to relieve the pressure of hardware resources and speed up the searching process, some methods [51, 14] presented a specific small search space, and BPNAS [55] was proposed to search on small datasets and then extend the model to large dataset tasks. However, the methods rely on the incomplete definition of mixedprecision quantization tasks, which are easy to fall into a local optimal solution. This task actually seeks a balance between the taskdependent accuracy and given constraints (e.g., energy consumption, hardware resource, quantization precision, model size and bitwise operations). HAQ [50] and DNAS [51] incorporated the complexity cost into loss function and tuned the corresponding balance weight, which usually takes multiple searches to get an appropriate model. HAWQ [11] manually chose the bit precision among the reduced search space, and HAWQV2 [10] developed a Pareto frontier based method for selecting the exact bit precision. Therefore, it is difficult to control the search direction towards the given constraints to meet the deployment requirements.In order to address the above issues, we propose a novel differentiable sequential single path search (SSPS) method, which can quickly find the ideal mixedprecision model of a specific network (e.g., ResNet20, 18, 34, 50 and MobileNetV2) satisfying the given constraints (e.g., average weight bitwidth and average operation bitwidth). The advantages of our method are shown as follows:

Save Resources. We propose a novel differentiable single path search cell, where only one candidate is sampled at a time to carry out calculations. That is, it avoids caching all the candidates in memory or participating in calculation together, thus saving hardware resources. Fig. 1(a) shows the GPU memory usage of our SSPS method, which is significantly less than that of the DARTSbased method.

Purpose Search. We use average weight bitwidth and average operation bitwidth to measure the given constraints (e.g., model size and bit operations) and innovatively introduce them into our constrained loss function. By punishing the quantized candidates which deviate from the objective constraint to guide the search direction, the problem that the parameters need to be adjusted many times to get a satisfactory solution is greatly alleviated.

Fast Search. We use entropy to evaluate the selection certainty of each search cell, and determine the quantization bitprecision of cells sequentially during the searching process. Therefore, the complexity is reduced exponentially with the determination of layer quantization bitprecision, and the searching process is significantly accelerated. Our method takes less than 7 hours on 4 V100 GPUs to complete a search for ResNet18 on ImageNet, which is faster than DNAS (40 GPUhours).

Stateoftheart Results. With our proposed techniques applied on a bunch of models (e.g., ResNet20, 18, 34, 50 and MobileNetV2) and tasks (e.g., classification and detection), the mixedprecision quantization models we searched are obviously better than other counterparts under the similar constraints.
2 Related Works
2.1 Model Quantization
Model quantization refers to a way to compress and accelerate the model by replacing the fullprecision weights or activation values with fixedprecision values in DNNs. [18, 40]
used bitwise operations (e.g., xnor and bitcount) to effectively compute the matrix multiplication and achieved outstanding efficiency and performance. To further improve the representation capability,
[46, 60] used the multibit quantization to approximate the fullprecision weights and activation values. Most quantizers of multiple bitwidths can be categorized in three modalities: quantizer [8, 30, 50], quantizer [35] and quantizer [22, 26]. According to the quantization granularity of DNNs, model quantization can be divided into  quantization [46, 19],  quantization [53, 50] and  quantization [33, 57].The mixedprecision quantization [6, 36, 48] can match the sensitivity of each layer in DNNs with appropriate combination of quantization bitwidths, and it can achieve better results under the same constraints. HAQ [50] added the feedback of acceleration information evaluated by hardware simulator to the training cycle, and used reinforcement learning to determine the quantization strategy automatically. [14, 51] converted the quantization task to a NAS problem, and optimized the network weights and architecture parameters by using the backpropagation methods. However, its pipeline behavior is like DARTS [32] at the beginning and it also requires high configuration hardware resources. [51] spent 40 GPU (V100) hours to complete the search of ResNet18 in a specific small search space. By generating distilled data, ZeroQ [33] can finetune the models with arbitrary quantization precisions without using any training or validation datasets. In [56, 16], the trained model can match a variety of quantization precisions without any finetune or calibration, which will lead to some performance loss.
2.2 Neural Architecture Search
The emergence of Neural Architecture Search (NAS) breaks the bottleneck of designing neural architectures manually and achieves better performance than humaninvented architectures on many tasks, such as image classification [64, 31], object detection [64], semantic segmentation [7]
, and language models
[63, 39, 9]. The success of NAS requires a variety of search spaces and huge amounts of computing resources, which makes the optimization of network become a difficult problem. Commonly used optimization methods are mainly divided into three types: such as reinforcement learning [2, 64, 59][42, 41], and gradientbased methods [32, 52, 1]. Besides searching computation operators, NAS methods also search for the width and spatial resolution of each block in the network structure [13]. It can also be used to channel pruning [17] or filter numbers search [47]. In [4], the network delay and sparsity are incorporated into the index of search consideration, and it can search the architectures on different tasks (e.g., CIFAR10 and ImageNet) and different hardware platforms (e.g., GPU, CPU and mobile phones).3 Method
In this paper, we model the mixedprecision quantization task as a NAS problem. The goal of this task is to find an ideal mixedprecision quantization model for a specific network under some given constraints to meet realworld requirements. Specifically, the learning procedure of architectural parameters is formulated as the following bilevel optimization problem:
(1)  
(2) 
where and represent the architecture parameters for searching of activation values and weights, and denote the architecture space. and denote the supernet parameters and selected model weights, and represent the taskdependent losses (e.g., the crossentropy loss) on validation and training datasets, respectively. measures the constraint loss of the quantization network initialized by , and . is a superparameter, and
is the target vector of given constraints (e.g., average weight bitwidth and average operation bitwidth). Fig. 2 shows the framework of our proposed SSPS method.
In order to search the ideal mixedprecision quantization model effective and fast, we propose a SSPS method and innovatively introduce some given constraints into its loss function to guide the searching process. In this section, we first describe the search space, and then we propose a differentiable single path search cell that constructs one fully differentiable search supernet. Finally, we describe how we use average weight bitwidth and average operation bitwidth to evaluate the given constraints and introduce them into our loss function to guide the searching process. The searching process will be described in the next section.
3.1 Weight and Activation Search Cells
Many recently proposed NAS methods [32, 4, 24] focus on cell search (i.e., normal cell and reduction cell). Once cell architectures are confirmed, they will stack many copies of these discovered cells to make up a deep neural network. The purpose of this task is to find the  quantization for specific networks (e.g., ResNet and MobilNet), in which different layers have different quantization precisions. From Fig. 2, the layer operation contains two search cells (i.e., weight search cell and activation value search cell).
Suppose we use and to represent the input data and output data of the th layer. denotes the weights, and denotes the calculation operations (e.g., full connection or convolution). The computation of those two variables can be formulated as follows:
(3) 
where denotes the selected bit quantized values of by the bit search cell , and denotes the selected bit quantized values of by the bit search cell . The general search space for the th search cell is shown as follows:
(4) 
where all the integers represent the bitprecision. denotes halfprecision floatingpoint format and denotes the singleprecision floatingpoint format. Thus, the search space size of the whole layer is . Obviously, the search space of this task is exponential in the number of the model layers , which is expressed as .
3.2 Differentiable Single Path Search Cell
Because of the huge search space, reinforcement learning techniques or evolutionary algorithms are computationally expensive and much timeconsuming. DARTSbased methods need to reside all candidate architectures and feature maps in memory. Thus, they require multiple GPUs with high memory configuration and a small batch size to search. Fig. 1 (a) shows the GPU memory usage of the DARTSbased method for searching on multiple models.
In order to save hardware resources and speed up the searching process, we propose a differentiable single path search cell to compose the supernet. We take activation value quantization as an example, and the search cell is shown in Fig. 3. The input denotes the fullprecision activation values, the output and denote the quantized values and the selected bitwidth. We introduce the  [21, 34]
to control the search strategy. It approximates the multi distributed sampling process by reparameterization, which provides an efficient way to draw samples from a discrete probability distribution. By this approximation, we can transform the nondifferentiable sampling problem into differentiable computation. Here, we use
to represent the sampling probability vector and is the th element of , which is formulated as follows:(5) 
where is the th element of a dimensional learnable architecture parameter vector ,
is a random variable drawn from the Gumbel distribution (
with ). is the temperature coefficient used to control the smoothness of sampling. Therefore, the activation value search cell can be expressed as follows:(6)  
(7)  
(8) 
where is the th element of onehot vector . denotes the th element of the search space. denotes the bit quantize function, (e.g., the quantizer). From Eq. (7), we can get the selected bit precision, which will be used as the input for the constrained loss. In general, argmax function is used to select the most probable index. However, since our goal is to sample from a discrete probability distribution, we cannot backpropagate gradients through the argmax function to optimize
. Here, we use the straightthrough estimator (STE)
[3] to backpropagate through Eq. (6).During the searching process, the real discrete distribution can be approached by gradually reducing the temperature . The higher the temperature, the smoother the distribution. The lower the temperature is, the closer the generated distribution is to discrete. At the beginning of the search, it can be regarded as random sampling. With the decrease of , it becomes probability sampling. The resource saving of our search cell can be seen clearly from Fig. 1(a).
3.3 Constrained Loss Function
Except for the taskdependent loss (e.g., the cross entropy loss), hardware resources, energy consumption, model size and computational complexity are also important factors affecting realworld applications. These factors can be effectively controlled by restricting the quantization precision of weights and activation values [44], and the correlation factors of different hardware platforms are different. In order to formulate those factors, we introduce average weight bit and average operation bitwidth to evaluate model size and bitwise operation, which are usually applied to evaluate the given constraints [55, 65]. We introduce average weight bitwidth and average operation bitwidth into our constrained loss function to guide the searching process.
Taking constrain the target model size as expectation, we focus on weight quantization to compress the model. Generally, model parameters are stored in a 32bit floatingpoint type. When we quantize the weights, the model size and storage requirements will be reduced. Suppose we have a model of layers, and represents the number of parameters in the th layer. The average weight bitwidth can be defined as follows:
(9) 
where denotes the weight architecture parameter vector, and is the output of weight search cell that denotes the selected quantization precision of th layer, the computation of is similar to , as shown in Eqs. .
The second is to constrain the quantization precision of the weights and activation values to achieve a specific computational complexity. It is also one of the important reasons affecting industrial applications. Here, we use average operation bitwidth to evaluate bitwise operation computational complexity. We use to denote the number of float point operations in the th layer. The average operation bitwidth is related to the architecture parameters and , and it is formulated as follows:
(10) 
where denotes the architecture parameter of activation value search cell, and is the output of activation value search cell that denotes the selected quantization precision of th layer.
Based on the above definition, we define a constrained loss function as follows:
(11) 
where represent the target average weight bitwidth and average operation bitwidth, respectively.
4 Searching Process
Entropy is commonly used to measure the uncertainty of a distribution. In this paper, we use entropy to evaluate the selection certainty of search cells. Different entropy values correspond to different selection certainties. The smaller the entropy, the stronger the selection certainty. We use the th activation value search cell as an example, and its probability distribution is computed as follows:
(12) 
where denotes the activation value architecture parameter, and the entropy of this cell is defined as:
(13) 
Fig. 1 (b) shows the entropy variation curves of some layers in ResNet20 on CIFAR10 based on our single path search cell. We can see that the entropies of different layers have different convergence speeds. And many layers will gradually converge to a steady state in the searching process. If we gradually determine the quantization precision of a certain layer in the searching process, the search space will decrease exponentially. Therefore, we propose a sequential single path search method, which divides this task into subtasks by iterations and optimizes them sequentially. After satisfying the decision conditions, we prioritize the cells with the highest selection certainty. Then, we use the selected quantization precision to replace the original search cell to participate in the subsequent searching process. A new search subproblem is generated by the above method. With the determination of some search cells, the search space decreases exponentially. Finally, a mixedprecision model satisfying the given constraints is obtained by iterative solutions. The iterative procedure is shown in Algorithm 1. In the searching process, the quantization precision of each search cell gradually tends to be stable and the entropy will gradually decrease through continuous iterative updating of architecture parameters.
5 Experiments
In this section, we search the mixedprecision quantization models to verify the effectiveness of our method on two image classification benchmarks (CIFAR10 and ImageNet) and an object detection benchmark (COCO). We first describe the details of our experimental implementations. Then the experimental results of our method are presented to compare with stateoftheart methods.
5.1 Implementation Details
We implement our method using Pytorch
[38], in which we can easily implement and debug quantization functions and NAS algorithms. We use a hardwarefriendly quantization function as the quantizer, therefore, the inference process can be efficiently implemented by bitwise operations (e.g., xnor and bitcount) to achieve model compression, computational acceleration and resource saving. We quantize the weights linearly into bit, which can be formulated as follows:(14) 
where the clamp function is used to truncate all values into the range of , and is a learned parameter of the th search cell. The scaling factor is defined as: . The search space of each search cell is .
In implementation, we combine the weight search cell and activation value search cell into one layerlevel search cell. Therefore, there are 25 candidates for each layer. The architecture parameters are optimized by Adam, and the initial learning rate is . Network parameters are updated by SGD, the initial learning rate is , and the weight decay is set to . For ImageNet, the batch size for all the networks is set to . The superparameter is set to . Following the methods [62, 18, 60], we quantize the first convolutional layer and the last fullyconnected layer to bit. We apply the pretrained fullprecision model to initialize the supernet and then the warmup strategy is adopted. After searching the desired network, we finetune the mixedprecision quantization model to get the final parameters. For COCO detection, we use the mixedprecision architecture obtained by the ImageNet classification task as the backbone. Our network is finetuned by SGD for 50K iterations with the initial learning rate and the batch size of 16 for 8 V100 GPUs. The learning rate is decayed by a factor of 10 at iterations 30K and 40K, respectively.
5.2 Experimental Results
5.2.1 Cifar10
We focus on searching the mixedprecision quantization model under the given average weight bit and average operation bit of ResNet20 on the CIFAR10 dataset. This dataset has 50K training images and 10K testing images. We divide the training images into a subtraining dataset (25K images) and a validation dataset (25K images). The subtraining dataset is used to update the weights of supernet and then the validation dataset is used to update architecture parameters. After searching process, we use the whole training images to finetune the selected model.
Methods  WBits  ABits  Top1  WComp  AveBits 

Baseline  32  32  92.37  1.00  32.00 
Dorefa [60]  3  3  89.90  10.67  3.00 
PACT [8]  3  3  91.10  10.67  3.00 
LQNets [58]  3  3  91.60  10.67  3.00 
HAWQ [11]  M  4  92.22  13.11   
BPNAS [55]  M  M  92.12  10.74  3.30 
SSPS  M  M  92.54  10.74  3.04 
Models  Methods  WBits  ABits  Top1  WComp  AveBits 
ResNet18  Baseline  32  32  70.20  1.00  32.00 
PACT [8]  3  3  68.10  10.67  3.00  
LQNets [58]  3  3  68.20  10.67  3.00  
DSQ [15]  3  3  68.66  10.67  3.00  
QIL [23]  3  3  69.20  10.67  3.00  
SSPS  M  M  69.64  10.65  2.99  
PACT [8]  4  4  69.20  8.00  4.00  
LQNets [58]  4  4  69.30  8.00  4.00  
DSQ [15]  4  4  69.56  8.00  4.00  
QIL [23]  4  4  70.10  8.00  4.00  
AutoQ [33]  M  M  68.20  6.91    
SSPS  M  M  70.70  7.95  3.95  
ResNet34  Baseline  32  32  73.8  1.00  32.00 
ABCNet [30]  3  3  66.70  10.67  3.00  
LQNets [58]  3  3  71.90  10.67  3.00  
DSQ [15]  3  3  72.54  10.67  3.00  
QIL [23]  3  3  73.10  10.67  3.00  
SSPS  M  M  73.49  10.69  3.06  
BCGD [54]  4  4  70.81  8.00  4.00  
DSQ [15]  4  4  72.76  8.00  4.00  
QIL [23]  4  4  73.70  8.00  4.00  
SSPS  M  M  74.30  7.99  4.01  
ResNet50  Baseline  32  32  77.15  1.00  32.00 
AutoQ [33]  M  M  63.21  9.12    
HAQ [50]  M  M  75.48    3.60  
HAWQ [11]  M  M  75.30    4.00  
BPNAS [55]  M  M  76.67    3.80  
SSPS  M  M  76.22  8.00  3.98  
MobileNetV2  Baseline  32  32  71.87  1.00  32.00 
DSQ [15]  4  4  64.80  8.00  4.00  
TQT [20]  4  4  67.79  8.00  4.00  
HAQ [50]  M  M  66.99      
AutoQ [33]  M  M  69.02  7.58    
SSPS  M  M  69.10  7.99  4.02 
For each compared method, we report its average weight bit, average activation value bit, Top1 accuracy, model size compression rate and average operation bit. The target average weight bit and average operation bit of searching process are . The results are shown in Table 1. Compared with the fullprecision model (Baseline), our model outperforms it by up to 0.17% while still achieving compression ratio for weights. Compared with the  quantization methods, Dorefa, PACT and LQNets, the Top1 accuracy of our method increases by , and , respectively. Similarly, our method has obvious advantages over the mixedprecision quantization methods. Moreover, our method performs much better than HAWQ and BPNAS, and its Top1 accuracy increases by 0.32% and 0.42%, respectively.
5.2.2 ImageNet
In order to verify the search ability of our method on largescale datasets and deep networks, we implement ResNet18, 34, 50 and MobileNetV2 on the ImageNet (ILSVRC2012) dataset. We choose threequarters of the training dataset as the subtraining dataset to update the weights of the supernet. The remaining onequarter of the training dataset is used as the validation dataset to update the architecture parameters.
Table 2 shows the experimental results, where the method marked ’M’ represents mixedprecision quantization. Similar to other methods, our experiments mainly focus on the average 3bit and 4bit quantization. In the searching process, we set the average weight bit and average operation bit to or to control the search direction. From Table 2, we can see that our selected mixedprecision quantization models of ResNet18 and ResNet34 achieve the best accuracies, which are higher than their fullprecision counterparts. For ResNet50, we compare our method with several mixedprecision quantization methods. HAQ and AutoQ apply reinforcement learning to search for mixedprecision quantized architectures. They spend more time on training and the results are still worse than ours. HAWQ manually chooses the bit precision among the reduced search space, and its result is 0.92% lower than ours. BPNAS uses small sampled datasets to complete the searching process and then transfer it to ResNet50. Our method is still comparable with BPNAS, although its results are obtained after 150 epochs of finetune with label smooth. As a lightweight network, MobileNetV2 eliminates many redundant calculations. Therefore, the model quantization of MobileNetV2 will bring great precision loss. Even so, compared with DSQ, TQT, HAQ and AutoQ, our method can converge well and outperforms them by up to 4.3%, 1.31%, 2.11% and 0.08%, respectively.
5.2.3 COCO Detection
We further explore the effectiveness of our mixedprecision model for detection tasks on the COCO benchmark [29], which is one of the most popular largescale benchmark datasets for object detection. This dataset consists of images in 80 different categories. We use the trainval35k split for training and minival split for validation. Both one stage RetinaNet [28] detector and two stage Faster RCNN [43] detector are applied to verify the effectiveness of our selected mixedprecision model. In other words, we use the selected mixedprecision ResNet50 model in Section 5.2.2 as backbones. For Faster RCNN, the RPN and ROIhead are quantized to 4bit. For RetinaNet, the feature pyramid and detection heads are quantized to 4bit, except for the last layer in the detection heads is quantized to 8bit.
ResNet50 + Faster RCNN  
Methods  W/ABits  AveBits  AP  
Baseline  32/32  32.00  37.7  59.3  40.9  22.0  41.5  48.9 
FQN [25]  4/4  4.00  33.1  54.0  35.5  18.2  36.2  43.6 
BPNAS [55]  M/M  4.00  35.8  57.9  38.3  21.7  39.8  47.4 
SSPS  M/M  4.00  37.4  58.1  40.6  22.1  40.4  47.9 
ResNet50 + RetinaNet  
Methods  W/ABits  AveBits  AP  
Baseline  32/32  32.00  37.8  58.0  40.8  23.8  41.6  48.9 
FQN [25]  4/4  4.00  32.5  51.5  34.7  17.3  35.6  42.6 
Auxi [62]  4/4  4.00  36.1  55.8  38.9  21.2  39.9  46.3 
SSPS  M/M  4.00  36.4  55.8  38.6  20.8  39.9  47.6 
We compare the performance of our method with those of FQN [25], Auxi [62] and BPNAS [55], where FQN and Auxi are fixedprecision methods, and BPNAS is a mixedprecision method. Table 3 shows the experimental results. As we can see, model quantization can affect the detection results, obviously. For Faster RCNN detector, our selected mixedprecision model demonstrate better performance. For example, our 4bit detector with ResNet50 backbone outperforms FQN and BPNAS by and , respectively. For RetinaNet detector, our model also shows the best experimental results in quantized models.
5.3 Effective and Fast
5.3.1 Convergence Analysis
We take the mixedprecision search of ResNet18 as an example for convergence analysis, where we set the expected average weight bit and average operation bit to 4. Fig. 4 show the convergence curves of the searching loss, average operation bit and average weight bit. Just as an ablation experiment, the blue line represents the convergence curve of our SSPS method, and the orange line represents the convergence curve without the decision operations (step 6 10 in Algorithm 1), which is called SPS. From Fig. 4 (a), we can see that our decision operations can significantly improve the convergence of the searching process. The decision operations can reduce the search space and the factors affecting the target, thus increasing the stability of the searching process. With the decrease of temperature coefficient and the convergence of training loss, the fluctuation becomes smaller and smaller until they approach to the target constraints. Fig. 5 shows the selected quantization policy for ResNet18. The histogram represents the quantization precision of each layer in our model. The upper part represents the weight precision of different layers, and the bottom part represents the quantization precision of activation values in different layers. In particular, we specially mark the order of precision decision for the layers in the process of our model search.
5.3.2 Comparison with Related Work
In this subsection, we discuss the differences of our mthod with two similar methods, DNAS [51] and BPNAS [55].
Compared with DNAS: (1) By using the Gumbel Softmax with an annealing temperature, the pipeline behaves of DNAS is very similar to DARTS at the beginning, in which multiple candidates participate in the calculation. Therefore, it takes up a lot of memory resources just like DARTS, as shown in Fig. 1 (a). However, only one candidate is allowed to pass through our search cell, thus saving hardware resources. (2) DNAS introduces the cost of candidate structures into the loss function to encourage using lowerprecision weights and activations. Because there is no target setting, DNAS needs multiple searches to get an appropriate model. Note that, our method only needs once search to return an optimal model under given constraints. (3) Our sequential search method can exponentially reduce the search space in the search process, thus improving the search speed and convergence stability, as shown in Fig. 4. Our method takes less than 28 GPUhours to complete a search for ResNet18 on ImageNet, which is much faster than DNAS (40 GPUhours). (4) DNAS is a blockwise mixedprecision method whose all layers in one block use the same precision. Our method is a layerwise mixedprecision method, which allows different levels of precision in the same block.
Compared with BPNAS: (1) BPNAS applies DARTS to address optimization problems, which leads to a sharp increase in resources demand. (2) For ImageNet, BPNAS needs to randomly sample 10 categories and takes 5000 images as the training dataset to search. This method relies on the incomplete definition of mixedprecision quantization tasks, which are easy to fall into a local optimal solution. (3) Similar to DNAS, BPNAS is also a blockwise mixedprecision method. The search space is much smaller than our method.
6 Concluding Remarks
In this paper, we proposed a novel SSPS method for mixedprecision quantization search and introduced constraints into our search loss function to guide the searching process. This is a fully differentiable model, and the searching process can be optimized by gradient descent methods. In the searching process, we determined the quantization precision according to the selection certainty of search cells, which can reduce the search space exponentially and accelerate the search convergence speed. Experimental results demonstrated that our proposed SSPS method achieves better testing performance with similar constraints, compared to stateoftheart methods on CIFAR10, ImageNet and COCO. Our future work will focus on mixedprecision quantization architecture search without training datasets and training a universal model that can support multiple quantization precision to meet more industrial demands.
References
 [1] Karim Ahmed and Lorenzo Torresani. Maskconnect: Connectivity learning by gradient descent. In ECCV, pages 349–365, 2018.
 [2] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
 [3] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 [4] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.
 [5] Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Zeroq: A novel zero shot quantization framework. arXiv preprint arXiv:2001.00281, 2020.
 [6] Zhaowei Cai and Nuno Vasconcelos. Rethinking differentiable search for mixedprecision neural networks. arXiv preprint arXiv:2004.05795, 2020.
 [7] LiangChieh Chen, Maxwell Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, and Jon Shlens. Searching for efficient multiscale architectures for dense image prediction. In NeurIPS, pages 8699–8710, 2018.
 [8] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce IJen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.

[9]
Xuanyi Dong and Yi Yang.
Searching for a robust neural architecture in four gpu hours.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages 1761–1770, 2019.  [10] Zhen Dong, Zhewei Yao, Yaohui Cai, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawqv2: Hessian aware traceweighted quantization of neural networks. arXiv preprint arXiv:1911.03852, 2019.
 [11] Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixedprecision. In ICCV, pages 293–302, 2019.
 [12] EENews. Apple describes 7nm a12 bionic chips. 2018.
 [13] Jiemin Fang, Yuzhu Sun, Qian Zhang, Yuan Li, Wenyu Liu, and Xinggang Wang. Densely connected search space for more flexible neural architecture search. arXiv preprint arXiv:1906.09607, 2019.

[14]
Chengyue Gong, Zixuan Jiang, Dilin Wang, Yibo Lin, Qiang Liu, and David Z Pan.
Mixed precision neural architecture search for energy efficient deep learning.
In ICCAD, pages 1–7. IEEE, 2019.  [15] Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. Differentiable soft quantization: Bridging fullprecision and lowbit neural networks. In ICCV, pages 4852–4861, 2019.
 [16] Luis Guerra, Bohan Zhuang, Ian Reid, and Tom Drummond. Switchable precision neural networks. arXiv preprint arXiv:2002.02815, 2020.
 [17] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, LiJia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In ECCV, pages 784–800, 2018.
 [18] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks. In NeurIPS, pages 4107–4115, 2016.
 [19] Benoit Jacob, Skirmantas Kligys, Matthew Chen, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integerarithmeticonly inference. In CVPR, pages 2704–2713, 2018.
 [20] Sambhav R Jain, Albert Gural, Michael Wu, and Chris Dick. Trained uniform quantization for accurate and efficient neural network inference on fixedpoint hardware. arXiv preprint arXiv:1903.08066, 6, 2019.
 [21] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 [22] Qing Jin, Linjie Yang, and Zhenyu Liao. Adabits: Neural network quantization with adaptive bitwidths. arXiv preprint arXiv:1912.09666, 2019.
 [23] Sangil Jung, Changyong Son, Seohyung Lee, Jinwoo Son, JaeJoon Han, Youngjun Kwak, Sung Ju Hwang, and Changkyu Choi. Learning to quantize deep networks by optimizing quantization intervals with task loss. In CVPR, pages 4350–4359, 2019.
 [24] Guohao Li, Guocheng Qian, Itzel C Delgadillo, Matthias Muller, Ali Thabet, and Bernard Ghanem. Sgas: Sequential greedy architecture search. In CVPR, pages 1620–1630, 2020.
 [25] Rundong Li, Yan Wang, Feng Liang, Hongwei Qin, Junjie Yan, and Rui Fan. Fully quantized network for object detection. In CVPR, pages 2810–2819, 2019.
 [26] Yuhang Li, Xin Dong, and Wei Wang. Additive powersoftwo quantization: An efficient nonuniform discretization for neural networks. In ICLR, 2019.
 [27] Shuang Liang, Shouyi Yin, Leibo Liu, Wayne Luk, and Shaojun Wei. Fpbnn: Binarized neural network on fpga. Neurocomputing, 275, 2017.
 [28] TsungYi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017.
 [29] TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.

[30]
Xiaofan Lin, Cong Zhao, and Wei Pan.
Towards accurate binary convolutional neural network.
In NeurIPS, pages 345–353, 2017.  [31] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436, 2017.
 [32] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
 [33] Qian Lou, Feng Guo, Lantao Liu, Minje Kim, and Lei Jiang. Autoq: Automated kernelwise neural network quantization. arXiv preprint arXiv:1902.05690, 2019.
 [34] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 [35] Daisuke Miyashita, Edward H Lee, and Boris Murmann. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016.
 [36] SR Nandakumar, Manuel Le Gallo, Irem Boybat, Bipin Rajendran, Abu Sebastian, and Evangelos Eleftheriou. Mixedprecision architecture based on computational memory for training deep neural networks. In ISCAS, pages 1–5. IEEE, 2018.

[37]
Nvidia.
Nvidia tensor cores.
2018.  [38] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
 [39] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
 [40] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNORNet: Imagenet classification using binary convolutional neural networks. In ECCV, pages 525–542, 2016.

[41]
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le.
Regularized evolution for image classifier architecture search.
In AAAI, volume 33, pages 4780–4789, 2019.  [42] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, and Alexey Kurakin. Largescale evolution of image classifiers. In ICML, pages 2902–2911. JMLR. org, 2017.
 [43] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NeurIPS, pages 91–99, 2015.
 [44] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and Hadi Esmaeilzadeh. Bit fusion: Bitlevel dynamically composable architecture for accelerating deep neural network. In ISCA, pages 764–775. IEEE, 2018.
 [45] Qigong Sun, Fanhua Shang, Xiufang Li, Kang Yang, Peizhuo Lv, and Licheng Jiao. Efficient computation of quantized neural networks by  1,+ 1 encoding decomposition. 2018.
 [46] Qigong Sun, Fanhua Shang, Kang Yang, Xiufang Li, Yan Ren, and Licheng Jiao. Multiprecision quantized neural networks via encoding decomposition of 1,+ 1. In AAAI, volume 33, pages 5024–5032, 2019.
 [47] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platformaware neural architecture search for mobile. In CVPR, pages 2820–2828, 2019.
 [48] Stefan Uhlich, Lukas Mauch, Fabien Cardinaux, Kazuki Yoshiyama, Javier Alonso Garcia, Stephen Tiedemann, Thomas Kemp, and Akira Nakamura. Mixed precision dnns: All you need is a good parametrization. arXiv preprint arXiv:1905.11452, 2019.
 [49] Yaman Umuroglu, Lahiru Rasnayake, and Magnus Själander. Bismo: A scalable bitserial matrix multiplication overlay for reconfigurable computing. In FPL, pages 307–3077. IEEE, 2018.
 [50] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardwareaware automated quantization with mixed precision. In CVPR, pages 8612–8620, 2019.
 [51] Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Peter Vajda, and Kurt Keutzer. Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090, 2018.
 [52] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural architecture search. arXiv preprint arXiv:1812.09926, 2018.
 [53] Amir Yazdanbakhsh, Ahmed T Elthakeb, FatemehSadat Pilligundla, and Hadi Esmaeilzadeh. Releq: An automatic reinforcement learning approach for deep quantization of neural networks. arXiv preprint arXiv:1811.01704, 2018.
 [54] Penghang Yin, Shuai Zhang, Jiancheng Lyu, Stanley Osher, Yingyong Qi, and Jack Xin. Blended coarse gradient descent for full quantization of deep neural networks. Research in the Mathematical Sciences, 6(1):14, 2019.
 [55] Haibao Yu, Qi Han, Jianbo Li, Jianping Shi, Guangliang Cheng, and Bin Fan. Search what you want: Barrier panelty nas for mixed precision quantization. arXiv preprint arXiv:2007.10026, 2020.
 [56] Haichao Yu, Haoxiang Li, Honghui Shi, Thomas S Huang, and Gang Hua. Anyprecision deep neural networks. arXiv preprint arXiv:1911.07346, 2019.
 [57] Linghua Zeng, Zhangcheng Wang, and Xinmei Tian. Kcnn: kernelwise quantization to remarkably decrease multiplications in convolutional neural network. In IJCAI, pages 4234–4242. AAAI Press, 2019.
 [58] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lqnets: Learned quantization for highly accurate and compact deep neural networks. In ECCV, pages 365–382, 2018.
 [59] Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and ChengLin Liu. Practical blockwise neural network architecture generation. In CVPR, pages 2423–2432, 2018.
 [60] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
 [61] Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. Trained ternary quantization. In ICLR, 2017.
 [62] Bohan Zhuang, Lingqiao Liu, Mingkui Tan, Chunhua Shen, and Ian Reid. Training quantized neural networks with a fullprecision auxiliary module. In CVPR, pages 1488–1497, 2020.
 [63] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In ICLR, 2017.
 [64] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In CVPR, pages 8697–8710, 2018.
 [65] Yochai Zur, Chaim Baskin, Evgenii Zheltonozhskii, Brian Chmiel, Itay Evron, Alex M Bronstein, and Avi Mendelson. Towards learning of filterlevel heterogeneous compression of convolutional neural networks. arXiv preprint arXiv:1904.09872, 2019.
Comments
There are no comments yet.