Wildfire Smoke Detection based on Depthwise Separable Convolutions and Target-Awareness

Since smoke usually occurs before a flame arises, fire smoke detection is especially significant for early warning systems. In this paper, a DSATA(Depthwise Separability And Target Awareness) algorithm based on depthwise separability and target awareness is proposed. Existing deep learning methods with convolutional neural networks pretrained by abundant and vast datasets are always used to realize generic object recognition tasks. In the area of smoke detection, collecting large quantities of smoke data is a challenging task for small sample smoke objects. The basis is that the objects of interest can be arbitrary object classes with arbitrary forms. Thus, deep feature maps acquired by target-aware pretrained networks are used in modelling these objects of arbitrary forms to distinguish them from unpredictable and complex environments. In this paper, this scheme is introduced to deal with smoke detection. The depthwise separable method with a fixed convolution kernel replacing the training iterations can improve the speed of the algorithm to meet the enhanced requirements of real-time fire spreading for detecting speed. The experimental results demonstrate that the proposed algorithm can detect early smoke, is superior to the state-of-the-art methods in accuracy and speed, and can also realize real-time smoke detection.


Introduction
Smoke detection is a promising method for fire alarm systems, especially in wide-open forest environments. Automatic fire detection systems play an important role in the early detection and response of unpredictable scenes [1]. Smoke video detection and analysis tasks often have difficulty obtaining ideal performance because of the multiformity of form, swing, changing smoke colour tones, environmental illumination, and low-resolution images of forest scenes. Traditional video smoke detection methods based on pattern recognition [2] and digital image processing [3] techniques depend on obtaining ample dynamic texture [4], colour features [5] [6], optical flows [7] and spatial features [8] [9]. Gubbi et al. [10] adopted a pattern recognition method that manually divides the smoke video frame into 32×32 pixels to detect smoke from datasets based on wavelets [11] and support vector machines [12]. In [13], a CIELAB colour space was used to perform a smoke chromatic feature clustering method to analyse smoke colour features. In [14], histogram of oriented gradient (HOG) [15] [16] descriptors were used to extract spatial features of smoke. Xiong et al. [17] used the adaptive Gaussian mixture model (GMM) to approximate background modelling. The values that did not match background Gaussian pixels were grouped as moving blobs using connected component analysis to detect smoke. In recent years, many machine vision tasks have made great progress in the application of realistic scenarios, gaining performance across public benchmark datasets by deep learning approaches. Video smoke detection using a relatively deep network has attracted a large number of researchers.
Smoke detection methods based on deep learning adopt the mainstream deep learning framework. In [18], the normalization and convolutional neural network (DNCNN) were applied to detect smoke in smoke video. In [19], a multichannel convolutional neural network was proposed to extract deep features of fire for fire detection. Sharma [20] pretrained two convolutional neural networks (CNNs), VGG16 and Resnet50, to detect early fires. Muhammad [21] proposed a cost-effective CNN to balance complex computations and accuracy. Xu [22] applied synthetic smoke images to solve the lack of CNN training data. In [23], a background subtraction algorithm was proposed to preprocess smoke video to significantly display smoke areas, and a deep belief network was used to classify smoke.
The deep learning framework based on the proposal of interest is a class of CNN architectures combined with a region proposal method. The region-based CNN (RCNN) [24] is a CNN extension that combines selective search to detect objects. A region proposal network (RPN) is added to a typical CNN to anchor the object region of interest. Faster R-CNN [25] was proposed for pretraining VGG16 combined with RPN to classify objects and regress bounding boxes. In [26], a Faster R-CNN was adopted to crudely extract smoke areas, and a 3D-CNN was used to classify smoke video.
The deep saliency network for smoke detection is a novel method that aims to emphasize the most important object regions in video frames. In [27], salient convolutional neural networks based on pixel-level and object-level extracted smoke saliency map information were used. In [28], a saliency detection model was applied to segment a smoke region based on pixel colour and motion features. In this paper, an end-to-end framework for video smoke detection is proposed. In the framework of the correlation filter, deep features extracted by CNN are processed by target awareness to realize dimension reduction. To meet the real-time requirements of smoke detection, a depthwise separable method with a fixed convolution kernel is applied to replace the traditional convolution. In the response image, the maximum value is used to predict the position of the detection area. A multiscale scheme can be used to determine the rectangle of the smoke area. This paper is organized as follows. In section 2, the related works are reviewed. In section 3, DSATA is introduced. The experimental results are presented in section 4, and the conclusion is presented in section 5.

Related work
Smoke detection based on deep learning methods is different from traditional image processing methods. A deep learning algorithm can extract multiclass features that are not limited to one or two typical image processing features. In [29], fully convolutional networks (FCNs) were used to realize semantic segmentation. A deep smoke segmentation network was also proposed to segment blurry smoke images via training high-quality segmentation masks. Traditional vision-based smoke video detection methods [30][31] [32] always divide each video frame into blocks and extract stable features in each block to classify smoke or nonsmoke. The highlighted performance of these methods usually relies on robust visual object forms that can obviously distinguish smoke from video scenes with clear background differentiation. However, fires are always accompanied by complex background effects and fuzzy real-time video data, which can hardly supply high-quality video and high-contrast video. Existing technical conditions cannot strictly meet the requirements of video detection for large quantities of data for small sample objects. [33] proposed synthetic smoke images to meet dataset requirements. However, in visual detection, the objects of interest can be arbitrary object classes with arbitrary forms. This means that it is impossible to complete all realistic scenarios. As a result, deep feature maps for pretraining are weak in modelling these objects of arbitrary forms for distinguishing them from unpredictable and complex environments.
In this paper, according to the target-aware deep tracking (TADT) algorithm [34], DSATA is proposed with a target-aware strategy to select useful deep features for object representation. Target awareness is realized according to regression loss. In [35], the T-SNE model showed the difference between target-aware features and original features. Pretrained deep features are less effective than target-aware deep features for discriminating the same semantic label but different objects. The main contributions of DSATA are as follows: • adaptive target-aware deep features for object detection do not need to require the complex pretraining of CNNs. This means that a few datasets can realize object detection using deep learning networks. The TADT algorithm compensates for the deficiency of the pretrained deep model being unable to consider arbitrary forms in visual detection. • we adopt the depthwise separable method to reduce the number of computations associated with the correlation of each frame. The speed of the algorithm is significantly improved. • we use a fixed depthwise separable convolutional kernel to avoid wasted time in the iteration of backpropagation.

Target-aware deep tracking
The TADT algorithm introduces the target-aware method to compute weights to express the degree of importance of deep features for object detection. Ridge loss-based gradients are trained to obtain a proportion to distinguish deep features, and ranking loss combined with the ridge component is used to represent the scale-sensitive building 3 scale for variation in smoke shape. The TADT algorithm includes 4 parts: pretraining CNNs, target-aware, correlation filtering, and a Siamese matching network. Fig.1 shows the framework of TADT. VGG16 has 16 layers, which include 13 convolutional layers, 5 maxpooling layers, 3 fully connected layers, an input layer and output layers. In the VGG16 model, smoke video frames are treated as input. As a result, 512 deep feature maps can be acquired as target-aware model input. Target awareness uses ridge loss to distinguish the importance degree of 512 deep feature maps and filters 300 deep features in 512 maps.
Target awareness uses ridge loss to research different object convolution kernels to extract particular characteristic information. These convolutional kernel filters provide a certain object ratio to classify object categories. In the target-aware model, feature weights acquired by minimizing ridge loss reflect the importance of the 512 feature maps captured from the pretrained VGG16. This means that we cannot train the VGG16 network to extract effective feature map representations for arbitrary objects in unknown scenes and avoid unnecessary bulk smoke video collection and complex network training. The ridge loss is defined as follows: where {Y(i, j)} is a Gaussian label described as follows: Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 3 April 2020 doi:10.20944/preprints202004.0027.v1 where σ is kernel width, * represents a convolution, and W is the weight of regression training to compute the contribution of feature maps. Backpropagation update weights represent the importance of feature maps, and the chain rule is used to compute the derivation of L reg to x i,j for backpropagation. The derivation using the chain rule is defined as follows: where X o (i, j) is W * x i,j as efficiently output feature maps. The pretrained model extracts 512 feature maps. These feature maps are sent to a regression net to obtain the feature maps, which are characterized by the degree of importance. Each pixel gradient is acquired. Finally, the global gradient average pooling layer is used to obtain the instant of weights to select 300 useful feature maps according to comparison with these weights. The global gradient average pooling function is defined as follows: where GAP is the global average pooling function.
∂L reg ∂z i is the derivation of the loss function L reg with respect to the i − th out feature map z i obtained by training the convolutional model of regression loss.
Fire smoke shows movement and irregular shape under the influence of wind and other environmental climates. These characteristics require the algorithm to add a scale-sensitive divisor to train the sensitive kernel filter to adapt the scale changes. In [36], a ranking loss is proposed as follows: where x i , x j is the scale-pairs with 2 pixel stride adjusting frames. L rank loss is minimized to match the variation in smoke shape. Ω is a set of x i , x j . In TADT, a training model is created to train the scale filter to close the complexity of the extraction computation for sensitive scale selection. Stochastic gradient descent (SGD) is adopted to train the rank loss to select 80 scale-sensitive deep features according to the rank loss model. The chain rule is used to compute the gradient defined as follows: where W is the convolutional kernel weight of the rank loss model.
is defined as the gradient of L rank relative to f x i,j . In this section, scale-sensitive features are extracted from the rank net of smoke, 80 deep feature maps for smoke video are selected, and 380 deep feature maps are extracted by combining regression and rank loss to represent the object characteristic and scale-sensitive expression.

Depth-wise separable convolutions
In [37][38] [39], MobileNets were proposed for slight mobile embedded vision detection. A depthwise separable strategy was built, and two depthwise convolutional kernels were created to balance the latency and accuracy. MobileNet is a streamlined architecture in which depthwise separable construction is designed by a kind of factorized convolution. It is composed of a normal convolutional kernel called depthwise used to convolute input images and a 1×1 convolutional Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 3 April 2020 doi:10.20944/preprints202004.0027.v1 kernel called pointwise applied in the output of normal convolutions. The depthwise convolution is divided into depthwise and pointwise. The depthwise filters input maps, and the pointwise combines the output feature maps of the depthwise convolutions. The factorization can greatly reduce the computations and decrease model complexity. Fig. 2 shows the typical convolution operation. The depthwise separable algorithm factorizes the kernel filter into a depthwise branch in Fig. 3 and a 1×1 pointwise branch in Fig. 4. A typical convolutional layer obtains input maps as D F × D F × N, and a typical convolution kernel filter extracts output deep features as F w × F h × C. The computational consumption of typical convolutions is defined as follows: where W is the width of the typical convolutional kernel, H is the height of the typical convolutional kernel, M is the channel of the typical convolutional filter, and N is the number of output feature maps. Fig. 3 shows the depthwise convolution, and the pointwise convolution is shown in Fig. 4.  where the depthwise computational cost is defined as follows: Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 3 April 2020 doi:10.20944/preprints202004.0027.v1 6 of 13 The pointwise computational cost is defined as follows: The total computational cost of depthwise separable convolutions is defined as follows: The computation reduction according to depthwise and pointwise streamline combinations can be obtained as follows: In TADT, the cross-correlation filter [40] method is applied to speed up the computations using the fast Fourier transform (FFT) to change the convolutional kernel and the input feature to the frequency domain to realize array multiplication instead of matrix operations. In this way, mathematical transformation can speed up computations. Because of the high dimensionality of multifeature maps for each frame correlation according to FFT, mathematical transformation cannot change the dimensionality of the matrix, which requires considerable computational cost. The depthwise separable algorithm reduces the convolution computations by a streamlining operation by combining two steps to decrease the dimensionality of the kernel via depthwise and pointwise operations. Additionally, the cross-correlation method is used to speed up the computations. This paper applies depthwise separability to reduce the dimensionality of the kernel to improve the architecture. Fig. 5 shows the framework of DSATA. In Fig. 5, the depthwise separable algorithm is used to divide the example sample, which is the first frame smoke image in DSATA, into two steps: a depthwise convolutional kernel and a pointwise convolutional kernel. The convolutional operation first correlates the follow-up frames by extracting depthwise feature maps and then correlates the depthwise feature maps pointwise. In TADT, M×N×512 feature maps are extracted by VGG16. Three scales are added to the input smoke video. Therefore, M×N×512×3 feature maps are extracted by VGG16. The target-aware method processes the feature maps to obtain M×N×380×3 scale-sensitive and representational feature maps. In DSATA, we use average pooling to average the 380 example feature maps extracted from the target-aware strategy to M×N×1×3 feature maps as the depthwise kernel and average each feature map to 1×1×512×3 as the pointwise kernel. We process the example target-aware feature maps to segment these feature maps into depthwise filters and pointwise filters. Once the example sample is selected in TADT, it will not be changed again. Therefore, we can fix the depthwise filter and pointwise filter instead of training by deep neural networks. The fixed depthwise separable kernel pairs can also avoid the computational consumption of training of the depthwise and pointwise filters. Experimental results show that fixed depthwise separable convolutional kernel pairs can not only realize the expected Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 3 April 2020 doi:10.20944/preprints202004.0027.v1 conclusion of approximately the same detection accuracy but also achieve a significant increase in detection speed.

Fire smoke video datasets
In this section, we select 8 fire smoke video sequences to verify the performance of the proposed algorithm of the depthwise separable method of DSATA. Smoke video samples are shown in Fig.  6. These smoke videos are collected from web sources and standard datasets. Some of them are chosen in different conditions to verify the algorithm performance. The selection of smoke videos of fires considers many factors, including climatic conditions, the resolution of the camera acquisition equipment and approximate interference, coupled with the fact that smoke swings violently due to the influence of wind. The fire smoke video sequences are selected according to the above requirements, and the smoke video information is shown in Table 1.  These smoke video datasets are fully considered to have similar background interference. In video 1, the fuzzy video frames are collected by the low-cost image acquisition device, and the influence of similar objects, such as white clouds, is added to verify the performance. In video 4, long-distance image acquisition exacerbates the degree of image blurring. At very low pixel resolution, it is still necessary to accurately detect the smoke position, which increases the experimental difficulty of the algorithm. The selection of other recognized datasets considers the effects of different experimental environments on smoke detection. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 3 April 2020 doi:10.20944/preprints202004.0027.v1

Experimental performance analysis
The experiment operation is implemented in Ubuntu 16.04 with TensorFlow on a PC with 32 G memory, an Intel i7 3.7 GHz CPU, and a GTX 1080 GPU. Smoke video collection and pre-processing is implemented in Win10 with MATLAB 2018a. In this section, we use smoke videos to compute the precision of TADT and DSATA. The experimental visualization results of TADT are shown in Fig. 7. These frame demos are chosen from the smoke videos that are selected when the algorithm runs. The visualization of DSATA is shown in Fig. 8.   Table 2 shows the precision of TADT and DSATA. In Table 2, the precision is defined as follows: where T p is the number of frames with true positives and Fp is the number of frames with false positives. According to Table 2, DSATA achieves the best performance on Videos 1, 2, 3, and 5. The improved TADT used in smoke detection achieves the best performance on Videos 4, 6, 7, and 8. There is a large difference in detection accuracy between the TADT and DSATA in Video 5 and Video 6. In Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 3 April 2020 doi:10.20944/preprints202004.0027.v1 Video 5,the accuracy of TADT is lower than that of DSATA,which maybe due to the difference both in the deep features extracted by VGG16 of example and the other frames of smokeing images.In Video 6,the accuracy of TADT is more higher than the accuracy of DSATA. The reason for this is that the distinguish deep features between example of smoke and the others is affected by the wind. Except for Video 5 and Video 6, the difference in detection accuracy in the other videos is not large. The difference in the mean detection accuracy between the TADT and DSATA is also not large. In other words, the difference in detection accuracy between the TADT and DSATA is not large because the algorithms use the same feature extraction strategy. Target-aware deep features are extracted to collect robustness and semantic information. The target-aware deep features are robust to appearance and scale changes. Table 3 shows the precision of some smoke detection algorithms using the deep learning architecture and traditional pattern recognition. In table 3, the accuracy of DSATA is similar to TADT, but higher than the other algorithm, such as the 93.4% of DBN and the 91.88% of Faster-RCNN.etc. The data of table 3 shows that the DSATA can get excellent performance than other deep learning algorithms and traditional smoke detection algorithms. The superiority of DSATA is that DSATA can get effectively deep features without training network of deep learning. The Faster-RCNN and the Saliency Detection should be train by a large number of datasets. Table 3 shows the higher accuracy of DSATA than TADT for the blur and interference factors of video 3. According to this DSATA can be used to realize smoke detection settings in real scenes. In Fig. 9, a curve describing the TADT algorithm and the DSATA algorithm by the depthwise separable algorithm is given to show that our method can obtain better performance. In Fig. 9, the DSATA curve is smoother than the TADT curve because we use depthwise separability to sharply reduce the computations instead of performing complete computations, which may create more nondeterminacy for the computation of response feature maps. In Fig. 9, the location error threshold is the centre Euclidean distance between the prediction bounding box and the ground truths, which are standard centres of the bounding box.   Table 4 shows a speed comparison between DSATA and TADT on the smoke video dataset. DSATA performs favourably against improved TADT on this dataset. Table 4 shows the speed of smoke detection of DSATA and TADT. According to table 4, the FPS of DSATA is approximately twice that of the TADT because DSATA introduces the depthwise separable method to enhance real-time performance. The minimum frame rate of DSATA can achieve approximately 86 FPS. The experimental results show that DSATA can realize real-time smoke detection. This demonstrates the effectiveness of DSATA proposed in this paper. Overall, all experimental results demonstrate that DSATA performs well in terms of accuracy, robustness, and running speed.

Conclusion
In this paper, we propose an algorithm with a target-aware and depthwise separable mechanism to realize fire smoke detection. The target-aware method can extract the most useful deep features that are robust to appearance and scale changes. The depthwise separable mechanism is composed of depthwise and pointwise convolutions to enhance real-time performance. We attempt a new method different from the mainstream methods, such as CNN, the region of proposal interest method, and saliency detection pattern recognition, which apply target-tracking algorithms to object detection. Target-aware methods can reduce the work of dataset collection, and adaptive target-aware deep features for object detection do not require the complex pretraining of CNN. We adopt the depthwise separable method to reduce the number of computations associated with the correlation of each frame. The speed of the algorithm has been significantly improved. We use a fixed depthwise separable convolutional kernel to avoid wasted time in backpropagation iterations. The experimental results show that our DSATA algorithm has excellent performance compared with other detection algorithms.