A Deep Learning Approach for Wi-Fi based People Localization

People localization is a key building block in many applications. In this paper, we propose a deep learning based approach that significantly improves the localization accuracy and reduces the runtime of Wi-Fi based localization systems. Three variants of the deep learning approach are proposed, a sub-task architecture, an end-to-end architecture, and an architecture that incorporates prior knowledge. The performance of the three architectures under different conditions is evaluated and the significant improvement of the three architectures over existing approaches is demonstrated.


Introduction
People localization is a key building block in many applications such as surveillance, activity classification, and elderly people monitoring.Video-based systems suffer from many limitations making them unable to operate in many real-world situations; for instance, they require users to be in the camera's line-of-sight, they cannot work in the dark, through walls or smoke, and they violate users' privacy; furthermore, video-based tracking and localization algorithms suffer from low localization accuracy and high computational cost.
Wi-Fi provides an accessible source of opportunity for people localization, it does not have the limitations of video-based systems; furthermore, it has reasonable bandwidth and transmitted power.
The potential to provide people localization by using this ubiquitous source of opportunity, and without transmitting any additional signal, nor requiring co-operation from the users, provides interesting opportunities.The users will be localized based on the available Wi-Fi signals that are reflected from their bodies.
Recently, a number of Wi-Fi based localization systems were proposed [1][2][3], and [7][8][9][10].A multiperson localization system was proposed by Adib et al. [1], they determined users' locations based on the reflections of Wi-Fi signals from their bodies, the results show that their system was able to localize up to five people at the same time with an average accuracy of 11.7 cm.Colone et al. [2] studied the use of Wi-Fi signals for people localization, they have conducted an ambiguity function analysis for Wi-Fi signals.They have also studied the range resolution for both direct sequence spread spectrum (DSSS) and orthogonal frequency division multiplexing (OFDM) frames, for both the range and the Doppler dimensions, large sidelobes were detected, which explains the masking of closely spaced users.Chetty et al. [3] conducted an experiment in a high clutter indoor environment using Wi-Fi signals, they were able to detect one moving person behind a wall.
Compressive Sensing (CS) is a convenient approach to improve the accuracy and detect closely spaced users, which are difficult to separate using conventional methods.CS can also work at a lower rate than the Nyquist rate.The use of CS in radar has been recently investigated in [4,5].Anitori et al. [6] presented an architecture for adaptive CS radar detection with Constant False Alarm Rate (CFAR) properties, they also provided a methodology to predict the performance of the proposed detector.Researchers in [7] and [8] have shown that compressive sensing can detect objects with high accuracy using Wi-Fi signals.
Recently, there is a growing interest in using Deep Learning (DL) in communication and signal processing due to its ability in adapting to many imperfections that exist in real-world environments.
DL has recently shown promising results in image recognition and classification [11][12][13].The key factors behind these significant results are high-performance computing systems and the use of large amount of data such as the ImageNet dataset [14] which contains more than one million images.
In [22], different deep learning architectures such as Deep Neural Network (DNN), Convolutional Neural Network (CNN), and Long Short-Term Memory (LSTM) were used for signal detection in a molecular communication system.Simulation results demonstrated that all these architectures were able to outperform the baseline, while the LSTM based detector has shown promising performance in the presence of Inter-Symbol Interference (ISI).
In [23], a deep learning based detector called DetNet was proposed, the aim was to reconstruct a transmitted signal x using the received signal y.To test the performance of the proposed approach in complex channel environments, two scenarios were considered, the fixed channel model and the varying channel model.DetNet was compared with two algorithms, the approximate message passing (AMP), and the semi-definite relaxation (SDR) which provide close to optimal detection accuracy.In the fixed channel scenario, the simulation results showed that DetNet was able to outperform AMP and achieves comparable accuracy to SDR but with a significant reduction of the computational cost (about 30 times faster).Similarly, in the varying channel scenario, DetNet was 30 times faster than the SDR and showed a close accuracy.
In [24], a five layers fully connected DNN was used for channel estimation and detection of OFDM system by considering the channel as a black box.In the training phase, the data are passed through a channel model.The frequency domain signal representing the data information is then fed to the DNN detector that reconstructs the transmitted data.When comparing with the conventional minimum mean square error (MMSE) method, the DNN detector was able to achieve comparable performance.Then it was able to show better performance when fewer pilots are used, or when clipping distortion was introduced to decrease the peak-to-average power ratio.
In radar, Yonel et al. [27] have recently used deep learning for the inverse problem in radar imaging, they designed a recurrent neural network architecture and used it as an inverse solver.The results show that the proposed approach was able to outperform conventional methods in terms of the computation time and the reconstructed image quality.
Deep learning has been also recently used for compressive sensing [28][29][30].Although However, the training needs to be done only once.
Recently, there is a trend of using deep learning for Wi-Fi based localization systems [31][32][33][34][35], [54].Fang and Lin [31] proposed a system that uses a neural network with a single hidden layer to extract features from Received Signal Strength (RSS).It was able to improve the localization error to below 2.5m, which is 17% improvement over state of the art approaches.A system called DeepFi was proposed in [33] with four layers neural network.DeepFi was able to improve the accuracy by 20% over the FIFS system, which uses a probability based model.A system called CiFi was proposed in [34], it used a convolutional network for indoor localization based on Wi-Fi signals.First, the phase data was extracted from the channel state information (CSI), then the phase data is used to estimate the angle of arrival (AOA).which is used as an input to the convolutional network.The results show that CiFi has an error of less than 1 m for 40% of the test locations, while for other approaches it is 30%.In addition, it has an error of less than 3 m for 87% of the test locations, while for DeepFi it is 73%.In [35], A system called ConFi was proposed, which is a CNN based Wi-Fi localization technique that uses CSI as features.The CSI was organized as a CSI feature image, where the CSIs at different times and different subcarriers were arranged into a matrix.The CNN consists of three convolutional layers and two fully connected layers.The network is trained using the CSI feature images.ConFi was able to reduce the mean error by 9.2% and 21.64% over DeepFi and DANN respectively.
These results show the significant improvement in the runtime and the accuracy of deep learning based systems.However, RSS provides only coarse-grained information about the wireless channel variations.CSI can capture fine-grained variations in the wireless channel.It also contains the amplitude and the phase measurements for each OFDM subcarrier.However, using the reflected signals from the bodies of the users can capture more valuable information such as the Doppler shift, therefore the latter approach will be used in this work.
In discusses the results and future research directions, and the paper is concluded in Section 9.

Wi-Fi Signal Model
Wi-Fi standards IEEE 802.11 [36] use both DSSS modulation in the 802.11b standard with 11MHz bandwidth and OFDM with 20MHz bandwidth in the newer a/g/n standards.In OFDM, the signal is divided into Ns symbols, then these symbols are modulated onto multiple subcarriers.The duration of each OFDM symbol is T. The spacing of the subcarrier is ∆f = 1/T and the bandwidth is B = Ns∆f.fc is the carrier frequency, and fm = fc + m∆f is the frequency of the mth subcarrier.A cyclic prefix (CP) is used to avoid inter-symbol interference, Tcp denotes the length of the CP.One OFDM symbol in the baseband is given by Where s[m] is the symbol of the mth subcarrier and q(t) is a rectangular window of length Tcp + T. We consider a uniform linear array with N elements and P signals impinge on the array from directions θ1, θ2, ..., θP, respectively, the received Wi-Fi signals can be expressed by Where w(t) is white Gaussian noise and Ap is the attenuation which includes the path loss and the reflection, τp is the delay and ap is the range rate of the pth path divided by the speed of light, x(t) is an OFDM symbol and the steering vector a(θp) is expressed by Where λ is the signal wavelength, d is the array inter-element spacing and L is the number of antennas.

Compressive Sensing
Consider a discrete-time signal x of length N. x can be represented in terms of basis vectors ѱi Where si is weighting coefficients.When x is a linear combination of a small number of K basis vectors, with K < N, i.e., only K vectors of si in (4) are non-zero; then, compressive sensing allows to Where ε bounds the noise in the signal.

Deep Learning
Deep learning [37,38] is inspired from neural systems in biology, where the weighted sum of many inputs is fed to an activation function such as the sigmoid function, to produce an output.The neural network is then built by linking many neurons to form a layered architecture.A loss function, such as the mean square error should be used to get the weights that minimize the loss function between the expected output and output of the network.Optimization algorithms such as the Gradient Descent (GD) are typically used in the training to find the best parameters.In [39] it has been shown that neural network can be used as a universal function approximator by introducing hidden layers between the output and the input layers.
In the fully connected feedforward neural network, each neuron is linked to the adjacent layers.
Efficient algorithms such as the backpropagation were proposed for training such networks.Many problems could arise during the training process, such as converging to a local minimum.To address this problem, many adaptive learning algorithms such as the Adam algorithm were proposed.
However, although the trained network can perform well using the training data, the network might perform very poorly using the testing data because of overfitting.Many techniques have been proposed to reduce overfitting such as dropout.
Recurrent Neural Network (RNN) was introduced to provide neural networks with memory, where in many situations, the outputs need also to depend on the input from previous time steps.
One example is translation, where the knowledge of previous words in the sentence would significantly help in producing a better translation of the current word.Some recently used RNN architectures that are showing promising results include Gated Recurrent Unit (GRU), and LSTM.
The convolutional neural network is another promising architecture.The basic idea of the CNN is to use convolutional and pooling layers before the fully connected network.In the convolutional layer, a number of filters are learned to represent local spatial patterns along the input channels.The pooling layer performs down-sampling, where the number of parameters is significantly decreased before the fully connected layers.
With the promising results of the CNN architecture in computer vision, many researchers have attempted to improve the CNN architecture proposed by Krizhevsky et al. [11] to achieve better accuracy.For example, the highest accuracy architecture submitted to the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2013 [12] used smaller stride and smaller window size for the convolutional layers.In [40], the researchers have addressed another important architecture design aspect, which is the network depth.Recent evidence [40,41] shows that the network depth is of crucial importance.However, the main challenge of using deeper networks is the vanishing gradients problem [42,43], which affects the convergence significantly.[43,44] and batch normalization layers [45], which were able to make networks with tens of layers to begin converging.However, with the increased network depth, the accuracy gets saturated and then rapidly degrades [46,47].In [46], the researchers have addressed the degradation problem by proposing a deep residual learning approach.To maximize the information flow, skip connections were introduced.The 152 layers residual network was applied on the ImageNet dataset, it was able to win the first place in the ILSVRC 2015 competition.To ensure maximum information flow between different layers of the network, all layers are connected to each other directly in [49].Where each layer connects its output to all subsequent layers and gets inputs from all preceding layers.

Wi-Fi based localization Using Compressive Sensing
In this section, we describe a Wi-Fi localization method using compressive sensing, where the works of [7] and [8] are extended to also include the angle of arrival estimation.The number of objects is often very small compared to the number of points in the scene, this implies that the scene is sparse, which enable us to formulate a CS reconstruction problem and solve it using convex optimization.
The received signal should be matched to delay-Doppler-angle combinations, corresponding to objects detections.A sufficient delay-Doppler-angle resolution should be considered; however, a very high resolution may lead to a large number of combinations, many of them are highly correlated.The delay-Doppler-angle scene is divided into a P × V × Z matrix, in which each point represents a unique delay-Doppler-angle point, the sparse vector x is composed of P data points in the range dimension and Z data points in the angle dimension at all considered Doppler shifts with V data points in the Doppler dimension.The size of vector x is Q = PVZ.The pvz index will be nonzero if an object exists at the point (p, v, z).The measurements vector y contains the data from L antennas at time tl.The measurement matrix Φ is generated by creating time-shifted versions of the transmitted signal (represented by ( 7)) for each Doppler frequency and each angle of arrival.
We assume that s(t) is known.To improve the detection probability, the results of 10 signals are combined before the threshold step, where the final value of the object is equal to the count of its appearance across all the 10 reconstructed scenes.

Method
Most signal processing techniques in communications and radar have solid foundations in information theory and statistics, and are optimal using some assumptions such as linearity, and Gaussian statistics.However, many imperfections exist in real-world environments.Deep learning is a very appealing option because it can adapt to real-world imperfections, which cannot be always captured by analytical models.In challenging environments where there are multipath, weak signals, and multiple people, the performance of conventional localization techniques degrades rapidly.
Therefore, we will investigate the use of deep learning to address these challenges.
Deep learning has shown a remarkable ability in extracting useful features from speech and visual data, and then uses these features in classification tasks.We want to extract the location information from the received signal, which is distorted by noise, multipath, and reflections from other people.In conventional localization techniques such as the technique proposed in the previous section, the Time of Arrival (TOA) of the reflected signal is used to determine the object range and the Doppler information is used to suppress clutters.Here, we want to test the ability of deep learning in determining the location information in low SNR situations and in the presence of multiple people.
We want also to test its ability in using the Doppler information to suppress clutters.We will formulate the localization problem as a classification problem; however, instead of classifying the object to a particular category, the object will be classified to the most likely location.The DL approach will use the received signal to build a grid, where each object will be matched to a particular location.
Choosing the suitable architecture and its parameters, which best suit the problem is an important question.We have tried many architectures with different number and size of layers, the best performing architecture is shown in Table 1.The roles of different parameters of the proposed architecture will be evaluated in the next section.The network has three convolutional layers and three fully connected layers.The input of the network is the received signal y.Different kernels (filters) can detect different features from the input signal and will construct different feature maps.
50 kernels and kernels of size 5 were found to work best in our model.For the fully connected layers, the width of each layer is 800, and a 25% dropout is used to avoid overfitting.Dropout [48] means temporarily removing units from the network with all their connections, the choice of which units to remove is random.This will make each unit more robust and reduces its dependence on other units to create useful features.To introduce non-linearity into the network, the ReLU is used as an activation function.ReLU has shown higher performance than the sigmoid function and is more The Adam optimizer is used to train the network and the training rate is set to 0.01.The used accuracy metric is given by ( 9) Where TP is the number of correct detections, and P is the number of positive cases.To be able to compare the results of the deep learning approach with the compressive sensing approach, the same accuracy metric will be also used to evaluate the performance of the compressive sensing approach.Three variants of the above architecture will be used.The first one seeks to simplify the problem and reduces its dimensionality by using several copies of the above network to estimate the location of each user alone, where the first network will be trained to estimate the location of the first user, the second network will be trained to estimate the location of the second user, and so on.The second variant will use an end-to-end approach where the performance of the whole system can be optimized.The above network will be used to estimate the locations of all users at the same time; however, several output layers are added to estimate the locations of different users.The third variant will introduce prior knowledge to the network by feeding the used pilot signal as an input to the network, where the above network is modified by adding one more input layer for the used pilot signal, followed by three convolutional layers, then the two branches are merged and the same fully connected layers are used.Fig. 1 shows the modified architecture.The performance of these three variants will be compared in the next section.Similar to the CS based approach, the output of the network for 10 signals will be combined.

Results
Computer simulations were performed to evaluate the proposed approach.We consider the 2.4GHz industrial, scientific, and medical (ISM) band.The delay profile is represented by 30 samples, the Doppler resolution is represented by 30 samples.The proposed approaches will be used to localize users with random positions in the scene under different conditions.Training of the network took 12 hours on a standard Intel i3-4030U processor.First, we will compare the deep learning approach with existing methods, then the performance of the proposed architectures will be compared.After that, the performance of the deep learning approach is evaluated in the presence of multipath propagation.
Then, the role of each parameter of the training set is evaluated, and finally, the effect of each parameter of the network is investigated.

A. Comparison with other methods in terms of accuracy and runtime
To compare the proposed deep learning approach with the compressive sensing approach described in section 5, 1000 Monte Carlo runs were performed to evaluate the compressive sensing approach under different SNR values where the locations of the users are generated randomly.Both the Orthogonal Matching Pursuit (OMP) [52] and the Interior Point Method (IPM) [53] were used to reconstruct the scene.Each reconstructed scene is the result of combining 10 signals.The same accuracy metric described in section 6 will be used to evaluate the CS approach.Two DL approaches will be compared, the first one tries to simplify the problem and reduces its dimensionality by estimating the location of each user alone as described in section 6.The second approach is an end-to-end approach where the locations of all users are estimated at the same time.
The end-to-end approach has shown a better performance, which suggests that the gain from dividing this particular problem into simpler sub-tasks is lower than the gain from the overall optimization of the whole problem.Fig. 3 shows the probability of correctly detecting the users under different SNR values for the two approaches.C. Comparing with an approach where prior knowledge is incorporating Here we compare the end-to-end architecture with an architecture where prior knowledge is fed to the network.The used pilot signal is also used as an input to the network to see whether it will improve the performance of the network.The two approaches showed comparable results with a very small improvement of the prior knowledge approach, which means that there is no much gain from using additional information as an input to the network and the network is able to extract the needed information from the received signal.Fig. 4 shows the probability of correctly detecting the users under different SNR values for the two architectures.approach and the approach when prior knowledge is incorporated.

D. The effect of multipath
To investigate the effect of multipath signals, the proposed approach will be compared when 4, 8 and 12 multipath signals are added to the received signal.Fig. 5 shows that the end-to-end approach is relatively robust to multipath propagation, where the network was able to cancel the multipath effect and correctly detect the users.

The effect of dropout
Here we compare the performance of the network for four cases, the first one is with no dropout, the second one is with 10% dropout, the third one is with 25% dropout, and the fourth one is with 40% dropout.Fig. 11 shows that increasing the dropout has resulted in more ability of the network to create useful features where 25% and 40% dropout are showing slightly higher accuracy.

Discussion
The proposed deep learning approach has shown higher performance with less runtime in comparison with the CS approach.The proposed approach has also shown a high ability to adapt to challenging environments.For the studied problem, using deep learning for each sub-task and hence reducing the curse of dimensionality has resulted in less accurate results in comparisons with the end-to-end approach where the performance of the whole system is optimized.Introducing prior knowledge by using the pilot signal as an input to the network has not resulted in much improvement in the accuracy, where the network seems to be able to extract the needed information from the received signal.The proposed approach has also shown that it is relatively robust in multipath environments.
Increasing the number of examples in the training stage has resulted in higher accuracy; however, the improvement was very small after 250000.Using training examples from different SNR values has resulted in more accurate results in comparison of using the same SNR value for all the examples, whether that SNR is low or high.The results have also shown the role of different network parameters in improving the accuracy.This work along with many other recent works have shown that deep learning has many potential applications in future signal processing, communication, and radar systems where conventional approaches are challenged.It represents a promising research direction that is still in its early stage.Some challenges still worth further investigations.Further research must be conducted to propose deep learning architectures that best suit signal processing, communication, and radar systems.

Conclusions
This paper has presented a Wi-Fi based localization technique based on deep learning, where three different architectures were proposed.Simulation results demonstrated the significant improvement in the accuracy and in the runtime of the proposed approaches over existing approaches.The end-to-end architecture was found to be more accurate than the other two architectures.The proposed approach has also shown that it is relatively robust in multipath environments.Future work will investigate further improvement in the localization accuracy by building architectures that best suit localization systems.
challenging environments where there are multipath, weak signals, and multiple people, the performance of conventional localization techniques degrades rapidly.Therefore, in this paper, we investigate the use of deep learning for Wi-Fi based localization systems.The main contribution of this paper is a deep learning based Wi-Fi localization technique that significantly improves the accuracy and reduces the runtime in comparison with existing techniques.Three architectures are proposed, an end-to-end architecture, a sub-task architecture, and an architecture that introducing prior knowledge.The performance of the proposed approach is evaluated in the presence of multipath propagation.The role of each parameter of the training set and the effect of each parameter of the network are also investigated.The paper is organized as follows: Section 2 describes the Wi-Fi signal model.An overview of compressive sensing is given in Section 3.An overview of deep learning is given in Section 4. Section localization technique is introduced in Section 6. Simulation results are listed in Section 7. Section 8 sample x with a smaller number of measurements than the Nyquist rate.Measurements y with M < N are performed by linear projections y = Φx + n (5) With a measurement matrix Φ and additive noise n.When x is sparse with only a small number of non-zero entries K < N, compressive sensing can reconstruct x given that the measurement matrix Φ Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: 7 November 2018 doi:10.20944/preprints201809.0213.v2 is incoherent with the basis ѱ, i.e., the vectors {φj} cannot sparsely represent the vectors {ψi}.The compressive sensing reconstruction problem then can be formulated as a convex optimization problem x ` = minx˜ ||x˜||1 subject to ||y -Φx˜||2 ≤ ε (6)

Fig 1 .
Fig 1.A DL architecture where prior knowledge is incorporated

Fig. 2 Fig 2 .
Fig 2. The percentage of correctly detecting the persons for the OMP and the IPM versus the DL approach for different SNR values and a different number of combined signals.

Fig 3 .
Fig 3.The probability of correctly detecting the users under different SNR values for the sub-task approach and the end-to-end approach.

Fig 4 .
Fig 4. The probability of correctly detecting the users under different SNR values for the end-to-end

Fig 11 .
Fig 11.The probability of correctly detecting the users under different SNR values when different percentages of dropout are used.

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 7 November 2018 doi:10.20944/preprints201809.0213.v2 compressive
[28]ing has revolutionized signal processing, the main challenge facing it, is the slow convergence of current reconstruction algorithms, which limits the applicability of CS systems.In[28]a new signal reconstruction framework called DeepInverse was introduced.DeepInverse uses a convolutional network to learn the inverse transformation from measurements to signals.The

7 November 2018 doi:10.20944/preprints201809.0213.v2 introduced
instead of the sigmoid function.The problem was also addressed by introducing normalized initialization To address the vanishing gradient problem, new activation functions such as the Rectified Linear Units (ReLU) were Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: The measurement matrix Φ establishes a linear relation between the measurements at multiple antennas [y1 y2 … yL] with the range profile [x1, x2 …. xQ] at different Doppler shifts ωv and different angles θz.

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 7 November 2018 doi:10.20944/preprints201809.0213.v2 plausible
[45]iological systems.To accelerate the training process and to further reduce the overfitting, batch normalization[45]is used in the proposed architecture.Vanishing gradients or getting trapped in a local minimum may occur when using a high learning rate.However, by normalizing the activations throughout the network, small changes are prevented from amplifying to large changes in activations in gradients.Batch normalization has also shown promising results in reducing overfitting.Softmax is used as an activation function in the output layer, softmax takes the advantage that the locations are mutually exclusive, i.e. the object can be at one location only, softmax will also output a probability for each location.

Table 1 .
The architecture of the network

Table . 2
shows the runtime for the end-to-end approach versus the two CS approaches using a standard Intel i3-4030U processor.The DL approach has significantly lower runtime than the CS based approaches.Where, once the network is trained and the weights are calculated, predicting new output involves relatively simple calculations.

Table 2 .
The runtime for the DL, the OMP, and the IPM methods.