Sensor-fusion Location Tracking System using Hybrid Multimodal Deep Neural Network †

: Many engineered approaches have been proposed over the years for solving the hard problem of performing indoor localization. However, specialising solutions for the edge cases remains challenging. Here we propose to build the solution with zero hand-engineered features, but having everything learned directly from data. We use a modality speciﬁc neural architecture for extracting preliminary features, which are then integrated with cross-modality neural network structures. We show that each modality-speciﬁc neural architecture branch is capable of estimating the location with good accuracy independently. But for better accuracy a cross-modality neural network fusing the features of those early modality-speciﬁc representations is a better proposition. Our multimodal neural network, MM-Loc, is effective because it allows the uniform ﬂow of gradients during training across modalities. Because it is a data driven approach, complex features representations are learned rather than relying heavily on hand-engineered features.


Introduction
With a growing number of mobile application requiring contextual information to adapt their service to user needs, location estimation is becoming an important feature of those tasks. GPS is a system level navigation method relying on satellite signals. However, indoor environment are often sheltered from satellite signals. As a result, scientists have proposed alternative methods for indoor positioning relying on signals such as WiFi, Bluetooth, Infrared RSS and movement tracking sensors (accelerometer, gyroscope, barometer) [1]. Most of these solutions have been hard-engineering work, but this is becoming harder to adapt for edge cases of performance.
There are two main approaches for performing indoor localization that have been at the core of most proposed systems: Pedestrian Dead-Reckoning (PDR) and WiFi based Location Estimation (with the popular version known as WiFi Fingerprinting). In PDR, the starting location is assumed to be known, and using specially engineered techniques (e.g. well-defined sets of formulations) to estimate the travelled distance and the direction of movement for predicting the following position. WiFi Fingerprinting systems compare the received signal strength with pre-recorded WiFi radio maps to estimate the most similar location.
Currently, most indoor localization systems contain two main techniques, WiFi Fingerprinting and Pedestrian Dead Reckoning (PDR) [2]. These systems work on a firm set of mathematical formulations defining the possible mobility pattern, such as step counting, direction estimation and WiFi fingerprint matching. However, the accuracy is usually affected by the variations of the signal distribution, so those equations need to be updated often. This adds to the complexity of the indoor positioning systems, which impede its easy deployment.
However, these specially engineered systems fail to work when facing imperfect inertial sensor data (accelerometer, gyroscope and magnetometer). The drift and noise from sensors lead to estimating the location with lower accuracy. This limitation has encouraged different creative engineering solutions, such as particle filters [2,3], Kalman filters, graph-conditions [4] and constraint modelling [5]. These systems rely on mathematical formulations to make firm assumptions about the possible action conditions based on isolated observations in sensor state. To enable these services in complex environments, systems are tuned to each deployment environment [4,6]. Nevertheless, these hand-engineered mathematical models lose efficiency when the indoor environment changes potentially with modified radio propagation distribution. Currently, no one system has reached the level of acceptance as GPS has for outdoor environments. This is due to the uncertainty about the indoor environments. It is costly to accommodate all possible actions with manual effort for system recalibration.
Many other approaches aim to combine the two approaches with ingenious solutions for fusing the sensors and independent estimations. These environment-oriented engineered positioning systems make firm assumptions about the possible conditions on isolated observations in data, many times restricting the possible users' activities because these may be harder to model with current mathematical formulations, or irregular movement samples would even disprove the initial models. Otherwise, these hand-engineered mathematical models fail to work when facing unpredictable situations, not hand modelled into the system.
Instead of the aforementioned tediously engineered algorithms, we aim to achieve a more convenient approach by relying on an end-to-end multimodal deep neural network. We believe that moving the focus from minutely understanding mobility patterns to learning cross-sensor patterns automatically from data is a more attractive proposition. This will allow positioning system to be more flexible and robust by continuous updating the models in training with fresh data. This is the fist proposed system using hybrid multimodal deep neural networks to perform a half-way feature fusion of different sensing modalities available on modern mobile phones (WiFi and inertial sensors).
The contributions of this work are as follows: • We replace the traditional methods of operating on sensor data for location estimation with end-to-end machine learning approaches. This avoids hand-tuning the models to approximate the data. Instead, data representations are learned automatically from data. • We deploy a recurrent neural network of Long short-term memory (LSTM) to model the sequential chain of estimations, which performs similarly to PDR -estimating a sequence of locations, starting from a known point and estimating the following points in the sequence from sensor data. • Introducing the first use of modalities fusion with different neural network constructions on each sensing modality to extract intra-modality features, which are then combined by higher layer features across modalities.

Motivation and Related Work
Position estimation of smartphones inside buildings is not easy due to the GPS being unreliable in environments shielded by walls and ceilings. At the same time, other radio signals with longer penetration (cellular and FM) are limited to the granularity of position estimation they can offer [7]. Alternative methods have been proposed to take advantage of a broader range of sensors available on smartphones [8]. However, none have managed to produce a robust and scalable system for efficient indoor position estimation. We believe the reasons are: i) Indoor spaces are too complex to model with limited and fragmented observations from the environment (limited data), ii) Signal distortion occurs during propagation across the spectrum, from light and sound to radio frequency and magnetic, and iii) current systems rely on human interpreted features extracted from data (e.g., engineered solutions to estimate the number of steps and direction of movement).
Machine learning has proven the most promising options due to validated performance in several fields, including computer vision, natural language processing and pattern recognition that making inferences with noisy data. It shows the advantages of learning the correlation between hidden features from the hyper-dimensional sensory data and target labels automatically.
Using machine learning for indoor location tracking has become a popular research topic of predicting target locations based on sensed data. However, many of them are investigated on a single modality, which limits their applications. Given the uncertainty of the signal distribution in indoor environments, model signal propagation from scarce observations is relatively challenged with simplistic modelling techniques. Meanwhile, the indoor environment is formed by multiple modalities, mono-modality limits the representation of indoor situations.
In order to enhance the capability of the artificial intelligence to understand the location tracking task from different perspectives, we propose a multimodal machine learning approach that allows neural networks to capture in-depth features of natural representations and learn from correspondences within sensor fusion data that merging inertial sensor data and RSS data as uniform data input for system inference with flexibility and robustness.
We believe that by relying on models with high generalisation to learn directly from data, taking advantage of growing data volumes, deep neural networks can tackle the aforementioned long-standing problems that limiting indoor localization.
Therefore, our task is converting the engineering problem to an artificial intelligence approach that shifting focus from identifying patterns manually and mathematically fitting propagation patterns to delegate computers with multimodal machine learning.

Dead Reckoning on inertial sensors
PDR builds on inertial sensors to estimate displacement distance and direction of movement. However, these sensors have their limitations. Sensor drift is one of the most notorious problems, making it hard to double integrate acceleration to estimate displacement [9]. The same problem is experienced when estimating the direction of movement. Figure 1 shows the drift in gyroscope readings for estimations of direction on movement in a straight line. We can see that in just a few seconds, the accumulating sensor drift is substantial, changing the estimated direction of movement. Relying on the gyroscope alone is known to be inefficient, so engineered solutions use other sensors for recalibration [10,11]. To replace hard and rigid engineered solutions, other researchers use machine learning to identify sampling characteristics in inertial sensors, such as for step size estimation [12].

WiFi Fingerprinting on Received Signal Strength
The WiFi Fingerprinting localization approach consists of two phases: i) training phase or commonly known as the offline phase that collects samples to build WiFi Fingerprinting dataset, and ii) the real-time phase or so-called online phase that produces estimations based on incoming observations [2].
In terms of WiFi signal, indoor spaces experience a challenging radio propagation environment with multi-path effect, shadowing, signal fading and other forms of signal degradation and distortion. It is hard to model all the possible states of the WiFi environment in the offline phase, so often the online phase experiences forms of the environment which never trained on, leading to erroneous estimations.
The main challenge of taking advantage of these sensor data is that erroneous signal silencing occurs during both data collection and real-time data sampling phases. Figure 2 shows the complexity of WiFi samples through the histogram of a long scan at a random indoor location. Although many WiFi-based positioning systems model RSS as a Gaussian process [13,14], we can see that none of the 5 AP histograms fits a normal distribution tightly. In fact, AP2 shows a bi-modal distribution; AP4 and AP5 are skewed to the right and are very close in RSS that is likely they would interfere with each other if on close channels. Whereas AP1 has a wide distribution of observed RSS values, spanning almost 20 dBm. The difficulty of modelling WiFi environments is also exposed in [4]. We sampled the RSS from one AP at a fixed location. Figure 3 shows the fluctuation characteristic of APs strength. Each of these histograms is drawn from one APs at a fixed location throughout a short period of time. Fitting a single polynomial density function to capture this wide variation of these histograms is hard to achieve. As shown from the above AP signal distribution figures, erroneous signal silencing occurs irregularly. Any slight change in the environment hinders accurate estimations. Unlike mathematical-based solutions, using neural networks to torrent these noisy data as specific characteristics of particular locations, rather than simply cancel out these observations as noise or outliers. Hence, a model should assimilate information from new data easily and capture more unexpected variations. Others use deep neural networks in WiFi fingerprints signal strength based indoor localization [15], and also on WiFi signals with a formulation of the propagation model known as EZ [16,17], while more recent work has been using neural networks on Channel State Information (CSI) [18].

Multimodal Approaches
Multimodal approaches make estimations from multiple perspectives of cross-modality data. Filtering methods like particle filters and Kalman filters have been proposed to address multimodalities of data. Specifically, HiMLoc uses particle filters to integrate inertial sensors with WiFi fingerprints based on prior observations of Gaussian processes for direction estimation, distance estimation and correlation between samples and location in buildings, and admissible human activity [2]. Similarly, WiFi-SLAM and Zee build on particle filters emphasising their importance for random system initialisation [3], while Kalman filters are used to integrate inertial sensing modalities [5]. Other engineered approaches, such as UnLoc, combining sensing modalities based on empirical observations of how some locations are unique across one or more sensors [6], MapCraft uses conditional random fields [19], LiFS uses graph constraints to map and position estimations on the trajectory [4]. Similarly, WILL builds a connected graph to estimate location at room level [20].
However, when samples are formed in multimodalities, solutions building on machine learning show their advantages of understanding correspondences between communicative multimodalities and capturing in-depth features from natural representations instead of focusing on a single modality without alternative feature inputs.
Neural networks across sensing modalities have not been used for indoor localization before, although it has been used for other context recognition tasks, like for human activity recognition [21]. We aim to customise an end-to-end multimodal deep neural network for the indoor localization task to produce location estimation based on inertial sensors and WiFi fingerprints data. Training directly on data has its drawbacks, that of moving the challenges to the quality of training dataset and cross-sensor modality alignment, although this can be eventually automated by other systems such as vision-based systems [22].

Dead Reckoning with Recurrent Neural Networks
Dead Reckoning estimates continuous location by starting from assuming a known position and based on a way to estimate displacement and direction of movement to estimate consecutive locations. By similarity, Recurrent Neural Networks (RNN) perform the same but with the advantage of memorising previous steps and not relying entirely on new observations coming into the system, which can be affected by local noise. Specifically, RNN is one of the artificial neural networks that include connections between nodes that flow along a sequence. This structure offers the RNN model has the ability to exhibit temporal information through time serial data.
RNN has proven its advantages in dealing with sequential data such as speech recognition, image captions and machine translation tasks. RNN is similar to a feedforward neural network. The difference is that the recurrent connections link a neuron from the current layer to the next neuron of the next layer. This feature makes the RNN model 'remember' the features from the previous loop [23]. RNN transfers the state within each loop. Therefore, it could deal with sequential data such as the inertial sensor data we take used in the experiment.
The RNN structure is shown in Figure 4. It is an unfolded basic RNN structure that output or hidden nodes, are organised into successive layers which follows a one-way direction connected to the next layer. Each neuron includes an activation that varied based on the time sequence. Errors are calculated from each sequence, while the total error is the total value of the deviations of the target label values calculated from each sequence. However, the basic RNN has the problem of vanishing gradient when feeding a long sequential data. It cannot catch the feature of the dependencies between samples in a relatively longer sequential data. Hence, Long short-term memory (LSTM) is presented to solve this problem. It performs similarly as Dead Reckoning to estimate positions from streaming inertial sensor data [24].
LSTM is an optimised RNN model that solves the basic RNN problem of vanishing gradient. This is achieved by adding a forget gate. It prevents vanishing or exploding caused by backpropagated errors. Figure 5 shows the internal structure of the LSTM unit. In each unit, there is not only an input gate, an output gate, but also a forget gate that controls the 'memory' to either propagate into the next layer or being forgotten in the current layer [25]. The value in the current state is controlled by the forget gate f. Specifically, save the value when the signal is set as 1 while forgetting the value if the gate is set as 0. The activation of receiving a new input or propagating is determined by its input gate and output gate, respectively [26]. The equations 1 to 6 show the numerical definitions. The softmax output of m t determined the final probability distribution while is the product of the cell value of the gate value. Figure 6 illustrates the unrolled chain of the LSTM network, where C t is the longterm memory at time t and h t is the block output at time t, or short-term memory, both transmitted to the following LSTM block in the chain.
As the sensor data is presented in time sequence, the LSTM model is ideal for location estimations. LSTM read the time-sequential inertial sensor data based on a fixed size of (Timestep * Features). The feature of each data point is the magnitude value of acceleration, gyroscope and magnetic field data. The number of data points in each sample is determined by the chosen time window (here, we set the time window as one second sampled every 100 milliseconds). Each sample is offered a target position in coordinate (X i , Y i ). The regression output of the LSTM model is the estimated position in coordinates (X est , Y est ) based on the x −1 = Sensor Data(I) (7) During the training process, LSTM updates its weight and state by each time reading a new incoming sensor data and returns a location estimation result. In this experiment, we propose a regression LSTM model as the prediction is the numerical coordinates of latitude and longitude.

WiFi Fingerprinting with Deep Neural Networks
Indoor positioning systems use an anchoring mechanism to estimate the location based on instant snapshots of the environment sampled by sensors for recalibration. One such independent estimation can be achieved with WiFi Fingerprinting, which are regarded as unique at some positions [6] for locationing without continuous sensing. Here we show how WiFi Fingerprinting can be achieved with deep neural networks.
For periodic recalibration, the WiFi is a reliable anchoring signal source, used extensively in previous research [2,3,6,20]  points at each sampling timestep as input features and (X i , Y i ) coordinate of where the fingerprint is collected as target variables during the training phase while producing two numerical outputs of (X est , Y est ) for each coordinate as geographical location estimations during the online inference phase.

Sensor Fusion by Multimodal Deep Neural Networks
By joining both inertial sensors and WiFi fingerprints modalities, their unique perspectives can contribute to more robust estimations. Similar to our previous work in multimodal deep learning for context recognition [21], here we explore the capacity of such a construction to combine the two aforementioned neural networks operating on each sensing modality.  The MM-Loc neural network first reads time-sequential inertial sensor data by LSTM sub-network and WiFi RSS data by DNN sub-network synchronously from the beginning, where the sample size of the sensor modality at the LSTM side is (Timestep * Sensor_num) and for the DNN side is (Timestep * AP_num). Both modalities are reshaped to 128dimensional hidden outputs from each branch. These two parallel 128 unit hidden outputs are then integrated by concatenation to 256 units. The 256-dimensional joint vectors are fed into three fully connected layers (FC) with the input size of 256, 128 and 64, then transferred to the top prediction layer. This operates as a regression, producing continuous values in two-dimensional outputs of (X est , Y est ) from the joint cross-modalities hidden layers.
The contribution of LSTM to the cross-modality component is in the form of its multiple LSTM layers worked as feature extractor, which passes sensor hidden feature towards the higher layer. The WiFi component contributes as a WiFi feature extractor. We replace the two units regression outputs, originally from single modality based neural networks, with additional fully-connected layers to transfer extracted hyper-dimensional vectors to joint layers. Both neural network components pass the individual sensor modality hidden features to the cross-modality layers with element-wise calculation and then passed to the final layer for location prediction regression. The MM-Loc containing three subcomponents of LSTM, DNN and cross-modality neural networks are trained synchronously with computing the loss in the forward pass and splitting the gradients on each branch on the backward pass.
What is unique about this construction is that it can handle missing inputs from the WiFi modality. This is because WiFi scans are produced at a much lower rate than inertial sensors, so when there is not WiFi scan available, the input is a vector with all components value of 0 (by normalising -100 dBm), so the WiFi branch of the network contributes negligibly to the cross-modality component, in which case the majority of the contribution comes from the inertial sensor modality branch. When both modalities have inputs, the same multimodal architecture automatically adjusts the weights of cross modalities to produces estimations based on both modality contributions.

Data
In this experiment, we build the sensor fusion dataset by ourselves by starting from data gathering to data preprocessing and eventually build a sensor fusion training dataset. The cross-sensor datasets are collected from two scenarios with ground truth locations. With light human intervention strategies of interpolation, normalisation, overlapping, down-sampling and data alignment, we derive a uniform machine learning dataset for the proposed multimodal deep neural network.

Data Collection
Multiple modalities of sensor data are collected with an Android application designed specially for the task of data collection. This application can be configured to collect sensors data (accelerometer, magnetometer, gyroscope and WiFi scans) continuously in the foreground (Figure 9(b)) with a visual interface to accept user inputs or in background mode to allow carrying the phone in a pocket with the screen off. To collect data with location ground truth information, the same application runs synchronously on two mobile phones. One used to collect ground truth locations as input from a user through the visual interface displaying the building map aligned to Google Maps coordinates ( Figure  9(a)). A long tap on the map interface triggers an event to store the latitude and longitude coordinates at the location indicated by the user as ground truth on the map. The same application runs on the second phone configured to collect sensor data continuously in the background. This allows the second phone to be placed in any position without the need for user interactions while collecting data, which resembles the perspective of sensors in a natural motion. During the data gathering process, ground truth labels are transferred to the device, collecting multiple sensor data to annotate sensor samples with the user-provided location of their provenience. The first campaign involved walking on the corridor at different walking speeds, starting from one corner of the building, performing a circular trajectory and arriving back at the starting point. Ground truth locations were provided sporadically, but to obtain location information on a more granular basis, some locations were interpolated between consecutive two input locations -assuming local constant walking speed. Specifically, during one round of data collection, we build two datasets synchronously: Dataset.1 that walking along the corridor to collect samples from inertial sensors and WiFi scan; Dataset.2 that collecting ground truth geographical location labels synchronously by clicking the screen to log latitude and longitude information when passing key locations such as corners.
For training our MDNN model with more generalisation, we collect two scenarios to build the dataset -the middle floors from two crowded office buildings. Both scenarios are typically indoor environments with multiple human activities to increase indoor complexity in terms of noisy data and scalable description. Specifically, these complex situations include people walking by frequently that affect how we walk along the corridors during data gathering; various indoor working electronic devices such as elevators, computers, printers and portable devices which generate electromagnetic radiation; Building materials contains reinforced concrete, metal and glass which influence signal's propagation pathway. Furthermore, considering the real-time situation that people using different hardware for locationing purposes, we use multiple mobile devices with different hardware sensitivity and sampling rates for data gathering to add dataset complexity. In terms of the motion during data collection, we consider the strategies of waking with different speeds and gestures to add variations of the samplings. By integrating 14 rounds of sampling files collected by ourselves, we derive the multimodal machine learning datasets for both scenarios illustrated in table 1. Using the Android APP, we export the raw log files of sensor fusion data from both data gathering scenarios. However, the Android API provides sensor samples on an event base when the value is considered to have changed. This leads to uneven intervals in each sensor sample. Inertial sensors of accelerator, gyroscope and magnetometer have their own refreshing rate. We implement linear interpolation to fill the missing values in between every two hardware sensed values in order to generate the continuous time-sequential data. These interpolated data are grouped in a time window, which we discussed later, and associated with one location. As we only record the special location points during data gathering, we interpolate geographical location information based on selected time window between each two recorded locations as we only record locations by clicking the map on smartphones when passing special points such as corners, elevators, kitchen, meeting rooms and other remarkable places along the corridors.
To allows for the phone to be placed in any orientation in a pocket or in a bag without the limitation of motions. Inertial magnitude is the same irrespective of the phone orientation. Position invariant sampling data is attained by calculating the magnitude value on the inertial sensors (accelerometer, gyroscope and magnetometer) on the three orthogonal axes: Hence, we derive the magnitude values of inertial sensor samplings for the machine learning time-sequential dataset. As the LSTM load the time-sequential data by time window, a proper window selection contributes to lower computational cost and on-device power consumption with accepted inference efficiency. We explore the time window settings by the following three main factors. Firstly, a short time window prompts location updates at a higher frequency over a short time window. The second consideration was inference time. Although a large time window might yield better estimation, it requires more computation resources for one inference because it needs to connect more artificial neurons (more connections) to large input size, compared with fewer needed for smaller time windows. At the other extreme is estimation accuracy, which benefits from a longer time window to capture more relevant information from signal.
We explored four settings of time windows of 10ms, 100ms, 1,000ms and 2000ms, chosen to provide fast updates on location changes -waiting on a smaller time window to infer on; while also wanting to capture more information for better classification accuracy - larger time windows. In addition, we considered the case of time windows overlapping (10%, 50%, 90%) to increase the frequency of location updates for a fast responsive system. Window overlapping also plays well with the LSTM model since the information from previous time windows is reinforced and emphasised by overlapping for better quality detection and strengthen correspondences between samples. Despite the amount of machine learning dataset increased with richer information for training by time window grouping and overlapping, the side effect is that heavier dataset increases the processing demand. Hence, to improve the feed-forward speed with a smaller input size without losing too much information, a down-sampling method is investigated to compress the dataset. Figure 10(a) shows three magnitude values for accelerometer, gyroscope, magnetometer with 1,000 data points (1,000ms) on the left side and the downsampled value to as much as 90% linear compression. Figure 10 Figure 10. Compare between the original data and the down-sampled data. It shows that loss of information is minimal across two time windows, being able to follow the trend in signal for the walking activity.
The comparison from figure 10 indicates that main features are maintained even on high compression, reducing samples frequency to 1 Hz. For this reason, we consider down-sampling in experiments presented in the evaluation section.
Therefore, by deploying data processing of time window grouping, overlapping and downsampling, we derive the ready-to-use inertial sensor dataset with the settings of 1 second time window, 90% overlapping rate and 90% downsampling compression, which will be explained later.

WiFi Fingerprint Data
Inputs to neural networks are provided as vectors of the RSS values for each AP mounted inside the building. To construct this vector, we first scan the whole WiFi logfile to identify all unique APs observed throughout the data collection process (total of n APs observed inside the building), as well as the minimal and maximum signal strength encountered throughout, which are used to normalise the vector input to [0,1] interval by linear scaling. By observation, the min-max interval is [-100,-40] in dBm. Hence, for missing APs in WiFi scans, a value of -100 is associated with their representation in the n-dimensional vectors as input to the neural network. To keep the original features of the sampled data without unnecessary human intervention, we keep those occasionally seen personal hotspots, considered as noise, in the dataset to add complexity, which simulates the real environments of changeable WiFi signal distribution.

Sensor Fusion Dataset Alignment
The main challenge of aligning inertial sensor data with WiFi data is that the WiFi sampling frequency is significantly slower than the inertial sensing rate due the hardware limitations.
By observing the original refreshing rate (on the event base) from log files, the average updating frequency of the inertial sensor is in milliseconds while WiFi updates in noninteger seconds. The amount of the time-series sensor data is significantly larger than WiFi samples, even after we group the inertial sensor data by one second time duration as one sample.
If we simply combine these two modalities (one-second grouped inertial sensor data with one WiFi scanning as a cross-sensor datapoint input), the cross-sensor dataset contains sparsity with unbalanced data components. As the fact of the WiFi scan updates on the event base with the waiting time gap over several seconds. For the purpose of eliminating the sparsity of the WiFi data to make it relatively compatible with the grouped sensor data, we squeeze the WiFi scans from the original sampling distribution to every 100 milliseconds time duration, which is compatible with the LSTM time window setting. For instance, if the first time duration of 0.1 seconds in the raw log file contains three WiFi scans through all accessible APs, we compress these three samplings into one single sample within the time duration. Therefore, we get a denser WiFi fingerprint dataset to be aligned with grouped inertial sensor data, illustrated in table 2.

Table 2. WiFi Fingerprint Dataset Compression
For some timesteps, if there only contains inertial sensor samplings without any accessible WiFi signals, we use -100 dBm to represent missing WiFi scans of all APs if there is no sensed signal through all APs in a time duration. They are added in parallel for a uniform size of the multimodal dataset.
As two synchronously-logged datasets contain not only inertial sensor and WiFi RSS samples but also ground truth location information within the same time duration, the time record is utilised for matching multimodalities with geographical labels. Meanwhile, location labels are normalised to the extreme boundaries chosen for the building and scaled to [0,1] intervals. Estimations of neural network models are converted back into latitude and longitude coordinates in meters to measure the estimation errors as the euclidean distance between predict and target locations. Table 3 illustrates the components of the cross-sensor dataset after alignment.

Evaluation
This section presents the evaluation of each independent modality-specific neural network architecture, followed by the evaluation of multimodal deep neural network implementations with different fusion architectures that combining the features extracted

Recurrent Neural Network on Inertial Sensors
Here we present our exploration to identify the best model structure and parameters needed to calibrate the LSTM for best positioning performance in terms of time window settings, overlapping ratio and data compression.

Time Window
As LSTM read the data in time windows, a well-selected time window allows the model to catch enough detailed information to accurately estimate the movement. A larger time window loses granular details by exploiting larger scale observations, which includes more information for a range of movements while being computationally demanding for performing inference on mobile devices and slower in providing location updates. In contrast, a smaller time window captures minimal information, not discriminating between different walking speeds or between very similar activities like moving on a flat surface and climbing stairs, although more computationally friendly to mobile devices since the input layer is smaller.
Here, we trained the LSTM model with the time window of 10ms, 100ms, 1 second and 2 seconds, with the model setting listed in table 4.  Figure 11 shows that the LSTM model performances with various input time windows, presented using Cumulative Distribution Function (CDF) charts. As a general observation, time window setting has little impact on the inference accuracy. The training observation is confirmed on the test set as well as seen in the CDF plot 11(b), showing that chosen time windows have minimal impact on detection accuracy. The 1,000ms based model shows a good performance similar to the others, and going for this larger time window allows the model to capture variations and different activities relevant when transitioning between different indoor areas (elevator, cafe). The estimation error is still high since this is the first parameter we optimise for. Therefore, this time window of 1000ms is our selection to use in the following explorations.

Overlapping Ratio
This evaluation presents the outcome of changing the overlapping ratio based on the fixed time window setting of 1 second. The overlapping ratios we experiment with are 30%, 50% and 90%, which increase the amount of training data subsequently by 1.3×, 2× and 9× respectively.
There are two main reasons for implementing overlapping. The first reason is to enhance dependency between consecutive time windows samples by repeated information observed in the overlapping part. With the LSTM model, this enhances the memory aspect of adjacent time windows, the model experiencing portions of the recent action through multiple inputs. Secondly, we increase the training dataset synthetically by obtaining a larger number of training samples, since larger training sets are beneficial when training neural networks. Figure 12 shows the three models training process with the enhanced training set and performance on the validation set. In the training set, it appears that the model trained with data overlapping by 90% performs consistently better than the other two models trained on 30% and 50% overlapping data. While in testing set CDF shown in figure 12(b), 90% overlapping model performs similarly as in the training set. This is an impressive result considering that it is based on nothing more than inertial sensor data without much to calibrate on apart from occasional well-located changes of direction (e.g., when going around a corner). This route covered long stretches of straight-line corridors (up to 60 meters) walked at various speeds, which conditions are identified well by the LSTM estimator. These results illustrate that by increasing the overlapping rate, models can learn features better since they have more data available to train on. From this experiment of using overlapping strategies, model performance improves compared to those without time windows overlapping. The 90% windows overlapping model has the best performance. We use this enhancement for the following model exploration.

Data Compression
We implement simple downsampling and Principal Component Analysis (PCA) strategies for data compression to speed up the training and inference time cost. As mentioned in figure 10, simply downsampling has little impact on the original signal representations. Hence, we implement downsampling (pick up datapoint every 100ms) over the training data with the overlapping ratio of 90%. Therefore, one sample has been downsampled from the size of (1000 * 3) to (10 * 3). For the PCA data compression, the dimension is reduced to the same size of (10 * 3) of each sample. New variables in a lower dimension could be calculated based on eigenvalues and eigenvectors and therefore replace the original Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 17 September 2021 doi:10.20944/preprints202109.0301.v1 variables in a higher dimension matrix by PCA. This avoids having LSTMs a heavy time step input of 1,000 samplings which increases the model's complexity and learning time spent. Figure 13 shows the comparison between the down-sampled based model and PCA based model with the uncompressed overlapping 90% based LSTM model. Both CDFs illustrate that the down-sampled based model performs better than the uncompressed data based model that it reaches 80% of the prediction accuracy with the precision requirement of 8 metres of the testing set. PCA based model performs slightly better compared to the uncompressed model in both validation and testing sets. In general, the down-sampled based model has the highest accuracy and reliability compared to the PCA-based and the uncompressed model. By implementing downsampling to the overlapping data, the model reduces its complexity by setting a smaller time step to save calculation time. It could handle the data with a larger time window without losing significant data features. Specifically, the model of the time step of 10 could cover the data with the time window of 1,000ms, which allows the model to further increase the time window allowance with little model complexity increased. It could potentially make the model process a variety of activities. Besides, it widens the model's tolerance for a more extensive training dataset in a shorter time span compared with feeding the uncompressed data. This is of vital importance when integrating the model into mobile devices to improve the prediction efficiency, which reduces application response time and also saves power consumption. Figure 14 contains all models that have been explored in the experiment. Overall, Starting with selecting a suitable time window from 10ms, 100ms, 1 second and 2 seconds, we choose 1 second as the time window as it allows more data variations. We are moving to improve estimation accuracy by further increasing the training size with overlapping. With 90% of the overlapping, the model shows a significant improvement in the locationing accuracy. To reduce the complexity and the training time of the models, we compress the training data size by downsampling and PCA dimension reduction. We selected the LSTM model with downsampling based on a 90% overlapping ratio, presented by the blue line , as the inertial sensor locationing model that balances between the estimation accuracy and efficiency.

Deep Neural Network on WiFi Fingerprints
WiFi scans are received through the Android API at an irregular frequency on the event base, which its average update rate of about one second. We used Deep Neural Network (DNN) as our configuration of the WiFi Fingerprinting model, which takes WiFi scans from sensed access points at each sampling timestep as input features and producing

Model Structure
Based on the same training dataset, we explore the impact of DNN model structures of 3-layers, 6-layers and 9-layers settings. As we observe from the CDF results presented in figure 15, the 3-layer DNN regression model produces the best inference accuracy than the deeper networks. The models show extreme similar estimation results in the testing CDF 15(b), which approves the ability of model generalisation.

Overview of DNN for WiFi Fingerprints
The WiFi-based estimator is modelled with a deep neural network regression model. By exploring the model structure with parameter tuning, we determine a three-layer DNN on WiFi Fingerprinting. We use this WiFi model architecture as the sub-component network integrated into the fusion multimodal deep neural network.

Multimodal Deep Neural Networks on Sensor Fusion
After we determined the model structure for each modality, we move focus to fusing each modality-specific network component representations together for a uniform multimodal deep neural network (MDNN).

MDNN Integration
To explore the best fusion architecture of the multimodal deep neural network (MDNN) that learns the modality-specific features of the LSTM and the DNN sub-networks, Figure  8 presents our candidates for feature fusion. We customise four types of fusion networks, including two hybrid element-wise fusion of concatenation and multiplication, a hybrid residual connection fusion as well as a late fusion structure.
Element-wise Fusion: The MDNN with element-wise fusion architecture is shown in table 6. By concatenating the modality-specific hidden layer outputs from both LSTM and DNN sub-networks of 128 dimension output, the fusion layer read these two hidden outputs by implementing element-wise matrix calculation of concatenation (128 * 2) or multiplication (128). This fused matrix then goes through higher 128 and 64 size fullyconnected joint layers and eventually being regressed to two output as location predictions of (X est , Y est ). WiFi feature input as its information amount is relatively incompatible with that of timesequential inertial sensor data, we add a residual connection layer that transfers the WiFi penultimate fully-connected layer hidden output (128) to the joint layer, fusing together with the LSTM (128) and DNN last FC layer outputs (128 * 2). Hence, a 128 * 3 representation for the sensor-fusion component, which performs the final location estimation.  Table 2 presents the MDNN with late fusion architecture. By combining two separate LSTM and DNN models' outputs, we build additional neural network layers on top of the predictions that produces the lat-long coordinate estimation (X Sensor , Y Sensor ) and (X WiFi , Y WiFi ) respectively. These estimations form a four-dimensional feature input space, which provides the representations needed by the top layers to estimate the final latitude and longitude (X Fusion , Y Fusion ).  figure 16. We evaluate these models on the aligned multimodality dataset collected from two buildings (deployment scenarios) with the following split of the datasets: 65%, 25% and 10% for training, validation and testing, respectively. performs the best with 1.98 metres precision within 80%, following by the residual fusion and multiplication fusion models. Although the late fusion model has a relatively poor 3.7 metres accuracy, it is approximately 2× better than that from the sensor-based estimator.
In terms of single-modality estimators' performances, it is noticed that the WiFi model perform significantly better than the sensor model with 2.6× better accuracy. Within the estimation accuracy of 80%, the WiFi model has a precision of 2.6 metres error while the sensor model holds 6.9 metres prediction error.
Based on the evaluation result on the dataset from scenario B shown in figure 16(b), we observe similar performances from these models with the best performance contributed by the same architecture of concatenation fusion MDNN. Regarding mono-modality accuracy, the WiFi model has an error of 3 metres, and the sensor-based model has an error of 6.2 metres.

Overview of MDNN on Sensor Fusion
By evaluating the performance of MDNN under different fusion architectures (concatenation, multiplication, residual connection and late fusion), we observe that the error of the hybrid fusion multimodal model is significantly lower than other fusion models and modality-specific estimators of sensors and WiFi data, the median error over the test set is only 1.98 meters. The CDF plot in Figure 16 shows that 90% of the errors are lower than 4 meters in both scenarios. The hybrid concatenation MDNN fusion method, has the best estimation accuracy among all other models in both scenarios, which gives us confidence in the generalisation power of the models. We use the concatenation of features as the default fusion method of MM-Loc for the rest of this research.

MM-Loc Evaluation
Here, we evaluate the performances of all the models on data collected from the two scenarios but using different WiFi sampling frequency inputs. Specifically, for scenario A, as the default WiFi sampling rate is 10Hz as sourced from the system, we reduce the scanning frequency of the dataset from 10Hz to 5Hz and 1Hz by applying a filter. The purpose of adjusting WiFi frequency is to assess the impact of this energy-saving strategy with scanning reduction onto the location estimation accuracy. This also shows how our model behaves in systems where a high refresh rate is not available. For Scenario B, we decrease the WiFi sampling frequency from the original 1Hz to 0.1Hz and even 0.05Hz for the same reasons. Figure 18 presents the comparison between the performances of MM-Loc running at different WiFi sampling frequencies and the single-modality baseline models. Generally, MM-Loc with the highest sampling rate performs best. MM-Loc median accuracy is within 2 metres error for 80% of the prediction cases, which is 3.5× better than the sensor baseline model. By comparing figure 18(a) and 18(b), we observe that in scenario A, MM-Loc reaches a better accuracy of 2.6 meters median precision at the intersection point with the WiFi model. However, the WiFi model has consistently good performance even for the extreme cases, with the maximum error being better than that of any other model. In scenario B, MM-Loc performs better than any other model. Another observation is that with decreasing sampling rates, the multimodal model prediction accuracy experiences the same trend. MM-Loc with intermediate refreshing rate data still predicts with approximately 4 meters precision. Hence, a proper sampling rate setting has minimal contribution on-device computing cost and power consumption with reliable positioning accuracy.  Figure 20 presents the predicted footpath of MM-Loc in the two scenarios. The error are presented with a red lines as the distance between the coordinates of the ground truth and those of the MM-Loc predicted coordinates. In scenario A, the accuracy at first quartile (Q 1 ), second quartile (Q 2 ) and third quartile (Q 3 ) are 0.332, 0.697 and 1.562 meters respectively; While in scenario B, Q 1 , Q 2 and Q 3 are 0.521, 1.028 and 2.145 metres respectively. The box plots of the MM-Loc on both scenarios are shown in figure 19.

MM-Loc Visualisation
We observe that MM-Loc predicts the footpath along the corridor with high fidelity, having clear estimation boundaries. However, some predictions are over 5 metres away from the ground truth, especially at the corners of corridors. This is likely an effect of the difficulty of observations in the WiFi component near corners. The other aspect introducing errors is the magnetic interference present in some places on the pathway (elevators and heavy iron materials in building materials).

Discussion
In this work we show that traditional smartphone indoor localization methods can be modelled by an end-to-end multimodal deep neural network using diverse sensor streams. This starts with individual feature extractors specific to each modality (using RNNs and DNNs) and then fusing these representations for the final inference through a joint neural network architecture. With this, we are moving the effort from engineering each modality component (step counting, direction estimation) and other common integration methods (particle filters, Kalman filter and graph-based constraints) to a purely data-driven ML approach.
Our data-driven fusion approach builds entirely on the quality and volume of data, without engineering preliminary features nor making assumptions about the use of the system. Previous systems fail when porting to new environments because of the built-in assumptions about the scene, but by learning only from data our system is generalizablerequires just labelled data.
Based on the merging methodology that fusing divisive modality features extracted by sub-components, the system is potentially extendable to adapt for more variations and additional signal sources (such as light, environmental noise, humidity, air pressure, etc.) for positioning.
There is still space for improvements of the estimation accuracy. Despite our exploration with a low volume of training data, we show that even this limited training set is enough for our end-to-end machine learning multimodal DNN solution. This method moves the effort entirely on the quality of training data. Although data collection is still a hard challenge for now, we believe this is the only way to capture the fine details that are commonly missed by traditional modelling approaches. On the other hand, this information would always be there to observe and train on if larger volumes of data become available.
In the future, this data collection can be automated, either by robots roaming the indoor space to update WiFi radio maps or by mass unlabelled data collection from users roaming naturally in the environment. With labelling solutions based on computer vision [22], this data can be used fruitfully.

Conclusions
In this work, we show how the task of performing indoor localization can be modelled with multimodal deep neural networks for sensor fusion. We model the traditional methods for indoor localisation, WiFi fingerprinting and Dead Reckoning with neural networks. These are capable of performing location estimation independently, with 2.8 metres achieved by the WiFi DNN model and 6.5 metres by the inertial RNN model respectively. We use a multimodal structure to bridge features from the independent models into a more robust location estimation approach. Our multimodal deep neural network achieves a performance of 1.9 metres and eases the deployment by learning directly from data automatically.