Enhancing Handover for 5G Mobile Networks using Jump Markov Linear System and Deep Reinforcement Learning

: The fifth Generation (5G) mobile networks use millimeter Waves (mmWaves) to offer giga bit data rates. However, unlike microwaves, mmWave links are prone to user and topographic dynamics. They easily get blocked and end up forming irregular cell patterns for 5G. This in turn cause too early, too late, or wrong handoffs (HOs). To mitigate HO challenges, sustain connectivity and avert unnecessary HO, we propose a HO scheme based on Jump Markov Linear System (JMLS) and Deep Reinforcement Learning (DRL). JMLS is widely known to account for abrupt changes in system dynamics. DRL likewise emerges as an artificial intelligence technique for learning highly dimensional and time-varying behaviors. We combine the two techniques to account for time-varying, abrupt, and irregular changes in mmWave link behaviour by predicting likely deterioration patterns of target links. The prediction is optimized by meta training techniques that also reduces training sample size. Thus, the JMLS-DRL platform formulates intelligent and versatile HO policies for 5G. Results show our proposed prediction scheme about target link behavior post HO to be highly reliable. The scheme also averts unnecessary HOs thus ably supports longer dew time. Markov Linear System.


Introduction
The fifth generation (5G) mobile users will need uninterrupted connectivity while consuming large amounts of data and media content when commuting [1]. The millimeter wave (mmWave) bands (i.e., between 30−300 GHz of the radio spectrum) hold great potential to enabling 5G mobile users experience in Gigabit rates and networks meet traffic demands. However, a caveat to this is that mmWave communication is very susceptible to topographic and user dynamics. Common materials like concrete, water, and even human bodies/movements among others [2] severely alter its cell patterns and ultimately its performance. This level of vulnerability in mmWave bands severely impacts mobility management in 5G mobile networks. To reduce that impact, research on efficient mobility management in 5G mmWave communication continues to gain momentum.
In the recent past, 5G mobility management solutions have been awash with Machine and Artificial intelligence (AI) learning solutions. Some of these include Deep and Reinforcement Learning (RL) Handoffs (HOs). The challenge is that most of the previous HO works [5,7] select target cells based on initial maximum network performance values. However, the challenge is the optimum initial value do not always guarantee reliability of the connection after HO. For instance, the selection of mmWave target links based on highest SINR values [4][5][6][7], does not always reveal the reliability of the link after a HO event. In most cases, HOs end up getting executed too early, too late, wrongly, or wastefully. To that effect, 5G mobile network performance is punctuated with gradual and abrupt changes. To reduce inconsistences in network performance, selection of best target links requires understanding not just the immediate behavior after HO but also the longterm behavior i.e., post HO.
To that effect, we propose a HO scheme that learns not just the immediate behavior of target links but also the likely behavior/pattern post HO. In this regard, we learn to predict the deterioration patterns of potential target links post HO. We use the Jump Markov Linear System (JMLS) and Deep Reinforcement Learning (DRL) to learn the feasible optimal deterioration pattern that chosen target links must adhere to for them to avoid wasteful HO. JMLS are known to account for abrupt changes [7] in system dynamics. We exploit this capability to predict the likely receivable power deterioration pattern of target links at the user. We strategically update the initial JMLS deterioration pattern with online DRL and meta-training techniques. Meta training is a technique that reuses similar past training data set to make new decisions. This reduces request for new training data sets when making new decisions in novel location. At HO, the predicted deterioration pattern of a target link is then compared against an optimal global desired deterioration patterns to understand reliability of a target link and select the most stable one.

Contributions
• We propose to use JMLS to model deterioration behavior/patten of mmWave target links and formulation of HO policies for 5G mmWave networks. Given JMLS's ability to account for abrupt changes [7], we analyze the pattern and learn to predict the extent of abrupt performance changes in the chosen target mmWave links before HO.
• We use DRL to update and optimize JMLS deterioration pattern predictions and learning. To help reduce training samples, thus have ample time to track pattern changes of rapid-varying channel in real-time, we propose using Meta-learning techniques. Meta-learning is a technique that automatically adapts reuse of training data from related past, tasks or neighbours to make new decision. This reduces the need for new CSI/training data set to make new decisions

•
We use Kaiser-Meyer-Olkin (KMO) test to measure the expected divergence of target links from optimum deterioration pattern post HO to know their reliability.

Related Works
The surging role/potential of mmWave bands in mobile networks such as 5G/beyond cannot be ignored. However, so are its challenges, particularly in the mobility management support of 5G networks. The authors in [23] for instance claim higher propagation losses inherent in mmWaves must be addressed to sustain connectivity especially at ranges beyond 100 meters and in non-line-of-sight (NLOS) settings. The authors in [24] takes four directions to tackle the crucial problem of distance limitation owing to high spreading loss and molecular absorption that often limit the mmWave transmission distance and coverage range. These include, a physical layer distance-aware design, ultramassive MIMO communication, reflect arrays, and intelligent surfaces. These use Machine and Artificial intelligence (AI) learning for 5G. The author in [24] suggest a move from centralized (used in most 4G systems) to decentralized mobility management algorithms using DRL. DRL in 5G ably learn and build knowledge about different dynamics of mmWave channels . For instance, by interacting with environment data, authors in [11] utilized DRL to observe the available resource at network edges and provide a resource allocation scheme. This enhances user mobility management at the edge given user mobility context, transitions, and signaling exchange.
Exploiting actor-critic DRL, authors in [8] proposed to jointly solve offload and resource allocation problems in fog networks. Authors in [12] used deep Q-learning based task offloading scheme to select optimal BSs for users and maximize task offloading utility. In [13], Q-learning integrates Mobility Robustness Optimization (MRO) scheme with Mobility-Load-Balancing (MLB) scheme to tackle traffic Load and speed effects in 5G. However, in all these schemes high mobile and dynamic users are hardly considered. Additionally, DRL requires thousands of samples to gradually learn useful policies [15]. Besides, DRL acts terribly unstable/stochastic when learning systems with large local variances [16].
Thus, to guarantee continuous connectivity for 5G mobility i.e., by not just satisfying channel input/state bounds but also considering abrupt and continuous disturbances. Control approaches using Markov systems have been proposed in the literature. For instance, [20], uses JMLS with Expected Maximization (EM) to predict abrupt deterioration behaviour. It then enhances predictions using Viterbi algorithms. The Viterbi algorithm however requires accurate Channel State Information (CSI) to converge. In such cases, paper [26] argues that inaccurate training gradually cripples the accuracy of predictions, particularly in low signal-to-noise ratios (SNRs). To that effect, the author combines it with the meta data training, making the Viterbi proposed approach more reliable and less dependable on changing and accuracy of the data. In [18], to tackle distributed making decision scenario, the author extends the JMLS formulation into game theoretic technique. Similarly [17], incorporates particle-filter-based RL in JMLS to predict a finite number of disturbances within a randomly chosen sample of trajectories. This allows the scheme to track/adjust to time-varying conditions in real-time

Organization
The remainder of this paper is organized as follows. Section II describes the proposed framework and its operation. Section III describes resource allocation and optimization problems. Sections IV and V present adoptions of JMLS-DRL solution. Sections VI and VII analyse simulation results and the conclusion, respectively.

Proposed Framework
We propose to use the likely received power pattern supplemented with SINR values to determine the best mmWave target cell/link. We first learn to predict and then analyze received power deterioration pattern for four different types of users with respect to mmWave BSs. The four types of users include cars, pedestrians, cyclers and ebikers. For each user type, prior to HO selection, the scheme learns the likely mmWave user received power deterioration pattern given effects of speed, topography, and channel state. The best target link is one whose likely deterioration pattern with distance is gradual and follows the global deterioration pattern generated from aggregative data samples from multiple mmWave BSs. The received power deterioration pattern is modelled using JMLS. It models how likely received power will deteriorate for a user given NLOS and distance effects on mmWave channel. Thus, in the first instance, the model learns and determines desired optimal received power deterioration patterns for different user types using Expected Maximization (EM) [6]. EM automatically infers missing values of the link deterioration pattern over some states. Even though EM is robust, dynamic channel changes are not anticipated [10]. The EM estimations are thus optimized by using DRL and meta training techniques.
Meta-learning is loosely defined as an automatic learning and adaptation mechanism that improves accuracy by typically acquiring training from related tasks/users. The scheme only requires new training samples when the prediction error is bigger than the assumed predicted threshold. At HO, we have two deterioration patterns to consider. A global deterioration pattern formulated with aggregative data from all mmWave BSs, and a current local deterioration pattern formulated using local/individual BS channel data. Owing to large data variance analyzed, the global pattern is regarded to be more accurate. Thus, at HO, KMO test index values are used to determine the similarity levels between the global and local deterioration pattern for target links whose SINR is above the threshold. The level of divergence between the target link's deterioration behavior and global pattern determines how reliable the target link is the past HOs. This is vital because mmWave links have a tendency of deteriorating from excellent to very poor performance immediately after HO. Thus, understanding the long-term connectivity endurance post HO is paramount to reliable connection.

Manhattan Grid Mobility Model
A Manhattan grid model is used to model the road network with streets and intersections (as shown in Fig.1) in an urban scenario. The road network area is 500m x 100m. We have four types of users: pedestrian, cycler, and cars. A quarter are pedestrian with speeds of 1.4m/s. Another quarter are cyclers with speeds of 3 -7 m/s, the other quarter are ebikers between 8 -9 m/s. The other quarter are car users with velocities of between 10-14m/s. Cars in the space of 3m or more to each other and adjust velocities by 1-3m/s to avert crashes. Car speeds are updated every 3s to decrease/increase. Each street consists of the right and left lanes for each user-type.
Given user directions i.e., ={moving towards/away from a mmWave BS}, users traverse different streets. The probability of recovering channel link just after being blocked ℙ and remaining blocked ℙ is [12]: Where is the total number of samples whist ∈ is the number of possible blockings, and are the mean non-blocking and blocking windows within a transmission range . The rate of channel links switching from blocked to recovered/vice versa within is 1/ and 1/ respectively. Otherwise is binary and assumed 1 when the users are moving towards a target BS. , is assumed to be 0 as recovery of reconnection over the serving cell is minimal if user is moving away. The argument is link recovery chances are high if a user is moving toward the direction of mmWave BS.

Outage Probability
Assuming, Θ, is a set of optimization parameters for a given access policy , the outage probability, , for the observable set of signals can be defined as [2] and [11]: where and ̂ are the measured and target SINR, respectively, and is the targeted data rate given channel state ∈ .
is the bandwidth for the given channel link . We assume that all mmWave BSs directionally transmit equal maximum power . And all users have a receiver sensitivity of .Thus, each serving mmWave BS (with either LOS or NLOS link) given, , must satisfy average received power of at least . Moreover, given a threshold 0 , where 0 > , any user-mmWave BS link that requires transmit power that exceeds or does not meet 0 will not be established or lose connection, i.e., such a connection experiences a truncation outage at a given distance (2). α is the path loss exponent in LOS and NLOS pathloss exponents [25]. Equally, given the cutoff threshold 0 , LOS and NLOS users located , respectively, from target BS are unable to communicate owing to insufficient received power . The data rate is defined as: where = 2 sin is the normalized central angle of arrival for beam p, is user velocity under 50 Km/h, is the carrier frequency. |ℎ | 2 is channel gain.

Resource Allocation Problem
The minimum rate, ℝ , requirement problem given outage and power constraints at from a BS is defined as: where | and | are LOS and NLOS conditional outage probability for a user in the ℎ state, respectively. is the maximum attainable data rate at user-BS distance . The target receivable power +1 , at needed to meet ℝ in condition (2a) given outage constraints (1a)-(2b) is (3b): where is current received power in LOS. ̂ and are targeted and measured SINR needed to satisfy ℝ . It must be noted that if there exists an infeasible SINR target in certain user state, the resulting power demand, +1 , by users may diverge to infinity. This is due to each user link attempt to meet its own required SINR no matter how high the power consumption can be. Thus and , are power and SINR scaling factors respectively to substantially enhance reasonable deviations of +1 in NLOS. The corresponding energy consumptions for a given +1 is (3c) [24]: where denotes the price per unit energy consumption, ( − ) denotes actual number of packets received by the user at t, during window, w. ( − ) ℝ ⁄ is the latency. is the current received power at time t. 0 is the unit energy per packet, 0 * ( − ) denotes energy lost due to lost packets (expected number less the actual number of received packets) at t during window w. Given receivable, , and transmittable power, , constraints (see section 2A), for optimum packet delivery latency, the maximum link utility problem is formulated as [19]: where , expected latency scaling factor given, +1 within w. is latency discrepancy following to +1 change as the user moves away from the serving BS. We learn to predict the long-term deterioration pattern { , … , } of the target links to ascertain its reliability to meeting desired data rate prior to the next HO. We utilize JMLS properties to predict the likely gradual/abrupt deterioration behavior of target links [7].

JMLS System Definition
We first reformulate resource allocation problem in (3a to 3d) into a JMLS learning form with system state, action, and reward defining the deterioration pattern.

The JMLS Representation
we propose the deterioration pattern learning algorithm and JMLS describe (3a)-(3d) as in (4a): where ∈ is the current received power in LOS given state, , ∈ is the estimated received power discrepancy due to blockage/NLOS effects. It is related to by: = − where is the control factor of the power and SINR scaling factor in (3b); ( ), and ( ) are SINR/power coefficient matrices in (3b). ~(0, ( )) and ~(0, ( )) are data rate and received power measurement noise, respectively. Measurement noises are influenced by competing effect of change in gain, angular and linear transmission distance, user speed etc. for the same SINR requirements (see eqn. where: Following a transition to +1 , the immediate reward, ( ), for the observed signal, ∈ , is defined as a function of energy efficiency: where ( , ) is a data rate greater than ℝ , in (3a). The likely rate discrepancy between user and mmWave BS is: Where is the scaling factor of the rate discrepancy for each state ,s, at time given maximum rate, ( , ) .The transition probability between states with and +1 is: Assuming samples from different mmWave BSs at time are collected within each window and arranged in ascending order of the users' distance from Serving BSs. The transmission energy cost function is defined in (4f) : where first and second factors in (4f) represent sum weighted norm energy cost for received packets and lost packet over

Initial Deterioration Path Training
Given , and denotes a sequence of observed data rates { 1 , … , } over corresponding receivable power values { 0 , … }, and { 1 , … , } states respectively till time . The JMLS learning problem in each user type is to define the likely sequence and parameter Θ that maximize the likelihood function ( |Θ, ) given a finite observation in over for all 1 ,.., 2 ∈ at distance 1 ≤ 2 . The initial deterioration pattern estimator upon which we design our framework for received power pattern X, is an EM algorithm in [12]. EM uses Bayesian inference to automatically infer optimal value set of Θ for [12] at each step as seen in Fig. 2 and the value function can be written as: where Θ (k) is the current parameter estimate at iteration . The change, ∆ ( ), between and +1 states must satisfy condition (5d) to avoid abrupt changes or shocks in data rate: where ( ) is the averaged data-rate-discrepancy between state and +1 for a user with velocity v. In (5d), the smaller the difference, the lower the change between and +1 , defining deterioration pattern . ( +1 ) and ( ) , can be chosen independently. Obtaining full or accurate CSI to determine the pattern may be difficult owing to rapid changes in mmWave channels. Besides EM, cannot handle such switching dynamics [12]. Thus, instead of recomputing steps in (5a)-(5c) to refine pattern X, as more CSI about is obtained, we use online DRL with EM estimations of X, as initial experience to determine user target data rates as will be seen in Fig. 3.

Deep Reinforcement Learning in EM-Estimates
As seen in (5b), EM estimates gives the maximum obtainable received power, x, i.e., the upper bound of desirable received power in each state needed to obtain high SINR, ∈ , hence data rate, y, efficiently. The role of DRL, given optimum maximum receivable power, − * per state, is to determine the minimum/lower-bound-of receivable power, * , needed to obtain the same JMLS value, ∈ efficiently about the same state. It must be both noted and emphasized that the power at the receiver can randomly vary with time, space, and frequency. This may trigger erroneous reception at the receiver. Rectifying or averting the errors may need high transmit power (which is energy inefficient and is beyond the limit) to meet desirable receivable power and receive the same amount of user data within a given QoS/SINR requirement. However, if gain of the channel is high in the peak, even if received power is lower (e.g., in NLOS), this will permit using lower receivable power to receive the same/similar amount of data while maintaining the same given QoS/SINR. Thus, knowing not just the pattern of the maximum but minimum receivable power prior to HO decision is vital. Hence the need to use DRL is to determine minimum desirable power given the maximum by EM estimation. Here DRL uses EM data as initial experience (meta data) to determine the least expected receivable power needed to give ∈ . In that case, the DRL agent has to consider only the SINR value, ∈ possible for -x, in EM and find the power, x, that gives the highest directly obtainable reward plus expected accumulated future reward of the resulting states, s. The EM Q value for the (− , ) −pair, is used as meta data by the agent to find the SINR that gives the smallest DRL value with a function value, . The optimal value function * is obtained by solving for each given − in Fig. 2 Technically, for a given optimum pattern, − * , in (5d), the algorithm uses corresponding optimized parameter sets and policy ( | ) as input to DRL. The DRL scheme then determines the minimum desirable value, , needed to achieve ( ). It uses corresponding maximum value, − , determined by EM in each state , as initial experience and improves it by minimizing expected energy cost, ( ). The policy, ( | ) , is defined as: where ( |− , ) → [0, 1]denotes the probability of transition from − to without change, ∈ , with least possible energy cost * ( ), in . The optimal policy , derives the smallest possible value of (− , , | ), hence * ( ) in (4f), satisfying the following Bellman equations. * ( ) =if * ∈ ⋁ , else where * are goal states where condition (5d) is satisfied.

Deep Deterministic Policy Gradient (DDPG)
We use Deep Deterministic Policy Gradient (DDPG) to do improve the accuracy of the pattern. DDPG combines with DQN on the premise of EM algorithm in order to further enhance the stability and effectiveness of network training. This makes it more conducive to solving issues of continuous state and action space. Technically, DDPG uses DQN's the experience replays memory and the target network to solve the problem of non-convergence to approximate the EM function values in neural networks. It thus is an actor-critic and model-free algorithm. It learns policies using highly dimensional observation and action spaces. In this respect, agents use three modules: primary network, target network, and replay memory.
Primary networks match actions (SINR ratios in JMLS parameter sets) with expected received power by a policy gradient method. It consists of two deep neural networks, namely primary-actor and primary-critic neural networks. On the other hand, the target network sets target values, , for the optimal receivable power , pattern, X given by EM estimations. The replay memory stores the tuple experience from EM Bayesian estimators and environment via the actor network given condition in (5d). Experience tuples include the current and next state, the SINR ratio value following the transition between state, and reward for choosing the received power level in . Replay memory updates are randomly sampled for training primary critic network and setting target in the target network for eqn. (5d) eventualities.
Given EM parameter set and policy ( | ), the cost policy gradient ∇ , gives the values of ∈ ∀ with a minimum change in ∇ ( , | ) between − and , and corresponding maximum change ∆ (− , , , ) for each value , , transitioning from − and is defined as: The optimal value * ( ) gives the highest possible expected future reward and lowest discrepancy from target values for each state. The policy gradient is explored by the primary actor neural network and the value function for the ( , )-pair, is used by the agent to find the SINR ratio, , and received power that gives the lowest value and highest reward. Value iteration in DDGP terminates when ∀s ∈ S, | (x) − (k)| ≤ and termination is guaranteed for > 0. is similar to Greedy strategy with probability 1 − [27]. Here decays as more iteration hence experience is gained. The primary critic network updates by minimizing loss function ( ) defined as: Where ̂ is the target network value and can be obtained by: Here ( , ( +1 | )| ), is obtained through the target network, i.e., the network with parameters , from EM with -X values and from X generated overtime for minimum desirable receivable power.The new values of (5d) hence pattern are updated by minimizing loss in (7b). The gradient of ( ) over is calculated by its first derivative, which can be denoted as in [14]; According to (7d), the parameter of primary critic neural network can be updated. Specifically, at each training step, with a mini-batch experience < , , , +1 >, ∈ {1, . . . , }, randomly sampled from replay memory. For each point in , the target network value is regarded as the previous and current version of EM parameters and . At each iteration, and in (7c) and (7d) are updated with a weighted combination of the previous state. The prediction of target path takes the form of a weighted combination of models : Where ∈ [0, 1] is weight computed using Gaussian kernel parameterized by the transmission distance metric ∈ ̃: = exp(−0.5( − ) ( − )) , (7f).
Target neural networks generate target or ideal values for training and re-optimizing deterioration pattern, from − based on EM and replay updates. Thus EM estimations in each iteration are used as meta data to DDPG. The target neural network has a similar network structure to the primary network, i.e., similar neural network structure and initialization parameters. In the training process, the parameters of target Actor and Critic network are updated in a way of slow change (Soft Replace) by EM estimated values. Here, instead of directly and randomly training parameters of primary Actor and Critic network to further enhance the stability of the training process, we copy EM estimations as ideal initial values. Replay memory stores EM experience tuples formulating , and each value update ∈ include tuple < − , , , > update. Fig. 3 shows the structure of the proposed JMLS-DDPG algorithm. The DDPG algorithm takes EM parameter data set and maximum receivable power values ,-X, as initial input to determine minimum receivable power values of a pattern. Given power effects on SINR can be reduced in high channel gain locations, afterwards, the DDPG agents outputs an minimum receivable power values X, needed to maintain the same SINR ratio previous predicted and set by EM estimations for . The corresponding reward of in EM is copied and the SINR that is beneficial to the agent to achieve the goal gives a positive reward and, on the contrary, it gives a negative reward if condition (5d) is not fulfilled. The current state information, the SINR ratio, the reward, and the state information of the next minimum desirable receivable power are stored in the replay pool. Meanwhile, neural network trains experience and continuously adjusts SINR strategy by randomly extracting sample data from EM pool and uses the gradient descent approach to update and iterate network parameters, so as to further enhance the stability of pattern X and accuracy of the algorithm. Using EM experiences as initial training data input to DDPG restricts search range for optimal minimum receivable power values. Thus, any observed mmWave BS data rate not meeting corresponding receivable power is immediately discarded for training or consideration. This in itself technically reduces training sample for DRL hence convergence time . Ultimately, the improved DRL HO is obtained by combining DDPG with EM predictions acting as meta training sample. Finally, the pattern model is integrated into HO platform for HOs.

Online Update of Target Deterioration Path
DDPG subdivides the training network structure into online network and target network (See Fig.3). The online network is used to output the minimum expected received power in real time, evaluate SINR ratio values, and update network parameters through online training, which includes online (primary) Actor network and online Critic network, respectively. The target network includes target Actor network and target Critic network, which get updated by EM values. The target Actor network system does not however carry out online training. For each user type, the estimated path is only re-estimated from new training samples when the pattern prediction error based on EM estimates is too large than minimal desired received power pattern. It therefore follows that when the error given the energy efficiency is small enough such that the channel gain compensates for the power loss to maintain the desired SINR, the corresponding EM information used to generate received power pattern , will be regarded to provide reliable training sample for the target network in DDPG. EM data is thus re-encoded to generate new training samples for the DRL and set new targets over ̃ henceforth as meta-training. If indeed the pattern of link deterioration is successfully followed by the target mmWave network, then ̃ represents the true channel link deterioration behaviour from which is obtained. Consequently, the corresponding pair ̃a nd parameter set ( ) can continue being used to re-train DDPG instead of requesting new CSI from the environment  Fig. 3. The model can be efficiently and quickly retrained with a relatively small number of new training samples. A natural drawback of decision-directed approaches such as the Bayesian in EM is their sensitivity to decision errors. For example, if the link fails to successfully sustain connectivity, then the meta-training samples −̃ of ̃ over ̃ do not accurately represent the channel behaviour results in . In such cases, the inaccurate training sequence may gradually deteriorate the accuracy of DDPG predictions, making the proposed approach unreliable, particularly in low SINRs areas where link deterioration pattern errors occur frequently. Nonetheless, when pattern errors are less in EM, the effects of decision estimate errors of , namely, the number of errors in a pattern, can be used to decide when to generate meta-training. For instance, we re-train with new training samples in DDPG only when the number of errors is larger than some threshold. Using this approach, only accurate meta training data is used, and the effect of decision errors is controlled. When using new training samples, we cleverly focus attention on states with un converged pattern values i.e., where (5d) is not fulfilled. Our online training mechanism is summarized in the proposed Algorithm 1.

Global Path and Local Path Optimization Formulation
The local pattern is formulated based on local CSI from one mmWave BS. The local agent thus considers only the SINR ratio ∈ and corresponding received power x values possible in the local environment over given states ̃. The long-term function for local deterioration pattern is: where δ ∈ (0,1) is the discount factor and approaches 1 with more training sample. The global deterioration pattern is formulated based on a collective SINR ratio and received power , values from different mmWave BSs over ̃. The value function is: where ( | ) is the probability of receiving, x, given in state by EM. is the learning rate over samples in EM.

Hand-Off Considerations.
We use the Kaiser-Meyer-Olkin (KMO) test [25] to test how much each individual/local mmWave target link's expected deterioration pattern given the user speed deviated from its optimized global deterioration pattern. The global deterioration pattern is formulated by collecting training sample from all mmWave BS with respect to user type/speed just like the complete report table (CRT) in [4]. The local deterioration pattern is based on data gathered from an individual BS's local environment with respect to a user's type. It is similar to a report table (RT) user data in [4]. Given all the Target BSs with at least 3dB SINR above threshold, the KMO indexing test is used to find the level of correlation between an optimized global deterioration pattern and that of target link at the time of the HO request . KMO overall index value correlation defined as (14): (1 − 2 )(1 −̂2 ) , (10 ) and where ∈ , is the optimum lower bound target link value of received power at state .
∈ is the minimum expected user-BS link distance ̂ and ̂ are values for the global deterioration path. KMO test takes values between 0 and 1 and Table I summarizes index values. The general rule for interpreting measurements are in Table I. In this study, we select the target cells with KMO index of 0.751. If the KMO index value is less than 0.7, most likely the target link is not suitable for HO consideration though it might have the highest initial SINR. Additionally, during HO phase, If the serving BS still has a SINR value of 3dB, the user maintains the connection to the serving gNB. This avoids wasteful HOs. Otherwise, we execute the HO process and then go back to prediction phase.

Measurement Definition
We measured the number of repeated HOs to ascertain if the HO scheme can reduce the number of the wasteful HOs. Repeated HOs mean that the HO scheme reselecting the same serving BS in which the user is already connected to for another HO. This is wasteful because there is no need to reselect the same BS for HO but rather maintain the link. We also analyzed the sum data rate of mmWave BSs using different HO schemes. Additionally, we as well analyzed the HO overhead for different schemes. The principle is the higher the overhead the more wasteful the HO scheme is with the bandwidth. Lastly, we analyzed the performance of our proposed scheme against another scheme dubbed DDPG only scheme. DDPG-only scheme does not use meta training technique and does not consider condition (5d). It particularly uses random training samples than EM refined samples. We also analyzed performance against an existing soft-HO DC model HO scheme in [3]. The scheme only selects the best target cell by averaging their SINR/data rate.

Simulation Results
We propose to use the DC LTE-mmWave model introduced by the NYU and the University of Padova in our simulation [1]. The LTE BSs in the DC model manage mmWave BS. The model carefully considers an end-to-end mmWave cellular network performance. It uses ns-3 simulator and features 3GPP channel model for frequencies above 6 GHz and a 3GPP-like cellular protocol stack [1]. The JMLS-DRL algorithm is developed using OpenAI Gym [24] toolkit . Open AI Gym is a RL development too and is integrable with ns-3 simulator: supports teaching agents for a variety of network applications including those in ns-3. We investigated the performance using system-level simulations. Data collected from over 1000s simulation time with a resolution of one Transmission Time Interval (TTI) (1 ms)  Table II. For a more detailed review of simulators, refer to [15]. Figs. 3 and 4 compare the number of wasteful HOs against the number of training episodes in JMLS Viterbi HO scheme and JMLS-DDPG HO scheme, respectively. The former gets new training samples from the environment once the initial pattern has been defined by EM estimations for every other episode while the later uses EM estimated data as training sample so that long condition (5d) is satisfied. It only requests new training samples when EM data estimates fail to meet (5d) conditions. Results show that our proposed scheme quickly reduces the number of wasted HOs than the DDPG only HO scheme. For instance, it requires 250 episodes to reduce repeated HOs to minimal levels of less than five whilst the DDPG only scheme requires close to 400 episodes. This also entails that it can strategically and ably predict deterioration patterns using less training samples. The fact that is more reliable accurate than one that keeps on getting new training samples is justified in [4]. The authors in [4] argue that the angles of arrival and received keeps on getting new training samples is justified in [4]. The authors in [4] argue that the angles of arrival and received power slowly varies with speeds because they are affected by the large-scale scattering environment and do not   The blurriness is also seen when we compare deterioration pattern prediction after 200 episodes in Fig. 7 and that of Fig. 8 at 500 episodes. Figure 8 shows a more accurate prediction of likely received power for different user type than that of Fig. 7 with 200 episodes or observations in our proposed JMLS-DRL-empowered HO algorithm. Secondly, while the DDPG scheme converges independently for each user type as seen in Fig.5., the proposed JMLS-DRL-scheme converges with almost a common and higher reward for all user types. The implication is that after 200 training episodes, the JMLS   -DRL algorithm can have one common/global deterioration pattern to follow regardless of user type. On the other hand, for DDPG HO scheme each user type will need to follow a different type of deterioration pattern. This makes our proposed scheme easier to predict the expected target link behaviour than the later. In both, schemes, a HO is only issued when the received power at a particular given state/distance from the serving BS drops beyond the corresponding value of the expected local deterioration pattern. In this case, the global and local deterioration pattern in KMO are compared to at least within a range of 80 m from a serving mmWave BS. While we can still try and predict beyond 80 m, the computation cost will be too high. Thus, a selected target link is deemed reliable if it is able to sustain connectivity within the 80 m transmission range. Beyond 80 m, HOs are evoked if the SINR drops to at least within 3 dB of threshold. Therefore, HOs select a link based on the fact that it is expected to sustain connectivity at least for 80 m of assumed coverage of the mmWave BS. We also analysed a soft-HO DC-based scheme [4] using only SINR [2] and a DDPG [3] based scheme to make HO comparisons, and the former acted as a baseline for our case in Fig. 9 and Fig. 10.  In Fig. 9, we compare the sum rate against the number of BSs for three different HO schemes. SINR based scheme as explained in [4], just compares the SINR of the target and serving cell/link. The other scheme as earlier said gets new updates every episode whilst our proposed scheme uses both new and old CSI. We can see that the proposed scheme has a good efficiency on how it uses/selects BSs. The other two schemes seem to start saturating after 35-40 BSs. This can be attributed to low training sample requirement and thorough analysis of CSI in our scheme. Reuse of training sample gives our scheme ample time to analyse the behaviour of links. At the same time, having a small number of mmWave BS deprives the proposed scheme from learning more about target link deterioration pattern. This can be seen by the smaller sum date rate recorded at 5 to 15 mmWave BS. The more the mmWave BSs the diverse the amount of data looked at in each episode. On the other hand, despite a very small amount of BS, for DDPG only HO scheme, the acquisition of new training samples in each episode improves prediction of target link path but because it changes fast, the inaccuracy in the predictions quickly manifest.
Another criterion to evaluate the performance of the proposed HO methods, is the generated overhead. Fig. 10 shows the variation of the induced overhead related to the three proposed HO methods. It is obvious that the SINR -based HO induces more handover since at each attachment to a new BS a number of new measurement report must be exchanged to allocate new subcarriers resources. Nevertheless, using the DDPG-only based handover and tour proposed HO scheme, less overheads are experienced because the past link data needed to achieve reliability is reusable and exchanged in advance before the HO. For our proposed scheme, it is even much better because it can switch measurement data sources depending on conditions 6d (see Fig. 3). Hence the proposed scheme is better than both the DDPG only and SINR HO schemes.

Conclusion and Future Works
This paper proposed a new HO scheme given the distinct propagation characteristics of mmWaves in a HetNet structure. A resource allocation problem that considers the utilization of mmWave-bands with LTE bands in multi-user set up was considered. We considered a downlink LTE-mmWave HetNet scenario. With a mmWave-link-behaviour pattern-analysis scheme applied to address the HO challenges. The resulting optimization solution considered modelling the link behaviour using JMLS, DRL and meta training techniques. Subsequently selection of optimal HO link used KMO test principles. Simulation results showed that our HO scheme outperformed DDPG only HO scheme and the SINR-only based HO scheme. This demonstrated the vital role deterioration pattern analysis can play in addressing mmWave link selection in 5G networks. Principally , we can conclude that our pattern analysis HO scheme envisages traits of long-term behaviour analysis for mmWave target links before HO execution . This is unlike unreliable techniques used in classic HO schemes where only the instantaneous behaviour of target links is analyzed prior to choosing the best target link . In future works, it would be interesting to consider the competing effects of pathloss, channel gain and transmission power when determining the receivable deterioration pattern of the target link. This is given the impact their variation have on the data rate. Further, while there is need for highly directional beam antennas at the PHY layer to have an acceptable link quality, how to effectively handle or dodge adverse effects of both mobile and static blockages when choosing mmWave links in HO schemes could be interesting to study in future behaviour pattern projections studies for target links. Finally, studying backhaul configurations that can efficiently support the proposed HO scheme would also be particularly interesting.