Forecasting high-frequency financial time series: an adaptive learning approach with the order book data

This paper proposes a forecast-centric adaptive learning model that engages with the past studies on the order book and high-frequency data, with applications to hypothesis testing. In line with the past literature, we produce brackets of summaries of statistics from the high-frequency bid and ask data in the CSI 300 Index Futures market and aim to forecast the one-step-ahead prices. Traditional time series issues, e.g. ARIMA order selection, stationarity, together with potential financial applications are covered in the exploratory data analysis, which pave paths to the adaptive learning model. By designing and running the learning model, we found it to perform well compared to the top fixed models, and some could improve the forecasting accuracy by being more stable and resilient to non-stationarity. Applications to hypothesis testing are shown with a rolling window, and further potential applications to finance and statistics are outlined.


Introduction and Literature Review
Time series can be described as a sequence of observations indexed by the time, which, by the nature of it, can be separated into the past and the future. The study of predicting the future based on the past information is defined as forecasting, which is of great importance to the society -forecasting financial time series, e.g. the price of an asset, can be influential to the decisions of both the private and the public sectors. As the computerisation of financial markets develops, higher frequency of the observation on the variables are taken and can be analysed. Consequently, forecasting such a high-frequency object becomes increasingly important.
At a higher level, the statistical approach undertaken for learning the big data and obtaining better prediction has evolved in the recent decades, in both the theory (Shalev-Shwartz and Ben-David 2014) and the applications, e.g. LSTM and deep learning (Hochreiter and Schmidhuber 1997;Goodfellow, Bengio, and Courville 2016). Various methods of learning, e.g. clustering, neural networks and other synthetic models have been developed and many of which have helped to solve socio-economic problems (Aghabozorgi, Shirkhorshidi, and Wah 2015;Chakraborty and Joseph 2017). The application of statistical learning algorithms to dynamically assess forecasts has also shown a contribution to the empirical time series econometrics literature (Yang 2020).
However, due to the nature that time series dataset is indexed by time, and the fact that many structures (mathematically, such a concept is quantified by functional forms and parameters) change over time, one needs to pay particular attention while applying generically-developed learning methods to a financial time series environment (Sirignano and Cont 2018). Econometricians refer such a unique time series issue as "time-varying parameters", which could also relate to the stationarity of a model -essentially questioning the validity of the boundedness of the variables over time. 1 Recent proposals on dealing with these have been suggested by Andres and Harvey (2012) and Harvey (2013), with some empirical studies being done (Harvey and Sucarrat 2014). Additionally, adapting learning methods to improve traditional ARIMA models' forecast has also been studied empirically (Z. Li, Han, and Song 2020). In this paper, we use the classical approach of window-estimation, thereby focusing on the contemporary relationship between variables to ensure time-variability. The choice of window sizes vary, as shown later in the adaptive learning, different sizes could be preferred from time to time.
In terms of generating features (explanatory variables) to help to forecast the price, the order book data becomes particularly helpful. In this paper, the order book data consists of the quantities and prices for the best bid and ask -meaning the ones at which the asset can be traded immediately sold and bought, respectively. Statistics of these can be summarised into order flow imbalance, which synthetically involves the prices and quantities on both sides, or order imbalance, which deals with solely the quantities on both sides. The earlier has been studied by Cont, Stoikov, and Talreja (2010) and Cont, Kukanov, and Stoikov (2014) and the latter by Avellaneda,Reed,and Stoikov 1 One may refer to Klenke (2013) for a more rigorous definition on stationarity.

(2011) and Stoikov (2014).
While more sophisticated learning models could be used, e.g. deep learning models (Sirignano and Cont 2018), we start from a traditional time series modelling and statistical learning perspective and subsequently propose learning models that can adapt to the past (thus called adaptive learning). Such a model has a better interpretability and numerous potential applications, e.g. hypothesis testing. A general reference on the foundation of model selection is from Akaike (1974) and Hastie, Tibshirani, and Friedman (2001). Additionally, penalisation plays a key role in the study of time series model selection (Cai and Wang 2014;Zbonakova, X. Li, and Härdle 2018), and functional penalisations, for instance, the MDL criterion has also shown its empirical usefulness (Rubio et al. 2007). These motivate the formation of adaptive learning proposed here.
Standard mathematical concepts and statistical notations are used in this paper. The concept of functional sets, as used in Vapnik (2000), is highlighted later in section 3 as will be used frequently. Standard time series notations are used throughout, with the main reference on stationarity issues being Banerjee et al. (1993) and Harris and Sollis (2003) and other ARIMA modellings being Harvey (1993) and Fuller (1996). Bayesian hypothesis test is adapted at the level of Koop, Poirier, and Tobias (2007), with standard frequentist test being assumed at the level of Casella and Berger (2008).
Details of the data cleaning, feature generation, and their exploratory analysis are written in section 2. Modelling and learning proposals, together with the results are presented in section 3, followed by an application to hypothesis testing in section 4 and the discussions in section 5.
In terms of the key contribution from this paper, we engage with the existing methods on feature generation and traditional ARIMA formation, then propose a forecast-centric learning model, the adaptive learning, to approach model selections and post-estimation penalisation. Such a learning model helps to confirm the reachability of certain level of accuracy of the forecasting, improves the forecasting in volatile and non-stationary markets, and can also be applied into further analogies such as on its formation and hypothesis testing.

Data
2.1 Data description, data cleaning and computing deployment

Description, time brackets and VWM
We base the statistical modelling on the intra-day price data of the CSI 300 Index Futures (hereafter called "the asset") provided by CIFCO Guangzhou. We focus the time range from 10th November 2017 to 17th April 2018, and there are 105 trading days in the range.
In a usual trading day, there are two trading sessions: one in the morning (0930-1130) and the other in the afternoon (1300-1500). 2 In the dataset, we expect one or two raw entries within each second, while some omissions occur throughout. In each of the raw entry, the best bid and ask data together with the latest traded price and volumes are observed. As a decision to summarise the data, we divide each of the session into 24 brackets of 5-minute slots and index them by the end time, e.g. 0935 refers to the bracket from the first second of 0930 to the last second of 0934. 3 This is supported by the fact that noisiness and emptiness of the data and lack of transactions do exist. If summaries were to be made on a minute-level basis, while any brackets larger than 5 minutes would be less regarded as a high-frequency time series, as each session only has 2 hours.
A summary of statistics and a histogram of the number of observations within each bracket are available in Table 11 and Figure 18 respectively in the appendix. Table 1 below serves as an example of translation between the brackets, thereafter "observations", the time, and in the financial environment. Within each bracket, we compute the Volume-Weighted-Mean (VWM) of the asset price. This is achieved by obtaining the arithmetic sum of the product of the trading volume and price in each of the raw entry, divided by the total volume. Summary of statistics is supplied in Table 11 in the appendix, and a line plot of the VWM is supplied in Figure 1 below. Figure 1: Plot of the VWM of the CSI 300 Index CSI300IndexFutures.html. 3 In HHMMSS format, that is from 093001 to 093500.

Number of Brackets
3

Computing deployment
Due to the computing complexity, cloud resources with parallel computing techniques are utilised.
In particular, we deploy multi-core parallel computing tasks using the machines on Google Colab and AWS, which is achieved by centralising the function and distributing the parameters we wish to compute over different CPU cores, followed by result collections individually. Codes for execution and visualisation are written in Python 3.

Feature generation from the order book
As suggested by the literature review, we generate the Order Imbalance (OIB) and the Order Flow Imbalance (OFI) for each of the raw entries as follows: where we use s as the index label for the time of the raw entry, BQ s , AQ s as the best bid and ask quantities respectively, and BP s , AP s as the best bid and ask prices respectively.
We provide a general interpretation without going deep into the theory here. For the OIB, when the bid quantity is relatively high, the OIB is more positive and vice versa if the ask quantity is relatively high. For the OFI, it can be seen as a signed contribution of the order book events to the supply or demand of the market of the asset. Say if someone buys passively through the current bid price, then e s = BQ s − BQ s−1 represents the size of that order cancellation. If the bid price were to change -depending on up or down, e s can represent the size of a price-improving order (e s = BQ s if BP s > BP s−1 ), serving as a quantity for a rise in the demand; or the last order in the queue that was removed (e s = −BQ s−1 if BP s < BP s−1 ), thus a quantity for a drop in the demand. Likewise for the ask side symmetrically, where an increase in AP , for example, signifies a decrease in supply of the asset.
Within each time bracket, we need to find representable summaries of statistics to represent the behaviour of each of the two features within, technically, {e OIB s , e OF I s |t − 1 < t(s) ≤ t} where t(s) indicates the time bracket that entry s belongs to. Now, the mean within each bracket are the usual choice and is consistent with the intuition. In addition, we consider a p-score defined as where Φ(·) is the normal CDF and mean(·) and sd(·) are the mean and standard deviation of the sequence. Summary of statistics of all of these feature generated are presented in Table 11, with a line plot below in Figure 2. The benefit of having a normal transformation, as seen from the plot or summary, is that the value can be restricted into a small range (theoretically [0, 1]), which deals with any potential spiky moves of the fraction, while the cost is the decrease of variance associated with the increase of stability, which is not a huge trouble as it also brings time series models a benefit of stationarity. Here we draw particular attention for the first and last observation of each session. Economically speaking, between the two trading sessions there could be large underlying events causing potential price movements, while the market is not open. This creates a high potential for the difference between the closing price of the previous session and the opening price of the current session to be large. We investigate these differences below.  As shown in Table 2, the day gap, i.e. the difference between the first VWM observation in a morning's session and the last in the previous afternoon's session, is distributed much wider and have extreme values compared to the rest. This can be additionally supported by the histogram in Figure 18 in the appendix. While lunch gap, i.e. the gap between the start of the afternoon's session and the end of the morning's session, is small, to ensure consistency we exclude both of these gaps from estimation. This is done by adding dummies when the time lands at these points.
To additionally ensure the stableness of forecasting models and that it has the ability to learn the new environment within each session before making forecasts, we exclude the first 6 observations, i.e. 30 minutes, of the session from forecasting.

Rolling ADF tests
One crucial concern of time series is its stationarity. The approach undertaken to test the null hypothesis of unit root against the alternative hypothesis of stationarity, is via an Augmented Dickey-Fuller (ADF) test. Though, here we are interested in modelling the temporal relationships between variables, for which we are more concerned with the stationarity of the dependent variable, VWM of the price, in a short window. Hence we introduce the rolling ADF test, for which we collect the p-value, i.e. the probability of rejecting the null conditional on the null being true 4 , over time.
Let y t be the VWM of the price at bracket t. Then, given a choice of window size w, we run an ADF test on the set {y τ } t−1 τ =t−w for every t, and collect the result as p t (d; w) where d is the level of difference. 5 Interpretation of the result can be made by observing p t (d; w) against a critical value, which we take as 0.05 as usual.
Starting from d = 0, if p t (0; w) < 0.05, we conclude {y τ } t−1 τ =t−w is stationary, else we seek for a higher order iteratively: We run this for three window sizes (12, 48, 96) and levels of difference (0,1,2) and draw, in Figure 3, the line plot and histograms of the p-values for each of the combination.
The choppiness of the p-values for small-window sized data (when w = 12) are significantly shown, while the larger ones seem stable with occasions where p t (1; w) > 0.05, meaning occasionally the 2nd level difference would be required for there to be a stationary model.
These exploratory results help to decide how the time series model should be formed, as detailed in section 3. Figure 3: Results from rolling ADF tests for different window sizes: 12 (top two), 48 (centre two), and 96 (bottom two). 7

The SR statistics: a trading perspective
A natural extension from a financial time series model is its profitability from trading. One good model should produce a reasonable return while maintaining suitable risks. This performance can be evaluated by the Sharpe Ratio (SR). Here we explain the construction towards such a statistics and provide baseline and feature-based results.
Let P t be the price of the asset at time t. Then the return for buying it at time t and selling it at time t + 1 is P t+1 −Pt

Pt
. For each trading session after the forecasting exemption, we have 17 such opportunities, hence, given a theoretically zero-mean time series feature α t for which the sign indicates the forecasted direction, we set the profit or loss in the trading session (P L session s ) as where t s + 1 locates the time index to the start of the session, and accordingly t s + 18 is the last observation of the session. For standard reporting on day profits or loss and further SR computation, we also produce the profit or loss in the trading day (P L day d ) as where s d locates the morning session of a trading day d. Now, the annualised SR is defined as where mean(P L day d ) and sd(P L day d ) stand for the mean and standard deviation of P L day d respectively.
In the baseline situation, we consider a buy-and-hold treatment, hence P L session s is defined by simply buying from the start and selling at the last, thus . Other statistics follows.
The results of these are plotted in Figure 4, which clearly shows the inability for the features themselves to achieve positive returns, while the baseline also performs badly.
8 Figure 4: Plot of the cumulative P L day d over time Baseline OFI mean OIBmean Table 3: Summary of statistics of P L day d and SR As a remark when SR is served as a performance metric later, the α t is naturally set as P t+1|t − P t where P t+1|t is the forecast of P t+1 at time t, and Equation 3 can be interpreted as the trading profit or loss if one buys whenever the next price is forecasted to raise or sell otherwise. 9 3 Methodology and Results

General setting
Let y t be the one-dimensional discrete time series of interest (the dependent variable), and let x t be the multi-dimensional discrete time series of features (explanatory variables). We are interested in forecasting the one-step-ahead future of the dependent variable conditional on the information up to time t, namely y t+1|t .
In the common practice of time series, one studies the model of the underlying process and then use the model to conduct forecasting (e.g. Prado and West (2010)). Here we take a different approach: we first appreciate the conditional forecast as a value from a map that takes the information set (Φ t ) and functional parameters (θ t ,h t ), then build models to learn the appropriate parameters based on the previous observations. Mathematically, is the parameter to be specified in the function, and h t ∈ H specifies the functional form, thus determines the parameter space Θ(h t ). 7 We manually design sensible models to construct H and sensible learning methods on θ t and h t to do good on reducing forecasting error -we consider the usual MSE and MAE as the performance indicators. 8 MSE has a better theoretical foundation while MAE is more interpretable. 9 We also consider the SR statistics which serve as an interpretable performance indicator in the context of financial time series.
At each of time t, x t is four dimensional: the first and second entries are, respectively, the mean of OIB and OFI; the third and fourth entries are, respectively, the p-score of OIB and OFI.
We index each h ∈ H by h(ι, w, p, d, q), which controls an ARIMAX(p,d,q)-type of forecasting model with w for window size and ι for the explanatory variables. For a given p, d, q, we consider a forecasting formula where c t is a constant, g 1 and g 2 are the appropriate ARIMAX operator functions: g 1 is specified by the autoregressive lag p and difference parameter d, and g 2 is specified by the moving average lag q.
ε t (h) are the residuals from the model h. g 3,t (x t ) summarises the explanatory variables' contribution to forecasting.
In the followings, we first elaborate each of the specifications of h with the associated method to pin down θ t , thus named "fixed models", then discuss adaptive learning models where h t can be time-varying by learning from the past. A general computing approach to obtain the result is shown in algorithm 1.
Algorithm 1: Algorithm for obtaining the forecasts with a fixed h (fixed models) Input: Data {Φ t } t∈T , specification of h, desired forecasting index set T , and validation data {y t+1 } t∈T . Output: Forecasts {y t+1|t (h)} t∈T and the performance metric.
1. For t ∈ T , repeat: Evaluate the performance metric.
3.2 The fixed models: the univariate and the multivariate

Univariate framework
In univariate models, the strategy to train the parameters θ * t ∈ Θ(h) is rather classical: for a given h(ι, w, p, d, q) with with ι ∈ {0, ..., 6}, we fit the following model in a w-windowed dataset: Γ t (p), ∆(d), Φ t (q) are the lag operator functions under an ARIMAX (p,d,q) specification with constant µ t , e.g.
where L is the lag operator, i.e. Ly t = y t−1 . The dum τ term dynamically adds the number of required dummies as proposed in subsection 2.3.
We fit the model using a Maximum Likelihood Estimation (MLE) based on the specified dataset at each time t, thus obtain the relevant parameters to implement forecasting in Equation 6.
In the next paragraph we give an example to clarify the relationship between Equation 6 and Equation 7.

Univariate example
In this example we take p = d = q = 1, ι = 3. Then Equation 7 becomes and we may also write < β t (3), x t−1 >= β t,1 x t−1,1 +β t,2 x t−1,2 in scalar form. With these specifications, we can summarise all parameters 10 to estimate as θ t = (µ t , γ t , φ t , β t,1 , β t,2 , σ 2 t ) ∈ Θ(h) R 6 , and in fact, by the standard time series set up we can pin down to the specified parameter region: Further into forecasting: once we obtained the appropriate θ * t ∈ Θ(h), we proceed to Equation 6, which becomes

Univariate choices of parameter
So far we explained the structure and strategy to train θ * t ∈ Θ(h). Here we specify the choices of the model parameters. For the ARIMAX parameters, we put p, q ∈ {0, 1, 2} and d ∈ {1, 2}, with the choice of window sizes w ∈ {12, 24, 48, 96}. We therefore have 72 models for each one of the seven univariate model groups, hence 504 models in total.
The reason for the window choices are from their corresponding financial meanings -as one may note from the initial data cleaning (Table 1 in particular), 12 observations refer to one trading hour while 24 refers to a session. Likewise for 48, 96 which means one and two trading days respectively. As a result, p, q may vary but rather restrictively due to the degrees of freedom, especially for smaller window sizes, hence the choice. The choice of d can be both motivated from the literature and the rolling-ADF observations done previously ( Figure 3). While d = 1 may be sufficient, in many occasions we need d > 1 for stationarity purposes, hence the choices for two potential values of d.

Multivariate framework
In multivariate model groups, we aim for the same forecasting formula as Equation 6 but implement a vector training strategy: for a given h(ι, w, p, d, q) with ι ∈ {7, ..., 12}, we train θ * t ∈ Θ(h) by a V ARM A(p, q) on a stacked vector S t := (∆(d)y t , x t ) ∈ R 5 . We fit the following model in a w-windowed dataset: 10 Apart from the dummies' term, which are straightforward to estimate.

12
At time τ ∈ {t − w + 1, ..., t}: Here we first note the role of M (ι): it transforms the stacked vector S t to another which we subsequently perform VARMA on. In particular, M (ι) ∈ {0, 1} n×5 where n ∈ {2, 3} is the number of parameters we plan to have. Accordingly, Γ t (p) and Φ t (q) are the lag operator functions under a VARMA (p,q) specification with n dimensional variable, and the dum τ term dynamically adds the number of required dummies as proposed in subsection 2.3.
The specification on M (ι) serves in the same spirit as was the β t (ι) in Equation 7: it selects the relevant entries of x t to interact with y t and eventually contribute to the g 3,t part of forecasting. For model groups 7 and 8, mean OIB and mean OFI, respectively, are the sole interaction being investigated, that is, M (7), M (8) ∈ {0, 1} 2×5 and M (7) 1,1 = M (7) 2,2 = 1 with the remaining entries being zero, M (8) 1,1 = M (8) 2,3 = 1 with the rest being zero.
Similarly, training is done by MLE, and we proceed into an example.

Multivariate choices of parameter
For the VARMA parameters, we put p = 1, q ∈ {0, 1}, d ∈ {1, 2} with w ∈ {48, 96}. Hence 48 models are constructed in total. One may recognise this as a more restricted choice of parameters -the choices of w are limited to the larger ones due to the degrees of freedom. Take the previous example where the parameters to estimate is equivalent to 14 dimensional, it is not realistic to be implemented when window sizes are small. For the same reason, we cap p, q ≤ 1 while if p = 0 we get little meaning in the vector models, hence p is fixed at 1 and q may take one of the two values.
11 Ignoring the dummy variables.

Results from the fixed models
As a summary of the results thus far, we first plot the scatter and histograms in Figure 5, then, in Table 4, Table 5, and Figure 6, we produce tables and plots for the top-performing models under the MSE ranking and the SR. Table 12 in the appendix is also produced to summarise the relationship between model groups and explanatory variable(s).
For a general result, we make scatter plots and histograms for all but the outliers models -those which have an MSE greater than 100 are excluded from the plot. As can be observed from Figure 5, large-window models, in general, produce lower MSE, potentially benefited from its overall stability, while outstanding small-window may also have small MSE with large SR. The linear relationship between MSE and SR is weakly negative and with many points far below or above the fitted line. This supports the discrepancy as observed later, that some models may only perform well in one of the two metrics.  For top-performing models, as seen from Table 4, depending on which metric we use, the "top-performing" models could vary -while models without any features (the top 2 of the upper table) perform well in MSE or MAE, their SR is rather low; with a slightly worse MSE and MAE models with features, here, in particular, the ones with either OFI mean or OFI p-score can obtain high SR, as seen from the lower table.
An interesting observation about window size may also be made -all of the models listed above are of size equal or greater than 48, similar out-performance may also be observed from the histograms of MSE in Figure 5. This corroborates with the classical statistical concern on stability, as the ones with smaller window sizes may have unstable estimations which occasionally induces large errors, therefore perform badly in MSE or MAE, but not necessarily in SR. We take a particular notice on one 12-window-sized model with (p, d, q) = (0, 1, 1) from the model group 4 -it has the fifth-highest ranking in the SR with MSE, MAE, and SR reported as 34.22, 4.22, and 4.28 respectively. This from another viewpoint shows the importance of having another performance metric -while small-window models obtain drastic forecasts from time to time, their overall ability to forecast, or at least the direction (as the SR statistic is constructed in a way that it depends on the sign of the forecast rather on the magnitude of the forecast) may still be good.
In fact, when looking at vector models below, we note this phenomenon to be rather significant as shown at the top row of the lower table of Table 5. Indeed, smaller window sizes 12 cause instability, to an extent that 2 outliers of the 3600 13 forecasts contribute largely to the bad-performing MSE and MAE.  In terms of the top-performing vector models ranked by MSE, it is close to the univariate results with a slightly higher SR. While the top-performing vector models ranked by SR does not outperform the ones from univariate groups. This provides evidence that vector models do not perform outstandingly well in the context of one-step-ahead forecasting and windowed estimation.
We also plot a cumulative PL of each of the top models in Figure 6 along with the aforementioned one from the 12-window-sized and the baseline. As can be seen, the univariate models can perform better than the vector models, while the large-sized univariate model, i.e. the one with the lowest MSE, has a modest level of cumulative PL throughout. This could be explained by its ability to produce low errors, benefited from its stability from larger degrees of freedom, but not able to catch the time-varying change of the underlying parameters, hence the ability to forecast the sign and make trading profits is rather low.
12 Here we note that w = 48 is relatively small in the context of vector models, due to the dimension of parameters it needs to estimate. One may observe from Figure 5 that indeed the distribution of vector models for w = 48 is much wider compared to the same windowed univariate models, and several extreme points exist. 13 As checked in details of their distribution.

Principle of learning
In the above twelve model groups, we fix h t constant throughout. Here, we consider an adaptive method to learn h t . We also notice that, in the h(11, 48, 1, 1, 1) for example, some drastic outliers could worsen the performance, especially if evaluated by the MSE. Hence we first shrink the functional set H by assigning certain error and outlier handling ability as below: This makes sure the set of functions we are selecting from are not outliers. Now, let time index W be large enough so we can run all the previous model groups and obtain a suitable amount of forecasting error τ +1 (h) from each of the model h, which is defined as follows. At any t ≥ W , we have access to models in the group 0 to 12, which produce forecasts y τ +1|τ (h) for each of the h ∈ H, t − 1 ≥ τ ≥ W . We write τ +1 (h) := y τ +1 − y τ +1|τ (h) and the adaptive learning aims to learn from the errors available up to time t, together with other information available, to make a decision on the model to employ at time t.
A general strategy to train h t at time t is to construct a loss function (Φ t , H) = (Φ t , h, H \ {h}) and solve the appropriate minimisation problem: The associated computing procedure is supplied in algorithm 2.
Algorithm 2: Algorithm for obtaining the forecasts with a time-varying h t Input: Data {Φ t } t∈T , functional sets H, specification of the loss function (·, ·), desired forecasting index set T , and validation data {y t+1 } t∈T .
Output: Forecasts {y t+1|t (h * t )} t∈T with the associated functions {h * t } t∈T , and the performance metric. 1. For t ∈ T , repeat: (a) ProduceH t according to Equation 11. (b) Evaluate and execute the minimisation given by Equation 12. Then get h * t with

Evaluate the performance metric.
In the followings, we consider two groups of specifications of the loss function, with the first one motivated from Yang (2020).

Adaptive learning on errors
In model group 13, we construct 14 The above line defines the loss by solely focusing on the errors in the past 48 observations, which are diminishing geometrically at a rate λ ∈ (0, 1]. The local loss function L local (·) is specified as a continuously differentiable combination between zero, square loss, and absolute loss, with constants Benefited from the high-frequency dataset, differently from Yang (2020), we here place a time-varying constants at which the losses switch -we trial different quantiles of {|ε t (h)| h ∈ H} to set C 1 (t), C 2 (t). E.g. C 1 (t) can be the 25% quantile of {|ε t (h)| h ∈ H} and C 2 (t) can be 50% or 75% quantile of {|ε t (h)| h ∈ H}. Note here that the quantiles have the benefit of outliers-resilient as it relates to the distribution rather than expectation of the set. The time-varying parameter here assists the local penalisation to be done in a time-varying manner and thus the minimisation process.
As to the selection of parameters, we consider λ ∈ {0.8, 0.85, 0.9, 0.95, 0.99, 1} and 25%, 50%, 75% quantiles for the C 1 , C 2 . Some interpretation on the λ can be made based on its logarithmetic and exponential properties -the half period of 0.95 for example, is around 13.5 thus having λ = 0.95 essentially reviews the past 14 observations with little role being played by the further ones, while λ = 0.8 is more extreme, as its half period is only about 3.1. A general table is supplied in Table 13.

Adaptive learning with functional awards and penalties
One concern about purely focusing on the forecasting error is the potential to misfit as the functional form plays a role in the degree of freedom and may also be of importance when selecting which model to adapt as the most appropriate one for h * t . The classical approach that adds penalisation on the model selection criteria is from Akaike (1974), and later the Lasso methods, in particular, the fused lasso (Tibshirani et al. 2005). Though we are different from the previous methods as we focus on the out-of-sample loss rather than the in-sample likelihood.
Here, in model group 14, we take a more time-varying approach: we consider a penalisation or reward, depending on the particular design, between the function of concern h and the previous choice h * t−1 , write as D(h, h * t−1 ).
In the followings, we first consider, in type-1, purely penalising the difference in each of the variables that form h, i.e. the time series parameters and the window sizes.
As specified, we pool the p, d, q together and, due to the size difference, treat w separately. The fact we pool p, d, q together can be appreciated as the change in the complexity, in particular, number of lags and differences in total. Another potential way is to separate them and penalise the change one by one. 16 What is also interesting to consider is the fact that we may still have a preference to large-window sized models due to their stability, hence an additional reward can be made to encourage switches into smaller window-sized models and differenced orders, hence the followings for type-2 and 3: where C 3 > 0, C 4 , C 5 < 0 in type 2. Now, in type-3 we put where C 3 > 0, C 4 , C 5 < 0.
Both of the specifications have a strictly increasing reward for smaller window sizes as small-window sized ones get a more negative value on the C 4 (48 − w) term, which proceed into a beneficial status at the minimisation stage. The difference between the type-2 and 3 is that in type-3, penalisation on the p, q terms are only applied if d stayed the same -this paves path for potential switches in d, which may be desirable when instability breaks out and later finished.
We inherit the parametrisation in model group 13, while for the C 3 , C 4 , C 5 , we consider the following time-varying parametrisation. Let and we set C 3 (t), C 4 (t), C 5 (t) ∝ l * t . In particular, for type-1, we put and for type-2 and 3, The rationale behind these particular fractions is that we aim to control the maximum loss being added from each term of the D(h, h * t−1 ) to be half of l * t , and likewise for the magnitude of the rewards in the type-2 and 3.  For each of the type in model group 14 and throughout the model group 13, we select the best model ranked in either MSE or SR and list them above in Table 6. Compared against Table 4, we see the adaptively learnt models may achieve similar, though no better results, compared to the ones in the fixed models, both when measured in MSE and SR. This motivates the later subsection where we look closer into some of the periods that adaptively learnt models outperform the best one from the fixed model groups. To have a more financial comparison, we observe in Figure 7, the cumulative PL plotted first within themselves and then with the top ones from the fixed model groups. Compared to the best model measured in MSE, which performs poorly in the SR, the adaptively learnt models obtain higher cumulative returns while not high enough compared to the ones which are the best individual models measured in SR. Note about the number of days: after some data for the initialisation of models, we have 99 trading days starting from 20th November 2017, which explains the small difference between the baseline data here and the ones in Table 3. Note about the model code: for fixed models, they are coded by model group, window size, and p, d, q parameters; for the learning models, they are coded by model group, C 1 , C 2 in percentage, type number (if applies), and λ. Another financial comparison can be made from Table 7, that the standard deviation, min, and max of the daily return from the adaptive learning models are all in line with the top-performing fixed models, which intuitively explains the stability of these adaptively learnt models.

Results from adaptive learning model groups
The benefit of adaptive learning models is one could look into the formation of each of the models. As shown in Figure 8, we may see the preferences of the model parameters through the periodunivariate models encapsulate a majority, d = 2 is occasionally visited, and there is a good blend of usage of explanatory variables and the other time series parameters.
The choices of window sizes, depending on the design, may vary largely -in particular, M4 has a strong preference towards the smaller ones and, in terms of explanatory variables, it prefers to use none of them.
In the following subsection, we take a closer look at periods when adaptive learning results outperform the fixed models.

A closer look at adaptive learning results
We first observe a period when prices are volatile and most of the models obtain large forecasting errors. We zoom into a five trading day period starting from 8th February 2018. As plotted in the top of Figure 9, there is a large drop and subsequently big fluctuations around. We label M 0 as the best-performing individual model in terms of MSE, which is the sized 96, (p, d, q) = (0, 1, 1) model with no explanatory variable, and as we see from the second plot of Figure 9, the forecasting errors can be spiky and occasionally large, contributing a large MAE and MSE on average, as tabled in Table 9.

Model and error
As observed from Table 9, the MAE and MSE in this period are generally high, while the adaptively learnt models have reduced them to a certain extent. The second plot of Figure 9 plots the absolute forecasting error of M 0 and M 2 , and the third row of plots of Figure 9 show the level of improvement (if positive) or worsening (if negative) from M 0 to M 1 on the left, and from M 0 to M 2 on the right.
The selection of window sizes may also explain the source of improvement -shorter ones are selected during a few periods when the prices are relatively unstable, and the relevant p, q parameters are changed throughout. These details can be further seen from the bottom of Figure 9. Among the M 1 and M 2 , some difference can also be observed in the variable selection and training method, though most of the time no explanatory variable and univariate training methods are preferred.
The trend that adaptive learning models seem to perform better in the non-stationary part of the data motivates another review on the performance of adaptive learning models. Here we consider the non-stationary data as observed in the subsection 2.3 -there were time when, at the level of window-sized 12 that we reject the null hypothesis of the ADF tests at 5% significance level for both 0-diff, 1-diff, and 2-diff, meaning that y t is not stationary even after twice the difference. There are 152 observations of this nature that intersect with the forecasting set, and below in Table 10, we provide a summary of statistics for the forecasting errors.   Table 10: Summary of the absolute forecasting errors (columns 1 to 2) and squared forecasting errors (columns 3 to 4)

Model and error
In Table 10, we see the best-fixed model (M 0 ) to have moderate MAE and MSE while the adaptively learnt one (M 1 ) has slightly smaller MSE benefited from its avoidance from the large errors. The histogram in Figure 10 also supports such evidence -errors from M 1 have less distribution on the right tail. The formation of the adaptively learnt model presented here can be found below in Figure 11: it has a high proportion of model group 0, meaning there are a vast majority of models being purely ARIMA without explanatory variables, while the window sizes tend to be the smaller ones and avoids the 96. The difference order is usually taken at 1, similar to the others observed previously. Having had adaptively learnt models combining different explanatory variables, training methods, ARIMA parameters, and window sizes, one may want to test statistically the significance of certain functional classes within the selection over certain periods. Here we provide two approaches: the simple Bayesian framework where we set the prior to be proportional to the inverse of the size of the hypothesis class and updates the likelihood by simple counting, and a frequentist framework whereby binomial distribution can be assumed and thus p-values can be produced.

A simple Bayesian hypothesis testing
For a subset of functions H 1 H 0 ⊆ H, consider the following hypothesis: H 0 :Functions in the set H 1 have the same or lower chance of being selected than the ones in H 0 \ H 1 .
H A :Functions in the set H 1 have a higher chance of being selected than the ones in H 0 \ H 1 .
Note that here, "being selected" refers to that a particular function in the functional set H 0 or H 1 being selected by the adaptive learning model.
We consider a Bayesian framework to test the above hypothesis: write π(H 0 ) and π(H A ) as the prior for H 0 and H A , respectively, and π(y t |H 0 ), π(y t |H A ) as the (conditional) likelihood for a testing observation value y t based on the hypothesis H 0 and H A respectively.
We define the following probabilities: And as a result, given a period of data y := {y t } T 1 t=T 0 , the Bayes factor for H A is computed as: the denominator, we assign the Bayes factor with infinity and assign a high value in the plot for illustration.

A frequentist hypothesis testing
Another way to test the hypothesis, or a more common frequentist approach, can be done by simply assuming, in case of H 0 , a model in H 1 is chosen with probability |H 1 | |H 0 | as we assume no better performance under the null.
Suppose we have a period of test data y := {y t } T 1 t=T 0 , then write n 1 as the number of t such that h * t ∈ H 1 . 17 Under null hypothesis, we have n 1 ∼ Bin(T 1 − T 0 + 1, |H 1 | |H 0 | ) as there are T 1 − T 0 + 1 number of observations each with a likelihood of at most |H 1 | |H 0 | to be chosen. 18 We can therefore set the relevant critical value for the hypothesis testing, as well as the p-value. As usual, a close-to-zero (usually set as less than 0.05) p-value indicates a rejection of null in favour of the alternative H A .

Result
We use the above approach to test seven hypotheses, each over a period of five trading days, within which there are 180 forecasting samples. 19 We use three adaptive learning models to test simultaneously: we label M 1 as the one with λ = 1 from model group 13, M 2 as the one with λ = 0.8 from model group 13, and M 3 as the one with λ = 1 from model group 14 type 1. All of them are with (C 1 , C 2 ) = (50%, 75%). The choices of M 1 and M 3 are because of their good performance in MSE (see Table 6) while M 2 is picked as a representative of low λ -which corresponds to a shorter-memory selection, aiding statistical conclusion here, though perform relatively badly in MSE ranking. Also, it is worth taking note that M 3 may be slightly questionable while performing the frequentist test, as the set-ups from the model group 14 could lead to highly-correlated model choices due to the term D(h, h * t−1 ), which contradicts with the underlying assumption of binomial distribution in the null hypothesis.
The first hypothesis testing is to test the window size: whether it is 96 or not. Hence the functional set is H 1 = {h(ι, w, p, d, q)|w = 96} and H 0 = H. As shown in Figure 12, we see this to be significant on many days for both tests -a high Bayes factor can be observed in all models, for all periods but one. Likewise, a close to, if not 0, p-value can also be observed for a almost all periods. Such a frequent rejection of the null hypothesis means that the adaptive learning models still have a high reliance on the large-windowed models. We now consider a hypothesis about whether the model group is zero. Hence the functional set is H 1 = {h(ι, w, p, d, q)|ι = 0} and H 0 = H. We see from Figure 13 that the significance is high in most periods. Hence we could conclude these adaptive learning models to have a high occupation of model group zero, therefore the use of explanatory variable could be low, in many periods. We question about the training method -whether the parameters are trained in univariate or vector models. Here we have H 1 = {h(ι, w, p, d, q)|ι ≤ 6} and H 0 = H. From the previous examples, the univariate choice may be intuitively true, as multivariate models are rarely used, and from the testing results, such an intuition is confirmed. Indeed, as shown in Figure 14, in both Bayesian and the frequentist, significance can be shown in all periods, and the magnitude of the Bayes factor is also gigantic. Now, we wonder if the significance of small-window model exists, compared to the already-tested significant large-window models (from Figure 12). In this case, we have H 1 = {h(ι, w, p, d, q)|w = 12} and H 0 = {h(ι, w, p, d, q)|w ∈ {12, 96}}. The p-values are almost all 1, and the Bayes factors, as shown in the left panel of Figure 15, is rather low, meaning the small-windows model are not picked significantly compared to the large one. We also attempt to "zoom-in" by using a 1-day period, for which the Bayes factors are plotted in the right panel. The p-values are mostly 1 or close-to 1, while the Bayes factors, as can be seen, are somewhat spiky -this can be due to the way the Bayes factor is constructed, but it suggests some instability as the value is spiky, for most models. This means, despite the small-window models are non-significant at a larger scale, they may occasionally play a part, as suggested by the Bayes factors on a 1-day period.
Likewise, in the followings we test the significance of certain choices of variables compared to a wider set, in particular, we first consider the sole choice of OFI mean compared against its combination with OIB mean or solely the OIB mean, which leads to H 1 = {h(ι, w, p, d, q)|ι ∈ {2, 8}} and H 0 = {h(ι, w, p, d, q)|ι ∈ {1, 2, 3, 7, 8, 9}}. Test results, as shown in Figure 16, suggest no dominance by the OFI mean for all but one period from a frequentist viewpoint, while the Bayes factor is occasionally large, meaning significance may exist for some periods. This is due to the fact that the set by null hypothesis (H 0 ) may not occupy a large amount of functions being chosen, 20 contributing to a spiky and potentially large Bayes factor by occasion. Figure 15: Bayesian testing result in a 5-day period (left) and in an 1-day period (right) Note on the right panel: infinity is plotted as 3 here. Figure 16: Bayesian testing result (left) and frequentist testing result (right) Note on the left panel: infinity is plotted as 220 here, with a logarithm scale (log 10 ) being used.
We now run the same test for OFI p-score compared against its combination with OFI p-score or solely the OIB p-score. Hence H 1 = {h(ι, w, p, d, q)|ι ∈ {5, 11}} and H 0 = {h(ι, w, p, d, q)|ι ∈ {4, 5, 6, 10, 11, 12}}. As shown in Figure 19 in the appendix, the significance is not huge although the Bayes factor may be occasionally high -this, as explained in the previous part, can be purely due to the lack of samples occupied by H 0 .
Finally, we concern about the choice of the difference order, thus H 1 = {h(ι, w, p, d, q)|d = 2} and H 0 = H. As shown in Figure 17, while testing this in a 5-day period, one may easily conclude the insignificance of H 1 due to its consistently low Bayes factor and likewise the constantly almost-unity in the p-value; once zoomed into the 1-day period, the result becomes spiky in both the Bayesian and the frequentist tests, implying that the significance of 2nd difference being chosen is occasionally high. Figure 17: Bayesian testing result (left) and frequentist testing result (right) with 5-day period (up) and 1-day period (down) Note on the lower left panel: infinity is plotted as 18 here.
As a conclusion from these testing, we see a significant component of large-window sized models and models without explanatory variables being selected by the top learning models, while instability and spikiness occur, in every aspect of the model choices, especially as we zoom into a 1-day period. This is because occasionally, statistical significance can be found for 2nd difference, small-window models, as well as groups of some explanatory variables.

Discussion, Extension, and Conclusion
As a general discussion about the adaptive learning, we reflect first from the statistical intuition: what is h t and why do we care to learn h t ? The ever-changing structure, as being frequently studied, requires certain awareness by the model on the time-variability, not only on the parameters, but also the functional forms. Classical treatments on time series, e.g. Harvey (1989) and Harvey (1993) offer the chance of parameter variation, while modern deep learning models, e.g. Sirignano and Cont (2018) can help on learning the h t .
However, some issue may occur in the learning on h t -interpretability is one, and ultimately the design could be questionable. Here we design a model-adapting and time-adaptive learning regime, to offer higher interpretability and allow testing to be undertaken. Statistical conclusions may also be drawn from the adaptive learning models, for example, the explanatory variables may not be of good use for many occasions, as concluded in the testing.
The design, particularly Equation 12 and Equation 15, allows H to be potentially infinite. For instance, one could set H to contain infinitely many p but set the move to be at most 1 from each time, i.e. D(h, h * t−1 ) takes infinity if |p − p * t−1 | > 1. This allows the model to be theoretically more variable and contributes to the ultimate learning on h t . Another more computationally expensive, but also important extension is to dynamically learn the "hyper-parameter". Here we have a variety of parameters, C 1 (t), C 2 (t) for example, being set with constant proportion to certain statistics from the past -these constants are which the model could have learnt, though the search of which would take high computational power. An interesting extension would be to learn these parameters and discuss the improvement on the learning.
We also note the financial applicability of such an adaptive learning model: extra parameters can also be introduced to engage with practical application to trading, e.g. the loss function in Equation 12 could take a specification that relates to a rolling-averaged profit and loss, or a more realistic profit and loss with trading barriers, e.g. ts+17 t=ts+1 sign(α t )1[α t ∈ R(Φ t , H)] P t+1 −Pt Pt where R(Φ t , H) is a time-varying region for the signal to be strong enough to trade, which can be one of the parameters being learnt. Such an extension may also engage with the contemporary econometric methods in conditional heteroskedasticity (Harvey 2013).
An additional direction is to engage with the study on penalisation, in particular the penalisation on estimation. In time series, due to the moving average terms, MLE is mostly inevitable, thus the penalisation must be done in the fashion of penalised MLE (pMLE). Here we focus on the penalisation on functional forms post-estimation while using the MLE at the first stage. An extension would be to engage with the theory of pMLE, e.g. from Cole, Chu, and Greenland (2013) and Spokoiny (2018), and use the adaptive learning to appreciate the value of penalisation in the context of forecasting. A particular modelling issue, as has been shown here, is the distaste towards multivariate models -this can be due to the failure of capturing the underlying time-varying parameters together with low degrees of freedom. This has also been studied by Wilms et al. (2017) and more thoughts on penalisation and potentially "smart identification" using the past information could be worked on.
In conclusion, we propose a forecast-centric learning model that aims to adapt to the past information in a time series context. Such a model requires inputs of different functional forms, as well as training methods to produce parametric estimation and forecasting -these are handled by the traditional ARIMA models and explanatory variables which are generated from the order book. The result of the learning model is comparable to the top models if the models were to be fixed, and can outperform the fixed models in relatively volatile and non-stationary market conditions. Additionally, stability is more ensured and the error-handling, functional penalisation, and potentially other criteria can be encoded as part of the model selection process. Since the process is intuitive and interpretable, many extensions and applications can be made, for which we have shown an application to statistical testing, in both a Bayesian and a frequentist context.  Table 11: Summary of statistics for features (column 1 to 4), the VWM (column 5), and the number of observations per 5-minute bracket (column 6).