4.1. LASSO
Inflation forecasting typically employs huge datasets containing a lot of variables, some of which may be irrelevant for prediction purposes. The Least Absolute Shrinkage and Selection Operator (LASSO) has the ability to select only the most important covariates, discarding irrelevant information and keeping the error of the prediction as small as possible (Freijeiro-González et al., 2022)
Freijeiro-González et al. (
2022).
LASSO combines properties from both subset selection and ridge regressions. This makes it able to produce explicable models (like subset selection), and be as stable as a ridge regression. The LASSO minimizes the residual sum of squares subject to the sum of the absolute value of coefficients being less than a constant. Because of this constraint LASSO tends to shrink the coefficients on some predictor variables to 0, thus giving us interpretive models (Tibshirani, 1996)
Tibshirani (
1996). The Lasso model contains:
Data
predictor variables
and responses
We either assume that the observations are independent or that the s are conditionally independent given the s.
We assume that the
s are standardized so that:
Letting
the lasso estimate
is defined by:
Here is a tuning parameter. Now, for all t, the solution for is . We can assume without loss of generality that and hence omit .
The parameter
controls the amount of shrinkage that is applied to the estimates. Let
be the full least squares estimates and let
. Values of
will cause shrinkage of the solutions toward 0, and some coefficients may be exactly equal to 0. For example, if
the effect will be roughly similar to finding the best subset of size
. The design matrix does not need to be of full rank (Tibshirani, 1996)
Tibshirani (
1996)
The reason for including LASSO in our model is to tackle the problems of overfitting and optimism bias. A LASSO regression tries to identify variables and corresponding regression coefficients that constitute a model that minimizes prediction error. This is done by imposing a constraint on the model parameters, shrinking the regression coefficient towards zero, forcing the sum of the absolute value of the regression coefficients to be less than a fixed value (
).
After the shrinkage, variables with regression coefficients equal to zero are excluded from the model (Ranstam and Cook, 2018)
Ranstam & Cook (
2018).
We use an automated k-fold cross-validation approach for choosing . To obtain this the dataset is randomly partitioned into k sub-samples of the same size. When k-1 sub-samples are used for developing a prediction model, the remaining sub-sample is used to validate this model. This is done k times, with each of the k sub-samples in turn being used for validation and the other for model development. By combining the k separate validation results for a range of values and choosing the preferred we get the results that are used to determine the final model.
This technique reduces overfitting without the need to reserve a subset of the dataset exclusively for internal validation. A disadvantage of the LASSO approach is that one may not be able to reliably interpret the regression coefficients in terms of independent risk factors, since the focus is on the best combined prediction, and not on the accuracy of the estimation (Ranstam and Cook, 2018)
Ranstam & Cook (
2018).
4.2. LSTM
As depicted in
Figure 4.2, the LSTM model is a variant of recurrent neural networks (RNNs)
Almosova & Andresen (
2023). Unlike other neural networks, a recurrent neural network updates by time step. This means that the model will adjust forecasts based on previous time steps. RNN models have proven particularly useful for data-sensitive sequences such as time series analysis, natural language processing and sound recognition (Mullainathan and Spiess, 2017)
Mullainathan & Spiess, 2017). For example, in the context of music recognition one could observe a pattern in the sound, making it possible to predict what is to come next or which song you are listening to (Bishop, 2006)
Bishop, 2006). For such models it is crucial that there is a pattern in the data, and that the sequence of the data anticipates later values.
The RNN model is able to update its memory based on previous steps and consider long term trends and patterns in the data (Tsui et al., 2018)
Tsui et al. (
2018). Consider an abnormal drop in inflation for one month, deviating with previous time steps in the data. The RNN takes into account the underlying pattern in the data based on previous observations, and considers the fall in inflation as an abnormality. What makes inflation behavior abnormal, and which patterns the model detects to label the drop in inflation as abnormal, is inherently difficult to grasp.
Figure 3.
Classification of neural networks. LSTM is a specific type of recurrent neural network (RNN) within the broader group of neural networks (Almosova and Andresen, 2023)
Almosova & Andresen (
2023).
Figure 3.
Classification of neural networks. LSTM is a specific type of recurrent neural network (RNN) within the broader group of neural networks (Almosova and Andresen, 2023)
Almosova & Andresen (
2023).
LSTM on the other hand, differs from other RNNs as it possesses an enhanced capability of capturing long term trends in the data
Tsui et al. (
2018). Consider an inflationary event in the 1970s that has a similar pattern as one observed recently. The LSTM will observe the similarities in pattern of the two events, and take this into account when making its next prediction. It is important to state that the event occurring in the 70s will not be fully weighted, but adjusted for short term events seen in the data. LSTM thus has the ability to consider both distant and recent events, when making its predictions (Lenza et al., 2023)
Lenza et al. (
2023).
LSTM has proven to be highly efficient for sequential data and has been used to compute univariate forecasts of monthly US CPI inflation. LSTM slightly outperforms autoregressive models (AR), Neural Networks (NN), and Markov-switching models, but its performance is on par with the SARIMA model (Almosova and Andresen, 2023)
Almosova & Andresen (
2023). Recently, it has become harder to outperform naive univariate random walk-type forecasts of US inflation, but since the mid-80s, inflation has also become less volatile and easier to predict. Atkeson and Ohanian
Atkeson et al. (
2001) show that averaging over the last 12 months gives a more accurate forecast of the 12-month-ahead inflation than a backwards looking Phillips curve. Macroeconomic literature argues that the inflation process might be changing over time, making a nonlinear model more precise in predicting inflation. According to Almosova and Andresen
Almosova & Andresen (
2023) there are four main advantages of the LSTM method.
1. LSTMs are flexible and data-driven. It means that researchers do not have to specify the exact form of the non-linearity. Instead, the LSTM will infer this from the data itself.
2. Under some mild regulatory conditions LSTMs and neural networks of any type in general can approximate any continuous function arbitrarily accurately. At the same time, these models are more parsimonious than many other nonlinear time series models.
3. LSTMs were developed specifically for the sequential data analysis and have proved to be very successful with this task.
4. The recent development of the optimization routines for NNs and the libraries that employ computer GPUs made the training of NNs and recurrent neural networks significantly more feasible.
In contrast to classical time-series models, the LSTM-network does not suffer from data instabilities or unit root problems. Nor does it suffer from the vanishing gradient problem of general RNNs, which can destroy the long-term memory of these networks. LSTM may be applied to forecasting any macroeconomic time-series, provided that there are enough observations to estimate the model.
Theoretically, Convolutional Neural Networks (CNN), originally developed for images, could also be used for time series forecasting, if one treats the input as a one-dimensional image. LSTMs performs particularly well at long horizons and during periods of high macroeconomic uncertainty. This is due to their lower sensitivity to temporary and sudden price changes compared to traditional models in the literature. One should note that their performance is not outstanding, for instance compared to the random forest model (Lenza et al., 2023)
Lenza et al. (
2023). Neural nets as well demonstrate competitive, but not outstanding, performance against common benchmarks, including other machine learning methods. A simlified, visual representation of an LSTM recurrent structure is provided in
Figure 4.2.
Figure 4.
Representation of LSTM recurrent structure. LSTM has a cell state
and a hidden state
. As
t increases, more information
is put into the cell state and memory state. This new information in the cell and memory state contribute to the prediction
(Almosova Andresen, 2023)
Almosova & Andresen (
2023).
Figure 4.
Representation of LSTM recurrent structure. LSTM has a cell state
and a hidden state
. As
t increases, more information
is put into the cell state and memory state. This new information in the cell and memory state contribute to the prediction
(Almosova Andresen, 2023)
Almosova & Andresen (
2023).
A common weakness of machine learning techniques, including neural networks, is the lack of interpretability (Mullainathan and Spiess, 2017)
Mullainathan & Spiess (
2017). For inflation in particular this could be a problem, since much of the effort is devoted to understanding the underlying inflation process, sometimes at the expense of marginal increases in forecasting gains. LSTM is on average less affected by sudden, short-lived movements in prices compared to other models. Random forest has proved sensitive to the downward pressure on prices caused by the global financial crisis (GFC). Machine learning models are more prone to instabilities in performance due to their sensitivity to model specification (Almosova and Andresen, 2023)
Almosova & Andresen (
2023). This also applies to the LSTM-network. Lastly, LSTM-implied factors display high correlation with business cycle indicators, informing on the usefulness of such signals as inflation predictors.
The LSTM model can be described by the cell state function and the internal memory. A visual representation of an LSTM cell is provided in
Figure 4.3. These functions start out with their initial value, before new information attained from new observations enter and impact the value of the function. We apply the sigmoid and tanh function, giving us the updated values of the internal memory and cell state.
Figure 5.
The figure illustrates the schematic of an LSTM cell. The cell state
and hidden state
from the previous time step, along with the current input
, are processed through forget, input, and output gates. The forget gate determines how much of the previous cell state should be retained, while the input gate decides how much new information should be added. These combined results update the cell state
. The output gate determines the next hidden state
, which, combined with the updated cell state, forms the output
. Activation functions like tanh and sigmoid are used to regulate the flow of information within the cell, ensuring that the LSTM effectively captures long-term dependencies in the data (Almosova and Andresen, 2023)
Almosova & Andresen (
2023).
Figure 5.
The figure illustrates the schematic of an LSTM cell. The cell state
and hidden state
from the previous time step, along with the current input
, are processed through forget, input, and output gates. The forget gate determines how much of the previous cell state should be retained, while the input gate decides how much new information should be added. These combined results update the cell state
. The output gate determines the next hidden state
, which, combined with the updated cell state, forms the output
. Activation functions like tanh and sigmoid are used to regulate the flow of information within the cell, ensuring that the LSTM effectively captures long-term dependencies in the data (Almosova and Andresen, 2023)
Almosova & Andresen (
2023).
4.3. LASSO-LSTM
The LASSO-LSTM model is an integrated machine learning, neural network model. It integrates the strengths of LASSO and LSTM. The initial step is LASSO, for feature selection. Predictors are fitted, reducing errors of the residuals in a similar fashion to that of OLS. LASSO applies a shrinkage parameter () to the coefficients, shrinking the size of less significant predictors. The size of the hyper-parameter () is important, as it decides the number of predictors that the LSTM model will be trained on.
The regularisation term of LASSO has the function of feature selection. The predictors that are determined to be most significant will not receive large penalties to their coefficients, rendering them important in the forecasting of inflation. The regularisation term is set to three sizes. In this study LASSO-LSTM is constructed with three sizes of architecture, large, medium and small. Increasing the regulatization term size results in a smaller LASSO-LSTM architecture. This approach will contribute in assessing how to balance underfitting and overfitting, in the context of macro-economic forecasting.
When dealing with medium sized sample datasets and high dimensional data, the LSTM model, while known to handle dimensionality well, can run into problems of overfitting. Work done on LSTM for macroeconomic forecasting has shown that larger architectures do not necessarily outperform smaller ones (Paranhos, 2024)
Paranhos (
2024). Feature selection performed by LASSO detects the features that can contribute to the forecasting performance of the LSTM.
The features considered most important after regularisation, proceed to the LSTM input layer. Different sizes of architectures have different amounts of layers. Larger architectures, more prone to overfitting, receive fewer layers of fully connected nodes, and receive drop out layers. Smaller architectures, less prone to overfitting, can have more layers and/or fewer dropout layers. The LSTM layer structures can then be trained on forecasting inflation based on the number of predictors deemed most important by LASSO. The LASSO-LSTM model, as an augmented version of the LSTM model integrating feature selection, contributes to model regularization.
An alternative approach commonly used for feature selection, is principal component analysis (PCA) (Tsui et al., 2028)
Tsui et al. (
2018). The two approaches deviate in their goals. PCA deems variables important based on variance. LASSO, by shrinking coefficients, retains the variables considered important. Thus LASSO-LSTM retains some interpretability, as forecasts are based on important factors, which is of interest to central banks in their decision making.
4.4. ARIMA and SARIMA
SARIMA, Seasonal Autoregressive Integrated Moving Average, is an extension of ARIMA, which supports the direct modeling of the seasonal component of a time series. ARIMA does not support a time series with a repeating cycle, and it expects that data is either non-seasonal or that the seasonal component is removed, for example through seasonal differencing (Dubey et al., 2021)
Dubey et al. (
2021).
An ARIMA(p, d, q) model can be represented by Equation (1) below:
Here is a constant, are the coefficients of the autoregressive part with p lags, are the coefficients of the moving average part with q lags and is the error term at time t. The error terms are typically assumed to be i.i.d. variables drawn from a normal distribution with zero mean.
The model involves a linear combination of lags, and the goal is to identify the optimal p, d, and q values. The minimum difference (d) is selected in the order by which the autocorrelation reaches zero. The p is determined by the order of the AR-term, and should be equal to the lags in the PAC, which significantly crosses the limit set. Equation (2) shows the Partial Autocorrelation (PAC), where y is considered the response variable and
,
, and
are the predictor variables. The PAC between y and
, (2), is calculated as the correlation between the regression residuals of y on
and
with the residuals of
on
and
.
The
order partial autocorrelation can be represented as (3):
The q is calculated based on the Autocorrelation (AC) and denotes the error of the lagged forecast:
Here,
If one requires seasonal patterns in the time series, a seasonal term can be added, which produces a SARIMA model. This model can be written as (5):
Here (p,d,q) represent the non-seasonal part, and (P,D,Q) represents the seasonal part of the model. S represents the period number in a season. In this study we employ SARIMA as we assume there exists seasonality in inflation data.
A seasonal ARIMA model uses differencing at a lag equal to the number of seasons (s) to remove additive seasonal effects. As with lag 1 differencing to remove a trend, the lag s differencing introduces a moving average term. The seasonal ARIMA model includes autoregressive and moving average terms at lag s. The trend elements can be chosen through careful analysis of AFF and PACF plots looking at the correlations of recent time steps. Similarly, ADC and PACF plots can be analyzed to specify values for the seasonal model by looking at correlation at seasonal lag time steps.
In short, SARIMA supports univariate time series data with a seasonal component, and adds three new hyper-parameters to specify the autoregression (AR), differencing (I) and moving average (MA) for the seasonal component of the series, as well as an additional parameter for the period of the seasonality. The reason for comparing NNs with SARIMA is that their celebrated performance might be due to their ability to capture seasonality. Consequently NNs should be compared to a linear seasonal model. Often economic time series variables evolve in a cyclical pattern through time, i.e., exhibit seasonality. In relation to inflation, sales, holidays, and production cycles can cause seasonal price variations that affect the Consumer Price Index.
According to most of the literature on inflation forecasting, SARIMA is the top performing classical model and usually outperforms VAR, AR and ARIMA (Paranhos, 2024)
Paranhos (
2024). Also, compared to the newer machine learning methods like Recurrent neural networks, LSTM and feed forward neural networks, SARIMA performs on par or better. This makes the SARIMA model a natural choice for our main benchmark, as we want to compare machine learning methods to classical methods, as well as look for ways to improve these methods.
4.6. Network Training
4.6.1. LSTM
We start by splitting the data into training data, two validation sets, and an out-of-sample set. The training and validation data cover the period from 1960 to 1997, and are used to train the model. The out-of-sample data ranges from 2010 to the end of 2023.
The first step is the tuning of the model, and starts with a set of hyper-parameters, which the model applies to create thousands of epocs.
The epochs are tested on the first validation set, and the best performing epoch is tested on a second validation set. This procedure is repeated several times with different sets of hyper-parameters. All selected epochs are compared on a second validation set, and the best tuned one is retained and applied to the out-of-sample set.
We refer readers to
Figure 4.4 for the specifications related to the different models.
Feature selection occurs prior to model tuning, and is performed using LASSO and PCA. The features for both LASSO-LSTM and PCA-LSTM are based on the training data and the first validation sample. The specification of the LSTM model can be divided into four distinct parts:
1) Feature selection: Features are selected based on their relevance to the data and the problem at hand. This is an independent step that occurs before training and optimization.
2) Model configuration: This involves setting a range of structure and parameters for the LSTM model, including the incorporation of lagged versions, the number and order of layers, the number of dropout layers, the dropout rate percentage, and the learning rate.
3) Training and optimization: This step includes setting the number of epochs, batch sizes, and validation strategies.
4) Model evaluation: This involves comparing different versions of the LSTM model and selecting the best one based on performance metrics.
Figure 6.
The table presents the optimal specifications for the applied LSTM models, showing the best values for hyper-parameters such as lags, layers, dropout layers, dropout rate, learning rate, epochs, batch size, and validation sample.
Figure 6.
The table presents the optimal specifications for the applied LSTM models, showing the best values for hyper-parameters such as lags, layers, dropout layers, dropout rate, learning rate, epochs, batch size, and validation sample.
4.6.2. Other Machine Learning Models
The data for the other machine learning models employed in this study is split into three parts; training, validation and out-of-sample. Data from 1960 to 2010 is used for training and validation.
Figure 4.5 gives an overview of the optimal hyperparameters selected for these models.
The first step is embedded feature selection, where features are selected to be included in the model using training data and the validation sample. This process is crucial for the Random Forest algorithm.
Unlike the other models, Random Forest has been specified to sequentially update. This means that each forecast utilizes all available data up to a certain point in time. As new forecasts are made, more data is incorporated into the model. The process of sequential updating allows the Random Forest to continually fit the available data, ensuring that the model remains up-to-date with the most recent information. However, the initial feature selection and hyper-parameters chosen during training remain constant throughout the forecasting period. This approach ensures that the Random Forest model is both dynamic and robust, adapting to new data while maintaining a consistent set of features and hyper-parameters.
LASSO and Ridge regression models are also trained and fitted using the training and validation samples, with the penalty term optimized based on the validation sample performance.
After determining the best model specifications, these models are tested on out-of-sample data.
Figure 7.
Optimal hyper-parameters and model specifications for Random Forest, LASSO, and Ridge regression models. Random Forest uses 500 trees and 4 variables per split, with sequential updating and Embedded Feature Selection (EFS) incorporating 18 features. LASSO applies a lambda range of 0.018-0.35 with an L1 penalty, using Penalized Regression Method (PRM) with 10-50 features and no sequential updating. Ridge regression uses a lambda range of 0.219-11.5 with an L2 penalty, employing PRM with 126 features and no sequential updating.
Figure 7.
Optimal hyper-parameters and model specifications for Random Forest, LASSO, and Ridge regression models. Random Forest uses 500 trees and 4 variables per split, with sequential updating and Embedded Feature Selection (EFS) incorporating 18 features. LASSO applies a lambda range of 0.018-0.35 with an L1 penalty, using Penalized Regression Method (PRM) with 10-50 features and no sequential updating. Ridge regression uses a lambda range of 0.219-11.5 with an L2 penalty, employing PRM with 126 features and no sequential updating.
4.6.3. Univariate Time Series Models
The specification of the AR(p) model was based on results from the ACF, PACF and BIC. The SARIMA model was determined based on the same tests. Both approaches use maximum likelihood, and other approaches were not tested. Both time series models are sequentially updated as forecasts are made, adjusting only the coefficients of the model, not the hyper-parameters. The optimal hyperparameteres selected for these models are shown i
Figure 4.6.
Figure 8.
The table presents the optimal hyper-parameters and model specifications for SARIMA and AR(p) models. Selection criteria include ACF and PACF for SARIMA but not for AR(p). Both models use BIC for selection, with maximum likelihood estimation (MLE) and sequential updating.
Figure 8.
The table presents the optimal hyper-parameters and model specifications for SARIMA and AR(p) models. Selection criteria include ACF and PACF for SARIMA but not for AR(p). Both models use BIC for selection, with maximum likelihood estimation (MLE) and sequential updating.