Variational Bayesian Learning and Semiparametric Models on the Double Exponential Family

In this paper, we focus on variational Bayesian learning deterministic optimization methods for inference in biparametric exponential models where the parameters follow semiparametric regression structures. This combination of data models and algorithms contributes to solving real-world problems and reduces the computation time. This allows both the rapid exploration of many data models and the accurate estimation of the mean and variance functions through the connection between generalized linear models and graph theory. A simulation study was carried out to assess the performance of the deterministic approximation. Finally, herein, we present an application using macroeconomic data to emphasize the beneﬁts of the proposed approach.


Introduction
Variational Bayesian learning is an optimization-based framework that provides computationally tractable posterior distributions over hidden or latent variables to approximate the true joint distribution. This complexity of statistical and probabilistic models has continuously increased due to the demand for new applications as a result of the advances in information technology. This expansion includes methods such as hidden Markov models, probabilistic graphical models, and mixed and latent variable models. Furthermore, empirical applications are linked to various analyses in the statistics and machine learning cultures, for example, estimating uncertainty, handling non-IID data, and hyper-parameter tuning, among others. Summaries of variational approximation research may be found in [2], [11], and [12]. A recent introduction from the statistical perspective was provided in [13]. The implementation of these models requires the solution of high-dimensional integrals in likelihood functions and posterior probability densities, which emerge in each setting. Variational Bayesian learning (VBL), which relies on the calculus of variation optimization techniques, approximates the parameters' values by minimizing various objective functions. This optimization determines a sufficient approximation to the posterior distribution in order to avoid overfitting problems, with the benefit of faster and easier scaling to large datasets through the use of stochastic and distributed optimization techniques.
The general VBL setup can be divided into three main stages. First, a family of approximating densities Q is proposed for the latent variables set. Then, the member of that family, which minimizes a functional, shaped with a finite number of unknown parameters, is found. This functional is related to the posterior distribution and is known as the free energy to the exact posterior. Finally, the posterior is approximated with the optimized member of that family, r * . In other words, the goal of VBL is to minimize the upper bound, which is known as the Evidence lower bound (ELBO)of the surprise or free energy. This optimization strategy makes use of the conditional conjugacy assumption that allows practical models to be built by combining basic distributions.
Even though complex Bayesian modeling has benefited from Markov Chain Monte Carlo (MCMC ) and its variants 1 , there are cases where this framework could face difficulties as a result of large datasets and the complex geometry of the posterior distributions. Thus, these approximations need massive computing resources, which can lead to slow convergence and incorrect conditional and predictive posterior distributions. Therefore, VBL can be seen as a complementary alternative to Markov chain Monte Carlo methods.
Recent studies including generalized linear models and semiparametric regression have con-sidered the presence of nonconjugate models arising when one or more conditional posteriors are unknown. Models of this type were studied by [6] who relied on the mixed model formulation of penalized spline regression. The variational methodology for heteroscedastic semiparametric regression models in the GLM context was initially developed in [13]. On the other hand, [10] achieved simultaneously nonparametric function estimation using a Gaussian process approach. Next, [11] developed a mean field variational approximation where both the mean and variance are smooth functions of the predictors in a nonparametric sense and the nonconjugate structure was addressed using a fixed-form Gaussian update. Furthermore, semiparametric mean-field variational Bayes were introduced in [7] as a combination of the kullback-Leibler divergence and the mean-field restriction that include parametric and nonparametric densities. For the heterocedastic regression model, [3] implemented a mean-field variational approximation with an embedded Laplace approximation to account for the nonconjugate structure for semiparametric regression in the presence of errors with nonconstant variance. From the applied point of view, variational approximations have been implemented, for example, in latent variable models, low-rank clustering, and time series nowcasting involving mixture models.
The main novelty of this paper is the connection of variational Bayesian learning theory, developed in [12], with the biparametric exponential family to model regression structures for mean and variance parameters. This strategy was developed under the framework of generalized linear models (GLIM ) with penalized splines. In this context, we designed an algorithm to simultaneously estimate the mean and variance functions for heteroscedastic regression.
This paper contains five sections apart from the introduction and proceeds as follows. In Section 2, we describe the main issues behind the variational Bayesian learning methodology. In Section 3, the biparametric exponential family applied to the heteroscedastic semiparametric model and its approximation methodology are explained. In Section 4, we provide a simulation study and an application to real data. Finally, in Section 5, we present our conclusions and propose future lines of research. The technical results are presented in the Appendix.

Variational Bayesian Learning
In this section, we describe the components of variational Bayesian learning and the optimization framework required for free-energy minimization. Thus, variational Bayesian learning (VBL) can be seen as an optimization setup that approximates the posterior density by solving a constrained maximization problem through the application of the calculus of variations. This procedure relies on the ideas of both free energy and conditional conjugacy.

Free energy
The free energy, F (r), is a functional that depends on the path of an arbitrary distribution r. This concept is related to the upper bound of the Bayes free energy, which is the negative logarithm of the marginal likelihood, − log p(D), where D represents the observed data. Thus, the free-energy minimization is summarized in the following three steps: Firstly, consider the Kullback-Leibler (KL) divergence from an arbitrary trial distribution r(ω) on the parameter ω to the posterior p(ω|D) given by this measure: On the basis of the work in [12] and [2], which depends on applying the KL properties, the minimizer of this integral corresponds to the Bayes posterior given by Secondly, according to [12], an equivalent problem for this optimization can be issued as where the functional of r, F (r) is known as the free-energy functional [4]. This functional is obtained by replacing the posterior distribution p(w/D) with the proportional joint distribution p(ω, D) from the KL divergence, according to Bayes theorem: where the normalizing factor p(D) does not depend on ω. The functional F (r) is known as an upper bound of the traditional Bayes free energy −log(p(D)) and −F (r) is called the evidence lower-bound (ELBO). Finally, to guarantee that the evaluation of the functional is tractable for optimal r (w), the following optimization problem is solved by restricting the search space to G: where G is a chosen tractable distribution. However, thanks to conditional conjugacy, a weaker constraint restricts the optimal distribution to being in a tractable class.

Conditional conjugacy
In various empirical applications, the model likelihood has no conjugate prior distribution, which has turned Bayesian analysis to numerical approximation algorithms, such as MCMC and its variants. However, the notion of conditional conjugacy appears to be crucial to assess the inference problem in terms of deterministic optimization algorithms. The objective of conditional conjugacy is to divide the set of unknown parameters into two parts: ω = (ω 1 , ω 2 ). For instance, if the posterior distribution of ω 1 is in the same distribution family as the prior p (w 1 ), where ω 2 is a given constant, then the conditional posterior is where the prior p (ω 1 ) is called the conditional conjugate prior of the likelihood p (D|ω) with respect to the the parameter ω 1 , given the fixed parameter ω 2 .

Constraint design
The design of a tractable VBL algorithm is straightforward once the conditional conjugacy for every unknown parameters is found. The procedure considers the set of the parameteres ω = (ω 1 , ..., ω s ), such that for each s = 1, .., S, the model likelihood p (D|ω) = p (D|ω s,s ,s =s ) has a conditionally conjugate prior p (ω s ) with respect to ω s , given ω s =s as fixed constant. Then, if the prior distribution is set as is, as a function of ω s , in the same family distribution as the prior, p (ω s ). Furthermore, the moments and characteristics of the posterior distribution are tractable if the other parameters, {ω s } s =s , are given.
Consequently, in order to take into account this conditional conjugacy property, the independence constraint between the parameter groups should be imposed on the approximate posterior.
The constraint allows us to compute the moments with respect to w s and to optimize each factor {r s } s s=1 separately. This factorized form of variational inference corresponds to a framework known as mean field theory. See [?] for details.
As a result, the VB posterior is defined as Optimizing each factor requires the use of the theory of calculus of variations. From this, the free energy, F (r), is expressed as an explicit function with a finite number of unknown parameters. In other words, this framework provides stationary conditions of the function with respect to the posterior.

Calculus of variations
According to [14], the introduction of variational principles leads to a more clear and compact modeling strategy. For instance, quantities such as entropy or energy that depends on probability distributions as inputs can be found as a result of the optimization setup. Moreover, VBL provides a structure to solve the problem using variational methods focused on minimizing integral functionals. Hence, calculus of variations is a method related to finding maxima and minima functions that occurs when the derivatives vanish and the gradient of f equals zero at the stationary points (x 0 , y 0 ). Moreover, it allows the extremes of functionals to be found, which are defined as functions that depend on the path of one or more functions and their domain is a set of admissible functions. Therefore, determining the stationary points of functionals is the fundamental problem of the calculus of variations, from where different solutions can be implemented. For instance, reducing the variational problem to differential calculus to obtain the Euler equations is popular among economists in order to predict future patterns of key variables.

Variational Bayesian Learning in action
In this section, we summarize the algorithm and its components to solve the VBL optimization problem. First, to approximate the posterior, according to the mean field theory, the freeenergy functional is minimized, conditional on the independence constraint between groups of parameters. This setup is given by Then, the conditionally conjugate prior, which can be decomposed, is assumed: Furthermore, on the basis of calculus of variations, the stationary conditions are computed. The free energy is stated as where the derivatives of the free energy with respect to r s (ω s ) equal zero, i.e., ∂F ∂rs = 0. From this first-order condition, the following stationary conditions as a function of ω s are obtained: The VBL posterior, r s (W s ), is in a parametric form depending on the variational parameters. The stationary conditions found before are used to update these variational parameters through an iterative algorithm that ends when the free energy is minimized. In other words, this computational design provides a local minimizer, r, for the free-energy functional, which is considered as the VBL-posterior. From this posterior distribution, the moments and marginal, conditional, and predictive functions can be computed. For example, the mean VBL estimator can be defined as W = W r(W ) .
Even though there are cases where a model does not have a conditionally conjugate prior, Bayesian variational learning can be adapted to consider such intractable functions. Various methods have been proposed according to the analyzed model. 2 In this paper, we follow the work in [3], where the nonconjugate posterior was approximated by the Laplace method. This method is briefly described below.

Laplace approximation
In the laplace approximation, the posterior is approximated by a Gaussian: VBL locates the variational parameters λ = (ω, Σ) by minimizing the free energy F (r). Then, the MAP mean and covariance estimators are located.

Empirical Variational Bayesian Learning
When the model involves hyperparameters in the likelihood and/or the priors, the joint distribution is specified as

Biparametric exponential model
Frequently, in empirical applications, the implicit mean-variance relation, in which the variance is specified as a function of the mean, is not validated by data. Thus, extending this condition to a larger collection of models has been proposed as a result of the advantages of the exponential family framework [5]. In this section, we outline the VBL framework and derive the variational posterior distribution for the semiparametric heterocedastic model under the GLM architecture. Following [15], the Doubly semiparametric stochastic generalized linear models with splines as random effects can be stated as where DE(µ i , τ i ) denotes the double exponential distribution with mean µ i and variance τ i . Under the semiparametric framework, the mean and variance functions depend on both the predictors of the model and the basis functions from the smoothing technique.
As a result, the entire model could be stated as with the random effect being doubled u v ∼ N 0 , σ 2 u I 0 0 σ 2 v I Then, by combining the fixed and random coefficients for the mean and variance functions into single vectors, these vectors can be expressed as On the other hand, concatenating the fixed and random effects for the mean and variance functions, the design matrices can be written as The equations can be simplified as Taking advantage of the connection between GLM models and graphical models, the correspondent directed acyclic graph (DAG) that identifies the causality among parameters for semiparametric heterocedastic model is presented below:

Prior specification
The prior beliefs for the parameters of the model are described in this section. First, The fixed effects parameters β i and γ i are a priori independent. That is, where σ 2 β and σ 2 γ are the hyperparameters with very large values. In the applications, we set 10 6 as was the case in [3] and [6].
On the other hand, the priors for the dispersion parameters are specified as with hyperparameters A u , A v , B u , and B v , respectively. The covariance matrices for fixed and random effects are formed by

Conditional posterior distributions
We now turn to set the conditional posterior distributions for each unknown parameter in the semiparametric Bayesian context defined by the DAG graphical model. Under the VBL framework, the majority of these distributions are in a closed form, but some of them may be intractable. These distributions could be approximated by causal learning ideas that link the directed acyclic graph (DAG) with the following concept: the markov blanket of a node is specified by the relevant variables for a node, which are the parents, children, children's parents, and spouses of that node. In other words, the markov blanket contains all variables that carry information about the node that cannot be obtained from any other variable, for instance, p(δ i | ·)= p(δ i | markov blanket). This framework is explained in detail in [8]. Furthermore, it is worth noting that for the semiparametric model, the work of [6] provided the conditional posteriors with closed and tractable standard forms for the mean regression fixed parameters p(θ|·), and the variance parameters for both the mean p(σ 2 u |·) and the variance equation p(σ 2 v |·). In contrast, the parameter distribution for the variance equation p(θ v |·) is nonconjugate and tractability is not easy to achieve. It can be accomplished by imposing the Laplace approximation as in [3]. In summary, the conditional posteriors are stated as follows: Fixed effects in the mean equation: Variance of random effects in the mean equation: Variance of random effects in the variance equation: Fixed effects in the variance equation:

Variational Bayesian learning of heterocedastic model
The VBL implementation is detailed in this section. The work of [3] described a variational approximation for the heterocedastic semiparametric regression via spline basis for both the mean and variance functions. The mean field factorizes the joint distribution in the following way: According to the VBL framework, the solution can be derived form the following optimization problem: Under the mean field constraint, the free energy is written as The following stationary conditions can be obtained from the method described above.
r θ (θ) : optimal distribution for the fixed effects in the mean equation By substituting the conditional posterior into Eq Thus, completing the square as follows: This implies that the posterior is Gaussian. This, more specifically, can be written as Optimal distribution for the variance of random effects in the mean equation Optimal distribution for the variance of random effects in the variance equation Optimal distribution for the variance of fixed effects in the variance equation Therefore, the Free energy as a function of the unknown variational parameters can be obtained by substituting the proposal densities into free energy.
Additionally, the empirical variational Bayesian procedure is performed by minimizing the free energy with respect to the hyperparameters. Thus, the derivative of the free energy with respect to the hyperparameters is ( ∂F (r) ∂Au , ∂F (r) ∂Bu , ∂F (r) ∂Av , ∂F (r) ∂Bv ). Technical details are presented in the Appendix. Using the r densities described above, the variational learning Bayesian algorithm consists of updating the parameters associated with each r − density until r has approximated the posterior. In each step of the algorithm, the most recent parameter value is used. This is called the coordinate ascent (CA) method.

Algorithm 1 Learning for heterocedastic semiparametric regression
1: Initialize the variational parameters (µ r(θ) , Σ r(θ) , B r ( σ 2 u ) , B r ( σ 2 v )) and the hyperparameters (A u , B u , A v , B v ) in the DAG. 2: Apply (substitute the right-hand side into the left-hand side to update Evaluate the free energy. 5: Iterate steps 2 through 4 until convergence (until the energy decrease becomes smaller than a threshold = 10 −4 ) 6: Construct parameter estimates using means of variational approximations
where x ∼ U [1,10] and n = 50, 200, 500 points. The model was fitted using penalized splines for both the mean and variance levels and the mean squared errors (RMSE ). The hyperparameters are set The variance hyperparameters related with the random effects are.

Real application
We illustrate the learning methodology with a quarterly dataset of gas consumption determinants from 1947.01 to 1997.04, which is used to explain US economic growth. In this type of macroeconomic study, the goal is to estimate the effects of key variables (price, income) on the dependent variable (gasoline demand). These types of growth regressions are described in [1]. The scatterplots of per capita US gasoline demand vs. price and per capita income can be seen in panels a) and b) of Figure 1. From these, we inferred that a linear relationship is not easy to justify. In panel c), we implemented the variational Bayesian algorithm to smooth the scatter plot between price and gasoline demand. Finally, the residual plot indicates the effect that per capita income has on gas demand, suggesting a nonlinear relationship.
The proposed learning Bayesian model was applied to fit this relation with the heterocedastic semiparametric regression methodology. More specifically, we assume The partial residual plot, which was introduced by [9], is a device that is commonly used by econometricians to apply multivariate regression. It assesses the importance of an independent variable in the presence of all other independent variables in predicting the dependent variable. Thus, it represents the multivariate regression results as bivariate scatterplots by previously netting-out the effect of the other variables. According to the theorem that was proved by Gauss-Frisch-Waugh, the same coefficient and standard deviation can be obtained for a given covariate by using either the partial residual regression or the multivariate regression. Additionally, this partial residual plot asseses the importance of nonlinearity. We adapted this tool to our model and found a curvilinear relationship, which is different from the literature. This suggests a much more precise and accurate assessment of the effects.

Conclusions
In this article, we connect the variational Bayesian learning theory with the machine learning literature to tackle inference in heterocedastic semiparametric regression models. The setup was based on the generalized linear model (GLM ) framework with penalized splines. In this context, we designed an algorithm for simultaneously estimating the mean and variance functions for heteroscedastic regression. Thus, on the basis of Bayesian variational learning, a deterministic approximation of the posterior density in heterocedastic semiparametric regression models is provided. Furthermore, closed formulae for estimating the mean and variance function in the GLM model architecture were derived. It is our belief that with the increasing availability of structured and unstructured data and advances in computing, the field of statistics needs to adapt to the new demands. From the Bayesian point of view, MCMC and its variants are excellent tools; however, faster approximations are required for various real-world applications.
To this end, the Bayesian learning methodology provides an accurate solution.

The empirical variational Bayesian
The algorithm is completed by minimizing the free energy with respect to the hyperparameters ∂F (r) ∂A u = − log(B u ) + log(A u ) + 1 2A u + log(B q (σ 2 u )) + log(A u + ∂F (r)