Integrated Model Selection and Scalability in Functional Data Analysis through Bayesian Learning

Wenzheng Tao; Sarang Joshi; Ross Whitaker

doi:10.20944/preprints202503.0658.v1

Submitted:

07 March 2025

Posted:

10 March 2025

You are already at the latest version

Abstract

Functional data, including one-dimensional curves and higher-dimensional surfaces, have become increasingly prominent across scientific disciplines. They offer a continuous perspective that captures subtle dynamics and richer structures compared to discrete representations, thereby preserving essential information and facilitating more natural modeling of real-world phenomena, especially in sparse or irregularly sampled settings. A key challenge lies in identifying low-dimensional representations and estimating covariance structures that capture population statistics effectively. We propose a novel Bayesian framework with a nonparametric kernel expansion and a sparse prior, enabling direct modeling of measured data and avoiding the artificial biases from regridding. Our method, Bayesian scalable functional data analysis (BSFDA), automatically selects both subspace dimensionalities and basis functions, reducing computational overhead through an efficient variational optimization strategy. We further propose a faster approximate variant that maintains comparable accuracy but accelerates computations significantly on large-scale datasets. Extensive simulation studies demonstrate that our framework outperforms conventional techniques in covariance estimation and dimensionality selection, showing resilience to high dimensionality and irregular sampling. The proposed methodology proves effective for multidimensional functional data and showcases practical applicability in biomedical and meteorological datasets. Overall, BSFDA offers an adaptive, continuous, and scalable solution for modern functional data analysis across diverse scientific domains.

Keywords:

Functional data analysis

;

Principal component analysis

;

Dimension reduction

;

Sparse Bayesian learning

;

Variational Bayesian inference

;

Nonparametric methods

;

Model selection

Subject:

Computer Science and Mathematics - Probability and Statistics

1. Introduction

The emergence of big data across diverse fields, such as biomedicine, finance, and physical modeling, has catalyzed the need for advanced analytical methodologies capable of handling complex, high-dimensional datasets that conventional discrete-data analysis approaches cannot always process effectively. Such datasets often require analysis that captures and interprets their continuous and potentially high-dimensional complexities–a central promise of functional data analysis (FDA) [1,2]. Foundational work established FDA’s capacity to treat each observation as an entire function [3], be it a curve, surface, or higher-dimensional structure, thereby extracting richer insights than conventional discrete-point analyses. Over the past decade, FDA’s scope has widened significantly to accommodate high-dimensional and multivariate applications with theoretical and computational advances emerging across various contexts [4,5,6].

A pivotal technique within FDA is functional principal component analysis (fPCA), which serves as a dimension-reduction tool similar to classical PCA and factor analysis. Unlike classical PCA, however, fPCA operates in principle in an infinite-dimensional function space to capture dominant modes of variation and reduce complexity [7]. Despite its conceptual elegance, existing fPCA and similar FDA models often assume that data is observed on a shared, finite grid, often relying on heuristic imputation or posterior estimation to handle any missing entries [8,9,10,11,12,13,14,15,16,17,18]. This assumption conveniently facilitates the adoption of established linear algebraic methods, but compromises the integrity of FDA by introducing significant information loss and high computational demands in high-dimensional applications.

Ideally, each function would be represented according to its naturally sampled measurement points rather than forcing all observations onto a shared grid, thus preserving crucial information and avoiding the need for heuristic resampling. This point is critical when considering that, given only a finite number of data points, infinitely many functions can interpolate these points, each reflecting different inductive biases about smoothness or shape [19]. Conventional smoothing or regridding methods (e.g., polynomial interpolation) introduce biases that may distort the underlying function’s actual behavior. In contrast, we will achieve more accurate and unbiased predictions by concurrently updating the function estimation and the population-level statistics governing the estimation, such as those encoded in the covariance operator. Such an approach requires directly modeling the function from its original measurement points rather than imposing artificial grids.

To mitigate these limitations, several studies have proposed alternative strategies. For instance, [3] developed a nonparametric technique for estimating mean and covariance for functional data under smoothness assumptions while also discussing a continuous formulation and the necessary discretization in practical applications. In [20,21], they extended fPCA to sparse and irregular longitudinal designs by smoothing the covariance estimate and then discretizing. Nonetheless, classical discretization steps often result in significant information loss and computational burdens.

As functional data size and complexity grew, researchers turned to flexible basis expansions, including sinusoids (Fourier), wavelets, polynomials, and B-splines, for a finite-dimensional representation of functional data that is convenient and accurate in computation, avoiding the drawbacks of explicit approximation and resampling [2,22,23,24,25,26]. For example, [27] utilized basis function approximations to manage irregular grids. However, a core challenge remains in selecting a suitable model. For instance, researchers must choose the number and form (e.g., smoothness), along with the dimensionality of the representational subspace. In approximation, the placement of basis functions is also essential. Evenly spaced nodes remain popular for their simplicity but may be suboptimal. Alternative node allocations may be better, like Chebyshev nodes for superior accuracy [28], or sparse grids to reduce combinatorial growth of computational complexities [29].

Existing studies tend to rely on choosing the hyperparameters manually [7,10,11,30], or cross-validation [3,25,31,32], which are known to be computationally prohibitive. Others employ approximated cross-validation [22,24] or marginal likelihood [14] but still require exhaustive testing of all candidate models. Methods with sparse Bayesian priors [8,33,34] for model selection allow model selection with a single optimization. In [35,36], they use shrinkage or sparse priors for data-adaptive basis selection to ensure minimal but effective sets of basis functions. Notably, [37] proposed Bayesian and Akaike information criterion demonstrating state-of-the-art performance in simulation studies for sparse and dense functional data.

In addition, probabilistic FDA emerges as a sophisticated adaptation of probabilistic methods tailored to incorporate the flexibility of latent variable models to manage functional data. A Bayesian latent factor regression model (LFRM) [34], for example, extends conventional regression to accommodate complex structures and dependencies in functional data, providing a robust framework to handle the complexities inherent in functional data. However, these Bayesian approaches are often limited with the computational demands of Monte Carlo methods in high dimensions [14]. To address increasingly high-dimensional FDA problems, recent efforts have emphasized scalability. For instance, [7,30,38,39] introduced FDA for 2D and 3D images with a fixed basis or grid. In [32], they further reduced complexity in 2D fPCA via tensor product B-splines. And [40] applied a Bayesian framework with basis expansion, adaptive regularization, and Gibbs sampling to 2D functional data in the form of EEG studies on children with autism. Further, [41] leverages a parsimonious basis representation and variational Bayes to achieve computational efficiency, making it suitable for 3-D brain imaging data.

In parallel, the broader field of principal component analysis (PCA) remains a fundamental and effective tool. Classical PCA, rooted in eigen-decomposition [42], effectively extracts dominant modes of variation in many settings but does not inherently accommodate the probabilistic nature of real-world data and its inherent uncertainties. Thus, [43] introduced probabilistic PCA (PPCA), which incorporates a probability distribution to manage these uncertainties more effectively. PCA has since evolved to address missing data, model selection, and complex data types [44,45,46,47]. In the context of functional data, these concepts motivate new approaches that unify probabilistic methodologies, latent factor models, and kernel expansions for continuous domains [1].

Within Bayesian machine learning, various priors have been proposed for sparse or robust formulations of PCA [48,49,50,51]. Specifically, sparse Bayesian learning (SBL) [52,53,54] with its mechanism automatic relevance determination (ARD) [55,56] has proven adept at promoting parsimonious solutions [57,58,59,60,61]. SBL has emerged in Bayesian PCA [45,62,63], applying an iterative method to evaluate the relevance of each component and select the internal dimensionality by disregarding the redundant ones. [64] applied SBL to optimize the combination of base kernels to enhance model performance. These methods often exploit variational techniques or accelerated optimization [17,64,65,66,67,68,69,70], thereby balancing model complexity with computational tractability. In functional data contexts, where representations are infinite-dimensional, SBL offers a compelling framework for advanced FDA methods by efficiently handling sparse expansions and adaptively adjusting model complexity.

In summary, despite these efforts to advance functional data analysis, several challenges persist. Existing methods often exhibit limitations in accuracy and efficiency when sampling is sparse, automatic model selection is essential, and dimensionality is high [41]. Concurrently, probabilistic PCA and SBL frameworks illustrate powerful strategies to incorporate versatility and adaptivity for such data complexities while their adaptation to FDA is still evolving. These gaps underscore a need for a robust, flexible, and computationally feasible approach, unifying ideas from FDA, PPCA, and SBL, that manages the continuous and high-dimensional intricacy of modern datasets.

1.1. Contriubutions

This manuscript proposes a novel Bayesian framework for functional principal component analysis that leverages nonparametric kernel expansions, sparse Bayesian learning for model selection, and efficient variational inference (VI). We abbreviate the proposed method as BSFDA (Bayesian Scalable Functional Data Analysis). BSFDA addresses critical gaps in existing FDA techniques with irregular sampling, high-dimensional scalability, and selection of both basis functions and principal components. Specifically, our approach offers:

Joint selection of optimum latent factors and sparse basis functions: This eliminates constraints on parametric representation dimensionality, avoids information loss from discretization, and extends naturally to higher dimensions or non-Euclidean spaces through nonparametric kernel expansion. It further enhances interpretability by adaptively choosing model complexity without testing multiple models separately. We achieve these improvements using a Bayesian paradigm that provides robust and accurate posterior estimates while supporting uncertainty quantification [1].
Scalability across domain dimensionality and data size: The proposed method uses VI for faster computation compared to Markov chain Monte Carlo (MCMC) methods, while still being accurate. BSFDA reduces overall computation by partitioning the parameters into smaller update groups, and introducing a slack variable to further subdivide the weighting matrix (which is part of the kernel structure) into even smaller parts [34], updating fewer blocks at a time and considering all model options. Introducing a slack variable makes the optimization process more efficient by separating different variable groups. This approach scales well with data size and works efficiently even with large, complex datasets. We demonstrate this on the 4D global oceanic temperature data set (ARGO), which consists of 127 million data points spanning across the globe for 27 years with depths up to 200 meters [71].

1.2. Outline

Together, these contributions position our work at the intersection of functional principal component analysis [25] and sparse Bayesian learning [52], enabling robust, flexible, and computationally feasible analysis of high-dimensional functional data. The remainder of this paper is organized as follows. We first describe the proposed Bayesian functional PCA framework in detail, highlighting the nonparametric kernel expansions and sparse Bayesian priors. Next, we discuss the variational inference procedure and the reduced active block updating step, illustrating how these techniques jointly provide scalability and accuracy. We then present extensive empirical studies demonstrating factor selection accuracy, covariance operator estimation, and performance in large-scale 4D applications. Finally, we conclude with a discussion of potential extensions and open directions, emphasizing the broader implications of our work for large-scale, high-dimensional functional data analysis.

2. Formulation

The aim of the proposed method is the estimation of functions

y_{i} : R^{M} \mapsto R

that are outcomes of an M-dimensional stochastic process. The observed data are P independent, noisy samples of the functions

{y_{i}}_{i = 1}^{P}

at index

{X_{i} \in R^{N_{i} \times M}}_{i = 1}^{P}

where

N_{i}

is the number of measured samples for the

i^{t h}

function

y_{i}

and

X_{i n} \in R^{M}

is the location of the

n^{t h}

measurement in the domain of the sample. The observations are

{Y_{i}}_{i = 1}^{P}

, where

Y_{i n} = y_{i} (X_{i n}) + E_{i n}

, where

E_{i n}

is white Gaussian noise of variance

σ^{2}

.

2.1. Generative model

We assume that

y_{i}

is in a class of functions that can be approximated as a weighted summation of K kernel functions

{ϕ_{k}}_{k = 1}^{K}

:

\begin{matrix} y_{i} (x) = \sum_{k = 1}^{K} w_{i k} ϕ_{k} (x), \end{matrix}

(1)

where

ϕ_{k} (x) = K (x, X_{i n})

,

K

is the kernel function,

X_{i n}

is the k-th location,

{w_{i k}}_{k = 1}^{K}

are the K coefficients of

y_{i}

. We also assume that the functions span a low dimensional subspace of dimension

J < < K

. We model this stochastically by assuming that the weights,

w_{i} \in R^{K}

, are given by

w_{i k} = \sum_{j = 1}^{J} Z_{i j} W_{j k} + {\bar{Z}}_{k},

where

W \in R^{J \times K}

are the principal component loadings and

Z_{i} \in R^{J}

are standard normal variables. This model is therefore:

\begin{matrix} Y_{i n} & = \sum_{k = 1}^{K} ((\sum_{j = 1}^{J} (Z_{i j} W_{j k}) + {\bar{Z}}_{k}) ϕ_{k} (X_{i n})) + E_{i n} = (Z_{i} W + \bar{Z}) Φ_{i \cdot n} + E_{i n}, \end{matrix}

(2)

where

Φ_{i \cdot n} = {[ϕ_{1} (X_{i n}), \dots, ϕ_{K} (X_{i n})]}^{T}

are the evaluations of the basis functions at the n-th index of the i-th sample function.

2.2. Sparse Prior

For effective model selection, we introduce a sparse prior over the coefficients of the basis functions. The sparse prior in the proposed model is based on automatic relevance detection (ARD) [62]. ARD evaluates the importance of a feature with a precision parameter estimated from the data. The model uses

{α_{j}}_{j = 1}^{J}

and

{β_{k}}_{k = 1}^{K}

for the numbers of components and basis functions, respectively, while

η

signifies the overall magnitude of the mean coefficients:

\begin{matrix} {\bar{Z}}_{k} & \sim N (0, η^{- 1} β_{k}^{- 1}), \forall k = 1 : K \end{matrix}

(3)

\begin{matrix} W_{j k} & \sim N (0, α_{j}^{- 1} β_{k}^{- 1}), \forall j = 1 : J, k = 1 : K \end{matrix}

(4)

In the model,

α_{j}, β_{k}, η, σ^{- 2}

are all variables of precision parameters, coming naturally with a conjugate prior of Gamma distribution. The probabilistic graphical model is depicted in Figure 1. Setting

a_{0}, b_{0}

to a small value yields a vague Gamma prior that approximates a noninformative (Jeffreys-type) prior.

3. Methods

Based on the proposed formulation in Section 1.2, we estimate

Pr [Θ | X, Y, a_{0}, b_{0}]

, the posterior of the unobserved values

Θ = {Z, W, \bar{Z}, σ, α, β, η}

. This inference gives the point estimates of

Θ

and the posterior predictive distribution of new data. For notational convenience,

X, a_{0}, b_{0}

are omitted.

Using Bayes’ theorem,

Pr [Θ | Y] = \frac{Pr [Y | Θ] Pr [Θ]}{Pr [Y]}

, but the exact posterior distribution is intractable because the evidence

Pr [Y] = \int Pr [Θ, Y] d Θ

is intractable. Therefore, an approximate inference strategy is proposed. To facilitate this, we utilize variational inference (VI) [72], choosing a surrogate density from a parameterized family, denoted as

Q

, to approximate the posterior. Compared with classical methods like Markov chain Monte Carlo (MCMC) sampling, VI is typically faster per [72]. In our experiments, VI is about 85 times faster for the original Bayesian PCA formulation [45] as shown in Section 12.2 in the supplements.

3.1. Variational Bayesian Inference

Variational inference optimizes

Q

by maximizing the lower bound

L

(minimizing the KL divergence between actual and surrogate distributions):

\begin{matrix} E_{Q} [ln \frac{Pr [Θ, Y]}{Q (Θ)}] = - KL (Q (Θ) | | Pr [Θ | Y]) + ln Pr [Y] \propto - KL (Q (Θ) | | Pr [Θ | Y]) . \end{matrix}

(5)

The mean-field variational family is used for

Q

. It simplifies the optimization by assuming the surrogate posterior distributions are independent, allowing each variable in the posterior to be optimized independently:

Q_{Θ} = \prod_{i} Q_{Θ_{i}}

. The posterior for each variable is chosen conjugate, further simplifying the optimization. Thus, the posteriors of the component scores Z, the weighting matrix W, and the mean weights

\bar{Z}

are normal distributions. Here W is vectorized via

vec (W)

without altering its normality assumption. Meanwhile, the posteriors of the precision variables of noise

σ^{- 2}

, components

α

, basis functions

β

, mean weights

η

are Gamma distributions:

\begin{matrix} Q_{Z} (Z) & = \prod_{i} Q_{Z_{i}} (Z_{i}) = \prod_{i} N (Z_{i} | μ_{Z_{i}}, Σ_{Z_{i}}) \end{matrix}

(6)

\begin{matrix} Q_{W} (W) & = N (vec (W) | μ_{vec (W)}, Σ_{vec (W)}) \end{matrix}

(7)

\begin{matrix} Q_{\bar{Z}} (\bar{Z}) & = N (\bar{Z} | μ_{\bar{Z}}, Σ_{\bar{Z}}) \end{matrix}

(8)

\begin{matrix} Q_{σ} (σ) & = Γ (σ^{- 2} | a_{σ}, b_{σ}) \end{matrix}

(9)

\begin{matrix} Q_{α} (α) & = \prod_{j} Q_{α_{j}} = \prod_{j} Γ (α_{j} | a_{α_{j}}, b_{α_{j}}) \end{matrix}

(10)

\begin{matrix} Q_{β} (β) & = \prod_{k} Q_{β_{k}} = \prod_{k} Γ (β_{k} | a_{β_{k}}, b_{β_{k}}) \end{matrix}

(11)

\begin{matrix} Q_{η} (η) & = Γ (η, | a_{η}, b_{η}) \end{matrix}

(12)

3.1.1. Update Steps

In mean field approximation using the surrogate posterior

Q_{Θ} = \prod_{i} Q_{Θ_{i}}

conditioned on observations Y, the lower bound is maximized with respect to each unknown

Θ_{i}

. With the conjugate prior, the optimal updates (denoted with "←") make the moments of

Q_{Θ_{i}}

equal to the moments conditioned on the remaining parts of

Q_{Θ}

[72]:

\begin{matrix} Q_{Θ_{i}} \leftarrow \frac{exp (E_{Q_{/ Θ_{i}}} [ln (Pr [Y, Θ])])}{\int exp (E_{Q_{/ Θ_{i}}} [ln (Pr [Y, Θ])]) d Θ_{i}} \end{matrix}

(13)

From Equation (13), detailed update rules for each variable are presented subsequently, and the derivations of these formulas are in the supplementary material.

Updates for the parameters of the posterior for the precision of components $Q_{α_{j}}, \forall j = 1 : J$ :

\begin{matrix} a_{α_{j}} & \leftarrow a_{0} + \frac{K}{2}, \end{matrix}

(14)

\begin{matrix} b_{α_{j}} & \leftarrow b_{0} + \frac{1}{2} \sum_{k = 1}^{K} E_{Q_{/ α_{j}}} [W_{j k}^{2} β_{k}] = b_{0} + \frac{1}{2} \sum_{k = 1}^{K} ((Σ_{W_{j k}} + μ_{W_{j k}}^{2}) \frac{a_{β_{k}}}{b_{β_{k}}}), \end{matrix}

(15)

where Equation (14) calculates the corrected degrees of freedom and Equation (15) calculates the corrected sum of squares. As

a_{0}

and

b_{0}

approach 0, the expectation of precision

α_{j}

, which is

E_{Q_{α_{j}}} [α_{j}] = \frac{a_{α_{j}}}{b_{α_{j}}}

, is exactly the inverse of the empirical or sample variance.

Updates for the parameters of the posterior of the precision of the mean weights $Q_{η}$ :

\begin{matrix} a_{η} & \leftarrow a_{0} + \frac{K}{2}, \end{matrix}

(16)

\begin{matrix} b_{η} & \leftarrow b_{0} + \frac{1}{2} \sum_{k = 1}^{K} E_{Q_{/ η}} [{\bar{Z}}_{k}^{2} β_{k}] = b_{0} + \frac{1}{2} \sum_{k = 1}^{K} (({Σ_{\bar{Z}}}_{k} + {μ_{\bar{Z}}}_{k}^{2}) \frac{a_{β_{k}}}{b_{β_{k}}}) \end{matrix}

(17)

Updates for the parameters of the posterior of the precision of basis functions $Q_{β_{k}}, \forall k = 1 : K$ :

\begin{matrix} a_{β_{k}} & \leftarrow a_{0} + \frac{J + 1}{2}, \\ b_{β_{k}} & \leftarrow b_{0} + \frac{1}{2} E_{Q_{/ β_{k}}} [{\bar{Z}}_{k}^{2} η + \sum_{j = 1}^{J} W_{j k}^{2} α_{j}] \end{matrix}

(18)

\begin{matrix} = b_{0} + \frac{1}{2} (({Σ_{\bar{Z}}}_{k k} + {μ_{\bar{Z}}}_{k}^{2}) \frac{a_{η}}{b_{η}} + \sum_{j = 1}^{J} ((Σ_{W_{j k}} + μ_{W_{j k}}^{2}) \frac{a_{α_{j}}}{b_{α_{j}}})) \end{matrix}

(19)

Updates for the parameters of the posterior of the mean weights $Q_{\bar{Z}}$ :

\begin{matrix} Σ_{\bar{Z}} & \leftarrow {(E_{Q_{/_{\bar{Z}}}} [σ^{- 2} \sum_{i = 1}^{P} Ψ_{i} + η diag (β)])}^{- 1} = {(\frac{a_{σ}}{b_{σ}} \sum_{i = 1}^{P} Ψ_{i} + \frac{a_{η}}{b_{η}} diag (\frac{a}{b}))}^{- 1}, \end{matrix}

(20)

\begin{matrix} μ_{\bar{Z}} & \leftarrow (E_{Q_{/_{\bar{Z}}}} [σ^{- 2}] \sum_{i = 1}^{P} (Y_{i} - E_{Q_{/_{\bar{Z}}}} [Z_{i} W] Φ_{i}) Φ_{i}^{T}) Σ_{\bar{Z}} = (\frac{a_{σ}}{b_{σ}} \sum_{i = 1}^{P} (Y_{i} - μ_{Z_{i}} μ_{W} Φ_{i}) Φ_{i}^{T}) Σ_{\bar{Z}} \end{matrix}

(21)

where

diag (β)

denotes the diagonal matrix with diagonal entries given by

β

. Equation (20) indicates that the eigenvectors of

Σ_{\bar{Z}}

are solely determined by the sum of Gram matrices

\sum_{i = 1}^{P} Ψ_{i}

, while the eigenvalues of

Σ_{\bar{Z}}

have a negative correlation with the scale of

\sum_{i = 1}^{P} Ψ_{i}

, the prior

η diag (β)

and data-dependent term

σ^{- 2}

. It is sensible because, for instance, large noise would result in large uncertainty in

\bar{Z}

. In Equation (21), the data residuals, excluding component scores, are projected into the K-dimensional space through the inner product with

Φ_{i}

and summed over all sample functions to calculate the mean weights.

Updates for the parameters of the posterior of the weights $Q_{W}$ :

\begin{matrix} Σ_{vec (W)} & \leftarrow E_{Q_{/_{W}}} {[σ^{- 2} \sum_{i = 1}^{P} (Ψ_{i}^{T} \otimes (Z_{i}^{T} Z_{i})) + diag (β) \otimes diag (α)]}^{- 1} \end{matrix}

\begin{matrix} = {(\frac{a_{σ}}{b_{σ}} \sum_{i = 1}^{P} (Ψ_{i}^{T} \otimes ({μ_{Z_{i}}}^{T} μ_{Z_{i}} + Σ_{Z_{i}})) + diag (\frac{a}{b}) \otimes diag (\frac{c}{d}))}^{- 1}, \\ μ_{vec (W)} & \leftarrow E_{Q_{/_{W}}} [- σ^{- 2} \sum_{i = 1}^{P} vec {({(Φ_{i} (Φ_{i}^{T} {\bar{Z}}^{T} - Y_{i}^{T}) Z_{i})}^{T})}^{T}] Σ_{vec (W)} \end{matrix}

(22)

\begin{matrix} = - \frac{a_{σ}}{b_{σ}} \sum_{i = 1}^{P} vec {({(Φ_{i} (Φ_{i}^{T} {μ_{\bar{Z}}}^{T} - Y_{i}^{T}) μ_{Z_{i}})}^{T})}^{T} Σ_{vec (W)} \end{matrix}

(23)

Equation (22) is similar to Equation (20), because it is correlated with

Φ_{i}

, its prior

diag (β) \otimes diag (α)

and data-dependent terms

σ^{- 2}

and

Z_{i}

. In Equation (23), the data residual excluding the mean function is used to estimate the expectation of W.

Updates for the parameters of the posterior of the component scores $Q_{Z_{i}}$ :

\begin{matrix} H_{i j k} & \leftarrow E_{Q_{/_{Z_{i}}}} [W_{j} Ψ_{i} W_{k}^{T}] = Tr (E_{Q_{/_{Z_{i}}}} [W_{k}^{T} W_{j}] Ψ_{i}) \end{matrix}

\begin{matrix} = T r ((Σ_{[W_{k}, W_{j}]} + μ_{[W_{j}]}^{T} μ_{[W_{k}]}) Ψ_{i}), \forall j = 1 : K, k = 1 : K, \end{matrix}

(24)

\begin{matrix} Σ_{Z_{i}} & \leftarrow {(E_{Q_{/_{Z_{i}}}} [σ^{- 2} W Ψ_{i} W^{T} + I])}^{- 1} = {[\frac{a_{σ}}{b_{σ}} H_{i} + I]}^{- 1}, \end{matrix}

(25)

\begin{matrix} μ_{Z_{i}} & \leftarrow E_{Q_{/_{Z_{i}}}} [σ^{- 2} (Y_{i} - \bar{Z} Φ_{i}) Φ_{i}^{T} W^{T}] Σ_{Z_{i}} = \frac{a_{σ}}{b_{σ}} (Y_{i} - μ_{\bar{Z}} Φ_{i}) Φ_{i}^{T} {(μ_{W})}^{T} Σ_{Z_{i}}, \end{matrix}

(26)

where

H_{i}

is a temporary variable denoting the Gram matrix of weighted kernel functions

W Φ_{i}

and

Σ_{[W_{k}, W_{j}]}

denotes the covariance between

W_{k}^{T}

and

W_{j}

in

Q

.

Updates for the parameters of the posterior of the noise $Q_{σ}$ :

\begin{matrix} a_{σ} & \leftarrow a_{0} + \frac{1}{2} \sum_{i} N_{i}, \\ b_{σ} & \leftarrow b_{0} + \frac{1}{2} E_{Q_{/ σ}} [\sum_{i} | | Y_{i} - (Z_{i} W + \bar{Z}) Φ_{i} {| |}_{2}^{2}] \\ = b_{0} + \frac{1}{2} \sum_{i} (Y_{i} Y_{i}^{T} - 2 Y_{i} {(μ_{Z_{i}} μ_{W} Φ_{i})}^{T} - 2 Y_{i} {(μ_{\bar{Z}} Φ_{i})}^{T} + 2 μ_{Z_{i}} μ_{W} Ψ_{i} {(μ_{\bar{Z}})}^{T} \end{matrix}

(27)

\begin{matrix} + Tr ((Σ_{\bar{Z}} + {(μ_{\bar{Z}})}^{T} μ_{\bar{Z}}) Ψ_{i})) + \frac{1}{2} {vec (H^{T})}^{T} \sum_{i} vec (vec (Ψ_{i}) vec (Σ_{Z_{i}} + {μ_{Z_{i}}}^{T} μ_{Z_{i}})^{T}), \end{matrix}

(28)

where H is a temporary variable that is updated by

\begin{matrix} H_{j + k M} & \leftarrow E_{Q_{/ σ}} [vec {(W_{k} {W_{j}}^{T})}^{T}] = vec {(Σ_{[W_{k}, W_{j}]} + μ_{[W_{j}]}^{T} μ_{[W_{k}]})}^{T}, \forall j = 1 : K, k = 1 : K \end{matrix}

(29)

Nearly noninformative (vagor) priors, i.e., with almost zero

a_{0}, b_{0}

, introduce an inherent identifiability ambiguity in our formulation, specifically, in the product of the precision parameters

α, β

and

η

(Equations 20 and 22). In our model, scaling

α

and

η

by a specific factor while inversely scaling

β

leaves the product (and hence the lower bound in Equation 5 ) unchanged. This inherent ambiguity can lead

α

,

β

, and

η

to converge to extreme values, thereby challenging numerical stability during optimization. To mitigate this issue, we adopt a heuristic constraint to ensure that the smallest values of

α

and

β

remain within one order of magnitude of each other. Specifically, we enforce

|{log}_{10}^{(\frac{error}{min (β)})}| \leq 1

. If an update to any

α_{j}

or

β_{k}

would violate this constraint, that particular update is skipped while the rest of the parameters remain updated. This strategy does not alter the algorithm’s overall structure but stabilizes the optimization by curbing unnecessary flexibility in the precision parameters.

3.2. Scalable Update Strategy

The scalability of our algorithm so far is primarily challenged by the need to optimize the variational lower bound,

L

over K basis functions. As indicated by Equation (22), time complexity is

O (K^{6})

(or, alternatively,

O (K^{2} P {max}_{i} (N_{i}))

, typically dominated by the former), which becomes prohibitive when K is large. In practice, however, only a small subset of these basis functions is necessary for an accurate representation–those with non-negligible weights under our sparse prior.

To address this, we focus the updates on the subspace of active basis functions, denoted as

K^{(a)}

, which comprises only those functions with non-negligible weights. The remaining basis functions, whose influence is minimal, are held fixed during optimization. Furthermore, the number of active principal components is noted as

J^{(a)}

and set equal to

K^{(a)}

, ensuring that the model spans the full range of possible ranks from 1 to

K^{(a)}

. Consequently, we optimize

Q^{(a)}

using updates derived w.r.t. the objective

K^{(a)}

-dimensional lower bound

L^{(a)}

as an efficient surrogate of the full updates of

Q

w.r.t. the full lower bound

L

, using only

K^{(a)}

active basis functions. Meanwhile, the active dimensionality of the model is adjusted dynamically during optimization by activating or deactivating basis functions based on their precision parameters. For clarity, variables associated with the active subspace are annotated with the superscript

(a)

(e.g.,

a_{α_{j}}^{(a)} = a_{0} + \frac{K^{(a)}}{2}

versus

a_{α_{j}} = a_{0} + \frac{K}{2}

).

3.2.1. Implicit Factorization

For notation clarity, we reorder the rows and columns of our parameter matrices to separate active components from inactive ones. Specifically, we partition as follows:

\begin{matrix} Z_{i} & = (\begin{matrix} Z_{i A} & Z_{i B} \end{matrix}), \bar{Z} = (\begin{matrix} {\bar{Z}}_{A} & {\bar{Z}}_{B} \end{matrix}), α = (\begin{matrix} α_{A} & α_{B} \end{matrix}), β = (\begin{matrix} β_{A} & β_{B} \end{matrix}), \\ W & = (\begin{matrix} W_{A} & W_{B} \\ W_{C} & W_{D} \end{matrix}), Φ_{i} = (\begin{matrix} Φ_{i A} \\ Φ_{i B}, \end{matrix}) \end{matrix}

(30)

Here, the subscript A denotes variables belonging to the active subspace (i.e., those corresponding to

K^{(a)}

basis functions), while

B, C,

and D denote the inactive components. Notably, the cross terms

W_{B}

and

W_{C}

involve both active and inactive components; these are updated implicitly, as proved in the supplements.

Following the strategy in [73], a basis function is deemed inactive if its precision exceeds a high threshold, i.e.

α_{j} > ϵ^{- 1}

and

β_{k} > ϵ^{- 1}

as

ϵ \to 0

. In the limit, the inactive basis functions decouple from the active ones, leading to the following mean-field factorization:

\begin{matrix} Q_{W} & = Q_{W_{A}} Q_{W_{B}} Q_{W_{C}} Q_{W_{D}} \end{matrix}

(31)

\begin{matrix} Q_{\bar{Z}} & = Q_{{\bar{Z}}_{A}} Q_{{\bar{Z}}_{B}} \end{matrix}

(32)

\begin{matrix} Q_{Z_{i}} & = Q_{Z_{i A}} Q_{Z_{i B}} . \end{matrix}

(33)

The factorization of

α

and

β

was already obtained in Equations (10) and (11). These factorizations allow us to decouple the update for the active subspace with the proof provided in the supplementary material.

It implies that only updates for

Q_{Z_{i A}}, Q_{W_{A}}, Q_{W_{B}}, Q_{W_{C}}, Q_{{\bar{Z}}_{A}}, Q_{α_{A}}, Q_{β_{A}}, Q_{σ}, Q_{η}

are required as shown in Figure 2. This strategy reduces the computational complexity from

O (K^{6})

to

O ({K^{(a)}}^{6})

. Moreover, the active dimensions

K^{(a)}

are initialized using a modified, multi-instance version of relevance vector machine [52], as detailed in Section 11 in the supplementary materials.

3.2.2. Low-Dimensional Lower Bound

This section shows how to optimize these active surrogates, e.g.,

Q_{{\bar{Z}}_{A}}

, using updates of

Q^{(a)}

w.r.t. the

K^{(a)}

-dimensional lower bound

L^{(a)}

, which ultimately optimizes the full lower bound

L

. To distinguish between the two, we denote the active surrogate posterior for the full model as

Q_{{\bar{Z}}_{A}}

and that for the reduced

K^{(a)}

-dimensional model as

Q_{{\bar{Z}}_{A}}^{(a)}

. The active Gaussian surrogate posteriors are shared, e.g.,

Q_{{\bar{Z}}_{A}} = Q_{{\bar{Z}}_{A}}^{(a)} = N ({\bar{Z}}_{A} | {μ_{\bar{Z}}}_{A}, {Σ_{\bar{Z}}}_{A})

. This implies updating

Q^{(a)}

is equivalent to updating

Q

, so we set the moments of the active distributions of the full model to match those of the reduced model. However, the surrogate posterior Gamma distributions differ between the two models. For example, the update of

E_{Q^{(a)}} [α_{A}]

depends solely on

Q_{W_{A}}

, whereas

E_{Q} [α_{A}]

also incorporates a cross term

Q_{W_{B}}

corresponding to the remaining

(K - K^{(a)})

dimensions. This difference is reflected in how the scale parameters depend on the number of active versus total basis functions, as shown in Equations (14), (16) and (18). Nonetheless, we prove that in the limit

ϵ \to 0

, the fixed point of the

K^{(a)}

-dimensional updates of the complete surrogate

Q

equals that of the reduced surrogate

Q^{(a)}

. Consequently, the updates for

Q_{α_{A}}, Q_{β_{A}}

, and

Q_{η}

are derived directly from the expectations of the reduced model

Q_{α_{A}}^{(a)}, Q_{β_{A}}^{(a)}, Q_{η}^{(a)}

:

\begin{matrix} E_{Q} [α_{A}] \leftarrow E_{Q^{(a)}} [α_{A}] & \Leftrightarrow b_{α_{j}} \leftarrow \frac{a_{α_{j}}}{a_{α_{j}}^{(a)}} b_{α_{j}}^{(a)}, \forall j \leq J^{(a)}, \end{matrix}

(34)

\begin{matrix} E_{Q} [β_{A}] \leftarrow E_{Q^{(a)}} [β_{A}] & \Leftrightarrow b_{β_{k}} \leftarrow \frac{a_{β_{k}}}{b_{β_{k}}} b_{β_{k}}^{(a)}, \forall k \leq K^{(a)}, \end{matrix}

(35)

\begin{matrix} E_{Q} [η] \leftarrow E_{Q^{(a)}} [η] & \Leftrightarrow b_{η} \leftarrow \frac{b_{η}^{(a)}}{a_{η}^{(a)}} a_{η} \end{matrix}

(36)

These update Equations (34), (35), (36) prove to optimize

L

in Theorem 1, 2 in the supplements.

3.2.3. Heuristic for Activating Basis Functions

The proposed method selects a relatively small set of basis functions from a potentially extensive set of possibilities. The computational costs are kept in check by recognizing that inactive basis functions do not interact with those active (with non-negligible weights). Due to computational constraints, we consider functions for activation sequentially rather than all at once. Thus, we propose Algorithm 1 to introduce unseen basis functions into the active set using a selective strategy akin to the heuristic approach described in [73].

The algorithm selects the top function,

ϕ_{B k}

, from the inactive basis functions

{ϕ_{B k}}_{k}

by gauging their correlation with residuals and applying an angle-based threshold

τ_{ang}

relative to the subspace of

ϕ_{A}

. The correlation with residuals for

ϕ_{B k}

is measured by

\sum_{i} {(Φ_{i B k} {(Y_{i} - E_{Q^{(a)}} [Z_{i A} W_{A} + {\bar{Z}}_{A}] Φ_{i A})}^{T})}^{2}

. The angle-based threshold ensures a meaningful distinction from active functions. Next, the current active surrogate posterior is expanded by a dimension for

\tilde{ϕ_{B k}}

, initiating optimization from the numerical maximum

τ_{\max}

. Post optimization, the function gets retained if it falls below

τ_{\max}

. Otherwise, the algorithm terminates. Efficiently, in trial optimization, the approach replaces one function with precision

τ_{\max}

, if present.

Algorithm 1 Search for new basis functions to activate

Sort inactive basis functions ${ϕ_{B k}}_{k}$ by correlation with residuals.
Filter through ${ϕ_{B k}}_{k}$ , selecting the most correlated one as ${ϕ_{B}}_{k}$ .
Copy current active surrogate $Q^{(a)} (Θ)$ posterior to ${Q_{k}}^{{(a)}^{'}} (Θ)$ .
Expand dimension in ${Q_{k}}^{{(a)}^{'}} (Θ)$ for ${ϕ_{B}}_{k}$ .
Optimize ${Q_{k}}^{{(a)}^{'}} (Θ)$ for 3 iterations using mean field approximation.
if expected precision is within threshold then
$Q^{(a)} (Θ) \leftarrow {Q_{k}}^{{(a)}^{'}} (Θ)$ .
end if

4. Faster Variant

To enhance the computational efficiency of our primary algorithm, we introduce a faster variant, denoted as

{BSFDA}^{Fast}

. This approach leverages conditional independence among the columns of W, enabling separate updates and thereby reducing computational complexity. Similar strategies have been described in [34,62]. The model is defined with an introduced variable

ζ_{i}

for the coefficient noise as follows:

θ_{i} = Z_{i} W + ζ_{i},

(37)

ζ_{i k} \sim N (0, ς_{k}^{2} β_{k}^{- 1}) .

(38)

Similar to before, we assign a conjugate Gamma prior to the precision:

ς_{k}^{- 2} \sim Γ (a_{0}, b_{0}) .

(39)

This formulation ensures that the columns of W are conditionally independent, allowing the variational distribution to factorize as:

Q_{W} = \prod_{k} Q_{W_{\cdot k}}

, thereby facilitating separate updates for each column. Consequently, the time complexity is reduced from

O ({K^{(a)}}^{6})

to

O ({K^{(a)}}^{3})

.

To align with the original model, it is necessary for

ζ

and the associated variance parameters

ς

to approach zero. Having

ς

too high would allow the coefficient noise to corrupt the signal, biasing the model toward underestimating the true signal levels, particularly because this noise operates in the coefficient space where it introduces smooth, correlated variation (low entropy, like signals) that is harder to eliminate than high-frequency white noise (maximum entropy). Injecting the same amount of noise leads to an unbiased estimation of the signals but increases the estimation variance. Conversely, as

ς

decreases, the columns of W become dependent, violating the independence assumption inherent in variational inference. This dependency degrades the approximation quality and slows down the optimization process. Such dependency issues are well documented in both variational inference and MCMC literature–with recent efforts addressing them via structured VI [72] or blocked/collapsed Gibbs sampling [74]. Empirical validations of this noise impact are conducted with both

{BSFDA}^{Fast}

in Section 5 and with Bayesian PCA [45,62] in Section 12.2 in the supplements.

To balance the trade-off between optimization speed and accuracy, we adopt a strategy of gradually decreasing the values of

ς_{k}

during the optimization iterations. Specifically, we initialize

ς_{k}

with a relatively large value and linearly decrease it from

10^{- 2}

to

10^{- 5}

over the first half of the iterations. After reaching

10^{- 5}

,

ς_{k}

is fixed for the remaining iterations. This gradual reduction ensures that the algorithm initially maintains efficiency with benefits from minimizing interdependency among the columns of W to accelerate convergence while later preserving quality of the approximation by preventing the noise from obscuring signal components. We unify the scales by scaling the basis functions so that

Z_{i}

is standard normal and W is an identity matrix in initialization. Empirical evaluations indicate the strategy above is effective in most applications.

By implementing these modifications,

{BSFDA}^{Fast}

offers a practical solution that substantially accelerates the algorithm without significant loss in accuracy, making it well-suited for large-scale, high-dimensional functional data analysis.

5. Results

The proposed method proves its effectiveness through simulations and applications to observed data sets.

5.1. Simulation Results

In simulations, we evaluate functional data analysis performance in model selection, estimated covariance accuracy, and extendability to multi-dimensional domains.

The model selection metric is the accuracy in estimating the number of principal components, which is the dimension of the compact subspace of signal variations. The configuration of the simulations in this section aligns with that established in [37], covering various scenarios. Simulated data sets derive from a latent generative model with variables

Z_{i}

with dimension r for the i-th sample function and noise corruption with a standard deviation of

σ

:

Y_{i} = \sum_{j = 1}^{r} (Z_{i j} f_{j} (X_{i})) + g (X_{i}) + E_{i}, Z_{i j} \sim N (0, v_{j}), E_{j} \sim N (0, σ^{2} I),

where

{f_{j}}_{j = 1}^{r}

represent eigenfunctions,

{v_{j}}_{j = 1}^{r}

are the eigenvalues,

g : R \mapsto R

signifies the mean function. Here, we consider five scenarios.

Scenario 1: Data generated with

g = 5 {(x - 0.6)}^{2}

,

r = 3

,

v = (0.6, 0.3, 0.1)

,

σ^{2} = 0.2

,

f_{1} (x) = 1, f_{2} (x) = \sqrt{2} sin (2 π x), f_{3} (x) = \sqrt{2} cos (2 π x) .

Here

v_{3} < σ^{2}

, i.e., the noise has a larger variance than the smallest signal.

Scenario 2: Similar to Scenario 1, but the third eigenfunction is replaced by a function with higher frequencies

f_{3} (x) = \sqrt{2} cos (4 π x)

, and the principal component scores follow a skewed Gaussian mixture model. Specifically, the j-th component score has 1/3 probability of following a

N (2 \sqrt{v_{j} / 3}, v_{j} / 3)

distribution, and 2/3 probability of following

N (- \sqrt{v_{j} / 3}, v_{j})

, for

j = 1, 2, 3

.

Scenario 3: Data generated with

g = 12.5 {(x - 0.5)}^{2} - 1.25

,

r = 3

,

v = (4, 2, 1)

,

σ^{2} = 0.5

,

f_{1} (x) = 1, f_{2} (x) = \sqrt{2} cos (2 π x), f_{3} (x) = \sqrt{2} sin (4 π x) .

Scenario 4: Same as Scenario 3, but the component scores are generated from a Gaussian mixture model as Scenario 2.

Scenario 5: Data from

g = 12.5 {(x - 0.5)}^{2} - 1.25

,

r = 6

,

v = (4, 3.5, 3, 2.5, 2, 1.5)

,

σ^{2} = 0.5

,

f_{1} (x) = 1, f_{2 k} (x) = \sqrt{2} sin (2 k π x) for k = 1, 2, 3, f_{2 k + 1} (x) = \sqrt{2} cos (2 k π x) for k = 1, 2

, j-th component score obeying

N (0, v_{j})

.

In each scenario, simulations produced 200 sample functions. We investigated 3 cases with sparse, medium, and dense sampling by assigning the number of observations per sample function

N_{i} = {5, 10, 50}

. Each case in each scenario is repeated 200 times. The method’s performance was compared to fpca from [22], AIC and BIC in the 2022 release of pace [20], modified AIC and BIC in [37], and all the competing methods in [37]. For fpca, we set the candidate numbers of basis functions as [8,10,15,20], and the candidate dimensions of the process as [2,3,4,5] for Scenario 1-4 and [4,5,6,7,8] for Scenario 5. The other parameters are all set to the defaults. Due to its consistent overestimation of the true number of components–likely resulting from interference by correlated noise and less sparse precision priors–we excluded LFRM[34] from further comparisons (see Section 12.1.1 in the supplements).

Each estimation chose ten length-scales of functions, which are selected using cross-validation and k-means clustering. This adaptive strategy allows the algorithm to choose distinct length-scales at different locations of the definition domain, thereby accommodating varying smoothness characteristics inherent in complex functional data–a flexibility that is not possible when using a regular grid that forces a single length-scale across the entire domain [34]. Sparse sampling in Scenario 5 used five length-scales to avoid over-fitting. Figure 3 shows the length-scales and centers of the selected kernel basis functions for three different numbers of sample points,

N_{i}

, in a random repetition of Scenario 5. The results reveal that the selected length-scales mainly concentrate around 0.07, with a few as high as 0.35–suggesting that the lower length-scales capture finer, high-frequency variations. The higher length-scales model the overall, lower-frequency quadratic mean structure and the constant baseline component. Furthermore, the estimated density functions of the selected length-scales exhibit consistent patterns across the three sampling densities, and the method selects 9, 11, and 12 basis functions respectively, demonstrating the algorithm’s adaptive fidelity and complexity based on the available observations. The supplements showcase the uncertainty evaluation in Figure 12.

Table 1, Table 2, Table 3, Table 4 and Table 5 show the results. Results for the first five methods are from [37]. Out of 15 cases, the proposed BSFDA exhibits the highest accuracy in 12. In the other 3 cases, the accuracy of BSFDA is comparable to the best result and is always above 0.950.

{BSFDA}^{Fast}

demonstrates performance comparable to BSFDA when applied to medium-density and dense datasets with significantly higher efficiency which we detail in Figure 5 later. However, its efficacy diminishes with sparse data. This limitation arises because the parameter

ς

can bias model estimation in scenarios with insufficient data evidence, leading to an underestimation of signal variance. Consequently,

{BSFDA}^{Fast}

tends to underestimate the number of components, particularly those capturing nuanced variations, in the presence of sparse observations. Nonetheless, with adequate data,

{BSFDA}^{Fast}

achieves performance on par with the original model.

5.1.1. Mean Squared Error in Covariance Operator

The mean squared error across

X_{grid}

, a grid of 1000 index points:

\frac{| | cov (X_{grid}, X_{grid}) - \hat{cov} (X_{grid}, X_{grid}) {| |}_{F}^{2}}{1000 \times 1000},

where

| | \cdot {| |}_{F}

is the Frobenius norm, measure the accuracy of the estimated covariance. The quadratic measure of error with Frobenius norm for covariance estimators has been used by [75]. Methods compared include fpca of [22], pace of [20] with AIC and BIC, refund-sc of [21]. Only cases in scenario 5 were used because of the time constraints (e.g., refund-sc takes 6 hours for 20 repetitions with 50 points in scenario 5). As the most challenging, scenario 5 should provide the most compelling comparison. The results in Table 6 demonstrate that the proposed method is comparable to the best work in terms of estimated covariance accuracy. Specifically, dense sampling becomes prohibitive for refund-sc. The results highlight the benefit of continuous formulations, as seen in both fpca and the proposed method, over the grid-based optimization in conventional methods.

{BSFDA}^{Fast}

again performed comparably well given data was adequate.

5.1.2. Multidimensional Functional Data Simulation

A simulation experiment with a four-dimensional index set reveals the proposed method’s advantages for high-dimensional data, where the gridding strategies of previous methods are impractical. The settings are as follows with a length-scale

l_{s} = 0.33

:

\begin{matrix} Z_{i} & \sim N (0, I) \in R^{1 \times 3} \end{matrix}

(40)

\begin{matrix} ϕ_{0} (x) & = {(π l_{s}^{2})}^{- 2} exp (- \frac{1}{2} {∥\frac{x - [0.5, 0.5, 0.5, 0.5]}{l_{s}}∥}_{2}^{2}) \end{matrix}

(41)

\begin{matrix} ϕ_{1} (x) & = {(π l_{s}^{2})}^{- 2} exp (- \frac{1}{2} {∥\frac{x - [0.4, 0.4, 0.4, 0.4]}{l_{s}}∥}_{2}^{2}) \end{matrix}

(42)

\begin{matrix} ϕ_{2} (x) & = {(π l_{s}^{2})}^{- 2} exp (- \frac{1}{2} {∥\frac{x - [0.6, 0.6, 0.6, 0.6]}{l_{s}}∥}_{2}^{2}) \end{matrix}

(43)

\begin{matrix} y_{i} (x) & = Z_{i 0} * \sqrt{0.6} * (ϕ_{0} (x) - ϕ_{1} (x)) + Z_{i 1} * \sqrt{0.3} * ϕ_{1} (x) + Z_{i 2} * \sqrt{0.4} * ϕ_{2} (x) \end{matrix}

(44)

The observations include additive noise with a sigma of 4.472e-01. The cross-validation selects a length-scale of 0.405. The estimated noise sigma is 4.637e-01. The proposed method correctly estimates the number of principal components as 3 and selected 31 basis functions. As shown in Figure 4, the eigenfunctions are correctly estimated. In addition, the estimated mean function is zero, which is accurate.

Next, we present a convergence comparison between BSFDA and

{BSFDA}^{Fast}

under four different schedules for the coefficient noise

ς_{k}

. Specifically, we compare the default diminishing schedule from

10^{- 2}

to

10^{- 5}

with three fixed settings:

10^{- 2}

,

10^{- 3}

, and

10^{- 5}

. We evaluate the covariance error and the discrepancy between the estimated/true dimensionality in one replicate of each sample density in Scenario 5, and the 4D simulation. For the 4D, we adopt a default initial

ς_{k}

of

10^{- 3}

. As illustrated in Figure 5,

{BSFDA}^{Fast}

achieves comparable accuracy to BSFDA while converging significantly faster than BSFDA in terms of both covariance errors and component estimation for medium and densely sampled data. In the 4D case,

{BSFDA}^{Fast}

converges in covariance estimation after approximately 10,000 seconds and in dimensionality after around 4,000 seconds, compared to roughly 100,000 seconds and 13,000 seconds, respectively, for BSFDA. However, for sparse data,

{BSFDA}^{Fast}

exhibits reduced estimation accuracy and underestimates the number of components by one. A similar decline in accuracy is observed in the 4D simulation when data sparsity is high. This limitation arises because the introduction of coefficient noise

ς

biases the model towards eliminating signals that are deemed insignificant. Moreover, when comparing the three fixed-

ς_{k}

variants of the fast algorithm, a clear trade-off emerges: smaller

ς_{k}

reduce overall error but slow down the optimization due to increased dependency among variables. These results collectively demonstrate the effectiveness of our chosen

ς_{k}

schedule in

{BSFDA}^{Fast}

, as it balances both efficiency and accuracy.

Figure 5. Convergence plots for Scenario 5 in Yehua and the 4D simulation. The upper row displays the covariance error against time, while the lower row illustrates the difference between the estimated and true number of components.

5.2. Results on Public Data Sets

The proposed method’s practicality was validated with 2 application data sets, CD4 and wind speed measurements.

5.2.1. CD4

CD4 data, a classical form of functional data, received attention in [1,20,22]. CD4 cell counts gauge the immune system’s response to human immunodeficiency virus (HIV) infection, which leads to a progressive reduction in CD4 cell counts. The Multicenter AIDS Cohort Study (MACS) [76] provided the CD4 data. This dataset consists of CD4 percentages from 283 male human subjects that were HIV positive, each with 1 to 14 repeated measurements over time in years. Subjects were scheduled for reevaluation at least semiannually. However, missed visits caused a sparse and uneven distribution of measurements. The proposed method used five length-scales selected from cross-validation and k-means clustering. Finally, the model selected 9 basis functions. Figure 6 displays the estimated mean function, eigenfunctions, and curves of the observations. The mean function reflects the overall decreasing tendency with the progression of the disease. The eigenfunctions are obtained by applying singular value decomposition of the covariance operator that is discretized (for visualization purposes only) with a grid of 50 evenly spaced points over the whole timeline. The first eigenfunction is relatively flat and mainly captures the subject-specific average magnitude of the CD4 counts, consistent with the finding of [1,20,22]. The second eigenfunction captures the simple linear trend of the variations, as described in [22]. The third eigenfunction captures the piece-wise linear time trend with a breakpoint near 2.5 years since baseline. [1,20] found similar eigenfunctions.

5.2.2. Wind Speed

Wind-speed data, collected from 110 locations across Utah’s Salt Lake Valley, varies between 11 to 1440 measurements. The proposed method leverages ten length-scales selected from cross-validation and k-means clustering. Figure 7 illustrates the estimated mean function, curves of the observations, eigenfunctions, and covariance. The horizontal axis represents the seconds starting from 12:00 AM Greenwich Mean Time (GMT) on June 15, 2023, which corresponds to 6:00 PM in Salt Lake City. In Figure 7a, the estimated mean function depicts two pronounced peaks observed approximately at 8:00 PM and 6:00 AM, as well as two troughs around 12:00 AM and 12:00 PM. This pattern aligns with the diurnal cycle, particularly highlighting the thermal activities associated with sunset and sunrise. The peaks during sunset and sunrise are due to the interplay of topographical features, which result in specific breezes, such as the land breeze near the Great Salt Lake and the distinct mountain and valley breezes. The troughs, on the other hand, reflect moments when the atmosphere is at its most stable, with minimal thermal activities disrupting wind patterns. The complexity of the data is distilled and represented using 12 descriptors with 17 basis functions. As Figure 7b shows, the primary eigenfunction is relatively level, indicating that the most significant variation is the location-specific average magnitude. Its profile echoes the influence of sunrise and sunset observed in the mean function, with elevations around 7:00 PM and 5:00 AM and subdued patterns during other times, indicative of a similar atmospheric stability. The estimated covariance in Figure 7c highlights variance peaks around 8 PM and 5 AM, as well as a strong correlation between these periods. This underscores the effects of location-specific topographic factors on wind speed.

5.2.3. Modeling Large-Scale, Dynamic, Geospatial Data

Here, we demonstrate the scalability on both the size of the measurements and the dimensionality of our framework. For this, we apply it to the ARGO dataset, which consists of ocean temperature measurements from more than 4,000 locations, at multiple depths, and time points [71]. ARGO is a nearly global observing system for ocean temperature, salinity, and other key variables via autonomous profiling floats. As of 2019, ARGO has generated over 338 gigabytes of data from 15,231 floats [71]. We focused on high-quality ("research" mode option in the database API) data from 1998 to 2024 for depths between 0–200 meters in the open-access snapshot of Argo GDAC of November 9, 2024 [77]. The number of measurement points per year varies widely–from 38,931 up to over 11 million, with 127 million in total. Figure 8 illustrates a global map of sea surface temperature measurements from February 2021, highlighting the dataset’s extensive spatial coverage.

In our modeling, each year’s data is treated as a single underlying function of four variables: latitude, longitude (on the spherical Earth), depth, and intra-annual time (modeled as a periodic variable). Note that the spatial data lies on a sphere and the time is a circle, assuming the periodicity of the time of the year. Our approach models these measurements holistically–without resorting to moving windows or sub-modeling–thereby preserving the continuous nature of the data and enabling the extraction of meaningful global, seasonal, and depth-dependent trends. Furthermore, the unique geospatial and temporal structure of Argo data, with spatial coordinates on a sphere and time exhibiting periodicity, necessitates specialized modeling techniques. Given that our model is 4 dimensional, the 4D kernel is defined as a product of the following kernels, following the design strategy for climatological data in [19]. The geospatial kernel on the sphere is a radial basis function (RBF) on geodesic distances. To ensure periodicity, the temporal kernel is an Exp-Sine-Squared

k (x, x^{'}) = exp (- \frac{2 {sin}^{2} (π | x - x^{'} |)}{l_{s}})

where

l_{s}

is the length-scale. For depth, we use a Gaussian kernel.

The numeric data (excluding metadata) as input to the model was approximately 4 GB. For length-scale selection, we used Gaussian process regression on a small subset of 2,000 randomly selected data in 2016 (medium size of measurements) for a cross validated RMSE which we optimize with a grid search. The specific length scales were set as follows: geodesic length scale of

2 \times 10^{3}

km, depth length scale of 70 m, time length scale of 3, and periodicity of 1. For evaluation, we held out 10% of the depth profiles (a single round trip of a buoy from surface to a depth at the same coordinate) from each year as testing data, following [78]. The total training set contained roughly 114 million points. Because the sample spacing is typically small relative to the selected length-scales, we apply agglomerative clustering to 10,000 randomly chosen index points, reducing them to 2,000 candidate basis functions. These candidate basis functions–precomputed for efficiency–took roughly 1.7 TB of memory. Computations were performed with 24 threads on a server equipped with 192 Intel^® Xeon^® Platinum 8360H CPUs @ 3.00GHz and 3TB RAM. Initialization was conducted using the modified RVM for 200 iterations for initial basis functions, using a stochastic optimization with a 1,000-batch size per year. Then

{BSFDA}^{Fast}

executed for 10,000 iterations, where the heuristic to include new bases also used a 1,000-batch size per year. With these computational strategies and heuristics, the entire modeling process was completed in 15 hours.

The proposed approach selected 163 effective basis functions and condensed them into 16 principle components. The final model occupies merely 50 MB of storage. The interpolation yielded a root mean square error (RMSE) of 1.95 and an

R^{2}

of 94.2% on testing data, reflecting a reasonable balance between global dimension reduction and fidelity. The estimated white noise level was also 1.95, indicating that the training data adequately covers the underlying variability in the ARGO observations, and the final model is reasonably generalizable.

Figure 9 presents 2D visualizations of geospatial interpolations at three depths (in decibars, roughly meters) and a specific time (May 29, 2021) around 1°S and 30°W, each with three views. We have picked one measurement as the central point, denoted as the red circle, and selected a narrow window (±1 decibar, ±1 day) around this center. The cyan and fuchsia circles represent training and testing data, respectively, within this window. Their sizes indicate distance along the unplotted dimensions (depth and time here), reflecting variations in these dimensions. The visualizations show that temperatures are warmer near the equator and decrease with depth. The match between interpolated values and actual measurements demonstrates consistency in capturing broad spatial and vertical variations.

Figure 10 complements this by illustrating interpolation in the depth-time slices while holding the geospatial coordinates fixed, focusing on mixed layer characteristics. The "mixed layer" refers to a region of nearly uniform temperature, which is crucial for understanding thermodynamic potential and nutrient cycling [79]. Here, the plot uses a window of 50 km to include actual measurement, and the circle sizes denote geodesic distance from the chosen center. We plot every fifth measurement vertically to reduce overlap and improve clarity. Figure 10a uses the same center point, 1°S and 30°W, as in Figure 9, exhibiting a shallow mixed layer with pronounced vertical gradients. In contrast, Figure 10b adopts a center at a higher latitude, 49°N and 29°W, where the model reveals a deeper mixed layer. The temperature there remains relatively stable below the surface. The dominant variations are cyclic seasonal changes, which are warmer near the surface around September. As is shown, the vertical sequence of the center and the nearby testing sequence match the interpolation closely. These results confirm that the mid-latitudes exhibit a stronger seasonal cycle [79], and that

{BSFDA}^{Fast}

accurately approximates the actual measurements.

To our knowledge, this is the first time the ARGO dataset has been modeled in a full 4-dimensional principal component model, with the correct domain topology. We incorporate the entire period of 27 years rather than shorter spans (e.g., 2004-2008 or 2007-2016) [78,80,81]. Instead of segmenting the dataset into localized spatiotemporal windows, we process the entire 4D domain (latitude, longitude, depth, and intra-annual time) in a single holistic framework. Previous studies were typically tailored for ARGO datasets and handled each depth, month, or spatial region separately, restricting correlation estimates to limited windows (e.g., 1000 km and three months) while excluding data with large offsets [78,81]. In addition, they require repeated on-demand model fitting that can hinder scalability. Our kernel-based framework, by contrast, is broadly applicable to general functional data, only requiring kernel definitions for the domain. Although global dimension-reduction inevitably introduces some residual noise, the kernel-based design is extensible to finer spacing or multiple length scales if higher precision is needed. Furthermore, inference with our model is simply the evaluation of the active 163 active basis functions weighted by the 16 principal components. Interpolation over a 300×300 grid only takes about two seconds. By contrast, previous methods with Gaussian process regression require a weighted sum of all the measured data within a certain window. The parametric representations also facilitate straightforward derivative and integration calculations, which are essential for investigating ocean temperature stratification and heat content [78]. In summary, the Argo dataset provides an ideal testbed for our method, as it captures the dynamic behavior of high-dimensional geospatial data in a continuous framework. A more comprehensive study of ARGO is beyond this paper’s scope. Nonetheless, the results here confirm the clear advantages of the proposed method for large-scale, high-dimensional functional data.

6. Discussion

This paper proposed BSFDA, a novel framework for functional data analysis with irregular sampling, integrating model selection and scalability in one unique, coherent, and effective algorithm. Our extensive empirical studies, including both simulations and real world applications, show that BSFDA offer superior covariance estimation accuracy with remarkable efficiency.

In terms of accuracy, our method excels in model selection, consistently achieving top-tier performance. The accuracy of the covariance operator estimation also rivals that of the best existing methodologies in the field. This shows that our approach can not only handle large and complex datasets, but also ensures high accuracy and precision in the results it produces. Our method’s superiority compared to existing techniques is expected owing to the inherent iterative nature of data smoothing and covariance estimation in our approach.

In terms of scalability, our method demonstrates a linear growth of time complexity with the size of the dataset, and impressively, the computations are executed in a small,

K^{(a)}

-dimensional subspace. This ensures that as the datasets grow larger and more complex, the performance of our model remains robust and efficient. Additionally, we introduced a faster variant,

{BSFDA}^{Fast}

, which performs similarly to BSFDA on medium and dense datasets with significantly reduced computational cost. This leap in efficiency enabled a full 4-dimensional functional modeling, for the first time, of the large scale oceanic temperature dataset across 27 years (ARGO) [71]. Although

{BSFDA}^{Fast}

can underestimate signal strength under very sparse sampling, the vanilla BSFDA effectively complements and alleviates this issue.

Looking ahead, it would be interesting to explore how extensions of regular PCA, such as simplified PCA and robust PCA [42], can be integrated within our proposed framework. These extensions will enhance the flexibility and robustness of our method, further improving its adaptability to various data conditions. In addition, we see potential in examining the extensions of functional PCA, such as time warping, dynamics, and manifold learning [1]. In particular, shape analysis emerges as a direct application of time warping. Such extensions would push the boundaries of what our proposed method could achieve, potentially enabling it to handle an even wider array of data structures and complexities.

In conclusion, our research findings affirm the proposed framework’s effectiveness and adaptability in advanced functional data analysis. Nonetheless, the method’s potential remains broad, and future work promises to widen its scope and refine its performance. By unifying sparse Bayesian learning, kernel-based expansions, and efficient variational inference, BSFDA offers a powerful foundation for large-scale, high-dimensional FDA challenges.

7. System of Notation

Table 7 summarizes the notation used in Section 1.2 and Section 3.2.3, providing a reference for the derivations. All vectors in the table are represented as row vectors.

Table 7. Symbol definitions in formulation.

Symbol	Meaning
$y_{i}$	The i-th sample function
$x \in R^{M}$	One M-dimension index
M	Dimension of the index set
K	Number of all basis functions
J	Number of all components
P	Number of sample functions
$N_{i}$	Number of measurements of the i-th sample function
$X_{i} \in R^{N i \times M}$	Index set of the i-th sample function
$Y_{i} \in R^{N i}$	Measurement of the i-th sample function
$Z_{i} \in R^{J}$	Component scores of the i-th sample function
$\bar{Z} \in R^{K}$	Coefficients of basis functions in the mean function
$E_{i} \in R^{N i}$	Measurement errors of the i-th sample function
$W \in R^{J \times K}$	Weighing matrix of basis functions in the eigenfunctions
$W_{j \cdot} \in R^{K}, W_{\cdot k}^{T} \in R^{J}$	The j-th row and k-th column of W
$K$	The kernel function
$α_{j}$	The scale parameter of $W_{j \cdot}$ (j-th component)
$β_{k}$	The scale parameter of $W_{\cdot k}$ (k-th basis function)
$σ$	The standard deviation of measurement errors
$η$	The communal scale parameter of $\bar{Z}$
${ϕ_{k} : R^{M} \to R}_{k = 1}^{K}$	The union of all the centered kernel functions
$Φ_{i k j} = ϕ_{k} (X_{i j \cdot}) \in R$	Value of centered kernel function $ϕ_{k}$ at $X_{i j \cdot}$
$θ_{i} \in R^{K}$	Coefficients of the i-th sample function
$ζ_{i} \in R^{K}$	Coefficient noise of the i-th sample function
$ς_{k}$	The scale parameter of k-th coefficient noise

Table 8 summarizes the notation used in Section 2.2.

Table 8. Notation used in formulating the optimization.

Symbol	Meaning
$Θ$	All the latent variables.
$Q_{\cdot}$	The surrogate posterior distribution of variable ·
$Q_{/_{\cdot}}$	The joint surrogate posterior distribution of all variables except ·
$μ_{\cdot}, Σ_{\cdot}$	The mean and covariance of · in $Q$ , e.g. $μ_{vec (W)} \in R^{J K}, Σ_{vec (W)} \in R^{J K \times J K}$
$a_{\cdot}, b_{\cdot}$	The shape and rate parameters of $Q_{\cdot}$ , e.g. $a_{β_{k}}, b_{β_{k}}$
$E_{Q} [\cdot]$	The expectation of variable · over density $Q$
$L$	The lower bound of surrogate posterior $Q$ with K basis functions
$Ψ_{i}$	Gram matrix of the kernel functions for the i-th sample function, $Φ_{i} Φ_{i}^{T}$
$K^{(a)}, K^{(e)}$	Number of active/effective basis functions
$J^{(a)}, J^{(e)}$	Number of active/effective components
$P_{i}$	Log likelihood of $Y_{i}$ in multi-sample relevance vector machine
$C_{i}$	Covariance of $Y_{i}$ in multi-sample relevance vector machine
$S_{i}$	Posterior covariance of $Z_{i}$ in multi-sample relevance vector machine
$P_{Z_{i}}$	Log likelihood of $(Y_{i}, Z_{i})$ in multi-sample relevance vector machine
$ϵ \to 0$	The infinitesimal number
$τ_{\cdot}$	Threshold/tolerance of ·

8. Variational Update Formulae

As defined in Section 1.2, we consider the following priors and conditional distributions:

\begin{matrix} Pr [Y | Z, W, \bar{Z}, σ] & = \prod_{i} N (Y_{i} | (Z_{i} W + \bar{Z}) Φ_{i}, σ^{2} I) \end{matrix}

(45)

\begin{matrix} Pr [Z] & = \prod_{i} N (Z_{i} | 0, I) \end{matrix}

(46)

\begin{matrix} Pr [W | α, β] & = \prod_{j, k} N (W_{j k} | 0, α_{j}^{- 1} β_{k}^{- 1}) \end{matrix}

(47)

\begin{matrix} Pr [\bar{Z}] & = \prod_{k} N ({\bar{Z}}_{k} | 0, η^{- 1} β_{k}^{- 1}) \end{matrix}

(48)

\begin{matrix} Pr [σ] Pr [α] Pr [β] Pr [η] & = Γ (σ^{- 2} | a_{0}, b_{0}) \prod_{j = 1}^{J} Γ (α_{j} | a_{0}, b_{0}) \prod_{k = 1}^{K} Γ (β_{k} | a_{0}, b_{0}) Γ (η | a_{0}, b_{0}) \end{matrix}

(49)

For brevity, the joint posterior is shown with the vague Gamma prior parameters

a_{0}

,

b_{0}

, and the observation index X omitted:

\begin{matrix} Pr [Z, W, \bar{Z}, σ, α, β, η | X, Y, a_{0}, b_{0}] = Pr [Z, W, \bar{Z}, σ, α, β, η | Y] \\ = Pr [Z, W, \bar{Z}, σ, α, β, η, Y] {(Pr [Y])}^{- 1} \\ \propto Pr [Z, W, \bar{Z}, σ, α, β, η, Y] \\ = Pr [Y | Z, W, \bar{Z}, σ] Pr [Z] Pr [W | α, β] Pr [\bar{Z} | η, β] Pr [σ] Pr [α] Pr [β] Pr [η] \end{matrix}

(50)

Derivation of Equations (14) and (15):

According to Equation (13) and the posterior in Equation (50), the update formulae for the surrogate distribution

Q_{α_{j}}

is:

\begin{matrix} Q_{α_{j}} & \leftarrow \frac{exp (E_{Q_{/_{α_{j}}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])])}{\int exp (E_{Q_{/_{α_{j}}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])]) d α_{j}} \\ \propto exp (E_{Q_{/_{α_{j}}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])]) \propto exp (E_{Q_{/_{α_{j}}}} [ln (Pr [W | α, β] Pr [α])]) \\ \propto exp (E_{Q_{/_{α_{j}}}} [- \frac{1}{2} \sum_{k = 1}^{K} (- ln (α_{j}) + W_{j k}^{2} α_{j} β_{k}) + ((a_{0} - 1) ln α_{j} - b_{0} α_{j})]) \\ \propto exp ((\frac{K}{2} + a_{0} - 1) ln (α_{j}) - α_{j} (\frac{1}{2} \sum_{k = 1}^{K} E_{Q_{/_{α_{j}}}} [(W_{j k}^{2} β_{k})] + b_{0})) \end{matrix}

(51)

where we have omitted terms that

α_{j}

is conditionally independent of. By definition

\begin{matrix} Q_{α_{j}} & = exp (ln (Γ (α_{j} | a_{α_{j}}, b_{α_{j}}))) = exp (ln (\frac{b_{α_{j}}^{a_{α_{j}}}}{Γ (a_{α_{j}})} α_{j}^{a_{α_{j}} - 1} exp (- b_{α_{j}} α_{j}))) \\ \propto exp ((a_{α_{j}} - 1) ln α_{j} - b_{α_{j}} α_{j}) \end{matrix}

(52)

By equating Equations (51) and (52), the updates for

Q_{α_{j}}

are

\begin{matrix} a_{α_{j}} & \leftarrow \frac{K}{2} + a_{0} \end{matrix}

(53)

\begin{matrix} b_{α_{j}} & \leftarrow \frac{1}{2} \sum_{k = 1}^{K} E_{Q_{/_{α_{j}}}} [(W_{j k}^{2} β_{k})] + b_{0} \end{matrix}

(54)

Derivation of Equations (14) and (15):

According to Equation (13) and the posterior Equation (50), the update formulae for

Q_{η}

is:

\begin{matrix} Q_{η} & \leftarrow \frac{exp (E_{Q_{/_{η}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])])}{\int exp (E_{Q_{/_{η}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])]) d η} \\ \propto exp (E_{Q / η} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])]) \propto exp (E_{Q / η} [ln (Pr [\bar{Z} | η, β] Pr [η])]) \\ \propto exp (E_{Q / η} [- \frac{1}{2} \sum_{k = 1}^{K} (- ln (η) + {\bar{Z}}_{k}^{2} η β_{k}) + ((a_{0} - 1) ln η - b_{0} η)]) \\ \propto exp ((\frac{K}{2} + a_{0} - 1) ln (η) - η (\frac{1}{2} \sum_{k = 1}^{K} E_{Q / η} [({\bar{Z}}_{k}^{2} β_{k})] + b_{0})) \end{matrix}

(55)

where we have omitted terms

η

is conditionally independent of. By definition

\begin{matrix} Q_{η} & = exp (ln (Γ (η | a_{η}, b_{η}))) = exp (ln (\frac{{b_{η}}^{a_{η}}}{Γ (a_{η})} η^{a_{η} - 1} exp (- b_{η} η))) \\ \propto exp ((a_{η} - 1) ln η - b_{η} η) \end{matrix}

(56)

By equating Equations (55) and (56), the updates for

Q_{η}

are

\begin{matrix} a_{η} & \leftarrow \frac{K}{2} + a_{0} \end{matrix}

(57)

\begin{matrix} b_{η} & \leftarrow \frac{1}{2} \sum_{k = 1}^{K} E_{Q / η} [({\bar{Z}}_{k}^{2} β_{k})] + b_{0} \end{matrix}

(58)

Derivation of Equations (18) and (19):

According to Equation (13) and the posterior Equation (50), the update formulae for

Q_{β_{k}}

is:

\begin{matrix} Q_{β_{k}} & \leftarrow \frac{exp (E_{Q_{/_{β_{k}}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])])}{\int exp (E_{Q_{/_{β_{k}}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])]) d β_{k}} \\ \propto exp (E_{Q_{/_{β_{k}}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])]) \\ \propto exp (E_{Q_{/_{β_{k}}}} [ln (Pr [W_{\cdot k} | α, β_{k}] Pr [{\bar{Z}}_{k} | η, β_{k}] Pr [β])]) \\ \propto exp (E_{Q_{/_{β_{k}}}} [- \frac{1}{2} \sum_{j = 1}^{J} (- ln (α_{j} β_{k}) + W_{j k}^{2} α_{j} β_{k}) \\ - \frac{1}{2} (- ln (η β_{k}) + {\bar{Z}}_{k}^{2} η β_{k}) + (a_{0} - 1) ln β_{k} - b_{0} β_{k}]) \\ \propto exp ((\frac{J + 1}{2} + a_{0} - 1) ln (β_{k}) - \\ β_{k} (\frac{1}{2} (E_{Q_{/_{β_{k}}}} [({\bar{Z}}_{k}^{2} η)] + \sum_{j = 1}^{J} E_{Q_{/_{β_{k}}}} [(W_{j k}^{2} α_{j})]) + b_{0})) \end{matrix}

(59)

where we have omitted terms that

β_{k}

in conditionally independent of. By definition

\begin{matrix} Q_{β_{k}} & = exp (ln (Γ (β_{k} | a_{β_{k}}, b_{β_{k}}))) = exp (ln (\frac{b_{β_{k}}^{a_{β_{k}}}}{Γ (a_{β_{k}})} η^{a_{β_{k}} - 1} exp (- b_{β_{k}} η))) \\ \propto exp ((a_{β_{k}} - 1) ln η - b_{β_{k}} η) \end{matrix}

(60)

By equating Equations (59) and (60), the updates for

Q_{η}

are

\begin{matrix} a_{β_{k}} & \leftarrow \frac{J + 1}{2} + a_{0} \end{matrix}

(61)

\begin{matrix} b_{β_{k}} & \leftarrow \frac{1}{2} (E_{Q_{/_{β_{k}}}} [({\bar{Z}}_{k}^{2} η)] + \sum_{j = 1}^{J} E_{Q_{/_{β_{k}}}} [(W_{j k}^{2} α_{j})]) + b_{0} \end{matrix}

(62)

Derivation of Equations (20), (21):

According to Equations (13) and the posterior Equation (50), the update formulae for

Q_{\bar{Z}}

is:

\begin{matrix} Q_{\bar{Z}} & \leftarrow \frac{exp (E_{Q_{/_{\bar{Z}}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])])}{\int exp (E_{Q_{/_{\bar{Z}}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])]) d \bar{Z}} \\ \propto exp (E_{Q_{/_{\bar{Z}}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])]) \\ \propto exp (E_{Q_{/_{\bar{Z}}}} [ln (Pr [Y | Z, W, \bar{Z}, σ] Pr [\bar{Z} | η, β])]) \\ \propto exp (E_{Q_{/_{\bar{Z}}}} [- \frac{1}{2} \sum_{i = 1}^{P} (N_{i} ln (2 π σ^{2}) + σ^{- 2} | | Y - (Z_{i} W + \bar{Z}) Φ_{i} {| |}_{2}^{2}) - \\ \frac{1}{2} \sum_{k = 1}^{K} (- ln (2 π η β_{k}) + {\bar{Z}}_{k}^{2} η β_{k})]) \\ \propto exp (- \frac{1}{2} (\bar{Z} E_{Q_{/_{\bar{Z}}}} [σ^{- 2} \sum_{i = 1}^{P} Ψ_{i} + η diag (β)] {\bar{Z}}^{T} - \\ 2 E_{Q_{/_{\bar{Z}}}} [σ^{- 2}] \sum_{i = 1}^{P} (Y - E_{Q_{/_{\bar{Z}}}} [Z_{i} W] Φ_{i}) Φ_{i}^{T} {\bar{Z}}^{T})) \end{matrix}

(63)

where we have omitted terms that

\bar{Z}

is conditionally independent of. By definition

\begin{matrix} Q_{\bar{Z}} & = exp (ln (N (\bar{Z} | μ_{\bar{Z}}, Σ_{\bar{Z}}))) = exp (- \frac{1}{2} (ln | 2 π Σ_{\bar{Z}} | + (\bar{Z} - μ_{\bar{Z}}) {Σ_{\bar{Z}}}^{- 1} {(\bar{Z} - μ_{\bar{Z}})}^{T})) \\ \propto exp (- \frac{1}{2} (\bar{Z} {Σ_{\bar{Z}}}^{- 1} {\bar{Z}}^{T} - 2 μ_{\bar{Z}} {Σ_{\bar{Z}}}^{- 1} {\bar{Z}}^{T})) \end{matrix}

(64)

By equating Equations (63) and (64), the updates for

Q_{\bar{Z}}

are

\begin{matrix} Σ_{\bar{Z}} & \leftarrow {(E_{Q_{/_{\bar{Z}}}} [σ^{- 2} \sum_{i = 1}^{P} (Ψ_{i}) + η diag (β)])}^{- 1} \end{matrix}

(65)

\begin{matrix} μ_{\bar{Z}} & \leftarrow (E_{Q_{/_{\bar{Z}}}} [σ^{- 2}] \sum_{i = 1}^{P} (Y - E_{Q_{/_{\bar{Z}}}} [Z_{i} W] Φ_{i}) Φ_{i}^{T}) Σ_{\bar{Z}} \end{matrix}

(66)

Derivation of Equations (22), (23):

According to Equation (13) and the posterior Equation (50), the update formulae for

Q_{W}

is:

\begin{matrix} Q_{W} & \leftarrow \frac{exp (E_{Q_{/_{W}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])])}{\int exp (E_{Q_{/_{W}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])]) d W} \\ \propto exp (E_{Q_{/_{W}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])]) \\ \propto exp (E_{Q_{/_{W}}} [ln (Pr [Y | Z, W, \bar{Z}, σ] Pr [W | α, β])]) \\ \propto exp (E_{Q_{/_{W}}} [- \frac{1}{2} \sum_{i = 1}^{P} (N_{i} ln (2 π σ^{2}) + σ^{- 2} | | Y - (Z_{i} W + \bar{Z}) Φ_{i} {| |}_{2}^{2})]) \\ exp (E_{Q_{/_{W}}} [- \frac{1}{2} (ln | 2 π {(diag (β) \otimes diag (α))}^{- 1} | + \\ vec {(W)}^{T} (diag (β) \otimes diag (α)) vec (W))]) \\ \propto exp (E_{Q_{/_{W}}} [- \frac{1}{2} (σ^{- 2} \sum_{i = 1}^{P} (- 2 Y_{i} Φ_{i}^{T} W^{T} Z_{i}^{T} + 2 Z_{i} W Ψ_{i} {\bar{Z}}^{T} + Z_{i} W Ψ_{i} W^{T} Z_{i}^{T}))]) \\ exp (E_{Q_{/_{W}}} [- \frac{1}{2} (vec {(W)}^{T} (diag (β) \otimes diag (α)) vec (W))]) \\ \propto exp (- \frac{1}{2} E_{Q_{/_{W}}} [- 2 σ^{- 2} \sum_{i = 1}^{P} vec {({(Φ_{i} (Φ_{i}^{T} {\bar{Z}}^{T} - Y_{i}^{T}) Z_{i})}^{T})}^{T}] vec (W)) \\ exp (- \frac{1}{2} vec {(W)}^{T} E_{Q_{/_{W}}} [σ^{- 2} \sum_{i = 1}^{P} (Ψ \otimes (Z_{i}^{T} Z_{i})) + (diag (β) \otimes diag (α))] vec (W)) \end{matrix}

(67)

where we have omitted terms that W is conditionally independent of. By definition

\begin{matrix} Q_{W} & = exp (ln (N (vec (W) | μ_{vec (W)}, Σ_{vec (W)}))) \\ = exp (- \frac{1}{2} (ln | 2 π Σ_{vec (W)} | + \\ (vec {(W)}^{T} - μ_{vec (W)}) {Σ_{vec (W)}}^{- 1} {(vec {(W)}^{T} - μ_{vec (W)})}^{T})) \\ \propto exp (- \frac{1}{2} (vec {(W)}^{T} {Σ_{vec (W)}}^{- 1} vec (W) - 2 μ_{vec (W)} {Σ_{vec (W)}}^{- 1} vec (W))) \end{matrix}

(68)

By equating Equations (67) and (68), the updates for

Q_{W}

are

\begin{matrix} Σ_{vec (W)} & \leftarrow E_{Q_{/_{W}}} {[σ^{- 2} \sum_{i = 1}^{P} (Ψ \otimes (Z_{i}^{T} Z_{i})) + (diag (β) \otimes diag (α))]}^{- 1} \\ μ_{vec (W)} & \leftarrow E_{Q_{/_{W}}} [- σ^{- 2} \sum_{i = 1}^{P} vec {({(Φ_{i} (Φ_{i}^{T} {\bar{Z}}^{T} - Y_{i}^{T}) Z_{i})}^{T})}^{T}] Σ_{vec (W)} \end{matrix}

(69)

Derivation of Equations (24), (25) and (26):

According to Equation (13) and the posterior Equation (50), the update formulae for

Q_{Z_{i}}

is:

\begin{matrix} Q_{Z_{i}} & \leftarrow \frac{exp (E_{Q_{/_{Z_{i}}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])])}{\int exp (E_{Q_{/_{Z_{i}}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])]) d Z_{i}} \\ \propto exp (E_{Q_{/_{Z_{i}}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])]) \\ \propto exp (E_{Q_{/_{Z_{i}}}} [ln (Pr [Y_{i} | Z_{i}, W, \bar{Z}, σ] Pr [Z_{i}])]) \\ \propto exp (E_{Q_{/_{Z_{i}}}} [- \frac{1}{2} (N_{i} ln (2 π σ^{2}) + σ^{- 2} | | Y - (Z_{i} W + \bar{Z}) Φ_{i} {| |}_{2}^{2} + J ln (2 π) + Z_{i} Z_{i}^{T})]) \\ \propto exp (- \frac{1}{2} (Z_{i} E_{Q_{/_{Z_{i}}}} [σ^{- 2} W Ψ_{i} W^{T} + I] Z_{i}^{T} - 2 E_{Q_{/_{Z_{i}}}} [σ^{- 2} (Y_{i} - Z_{i} Φ_{i}) Φ_{i}^{T} W^{T}] Z_{i}^{T})) \end{matrix}

(70)

where we have omitted terms that

Z_{i}

is conditionally independent of. By definition

\begin{matrix} Q_{Z_{i}} & = exp (ln (N (Z_{i} | μ_{Z_{i}}, Σ_{Z_{i}}))) \\ = exp (- \frac{1}{2} (ln | 2 π Σ_{Z_{i}} | + (Z_{i} - μ_{Z_{i}}) {Σ_{Z_{i}}}^{- 1} {(Z_{i} - μ_{Z_{i}})}^{T})) \\ \propto exp (- \frac{1}{2} (Z_{i} {Σ_{Z_{i}}}^{- 1} Z_{i}^{T} - 2 μ_{Z_{i}} {Σ_{Z_{i}}}^{- 1} Z_{i}^{T})) \end{matrix}

(71)

By equating Equations (70) and (71), the updates for

Q_{\bar{Z}}

are

\begin{matrix} Σ_{Z_{i}} & \leftarrow {(E_{Q_{/_{Z_{i}}}} [σ^{- 2} W Ψ_{i} W^{T} + I])}^{- 1} \end{matrix}

(72)

\begin{matrix} μ_{Z_{i}} & \leftarrow E_{Q_{/_{Z_{i}}}} [σ^{- 2} (Y_{i} - \bar{Z} Φ_{i}) Φ_{i}^{T} W^{T}] Σ_{Z_{i}} \end{matrix}

(73)

Derivation of Equations (27), (28) and (29):

According to Equation (13) and the posterior Equation (50), the update formulae for

Q_{σ}

is:

\begin{matrix} Q_{σ} & \leftarrow \frac{exp (E_{Q_{/_{σ}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])])}{\int exp (E_{Q_{/_{σ}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])]) d σ} \\ \propto exp (E_{Q_{/_{σ}}} [ln (Pr [Z, W, \bar{Z}, σ, α, β, η, Y])]) \\ \propto exp (E_{Q_{/_{σ}}} [ln (Pr [Y_{i} | Z_{i}, W, \bar{Z}, σ] Pr [σ])]) \\ \propto exp (E_{Q_{/_{σ}}} [- \frac{1}{2} \sum_{i = 1}^{P} (N_{i} ln (2 π σ^{2}) + σ^{- 2} | | Y - (Z_{i} W + \bar{Z}) Φ_{i} {| |}_{2}^{2})]) \\ exp (E_{Q_{/_{σ}}} [((a_{0} - 1) ln σ^{- 2} - b_{0} σ^{- 2})]) \\ \propto exp ((a_{0} + \frac{1}{2} \sum_{i} N_{i} - 1) ln (σ^{- 2}) - \\ σ^{- 2} (b_{0} + \frac{1}{2} E_{Q_{/ σ}} [\sum_{i} | | Y_{i} - (Z_{i} W + \bar{Z}) Φ_{i} {| |}_{2}^{2}])) \end{matrix}

(74)

where we have omitted terms that

σ

is conditionally independent of. By definition

\begin{matrix} Q_{σ} & = exp (ln (Γ (σ^{- 2} | a_{σ}, b_{σ}))) = exp (ln (\frac{{b_{σ}}^{a_{σ}}}{Γ (a_{σ})} {(σ^{- 2})}^{a_{σ} - 1} exp (- b_{σ} σ^{- 2}))) \\ \propto exp ((a_{σ} - 1) ln σ^{- 2} - b_{σ} σ^{- 2}) \end{matrix}

(75)

By equating Equations (74) and (75), the updates for

Q_{σ}

are

\begin{matrix} a_{σ} & \leftarrow a_{0} + \frac{1}{2} \sum_{i} N_{i} \end{matrix}

(76)

\begin{matrix} b_{σ} & \leftarrow b_{0} + \frac{1}{2} E_{Q_{/ σ}} [\sum_{i} | | Y_{i} - (Z_{i} W + \bar{Z}) Φ_{i} {| |}_{2}^{2}] \end{matrix}

(77)

9. Scalable Update for BSFDA

9.1. Implicit Factorization

We initialize the inactive precision parameters as:

\begin{matrix} E_{Q_{α_{j}}} [α_{j}] & = ϵ^{- 1}, \forall j > J^{(a)} \end{matrix}

(78)

\begin{matrix} E_{Q_{β_{k}}} [β_{k}] & = ϵ^{- 1}, \forall k > K^{(a)} \end{matrix}

(79)

Under these settings and subsequent variational updates (using Equations (78) and (79)), in the limit as

ϵ \to 0

, the surrogate distributions satisfy:

\begin{matrix} μ_{{Z_{i}}_{B}} & = 0, {Σ_{Z_{i}}}_{B} = ϵ I^{(J - J^{(a)})}, {Σ_{Z_{i}}}_{[A, B]} = {({Σ_{Z_{i}}}_{[B, A]})}^{T} = 0 \end{matrix}

(80)

\begin{matrix} {μ_{\bar{Z}}}_{B} & = 0, {Σ_{\bar{Z}}}_{[B, B]} = ϵ I^{(K - K^{(a)})}, {Σ_{\bar{Z}}}_{[A, B]} = {Σ_{\bar{Z}}}_{[B, A]}^{T} = 0 \\ {μ_{vec (W)}}_{B} & = 0, {μ_{vec (W)}}_{C} = 0, {μ_{vec (W)}}_{D} = 0, {Σ_{vec (W)}}_{[B, B]} = ϵ I^{(K^{(a)} J - K^{(a)} J^{(a)})}, \\ {Σ_{vec (W)}}_{[C, C]} & = ϵ I^{(J^{(a)} K - J^{(a)} K^{(a)})}, {Σ_{vec (W)}}_{[D, D]} = ϵ I^{(J K + J^{(a)} K^{(a)} - J K^{(a)} - J^{(a)} K)}, \end{matrix}

(81)

\begin{matrix} {Σ_{vec (W)}}_{[x, y]} & = 0, \forall (x, y) \notin {(A, A), (B, B), (C, C), (D, D)} \end{matrix}

(82)

For convenience, we initialize

Q

with the above properties.

Lemma 1.

If

Q_{α_{j}} [α_{j}] = ϵ, \forall j \geq J^{(a)}

and

Q_{β_{k}} [β_{k}] = ϵ, \forall k \geq K^{(a)}

, then the variational distribution over W factorizes as

Q_{W} = Q_{W_{A}} Q_{W_{B}} Q_{W_{C}} Q_{W_{D}}

in the limit as

ϵ \to 0

.

Proof.

We express the distribution as

\begin{matrix} Q_{W} & = N (vec (W) | μ_{vec (W)}, Σ_{vec (W)}) \\ = exp (- \frac{1}{2} (ln | 2 π Σ_{vec (W)} | + μ_{vec (W)} {Σ_{vec (W)}}^{- 1} {μ_{vec (W)}}^{T})), \end{matrix}

The factorization holds if the off-diagonal block matrices in

Σ_{vec (W)}

, e.g.

Σ_{[W_{A}, W_{B}]}

, are all zero, i.e., the blocks are mutually independent. Initially, this is ensured by the definition for the initial status in Equation (82). Thus, we only need to show the statement remains true after

Q_{W}

is updated, i.e., after Equation (22) is applied with the inactive scale parameters

Q_{α_{j}} [α_{j}]

and

Q_{β_{k}} [β_{k}]

fixed at

ϵ

. First we regard

Σ_{[W_{A B C}]}

, i.e., the covariance of the union of

W_{A}, W_{B}, W_{C}

after vectorization, as one block. By the block matrix inversion formula, we get

Σ_{[W_{A B C}, W_{D}]} \propto ϵ^{2} \to 0

and consequently

Q_{W} = Q_{W_{A B C}} Q_{W_{D}}

. Next, apply block matrix inversion formula to

Σ_{[W_{A B C}]}

in Equation (22) and we get

(Σ_{[W_{B}, W_{A}]}, Σ_{[W_{B}, W_{C}]}, Σ_{[W_{C}, W_{A}]}, Σ_{[W_{C}, W_{B}]}) \propto ϵ \to 0

, yielding the desired factorization. □

Lemma 2.

If

Q_{β_{k}} [β_{k}] = ϵ, \forall k \geq K^{(a)}

, then the implicit factorization

Q_{\bar{Z}} = Q_{{\bar{Z}}_{A}} Q_{{\bar{Z}}_{B}}

holds in the limit as

ϵ \to 0

.

Proof.

It is similar to the proof for Lemma A1. Because

Q_{\bar{Z}} = N (μ_{\bar{Z}}, Σ_{\bar{Z}})

, we only need the off-diagonal block is zero, i.e.,

{Σ_{\bar{Z}}}_{[A, B]} = 0

. Initially, it is ensured by definition for the initial status in Equation (81).

Q_{\bar{Z}}

is updated by Equation (20), Applying block matrix inversion formula with the inactive

Q_{β_{k}} [β_{k}]

, we get

{Σ_{\bar{Z}}}_{[A, B]} \propto ϵ \to 0

, establishing the factorization. □

Lemma 3.

If

j \geq J^{(a)}

or

k \geq K^{(a)}

, then

E_{Q_{/_{Z_{i}}}} [W_{k j^{'}} W_{j k^{'}}] \propto O (ϵ), \forall j^{'} = 1 : J, k^{'} = 1 : K

in the limit as

ϵ \to 0

.

Proof.

For initial status, apparently the largest

E_{Q_{/_{Z_{i}}}} [W_{k j^{'}} W_{j k^{'}}]

is

E_{Q_{/_{Z_{i}}}} [W_{k j^{'}}^{2}] = ϵ

. Because either

Q_{α_{j}} [α_{j}] = ϵ

or

Q_{β_{k}} [β_{k}] = ϵ

, after updates from Equations (22) and (23) are applied,

E_{Q_{/_{Z_{i}}}} [W_{k j^{'}} W_{j k^{'}}] = Σ_{[W_{k j^{'}}, W_{j k^{'}}]} + {μ_{vec (W)}}_{k j^{'}} {μ_{vec (W)}}_{j k^{'}} \propto O (ϵ)

by Woodbury matrix identity. □

Lemma 4.

If

E_{Q_{α_{j}}} [α_{j}] = \frac{a_{α_{j}}}{b_{α_{j}}} = ϵ^{- 1}, \forall j \geq J^{(a)}

, then the implicit factorization

Q_{Z_{i}} = Q_{Z_{i A}} Q_{Z_{i B}}

holds in the limit as

ϵ \to 0

.

Proof.

It is similar to the proof for Lemma 1. Because

Q_{Z_{i}} = N (μ_{Z_{i}}, Σ_{Z_{i}})

, only

{Σ_{Z_{i}}}_{[A, B]} = 0

is needed. Initially, it is ensured by definition for the initial status Equation (80).

Q_{Z}

is updated by Equations (24) and (25). In Equation (24), when

j \geq J^{(a)}

or

k \geq K^{(a)}

,

C_{i j k} = Tr (E_{Q_{/_{Z_{i}}}} [W_{k \cdot}^{T} W_{j \cdot}] Ψ_{i}) = \sum_{(j^{'}, k^{'})} (E_{Q_{/_{Z_{i}}}} [W_{k j^{'}} W_{j k^{'}}] {(Ψ_{i})}_{k^{'} j^{'}}) \propto O (ϵ) \to 0

applying Lemma 3. Applying block matrix inversion formula to Equation (25),

{Σ_{Z_{i}}}_{[A B]} \propto O (ϵ) \to 0

, , thus proving the implicit factorization □

9.2. Scale Parameters

Here we state the theorems that justify we can use updating rules for

Q_{α_{j}}^{(a)}

based on

L^{(a)}

to update

Q_{α_{j}}

(and similarly,

Q_{β_{k}}^{(a)}

for

Q_{β_{k}}

,

Q_{η}^{(a)}

for

Q_{η}

) and it does maximize

L

ultimately.

Lemma 5.

\forall W_{j k} \in W_{B} \cup W_{C}

, i.e., either

(j > J^{(a)})

or

(k > K^{(a)})

, after updating

Q_{W_{B}}

and

Q_{W_{C}}

by Equations (22) and (23),

E_{Q} [W_{j k}^{2}] = \frac{b_{α_{j}} b_{β_{k}}}{a_{α_{j}} a_{β_{k}}}

.

Proof.

According to Equations (78) and (79), if

(j > J^{(a)})

or

(k > K^{(a)})

, either

E_{Q} [α_{j}] = ϵ^{- 1}

or

E_{Q} [β_{k}] = ϵ^{- 1}

respectively.

In the limit as

ϵ \to 0

, using Equation (22) and block matrix inversion formula we get

\begin{matrix} Σ_{W_{j k}} & \leftarrow lim_{ϵ \to 0} {({(E_{Q_{/_{W}}} [diag (β) \otimes diag (α) + σ^{- 2} \sum_{i} ((Ψ_{i}) \otimes (Z_{i}^{T} Z_{i}))])}^{- 1})}_{[j + k M, j + k M]} \\ = lim_{ϵ \to 0} ({(E_{Q_{/_{W}}} [α_{j} β_{k}])}^{- 1} + O (ϵ^{2})) = {(E_{Q_{/_{W}}} [α_{j} β_{k}])}^{- 1} = \frac{b_{α_{j}} b_{β_{k}}}{a_{α_{j}} a_{β_{k}}} \end{matrix}

(83)

In the limit as

ϵ \to 0

and using Equation (23)

\begin{matrix} μ_{W_{j k}} & \leftarrow lim_{ϵ \to 0} {(- \frac{a_{σ}}{b_{σ}} \sum_{i} vec {({(Φ_{i} {(μ_{\bar{Z}} Φ_{i} - Y_{i})}^{T} μ_{Z_{i}})}^{T})}^{T} Σ_{vec (W)})}_{[1, j + k M]} \\ = (- \frac{a_{σ}}{b_{σ}} \sum_{i} vec {({(Φ_{i} {(μ_{\bar{Z}} Φ_{i} - Y_{i})}^{T} μ_{Z_{i}})}^{T})}^{T}) {(Σ_{vec (W)})}_{\cdot (j + k M)} \in O (ϵ) \end{matrix}

(84)

Equation (84) uses the fact that elements in

{(Σ_{vec (W)})}_{\cdot (j + k M)}

are all

O (ϵ)

based on block matrix inversion formula. Thus,

\begin{matrix} lim_{ϵ \to 0} E_{Q} [W_{j k}^{2}] & = lim_{ϵ \to 0} (Σ_{W_{j k}} + {(μ_{W_{j k}})}^{2}) = lim_{ϵ \to 0} (Σ_{W_{j k}} + O (ϵ^{2})) \\ = lim_{ϵ \to 0} (\frac{b_{α_{j}} b_{β_{k}}}{a_{α_{j}} a_{β_{k}}} + O (ϵ^{2})) = \frac{b_{α_{j}} b_{β_{k}}}{a_{α_{j}} a_{β_{k}}} \end{matrix}

(85)

□

Lemma 6.

\forall k > K^{(a)}

, after updating

Q_{{\bar{Z}}_{B}}

by Equations (20) and (21),

E_{Q} [{\bar{Z}}_{k}^{2}] = \frac{b_{η}}{a_{η}} ϵ

.

Proof.

If

k > K^{(a)}

,

E_{Q} [β_{k}] = ϵ^{- 1}

.

Then using Equation (20) and block matrix inversion formula we get

\begin{matrix} {Σ_{\bar{Z}}}_{k k} & \leftarrow lim_{ϵ \to 0} {({(E_{Q_{/_{\bar{Z}}}} [\sum_{i = 1}^{P} (σ^{- 2} Ψ_{i}) + η diag (β)])}^{- 1})}_{k k} \\ = lim_{ϵ \to 0} {({(\sum_{i = 1}^{P} (\frac{a_{σ}}{b_{σ}} Ψ_{i}) + \frac{a_{η}}{b_{η}} diag (\frac{a}{b}))}^{- 1})}_{k k} \\ = lim_{ϵ \to 0} (\frac{b_{η} b_{β_{k}}}{a_{η} a_{β_{k}}} + O (ϵ^{2})) = \frac{b_{η} b_{β_{k}}}{a_{η} a_{β_{k}}} \end{matrix}

(86)

Using Equation (21)

\begin{matrix} {μ_{\bar{Z}}}_{k} & \leftarrow lim_{ϵ \to 0} {((E_{Q_{/_{\bar{Z}}}} [σ^{- 2}] \sum_{i = 1}^{P} (Y - E_{Q_{/_{\bar{Z}}}} [Z_{i} W] Φ_{i}) Φ_{i}^{T}) Σ_{\bar{Z}})}_{1 k} \\ = lim_{ϵ \to 0} {((\frac{a_{σ}}{b_{σ}} \sum_{i = 1}^{P} (Y - μ_{Z_{i}} μ_{W} Φ_{i}) Φ_{i}^{T}) Σ_{\bar{Z}})}_{1 k} \\ = lim_{ϵ \to 0} (\frac{a_{σ}}{b_{σ}} \sum_{i = 1}^{P} (Y - μ_{Z_{i}} μ_{W} Φ_{i}) Φ_{i}^{T}) {Σ_{\bar{Z}}}_{\cdot k} \in O (ϵ) \end{matrix}

(87)

Equation (87) uses the fact that elements in

{Σ_{\bar{Z}}}_{\cdot k}

are all

\in O (ϵ)

.

\begin{matrix} E_{Q} [{\bar{Z}}_{k}^{2}] & = lim_{ϵ \to 0} ({Σ_{\bar{Z}}}_{k k} + {μ_{\bar{Z}}}_{k}^{2}) = lim_{ϵ \to 0} ({Σ_{\bar{Z}}}_{k k} + O (ϵ^{2})) \\ = lim_{ϵ \to 0} (\frac{b_{η} b_{β_{k}}}{a_{η} a_{β_{k}}} + O (ϵ^{2})) = lim_{ϵ \to 0} (\frac{b_{η}}{a_{η}} ϵ + O (ϵ^{2})) = \frac{b_{η}}{a_{η}} ϵ \end{matrix}

(88)

□

Theorem 1.

\forall j \leq J^{(a)}

, updates of

Q_{α_{j}}

and

Q_{W_{B}}

will converge at

E_{Q_{α_{j}}} [α_{j}] = E_{Q_{α_{j}}^{(a)}} [α_{j}]

given

E_{Q_{β_{k}}} [β_{k}] = E_{Q_{β_{k}}^{(a)}} [β_{k}], \forall k \leq K^{(a)}

,

a_{0} = b_{0} = 0

and conditions in Equations (80)-(79) are satisfied in the limit as

ϵ \to 0

.

Proof.

Assume

Q_{α_{j}}^{(a)}

has just been updated using Equations (14), (15), i.e.,

\forall j \leq J^{(a)}

\begin{matrix} a_{α_{j}}^{(a)} & = a_{0} + \frac{K^{(a)}}{2} \\ b_{α_{j}}^{(a)} & = b_{0} + \frac{1}{2} \sum_{k = 1}^{K^{(a)}} E_{Q_{/ α_{j}}^{(a)}} [W_{j k}^{2} β_{k}] \end{matrix}

(89)

\begin{matrix} = b_{0} + \frac{1}{2} \sum_{k = 1}^{K^{(a)}} ((Σ_{W_{j k}} + μ_{W_{j k}}^{2}) \frac{a_{β_{k}}^{(a)}}{b_{β_{k}}^{(a)}}) \end{matrix}

(90)

The updates for

Q_{α}

derived from

L

are

\begin{matrix} b_{α_{j}} & \leftarrow b_{0} + \frac{1}{2} \sum_{k = 1}^{K} ((Σ_{W_{j k}} + μ_{W_{j k}}^{2}) \frac{a_{β_{k}}}{b_{β_{k}}}) \\ = b_{0} + \frac{1}{2} \sum_{k = 1}^{K^{(a)}} ((Σ_{W_{j k}} + μ_{W_{j k}}^{2}) \frac{a_{β_{k}}}{b_{β_{k}}}) \\ + \frac{1}{2} \sum_{k = K^{(a)} + 1}^{K} ((Σ_{W_{j k}} + μ_{W_{j k}}^{2}) \frac{a_{β_{k}}}{b_{β_{k}}}) \\ = b_{α_{j}}^{(a)} + \frac{1}{2} \sum_{k = K^{(a)} + 1}^{K} ((Σ_{W_{j k}} + μ_{W_{j k}}^{2}) \frac{a_{β_{k}}}{b_{β_{k}}}) \end{matrix}

(91)

It involves

W_{j k}, k > K^{(a)}

and therefore they need to be kepted updated. Apply Theorem 5 for Equation (91) and we can get

\begin{matrix} b_{α_{j}} & \leftarrow b_{α_{j}}^{(a)} + \frac{1}{2} \sum_{k = K^{(a)} + 1}^{K} ((\frac{b_{α_{j}} b_{β_{k}}}{a_{α_{j}} a_{β_{k}}}) \frac{a_{β_{k}}}{b_{β_{k}}}) \\ = b_{α_{j}}^{(a)} + \frac{1}{2} (K - K^{(a)}) \frac{b_{α_{j}}}{a_{α_{j}}} \end{matrix}

(92)

Applying Equation (92) in an iterative manner, we will get a sequence of updates for

a_{α_{j}}

. Solving

\begin{matrix} b_{α_{j}} & = b_{α_{j}}^{(a)} + \frac{1}{2} (K - K^{(a)}) \frac{b_{α_{j}}}{\frac{K}{2}} \end{matrix}

(93)

\begin{matrix} \Rightarrow b_{α_{j}} & = {(1 - \frac{1}{2} (K - K^{(a)}) \frac{2}{K})}^{- 1} b_{α_{j}}^{(a)} = \frac{K}{K^{(a)}} b_{α_{j}}^{(a)} \end{matrix}

(94)

Thus, we can get that the sequence will converge at

\begin{matrix} b_{α_{j}} & \leftarrow \frac{K}{K^{(a)}} b_{α_{j}}^{(a)} \end{matrix}

(95)

As a result,

E_{Q_{α_{j}}} [α_{j}] = \frac{a_{α_{j}}}{b_{α_{j}}} = \frac{a_{α_{j}}^{(a)}}{b_{α_{j}}^{(a)}} = E_{Q_{α_{j}}^{(a)}} [α_{j}]

. □

Theorem 2.

\forall k \leq K^{(a)}

, updates of

Q_{β_{k}}

and

Q_{W_{C}}

will converge at

E_{Q_{β_{k}}} [β_{k}] = E_{Q_{β_{k}}^{(a)}} [β_{k}]

given

E_{Q_{α_{j}}} [α_{j}] = E_{Q_{α_{j}}^{(a)}} [α_{j}], \forall j \leq J^{(a)}

,

a_{0} = b_{0} = 0

and conditions in Equations (80)-(79) are satisfied in the limit as

ϵ \to 0

.

Proof.

Assume

Q_{β_{k}}^{(a)}

has just been updated using Equations (18) and (19), i.e.,

\begin{matrix} a_{β_{k}}^{(a)} & = a_{0} + \frac{K^{(a)} + 1}{2} \\ b_{β_{k}}^{(a)} & \leftarrow b_{0} + \frac{1}{2} E_{Q_{/ β_{k}}^{(a)}} [{\bar{Z}}_{k}^{2} + \sum_{j = 1}^{J^{(a)}} W_{j k}^{2} α_{j}] \end{matrix}

(96)

\begin{matrix} = b_{0} + \frac{1}{2} (({Σ_{\bar{Z}}}_{k k} + {μ_{\bar{Z}}}_{k}^{2}) + \sum_{j = 1}^{J^{(a)}} ((Σ_{W_{j k}} + μ_{W_{j k}}^{2}) \frac{a_{α_{j}}^{(a)}}{b_{α_{j}}^{(a)}})) \end{matrix}

(97)

The updates for

Q_{β_{k}}

derived from

L

is

\begin{matrix} b_{β_{k}} & \leftarrow b_{0} + \frac{1}{2} (({Σ_{\bar{Z}}}_{k k} + {μ_{\bar{Z}}}_{k}^{2}) + \sum_{j = 1}^{J} ((Σ_{W_{j k}} + μ_{W_{j k}}^{2}) \frac{a_{α_{j}}}{b_{α_{j}}})) \\ = b_{0} + \frac{1}{2} (({Σ_{\bar{Z}}}_{k k} + {μ_{\bar{Z}}}_{k}^{2}) + \sum_{j = 1}^{J^{(a)}} ((Σ_{W_{j k}} + μ_{W_{j k}}^{2}) \frac{a_{α_{j}}}{b_{α_{j}}})) \\ + \frac{1}{2} (\sum_{j = J^{(a)} + 1}^{J} ((Σ_{W_{j k}} + μ_{W_{j k}}^{2}) \frac{a_{α_{j}}}{b_{α_{j}}})) \\ = b_{β_{k}}^{(a)} + \frac{1}{2} (\sum_{j = J^{(a)} + 1}^{J} ((Σ_{W_{j k}} + μ_{W_{j k}}^{2}) \frac{a_{α_{j}}}{b_{α_{j}}})) \end{matrix}

(98)

It involves

W_{j k}, j > J^{(a)}

and therefore they need to be kepted updated. Apply Theorem 5 for Equation (98) and we can get

\begin{matrix} b_{β_{k}} & \leftarrow b_{β_{k}}^{(a)} + \frac{1}{2} \sum_{j = J^{(a)} + 1}^{J} ((\frac{b_{β_{k}} b_{α_{j}}}{a_{β_{k}} a_{α_{j}}}) \frac{a_{α_{j}}}{b_{α_{j}}}) \end{matrix}

(99)

\begin{matrix} \Rightarrow & = b_{β_{k}}^{(a)} + \frac{1}{2} (K - K^{(a)}) \frac{b_{β_{k}}}{a_{β_{k}}} \end{matrix}

(100)

Applying Equation (99) in an iterative manner, we will get a sequence of

b_{β_{k}}

. Solving

\begin{matrix} b_{β_{k}} & = b_{β_{k}}^{(a)} + \frac{1}{2} (K - K^{(a)}) \frac{b_{β_{k}}}{\frac{K + 1}{2}} \end{matrix}

(101)

\begin{matrix} b_{β_{k}} & = {(1 - \frac{1}{2} (K - K^{(a)}) \frac{2}{K + 1})}^{- 1} b_{β_{k}}^{(a)} = \frac{K + 1}{K^{(a)} + 1} b_{β_{k}}^{(a)} \end{matrix}

(102)

Thus, we can get that the sequence will converge at

\begin{matrix} b_{β_{k}} & \leftarrow \frac{K + 1}{K^{(a)} + 1} b_{β_{k}}^{(a)} \end{matrix}

(103)

As a result,

E_{Q} [β_{k}] = \frac{a_{β_{k}}}{b_{β_{k}}} = \frac{a_{β_{k}}^{(a)}}{b_{β_{k}}^{(a)}} = E_{Q^{(a)}} [β_{k}]

. □

Theorem 3.

Updates of

Q_{η}

and

Q_{{\bar{Z}}_{B}}

will converge at

E_{Q_{η}} [η] = E_{Q_{η}^{(a)}} [η]

given

E_{Q_{β_{k}}} [β_{k}] = E_{Q_{β_{k}}^{(a)}} [β_{k}], \forall k \leq K^{(a)}

,

a_{0} = b_{0} = 0

and conditions in Equations (80)-(79) are satisfied in the limit as

ϵ \to 0

.

Proof.

Assume

Q_{η}^{(a)}

has just been updated using Equations (16) and (17), i.e.,

\begin{matrix} a_{η}^{(a)} & \leftarrow a_{0} + \frac{K^{(a)}}{2} \\ {b_{η}}^{(a)} & \leftarrow b_{0} + \frac{1}{2} \sum_{k = 1}^{K^{(a)}} E_{Q_{/ η}} [{\bar{Z}}_{k}^{2} β_{k}] \end{matrix}

(104)

\begin{matrix} = b_{0} + \frac{1}{2} \sum_{k = 1}^{K^{(a)}} (({Σ_{\bar{Z}}}_{k} + {μ_{\bar{Z}}}_{k}^{2}) \frac{a_{β_{k}}^{(a)}}{b_{β_{k}}^{(a)}}) \end{matrix}

(105)

The updates for

Q_{η}

derived from

L

is

\begin{matrix} b_{η} & \leftarrow b_{0} + \frac{1}{2} \sum_{k = 1}^{K} (({Σ_{\bar{Z}}}_{k} + {μ_{\bar{Z}}}_{k}^{2}) \frac{a_{β_{k}}}{b_{β_{k}}}) \\ = b_{0} + \frac{1}{2} \sum_{k = 1}^{K^{(a)}} (({Σ_{\bar{Z}}}_{k} + {μ_{\bar{Z}}}_{k}^{2}) \frac{a_{β_{k}}}{b_{β_{k}}}) + \frac{1}{2} \sum_{k = K^{(a)} + 1}^{K} (({Σ_{\bar{Z}}}_{k} + {μ_{\bar{Z}}}_{k}^{2}) \frac{a_{β_{k}}}{b_{β_{k}}}) \\ = {b_{η}}^{(a)} + \frac{1}{2} \sum_{k = K^{(a)} + 1}^{K} (({Σ_{\bar{Z}}}_{k} + {μ_{\bar{Z}}}_{k}^{2}) \frac{a_{β_{k}}}{b_{β_{k}}}) \end{matrix}

(106)

It involves

{\bar{Z}}_{k}, k > K^{(a)}

and therefore they need to be kept updated. Apply Lemma 6 for Equation (106) and we can get

\begin{matrix} b_{η} & \leftarrow {b_{η}}^{(a)} + \frac{1}{2} \sum_{k = K^{(a)} + 1}^{K} ((\frac{b_{η} b_{β_{k}}}{a_{η} a_{β_{k}}}) \frac{a_{β_{k}}}{b_{β_{k}}}) \end{matrix}

(107)

\begin{matrix} = {b_{η}}^{(a)} + \frac{1}{2} (K - K^{(a)}) \frac{b_{η}}{a_{η}} \end{matrix}

(108)

Applying Equation (108) in an iterative manner, we will get a sequence of updates for

b_{η}

. Solving

\begin{matrix} b_{η} & = {b_{η}}^{(a)} + \frac{1}{2} (K - K^{(a)}) \frac{b_{η}}{\frac{K}{2}} \end{matrix}

(109)

\begin{matrix} b_{η} & = {(1 - \frac{1}{2} (K - K^{(a)}) \frac{2}{K})}^{- 1} {b_{η}}^{(a)} = \frac{K}{K^{(a)}} {b_{η}}^{(a)} \end{matrix}

(110)

Thus, we can get that the sequence will converge at

\begin{matrix} b_{η} & \leftarrow \frac{K}{K^{(a)}} {b_{η}}^{(a)} \end{matrix}

(111)

As a result,

E_{Q} [η] = \frac{b_{η}}{a_{η}} = \frac{{b_{η}}^{(a)}}{a_{η}^{(a)}} = E_{Q^{(a)}} [η]

. □

In practice, due to limitations in numerical representation, we restrict values so that the active precision parameter estimates would not really go to infinity:

\begin{matrix} E_{Q_{α_{j}}} [α_{j}] \leq τ_{\max}, \forall j \leq J^{(a)} \end{matrix}

(112)

\begin{matrix} E_{Q_{β_{k}}} [β_{k}] \leq τ_{\max}, \forall k \leq K^{(a)} \end{matrix}

(113)

9.3. Weights and Noise

Here is how to update

Q_{Z_{A}}, Q_{{\bar{Z}}_{A}}, Q_{W_{A}}, Q_{σ}

in a scalable manner, using computation in the

K^{(a)}

dimension subspace only.

Theorem 4.

L

and

L^{(a)}

share the same update rule for

Z_{i A}

, i.e.,

\begin{matrix} H_{i A j k} & \leftarrow E_{Q_{/_{Z_{i}}}} [W_{A j} Φ_{i A} Φ_{i A}^{T} W_{A k}^{T}] = Tr (E_{Q_{/_{Z_{i}}}} [W_{A k}^{T} W_{A j}] Φ_{i A} Φ_{i A}^{T}) \end{matrix}

\begin{matrix} = T r ((Σ_{[W_{A k}, W_{A j}]} + μ_{[W_{A j}]}^{T} μ_{[W_{A k}]}) Φ_{i A} Φ_{i A}^{T}), \forall j = 1 : J^{(a)}, k = 1 : K^{(a)} \end{matrix}

(114)

\begin{matrix} {Σ_{Z_{i}}}_{A} & \leftarrow {(E_{Q_{/_{Z_{i}}}} [σ^{- 2} W_{A} Φ_{i A} Φ_{i A}^{T} W_{A}^{T} + I])}^{- 1} = {[\frac{a_{σ}}{b_{σ}} H_{i A} + I]}^{- 1} \end{matrix}

(115)

\begin{matrix} μ_{i A} & \leftarrow E_{Q_{/_{Z_{i}}}} [σ^{- 2} (Y_{i} - \bar{Z} Φ_{i A}) Φ_{i A}^{T} W_{A}^{T}] {Σ_{Z_{i}}}_{A} = \frac{a_{σ}}{b_{σ}} (Y_{i} - {μ_{\bar{Z}}}_{A} Φ_{i A}) Φ_{i A}^{T} {(μ_{W_{A}})}^{T} {Σ_{Z_{i}}}_{A} \end{matrix}

(116)

Proof.

Apply Lemma 3 to Equation (24) we get

\begin{matrix} H_{i A j k} & \leftarrow T r ((Σ_{[W_{A k}, W_{A j}]} + μ_{[W_{A j}]}^{T} μ_{[W_{A k}]}) Φ_{i A} Φ_{i A}^{T}) + O (ϵ) \\ \to T r ((Σ_{[W_{A k}, W_{A j}]} + μ_{[W_{A j}]}^{T} μ_{[W_{A k}]}) Φ_{i A} Φ_{i A}^{T}), \forall j = 1 : J^{(a)}, k = 1 : K^{(a)} \end{matrix}

(117)

Apply block matrix inversion formula to Equation (25) we get

\begin{matrix} {Σ_{Z_{i}}}_{A} & \leftarrow {(E_{Q_{/_{Z_{i}}}} [σ^{- 2} W_{A} Φ_{i A} Φ_{i A}^{T} W_{A}^{T} + I])}^{- 1} + O (ϵ^{2}) \\ \to {(E_{Q_{/_{Z_{i}}}} [σ^{- 2} W_{A} Φ_{i A} Φ_{i A}^{T} W_{A}^{T} + I])}^{- 1} = {[\frac{a_{σ}}{b_{σ}} H_{i A} + I]}^{- 1} \end{matrix}

(118)

Apply block matrix multiplication and Theorem 5 to Equation (26) conditioned on Equation (81) we get

\begin{matrix} μ_{i A} & \leftarrow E_{Q_{/_{Z_{i}}}} [σ^{- 2} (Y_{i} - \bar{Z} Φ_{i A}) Φ_{i A}^{T} W_{A}^{T}] {Σ_{Z_{i}}}_{A} + O (ϵ) \\ \to \frac{a_{σ}}{b_{σ}} (Y_{i} - {μ_{\bar{Z}}}_{A} Φ_{i A}) Φ_{i A}^{T} {(μ_{W_{A}})}^{T} {Σ_{Z_{i}}}_{A} \end{matrix}

(119)

□

Theorem 5.

L

and

L^{(a)}

share the same update rule for

{\bar{Z}}_{A}

, i.e.,

\begin{matrix} {Σ_{\bar{Z}}}_{A} & \leftarrow {(E_{Q_{/_{\bar{Z}}}} [\sum_{i = 1}^{P} (σ^{- 2} Φ_{i A} Φ_{i A}^{T}) + η diag (β_{A})])}^{- 1} \end{matrix}

\begin{matrix} = {(\sum_{i = 1}^{P} (\frac{a_{σ}}{b_{σ}} Φ_{i A} Φ_{i A}^{T}) + \frac{a_{η}}{b_{η}} diag (\frac{a_{A}}{b_{A}}))}^{- 1} \\ {μ_{\bar{Z}}}_{A} & \leftarrow (E_{Q_{/_{\bar{Z}}}} [σ^{- 2}] \sum_{i = 1}^{P} (Y - E_{Q_{/_{\bar{Z}}}} [Z_{i A} W_{A}] Φ_{i A}) Φ_{i A}) {Σ_{\bar{Z}}}_{A} \end{matrix}

(120)

\begin{matrix} = (\frac{a_{σ}}{b_{σ}} \sum_{i = 1}^{P} (Y - μ_{i A} μ_{W_{A}} Φ_{i A}) Φ_{i A}) {Σ_{\bar{Z}}}_{A} \end{matrix}

(121)

Proof.

Apply block matrix inversion formula to Equation (20) conditioned on

E_{Q_{/_{\bar{Z}}}} [β_{k}] = ϵ^{- 1}, \forall k > K^{(a)}

and we get

\begin{matrix} {Σ_{\bar{Z}}}_{A} & \leftarrow {(E_{Q_{/_{\bar{Z}}}} [\sum_{i = 1}^{P} (σ^{- 2} Φ_{i A} Φ_{i A}^{T}) + η diag (β_{A})])}^{- 1} + O (ϵ) \\ \to {(E_{Q_{/_{\bar{Z}}}} [\sum_{i = 1}^{P} (σ^{- 2} Φ_{i A} Φ_{i A}^{T}) + η diag (β_{A})])}^{- 1} \end{matrix}

(122)

Apply block matrix multiplication and Theorem 5 to Equation (21) conditioned on Equation (80) we get

\begin{matrix} {μ_{\bar{Z}}}_{A} & \leftarrow (E_{Q_{/_{\bar{Z}}}} [σ^{- 2}] \sum_{i = 1}^{P} (Y - E_{Q_{/_{\bar{Z}}}} [Z_{i A} W_{A}] Φ_{i A}) Φ_{i A}) {Σ_{\bar{Z}}}_{A} + O (ϵ) \\ \to (E_{Q_{/_{\bar{Z}}}} [σ^{- 2}] \sum_{i = 1}^{P} (Y - E_{Q_{/_{\bar{Z}}}} [Z_{i A} W_{A}] Φ_{i A}) Φ_{i A}) {Σ_{\bar{Z}}}_{A} \end{matrix}

(123)

□

Theorem 6.

L

and

L^{(a)}

share the same update rule for

W_{A}

, i.e.,

\begin{matrix} Σ_{vec (W)} & \leftarrow E_{Q_{/_{W}}} {[σ^{- 2} \sum_{i = 1}^{P} ((Φ_{i A}^{T} Φ_{i A}) \otimes (Z_{i A}^{T} Z_{i A})) + diag (β_{A}) \otimes diag (α_{A})]}^{- 1} \end{matrix}

\begin{matrix} = {(\frac{a_{σ}}{b_{σ}} \sum_{i = 1}^{P} ((Φ_{i A}^{T} Φ_{i A}) \otimes (μ_{i A}^{T} μ_{i A} + {Σ_{Z_{i}}}_{A})) + diag (\frac{a_{A}}{b_{A}}) \otimes diag (\frac{c_{A}}{d_{A}}))}^{- 1} \\ μ_{vec (W)} & \leftarrow E_{Q_{/_{W}}} [- σ^{- 2} \sum_{i = 1}^{P} vec {({(Φ_{i A} (Φ_{i A}^{T} {\bar{Z}}_{A}^{T} - Y_{i}^{T}) Z_{i A})}^{T})}^{T}] {Σ_{vec (W)}}_{A} \end{matrix}

(124)

\begin{matrix} = - \frac{a_{σ}}{b_{σ}} \sum_{i = 1}^{P} vec {({(Φ_{i A} (Φ_{i A}^{T} {μ_{\bar{Z}}}_{A}^{T} - Y_{i}^{T}) μ_{i A})}^{T})}^{T} {Σ_{vec (W)}}_{A} \end{matrix}

(125)

Proof.

Apply block matrix inversion formula to Equation (22) conditioned on

E_{Q_{/_{\bar{Z}}}} [β_{k}] = ϵ^{- 1}, \forall k > K^{(a)}

and

E_{Q_{/_{\bar{Z}}}} [α_{j}] = ϵ^{- 1}, \forall j > J^{(a)}

, we get

\begin{matrix} Σ_{vec (W)} & \leftarrow E_{Q_{/_{W}}} {[σ^{- 2} \sum_{i = 1}^{P} ((Φ_{i A}^{T} Φ_{i A}) \otimes (Z_{i A}^{T} Z_{i A})) + diag (β_{A}) \otimes diag (α_{A})]}^{- 1} + O (ϵ) \\ \to E_{Q_{/_{W}}} {[σ^{- 2} \sum_{i = 1}^{P} ((Φ_{i A}^{T} Φ_{i A}) \otimes (Z_{i A}^{T} Z_{i A})) + diag (β_{A}) \otimes diag (α_{A})]}^{- 1} \end{matrix}

(126)

Apply block matrix multiplication and Theorem 5 to Equation (23) conditioned on Equation (80) and Equation (81), we get

\begin{matrix} μ_{vec (W)} & \leftarrow E_{Q_{/_{W}}} [- σ^{- 2} \sum_{i = 1}^{P} vec {({(Φ_{i A} (Φ_{i A}^{T} {\bar{Z}}_{A}^{T} - Y_{i}^{T}) Z_{i A})}^{T})}^{T}] {Σ_{vec (W)}}_{A} + O (ϵ) \\ \to E_{Q_{/_{W}}} [- σ^{- 2} \sum_{i = 1}^{P} vec {({(Φ_{i A} (Φ_{i A}^{T} {\bar{Z}}_{A}^{T} - Y_{i}^{T}) Z_{i A})}^{T})}^{T}] {Σ_{vec (W)}}_{A} \end{matrix}

(127)

□

Theorem 7.

L

and

L^{(a)}

share the same update rule for σ, i.e.,

\begin{matrix} a_{σ} & \leftarrow a_{0} + \frac{1}{2} \sum_{i} N_{i} \\ b_{σ} & \leftarrow b_{0} + \frac{1}{2} E_{Q_{/ σ}} [\sum_{i} | | Y_{i} - (Z_{i A} W_{A} + {\bar{Z}}_{A}) Φ_{i A} {| |}_{2}^{2}] \\ = b_{0} + \frac{1}{2} \sum_{i} (Y_{i} Y_{i}^{T} - 2 Y_{i} {(μ_{i A} μ_{W_{A}} Φ_{i A})}^{T} - 2 Y_{i} {({μ_{\bar{Z}}}_{A} Φ_{i A})}^{T} + 2 μ_{i A} μ_{W_{A}} Φ_{i A} Φ_{i A}^{T} {({μ_{\bar{Z}}}_{A})}^{T} \\ + Tr (({Σ_{\bar{Z}}}_{A} + {({μ_{\bar{Z}}}_{A})}^{T} {μ_{\bar{Z}}}_{A}) Φ_{i A} Φ_{i A}^{T})) \end{matrix}

(128)

\begin{matrix} + \frac{1}{2} {vec (G_{A}^{T})}^{T} \sum_{i} vec (vec (Φ_{i A} Φ_{i A}^{T}) vec ({Σ_{Z_{i}}}_{A} + μ_{i A}^{T} μ_{i A})^{T}), \end{matrix}

(129)

where

\begin{matrix} G_{A (j + k M)} & \leftarrow E_{Q_{/ σ}} [vec {(W_{A k} W_{A j}^{T})}^{T}] \\ = vec {(Σ_{[W_{A k}, W_{A j}]} + {μ_{vec (W)}}_{[W_{A j}]}^{T} {μ_{vec (W)}}_{[W_{A k}]})}^{T}, \forall j = 1 : K^{(a)}, k = 1 : K^{(a)} \end{matrix}

(130)

Proof.

Apply block matrix multiplication and Theorem 5 to Equation (28) conditioned on Equations (80) and (81), we get

\begin{matrix} b_{σ} & \leftarrow b_{0} + \frac{1}{2} E_{Q_{/ σ}} [\sum_{i} | | Y_{i} - (Z_{i A} W_{A} + {\bar{Z}}_{A}) Φ_{i A} {| |}_{2}^{2}] + O (ϵ) \\ \to b_{0} + \frac{1}{2} E_{Q_{/ σ}} [\sum_{i} | | Y_{i} - (Z_{i A} W_{A} + {\bar{Z}}_{A}) Φ_{i A} {| |}_{2}^{2}] \end{matrix}

(131)

□

We show

Q_{Z_{i A}}, Q_{W_{A}}, Q_{{\bar{Z}}_{i A}}, Q_{σ}

share the same update formulas as those derived from the low-dimensional lower bound:

Q_{Z_{i A}}^{(a)}, Q_{W_{A}}^{(a)}, Q_{{\bar{Z}}_{i A}}^{(a)}, Q_{σ}^{(a)}

. Thus, in practice, if suffices to update

Q^{(a)}

; we can then increase

K^{(a)}

by including new basis functions. This process proves to implicitly maximizes

L

with

Q

.

9.4. Low-Dimensional Lower Bound

We now have updating formulas for the parameters in the active subspace.

Q_{Z_{i A}}

is updated by Equations (114), (115) and (116).

Q_{W_{A}}

is updated by Equations (124) and (125).

Q_{{\bar{Z}}_{A}}

is updated by Equations (120) and (121).

Q_{α_{A}}, Q_{β_{A}}, Q_{η}

are updated by Theorem 1, 2 and 3, with the companion of implicit updates of

Q_{W_{B}}, Q_{W_{C}}

.

Q_{σ}

is updated by Equations (130), (128) and (129). All the updating rules are identical to those derived from the low-dimensional lower bound

L^{(a)}

with

K^{(a)}

basis functions. Therefore, in practice all we need is to optimize

L^{(a)}

, with time complexity of

O ({K^{(a)}}^{2} max ({K^{(a)}}^{4}, P {max}_{i} (N_{i})))

, as described in Theorem 8, and then check if a new basis function should be included in the model.

For numerical stability, we scale

ϕ, b

such that

{min}_{k} (E_{Q_{β}} [β_{k}]) = {min}_{k} (\frac{c_{k}}{d_{k}}) = 1

at the beginning of Algorithm 2.

Theorem 8.

The lower bound

L

can be optimized using Algorithm 2 with time complexity of

O ({K^{(a)}}^{2} max ({K^{(a)}}^{4}, P {max}_{i} (N_{i})))

.

Proof.

It is a consequence of Theorems 1, 2, 3, 4, 5, 6, 7. □

Algorithm 2 Variational inference

Require: $μ_{Z_{i}}, Σ_{Z_{i}}, μ_{vec (W)}, Σ_{vec (W)}, μ_{\bar{Z}}, Σ_{\bar{Z}}, a_{σ}, b_{σ}, a_{α_{j}}, b_{α_{j}}, a_{β_{k}}, b_{β_{k}}, \forall i, j, k$ ▹ Multi sample RVM
while True do
$L^{(a)} \leftarrow lowerbound (Q^{(a)})$
Update $Q^{(a)}$ with respect to all parameters using mean field approximation
if $lowerbound (Q^{(a)}) - L^{(a)} < τ_{con}$ then ▹ Insignificant increase
Search for new basis functions using Algorithm 1
if not found then ▹ Converged
break
end if
end if
Remove dimensions associated with the precision of the maximum values
end while
Get rid of dimensions associated with $α_{j} \geq {min}_{j} (α_{j}) τ_{eff}$ and $β_{k} \geq {min}_{k} (β_{k}) τ_{eff}$

10. Scalable Update for BSFDA^Fast

For brevity, we denote the covariance of

ζ_{i}

as S, i.e.,

ζ_{i} \sim N (0, S)

. S is diagonal and

S_{k k} = ς_{k}^{2} β_{k}^{- 1}

. The variational update formulas are as follows:

\begin{matrix} Σ_{θ_{i}} & \leftarrow E_{Q_{/ θ_{i}}} {[Φ_{i} Φ_{i}^{T} σ^{- 2} + S^{- 1}]}^{- 1} \end{matrix}

(132)

\begin{matrix} μ_{θ_{i}} & \leftarrow E_{Q_{/ θ_{i}}} [((Y_{i} - \bar{Z} Φ_{i}) Φ_{i}^{T} σ^{- 2} + Z_{i} W S^{- 1})] Σ_{θ_{i}} \end{matrix}

(133)

\begin{matrix} Σ_{Z_{i}} & \leftarrow E_{Q_{/ Z_{i}}} {[W S^{- 1} W^{T} + I]}^{- 1} \end{matrix}

(134)

\begin{matrix} μ_{Z_{i}} & \leftarrow E_{Q_{/ Z_{i}}} [θ_{i} S^{- 1} W^{T}] Σ_{Z_{i}} \end{matrix}

(135)

\begin{matrix} a_{ς_{k}} & \leftarrow a_{0} + \frac{P}{2} \end{matrix}

(136)

\begin{matrix} b_{ς_{k}} & \leftarrow E_{Q_{/ ς_{k}}} [b_{0} + \frac{1}{2} \sum_{i} {(θ_{i k} - Z_{i} W_{\cdot k})}^{2} β_{k}] \end{matrix}

(137)

\begin{matrix} Σ_{W_{\cdot k}} & \leftarrow E_{Q_{/ W_{\cdot k}}} {[ς_{k}^{- 2} β_{k} \sum_{i} Z_{i}^{T} Z_{i} + β_{k} diag (α)]}^{- 1} \end{matrix}

(138)

\begin{matrix} μ_{W_{\cdot k}} & \leftarrow E_{Q_{/ W_{\cdot k}}} [ς_{k}^{- 2} β_{k} \sum_{i} (θ_{i k} Z_{i})] Σ_{W_{\cdot k}} \end{matrix}

(139)

\begin{matrix} a_{β_{k}} & \leftarrow a_{0} + \frac{1 + K + P}{2} \end{matrix}

(140)

\begin{matrix} b_{β_{k}} & \leftarrow E_{Q_{/ β_{k}}} [b_{0} + \frac{1}{2} [{\bar{Z}}_{k}^{2} η + \sum_{j} (W_{j k}^{2} α_{j}) + \sum_{i} {(θ_{i k} - Z_{i} W_{\cdot k})}^{2} ς_{k}^{- 2}]] \end{matrix}

(141)

\begin{matrix} Σ_{\bar{Z}} & \leftarrow E_{Q_{/ \bar{Z}}} {[σ^{- 2} \sum_{i} (Φ_{i} Φ_{i}^{T}) + η diag (β)]}^{- 1} \end{matrix}

(142)

\begin{matrix} μ_{\bar{Z}} & \leftarrow E_{Q_{/ \bar{Z}}} [σ^{- 2} \sum_{i} [(Y_{i} - θ_{i} Φ_{i}) Φ_{i}^{T}]] Σ_{\bar{Z}} \end{matrix}

(143)

\begin{matrix} a_{σ^{- 2}} & \leftarrow a_{0} + \frac{1}{2} \sum_{i} N_{i} \end{matrix}

(144)

\begin{matrix} b_{σ^{- 2}} & \leftarrow E_{Q_{/ σ}} [b_{0} + \frac{1}{2} \sum_{i} | | Y_{i} - (\bar{Z} + θ_{i}) Φ_{i} {| |}_{2}^{2}] \end{matrix}

(145)

Notably, the columns of W becomes conditionally independent with the introduction of the slack variable

θ

, akin to the strategy described in [34,62]. Then the surrogate posterior of W factorizes over the columns, thereby requiring calculating the covariance of each column separately instead of the entire W at once. Thus, the computational complexity is significantly reduced. This factorization is introduced on top of the existing factorizations, thus the low-dimensional optimization strategy of BSFDA also applies to

{BSFDA}^{Fast}

.

11. Fast Initialization

In order to efficiently obtain a good initialization for the unknowns to be estimated, e.g.

Z, \bar{Z}

,

β

and

σ

, we approximate the model so that we can adopt a fast strategy maximizing marginal likelihood using direct differentiation that is similar to [73]. This initial

β

serves to select the

K^{(a)}

basis functions to start with.

We introduce

\tilde{Z}

for easier marginalization:

\begin{matrix} Y_{i} & = {\tilde{Z}}_{i} Φ_{i} + E_{i} \end{matrix}

(146)

\begin{matrix} {\tilde{Z}}_{i k} & = \frac{Z_{i k}}{\sqrt{β_{k}}} + {\bar{Z}}_{k} \sim N ({\bar{Z}}_{k}, β_{k}^{- 1}) \end{matrix}

(147)

\begin{matrix} {\bar{Z}}_{k} & \sim N (0, β_{k}^{- 1}) \end{matrix}

(148)

\begin{matrix} β_{k} & \sim Γ (β_{k} | a_{0}, b_{0}), σ^{- 2} \sim Γ (σ^{- 2} | a_{0}, b_{0}) \end{matrix}

(149)

\begin{matrix} E_{i} & \sim N (0, σ^{2} I) \end{matrix}

(150)

The approximated probabilistic graphical model is shown in Figure 11.

Figure 11. Probabilistic graphical model for the simplified model.

11.1. Maximum Likelihood Estimation

We apply maximum likelihood estimation for point estimates of

\bar{Z}, β, σ

.

\begin{matrix} {\bar{Z}}^{*}, β^{*}, σ^{*} & \leftarrow arg min_{\bar{Z}, β, σ} P, \end{matrix}

(151)

where

P = - ln Pr [Y | \bar{Z}, β, σ]

. Conditioned on these estimates, we can calculate the expectation of Z.

Optimization of $β, \bar{Z}$

We set the differentiation to zero, i.e.,

\frac{\partial P}{\partial β_{k}} = 0

, and get:

\begin{matrix} β_{k} & \leftarrow \{\begin{matrix} θ_{k}, if θ_{k} > 0 \\ \infty, otherwise \end{matrix} \end{matrix}

(152)

where

\begin{matrix} θ_{k} & = (\frac{\sum_{i = 1}^{P} s_{i k}^{2}}{\sum_{i = 1}^{P} (q_{i k}^{2} - s_{i k})}) \end{matrix}

(153)

\begin{matrix} q_{i k} & = Φ_{i k} C_{i_{/ k}}^{- 1} {(Y - \bar{Z} Φ_{i})}^{T} \end{matrix}

(154)

\begin{matrix} s_{i k} & = Φ_{i k} C_{i_{/ k}}^{- 1} Φ_{i k}^{T} \end{matrix}

(155)

\begin{matrix} C_{i_{/_{k}}} & = C_{i} - Φ_{i k}^{T} β_{k}^{- 1} Φ_{i k} \end{matrix}

(156)

\begin{matrix} C_{i} & = Φ_{i}^{T} diag (β^{- 1}) Φ_{i} + σ^{2} I = \sum_{k = 1}^{K} Φ_{i k}^{T} β_{k}^{- 1} Φ_{i k} + σ^{2} I \end{matrix}

(157)

We differentiate

P

with respect to

\bar{Z}

and zero the derivative, i.e.,

\frac{\partial P}{\bar{Z}} = 0

, to get:

\begin{matrix} \bar{Z} & \leftarrow \sum_{i = 1}^{P} (Y_{i} C_{i}^{- 1} Φ_{i}^{T}) {(\sum_{i = 1}^{P} (Φ_{i} C_{i}^{- 1} Φ_{i}^{T}))}^{- 1} \end{matrix}

(158)

We approximate Equation (158) by

{\bar{Z}}_{A} \leftarrow \sum_{i = 1}^{P} (Y_{i} C_{i}^{- 1} Φ_{i A}^{T}) {(\sum_{i = 1}^{P} (Φ_{i A} C_{i}^{- 1} Φ_{i A}^{T}))}^{- 1}

and

{\bar{Z}}_{B} \leftarrow 0

. This way we can apply the update with only the active basis functions.

Optimization of $σ$ :

We use EM to optimize

σ

. In E step:

\begin{matrix} E_{Q_{\tilde{Z}}} [{\tilde{Z}}_{i}] & \leftarrow σ^{- 2} (Y_{i} - \bar{Z} Φ_{i}) Φ_{i}^{T} S_{i} \end{matrix}

(159)

\begin{matrix} E_{Q_{\tilde{Z}}} [{\tilde{Z}}_{i}^{T} {\tilde{Z}}_{i}] & \leftarrow S_{i} + E_{Q_{\tilde{Z}}} {[{\tilde{Z}}_{i}]}^{T} E_{Q_{\tilde{Z}}} [{\tilde{Z}}_{i}], \end{matrix}

(160)

where

S_{i} = {(Ψ_{i} σ^{- 2} + diag (β))}^{- 1}

.

In M step

\begin{matrix} σ^{- 2} & \leftarrow \frac{\sum_{i = 1}^{P} E_{Q_{\tilde{Z}}} [| | Y_{i} - ({\tilde{Z}}_{i} + \bar{Z}) Φ_{i} {| |}_{2}^{2}]}{\sum_{i = 1}^{P} N_{i}} \\ = \frac{\sum_{i = 1}^{P} (Y_{i} - \bar{Z} Φ_{i}) {(Y_{i} - \bar{Z} Φ_{i} - 2 Φ_{i}^{T} E_{Q_{\tilde{Z}}} {[{\tilde{Z}}_{i}]}^{T})}^{T} + Tr (E_{Q_{\tilde{Z}}} [{\tilde{Z}}_{i}^{T} {\tilde{Z}}_{i}] Ψ_{i})}{\sum_{i = 1}^{P} N_{i}} \end{matrix}

(161)

The optimization iterates between the E-step Equations (159) and (160), and the M-step Equation (161).

In practice, we need only

E_{Q_{\tilde{Z}}} [{\tilde{Z}}_{i A}], E_{Q_{\tilde{Z}}} [{\tilde{Z}}_{i A}^{T} {\tilde{Z}}_{i A}], S_{i A}

, and they can be calculated using the

K^{(a)}

active basis functions. Thus, similar to [73], all the computations can be operated with only the active basis functions and thus it computationally efficient. This is described in Algorithm 3.

\begin{matrix} P & = - ln Pr [Y | \bar{Z}, β, σ] = - \sum_{i = 1}^{P} ln Pr [Y_{i} | \bar{Z}, β, σ] = \sum_{i = 1}^{P} P_{i} \end{matrix}

(162)

\begin{matrix} P_{i} & = \int Pr [Y_{i} | {\tilde{Z}}_{i}, \bar{Z}, β, σ] Pr [{\tilde{Z}}_{i} | \bar{Z}, β] d {\tilde{Z}}_{i} = E_{{\tilde{Z}}_{i} \sim N (\bar{Z}, β)} [Pr [Y_{i} | {\tilde{Z}}_{i}, σ]] = N (Y_{i} | \bar{Z} Φ_{i}, C_{i}) \end{matrix}

(163)

\begin{matrix} Pr & [Y_{i} | {\tilde{Z}}_{i}, \bar{Z}, β, σ] = N (Y_{i} | ({\tilde{Z}}_{i} + \bar{Z}) Φ_{i}, σ^{2} I) \end{matrix}

(164)

Algorithm 3 Multi-sample relevance vector machine

while $P$ is not converged do
$k \leftarrow a random number that satisfies CosSim (ϕ_{k}, ϕ_{A}) \leq τ_{sim}$ ▹ $O ({K^{(a)}}^{3})$
$s_{i k} \leftarrow Φ_{i k} C_{i_{/ k}}^{- 1} Φ_{i k}^{T}, \forall i$ ▹ Sparsity factor. $O (P max ({K^{(a)}}^{3}, {max}_{i} {(N_{i})}^{2}))$
$q_{i k} \leftarrow Φ_{i k} C_{i_{/ k}}^{- 1} {(Y - {\bar{Z}}_{A} Φ_{i A})}^{T}, \forall i$ ▹ Quality factor. $O (P {max}_{i} (N_{i}) max (K^{(a)}, {max}_{i} (N_{i})))$
$θ_{k} \leftarrow (\frac{\sum_{i = 1}^{P} s_{i k}^{2}}{\sum_{i = 1}^{P} (q_{i k}^{2} - s_{i k})})$
if $θ > 0$ then
$β_{k} \leftarrow θ_{k}$ ▹ Precision is finite
else
$β_{k} \leftarrow \infty$ ▹ Precision is infinite and the dimension is removed
end if
$Φ_{i A} \leftarrow All Φ_{i k} that has β_{k} < \infty, \forall i$
$C_{i} = \sum_{β_{k} < \infty} Φ_{i k}^{T} β_{k}^{- 1} Φ_{i k} + σ^{2} I, \forall i$
${\bar{Z}}_{A} \leftarrow \sum_{i = 1}^{P} (Y_{i} C_{i}^{- 1} Φ_{i A}^{T}) {(\sum_{i = 1}^{P} (Φ_{i A} C_{i}^{- 1} Φ_{i A}^{T}))}^{- 1}$ ▹ $O (P K^{(a)} max {(K^{(a)}, {max}_{i} (N_{i}))}^{2})$
$S_{i A} \leftarrow {(Φ_{i A} Φ_{i A}^{T} σ^{- 2} + diag (β_{A}))}^{- 1}, \forall i$
$E_{Q_{\tilde{Z}}} [{\tilde{Z}}_{i A}] \leftarrow σ^{- 2} (Y_{i} - {\bar{Z}}_{A} Φ_{i A}) Φ_{i A}^{T} S_{i A}, \forall i$ ▹ $O (P {K^{(a)}}^{2} {max}_{i} (N_{i}))$
$E_{Q_{\tilde{Z}}} [{\tilde{Z}}_{i A}^{T} {\tilde{Z}}_{i A}] \leftarrow S_{i A} + E_{Q_{\tilde{Z}}} {[{\tilde{Z}}_{i A}]}^{T} E_{Q_{\tilde{Z}}} [{\tilde{Z}}_{i A}], \forall i$
$σ \leftarrow \frac{\sum_{i = 1}^{P} (Y_{i} - {\bar{Z}}_{A} Φ_{i A}) {(Y_{i} - {\bar{Z}}_{A} Φ_{i A} - 2 Φ_{i A}^{T} E_{Q_{\tilde{Z}}} {[{\tilde{Z}}_{i A}]}^{T})}^{T} + Tr (E_{Q_{\tilde{Z}}} [{\tilde{Z}}_{i A}^{T} {\tilde{Z}}_{i A}] Φ_{i A} Φ_{i A}^{T})}{\sum_{i = 1}^{P} N_{i}}$
end while

We apply Sylvester’s determinant theorem to Equation (157) and get

\begin{matrix} | C_{i} | & = | C_{i_{/_{k}}} | | I + β_{k}^{- 1} Φ_{i k}^{T} C_{i_{/ k}}^{- 1} Φ_{i k} | \end{matrix}

(165)

We apply Woodbury matrix identity to Equation (157) and get

\begin{matrix} C_{i}^{- 1} & = C_{i_{/_{k}}}^{- 1} - C_{i_{/_{k}}}^{- 1} Φ_{i k}^{T} {(β_{k} + Φ_{i k} C_{i_{/_{k}}}^{- 1} Φ_{i k}^{T})}^{- 1} Φ_{i k} C_{i_{/_{k}}}^{- 1} \end{matrix}

(166)

We first expand

P_{i}

\begin{matrix} P_{i} & = ln Pr [Y_{i} | \bar{Z}, σ, β] \\ = - \frac{1}{2} \sum_{i} ln | 2 π C_{i} | + (Y_{i} - \bar{Z} Φ_{i}) C_{i}^{- 1} {(Y_{i} - \bar{Z} Φ_{i})}^{T} \\ = - \frac{1}{2} (N_{i} ln (2 π) + ln | C_{i_{/ k}} | + ln | I + β_{k}^{- 1} Φ_{i k} C_{i_{/ k}}^{- 1} Φ_{i k}^{T} | + (Y_{i} - \bar{Z} Φ_{i}) C_{i}^{- 1} {(Y_{i} - \bar{Z} Φ_{i})}^{T} \\ - {(β_{k} + Φ_{i k} C_{i_{/ k}}^{- 1} Φ_{i k}^{T})}^{- 1} | | Φ_{i k} C_{i_{/ k}}^{- 1} {(Y - \bar{Z} Φ_{i})}^{T} {| |}_{2}^{2}) \\ = P_{i_{/_{k}}} + \frac{1}{2} (ln β_{k} - ln | β_{k} + s_{i k} | + \frac{q_{i k}^{2}}{β_{k} + s i k}) \end{matrix}

(167)

where we plug in Equations (165) and (166) and define

q_{i k}, s_{i k}

in a similar way to [73]. The sparsity factor

s_{i k}

can be seen to be a measure of the extent that the basis function

ϕ_{k}

overlaps those already present in the model under the measurements at index set

X_{i}

. The quality factor

q_{i k}

is a measure of the alignment with the error of the model at

X_{i}

with that basis function excluded. Because we are representing the mean functions using only the active basis functions, i.e.,

{\bar{Z}}_{k} = 0

when

β_{k} = \infty

, Equation (154) only uses the K active basis functions. Similarly, Equation (155) only uses the K active basis functions.

For computational efficiency, we can compute

s_{i k}, q_{i k}

using

S_{i k} = Φ_{i k} C_{i}^{- 1} Φ_{i k}^{T}, Q_{i k} = Φ_{i k} C_{i}^{- 1} {(Y - \bar{Z} Φ_{i})}^{T}

in a similar way to [73] as follows

\begin{matrix} s_{i k} & = Φ_{i k} C_{i_{/ k}}^{- 1} Φ_{i k}^{T} = S_{i k} + Φ_{i k} C_{i_{/_{k}}}^{- 1} Φ_{i k}^{T} {(β_{k} + Φ_{i k} C_{i_{/_{k}}}^{- 1} Φ_{i k}^{T})}^{- 1} Φ_{i k} C_{i_{/_{k}}}^{- 1} Φ_{i k}^{T} \end{matrix}

\begin{matrix} = S_{i k} + s_{i k} {(β_{k} + s_{i k})}^{- 1} s_{i k} ⇌ s_{i k} = \frac{β_{k} + s_{i k}}{β_{k}} S_{i k} \end{matrix}

(168)

\begin{matrix} \Rightarrow s_{i k} & \leftarrow {(1 - \frac{1}{β_{k}} S_{i k})}^{- 1} S_{i k} = \frac{β_{k} S_{i k}}{β_{k} - S_{i k}} \\ q_{i k} & = Φ_{i k} C_{i_{/ k}}^{- 1} {(Y - \bar{Z} Φ_{i})}^{T} \\ = Q_{i k} + Φ_{i k} C_{i_{/_{k}}}^{- 1} Φ_{i k}^{T} {(β_{k} + Φ_{i k} C_{i_{/_{k}}}^{- 1} Φ_{i k}^{T})}^{- 1} Φ_{i k} C_{i_{/_{k}}}^{- 1} {(Y - \bar{Z} Φ_{i})}^{T} \end{matrix}

(169)

\begin{matrix} = Q_{i k} + s_{i k} {(β_{k} + s_{i k})}^{- 1} q_{i k} \end{matrix}

(170)

\begin{matrix} \Rightarrow q_{i k} & \leftarrow \frac{β_{k} + s_{i k}}{β_{k}} Q_{i k} = \frac{β_{k} Q_{i k}}{β_{k} - S_{i k}} \end{matrix}

(171)

11.2. Optimization of β,Z ¯

Derivation of Equation (152):

We differentiate

P

with respect to

β_{k}

\begin{matrix} \frac{\partial P}{\partial β_{k}} & = \sum_{i = 1}^{P} \frac{1}{2} (β_{k}^{- 1} - {| β_{k} + s_{i k} |}^{- 1} - q_{i k}^{2} {(β_{k} + s_{i k})}^{2}) \\ = \frac{1}{2} β_{k}^{- 1} \sum_{i = 1}^{P} ({(β_{k} + s_{i k})}^{- 2} (β_{k} (s_{i k} - q_{i k}^{2}) + s_{i k}^{2})) \end{matrix}

(172)

We further adopt the approximation

s_{1 k} \approx s_{2 k} \approx \dots \approx s_{P k}

. Because

s_{i k}

is a discrete measure of the overlapping between the basis functions, it should remain invariant with respect to different sampling grid

X_{i}

given the number of measurements is adequate and similar. Alternatively, the Expectation maximization scheme can also be applied and is guaranteed to increase the likelihood

P

in each iteration until convergence. However, we opt for this gradient descent with approximations for its advantage in speed to obtain a reasonable initialization. This way we set the approximated differentiation to zero

\begin{matrix} \frac{\partial P}{\partial β_{k}} & \approx \frac{1}{2} β_{k}^{- 1} {(β_{k} + s_{1 k})}^{- 2} \sum_{i = 1}^{P} ((β_{k} (s_{i k} - q_{i k}^{2}) + s_{i k}^{2})) = 0 \end{matrix}

(173)

\begin{matrix} \Rightarrow β_{k} & \leftarrow θ_{k} = (\frac{\sum_{i = 1}^{P} s_{i k}^{2}}{\sum_{i = 1}^{P} (q_{i k}^{2} - s_{i k})}) \end{matrix}

(174)

Because

β_{k}

is a scale parameter, we need

β_{k} > 0

. Consequently, the optimal value for

β_{k}

to maximize

P

dependents on the sign of

θ_{k}

. When

θ_{k} > 0

, the maximum of

P

is achieved at

β_{k} = θ_{k}

.

On the other hand when

θ_{k} \leq 0

,

P

is monotonically increasing with respect to

β_{k}

and therefore we should have

β_{k} \leftarrow \infty

in order to maximize

P

.

More intuitively, Equation (174) can be regarded as a weighted summation of the estimation of

β_{k}

using each individual sample function and it automatically assigns more weights to those with more measurements. Therefore, this optimization strategy is supposed to provide reasonable estimates even when the sampled functions have different numbers of measurements.

Derivation of Equation (158):

We differentiate

P

with respect to

\bar{Z}

and zero the derivative to get

\begin{matrix} \frac{\partial P}{\bar{Z}} & = - \frac{1}{2} \sum_{i = 1}^{P} (- 2 Y_{i} C_{i} Φ_{i}^{T} + 2 \bar{Z} Φ_{i} C_{i}^{- 1} Φ_{T}) = 0 \end{matrix}

(175)

\begin{matrix} \Rightarrow \bar{Z} & \leftarrow \sum_{i = 1}^{P} (Y_{i} C_{i}^{- 1} Φ_{i}^{T}) {(\sum_{i = 1}^{P} (Φ_{i} C_{i}^{- 1} Φ_{i}^{T}))}^{- 1} \end{matrix}

(176)

11.3. Optimization of σ

Derivation of Equations (159) and (160):

We use the Expectation maximization strategy with latent variables

{\tilde{Z}}_{i}

. It is similar to that used in [52]. It introduces a surrogate function, the log likelihood for the complete data

E_{Q_{\tilde{Z}}} [P_{\tilde{Z}}]

, that is easier to optimize and in theory the process ultimately maximizes

P

.

For the E-step, we calculate the posterior of

{\tilde{Z}}_{i}

.

\begin{matrix} ln Pr [{\tilde{Z}}_{i} | Y_{i}, \bar{Z}, σ, β] & = ln \frac{Pr [Y_{i} | {\tilde{Z}}_{i}, \bar{Z}, σ, β] Pr [{\tilde{Z}}_{i} | β]}{Pr [Y_{i} | \bar{Z}, σ, β]} \end{matrix}

(177)

\begin{matrix} \propto - \frac{1}{2} ({\tilde{Z}}_{i} (Ψ_{i} σ^{- 2} + diag (β)) {\tilde{Z}}_{i}^{T} - 2 σ^{- 2} (Y_{I} - \bar{Z} Φ_{i}) Φ_{i}^{T} {\tilde{Z}}_{i}^{T}) \end{matrix}

(178)

Therefore,

\begin{matrix} E_{Q_{\tilde{Z}}} [{\tilde{Z}}_{i}] & \leftarrow σ^{- 2} (Y_{i} - \bar{Z} Φ_{i}) Φ_{i}^{T} S_{i} \end{matrix}

(179)

\begin{matrix} E_{Q_{\tilde{Z}}} [{\tilde{Z}}_{i}^{T} {\tilde{Z}}_{i}] & \leftarrow S_{i} + E_{Q_{\tilde{Z}}} {[{\tilde{Z}}_{i}]}^{T} E_{Q_{\tilde{Z}}} [{\tilde{Z}}_{i}] \end{matrix}

(180)

where

\begin{matrix} S_{i} & = {(Ψ_{i} σ^{- 2} + diag (β))}^{- 1} \end{matrix}

(181)

Derivation of Equation (161):

In M-step, we need to maximum

E_{Q_{\tilde{Z}}} [P_{\tilde{Z}}]

conditioned on

Q_{\tilde{Z}}

with respect to

σ^{- 2}

,

\begin{matrix} P_{\tilde{Z}} & = \sum_{i = 1}^{P} ln Pr [Y_{i}, {\tilde{Z}}_{i} | \bar{Z}, σ, β] = \sum_{i = 1}^{P} ln (Pr [Y_{i} | {\tilde{Z}}_{i}, \bar{Z}, σ] Pr [{\tilde{Z}}_{i} | β]) \\ = - \frac{1}{2} \sum_{i = 1}^{P} (N_{i} ln (2 π σ^{- 2}) + σ^{- 2} | | Y_{i} - ({\tilde{Z}}_{i} + \bar{Z}) Φ_{i} {| |}_{2}^{2} + \\ \sum_{k = 1}^{K} ln (2 π β_{k}^{- 1}) + Tr ({\tilde{Z}}_{i} diag (β) {\tilde{Z}}_{i}^{T})) \end{matrix}

(182)

We differentiate

E_{Q_{\tilde{Z}}} [P_{\tilde{Z}}]

with respect to

σ^{- 2}

and set to 0

\begin{matrix} \frac{\partial E_{Q_{\tilde{Z}}} [P_{\tilde{Z}}]}{\partial σ^{- 2}} & = E_{Q_{\tilde{Z}}} [- \frac{1}{2} \sum_{i = 1}^{P} (N_{i} σ^{- 2} - σ^{- 4} | | Y_{i} - ({\tilde{Z}}_{i} + \bar{Z}) Φ_{i} {| |}_{2}^{2})] = 0 \\ \Rightarrow σ^{- 2} & \leftarrow \frac{\sum_{i = 1}^{P} E_{Q_{\tilde{Z}}} [| | Y_{i} - ({\tilde{Z}}_{i} + \bar{Z}) Φ_{i} {| |}_{2}^{2}]}{\sum_{i = 1}^{P} N_{i}} \end{matrix}

(183)

\begin{matrix} = \frac{\sum_{i = 1}^{P} (Y_{i} - \bar{Z} Φ_{i}) {(Y_{i} - \bar{Z} Φ_{i} - 2 Φ_{i}^{T} E_{Q_{\tilde{Z}}} {[{\tilde{Z}}_{i}]}^{T})}^{T} + Tr (E_{Q_{\tilde{Z}}} [{\tilde{Z}}_{i}^{T} {\tilde{Z}}_{i}] Ψ_{i})}{\sum_{i = 1}^{P} N_{i}} \end{matrix}

(184)

12. Experiments

12.1. Benchmark Simulation

Figure 12 presents the application of the proposed BSFDA to the simulation benchmark (Scenario 1) outlined in [37]. While prior analyses have utilized this benchmark, the current experimental configuration is specifically adapted to highlight the method’s capacity for uncertainty quantification. The experimental design consists of 20 functional observations, each sampled at either 3 points (with a 20% probability) or 10 points (with an 80% probability), determined via random assignment. The number of sampled functions is decreased from 200 to 20 to underscore the effect and estimation of uncertainties. The actual white noise standard deviation is 0.4472, while the estimated standard deviation is 0.4839. The component number is also correctly estimated as 3. The figure depicts the true underlying function, the discrete observational data, and the corresponding functional estimates, accompanied by their respective 95% truncated uncertainty intervals.

Notably, the uncertainty associated with sparsely sampled functions exhibits substantial inflation in regions devoid of observations. In contrast, in sampled regions, the uncertainty aligns closely with that of densely sampled functions, approximating twice the standard deviation of the white noise. Additionally, the uncertainty bounds for the estimated mean function are presented, demonstrating reduced variability relative to individual function estimates.

Figure 12. Application of the proposed BSFDA to the simulation benchmark from [37], illustrating the true mean function (blue), observed measurements from two functions sampled at different densities (light blue for sparse, orange for dense), and the corresponding functional estimates with 95% truncated uncertainty intervals.

Table 9. Distributions of the estimated component number

\hat{r}

for Scenario 1 (r=3).

Table 9. Distributions of the estimated component number

\hat{r}

for Scenario 1 (r=3).

$N_{i}$	$\hat{r}$	${AIC}_{PACE}$	AIC	BIC	${PC}_{p 1}$	${IC}_{p 1}$	${AIC}_{PACE}^{2022}$	${BIC}_{PACE}^{2022}$	fpca	BSFDA	${BSFDA}^{Fast}$
5	$\leq 1$	0.000	0.000	0.155	0.005	0.000	0.000	0.000	0.000	0.000	0.000
	=2	0.008	0.405	0.335	0.565	0.215	0.000	0.000	0.000	0.000	0.985
	=3	0.000	0.580	0.380	0.410	0.735	0.650	0.880	0.645	0.995	0.015
	=4	0.121	0.010	0.115	0.010	0.045	0.335	0.120	0.235	0.005	0.000
	$\geq 5$	0.870	0.005	0.015	0.010	0.005	0.015	0.000	0.120	0.000	0.000
10	$\leq 1$	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=2	0.000	0.005	0.040	0.040	0.005	0.000	0.000	0.000	0.000	0.075
	=3	0.000	0.980	0.670	0.955	0.985	0.880	0.920	0.645	1.000	0.910
	=4	0.000	0.015	0.255	0.000	0.010	0.120	0.080	0.235	0.000	0.015
	$\geq 5$	1.000	0.000	0.035	0.005	0.000	0.000	0.000	0.120	0.000	0.000
50	$\leq 1$	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=2	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=3	0.000	1.000	0.830	1.000	1.000	1.000	1.000	0.890	0.980	0.945
	=4	0.000	0.000	0.150	0.000	0.000	0.000	0.000	0.060	0.020	0.050
	$\geq 5$	1.000	0.000	0.020	0.000	0.000	0.000	0.000	0.050	0.000	0.005

Table 10. Distributions of the estimated component number

\hat{r}

for Scenario 2 (r=3).

Table 10. Distributions of the estimated component number

\hat{r}

for Scenario 2 (r=3).

$N_{i}$	$\hat{r}$	${AIC}_{PACE}$	AIC	BIC	${PC}_{p 1}$	${IC}_{p 1}$	${AIC}_{PACE}^{2022}$	${BIC}_{PACE}^{2022}$	fpca	BSFDA	${BSFDA}^{Fast}$
5	$\leq 1$	0.000	0.000	0.230	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=2	0.000	0.205	0.395	0.000	0.140	0.050	0.075	0.000	0.000	0.960
	=3	0.005	0.630	0.245	0.375	0.605	0.570	0.620	0.475	1.000	0.040
	=4	0.125	0.155	0.110	0.440	0.210	0.345	0.275	0.350	0.000	0.000
	$\geq 5$	0.870	0.010	0.020	0.185	0.045	0.035	0.030	0.175	0.000	0.000
10	$\leq 1$	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=2	0.000	0.000	0.170	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=3	0.000	0.710	0.665	0.570	0.805	0.825	0.850	0.640	1.000	0.995
	=4	0.005	0.260	0.135	0.355	0.185	0.175	0.150	0.235	0.000	0.005
	$\geq 5$	0.995	0.030	0.030	0.075	0.010	0.000	0.000	0.125	0.000	0.000
50	$\leq 1$	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=2	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=3	0.000	0.630	0.795	0.955	0.945	1.000	1.000	0.950	1.000	0.950
	=4	0.000	0.320	0.185	0.045	0.055	0.000	0.000	0.020	0.000	0.050
	$\geq 5$	1.000	0.050	0.020	0.000	0.000	0.000	0.000	0.030	0.000	0.000

Table 11. Distributions of the estimated component number

\hat{r}

for Scenario 3 (r=3).

Table 11. Distributions of the estimated component number

\hat{r}

for Scenario 3 (r=3).

$N_{i}$	$\hat{r}$	${AIC}_{PACE}$	AIC	BIC	${PC}_{p 1}$	${IC}_{p 1}$	${AIC}_{PACE}^{2022}$	${BIC}_{PACE}^{2022}$	fpca	BSFDA	${BSFDA}^{Fast}$
5	$\leq 1$	0.000	0.000	0.335	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=2	0.025	0.035	0.260	0.220	0.005	0.000	0.005	0.000	0.000	0.025
	=3	0.005	0.720	0.325	0.640	0.590	0.320	0.400	0.450	0.995	0.945
	=4	0.130	0.170	0.080	0.075	0.280	0.640	0.565	0.360	0.005	0.030
	$\geq 5$	0.840	0.075	0.000	0.065	0.125	0.030	0.030	0.190	0.000	0.000
10	$\leq 1$	0.000	0.000	0.005	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=2	0.015	0.000	0.035	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=3	0.000	0.580	0.770	0.965	0.665	0.740	0.755	0.440	0.995	1.000
	=4	0.000	0.400	0.145	0.030	0.320	0.260	0.245	0.380	0.005	0.000
	$\geq 5$	0.985	0.020	0.045	0.005	0.015	0.000	0.000	0.180	0.000	0.000
50	$\leq 1$	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=2	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.015	0.000
	=3	0.000	1.000	0.775	1.000	1.000	1.000	1.000	0.765	0.980	0.920
	=4	0.000	0.000	0.200	0.000	0.000	0.000	0.000	0.110	0.005	0.050
	$\geq 5$	1.000	0.000	0.025	0.000	0.000	0.000	0.000	0.125	0.000	0.030

Table 12. Distributions of the estimated component number

\hat{r}

for Scenario 4 (r=3).

Table 12. Distributions of the estimated component number

\hat{r}

for Scenario 4 (r=3).

$N_{i}$	$\hat{r}$	${AIC}_{PACE}$	AIC	BIC	${PC}_{p 1}$	${IC}_{p 1}$	${AIC}_{PACE}^{2022}$	${BIC}_{PACE}^{2022}$	fpca	BSFDA	${BSFDA}^{Fast}$
5	$\leq 1$	0.000	0.000	0.315	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=2	0.015	0.020	0.180	0.160	0.015	0.000	0.000	0.000	0.000	0.000
	=3	0.015	0.710	0.410	0.640	0.560	0.515	0.575	0.370	1.000	0.975
	=4	0.145	0.185	0.070	0.095	0.260	0.450	0.390	0.515	0.000	0.025
	$\geq 5$	0.825	0.085	0.025	0.105	0.165	0.035	0.035	0.115	0.000	0.000
10	$\leq 1$	0.000	0.000	0.010	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=2	0.000	0.000	0.005	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=3	0.000	0.830	0.775	0.920	0.900	0.750	0.760	0.350	0.995	0.990
	=4	0.000	0.150	0.190	0.045	0.085	0.250	0.240	0.380	0.005	0.010
	$\geq 5$	1.000	0.020	0.020	0.035	0.015	0.000	0.000	0.270	0.000	0.000
50	$\leq 1$	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=2	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.010	0.000
	=3	0.000	0.945	0.835	1.000	1.000	1.000	1.000	0.730	0.950	0.935
	=4	0.000	0.055	0.140	0.000	0.000	0.000	0.000	0.160	0.040	0.055
	$\geq 5$	1.000	0.000	0.025	0.000	0.000	0.000	0.000	0.110	0.000	0.010

Table 13. Distributions of the estimated component number

\hat{r}

for Scenario 5 (r=6).

Table 13. Distributions of the estimated component number

\hat{r}

for Scenario 5 (r=6).

$N_{i}$	$\hat{r}$	${AIC}_{PACE}$	AIC	BIC	${PC}_{p 1}$	${IC}_{p 1}$	${AIC}_{PACE}^{2022}$	${BIC}_{PACE}^{2022}$	fpca	BSFDA	${BSFDA}^{Fast}$
5	$\leq 4$	0.005	0.165	0.835	0.580	0.060	0.000	0.000	0.010	0.000	0.060
	=5	0.005	0.330	0.020	0.345	0.335	0.575	0.590	0.010	0.075	0.515
	=6	0.705	0.470	0.090	0.070	0.545	0.425	0.410	0.855	0.925	0.160
	=7	0.245	0.035	0.050	0.005	0.060	0.000	0.000	0.115	0.000	0.160
	$\geq 8$	0.040	0.000	0.005	0.000	0.000	0.000	0.000	0.010	0.000	0.105
10	$\leq 4$	0.005	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=5	0.000	0.000	0.030	0.145	0.000	0.425	0.425	0.000	0.000	0.000
	=6	0.065	0.570	0.525	0.775	0.705	0.575	0.575	0.500	1.000	0.930
	=7	0.475	0.280	0.165	0.020	0.185	0.000	0.000	0.405	0.000	0.035
	$\geq 8$	0.455	0.150	0.030	0.060	0.110	0.000	0.000	0.095	0.000	0.035
50	$\leq 4$	0.000	0.000	0.005	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	=5	0.065	0.000	0.000	0.000	0.000	0.130	0.130	0.005	0.000	0.000
	=6	0.000	0.260	0.590	0.980	0.965	0.870	0.770	0.695	0.995	0.925
	=7	0.000	0.405	0.325	0.010	0.035	0.000	0.000	0.250	0.005	0.045
	$\geq 8$	0.935	0.335	0.080	0.010	0.000	0.000	0.000	0.050	0.000	0.030

12.1.1. Performance of LFRM

To compare the latent factor regression model (LFRM) [34] as a dimension reduction model to ours, Bayesian scalable functional data analysis (BSFDA), we set the covariates in LFRM to zero, thus assigning standard Gaussian priors to the latent variables, analogous to our approach. We followed the simulation benchmark in [37] for selecting the number of components, focusing on Scenario 1 with 50 measurements per function (the densest data). Because LFRM does not estimate a mean function, we omitted the mean from the simulation run here.

The following hyperparameters of LFRM need to be determined:

Gamma prior for white noise and correlated noise
Length-scale
Number of basis functions
Number of iterations

LFRM, with its default white-noise prior, correctly identified the white-noise variance (true value 0.2) in all tests. We thus retained that default. We tested different Gamma priors for correlated noise: the default prior, a noninformative-like (vagor) prior (same mean but 100 times the variance), and a low noise prior (same variance but 100 times the mean). We maintained the number of locations for basis functions at 10, which is the default setting. For length-scale in LFRM, we first used the best estimate from our cross-validation (CV). We then tried all 10 CV-selected length scales, producing 100 basis functions in total. However, this required substantial time, so we performed only two repeated runs for that setting. We kept LFRM’s default of 5000 burn-in iterations (25,000 total) with thinning at intervals of 5, verifying convergence through trace plots in line with [34]. Meanwhile, BSFDA was run 200 times as in Section 5, LFRM (10 length scales) 2 times, and all other settings 10 times.

Across repeated trials, LFRM consistently overestimated the true number of components (which is 3). Specifically:

Standard LFRM estimated 10–14 components.
LFRM with 10 length-scales estimated 6–8 components.
LFRM with a low correlated-noise prior estimated 8–15 components.
LFRM with a noninformative-like correlated-noise prior estimated 10–14 components.

In contrast, our method BSFDA produced a clear gap in the distribution of the precision parameters, effectively separating effective dimensions from redundant ones.

Several factors may explain LFRM’s performance:

Correlated noise interference: The correlated noise can obscure the true signal.
Prior specification: LFRM’s precision parameter prior are potentially less noninformative and not as sparse as those sparse Bayesian learning priors [52] in BSFDA.
Element-wise vs. Column-wise Precision: The element-wise precision parameters in LFRM might compensate in a way that reduces overall sparsity.

12.2. Variational Inference v.s. MCMC

We conducted experiments using both Gibbs sampling (MCMC) and mean-field approximation (Variational Inference) for the Bayesian PCA simulation [45,62] under varying noise levels. In our experiments, "satisfactory estimation" is defined as the point when the 4th smallest precision (i.e., the inverse of variance) is at least 100 times smaller than the 5th–indicating that the four true signal dimensions (with variances [5, 4, 3, 2]) have been correctly identified. For computational tractability, we capped VI at 200,000 iterations (approximately 200 seconds) and MCMC sampling at 20,000 iterations (about 20 minutes), with a burn-in period of 200 iterations and thinning set to 10.

Figure 13 illustrates the runtime for VI and MCMC to identify the correct components. Our key findings are as follows:

1.: When the noise level is close to the signal, neither MCMC or VI found the true dimension in the limited iterations (probably never will), because the data is heavily polluted.
2.: As the noise level decreases toward zero, the number of iterations (and runtime) required for satisfactory estimation increases dramatically; VI begins to fail around a noise level of $1 \times 10^{- 4}$ , and MCMC sampling around $1 \times 10^{- 3}$ , within the set time constraints.
3.: Across the 10 noise levels (about $3 \times 10^{- 3}$ to $2 \times 10^{- 1}$ ) where both successfully identified the correct dimensionality, VI consistently completes much faster than MCMC sampling. VI is approximately 20 times faster. 85.57 ± 50.24 in average, in the range of 32.46 to 189.12.

Figure 13. Time for variational inference and MCMC to identify the correct components in Bayesian PCA.

These results indicate that both MCMC sampling and VI become slower as noise decreases due to strong dependencies in the posterior. We hypothesize it is because both MCMC and VI suffer from the dependency introduced by low noise, which is a known long-standing issue with ongoing research methods, e.g., structured VI [72], or blocked/collapsed Gibbs sampler [74]. But both MCMC and VI work well provided there are sufficient iterations. This behavior suggests that the dependency induced by very low noise levels creates an optimization challenge rather than a fundamental modeling issue.

In summary: (1) VI is significantly faster than MCMC, (2) both methods slow down as the noise level decreases, and (3) both fail to recover the correct components when the noise is excessively high.

References

Wang, J.L.; Chiou, J.M.; Müller, H.G. Functional Data Analysis. Annual Review of Statistics and Its Application 2016, 3, 257–295. [CrossRef]
Ramsay, J.O.; Silverman, B.W. Applied functional data analysis: methods and case studies; Springer, 2002.
Rice, J.A.; Silverman, B.W. Estimating the Mean and Covariance Structure Nonparametrically When the Data are Curves. Journal of the Royal Statistical Society Series B: Statistical Methodology 1991, 53, 233–243. [CrossRef]
Górecki, T.; Krzyśko, M.; Waszak, .; Wołyński, W. Selected statistical methods of data analysis for multivariate functional data. Statistical Papers 2018, 59, 153–182. [CrossRef]
Aneiros, G.; Cao, R.; Fraiman, R.; Genest, C.; Vieu, P. Recent advances in functional data analysis and high-dimensional statistics. Journal of Multivariate Analysis 2019, 170, 3–9. [CrossRef]
Li, Y.; Qiu, Y.; Xu, Y. From multivariate to functional data analysis: Fundamentals, recent developments, and emerging areas. Journal of Multivariate Analysis 2022, 188, 104806. [CrossRef]
Happ, C.; Greven, S. Multivariate Functional Principal Component Analysis for Data Observed on Different (Dimensional) Domains. Journal of the American Statistical Association 2018, 113, 649–659. [CrossRef]
Kowal, D.R.; Canale, A. Semiparametric Functional Factor Models with Bayesian Rank Selection. Bayesian Analysis 2023, 18. [CrossRef]
Paulon, G.; Müller, P.; Sarkar, A. Bayesian Semiparametric Hidden Markov Tensor Models for Time Varying Random Partitions with Local Variable Selection. Bayesian Analysis 2024, 19. [CrossRef]
Goldsmith, J.; Zipunnikov, V.; Schrack, J. Generalized Multilevel Function-on-Scalar Regression and Principal Component Analysis. Biometrics 2015, 71, 344–353. [CrossRef]
Nguyen, X.; Gelfand, A.E. The Dirichlet labeling process for clustering functional data. Statistica Sinica 2011, 21, 1249–1289. [CrossRef]
Petrone, S.; Guindani, M.; Gelfand, A.E. Hybrid Dirichlet Mixture Models for Functional Data. Journal of the Royal Statistical Society Series B: Statistical Methodology 2009, 71, 755–782. [CrossRef]
Berrendero, J.; Justel, A.; Svarc, M. Principal components for multivariate functional data. Computational Statistics & Data Analysis 2011, 55, 2619–2634. [CrossRef]
Suarez, A.J.; Ghosal, S. Bayesian Estimation of Principal Components for Functional Data. Bayesian Analysis 2017, 12. [CrossRef]
Yang, J.; Zhu, H.; Choi, T.; Cox, D.D. Smoothing and Mean–Covariance Estimation of Functional Data with a Bayesian Hierarchical Model. Bayesian analysis 2016/09, 11, 649–670. [CrossRef]
Fox, E.B.; Dunson, D.B. Bayesian nonparametric covariance regression. The Journal of Machine Learning Research 2015, 16, 2501–2542.
Goldsmith, J.; Wand, M.P.; Crainiceanu, C. Functional regression via variational Bayes. Electronic journal of statistics 2011/01/01, 5, 572–602. [CrossRef]
Sun, T.Y.; Kowal, D.R. Ultra-Efficient MCMC for Bayesian Longitudinal Functional Data Analysis. Journal of Computational and Graphical Statistics 2024, pp. 1–13. [CrossRef]
Rasmussen, C.E.; Williams, C.K.I. Gaussian processes for machine learning; Adaptive computation and machine learning, MIT Press: Cambridge, Mass, 2006; p. 248.
Yao, F.; Müller, H.G.; Wang, J.L. Functional Data Analysis for Sparse Longitudinal Data. Journal of the American Statistical Association 2005, 100, 577–590. [CrossRef]
Di, C.Z.; Crainiceanu, C.M.; Caffo, B.S.; Punjabi, N.M. Multilevel functional principal component analysis. The Annals of Applied Statistics 2009, 3, 458–488. [CrossRef]
Peng, J.; Paul, D. A geometric approach to maximum likelihood estimation of the functional principal components from sparse longitudinal data. Journal of Computational and Graphical Statistics 2009, 18, 995–1015.
Goldsmith, J.; Schwartz, J.E. Variable selection in the functional linear concurrent model. Statistics in Medicine 2017/06/30, 36, 2237–2250. [CrossRef]
Xiao, L.; Li, C.; Checkley, W.; Crainiceanu, C. Fast covariance estimation for sparse functional data. Statistics and Computing 2018, 28, 511–522. [CrossRef]
Chiou, J.M.; Yang, Y.F.; Chen, Y.T. Multivariate functional principal component analysis: A normalization approach. Statistica Sinica 2014. [CrossRef]
Jacques, J.; Preda, C. Model-based clustering for multivariate functional data. Computational Statistics & Data Analysis 2014, 71, 92–106. [CrossRef]
Yang, J.; Cox, D.D.; Lee, J.S.; Ren, P.; Choi, T. Efficient Bayesian Hierarchical Functional Data Analysis with Basis Function Approximations Using Gaussian–Wishart Processes. Biometrics 2017/12/01, 73, 1082–1091. [CrossRef]
Trefethen, L.N. Approximation theory and approximation practice, extended edition; SIAM, 2019.
Bungartz, H.J.; Griebel, M. Sparse grids. Acta numerica 2004, 13, 147–269.
Zipunnikov, V.; Caffo, B.; Yousem, D.M.; Davatzikos, C.; Schwartz, B.S.; Crainiceanu, C. Multilevel functional principal component analysis for high-dimensional data. Journal of Computational and Graphical Statistics 2011, 20, 852–873. [CrossRef]
Yao, F.; Lei, E.; Wu, Y. Effective dimension reduction for sparse functional data. Biometrika 2015, 102, 421–437. [CrossRef]
Shi, H.; Yang, Y.; Wang, L.; Ma, D.; Beg, M.F.; Pei, J.; Cao, J. Two-Dimensional Functional Principal Component Analysis for Image Feature Extraction. Journal of Computational and Graphical Statistics 2022, 31, 1127–1140. [CrossRef]
Van Der Linde, A. Variational Bayesian functional PCA. Computational Statistics & Data Analysis 2008, 53, 517–533. [CrossRef]
Montagna, S.; Tokdar, S.T.; Neelon, B.; Dunson, D.B. Bayesian Latent Factor Regression for Functional and Longitudinal Data. Biometrics 2012, 68, 1064–1073. [CrossRef]
Kowal, D.R.; Bourgeois, D.C. Bayesian Function-on-Scalars Regression for High-Dimensional Data. Journal of Computational and Graphical Statistics 2020-7-2, 29, 629–638. [CrossRef]
Sousa, P.H.T.O.; Souza, C.P.E.d.; Dias, R. Bayesian adaptive selection of basis functions for functional data representation. Journal of Applied Statistics 2024-04-03, 51, 958–992. [CrossRef]
Li, Y.; Wang, N.; Carroll, R.J. Selecting the Number of Principal Components in Functional Data. Journal of the American Statistical Association 2013, 108, 1284–1294. [CrossRef]
Huang, L.; Reiss, P.T.; Xiao, L.; Zipunnikov, V.; Lindquist, M.A.; Crainiceanu, C.M. Two-way principal component analysis for matrix-variate data, with an application to functional magnetic resonance imaging data. Biostatistics 2017, 18, 214–229. [CrossRef]
Lynch, B.; Chen, K. A test of weak separability for multi-way functional data, with application to brain connectivity studies. Biometrika 2018, 105, 815–831. [CrossRef]
Shamshoian, J.; Şentürk, D.; Jeste, S.; Telesca, D. Bayesian analysis of longitudinal and multidimensional functional data. Biostatistics 2022/04/13, 23, 558–573. [CrossRef]
Huo, S.; Morris, J.S.; Zhu, H. Ultra-Fast Approximate Inference Using Variational Functional Mixed Models. Journal of Computational and Graphical Statistics 2023-4-3, 32, 353–365. [CrossRef]
Jolliffe, I.T.; Cadima, J. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 2016, 374, 20150202. [CrossRef]
Tipping, M.E.; Bishop, C.M. Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology 1999, 61, 611–622.
Ilin, A.; Raiko, T. Practical approaches to principal component analysis in the presence of missing values. The Journal of Machine Learning Research 2010, 11, 1957–2000.
Bishop, C. Bayesian pca. Advances in neural information processing systems 1998, 11.
Tipping, M.E.; Bishop, C.M. Mixtures of probabilistic principal component analyzers. Neural computation 1999, 11, 443–482. [CrossRef]
Minka, T. Automatic choice of dimensionality for PCA. Advances in neural information processing systems 2000, 13.
Li, J.; Tao, D. On Preserving Original Variables in Bayesian PCA With Application to Image Analysis. IEEE Transactions on Image Processing 2012, 21, 4830–4843. [CrossRef]
Bouveyron, C.; Latouche, P.; Mattei, P.A. Bayesian variable selection for globally sparse probabilistic PCA. Electronic Journal of Statistics 2018, 12. [CrossRef]
Ning, Y.C.B.; Ning, N. Spike and slab Bayesian sparse principal component analysis. Statistics and Computing 2024, 34, 118.
Bhattacharya, A.; Dunson, D.B. Sparse Bayesian infinite factor models. Biometrika 2011, 98, 291–306. [CrossRef]
Tipping, M.E. Sparse Bayesian learning and the relevance vector machine. Journal of machine learning research 2001, 1, 211–244.
Tipping, M. The relevance vector machine. Advances in neural information processing systems 1999, 12.
Faul, A.C.; Tipping, M.E., Analysis of Sparse Bayesian Learning. In Advances in Neural Information Processing Systems 14; The MIT Press, 2002; pp. 383–390.
MacKay, D.J., Bayesian methods for backpropagation networks. In Models of neural networks III: association, generalization, and representation; Springer, 1996; pp. 211–254.
Neal, R.M. Bayesian learning for neural networks; Vol. 118, Springer Science & Business Media, 2012.
Palmer, J.; Rao, B.; Wipf, D. Perspectives on sparse Bayesian learning. Advances in neural information processing systems 2003, 16.
Wipf, D.; Nagarajan, S. A new view of automatic relevance determination. Advances in neural information processing systems 2007, 20.
Wipf, D.; Rao, B. ℓ₀-norm minimization for basis selection. In Proceedings of the Proceedings of the 17th International Conference on Neural Information Processing Systems, 2004, pp. 1513–1520.
Wipf, D.P.; Rao, B.D.; Nagarajan, S. Latent Variable Bayesian Models for Promoting Sparsity. IEEE Transactions on Information Theory 2011, 57, 6236–6255. [CrossRef]
Wipf, D.; Rao, B. Sparse Bayesian Learning for Basis Selection. IEEE Transactions on Signal Processing 2004, 52, 2153–2164. [CrossRef]
Bishop, C.M. Variational Principal Components. In Proceedings of the Proceedings Ninth International Conference on Artificial Neural Networks, ICANN’99. IEE, 1999, pp. 509–514.
Nakajima, S.; Sugiyama, M.; Babacan, D. On Bayesian PCA: Automatic dimensionality selection and analytic solution. In Proceedings of the Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 497–504.
Girolami, M.; Rogers, S. Hierarchic Bayesian models for kernel learning. In Proceedings of the Proceedings of the 22nd international conference on Machine learning, 2005, pp. 241–248.
Bishop, C.M.; Tipping, M.E. Variational relevance vector machines. In Proceedings of the Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, 2000, pp. 46–53.
Cheng, L.; Yin, F.; Theodoridis, S.; Chatzis, S.; Chang, T.H. Rethinking Bayesian Learning for Data Analysis: The Art of Prior and Inference in Sparsity-Aware Modeling. IEEE Signal Processing Magazine 2022, 39, 18–52. [CrossRef]
Qi, Y.A.; Minka, T.P.; Picard, R.W.; Ghahramani, Z. Predictive automatic relevance determination by expectation propagation. In Proceedings of the Twenty-first international conference, Banff, Alberta, Canada, 2004; p. 85. [CrossRef]
Babacan, S.D.; Luessi, M.; Molina, R.; Katsaggelos, A.K. Sparse Bayesian Methods for Low-Rank Matrix Estimation. IEEE Transactions on Signal Processing 2012, 60, 3964–3977. [CrossRef]
Zhao, Q.; Meng, D.; Xu, Z.; Zuo, W.; Yan, Y. L₁ -Norm Low-Rank Matrix Factorization by Variational Bayesian Method. IEEE Transactions on Neural Networks and Learning Systems 2015, 26, 825–839. [CrossRef]
Guan, Y.; Dy, J. Sparse probabilistic principal component analysis. In Proceedings of the Artificial Intelligence and Statistics. PMLR, 2009, pp. 185–192.
Wong, A.P.S.; Wijffels, S.E.; Riser, S.C.; Pouliquen, S.; Hosoda, S.; Roemmich, D.; Gilson, J.; Johnson, G.C.; Martini, K.; Murphy, D.J.; et al. Argo Data 1999-2019: Two Million Temperature-Salinity Profiles and Subsurface Velocity Observations From a Global Array of Profiling Floats. Frontiers in Marine Science 2020, 7, 700. [CrossRef]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational Inference: A Review for Statisticians. Journal of the American Statistical Association 2017, 112, 859–877. [CrossRef]
Tipping, M.E.; Faul, A.C. Fast marginal likelihood maximisation for sparse Bayesian models. In Proceedings of the International workshop on artificial intelligence and statistics. PMLR, 2003, pp. 276–283.
Park, T.; Lee, S. Improving the Gibbs sampler. Wiley Interdisciplinary Reviews: Computational Statistics 2022, 14, e1546.
Ledoit, O.; Wolf, M. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 2004, 88, 365–411. [CrossRef]
Kaslow, R.A.; Ostrow, D.G.; Detels, R.; Phair, J.P.; Polk, B.F.; RINALDO Jr, C.R.; Study, M.A.C. The Multicenter AIDS Cohort Study: rationale, organization, and selected characteristics of the participants. American journal of epidemiology 1987, 126, 310–318. [CrossRef]
Argo float data and metadata from Global Data Assembly Centre (Argo GDAC) - Snapshot of Argo GDAC of November 09st 2024, 2024. [CrossRef]
Yarger, D.; Stoev, S.; Hsing, T. A functional-data approach to the Argo data. The Annals of Applied Statistics 2022, 16, 216–246. [CrossRef]
de Boyer Montégut, C.; Madec, G.; Fischer, A.S.; Lazar, A.; Iudicone, D. Mixed layer depth over the global ocean: An examination of profile data and a profile-based climatology. Journal of Geophysical Research: Oceans 2004, 109.
Roemmich, D.; Gilson, J. The 2004–2008 mean and annual cycle of temperature, salinity, and steric height in the global ocean from the Argo Program. Progress in oceanography 2009, 82, 81–100. [CrossRef]
Kuusela, M.; Stein, M.L. Locally stationary spatio-temporal interpolation of Argo profiling float data. Proceedings of the Royal Society A 2018, 474, 20180400. [CrossRef]

Figure 1. Probabilistic graphical model for the full model.

Figure 2. Diagrams of variational inference algorithm for all the parameters. The top 3 diagrams each has a closed loop and a closed form overall transfer function.

Figure 3. length-scales and centers of selected kernel basis functions in a random repetition for three different m values in Scenario 5.

Figure 4. Cross-sectional visualization of eigenfunctions (eigenvalues) of the 4D simulation.

Figure 6. Outcomes from the proposed method applied to MACS CD4 data sets.(top): Estimated curves for a random selection of 9 sampled functions and mean function; (bottom): Estimated eigenfunctions (eigenvalues).

Figure 7. Outcomes from the proposed method applied to a wind speed data set. (a) Estimated curves for a random selection of 9 sampled functions and mean function; (b) Estimated eigenfunctions (eigenvalues) denoted as EF; (c) Estimated covariance.

Figure 8. Temperature measurements in 2021 February near the sea surface in the ARGO dataset.

Figure 9. Geodesic interpolation from

{BSFDA}^{Fast}

vs. actual ARGO global oceanic measurements at 1, 200, and 300 decibars, at 1°S and 30°W on May 29. Measurements are represented by circles, with filling color indicating temperature. Circle sizes show distance in depth and time from the central point.

Figure 9. Geodesic interpolation from

{BSFDA}^{Fast}

vs. actual ARGO global oceanic measurements at 1, 200, and 300 decibars, at 1°S and 30°W on May 29. Measurements are represented by circles, with filling color indicating temperature. Circle sizes show distance in depth and time from the central point.

Figure 10. Depth-time interpolation from

{BSFDA}^{Fast}

vs. actual ARGO global oceanic measurements at two sites focusing on mixed layer behaviors. Measurements are represented by circles, with filling color indicating temperature. Circle sizes show distance in geodesic space from the central point.

Figure 10. Depth-time interpolation from

{BSFDA}^{Fast}

vs. actual ARGO global oceanic measurements at two sites focusing on mixed layer behaviors. Measurements are represented by circles, with filling color indicating temperature. Circle sizes show distance in geodesic space from the central point.

Table 1. Proportion of accurate estimations for Scenario 1 (r=3).

$N_{i}$	${AIC}_{PACE}$	AIC	BIC	${PC}_{p 1}$	${IC}_{p 1}$	${AIC}_{PACE}^{2022}$	${BIC}_{PACE}^{2022}$	fpca	BSFDA	${BSFDA}^{Fast}$
5	0.000	0.580	0.380	0.410	0.735	0.650	0.880	0.645	0.995	0.015
10	0.000	0.980	0.670	0.955	0.985	0.880	0.920	0.645	1.000	0.910
50	0.000	1.000	0.830	1.000	1.000	1.000	1.000	0.890	0.980	0.945

Table 2. Proportion of accurate estimations for Scenario 2 (r=3).

$N_{i}$	${AIC}_{PACE}$	AIC	BIC	${PC}_{p 1}$	${IC}_{p 1}$	${AIC}_{PACE}^{2022}$	${BIC}_{PACE}^{2022}$	fpca	BSFDA	${BSFDA}^{Fast}$
5	0.005	0.630	0.245	0.375	0.605	0.570	0.620	0.475	1.000	0.040
10	0.000	0.710	0.665	0.570	0.805	0.825	0.850	0.640	1.000	0.995
50	0.000	0.630	0.795	0.955	0.945	1.000	1.000	0.950	1.000	0.950

Table 3. Proportion of accurate estimations for Scenario 3 (r=3).

$N_{i}$	${AIC}_{PACE}$	AIC	BIC	${PC}_{p 1}$	${IC}_{p 1}$	${AIC}_{PACE}^{2022}$	${BIC}_{PACE}^{2022}$	fpca	BSFDA	${BSFDA}^{Fast}$
5	0.005	0.720	0.325	0.640	0.590	0.320	0.400	0.450	0.995	0.945
10	0.000	0.580	0.770	0.965	0.665	0.740	0.755	0.440	0.995	1.000
50	0.000	1.000	0.775	1.000	1.000	1.000	1.000	0.765	0.980	0.920

Table 4. Proportion of accurate estimations for Scenario 4 (r=3).

$N_{i}$	${AIC}_{PACE}$	AIC	BIC	${PC}_{p 1}$	${IC}_{p 1}$	${AIC}_{PACE}^{2022}$	${BIC}_{PACE}^{2022}$	fpca	BSFDA	${BSFDA}^{Fast}$
5	0.015	0.710	0.410	0.640	0.560	0.515	0.575	0.370	1.000	0.975
10	0.000	0.830	0.775	0.920	0.900	0.750	0.760	0.350	0.995	0.990
50	0.000	0.945	0.835	1.000	1.000	1.000	1.000	0.730	0.950	0.935

Table 5. Proportion of accurate estimations for Scenario 5 (r=6).

$N_{i}$	${AIC}_{PACE}$	AIC	BIC	${PC}_{p 1}$	${IC}_{p 1}$	${AIC}_{PACE}^{2022}$	${BIC}_{PACE}^{2022}$	fpca	BSFDA	${BSFDA}^{Fast}$
5	0.705	0.470	0.090	0.070	0.545	0.425	0.410	0.855	0.925	0.160
10	0.065	0.570	0.525	0.775	0.705	0.575	0.575	0.500	1.000	0.930
50	0.000	0.260	0.590	0.980	0.965	0.870	0.770	0.695	0.995	0.925

Table 6. Mean squared error of covariance

{Error}_{CovFunc}

for Scenario 5

Table 6. Mean squared error of covariance

{Error}_{CovFunc}

for Scenario 5

$N_{i}$	${AIC}_{PACE}^{2022}$	${BIC}_{PACE}^{2022}$	fpca	refund.sc	BSFDA	${BSFDA}^{Fast}$
5	12.373 ± 4.026	12.377 ± 4.031	5.192 ± 6.166	8.833 ± 4.730	5.814 ± 3.535	10.292±12.717
10	10.391 ± 2.521	10.391 ± 2.521	2.098 ± 1.425	5.314 ± 3.501	2.068 ± 1.427	2.656±1.712
50	9.054 ± 1.683	9.054 ± 1.683	1.642 ± 1.240	N/A	1.638 ± 1.247	1.770±1.275

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Integrated Model Selection and Scalability in Functional Data Analysis through Bayesian Learning

Abstract

Keywords:

Subject:

1. Introduction

1.1. Contriubutions

1.2. Outline

2. Formulation

2.1. Generative model

2.2. Sparse Prior

3. Methods

3.1. Variational Bayesian Inference

3.1.1. Update Steps

3.2. Scalable Update Strategy

3.2.1. Implicit Factorization

3.2.2. Low-Dimensional Lower Bound

3.2.3. Heuristic for Activating Basis Functions

4. Faster Variant

5. Results

5.1. Simulation Results

5.1.1. Mean Squared Error in Covariance Operator

5.1.2. Multidimensional Functional Data Simulation

5.2. Results on Public Data Sets

5.2.1. CD4

5.2.2. Wind Speed

5.2.3. Modeling Large-Scale, Dynamic, Geospatial Data

6. Discussion

7. System of Notation

8. Variational Update Formulae

9. Scalable Update for BSFDA

9.1. Implicit Factorization

9.2. Scale Parameters

9.3. Weights and Noise

9.4. Low-Dimensional Lower Bound

10. Scalable Update for BSFDAFast

11. Fast Initialization

11.1. Maximum Likelihood Estimation

11.2. Optimization of β,Z ¯

11.3. Optimization of σ

12. Experiments

12.1. Benchmark Simulation

12.1.1. Performance of LFRM

12.2. Variational Inference v.s. MCMC

References

MDPI Initiatives

Important Links

Subscribe

10. Scalable Update for BSFDA^Fast