Approximate Information and Accelerating for High-throughput Heterogeneous Data Analysis with Linear Mixed Models

Linear mixed models are frequently used for analysing heterogeneous data in a 1 broad range of applications. The restricted maximum likelihood method is often preferred to 2 estimate co-variance parameters in such models due to its unbiased estimation of the underlying 3 variance parameters. The restricted log-likelihood function involves log determinants of a 4 complicated co-variance matrix. An efficient statistical estimate of the underlying model parameters 5 and quantifying the accuracy of the estimation requires the first derivatives and the second 6 derivatives of the restricted log-likelihood function, i.e., the observed information. Standard 7 approaches to compute the observed information and its expectation, the Fisher information, is 8 computationally prohibitive for linear mixed models with thousands random and fixed effects. 9 Customized algorithms are of highly demand to keep mixed models analysis scalable for increasing 10 high-throughput heterogeneous data sets. In this paper, we explore how to leverage an averaged 11 information splitting technique and dedicate matrix transform to significantly reduce computations 12 and to accelerate computing. Together with a fill-in reducing multi-frontal sparse direct solver, the 13 averaged information splitting approach improves the performance of the computation process. 14


Introduction
Real-world data like these in animal/plant breeding, clinic trials, ecology and evolution, genome-wide association and many other fields are often heterogeneous.For example, one can not control how many child animals can one animal sire each time.In the clinic trials, individuals in every group are usually not equaled.For many repeated measurements, missing data are very common, which results in heterogeneous data.These cases are different with those in controlled experiments (in a laboratory) where the data usually enjoy a nice controlled block structure.For such controlled structured block data, the classical analysis of variance(ANOVA) approach can give a statistical efficient estimate.On contrast, the restricted maximum likelihood method(REML) introduced in [1] is an attractive method for heterogeneous data analysis.This approach is conceptually simple and is widely used in in animal/plant breeding [2], clinic trials, ecology and evolution [3], genome-wide association [4][5][6][7].And it is receiving increasing attention.In this approach, the underlying observation y ∈ R n×1 is modeled by the mixed model where τ ∈ R p×1 and u ∈ R b×1 are fixed effects and random effects respectively.is noise.The random effects and noise are independent normal distribution such that E(u) = 0, E( ) = 0, u ∼ N(0, σ 2 G), ∼ N(0, σ 2 R) and where G = G(γ) and R = R(φ) are parametric co-variance matrices.
Efficient statistical estimates τ and ũ for the fixed and random effects and uncertainty quantifications for such estimates require estimates of the unknown co-variance parameter θ = (σ 2 , γ, φ).Where τ and ũ satisfy the mixed model equations [8] The uncertainty of the estimates are quantified by For heterogeneous and unbalanced data sets, the restricted maximum likelihood (REML) is preferred for an unbiased or less biased estimate of the covariance parameter [1].On contrast to the standard maximum likelihood estimate which requires maximizing a log-likelihood function directly, the REML estimate requires maximizing a marginal log-likelihood function.Alternatively, we can view that the REML maximizing a log-likelihood function in a restricted subspace which can remove redundant or correlated information used in estimating the fixed effects.The benefit is that it results in an unbiased estimate or less biased estimate which has been articulated in [9].The restricted maximum likelihood method results in a restricted log-likelihood function of the form [10] where V(θ) = σ 2 (R + ZGZ T ) and

Objective
For many realistic problems, the variance-variance matrices of the random effects u and the noise are unknown.In these cases to quantify estimates of the fixed effects and random effects, one has to estimate the variance parameters θ first.A problem of great interest is to obtain a statistical efficient estimate of the variance parameter θ for high-throughput biological data sets according to the maximum (restricted) likelihood principle: The first derivative of the log-likelihood is called a score function, which is given by [10, p.252] Score for θ i : This formulae is standard, for a detailed self-contained derivation, the reader is directed to [9].matrix of the score, which is usually referred to as the observed information.The expected value of the observed information is called the Fisher information.

Main Challenges
The objective function, R , involves log determinant terms and is computationally prohibitive for high-throughput biological data sets.Challenges exist for the derivative Newton approach [11], derivative-free approach and the Fisher-scoring approach.For a derivative (Newton-Ramphson) approach, the element of the observed information is [10][9], It is too complicated for practical use.For a derivative-free approach, it often requires more total time even though less time per iterate because the method converges very slow and even doesn't converge for some difficult problems [12].The Fisher-scoring approach employs the Fisher information matrix instead of the observed information matrix.Elements of the Fisher information matrix are given by [13,14] ).
The Fisher-scoring approach is a quasi-Newton method.It enjoys a simper formulae than the exact Newton-Ramphson approach does.But still, it is not scalable for high-throughput biological data sets due to the four matrix-matrix products in the Fisher information, I.

Averaged Information splitting
When V depends linearly on the variance parameter θ, i.e. ∂ 2 V ∂θ i ∂θ j = 0, it was observed that the average of the observed and Fisher information enjoys a simper form [15] which can be efficiently evaluated by four matrix-vector multiplication and one inner product.And this formulae was used in [16] for general case without any proof.For more general cases, we prove that [Averaged Information Splitting Theorem [17]] Let I O and I be the observed information and the Fisher information for the residual log-likelihood of the linear mixed model respectively, then the average of the observed and the Fisher information can be split as I O +I 2 = I A + I Z , such that the expectation of I A is the Fisher information matrix and E(I Z ) = 0.
Theorem 1 indicates that the I Z part which involves intensive computations is negligible.This builds up the foundation of using I A as a good approximation to the negative Jacobian matrix in general cases.As a byproduct, it shows that I A can be used as an alternative of the Fisher information matrix.In a natural language, the Theorem reads as [1'] The average of the Fisher information and the observed information of the restrict log-likelihood for the linear mixed models can be split as a simper Fisher information matrix plus an random zero matrix.

Matrix transforms
The next problem to be handled is that the formulae for the matrix P in ( 5) Solve the following mixed model equations with multiple right-hand side.
The choice of information or (negative Jacobian matrix) distinguishes the exact Newton-Raphson method, the Fisher-scoring algorithm and the average information splitting approach.Difference between them are illustrated in Algorithm 2.
And such an approximation significantly reduces computations and speeds up the linear mixed model together with the sparse techniques for sparse linear systems [18].

Accelerating techniques: sparse factorization and fill-in reducing algorithms
One of the main computational kernel is to compute the negative Jacobian matrix.In the averaged information splitting(AIS) approach, compute the approximate information matrix I A depends heavily on the efficient factorization of the coefficient matrix in the mixed model equations (3).Since the coefficient matrix C is often very sparse, and the factorization is reused to solve the linear system with multiple right hand-sides (11).The computation amount of the factorization depends on the sparsity of the L factor.Therefore, it is of great importance to keep L as sparse as possible to reduce all the computations.Such an algorithm is often referred to as a fill-in reducing algorithm.It is a preprocess of the factorization by exchanging rows and columns of the coefficient matrix C.
Another functionality of an fill-in reducing algorithm is to make the factorization process more scalable.The LDL T factorization is based on the Gaussian elimination process which was often believed essentially essentially sequential.Fill-in reducing is also a key technique which can increase the parallelism as many as possible.Figure 1 compares the sparsity of the LDL T factor L of a matrix C and that of a reorder matrix.Here we employ the approximated minimum degree ordering [19] and the parallel multi-frontal LDL T factorization algorithm [20].
Date sets for the benchmark problem.The column titles are the number of years (y), the number of centres (c), the number of varieties(v), thes number of levels of cross terms(y.c, y.v, v.c), the average varieties per year (v/year), and the averages year per variety (y/v), and the number of controlled varieties all year (control varieties all year).

Numerical examples
A serial of plant breeding benchmark problems are used to verify performance of algorithms implemented here.These examples are based on a second-stage analysis of a set of variety trials with linear mixed models, i.e. based on variety predicted values from each trial.Trials are conducted in a number of years across a number of locations (centres).The data in Table 1 are generated by a program which allows you to specify the number of years, total number of centres and proportion of centres used per year, the number of control varieties (used every year), the number of test varieties entering the system per year and the average persistence of the test varieties, and the proportion of missing varieties per trial, where proportions of things are selected.They are sampled at random, and the life of each variety is generated from a Poisson distribution.This gives a three-way crossed structure (year*variety*site) with some imbalance.In the current model, all terms except a grand mean are fitted as random.The random terms are generated as independent and identically distributed normal distribution with variance components generated from a test program with similar structure used for the original SAS REML program, so it is just a variance components model.
The main part of the computing is the reordering and the factorization.Table 2 compares the algorithm implemented here and the counterpart in an existing commercial software, in which the second column is the number of effects, third to fifth columns are speedup of algorithms implemented here over the counterpart of an existing software package.

Discussion and future work
The aim of averaged information splitting is to remove computationally expensive and negligible terms so that a simper approximate information matrix is obtained.Such a splitting keeps the essential information and can be used as a good approximation to the observed information matrix  which is required for a derivative Newton method.The resulted formula is significantly simper than that used in the software package used in Nature Genetics [7, p.825, eq.8].Together with the fill-in reducing and multi-frontal factorization sparse matrix techniques, the splitting can significantly improve the performance of a quasi-Newton approach to estimate the co-variance parameters in the linear mixed models.Part of the result has been implemented in a leading commercial breeding software package by VSN international(VSNi) Ltd.On a suit of test examples the method described here ran 10 times faster than counterpart of VSNi's existing software [18].The splitting approach builds up a framework to analysis unbalanced data sets modeled by liner mixed models.
Mathematically, theoretical proof on the convergence order of the quasi-Newton method based on the

PreprintsAlgorithm 1
(www.preprints.org)| NOT PEER-REVIEWED | Posted: 7 April 2017 doi:10.20944/preprints201704.0044.v1 is also too complicated and the matrix vector multiplication Py still involves considerable computations.The next results shows that Py is equivalent to a simper formulae, for a detail derivation, the reader is directed to [9, Theorem 6].Py = R −1 e, where e = y − X τ − Z ũ, τ and ũ is the solution to the mixed model equation (3).Theorem 3 states the equivalence between the projection of the observations Py and the weighted fitted residual R −1 e.The variance-variance matrix R, often enjoys a very simple structure, for example a diagonal matrix.The residual e can be obtained by solving the mixed model equation (3) of order (p + b) × (p + b).The freedom of unknowns in the mixed model equation is often much smaller than the order of P, or the number of observations.According to Theorem 1 and Theorem 3, the approximate information matrix can be computed according to Algorithm 1. Main steps to compute I A 1: Let C be the coefficient matrix in (3), W = [X, Z], 2: Factorize C = LDL T 3: Solve (3) with the help of LDL T factorization 4: Calculate η

Figure 1 .
Figure 1.Illustration of the sparse structure of matrix C and its factor L in (a) and the sparse structure of the reordered matrix and its Cholesky factor L in (c).(b) and (d) are the elimination trees which indicate the elimination process correspond to the matrix C and its reordered matrix.The Cholesky elimination starts from a leaf node and ends in the root node.The height of he elimination tree stands for the sequential steps in the elimination, and the width of the elimination tree stands for the parallelism.

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 7 April 2017 doi:10.20944/preprints201704.0044.v1
Therefore finding the REML estimate requires the stationary point of log-likelihood function R , ie. the solution to the nonlinear score function S( θ) = 0.It usually requires the negative Jacobian

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 7 April 2017 doi:10.20944/preprints201704.0044.v1Table 2 .
Speedup of of algorithms implemented here and the counterpart in an existing commercial software deserves to be further investigated.It is of great interest to leverage this method to actuarial science and other econometrics where the linear mixed models are used intensively.