On a skew t–discrete gamma alternative to the normal-Poisson model for clustered count outcomes

The normal and Poisson distribution assumptions in the normal-Poisson mixed effects regression model are often too restrictive for many real count data. Several works have independently relaxed the Poisson conditional distribution assumption for counts or the normal distribution assumption for random effects. This work couples some recent advances in these two regards to develop a skew t–discrete gamma regression model in which the count outcomes have full dispersion flexibility and random effets can be skewed and heavy tailed. Inference in the model is achieved by maximum likelihood using pseudo-adaptive Gaussian quadature. The use of the proposal is demonstrated on a popular owl sibling negotiation data. It appears that, for this example, the proposed approach outperforms models based on normal random effects and the Poisson or negative binomial count distribution.


Introduction
The Normal-Poisson (NP) model incorporating normal random effects in Poisson regression [1] is commonly used to analyze clustered, longitudinal or hierarchical count outcomes which arise routinely from applications in e.g. agricultural, environmental, biomedical, economic, and social sciences. Statistical softwares such as the R freeware [2], Stata [3] and SAS [4] provide routines returning for a count dataset, a NP model fit which gives a theoretical generating mechanism and provides statistics to guide decisions, relying on the relative flexibility of the NP model for handling a wide range of count data [5]. There is however a rich literature on the lack of robustness against departures from distributional assumptions underliying the model, namely the Poisson equidispersion assumption (variance equal to mean) and the normality of random effects [6][7][8][9].
For the control of dispersion in observed counts, overdispersion (variance greater than mean) models [6] are the most popular alternative count models among which, Poisson mixtures (e.g. the negative binomial [10], discrete Poisson-Lindley [11]) play a central role. Conway-Maxwell-Poisson (CMP) [12][13][14], double Poisson [15,16] and Consul's generalized Poisson [17] are also well known models for overdispersed count data. On the other hand, underdispersion (variance less than mean) is often considered rarer [18]. Indeed, underdispersion has received much less attention [19] and pure underdispersion models such as the positive binomial [20], Morse's [21] and Haight's generalized Poisson models [22,23] are rarely used. Sellers and Shmueli [24] have shown that underdispersion may has been hidden alongside Poisson and overdispersion models, and the rarity of underdispersion may thus be linked to the scarcity of statistical routines for fitting underdispersion and group-level dispersion models. This suggests that models for the control of dispersion should have full dispersion flexibility, i.e. allow under-, equi-as well as overdispersion. Although some existing models can be used to handle underdispersed, equidispersed as well as overdispersed data, they yield quasi-models for underdispersed counts (e.g. Quasi-Poisson [25], and extended Poisson-Tweedy [26]) or lack a simple mean parameterization (e.g. CMP [12], double Poisson [15], gamma count [27], semi nonparametric Poisson [28], Faddy [29] and discrete Weibull [30]) rendering inference only approximate [31,32]. An old but reviving approach to flexible count models identification is the discretization of flexible continuous distributions [33]. In this frame, Tovissodé et al. [34] have recently introduced the discrete gamma (DG) regression model for counts using the balanced discretization method [32]. The DG model interestingly yields a full dispersion regression model free of the above mentionned shortcomings.
In count data with a natural structure (clustered, longitudinal or hierarchical), withinsubject dispersion may differ from between-subject dispersion [35]. For the sake of flexibility in modelling between-subject variability, various distributions have been considered for random effects in mixed count models, e.g. Semi-nonparametric [36], skew normal [37], log normal [38], generalized log-gamma [39] and skew t [40]. In particular, skew t distributions [41] have been satisfactorily used in various areas of mixed modelling, e.g. linear mixed models [42], generalized linear mixed models [40,43] and nonlinear mixed models [44,45], allowing both skewness and kurtosis while keeping the normal distribution as a limiting special case. However, except works which treated overdispersion in Poisson regression by introducing observation-level random effects [46][47][48][49], there has been, so far, very few attempts to explicitely couple advances in the control of dispersion and the use of flexible random effects distributions. For instance, Khan et al. [50] introduced a CMP model with serial correlation and log-gamma random effects for clustered longitudinal counts. As noted above, the CMP distribution [12] has no simple closed form expression for the mean count in terms of model parameters, and thus leads to hardly interpretable regression coefficients [14,34,51].
In order to bridge advances in control of dispersion and relaxation of normality, this work proposes an extension of the NP model by introducing skew t [41] distributed random effects in the DG count regression model [34]. The resulting mixed DG regression model offers a very flexible approach to the modelling of correlated counts, allowing an unlimited dispersion level for the data and skewed as well as thick tailed random effects.
For clarity, we present in Section 2 some properties of the balanced discrete gamma and the skew t distributions. Then, we define in Section 3 the skew t-discrete gamma regression model and present a maximum likelihood estimation procedure. In Section 4, the proposed model is illustrated on a real count data and compared to some competing approaches. Concluding remarks are given in Section 5.

Preliminaries
In this section, we introduce the balanced discrete gamma (BDG) and the multivariate skew t (ST) distributions as ingredients to define the skewt t-discrete gamma model. Let us denote N, the set of natural numbers, R, the set of reals, R + , the set of positive reals, R * , the set of non null reals, and x , the integral part of any real x. For a ∈ R + and x ∈ R + , let γ(x, a) = x 0 u a−1 e −u du/Γ(a) the lower incomplete gamma ratio. We shall denote Gamma(a, b) the gamma distribution with scale a ∈ R + and shape b ∈ R + , which has mean value b/a.

The multivariate skew t distribution
A random variable Y is said to follow a q-variate skew t (ST) distribution with location parameter µ ∈ R q , scale Ω ∈ R q×q (positive definite), shape λ ∈ R q and degrees of freedom ν if for a value y ∈ R q of Y, the pdf is [42]: where t q (y|µ, Ω, ν) = Γ((ν+q)/2) is the pdf of the q-variate t distribution, y 0 = Ω −1/2 (y − µ), α = λ y 0 , and T (·|ν) is the cdf of the standard univariate t distribution with ν degrees of freedom. A random vector Y with the pdf (5) is denoted Y ∼ ST q (µ, Ω, λ, ν) and has the stochastic representation [42]: where and Z 0 , Z and U are independent. SettingŨ Γ(ν/2) for ν > 2 and c = √ 2/π, the mean vector and the covariance matrix of Y are respectively given by [42]:

The skew t-discrete gamma model
We introduce in the section a flexible mixed count model which combines full dispersion flexibility in counts with a flexible distribution for random effects. We then describe a quadrature based procedure for maximum likelihood (ML) estimation of model parameters.

Model specification
Let us consider a population in which, the count response Y ij from the jth measurement on an individual i has mean value µ ij and a variance σ 2 ij depending on µ ij and a positive scalar a i given a q-vector of random effects b i = (b 1i , · · · , b qi ) . The conditional mean count µ ij is log-linearly related to a set of covariates (constants) contains population effects, β j ∈ R representing the marginal effect of the covariate X j on the log-average count, log µ i , so that, conditional on b i , a unit change in X ij induces ceteris paribus a β j % variation in the mean count µ i as in NP regression. In addition, the scalar a i has the log-linear form For such a population, a skew t-discrete gamma (STDG) model is specified as: andD is a positive definite scale matrix for random effects with lower triangular log-Cholesky factor R, i.e.D =RR whereR has the same off diagonal elements as R but has diagonal elementsR kk = exp(R kk ) (diagonal elements ofR constrained to be positive to ensure the uniqueness of R for fixedD) [52].
The STDG model (9) has as special cases the Normal-discrete gamma (NDG) model obtained when d = 0 and ν → ∞, the Student t-discrete gamma (TDG) model obtained with the restriction d = 0, and the skew normal-discrete gamma (SNDG) model when d = 0. Although the traditional Normal-Poisson mixed model [1] is not a special case of the model (9), it can be approximated with the special case obtained with the restrictions δ = 0, d = 0 and ν → ∞ (Normal-discrete equidispersed gamma model). By (7), the random effects b i has a null mean vector, and on using (8), a covariance matrix

Marginal likelihood
Given a sample of n independent individuals, let us denote y i = (y i1 , · · · , y in i ) the n i -vector of the measurements from an individual i, i = 1, 2, · · · , n. The likelihood of the parameters β, δ, d, R and ν is by Bayes'rule: where θ = β , δ , d , vech(R) , ν with vech the usual vech operator which stacks the lower triangle elements of a matrix, y = (y 1 , · · · , y n ) , and since conditional on random effects b i , the individual measurements y ij are independent, the joint conditional probability mass of y i is given by As for generalized linear mixed models in general, the q-dimensional integral in (11) is not analytically tractable and there is thus no closed form solution for the ML estimates for the fixed effects β, the dispersion parameter δ, and the skewness parameter d, the scale parameter R and the degrees of freedom ν of random effects. We develop in the next section a pseudo-adaptive quadrature based estimation procedure.

Estimation based on pseudo-adaptive Gaussian quadrature 3.3.1. Likelihood approximation
Multi-dimensional Gaussian quadrature has proved useful for approximating multivariable integrals of the form R q f (x) exp(x x)dx which emerges in the ML estimation of generalized linear mixed models [53,54]. To improve the accuracy of Gaussian quadrature while using a small number m of nodes per integration dimension, Pinheiro and Bates [55] have introduced adaptive Gaussian quadrature. However, this technique is impraticalble for models with complex structures such as the STDG, since it requires a repeated search of the mode of the integrand. We instead consider the pseudo-adaptive procedure introduced by Rizopoulos [56] where b i is centered and rescaled about its empirical Bayes estimates obtained from assuming normal random effects [56][57][58][59]. In this setting, we make the following change of variable: where, r i ∈ R q ; b i is the empirical Bayes estimate of b i and R i is the lower triangular Cholesky factor of the covariance matrix estimate of b i given y i , obtained by fitting the NP model using e.g. the glmer routine of the package lme4 [60] in the R freeware [2]. The likelihood function (11) can then be rewritten as where we have set On using (14), the pseudo-adaptive Gaussian quadrature based approximation to the logarithm of the likelihood function (11) iŝ where m is a pre-specified number of nodes per integration dimension, and w k = q j=1 w kj and r k = (r k1 , r k2 , · · · , r kq ) are, respectively, the multidimensional integration nodes and the associated weights. The components r kj and w kj are available for instance in the R package fastGHQuad [61]. However, the total number of evaluations of the function g y (y i |·, θ) required in (15) increases exponentially in the number of random effects (m q function evaluations). In order to further reduce the curse of dimensionality when q ≥ 3, we consider sparse-grid integration based on the Smolyak rule [62]. a generic approach for extending any univariate quadrature rule to multivariate settings without the exponentially growing computational cost of the product rule in (15). It has been successfully used, for instance, on likelihhod related estimation problems in econometrics [63][64][65], on Bayesian inverse problems [66] and in general optimization theory [67]. The advantage of this quadrature rule resides in the number of nodes which increases only polynomially in the number of dimensions [64]. Instead of the number m of nodes used per dimension in (15), the Smolyak rule requires an accuracy level ς ∈ N which then determines the total number Q(ς) of nodes. The sparse-grid pseudo-adaptive quadrature based approximation to the logarithm of the likelihood function (11) iŝ Although the sparse-grid integration nodes r k and the associated weights w k are not trivial to determine, they are available in the R package SparseGrid [64] and the total number Q(ς) of nodes increases only polynomially (Q(m) < m q for q ≥ 3).

Maximum likelihood estimates, standard errors and model selection
The maximum likelihood estimate θ = β , δ , d , vech( R),ν of θ is obtained by maximizing the (sparse-grid) pseudo-adaptive Gaussian approximationˆ to the loglikelihood function (11), using a numerical optimization software such as the optim function in R. Although such softwares run faster when analytical derivative of the log-likelihood function are supplied, we follow [53] and avoid the use of this option since numerical integration will be required for these derivatives too and they are unfortunately often more poorly approximated by quadrature based on the same nodes used for approximating the marginal likelihood itself [68]. It is worth noting that the approximate log-likelihood function may be unbounded in the degrees of freedom parameter ν. This can occurs in Student t distribution related likelihood based inference [69], preventing the convergence of optimization routines and usually resulting in non-negative definite hessian of the loglikelihood function. In the event of such unboundedness issues, we advocate the profile likelihood approach described by [70]. It consists in setting a grid of feasible values of ν (e.g. ν k = ν k−1 + 0.5; k = 1, 2, · · · , 55; ν 0 = 2.5) and obtaining a sequence of estimates θ k while keeping ν = ν k during each maximization of (11). Then, the vector θ k which maximizes (11) is taken as the ML estimate θ.
After the maximization of (11), the ML estimate of the scale matrixD can be recovered as D = R R with R the matrix R with exponentiated diagonal elements (see (9)). The ML estimate of the random effects covariance matrix is then given by where Γ(ν/2) . Interestingly, many optimization routines (e.g. optim in R) computes, on request, the hessian matrix H n (θ) of the log-likelihood function evaluated at θ. For inference on the model parameters, we assume that θ − θ asymptotically follows a normal distribution with null mean and covariance matrix C n ( θ) = − H n θ −1 .
The appropriate distribution (normal, Student t, skew normal or skew t) for random effects is a priori unknown for a specific dataset. In practice, the best fit can be determine after fitting the full version of the STDG model (9) as well as its three special cases (normal with the constraints d = 0 and ν → ∞; Student t with d = 0; and skew normal with ν → ∞). Indeed, when the boundary restriction ν → ∞ is not involved, an appropriate statistical test (e.g. Wald test [71] or likelihood ratio test [72]) can be used to identify the Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2021 doi:10.20944/preprints202108.0490.v1 best fit to a specific dataset. When on the contrary the restriction ν → ∞ is involved, we consider information criteria such as the Akaike's Information Criterion given for a fit with N p estimated model parameters by AIC = −2 ( θ) + 2N p (the lower, the better) [73]. When the best fit to a specific dataset involves skew t (ν < ∞) or Student t (d = 0 and ν < ∞) random effects, this fit can be used to identify possible outliers in the dataset (see Appendix A).

Estimation of random effects
In the Gaussian linear mixed models framework, Best Linear Unbiased Predictors (BLUPs) of random effects are routinely used in application fields like ecology, evolutionary biology and genetics [74,75] where they have become very important, for instance, for breeding values based selection. However, the validity of the BLUPs relies on, at least, an approximate adequation between the studied population and the assumed model [75][76][77][78]. BLUPs are equivalent to Empirical Bayes (EB) estimates in Gaussian models. As a result, in models with more complex structures designed to capture more latent data features than the traditional Gaussian model, EB estimates are developped as appropriate alternatives to BLUPs to give an overview on latent unobservable subject-specific quantities such as intercept and slopes [79][80][81]. Based on the STDG model (9), we provide in this section expressions for predicting random effects. From (11), the conditional density of the random effects b i given the observed data is cŨ 1 d, D, λ, ν)db i is the contribution of the individual i to the sample likelihood (11). The minimum mean-squared error estimator of b i correspond to its conditional mean given y i [82] and can thus be approximated by a weighted average as |r k , θ). Similarly, the conditional covariance matrix of b i can be approximated as The EB estimator of b i is then obtained by substituting the ML estimate of θ into (18), i.e. b i = b i ( θ). The covariance of these estimates are also obtained by substituting θ for θ in (19). However, this error covariance matrix does not account for errors due to the estimation of θ. This indicates that variances extracted from this estimated covariance matrix tend to underestimate the true error variances [80].

Application to the owl sibling negotiation data 4.1. Data and statistical analyses
To illustrate the use of the proposed model, we analyse in this section the owl sibling negotiation data from Roulin and Bersier [83]. This popular data is relative to the vocal begging behaviour before the owl parents bring prey to their nest. Young owls indeed use begging in the absence of parents to share information on the different levels of hunger between them, so that to save energy to compete for food when there is the highest probability of success, the least hungry young owl avoids competing for food against the hungriest owls when a parent brings a prey [84]. Roulin and Bersier [83] investigated in nestling barn owls whether the intensity of vocal begging behaviour differs in the presence of the father as compared to the presence of the mother.
The interest response variable is sibling negotiation which is defined as the number of calls made by all offspring in the nearest 30 seconds interval before the arrival of a parent. The explanatory variables are the sex of the parent, the arrival time of the parent, and a food treatment defined as: half of nests were given extra prey ("food-satiated"), and prey were removed from the other half ("food-deprived"). The data collected from n = 27 nests is available in the R package glmmADMB [85] and some descriptive statistics are given in Appendix B. From analyses performed by Zuur et al. [84], the response variable (sibling negotiation) is overdispersed (ID = 6.6) and correlated within nest (ρ = 0.51 between two sequential parent arrivals). A dispersion model is indeed required to adequately model sibling negotiation. On the contrary, the need of a flexible random effect distribution is still undocumented. Following Zuur et al. [84], we consider the random intercept mixed effects model with conditional mean count µ ij defined through where ij indexes observation j from a nest i, BroodSize is the number of nestlings (used as a model offset) and b i is the random intercept associated to all observations from nest i. For comparison purposes, we fitted in addition to the STDG model (9) and its special cases including the NDG (Normal-discrete gamma), the TDG (Student t-discrete gamma) and the SNDG (Skew normal-discrete gamma) models; the Normal-Poisson (NP) and the Normal-negative binomial (NNB) models to the owl sibling negotiation data. The STDG model and its special cases were fitted following the procedures described in Section 3 and implemented in R software (see Appendix C for some linear algebra details on the implementation). The NP and NNB models were fitted using the R package lme4 [60]. The significance level for statistical tests was set to 0.05.

Results
Tables 1-3 present the estimates of population parameters and AIC values for the six fitted models. Based on the AIC values in Tables 1-2, it appears that with normally distributed random effects, the discrete gamma model (AIC = 3420.8) is better than the Poisson (AIC = 5020.3) and the negative binomial (AIC = 3472.9) mixed models for the owl sibling negotiation data. There are important differences between estimates of fixed effects from these three models, but there is no switch in the 5% statistical significance of fixed effects. More importantly, the use of the Poisson distribution results in a random intercept variance estimate ofσ 2 b = 0.2 whereas the use of flexible count distributions results in much lower random intercept variance estimates (< 10 −4 ). This indicates that much of the between-nests variability identified by the NP model is actually a within-nest variability.
Further considering discrete gamma models with flexible distributions for random effects, the AIC values in Tables 2-3, indicate that the skew normal model (AIC = 3394.1) provides the best fit as compared to the normal, the Student t (AIC = 3422.6) and the skew t models (AIC = 3396.1). It can be observed that the estimate (δ 0 = −2.23) of the skewness parameter of the nest related random intercept is significantly different from zero as indicated by the associated Wald test (p-value < 0.001). On the contrary, the random intercept distribution is not heavy tailled as indicated by the large degrees on freedom parameter estimates for both the Student t and the skew t models ( ν > 9987). It can be noted that there are only slight and non significant differences between estimates of both fixed effects and random intercept variance from these three models. Based on the skew normal-discrete gamma model fit, it appears that sibling negotiation was significantly lower (82%) for food satiated young owls as compared to food deprived young owls, and each unit Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2021 doi:10.20944/preprints202108.0490.v1      increase in the arrival time reduced sibling negotiation by 13% on average. But the sex of the owl parent did not affect significantly sibling negotiation.

Conclusion
Ignoring data dispersion or shape may lead to biased, inefficient and inconsistent estimates of population parameters in mixed count regression models. This work introduces a large class of mixed effects models for nested, clustered or longitudinal count data. The proposal combines flexible count distribution with a flexible distribution for random effects, and estimates model parameters using pseudo-adaptive Gaussian quadrature. The application on owl sibling negotiation data demonstrates the usefulness of the approach in detecting and controling dispersion in counts and skewness in random effects distribution. Further investigations will target the performance of the proposal when data comes from various count and random effects distributions in use in agricultural, environmental, biomedical, economic, and social sciences.

Appendix A. Individual-specific weights
Appendix A. 1

. Estimation of individual weights for outlier detection
When the best fit to a specific dataset involves skew t (ν < ∞) or Student t (d = 0 and ν < ∞) random effects, this fit can be used to identify possible outliers in the dataset. Indeed, by (6), the STDG model (9) has the stochastic representation The mixing variable U i in (A1) is an individual-specific stochastic weight which serves for outlier detection using models built under the Student t and related distributions [43,81,86]. Using (A1), the conditional mean of U i given y i turns out to be (see (A5)) where D * = ν ν+2 D. Thus, on setting the conditional mean E{U i |y i , θ} can be approximated as The EB estimator of the weight U i of individual i is then given by u i = U i ( θ). Unusual or outlying individuals are indicated by relatively low weights ( u i < 1) [43,81].

Appendix C. Linear algebra details
This Appendix gives some linear algebra details useful for the implementation of models based on the skew t distribution defined by (5). Indeed, the skew t pdf for y ∈ R q involves the quantities y 0 y 0 , α = λ y 0 and |D| 1/2 where y 0 = D −1/2 (y − µ). For the STDG model (9), we additionally have D =D + dd , λ = (1 + d D −1 d) 1/2 D −1/2 d, and D =RR . We thus have by the Sherman-Morrison identity [87], Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 25 August 2021 doi:10.20944/preprints202108.0490.v1 This results on setting z =R −1 y 0 and z α =R −1 d in Likely, from the identity D =D + dd =R I q +R −1 dd R − R , the determinant of D is given by |D| = |R| 2 × |I q +R −1 dd R − |. Hence, by the Sherman-Morrison formula for the determinant [87], Finally, note that |R| is the product of the diagonal elements ofR (sinceR is lower triangular).