Bayesian Nonlinear Models for Repeated Measurement Data: An Overview, Implementation, and Applications

Nonlinear mixed effects models have become a standard platform for analysis when data is in the form of continuous and repeated measurements of subjects from a population of interest, while temporal profiles of subjects commonly follow a nonlinear tendency. While frequentist analysis of nonlinear mixed effects models has a long history, Bayesian analysis of the models has received comparatively little attention until the late 1980s due primarily to the time-consuming nature of Bayesian computation. Since the early 1990s Bayesian approaches for the models began to emerge to leverage rapid developments in computing power, and recently, have received significant attention due to (1) superiority to quantify the uncertainty of parameter estimation; (2) utility to incorporate prior knowledge into the models; and (3) flexibility to match exactly the increasing complexity of scientific research arising from diverse industrial and academic fields. This review article presents an overview of modeling strategies to implement Bayesian approaches for the nonlinear mixed effects models, ranging from designing a scientific question out of real-life problems to practical computations.


INTRODUCTION
One of the common challenges in biological, agricultural, environmental, epidemiological, financial, and medical applications is to make inferences on characteristics underlying profiles of continuous, repeated measures data from multiple individuals within a population of interest [67,204,206,268]. By 'repeated measures data' we mean the data type generated by observing a number of individuals repeatedly under differing experimental conditions where the individuals are assumed to constitute a random sample from a population of interest. A common type of repeated measures data is longitudinal data such that the observations are ordered by time [83,316].
Linear mixed effects models for repeated measures data have become popular due to their straightforward interpretations, flexibility allowing correlation structure among the observations, and utility accommodating unbalanced and multi-level data structure (i.e., clustered designs that vary among individuals) [79,260]. The modeling framework is also intuitively appealing: the central idea that individuals' responses are governed by a linear model with slope or intercept parameters that vary among individuals seems to be appropriate in many scientific problems (for e.g., see [125,242]). It also allows practitioners to test and evaluate multivariate causal relationships by conducting regression analysis at the population level. By preserving the multi-level structure in a single model, estimation or prediction for the analyses can take advantage of information borrowing [93].
For many applications, researchers often want to theorize that time courses of individual response commonly follow a certain nonlinear function dictated by a finite number of parameters [257]. These nonlinear functions are based on reasonable scientific hypotheses, typically represented as a differential equation system. By tuning the parameters, the shape of the function in terms of curvature, steepness, scale, height, etc, may change, which is used as the rationale behind describing heterogeneity between subjects. Nonlinear mixed effects models, also referred to as hierarchical nonlinear models, have gained broad acceptance as a suitable framework for these purposes [74,75,189]. Analyses based on this model are now routinely reported in various industrial problems, which is, in part, enabled by the breakthrough development of software [21,102,265,266,298]. The excellent books and review papers were published by [74,75,286].
Although their works were published more than 20 years ago, they still provide statisticians, programmers, and researchers with many pedagogical insights about the modeling framework, implementations, and practical applications of using the nonlinear mixed effects models.
While frequentist analysis of nonlinear mixed effects models has a long history, Bayesian analysis for the models was a relatively dormant field until the late 1980s. This is due primarily to the time-consuming nature of the calculations required for Bayesian computation to implement a Bayesian model [181]. Since the early 1990s Bayesian approaches began to re-emerge, motivated both by exploitation of rapid developments in computing power and by the growing desire to quantify the uncertainty associated with parameter estimation and prediction [46,77,315]. Since then, Bayesian nonlinear mixed effects models, also called Bayesian hierarchical nonlinear models, have been extensively used in diverse industrial and academic researches, endowed with new computational tools providing a far more flexible framework for statistical inference matching exactly the increasing complexity of scientific research [25,41,111,176,177,291].
The objective of this article is to present an updated look at the Bayesian nonlinear mixed effects models. Although the works of [74,75] discuss some of the Bayesian approaches for the nonlinear mixed effects models, the main perspective adopted in the works is much more oriented to the frequentist framework, and prior distributions and Bayesian computing strategy explained in the works are quite outdated. In the literature, it is striking that very few research works provide an updated overview of the Bayesian methodologies on the nonlinear mixed effects models. Motivated this, in this article, we endeavor to presents an overview of modeling strategies to implement Bayesian approaches for the nonlinear mixed effects models, ranging from designing a scientific question out of real-life problems to practical computations. The novelty of this paper is as follow: I. Guidance for Bayesian workflow to solve a real-life problem is provided for domain experts to facilitate efficient collaboration with quantitative researchers; II. Recently developed prior distributions and Bayesian computation techniques for a basic model and its extensions are illustrated for statisticians to develop more complex models built on the basic model; III. Illustrated methodologies can be directly exploited in diverse applications, ranging from small data to big data problems, for quantitative researchers, modeling scientists, and professional programmers working in diverse industries.
This article is organized as follows. In Section 2, we explore trends and workflow on the use of Bayesian nonlinear mixed effects. In Section 3, we motivate readers to understand why it is necessary to use the Bayesian nonlinear mixed effects model by illustrating four real-life problems, which will be conceptualized as a statistical problem. To solve the statistical problem, we suggest a basic version of the Bayesian nonlinear mixed effects models in Section 4, and its likelihood is analyzed in Section 5 wherein frequentist computations are briefly discussed. Section 6 describes modern Bayesian computation strategies to implement the basic model. Popularly used prior distributions are presented in Section 7. Section 8 discusses model selection, and Section 9 reviews recent advances and extensions that build on the basic model. Finally, Section 10 concludes the article.

TRENDS AND WORKFLOW OF BAYESIAN NONLINEAR MIXED EFFECTS MODELS
2.1 Rise in the use of Bayesian approaches for the nonlinear mixed effects models As of January 1970 to December 2021, PubMed.gov (pubmed.ncbi.nlm.nih.gov/-) searched of "nonlinear mixed-effect models" yielded 6, 288 publications. Among the published articles, nearly 94% of works used frequentist approaches (5, 929 articles), while only 6% of works adopted Bayesian approaches (359 articles). Figure 1 displays a bar plot based on the published articles, categorized by the frequentist and Bayesian approaches over time. In the panel, it is observed that, until the late 1980s, the Bayesian research was nearly dormant, but since the early 1990s, Bayesian works begin to re-emerge, and the gap between frequentist and Bayesian works is becoming gradually narrower as time evolves.
The dormancy of the Bayesian approaches until the late 1980s is mainly due to the time-consuming nature of the calculations based on sampling scheme which was previously impossible by the limitation of computing power. Fortunately, a breakthrough in computer processors (for e.g., Am386 in 1991, Pentium Processor in 1993, etc) took place in the early 1990s, driving the computing revolution to solve computationally intense problems, and this has given the statistical community the ability to solve statistical questions by using Bayesian methods. This timeline is also aligned with the widespread of Markov chain Monte Carlo (MCMC) sampling techniques in the Bayesian community [127,132]. Since then, the Bayesian community has been gradually gaining the momentum to leverage the rapidly growing developments of computing power, and now, assorted Bayesian software packages (e.g., JAGS [235], BUGS [194], and STAN [265]) are available for researchers to answer scientific questions arising from industrial and academic research.
To understand the rise of the Bayesian approaches, we want to first understand what will be some advantages of using Bayesian methods over frequentist methods in the context of nonlinear mixed effects models. As the primary focus of this review paper is to provide the readers with some insight on methodologies and practical implementation of using Bayesian approaches, our comparison and exposition below are described from an operational viewpoint. Table 1 summarizes modeling strategies of using the frequentist and Bayesian approaches for the nonlinear mixed effects models. Broadly speaking, the usual estimation method of the frequentist computation is optimization, while that of the Bayesian computation is sampling.
Normally, it is known that the former is much faster than the latter. This is not surprising because a sampling scheme, by its nature, needs to explore a wide range of the parameter space, whereas the optimization only needs to find the best point estimate, which is often described by the maximum likelihood estimate. In many practical problems, widely used frequentist optimization algorithms are the first-order approximation [22], Laplace approximation [307], and stochastic approximation of expectation-maximization algorithm [78].
They will be briefly discussed in Subsection 5.4. As for the Bayesian sampling algorithms, combinations of Gibbs sampler [175], Metropolis-Hastings algorithm [246], Hamiltonian Monte Carlo [220], and No-U-Turn sampler [142] are popularly used, among many others [90,196,219]. We explain these in detail in Section 6. Table 1: Comparison of modeling strategies using in Frequentist and Bayesian approaches for the nonlinear mixed effects models from implementational viewpoint.

Characteristic Frequentist Bayesian
Estimation objective Maximize a likelihood [74,75,286] Sample from a posterior [176,181,291] Computation algorithm First-order approximation [22], Laplace approximation [307], and stochastic approximation of EM algorithm [78] Gibbs sampler [175], Metropolis-Hastings algorithm [246], Hamiltonian Monte Carlo [220], and No-U-Turn sampler [142] Software SAS [150], NONMEM [23],MONOLIX [171],NLMIXR [102] JAGS [195], BUGS [194], STAN [265], BRMS [44] Advantages The stark difference of using frequentist and Bayesian approaches may be the procedure of describing an uncertainty underlying the parameter estimation for the nonlinear mixed effects models. Here, the parameter which is of primary interest is the population-level parameters (also called fixed effects), typical values for the individual-level parameters. In many cases, frequentist 95% confidence intervals for the parameters of the models are constructed by assuming that asymptotic normality of maximum likelihood estimator holds in a finite sample study, which is actually the most accurate in large sample scenario [61]. Most frequentist software packages, such as NONMEM [23,189], MONOLIX [171], and NLMIXR [102], by default, may print out a 95% confidence interval of the form, "Estimate ± 1.96 × Standard Error", or some transformation of the lower and upper bounds, if necessary, such that the Standard Error is calculated by using (observed) Fisher information matrix [39,166,294]. Using such a scheme in small sample studies is highly likely to overlook the gap between the reality of the data and the idealistic asymptotic situation.
In contrast, as for the Bayesian approaches, the large-sample theory is not needed for the uncertainty quantification, and the procedure to obtain 95% posterior credible intervals is a lot easier than obtaining 95% confidence intervals (See Chapter 4 of [112]). Furthermore, Bayesian credible intervals based on percentiles of posterior samples allow for a strongly skewed distribution, wherein frequentist confidence intervals (based on large-sample theory) may induce a non-negligible approximating error due to the deviation from the asymptotic normality. Along with that, Bayesian methods are highly appreciated when researchers want to incorporate prior knowledge from previous studies into the model so that posterior inference provides the researchers with an updated view on the problem, possibly with a more accurate estimation. Using prior information would be particularly useful in small-sample contexts [251,258]

Bayesian workflow
We outline the first two steps in the Bayesian workflow of using Bayesian nonlinear mixed effects models described in Figure 2. The panel includes some mathematical notations that are consistently used throughout the paper. These notations will be clearly understood later. The aim of our explanation at this point is to provide readers with a blueprinted plan to implement Bayesian modeling strategies for the nonlinear mixed effects models. We assume that readers are familiar with basic concepts and generic workflow in Bayesian statistics; see [112,141,225] for those basic concepts and refer to the review paper by [282] and references therein for detailed concepts and general terminologies used in workflow, such as prior and posterior predictive checks, and prior elicitation, etc.
The first step of the Bayesian research cycle is (a) standard research cycle [31,223]. Some beginning activities at this step involve reviewing literature, defining a problem, and specifying a research question and a hypothesis. After that, researchers specify which analytic strategy would be taken to solve the research question and suggest possible model types, followed by data collection. The data type arising in this process may include a response variable and some covariates that are grouped longitudinally, which then formulates repeated measures data of a population of interest. Furthermore, if there appears to be some nonlinear temporal tendency at each subject, then a possible model type for the analysis is a nonlinear mixed effects model [74,233].
The second step of the Bayesian research cycle is (b) Bayesian-specific workflow. Logically, the first thing to do at this step is to determine prior distributions (see Step (b) -(i) in Figure 2). The selection of priors is often viewed as one of the most crucial choices that a researcher makes when implementing a Bayesian model as it can have a substantial impact on the final results [116]. As exemplified earlier in the context of Bayesian medical device trials, using a prior in small sample studies may improve the estimation accuracy, but unthoughtful choice of priors would lead to a significant bias in estimation. Prior elicitation effort would require Bayesian expertise to formulate domain expert's knowledge in a probabilistic form [107]. Strategies for prior elicitation include asking domain experts to provide suitable values for the hyperparameters of the prior [143,226]. After prior is specified, one can check the appropriateness of the priors through prior predictive checking process [183]. For almost all practical problems, prior distribution of Bayesian nonlinear mixed effect models can be hierarchically represented as follow: (1)   The resulting posterior inference can be used to start a new research cycle. Distributions for prior, likelihood, and posterior are colored in blue, yellow, and violet, respectively. Θ, model matrix; σ 2 , error variance parameter; α, intercepts; B, coefficient matrix; Ω, covariance matrix; p(.), probability distribution; π(.), prior or posterior probability distribution; modeling; and (2) a prior for the parameters used in the population-level model and the likelihood. It is important to note that, the former type of prior distribution (that is, (1)) is also a requirement to implement frequentist approaches for the nonlinear mixed effects model, as a name of 'distribution for random effects'.
Essentially, the defining factor of the Bayesian framework is the latter type of prior distribution (that is, (2)), which is fixed in the frequentist framework, as a name of 'fixed effects'. Some prior options of the latter type will be discussed in Section 7.
The second task is to determine the likelihood function (see Step (b) -(ii) in the panel). At this time, the raw dataset collected in (a) standard research cycle should be cleaned and preprocessed. Before embarking on more serious statistical modeling, it is a common practice to get some insight about the research question via exploratory data analysis and have a discussion with domain experts such as clinical pharmacologists, clinicians, physicians, engineers, etc. To some extent, eventually, all these efforts are to determine a nonlinear function (denoted as f in this paper) that best represents the temporal profiles of all subjects.
This nonlinear function is a known function because it should be specified by researchers. In other words, the branch of the nonlinear mixed effects models belongs to parametric statistics. However, one technical challenge is that, in many problems, such a nonlinear function is represented as a solution of a differential equation system [35,296], and therefore there is no guarantee that we can conveniently work with a closed-form expression of the nonlinear function. For example, if researchers wish to work with nonlinear differential equations [105,284], then some approximation via differential equation solver [65,84] may be needed to calculate the nonlinear function. As such, most software packages dedicated to implementing nonlinear mixed effect model, or more generally, Bayesian hierarchical model, are equipped with several built-in differential equation solvers [23,201,265]. For instance, visit the website (mc-stan.org/-) to see the functionality supported in Stan [265].
Finally, the likelihood is combined with the prior to form the posterior distribution (see Step (b) -(iii) in the panel). Given the important roles that the prior and the likelihood have in determining the posterior, this step must be conducted with care. The implementational challenge at this step is to construct an efficient MCMC sampling algorithm. The basic idea behind MCMC here is the construction of a sampler that simulates a Markov chain that is converging to the posterior distribution. One can use software packages if prior distributions to be implemented in Bayesian models exist in the list of prior options available in the packages. Otherwise, professional programmers and Bayesian statisticians are needed to make codes manually; this review paper will be useful for that purpose. Another activity important at this step is to compare multiple models with different priors and nonlinear functions, specified in Step (b) -(i) and (ii), and select the best model out of them. This topic is broadly called the model selection [63], which will be discussed in Section 8.

The setting
To exemplify circumstances for which the nonlinear mixed effects model is a suitable modeling framework, we review challenges from several diverse applications. Table 2 summarize four real-life problems that will be illustrated in the next subsections.  [34]. Figure 3 shows theophylline concentration in the plasma as a function of time after oral administration of the same amount of anti-asthmatic theophylline for 12 subjects. (The data considered here are courtesy of Dr. Robert A. Upton of the University of California, San Francisco.) As seen in the panel, concentration trajectories have a similar functional shape for all individuals. However, C max and t max (peak concentration and time when it is achieved), absorption, and elimination phases are substantially different across subjects. Clinical pharmacologists believe that these differences are attributable to between-subject variation in the underlying pharmacokinetic processes, explained by Absorption, Distribution, Metabolism, and Excretion (ADME), understanding of which is crucial in a new drug development in the pharmaceutical industry.  In pharmacokinetics analysis, often abbreviated by 'PK analysis', it is routine to use compartmental modeling to describe the amount of drug in the body by dividing the whole body into one or more compartments [250]. For theophylline, a one-compartment model is normally used, which assumes that the entire body acts like a single, uniform compartment; see page 30 from [106] for a detailed explanation about the model: where C(t) is drug concentration at time t for a single subject following oral dose D at t = 0. Here, F is the bioavailability which expresses the proportion of a drug that gains access to the systemic circulation.
k a is the absorption rate constant describing how quickly drug is absorbed from the gut into the systemic circulation. V is the volume of the central compartment. Cl is the clearance rate representing the volume of plasma from which drug is eliminated per unit time. Eventually, the pharmacokinetic processes for a given subject is summarized by the 4-dimensional vector with 'PK parameters' (F, k a , V, Cl). Obviously, it is the modeler's discretion to proceed with a more complex PK model such as a three compartment models with nonlinear clearance to fit the data, but in this case, over-parameterization should be carefully examined [85].
Typically, the dataset collected in a drug development program includes demographic and clinical covariates obtained from each subject, for e.g., body weight, height, age, sex, creatinine clearance, albumin, etc; and furthermore, one can also involve genetic information in an individual's response to drugs. Most covariates are measured at baseline, before assigning the drug, while some covariates can be measured at every sampling time. One of the crucial goals of PK analysis is to illustrate the effect of such covariates on the PK parameters [311]. The causal relationship inferred by the covariate analysis can be used to support physicians in making the necessary judgments about the medicines that they prescribe, tailored to individual patients [248].
In PK report for a new drug application to government authorities like U.S. Food and Drug Administration (FDA) or European Medicines Agency (EMA), PK parameters are summarized by mean or median, and very importantly, estimates of parameter precision. Estimates of parameter precision can provide valuable information regarding the adequacy of the data to support those parameters [130]. Parameter uncertainty can be estimated through several methods, including bootstrap procedures [91], log-likelihood profiling [40], or using the asymptotic standard errors of parameter estimates, and recently, Bayesian approaches draw a lot of attention from the pharmaceutical industry [20]. Particularly, Bayesian approach for the population PK analysis can be very useful when there is prior knowledge about PK parameters learned from preclinical studies, published works, etc, and one wants to incorporate them into the prior specification for PK parameters [291].

Example 2: Decline curve analysis
The US shale boom-a product of technological advances in horizontal drilling and hydraulic fracturing that unlocked new stores of energy-has greatly benefited the growth in the US economy. Horizontal drilling is a directional drilling technology such that a well is drilled parallel to the reservoir bedding plane [122].
Well productivity of a horizontal well is known to be often 3 to 5 times greater than that of a vertical well [5,214], but also costs 1.5 to 2.5 times more than a vertical well [164]. Therefore, the eventual success of the drilling project of unconventional shale wells relies on a large degree of well construction costs [280].
Because of very low permeability, and a flow mechanism very different from that of conventional reservoirs, estimates for the shale well construction cost often contain high levels of uncertainty. For this reason, one of the crucial tasks of petroleum engineers is to quantify the uncertainty associated with the process of oil or gas production to reduce the extra initial risk for the projects.   Shale of South Texas, studied by [177]. The declining pattern manifested in the trajectories is commonly observed in almost all oil production rate time series data following well completion. (Here, the completion is terminology in petroleum engineering, meaning the process of transforming a well ready for the initial production [24].) Decline curve analysis (DCA), introduced by [185] around 100 years ago, is one of the most popularly utilized methods for petroleum engineers. Its purpose is to (i) theorize a curve describing the declining pattern, (ii) analyze the declining production rates, (iii) characterize the well-productivity, and (iv) forecast the future performance of oil and gas wells. Particularly, estimation and uncertainty quantification of estimated ultimate recovery (EUR) (here, EUR is a special jargon defined as an approximated quantity of oil from a well which is potentially recoverable by the end of its producing life [69]) is the utmost important task and a starting point in the decision-making process for future drilling projects. Also, the oil and gas companies comply with financial regulations about EUR outlined by the U.S. Securities and Exchange Commission: see www.sec.gov/-for the regulations.
Most curves used in DCA are derived from solving certain differential equations that describe a hidden dynamic from production rate trajectory [14,64,89,149,281,303]. See [6,101,133,211] for an overview of such curves. [177] studied Arps' hyperbolic, stretched exponentiated decline, Duong, and Weibull curves to fit the trajectories shown in the Figure 4. Particularly, the Duong model was developed for unconventional reservoirs with very low permeability: where P (t) is the production rate at time t for a single well following completion. q 1 is the initial rate coefficient, and m and a are additional model parameters. We note that the parameters, q 1 , m and a, have their own meanings in terms of well-productivity: see [89] for the interpretation. That being said, the wellproductivity for a given well is summarized by the 3-dimensional parameter vector, (q 1 , m, a). In modeling perspective, the variation of the well-productivity across different wells is attributable to the different values for (q 1 , m, a). To explain this variability, one can regress the values (q 1 , m, a) on the well-design parameters such as true vertical depth, measure depth, etc. The causal relationship inferred by the covariate analysis will be used in a future drilling project. Geological information of wells can be also incorporated to make a spatial prediction for the EUR at a new location, as researched by [177].

Example 3: Yield curve modeling
Macroeconomists, financial economists, and market participants all attempt to build good models of the 'yield curve' [87]. The yield curve on a given day is a curve showing the interest rates across different maturity spans (1 month, one year, five years, etc.) for a similar debt contract at a particular date. It determines the interest rate pattern (i.e., cost of borrowing), which can be used to calculate a bond's price [131]. Figure 5 shows daily treasury par yield curve rates spanning from January 3rd to January 13th, 2022, with maturities up to 30 years. The data source is from the U.S. Department of the treasury (www.treasury.gov/-).
As seen from the panel, the shape of the yield curve displays a slightly delayed humped shape. Economists believe that such a shape of the yield curve has an important implication on the economic growth [314].  The Nelson-Siegel model [222] is a very popular model in the literature to fit the term structure: where Y (τ ) denotes the (zero-coupon) yield at evaluated at τ , and τ denotes the time to maturity. The model parameters have a specific financial meaning: β 0 , β 1 , and β 2 are related long-term, short-term, and mid-term effects on the interest rate, respectively, and λ is referred to as a decay factor [81]. Each of the yield curves is summarized by the 4-dimensional parameter (β 0 , β 1 , β 2 , λ), and it is known that model can capture a wide range of possible shapes of the yield curve [60,81,137,222]. Therefore, the Nelson-Siegel model is extensively used by central banks and monetary policymakers [2]. For example, The Federal Reserve updates estimates of (β 0 , β 1 , β 2 , λ) once per week: visit the website (www.federalreserve.gov/-).
In recent years, there has been a great deal of interest in the uncertainty quantification of the Nelson-Siegel parameters over time, and their relationship with macroeconomic variables such as inflation and real activity, etc, in financial applications: refer to [60,68,82,136] for some of those works. cases. Figure 6 displays the daily infection trajectories describing the cumulative numbers of infected cases for 40 countries, spanning from January 22nd to May 14th, 2020, studied by [176]. In general, during an early phase of a pandemic, information regarding the disease is very limited and scattered even if it exists. In spite of that, it is crucial to predict future cases of infection or death. In such a situation, one consideration is to use data integration (also called 'borrowing information'), combining data from diverse sources and eliciting useful information with a unified view of them. Additionally, it is very important to find risk factors relevant to the disease. Reliable and early risk assessment of a developing infectious disease outbreak allow policymakers to make swift and well-informed decisions that would be needed to ensure epidemic control. Quantifying uncertainty about the final epidemic size is also very important.
Richards growth curve [243], so-called the generalized logistic curve [221], is a popularly used growth curve for population studies in situations where growth is not symmetrical about the point of inflection  [ 10,255]. There are variant reparamerized forms of the Richards curve in the literature [29,48,57,165], and one of the frequently used form is where I(t) is the cumulative number of infected cases at time t. Here, epidemiological meanings of the parameters, a, b, and c, are the final epidemic size, infection rate, and lag phase of the trajectory, respectively. The parameter ξ is the shape parameter, and there seems no clear epidemiological meaning [299].
Each infection trajectory in Figure 6 can be characterized by the 4-dimensional parameters (a, b, c, ξ) if the Richards curve is used. Due to its flexibility originating from the shape parameter ξ, Richards curve has been widely used in epidemiology for real-time prediction of outbreak of diseases, possibly at an early phase of the pandemic when there is no second wave. Examples include SARS [145,147], dengue fever [144,148], pandemic influenza H1N1 [146], and COVID-19 outbreak [176,310].

Statistical problem
In the previous subsections, we presented a range of examples in which nonlinear mixed effects models can be exploited. They have their own challenges to solve the problems that are representative of issues many researchers have to deal with in other areas: for example, (1) how to describe a possible nonlinear clearance with a limited number of patients; (2) how to handle an enormously large number of shale oil wells and make a spatial prediction of EUR at a new location; (3) how to describe the dynamic of the financial parameters over time; and (4) how to integrate data from different sources to produce more accurate forecast on the epidemic size.
An emerging issue accompanied by these problems, requested from researchers, government entities, domain experts, etc, is how to quantify the uncertainty associated with parameter estimation and prediction.
Although the traditional nonlinear mixed effects models, based on the maximum likelihood method, can provide confidence intervals and statistical tests, calculations of those generally involve approximations that are most accurate for large sample sizes, as discussed in Subsection 2.1. On the other hand, in the Bayesian approach -in which the prior automatically imposes the parameter constraints -inferences about parameter values based on the posterior distribution usually require integration rather than maximization, and no further approximation is involved. For that reason, the Bayesian approach is often suggested as a viable alternative to the frequentist approach to solving the problems.
We now formulate these problems as a statistical problem. First, we summarize common features of the dataset for the analysis.
(1) There exists repeated measures of a continuous response over time for each subject; (2) There exists a variation of individual observations over time; (3) There exists a variation from subject-to-subject in trajectories; (4) There exist covariates measured at baseline for each subject.
The subject of sampling units considered in the statistical analysis is quite comprehensive. We have seen that it can be a patient, a shale oil well, a particular date, and a country. As the unique identifier, we assign the index i to each individual. By denoting N as the number of individuals (i.e., the sample size), the index i will take an integer from 1 to N . The sample size N available for the data analysis substantially varies across different industrial problems as well as subfields within the same industry. For example, the number of shale oil wells on Eagle Ford Shale Play can be as large as 6, 000 [177]. As for the pharmaceutical industry, in phase I cancer clinical trials, the number of cancer patients N may be strictly confined to 25 [178], but for phase III trials for non-oncology drug studies, N can be as large as 2, 000 [88].
Here the term 'time' is meant in the broadest sense. It can be a calendar time, a nominal time, a time after some event (for e.g., the time after dose from Figure 3 and the time after well completion from Figure   4), or a time to some event (for e.g., the time to maturity from Figure 5). Essentially, time can be defined as a physical quantity that can be indexed with consecutive integers to produce a temporal record. Another important characteristic of the time is that each subject may have different time points where observations are measured. In this article, we use t ij to represent the time point, where the integer j = 1, 2, · · · , M i indexes the time point from the earliest to the last observations. Thus, M i represents the number of repeated observations for the i-th individual. When M i is relatively small (or large), we say the repeated measures are sparsely observed (or densely observed). For example, the theophylline and yield curve data shown in As for the repeated measures, y ij denotes the continuous response of the i-th subject at the time point t ij . We assume that y ij has been already pre-processed so that it is ready to be used for statistical modeling.
For most applications, it may be necessary first to transform the data into some new representation before training the model. For example, as seen from Figure 4, oil productions vary substantially across different wells. For that reason, the authors [177] take a logarithm on the productions to derive the response y ij , followed by appropriate statistical modeling on the log-scale. To some extent, data pre-processing may enhance the performance of the model.
Suppose that researchers collected P number of covariates at the baseline from each subject i (i = 1, · · · , N ). Here, the baseline refers to the time point t i1 (or possibly right before the time point t i1 ), where at the first response y i1 has not been observed yet. Let x ib denote the b-th covariate of the i-th subject (b = 1, · · · , P ). In general, there are two types of covariates: time-invariant and time-varying covariates.
This article mainly concerns the former type. As similar to N , the number of covariates P substantially varies across industries and specific problems. For instance, in pharmacogenetics analysis, the number of protein-coding genes P would be around 20, 000 [306]. In the oil and gas industry, if we consider most of the covariates obtained from the well completion procedure, P could be at least 100 [177].
In conclusion, the dataset for the statistical analysis can be represented by the collection of the N Here, for each subject i (i = 1, · · · , N ), we formulated two M i -dimensional vectors y i = (y i1 , · · · , y ij , · · · , y iM i ) and t i = (t i1 , · · · , t ij , · · · , t iM i ) , and a P -dimensional vector The data structures described up to this point are commonly encountered in longitudinal data studies [79]. Essentially, the feature of dataset motivating the use of nonlinear mixed effects models is that, for each subject i, the response vector y i displays some nonlinear tendency over time t i , as seen in Figure 3, 4, 5, and 6. To explain this nonlinearity, a researcher needs to theorize some nonlinear function, denoted as f , such as one compartment, Duong, Nelson-Siegel, and Richards models, depending on the contexts. The construction of such functions relies on human modelers' abstraction of data into a suitable dynamical system, which is often represented by a differential equation. Such a differential equation has a finite number of parameters that control the dynamic of the solution of the system, understanding of which is vital for causal inference for the nature of the system by associating with covariates x i . Figure 7 displays a pictorial description about how a PK modeler would see the theophylline concentration trajectory from the modeling perspective, where she theorized that the one compartment model (1) would be suitable to describe the trajectories y i over time t i for each subject i (i = 1, · · · , N ). Then the 10-dimensional vector y i is summarized by a 4-dimensional PK parameter vector (F i , k ai , V i , Cl i ); the dimension reduction is intrinsically embedded in this process. As each of the parameters F i , k ai , V i , and Cl i has an important clinical meaning, it is very natural to ask how they are related with P covariates x i to induce a causal relationship. For the purpose of modeling, it may be necessary to transform the so that elements θ li (l = 1, · · · , K) are supported on the real number by taking transformations In (2), the conditional mean E[y ij |θ i , σ 2 ] = f (t ij ; θ i ) is a known function governing within-individual temporal behavior dictated by a K-dimensional parameter θ i = (θ 1i , θ 2i , · · · , θ li , · · · , θ Ki ) ∈ R K specific to the subject i. We assume that the residuals, ij , are normally distributed with mean zero and with an unknown variance, σ 2 .
• Stage 3: Prior Distributions in (4) are chosen to encapsulate any information or belief which have been formulated about the parameters. We suggest some popularly used prior options in Section 7.
Directed asymmetric graphical (DAG) model representation of the basic model (2) -(4) is depicted in

Vectorized form of the basic model
We will often wish to write the hierarchy (2) -(4) for the i-th individual's entire response vector and represent it with an equivalent vector-form. This turns out to be useful to develop relevant computational algorithms. We first introduce a K × N dimensional matrix frequently used throughout this article: The matrix (5) is referred to as model matrix because it comprises of scalar model parameters from all subjects. Indeed, most of computational techniques either via frequentist or Bayesian setting in the literature have been developed to overcome an obstacle of a nonlinear association of the model matrix Θ into the mean function f .
In (5), the subject index i is stacked column-wisely, while model parameter index l is stacked rowwisely, different from the conventional way adopted in most statistics. The column indexing for the subjects (i.e., stacking individual-based vector horizontally) shown in (5) is often adopted in modern computation theory of deep learning [126], and one of the main advantages of using this indexing is that it may give some pedagogical insights on the use of vectorization toward the entries {θ li } K,N l=1,i=1 to exploit parallel computations, stochastic updating, etc, in optimization or sampling techniques.
We are now in a position to re-write the hierarchy (2) -(4) using the vector notations: • Stage 1: Individual-Level Model In (6), for the subject i. The vector i is distributed according to the M i -dimensional Gaussian distribution with mean 0 and covariance matrix σ 2 I.
• Stage 2': Population Model (i-indexing) Equation in (8) is derived by incorporating each of the N columns of the model matrix (5). Here, α represents a K-dimensional vector α = (α 1 , α 2 , · · · , α K ) , and B represents a K-by-P matrix with rows β l (l = 1, · · · , K). Here, the K-dimensional vector Bx i in the right-hand side of (8) is the mathematically identical to X i β, where X i = I K ⊗x i ∈ R K×KP and β = (β 1 , β 2 , · · · , β K ) ∈ R KP (I K is the K-by-K identity matrix and ⊗ represents the Kronecker matrix product.). The error vector •Stage 3: Prior Each of the parameter blocks in (σ 2 , α, B, Ω) is assumed to be independent a priori.
To summarize, we derived two equivalent vectorized formulations representing the basic model (2) - (4) according to how the model matrix Θ (5) is vectorized:

Outline
In this section, we investigate a likelihood function based on the basic model (2) -(4). As illustrated in Bayesian workflow in Subsection 2.2, likelihood theory is fundamental of Bayesian inference (see Figure   2). Therefore, it is worth spending time to re-study the formulation of the likelihood function. Here, one caveat is that, due to the hierarchical nature of the nonlinear mixed effects model, a notion of the likelihood function depends on what part of the model specification is considered to be part of the likelihood, and what is not. Most papers directly consider the marginal likelihood that will be discussed in Subsection 5.4. In this paper, before marching there, we study other two formulations of the likelihood in Subsection 5.2 and 5.3 to get some pedagogical insights. We will briefly discuss popularly used frequentist computing strategies in Subsection 5.4.

Likelihood based on Stage 1
As in most of the statistical models, a natural starting point for inference is maximum likelihood estimation.
We start with considering only Stage 1 from the basic model in Section 4.1 and ignore Stage 2 and 3 for now. Then the likelihood function for the i-th subject is Therefore, the likelihood function based on the N subjects y 1: Now, we maximize the likelihood (10) with respect to the model matrix Θ (5) given σ 2 fixed: where · 2 is the Euclidean norm.
The estimator Θ = [ θ 1 · · · θ i · · · θ N ] ∈ R K×N of the model matrix Θ (5) can be obtained by various optimization techniques such as Newton-Raphson method or Gradient descent method [37]. Noting from the summation across i in (11) We can obtain an estimator of the variance σ 2 by plugging Θ into the likelihood (10), and then maximize with respect to the σ 2 . To investigate a denoised temporal tendency for the trajectory y i , we can simply plug θ i into the function To see a future pattern, we can extrapolate the function by extending the time index beyond the last time point t iM i . Eventually, the illustrated approach is based on traditional least squares estimation.
Unfortunately, there are three major drawbacks in this approach. First, it forfeits the opportunity to use 'information borrowing' [176] to improve a predictive accuracy due to the ignorance of Stage 2. What happens in Stage 2 (3) is to borrow strength across N individuals to produce a better estimator for Θ (5) than an estimator simply based on individual data. A similar issue can be found in the Clemente problem from [93] where the James-Stein estimator [153] better predicts than an individual hitter-based estimator.
Another example applied to epidemic data can be found in [176]. Second, it does not well-aligned with the generic motivation to use the mixed effects models whose primary purpose is to understand "typical" values for the model parameters in f , representing whole subjects, which should be addressed by making an inference about the parameters α, B, and Ω. Third, it only produces point estimates for the parameters, failing to describe the underlying uncertainty.
A remedy of the first two drawbacks is the consideration of Stage 1 (2) and 2 (3) hierarchically in a single model, leading to a frequentist version of nonlinear mixed effects models, which will be discussed in Subsection 5.3 and 5.4. To describe relevant uncertainty within the frequentist framework one may resort to bootstrap methods [91]. Another solution resolving all the three drawbacks at once is to incorporate Stage 1 (2), 2 (3), and 3 (4) in a fully Bayesian way, resulting in a Bayesian version of nonlinear mixed effects models, which is the main topic in this paper.

Likelihood based on Stage 1 and 2 from vector-form (a)
A likehood function based on vector-form (a) is derived. More specifically, we examine the frequentist setting where the assumptions of Stage 1-(6) and Stage 2- (7) are considered, while the parameters introduced in Stage 3-(9) are fixed (i.e., no prior assumptions).
The individual model on Stage 1 (6) yields a conditional density p( for each subject i = 1, · · · , N . Under the population assumption on Stage 2 (7), we have the density The joint density of (y 1:N , θ 1:K ) given parameters σ 2 , α, B and Ω is a product-form distribution: where y 1:N = {y 1 , y 2 , · · · , y N } and θ 1:K = {θ 1 , θ 2 , · · · , θ K }. Now, the next step is to integrate out the latent model parameters θ 1:K (or equivalently, the model matrix Θ (5)) from the density above to get a likelihood for the (σ 2 , α, B, Ω): In most cases, the integral (12) is not tractable due to the non-linearity of the function f i (t i , θ i ) with respect to the θ i . Although it may be possible to use numerical techniques for the evaluation of the integral (12), this might require enormous computational effort, which is not really appreciated in the literature due to the high-dimensionality of the integral involving the KN dimensional model parameters θ 1:K .

Likelihood based on Stage 1 and 2' from vector-form (b)
A likehood function based on vector-form (b) adopting i-indexing is derived here. As similar to Subsection 5.3, we preserve the assumption of Stage 1-(6) and Stage 2'- (8), but work with fixing the parameters in Stage 3-(9). In these specification, for each index i = 1, · · · , N , the individual model on Stage 1 (6) and population model on Stage 2 (8) lead to densities p(y i |θ i , σ 2 ) = N M i (y i |f i (t i , θ i ), σ 2 I) and p(θ i |α, B, Ω) = N K (θ i |α + Bx i , Ω), respectively. Thus, the joint density of (y i , θ i ) given parameters Given the parameters (σ 2 , α, B, Ω), the ordered pairs in the collection {(y i , θ i )} N i=1 are conditionally independent across individuals. Therefore, a likelihood for the (σ 2 , α, B, Ω) is based on the marginal density of y 1:N = {y 1 , y 2 , · · · , y N }: where the last equality is derived by using the change of variable (8). The last expression (14) is a standard mathematical formulation that many frequentist computing strategies are constructed: see Equation (3.2) from [75].
As the model parameter θ i in (13) (or similarly, η i in (14)  We shall briefly discuss on MLE computations. One approach would be to perform a multivariate numer-ical integration (for e.g., Gauss-Hermite quadrature [192]) to each of the N integrals (13), and then obtain the MLE by maximizing the product of the N numerical integrals with respect to the parameters (σ 2 , α, B, Ω) [138]. This approach turns out to be computationally so expensive and may have poor converge properties due to the following two reasons [290]. First, the numerical integration necessitates increasingly expensive iterative procedures within a MLE algorithm as the correlation of the model parameters (or equivalently, random effects) increases. Second, convergence property may be highly deteriorated when the number of model parameters K is large (i.e., high-dimensional integral) and the number of sampling times M i is small (i.e., sparse data) due to the 'curse of dimensionality' [139].
A class of common approaches for the MLE computations is based on analytical approximation to each of the N integrals (14) [22,124,189,288,289], and some of them have been successfully adopted to industrial software like NONMEM [20,23] and SAS [150]. Here, we illustrate a key idea of the first-order method attributed to [22]. Let us define a mapping g i (η Taylor's theorem (page 375 of [202]), we have the best linear approximation of the mapping g i (η i ) at the origin 0 given by (14) with the resulting approximation g i (0) + Dg i (0)η i for each i (i = 1, · · · , N ), leading to a closed-form expression L(σ 2 , α, B, Ω|y 1: To summarize, a linearization was used to convert the nonlinear mixed effects model to a linear mixed effects model, in some sense, equivalent to Lindley-Smith form [187]. This enables us to integrate out the random vector η i from the N integrals (15), deriving a marginal likelihood (15) to approximate the exact marginal likelihood (13). MLE ( σ 2 , α, B, Ω) can be obtained by jointly maximizing L(σ 2 , α, B, Ω|y 1:N ) (15) assuming the approximation is exact.
An another way to compute the MLE is through the use of expectation-maximization (EM) algorithm [80]. Borrowing terms from EM updating process [207], y i , θ i , (y i , θ i ), (σ 2 , α, B, Ω) and p(y i , θ i |σ 2 , α, B, Ω) (i.e., the integrand in (13)) can be viewed as observable incomplete data, missing data, complete data, unknown parameters, and density of complete data, respectively, for the i-th subject. The goal is to maximize the exact marginal likelihood L(σ 2 , α, B, Ω|y 1:N ) (13) by iterating E-step and M-step, leading to the MLE α, B, Ω). The E-step computes a conditional expected log-likelihood of (σ 2 , α, B, Ω) based on the hierarchy (2) -(3), followed by the M-step that maximizes the function with respect to (σ 2 , α, B, Ω). The nonlinearity associated with the model matrix Θ (5) makes the E-step intractable. As a remedy, variant versions of the EM algorithm are proposed; see [7,168,252,292] for a technical detail applied to a hierarchy similar to the basic model. Among them, the scheme of stochastic approximation EM algorithm proposed by [78], splitting the E-step into two steps, namely a simulation step and a stochastic approximation step, is widely used in many applications for its numerical stability, fast computation, and theoretical soundness [8,167], which has been successfully deployed as industrial software including MONOLIX [171] as well as open source software such as R package NLMIXR [102].

Bayesian inference
We briefly overview two contrasting workflows of Bayesian and frequentist approaches for nonlinear mixed effects models before moving to a technical detail. Both settings allow the randomness in the model matrix In contrast, to drive the Bayesian engine, one would need an appropriate prior π(σ 2 , α, B, Ω). After that, the entire collection of parameters (Θ, σ 2 , α, B, Ω) will be updated through the Bayes' theorem post observing the data y 1:N , leading to the posterior density π(Θ, σ 2 , α, B, Ω|y 1:N ) [26]. The essence of the Bayesian viewpoint is that there is no logical distinction between Θ and (σ 2 , α, B, Ω), which are associated with the random and fixed effects, respectively, from the frequentist perspective. In Bayesian framework, both Θ and (σ 2 , α, B, Ω) are random quantities. It is important to point out that the likelihood principle is naturally incorporated in the Bayes' theorem [188]. Clearly, modern data complications such as enormous volume, large dimensionality, and multi-level structures may necessitate a sophistication on the prior specifications.
From a Bayesian perspective, all inferential problems regarding the parameter (Θ, σ 2 , α, B, Ω) may be addressed in terms of the posterior distribution π(Θ, σ 2 , α, B, Ω|y 1:N ) (16). Unfortunately, for almost all problems, the distribution is intractable. In such situations, we need to resort to approximation techniques, and these fall broadly into two classes, according to whether they rely on stochastic [53,218,219,220] or deterministic [32,212,239,295] approximations. See [9,319] for review papers for these techniques. In this article, we are mainly focused on the stochastic approximation. The basic idea behind the methodology is to construct a Markov chain whose stationary distribution is the posterior distribution π(Θ, σ 2 , α, B, Ω|y 1:N ) (16).

Parallel computation for model matrix
One of the most computer-intensive steps to implement the Gibbs sampler in Subsection 6.2 is Step 1 to sample the model matrix Θ ∈ R K×N (5), or equivalently its entries {θ li } K,N l=1,i=1 , from the full conditional distribution π(Θ|σ 2 , α, B, Ω, y 1:N ) (17). Clearly, the nonlinear participation of the model parameters to the function f makes the conditional distribution intractable, hence, non-conjugate sampling is unavoidable, which may suffer from a slow convergence. At the same times, due to the Markovian nature of the Gibbs algorithm, it is difficult to parallelize the whole steps of the Gibbs sampler, which creates difficulties in slower languages like R [274]. Nevertheless, the increasing number of parallel cores that are available at a very low price drives more and more interest in 'parallel sampling algorithms' that can benefit from the available parallel processing units on computers [174,269].
We suggest a framework of parallel computations to efficiently update the model matrix Θ ∈ R K×N .
This framework can be particularly appreciated under the setting of Bayesian nonlinear mixed effects models when the number of subjects N is a lot larger than the number of model parameters K (N K).
The first version of parallel sampling algorithms is based on scalar updating. For the derivation, we start with analyzing the full conditional posterior distribution of a single element θ li (l = 1, · · · , K; i = 1, · · · , N ): π(θ li |−) = π(θ li |θ −li , σ 2 , α, B, Ω, y 1:N ) ∝ π(Θ|σ 2 , α, B, Ω, y 1:N ) where the notation θ −li represents the all the entries of Θ except for θ li , that is, Here, we used a conventional notation in Bayesian computation: 'π(θ li |−)' indicates the density π(θ li |θ −li , σ 2 , α, B, Ω, y 1:N ), where the notation '−' in π(θ li |−) implies all the parameters except for the θ li in the basic model (2) -(4) along with N observations. Note that the proportional part of the full conditional π(θ li |−) (22) only involves the i-th column vector of the model matrix Θ (5), that is, θ i = (θ 1i , · · · , θ Ki ) ∈ R K in its analytic expression. This implies that we can update the K entries of the vector θ i (i = 1, · · · , N ) independently across subjects. Parallel sampling algorithm can be completed by assigning a single CPU process to each of the subjects i. Within the step to sample the K entries from the vector θ i , it is required to use Gibbs iterative procedure to update the scalar components. Authors [176,177] applied this technique to update the model matrix for a Bayesian nonlinear mixed effects model to train the dataset explained in Subsection 3.3 and 3.5. Figure 10 displays the schematic idea of the parallel sampling algorithm.  Figure 10: A pictorial description on the use of parallel computation to the basic model (2) -(4) to update the model matrix Θ = [θ 1 · · · θ i · · · θ N ] ∈ R K×N (5). In the parallel computation, a single CPU is assigned to an individual subject i = 1, · · · , N .
The second version of parallel sampling algorithms is based on vector updating. We analyze the full conditional posterior distribution of the vector θ i (i = 1, · · · , N ): π(θ i |−) = π(θ i |θ −i , σ 2 , α, B, Ω, y 1:N ) ∝ π(Θ|σ 2 , α, B, Ω, y 1:N ) where the notation θ −i represents the all the column vectors of Θ except for θ i . As similar to the first version, we can use the parallel computation to update the model matrix Θ by simultaneously sampling from the full-conditional density π(θ i |−) (23) across subjects by assigning one CPU to each individual.

Elliptical slice sampler
Due to the issue of non-conjugacy to sample from the univariate density π(θ li |−) (22) or K-dimensional density π(θ i |−) (23), the choice of a suitable MCMC method and further the choice of a proposal distribution is crucial for the fast convergence of the Markov chain simulated from the Step 1 within the Gibbs sampler in Section 6.2. The Metropolis-Hastings (MH) algorithm [135,209] is the first solution to consider in such intractable situations: see the Algorithm 1 from [245]. In practice, the performances of the MH algorithm are highly dependent on the choice of the proposal density [62]. In the past decades, numerous MH-type algorithms to improve computational efficiency have been developed, and these fall broadly into two classes, according to whether the proposal density reflects a gradient information [86,90,196,220] or not [208,218]. In specific, the gradient information, here, refers to the first-order derivative of the minus of the log of the target density (i.e., ∇U (θ li ) = −∇ log π(θ li |−) ∈ R or ∇U (θ i ) = −∇ log π(θ i |−) ∈ R K , where the notation ∇ represents the gradient operator). Typically, gradient-based samplers are attractive in terms of rapid exploration of the state space, but the cost of the gradient computation can be prohibitive when the sample size N or model dimension K is extremely large [3,59]. Fortunately, this requirement can be made less burdensome by using automatic differentiation [128].
In the present subsection, we introduce an efficient gradient-free sampling technique, the elliptical slice sampler (ESS) proposed by [218], to simulate a Markov chain from the density π(θ li |−) (22). (The sampling logic can be directly applied to the situation to sample from the density π(θ i |−) (23) by simply replacing θ li by θ i and changing the dimension of relevant distributions, stochastic processes, etc, from 1 to K.) Conceptually, MH and ESS algorithms are similar in that both comprise two steps: a proposal step and a criterion step. A difference between the two algorithms arises in the criterion step. If a new candidate does not pass the criterion, then MH algorithm takes the current state as the next state: whereas, ESS re-proposes a new candidate until rejection does not take place, rendering the algorithm rejection-free. The utility of ESS can be appreciated when an analytic expression of a target density can be factored to have a Gaussian prior distribution. Unlike the traditional MH algorithm that requires the proposal variance or density, ESS is fully automated, no tuning is required.
To adapt the ESS to our example, we re-write the density π(θ li |−) (22) as the following form: where L(θ li ) = exp {− y i − f i (t i , θ 1i , · · · , θ li , · · · , θ Ki ) 2 2 /(2σ 2 )}, and Z is the normalization constant. Introducing the notation L(θ li ) is our intention that we shall treat the function as a likelihood part temporarily only when sampling from the density (24). Alternatively, one can proceed with the choice , · · · , θ li , · · · , θ Ki ), σ 2 I) as a likelihood part to operate ESS, which then change the normalization constant accordingly. We recommend to use the simplest functional form for the likelihood part, if possible, to reduce computation cost.
Algorithm 1 summarizes the ESS in an algorithmic form where the situation is at the (s + 1)-th iteration of the Gibbs sampler. Therefore, the goal is to draw θ  φ ∼ Unif (−π, π];

Metropolis adjusted Langevin algorithm
We introduce Metropolis adjusted Langevin algorithm (MALA) [90,196] which is popular for its use of problem-specific proposal distribution based on the gradient information of the target density. The main idea of MALA is to use Langevin dynamics to construct the Markov chain. To adapt the sampling technique to our example, we re-write the density π(θ li |−) (22) as the following form: where consider a stochastic differential equation [228] that characterizes the evolution of the Langevin diffusion with the drift term set by the gradient of the log of the density (25): where {W (t) | t ≥ 0} is a standard 1-dimensional Wiener process, or Brownian motion [279]. In (26), t indexes a fictitious continuous time. ∇ represents the gradient operator with respect to θ li . Under faily mild conditions on the function U (θ li ), the equation (26) has a strong solution {θ li (t) | t ≥ 0} that is a Markov process [247]. Furthermore, it can be shown that the distribution of {θ li (t) | t ≥ 0} converges to the invariant distribution π(θ li |−) (25) as t → ∞.
Since solving the equation (26) is very difficult, a first-order Euler-Maruyama discretization [15] is used to approximate the solution to the equation: where δ is the step size of discretization, and [s] indexes the discrete time steps. This recursive update defines the Langevin Monte Carlo algorithm. Typically, to handle the discretization error and satisfy the detailed balance [54] to make Markov chain converge to the target distribution π(θ li |−) (25), the MH correction is needed. Algorithm 2 details MALA to sample from the π(θ li |−) (l = 1, · · · , K; i = 1, · · · , N ) (25): Algorithm 2: MALA to sample from π(θ li |−) (22) Goal : Sampling from the full conditional posterior distribution

Hamiltonian Monte Carlo
We introduce Hamiltonian Monte Carlo (HMC) algorithm that employs Hamiltonian dynamics to efficiently explore the parameter space [86,220]. Among many MH-type sampling algorithms, HMC has been recognized as one of the most effective algorithms due to its rapid mixing rate and small discretization error. By that reason, HMC has been deployed as the default sampler in many open packages such as STAN [50] and TENSORFLOW [1]. A key idea of HMC distinctive from ESS and MALA is the introduction of an auxiliary momentum variable, which is typically assumed to follow as a Gaussian distribution and independent of the target variable. By doing so, the HMC can produce distant proposals for the target variable, thereby avoiding the slow exploration of the state space that results from the diffusive behavior of simple random-walk proposals.
We adapt the HMC to our example. We shall first take a look at a joint density: where The auxiliary variable φ li is distributed according to the univariate Gaussian distribution N (φ li |0, m li ) with variance m li . Note that it holds π(θ li , φ li |−)dφ li = π(θ li |−) due to the independence between θ li and φ li . Therefore, our ultimate goal is to sample from the joint density π(θ li , φ li |−) (28), and take only θ li by marginalization.
Noting from (28), the negative of joint log-posterior is The physical analogy of the bivariate function H(θ li , φ li ) : R × R → R (29) is a Hamiltonian [86,182], which describe the sum of a potential energy U (θ li ), defined at the position θ li , and a kinetic energy where ∇ represents the gradient operator with respect to θ li .
The Hamiltonian system (30) -(31) has three nice properties. Assume that (θ li (t), φ li (t)) : [a, b] → R × R is a solution curve of the system, where a, b ∈ R ∪ {±∞}. Then, the following relationships hold: (a) preservation of total energy: (c) and time reversibility: The mapping T s from state at t, (θ li (t), φ li (t)), to the state at time t + s, (θ li (t + s), φ li (t + s)), is one-to-one, and hence has an inverse T −s .
Three properties are eventually related with the following nice properties of the HMC: (a) a high probability of acceptance of proposals; (b) a simple analytic form of acceptance ratio (no need to consider a hardto-compute Jacobian factor); and (c) a detailed balance with respect to the target density π(θ li |−). For a detailed description and extensive review see [220].
For practical applications, the differential equation system (30) -(31) cannot be solved analytically and numerical methods are required. As Hamiltonian H in the system is separable (or equivalently, the joint density π(θ li , φ li |−) is factorizable), to traverse the state space more efficiently, the leapfrog integrator method is typically used, which involves a discretized step of the dynamics. As similar to the construction of MALA, the discretization errors arising from the leapfrog integration are addressed by MH correction step. Algorithm 3 details the HMC to sample from the target density π(θ li |−) = π(θ li , φ li |−)dφ li (22).
In the algorithm, (s) indexes the sampling iteration within the Gibbs sampler, while [d] represents the index introduced due to the discretization.
Algorithm 3: HMC to sample from π(θ li |−) = π(θ li , φ li |−)dφ li (22) Goal : Sampling from the full conditional posterior distribution ii. Make a half step for the momentum: φ iii. Alternate full steps for position and momentum: li ). } } iv. Make a half step for momentum: φ li . vi. Set the last pair of the solution curve as the proposal: li ) which is used as the li is deleted and we will draw a new momentum in the next iteration.
This independent drawing of the momentum is the engine that enables HMC to produce distant proposals, but nevertheless maintains a high probability of acceptance.
The naive HMC (Algorithm 3) requires the users to specify at least three parameters: a step size δ, a number of steps L, and a mass m li , for which to run a simulated Hamiltonian system. A poor choice of either of these parameters will result in a dramatic drop in the efficiency HMC. No-U-Turn Sampler (NUTS) developed by [142] is an extension of HMC, which is designed to automatically turn the parameters (δ, L) while fixing m li = 1, making it possible to run NUTS with no hand-tuning at all. HMC and NUTS are general-purpose inference engines deployed in STAN.
We would like to highlight a difference between MALA (Algorithm 2) and HMC (Algorithm 3). Although both algorithms utilizes the gradient information (that is, ∇U (θ li ) = −∇ log π(θ li |−)), the former is based on stochastic differential equation (26) and the latter is based on ordinary differential equation (30) - (31). From the algorithmic perspective, MALA exhibits a single loop structure and is designed to directly employ the discretization of the underlying Langevin dynamics: see that the index [s] resulting from the discretization in (27) [322]. Therefore, one can set the number of steps L for the leapfrog integrator by an arbitrary integer.

Priors for variance
Provided the assumption of the basic form (2) -(4), the random error terms { ij } N,M i i=1,j=1 in Stage 1 and in Stage 2 are the stochastic sources of (remaining) intra-individual and inter-individual vari-abilities, respectively [260]. Both terms are assumed to follow univariate Gaussian distributions in the basic model. This assumption can be generalized to multivariate Gaussian distribution, t-distribution, mixture of Gaussian distributions, etc, depending on the exhibition of the data or prior guess of perturbation associated with model matrix Θ (5) [210].
Recall that the basic model assumes that data-level errors ij are distributed according to N (0, σ 2 ) with variance σ 2 , independently across times j = 1, · · · , M i and subjects i = 1, · · · , N . (We discuss about the η-term in Subsection 7.3.) Therefore, the standard deviation σ describes a vertical difference (i.e., measurement error) between the observation y ij and theory f (t ij ; θ i ) across time and individuals. We can generalize the basic setting by replacing σ 2 with (a) σ 2 i (i = 1, · · · , N ) or (b) σ 2 ij (i = 1, · · · , N ; j = 1, · · · , M i ) to accommodate the heterogeneity the measurement error (a) across subjects and (b) across subjects and time, respectively, provided sufficiently large sampling times M i [291].
For any prior π(σ 2 ), the full conditional posterior distribution of σ 2 (18) is given as where a 2 2 represents the Euclidean norm of the vector a.

Priors for intercept and coefficient vector
One of the central goals of using nonlinear mixed effects models is to identify significant covariates among the P covariates x = (x 1 , · · · , x P ) , explaining each of the model parameter θ l (l = 1, · · · , K). This is because the function f in Stage 1 (2) is typically derived from a differential equation system. Such a differential equation has model parameters {θ l } K l=1 which controls the dynamic of the solution of the system, and how the parameters are related with covariates is vital to understand causality. For example, in PK analysis, understanding whether and to what extent weight, renal status, disease status, etc, are associated with drug clearance may dictate how these factors can be considered in a dosing schedule.
We explain popularly used priors for the intercept and coefficient vector by taking the vector-form (a) [Stage 1-(6), Stage 2-(7), and Stage 3-(9)] because it directly embeds the framework of linear regression.
Default choice for the prior π(α l ) is the flat prior π(α l ) ∝ 1, also called a uniform prior. Alternatively, a diffuse Gaussian prior π(α l ) = N (a l , b 2 l ) is also often used by fixing b l to be sufficiently large (saying b l = 10 or 100) and a l = 0. In either case, the full conditional density (19) enjoys the conjugate update, hence, Step 3 in the Gibbs sampler in Subsection 6.2 seldom imposes computational burden. The idea behind the use of (nearly) non-informative priors for the intercept is that such priors induce almost minimal degree of Bayesian shrinkage, hence, allow the data to have (nearly) maximum effect on the posterior estimate for the intercept [199]. Now, we discuss on the Bayesian analysis for the coefficients. There are numerous choices for the prior of the coefficient vector β l = (β l1 , · · · , β lb , · · · , β lP ) ∼ π(β l ), which is not surprising because the linear regression is arguably one of the most researched topics in statistics. Here, we suggest some popular priors whose main utility fall broadly into two settings, according to whether the design matrix X ∈ R N ×P in (33) is tall (N ≥ P ) or fat (N < P ). In regression theory, the former setting is referred to as low-dimensional regression, and latter one is called high-dimensional regression [99,276,323].
Under the tall design (N ≥ P ), particularly when the number of subjects N is much larger than the number of covariates P (N P ), one important theoretical consideration is that, it is expected to see Bernstein-von Mises type results [160,193,300] on the posterior inference for coefficients β l . Roughly speaking, the theorem, sometimes called the "Bayesian Central Limit Theorem", states that the posterior distribution of β l is approximately a normal distribution following a likelihood theory as sample size N goes to infinity for any prior choice π(β l ) under certain regularity conditions. In reality, it is possible that due to an ill-conditioned design matrix X, a misspecification of error distribution for η li , a small sample size may not be empirically observed. But even such cases, it is known the influence of the prior distribution diminishes as N grows, which means that using different priors for β l may not sensitively change the resulting Bayesian inferences about β l , and furthermore, the inference outcomes obtained by Bayesian and frequentist methods agree in most instances under the tall design. For example, see results of [73] and [291] for pharmacokinetic applications. Some possible options for prior π(β l ) are (i) flat prior π(β l ) ∝ 1; (ii) Gaussian diffuse prior π(β l ) = N P (0, σ 2 β l I) with a large variance σ 2 β l [302], and (iii) g-prior π(β l ) = N P (0, g · ω 2 l [X X] −1 ) for some positive value g [317]. The suggested priors yield the conjugate update for Step 4 within the Gibbs sampler. Now, we discuss some priors for π(β l ) in the linear regression (33) under the fat design setting (N P ). This setting can be applied to pharmacogenetics where one of the main interests is to find important genes that may influence pharmacokinetics or pharmacodynamics [12,234,304], where the number of genes P is allowed to be a few thousand, while number of patients N is confined to a few hundred. A fundamental assumption in this setting is sparsity assumption on the coefficients β l . This means that many of the coefficients of β l = (β l1 , · · · , β lP ) are (close to) zero. The true non-zero coefficients in the β l are referred to as signal coefficients, while the remaining are called noise coefficients.
Statisticians have devised a number of penalized regression techniques for estimating β l under the sparsity assumption [134]. From a Bayesian point of view, sparsity favoring mixture priors with separate control on the signal and noise coefficients have been proposed [118,159,213,313], which is called the 'spikeand-slab priors'. Although these priors often lead to attractive theoretical properties [55,56], computational issues and considerations that many of the β lb 's (b = 1, · · · , P ) may be small but not exactly zero has led to a wide variety of 'continuous shrinkage priors' [51,52,129,231,277], which can be unified through a global-local scale mixture representation [236]. The following hierarchies describe the sparse favoring priors: • Spike-and-slab priors. Each component of the coefficients β l is assumed to be drawn from β lb |τ l ∼ (1 − τ l ) · δ 0 (β lb ) + τ l · f (β lb ), (l = 1, · · · , K; b = 1, · · · , P ), where τ l = Pr[β lb = 0]. The function δ 0 (β lb ) is the Direc-delta function and f (β lb ) is a density supported on R, called the spike and slab densities, respectively. The spike density shrinks noise coefficients to the exact zero, while the slab density captures signal coefficients by allowing a positive mass on the tail region [55,119,161].
• Continuous shrinkage priors. Each component of the coefficients β l is assumed to be drawn from β lb |λ lb , τ l , ω 2 l ∼ N (0, λ 2 lb τ 2 l ω 2 l ), (l = 1, · · · , K; b = 1, · · · , P ), where f , g, and h are priors for λ lb , τ l , and ω 2 l , respectively, supported on (0, ∞). Here, λ lb and τ l are referred to as local-scale and global-scale parameters, respectively. The choices of f and g play a key role in controlling the effective sparsity and concentration of the prior and posterior distributions [16,203,232,236,261,320].
Roughly speaking, the roles of the τ l in both prior frameworks are similar in the sense that they control the degree of the sparsity [236]. Slab density and local-scale prior density are expected to put a sufficient mass on the tail regions of the densities to detect signal coefficients and produce a robust Bayes estimator for β l [52,179]. For that reason, heavy-tailed densities (for e.g., double generalized Pareto distribution, Cauchy distributions [13,180]) are preferably used. Refer to [27,227] for comprehensive surveys on Bayesian variable selection.

Priors for covariance matrix
Consider a nonlinear function f (t; θ) indexed by a K-dimensional model parameter θ = (θ 1 , · · · , θ l , · · · , θ K ) to describe an individual trajectory. A basic assumption is that all the components θ l are unrelated across l (l = 1, · · · , K), which is referred to as uncorrelated design setting. In many practical problems, this setting is reasonably accepted since one of the fundamental assumptions on f is that each component θ l has its own role in modifying a functional shape of f . A central goal of researcher in using nonlinear mixed models is to examine these roles mathematically, endowed with interpretations by domain experts in terms of physiology, epidemiology, or pharmacology, etc. The basic model (2) -(4) that we illustrated so far is designed under this assumption; recall that the covariance matrix Ω ∈ R K×K on Stage 2 was assumed to be diagonal, Ω = diag(ω 2 1 , · · · , ω 2 l , · · · , ω 2 K ). Under this uncorrelated design setting, possible options for priors for the scale components ω 2 l (or ω l ) are (i) Jeffreys prior π(ω 2 l ) ∝ 1/ω 2 l See discussion by [114] for the prior options implemented on 8-schools example.
Researchers often wish to work with the correlated design setting to examine whether any pair of model parameters, θ l 1 and θ l 2 , are physiologically (or epidemiology, pharmacology, financially, etc) associated or not. Taking the term structure modeling discussed in Subsection 3.4 as an example, it is a valid question whether the Nelson-Siegel parameters are correlated or not as they are all associated with the interest rate [81]. Statistically, having a well-designed covariance structure can also improve the model fitting and produce reliable estimators for the model parameters compared to uncorrelated designs. To make a fully Bayesian inference about the Ω (34), we need to specify an appropriate prior π(Ω) which we will discuss shortly. After that, we operate the Gibbs sampler in Subsection 6.2, with some modifications, if necessary.
For instance, to implement the parallel computation in Step 1, we recommend to sample from the joint density π(θ i |−) across i, instead of sampling from the individual π(θ li |−). Especially, among the five steps of the Gibbs sampler, implementation of the Step 5 needs special care in sampling from the full-conditional posterior density Ω. This step can be highly complicated depending on the chosen prior.
A challenge in choosing a workable prior π(Ω) is briefly mentioned. Researches regarding this subject are vast, growing, and deep. As similar to the obstacles encountered in classical covariance estimation [28,94,98,169,267], there are three major aspects, among many others, in the consideration of a thoughtful prior π(Ω) to produce a reliable Bayes estimator of Ω: (1) sample size N ; (2) the number of model parameters K; and (3) positive-definiteness of Ω [237]. The first two aspects are related to theoretical constraints.
In general, it is known that the estimation of the covariance Ω can be distorted unless the ratio K/N is sufficiently small (see, e.g., [173,238,308]). The third one is germane to the modeling consideration and computation strategies to estimate Ω, typically resolved via principal component analysis, Cholesky decomposition, and Gaussian graphical models, etc [18,190].
For any prior π(Ω), the full conditional density of Ω is analytically expressed as follows (see [108] for a similar derivation): π(Ω|Θ, σ 2 , α, B, y 1: where φ i = θ i − Bx i (i = 1, · · · , N ) ,φ ∈ R K , and G ∈ R K×K are defined bȳ In (35), we used the vector-form (b) (i.e., i-indexing) to express the prior for Θ. Notations det(A) and tr(A) denote the determinant and trace of a square matrix A, respectively. Matrix G is the 'latent' covariance matrix based on the model matrix Θ ∈ R K×K (5) and coefficient matrix G ∈ R K×P , whose form resembles sample covariance matrix assuming φ i are observed [308].
Traditionally used priors for the covariance matrix Ω (34) are the Jefferys prior and the conjugate inverse Wishart prior (see [43,186] for the reviews of the earlier works): • Jeffreys prior. The common non-informative prior has been the Jeffreys improper prior This prior was originally derived from an invariance argument by [156] for the case K = 1, 2; and it was considered for arbitrary K by [108,109,285] to develop Bayesian multivariate theory.
• Inverse-Wishart prior. The common informative prior is the inverse-Wishart prior [253] π(Ω) = IW(V, d) where Ω and V are K-by-K positive definite matrices, and Γ K (·) is the multivariate gamma function [152]. V is a scale matrix, and d(> K − 1) is the number of degrees of freedom. Conventionally, d is chosen to be as small as possible to reflect vague prior knowledge. A univariate specialization (K = 1) is the inverse-gamma distribution.
The success of Bayesian computation and MCMC in the late 1980s opened up the potential of using more flexible non-conjugate priors for covariance matrices [71,309,312]. Limitations of the traditional priors studied by many statisticians also motivated them to develop a new prior. For example, some of them argued that the Jeffreys prior may not be really non-informative, particularly in high dimensional setting [18,270], and inverse Wishart prior is very restrictive and lacks flexibility [72]. Among many new priors developed for particular applications [259], a combination of separation strategy developed by [18] and LKJ prior [184] has been successful, heavily used in a variety of industrial problems, and relevant software has been developed, including R package STAN [50,265].
Then, the priors π(ω) and π(R) are supported on (0, ∞) K and R K , respectively. Finally, draw sample from each of the full conditional posterior distributions π(ω|R, −) and π(R|ω, −) at time in Step 5 within the Gibbs sampler in Subsection 6.2. (This means that, we are not directly sampling the covariance matrix Ω from π(Ω|−) (35) as we would do when Jeffreys or inverse-Wishart prior were used for the prior.) Standardly used prior options for the scale vector ω = (ω 1 , · · · , ω l , · · · , ω K ) are (i) log(ω) ∼ N K (a ω , B ω ), where log(ω) = (log ω 1 , · · · , log ω l , · · · , log ω K ) , with hyperparameters, mean a ω ∈ R K and covariance B ω ∈ R K×K which is often diagonal [18]; and (ii) ω l ∼ C + (0, b ω l ) with the scale hyperparameter b ω l > 0 [265]. As for the prior distribution for the correlation R ∈ R K×K , the LKJ prior proposed by [184] is popularly used: • LKJ prior. LKJ prior is supported over the correlation matrix space R K , or equivalently over the set of K × K Cholesky factors of real symmetric positive definite matrces with the shape parameter γ > 0. The function B(α, β) is the beta function. If γ = 1, the density is uniform over the space R K ; for γ > 1, the density increasingly concentrates mass around the identity matrix I ∈ R K×K (i.e., favoring less correlation); for γ < 1, the density increasingly concentrates mass in the other direction, and has a trough at the identity matrix (i.e., favoring more correlation).
Note that the normalizing constant of the LKJ prior (37) is constant with respect to γ, therefore, we have π(R) ∝ (det R) γ−1 , with the shape hyperparameter γ > 0. The behavior of LKJ prior with γ = 1 (i.e., π(R) ∝ 1) was studied by [18], where the author found that as K increases the marginal correlations tend to concentrate around zeros (see Figure 1 from [18]), hence, model matrix Θ (5) are more likely to be treated as in the uncorrelated design setting.
As for the hyperparameter specification for the LKJ prior, Stan Development Team [265] recommends to use γ ≥ 1. This suggestion is also well-aligned with the original intention of using the separation strategy to make a variance-correlation structure by [18] in that: (1) the authors intend to choose a diffuse prior for π(R) to reflect weak knowledge about the correlation R, while (2) prior knowledge, possibly informative, shall be put on the scale parameters by specifying π(ω), as most statisticians are normally trained to do.
Computational algorithm and theory concerning the LKJ prior can be found in [120,157,184].

Setting
The recent development of MCMC methods has made it possible to fit enormously large classes of models with the aim of exploring real world complexities of data [123]. This ability naturally led us to wish to compare several candidate models that vary substantially in the model complexities and choose the best model out of them. For example, authors [177] tried to compare four different rate decline curves to fit the production data from the 360 wells in Figure 4. Indeed, upstream petroleum engineers endeavor to find a rate decline curve describing the production trajectories as accurately as possible so that EUR can be accurately estimated. Another application of model selection can be found in PK analysis. Taking the theophylline data in Figure 3 as an example, PK modelers may debate whether they need to use a two or three-compartment model with a nonlinear clearance to describe the PK exposure, or just one-compartment model with a linear clearance is sufficient.
In the current section, our primary focus is to illustrate a Bayesian approach to compare multiple Bayesian nonlinear mixed effects models explaining the data To that end, we want to lay out the set-up that underlies our model selection procedure. Consider Stage 1 and 2 of the basic model (i.e., the hierarchy (2) -(3)), endowed with a joint prior π(σ 2 , α, B, Ω). Very importantly, we do not assume the independent prior assumption as we did in Stage 3 (4): any prior assumption on the parameters (σ 2 , α, B, Ω) works fine in our framework for the model comparison. Therefore, the basic model (2) -(4) is considered as a subclass of candidate models that we want to compare. In our framework, we can also consider the correlated design setting discussed in Subsection 7.3, where the covariance matrix Ω (34) is allowed to be any positive-definite matrix (i.e., does not need to be a diagonal matrix), as one of candidates. Eventually, in our framework, modelers have freedom to choose (i) the nonlinear function f to describe the temporal profile y i and (ii) prior distribution π(σ 2 , α, B, Ω).
Assume that a researcher wants to consider H functions, denoted as f (t; where each of the models has the corresponding model matrix . . .
Finally, Stage 3 of each of the H models is comprised of the prior: . . .
. . . We describe three model comparison criteria that are popularly used in the literature: deviance information criterion (DIC) [113,263], widely applicable information criterion (WAIC) [301], and posterior predictive loss criterion (PPLC) [110]. As in frequentist information criteria [4,45,92], formulation of the DIC, WAIC, and PPLC also takes the two terms into a consideration: goodness-of-fit and penalty for model complexity. Because increasing (or decreasing) model complexity is accompanied by the risk of over-fitting (or under-fitting), models should be compared by trading-off these two terms. Particularly, as we are currently discussing about a Bayesian hierarchical model, these criteria are obviously depending on what part of the model specification is considered to be part of the likelihood, and what is not. [263] refer to this as the focus issue. For example, in the general form of a Bayesian hierarchical model consisting of a top-level likelihood p(y|Ψ) for data y, a prior model π(Ψ|η), and a hyperprior π(η), one might choose as the likelihood either the conditional density p(y|Ψ), or the marginal density p(y|η) = p(y|Ψ)π(Ψ|η)dΨ. Based on [263], the former situation is referred to as "focus on Ψ", while the latter situation is referred to as "focus on η", respectively.
where L((Ψ [M h ] ) i |y i ) is the likelihood (i.e., data distribution) based on an individual data with parameter In the next subsections, we provide some brief summaries of the criteria and then adapt them to our context. In what follows, to simplify the notation, we suppress the arguments [M h ] in the parameters. For a detailed explanation of the criteria, refer to [17,115].
8.2 Deviance information criterion [113] suggested DIC, a generalized version of Akaike information criterion [4] for a Bayesian hierarchical model, given by In (38), the function D(Ψ) = −2 log L(Ψ|y 1:N ) is referred to as deviance. Deviance is a goodness-of-fit statistics whose lower value indicates a better fitting [58].
where only mean function f i (t i , θ i ) differs across H models M h , h = 1, · · · , H. In practice, posterior mean Ψ and effective number of parameters p D are not expressed in closed-forms, hence, the DIC (38) is stochastically approximated though MCMC techniques [244].
8.3 Widely applicable information criterion [301] introduced WAIC which is regarded as a fully Bayesian version of the DIC (38) in the sense that a goodness-of-fit term exploits the entire posterior distribution. Note that the goodness-of-fit term of the DIC (38) is obtained by plugging the posterior mean Ψ into the deviance D(Ψ), which lacks a fully Bayesian sense. It is known that WAIC is asymptotically equivalent to Bayesian cross-validation [283], and also applicable to singular models.
WAIC is defined by where the goodness-of-fit term is called the log posterior predictive density (LPPD), which is defined as |y 1:N ], and the effective number of parameter in the penalty term is defined In practice, as similar to DIC (38), WAIC (39) is obtained by stochastic approximations. Given posterior samples {(Ψ) (s) } S s=1 ∼ π(Ψ|y 1:N ), the LPPD and p W terms may be approximated by Returning to our examples, we can approximate the value of WAIC corresponding to each of the H models as follows. First, replace L(Ψ i |y i ) in (40) and (41) with the individual-based data distribution where only the mean function f i (t i , θ i ) differs across the H candidate models, and second, approximate LPPD and p W by using a MCMC method, and finally, obtain an approximation of WAIC (39) corresponding to each model.

8.4
Posterior predictive loss criterion [110] introduced PPLC as an alternative to DIC (38) or WAIC (39). A notable feature of PPLC different from DIC and WAIC is its use of replicated observations, denoted by y rep i = (y rep i1 , y rep i2 , · · · , y rep iM i ) ∈ R M i , corresponding to the actual observations y i = (y i1 , y i2 , · · · , y iM i ) ∈ R M i , for each i = 1, · · · , N . Here, the replicate y rep i for the subject i is drawn from its posterior predictive density f (y rep i |y 1:N ) = p(y rep i |Ψ i ) · π(Ψ|y 1:N )dΨ, (i = 1, · · · , N ), where p(y rep i |Ψ i ) is the data density for the i-th subject and π(Ψ|y 1:N ) is posterior distribution. The idea of using replicates {y rep i } N i=1 for a criticism of the model in light of the observed data {y i } N i=1 is also purported by [36].
A general rule of the PPLC is principled on a balanced loss function [318]. Given any loss function l(·) and a positive real number k, a balanced loss function is defined by l(y rep i , a i ; y 1:N ) = l(y rep i , a i ) + k · l(y i , a i ), k > 0, i = 1, · · · , N, where a i is a non-stochastic action vector, k is a weight, and y rep i is a replicate for its observed counterpart y i .
Conceptually, the role of action vector a i is to accommodate both y i , and what we predict for y rep i . Note that the loss function on the left-hand side of (43) penalizes actions a i both for departure from the corresponding observed value (fit) as well as for departure from what we expect the replicate to be (smoothness) [17]. A generic version of PPLC is defined by By choosing the quadratic loss l(y, a) = y − a 2 2 in (43), the generic PPLC D k may be simplified as where G = N i=1 ν i −y i Finally, we adapt the PPLC (44) to our examples. Due to the definition of notation Ψ i = (θ i , σ 2 ) (i = 1, · · · , N ), the posterior predictive distribution of y rep i (42) can be detailed as follows To approximate D k (44) for each model, first, choose a number k, saying k = 1, and second, approximate ν i and ς 2 i through replicates y rep i drawn from the predictive density f (y rep i |y 1:N ) (42) for each i = 1, · · · , N , and finally, complete the G and P to get an approximation to the D k (44).

Residual error models
In the basic version of the Bayesian nonlinear mixed effects model (2) -(4), we assume that residual errors in the individual-level model are additive to the mean function f across all subjects and times. Under this assumption, the (conditional) variance of the i-th subject's trajectory V[y ij |θ i ] is constant with σ 2 over time t ij (j = 1, · · · , M i ). The additive error model is the most standard assumption used in a variety of problems arising from many industrial and academic researches [176,177,217,287,291]. However, when there exists some systematic temporal trend in the volatility of individual trajectories, for instance, the variance V[y ij |θ i ] seems to decrease over times t ij as shown in Figure 3 and 4, the additive residual assumption may not be adequate to fully account for the reality of the data. Table 3 are popularly used residual error models that can be used in Stage 1 (2). Some of them are deployed as options for user to choose in industrial software such as MONOLIX [171] and NONMEM [20,23], and open source R package such as NLMIXR [102]. If we assume ij = ε ij = 0, then all the error models leads to the same deterministic equation y ij = f (t ij ; θ i ). That being said, if the variances of the residuals ij ∼ N (0, σ 2 ) and ε ij ∼ N (0, ς 2 ), that is, σ 2 and ς 2 , are quite small, then the inference outcome based on each of the error models will be similar each other. Additive yij = f (tij; θ i ) + ij f (tij; θ i ) σ 2
[177] analyzed the shale oil data shown in Figure 4 in this formulation.

Bayesian nonparametric methods
Recently, the use of the Bayesian nonparametric (BNP) statistical models has received increasing attention in the statistical literature because they allow modelers to gain model flexibility and robustness compared to its parametric counterpart [140,216]. BNP methods can be applied to the formulation of the basic model (2) -(4), when the parametric specification for the error distributions is too restrictive to achieve certain purpose of the analysis, or inference results lead to poor performance due to the inappropriate parametric form.
Typically, BNP methods are applied to the population-level model, by extending or relaxing the parametric assumption on the random errors η li , while retaining the individual-level model as fully parametric [293].
A Gaussian process prior [197,240] or a Dirichlet process prior [95,96,100] is popularly used for such extension and relaxation. Mathematical concepts of the processes are explained in [100,240].
To illustrate some motivation behind the application of BNP methods, we take the shale oil production data in Figure 4 researched by [177] as an example. Their goal was to predict EUR at a new location before the actual drilling takes place. Because the geological location is not stochastically incorporated into the basic model (2) -(4), authors extended the linear regression in Stage 2 into a spatial linear regression as follows θ li = θ l (s i ) = α l + x i β l + υ l (s i ) + η l (s i ), (i = 1, · · · , N ; l = 1, · · · , K), where η l (·) ∼ GP(0, ω 2 l I(·, ·)) represents a Gaussian white noise process with the indicator function I(·, ·) with variance ω 2 l . The stochastic process υ l (·) ∼ GP(0, K(·, ·)) is the newly introduced Gaussian process with a radial basis function kernel K(s i 1 , s i 2 )) = γ 2 l exp[− s i 1 − s i 2 2 2 /{2ρ 2 l }] with variance γ 2 l and range parameter ρ 2 l , and s i represents the (longitude, latitude) of the i-th shale oil well. The existence of υ l (·) enables spatial prediction of EUR at a new location, taking an advantage of the geological proximity information, which is called the latent kriging technique.
An another motivation on the use BNP methods is the situation when there exists multimodality in the distribution P of model parameter vector {θ i } N i=1 ∼ P. Note that, in the basic model (2) -(4), the distribution P is assumed to be a single multivariate normal distribution N K (α + Bx i , Ω). In the multimodality case, the population may consist of disparate subpopulations, and the single multivariate normal distribution of the basic model can produce a poor model performance due to the lack of flexibility. A natural generalization to accommodate such multimodality is an extension to a finite mixture of multivariate normal distributions [205], or furthermore, to a countably infinite number of mixtures of multivariate normal distributions [241]. Particularly, in the latter case, if a Dirichlet process prior [100] is placed on the mixture components, then the resulting infinite mixture models are generally called Dirichlet process mixture (DPM) model [11,275]. DPM model is one of the most studied topics in BNP methods in recent years [140]. See [154] for a survey of the posterior computations of using DPM models.
A number of authors have studied DPM models under the basic model (2) -(4) or similar forms with their own specifications [46,217,249,293]. For example, [293] placed a DPM model only for the model parameter vector θ i (i = 1, · · · , N ), while the covariates x i are incorporated into the base measure of Dirichlet process. In contrast, [217,249] used a DPM model jointly for the model parameter vector and covariates, (θ i , x i ) (i = 1, · · · , N ), to induce a nonparametric regression function E[θ i |x i ]. [46] used a Dirichlet process prior only for a certain component θ li (i = 1, · · · , N ) corresponding to a block indicator in an analysis-of-variance setup. Refer to the [215] for a review and references therein for more specifications.

Software development
Recent years have seen the great success of Bayesian nonlinear mixed effects models, or more generally, Bayesian hierarchical models (BHM), in a variety of disciplines such as biology, medical research, physics, social, and educational sciences [42,66,176,177,178]. This was partly due to the widespread introduction of non-commercial software packages that enabled applied researchers to answer substantive research questions through applications of BHM [23,44,194,201,235,265]. Most of the Bayesian software such as JAGS [235], BUGS [194], and STAN [265] are designed to require a reasonable understanding of the MCMC sampling scheme. From the perspective of implementation, spirits of most Bayesian software are similar in that, researchers only need to designate a DAG structure [162,170] of a BHM. Such a DAG structure can be abstractly represented as the collection { data y , likelihood p(y|θ), prior π(θ|η), hyperprior π(η) } that should be programmed by textually or graphically, after which Bayesian software prints out simulated Markov chains from the posterior distribution π(θ, η|y). See [194,195] for an overall idea about how Bayesian software operates.
From the algorithmic perspective, the performance of Bayesian software may highly depend on two aspects: (i) whether the program has been designed to exploit a conditional independence structure arising from the hierarchy; and (ii) what sampling algorithms have been deployed to simulate Markov chains from a non-closed form distribution, possibly of high-dimensional. As discussed in Subsection 6.3, conditional independence is inherent in the formulation of BHM, of which proper exploitation can greatly improve the computational efficiency [117,175]. This can be mostly done by constructing a Gibbs sampling algorithm with a blocking strategy into the consideration [191]. A general rule is that the convergence of the Gibbs sampler can be improved by grouping correlated latent variables as a single parameter block to sample from as a whole [230]. On the other hand, in the task of sampling from a non-closed form distribution, we know that a naive MH algorithm [209] requires the specification of proposal density, which can be problematic in developing software. Therefore, fully automated sampling algorithms such as ESS (Algorithm 1) [218], NUTS [142], and slice sampler [219] are appreciated as general-purpose inference engines in the development of Bayesian software when conjugate-update is infeasible.
In contrast, STAN [265] implements HMC [86,220] and its extension, NUTS [142]. Perhaps, STAN is one of the most extensively used Bayesian software packages in recent years due to the fast converge of the inference engines regardless of whether the priors are conjugate or not. By that reason, and its great modeling flexibility, STAN has been used as a basic platform for other high-level packages like BRMS [44] and TORSTEN [201].

Future research topics
We briefly mention two topics that have generated great recent interest in the Bayesian statistical community.
The first topic we want to bring out is the use of Bayesian optimization techniques, namely variational inference [32,319] and expectation propagation [19,212], for the Bayesian nonlinear mixed effects models.
These methods received significant attention in the recent past because of their scalability to large-scale problems enabling 'Big Bayesian Learning' [163,321]. Essentially, the main goal of these methods is to approximate the joint posterior density (16) via optimization, rather than via sampling such as MCMC sampling which may cause scalability problems. The basic idea behind them is to first posit a family of densities and then to find a member of that family which is close to the target density, where the closeness is often measured by Kullback-Leibler divergence [158]. See Chapter 10 in [30] for a general idea for the methods. Although there were published research works for a new algorithmic development of the methods for the application to (generalized) linear mixed effect models [229,272,273], to our knowledge, there is no relevant published research for the application to nonlinear linear mixed effect models.
Another topic untapped in the literature is the development of the Bayesian version of mixed effects machine learning models [49,104,200,224]. This is a relatively new branch in statistics and machine learning community, where most research works were published in the past five years. The central idea of the models is to estimate the nonlinear function f in the individual-level model (2) nonparametrically by using a verity of machine learning methods, rather than specifying a parametric function, while maintaining the mixed effects modeling framework. Therefore, the modeling framework will be similar to a nonparametric regression of which the primary purpose is to estimate the unknown function f [240,254,278]. However, the main difference is that, in mixed effects machine learning models, (1) there exists some random variables to describe inter-subject variability, and (2) curve fitting mechanism (i.e., estimation of f ) is mostly done by machine learning models, including deep learning [126], random forest [38], gradient boosted machine [103], etc.

DISCUSSION
This review of Bayesian nonlinear mixed effects models is of necessity incomplete, as the literature is too vast to attempt even a moderate review. We have chosen to focus much of our attention on providing some of the most recent literature on the Bayesian analysis of the underlying basic model, with an emphasis on implementation of the model and introduction of recently developed prior distributions. We hope that this review can be read as a guideline to develop computational algorithms for complex and realistic Bayesian hierarchical nonlinear models. We wish that this review will offer readers familiar with frequentist analysis some pedagogical insight into the Bayesian approach, and provide those new to nonlinear mixed effects modeling a foundation of the implementation of Bayesian and frequentist computations for appreciating its idea and utility. We look forward to continuing methodological developments, software developments, and new applications of this rich class of models in industrial and academic research.