On the Effectiveness of Bayesian AutoML methods for Physics Emulators

The adoption of Machine Learning (ML) for building emulators for complex physical processes has seen an exponential rise in the recent years. While neural networks are good function approximators, optimizing the hyper-parameters of the network to reach a global minimum is not trivial, and often needs human knowledge and expertise. In this light, automatic ML or autoML methods have gained large interest as they automate the process of network hyper-parameter tuning. In addition, Neural Architecture Search (NAS) has shown promising outcomes for improving model performance. While autoML methods have grown in popularity for image, text and other applications, their effectiveness for high-dimensional, complex scientiﬁc datasets remains to be investigated. In this work, a data driven emulator for turbulence closure terms in the context of Large Eddy Simulation (LES) models is trained using Artiﬁcial Neural Networks and an autoML framework based on Bayesian Optimization, incorporating priors to jointly optimize the hyper-parameters as well as conduct a full neural network architecture search to converge to a global minima, is proposed. Additionally, we compare the effect of using different network weight initializations and optimizers such as ADAM, SGDM and RMSProp, to explore the best performing setting. Weight and function space similarities during the optimization trajectory are investigated, and critical differences in the learning process evolution are noted and compared to theory. We observe ADAM optimizer and Glorot initialization consistently performs better, while RMSProp outperforms SGDM as the latter appears to have been stuck at a local minima. Therefore, this autoML BayesOpt framework provides a means to choose the best hyper-parameter settings for a given dataset.


Introduction
Fluid turbulence is a multi-scale phenomenon and is an essential component of modeling engineeringrelevant flows. While solving the full Navier Stokes using Direct Numerical Simulation (DNS) results in the most accurate representation of the complicated, non-linear, non-local, multi-scale phenomenon, DNS is often computationally intractable. Engineering level solutions based on Reynolds Averaged Navier-Stokes (RANS) and Large Eddy Simulations (LES) alleviate this issue by resolving the larger integral length scales and modelling the smaller unresolved scales by introducing a linear operator to the Navier-Stokes equation to reduce the simulation complexity. These models however suffer from the curse of turbulence closure. The linear eddy-viscosity model represents one of the most popular methods for Reynolds stress closure for two-equation RANS as well as Smagronisky-LES models [1,2]. However, these approximates models are commonly phenomenological/heuristic in nature and thus require fitting to high fidelity DNS datasets for idealized flows [3].
With the incredible strides in the development of sophisticated Machine Learning (ML) algorithms made in the last decade it is only logical that these tools be widely adopted for use with scientific applications (refer to [4,5,6] for a review). In particular, Artificial Neural Networks (ANN) have shown great performance for function approximations [7] and recently Deep Neural Network (DNN)based approaches for fluid problems have gained wide attention [3,8,9,10,11,12,13,14,15]. Specifically, the use of ML methods for developing data-driven Reynolds stress closures has shown a lot of promise for canonical as well as complex engineering flows [5,6,16,17,18,19]. We extend this effort to emulate a data-driven closure term for compressible flows relevant to modeling advanced propulsion systems.
When training a neural network, one of the the main difficulties is setting the tunable hyper-parameters which need to be chosen and directly dictate the performance of the data-driven model. Manually tuning these hyper-parameters requires expertise and a-priori information about their sensitivities, which can be difficult to develop as more of these techniques are applied to large-scale scientific datasets. In addition, there are some challenges in transferring the best practices from other ML areas to Scientific ML. Primarily because the data used to develop these algorithms are different than scientific data as the latter can be high-dimensional, multi-modal, complex, structured and/or sparse [20]. Often, these suggested settings for hyper-parameters are incompatible due to the nature of the complex non-convex loss manifold for these datasets.
Automatic Machine Learning, or autoML, promises to be an important effort in this area where the optimization occurs without human intervention, and often times yield robust results. Along with tuning parameters, searching for optimal neural network architecture or Neural Architecture Search (NAS), in terms of layer depth and width, has previously shown performance boost [21,22].

Our Contributions
In this work, a data driven physics emulator is built for Reynolds Stress closure term using a simple feed-forward neural network and the a-priori performance reported. The effectiveness of using an autoML framework based on Bayesian Optimization to automatically discover optimal parameter as well as identify the best neural network architecture for emulating closure terms based on large resolved scale quantities are investigated. Different optimizers (RMSProp, ADAM and SGDM) and weight initialization (Glorot, He, Narrow-Normal) strategies are used to identify best performing settings that is hyper-parameter and architecture. In addition, the training process of the different optimizer and initialization strategies are compared using weight and function space similarities.

Physics of the Problem
The LES-filtered governing equations (using Favre-averaging) for the balance of mass and momentum are as below: where u represents the velocity, p is the pressure, ρ the fluid density, ν the dynamic the viscosity and τ the subgrid stress term. The effect of the sub-grid scale appears on the right hand side of the governing equations through the sub-grid scale stresses, τ ij , which are modelled using the Boussinesq approximation [23], and the assumption by Smagorinsky that the smallest scales are isotropic [2]. Based on Prandtl mixing length theory, the subgrid viscosity can be derived in terms of characteristic length and one velocity scale [24] as follows, therefore helping to close the Reynolds stress term where the superscript sgs stands for subgrid scale terms, ∆ is the filter width, and C s is the Smagorinsky constant, and S ij is the strain rate tensor and is calculated by taking off-diagnal gradients of velocities. In conclusion, the above approximation for the eddy viscosity assumes that changes in the resolved fields are slow, so that subgrid eddies can adjust themselves quickly to the rate-of-strain tensor. Thus, a closure based on a single constant is not universally true and the constant value may have to be adjusted [25], based on fitting the model parameters to high-fidelity data. Since some form of data-fitting is needed to optimize the subgrid scale model parameters even for this simplified approach, one can envisage a purely data-driven method to optimally approximate this changing constant based on large scale resolved terms, motivating our approach. More details about the LES implementation as well as the accuracy of results is reported in [26].
In this work we aim to derive a functional relationship between the large scale resolved flow features and the sub-grid scale unresolved terms, and specifically to approximate the subgrid scale viscosity where Re c is the Cell Reynolds number, S is the Strain-rate tensor and has six components, Ω is the rotation-rate tensor, and has three components, ∆K is the Kinetic energy gradient, and Y is a non-dimensional term that is a measure of the mesh resolution. The non-dimensionalized input features are chosen in order to impose Galliean-invariance [27,12].

Machine Learning Approach
Differential programming, based on principles of automatic differentiation, is a paradigm in deep learning where parameters of the neural network are trained by gradient-based optimizations [28,29]. This is indeed useful for scientific ML tasks and helps to improve the parameterization of the approximations of the neural network [30]. In this work, fully connected ANNs are used to find the functional mapping between the target (µ sgs ) to the non-dimensional input features. While the neural network implementation is fairly straightforward and done using MATLAB's Deep Learning Toolbox, the a-priori estimation of the best network hyper-parameters, and the architecture itself is non-trivial due to the multi-scale, non-local, non-linear nature of the dataset. To discover the best performing settings for our dataset, an Automatic Machine Learning (autoML) strategy is used and, while there are many available optimization methods [31], the Bayesian Optimization (BayesOpt) approach, which has shown good performance for other data-driven tasks ( [32,33]) is used. Our overall approach is shown in Figure 1 and described in detail in [18,34]. For the purposes of this paper we limit ourselves to identifying the best performing ANN, given by the BayesOpt algorithm. The coupling of the ANN to a CFD solver and comparing a-posteriori performance will be the subject of a subsequent article.

Automatic Hyper-parameter Tuning using Bayesian Optimization
We explore an automatic hyper-parameter tuning setup within the Bayesian Optimization framework, in which the learning algorithm's generalization performance is modeled as a sample from a Gaussian Process (GP). The posterior distribution induced by the GP leads to efficient use of information gathered by previous experiments, enabling optimal choices for what parameters to try next. To pick the hyper-parameters of the next experiment, one can optimize the expected improvement (EI) [35] over the current best result or the Gaussian process upper confidence bound (UCB) [36]. EI and UCB Figure 1: Schematic of the Bayesian Optimization based data-driven physical emulator workflow in identifying the best performing network hyper-parameters and network architecture have been shown to be efficient in the number of function evaluations required to find the global optimum of many multimodal black-box functions [36,37]. While there are many hyper-parameters that have an effect on the ANN performance and accuracy, we limit set of parameters to the ones in Table 1, due to their leading order effect on the network performance.

Acquisition Functions for Bayesian Optimization
The acquisition functions evaluate the effectiveness of a point, x, based on the posteriori distribution function, Q [33]. The Expected Improvement function is used due to its superior performance over other methods [33]. The Expected-Improvement acquisition function evaluates the expected amount of improvement in the objective function, ignoring values that cause an increase in the objective. In other words, it is shown mathematically as: where µ Q (X best ) is the lowest value of the posterior mean, and X best is the location of the lowest posterior mean.

Solvers and Initialization of Learnable Parameters
The functional mapping, using a neural network, can be viewed as a function of the learnable parameters such as weights and biases. In fact the training process of the neural network is often stochastic -very often two separate runs do not yield the same result. This can be attributed to the choice of optimizer, the precision of the learning process as well as the complex, non-convex loss manifold and the initialization of the weights. This last step has a determination on the performance of the neural network during the optimization process itself [38,39,40]. In this work we explore the effect of choosing different initialization, from well known ones including Glorot [38], He [39] and narrow-normal. In addition to the initialization of the weights the performance and evolution of the training process itself are studied by using different solvers such as ADAM [41], RMSProp [42] and SGDM [40], within the autoML framework.

Results
A well-resolved and validated Large Eddy Simulation dataset [26] is used with over 10 million data points, taken across four time-steps from a simulation of a compressible flow within the cylinder of an advanced propulsion system generated using OpenFOAM C++ libraries [43]. 20% of the dataset is set aside for a-priori testing. Fifty function evaluations are run for the BayesOpt workflow, for each case, and after each iteration of the BayesOpt run, an error is calculated, ε on the test dataset defined as: where y true is the test data point, and the y pred is the value predicted by the ANN. The best performing settings for all the possible combinations of optimizers and initializations are reported in Table 2.
For each of our network evaluation, the architecture is optimized by self-repeating blocks of Dense Layer -> Leaky ReLU activation -> Dropout Layer. The p value of the Dropout layer is set to 0.2, per best practices [44]. For the specific case of optimizing the network architecture, the best set of Layer Width, W , and Layer Depth, D, the latter optimizing the number of the self-repeating blocks for a given BayesOpt evaluation, are identified and reported.

Network Performance
The final objective of this effort is to not only identify the best settings for a given choice of solver and initialization as reported in Table 2, but also understand the effective cost one can expect to pay when this network is coupled to a CFD solver (see Figure 1) to make run-time inferences across multiple timesteps and grid points, in a realistic CFD simulation. This inference cost is directly proportional to the size of the network, width (W ) and depth (D). Therefore total number of terms based on the formulation N = I * W + D * W * W + W * O, where I refers to the input features, in this case 14, O refers to the output features, which is 1 is also reported in Table 2. It is found that the ADAM Glorot and ADAM He combinations perform the best in terms of absolute errors, although the ADAM Glorot configuration has the highest number of parameters. The RMPSProp on average, performs better than SGDM optimizer. This can be explained as RMSProp is an adaptive learning rate method and is capable in navigating regions of local optima and whereas that SGDM performs poorly navigating ravines and makes hesitant progress towards local optima [45]. It is observed that the Glorot initialization consistently has the best performance in terms of error, ε, which is an interesting result as He and ReLU activation have performed well for Convolutional Neural Networks [39].

Visualizing Weight and Function Space Similarity
We investigate the similarities in the weight and function space for similar as well as differently initialized trajectories, in an effort to better understand the commonalities in the optimization process.
In order to do that, the simulations are check-pointed after every epoch and the cosine similarity among the weights are computed, defined by cos(θ 1 , θ 2 ) = θ T 1 θ2 ||θ1||||θ2|| , where θ 1 and θ 2 are the vectorized weights of the ANN. From the left panel in Figure 2, it is observed the checkpoints along a trajectory are largely similar in the weight space, in this case for ADAM-Glorot configuration. However, when compared to the same initialization, He, and different optimizers, ADAM and SGDM we observe major differences in the trajectories in the weight space and this can be attributed to the optimization methods and the stochasticity of the learning process therein. Therefore we see that functions within a single trajectory exhibit higher similarity and this similarity map is optimizer-initialization specific. In addition to investigating the similarities in the weight space which are inherently high-dimensional and therefore non-intuitive, the use of dimensionality reduction methods are used, such as t-SNE [46,47,48] to observe the trajectory of the checkpoints in a 2D space in Figure 3. It is observed that the Glorot and He have overlapping similarities, which makes sense as they both have similar functional forms and theoretical analysis: they both find a good variance for the distribution from which the initial parameters are drawn and only differ in the type of distribution they use -Gaussian for He, and Uniform for Glorot.
On the other hand, the t-SNE trajectory of the narrow-normal initialization which independently samples from the normal distribution with zero mean and a standard deviation of 0.01, thereby not incorporating information from the data, only occupies a small region in the phase space. For the same initialization, and different solvers, it is observed all the three different solutions start off from the same point but quickly diverge and follow different trajectories, consistent with previously reported observations [49]. This shows that the functions explored by different optimizers are far away and this leads to the divergence and differences in predictions, while functions explored within a single trajectory tend to be much more similar. This observation is independently verified by comparing weight-space similarities in the first few dense layer for two separate fully converged networks, trained using the same initialization, and different solvers and observe no discernible common patterns (high similarity scores) in the map, see Figure 5 in the Appendix.
Lastly another dimensionality reduction method in Singular Value Decomposition (SVD) are used and the singular values for each set of runs plotted. It is observed that the narrow-normal initialization has the least significant number of indices, which can be explained due to the nature of the initialization. Whereas for the He and Glorot initializations RMSProp and SGDM tends to have higher singular values compared to ADAM, thereby making ADAM the most suitable candidate for a pruning-type operation to reduce the number of overall parameters in the network. This is important for the primary objective of this effort to help isolate and identify candidate network settings that can be successfully coupled to a non-linear PDE solver for run-time predictions.  Table 2. The right panel shows the trajectories of the different optimizer, given same weight initialization and shows the difference in the function space exploration of each optimizer.

Conclusions
In this work, the effectiveness of an automatic Machine Learning framework, based on Bayesian Optimization, is demonstrated in identifying the best network parameters in a differential programming paradigm relevant to scientific computing. We provide a measure of efficiency for using these settings in terms of expected advantage-to-compute. Comparing different initialization and solver settings, and dimensionality reduction techniques, the differences in the learning process are observed, and suitability in identifying candidate network settings for pruning tasks matrix reduction for cheaper run-time inference are noted. Figure 4: Singular values for different initialization and optimizers for the first dense layer in the trained network, showing key differences in the nature of the weight matrix. RMSProp and SGDM have higher singular values compared to ADAM for the same settings. Specifically, ADAM-Glorot stands as best candidate for a low-rank approximation pruning-type operation.