Abnormality Detection and Failure Prediction Using Explainable Bayesian Deep Learning: Methodology and Case Study of Real-World Gas Turbine Anomalies

: Mistrust, amplified by numerous artificial intelligence (AI) related incidents, has caused the energy and industrial sectors to be amongst the slowest adopter of AI methods. Central to this issue is the black-box problem of AI, which impedes investments and fast becoming a legal hazard for users. Explainable AI (XAI) is a recent paradigm to tackle this challenge. Being the backbone of the industry, the prognostic and health management (PHM) domain has recently been introduced to XAI. However, many deficiencies, particularly lack of explanation assessment methods and uncertainty quantification, plague this young field. In this paper, we elaborate a framework on explainable anomaly detection and failure prognostic employing a Bayesian deep learning model to generate local and global explanations from the PHM tasks. An uncertainty measure of the Bayesian model is utilized as marker for anomalies expanding the prognostic explanation scope to include model’s confidence. Also, the global explanation is used to improve prognostic performance, an aspect neglected from the handful of PHM-XAI publications. The quality of the explanation is finally examined employing local accuracy and consistency properties. The method is tested on real-world gas turbine anomalies and synthetic turbofan data failure prediction. Seven out of eight of the tested anomalies were successfully identified. Additionally, the prognostic outcome showed 19% improvement in statistical terms and achieved the highest prognostic score amongst best published results on the topic.


Artificial Intelligence
Artificial intelligence (AI) is officially the hype of the century, unravelling possibilities that once reside only in our imagination. AI is currently serving numerous fields and constantly breaking fresh boundaries. Its capability is consumed by the mass public and reaching far into specialized domains. Intensive race between world powers to harness its power stimulates consistent stream of fund to support AI based projects in all parts of the globe. With AI technology presently within reach by literally everyone, the age of AI has just begun.
How does one define AI? According to a survey conducted by Artificial General Intelligence Sentinel Initiative (AGISI) in 2018, the most agreeable definition for AI voted by experts is that stated in [1]. This describes AI as having the faculty of adaptation and improvisation, despite establishing limited knowledge and resources. The description further implies the autonomicity and learning capacity of the system. The European Commission's portrayal of AI is somewhat like the former definition, without the concept of restriction albeit carefully specifying the system's partial degree of autonomy [2]. These depictions paint us the picture of a system, capable of reasoning and operating with partial or no supervision at all, thus potentially beneficial, or dangerous, to the human being. The type and task of AI methods are commonly classified into seven categories as follows: 1. Machine learning (ML): in addition, based on deep learning (DL) and predictive analytics. 2. Natural language processing: Translation, classification, information extraction. 3. Speech: This is visualized as speech to text and text to speech. 4. Expert systems: inference engine and knowledge base. 5. Planning, schedule, optimization: Reduction and classical probabilistic as well as temporal. 6. Robotic: Reactive machine, limited memory, theory of mind, and self-aware. 7. Vision: Image recognition and computer/machine vision.
Such a vast catalogue of ability naturally finds its worth in many applications. Globally, the impact of AI is more anticipated in key economic and social pillars such as manufacturing, transportation, healthcare, business analytics, finance, and retails [3,4]. Likewise, research on AI stretches over other niche domains like entertainment [5], law enforcement [6], security [7], safety [8], defense [9], construction [10], investment [11], and mining operation [12]. The list goes on with the endless possibility, with new fronts being opened by researchers on daily basis.
ML and DL have emerged as the most popular and powerful tools in solving technical challenges. Their nonlinearity power, ever-increasing data volume, availability of open-source development tools within reach by everyone, together with enhanced and affordable computing power, push DL to the forefront of AI tools. Some of the notable DL achievements throughout the decade are mentioned here. In speech recognition field, DL outperformed the Gaussian mixture modeling-based systems in automatic speech recognition with record accuracy [13]. Alpha Go, an AI game system, beat world champions, Lee Sedol and Ke Jie in Go game match in 2016 and 2017 respectively [14,15]. In robotic, the OpenAI five robot system beat the world champion team in 2019 in Dota game tournament [14]. In 2021, CoAtNet-7 achieved 90.88% accuracy in ImageNet image classification dataset [16].
When charting the AI investment landscape, we can mention the following. Price Water Cooper (PwC) estimated that AI could uplift global GDP by 14% or 15.7$ trillion by 2030, with China and the United States as the biggest beneficiaries of this impact [17]. In 2019, the United States (US) possessed the most investment under the form of private AI companies representing around 64% of global share followed by China. The rest of the world trails the US and China, contracting around 400% in investment value from 2015 to 2019. During the said epoque, transportation, customer relation, and business analytics received the biggest specific investments in the US while transportation, security and arts attracted more investment in China. Globally, transportation and business analytic sectors constitute important investment grounds [18]. Soon, AI will fully replace capital and labor as the new factors of production, being the main driver of productivity [17]. The labor market will experience profound change where less workforce generating higher value will be required. To thrive or merely survive the competition, increasing AI assimilation to replace low skilled works is expected to be the future agenda in the industries [19].
According to the World Intellectual Property Organization (WIPO), the number new of AI patents registered tripled from 2013 to 2017, mirroring the intensive efforts led by the technical community in exploiting AI potential to overcome challenges [20]. Geographically, the top world economies, majorly in Asia, occupy the biggest share in AI patent registration headed by Japan (43%), followed by US (20%), Europe Union(EU)-28 (10%), China (10%), South Korea (10%) and Germany (3%). The primary sectors where patent and trademark registrations are concentrated correspond to computers and electronics, machinery, information technology services, and transportation [21].
Surprisingly, the industrial, manufacturing and energy sectors are amongst the slowest to adopt AI in their day-to-day operations [17]. Considering the continuous improvement in these areas, this slowness seems to be improbable. However, one can understand that there is a confidence issue from the industrial actors in blindly accepting AI decisions.
Trust is thus the primary obstacle in AI implementation. In the mentioned domains, this mistrust is more related to performance issue. In other fields, some types of problems might arise. The Center for Security and Emerging Technology (C-NET) defines the category of AI malfunction as follows [22]: 1. Failures of robustness: The system is subjected to unusual or unforeseen inputs, causing failure. 2. Failures of specification: The system is attempting to do something that is subtly different from what the developer of user anticipated, which might result in surprising behavior or consequences. 3. Failures of assurance: In operational mode, the system cannot be fully supervised or regulated.
The AI incident database documents the growing AI incidents since 2019 [23]. This repertoire exhibits several information worth noting. As per today, the top domains where incidents are reported are transportation, healthcare, manufacturing, and nuclear as presented in Figure 1(a). Most of the incidents are caused by ML issues as shown in Figure  1(b). These facts strengthen the belief about why the industrial and energy sectors are hesitant in using AI. In a more serious note, 8% of the incidents resulted in loss of lives. Secondly, AI regulation. The heightened incidents and its consequences risk to stall investments and prompted the call for regulations. The laws intend not to punish but to foster a responsible AI culture. A summary on global AI regulations can be found in [24]. Included is the General Data Protection Regulation (GDPR). The GDPR is the strictest regulation to date issued by the EU. This regulation would affect developers around the world whose system's output are related to the EU. It classifies AI systems into three categories as follows: 1. Limited and minimal-risk, high-risk, and unacceptable-risk. 2. The unacceptable-risk system will not be authorized anymore while the high-risk system will be conditioned to strict requirements. 3. The minimal-risk system will also be subject to a few conditions. Under GDPR, offences could incur fine of up to €30 million or 6 % of global revenue with the use of illegal systems and the breach of the data-governance requirements by employing hazardous systems could result in the heftiest penalties.
In brief, there are six system's qualities demanded towards a responsible AI ecosystem stated as: 1. Transparency: An AI system mechanism should be understood. 2. Reliability and safety: An AI system should work as intended and safe to use. 3. Privacy and security: AI systems should respect confidentiality and protected. 4. Fairness: AI systems should behave equally toward all human being 5. Inclusiveness: AI systems should inspire and promote human participation. 6. Accountability: Responsibility measures must be available when AI system malfunction.
One can note that most of the provisions in the law focus on the issue of transparency, fairness, privacy, and data security related to AI algorithms. The transparency refers to the mechanism of AI methods in obtaining their output. In fact, transparency is the main key in minimizing AI malfunctions and achieving the AI quality goals mentioned before. This is due to the black-box characteristic of some AI techniques.
DL, being the most powerful AI method now, is a black-box model so that it is opaque. Though very effective, its mechanism in generating forecast is unknown. Naturally, this opacity thwarts AI dissemination in high stakes areas, such as the industry and the energy sectors, where incomprehensible outcome could lead to incorrect prediction. In turn, this can provoke disastrous effects in term of lives, safety and financially. Obviously, the experts of the domain demand more than mere point estimate prediction to convince them in taking the correct course of action. Thus, the ball lies in the research community hands to diminish this mistrust. Then, third, explainable AI (XAI) enters.
Note that XAI is a field dedicated in making AI model transparent to human through various approaches. Though this notion is known for decades, global attention garnered in XAI shows a notable rise more recently, reflected by the increasing initiatives by various parties including the Defense Advanced Research Projects Agency (DARPA) since 2016 [25]. This sudden spike in interest on XAI is partly due to emerging laws as mentioned previously. The steady accumulation in general and specialized review articles on XAI translates the growing interest in XAI from the research community [26][27][28][29][30].
The advantages of XAI, however, far outweighs the need of regulations based on: 1. Justify model's decision, detecting its problem, especially during the trial period of AI model, strengthening reliability and safety. 2. Comply with the regulations, transparency that leads to accountability, enhanced security, and data privacy. 3. Help to understand AI reasoning and decrease problems related to fairness in AI use. 4. Assist practitioners in verifying the required proprieties of AI system from developer. 5. Promote interactivity and expand human creativity by discovering new perspective on the model or the data. 6. Allow resources to be more optimized, avoiding wastage. 7. Foster collaboration between experts, data scientists, users, and stakeholders.
Several published articles have organized XAI approaches into distinct taxonomies [31][32][33]. This paper briefly describes the categorization according to [31] which falls into two general classes. Firstly, transparent models, which are directly interpretable due to their simple structure or comprehensible visualization such as linear or logistic regression, decision tree and rule-based methods. Secondly, post-hoc explainability, where explanation is generated after the model to be explained is trained. Included in this category is model agnostic approach, an external method that can be used with any AI model. In addition, post-hoc explainability is applied for shallow ML models (tree ensembles, random forests and multiple classifier systems, support vector machine). Hence, approaches related to DL such as neural networks (model simplification, feature relevance), techniques appropriate only for certain DL models as convolutional neural network (CNN) and recurrent neural network (RNN), layer wise propagation (LRP), class activation mapping (CAM), gradient weighted class activation mapping and for hybrid-transparent-opaque models (knowledge-based and case-based reasonings).
As the backbone of the industry, prognostic, and health management (PHM) is a set of frameworks exploiting sensor signals to safeguard the health state of industrial assets by identifying and examining, tracking degradation, and to estimate failure evolution [34]. To achieve this goal, three main activities comprising of anomaly detection, failure prognostic, and diagnostic are employed, that is: 1. The first consists of identifying outliers in the system's output data [35]. 2. The second task englobes the determination of remaining useful life (RUL); and 3. Lastly, the classification and identification of root cause of failure [36,37]. In recent years, AI has become a predominant tool in reliability-based research [38].
PHM-XAI is still a very young discipline. As testified by the recent systematic review on PHM-XAI presented in [39] and shown in Figure 2(a), several peer reviewed journal articles treating the subject is still small but steadily rising. Several explainability approaches have been explored by the PHM-XAI researchers. To forge trust in AI and facilitate its legal use in the industry, it is urgent to disseminate XAI know-how to PHM players in both the research and industrial domains.

Research Gaps and Opportunities
The review presented in [39] further lists several deficiencies plaguing the research in PHM-XAI that need to be remedied promptly, considering: 1. Lack of human involvement: Human engagement is crucial for assessing the generated explanation as the latter is meant for them. Furthermore, human-AI cooperation could contribute to the integration of human related sciences, augmenting the PHM-XAI field. In addition, human participation is urged for the development of interactive AI, where experts and AI system work hand in hand, providing more assurance in AI system's output. in trusting AI methods prediction compared to point estimation models. It is therefore unconceivable for a working AI system to be devoid of this feature. Therefore, the review article summarized some research opportunities in PHM-XAI stated as: 1. As shown in Figure 2(b), model agnostic explainability, LRP and logic analysis of data (LAD) are less explored, but they possess great potential as they could be used with any black-box models without altering its performance. LAD can be combined with fault tree analysis for complex risk management. 2. While SHapley Additive exPlanations (SHAP) is an established model agnostic method and employed in PHM-XAI works, note that it was not exploited to improve PHM task's performance.
Addressing the weakness and seizing the opportunity, this article demonstrates the application of the SHAP model agnostic approach in explaining and improving anomaly detection and failure prognosis tasks taking a case study related to gas turbine systems. Abrupt disturbances in a real-world gas turbine modeling are tested for detection. Then, the root cause of degradation in a turbofan prognostic problem using simulated data is deciphered. SHAP local and global explanations are utilized to improve the prognostic performance. Prediction uncertainty, specifically aleatoric uncertainty, issued from a DL model to be explained, served a dual purpose: (i) as anomaly indicator, monitored using cumulative sum (CUSUM) changepoint detection; and (ii) to bolster explanation in terms of the confidence of the model in its output. Additionally, these uncertainties were minimized based on denoising and hyperparameters optimization operations, a crucial aspect seldom ignored in probabilistic DL articles. Decreased uncertainties amplified anomaly detection ability and increased the accuracy of prognosis. Then, the explanation produced is evaluated utilizing local accuracy and consistency metrics.
The main contributions of this work are as follow: 1. We combine SHAP and DL uncertainty to constitute a wider explanation scope, where the first one explains the decision of the model, while the latter one describes its output confidence. 2. We demonstrate the SHAP global explanation's ability to improve prognostic task's performance, which was absent from previous works. 3. We apply explanation evaluation metrics, which is clearly deficient from previous PHM-XAI literature. 4. We show the potential of DL uncertainty as anomaly indicator for a real-world industrial dataset, which validates its capability. 5. We minimize DL uncertainties for enhancing prognostic accuracy. Additionally, the small aleatoric uncertainty enables a more visible aspect spiking effect caused by anomalous data. The secondary contributions are the following: 6. We add model agnostic explainability to the collection of PHM-XAI articles, which is still lacking in the moment. 7. We prove the local accuracy trait of the explanation validates the efficiency property of Shapley values, while confirming the consistency characteristic justifies the additivity and symmetry proprieties of these values.
The dynamic structure-adaptive symbolic approach (DSASA), a cross-domain life prediction model, is elaborated in [40] for slewing bearings RUL prediction. The DSASA presents internal model structures visibly, takes historical run-to-failure data into account, and dynamically adapts real-time deterioration. In a nutshell, multi-signal-based health indicators are fed into three genetic programming algorithms for symbolic life modeling. This modeling visually displays the life process in the manner of legible mapping relationships and obtains ideal RUL prediction results. Then, the DSASA reconstructs original life expressions from the initial symbolic life model and uses dynamic coupling terms and its exponents to track the real-time asset deterioration. The recorded performance is better than the previously employed method for the case study and contributed by XAI ability.
An interpretable structured-effect neural network (SENN) stated as consists of a non-parametric baseline, a linear component of the current condition and a recurrent component as proposed in [41] for turbofan prognostic application, with the model being represented in (1). Here, the first component, ( ) namely, is the non-parametric part consisting of lifetime probabilistic model. The second component is a linear form that can be employed with raw sensor readings, say, where the importance of features may be evaluated based on the linear coefficients. The third component, RNN , refers to recurrent neural network with weights Θ. Thus, the recurrent component needs to explain less variance of the data compared to pure neural network structure. The performance of the model surpasses other traditional ML methods except the LSTM. However, XAI does not contribute to this performance.
An autoencoder with explanation discriminator is employed in [42] for continuous batch washing equipment anomaly detection. The autoencoder's reconstruction error, which is the anomaly indicator, is utilized by the discriminator to measure the precision and accuracy measurement of the anomaly detection task. The discriminator rescales the reconstruction error using a sigmoidal function giving value 0 as normal, 1 as anomaly and between 0 and 1 as warning. The performance of the proposed method is comparable to the best technique, isolation forest, previously employed for the problem, assisted by XAI approach.
The Fused-AI interpretabLe Anomaly Generation System (FLAGS), which combines both knowledge-driven (KD) and data-driven (DD) abilities, is presented in [43] for anomaly detection, failure recognition and root cause analysis of train. The FLAGS consists of three stages as follows: 1. In the first phase, both KD and DD fault recognition (FR) and root cause analysis (RCA) using data from failure mode/effect analysis (FMEA) and fault tree analysis (FTA), are employed simultaneously. The data streams and case-specific context data are used as inputs. Faults from the KD or outliers from the DD are produced with interpretation of the detected anomalies and stored inside a knowledge graph (KG). 2. In the second phase, the detected anomalies are shown in a dynamic dashboard complete with the raw data and interpretation, where the user modification is authorized. This is also stored in the KG. 3. Then, in the third phase, the information in the KG, which are anomalies, the feedback, and all contextual meta-information, is used to improve the AD, FR and RCA techniques of both KD and DD. The reported accuracy is good for anomaly detection, being it better than other standalone DD methods, partly because of the XAI approach.
The self-monitoring, analysis, and reporting technology (SMART), depicted in [44], is utilized to detect and predict failure in hard-drives through SMART statistics in the Attention-augMENted DEep aRchitecture (AMENDER) model. The SMART statistics daily record is incorporated into vectors through the feature integration layer. Then, these vectors are fed into the temporal dependency extraction layer consisting of gated recurrent unit (GRU), whose output can be considered as a compact representation of the SMART temporal sequence of the observed days. The attention distribution is calculated from the healthy context vector and the SMART compact representation. The healthy context vector is the high-level feature representation of healthy hard-drives. The resultant distribution, together with the GRU hidden state, produce attentional hidden state of the corresponding days. This attention mechanism enables the model to focus on failure advancement. Then, the attentional hidden state may be used to determine the health of the hard-drive for the associated day. The model's performance is better than other tested methods in both hard-drive health status classification and prognostic. The attention mechanism contributed to this performance, besides being the mechanism for diagnostic.
A fouling prediction in crossflow heat exchanger, using feed-forward neural network architecture with LIME model agnostic explainability, is described in [45]. The model is fed with operational data, such as inlet fluid temperatures, ratio of fouled fluid flow rates to flow rates under clean conditions, and output fluid temperatures from the heat exchanger and predicts fouling resistances of the equipment. Note that the predictive accuracy is very good.
A comprehensive visual explanation tool applied to turbofan engine prognostic is suggested in [46]. This online diagnostic, prognostic, and situation awareness system works with streaming data and is divided into the following sections: (i) ML-based classifier; (ii) visualization dashboard for health state monitoring; (iii) cybersecurity command centre, and (iv) high-performance local servers. The visualization dashboard displays real-time predictive analytics to reveal potential flaws, risks, and harmful attacks. In the form of heat maps, users may view the input and output. One heat-map for each sensor input and related engine at each time step. The network weights of each layer may be examined by practitioners to see how each feature contributes to the output of the following layer. The network weights are represented by the line thickness. As the weight values increase, the thicker the lines increase as well. Practitioners may also customize model hyper-parameters like the number of layers, hidden units, weights in each layer, regularizer types, and regularizer parameters, to integrate their expertise into the learning process.
This article is organized as follow. The methodology is described in Section 2. The case study, results and discussion are presented in Sections 3 and 4, respectively. Finally, the concluding remarks are in given in the last Section 5.

Multi Output Bayesian LSTM and Uncertainty Quantification Layers
A single input and multi outputs LSTM model is employed for anomaly detection and RUL estimation tasks. The model, denoted by , comprises an input layer, where input data are fed, a single LSTM layer, a fully connected or dense layer, and two output layers, such as presented in Figure 3. The LSTM layer produces sequential prediction by employing a gating mechanism to retain important memory or forget negligible ones. This structure enables the accumulation of important information, a crucial ability in anomaly monitoring and degradation tracking tasks. The input data's matrix multiplication and addition with the weights and bias factors of the model happen in the dense layer. Then the forecast, altogether with uncertainty, are enabled by the probabilistic nature of the output layers.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 12 January 2022 doi:10.20944/preprints202109.0034.v3 Two types of uncertainties are defined in DL models. The first type is the aleatoric uncertainty, linked to noise, acquisition error and randomness in the dataset. Thus, the first output layer has aleatoric uncertainty, which learns and predicts using the sequential output of the LSTM layer as input, based on the mean and standard deviation of the output distributions as depicted in layer "dense2" of in Figure 3. The prediction range reflects the uncertainty.
The second type is the epistemic uncertainty, corresponding to the uncertainty of the weights of the DL model. Hence, the second output layer has epistemic uncertainty, also known as the dense variational layer. This layer learns and predicts the posterior distribution of the weights using variational inference by maximizing the evidence lower bound (ELBO) objective function stated as where ( | , ) stated in (2)-(3) is the corresponding likelihood function that links the input , the output and the weights . Note that ( ) is the prior or the initial distribution of the weights, whereas ( ) is the approximated distribution once the training of the DL model is completed. The sampled values ( ) is the prediction output.

Minimization of Uncertainties, Anomaly Detection, and RUL Estimation
The Gaussian or normal distribution, a well understood and commonly used probability model, was utilized to describe both types of uncertainties. The aleatoric and epistemic uncertainties are represented by the rolling standard deviation of the predicted distribution's sequence. The only possible way to reduce the aleatoric uncertainty of the recorded data is removing its noise. Thus, the data are firstly denoised using the singular value decomposition algorithm following the methodology stated in [48,49]. The denoised data are later utilized to optimize the DL hyperparameters with Bayesian hyperparameter optimization (BayesOpt), whose limits are shown in Table A1 in the Appendix [50]. The BayesOpt optimized the model and decreased the epistemic uncertainty.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 12 January 2022 doi:10.20944/preprints202109.0034.v3 For anomaly detection, the model is trained with healthy data as it is expected that the aleatoric uncertainty shows a spike when the model encounters an abrupt anomalous observation. This spike, or changepoint, is detected using a CUSUM algorithm with a specified control limit as stated in [51]. Note that is determined employing the prediction's aleatoric uncertainty of the healthy data. Given AU , AU and AU corresponding to the maximum, mean, and standard deviation of the standard deviations of the aleatoric uncertainties, respectively, the specified control limit defined as = (AU − AU )/AU .
Given a sequence , . . . . , of process measurements with mean and standard deviation , the lower and upper cumulative process sums are defined as where and stated in (4) and (5) are the lower and upper cumulative process sums.
Deviation is detected at point if > or < − . For prognostic purpose, the DL model is trained with both healthy and degradation data. The trend of the aleatoric uncertainty reflects the confidence of the model in its prediction. The rising aleatoric uncertainty trend mirrors a growing uncertainty, while the contrary represents increasing confidence of the DL model.

Model Performance Assesment and SHAP Explainability
The root mean squared error (RMSE) and early prognostic metric are employed. The first metric is applied to evaluate both the anomaly detection and prognostic tasks while the second one is only used for prognostic.
The RMSE is utilized to examine the model's predictive performance with aleatoric and epistemic uncertainties [52]. In order to obtain a meaningful measure, the mean performance for 100 predictions is calculated. The RMSE measures how spread the errors are between the predictions and the true RULs being is defined as where RUL The metric stated in (7) gives higher score for errors of similar amplitude in early prediction than late prediction as the former is more important in failure estimation. Note that is the individual asset's prognostic score, while is the individual asset's prognostic error. Here also, the mean of 100 prediction scores, with aleatoric and epistemic uncertainties, are calculated.
The SHAP is a technique to explaining any machine learning model's output mechanism based on game theory [53]. It uses Shapley values to assess the contribution of each feature to the prediction. The formula for Shapley value is given by The Shapley value of feature , namely, defined in (8), is the average marginal contribution of feature 's value over all probable combinations of features values regarding the prediction. Note that is a subset of the total features and is the instance's vector to be explained. The prediction for feature values in set that are marginalized over those excluded from set is val ( ) defined as where E ( ( ) stated in (9) is the expected value of all predictions. The description of the SHAP is provided next. Given the model of explanation, namely, the coalition vector, and ∈ {0,1} , with = 1 indicating that the feature is present in the coalition, while = 0 points to the contrary, is the maximum coalition size, we have that where, as mentioned, expressed in (10) is the Shapley value of feature .
The SHAP can explain both global and local outputs. However, it is not compatible with probabilistic DL and only accepts a single output vector for explanation. Thus, a workaround, in the form of a non-probabilistic model labelled as , is developed as shown in Figure 4(a). Note that has the same layers and weights as those figured along the explanation path in , except the weights in dense2 of . Here, only the weights corresponding to the mean are used and transferred to while the weights associated with the standard deviation are ignored. The output layer out3 in slices only the first value of each sequence vector and arranges them in a single vector for the SHAP explanation.  .

Explanation Visualization
Three means of visualization are used for illustrating the local and global explanations as follows: 1. Local: Force plot and Waterfall plot, which highlights the positive or negative forces of features influencing an instance's output. The former, as it only shows features inputs values and forces directions, is utilized for explaining anomalous instances. The latter one, as it shows features contributions and forces directions, is used to verify the local accuracy and consistency properties of the explanation elaborated in the next subsection. 2. Global: Summary plot, which highlights the most contributing features in a sequence.
The plot arranges the features according to its contributing power and its forces directions. Here, the explanation is exploited to enhance the prognostic accuracy by employing only the most contributing features. The model is initially tested with all the features followed by only using 75% of the best features. Therefore, the performances of the different settings are analyzed and compared with published results.
The first explanation property to be verified is local accuracy of SHAP as stated in [54]. It establishes that the sum of the feature contributions, namely, is equal to prediction of or ( ), minus the average prediction, Ε (̂( )) say.
The second property is consistency, which states that if a model is modified, resulting to either the unchanged or increased marginal contribution of a feature, the Shapley value also follows the marginal contribution's trend as defined in [54].

Let
be the complete set of features, \ ⇔ = 0, and the absence of feature from the set of features , for models and . Thus, if for ∈ {0, 1} , then ℎ ( ) is calculated from and ( ) from the model shown in Figure 4(b), having the same layers as but with different weights. Observe that ( \ ) and ( \ ) are obtained by removing the weights of the feature from and respectively. In order to calculate the expression presented in (13), a waterfall plot is used to obtain the values of ( ), ( \ ), ( ), ( \ ) and to confirm ( , ) and ( , ) in the inequality formulated in (14).

Case Study 1: Real Gas Turbine Anomaly Detection
Data from an 18.8 MW, twin-shaft industrial gas turbine from Petronas Angsi Oil Platform in Terengganu, Malaysia, recorded over one year period, or 8737 hours, are used in this study. Note that 98 sensor signals, comprising of various pressure, temperature, velocity, and positional readings make up the largely healthy data. While the features number is overwhelming, only several were used in modeling the gas turbine as indicted in [55]. The inputs and outputs utilized are shown in Tables 1 and 2 respectively. Four DL networks using architecture labelled as Bayes_LSTM ,Bayes_LSTM , Bayes_LSTM and Bayes_LSTM are fed with all the inputs to predict each output.

Ref Output Unit
Gas generator rotational speed RPM Compressor outlet pressure Bar Gas generator turbine outlet pressure Bar Gas generator turbine outlet temperature K First, we preprocess the data. The anomaly part is separated from the dataset and the healthy part is split into training. Validation and testing datasets as shown in Table 3.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 12 January 2022 doi:10.20944/preprints202109.0034.v3 Sequence of input and output are set to 24 hours. The only abrupt, null sensor's reading instances, consider as anomalies from 12 am to 1 am on 20/03/18 to 21/03/18 and 11 pm to 12 am on 08/04/18 to 09/04/18, which are chosen from the anomaly data collection and joined with the neighboring healthy data to put together a sequence of 24 hours. Both anomalies are set to be on the 12 th to 13 th instances of the sequences.   Table 5.
The CUSUM charts for anomalies predicted from Bayes_LSTM obtained from the parameters in Table 5 are shown in Figure 7. As illustrated, both aleatoric uncertainty spikes are detected by the CUSUM method with the formulated control limit as it is the case for all the other models except for Figure A4    is the anomalous feature due to its negative normalized value.
From the figures mentioned, note that and influences initially positive in instance 11, becoming negative in the 12 th and 13 th instances, except for prediction. In addition, and positive influence grows on the 12 th and 13 th instances compared to previous instances except for prediction. In Figure 8, and forces become dominant on the 12 th and 13 th instances, making the predictions to be less than the base value. In Figures A5 and A10, and are generally the major forces, causing the outputs to be greater than the base value. From the illustrations, observe that most features assert positive impact, pushing the output value higher.

Case Study 2: Turbofan Engines Failure Prognostic
The NASA turbofan run to failure FD001 dataset, produced by the Nasa Prognostic Centre (from Ames Research Centre), was exploited for the prognostic study [56]. This synthetic time series data were generated by modeling a variety of operational scenarios and inserting defects with diverse degrees of deterioration. The original data comprises of training, testing, and true RUL for 100 turbofan engines as summarized in Table 6. Thus, there are 100 turbofan records, referring to turbofans health that declined until breakdown after a given cycle, or failure start point (FSP). Note that 21 sensor signals, described in Table A2 in Appendix 1, working per cycle and three operating conditions (OC) form the recorded data. The OC corresponds to diverse operating regimes, combination of altitude, throttle resolver angle, and mach number that conceal the extent of degradation of each turbofan. On top of this, high-level noise is blended to the dataset [44].  As in case study 1, we first preprocess the data. Out of the 21 signals, only 14 sensors whose signals trend are strictly monotonic are selected as they best represent degradation contrary to irregular and unchanged signals. The total inputs, including three OC's, are 17.

Preprints
A piece-wise linear degradation assumption is adopted, where the RUL is assumed stable before the FSP and decreased linearly thereof until failure. In the initial phase, the RUL is equal to the value of the recorded signal's last cycle and decreases linearly as illustrated for Turbofan 1 in Figure 9(a) without the FSP. Then, the CUSUM method with equals to 5 standard deviations are used to calculate the FSPs of the signals of the concerned turbofan. The mean of these FSPs is set as the FSP of the turbofan. The combination of linear degradation obtained earlier and the FSP results in the final RUL sequence as shown in Figure 9(b). The obtained RULs are limited at 50 to ease model's generalization.  Note that some testing data with long sequence lengths are associated with very small true RULs that differed from the characteristic of the training data. Hence, it is anticipated that the model to perform more poorly on these 'abnormal' data. The prognostic results of turbofan 1 and turbofan 18 are examined as the former data's characteristic bore similarity to the training data's input-output nature while the latter resembles the abnormal data's trait.
Next, we provide the results with 100% features. The RMSE and score results with aleatoric and epistemic uncertainties are presented in Table 7. The 3D representation of turbofan 1 prognostic with aleatoric uncertainty is shown in Figure 10 to provide the full picture of the modeling. As noted in this illustration, the range of prediction or uncertainty decreased along the cycle, signaling growing model's confidence in its prediction. For the rest of the work, only the 2D presentations are shown.
The 2D depictions of turbofan 1 and turbofan 18 with aleatoric and epistemic uncertainties are presented in Figure 11. Looking at the aleatoric uncertainty rolling standard deviation slope of each prediction, one can observe decreasing trend for turbofan 1 and the contrary for turbofan 18. Hence, the model expresses increasing confidence in the former and decreasing confidence in the latter one. The different aleatoric uncertainty outcomes are translated by the model's prognostic outputs that show better performance for turbofan 1 than for turbofan 18. In Figure 11, observe that the RUL prediction with aleatoric uncertainty agrees with the true RUL in the early cycle before showing degradation and failure earlier than the true RUL curve, which is a demanded quality for prognostic modeling. The prediction oscillates at the end of the degradation phase before stabilizing at the failure stage. Meanwhile, a small gap separates the RUL prediction with aleatoric and epistemic uncertainties during the early cycle before both seemingly coincide during the degradation phase onward until failure. This is not the case for turbofan 18, where both prognostics are way off from the true RUL. The global explanation for 100% features is provided next. The feature contributions and their directions, issued from explanation model, are presented in the summary plots in Figure 12. Both plots seem similar, but if they differed, one should prioritize choosing the more confident prediction, in this case, turbofan 1. Though the top contributing features influenced the predictions more negatively, most of the features had positive impact on the estimates. The features, according to their contributing power, are ordered in Table 8. Note that 75% of the original features, or 13 features that are selected to improve the prognostic modeling, are shown in Italic characters.

Combination
Contribution Order 17 Features S11, S13, S8, S12, S21, S4, S20, OC2, OC3, S7, OC1, S15, S2, S17, S9, S3, and S14 Now, the performance and prognostic results with 75% features are reported. The RMSE and score outcomes with the selected features are presented in Table 9. As observed, the RMSE results with aleatoric and epistemic uncertainties show drastic improvement from the previous results, with the score, however, being worse. Outcomes for turbofan 1 and turbofan 18 are depicted in Figure 13(a) and Figure 13(b) respectively. The same manifestation of aleatoric uncertainty slopes trends as previous results is observed, matching the prognostic outcomes. Turbofan 1 modeling shows improvement as the oscillation at the end of the degradation phase decreases before stabilizing in the failure phase. The aleatoric uncertainty level for turbofan 18 improves in general from previous result. In the global explanation for 75% features, the features contributions, and their directions, which are mostly having positive impacts in the predictions, are presented in summary plots in Figure 14.  When conducting the performance comparison, only the best RMSE and score with the aleatoric uncertainty obtained previously are compared with the best published works according to the year of publication. As presented in Table 10, the results are on par with these methods, with the prognostic score occupying the top position amongst all the techniques. Predictions issued from and for are presented in Figure 16.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 12 January 2022 doi:10.20944/preprints202109.0034.v3 For the illustration purpose, we carry out an explanation evaluation using the Waterfall plots of the real-world gas turbine's prediction. Note that predictions issued from and are presented in Figures 15 and 16 respectively. The feature removed is . Next, we calculate the local accuracy. Applying the expression given in (12) on Figure  15 Hence, applying the expression given in (12) on Figure 16(b), we reach: Observe that the calculations confirm the local accuracy property of the explanation. Next, we evaluate the consistency property similarly, that is, applying the expression given in (13) on Figures 15 and 16, we obtain that: Applying the expression given in (14) on Figures 15 and 16, we get that:_  ( , ) = 9.32 ; ( , ) = 1.1, thus ( , ) > ( , ) Therefore, once again, the calculations confirm now the consistency property of the explanation.

Discussion
The insights gained from the study as well as its limitation and future opportunities are elaborated in this section.

Anomaly Detection
This paper firstly proposed an anomaly detection framework based on deep learning aleatoric uncertainty and CUSUM changepoint detection. Bayesian deep learning models, capable of generating uncertainties, were trained using only healthy data. Thus, it is expected that the aleatoric uncertainty, which is influenced by the input data quality, is stable for healthy data and shows abnormality when encountering abrupt anomalies. As demonstrated in Figures 7, A4 and A9, the strategy yielded 87.5% success or 7 out of 8 anomalies detected in real-world gas turbine dataset. The achievement was partly due to the minimization of aleatoric uncertainty by the mean of singular value decomposition denoising. As observed, the aleatoric uncertainty around healthy data prediction is so small because of denoising except for Figure A4(a), where aleatoric uncertainty variation was too big to be minimized by singular value decomposition denoising. Without this operation beforehand, the anomaly spikes risk is invisible from the rest of the prediction's aleatoric uncertainty, hindering effective anomaly detection.
The force plot for local explanation uncovers the dynamic caused by anomaly to the predictions. Note that seems to follow behavior, changing force direction from positive to negative and dragging the prediction lower. The two features influence seems amplified in instance 13 due to the consecutive anomaly. Also, observe that and positive influences rose, increasing the prediction. It is also learnt that most features are exerting positive impact that pushed the output value higher. Nonetheless, whether influenced , and is not certain and could be investigated by other means such as partial dependence plot in the future.
Since the investigation only focused on abrupt anomalies, it is recommended to apply the technique on long consecutive anomalies and examine the generated explanation.
Additionally, this work defined the calculation of control limit using aleatoric uncertainty level calculations. However, one can see from Figures 7, A4 and A9 in Appendix 2 that the anomalies were only identified on the 13 th or higher instance, even when the disturbances had already started from the 12 th instance. Faster detection could be possible with a proper definition of control limit . One could lower the limit but a risk of having more false alarms exists, especially when the range of aleatoric uncertainty is important such as in Figure A4(a). As can be seen in this figure, using 1/3 of as control limit to identify anomaly on the 12 th instance leaded to many erroneous detections.

Failure Prognostic
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 12 January 2022 doi:10.20944/preprints202109.0034.v3 Secondly, the deep learning model was employed for failure prognostic purpose. This time, it was fed with both the healthy and failure recorded data. The aleatoric uncertainty in this task served as confidence indicator, expressing the uncertainty of the model in its output. Based on the graphical results in Figures 11, 13 and 15, the aleatoric uncertainty indicator matched all the prognostic modelings, where it increases when the prediction was bad and decreases when the prognostic was good. This feature is vital in failure prediction especially in the absence of true RUL. Then, practitioners could judge the quality of the prediction for important decision making.
The global explanation in the form of summary plot helped to improve the performance of deep learning model. By only using the best contributing features, the RMSE result obtained was on par with the best published techniques in this problem. Interestingly, all the OCs played important role in the prediction and made it to the final selection. While the results coming from frequentist models may seems a bit better, this is mainly due to their more complex structures as their designations suggest. The Bayesian deep learning model employed in this work only consisted of a single LSTM and dense layer that limits its nonlinearity modeling power compared to the other methods. Furthermore, the frequentist models could never be utilized in real-life applications and its usage scope is limited to experimental purpose as they are devoid of uncertainty quantification. Hence, one could incorporate more complex network to the existing Bayesian deep learning model in the future to enhance its performance. Moreover, feature selection could be done in another angle where features are chosen according to their influence direction rather than their contributing power to investigate the effect on the performance.
The aleatoric uncertainty indicator provided another dimension in explanation, where it indicated which prediction was reliable before explanation when the XAI approach takes place. This feature enabled the differentiation between explanation of reliable outputs and unreliable ones, helping users and developers to obtain a deeper insight into the artificial intelligence decision. This distinction facilitates user to prioritize on explaining either one of the output types for fast decision making. Time is always a natural constraint in this situation. The prioritization in turn will lead to resources optimization for the task at hand. Furthermore, the distinction aided in selecting which global explanation to use for improving the model. Obviously, it would be wiser to choose explanation from a more confident prediction than a lesser one.
One can notice that the epistemic uncertainty level in all the plots hovered around the same range. This is normal as the uncertainty for the weights is fixed once the training was done.

Safeguarding Security and Explanation Evaluation
Uncertainty quantification excels in minimizing adversarial example risk. This issue arises when new and unseen data, either unintentionally generated or engineered by attackers is fed to the network. Adversarial example could fool deep learning models. Obviously, frequentist models are unable to detect this abnormality. Bayesian model, however, can signal its presence in the form of rising uncertainty. While this work focused on mechanical failure assumption, it is equally important to investigate failure due to adversarial example as well.
The explanation generated conforms to the local accuracy and consistency properties. The former one also equals to efficiency nature of the Shapley values. Certifying the latter one, also it means justifying the symmetry and additivity qualities of the Shapley values. The first characteristic asserts that the Shapley values of two features should be equal if their contributions to all probable coalitions are even. The final attribute denotes that for an ensemble prediction, for a specific feature, one can calculate the Shapley value of the feature in each individual ensemble, averaging them, and getting the Shapley value for the feature for the whole ensemble.
While the explanation fulfils several demanded general qualities, the need to evaluate explanation based of PHM criteria such as security and safety, cost, and time are still present. This aspect is also echoed in [39]. Therefore, it is crucial for PHM-XAI researchers to develop explanation metrics satisfying PHM needs.

Conclusions
Opacity of artificial intelligence models constitutes an operational and legal risks that could potentially derail investments of intelligence artificial in the energy and industrial sectors. To promulgate the assimilation of artificial intelligence in real world prognostic and health management applications, this article tackles the challenges afflicting PHM-XAI domain, in specific, lack of explanation assessment and uncertainty quantification. Prognostic and health management tasks relating to anomaly detection and failure prognostic of gas turbine engine were investigated. The Shapley additive explanations model agnostic approach was employed to generate local and global explanations from a Bayesian deep learning model. The former one for the anomaly explanation while the latter one for the failure prediction. The global explanation was also exploited to improve the prognosis performance. The deep learning model was able to predict with uncertainty whose trend served as anomaly marker that changes intensely with abnormal data. The anomaly detection strategy succeeded in identifying seven out of eight available abnormalities, while the best selected features from the global explanation enhanced prognostic performance to be on par with the best results in the problem. The Shapley additive explanations were finally validated with the local accuracy and consistency characteristics of explanation. Funding: Not applicable.
Data Availability Statement: The data and code presented in this study are openly available in https://github.com/AhmadNor (accessed on 9 January 2022).

Sensor
Description Unit S1 Total temperature fan inlet 0 R S2 Total temperature at low pressure compressor (LPC) outlet 0 R S3 Total temperature at high pressure compressor (HPC) outlet 0 R S4 Total temperature at low pressure turbine (LPT) outlet 0 R S5 Pressure at fan inlet psia S6 Total pressure in bypass-duct psia S7 Total pressure at HPC outlet psia S8 Physical fan speed rpm S9 Physical core speed rpm S10 Engine pressure ratio (P50/P2) N/A S11 Static pressure at HPC outlet psia S12 Ratio of fuel flow to Ps30 Pps/psi S13 Corrected fan speed rpm S14 Corrected core speed rpm S15 Bypass ratio N/A S16 Burner fuel-air ratio N/A S17 Bleed enthalpy N/A S18 Demanded fan speed rpm S19 Demanded corrected fan speed rpm S20 HPT coolant bleed lbm/s S21 LPT coolant bleed lbm/s