Application of Explainable AI (XAI) for Anomaly Detection and Prognostic of Gas Turbines with Uncertainty Quantification.

: XAI is presently in its early assimilation phase in Prognostic and Health Management (PHM) domain. However, the handful of PHM-XAI articles suffer from various deficiencies, amongst others, lack of uncertainty quantification and explanation evaluation metric. This paper proposes an anomaly detection and prognostic of gas turbines using Bayesian deep learning (DL) model with SHapley Additive exPlanations (SHAP). SHAP was not only applied to explain both tasks, but also to improve the prognostic performance, the latter trait being left undocumented in the previous PHM-XAI works. Uncertainty measure serves to broaden explanation scope and was also exploited as anomaly indicator. Real gas turbine data was tested for the anomaly detection task while NASA CMAPSS turbofan datasets were used for prognostic. The generated explanation was evaluated using two metrics: Local Accuracy and Consistency. All anomalies were successfully detected thanks to the uncertainty indicator. Meanwhile, the turbofan prognostic results show up to 9% improvement in RMSE and 43% enhancement in early prognostic due to SHAP, making it comparable to the best published methods in the problem. XAI and uncertainty quantification offer a comprehensive explanation package, assisting decision making. Additionally, SHAP ability in boosting PHM performance solidifies its worth in AI-based reliability research.


Introduction
AI is a marvel of today's technological advancement. It marks the culmination of decades-long effort by the technical community in imitating biological reasoning. The expansion of data volume, the availability of open source development tools, the easing of collaboration between AI players and the countless unexplored opportunities push AI on a global scale. Backed by a steady flow of investment and enjoying supports from techfriendly authorities, AI-based projects flourish, replacing the old ways of doing things. AI brings optimization, automation, and efficiency to the table.
Nowadays, AI powered applications are practically everywhere, whether it is apparent or hidden. AI penetration is not limited to social media, where it is probably more visible to the general public, but it reaches far into niche areas. Much progress has been felt especially in fields such as healthcare [1], defense [2], manufacturing [3], biology [4] robotics [5] and reliability [6] in the recent years.
Tech firms and external funders define the AI investment landscape at the moment, with machine learning startups being one of the most funded sectors since 2011 [7]. Approximately 30% augmentation in AI investment was registered from the 2010 to 2013 and 40% from the 2013 to 2016 [8]. To give an idea of the scale this represents, around $26 to $39 billion were invested in 2016.
Price Water Cooper (PwC) projects an equivalent of $15.7 trillion or 14% of added GDP value by the 2030 fueled by the growth in productivity and consumer demand due XAI is a discipline dedicated in making AI model discoverable and more transparent. While the term has existed early on, it recently picked up steam as a result of rising scrutiny in AI usage [13]. The accumulation of publications and the surge in interest expressed for the search term Explainable AI since 2016, shown here in Figure 1, reflect the growing interest in the field [14]. In 2017, DARPA launched the "Explainable AI (XAI) initiative", while the Chinese government published "The Development Plan for New Generation of Artificial Intelligence" in the same year, both aiming to proliferate XAI [13].
The need for XAI transcends regulations. XAI could prove to be rewarding than burdensome to AI community. Some of the incentives in incorporating XAI are as follows: 1. Justify decision, detect problem, and improve AI models. 2. Comply with the regulations, bias, ethics, reliability, accountability, safety, and security of AI use. 3. Enabling user to verify model's desirable properties, encouraging interactivity, gaining new insights on the model or the data and augment human intuition. 4. Allow user's task, effort, and resources to be more optimized and targeted. 5. Important when the cost of error is high or when the AI system is not yet proven to be reliable. 6. Foster the collaboration between experts, data scientists, users, and stakeholders. Google "Explainable AI" Term Search

The State XAI in PHM
PHM is a maintenance and asset management strategy that exploits signals, measurements, models, and algorithms to anticipate, analyze, and track health deterioration in industrial assets. [15]. PHM provides standards and protocols to ensure that assets are in good working order. It reduces hazards, maintenance costs, and workload, allowing maintenance operations to be optimized.
Failure prognostic, diagnostic, and anomaly detection are the three categories of PHM activities. Prognostic is the process of determining asset's Remaining Useful Life (RUL) or leftover operating time before breakdown. Anomaly detection is the action of identifying unusual patterns going against the norm of operational indicators whereas diagnostic is the action of classifying failure and discovering the detailed root cause of failure. AI-based methods occupy a key position in PHM research as shown in [6]. XAI, on the other hand, is somewhat a novelty in PHM.
A systematic review conducted by the author in [16] summarizes the current state of XAI in PHM: 1. XAI assimilation in PHM is still in its early years. Nevertheless, it is gaining interest, with a spike in published works in 2020. 2. Interpretable model, rule & knowledge-based model as well as attention mechanism are the most commonly used XAI approach in PHM at the moment, as presented in Figure 2. 3. XAI is fast becoming vital to PHM, as it can be adapted as a tool to execute PHM tasks, as seen in the majority of diagnostic and anomaly detection works. 4. PHM performance is unaltered by XAI. 5. Identified gaps in PHM-XAI research comprises of lack in human participation, explainability metrics and uncertainty management. 6. Mostly real, industrial case studies were tested in previous works to demonstrate the effectiveness of XAI in PHM domain.

XAI Approach Employed in PHM
An interpretable logistic regression model with elastic net regularization is employed in high pressure plunger pump anomaly detection in [17]. Data is first equally divided, and statistic measures are calculated on each division. A rolling window operation is then applied on the extracted features where flag is associated indicating if a failure will occur or not based on the statistical measure calculated before. The flagged representations, having the most relevant features associated with failure, serve as input to the regularized logistic regression. The relevance order of features to be included from the flagged representations is determined by considering the normal/failure feature distributions and measuring their Kolmogorov-Smirnov distance.
A graphical diagnosis technique based on Convolutional Neural network (CNN) and extreme gradient boosting (XGBoost), applied on gas turbine failure problem is presented in [18]. It replaces portions of the CNN architecture with XG-boost, a machine learning approach for classification and regression, and makes the CNN training model interpretable. XGBoost is a boosting method that combines several weak classifiers into a single strong classifier. The Classification and Regression Tree (CART) is the weak classifier utilized by XGBoost. CART is a binary tree that splits by looking for the best segmentation feature and cut point using the GINI coefficient as a criterion. The time series data are fed into the CNN. When comparable signals are clustered together, the local features will improve, allowing CNN to be more accurate. These signals may be sorted with XGBoost, improving feature order interpretability. To determine the accuracy, the original raw data obtained from the gas turbine is first fed into the CNN. The signal rankings from the initial raw data, as well as the accuracy gained by CNN, are then trained in XGBoost to produce tree models that can choose the optimal features-accuracy sorting combinations.
A K-margin-based intErpretable lEarNing (KEEN) is presented in [19] for interpretable aircraft structural damage diagnosis. This framework consists of a Residual Convolution Recurrent Neural Network (RCR-Net), a K-margin diagnostic method and a knowledge-directed interpretation approach. RCR-Net is a deep learning model that can automatically obtain features and deal with class skewness issues. As input, it accepts augmented data segments. After that, it divides the augmented segments into small fragments and outputs the segment's health-condition prognosis. The K-margin based diagnosis model is robust against noise. It focuses on the RCR-Net's most relevant segments automatically. Its health-condition detector uses segments with top-K confident to estimate the health status. Simultaneously, a knowledge-based interpretation approach automatically extracts features from the RCR-Net responsible for the fault.
A process diagnostic-explanation structure consisting of knowledge discovery in database (KDD) method and Failure Tree Analysis (FTA) is proposed in [20]. The KDD method, in specific LAD, extracts patterns from the process dataset and produces rulebased explanation describing the root cause of failure. This explanation is later translated into FTA logic reasoning. The ability of this method is demonstrated in an actuator system failure diagnosis.
The Spectrum Anomaly Detector with Interpretable Feature (SAIFE) is an Adversarial Autoencoders (AAE) based model applied on the problem of wireless spectrum anomaly detection [21]. LSTM acts as the encoder for extracting interpretable features such as signal bandwidth, class, and center frequency via a linear layer and classifying signal via a Softmax layer. A CNN acts as decoder for reconstructing the input data from the extracted features. The AAE architecture is trained in a semi-supervised fashion for learning interpretable features, while the reconstruction is fully unsupervised. The model learns the features during the semi-supervised training with partial data. During testing, anomaly is detected based on the reconstruction error, classification error and the loss from the discriminator which is part of AAE generator-discriminator adversarial architecture, Anomaly localization is achieved by plotting the absolute reconstruction error.
TScatNet is proposed in [22] for bearing and drive train failure diagnosis. TScatNet collects domain-invariant features utilizing Morlet wavelet and uses these features for diagnosis purpose. TScatNet consists of a time-scattering (Scat) module of standard CNN having Morlet wavelets as convolutional filters and a Softmax module comprising of global averaging pooling (GAP) and Softmax layer. The Scat module transforms the input into scattering features maps. At testing phase, these maps are passed to the global averaging pooling (GAP) layer. The GAP layer aids in the simplification of testing processes and improves the stability of the derived scattering characteristic. The Softmax layer maps each scattering feature into the probability value of fault categories.
Emission control system fault diagnosis method based on PCA clustering is presented in [23]. The sensor data is firstly treated with PCA for dimensionality reduction. This sensor data is mapped to relative air/fuel ratio target, which represents normal or degraded operation. The result of the PCA then undergo PCA-based clustering (Vectorized PCA-VPCA, Multilinear Principal Component Analysis-MPCA or Uncorrelated Multilinear PCA-UMPCA clustering). The PCA-based clusters isolate fault events in a restricted number of clusters (scenarios), each one described by a reference pattern. Once the data have been partitioned into clusters (scenarios), practitioners analyze cluster patterns to get more insight for fault diagnosis. This provides practitioners with an efficient and interpretable model of multichannel profile data in high-dimensional spaces to support the diagnosis and finding root cause.
Classification of Linear Motion guide fault based on CNN applied to vibration signal and explainability with frequency domain-based Grad-CAM (FG-CAM) are proposed to analyze frequencies that have a significant impact on fault conditions [24].
A feed forward neural network (FFNN) together with SHAP (global) and LIME (local) are employed to predict and explain the damage of prismatic cantilever steel beam in [25]. The frequencies and associated damage percentage ranging from 0% to 75% are used as input features and the distance, corresponding to 194 positions of damage are used as target of the FFNN.
Diagnosis of induction motor fault using CNN and LRP is proposed in [26]. The vibration time series data segments used as input are transformed into time-frequency image using Continuous Wavelet Transform (CWT) with Morlet wavelet which is then processed by CNN for classification. LRP captures pixel-level representation of features contributing to the failure.

Research Objectives and Contributions
This work firstly elaborates how data uncertainty can be exploited as anomaly indicator in anomaly detection task. Then it details a prognostic improvement method using SHAP global explanation. Both task's predictions were explained by SHAP. Additionally, the uncertainty also serves to strengthen the explanation by broadening its scope. Local Accuracy and Consistency metrics were used to assess the explanation. Real world data from a gas turbine and NASA CMAPPS turbofan datasets were respectively used for demonstrating the anomaly detection and prognostic capabilities.
The direct contributions of this work are four folds: 1. Firstly, the uncertainty, together with XAI form a broader explanation scope, which bridge the gap identified in 1.3. 2. Secondly, the SHAP ability to improve PHM task, which was absent from previous works as explained in [16]. 3. Thirdly, the application of explanation metrics which was nearly missing from former works as indicated in 1.3. 4. Finally, this paper reveals the practicality of deep learning uncertainty as anomaly indicator using real world dataset.
The supplementary contributions of this work are two folds: 1. This work adds to AI-based PHM articles employing model agnostic approaches which is insufficiently explored as testified in 100% of the anomalous data were successfully detected thanks to the uncertainty-based indicator. Additionally, the prognostic performance improved around 6% to 9% as well as 43% improvement in early prognostic thanks to SHAP global explanation.

Uncertainties in Deep Learning
Uncertainty in DL linked to the quality of input data is known as Aleatoric uncertainty (AU). This uncertainty may happen due to noise, data acquisition error or stochasticity captured in the input data, which is the usual situation encountered in the real world. This type of uncertainty cannot be reduced further by having more data if no improvement was done on the data acquisition technique. Uncertainty linked to the chosen parameter (weights) of DL model is called epistemic uncertainty (EU) [27,28,29,30].

Multi-Outputs Bayesian LSTM
To enable the quantification of both uncertainties and generate explanation, a single input, multi outputs probabilistic LSTM was developed. The model consists of an input layer, then an LSTM layer, followed by a fully connected layer. The proceeding layers are the output layers. The first output layer is the AU layer, generating sequential outputs with data uncertainty. The second output layer is the EU layer, also producing sequential outputs with parameter uncertainty. The last output layer produces the prediction to be explained. In this layer, the outputs from the LSTM are sliced to obtain only the first value of each sequence which are then grouped in a single explanation vector. For a simplified schematic of the whole model, refer to Structure 1 in Figure 3.
For anomaly detection, the model was fed with only healthy data while for prognostic, both healthy and failure inputs were involved.

Probabilistic Layers
The AU layer is a probabilistic layer that learns and predicts the mean and variable standard deviation from the input coming from the LSTM layer to form the prediction range, translated into uncertainty distribution [31]. Thus, every point in an RUL sequence consists of a distribution of RUL prediction. In this work, normal distribution was used to model the uncertainty as it is easily understood.
The EU layer, called the Dense Variational Layer learns and predicts the weights distributions or the posterior distribution of the weights using variational inference by maximizing the ELBO (Evidence Lower BOund) objective, ℒ [32].
With ( ) the prior, the approximation distribution ( ) ( | , ), the likelihood function relating all inputs , all labels and the weights . The weights distribution can then be sampled to produce the output for a given input.

Bayesian Hyperparameter Optimization (BayesOpt)
The hyperparameters for the model were obtained via Bayesian hyperparameter optimization (BayesOpt) [33]. Optimized hyperparameters help in reducing the EU. The explored hyperparameters and its search space are shown in Table 1.

Data Denoising & Uncetainty Visualization
Since noise could worsen the AU, data denoising was performed by applying Singular Value Decomposition (SVD) following the method shown in [34] [35].
The rolling standard deviation of the prediction distributions characterizes the uncertainty. Increasing trend in standard deviation signifies a decreasing confidence in model's prediction and vice versa.

CUSUM Changepoint Detection for Anomaly Detection
The uncertainty mirrors the model's confidence in predicting. Since the model was trained with only healthy data, the AU is expected to show a spike once anomalous input is tested, signaling that the distribution of data in question was not previously learned during the training phase. CUSUM changepoint detection was applied to identify the anomaly spikes with the appropriate control limit [36].
Given a sequence , , , … , with mean and standard deviation , the upper and lower cumulative process sums are: A process deviates at the sample if it obeys > or < − with the control limit. The predetermined control limit, is defined using healthy data prediction AU. Given the standard deviation of the AU. is the maximum and is the mean of the standard deviations of the AU, is the standard deviation of the standard deviations of the AU, can be calculated as:

SHapley Additive exPlanations
SHAP is a game theoretic approach to explain the output of any machine learning model [37].  is the maximum coalition size and ∈ is the Shapley values for a feature . The formula for Shapley value is: is a subset of the features used in the model, is the vector of feature values of the instance to be explained and is the number of features.
( ) is the prediction for feature values in set that are marginalized over features that are not included in set .
However, SHAP only accepts non probabilistic model. Thus, for generating explanation, another LSTM model was used, whose structure and weights resemble the input and the third output layers of the original model as depicted in Structure 2 in Figure 3.
SHAP force plot and waterfall plot, were used to explain the instance prediction while SHAP summary plot, a global visualization, explains by identifying the most contributing features in a sequence. In force plot, each feature value is represented as a positive or negative force pushing or dragging the prediction while in waterfall plot, the features contribution, and its force, linking the instance prediction and the average prediction are depicted. In summary plot, features are ordered according its absolute Shapley value. Those with important values occupy the top positions than less important features. The force plot was used to explain anomaly instances while the summary plot was exploited to explain and improve the prognostic performance. The waterfall plot, on the other hand, was employed to verify the consistency nature of the explanation as described later in Section 2.7.2.

Model Predictive Performance
The average RMSE for 100 predictions was calculated between the predicted RUL (mean of RUL distribution) and the ground truth RUL [38,39].
With ( ) as the ground truth RUL for gas turbine i, ( ) as the predicted RUL for gas turbine and as the total number of gas turbine.

Early Prediction Score
This metric was only applied in prognostic task. The scoring function, , gives higher score for the same error in early prediction than late prediction. It penalizes late prediction than the early ones as the latter is more important that the former in any failure related forecasting problem [40,41]. The average score for 100 predictions was calculated.

Explanation Metrics
This subsection introduces the metrics for evaluating SHAP explanation [42].

Local Accuracy
This propriety states that the feature contributions must add up to the difference of prediction for x and the average. Starting from a normal SHAP notation: By posing = (̂( )) and setting = 1, the Shapley Value efficiency propriety is found.
= ( ) − (̂( )) Where ( ) is the prediction for x and (̂( )) is the average prediction. This propriety states that if a model changes so that the marginal contribution of a feature value increases or stays, the Shapley value also increases or stays the same. With \ ⇔ = 0. For any two models and , if:

Consistency
for ∈ {0, 1} , then: ( ) is the model with Structure 2 in Figure 3 while ( ) is the same model but with different weights. ( \ ) and ( \ ) are then the models with Structure 3 in Figure 3, having the same weights as ( ) and ( ) respectively, except for the input of interest.
By validating this metric, the explanation also conforms to the Symmetry, Dummy and Additivity natures of Shapley values.

Case Study 1: Anomaly Detection on Real Gas Turbine Data
A one year worth of data coming from a twin-shaft 18.8 MW industrial gas turbine was exploited. This equipment had been previously studied in [43]. The data consists predominantly of healthy data with some anomalies producing null (zero) and NaN sensor measurement. It comprises of 98 features ranging from temperature, pressure, speed, and position, totaling 8737 hours of recorded measurement. However, as stated in [43], only several variables are useful for the DL model. The inputs-outputs are shown in Table 2. All the inputs were used to predict each of the output as depicted in Figure 5 by four models denoted as , , and Figure 4 depicts a schematic diagram of the gas turbine under consideration.

Data Preparation
Anomalous data in the order of 377 hours was firstly removed from the dataset. The rest of the data was divided into training and testing datasets. A sequence of data was set to 24 hours. Thus, the models were fed with 24 hours input and output the same length of prediction. Hourly data from 01/01/18 to 26/11/18 amounting to 7488 hours or 312 sequences were used for training and validation. The data from 26/11/18 to 31/12/18 amounting to 816 hours or 34 sequences were reserved for healthy state testing purpose.
The anomalous hours were combined with the healthy data corresponding to the period before and after the anomaly to make up a sequence of 24 hours. The null anomaly, on the 8 th April to 9 th April at 11pm to 12am (6 th to 7 th instances) were considered.
The summary of the datasets is presented in Table 3.

. Prediction with Null Anomaly
The prediction done for sequence containing anomalous inputs for , , with AU are respectively presented in Figure 6, Figure 7, Figure 8, and Figure 9.    1 prediction containing null anomalous data.

Control limit calculation
The variables and results for the control limits are listed in Table 5.

. Anomaly Detection with CUSUM
The CUSUM chart for anomaly detection associated with the predictions and the control limit are presented in Figure 10. The coordinates featured in the chart belong to the identified anomalies.

Anomaly Sequence Force Plot Visualization
The SHAP force plot for explaining the anomalies are shown in Figure 11. The marked areas corresponding to the 6 th and 7 th instances are the tested anomaly instances. For illustration purpose, only instance 1 to 9 are shown. The chosen FD001 training and testing datasets consist each of 100 recorded turbofan degradations as summarized in Table 6. A single record corresponds to a turbofan whose health condition deteriorated after certain cycle, or failure start point, until breakdown [44]. Each turbofan fleet might be used in different operating conditions. As such the extent of degradation is different from one another. Each record is a time series comprising of Time (Cycle), 3 Operating Conditions (OC) and 21 sensor signals as presented in Appendix A. The RUL targets for the training dataset are not available, only the ground truth RUL are given. The OC refers to different operating regimes combination of Altitude (O-42K ft.), Throttle Resolver Angle (20-100), and Mach Number (0-0.84) [44]. High level noise is incorporated, and the faults encountered are hidden by the effect of various operational conditions [45].
To obtain the RUL labels for training, piece-wise linear degradation was assumed [46,47]. Each fleet's health was considered stable in the beginning, followed by a linear deterioration after the failure start point until breakdown.
Originally, the RUL for a signal took the value of the recorded signal's last cycle, or the signal sequence length, and degraded linearly until zero as shown for Fleet 1 in Figure  12(a). The failure start point for each signal was identified using CUSUM with the control limit set to 5 standard deviations. Then, the mean of these failure start points was calculated, in this case, resulting to cycle 46. Combining the linear degradation obtained earlier and the mean failure start point, the transformed Fleet 1 RUL sequence is presented in Figure 12(b). To facilitate model's generalization, all target RULs were capped to 50. The total signal sequence lengths and its respective RUL for training and testing datasets are presented in Appendix B.

Prognostic Performance
The RUL prediction for Fleet 1 and Fleet 18 using the 17 features are illustrated respectively in Figure 14 and Figure 15. These fleets were chosen because the former fleet's testing data length and ground truth RUL follow the same trait as the training data while the latter fleet is not, as indicated in Appendix B. It is thus interesting to examine the uncertainty behavior between the two.
The SHAP summary plot for the prediction of these fleets are depicted in Figure 13. As a matter of fact, almost all summary plots for the 100 testing fleets show the same order of features as Figure 13. One can thus choose the best set of features to improve the predictive performance. Accordingly, the model was also tested with the best 13 features or 75% and the best 9 features or 50% of the original 17 features. Table 7 lists the combination of features tested.

Performance Comparison with Published Methods
The results using 13 features with AU compared with published methods are respectively presented in Table 9 and Table 10.

Local Accuracy Verification
The waterfall plot of the first instance on the first sequence of prediction is shown in Figure 16.

Consistency Verification
The contribution of variable on the first test data instance was investigated as an example. For each output, the difference and contribution of , , was calculated:  and _ _ > _ _ as seen above. These results are illustrated in the waterfall plots in Figure 17.

Explainable Anomaly Detection
100% of the tested null anomalies were successfully detected with the help of AU indicator and Cusum changepoint detection as illustrated in Figure 10. The AU spiked, representing the unsurety of the model when it was fed with anomalous data, surpassing the healthy threshold limit at the instances of anomaly for all outputs.
The force plots local explanation, linked to the anomaly instances shown in Figure  11 highlight that , fuel mass flow rate and N2, power turbine rotational speed as responsible features causing the anomaly. During the initial instances before the anomalies, all features contributed to the prediction. When the consecutive anomalies occurred, the force of both features were amplified. In the 7 th instance, all other feature forces were eclipsed, showing mostly and N2. However, on the 8 th instance, the distribution of contributing forces became normal, with all the features taking part in the prediction. The red colored bar in the plot pushed the prediction positively while the blue colored bar dragged the prediction negatively. The width of the bar represents its contributing force magnitude while the values on these bars are the normalized test data values. The base value is the average output of the model during training phase.
To improve the anomaly detection, one could lower the limit value, resulting to a faster detection. However, by doing so, the risk of false alarm increases. Considering that the tested anomalies are merely stochastic disturbance rather than a continuous one, the present limit definition is deemed acceptable. Figure 15 depicts the prognostic result of Fleet 18. As can be seen, the AU shows a rising trend, signaling that the model is increasingly uncertain of its prediction, reflecting the predicted RUL sequence which is far from the ground truth RUL. The AU for Fleet 1 prediction, however, indicates a decreasing trend as presented in Figure 14, mirroring the good prediction the model had made. The model thus becoming more and more confident of its sequential estimation. Meanwhile, the EU measure, manifest very small change in nearly the same scale for both fleets which is expected for the weight's uncertainty. This uncertainty should not be influenced much by the change in input data.

Explainable Prognostic
The summary plot global explanation ordered the features according to its contribution power in the sequence prediction as shown in Table 7. The top 5 variables influencing the prediction are physical fan speed, static pressure at HPC outlet, total temperature at LPT outlet, corrected fan speed and bypass ratio. The model's predictive performance increased around 6% to 9% while its early prediction showed 43% improvement with only 13 of the most influencing features. The model performed a little worse with only 9 features compared to 13, though it was still better than using all 17 inputs. The predictive power decreased by 0.5% and increased by 3% with AU and EU respectively while the early prediction ability decreased by 0.8% and 2% with AU and EU respectively compared to 13 features. However, considering that only 9 features were used instead than 13, this small performance drop is perfectly tolerable. Weighing all factors, one could even justify that the model with 9 features is better than the one with 13 features.
The enhanced result is comparable to the best methods' outcome in CMAPPS FD001 dataset. It is true that some works fare better than the proposed framework. This is firstly due to a more complex structure adoption. The DCNN and RNN in [49] for example, has respectively five convolutional and five recurrent layers to learn the data while the BiLSTM in [50] possesses two BiLSTM layers and two fully connected layers. Secondly, the mentioned methods only produce point estimates results, without any quantification of uncertainty. Obviously, model's generalization is easier in this case. Consequently, without uncertainty measure, these works can only be experimental and cannot be applied in real-world applications.

Explanation Evaluation
This work demonstrated that SHAP explanation satisfies the Local Accuracy and Consistency criteria. By fulfilling these proprieties, the explanation also conforms to the Efficiency, Symmetry, Dummy and Additivity natures of Shapley values. Efficiency affirms that the sum of the feature contributions is equal to the difference between the instance prediction and the average prediction of all instances, Symmetry implies that two feature values' contributions should be identical if they contribute equally to all feasible coalitions. Dummy states that a feature that does not affect the predicted value should have a Shapley value of zero regardless of the coalition it is part of. Finally, the Additivity denotes that for an ensemble prediction, for a specific feature, one can calculate the Shapley value of the feature in each individual ensemble, average them, and get the Shapley value for the feature for the whole ensemble.

Conclusions
This article elaborates the application of SHAP model agnostic approach in explaining the outputs of a Bayesian LSTM in anomaly detection and prognostic tasks of gas turbines using real and simulated datasets. The forecast uncertainty, generated by the Bayesian model, broaden the explanation scope to include model's confidence, strengthening the explanation. It was also exploited as anomaly indicator. SHAP global explanation was used to enhance prognostic performance by identifying the most contributing features in the prediction. All the anomalous instances were detected owing to the uncertainty indicator. Moreover, the model's RMSE increased around 6% to 9% while its early prediction ability showed 43% improvement thanks to SHAP. These results are comparable to the best published methods in the problem. Finally, the generated explanation verifies the Local Accuracy and Consistency proprieties, and by doing so validates the Efficiency, Symmetry, Dummy and Additivity natures of Shapley values. This paper shows how SHAP and deep learning uncertainty form a broader explanation scope while simultaneously demonstrating SHAP ability in enhancing PHM performance, highlighting its potential as an easy to use, flexible and powerful XAI technique.

Sensor
References Description Unit S1 Total temperature fan inlet 0 R S2 Total temperature at LPC outlet 0 R S3 Total temperature at HPC outlet 0 R S4 Total temperature at LPT outlet 0 R S5 Pressure at fan inlet Psia S6 Total pressure in bypass-duct Psia S7 Total pressure at HPC outlet Psia S8 Physical fan speed RPM S9 Physical core speed RPM S10 Engine pressure ratio (P50/P2) N/A S11 Static pressure at HPC outlet psia S12 ℎ Ratio of fuel flow to Ps30 Pps/psi S13 Corrected fan speed RPM S14 Corrected core speed RPM S15 Bypass ratio N/A S16 Burner fuel-air ratio N/A S17 ℎ Bleed enthalpy N/A S18 _ Demanded fan speed RPM S19 _ Demanded corrected fan speed RPM S20 HPT coolant bleed lbm/s S21 LPT coolant bleed lbm/s Appendix B (a) (b)