On the ROC Area of Ensemble Forecasts for Rare Events

: The relative operating characteristic (ROC) curve is a popular diagnostic tool in forecast veri ﬁ cation, with the area under the ROC curve (AUC) used as a veri ﬁ cation metric measuring the discrimination ability of a forecast. Along with calibration, discrimination is deemed as a fundamental probabilistic forecast attribute. In particular, in ensemble forecast veri ﬁ cation, AUC provides a basis for the comparison of potential predictive skill of competing forecasts. While this approach is straightforward when dealing with forecasts of common events (e.g., probability of precipitation), the AUC interpretation can turn out to be oversimplistic or misleading when focusing on rare events (e.g., precipitation exceeding some warning criterion). How should we interpret AUC of ensemble forecasts when focusing on rare events? How can changes in the way probability forecasts are derived from the ensemble forecast affect AUC results? How can we detect a genuine improvement in terms of predictive skill? Based on veri ﬁ cation experiments, a critical eye is cast on the AUC interpretation to answer these questions. As well as the traditional trapezoidal approximation and the well-known binormal ﬁ tting model, we discuss a new approach that embraces the concept of imprecise probabilities and relies on the subdivision of the lowest ensemble probability category.


Introduction
Over the past decades, the popularity of the relative operating characteristic (ROC) curve has steadily increased with applications in numerous fields (Gneiting and Vogel 2018). In meteorology, verification of weather forecasts based on signal detection theory has been in usage since the seminal works of Mason (1982) and Harvey et al. (1992), and recommended as a standard verification tool by the Word Meteorological Organization in Stanski et al. (1989). In the framework of probabilistic forecast verification, the area under the ROC curve (AUC) is often used as a summary measure of forecast discrimination. Discrimination is the ability to distinguish between event and nonevent and, along with calibration, it is one of the key attributes of a probabilistic forecast (Murphy 1991). While calibration deals with the meaning of probabilities (its estimation is an attempt to measure whether taking a forecast at face value is an optimal strategy), discrimination appraises the existence of a signal in the forecast when an event materializes and its absence in the opposite situation.
Practically, the ROC plots the hit rate (HR) versus the false alarm rate (FAR) of an event for incremental decision thresholds. Examples are provided in Fig. 1 for probability forecasts derived from the 50-member ensemble run at the European Centre for Medium-Range Weather Forecasts (see more details about the data in section 2). Corresponding probability fields for 15 February 2021 over the British Isles, are shown in Fig. 2. The targeted events correspond to precipitation exceeding the following thresholds: 1, 20, and 50 mm (24 h) 21 . A ROC curve is defined by the line joining successive ROC points, where each point corresponds to results for increasing decision thresholds, from the top right to the bottom left corners of the plot. When the decision variable is the number of members exceeding the event threshold (interpreted as a raw probability forecast), the issued forecast can take values in [0, 1/M, 2/M, … , 1] for an ensemble of size M. As a consequence, the resulting ROC curve is based on (up to) M 1 1 points.
The area under the straight lines formed by connecting the M 1 1 points [including the (0, 0) and the (1, 1) points] of the ROC plot correspond to the AUC with the so-called trapezoidal approximation (T-AUC). This nomenclature comes from the fact that the area is estimated considering straight lines between two consecutive points of the plot and so as a sum of trapeziums. Interestingly, T-AUC is equivalent to the result of a two alternative forced choice (2AFC) test for dichotomous events (Mason and Weigel 2009). The 2AFC test consists in checking whether, for two different observations in a verification sample, one event and one nonevent, the forecast associated with the former is larger than the forecast associated with the later (provided that the decision variable is oriented so that large implies more likely). If we denote x 1 a forecast when the event occurs and x 0 a forecast issued when the event does not materialize, the scoring function associated with the test is The result of the test is the average score over all event/ nonevent pairs in the verification sample. In Eq. (1), the test returns a value of 0.5 when the forecasts are indistinguishable, that is, in our examples, when two probability forecasts are Denotes content that is immediately available upon publication as open access.
identical. An average score of 0.5 is also the expected mean result for a random forecast.
For rare events, there is "a tendency for the points on the ROC to cluster toward the lower left corner of the graph" as noted by Casati et al. (2008) and illustrated in Fig. 1. When computing T-AUC, a straight line is drawn between the last meaningful point on the ROC curve and the top-right corner to close the ROC curve, giving the impression that part of the curve is missing. How much of the curve is "missing" depends on the lowest category, 1 defined here by the ensemble size, 2 and the base rate of the event. As a rule of thumb, half of the ROC curve (from the apex to the right corner) is missing when assessing the performance of an ensemble of size M focusing on an event with a base rate a 1=M. This rule is valid when the probability forecast is close to calibration, that is when the probability can be taken at face value for decision making. The rule is derived from the relationship between an optimal decision threshold, the slope of the line tangent to the ROC curve, and the event base rate [see e.g., Eq. (25) in Ben Bouallègue et al. 2015].
To draw a "full" ROC curve, one can apply the so-called binormal model (Harvey et al. 1992;Wilson 2000;Atger 2004). 3 The fitting of the ROC curve with the binormal model is based on the assumption that HR and FAR are integrations of normal distributions, a signal and a noise distribution, respectively. A closed-form for the computation of the AUC exists [see Eqs. (2) and (3) in Harvey et al. 1992]. The fitting of the HR and FAR requires a Z-transformation based on the unit normal distribution. For this reason, the resulting AUC is denoted here Z-AUC. When applied to ensemble-derived probability forecasts for rare events, this approach consists effectively in an extrapolation to a hypothetical continuous decision variable based on the limited set of decision thresholds materially assessable. Because such a decision variable may not be achievable in practice, Z-AUC is sometime considered as a measure of the potential discrimination that could be achieved, for example "for an unlimited ensemble size" (Bowler et al. 2006).
T-AUC and Z-AUC summary metrics can provide very different comparative results. The statistics reported in Fig. 1c are striking in that respect. In gray, we report the verification results obtained when using the probability of exceeding 20 mm (24 h) 21 to predict the occurrence of precipitation exceeding 50 mm (24 h) 21 . On the one hand, Z-AUC is (slightly) smaller than for the original forecast (in black): the interpretation is that the probability forecast for 50 mm (24 h) 21 is potentially more informative than the probability forecast for 20 mm (24 h) 21 when we focus on the higher event threshold. On the other hand, T-AUC statistics point toward a larger predictive skill of the low event-threshold probability forecast practically users may benefit more from using the lower threshold unless additional postprocessing is carried out to realize the potential additional benefit from using the higher threshold implied by the Z-AUC. As illustrated in Fig. 2, in many cases no ensemble member exceeds the high event threshold. The small proportion of distinguishable forecasts explains the poor results of the original forecast in terms of T-AUC: the discrimination ability of two forecasts with the same value is equivalent to the discrimination ability of two random forecasts [see Eq. (1)].
When verifying ensemble forecasts focusing on rare events, the AUC users face a dilemma: should they use T-AUC that relies on a clustering of points in the bottom left corner of the ROC plot, or should they use Z-AUC that extrapolates the results to compute scores based on a "full" ROC curve? The user's preference depends on the scientific question at hand, and in particular on whether the practical usefulness or the intrinsic information content of the ensemble forecast is the key aspect to be assessed. AUC assesses the discrimination ability of a decision variable, so special attention should be paid on how this decision variable is derived from the ensemble forecast. A decision variable defined as the number of ensemble members exceeding a threshold can appear appropriate for common events but may be less useful when forecasting rare events as illustrated in Fig. 1.
Aiming at bridging the gap between T-AUC and Z-AUC results, we propose using a new decision variable that encompasses more information from the ensemble than simply the number of members exceeding the event threshold for the computation of T-AUC. Our approach is inspired by a suggestion in Casati et al. (2008): "one solution to this problem [the clustering of the ROC points] is to subdivide the lowestvalued forecast probability bins. The verification sample can usually support subdividing the lower-valued probability bins when fitting the ROC for low base-rates." In other words, in the situation where no ensemble member exceeds the event threshold, the probability of occurrence should (more than ever) be interpreted as an imprecise probability (IP), a probability over an interval. A refinement of the forecast probability on that interval is possible using additional information from the ensemble itself such that different levels of near 0 chances of event occurrence can be distinguished. In the following, we show how to draw additional information from the ensemble about low chances of occurrence when using the ensemble mean (EM). EM is called "secondary" decision variable in this process and AUC estimated with this approach is referred to as IP/EM-AUC.
Having in mind the key question "How to interpret AUC results of ensemble forecasts when focusing on rare events?", we design a series of verification experiments in order to analyze T-AUC and Z-AUC in context. The verification experiments are chosen to show: 1) the loss of predictability with forecast lead time; 2) the impact of a postprocessing step, which accounts for subgrid variability; 3) the impact of increasing the forecast probability categories; 4) how to isolate the ensemble size effect with the help of a parametric model; and 5) the impact of subdividing the lowest category with the help of an ensemble summary statistic.
The verification experiments and derived results are described and presented in section 2 before drawing recommendations in section 3.

Verification experiments
a. Verification setup 1) DATASET Forecasts of daily precipitation are used in the following verification experiments, but similar qualitative results can be obtained with other accumulation periods or weather variables. The probability forecasts are derived from the ensemble prediction system run operationally at the ECMWF. The interpretation of the 50-member ensemble in terms of probability follows a simple (but common) approach. It consists in counting the number of members exceeding a threshold. Observation measurements at surface synoptic observation stations over the globe are compared with forecasts at the nearest grid point over a verification period running from 1 September 2019 to 31 August 2020.
Probability derived from ensemble forecasts are interpreted as imprecise probabilities as we are in a situation where the source of probabilistic information is incomplete and imperfect (Bradley 2019). For example, when no member exceeds the event thresholds of interest, the derived probability belongs to a probability interval close to 0. A ranking of the probability forecasts in that category can, however, be expressed with the help of an ensemble summary statistic such as the ensemble mean or an ensemble quantile. For illustration purposes, the ensemble mean is chosen as a secondary decision variable which is used to refine the ensemble interpretation for the lowest probability interval. Other choices can be valid as well and further research is encouraged in order to determine if an optimal summary statistic as a secondary decision variable exists in such a context.
Statistical postprocessing of precipitation forecasts is also envisaged here and tested applying a parametric approach, that is relying on a predefined type of probability distribution. In the following, censored shifted gamma distributions are used to describe appropriately precipitation forecast distributions (a detailed description of the statistical method can be found in Ben Bouallègue et al. 2020). Postprocessing aims here at correcting for the scale mismatch between forecasts (as model outputs on a grid) and observations (as point measurements at stations). We follow a so-called perturbed ensemble approach, which consists in adding uncertainty to the forecast in order to represent the larger uncertainty at a finer scale. Practically, random perturbations drawn from a parametric distribution are added to each ensemble member.
Other ensemble postprocessing techniques and their impact on the ROC are not investigated here. Techniques such as the neighborhood method or the use of lagged ensemble lead to an increase of the effective ensemble size at much lower computational cost than running additional ensemble members (Ben Bouallègue et al. 2013). The availability of more members allows in any case a finer probability discretization. In general terms, the impact of discretization is discussed below along with experiment III.

2) VERIFICATION METHODOLOGY
In our experiments, the central step of the verification process consists in populating contingency tables. The 2 3 2 tables are the raw material for generating: • ROC curves (Mason 1982), • performance diagrams plotting hit rate versus success ratio (Roebber 2009), and • potential economic value (PEV) plotted as a function of the user's cost-loss ratio (Richardson 2000(Richardson , 2011. Contingency tables are populated for incremental decision thresholds. In the traditional ensemble verification case, the decision variable is the number of members exceeding the event threshold [e.g., 50 mm (24 h) 21 ] and the number of decision thresholds is M 1 1 with M the ensemble size. In our experiment dealing with imprecise probabilities, the "zero category" is subdivided in additional subcategories using the ensemble mean EM as a secondary decision variable. We are considering EM as a continuous decision variable rather than focusing on the binary forecast EM being greater than the event threshold. Dealing here with daily precipitation, the following (secondary) decision thresholds are used in this context: [0.1, 0.2, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 10, 15] in millimeters per 24 h [mm (24 h) 21 ]. The basic principle governing this choice of thresholds is to span decision thresholds over the range of values that the secondary decision variable can take.
So in practice, the following is performed: 1) choose an event threshold; 2) count the number of members exceeding this event threshold (this decision variable is called raw probability forecast); 3) if no members exceed the event threshold, compute the ensemble mean EM and count the number k out of K cases for which EM is greater than each of the K secondary decision thresholds; 4) adjust the raw decision variable by considering a probability of k= M K 1 1 ( ) (rather than 0) for that forecast; and 5) derive a contingency table for all distinct probability values in the forecast sample.
The trapezoidal approximation applied to the corresponding set of (HR, FAR) pairs correspond to IP/EM-AUC. This acronym reflects that the ensemble mean is used as a secondary decision variable.
Scores are aggregated over different domains, but mainly results for the globe are shown. Event thresholds are defined in absolute terms with a focus on an exceeding event threshold of 50 mm (24 h) 21 . In the global verification sample, this event has a base rate of around 1%, but with more occurrence in certain regions of the world than others. Hamill and Juras (2006) recommend the use of thresholds expressed in relative terms (quantile of a climatology) in order to avoid over-interpretation of the AUC results by mixing the forecast ability to distinguish between wet and dry regions and genuine predictive skill. Results for precipitation exceeding the 99th percentile of the local climatology is also discussed as a final example.

b. Verification experiments results
Each of the following figures (Figs. 3-7) comprises three plots: a performance diagram, a ROC plot, and a potential economic value plot (with a log scale on the x axis). AUC estimates are indicated for both the trapezoidal approach (T-AUC) and the binormal method (Z-AUC). For each figure, two different sets of forecasts are compared. An asterisk indicates which of the two sets of forecasts (the red or the blue) has the best score in terms of T-AUC and Z-AUC. The relative superiority of the best forecast is indicated in percent. For all plots (except for experiment IV in Fig. 6), the results in red correspond to the raw ensemble-derived probability forecasts at day 5.

LEAD TIME
The first experiment compares the performance of forecasts at two different lead times. This experiment illustrates how different verification tools are impacted by a genuine change in forecast predictive skill. Figure 3 illustrates the visual impact one can expect to see in such a situation where a difference in verification results can be explained only by a difference in predictive skill.
In Fig. 3a, the performance diagram shows the blue points (day 1 results) lie closer to the top right corner than the red ones (day 5 results). In Fig. 3b, on the ROC plot, the blue points are distinct from the red ones and are closer to the top left corner. In Fig. 3c, on the PEV plot, the blue curve lies above the red one, except for cost-loss ratios close to 1 for which both short and medium range forecasts have no value. This first experiment provides typical results expected from an increase in forecast predictive skill and, as such, serves as a benchmark for the following experiments. In terms of summary metrics, the relative skill improvement between day 5 and day 1 is of the same order of magnitude for T-AUC and Z-AUC, with 4.8% and 4.2% measured improvement, respectively.
2) EXPERIMENT II: THE IMPACT OF A POSTPROCESSING STEP In this experiment, ensemble postprocessing is applied in order to account for the scale mismatch between forecasts and observations. The method can be applied in a verification context to account for observation uncertainty, but also in a postprocessing context to provide a forecast valid for all points within a model grid box. The postprocessed forecast is derived by adding a random perturbation independently to each member. The random perturbation is draw from a distribution described by the value of the forecast member (i.e., the predicted gridscale precipitation) and with fixed parameters for all forecasts (as in Ben Bouallègue et al. 2020). In other words, a constant piece of information is added in the process: the expected subgrid scale uncertainty expressed as a function of the gridscale precipitation value. The main impact on the forecast is a significant increase of the ensemble spread with, in particular, larger distributional tails of the postprocessed ensemble distribution compared with the original one while the ensemble size remains unchanged.
In Fig. 4 and on, both blue and red curves are valid for forecasts at day 5. All three plots in Fig. 4 show that the blue (postprocessed ensemble results) and red points (original ensemble results) are overlapping with the exception of a couple of points in each plot. In the case of the performance diagram and ROC curve, blue and red points belong to the same underlying curves. In the PEV plot, we see larger value for users with a cost-loss ratio smaller than or equal to 2%. In terms of Z-AUC, the results are identical before and after postprocessing up to the second decimal indicating that the underlying forecast information content is not altered by the process. The information has increased in the sense that information about the subgrid uncertainty has been added to the forecast, 4 but this bit of information is not different from one forecast to another. The improvement by postprocessing as measured by T-AUC is 4.5%, close to the improvement measured between day 5 to day 1 discrimination ability (see Fig. 3b). While the latter is attributed to an improved predictability at shorter lead times, the former, the T-AUC improvement with postprocessing, is attributed to a change in the frequency of forecast events: postprocessed ensemble members exceed the event threshold more often due to the larger spread (as in the example in Fig. 2c).

3) EXPERIMENT III: THE IMPACT OF DISCRETIZATION
The postprocessing technique used in experiment II is based on a parametric approach. A random perturbation drawn from a parametric distribution is added to each member. The random draw is made from a distribution for which the parameters are known. Now, in experiment III, rather than a single perturbation FIG. 3. Results of experiment I, the benchmark experiment comparing forecast performance at day 1 (blue) and at day 5 (red). On the performance diagram, dashed and solid lines refer to the forecast bias and critical success index, respectively.
for each member, we consider two random draws for each of the 50 raw ensemble members. So, the resulting postprocessed ensemble has a size of 100. Let us recall that the ensemble perturbations are based on random draws from a distribution whose form depends only on the forecast value itself. The model for the precipitation subgrid variability is a new but constant piece of information added to the forecast. Drawing additional random numbers from a parametric distribution better captures the form of the distribution but does not change the distribution itself or the quality of the underlying model. Figure 5 is similar to Fig. 4. The major difference between the two figures is the fact of a single point on each plot such as on the ROC plots with one point on the blue ROC curve closer to the top right corner when increasing the forecast discretization. In terms of T-AUC, the change in the number of probability thresholds from 51 to 101 leads to a jump from 4.5% to 7.9% of improvement with respect to the original forecast. The larger the number of members describing the underlying forecast distribution, the better the decision variable defined as the number of members exceeding a threshold. In terms of Z-AUC, the doubling of the number of categories as part of the postprocessing has no impact on performance. The underlying forecast distribution is the same, no additional information is provided that improves the forecast discrimination ability. With this experiment, we see that the choice regarding the categorization of the probability forecast considerably influences the T-AUC results but not the Z-AUC ones.

EFFECT
Here we again exploit the parametric nature of the postprocessing approach we have followed. The goal here is to assess the ensemble-size effect on the forecast discrimination independently from the forecast discretization effect. This experiment is designed to compare probability forecasts with the same discretization (the same decision thresholds) but derived from raw ensembles with different sizes. So, we distinguish raw ensemble size (the source of the forecast information) and the postprocessed ensemble size which is an arbitrary choice in our setting. In this experiment, we compare raw ensembles of size 10 and 50, both postprocessed in order to get 50 postprocessed "members" (drawing 5 and 1 random perturbations for each member, respectively). Figure 6 shows the positive impact of increasing the raw ensemble size from 10 to 50 members. Derived probabilities from 50 postprocessed forecasts (in both cases) exhibit better performance on the three plots. In particular, as expected, users with smaller cost-loss ratios) benefit most of the ensemble size increase. The discrimination ability as measured by T-AUC and Z-AUC shows an increase of 7.4% and 1.5%, respectively. These values can be compared with a 7.3% theoretical improvement of a proper score, 5 the continuous ranked probability score, which encompasses both discrimination and calibration contributions.

CATEGORY
In many cases, none of the ensemble members forecast a rare event, i.e., precipitation exceeding 50 mm (24 h) 21 . The raw probability forecast is 0, but it is interpreted as an imprecise probability on an interval close to 0. Using a secondary decision variable, it is possible to distinguish between lower and higher chances within the pool of "0 probability" forecasts. In this experiment, we consider a forecaster that has access to the ensemble mean in addition to the probability forecast itself. In the situation where a forecast probability of 0 is issued, the forecaster infers that the chance of occurrence of an event is higher when the ensemble mean is higher. Quantitatively, within an interval close to 0, a larger probability is assigned to a forecast with a larger ensemble mean (see methodology in section 2). The verification results obtained with this ensemble interpretation are shown in Fig. 7 and compared with results obtained when using only the ensemble mean as a decision variable in Fig. 8. We can recall that we are here considering the ensemble mean EM as a continuous variable, not drawing information from a binary forecast derived as EM being greater than the event threshold. Figures 7a and 7b show the continuity between the results based on the original data (red points) and the ones when subdividing the lowest probability category with the help of the ensemble mean (blue points). The so-called IP/EM-AUC corresponds to T-AUC computed using the blue points. In Fig. 7c, PEV results diverge only for cost-loss ratio values smaller than 2%. These results share similarities with the ones obtained with postprocessing in Fig. 4. When using the ensemble mean as a secondary decision variable, we obtain T-AUC and Z-AUC results that are (almost) identical, as a "full" ROC curve is now available based on the enhanced interpretation of the ensemble forecast. Interestingly, T-AUC with the lowest category subdivision is very close to the Z-AUC estimate for the original data. In other words, similar results can be obtained by extrapolation using Z-AUC or with T-AUC applied to an appropriate decision variable for rare events. On the plot in Figs. 7 and 8, the blue points follow the red line derived with the binormal approximation, in one case they are above and in the other below. The results with the two approaches are consistent but subject to different level of uncertainty as discussed in section 2c.
The ensemble interpretation in two steps with the help of a secondary decision variable refines the imprecise probabilities close to 0%. The ensemble mean forecast serves as a secondary decision variable in our example. This approach is different than using the ensemble mean as a unique decision variable as illustrated in Fig. 8. Results here are produced using the 99% quantile of the local climatology as an event threshold. This choice allows to better highlight the major benefit of integrating the ensemble distribution into the decision variable rather than using the ensemble mean only. Indeed, not only the ROC plot but also the performance diagram and potential economic value plot show the overall poor results of the ensemble mean (in cyan) compared with the probability forecasts (in red and blue).
Focusing on the T-AUC results in Fig. 8b, we see that the ensemble mean EM appears as a better decision variable than the original ensemble interpretation as measured with the trapezoidal approximation. However, EM as a decision variable does not seem to perform better anywhere on the performance diagram (see Fig. 8a) or to benefit any user except possibly with very low cost-loss ratios (see Fig. 8c), this result illustrates how misleading conclusions can be drawn when comparing T-AUC results from different "sources." The ensemble size has an impact on both the raw probability and the ensemble mean estimates. The impact of the ensemble size on the discrimination ability as estimated when combining information from both ensemble aspects is not explored here. c. Impact of the event base rate As a final investigation, we analyze AUC estimations and corresponding uncertainty as a function of the event base rate. We consider three event thresholds: 20, 40, and 50 mm (24 h) 21 , and build verification datasets for four different geographical domains (global, Northern Hemisphere, Southern Hemisphere, Europe) and four different seasons (autumn 2019, winter 2019, spring 2020, and summer 2020). The event base rate is different for each domain and threshold combination. The score uncertainty associated with each AUC estimation is derived as the inter-quantile range (5%-95%) of the score empirical bootstrapping distribution. Results are shown in Fig. 9 for T-AUC, Z-AUC, and IP/EM-AUC, the trapezoidal approximation when interpreting the ensemble in terms of imprecise probabilities and using the ensemble mean as secondary decision variable.
In Fig. 9a, we observe a slight increase of the discrimination ability as a function of the rarity of the event, with the Z-AUC and IP/EM-AUC approaches displaying consistent estimates with respect to each other. However, a drop in discrimination is measured with T-AUC for event base rates smaller than 3%. When applied to ensemble probability forecasts, T-AUC appears base rate dependent. The drop is the result of the application of the trapezoidal approximation to raw probabilities derived from a non-infinite size ensemble: the AUC computation is confined to a smaller part of the full ROC curve as the event's base rate decreases. The drop in discrimination is also an indicator of when the ensemble interpretation into a decision variable by simply counting the number of members exceeding a threshold is no longer appropriate. The distinction between low probability of occurrence would require more ensemble members, or some sort of statistical postprocessing, or simply to categorize low probability forecasts based on an additional ensemble summary statistic as for example the ensemble mean.
In Fig. 9b, larger score uncertainty is seen for more extreme events. The increase in uncertainty is visible for all AUC methods, but T-AUC and Z-AUC estimates are more impacted than IP/EM-AUC ones. The binormal model exhibits the largest level of uncertainty for rare events because the original points are too close to get good estimates of the slope in the Z-transformed space, but the trapezoidal approximation is also subject to large variations. In practice, for event base rates larger than 99%, the level of uncertainty appears too large to draw any useful conclusion with T-AUC or Z-AUC applied to raw probability forecast while the results for the enhanced probability using the ensemble mean as secondary decision variable appears more robust and reliable. FIG. 8. Results of experiment V as in Fig. 7, but using a threshold defined as the 99% of the local climatology. In addition to the results for raw probability (red) and when subdividing the lowest probability category with the ensemble mean (blue), we show results when using only the ensemble mean as a decision variable (cyan).

Recommendations
T-AUC and Z-AUC provide complementary information about the discrimination ability of a forecasting system. It is important to understand the differences between them and to use each appropriately depending on the question under investigation. Z-AUC measures the inherent discrimination ability of a forecasting system, while T-AUC measures how well this is achieved by a given implementation. Computing IP/EM-AUC shows a path to practically building a bridge between the two approaches.
Awareness of AUC properties is key in situations which combine 1) the assessment of probability forecasts derived from an ensemble and 2) a focus on rare events. Differences between T-AUC and Z-AUC are largest in such situations and it is especially important to interpret the results carefully. The recommendations below are targeted to this specific type of situation. In other cases (when focusing on more frequent events or when assessing probability forecasts derived from a parametric distribution for example), the different approaches for computing AUC converge to identical results as illustrated for instance in Fig. 1a. T-AUC is not the only verification metrics whose interpretation can be altered when focusing on rare events. Forecast verification of rare events has gathered increasing attention over the last decade and the interested reader could refer to Ben Bouallègue et al. (2019) and references within.
T-AUC, the AUC estimation with the trapezoidal approximation, is the traditional way to measure forecast discrimination. As illustrated, differences in T-AUC can be attributed to multiple sources, for example the level of forecast discretization, the presence of ensemble forecast biases, or the way probabilities are derived from the ensemble forecast. So, when using the empirical ROC and summarizing performance using the T-AUC, it is important to consider the following: • Comparing T-AUC results should be done carefully. For example, a positive difference in T-AUC should be scrutinized before being interpreted as an improvement in intrinsic discrimination ability (or as an improvement of the forecast performance in a broader sense). • T-AUC is a measure of the performance of the forecasts using a particular discretization of a chosen decision variable.
Therefore, both the discretization and the decision variable should be clearly described. • The maximum available discretization should be used (e.g., each member rather than fixed percentage bins), to ensure the ROC is as complete as possible. • A comparison of the practical benefits of two competing forecast configurations should follow the above approach, describing the decision variable and discretization used for each configuration. However, this approach should not be used to draw conclusion about the underlying discrimination ability of the different systems. • A fair comparison for the underlying discrimination ability of different systems would rely on using the same decision variable and the same discretization. A sanity check for no significant differences in the underlying forecast bias is also important. • Comparing competing forecasts with T-AUC can lead to misleading conclusions if the above is not accounted for. • To evaluate the impact of changing discretization T-AUC should be used (Z-AUC will not be sensitive to this).
Z-AUC, the AUC estimate with a binormal model, is an alternative to the traditional trapezoidal approach. Z-AUC is a measure of "potential" discrimination ability of a system, in the sense that the extrapolation of the performance with the curve fitting has no practical meaning in terms of forecast usefulness. The binormal model is based on the assumption that HR and FAR follow the characteristics of normality distributed parameters. This assumption is not tested in our experiments, but our examples show good fits between model and data. As a summary, we state the following conclusions: • Results with Z-AUC are not sensitive to forecast discretization and to simple ensemble postprocessing, in contrast to the T-AUC results. • Z-AUC should be used to compare the potential discrimination ability of different forecasting systems. It does not indicate how this can be realized, but it does show which system has the better underlying performance. • Z-AUC is useful to compare for example ensemble forecasts with different number of members it gives a better indication of the skill that could be achieved if sufficient discretization is available. • Z-AUC cannot distinguish whether a chosen discretization is sufficient or can be improved; instead it shows what would be achieved if a sufficient discretization was available. • Computing both T-AUC and Z-AUC allows useful comparison of potential and actual discrimination ability. • If T-AUC and Z-AUC are similar, the ensemble interpretation allows a discretization of the derived decision variable which is sufficient and there is not much to be gained from postprocessing to generate a better interpretation or a finer discretization. • If T-AUC is lower than Z-AUC there is the potential to improve the forecast performance by a better ensemble interpretation. This will be most beneficial to users with low C/L and most likely to happen for rare events (especially where low probability categories are not well resolved in the forecast).
We have demonstrated two methods to improve the ensemble interpretation and thus increase the forecast discretization in such situations. Postprocessing to account for subgrid-scale variability introduces a continuous distribution and allows arbitrary fine discretization. The choice of discretization should be sufficient to generate the full ROC (as well as plotting the ROC, a simple way to check this is to compare the T-AUC and Z-AUC).
IP/EM-AUC refers to the AUC estimated with a new approach which involves subdividing the lowest probability category by ranking the forecasts in this category according to an ensemble summary statistic, here the ensemble mean. We showed that this approach can provide sufficient discretization to generate a full ROC curve based directly on the ensemble forecasts. The score estimations using this method have been shown to be more robust than with the binormal model. They also represent the real skill of the forecast system since users can act on each of the discretization categories, rather than the potential skill that is shown by the extrapolation using a statistical model. Using this method, we have shown that while Z-AUC is strictly only a measure of potential discrimination skill, it may actually be straightforward to achieve in practice.
Other implications of this interpretation in the context of ensemble forecast verification and postprocessing are topics for future research.