The potential biases of robust location estimators in estimating the population mean have been noticed for more than two centuries [
1], with numerous significant attempts made to address them. In calculating a robust estimator, the procedure of identifying and downweighting extreme values inherently necessitates the formulation of distributional assumptions. Previously, it was demonstrated that, due to the presence of infinite-dimensional nuisance shape parameters, the semiparametric approach struggles to consistently address distributions with shapes more intricate than symmetry. Newcomb (1886) provided the first modern approach to robust parametric estimation by developing a class of estimators that gives "less weight to the more discordant observations" [
2]. In 1964, Huber [
3] used the minimax procedure to obtain
M-estimator for the contaminated normal distribution, which has played a pre-eminent role in the later development of robust statistics. However, as previously demonstrated, under growing asymmetric departures from normality, the bias of the Huber
M-estimator (HM) increases rapidly. This is a common issue in parametric robust statistics. For example, He and Fung (1999) constructed [
4] a robust
M-estimator (HFM) for the two-parameter Weibull distribution, from which the mean and central moments can be calculated. Nonetheless, it is inadequate for other parametric distributions, e.g., the gamma, Perato, lognormal, and the generalized Gaussian distributions (SI Dataset S1). Another interesting approach is based on
L-estimators, such as percentile estimators. For examples of percentile estimators for the Weibull distribution, the reader is referred to the works of Menon (1963) [
5], Dubey (1967) [
6], Marks (2005) [
7], and Boudt, Caliskan, and Croux (2011) [
8]. At the outset of the study of percentile estimators, it was known that they arithmetically utilize the invariant structures of parametric distributions [
5,
6]. An estimator is classified as an
I-statistic if it asymptotically satisfies
for the distribution it is consistent, where LEs are calculated with the use of
-statistics (defined in REDS II), I is defined using arithmetic operations and constants but may also incorporate transcendental functions and quantile functions, and
s are the population parameters it estimates. In this article, two subclasses of
I-statistics are introduced, recombined
I-statistics and quantile
I-statistics. Based on
-statistics,
I-statistics are naturally robust. Compared to probability density functions (pdfs) and cumulative distribution functions (cdfs), the quantile functions of many parametric distributions are more elegant. Since the expectation of an
L-estimator can be expressed as an integral of the quantile function,
I-statistics are often analytically obtainable. However, it is observed that even when the sample follows a gamma distribution, which belongs to the same larger family as the Weibull model, the generalized gamma distribution, a misassumption can still lead to substantial biases in Marks percentile estimator (MP) for the Weibull distribution [
7] (SI Dataset S1).
Invariant Moments
Most popular robust location estimators, such as the symmetric trimmed mean, symmetric Winsorized mean, Hodges-Lehmann estimator, Huber
M-estimator, and median of means, are symmetric. As shown in REDS I, a symmetric weighted Hodges-Lehmann mean (
) can achieve consistency for the population mean in any symmetric distribution with a finite mean. However, it falls considerably short of consistently handling other parametric distributions that are not symmetric. Shifting from semiparametrics to parametrics, consider a robust estimator with a non-sample-dependent breakdown point (defined in Subsection ) which is consistent simultaneously for both a semiparametric distribution and a parametric distribution that does not belong to that semiparametric distribution, it is named with the prefix ‘invariant’ followed by the name of the population parameter it is consistent with. Here, the recombined
I-statistic is defined as
where
d is the key factor for bias correction,
is the
-statistic,
is the degree of the
U-statistic,
k is the degree of the
-statistic,
is the upper asymptotic breakdown point of the
-statistic. It is assumed in this series that in the subscript of an estimator, if
,
k and
are omitted,
,
,
are assumed, if just one
is indicated,
, if just one
is indicated,
, if
n is omitted, only the asymptotic behavior is considered, in the absence of subscripts, no assumptions are made. The subsequent theorem shows the significance of a recombined
I-statistic.
Theorem 1. Define the recombined mean as . Assuming finite means, is a consistent mean estimator for a location-scale distribution, where μ, , and are different location parameters from that location-scale distribution. If , , is also consistent for any symmetric distributions.
Proof. Finding d that make a consistent mean estimator is equivalent to finding the solution of . First consider the location-scale distribution. Since . So, . In REDS I, it was established that any can be expressed as for a location-scale distribution parameterized by a location parameter and a scale parameter , where is a function of , the quantile function of a standard distribution without any shifts or scaling, according to the definition of the weighted L-statistic. The simultaneous cancellation of and in assures that the d in is always a constant for a location-scale distribution. The proof of the second assertion follows directly from the coincidence property. According to Theorem 17 in REDS I, for any symmetric distribution with a finite mean, . Then This completes the demonstration. □
For example, the Pareto distribution has a quantile function , where is the minimum possible value that a random variable following the Pareto distribution can take, serving a scale parameter, is a shape parameter. The mean of the Pareto distribution is given by . As can be expressed as a function of , one can set the two s in the d value of as two arbitrary quantiles and . For the Pareto distribution, . can be canceled out. Intriguingly, the quantile function of exponential distribution is , . . Then, . Since , approaches , as , regardless of the type of weighted L-statistic used. That means, for the Weibull, gamma, Pareto, lognormal and generalized Gaussian distribution, is consistent for at least one particular case, where , , and are different location parameters from an exponential distribution. Let , , then , , , the detailed formula is given in the SI Text. So, . The biases of for distributions with skewness between those of the exponential and symmetric distributions are tiny (SI Dataset S1). exhibits excellent performance for all these common unimodal distributions (SI Dataset S1).
The recombined mean is a recombined I-statistic. Consider an I-statistic whose LEs are percentiles of a distribution obtained by plugging -statistics into a cumulative distribution function, I is defined with arithmetic operations, constants, and quantile functions, such an estimator is classified as a quantile I-statistic. One version of the quantile I-statistic can be defined as where is , is the empirical cumulative distribution function of the kernel distribution, is the quantile function of the kernel distribution.
Similarly, the quantile mean can be defined as
. Moreover, in extreme right-skewed heavy-tailed distributions, if the calculated percentile exceeds
, it will be adjusted to
. In a left-skewed distribution, if the obtained percentile is smaller than
, it will also be adjusted to
. Without loss of generality, in the following discussion, only the case where
is considered. The most popular method for computing the sample quantile function was proposed by Hyndman and Fan in 1996 [
9]. Another widely used method for calculating the sample quantile function involves employing linear interpolation of modes corresponding to the order statistics of the uniform distribution on the interval [0, 1], i.e.,
,
. To minimize the finite sample bias, here, the inverse function of
is deduced as
, based on Hyndman and Fan’s definition, or
, based on the latter definition, where
,
is the indicator of event
A.
The quantile mean uses the location-scale invariant in a different way, as shown in the subsequent proof.
Theorem 2. is a consistent mean estimator for a location-scale distribution provided that the means are finite and , and are all within the range of , where μ and are location parameters from that location-scale distribution. If , is also consistent for any symmetric distributions.
Proof. When , the solution of is . The d value for the case where is the same. The definitions of the location and scale parameters are such that they must satisfy , then . It follows that the percentile of any weighted L-statistic is free of and for a location-scale distribution. Therefore d in is also invariably a constant. For the symmetric case, is valid for any symmetric distribution with a finite second moment, as the same values correspond to same percentiles. Then, To avoid inconsistency due to post-adjustment, , and must reside within the range of . All results are now proven. □
The cdf of the Pareto distribution is . So, set the d value in with two arbitrary percentiles and , . The d value in for the exponential distribution is always identical to as , since and the cdf of the exponential distribution is , then . So, for the Weibull, gamma, Pareto, lognormal and generalized Gaussian distribution, is also consistent for at least one particular case, provided that and are different location parameters from an exponential distribution and , and are all within the range of . Also let and , then , and are all within the range of . works better in the fat-tail scenarios (SI Dataset S1). Theorem 1 and 2 show that and are both consistent mean estimators for any symmetric distribution and the exponential distribution with finite second moments. It’s obvious that the asymptotic breakdown points of and are both . Therefore they are all invariant means.
To study the impact of the choice of WLs in
and
, it is constructive to recall that a weighted
L-statistic is a combination of order statistics. While using a less-biased weighted
L-statistic can generally enhance performance (SI Dataset S1), there is a greater risk of violation in the semiparametric framework. However, the mean-
-
-median inequality is robust to slight fluctuations of the QA function of the underlying distribution. Suppose for a right-skewed distribution, the QA function is generally decreasing with respect to
in
, but increasing in
, since all quantile averages with breakdown points from
to
will be included in the computation of
, as long as
, and other portions of the QA function satisfy the inequality constraints that define the
th
-orderliness on which the
is based, if
, the mean-
-
-median inequality still holds. This is due to the violation of
th
-orderliness being bounded, when
, as shown in REDS I and therefore cannot be extreme for unimodal distributions with finite second moments. For instance, the SQA function of the Weibull distribution is non-monotonic with respect to
when the shape parameter
as shown in the SI Text of REDS I, the violation of the second and third orderliness starts near this parameter as well, yet the mean-
-median inequality retains valid when
. Another key factor in determining the risk of violation of orderliness is the skewness of the distribution. In REDS I, it was demonstrated that in a family of distributions differing by a skewness-increasing transformation in van Zwet’s sense, the violation of orderliness, if it happens, only occurs as the distribution nears symmetry [
10]. When
, the over-corrections in
and
are dependent on the
-median difference, which can be a reasonable measure of skewness after standardization [
11,
12], implying that the over-correction is often tiny with moderate
d. This qualitative analysis suggests the general reliability of
and
based on the mean-
-
-median inequality, especially for unimodal distributions with finite second moments when
. Extending this rationale to other weighted
L-statistics is possible, since the
-
U-orderliness can also be bounded with certain assumptions, as discussed previously.
Consider two continuous distributions belonging to the same location–scale family, according to Theorem 3 in REDS II, their corresponding
th central moment kernel distributions only differ in scaling. Although strict complete
th orderliness is difficult to prove, following the same logic as discussed above, even if the inequality may be violated in a small range, the mean-
-median inequality remains valid, in most cases, for the central moment kernel distribution. Define the recombined
th central moment as
. Then, assuming finite
th central moment and applying the same logic as in Theorem 1,
is a consistent
th central moment estimator for a location-scale distribution, where
,
, and
are different
th central moment parameters from that location-scale distribution. Similarly, the quantile will not change after scaling. The quantile
th central moment is thus defined as
is also a consistent
th central moment estimator for a location-scale distribution provided that the
th central moment is finite and
,
and
are all within the range of
, where
and
are different
th central moment parameters from that location-scale distribution. According to Theorem 2 in REDS II, if the original distribution is unimodal, the central moment kernel distribution is always a heavy-tailed distribution, as the degree term amplifies its skewness and tailedness. From the better performance of the quantile mean in heavy-tailed distributions, the quantile
th central moments are generally better than the recombined
th central moments regarding asymptotic bias.
Finally, the recombined standardized
th moment is defined to be
The quantile standardized
th moment is defined similarly,
A shape-scale distribution as the consistent distribution
In the last section, the parametric robust estimation is limited to a location-scale distribution, with the location parameter often being omitted for simplicity. For improved fit to observed skewness or kurtosis, shape-scale distributions with shape parameter () and scale parameter () are commonly utilized. Weibull, gamma, Pareto, lognormal, and generalized Gaussian distributions (when is a constant) are all shape-scale unimodal distributions. Furthermore, if either the shape parameter or the skewness or kurtosis is constant, the shape-scale distribution is reduced to a location-scale distribution. Let denote the function to specify d values, where the first input is the absolute value of the skewness, the second input is the kurtosis, the third is the order of the central moment (if , the mean), the fourth is the type of estimator, the fifth is the type of consistent distribution, and the sixth input is the sample size. For simplicity, the last three inputs will be omitted in the following discussion. Hold in awareness that since skewness and kurtosis are interrelated, specifying d values for a shape-scale distribution only requires either skewness or kurtosis, while the other may be also omitted. Since many common shape-scale distributions are always right-skewed (if not, only the right-skewed or left-skewed part is used for calibration, while the other part is omitted), the absolute value of the skewness should be the same as the skewness of these distributions. This setting also handles the left-skew scenario well.
For recombined moments up to the fourth ordinal, the object of using a shape-scale distribution as the consistent distribution is to find solutions for the system of equations
, where
,
and
are the population second, third and fourth central moments.
and
should be the invariant points of the functions
and
. Clearly, this is an overdetermined nonlinear system of equations, given that the skewness and kurtosis are interrelated for a shape-scale distribution. Since an overdetermined system constructed with random coefficients is almost always inconsistent, it is natural to optimize them separately using the fixed-point iteration (see Algorithm 1, only
is provided, others are the same).
Algorithm 1 for a shape-scale distribution |
-
Require:
D; ; ; ; ; ;
-
Ensure:
-
- 2:
▹ Using the maximum kurtosis available in D as an initial guess.
-
repeat
- 4:
-
- 6:
-
until or ▹ is the maximum number of iterations, is a small positive number.
|
The following theorem shows the validity of Algorithm 1.
Theorem 3. Assuming and s, where , are all equal to zero, and , defined as the largest attracting fixed points of the functions and , are consistent estimators of and for a shape-scale distribution whose th central moment kernel distributions are U-congruent, as long as they are within the domain of D, where and are the population skewness and kurtosis, respectively.
Proof. Without loss of generality, only is considered, while the logic for is the same. Additionally, the second central moments of the underlying sample distribution and consistent distribution are assumed to be 1, with other cases simply multiplying a constant factor according to Theorem 3 in REDS II. From the definition of D, , where the subscript D indicates that the estimates are from the central moment kernel distributions generated from the consistent distribution, while other estimates are from the underlying distribution of the sample.
Then, assuming the s are all equal to zero and , . Since are from the same fourth central moment kernel distribution as , according to the definition of U-congruence, an increase in will also result in an increase in . Combining with Theorem 3 in REDS II, is a measure of kurtosis that is invariant to location and scale, so . As a result, if there is at least one fixed point, let the largest one be , then it is attracting since for all , where is the maximum kurtosis available in D.
□
As a result of Theorem 3, assuming continuity,
s are all equal to zero, and
U-congruence of the central moment kernel distributions, Algorithm 1 converges surely provided that a fixed point exists within the domain of
D. At this stage,
D can only be approximated through a Monte Carlo study. The continuity of
D can be ensured by using linear interpolation. One common encountered problem is that the domain of
D depends on both the consistent distribution and the Monte Carlo study, so the iteration may halt at the boundary if the fixed point is not within the domain. However, by setting a proper maximum number of iterations, the algorithm can return the optimal boundary value. For quantile moments, the logic is similar, if the percentiles do not exceed the breakdown point. If this is the case, consistent estimation is impossible, and the algorithm will stop due to the maximum number of iterations. The fixed point iteration is, in principle, similar to the iterative reweighing in Huber
M-estimator, but an advantage of this algorithm is that it is solely related to the inputs in Algorithm 1 and is independent of the sample size. Since they are consistent for a shape-scale distribution,
can specify
and
,
can specify
and
. Algorithm 1 enables the robust estimations of all four moments to reach a near-consistent level for common unimodal distributions (
Table 1, SI Dataset S1), just using the Weibull distribution as the consistent distribution.
Robustness
The measure of robustness to gross errors used in this series is the breakdown point proposed by Hampel [
34] in 1968. In REDS I, it has shown that the median of means (MoM) is asymptotically equivalent to the median Hodge-Lehmann mean. Therefore it is also biased for any asymmetric distribution. However, the concentration bound of MoM depends on
[
35], it is quite natural to deduce that it is a consistent robust estimator. The concept, sample-dependent breakdown point, is defined to avoid ambiguity.
Definition 2 (Sample-dependent breakdown point).
The breakdown point of an estimator is called sample-dependent if and only if the upper and lower asymptotic breakdown points, which are the upper and lower breakdown points when , are zero and the empirical influence function of is bounded. For a full formal definition of the empirical influence function, the reader is referred to Devlin, Gnanadesikan and Kettenring (1975)’s paper [36].
Bear in mind that it differs from the "infinitesimal robustness" defined by Hampel, which is related to whether the asymptotic influence function is bounded [
37,
38,
39]. The proof of the consistency of MoM assumes that it is an estimator with a sample-dependent breakdown point since its breakdown point is
, where
b is the number of blocks, then
, if
b is a constant and any changes in any one of the points of the sample cannot break down this estimator.
For the
-statistics, the asymptotic upper breakdown points are suggested by the following theorem, which extends the method in Donoho and Huber (1983)’s proof of the breakdown point of the Hodges-Lehmann estimator [
40]. The proof is given in the SI Text.
Theorem 4. Given a U-statistic associated with a symmetric kernel of degree . Then, assuming that as , is a constant, the upper breakdown point of the -statistic is , where is the upper breakdown point of the corresponding -statistic.
Remark 1. If , , so this formula also holds for the -statistic itself. Here, to ensure the breakdown points of all four moments are the same, , since , the breakdown points of all -statistics for the second, third, and fourth central moment estimations are adjusted as , , , respectively.
Every statistic is based on certain assumptions. For instance, the sample mean assumes that the second moment of the underlying distribution is finite. If this assumption is violated, the variance of the sample mean becomes infinitely large, even if the population mean is finite. As a result, the sample mean not only has zero robustness to gross errors, but also has zero robustness to departures. To meaningfully compare the performance of estimators under departures from assumptions, it is necessary to impose constraints on these departures. Bound analysis [
1] is the first approach to study the robustness to departures, i.e., although all estimators can be biased under departures from the corresponding assumptions, but their standardized maximum deviations can differ substantially [
35,
41,
42,
43,
44,
45]. In REDS I, it is shown that another way to qualitatively compare the estimators’ robustness to departures from the symmetry assumption is constructing and comparing corresponding semiparametric models. While such comparison is limited to a semiparametric model and is not universal, it is still valid for a wide range of parametric distributions. Bound analysis is a more universal approach since they can be deduced by just assuming regularity conditions [
35,
41,
42,
43,
45]. However, bounds are often hard to deduce for complex estimators. Also, sometimes there are discrepancies between maximum bias and average bias. Since the estimators proposed here are all consistent under certain assumptions, measuring their biases is also a convenient way of measuring the robustness to departures. Average standardized asymptotic bias is thus defined as follows.
Definition 3 (Average standardized asymptotic bias).
For a single-parameter distribution, the average standardized asymptotic bias (ASAB) is given by , where represents the estimation of θ, and σ denotes the standard deviation of the kernel distribution associated with the -statistic. If the estimator is not classified as an RI-statistic, QI-statistic, or -statistic, the corresponding U-statistic, which measures the same attribute of the distribution, is utilized to determine the value of σ. For a two-parameter distribution, the first step is setting the lower bound of the kurtosis range of interest , the spacing δ, and the bin count C. Then, the average standardized asymptotic bias is defined as
where is the kurtosis specifying the two-parameter distribution, denotes the expected value given fixed .
Standardization plays a crucial role in comparing the performance of estimators across different distributions. Currently, several options are available, such as using the root mean square deviation from the mode (as in Gauss [
1]), the mean absolute deviation, or the standard deviation. The standard deviation is preferred due to its central role in standard error estimation. In
Table 1,
,
. For the Weibull, gamma, lognormal and generalized Gaussian distributions,
(there are two shape parameter solutions for the Weibull distribution, the lower one is used here). For the Pareto distribution,
. To provide a more practical and straightforward illustration, all results from five distributions are further weighted by the number of Google Scholar search results. Within the range of kurtosis setting, nearly all WLs and WHL
s proposed here reach or at least come close to their maximum biases (SI Dataset S1). The pseudo-maximum bias is thus defined as the maximum value of the biases within the range of kurtosis setting for all five unimodal distributions. In most cases, the pseudo-maximum biases of invariant moments occur in lognormal or generalized Gaussian distributions (SI Dataset S1), since besides unimodality, the Weibull distribution differs entirely from them. Interestingly, the asymptotic biases of
and
, after averaging and weighting, are 0.107
and 0.066
, respectively, in line with the sharp bias bounds of
and
(a different subscript is used to indicate a sample size of 15, with the removal of the first and last order statistics), 0.173
and 0.126
, for any distributions with finite moments [
41,
42].