Preprint
Article

Robust Estimations from Distribution Structures: III. Invariant Moments

This version is not peer-reviewed.

Submitted:

13 February 2024

Posted:

14 February 2024

You are already at the latest version

Abstract
Descriptive statistics for parametric models are currently highly sensative to departures, gross errors, and/or random errors. Here, leveraging the structures of parametric distributions and their central moment kernel distributions, a class of estimators, consistent simultanously for both a semiparametric distribution and a distinct parametric distribution, is proposed. These efficient estimators are robust to both gross errors and departures from parametric assumptions, making them ideal for estimating the mean and central moments of common unimodal distributions. This article also illuminates the understanding of the common nature of probability distributions and the measures of them.
Keywords: 
;  ;  ;  ;  

 

The potential biases of robust location estimators in estimating the population mean have been noticed for more than two centuries [1], with numerous significant attempts made to address them. In calculating a robust estimator, the procedure of identifying and downweighting extreme values inherently necessitates the formulation of distributional assumptions. Previously, it was demonstrated that, due to the presence of infinite-dimensional nuisance shape parameters, the semiparametric approach struggles to consistently address distributions with shapes more intricate than symmetry. Newcomb (1886) provided the first modern approach to robust parametric estimation by developing a class of estimators that gives "less weight to the more discordant observations" [2]. In 1964, Huber [3] used the minimax procedure to obtain M-estimator for the contaminated normal distribution, which has played a pre-eminent role in the later development of robust statistics. However, as previously demonstrated, under growing asymmetric departures from normality, the bias of the Huber M-estimator (HM) increases rapidly. This is a common issue in parametric robust statistics. For example, He and Fung (1999) constructed [4] a robust M-estimator (HFM) for the two-parameter Weibull distribution, from which the mean and central moments can be calculated. Nonetheless, it is inadequate for other parametric distributions, e.g., the gamma, Perato, lognormal, and the generalized Gaussian distributions (SI Dataset S1). Another interesting approach is based on L-estimators, such as percentile estimators. For examples of percentile estimators for the Weibull distribution, the reader is referred to the works of Menon (1963) [5], Dubey (1967) [6], Marks (2005) [7], and Boudt, Caliskan, and Croux (2011) [8]. At the outset of the study of percentile estimators, it was known that they arithmetically utilize the invariant structures of parametric distributions [5,6]. An estimator is classified as an I-statistic if it asymptotically satisfies I LE 1 , , LE l = θ 1 , , θ q for the distribution it is consistent, where LEs are calculated with the use of L U -statistics (defined in REDS II), I is defined using arithmetic operations and constants but may also incorporate transcendental functions and quantile functions, and θ s are the population parameters it estimates. In this article, two subclasses of I-statistics are introduced, recombined I-statistics and quantile I-statistics. Based on L U -statistics, I-statistics are naturally robust. Compared to probability density functions (pdfs) and cumulative distribution functions (cdfs), the quantile functions of many parametric distributions are more elegant. Since the expectation of an L-estimator can be expressed as an integral of the quantile function, I-statistics are often analytically obtainable. However, it is observed that even when the sample follows a gamma distribution, which belongs to the same larger family as the Weibull model, the generalized gamma distribution, a misassumption can still lead to substantial biases in Marks percentile estimator (MP) for the Weibull distribution [7] (SI Dataset S1).
The purpose of this paper is to demonstrate that, in light of previous works, by utilizing the invariant structures of unimodal distributions, a suite of robust estimators can be constructed whose biases are typically smaller than the variances (as seen in Table 1 for n = 5184 ).

Invariant Moments

Most popular robust location estimators, such as the symmetric trimmed mean, symmetric Winsorized mean, Hodges-Lehmann estimator, Huber M-estimator, and median of means, are symmetric. As shown in REDS I, a symmetric weighted Hodges-Lehmann mean ( SWHLM k , ϵ ) can achieve consistency for the population mean in any symmetric distribution with a finite mean. However, it falls considerably short of consistently handling other parametric distributions that are not symmetric. Shifting from semiparametrics to parametrics, consider a robust estimator with a non-sample-dependent breakdown point (defined in Subsection ) which is consistent simultaneously for both a semiparametric distribution and a parametric distribution that does not belong to that semiparametric distribution, it is named with the prefix ‘invariant’ followed by the name of the population parameter it is consistent with. Here, the recombined I-statistic is defined as
RI d , h k , k 1 , k 2 , k 1 , k 2 , ϵ = min ( ϵ 1 , ϵ 2 ) , γ 1 , γ 2 , n , L U 1 , L U 2 : = lim c L U 1 h k , k 1 , k 1 , ϵ 1 , γ 1 , n + c d + 1 L U 2 h k , k 2 , k 2 , ϵ 2 , γ 2 , n + c d c ,
where d is the key factor for bias correction, L U h k , k , k , ϵ , γ , n is the L U -statistic, k is the degree of the U-statistic, k is the degree of the L L -statistic, ϵ is the upper asymptotic breakdown point of the L U -statistic. It is assumed in this series that in the subscript of an estimator, if k , k and γ are omitted, k = 1 , k = 1 , γ = 1 are assumed, if just one k is indicated, k 1 = k 2 , if just one γ is indicated, γ 1 = γ 2 , if n is omitted, only the asymptotic behavior is considered, in the absence of subscripts, no assumptions are made. The subsequent theorem shows the significance of a recombined I-statistic.
Theorem 1. 
Define the recombined mean as r m d , k 1 , k 2 , ϵ = min ( ϵ 1 , ϵ 2 ) , γ 1 , γ 2 , n , WL 1 , WL 2 : = RI d , h k = x , k 1 = 1 , k 2 = 1 , k 1 , k 2 , ϵ = min ( ϵ 1 , ϵ 2 ) , γ 1 , γ 2 , n , L U 1 = WL 1 , L U 2 = WL 2 . Assuming finite means, r m d = μ WL 1 k 1 , ϵ 1 , γ 1 WL 1 k 1 , ϵ 1 , γ 1 WL 2 k 2 , ϵ 2 , γ 2 , k 1 , k 2 , ϵ = min ( ϵ 1 , ϵ 2 ) , γ 1 , γ 2 , WL 1 , WL 2 is a consistent mean estimator for a location-scale distribution, where μ, WL 1 k 1 , ϵ 1 , γ 1 , and WL 2 k 2 , ϵ 2 , γ 2 are different location parameters from that location-scale distribution. If γ 1 = γ 2 = 1 , WL = SWHLM , r m is also consistent for any symmetric distributions.
Proof. 
Finding d that make r m d , k 1 , k 2 , ϵ = min ( ϵ 1 , ϵ 2 ) , γ 1 , γ 2 , WL 1 , WL 2 a consistent mean estimator is equivalent to finding the solution of r m d , k 1 , k 2 , ϵ = min ( ϵ 1 , ϵ 2 ) , γ 1 , γ 2 , WL 1 , WL 2 = μ . First consider the location-scale distribution. Since r m d , k 1 , k 2 , ϵ = min ( ϵ 1 , ϵ 2 ) , γ 1 , γ 2 , WL 1 , WL 2 = lim c WL 1 k 1 , ϵ 1 , γ 1 + c d + 1 WL 2 k 2 , ϵ 2 , γ 2 + c d c = d + 1 WL 1 k 1 , ϵ 1 , γ d WL 2 k 2 , ϵ 2 , γ = μ . So, d = μ WL 1 k 1 , ϵ 1 , γ 1 WL 1 k 1 , ϵ 1 , γ 1 WL 2 k 2 , ϵ 2 , γ 2 . In REDS I, it was established that any WL ( k , ϵ , γ ) can be expressed as λ WL 0 ( k , ϵ , γ ) + μ for a location-scale distribution parameterized by a location parameter μ and a scale parameter λ , where WL 0 ( k , ϵ , γ ) is a function of Q 0 ( p ) , the quantile function of a standard distribution without any shifts or scaling, according to the definition of the weighted L-statistic. The simultaneous cancellation of μ and λ in ( λ μ 0 + μ ) ( λ WL 1 0 ( k 1 , ϵ 1 , γ 1 ) + μ ) ( λ WL 1 0 ( k 1 , ϵ 1 , γ 1 ) + μ ) ( λ WL 2 0 ( k 2 , ϵ 2 , γ 2 ) + μ ) assures that the d in r m is always a constant for a location-scale distribution. The proof of the second assertion follows directly from the coincidence property. According to Theorem 17 in REDS I, for any symmetric distribution with a finite mean, SWHLM 1 k 1 = SWHLM 2 k 2 = μ . Then r m d , k 1 , k 2 , ϵ 1 , ϵ 2 , SWHLM 1 , SWHLM 2 = lim c μ + c d + 1 μ + c d c = μ . This completes the demonstration.   □
For example, the Pareto distribution has a quantile function Q P a r p = x m ( 1 p ) 1 α , where x m is the minimum possible value that a random variable following the Pareto distribution can take, serving a scale parameter, α is a shape parameter. The mean of the Pareto distribution is given by α x m α 1 . As WL ( k , ϵ , γ ) can be expressed as a function of Q ( p ) , one can set the two WL k , ϵ , γ s in the d value of r m as two arbitrary quantiles Q P a r ( p 1 ) and Q P a r ( p 2 ) . For the Pareto distribution, d P e r , r m = μ P e r Q P a r ( p 1 ) Q P a r ( p 1 ) Q P a r ( p 2 ) = α x m α 1 x m ( 1 p 1 ) 1 α x m ( 1 p 1 ) 1 α x m ( 1 p 2 ) 1 α . x m can be canceled out. Intriguingly, the quantile function of exponential distribution is Q e x p ( p ) = ln 1 1 p λ , λ 0 . μ e x p = λ . Then, d e x p , r m = μ e x p Q e x p ( p 1 ) Q e x p ( p 1 ) Q e x p ( p 2 ) = λ ln 1 1 p 1 λ ln 1 1 p 1 λ ln 1 1 p 2 λ = ln ( 1 p 1 ) + 1 ln ( 1 p 1 ) ln ( 1 p 2 ) . Since lim α α α 1 ( 1 p 1 ) 1 / α ( 1 p 1 ) 1 / α ( 1 p 2 ) 1 / α = ln ( 1 p 1 ) + 1 ln ( 1 p 1 ) ln ( 1 p 2 ) , d P e r , r m approaches d e x p , r m , as α , regardless of the type of weighted L-statistic used. That means, for the Weibull, gamma, Pareto, lognormal and generalized Gaussian distribution, r m d = μ SWHLM 1 k 1 , ϵ 1 SWHLM 1 k 1 , ϵ 1 SWHLM 2 k 2 , ϵ 2 , k 1 , k 2 , ϵ = min ( ϵ 1 , ϵ 2 ) , SWHLM 1 , SWHLM 2 is consistent for at least one particular case, where μ , SWHLM 1 k 1 , ϵ 1 , and SWHLM 2 k 2 , ϵ 2 are different location parameters from an exponential distribution. Let SWHLM 1 k 1 , ϵ 1 , γ = BM ν = 3 , ϵ = 1 24 , SWHLM 2 k 2 , ϵ 2 , γ = m , then μ = λ , m = Q 1 2 = ln 2 λ , BM ν = 3 , ϵ = 1 24 = λ 1 + ln 26068394603446272 7 247 6 11 3 391 5 / 6 101898752449325 5 , the detailed formula is given in the SI Text. So, d = μ BM ν = 3 , ϵ = 1 24 BM ν = 3 , ϵ = 1 24 m = λ λ 1 + ln 26068394603446272 7 247 6 11 3 391 5 / 6 101898752449325 5 λ 1 + ln 26068394603446272 7 247 6 11 3 391 5 / 6 101898752449325 5 ln 2 λ = ln 26068394603446272 7 247 6 11 3 391 5 / 6 101898752449325 5 1 ln ( 2 ) + ln 26068394603446272 7 247 6 11 3 391 5 / 6 101898752449325 5 0.103 . The biases of r m d 0.103 , ν = 3 , ϵ = 1 24 , BM , m for distributions with skewness between those of the exponential and symmetric distributions are tiny (SI Dataset S1). r m d 0.103 , ν = 3 , ϵ = 1 24 , BM , m exhibits excellent performance for all these common unimodal distributions (SI Dataset S1).
The recombined mean is a recombined I-statistic. Consider an I-statistic whose LEs are percentiles of a distribution obtained by plugging L U -statistics into a cumulative distribution function, I is defined with arithmetic operations, constants, and quantile functions, such an estimator is classified as a quantile I-statistic. One version of the quantile I-statistic can be defined as QI d , h k , k , k , ϵ , γ , , n , L U : = Q ^ n , h k F ^ n , h k L U γ 1 + γ d + F ^ n , h k L U F ^ n , h k L U γ 1 + γ Q ^ n , h k F ^ n , h k L U γ 1 + γ F ^ n , h k L U d F ^ n , h k L U < γ 1 + γ , where L U is L U k , k , ϵ , γ , n , F ^ n , h k x is the empirical cumulative distribution function of the h k kernel distribution, Q ^ n , h k is the quantile function of the h k kernel distribution.
Similarly, the quantile mean can be defined as q m d , k , ϵ , γ , n , WL : = QI d , h k = x , k = 1 , k , ϵ , γ , n , L U = WL . Moreover, in extreme right-skewed heavy-tailed distributions, if the calculated percentile exceeds 1 ϵ , it will be adjusted to 1 ϵ . In a left-skewed distribution, if the obtained percentile is smaller than γ ϵ , it will also be adjusted to γ ϵ . Without loss of generality, in the following discussion, only the case where F ^ n WL k , ϵ , γ , n γ 1 + γ is considered. The most popular method for computing the sample quantile function was proposed by Hyndman and Fan in 1996 [9]. Another widely used method for calculating the sample quantile function involves employing linear interpolation of modes corresponding to the order statistics of the uniform distribution on the interval [0, 1], i.e., Q ^ n p = X h + h h X h X h , h = n 1 p + 1 . To minimize the finite sample bias, here, the inverse function of Q ^ n is deduced as F ^ n x : = 1 n x X c f X c f + 1 X c f + c f , based on Hyndman and Fan’s definition, or F ^ n x : = 1 n 1 c f 1 + x X c f X c f + 1 X c f , based on the latter definition, where c f = i = 1 n 1 X i x , 1 A is the indicator of event A.
The quantile mean uses the location-scale invariant in a different way, as shown in the subsequent proof.
Theorem 2. 
q m d = F ( μ ) F ( WL k , ϵ , γ ) F ( WL k , ϵ , γ ) γ 1 + γ , k , ϵ , γ , WL is a consistent mean estimator for a location-scale distribution provided that the means are finite and F ( μ ) , F ( WL k , ϵ , γ ) and γ 1 + γ are all within the range of [ γ ϵ , 1 ϵ ] , where μ and WL k , ϵ , γ are location parameters from that location-scale distribution. If WL = SWHLM , q m is also consistent for any symmetric distributions.
Proof. 
When F WL k , ϵ , γ γ 1 + γ , the solution of F WL k , ϵ , γ γ 1 + γ d + F WL k , ϵ , γ = F μ is d = F ( μ ) F ( WL k , ϵ , γ ) F ( WL k , ϵ , γ ) γ 1 + γ . The d value for the case where F WL k , ϵ , γ , n < γ 1 + γ is the same. The definitions of the location and scale parameters are such that they must satisfy F ( x ; λ , μ ) = F ( x μ λ ; 1 , 0 ) , then F ( WL ( k , ϵ , γ ) ; λ , μ ) = F ( λ WL 0 ( k , ϵ , γ ) + μ μ λ ; 1 , 0 ) = F ( WL 0 ( k , ϵ , γ ) ; 1 , 0 ) . It follows that the percentile of any weighted L-statistic is free of λ and μ for a location-scale distribution. Therefore d in q m is also invariably a constant. For the symmetric case, F SWHLM k , ϵ = F μ = F ( Q ( 1 2 ) ) = 1 2 is valid for any symmetric distribution with a finite second moment, as the same values correspond to same percentiles. Then, q m d , k , ϵ , SWHLM = F 1 F SWHLM k , ϵ 1 2 d + F μ = F 1 0 + F μ = μ . To avoid inconsistency due to post-adjustment, F ( μ ) , F ( WL k , ϵ , γ ) and γ 1 + γ must reside within the range of [ γ ϵ , 1 ϵ ] . All results are now proven.   □
The cdf of the Pareto distribution is F P a r ( x ) = 1 x m x α . So, set the d value in q m with two arbitrary percentiles p 1 and p 2 , d P a r , q m = 1 x m α x m α 1 α 1 x m x m 1 p 1 1 α α 1 x m x m 1 p 1 1 α α 1 x m x m 1 p 2 1 α α = 1 α 1 α α p 1 p 1 p 2 . The d value in q m for the exponential distribution is always identical to d P a r , q m as α , since lim α α 1 α α = 1 e and the cdf of the exponential distribution is F e x p x = 1 e λ 1 x , then d e x p , q m = 1 e 1 1 e ln 1 1 p 1 1 e ln 1 1 p 1 1 e ln 1 1 p 2 = 1 1 e p 1 p 1 p 2 . So, for the Weibull, gamma, Pareto, lognormal and generalized Gaussian distribution, q m d = F e x p ( μ ) F e x p ( SWHLM k , ϵ ) F e x p ( SWHLM k , ϵ ) 1 2 , k , ϵ , SWHLM is also consistent for at least one particular case, provided that μ and SWHLM k , ϵ are different location parameters from an exponential distribution and F ( μ ) , F ( SWHLM k , ϵ ) and 1 2 are all within the range of [ ϵ , 1 ϵ ] . Also let SWHLM k , ϵ , γ = BM ν = 3 , ϵ = 1 24 and μ = λ , then d = F e x p ( μ ) F e x p ( BM ν = 3 , ϵ = 1 24 ) F e x p ( BM ν = 3 , ϵ = 1 24 ) 1 2 = e 1 + e 1 + ln 26068394603446272 7 247 6 11 3 391 5 / 6 101898752449325 5 1 2 e 1 + ln 26068394603446272 7 247 6 11 3 391 5 / 6 101898752449325 5 = 101898752449325 5 247 7 6 391 5 / 6 26068394603446272 11 3 e 1 e 1 2 101898752449325 5 247 7 6 391 5 / 6 26068394603446272 11 3 e 0.088 . F e x p ( μ ) , F e x p ( BM ν = 3 , ϵ = 1 24 ) and 1 2 are all within the range of [ 1 24 , 23 24 ] . q m d 0.088 , ν = 3 , ϵ = 1 24 , BM works better in the fat-tail scenarios (SI Dataset S1). Theorem 1 and 2 show that r m d 0.103 , ν = 3 , ϵ = 1 24 , BM , m and q m d 0.088 , ν = 3 , ϵ = 1 24 , BM are both consistent mean estimators for any symmetric distribution and the exponential distribution with finite second moments. It’s obvious that the asymptotic breakdown points of r m d 0.103 , ν = 3 , ϵ = 1 24 , BM , m and q m d 0.088 , ν = 3 , ϵ = 1 24 , BM are both 1 24 . Therefore they are all invariant means.
To study the impact of the choice of WLs in r m and q m , it is constructive to recall that a weighted L-statistic is a combination of order statistics. While using a less-biased weighted L-statistic can generally enhance performance (SI Dataset S1), there is a greater risk of violation in the semiparametric framework. However, the mean- WA ϵ , γ - γ -median inequality is robust to slight fluctuations of the QA function of the underlying distribution. Suppose for a right-skewed distribution, the QA function is generally decreasing with respect to ϵ in [ 0 , u ] , but increasing in [ u , 1 1 + γ ] , since all quantile averages with breakdown points from ϵ to 1 1 + γ will be included in the computation of WA ϵ , γ , as long as 1 1 + γ u 1 1 + γ γ ϵ , and other portions of the QA function satisfy the inequality constraints that define the ν th γ -orderliness on which the WA ϵ , γ is based, if 0 γ 1 , the mean- WA ϵ , γ - γ -median inequality still holds. This is due to the violation of ν th γ -orderliness being bounded, when 0 γ 1 , as shown in REDS I and therefore cannot be extreme for unimodal distributions with finite second moments. For instance, the SQA function of the Weibull distribution is non-monotonic with respect to ϵ when the shape parameter α > 1 1 ln ( 2 ) 3.259 as shown in the SI Text of REDS I, the violation of the second and third orderliness starts near this parameter as well, yet the mean- BM ν = 3 , ϵ = 1 24 -median inequality retains valid when α 3.387 . Another key factor in determining the risk of violation of orderliness is the skewness of the distribution. In REDS I, it was demonstrated that in a family of distributions differing by a skewness-increasing transformation in van Zwet’s sense, the violation of orderliness, if it happens, only occurs as the distribution nears symmetry [10]. When γ = 1 , the over-corrections in r m and q m are dependent on the SWA ϵ -median difference, which can be a reasonable measure of skewness after standardization [11,12], implying that the over-correction is often tiny with moderate d. This qualitative analysis suggests the general reliability of r m and q m based on the mean- WA ϵ , γ - γ -median inequality, especially for unimodal distributions with finite second moments when 0 γ 1 . Extending this rationale to other weighted L-statistics is possible, since the γ -U-orderliness can also be bounded with certain assumptions, as discussed previously.
Consider two continuous distributions belonging to the same location–scale family, according to Theorem 3 in REDS II, their corresponding k th central moment kernel distributions only differ in scaling. Although strict complete ν th orderliness is difficult to prove, following the same logic as discussed above, even if the inequality may be violated in a small range, the mean- SWA ϵ -median inequality remains valid, in most cases, for the central moment kernel distribution. Define the recombined k th central moment as r k m d , k 1 , k 2 , ϵ = min ( ϵ 1 , ϵ 2 ) , γ 1 , γ 2 , n , WHL k m 1 , WHL k m 2 : = RI d , h k = ψ k , k 1 = k , k 2 = k , k 1 , k 2 , ϵ 1 , ϵ 2 , γ 1 , γ 2 , n , L U 1 = WHL k m 1 , L U 2 = WHL k m 2 . Then, assuming finite k th central moment and applying the same logic as in Theorem 1, r k m d = μ k WHL k m 1 k 1 , ϵ 1 , γ 1 WHL k m 1 k 1 , ϵ 1 , γ 1 WHL k m 2 k 2 , ϵ 2 , γ 2 , k 1 , k 2 , ϵ = min ( ϵ 1 , ϵ 2 ) , γ 1 , γ 2 , WHL k m 1 , WHL k m 2 is a consistent k th central moment estimator for a location-scale distribution, where μ k , WHL k m 1 k 1 , ϵ 1 , γ 1 , and WHL k m 2 k 2 , ϵ 2 , γ 2 are different k th central moment parameters from that location-scale distribution. Similarly, the quantile will not change after scaling. The quantile k th central moment is thus defined as
q k m d , k , ϵ , γ , n , WHL k m : = QI d , h k = ψ k , k = k , k , ϵ , γ , n , L U = WHL k m .
q k m d = F ψ k ( μ k ) F ψ k ( WHL k m k , ϵ , γ ) F ψ k ( WHL k m k , ϵ , γ ) γ 1 + γ , k , ϵ , γ , WHL k m is also a consistent k th central moment estimator for a location-scale distribution provided that the k th central moment is finite and F ψ k ( μ k ) , F ψ k ( WHL k m k , ϵ , γ ) and γ 1 + γ are all within the range of [ γ ϵ , 1 ϵ ] , where μ k and WHL k m k , ϵ , γ are different k th central moment parameters from that location-scale distribution. According to Theorem 2 in REDS II, if the original distribution is unimodal, the central moment kernel distribution is always a heavy-tailed distribution, as the degree term amplifies its skewness and tailedness. From the better performance of the quantile mean in heavy-tailed distributions, the quantile k th central moments are generally better than the recombined k th central moments regarding asymptotic bias.
Finally, the recombined standardized k th moment is defined to be
r s k m ϵ = min ( ϵ 1 , ϵ 2 ) , k 1 , k 2 , k 3 , k 4 , γ 1 , γ 2 , γ 3 , γ 4 , n , WHL k m 1 , WHL k m 2 , WHL v a r 1 , WHL v a r 2 : = r k m d , k 1 , k 2 , ϵ 1 , γ 1 , γ 2 , n , WHL k m 1 , WHL k m 2 ( r v a r d , k 3 , k 4 , ϵ 2 , γ 3 , γ 4 , n , WHL v a r 1 , WHL v a r 2 ) k / 2 .
The quantile standardized k th moment is defined similarly,
q s k m ϵ = min ( ϵ 1 , ϵ 2 ) , k 1 , k 2 , γ 1 , γ 2 , n , WHL k m , WHL v a r : = q k m d , k 1 , ϵ 1 , γ 1 , n , WHL k m ( q v a r d , k 2 , ϵ 2 , γ 2 , n , WHL v a r ) k / 2 .

A shape-scale distribution as the consistent distribution

In the last section, the parametric robust estimation is limited to a location-scale distribution, with the location parameter often being omitted for simplicity. For improved fit to observed skewness or kurtosis, shape-scale distributions with shape parameter ( α ) and scale parameter ( λ ) are commonly utilized. Weibull, gamma, Pareto, lognormal, and generalized Gaussian distributions (when μ is a constant) are all shape-scale unimodal distributions. Furthermore, if either the shape parameter α or the skewness or kurtosis is constant, the shape-scale distribution is reduced to a location-scale distribution. Let D ( | s k e w n e s s | , k u r t o s i s , k , e t y p e , d t y p e , n ) = d i k m denote the function to specify d values, where the first input is the absolute value of the skewness, the second input is the kurtosis, the third is the order of the central moment (if k = 1 , the mean), the fourth is the type of estimator, the fifth is the type of consistent distribution, and the sixth input is the sample size. For simplicity, the last three inputs will be omitted in the following discussion. Hold in awareness that since skewness and kurtosis are interrelated, specifying d values for a shape-scale distribution only requires either skewness or kurtosis, while the other may be also omitted. Since many common shape-scale distributions are always right-skewed (if not, only the right-skewed or left-skewed part is used for calibration, while the other part is omitted), the absolute value of the skewness should be the same as the skewness of these distributions. This setting also handles the left-skew scenario well.
For recombined moments up to the fourth ordinal, the object of using a shape-scale distribution as the consistent distribution is to find solutions for the system of equations r m WHLM , γ m , D ( | r s k e w | , r k u r t , 1 ) = μ r v a r WHL v a r , γ m v a r , D ( | r s k e w | , r k u r t , 2 ) = μ 2 r t m WHL t m , γ m t m , D ( | r s k e w | , r k u r t , 3 ) = μ 3 r f m WHL f m , γ m f m , D ( | r s k e w | , r k u r t , 4 = μ 4 r s k e w = μ 3 μ 2 3 2 r k u r t = μ 4 μ 2 2 , where μ 2 , μ 3 and μ 4 are the population second, third and fourth central moments. | r s k e w | and r k u r t should be the invariant points of the functions ς ( | r s k e w | ) = r t m WHL t m , γ m t m , D ( | r s k e w | , 3 ) r v a r WHL v a r , γ m v a r , D ( | r s k e w | , 2 ) 3 2 and ϰ ( r k u r t ) = r f m WHL f m , γ m f m , D ( r k u r t , 4 ) r v a r WHL v a r , γ m v a r , D ( r k u r t , 2 ) 2 . Clearly, this is an overdetermined nonlinear system of equations, given that the skewness and kurtosis are interrelated for a shape-scale distribution. Since an overdetermined system constructed with random coefficients is almost always inconsistent, it is natural to optimize them separately using the fixed-point iteration (see Algorithm 1, only r k u r t is provided, others are the same).
Algorithm 1  r k u r t for a shape-scale distribution
Require: 
D; WHL v a r ; WHL f m ; γ m v a r ; γ m f m ; m a x i t ; δ
Ensure: 
r k u r t i 1
 
i = 0
2:
r k u r t i ϰ ( k u r t o s i s m a x ) ▹ Using the maximum kurtosis available in D as an initial guess.
 
repeat
4:
     i = i + 1
 
     r k u r t i 1 r k u r t i
6:
     r k u r t i ϰ ( r k u r t i 1 )
 
until i > m a x i t or | r k u r t i r k u r t i 1 | < δ m a x i t is the maximum number of iterations, δ is a small positive number.
The following theorem shows the validity of Algorithm 1.
Theorem 3. 
Assuming γ = 1 and m k m s, where 2 k 4 , are all equal to zero, | r s k e w | and r k u r t , defined as the largest attracting fixed points of the functions ς ( | r s k e w | ) and ϰ ( r k u r t ) , are consistent estimators of μ ˜ 3 and μ ˜ 4 for a shape-scale distribution whose k th central moment kernel distributions are U-congruent, as long as they are within the domain of D, where μ ˜ 3 and μ ˜ 4 are the population skewness and kurtosis, respectively.
Proof. 
Without loss of generality, only r k u r t is considered, while the logic for | r s k e w | is the same. Additionally, the second central moments of the underlying sample distribution and consistent distribution are assumed to be 1, with other cases simply multiplying a constant factor according to Theorem 3 in REDS II. From the definition of D, ϰ r k u r t D r k u r t D = f m D SWHL f m D SWHL f m D m f m D SWHL f m m f m + SWHL f m r k u r t D v a r D SWHL v a r D SWHL v a r D m v a r D SWHL v a r m v a r + SWHL v a r 2 , where the subscript D indicates that the estimates are from the central moment kernel distributions generated from the consistent distribution, while other estimates are from the underlying distribution of the sample.
Then, assuming the m k m s are all equal to zero and v a r D = 1 , ϰ r k u r t D r k u r t D = f m D SWHL f m D SWHL f m D SWHL f m + SWHL f m r k u r t D SWHL v a r SWHL v a r D 2 = f m D SWHL f m D SWHL f m D + 1 SWHL f m f m D SWHL v a r SWHL v a r D 2 = SWHL f m SWHL v a r D 2 SWHL f m D SWHL v a r 2 = SWHL f m SWHL v a r 2 SWHL f m D SWHL v a r D 2 = SWHL k u r t SWHL k u r t D . Since SWHL f m D are from the same fourth central moment kernel distribution as f m D = r k u r t D v a r D 2 , according to the definition of U-congruence, an increase in f m D will also result in an increase in SWHL f m D . Combining with Theorem 3 in REDS II, SWHL k u r t is a measure of kurtosis that is invariant to location and scale, so lim r k u r t D ϰ r k u r t D r k u r t D < 1 . As a result, if there is at least one fixed point, let the largest one be f i x m a x , then it is attracting since | ( ϰ ( r k u r t D ) ) ( r k u r t D ) | < 1 for all r k u r t D [ f i x m a x , k u r t o s i s m a x ] , where k u r t o s i s m a x is the maximum kurtosis available in D.
As a result of Theorem 3, assuming continuity, m k m s are all equal to zero, and U-congruence of the central moment kernel distributions, Algorithm 1 converges surely provided that a fixed point exists within the domain of D. At this stage, D can only be approximated through a Monte Carlo study. The continuity of D can be ensured by using linear interpolation. One common encountered problem is that the domain of D depends on both the consistent distribution and the Monte Carlo study, so the iteration may halt at the boundary if the fixed point is not within the domain. However, by setting a proper maximum number of iterations, the algorithm can return the optimal boundary value. For quantile moments, the logic is similar, if the percentiles do not exceed the breakdown point. If this is the case, consistent estimation is impossible, and the algorithm will stop due to the maximum number of iterations. The fixed point iteration is, in principle, similar to the iterative reweighing in Huber M-estimator, but an advantage of this algorithm is that it is solely related to the inputs in Algorithm 1 and is independent of the sample size. Since they are consistent for a shape-scale distribution, | r s k e w | can specify d r m and d t m , r k u r t can specify d r v a r and d r f m . Algorithm 1 enables the robust estimations of all four moments to reach a near-consistent level for common unimodal distributions (Table 1, SI Dataset S1), just using the Weibull distribution as the consistent distribution.

Critical points, lines, and Multiple consistent distributions

In 1895, Pearson considered how to construct probability distributions in which the skewness and kurtosis could be adjusted equally freely. The Pearson distribution is a family of unimodal continuous probability distribution functions that satisfy the following differential equation d P ( x ) d x = ( a 0 + a 1 x + a 2 x 2 ) P ( x ) b 0 + b 1 x + b 2 x 2 + b 3 x 3 , where P ( x ) is the pdf of the Pearson distribution, and a 0 , a 1 , a 2 , b 0 , b 1 , b 2 , and b 3 are constants that determine the specific type of Pearson distribution. This differential equation ensures that the distribution is unimodal and that the density function is continuous. The Pearson family subsumes many common unimodal distributions, e.g., beta distribution, Cauchy distribution, Chi-squared distribution, continuous uniform distribution, gamma distribution, inverse-chi-squared distribution, inverse-gamma distribution, normal distribution, and tudent’s t-distribution. Plotting the kurtosis-skewness lines of the five shape-scale unimodal distributions used in this series into the Pearson diagram. One can immediately identify two critical points, one is 9-2, which is the exponential distribution or the limiting form of the Pareto distribution, another is the normal distribution. This further explains why using the exponential distribution as the consistent distribution has excellent performance even better than using the Weibull distribution as the consistent distribution, since it lays in this critical point. Then, we further consider the critical lines in the Pearson diagram (Figure 1). Consider if the underlying distribution is not the Weibull distribution, but the gamma distribution, then, using the Weibull distribution as an initial guess, calculating the Euclidean distance of the obtained kurtosis-skewness to the kurtosis-skewness lines of other consistent distributions, choosing the one that has the smallest distance, then recalculate the kurtosis-skewness according to the new consistent distribution, iterately, we can use multiple distributions as consistent distributions.

Variance

As one of the fundamental theorems in statistics, the central limit theorem declares that the standard deviation of the limiting form of the sampling distribution of the sample mean is σ n . The principle, asymptotic normality, was later applied to the sampling distributions of robust location estimators [2,13,14,15,16,17,18,19,20,21]. Daniell (1920) stated [13] that comparing the efficiencies of various kinds of estimators is useless unless they all tend to coincide asymptotically. Bickel and Lehmann, also in the landmark series [19,20], argued that meaningful comparisons of the efficiencies of various kinds of location estimators can be accomplished by studying their standardized variances, asymptotic variances, and efficiency bounds. Standardized variance, Var θ ^ θ 2 , allows the use of simulation studies or empirical data to compare the variances of estimators of distinct parameters. However, a limitation of this approach is the inverse square dependence of the standardized variance on θ . If Var θ ^ 1 = Var θ ^ 2 , but θ 1 is close to zero and θ 2 is relatively large, their standardized variances will still differ dramatically. Here, the scaled standard error (SSE) is proposed as a method for estimating the variances of estimators measuring the same attribute, offering a standard error more comparable to that of the sample mean and much less influenced by the magnitude of θ .
Definition 1 
(Scaled standard error). Let M s i s j R i × j denote the sample-by-statistics matrix, i.e., the first column corresponds to θ ^ , which is the mean or a U-central moment measuring the same attribute of the distribution as the other columns, the second to the jth column correspond to j 1 statistics required to scale, θ r 1 ^ , θ r 2 ^ , …, θ r j 1 ^ . Then, the scaling factor S = 1 , θ r 1 ¯ θ m ¯ , θ r 2 ¯ θ m ¯ , , θ r j 1 ¯ θ m ¯ T is a j × 1 matrix, which θ ¯ is the mean of the column of M s i s j . The normalized matrix is M s i s j N = M s i s j S . The SSEs are the unbiased standard deviations of the corresponding columns of M s i s j N .
The U-central moment (the central moment estimated by using U-statistics) is essentially the mean of the central moment kernel distribution, so its standard error should be generally close to σ k m n , although not exactly since the kernel distribution is not i.i.d., where σ k m is the asymptotic standard deviation of the central moment kernel distribution. If the statistics of interest coincide asymptotically, then the standard errors should still be used, e.g, for symmetric location estimators and odd ordinal central moments for the symmetric distributions, since the scaled standard error will be too sensitive to small changes when they are zero.
The SSEs of all robust estimators proposed here are often, although many exceptions exist, between those of the sample median and those of the sample mean or median central moments and U-central moments (SI Dataset S1). This is because similar monotonic relations between breakdown point and variance are also very common, e.g., Bickel and Lehmann [20] proved that a lower bound for the efficiency of TM ϵ to sample mean is ( 1 2 ϵ ) 2 and this monotonic bound holds true for any distribution. However, the direction of monotonicity differs for distributions with different kurtosis. Lehmann and Scheffé (1950, 1955) [22,23] in their two early papers provided a way to construct a uniformly minimum-variance unbiased estimator (UMVUE). From that, the sample mean and unbiased sample second moment can be proven as the UMVUEs for the population mean and population second moment for the Gaussian distribution. While their performance for sub-Gaussian distributions is generally satisfied, they perform poorly when the distribution has a heavy tail and completely fail for distributions with infinite second moments. For sub-Gaussian distributions, the variance of a robust location estimator is generally monotonic increasing as its robustness increases, but for heavy-tailed distributions, the relation is reversed. So, unlike bias, the variance-optimal choice can be very different for distributions with different kurtosis.
Due to combinatorial explosion, the bootstrap [24], introduced by Efron in 1979, is indispensable for computing central moments in practice. In 1981, Bickel and Freedman [25] showed that the bootstrap is asymptotically valid to approximate the original distribution in a wide range of situations, including U-statistics. The limit laws of bootstrapped trimmed U-statistics were proven by Helmers, Janssen, and Veraverbeke (1990) [26]. In REDS I, the advantages of quasi-bootstrap were discussed [27,28,29]. By using quasi-sampling, the impact of the number of repetitions of the bootstrap, or bootstrap size, on variance is very small (SI Dataset S1). An estimator based on the quasi-bootstrap approach can be seen as a complex deterministic estimator that is not only computationally efficient but also statistical efficient. The only drawback of quasi-bootstrap compared to non-bootstrap is that a small bootstrap size can produce additional finite sample bias (SI Text).

Root mean square error

Lai, Robbins, and Yu (1983) proposed an estimator that adaptively chooses the mean or median in a symmetric distribution and showed that the choice is typically as good as the better of the sample mean and median regarding variance [30]. Another approach which can be dated back to Laplace (1812) [31] is using w x ¯ + ( 1 w ) m n as a location estimator and w is deduced to achieve optimal variance. Inspired by Lai et al’s approach [30], in this study, for r k u r t , there are 364 combinations based on 14 SWHL f m s and 26 SWHL v a r s (SI Text). Each combination has a root mean square error (RMSE) for a single-parameter distribution, which can be inferred using a Monte Carlo study. For q k u r t , there are another 364 combinations, but if the percentiles of quantile moments exceed the breakdown point, that combination is excluded. Then, the combination with the smallest RMSE, calibrated by a two-parameter distribution, is chosen. Similar to Subsection , let I ( k u r t o s i s , d t y p e , n ) = i k u r t WHL f m , WHL v a r represent these relationships. In this article, the breakdown points of the SWHLMs in SWHL k m were adjusted to ensure the overall breakdown points were 1 24 , as detailed in Theorem 4). There are two approaches to determine i k u r t . The first one is computing all 364+364 r k u r t and q k u r t , and then, since lim i k u r t I ( i k u r t ) i k u r t < 1 , the same fix point iteration algorithm as Algorithm 1 can be used to choose the RMSE-optimum combination. The only difference is that unlike D, I is defined to be discontinuous but linear interpolation can also ensure continuity. The second approach is shown in SI Algorithm 2. The RMSEs of these i k u r t from the two approaches can be further determined by a Monte Carlo study. Algorithm 1 can also be used to determine the optimum choice among the two approaches. The 364+364 r k u r t and q k u r t can form a vector, V k u r t , where the Q V k u r t ( 1 5 ) to Q V k u r t ( 4 5 ) can be used to determine the d values of r k m s and q k m s. The RMSEs of those r k m s and q k m s can also be estimated by a Monte Carlo study and the estimator with the smallest RMSE of each ordinal is named as i k m . When k is even, the i k u r t determined by I s m (detailed in the SI Text) is used to determine i k m . This approach yields results that are often nearly optimal (SI Dateset S1). The estimations of skewness and i k m , when k is odd, follow the same logic.
In general, the variances of invariant central moments are much smaller than those of corresponding unbiased sample central moments (deduced by Cramér [32,33]), except that of the corresponding second central moment (Table 1).

Robustness

The measure of robustness to gross errors used in this series is the breakdown point proposed by Hampel [34] in 1968. In REDS I, it has shown that the median of means (MoM) is asymptotically equivalent to the median Hodge-Lehmann mean. Therefore it is also biased for any asymmetric distribution. However, the concentration bound of MoM depends on 1 n [35], it is quite natural to deduce that it is a consistent robust estimator. The concept, sample-dependent breakdown point, is defined to avoid ambiguity.
Definition 2 
(Sample-dependent breakdown point). The breakdown point of an estimator θ ^ is called sample-dependent if and only if the upper and lower asymptotic breakdown points, which are the upper and lower breakdown points when n , are zero and the empirical influence function of θ ^ is bounded. For a full formal definition of the empirical influence function, the reader is referred to Devlin, Gnanadesikan and Kettenring (1975)’s paper [36].
Bear in mind that it differs from the "infinitesimal robustness" defined by Hampel, which is related to whether the asymptotic influence function is bounded [37,38,39]. The proof of the consistency of MoM assumes that it is an estimator with a sample-dependent breakdown point since its breakdown point is b 2 n , where b is the number of blocks, then lim n b 2 n = 0 , if b is a constant and any changes in any one of the points of the sample cannot break down this estimator.
For the L U -statistics, the asymptotic upper breakdown points are suggested by the following theorem, which extends the method in Donoho and Huber (1983)’s proof of the breakdown point of the Hodges-Lehmann estimator [40]. The proof is given in the SI Text.
Theorem 4. 
Given a U-statistic associated with a symmetric kernel of degree k . Then, assuming that as n , k is a constant, the upper breakdown point of the L U -statistic is 1 1 ϵ 0 1 k , where ϵ 0 is the upper breakdown point of the corresponding L L -statistic.
Remark 1. 
If k = 1 , 1 1 ϵ 0 1 k = ϵ 0 , so this formula also holds for the L L -statistic itself. Here, to ensure the breakdown points of all four moments are the same, 1 24 , since ϵ 0 = 1 1 ϵ k , the breakdown points of all L U -statistics for the second, third, and fourth central moment estimations are adjusted as ϵ 0 = 47 576 , 1657 13824 , 51935 331776 , respectively.
Every statistic is based on certain assumptions. For instance, the sample mean assumes that the second moment of the underlying distribution is finite. If this assumption is violated, the variance of the sample mean becomes infinitely large, even if the population mean is finite. As a result, the sample mean not only has zero robustness to gross errors, but also has zero robustness to departures. To meaningfully compare the performance of estimators under departures from assumptions, it is necessary to impose constraints on these departures. Bound analysis [1] is the first approach to study the robustness to departures, i.e., although all estimators can be biased under departures from the corresponding assumptions, but their standardized maximum deviations can differ substantially [35,41,42,43,44,45]. In REDS I, it is shown that another way to qualitatively compare the estimators’ robustness to departures from the symmetry assumption is constructing and comparing corresponding semiparametric models. While such comparison is limited to a semiparametric model and is not universal, it is still valid for a wide range of parametric distributions. Bound analysis is a more universal approach since they can be deduced by just assuming regularity conditions [35,41,42,43,45]. However, bounds are often hard to deduce for complex estimators. Also, sometimes there are discrepancies between maximum bias and average bias. Since the estimators proposed here are all consistent under certain assumptions, measuring their biases is also a convenient way of measuring the robustness to departures. Average standardized asymptotic bias is thus defined as follows.
Definition 3 
(Average standardized asymptotic bias). For a single-parameter distribution, the average standardized asymptotic bias (ASAB) is given by θ ^ θ σ , where θ ^ represents the estimation of θ, and σ denotes the standard deviation of the kernel distribution associated with the L U -statistic. If the estimator θ ^ is not classified as an RI-statistic, QI-statistic, or L U -statistic, the corresponding U-statistic, which measures the same attribute of the distribution, is utilized to determine the value of σ. For a two-parameter distribution, the first step is setting the lower bound of the kurtosis range of interest μ ˜ 4 l , the spacing δ, and the bin count C. Then, the average standardized asymptotic bias is defined as
ASAB θ ^ : = 1 C δ + μ ˜ 4 l μ ˜ 4 C δ + μ ˜ 4 l μ ˜ 4 is   a   multiple   of   δ E θ ^ | μ ˜ 4 θ ^ θ σ
where μ ˜ 4 is the kurtosis specifying the two-parameter distribution, E θ ^ | μ ˜ 4 denotes the expected value given fixed μ ˜ 4 .
Standardization plays a crucial role in comparing the performance of estimators across different distributions. Currently, several options are available, such as using the root mean square deviation from the mode (as in Gauss [1]), the mean absolute deviation, or the standard deviation. The standard deviation is preferred due to its central role in standard error estimation. In Table 1, δ = 0.1 , C = 70 . For the Weibull, gamma, lognormal and generalized Gaussian distributions, μ ˜ 4 l = 3 (there are two shape parameter solutions for the Weibull distribution, the lower one is used here). For the Pareto distribution, μ ˜ 4 l = 9 . To provide a more practical and straightforward illustration, all results from five distributions are further weighted by the number of Google Scholar search results. Within the range of kurtosis setting, nearly all WLs and WHL k m s proposed here reach or at least come close to their maximum biases (SI Dataset S1). The pseudo-maximum bias is thus defined as the maximum value of the biases within the range of kurtosis setting for all five unimodal distributions. In most cases, the pseudo-maximum biases of invariant moments occur in lognormal or generalized Gaussian distributions (SI Dataset S1), since besides unimodality, the Weibull distribution differs entirely from them. Interestingly, the asymptotic biases of TM ϵ = 1 8 and WM ϵ = 1 8 , after averaging and weighting, are 0.107 σ and 0.066 σ , respectively, in line with the sharp bias bounds of TM 2 , 14 : 15 and WM 2 , 14 : 15 (a different subscript is used to indicate a sample size of 15, with the removal of the first and last order statistics), 0.173 σ and 0.126 σ , for any distributions with finite moments [41,42].

Discussion

Statistics, encompassing the collection, analysis, interpretation, and presentation of data, has evolved over time, with various approaches emerging to meet challenges in practice. Among these approaches, the use of probability models and measures of random variables for data analysis is often considered the core of statistics. While the early development of statistics was focused on parametric methods, there were two main approaches to point estimation. The Gauss–Markov theorem [1,46] states the principle of minimum variance unbiased estimation which was further enriched by Neyman (1934) [47], Rao (1945) [48], Blackwell (1947) [49], and Lehmann and Scheffé (1950, 1955) [22,23]. Maximum likelihood was first introduced by Fisher in 1922 [50] in a multinomial model and later generalized by Cramér (1946), Hájek (1970), and Le Cam (1972) [18,32,51]. In 1939, Wald [52] combined these two principles and suggested the use of minimax estimates, which involve choosing an estimator that minimizes the maximum possible loss. Following Huber’s seminal work [3], M-statistics have dominated the field of parametric robust statistics for over half a century. Nonparametric methods, e.g., the Kolmogorov–Smirnov test, Mann-Whitney-Wilcoxon Test, and Hoeffding’s independence test, emerged as popular alternatives to parametric methods in 1950s, as they do not make specific assumptions about the underlying distribution of the data. In 1963, Hodges and Lehmann proposed a class of robust location estimators based on the confidence bounds of rank tests [53]. In REDS I, when compared to other semiparametric mean estimators with the same breakdown point, the H-L estimator was shown to be the bias-optimal choice, which aligns Devroye, and Lerasle, Lugosi, and Oliveira’s conclusion that the median of means is near-optimal in terms of concentration bounds [35] as discussed. The formal study of semiparametric models was initiated by Stein [54] in 1956. Bickel, in 1982, simplified the general heuristic necessary condition proposed by Stein [54] and derived sufficient conditions for this type of problem, adaptive estimation [55]. These conditions were subsequently applied to the construction of adaptive estimates [55]. It has become increasingly apparent that, in robust statistics, many estimators previously called "nonparametric" are essentially semiparametric as they are partly, though not fully, characterized by some interpretable Euclidean parameters. This approach is particularly useful in situations where the data do not conform to a simple parametric distribution but still have some structure that can be exploited. In 1984, Bickel addressed the challenge of robustly estimating the parameters of a linear model while acknowledging the possibility that the model may be invalid but still within the confines of a larger model [56]. He showed by carefully designing the estimators, the biases can be very small. The paradigm shift here opens up the possibility that by defining a large semiparametric model and constructing estimators simultaneously for two or more very different semiparametric/parametric models within the large semiparametric model, then even for a parametric model belongs to the large semiparametric model but not to the semiparametric/parametric models used for calibration, the performance of these estimators might still be near-optimal due to the common nature shared by the models used by the estimators. Maybe it can be named as comparametrics. Closely related topics are "mixture model" and "constraint defined model," which were generalized in Bickel, Klaassen, Ritov, and Wellner’s classic semiparametric textbook (1993) [57] and the method of sieves, introduced by Grenander in 1981 [58]. As the building blocks of statistics, invariant moments can reduce the overall errors of statistical results across studies and thus can enhance the replicability of the whole community [59,60].

Methods

Methods of generating the Table 1 are summarized below, with details in the SI Text. The d values for the invariant moments of the Weibull distribution were approximated using a Monte Carlo study, with the formulae presented in Theorem 1 and 2. The computation of I functions is summarized in Subsection and further explained in the SI Text. The computation of ASABs and ASBs is described in Subsection . The SEs and SSEs were computed by approximating the sampling distribution using 1000 pseudorandom samples for n = 5184 and 50 pseudorandom samples for n = 2654208 . The impact of the bootstrap size, ranging from n = 2.7 × 10 2 to n = 2.765 × 10 4 , on the variance of invariant moments and U-central moments was studied using the SEs and SSEs methods described above. A brute force approach was used to estimate the maximum biases of the robust estimators discussed for the five unimodal distributions. The validity of this approach is discussed in the SI Text.

Significance Statement

Bias, variance, and contamination are the three main errors in statistics. Consistent robust estimation is unattainable without distributional assumptions. In this article, invariant moments are proposed as a means of achieving near-consistent and robust estimations of moments, even in scenarios where moderate violations of distributional assumptions occur, while the variances are sometimes smaller than those of the sample moments.

Data and Software Availability

Data for Table 1 are given in SI Dataset S1-S4. All codes have been deposited in GitHub.

Author Contributions

T.L. designed research, performed research, analyzed data, and wrote the paper.

Conflicts of Interest

The author declares no competing interest.

References

  1. Gauss, C.F. Theoria combinationis observationum erroribus minimis obnoxiae; Henricus Dieterich, 1823. [Google Scholar]
  2. Newcomb, S. A generalized theory of the combination of observations so as to obtain the best result. American journal of Mathematics 1886, 8, 343–366. [Google Scholar] [CrossRef]
  3. Huber, P.J. Robust Estimation of a Location Parameter. Ann. Math. Statist. 1964, 35, 73–101. [Google Scholar] [CrossRef]
  4. He, X.; Fung, W.K. Method of medians for lifetime data with Weibull models. Statistics in medicine 1999, 18, 1993–2009. [Google Scholar] [CrossRef]
  5. Menon, M. Estimation of the shape and scale parameters of the Weibull distribution. Technometrics 1963, 5, 175–182. [Google Scholar] [CrossRef]
  6. Dubey, S.D. Some percentile estimators for Weibull parameters. Technometrics 1967, 9, 119–129. [Google Scholar] [CrossRef]
  7. Marks, N.B. Estimation of Weibull parameters from common percentiles. Journal of applied Statistics 2005, 32, 17–24. [Google Scholar] [CrossRef]
  8. Boudt, K.; Caliskan, D.; Croux, C. Robust explicit estimators of Weibull parameters. Metrika 2011, 73, 187–209. [Google Scholar] [CrossRef]
  9. Hyndman, R.J.; Fan, Y. Sample quantiles in statistical packages. The American Statistician 1996, 50, 361–365. [Google Scholar] [CrossRef]
  10. van Zwet, W.R. Convex Transformations of Random Variables: Nebst Stellingen; 1964. [Google Scholar]
  11. Bowley, A.L. Elements of statistics; Number 8, King, 1926.
  12. Groeneveld, R.A.; Meeden, G. Measuring skewness and kurtosis. Journal of the Royal Statistical Society: Series D (The Statistician) 1984, 33, 391–399. [Google Scholar] [CrossRef]
  13. Daniell, P. Observations weighted according to order. American Journal of Mathematics 1920, 42, 222–236. [Google Scholar] [CrossRef]
  14. Mosteller, F. On Some Useful" Inefficient" Statistics. The Annals of Mathematical Statistics 1946, 17, 377–408. [Google Scholar] [CrossRef]
  15. Rao, C.R. Advanced statistical methods in biometric research; Wiley, 1952. [Google Scholar]
  16. Bickel, P.J. Some contributions to the theory of order statistics. Proc. Fifth Berkeley Sympos. Math. Statist. and Probability, 1967, Vol. 1, pp. 575–591.
  17. Chernoff, H.; Gastwirth, J.L.; Johns, M.V. Asymptotic distribution of linear combinations of functions of order statistics with applications to estimation. The Annals of Mathematical Statistics 1967, 38, 52–72. [Google Scholar] [CrossRef]
  18. LeCam, L. On the assumptions used to prove asymptotic normality of maximum likelihood estimates. The Annals of Mathematical Statistics 1970, 41, 802–828. [Google Scholar] [CrossRef]
  19. Bickel, P.; Lehmann, E. Descriptive statistics for nonparametric models I. Introduction. In Selected Works of EL Lehmann; Springer, 2012; pp. 465–471. [Google Scholar]
  20. Bickel, P.J.; Lehmann, E.L. Descriptive statistics for nonparametric models II. Location. In selected works of EL Lehmann; Springer, 2012; pp. 473–497. [Google Scholar]
  21. Janssen, P.; Serfling, R.; Veraverbeke, N. Asymptotic normality for a general class of statistical functions and applications to measures of spread. The Annals of Statistics 1984, 12, 1369–1379. [Google Scholar] [CrossRef]
  22. Lehmann, E.L.; Scheffé, H. Completeness, similar regions, and unbiased estimation-Part I. In Selected works of EL Lehmann; Springer, 2011; pp. 233–268. [Google Scholar]
  23. Lehmann, E.L.; Scheffé, H. Completeness, similar regions, and unbiased estimation—part II; Springer, 2012. [Google Scholar]
  24. Efron, B. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 1979, 7, 1–26. [Google Scholar] [CrossRef]
  25. Bickel, P.J.; Freedman, D.A. Some asymptotic theory for the bootstrap. The annals of statistics 1981, 9, 1196–1217. [Google Scholar] [CrossRef]
  26. Helmers, R.; Janssen, P.; Veraverbeke, N. Bootstrapping U-quantiles; CWI. Department of Operations Research, Statistics, and System Theory [BS], 1990.
  27. Richtmyer, R.D. A non-random sampling method, based on congruences, for" Monte Carlo" problems. Technical report, New York Univ., New York. Atomic Energy Commission Computing and Applied …, 1958.
  28. Sobol’, I.M. On the distribution of points in a cube and the approximate evaluation of integrals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki 1967, 7, 784–802. [Google Scholar] [CrossRef]
  29. Do, K.A.; Hall, P. Quasi-random resampling for the bootstrap. Statistics and Computing 1991, 1, 13–22. [Google Scholar] [CrossRef]
  30. Lai, T.; Robbins, H.; Yu, K. Adaptive choice of mean or median in estimating the center of a symmetric distribution. Proceedings of the National Academy of Sciences 1983, 80, 5803–5806. [Google Scholar] [CrossRef]
  31. Laplace, P.S. Theorie analytique des probabilities; 1812. [Google Scholar]
  32. Cramér, H. Mathematical methods of statistics; Vol. 43, Princeton university press, 1999.
  33. Gerlovina, I.; Hubbard, A.E. Computer algebra and algorithms for unbiased moment estimation of arbitrary order. Cogent mathematics & statistics 2019, 6, 1701917. [Google Scholar]
  34. Hampel, F.R. Contributions to the theory of robust estimation; University of California, Berkeley, 1968. [Google Scholar]
  35. Devroye, L.; Lerasle, M.; Lugosi, G.; Oliveira, R.I. Sub-Gaussian mean estimators. The Annals of Statistics 2016, 44, 2695–2725. [Google Scholar] [CrossRef]
  36. Devlin, S.J.; Gnanadesikan, R.; Kettenring, J.R. Robust estimation and outlier detection with correlation coefficients. Biometrika 1975, 62, 531–545. [Google Scholar] [CrossRef]
  37. Hampel, F.R. A general qualitative definition of robustness. The annals of mathematical statistics 1971, 42, 1887–1896. [Google Scholar] [CrossRef]
  38. Hampel, F.R. The influence curve and its role in robust estimation. Journal of the american statistical association 1974, 69, 383–393. [Google Scholar] [CrossRef]
  39. Rousseeuw, P.J.; Hampel, F.R.; Ronchetti, E.M.; Stahel, W.A. Robust statistics: the approach based on influence functions; John Wiley & Sons, 2011. [Google Scholar]
  40. Donoho, D.L.; Huber, P.J. The notion of breakdown point. A festschrift for Erich L. Lehmann 1983, 157184. [Google Scholar]
  41. Bieniek, M. Comparison of the bias of trimmed and Winsorized means. Communications in Statistics-Theory and Methods 2016, 45, 6641–6650. [Google Scholar] [CrossRef]
  42. Danielak, K.; Rychlik, T. Theory & Methods: Exact Bounds for the Bias of Trimmed Means. Australian & New Zealand Journal of Statistics 2003, 45, 83–96. [Google Scholar]
  43. Li, L.; Shao, H.; Wang, R.; Yang, J. Worst-case range value-at-risk with partial information. SIAM Journal on Financial Mathematics 2018, 9, 190–218. [Google Scholar] [CrossRef]
  44. Bernard, C.; Kazzi, R.; Vanduffel, S. Range Value-at-Risk bounds for unimodal distributions under partial information. Insurance: Mathematics and Economics 2020, 94, 9–24. [Google Scholar] [CrossRef]
  45. Mathieu, T. Concentration study of M-estimators using the influence function. Electronic Journal of Statistics 2022, 16, 3695–3750. [Google Scholar] [CrossRef]
  46. Markov, A.A. Wahrscheinlichkeitsrechnung; Teubner, 1912. [Google Scholar]
  47. Neyman, J. On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection. Journal of the Royal Statistical Society 1934, 97, 558–606. [Google Scholar] [CrossRef]
  48. Radhakrishna Rao, C. Information and accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society 1945, 37, 81–91. [Google Scholar]
  49. Blackwell, D. Conditional expectation and unbiased sequential estimation. The Annals of Mathematical Statistics 1947, 105–110. [Google Scholar] [CrossRef]
  50. Fisher, R.A. On the mathematical foundations of theoretical statistics. Philosophical transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character 1922, 222, 309–368. [Google Scholar]
  51. Hájek, J. Local asymptotic minimax and admissibility in estimation. Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, 1972, Vol. 1, pp. 175–194.
  52. Wald, A. Contributions to the theory of statistical estimation and testing hypotheses. The Annals of Mathematical Statistics 1939, 10, 299–326. [Google Scholar] [CrossRef]
  53. Hodges Jr, J.; Lehmann, E. Estimates of Location Based on Rank Tests. The Annals of Mathematical Statistics 1963, 34, 598–611. [Google Scholar] [CrossRef]
  54. Stein, C.M. Efficient nonparametric testing and estimation. Proceedings of the third Berkeley symposium on mathematical statistics and probability, 1956, Vol. 1, pp. 187–195.
  55. Bickel, P.J. On adaptive estimation. The Annals of Statistics 1982, 10, 647–671. [Google Scholar] [CrossRef]
  56. Bickel, P. Parametric robustness: small biases can be worthwhile. The Annals of Statistics 1984, 12, 864–879. [Google Scholar] [CrossRef]
  57. Bickel, P.; Klaassen, C.A.; Ritov, Y.; Wellner, J.A. Efficient and adaptive estimation for semiparametric models; Vol. 4, Springer, 1993.
  58. Grenander, U. Abstract Inference; 1981. [Google Scholar]
  59. Leek, J.T.; Peng, R.D. Reproducible research can still be wrong: adopting a prevention approach. Proceedings of the National Academy of Sciences 2015, 112, 1645–1646. [Google Scholar] [CrossRef]
  60. National Academies of Sciences, E.; Affairs, P.; Committee on Science, E.; Information, B.; Sciences, D.; Statistics, C.; Analytics, B.; Studies, D.; Board, N.; Education, D.; others. Reproducibility and Replicability in Science; National Academies Press, 2019.
Table 1. Evaluation of invariant moments for five common unimodal distributions in comparison with current popular methods.
Table 1. Evaluation of invariant moments for five common unimodal distributions in comparison with current popular methods.
Errors x ¯ TM H-L SM HM WM SQM BM MoM MoRM mHLM r m e x p , BM q m e x p , BM
WASAB 0.000 0.107 0.088 0.078 0.078 0.066 0.048 0.048 0.034 0.035 0.034 0.002 0.003
WRMSE 0.014 0.111 0.092 0.083 0.083 0.070 0.053 0.053 0.041 0.041 0.038 0.017 0.018
WASB n = 5184 0.000 0.108 0.089 0.078 0.079 0.066 0.048 0.048 0.034 0.036 0.033 0.002 0.003
WSE WSSE 0.014 0.014 0.014 0.015 0.014 0.014 0.014 0.015 0.017 0.014 0.014 0.017 0.017
Errors HFMμ MPμ r m q m i m v a r v a r b s T s d 2 HFM μ 2 MP μ 2 r v a r q v a r i v a r
WASAB 0.037 0.043 0.001 0.002 0.001 0.000 0.000 0.200 0.027 0.042 0.005 0.018 0.003
WRMSE 0.049 0.055 0.015 0.015 0.014 0.017 0.017 0.198 0.042 0.062 0.019 0.026 0.019
WASB n = 5184 0.038 0.043 0.001 0.002 0.001 0.000 0.001 0.198 0.027 0.043 0.005 0.018 0.003
WSE WSSE 0.018 0.021 0.015 0.015 0.014 0.017 0.017 0.015 0.024 0.032 0.018 0.017 0.018
Errors t m t m b s HFM μ 3 MP μ 3 r t m q t m i t m f m f m b s HFM μ 4 MP μ 4 r f m q f m i f m
WASAB 0.000 0.000 0.052 0.059 0.006 0.083 0.034 0.000 0.000 0.037 0.046 0.024 0.038 0.011
WRMSE 0.019 0.018 0.063 0.074 0.018 0.083 0.044 0.026 0.023 0.049 0.062 0.037 0.043 0.029
WASB n = 5184 0.001 0.003 0.052 0.059 0.007 0.082 0.038 0.001 0.009 0.037 0.047 0.024 0.036 0.013
WSE WSSE 0.019 0.018 0.021 0.091 0.015 0.012 0.017 0.024 0.021 0.020 0.027 0.021 0.020 0.022
1 The first table presents the use of the exponential distribution as the consistent distribution for five common unimodal distributions: Weibull, gamma, Pareto, lognormal, and generalized Gaussian distributions. Popular robust mean estimators discussed in REDS 1 were used as comparisons. The breakdown points of mean estimators in the first table, besides H-L estimator and Huber M-estimator, are all 1 8 . The second and third tables present the use of the Weibull distribution as the consistent distribution not plus/plus using the lognormal distribution for the odd ordinal moments optimization and the generalized Gaussian distribution for the even ordinal moments optimization. SQM is the robust mean estimator used in recombined/quantile moments. Unbiased sample central moments ( v a r , t m , f m ), U-central moments with quasi-bootstrap ( v a r b s , t m b s , f m b s ), and other estimators were used as comparisons. The generalized Gaussian distribution was excluded for He and Fung M-Estimator and Marks percentile estimator, since the logarithmic function does not produce results for negative inputs. The breakdown points of estimators in the second and third table, besides M-estimators and percentile estimator, are all 1 24 . The tables include the average standardized asymptotic bias (ASAB, as n ), root mean square error (RMSE, at n = 5184 ), average standardized bias (ASB, at n = 5184 ) and variance (SE ∨ SSE, at n = 5184 ) of these estimators, all reported in the units of the standard deviations of the distribution or corresponding kernel distributions. W means that the results were weighted by the number of Google Scholar search results on May 30, 2022 (including synonyms). The calibrations of d values and the computations of ASAB, ASB, and SSE were described in Subsection , and SI Methods. Detailed results and related codes are available in SI Dataset S1 and GitHub.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Downloads

76

Views

47

Comments

0

Subscription

Notify me about updates to this article or when a peer-reviewed version is published.

Email

Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2025 MDPI (Basel, Switzerland) unless otherwise stated