Robust Estimations of the Central Moments
In 1928, Fisher constructed
-statistics as unbiased estimators of cumulants [
18]. Halmos (1946) proved that a functional
admits an unbiased estimator if and only if it is a regular statistical functional of degree
and showed a relation of symmetry, unbiasness and minimum variance [
19]. Hoeffding, in 1948, generalized
U-statistics [
20] which enable the derivation of a minimum-variance unbiased estimator from each unbiased estimator of an estimable parameter. In 1984, Serfling pointed out the speciality of Hodges-Lehmann estimator, which is neither a simple
L-statistic nor a
U-statistic, and considered the generalized
L-statistics and trimmed
U-statistics [
21]. Given a kernel function
which is a symmetric function of
variables, the
-statistic is defined as:
where
(proven in Subsection ?? in REDS III [
22]),
are the
n choose
elements from the sample,
denotes the
-statistic with the sorted sequence
serving as an input. In the context of Serfling’s work, the term ‘trimmed
U-statistic’ is used when
is
[
21].
In 1997, Heffernan [
11] obtained an unbiased estimator of the
th central moment by using
U-statistics and demonstrated that it is the minimum variance unbiased estimator for distributions with the finite first
moments. The weighted H-L
th central moment (
) is thus defined as,
where
is used as the
in
,
, the second summation is over
to
with
and
[
11]. Despite the complexity, the following theorem offers an approach to infer the general structure of such kernel distributions.
Theorem 2. Define a set T comprising all pairs such that with and is the probability density of the -tuple, (a formula drawn after a modification of the Jacobian density theorem). is a subset of T, consisting all those pairs for which the corresponding -tuples satisfy that . The component quasi-distribution, denoted by , has a quasi-pdf , i.e., sum over all such that the pair is in the set and the first element of the pair, , is equal to . The th, where , central moment kernel distribution, labeled , can be seen as a quasi-mixture distribution comprising an infinite number of component quasi-distributions, s, each corresponding to a different value of Δ, which ranges from to 0. Each component quasi-distribution has a support of .
Proof. The support of is the extrema of the function subjected to the constraints, and . Using the Lagrange multiplier, the only critical point can be determined at , where . Other candidates are within the boundaries, i.e., , , , , . can be divided into groups. The gth group has the common factor , if and the final th group is the term . If and , the gth group has terms having the form . If and , the gth group has terms having the form . If and , the gth group has terms having the form . If and , the gth group has terms having the form . If and , the gth group has terms having the form . So, if , , , the summed coefficient of is . The summation identities are and . If and , . If and , the summed coefficient of is , the same as above. If , since , the related terms can be ignored, so, using the binomial theorem and beta function, the summed coefficient of is .
According to the binomial theorem, the coefficient of in is , same as the above summed coefficient of , if . If , the coefficient of is , same as the corresponding summed coefficient of . Therefore, , the maximum and minimum of follow directly from the properties of the binomial coefficient. □
The component quasi-distribution, , is closely related to , which is the pairwise difference distribution, since . Recall that Theorem 1 established that is monotonic increasing with a mode at zero if the original distribution is unimodal, is thus monotonic decreasing with a mode at zero. In general, if assuming the shape of is uniform, is monotonic left and right around zero. The median of also exhibits a strong tendency to be close to zero, as it can be cast as a weighted mean of the medians of . When is small, all values of are close to zero, resulting in the median of being close to zero as well. When is large, the median of depends on its skewness, but the corresponding weight is much smaller, so even if is highly skewed, the median of will only be slightly shifted from zero. Denote the median of as , for the five parametric distributions here, s are all for and , where is the standard deviation of (SI Dataset S1). Assuming , for the even ordinal central moment kernel distribution, the average probability density on the left side of zero is greater than that on the right side, since . This means that, on average, the inequality holds. For the odd ordinal distribution, the discussion is more challenging since it is generally symmetric. Just consider , let and , changing the value of from to will monotonically change the value of , since , . If the original distribution is right-skewed, will be left-skewed, so, for , the average probability density of the right side of zero will be greater than that of the left side, which means, on average, the inequality holds. In all, the monotonic decreasing of the negative pairwise difference distribution guides the general shape of the th central moment kernel distribution, , forcing it to be unimodal-like with the mode and median close to zero, then, the inequality or holds in general. If a distribution is th -ordered and all of its central moment kernel distributions are also th -ordered, it is called completely th -ordered.
Another crucial property of the central moment kernel distribution, location invariant, is introduced in the next theorem.
Theorem 3. .
Proof. Recall that for the
th central moment, the kernel is
, where the second summation is over
to
with
and
[
11].
consists of two parts. The first part, , involves a double summation over certain terms. The second part, , carries an alternating sign and involves multiplication of the constant with the product of all the x variables, . Consider each multiplication cluster for j ranging from 0 to in the first part. Let each cluster form a single group. The first part can be divided into groups. Combine this with the second part . Together, the terms of can be divided into a total of groups. From the 1st to th group, the gth group has terms having the form . The final th group is the term .
There are two ways to divide into groups according to the form of each term. The first choice is, if , the gth group of has terms having the form , where are fixed, are selected such that and . Define another function , the first group of is , the hth group of , , has terms having the form . Transforming by , then combing all terms with , , the summed coefficient is since the summation is starting from l, ending at , the first term includes the factor , the final term includes the factor , the terms in the middle are also zero due to the factorial property.
Another possible choice is the gth group of has terms having the form
, provided that , , where are fixed, and are selected such that and . Transforming these terms by , then there are terms having the form . Transforming the final th group of by , then, there is one term having the form . Another possible combination is that the gth group of contains terms having the form . Transforming these terms by , then there is only one term having the form . The above summation should also be included, i.e., , . So, combing all terms with , according to the binomial theorem, the summed coefficient is . The summation identities required are and . These two summation identities are proven in Lemma 4 and 5 in the SI Text.
Thus, no matter in which way, all terms including can be canceled out. The proof is complete by noticing that the remaining part is . □
A direct result of Theorem 3 is that,
after standardization is invariant to location and scale. So, the weighted H-L standardized
th moment is defined to be
To avoid confusion, it should be noted that the robust location estimations of the kernel distributions discussed in this paper differ from the approach taken by Joly and Lugosi (2016) [
23], which is computing the median of all
U-statistics from different disjoint blocks. Compared to bootstrap median
U-statistics, this approach can produce two additional kinds of finite sample bias, one arises from the limited numbers of blocks, another is due to the size of the
U-statistics (consider the mean of all
U-statistics from different disjoint blocks, it is definitely not identical to the original
U-statistic, except when the kernel is the Hodges-Lehmann kernel). Laforgue, Clemencon, and Bertail (2019)’s median of randomized
U-statistics [
24] is more sophisticated and can overcome the limitation of the number of blocks, but the second kind of bias remains unsolved.