A Non-Parametric Maximum for Reasonable Number of Rejected Hypotheses : Objective Optima for False Discovery Rate and Significance Threshold in Exploratory Research with Application to Ordinal Survey Analysis

This paper identifies a criterion for choosing the largest set of rejected hypotheses in high-dimensional data analysis where Multiple Hypothesis testing is used in exploratory research to identify significant associations among many variables. The method neither requires predetermined thresholds for level of significance, nor uses presumed thresholds for false discovery rate. The upper limit for number of rejected hypotheses is determined by finding maximum difference between expected true hypotheses and false hypotheses among all possible sets of rejected hypotheses. Methods of choosing a reasonable number of rejected hypotheses and application to non-parametric analysis of ordinal survey data are presented.


Introduction
In high-dimensional data analysis, multiple simultaneous hypothesis testing arises because we need to identify which null hypotheses, among many, should be reasonably rejected (Neuvial, 2013).A significant finding (discovery) is a hypothesis that is rejected based on statistical evidence.
When performing multiple hypothesis testing, as the number of hypotheses being tested (m) gets bigger and bigger, using a p-value threshold (alpha), such as .05,for rejecting hypothesis based on p-values becomes problematic.P-value is a measure of the probability of a rejected hypothesis to be a false positive.When number of hypothesis being tested is big, for example 1000, the expected number of false positives is (m*alpha).If alpha is .05this means that the expected number of false positives among significant findings is less than or equal to 50.
It is anticipated that, when there are no expected true discoveries, the frequency distribution of p-values to be uniform.Which means that the proportion of tests resulting a p-value in any class should be the same.In many situations, "it is reasonable to assume that larger p-values are more likely to correspond to true null hypotheses than smaller ones" (Neuvial 2013(Neuvial , 1428) ) or smaller p-values are less likely to correspond to true null hypotheses.In many research situations, p-values have a frequency distribution like figure 1, where number of hypotheses with very low pvalues are more than other p-values.The first column in figure 1 is presenting hypotheses with p-value<.05.These the hypotheses that we might be inclined to declare as rejected hypothesis or significant discoveries; however, the expected number of false positives in this subset can be very high.In other words, many of rejected hypothesis may be true nulls.Therefore, we may want to reduce our threshold (alpha) for rejection so fewer but more significant hypotheses are rejected.Reduction of alpha, decreases the chance of false positives in our discovery set and thus leads to a smaller chance of false discoveries.Unfortunately, this may increase the false negatives.By decreasing Alpha, we are accepting to have more false negatives.
False discovery rate (FDR) is defined as expected number of false discoveries (false positives among rejected hypotheses) divided by total number of rejected hypotheses (Neuvial 2013, 2).In many situations, a p-value of .05 may lead to a big FDR.Several algorithms have been proposed to consider FDR in the process of selecting significant findings.
Bonferroni has suggested to consider a hypothesis significant when alpha is less than or equal to alpha/m (Storey and Tibshirani, 2003,2).By choosing a rejection threshold much lower than alpha, the probability of making one or more false discoveries will be less than or equal to alpha.
Bonferroni's suggestion guaranties a family-wise error rate (FWER) less than or equal to alpha; but this conservative measure can result in many false negatives.When the number of significant hypotheses are few this measure is appropriate; because even expectation of one false positive in the result set is damaging.In many studies, where number of significant findings are many, the researcher may be able to afford a higher FDR, if that will prevent many false negatives.Not detecting many important associations may be more harmful than probability of one false negative among many significant findings.
Many adaptive hypothesis testing procedures rely on estimates of the proportion of true null hypotheses in the initial pool using plugins, a single step, in multiple steps, or asymptotically (Blanchard and Roquain, 2009).Plug-in procedures use an estimate of the proportion of true null hypotheses (Neuvial, 2013).Thresholding-based multiple testing procedures, reject hypotheses with p-values less than a threshold (Neuvial, 2013).Storey and Tibshirani (2003) have proposed a strategy that assigns each hypothesis an individual measure of significance in terms of expected false discovery rate called q-value.Most qvalue based strategies rely on some estimate of the proportion of true null hypotheses.However, the choice of the threshold of q-values at which the researcher draws the line of significance remains subjective.Storey (2007) has argued that two steps that are involved in any multiple-testing procedure.The first step is "determining the order in which the tests should be called significant" by "ranking the tests from most to least significant".The second step is "choosing an appropriate significance cut-off somewhere along this ordering".Storey focuses on performing the first step optimally, given a certain significance framework for the second step.He cites (Shaffer, 1995) defining the goal: "to estimate the appropriate cut-off to obtain a particular error rate, usually based on the familywise error rate or false discovery rate".Storey (2007) proposes an optimal discovery procedure based on maximizing Expected True Positives (ETP) for each Expected False Positive (EFP) among all Single Thresholding Procedures (STP).Norris and Kahn (2006) have proposed balanced probability analysis (BPA) based on three variables: (i) The total number of true positives (TTP); (ii) The false discovery rate (FDR), defined as the aggregate chance that a true null hypothesis is rejected by statistical accident.(iii) The false negative rate (FNR), defined as the number of hypothesis that should truly be rejected but are missing from the significance list divided by the total number of hypothesis that should truly be rejected.They believe other definitions of type 2 error rates, such as the false nondiscovery rate (the ratio of hypotheses that should truly be rejected but are nondiscovered to the number of unrejected Hypothesis) "are difficult to intuit for the nonstatistician".They "calculate the FNR directly from the data, by using resampling to estimate the null and alternate distributions".Their "procedure weakly depends on the estimated FDR, and requires one model-dependent step to optimize a single parameter".
As Norris and Kahn (2006) have argued, the true FDR can be accurately determined only when the TTP is known.They used an adaptation of the algorithm by Storey and Tibshirani (2003) they estimate the TTPs.They estimated FDR and then they estimated FNR based on their estimates of FDR and TTP.

A Non-Parametric Maximum for Reasonable Number of Rejected Hypotheses
This article, is concerned about the second step mentioned above, "choosing an appropriate significance cut-off somewhere along this ordering", but we won't need to know or estimate the total number of true positives or total number of true Negatives.
Although Setting a subjective threshold for FDR (such as 0.05) can relax the extremely conservative suggestion by Bonferroni it can be a limitation which may unnecessarily limit the number reasonable findings a researcher should report.I will show that, in some situations, grounded on observed data one can identify an objective upper bound for "level of significance and FDR" that is reasonable for the researcher to report.
When we tabulate the p-values resulted from a study into sorted classes (from smallest to largest p-value), we will have the frequency of each observed p-value.We have a special interest in the set of smallest p-values; thus, the first class is the most valuable class for us.All the P-values with a value closest to Zero (or zero if such hypotheses exist) in set S1 which will have will have f1 members (f1>=1).
The next smallest p-value will be .Set 2, will contain all the hypotheses with a value of .S 2 will have members ( >=1).For each one of k observed p-values there will be corresponding frequency and a set of hypotheses.
Total number of hypotheses tested= = ∑ ) In the equation above, is the frequency of hypotheses in set Si.If we set the Alpha (rejection threshold) at .We will have rejected Hypotheses, of which × are expected to be false Discoveries (EFD1).EFD1= p1*N Therefore, from the first set we expect to have: ETD1 is expected true discoveries if we reject hypotheses with p-value less than or equal to p 1 .We may be interested in including the set of f 2 hypothesis S 2 in our discoveries, but the p-value of these Hypotheses is P2 and the expected false discoveries in rejected set S 1 and S 2 will be p 2 *N.R 2 =S 1 U S 2 p2*N is always bigger than p1*N.p2*N will be the Cumulative Expected False Discoveries (CEFD) in R2: CEFD 2 = *N Therefore, from the first two sets we expect to have Cumulative Expected False Discoveries (CETD) in R2 as: Therefore, cumulative expected true discoveries CETD 2 from S 1 and S 2 , will be bigger than ETD 1 .The series of expected false discoveries in each set: EFD 1 , EFD 2 , EFD 3 , …… is usually increasing because the p-values are getting bigger.And the series of expected true discoveries is each set: ETD1, CETD2, CETD3, …. is usually increasing in the first sets.But because p-values are increasing and by adding each set to rejected set we are in fact increasing our Alpha, The proportion of false discoveries added by set Sj (j>i) to Rj is more than the contribution of false discoveries in set Si to Ri and contribution of true discoveries in from Sj to Rj is more than the contribution of true discoveries by Si to Ri.When i goes toward N, pi goes toward 1 If we define delta: δi =CETDi-CEFDi At some point δ i must start to decrease and must have a maximum.The maximum number of rejected hypotheses happens at set S max after which adding the hypotheses in the next set S max+1 (setting alpha at p max+1 ) will contribute more to false discoveries than to true discoveries.
is the largest set of rejected hypothesis that is reasonable to be reported.The largest alpha that is reasonable to be the threshold for rejecting hypotheses is P max .FDR max is the biggest reasonable FDR to be reported.
That is the point at which we have no incentive to add the set to our discoveries.If we add set to our set of rejected hypotheses, the difference between CETD and CEFD (δ) will start to decline.δ max is an objective upper bound for the number of hypothesis we reject.If Maximum δ max happens when we add S max to set of rejected hypotheses, we have decided that the threshold alpha for rejecting null hypotheses is , we will reject hypothesis with p-value<= .
If we have k observed p-values ≤ ≤ ≤ … ≤ , related to sets of tested hypotheses S1, S2, S3, … , Sk; δ happens when we add set Sm to our rejected hypotheses.
The number of rejected hypotheses, at level Alpha will be , and the biggest reasonable set of rejected hypotheses R will be:

R =
Maximum ECTD can be calculated based on the following formula: Table 1, summarizes what we discussed above.Notice that the upper limit for number of rejected hypotheses is determined based on maximization of difference between expected true hypotheses and False Hypotheses.Alpha is reported (not assumed) and is not subjectively selected.The FDR is dictated by data.If the researcher decides to add more sets to discoveries, he/she is accepting the cost of adding more false discoveries than true discoveries to the set of rejected hypotheses.

Objective Optima for False Discovery Rate and Significance Threshold
Making the set of Rejected hypotheses beyond may increase CETD, but it will increase the CEFD even more; it will decrease the quality of discovery measured as δ.At however, we don't have a sharp or sudden decrease of δ.Delta usually changes relatives slowly around .We have a peak and a slow reversal in trend for δ.The researcher can use different ways of piecewise regression to identify an optimum number of rejected hypotheses much less than but much more than .
. For example, piecewise regression of the p-values of hypotheses in sets S1 to Smax , and number of observations in R1 to Rmax , with one breakpoint can model the observations with two line segments.The breakpoint, where the slope of the two lines changes, is were the efficiency of adding more hypotheses to R changes.It is an objective threshold at which rejected hypotheses are less than Rmax, while number of CETD is close to true discoveries at Rmax resulting in a better FDR with little loss of CETD.Therefore, the number of rejected hypotheses at the break point, R bp ,is an optimal number of hypotheses.It doesn't decrease the quality of our discovery, measured by δ very much.
A more computationally intensive piecewise regression of the p-values of hypotheses in sets S1 to Smax+ε can be conducted such that the second segment is a horizontal line close to the point ( , ).The horizontal line can also be the one that passes the point ( , ).The resulting set of discoveries is not very sensitive to the selection of piecewise regression method.

Example: application of method to non-parametric analysis of survey data
Table 1 shows the sorted results of using the procedure when analyzing a large survey regarding "variables influencing citizen engagement in mediated democracies".Fisher's Exact test was used to check the significance of associations observed in cross-tabulated data, and Sommer's D was used to measure the extent of association.The null hypothesis was that the observed association in crosstabulation is accidental.
For one of the outcome variables, 1031 hypotheses regarding crosstabulations were tested.If we rely on 0.05 rule of thumb for alpha, too many hypotheses will be falsely rejected.If we rely on 0.05 rule of thumb for FDR, many potentially significant findings, may falsely remain unrejected.Notice that Bonferroni's correction for p-value=0.05would suggest a threshold of rejection of 0.0000485 which means we can conservatively reject 16 hypotheses at FDR 0.002414.
We observe 7 Hypotheses with p-value of 0 in set S 1 which will be obviously rejected.If we decide to reject the hypothesis in the second set at p-value=0.000001,we will add 1 hypothesis to the set of rejected hypotheses.The single hypothesis that can be rejected contributes 0.998956 to the total expected true discoveries.Cumulative expected False discoveries will be 1031*0.000131=001044.Rejecting the hypotheses in sets S 1 and S 2 , we are in fact declaring the threshold alpha is 0.000001, cumulative expected False discoveries will be 1031*0.000131=001044,FDR will be 0.000131.
We may be interested to reject more hypotheses in next sets.If we reject all the hypotheses in sets S1 to S36, we will have 42 hypotheses in our set of rejected hypotheses R36.FDR will grow to 0.048322.Like many researchers who will not reject set S37, we can define our alpha to be 0.001944.In other words, we reject hypotheses with p-value less than 0.001944.This is more powerful than Bonferroni's correction.But we expect 2.029536 of 42 discoveries to be false.
Our set S177 has the 184th p-value at 0.0393.The resulting set of rejected hypotheses from S1 to S177 is expected to have 41.0292 to be false discoveries and 142.9708 true discoveries.The expected false discovery rate as the result of increasing Alpha to 0.0393 will be 0.222985.
The p-value of each set can be observed in Figure 2. Since we have sorted our hypotheses based on their p-values, as we include more sets of hypotheses to our rejected set, the alpha (threshold p-value) increases.Depicted in red we see that FDR or .05 is allowing 42 or hypotheses to be rejected.The green line depicts the CETD.The p-value of first sets is very low, and these hypotheses are most likely to be true rejections, when we reject the first sets of hypotheses, CETD is growing very fast.Even when we pass the threshold of FDR=0.05 the p-values of next sets are very low which keep FDRs close to .05.For example, in the study presented above, the hypothesis in set 37 has a p-values of 0.0001 and FDR37 is 0.050525.
If we add S37 to our rejected set R, our CETD will grow and CEFD will also grow, but the growth of CETD is much faster.This trend however doesn't last forever.As p-values get bigger, CEFD will grow faster and CETD will grow slower.If we continue rejecting hypotheses with big p-values CEFD will accelerate and will surpass CETD.CETD will start to decline when p-values included in rejection set get close to 1.If we look at the difference CETD-CEFD shown in the last column of table 1, we are sure that it has a maximum above which rejecting a set of hypotheses will contribute more to CEFD than CETD and the difference will start to decline.In figure 4, δ, the difference between the expected true discoveries and expected false discoveries among rejected hypotheses, is depicted as a black line.As expected it has several local minima and maxima but it has a global maximum.Let us name the rejected number of hypotheses at this point as .FDR is always growing.By every new hypothesis we reject, we are increasing the proportion of false discoveries in the rejected set of hypotheses.Rejecting more hypotheses after we have reached Rmax, will weaken the quality of discoveries in absolute sense.Table 1 shows that the p-values of set S177 is 0.0393.Rejecting hypothesis beyond , for example rejecting set S178 which contains hypothesis 185, may increase the quantity CETD but it will increase the quantity of CEFD even more; it will decrease the quality of discovery because delta will go from 101.9416 to 97.42928.is a maximum for number of rejected hypotheses our data can justify.It will dictate a maximum for acceptable significance level alpha considering the data we have.In this data, doesn't appear as a sharp peak at which we have a turn, it is a peak around which the trend has an slow reversal; therefore, we can use many methods that suggest a reasonable number of rejected hypotheses much lower than .
If we use piecewise regression to identify two line segments, that will mimic the data upto Rmax.The breakpoint is found at .If we reject set S105, or reject 110 hypotheses with lowest pvalue, we will have a δ105=88.10299close to δmax=101.9416at , with an FDR105 =0.10314 about half of = 0.222985.As shown in table 1, the p-value of set S 105 p 105 =0.010966, about three times less than the p-value for p max =0.0393.Using segmented regression is just one of many ways the researcher can include the information about Rmax.The researcher can devise a more objective strategy to select the set of rejected hypothesis without relying on 0.05 or any other presumed thresholds for alpha or FDR.The researcher, should report the resulting alpha and FDR instead of assuming them.In the example shown above, the optimum (breakpoint of piecewise regression) is not very sensitive to the method of conducting regression.Either way, it suggests the about 10% of hypotheses which is much more than the number that could be rejected based on FDR=0.05 criterion and much less than absolute maximum Reasonable Number of Rejected Hypotheses.
In many exploratory researches the goal is to identify a set of significant associations.Many times, the extent of association (like slopes in linear regression) are more important for understanding the phenomena, or modeling the system, than the differences of FDRs associated with each p-value among significantly accepted alternatives.To test the quality of resulting set of rejected hypotheses, the non-parametric Sommer's D statistics for the extent of association for each comparison was calculated.It was observed that near all the rejected hypothesis had a level of association whose confidence intervals were on one side of Zero.

Discussion
In exploratory research, or for whom a few more possible false positives among many truly rejected hypotheses is not a sensitive issue, relying on predetermined threshold of 0.05 for FDR may be too limiting.But accepting larger and larger FDRs is not also a reasonable approach.The process explained in this paper neither requires predetermined thresholds for level of significance, nor uses presumed thresholds for false discovery rate.We observed a naturally occurring metric (for the quality of the set of rejected hypothesis), which has an upper bound.The researcher can rely on this maximum and devise methods to find an optimum that remains acceptable in terms of quality of discovery.Once the set of rejected hypotheses is determined a related significance level and FDR should be reported.
The paper presented methods that could identify optimum reasonable number of rejected hypotheses.The found optimum is in the range between most conservative selection criteria, such as what has been used in Bonferroni's procedure, and this identified upper bound.
The criterion and methods can be used in many fields of inquiry dealing with highdimensional data, including genomics and survey analysis.The results of using the criterion in the pairwise crosstabulation analysis of an ordinal outcome variable with 1031 potential ordinal predictors in a large survey, regarding "variables influencing citizen engagement in mediated democracies", is used as an example of application of the method in social sciences.
One can follow the following steps to identify δmax that data can afford.
1. start by sorting p-values from smallest to largest 2. tabulate the hypotheses to classes of observed p-values 3. reject the set of hypotheses with the least p-value (the first set is called S1) 4. calculate cumulative expected false discoveries for all the rejected hypotheses (P i ×N) 5. calculate 1-CEFD for all the rejected hypotheses 6. calculate δ=CETD-CEFD 7. record the results 8. repeat steps 2 to 7 for all the sets.9. Find the set with maximum recorded δ called δmax resulting from rejecting set Smax 10.The biggest reasonable set of rejected hypotheses R max will be = ∪ ∪ ∪ … ∪ 11.The p-value for set S m is p m which is the alpha that should be reported 12.The FDR that should be reported is = × ∑

Figure 1 ,
Figure 1, frequency distribution of p-values many expected true discoveries

Figure 2 ,
Figure 2, All p-values for 1031 hypotheses tested

Figure 4 ,
Figure 4, maximum δ and the breakpoint of piecewise regression

Figure 5 ,
Figure 5, maximum δ and the breakpoint of piecewise regression with horizontal piece

Table 1 ,
Frequency distribution of p-values from 1031 tested hypotheses