Version 1
: Received: 18 January 2022 / Approved: 20 January 2022 / Online: 20 January 2022 (15:32:13 CET)

How to cite:
Heckelei, T.; Hüttel, S.; Odening, M.; Rommel, J. The Replicability Crisis and the p-Value Debate – what Are the Consequences for the Agricultural and Food Economics Community?. Preprints2022, 2022010311. https://doi.org/10.20944/preprints202201.0311.v1
Heckelei, T.; Hüttel, S.; Odening, M.; Rommel, J. The Replicability Crisis and the p-Value Debate – what Are the Consequences for the Agricultural and Food Economics Community?. Preprints 2022, 2022010311. https://doi.org/10.20944/preprints202201.0311.v1

Heckelei, T.; Hüttel, S.; Odening, M.; Rommel, J. The Replicability Crisis and the p-Value Debate – what Are the Consequences for the Agricultural and Food Economics Community?. Preprints2022, 2022010311. https://doi.org/10.20944/preprints202201.0311.v1

APA Style

Heckelei, T., Hüttel, S., Odening, M., & Rommel, J. (2022). The Replicability Crisis and the p-Value Debate – what Are the Consequences for the Agricultural and Food Economics Community?. Preprints. https://doi.org/10.20944/preprints202201.0311.v1

Chicago/Turabian Style

Heckelei, T., Martin Odening and Jens Rommel. 2022 "The Replicability Crisis and the p-Value Debate – what Are the Consequences for the Agricultural and Food Economics Community?" Preprints. https://doi.org/10.20944/preprints202201.0311.v1

Abstract

A vivid debate is ongoing in the scientific community about statistical malpractice and the related publication bias. No general consensus exists on the consequences and this is reflected in heterogeneous rules defined by scientific journals on the use and reporting of statistical inference. This paper aims at discussing how the debate is perceived by the agricultural economics community and implications for our roles as researchers, contributors to the scientific publication process, and teachers. We start by summarizing the current state of the p-value debate and the replication crisis, and commonly applied statistical practices in our community. This is followed by motivation, design, results and discussion of a survey on statistical knowledge and practice among the researchers in the agricultural economics community in Austria, Germany and Switzerland. We conclude that beyond short-term measures like changing rules of reporting in publications, a cultural change regarding empirical scientific practices is needed that stretches across all our roles in the scientific process. Acceptance of scientific work should largely be based on the theoretical and methodological rigor and where the perceived relevance arises from the questions asked, the methodology employed, and the data used but not from the results generated. Revised and clear journal guidelines, the creation of resources for teaching and research, and public recognition of good practice are suggested measures to move forward.

Business, Economics and Management, Econometrics and Statistics

Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Received:
26 January 2022
Commenter:
Norbert Hirschauer
The commenter has declared there is no conflict of interests.

Comment:
This comment is a slightly shortened version of a comment I sent the paper’s authors on December 29, 2001. All arguments still apply because the paper has not been revised.

Dear colleagues, congratulations on your Discussion Paper. I have three major sets of comments that I would like to share. The first one is associated with your presentation of Imbens’ (2021) paper, the second one with the, as I find, missing clearness regarding the question of methodological choice, and the third one with your dealing with assumptions violations.

1 Your presentation of Imbens’ (2021) paper
I am confused about your presentation/interpretation of the paper by Imbens (2021). On page 2, you seem to create the impression that Imbens is a representative of those who counter criticisms of statistical significance testing. In my reading of Imbens’ paper, this is not correct. I believe confusion arises because you jump from criticisms of statistical significance testing to a justification of p-values, without clearly distinguishing the two approaches. You verbatim write: “However, this [i.e., criticisms of significance testing] is countered by others who acknowledge existing problems but nevertheless defend p-values, basically saying that nothing is wrong with p-values if they are used correctly (Imbens, 2021).” While Imbens identifies a research setting where he thinks that p-values are a meaningful way of reporting the evidence, he does not support binary significance statements. Therefore, I think that the confusing presentation of Imbens’ reasoning, which is again to be found on page 20, does not do justice to the crucial message of his paper:

Imbens does not support statistical significance testing. On the contrary. He sees “little purpose” for a “binary significance indicator.” For many, if not most research contexts in our field, Imbens does not support p-values. He explicitly argues that in many economic research settings one should report the point estimate and the uncertainty associated with that point estimation – instead of a p-value. This is because a p-value is an immanent test in that it assesses the incompatibility of the data with the null hypothesis of no effect. In many research contexts, however, null hypotheses are of little interest. Imbens (2021: 162) gives the following example: “Although hypothesis testing is routinely used in economics, I would submit that many of the substantive questions are primarily about point estimation and their uncertainty, rather than about testing. However, many studies where estimation questions should be the primary focus present the results in the form of hypothesis tests. […Take] a specific example—the return to schooling—where testing a null hypothesis of no effect is common, yet arguably of little or no substantive interest. One would be hard-pressed to find an economist who believes that the return to education is zero.” In other words, since previous studies have already produced strong evidence for a positive return to schooling, using a p-value to assess the compatibility of the data with the highly unlikely and, therefore, uninteresting hypothesis of exactly zero return is not a meaningful way of summarizing the evidence in that data. Imbens’ strictly limits his support of p-values to one specific research setting. This is when substantial prior probability can be put, and is put, on the null, i.e., when the null represents the most interesting hypothesis to compare the data with. In other words, the null must be specified as to represent the most established prior scientific belief (this is rarely done in our field). Imbens argues that in this case, and in this case only, a low p-value – i.e., a high incompatibility of the data with the established prior scientific belief – is an informative way of summarizing the evidence in the data. But he also cautions that a high incompatibility of a single dataset with a strong prior scientific belief can only serve as an auxiliary means to answer the question of whether it might be worthwhile investigating the issue further with new data. And yet with regard to this limited purpose, he warns that even very small and, therefore, economically irrelevant effects are necessarily associated with small standard errors and, therefore, small p-values in large samples.

While the discussion paper again refers to Imbens on page 20 by noting that he specifies applications where p-values are useful and where they are not, it misses stating unmistakably (i) that Imbens sees little purpose for a binary significance indicator. It also misses communicating the crucial implication of Imbens’ argument for our community (ii) that many, if not most, research settings underlying agricultural economics studies do not coincide with the research setting in which Imbens considers p-values to be useful. In other words, Imbens would not only have to be referred to as an opponent of using a “binary significance indicator” but – as regards common research settings in our field – also as a critic of the convention to present results in the form of p-values. Neither of these two crucial statements/conclusions is clearly conveyed in the discussion paper so far.

2 Missing clearness regarding the question of methodological choice
My general impression is that the paper should emphasize more clearly that statistical inferential procedures are a means to an end but not an end in itself. In other words, it is crucial to realize that we are dealing with a methodological choice for which a justification has to be provided under consideration of the research context and the kind of the intended inference. For example, stating in a study’s objective section that the study is aimed at finding out whether an effect is statistically significant or not does not make sense. Instead, one would need to know for which kind of inference the selected inferential statistic – be it a standard error, t-ratio, p-value, significance statement or confidence interval – is used as an auxiliary means. Analogously, stating in the results section that an effect is statistically significant or not can never be the bottom line of inference. I believe that a clear understanding of this “means-end-relationship” is of uttermost importance, and this has several implications:

The end must be clearly communicated. Therefore, in my view, the discussion paper should emphasize more strongly that each researcher is required to describe the data generation process (sampling design) and the broader population of interest from which the sample was drawn and to which generalization are to be made. Without knowing how data were generated and without a clear definition of the inferential target population – be it a numerically larger population or a superpopulation (if deemed useful) – all statistical inferential statements are opaque, at best, or misleading, at worst, because the end to which the means (i.e., the inferential statistics) are used remains unclear.

After having emphasized the need to describe the data generation process and the inferential target population, the paper should, in my view, more clearly address the question of why or when researchers should transform the two original pieces of information that we can derive from a random sample – the effect size estimate (point estimate) and its estimated sampling variation (standard error) – into a p-value or a dichotomous significance statement (or some other transformation of those two original pieces of information). I think that from your statements in this regard (e.g., in Section 4.1), readers will not understand what you consider the most adequate means (i.e., the “best” way of reporting the evidence) in which circumstances. I believe that this is partly because the paper does not unambiguously distinguish between the use of p-values, on the one hand, and the use of threshold-based statistical significance statements, on the other.

As already indicated above, I do not fully understand which research contexts you distinguish (i.e., your classification of research contexts) and in which contexts you suggest which inferential statistical procedure (inferential statistic) as adequate means to report the evidence in the data and support inferences towards a broader context. In my opinion, there are several open questions that need to be answered to provide more clearness:

(1) Referring to both data-dependent modelling choices and research questions such as the efficient market hypothesis, you state that “[t]esting a null hypothesis versus an alternative hypothesis is meaningful.” Do you suggest that an identical statistical procedure should be used for both cases even though they are quite different? If so, which procedure do you propose – (i) the hypothesis testing approach in the Neyman-Pearson (NP) tradition where there is a clearly specified alternative hypothesis or (ii) the null-hypothesis-significance-testing (NHST) approach where the alternative hypothesis is only a vague non-null proposition? While there is some ambiguity because you do not use the technical terms, your wording suggests to me that you recommend the NP-approach (“statistical decision theory”). In the NP-framework, a dichotomous choice is made between a decision associated with the null hypothesis H_0 and a decision associated with a concrete alternative hypothesis H_A. Regarding the decision rule you state: “In these situations, a decision shall be made based on a statistical decision rule. This then necessarily includes a threshold determining what the decision will be.” I fear that this general statement will not suffice to make things clear to readers who are not familiar with the NP-approach. In the NP-framework, the choice is based on a decision rule α, which is the p-value threshold below which the null is rejected. An appropriate level of α (also called type I error rate or “false positive rate”) must be set depending on the parameters of the decision context. In particular, the type II error rate β (also called “false negative rate”) that is associated with a given level of α as well as the costs that are associated with type I and type II errors, respectively, must be considered when setting α to a level that represents an adequate decision rule in the given context.

If you suggest using the NP-framework, some important implications should be emphasized. For example, based on your brief statement, readers will probably not realize that the decision-rule α is not about inferring whether H_A or H_0 is true or more likely but about making the right decision under consideration of the costs associated with either choice. With ceteris paribus increasing type II error costs, the decision rule α must be set to increasingly high levels. This is because there is a tradeoff: increasing the type I error rate α (false positive rate) reduces the type II error rate β (false negative rate), and vice versa. Similarly, your brief statement does not convey the crucial fact that, contrary to widespread perceptions, not p but α is the type I error rate and that the precise value of p in a particular test is completely irrelevant in statistical decision theory. The only relevant information is whether p falls into the rejection region or not.

If you suggest using the NP-framework, more questions need to be answered to enable the reader to understand which concrete procedure you actually suggest: (i) Do you suggest to routinely use a conventional threshold such as α=0.05 as general default for all contexts? This would correspond to the argument that 0.05 is a rule-of-thumb that works sufficiently well for all contexts, irrespective, for example, of the levels of type I and type II error costs. While I would not share that argument, it might be defended as being a simplified, pragmatic approach. Is that your position? If so, it should be clearly stated. (ii) Or do you suggest that researchers provide at least a qualitative discussion of the parameters of the decision context and then informally set α to some “plausible” level? If so, it should be clearly stated. (iii) Or do you suggest “between the lines” of your brief statement that researchers formally derive a decision rule α? If so, it should be clearly communicated by all means.

(2) My next question refers, again, to the research contexts that you try to illustrate through examples such as the efficient market hypothesis. How would you define the research contexts that you have in mind here? Because only examples but no clear specification are provided, I have to guess: Are the contexts that you have in mind the same as those that Imbens (2021) identifies as contexts where substantial probability can be put on the null? If yes, for what reason do you propose making dichotomous significance statements in those contexts? Imbens argues that, when substantial prior probability can be put on the null, using the p-value to assess the incompatibility of the data with that prior belief is meaningful and more informative than a binary significance indicator. According to Imbens, low p-values can then be used as means to decide whether to investigate the issue further with new data. Now, my question is: How does your proposition regarding the proper use of inferential statistics in the specified context relate to Imbens’ proposition?

(3) You make an important statement regarding a further context, which is arguably the most relevant one in our field: “In many economic applications, however, testing against a null hypothesis of ‘no effect’ is not of particular interest. For example, it is not exciting to test whether farmers’ education increases farm income or not, whether a gender pay gap exists or not or whether investment aid stimulates investment demand or not. Here the magnitude of the (treatment) effect is what matters and the causal mechanism, e.g. how investment aid stimulates investments. We believe that in situations, where no specific decision on a hypothesis has to be made, it suffices to display standard errors or […]. Here, you seem to refer to contexts where the null is uninteresting/unlikely because previous studies have already produced strong evidence for the existence of an effect, such as in Imbens’ return-to-schooling example. Your requirement to report the magnitude of the effect and its standard error also fully agrees with Imbens’ view that one should report the point estimate and the uncertainty of that point estimation in such contexts. But in Imbens’ view, this is not a case where it is meaningful to report p-values because p-values would indicate the strength of the evidence in the data against a null hypothesis that is, as you mention yourself, “not of particular interest.” Therefore, I find it confusing that you continue your statement above by saying: “[…] or to interpret p-values as indicators of the general compatibility of the data with the corresponding hypothesis.” What do you intend to say with this “or” sub-clause? We can, of course, go through the mathematical manipulations to transform the point estimate and the standard error into a p-value that assesses the compatibility of the data with the null. But why should we summarize the evidence in the form of a p-value in research contexts where the null hypothesis is uninteresting from the very start? I think that the discussion paper should provide a clear answer to that question, which is one of the most crucial ones in the present methodological debate.

(4) Finally, I fear that readers will not grasp the assumptions and “means-end-relationship” in your rather passing mention of the concept of power (and related terms such as false positives, false negatives, etc.). In particular, it should be clarified that the power concept makes only sense in the dichotomous NP-framework where a null and a concrete alternative hypothesis are defined between which a rule-based decision is to be made. Power is defined as 1- β. It quantifies the repeatability of p≤α (and, therefore, the rate of rejection of H_0) when H_A is true. In other words, it is the rate of acting as if H_A were true when it is true (“true positive rate”). I believe that the discussion paper should clarify that the concept of power makes only sense in contexts where the NP-approach is used to make a decision between two alternatives but not in contexts where the substantive question is about effect size estimation and the uncertainty of that estimation.

3 Your dealing with assumptions violations
Assumptions violations are another major issue that, in my view, is not covered clearly enough. On page 20, you rightly state: “[I]f data come from convenience sample, any source of potential bias regarding estimates of regression coefficients and standard errors should be carefully considered and discussed.” I don’t think this statement is clear enough to make a substantial contribution to mitigating inferential errors associated with the widespread assumptions violations in the practice of empirical research. As it is, I suspect that readers of the discussion paper such as PhD students will not understand how they have to deal with generalizing statistical inference in the case of convenience samples. Of course, this issue is again related to the question of the adequateness of “means” to an “end” – but now in a more fundamental way: can sample statistics that would carry inferential meaning if data were probabilistically generated be adequate means to the end of generalizing towards a broader context when the data were not probabilistically generated? This is an extremely relevant question in our field where p-values and asterisks are routinely displayed (and often called for by reviewers) whenever there are quantitative data – without questioning whether there is a chance model upon which to base statistical inference. This routine includes “grossly-non-random” samples of haphazardly recruited respondents that researchers could get hold of, in one way or the other.

From a logical point of view, what should be done is quite unambiguous: using inferential statistical procedures to generalize from samples to populations in the case of convenience samples would have to be justified by either running a trustworthy sample selection model that would rehabilitate the statistical foundations of statistical inference (see Hirschauer et al. 2020), or by assuming that those convenience samples are approximately random samples. Since many researchers who use convenience samples simply resort to the standard error formula for simple random samples without a second thought, one would even have to assume that all those convenience samples are approximately simple random samples. I think that the discussion paper should communicate beyond any doubt that such an “approximately-a-random-sample argument” is often a heroic assumption but absolutely necessary from a logical point of view for statistical inferential procedures to make any sense when, in fact, the data generating process was not probabilistic. Whether the approximately-a-random-sample argument is then deemed trustworthy or helpful in the specific context – say, in the case of convenience samples of individuals who are haphazardly recruited on some venue, location or the Internet, or in the case of the Testbetriebsdaten – would then be at least a transparent issue open for judgment by the reader of a study.

References
Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C. (2021): A Primer on p-Value Thresholds and α-Levels – Two Different Kettles of Fish. German Journal of Agricultural Economics 70: 123-133 (DOI: 10.30430/70.2021.2.123-133).
Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C., Jantsch, A. (2020): Can p-values be meaningfully interpreted without random sampling? Statistics Surveys 14(2020): 71-91 (DOI: 10.1214/20-SS129).

The paper “Can p-values be meaningfully interpreted without random sampling?” was published in Statistics Surveys in 2020, not in 2019, as you erroneously indicate in your reference list. But it seems that in many places where you refer to Hirschauer et al. (2019) in the text, you intend to refer to the following publication, which, in turn is missing in the list of references: Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C. (2019): Twenty steps towards an adequate inferential interpretation of p-values in econometrics. Journal of Economics and Statistics 239(4): 703-721 (DOI: 10.1515/jbnst-2018-0069).

You write on page 6 that “Hirschauer, Mußhoff and Grüner (2017: p. 5) argue that “multiple testing is inherent to multiple regression since we test as many null hypotheses as we have variables of interest.” And the reference of this quote is towards the following publication: Hirschauer, N., Mußhoff, O. and Grüner, S. (2017). False Discoveries und Fehlinterpretationen wissenschaftlicher Ergebnisse. Wirtschaftsdienst 97(3): 201–206. This is not correct. The Wirtschaftsdienst paper is in German. We have discussed multiple testing and made the above statement, not verbatim but as regards content, in the following publication: Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C. (2018): Pitfalls of significance testing and p-value variability: An econometrics perspective. Statistics Surveys 12(2018): 136-172 (DOI: 10.1214/18-SS122). I believe that you quote an earlier working paper version of that publication. You may want to correct that.

On page 7, you write: “Hirschauer et al. (2019) argue that convenience sampling precludes the use of p-values because researchers run the risk of misestimating coefficients and standard errors, at least if selection bias is not adequately considered.” I believe the intended reference here is to the 2020 paper “Can p-values be meaningfully interpreted without random sampling?” or, alternatively to the following paper: Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C., Jantsch, A. (2021): Inference using non-random samples? Stop right there! Significance (October 2021): 20-24 (DOI: 10.1111/1740-9713.01568).

On page 26, you refer to the guidelines I suggested for journals on the pre-conference p-value workshop of the 2021 GEWISOLA conference. Your reference is to: Hirschauer, N. (2021). The debate on p-values and statistical inference: What are the consequences for our community? Problems and solutions in statistical practice. But no further information regarding the material/slides of my presentation is provided. Since you use it as a reference, the material should be made accessible to the reader, e.g., by uploading it to the GEWISOLA homepage and providing a link. If need be, I could also upload the slides to my personal MLU-website.

The commenter has declared there is no conflict of interests.

Comment: Response by the authors to the comments by Norbert Hirschauer

We would like to thank Norbert Hirschauer for the detailed comments. We respond to them point-by-point below.

General response: Our manuscript wants to give a comprehensive overview of the debate, but we also discuss the attitudes of the community, and many of the possible remedies. Consequently, some points may not have gotten the attention they deserve, but there are also plenty of resources readers are pointed towards to learn more.

We are no statistical authorities to make objective and final judgments. As part of the community and as affected researchers, editors, reviewers, supervisors, we see it as our task to raise issues from the community and to summarize/synthesize these issues for the community which may also involve some subjectivity. We also believe that such subjectivity and critical decisions are part of any empirical work.

Ultimately, we see it as our main mission to bring the debate to the community, one aspect of which is the discussion in this forum. So, thanks again for making a start. If the manuscript is accepted in a journal, and if the editors agree, we will include a link to this forum, so that all readers and interested members of the community can contribute to the debate with their standpoints as well.

Response to point 1 – Presentation of Imbens (2021)

We thank Norbert Hirschauer for prompting us to clarify (our perception of) the main message of Imbens’ recent paper on the p-value debate. This is particularly valuable, because we have much sympathy for the arguments presented in this paper. Actually, they are rather close to our own position.
We admit that Imbens (2021) is not the best reference for the statement “that nothing is wrong with p-values, if they are used correctly” (page #) and we replaced this reference by Verhulst (2016). Having said this, we think it is futile to base a claim on whether Imbens belongs to the camp of opponents or defenders of p-values. Both groups can find arguments for their respective views in his article. In a nutshell, after briefly reviewing the controversy about p-values and significance reporting, he distinguishes empirical economic applications with regard to their main objective, namely estimation in the sense of quantifying the magnitude of an effect versus hypothesis testing. In the next two subsections, Imbens provides arguments, why p-values do contribute only little to the former research setting and why p-values and significance testing are useful for the latter.
This does not imply that p-values are without problems when hypothesis testing is the focus of an economic study and neither Imbens nor we make this claim. However, we do not share Norbert Hirschauer’s perception that “for many, if not most, research contexts in our field, Imbens does not support p-values”. On page 165 Imbens provides a couple of examples where it may be reasonable to focus on testing null hypotheses. (In our discussion paper we extend this list.) If there was only little need for hypothesis testing in empirical economic research, how could Imbens arrive at the conclusion: “In my view banning of p-values is inappropriate.” (page 170)?
Response to point 2 – Missing clearness regarding the question of methodological choice

We agree the methods are a means to an end, not a means in itself. Of course, researchers should clearly describe how samples are generated and under which assumptions they operate (also see response to point 3).

Fisherian Null Hypothesis Significance Testing (NHST) vs. Neyman-Pearson Framework (NPF)

We want to clarify that we do not suggest using any of the frameworks at all times. It depends on the case at hand. We believe that the Fisherian NHST is particularly useful in explorative analysis (compatibility of data under the null may point towards issues to study in greater depth) and in deductive work if the goal is to establish the initial presence of a directional effect based on theory. The NPF will likely be more useful in cases with well established priors and if loss functions can be reasonably specified. That being said, the NPF should then maybe be used more in deductive work related to well-established literature.

Readers who want to learn more about the history and debate around the two approaches can find a detailed treatment in: Ziliak, S., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. University of Michigan Press.

A critique of the practice of NHST is provided by: Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587-606.

Research contexts, in which NHST is meaningful
Your conjecture, that we borrowed from Imbens (2021) when describing research contexts for meaningful applications of the NHST, is correct. We refer to situations in which the Null Hypothesis is not a “strawman”, i.e. the Null has “substantial” prior based on theoretical reasoning or previous empirical work. We think, however, it is not possible to specify research questions for which this precondition unambiguously holds or not. The efficient market hypothesis or the law of one price may serve as an example. A rejection of the Null, i.e. an incompatibility of data with the hypothesis, would be much more surprising for stock markets than for land markets.
At this point, it is legitimate to ask whether one should stop at displaying p-values as a measure of the (in)compatibility of the data with the Null hypothesis or whether one should go one step further and draw a “conclusion” in the sense of rejecting or accepting the Null hypothesis (without implying that it is actually true / false). Our position here is that one should neither apply NHST mechanically nor rule it out. Prominent examples that require an application of NHST can be found in the context of econometric model building. Proper specification of econometric models typically involves a series of decisions: Are economic time series stationary or not? Are random variables normally distributed or not? Do economic variables exhibit spatial correlation or not? Is endogeneity an issue or not? The list of examples can be extended arbitrarily. Discrete choices have to be made and NHST is definitely a useful tool to support these decisions despite their pitfalls.
The informative value of p-values is limited if the null is not very likely

Indeed, the focus in the described cases should be on effect sizes. In cases where the null hypothesis is likely not to be true, because theory or prior work strongly suggest that there is an effect, the NPF could also often be the more adequate framework. In other words, p-values are probably more useful to explore. We do not want to suggest that p-values should always be used.

Power and the NPF

We agree that power is related to the NPF and its calculation requires the specification of a specific alternative hypothesis. (This should be evident from Figure 1.) However, effect sizes can also be estimated in the NPF (priors and an alternative hypothesis do not stop one from doing so).

Response to point 3 – Violations of assumptions

We agree with the statements. This point is critical, since it is very often difficult to work with random samples of farmers and consumers in the agricultural and food economics community. It is important to clearly distinguish the problem of inference and the problem of bias from non-random samples. In our perception, researchers in the community struggle more with the former (inference from non-random samples), but often make an honest attempt to discuss the latter (bias from non-random samples).

Regarding the problem of inference, we agree that non-random samples should not be used lightheartedly for inference to the population. Researchers have the option to present their data, models and associated estimates without inference. For instance, authors could run a regression and present their point estimates. The bigger question is: what is the value of such studies if no inference can be made? What is there to learn for the reader? In other words, in many instances, a careful discussion of biases and a discussion of assumptions might be more fruitful.

Alternatively, authors could use inferential statistics (including the use of p-values and confidence intervals), but point towards the violation of the critical assumptions of a random data generation process. Ultimately, it is difficult to generally judge as to how far assumptions of an “approximately-a-random-sample argument” are violated. We believe that transparency and critical reflection are key. As a minimum standard, if such assumptions are made, they should, of course, be made explicit in the communication of research results for the community to judge.

We believe that researchers should continue to make an honest attempt to discuss biases on a case-by-case basis. For instance, a survey of a self-selected sample on farmers’ willingness to participate in research surveys is probably very likely to produce an overestimate of that willingness, whereas a question on the farm size may produce a somewhat smaller bias (if any): Larger or smaller farms may still be more or less likely to respond for various reasons, but it is plausible that the bias is smaller in the second case. As a consequence, it would be a task for the research community as a whole to produce and publicly share high quality random samples or population data that allows researchers to assess biases on as many variables as possible. Some issues will still be unresolved though. Imagine the case where the willingness to participate in surveys is uncorrelated with other variables and unobserved in the population or a random sample. It will then be the task of the researcher to discuss sources and direction of bias.

We may revisit our discussion of implications with respect to this issue in the course of a revision. A first implication for teaching and training of PhD students would involve that method courses cover data generation, sampling and representativeness together with how to best address the aim of the research (hypothesis testing, effect quantification, explorative analysis for theory development, etc.). This, however, would in our view support the move towards a more holistic courses including scientific practice and methods. Discussing commonly used observational data sets and how these may suffer from bias would enrich such courses/modules. This would need to go along with a more stringent research data management and community-specific rules/standards for documentation, including discussion of their sources of bias and limits for statistical inference along with the FAIR Guiding Principles for scientific data management and stewardship (findability, accessibility, interoperability, and reusability).

Finally, we want to point out that the combination of bias and misspecified sampling error are interlinked. Hence, we want to emphasize that both issues should be discussed separately. If the real sampling error is larger than the assumed sampling error, confidence intervals become wider, whereas biases may shift the interval upward or downward. We agree that this is an additional problem of using p-values in non-random samples.

Comments regarding the references

We thank Norbert Hirschauer for the points raised regarding our citing and references. As these issues were pointed out to us at an earlier stage already, we have tried to address them in this preprint. We will carefully check the references again and, if needed, revise them accordingly in a new version of the manuscript.

Thomas Heckelei, Silke Hüttel, Martin Odening, Jens Rommel

Comment 2

Received:
20 August 2022
Commenter:
Norbert Hirschauer
The commenter has declared there is no conflict of interests.

Comment: Comment: Reforming statistical practices in the agricultural economics community in Germany – Which steps should be taken next?
The discussion paper by the GEWISOLA p-value working group, which dealt with the pitfalls of p-values and null-hypothesis-significance-testing (NHST), represents a timely contribution to the topical debate regarding the reform of inferential reporting practices. But even though the paper was explicitly published here to facilitate discussion postings from members of our community (see GEWISOLA newsletter 1/2022), strikingly few comments have been made. At first, I found this very surprising. For one thing, most empirical work in our field is still heavily based on NHST-routines, which many still seem to follow despite severe criticism. The “silence” regarding reform requirements is all the more remarkable when one takes into account that some leading economics journals, which are usually considered as beacons for best practice, changed their inferential reporting standards already some years ago. The author guidelines of the American Economic Review, for example, read as follows: “Do not use asterisks to denote significance of estimation results. Report the standard errors in parentheses.”

Individual and institutional-level efforts for better inferences
I believe that a vivid discussion on this preprint platform (and elsewhere) would be a great chance to effectively raise the problem awareness among German agricultural economists, including PhD students. Because of many personal communications that acknowledge widespread inferential errors in the practice of research as well as open questions, I do not believe that the low number of comments reflects a low interest in statistical reforms in our community. But of course, it would be interesting to learn why virtually no individual researcher felt inclined or dared to make a comment on this public platform so far. After all, it is the individual researcher who is responsible for following the rules of good scientific practice and avoiding inferential errors as best as possible.

However, an exclusive focus on the individual might miss the point. Research – and the use of statistics in research – is a complex social enterprise. In this enterprise, the individual researcher, and especially a young researcher, is not the most potent agent of change for doing away with damaging conventions. Quite on the contrary. The individual researcher must find his/her way through the thicket of a still predominant NHST-routine that has been entrenched in the community for decades through inappropriate teaching, unwarranted reviewer requests, and even best-selling statistics textbooks. Changes for the better depend to a large extent on institutions and their codes of conduct that govern the behavior of re-searchers. This includes, for example, formal codes of conduct endorsed by professional associations and funding organizations. But above all, changes for the better depend on scientific journals with their guidelines and review processes.

The methodological debate in a nutshell
Despite the delusive term “hypothesis testing,” statistical inference is no magic that could tell us whether some hypothesis about a real-world state of interest is true or not. However, its principal idea to learn (infer) something about a population based only on a random sample of that population is quite simple. Imagine you have a sample with 500 observations for a variable X (education) and a variable Y (income). Irrespective of how those observations were obtained, we can compute summary sample statistics that inform us about certain features of these data. Examples are the means and standard deviations of X and Y, or a relationship (correlation or regression coefficient) between those 500 X- and Y-observations.

If the sample was randomly drawn from a population, summary sample statistics such as the sample 'X-Y-relationship can be used as point estimate for the (unknown) population 'X-Y-relationship. And another sample statistic, the standard error, can be used as estimate for the uncertainty caused by random sampling error. “Standard error” is but another label for the standard deviation of the (sampling) distribution of all point estimates that we would find if we independently drew very many equal-sized random samples from the same parent population.

In brief, what we can extract – at best – from a random sample is an unbiased point estimate of an unknown population effect size (e.g., the relationship between education and income) and an unbiased estimation of the uncertainty, caused by random error, of that point estimation (i.e., the standard error). We can, of course, go through various mathematical manipulations. But why should we transform two intelligible and meaningful pieces of infor-mation – point estimate and standard error – into a p-value or even a dichotomous significance statement? This is a particularly urgent question given the considerable costs in the form of information losses, misdirected incentives, and inferential errors that are associated with the NHST-routine.

It cannot be emphasized enough that statistical inference is based on probability theory and a formal chance model that links a randomly generated dataset to a broader target population. It is a means to the end of evaluating a study’s knowledge contribution given the uncertainty caused by random sampling error (note that I do not talk here about causal inference such as in randomized controlled trials). Therefore, statistical inference aimed at generalizing to populations requires that the sample under study is a random sample. Alternatively, one would need a sample selection model to correct for selection bias or one would have to assume that the sample is approximately a random sample. The latter is often a “heroic” but deceptive assumption. This becomes evident from the fact that probabilistic sampling designs such as cluster sampling can lead to standard errors that are several times larger than the default which presumes simple random sampling. We must know how members of the population were selected into the sample to be able to estimate the uncertainty caused by random sampling error (i.e., the standard deviation of the sampling distribution). Therefore, standard errors and p-values that are just based on a bold assumption of random sampling – contrary to how data were actually collected – are virtually worthless. In other words, contrary to the intention of adequately communicating uncertainty, reporting standard errors or p-values for non-random samples might delusively convey excessive certainty imposed by wrong assumptions about the data generation process and, thus, the data distributions used in statistical analysis. To put it more bluntly, proceeding with the conventional routine of displaying p-values and statistical significance even when the random sampling as-sumption is grossly violated is tantamount to pretending to have better evidence than one has. This is a breach of good scientific practice that provokes unwarranted moves from the description of patterns in some conveniently available data to overconfident generalizations beyond the confines of the particular sample.

The discussion paper as starting point for reforms
While it is not very explicit in all respects, the paper by the GEWISOLA p-value working group addresses the two crucial issues discussed above and tries to raise critical awareness regarding the shortcomings of conventional statistical practices. Regarding the issue of information transformation, it states, for example: “In many economic applications, however, testing against a null hypothesis of ‘no effect’ is not of particular interest. For example, it is not exciting to test whether farmers’ education increases farm income or not, whether a gender pay gap exists or not or whether investment aid stimulates investment demand or not. Here the magnitude of the (treatment) effect is what matters […].”

Regarding assumptions violations the discussion paper notes: “Perhaps the most basic question is whether observed data can be considered as a random sample, i.e. as an outcome of a random data generating process, because this is a prerequisite for inferential statistics. […] if data come from convenience sample, any source of potential bias regarding estimates of regression coefficients and standard errors should be carefully considered and discussed.”

Despite the discussion paper and some earlier GEWISOLA activities such as last year’s pre-conference workshop on p-values and statistical inference, it seems to me that, in general, the public debate in our community has been too weak to move many researchers away from the inferential errors associated with automated NHST-routines. In my experience, this holds not only for PhD students in their defenses but also senior researchers authoring agricultural economics publications. In brief, there is much business as usual, as if the methodological debate about “Statistical inference in the 21st century: A World Beyond p < 0.05” were not existent. That is, many study results are still presented as if “obtaining statistical significance” were the ultimate end of science, instead of adequately using inferential statistics as what they are: auxiliary means for assessing the informational value of a sample-based point estimate in the light of the uncertainty caused by random sampling error.

The near-automatic routine of making dichotomous significance statements whenever there are quantitative data goes hand in hand with a lacking consideration of the implications of assumptions violations. While many studies in our field are based on non-random (convenience) samples, very few of them acknowledge the fact that non-random sampling error cannot be assessed by statistical methods designed for dealing with random sampling error. That is, many studies implicitly pretend to have better evidence (i.e., random samples) than they have (i.e., non-random samples). Thus, they provoke or at least tacitly condone overconfident generalizations beyond the confines of the convenience sample. I believe that this one important instance where the warnings in the discussion paper were not explicit enough to get through to everybody. Stating that biases that result from assumptions violations “should be carefully considered and discussed” (see quote above) is too vague to do away with the entrenched routine of reporting inferential statistics for non-random samples even when doing so is a blunder based on “heroic” assumptions regarding the data generation process.

Senior scientists who supervise research projects and PhD students should do their best to ensure that misuses and misinterpretations of inferential statistics are avoided in their area of responsibility. For example, every PhD student who resorts to inferential statistical procedures should be qualified and knowledgeable enough to relate the inferential approach used in the dissertation with the topical methodological debate on p-values and statistical significance. But these requirements are apparently often not met. I believe that this is a serious problem for our profession. If we delay institutional reforms that can quickly change statistical practices to the better, we will not be able to reduce the inferential errors made in our community in due time. And in the long run, closing our eyes on the problem will make us fall behind other researchers and research communities. Delayed reforms will also result in a loss of resources as conclusions from research are wrong and resources for future research are misdirected. However, if we act immediately, we might still have the chance to be at the forefront of methodological progress instead of lagging behind.

One might speculate that the paper by the GEWISOLA p-value working group was a necessary but, by itself, not sufficient step to bring about the indispensable changes in statistical practice. Institutional-level efforts such as the revision of journal guidelines (e.g., of the GJAE) or a formal code of conduct (“inferential quality standard”) endorsed by the GEWISOLA are likely to provide more effective guidance for our community. The outcome could be similar to the one of the GEWISOLA journal ranking that proved effective in our day-to-day work of choosing research outlets and reviewing. But, of course, such a quality standard would have to be drafted and actively discussed and agreed on by the members of the GEWISOLA.

Clear and formal inferential reporting guidelines would have several benefits: They would effectively communicate necessary standards to authors and would help reviewers assess the credibility of inferential claims. They would also provide authors with an effective defense against unqualified reviewer requests. The latter is arguably even the most important benefit because it would also mitigate publication bias that results from the fact that many reviewers still prefer statistically significant results and pressure researchers to report p-values and “significant novel discoveries” often without even taking account of whether data were randomly generated or not.

A short text book for statistical practitioners in an era of reform
The issues surrounding the scientific debate concerned with the pitfalls of p-values and NHST are also covered in the book “Fundamentals of Statistical Inference: What is the Meaning of Random Error?” by Hirschauer, Grüner, and Mußhoff. The book is part of the SpringerBriefs in Applied Statistics and Econometrics, a series published under the auspices of the German Statistical Society.

Starting from the premise that a lacking understanding of the probabilistic foundations of statistical inference is responsible for the inferential errors associated with the conventional NHST-routine, the book provides readers with an effective intuition and conceptual understanding of random error, sampling variation, and statistical inference. It also suggests clear guidelines (dos and don’ts) based on the understanding that the probabilistic assumptions regarding data generation must be met and that, if they are met, reporting point estimates and standard errors is a better summary of the evidence in a dataset than p-values and statistical significance declarations. It is, thus, intended as a resource for statistical practitioners who are confronted with the methodological debate about the drawbacks of “significance testing” but do not know what to do instead. We hope, of course, that the book is informative for many readers and a valuable contribution to the reform debate. But again, it is “only” another publication by individual authors. As such, its potential to promote the necessary change of inferential practices in our community is very limited compared to institutional-level reforms such as the revision of journal guidelines or a formal GEWISOLA-statement that specifies quality standards (dos and don’ts) in inferential reporting that we should meet.

Commenter: Norbert Hirschauer

The commenter has declared there is no conflict of interests.

Dear colleagues, congratulations on your Discussion Paper. I have three major sets of comments that I would like to share. The first one is associated with your presentation of Imbens’ (2021) paper, the second one with the, as I find, missing clearness regarding the question of methodological choice, and the third one with your dealing with assumptions violations.

1 Your presentation of Imbens’ (2021) paperI am confused about your presentation/interpretation of the paper by Imbens (2021). On page 2, you seem to create the impression that Imbens is a representative of those who counter criticisms of statistical significance testing. In my reading of Imbens’ paper, this is not correct. I believe confusion arises because you jump from criticisms of statistical significance testing to a justification of p-values, without clearly distinguishing the two approaches. You verbatim write: “However, this [i.e., criticisms of significance testing] is countered by others who acknowledge existing problems but nevertheless defend p-values, basically saying that nothing is wrong with p-values if they are used correctly (Imbens, 2021).” While Imbens identifies a research setting where he thinks that p-values are a meaningful way of reporting the evidence, he does not support binary significance statements. Therefore, I think that the confusing presentation of Imbens’ reasoning, which is again to be found on page 20, does not do justice to the crucial message of his paper:

Imbens does not support statistical significance testing.On the contrary. He sees “little purpose” for a “binary significance indicator.”For many, if not most research contexts in our field, Imbens does not support p-values.He explicitly argues that in many economic research settings one should report the point estimate and the uncertainty associated with that point estimation – instead of a p-value. This is because a p-value is an immanent test in that it assesses the incompatibility of the data with the null hypothesis of no effect. In many research contexts, however, null hypotheses are of little interest. Imbens (2021: 162) gives the following example: “Although hypothesis testing is routinely used in economics, I would submit that many of the substantive questions are primarily about point estimation and their uncertainty, rather than about testing. However, many studies where estimation questions should be the primary focus present the results in the form of hypothesis tests. […Take] a specific example—the return to schooling—where testing a null hypothesis of no effect is common, yet arguably of little or no substantive interest. One would be hard-pressed to find an economist who believes that the return to education is zero.” In other words, since previous studies have already produced strong evidence for a positive return to schooling, using a p-value to assess the compatibility of the data with the highly unlikely and, therefore, uninteresting hypothesis of exactly zero return is not a meaningful way of summarizing the evidence in that data.Imbens’ strictly limits his support of p-values to one specific research setting.This is when substantial prior probability can be put, and is put, on the null, i.e., when the null represents the most interesting hypothesis to compare the data with. In other words, the null must be specified as to represent the most established prior scientific belief (this is rarely done in our field). Imbens argues that in this case, and in this case only, a low p-value – i.e., a high incompatibility of the data with the established prior scientific belief – is an informative way of summarizing the evidence in the data. But he also cautions that a high incompatibility of a single dataset with a strong prior scientific belief can only serve as an auxiliary means to answer the question of whether it might be worthwhile investigating the issue further with new data. And yet with regard to this limited purpose, he warns that even very small and, therefore, economically irrelevant effects are necessarily associated with small standard errors and, therefore, small p-values in large samples.While the discussion paper again refers to Imbens on page 20 by noting that he specifies applications where p-values are useful and where they are not, it misses stating unmistakably (i) that Imbens sees little purpose for a binary significance indicator. It also misses communicating the crucial implication of Imbens’ argument for our community (ii) that many, if not most, research settings underlying agricultural economics studies do not coincide with the research setting in which Imbens considers p-values to be useful. In other words,

Imbens would not only have to be referred to as an opponent of using a “binary significance indicator” but – as regards common research settings in our field – also as a critic of the convention to present results in the form of p-values. Neither of these two crucial statements/conclusions is clearly conveyed in the discussion paper so far.2 Missing clearness regarding the question of methodological choiceMy general impression is that

the paper should emphasize more clearly that statistical inferential procedures are a means to an end but not an end in itself.In other words, it is crucial to realize that we are dealing with a methodological choice for which a justification has to be provided under consideration of the research context and the kind of the intended inference. For example, stating in a study’s objective section that the study is aimed at finding out whether an effect is statistically significant or not does not make sense. Instead, one would need to know for which kind of inference the selected inferential statistic – be it a standard error, t-ratio, p-value, significance statement or confidence interval – is used as an auxiliary means. Analogously, stating in the results section that an effect is statistically significant or not can never be the bottom line of inference. I believe that a clear understanding of this “means-end-relationship” is of uttermost importance, and this has several implications:The end must be clearly communicated. Therefore, in my view,

the discussion paper should emphasize more strongly that each researcher is required to describe the data generation process (sampling design) and the broader population of interest from which the sample was drawn and to which generalization are to be made.Without knowing how data were generated and without a clear definition of the inferential target population – be it a numerically larger population or a superpopulation (if deemed useful) – all statistical inferential statements are opaque, at best, or misleading, at worst, because the end to which the means (i.e., the inferential statistics) are used remains unclear.After having emphasized the need to describe the data generation process and the inferential target population,

the paper should, in my view, more clearly address the question of why or when researchers should transform the two original pieces of information that we can derive from a random sample – the effect size estimate (point estimate) and its estimated sampling variation (standard error) – into a p-value or a dichotomous significance statement(or some other transformation of those two original pieces of information). I think that from your statements in this regard (e.g., in Section 4.1), readers will not understand what you consider the most adequate means (i.e., the “best” way of reporting the evidence) in which circumstances. I believe that this is partly because the paper does not unambiguously distinguish between the use of p-values, on the one hand, and the use of threshold-based statistical significance statements, on the other.As already indicated above, I do not fully understand which research contexts you distinguish (i.e., your classification of research contexts) and in which contexts you suggest which inferential statistical procedure (inferential statistic) as adequate means to report the evidence in the data and support inferences towards a broader context. In my opinion, there are several open questions that need to be answered to provide more clearness:

(1) Referring to both data-dependent modelling choices and research questions such as the efficient market hypothesis, you state that “[t]esting a null hypothesis versus an alternative hypothesis is meaningful.”

Do you suggest that an identical statistical procedure should be used for both cases even though they are quite different? If so, which procedure do you propose – (i) the hypothesis testing approach in the Neyman-Pearson (NP) tradition where there is a clearly specified alternative hypothesis or (ii) the null-hypothesis-significance-testing (NHST) approach where the alternative hypothesis is only a vague non-null proposition?While there is some ambiguity because you do not use the technical terms, your wording suggests to me that you recommend the NP-approach (“statistical decision theory”). In the NP-framework, a dichotomous choice is made between a decision associated with the null hypothesis H_0 and a decision associated with a concrete alternative hypothesis H_A. Regarding the decision rule you state: “In these situations, a decision shall be made based on a statistical decision rule. This then necessarily includes a threshold determining what the decision will be.” I fear that this general statement will not suffice to make things clear to readers who are not familiar with the NP-approach. In the NP-framework, the choice is based on a decision rule α, which is the p-value threshold below which the null is rejected. An appropriate level of α (also called type I error rate or “false positive rate”) must be set depending on the parameters of the decision context. In particular, the type II error rate β (also called “false negative rate”) that is associated with a given level of α as well as the costs that are associated with type I and type II errors, respectively, must be considered when setting α to a level that represents an adequate decision rule in the given context.If you suggest using the NP-framework, some important implications should be emphasized. For example, based on your brief statement, readers will probably not realize that the decision-rule α is not about inferring whether H_A or H_0 is true or more likely but about making the right decision under consideration of the costs associated with either choice. With ceteris paribus increasing type II error costs, the decision rule α must be set to increasingly high levels. This is because there is a tradeoff: increasing the type I error rate α (false positive rate) reduces the type II error rate β (false negative rate), and vice versa. Similarly, your brief statement does not convey the crucial fact that, contrary to widespread perceptions, not p but α is the type I error rate and that the precise value of p in a particular test is completely irrelevant in statistical decision theory. The only relevant information is whether p falls into the rejection region or not.

If you suggest using the NP-framework, more questions need to be answered to enable the reader to understand which concrete procedure you actually suggest: (i) Do you suggest to routinely use a conventional threshold such as α=0.05 as general default for all contexts? This would correspond to the argument that 0.05 is a rule-of-thumb that works sufficiently well for all contexts, irrespective, for example, of the levels of type I and type II error costs. While I would not share that argument, it might be defended as being a simplified, pragmatic approach. Is that your position? If so, it should be clearly stated. (ii) Or do you suggest that researchers provide at least a qualitative discussion of the parameters of the decision context and then informally set α to some “plausible” level? If so, it should be clearly stated. (iii) Or do you suggest “between the lines” of your brief statement that researchers formally derive a decision rule α? If so, it should be clearly communicated by all means.

(2) My next question refers, again, to the research contexts that you try to illustrate through examples such as the efficient market hypothesis. How would you define the research contexts that you have in mind here? Because only examples but no clear specification are provided, I have to guess: Are the contexts that you have in mind the same as those that Imbens (2021) identifies as contexts where substantial probability can be put on the null? If yes, for what reason do you propose making dichotomous significance statements in those contexts? Imbens argues that, when substantial prior probability can be put on the null, using the p-value to assess the incompatibility of the data with that prior belief is meaningful and more informative than a binary significance indicator. According to Imbens, low p-values can then be used as means to decide whether to investigate the issue further with new data. Now, my question is:

How does your proposition regarding the proper use of inferential statistics in the specified context relate to Imbens’ proposition?(3) You make an important statement regarding a further context, which is arguably the most relevant one in our field: “In many economic applications, however, testing against a null hypothesis of ‘no effect’ is not of particular interest. For example, it is not exciting to test whether farmers’ education increases farm income or not, whether a gender pay gap exists or not or whether investment aid stimulates investment demand or not. Here the magnitude of the (treatment) effect is what matters and the causal mechanism, e.g. how investment aid stimulates investments. We believe that in situations, where no specific decision on a hypothesis has to be made, it suffices to display standard errors or […]. Here, you seem to refer to contexts where the null is uninteresting/unlikely because previous studies have already produced strong evidence for the existence of an effect, such as in Imbens’ return-to-schooling example. Your requirement to report the magnitude of the effect and its standard error also fully agrees with Imbens’ view that one should report the point estimate and the uncertainty of that point estimation in such contexts. But in Imbens’ view, this is not a case where it is meaningful to report p-values because p-values would indicate the strength of the evidence in the data against a null hypothesis that is, as you mention yourself, “not of particular interest.” Therefore, I find it confusing that you continue your statement above by saying: “[…] or to interpret p-values as indicators of the general compatibility of the data with the corresponding hypothesis.” What do you intend to say with this “or” sub-clause? We can, of course, go through the mathematical manipulations to transform the point estimate and the standard error into a p-value that assesses the compatibility of the data with the null. But

why should we summarize the evidence in the form of a p-value in research contexts where the null hypothesis is uninteresting from the very start? I think that the discussion paper should provide a clear answer to that question, which is one of the most crucial ones in the present methodological debate.(4) Finally, I fear that readers will not grasp the assumptions and “means-end-relationship” in your rather passing mention of the concept of power (and related terms such as false positives, false negatives, etc.). In particular, it should be clarified that the power concept makes only sense in the dichotomous NP-framework where a null and a concrete alternative hypothesis are defined between which a rule-based decision is to be made. Power is defined as 1- β. It quantifies the repeatability of p≤α (and, therefore, the rate of rejection of H_0) when H_A is true. In other words, it is the rate of acting as if H_A were true when it is true (“true positive rate”).

I believe that the discussion paper should clarify that the concept of power makes only sense in contexts where the NP-approach is used to make a decision between two alternatives but not in contexts where the substantive question is about effect size estimation and the uncertainty of that estimation.3 Your dealing with assumptions violationsAssumptions violations are another major issue that, in my view, is not covered clearly enough. On page 20, you rightly state: “[I]f data come from convenience sample, any source of potential bias regarding estimates of regression coefficients and standard errors should be carefully considered and discussed.” I don’t think this statement is clear enough to make a substantial contribution to mitigating inferential errors associated with the widespread assumptions violations in the practice of empirical research. As it is, I suspect that readers of the discussion paper such as PhD students will not understand how they have to deal with generalizing statistical inference in the case of convenience samples. Of course, this issue is again related to the question of the adequateness of “means” to an “end” – but now in a more fundamental way: can sample statistics that would carry inferential meaning if data were probabilistically generated be adequate means to the end of generalizing towards a broader context when the data were not probabilistically generated? This is an extremely relevant question in our field where p-values and asterisks are routinely displayed (and often called for by reviewers) whenever there are quantitative data – without questioning whether there is a chance model upon which to base statistical inference. This routine includes “grossly-non-random” samples of haphazardly recruited respondents that researchers could get hold of, in one way or the other.

From a logical point of view, what should be done is quite unambiguous: using inferential statistical procedures to generalize from samples to populations in the case of convenience samples would have to be justified by either running a trustworthy sample selection model that would rehabilitate the statistical foundations of statistical inference (see Hirschauer et al. 2020), or by assuming that those convenience samples are approximately random samples. Since many researchers who use convenience samples simply resort to the standard error formula for simple random samples without a second thought, one would even have to assume that all those convenience samples are approximately simple random samples.

I think that the discussion paper should communicate beyond any doubt that such an “approximately-a-random-sample argument” is often a heroic assumption but absolutely necessary from a logical point of view for statistical inferential procedures to make any sense when, in fact, the data generating process was not probabilistic.Whether the approximately-a-random-sample argument is then deemed trustworthy or helpful in the specific context – say, in the case of convenience samples of individuals who are haphazardly recruited on some venue, location or the Internet, or in the case of the Testbetriebsdaten – would then be at least a transparent issue open for judgment by the reader of a study.ReferencesHirschauer, N., Grüner, S., Mußhoff, O., Becker, C. (2021): A Primer on p-Value Thresholds and α-Levels – Two Different Kettles of Fish. German Journal of Agricultural Economics 70: 123-133 (DOI: 10.30430/70.2021.2.123-133).

Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C., Jantsch, A. (2020): Can p-values be meaningfully interpreted without random sampling? Statistics Surveys 14(2020): 71-91 (DOI: 10.1214/20-SS129).

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Comments regarding your referencesThe paper “Can p-values be meaningfully interpreted without random sampling?” was published in Statistics Surveys in 2020, not in 2019, as you erroneously indicate in your reference list. But it seems that in many places where you refer to Hirschauer et al. (2019) in the text, you intend to refer to the following publication, which, in turn is missing in the list of references: Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C. (2019): Twenty steps towards an adequate inferential interpretation of p-values in econometrics. Journal of Economics and Statistics 239(4): 703-721 (DOI: 10.1515/jbnst-2018-0069).

You write on page 6 that “Hirschauer, Mußhoff and Grüner (2017: p. 5) argue that “multiple testing is inherent to multiple regression since we test as many null hypotheses as we have variables of interest.” And the reference of this quote is towards the following publication: Hirschauer, N., Mußhoff, O. and Grüner, S. (2017). False Discoveries und Fehlinterpretationen wissenschaftlicher Ergebnisse. Wirtschaftsdienst 97(3): 201–206. This is not correct. The Wirtschaftsdienst paper is in German. We have discussed multiple testing and made the above statement, not verbatim but as regards content, in the following publication: Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C. (2018): Pitfalls of significance testing and p-value variability: An econometrics perspective. Statistics Surveys 12(2018): 136-172 (DOI: 10.1214/18-SS122). I believe that you quote an earlier working paper version of that publication. You may want to correct that.

On page 7, you write: “Hirschauer et al. (2019) argue that convenience sampling precludes the use of p-values because researchers run the risk of misestimating coefficients and standard errors, at least if selection bias is not adequately considered.” I believe the intended reference here is to the 2020 paper “Can p-values be meaningfully interpreted without random sampling?” or, alternatively to the following paper: Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C., Jantsch, A. (2021): Inference using non-random samples? Stop right there! Significance (October 2021): 20-24 (DOI: 10.1111/1740-9713.01568).

On page 26, you refer to the guidelines I suggested for journals on the pre-conference p-value workshop of the 2021 GEWISOLA conference. Your reference is to: Hirschauer, N. (2021). The debate on p-values and statistical inference: What are the consequences for our community? Problems and solutions in statistical practice. But no further information regarding the material/slides of my presentation is provided. Since you use it as a reference, the material should be made accessible to the reader, e.g., by uploading it to the GEWISOLA homepage and providing a link. If need be, I could also upload the slides to my personal MLU-website.

Commenter:

The commenter has declared there is no conflict of interests.

Response by the authors to the comments by Norbert HirschauerWe would like to thank Norbert Hirschauer for the detailed comments. We respond to them point-by-point below.

General response: Our manuscript wants to give a comprehensive overview of the debate, but we also discuss the attitudes of the community, and many of the possible remedies. Consequently, some points may not have gotten the attention they deserve, but there are also plenty of resources readers are pointed towards to learn more.

We are no statistical authorities to make objective and final judgments. As part of the community and as affected researchers, editors, reviewers, supervisors, we see it as our task to raise issues from the community and to summarize/synthesize these issues for the community which may also involve some subjectivity. We also believe that such subjectivity and critical decisions are part of any empirical work.

Ultimately, we see it as our main mission to bring the debate to the community, one aspect of which is the discussion in this forum. So, thanks again for making a start. If the manuscript is accepted in a journal, and if the editors agree, we will include a link to this forum, so that all readers and interested members of the community can contribute to the debate with their standpoints as well.

Response to point 1 – Presentation of Imbens (2021)We thank Norbert Hirschauer for prompting us to clarify (our perception of) the main message of Imbens’ recent paper on the p-value debate. This is particularly valuable, because we have much sympathy for the arguments presented in this paper. Actually, they are rather close to our own position.

We admit that Imbens (2021) is not the best reference for the statement “that nothing is wrong with p-values, if they are used correctly” (page #) and we replaced this reference by Verhulst (2016). Having said this, we think it is futile to base a claim on whether Imbens belongs to the camp of opponents or defenders of p-values. Both groups can find arguments for their respective views in his article. In a nutshell, after briefly reviewing the controversy about p-values and significance reporting, he distinguishes empirical economic applications with regard to their main objective, namely estimation in the sense of quantifying the magnitude of an effect versus hypothesis testing. In the next two subsections, Imbens provides arguments, why p-values do contribute only little to the former research setting and why p-values and significance testing are useful for the latter.

This does not imply that p-values are without problems when hypothesis testing is the focus of an economic study and neither Imbens nor we make this claim. However, we do not share Norbert Hirschauer’s perception that “for many, if not most, research contexts in our field, Imbens does not support p-values”. On page 165 Imbens provides a couple of examples where it may be reasonable to focus on testing null hypotheses. (In our discussion paper we extend this list.) If there was only little need for hypothesis testing in empirical economic research, how could Imbens arrive at the conclusion: “In my view banning of p-values is inappropriate.” (page 170)?

Response to point 2 – Missing clearness regarding the question of methodological choice

We agree the methods are a means to an end, not a means in itself. Of course, researchers should clearly describe how samples are generated and under which assumptions they operate (also see response to point 3).

Fisherian Null Hypothesis Significance Testing (NHST) vs. Neyman-Pearson Framework (NPF)

We want to clarify that we do not suggest using any of the frameworks at all times. It depends on the case at hand. We believe that the Fisherian NHST is particularly useful in explorative analysis (compatibility of data under the null may point towards issues to study in greater depth) and in deductive work if the goal is to establish the initial presence of a directional effect based on theory. The NPF will likely be more useful in cases with well established priors and if loss functions can be reasonably specified. That being said, the NPF should then maybe be used more in deductive work related to well-established literature.

Readers who want to learn more about the history and debate around the two approaches can find a detailed treatment in: Ziliak, S., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. University of Michigan Press.

A critique of the practice of NHST is provided by: Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587-606.

Research contexts, in which NHST is meaningful

Your conjecture, that we borrowed from Imbens (2021) when describing research contexts for meaningful applications of the NHST, is correct. We refer to situations in which the Null Hypothesis is not a “strawman”, i.e. the Null has “substantial” prior based on theoretical reasoning or previous empirical work. We think, however, it is not possible to specify research questions for which this precondition unambiguously holds or not. The efficient market hypothesis or the law of one price may serve as an example. A rejection of the Null, i.e. an incompatibility of data with the hypothesis, would be much more surprising for stock markets than for land markets.

At this point, it is legitimate to ask whether one should stop at displaying p-values as a measure of the (in)compatibility of the data with the Null hypothesis or whether one should go one step further and draw a “conclusion” in the sense of rejecting or accepting the Null hypothesis (without implying that it is actually true / false). Our position here is that one should neither apply NHST mechanically nor rule it out. Prominent examples that require an application of NHST can be found in the context of econometric model building. Proper specification of econometric models typically involves a series of decisions: Are economic time series stationary or not? Are random variables normally distributed or not? Do economic variables exhibit spatial correlation or not? Is endogeneity an issue or not? The list of examples can be extended arbitrarily. Discrete choices have to be made and NHST is definitely a useful tool to support these decisions despite their pitfalls.

The informative value of p-values is limited if the null is not very likely

Indeed, the focus in the described cases should be on effect sizes. In cases where the null hypothesis is likely not to be true, because theory or prior work strongly suggest that there is an effect, the NPF could also often be the more adequate framework. In other words, p-values are probably more useful to explore. We do not want to suggest that p-values should always be used.

Power and the NPF

We agree that power is related to the NPF and its calculation requires the specification of a specific alternative hypothesis. (This should be evident from Figure 1.) However, effect sizes can also be estimated in the NPF (priors and an alternative hypothesis do not stop one from doing so).

Response to point 3 – Violations of assumptionsWe agree with the statements. This point is critical, since it is very often difficult to work with random samples of farmers and consumers in the agricultural and food economics community. It is important to clearly distinguish the problem of inference and the problem of bias from non-random samples. In our perception, researchers in the community struggle more with the former (inference from non-random samples), but often make an honest attempt to discuss the latter (bias from non-random samples).

Regarding the problem of inference, we agree that non-random samples should not be used lightheartedly for inference to the population. Researchers have the option to present their data, models and associated estimates without inference. For instance, authors could run a regression and present their point estimates. The bigger question is: what is the value of such studies if no inference can be made? What is there to learn for the reader? In other words, in many instances, a careful discussion of biases and a discussion of assumptions might be more fruitful.

Alternatively, authors could use inferential statistics (including the use of p-values and confidence intervals), but point towards the violation of the critical assumptions of a random data generation process. Ultimately, it is difficult to generally judge as to how far assumptions of an “approximately-a-random-sample argument” are violated. We believe that transparency and critical reflection are key. As a minimum standard, if such assumptions are made, they should, of course, be made explicit in the communication of research results for the community to judge.

We believe that researchers should continue to make an honest attempt to discuss biases on a case-by-case basis. For instance, a survey of a self-selected sample on farmers’ willingness to participate in research surveys is probably very likely to produce an overestimate of that willingness, whereas a question on the farm size may produce a somewhat smaller bias (if any): Larger or smaller farms may still be more or less likely to respond for various reasons, but it is plausible that the bias is smaller in the second case. As a consequence, it would be a task for the research community as a whole to produce and publicly share high quality random samples or population data that allows researchers to assess biases on as many variables as possible. Some issues will still be unresolved though. Imagine the case where the willingness to participate in surveys is uncorrelated with other variables and unobserved in the population or a random sample. It will then be the task of the researcher to discuss sources and direction of bias.

We may revisit our discussion of implications with respect to this issue in the course of a revision. A first implication for teaching and training of PhD students would involve that method courses cover data generation, sampling and representativeness together with how to best address the aim of the research (hypothesis testing, effect quantification, explorative analysis for theory development, etc.). This, however, would in our view support the move towards a more holistic courses including scientific practice and methods. Discussing commonly used observational data sets and how these may suffer from bias would enrich such courses/modules. This would need to go along with a more stringent research data management and community-specific rules/standards for documentation, including discussion of their sources of bias and limits for statistical inference along with the FAIR Guiding Principles for scientific data management and stewardship (findability, accessibility, interoperability, and reusability).

Finally, we want to point out that the combination of bias and misspecified sampling error are interlinked. Hence, we want to emphasize that both issues should be discussed separately. If the real sampling error is larger than the assumed sampling error, confidence intervals become wider, whereas biases may shift the interval upward or downward. We agree that this is an additional problem of using p-values in non-random samples.

Comments regarding the referencesWe thank Norbert Hirschauer for the points raised regarding our citing and references. As these issues were pointed out to us at an earlier stage already, we have tried to address them in this preprint. We will carefully check the references again and, if needed, revise them accordingly in a new version of the manuscript.

Thomas Heckelei, Silke Hüttel, Martin Odening, Jens Rommel

Commenter: Norbert Hirschauer

The commenter has declared there is no conflict of interests.

Comment: Reforming statistical practices in the agricultural economics community in Germany – Which steps should be taken next?The discussion paper by the GEWISOLA

p-value working group, which dealt with the pitfalls ofp-values and null-hypothesis-significance-testing (NHST), represents a timely contribution to the topical debate regarding the reform of inferential reporting practices. But even though the paper was explicitly published here to facilitate discussion postings from members of our community (see GEWISOLA newsletter 1/2022), strikingly few comments have been made. At first, I found this very surprising. For one thing, most empirical work in our field is still heavily based on NHST-routines, which many still seem to follow despite severe criticism. The “silence” regarding reform requirements is all the more remarkable when one takes into account that some leading economics journals, which are usually considered as beacons for best practice, changed their inferential reporting standards already some years ago. The author guidelines of the American Economic Review, for example, read as follows: “Do not use asterisks to denote significance of estimation results. Report the standard errors in parentheses.”Individual and institutional-level efforts for better inferencesI believe that a vivid discussion on this preprint platform (and elsewhere) would be a great chance to effectively raise the problem awareness among German agricultural economists, including PhD students. Because of many personal communications that acknowledge widespread inferential errors in the practice of research as well as open questions, I do not believe that the low number of comments reflects a low interest in statistical reforms in our community. But of course, it would be interesting to learn why virtually no individual researcher felt inclined or dared to make a comment on this public platform so far. After all, it is the individual researcher who is responsible for following the rules of good scientific practice and avoiding inferential errors as best as possible.

However, an exclusive focus on the individual might miss the point. Research – and the use of statistics in research – is a complex social enterprise. In this enterprise, the individual researcher, and especially a young researcher, is not the most potent agent of change for doing away with damaging conventions. Quite on the contrary. The individual researcher must find his/her way through the thicket of a still predominant NHST-routine that has been entrenched in the community for decades through inappropriate teaching, unwarranted reviewer requests, and even best-selling statistics textbooks.

Changes for the better depend to a large extent on institutions and their codes of conduct that govern the behavior of re-searchers.This includes, for example, formal codes of conduct endorsed by professional associations and funding organizations. But above all, changes for the better depend on scientific journals with their guidelines and review processes.The methodological debate in a nutshellDespite the delusive term “hypothesis testing,” statistical inference is no magic that could tell us whether some hypothesis about a real-world state of interest is true or not. However, its principal idea to learn (infer) something about a population based only on a random sample of that population is quite simple. Imagine you have a sample with 500 observations for a variable

X(education) and a variableY(income). Irrespective of how those observations were obtained, we can compute summary sample statistics that inform us about certain features of these data. Examples are the means and standard deviations ofXandY, or a relationship (correlation or regression coefficient) between those 500X- andY-observations.If the sample was randomly drawn from a population, summary sample statistics such as the

sample'X-Y-relationship can be used as point estimate for the (unknown)population'X-Y-relationship. And another sample statistic, the standard error, can be used as estimate for the uncertainty caused by random sampling error. “Standard error” is but another label for the standard deviation of the (sampling) distribution of all point estimates that we would find if we independently drew very many equal-sized random samples from thesameparent population.In brief, what we can extract – at best – from a random sample is an unbiased

point estimateof an unknown population effect size (e.g., the relationship between education and income) and an unbiased estimation of the uncertainty, caused by random error, of that point estimation (i.e., thestandard error). We can, of course, go through various mathematical manipulations. Butwhy should we transform two intelligible and meaningful pieces of infor-mation – point estimate and standard error – into aThis is a particularly urgent question given the considerable costs in the form of information losses, misdirected incentives, and inferential errors that are associated with the NHST-routine.p-value or even a dichotomous significance statement?It cannot be emphasized enough that statistical inference is based on probability theory and a formal chance model that links a randomly generated dataset to a broader target population. It is a means to the end of evaluating a study’s knowledge contribution given the uncertainty caused by random sampling error (note that I do not talk here about causal inference such as in randomized controlled trials). Therefore, statistical inference aimed at generalizing to populations requires that the sample under study is a random sample. Alternatively, one would need a sample selection model to correct for selection bias or one would have to

assumethat the sample isapproximatelya random sample. The latter is often a “heroic” but deceptive assumption. This becomes evident from the fact that probabilistic sampling designs such as cluster sampling can lead to standard errors that are several times larger than the default which presumes simple random sampling. We must knowhowmembers of the population were selected into the sample to be able to estimate the uncertainty caused by random sampling error (i.e., the standard deviation of the sampling distribution). Therefore, standard errors andp-values that are just based on a bold assumption of random sampling – contrary to how data were actually collected – are virtually worthless. In other words, contrary to the intention of adequately communicating uncertainty, reporting standard errors orp-values for non-random samples might delusively convey excessive certainty imposed by wrong assumptions about the data generation process and, thus, the data distributions used in statistical analysis. To put it more bluntly,proceeding with the conventional routine of displaying p-values and statistical significance even when the random sampling as-sumption is grossly violated is tantamount to pretending to have better evidence than one has. This is a breach of good scientific practicethat provokes unwarranted moves from the description of patterns in some conveniently available data to overconfident generalizations beyond the confines of the particular sample.The discussion paper as starting point for reformsWhile it is not very explicit in all respects, the paper by the GEWISOLA

p-value working group addresses the two crucial issues discussed above and tries to raise critical awareness regarding the shortcomings of conventional statistical practices. Regarding the issue of information transformation, it states, for example: “In many economic applications, however, testing against a null hypothesis of ‘no effect’ is not of particular interest. For example, it is not exciting to test whether farmers’ education increases farm income or not, whether a gender pay gap exists or not or whether investment aid stimulates investment demand or not. Here the magnitude of the (treatment) effect is what matters[…].”Regarding assumptions violations the discussion paper notes: “

Perhaps the most basic question is whether observed data can be considered as a random sample, i.e. as an outcome of a random data generating process, because this is a prerequisite for inferential statistics.[…]if data come from convenience sample, any source of potential bias regarding estimates of regression coefficients and standard errors should be carefully considered and discussed.”Despite the discussion paper and some earlier GEWISOLA activities such as last year’s pre-conference workshop on

p-values and statistical inference, it seems to me that, in general, the public debate in our community has been too weak to move many researchers away from the inferential errors associated with automated NHST-routines. In my experience, this holds not only for PhD students in their defenses but also senior researchers authoring agricultural economics publications. In brief, there is much business as usual, as if the methodological debate about “Statistical inference in the 21st century: A World Beyondp< 0.05” were not existent. That is, many study results are still presented as if “obtaining statistical significance” were the ultimate end of science, instead of adequately using inferential statistics as what they are: auxiliary means for assessing the informational value of a sample-based point estimate in the light of the uncertainty caused by random sampling error.The near-automatic routine of making dichotomous significance statements whenever there are quantitative data goes hand in hand with a lacking consideration of the implications of assumptions violations. While many studies in our field are based on non-random (convenience) samples, very few of them acknowledge the fact that

non-randomsampling error cannot be assessed by statistical methods designed for dealing withrandomsampling error. That is, many studies implicitly pretend to have better evidence (i.e., random samples) than they have (i.e., non-random samples). Thus, they provoke or at least tacitly condone overconfident generalizations beyond the confines of the convenience sample. I believe that this one important instance where the warnings in the discussion paper were not explicit enough to get through to everybody. Stating that biases that result from assumptions violations “should be carefully considered and discussed” (see quote above) is too vague to do away with the entrenched routine of reporting inferential statistics for non-random samples even when doing so is a blunder based on “heroic” assumptions regarding the data generation process.Senior scientists who supervise research projects and PhD students should do their best to ensure that misuses and misinterpretations of inferential statistics are avoided in their area of responsibility. For example, every PhD student who resorts to inferential statistical procedures should be qualified and knowledgeable enough to relate the inferential approach used in the dissertation with the topical methodological debate on

p-values and statistical significance. But these requirements are apparently often not met. I believe that this is a serious problem for our profession.If we delay institutional reforms that can quickly change statistical practices to the better, we will not be able to reduce the inferential errors made in our community in due time. And in the long run, closing our eyes on the problem will make us fall behind other researchers and research communities.Delayed reforms will also result in a loss of resources as conclusions from research are wrong and resources for future research are misdirected. However, if we act immediately, we might still have the chance to be at the forefront of methodological progress instead of lagging behind.One might speculate that the paper by the GEWISOLA

p-value working group was a necessary but, by itself, not sufficient step to bring about the indispensable changes in statistical practice. Institutional-level efforts such as the revision of journal guidelines (e.g., of the GJAE) or a formal code of conduct (“inferential quality standard”) endorsed by the GEWISOLA are likely to provide more effective guidance for our community. The outcome could be similar to the one of the GEWISOLA journal ranking that proved effective in our day-to-day work of choosing research outlets and reviewing. But, of course, such a quality standard would have to be drafted and actively discussed and agreed on by the members of the GEWISOLA.Clear

andformal inferential reporting guidelines would have several benefits: They would effectively communicate necessary standards to authors and would help reviewers assess the credibility of inferential claims. They would also provide authors with an effective defense against unqualified reviewer requests. The latter is arguably even the most important benefit because it would also mitigate publication bias that results from the fact that many reviewers still prefer statistically significant results and pressure researchers to reportp-values and “significant novel discoveries” often without even taking account of whether data were randomly generated or not.A short text book for statistical practitioners in an era of reformThe issues surrounding the scientific debate concerned with the pitfalls of

p-values and NHST are also covered in the book “Fundamentals of Statistical Inference: What is the Meaning of Random Error?” by Hirschauer, Grüner, and Mußhoff. The book is part of the SpringerBriefs in Applied Statistics and Econometrics, a series published under the auspices of the German Statistical Society.Starting from the premise that a lacking understanding of the probabilistic foundations of statistical inference is responsible for the inferential errors associated with the conventional NHST-routine, the book provides readers with an effective intuition and conceptual understanding of random error, sampling variation, and statistical inference. It also suggests clear guidelines (dos and don’ts) based on the understanding that the probabilistic assumptions regarding data generation must be met and that, if they are met, reporting point estimates and standard errors is a better summary of the evidence in a dataset than

p-values and statistical significance declarations. It is, thus, intended as a resource for statistical practitioners who are confronted with the methodological debate about the drawbacks of “significance testing” but do not know what to do instead. We hope, of course, that the book is informative for many readers and a valuable contribution to the reform debate. But again, it is “only” another publication by individual authors. As such, its potential to promote the necessary change of inferential practices in our community is very limited compared to institutional-level reforms such as the revision of journal guidelines or a formal GEWISOLA-statement that specifies quality standards (dos and don’ts) in inferential reporting that we should meet.