Version 1
: Received: 18 January 2022 / Approved: 20 January 2022 / Online: 20 January 2022 (15:32:13 CET)

How to cite:
Heckelei, T.; Hüttel, S.; Odening, M.; Rommel, J. The Replicability Crisis and the p-Value Debate – what Are the Consequences for the Agricultural and Food Economics Community?. Preprints2022, 2022010311 (doi: 10.20944/preprints202201.0311.v1).
Heckelei, T.; Hüttel, S.; Odening, M.; Rommel, J. The Replicability Crisis and the p-Value Debate – what Are the Consequences for the Agricultural and Food Economics Community?. Preprints 2022, 2022010311 (doi: 10.20944/preprints202201.0311.v1).

Cite as:

Heckelei, T.; Hüttel, S.; Odening, M.; Rommel, J. The Replicability Crisis and the p-Value Debate – what Are the Consequences for the Agricultural and Food Economics Community?. Preprints2022, 2022010311 (doi: 10.20944/preprints202201.0311.v1).
Heckelei, T.; Hüttel, S.; Odening, M.; Rommel, J. The Replicability Crisis and the p-Value Debate – what Are the Consequences for the Agricultural and Food Economics Community?. Preprints 2022, 2022010311 (doi: 10.20944/preprints202201.0311.v1).

Abstract

A vivid debate is ongoing in the scientific community about statistical malpractice and the related publication bias. No general consensus exists on the consequences and this is reflected in heterogeneous rules defined by scientific journals on the use and reporting of statistical inference. This paper aims at discussing how the debate is perceived by the agricultural economics community and implications for our roles as researchers, contributors to the scientific publication process, and teachers. We start by summarizing the current state of the p-value debate and the replication crisis, and commonly applied statistical practices in our community. This is followed by motivation, design, results and discussion of a survey on statistical knowledge and practice among the researchers in the agricultural economics community in Austria, Germany and Switzerland. We conclude that beyond short-term measures like changing rules of reporting in publications, a cultural change regarding empirical scientific practices is needed that stretches across all our roles in the scientific process. Acceptance of scientific work should largely be based on the theoretical and methodological rigor and where the perceived relevance arises from the questions asked, the methodology employed, and the data used but not from the results generated. Revised and clear journal guidelines, the creation of resources for teaching and research, and public recognition of good practice are suggested measures to move forward.

Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Received:
26 January 2022
Commenter:
Norbert Hirschauer
The commenter has declared there is no conflict of interests.

Comment:
This comment is a slightly shortened version of a comment I sent the paper’s authors on December 29, 2001. All arguments still apply because the paper has not been revised.

Dear colleagues, congratulations on your Discussion Paper. I have three major sets of comments that I would like to share. The first one is associated with your presentation of Imbens’ (2021) paper, the second one with the, as I find, missing clearness regarding the question of methodological choice, and the third one with your dealing with assumptions violations.

1 Your presentation of Imbens’ (2021) paper
I am confused about your presentation/interpretation of the paper by Imbens (2021). On page 2, you seem to create the impression that Imbens is a representative of those who counter criticisms of statistical significance testing. In my reading of Imbens’ paper, this is not correct. I believe confusion arises because you jump from criticisms of statistical significance testing to a justification of p-values, without clearly distinguishing the two approaches. You verbatim write: “However, this [i.e., criticisms of significance testing] is countered by others who acknowledge existing problems but nevertheless defend p-values, basically saying that nothing is wrong with p-values if they are used correctly (Imbens, 2021).” While Imbens identifies a research setting where he thinks that p-values are a meaningful way of reporting the evidence, he does not support binary significance statements. Therefore, I think that the confusing presentation of Imbens’ reasoning, which is again to be found on page 20, does not do justice to the crucial message of his paper:

Imbens does not support statistical significance testing. On the contrary. He sees “little purpose” for a “binary significance indicator.” For many, if not most research contexts in our field, Imbens does not support p-values. He explicitly argues that in many economic research settings one should report the point estimate and the uncertainty associated with that point estimation – instead of a p-value. This is because a p-value is an immanent test in that it assesses the incompatibility of the data with the null hypothesis of no effect. In many research contexts, however, null hypotheses are of little interest. Imbens (2021: 162) gives the following example: “Although hypothesis testing is routinely used in economics, I would submit that many of the substantive questions are primarily about point estimation and their uncertainty, rather than about testing. However, many studies where estimation questions should be the primary focus present the results in the form of hypothesis tests. […Take] a specific example—the return to schooling—where testing a null hypothesis of no effect is common, yet arguably of little or no substantive interest. One would be hard-pressed to find an economist who believes that the return to education is zero.” In other words, since previous studies have already produced strong evidence for a positive return to schooling, using a p-value to assess the compatibility of the data with the highly unlikely and, therefore, uninteresting hypothesis of exactly zero return is not a meaningful way of summarizing the evidence in that data. Imbens’ strictly limits his support of p-values to one specific research setting. This is when substantial prior probability can be put, and is put, on the null, i.e., when the null represents the most interesting hypothesis to compare the data with. In other words, the null must be specified as to represent the most established prior scientific belief (this is rarely done in our field). Imbens argues that in this case, and in this case only, a low p-value – i.e., a high incompatibility of the data with the established prior scientific belief – is an informative way of summarizing the evidence in the data. But he also cautions that a high incompatibility of a single dataset with a strong prior scientific belief can only serve as an auxiliary means to answer the question of whether it might be worthwhile investigating the issue further with new data. And yet with regard to this limited purpose, he warns that even very small and, therefore, economically irrelevant effects are necessarily associated with small standard errors and, therefore, small p-values in large samples.

While the discussion paper again refers to Imbens on page 20 by noting that he specifies applications where p-values are useful and where they are not, it misses stating unmistakably (i) that Imbens sees little purpose for a binary significance indicator. It also misses communicating the crucial implication of Imbens’ argument for our community (ii) that many, if not most, research settings underlying agricultural economics studies do not coincide with the research setting in which Imbens considers p-values to be useful. In other words, Imbens would not only have to be referred to as an opponent of using a “binary significance indicator” but – as regards common research settings in our field – also as a critic of the convention to present results in the form of p-values. Neither of these two crucial statements/conclusions is clearly conveyed in the discussion paper so far.

2 Missing clearness regarding the question of methodological choice
My general impression is that the paper should emphasize more clearly that statistical inferential procedures are a means to an end but not an end in itself. In other words, it is crucial to realize that we are dealing with a methodological choice for which a justification has to be provided under consideration of the research context and the kind of the intended inference. For example, stating in a study’s objective section that the study is aimed at finding out whether an effect is statistically significant or not does not make sense. Instead, one would need to know for which kind of inference the selected inferential statistic – be it a standard error, t-ratio, p-value, significance statement or confidence interval – is used as an auxiliary means. Analogously, stating in the results section that an effect is statistically significant or not can never be the bottom line of inference. I believe that a clear understanding of this “means-end-relationship” is of uttermost importance, and this has several implications:

The end must be clearly communicated. Therefore, in my view, the discussion paper should emphasize more strongly that each researcher is required to describe the data generation process (sampling design) and the broader population of interest from which the sample was drawn and to which generalization are to be made. Without knowing how data were generated and without a clear definition of the inferential target population – be it a numerically larger population or a superpopulation (if deemed useful) – all statistical inferential statements are opaque, at best, or misleading, at worst, because the end to which the means (i.e., the inferential statistics) are used remains unclear.

After having emphasized the need to describe the data generation process and the inferential target population, the paper should, in my view, more clearly address the question of why or when researchers should transform the two original pieces of information that we can derive from a random sample – the effect size estimate (point estimate) and its estimated sampling variation (standard error) – into a p-value or a dichotomous significance statement (or some other transformation of those two original pieces of information). I think that from your statements in this regard (e.g., in Section 4.1), readers will not understand what you consider the most adequate means (i.e., the “best” way of reporting the evidence) in which circumstances. I believe that this is partly because the paper does not unambiguously distinguish between the use of p-values, on the one hand, and the use of threshold-based statistical significance statements, on the other.

As already indicated above, I do not fully understand which research contexts you distinguish (i.e., your classification of research contexts) and in which contexts you suggest which inferential statistical procedure (inferential statistic) as adequate means to report the evidence in the data and support inferences towards a broader context. In my opinion, there are several open questions that need to be answered to provide more clearness:

(1) Referring to both data-dependent modelling choices and research questions such as the efficient market hypothesis, you state that “[t]esting a null hypothesis versus an alternative hypothesis is meaningful.” Do you suggest that an identical statistical procedure should be used for both cases even though they are quite different? If so, which procedure do you propose – (i) the hypothesis testing approach in the Neyman-Pearson (NP) tradition where there is a clearly specified alternative hypothesis or (ii) the null-hypothesis-significance-testing (NHST) approach where the alternative hypothesis is only a vague non-null proposition? While there is some ambiguity because you do not use the technical terms, your wording suggests to me that you recommend the NP-approach (“statistical decision theory”). In the NP-framework, a dichotomous choice is made between a decision associated with the null hypothesis H_0 and a decision associated with a concrete alternative hypothesis H_A. Regarding the decision rule you state: “In these situations, a decision shall be made based on a statistical decision rule. This then necessarily includes a threshold determining what the decision will be.” I fear that this general statement will not suffice to make things clear to readers who are not familiar with the NP-approach. In the NP-framework, the choice is based on a decision rule α, which is the p-value threshold below which the null is rejected. An appropriate level of α (also called type I error rate or “false positive rate”) must be set depending on the parameters of the decision context. In particular, the type II error rate β (also called “false negative rate”) that is associated with a given level of α as well as the costs that are associated with type I and type II errors, respectively, must be considered when setting α to a level that represents an adequate decision rule in the given context.

If you suggest using the NP-framework, some important implications should be emphasized. For example, based on your brief statement, readers will probably not realize that the decision-rule α is not about inferring whether H_A or H_0 is true or more likely but about making the right decision under consideration of the costs associated with either choice. With ceteris paribus increasing type II error costs, the decision rule α must be set to increasingly high levels. This is because there is a tradeoff: increasing the type I error rate α (false positive rate) reduces the type II error rate β (false negative rate), and vice versa. Similarly, your brief statement does not convey the crucial fact that, contrary to widespread perceptions, not p but α is the type I error rate and that the precise value of p in a particular test is completely irrelevant in statistical decision theory. The only relevant information is whether p falls into the rejection region or not.

If you suggest using the NP-framework, more questions need to be answered to enable the reader to understand which concrete procedure you actually suggest: (i) Do you suggest to routinely use a conventional threshold such as α=0.05 as general default for all contexts? This would correspond to the argument that 0.05 is a rule-of-thumb that works sufficiently well for all contexts, irrespective, for example, of the levels of type I and type II error costs. While I would not share that argument, it might be defended as being a simplified, pragmatic approach. Is that your position? If so, it should be clearly stated. (ii) Or do you suggest that researchers provide at least a qualitative discussion of the parameters of the decision context and then informally set α to some “plausible” level? If so, it should be clearly stated. (iii) Or do you suggest “between the lines” of your brief statement that researchers formally derive a decision rule α? If so, it should be clearly communicated by all means.

(2) My next question refers, again, to the research contexts that you try to illustrate through examples such as the efficient market hypothesis. How would you define the research contexts that you have in mind here? Because only examples but no clear specification are provided, I have to guess: Are the contexts that you have in mind the same as those that Imbens (2021) identifies as contexts where substantial probability can be put on the null? If yes, for what reason do you propose making dichotomous significance statements in those contexts? Imbens argues that, when substantial prior probability can be put on the null, using the p-value to assess the incompatibility of the data with that prior belief is meaningful and more informative than a binary significance indicator. According to Imbens, low p-values can then be used as means to decide whether to investigate the issue further with new data. Now, my question is: How does your proposition regarding the proper use of inferential statistics in the specified context relate to Imbens’ proposition?

(3) You make an important statement regarding a further context, which is arguably the most relevant one in our field: “In many economic applications, however, testing against a null hypothesis of ‘no effect’ is not of particular interest. For example, it is not exciting to test whether farmers’ education increases farm income or not, whether a gender pay gap exists or not or whether investment aid stimulates investment demand or not. Here the magnitude of the (treatment) effect is what matters and the causal mechanism, e.g. how investment aid stimulates investments. We believe that in situations, where no specific decision on a hypothesis has to be made, it suffices to display standard errors or […]. Here, you seem to refer to contexts where the null is uninteresting/unlikely because previous studies have already produced strong evidence for the existence of an effect, such as in Imbens’ return-to-schooling example. Your requirement to report the magnitude of the effect and its standard error also fully agrees with Imbens’ view that one should report the point estimate and the uncertainty of that point estimation in such contexts. But in Imbens’ view, this is not a case where it is meaningful to report p-values because p-values would indicate the strength of the evidence in the data against a null hypothesis that is, as you mention yourself, “not of particular interest.” Therefore, I find it confusing that you continue your statement above by saying: “[…] or to interpret p-values as indicators of the general compatibility of the data with the corresponding hypothesis.” What do you intend to say with this “or” sub-clause? We can, of course, go through the mathematical manipulations to transform the point estimate and the standard error into a p-value that assesses the compatibility of the data with the null. But why should we summarize the evidence in the form of a p-value in research contexts where the null hypothesis is uninteresting from the very start? I think that the discussion paper should provide a clear answer to that question, which is one of the most crucial ones in the present methodological debate.

(4) Finally, I fear that readers will not grasp the assumptions and “means-end-relationship” in your rather passing mention of the concept of power (and related terms such as false positives, false negatives, etc.). In particular, it should be clarified that the power concept makes only sense in the dichotomous NP-framework where a null and a concrete alternative hypothesis are defined between which a rule-based decision is to be made. Power is defined as 1- β. It quantifies the repeatability of p≤α (and, therefore, the rate of rejection of H_0) when H_A is true. In other words, it is the rate of acting as if H_A were true when it is true (“true positive rate”). I believe that the discussion paper should clarify that the concept of power makes only sense in contexts where the NP-approach is used to make a decision between two alternatives but not in contexts where the substantive question is about effect size estimation and the uncertainty of that estimation.

3 Your dealing with assumptions violations
Assumptions violations are another major issue that, in my view, is not covered clearly enough. On page 20, you rightly state: “[I]f data come from convenience sample, any source of potential bias regarding estimates of regression coefficients and standard errors should be carefully considered and discussed.” I don’t think this statement is clear enough to make a substantial contribution to mitigating inferential errors associated with the widespread assumptions violations in the practice of empirical research. As it is, I suspect that readers of the discussion paper such as PhD students will not understand how they have to deal with generalizing statistical inference in the case of convenience samples. Of course, this issue is again related to the question of the adequateness of “means” to an “end” – but now in a more fundamental way: can sample statistics that would carry inferential meaning if data were probabilistically generated be adequate means to the end of generalizing towards a broader context when the data were not probabilistically generated? This is an extremely relevant question in our field where p-values and asterisks are routinely displayed (and often called for by reviewers) whenever there are quantitative data – without questioning whether there is a chance model upon which to base statistical inference. This routine includes “grossly-non-random” samples of haphazardly recruited respondents that researchers could get hold of, in one way or the other.

From a logical point of view, what should be done is quite unambiguous: using inferential statistical procedures to generalize from samples to populations in the case of convenience samples would have to be justified by either running a trustworthy sample selection model that would rehabilitate the statistical foundations of statistical inference (see Hirschauer et al. 2020), or by assuming that those convenience samples are approximately random samples. Since many researchers who use convenience samples simply resort to the standard error formula for simple random samples without a second thought, one would even have to assume that all those convenience samples are approximately simple random samples. I think that the discussion paper should communicate beyond any doubt that such an “approximately-a-random-sample argument” is often a heroic assumption but absolutely necessary from a logical point of view for statistical inferential procedures to make any sense when, in fact, the data generating process was not probabilistic. Whether the approximately-a-random-sample argument is then deemed trustworthy or helpful in the specific context – say, in the case of convenience samples of individuals who are haphazardly recruited on some venue, location or the Internet, or in the case of the Testbetriebsdaten – would then be at least a transparent issue open for judgment by the reader of a study.

References
Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C. (2021): A Primer on p-Value Thresholds and α-Levels – Two Different Kettles of Fish. German Journal of Agricultural Economics 70: 123-133 (DOI: 10.30430/70.2021.2.123-133).
Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C., Jantsch, A. (2020): Can p-values be meaningfully interpreted without random sampling? Statistics Surveys 14(2020): 71-91 (DOI: 10.1214/20-SS129).

The paper “Can p-values be meaningfully interpreted without random sampling?” was published in Statistics Surveys in 2020, not in 2019, as you erroneously indicate in your reference list. But it seems that in many places where you refer to Hirschauer et al. (2019) in the text, you intend to refer to the following publication, which, in turn is missing in the list of references: Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C. (2019): Twenty steps towards an adequate inferential interpretation of p-values in econometrics. Journal of Economics and Statistics 239(4): 703-721 (DOI: 10.1515/jbnst-2018-0069).

You write on page 6 that “Hirschauer, Mußhoff and Grüner (2017: p. 5) argue that “multiple testing is inherent to multiple regression since we test as many null hypotheses as we have variables of interest.” And the reference of this quote is towards the following publication: Hirschauer, N., Mußhoff, O. and Grüner, S. (2017). False Discoveries und Fehlinterpretationen wissenschaftlicher Ergebnisse. Wirtschaftsdienst 97(3): 201–206. This is not correct. The Wirtschaftsdienst paper is in German. We have discussed multiple testing and made the above statement, not verbatim but as regards content, in the following publication: Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C. (2018): Pitfalls of significance testing and p-value variability: An econometrics perspective. Statistics Surveys 12(2018): 136-172 (DOI: 10.1214/18-SS122). I believe that you quote an earlier working paper version of that publication. You may want to correct that.

On page 7, you write: “Hirschauer et al. (2019) argue that convenience sampling precludes the use of p-values because researchers run the risk of misestimating coefficients and standard errors, at least if selection bias is not adequately considered.” I believe the intended reference here is to the 2020 paper “Can p-values be meaningfully interpreted without random sampling?” or, alternatively to the following paper: Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C., Jantsch, A. (2021): Inference using non-random samples? Stop right there! Significance (October 2021): 20-24 (DOI: 10.1111/1740-9713.01568).

On page 26, you refer to the guidelines I suggested for journals on the pre-conference p-value workshop of the 2021 GEWISOLA conference. Your reference is to: Hirschauer, N. (2021). The debate on p-values and statistical inference: What are the consequences for our community? Problems and solutions in statistical practice. But no further information regarding the material/slides of my presentation is provided. Since you use it as a reference, the material should be made accessible to the reader, e.g., by uploading it to the GEWISOLA homepage and providing a link. If need be, I could also upload the slides to my personal MLU-website.

The commenter has declared there is no conflict of interests.

Comment: Response by the authors to the comments by Norbert Hirschauer

We would like to thank Norbert Hirschauer for the detailed comments. We respond to them point-by-point below.

General response: Our manuscript wants to give a comprehensive overview of the debate, but we also discuss the attitudes of the community, and many of the possible remedies. Consequently, some points may not have gotten the attention they deserve, but there are also plenty of resources readers are pointed towards to learn more.

We are no statistical authorities to make objective and final judgments. As part of the community and as affected researchers, editors, reviewers, supervisors, we see it as our task to raise issues from the community and to summarize/synthesize these issues for the community which may also involve some subjectivity. We also believe that such subjectivity and critical decisions are part of any empirical work.

Ultimately, we see it as our main mission to bring the debate to the community, one aspect of which is the discussion in this forum. So, thanks again for making a start. If the manuscript is accepted in a journal, and if the editors agree, we will include a link to this forum, so that all readers and interested members of the community can contribute to the debate with their standpoints as well.

Response to point 1 – Presentation of Imbens (2021)

We thank Norbert Hirschauer for prompting us to clarify (our perception of) the main message of Imbens’ recent paper on the p-value debate. This is particularly valuable, because we have much sympathy for the arguments presented in this paper. Actually, they are rather close to our own position.
We admit that Imbens (2021) is not the best reference for the statement “that nothing is wrong with p-values, if they are used correctly” (page #) and we replaced this reference by Verhulst (2016). Having said this, we think it is futile to base a claim on whether Imbens belongs to the camp of opponents or defenders of p-values. Both groups can find arguments for their respective views in his article. In a nutshell, after briefly reviewing the controversy about p-values and significance reporting, he distinguishes empirical economic applications with regard to their main objective, namely estimation in the sense of quantifying the magnitude of an effect versus hypothesis testing. In the next two subsections, Imbens provides arguments, why p-values do contribute only little to the former research setting and why p-values and significance testing are useful for the latter.
This does not imply that p-values are without problems when hypothesis testing is the focus of an economic study and neither Imbens nor we make this claim. However, we do not share Norbert Hirschauer’s perception that “for many, if not most, research contexts in our field, Imbens does not support p-values”. On page 165 Imbens provides a couple of examples where it may be reasonable to focus on testing null hypotheses. (In our discussion paper we extend this list.) If there was only little need for hypothesis testing in empirical economic research, how could Imbens arrive at the conclusion: “In my view banning of p-values is inappropriate.” (page 170)?
Response to point 2 – Missing clearness regarding the question of methodological choice

We agree the methods are a means to an end, not a means in itself. Of course, researchers should clearly describe how samples are generated and under which assumptions they operate (also see response to point 3).

Fisherian Null Hypothesis Significance Testing (NHST) vs. Neyman-Pearson Framework (NPF)

We want to clarify that we do not suggest using any of the frameworks at all times. It depends on the case at hand. We believe that the Fisherian NHST is particularly useful in explorative analysis (compatibility of data under the null may point towards issues to study in greater depth) and in deductive work if the goal is to establish the initial presence of a directional effect based on theory. The NPF will likely be more useful in cases with well established priors and if loss functions can be reasonably specified. That being said, the NPF should then maybe be used more in deductive work related to well-established literature.

Readers who want to learn more about the history and debate around the two approaches can find a detailed treatment in: Ziliak, S., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. University of Michigan Press.

A critique of the practice of NHST is provided by: Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587-606.

Research contexts, in which NHST is meaningful
Your conjecture, that we borrowed from Imbens (2021) when describing research contexts for meaningful applications of the NHST, is correct. We refer to situations in which the Null Hypothesis is not a “strawman”, i.e. the Null has “substantial” prior based on theoretical reasoning or previous empirical work. We think, however, it is not possible to specify research questions for which this precondition unambiguously holds or not. The efficient market hypothesis or the law of one price may serve as an example. A rejection of the Null, i.e. an incompatibility of data with the hypothesis, would be much more surprising for stock markets than for land markets.
At this point, it is legitimate to ask whether one should stop at displaying p-values as a measure of the (in)compatibility of the data with the Null hypothesis or whether one should go one step further and draw a “conclusion” in the sense of rejecting or accepting the Null hypothesis (without implying that it is actually true / false). Our position here is that one should neither apply NHST mechanically nor rule it out. Prominent examples that require an application of NHST can be found in the context of econometric model building. Proper specification of econometric models typically involves a series of decisions: Are economic time series stationary or not? Are random variables normally distributed or not? Do economic variables exhibit spatial correlation or not? Is endogeneity an issue or not? The list of examples can be extended arbitrarily. Discrete choices have to be made and NHST is definitely a useful tool to support these decisions despite their pitfalls.
The informative value of p-values is limited if the null is not very likely

Indeed, the focus in the described cases should be on effect sizes. In cases where the null hypothesis is likely not to be true, because theory or prior work strongly suggest that there is an effect, the NPF could also often be the more adequate framework. In other words, p-values are probably more useful to explore. We do not want to suggest that p-values should always be used.

Power and the NPF

We agree that power is related to the NPF and its calculation requires the specification of a specific alternative hypothesis. (This should be evident from Figure 1.) However, effect sizes can also be estimated in the NPF (priors and an alternative hypothesis do not stop one from doing so).

Response to point 3 – Violations of assumptions

We agree with the statements. This point is critical, since it is very often difficult to work with random samples of farmers and consumers in the agricultural and food economics community. It is important to clearly distinguish the problem of inference and the problem of bias from non-random samples. In our perception, researchers in the community struggle more with the former (inference from non-random samples), but often make an honest attempt to discuss the latter (bias from non-random samples).

Regarding the problem of inference, we agree that non-random samples should not be used lightheartedly for inference to the population. Researchers have the option to present their data, models and associated estimates without inference. For instance, authors could run a regression and present their point estimates. The bigger question is: what is the value of such studies if no inference can be made? What is there to learn for the reader? In other words, in many instances, a careful discussion of biases and a discussion of assumptions might be more fruitful.

Alternatively, authors could use inferential statistics (including the use of p-values and confidence intervals), but point towards the violation of the critical assumptions of a random data generation process. Ultimately, it is difficult to generally judge as to how far assumptions of an “approximately-a-random-sample argument” are violated. We believe that transparency and critical reflection are key. As a minimum standard, if such assumptions are made, they should, of course, be made explicit in the communication of research results for the community to judge.

We believe that researchers should continue to make an honest attempt to discuss biases on a case-by-case basis. For instance, a survey of a self-selected sample on farmers’ willingness to participate in research surveys is probably very likely to produce an overestimate of that willingness, whereas a question on the farm size may produce a somewhat smaller bias (if any): Larger or smaller farms may still be more or less likely to respond for various reasons, but it is plausible that the bias is smaller in the second case. As a consequence, it would be a task for the research community as a whole to produce and publicly share high quality random samples or population data that allows researchers to assess biases on as many variables as possible. Some issues will still be unresolved though. Imagine the case where the willingness to participate in surveys is uncorrelated with other variables and unobserved in the population or a random sample. It will then be the task of the researcher to discuss sources and direction of bias.

We may revisit our discussion of implications with respect to this issue in the course of a revision. A first implication for teaching and training of PhD students would involve that method courses cover data generation, sampling and representativeness together with how to best address the aim of the research (hypothesis testing, effect quantification, explorative analysis for theory development, etc.). This, however, would in our view support the move towards a more holistic courses including scientific practice and methods. Discussing commonly used observational data sets and how these may suffer from bias would enrich such courses/modules. This would need to go along with a more stringent research data management and community-specific rules/standards for documentation, including discussion of their sources of bias and limits for statistical inference along with the FAIR Guiding Principles for scientific data management and stewardship (findability, accessibility, interoperability, and reusability).

Finally, we want to point out that the combination of bias and misspecified sampling error are interlinked. Hence, we want to emphasize that both issues should be discussed separately. If the real sampling error is larger than the assumed sampling error, confidence intervals become wider, whereas biases may shift the interval upward or downward. We agree that this is an additional problem of using p-values in non-random samples.

Comments regarding the references

We thank Norbert Hirschauer for the points raised regarding our citing and references. As these issues were pointed out to us at an earlier stage already, we have tried to address them in this preprint. We will carefully check the references again and, if needed, revise them accordingly in a new version of the manuscript.

Thomas Heckelei, Silke Hüttel, Martin Odening, Jens Rommel

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.

Commenter: Norbert Hirschauer

The commenter has declared there is no conflict of interests.

Dear colleagues, congratulations on your Discussion Paper. I have three major sets of comments that I would like to share. The first one is associated with your presentation of Imbens’ (2021) paper, the second one with the, as I find, missing clearness regarding the question of methodological choice, and the third one with your dealing with assumptions violations.

1 Your presentation of Imbens’ (2021) paperI am confused about your presentation/interpretation of the paper by Imbens (2021). On page 2, you seem to create the impression that Imbens is a representative of those who counter criticisms of statistical significance testing. In my reading of Imbens’ paper, this is not correct. I believe confusion arises because you jump from criticisms of statistical significance testing to a justification of p-values, without clearly distinguishing the two approaches. You verbatim write: “However, this [i.e., criticisms of significance testing] is countered by others who acknowledge existing problems but nevertheless defend p-values, basically saying that nothing is wrong with p-values if they are used correctly (Imbens, 2021).” While Imbens identifies a research setting where he thinks that p-values are a meaningful way of reporting the evidence, he does not support binary significance statements. Therefore, I think that the confusing presentation of Imbens’ reasoning, which is again to be found on page 20, does not do justice to the crucial message of his paper:

Imbens does not support statistical significance testing.On the contrary. He sees “little purpose” for a “binary significance indicator.”For many, if not most research contexts in our field, Imbens does not support p-values.He explicitly argues that in many economic research settings one should report the point estimate and the uncertainty associated with that point estimation – instead of a p-value. This is because a p-value is an immanent test in that it assesses the incompatibility of the data with the null hypothesis of no effect. In many research contexts, however, null hypotheses are of little interest. Imbens (2021: 162) gives the following example: “Although hypothesis testing is routinely used in economics, I would submit that many of the substantive questions are primarily about point estimation and their uncertainty, rather than about testing. However, many studies where estimation questions should be the primary focus present the results in the form of hypothesis tests. […Take] a specific example—the return to schooling—where testing a null hypothesis of no effect is common, yet arguably of little or no substantive interest. One would be hard-pressed to find an economist who believes that the return to education is zero.” In other words, since previous studies have already produced strong evidence for a positive return to schooling, using a p-value to assess the compatibility of the data with the highly unlikely and, therefore, uninteresting hypothesis of exactly zero return is not a meaningful way of summarizing the evidence in that data.Imbens’ strictly limits his support of p-values to one specific research setting.This is when substantial prior probability can be put, and is put, on the null, i.e., when the null represents the most interesting hypothesis to compare the data with. In other words, the null must be specified as to represent the most established prior scientific belief (this is rarely done in our field). Imbens argues that in this case, and in this case only, a low p-value – i.e., a high incompatibility of the data with the established prior scientific belief – is an informative way of summarizing the evidence in the data. But he also cautions that a high incompatibility of a single dataset with a strong prior scientific belief can only serve as an auxiliary means to answer the question of whether it might be worthwhile investigating the issue further with new data. And yet with regard to this limited purpose, he warns that even very small and, therefore, economically irrelevant effects are necessarily associated with small standard errors and, therefore, small p-values in large samples.While the discussion paper again refers to Imbens on page 20 by noting that he specifies applications where p-values are useful and where they are not, it misses stating unmistakably (i) that Imbens sees little purpose for a binary significance indicator. It also misses communicating the crucial implication of Imbens’ argument for our community (ii) that many, if not most, research settings underlying agricultural economics studies do not coincide with the research setting in which Imbens considers p-values to be useful. In other words,

Imbens would not only have to be referred to as an opponent of using a “binary significance indicator” but – as regards common research settings in our field – also as a critic of the convention to present results in the form of p-values. Neither of these two crucial statements/conclusions is clearly conveyed in the discussion paper so far.2 Missing clearness regarding the question of methodological choiceMy general impression is that

the paper should emphasize more clearly that statistical inferential procedures are a means to an end but not an end in itself.In other words, it is crucial to realize that we are dealing with a methodological choice for which a justification has to be provided under consideration of the research context and the kind of the intended inference. For example, stating in a study’s objective section that the study is aimed at finding out whether an effect is statistically significant or not does not make sense. Instead, one would need to know for which kind of inference the selected inferential statistic – be it a standard error, t-ratio, p-value, significance statement or confidence interval – is used as an auxiliary means. Analogously, stating in the results section that an effect is statistically significant or not can never be the bottom line of inference. I believe that a clear understanding of this “means-end-relationship” is of uttermost importance, and this has several implications:The end must be clearly communicated. Therefore, in my view,

the discussion paper should emphasize more strongly that each researcher is required to describe the data generation process (sampling design) and the broader population of interest from which the sample was drawn and to which generalization are to be made.Without knowing how data were generated and without a clear definition of the inferential target population – be it a numerically larger population or a superpopulation (if deemed useful) – all statistical inferential statements are opaque, at best, or misleading, at worst, because the end to which the means (i.e., the inferential statistics) are used remains unclear.After having emphasized the need to describe the data generation process and the inferential target population,

the paper should, in my view, more clearly address the question of why or when researchers should transform the two original pieces of information that we can derive from a random sample – the effect size estimate (point estimate) and its estimated sampling variation (standard error) – into a p-value or a dichotomous significance statement(or some other transformation of those two original pieces of information). I think that from your statements in this regard (e.g., in Section 4.1), readers will not understand what you consider the most adequate means (i.e., the “best” way of reporting the evidence) in which circumstances. I believe that this is partly because the paper does not unambiguously distinguish between the use of p-values, on the one hand, and the use of threshold-based statistical significance statements, on the other.As already indicated above, I do not fully understand which research contexts you distinguish (i.e., your classification of research contexts) and in which contexts you suggest which inferential statistical procedure (inferential statistic) as adequate means to report the evidence in the data and support inferences towards a broader context. In my opinion, there are several open questions that need to be answered to provide more clearness:

(1) Referring to both data-dependent modelling choices and research questions such as the efficient market hypothesis, you state that “[t]esting a null hypothesis versus an alternative hypothesis is meaningful.”

Do you suggest that an identical statistical procedure should be used for both cases even though they are quite different? If so, which procedure do you propose – (i) the hypothesis testing approach in the Neyman-Pearson (NP) tradition where there is a clearly specified alternative hypothesis or (ii) the null-hypothesis-significance-testing (NHST) approach where the alternative hypothesis is only a vague non-null proposition?While there is some ambiguity because you do not use the technical terms, your wording suggests to me that you recommend the NP-approach (“statistical decision theory”). In the NP-framework, a dichotomous choice is made between a decision associated with the null hypothesis H_0 and a decision associated with a concrete alternative hypothesis H_A. Regarding the decision rule you state: “In these situations, a decision shall be made based on a statistical decision rule. This then necessarily includes a threshold determining what the decision will be.” I fear that this general statement will not suffice to make things clear to readers who are not familiar with the NP-approach. In the NP-framework, the choice is based on a decision rule α, which is the p-value threshold below which the null is rejected. An appropriate level of α (also called type I error rate or “false positive rate”) must be set depending on the parameters of the decision context. In particular, the type II error rate β (also called “false negative rate”) that is associated with a given level of α as well as the costs that are associated with type I and type II errors, respectively, must be considered when setting α to a level that represents an adequate decision rule in the given context.If you suggest using the NP-framework, some important implications should be emphasized. For example, based on your brief statement, readers will probably not realize that the decision-rule α is not about inferring whether H_A or H_0 is true or more likely but about making the right decision under consideration of the costs associated with either choice. With ceteris paribus increasing type II error costs, the decision rule α must be set to increasingly high levels. This is because there is a tradeoff: increasing the type I error rate α (false positive rate) reduces the type II error rate β (false negative rate), and vice versa. Similarly, your brief statement does not convey the crucial fact that, contrary to widespread perceptions, not p but α is the type I error rate and that the precise value of p in a particular test is completely irrelevant in statistical decision theory. The only relevant information is whether p falls into the rejection region or not.

If you suggest using the NP-framework, more questions need to be answered to enable the reader to understand which concrete procedure you actually suggest: (i) Do you suggest to routinely use a conventional threshold such as α=0.05 as general default for all contexts? This would correspond to the argument that 0.05 is a rule-of-thumb that works sufficiently well for all contexts, irrespective, for example, of the levels of type I and type II error costs. While I would not share that argument, it might be defended as being a simplified, pragmatic approach. Is that your position? If so, it should be clearly stated. (ii) Or do you suggest that researchers provide at least a qualitative discussion of the parameters of the decision context and then informally set α to some “plausible” level? If so, it should be clearly stated. (iii) Or do you suggest “between the lines” of your brief statement that researchers formally derive a decision rule α? If so, it should be clearly communicated by all means.

(2) My next question refers, again, to the research contexts that you try to illustrate through examples such as the efficient market hypothesis. How would you define the research contexts that you have in mind here? Because only examples but no clear specification are provided, I have to guess: Are the contexts that you have in mind the same as those that Imbens (2021) identifies as contexts where substantial probability can be put on the null? If yes, for what reason do you propose making dichotomous significance statements in those contexts? Imbens argues that, when substantial prior probability can be put on the null, using the p-value to assess the incompatibility of the data with that prior belief is meaningful and more informative than a binary significance indicator. According to Imbens, low p-values can then be used as means to decide whether to investigate the issue further with new data. Now, my question is:

How does your proposition regarding the proper use of inferential statistics in the specified context relate to Imbens’ proposition?(3) You make an important statement regarding a further context, which is arguably the most relevant one in our field: “In many economic applications, however, testing against a null hypothesis of ‘no effect’ is not of particular interest. For example, it is not exciting to test whether farmers’ education increases farm income or not, whether a gender pay gap exists or not or whether investment aid stimulates investment demand or not. Here the magnitude of the (treatment) effect is what matters and the causal mechanism, e.g. how investment aid stimulates investments. We believe that in situations, where no specific decision on a hypothesis has to be made, it suffices to display standard errors or […]. Here, you seem to refer to contexts where the null is uninteresting/unlikely because previous studies have already produced strong evidence for the existence of an effect, such as in Imbens’ return-to-schooling example. Your requirement to report the magnitude of the effect and its standard error also fully agrees with Imbens’ view that one should report the point estimate and the uncertainty of that point estimation in such contexts. But in Imbens’ view, this is not a case where it is meaningful to report p-values because p-values would indicate the strength of the evidence in the data against a null hypothesis that is, as you mention yourself, “not of particular interest.” Therefore, I find it confusing that you continue your statement above by saying: “[…] or to interpret p-values as indicators of the general compatibility of the data with the corresponding hypothesis.” What do you intend to say with this “or” sub-clause? We can, of course, go through the mathematical manipulations to transform the point estimate and the standard error into a p-value that assesses the compatibility of the data with the null. But

why should we summarize the evidence in the form of a p-value in research contexts where the null hypothesis is uninteresting from the very start? I think that the discussion paper should provide a clear answer to that question, which is one of the most crucial ones in the present methodological debate.(4) Finally, I fear that readers will not grasp the assumptions and “means-end-relationship” in your rather passing mention of the concept of power (and related terms such as false positives, false negatives, etc.). In particular, it should be clarified that the power concept makes only sense in the dichotomous NP-framework where a null and a concrete alternative hypothesis are defined between which a rule-based decision is to be made. Power is defined as 1- β. It quantifies the repeatability of p≤α (and, therefore, the rate of rejection of H_0) when H_A is true. In other words, it is the rate of acting as if H_A were true when it is true (“true positive rate”).

I believe that the discussion paper should clarify that the concept of power makes only sense in contexts where the NP-approach is used to make a decision between two alternatives but not in contexts where the substantive question is about effect size estimation and the uncertainty of that estimation.3 Your dealing with assumptions violationsAssumptions violations are another major issue that, in my view, is not covered clearly enough. On page 20, you rightly state: “[I]f data come from convenience sample, any source of potential bias regarding estimates of regression coefficients and standard errors should be carefully considered and discussed.” I don’t think this statement is clear enough to make a substantial contribution to mitigating inferential errors associated with the widespread assumptions violations in the practice of empirical research. As it is, I suspect that readers of the discussion paper such as PhD students will not understand how they have to deal with generalizing statistical inference in the case of convenience samples. Of course, this issue is again related to the question of the adequateness of “means” to an “end” – but now in a more fundamental way: can sample statistics that would carry inferential meaning if data were probabilistically generated be adequate means to the end of generalizing towards a broader context when the data were not probabilistically generated? This is an extremely relevant question in our field where p-values and asterisks are routinely displayed (and often called for by reviewers) whenever there are quantitative data – without questioning whether there is a chance model upon which to base statistical inference. This routine includes “grossly-non-random” samples of haphazardly recruited respondents that researchers could get hold of, in one way or the other.

From a logical point of view, what should be done is quite unambiguous: using inferential statistical procedures to generalize from samples to populations in the case of convenience samples would have to be justified by either running a trustworthy sample selection model that would rehabilitate the statistical foundations of statistical inference (see Hirschauer et al. 2020), or by assuming that those convenience samples are approximately random samples. Since many researchers who use convenience samples simply resort to the standard error formula for simple random samples without a second thought, one would even have to assume that all those convenience samples are approximately simple random samples.

I think that the discussion paper should communicate beyond any doubt that such an “approximately-a-random-sample argument” is often a heroic assumption but absolutely necessary from a logical point of view for statistical inferential procedures to make any sense when, in fact, the data generating process was not probabilistic.Whether the approximately-a-random-sample argument is then deemed trustworthy or helpful in the specific context – say, in the case of convenience samples of individuals who are haphazardly recruited on some venue, location or the Internet, or in the case of the Testbetriebsdaten – would then be at least a transparent issue open for judgment by the reader of a study.ReferencesHirschauer, N., Grüner, S., Mußhoff, O., Becker, C. (2021): A Primer on p-Value Thresholds and α-Levels – Two Different Kettles of Fish. German Journal of Agricultural Economics 70: 123-133 (DOI: 10.30430/70.2021.2.123-133).

Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C., Jantsch, A. (2020): Can p-values be meaningfully interpreted without random sampling? Statistics Surveys 14(2020): 71-91 (DOI: 10.1214/20-SS129).

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Comments regarding your referencesThe paper “Can p-values be meaningfully interpreted without random sampling?” was published in Statistics Surveys in 2020, not in 2019, as you erroneously indicate in your reference list. But it seems that in many places where you refer to Hirschauer et al. (2019) in the text, you intend to refer to the following publication, which, in turn is missing in the list of references: Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C. (2019): Twenty steps towards an adequate inferential interpretation of p-values in econometrics. Journal of Economics and Statistics 239(4): 703-721 (DOI: 10.1515/jbnst-2018-0069).

You write on page 6 that “Hirschauer, Mußhoff and Grüner (2017: p. 5) argue that “multiple testing is inherent to multiple regression since we test as many null hypotheses as we have variables of interest.” And the reference of this quote is towards the following publication: Hirschauer, N., Mußhoff, O. and Grüner, S. (2017). False Discoveries und Fehlinterpretationen wissenschaftlicher Ergebnisse. Wirtschaftsdienst 97(3): 201–206. This is not correct. The Wirtschaftsdienst paper is in German. We have discussed multiple testing and made the above statement, not verbatim but as regards content, in the following publication: Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C. (2018): Pitfalls of significance testing and p-value variability: An econometrics perspective. Statistics Surveys 12(2018): 136-172 (DOI: 10.1214/18-SS122). I believe that you quote an earlier working paper version of that publication. You may want to correct that.

On page 7, you write: “Hirschauer et al. (2019) argue that convenience sampling precludes the use of p-values because researchers run the risk of misestimating coefficients and standard errors, at least if selection bias is not adequately considered.” I believe the intended reference here is to the 2020 paper “Can p-values be meaningfully interpreted without random sampling?” or, alternatively to the following paper: Hirschauer, N., Grüner, S., Mußhoff, O., Becker, C., Jantsch, A. (2021): Inference using non-random samples? Stop right there! Significance (October 2021): 20-24 (DOI: 10.1111/1740-9713.01568).

On page 26, you refer to the guidelines I suggested for journals on the pre-conference p-value workshop of the 2021 GEWISOLA conference. Your reference is to: Hirschauer, N. (2021). The debate on p-values and statistical inference: What are the consequences for our community? Problems and solutions in statistical practice. But no further information regarding the material/slides of my presentation is provided. Since you use it as a reference, the material should be made accessible to the reader, e.g., by uploading it to the GEWISOLA homepage and providing a link. If need be, I could also upload the slides to my personal MLU-website.

Commenter:

The commenter has declared there is no conflict of interests.

Response by the authors to the comments by Norbert HirschauerWe would like to thank Norbert Hirschauer for the detailed comments. We respond to them point-by-point below.

General response: Our manuscript wants to give a comprehensive overview of the debate, but we also discuss the attitudes of the community, and many of the possible remedies. Consequently, some points may not have gotten the attention they deserve, but there are also plenty of resources readers are pointed towards to learn more.

We are no statistical authorities to make objective and final judgments. As part of the community and as affected researchers, editors, reviewers, supervisors, we see it as our task to raise issues from the community and to summarize/synthesize these issues for the community which may also involve some subjectivity. We also believe that such subjectivity and critical decisions are part of any empirical work.

Ultimately, we see it as our main mission to bring the debate to the community, one aspect of which is the discussion in this forum. So, thanks again for making a start. If the manuscript is accepted in a journal, and if the editors agree, we will include a link to this forum, so that all readers and interested members of the community can contribute to the debate with their standpoints as well.

Response to point 1 – Presentation of Imbens (2021)We thank Norbert Hirschauer for prompting us to clarify (our perception of) the main message of Imbens’ recent paper on the p-value debate. This is particularly valuable, because we have much sympathy for the arguments presented in this paper. Actually, they are rather close to our own position.

We admit that Imbens (2021) is not the best reference for the statement “that nothing is wrong with p-values, if they are used correctly” (page #) and we replaced this reference by Verhulst (2016). Having said this, we think it is futile to base a claim on whether Imbens belongs to the camp of opponents or defenders of p-values. Both groups can find arguments for their respective views in his article. In a nutshell, after briefly reviewing the controversy about p-values and significance reporting, he distinguishes empirical economic applications with regard to their main objective, namely estimation in the sense of quantifying the magnitude of an effect versus hypothesis testing. In the next two subsections, Imbens provides arguments, why p-values do contribute only little to the former research setting and why p-values and significance testing are useful for the latter.

This does not imply that p-values are without problems when hypothesis testing is the focus of an economic study and neither Imbens nor we make this claim. However, we do not share Norbert Hirschauer’s perception that “for many, if not most, research contexts in our field, Imbens does not support p-values”. On page 165 Imbens provides a couple of examples where it may be reasonable to focus on testing null hypotheses. (In our discussion paper we extend this list.) If there was only little need for hypothesis testing in empirical economic research, how could Imbens arrive at the conclusion: “In my view banning of p-values is inappropriate.” (page 170)?

Response to point 2 – Missing clearness regarding the question of methodological choice

We agree the methods are a means to an end, not a means in itself. Of course, researchers should clearly describe how samples are generated and under which assumptions they operate (also see response to point 3).

Fisherian Null Hypothesis Significance Testing (NHST) vs. Neyman-Pearson Framework (NPF)

We want to clarify that we do not suggest using any of the frameworks at all times. It depends on the case at hand. We believe that the Fisherian NHST is particularly useful in explorative analysis (compatibility of data under the null may point towards issues to study in greater depth) and in deductive work if the goal is to establish the initial presence of a directional effect based on theory. The NPF will likely be more useful in cases with well established priors and if loss functions can be reasonably specified. That being said, the NPF should then maybe be used more in deductive work related to well-established literature.

Readers who want to learn more about the history and debate around the two approaches can find a detailed treatment in: Ziliak, S., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. University of Michigan Press.

A critique of the practice of NHST is provided by: Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587-606.

Research contexts, in which NHST is meaningful

Your conjecture, that we borrowed from Imbens (2021) when describing research contexts for meaningful applications of the NHST, is correct. We refer to situations in which the Null Hypothesis is not a “strawman”, i.e. the Null has “substantial” prior based on theoretical reasoning or previous empirical work. We think, however, it is not possible to specify research questions for which this precondition unambiguously holds or not. The efficient market hypothesis or the law of one price may serve as an example. A rejection of the Null, i.e. an incompatibility of data with the hypothesis, would be much more surprising for stock markets than for land markets.

At this point, it is legitimate to ask whether one should stop at displaying p-values as a measure of the (in)compatibility of the data with the Null hypothesis or whether one should go one step further and draw a “conclusion” in the sense of rejecting or accepting the Null hypothesis (without implying that it is actually true / false). Our position here is that one should neither apply NHST mechanically nor rule it out. Prominent examples that require an application of NHST can be found in the context of econometric model building. Proper specification of econometric models typically involves a series of decisions: Are economic time series stationary or not? Are random variables normally distributed or not? Do economic variables exhibit spatial correlation or not? Is endogeneity an issue or not? The list of examples can be extended arbitrarily. Discrete choices have to be made and NHST is definitely a useful tool to support these decisions despite their pitfalls.

The informative value of p-values is limited if the null is not very likely

Indeed, the focus in the described cases should be on effect sizes. In cases where the null hypothesis is likely not to be true, because theory or prior work strongly suggest that there is an effect, the NPF could also often be the more adequate framework. In other words, p-values are probably more useful to explore. We do not want to suggest that p-values should always be used.

Power and the NPF

We agree that power is related to the NPF and its calculation requires the specification of a specific alternative hypothesis. (This should be evident from Figure 1.) However, effect sizes can also be estimated in the NPF (priors and an alternative hypothesis do not stop one from doing so).

Response to point 3 – Violations of assumptionsWe agree with the statements. This point is critical, since it is very often difficult to work with random samples of farmers and consumers in the agricultural and food economics community. It is important to clearly distinguish the problem of inference and the problem of bias from non-random samples. In our perception, researchers in the community struggle more with the former (inference from non-random samples), but often make an honest attempt to discuss the latter (bias from non-random samples).

Regarding the problem of inference, we agree that non-random samples should not be used lightheartedly for inference to the population. Researchers have the option to present their data, models and associated estimates without inference. For instance, authors could run a regression and present their point estimates. The bigger question is: what is the value of such studies if no inference can be made? What is there to learn for the reader? In other words, in many instances, a careful discussion of biases and a discussion of assumptions might be more fruitful.

Alternatively, authors could use inferential statistics (including the use of p-values and confidence intervals), but point towards the violation of the critical assumptions of a random data generation process. Ultimately, it is difficult to generally judge as to how far assumptions of an “approximately-a-random-sample argument” are violated. We believe that transparency and critical reflection are key. As a minimum standard, if such assumptions are made, they should, of course, be made explicit in the communication of research results for the community to judge.

We believe that researchers should continue to make an honest attempt to discuss biases on a case-by-case basis. For instance, a survey of a self-selected sample on farmers’ willingness to participate in research surveys is probably very likely to produce an overestimate of that willingness, whereas a question on the farm size may produce a somewhat smaller bias (if any): Larger or smaller farms may still be more or less likely to respond for various reasons, but it is plausible that the bias is smaller in the second case. As a consequence, it would be a task for the research community as a whole to produce and publicly share high quality random samples or population data that allows researchers to assess biases on as many variables as possible. Some issues will still be unresolved though. Imagine the case where the willingness to participate in surveys is uncorrelated with other variables and unobserved in the population or a random sample. It will then be the task of the researcher to discuss sources and direction of bias.

We may revisit our discussion of implications with respect to this issue in the course of a revision. A first implication for teaching and training of PhD students would involve that method courses cover data generation, sampling and representativeness together with how to best address the aim of the research (hypothesis testing, effect quantification, explorative analysis for theory development, etc.). This, however, would in our view support the move towards a more holistic courses including scientific practice and methods. Discussing commonly used observational data sets and how these may suffer from bias would enrich such courses/modules. This would need to go along with a more stringent research data management and community-specific rules/standards for documentation, including discussion of their sources of bias and limits for statistical inference along with the FAIR Guiding Principles for scientific data management and stewardship (findability, accessibility, interoperability, and reusability).

Finally, we want to point out that the combination of bias and misspecified sampling error are interlinked. Hence, we want to emphasize that both issues should be discussed separately. If the real sampling error is larger than the assumed sampling error, confidence intervals become wider, whereas biases may shift the interval upward or downward. We agree that this is an additional problem of using p-values in non-random samples.

Comments regarding the referencesWe thank Norbert Hirschauer for the points raised regarding our citing and references. As these issues were pointed out to us at an earlier stage already, we have tried to address them in this preprint. We will carefully check the references again and, if needed, revise them accordingly in a new version of the manuscript.

Thomas Heckelei, Silke Hüttel, Martin Odening, Jens Rommel