Submitted:
07 March 2024
Posted:
08 March 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Count data
3. Count models
4. Poisson and Negative Binomial Models
5. Zero-inflated count data models
5.1. Zero Inflated Poisson Models
5.2. Zero Inflated Negative Binomial Models
6. Likelihood Ratio Tests for Model Selection: Vuong Test
7. Vuong Test Implementation on Python
7.1. Corruption Database
- Country name
- Code of the country
- Number of parking violations
- Number of UN Mission Diplomats in 1998
- Indicator (Yes/No) that the data is before or after the law enforcement
- Corruption Index - CI [42]
7.2. Code Availability
7.3. Computational Environment and Analysis Setup
7.4. Efficiency and Performance of the Algorithm
8. Results
9. Discussion
10. Conclusion
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| ZIP | Zero Inflated Poisson |
| ZINB | Zero Inflated Negative Binomial |
| LRT | Likelihood Ration Tests |
| OLS | Ordinary Least Squares |
| NB | Negative Binomial |
| LL | Log-Likelihood |
| CI | Confidence Interval |
Appendix A. Vuong Test code in Python


References
- Fahrmeir, L.; Kneib, T.; Lang, S.; Marx, B. Regression: Models, Methods and Applications; Springer-Verlag: New York, 2013. [Google Scholar]
- Lesaffre, E.; Komárek, A.; Jara, A. The Bayesian approach. In Statistical and Methodological Aspects of Oral Health Research; Lesaffre, E., Feine, J., Leroux, B., Declerck, D., Eds.; John Wiley and Sons: Chichester, 2009; pp. 315–338. [Google Scholar]
- Gómez, G.; Luz Calle, M.; Oller, R.; Langohr, K. Tutorial on methods for interval-censored data and their implementation in Statistical Modelling. 2009, 9, 259–297. [Google Scholar] [CrossRef]
- Kneib, T. Beyond mean regression. Statistical Modelling 2013, 13, 275–303. [Google Scholar] [CrossRef]
- Komárek, A.; Lesaffre, E. Bayesian semi-parametric accelerated failure time model for paired doubly-interval-censored data. Statistical Modelling 2006, 6, 3–22. [Google Scholar] [CrossRef]
- Li, L.; Simonoff, J. S.; Tsai, C.-L. Tobit model estimation and sliced inverse regression. Statistical Modelling 2007, 7, 107–123. [Google Scholar] [CrossRef]
- Waldmann, E.; Kneib, T.; Yue, Y. R.; Lang, S.; Flexeder, C. Bayesian semiparametric additive quantile regression. Statistical Modelling 2013, 13, 223–252. [Google Scholar] [CrossRef]
- Vuong, Q. H. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica: journal of the Econometric Society 1989, 307–333. [Google Scholar] [CrossRef]
- Cohen, J.; Cohen, P.; West, S. G.; Aiken, L. S. Applied multiple regression/correlation analysis for the behavioral sciences; Routledge, 2013. [Google Scholar]
- Long, J. Scott; Freese, Jeremy. Regression models for categorical dependent variables using Stata; Stata press, 2006; Volume 7. [Google Scholar]
- Hilbe, Joseph M. Modeling count data; Cambridge University Press, 2014. [Google Scholar]
- Coxe, Stefany; West, Stephen G.; Aiken, Leona S. The analysis of count data: A gentle introduction to Poisson regression and its alternatives. Journal of personality assessment 2009, 91, 121–136. [Google Scholar] [CrossRef] [PubMed]
- Corlu, Canan G.; Akcay, Alp; Xie, Wei. Stochastic simulation under input uncertainty: A review. Operations Research Perspectives 2020, 7, 100162, Elsevier. [Google Scholar] [CrossRef]
- Kejzlar, Vojtech; Son, Mookyong; Bhattacharya, Shrijita; Maiti, Tapabrata. A fast and calibrated computer model emulator: an empirical Bayes approach. Statistics and Computing 2021, 31, 1–26, Springer. [Google Scholar]
- Barratt, Shane; Angeris, Guillermo; Boyd, Stephen. Optimal representative sample weighting. Statistics and Computing 2021, 31, 1–14, Springer. [Google Scholar]
- Bodenham, Dean A. ; Kawahara, Yoshinobu. euMMD: efficiently computing the MMD two-sample test statistic for univariate data. Statistics and Computing 2023, 33, 110, Springer. [Google Scholar]
- Fischer, Samuel M. ; Lewis, Mark A. A robust and efficient algorithm to find profile likelihood confidence intervals. Statistics and Computing 2021, 31, 38, Springer. [Google Scholar]
- Winkelmann, Rainer. Counting on count data models: Quantitative policy evaluation can benefit from a rich set of econometric methods for analyzing count data; IZA world of labor, 2015; 148, online. Forschungsinstitut zur Zukunft der Arbeit GmbH (IZA). [Google Scholar]
- Cameron, A. Colin; Trivedi, Pravin K. Essentials of count data regression. A companion to theoretical econometrics Wiley Online Library.. 2001, 331. [Google Scholar]
- Cameron, A. Colin; Trivedi, Pravin K. Regression analysis of count data; 53; Cambridge university press, 2013. [Google Scholar]
- Perumean-Chaney, Suzanne E.; Morgan, Charity; McDowall, David; Aban, Inmaculada. Zero-inflated and overdispersed: what’s one to do? Journal of Statistical Computation and Simulation 2013, 83, 1671–1683, Taylor & Francis. [Google Scholar] [CrossRef]
- Nelder, John Ashworth; Wedderburn, Robert WM. Generalized linear models. Journal of the Royal Statistical Society Series A: Statistics in Society 1972, 135, 370–384, Oxford University Press. [Google Scholar] [CrossRef]
- Faraway, JJ. Generalized linear models. In International Encyclopedia of Education; Elsevier, 2010; pp. 178–183. [Google Scholar]
- Ramalho, Joaquim Jose Dos Santos. Modelos de regressao para dados de contagem; Universidade de Evora: Portugal, 1996. [Google Scholar]
- Favero, Luiz Paulo; Belfiore, Patricia. Manual de analise de dados: estatistica e modelagem multivariada com Excel®, SPSS® e Stata®; Elsevier: Brasil, 2017. [Google Scholar]
- Tadano, Yara de Souza; Ugaya, Cassia Maria Lie; Franco, Admilson Teixeira. Metodo de regressao de Poisson: metodologia para avaliacao do impacto da poluicao atmosferica na saude populacional. Ambiente & Sociedade 2009, 12, 241–255, SciELO Brasil. [Google Scholar]
- Winkelmann, Rainer. Econometric analysis of count data; pringer Science & Business Media, 2008. [Google Scholar]
- Payne, Elizabeth H.; Hardin, James W.; Egede, Leonard E.; Ramakrishnan, Viswanathan; Selassie, Anbesaw; Gebregziabher, Mulugeta. Approaches for dealing with various sources of overdispersion in modeling count data: Scale adjustment versus modeling. Statistical methods in medical research 2017, 26, 1802–1823, SAGE Publications Sage UK: London, England. [Google Scholar] [CrossRef]
- Hilbe, Joseph M. Negative binomial regression; Cambridge University Press, 2011. [Google Scholar]
- Walters, Glenn D. Using Poisson class regression to analyze count data in correctional and forensic psychology: A relatively old solution to a relatively new problem. Criminal Justice and Behavior 2007, 34, 1659–1674, Sage Publications Sage CA: Los Angeles, CA. [Google Scholar] [CrossRef]
- Atkins, David C.; Gallop, Robert J. Rethinking how family researchers model infrequent outcomes: a tutorial on count regression and zero-inflated models. Journal of Family Psychology 2007, 21, 726, American Psychological Association. [Google Scholar]
- Lambert, Diane. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992, 34, 1–14, Taylor & Francis. [Google Scholar] [CrossRef]
- Desmarais, Bruce A.; Harden, Jeffrey J. Testing for zero inflation in count models: Bias correction for the Vuong test. The Stata Journal 2013, 13, 810–835, SAGE Publications Sage CA: Los Angeles, CA. [Google Scholar] [CrossRef]
- Kullback, Solomon; Leibler, Richard A. On information and sufficiency. The Annals of Mathematical Statistics 1951, 22, 79–86, JSTOR. [Google Scholar] [CrossRef]
- Akaike, Hirotugu. A new look at the statistical model identification. IEEE Transactions on Automatic Control 1974, 19, 716–723, IEEE. [Google Scholar] [CrossRef]
- Konishi, Sadanori; Kitagawa, Genshiro. Generalised information criteria in model selection. Biometrika 1996, 83, 875–890, Oxford University Press.. [Google Scholar] [CrossRef]
- Smyth, Padhraic. Model selection for probabilistic clustering using cross-validated likelihood. Statistics and Computing 2000, 10, 63–72, Springer. [Google Scholar] [CrossRef]
- Vallat, Raphael. Pingouin: statistics in Python. J. Open Source Softw. 2018, 3, 1026. [Google Scholar]
- Seabold, Skipper; Perktold, Josef. Statsmodels: Econometric and statistical modeling with Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, 2010; Volume 57, Number 61. pp. 10–25080. [Google Scholar]
- McKinney, Wes; others. pandas: a foundational Python library for data analysis and statistics. Python for high performance and scientific computing 2011, 14, 1–9, Seattle. [Google Scholar]
- Fisman, Raymond; Miguel, Edward. Corruption, norms, and legal enforcement: Evidence from diplomatic parking tickets. Journal of Political economy 2007, 115, 1020–1048, The University of Chicago Press. [Google Scholar] [CrossRef]
- Kaufmann, Daniel; Kraay, Aart; Mastruzzi, Massimo. Governance matters IV: governance indicators for 1996-2004. World bank policy research working paper series 2005, (3630).
- Cameron, A Colin; Trivedi, Pravin K. Regression-based tests for overdispersion in the Poisson model. Journal of econometrics 1990, 46, 347–364, Elsevier. [Google Scholar] [CrossRef]
- Nagpal, Abhinav; Gabrani, Goldie. Python for data analytics, scientific and technical applications. In 2019 Amity international conference on artificial intelligence (AICAI); IEEE, 2019; pp. 140–145. [Google Scholar]
- Sarker, Kamal Uddin, Mohammed Saqib, Raza Hasan, Salman Mahmood, Saqib Hussain, Ali Abbas, and Aziz Deraman. A Ranking Learning Model by K-Means Clustering Technique for Web Scraped Movie Data. Computers 2022, 11, 158. [Google Scholar] [CrossRef]
- Malamatinos, Marios-Christos, Eleni Vrochidou, and George A Papakostas. On Predicting Soccer Outcomes in the Greek League Using Machine Learning. Computers 2022, 11, 133. [Google Scholar] [CrossRef]
- Baker del Aguila, Ryan, Carlos Daniel Contreras P’erez, Alejandra Guadalupe Silva-Trujillo, Juan C Cuevas-Tello, and Jose Nunez-Varela. Static Malware Analysis Using Low-Parameter Machine Learning Models. Computers 2024, 13, 59. [Google Scholar] [CrossRef]








| Country | Code | Violations | Staff | Law | CI |
| Angola | AGO | 50 | 9 | no | 1.048 |
| Angola | AGO | 1 | 9 | yes | 1.048 |
| Albania | ALB | 17 | 3 | no | 0.921 |
| Albania | ALB | 0 | 3 | yes | 0.921 |
| UAE | ARE | 0 | 3 | no | -0.780 |
| UAE | ARE | 0 | 3 | yes | -0.780 |
| Argentina | ARG | 5 | 19 | no | 0.224 |
| Argentina | ARG | 0 | 19 | yes | 0.224 |
| Armenia | ARM | 3 | 4 | no | 0.710 |
| Armenia | ARM | 0 | 4 | yes | 0.710 |
| Parameter | Value |
| Mean () | 6.497 |
| Variance () | 331.618 |
| Vuong Test [8] | Poisson x ZIP | NB x ZINB |
| Vuong z-statistic: | ≈ -2.993 | ≈ -1.947 |
| p-value | ≈ 0.0014 | ≈ 0.0258 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).