Preprint
Article

This version is not peer-reviewed.

Zipf Law and n-Legomena over Texts with Limited Vocabularies, with Application to Fake News

Submitted:

02 July 2026

Posted:

03 July 2026

You are already at the latest version

Abstract
There is an intimate theoretical relationship between Zipf’s law and the expected number of hapax legomena, dis-legomena, and n-legomena in general, as established in recent papers. The relationship was confirmed by the empirical analysis for very large texts. However, the known theoretical relationship was established under the assumption of very large vocabularies and very large texts, where the dimension of the vocabularies used does not limit the maximal rank value and thus the hapax legomena, or the dis-legomena etc. (n-legomena for small n values). In addition, the theoretical results were established under the hypothesis that the probabilities of the words are context-independent, which is not satisfied for small vocabularies and small texts, as typically used in fake news and in other classes of small texts over the Internet. We provide a theoretical analysis under rectified hypotheses, prove several results under relaxed hypotheses, and show examples of practical results.
Keywords: 
;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2026 MDPI (Basel, Switzerland) unless otherwise stated

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings