Zipf Law and n-Legomena over Texts with Limited Vocabularies, with Application to Fake News

Daniela Gifu; Mironela Pirnau; Speranta Cecilia Bolea; Silviu-Ioan Bejinariu; Vasile Apopei

doi:10.20944/preprints202607.0282.v1

Submitted:

02 July 2026

Posted:

03 July 2026

You are already at the latest version

Abstract

There is an intimate theoretical relationship between Zipf’s law and the expected number of hapax legomena, dis-legomena, and n-legomena in general, as established in recent papers. The relationship was confirmed by the empirical analysis for very large texts. However, the known theoretical relationship was established under the assumption of very large vocabularies and very large texts, where the dimension of the vocabularies used does not limit the maximal rank value and thus the hapax legomena, or the dis-legomena etc. (n-legomena for small n values). In addition, the theoretical results were established under the hypothesis that the probabilities of the words are context-independent, which is not satisfied for small vocabularies and small texts, as typically used in fake news and in other classes of small texts over the Internet. We provide a theoretical analysis under rectified hypotheses, prove several results under relaxed hypotheses, and show examples of practical results.

Keywords:

Zipf’s law

;

Hapax–Dis–n-Legomena

;

social-media

;

fake news detection

;

small-text statistical modeling

Subject:

Computer Science and Mathematics - Applied Mathematics

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Zipf Law and n-Legomena over Texts with Limited Vocabularies, with Application to Fake News

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe