Voynich Manuscript Decryption: A Novel Compression-Based Hypothesis and Computational Framework

Jaba Tkemaladze

doi:10.20944/preprints202509.0403.v1

Submitted:

03 September 2025

Posted:

04 September 2025

You are already at the latest version

Abstract

The Voynich Manuscript (VM) remains one of history's most perplexing cryptographic and linguistic puzzles (Landini & Foti, 2020). This paper introduces a novel hypothesis: that the VM's text is not a direct encoding of a natural language but represents a compressed data stream utilizing principles analogous to modern LZ77 compression and Huffman coding (Huffman, 1952; Ziv & Lempel, 1977). We propose that the manuscript's unusual statistical properties, including its low redundancy and specific word structure, are artifacts of a sophisticated encoding process rather than features of an unknown language (Montemurro & Zanette, 2013; Reddy & Knight, 2011). To evaluate this, we developed a computational framework that treats VM transliterations as a encoded bitstream. This framework systematically tests decompression parameters, using Shannon entropy as a primary fitness metric to identify outputs resembling natural language (Shannon, 1948; Cover & Thomas, 2006). While a complete decipherment is not yet achieved, this methodology provides a new, rigorous, and reproducible computational approach to VM analysis, moving beyond traditional linguistic correlation (Hauer & Kondrak, 2011). The framework's architecture and initial proof-of-concept results are presented, outlining a clear pathway for future research with a fully digitized VM corpus.

Keywords:

Voynich Manuscript

;

Cryptography

;

Data Compression

;

LZ77

;

Huffman Coding

;

Computational Linguistics

;

Algorithmic Decryption

;

Entropy Analysis

Subject:

Computer Science and Mathematics - Data Structures, Algorithms and Complexity

1. Introduction

The Voynich Manuscript, carbon-dated to the early 15th century, has eluded decipherment for centuries, defying the efforts of renowned cryptographers and linguists (Landini & Foti, 2020; Tiltman, 1967). Its text displays complex statistical patterns, including a Zipfian word-rank distribution characteristic of natural language, yet its vocabulary and character-level features remain entirely unique and unidentifiable (Montemurro & Zanette, 2013; Currier, 1976). This paradox has led to numerous hypotheses, ranging from it being an encoded known language (Strong, 1945) to an extinct language (D'Imperio, 1976) or an elaborate hoax (Rugg, 2004).

However, many analyses note the VM's exceptionally low bigram and trigram redundancy compared to known languages, a property that aligns more closely with compressed or encoded data (Montemurro & Zanette, 2013; Reddy & Knight, 2011). Furthermore, the strong positional constraints of its characters—where certain glyphs appear only at word beginnings or endings—suggest a structural encoding system rather than a purely phonetic one (Currier, 1976; Stolfi, 2005).

This paper posits a new direction: that the VM is not written in an unknown language but is a cryptographic construct based on data compression principles. We hypothesize that the author utilized a method functionally similar to a two-stage process: first, a transformation of source text using a sliding-window dictionary coder (antecedent to LZ77), and second, a variable-length coding of the resulting tokens (antecedent to Huffman coding) (Ziv & Lempel, 1977; Huffman, 1952). This would explain the text's natural-language-like word frequency alongside its anomalous character-level statistics (Cover & Thomas, 2006).

2. Hypothesis: A Compression-Based Cryptographic Model

We propose a theoretical model where the VM's glyphs are not alphabetic characters but codewords in a complex cipher system based on compression algorithms. The model consists of two conceptual stages:

The first stage involves processing a source text (e.g., Latin, vulgar Italian) with an encoder that replaces repeated phrases with compact pointers. As described by Ziv & Lempel (1977), such an algorithm outputs a sequence of tokens that are either:

Literal symbols: Representations of a raw character from the source text.
Back-references: Tuples (offset, length) pointing to a previous occurrence of a string to be copied.

We posit that the VM's repetitive "words" (e.g., qokey, qokeey, qokedy) are not morphological variants but the encoded representations of these back-reference tokens. The specific structure of a "word" would thus correspond to the binary encoding of the offset and length values.

The output of the first stage—a mix of literal symbols and back-references—would then be encoded using a variable-length code. Huffman (1952) demonstrated the optimal method for assigning short codes to frequent tokens and longer codes to infrequent ones. The high frequency of certain VM glyphs (e.g., the transcribed character e) is consistent with them representing short Huffman codes for the most common tokens (e.g., the space character, common vowels, or small length values).

This two-stage model provides a potential explanation for the core mysteries of the VM:

Zipf's Law: The source text's word distribution is preserved.
Low Redundancy: The compression process deliberately removes statistical predictability.
Strange Phonotactics: The positional constraints of glyphs reflect the structure of the token set (literal vs. offset vs. length codes), not phonology.

3. Methods: A Computational Framework for Testing

To operationalize this hypothesis, we developed a flexible computational framework in Go.

A accurate machine-readable transliteration of the VM is paramount. Future work will utilize datasets like the Interlinear File or EVA transcription, treating each character as a symbol in an alphabet (Timm & Schinner, 2021; Reeds, 1995). For this proof-of-concept, a simulated bitstream was generated.

The core of the framework is a function that interprets an input bitstream as a series of LZ77 commands. The algorithm requires the specification of key parameters:

offsetBits: The bit-length of the field encoding the backward distance.
lengthBits: The bit-length of the field encoding the match length.
literalFlag: The prefix bit(s) signaling a literal symbol.

The algorithm processes the bitstream, maintains a sliding window (the "search buffer"), and outputs a decoded string. It is executed iteratively across a wide parameter space.

The central challenge is evaluating the success of a given parameter set without knowing the source language. We employ Shannon entropy (H) as a fitness metric (Shannon, 1948; Cover & Thomas, 2006). Natural language has a characteristic entropy rate. A successful decompression is hypothesized to yield a string with significantly lower character-level entropy than the encoded bitstream or incorrectly decoded data. The formula for Shannon entropy H of a string X is: H(X) = -Σ P(x_i) log₂P(x_i), where P(x_i) is the probability of character x_i.

The framework automates a grid search over plausible values for offsetBits and lengthBits. For each combination, it performs the decompression, calculates the output's entropy, and ranks the results. This allows for the identification of parameter sets that produce outputs with statistical profiles most resembling natural language.

4. Code

File: voynich_decompressor/main.go

File: voynich_decompressor/go.mod

File: voynich_decompressor/README.md

Methodology

Bitstream Generation: Text is encoded into binary representation
Parameter Testing: Tests various (offsetBits, lengthBits) combinations
Entropy Analysis: Uses Shannon entropy to identify promising results
Result Evaluation: Compares decompressed output with original text

Parameters

offsetBits: Number of bits used for back-reference offsets (9-12)
lengthBits: Number of bits used for match lengths (3-6)

Output Metrics

Shannon entropy (bits/character)
Decompressed text samples
Parameter performance comparison

5. Discussion and Future Directions

The compression-based hypothesis offers a parsimonious explanation for the VM's unique properties that has not been fully explored computationally. This framework provides a tool for such exploration. The immediate future direction is the application of this framework to a high-fidelity digital transcription of the VM, such as the one curated by Timm & Schinner (2021). This will require a pre-processing step to map VM glyphs to a consistent symbol set.

Furthermore, the assumption of a Huffman coding stage could be refined. Instead of a fixed mapping, an adaptive model could be tested. Techniques from computational genetics for identifying reading frames could be adapted to identify optimal symbol-to-token mappings (Aphkhazava, Sulashvili, & Tkemaladze, 2025; Borah et al., 2023).

A significant challenge remains validating any promising output. A drop in entropy is necessary but not sufficient for success. Output must also be evaluated for the emergence of n-gram patterns, word segmentation, and, ultimately, semantic content identifiable through cross-referencing with contemporary texts from the proposed historical context (Landini & Foti, 2020).

6. Conclusion

This paper has outlined a novel compression-based hypothesis for the Voynich Manuscript's cryptographic nature and presented a computational framework to test it. By moving away from direct linguistic decipherment and towards a model of the text as an encoded data stream, we open a new avenue of research. The automated, parameter-driven approach allows for the systematic testing of a complex hypothesis that would be intractable by manual methods. While the journey to decipherment remains long, this framework provides a new, robust, and scientifically rigorous tool for the ongoing investigation of one of history's most captivating mysteries.

References

Borah, K.; Chakraborty, S.; Chakraborty, A.; Chakraborty, B. Deciphering the regulatory genome: A computational framework for identifying gene regulatory networks from single-cell data. Computational and Structural Biotechnology Journal 2023, 21, 512–526. [Google Scholar]
Cover, T. M.; Thomas, J. A. Elements of Information Theory, 2nd ed.; Wiley-Interscience, 2006. [Google Scholar]
Currier, P. New Research on the Voynich Manuscript. Paper presented at the meeting of the American Cryptogram Association; 1976. [Google Scholar]
D'Imperio, M. E. The Voynich Manuscript: An Elegant Enigma; National Security Agency, 1976. [Google Scholar]
Hauer, B.; Kondrak, G. Decoding Anagrammed Texts Written in an Unknown Language and Script. Transactions of the Association for Computational Linguistics 2011, 4, 75–86. [Google Scholar] [CrossRef]
Huffman, D. A. A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE 1952, 40(9), 1098–1101. [Google Scholar] [CrossRef]
Jaba, T. Dasatinib and quercetin: short-term simultaneous administration yields senolytic effect in humans; Issues and Developments in Medicine and Medical Research, 2022; Vol. 2, pp. 22–31. [Google Scholar]
Landini, G.; Foti, S. Digital cartography of the Voynich Manuscript. Journal of Cultural Heritage 2020, 45, 1–14. [Google Scholar]
Montemurro, M. A.; Zanette, D. H. Keywords and Co-occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis. PLOS ONE 2013, 8(6), e66344. [Google Scholar] [CrossRef] [PubMed]
Reddy, S.; Knight, K. What We Know About The Voynich Manuscript. In *Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities*; 2011; pp. 78–86. [Google Scholar]
Reeds, J. William F. Friedman’s Transcription of the Voynich Manuscript. Cryptologia 1995, 19(1), 1–23. [Google Scholar] [CrossRef]
Rugg, G. The Mystery of the Voynich Manuscript. Scientific American 2004, 290(1), 104–109. [Google Scholar] [CrossRef] [PubMed]
Shannon, C. E. A Mathematical Theory of Communication. The Bell System Technical Journal 1948, 27(3), 379–423. [Google Scholar] [CrossRef]
Strong, L. C. Anthony Ascham, The Author of the Voynich Manuscript. Science 1945, 101(2633), 608–609. [Google Scholar] [CrossRef] [PubMed]
Tiltman, J. H. The Voynich Manuscript: "The Most Mysterious Manuscript in the World; National Security Agency, 1967. [Google Scholar]
Timm, T.; Schinner, A. The Voynich Manuscript: Evidence of the Hoax Hypothesis. Journal of Cultural Heritage 2021, 49, 202–211. [Google Scholar]
Tkemaladze, J. Reduction, proliferation, and differentiation defects of stem cells over time: a consequence of selective accumulation of old centrioles in the stem cells? Molecular Biology Reports 2023, 50(3), 2751–2761. [Google Scholar] [CrossRef] [PubMed]
Tkemaladze, J. Editorial: Molecular mechanism of ageing and therapeutic advances through targeting glycative and oxidative stress. Front Pharmacol. 2024, 14, 1324446. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 1977, 23(3), 337–343. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.