Preprint
Article

This version is not peer-reviewed.

Karamel’s Adventures: Building an AI-Powered Multilingual Storybook Generation Pipeline

Submitted:

04 May 2026

Posted:

06 May 2026

You are already at the latest version

Abstract
This paper presents a fully automated pipeline for converting monolingual, illustrated PDF storybooks into multilingual, AI-narrated interactive digital publications. The system was developed to disseminate 53 children's storybooks—originally produced in English by the Houston Education Attaché Office of the Republic of Türkiye and hosted at storiesofturkiye.com—across 34 target languages, covering the cultural, historical, and geographical heritage of Türkiye for young readers worldwide. The pipeline comprises four sequential stages: (1) structured PDF decomposition into text and image assets using PyMuPDF, (2) context-aware translation and editorial refinement via a locally hosted large language model (LLM) running under LM Studio, (3) multilingual text-to-speech (TTS) synthesis with optional zero-shot voice cloning using the Chatterbox model, and (4) automated generation of flip-book–style HTML5 web publications. The resulting system produces 15 languages with full audio-text output and an additional 19 languages with text-only output, reaching over 34 distinct linguistic communities through the diplomatic education network of Türkiye's overseas representations. We describe the architectural decisions, prompt engineering strategies, AI hallucination mitigation, and cross-lingual voice transfer challenges encountered, and we reflect on the broader implications of LLM-driven educational content localisation at scale.
Keywords: 
;  ;  ;  ;  ;  ;  ;  

1. Introduction

Digital technologies have fundamentally transformed the dissemination of educational content, particularly for young learners. Interactive e-books that combine text, imagery, and narrated audio have been shown to improve reading comprehension, vocabulary acquisition, and motivation in children aged 5–10 [1]. However, the production of high-quality, multilingual narrated storybooks remains resource-intensive, typically requiring professional translators, voice actors, and web developers for each target language.
The Houston Education Attaché Office of the Republic of Türkiye operates storiesofturkiye.com, a platform hosting 53 illustrated storybooks featuring a character named Karamel—an orange cat whose adventures introduce Türkiye’s culture, history, and geography to children. These books were originally published as flipbooks on heyzine.com, with English text and professional audio narration. The diplomatic mission of the attaché network—spanning dozens of countries across Asia, Europe, Africa, and the Americas—created a clear need to localise these materials into the native languages of host countries.
Manual localisation of 53 books into even 10 languages would require an estimated 530 translation units plus equivalent voice recordings, representing a prohibitive cost in both time and budget. This paper describes how we addressed this challenge by designing a four-stage, fully automated pipeline that leverages open-source and locally hosted AI tools to transform the source PDFs into 34-language digital publications with minimal human intervention.
The contributions of this work are as follows:
  • A novel PDF decomposition scheme exploiting the fixed page-layout structure of the source storybooks;
  • A prompt engineering methodology for child-oriented, voiceover-ready translation via a local LLM;
  • A cross-lingual TTS approach combining smart chunking and zero-shot voice cloning; and
  • An automated HTML5 flipbook generation module that synchronises page-turn events with audio playback.

2. Source Material and Motivation

2.1. The Stories of Türkiye Corpus

The source corpus consists of 53 PDF storybooks. Each book follows a strictly fixed page layout: page 1 is the front cover image; page 2 carries the Houston Education Attaché logo; page 3 is a colophon (identical across all books); even-numbered pages 4–20 contain the story text; odd-numbered pages 5–21 contain full-page illustrations; and page 22 is the back cover (see Table 1). This deterministic structure is fundamental to the extraction strategy described in Section 3.1. [2]

2.2. Target Language Selection

Target languages were selected based on the geographical distribution of Türkiye’s diplomatic representations and the linguistic communities they serve. Fifteen languages were designated as full audio-text targets (Arabic, Turkish, Chinese, Spanish, Malay, German, Swedish, Italian, Russian, French, Korean, Dutch, Polish, Danish, English), while an additional nineteen languages (including Belarusian, Uzbek, Azerbaijani, Georgian, Pashto, Kyrgyz, and others) received text-only output due to TTS model limitations at the time of production.

3. Methods

The pipeline is composed of four sequential modules, each implemented as an independent Python script. Figure 1 illustrates the overall data flow.

3.1. Stage 1: Structured PDF Decomposition (pdf2text.py)

The first module scans a source directory for PDF files and processes each one using PyMuPDF (fitz) [ ], a high-performance PDF rendering library. For every source file, a corresponding output directory is created. Pages are classified according to three rules derived from the corpus structure:
  • Rule 1: Pages 1–3 are always saved as PNG images (cover, logo, colophon).
  • Rule 2: The final page (page 22) is always saved as a PNG image (back cover).
  • Rule 3: Intermediate pages alternate between text (even pages →.txt via get_text()) and image (odd pages → .png via get_pixmap()).
Text files are written in UTF-8 encoding to ensure correct handling of Turkish characters (ş, ğ, ı, ç, ö, ü). Images are rendered at high resolution using the Matrix scaling functionality of PyMuPDF. A try-except block surrounds each file-level operation, allowing the script to log errors and continue processing remaining books without interruption.

3.2. Stage 2: LLM-Based Translation (translate.py)

The second module reads the extracted .txt files and submits them to a locally hosted LLM server. The system uses LM Studio [3] running on localhost (127.0.0.1:1234) with an OpenAI-compatible API endpoint, which eliminates data privacy concerns associated with cloud-based translation APIs and removes per-token costs for large batch jobs. We used google/gemma-3n-e4b [4] inside LM Studio.
The target languages are read from a lan.txt configuration file, enabling easy addition or removal of languages without modifying the source code. A resume mechanism checks for the existence of output files before processing; already-translated files are marked SKIPPED, allowing interrupted jobs to be restarted without redundant computation. Progress is visualised in the terminal using the tqdm library [5].
The system prompt (or “system instruction”) sent to the model constitutes the most critical engineering decision of this stage. The prompt encodes the following strict constraints:
(a)
Age-appropriate vocabulary: The output must be written for children aged 5–10, using simple and engaging language faithful to the original story.
(b)
PDF artifact repair: Broken lines and incomplete sentences resulting from the PDF extraction step must be rejoined into fluent, complete sentences.
(c)
Voiceover formatting: Detected headings are to be followed by a period and a newline, creating natural pause points for TTS software.
(d)
Proper noun preservation: Turkish proper nouns (e.g., Karamel, Türkiye, Anadolu) must not be translated and must appear verbatim in all target languages.
Pathlib (Path) is used throughout for cross-platform path management, replacing the older os.path module. Regular expressions (re module) extract story numbers from directory names of the form “Hikaye N”.

3.3. Stage 3: Multilingual TTS Synthesis (text2speech.py)

The third module performs text-to-speech synthesis using the Chatterbox multilingual TTS model [6], which supports 23 languages and offers zero-shot voice cloning from a reference audio file.
A key engineering challenge in neural TTS is memory overflow and quality degradation when processing long input sequences. The split_text_smart() function addresses this by chunking text into segments of at most 300 characters, respecting sentence boundaries identified by multi-language punctuation markers. This avoids mid-word or mid-sentence splits that would produce audible artefacts.
Hardware acceleration is handled dynamically: the script detects CUDA-capable NVIDIA GPUs at runtime and switches to GPU inference accordingly; otherwise it falls back to CPU inference. Audio segments are accumulated as NumPy arrays [7], concatenated after all chunks are processed, saved as high-quality WAV, and then transcoded to 192 kbps MP3 using pydub [8]. An existence check on the WAV output file allows interrupted sessions to resume.
A particularly noteworthy feature is the cfg_weight=0.0 parameter setting for cross-lingual voice transfer. Setting this classifier-free guidance weight to zero enables the model to adapt the cloned voice’s prosody to the phonetic patterns of the target language rather than forcing an unnaturally accented rendering—a meaningful quality improvement for non-English outputs.

3.4. Stage 4: HTML5 Flipbook Generation (webbooks.py)

The final module assembles translated text, MP3 audio, and original images into interactive web publications. It uses the StPageFlip JavaScript library [9] to render a realistic, physics-based page-turning effect in any modern browser without plugins.
The module iterates over pages 1–22 for each book-language combination, assigning left (even) and right (odd) CSS classes to recreate the open-book layout. For pages with an associated MP3 file, an invisible <audio> HTML element is embedded; the accompanying main.js file triggers playback automatically on page turn. A clean_text() function scrubs common LLM preamble artefacts—phrases such as “Here is the translation:” or “Translated text:”—that may appear at the head of model outputs. All user-facing text is sanitised with html.escape() to prevent injection of unintended HTML markup. Output files are written to a hierarchy of web/[Language]/[BookNumber]/index.html, making the entire collection deployable to any static web host.

4. Results

4.1. Code and Data Availability

The fully translated interactive storybooks generated by this pipeline can be accessed at https://storiesofturkiye.github.io/. The complete Python source code for the four-stage automated generation pipeline is open-source and available on GitHub at https://github.com/storiesofturkiye/Karamel.

4.2. Coverage

The pipeline successfully processed all 53 source storybooks across 34 target languages, yielding 1,802 individual book-language combinations. Of these, 795 ( 15 languages × 53 books ) include synchronised MP3 audio narration; the remaining 1,007 are text-and-image publications.

4.3. Translation Quality Observations

Qualitative review of translated texts across Arabic, German, Russian, and Spanish samples indicated that the LLM consistently preserved proper nouns and produced age-appropriate vocabulary. The most common failure modes were (i) occasional inclusion of meta-commentary preambles (mitigated by clean_text()), and (ii) inconsistent heading detection across languages with different typographic conventions. The voiceover-formatting constraint (appending periods to headings) proved effective for pause induction in TTS outputs.

4.4. TTS Audio Quality

Informal listening tests conducted across Arabic, German, Russian, and Spanish outputs indicated intelligible and naturally paced narration in all tested languages. Cross-lingual voice cloning with cfg_weight=0.0 was qualitatively rated as more natural than higher cfg_weight values, consistent with findings reported in the Chatterbox technical documentation. Languages not natively supported by Chatterbox at the time of production (e.g., Kyrgyz, Pashto, Turkmen) were excluded from audio output pending model updates.

5. Discussion

5.1. Scalability

The modular design of the pipeline means that adding a new target language requires only two changes: appending the language name to lan.txt and adding the corresponding language code to the LANG_MAP dictionary. Similarly, new storybooks can be introduced by placing additional PDFs in the source directory. The resume mechanism in both translation and TTS stages ensures that incremental additions do not trigger full reprocessing of existing outputs.

5.2. Privacy and Cost

The choice of a locally hosted LLM—as opposed to a commercial cloud API such as GPT-4 or Google Translate—was deliberate. It eliminates recurring per-token costs for batch translation of 53 × 9 text pages × 34 languages 16 , 000 translation units , and ensures that no copyrighted storybook content is transmitted to third-party servers. This consideration is particularly relevant for educational materials produced by a governmental organisation.

5.3. Limitations

The current pipeline has several limitations. First, TTS audio is currently unavailable for 19 of the 34 target languages due to model coverage constraints; this gap is expected to narrow as multilingual TTS research progresses. Second, translation quality has not been evaluated through formal human assessment or automated metrics such as BLEU [10] or COMET [11]; such evaluation is planned as future work. Third, the HTML flipbook output does not currently include accessibility features such as screen-reader compatibility or keyboard navigation, limiting usability for children with visual impairments.

6. Conclusion

We have presented a fully automated, four-stage pipeline that converts 53 illustrated PDF storybooks into 34-language, AI-narrated interactive digital publications. The system exploits the deterministic page structure of the source corpus for reliable decomposition, uses a locally hosted LLM with carefully engineered prompts for child-appropriate translation, applies smart chunking and zero-shot voice cloning for multilingual narration, and generates HTML5 flipbook publications with synchronised audio. The resulting platform serves the educational diplomacy objectives of Türkiye’s overseas representation network, making culturally rich content accessible to young learners in their native languages.
Future work will focus on:
(i)
expanding TTS coverage to the remaining 19 text-only languages;
(ii)
conducting formal human evaluation of translation and audio quality;
(iii)
integrating accessibility standards (WCAG 2.1); and
(iv)
exploring fine-tuning of translation models on the specific domain of Turkish cultural children’s literature to further improve output fidelity.

Acknowledgments

The author thank the Houston Education Attaché Office for providing the source storybook corpus and for their commitment to multilingual educational outreach.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Korat, O.; Shamir, A. Do Hebrew electronic books differ from Dutch electronic books? A replication of a Dutch content analysis. J. Comput. Assist. Learn. 2004, 20, 257–268. [Google Scholar] [CrossRef]
  2. Kılıçlıoğlu, A.; Acar, E.; Doğan, C.; Bişgen, E.; Karasakal, C.; Konukseven, H.; Yirmibeş, S.K.; Ballı, N.; Begenjov, S. Stories of Türkiye. 2026. Available online: https://www.storiesofturkiye.com/.
  3. Team, L.S. LM Studio. 2024. Available online: https://github.com/lmstudio-ai.
  4. Team, G. Gemma 3 Technical Report. arXiv 2025, arXiv:2503.19786. [Google Scholar] [CrossRef]
  5. Da Costa-Luis, C. tqdm: A Fast, Extensible Progress Meter for Python and CLI. J. Open Source Softw. 2019, 4, 1277. [Google Scholar] [CrossRef]
  6. AI, R. Chatterbox: Open Source Text-to-Speech Model. 2025. Available online: https://huggingface.co/ResembleAI/chatterbox (accessed on 2026-05-02).
  7. Harris, C.R.; Millman, K.J. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef] [PubMed]
  8. Robert, J. Pydub. 2018. [Google Scholar]
  9. Nodlik. StPageFlip - Simple library for creating realistic page turning. 2021. Available online: https://nodlik.github.io/StPageFlip/ (accessed on 02.05.2026).
  10. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002; pp. 311–318. [Google Scholar]
  11. Rei, R.; Stewart, C.; Farinha, A.C.; Lavie, A. COMET: A neural framework for MT evaluation. In Proceedings of the Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp), 2020; pp. 2685–2702. [Google Scholar]
Figure 1. Overall data flow of the automated localisation pipeline.
Figure 1. Overall data flow of the automated localisation pipeline.
Preprints 211821 g001
Table 1. Source Corpus Statistics
Table 1. Source Corpus Statistics
Parameter Value
Total Books 53
Pages per Book 22
Total Source Pages 1,166
Text Pages per Book 9 (p. 4, 6, 8, 10, 12, 14, 16, 18, 20)
Image Pages per Book 13
Source Language English
Target Languages 34
Total Book-Language Combinations 1,802
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated