Submitted:
30 March 2026
Posted:
30 March 2026
You are already at the latest version
Abstract

Keywords:
1. Introduction
2. Related Work
2.1. Document-Level Translation and Discourse Coherence
2.2. Terminology Control in Machine Translation
2.3. Human-in-the-Loop Translation and Post-Editing
2.4. Translation Memory, CAT, and LLM-Augmented Translation
2.5. Evaluation and Reproducibility Considerations
3. Proposed Method: Language Twin Architecture
3.1. Overview
3.2. Formal Definition of the Language Twin
- L0 (Source Layer): Original source text with document structure, paragraph boundaries, section headers, and document metadata.
- L1 (Alignment Layer): Segment and sub-segment source–target alignments, TM links, fuzzy-match scores, and anchors into external assets.
- L2 (Terminology Layer): Approved term pairs, named-entity rules, numerical and unit conventions, domain constraints, and provenance of each terminology decision.
- L3 (Discourse Layer): Topic segmentation, document summaries, coreference cues, audience and purpose metadata, and section-level coherence notes.
- L4 (Variant Layer): Multiple target variants tagged by purpose or audience, such as literal, publication-ready, simplified, or domain-expert versions.
- L5 (Human Edit Layer): Translator and reviewer corrections, approvals, rejections, rationale, timestamps, and disagreement traces.
3.3. Central Language Hub
3.4. Lazy Contextual Loading
3.5. Document Translation Environment
3.6. Human-in-the-Loop Update Mechanism
3.7. Scope Control, Conflict Resolution, and Rollback
4. Experimental Design
4.1. Research Questions
- H1: P1 will outperform B1–B3 on preferred-term accuracy and repeated-error rate.
- H2: P1 will outperform A3 (propagation disabled) after the warm-up stage by reducing downstream recurrence of corrected error types.
- H3: P1 will achieve comparable quality to A4 (full-context flooding) while using substantially fewer prompt tokens.
4.2. Data, Language Pair, and Translation Backbone
4.3. Baselines and Proposed System
- B1 (Sentence-level MT): Each segment translated independently with no persistent external state. Context used: none.
- B2 (Fixed-context MT): Translation with a fixed local context window comprising the previous two and next two segments. Because B2 accesses future segments, it operates as an offline (non-sequential) document translation baseline rather than a causal online system. This is intentional: B2 represents the scenario in which a translator or system has access to the full surrounding context but lacks any project-level state.
- B3 (Retrieval-augmented MT): Static glossary and top-3 TM examples (retrieved by BM25 from the same TM pool as P1) injected at every step, with ±2 segments of local context (same window as B2), but without state updates or edit propagation. Context used: glossary + TM + local context.
- P1 (Language Twin): Layered shared state with hub versioning, edit propagation, and lazy retrieval. Context used: L0–L6, selectively loaded.
| ID | System | Description | Context Used |
| B1 | Sentence-level MT | Each segment translated independently; no persistent shared state | None |
| B2 | Fixed-context MT | Translation with a fixed local context window (offline) | Adjacent segments (±2) |
| B3 | Retrieval-augmented MT | Static glossary and top-3 TM examples injected at every step | Glossary + TM + local context |
| P1 | Language Twin | Layered shared state with hub versioning, edit propagation, and lazy retrieval | L0–L6, selectively loaded |
4.4. Edit-Propagation Protocol and Ablation Study
- A1: Removes the terminology layer L2.
- A2: Removes the discourse layer L3.
- A3: Disables propagation from the human-edit layer L5 to subsequent segments.
- A4: Replaces lazy retrieval with full-context flooding (all layers loaded in full).
4.5. Runtime Surrogate Quality Score (RQS)
4.6. Recurrence Metric Definition
4.7. Human Evaluation Design
5. Pilot Results
5.1. Main Results
5.2. Ablation Results
5.3. Lazy Loading Analysis
5.4. Human Edit Propagation Analysis
5.5. Human Evaluation Results
5.6. Qualitative Analysis
5.7. Harmful Propagation Audit
6. Discussion
6.1. Scope of Supported Claims
6.2. Architectural Implications
6.3. Comparison with Existing Pipelines
6.4. Limitations
6.5. Future Work
7. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| NMT | Neural Machine Translation |
| LLM | Large Language Model |
| CAT | Computer-Assisted Translation |
| TM | Translation Memory |
| RER | Repeated-Error Recurrence rate |
| BM25 | Best Matching 25 (sparse retrieval scoring function) |
| COMET | Crosslingual Optimized Metric for Evaluation of Translation |
| xCOMET | eXplainable COMET (fine-grained MT evaluation metric) |
| RQS | Runtime Surrogate Quality Score |
| PE | Post-Editing |
References
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Hendy, A.; Abdelrehim, M.; Sharaf, A.; Raunak, V.; Gabr, M.; Matsushita, H.; Kim, Y.J.; Afify, M.; Awadalla, H.H. How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation. arXiv 2023, arXiv:2302.09210. [Google Scholar] [CrossRef]
- Maruf, S.; Saleh, F.; Haffari, G. A Survey on Document-Level Neural Machine Translation: Methods and Evaluation. ACM Comput. Surv. 2021, 54, 45. [Google Scholar] [CrossRef]
- Maruf, S.; Haffari, G. Document Context Neural Machine Translation with Memory Networks. In Proceedings of the 56th Annual Meeting of the ACL, Melbourne, Australia, 15–20 July 2018; pp. 1275–1284. [Google Scholar]
- Sugiyama, A.; Yoshinaga, N. Context-aware Decoder for Neural Machine Translation using a Target-side Document-Level Language Model. Proceedings of NAACL-HLT 2021, Online, 6–11 June 2021; pp. 5781–5791. [Google Scholar]
- Lyu, X.; Li, J.; Gong, Z.; Zhang, M. Encouraging Lexical Translation Consistency for Document-Level Neural Machine Translation. Proceedings of EMNLP 2021, Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3265–3277. [Google Scholar]
- Hong, H.; Xie, Y.; Zheng, J.; Wang, X. SubDocTrans: Enhancing Document-level Machine Translation with Plug-and-play Multi-granularity Knowledge Augmentation. Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 2025; pp. 14490–14506. [Google Scholar]
- Dinu, G.; Mathur, P.; Federico, M.; Al-Onaizan, Y. Training Neural Machine Translation to Apply Terminology Constraints. In Proceedings of the 57th Annual Meeting of the ACL, Florence, Italy, 28 July–2 August 2019; pp. 3063–3068. [Google Scholar]
- Post, M.; Vilar, D. Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation. Proceedings of NAACL-HLT 2018, New Orleans, LA, USA, 1–6 June 2018; pp. 1314–1324. [Google Scholar]
- Hasler, E.; de Gispert, A.; Iglesias, G.; Byrne, B. Neural Machine Translation Decoding with Terminology Constraints. Proceedings of NAACL-HLT 2018, New Orleans, LA, USA, 1–6 June 2018; Volume 2, pp. 506–512. [Google Scholar]
- Zhang, H.; Wang, Q.; Qin, B.; Shi, Z.; Wang, H.; Chen, M. Understanding and Improving the Robustness of Terminology Constraints in Neural Machine Translation. Proceedings of ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 6037–6051. [Google Scholar]
- Ghazvininejad, M.; Gonen, H.; Zettlemoyer, L. Dictionary-based Phrase-level Prompting of Large Language Models for Machine Translation. arXiv 2023, arXiv:2302.07856. [Google Scholar]
- Turchi, M.; Negri, M.; Farajian, M.A.; Federico, M. Continuous Learning from Human Post-Edits for Neural Machine Translation. Prague Bull. Math. Linguist. 2017, 108, 233–244. [Google Scholar] [CrossRef]
- Ki, D.; Carpuat, M. Guiding Large Language Models to Post-Edit Machine Translation with Error Annotations. Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 2024; pp. 4253–4273. [Google Scholar]
- Koneru, S.; Exel, M.; Huck, M.; Niehues, J. Contextual Refinement of Translations: Large Language Models for Sentence and Document-Level Post-Editing. Proceedings of NAACL-HLT 2024, Mexico City, Mexico, June 2024; pp. 2711–2725. [Google Scholar]
- Chen, P.; Guo, Z.; Haddow, B.; Heafield, K. Iterative Translation Refinement with Large Language Models. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation, Sheffield, UK, June 2024; pp. 181–190. [Google Scholar]
- Bulte, B.; Tezcan, A. Neural Fuzzy Repair: Integrating Fuzzy Matches into Neural Machine Translation. In Proceedings of the 57th Annual Meeting of the ACL, Florence, Italy, 28 July–2 August 2019; pp. 1800–1809. [Google Scholar]
- Mu, Y.; Reheman, A.; Cao, Z.; Fan, Y.; Li, B.; Li, Y.; Xiao, T.; Zhang, C.; Zhu, J. Augmenting Large Language Model Translators via Translation Memories. Findings of ACL 2023, Toronto, ON, Canada, July 2023; pp. 10287–10299. [Google Scholar]
- Hou, R.; Liu, H.; Lepage, Y. Leveraging Knowledge from Translation Memory for Globally and Locally Guiding Neural Machine Translation. In Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation, Tokyo, Japan, December 2024; pp. 9–19. [Google Scholar]
- Moslem, Y.; Haque, R.; Kelleher, J.D.; Way, A. Adaptive Machine Translation with Large Language Models. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, Tampere, Finland, 12–15 June 2023; pp. 227–237. [Google Scholar]
- Li, C.; Zhang, M.; Liu, X.; Li, Z.; Wong, D.; Zhang, M. Towards Demonstration-Aware Large Language Models for Machine Translation. Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, August 2024; pp. 13868–13881. [Google Scholar]
- Anastasopoulos, A.; Cattelan, A.; Dou, Z.-Y.; Federico, M.; Federmann, C.; Genzel, D.; Guzman, F.; Hu, J.; Hughes, M.; Koehn, P.; et al. TICO-19: the Translation Initiative for COVID-19. In Proceedings of the 1st Workshop on NLP for COVID-19 at EMNLP 2020, Online, December 2020. [Google Scholar]
- Tiedemann, J. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, May 2012; pp. 2214–2218. [Google Scholar]
- Rei, R.; Stewart, C.; Farinha, A.C.; Lavie, A. COMET: A Neural Framework for MT Evaluation. Proceedings of EMNLP 2020, Online, 16–20 November 2020; pp. 2685–2702. [Google Scholar]
- Guerreiro, N.M.; Rei, R.; van Stigt, D.; Coheur, L.; Colombo, P.; Martins, A.F.T. xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection. Trans. Assoc. Comput. Linguist. 2024, 12, 979–995. [Google Scholar] [CrossRef]
- Rei, R.; Guerreiro, N.M.; Pombal, J.; van Stigt, D.; Treviso, M.; Coheur, L.; de Souza, J.G.C.; Martins, A.F.T. Scaling up CometKiwi: Unbabel-IST 2023 Submission for the Quality Estimation Shared Task. Proceedings of WMT 2023, Singapore, 6–7 December 2023; pp. 841–848. [Google Scholar]
- He, Z.; Liang, T.; Jiao, W.; Zhang, Z.; Yang, Y.; Wang, R.; Tu, Z.; Shi, S.; Wang, X. Exploring Human-Like Translation Strategy with Large Language Models. Trans. Assoc. Comput. Linguist. 2024, 12, 229–246. [Google Scholar] [CrossRef]
- Lei, X.; Li, J.; Tao, S.; Yang, H. Evaluation Dataset for Lexical Translation Consistency in Chinese-to-English Document-level Translation. In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, May 2024; pp. 6575–6581. [Google Scholar]
- Bhattacharyya, P.; Chatterjee, R.; Freitag, M.; Kanojia, D.; Negri, M.; Turchi, M. Findings of the WMT 2023 Shared Task on Automatic Post-Editing. In Proceedings of the Eighth Conference on Machine Translation (WMT 2023), Singapore, 6–7 December 2023; pp. 672–681. [Google Scholar]
- Lee, D.; Ahn, J.; Park, H.; Jo, J. IntelliCAT: Intelligent Machine Translation Post-Editing with Quality Estimation and Translation Suggestion. In Proceedings of the Joint Conference of the 59th Annual Meeting of the ACL and the 11th International Joint Conference on Natural Language Processing: System Demonstrations (ACL-IJCNLP 2021), Online, August 2021; pp. 11–19. [Google Scholar]
- Toral, A.; Wieling, M.; Way, A. Post-editing Effort of a Novel with Statistical and Neural Machine Translation. Front. Artif. Intell. 2018, 1, 11. [Google Scholar] [CrossRef]
- Grieves, M.; Vickers, J. Digital Twin: Mitigating Unpredictable, Undesirable Emergent Behavior in Complex Systems. In Transdisciplinary Perspectives on Complex Systems; Kahlen, F.J., Flumerfelt, S., Alves, A., Eds.; Springer: Cham, Switzerland, 2017; pp. 85–113. [Google Scholar]
- Tao, F.; Qi, Q.; Wang, L.; Nee, A.Y.C. Digital Twins and Cyber–Physical Systems toward Smart Manufacturing and Industry 4.0: Correlation and Comparison. Engineering 2019, 5, 653–661. [Google Scholar] [CrossRef]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the Eighth International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, April 2020. [Google Scholar]
- Popović, M. chrF++: Words Helping Character N-grams. In Proceedings of the Second Conference on Machine Translation (WMT 2017), Copenhagen, Denmark, September 2017; pp. 612–618. [Google Scholar]

| Method | RQS [95% CI] | Term (k/N) | Entity (k/N) | Downstr. Err. (k/N) | Tokens/Seg (mean±SD) | n (seg/docs) |
| B1 | 0.767 [0.697, 0.838] | 19.0 (4/21) | 90.0 (9/10) | 6/6† | 85.4 ± 3.5 | 17/3 |
| B2 | 0.907 [0.856, 0.953] | 57.1 (12/21) | 100.0 (10/10) | 2/6 | 122.4 ± 13.8 | 17/3 |
| B3 | 0.935 [0.896, 0.972] | 66.7 (14/21) | 100.0 (10/10) | 4/4 | 111.2 ± 5.4 | 17/3 |
| P1 | 0.965 [0.931, 0.991] | 81.0 (17/21) | 100.0 (10/10) | 0/5 | 174.7 ± 14.4 | 17/3 |
| Condition | RQS [95% CI] | Term (k/N) | Downstr. Err. (k/N) | Tokens/Seg (mean±SD) | n (seg/docs) |
| P1 | 0.965 [0.931, 0.991] | 81.0 (17/21) | 0/5 | 174.7 ± 14.4 | 17/3 |
| A1 (−terminology) | 0.908 [0.859, 0.953] | 57.1 (12/21) | 1/6 | 153.3 ± 15.5 | 17/3 |
| A2 (−discourse) | 0.954 [0.921, 0.982] | 81.0 (17/21) | 0/5 | 140.0 ± 5.3 | 17/3 |
| A3 (−propagation) | 0.914 [0.874, 0.953] | 57.1 (12/21) | 5/5 | 163.9 ± 13.6 | 17/3 |
| A4 (full loading) | 0.965 [0.931, 0.991] | 81.0 (17/21) | 0/5 | 287.4 ± 19.4 | 17/3 |
| Method | Adequacy (mean±SD) | Fluency (mean±SD) | Terminology (mean±SD) | Overall (mean±SD) | PE s/seg (mean±SD) | PE s/word (mean±SD) |
| B1 | 3.41 ± 0.50 | 3.53 ± 0.46 | 3.26 ± 0.52 | 3.40 ± 0.28 | 24.8 ± 5.9 | 1.43 ± 0.34 |
| B2 | 3.79 ± 0.42 | 3.88 ± 0.36 | 3.68 ± 0.47 | 3.78 ± 0.22 | 21.4 ± 4.8 | 1.23 ± 0.28 |
| B3 | 4.09 ± 0.33 | 4.12 ± 0.30 | 3.97 ± 0.40 | 4.06 ± 0.18 | 19.1 ± 4.4 | 1.10 ± 0.25 |
| P1 | 4.15 ± 0.31 | 4.10 ± 0.29 | 4.38 ± 0.34 | 4.21 ± 0.15 | 16.9 ± 3.9 | 0.97 ± 0.22 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).