Submitted:
07 August 2025
Posted:
07 August 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Methodology
2.1. Tokenization Strategies
2.1.1. Character-Level Tokenization

2.1.2. Grapheme Cluster Tokenization

2.1.3. Byte-Level Tokenization

2.1.4. Subword-Based Tokenization (WordPiece)

2.1.5. Word-Based Tokenization
3. Model Architecture
4. Experiment Setup
4.1. Datasets and Data Augmentation
- Minor Typos: Each character in the original text has a 5% probability of being randomly deleted, replaced, or swapped with an adjacent character.
- Aggressive Typos: The character-level error probability is increased to 10%.
- Code-Mixing and Typos: This variant combines minor typos (5% character error rate) with code-mixing, where each word has a 30% probability of being transcribed into its romanized form ("Singlish").
- Original:
- Aggressive Typos:
- Code-Mixing:
4.2. Model Training and Hyperparameters
5. Results and Discussion
Discussion
Performance vs. Robustness
Performance Stability Across Tasks
The Efficiency-Robustness Trade-Off
6. Conclusions
References
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
- Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 5998–6008, 2017.
- Ranasinghe, T.; Anuradha, I.; Premasiri, D.; et al. SOLD: Sinhala offensive language dataset. Lang Resources & Evaluation 2025, 59, 297–337. [Google Scholar] [CrossRef]


| Hyperparameter | Value |
|---|---|
| Hidden Size () | 512 |
| FFN Intermediate Size | 512 |
| Number of Attention Heads | 8 |
| Dropout Probability | 0.2 |
| Learning Rate | |
| Max Sequence Length | 512 |
| Optimizer | AdamW |
| Weight Decay | 0 |
| Dataset | Byte | Char | GCT | Word | WPE |
|---|---|---|---|---|---|
| Original | 0.6580 | 0.6671 | 0.7073 | 0.7274 | 0.7100 |
| Minor Typos | 0.6566 | 0.6651 | 0.7007 | 0.7044 | 0.7000 |
| Aggressive Typos | 0.6604 | 0.6785 | 0.6987 | 0.6884 | 0.7000 |
| Mixed Coding | 0.6468 | 0.6526 | 0.6528 | 0.6779 | 0.6776 |
| Std. Dev. (across tasks) | 0.0060 | 0.0106 | 0.0250 | 0.0215 | 0.0142 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).