ARTICLE | doi:10.20944/preprints202208.0451.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: text splitting; text tokenization; transfer learning; mask-fill prediction; NLP linguistic rules; missing punctuations; cross-lingual BERT model; Masked Language Modeling
Online: 26 August 2022 (05:19:39 CEST)
Long unpunctuated texts containing complex linguistic sentences are a stumbling block to processing any low-resource languages. Thus, approaches that attempt to segment lengthy texts with no proper punctuation into simple candidate sentences are a vitally important preprocessing task in many hard-to-solve NLP applications. In this paper, we propose (PDTS) a punctuation detection approach for segmenting Arabic text, built on top of a multilingual BERT-based model and some generic linguistic rules. Furthermore, we showcase how PDTS can be effectively employed as a text tokenizer for unpunctuated documents (i.e., mimicking the transcribed audio-to-text documents). Experimental findings across two evaluation protocols (involving an ablation study and a human-based judgment) demonstrate that PDTS is practically effective in both performance quality and computational cost.