Employing a Multilingual Transformer Model for Segmenting Unpunctuated Arabic Text

Abdullah M. Alshanqiti; Sami Albouq; Ahmad B. Alkhodre; Abdallah Namoun; Emad Nabil

doi:10.20944/preprints202208.0451.v1

Preprint

Article

Employing a Multilingual Transformer Model for Segmenting Unpunctuated Arabic Text

This version is not peer-reviewed.

Abdullah M. Alshanqiti^*,Sami Albouq,

Ahmad B. Alkhodre,

Abdallah Namoun

Emad Nabil

Abdullah M. Alshanqiti^*,Sami Albouq,

Ahmad B. Alkhodre,

Abdallah Namoun

Emad Nabil

This version is not peer-reviewed.

Downloads

265

Views

109

Comments

Submitted:

25 August 2022

Posted:

26 August 2022

You are already at the latest version

A peer-reviewed article of this preprint also exists.

Abstract

Long unpunctuated texts containing complex linguistic sentences are a stumbling block to processing any low-resource languages. Thus, approaches that attempt to segment lengthy texts with no proper punctuation into simple candidate sentences are a vitally important preprocessing task in many hard-to-solve NLP applications. In this paper, we propose (PDTS) a punctuation detection approach for segmenting Arabic text, built on top of a multilingual BERT-based model and some generic linguistic rules. Furthermore, we showcase how PDTS can be effectively employed as a text tokenizer for unpunctuated documents (i.e., mimicking the transcribed audio-to-text documents). Experimental findings across two evaluation protocols (involving an ablation study and a human-based judgment) demonstrate that PDTS is practically effective in both performance quality and computational cost.

Keywords:

text splitting

;

text tokenization

;

transfer learning

;

mask-fill prediction

;

NLP linguistic rules

;

missing punctuations

;

cross-lingual BERT model

;

Masked Language Modeling

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Downloads

265

Views

109

Comments

Subscription

Notify me about updates to this article or when a peer-reviewed version is published.