Version 1
: Received: 21 April 2022 / Approved: 26 April 2022 / Online: 26 April 2022 (10:30:54 CEST)
How to cite:
Madatov, K.; Bekchanov, S.; Vičič, J. Automatic Detection of Stop Words for Texts in the Uzbek Language. Preprints2022, 2022040234. https://doi.org/10.20944/preprints202204.0234.v1
Madatov, K.; Bekchanov, S.; Vičič, J. Automatic Detection of Stop Words for Texts in the Uzbek Language. Preprints 2022, 2022040234. https://doi.org/10.20944/preprints202204.0234.v1
Madatov, K.; Bekchanov, S.; Vičič, J. Automatic Detection of Stop Words for Texts in the Uzbek Language. Preprints2022, 2022040234. https://doi.org/10.20944/preprints202204.0234.v1
APA Style
Madatov, K., Bekchanov, S., & Vičič, J. (2022). Automatic Detection of Stop Words for Texts in the Uzbek Language. Preprints. https://doi.org/10.20944/preprints202204.0234.v1
Chicago/Turabian Style
Madatov, K., Shukurla Bekchanov and Jernej Vičič. 2022 "Automatic Detection of Stop Words for Texts in the Uzbek Language" Preprints. https://doi.org/10.20944/preprints202204.0234.v1
Abstract
Stop words are very important for information retrieval and text analysis investigation. This study aimed to automatically analyze and detect stop words in texts in the Uzbek language. Because of the limited availability of methods for automatic search of stop words of texts in Uzbek we analyzed a newly prepared corpus. The Uzbek language belongs to the family of agglutinative languages. As with all agglutinative languages, we can explain that the detection of stop words in Uzbek texts is a more complex process than in inflected languages: In inflected languages, words such as auxiliary words, articles, prepositions can be included in the stop words group. In agglutinative languages, the meanings of such words are hidden in the text. Therefore, it is not appropriate to apply all known methods of stop words detection in inflected languages directly to agglutinative languages. In this work, the “School corpus” which contains 731156 Uzbek words has been investigated. The bigram method of analysis was applied to the corpus. We proposed the collocation method of detecting stop words of the corpus. We proposed the method of automatically detecting stop words of texts in Uzbek. It is shown that the collocation method is 6 times better than the bigram method.
Keywords
stop word detection; Uzbek language; agglutinative language; algorithm
Subject
Computer Science and Mathematics, Computer Science
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.