Submitted:
14 August 2025
Posted:
15 August 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Materials and Methods
2.1. Data Description and Preprocessing
2.1.1. Dataset
2.1.2. Preprocessing
- Encoding normalization: Converted every file to UTF-8.
- Lowercasing: Transformed all characters to lowercase to reduce sparsity.
- Cleaning: Stripped non-textual elements (headers, footers, page numbers).
- Punctuation removal: Removed non-alphanumeric symbols while preserving sentence delimiters.
- Tokenization: Split text into tokens using whitespace and punctuation rules.
-
Vectorization: Constructed TF–IDF representations of each grade corpus. For each token t in document d, the TF–IDF weight was computed as:where the term frequency (TF) is defined as:and the inverse document frequency (IDF) is given by:Here, is the frequency of term t in document d; is the total number of terms in d; N is the total number of documents; and is the number of documents containing t.
2.2. TF–IDF + Linear Regression Algorithm
-
Regression model formulation:For each existing vector , where , we assume a linear relationship of the form:where is a scalar coefficient representing how closely aligns with the new vector .
-
Compute optimal scalar :The optimal scalar is computed as:where:
- is the dot product between the existing and new vectors,
- is the squared norm of the existing vector.
-
Residual error :The squared residual error for each i is defined as:Expanding and substituting the optimal , we obtain:
-
Assign class:Each existing vector has an associated class label , where , representing the school grade level of the document.
-
Select best match:We identify the index corresponding to the minimum residual error:
-
Final classification step:We determine the class of the vector corresponding to , and this class will be the most similar to the given text:
2.3. TF–IDF + K-Nearest Neighbors Algorithm
-
Determine K:Since we have a corpus of 96 school textbooks, we compute:
-
Compute similarity (Euclidean distance):For each , compute the similarity between and using the following formula:where and are the j-th components of the respective vectors.
-
Assign class:Each vector is associated with a class label , where , representing the school grades.
-
Sorting step:Sort the computed similarities in ascending order:Then select the top vectors.
-
Select top K neighbors:Selected 9 vectors and their corresponding class will be defined , where .
-
Final classification step:Identify the most frequently occurring class according to the 9 selected vectors.If there are many such classes, the one is selected whose corresponding vector has the greatest value from the vector of the given text. We consider that the given text belongs to that class.
2.4. TF–IDF + Cosine Similarity Algorithm
-
Cosine similarity definition:For each existing vector , where , we compute the cosine similarity between the new vector and as:where:
- is the dot product between the two vectors,
- and are their Euclidean norms.
-
Similarity interpretation:Cosine similarity provides a normalized measure of directional alignment between two TF–IDF vectors. A value closer to 1 indicates stronger textual similarity. In our case, this measure is used to compute the similarity between the new document vector and each of the existing vectors . The vector with the highest similarity value, ideally approaching 1, are considered the best match and determine the classification outcome.
-
Assign class:Each vector is labeled with a class , where , corresponding to school grade levels.
-
Final classification step:Letbe the index of the most similar vector. The class name of the new document is assigned based on that of the most similar existing document:If multiple vectors have the same maximum similarity result, those classes are taken as the final result.
3. Results
| № | File Name | Grade | Source Type | Number of Textbooks | # Tokens | # Unique Words |
|---|---|---|---|---|---|---|
| 1 | 5_merged.txt | 5 | Internal | 13 | 268 189 | 46 791 |
| 2 | 6_merged.txt | 6 | Internal | 12 | 253 608 | 45 740 |
| 3 | 7_merged.txt | 7 | Internal | 15 | 386 479 | 57 387 |
| 4 | 8_merged.txt | 8 | Internal | 15 | 403 241 | 58 630 |
| 5 | 9_merged.txt | 9 | Internal | 11 | 275 343 | 47 407 |
| 6 | 10_merged.txt | 10 | Internal | 15 | 365 454 | 56 396 |
| 7 | 11_merged.txt | 11 | Internal | 15 | 355 897 | 44 864 |
| Total | 96 | 2 303 150 | 221 036 | |||
| # | File Name | Source Description |
|---|---|---|
| 1 | garri.txt | Uzbek translation of “Harry Potter” novel by J.K. Rowling |
| 2 | shaytanat.txt | Crime novel “Shaytanat” by Tohir Malik |
| 3 | kuhnadunyo.txt | “Kuhna dunyo”, a novel by Odil Yoqubov |
| 4 | xorazm.txt | “Xorazm tarixi” by Bayoniy, a historical chronicle |
| 5 | attor.txt | “Mantiq-ut-Tayr” by Fariduddin Attar |
| 6 | mehrob.txt | “Mehrobdan chayon”, a novel by Abdulla Qodiriy |
| 7 | avazxon.txt | “Avazxon”, an Uzbek folk epic |
| № | File name | Source | TF-IDF + LR | TF-IDF + KNN | TF-IDF + CS |
|---|---|---|---|---|---|
| 1 | garri.txt | external | 5 | 7 | 5 |
| 2 | shaytanat.txt | external | 7 | 10 | 7 |
| 3 | kuhnadunyo.txt | external | 9 | 7 | 9 |
| 4 | xorazm.txt | external | 10 | 8 | 10 |
| 5 | attor.txt | external | 11 | 10 | 11 |
| 6 | avazxon.txt | external | 11 | 10 | 11 |
| 7 | temuriy.txt | external | 7 | 7 | 7 |
| # | File Name | Grade | Source Description |
|---|---|---|---|
| 1 | dunyo.txt | 5 | “Dunyoning ishlari”, novel, by O’tkir Hoshimov |
| 2 | ezop.txt | 5 | Ezop fables |
| 3 | guliston.txt | 5 | “Guliston”, by Sa’diy Sheroziy |
| 4 | hadislar.txt | 5 | Imom Buxoriy hadisth |
| 5 | hellados.txt | 5 | “Hellados”, story by Nodar Dumbadze |
| 6 | mahbub.txt | 6 | “Mahbub ul-qulub”,by Alisher Navoiy |
| 7 | muzqaymoq.txt | 6 | “Muzqaymoq” story,by Odil Yoqubov |
| 8 | nasihatlar.txt | 6 | Nasihatlar, by Abay |
| 9 | shumbola.txt | 6 | “Shum bola” by G’afur Gulom |
| 10 | yulduz.txt | 6 | “Yulduzli tunlar”, by P. Qodirov |
| 11 | mehrob.txt | 7 | “Mehrobdan chayon”, by Abdulla Qodiriy |
| 12 | memor.txt | 7 | “Me’mor” novel by Mirmuhsin |
| 13 | oq_kema.txt | 7 | “Oq kema” asari, by Chingiz Aytmatov |
| 14 | qiyomat.txt | 7 | “Qiyomat qarz” novel,by O’lmas Umarbekov |
| 15 | ravshan.txt | 7 | “Ravshan” epic, folklore |
| 16 | chinor.txt | 8 | “Chinor” novel, Asqad Muxtor |
| 17 | kuntugmish.txt | 8 | “Kuntug’mish” epic,folklore |
| 18 | lutfiy.txt | 8 | “G’azallar”, by Lutfiy |
| 19 | nilvarim.txt | 8 | “Nil va Rim”, prose,by Usmon Nosir |
| 20 | qochoq.txt | 8 | “Qochoq”, novel,by Said Ahmad |
| 21 | asr.txt | 9 | “Asrga tatigulik kun”, Chingiz Aytmatov |
| 22 | farhodshirin.txt | 9 | “Farhod va Shirin” epic, Alisher Navoiy |
| 23 | navoiy.txt | 9 | “Navoiy” excerpt, Oybek |
| 24 | ulugbek.txt | 9 | “Ulug’bek xazinasi” novel by Odil Yoqubov |
| 25 | xoja.txt | 9 | Xoja story’s |
| 26 | atoyi.txt | 10 | Atoiy ghazals |
| 27 | bobur.txt | 10 | “Boburnoma”,by Zahiriddin Muhammad Bobur |
| 28 | hayot.txt | 10 | “Hayotga muxabbat”, Djek London |
| 29 | ikkieshik.txt | 10 | “Ikki eshik orasi” novel,by O’tkir Hoshimov |
| 30 | rustamxon.txt | 10 | “Rustamxon” epic, folklore |
| 31 | kechakunduz.txt | 11 | “Kecha va kunduz” by Abdulhamid Cho’lpon |
| 32 | mashrab.txt | 11 | Ghazals, Boborahim Mashrab |
| 33 | qutadgu.txt | 11 | “Qutadg’u bilig” epic, Yusuf Xos Hojib |
| 34 | rabguziy.txt | 11 | “Qisasi Rabg’uziy”, Nosiriddin Rabg’uziy |
| 35 | choliqushi.txt | 11 | “Choliqushi” novel by Rashod Nuri Guntekin |
4. Discussion
5. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| NLP | Natural Language Processing |
| TF-IDF | Term Frequency–Inverse Document Frequency |
| KNN | K-Nearest Neighbors |
| LR | Logistic Regression |
| CS | Cosine Similarity |
| UPSC | Uzbek Primary School Corpus |
| GPU | Graphics Processing Unit |
References
- Deng, X.; Li, Y.; Weng, J.; Zhang, J. Feature Selection for Text Classification: A Review. Multimedia Tools and Applications 2019, 78, 3797–3816. [Google Scholar] [CrossRef]
- Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text Classification Algorithms: A Survey. Information 2019, 10, 150. [Google Scholar] [CrossRef]
- Page, E.B. Project Essay Grade: PEG. In Automated Essay Scoring: A Cross-Disciplinary Perspective; Shermis, M.D., Burstein, J., Eds.; Lawrence Erlbaum Associates: Mahwah, NJ, USA, 2003; pp. 43–54. [Google Scholar]
- Young, T.; Hazarika, D.; Poria, S.; Cambria, E. Recent Trends in Deep Learning Based Natural Language Processing. IEEE Computational Intelligence Magazine 2018, 13, 55–75. [Google Scholar] [CrossRef]
- Pires, T.; Schlinger, E.; Garrette, D. How Multilingual is Multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4996–5001. [Google Scholar]
- Schwenk, H.; Li, X. A Corpus for Multilingual Document Classification in Eight Languages. In Proceedings of the Eleventh Language Resources and Evaluation Conference (LREC), Miyazaki, Japan, 7–12 May 2018; pp. 3548–3551. [Google Scholar]
- Madatov, K.A.; Bekchanov, S.K. The Algorithm of Uzbek Text Summarizer. In Proceedings of the International Conference of Young Specialists on Micro/Nanotechnologies and Electron Devices (EDM), Erlagol, Russia, 30 June–4 July 2024; pp. 2430–2433. [Google Scholar] [CrossRef]
- Madatov, K.; Sattarova, S.; Vičič, J. Dataset of Vocabulary in Uzbek Primary Education: Extraction and Analysis in Case of the School Corpus. Data in Brief 2025, 59, 111349. [Google Scholar] [CrossRef] [PubMed]
- Madatov, K.A.; Sattarova, S. Creation of a Corpus for Determining the Intellectual Potential of Primary School Students. In Proceedings of the International Conference of Young Specialists on Micro/Nanotechnologies and Electron Devices (EDM), Erlagol, Russia, 1–5 July 2024; pp. 2420–2423. [Google Scholar] [CrossRef]
- Matlatipov, S.; Tukeyev, U.; Aripov, M. Towards the Uzbek Language Endings as a Language Resource. In Proceedings of the International Conference on Computational Collective Intelligence; Springer: Cham, Switzerland, 2020; pp. 729–740. [Google Scholar]
- Madatov, K.A.; Khujamov, D.J.; Boltayev, B.R. Creating of the Uzbek WordNet Based on Turkish WordNet. AIP Conference Proceedings 2022. [Google Scholar] [CrossRef]
- Rabbimov, I.M.; Kobilov, S.S. Multi-class Text Classification of Uzbek News Articles Using Machine Learning. Journal of Physics: Conference Series 2020, 1546. [Google Scholar] [CrossRef]
- Kuriyozov, E.; Salaev, U.; Matlatipov, S.; Gómez-Rodríguez, C. Construction and Evaluation of Sentiment Datasets for Low-Resource Languages: The Case of Uzbek. In Proceedings of the Language and Technology Conference; Springer International Publishing: Cham, Switzerland, 2019; pp. 232–243. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- ZiyoUz Library – Digital Collection of Literary Works. Available online: https://n.ziyouz.com/kutubxona/category/1-ziyouz-com-kutubxonasi.
- Tan, Pang-Ning and Steinbach, Michael and Kumar, Vipin (2016). Introduction to data mining, Pearson Education India.
| № | File name | Grade | TF-IDF + LR | TF-IDF + KNN | TF-IDF + CS |
|---|---|---|---|---|---|
| 1 | dunyo.txt | 5 | 5 | 7 | 5 |
| 2 | ezop.txt | 5 | 5 | 8 | 5 |
| 3 | guliston.txt | 5 | 10 | 10 | 10 |
| 4 | hadislar.txt | 5 | 5 | 10 | 5 |
| 5 | hellados.txt | 5 | 5 | 10 | 5 |
| 6 | Sariqdev.txt | 6 | 6 | 7 | 6 |
| 7 | muzqaymoq.txt | 6 | 6 | 10 | 6 |
| 8 | nasihatlar.txt | 6 | 5 | 8 | 5 |
| 9 | shumbola.txt | 6 | 5 | 7 | 6 |
| 10 | yulduz.txt | 6 | 6 | 10 | 6 |
| 11 | mehrob.txt | 7 | 7 | 7 | 7 |
| 12 | memor.txt | 7 | 7 | 8 | 7 |
| 13 | oq_kema.txt | 7 | 7 | 7 | 7 |
| 14 | qiyomat.txt | 7 | 7 | 7 | 7 |
| 15 | ravshan.txt | 7 | 7 | 8 | 7 |
| 16 | chinor.txt | 8 | 8 | 7 | 8 |
| 17 | kuntugmish.txt | 8 | 8 | 7 | 8 |
| 18 | lutfiy.txt | 8 | 8 | 7 | 8 |
| 19 | nilvarim.txt | 8 | 8 | 6 | 8 |
| 20 | qochoq.txt | 8 | 8 | 7 | 8 |
| 21 | asr.txt | 9 | 9 | 7 | 9 |
| 22 | farhodshirin.txt | 9 | 9 | 11 | 9 |
| 23 | navoiy.txt | 9 | 9 | 10 | 9 |
| 24 | ulugbek.txt | 9 | 9 | 7 | 9 |
| 25 | xoja.txt | 9 | 11 | 10 | 11 |
| 26 | turdifarogiy | 10 | 10 | 7 | 10 |
| 27 | bobur.txt | 10 | 10 | 7 | 10 |
| 28 | hayot.txt | 10 | 10 | 7 | 10 |
| 29 | ikkieshik.txt | 10 | 10 | 10 | 10 |
| 30 | rustamxon.txt | 10 | 10 | 10 | 10 |
| 31 | kechakunduz.txt | 11 | 11 | 10 | 11 |
| 32 | mashrab.txt | 11 | 10 | 10 | 10 |
| 33 | qutadgu.txt | 11 | 11 | 8 | 11 |
| 34 | rabguziy.txt | 11 | 11 | 10 | 11 |
| 35 | choliqushi.txt | 11 | 5 | 7 | 5 |
| Accuracy | 82% | 22% | 85.7% | ||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).