Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Vocational Domain Identification with Machine Learning and Natural Language Processing on Wikipedia Text: Error Analysis and Class Balancing

Version 1 : Received: 24 April 2023 / Approved: 25 April 2023 / Online: 25 April 2023 (09:29:07 CEST)

A peer-reviewed article of this Preprint also exists.

Nikiforos, M.N.; Deliveri, K.; Kermanidis, K.L.; Pateli, A. Vocational Domain Identification with Machine Learning and Natural Language Processing on Wikipedia Text: Error Analysis and Class Balancing. Computers 2023, 12, 111. Nikiforos, M.N.; Deliveri, K.; Kermanidis, K.L.; Pateli, A. Vocational Domain Identification with Machine Learning and Natural Language Processing on Wikipedia Text: Error Analysis and Class Balancing. Computers 2023, 12, 111.

Abstract

Highly-skilled migrants and refugees finding employment in low-skill vocations, despite professional qualifications and educational background, has become a global tendency, mainly due to the language barrier. Employment prospects for displaced communities are mostly decided by their knowledge of the sublanguage of the vocational domain they are interested in working. Common vocational domains include agriculture, cooking, crafting, construction, and hospitality. The increasing amount of user-generated content in wikis and social networks provides a valuable source of data for data mining, Natural Language Processing and machine learning applications. This paper extends the contribution of the authors’ previous research on automatic vocational domain identification by further analyzing the results of the machine learning experiments with the domain-specific textual data set, considering 2 research directions: a. predictions analysis and b. data balancing. Wrong predictions analysis and the features that contributed to misclassification, along with correct predictions analysis and the features that were the most dominant, contributed to the identification of a primary set of terms for the vocational domains. Data balancing techniques were applied on the data set to observe their impact on the performance of the classification model. A novel 4-step methodology is proposed in this paper for the first time, consisting of successive applications of SMOTE oversampling on imbalanced data. Data oversampling obtains better results than data undersampling in imbalanced data sets, while hybrid approaches perform reasonably well.

Keywords

Natural Language Processing; Social Text Mining; Machine Learning; Vocational Domain Identification; Vocational Language; Error Analysis; Class Balancing

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.