ARTICLE | doi:10.20944/preprints201908.0073.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: contextual keyword extraction; BERT; word embedding; LSTM; transformers; Deep Learning
Online: 6 August 2019 (09:17:36 CEST)
In this paper we propose a novel self-supervised approach of keywords and keyphrases retrieval and extraction by an end-to-end deep learning approach, which is trained by contextually self-labelled corpus. Our proposed approach is novel to use contextual and semantic features to extract the keywords and has outperformed the state of the art. Through the experiment the proposed approach has been proved to be better in both semantic meaning and quality than the existing popular algorithms of keyword extraction. In addition, we propose to use contextual features from bidirectional transformers to automatically label short-sentence corpus with keywords and keyphrases to build the ground truth. This process avoids the human time to label the keywords and do not need any prior knowledge. To the best of our knowledge, our published dataset in this paper is a fine domain-independent corpus of short sentences with labelled keywords and keyphrases in the NLP community.
ARTICLE | doi:10.20944/preprints202006.0223.v1
Subject: Keywords: BERT; Classification; Mix-Code; Language Model; Youtube; Parametric and Non-Parametric
Online: 17 June 2020 (13:40:22 CEST)
The scope of a lucrative career promoted by Google through its video distribution platform YouTube 1 has attracted a large number of users to become content creators. An important aspect of this line of work is the feedback received in the form of comments which show how well the content is being received by the audience. However, volume of comments coupled with spam and limited tools for comment classification makes it virtually impossible for a creator to go through each and every comment and gather constructive feedback. Automatic classification of comments is a challenge even for established classification models, since comments are often of variable lengths riddled with slang, symbols and abbreviations. This is a greater challenge where comments are multilingual as the messages are often rife with the respective vernacular. In this work, we have evaluated top-performing classification models and four different vectorizers, for classifying comments which are a mix of different combinations of English and Malayalam (only English, only Malayalam and Mix of English and Malayalam). The statistical analysis of results indicates that Multinomial Naïve Bayes, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Random Forest and Decision Trees offer similar level of accuracy in comment classification. Further, we have also evaluated 3 multilingual sub-types of the novel NLP language model, BERT and compared its’ performance to the conventional machine learning classification techniques. XLM was the top-performing BERT model with an accuracy of 67.31%. Random Forest with Term Frequency Vectorizer was the best the top-performing model out of all the traditional classification models with an accuracy of 63.59%.
ARTICLE | doi:10.20944/preprints202101.0081.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Transformers; wave2vec; bert; mockingjay; interpretability
Online: 5 January 2021 (11:20:22 CET)
In recent times, BERT based transformer models have become an inseparable part of the 'tech stack' of text processing models. Similar progress is being observed in the speech domain with a multitude of models observing state-of-the-art results by using audio transformer models to encode speech. This begs the question of what are these audio transformer models learning. Moreover, although the standard methodology is to choose the last layer embedding for any downstream task, but is it the optimal choice? We try to answer these questions for the two recent audio transformer models, Mockingjay and wave2vec2.0. We compare them on a comprehensive set of language delivery and structure features including audio, fluency and pronunciation features. Additionally, we probe the audio models' understanding of textual surface, syntax, and semantic features and compare them to BERT. We do this over exhaustive settings for native, non-native, synthetic, read and spontaneous speech datasets
ARTICLE | doi:10.20944/preprints202208.0233.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Smart Families; Smart Homes; Sustainable Societies; Smart Cities; Deep Learning; Natural Language Processing (NLP); Social Sustainability; Environmental Sustainability; Economic Sustainability; Bidirectional Encoder Representations from Transformers (BERT); Triple Bottom Line (TBL); Internet of Things (IoT)
Online: 12 August 2022 (10:22:17 CEST)
Technological advancements and innovations have profoundly changed the lives of people giving rise to smart environments, cities, and societies. As homes are the building block of cities and societies, smart homes are critical to establishing smart living and are expected to play a key role in enabling smart cities and societies. The current academic literature and commercial advancements on smart homes have mainly focused on developing and providing smart functions for homes to provide security management and facilitate the residents in their various activities such as ambiance management. Homes are much more than physical structures, buildings, appliances, operational machines, and systems. Homes are composed of families and are inherently complex phenomena underlined by humans and their relationships with each other, subject to individual, intragroup, intergroup, and intercommunity goals. There is a clear need to understand, define, consolidate existing research, and actualize the overarching roles of smart homes, the roles of smart homes that would serve the needs of future smart cities and societies. This paper introduces our data-driven parameter discovery methodology and uses it to provide, for the first time, an extensive, rather fairly comprehensive, analysis of the families and homes landscape seen through the eyes of academics and the public using over a hundred thousand research papers and nearly a million tweets. We develop a methodology using deep learning, natural language processing (NLP), and big data analytics methods and apply it to automatically discover parameters that capture a comprehensive knowledge and design space of smart families and homes comprising social, political, economic, environmental, and other dimensions. The 66 discovered parameters and the knowledge space comprising 100s of dimensions are explained by reviewing and referencing over 300 articles from the academic literature and tweets. The knowledge and parameters discovered in this paper can be used to develop a holistic understanding of matters related to families and homes facilitating the development of better, community-specific, policies, technologies, solutions, and industries for families and homes, leading to strengthening families and homes, and in turn, empowering sustainable societies across the globe.
ARTICLE | doi:10.20944/preprints202210.0238.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: NLP; NLU; Twitter; Sentiment Analysis; Opinion Mining; Nigeria; Election; Machine Learning; BERT; LSTM; SVM
Online: 17 October 2022 (12:01:42 CEST)
Introduction: Social media platforms such as Facebook, LinkedIn, Twitter, among others have been used as a tool for staging protests, opinion polls, campaign strategy, medium of agitation and a place of interest expression especially during elections. Past studies have established people’s opinion elections using social media posts. The advent of state-of-the-art algorithms for unstructured text processing implies tremendous progress in natural language processing and understanding. Aim: In this work, a Natural Language framework is designed to understand Nigeria 2023 presidential election based on public opinion using Twitter dataset. Methods: Raw datasets concerning discourse around Nigeria 2023 elections from Twitter of 2,059,113 18 dimensions were collected. Sentiment analysis was performed on the preprocessed dataset using three different machine learning models namely: Long Short-Term Memory (LSTM) Recurrent Neural Network, Bidirectional Encoder Representations from Transformers (BERT) and Linear Support Vector Classifier (LSVC) models. Personal tweet analysis of the three candidates provided insight on their campaign strategies and personalities while public tweet analysis established the public’s opinion about them. The performance of the models was also compared using accuracy, recall, false positive rate, precision and F-measure. Results: LSTM model gave an accuracy, precision, recall, AUC and f-measure of 88%, 82.7%, 87.2% , 87.6% and 82.9% respectively; the BERT model gave an accuracy, precision, recall, AUC and f-measure of 94%, 88.5%, 92.5%, 94.7% and 91.7% respectively while the LSVC model gave an accuracy, precision, recall, AUC and f-measure of 73%, 81.4%, 76.4%, 81.2% and 79.2% respectively. Conclusion: The experimental results show that sentiment analysis and other Natural Language Processing tasks can aid in the understanding of the social media space. Results also revealed the leverage of each aspirant towards winning the election. We conclude that sentiment analysis can form a general basis for generating insights for election and modeling election outcomes.
ARTICLE | doi:10.20944/preprints202201.0061.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: BERT, Document Image Classification, EfficientNet, fine-tuned BERT, Hierarchical Attention Networks, Multimodal, RVL-CDIP, Two-stream, Tobacco-3482
Online: 6 January 2022 (10:08:38 CET)
Document classification is one of the most critical steps in the document analysis pipeline. There are two types of approaches for document classification, known as image-based and multimodal approaches. The image-based document classification approaches are solely based on the inherent visual cues of the document images. In contrast, the multimodal approach co-learns the visual and textual features, and it has proved to be more effective. Nonetheless, these approaches require a huge amount of data. This paper presents a novel approach for document classification that works with a small amount of data and outperforms other approaches. The proposed approach incorporates a hierarchical attention network(HAN) for the textual stream and the EfficientNet-B0 for the image stream. The hierarchical attention network in the textual stream uses the dynamic word embedding through fine-tuned BERT. HAN incorporates both the word level and sentence level features. While the earlier approaches rely on training on a large corpus (RVL-CDIP), we show that our approach works with a small amount of data (Tobacco-3482). To this end, we trained the neural network at Tobacco-3428 from scratch. Thereby, we outperform state-of-the-art by obtaining an accuracy of 90.3%. This results in a relative error reduction rate of 7.9%.
ARTICLE | doi:10.20944/preprints202111.0378.v1
Subject: Engineering, Other Keywords: NCM classification; natural language processing; transformers; multilingual BERT; portuguese BERT; NLP; BERT
Online: 22 November 2021 (10:59:43 CET)
The classification of goods involved in international trade in Brazil is based on the Mercosur Common Nomenclature (NCM). The classification of these goods represents a real challenge due to the complexity involved in assigning the correct category codes especially considering the legal and fiscal implications of misclassification. This work focuses on the training of a classifier based on Bidirectional En-coder Representations from Transformers (BERT) for the tax classification of goods with NCM codes. In particular, this article presents results from using a specific Portuguese Language tuned BERT model as well results from using a Multilingual BERT. Experimental results justify the use of these models in the classification process and also that the language specific model has a slightly better performance.
ARTICLE | doi:10.20944/preprints202208.0451.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: text splitting; text tokenization; transfer learning; mask-fill prediction; NLP linguistic rules; missing punctuations; cross-lingual BERT model; Masked Language Modeling
Online: 26 August 2022 (05:19:39 CEST)
Long unpunctuated texts containing complex linguistic sentences are a stumbling block to processing any low-resource languages. Thus, approaches that attempt to segment lengthy texts with no proper punctuation into simple candidate sentences are a vitally important preprocessing task in many hard-to-solve NLP applications. In this paper, we propose (PDTS) a punctuation detection approach for segmenting Arabic text, built on top of a multilingual BERT-based model and some generic linguistic rules. Furthermore, we showcase how PDTS can be effectively employed as a text tokenizer for unpunctuated documents (i.e., mimicking the transcribed audio-to-text documents). Experimental findings across two evaluation protocols (involving an ablation study and a human-based judgment) demonstrate that PDTS is practically effective in both performance quality and computational cost.
ARTICLE | doi:10.20944/preprints202203.0245.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Natural language processing (NLP); topic modelling; BERT; transportation; newspaper; magazine; academic research; journalism; deep learning; smart cities
Online: 17 March 2022 (07:58:15 CET)
We live in a complex world characterised by complex people, complex times, and complex social, technological, and ecological environments. There is clear evidence that governments are failing at most public matters. The recent COVID-19 pandemic is a high example of global governance failure both at preventing such pandemics and managing the COVID-19 pandemic. It is time that all of us take responsibility and look into ways of collaboratively improving the governance of public matters, our matters. While there are many reasons for government failures, we believe the lack of information availability is a fundamental reason that limits the government’s ability to act smartly and allows the lack of transparency to creep into policy and action leading to corruption and failure. To this end, this paper introduces the concept of deep journalism, a data-driven deep learning-based approach for discovering multi-perspective parameters related to a topic of interest. We build three datasets (a newspaper, a technology magazine, and a Web of Science dataset) and discover the academic, industrial, public, governance, and political parameters for the transportation sector as a case study to introduce deep journalism and our tool DeepJournal (Version 1.0) that implements our proposed approach. We elaborate on 89 transportation parameters and hundreds of dimensions reviewing 400 technical, academic, and news articles. The findings related to the multi-perspective view of transportation reported in this paper show that there are many important problems seen by the public that industry and academia seem to not place their focus on. On the other hand, academia produces much broader and deeper knowledge on the subject such as a wide range of pollutions affecting the people and planet do not get to reach the public eye. Our deep journalism approach could find the gaps and highlight them to the public and other stakeholders.