REVIEW | doi:10.20944/preprints202010.0649.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: text mining; natural language processing; electronic health records; clinical text; machine learning
Online: 3 February 2021 (10:31:14 CET)
Electronic health records (EHRs) are becoming a vital source of data for healthcare quality improvement, research, and operations. However, much of the most valuable information contained in EHRs remains buried in unstructured text. The field of clinical text mining has advanced rapidly in recent years, transitioning from rule-based approaches to machine learning and, more recently, deep learning. With new methods come new challenges, however, especially for those new to the field. This review provides an overview of clinical text mining for those who are encountering it for the first time (e.g. physician researchers, operational analytics teams, machine learning scientists from other domains). While not a comprehensive survey, it describes the state of the art, with a particular focus on new tasks and methods developed over the past few years. It also identifies key barriers between these remarkable technical advances and the practical realities of implementation at health systems and in industry.
ARTICLE | doi:10.20944/preprints202105.0601.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Mobile RPG; Big Data; Text Mining; Topic Modeling
Online: 25 May 2021 (10:21:36 CEST)
As RPG has high sales and profits, lots of developers have supplied various RPG to market but it changed to mass production type with sensational advertising, low quality and excessive charging and similar contents which affects game market and users’ game play experience. The author of this paper studied ways to improve mobile RPG by collecting and analyzing users’ reviews using crawling on Google Play Store. The author of this paper used topic modeling that uses text mining technique and LDA (Latent Dirichlet Allocation) to extract meaningful information from collected big data and visualized it. Inferring users’ reviews, figuring out opinions objectively and seeking ways to improve games are helpful in improving mobile RPG that can be played continuously.
ARTICLE | doi:10.20944/preprints202008.0265.v2
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Ecommender system; learning to rank; Mining software repositories; Text Mining; Deep learning; Stack Overflow
Online: 4 September 2020 (11:20:33 CEST)
In software development, developers received bug reports that describe the software bug. Developers find the cause of bug through reviewing the code and reproducing the abnormal behavior that can be considered as tedious and time-consuming processes. The developers need an automated system that incorporates large domain knowledge and recommends a solution for those bugs to ease on developers rather than spending more manual efforts to fixing the bugs or waiting on Q&A websites for other users to reply to them. Stack Overflow is a popular question-answer site that is focusing on programming issues, thus we can benefit knowledge available in this rich platform. This paper, presents a survey covering the methods in the field of mining software repositories. We propose an architecture to build a recommender System using the learning to rank approach. Deep learning is used to construct a model that solve the problem of learning to rank using stack overflow data. Text mining techniques were invested to extract, evaluate and recommend the answers that have the best relevance with the solution of this bug report.
ARTICLE | doi:10.20944/preprints201809.0466.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: topological data analysis; text mining; computational topology; style; persistent homology
Online: 24 September 2018 (15:33:02 CEST)
Topological Data Analysis (TDA) refers to a collection of methods that find the structure of shapes in data. Although recently, TDA methods have been used in many areas of data mining, it has not been widely applied to text mining tasks. In most text processing algorithms, the order in which different entities appear or co-appear is being lost. Assuming these lost orders are informative features of the data, TDA may play a significant role in the resulted gap on text processing state of the art. Once provided, the topology of different entities through a textural document may reveal some additive information regarding the document that is not reflected in any other features from traditional text processing methods. In this paper, we introduce a novel approach that hires TDA in text processing in order to capture and use the topology of different same-type entities in textural documents. First, we will show how to extract some topological signatures in the text using persistent homology-i.e., a TDA tool that captures topological signature of data cloud. Then we will show how to utilize these signatures for text classification.
ARTICLE | doi:10.20944/preprints201708.0055.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: EMR; data preprocessing; text mining; information extraction; medical decision support system
Online: 15 August 2017 (05:46:43 CEST)
At present, medical institutes generally use EMR to record patient's condition, including diagnostic information, procedures performed and treatment results. EMR has been recognized as a valuable resource for large scale analysis. However, EMR has the characteristics of diversity, incompleteness, redundancy and privacy, which make it difficult to carry out data mining and analysis directly. Therefore, it is necessary to preprocess the source data in order to improve data quality and improve the data mining results. Different types of data require different processing technologies. Most structured data commonly needs classic preprocessing technologies, including data cleansing, data integration, data transformation and data reduction. For semi-structured or unstructured data, such as medical text, containing more health information, it requires more complex and challenging processing methods. The task of information extraction for medical texts mainly includes NER (Named Entity Recognition) and RE (Relation Extraction). In this paper, we introduce the process of EMR processing, including data collection, data preprocessing, data mining, evaluation and knowledge application, analyze the current status of the key technologies, such as data preprocessing and data mining, and provide an overview of the application domains and prospects of EMR mining technologies. Finally, we summarize the existing problems in the research of EMR mining, and review the development trends.
ARTICLE | doi:10.20944/preprints202206.0050.v2
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Emotions Mining; Context Mining; Sensory Mining; Artificial Intelligence; Information extraction; Text classification; Fairy tales; Olfactory Cultural Heritage
Online: 2 August 2022 (07:57:35 CEST)
This paper presents an Artificial Intelligence approach to mining context and emotions related to olfactory cultural heritage narratives, in particular to fairy tales. We provide an overview of the role of smell and emotions in literature, as well as highlight the importance of olfactory experience and emotions from psychology and linguistic perspectives. We introduce a methodology for extracting smells and emotions from text, as well as demonstrate the context-based visualizations related to smells and emotions implemented in a novel Smell Tracker tool. The evaluation is performed using a collection of fairy tales from Grimm and Andersen. We find out that fairy tales often connect smell with emotional charge of situations. The experimental results show that we can detect smells and emotions with F1 score of 92.7 and 79.2, respectively.
ARTICLE | doi:10.20944/preprints202102.0120.v1
Subject: Social Sciences, Business And Administrative Sciences Keywords: Homepage words; Financial ratio; Text-mining; Balanced scorecard
Online: 3 February 2021 (15:07:40 CET)
(1) Background: The CEO message of hospital homepage contain various contents such as the hospital's future vision, promises with customers, upgraded services and public activities. The CEO’s message of the homepage includes non-financial information as well as financial information of corporates. Also, it provides useful information for not only company's goals and vision but also firm performance and strategies for the future. This study aims to investigate associations between CEO’s message of hospitals homepages and financial status. We used the balanced scorecard frame to analyze what content on the hospital's homepage is related to the hospital's various financial ratios. (2) Methods: We adopt a text mining method to extract significantly repeated keywords from the CEO’s message of hospital website. And we classify these keywords by a balanced scorecard frame. To examine the relationship between keywords of CEO’s message of the hospital homepage and hospital’s financial ratio, T-test is conducted for the difference in the TF-IDF (Term Frequency is Divided by Inverse Document Frequency) mean of the home page contents and its relationship with the views of the balanced scorecard framework. (3) Results: According to empirical results on 65 samples collected from local hospitals, there are some significant relationship between the qualitative content of the hospital's homepage and the quantitative financial ratio that indicates profitability, activity, leverage, liquidity, and transfer to essential business fund (EBF) income. (4) Conclusions: The introduction section of a homepage is most accessible to customers, containing the aims and ideals of hospitals and reflecting their values and visions . In addition, in view of financial status, they can either emphasize financial strength or focus on other areas to mask weakness of financial information. This study reminds us of the importance of hospital website’s disclosure, and it can be inferred from the financial status of the hospital. It also highlights the need for harmonization between quantitative data, financial statements, and qualitative data, CEO’s messages. (5) Implications: To our best knowledge, this paper is the first research attempting to investigate the relation between text of hospital homepage and financial ratio of hospital through text mining technique and balanced scorecard frame. Hospitals take a crucial part in a country’s welfare and healthcare backbone industry. Nevertheless, in many countries, hospital organization sectors tend to remain a source of critical fiscal deficits due to its ineffective and sloppy management. We expect that the result of this paper can provide hospital managers to useful information.
REVIEW | doi:10.20944/preprints202102.0447.v1
Subject: Earth Sciences, Atmospheric Science Keywords: circular economy; Covid-19; Voyant tools; environmental sustainability; social sustainability; economic sustainability; text mining
Online: 20 February 2021 (01:42:10 CET)
The emergence of the Covid-19 pandemic has created both negative and positive changes, including implementing the circular economy across the globe. This Systematic Review follows the PRISMA statement and employs the Text Mining (Voyant Tools) technique to visualize and analyze the impacts of the Covid-19 on three aspects of the circular economy: economic, social, and environmental. The research employs Latent Dirichlet Allocation (LDA) to identify five major topics: (1) Shortage of medical equipment but high medical waste during Covid-19 due to the high demand in healthcare; (2) The long term negative impacts of lockdown on economic and social activities because of Covid-19 pandemic; (3) The reports on impacts of Covid-19 pandemic on the manufacturing globally, and their coping strategies and new opportunities; (4) The impacts of international restriction on the tourism, trade, shipping, and aviation industry, causing billion-dollar losses; (5) The reduction of pollution with health environment improvements with example cases from China and EU. The research identifies current literature gaps in the circular economy and Covid- 19 topics and encourages the application of text mining tools into researching to stimulate the research process and assist in communicating with the public.
ARTICLE | doi:10.20944/preprints202203.0329.v1
Subject: Mathematics & Computer Science, Analysis Keywords: Plagiarism Detection; Plagiarism checker for Bengali text; Bengali Literature Corpus; OCR in Bengali text
Online: 24 March 2022 (09:36:56 CET)
Plagiarism means taking another person’s work and not giving any credit to them for it. Plagiarism is one of the most serious problems in academia and among researchers. Even though there are multiple tools available to detect plagiarism in a document but most of them are domain-specific and designed to work in English texts, but plagiarism is not limited to a single language only. Bengali is the most widely spoken language of Bangladesh and the second most spoken language in India with 300 million native speakers and 37 million second-language speakers. Plagiarism detection requires a large corpus for comparison. Bengali Literature has a history of 1300 years. Hence most Bengali Literature books are not yet digitalized properly. As there was no such corpus present for our purpose so we have collected Bengali Literature books from the National Digital Library of India and with a comprehensive methodology extracted texts from it and constructed our corpus. Our experimental results find out average accuracy between 72.10 % - 79.89 % in text extraction using OCR. Levenshtein Distance algorithm is used for determining Plagiarism. We have built a web application for end-user and successfully tested it for Plagiarism detection in Bengali texts. In future, we aim to construct a corpus with more books for more accurate detection.
ARTICLE | doi:10.20944/preprints202110.0033.v1
Online: 4 October 2021 (08:58:52 CEST)
Antimicrobial resistance (AMR) is one of the top 10 threats affecting global health. AMR defeats the effective prevention and treatment of infections caused by microbial pathogens including bacteria, parasites, viruses and fungi (WHO). Microbial pathogens have natural tendency to evolve and mutate over time resulting in AMR strains. The set of genes involved in antibiotic resistance also termed as “antibiotic resistance genes” (ARGs) spread through species by lateral gene transfer thereby causing global dissemination. While this biological mechanism is prevalent in the spread of AMR, human methods also augment through various mechanisms such as over prescription, incomplete treatment, environmental waste etc. A considerable portion of scientific community is engrossed in AMR related work trying to discover novel therapeutic solutions for tackling resistant pathogens. Comprehensive inspection of the literature shows that diverse therapeutic strategies have evolved over recent years. Collectively, these therapeutic strategies include novel small molecules, newly identified antimicrobial peptides, bacteriophages, phytochemicals, nanocomposites, novel phototherapy against bacteria, fungi and virus. In this work we have developed a comprehensive knowledgebase by collecting alternative antimicrobial therapeutic strategies from literature data. We have used subjective approach for datamining new strategies resulting in broad coverage of entities and subsequently add objective data like entity name, potency, safety information etc. The extracted data was organized KOMBAT (Knowledgebase Of Microbes’ Battling Agents for Therapeutics). A lot of these data are tested against AMR pathogens. We envision that this database will be noteworthy for developing future therapeutics against resistant pathogens. The database can be accessed through http://kombat.igib.res.in/.
ARTICLE | doi:10.20944/preprints202011.0646.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: social media; hate speech; text classification
Online: 25 November 2020 (14:12:07 CET)
The exponential increase in the use of the Internet and social media over the last two decades has changed human interaction. This has led to many positive outcomes, but at the same time it has brought risks and harms. While the volume of harmful content online, such as hate speech, is not manageable by humans, interest in the academic community to investigate automated means for hate speech detection has increased. In this study, we analyse six publicly available datasets by combining them into a single homogeneous dataset and classify them into three classes, abusive, hateful or neither. We create a baseline model and we improve model performance scores using various optimisation techniques. After attaining a competitive performance score, we create a tool which identifies and scores a page with effective metric in near-real time and uses the same as feedback to re-train our model. We prove the competitive performance of our multilingual model on two langauges, English and Hindi, leading to comparable or superior performance to most monolingual models.
ARTICLE | doi:10.20944/preprints201610.0012.v1
Online: 5 October 2016 (15:08:32 CEST)
Bio-molecular reagents like antibodies required in experimental biology are expensive and their effectiveness, among other things, is critical to the success of the experiment. Although such resources are sometimes donated by one investigator to another through personal communication between the two, there is no previous study to our knowledge on the extent of such donations, nor a central platform that directs resource seekers to donors. In this paper, we describe, to our knowledge, a first attempt at building a web-portal titled Bio-Resource Exchange that attempts to bridge this gap between resource seekers and donors in the domain of experimental biology. Users on this portal can request for or donate antibodies, cell-lines and DNA Constructs. This resource could also serve as a crowd-sourced database of resources for experimental biology. Further, in order to index donations outside of our portal, we mined scientific articles to find instances of donations of antibodies and attempted to extract information about these donations at the finest granularity. Specifically, we extracted the name of the donor, his/her affiliation and the name of the antibody for every donation by parsing the acknowledgements sections of articles. To extract annotations at this level, we propose two approaches – a rule based algorithm and a bootstrapped relation learning algorithm. The algorithms extracted donor names, affiliations and antibody names with average accuracies of 57% and 62% respectively. We also created a dataset of 50 expert-annotated acknowledgements sections that will serve as a gold standard dataset to evaluate extraction algorithms in the future. Contact: email@example.com, firstname.lastname@example.org Database URL: http://tonks.dbmi.pitt.edu/brx Supplementary information: Supplementary data are available at Database online.
ARTICLE | doi:10.20944/preprints202204.0303.v1
Subject: Arts & Humanities, Other Keywords: Blogging; intercultural competence; international learning outcomes; reflective writing; reflection; text analysis; text mining; psycholinguistics; linguistic markers
Online: 29 April 2022 (13:07:15 CEST)
This study combines insights from psycholinguistics and text analysis to identify linguistic markers of intercultural competence (ICC) in student blogs about intercultural experiences. By combining holistic ICC frameworks with a more analytical approach at text and word level, we were able to demonstrate that blogs with a high perceived level of ICC contain significantly more I-words, more insights words and less quantifiers. These markers of ICC constitute concrete cues for teachers when assessing reflective writing assignments and allow them to pinpoint concrete areas for improvement in their feedback and interaction with students.
ARTICLE | doi:10.20944/preprints202008.0033.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Natural Language Processing; Suspicious Text Detection; Bengali Language Processing; Machine Learning; Text Classification; Feature Extraction; Suspicious Corpora
Online: 2 August 2020 (14:38:13 CEST)
Due to the substantial growth of internet users and its spontaneous access via electronic devices, the amount of electronic contents is growing enormously in recent years through instant messaging, social networking posts, blogs, online portals, and other digital platforms. Unfortunately, the misapplication of technologies has boosted with this rapid growth of online content which leads to the rise in suspicious activities. People misuse the web media to disseminate malicious activity, perform the illegal movement, abuse other people, and publicize suspicious contents on the web. The suspicious contents usually available in the form of text, audio or video, whereas text contents have been used in most of the cases to perform suspicious activities. Thus, one of the most challenging issues for NLP researchers is to develop a system that can identify suspicious text efficiently from the specific contents. In this paper, a Machine Learning (ML)-based classification model is proposed (hereafter called STD) to classify Bengali text into non-suspicious and suspicious categories based on its original contents. A set of ML classifiers with various features has been used on our developed corpus, consisting of 7000 Bengali text documents where 5600 documents used for training and 1400 documents used for testing. The performance of the proposed system is compared with the human baseline and existing ML techniques. The SGD classifier `tf-idf’ with the combination of unigram and bigram features are used to achieve the highest accuracy of 84.57%.
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Semantic Complexity; Semantics; Text Complexity; Readability Formulae
Online: 6 September 2021 (13:33:34 CEST)
Simple measures often couldn’t count a deep complexity. In the case of semantic complexity, conventional readability formulas share a common style, a common sort of achievements and a common borders of limitation: These formulas lack a semantics-aware approach and as a result, a precise measurement of semantic complexity couldn’t be done. In this paper, we introduce DASTEX, a novel semantics-aware complexity measure for semantic complexity of text. By DASTEX, a new layer of complexity analysis are opened for NLP, cognitive and computational tasks. This measure benefits from an intuitionistic underlying formal model which consider semantic as a lattice of intuitions. This yields to a well-defined definition for semantic of a text and its complexity. DASTEX is a practical analysis method upon this formal model. So a complete suite of idea, model and method are prepared to result in a simple but yet deep measure for semantic complexity of text. The evaluation of the proposed approach is done by 4 Experiments. The results show DASTEX is capable of measuring the semantic complexity of text in 6 application-tasks.
ARTICLE | doi:10.20944/preprints202010.0057.v1
Subject: Social Sciences, Accounting Keywords: multiclass classification; text mining; accounting control system
Online: 5 October 2020 (09:05:53 CEST)
Electronic invoicing has become mandatory for Italian companies since January 2019. Invoices are structured in a predefined xml template where the information reported can be easily extracted and analyzed. The main aim of this paper is to exploit the information structured in electronic invoices to build an intelligent system which can facilitate accountants work. More precisely, this contribution shows how it is possible to automate part of the accounting process: all sent or received invoices of a company are classified into specific codes which represent the economic nature of the the financial transactions. In order to classify data contained in the invoices a machine learning multiclass classification problem is proposed using as input variables the information of the invoices to predict two different target variables, account codes and the VAT codes, which composes a general ledger entry. Different approaches are compared in terms of prediction accuracy. The best performance is achieved considering the hierarchical structure of the account codes.
ARTICLE | doi:10.20944/preprints201802.0001.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: domain ontology; semantic analysis; linguistics, text resources
Online: 1 February 2018 (03:08:47 CET)
Ontology is a formalized representation of the problem area (PrA). Representation of the PrA in the form of an domain ontology is often used in the process of development of intelligent software systems and used as a knowledge base. The process of building an ontology is complex and requires an expert in the PrA. A large number of researchers are working to solve this problem. The basis of our approach is the use of a pipeline of different linguistic methods of text analysis. The set of rules developed by us is used to build an ontology based on the content analysis of a text resource. This article describes the method of building a domain ontology based on the linguistic analysis of content of text resources, presents an example of the proposed approach, and also presents the architecture of our pipeline.
BRIEF REPORT | doi:10.20944/preprints201811.0527.v1
Subject: Medicine & Pharmacology, Nutrition Keywords: citation network analysis; text mining; nutrition intervention; cognition
Online: 21 November 2018 (13:50:28 CET)
Manual review of the extensive literature covering nutrition-based lifestyle interventions to promote healthy cognitive ageing has proved educative, however, data-driven techniques can better account for the large size of the literature (tens of thousands of potentially relevant publications to date) and interdisciplinary nature of where relevant publications may be found. In this study we present a new way to map the literature landscape focusing on nutrition-based lifestyle interventions to promote healthy cognitive ageing. We applied a combination of citation network analysis and text mining to map out the existing literature on nutritional interventions and cognitive health. Results indicated five overarching clusters of publications, which could be further deconstructed into a total of 35 clusters. These could be broadly distinguished by focus on lifespan stages (e.g. infancy versus older age), and specificity regarding nutrition (e.g. narrow focus on iodine deficiency versus broad focus on weight gain). Rather than concentrating into a single cluster, interventions were present throughout the majority of the research. We conclude that a data-driven map of the nutritional intervention literature can benefit the design of future interventions, by highlighting topics and themes that could be synthesized across currently disconnected clusters of publications.
ARTICLE | doi:10.20944/preprints201811.0206.v1
Subject: Mathematics & Computer Science, Other Keywords: Biomedical libraries; author’s confidence; writing styles; text analysis
Online: 8 November 2018 (11:01:24 CET)
In an era when medical literature is increasing daily, researchers in biomedical and clinical areas have joined efforts with language engineers to analyze large amount of biomedical and molecular biology literature (such as PubMed), patient data or health records. With such a huge amount of reports, evaluating their impact has long seized to be a trivial task. In this context, this paper intends to introduce a non-scientific factor that represents an important element in the effort of gaining acceptance of claims. Thus, we postulate that the confidence the author is expressing in his work plays an important role in shaping the first impression that influences the reader’s perception of the paper. The results discussed in this paper are based on a series of experiments ran over data from the Open Archives Initiative (OAI) corpus that provides interoperability standards in order to facilitate the effectiveness dissemination of the content. This method can be useful to the direct beneficiaries (authors, who are engaged in medical or academic research), but, also, researchers in the fields of BioNLP and NLP, etc.
ARTICLE | doi:10.20944/preprints201810.0338.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: text classification; topic modelling; latent semantic analysis; latent dirichlet allocation; hierarchical sentiment dictionary; contextually-oriented hierarchical corpus; text tonality; evaluation
Online: 16 October 2018 (07:55:35 CEST)
The research presents the Methodology of Improving the Accuracy in Text Classification in Light of Modelling the Latent Semantic Relations (LSR). The aim of this Methodology is to find the ways of eliminating the Limitations of Discriminant and Probabilistic methods for LSR revealing and customizing the Text Classification Process to the more accurate recognition of the text tonality. This aim should be achieved by using the knowledge about the text’s Hierarchical Semantic Context in the form of Corpora-based Hierarchical Sentiment Dictionary. The main scientific contribution of this research is the following set of approaches to improve the qualitative characteristics of Text Classification process: combination of the Discriminant and Probabilistic methods allowing to decrease the influences of the Limitations of these methods on the LSR revealing process; considering each document as a complex structure allowing to estimate documents integrally by separated classification of topically completed textual component (paragraphs); taking into account the features of Argumentative type of documents (Reviews) allowing to use the author’s subjective evaluation of text tonality for development the Text Classification methodology. Tonality, expressed by the Review’s author, has a significant, but not critical, effect on the qualitative indicators of Sentiment Recognition.
ARTICLE | doi:10.20944/preprints202208.0451.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: text splitting; text tokenization; transfer learning; mask-fill prediction; NLP linguistic rules; missing punctuations; cross-lingual BERT model; Masked Language Modeling
Online: 26 August 2022 (05:19:39 CEST)
Long unpunctuated texts containing complex linguistic sentences are a stumbling block to processing any low-resource languages. Thus, approaches that attempt to segment lengthy texts with no proper punctuation into simple candidate sentences are a vitally important preprocessing task in many hard-to-solve NLP applications. In this paper, we propose (PDTS) a punctuation detection approach for segmenting Arabic text, built on top of a multilingual BERT-based model and some generic linguistic rules. Furthermore, we showcase how PDTS can be effectively employed as a text tokenizer for unpunctuated documents (i.e., mimicking the transcribed audio-to-text documents). Experimental findings across two evaluation protocols (involving an ablation study and a human-based judgment) demonstrate that PDTS is practically effective in both performance quality and computational cost.
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Text summarization; Fine-tuning; Transformers; SMS; Gateway; French Wikipedia.
Online: 14 September 2021 (10:48:55 CEST)
Text summarization remains a challenging task in the Natural Language Processing field despite the plethora of applications in enterprises and daily life. One of the common use cases is the summarization of web pages which has the potential to provide an overview of web pages to devices with limited features. In fact, despite the increasing penetration rate of mobile devices in rural areas, the bulk of those devices offer limited features in addition to the fact that these areas are covered with limited connectivity such as the GSM network. Summarizing web pages into SMS becomes, therefore, an important task to provide information to limited devices. This work introduces WATS-SMS, a T5-based French Wikipedia Abstractive Text Summarizer for SMS. It is built through a transfer learning approach. The T5 English pre-trained model is used to generate a French text summarization model by retraining the model on 25,000 Wikipedia pages then compared with different approaches in the literature. The objective is twofold: (1) to check the assumption made in the literature that abstractive models provide better results compared to extractive ones; and (2) to evaluate the performance of our model compared to other existing abstractive models. A score based on ROUGE metrics gave us a value of 52% for articles with length up to 500 characters against 34.2% for transformer-ED and 12.7% for seq-2seq-attention; and a value of 77% for articles with larger size against 37% for transformers-DMCA. Moreover, an architecture including a software SMS-gateway has been developed to allow owners of mobile devices with limited features to send requests and to receive summaries through the GSM network.
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Amharic script; Attention mechanism; OCR; Encoder-decoder; Text-image
Online: 15 October 2020 (13:42:28 CEST)
In the present, the growth of digitization and worldwide communications make OCR systems of exotic languages a very important task. In this paper, we attempt to develop an OCR system for one of these exotic languages with a unique script, Amharic. Motivated by the recent success of the Attention mechanism in Neural Machine Translation (NMT), we extend the attention mechanism for Amharic text-image recognition. The proposed model consists of CNNs and attention embedded recurrent encoder-decoder networks that are integrated following the configuration of the seq2seq framework. The attention network parameters are trained in an end-to-end fashion and the context vector is injected, with the previously predicted output, at each time steps of decoding. Unlike the existing OCR model that minimizes the CTC objective function, the new model minimizes the categorical cross-entropy loss. The performance of the proposed attention-based model is evaluated against the test dataset from the ADOCR database which consists of both printed and synthetically generated Amharic text-line images and achieved promising results with a CER of 1.54% and 1.17% respectively.
ARTICLE | doi:10.20944/preprints201812.0306.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: cymatics; text detection and recognition; optical character recognition (OCR)
Online: 25 December 2018 (13:52:31 CET)
This paper propose an original approach of achieving a Cymatics based visual perception of image-extracted text. In this context, an effective approach for automated text detection and recognition for the natural scene images is proposed. The incoming image is firstly enhanced by employing CLAHE and DWT. Afterwards, the text regions of the enhanced image are detected by employing the MSER feature detector. The non-text MSERs are removed by employing the geometrical and contour based filters. The remaining MSERs are grouped into words or phrases by finding out similarities between them. The text recognition is performed by employing an OCR function. The extracted text is sequentially analysed on character by character basis. Each character is converted into a methodical acoustic excitation. Finally, these excitations are converted into the systematic visual perceptions by using the phenomenon of Cymatics. The system functionality is tested with an experimental setup. For the case of studied natural scenes, the suggested approach achieves 80% precision in text localization and 53% precision in end-to-end text recognition. The devised system principle is novel and can be employed in various applications like visual art, encryption, education, integration of impaired people, etc.
ARTICLE | doi:10.20944/preprints202106.0482.v3
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: COVID-19 Infodemic; Text Classification; TFIDF Features; Network Training modes; Supervised Learning; Misinformation; News Classification; False Publications; PubMed; Anomaly Detection
Online: 26 July 2021 (12:06:04 CEST)
The spread of the Coronavirus pandemic has been accompanied by an infodemic. The false information that is embedded in the infodemic affects people’s ability to have access to safety information and follow proper procedures to mitigate the risks. This research aims to target the falsehood part of the infodemic, which prominently proliferates in news articles and false medical publications. Here, we present NeoNet, a novel supervised machine learning text mining algorithm that analyzes the content of a document (news article, a medical publication) and assigns a label to it. The algorithm is trained by TFIDF bigram features which contribute a network training model. The algorithm is tested on two different real-world datasets from the CBC news network and Covid-19 publications. In five different fold comparisons, the algorithm predicted a label of an article with a precision of 97-99 %. When compared with prominent algorithms such as Neural Networks, SVM, and Random Forests NeoNet surpassed them. The analysis highlighted the promise of NeoNet in detecting disputed online contents which may contribute negatively to the COVID-19 pandemic.
ARTICLE | doi:10.20944/preprints202103.0738.v1
Subject: Mathematics & Computer Science, Analysis Keywords: bibliometry; coronavirus; text and data mining; SARS; MERS; COVID-19
Online: 31 March 2021 (17:30:56 CEST)
A global event such as the COVID-19 crisis presents new, often unexpected responses that are fascinating to investigate from both, scientific and social standpoints. Despite several documented similarities, the Coronavirus pandemic is clearly distinct from the 1918 flu pandemic in terms of our exponentially increased, almost instantaneous ability to access/share information, offering an unprecedented opportunity to visualise rippling effects of global events across space and time. Personal devices provide “big data” on people’s movement, the environment and economic trends, while access to the unprecedented flurry in scientific publications and media posts provides a measure of the response of the educated world to the crisis. Most bibliometric (co-authorship, co-citation, or bibliographic coupling) analyses ignore the time dimension, but COVID-19 has made it possible to perform a detailed temporal investigation into the pandemic. Here, we report a comprehensive network analysis based on more than 20000 published documents on viral epidemics, authored by over 75,000 individuals from 140 nations in the past one year of the crisis. In contrast to the 1918 flu pandemic, access to published data over the past two decades enabled a comparison of publishing trends between the ongoing COVID-19 pandemic and those of the 2003 SARS epidemic, to study changes in thematic foci and societal pressures dictating research over the course of a crisis.
ARTICLE | doi:10.20944/preprints202103.0380.v1
Subject: Social Sciences, Accounting Keywords: COVID-19; pandemic crisis; crisis management; text mining; network analysis
Online: 15 March 2021 (12:34:01 CET)
This study aims to understand the global environment of COVID-19 management and guide future policy directions after the pandemic crisis. To this end, we analyzed a series of the World Economic Forum’s COVID-19 response reports through text mining and network analysis. These reports, written by experts in diverse fields, discuss multidimensional changes in socioeconomic situations, various problems created by those changes, and strategies to respond to national crises. Based on 3,897 refined words drawn from a morphological analysis of 26 reports (as of the end of 2020), this study analyzes the frequency of words, the relationships among words, the importance of specific documents, and the connection centrality through text mining. In addition, network analysis helps develop strategies for sustainable response to and management of national crises through identifying clusters of words with a similar structural equivalence.
ARTICLE | doi:10.20944/preprints202111.0344.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: pharmacological text corpus; automatic relation extraction; natural language processing; deep learning
Online: 19 November 2021 (10:40:10 CET)
Nowadays, an analysis of virtual media to predict society’s reaction to any events or processes is a task of great relevance. Especially it concerns meaningful information on healthcare problems. Internet sources contain a large amount of pharmacologically meaningful information useful for pharmacovigilance purposes and repurposing drug use. An analysis of such a scale of information demands developing the methods that require the creation of a corpus with labeled relations among entities. Before, there have been no such Russian language datasets. This paper considers the first Russian language dataset where labeled entity pairs are divided into multiple contexts within a single text (by used drugs, by different users, by the cases of use, etc.), and a method based on the XLM-RoBERTa language model, previously trained on medical texts to evaluate the state-of-the-art accuracy for the task of indication of the four types of relationships among entities: ADR–Drugname, Drugname–Diseasename, Drugname–SourceInfoDrug, Diseasename–Indication. As shown based on the presented dataset from the Russian Drug Review Corpus, the developed method achieves the F1-score of 81.2% (obtained using cross-validation and averaged for the four types of relationships), which is 7.8% higher than the basic classifiers.
ARTICLE | doi:10.20944/preprints201904.0170.v1
Subject: Medicine & Pharmacology, Other Keywords: topic modelling; latent dirichlet allocation; text mining; assisted reproduction; ART; IVF
Online: 15 April 2019 (12:25:12 CEST)
Study question: What are the current trends of research in Human Assisted Reproduction around the world? Summary answer: USA is the leading country, followed by the UK, China, France and Italy. The largest research area is “laboratory techniques”, although other areas such as “public health”, “quality, ethics and law” and “female factor” are gaining ground worldwide. What is known already: Scientific research, especially in health and medical sciences, aims at addressing specific needs that society (and, especially, patients) perceives as pressing. One of the main challenges for policymakers and research funders alike is therefore to align research priorities to societal needs. We can thus think of research agendas in terms of a demand side (societal needs) and a supply side (research outputs). Research output in Human Assisted Reproduction has expanded in the past years, as indicated by the increasing number of scientific publications in indexed journals in this area. Nevertheless, no map of research related to assisted reproduction has been produced so far, hindering the identification of potential areas of improvement and need. Study design, size, duration: 26,000+ scientific publications (articles, letters, and reviews) on Human Assisted Reproduction produced worldwide between 2005 and 2016 were analyzed. These publications were indexed in PubMed or obtained from reference list of indexed publications included in the analysis.Participants/materials, setting, methods: The corpus of publications was obtained by combining the MeSH terms: “Reproductive techniques”, “Reproductive medicine”, “Reproductive health”, “Fertility”, “Infertility”, and “Germ cells”. Then it was analyzed by means of text mining algorithms (Topic Modeling (TM) based on Latent Dirichlet Allocation (LDA)), in order to obtain the main topics of interest. Finally, these categories were analyzed across world regions and time. Main results and the role of chance: We identified 44 main topics, which were further grouped in 11 macro categories, form larger to smaller: “laboratory techniques”, “male factor”, “quality, ethics and law”, “female factor”, “public health and infectious diseases”, “basic research and genetics”, “pregnancy complications and risks”, “general infertility and ART”, “psychosocial aspects”, “cancer”, and “research methodology”. The USA was the leading country in number of publications, followed by the UK, China, France and Italy. Interestingly, research contents in high income countries is fairly homogeneous across macro-categories, and it is dominated by “laboratory techniques” in Western and Southern Europe, and by “quality, ethics and law” in North America, Australia and New Zealand. In middle income countries we observe that research is mainly performed on “male factor”, and noticeably less on “female factor”. Finally, research on “public health and infectious diseases” predominates in low-income countries. Regarding temporal evolution of research, “laboratory techniques” is the most abundant topic on a yearly basis, and relatively constant over time. However, since production in most of the other categories is increasing, the relative contribution of this research category is actually decreasing. Publication is especially increasing in “public health and infectious diseases” (in all world regions, but especially in low income countries), “quality, ethics and law” (high income countries), and “female factor” (middle income countries). Limitations, reasons for caution: Three main factors might limit the robustness of our work: the textual corpus analyzed is based on abstract and titles, the reproducibility of the stochastic algorithms applied, which may produce slightly differing results at each run, and the interpretation of the topics obtained. Wider implications of the findings: This study should prove beneficial in the design of research strategies and policies that foster the alignment between supply (assisted reproduction research) and demand (society). Study funding/competing interest(s): PTQ-14-06718 of the Spanish MINECO Torres Quevedo programme (FAM).
ARTICLE | doi:10.20944/preprints201810.0678.v1
Subject: Medicine & Pharmacology, Pediatrics Keywords: post-operative death; unstructured data; logistic regression; text mining; surgery outcome
Online: 29 October 2018 (11:46:18 CET)
Text fields in electronic medical records (EMR) contain information on important factors that influence health outcomes, however, they are underutilized in clinical decision making due to their unstructured nature. We analyzed 6,497 inpatient surgical cases with 719,308 free text notes from Le Bonheur Children’s Hospital EMR. We used a text mining approach on preoperative notes to obtain the text-based risk score algorithm as predictive of death within 30 days of surgery. We studied the additional performance obtained by including text-based risk score as a predictor of death along with other structured data based clinical risk factors. The C-statistic of a logistic regression model with 5-fold cross-validation significantly improved from 0.76 to 0.92 when text-based risk scores were included in addition to structured data. We conclude that preoperative free text notes in EMR include significant information that can predict adverse surgery outcomes.
ARTICLE | doi:10.20944/preprints202209.0324.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Insurance; natural language processing; topic modelling; text analysis; complex networks; risk ranking
Online: 21 September 2022 (10:25:26 CEST)
The ability to identify and rank risk is essential for efficient and effective supervision of financial service firms, such as banks and insurers. Risk ranking ensures limited resources are allocated where they are most needed. Today, automatic risk identification within insurance supervision primarily relies on quantitative metrics based on numerical data (e.g. returns). The purpose of this work is to assess whether Natural Language Processing (NLP) and cognitive networks can achieve similar automated risk ranking and identification by analysing textual data, i.e. NIDT=829 investor transcripts from Bloomberg. To this aim, this work explores and tunes 3 NLP techniques: (1) keyword extraction enhanced by cognitive network analysis; (2) valence/sentiment analysis; and (3) topic modelling. Results highlight that keyword analysis, enriched by term frequency-inverse document frequency scores and semantic framing through cognitive networks, could detect events of relevance for the insurance system like cyber-attacks or the COVID-19 pandemic. Cognitive networks were found to highlight events that related to specific financial transitions: The semantic frame of "climate" grew in size by +538% between 2018 and 2020 and outlined an increased awareness that agents and insurers expressed towards climate change. A lexicon-based sentiment analysis achieved a Pearson’s correlation of ρ=0.16 (p<0.001,N=829) between sentiment levels and daily share prices. Although relatively weak, this finding indicates that insurance jargon is insightful to support risk supervision. Topic modelling is considered less amenable to support supervision, because of a lack of results’ stability and an intrinsic difficulty to interpret risk patterns. We discuss how these automatic methods could complement existing supervisory tools in automated risk ranking.
REVIEW | doi:10.20944/preprints202205.0114.v1
Subject: Social Sciences, Business And Administrative Sciences Keywords: Blockchain Technology; Industry 4.0; Supply Chain Management; Text mining; Metaverse; Hashgraph, Baas.
Online: 9 May 2022 (10:01:43 CEST)
In the current business environment, firms are eager to adopt new technology as they witness and perceive more and more successful business applications of new technologies, e.g., Big Data, Artificial Intelligence (A.I.), Cloud Computing, etc. As one of the disruptive technologies, Blockchain technology (BCT) is now drawing public attention owing to the cryptocurrency phenomenon (e.g., Bitcoin), for which Blockchain serves as the backbone technology. Given certain innovative features of BCT, especially its transparency, traceability, security, efficiency, confidentiality, and immutability, BCT holds out the promise of impacting supply chain operational and financial efficiencies. Recently, the burgeoning of Metaverse and Non-Fungible Token (NFT) has boosted the BCT as state-of-the-art technology to another notch. Motivated by the proliferating adoption of BCT, we conduct a holistic literature review with a focus on the status of research on BCT in the context of Supply Chain Management (SCM). In particular, this Blockchain-centered research reviews the research up to date on the Blockchain application in SCM. It provides holistic review in terms of (1) the functionality of BCT and its salient features; (2) the prevailing and potential applications of BCT; (3) and the business benefits and impact of BCT in SCM. Finally, we substantiate and highlight a variety of research directions.
ARTICLE | doi:10.20944/preprints202111.0208.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Technology analysis; Trend analysis; Patent keyword analysis; Text mining; Natural language processing
Online: 10 November 2021 (15:25:21 CET)
Thanks to rapid development of artificial intelligence technology in recent years, the current artificial intelligence technology is contributing to many part of society. Education, environment, medical care, military, tourism, economy, politics, etc. are having a very large impact on society as a whole. For example, in the field of education, there is an artificial intelligence tutoring system that automatically assigns tutors based on student's level. In the field of economics, there are quantitative investment methods that automatically analyze large amounts of data to find investment laws to create investment models or predict changes in financial markets. As such, artificial intelligence technology is being used in various fields. So, it is very important to know exactly what factors have an important influence on each field of artificial intelligence technology and how the relationship between each field is connected. Therefore, it is necessary to analyze artificial intelligence technology in each field. In this paper, we analyze patent documents related to artificial intelligence technology. We propose a method for keyword analysis within factors using artificial intelligence patent data sets for artificial intelligence technology analysis. This is a model that relies on feature engineering based on deep learning model named KeyBERT, and using vector space model. A case study of collecting and analyzing artificial intelligence patent data was conducted to show how the proposed model can be applied to real-world problems.
ARTICLE | doi:10.20944/preprints201812.0086.v4
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: multi-model information fusion; video skim-ming; audio and text classification; keyframe extraction
Online: 5 August 2019 (03:48:49 CEST)
In this paper, we propose a novel approach of video skimming by exploiting the fusion of video temporal information and keyword information representation extracted from multi-model video information including audio, text and visual indices. In addition, we introduce the brand-safe filtering and sentiment analysis in order to only reserve the user-friendly content in the video skim. In the experiment by using the videos from YouTube-8M dataset, we have proved that the semantic conservation in the video skim from the proposed approach highly outperforms the approaches by only partial information of the video in conserving the semantic content of the video.
ARTICLE | doi:10.20944/preprints201802.0108.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Mandarin; prosody generation; linguistic feature; break prediction; text-to-speech; punctuation confidence
Online: 16 February 2018 (15:39:58 CET)
This paper proposes two fully-automatic machine-extracted linguistic features from an unlimited text input for Mandarin prosody generation. One is the punctuation confidence (PC) which measures the likelihood of inserting a major punctuation mark (PM) at a word boundary. Another is the quotation confidence (QC) which measures the likelihood of a word string to be quoted as a meaningful or emphasized unit in text. Because a major PM in a text is highly correlated with a prosodic break, and a quoted word string plays an important role in human language understanding, the two features potentially could provide useful information for prosody generation. The idea is first realized by employing conditional random field (CRF)-based models to predict major PMs, quoted word string locations, and their associated confidences, i.e., the PC and the QC, for each word boundary. Then, the predicted punctuations and their confidences are combined with traditional contextual linguistic features to predict prosodic-acoustic features. Both objective and subjective tests showed that the prosody generation with the proposed linguistic features performed better than the one without the proposed features. So, the proposed PC and QC are promising features for Mandarin prosody generation.
ARTICLE | doi:10.20944/preprints202207.0090.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Text mining; natural language processing; sustainability; semantic similarity; corporate social responsibility; Machine Learning
Online: 6 July 2022 (08:53:02 CEST)
This paper investigates if Corporate Social Responsibility (CSR) reports published by a selected group of Nordic companies are aligned with the Global Reporting Initiative (GRI) standards. To achieve this goal, several natural language processing, and text mining techniques were implemented and tested. We extracted strings, corpus, and hybrid semantic similarities from the reports and evaluated the models through the intrinsic assessment methodology. A quantitative ranking score based on index matching was developed to complement the semantic valuation. The final results show that Latent Semantic Analysis (LSA) and Global Vectors for Word Representation (GloVE) are the best methods for our study. Our findings will open the door to the automatic evaluation of sustainability reports which could have a strong impact on the environment.
REVIEW | doi:10.20944/preprints202110.0184.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: text-mining; self-attention models; biological literature mining; relationship extraction; natural language processing
Online: 12 October 2021 (14:17:46 CEST)
For any molecule, network, or process of interest, to keep up with new publications on these, is becoming increasingly difficult. For many cellular processes, molecules and their interactions that need to be considered can be very large. Automated mining of publications can support large scale molecular interaction maps and database curation. Text mining and Natural Language Processing (NLP)-based techniques are finding their applications in mining the biological literature, handling problems such as Named Entity Recognition (NER) and Relationship Extraction (RE). Both rule-based and machine learning (ML)-based NLP approaches have been popular in this context, with multiple research and review articles examining the scope of such models in Biological Literature Mining (BLM). In this review article, we explore self-attention based models, a special type of neural network (NN)-based architectures that have recently revitalized the field of NLP, applied to biological texts. We cover self-attention models operating either at a sentence level or an abstract level, in the context of molecular interaction extraction, published from 2019 onwards. We conduct a comparative study of the models in terms of their architecture. Moreover, we also discuss some limitations in the field of BLM that identifies opportunities for the extraction of molecular interactions from biological text.
COMMUNICATION | doi:10.20944/preprints202104.0575.v2
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Ethnopharmacology; Artificial Intelligence; Web Crawling; Active Learning; Reinforcement Learning; Text Mining; Big Data
Online: 23 June 2021 (11:47:32 CEST)
Ethnopharmacology experts face several challenges when identifying and retrieving documents and resources related to their scientific focus. The volume of sources that need to be monitored, the variety of formats utilized, the different quality of language use across sources, present some of what we call “big data” challenges in the analysis of this data. This study aims to understand if and how experts can be supported effectively through intelligent tools in the task of ethnopharmacological literature research. To this end, we utilize a real case study of ethnopharmacology research, aimed at the Southern Balkans and Coastal zone of Asia Minor. Thus, we propose a methodology for more efficient research in ethnopharmacology. Our work follows an “Expert-Apprentice” paradigm in an automatic URL extraction process, through crawling, where the apprentice is a Machine Learning (ML) algorithm, utilizing a combination of Active Learning (AL) and Reinforcement Learning (RL), and the Expert is the human researcher. ML-powered research improved 3.1 times the effectiveness and 5.14 times the efficiency of the domain expert, fetching a total number of 420 relevant ethnopharmacological documents in only 7 hours versus an estimated 36-hour human-expert effort. Therefore, utilizing Artificial Intelligence (AI) tools to support the researcher can boost the efficiency and effectiveness of the identification and retrieval of appropriate documents.
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Urdu Twitter Dataset; Urdu Natural language processing (NLP); Urdu text Sentiments and Emoticons
Online: 24 March 2021 (12:03:46 CET)
This article presents a dataset of tweets in the Urdu language. There are 1,140,824 tweets in the dataset, collected from Twitter for September and October 2020. This large-scale corpus of tweets is generated by performing pre-processing which includes removing columns containing user information, retweet’s count, followers information, duplicate tweets, removing unnecessary punctuation, links, symbols, and spaces, and finally extracting emojis if present in the tweet text. In the final dataset each tweet record contains columns for tweet id, text, and emoji extracted from the text with a sentiment score. Emojis are extracted to validate Machine Learning models used for the multilingual sentiment and behavior analysis. These are extracted using a Python script that searches for an emoji from the list of 751 most frequently used emojis. If an emoji is present in the text, a column with the emoji description and sentiment score is added.
ARTICLE | doi:10.20944/preprints202003.0249.v1
Subject: Mathematics & Computer Science, Other Keywords: machine learning; preprocessing; semantic analysis; text mining; TF/IDF; scraping; Google Play Store
Online: 11 August 2020 (08:14:10 CEST)
The fact is quite transparent that almost everybody around the world is using android apps. Half of the population of this planet is associated with messaging, social media, gaming, and browsers. This online marketplace provides free and paid access to users. On the Google Play store, users are encouraged to download countless of applications belonging to predefined categories. In this research paper, we have scrapped thousands of users reviews and app ratings. We have scrapped 148 apps’ reviews from 14 categories. We have collected 506259 reviews from Google play store and subsequently checked the semantics of reviews about some applications form users to determine whether reviews are positive, negative, or neutral. We have evaluated the results by using different machine learning algorithms like Naïve Bayes, Random Forest, and Logistic Regression algorithm. we have calculated Term Frequency (TF) and Inverse Document Frequency (IDF) with different parameters like accuracy, precision, recall, and F1 and compared the statistical result of these algorithms. We have visualized these statistical results in the form of a bar chart. In this paper, the analysis of each algorithm is performed one by one, and the results have been compared. Eventually, We've discovered that Logistic Regression is the best algorithm for a review-analysis of all Google play store. We have proved that Logistic Regression gets the speed of precision, accuracy, recall, and F1 in both after preprocessing and data collection of this dataset.
ARTICLE | doi:10.20944/preprints201812.0114.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: directional encoding mask; selective attention network; supervised learning; horizontal and vertical text recognition
Online: 11 December 2018 (07:24:04 CET)
Recent state-of-the-art scene text recognition methods have primarily focused on horizontal text in images. However, in several Asian countries, including China, large amounts of text in signs, books, and TV commercials are vertically directed. Because the horizontal and vertical texts exhibit different characteristics, developing an algorithm that can simultaneously recognize both types of text in real environments is necessary. To address this problem, we adopted the direction encoding mask (DEM) and selective attention network (SAN) methods based on supervised learning. DEM contains directional information to compensate in cases that lack text direction; therefore, our network is trained using this information to handle the vertical text. The SAN method is designed to work individually for both types of text. To train the network to recognize both types of text and to evaluate the effectiveness of the designed model, we prepared a new synthetic vertical text dataset and collected an actual vertical text dataset (VTD142) from the Web. Using these datasets, we proved that our proposed model can accurately recognize both vertical and horizontal text and can achieve state-of-the-art results in experiments using benchmark datasets, including the street view test (SVT), IIIT-5k, and ICDAR. Although our model is relatively simple as compared to its predecessors, it maintains the accuracy and is trained in an end-to-end manner.
ARTICLE | doi:10.20944/preprints202007.0646.v1
Subject: Keywords: Machine Learning; Natural Language Processing; Text Mining; Semantic Analysis; Scraping; Google Play Store; Rating
Online: 26 July 2020 (17:11:09 CEST)
Google play store allow the user to download a mobile application (app) and user get inspired by the rating and reviews of the mobile app. A recent study analyzes that user preferences, user opinion for improvement, user sentiment about particular feature and detail with descriptions of experiences are very useful for an application developer. However, many application reviews are very large and difficult to process manually. Star rating is given of the whole application and the developer cannot analyze the single feature. In this research, we have scrapped 282,231 user reviews through different data scraping techniques. We have applied the text classification on these user reviews. We have applied different algorithms and find the precision, accuracy, F1 score and recall. In evaluated results, we have to also find the best algorithm.
ARTICLE | doi:10.20944/preprints202105.0449.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Explainable Artificial Intelligence; Hopfield Neural Networks; Automatic Video Generation; Data-to-text systems; Software Visualization
Online: 19 May 2021 (14:07:48 CEST)
Hopfield Neural Networks (HNNs) are recurrent neural networks used to implement associative memory. Their main feature is their ability to pattern recognition, optimization, or image segmentation. However, sometimes it is not easy to provide the users with good explanations about the results obtained with them due to mainly the large number of changes in the state of neurons (and their weights) produced during a problem of machine learning. There are currently limited techniques to visualize, verbalize, or abstract HNNs. This paper outlines how we can construct automatic video generation systems to explain their execution. This work constitutes a novel approach to get explainable artificial intelligence systems in general and HNNs in particular building on the theory of data-to-text systems and software visualization approaches. We present a complete methodology to build these kinds of systems. Software architecture is also designed, implemented, and tested. Technical details about the implementation are also detailed and explained. Finally, we apply our approach for creating a complete explainer video about the execution of HNNs on a small recognition problem.
ARTICLE | doi:10.20944/preprints202009.0657.v1
Subject: Medicine & Pharmacology, Allergology Keywords: HIV; workplace intervention; SMS; HIV testing; construction; mobile phone; Covid-19; health promotion; text messaging
Online: 27 September 2020 (03:02:41 CEST)
Background: HIV poses a threat to global health. With effective treatment options available, education and testing strategies are essential in preventing transmission. Text messaging is an effective tool for health promotion and can be used to target higher risk populations. This study reports on the design, delivery and testing of a mobile text messaging SMS intervention for HIV prevention and awareness, aimed at adults in the construction industry and delivered during the COVID-19 pandemic. Method: Participants were recruited at Test@Work workplace health promotion events (21 sites, n=464 employees), including health checks with HIV testing. Message development was based on a participatory design and included a focus group (n=9) and message fidelity testing (n=291) with assessment of intervention uptake, reach, acceptability, and engagement. Barriers to HIV testing were identified and mapped to the COM-B behavioural model. 23 one-way push SMS messages (19 included short web links) were generated and fidelity tested, then sent via automated SMS to two employee cohorts over a 10-week period during the COVID-19 pandemic. Engagement metrics measured were; opt-outs, SMS delivered/read, number of clicks per web link, and four two-way pull messages exploring repeat HIV testing, learning new information, perceived usefulness and behaviour change. Results: 291 people participated (68.3% of eligible attendees). A total of 7,726 messages were sent between March and June 2020, with 91.6% successfully delivered (100% read). 12.4% of participants opted out over 10 weeks. Of delivered messages, links were clicked an average of 14.4%, max 24.1% for HIV related links. The number of clicks on web links declined over time (r= -6.24, p=0.01). Response rate for two-way pull messages was 13.7% of participants. Since the workplace HIV test offer at recruitment, 21.6% reported having taken a further HIV test. Qualitative replies indicated behavioural influence of messaging on exercise, lifestyle behaviours and intention to HIV test. Conclusion: SMS messaging for HIV prevention and awareness is acceptable to adults in the construction industry, has high uptake, low attrition and good engagement with message content, when delivered during a global pandemic. Data collection methods may need refinement for audience and effect of COVID-19 on results is yet to be understood.
ARTICLE | doi:10.20944/preprints202106.0196.v3
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Twitter; Social Media; Social Networking; Social Network Analytic; DistilBERT; Text Similarity; Natural Language Processing; Character Computing
Online: 17 February 2022 (13:15:23 CET)
Social media platforms have been entirely an undeniable part of the lifestyle for the past decade. Analyzing the information being shared is a crucial step to understanding human behavior. Social media analysis aims to guarantee a better experience for the user and risen user satisfaction. For deriving any further conclusion, first, it is necessary to know how to compare users. In this paper, a hybrid model has been proposed to measure Twitter profiles’ similarity and quantifies the likeness degree of profiles by calculating features considering users’ behavioral habits. For this, first, the timeline of each profile has been extracted using the official TwitterAPI. Then, in parallel, three aspects of a profile are deliberated. Behavioral ratios are time-series-related information showing the consistency and habits of the user. Dynamic time warping has been utilized to compare the behavioral ratios of two profiles. Next, the audience network is extracted for each user, and for estimating the similarity of two sets, Jaccard similarity is used. Finally, for the Content similarity measurement, the tweets are preprocessed respecting the feature extraction method; TF-IDF and DistilBERT for feature extraction are employed and then compared using the cosine similarity method. Results have shown that TF-IDF has slightly better performance; therefore, the more straightforward solution is selected for the model. Similarity level of different profiles. As in the case study, a Random Forest classification model was trained on almost 20000 users revealed a 97.24% accuracy. This comparison enables us to find duplicate profiles with nearly the same behavior and content.
REVIEW | doi:10.20944/preprints201708.0003.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: stylometry; author identification; author verification; authorprofiling; stylistic inconsistency; text analysis; supervised learning; unsupervised learning; classification; forensics
Online: 2 August 2017 (12:38:17 CEST)
Electronic text stylometry is a collection of forensics methods that analyze the writing styles of input electronic texts in order to extract information about authors of the input electronic texts. Such extracted information could be the identity of the authors, or aspects of the authors, such as their gender, age group, ethnicity, etc. This survey paper presents the following contributions: 1) A description of all stylometry problems in probability terms, under a unified notation. To the best of our knowledge, this is the most comprehensive definition to date. 2) A survey of key methods, with a particular attention to data representation (or feature extraction) methods. 3) An evaluation of 23,760 feature extraction methods, which is the most comprehensive evaluation of feature extraction methods in the literature of stylometry to date. The importance of this evaluation is two fold. First, identifying the relative effectiveness of the features (since, currently, many are not evaluated jointly; e.g. syntactic n-grams are not evaluated against k-skip n-grams, and so forth). Second, thanks to our generalizations, we could evaluate novel grams, such as what we name compound grams. 4) The release of our associated Python feature extraction library, namely Fextractor. Essentially, the library generalizes all existing n-gram based feature extraction methods under the "at least l-frequent, dir-directed, k-skipped n-grams'', and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as POS tags, as well as lower-level ones, such as distribution of function words, word shapes, etc. This makes the library, by far, the most extensive in this domain to date. 5) The construction, evaluation, and release of the first dataset for Emirati social media text. This evaluation represents the first evaluation of author identification against Emirati social media texts. Interestingly, we find that, when using our models and feature extraction library (Fextractor), authors could be identified significantly more accurately than what is reported with similarly sized datasets. The dataset also contains sub-datasets that represent other languages (Dutch, English, Greek and Spanish), and our findings are consistent across them.
ARTICLE | doi:10.20944/preprints202103.0442.v1
Subject: Behavioral Sciences, Applied Psychology Keywords: dyslexia; reading; children; background colour; overlay colour; text colour; sensors; physiological parameters; EEG; ECG; EDA; eye tracking
Online: 17 March 2021 (14:31:47 CET)
Reading is one of the essential processes during the maturation of an individual. It is estimated that 5-10% of school-age children are affected by dyslexia, the reading disorder characterised by difficulties in the accuracy or fluency of word recognition. There are many studies which have reported that colour overlays and background could improve the reading process, especially in children with reading disorders. As dyslexia has neurobiological origins, the aim of the present research was to understand the relationship between physiological parameters and colour modifications in the text and background during reading in children with and without dyslexia. We have measured differences in electroencephalography (EEG), heart rate variability (HRV), electrodermal activities (EDA), and eye movement of the 36 school-age children (18 with dyslexia and 18 of control group) during the reading performance in 13 combinations of background and overlay colours during the reading task. Our findings showed that the dyslexic children have longer reading duration, fixation count, fixation duration average, fixation duration total, and longer saccade count, saccade duration total, and saccade duration average while reading on white and coloured background/overlay. It was found that the turquoise, turquoise O, and yellow colours are beneficial for dyslexic readers, as they achieved the shortest time duration during the reading tasks when these colours were used. Also, dyslexic children have higher values of beta and the whole range of EEG while reading in particular colour (purple), as well as increasing theta range while reading on the purple overlay colour. We have observed no significant differences between HRV parameters on white colour, except for single colours (purple, turquoise overlay and yellow overlay) where the control group showed higher values for Mean HR, while dyslexic children scored higher with Mean RR. Regarding EDA measure we have found systematically lower values in children with dyslexia in comparison to the control group. Based on present results we can conclude that both colours (warm and cold background/overlays) are beneficial for both groups of readers and all sensor modalities could be used to better understand the neurophysiological origins in dyslexic children.
ARTICLE | doi:10.20944/preprints202112.0417.v1
Subject: Earth Sciences, Geoinformatics Keywords: Cultural ecosystem services; urban green space management; Singapore; public participation geographic information system; social media text mining analysis
Online: 27 December 2021 (09:48:44 CET)
Cultural ecosystem services has been increasingly influential in both environmental research and policy decision-making, such as for urban green spaces However, its popular definition conflates the concepts of ‘services’ and ‘benefits’ which made it challenging for planners to employ it directly for urban green space management. One the most widely used definition of this non-tangible ecosystem services are “functions of environmental spaces and cultural activities which may then result in the enjoyment of cultural ecosystem benefits”; yet the latter itself have never found its way into official laws and regulations. In this study, via a case study in Singapore, we propose new evidence to re-evaluate and re-position the two of the most important emerging concepts in managing the green spaces in urban areas. Using the transdisciplinary mixed methods of public participation GIS and social media text mining analysis, a wealth of cultural ecosystem services and their associated benefits were reported. This was especially so with regards to recreational and aesthetic services and experiential benefits. Recommendations to improve the park were also suggested, alongside sharing of methodological considerations for future research. Overall, this paper recommends the employment of the redefined cultural ecosystem services conceptual framework to generate relational, data-driven and actionable insights to better support urban green space management, which is not only useful to Singapore governments but also world-wide relevant.
Subject: Keywords: Textual data distributions; supervised learning; unsupervised learning; Kullback-Leibler divergence; sentiment; textual analytics; text generation; vaccine; stock market
Online: 17 June 2021 (10:03:41 CEST)
Efficient textual data distributions (TDD) alignment and generation are open research problems in textual analytics and NLP. It is presently difficult to parsimoniously and methodologically confirm that two or more natural language datasets belong to similar distributions, and to identify the extent to which textual data possess alignment. This study focuses on addressing a segment of the broader problem described above by applying multiple supervised and unsupervised machine learning (ML) methods to explore the behavior of TDD by (i) topical alignment, and (ii) by sentiment alignment. Furthermore we use multiple text generation methods including fine-tuned GPT-2, to generate text by topic and by sentiment. Finally we develop a unique process driven variation of Kullback-Leibler divergence (KLD) application to TDD, named KL Textual Distributions Contrasts (KL-TDC) to identify the alignment of machine generated textual corpora with naturally occurring textual corpora. This study thus identifies a unique approach for generating and validating TDD by topic and sentiment, which can be used to help address sparse data problems and other research, practice and classroom situations in need of artificially generated topic or sentiment aligned textual data.
ARTICLE | doi:10.20944/preprints202011.0056.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: COVID-19; Deep Learning; Natural Language Processing; Topic Modelling; Text Classification; Latent Dirichlet allocation (LDA); Non-negative matrix factorization (NMF)
Online: 2 November 2020 (15:24:20 CET)
Ongoing COVID-19 Pandemic has resulted into massive damage to various platforms of global economy which has caused disruption to human livelihood. Natural Language Processing has been extensively used in different organizations to categorize sentiments, perform recommendation, summarizing information and topic modelling. This research aims to understand the non-medical impact of COVID-19 on global economy by leveraging the natural language processing methodology. This methodology comprises of text classification which includes topic modelling on unstructured COVID-19 media articles dataset provided by Anacode. Like other Natural Language Processing algorithms, Latent Dirichlet allocation (LDA) and Non-negative matrix factorization (NMF) has been proposed to classify the media articles dataset in order to analyze COVID-19 pandemic impacts in the different sectors of global economy. Model Accuracy was examined based on the coherence and perplexity score which came out to be 0.51 and -10.90 using LDA algorithm. Both the LDA and NMF algorithm identified similar prevalent topics that was impacted by COVID-19 pandemic in multiple sectors of economy. Through intertopic distance map visualization produced by LDA algorithm, it can be reciprocated that general industries which includes children schooling, parental care, and family gatherings had the major impact followed by business sector and the financial industry.
ARTICLE | doi:10.20944/preprints202105.0255.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: keystroke dynamics; typing pattern; keystroke data set; user authentication; user identification; free text typing; keystroke dynamics researches; keystroke analysis; biometrics; keystroke characteristics
Online: 11 May 2021 (15:50:34 CEST)
Identifying or authenticating a computer user are necessary steps to keep systems secure on the network and to prevent fraudulent users from accessing accounts. Keystroke dynamics authentication can be used as an additional authentication method. Keystroke dynamics involves in-depth analysis of how you type on the keyboard, analysis of how long a key is pressed or the time between two consecutive keys. This field has seen a continuous growth in scientific research. In the last five years alone, about 10,000 scientific researches in this field have been published. One of the main problems facing researchers is the small number of public data sets that include how users type on the keyboard. This paper aims to provide researchers with a data set that includes how to type free text on the keyboard by 80 users. The data were collected in a single session via a web platform. The dataset contains 410,633 key-events collected in a total time interval of almost 24 hours. In similar research, most datasets are with texts written by users in English. The language in which the users wrote for this research is Romanian. This paper also provides an extensive analysis of the data set collected and presents relevant information for the analysis of the data set in future research.
ARTICLE | doi:10.20944/preprints202111.0023.v1
Subject: Engineering, Other Keywords: Twitter; Social Media Analysis; User Behavior Mining; Crime Detection; Feature Extraction; Graph Analysis; Natural Language Processing; Text Classification; Aspect-based Sentiment Analysis; DistilBERT
Online: 1 November 2021 (15:25:19 CET)
Maintaining a healthy cyber society is a big challenge due to the users’ freedom of expression and behaving. It can be solved by monitoring and analyzing the users’ behavior and taking proper actions towards them. This research aims to present a platform that monitors the public content on Twitter by extracting tweet data. After maintaining the data, the users’ interactions are analyzed using Graph Analysis methods. Then the users’ behavioral patterns are analyzed by applying Metadata Analysis, in which the timeline of each profile is obtained; also, the time-series behavioral features of users are investigated. Then in the Abnormal Behavior Detection Filtering component, the interesting profiles are selected for further examinations. Finally, in the Contextual Analysis component, the contents will be analyzed using natural language processing techniques; A binary text classification model (SVM + TF-IDF with 88.89% accuracy) for detecting if the tweet is related to crime or not. Then, a sentiment analysis method is applied to the crime-related tweets to perform aspect-based sentiment analysis (DistilBERT + FFNN with 80% accuracy); because sharing positive opinions about a crime-related topic can threaten society. This platform aims to provide the end-user (Police) suggestions to control hate speech or terrorist propaganda.
ARTICLE | doi:10.20944/preprints201811.0149.v1
Subject: Arts & Humanities, Religious Studies Keywords: Confidence tests, dictations, Jesus Christ, Maria Valtorta, mystics, punctuation marks, readability index, sentences, semantic index, syntactic index, text characters, Virgin Mary, visions, words, word interval.
Online: 7 November 2018 (09:06:01 CET)
We have studied the very large amount of literary works written by the Italian mystic Maria Valtorta to assess similarities and differences in her writings because she claims that most of them are due to mystical visions. We have used mathematical and statistical tools developed for specifically studying deep linguistic aspects of texts. The general trend indicates that the literary works explicitly attributable to Maria Valtorta differ significantly from her other literary works, that she claims are attributable to the alleged characters Jesus and Mary. Mathematically, they seem to have been written by different authors. The comparison with the Italian literature is very striking. A single author, namely Maria Valtorta, seems to be able to write texts so diverse to cover the entire mathematical range (suitable defined) of the Italian literature of seven centuries.
ARTICLE | doi:10.20944/preprints202107.0200.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: image quality assessment; image quality metrics; NR-IQAs; D-IQA; OCR accuracy; OCR prediction; OCR improvements; visual aids; visually impaired; reading aids; document images; text-based images
Online: 8 July 2021 (13:21:49 CEST)
For Visually impaired People (VIPs), the ability to convert text to sound can mean a new level of independence or the simple joy of a good book. With significant advances in Optical Character Recognition (OCR) in recent years, a number of reading aids are appearing on the market. These reading aids convert images captured by a camera to text which can then be read aloud. However, all of these reading aids suffer from a key issue – the user must be able to visually target the text and capture an image of sufficient quality for the OCR algorithm to function – no small task for VIPs. In this work, a Sound-Emitting Document Image Quality Assessment metric (SEDIQA) is proposed which allows the user to hear the quality of the text image and automatically captures the best image for OCR accuracy. This work also includes testing of OCR performance against image degradations, to identify the most significant contributors to accuracy reduction. The proposed No-Reference Image Quality Assessor (NR-IQA) is validated alongside established NR-IQAs and this work includes insights into the performance of these NR-IQAs on document images.