ARTICLE | doi:10.20944/preprints201812.0114.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: directional encoding mask; selective attention network; supervised learning; horizontal and vertical text recognition
Online: 11 December 2018 (07:24:04 CET)
Recent state-of-the-art scene text recognition methods have primarily focused on horizontal text in images. However, in several Asian countries, including China, large amounts of text in signs, books, and TV commercials are vertically directed. Because the horizontal and vertical texts exhibit different characteristics, developing an algorithm that can simultaneously recognize both types of text in real environments is necessary. To address this problem, we adopted the direction encoding mask (DEM) and selective attention network (SAN) methods based on supervised learning. DEM contains directional information to compensate in cases that lack text direction; therefore, our network is trained using this information to handle the vertical text. The SAN method is designed to work individually for both types of text. To train the network to recognize both types of text and to evaluate the effectiveness of the designed model, we prepared a new synthetic vertical text dataset and collected an actual vertical text dataset (VTD142) from the Web. Using these datasets, we proved that our proposed model can accurately recognize both vertical and horizontal text and can achieve state-of-the-art results in experiments using benchmark datasets, including the street view test (SVT), IIIT-5k, and ICDAR. Although our model is relatively simple as compared to its predecessors, it maintains the accuracy and is trained in an end-to-end manner.
ARTICLE | doi:10.20944/preprints202206.0426.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: event-based vision; object detection and tracking; high-temporal resolution tracking; frame-based vision; hybrid approach
Online: 30 June 2022 (09:54:14 CEST)
Event-based vision is an emerging field of computer vision that offers unique properties such as asynchronous visual output, high temporal resolutions, and dependence on brightness changes to generate data. These properties can enable robust high-temporal-resolution object detection and tracking when combined with frame-based vision. In this paper, we present a hybrid, high-temporal-resolution, object detection and tracking approach, that combines learned and classical methods using synchronized images and event data. Off-the-shelf frame-based object detectors are used for initial object detection and classification. Then, event masks, generated per each detection, are used to enable inter-frame tracking at varying temporal resolutions using the event data. Detections are associated across time using a simple low-cost association metric. Moreover, we collect and label a traffic dataset using the hybrid sensor DAVIS 240c. This dataset is utilized for quantitative evaluation using state-of-the-art detection and tracking metrics. We provide ground truth bounding boxes and object IDs for each vehicle annotation. Further, we generate high-temporal-resolution ground truth data to analyze the tracking performance at different temporal rates. Our approach shows promising results with minimal performance deterioration at higher temporal resolutions (48 – 384 Hz) when compared with the baseline frame-based performance at 24 Hz.
ARTICLE | doi:10.20944/preprints201805.0143.v1
Subject: Mathematics & Computer Science, General & Theoretical Computer Science Keywords: depth-image-based rendering (DIBR); 3D content; curvelet transform; 1D-discrete cosine transform (1D-DCT); template watermark; DIBR watermarking
Online: 9 May 2018 (09:00:10 CEST)
Several depth image based rendering (DIBR) watermarking methods have been proposed, but they have various drawbacks, such as non-blindness, low imperceptibility, and vulnerability to signal or geometric distortion. This paper proposes a template based DIBR watermarking method that overcomes the drawbacks of previous methods. The proposed method exploits two properties to resist DIBR attacks: the pixel is only moved horizontally by DIBR, and the smaller block is not distorted by DIBR. The one dimensional (1D) discrete cosine transform (DCT) and curvelet domains are adopted to utilize these two properties. A template is inserted in the curvelet domain to restore the synchronization error caused by geometric distortion. A watermark is inserted in the 1D DCT domain to insert and detect a message from the DIBR image. Experimental results of the proposed method show high imperceptibility and robustness to various attacks, such as signal and geometric distortions. The proposed method is also robust to DIBR distortion and DIBR configuration adjustment, such as depth image preprocessing and baseline distance adjustment.
ARTICLE | doi:10.20944/preprints202004.0032.v1
Subject: Earth Sciences, Geoinformatics Keywords: indoor positioning system; image-based positioning system; computer vision; SIFT; feature detection; feature description; cell phone camera; PnP problem; projection matrix; epipolar geometry; OpenCV
Online: 3 April 2020 (11:59:48 CEST)
As people grow a custom to effortless outdoor navigation there is a rising demand for similar possibility indoors as well. Unfortunately, indoor localization, being one of the necessary requirements for navigation, continues to be problem without a clear solution. In this article we are proposing a method for an indoor positioning system using a single image. This is made possible using small preprocessed database of images with known control points as the only preprocessing needed. Using feature detection with SIFT algorithm we can look through the database and find image which is the most similar to the image taken by user. Pair of images is then used to find coordinates of database image using PnP problem. Furthermore, projection and essential matrices are determined allowing for the user image localization ~ determining the position of the user in indoor environment. Benefits of this approach lies in the single image being the only input from user and no requirements for new onsite infrastructure and thus enables a simpler realization for the building management.
ARTICLE | doi:10.20944/preprints202211.0249.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: urothelial carcinoma; urine; liquid-based cytology; deep learning; cancer screening; whole slide image
Online: 14 November 2022 (09:31:16 CET)
Urinary cytology is a useful, essential diagnostic method in routine urological clinical practice. Liquid-based cytology (LBC) for urothelial carcinoma screening is commonly used in the routine clinical cytodiagnosis because of its high cell collection rate. Since conventional screening processes by cytoscreeners and cytopathologists using microscopes is limited in terms of human resources, it is important to integrate new deep learning methods that can automatically and rapidly diagnose a large amount of specimens without delay. The goal of this study was to investigate the use of deep learning models for the classification of urine LBC whole-slide images (WSIs) into neoplastic and non-neoplastic (negative). We trained deep learning models using 786 WSIs by transfer learning, fully supervised, and weakly supervised learning approaches. We evaluated the trained models on two test sets (equal and clinical balance) with a combined total of 750 WSIs, achieving ROC-AUCs for WSI diagnosis in the range of 0.984-0.990 by the best model, demonstrating the promising potential use of our model for aiding urine cytodiagnostic processes.
ARTICLE | doi:10.20944/preprints202106.0157.v1
Subject: Earth Sciences, Atmospheric Science Keywords: Land use and land cover; Classification; Object-based change detection; Multi-temporal image analysis; Landsat; Tiaoxi
Online: 7 June 2021 (09:27:22 CEST)
The changing of land use and land cover (LULC) are both affected by climate and human activity and affect climate, biological diversity, and human well-being. Accurate and timely information about the LULC pattern and change is crucial for land management decision-making, ecosystem monitoring, and urban planning, especially in developing economies undergoing industrialization, urbanization, and globalization. Biodiversity degradation and urban expansion in eastern China are research hot-spots. However, the influence of LULC changes on the region remains largely unexplored. Here, an object-based and multi-temporal image analysis approach was developed to detect how LULC changes during 1985-2015 in the Tiaoxi watershed (Zhejiang province, eastern China) using Landsat TM and OLI data. The main objective of this study is to improve the accuracy of unsupervised change detection from object-based and multi-temporal images. To this end, a total of seven LULC maps are generated with multi-temporal images. A random stratified sample design was used for assessing change detection accuracy. The proposed method achieved an overall accuracy of 91.86%, 92.14%, 92.00%, and 93.86% for 2000, 2005, 2010, and 2015, respectively. Nevertheless, the proposed method, in conjunction with object-oriented and multi-temporal satellite images, offers a robust and flexible approach to LULC changes mapping that helps with emergency response and government management. Urbanization and agriculture efficiency are the main reasons for LULC changes in the region. We anticipate that this freely available data will improve the modeling for surface forcing, provide evidence of changes in LULC, and inform water-management decision-making.
REVIEW | doi:10.20944/preprints202010.0649.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: text mining; natural language processing; electronic health records; clinical text; machine learning
Online: 3 February 2021 (10:31:14 CET)
Electronic health records (EHRs) are becoming a vital source of data for healthcare quality improvement, research, and operations. However, much of the most valuable information contained in EHRs remains buried in unstructured text. The field of clinical text mining has advanced rapidly in recent years, transitioning from rule-based approaches to machine learning and, more recently, deep learning. With new methods come new challenges, however, especially for those new to the field. This review provides an overview of clinical text mining for those who are encountering it for the first time (e.g. physician researchers, operational analytics teams, machine learning scientists from other domains). While not a comprehensive survey, it describes the state of the art, with a particular focus on new tasks and methods developed over the past few years. It also identifies key barriers between these remarkable technical advances and the practical realities of implementation at health systems and in industry.
ARTICLE | doi:10.20944/preprints202203.0329.v1
Subject: Mathematics & Computer Science, Analysis Keywords: Plagiarism Detection; Plagiarism checker for Bengali text; Bengali Literature Corpus; OCR in Bengali text
Online: 24 March 2022 (09:36:56 CET)
Plagiarism means taking another person’s work and not giving any credit to them for it. Plagiarism is one of the most serious problems in academia and among researchers. Even though there are multiple tools available to detect plagiarism in a document but most of them are domain-specific and designed to work in English texts, but plagiarism is not limited to a single language only. Bengali is the most widely spoken language of Bangladesh and the second most spoken language in India with 300 million native speakers and 37 million second-language speakers. Plagiarism detection requires a large corpus for comparison. Bengali Literature has a history of 1300 years. Hence most Bengali Literature books are not yet digitalized properly. As there was no such corpus present for our purpose so we have collected Bengali Literature books from the National Digital Library of India and with a comprehensive methodology extracted texts from it and constructed our corpus. Our experimental results find out average accuracy between 72.10 % - 79.89 % in text extraction using OCR. Levenshtein Distance algorithm is used for determining Plagiarism. We have built a web application for end-user and successfully tested it for Plagiarism detection in Bengali texts. In future, we aim to construct a corpus with more books for more accurate detection.
ARTICLE | doi:10.20944/preprints202003.0313.v3
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: object detection; faster region-based convolutional neural network (FRCNN); single-shot multibox detector (SSD); super-resolution; remote sensing imagery; edge enhancement; satellites
Online: 29 April 2020 (13:33:56 CEST)
The detection performance of small objects in remote sensing images has not been satisfactory compared to large objects, especially in low-resolution and noisy images. A generative adversarial network (GAN)-based model called enhanced super-resolution GAN (ESRGAN) showed remarkable image enhancement performance, but reconstructed images usually miss high-frequency edge information. Therefore, object detection performance showed degradation for small objects on recovered noisy and low-resolution remote sensing images. Inspired by the success of edge enhanced GAN (EEGAN) and ESRGAN, we applied a new edge-enhanced super-resolution GAN (EESRGAN) to improve the quality of remote sensing images and used different detector networks in an end-to-end manner where detector loss was backpropagated into the EESRGAN to improve the detection performance. We proposed an architecture with three components: ESRGAN, EEN, and Detection network. We used residual-in-residual dense blocks (RRDB) for both the ESRGAN and EEN, and for the detector network, we used a faster region-based convolutional network (FRCNN) (two-stage detector) and a single-shot multibox detector (SSD) (one stage detector). Extensive experiments on a public (car overhead with context) dataset and another self-assembled (oil and gas storage tank) satellite dataset showed superior performance of our method compared to the standalone state-of-the-art object detectors.
ARTICLE | doi:10.20944/preprints202110.0033.v1
Online: 4 October 2021 (08:58:52 CEST)
Antimicrobial resistance (AMR) is one of the top 10 threats affecting global health. AMR defeats the effective prevention and treatment of infections caused by microbial pathogens including bacteria, parasites, viruses and fungi (WHO). Microbial pathogens have natural tendency to evolve and mutate over time resulting in AMR strains. The set of genes involved in antibiotic resistance also termed as “antibiotic resistance genes” (ARGs) spread through species by lateral gene transfer thereby causing global dissemination. While this biological mechanism is prevalent in the spread of AMR, human methods also augment through various mechanisms such as over prescription, incomplete treatment, environmental waste etc. A considerable portion of scientific community is engrossed in AMR related work trying to discover novel therapeutic solutions for tackling resistant pathogens. Comprehensive inspection of the literature shows that diverse therapeutic strategies have evolved over recent years. Collectively, these therapeutic strategies include novel small molecules, newly identified antimicrobial peptides, bacteriophages, phytochemicals, nanocomposites, novel phototherapy against bacteria, fungi and virus. In this work we have developed a comprehensive knowledgebase by collecting alternative antimicrobial therapeutic strategies from literature data. We have used subjective approach for datamining new strategies resulting in broad coverage of entities and subsequently add objective data like entity name, potency, safety information etc. The extracted data was organized KOMBAT (Knowledgebase Of Microbes’ Battling Agents for Therapeutics). A lot of these data are tested against AMR pathogens. We envision that this database will be noteworthy for developing future therapeutics against resistant pathogens. The database can be accessed through http://kombat.igib.res.in/.
ARTICLE | doi:10.20944/preprints202011.0646.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: social media; hate speech; text classification
Online: 25 November 2020 (14:12:07 CET)
The exponential increase in the use of the Internet and social media over the last two decades has changed human interaction. This has led to many positive outcomes, but at the same time it has brought risks and harms. While the volume of harmful content online, such as hate speech, is not manageable by humans, interest in the academic community to investigate automated means for hate speech detection has increased. In this study, we analyse six publicly available datasets by combining them into a single homogeneous dataset and classify them into three classes, abusive, hateful or neither. We create a baseline model and we improve model performance scores using various optimisation techniques. After attaining a competitive performance score, we create a tool which identifies and scores a page with effective metric in near-real time and uses the same as feedback to re-train our model. We prove the competitive performance of our multilingual model on two langauges, English and Hindi, leading to comparable or superior performance to most monolingual models.
ARTICLE | doi:10.20944/preprints201610.0012.v1
Online: 5 October 2016 (15:08:32 CEST)
Bio-molecular reagents like antibodies required in experimental biology are expensive and their effectiveness, among other things, is critical to the success of the experiment. Although such resources are sometimes donated by one investigator to another through personal communication between the two, there is no previous study to our knowledge on the extent of such donations, nor a central platform that directs resource seekers to donors. In this paper, we describe, to our knowledge, a first attempt at building a web-portal titled Bio-Resource Exchange that attempts to bridge this gap between resource seekers and donors in the domain of experimental biology. Users on this portal can request for or donate antibodies, cell-lines and DNA Constructs. This resource could also serve as a crowd-sourced database of resources for experimental biology. Further, in order to index donations outside of our portal, we mined scientific articles to find instances of donations of antibodies and attempted to extract information about these donations at the finest granularity. Specifically, we extracted the name of the donor, his/her affiliation and the name of the antibody for every donation by parsing the acknowledgements sections of articles. To extract annotations at this level, we propose two approaches – a rule based algorithm and a bootstrapped relation learning algorithm. The algorithms extracted donor names, affiliations and antibody names with average accuracies of 57% and 62% respectively. We also created a dataset of 50 expert-annotated acknowledgements sections that will serve as a gold standard dataset to evaluate extraction algorithms in the future. Contact: email@example.com, firstname.lastname@example.org Database URL: http://tonks.dbmi.pitt.edu/brx Supplementary information: Supplementary data are available at Database online.
ARTICLE | doi:10.20944/preprints202204.0303.v1
Subject: Arts & Humanities, Other Keywords: Blogging; intercultural competence; international learning outcomes; reflective writing; reflection; text analysis; text mining; psycholinguistics; linguistic markers
Online: 29 April 2022 (13:07:15 CEST)
This study combines insights from psycholinguistics and text analysis to identify linguistic markers of intercultural competence (ICC) in student blogs about intercultural experiences. By combining holistic ICC frameworks with a more analytical approach at text and word level, we were able to demonstrate that blogs with a high perceived level of ICC contain significantly more I-words, more insights words and less quantifiers. These markers of ICC constitute concrete cues for teachers when assessing reflective writing assignments and allow them to pinpoint concrete areas for improvement in their feedback and interaction with students.
ARTICLE | doi:10.20944/preprints202008.0033.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Natural Language Processing; Suspicious Text Detection; Bengali Language Processing; Machine Learning; Text Classification; Feature Extraction; Suspicious Corpora
Online: 2 August 2020 (14:38:13 CEST)
Due to the substantial growth of internet users and its spontaneous access via electronic devices, the amount of electronic contents is growing enormously in recent years through instant messaging, social networking posts, blogs, online portals, and other digital platforms. Unfortunately, the misapplication of technologies has boosted with this rapid growth of online content which leads to the rise in suspicious activities. People misuse the web media to disseminate malicious activity, perform the illegal movement, abuse other people, and publicize suspicious contents on the web. The suspicious contents usually available in the form of text, audio or video, whereas text contents have been used in most of the cases to perform suspicious activities. Thus, one of the most challenging issues for NLP researchers is to develop a system that can identify suspicious text efficiently from the specific contents. In this paper, a Machine Learning (ML)-based classification model is proposed (hereafter called STD) to classify Bengali text into non-suspicious and suspicious categories based on its original contents. A set of ML classifiers with various features has been used on our developed corpus, consisting of 7000 Bengali text documents where 5600 documents used for training and 1400 documents used for testing. The performance of the proposed system is compared with the human baseline and existing ML techniques. The SGD classifier `tf-idf’ with the combination of unigram and bigram features are used to achieve the highest accuracy of 84.57%.
ARTICLE | doi:10.20944/preprints202212.0478.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Datasets, Neural Networks, Hand Detection, Text Tagging
Online: 26 December 2022 (07:30:24 CET)
American Sign Language is a popular language for deaf individuals. Communication is made easier for these people through sign language. However, in a digital era like today, there is a need for these people to be able to communicate online, and even get help from technology to communicate in person with non sign language speakers. This research will present a program able to translate American sign language to plain English. This study aims to use the OpenCV library to recognize hand signals, also a trained model to identify images so that the program can then translate them to words and letters. The program uses a data set of over 2000 images which will be in this case the largest data set available. With over 90\% of accuracy it results in a basic computer program with the largest data set available that would make possible for users to communicate with a wide variety of words and expressions.
ARTICLE | doi:10.20944/preprints202211.0017.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: text-to-speech; naturalness; intelligibility; Brazilian Portuguese
Online: 1 November 2022 (04:37:04 CET)
This paper compares the performance of three text-to-speech (TTS) models released from June 2021 to January 2022 in order to establish a baseline for Brazilian Portuguese. Those models were trained using dataset for Brazilian Portuguese. The experimental setup considers tts-portuguese dataset to fine-tune the following TTS models: VITS end-to-end model; glowtts and gradtts acoustic models both using hifi-gan vocoder. Performance metrics are arranged into objective and subjective metrics. As subjective metrics, the naturalness and intelligibility are measured based on the mean opinion score (MOS). Results shows that gradtts+hifigan model achieved naturalness of 4.07 MOS, close to performance of current commercial models.
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Semantic Complexity; Semantics; Text Complexity; Readability Formulae
Online: 6 September 2021 (13:33:34 CEST)
Simple measures often couldn’t count a deep complexity. In the case of semantic complexity, conventional readability formulas share a common style, a common sort of achievements and a common borders of limitation: These formulas lack a semantics-aware approach and as a result, a precise measurement of semantic complexity couldn’t be done. In this paper, we introduce DASTEX, a novel semantics-aware complexity measure for semantic complexity of text. By DASTEX, a new layer of complexity analysis are opened for NLP, cognitive and computational tasks. This measure benefits from an intuitionistic underlying formal model which consider semantic as a lattice of intuitions. This yields to a well-defined definition for semantic of a text and its complexity. DASTEX is a practical analysis method upon this formal model. So a complete suite of idea, model and method are prepared to result in a simple but yet deep measure for semantic complexity of text. The evaluation of the proposed approach is done by 4 Experiments. The results show DASTEX is capable of measuring the semantic complexity of text in 6 application-tasks.
ARTICLE | doi:10.20944/preprints202010.0057.v1
Subject: Social Sciences, Accounting Keywords: multiclass classification; text mining; accounting control system
Online: 5 October 2020 (09:05:53 CEST)
Electronic invoicing has become mandatory for Italian companies since January 2019. Invoices are structured in a predefined xml template where the information reported can be easily extracted and analyzed. The main aim of this paper is to exploit the information structured in electronic invoices to build an intelligent system which can facilitate accountants work. More precisely, this contribution shows how it is possible to automate part of the accounting process: all sent or received invoices of a company are classified into specific codes which represent the economic nature of the the financial transactions. In order to classify data contained in the invoices a machine learning multiclass classification problem is proposed using as input variables the information of the invoices to predict two different target variables, account codes and the VAT codes, which composes a general ledger entry. Different approaches are compared in terms of prediction accuracy. The best performance is achieved considering the hierarchical structure of the account codes.
ARTICLE | doi:10.20944/preprints201802.0001.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: domain ontology; semantic analysis; linguistics, text resources
Online: 1 February 2018 (03:08:47 CET)
Ontology is a formalized representation of the problem area (PrA). Representation of the PrA in the form of an domain ontology is often used in the process of development of intelligent software systems and used as a knowledge base. The process of building an ontology is complex and requires an expert in the PrA. A large number of researchers are working to solve this problem. The basis of our approach is the use of a pipeline of different linguistic methods of text analysis. The set of rules developed by us is used to build an ontology based on the content analysis of a text resource. This article describes the method of building a domain ontology based on the linguistic analysis of content of text resources, presents an example of the proposed approach, and also presents the architecture of our pipeline.
ARTICLE | doi:10.20944/preprints202107.0277.v1
Subject: Medicine & Pharmacology, Oncology & Oncogenics Keywords: Cervical cancer; Pap smear test; whole slide image (WSI); feature pyramid network (FPN); global context aware (GCA); region based convolutional neural networks (R-CNN); Region Proposal Network (RPN).
Online: 12 July 2021 (23:05:34 CEST)
Cervical cancer is a worldwide public health problem with a high rate of illness and mortality among women. In this study, we proposed a novel framework based on Faster RCNN-FPN ar-chitecture for the detection of abnormal cervical cells in cytology images from cancer screening test. We extended the Faster RCNN-FPN model by infusing deformable convolution layers into the feature pyramid network (FPN) to improve scalability. Furthermore, we introduced a global contextual aware module alongside the Region Proposal Network (RPN) to enhance the spatial correlation between the background and the foreground. Extensive experimentations with the proposed deformable and global context aware (DGCA) RCNN were carried out using the cer-vical image dataset of “Digital Human Body" Vision Challenge from the Alibaba Cloud TianChi Company. Performance evaluation based on the mean average precision (mAP) and receiver operating characteristic (ROC) curve has demonstrated considerable advantages of the proposed framework. Particularly, when combined with tagging of the negative image samples using tra-ditional computer-vision techniques, 6-9% increase in mAP has been achieved. The proposed DGCA-RCNN model has potential to become a clinically useful AI tool for automated detection of cervical cancer cells in whole slide images of Pap smear.
ARTICLE | doi:10.20944/preprints202111.0023.v1
Subject: Engineering, Other Keywords: Twitter; Social Media Analysis; User Behavior Mining; Crime Detection; Feature Extraction; Graph Analysis; Natural Language Processing; Text Classification; Aspect-based Sentiment Analysis; DistilBERT
Online: 1 November 2021 (15:25:19 CET)
Maintaining a healthy cyber society is a big challenge due to the users’ freedom of expression and behaving. It can be solved by monitoring and analyzing the users’ behavior and taking proper actions towards them. This research aims to present a platform that monitors the public content on Twitter by extracting tweet data. After maintaining the data, the users’ interactions are analyzed using Graph Analysis methods. Then the users’ behavioral patterns are analyzed by applying Metadata Analysis, in which the timeline of each profile is obtained; also, the time-series behavioral features of users are investigated. Then in the Abnormal Behavior Detection Filtering component, the interesting profiles are selected for further examinations. Finally, in the Contextual Analysis component, the contents will be analyzed using natural language processing techniques; A binary text classification model (SVM + TF-IDF with 88.89% accuracy) for detecting if the tweet is related to crime or not. Then, a sentiment analysis method is applied to the crime-related tweets to perform aspect-based sentiment analysis (DistilBERT + FFNN with 80% accuracy); because sharing positive opinions about a crime-related topic can threaten society. This platform aims to provide the end-user (Police) suggestions to control hate speech or terrorist propaganda.
ARTICLE | doi:10.20944/preprints202105.0601.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Mobile RPG; Big Data; Text Mining; Topic Modeling
Online: 25 May 2021 (10:21:36 CEST)
As RPG has high sales and profits, lots of developers have supplied various RPG to market but it changed to mass production type with sensational advertising, low quality and excessive charging and similar contents which affects game market and users’ game play experience. The author of this paper studied ways to improve mobile RPG by collecting and analyzing users’ reviews using crawling on Google Play Store. The author of this paper used topic modeling that uses text mining technique and LDA (Latent Dirichlet Allocation) to extract meaningful information from collected big data and visualized it. Inferring users’ reviews, figuring out opinions objectively and seeking ways to improve games are helpful in improving mobile RPG that can be played continuously.
ARTICLE | doi:10.20944/preprints202102.0120.v1
Subject: Social Sciences, Business And Administrative Sciences Keywords: Homepage words; Financial ratio; Text-mining; Balanced scorecard
Online: 3 February 2021 (15:07:40 CET)
(1) Background: The CEO message of hospital homepage contain various contents such as the hospital's future vision, promises with customers, upgraded services and public activities. The CEO’s message of the homepage includes non-financial information as well as financial information of corporates. Also, it provides useful information for not only company's goals and vision but also firm performance and strategies for the future. This study aims to investigate associations between CEO’s message of hospitals homepages and financial status. We used the balanced scorecard frame to analyze what content on the hospital's homepage is related to the hospital's various financial ratios. (2) Methods: We adopt a text mining method to extract significantly repeated keywords from the CEO’s message of hospital website. And we classify these keywords by a balanced scorecard frame. To examine the relationship between keywords of CEO’s message of the hospital homepage and hospital’s financial ratio, T-test is conducted for the difference in the TF-IDF (Term Frequency is Divided by Inverse Document Frequency) mean of the home page contents and its relationship with the views of the balanced scorecard framework. (3) Results: According to empirical results on 65 samples collected from local hospitals, there are some significant relationship between the qualitative content of the hospital's homepage and the quantitative financial ratio that indicates profitability, activity, leverage, liquidity, and transfer to essential business fund (EBF) income. (4) Conclusions: The introduction section of a homepage is most accessible to customers, containing the aims and ideals of hospitals and reflecting their values and visions . In addition, in view of financial status, they can either emphasize financial strength or focus on other areas to mask weakness of financial information. This study reminds us of the importance of hospital website’s disclosure, and it can be inferred from the financial status of the hospital. It also highlights the need for harmonization between quantitative data, financial statements, and qualitative data, CEO’s messages. (5) Implications: To our best knowledge, this paper is the first research attempting to investigate the relation between text of hospital homepage and financial ratio of hospital through text mining technique and balanced scorecard frame. Hospitals take a crucial part in a country’s welfare and healthcare backbone industry. Nevertheless, in many countries, hospital organization sectors tend to remain a source of critical fiscal deficits due to its ineffective and sloppy management. We expect that the result of this paper can provide hospital managers to useful information.
BRIEF REPORT | doi:10.20944/preprints201811.0527.v1
Subject: Medicine & Pharmacology, Nutrition Keywords: citation network analysis; text mining; nutrition intervention; cognition
Online: 21 November 2018 (13:50:28 CET)
Manual review of the extensive literature covering nutrition-based lifestyle interventions to promote healthy cognitive ageing has proved educative, however, data-driven techniques can better account for the large size of the literature (tens of thousands of potentially relevant publications to date) and interdisciplinary nature of where relevant publications may be found. In this study we present a new way to map the literature landscape focusing on nutrition-based lifestyle interventions to promote healthy cognitive ageing. We applied a combination of citation network analysis and text mining to map out the existing literature on nutritional interventions and cognitive health. Results indicated five overarching clusters of publications, which could be further deconstructed into a total of 35 clusters. These could be broadly distinguished by focus on lifespan stages (e.g. infancy versus older age), and specificity regarding nutrition (e.g. narrow focus on iodine deficiency versus broad focus on weight gain). Rather than concentrating into a single cluster, interventions were present throughout the majority of the research. We conclude that a data-driven map of the nutritional intervention literature can benefit the design of future interventions, by highlighting topics and themes that could be synthesized across currently disconnected clusters of publications.
ARTICLE | doi:10.20944/preprints201811.0206.v1
Subject: Mathematics & Computer Science, Other Keywords: Biomedical libraries; author’s confidence; writing styles; text analysis
Online: 8 November 2018 (11:01:24 CET)
In an era when medical literature is increasing daily, researchers in biomedical and clinical areas have joined efforts with language engineers to analyze large amount of biomedical and molecular biology literature (such as PubMed), patient data or health records. With such a huge amount of reports, evaluating their impact has long seized to be a trivial task. In this context, this paper intends to introduce a non-scientific factor that represents an important element in the effort of gaining acceptance of claims. Thus, we postulate that the confidence the author is expressing in his work plays an important role in shaping the first impression that influences the reader’s perception of the paper. The results discussed in this paper are based on a series of experiments ran over data from the Open Archives Initiative (OAI) corpus that provides interoperability standards in order to facilitate the effectiveness dissemination of the content. This method can be useful to the direct beneficiaries (authors, who are engaged in medical or academic research), but, also, researchers in the fields of BioNLP and NLP, etc.
ARTICLE | doi:10.20944/preprints201810.0338.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: text classification; topic modelling; latent semantic analysis; latent dirichlet allocation; hierarchical sentiment dictionary; contextually-oriented hierarchical corpus; text tonality; evaluation
Online: 16 October 2018 (07:55:35 CEST)
The research presents the Methodology of Improving the Accuracy in Text Classification in Light of Modelling the Latent Semantic Relations (LSR). The aim of this Methodology is to find the ways of eliminating the Limitations of Discriminant and Probabilistic methods for LSR revealing and customizing the Text Classification Process to the more accurate recognition of the text tonality. This aim should be achieved by using the knowledge about the text’s Hierarchical Semantic Context in the form of Corpora-based Hierarchical Sentiment Dictionary. The main scientific contribution of this research is the following set of approaches to improve the qualitative characteristics of Text Classification process: combination of the Discriminant and Probabilistic methods allowing to decrease the influences of the Limitations of these methods on the LSR revealing process; considering each document as a complex structure allowing to estimate documents integrally by separated classification of topically completed textual component (paragraphs); taking into account the features of Argumentative type of documents (Reviews) allowing to use the author’s subjective evaluation of text tonality for development the Text Classification methodology. Tonality, expressed by the Review’s author, has a significant, but not critical, effect on the qualitative indicators of Sentiment Recognition.
ARTICLE | doi:10.20944/preprints202208.0451.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: text splitting; text tokenization; transfer learning; mask-fill prediction; NLP linguistic rules; missing punctuations; cross-lingual BERT model; Masked Language Modeling
Online: 26 August 2022 (05:19:39 CEST)
Long unpunctuated texts containing complex linguistic sentences are a stumbling block to processing any low-resource languages. Thus, approaches that attempt to segment lengthy texts with no proper punctuation into simple candidate sentences are a vitally important preprocessing task in many hard-to-solve NLP applications. In this paper, we propose (PDTS) a punctuation detection approach for segmenting Arabic text, built on top of a multilingual BERT-based model and some generic linguistic rules. Furthermore, we showcase how PDTS can be effectively employed as a text tokenizer for unpunctuated documents (i.e., mimicking the transcribed audio-to-text documents). Experimental findings across two evaluation protocols (involving an ablation study and a human-based judgment) demonstrate that PDTS is practically effective in both performance quality and computational cost.
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Text summarization; Fine-tuning; Transformers; SMS; Gateway; French Wikipedia.
Online: 14 September 2021 (10:48:55 CEST)
Text summarization remains a challenging task in the Natural Language Processing field despite the plethora of applications in enterprises and daily life. One of the common use cases is the summarization of web pages which has the potential to provide an overview of web pages to devices with limited features. In fact, despite the increasing penetration rate of mobile devices in rural areas, the bulk of those devices offer limited features in addition to the fact that these areas are covered with limited connectivity such as the GSM network. Summarizing web pages into SMS becomes, therefore, an important task to provide information to limited devices. This work introduces WATS-SMS, a T5-based French Wikipedia Abstractive Text Summarizer for SMS. It is built through a transfer learning approach. The T5 English pre-trained model is used to generate a French text summarization model by retraining the model on 25,000 Wikipedia pages then compared with different approaches in the literature. The objective is twofold: (1) to check the assumption made in the literature that abstractive models provide better results compared to extractive ones; and (2) to evaluate the performance of our model compared to other existing abstractive models. A score based on ROUGE metrics gave us a value of 52% for articles with length up to 500 characters against 34.2% for transformer-ED and 12.7% for seq-2seq-attention; and a value of 77% for articles with larger size against 37% for transformers-DMCA. Moreover, an architecture including a software SMS-gateway has been developed to allow owners of mobile devices with limited features to send requests and to receive summaries through the GSM network.
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Amharic script; Attention mechanism; OCR; Encoder-decoder; Text-image
Online: 15 October 2020 (13:42:28 CEST)
In the present, the growth of digitization and worldwide communications make OCR systems of exotic languages a very important task. In this paper, we attempt to develop an OCR system for one of these exotic languages with a unique script, Amharic. Motivated by the recent success of the Attention mechanism in Neural Machine Translation (NMT), we extend the attention mechanism for Amharic text-image recognition. The proposed model consists of CNNs and attention embedded recurrent encoder-decoder networks that are integrated following the configuration of the seq2seq framework. The attention network parameters are trained in an end-to-end fashion and the context vector is injected, with the previously predicted output, at each time steps of decoding. Unlike the existing OCR model that minimizes the CTC objective function, the new model minimizes the categorical cross-entropy loss. The performance of the proposed attention-based model is evaluated against the test dataset from the ADOCR database which consists of both printed and synthetically generated Amharic text-line images and achieved promising results with a CER of 1.54% and 1.17% respectively.
ARTICLE | doi:10.20944/preprints201812.0306.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: cymatics; text detection and recognition; optical character recognition (OCR)
Online: 25 December 2018 (13:52:31 CET)
This paper propose an original approach of achieving a Cymatics based visual perception of image-extracted text. In this context, an effective approach for automated text detection and recognition for the natural scene images is proposed. The incoming image is firstly enhanced by employing CLAHE and DWT. Afterwards, the text regions of the enhanced image are detected by employing the MSER feature detector. The non-text MSERs are removed by employing the geometrical and contour based filters. The remaining MSERs are grouped into words or phrases by finding out similarities between them. The text recognition is performed by employing an OCR function. The extracted text is sequentially analysed on character by character basis. Each character is converted into a methodical acoustic excitation. Finally, these excitations are converted into the systematic visual perceptions by using the phenomenon of Cymatics. The system functionality is tested with an experimental setup. For the case of studied natural scenes, the suggested approach achieves 80% precision in text localization and 53% precision in end-to-end text recognition. The devised system principle is novel and can be employed in various applications like visual art, encryption, education, integration of impaired people, etc.
ARTICLE | doi:10.20944/preprints202107.0200.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: image quality assessment; image quality metrics; NR-IQAs; D-IQA; OCR accuracy; OCR prediction; OCR improvements; visual aids; visually impaired; reading aids; document images; text-based images
Online: 8 July 2021 (13:21:49 CEST)
For Visually impaired People (VIPs), the ability to convert text to sound can mean a new level of independence or the simple joy of a good book. With significant advances in Optical Character Recognition (OCR) in recent years, a number of reading aids are appearing on the market. These reading aids convert images captured by a camera to text which can then be read aloud. However, all of these reading aids suffer from a key issue – the user must be able to visually target the text and capture an image of sufficient quality for the OCR algorithm to function – no small task for VIPs. In this work, a Sound-Emitting Document Image Quality Assessment metric (SEDIQA) is proposed which allows the user to hear the quality of the text image and automatically captures the best image for OCR accuracy. This work also includes testing of OCR performance against image degradations, to identify the most significant contributors to accuracy reduction. The proposed No-Reference Image Quality Assessor (NR-IQA) is validated alongside established NR-IQAs and this work includes insights into the performance of these NR-IQAs on document images.
ARTICLE | doi:10.20944/preprints202210.0247.v1
Subject: Life Sciences, Other Keywords: Text-mining; ANDDigest; ANDSystem; Named entity recognition; Machine learning; PubMedBERT
Online: 18 October 2022 (04:29:17 CEST)
The body of scientific literature continues to grow annually. Over 1.5 million abstracts of biomedical publications were added to the PubMed database in 2021. Therefore, developing cognitive systems that provide a specialized search for information in scientific publications based on subject area ontology and modern artificial intelligence methods is urgently needed. We previously developed a web-based information retrieval system, ANDDigest, designed to search and analyze information in the PubMed database using a customized domain ontology. This paper presents an improved ANDDigest version that uses fine-tuned PubMedBERT classifiers to enhance the quality of short name recognition for molecular-genetics entities in PubMed abstracts on eight biological object types: cell components, diseases, side effects, genes, proteins, pathways, drugs, and metabolites. This approach increased average short name recognition accuracy by 13%. The new ANDDigest version (01.2022) has a web interface and is freely available to users at https://anddigest.sysbio.ru/.
ARTICLE | doi:10.20944/preprints202106.0482.v3
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: COVID-19 Infodemic; Text Classification; TFIDF Features; Network Training modes; Supervised Learning; Misinformation; News Classification; False Publications; PubMed; Anomaly Detection
Online: 26 July 2021 (12:06:04 CEST)
The spread of the Coronavirus pandemic has been accompanied by an infodemic. The false information that is embedded in the infodemic affects people’s ability to have access to safety information and follow proper procedures to mitigate the risks. This research aims to target the falsehood part of the infodemic, which prominently proliferates in news articles and false medical publications. Here, we present NeoNet, a novel supervised machine learning text mining algorithm that analyzes the content of a document (news article, a medical publication) and assigns a label to it. The algorithm is trained by TFIDF bigram features which contribute a network training model. The algorithm is tested on two different real-world datasets from the CBC news network and Covid-19 publications. In five different fold comparisons, the algorithm predicted a label of an article with a precision of 97-99 %. When compared with prominent algorithms such as Neural Networks, SVM, and Random Forests NeoNet surpassed them. The analysis highlighted the promise of NeoNet in detecting disputed online contents which may contribute negatively to the COVID-19 pandemic.
ARTICLE | doi:10.20944/preprints202103.0738.v1
Subject: Mathematics & Computer Science, Analysis Keywords: bibliometry; coronavirus; text and data mining; SARS; MERS; COVID-19
Online: 31 March 2021 (17:30:56 CEST)
A global event such as the COVID-19 crisis presents new, often unexpected responses that are fascinating to investigate from both, scientific and social standpoints. Despite several documented similarities, the Coronavirus pandemic is clearly distinct from the 1918 flu pandemic in terms of our exponentially increased, almost instantaneous ability to access/share information, offering an unprecedented opportunity to visualise rippling effects of global events across space and time. Personal devices provide “big data” on people’s movement, the environment and economic trends, while access to the unprecedented flurry in scientific publications and media posts provides a measure of the response of the educated world to the crisis. Most bibliometric (co-authorship, co-citation, or bibliographic coupling) analyses ignore the time dimension, but COVID-19 has made it possible to perform a detailed temporal investigation into the pandemic. Here, we report a comprehensive network analysis based on more than 20000 published documents on viral epidemics, authored by over 75,000 individuals from 140 nations in the past one year of the crisis. In contrast to the 1918 flu pandemic, access to published data over the past two decades enabled a comparison of publishing trends between the ongoing COVID-19 pandemic and those of the 2003 SARS epidemic, to study changes in thematic foci and societal pressures dictating research over the course of a crisis.
ARTICLE | doi:10.20944/preprints202103.0380.v1
Subject: Social Sciences, Accounting Keywords: COVID-19; pandemic crisis; crisis management; text mining; network analysis
Online: 15 March 2021 (12:34:01 CET)
This study aims to understand the global environment of COVID-19 management and guide future policy directions after the pandemic crisis. To this end, we analyzed a series of the World Economic Forum’s COVID-19 response reports through text mining and network analysis. These reports, written by experts in diverse fields, discuss multidimensional changes in socioeconomic situations, various problems created by those changes, and strategies to respond to national crises. Based on 3,897 refined words drawn from a morphological analysis of 26 reports (as of the end of 2020), this study analyzes the frequency of words, the relationships among words, the importance of specific documents, and the connection centrality through text mining. In addition, network analysis helps develop strategies for sustainable response to and management of national crises through identifying clusters of words with a similar structural equivalence.
ARTICLE | doi:10.20944/preprints201809.0466.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: topological data analysis; text mining; computational topology; style; persistent homology
Online: 24 September 2018 (15:33:02 CEST)
Topological Data Analysis (TDA) refers to a collection of methods that find the structure of shapes in data. Although recently, TDA methods have been used in many areas of data mining, it has not been widely applied to text mining tasks. In most text processing algorithms, the order in which different entities appear or co-appear is being lost. Assuming these lost orders are informative features of the data, TDA may play a significant role in the resulted gap on text processing state of the art. Once provided, the topology of different entities through a textural document may reveal some additive information regarding the document that is not reflected in any other features from traditional text processing methods. In this paper, we introduce a novel approach that hires TDA in text processing in order to capture and use the topology of different same-type entities in textural documents. First, we will show how to extract some topological signatures in the text using persistent homology-i.e., a TDA tool that captures topological signature of data cloud. Then we will show how to utilize these signatures for text classification.
REVIEW | doi:10.20944/preprints201607.0012.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: role-based access control; attribute-based access control; attribute-based encryption
Online: 8 July 2016 (10:12:21 CEST)
Cloud Computing is a promising and emerging technology that is rapidly being adopted by many IT companies due to a number of benefits that it provides, such as large storage space, low investment cost, virtualization, resource sharing, etc. Users are able to store a vast amount of data and information in the cloud and access it from anywhere, anytime on a pay-per-use basis. Since many users are able to share the data and the resources stored in the cloud, there arises a need to provide access to the data to only those users who are authorized to access it. This can be done through access control schemes which allow the authenticated and authorized users to access the data and deny access to unauthorized users. In this paper, a comprehensive review of all the existing access control schemes has been discussed along with analysis. Keywords: role-based access control, attribute-based access control, attribute-based encryption
REVIEW | doi:10.20944/preprints202212.0064.v1
Subject: Medicine & Pharmacology, Psychiatry & Mental Health Studies Keywords: Natural language processing; NLP; Text mining; Suicide; Suicide-Ideation; Mental Health
Online: 5 December 2022 (07:34:30 CET)
Introduction: Around a million people are reported to die by suicide every year, and due to the stigma associated with the nature of the death, this figure is usually assumed to be an underestimate. Suicide may be prevented if prompt intervention is taken to mitigate risk. Machine learning and artificial intelligence-based modelling, such as natural language processing (NLP) and other text analytics approaches, has the potential to become a major technique for the detection, diagnosis, and treatment of people who are suffering from mental health issues. The primary aims of this research are to determine whether NLP techniques have been utilised in the field of suicide prevention, and if so, were they effective? What were their limitations? Methods: PubMed, EMBASE, MEDLINE, PsycInfo, and Global Health databases were searched for studies that reported use of NLP for suicide ideation or self-harm. Thematic analysis was used to synthesise and analyse the included studies. Findings were reported using the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) statement, and the Mixed Methods Appraisal Tool (MMAT) was used in assessing paper quality. Result: The preliminary search of five databases generated 387 results. Removal of duplicates resulted in 158 potentially suitable studies. Twenty papers were finally included in this review. Discussion: Studies show that combining structured and unstructured data in NLP data modelling yielded more accurate results than utilizing either alone. Also, to reduce suicides, people with mental problems must be continuously and passively monitored. Further, NLP and other machine learning/artificial intelligence technologies can be used to address health inequities and electronic health records provide valuable data for creating suicide risk tools. Finally, Online, social media, and smartphone applications can be leverage in detecting people with suicide ideation. Conclusion: The use of artificial intelligence and machine learning opens new avenues for considerably guiding risk prediction and advancing suicide prevention frameworks. The review's analysis of the included research revealed that the use of NLP may result in low-cost and effective alternatives to existing resource-intensive methods of suicide prevention. To summarise, there is substantial evidence that NLP is useful in identifying people who have suicide ideation.
ARTICLE | doi:10.20944/preprints202111.0344.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: pharmacological text corpus; automatic relation extraction; natural language processing; deep learning
Online: 19 November 2021 (10:40:10 CET)
Nowadays, an analysis of virtual media to predict society’s reaction to any events or processes is a task of great relevance. Especially it concerns meaningful information on healthcare problems. Internet sources contain a large amount of pharmacologically meaningful information useful for pharmacovigilance purposes and repurposing drug use. An analysis of such a scale of information demands developing the methods that require the creation of a corpus with labeled relations among entities. Before, there have been no such Russian language datasets. This paper considers the first Russian language dataset where labeled entity pairs are divided into multiple contexts within a single text (by used drugs, by different users, by the cases of use, etc.), and a method based on the XLM-RoBERTa language model, previously trained on medical texts to evaluate the state-of-the-art accuracy for the task of indication of the four types of relationships among entities: ADR–Drugname, Drugname–Diseasename, Drugname–SourceInfoDrug, Diseasename–Indication. As shown based on the presented dataset from the Russian Drug Review Corpus, the developed method achieves the F1-score of 81.2% (obtained using cross-validation and averaged for the four types of relationships), which is 7.8% higher than the basic classifiers.
ARTICLE | doi:10.20944/preprints201904.0170.v1
Subject: Medicine & Pharmacology, Other Keywords: topic modelling; latent dirichlet allocation; text mining; assisted reproduction; ART; IVF
Online: 15 April 2019 (12:25:12 CEST)
Study question: What are the current trends of research in Human Assisted Reproduction around the world? Summary answer: USA is the leading country, followed by the UK, China, France and Italy. The largest research area is “laboratory techniques”, although other areas such as “public health”, “quality, ethics and law” and “female factor” are gaining ground worldwide. What is known already: Scientific research, especially in health and medical sciences, aims at addressing specific needs that society (and, especially, patients) perceives as pressing. One of the main challenges for policymakers and research funders alike is therefore to align research priorities to societal needs. We can thus think of research agendas in terms of a demand side (societal needs) and a supply side (research outputs). Research output in Human Assisted Reproduction has expanded in the past years, as indicated by the increasing number of scientific publications in indexed journals in this area. Nevertheless, no map of research related to assisted reproduction has been produced so far, hindering the identification of potential areas of improvement and need. Study design, size, duration: 26,000+ scientific publications (articles, letters, and reviews) on Human Assisted Reproduction produced worldwide between 2005 and 2016 were analyzed. These publications were indexed in PubMed or obtained from reference list of indexed publications included in the analysis.Participants/materials, setting, methods: The corpus of publications was obtained by combining the MeSH terms: “Reproductive techniques”, “Reproductive medicine”, “Reproductive health”, “Fertility”, “Infertility”, and “Germ cells”. Then it was analyzed by means of text mining algorithms (Topic Modeling (TM) based on Latent Dirichlet Allocation (LDA)), in order to obtain the main topics of interest. Finally, these categories were analyzed across world regions and time. Main results and the role of chance: We identified 44 main topics, which were further grouped in 11 macro categories, form larger to smaller: “laboratory techniques”, “male factor”, “quality, ethics and law”, “female factor”, “public health and infectious diseases”, “basic research and genetics”, “pregnancy complications and risks”, “general infertility and ART”, “psychosocial aspects”, “cancer”, and “research methodology”. The USA was the leading country in number of publications, followed by the UK, China, France and Italy. Interestingly, research contents in high income countries is fairly homogeneous across macro-categories, and it is dominated by “laboratory techniques” in Western and Southern Europe, and by “quality, ethics and law” in North America, Australia and New Zealand. In middle income countries we observe that research is mainly performed on “male factor”, and noticeably less on “female factor”. Finally, research on “public health and infectious diseases” predominates in low-income countries. Regarding temporal evolution of research, “laboratory techniques” is the most abundant topic on a yearly basis, and relatively constant over time. However, since production in most of the other categories is increasing, the relative contribution of this research category is actually decreasing. Publication is especially increasing in “public health and infectious diseases” (in all world regions, but especially in low income countries), “quality, ethics and law” (high income countries), and “female factor” (middle income countries). Limitations, reasons for caution: Three main factors might limit the robustness of our work: the textual corpus analyzed is based on abstract and titles, the reproducibility of the stochastic algorithms applied, which may produce slightly differing results at each run, and the interpretation of the topics obtained. Wider implications of the findings: This study should prove beneficial in the design of research strategies and policies that foster the alignment between supply (assisted reproduction research) and demand (society). Study funding/competing interest(s): PTQ-14-06718 of the Spanish MINECO Torres Quevedo programme (FAM).
ARTICLE | doi:10.20944/preprints201810.0678.v1
Subject: Medicine & Pharmacology, Pediatrics Keywords: post-operative death; unstructured data; logistic regression; text mining; surgery outcome
Online: 29 October 2018 (11:46:18 CET)
Text fields in electronic medical records (EMR) contain information on important factors that influence health outcomes, however, they are underutilized in clinical decision making due to their unstructured nature. We analyzed 6,497 inpatient surgical cases with 719,308 free text notes from Le Bonheur Children’s Hospital EMR. We used a text mining approach on preoperative notes to obtain the text-based risk score algorithm as predictive of death within 30 days of surgery. We studied the additional performance obtained by including text-based risk score as a predictor of death along with other structured data based clinical risk factors. The C-statistic of a logistic regression model with 5-fold cross-validation significantly improved from 0.76 to 0.92 when text-based risk scores were included in addition to structured data. We conclude that preoperative free text notes in EMR include significant information that can predict adverse surgery outcomes.
ARTICLE | doi:10.20944/preprints201708.0055.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: EMR; data preprocessing; text mining; information extraction; medical decision support system
Online: 15 August 2017 (05:46:43 CEST)
At present, medical institutes generally use EMR to record patient's condition, including diagnostic information, procedures performed and treatment results. EMR has been recognized as a valuable resource for large scale analysis. However, EMR has the characteristics of diversity, incompleteness, redundancy and privacy, which make it difficult to carry out data mining and analysis directly. Therefore, it is necessary to preprocess the source data in order to improve data quality and improve the data mining results. Different types of data require different processing technologies. Most structured data commonly needs classic preprocessing technologies, including data cleansing, data integration, data transformation and data reduction. For semi-structured or unstructured data, such as medical text, containing more health information, it requires more complex and challenging processing methods. The task of information extraction for medical texts mainly includes NER (Named Entity Recognition) and RE (Relation Extraction). In this paper, we introduce the process of EMR processing, including data collection, data preprocessing, data mining, evaluation and knowledge application, analyze the current status of the key technologies, such as data preprocessing and data mining, and provide an overview of the application domains and prospects of EMR mining technologies. Finally, we summarize the existing problems in the research of EMR mining, and review the development trends.
ARTICLE | doi:10.20944/preprints202212.0495.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: computational creativity; literary sentences; automatic text generation; shallow parsing and deep learning.
Online: 26 December 2022 (15:53:39 CET)
In this paper, we introduce a model for the automatic generation of literary sentences in French. It is based on algorithms that we have previously used to generate sentences in Spanish and Portuguese, and on a new corpus consisting of literary texts in French that we have constructed, called [FR]. Our automatic text generation algorithm combines language models, shallow parsing and deep learning, artificial neural networks. We have also proposed and implemented a manual evaluation protocol to assess the quality of the artificial sentences generated by our algorithm, by testing if they fulfill four simple criteria. We have obtained encouraging results from the evaluators for most of the desired features of our artificially generated sentences.
ARTICLE | doi:10.20944/preprints202209.0324.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Insurance; natural language processing; topic modelling; text analysis; complex networks; risk ranking
Online: 21 September 2022 (10:25:26 CEST)
The ability to identify and rank risk is essential for efficient and effective supervision of financial service firms, such as banks and insurers. Risk ranking ensures limited resources are allocated where they are most needed. Today, automatic risk identification within insurance supervision primarily relies on quantitative metrics based on numerical data (e.g. returns). The purpose of this work is to assess whether Natural Language Processing (NLP) and cognitive networks can achieve similar automated risk ranking and identification by analysing textual data, i.e. NIDT=829 investor transcripts from Bloomberg. To this aim, this work explores and tunes 3 NLP techniques: (1) keyword extraction enhanced by cognitive network analysis; (2) valence/sentiment analysis; and (3) topic modelling. Results highlight that keyword analysis, enriched by term frequency-inverse document frequency scores and semantic framing through cognitive networks, could detect events of relevance for the insurance system like cyber-attacks or the COVID-19 pandemic. Cognitive networks were found to highlight events that related to specific financial transitions: The semantic frame of "climate" grew in size by +538% between 2018 and 2020 and outlined an increased awareness that agents and insurers expressed towards climate change. A lexicon-based sentiment analysis achieved a Pearson’s correlation of ρ=0.16 (p<0.001,N=829) between sentiment levels and daily share prices. Although relatively weak, this finding indicates that insurance jargon is insightful to support risk supervision. Topic modelling is considered less amenable to support supervision, because of a lack of results’ stability and an intrinsic difficulty to interpret risk patterns. We discuss how these automatic methods could complement existing supervisory tools in automated risk ranking.
REVIEW | doi:10.20944/preprints202205.0114.v1
Subject: Social Sciences, Business And Administrative Sciences Keywords: Blockchain Technology; Industry 4.0; Supply Chain Management; Text mining; Metaverse; Hashgraph, Baas.
Online: 9 May 2022 (10:01:43 CEST)
In the current business environment, firms are eager to adopt new technology as they witness and perceive more and more successful business applications of new technologies, e.g., Big Data, Artificial Intelligence (A.I.), Cloud Computing, etc. As one of the disruptive technologies, Blockchain technology (BCT) is now drawing public attention owing to the cryptocurrency phenomenon (e.g., Bitcoin), for which Blockchain serves as the backbone technology. Given certain innovative features of BCT, especially its transparency, traceability, security, efficiency, confidentiality, and immutability, BCT holds out the promise of impacting supply chain operational and financial efficiencies. Recently, the burgeoning of Metaverse and Non-Fungible Token (NFT) has boosted the BCT as state-of-the-art technology to another notch. Motivated by the proliferating adoption of BCT, we conduct a holistic literature review with a focus on the status of research on BCT in the context of Supply Chain Management (SCM). In particular, this Blockchain-centered research reviews the research up to date on the Blockchain application in SCM. It provides holistic review in terms of (1) the functionality of BCT and its salient features; (2) the prevailing and potential applications of BCT; (3) and the business benefits and impact of BCT in SCM. Finally, we substantiate and highlight a variety of research directions.
ARTICLE | doi:10.20944/preprints202111.0208.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Technology analysis; Trend analysis; Patent keyword analysis; Text mining; Natural language processing
Online: 10 November 2021 (15:25:21 CET)
Thanks to rapid development of artificial intelligence technology in recent years, the current artificial intelligence technology is contributing to many part of society. Education, environment, medical care, military, tourism, economy, politics, etc. are having a very large impact on society as a whole. For example, in the field of education, there is an artificial intelligence tutoring system that automatically assigns tutors based on student's level. In the field of economics, there are quantitative investment methods that automatically analyze large amounts of data to find investment laws to create investment models or predict changes in financial markets. As such, artificial intelligence technology is being used in various fields. So, it is very important to know exactly what factors have an important influence on each field of artificial intelligence technology and how the relationship between each field is connected. Therefore, it is necessary to analyze artificial intelligence technology in each field. In this paper, we analyze patent documents related to artificial intelligence technology. We propose a method for keyword analysis within factors using artificial intelligence patent data sets for artificial intelligence technology analysis. This is a model that relies on feature engineering based on deep learning model named KeyBERT, and using vector space model. A case study of collecting and analyzing artificial intelligence patent data was conducted to show how the proposed model can be applied to real-world problems.
ARTICLE | doi:10.20944/preprints201812.0086.v4
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: multi-model information fusion; video skim-ming; audio and text classification; keyframe extraction
Online: 5 August 2019 (03:48:49 CEST)
In this paper, we propose a novel approach of video skimming by exploiting the fusion of video temporal information and keyword information representation extracted from multi-model video information including audio, text and visual indices. In addition, we introduce the brand-safe filtering and sentiment analysis in order to only reserve the user-friendly content in the video skim. In the experiment by using the videos from YouTube-8M dataset, we have proved that the semantic conservation in the video skim from the proposed approach highly outperforms the approaches by only partial information of the video in conserving the semantic content of the video.
ARTICLE | doi:10.20944/preprints201802.0108.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Mandarin; prosody generation; linguistic feature; break prediction; text-to-speech; punctuation confidence
Online: 16 February 2018 (15:39:58 CET)
This paper proposes two fully-automatic machine-extracted linguistic features from an unlimited text input for Mandarin prosody generation. One is the punctuation confidence (PC) which measures the likelihood of inserting a major punctuation mark (PM) at a word boundary. Another is the quotation confidence (QC) which measures the likelihood of a word string to be quoted as a meaningful or emphasized unit in text. Because a major PM in a text is highly correlated with a prosodic break, and a quoted word string plays an important role in human language understanding, the two features potentially could provide useful information for prosody generation. The idea is first realized by employing conditional random field (CRF)-based models to predict major PMs, quoted word string locations, and their associated confidences, i.e., the PC and the QC, for each word boundary. Then, the predicted punctuations and their confidences are combined with traditional contextual linguistic features to predict prosodic-acoustic features. Both objective and subjective tests showed that the prosody generation with the proposed linguistic features performed better than the one without the proposed features. So, the proposed PC and QC are promising features for Mandarin prosody generation.
ARTICLE | doi:10.20944/preprints202207.0090.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Text mining; natural language processing; sustainability; semantic similarity; corporate social responsibility; Machine Learning
Online: 6 July 2022 (08:53:02 CEST)
This paper investigates if Corporate Social Responsibility (CSR) reports published by a selected group of Nordic companies are aligned with the Global Reporting Initiative (GRI) standards. To achieve this goal, several natural language processing, and text mining techniques were implemented and tested. We extracted strings, corpus, and hybrid semantic similarities from the reports and evaluated the models through the intrinsic assessment methodology. A quantitative ranking score based on index matching was developed to complement the semantic valuation. The final results show that Latent Semantic Analysis (LSA) and Global Vectors for Word Representation (GloVE) are the best methods for our study. Our findings will open the door to the automatic evaluation of sustainability reports which could have a strong impact on the environment.
REVIEW | doi:10.20944/preprints202110.0184.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: text-mining; self-attention models; biological literature mining; relationship extraction; natural language processing
Online: 12 October 2021 (14:17:46 CEST)
For any molecule, network, or process of interest, to keep up with new publications on these, is becoming increasingly difficult. For many cellular processes, molecules and their interactions that need to be considered can be very large. Automated mining of publications can support large scale molecular interaction maps and database curation. Text mining and Natural Language Processing (NLP)-based techniques are finding their applications in mining the biological literature, handling problems such as Named Entity Recognition (NER) and Relationship Extraction (RE). Both rule-based and machine learning (ML)-based NLP approaches have been popular in this context, with multiple research and review articles examining the scope of such models in Biological Literature Mining (BLM). In this review article, we explore self-attention based models, a special type of neural network (NN)-based architectures that have recently revitalized the field of NLP, applied to biological texts. We cover self-attention models operating either at a sentence level or an abstract level, in the context of molecular interaction extraction, published from 2019 onwards. We conduct a comparative study of the models in terms of their architecture. Moreover, we also discuss some limitations in the field of BLM that identifies opportunities for the extraction of molecular interactions from biological text.
COMMUNICATION | doi:10.20944/preprints202104.0575.v2
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Ethnopharmacology; Artificial Intelligence; Web Crawling; Active Learning; Reinforcement Learning; Text Mining; Big Data
Online: 23 June 2021 (11:47:32 CEST)
Ethnopharmacology experts face several challenges when identifying and retrieving documents and resources related to their scientific focus. The volume of sources that need to be monitored, the variety of formats utilized, the different quality of language use across sources, present some of what we call “big data” challenges in the analysis of this data. This study aims to understand if and how experts can be supported effectively through intelligent tools in the task of ethnopharmacological literature research. To this end, we utilize a real case study of ethnopharmacology research, aimed at the Southern Balkans and Coastal zone of Asia Minor. Thus, we propose a methodology for more efficient research in ethnopharmacology. Our work follows an “Expert-Apprentice” paradigm in an automatic URL extraction process, through crawling, where the apprentice is a Machine Learning (ML) algorithm, utilizing a combination of Active Learning (AL) and Reinforcement Learning (RL), and the Expert is the human researcher. ML-powered research improved 3.1 times the effectiveness and 5.14 times the efficiency of the domain expert, fetching a total number of 420 relevant ethnopharmacological documents in only 7 hours versus an estimated 36-hour human-expert effort. Therefore, utilizing Artificial Intelligence (AI) tools to support the researcher can boost the efficiency and effectiveness of the identification and retrieval of appropriate documents.
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Urdu Twitter Dataset; Urdu Natural language processing (NLP); Urdu text Sentiments and Emoticons
Online: 24 March 2021 (12:03:46 CET)
This article presents a dataset of tweets in the Urdu language. There are 1,140,824 tweets in the dataset, collected from Twitter for September and October 2020. This large-scale corpus of tweets is generated by performing pre-processing which includes removing columns containing user information, retweet’s count, followers information, duplicate tweets, removing unnecessary punctuation, links, symbols, and spaces, and finally extracting emojis if present in the tweet text. In the final dataset each tweet record contains columns for tweet id, text, and emoji extracted from the text with a sentiment score. Emojis are extracted to validate Machine Learning models used for the multilingual sentiment and behavior analysis. These are extracted using a Python script that searches for an emoji from the list of 751 most frequently used emojis. If an emoji is present in the text, a column with the emoji description and sentiment score is added.
ARTICLE | doi:10.20944/preprints202003.0249.v1
Subject: Mathematics & Computer Science, Other Keywords: machine learning; preprocessing; semantic analysis; text mining; TF/IDF; scraping; Google Play Store
Online: 11 August 2020 (08:14:10 CEST)
The fact is quite transparent that almost everybody around the world is using android apps. Half of the population of this planet is associated with messaging, social media, gaming, and browsers. This online marketplace provides free and paid access to users. On the Google Play store, users are encouraged to download countless of applications belonging to predefined categories. In this research paper, we have scrapped thousands of users reviews and app ratings. We have scrapped 148 apps’ reviews from 14 categories. We have collected 506259 reviews from Google play store and subsequently checked the semantics of reviews about some applications form users to determine whether reviews are positive, negative, or neutral. We have evaluated the results by using different machine learning algorithms like Naïve Bayes, Random Forest, and Logistic Regression algorithm. we have calculated Term Frequency (TF) and Inverse Document Frequency (IDF) with different parameters like accuracy, precision, recall, and F1 and compared the statistical result of these algorithms. We have visualized these statistical results in the form of a bar chart. In this paper, the analysis of each algorithm is performed one by one, and the results have been compared. Eventually, We've discovered that Logistic Regression is the best algorithm for a review-analysis of all Google play store. We have proved that Logistic Regression gets the speed of precision, accuracy, recall, and F1 in both after preprocessing and data collection of this dataset.
REVIEW | doi:10.20944/preprints202211.0544.v1
Subject: Earth Sciences, Environmental Sciences Keywords: pillar-based lake management; object-based lake management; Lake Rawapening
Online: 29 November 2022 (08:49:57 CET)
Lake Rawapening, Semarang Regency, Indonesia, has incorporated a holistic plan in its management practices. However, despite successful target achievements, some limitations remain that a review of its management plan is needed. This paper identifies and analyzes existing lake management strategies as a standard specifically in Lake Rawapening by exploring various literature, both legal frameworks and scholarly articles indexed in Google Scholar and published in Water by MDPI about lake management in many countries. There are two major types of lake management, namely pillar-based and object-based. While the former is the foundation of a conceptual paradigm that does not comprehensively consider the roles of finance and technology in the lake management, the latter indicates the objects to manage so as to create standards or benchmarks for the implementation of various programs. Overall, Lake Rawapening management should include more programs on erosion-sedimentation control and monitoring of operational performance using information systems.
ARTICLE | doi:10.20944/preprints202110.0336.v1
Subject: Biology, Ecology Keywords: nature-based solutions; climate change adaptation; biodiversity; ecosystem-based adaptation
Online: 23 October 2021 (14:19:30 CEST)
Nature-based solutions (NbS) are increasingly recognised for their potential to address both the climate and biodiversity crises. These outcomes are interdependent, and both rely on the capacity of NbS to support and enhance the health of an ecosystem: its biodiversity, the condition of its abiotic and biotic elements, and its capacity to function normally despite environmental change. However, while understanding of ecosystem health outcomes of nature-based interventions for climate change mitigation is growing, the outcomes of those implemented for adaptation remain poorly understood with evidence scattered across multiple disciplines. To address this, we conducted a systematic review of the outcomes of 109 nature-based interventions for climate change adaptation using 33 indicators of ecosystem health across eight broad categories (e.g. diversity, biomass, ecosystem functioning and population dynamics). We showed that 88% of interventions with positive outcomes for climate change adaptation also reported measurable benefits for ecosystem health. We also showed that interventions were associated with a 67% average increase in local species richness. All eight studies that reported benefits in terms of both climate change mitigation and adaptation also supported ecosystem health, leading to a triple win. However, there were also trade-offs, mainly for forest management and creation of novel ecosystems such as monoculture plantations of non-native species. Our review highlights two major limitations of research to date. First, only a limited selection of metrics are used to assess ecosystem health and these rarely include key aspects such as functional diversity and habitat connectivity. Second, taxonomic coverage is poor: 67% of outcomes assessed only plants and 57% did not distinguish between native and non-native species. Future research addressing these issues will allow the design and adaptive management of NbS to support healthy and resilient ecosystems, and thereby enhance their effectiveness for meeting both climate and biodiversity targets.
REVIEW | doi:10.20944/preprints202102.0447.v1
Subject: Earth Sciences, Atmospheric Science Keywords: circular economy; Covid-19; Voyant tools; environmental sustainability; social sustainability; economic sustainability; text mining
Online: 20 February 2021 (01:42:10 CET)
The emergence of the Covid-19 pandemic has created both negative and positive changes, including implementing the circular economy across the globe. This Systematic Review follows the PRISMA statement and employs the Text Mining (Voyant Tools) technique to visualize and analyze the impacts of the Covid-19 on three aspects of the circular economy: economic, social, and environmental. The research employs Latent Dirichlet Allocation (LDA) to identify five major topics: (1) Shortage of medical equipment but high medical waste during Covid-19 due to the high demand in healthcare; (2) The long term negative impacts of lockdown on economic and social activities because of Covid-19 pandemic; (3) The reports on impacts of Covid-19 pandemic on the manufacturing globally, and their coping strategies and new opportunities; (4) The impacts of international restriction on the tourism, trade, shipping, and aviation industry, causing billion-dollar losses; (5) The reduction of pollution with health environment improvements with example cases from China and EU. The research identifies current literature gaps in the circular economy and Covid- 19 topics and encourages the application of text mining tools into researching to stimulate the research process and assist in communicating with the public.
ARTICLE | doi:10.20944/preprints202008.0265.v2
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Ecommender system; learning to rank; Mining software repositories; Text Mining; Deep learning; Stack Overflow
Online: 4 September 2020 (11:20:33 CEST)
In software development, developers received bug reports that describe the software bug. Developers find the cause of bug through reviewing the code and reproducing the abnormal behavior that can be considered as tedious and time-consuming processes. The developers need an automated system that incorporates large domain knowledge and recommends a solution for those bugs to ease on developers rather than spending more manual efforts to fixing the bugs or waiting on Q&A websites for other users to reply to them. Stack Overflow is a popular question-answer site that is focusing on programming issues, thus we can benefit knowledge available in this rich platform. This paper, presents a survey covering the methods in the field of mining software repositories. We propose an architecture to build a recommender System using the learning to rank approach. Deep learning is used to construct a model that solve the problem of learning to rank using stack overflow data. Text mining techniques were invested to extract, evaluate and recommend the answers that have the best relevance with the solution of this bug report.
ARTICLE | doi:10.20944/preprints202007.0646.v1
Subject: Keywords: Machine Learning; Natural Language Processing; Text Mining; Semantic Analysis; Scraping; Google Play Store; Rating
Online: 26 July 2020 (17:11:09 CEST)
Google play store allow the user to download a mobile application (app) and user get inspired by the rating and reviews of the mobile app. A recent study analyzes that user preferences, user opinion for improvement, user sentiment about particular feature and detail with descriptions of experiences are very useful for an application developer. However, many application reviews are very large and difficult to process manually. Star rating is given of the whole application and the developer cannot analyze the single feature. In this research, we have scrapped 282,231 user reviews through different data scraping techniques. We have applied the text classification on these user reviews. We have applied different algorithms and find the precision, accuracy, F1 score and recall. In evaluated results, we have to also find the best algorithm.
REVIEW | doi:10.20944/preprints202202.0212.v1
Subject: Mathematics & Computer Science, Analysis Keywords: Knowledge Graphs; Link Prediction; Semantic-Based Models; Translation Based Embedded Models
Online: 17 February 2022 (11:49:24 CET)
For disciplines like biological science, security, and the medical field, link prediction is a popular research area. To demonstrate the link prediction many methods have been proposed. Some of them that have been demonstrated through this review paper are TransE, Complex, DistMult, and DensE models. Each model defines link prediction with different perceptions. We argue that the practical performance potential of these methods, having similar parameter values, using the fine-tuning technique to evaluate their reliability and reproducibility of results. We describe those methods and experiments; provide theoretical proofs and experimental examples, demonstrating how current link prediction methods work in such settings. We use the standard evaluation metrics for testing the model's ability.
REVIEW | doi:10.20944/preprints202112.0027.v2
Subject: Biology, Animal Sciences & Zoology Keywords: Zoo animal welfare; Five Domains; Validity; Animal-based; Resource-based; Scoring
Online: 22 December 2021 (11:59:32 CET)
Zoos are increasingly putting in place formalized animal welfare assessment programs to allow monitoring of welfare over time, as well as to aid in resource prioritization. These programs tend to rely on assessment tools that incorporate resource-based and observational animal- focused measures since it is rarely feasible to obtain measures of physiology in zoo-housed animals. A range of assessment tools are available which commonly have a basis in the Five Domains framework. A comprehensive review of the literature was conducted to bring together recent studies examining welfare assessment methods in zoo animals. A summary of these methods is provided with advantages and limitations of the approach es presented. We then highlight practical considerations with respect to implementation of these tools into practice, for example scoring schemes, weighting of criteria, and innate animal factors for consideration. It is concluded that would be value in standardizing guidelines for development of welfare assessment tools since zoo accreditation bodies rarely prescribe these. There is also a need to develop taxon or species- specific assessment tools to inform welfare management.
ARTICLE | doi:10.20944/preprints202105.0449.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Explainable Artificial Intelligence; Hopfield Neural Networks; Automatic Video Generation; Data-to-text systems; Software Visualization
Online: 19 May 2021 (14:07:48 CEST)
Hopfield Neural Networks (HNNs) are recurrent neural networks used to implement associative memory. Their main feature is their ability to pattern recognition, optimization, or image segmentation. However, sometimes it is not easy to provide the users with good explanations about the results obtained with them due to mainly the large number of changes in the state of neurons (and their weights) produced during a problem of machine learning. There are currently limited techniques to visualize, verbalize, or abstract HNNs. This paper outlines how we can construct automatic video generation systems to explain their execution. This work constitutes a novel approach to get explainable artificial intelligence systems in general and HNNs in particular building on the theory of data-to-text systems and software visualization approaches. We present a complete methodology to build these kinds of systems. Software architecture is also designed, implemented, and tested. Technical details about the implementation are also detailed and explained. Finally, we apply our approach for creating a complete explainer video about the execution of HNNs on a small recognition problem.
ARTICLE | doi:10.20944/preprints202009.0657.v1
Subject: Medicine & Pharmacology, Allergology Keywords: HIV; workplace intervention; SMS; HIV testing; construction; mobile phone; Covid-19; health promotion; text messaging
Online: 27 September 2020 (03:02:41 CEST)
Background: HIV poses a threat to global health. With effective treatment options available, education and testing strategies are essential in preventing transmission. Text messaging is an effective tool for health promotion and can be used to target higher risk populations. This study reports on the design, delivery and testing of a mobile text messaging SMS intervention for HIV prevention and awareness, aimed at adults in the construction industry and delivered during the COVID-19 pandemic. Method: Participants were recruited at Test@Work workplace health promotion events (21 sites, n=464 employees), including health checks with HIV testing. Message development was based on a participatory design and included a focus group (n=9) and message fidelity testing (n=291) with assessment of intervention uptake, reach, acceptability, and engagement. Barriers to HIV testing were identified and mapped to the COM-B behavioural model. 23 one-way push SMS messages (19 included short web links) were generated and fidelity tested, then sent via automated SMS to two employee cohorts over a 10-week period during the COVID-19 pandemic. Engagement metrics measured were; opt-outs, SMS delivered/read, number of clicks per web link, and four two-way pull messages exploring repeat HIV testing, learning new information, perceived usefulness and behaviour change. Results: 291 people participated (68.3% of eligible attendees). A total of 7,726 messages were sent between March and June 2020, with 91.6% successfully delivered (100% read). 12.4% of participants opted out over 10 weeks. Of delivered messages, links were clicked an average of 14.4%, max 24.1% for HIV related links. The number of clicks on web links declined over time (r= -6.24, p=0.01). Response rate for two-way pull messages was 13.7% of participants. Since the workplace HIV test offer at recruitment, 21.6% reported having taken a further HIV test. Qualitative replies indicated behavioural influence of messaging on exercise, lifestyle behaviours and intention to HIV test. Conclusion: SMS messaging for HIV prevention and awareness is acceptable to adults in the construction industry, has high uptake, low attrition and good engagement with message content, when delivered during a global pandemic. Data collection methods may need refinement for audience and effect of COVID-19 on results is yet to be understood.
ARTICLE | doi:10.20944/preprints202010.0148.v2
Subject: Social Sciences, Accounting Keywords: Sustainable Teaching; multidisciplinary; multicultural; teams; Case-based Learning; Problem-based Learning; teamwork
Online: 26 April 2021 (15:38:20 CEST)
This article investigates the prospect of implementing multidisciplinary and multicultural student teamwork (MMT) involving Case-based Learning (CBL) and Problem-based Learning (PBL) as a sustainable teaching practice. Based on a mixed methods approach, which includes direct observation (both physical and virtual), questionnaire distribution and focus-group interviews the study reveals that MMT through CBL and PBL can both facilitate and hinder sustainable learning. Our findings show that while MMT enhances knowledge sharing, it also poses a wide range of challenges, raising questions about its social significance as a sustainable teaching practice. The study suggests the implementation of certain mechanisms, such as ‘Teamwork Training’ and ‘Pedagogical Mentors’, aiming to strengthen the sustainable orientation of MMT through CBL and PBL.
Subject: Engineering, Control & Systems Engineering Keywords: Model-based systems engineering (MBSE); Model informatics and analytics; Model-based collaboration
Online: 12 March 2021 (16:52:34 CET)
In MBSE there is yet no converged terminology. The term ’system model’ is used in different contexts in literature. In this study we elaborated the definitions and usages of the term ’system model’, to find a common definition. 104 publications have been analyzed in depth for their usage and definition as well as their meta-data e.g., the publication year and publication background to find some common patterns. While the term is gaining more interest in recent years it is used in a broad range of contexts for both analytical and synthetic use cases. Based on this three categories of system models have been defined and integrated into a more precise definition.
ARTICLE | doi:10.20944/preprints201807.0523.v1
Subject: Mathematics & Computer Science, Other Keywords: game-based learning; game design; project-based teaching; informatics and society, cybersecurity
Online: 26 July 2018 (16:38:48 CEST)
This article discusses the use of game design as a method for interdisciplinary project-based teaching in secondary school education to convey informatics and society topics. There is a lot of knowledge about learning games but little background on project-based teaching using game design as a method. We present the results of an analysis of student-created games and an evaluation of a student-authored database on learning contents found in commercial off-the-shelf games. We further contextualise these findings using a group discussion with teachers. Results underline the effectiveness of project-based teaching to raise awareness for informatics and society topics. We further outline informatics and society topics that are particularly interesting to students, genre preferences and potentially engaging game mechanics stemming from our analyses.
ARTICLE | doi:10.20944/preprints201709.0074.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: recommendation system; context awareness; location based services; mobile computing, cloud-based computing
Online: 18 September 2017 (08:54:04 CEST)
The ubiquity of mobile sensors (such as GPS, accelerometer and gyroscope) together with increasing computational power have enabled an easier access to contextual information, which proved its value in next generation of the recommender applications. The importance of contextual information has been recognized by researchers in many disciplines, such as ubiquitous and mobile computing, to filter the query results and provide recommendations based on different user status. A context-aware recommendation system (CoARS) provides a personalized service to each individual user, driven by his or her particular needs and interests at any location and anytime. Therefore, a contextual recommendation system changes in real time as a user’s circumstances changes. CoARS is one of the major applications that has been refined over the years due to the evolving geospatial techniques and big data management practices. In this paper, a CoARS is designed and implemented to combine the context information from smartphones’ sensors and user preferences to improve efficiency and usability of the recommendation. The proposed approach combines user’s context information (such as location, time, and transportation mode), personalized preferences (using individuals past behavior), and item-based recommendations (such as item’s ranking and type) to personally filter the item list. The context-aware methodology is based on preprocessing and filtering of raw data, context extraction and context reasoning. This study examined the application of such a system in recommending a suitable restaurant using both web-based and android platforms. The implemented system uses CoARS techniques to provide beneficial and accurate recommendations to the users. The capabilities of the system is evaluated successfully with recommendation experiment and usability test.
REVIEW | doi:10.20944/preprints202201.0073.v1
Subject: Medicine & Pharmacology, Other Keywords: Messenger RNA • Hospital-based mRNA therapeutics • circular mRNA • self-amplifying mRNA • RNA-based CAR T-cell • RNA-based gene-editing tools
Online: 6 January 2022 (11:20:59 CET)
Hospital-based programs democratize mRNA therapeutics by facilitating the processes to translate a novel RNA idea from the bench to the clinic. Because mRNA is essentially biological software, therapeutic RNA constructs can be rapidly developed. The generation of small batches of clinical grade mRNA to support IND applications and first-in-man clinical trials, as well as personalized mRNA therapeutics delivered at the point-of-care, is feasible at a modest scale of cGMP manufacturing. Advances in mRNA manufacturing science and innovations in mRNA biology, are increasing the scope of mRNA clinical applications.
ARTICLE | doi:10.20944/preprints202208.0523.v1
Subject: Mathematics & Computer Science, Other Keywords: angle-based outlier detection: percentile-based outlier detection; multiphilda, noise; irrelevant software requirements
Online: 30 August 2022 (11:25:24 CEST)
Noise in requirements has been known to be a defect in software requirements specifications (SRS). Detecting defects at an early stage is crucial in the process of software development. Noise can be in the form of irrelevant requirements that are included within a SRS. A previous study had attempted to detect noise in SRS, in which noise was considered as an outlier. However, the resulting method only demonstrated a moderate reliability due to the overshadowing of unique actor words by unique action words in the topic-word distribution. In this study, we propose a framework to identify irrelevant requirements based on the MultiPhiLDA method. The proposed framework distinguishes the topic-word distribution of actor words and action words as two separate topic-word distributions with two multinomial probability functions. Weights are used to maintain a proportional contribution of actor and action words. We also explore the use of two outlier detection methods, namely Percentile-based Outlier Detection (PBOD) and Angle-based Outlier Detection (ABOD), to distinguish irrelevant requirements from relevant requirements. The experimental results show that the proposed framework was able to exhibit better performance than previous methods. Furthermore, the use of the combination of ABOD as the outlier detection method and topic coherence as the estimation approach to determine the optimal number of topics and iterations in the proposed framework outperformed the other combinations and obtained sensitivity, specificity, F1-score, and G-mean values of 0.59, 0.65, 0.62, and 0.62, respectively.
ARTICLE | doi:10.20944/preprints202111.0196.v1
Subject: Life Sciences, Other Keywords: crocodilian; animal welfare; animal-based measure; animal-based indicator; welfare assessment; welfare measure
Online: 10 November 2021 (08:46:54 CET)
Animal-based measures are the measure of choice in animal welfare assessment protocols as they can often be applied completely independently to the housing or production system employed. Although there has been a small body of work on potential animal-based measures for farmed crocodilians [1-3], they have not been studied in the context of an animal welfare assessment protocol. Potential animal-based measures, that could be used to reflect the welfare state of farmed crocodilians, were identified and aligned with the Welfare Quality® principles of good housing, good health, good feeding and appropriate behaviour. A consultation process with a panel of experts was used to evaluate and score the potential measures in terms of validity and feasibility. This resulted in a toolbox of measures being identified for further development and integration into animal welfare assessment on the farm. Animal-based measures related to ‘good feeding’ and ‘good health’ received the highest scores for validity and feasibility by the experts. There was less agreement on the animal-based measures that could be used to reflect ‘appropriate behaviour’. Where no animal-based measures were deemed to reliably reflect a welfare criterion nor be useful as a measure on the farm, additional measures of resources or management were suggested as alternatives. Future work in this area should focus on the reliability of the proposed measures and involve further evaluation of their validity and feasibility as they relate to different species of crocodilian and farming system.
REVIEW | doi:10.20944/preprints201810.0175.v1
Subject: Chemistry, Analytical Chemistry Keywords: biosensors; enzyme-based systems; receptor-based systems; toxins; food analysis; environmental monitoring; nanotechnology
Online: 9 October 2018 (05:59:30 CEST)
The exploitation of lipid membranes in biosensors has provided the ability to reconstitute a considerable part of their functionality to detect trace of food toxicants and environmental pollutants. Nanotechnology enabled sensor miniaturization and extended the range of biological moieties that could be immobilized within a lipid bilayer device. This chapter reviews recent progress in biosensor technologies based on lipid membranes suitable for environmental applications and food quality monitoring. Numerous biosensing applications are presented, putting emphasis on novel systems, new sensing techniques and nanotechnology-based transduction schemes. The range of analytes that can be currently detected include, insecticides, pesticides, herbicides, metals, toxins, antibiotics, microorganisms, hormones, dioxins, etc. Technology limitations and future prospects are discussed, focused on the evaluation/ validation and eventually commercialization of the proposed sensors.
REVIEW | doi:10.20944/preprints201808.0069.v1
Subject: Chemistry, Analytical Chemistry Keywords: biosensors, enzyme-based systems, receptor-based systems, toxins, food analysis, environmental monitoring, nanotechnology
Online: 3 August 2018 (14:20:04 CEST)
The exploitation of lipid membranes in biosensors has provided the ability to reconstitute a considerable part of their functionality to detect trace of food toxicants and environmental pollutants. Nanotechnology enabled sensor miniaturization and extended the range of biological moieties that could be immobilized within a lipid bilayer device. This chapter reviews recent progress in biosensor technologies based on lipid membranes suitable for environmental applications and food quality monitoring. Numerous biosensing applications are presented, putting emphasis on novel systems, new sensing techniques and nanotechnology-based transduction schemes. The range of analytes that can be currently detected include, insecticides, pesticides, herbicides, metals, toxins, antibiotics, microorganisms, hormones, dioxins, etc. Technology limitations and future prospects are discussed, focused on the evaluation/ validation and eventually commercialization of the proposed sensors.
ARTICLE | doi:10.20944/preprints201807.0307.v1
Subject: Social Sciences, Marketing Keywords: sustainable outcomes; dedication-based mechanism; constraint-based mechanism; perceived switching costs; loyalty program
Online: 17 July 2018 (10:55:47 CEST)
Given the increase in consumers’ preferences for coffee, it is becoming important to understand their decision-making processes in the coffee chain context. To deepen the understanding of sustainable outcomes in this context, this study investigates the role of dedication- and constraint-based mechanisms in forming consumers’ repurchase and positive word-of-mouth (WOM) intentions, two critical sustainable outcomes. We examined the effects of coffee quality, the quality of the physical environment, and service quality in accelerating the formation of dedication-based factors. Moreover, this study offers an in-depth understanding of the enablers of perceived switching costs. Data collected from 238 university students that frequently visit coffee chains are empirically tested against the proposed theoretical framework by using structural equation modeling. The results confirm that both dedication- and constraint-based factors substantially predict consumers’ sustainable outcomes in the coffee chain context. Brand image and perceived switching costs play an important role in enhancing consumers’ repurchase and positive WOM intentions compared with customer satisfaction. Coffee quality is significantly associated with both customer satisfaction and brand image, whereas the quality of the physical environment and service quality are only significantly associated with brand image. Habit is found to be the key enabler of perceived switching costs, while loyalty programs have no significant impact on perceived switching costs.
ARTICLE | doi:10.20944/preprints201608.0069.v1
Subject: Earth Sciences, Environmental Sciences Keywords: Rubber (Hevea brasiliensis) plantation; phenology; Xishuangbanna; Landsat; object-based approach; pixel-based approach
Online: 6 August 2016 (11:54:28 CEST)
Effectively mapping and monitoring rubber plantation is still changing. Previous studies have explored the potential of phenology features for rubber plantation mapping through a pixel-based approach (pixel-based phenology approach). However, in fragmented mountainous Xishuangbanna, it could lead to noises and low accuracy of resultant maps. In this study, we investigated the capability of an integrated approach by combining phenology information with an object-based approach (object-based phenology approach) to map rubber plantations in Xishuangbanna. Moderate Resolution Imaging Spectroradiometer (MODIS) data were firstly used to acquire the temporal profile and phenological features of rubber plantations and natural forests, which delineates the time windows of defoliation and foliation phases. Landsat images were then used to extract a phenology algorithm comparing three different approaches: pixel-based phenology, object-based phenology, and extended object-based phenology to separate rubber plantations and natural forests. The results showed that the two object-based approaches achieved higher accuracy than the pixel-based approach, having overall accuracies of 96.4%, 97.4%, and 95.5%, respectively. This study proved the reliability of a phenology-based rubber mapping in fragmented landscapes with a distinct dry/cool season using Landsat images. This study indicated that the object-based phenology approaches can effectively improve the accuracy of the resultant maps in fragmented landscapes.
ARTICLE | doi:10.20944/preprints202301.0061.v1
Subject: Medicine & Pharmacology, Pathology & Pathobiology Keywords: Neural Network; Machine Learning; Natural Language Processing (NLP); Text Mining; Sentence Classification; Colorectal Cancer; Clinical Information.
Online: 4 January 2023 (03:48:26 CET)
Colonoscopy is used for colorectal cancer (CRC) screening. Extracting details of the colonoscopy findings from free text in electronic health records (EHRs) can be used to determine patient risk for CRC and colorectal screening strategies. In this study, we developed and evaluated the accuracy of a deep learning model framework to extract information for the clinical decision support system to analyze relevant free-text colonoscopy reports, including indications, pathology, and findings notes. The Bio-Bi-LSTM-CRF framework was developed using Bidirectional Long Short-term Memory (Bi-LSTM) and Conditional Random Fields (CRF) to extract several clinical features from these free-text reports, including indications for the colonoscopy, findings during the colonoscopy, and pathology of the resected material. We then trained the Bio-Bi-LSTM-CRF and existing Bi-LSTM-CRF models on 80% of 4,000 manually annotated notes obtained from the colonoscopy reports of 3,867 patients. The clinical notes were from a group of patients aged over 40 years old enrolled in four Veterans Affairs Medical Centers. A total of 10% of the remaining annotated notes were used to train hyperparameter, while the remaining 10% were used to evaluate the accuracy of our model (Bio-Bi-LSTM-CRF) and to compare the results with the outcomes obtained using Bi-LSTM-CRF. The results of our experiments showed that the bidirectional encoder representations by integrating dictionary function vector from Bio-Bi-LSTM-CRF and strategies character sequence embedding approach is an effective way to identify colonoscopy features from EHR-extracted clinical notes. Therefore, the Bio-Bi-LSTM-CRF model is concluded to be capable of creating new opportunities to identify patients at risk for colon cancer and to study their health outcomes.
ARTICLE | doi:10.20944/preprints202106.0196.v3
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Twitter; Social Media; Social Networking; Social Network Analytic; DistilBERT; Text Similarity; Natural Language Processing; Character Computing
Online: 17 February 2022 (13:15:23 CET)
Social media platforms have been entirely an undeniable part of the lifestyle for the past decade. Analyzing the information being shared is a crucial step to understanding human behavior. Social media analysis aims to guarantee a better experience for the user and risen user satisfaction. For deriving any further conclusion, first, it is necessary to know how to compare users. In this paper, a hybrid model has been proposed to measure Twitter profiles’ similarity and quantifies the likeness degree of profiles by calculating features considering users’ behavioral habits. For this, first, the timeline of each profile has been extracted using the official TwitterAPI. Then, in parallel, three aspects of a profile are deliberated. Behavioral ratios are time-series-related information showing the consistency and habits of the user. Dynamic time warping has been utilized to compare the behavioral ratios of two profiles. Next, the audience network is extracted for each user, and for estimating the similarity of two sets, Jaccard similarity is used. Finally, for the Content similarity measurement, the tweets are preprocessed respecting the feature extraction method; TF-IDF and DistilBERT for feature extraction are employed and then compared using the cosine similarity method. Results have shown that TF-IDF has slightly better performance; therefore, the more straightforward solution is selected for the model. Similarity level of different profiles. As in the case study, a Random Forest classification model was trained on almost 20000 users revealed a 97.24% accuracy. This comparison enables us to find duplicate profiles with nearly the same behavior and content.
REVIEW | doi:10.20944/preprints201708.0003.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: stylometry; author identification; author verification; authorprofiling; stylistic inconsistency; text analysis; supervised learning; unsupervised learning; classification; forensics
Online: 2 August 2017 (12:38:17 CEST)
Electronic text stylometry is a collection of forensics methods that analyze the writing styles of input electronic texts in order to extract information about authors of the input electronic texts. Such extracted information could be the identity of the authors, or aspects of the authors, such as their gender, age group, ethnicity, etc. This survey paper presents the following contributions: 1) A description of all stylometry problems in probability terms, under a unified notation. To the best of our knowledge, this is the most comprehensive definition to date. 2) A survey of key methods, with a particular attention to data representation (or feature extraction) methods. 3) An evaluation of 23,760 feature extraction methods, which is the most comprehensive evaluation of feature extraction methods in the literature of stylometry to date. The importance of this evaluation is two fold. First, identifying the relative effectiveness of the features (since, currently, many are not evaluated jointly; e.g. syntactic n-grams are not evaluated against k-skip n-grams, and so forth). Second, thanks to our generalizations, we could evaluate novel grams, such as what we name compound grams. 4) The release of our associated Python feature extraction library, namely Fextractor. Essentially, the library generalizes all existing n-gram based feature extraction methods under the "at least l-frequent, dir-directed, k-skipped n-grams'', and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as POS tags, as well as lower-level ones, such as distribution of function words, word shapes, etc. This makes the library, by far, the most extensive in this domain to date. 5) The construction, evaluation, and release of the first dataset for Emirati social media text. This evaluation represents the first evaluation of author identification against Emirati social media texts. Interestingly, we find that, when using our models and feature extraction library (Fextractor), authors could be identified significantly more accurately than what is reported with similarly sized datasets. The dataset also contains sub-datasets that represent other languages (Dutch, English, Greek and Spanish), and our findings are consistent across them.
ARTICLE | doi:10.20944/preprints201907.0131.v1
Online: 9 July 2019 (14:15:17 CEST)
Saudi Arabia is an oil-reliant nation as a large percentage of its GDP comes from oil resources. Oil dependency leaves a county at the mercy of the international crude market, and a decrease in the price of crude can seriously destabilize the economy of such nations. An example is the case of Venezuela whose dependence on oil caused a national disaster (McCarthy, 2017). As such, the nation’s exports, GDP, and government revenue are primarily dependent on oil revenue, and the recent decrease in the oil prices has decreased Venezuela’s national revenue resulting in economic collapse as well as inflation. A shift from a resource based economy to a knowledge based economy will help Saudi Arabia become less reliant on its oil revenues for its economic stability and growth (Nurunnabi, 2017).
ARTICLE | doi:10.20944/preprints202210.0331.v1
Subject: Mathematics & Computer Science, Applied Mathematics Keywords: IoT-based payment protocols; identity-based signature; server-aided verification; pairing-free security protocols
Online: 21 October 2022 (10:20:05 CEST)
After the great success of Mobile wallet, the Internet of Things (IoT) leaves the door wide open for consumers to use their connected devices to access their bank accounts and perform routine banking activities from anywhere, anytime and with any device. However, consumers need to feel safe when interacting with IoT-based payment systems, and their personal information should be protected as much as possible. Unlike as usually done in the literature, in this paper, we introduce two lightweight and secure IoT-based payment protocols based on an identity-based signature scheme. We adopt a server-aided verification technique to construct the first scheme. This technique allows to outsource the heavy computation overhead on the sensor node to a cloud server while maintaining the user's privacy. The second scheme is built upon a pairing-free ECC-based security protocol to avoid the heavy computational complexity of bilinear pairing operations. The security reduction results of both schemes are held in the Random Oracle Model (ROM) under the discrete logarithm and computational Diffie-Hellman assumptions. Finally, we experimentally compare the proposed schemes against each other and against the original scheme on the most commonly used IoT devices: a smartphone, a smartwatch and the embedded device Raspberry Pi. Compared with existing schemes, our proposed schemes achieve significant efficiency in the term of communication and computational overheads
REVIEW | doi:10.20944/preprints202212.0086.v1
Subject: Behavioral Sciences, Other Keywords: Information; resources; coronary heart disease; digital health; education; cardiac rehabilitation; secondary prevention; text message; sensors; cardiovascular risk
Online: 6 December 2022 (02:09:28 CET)
A critical aspect of coronary heart disease (CHD) care and secondary prevention is ensuring patients have access to evidence-based information. The purpose of this review is to summarise the guiding principles, content, context and timing of information and education that is beneficial for supporting people with CHD and potential communication strategies including digital interventions. We conducted a scoping review involving searching four databases (Web of Science, PubMed, CINAHL, Medline) for articles published from January 2000 to August 2022. Literature was identified through title and abstract screening by expert reviewers. Evidence was synthesised according to the review aims. Results demonstrated that information-sharing, decision-making, goal-setting, positivity and practicality are important aspects of secondary prevention and should be patient-centred and evidenced based with consideration of patient need and preference. Initiation and duration of education is highly variable between and within people, but hence communication and support should be regular and ongoing. In conclusion, text messaging programs, smartphone applications and wearable devices are examples of digital health strategies that facilitate education and support for patients with heart disease. There is no one size fits all approach that suits all patients at all stages and hence flexibility and a suite of resources and strategies is optimal.
ARTICLE | doi:10.20944/preprints202206.0050.v2
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Emotions Mining; Context Mining; Sensory Mining; Artificial Intelligence; Information extraction; Text classification; Fairy tales; Olfactory Cultural Heritage
Online: 2 August 2022 (07:57:35 CEST)
This paper presents an Artificial Intelligence approach to mining context and emotions related to olfactory cultural heritage narratives, in particular to fairy tales. We provide an overview of the role of smell and emotions in literature, as well as highlight the importance of olfactory experience and emotions from psychology and linguistic perspectives. We introduce a methodology for extracting smells and emotions from text, as well as demonstrate the context-based visualizations related to smells and emotions implemented in a novel Smell Tracker tool. The evaluation is performed using a collection of fairy tales from Grimm and Andersen. We find out that fairy tales often connect smell with emotional charge of situations. The experimental results show that we can detect smells and emotions with F1 score of 92.7 and 79.2, respectively.
ARTICLE | doi:10.20944/preprints202103.0442.v1
Subject: Behavioral Sciences, Applied Psychology Keywords: dyslexia; reading; children; background colour; overlay colour; text colour; sensors; physiological parameters; EEG; ECG; EDA; eye tracking
Online: 17 March 2021 (14:31:47 CET)
Reading is one of the essential processes during the maturation of an individual. It is estimated that 5-10% of school-age children are affected by dyslexia, the reading disorder characterised by difficulties in the accuracy or fluency of word recognition. There are many studies which have reported that colour overlays and background could improve the reading process, especially in children with reading disorders. As dyslexia has neurobiological origins, the aim of the present research was to understand the relationship between physiological parameters and colour modifications in the text and background during reading in children with and without dyslexia. We have measured differences in electroencephalography (EEG), heart rate variability (HRV), electrodermal activities (EDA), and eye movement of the 36 school-age children (18 with dyslexia and 18 of control group) during the reading performance in 13 combinations of background and overlay colours during the reading task. Our findings showed that the dyslexic children have longer reading duration, fixation count, fixation duration average, fixation duration total, and longer saccade count, saccade duration total, and saccade duration average while reading on white and coloured background/overlay. It was found that the turquoise, turquoise O, and yellow colours are beneficial for dyslexic readers, as they achieved the shortest time duration during the reading tasks when these colours were used. Also, dyslexic children have higher values of beta and the whole range of EEG while reading in particular colour (purple), as well as increasing theta range while reading on the purple overlay colour. We have observed no significant differences between HRV parameters on white colour, except for single colours (purple, turquoise overlay and yellow overlay) where the control group showed higher values for Mean HR, while dyslexic children scored higher with Mean RR. Regarding EDA measure we have found systematically lower values in children with dyslexia in comparison to the control group. Based on present results we can conclude that both colours (warm and cold background/overlays) are beneficial for both groups of readers and all sensor modalities could be used to better understand the neurophysiological origins in dyslexic children.
ARTICLE | doi:10.20944/preprints202112.0046.v1
Subject: Chemistry, Analytical Chemistry Keywords: Paperfluidics; Parafilm; Paper-based Analytical Devices
Online: 3 December 2021 (09:58:36 CET)
Paper-based analytical devices have been substantially developed in recent decades. Many fabrication techniques for paper-based analytical devices have been demonstrated and reported. Herein we report a relatively rapid, simple, and inexpensive method for fabricating paper-based analytical devices using parafilm hot pressing. We studied and optimized the effect of the key fabrication parameters, namely pressure, temperature, and pressing time. We discerned the optimal conditions, including pressure of 3.8 MPa (3 tons), temperature of 80oC, and 3 minutes of pressing time, with the smallest hydrophobic barrier size (821 µm) being governed by laminate mask and parafilm dispersal from pressure and heat. Physical and biochemical properties were evaluated to substantiate the paper functionality for analytical devices. Wicking speed in the fabricated paper strips was slightly slower than that of non-processed paper, resulting from reducing paper pore size. A colorimetric immunological assay was performed to demonstrate the protein binding capacity of the paper-based device after exposure to pressure and heat from the fabrication. Moreover, mixing in two-dimensional paper-based device and flowing in a three-dimensional counterpart were thoroughly investigated, demonstrating that the paper device from this fabrication process is potentially applicable as analytical devices for biomolecule detection. Fast, easy, and inexpensive parafilm hot press fabrication presents an opportunity for researchers to develop paper-based analytical devices in resource-limited environments.
ARTICLE | doi:10.20944/preprints202109.0490.v1
Subject: Chemistry, Physical Chemistry Keywords: Hydroxyapatite; Ca-based catalyst; stability; polyglycerol.
Online: 29 September 2021 (11:26:01 CEST)
Abstract: Calcium-based catalysts are of a high interest for glycerol polymerization due to their high catalytic activity and large availability. However, their poor stability under reaction conditions is an issue. In the present study, we investigated the stability and catalytic activity of Ca-hydroxyapatites (HAps) as one of the most abundant Ca-source in nature. A stochiometric, a Ca-deficient and a Ca-rich HAps have been synthetized and tested as catalysts in the glycerol polymerization reaction. Deficient and stochiometric HAps exhibited a remarkable 100% selectivity to triglycerol at 15 % of glycerol conversion at 245 °C after 8 h of reaction in the presence 0.5 mol.% of catalyst. Moreover, under the same reaction conditions, Ca-rich HAp showed a high selectivity (88 %) to di- and triglycerol at a glycerol conversion of 27 %. Most importantly, these catalysts were unexpectedly stable towards leaching under the reaction conditions based on the ICP-OES results. However, based on the catalytic tests and characterization analysis performed by XRD, XPS, IR, TGA-DSC and ICP-OES, we found that HAps can be deactivated by the presence of the reaction products themselves, i.e., water and polymers.
ARTICLE | doi:10.20944/preprints202108.0050.v1
Subject: Arts & Humanities, Anthropology & Ethnography Keywords: SDG; Gender Equality; project-based methodology
Online: 2 August 2021 (14:45:06 CEST)
A project-based module on Sustainable Development Goal number 5, Gender Equality, was im-plemented on 5 different groups of Business English students consisting of a total number of 62 students in higher education. The main purpose of this project was to raise awareness of this goal by means of a flipped method in which students were required to carry out some research on specific areas of the aforementioned goal and work in teams to elaborate oral presentations. Once their findings were shared in class, students were expected to answer a written questionnaire of open-ended questions which were part of a qualitative analysis. Results of this survey showed that not only 90% of the students gained in depth knowledge of this goal, but also 85% had built a positive attitude to take initiative and 80% were optimistic about future gender equality. Finally, 70% of students suggested further social action to curb the problem of gender discrimination. On the whole, the flipped classroom method of learning combined with project-based group work have proven to be an effective way to raise awareness of this goal, create a more positive attitude, in-crease their willingness to take action as well as widening their English lexical resources.
ARTICLE | doi:10.20944/preprints201709.0139.v1
Online: 27 September 2017 (16:45:25 CEST)
Object-Based Image Analysis (OBIA) has been successfully used to map slums. In general, the occurrence of uncertainties in producing geographic data is inevitable. However, most studies concentrated solely on assessing the classification accuracy and neglecting the inherent uncertainties. Our research analyses the impact of uncertainties in measuring the accuracy of OBIA-based slum detection. We selected Jakarta as our case study area, because of a national policy of slum eradication, which is causing rapid changes in slum areas. Our research comprises of four parts: slum conceptualization, ruleset development, implementation, and accuracy and uncertainty measurements. Existential and extensional uncertainty arise when producing reference data. The comparison of a manual expert delineations of slums with OBIA slum classification results into four combinations: True Positive, False Positive, True Negative and False Negative. However, the higher the True Positive (which lead to a better accuracy), the lower the certainty of the results. This demonstrates the impact of extensional uncertainties. Our study also demonstrates the role of non-observable indicators (i.e., land tenure), to assist slum detection, particularly in areas where uncertainties exist. In conclusion, uncertainties are increasing when aiming to achieve a higher classification accuracy by matching manual delineation and OBIA classification.
REVIEW | doi:10.20944/preprints201608.0173.v1
Online: 18 August 2016 (06:07:05 CEST)
ARTICLE | doi:10.20944/preprints202112.0417.v1
Subject: Earth Sciences, Geoinformatics Keywords: Cultural ecosystem services; urban green space management; Singapore; public participation geographic information system; social media text mining analysis
Online: 27 December 2021 (09:48:44 CET)
Cultural ecosystem services has been increasingly influential in both environmental research and policy decision-making, such as for urban green spaces However, its popular definition conflates the concepts of ‘services’ and ‘benefits’ which made it challenging for planners to employ it directly for urban green space management. One the most widely used definition of this non-tangible ecosystem services are “functions of environmental spaces and cultural activities which may then result in the enjoyment of cultural ecosystem benefits”; yet the latter itself have never found its way into official laws and regulations. In this study, via a case study in Singapore, we propose new evidence to re-evaluate and re-position the two of the most important emerging concepts in managing the green spaces in urban areas. Using the transdisciplinary mixed methods of public participation GIS and social media text mining analysis, a wealth of cultural ecosystem services and their associated benefits were reported. This was especially so with regards to recreational and aesthetic services and experiential benefits. Recommendations to improve the park were also suggested, alongside sharing of methodological considerations for future research. Overall, this paper recommends the employment of the redefined cultural ecosystem services conceptual framework to generate relational, data-driven and actionable insights to better support urban green space management, which is not only useful to Singapore governments but also world-wide relevant.
Subject: Keywords: Textual data distributions; supervised learning; unsupervised learning; Kullback-Leibler divergence; sentiment; textual analytics; text generation; vaccine; stock market
Online: 17 June 2021 (10:03:41 CEST)
Efficient textual data distributions (TDD) alignment and generation are open research problems in textual analytics and NLP. It is presently difficult to parsimoniously and methodologically confirm that two or more natural language datasets belong to similar distributions, and to identify the extent to which textual data possess alignment. This study focuses on addressing a segment of the broader problem described above by applying multiple supervised and unsupervised machine learning (ML) methods to explore the behavior of TDD by (i) topical alignment, and (ii) by sentiment alignment. Furthermore we use multiple text generation methods including fine-tuned GPT-2, to generate text by topic and by sentiment. Finally we develop a unique process driven variation of Kullback-Leibler divergence (KLD) application to TDD, named KL Textual Distributions Contrasts (KL-TDC) to identify the alignment of machine generated textual corpora with naturally occurring textual corpora. This study thus identifies a unique approach for generating and validating TDD by topic and sentiment, which can be used to help address sparse data problems and other research, practice and classroom situations in need of artificially generated topic or sentiment aligned textual data.
REVIEW | doi:10.20944/preprints202203.0032.v1
Subject: Chemistry, Medicinal Chemistry Keywords: artificial intelligence; machine learning; drug design; covid-19; structure-based drug design; ligand-based drug design
Online: 2 March 2022 (03:00:37 CET)
The recent covid crisis has proven important lessons for academia and industry regarding digital reorganization. Among fascinating lessons from these times is the huge potential of data analytics and artificial intelligence. The crisis exponentially accelerated the adoption of analytics and artificial intelligence, and this momentum is predicted to continue into the 2020s and over. Moreover, drug development is a costly and time-consuming business, and only a minority of approved drugs return the revenue that exceeds the research and development costs. As a result, there is a huge drive to make drug discovery cheaper and faster. With modern algorithms and hardware, it is not too surprising that the new technologies of artificial intelligence and other computational simulation tools can help drug developers. In only two years of covid research, many novel molecules have been designed/identified using artificial intelligence methods with astonishing results in terms of time and effectiveness. This paper will review the most significant research on artificial intelligence in the de novo drug design for COVID-19 pharmaceutical research.
DATA DESCRIPTOR | doi:10.20944/preprints202104.0351.v1
Subject: Keywords: lecture based instruction; actual community-based instruction; maternal and child care; social competency skills; community awareness
Online: 13 April 2021 (12:47:52 CEST)
Maternal-child care is one of the foundations of primary health care. Nurses’ competency skills they have been taught. Community awareness is an important part of preventive healthcare, and nurses must be aware of the factors that impact the health of the community. This study examines the effectiveness of lecture-based instructions in maternal and child care and its implications to students' social competency skills and community awareness in Nursing Colleges in Nueva Ecija, Philippines. The researcher uses survey questionnaire and employed the descriptive design where fifteen (15) nursing students and five (5) teachers were purposively selected. The findings revealed that the weighted mean for the effectiveness of lecture based instruction in maternal and child care is 3.91 with verbal description of “Effective”, the effects of lecture based instruction in maternal and childcare to students’ social competency skills and community awareness got the weighted mean of 3.87 and interpreted as “very satisfactory” and the effectiveness of actual community-based instruction is very effective with weighted mean of 4.25 and is higher compare to lecture based instruction. The results also revealed that students and teachers were challenged in lecture-based instruction in maternal and chi8ldcare during distance learning. Recommendations for the enhancement of lecture-based instruction in maternal and childcare in social competency skills and community awareness were also made.
REVIEW | doi:10.20944/preprints202104.0203.v1
Subject: Engineering, Automotive Engineering Keywords: Additive manufacturing; Fused Deposition Modelling; Robot-based additive manufacturing; Polylactic acid (PLA) and PLA-based composite.
Online: 7 April 2021 (12:24:16 CEST)
Over the last decade, a significant literature has emerged that advocates the potential of different Additive manufacturing (AM) technologies and printable polymeric materials. Nevertheless, large scale printing and complex geometric shapes, with curvatures and non-planar layer deposition, are a challenging process for the traditional gantry-based machine. The 3 degrees of freedom cartesian configuration restricted their capability to planar layered printing and restricted part dimensions. To date, many researchers have used industrial robots to overcomes this limitation. This review gives the reader a good overview of the FDM technique due to its scalability, cost efficiency and a wide range of material printability. A strong emphasis is laid on the PLA and PLA-based composites as promising materials for the FDM process applications. The second part of this paper links the successful use of these materials in the traditional printing process to large scale printing using the robot-based FDM process. This survey presents representative setups for robot-based AM and works that have been used these setups for non-planar material deposition. Finally, we conclude this paper by identifying opportunities for realizing new functional capabilities by exploiting robot-based AM, and we also present the future trends in this area.
ARTICLE | doi:10.20944/preprints202002.0249.v1
Subject: Biology, Agricultural Sciences & Agronomy Keywords: Fungal diversity; Saccharomyces; genetic diversity; glyphosate-based herbicides; copper-based fungicides; RoundUp Ready™ corn; phylogenetics
Online: 17 February 2020 (15:37:11 CET)
Saccharomyces cerevisiae are a phenotypically diverse species that adapt to a wide variety of environments by exploiting standing genetic diversity and selecting for advantageous mutations. Glyphosate and copper-based herbicides/ fungicides affect non-target organisms, these incidental exposures can impact microbial populations. In this study, glyphosate resistance was found in the historical collection of yeast which was collected over the last century, but only in yeast isolated after the introduction of glyphosate. The highest glyphosate-resistant yeasts were isolated from agricultural sites. However, herbicide application at these sites was not recorded. In an effort to assess glyphosate resistance and impact on non-target microorganisms, yeast were harvested from 15 areas with known herbicidal histories, including an organic farm, conventional farm, remediated coal mine, suburban locations, state park, and a national forest. Yeast representing 23 genera were isolated from 237 samples of plant, soil, spontaneous fermentation, nut, flower, fruit, feces, and tree material samples. Saccharomyces, Candida, Metschnikowia, Klyveromyces, Hanseniaspora, and Pichia were other genera commonly found across our sampled environments. Managed areas had less species diversity and at the brewery, only Saccharomyces and Pichia were isolated. A conventional farm growing RoundUp Ready™ corn had the lowest phylogenetic diversity and the highest glyphosate resistance. The mine was sprayed with multiple herbicides including a commercial formulation of glyphosate; however, the yeast did not have elevated glyphosate resistance. In contrast to the conventional farm, the mine was exposed to glyphosate only one year prior to sample isolation. Glyphosate resistance is an example of the anthropogenic selection of nontarget organisms.
REVIEW | doi:10.20944/preprints201812.0129.v1
Subject: Life Sciences, Biochemistry Keywords: food safety; gel-based proteomics; LC-based proteomics; post-translational modifications; proteomics; seed ageing; seed quality
Online: 11 December 2018 (11:00:26 CET)
For centuries, crop plants have represented the basis of the daily human diet. Among them, cereals and legumes, accumulating oils, proteins and carbohydrates in their seeds, distinctly dominate modern agronomic practice. Indeed, these plants play an essential role in the food industry and fuel production. Therefore, the seeds of crop plants are intensively studied by food chemists, biologists, biochemists, and nutritional physiologists. Accordingly, not only seed development and germination, but also age- and stress-related alterations in seed vigor, longevity, nutritional value and safety can be addressed by a broad panel of analytical, biochemical and physiological methods. Currently, functional genomics is one of the most powerful tools, giving direct access to characteristic metabolic changes, accompanying plant development, senescence and response to biotic or environmental stress. Among individual methodological platforms, proteomics represents one of the most effective ones, giving access to cellular metabolism at the level of proteins. Here we discuss the main methodological approaches employed by seed proteomics in the context of physiological changes related to seed development, ageing and response to environmental stress.
REVIEW | doi:10.20944/preprints202209.0201.v1
Subject: Chemistry, Medicinal Chemistry Keywords: ligand-based pharmacophores; structure-based pharmacophores; virtual screening; drug design; machine learning; molecular dynamics; de novo design
Online: 14 September 2022 (09:10:58 CEST)
G protein-coupled receptors (GPCRs) are amongst the most pharmaceutically relevant and well-studied protein targets, yet unanswered questions in the field leave significant gaps in our understanding of their nuanced structure and function. 3D pharmacophore models are powerful computational tools in silico drug discovery, presenting myriad opportunities for the integration of GPCR structural biology and cheminformatics. This review highlights success stories in the application of 3D pharmacophore modeling to de novo drug design, discovery of biased and allosteric ligands, scaffold hopping, QSAR analysis, hit-to-lead optimization, GPCR de-orphanization, mechanistic understanding of GPCR pharmacology and elucidation of ligand-receptor interactions. Furthermore, advances in the incorporation of dynamics and machine learning will be highlighted. The review will analyze challenges in the field of GPCR drug discovery, detailing how 3D pharmacophore modeling can be used to address them. Finally, we will present opportunities afforded by 3D pharmacophore modeling in the advancement of our understanding and targeting of GPCRs.
ARTICLE | doi:10.20944/preprints202301.0534.v1
Subject: Engineering, Mechanical Engineering Keywords: EMA; prognostics; PHM; model-based; metaheuristic; MEA
Online: 30 January 2023 (02:39:27 CET)
The deployment of Electro-Mechanical Actuators plays an important role towards the adoption of the More Electric Aircraft (MEA) philosophy. On the other hand, a seamless substitution of EMAs in place of more traditional hydraulic solutions is still set back due to the shortage of real-life and reliability data regarding their failure modes. One way to work around this problem is providing a capillary EMA Prognostics and Health Management (PHM) system, capable of recognizing failures before they actually undermine the ability of the safety-critical system to perform its functions. The authors have developed a model-based prognostic framework for PMSM based EMAs leveraging a metaheuristic algorithm: Evolutionary (Differential Evolution (DE)) and swarm intelligence (particle swarm (PSO), grey wolf (GWO)) methods are considered. Several failures (dry friction, backlash, short circuit, eccentricity and proportional gain) are simulated thanks to a Reference Model, acting as a Numerical Test Bench, then detected and identified thanks to the envisioned prognostic method, which leverages a low fidelity Monitoring Model. The employed algorithms showed good results and prove that this strategy could be executed in pre-flight checks or during the flight at specific time intervals, with positive impacts on system safety and availability.
ARTICLE | doi:10.20944/preprints202301.0118.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Deep Learning; Optimization; Benchmarking; Gradient based optimizers
Online: 6 January 2023 (06:31:40 CET)
Initial choice of Learning Rate is a key part of gradient based methods and has a great effect on the performance of the Deep Learning Model.This paper studies the behavior of multiple gradient based optimization algorithm which are commonly used in Deep Learning and compare their performance on various learning rate. As observed popular choice of optimization algorithms are highly sensitive to various choice of learning rates. Our goal is to find which optimizer has an edge over others for a specific setting. We look at two datasets namely MNIST and CIFAR10 for benchmarking. The results are quite surprising, and it will help us to choose a learning rate more efficiently.
ARTICLE | doi:10.20944/preprints202211.0556.v2
Online: 1 December 2022 (02:09:32 CET)
Agent-based models (ABMs) are computational models for simulating the actions and interactions of autonomous agents in time and space. These models allow users to simulate the complex interactions between individual agents and the landscapes they inhabit and are increasingly used in epidemiology to understand complex phenomena and make predictions. However, as the complexity of the simulated systems increases, notably when disease control interventions are considered, model flexibility and processing speed can become limiting. Here we introduce SamPy, an open-source Python library for stochastic agent-based modeling of epidemics. SamPy is a modular toolkit for model development, providing adaptable modules that capture host movement, disease dynamics, and disease control interventions. Memory optimization and design provide high computational efficiency allowing modelling of large, spatially-explicit populations of agents over extensive geographical areas. In this article, we demonstrate the high flexibility and processing speed of this new library. The version of SamPy considered in this paper is available at https://github.com/sampy-project/sampy-paper .
ARTICLE | doi:10.20944/preprints202210.0464.v1
Subject: Mathematics & Computer Science, Analysis Keywords: Kabirian-based optinalysis; estimators; properties; computing codes
Online: 31 October 2022 (04:53:43 CET)
Good estimators are characterized as robust, unbiased, efficient, and consistent. However, the commonly used estimators are weak or lack one or more of these properties. In this article, eight (8) estimators for statistical and geometrical estimations of symmetry/asymmetry, similarity/dissimilarity, identity/unidentity, and feature transformation were proposed following Kabirian-based optinalysis and other operations. The proposed estimators are characterized as invariant (robust) under scaling, location shift, and rotation or reflection. A computing code was written in python language for each of the proposed estimators so that peers can have working codes for application and performance evaluation.
ARTICLE | doi:10.20944/preprints202210.0192.v1
Subject: Mathematics & Computer Science, Analysis Keywords: Knowledge-based Systems; Ontology; Knowledge Engineering; MCDA.
Online: 13 October 2022 (09:54:49 CEST)
Decision making as a result of system dynamics analysis requires, in practice, a straightforward and systematic modelling capability as well as a high-level of customisation and flexibility to adapt to situations and environments that may vary very much from each other. While in general terms a completely generic approach could be not as effective as ad-hoc solutions, the proper application of modern technology may facilitate agile strategies as a result of a smart combination of qualitative and quantitative aspects. In order to address such a complexity, we propose a knowledge-based approach that integrates the systematic computation of heterogeneous criteria with open semantics. The holistic understanding of the framework is described by a reference architecture and the proof-of-concept prototype developed can support high-level system analysis, as well as it suitable within a number of applications contexts - i.e. as a research/educational tool, communication framework, gamification and participatory modelling. Additionally, the knowledge-based philosophy, developed upon Semantic Web technology, increases the capability in terms of holistic knowledge building and re-use via interoperability. Last but not least, the framework is designed to constantly evolve in the next future, for instance by incorporating more advanced AI-powered features.
ARTICLE | doi:10.20944/preprints202203.0239.v1
Subject: Engineering, Civil Engineering Keywords: ATO; Performance Evaluation; Scenario-based Testing; Simulation
Online: 17 March 2022 (02:42:05 CET)
There is increasing interest in automating train operations of mainline services, e.g. to increase network capacity. Automatic train operation (ATO) is already achieved by several pilot projects, but not implemented on a large scale. Before the general introduction of new or adapted technologies can have a transformative effect on the operation of such a complex system as train operation on mainlines, they have to pass functional, interoperability and performance tests. A virtual preliminary analysis is one way to ensure a smooth as well as safe introduction and implementation. This paper aims to present an approach that applies to the performance testing of ATO systems. Therefore, methods and test standards for technologies enabling automatic operation in other transport sectors are reviewed. The main findings have been adapted, transformed and combined to be used as a general strategy for virtual performance testing in the railway sector. Specifically, universal performance indicators, namely punctuality, accuracy, energy consumption, safety and comfort, are presented. A layer model for scenario description is adapted from the automotive sector, as well as the definition of different scenario types. Lastly, factors that can influence the performance of an ATO algorithm are identified. To demonstrate the developed approach, a straightforward investigation of a case study is conducted using a microscopic train simulator in combination with an ATO algorithm.