ARTICLE | doi:10.20944/preprints201808.0350.v2
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: big data; clustering; data mining; educational data mining; e-learning; profile learning
Online: 19 October 2018 (05:58:05 CEST)
Educational data-mining is an evolving discipline that focuses on the improvement of self-learning and adaptive methods. It is used for finding hidden patterns or intrinsic structures of educational data. In the arena of education, the heterogeneous data is involved and continuously growing in the paradigm of big-data. To extract meaningful information adaptively from big educational data, some specific data mining techniques are needed. This paper presents a clustering approach to partition students into different groups or clusters based on their learning behavior. Furthermore, personalized e-learning system architecture is also presented which detects and responds teaching contents according to the students’ learning capabilities. The primary objective includes the discovery of optimal settings, in which learners can improve their learning capabilities. Moreover, the administration can find essential hidden patterns to bring the effective reforms in the existing system. The clustering methods K-Means, K-Medoids, Density-based Spatial Clustering of Applications with Noise, Agglomerative Hierarchical Cluster Tree and Clustering by Fast Search and Finding of Density Peaks via Heat Diffusion (CFSFDP-HD) are analyzed using educational data mining. It is observed that more robust results can be achieved by the replacement of existing methods with CFSFDP-HD. The data mining techniques are equally effective to analyze the big data to make education systems vigorous.
ARTICLE | doi:10.20944/preprints202103.0593.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Business Inteligence; Data Mining; Data Warehouse.
Online: 24 March 2021 (13:47:31 CET)
In the coming years, digital applications and services that continue to use the country's native cloud systems will be huge. By 2023, that will exceed 500 million, according to IDC. This corresponds to the sum of all applications developed in the last 40 years. If you are the one you answered, yes! This article is for you!
ARTICLE | doi:10.20944/preprints202308.1237.v1
Subject: Engineering, Transportation Science And Technology Keywords: data mining; data extraction; data science; cost infrastructure projects
Online: 17 August 2023 (09:25:22 CEST)
Context: Despite the effort put into developing standards for structuring construction cost, and the strong interest into the field. Most construction companies still perform the process of data gathering and processing manually. That provokes inconsistencies, different criteria when classifying, misclassifications, and the process becomes very time-consuming, particularly on big projects. Additionally, the lack of standardization makes very difficult the cost estimation and comparison tasks. Objective: To create a method to extract and organize construction cost and quantity data into a consistent format and structure, to enable rapid and reliable digital comparison of the content. Method: The approach consists of a two-step method: Firstly, the system implements data mining to review the input document and determine how it is structured based on the position, format, sequence, and content of descriptive and quantitative data. Secondly, the extracted data is processed and classified with a combination of data science and experts’ knowledge to fit a common format. Results: A big variety of information coming from real historical projects has been successfully extracted and processed into a common format with 97.5% of accuracy, using a subset of 5770 assets located on 18 different files, building a solid base for analysis and comparison. Conclusion: A robust and accurate method was developed for extracting hierarchical project cost data to a common machine-readable format to enable rapid and reliable comparison and benchmarking.
ARTICLE | doi:10.20944/preprints202105.0102.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Market basket analysis; association rule mining; buying pattern; data mining
Online: 6 May 2021 (15:14:25 CEST)
Buyer practices have changed as individuals are figuring out how to live with the new truth of COVID-19. Take-out and conveyance orders have expanded, and our customer has added new items to their menu because of new client inclinations. With every one of the continuous changes, the customer had numerous unanswered inquiries, for example, Smartbridge has broad involvement with café innovation development Café TECHNOLOGY CAPABILITIES :Are the most famous items as yet unchanged after COVID? :Which are the most sold item blends now? :What is the acknowledgment of new things? :What are clients purchasing alongside new things? :How have liquor deals changed? The customer previously had reports that followed item deals and operational measurements, notwithstanding, there was a need to get a more profound knowledge into item examination. The customer expected to recognize what items and introductions were being sold all the more frequently, measure the acknowledgment of new items, and figure out what items clients buy together to improve advertising efforts, advancements, and deals. he E-business industry is filling immensely in the Indian market. The modest 4G web bundles in India clearly gives a push to these ventures. Thus, as Covid19 first hit in Quite a while, individuals got terrified to go out from their homes in light of the fact that, in their mind, it's a dread of Covid. They even wonder whether or not to go out to purchase fundamental (FMCG) products. Frenzy purchasing additionally has seen and to stay away from this dread of COVID-19, individuals are offering inclinations to the E-Commerce destinations to purchase fundamental products and a few clients are new which joined to purchase fundamental merchandise during this Pandemic Lockdown period. Numerous clients are moving their purchasing conduct from disconnected retail locations to online stores. This paper examines the customer buying pattern during lockdown.
ARTICLE | doi:10.20944/preprints201908.0019.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: emotion classification; machine learning classifiers; ISEAR dataset; data mining; performance evaluation; data science; opinion-mining
Online: 2 August 2019 (08:49:27 CEST)
Emotion detection from the text is an important and challenging problem in text analytics. The opinion-mining experts are focusing on the development of emotion detection applications as they have received considerable attention of online community including users and business organization for collecting and interpreting public emotions. However, most of the existing works on emotion detection used less efficient machine learning classifiers with limited datasets, resulting in performance degradation. To overcome this issue, this work aims at the evaluation of the performance of different machine learning classifiers on a benchmark emotion dataset. The experimental results show the performance of different machine learning classifiers in terms of different evaluation metrics like precision, recall ad f-measure. Finally, a classifier with the best performance is recommended for the emotion classification.
ARTICLE | doi:10.20944/preprints202208.0224.v1
Subject: Engineering, Automotive Engineering Keywords: VR-XGBoost; K-VDTE; ETC data; ESAs; data mining
Online: 12 August 2022 (03:53:23 CEST)
To scientifically and effectively evaluate the service capacity of expressway service areas (ESAs) and improve the management level of ESAs, we propose a method for the recognition of vehicles entering ESAs (VeESAs) and estimation of vehicle dwell times using ETC data. First, the ETC data and their advantages are described in detail, and then the cleaning rules are designed according to the characteristics of the ETC data. Second, we established feature engineering according to the characteristics of VeESA, and proposed the XGBoost-based VeESA recognition (VR-XGBoost) model. Studied the driving rules in depth, we constructed a kinematics-based vehicle dwell time estimation (K-VDTE) model. The field validation in Part A/B of Yangli ESA using real ETC transaction data demonstrates that the effectiveness of our proposal outperforms the current state of the art. Specifically, in Part A and Part B, the recognition accuracies of VR-XGBoost are 95.9% and 97.4%, respectively, the mean absolute errors (MAEs) of dwell time are 52 s and 14 s, respectively, and the root mean square errors (RMSEs) are 69 s and 22 s, respectively. In addition, the confidence level of controlling the MAE of dwell time within 2 minutes is more than 97%. This work can effectively identify the VeESA, and accurately estimate the dwell time, which can provide a reference idea and theoretical basis for the service capacity evaluation and layout optimization of the ESA.
ARTICLE | doi:10.20944/preprints201909.0040.v1
Subject: Business, Economics And Management, Business And Management Keywords: data mining; security; association rule; ECLAT
Online: 4 September 2019 (03:48:58 CEST)
The purpose of this paper is to develop WebSecuDMiner algorithm to discover unusual web access patterns based on analysing the potential rules hidden in web server log and user navigation history. Design/methodology/approach: WebSecuDMiner uses equivalence class transformation (ECLAT) algorithm to extract user access patterns from the web log data, which will be used to identify the user access behaviours pattern and detect unusual one. Data extracted from the web serve log and user browsing behaviour is exploited to retrieve the web access pattern that is produced by the same user. Findings: WebSecuDMiner is used to detect whether any unauthorized access have been posed and take appropriate decisions regarding the review of the original rights of suspicious user. Research limitations/implications: The present work uses the database which is extracted from web serve log file and user browsing behaviour. Although the page is viewed by the user, the visit is not recorded in the server log file, since it can be access from the browser's cache.
ARTICLE | doi:10.20944/preprints201610.0012.v1
Subject: Biology And Life Sciences, Biochemistry And Molecular Biology Keywords: data exchange; resource donations; text mining
Online: 5 October 2016 (15:08:32 CEST)
Bio-molecular reagents like antibodies required in experimental biology are expensive and their effectiveness, among other things, is critical to the success of the experiment. Although such resources are sometimes donated by one investigator to another through personal communication between the two, there is no previous study to our knowledge on the extent of such donations, nor a central platform that directs resource seekers to donors. In this paper, we describe, to our knowledge, a first attempt at building a web-portal titled Bio-Resource Exchange that attempts to bridge this gap between resource seekers and donors in the domain of experimental biology. Users on this portal can request for or donate antibodies, cell-lines and DNA Constructs. This resource could also serve as a crowd-sourced database of resources for experimental biology. Further, in order to index donations outside of our portal, we mined scientific articles to find instances of donations of antibodies and attempted to extract information about these donations at the finest granularity. Specifically, we extracted the name of the donor, his/her affiliation and the name of the antibody for every donation by parsing the acknowledgements sections of articles. To extract annotations at this level, we propose two approaches – a rule based algorithm and a bootstrapped relation learning algorithm. The algorithms extracted donor names, affiliations and antibody names with average accuracies of 57% and 62% respectively. We also created a dataset of 50 expert-annotated acknowledgements sections that will serve as a gold standard dataset to evaluate extraction algorithms in the future. Contact: email@example.com, firstname.lastname@example.org Database URL: http://tonks.dbmi.pitt.edu/brx Supplementary information: Supplementary data are available at Database online.
ARTICLE | doi:10.20944/preprints202108.0256.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Learning Analytics, Education, Educational Data Mining, Pattern Recognition, Data Visualization.
Online: 11 August 2021 (11:23:48 CEST)
With the exponential growth in today’s technology and its expanding areas of application it has become vital to incorporate it in education. One such application is Knowledge Discovery in Databases (KDD) which is a subset of data mining. KDD deals with extracting useful information and meaningful patterns from the database that were not known before. This study is a detailed application of KDD and focuses on analyzing why a particular set of students performed better than others and what factors influenced the results. The study is conducted on a dataset of 480 students and across 16 different features. The authors implemented 4 major classification techniques namely Logistic Regression, Decision Tree, Random Forest and XGB classifier. Obtaining the key features from the top performing ML algorithms that have a major impact on the performance of the student, the study takes these features as a baseline for further analysis. Further data analysis highlights patterns in the data. The study concludes that there are a lot of non-academic factors that influence the overall performance of a student and should be taken into consideration by universities and other relevant bodies.
ARTICLE | doi:10.20944/preprints201806.0440.v1
Subject: Computer Science And Mathematics, Computational Mathematics Keywords: clustering; spatial data; grid-based k-prototypes; data mining; sustainability
Online: 27 June 2018 (10:21:22 CEST)
Data mining plays a critical role in the sustainable decision making. The k-prototypes algorithm is one of the best-known algorithm for clustering both numeric and categorical data. Despite this, however, clustering a large number of spatial object with mixed numeric and categorical attributes is still inefficient due to its high time complexity. In this paper, we propose an efficient grid-based k-prototypes algorithms, GK-prototypes, which achieves high performance for clustering spatial objects. The first proposed algorithm utilizes both maximum and minimum distance between cluster centers and a cell, which can remove unnecessary distance calculation. The second proposed algorithm as extensions of the first proposed algorithm utilizes spatial dependence that spatial data tend to be more similar as objects are closer. Each cell has a bitmap index which stores categorical values of all objects in the same cell for each attribute. This bitmap index can improve the performance in case that a categorical data is skewed. Our evaluation experiments showed that proposed algorithms can achieve better performance than the existing pruning technique in the k-prototypes algorithm.
ARTICLE | doi:10.20944/preprints201906.0144.v1
Subject: Computer Science And Mathematics, Security Systems Keywords: data mining; network security; association rules; DDoS
Online: 16 June 2019 (02:42:59 CEST)
Typical modern information systems are required to process copious data. Conventional manual approaches can no longer effectively analyze such massive amounts of data, and thus humans resort to smart techniques and tools to complement human effort. Currently, network security events occur frequently, and generate abundant log and alert files. Processing such vast quantities of data particularly requires smart techniques. This study reviewed several crucial developments of existent data mining algorithms, including those that compile alerts generated by heterogeneous IDSs into scenarios and employ various HMMs to detect complex network attacks. Moreover, sequential pattern mining algorithms were examined to develop multi-step intrusion detection. These studies can focus on applying these algorithms in practical settings to effectively reduce the occurrence of false alerts. This article researched the application of data mining algorithms in network security. The academic community has recently generated numerous studies on this topic.
ARTICLE | doi:10.20944/preprints202308.1391.v1
Subject: Engineering, Transportation Science And Technology Keywords: data extraction; data mining; railway infrastructure costs; infrastructure costs data analysis; cost analysis
Online: 18 August 2023 (16:03:08 CEST)
The capability of extracting information and analyze it into a common format is essential for performing predictions, comparing projects through cost benchmarking, and for having a deeper understanding of the project costs. However, the lack of standardization and the manual inclusion of the data makes this process very time-consuming, unreliable, and inefficient. To tackle this problem, a novel approach with a big impact is presented combining the benefits of data mining, statistics, and machine learning to extract and analyze the information related to railway costs infrastructure data. To validate the suggested approach, data from 23 real historical projects from the client network rail was extracted, allowing their costs to be comparable. Finally, some machine learning and data analytics methods were implemented to identify the most relevant factors allowing for costs benchmarking. The presented method proves the benefits of data extraction being able to gather, analyze and benchmark each project in an efficient manner, and deeply understand the relationships and the relevant factors that matter in infrastructure costs.
ARTICLE | doi:10.20944/preprints202007.0078.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: personalization; decision making; medical data; artificial intelligence; Data-driving; Big Data; Data Mining; Machine Learning
Online: 5 July 2020 (15:04:17 CEST)
The study was conducted on applying machine learning and data mining methods to personalizing the treatment. This allows investigating individual patient characteristics. Personalization is built on the clustering method and associative rules. It was suggested to determine the average distance between instances for optimal performance metrics finding. The formalization of the medical data pre-processing stage for finding personalized solutions based on current standards and pharmaceutical protocols is proposed. The model of patient data is built. The paper presents the novel approach to clustering built on ensemble of cluster algorithm with better than k-means algorithm Hopkins metrics. The personalized treatment usually is based on decision tree. Such approach requires a lot of computation time and cannot be paralyzed. Therefore, it is proposed to classify persons by conditions, to determine deviations of parameters from the normative parameters of the group, as well as the average parameters. This made it possible to create a personalized approach to treatment for each patient based on long-term monitoring. According to the results of the analysis, it becomes possible to predict the optimal conditions for a particular patient and to find the medicaments treatment according to personal characteristics.
ARTICLE | doi:10.20944/preprints202008.0487.v1
Subject: Social Sciences, Geography, Planning And Development Keywords: Twitter; data reliability; risk communication; data mining; Google Cloud Vision API
Online: 22 August 2020 (02:32:40 CEST)
While Twitter has been touted to provide up-to-date information about hazard events, the reliability of tweets is still a concern. Our previous publication extracted relevant tweets containing information about the 2013 Colorado flood event and its impacts. Using the relevant tweets, this research further examined the reliability (accuracy and trueness) of the tweets by examining the text and image content and comparing them to other publicly available data sources. Both manual identification of text information and automated (Google Cloud Vision API) extraction of images were implemented to balance accurate information verification and efficient processing time. The results showed that both the text and images contained useful information about damaged/flooded roads/street networks. This information will help emergency response coordination efforts and informed allocation of resources when enough tweets contain geocoordinates or locations/venue names. This research will help identify reliable crowdsourced risk information to enable near-real time emergency response through better use of crowdsourced risk communication platforms.
COMMUNICATION | doi:10.20944/preprints202206.0172.v3
Subject: Computer Science And Mathematics, Information Systems Keywords: Monkeypox; monkey pox; Twitter; Dataset; Tweets; Social Media; Big Data; Data Mining; Data Science
Online: 25 July 2022 (09:41:19 CEST)
ARTICLE | doi:10.20944/preprints202105.0601.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Mobile RPG; Big Data; Text Mining; Topic Modeling
Online: 25 May 2021 (10:21:36 CEST)
As RPG has high sales and profits, lots of developers have supplied various RPG to market but it changed to mass production type with sensational advertising, low quality and excessive charging and similar contents which affects game market and users’ game play experience. The author of this paper studied ways to improve mobile RPG by collecting and analyzing users’ reviews using crawling on Google Play Store. The author of this paper used topic modeling that uses text mining technique and LDA (Latent Dirichlet Allocation) to extract meaningful information from collected big data and visualized it. Inferring users’ reviews, figuring out opinions objectively and seeking ways to improve games are helpful in improving mobile RPG that can be played continuously.
DATA DESCRIPTOR | doi:10.20944/preprints202308.1701.v1
Subject: Public Health And Healthcare, Public Health And Health Services Keywords: disease X; big data; data science; data analysis; dataset development; database; google trends; data mining; healthcare; epidemiology
Online: 24 August 2023 (05:48:54 CEST)
The World Health Organization (WHO) added Disease X to their shortlist of blueprint priority diseases to represent a hypothetical, unknown pathogen that could cause a future epidemic. During different virus outbreaks of the past, such as COVID-19, Influenza, Lyme Disease, and Zika virus, researchers from various disciplines utilized Google Trends to mine multimodal components of web behavior to study, investigate, and analyze the global awareness, preparedness, and response associated with these respective virus outbreaks. As the world prepares for Disease X, a dataset on web behavior related to Disease X would be crucial to contribute towards the timely advancement of research in this field. Furthermore, none of the prior works in this field have focused on the development of a dataset to compile relevant web behavior data, which would help to prepare for Disease X. To address these research challenges, this work presents a dataset of web behavior related to Disease X, which emerged from different geographic regions of the world, between February 2018 to August 2023. Specifically, this dataset presents the search interests related to Disease X from 94 geographic regions. These regions were chosen for data mining as these regions recorded significant search interests related to Disease X during this timeframe. The dataset was developed by collecting data using Google Trends. The relevant search interests for all these regions for each month in this time range are available in this dataset. This paper also discusses the compliance of this dataset with the FAIR principles of scientific data management. Finally, a brief analysis of specific features of this dataset is presented to uphold the applicability, relevance, and usefulness of this dataset for the investigation of different research questions in the interrelated fields of Big Data, Data Mining, Healthcare, Epidemiology, and Data Analysis.
COMMUNICATION | doi:10.20944/preprints202303.0453.v1
Subject: Social Sciences, Media Studies Keywords: COVID-19; MPox; Twitter; Big Data; Data Mining; Data Analysis; Sentiment Analysis; Data Science; Social Media; Monkeypox
Online: 27 March 2023 (08:39:28 CEST)
Mining and analysis of the Big Data of Twitter conversations have been of significant interest to the scientific community in the fields of healthcare, epidemiology, big data, data science, computer science, and their related areas, as can be seen from several works in the last few years that focused on sentiment analysis and other forms of text analysis of Tweets related to Ebola, E-Coli, Dengue, Human papillomavirus (HPV), Middle East Respiratory Syndrome (MERS), Measles, Zika virus, H1N1, influenza-like illness, swine flu, flu, Cholera, Listeriosis, cancer, Liver Disease, Inflammatory Bowel Disease, kidney disease, lupus, Parkinson's, Diphtheria, and West Nile virus. The recent outbreaks of COVID-19 and MPox have served as "catalysts" for Twitter usage related to seeking and sharing information, views, opinions, and sentiments involving both these viruses. While there have been a few works published in the last few months that focused on performing sentiment analysis of Tweets related to either COVID-19 or MPox, none of the prior works in this field thus far involved analysis of Tweets focusing on both COVID-19 and MPox at the same time. With an aim to address this research gap, a total of 61,862 Tweets that focused on Mpox and COVID-19 simultaneously, posted between May 7, 2022, to March 3, 2023, were studied to perform sentiment analysis and text analysis. The findings of this study are manifold. First, the results of sentiment analysis show that almost half the Tweets (the actual percentage is 46.88%) had a negative sentiment. It was followed by Tweets that had a positive sentiment (31.97%) and Tweets that had a neutral sentiment (21.14%). Second, this paper presents the top 50 hashtags that were used in these Tweets. Third, it presents the top 100 most frequently used words that are featured in these Tweets. The findings of text analysis show that some of the commonly used words involved directly referring to either or both viruses. In addition to this, the presence of words such as "Polio", "Biden", "Ukraine", "HIV", "climate", and "Ebola" in the list of the top 100 most frequent words indicate that topics of conversations on Twitter in the context of COVID-19 and MPox also included a high level of interest related to other viruses, President Biden, and Ukraine. Finally, a comprehensive comparative study that involves a comparison of this work with 49 prior works in this field is presented to uphold the scientific contributions and relevance of the same.
Subject: Engineering, Automotive Engineering Keywords: Business Intelligence; Data warehouse; Data Marts; Architecture; Data; Information; cloud; Data Mining; evolution; technologic companies; tools; software
Online: 24 March 2021 (13:06:53 CET)
Information has been and will be a vital element for a person or department groups in an organization. That is why there are technologies that help us to give them the proper management of data; Business Intelligence is responsible for bringing technological solutions that correctly and effectively manage the entire volume of necessary and important information for companies. Among the solutions offered by Business Intelligence are Data Warehouses, Data Mining, among other business technologies that working together achieve the objectives proposed by an organization. It is important to highlight that these business technologies have been present since the 50's and have been evolving through time, improving processes, infrastructure, methodologies and implementing new technologies, which have helped to correct past mistakes based on information management for companies. There are questions about Business Intelligence. Could it be that in the not-too-distant future it will be used as an essential standard or norm in any organization for data management, since it provides many benefits and avoids failures at the time of classifying information. On the other hand, Cloud storage has been the best alternative to safeguard information and not depend on physical storage media, which are not 100% secure and are exposed to partial or total loss of information, by presenting hardware failures or security failures due to mishandling that can be given to such information.
ARTICLE | doi:10.20944/preprints202008.0074.v1
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: data mining; cardiovascular diseases; cluster analysis; principle component analysis
Online: 4 August 2020 (03:56:19 CEST)
Cardiovascular disease is the number one cause of death in the world and Quoting from WHO, around 31% of deaths in the world are caused by cardiovascular diseases and more than 75% of deaths occur in developing countries. The results of patients with cardiovascular disease produce many medical records that can be used for further patient management. This study aims to develop a method of data mining by grouping patients with cardiovascular disease to determine the level of patient complications in the two clusters. The method applied is principal component analysis (PCA) which aims to reduce the dimensions of the large data available and the techniques of data mining in the form of cluster analysis which implements the K-Medoids algorithm. The results of data reduction with PCA resulted in five new components with a cumulative proportion variance of 0.8311. The five new components are implemented for cluster formation using the K-Medoids algorithm which results in the form of two clusters with a silhouette coefficient of 0.35. Combination of techniques of Data reduction by PCA and the application of the K-Medoids clustering algorithm are new ways for grouping data of patients with cardiovascular disease based on the level of patient complications in each cluster of data generated.
ARTICLE | doi:10.20944/preprints202006.0161.v1
Subject: Medicine And Pharmacology, Epidemiology And Infectious Diseases Keywords: COVID-19; Coronavirus; Artificial intelligence; Machine learning; Data mining
Online: 14 June 2020 (03:34:22 CEST)
The novel coronavirus disease (COVID-19) pandemic has impacted health and wellbeing globally. To strengthen preventive and clinical care amid this pandemic, technological innovations like artificial intelligence (AI) are increasingly used in different contexts. This bibliometric study aimed to assess the current scholarly development and prominent research domains in applications of AI technologies in COVID-19 research. A total of 105 articles were retrieved from MEDLINE database that emphasized on the use of AI in the context of COVID-19. Most articles had multiple authors with a collaboration index of 7.18. Moreover, most of the articles were produced from the USA (22.86%) and China (21.9%), whereas developing countries were underrepresented among the contributing nations. Furthermore, several research domains were identified, including prevention and control, diagnostics, epidemiological characteristics, therapeutics, psychological conditions, and different areas of data sciences related to COVID-19. The current bibliometric evidence shows the early stage of development in this field, which necessitates equitable applications of AI in COVID-19 research emphasizing on health disparities, socio-legal issues, vaccine development, and applied public health research in this pandemic.
ARTICLE | doi:10.20944/preprints201907.0338.v1
Subject: Engineering, Automotive Engineering Keywords: prediction; futures studies; complex environment; machine learning data mining
Online: 30 July 2019 (03:48:37 CEST)
Decision-makers are concerned with the inherent complexity of the modern world's markets. However, price fluctuations, environmental concerns, technological development, emerging markets, political challenges, and social expectations made the 21st century's more dynamic and complex. From a policy-making perspective, it is vital to uncover future trends. This paper proposed that artificial intelligence can improve interpretations in complex markets, such as financial and energy markets. In a complex environment, it is critical to investigate maximum available input features to ensure no valuable informative feature is neglected. Some AI-based models are investigated and presented that AI-based models can successfully uncover future trends. From a scenario development perspective purified input features subset refer to driving forces which shape alternative futures. Results showed that using AI can improve our understanding of how input features influence future behaviors and simultaneously improves prediction accuracy and reliability.
COMMUNICATION | doi:10.20944/preprints202206.0383.v2
Subject: Computer Science And Mathematics, Information Systems Keywords: Exoskeleton; Twitter; Tweets; Big Data; social media; Data Mining; dataset; Data Science; Natural Language Processing; Information Retrieval
Online: 21 July 2022 (04:06:53 CEST)
The exoskeleton technology has been rapidly advancing in the recent past due to its multitude of applications and diverse use-cases in assisted living, military, healthcare, firefighting, and industry 4.0. The exoskeleton market is projected to increase by multiple times of its current value within the next two years. Therefore, it is crucial to study the degree and trends of user interest, views, opinions, perspectives, attitudes, acceptance, feedback, engagement, buying behavior, and satisfaction, towards exoskeletons, for which the availability of Big Data of conversations about exoskeletons is necessary. The Internet of Everything style of today's living, characterized by people spending more time on the internet than ever before, with a specific focus on social media platforms, holds the potential for the development of such a dataset, by the mining of relevant social media conversations. Twitter, one such social media platform, is highly popular amongst all age groups, where the topics found in the conversation paradigms include emerging technologies such as exoskeletons. To address this research challenge, this work makes two scientific contributions to this field. First, it presents an open-access dataset of about 140,000 tweets about exoskeletons that were posted in a 5-year period from May 21, 2017, to May 21, 2022. Second, based on a comprehensive review of the recent works in the fields of Big Data, Natural Language Processing, Information Retrieval, Data Mining, Pattern Recognition, and Artificial Intelligence that may be applied to relevant Twitter data for advancing research, innovation, and discovery in the field of exoskeleton research, a total of 100 Research Questions are presented for researchers to study, analyze, evaluate, ideate, and investigate based on this dataset.
COMMUNICATION | doi:10.20944/preprints202210.0351.v1
Subject: Biology And Life Sciences, Biochemistry And Molecular Biology Keywords: oral cancer; machine learning; gene prioritization; genomic datasets; data mining
Online: 24 October 2022 (07:10:08 CEST)
Delayed cancer detection is one of the common causes of poor prognosis in case of many cancers including the cancers of the oral cavity. Despite improvement and development of new and efficient gene therapy treatments, very little has been done to algorithmically assess the impedance of these carcinomas. In this work, we attempt to annotate viable attributes in oral cancer gene datasets for identification of gingivobuccal cancer (GBC). We further apply supervised and unsupervised machine learning methods to the gene datasets revealing key candidate attributes for GBC prognosis. Our work highlights the importance of automated identification of key genes responsible for GBC that could perhaps be easily replicated to other forms of oral cancer detection.
ARTICLE | doi:10.20944/preprints202107.0230.v1
Subject: Business, Economics And Management, Accounting And Taxation Keywords: Cancer; Public Attention; News Media; Granger Causality Test; Data Mining
Online: 9 July 2021 (15:44:24 CEST)
Over the past decade, China has witnessed fast-paced technological advancements in the media industry, as well as major shifts in the health agenda portrayed in the media. Therefore, a key starting point when discussing health communication lies in whether media attention and public attention towards health issues are structurally aligned, and to what extent the news media guide public attention. Based on data mined from 73,060 sets of the Baidu Search Index and Media Index on 20 terms covering different types of cancer from 2011 to 2020, the Granger test demonstrates that, in the last decade, public attention and media attention towards cancer in China has gone through two distinct phases. During the first phase, 2011-2015, Chinese news media still held the key in transferring the salience of issues on most cancer types to the public. In the second phase, from 2016-2020, public attention towards cancer has gradually diverged from media coverage, mirroring the imbalance and mismatch between the demand of active public and the supply of cancer information from news media. This study provides an overview of the dynamic transition on cancer issues in China over a ten-year span, along with descriptive results on public and media attention towards specific cancer types.
ARTICLE | doi:10.20944/preprints202103.0738.v1
Subject: Computer Science And Mathematics, Analysis Keywords: bibliometry; coronavirus; text and data mining; SARS; MERS; COVID-19
Online: 31 March 2021 (17:30:56 CEST)
A global event such as the COVID-19 crisis presents new, often unexpected responses that are fascinating to investigate from both, scientific and social standpoints. Despite several documented similarities, the Coronavirus pandemic is clearly distinct from the 1918 flu pandemic in terms of our exponentially increased, almost instantaneous ability to access/share information, offering an unprecedented opportunity to visualise rippling effects of global events across space and time. Personal devices provide “big data” on people’s movement, the environment and economic trends, while access to the unprecedented flurry in scientific publications and media posts provides a measure of the response of the educated world to the crisis. Most bibliometric (co-authorship, co-citation, or bibliographic coupling) analyses ignore the time dimension, but COVID-19 has made it possible to perform a detailed temporal investigation into the pandemic. Here, we report a comprehensive network analysis based on more than 20000 published documents on viral epidemics, authored by over 75,000 individuals from 140 nations in the past one year of the crisis. In contrast to the 1918 flu pandemic, access to published data over the past two decades enabled a comparison of publishing trends between the ongoing COVID-19 pandemic and those of the 2003 SARS epidemic, to study changes in thematic foci and societal pressures dictating research over the course of a crisis.
ARTICLE | doi:10.20944/preprints202001.0048.v1
Subject: Biology And Life Sciences, Horticulture Keywords: cis-regulatory element; data mining; NBS-LRR resistance genes; Zucchini
Online: 5 January 2020 (17:22:10 CET)
Although Cucurbita pepo is one of the most variable species of the plant kingdom, Zucchini morphotype has undergone intensive breeding that has led to a narrow genetic base making the crop vulnerable to pest and diseases. This vulnerability makes the knowledge of resistance genes of utmost importance. In this study, a data mining search of Zucchini summer squash genome database was conducted to identify and annotate members of the NBS-encoding gene family. In order to characterize the retrieved genes in detail, they have been studied in the bases of phylogenetic relationships, structural diversity, conserved protein motifs, gene duplications and promoter region analysis. Our study shows that the NBS-encoding gene family is relatively small in Zucchini (34 members, which are separated into non-TIR- and TIR-NBS-LRR subfamilies) with a significantly lower number of R-genes than in other species. Duplications have not played a major role in the expansion of this type of genes in C. pepo. Among the cis-regulatory elements presented in these sequences, six motifs are over-represented. These elements were reported to be involved in pathogens or plant stress induced responses. These results will contribute to the identification, isolation and characterization of candidate R-genes, thereby providing insight into NBS gene family evolution in the species.
ARTICLE | doi:10.20944/preprints201812.0056.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: Low accuracy CDRs; Group movement pattern; Data mining; Travel behaviors
Online: 4 December 2018 (10:02:30 CET)
Identifying group movement patterns of crowds and understanding group behaviors is valuable for urban planners, especially when the groups are special such as tourist groups. In this paper, we present a framework to discover tourist groups and investigate the tourist behaviors using mobile phone call detail records (CDRs). Unlike GPS data, CDRs are relatively poor in spatial resolution with low sampling rates, which makes it a big challenge to identify group members from thousands of tourists. Moreover, since touristic trips are not on a regular basis, no historical data of the specific group can be used to reduce the uncertainty of trajectories. To address such challenges, we propose a method called group movement pattern mining based on similarity (GMPMS) to discover tourist groups. To avoid large amounts of trajectory similarity measurements, snapshots of the trajectories are firstly generated to extract candidate groups containing co-occurring tourists. Then, considering that different groups may follow the same itineraries, additional traveling behavioral features are defined to identify the group members. Finally, with Hainan province as an example, we provide a number of interesting insights of travel behaviors of group tours as well as individual tours, which will be helpful for tourism planning and management.
ARTICLE | doi:10.20944/preprints201809.0466.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: topological data analysis; text mining; computational topology; style; persistent homology
Online: 24 September 2018 (15:33:02 CEST)
Topological Data Analysis (TDA) refers to a collection of methods that find the structure of shapes in data. Although recently, TDA methods have been used in many areas of data mining, it has not been widely applied to text mining tasks. In most text processing algorithms, the order in which different entities appear or co-appear is being lost. Assuming these lost orders are informative features of the data, TDA may play a significant role in the resulted gap on text processing state of the art. Once provided, the topology of different entities through a textural document may reveal some additive information regarding the document that is not reflected in any other features from traditional text processing methods. In this paper, we introduce a novel approach that hires TDA in text processing in order to capture and use the topology of different same-type entities in textural documents. First, we will show how to extract some topological signatures in the text using persistent homology-i.e., a TDA tool that captures topological signature of data cloud. Then we will show how to utilize these signatures for text classification.
ARTICLE | doi:10.20944/preprints201801.0231.v1
Subject: Engineering, Control And Systems Engineering Keywords: Data mining; Association rules; Previous Cause; Type of Accident; Overexertion
Online: 24 January 2018 (19:40:52 CET)
An analysis of workplace accidents in the mining sector has been done using the database from the Spanish administration between the period 2005-2015 and applying data mining techniques. Data has been processed by means of the software Weka. Two scenarios were chosen regarding the accidents database, surface and underground mining. The most important variables involved in occupation accidents and their association rules have been determined. These rules are formed by several predictor variables that cause an accident, defining its characteristics and context. This study exposes the 20 most important association rules of the sector, either surface or underground mining, based on statistical confidence levels of each rule obtained by Weka. The outcomes display the most typical immediate causes with the percentage of accident basis of each association rule. The most typical immediate cause is body movement with physical effort or overexertion and type of accident is physical effort or overexertion. On the other hand, the second most important immediate cause and type of accident change in both scenarios. Data mining techniques have been proved as a very powerful tool to find out the root of the accidents, apply corrective measures and verify their effectiveness, either for public or private companies.
ARTICLE | doi:10.20944/preprints201801.0017.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Wikipedia; Polish; information quality; linguistic features; linguistics; data mining; NLP
Online: 3 January 2018 (02:03:51 CET)
Wikipedia is the most popular and the largest user-generated source of knowledge on the Web. Quality of the information in this encyclopedia is often questioned. Therefore, Wikipedians have developed an award system for high quality articles, which follows the specific style guidelines. Nevertheless, more than 1.2 million articles in Polish Wikipedia are unassessed. This paper considers over 100 linguistic features to determine the quality of Wikipedia articles in Polish language. We evaluate our models on 500,000 articles of Polish Wikipedia. Additionally, we discuss the importance of linguistic features for quality prediction.
ARTICLE | doi:10.20944/preprints202201.0229.v1
Subject: Public Health And Healthcare, Public Health And Health Services Keywords: FAIR principles; Multimorbidity; Mortality; Research data management; Pathfinder case study; Privacy-Preserving Distributed Data Mining.
Online: 17 January 2022 (13:04:03 CET)
The current availability of electronic health records represents an excellent research opportunity on multimorbidity, one of the most relevant public health problems nowadays. However, it also poses a methodological challenge due to the current lack of tools to access, harmonize and reuse research datasets. In FAIR4Health, a European Horizon 2020 project, a workflow to implement the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles on health datasets was developed, as well as two tools aimed at facilitating the transformation of raw datasets into FAIR ones and the preservation of data privacy. As part of this project, we conducted a multicentric retrospective observational study to apply the aforementioned FAIR implementation workflow and tools to five European health datasets for research on multimorbidity. We applied a federated frequent pattern growth association algorithm to identify the most frequent combinations of chronic diseases and their association with mortality risk. We identified several multimorbidity patterns clinically plausible and consistent with the bibliography, some of which were strongly associated with mortality. Our results show the usefulness of the solution developed in FAIR4Health to overcome the difficulties in data management and highlight the importance of implementing a FAIR data policy to accelerate responsible health research.
ARTICLE | doi:10.20944/preprints202308.0219.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Machine learning; educational data mining; supervised methods; classifiers; course failure risk
Online: 3 August 2023 (02:43:48 CEST)
In this paper, we address the following research question: Is feasibility to use an artificial intelligence system to predict the risk of student failure in a course based solely on their performance in prerequisite courses? Adopting a machine learning-based quantitative approach, we implement Course Prophet, the prototype of a predictive system that maps the input variables representing student performance to the target variable, i.e., the risk of course failure. We evaluate multiple machine learning methods and find that the Gaussian process with Matern kernel outperforms other methods, achieving the highest accuracy and a favorable trade-off between precision and recall. We conduct this research in the context of the students pursuing a Bachelor’s degree in Systems Engineering at the University of Córdoba, Colombia. In this context, we focus on predicting the risk of failing the Numerical Methods course. In conclusion, the main contribution of this research is the development of Course Prophet, providing an efficient and accurate tool for predicting student failure in the Numerical Methods course based on their academic history in prerequisite courses.
ARTICLE | doi:10.20944/preprints202305.1894.v1
Subject: Biology And Life Sciences, Neuroscience And Neurology Keywords: Aircraft control human factors; Cognitive workload; Data Mining; Electroencephalography; Fatigue; Safety
Online: 26 May 2023 (08:40:35 CEST)
The purpose of the research was to examine and assess the relation between the pilot’s concentration and reaction time with specific brain activity during short-haul flights. In order to obtain this task, participants took part in 1-hour long flight sessions performed on FNPT II class flight simulator. During the flight auto–pilot was activated and subjects were instructed to give response to unexpected events that were occurring during the flight. Brainwaves of each participant were recorded with Emotiv EPOC+ Scientific Contextual EEG device. Statistical analyses of the results were presented in the article.
ARTICLE | doi:10.20944/preprints202206.0360.v1
Subject: Business, Economics And Management, Business And Management Keywords: tourism and related; SMEs; small particulate matters; association rules; data mining
Online: 27 June 2022 (10:24:27 CEST)
In northern Thailand, the problem of small particulate matter happens every year, with the pri-mary source being agricultural weed burning and wildfire. The tourism industry is strongly impacted and has been the spotlight for the past few years. Thus, the study aims to investigate the effect of small particulate matter on tourism and related SMEs in Chiang Mai, Thailand. The data was collected from 286 entrepreneurs in the tourism and related SMEs sectors. The data was analyzed using data mining and association rule techniques. The study revealed that small particulate matter has a considerable impact on customer factors, especially the number of customers has decreased. Operational factors and prod-uct/service factors are also affected by the dust in the form of adjustments to keep the business running and the protection of the health of employees and customers. Certainly, financial factors are affected by the small particulate matter situation, both lower revenues and higher costs.
ARTICLE | doi:10.20944/preprints202111.0499.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Social Networks; Data Mining; Graph Structure; Natural Language Processing; Machine Learning
Online: 26 November 2021 (10:45:06 CET)
The herd effect is a common phenomenon in social society. The detection of this phenomenon is of great significance in many tasks based on social network analysis such as recommendation. However, the research on social network and natural language processing seldom focuses on this issue. In this paper, we propose an unsupervised data mining method to detect herding in social networks. Taking shopping review as an example, our algorithm can identify other reviews which are affected by some previous reviews and detect a herd effect chain. From the overall perspective, the cross effects of all views form the herd effect graph. This algorithm can be widely used in various social network analysis methods through graph structure, which provides new useful features for many tasks.
ARTICLE | doi:10.20944/preprints202108.0564.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: E-learning derived annotations; Pneumothorax; Artificial intelligence; Crowdsourcing; Educational data mining
Online: 31 August 2021 (11:23:12 CEST)
Development of supervised AI algorithms requires a large amount of labeled images. Image labelling is both time-consuming and expensive. Therefore, we explored the value of e-learning derived annotations for AI algorithm development in medical imaging. Methods We have developed an e-learning platform that involves image-based single click labelling as part of the educational learning process. Ten radiology residents, as part of their residency training, trained the recognition of pneumothorax on 1161 chest X-rays in posterior-anterior projection. Using this data, multiple AI algorithms for detecting pneumothorax were developed. Classification and localization performance of the models was tested on an independent internal testing dataset and on the public NIH ChestX-ray14 dataset. Results The AI models F1 scores on the internal and the NIH dataset were 0.87 and 0.44, respectively. Sensitivity was 0.85 and 0.80 for classification and specificity 0.96 and 0.48 for classification. F1 scores were 0.72 and 0.66, sensitivity 0.72 and 0.72. False positive rate was 0.36 and 0.32 for localisation. Conclusion Our results demonstrated that e-learning derived annotations are a valuable data source for algorithm development. Further work is needed to include additional parameters such as user performance, consensus of diagnosis, and quality control in the development pipeline.
ARTICLE | doi:10.20944/preprints201811.0216.v1
Subject: Engineering, Architecture, Building And Construction Keywords: Construction, worker safety, safety helmet, three-axis accelerometer sensor, data mining
Online: 8 November 2018 (14:03:21 CET)
In the Korean construction industry, legal and institutional safety management improvements are continually being pursued. However, there was a 4.5% increase in the number of workers’ deaths at construction sites in 2017 compared to the previous year. Failure to wear safety helmets seems to be one of the major causes of the increase in accidents, and so it is necessary to develop technology to monitor whether or not safety helmets are being used. However, the approaches employed in existing technical studies on this issue have mainly involved the use of chinstrap sensors and have been limited to the problem of whether or not safety helmets are being worn. Meanwhile, improper wearing, such as when the chinstrap and harness fixing of the safety helmet are not properly tightened, has not been monitored. To remedy this shortcoming, a sensing safety helmet with a three-axis accelerometer sensor attached was developed in this study. Experiments were performed in which the sensing data were classified whether the safety helmet was being worn properly, not worn, or worn improperly during construction workers’ activities. The results verified that it is possible to differentiate among wearing status of the proposed safety helmet with a high accuracy of 97.0%
ARTICLE | doi:10.20944/preprints201810.0678.v1
Subject: Medicine And Pharmacology, Pediatrics, Perinatology And Child Health Keywords: post-operative death; unstructured data; logistic regression; text mining; surgery outcome
Online: 29 October 2018 (11:46:18 CET)
Text fields in electronic medical records (EMR) contain information on important factors that influence health outcomes, however, they are underutilized in clinical decision making due to their unstructured nature. We analyzed 6,497 inpatient surgical cases with 719,308 free text notes from Le Bonheur Children’s Hospital EMR. We used a text mining approach on preoperative notes to obtain the text-based risk score algorithm as predictive of death within 30 days of surgery. We studied the additional performance obtained by including text-based risk score as a predictor of death along with other structured data based clinical risk factors. The C-statistic of a logistic regression model with 5-fold cross-validation significantly improved from 0.76 to 0.92 when text-based risk scores were included in addition to structured data. We conclude that preoperative free text notes in EMR include significant information that can predict adverse surgery outcomes.
ARTICLE | doi:10.20944/preprints201806.0247.v1
Subject: Computer Science And Mathematics, Analysis Keywords: data mining; association rule learning; policyholder lapse; auto insurance; market inefficiency
Online: 15 June 2018 (09:01:03 CEST)
For automobile insurance, it has long been implied that when a policyholder made at least one claim in the prior year, the subsequent premium is likely to increase. When this happens, the policyholder may seek to switch to another insurance company to possibly avoid paying for a higher premium. In such situations, insurers may be faced with the challenges of policyholder retention by keeping premiums low in the face of competition. In this paper, we seek to find empirical evidence of possible association between policyholder switching after a claim and the associated change in premium. In accomplishing this goal, we employ the method of association rule learning, a data mining technique that has its origins in marketing for analyzing and understanding consumer purchase behavior. We apply this unique technique in two stages. In the first stage, we identify policyholder and vehicle characteristics that affect the size of the claim and resulting change in premium regardless of policy switch. In the second stage, together with policyholder and vehicle characteristics, we identify the association among the size of the claim, the level of premium increase and policy switch. This empirical process is often challenging to insurers because they are unable to observe the new premium for those policyholders who switched. However, we used a 9-year claims data for the entire Singapore automobile insurance market that allowed us to track information before and after the switch. Our results provide evidence of a strong association among the size of the claim, the level of premium increase and policy switch. We attribute this to the possible inefficiency of the insurance market because of the lack of sharing and exchange of claims history among the companies.
ARTICLE | doi:10.20944/preprints201708.0055.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: EMR; data preprocessing; text mining; information extraction; medical decision support system
Online: 15 August 2017 (05:46:43 CEST)
At present, medical institutes generally use EMR to record patient's condition, including diagnostic information, procedures performed and treatment results. EMR has been recognized as a valuable resource for large scale analysis. However, EMR has the characteristics of diversity, incompleteness, redundancy and privacy, which make it difficult to carry out data mining and analysis directly. Therefore, it is necessary to preprocess the source data in order to improve data quality and improve the data mining results. Different types of data require different processing technologies. Most structured data commonly needs classic preprocessing technologies, including data cleansing, data integration, data transformation and data reduction. For semi-structured or unstructured data, such as medical text, containing more health information, it requires more complex and challenging processing methods. The task of information extraction for medical texts mainly includes NER (Named Entity Recognition) and RE (Relation Extraction). In this paper, we introduce the process of EMR processing, including data collection, data preprocessing, data mining, evaluation and knowledge application, analyze the current status of the key technologies, such as data preprocessing and data mining, and provide an overview of the application domains and prospects of EMR mining technologies. Finally, we summarize the existing problems in the research of EMR mining, and review the development trends.
ARTICLE | doi:10.20944/preprints202108.0301.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: Unobtrusive Sensing; Data Fusion; Data Mining; Radar Sensing; Thermal Sensing; Sprained Ankle; Infrared Thermopile Array; Home Environment.
Online: 13 August 2021 (15:12:24 CEST)
The ability to monitor Sprained Ankle Rehabilitation Exercises (SPAREs) in home environments can help therapists to ascertain if exercises have been performed as prescribed. Whilst wearable devices have been shown to provide advantages such as high accuracy and precision during monitoring activities, disadvantages such as limited battery life, users' inability to remember to charge and wear the devices are often the challenges for their usage. Also, video cameras, which are notable for high frame rates and granularity, are not privacy-friendly. This paper, therefore, proposes the use and fusion of unobtrusive and privacy-friendly sensing solutions for data collection and processing during SPAREs in home environments. Two Infrared Thermopile Array (ITA-32) thermal sensors and two Frequency Modulated Continuous Wave (FMCW) Radar sensors were used to simultaneously monitor 15 healthy participants during SPAREs which involved twisting their ankle in 4-fundamental movement patterns namely (i) extension, (ii) flexion, (iii) eversion and (iv) inversion. Experimental results indicated the ability to identify thermal blobs of participants performing the 4 fundamental movement patterns of the human ankle. Cluster-based analysis of data gleaned from the ITA-32 sensors and the FMCW Radar sensors indicated average classification accuracy of 96.9% with K-Nearest Neighbours, Neural Network, AdaBoost, Decision Tree, Stochastic Gradient Descent and Support Vector Machine, amongst others.
Subject: Medicine And Pharmacology, Dietetics And Nutrition Keywords: obesity; eating context; nutrient-poor foods; nutritional surveillance; adolescents; survey data analysis; data-mining; correspondence analysis; biplots
Online: 9 June 2020 (13:52:45 CEST)
Obesity is a global public health problem and the environment as its major determinant. To identify interventions an evidence base is warranted. To this aim we investigate the relationship between the consumption of foods and eating locations (like home, school/work and others) in British adolescents, using data from the UK National Diet and Nutrition Survey Rolling Program (2008–2012 and 2013-2016). Cross-sectional analysis of 62,523 food diary entries from this nationally representative sample then focused on foods contributing up to 80% total energy to the daily adolescent´s diet. Correspondence Analysis (CA) was first used to generate food-location relationship hypotheses and Logistic Regression (LR) to quantify the evidence in terms of odds ratios and formally test those hypotheses. The less-healthy foods that emerged from CA were chips, soft drinks, chocolate and meat pies. Adjusted Odds Ratios (99% CI) for consuming specific foods at a location “Other” than home (H) or school/work (S) in the 2008-12 survey sample were: for soft drinks 2.8 (2.1 to 3.8) vs. H and 2.0 (1.4 to 2.8) vs. S; for chips 2.8 (2.2 to 3.7) vs. H and 3.4 (2.1 to 5.5) vs. S; for chocolates 2.6 (1.9 to 3.5) vs. H and 1.9 (1.2 to 2.9) vs. S; and for meat pies 2.7 (1.5 to 5.1) vs. H and 1.3 (0.5 to 3.1) vs. S. These trends were confirmed in the 2013-16 survey sample. Interactions between location and BMI were not significant in either sample. In conclusion, our study showed that adolescents are more likely to consume specific less-healthy foods at locations away from home and school/work, irrespective of BMI. Such locations include leisure places, food outlets and “on the go”, hence public health policies to discourage less-healthy food choices in these locations is warranted for all adolescents.
ARTICLE | doi:10.20944/preprints202111.0440.v1
Subject: Engineering, Control And Systems Engineering Keywords: time series; NMP algorithm; anomalies; data mining; similarities in time series; clustering
Online: 23 November 2021 (17:51:42 CET)
Time series data are significant and are derived from temporal data, which involve real numbers representing values collected regularly over time. Time series have a great impact on many types of data. However, time series have anomalies. We introduce hybrid algorithm named novel matrix profile (NMP) to solve the all-pairs similarity search problem for time series data. The proposed NMP inherits the features from two state-of-the art algorithms: similarity time-series automatic multivariate prediction (STAMP), and short text online microblogging protocol (STOMP). The proposed algorithm caches the output in an easy-to-access fashion for single- and multidimensional data. The proposed NMP algorithm can be used on large data sets and generates approximate solutions of high quality in a reasonable time. The proposed NMP can also handle several data mining tasks. It is implemented on a Python platform. To determine its effectiveness, it is compared with the state-of-the-art matrix profile algorithms i.e., STAMP and STOMP. The results confirm that the proposed NMP provides higher accuracy than the compared algorithms.
REVIEW | doi:10.20944/preprints202108.0345.v1
Subject: Social Sciences, Education Keywords: student academic performance; educational data mining; methods; algorithms; tools; higher education; overview
Online: 16 August 2021 (14:04:57 CEST)
This overview study set out to compare and synthesise the findings of review studies conducted on predicting student academic performance (SAP) in higher education using educational data mining (EDM) methods, EDM algorithms and EDM tools from 2013 to June 2020. It conducted multiple searches for suitable and relevant peer-reviewed articles on two online search engines, on nine online databases, and on two online academic social networks. It, then, selected 26 eligible articles from 2,050 articles. Some of the findings of this overview study are worth mentioning. First, only 2 studies explicitly stated their precise sample sizes with maths and science as the two most mentioned subject areas. Second, 16 review studies had purposes related to either EDM techniques, EDM methods, EDM models, or EDM algorithms employed to predict SAP and student success in the higher education sector. Third, there are six commonly used typologies of input variables reported by 26 review studies, of which student demographics was the most commonly utilised variable for predicting SAP. Fourth and last, seven common EDM algorithms employed for predicting SAP were identified, of which Decision Tree emerged both as the most used algorithm and as the algorithm with the highest prediction accuracy rate for predicting SAP.
ARTICLE | doi:10.20944/preprints201811.0328.v1
Subject: Computer Science And Mathematics, Analysis Keywords: e-learning; automatic test generation; medical ontology; data mining for medical texts
Online: 14 November 2018 (09:45:38 CET)
The Medi-test system we developed was motivated by the large number of resources available for the medical domain, as well as the number of tests needed in this field (during and after the medical school) for evaluation, promotion, certification, etc. Generating questions to support learning and user interactivity has been an interesting and dynamic topic in NLP since the availability of e-book curricula and e-learning platforms. Current e-learning platforms offer increased support for student evaluation, with an emphasis in exploiting automation in both test generation and evaluation. In this context, our system is able to evaluate a student’s academic performance for the medical domain. Using as input medical reference texts and supported by a specially designed medical ontology, Medi-test generates different types of questionnaires for Romanian language. The evaluation includes 4 types of questions (multiple-choice, fill in the blanks, true/false and match), can have customizable length and difficulty and can be automatically graded. A recent extension of our system also allows for the generation of tests which include images. We evaluated our system with a local testing team, but also with a set of medicine students, and user satisfaction questionnaires showed that the system can be used to enhance learning.
ARTICLE | doi:10.20944/preprints201804.0008.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: SNP; multiple analysis pipeline; pharmacogenomics; overall survival curves; data mining: statistical analysis
Online: 2 April 2018 (07:53:23 CEST)
Personalized medicine is an aspect of the P4 medicine (predictive, preventive, personalized and participatory) based precisely on the customization of all medical characters of each subject. In personalized medicine, the development of medical treatments and drugs is tailored to the individual characteristics and needs of each subject, according to the study of diseases at different scales from genotype to phenotype scale. To make concrete the goal of personalized medicine, it is necessary to employ high-throughput methodologies such as Next Generation Sequencing (NGS), Genome-Wide Association Studies (GWAS), Mass Spectrometry or Microarrays, that are able to investigate a single disease from a broader perspective. For example, by using genotyping microarrays (e.g. collections of Single Nucleotide Polymorphism - SNP) it is possible to uncover the reasons (i.e. mutation in genes) because a treatment works properly in some patients (for example absence of mutated genes), but it does not work (presence of mutated genes) in others. A side effect of high-throughput methodologies is the massive amount of data produced for each single experiment, that poses several challenges (e.g. high execution time and required memory) to bioinformatic software. Thus a main requirement of modern bioinformatic software is the use of good software engineering methods and efficient programming techniques, able to face those challenges, that include the use of parallel programming and efficient and compact data structures. Thus, to exploit all the potential of this massive amount of data in the short possible time (before that data becomes obsolete), the necessity to develop parallel software tools for efficient data collection and analysis arise. Moreover, due to the heterogeneity of the data produced by the different kinds of experimental platforms, it is necessary to automatize in a comprehensive software pipeline, the various steps that compose a bioinformatic analysis, such as: the preprocessing of raw data to remove noise or corrupted data; the annotation of data with external knowledge (e.g. Gene Ontology), and the integration of molecular data with clinical data. It should be noted that such steps are necessary to make statistical or data mining analysis more effective. This paper presents the design and the experimentation of a comprehensive software pipeline, named microPipe, for the preprocessing, annotation and analysis of microarray-based SNP genotyping data. A case study in pharmacogenomics is presented. The main advantages of using microPipe are: the reduction of errors that may happen when trying to make data compatible among different tools; the possibility to analyze in parallel huge datasets; the easy annotation and integration of data.
ARTICLE | doi:10.20944/preprints201707.0011.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: Internet of Things; data mining algorithms; GPU cluster; performance; energy consumption; reliability
Online: 6 July 2017 (12:40:22 CEST)
This paper aims to develop a low-cost, high-performance and high-reliability computing system to process large-scale data using common data mining algorithms in the Internet of Things computing. Considering the characteristics of IoT data processing, similar to mainstream high performance computing, we use a GPU cluster to achieve better IoT services. Firstly, we present an energy consumption calculation method (ECCM) based on WSN. Then, using the CUDA Programming model, we propose a Two-level Parallel Optimization Model (TLPOM) which exploits reasonable resource planning and common compiler optimization techniques to obtain the best blocks and threads configuration considering the resource constraints of each node. The key to this part is dynamic coupling Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP) to improve the performance of the algorithms without additional energy consumption. Finally, combining the ECCM and the TLPOM, we use the Reliable GPU Cluster Architecture (RGCA) to obtain a high-reliability computing system considering the nodes’ diversity, algorithm characteristics, etc. The results show that the performance of the algorithms significantly increased by 34.1%, 33.96% and 24.07% for Fermi, Kepler and Maxwell on average with TLPOM and the RGCA ensures that our IoT computing system provides low-cost and high-reliability services.
COMMUNICATION | doi:10.20944/preprints202104.0575.v2
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Ethnopharmacology; Artificial Intelligence; Web Crawling; Active Learning; Reinforcement Learning; Text Mining; Big Data
Online: 23 June 2021 (11:47:32 CEST)
Ethnopharmacology experts face several challenges when identifying and retrieving documents and resources related to their scientific focus. The volume of sources that need to be monitored, the variety of formats utilized, the different quality of language use across sources, present some of what we call “big data” challenges in the analysis of this data. This study aims to understand if and how experts can be supported effectively through intelligent tools in the task of ethnopharmacological literature research. To this end, we utilize a real case study of ethnopharmacology research, aimed at the Southern Balkans and Coastal zone of Asia Minor. Thus, we propose a methodology for more efficient research in ethnopharmacology. Our work follows an “Expert-Apprentice” paradigm in an automatic URL extraction process, through crawling, where the apprentice is a Machine Learning (ML) algorithm, utilizing a combination of Active Learning (AL) and Reinforcement Learning (RL), and the Expert is the human researcher. ML-powered research improved 3.1 times the effectiveness and 5.14 times the efficiency of the domain expert, fetching a total number of 420 relevant ethnopharmacological documents in only 7 hours versus an estimated 36-hour human-expert effort. Therefore, utilizing Artificial Intelligence (AI) tools to support the researcher can boost the efficiency and effectiveness of the identification and retrieval of appropriate documents.
REVIEW | doi:10.20944/preprints201911.0338.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Indian; Sentiment Analysis; Indigenous Languages; Machine Learning; Deep learning; Data; Opinion Mining; Languages.
Online: 27 November 2019 (09:30:07 CET)
An increase in the use of smartphones has laid to the use of the internet and social media platforms. The most commonly used social media platforms are Twitter, Facebook, WhatsApp and Instagram. People are sharing their personal experiences, reviews, feedbacks on the web. The information which is available on the web is unstructured and enormous. Hence, there is a huge scope of research on understanding the sentiment of the data available on the web. Sentiment Analysis (SA) can be carried out on the reviews, feedbacks, discussions available on the web. There has been extensive research carried out on SA in the English language, but data on the web also contains different other languages which should be analyzed. This paper aims to analyze, review and discuss the approaches, algorithms, challenges faced by the researchers while carrying out the SA on Indigenous languages.
ARTICLE | doi:10.20944/preprints201906.0202.v1
Subject: Engineering, Mechanical Engineering Keywords: Natural gas demands; Prediction; Energy market; Genetic algorithm; Artificial neural network; Data mining.
Online: 20 June 2019 (15:58:25 CEST)
Recently natural gas (NG) global market attracted much attention in case it is cleaner than oil, and simultaneously in most regions is cheaper than renewable energy sources. However, price fluctuations, environmental concerns, technological development, emerging unconventional resources, energy security challenges, and shipment are some of the forces that made the NG market more dynamic and complex. From a policy-making perspective, it is vital to uncover demand-side future trends. This paper proposed an intelligent forecasting model to forecast NG global demand, however investigating a multi-dimensional purified input vector. The model starts with a data mining (DM) step to purify input features, identify the best time lags, and to pre-process selected input vector. Then a hybrid artificial neural network (ANN) which equipped with genetic optimizer is applied to set up ANN’s characteristics. Among 13 available input features, six features (e.g. Alternative and Nuclear Energy, CO2 Emissions, GDP per Capita, Urban Population, Natural Gas Production, Oil Consumption) selected as the most critical feature via the DM step. Then, the hybrid prediction model is designed to extrapolate the consumption of future trends. The proposed model overcomes competitive models refer to different error based evaluation statistics. Besides, as the model proposed the best input feature set, results compared to the model which used the raw input set, with no DM purification process.
ARTICLE | doi:10.20944/preprints202308.1302.v1
Subject: Social Sciences, Education Keywords: Big 5; Child Personality; Elementary School Mathematic Performance; Socio Emotional Effects; Educational Data Mining
Online: 18 August 2023 (09:56:56 CEST)
children is challenging. An important advance is the 15-questions Pictorial Personality Traits Questionnaire for Children. A study on students from 10 to 13 years old in Poland validated the questionnaire but with some observations. Thus, there is a need for replication and stronger evi-dence in this age group. In Chile, we replicated the study with 3,423 4th-graders (9 to 12 years old). Teachers, in regular sessions, applied the questionnaire to their entire classes. We found similar results, including that asexual pictograms worked well in both genders. We also found positive relationships between conscientiousness, openness, and extraversion with mathematical performance. Furthermore, a combination of these three traits has a relationship with math performance twice as big as each trait alone. Moreover, students with the lowest scores in this combination of personality traits (6.6% of students) have 0.27 standard deviations less in mathematical performance than those with the highest score, which is 74.3% of students. To the best of our knowledge, this is the first time that a study finds a strong relationship between a combination of personality traits gathered with a 15-question questionnaire and fourth-graders´ math perfor-mances.
ARTICLE | doi:10.20944/preprints202208.0495.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: chronic venous disease; deep leaning; data mining; Resnet50; DeiT; automatic classification; automatic CEAP classification
Online: 29 August 2022 (12:46:56 CEST)
Chronic venous disease (CVD) occurs in a substantial proportion of the world's population. If the onset of CVD looks like a cosmetic defect, then over time, it can develop into serious problems that require surgical intervention. The aim of the work is to use deep learning (DL) methods for automatic classification of the stage of CVD for self-diagnosis of a patient by using the image of the patient’s legs. The required for DL algorithms images of legs with CVD were obtained by using Internet Data Mining. For images preprocessing, the binary classification problem “legs - no legs” was solved based on Resnet50 with accuracy 0.998. The application of this filter made it possible to collect a data set of 11,118 good quality leg images with various stages of CVD. For classification of various stages of CVD according to CEAP classification, the multi classification problem was set and resolved by using two neural networks with completely different architecture - Resnet50 and DeiT. The model based on DeiT without any tuning shows better results than the model based on Resnet50 (precision = 0.770 (DeiT) and 0.615 (Resnet50)). To demonstrate the results of the work, a telegram bot was developed, in which fully functioning DL algorithms are implemented. This bot allows evaluating the condition of the patient's legs with a fairly good accuracy for the CVD classification.
ARTICLE | doi:10.20944/preprints202005.0051.v1
Subject: Public Health And Healthcare, Health Policy And Services Keywords: COVID-19; Data mining; Infection in India; R package; State- wise analysis; Statistical analysis
Online: 5 May 2020 (02:28:26 CEST)
Background & Objectives: The global pandemic caused by novel coronavirus SARS-CoV-2 has claimed several lives worldwide. With the virus gathering rapid spread, the world has witnessed increasing number of confirmed cases and mortality rate, India is not far behind with approximately 37,000 affected individuals as on May 2, 2020. The ongoing pandemic has raised several questions which need to be answered by analysis of transmission of the infection. The data has been collected on daily basis from WHO and other sites. We have represented the data collated graphically using statistical packages, R and other online softwares. The present study provides a holistic overview of the spread of COVID-19 infection in India. Methods: Real-time data query was done based on daily observations using publicly available data from reference websites for COVID-19 and other government official reports for the period (15th February, 2020 to April 28th, 2020). Statistical analysis was performed to draw important inferences regarding COVID-19 trend in India. Results: A decrease in growth rate of cases due to COVID-19 in India post lockdown and improvement in recovery rate during the month of April was identified. The case fatality rate was estimated to be 3.22% of the total reported cases. State-wise analysis revealed a deteriorating situation in states of Maharashtra and Gujarat among others as cases continued to increase rapidly there. A positive linear correlation between the number of deaths and total cases and exponential relation between population density and number of cases reported per square km was established. Interpretation & Conclusions: Despite early preventive measures taken up by the Government of India, the increasing number of cases in India is a concern. This study compiles state-wise and district-wise data to report the daily conﬁrmed cases, case fatalities and strategies adopted in the form of case studies. Understanding the transmission spread of SARS-CoV-2 in a diverse and populated country like India will be crucial in assessing the effectiveness of control policies towards the spread of COVID-19 infection.
TECHNICAL NOTE | doi:10.20944/preprints201911.0073.v1
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: deep behavioral covariates; clinical informatics; predictive modeling; electronic medical records; machine-learning; data-mining
Online: 7 November 2019 (09:25:04 CET)
Deep behavioral covariates (DBCs) introduced in this perspective form a new class of covariates that have the potential to enhance the performance of predictive models and improve analytics in clinical decision support applications. DBCs can measure how engaged a patient tends to be and how he or she tends to respond to events, and they may be highly predictive of the patient’s outcomes for a planned treatment. DBCs may potentially serve as a standard to measure patient engagement and activation and may form highly efficient mechanisms for improving patient outcomes.
ARTICLE | doi:10.20944/preprints202304.0048.v1
Subject: Engineering, Other Keywords: Polymer Extrusion; Barrier Screws; Multi-Objective Optimization; Data Mining, Decision Making; Number of Objectives reduction
Online: 4 April 2023 (14:33:09 CEST)
: Polymer single screw extrusion is a major industrial processing technique used to obtain plastics products. To assure high outputs, tight dimensional tolerances, and excellent product performance, extruder screws may show different design characteristics. Barrier screws, which contain a second flight in the compression zone, have become quite popular as they promote and stabilize polymer melting. Therefore, it is important to design efficiently extruder screws and decide whether a conventional screw will do the job efficiently, or a barrier-screw should be considered instead. This work uses multi-objective evolutionary algorithms to design conventional and barrier screws (Maillefer screws will be studied) with optimized geometry. Processing of two polymers, Low Density Polyethylene and Polypropylene, is analyzed. A methodology based on the use of Artificial Intelligence (AI) techniques, namely data mining, decision making and evolutionary algorithms, is presented and utilized to obtain results with practical significance, based on relevant performance measures (objectives) used in the optimization. For the various case studies selected, Maillefer screws were generally advantageous for processing LDPE, while for PP the use of both types of screws would be feasible.
ARTICLE | doi:10.20944/preprints202203.0178.v1
Subject: Computer Science And Mathematics, Computational Mathematics Keywords: artificial intelligence; data mining; diagnostic decision support; rare diseases; questionnaire anamnesis; neuromuscular diseases; high latencies
Online: 14 March 2022 (08:58:29 CET)
During the COVID-19 pandemic, individuals with symptoms other than cough or fever have refrained from seeking medical advice. However, a delay in treatment might lead to serious consequences. At the same time, digital health initiatives have emerged to overcome this bottleneck of healthcare. Herein, we report the results of a multi-center initiative using a combination of patient history and artificial intelligence (AI) to identify individuals with rare neuromuscular diseases. First, a questionnaire with 46 items was developed by interviewing patients with muscular dystrophies, amyotrophic lateral sclerosis, Morbus Pompe, neuropathies, and myasthenia gravis. Second, patients with proven neurological diseases answered the questionnaire. Third, a combination of classifiers (artificial neural network, support vector, and random forest) was trained and, finally, the system was challenged with new questionnaires. Users with an abnormal questionnaire pattern received a unique code for data privacy and contact details for a neurologist for further advice. The neurologists confirmed or refuted the AI-based diagnosis. The questionnaire was accessed 3122 times, leading to 853 unique codes. Only for a few patients the computer-based diagnoses and the confirmed final diagnoses were reported to us. However, for these few patients, the genetic testing and high CK levels finally ended their long-lasting diagnostic odyssey.
ARTICLE | doi:10.20944/preprints202201.0445.v1
Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: data mining; predictive analytics; Internet of Things; peasant farming; smart farming system; crop production prediction
Online: 31 January 2022 (10:58:30 CET)
Internet of Things (IoT) technologies can greatly benefit from machine learning techniques and Artificial Neural Networks for data mining and vice versa. In the agricultural field, this convergence could result in the development of smart farming systems suitable for use as decision support systems by peasant farmers. This work presents the design of a smart farming system for crop production, which is based on low-cost IoT sensors and popular data storage services and data analytics services on the Cloud. Moreover, a new data mining method exploiting climate data along with crop production data is proposed for the prediction of production volume from heterogeneous data sources. This method was initially validated using traditional machine learning techniques and open historical data of the northeast region of the state of Puebla, Mexico, which were collected from data sources from the National Water Commission and the Agri-food Information Service of the Mexican Government.
REVIEW | doi:10.20944/preprints202102.0108.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Sentiment Analysis; Students' feedback; Students' reviews; Natural language processing; Data mining; Deep learning; Machine learning
Online: 3 February 2021 (10:11:54 CET)
Now when the whole world is still under COVID-19 pandemic, many schools have transferred the teaching from physical classroom to online platforms. It is highly important for schools and online learning platforms to investigate the feedback to get valuable insights about online teaching process so that both platforms and teachers are able to learn which aspect they can improve to achieve better teaching performance. But handling reviews expressed by students would be a pretty laborious work if they were handled manually as well as it is unrealistic to handle large-scale feedback from e-learning platform. In order to address this problem, both machine learning algorithms and deep learning models are used in recent research to automatically process students' review getting the opinion, sentiment and attitudes expressed by the students. Such studies may play a crucial role in improving various interactive online learning platforms by incorporating automatic analysis of feedback. Therefore, we conduct an overview study of sentiment analysis in educational field presented in recent research, to help people grasp an overall understanding of the sentiment analysis research. Besides, according to the literature review, we identify three future directions that researchers can focus on in automatically feedback processing: high-level entity extraction, multi-lingual sentiment analysis, and handling of figurative language.
Subject: Biology And Life Sciences, Animal Science, Veterinary Science And Zoology Keywords: artificial intelligence; bioinformatics; computational biology; data mining & machine learning; evolutionary studies; mathematical biology; animal behavior
Online: 6 November 2019 (05:07:24 CET)
Industrial pig farming is associated with negative technological pressure on the bodies of pigs. Leg weakness and lameness are the sources of significant economic loss in raising pigs. Therefore, it is important to identify predictors of limb condition. This work presents assessments of the state of limbs using indicators of growth and meat characteristics of pigs based on machine learning algorithms. We have evaluated and compared the accuracy of prediction for several ML classification algorithms (Random Forest, K-Nearest Neighbors, Artificial Neural Networks, C50Tree, Support Vector Machines, Naive Bayes, Generalized Linear Models, Boost, and Linear Discriminant Analysis) and have identified the Random Forest and K-Nearest Neighbors as the best performing algorithms for predicting pig leg weakness using a small set of simple measurements that can be taken at an early stage of animal development. Muscle Thickness, Back Fat amount, and Average Daily Gain serve as significant predictors of conformation of pig limbs. Our work demonstrates the utility and relative ease of using machine learning algorithms to assess the state of limbs in pigs based on growth rate and meat characteristics.
ARTICLE | doi:10.20944/preprints202306.0925.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: Coal mining under reservoirs; High-intensity mining; Green mining; Physical simulation; Water conducting fracture zone
Online: 13 June 2023 (10:22:10 CEST)
China is rich in coal resources under water bodies. However, for a long time, the safety prediction of high-intensity mining under water bodies is one of the problems encountered by the coal industry. It is of great significance to realize safe mining under water bodies, improve the recovery rate of coal resources and protect reservoir resources. Therefore, this article takes the No. 5 coal seam and No. 11 mining area of Wangwa Coal Mine as the research object, and integrates physical simulation, numerical simulation, theoretical analysis and other methods to study the development height of water-conducting fracture zones in fully mechanized top coal caving mining. Solid-liquid coupling physical simulation test reveals the failure characteristics of overlying strata in goaf and the seepage law of reservoir water under the influence of mining. By comparing the monitoring data of borehole leakage, the measured data obtained by borehole peeping with the height data of water-conducting fracture zone obtained by the traditional empirical formula of three-under standard, the error between the two is as high as -29.39 %. In this case, the variance correction coefficient is used to correct the empirical formula, and on this basis, in order to effectively protect the surface water dam and water body, the mining height of coal seam in the working face with limited height mining is inversely derived. The research results provide a basis for the safety prediction of high-intensity mining under the reservoir dam in the ecologically fragile areas of western China and provide a scientific guarantee for the formulation of safety measures under such conditions.
ARTICLE | doi:10.20944/preprints202306.2033.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Process Mining; Process Model Discovery; Mining action+state evolution
Online: 28 June 2023 (16:19:00 CEST)
Process model discovery covers different methodologies to mine a process model from traces of process executions, and is gaining an important role in Artificial Intelligence research. Current approaches in the area, with few exceptions, focus on determining a model of the flow of actions only. However, in several contexts, (i) restricting the attention to actions is quite limitative, since the effects of such actions have to be analysed, too, and (ii) traces provide additional pieces of information, in the form of states (i.e., values of parameters possibly affected by the actions): for instance, in several medical domains traces include both actions and measurements of patients’ parameters. In this paper, we propose AS-SIM (Action-State SIM), the first approach able to mine a process model which comprehends two distinct classes of nodes, to capture both actions and states.
ARTICLE | doi:10.20944/preprints202305.2103.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: Cameroon; mining; Small Scale mining; Sustainable development; Betare Oya
Online: 30 May 2023 (10:00:47 CEST)
Considering the differences between the European and African continents concerning the management of the mining production sector, we decided to carry out this study with the main objective of demonstrating that, in Africa, mining can positively change the quality of life of the populations where it develops and, at the same time, it is possible to respect the environment, which is our main wealth. To achieve these objectives, it is necessary to present the mining activity of the continent, emphasizing both the negative aspects and its strong points. The most important thing is to make a good diagnosis of the situation, which will allow us to cure our "patient", that is, African mining production.
ARTICLE | doi:10.20944/preprints202308.1358.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: artificial intelligence; classification models; educational data mining; educational machine learning; feature selection; student performance prediction; taxonomy
Online: 18 August 2023 (09:52:57 CEST)
. Identifying students who might have difficulty in their course of studies ahead of time is crucial. There can be many reasons for performance issues, such as personality, family, social, and/or economic. We advocate that educational systems should use machine learning to predict students’ performance based on performance factors. This would allow educational professionals and institutions to put in place a preventive plan to help students towards achievements of their educational goals and success. In this chapter, we propose a student performance prediction method and evaluate its performance. We provide a taxonomy of performance factors that help to gauge students performance from different perspectives and give insights on the categories and features that have more significant impact on students’ performance. The results of this work can be used by education institutions to put in place a student-centric approach to tackle performance issues before they create long-term effects on student’s life. In addition, it will help education policymakers to introduce a tailored approach for the population in specific areas.
ARTICLE | doi:10.20944/preprints202301.0254.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: Logic Artificial Intelligence; Knowledge Bases; Query Plan; Temporal Logic; Conformance Checking; Temporal Data Mining; Intraquery Parallelism
Online: 13 January 2023 (11:07:20 CET)
This paper extends our seminal paper on KnoBAB for efficient Conformance Checking computations performed on top of a customised relational model. After defining our proposed temporal algebra for temporal queries (xtLTLf ), we show that this can express existing temporal languages over finite and non-empty traces such as LTLf . This paper also proposes a parallelisation strategy for such queries thus reducing conformance checking into an embarrassingly parallel problem leading to super-linear speed up. This paper also presents how a single xtLTLf operator (or even entire sub-expressions) might be efficiently implemented via different algorithms thus paving the way to future algorithmic improvements. Finally, our benchmarks remark that our proposed implementation of xtLTLf (KnoBAB) outperforms state-of-the-art conformance checking software running on LTLf logic, be it data or dateless.
ARTICLE | doi:10.20944/preprints202306.1691.v1
Subject: Engineering, Mining And Mineral Processing Keywords: key strata; mining-induced stress; DOFS; 3DEC; large-scale mining
Online: 23 June 2023 (14:09:45 CEST)
When there are multiple key strata in the overburden of deep coal seam and the surface subsid-ence coefficient after mining is small, it indicates that the overlying key strata fail to break com-pletely after mining. On this occasion, the stress concentration on the working face occurs easily, which in turn leads to the occurrence of dynamic disasters such as rock burst. This study adopted a comprehensive analysis method of field monitoring and numerical simulation to explore the in-fluence of key stratum on the evolution law of mining-induced stress in the working face. Dis-tributed optical fiber sensor (DOFS) and surface subsidence GNSS monitoring system were re-spectively arranged inside and at the mouth of the ground observation borehole. According to the monitoring results of strain obtained from DOFS, the height of broken stratum inside the overlying strata was obtained; according to the monitoring results of surface subsidence, the sur-face subsidence coefficient was proved to be less than 0.1, indicating that the high key stratum does not break completely, but enters a state of bending subsidence instead. In order to reveal the influence of key stratum on the mining-induced stress of working face, two 3DEC numerical models with and without key stratum were established for comparative analysis. As the numeri-cal simulation results show, when there are multiple key strata in the overburden, the stress in-fluence range and stress concentration coefficient of coal seam after mining are relatively large. The study revealed the working mechanism of rock burst accidents after large-scale mining and predicted the potential area of rock burst risk after the mining of the working face, which has been verified by field investigation. The research results are of great guiding significance for the revelation of the working mechanism of rock burst in deep mining condition and its prevention and control.
ARTICLE | doi:10.20944/preprints201812.0083.v1
Subject: Biology And Life Sciences, Ecology, Evolution, Behavior And Systematics Keywords: post-mining regeneration; succession; tropical dry forest; post-mining recovery
Online: 6 December 2018 (11:04:06 CET)
Open pit mining is a common activity in the Yucatan peninsula for the extraction of limestone. This mining is known under the generic name of quarries, and regionally as sascaberas (sascab=white soil in Mayan language). These areas are characterized by the total removal of the natural vegetation cover and soil in order to have access to the calcareous material. The present study shows the composition and structure of the vegetation in five quarries after approximately ten years of abandonment, and the conserved vegetation near to each one of the quarries in southeastern Quintana Roo. Using a canonical correspondence analysis (CCA), the distribution of the species was determined in relation to the edaphic variables: soil depth, percentage of organic matter (OM), cationic exchange capacity (CEC), pH and texture. 26 families, 46 genera and 50 species were recorded in the quarries and 25 families, 45 genera and 47 species were recorded in the conserved areas. The dominant species in the quarries belong to the families Poaceae, Fabaceae, Rubiaceae and Anacardiaceae. The quarries with higher values of OM (1.63%), CEC (24.05 Cmol/kg), depth (11 cm) and sand percentage (31.33%) include the following species like Lysiloma latisiliquum, Metopium brownei and Bursera simaruba which are commonly found in secondary forests. On the other hand, quarries with lower values of OM (0.39%), CEC (16.58 Cmol/kg) and depth (5.02), and higher percentage of silt (42.44%) were dominated by herbaceous species belonging to the Poaceae family and by Borreria verticillata, which are typical in disturbed areas of southeastern Mexico. In all cases, the pH was slightly alkaline due to the content of calcium carbonate (CaCO3), characteristic of the soils of the region. The edaphic variables are significantly correlated with the development and distribution of vegetation, and with the structure of the communities.
ARTICLE | doi:10.20944/preprints202206.0050.v2
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Emotions Mining; Context Mining; Sensory Mining; Artificial Intelligence; Information extraction; Text classification; Fairy tales; Olfactory Cultural Heritage
Online: 2 August 2022 (07:57:35 CEST)
This paper presents an Artificial Intelligence approach to mining context and emotions related to olfactory cultural heritage narratives, in particular to fairy tales. We provide an overview of the role of smell and emotions in literature, as well as highlight the importance of olfactory experience and emotions from psychology and linguistic perspectives. We introduce a methodology for extracting smells and emotions from text, as well as demonstrate the context-based visualizations related to smells and emotions implemented in a novel Smell Tracker tool. The evaluation is performed using a collection of fairy tales from Grimm and Andersen. We find out that fairy tales often connect smell with emotional charge of situations. The experimental results show that we can detect smells and emotions with F1 score of 92.7 and 79.2, respectively.
ARTICLE | doi:10.20944/preprints202003.0298.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Data Mining; Breast Cancer; Hybrid Feature Selection; Machine learning; Support Vector Machine; Optimize Genetic Algorithm; boosting algorithms
Online: 19 March 2020 (11:13:15 CET)
Breast cancer is a significant health issue across the world. Breast cancer is the most widely-diagnosed cancer in women; early-stage diagnosis of disease and therapies increase patient safety. This paper proposes a synthetic model set of features focused on the optimization of the genetic algorithm (CHFS-BOGA) to forecast breast cancer. This hybrid feature selection approach combines the advantages of three filter feature selection approaches with an optimize Genetic Algorithm (OGA) to select the best features to improve the performance of the classification process and scalability. We propose OGA by improving the initial population generating and genetic operators using the results of filter approaches as some prior information with using the C4.5 decision tree classifier as a fitness function instead of probability and random selection. The authors collected available updated data from Wisconsin UCI machine learning with a total of 569 rows and 32 columns. The dataset evaluated using an explorer set of weka data mining open-source software for the analysis purpose. The results show that the proposed hybrid feature selection approach significantly outperforms the single filter approaches and principal component analysis (PCA) for optimum feature selection. These characteristics are good indicators for the return prediction. The highest accuracy achieved with the proposed system before (CHFS-BOGA) using the support vector machine (SVM) classifiers was 97.3%. The highest accuracy after (CHFS-BOGA-SVM) was 98.25% on split 70.0% train, remainder test, and 100% on the full training set. Moreover, the receiver operating characteristic (ROC) curve was equal to 1.0. The results showed that the proposed (CHFS-BOGA-SVM) system was able to accurately classify the type of breast tumor, whether malignant or benign.
ARTICLE | doi:10.20944/preprints202309.1430.v1
Subject: Business, Economics And Management, Economics Keywords: depletion of natural capital; mining; technogenic deposits; mining dumps; circular economy; environmental protection; Erdenet Mining Corporation SOE; Mongolia
Online: 21 September 2023 (08:52:35 CEST)
The article justifies the need to involve the technogenic deposits (off-balance ore and wastes) into the economic circulation of mining enterprises when there is natural resources depletion. It could be considered as one of the tools of the circular economy. The authors analyze global trends in the copper deposits development, global demand for copper, and design recommendations for possible alternative options for the copper production. The authors use a case of Erdenet Mining Corporation SOE based in Mongolia to develop the approach for economic, social and environmental problem-solving. The millions of mining dumps are proposed to develop as technogenic resources for recycled materials, prolonging profitable activities of the mine. The Hierarchy analysis method is used to obtain optimum order of the mining dumps development to get economic, social and environmental effect.
ARTICLE | doi:10.20944/preprints202104.0404.v1
Subject: Business, Economics And Management, Accounting And Taxation Keywords: automated assessment; computer science; learning analytics; process mining; programming; sequence mining
Online: 15 April 2021 (09:40:33 CEST)
Learning programming is a complex and challenging task for many students. It in-volves both understanding theoretical concepts and acquiring practical skills. Hence, analyzing learners’ data from online learning environments alone fails to capture the full breadth of stu-dents’ actions if part of their learning process takes place elsewhere. Moreover, existing studies on learning analytics applied to programming education have mainly relied on frequency analysis to classify students according to their approach to programming or to predict academic achieve-ment. However, frequency analysis provides limited insights into the individual time-related characteristics of the learning process. The current study examines students’ strategies when learning programming, combining data from the learning management system and from an au-tomated assessment tool. To gain an in-depth understanding of students’ learning process as well as of the types of learners, we used learning analytics methods that account for the temporal order of learning actions. Our results show that students have special preferences for specific learning resources when learning programming, namely slides that support search, and copy and paste. We also found that videos are relatively less consumed by students, especially while working on programming assignments. Lastly, students resort to course forums to seek help only when they struggle.
REVIEW | doi:10.20944/preprints201807.0116.v1
Subject: Chemistry And Materials Science, Medicinal Chemistry Keywords: chemical space; chemoinformatics; data mining; databases; DNMT inhibitors; drug discovery; epi-informatics; molecular modeling; similarity searching; virtual screening
Online: 6 July 2018 (10:04:44 CEST)
Naturally occurring small molecules include a large variety of natural products from different sources that have confirmed activity against epigenetic targets. In this work we review chemoinformatic, molecular modeling and other computational approaches that have been used to uncover natural products as inhibitors of DNA metiltransferases, a major family of epigenetic targets with significant potential for the treatment of cancer and several other diseases. Examples of these computational approaches include docking, similarity-based virtual screening, and pharmacophore modeling. It is also commented the chemoinformatic-based exploration of the chemical space of naturally occurring compounds as epigenetic modulators which may have significant implications in epigenetic drug discovery and nutriepigenetics.
REVIEW | doi:10.20944/preprints202110.0184.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: text-mining; self-attention models; biological literature mining; relationship extraction; natural language processing
Online: 12 October 2021 (14:17:46 CEST)
For any molecule, network, or process of interest, to keep up with new publications on these, is becoming increasingly difficult. For many cellular processes, molecules and their interactions that need to be considered can be very large. Automated mining of publications can support large scale molecular interaction maps and database curation. Text mining and Natural Language Processing (NLP)-based techniques are finding their applications in mining the biological literature, handling problems such as Named Entity Recognition (NER) and Relationship Extraction (RE). Both rule-based and machine learning (ML)-based NLP approaches have been popular in this context, with multiple research and review articles examining the scope of such models in Biological Literature Mining (BLM). In this review article, we explore self-attention based models, a special type of neural network (NN)-based architectures that have recently revitalized the field of NLP, applied to biological texts. We cover self-attention models operating either at a sentence level or an abstract level, in the context of molecular interaction extraction, published from 2019 onwards. We conduct a comparative study of the models in terms of their architecture. Moreover, we also discuss some limitations in the field of BLM that identifies opportunities for the extraction of molecular interactions from biological text.
REVIEW | doi:10.20944/preprints202207.0010.v1
Subject: Engineering, Architecture, Building And Construction Keywords: mining; tailings; waste; recycling; restoration
Online: 1 July 2022 (09:00:31 CEST)
Mining is an important industry that provides products and services through infrastructure systems worldwide. However, the global development promotes the steady growth and accelerated demand for minerals, resulting in the accumulation of hazardous waste in land, sea and air environments and, consequently a series of environmental and health problems. Restoration techniques from mining tailing have become increasingly discussed among scholars due to their potential to offer benefits over reducing tailings levels, thereby reducing environmental pressure for the correct management and adding value to previously discarded waste. This review paper critically explores available literature on the main techniques of mining tailing recycling, and discusses leading recycling technologies, including the advantages and drawbacks, as well as future perspectives. The findings of this review contribute as a reference for scholars as well as support for decisionmakers concerning the related environmental issues.
ARTICLE | doi:10.20944/preprints202011.0424.v1
Online: 16 November 2020 (14:20:15 CET)
Rock salt is characterized by specific geomechanical and rheological properties. Layers of rock salt on depth over 900 m cause problems with shaft lining deformation. Methods of shaft lining protection used so far (e.g. in Sieroszowice mine) have not been effective enough. The research presents a patented and copyright protected concept of a shaft lining construction that can be used in rock masses with strong rheological properties and susceptible to leaching. In the case of salt layers, especially at significant depths the relative convergence of the heading contour may be about 40 ‰/year. That results in the fact that any other method of securing the shaft lining, e.g. by making it flexible, is not sufficient to ensure the stability of the shaft guidance geometry. In the new shaft lining concept, the excessive rock creep into the outbreak inside the shaft diameter is removed by local and controlled leaching of the shaft cheeks by means of fresh water through a porous medium at the contact layer behind the watertight tubing lining. The article presents the methodology of performing tests on a special device and the test results.
ARTICLE | doi:10.20944/preprints202008.0265.v2
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Ecommender system; learning to rank; Mining software repositories; Text Mining; Deep learning; Stack Overflow
Online: 4 September 2020 (11:20:33 CEST)
In software development, developers received bug reports that describe the software bug. Developers find the cause of bug through reviewing the code and reproducing the abnormal behavior that can be considered as tedious and time-consuming processes. The developers need an automated system that incorporates large domain knowledge and recommends a solution for those bugs to ease on developers rather than spending more manual efforts to fixing the bugs or waiting on Q&A websites for other users to reply to them. Stack Overflow is a popular question-answer site that is focusing on programming issues, thus we can benefit knowledge available in this rich platform. This paper, presents a survey covering the methods in the field of mining software repositories. We propose an architecture to build a recommender System using the learning to rank approach. Deep learning is used to construct a model that solve the problem of learning to rank using stack overflow data. Text mining techniques were invested to extract, evaluate and recommend the answers that have the best relevance with the solution of this bug report.
ARTICLE | doi:10.20944/preprints202307.1831.v1
Subject: Medicine And Pharmacology, Veterinary Medicine Keywords: machine learning; veterinary medical education; random forest; medical education; artificial intelligence; Python; R; veterinary educators; educational data mining; learning analytics
Online: 26 July 2023 (14:02:31 CEST)
Machine learning (ML) offers potential opportunities to enhance the learning, teaching and assessments within veterinary medical education including but not limited to assisting with admissions processes as well as student progress evaluations. The purpose of this primer is to assist veterinary educators in appraising and potentially adopting these rapid upcoming advances in data science and technology. In the first section, we introduce ML concepts and highlight similarities/differences between ML and classical statistics. In the second section, we provide a step-by-step worked example using simulated veterinary student data to answer a hypothesis driven question. Python syntax with explanations is provided within the text to create a random forest ML prediction model and within each step, specific considerations such as how to manage incomplete student records are highlighted when applying ML algorithms within the veterinary education field. The results from the simulated data demonstrate how decisions by the veterinary educator during ML model creation may impact the most important features contributing to the model. These results highlight the need for the veterinary educator to be fully transparent during the creation of ML models and future research is needed to establish guidelines for handling data not missing at random in medical education, and preferred methods for model evaluation.
ARTICLE | doi:10.20944/preprints202303.0192.v1
Subject: Biology And Life Sciences, Anatomy And Physiology Keywords: Phenome; Matrisome; Matreotype; Phenotype; Extracellular Matrix; Data Mining; SNP; PheWAS; GWAS; Electronic Health Records; Drug Repurposing; Precision Medicine; Collagen; Human
Online: 10 March 2023 (09:34:15 CET)
The extracellular matrix (ECM) is earning an increasingly relevant role in many disease states and the process of aging. Analyzing these disease states is possible with GWAS and PheWAS methodology, and through our analysis, we aimed to explore the relationships between polymorphisms in the compendium of ECM genes (i.e., matrisome genes) in various disease states. A significant contribution on the part of the ECM polymorphisms is evident in many varying types of diseases, particularly those in the core matrisome genes. Our results confirm previous links to connective tissue disorders, but also unearth new and underexplored relationships with neurological, psychiatric, and age-related disease states. Upon analysis of drug indications for gene-disease relationships, we identified numerous targets that may be repurposed for age-related pathology. The identification of ECM polymorphisms and their contribution to disease plays an integral role in future therapeutic developments, drug repurposing, precision medicine, and personalized care.
ARTICLE | doi:10.20944/preprints202003.0297.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Data Mining; Alzheimer’s Dementia; Composite Hybrid Feature Selection; Machine learning; Stack Hybrid Classification; AI Techniques; Classification; AD Diagnose; Clinical AD Dataset
Online: 19 March 2020 (10:52:31 CET)
Alzheimer's disease (AD) is a significant regular type of dementia that causes damage in brain cells. Early detection of AD acting as an essential role in global health care due to misdiagnosis and sharing many clinical sets with other types of dementia, and costly monitoring the progression of the disease over time by magnetic reasoning imaging (MRI) with consideration of human error in manual reading. Our proposed model, in the first stage, apply the medical dataset to a composite hybrid feature selection (CHFS) to extract new features for select the best features to improve the performance of the classification process due to eliminating obscures features. In the second stage, we applied a dataset to a stacked hybrid classification system to combine Jrip and random forest classifiers with six model evaluations as meta-classifier individually to improve the prediction of clinical diagnosis. All experiments conducted on a laptop with an Intel Core i7- 8750H CPU at 2.2 GHz and 16 G of ram running on windows 10 (64 bits). The dataset evaluated using an explorer set of weka data mining software for the analysis purpose. The experimental show that the proposed model of (CHFS) feature extraction performs better than principal component analysis (PCA), and lead to effectively reduced the false-negative rate with a relatively high overall accuracy with support vector machine (SVM) as meta-classifier of 96.50% compared to 68.83% which is considerably better than the previous state-of-the-art result. The receiver operating characteristic (ROC) curve was equal to 95.5%. Also, the experiment on MRI images Kaggle dataset of CNN classification process with 80.21% accuracy result. The results of the proposed model show an accurate classify Alzheimer's clinical samples against MRI neuroimaging for diagnoses AD at a low cost.
ARTICLE | doi:10.20944/preprints202309.1505.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: heavy metals, mining activities, pollution, remediation
Online: 22 September 2023 (06:37:03 CEST)
Mining activities often generate important amounts of extractive waste, and as a consequence, environmental impacts that affects all factors to a greater or lesser extent. Depending on a variety of variables, the impact can be permanent or temporary, reversible or irreversible, negative or positive. This study conducted research on the status of closure and remediation processes of mining areas in Romania, specifically in the counties of Maramureș, Suceava, Harghita, Alba, Hunedoara, and Caraș-Severin. Furthermore, based on the type and level of pollution, the degree of application of remediation techniques for water and soil pollution in the investigated mining areas was studied. From the analysed information, it is evident that although the closure and remediation process started in Romania over 20 years ago, unfortunately, to this day, the technical projects, technical assistance, and execution of closure and remediation works have not yet completely solved the complex environmental issues in the mining sector. Most of the tailing ponds and waste piles of former mines continue to pose permanent specific risk to the environment and the population. This study concludes that the mining sector in Romania, although it has the necessary techniques and technologies for the ecological rehabilitation of degraded lands related to the Extractive Waste Facilities and the elimination of negative impacts on the environment and public health, has not yet been able to fully concretize its remediation efforts.
ARTICLE | doi:10.20944/preprints202110.0033.v1
Subject: Biology And Life Sciences, Immunology And Microbiology Keywords: Antibiotic resistance; text mining; therapy; database
Online: 4 October 2021 (08:58:52 CEST)
Antimicrobial resistance (AMR) is one of the top 10 threats affecting global health. AMR defeats the effective prevention and treatment of infections caused by microbial pathogens including bacteria, parasites, viruses and fungi (WHO). Microbial pathogens have natural tendency to evolve and mutate over time resulting in AMR strains. The set of genes involved in antibiotic resistance also termed as “antibiotic resistance genes” (ARGs) spread through species by lateral gene transfer thereby causing global dissemination. While this biological mechanism is prevalent in the spread of AMR, human methods also augment through various mechanisms such as over prescription, incomplete treatment, environmental waste etc. A considerable portion of scientific community is engrossed in AMR related work trying to discover novel therapeutic solutions for tackling resistant pathogens. Comprehensive inspection of the literature shows that diverse therapeutic strategies have evolved over recent years. Collectively, these therapeutic strategies include novel small molecules, newly identified antimicrobial peptides, bacteriophages, phytochemicals, nanocomposites, novel phototherapy against bacteria, fungi and virus. In this work we have developed a comprehensive knowledgebase by collecting alternative antimicrobial therapeutic strategies from literature data. We have used subjective approach for datamining new strategies resulting in broad coverage of entities and subsequently add objective data like entity name, potency, safety information etc. The extracted data was organized KOMBAT (Knowledgebase Of Microbes’ Battling Agents for Therapeutics). A lot of these data are tested against AMR pathogens. We envision that this database will be noteworthy for developing future therapeutics against resistant pathogens. The database can be accessed through http://kombat.igib.res.in/.
Subject: Computer Science And Mathematics, Information Systems Keywords: fraud audit; process mining; visual analytics
Online: 2 March 2021 (09:19:01 CET)
Among the knowledge areas in which process mining has had an impact, the audit domain is particularly striking. Traditionally, audits seek evidence in a data sample that allows to make inferences about a population. Mistakes are usually committed when generalizing the results and anomalies, therefore, appear in unprocessed sets. However, there are some efforts to address these limitations using process mining-based approaches for fraud detection. To the best of our knowledge, no fraud audit method exists that combines process mining techniques and visual analytics to identify relevant patterns. This paper presents a fraud audit approach based on the combination of process mining techniques and visual analytics. The main advantages are: (i) a method is included that guides the use of the visual capabilities of process mining to detect fraud data patterns during an audit; (ii) the approach can be generalized to any business domain; (iii) well-known process mining techniques are used (Dotted Chart, Trace Alignment, Fuzzy Miner…). The techniques were selected by a group of experts and were extended to enable filtering for contextual analysis, to handle levels of process abstraction, and to facilitate implementation in the area of fraud audits. Based on the proposed approach, we developed a software solution that is currently being used in the financial sector as well as in the telecommunications and hospitality sector. Finally, for demonstration purposes, we present a real hotel management use case in which we detected suspected fraud behaviors, thus validating the effectiveness of the approach.
ARTICLE | doi:10.20944/preprints202003.0299.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Data Mining; Alzheimer’s Dementia; Composite Hybrid Feature Selection; Machine learning; stack Hybrid Classification; AI; MRI; Neuroimaging; MPEG7 edge histogram feature extraction; CNN
Online: 19 March 2020 (11:25:01 CET)
Alzheimer's disease (AD) detection acting as an essential role in global health care due to misdiagnosis and sharing many clinical sets with other types of dementia, and costly monitoring the progression of the disease over time by magnetic reasoning imaging (MRI) with consideration of human error in manual reading. This paper goal a comparative study on the performance of data mining techniques on two datasets of Clinical and Neuroimaging Tests with AD. Our proposed model in the first stage, Apply clinical medical dataset to a composite hybrid feature selection (CHFS), for extract new features to select the best features due to eliminating obscures features, In parallel with Apply a novel hybrid feature extraction of three batch edge detection algorithm and texture from MRI images dataset and optimized with fuzzy 64-bin histogram. In the second stage, we applied a clinical dataset to a stacked hybrid classification(SHC) model to combine Jrip and random forest classifiers with six model evaluations as meta-classifier individually to improve the prediction of clinical diagnosis. At the same stage of improving the classification accuracy of neuroimaging (MRI) dataset images by applying a convolution neural network (CNN) in comparison with traditional classifiers, running on extracted features from images. The authors have collected the clinical dataset of 426 subjects with (1229 potential patient sample) from oasis.org and (MRI) dataset from a benchmark kaggle.com with a total of around ~5000 images each segregated into the severity of Alzheimer's. The datasets evaluated using an explorer set of weka data mining software for the analysis purpose. The experimental show that the proposed model of (CHFS) feature extraction lead to effectively reduced the false-negative rate with a relatively high overall accuracy with a stack hybrid classification of support vector machine (SVM) as meta-classifier of 96.50% compared to 68.83% of the previous result on a clinical dataset, Besides a compared model of CNN classification on MRI images dataset of 80.21%. The results showed the superiority of our CHFS model in predicting Alzheimer's disease more accurately with the clinical medical dataset in early-stage compared with the neuroimaging (MRI) dataset. The results of the proposed model were able to predict with accurately classify Alzheimer's clinical samples at a low cost in comparison with the MRI-CNN images model at the early stage and get a good indicator for high classification rate for MRI images when applying our proposed model of SHC.
ARTICLE | doi:10.20944/preprints201711.0019.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: Mining; Mine reclamation; Land cover change; Vegetation health; NDVI Post-mining; SMA; Random forest classification; Remote Sensing
Online: 2 November 2017 (15:01:03 CET)
Mining for resources extraction may lead to several geological and associated environmental changes due to ground movements, collision with mining cavities and deformation of aquifers. Geological changes may continue in a reclaimed mine area, and the deformed aquifers may entail a breakdown of substrates and an increase in ground water tables, which may cause surface area inundation. Consequently, a reclaimed mine area may experience surface area collapse, i.e. subsidence, and degradation of vegetation health. Thus, monitoring short-term landscape dynamics in a reclaimed mine area may provide important information on the long-term geological and environmental impacts of mining activities. We studied landscape dynamics in Kirchheller Heide, Germany, which experienced extensive soil movement due to longwall mining without stowing, using Landsat imageries between 2013 and 2016. A Random Forest image classification technique was applied to analyse land-use and land-cover dynamics and the growth of wetland areas was assessed using a Spectral Mixture Analysis (SMA). We also analyzed the changes in vegetation health using a Normalized Difference Vegetation Index (NDVI). We observed a 19.9% growth of wetland area within the four years with 87.2% of growth in the coverage of two major waterbodies in the reclaimed mine area. NDVI values indicate that 66.5% of the vegetation of the study area was degraded due to changes in ground water tables and surface flooding. Our results inform environmental management and mining reclamation authorities about the subsidence spots and priority mitigation areas from land surface and vegetation degradation in Kirchheller Heide.
REVIEW | doi:10.20944/preprints202003.0141.v1
Subject: Medicine And Pharmacology, Other Keywords: data sharing; data management; data science; big data; healthcare
Online: 8 March 2020 (16:46:20 CET)
In recent years, more and more health data are being generated. These data come not only from professional health systems, but also from wearable devices. All these data combined form ‘big data’ that can be utilized to optimize treatments for each unique patient (‘precision medicine’). To achieve this precision medicine, it is necessary that hospitals, academia and industry work together to bridge the ‘valley of death’ of translational medicine. However, hospitals and academia often have problems with sharing their data, even though the patient is actually the owner of his/her own health data, and the sharing of data is associated with increased citation rate. Academic hospitals usually invest a lot of time in setting up clinical trials and collecting data, and want to be the first ones to publish papers on this data. The idea that society benefits the most if the patient’s data are shared as soon as possible so that other researchers can work with it, has not taken root yet. There are some publicly available datasets, but these are usually only shared after studies are finished and/or publications have been written based on the data, which means a severe delay of months or even years before others can use the data for analysis. One solution is to incentivize the hospitals to share their data with (other) academic institutes and the industry. Here we discuss several aspects of data sharing in the medical domain: publisher requirements, data ownership, support for data sharing, data sharing initiatives and how the use of federated data might be a solution. We also discuss some potential future developments around data sharing.
ARTICLE | doi:10.20944/preprints202204.0138.v1
Subject: Computer Science And Mathematics, Software Keywords: API; clickstream; cloud applications; process mining; scripting
Online: 15 April 2022 (07:37:06 CEST)
Background: Process mining (PM) exploits event logs to obtain meaningful information about the processes that produced them. As the number of applications developed on cloud infrastructures is increasing, it becomes important to study and discover their underlying processes. However, many current PM technologies face challenges in dealing with complex and large event logs from cloud applications, especially when they have little structure (e.g., clickstreams). Methods: Using Design Science Research, this paper introduces a new method, called Cloud Pattern API – Process Mining (CPA-PM), that enables discovering and analyzing cloud-based application processes using PM in a way that addresses many of these challenges. CPA-PM exploits a new application programming interface (API), with an R implementation, for creating repeatable scripts that preprocess event logs collected from such applications. Results: Applying CPA-PM to a case with real and evolving event logs related to the trial process of a Software-as-a-Service cloud application led to useful analyses and insights, with reusable scripts. Conclusion: CPA-PM helps producing executable scripts for filtering event logs from clickstream and cloud-based applications, where the scripts can be used in pipelines while minimizing the need for error-prone and time-consuming manual filtering.
ARTICLE | doi:10.20944/preprints202107.0568.v1
Subject: Biology And Life Sciences, Biochemistry And Molecular Biology Keywords: pattern mining; digenic traits; genotype pattern; diplotype
Online: 26 July 2021 (11:14:06 CEST)
Some genetic diseases (“digenic traits”) are due to the interaction between two DNA variants, which presumably reflects biochemical interactions. For example, certain forms of Retinitis Pigmentosa, a type of blindness, occur in the presence of two mutant variants, one each in the ROM1 and RDS genes, while occurrence of only one such variant results in a normal phenotype. Detecting variant pairs underlying digenic traits by standard genetic methods is difficult and is downright impossible when individual variants alone have minimal effects. Frequent Pattern Mining (FPM) methods are known to detect patterns of items. We make use of FPM approaches to find pairs of genotypes (from different variants) that can discriminate between cases and controls. Our method is based on genotype patterns of length two, and permutation testing allows assigning p-values to genotype patterns, where the null hypothesis refers to equal pattern frequencies in cases and controls. We compare different interaction search approaches and their properties on the basis of published datasets. Our implementation of FPM to case-control studies is freely available.
ARTICLE | doi:10.20944/preprints202105.0198.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Availability, underground mining, radio communication system, influence
Online: 10 May 2021 (14:21:02 CEST)
Radio communication system in an underground mine is one of the very essential systems for the underground mine. It is required that the radio communication system must be reliable from the starting to construct the underground mine to the closure of the mine. However, underground mine radio communication systems reliability is required to be tested on an active radio communication system in a real environment. In the study was suggested a new research methodology that is studied reliability using dynamic system modeling on Vensim software instead of traditional method to study the reliability of radio communication systems calculating large-scale differential equations. In other words, we suggest a new research methodology. The Motorola Dimetra (TETRA) radio communication system's availability readiness information was used to simulate the reliability of the underground mine radio communication system probability of reliability using Vensim software for system dynamic modeling.Also, the factors that affect the reliability of underground mining radio communication systems was studied. The study was determined factors that affect the underground mine radio communication system from the following risks. The study was in the examples of the Oyu Tolgoi underground mine. The factors that affect the reliable operation of the underground mine radio communication system were determined using the failure statistics of TETRA radio communication system in the Oyu Tolgoi mine in 2015-2018.
ARTICLE | doi:10.20944/preprints202010.0057.v1
Subject: Business, Economics And Management, Accounting And Taxation Keywords: multiclass classification; text mining; accounting control system
Online: 5 October 2020 (09:05:53 CEST)
Electronic invoicing has become mandatory for Italian companies since January 2019. Invoices are structured in a predefined xml template where the information reported can be easily extracted and analyzed. The main aim of this paper is to exploit the information structured in electronic invoices to build an intelligent system which can facilitate accountants work. More precisely, this contribution shows how it is possible to automate part of the accounting process: all sent or received invoices of a company are classified into specific codes which represent the economic nature of the the financial transactions. In order to classify data contained in the invoices a machine learning multiclass classification problem is proposed using as input variables the information of the invoices to predict two different target variables, account codes and the VAT codes, which composes a general ledger entry. Different approaches are compared in terms of prediction accuracy. The best performance is achieved considering the hierarchical structure of the account codes.
ARTICLE | doi:10.20944/preprints201906.0310.v1
Subject: Biology And Life Sciences, Immunology And Microbiology Keywords: cyanobacteria; secondary metabolite; genome mining; molecular networking
Online: 30 June 2019 (10:42:22 CEST)
Cyanobacteria are an ancient lineage of slow-growing photosynthetic bacteria and a proliﬁc source of natural products with diverse chemical structures and potent biological activities and toxicities. The chemical identiﬁcation of these compounds remains a major bottleneck. Strategies that can prioritize the most proliﬁc strains and novel compounds are of great interest. Here, we combine chemical analysis and genomics to investigate the chemodiversity of secondary metabolites based on their pattern of distribution within some cyanobacteria. Planktothrix being a cyanobacterial genus known to form blooms worldwide and to produce a broad spectrum of toxins and other bioactive compounds, we applied this combined approach on four closely related strains of Planktothrix. The chemical diversity of the metabolites produced by the four strains was evaluated using an untargeted metabolomics strategy with high-resolution LC-MS. Metabolite proﬁles were correlated with the potential of metabolite production identified by genomics for the different strains. Although, the Planktothrix strains present a global similarity in term biosynthetic cluster gene for microcystin, aeruginosin and prenylagaramide for example, we found remarkable strain-specific chemo-diversity. Only few of the chemical features were common to the four studied strains. Additionally, the MS/MS data were analyzed using Global Natural Products Social Molecular Networking (GNPS) to identify molecular families of the same biosynthetic origin. In conclusion, we present an efﬁcient integrative strategy for elucidating the chemical diversity of a given genus and link the data obtained from analytical chemistry to biosynthetic genes of cyanobacteria.
ARTICLE | doi:10.20944/preprints201805.0281.v1
Subject: Engineering, Control And Systems Engineering Keywords: construction technology adoption process; construction; mining; digital technology; diffusion; implementation; mix methods; grounded theory; thematic analysis; data and methodological triangulation techniques; AHP; NVivo
Online: 22 May 2018 (04:52:39 CEST)
Due to the complexity, high-risk, and conservative character of construction companies, advanced digital technologies do not become widely adopted in the short term, while vendors make determined efforts to overcome this and disseminate their technologies. This paper presents the methods of an investigation addressing the extremely complex issues related to the current practices of digital technology adoption in construction. It discusses how construction companies follow a specific logical process linked to need, project objectives, characteristics of the adopting organization, and the characteristics of the new technology to be adopted. The study aims to demonstrate a novel method of data collection and analysis including data and methodological triangulation techniques including the use of NVivo and AHP to explore how companies make the decision to uptake a new technology (e.g. advanced crane, tunnel boring machine or drones) by focusing on customer and vendor activities, their interactions, contributing factors, and people involved in the process. The major original contribution of this paper is to develop an innovative methodological Cube for investigating the Construction Technology Adoption Process (CTAP) covering technology adoption, acceptance, diffusion and implementation concepts. CTAP is a framework that delineates the phases of the process that customer organizations use when deciding to adopt a new digital technology and the parallel vendor activities. The significance of these contributions is that they enable vendors to understand how to match their strategies with customer expectations in each phase of the CTAP. It also provides a benchmark for new construction companies to use the current best practice of decision making. Future research is warranted to more clearly delineate any differences with developing nations or related industries such as mining and property management.
ARTICLE | doi:10.20944/preprints202206.0320.v4
Subject: Biology And Life Sciences, Other Keywords: data; reproducibility; FAIR; data reuse; public data; big data; analysis
Online: 2 November 2022 (02:55:49 CET)
With an increasing amount of biological data available publicly, there is a need for a guide on how to successfully download and use this data. The Ten simple rules for using public biological data are: 1) use public data purposefully in your research, 2) evaluate data for your use case, 3) check data reuse requirements and embargoes, 4) be aware of ethics for data reuse, 5) plan for data storage and compute requirements, 6) know what you are downloading, 7) download programmatically and verify integrity, 8) properly cite data, 9) make reprocessed data and models Findable, Accessible, Interoperable, and Reusable (FAIR) and share, and 10) make pipelines and code FAIR and share. These rules are intended as a guide for researchers wanting to make use of available data and to increase data reuse and reproducibility.
ARTICLE | doi:10.20944/preprints202003.0268.v1
Subject: Social Sciences, Library And Information Sciences Keywords: matching; data marketplace; data platform; data visualization; call for data
Online: 17 March 2020 (04:10:28 CET)
Improvements in web platforms for data exchange and trading are creating more opportunities for users to obtain data from data providers of different domains. However, the current data exchange platforms are limited to unilateral information provision from data providers to users. In contrast, there are insufficient means for data providers to learn what kinds of data users desire and for what purposes. In this paper, we propose and discuss the description items for sharing users’ call for data as data requests in the data marketplace. We also discuss structural differences in data requests and providable data using variables, as well as possibilities of data matching. In the study, we developed an interactive platform, treasuring every encounter of data affairs (TEEDA), to facilitate matching and interactions between data providers and users. The basic features of TEEDA are described in this paper. From experiments, we found the same distributions of the frequency of variables but different distributions of the number of variables in each piece of data, which are important factors to consider in the discussion of data matching in the data marketplace.
ARTICLE | doi:10.20944/preprints202304.0130.v1
Subject: Computer Science And Mathematics, Other Keywords: data; cooperatives; open data; data stewardship; data governance; digital commons; data sovereignty; open digital federation platform
Online: 7 April 2023 (14:14:02 CEST)
Network effects, economies of scale, and lock-in-effects increasingly lead to a concentration of digital resources and capabilities, hindering the free and equitable development of digital entrepreneurship (SDG9), new skills, and jobs (SDG8), especially in small communities (SDG11) and their small and medium-sized enterprises (“SMEs”). To ensure the affordability and accessibility of technologies, promote digital entrepreneurship and community well-being (SDG3), and protect digital rights, we propose data cooperatives [1,2] as a vehicle for secure, trusted, and sovereign data exchange [3,4]. In post-pandemic times, community/SME-led cooperatives can play a vital role by ensuring that supply chains to support digital commons are uninterrupted, resilient, and decentralized . Digital commons and data sovereignty provide communities with affordable and easy access to information and the ability to collectively negotiate data-related decisions. Moreover, cooperative commons (a) provide access to the infrastructure that underpins the modern economy, (b) preserve property rights, and (c) ensure that privatization and monopolization do not further erode self-determination, especially in a world increasingly mediated by AI. Thus, governance plays a significant role in accelerating communities’/SMEs’ digital transformation and addressing their challenges. Cooperatives thrive on digital governance and standards such as open trusted Application Programming Interfaces (APIs) that increase the efficiency, technological capabilities, and capacities of participants and, most importantly, integrate, enable, and accelerate the digital transformation of SMEs in the overall process. This policy paper presents and discusses several transformative use cases for cooperative data governance. The use cases demonstrate how platform/data-cooperatives, and their novel value creation can be leveraged to take digital commons and value chains to a new level of collaboration while addressing the most pressing community issues. The proposed framework for a digital federated and sovereign reference architecture will create a blueprint for sustainable development both in the Global South and North.
ARTICLE | doi:10.20944/preprints202307.1513.v1
Subject: Environmental And Earth Sciences, Pollution Keywords: biochar; coal mining; heavy metals; remediation; seed balls
Online: 24 July 2023 (08:28:43 CEST)
Globally, open-pit coal mining is associated with severe land use impact and contamination of soil and water resources with heavy metals. Thus, in growing economies like India, where coal is a significant energy source, the heavy metals contamination of soil and water become ubiquitous. Remediation of such a large stretch of mined-out land is a major challenge and a costly process for the mining industry. In recent years, the application of biochar for the remediation of such heavy metals-contaminated soil has been widely practiced. However, applying biochar and cultivating plants in field conditions becomes challenging. This study uses a unique remediation approach by developing biochar-bentonite-based seed balls encapsulating Shorgham grass seeds at their core for application in the contaminated soil. The seed ball was developed by using the bentonite biochar composite in varying weight fractions of 0.5 – 5 % with respect to the kaolinite, whose fractions in the seed ball also varied at one, three, and five parts. The seed balls were applied to the pots containing 3 kg of heavy metals contaminated soil for a pot-culture study in a polyhouse for a period of four months. Initial soil analysis results indicated that the mine soil samples showed poor nutrient and organic matter content and were contaminated with heavy metals such as Ni, Zn, Cr, and Cd. Post-pot-culture soil analysis results indicated that the application of seed balls containing five fractions of biochar composite with its combination with three and five-weight fractions of kaolinite showed substantial improvement in the pH, available nutrients, organic matter content, soil enzymes, and overall soil fertility index compared to the controlled study and other cases. The same combination of seed balls also significantly reduced the plant-available fractions of Ni, Zn, Cr, and Cd in the soil and the translocation of these heavy metals from the rhizosphere zone to the grass’s aerial parts, indicating stabilization of heavy metals within the soil matrix. Moreover, the application of seed balls also substantially improved the plant physiology and reduced the release of stress hormones such as proline and glutathione within the plant cells indicating improvement in the plant’s biotic and abiotic stress factors. Thus, the application of seed balls in heavy metals contaminated soils, particularly over a large stretch of land, could be a low-cost and viable remediation technique.
ARTICLE | doi:10.20944/preprints202307.0210.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: Climate Change; SSPs scenarios; Water Management; Mining; Kazakhstan
Online: 4 July 2023 (11:40:39 CEST)
Climate change is a threat to mining and other industries, especially those involving water supply and management by inducing or amplifying some climatic parameters such as changes in precipitation regimes and temperature extremes. Using the latest NASA NEX-GDDP-CMIP6 datasets, this study quantifies the level of climate change that may affect the development of two mine sites (Site1 and Site2) in northeast Kazakhstan. The study analyses the daily precipitation and maximum and minimum temperature a of a number of global circulation models (GCM) over three future time periods, 2040s, 2060s and 2080s, under two shared socioeconomic pathway (SSP) scenarios, SSP245 and SSP585, against the baseline period 1981- 2014. The analyses revealed that: (1) Both maximum and minimum temperature will increase under both SSP in those time periods, with the rate of change for minimum temperature being higher than maximum temperature. (2) The mean annual precipitation will increase by an average rate of 7% and 10.5% in 2040s for SSP245 and 17.5% and 7.5% for SSP585 in 2080s at Site1 and Site2, respectively. It is also observed that summer months will experience drier condition whilst all other months will increase in precipitation. (3) The values of 24-hour precipitation with 10-year return period will also increase under both SSP scenarios and future time periods for most of the studied GCM and at both mine sites. These predicted changes should be considered as design criteria adjustments for project water supply and water management structures.
ARTICLE | doi:10.20944/preprints202305.0444.v1
Subject: Environmental And Earth Sciences, Soil Science Keywords: soil indicators; vegetation indicators; iron mining; ecological restoration
Online: 8 May 2023 (04:45:53 CEST)
Many ecosystems are being severely degraded, leading the United Nations to deem 2021-2030 as the Decade on Ecosystem Restoration. To be successful, this effort requires robust monitoring tools to assess land reclamation practices. Our study aimed to evaluate the quality of recovery efforts in mined areas by developing a Recovery Quality Index (RQI) based on soil and vegetation indicators. Using the heavily mined Iron Quadrangle region of Brazil as an example, we selected four local, undisturbed reference areas as restoration goals: Atlantic Forest (AF); ferruginous rupestrian grassland with dense vegetation (FRGD); ferruginous rupestrian grassland with sparse vegetation (FRGS); and quartzite rupestrian grassland (QRG). We also selected four areas that were directly or indirectly affected by mining, including an environmental compensation area set aside 5 years prior to the study (COMP-5), two sterile piles that had undergone recovery for 15 and 20 years (SP-20 and SP-15), and a cave area with 15 years of recovery (CAVE-15). The four recovery areas were grouped together with each individual reference area (making four combinations of sites), and measurements of 2 vegetation parameters and 34 soil attributes were used in a Principal Component Analysis (PCA) for each grouping. We determined the RQI for each group by summing weighted PCA scores for responsive indicators. Vegetative parameters had the lowest RQI weights in all four groups. Soil physical indicators tended to be the most important, except in AF, where chemical indicators were most relevant. RQI values were also lowest when AF was used as the reference, showing that the forest was a unique ecosystem, and the CAVE-15 site had lower RQI scores than the other restored sites, indicating the high degree of disturbance that occurred in that low-lying area. The SP-20 site tended to have higher RQI values than the SP-15, and similar values to the less disturbed COMP-5 areas, potentially indicating greater recovery of native soil properties during the longer recovery period. This RQI-based approach has excellent potential for robust assessment of the recovery of areas degraded by mining and can support decision-making during monitoring.
Subject: Public Health And Healthcare, Public, Environmental And Occupational Health Keywords: Artisanal mining; PPE; Occupational factors; Occupational health and safety
Online: 13 September 2021 (08:43:37 CEST)
Artisanal goldminers in Ghana are exposed to various levels and forms of health, safety and environmental threats. Without the required legislation and regulations, artisanal miners are responsible for their own health and safety at work. Consequently, understanding the probabilities of self-protection at work by artisanal goldminers is crucial. A cross-sectional survey of 500 artisanal goldminers was conducted to examine the probabilities of personal protective equipment use among artisanal goldminers in Ghana. The data was subjected to both descriptive and inferential statistics. Initial findings showed that personal protective equipment use among artisanal miners was 77.4%. Overall, higher probabilities of personal protective equipment use was observed among artisanal goldminers who work in good health and safety conditions as compared to artisanal miners who work in poor health and safety conditions. Also, personal protective equipment use was more probable among the highly educated artisanal goldminers, miners who regularly go for medical screening and the most experienced miners. Additionally, personal protective equipment use was more probable among artisanal miners who work in non-production departments and miners who work in the medium scale subsector. Inversely, personal protective equipment use was less probable among female artisanal miners and miners who earn more monthly income ($174 and above). To increase self-care and safety consciousness in artisanal mining, there is the need for a national occupational health and safety legislation in Ghana. Also, interventions and health promotion campaigns for better occupational conditions in artisanal mining should target and revise the health and safety related workplace programs and conditions
ARTICLE | doi:10.20944/preprints202105.0375.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: Metals; Environmental monitoring; Bioassays; Amazon River; Amazon; mining
Online: 17 May 2021 (09:42:51 CEST)
As the number of legal and illegal mining sites increase, integrative methods to evaluate the effects of mining pollution on Andes-Amazonia freshwater ecosystems are paramount. Here, we sampled water and sediments in 11 sites potentially affected by mining activities in the Napo province (Ecuador). The environmental impacts were evaluated using four lines of evidence (LOEs): water physico-chemical parameters; metal exposure concentrations; macroinvertebrate community response (AAMBI); and toxicity by conducting bioassays with Lactuca sativa and Daphnia magna. Overall, dissolved oxygen and total suspended solids were, under (<80%) and above (>130 mg/Ls) quality standards. Ag, Al, As, Cd, Cu, Fe, Mn, Pb and Zn in water and V, B and Cr in sediments were detected above quality standards. Nine out of eleven sites were classified as having bad environmental quality based on the AAMBI. Ranges of L. sativa seed germination in both water (37% to 70%) and sediment (0% to 65%), indicate significant toxicity. In 5 sites, neonates of D. magna showed a 25% reduction in survival compared to the control. Our integrated LOEs index ranked sites regarding their environmental degradation. Given the importance of the Andes-Amazon region, we recommend environmental impact monitoring of the mining expansion using multiple LOEs.
ARTICLE | doi:10.20944/preprints202102.0120.v1
Subject: Business, Economics And Management, Business And Management Keywords: Homepage words; Financial ratio; Text-mining; Balanced scorecard
Online: 3 February 2021 (15:07:40 CET)
(1) Background: The CEO message of hospital homepage contain various contents such as the hospital's future vision, promises with customers, upgraded services and public activities. The CEO’s message of the homepage includes non-financial information as well as financial information of corporates. Also, it provides useful information for not only company's goals and vision but also firm performance and strategies for the future. This study aims to investigate associations between CEO’s message of hospitals homepages and financial status. We used the balanced scorecard frame to analyze what content on the hospital's homepage is related to the hospital's various financial ratios. (2) Methods: We adopt a text mining method to extract significantly repeated keywords from the CEO’s message of hospital website. And we classify these keywords by a balanced scorecard frame. To examine the relationship between keywords of CEO’s message of the hospital homepage and hospital’s financial ratio, T-test is conducted for the difference in the TF-IDF (Term Frequency is Divided by Inverse Document Frequency) mean of the home page contents and its relationship with the views of the balanced scorecard framework. (3) Results: According to empirical results on 65 samples collected from local hospitals, there are some significant relationship between the qualitative content of the hospital's homepage and the quantitative financial ratio that indicates profitability, activity, leverage, liquidity, and transfer to essential business fund (EBF) income. (4) Conclusions: The introduction section of a homepage is most accessible to customers, containing the aims and ideals of hospitals and reflecting their values and visions . In addition, in view of financial status, they can either emphasize financial strength or focus on other areas to mask weakness of financial information. This study reminds us of the importance of hospital website’s disclosure, and it can be inferred from the financial status of the hospital. It also highlights the need for harmonization between quantitative data, financial statements, and qualitative data, CEO’s messages. (5) Implications: To our best knowledge, this paper is the first research attempting to investigate the relation between text of hospital homepage and financial ratio of hospital through text mining technique and balanced scorecard frame. Hospitals take a crucial part in a country’s welfare and healthcare backbone industry. Nevertheless, in many countries, hospital organization sectors tend to remain a source of critical fiscal deficits due to its ineffective and sloppy management. We expect that the result of this paper can provide hospital managers to useful information.