ARTICLE | doi:10.20944/preprints201808.0350.v2
Subject: Mathematics & Computer Science, Other Keywords: big data; clustering; data mining; educational data mining; e-learning; profile learning
Online: 19 October 2018 (05:58:05 CEST)
Educational data-mining is an evolving discipline that focuses on the improvement of self-learning and adaptive methods. It is used for finding hidden patterns or intrinsic structures of educational data. In the arena of education, the heterogeneous data is involved and continuously growing in the paradigm of big-data. To extract meaningful information adaptively from big educational data, some specific data mining techniques are needed. This paper presents a clustering approach to partition students into different groups or clusters based on their learning behavior. Furthermore, personalized e-learning system architecture is also presented which detects and responds teaching contents according to the students’ learning capabilities. The primary objective includes the discovery of optimal settings, in which learners can improve their learning capabilities. Moreover, the administration can find essential hidden patterns to bring the effective reforms in the existing system. The clustering methods K-Means, K-Medoids, Density-based Spatial Clustering of Applications with Noise, Agglomerative Hierarchical Cluster Tree and Clustering by Fast Search and Finding of Density Peaks via Heat Diffusion (CFSFDP-HD) are analyzed using educational data mining. It is observed that more robust results can be achieved by the replacement of existing methods with CFSFDP-HD. The data mining techniques are equally effective to analyze the big data to make education systems vigorous.
ARTICLE | doi:10.20944/preprints201801.0231.v1
Subject: Engineering, Other Keywords: Data mining; Association rules; Previous Cause; Type of Accident; Overexertion
Online: 24 January 2018 (19:40:52 CET)
An analysis of workplace accidents in the mining sector has been done using the database from the Spanish administration between the period 2005-2015 and applying data mining techniques. Data has been processed by means of the software Weka. Two scenarios were chosen regarding the accidents database, surface and underground mining. The most important variables involved in occupation accidents and their association rules have been determined. These rules are formed by several predictor variables that cause an accident, defining its characteristics and context. This study exposes the 20 most important association rules of the sector, either surface or underground mining, based on statistical confidence levels of each rule obtained by Weka. The outcomes display the most typical immediate causes with the percentage of accident basis of each association rule. The most typical immediate cause is body movement with physical effort or overexertion and type of accident is physical effort or overexertion. On the other hand, the second most important immediate cause and type of accident change in both scenarios. Data mining techniques have been proved as a very powerful tool to find out the root of the accidents, apply corrective measures and verify their effectiveness, either for public or private companies.
ARTICLE | doi:10.20944/preprints201906.0144.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: data mining; network security; association rules; DDoS
Online: 16 June 2019 (02:42:59 CEST)
Typical modern information systems are required to process copious data. Conventional manual approaches can no longer effectively analyze such massive amounts of data, and thus humans resort to smart techniques and tools to complement human effort. Currently, network security events occur frequently, and generate abundant log and alert files. Processing such vast quantities of data particularly requires smart techniques. This study reviewed several crucial developments of existent data mining algorithms, including those that compile alerts generated by heterogeneous IDSs into scenarios and employ various HMMs to detect complex network attacks. Moreover, sequential pattern mining algorithms were examined to develop multi-step intrusion detection. These studies can focus on applying these algorithms in practical settings to effectively reduce the occurrence of false alerts. This article researched the application of data mining algorithms in network security. The academic community has recently generated numerous studies on this topic.
ARTICLE | doi:10.20944/preprints201909.0040.v1
Subject: Social Sciences, Business And Administrative Sciences Keywords: data mining; security; association rule; ECLAT
Online: 4 September 2019 (03:48:58 CEST)
The purpose of this paper is to develop WebSecuDMiner algorithm to discover unusual web access patterns based on analysing the potential rules hidden in web server log and user navigation history. Design/methodology/approach: WebSecuDMiner uses equivalence class transformation (ECLAT) algorithm to extract user access patterns from the web log data, which will be used to identify the user access behaviours pattern and detect unusual one. Data extracted from the web serve log and user browsing behaviour is exploited to retrieve the web access pattern that is produced by the same user. Findings: WebSecuDMiner is used to detect whether any unauthorized access have been posed and take appropriate decisions regarding the review of the original rights of suspicious user. Research limitations/implications: The present work uses the database which is extracted from web serve log file and user browsing behaviour. Although the page is viewed by the user, the visit is not recorded in the server log file, since it can be access from the browser's cache.
ARTICLE | doi:10.20944/preprints202105.0102.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Market basket analysis; association rule mining; buying pattern; data mining
Online: 6 May 2021 (15:14:25 CEST)
Buyer practices have changed as individuals are figuring out how to live with the new truth of COVID-19. Take-out and conveyance orders have expanded, and our customer has added new items to their menu because of new client inclinations. With every one of the continuous changes, the customer had numerous unanswered inquiries, for example, Smartbridge has broad involvement with café innovation development Café TECHNOLOGY CAPABILITIES :Are the most famous items as yet unchanged after COVID? :Which are the most sold item blends now? :What is the acknowledgment of new things? :What are clients purchasing alongside new things? :How have liquor deals changed? The customer previously had reports that followed item deals and operational measurements, notwithstanding, there was a need to get a more profound knowledge into item examination. The customer expected to recognize what items and introductions were being sold all the more frequently, measure the acknowledgment of new items, and figure out what items clients buy together to improve advertising efforts, advancements, and deals. he E-business industry is filling immensely in the Indian market. The modest 4G web bundles in India clearly gives a push to these ventures. Thus, as Covid19 first hit in Quite a while, individuals got terrified to go out from their homes in light of the fact that, in their mind, it's a dread of Covid. They even wonder whether or not to go out to purchase fundamental (FMCG) products. Frenzy purchasing additionally has seen and to stay away from this dread of COVID-19, individuals are offering inclinations to the E-Commerce destinations to purchase fundamental products and a few clients are new which joined to purchase fundamental merchandise during this Pandemic Lockdown period. Numerous clients are moving their purchasing conduct from disconnected retail locations to online stores. This paper examines the customer buying pattern during lockdown.
ARTICLE | doi:10.20944/preprints202012.0529.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: e-commerce; big data; bibliometric analysis; knowledge mapping
Online: 21 December 2020 (14:24:06 CET)
The e-commerce platform in the digital economy era has evolved into a data platform ecosystem built around data resources and data mining technology systems. The most typical applications of big data are also concentrated in the field of e-commerce. E-commerce companies should first grasp the interactive relationship among the three major factors of data, technology and innovation, e-commerce platform operation is a multidisciplinary research field. It is not easy for researchers to obtain a panoramic view of the knowledge structure in this field. Knowledge graph is a kind of graph that shows the development process and structure relationship of knowledge with the field of knowledge as the object. It is not only a visual knowledge mapping, but also a serialized knowledge pedigree, which provides researchers with a quantitative research method for the development trend of statistics and academic status. The purpose of this research is to help researchers understand the key knowledge, evolutionary trends and research frontiers of current research. This study uses Citespace bibliometric analysis to analyze the data of the Science Net database and finds that: 1) The development of the research field has gone through three stages, and some representative key scholars and key documents have been recognized; 2) the common knowledge mapping of literature The co-occurrence of citations and keywords shows research hotspots; 3) The results of burst detection and central node analysis reveal research frontiers and development trends. Today, the visualization of big data brings different challenges. The abstraction between the world and today's data visualization occurs when the data is captured. Every user sees his own visualization data generated by standardized calculations. At the same time, there are still many controversies in the theoretical model, structure and structural dimensions. This is the direction that future researchers need to further study.
ARTICLE | doi:10.20944/preprints202111.0440.v1
Subject: Engineering, Control & Systems Engineering Keywords: time series; NMP algorithm; anomalies; data mining; similarities in time series; clustering
Online: 23 November 2021 (17:51:42 CET)
Time series data are significant and are derived from temporal data, which involve real numbers representing values collected regularly over time. Time series have a great impact on many types of data. However, time series have anomalies. We introduce hybrid algorithm named novel matrix profile (NMP) to solve the all-pairs similarity search problem for time series data. The proposed NMP inherits the features from two state-of-the art algorithms: similarity time-series automatic multivariate prediction (STAMP), and short text online microblogging protocol (STOMP). The proposed algorithm caches the output in an easy-to-access fashion for single- and multidimensional data. The proposed NMP algorithm can be used on large data sets and generates approximate solutions of high quality in a reasonable time. The proposed NMP can also handle several data mining tasks. It is implemented on a Python platform. To determine its effectiveness, it is compared with the state-of-the-art matrix profile algorithms i.e., STAMP and STOMP. The results confirm that the proposed NMP provides higher accuracy than the compared algorithms.
ARTICLE | doi:10.20944/preprints202105.0601.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Mobile RPG; Big Data; Text Mining; Topic Modeling
Online: 25 May 2021 (10:21:36 CEST)
As RPG has high sales and profits, lots of developers have supplied various RPG to market but it changed to mass production type with sensational advertising, low quality and excessive charging and similar contents which affects game market and users’ game play experience. The author of this paper studied ways to improve mobile RPG by collecting and analyzing users’ reviews using crawling on Google Play Store. The author of this paper used topic modeling that uses text mining technique and LDA (Latent Dirichlet Allocation) to extract meaningful information from collected big data and visualized it. Inferring users’ reviews, figuring out opinions objectively and seeking ways to improve games are helpful in improving mobile RPG that can be played continuously.
ARTICLE | doi:10.20944/preprints201809.0466.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: topological data analysis; text mining; computational topology; style; persistent homology
Online: 24 September 2018 (15:33:02 CEST)
Topological Data Analysis (TDA) refers to a collection of methods that find the structure of shapes in data. Although recently, TDA methods have been used in many areas of data mining, it has not been widely applied to text mining tasks. In most text processing algorithms, the order in which different entities appear or co-appear is being lost. Assuming these lost orders are informative features of the data, TDA may play a significant role in the resulted gap on text processing state of the art. Once provided, the topology of different entities through a textural document may reveal some additive information regarding the document that is not reflected in any other features from traditional text processing methods. In this paper, we introduce a novel approach that hires TDA in text processing in order to capture and use the topology of different same-type entities in textural documents. First, we will show how to extract some topological signatures in the text using persistent homology-i.e., a TDA tool that captures topological signature of data cloud. Then we will show how to utilize these signatures for text classification.
REVIEW | doi:10.20944/preprints201911.0338.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Indian; Sentiment Analysis; Indigenous Languages; Machine Learning; Deep learning; Data; Opinion Mining; Languages.
Online: 27 November 2019 (09:30:07 CET)
An increase in the use of smartphones has laid to the use of the internet and social media platforms. The most commonly used social media platforms are Twitter, Facebook, WhatsApp and Instagram. People are sharing their personal experiences, reviews, feedbacks on the web. The information which is available on the web is unstructured and enormous. Hence, there is a huge scope of research on understanding the sentiment of the data available on the web. Sentiment Analysis (SA) can be carried out on the reviews, feedbacks, discussions available on the web. There has been extensive research carried out on SA in the English language, but data on the web also contains different other languages which should be analyzed. This paper aims to analyze, review and discuss the approaches, algorithms, challenges faced by the researchers while carrying out the SA on Indigenous languages.
ARTICLE | doi:10.20944/preprints201708.0055.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: EMR; data preprocessing; text mining; information extraction; medical decision support system
Online: 15 August 2017 (05:46:43 CEST)
At present, medical institutes generally use EMR to record patient's condition, including diagnostic information, procedures performed and treatment results. EMR has been recognized as a valuable resource for large scale analysis. However, EMR has the characteristics of diversity, incompleteness, redundancy and privacy, which make it difficult to carry out data mining and analysis directly. Therefore, it is necessary to preprocess the source data in order to improve data quality and improve the data mining results. Different types of data require different processing technologies. Most structured data commonly needs classic preprocessing technologies, including data cleansing, data integration, data transformation and data reduction. For semi-structured or unstructured data, such as medical text, containing more health information, it requires more complex and challenging processing methods. The task of information extraction for medical texts mainly includes NER (Named Entity Recognition) and RE (Relation Extraction). In this paper, we introduce the process of EMR processing, including data collection, data preprocessing, data mining, evaluation and knowledge application, analyze the current status of the key technologies, such as data preprocessing and data mining, and provide an overview of the application domains and prospects of EMR mining technologies. Finally, we summarize the existing problems in the research of EMR mining, and review the development trends.
ARTICLE | doi:10.20944/preprints202008.0074.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: data mining; cardiovascular diseases; cluster analysis; principle component analysis
Online: 4 August 2020 (03:56:19 CEST)
Cardiovascular disease is the number one cause of death in the world and Quoting from WHO, around 31% of deaths in the world are caused by cardiovascular diseases and more than 75% of deaths occur in developing countries. The results of patients with cardiovascular disease produce many medical records that can be used for further patient management. This study aims to develop a method of data mining by grouping patients with cardiovascular disease to determine the level of patient complications in the two clusters. The method applied is principal component analysis (PCA) which aims to reduce the dimensions of the large data available and the techniques of data mining in the form of cluster analysis which implements the K-Medoids algorithm. The results of data reduction with PCA resulted in five new components with a cumulative proportion variance of 0.8311. The five new components are implemented for cluster formation using the K-Medoids algorithm which results in the form of two clusters with a silhouette coefficient of 0.35. Combination of techniques of Data reduction by PCA and the application of the K-Medoids clustering algorithm are new ways for grouping data of patients with cardiovascular disease based on the level of patient complications in each cluster of data generated.
REVIEW | doi:10.20944/preprints202108.0345.v1
Subject: Arts & Humanities, Other Keywords: student academic performance; educational data mining; methods; algorithms; tools; higher education; overview
Online: 16 August 2021 (14:04:57 CEST)
This overview study set out to compare and synthesise the findings of review studies conducted on predicting student academic performance (SAP) in higher education using educational data mining (EDM) methods, EDM algorithms and EDM tools from 2013 to June 2020. It conducted multiple searches for suitable and relevant peer-reviewed articles on two online search engines, on nine online databases, and on two online academic social networks. It, then, selected 26 eligible articles from 2,050 articles. Some of the findings of this overview study are worth mentioning. First, only 2 studies explicitly stated their precise sample sizes with maths and science as the two most mentioned subject areas. Second, 16 review studies had purposes related to either EDM techniques, EDM methods, EDM models, or EDM algorithms employed to predict SAP and student success in the higher education sector. Third, there are six commonly used typologies of input variables reported by 26 review studies, of which student demographics was the most commonly utilised variable for predicting SAP. Fourth and last, seven common EDM algorithms employed for predicting SAP were identified, of which Decision Tree emerged both as the most used algorithm and as the algorithm with the highest prediction accuracy rate for predicting SAP.
ARTICLE | doi:10.20944/preprints201704.0117.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Body sensor network; Smart home, knowledge discovery in BSN data; frequent patterns; periodic patterns and productive pattern.
Online: 18 April 2017 (18:15:50 CEST)
The understanding of various health-oriented vital sign data generated from body sensor networks (BSN) and discovery of the association between the generated parameters is an important task that may assist and promote important decision making in healthcare. For example, in a smart home scenario where the occupants’ health status is continuously monitored remotely, it is essential to provide required assistance when an unusual or critical situation is detected in their vital sign data. In this paper, we present an efficient approach to mine the incomplete (partial) periodic patterns obtained from BSN data. In addition, we employ a correlation test on the generated patterns and introduce the productive-associated partial periodic frequent patterns as the set of correlated partial periodical frequent items. The combination of these measures has the advantage of empowering healthcare providers and patients for quality of diagnosis, and also for better treatment and smart care, especially for the elderly people at smart home. We developed an efficient algorithm named PPFP-Growth (Productive Periodic Frequent Pattern growth) to discover all productive associated partial periodic patterns using these measures. PPFP-Growth is efficient, and the productiveness measure removes uncorrelated periodic items. An experimental evaluation on synthetic and real datasets shows the efficiency of the proposed PPFP-Growth algorithm, and can filter a huge number of partial periodic patterns to reveal only the correlated ones.
ARTICLE | doi:10.20944/preprints202201.0445.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: data mining; predictive analytics; Internet of Things; peasant farming; smart farming system; crop production prediction
Online: 31 January 2022 (10:58:30 CET)
Internet of Things (IoT) technologies can greatly benefit from machine learning techniques and Artificial Neural Networks for data mining and vice versa. In the agricultural field, this convergence could result in the development of smart farming systems suitable for use as decision support systems by peasant farmers. This work presents the design of a smart farming system for crop production, which is based on low-cost IoT sensors and popular data storage services and data analytics services on the Cloud. Moreover, a new data mining method exploiting climate data along with crop production data is proposed for the prediction of production volume from heterogeneous data sources. This method was initially validated using traditional machine learning techniques and open historical data of the northeast region of the state of Puebla, Mexico, which were collected from data sources from the National Water Commission and the Agri-food Information Service of the Mexican Government.
ARTICLE | doi:10.20944/preprints201906.0202.v1
Subject: Engineering, Mechanical Engineering Keywords: Natural gas demands; Prediction; Energy market; Genetic algorithm; Artificial neural network; Data mining.
Online: 20 June 2019 (15:58:25 CEST)
Recently natural gas (NG) global market attracted much attention in case it is cleaner than oil, and simultaneously in most regions is cheaper than renewable energy sources. However, price fluctuations, environmental concerns, technological development, emerging unconventional resources, energy security challenges, and shipment are some of the forces that made the NG market more dynamic and complex. From a policy-making perspective, it is vital to uncover demand-side future trends. This paper proposed an intelligent forecasting model to forecast NG global demand, however investigating a multi-dimensional purified input vector. The model starts with a data mining (DM) step to purify input features, identify the best time lags, and to pre-process selected input vector. Then a hybrid artificial neural network (ANN) which equipped with genetic optimizer is applied to set up ANN’s characteristics. Among 13 available input features, six features (e.g. Alternative and Nuclear Energy, CO2 Emissions, GDP per Capita, Urban Population, Natural Gas Production, Oil Consumption) selected as the most critical feature via the DM step. Then, the hybrid prediction model is designed to extrapolate the consumption of future trends. The proposed model overcomes competitive models refer to different error based evaluation statistics. Besides, as the model proposed the best input feature set, results compared to the model which used the raw input set, with no DM purification process.
ARTICLE | doi:10.20944/preprints201803.0021.v2
Subject: Earth Sciences, Geoinformatics Keywords: map processing; retrospective landscape analysis; visual data mining, image retrieval, low-level image descriptors, color moments, t-distributed stochastic neighborhood embedding, USGS topographic maps, Sanborn fire insurance maps
Online: 17 April 2018 (09:23:37 CEST)
Historical maps constitute unique sources of retrospective geographic information. Recently, several map archives containing map series covering large spatial and temporal extents have been systematically scanned and made available to the public. The geographic information contained in such data archives allows extending geospatial analysis retrospectively beyond the era of digital cartography. However, given the large data volumes of such archives and the low graphical quality of older map sheets, the processes to extract geographic information need to be automated to the highest degree possible. In order to understand the salient characteristics, data quality variation, and potential challenges in large-scale information extraction tasks, preparatory analytical steps are required to efficiently assess spatio-temporal coverage, approximate map content, and spatial accuracy of such georeferenced map archives across different cartographic scales. Such preparatory steps are often neglected or ignored in the map processing literature but represent highly critical phases that lay the foundation for any subsequent computational analysis and recognition. In this contribution we demonstrate how such preparatory analyses can be conducted using classical analytical and cartographic techniques as well as visual-analytical data mining tools originating from machine learning and data science, exemplified for the United States Geological Survey topographic map and Sanborn fire insurance map archives.
ARTICLE | doi:10.20944/preprints201608.0202.v2
Subject: Earth Sciences, Environmental Sciences Keywords: HR satellite remote sensing; urban fabric vulnerability; UHI & heat waves; landsat & MODIS sensors; LST & urban heating; segmentation & objects classification; data mining; feature extraction & selection; stepwise regression & model calibration
Online: 26 October 2021 (13:11:23 CEST)
Densely urbanized areas, with a low percentage of green vegetation, are highly exposed to Heat Waves (HW) which nowadays are increasing in terms of frequency and intensity also in the middle-latitude regions, due to ongoing Climate Change (CC). Their negative effects may combine with those of the UHI (Urban Heat Island), a local phenomenon where air temperatures in the compact built up cores of towns increase more than those in the surrounding rural areas, with significant impact on the quality of urban environment, on citizens health and energy consumption and transport, as it has occurred in the summer of 2003 on France and Italian central-northern areas. In this context this work aims at designing and developing a methodology based on aero-spatial remote sensing (EO) at medium-high resolution and most recent GIS techniques, for the extensive characterization of the urban fabric response to these climatic impacts related to the temperature within the general framework of supporting local and national strategies and policies of adaptation to CC. Due to its extension and variety of built-up typologies, the municipality of Rome was selected as test area for the methodology development and validation. First of all, we started by operating through photointerpretation of cartography at detailed scale (CTR 1: 5000) on a reference area consisting of a transect of about 5x20 km, extending from the downtown to the suburbs and including all the built-up classes of interest. The reference built-up vulnerability classes found inside the transect were then exploited as training areas to classify the entire territory of Rome municipality. To this end, the satellite EO HR (High Resolution) multispectral data, provided by the Landsat sensors were used within a on purpose developed "supervised" classification procedure, based on data mining and “object-classification” techniques. The classification results were then exploited for implementing a calibration method, based on a typical UHI temperature distribution, derived from MODIS satellite sensor LST (Land Surface Temperature) data of the summer 2003, to obtain an analytical expression of the vulnerability model, previously introduced on a semi-empirical basis.
REVIEW | doi:10.20944/preprints202003.0141.v1
Subject: Medicine & Pharmacology, General Medical Research Keywords: data sharing; data management; data science; big data; healthcare
Online: 8 March 2020 (16:46:20 CET)
In recent years, more and more health data are being generated. These data come not only from professional health systems, but also from wearable devices. All these data combined form ‘big data’ that can be utilized to optimize treatments for each unique patient (‘precision medicine’). To achieve this precision medicine, it is necessary that hospitals, academia and industry work together to bridge the ‘valley of death’ of translational medicine. However, hospitals and academia often have problems with sharing their data, even though the patient is actually the owner of his/her own health data, and the sharing of data is associated with increased citation rate. Academic hospitals usually invest a lot of time in setting up clinical trials and collecting data, and want to be the first ones to publish papers on this data. The idea that society benefits the most if the patient’s data are shared as soon as possible so that other researchers can work with it, has not taken root yet. There are some publicly available datasets, but these are usually only shared after studies are finished and/or publications have been written based on the data, which means a severe delay of months or even years before others can use the data for analysis. One solution is to incentivize the hospitals to share their data with (other) academic institutes and the industry. Here we discuss several aspects of data sharing in the medical domain: publisher requirements, data ownership, support for data sharing, data sharing initiatives and how the use of federated data might be a solution. We also discuss some potential future developments around data sharing.
ARTICLE | doi:10.20944/preprints202206.0320.v3
Subject: Life Sciences, Other Keywords: data; reproducibility; FAIR; data reuse; public data; big data; analysis
Online: 23 September 2022 (03:16:07 CEST)
With an increasing amount of "omics" data available publicly, there is a need for a guide on how to successfully download and use this data. The 10 simple rules for using public data are: 1) use public data in your research, 2) evaluate data for your use case, 3) check data reuse requirements and embargoes, 4) be aware of ethics for data reuse, 5) plan for data storage and compute requirements, 6) know what you are downloading, 7) download programmatically and verify integrity, 8) properly cite data, 9) make data FAIR and share, and 10) make pipelines and code FAIR and share. These rules are intended as a guide for researchers wanting to make use of available data and to increase data reuse and reproducibility.
ARTICLE | doi:10.20944/preprints202003.0268.v1
Subject: Social Sciences, Library & Information Science Keywords: matching; data marketplace; data platform; data visualization; call for data
Online: 17 March 2020 (04:10:28 CET)
Improvements in web platforms for data exchange and trading are creating more opportunities for users to obtain data from data providers of different domains. However, the current data exchange platforms are limited to unilateral information provision from data providers to users. In contrast, there are insufficient means for data providers to learn what kinds of data users desire and for what purposes. In this paper, we propose and discuss the description items for sharing users’ call for data as data requests in the data marketplace. We also discuss structural differences in data requests and providable data using variables, as well as possibilities of data matching. In the study, we developed an interactive platform, treasuring every encounter of data affairs (TEEDA), to facilitate matching and interactions between data providers and users. The basic features of TEEDA are described in this paper. From experiments, we found the same distributions of the frequency of variables but different distributions of the number of variables in each piece of data, which are important factors to consider in the discussion of data matching in the data marketplace.
REVIEW | doi:10.20944/preprints202007.0153.v1
Online: 8 July 2020 (11:53:33 CEST)
Large datasets that enable researchers to perform investigations with unprecedented rigor are growing increasingly common in neuroimaging. Due to the simultaneous increasing popularity of open science, these state-of-the-art datasets are more accessible than ever to researchers around the world. While analysis of these samples has pushed the field forward, they pose a new set of challenges that might cause difficulties for novice users. Here, we offer practical tips for working with large datasets from the end-user’s perspective. We cover all aspects of the data life cycle: from what to consider when downloading and storing the data, to tips on how to become acquainted with a dataset one did not collect, to what to share when communicating results. This manuscript serves as a practical guide one can use when working with large neuroimaging datasets, thus dissolving barriers to scientific discovery.
ARTICLE | doi:10.20944/preprints201810.0273.v1
Subject: Physical Sciences, Astronomy & Astrophysics Keywords: astroparticle physics, cosmic rays, data life cycle management, data curation, meta data, big data, deep learning, open data
Online: 12 October 2018 (14:48:32 CEST)
Modern experimental astroparticle physics features large-scale setups measuring different messengers, namely high-energy particles generated by cosmic accelerators (e.g. supernova remnants, active galactic nuclei, etc): cosmic and gamma rays, neutrinos and recently discovered gravitational waves. Ongoing and future experiments are distributed over the Earth including ground, underground/underwater setups as well as balloon payloads and spacecrafts. The data acquired by these experiments have different formats, storage concepts and publication policies. Such differences are a crucial issue in the era of big data and of multi-messenger analysis strategies in astroparticle physics. We propose a service ASTROPARTICLE.ONLINE in the frame of which we develop an open science system which enables to publish, store, search, select and analyse astroparticle physics data. The cosmic-ray experiments KASCADE-Grande and TAIGA were chosen as pilot experiments to be included in this framework. In the first step of our initiative we will develop and test the following components of the full data life cycle concept: (i) describing, storing and reusing of astroparticle data; (ii) software for performing multi-experiment and multi-messenger analyses like deep-learning methods; (iii) outreach including example applications and tutorial for students and scientists outside the specific research field. In the present paper we describe the concepts of our initiative, and in particular the plans toward a common, federated astroparticle data storage.
ARTICLE | doi:10.20944/preprints202105.0589.v1
Subject: Engineering, Automotive Engineering Keywords: Game Ratings; Public Data; Game Data; Data analysis; GRAC(Korea)
Online: 25 May 2021 (08:32:32 CEST)
As of 2020, public data for game ratings provided by Game Ratings And Administration Committee(GRAC) are more limited than public data for movie and video ratings provided by Korea Media Ratings Board and do not provide data which allow us to see information on ratings clearly and in detail. To get information on game ratings, we need to find information by searching for specific target on homepage which is inconvenient for us. In order to improve such inconvenience and extend scope of provision in public data, the author of this paper intends to study public data API which has been extended based on information on video ratings. To draw items to be extended, this study analyzes data for ratings on homepage of GRAC and designs collection system to build database. This study intends to implement system that provides data collected based on extended public data items in a form which users want. This study is expected to provide information on ratings to GRAC which will strengthen fairness and satisfy game users and people’s rights to know and contribute to promotion and development of game industry.
ARTICLE | doi:10.20944/preprints202007.0078.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: personalization; decision making; medical data; artificial intelligence; Data-driving; Big Data; Data Mining; Machine Learning
Online: 5 July 2020 (15:04:17 CEST)
The study was conducted on applying machine learning and data mining methods to personalizing the treatment. This allows investigating individual patient characteristics. Personalization is built on the clustering method and associative rules. It was suggested to determine the average distance between instances for optimal performance metrics finding. The formalization of the medical data pre-processing stage for finding personalized solutions based on current standards and pharmaceutical protocols is proposed. The model of patient data is built. The paper presents the novel approach to clustering built on ensemble of cluster algorithm with better than k-means algorithm Hopkins metrics. The personalized treatment usually is based on decision tree. Such approach requires a lot of computation time and cannot be paralyzed. Therefore, it is proposed to classify persons by conditions, to determine deviations of parameters from the normative parameters of the group, as well as the average parameters. This made it possible to create a personalized approach to treatment for each patient based on long-term monitoring. According to the results of the analysis, it becomes possible to predict the optimal conditions for a particular patient and to find the medicaments treatment according to personal characteristics.
ARTICLE | doi:10.20944/preprints202103.0593.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Business Inteligence; Data Mining; Data Warehouse.
Online: 24 March 2021 (13:47:31 CET)
In the coming years, digital applications and services that continue to use the country's native cloud systems will be huge. By 2023, that will exceed 500 million, according to IDC. This corresponds to the sum of all applications developed in the last 40 years. If you are the one you answered, yes! This article is for you!
ARTICLE | doi:10.20944/preprints202012.0468.v1
Online: 18 December 2020 (13:29:38 CET)
This manuscript describes the construction and validation of high resolution daily gridded (0.05° × 0.05°) rainfall and maximum and minimum temperature data for Bangladesh : the Enhancing National Climate Services for Bangladesh Meteorological Department (ENACTS-BMD) dataset. The dataset was generated by merging data from weather stations, satellite products (for rainfall) and reanalysis (for temperature). ENACTS-BMD is the first high-resolution gridded surface meteorological dataset developed specifically for studies of surface climate processes in Bangladesh. Its record begins in January 1981 and is updated in real-time monthly and outputs have daily, decadal and monthly time resolution. The Climate Data Tools (CDT), developed by the International Research Institute for Climate and Society (IRI), Columbia University, is used to generate the dataset. This data processing includes the collection of weather and gridded data, quality control of stations data, downscaling of the reanalysis for temperature, bias correction of both satellite rainfall and downscaled reanalysis of temperature, and the combination of station and bias-corrected gridded data. The ENACTS-BMD dataset is available as an open-access product at BMD’s official website, allowing the enhancement of the provision of services, overcoming the challenges of data quality, availability, and access, promoting at the same time the engagement and use by stakeholders.
CASE REPORT | doi:10.20944/preprints201801.0066.v1
Online: 8 January 2018 (11:11:47 CET)
The implementation of the European Cohesion Policy aiming at fostering regions competitiveness, economic growth and creation of new jobs is documented over the period 2014–2020 in the publicly available Open Data Portal for the European Structural and Investment funds. On the base of this source, this paper aims at describing the process of data mining and visualization for information production on regional programmes performace in achieving effective expenditure of resouces.
ARTICLE | doi:10.20944/preprints202205.0344.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Linked (open) Data; Semantic Interoperability; Data Mapping; Governmental Data; SPARQL; Ontologies
Online: 25 May 2022 (08:18:46 CEST)
In this paper, we present a method to map information regarding service activity provision residing in governmental portals across European Commission. In order to perform this, we used as a basis the enriched Greek e-GIF ontology, modeling concepts, and relations in one of the two data portals (i.e., Points of Single Contacts) examined, since relevant information on the second was not provided. Mapping consisted in transforming information appearing in governmental portals in RDF format (i.e., as Linked data), in order to be easily exchangeable. Mapping proved a tedious task, since description on how information is modeled in the second Point of Single Contact is not provided and must be extracted in a manual manner.
ARTICLE | doi:10.20944/preprints202111.0073.v1
Subject: Medicine & Pharmacology, Other Keywords: data quality; OMOP CDM; EHDEN; healthcare data; real world data; RWD
Online: 3 November 2021 (09:12:54 CET)
Background: Observational health data has the potential to be a rich resource to inform clinical practice and regulatory decision making. However, the lack of standard data quality processes makes it difficult to know if these data are research ready. The EHDEN COVID-19 Rapid Col-laboration Call presented the opportunity to assess how the newly developed open-source tool Data Quality Dashboard (DQD) informs the quality of data in a federated network. Methods: 15 Data Partners (DPs) from 10 different countries worked with the EHDEN taskforce to map their data to the OMOP CDM. Throughout the process at least two DQD results were collected and compared for each DP. Results: All DPs showed an improvement in their data quality between the first and last run of the DQD. The DQD excelled at helping DPs identify and fix conformance is-sues but showed less of an impact on completeness and plausibility checks. Conclusions: This is the first study to apply the DQD on multiple, disparate databases across a network. While study-specific checks should still be run, we recommend that all data holders converting their data to the OMOP CDM use the DQD as it ensures conformance to the model specifications and that a database meets a baseline level of completeness and plausibility for use in research.
ARTICLE | doi:10.20944/preprints202110.0103.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Data Analytics; Analytics; Supply Chain Input; Supply Chain; Data Science; Data
Online: 6 October 2021 (10:38:42 CEST)
One of the most remarkable features in the 20th century was the digitalization of technical progress, which changed the output of companies worldwide and became a defining feature of the century. The growth of information technology systems and the implementation of new technical advances, which enhance the integrity, agility and long-term organizational performance of the supply chain, can distinguish a digital supply chain from other supply chains. For example, the Internet of Things (IoT)-enabled information exchange and Big Data analysis might be used to regulate the mismatch between supply and demand. In order to assess contemporary ideas and concepts in the field of data analysis in the context of supply chain management, this literary investigation has been decided. The research was conducted in the form of a comprehensive literature review. In the SLR investigation, a total of 71 papers from leading journals were used. SLR has found that data analytics integrate into supply chain management can have long-term benefits on supply chain management from the input side, i.e., improved strategic development, management and other areas.
Subject: Engineering, Automotive Engineering Keywords: Business Intelligence; Data warehouse; Data Marts; Architecture; Data; Information; cloud; Data Mining; evolution; technologic companies; tools; software
Online: 24 March 2021 (13:06:53 CET)
Information has been and will be a vital element for a person or department groups in an organization. That is why there are technologies that help us to give them the proper management of data; Business Intelligence is responsible for bringing technological solutions that correctly and effectively manage the entire volume of necessary and important information for companies. Among the solutions offered by Business Intelligence are Data Warehouses, Data Mining, among other business technologies that working together achieve the objectives proposed by an organization. It is important to highlight that these business technologies have been present since the 50's and have been evolving through time, improving processes, infrastructure, methodologies and implementing new technologies, which have helped to correct past mistakes based on information management for companies. There are questions about Business Intelligence. Could it be that in the not-too-distant future it will be used as an essential standard or norm in any organization for data management, since it provides many benefits and avoids failures at the time of classifying information. On the other hand, Cloud storage has been the best alternative to safeguard information and not depend on physical storage media, which are not 100% secure and are exposed to partial or total loss of information, by presenting hardware failures or security failures due to mishandling that can be given to such information.
ARTICLE | doi:10.20944/preprints202111.0410.v1
Subject: Engineering, Other Keywords: Data compression; data hiding; psnr; mse; virtual data; public cloud; quantization error
Online: 22 November 2021 (15:17:12 CET)
Nowadays, information security is a challenge especially when transmitted or shared in public clouds. Many of researchers have been proposed technique which fails to provide data integrity, security, authentication and another issue related to sensitivity data. The most common techniques were used to protect data during transmission on public cloud are cryptography, steganography, and compression. The proposed scheme suggests an entirely new approach for data security on public cloud. Authors have suggested an entirely new approach that completely makes secret data invisible behind carrier object and it is not been detected with the image performance parameters like PSNR, MSE, entropy and others. The details of results are explain in result section of paper. Proposed technique have better outcome than any other existing technique as a security mechanism on a public cloud. Primary focus of suggested approach is to minimize integrity loss of public storage data due to unrestricted access rights by uses. To improve reusability of carrier even after data concealed is really a challenging task and achieved through suggested approach.
REVIEW | doi:10.20944/preprints201807.0059.v1
Subject: Life Sciences, Biophysics Keywords: data normalization; data scaling; zero-sum; metabolic fingerprinting; NMR; statistical data analysis
Online: 3 July 2018 (16:22:31 CEST)
The aim of this article is to summarize recent bioinformatic and statistical developments applicable to NMR-based metabolomics. Extracting relevant information from large multivariate datasets by statistical data analysis strategies may be of considerable complexity. Typical tasks comprise for example classification of specimens, identification of differentially produced metabolites, and estimation of fold changes. In this context it is of prime importance to minimize contributions from unwanted biases and experimental variance prior to these analyses. This is the goal of data normalization. Therefore, special emphasize is given to different data normalization strategies. In the first part, we will discuss the requirements and the pros and cons for a variety of commonly applied strategies. In the second part, we will concentrate on possible solutions in case that the requirements for the standard strategies are not fulfilled. In the last part, very recent developments will be discussed that allow reliable estimation of metabolic signatures for sample classification without prior data normalization. In this contribution special emphasis will be given to techniques that have worked well in our hands.
Subject: Social Sciences, Econometrics & Statistics Keywords: poverty; composite indicators; interval data; symbolic data
Online: 24 August 2021 (15:46:09 CEST)
The analysis and measurement of poverty is a crucial issue in the field of social science. Poverty is a multidimensional notion that can be measured using composite indicators relevant to synthesizing statistical indicators. Subjective choices could, however, affect these indicators. We propose interval-based composite indicators to avoid the problem, enabling us in this context to obtain robust and reliable measures. Based on a relevant conceptual model of poverty we have identified, we will consider all the various factors identified. Then, considering a different random configuration of the various factors, we will compute a different composite indicator. We can obtain a different interval for each region based on the distinct factor choices on the different assumptions for constructing the composite indicator. So we will create an interval-based composite indicator based on the results obtained by the Monte-Carlo simulation of all the different assumptions. The different intervals can be compared, and various rankings for poverty can be obtained. For their parameters, such as center, minimum, maximum, and range, the poverty interval composite indicator can be considered and compared. The results demonstrate a relevant and consistent measurement of the indicator and the shadow sector's relevant impact on the final measures.
Online: 30 October 2020 (15:35:00 CET)
In the information age today, data are getting more and more important. While other industries achieve tangible improvement by applying cutting edge information technology, the construction industry is still far from being enough. Cost, schedule, and performance control are three major functions in the project execution phase. Along with their individual importance, cost-schedule integration has been a significant challenge over the past five decades in the construction industry. Although a lot of efforts have been put into this development, there is no method used in construction practice. The purpose of this study is to propose a new method to integrate cost and schedule data using big data technology. The proposed algorithm is designed to provide data integrity and flexibility in the integration process, considerable time reduction on building and changing database, and practical use in a construction site. It is expected that the proposed method can transform the current way that field engineers regard information management as one of the troublesome tasks in a data-friendly way.
ARTICLE | doi:10.20944/preprints201701.0090.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: transportation data; data interlinking; automatic schema matching
Online: 20 January 2017 (03:38:06 CET)
Multimodality requires integration of heterogeneous transportation data to construct a broad view of the transportation network. Many new transportation services are emerging with being isolated from previously existing networks. This lead them to publish their data sources to the web -- according to Linked Data Principles -- in order to gain visibility. Our interest is to use these data to construct an extended transportation network that links these new services to existing ones. The main problems we tackle in this article fall in the categories of automatic schema matching and data interlinking. We propose an approach that uses web services as mediators to help in automatically detect geospatial properties and map them between two different schemas. On the other hand, we propose a new interlinking approach that enables user to define rich semantic links between datasets in a flexible and customizable way.
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Academic Analytics; data storage; education and big data; analysis of data; learning analytics
Online: 19 July 2020 (20:37:39 CEST)
Business Intelligence, defined by  as "the ability to understand the interrelations of the facts that are presented in such a way that it can guide the action towards achieving a desired goal", has been used since 1958 for the transformation of data into information, and of information into knowledge, to be used when making decisions in a business environment. But, what would happen if we took the same principles of business intelligence and applied them to the academic environment? The answer would be the creation of Academic Analytics, a term defined by  as the process of evaluating and analyzing organizational information from university systems for reporting and making decisions, whose characteristics allow it to be used more and more in institutions, since the information they accumulate about their students and teachers gathers data such as academic performance, student success, persistence, and retention . Academic Analytics enables an analysis of data that is very important for making decisions in the educational institutional environment, aggregating valuable information in the academic research activity and providing easy to use business intelligence tools. This article shows a proposal for creating an information system based on Academic Analytics, using ASP.Net technology and trusting storage in the database engine Microsoft SQL Server, designing a model that is supported by Academic Analytics for the collection and analysis of data from the information systems of educational institutions. The idea that was conceived proposes a system that is capable of displaying statistics on the historical data of students and teachers taken over academic periods, without having direct access to institutional databases, with the purpose of gathering the information that the director, the teacher, and finally the student need for making decisions. The model was validated with information taken from students and teachers during the last five years, and the export format of the data was pdf, csv, and xls files. The findings allow us to state that it is extremely important to analyze the data that is in the information systems of the educational institutions for making decisions. After the validation of the model, it was established that it is a must for students to know the reports of their academic performance in order to carry out a process of self-evaluation, as well as for teachers to be able to see the results of the data obtained in order to carry out processes of self-evaluation, and adaptation of content and dynamics in the classrooms, and finally for the head of the program to make decisions.
ARTICLE | doi:10.20944/preprints201812.0071.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: data governance; data sovereignty; urban data spaces; ICT reference architecture; open urban platform
Online: 6 December 2018 (05:09:54 CET)
This paper presents the results of a recent study that was conducted with a number of German municipalities/cities. Based on the obtained and briefly presented recommendations emerging from the study, the authors propose the concept of an Urban Data Space (UDS), which facilitates an eco-system for data exchange and added value creation thereby utilizing the various types of data within a smart city/municipality. Looking at an Urban Data Space from within a German context and considering the current situation and developments in German municipalities, this paper proposes a reasonable classification of urban data that allows to relate the various data types to legal aspects and to conduct solid considerations regarding technical implementation designs and decisions. Furthermore, the Urban Data Space is described/analyzed in detail, and relevant stakeholders are identified, as well as corresponding technical artifacts are introduced. The authors propose to setup Urban Data Spaces based on emerging standards from the area of ICT reference architectures for Smart Cities, such as DIN SPEC 91357 “Open Urban Platform” and EIP SCC. Thereby, the paper walks the reader through the construction of an UDS based on the above mentioned architectures and outlines all the goals, recommendations and potentials, which an Urban Data Space can reveal to a municipality/city.
ARTICLE | doi:10.20944/preprints202110.0260.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: big data; data acquisition; data visualization; data exchange; dashboard; frequency stability; Grafana lab; Power Quality; GPS reference; frequency measurement.
Online: 18 October 2021 (18:07:43 CEST)
This article proposes a measurement solution designed to monitor instantaneous frequency in power systems. It uses a data acquisition module and a GPS receiver for time stamping. A program in Python takes care of receiving the data, calculating the frequency, and finally transferring the measurement results to a database. The frequency is calculated with two different methods, which are compared in the article. The stored data is visualized using the Grafana platform, thus demonstrating its potential for comparing scientific data. The system as a whole constitutes an efficient low cost solution as a data acquisition system.
DATA DESCRIPTOR | doi:10.20944/preprints202109.0370.v1
Subject: Engineering, Energy & Fuel Technology Keywords: smart meter data; household survey; EPC; energy data; energy demand; energy consumption; longitudinal; energy modelling; electricity data; gas data
Online: 22 September 2021 (10:16:05 CEST)
The Smart Energy Research Lab (SERL) Observatory dataset described here comprises half-hourly and daily electricity and gas data, SERL survey data, Energy Performance Certificate (EPC) input data and 24 local hourly climate reanalysis variables from the European Centre for Medium-Range Weather Forecasts (ECMWF) for over 13,000 households in Great Britain (GB). Participants were recruited in September 2019, September 2020 and January 2021 and their smart meter data are collected from up to one year prior to sign up. Data collection will continue until at least August 2022, and longer if funding allows. Survey data relating to the dwelling, appliances, household demographics and attitudes was collected at sign up. Data are linked at the household level and UK-based academic researchers can apply for access within a secure virtual environment for research projects in the public interest. This is a data descriptor paper describing how the data was collected, the variables available and the representativeness of the sample compared to national estimates. It is intended as a guide for researchers working with or considering using the SERL Observatory dataset, or simply looking to learn more about it.
ARTICLE | doi:10.20944/preprints201807.0038.v1
Online: 3 July 2018 (11:25:13 CEST)
The rich emission and absorption line spectra of Fe I may be used to extract crucial information on astrophysical plasmas, such as stellar metallicities. There is currently a lack, in quality and quantity, of accurate level-resolved effective electron-impact collision strengths and oscillator strengths for radiative transitions. Here, we discuss the challenges in obtaining a sufficiently good structure for neutral iron and compare our theoretical fine-structure energy levels with observation for several increasingly large models. Radiative data is presented for several transitions for which the atomic data is accurately known.
ARTICLE | doi:10.20944/preprints202206.0335.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: metadata; contextual data; harmonization; genomic surveillance; data management
Online: 24 June 2022 (08:46:04 CEST)
ARTICLE | doi:10.20944/preprints202108.0471.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Big data; Health prevention; Machine learning; Medical data
Online: 24 August 2021 (14:00:12 CEST)
CVDs are a leading cause of death globally. In CVDs, the heart is unable to deliver enough blood to other body regions. Since effective and accurate diagnosis of CVDs is essential for CVD prevention and treatment, machine learning (ML) techniques can be effectively and reliably used to discern patients suffering from a CVD from those who do not suffer from any heart condition. Namely, machine learning algorithms (MLAs) play a key role in the diagnosis of CVDs through predictive models that allow us to identify the main risks factors influencing CVD development. In this study, we analyze the performance of ten MLAs on two datasets for CVD prediction and two for CVD diagnosis. Algorithm performance is analyzed on top-two and top-four dataset attributes/features with respect to five performance metrics –accuracy, precision, recall, f1-score, and roc-auc – using the train-test split technique and k-fold cross-validation. Our study identifies the top two and four attributes from each CVD diagnosis/prediction dataset. As our main findings, the ten MLAs exhibited appropriate diagnosis and predictive performance; hence, they can be successfully implemented for improving current CVD diagnosis efforts and help patients around the world, especially in regions where medical staff is lacking.
ARTICLE | doi:10.20944/preprints202106.0738.v1
Subject: Earth Sciences, Atmospheric Science Keywords: time series; homogenization; ACMANT; observed data; data accuracy
Online: 30 June 2021 (13:08:39 CEST)
The removal of non-climatic biases, so-called inhomogeneities, from long climatic records needs sophistically developed statistical methods. One principle is that usually the differences between a candidate series and its neighbour series are analysed instead of directly the candidate series, in order to neutralize the possible impacts of regionally common natural climate variation on the detection of inhomogeneities. In most homogenization methods, two main kinds of time series comparisons are applied, i.e. composite reference series or pairwise comparisons. In composite reference series the inhomogeneities of neighbour series are attenuated by averaging the individual series, and the accuracy of homogenization can be improved by the iterative improvement of composite reference series. By contrast, pairwise comparisons have the advantage that coincidental inhomogeneities affecting several station series in a similar way can be identified with higher certainty than with composite reference series. In addition, homogenization with pairwise comparisons tends to facilitate the most accurate regional trend estimations. A new time series comparison method is presented here, which combines the use of pairwise comparisons and composite reference series in a way that their advantages are unified. This time series comparison method is embedded into the ACMANT homogenization method, and tested in large, commonly available monthly temperature test datasets.
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: GAN; ECG; anonymization; healthcare data; sensors; data transformation
Online: 3 September 2020 (05:26:01 CEST)
In personalized healthcare, an ecosystem for the manipulation of reliable and safe private data should be orchestrated. This paper describes a first approach for the generation of fake electrocardiograms (ECGs) based on Generative Adversarial Networks (GANs) with the objective of anonymizing users’ information for privacy issues. This is intended to create valuable data that can be used both, in educational and research areas, while avoiding the risk of a sensitive data leakage. As GANs are mainly exploited on images and video frames, we are proposing general raw data processing after transformation into an image, so it can be managed through a GAN, then decoded back to the original data domain. The feasibility of our transformation and processing hypothesis is primarily demonstrated. Next, from the proposed procedure, main drawbacks for each step in the procedure are addressed for the particular case of ECGs. Hence, a novel research pathway on health data anonymization using GANs is opened and further straightforward developments are expected.
ARTICLE | doi:10.20944/preprints201806.0419.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: social business intelligence; data streaming models; linked data
Online: 26 June 2018 (12:48:17 CEST)
Social Business Intelligence (SBI) enables companies to capture strategic information from public social networks. Contrary to traditional Business Intelligence (BI), SBI has to face the high dynamicity of both the social network contents and the company analytical requests, as well as the enormous amount of noisy data. Effective exploitation of these continuous sources of data requires efficient processing of the streamed data to be semantically shaped into insightful facts. In this paper, we propose a multidimensional formalism to represent and evaluate social indicators directly from fact streams derived in turn from social network data. This formalism relies on two main aspects: the semantic representation of facts via Linked Open Data and the support of OLAP-like multidimensional analysis models. Contrary to traditional BI formalisms, we start the process by modeling the required social indicators according to the strategic goals of the company. From these specifications, all the required fact streams are modeled and deployed to trace the indicators. The main advantages of this approach are the easy definition of on-demand social indicators, and the treatment of changing dimensions and metrics through streamed facts. We demonstrate its usefulness by introducing a real scenario user case in the automotive sector.
COMMUNICATION | doi:10.20944/preprints201803.0054.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: data feature selection; data clustering; travel time prediction
Online: 7 March 2018 (13:30:06 CET)
In recent years, governments applied intelligent transportation system (ITS) technique to provide several convenience services (e.g., garbage truck app) for residents. This study proposes a garbage truck fleet management system (GTFMS) and data feature selection and data clustering methods for travel time prediction. A GTFMS includes mobile devices (MD), on-board units, fleet management server, and data analysis server (DAS). When user uses MD to request the arrival time of garbage truck, DAS can perform the procedure of data feature selection and data clustering methods to analyses travel time of garbage truck. The proposed methods can cluster the records of travel time and reduce variation for the improvement of travel time prediction. After predicting travel time and arrival time, the predicted information can be sent to user’s MD. In experimental environment, the results showed that the accuracies of previous method and proposed method are 16.73% and 85.97%, respectively. Therefore, the proposed data feature selection and data clustering methods can be used to predict stop-to-stop travel time of garbage truck.
COMMUNICATION | doi:10.20944/preprints202206.0172.v3
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Monkeypox; monkey pox; Twitter; Dataset; Tweets; Social Media; Big Data; Data Mining; Data Science
Online: 25 July 2022 (09:41:19 CEST)
ARTICLE | doi:10.20944/preprints202109.0518.v1
Subject: Earth Sciences, Environmental Sciences Keywords: data fusion; multi-sensor; data visualization; data treatment; participant reports; air quality; exposure assessment
Online: 30 September 2021 (14:13:52 CEST)
Use of a multi-sensor approach can provide citizens a holistic insight in the air quality in their immediate surroundings and assessment of personal exposure to urban stressors. Our work, as part of the ICARUS H2020 project, which included over 600 participants from 7 European cities, discusses data fusion and harmonization on a diverse set of multi-sensor data streams to provide a comprehensive and understandable report for participants, and offers possible solutions and improvements. Harmonizing the data streams identified issues with the used devices and protocols, such as non-uniform timestamps, data gaps, difficult data retrieval from commercial devices, and coarse activity data logging. Our process of data fusion and harmonization allowed us to automate the process of generating visualizations and reports and consequently provide each participant with a detailed individualized report. Results showed that a key solution was to streamline the code and speed up the process, which necessitated certain compromises in visualizing the data. A thought-out process of data fusion and harmonization on a diverse set of multi-sensor data streams considerably improved the quality and quantity of data that a research participant receives. Though automatization accelerated the production of the reports considerably, manual structured double checks are strongly recommended.
ARTICLE | doi:10.20944/preprints201806.0185.v1
Subject: Medicine & Pharmacology, Nursing & Health Studies Keywords: mHealth; mobile data collection; data quality; data quality assessment framework; Tuberculosis control; developing countries
Online: 12 June 2018 (10:34:33 CEST)
Background Increasingly, healthcare organizations are using technology for the efficient management of data. The aim of this study was to compare the data quality of digital records with the quality of the corresponding paper-based records by using data quality assessment framework. Methodology We conducted a desk review of paper-based and digital records over the study duration from April 2016 to July 2016 at six enrolled TB clinics. We input all data fields of the patient treatment (TB01) card into a spreadsheet-based template to undertake a field-to-field comparison of the shared fields between TB01 and digital data. Findings A total of 117 TB01 cards were prepared at six enrolled sites, whereas just 50% of the records (n=59; 59 out of 117 TB01 cards) were digitized. There were 1,239 comparable data fields, out of which 65% (n=803) were correctly matched between paper based and digital records. However, 35% of the data fields (n=436) had anomalies, either in paper-based records or in digital records. 1.9 data quality issues were calculated per digital patient record, whereas it was 2.1 issues per record for paper-based record. Based on the analysis of valid data quality issues, it was found that there were more data quality issues in paper-based records (n=123) than in digital records (n=110). Conclusion There were fewer data quality issues in digital records as compared to the corresponding paper-based records. Greater use of mobile data capture and continued use of the data quality assessment framework can deliver more meaningful information for decision making.
REVIEW | doi:10.20944/preprints202105.0663.v1
Subject: Keywords: Big Data, Internet Data Sources (IDS), Internet of Things (IoT), Sustainable Development Goals (SDGs), Big data Technologies, Big data Challenges
Online: 27 May 2021 (10:31:03 CEST)
It is strongly believed that technology can reap the best only when it can be tamed by all stakeholders. Big data technology has no exception for this and even after a decade of emergence, the technology is still a herculean task and is in nascent stage with respect to applicability for many people. Having understood the gaps in the technology adoption for big data in the contemporary world, the present exploratory research work intended to highlight the possible prospects of big data technologies. It is also advocated as to how the challenges of various fields can be converted as opportunities with the shift in the perspective towards this evolving concept. Examples of apex organizations like (IMF and ITU) and their initiatives of big data technologies with respect to the Sustainable Development Goals (SDGs) are also cited for a broader outlook. The intervention of the responsible organizations along with the respective governments is also much sought for encouraging the technology adoption across all the sections of the market players.
ARTICLE | doi:10.20944/preprints202003.0073.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: digital object; data infrastructure; research infrastructure; data management; data science; FAIR data; open science; European Open Science Cloud; EOSC; persistent identifier
Online: 5 March 2020 (02:30:06 CET)
Data science is facing the following major challenges: (1) developing scalable cross-disciplinary capabilities, (2) dealing with the increasing data volumes and their inherent complexity, (3) building tools that help to build trust, (4) creating mechanisms to efficiently operate in the domain of scientific assertions, (5) turning data into actionable knowledge units and (6) promoting data interoperability. As a way to overcome these challenges, we further develop the proposals by early Internet pioneers for Digital Objects as encapsulations of data and metadata made accessible by persistent identifiers. In the past decade, this concept was revisited by various groups within the Research Data Alliance and put in the context of the FAIR Guiding Principles for findable, accessible, interoperable and reusable data. The basic components of a FAIR Digital Object (FDO) as a self-contained, typed, machine-actionable data package are explained. A survey of use cases has indicated the growing interest of research communities in FDO solutions. We conclude that the FDO concept has the potential to act as the interoperable federative core of a hyperinfrastructure initiative such as the European Open Science Cloud (EOSC).
REVIEW | doi:10.20944/preprints202208.0420.v1
Subject: Social Sciences, Law Keywords: conversational commerce; data protection; law of obligations of data
Online: 24 August 2022 (10:55:06 CEST)
The possibilities and reach of social networks are increasing, the designs are becoming more diverse, and the ideas more visionary. Most recently, the former company “Facebook” announced the creation of a metaverse. With these technical possibilities, however, the danger of fraudsters is also growing. Using social bots, consumers are increasingly influenced on such platforms and business transactions are brought about through communication, i.e. conversational commerce. Minors or the elderly are particularly susceptible. This technical development is accompanied by a legal one: it is permitted by the Digital Services Directive and the Sale of Goods Directive to demand the provision of data as consideration for the sale of digital products. This raises legal problems at the level of the law of obligations and data protection law, whose regulations are intended to protect the aforementioned groups of individuals. This protection becomes even more important the more manipulative consumers are influenced by communicative bots. We show that there is a lack of knowledge about what objective data value can have in business transactions. Sufficient transparency of an objective data value can maintain legal protection, especially of vulnerable groups, and ensure the purpose of the laws.
ARTICLE | doi:10.20944/preprints202208.0224.v1
Subject: Engineering, Automotive Engineering Keywords: VR-XGBoost; K-VDTE; ETC data; ESAs; data mining
Online: 12 August 2022 (03:53:23 CEST)
To scientifically and effectively evaluate the service capacity of expressway service areas (ESAs) and improve the management level of ESAs, we propose a method for the recognition of vehicles entering ESAs (VeESAs) and estimation of vehicle dwell times using ETC data. First, the ETC data and their advantages are described in detail, and then the cleaning rules are designed according to the characteristics of the ETC data. Second, we established feature engineering according to the characteristics of VeESA, and proposed the XGBoost-based VeESA recognition (VR-XGBoost) model. Studied the driving rules in depth, we constructed a kinematics-based vehicle dwell time estimation (K-VDTE) model. The field validation in Part A/B of Yangli ESA using real ETC transaction data demonstrates that the effectiveness of our proposal outperforms the current state of the art. Specifically, in Part A and Part B, the recognition accuracies of VR-XGBoost are 95.9% and 97.4%, respectively, the mean absolute errors (MAEs) of dwell time are 52 s and 14 s, respectively, and the root mean square errors (RMSEs) are 69 s and 22 s, respectively. In addition, the confidence level of controlling the MAE of dwell time within 2 minutes is more than 97%. This work can effectively identify the VeESA, and accurately estimate the dwell time, which can provide a reference idea and theoretical basis for the service capacity evaluation and layout optimization of the ESA.
ARTICLE | doi:10.20944/preprints202208.0083.v1
Subject: Social Sciences, Accounting Keywords: Ratios; Financial Crisis; Covid-19; Big Data; Accounting Data
Online: 3 August 2022 (10:42:06 CEST)
The effects of the 2008 financial crisis undoubtedly caused problems not only to the banking sector but also to the real economy of the developed and the developing countries in almost all around the globe. Besides, as is widely known, every banking crisis entails the corresponding cost to the economy of each country affected by it, which results from the shakeout and the restructuring of its financial system. The purpose of this research is to investigate the consequences of the financial crisis and the COVID-19 health crisis and how these affected the course of the four systemic banks (Eurobank, Alpha Bank, National Bank, Piraeus Bank) through the analysis of ratios for the period of 2015-2020.
ARTICLE | doi:10.20944/preprints202103.0331.v1
Online: 12 March 2021 (08:05:09 CET)
The present high-tech landscape has allowed institutes to undergo digital transformation in addition to the storing of exceptional bulks of information from several resources, such as mobile phones, debit cards, GPS, transactions, online logs, and e-records. With the growth of technology, big data has grown to be a huge resource for several corporations that helped in encouraging enhanced strategies and innovative enterprise prospects. This advancement has also offered the expansion of linkable data resources. One of the famous data sources is social media platforms. Ideas and different types of content are being posted by thousands of people via social networking sites. These sites have provided a modern method for operating companies efficiently. However, some studies showed that social media platforms can be a source for misinformation at which some users tend to misuse social media data. In this work, the ethical concerns and conduct in online communities has been reviewed in order to see how social media data from different platforms has been misused, and to highlight some of the ways to avoid the misuse of social media data.
ARTICLE | doi:10.20944/preprints202006.0258.v2
Subject: Engineering, Civil Engineering Keywords: Conservation laws; Data inference; Data discovery; Dimensionless form; PINN
Online: 30 September 2020 (03:51:25 CEST)
Deep learning has achieved remarkable success in diverse computer science applications, however, its use in other traditional engineering fields has emerged only recently. In this project, we solved several mechanics problems governed by differential equations, using physics informed neural networks (PINN). The PINN embeds the differential equations into the loss of the neural network using automatic differentiation. We present our developments in the context of solving two main classes of problems: data-driven solutions and data-driven discoveries, and we compare the results with either analytical solutions or numerical solutions using the finite element method. The remarkable achievements of the PINN model shown in this report suggest the bright prospect of the physics-informed surrogate models that are fully differentiable with respect to all input coordinates and free parameters. More broadly, this study shows that PINN provides an attractive alternative to solve traditional engineering problems.
ARTICLE | doi:10.20944/preprints202007.0051.v2
Subject: Social Sciences, Library & Information Science Keywords: COVID-19; WHO; database; systematic review; data quality
Online: 2 August 2020 (17:43:38 CEST)
Introduction: A large number of COVID-19 publications has created a need to collect all research-related material in practical and reliable centralized databases. The aim of this study was to evaluate the functionality and quality of the compiled World Health Organisation COVID-19 database and compare it to Pubmed and Scopus. Methods: Article metadata for COVID-19 articles and articles on 8 specific topics related to COVID-19 was exported from the WHO global research database, Scopus and Pubmed. The analysis was conducted in R to investigate the number and overlapping of the articles between the databases and the missingness of values in the metadata. Results: The WHO database contains the largest number of COVID-19 related articles overall but retrieved the same number of articles on 8 specific topics as Scopus and Pubmed. Despite having the smallest number of exclusive articles overall, the highest number of exclusive articles on specific COVID-19 related topics was retrieved from the Scopus database. Further investigation revealed that PubMed and Scopus have more comprehensive structure than the WHO database, and less missing values in the categories searched by the information retrieval systems. Discussion: This study suggests that the WHO COVID-19 database, even though it is compiled from multiple databases, has a very simple and limited structure, and significant problems with data quality. As a consequence, relying on this database as a source of articles for systematic reviews or bibliometric analyses is undesirable.
ARTICLE | doi:10.20944/preprints201905.0158.v1
Subject: Engineering, Biomedical & Chemical Engineering Keywords: blockchain; biomedical data managing; DWT; keyword search; data sharing.
Online: 13 May 2019 (13:30:37 CEST)
A crucial role is played by personal biomedical data when it comes to maintaining proficient access to health records by patients as well as health professionals. However, it is difficult to get a unified view pertaining to health data that have been scattered across various health center/hospital sections. To be specific, health records are distributed across many places and cannot be found integrated easily. In recent years, blockchain is regarded as a promising explanation that helps to achieve individual biomedical information sharing in a secured way along with privacy preservation, because of its benefit of immutability. This research work put forwards a blockchain-based managing scheme that helps to establish interpretation improvements pertaining to electronic biomedical systems. In this scheme, two blockchain were employed to construct the base of it, where the second blockchain algorithm is used to generate a secure sequence for the hash key that generated in first blockchain algorithm. The adaptively feature enable the algorithm to use multiple data types and combine between various biomedical images and text records as well. All the data, including keywords, digital records as well as the identity of patients are private key encrypted along with keyword searching capability so as to maintain data privacy preservation, access control and protected search. The obtained results which show the low latency (less than 750 ms) at 400 requests / second indicate the ability to use it within several health care units such as hospitals and clinics.
ARTICLE | doi:10.20944/preprints201806.0219.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Big data technology; Business intelligence; Data integration; System virtualization.
Online: 13 June 2018 (16:19:48 CEST)
Big Data warehouses are a new class of databases that largely use unstructured and volatile data for analytical purpose. Examples of this kind of data sources are those coming from the Web, such as social networks and blogs, or from sensor networks, where huge amounts of data may be available only for short intervals of time. In order to manage massive data sources, a strategy must be adopted to define multidimensional schemas in presence of fast-changing situations or even undefined business requirements. In the paper, we propose a design methodology that adopts agile and automatic approaches, in order to reduce the time necessary to integrate new data sources and to include new business requirements on the fly. The data are immediately available for analyses, since the underlying architecture is based on a virtual data warehouse that does not require the importing phase. Examples of application of the methodology are presented along the paper in order to show the validity of this approach compared to a traditional one.
ARTICLE | doi:10.20944/preprints202102.0326.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Data analysis; Artificial Intelligence; Machine Learning; Knowledge Engineering; Computers and information processing, Data analysis; Data Processing.
Online: 16 February 2021 (13:33:53 CET)
The copper mining industry is increasingly using artificial intelligence methods to improve cop-per production processes. Recent studies reveal the use of algorithms such as Artificial Neural Network, Support Vector Machine, and Random Forest, among others, to develop models for predicting product quality. Other studies compare the predictive models developed with these machine learning algorithms in the mining industry, as a whole. However, not many copper mining studies published compare the results of machine learning techniques for copper recovery prediction. This study makes a detailed comparison between three models for predicting copper recovery by leaching, using four datasets resulting from mining operations in northern Chile. The algorithms used for developing the models were Random Forest, Support Vector Machine, and Artificial Neural Network. To validate these models, four indicators or values of merit were used: accuracy (acc), precision (p), recall (r), and Matthew’s correlation coefficient (mcc). This paper describes dataset preparation and the refinement of the threshold values used for the predictive variable most influential on the class (the copper recovery). Results show both a precision over 98.50% and also the model with the best behavior between the predicted and the real. Finally, the models obtained show the following mean values: acc=94.32, p=88.47, r=99.59, and mcc=2.31. These values are highly competitive as compared with those obtained in similar studies using other approaches in the context.
ARTICLE | doi:10.20944/preprints202008.0254.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: feature selection; k-means; silhouette measure; clustering; big data; fault classification; sensor data; time-series data
Online: 11 August 2020 (06:26:43 CEST)
Feature selection is a crucial step to overcome the curse of dimensionality problem in data mining. This work proposes Recursive k-means Silhouette Elimination (RkSE) as a new unsupervised feature selection algorithm to reduce dimensionality in univariate and multivariate time-series datasets. Where k-means clustering is applied recursively to select the cluster representative features, following a unique application of silhouette measure for each cluster and a user-defined threshold as the feature selection or elimination criteria. The proposed method is evaluated on a hydraulic test rig, multi sensor readings in two different fashions: (1) Reduce the dimensionality in a multivariate classification problem using various classifiers of different functionalities. (2) Classification of univariate data in a sliding window scenario, where RkSE is used as a window compression method, to reduce the window dimensionality by selecting the best time points in a sliding window. Moreover, the results are validated using 10-fold cross validation technique. As well as, compared to the results when the classification is pulled directly with no feature selection applied. Additionally, a new taxonomy for k-means based feature selection methods is proposed. The experimental results and observations in the two comprehensive experiments demonstrated in this work reveal the capabilities and accuracy of the proposed method.
ARTICLE | doi:10.20944/preprints202108.0303.v2
Online: 19 November 2021 (08:38:42 CET)
Science continues to become more interdisciplinary and to involve increasingly complex data sets. Many projects in the biomedical and health-related sciences follow or aim to follow the principles of FAIR data sharing, which has been demonstrated to foster collaboration, to lead to better research outcomes, and to help ensure reproducibility of results. Data generated in the course of biomedical and health research present specific challenges for FAIR sharing in the sense that they are heterogeneous and highly sensitive to context and the needs of protection and privacy. Data sharing must respect these features without impeding timely dissemination of results, so that they can contribute to time-critical advances in medical therapy and treatment. Modeling and simulation of biomedical processes have become established tools, and a global community has been developing algorithms, methodologies, and standards for applying biomedical simulation models in clinical research. However, it can be difficult for clinician scientists to follow the specific rules and recommendations for FAIR data sharing within this domain. We seek to clarify the standard workflow for sharing experimental and clinical data with the simulation modeling community. By following these recommendations, data sharing will be improved, collaborations will become more effective, and the FAIR publication and subsequent reuse of data will become possible at the level of quality necessary to support biomedical and health-related sciences.
ARTICLE | doi:10.20944/preprints202112.0068.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: Data security; data handling; access control; unauthorized access; cloud computing
Online: 6 December 2021 (12:15:56 CET)
Nowadays, cloud computing is one of the important and rapidly growing paradigms that extend its capabilities and applications in various areas of life. The cloud computing system challenges many security issues, such as scalability, integrity, confidentiality, and unauthorized access, etc. An illegitimate intruder may gain access to the sensitive cloud computing system and use the data for inappropriate purposes that may lead to losses in business or system damage. This paper proposes a hybrid unauthorized data handling (HUDH) scheme for Big data in cloud computing. The HUDU aims to restrict illegitimate users from accessing the cloud and data security provision. The proposed HUDH consists of three steps: data encryption, data access, and intrusion detection. HUDH involves three algorithms; Advanced Encryption Standards (AES) for encryption, Attribute-Based Access Control (ABAC) for data access control, and Hybrid Intrusion Detection (HID) for unauthorized access detection. The proposed scheme is implemented using Python and Java language. Testing results demonstrate that the HUDH can delegate computation overhead to powerful cloud servers. User confidentiality, access privilege, and user secret key accountability can be attained with more than 97% high accuracy.
ARTICLE | doi:10.20944/preprints202108.0256.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Learning Analytics, Education, Educational Data Mining, Pattern Recognition, Data Visualization.
Online: 11 August 2021 (11:23:48 CEST)
With the exponential growth in today’s technology and its expanding areas of application it has become vital to incorporate it in education. One such application is Knowledge Discovery in Databases (KDD) which is a subset of data mining. KDD deals with extracting useful information and meaningful patterns from the database that were not known before. This study is a detailed application of KDD and focuses on analyzing why a particular set of students performed better than others and what factors influenced the results. The study is conducted on a dataset of 480 students and across 16 different features. The authors implemented 4 major classification techniques namely Logistic Regression, Decision Tree, Random Forest and XGB classifier. Obtaining the key features from the top performing ML algorithms that have a major impact on the performance of the student, the study takes these features as a baseline for further analysis. Further data analysis highlights patterns in the data. The study concludes that there are a lot of non-academic factors that influence the overall performance of a student and should be taken into consideration by universities and other relevant bodies.
ARTICLE | doi:10.20944/preprints202102.0593.v2
Subject: Medicine & Pharmacology, Other Keywords: Hospital admissions; care homes; COVID-19; linked data; administrative data
Online: 25 May 2021 (10:33:46 CEST)
Background: Care home residents have complex healthcare needs but may have faced barriers to accessing hospital treatment during the first wave of the COVID-19 pandemic. Objective: To examine trends in the number of hospital admissions for care home residents during the first months of the COVID-19 outbreak. Methods: Retrospective analysis of a national linked dataset on hospital admissions for residential and nursing home residents in England (257,843 residents, 45% in nursing homes) between 20 January 2020 and 28 June 2020, compared to admissions during the corresponding period in 2019 (252,432 residents, 45% in nursing homes). Elective and emergency admission rates, normalised to the time spent in care homes across all residents, were derived across the first three months of the pandemic between 1 March and 31 May and primary admissions reasons for this period were compared across years. Results: Hospital admission rates rapidly declined during early March 2020 and remained substantially lower than in 2019 until the end of June. Between March and May, 2,960 admissions from residential homes (16.2%) and 3,295 admissions from nursing homes (23.7%) were for suspected or confirmed COVID-19. Rates of other emergency admissions decreased by 36% for residential and by 38% for nursing home residents (13,191 fewer admissions in total). Emergency admissions for acute coronary syndromes fell by 43% and 29% (105 fewer admission) and emergency admissions for stroke fell by 17% and 25% (128 fewer admissions) for residential and nursing home residents, respectively. Elective admission rates declined by 64% for residential and by 61% for nursing home residents (3,762 fewer admissions). Conclusions: This is the first study showing that care home residents’ hospital use declined during the first wave of COVID-19, potentially resulting in substantial unmet health need that will need to be addressed alongside ongoing pressures from COVID-19.
ARTICLE | doi:10.20944/preprints202103.0623.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: SARS-CoV-2; Big Data; Data Analytics; Predictive Models; Schools
Online: 25 March 2021 (14:35:53 CET)
Background: CoronaVirus Disease 2019 (COVID-19) is the main discussed topic world-wide in 2020 and at the beginning of the Italian epidemic, scientists tried to understand the virus diffusion and the epidemic curve of positive cases with controversial findings and numbers. Objectives: In this paper, a data analytics study on the diffusion of COVID-19 in Lombardy Region and Campania Region is developed in order to identify the driver that sparked the second wave in Italy Methods: Starting from all the available official data collected about the diffusion of COVID-19, we analyzed google mobility data, school data and infection data for two big regions in Italy: Lombardy Region and Campania Region, which adopted two different approaches in opening and closing schools. To reinforce our findings, we also extended the analysis to the Emilia Romagna Region. Results: The paper aims at showing how different policies adopted in school opening / closing may have on the impact on the COVID-19 spread. Conclusions: The paper shows that a clear correlation exists between the school contagion and the subsequent temporal overall contagion in a geographical area.
ARTICLE | doi:10.20944/preprints202010.0618.v1
Subject: Engineering, Automotive Engineering Keywords: optical data communications; fiber optics; microcombs; ultrahigh bandwidth data transmission
Online: 29 October 2020 (14:34:21 CET)
We report world record high data transmission over standard optical fiber from a single optical source. We achieve a line rate of 44.2 Terabits per second (Tb/s) employing only the C-band at 1550nm, resulting in a spectral efficiency of 10.4 bits/s/Hz. We use a new and powerful class of micro-comb called soliton crystals that exhibit robust operation and stable generation as well as a high intrinsic efficiency that, together with an extremely low spacing of 48.9 GHz enables a very high coherent data modulation format of 64 QAM. We achieve error free transmission across 75 km of standard optical fiber in the lab and over a field trial with a metropolitan optical fiber network. This work demonstrates the ability of optical micro-combs to exceed other approaches in performance for the most demanding practical optical communications applications.
SHORT NOTE | doi:10.20944/preprints202001.0196.v1
Subject: Biology, Entomology Keywords: reproducibility; open access; data curation; data mangement; pre-print servers
Online: 18 January 2020 (09:05:49 CET)
The ability to replicate scientific experiments is a cornerstone of the scientific method. Sharing ideas, workflows, data, and protocols facilitates testing the generalizability of results, increases the speed that science progresses, and enhances quality control of published work. Fields of science such as medicine, the social sciences, and the physical sciences have embraced practices designed to increase replicability. Granting agencies, for example, may require data management plans and journals may require data and code availability statements along with the deposition of data and code in publicly available repositories. While many tools commonly used in replicable workflows such as distributed version control systems (e.g. “git”) or scripted programming languages for data cleaning and analysis may have a steep learning curve, their adoption can increase individual efficiency and facilitate collaborations both within entomology and across disciplines. The open science movement is developing within the discipline of entomology, but practitioners of these concepts or those desiring to work more collaboratively across disciplines may be unsure where or how to embrace these initiatives. This article is meant to introduce some of the tools entomologists can incorporate into their workflows to increase the replicability and openness of their work. We describe these tools and others, recommend additional resources for learning more about these tools, and discuss the benefits to both individuals and the scientific community and potential drawbacks associated with implementing a replicable workflow.
ARTICLE | doi:10.20944/preprints201807.0534.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: covariates; crab data; foetal lamb data; orthonormal polynomials; Poisson distribution
Online: 27 July 2018 (05:20:44 CEST)
Dispersion tests based on the second order component of smooth test statistics are related to Fisher’s Index of Dispersion test, used for testing for the Poisson distribution when there are no covariates present. Such tests have been recommended in  to test for the Poisson distribution when covariates are present. The modified Borel-Tanner (MBT) distribution seems suited to data with extra zeroes, a monotonic decline in counts and longer tails. Here we recommend a dispersion test for the MBT distribution for both when covariates are absent and when they are present.
ARTICLE | doi:10.20944/preprints201806.0440.v1
Subject: Engineering, Other Keywords: clustering; spatial data; grid-based k-prototypes; data mining; sustainability
Online: 27 June 2018 (10:21:22 CEST)
Data mining plays a critical role in the sustainable decision making. The k-prototypes algorithm is one of the best-known algorithm for clustering both numeric and categorical data. Despite this, however, clustering a large number of spatial object with mixed numeric and categorical attributes is still inefficient due to its high time complexity. In this paper, we propose an efficient grid-based k-prototypes algorithms, GK-prototypes, which achieves high performance for clustering spatial objects. The first proposed algorithm utilizes both maximum and minimum distance between cluster centers and a cell, which can remove unnecessary distance calculation. The second proposed algorithm as extensions of the first proposed algorithm utilizes spatial dependence that spatial data tend to be more similar as objects are closer. Each cell has a bitmap index which stores categorical values of all objects in the same cell for each attribute. This bitmap index can improve the performance in case that a categorical data is skewed. Our evaluation experiments showed that proposed algorithms can achieve better performance than the existing pruning technique in the k-prototypes algorithm.
ARTICLE | doi:10.20944/preprints201805.0353.v1
Subject: Mathematics & Computer Science, General & Theoretical Computer Science Keywords: big data; big data system; energy; district heating; reinforcement learning
Online: 24 May 2018 (16:05:27 CEST)
This paper presents a study on the thermal efficiency improvement of the user equipment room in the district heating system based on reinforcement learning , and suggests a general method of constructing a learning network(DQN) using deep Q learning, which is a reinforcement learning algorithm that does not specify a model. In addition, we introduce the big data platform system and the integrated heat management system for the energy field in the massive data processing from the IoT sensor installed in large number of thermal energy control facilities.
ARTICLE | doi:10.20944/preprints201804.0054.v1
Subject: Earth Sciences, Other Keywords: metadata; documentation; data life-cycle; metadata life-cycle; hierarchical data
Online: 4 April 2018 (08:16:15 CEST)
The historic view of metadata as “data about data” is expanding to include data about other items that must be created, used and understood throughout the data and project life cycles. In this context, metadata might better be defined as the structured and standard part of documentation and the metadata life cycle can be described as the metadata content that is required for documentation in each phase of the project and data life cycles. This incremental approach to metadata creation is similar to the spiral model used in software development. Each phase also has distinct users and specific questions they need answers to. In many cases, the metadata life cycle involves hierarchies where latter phases have increased numbers of items. The relationships between metadata in different phases can be captured through structure in the metadata standard or through conventions for identifiers. Metadata creation and management can be streamlined and simplified by re-using metadata across many records. Many of these ideas are being used in metadata for documenting the life cycle of research projects in the Arctic.
ARTICLE | doi:10.20944/preprints201710.0076.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: big data; machine learning; regularization; data quality; robust learning framework
Online: 17 October 2017 (03:47:41 CEST)
The concept of ‘big data’ has been widely discussed, and its value has been illuminated throughout a variety of domains. To quickly mine potential values and alleviate the ever-increasing volume of information, machine learning is playing an increasingly important role and faces more challenges than ever. Because few studies exist regarding how to modify machine learning techniques to accommodate big data environments, we provide a comprehensive overview of the history of the evolution of big data, the foundations of machine learning, and the bottlenecks and trends of machine learning in the big data era. More specifically, based on learning principals, we discuss regularization to enhance generalization. The challenges of quality in big data are reduced to the curse of dimensionality, class imbalances, concept drift and label noise, and the underlying reasons and mainstream methodologies to address these challenges are introduced. Learning model development has been driven by domain specifics, dataset complexities, and the presence or absence of human involvement. In this paper, we propose a robust learning paradigm by aggregating the aforementioned factors. Over the next few decades, we believe that these perspectives will lead to novel ideas and encourage more studies aimed at incorporating knowledge and establishing data-driven learning systems that involve both data quality considerations and human interactions.
COMMUNICATION | doi:10.20944/preprints202206.0383.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Exoskeleton; Twitter; Tweets; Big Data; social media; Data Mining; dataset; Data Science; Natural Language Processing; Information Retrieval
Online: 21 July 2022 (04:06:53 CEST)
The exoskeleton technology has been rapidly advancing in the recent past due to its multitude of applications and diverse use-cases in assisted living, military, healthcare, firefighting, and industry 4.0. The exoskeleton market is projected to increase by multiple times of its current value within the next two years. Therefore, it is crucial to study the degree and trends of user interest, views, opinions, perspectives, attitudes, acceptance, feedback, engagement, buying behavior, and satisfaction, towards exoskeletons, for which the availability of Big Data of conversations about exoskeletons is necessary. The Internet of Everything style of today's living, characterized by people spending more time on the internet than ever before, with a specific focus on social media platforms, holds the potential for the development of such a dataset, by the mining of relevant social media conversations. Twitter, one such social media platform, is highly popular amongst all age groups, where the topics found in the conversation paradigms include emerging technologies such as exoskeletons. To address this research challenge, this work makes two scientific contributions to this field. First, it presents an open-access dataset of about 140,000 tweets about exoskeletons that were posted in a 5-year period from May 21, 2017, to May 21, 2022. Second, based on a comprehensive review of the recent works in the fields of Big Data, Natural Language Processing, Information Retrieval, Data Mining, Pattern Recognition, and Artificial Intelligence that may be applied to relevant Twitter data for advancing research, innovation, and discovery in the field of exoskeleton research, a total of 100 Research Questions are presented for researchers to study, analyze, evaluate, ideate, and investigate based on this dataset.
ARTICLE | doi:10.20944/preprints202011.0266.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: dyadic data; co-occurrence data; attributed dyadic data (ADD); mixture model; conditional mixture model (CMM); regression model
Online: 9 November 2020 (08:48:40 CET)
Dyadic data contains co-occurrences of objects, which is often modeled by finite mixture model which in turn is learned by expectation maximization (EM) algorithm. Objects in traditional dyadic data are identified by names, causing the drawback which is that it is impossible to extract implicit valuable knowledge under objects. In this research, I propose the so-called attributed dyadic data (ADD) in which each object has an informative attribute and each co-occurrence of two objects is associated with a value. ADD is flexible and covers most of structures / forms of dyadic data. Conditional mixture model (CMM), which is a variant of finite mixture model, is applied into learning ADD. Moreover, a significant feature of CMM is that any co-occurrence of two objects is based on some conditional variable. As a result, CMM can predict or estimate co-occurrent values based on regression model, which extends applications of ADD and CMM.
ARTICLE | doi:10.20944/preprints201806.0078.v1
Subject: Medicine & Pharmacology, General Medical Research Keywords: health data science; clinical trials; research participant reporting; personal health data diary; personal private webserver; research data integrity
Online: 6 June 2018 (09:40:35 CEST)
We describe how clinical researchers can exploit the Android cell phone as an economic platform for the gathering of data from clinical trial participants. The aim was to provide a solution with the shortest possible learning curve for researchers who are comfortable with setting up web pages. The additional requirement is that they extend their skills to the installation of a local webserver on the cell phone and then use four simple PHP templates to construct the clinical research data collection and processing forms. Data so collected is automatically written to local csv files on the cell phone. These csv phones can be retrieved from the device by the researcher simply by plugging the cell phone into their desktop PC and accessing the cell phone memory in just the same way as they would a USB memory stick. The results are presented as a list of recommended Android Apps along with settings that have proved to provide a stable combination likely to be easily used by clinical research participants. We have made a limited ‘user trial’ of this approach with satisfactory feedback received. We have concluded that this approach will reward researchers with a solution that is user friendly, will provide transcription free data and that is more than cost competitive with the conventional error prone/poor compliance ‘paper based participant form – researcher transcription’ cycle.
ARTICLE | doi:10.20944/preprints202206.0354.v1
Subject: Social Sciences, Other Keywords: health self-tracking; data donation; data sharing; quantified self; mobile tracking
Online: 27 June 2022 (08:46:26 CEST)
Health self-tracking is an ongoing trend as software and hardware evolve, making the collection of personal data not only fun for users but also increasingly interesting for public health research. In a quantitative approach we studied German health self-trackers (N=919) for differences in their data disclosure behavior by comparing data showing and sharing behavior among peers and their willingness to donate data to research. In addition, we examined user characteristics that may positively influence willingness to make the self-tracked data available to research and propose a framework for structuring research related to self-measurement. Results show that users' willingness to disclose data as a "donation" more than doubled compared to their "sharing" behavior (willingness to donate= 4.5/10; sharing frequency= 2.09/10). Younger men (up to 34 years), who record their vital signs daily, are less concerned about privacy, regularly donate money, and share their data with third parties because they want to receive feedback, are most likely to donate data to research and are thus a promising target audience for health data donation appeals. The paper adds to qualitative accounts of self-tracking but also engages with discussions around data sharing and privacy.
ARTICLE | doi:10.20944/preprints202201.0365.v3
Subject: Life Sciences, Biochemistry Keywords: binding affinity prediction; machine learning; data quality; data quantity; deep learning
Online: 23 May 2022 (11:16:49 CEST)
Prediction of protein-ligand binding affinities is crucial for computational drug discovery. A number of deep learning approaches have been developed in recent years to improve the accuracy of such affinity prediction. While the predicting power of these systems have advanced to some degrees depending on the dataset used for model training and testing, the effects of the quality and quantity of the underlying data have not been thoroughly examined. In this study, we employed erroneous datasets and data subsets of different sizes, created from one of the largest databases of experimental binding affinities, to train and evaluate a deep learning system based on convolutional neural networks. Our results show that data quality and quantity do have significant impacts on the prediction performance of trained models. Depending on the variations in data quality and quantity, the performance discrepancies could be comparable to or even larger than those observed among different deep learning approaches. In particular, the presence of proteins during model training leads to a dramatic increase in prediction accuracy. This implies that continued accumulation of high-quality affinity data, especially for new protein targets, is indispensable for improving deep learning models to better predict protein-ligand binding affinities.
ARTICLE | doi:10.20944/preprints202204.0261.v1
Subject: Earth Sciences, Atmospheric Science Keywords: PM2.5; Aerosol Optical Depth; Data assimilation; MODIS; satellite data; Objective analysis
Online: 27 April 2022 (11:32:49 CEST)
We used the objective analysis method in junction with the successive correction method to assimilate MODerate resolution Imaging Spectroradiometer (MODIS) Aerosol Optical Depth (AOD) data into Chimère model in order to improve the modeling of fine particulate matter (PM2.5) concentrations and AOD field over Europe. A data assimilation module was developed to adjust the daily initial total column aerosol concentrations based on a forecast-analysis cycling scheme. The model is then evaluated during one-month winter period to examine how such data assimilation technique pushes the model results closer to surface observations. This comparison showed that the mean biases of both surface PM2.5 concentrations and AOD field could be reduced from -34 to -15% and from -45 to -27%. The assimilation however leads to false alarms because of the difficulty to distribute AOD550 over different particles sizes. The impact of the influence radius is found to be small and depends on the density of satellite data. This work, although preliminary, is important in terms of near-real time air quality forecasting using Chimère model and can be further developed to improve modeled PM2.5 and ozone concentrations.
REVIEW | doi:10.20944/preprints202203.0407.v1
Subject: Social Sciences, Organizational Economics & Management Keywords: big data analytics; healthcare; data technologies; decision making; information management; EHR
Online: 31 March 2022 (12:24:19 CEST)
Big data analytics tools are the use of advanced analytic techniques targeting large and diverse volumes of data that include structured, semi-structured, and unstructured data from different sources and in different sizes from terabytes to zetabytes. The health sector is faced with the need to generate and manage large data sets from various health systems, such as electronic health records and clinical decision support systems. This data can be used by providers, clinicians, and policymakers to plan and implement interventions, detect disease more quickly, predict outcomes, and personalize care delivery. However, little attention is paid to the connection between big data analytics tools and the health sector. Thus, a systematic review of the bibliometric literature (LRSB) was developed to study how the adoption of big data analytics tools and infrastructures will revolutionize the healthcare industry. The review integrated 77 scientific and/or academic documents indexed in SCOPUS presenting up‐to‐date knowledge on current insights on how big data analytics technologies influence the healthcare sector and the different big data analytical tools used. The LRSB provides findings related to the impact of Big Data analytics on the health sector by introducing opportunities and technologies that provide practical solutions to various challenges.
ARTICLE | doi:10.20944/preprints202111.0019.v1
Subject: Engineering, Industrial & Manufacturing Engineering Keywords: Industry 4.0; Database; Data models; Big Data & Analytics; Asset Administration Shell
Online: 1 November 2021 (13:01:51 CET)
The data-oriented paradigm has proven to be fundamental for the technological transformation process that characterizes Industry 4.0 (I4.0) so that Big Data & Analytics is considered a technological pillar of this process. The literature reports a series of system architecture proposals that seek to implement the so-called Smart Factory, which is primarily data-driven. Many of these proposals treat data storage solutions as mere entities that support the architecture's functionalities. However, choosing which logical data model to use can significantly affect the performance of the architecture. This work identifies the advantages and disadvantages of relational (SQL) and non-relational (NoSQL) data models for I4.0, taking into account the nature of the data in this process. The characterization of data in the context of I4.0 is based on the five dimensions of Big Data and a standardized format for representing information of assets in the virtual world, the Asset Administration Shell. This work allows identifying appropriate transactional properties and logical data models according to the volume, variety, velocity, veracity, and value of the data. In this way, it is possible to describe the suitability of SQL and NoSQL databases for different scenarios within I4.0.
TECHNICAL NOTE | doi:10.20944/preprints202109.0505.v1
Subject: Medicine & Pharmacology, Other Keywords: Semantics; standards; clinical research infrastructure; terminology; graph data; data-driven medicine
Online: 29 September 2021 (17:32:40 CEST)
Health-related data originating from diverse sources are commonly stored in manifold databases and formats, making it difficult to find, access and gather data for research purposes. In addition, so-called secondary use scenarios for health data are usually hindered by local data codes, missing dictionaries and the lack of metadata and context descriptions. Following the FAIR principles (Findable, Accessible, Interoperable and Reusable), we developed a decentralized infrastructure to overcome these hurdles and enable collaborative research by making the meaning of health-related data understandable to both, humans and machines. This infrastructure is currently being implemented in the realm of the Swiss Personalized Health Network (SPHN), a research infrastructure initiative for enabling the use and exchange of health-related data for research in Switzerland. The SPHN ecosystem for FAIR data consists of the SPHN Dataset (semantic definitions), the SPHN RDF Schema (linkage and transport of the semantics in a machine-readable format), a project RDF template, extensive guidelines and conventions on how to generate SPHN RDF schema, a Terminology Service (converter of clinical terminologies in RDF), and a Quality Assurance Framework (automated data validation with SHACLs and SPARQLs). The SPHN ecosystem has been built in a way that it can easily be adapted and extended by any SPHN project to fit individual needs. By providing such a national ecosystem, SPHN supports researchers in generating, processing and sharing FAIR data.
ARTICLE | doi:10.20944/preprints202105.0377.v1
Subject: Keywords: Sensor data, wireless body area network, wearable devices, sensor data interoperability
Online: 17 May 2021 (09:47:26 CEST)
The monitoring of maternal and child health, using wearable devices made with wireless sensor technologies, is expected to reduce maternal and child death rates. Wireless sensor technologies have been used in wireless sensor networks to enable the acquisition of data for monitoring machines, smart cities, transportation, asset tracking, and tracking of human activity. Applications based on wireless body area network (WBAN) have been used in healthcare for measuring and monitoring of patient health and activity through integration with wearable devices. Wireless sensors used in WBAN can be cost-effective, enable remote availability, and can be integrated with electronic health record (EHR) management systems. Interoperability of WBAN sensor data with other linked data has the potential to improve health for all, including maternal and child health through the improvement of data access, data quality and healthcare access. This paper presents a survey of the state-of-the-art techniques for managing WBAN sensor data interoperability. The findings in this study will provide reliable support to enable policymakers and health care providers to take action to enhance the use of e-health to improve maternal-child health and reduce the mortality rates of women and children.
REVIEW | doi:10.20944/preprints202103.0214.v2
Subject: Engineering, Automotive Engineering Keywords: data center; green data center; sustainability; energy efficiency; energy saving; ICT.
Online: 14 April 2021 (12:59:53 CEST)
Information and communication technologies (ICT) are increasingly permeating our daily life and we ever more commit our data to the cloud. Events like the COVID-19 pandemic put an exceptional burden upon ICT infrastructures. This involves increasing implementation and use of data centers, which increased energy use and environmental impact. The scope of this work is to take stock on data center impact, opportunities, and assessment. First, we estimate impact entity. Then, we review strategies for efficiency and energy conservation in data centers. Energy use pertain to power distribution, IT-equipment, and non-IT equipment (e.g. cooling): Existing and prospected strategies and initiatives in these sectors are identified. Among key elements are innovative cooling techniques, natural resources, automation, low-power electronics, and equipment with extended thermal limits. Research perspectives are identified and estimates of improvement opportunities are presented. Finally, we present an overview on existing metrics, regulatory framework, and bodies concerned.
ARTICLE | doi:10.20944/preprints202102.0365.v1
Subject: Life Sciences, Biochemistry Keywords: Cancer subtype detection; Multi-omics data; Data integration; Autoencoder; Survival analysis
Online: 17 February 2021 (10:09:51 CET)
A heterogeneous disease like cancer is activated through multiple pathways and different perturbations. Depending upon the activated pathway(s), patients’ survival vary significantly and show different efficacy to various drugs. Therefore, cancer subtype detection using genomics level data is a significant research problem. Subtype detection is often a complex problem, and in most cases, needs multi-omics data fusion to achieve accurate subtyping. Different data fusion and subtyping approaches have been proposed, such as kernel-based fusion, matrix factorization, and deep learning autoencoders. In this paper, we compared the performance of different deep learning autoencoders for cancer subtype detection. We performed cancer subtype detection on four different cancer types from The Cancer Genome Atlas (TCGA) datasets using four autoencoder implementations. We also predicted the optimal number of subtypes in a cancer type using the silhouette score. We observed that the detected subtypes exhibit significant differences in survival profiles. Furthermore, we also compared the effect of feature selection and similarity measures for subtype detection. To evaluate the results obtained, we selected the Glioblastoma multiforme (GBM) dataset and identified the differentially expressed genes in each of the subtypes identified by the autoencoders; the obtained results coincide well with other genomic studies and can be corroborated with the involved pathways and biological functions. Thus, it shows that the results from the autoencoders, obtained through the interaction of different datatypes of cancer, can be used for the prediction and characterization of patient subgroups and survival profiles.
Subject: Earth Sciences, Environmental Sciences Keywords: forest inventory; data harvesting; forest modeling; forest growth; macroecology; public data
Online: 26 November 2020 (10:38:58 CET)
Net CO2 emissions and sequestration from European forests are the result of removal and growth of flora. To arrive at aggregated measurements of these processes at a country's level, local observations of increments and harvest rates are up-scaled to national forest areas. Each country releases these statistics through their individual National Forest Inventory using their particular definitions and methodologies. In addition, five international processes deal with the harmonization and comparability of such forest datasets in Europe, namely the IPCC, SOEF, FAOSTAT, HPFFRE, FRA (definitions follow in the article). In this study, we retrieved living biomass dynamics from each of these sources for 27 European Union member states. To demonstrate the reproducibility of our method, we release an open source python package that allows for automated data retrieval and analysis, as new data becomes available. The comparison of the published values shows discrepancies in the magnitude of forest biomass changes for several countries. In some cases, the direction of these changes also differ between sources. The scarcity of the data provided, along with the low spatial resolution, forbids the creation or calibration of a pan-European forest dynamics model, which could ultimately be used to simulate future scenarios and support policy decisions. To attain these goals, an improvement in forest data availability and harmonization is needed.
ARTICLE | doi:10.20944/preprints202011.0622.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: Driving Offenses; Speed Zone; Airports; Functional Data Analysis; Data-Driven Policy;
Online: 24 November 2020 (16:12:38 CET)
The road traffic injuries risk factors such as driving offenses and average speed are concerns for health organizations to reduce the number of injuries. Without any comprehensive view of each road, one cannot decide about the effective policy. In this manner, the data-driven policy will help to improve and assess the decisions. The count data near the road of two airports is surveyed for investigating the time-varying speed zones. The descriptive statistics, ANOVA, and functional data analysis were used. The hourly data of traffic counts for four different locations at the entrance of the two airports, international and domestics, were collected for one the year 2018 to 2019.The hourly pattern of driving offenses for each road was assessed and the to and from airport roads had different peaks (<0.05). The hour, weekdays, type of airport, direction and their interactions were statistically significant (<0.05) for the chance of driving offenses. The speed average during the day was statistically different (<0.5) by the number of different types of vehicles. The traffic count data is a great resource for decision making in safe driving subjects such as driving offenses. With functional data analysis, we can analyze them to get the most of the characteristics of this data. The airports are public places with high traffic demand in all countries that yields the different pattern of traffic transportation, therefore we extract the factors that affect the driving offenses. Finally, we conclude that conducting a time-varying speed zone near the airports seems vital.
TECHNICAL NOTE | doi:10.20944/preprints202011.0038.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: dyadic data; co-occurrence data; expectation maximization (EM) algorithm; mixture model
Online: 2 November 2020 (12:06:26 CET)
Dyadic data which is also called co-occurrence data (COD) contains co-occurrences of objects. Searching for statistical models to represent dyadic data is necessary. Fortunately, finite mixture model is a solid statistical model to learn and make inference on dyadic data because mixture model is built smoothly and reliably by expectation maximization (EM) algorithm which is suitable to inherent spareness of dyadic data. This research summarizes mixture models for dyadic data. When each co-occurrence in dyadic data is associated with a value, there are many unaccomplished values because a lot of co-occurrences are inexistent. In this research, these unaccomplished values are estimated as mean (expectation) of random variable given partial probabilistic distributions inside dyadic mixture model.
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Data Visualization; Visual Analytics; Natural Language Processing; Dark Data; Pattern Recognition
Online: 28 October 2020 (07:47:26 CET)
Over the years, there has been a significant rise in the world's scientific knowledge. However, most of it lacks structure and is often termed as Dark Data. Both humans and expert systems have continually faced difficulty in analyzing and comprehending such overwhelming amounts of information which is crucial in solving several real-world problems. Information and data visualization techniques proffer a promising solution to explore such data by allowing quick comprehension of information, the discovery of emerging trends, identification of relationships and patterns, etc. In this tutorial, we utilize the rich corpus of PubMed comprising of more than 30 million citations from biomedical literature to visually explore and understand the underlying key-insights using various information visualization techniques. With this study, we aim to diminish the limitation of human cognition and perception in handling and examining such large volumes of data by speeding up the process of decision making and pattern recognition and enabling decision-makers to fully understand data insights and make informed decisions.
TECHNICAL NOTE | doi:10.20944/preprints202009.0357.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: data; data paper; omics; metadata; workflow; standards; FAIR principles, MIxS, MINSEQE
Online: 16 September 2020 (11:04:34 CEST)
Data papers have emerged as a powerful instrument for open data publishing, obtaining credit, and establishing priority for datasets generated in scientific experiments. Academic publishing improves data and metadata quality through peer-review and increases the impact of datasets by enhancing their visibility, accessibility, and re-usability. We aimed to establish a new type of article structure and template for omics studies: the omics data paper. To improve data interoperability and further incentivise researchers to publish high-quality data sets, we created a workflow for streamlined import of omics metadata directly into a data paper manuscript. An omics data paper template was designed by defining key article sections which encourage the description of omics datasets and methodologies. The workflow was based on REpresentational State Transfer services and Xpath to extract information from the European Nucleotide Archive, ArrayExpress and BioSamples databases, which follow community-agreed standards. The workflow for automatic import of standard-compliant metadata into an omics data paper manuscript facilitates the authoring process. It demonstrates the importance and potential of creating machine-readable and standard-compliant metadata. The omics data paper structure and workflow to import omics metadata improves the data publishing landscape by providing a novel mechanism for creating high-quality, enhanced metadata records, peer reviewing and publishing of these. It constitutes a powerful addition for distribution, visibility, reproducibility and re-usability of scientific data. We hope that streamlined metadata re-use for scholarly publishing encourages authors to improve the quality of their metadata to achieve a truly FAIR data world.
ARTICLE | doi:10.20944/preprints202008.0487.v1
Subject: Social Sciences, Geography Keywords: Twitter; data reliability; risk communication; data mining; Google Cloud Vision API
Online: 22 August 2020 (02:32:40 CEST)
While Twitter has been touted to provide up-to-date information about hazard events, the reliability of tweets is still a concern. Our previous publication extracted relevant tweets containing information about the 2013 Colorado flood event and its impacts. Using the relevant tweets, this research further examined the reliability (accuracy and trueness) of the tweets by examining the text and image content and comparing them to other publicly available data sources. Both manual identification of text information and automated (Google Cloud Vision API) extraction of images were implemented to balance accurate information verification and efficient processing time. The results showed that both the text and images contained useful information about damaged/flooded roads/street networks. This information will help emergency response coordination efforts and informed allocation of resources when enough tweets contain geocoordinates or locations/venue names. This research will help identify reliable crowdsourced risk information to enable near-real time emergency response through better use of crowdsourced risk communication platforms.
ARTICLE | doi:10.20944/preprints202007.0330.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Data Privacy; Mobile devices; Environment Privacy; General Data Protection Regulation (GDPR).
Online: 15 July 2020 (09:30:42 CEST)
The mobile devices caused a constant struggle for the pursuit of data privacy. Nowadays, it appears that the number of mobile devices in the world is increasing. With this increase and technological evolution, thousands of data associated with everyone are generated and stored remotely. Thus, the topic of data privacy is highlighted in several areas. There is a need for control and management of data in circulation inherent to this theme. This article presents an approach of the interaction between the individual and the public environment, where this interaction will determine the access to information. This analysis was based on a data privacy management model in public environments created after reading and analyzing the current technologies. A mobile application based on location via Global Positioning System (GPS) was created to substantiate this model, which it considers the General Data Protection Regulation (GDPR) to control and manage access to the data of each individual.
ARTICLE | doi:10.20944/preprints202005.0274.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: big data; deep learning; intelligent systems; medical imaging; multi-data processing
Online: 16 May 2020 (17:43:42 CEST)
Big Data in medicine includes possibly fast processing of large data sets, both current and historical in purpose supporting the diagnosis and therapy of patients' diseases. Support systems for these activities may include pre-programmed rules based on data obtained from the interview medical and automatic analysis of test results diagnostic results will lead to classification of observations to a specific disease entity. The current revolution using Big Data significantly expands the role of computer science in achieving these goals, which is why we propose a Big Data computer data processing system using artificial intelligence to analyze and process medical images.
ARTICLE | doi:10.20944/preprints201811.0337.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Web API; SPARQL; micro-service; Data Integration; Linked Data; REST; Biodiversity
Online: 14 November 2018 (10:59:31 CET)
In recent years, Web APIs have become a de facto standard for exchanging machine-readable data on the Web. Despite this success though, they often fail in making resource descriptions interoperable due to the fact that they rely on proprietary vocabularies that lack formal semantics. The Linked Data principles similarly seek the massive publication of data on the Web, yet with the specific goal of ensuring semantic interoperability. Given their complementary goals, it is commonly admitted that cross-fertilization could stem from the automatic combination of Linked Data and Web APIs. Towards this goal, in this paper we leverage the micro-service architectural principles to define a SPARQL Micro-Service architecture, aimed at querying Web APIs using SPARQL. A SPARQL micro-service is a lightweight SPARQL endpoint that provides access to a small, resource-centric, virtual graph. In this context, we argue that full SPARQL Query expressiveness can be supported efficiently without jeopardizing servers availability. Furthermore, we demonstrate how this architecture can be used to dynamically assign dereferenceable URIs to Web API resources that do not have URIs beforehand, thus literally ``bringing'' Web APIs into the Web of Data. We believe that the emergence of an ecosystem of SPARQL micro-services published by independent providers would enable Linked Data-based applications to easily glean pieces of data from a wealth of distributed, scalable and reliable services. We describe a working prototype implementation and we finally illustrate the use of SPARQL micro-services in the context of two real-life use cases related to the biodiversity domain, developed in collaboration with the French National Museum of Natural History.
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: geographic information fusion; data quality; data consistency checking; historic GIS; railway network; patrimonial data; crowdsourcing open data; volunteer geographic information VGI; wikipedia geo-spatial information extraction.
Online: 17 August 2020 (14:51:04 CEST)
Transportation of goods is as old as human civilizations : past networks and their evolution shed light on long term trends. Transportation impact on climate change is measured as major, as well as the impact on spreading a pandemic. These two reasons motivate the importance of providing relevant and reliable historical geographic datasets of these networks. This paper focuses on reconstructing the railway network in France at its maximal extent, a century ago. The active stations and lines are well documented by the French SNCF, in open public data. However, that information ignores past stations (ante 1980), which represent probably more than what is recorded in public data. Additional open data, individual or collaborative (eg. Wikipedia) are particularly valuable, but they are not always geo-coded, and two more sources are necessary to completing that geo-coding: ancient maps and aerial photography. Therefore, remote sensing and volunteer geographic information are the two pillars of past railway reconstruction. The methods developed are adapted to the extraction of information from these sources: automated parsing of Wikipedia Infoboxes, data extraction from simple tables, even from simple text. That series of sparse procedures can be merged into a comprehensive computer-assisted process. Beyond this, a huge effort in quality control is necessary when merging these data: automated wherever possible, or finally visually controlled by observation of remote sensing information. The main output is a reliable dataset, under ODbl, of more than 9100 stations, which can be combined with the information about the 35000 communes of France, for a large variety of studies. This work demonstrates two thesis: (a) it is possible to reconstruct transport network data from the past, and generic computer assisted methods can be developed; (b) the value of remote sensing and volunteered geo info is considerable (what archeologists already know).
ARTICLE | doi:10.20944/preprints202205.0238.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: COVID-19; SARS-CoV-2; Omicron; Twitter; tweets; sentiment analysis; big data; Natural Language Processing; Data Science; Data Analysis
Online: 7 July 2022 (08:36:40 CEST)
This paper presents the findings of an exploratory study on the continuously generating Big Data on Twitter related to the sharing of information, news, views, opinions, ideas, knowledge, feedback, and experiences about the COVID-19 pandemic, with a specific focus on the Omicron variant, which is the globally dominant variant of SARS-CoV-2 at this time. A total of 12028 tweets about the Omicron variant were studied, and the specific characteristics of tweets that were analyzed include - sentiment, language, source, type, and embedded URLs. The findings of this study are manifold. First, from sentiment analysis, it was observed that 50.5% of tweets had the ‘neutral’ emotion. The other emotions - ‘bad’, ‘good’, ‘terrible’, and ‘great’ were found in 15.6%, 14.0%, 12.5%, and 7.5% of the tweets, respectively. Second, the findings of language interpretation showed that 65.9% of the tweets were posted in English. It was followed by Spanish or Castillian, French, Italian, Japanese, and other languages, which were found in 10.5%, 5.1%, 3.3%, 2.5%, and <2% of the tweets, respectively. Third, the findings from source tracking showed that “Twitter for Android” was associated with 35.2% of tweets. It was followed by “Twitter Web App”, “Twitter for iPhone”, “Twitter for iPad”, “TweetDeck”, and all other sources that accounted for 29.2%, 25.8%, 3.8%, 1.6%, and <1% of the tweets, respectively. Fourth, studying the type of tweets revealed that retweets accounted for 60.8% of the tweets, it was followed by original tweets and replies that accounted for 19.8% and 19.4% of the tweets, respectively. Fifth, in terms of embedded URL analysis, the most common domains embedded in the tweets were found to be twitter.com, which was followed by biorxiv.org, nature.com, wapo.st, nzherald.co.nz, recvprofits.com, science.org, and other URLs. Finally, to support similar research and development in this field centered around the analysis of tweets, we have developed an open-access Twitter dataset that comprises tweets about the SARS-CoV-2 omicron variant since the first detected case of this variant on November 24, 2021.
Online: 28 June 2020 (19:46:40 CEST)
Objectives: Data sharing has become a requirement of many funding bodies and is becoming a scientific standard in many disciplines. In medical research, however, data sharing can conflict with clinicians’ obligation to protect patients’ privacy. General recommendations on data sharing exist also for clinical research, but so far lack practical and Swiss-specific aspects. The objective of this document is to provide practical recommendations for all relevant aspects of data sharing in agreement with legislation in Switzerland. Methods: This document was written by members of the Swiss CTU Network, a network of academic clinical trial units. The process did not follow a formalized Delphi process. After an internal consensus round, this report is now published as pre-print for external review. A second version will incorporate external comments. We plan to publish this document as a text in progress, as we expect relevant changes in related fields such as the development of further dedicated medical repositories or methodological advances in anonymization techniques. Results: We developed principles and practical recommendations with respect to informed consent, data management plan, anonymization, data structure and format, coding of variables, metadata and documentation, version control, selection of repository, requesting and use of data. We also provide a summary of legal aspects relevant for the Swiss context. Conclusions: The intension to share data has an impact not only after a clinical trial or an observational study is completed, but also during the planning period, the conduct and the analysis phase. Clinical researchers need to be aware at the beginning of a study on how to inform patients and at least the amount of work related to preparing data for sharing, metadata, and any further documentation. This report provides details of aspects to be considered, suggests decision criteria, and provides examples and checklists, in order to support data sharing in practice.
ARTICLE | doi:10.20944/preprints202204.0068.v1
Subject: Mathematics & Computer Science, Computational Mathematics Keywords: Functional Data Analysis; Image Processing; Brain Imaging; Neuroimaging; Computational Neuroscience; Data Science
Online: 8 April 2022 (03:21:06 CEST)
Functional Data Analysis (FDA) is a relatively new field of statistics dealing with data expressed in the form of functions. FDA methodologies can be easily extended to the study of imaging data, an application proposed in Wang et al. (2020), where the authors settle the mathematical groundwork and properties of the proposed estimators. This methodology allows for the estimation of mean functions and simultaneous confidence corridors (SCC), also known as simultaneous confidence bands, for imaging data and for the difference between two groups of images. This is especially relevant for the field of medical imaging, as one of the most extended research setups consists on the comparison between two groups of images, a pathological set against a control set. FDA applied to medical imaging presents at least two advantages compared to previous methodologies: it avoids loss of information in complex data structures and avoids the multiple comparison problem arising from traditional pixel-to-pixel comparisons. Nonetheless, computing times for this technique have only been explored in reduced and simulated setups (Arias-López et al., 2021). In the present article, we apply this procedure to a practical case with data extracted from open neuroimaging databases and then measure computing times for the construction of Delaunay triangulations, and for the computation of mean function and SCC for one-group and two-group approaches. The results suggest that previous researcher has been too conservative in its parameter selection and that computing times for this methodology are reasonable, confirming that this method should be further studied and applied to the field of medical imaging.