ARTICLE | doi:10.20944/preprints201808.0029.v1
Subject: Earth Sciences, Environmental Sciences Keywords: Landsat, analysis ready data, collection 1
Online: 1 August 2018 (20:03:52 CEST)
Data that have been processed to allow analysis with a minimum of additional user effort are often referred to as Analysis Ready Data (ARD). The ability to perform large scale Landsat analysis relies on the ability to access observations that are geometrically and radiometrically consistent, and have had non-target features (clouds) and poor quality observations flagged so that they can be excluded. The United States Geological Survey (USGS) has processed all of the Landsat 4 and 5 Thematic Mapper (TM), Landsat 7 Enhanced Thematic Mapper Plus (ETM+), Landsat 8 Operational Land Imager (OLI) and Thermal Infrared Sensor (TIRS) archive over the conterminous United States (CONUS), Alaska, and Hawaii, into Landsat ARD. The ARD are available to significantly reduce the burden of pre-processing on users of Landsat data. Provision of pre-prepared ARD is intended to make it easier for users to produce Landsat-based maps of land cover and land-cover change and other derived geophysical and biophysical products. The ARD are provided as tiled, georegistered, top of atmosphere and atmospherically corrected products defined in a common equal area projection, accompanied by spatially explicit quality assessment information, and
REVIEW | doi:10.20944/preprints202203.0407.v1
Subject: Social Sciences, Organizational Economics & Management Keywords: big data analytics; healthcare; data technologies; decision making; information management; EHR
Online: 31 March 2022 (12:24:19 CEST)
Big data analytics tools are the use of advanced analytic techniques targeting large and diverse volumes of data that include structured, semi-structured, and unstructured data from different sources and in different sizes from terabytes to zetabytes. The health sector is faced with the need to generate and manage large data sets from various health systems, such as electronic health records and clinical decision support systems. This data can be used by providers, clinicians, and policymakers to plan and implement interventions, detect disease more quickly, predict outcomes, and personalize care delivery. However, little attention is paid to the connection between big data analytics tools and the health sector. Thus, a systematic review of the bibliometric literature (LRSB) was developed to study how the adoption of big data analytics tools and infrastructures will revolutionize the healthcare industry. The review integrated 77 scientific and/or academic documents indexed in SCOPUS presenting up‐to‐date knowledge on current insights on how big data analytics technologies influence the healthcare sector and the different big data analytical tools used. The LRSB provides findings related to the impact of Big Data analytics on the health sector by introducing opportunities and technologies that provide practical solutions to various challenges.
ARTICLE | doi:10.20944/preprints202005.0274.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: big data; deep learning; intelligent systems; medical imaging; multi-data processing
Online: 16 May 2020 (17:43:42 CEST)
Big Data in medicine includes possibly fast processing of large data sets, both current and historical in purpose supporting the diagnosis and therapy of patients' diseases. Support systems for these activities may include pre-programmed rules based on data obtained from the interview medical and automatic analysis of test results diagnostic results will lead to classification of observations to a specific disease entity. The current revolution using Big Data significantly expands the role of computer science in achieving these goals, which is why we propose a Big Data computer data processing system using artificial intelligence to analyze and process medical images.
ARTICLE | doi:10.20944/preprints201608.0202.v2
Subject: Earth Sciences, Environmental Sciences Keywords: HR satellite remote sensing; urban fabric vulnerability; UHI & heat waves; landsat & MODIS sensors; LST & urban heating; segmentation & objects classification; data mining; feature extraction & selection; stepwise regression & model calibration
Online: 26 October 2021 (13:11:23 CEST)
Densely urbanized areas, with a low percentage of green vegetation, are highly exposed to Heat Waves (HW) which nowadays are increasing in terms of frequency and intensity also in the middle-latitude regions, due to ongoing Climate Change (CC). Their negative effects may combine with those of the UHI (Urban Heat Island), a local phenomenon where air temperatures in the compact built up cores of towns increase more than those in the surrounding rural areas, with significant impact on the quality of urban environment, on citizens health and energy consumption and transport, as it has occurred in the summer of 2003 on France and Italian central-northern areas. In this context this work aims at designing and developing a methodology based on aero-spatial remote sensing (EO) at medium-high resolution and most recent GIS techniques, for the extensive characterization of the urban fabric response to these climatic impacts related to the temperature within the general framework of supporting local and national strategies and policies of adaptation to CC. Due to its extension and variety of built-up typologies, the municipality of Rome was selected as test area for the methodology development and validation. First of all, we started by operating through photointerpretation of cartography at detailed scale (CTR 1: 5000) on a reference area consisting of a transect of about 5x20 km, extending from the downtown to the suburbs and including all the built-up classes of interest. The reference built-up vulnerability classes found inside the transect were then exploited as training areas to classify the entire territory of Rome municipality. To this end, the satellite EO HR (High Resolution) multispectral data, provided by the Landsat sensors were used within a on purpose developed "supervised" classification procedure, based on data mining and “object-classification” techniques. The classification results were then exploited for implementing a calibration method, based on a typical UHI temperature distribution, derived from MODIS satellite sensor LST (Land Surface Temperature) data of the summer 2003, to obtain an analytical expression of the vulnerability model, previously introduced on a semi-empirical basis.
Subject: Social Sciences, Econometrics & Statistics Keywords: poverty; composite indicators; interval data; symbolic data
Online: 24 August 2021 (15:46:09 CEST)
The analysis and measurement of poverty is a crucial issue in the field of social science. Poverty is a multidimensional notion that can be measured using composite indicators relevant to synthesizing statistical indicators. Subjective choices could, however, affect these indicators. We propose interval-based composite indicators to avoid the problem, enabling us in this context to obtain robust and reliable measures. Based on a relevant conceptual model of poverty we have identified, we will consider all the various factors identified. Then, considering a different random configuration of the various factors, we will compute a different composite indicator. We can obtain a different interval for each region based on the distinct factor choices on the different assumptions for constructing the composite indicator. So we will create an interval-based composite indicator based on the results obtained by the Monte-Carlo simulation of all the different assumptions. The different intervals can be compared, and various rankings for poverty can be obtained. For their parameters, such as center, minimum, maximum, and range, the poverty interval composite indicator can be considered and compared. The results demonstrate a relevant and consistent measurement of the indicator and the shadow sector's relevant impact on the final measures.
ARTICLE | doi:10.20944/preprints202011.0297.v1
Online: 10 November 2020 (10:00:37 CET)
In this paper, we present a relapse based demonstrating way to deal with investigate various arrangement MTC information. A commonplace use of this displaying approach incorporates three stages: first, define a model that approximates the connection between quality articulation and trial factors, with boundaries consolidated to address the exploration premium; second, utilize least-squares and assessing condition methods to gauge boundaries and their relating standard blunders; third, register test insights, P-qualities and NFD as proportions of factual criticalness. The benefits of this methodology are as per the following. To begin with, it tends to the exploration interest in a particular, precise way, and maximally uses all the information and other important data. Second, it represents both orderly and irregular varieties related with the information, and the consequences of such examination give not just quality explicit data applicable to the exploration objective, yet additionally its dependability, in this way helping agents to settle on better choices for subsequent investigations. Third, this methodology is truly adaptable, and can undoubtedly be stretched out to different sorts of MTC considers or other microarray explores by detailing various models dependent on the test plan of the investigations.
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Academic Analytics; data storage; education and big data; analysis of data; learning analytics
Online: 19 July 2020 (20:37:39 CEST)
Business Intelligence, defined by  as "the ability to understand the interrelations of the facts that are presented in such a way that it can guide the action towards achieving a desired goal", has been used since 1958 for the transformation of data into information, and of information into knowledge, to be used when making decisions in a business environment. But, what would happen if we took the same principles of business intelligence and applied them to the academic environment? The answer would be the creation of Academic Analytics, a term defined by  as the process of evaluating and analyzing organizational information from university systems for reporting and making decisions, whose characteristics allow it to be used more and more in institutions, since the information they accumulate about their students and teachers gathers data such as academic performance, student success, persistence, and retention . Academic Analytics enables an analysis of data that is very important for making decisions in the educational institutional environment, aggregating valuable information in the academic research activity and providing easy to use business intelligence tools. This article shows a proposal for creating an information system based on Academic Analytics, using ASP.Net technology and trusting storage in the database engine Microsoft SQL Server, designing a model that is supported by Academic Analytics for the collection and analysis of data from the information systems of educational institutions. The idea that was conceived proposes a system that is capable of displaying statistics on the historical data of students and teachers taken over academic periods, without having direct access to institutional databases, with the purpose of gathering the information that the director, the teacher, and finally the student need for making decisions. The model was validated with information taken from students and teachers during the last five years, and the export format of the data was pdf, csv, and xls files. The findings allow us to state that it is extremely important to analyze the data that is in the information systems of the educational institutions for making decisions. After the validation of the model, it was established that it is a must for students to know the reports of their academic performance in order to carry out a process of self-evaluation, as well as for teachers to be able to see the results of the data obtained in order to carry out processes of self-evaluation, and adaptation of content and dynamics in the classrooms, and finally for the head of the program to make decisions.
ARTICLE | doi:10.20944/preprints202206.0335.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: metadata; contextual data; harmonization; genomic surveillance; data management
Online: 24 June 2022 (08:46:04 CEST)
ARTICLE | doi:10.20944/preprints202208.0083.v1
Subject: Social Sciences, Accounting Keywords: Ratios; Financial Crisis; Covid-19; Big Data; Accounting Data
Online: 3 August 2022 (10:42:06 CEST)
The effects of the 2008 financial crisis undoubtedly caused problems not only to the banking sector but also to the real economy of the developed and the developing countries in almost all around the globe. Besides, as is widely known, every banking crisis entails the corresponding cost to the economy of each country affected by it, which results from the shakeout and the restructuring of its financial system. The purpose of this research is to investigate the consequences of the financial crisis and the COVID-19 health crisis and how these affected the course of the four systemic banks (Eurobank, Alpha Bank, National Bank, Piraeus Bank) through the analysis of ratios for the period of 2015-2020.
ARTICLE | doi:10.20944/preprints201610.0067.v1
Subject: Mathematics & Computer Science, Applied Mathematics Keywords: point information gain; Rényi entropy; data processing
Online: 17 October 2016 (11:35:13 CEST)
We generalize the point information gain (PIG) and derived quantities, i.e., point information gain entropy (PIE) and point information gain entropy density (PIED), for the case of the Rényi entropy and simulate the behavior of PIG for typical distributions. We also use these methods for the analysis of multidimensional datasets. We demonstrate the main properties of PIE/PIED spectra for the real data on the example of several images, and discuss further possible utilizations in other fields of data processing.
ARTICLE | doi:10.20944/preprints202204.0068.v1
Subject: Mathematics & Computer Science, Computational Mathematics Keywords: Functional Data Analysis; Image Processing; Brain Imaging; Neuroimaging; Computational Neuroscience; Data Science
Online: 8 April 2022 (03:21:06 CEST)
Functional Data Analysis (FDA) is a relatively new field of statistics dealing with data expressed in the form of functions. FDA methodologies can be easily extended to the study of imaging data, an application proposed in Wang et al. (2020), where the authors settle the mathematical groundwork and properties of the proposed estimators. This methodology allows for the estimation of mean functions and simultaneous confidence corridors (SCC), also known as simultaneous confidence bands, for imaging data and for the difference between two groups of images. This is especially relevant for the field of medical imaging, as one of the most extended research setups consists on the comparison between two groups of images, a pathological set against a control set. FDA applied to medical imaging presents at least two advantages compared to previous methodologies: it avoids loss of information in complex data structures and avoids the multiple comparison problem arising from traditional pixel-to-pixel comparisons. Nonetheless, computing times for this technique have only been explored in reduced and simulated setups (Arias-López et al., 2021). In the present article, we apply this procedure to a practical case with data extracted from open neuroimaging databases and then measure computing times for the construction of Delaunay triangulations, and for the computation of mean function and SCC for one-group and two-group approaches. The results suggest that previous researcher has been too conservative in its parameter selection and that computing times for this methodology are reasonable, confirming that this method should be further studied and applied to the field of medical imaging.
ARTICLE | doi:10.20944/preprints201809.0073.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Software Process Analysis, Software Process Improvement, Data Prove-nance
Online: 4 September 2018 (16:30:51 CEST)
Companies have been increasing the amount of data that they collect from their systems and processes, considering the decrease in the cost of memory and storage technologies in recent years. The emergence of technologies such as Big Data, Cloud Computing, E-Science, and the growing complexity of information systems made evident that traceability and provenance are promising approaches. Provenance has been successfully used in complex domains, like health sciences, chemical industries, and scientific computing, considering that these areas require a comprehensive semantic traceability mechanism. Based on these, we investigate the use of provenance in the context of Software Process (SP) and introduce a novel approach based on provenance concepts to model and represent SP data. It addresses SP provenance data capturing, storing, new information inferencing and visualization. The main contribution of our approach is PROV-SwProcess, a provenance model to deal with the specificities of SP and its ability in supporting process managers to deal with vast amounts of execution data during the process analysis and data-driven decision-making. A set of analysis possibilities were derived from this model, using SP goals and questions. A case study was conducted in collaboration with a software development company to instantiate the PROV-SwProcess model (using the proposed approach) with real-word process data. This study showed that 87.5% of the analysis possibilities using real data was correct and can assist in decision-making, while 62.5% of them are not possible to be performed by the process manager using his currently dashboard or process management tool.
Subject: Materials Science, Biomaterials Keywords: Microscopy Image Segmentation; Deep Learning; Data Augmentation; Synthetic Training Data; Parametric Models
Online: 1 March 2021 (13:07:00 CET)
The analysis of microscopy images has always been an important yet time consuming process in in materials science. Convolutional Neural Networks (CNNs) have been very successfully used for a number of tasks, such as image segmentation. However, training a CNN requires a large amount of hand annotated data, which can be a problem for material science data. We present a procedure to generate synthetic data based on ad-hoc parametric data modelling for enhancing generalization of trained neural network models. Especially for situations where it is not possible to gather a lot of data, such an approach is beneficial and may enable to train a neural network reasonably. Furthermore, we show that targeted data generation by adaptively sampling the parameter space of the generative models gives superior results compared to generating random data points.
ARTICLE | doi:10.20944/preprints202103.0623.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: SARS-CoV-2; Big Data; Data Analytics; Predictive Models; Schools
Online: 25 March 2021 (14:35:53 CET)
Background: CoronaVirus Disease 2019 (COVID-19) is the main discussed topic world-wide in 2020 and at the beginning of the Italian epidemic, scientists tried to understand the virus diffusion and the epidemic curve of positive cases with controversial findings and numbers. Objectives: In this paper, a data analytics study on the diffusion of COVID-19 in Lombardy Region and Campania Region is developed in order to identify the driver that sparked the second wave in Italy Methods: Starting from all the available official data collected about the diffusion of COVID-19, we analyzed google mobility data, school data and infection data for two big regions in Italy: Lombardy Region and Campania Region, which adopted two different approaches in opening and closing schools. To reinforce our findings, we also extended the analysis to the Emilia Romagna Region. Results: The paper aims at showing how different policies adopted in school opening / closing may have on the impact on the COVID-19 spread. Conclusions: The paper shows that a clear correlation exists between the school contagion and the subsequent temporal overall contagion in a geographical area.
ARTICLE | doi:10.20944/preprints202204.0261.v1
Subject: Earth Sciences, Atmospheric Science Keywords: PM2.5; Aerosol Optical Depth; Data assimilation; MODIS; satellite data; Objective analysis
Online: 27 April 2022 (11:32:49 CEST)
We used the objective analysis method in junction with the successive correction method to assimilate MODerate resolution Imaging Spectroradiometer (MODIS) Aerosol Optical Depth (AOD) data into Chimère model in order to improve the modeling of fine particulate matter (PM2.5) concentrations and AOD field over Europe. A data assimilation module was developed to adjust the daily initial total column aerosol concentrations based on a forecast-analysis cycling scheme. The model is then evaluated during one-month winter period to examine how such data assimilation technique pushes the model results closer to surface observations. This comparison showed that the mean biases of both surface PM2.5 concentrations and AOD field could be reduced from -34 to -15% and from -45 to -27%. The assimilation however leads to false alarms because of the difficulty to distribute AOD550 over different particles sizes. The impact of the influence radius is found to be small and depends on the density of satellite data. This work, although preliminary, is important in terms of near-real time air quality forecasting using Chimère model and can be further developed to improve modeled PM2.5 and ozone concentrations.
ARTICLE | doi:10.20944/preprints201910.0146.v2
Subject: Life Sciences, Molecular Biology Keywords: NGS data analysis; bioinformatics pipelines; NGS pipelines
Online: 8 April 2020 (06:21:10 CEST)
Next-generation sequencing (NGS) has been a widely-used technology in biomedical research for understanding the role of molecular genetics of cells in health and disease. A variety of computational tools have been developed to analyse the vastly growing NGS data, which often require bioinformatics skills, tedious work and significant amount of time. To facilitate data processing steps minding the gap between biologists and bioinformaticians, we developed CSI NGS Portal, an online platform which gathers established bioinformatics pipelines to provide fully automated NGS data analysis and sharing in a user-friendly website. The portal currently provides 16 standard pipelines for analysing data from DNA, RNA, smallRNA, ChIP, RIP, 4C, SHAPE, circRNA, eCLIP, Bisulfite and scRNA sequencing, and is flexible to expand with new pipelines. The users can upload raw data in fastq format and submit jobs in a few clicks, and the results will be self-accessible via the portal to view/download/share in real-time. The output can be readily used as the final report or as input for other tools depending on the pipeline. Overall, CSI NGS Portal helps researchers rapidly analyse their NGS data and share results with colleagues without the aid of a bioinformatician. The portal is freely available at: https://csibioinfo.nus.edu.sg/csingsportal
ARTICLE | doi:10.20944/preprints201609.0027.v1
Subject: Social Sciences, Organizational Economics & Management Keywords: customer complaint process improvement; customer complaint service; big data analysis
Online: 7 September 2016 (11:38:33 CEST)
With the advances in industry and commerce, passengers have become more accepting of environmental sustainability issues; thus, more people now choose to travel by bus. Government administration constitutes an important part of bus transportation services as the government gives the right-of-way to transportation companies allowing them to provide services. When these services are of poor quality, passengers may lodge complaints. The increase in consumer awareness and developments in wireless communication technologies have made it possible for passengers to easily and immediately submit complaints about transportation companies to government institutions, which has brought drastic changes to the supply-demand chain comprised of the public sector, transportation companies, and passengers. This study proposed the use of big data analysis technology including systematized case assignment and data visualization to improve management processes in the public sector and optimize customer complaint services. Taichung City, Taiwan was selected as the research area. There, the customer complaint management process in public sector was improved, effectively solving such issues as station-skipping, allowing the public sector to fully grasp the service level of transportation companies, improving the sustainability of bus operations, and supporting the sustainable development of the public sector-transportation company-passenger supply chain.
ARTICLE | doi:10.20944/preprints202102.0593.v2
Subject: Medicine & Pharmacology, Other Keywords: Hospital admissions; care homes; COVID-19; linked data; administrative data
Online: 25 May 2021 (10:33:46 CEST)
Background: Care home residents have complex healthcare needs but may have faced barriers to accessing hospital treatment during the first wave of the COVID-19 pandemic. Objective: To examine trends in the number of hospital admissions for care home residents during the first months of the COVID-19 outbreak. Methods: Retrospective analysis of a national linked dataset on hospital admissions for residential and nursing home residents in England (257,843 residents, 45% in nursing homes) between 20 January 2020 and 28 June 2020, compared to admissions during the corresponding period in 2019 (252,432 residents, 45% in nursing homes). Elective and emergency admission rates, normalised to the time spent in care homes across all residents, were derived across the first three months of the pandemic between 1 March and 31 May and primary admissions reasons for this period were compared across years. Results: Hospital admission rates rapidly declined during early March 2020 and remained substantially lower than in 2019 until the end of June. Between March and May, 2,960 admissions from residential homes (16.2%) and 3,295 admissions from nursing homes (23.7%) were for suspected or confirmed COVID-19. Rates of other emergency admissions decreased by 36% for residential and by 38% for nursing home residents (13,191 fewer admissions in total). Emergency admissions for acute coronary syndromes fell by 43% and 29% (105 fewer admission) and emergency admissions for stroke fell by 17% and 25% (128 fewer admissions) for residential and nursing home residents, respectively. Elective admission rates declined by 64% for residential and by 61% for nursing home residents (3,762 fewer admissions). Conclusions: This is the first study showing that care home residents’ hospital use declined during the first wave of COVID-19, potentially resulting in substantial unmet health need that will need to be addressed alongside ongoing pressures from COVID-19.
ARTICLE | doi:10.20944/preprints201704.0169.v1
Subject: Engineering, Biomedical & Chemical Engineering Keywords: thermopile sensor; actimetry; thermal camera, data classification; tele-medicine; polysomnography;
Online: 26 April 2017 (12:27:38 CEST)
This paper address the development of a new technic in the sleep analysis domain. Sleep is defined as a periodic physiological state during which vigilance is suspended and reactivity to external stimulations diminished. We sleep on average between six and nine hours per night and our sleep is composed of four to six cycles of about 90-minutes each. Each of these cycles is composed of a succession of several stages of sleep, more or less deep. The analysis of sleep is usually done using a polysomnography. This examination consists of recording, among other things, electrical cerebral activity by electroencephalography (EEG), ocular movements by electrooculography (EOG) and chin muscle tone by electromyography (EMG). The recording is done mostly in a hospital, more specifically in a service for monitoring the pathologies related to sleep. The readings are then interpreted manually by an expert to generate a hypnogram, a curve showing the succession of sleep stages during the night in 30-second epochs. The proposed method is based on the follow-up of the thermal signature that makes it possible to classify the activity into three classes: "awakening", "calm sleep" and "agitated sleep". The contribution of this non-invasive method is part of the screening of sleep disorders, to be validated by a more complete analysis of the sleep. The measure provided by this new system, based on temperature monitoring (patient and ambient), aims to be integrated into the tele-medicine platform developed within the framework of the Smart-EEG project by the SYEL - SYstèmes ELectroniques team. Analysis of the data collected during the first surveys carried out with this method showed a correlation between thermal signature and activity during sleep. The advantage of this method lies in its simplicity and the possibility of carrying out measurements of activity during sleep and without direct contact with the patient at home or hospitals.
ARTICLE | doi:10.20944/preprints201802.0065.v3
Online: 26 February 2018 (15:38:23 CET)
This study attempts to assess the impact of corruption on economic growth in the Mediterranean countries, during the period from 1998 to 2007. Econometric analysis using panel regression has been adopted to test this effect. Individual effects models such as random effects model and fixed effects model were applied to the study sample of 160 observations, and to choose the suitable model, we implemented several tests. For our analysis, we used a basic model that includes the dependent variable GDP per capita as a factor of economic growth and the corruption perception index as the independent variable concerned. Then we completed the model with several standardized macroeconomic control variables mentioned above and applied the individual effects models. The outcomes illustrate that corruption has a negative impact on the selected Mediterranean countries’ economic growth.
ARTICLE | doi:10.20944/preprints202205.0334.v1
Online: 24 May 2022 (11:47:39 CEST)
In the past decades, a significant rise in the adoption of streaming applications has changed the decision-making process for the industry and academia sectors. This movement led to the emergence of a plurality of Big Data technologies such as Apache Storm, Spark, Heron, Samza, Flink, and other systems to provide in-memory processing for real-time Big Data analysis at high throughput. Spark Streaming represents one of the most popular open-source implementations which handles an ever-increasing data ingestion and processing by using the Unified Memory Manager to manage memory occupancy between storage and processing regions dynamically, which is the focus of this study. The problem behind memory management for data-intensive stream processing pipelines is that the incoming data is faster than the downstream operators can consume. Consequently, the backpressure of Spark acts in the opposite direction of downstream operators. In such a case, the incoming data overwhelms the memory manager and provokes memory leak issues. As a result, it affects the performance of applications generating, e.g., high latency, low throughput, or even data loss. In such a case, the initial intuition motivating our work is that memory management became the critical factor in keeping processing at scale and system stability of Spark. This work provides a deep dive into Spark backpressure, evaluates its structure, presents the main characteristics to support data-intensive streaming pipelines, and investigates the current in-memory-based performance issues.
REVIEW | doi:10.20944/preprints202211.0161.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: High Performance Computing (HPC); big data; High Performance Data Analytics (HPDS); con-vergence; data locality; spark; Hadoop; design patterns; process mapping; in-situ data analysis
Online: 9 November 2022 (01:38:34 CET)
Big data has revolutionised science and technology leading to the transformation of our societies. High Performance Computing (HPC) provides the necessary computational power for big data analysis using artificial intelligence and methods. Traditionally HPC and big data had focused on different problem domains and had grown into two different ecosystems. Efforts have been underway for the last few years on bringing the best of both paradigms into HPC and big converged architectures. Designing HPC and big data converged systems is a hard task requiring careful placement of data, analytics, and other computational tasks such that the desired performance is achieved with the least amount of resources. Energy efficiency has become the biggest hurdle in the realisation of HPC, big data, and converged systems capable of delivering exascale and beyond performance. Data locality is a key parameter of HPDA system design as moving even a byte costs heavily both in time and energy with an increase in the size of the system. Performance in terms of time and energy are the most important factors for users, particularly energy, due to it being the major hurdle in high performance system design and the increasing focus on green energy systems due to environmental sustainability. Data locality is a broad term that encapsulates different aspects including bringing computations to data, minimizing data movement by efficient exploitation of cache hierarchies, reducing intra- and inter-node communications, locality-aware process and thread mapping, and in-situ and in-transit data analysis. This paper provides an extensive review of the cutting-edge on data locality in HPC, big data, and converged systems. We review the literature on data locality in HPC, big data, and converged environments and discuss challenges, opportunities, and future directions. Subsequently, using the knowledge gained from this extensive review, we propose a system architecture for future HPC and big data converged systems. To the best of our knowledge, there is no such review on data locality in converged HPC and big data systems.
Subject: Mathematics & Computer Science, Other Keywords: Chemometric data,sparse autoencoder, gaussian process regressor, pareto optimization.
Online: 9 May 2019 (11:31:46 CEST)
We proposed a deep learning based chemometric data analysis technique. We trained L2 regularized sparse autoencoder end-to-end for reducing the size of the feature vector to handle the classic problem of curse of dimensionality in chemometric data analysis. We introduce a novel technique of automatic selection of nodes inside hidden layer of an autoencoder through pareto optimization. Moreover, linear regression, ϵ-SVR , and Gaussian process regressor are applied on the reduced size feature vector for the regression. We evaluated our technique on orange juice and wine dataset and results are compared against state-of-the-art methods. Quantitative results are shown on Normalized Mean Square Error (NMSE) and the results show considerable improvement in the state-of- the-art.
ARTICLE | doi:10.20944/preprints201804.0144.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: big data; SIEM; correlation analysis; cyber crime profiling
Online: 11 April 2018 (08:39:02 CEST)
The number of SIEM introduction is increasing in order to detect threat patterns in a short period of time with a large amount of structured/unstructured data, to precisely diagnose crisis to threats, and to provide an accurate alarm to an administrator by correlating collected information. However, it is difficult to quickly recognize and handle with various attack situations using a solution equipped with complicated functions during security monitoring. In order to overcome this situation, new detection analysis process has been required, and there is an effort to increase response speed during security monitoring and to expand accurate linkage analysis technology. In this paper, reflecting these requirements, we design and propose profiling auto-generation model that can improve the efficiency and speed of attack detection for potential threats requirements. we design and propose profiling auto-generation model that can improve the efficiency and speed of attack detection for potential threats.
CONCEPT PAPER | doi:10.20944/preprints201901.0246.v2
Subject: Life Sciences, Endocrinology & Metabolomics Keywords: reproducibility; minimum guidelines; reporting; data analysis; reporting
Online: 8 March 2019 (09:06:02 CET)
Despite the proposal of minimum reporting guidelines for metabolomics over a decade ago, reporting on the data analysis step in metabolomics studies has been shown to be unclear and incomplete. Major omissions and a lack of logical flow render the data analysis’ sections in metabolomics studies impossible to follow, and therefore replicate or even imitate. Here, we propose possible reasons why the original reporting guidelines have had poor adherence and present an approach to improve their uptake. We present in this paper an R markdown reporting template file that guides the production of text and generates workflow diagrams based on user input. This R Markdown template contains, as an example in this instance, a set of minimum information requirements specifically for the data pre-treatment and data analysis section of biomarker discovery metabolomics studies, (gleaned directly from the original proposed guidelines by Goodacre at al.). These minimum requirements are presented in the format of a questionnaire checklist in an R markdown template file. The R Markdown reporting template proposed here can be presented as a starting point to encourage the data analysis section of a metabolomics manuscript to have a more logical presentation and to contain enough information to be understandable and reusable. The idea is that these guidelines would be open to user feedback, modification and updating by the metabolomics community via GitHub.
ARTICLE | doi:10.20944/preprints201806.0279.v2
Subject: Physical Sciences, Astronomy & Astrophysics Keywords: galaxy morphology, machine learning; data analysis; object classification
Online: 22 October 2018 (13:01:42 CEST)
Automated machine classifications of galaxies are necessary because the size of upcoming surveys will overwhelm human volunteers. We improve upon existing machine classification methods by adding the output of SpArcFiRe to the inputs of a machine learning model. We use the human classifications from Galaxy Zoo 1 (GZ1) to train a random forest of decision trees to reproduce the human vote distributions of the Spiral class. We prefer the random forest model over other black box models like neural networks because it allows us to trace post hoc the precise reasoning behind the classification of each galaxy. We find that, across a sample of 470,000 Sloan galaxies that are large enough that details could be seen if they were there, the combination of SpArcFiRe outputs with existing SDSS features provides a better machine classification than either one alone on comparison to Galaxy Zoo 1. We suggest that adding SpArcFiRe outputs as features to any machine learning algorithm will likely improve its performance.
ARTICLE | doi:10.20944/preprints202206.0347.v1
Subject: Social Sciences, Geography Keywords: mobile network data; call detail records; data analysis; human mobility; urban mobility; social sensing; urban geography; urban sociology; commuting; sustainability
Online: 27 June 2022 (04:04:09 CEST)
The analysis of the human movement patterns based on the mobile network data makes it possible to examine a very large population cost-effectively, and led to several discoveries about human dynamics. However, the application of this data source is still not common practice. The goal of this study was to analyze the commuting tendencies of the Budapest Metropolitan Area using mobile network data and propose an automatized alternative to the current, questionnaire-based method. Commuting is predominantly analyzed by the census, but that is performed only once in a decade in Hungary. To analyze commuting, the home and the work locations of the subscribers are determined based on their appearances during and outside the working hours. The home locations were compared to census data at a settlement level. Then, the settlement and district level commuting tendencies were identified and compared to the findings of census-based sociological studies. It has been found that commuting analysis based on mobile network data strongly correlates with the census-based findings, even though home and work locations have been estimated by statistical methods. All the examined aspects, including commuting from sectors of the agglomeration to the districts of Budapest and demographic distribution of the commuters, show that mobile network data can be an automatized, fast, cost-effective, and relatively accurate way of commuting analysis, that could provide a powerful tool to the sociologists interested in commuting.
ARTICLE | doi:10.20944/preprints202104.0529.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: censored data; machine learning; deep learning; DNNSurv; survival analysis
Online: 20 April 2021 (11:15:02 CEST)
As the development of high-throughput technologies, more and more high-dimensional or ultra high-dimensional genomic data are generated. Therefore, how to make effective analysis of such data becomes a challenge. Machine learning (ML) algorithms have been widely applied for modelling nonlinear and complicated interactions in a variety of practical fields such as high-dimensional survival data. Recently, the multilayer deep neural network (DNN) models have made remarkable achievements. Thus, a Cox-based DNN prediction survival model (DNNSurv model) , which was built with Keras and Tensorflow, was developed. However, its results were only evaluated to the survival datasets with high-dimensional or large sample sizes. In this paper, we evaluate the prediction performance of the DNNSurv model using ultra high-dimensional and high-dimensional survival datasets, and compare it with three popular ML survival prediction models (i.e., random survival forest and Cox-based LASSO and Ridge models). For this purpose we also present the optimal setting of several hyper-parameters including selection of tuning parameter. The proposed method demonstrates via data analysis that the DNNSurv model performs overall well as compared with the ML models, in terms of three main evaluation measures (i.e., concordance index, time-dependent Brier score and time-dependent AUC) for survival prediction performance.
ARTICLE | doi:10.20944/preprints202205.0238.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: COVID-19; SARS-CoV-2; Omicron; Twitter; tweets; sentiment analysis; big data; Natural Language Processing; Data Science; Data Analysis
Online: 7 July 2022 (08:36:40 CEST)
This paper presents the findings of an exploratory study on the continuously generating Big Data on Twitter related to the sharing of information, news, views, opinions, ideas, knowledge, feedback, and experiences about the COVID-19 pandemic, with a specific focus on the Omicron variant, which is the globally dominant variant of SARS-CoV-2 at this time. A total of 12028 tweets about the Omicron variant were studied, and the specific characteristics of tweets that were analyzed include - sentiment, language, source, type, and embedded URLs. The findings of this study are manifold. First, from sentiment analysis, it was observed that 50.5% of tweets had the ‘neutral’ emotion. The other emotions - ‘bad’, ‘good’, ‘terrible’, and ‘great’ were found in 15.6%, 14.0%, 12.5%, and 7.5% of the tweets, respectively. Second, the findings of language interpretation showed that 65.9% of the tweets were posted in English. It was followed by Spanish or Castillian, French, Italian, Japanese, and other languages, which were found in 10.5%, 5.1%, 3.3%, 2.5%, and <2% of the tweets, respectively. Third, the findings from source tracking showed that “Twitter for Android” was associated with 35.2% of tweets. It was followed by “Twitter Web App”, “Twitter for iPhone”, “Twitter for iPad”, “TweetDeck”, and all other sources that accounted for 29.2%, 25.8%, 3.8%, 1.6%, and <1% of the tweets, respectively. Fourth, studying the type of tweets revealed that retweets accounted for 60.8% of the tweets, it was followed by original tweets and replies that accounted for 19.8% and 19.4% of the tweets, respectively. Fifth, in terms of embedded URL analysis, the most common domains embedded in the tweets were found to be twitter.com, which was followed by biorxiv.org, nature.com, wapo.st, nzherald.co.nz, recvprofits.com, science.org, and other URLs. Finally, to support similar research and development in this field centered around the analysis of tweets, we have developed an open-access Twitter dataset that comprises tweets about the SARS-CoV-2 omicron variant since the first detected case of this variant on November 24, 2021.
ARTICLE | doi:10.20944/preprints202204.0016.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: natural language processing; risk management; transmission lines; unstructured data
Online: 4 April 2022 (11:26:15 CEST)
Risk management of electric power transmission lines requires knowledge from different areas such as environment, land, investors, regulations, and engineering. Despite the widespread availability of databases for most of those areas, integrating them into a single database or model is a challenging problem. Instead, in this paper, we use a single source, the Brazilian National Electric Energy Agency’s (ANEEL) weekly reports, which contains decisions about the electrical grid, comprising most of the areas. Since the data is unstructured (text), we employed NLP techniques such as stemming and tokenization to identify keywords related to common causes of risks provided by an expert group on energy transmission. Then, we used models to estimate the probability of each risk. Our results show that we were able to estimate the probability of 97 risks out of 233.
ARTICLE | doi:10.20944/preprints202011.0622.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: Driving Offenses; Speed Zone; Airports; Functional Data Analysis; Data-Driven Policy;
Online: 24 November 2020 (16:12:38 CET)
The road traffic injuries risk factors such as driving offenses and average speed are concerns for health organizations to reduce the number of injuries. Without any comprehensive view of each road, one cannot decide about the effective policy. In this manner, the data-driven policy will help to improve and assess the decisions. The count data near the road of two airports is surveyed for investigating the time-varying speed zones. The descriptive statistics, ANOVA, and functional data analysis were used. The hourly data of traffic counts for four different locations at the entrance of the two airports, international and domestics, were collected for one the year 2018 to 2019.The hourly pattern of driving offenses for each road was assessed and the to and from airport roads had different peaks (<0.05). The hour, weekdays, type of airport, direction and their interactions were statistically significant (<0.05) for the chance of driving offenses. The speed average during the day was statistically different (<0.5) by the number of different types of vehicles. The traffic count data is a great resource for decision making in safe driving subjects such as driving offenses. With functional data analysis, we can analyze them to get the most of the characteristics of this data. The airports are public places with high traffic demand in all countries that yields the different pattern of traffic transportation, therefore we extract the factors that affect the driving offenses. Finally, we conclude that conducting a time-varying speed zone near the airports seems vital.
ARTICLE | doi:10.20944/preprints202104.0389.v1
Online: 14 April 2021 (16:06:14 CEST)
In this study, we have first studied the trend in meteorological data from the Harmaleh, Vanai and Farsesh stations in the 50-year period in the Dez catchment area. The meteorological data will be then forecasted using SWAT and Mann-Kendall. Forecasting the results in the Mann-Kendall and SWAT model has been done using the code written in MATLAB software and RCP (4.5, 8.5) scenarios, respectively. Studying the results of the trend in the data of meteorological stations in this catchment area indicated that these two parametric and non-parametric methods have been used to determine trends in meteorological data. The results of the parametric method are positive in all meteorological parameters. Non-parametric method over a period of 50 years shows the presence of trends in the data. The comparison on the forecasting results at maximum temperature suggested that during summer, we will see an increase in temperature compared to the ground state in all three forecasts. The results of the minimum temperature forecast show a decrease in the minimum increase during the winter and the precipitation forecast indicates that at the end of autumn (Nov) precipitation decreased by 20 mm in the Mann-Kendall and 4.5 RCP while RCP8.5 suggests the increase in precipitation compared to the ground state. Studying the runoff forecast results using SWOT show that at the end of winter (Feb) and almost all spring (Mar, Apr) a decrease of about 40%, 15% and 14% will be seen, respectively
ARTICLE | doi:10.20944/preprints201709.0078.v1
Subject: Social Sciences, Organizational Economics & Management Keywords: project portfolio configuration; synergetic management; data envelopment analysis; efficiency evaluation
Online: 18 September 2017 (11:25:21 CEST)
Project portfolio configuration (PPC) is an important approach to maintain the sustainable development of enterprises and achieve organizations’ strategy. However, the synergetic efficacy of PPC which determines the degree of the project's strategic objectives achieved is a fuzzy problem and hard to be measured. To solve this problem, this paper takes the data envelopment analysis (DEA) as the tool to measure the efficacy of PPC under deterministic conditions. First, a portfolio evaluation index system which takes financial indicators and non-financial indicators into consideration is developed based on the review of the literature; Second, an evaluation model based on DEA is built to reduce the number of decision making-unit with the perspective of synergetic theory; Then, a computational experiment is studied to verify the feasibility of this proposed model. The results of this computational experiment show that this model can effectively narrow scope of decision-making, improve the decision-making level and provide a reference to decide the DEA effective project portfolio decision-making unit. To our knowledge, this study is the first time to apply the notion of synergetic efficacy and DEA to the PPC domain. It is hoped that this paper may shed lights on any further study about PPC and enterprise competitiveness of sustainable development.
ARTICLE | doi:10.20944/preprints202005.0399.v2
Subject: Mathematics & Computer Science, Analysis Keywords: IR; ML; Data Analysis; COVID-19; Coronavirus; Pandemic
Online: 28 May 2020 (03:09:38 CEST)
The world is facing new challenges every day; however, with the spread of the pandemic around the world, this new challenge is different. The pandemic is increasing and concentrating various challenges simultaneously. Although different sectors are facing consequences, the most important sectors, that is, health and economy are the most affected. When the pandemic began, it was not known how long it would last, which complicated health and economic planning. Therefore, it is important for decision makers and the public to know the predictions and expectations of the future of these challenges. In this work, the current situation is analyzed. Then, an expectation model is developed based on the statistics of the pandemic using a growth rate model based on an exponential and logarithmic rate of increase. Based on the available open data about the pandemic spread, the model can successfully predict future expectations, including the duration and maximum number of cases of the pandemic. The model uses the equilibrium point as the day the cases decrease. The model can be used for planning and the development of strategies to overcome these challenges.
ARTICLE | doi:10.20944/preprints201804.0127.v1
Subject: Engineering, Energy & Fuel Technology Keywords: energy efficiency indices; data visualization; clustering algorithms; university campus; energy management
Online: 10 April 2018 (10:40:47 CEST)
In this paper, we propose a simple tool to help the energy management of a large buildings stock defining clusters of buildings with the same function, setting alert thresholds for each cluster, and easily recognizing outliers. The objective is to enable a building management system to be used for detection of abnormal energy use. First, we framed the issue of energy performance indicators, and how they feed into data visualization (Data Viz) tools for a large building stock, especially for university campuses. Both for Data Viz and clustering algorithm processes, we discussed two possible approaches to choose the right number of clusters and the identification of alert thresholds and outliers, after a brief presentation of the University of Turin's building stock case study. Different Data Viz tools have been studied to apply a specific clustering algorithm, the k-means one. An explorative analysis based on the general Multidimensional detective approach by Inselberg has been performed. Two multidimensional analysis tools, the Scatter Plot Matrix and the Parallel coordinates method have been used. Secondly, the k-means clustering algorithm has been applied on the same dataset in order to test the hypothesis made during the explorative analysis. Data Viz techniques developed in this study revealed to be very useful to explore quickly and simply a large buildings' stock, identifying the worst efficient buildings and clustering them according to their distinct functions.
ARTICLE | doi:10.20944/preprints202201.0348.v1
Subject: Medicine & Pharmacology, Other Keywords: Data Science; Genomic Data Science; Machine Learning; Network Analysis; RNA-Seq; Precision Medicine; Subtyping; Parkinson’s Disease
Online: 24 January 2022 (11:36:51 CET)
Precision medicine emphasizes fine-grained diagnostics, taking individual variability into account to enhance treatment effectiveness. Parkinson's Disease (PD) heterogeneity among individuals is a proof that disease subtypes exist, and assigning individuals to subgroups is necessary for a better understanding of disease mechanisms and designing precise treatment approaches. The purpose of this study was to identify PD subtypes using RNA-Seq data in a combined pipeline including unsupervised machine learning, bioinformatics, and network analysis. 210 post mortem brain RNA-Seq samples from PD (n = 115) and Normal Controls (NC, n = 95) were obtained with a systematic data retrieval following PRISMA statements and a fully data-driven clustering pipeline was performed to identify PD subtypes. Bioinformatics and Network analyses were performed to characterize the disease mechanisms of the identified PD subtypes and to identify target genes for drug repurposing. Two PD clusters were identified and 42 DEGs were found (p.adjusted ≤ 0.01). PD clusters had significantly different gene network structures (p < 0.0001) and phenotype-specific disease mechanisms, highlighting the differential involvement of the Wnt/β-catenin pathway regulating adult neurogenesis. NEUROD1 was identified as a key regulator of gene networks and ISX9 and PD98059 were identified as NEUROD1-interacting compounds with disease-modifying potential, reducing the effects of dopaminergic neurodegeneration. This hybrid data analysis approach could enable precision medicine applications by providing insights for the identification and characterization of pathological subtypes. This workflow has proven useful on PD brain RNA-Seq, but its application to other neurodegenerative diseases is encouraged.
Subject: Earth Sciences, Atmospheric Science Keywords: soil temperature; data evaluation; climatology; interannual variation; Poyang Lake Basin
Online: 24 February 2020 (01:38:30 CET)
Soil temperature reflects the impact of local factors，such as vegetation，soil and the atmosphere of a region. Therefore, it is important to understand the regional variation of soil temperature. However, lack of observations with adequate spatial and/or temporal coverage, it would be difficult to use the observation data to study the regional variation. Based on the observation data from Nanchang and Ganzhou stations and ERA-Interim/Land reanalysis data, this study analyzed the temporal-spatial distribution characteristics of soil temperature over Poyang Lake Basin. The results showed close correlations between observation data and reanalysis data at different depths. Reanalysis data could mainly reproduce the temporal-spatial distributions of soil temperature over the Poyang Lake Basin, but generally underestimate their magnitudes. Temporally, there is an obvious warming trend in the basin. Seasonally, the temperature raised fastest in spring and slowest in summer, except for the ST4, which rising fastest in spring and slowest in winter. In terms of depths, the temperature of ST1 rises fastest. For the other layers, the warming trend is almost similar. An abrupt change of annual soil temperature at all depths occurred in 1997, and annual soil temperatures at all depths were abnormally low in 1984. Spatially, annual soil temperature decreased with latitude, except for the summer ST1. Because of the high temperature and precipitation in summer, the ST1 are higher around the lake and the river. The climatic trend of soil temperature presents the general increase trend from south to north, opposite to the distribution of soil temperature. The findings provide a basis for understanding and assessing the variation of the soil temperature over the Poyang Lake Basin.
ARTICLE | doi:10.20944/preprints202003.0443.v2
Subject: Social Sciences, Library & Information Science Keywords: COVID-19; open science; data; bibliometric; pandemic
Online: 22 April 2020 (06:15:34 CEST)
Introduction: The Pandemic of COVID-19, an infectious disease caused by SARS-CoV-2 motivated the scientific community to work together in order to gather, organize, process and distribute data on the novel biomedical hazard. Here, we analyzed how the scientific community responded to this challenge by quantifying distribution and availability patterns of the academic information related to COVID-19. The aim of our study was to assess the quality of the information flow and scientific collaboration, two factors we believe to be critical for finding new solutions for the ongoing pandemic. Materials and methods: The RISmed R package, and a custom Python script were used to fetch metadata on articles indexed in PubMed and published on Rxiv preprint server. Scopus was manually searched and the metadata was exported in BibTex file. Publication rate and publication status, affiliation and author count per article, and submission-to-publication time were analysed in R. Biblioshiny application was used to create a world collaboration map. Results: Our preliminary data suggest that COVID-19 pandemic resulted in generation of a large amount of scientific data, and demonstrates potential problems regarding the information velocity, availability, and scientific collaboration in the early stages of the pandemic. More specifically, our results indicate precarious overload of the standard publication systems, significant problems with data availability and apparent deficient collaboration. Conclusion: In conclusion, we believe the scientific community could have used the data more efficiently in order to create proper foundations for finding new solutions for the COVID-19 pandemic. Moreover, we believe we can learn from this on the go and adopt open science principles and a more mindful approach to COVID-19-related data to accelerate the discovery of more efficient solutions. We take this opportunity to invite our colleagues to contribute to this global scientific collaboration by publishing their findings with maximal transparency.
ARTICLE | doi:10.20944/preprints202103.0530.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Multiblock data analysis; redundancy analysis; PLS regression; super- vised methods; multicolinearity
Online: 22 March 2021 (12:25:20 CET)
Within the framework of multiblock data analysis, a unified approach of supervised methods is discussed. It encompasses multiblock redundancy analysis (MB-RA) and multiblock partial least squares (MB-PLS) regression. Moreover, we develop new supervised strategies of multiblock data analysis, which can be seen as variants of one or the other of these two methods. They are respectively refered to as multiblock weighted redundancy analysis (MB-WRA) and multiblock weighted covariate analysis (MB-WCov). The four methods are based on the determination of latent variables associated with the various blocks of variables. They are derived from clear optimization criteria whose aim is to maximize either the sum of the covariances or the sum of squared covariances between the latent variable associated with the response block of variables and the block latent variables associated with the various explanatory blocks of variables. We also propose indices to help better interpreting the outcomes of the analyses. The methods are illustrated and compared based on simulated and real datasets.
ARTICLE | doi:10.20944/preprints201801.0077.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: UAVs sensor fusion; EKF; real data analysis; system design
Online: 9 January 2018 (07:47:45 CET)
This paper presents a methodology to design sensor fusion parameters using real performance indicators of navigation in UAVs based on PixHawk flight controller and peripherals. This methodology and the selected performance indicators allows to find the best parameters for the fusion system of a determined configuration of sensors and a predefined real mission. The selected real platform is described with stress on available sensors and data processing software, and the experimental methodology is proposed to characterize sensor data fusion output and determine the best choice of parameters using quality measurements of tracking output with performance metrics not requiring ground truth.
ARTICLE | doi:10.20944/preprints201908.0225.v1
Subject: Earth Sciences, Geoinformatics Keywords: water bodies; satellite images; vector data; SVM; positive and negative buffering; polygons
Online: 21 August 2019 (10:30:16 CEST)
The technique of obtaining information or data about any feature or object from afar, called in technical parlance as remote sensing, has proven extremely useful in diverse fields. In the ecological sphere, especially, remote sensing has enabled collection of data or information about large swaths of areas or landscapes. Even then, in remote sensing the task of identifying and monitoring of different water reservoirs has proved a tough one. This is mainly because getting correct appraisals about the spread and boundaries of the area under study and the contours of any water surfaces lodged therein becomes a factor of utmost importance. Identification of water reservoirs is rendered even tougher because of presence of cloud in satellite images, which becomes the largest source of error in identification of water surfaces. To overcome this glitch, the method of the shape matching approach for analysis of cloudy images in reference to cloud-free images of water surfaces with the help of vector data processing, is recommended. It includes the database of water bodies in vector format, which is a complex polygon structure. This analysis highlights three steps: First, the creation of vector database for the analysis; second, simplification of multi-scale vector polygon features; and third, the matching of reference and target water bodies database within defined distance tolerance. This feature matching approach provides matching of one to many and many to many features. It also gives the corrected images that are free of clouds.
ARTICLE | doi:10.20944/preprints202008.0074.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: data mining; cardiovascular diseases; cluster analysis; principle component analysis
Online: 4 August 2020 (03:56:19 CEST)
Cardiovascular disease is the number one cause of death in the world and Quoting from WHO, around 31% of deaths in the world are caused by cardiovascular diseases and more than 75% of deaths occur in developing countries. The results of patients with cardiovascular disease produce many medical records that can be used for further patient management. This study aims to develop a method of data mining by grouping patients with cardiovascular disease to determine the level of patient complications in the two clusters. The method applied is principal component analysis (PCA) which aims to reduce the dimensions of the large data available and the techniques of data mining in the form of cluster analysis which implements the K-Medoids algorithm. The results of data reduction with PCA resulted in five new components with a cumulative proportion variance of 0.8311. The five new components are implemented for cluster formation using the K-Medoids algorithm which results in the form of two clusters with a silhouette coefficient of 0.35. Combination of techniques of Data reduction by PCA and the application of the K-Medoids clustering algorithm are new ways for grouping data of patients with cardiovascular disease based on the level of patient complications in each cluster of data generated.
ARTICLE | doi:10.20944/preprints201801.0231.v1
Subject: Engineering, Other Keywords: Data mining; Association rules; Previous Cause; Type of Accident; Overexertion
Online: 24 January 2018 (19:40:52 CET)
An analysis of workplace accidents in the mining sector has been done using the database from the Spanish administration between the period 2005-2015 and applying data mining techniques. Data has been processed by means of the software Weka. Two scenarios were chosen regarding the accidents database, surface and underground mining. The most important variables involved in occupation accidents and their association rules have been determined. These rules are formed by several predictor variables that cause an accident, defining its characteristics and context. This study exposes the 20 most important association rules of the sector, either surface or underground mining, based on statistical confidence levels of each rule obtained by Weka. The outcomes display the most typical immediate causes with the percentage of accident basis of each association rule. The most typical immediate cause is body movement with physical effort or overexertion and type of accident is physical effort or overexertion. On the other hand, the second most important immediate cause and type of accident change in both scenarios. Data mining techniques have been proved as a very powerful tool to find out the root of the accidents, apply corrective measures and verify their effectiveness, either for public or private companies.
ARTICLE | doi:10.20944/preprints202105.0582.v1
Subject: Medicine & Pharmacology, Allergology Keywords: under-five; mortality; demographic health survey data; Ethiopia
Online: 24 May 2021 (15:12:13 CEST)
Introduction: Over decades, much have been said and done regarding under-five mortality in Ethiopia. The country has been following the lead of sustainable development goals and UNICEF with its transformation plan targets. However, unless the efforts supported by status assessing studies, it might be difficult for the country to progress. Thus, the current study was directed to identify the prevalence and associated factors of under-five mortality in 2019. Methods: According to the study criteria, we extract and cleaned data in STATA v. 15.0. The data then weighted as per the sampling weight, primary sampling unit, and strata before analyzing in STATA 15.0. Data management consisted of descriptive (mean, standard deviation, and proportion or percent) and association statistics. We deliberated binary logistic regression for this analysis and we checked each variable at 0.25 p-values to include in the model. The final p-value to declare association was p <0.05 and AOR with 95% CI was also applied to describe the results. The data source was the Ethiopian Mini Demographic Health Survey (EMDHS) 2019. EMDHS collected the data from 8,885 in a face-to-face manner with a 99% response rate. Results: From 5,527 numbers of weighted women with under-five analysed in this study, the proportion of under-five mortality was 277.23(5.02%). Factors like 2nd birth order 0.52(0.35, 0.79), 3rd-4th 0.49(0.28, 0.84), 1-2 ANC visits 0.24(0.12, 0.49, ANC visit three’ 0.14(0.07, 0.28), ANC visit four and above 0.22(0.14, 0.36), in marriage mother 0.43(0.19, 0.96), ‘1-2 under-five children 0.02(0.011, 0.03), and greater than three under-five children 0.007(0.0007, 0.004) were all negatively associated with under-five mortality rate. Conclusion: To obtain the exalted outcome out of this study, the government might need to increase antenatal care, women education, institutional delivery, and the modern contraceptive methods use through enhanced community mobilization, health education using community health workers, increasing access to essential cares of mothers and children, and the policy commitment for the issues related to family size, birth order, and birth interval.
ARTICLE | doi:10.20944/preprints202201.0323.v2
Subject: Behavioral Sciences, General Psychology Keywords: social media; netnography; mental health; natural language processing; visualization; data analysis; COVID-19
Online: 8 February 2022 (11:12:14 CET)
Abstract: Understanding social media networks and group interactions are crucial to the ad-vancement of linguistic and cultural behaviour. This includes the manner in which people ac-cessed advice on health, especially during the global lockdown periods. Some people turned to social media to access information on health where most activities were curtailed with isolation rules, especially for older generations. Facebook public pages, groups and verified profiles, using "senior citizen health", "older generations", and "healthy living" keywords were analysed over a 12-month period to analyse the engagement promoting good mental health. CrowdTangle was used to source English language status updates, photo and video sharing information which resulted in an initial 116,321 posts and 6,462,065 interactions Data analysis and visualisation were used to explore large datasets, including natural language processing for “Message” content discovery, word frequency and correlational analysis and co-word clustering. Preliminary results indicate strong links to healthy aging information shared on social media which showed correla-tions to global daily confirmed case and daily death totals. The results can be used to identify public concerns early on and address mental health issues in the senior generation on Facebook.
ARTICLE | doi:10.20944/preprints202202.0005.v2
Subject: Physical Sciences, Condensed Matter Physics Keywords: hydride superconductor; room temperature superconductor; pressure; ac magnetic susceptibility; raw data; background signal
Online: 4 February 2022 (10:31:10 CET)
In Ref.  Snider et al reported room temperature superconductivity in carbonaceous sulfur hydride (CSH) under high pressure. Recently the data for the temperature dependent ac magnetic susceptibility shown in figures of Ref  have appeared in the form of tables corresponding to different pressures . Here we provide an analysis of the data for a pressure of 160 GPa. This work was performed in collaboration with D. van der Marel.
ARTICLE | doi:10.20944/preprints202007.0325.v1
Subject: Mathematics & Computer Science, Applied Mathematics Keywords: Data Center; Thermal Characteristics Analysis; Machine Learning, Energy Efficiency, Hotspots, Clustering Technique, Unsupervised Learning
Online: 15 July 2020 (09:16:23 CEST)
Energy efficiency of Data Center (DC) operations heavily relies on IT and cooling systems performance. A reliable and efficient cooling system is necessary to produce a persistent flow of cold air to cool servers that are subjected to constantly increasing computational load due to the advent of IoT- enabled smart systems. Consequently, increased demand for computing power will bring about increased waste heat dissipation in data centers. In order to bring about a DC energy efficiency, it is imperative to explore the thermal characteristics analysis of an IT room (due to waste heat). This work encompasses the employment of an unsupervised machine learning modelling technique for uncovering weaknesses of the DC cooling system based on real DC monitoring thermal data. The findings of the analysis result in the identification of areas for energy efficiency improvement that will feed into DC recommendations. The methodology employed for this research includes statistical analysis of IT room thermal characteristics, and the identification of individual servers that frequently occur in the hotspot zones. A critical analysis has been conducted on available big dataset of ambient air temperature in the hot aisle of ENEA Portici CRESCO6 computing cluster. Clustering techniques have been used for hotspots localization as well as categorization of nodes based on surrounding air temperature ranges. The principles and approaches covered in this work are replicable for energy efficiency evaluation of any DC and thus, foster transferability. This work showcases applicability of best practices and guidelines in the context of a real commercial DC that transcends the set of existing metrics for DC energy efficiency assessment.
ARTICLE | doi:10.20944/preprints202104.0745.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: green effect; pipelines; remote monitoring; data analysis; machine learning; time series
Online: 28 April 2021 (10:39:31 CEST)
Extensive, but remote oil and gas fields of the United States, Canada, and Russia require the construction and operation of extremely long pipelines. Global warming and local heating effects lead to rising soil temperatures and thus a reduction in the sub-grade capacity of the soils; this causes changes in the spatial positions and forms of the pipelines, consequently increasing the number of accidents. Oil operators are compelled to monitor the soil temperature along the routes of the remoted pipelines in order to be able to perform remedial measures in time. They are therefore seeking methods for the analysis of volumetric diagnostic information. To forecast soil temperatures at the different depths we propose compiling a multidimensional dataset, defining descriptive statistics; selecting uncorrelated time series; generating synthetic features; robust scaling temperature series, tuning the additive regression model to forecast soil temperatures.
ARTICLE | doi:10.20944/preprints202012.0281.v1
Subject: Life Sciences, Biochemistry Keywords: covid-19; drug repurposing; topological data analysis; persistent homology
Online: 11 December 2020 (12:57:28 CET)
Since its emergence in March 2020, the SARS-CoV-2 global pandemic has produced more than 65 million cases and one point five million deaths worldwide. Despite the enormous efforts carried out by the scientific community, no effective treatments have been developed to date. We created a novel computational pipeline aimed to speed up the process of repurposable candidate drug identification. Compared with current drug repurposing methodologies, our strategy is centered on filtering the best candidate among all selected targets focused on the introduction of a mathematical formalism motivated by recent advances in the fields of algebraic topology and topological data analysis (TDA). This formalism allows us to compare three-dimensional protein structures. Its use in conjunction with two in silico validation strategies (molecular docking and transcriptomic analyses) allowed us to identify a set of potential drug repurposing candidates targeting three viral proteins (3CL viral protease, NSP15 endoribonuclease, and NSP12 RNA-dependent RNA polymerase), which included rutin, dexamethasone, and vemurafenib among others. To our knowledge, it is the first time that a TDA based strategy has been used to compare a massive amount of protein structures with the final objective of performing drug repurposing
ARTICLE | doi:10.20944/preprints201711.0047.v1
Subject: Earth Sciences, Atmospheric Science Keywords: data assimilation; statistical diagnostics of analysis residuals; estimation of analysis error, air quality model diagnostics; Desroziers et al. method; cross-validation
Online: 7 November 2017 (10:09:42 CET)
We present a general theory of estimation of analysis error covariances based on cross-validation as well as a geometric interpretation of the method. In particular we use the variance of passive observation–minus-analysis residuals and show that the true analysis error variance can be estimated, without relying on the optimality assumption. This approach is used to obtain near optimal analyses that are then used to evaluate the air quality analysis error using several different methods at active and passive observation sites. We compare the estimates according to the method of Hollingsworth-Lönnberg, Desroziers et al., a new diagnostic we developed, and the perceived analysis error computed from the analysis scheme, to conclude that, as long as the analysis is near optimal, all estimates agree within a certain error margin.
REVIEW | doi:10.20944/preprints202202.0083.v2
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Machine Learning; COVID-19; Internet of Things (IoT); Deep Learning; Big Data
Online: 19 April 2022 (08:21:00 CEST)
Early diagnosis, prioritization, screening, clustering and tracking of COVID-19 patients, and production of drugs and vaccines are some of the applications that have made it necessary to use a new style of technology to involve, to manage and deal with this epidemic. Strategies backed by artificial intelligence (AI) and the Internet of Things (IoT) have been undeniable to understand how the virus works and try to prevent it from spreading. Accordingly, the main aim of this survey article is to highlight the methods of ML, IoT and the integration of IoT and ML-based techniques in the applications related to COVID-19 from the diagnosis of the disease to the prediction of its outbreak. According to the main findings, IoT provided a prompt and efficient approach of following the disease spread. Most of the studies developed by ML-based techniques for handling COVID-19 based dataset provided performance criteria. The most popular performance criteria, is related to accuracy factor. It can be employed for comparing the ML-based methods with different datasets. According to the results, CNN with SVM classifier, Genetic CNN and pre-trained CNN followed by ResNet, provided highest accuracy values. On the other hand, the lowest accuracy was related to single CNN followed by XGboost and KNN methods.
REVIEW | doi:10.20944/preprints201911.0338.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Indian; Sentiment Analysis; Indigenous Languages; Machine Learning; Deep learning; Data; Opinion Mining; Languages.
Online: 27 November 2019 (09:30:07 CET)
An increase in the use of smartphones has laid to the use of the internet and social media platforms. The most commonly used social media platforms are Twitter, Facebook, WhatsApp and Instagram. People are sharing their personal experiences, reviews, feedbacks on the web. The information which is available on the web is unstructured and enormous. Hence, there is a huge scope of research on understanding the sentiment of the data available on the web. Sentiment Analysis (SA) can be carried out on the reviews, feedbacks, discussions available on the web. There has been extensive research carried out on SA in the English language, but data on the web also contains different other languages which should be analyzed. This paper aims to analyze, review and discuss the approaches, algorithms, challenges faced by the researchers while carrying out the SA on Indigenous languages.
ARTICLE | doi:10.20944/preprints202201.0214.v1
Subject: Social Sciences, Other Keywords: spatial analysis; innovation flows; urban transition; inclusive; clusters; lagging regions; network analysis, data city.
Online: 14 January 2022 (13:55:41 CET)
The economy is a complex system, and the interactions between different agents are still not easy to quickly see-through. This complexity should reflect in a spatial dimension; in this way, tracking the tradeoffs opens a new window to the nexus of place and flow. Due to the fact, the economic systems often go through transitions and end up in another state, and this evolution is embedded in cities as the new motor of paradigm shift. To adequately represent and study these dynamics, we aim to develop an integrated method based on network analysis science and geographic economy synthesis to detect a multiscale navigator to track the transition from regional to the local level. This paper seeks to explore the specialization of regional clusters and their innovative behaviour in a particular lagging region, hence unfolding the innovation ecosystem to the smallest granularity then simulating the emergence phase of this complex system. First, our findings reveal that the local scale is relevant to start a bottom-up planning approach on policy implementation. Second, the global challenges could be addressed on a regional scale if we investigate the local complexity to unfold the innovation flow over its complex ecosystem and lead the knowledge as a critical element for inclusive transition, most probably into cities. Finally, the innovation network is an existing fact which can translate as a host for prosperity; In this line of reasoning, we intend to spatialize the track of the innovation flow to achieve transition hotspots and respond adequately to upcoming world concerns.
ARTICLE | doi:10.20944/preprints202109.0334.v1
Subject: Life Sciences, Other Keywords: organic producer; organic practices; surveillance data; health and safety
Online: 20 September 2021 (13:40:18 CEST)
Research indicates that farmers’ demographic characteristics and production practices have safety and health implications. However, current systems do not identify organic farmers independently from conventional farmers, and literature on how organic and conventional farmers compare is very limited. We conducted a secondary analysis of 2012 Census of Agriculture data to compare organic and non-organic farms and principal operators (POs) in New Mexico (NM). Organic farms were smaller in size, and POs of farms with organic sales were significantly younger (55.8±9.5 vs. 60.5±5.5 years) and less experienced (19.5±6.8 vs. 25.2±6.8 years). Significant differences were also found in POs ethnicity, race, and primary occupation. More farms with organic sales had a female PO compared to farms with non-organic sales (27% vs. 19%). Other significant differences related to work arrangements, household income, living conditions, and access to Internet. National surveys and regional studies may not accurately typify and describe the local organic producer, which is essential in order to advance policy, develop health interventions, and properly address occupational safety and risk among organic farmers. This study makes a unique contribution to understanding the importance of surveillance and collecting place-based data that are specific to the organic producer.
ARTICLE | doi:10.20944/preprints201805.0452.v1
Subject: Social Sciences, Other Keywords: transportation; carbon emission; carbon intensity; panel data analysis; China
Online: 30 May 2018 (16:16:35 CEST)
China’s transportation industry has made rapid progress, which has led to a mass of carbon emissions. However, it is still unclear how the carbon emission from transport sector is punctuated by shifts in underlying drivers. This paper aims to examine the process of China’s carbon emissions from transport sector as well as its major driving forces during the period of 2000 to 2015 at the provincial level. We firstly estimate the carbon emissions from transport sector at the provincial level based on the fuel and electricity consumption using a top-down method. We find that the carbon emission per capita is steadily increasing across the nation, especially in the provinces of Chongqing and Inner Mongolia. However, the carbon emission intensity is decreasing in most provinces of China, except in Yunnan, Qinghai, Chongqing, Zhejiang, Heilongjiang, Jilin, Inner Mongolia, Henan and Anhui. We then quantify the effect of socio-economic factors and their regional variations on the carbon emissions using panel data model. The results show that the development of secondary industry is the most significant variable in both the entire nation level and the regional level, while the effects of the other variables vary across regions. Among these factors, population density is the main motivator of the increasing carbon emissions per capita from transport sector for both the whole nation and the western region, whereas the consumption level per capita of residents and the development of tertiary industry are the primary drivers of per capita carbon emissions for the eastern and central region.
Subject: Life Sciences, Other Keywords: Analysis of variance; Variance-decomposition; The Bayesian brain; High-dimensional data; Association; Explanation; Prediction; Causation; The neural law of large numbers
Online: 23 September 2021 (11:13:08 CEST)
We discuss what we believe could be an improvement in future discussions of the ever-changing brain. We do so by distinguishing different types of brain variability and outlining methods suitable to analyse them. We argue that, when studying brain and behaviour data, classical methods such as regression analysis and more advanced approaches both aim to decompose the total variance into sensible variance components. In parallel, we argue that a distinction needs to be made between innate and acquired brain variability. For varying high-dimensional brain data, we present methods useful to extract their low-dimensional representations. Finally, to trace potential causes and predict plausible consequences of brain variability, we discuss how to combine statistical principles and neurobiological insights to make associative, explanatory, predictive, and causal enquires; but cautions are needed to raise association- or prediction-based neurobiological findings to causal claims.
REVIEW | doi:10.20944/preprints202003.0141.v1
Subject: Medicine & Pharmacology, General Medical Research Keywords: data sharing; data management; data science; big data; healthcare
Online: 8 March 2020 (16:46:20 CET)
In recent years, more and more health data are being generated. These data come not only from professional health systems, but also from wearable devices. All these data combined form ‘big data’ that can be utilized to optimize treatments for each unique patient (‘precision medicine’). To achieve this precision medicine, it is necessary that hospitals, academia and industry work together to bridge the ‘valley of death’ of translational medicine. However, hospitals and academia often have problems with sharing their data, even though the patient is actually the owner of his/her own health data, and the sharing of data is associated with increased citation rate. Academic hospitals usually invest a lot of time in setting up clinical trials and collecting data, and want to be the first ones to publish papers on this data. The idea that society benefits the most if the patient’s data are shared as soon as possible so that other researchers can work with it, has not taken root yet. There are some publicly available datasets, but these are usually only shared after studies are finished and/or publications have been written based on the data, which means a severe delay of months or even years before others can use the data for analysis. One solution is to incentivize the hospitals to share their data with (other) academic institutes and the industry. Here we discuss several aspects of data sharing in the medical domain: publisher requirements, data ownership, support for data sharing, data sharing initiatives and how the use of federated data might be a solution. We also discuss some potential future developments around data sharing.
ARTICLE | doi:10.20944/preprints202012.0014.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Data Envelopment Analysis; Conditional Frontier Analysis; Multicriteria Decision Analysis; PROMETHEE II; Police Efficiency; Police Effectiveness; Crime; Pernambuco; Brazil.
Online: 1 December 2020 (11:22:46 CET)
Nonparametric assessments of police technical and scale efficiency is challenging because of the stochastic nature of criminal behavior and because of the subjective dependence on multiple decision criteria, which can lead to a more or less efficiency prospect depending on the regulation, necessity, or organizational objective. There is a trade-off between efficiency and effectiveness in many police performance assessments, i.e., efficient departments (producing more clear-ups with a given resource) are crime-specialized or cannot reproduce those good results effectively on more severe or complex occurrences. This study proposes a combined methodology for carrying out efficiency and effectiveness analysis of Police departments. A conditional non-parametric approach, which allows to include crime as an external factor in the analysis, is combined with a non-compensatory ranking based on the PROMETHEE II methodology for the approach illustrated on the multidimensional efficiency and effectiveness comparison of 145 Pernambuco (Brazil)'s police departments. The application results offer compelling perspectives for public administrations concerning the strategic prioritization of units for rewards or interventions.
ARTICLE | doi:10.20944/preprints201907.0136.v1
Subject: Engineering, Control & Systems Engineering Keywords: carbon dioxide, energy efficiency, occupancy detection, indoor air quality, measurement, data analysis.
Online: 9 July 2019 (14:37:10 CEST)
The problem of real-time estimation of occupancy of buildings (number of people in various zones at every time instant) is relevant to a number of emerging applications that achieve high energy efficiency through feedback control. The measurement of CO2 concentration can be considered an important indicator that allows to define the occupation of closed and crowded spaces. Interesting cases can be school buildings and other buildings used in civil and residential (shopping centres, hospitals, etc.). This paper, starting from an experimental analysis in different classrooms of a University campus in real operating conditions, in different period of the year, proposes a possible correlation between CO2 concentration and the occupancy profile of the spaces. The acquired data are used to present some graphical correlations and to understand the most important variables or combination of them. Starting from an accurate analysis of the data, attempts are made to define a preliminary estimation method through the development of a mathematical models of occupancy dynamics in a building, which show interesting results.
ARTICLE | doi:10.20944/preprints202206.0320.v4
Subject: Life Sciences, Other Keywords: data; reproducibility; FAIR; data reuse; public data; big data; analysis
Online: 2 November 2022 (02:55:49 CET)
With an increasing amount of biological data available publicly, there is a need for a guide on how to successfully download and use this data. The Ten simple rules for using public biological data are: 1) use public data purposefully in your research, 2) evaluate data for your use case, 3) check data reuse requirements and embargoes, 4) be aware of ethics for data reuse, 5) plan for data storage and compute requirements, 6) know what you are downloading, 7) download programmatically and verify integrity, 8) properly cite data, 9) make reprocessed data and models Findable, Accessible, Interoperable, and Reusable (FAIR) and share, and 10) make pipelines and code FAIR and share. These rules are intended as a guide for researchers wanting to make use of available data and to increase data reuse and reproducibility.
ARTICLE | doi:10.20944/preprints202003.0268.v1
Subject: Social Sciences, Library & Information Science Keywords: matching; data marketplace; data platform; data visualization; call for data
Online: 17 March 2020 (04:10:28 CET)
Improvements in web platforms for data exchange and trading are creating more opportunities for users to obtain data from data providers of different domains. However, the current data exchange platforms are limited to unilateral information provision from data providers to users. In contrast, there are insufficient means for data providers to learn what kinds of data users desire and for what purposes. In this paper, we propose and discuss the description items for sharing users’ call for data as data requests in the data marketplace. We also discuss structural differences in data requests and providable data using variables, as well as possibilities of data matching. In the study, we developed an interactive platform, treasuring every encounter of data affairs (TEEDA), to facilitate matching and interactions between data providers and users. The basic features of TEEDA are described in this paper. From experiments, we found the same distributions of the frequency of variables but different distributions of the number of variables in each piece of data, which are important factors to consider in the discussion of data matching in the data marketplace.
ARTICLE | doi:10.20944/preprints202110.0102.v1
Subject: Engineering, Automotive Engineering Keywords: long-haul truck; crash scenarios; GIDAS; CARE; crash causation; European national crash data
Online: 6 October 2021 (10:35:39 CEST)
This paper addresses crashes involving heavy goods vehicles (HGV) in Europe focusing on long-haul trucks weighing 16 tons or more (16t+). The identification of the most critical scenarios and their characteristics is based on a three-level analysis: general crash statistics from CARE addressing all HGVs, results about 16t+ trucks from national crash databases and a detailed study of in-depth crash data from GIDAS, including a crash causation analysis. Most European HGV crashes occur in clear weather, during daylight, on dry roads, outside city limits, and on non-highway roads. Three main scenarios for 16t+ trucks are characterized in-depth: (1) rear-end crashes in which the truck is the striking partner, (2) conflicts during right turn maneuvers of the truck and a cyclist riding alongside and (3) pedestrians crossing the road in front of the truck. Among truck-related crash causes, information admission failures (e.g. distraction) were the main causing factors in 72% of cases in scenario (1) while information access problems (e.g. blind spots) were present for 72% of cases in scenario (2) and 75% of cases in scenario (3). The results provide both a global overview and sufficient depth of analysis in the most relevant cases and thereby aid safety system development.
ARTICLE | doi:10.20944/preprints202105.0102.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Market basket analysis; association rule mining; buying pattern; data mining
Online: 6 May 2021 (15:14:25 CEST)
Buyer practices have changed as individuals are figuring out how to live with the new truth of COVID-19. Take-out and conveyance orders have expanded, and our customer has added new items to their menu because of new client inclinations. With every one of the continuous changes, the customer had numerous unanswered inquiries, for example, Smartbridge has broad involvement with café innovation development Café TECHNOLOGY CAPABILITIES :Are the most famous items as yet unchanged after COVID? :Which are the most sold item blends now? :What is the acknowledgment of new things? :What are clients purchasing alongside new things? :How have liquor deals changed? The customer previously had reports that followed item deals and operational measurements, notwithstanding, there was a need to get a more profound knowledge into item examination. The customer expected to recognize what items and introductions were being sold all the more frequently, measure the acknowledgment of new items, and figure out what items clients buy together to improve advertising efforts, advancements, and deals. he E-business industry is filling immensely in the Indian market. The modest 4G web bundles in India clearly gives a push to these ventures. Thus, as Covid19 first hit in Quite a while, individuals got terrified to go out from their homes in light of the fact that, in their mind, it's a dread of Covid. They even wonder whether or not to go out to purchase fundamental (FMCG) products. Frenzy purchasing additionally has seen and to stay away from this dread of COVID-19, individuals are offering inclinations to the E-Commerce destinations to purchase fundamental products and a few clients are new which joined to purchase fundamental merchandise during this Pandemic Lockdown period. Numerous clients are moving their purchasing conduct from disconnected retail locations to online stores. This paper examines the customer buying pattern during lockdown.
ARTICLE | doi:10.20944/preprints202007.0227.v1
Subject: Life Sciences, Endocrinology & Metabolomics Keywords: Data integration; Metabolomics; Multi-tissue; Multiblock; Joint and unique multiblock analysis (JUMBA), OnPLS; Multiblock Orthogonal Component Analysis (MOCA)
Online: 11 July 2020 (04:01:03 CEST)
Data integration has been proven to provide valuable information. The information extracted using data integration in the form of multiblock analysis can pinpoint both common and unique trends in the different blocks. When working with small multiblock datasets the number of possible integration methods is drastically reduced. To investigate the application of multiblock analysis in cases where one has few number of samples, we studied a small metabolomic multiblock dataset containing six blocks (i.e. tissue types), only including common metabolites. We used a single model multiblock analysis method called Joint and unique multiblock analysis (JUMBA) and compare it to a commonly used method, concatenated PCA. These methods were used to detect trends in the dataset and identify underlying factors responsible for metabolic variations. Using JUMBA, we were able to interpret the extracted components and link them to relevant biological properties. JUMBA shows how the observations are related to one another, the stability of these relationships and to what extent each of the blocks contribute to the components. These results indicate that multiblock methods can be useful even with a small number of samples.
Subject: Medicine & Pharmacology, Nutrition Keywords: obesity; eating context; nutrient-poor foods; nutritional surveillance; adolescents; survey data analysis; data-mining; correspondence analysis; biplots
Online: 9 June 2020 (13:52:45 CEST)
Obesity is a global public health problem and the environment as its major determinant. To identify interventions an evidence base is warranted. To this aim we investigate the relationship between the consumption of foods and eating locations (like home, school/work and others) in British adolescents, using data from the UK National Diet and Nutrition Survey Rolling Program (2008–2012 and 2013-2016). Cross-sectional analysis of 62,523 food diary entries from this nationally representative sample then focused on foods contributing up to 80% total energy to the daily adolescent´s diet. Correspondence Analysis (CA) was first used to generate food-location relationship hypotheses and Logistic Regression (LR) to quantify the evidence in terms of odds ratios and formally test those hypotheses. The less-healthy foods that emerged from CA were chips, soft drinks, chocolate and meat pies. Adjusted Odds Ratios (99% CI) for consuming specific foods at a location “Other” than home (H) or school/work (S) in the 2008-12 survey sample were: for soft drinks 2.8 (2.1 to 3.8) vs. H and 2.0 (1.4 to 2.8) vs. S; for chips 2.8 (2.2 to 3.7) vs. H and 3.4 (2.1 to 5.5) vs. S; for chocolates 2.6 (1.9 to 3.5) vs. H and 1.9 (1.2 to 2.9) vs. S; and for meat pies 2.7 (1.5 to 5.1) vs. H and 1.3 (0.5 to 3.1) vs. S. These trends were confirmed in the 2013-16 survey sample. Interactions between location and BMI were not significant in either sample. In conclusion, our study showed that adolescents are more likely to consume specific less-healthy foods at locations away from home and school/work, irrespective of BMI. Such locations include leisure places, food outlets and “on the go”, hence public health policies to discourage less-healthy food choices in these locations is warranted for all adolescents.
COMMUNICATION | doi:10.20944/preprints201909.0089.v1
Subject: Social Sciences, Geography Keywords: nighttime light data; human activities; karst rocky desertification; environmental impact; China
Online: 8 September 2019 (16:49:12 CEST)
Due to remarkable socioeconomic development, an increasing number of karst rocky desertification areas have been severely affected by human activities in southern China. Effectively analyzing human activities in karst rocky desertification areas is a critical prerequisite for managing and restoring areas with tremendous negative impacts from desertification. At present, a timely and accurate way of quantifying the spatiotemporal variations of human activities in karst rocky desertification areas is still lacking. In this communication, we attempted to quantify human activities from the corrected NPP-VIIRS nighttime light data from 2012 to 2018 based on statistical analysis. The results show that a significant increase of night lights could be clearly identified during the study period. The total nighttime lights (TL) related to severe karst rocky desertification (S) were particularly concentrated in Guizhou and Yunnan. The nighttime light intensity (LI) related to the S areas in Chongqing were the strongest due to its rapid socioeconomic development. The annual growth rate of nighttime lights (GL) has been slow or even negative in Guangdong because of its various karst rocky desertification restoration programs. This communication could provide an effective approach for quantifying human activities and provide useful information about where prompt attention is required for policy-making on the restoration of the karst rocky desertification areas.
ARTICLE | doi:10.20944/preprints202103.0244.v1
Subject: Earth Sciences, Atmospheric Science Keywords: land surface temperature (LST); NDVI; NDBaI; MNDWI; Satellite data
Online: 9 March 2021 (09:17:02 CET)
Analysis of the correlation between indices (Normalized Difference Vegetation Index, Normalized Difference Barren Index and Modified Normalized Difference Water Index) and land surface temperature is used to natural resources and environmental studies. This research aimed to analysis of Land Surface Temperature due to dynamics of Different Indices (NDVI, NDBaI and MNDWI) Using Remote Sensing Data in three selected districts (Gida Kiremu, Limu and Amuru), western Ethiopia. From thermal and multispectral bands of landsat imageries (Landsat TM of 1990, landsat ETM+ of 2003 and landsat OLI/TIRS of 2020) Land surface temperature and NDVI, NDBaI and MNDWI were calculated. Correlation analysis was used to indicate relationships between LST with NDVI, NDBaI and MNDWI. The study found that Land Surface Temperature was increased by 50C from 1990 to 2020. Vegetation areas (NDVI) and Water bodies (MNDWI) have strong negative relationship with Land Surface Temperature (R2= 0.99, 0.95) whereas Barren land (NDBaI) has positive relationship with Land Surface Temperature (R2= 0.96). Finally, we recommend the decision makers and environmental analyst to emphasis the importance of vegetation cover and water body to minimize the potential impacts of land surface temperature.
ARTICLE | doi:10.20944/preprints201903.0122.v1
Subject: Earth Sciences, Geoinformatics Keywords: Classification, SVM Classifier, ML Classifier, Supervised and Unsupervised Classification, Object-based Classification, Multispectral Data
Online: 11 March 2019 (09:01:44 CET)
This paper focuses on the crucial role that remote sensing plays in divining land features. Data that is collected distantly provides information in spectral, spatial, temporal and radiometric domains, with each domain having the specific resolution to information collected. Diverse sectors such as hydrology, geology, agriculture, land cover mapping, forestry, urban development and planning, oceanography and others are known to use and rely on information that is gathered remotely from different sensors. In the present study, IRS LISS IV Multi-spectral data is used for land cover mapping. It is known, however, that the task of classifying high-resolution imagery of land cover through manual digitizing consumes time and is way too costly. Therefore, this paper proposes accomplishing classifications by way of enforcing algorithms in computers. These classifications fall in three classes: supervised, unsupervised, and object-based classification. In the case of supervised classification, two approaches are relied upon for land cover classification of high-resolution LISS-IV multispectral image. These approaches are Maximum Likelihood and Support Vector Machine (SVM). Finally, the paper proposes a step-by-step procedure for optical image classification methodology. This paper concludes that in optical data classification, SVM classification gives a better result than the ML classification technique.
REVIEW | doi:10.20944/preprints202102.0108.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Sentiment Analysis; Students' feedback; Students' reviews; Natural language processing; Data mining; Deep learning; Machine learning
Online: 3 February 2021 (10:11:54 CET)
Now when the whole world is still under COVID-19 pandemic, many schools have transferred the teaching from physical classroom to online platforms. It is highly important for schools and online learning platforms to investigate the feedback to get valuable insights about online teaching process so that both platforms and teachers are able to learn which aspect they can improve to achieve better teaching performance. But handling reviews expressed by students would be a pretty laborious work if they were handled manually as well as it is unrealistic to handle large-scale feedback from e-learning platform. In order to address this problem, both machine learning algorithms and deep learning models are used in recent research to automatically process students' review getting the opinion, sentiment and attitudes expressed by the students. Such studies may play a crucial role in improving various interactive online learning platforms by incorporating automatic analysis of feedback. Therefore, we conduct an overview study of sentiment analysis in educational field presented in recent research, to help people grasp an overall understanding of the sentiment analysis research. Besides, according to the literature review, we identify three future directions that researchers can focus on in automatically feedback processing: high-level entity extraction, multi-lingual sentiment analysis, and handling of figurative language.
ARTICLE | doi:10.20944/preprints202009.0195.v1
Subject: Social Sciences, Geography Keywords: urban governance; public participation; public comments; web-crawling data; qualitative content analysis; urban China
Online: 9 September 2020 (03:37:38 CEST)
Public participation is crucial in the process of urban governance in smart-city initiatives to enable urban planners and policy makers to take account of the real public needs. Our study aims to develop an analytical framework using citizen-centred qualitative data to analyse urban problems and identify the areas most needed for urban governance. Taking a Chinese megacity as the study area, we first utilise a web-crawling tool to retrieve public comments from an online comment board and employ the Baidu Application Programming Interfaces and a qualitative content analysis for data reclassification. We then analyse the urban problems reflected by negative comments in terms of their statistical and spatial distribution, and the associative factors to explain their formation. Our findings show that urban problems are dominantly related to construction and housing, and most frequently appear in industry-oriented areas and newly-developed economic development zones on the urban fringe, where the reconciling of government-centered governance and private governance by real estate developers and property management companies are most needed. Areas with higher land price and a higher proportion of aged population tend to have less urban problems, while various types of civil facilities affect the prevalence of urban problems differently.
ARTICLE | doi:10.20944/preprints202107.0406.v1
Subject: Medicine & Pharmacology, Allergology Keywords: universal health coverage; health insurance claims; administrative data; claims database
Online: 19 July 2021 (11:38:35 CEST)
Although the universal health coverage (UHC) is pursued by many countries, not all countries with UHC include dental care as their benefits. Japan, with its long-held tradition of UHC, has covered dental care as essential benefit and majority of dental care services are provided to all patients with minimal copayment. Being under the UHC, the scope of services as well as prices are regulated by the uniform fee schedule and dentists submit claims according to the uniform format and fee schedule. The author analyzes the publicly available dental health insurance claims data as well as a sampling survey on dental hygiene and illustrates how Japan’s dental care is responding to the challenges of the population ageing.
REVIEW | doi:10.20944/preprints202007.0153.v1
Online: 8 July 2020 (11:53:33 CEST)
Large datasets that enable researchers to perform investigations with unprecedented rigor are growing increasingly common in neuroimaging. Due to the simultaneous increasing popularity of open science, these state-of-the-art datasets are more accessible than ever to researchers around the world. While analysis of these samples has pushed the field forward, they pose a new set of challenges that might cause difficulties for novice users. Here, we offer practical tips for working with large datasets from the end-user’s perspective. We cover all aspects of the data life cycle: from what to consider when downloading and storing the data, to tips on how to become acquainted with a dataset one did not collect, to what to share when communicating results. This manuscript serves as a practical guide one can use when working with large neuroimaging datasets, thus dissolving barriers to scientific discovery.
ARTICLE | doi:10.20944/preprints201912.0292.v1
Subject: Social Sciences, Geography Keywords: cultural differences; spatial interaction patterns; emotion analysis; Zhihu topic data; cultural geography
Online: 22 December 2019 (10:05:48 CET)
As an important research content in cultural geography, the exploration and analysis of the laws of regional cultural differences has great significance for the discovery of distinctive cultures, protection of regional cultures and in-depth understanding of cultural differences. In recent years, with the "spatial turn" of sociology, scholars are paying more and more attention to the implicit spatial information in social media data and the various social phenomena and laws they reflect. One important aspect is to grasp the social cultural phenomena and its spatial distribution characteristics through the text. Using machine learning methods such as the popular natural language processing (NLP), this paper can not only extract hotspot cultural elements from text data but also accurately detect the spatial interaction pattern of some specific cultures and the characteristics of emotions towards non-native cultures. Taking the 6,128 answers to the question “what are the differences between South and North China that you never know” on the Zhihu Q&A Platform as an example, with the help of NLP, this paper has explored the cultural differences between South and North China in people’s mind. This paper probes into people’s feeling and cognition of the cultural differences between South and North China from three aspects, including spatial interaction patterns of hotspot cultural elements, components of hotspot culture and emotional characteristics under the influence of cultural differences between North and South. The study reveals that 1) people from North and South China have great differences in recognizing each other’s culture. 2) Food culture is the most popular among many cultural differences. 3) People tend to show negative attitude towards the food cultures different from their own. All these findings shed light upon the understanding of regional cultural differences and addressing cultural conflicts. In addition, this paper also provides an effective solution to the study from a macro perspective, which have been difficult for new cultural geography.
ARTICLE | doi:10.20944/preprints202103.0357.v1
Subject: Engineering, Automotive Engineering Keywords: Geomagnetic induced current (GIC); Magnetic field component (dB/dt); Geomagnetic data (GMD); Corona mass ejections (CMEs); Electric field (E)
Online: 12 March 2021 (23:58:40 CET)
Geomagnetic induced current (GIC) is a ground end manifestation associated with the space weather perturbations that should be greatly taken into account by the society. Although the GICs implication to the power system is not regular, it can cause large scale of system failure. In equatorial, the power system is considered safe since the most intense of geomagnetic storm happened in high latitude. However, the internal damage due to GICs which finally led to the South African power system failure has totally changed the normal perception. Therefore, a preliminary investigation on the GICs activity in equatorial region is performed to understand the space weather impact to the power system. Time derivative of the horizontal magnetic field component (dB/dt) is done to indicate the GICs activity value based on Faraday’s law. All the reported power failures are compiled to produce the threshold value of dB/dt, which possibly cause the harmful effect to the system. Then, dB/dt analysis is extended to show the pattern of GICs activity in function of magnetic latitude and local time. The results reveal that power network in equatorial region has possibly suffered by GIC. Plus, high number of intense GIC activity in this region occurred during dayside.
ARTICLE | doi:10.20944/preprints201801.0217.v1
Subject: Earth Sciences, Atmospheric Science Keywords: data assimilation; statistical diagnostics of analysis residuals; estimation of analysis error, air quality model diagnostics; Desroziers method; cross-validation
Online: 23 January 2018 (16:23:25 CET)
We examine how observations can be used to evaluate an air quality analysis by verifying against passive observations (i.e. cross-validation) that are not used to create the analysis and we compare these verifications to those made against the same set of (active) observations that were used to generate the analysis. The results show that both active and passive observations can be used to evaluate of first moment metrics (e.g. bias) but only passive observations are useful to evaluate second moment metrics such as variance of observed-minus-analysis and correlation between observations and analysis. We derive a set of diagnostics based on passive observation–minus-analysis residuals and we show that the true analysis error variance can be estimated, without relying on any statistical optimality assumption. This diagnostic is used to obtain near optimal analyses that are then used to evaluate the analysis error using several different methods. We compare the estimates according to the method of Hollingsworth Lonnberg, Desroziers, a diagnostic we introduce, and the perceived analysis error computed from the analysis scheme, to conclude that as long as the analysis is optimal, all estimates agrees within a certain error margin. The analysis error variance at passive observation sites is also obtained.
REVIEW | doi:10.20944/preprints202209.0032.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: cybersecurity; machine learning; deep learning; artificial intelligence; data-driven decision making; automation; cyber analytics; intelligent systems;
Online: 2 September 2022 (03:32:48 CEST)
Due to the digitization and Internet of Things revolutions, the present electronic world has a wealth of cybersecurity data. Efficiently resolving cyber anomalies and attacks is becoming a growing concern in today's cyber security industry all over the world. Traditional security solutions are insufficient to address contemporary security issues due to the rapid proliferation of many sorts of cyber-attacks and threats. Utilizing artificial intelligence knowledge, especially machine learning technology, is essential to providing a dynamically enhanced, automated, and up-to-date security system through analyzing security data. In this paper, we provide an extensive view of machine learning algorithms, emphasizing how they can be employed for intelligent data analysis and automation in cybersecurity through their potential to extract valuable insights from cyber data. We also explore a number of potential real-world use cases where data-driven intelligence, automation, and decision-making enable next-generation cyber protection that is more proactive than traditional approaches. The future prospects of machine learning in cybersecurity are eventually emphasized based on our study, along with relevant research directions. Overall, our goal is to explore not only the current state of machine learning and relevant methodologies but also their applicability for future cybersecurity breakthroughs.
ARTICLE | doi:10.20944/preprints201810.0273.v1
Subject: Physical Sciences, Astronomy & Astrophysics Keywords: astroparticle physics, cosmic rays, data life cycle management, data curation, meta data, big data, deep learning, open data
Online: 12 October 2018 (14:48:32 CEST)
Modern experimental astroparticle physics features large-scale setups measuring different messengers, namely high-energy particles generated by cosmic accelerators (e.g. supernova remnants, active galactic nuclei, etc): cosmic and gamma rays, neutrinos and recently discovered gravitational waves. Ongoing and future experiments are distributed over the Earth including ground, underground/underwater setups as well as balloon payloads and spacecrafts. The data acquired by these experiments have different formats, storage concepts and publication policies. Such differences are a crucial issue in the era of big data and of multi-messenger analysis strategies in astroparticle physics. We propose a service ASTROPARTICLE.ONLINE in the frame of which we develop an open science system which enables to publish, store, search, select and analyse astroparticle physics data. The cosmic-ray experiments KASCADE-Grande and TAIGA were chosen as pilot experiments to be included in this framework. In the first step of our initiative we will develop and test the following components of the full data life cycle concept: (i) describing, storing and reusing of astroparticle data; (ii) software for performing multi-experiment and multi-messenger analyses like deep-learning methods; (iii) outreach including example applications and tutorial for students and scientists outside the specific research field. In the present paper we describe the concepts of our initiative, and in particular the plans toward a common, federated astroparticle data storage.
ARTICLE | doi:10.20944/preprints201703.0191.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: High-dimensional data analysis, Multiple hypothesis testing, False discovery rate, Optimum significance threshold, Maximum for reasonable number of rejected hypotheses, Big data analysis
Online: 24 March 2017 (18:29:35 CET)
This paper identifies a criterion for choosing the largest set of rejected hypotheses in high-dimensional data analysis where Multiple Hypothesis testing is used in exploratory research to identify significant associations among many variables. The method neither requires predetermined thresholds for level of significance, nor uses presumed thresholds for false discovery rate. The upper limit for number of rejected hypotheses is determined by finding maximum difference between expected true hypotheses and false hypotheses among all possible sets of rejected hypotheses. Methods of choosing a reasonable number of rejected hypotheses and application to non-parametric analysis of ordinal survey data are presented.
Subject: Medicine & Pharmacology, Allergology Keywords: Acute Lymphoblastic Leukaemia; Flow Cytometry Data; Fisher’s Ratio; CD38; mathematical oncology; response biomarkers; personalized medicine
Online: 27 October 2020 (15:20:10 CET)
Artificial intelligence methods may help in unveliling information hidden in high-dimensional oncological data. Flow cytometry studies of haematological malignancies provide quantitative data with the potential to be used for the construction of response biomarkers. Many computational methods from the bioinformatics toolbox can be applied to these data but have not been exploited in their full potential in leukaemias, specifically for the case of childhood B-cell acute lymphoblastic leukemia. In this paper we analysed flow cytometry data obtained on diagnosis from 54 paediatric B-cell acute lymphoblastic leukemia patients from two local institutions. We constructed classifiers based on the Fisher’s Ratio to quantify differences in expression levels of immunophenotypical markers between patients with relapsing and non-relapsing disease. The distribution of the marker CD38 was found and validated to have a strong discriminating power between both patient cohorts, thus providing a classifier.
ARTICLE | doi:10.20944/preprints202105.0589.v1
Subject: Engineering, Automotive Engineering Keywords: Game Ratings; Public Data; Game Data; Data analysis; GRAC(Korea)
Online: 25 May 2021 (08:32:32 CEST)
As of 2020, public data for game ratings provided by Game Ratings And Administration Committee(GRAC) are more limited than public data for movie and video ratings provided by Korea Media Ratings Board and do not provide data which allow us to see information on ratings clearly and in detail. To get information on game ratings, we need to find information by searching for specific target on homepage which is inconvenient for us. In order to improve such inconvenience and extend scope of provision in public data, the author of this paper intends to study public data API which has been extended based on information on video ratings. To draw items to be extended, this study analyzes data for ratings on homepage of GRAC and designs collection system to build database. This study intends to implement system that provides data collected based on extended public data items in a form which users want. This study is expected to provide information on ratings to GRAC which will strengthen fairness and satisfy game users and people’s rights to know and contribute to promotion and development of game industry.
REVIEW | doi:10.20944/preprints202111.0429.v1
Subject: Behavioral Sciences, Behavioral Neuroscience Keywords: Agri-Food; Food Supply Chain; Blockchain; IoT; Big Data; Sustainability; Food Security; COVID-19; Food Safety; Digitalization
Online: 23 November 2021 (14:52:59 CET)
Technological advances such as blockchain, artificial intelligence, big data, social media, geographic information systems represent a building block of the digital transformation that supports the resilience of the food supply chain (FSC) and increases its efficiency. This paper reviews the literature surrounding digitalization in FSCs. A bibliometric and key-route main path analysis was carried out to objectively and analytically uncover the knowledge development in digitalization within the context of sustainable FSCs. The research began with the selection of 2140 articles published nearly over five decades. Then, the articles were examined according to several bibliometric metrics such as year of publication, countries, institutions, sources, authors, and keywords frequency. A keyword co-occurrence network was generated to cluster the relevant literature. Findings of the review and bibliometric analysis indicate that research at the intersection of technology and the FSC has gained substantial interest from scholars. On the basis of keyword co-occurrence network, the literature is focused on the role of information communication technology for agriculture and food security, food waste and circular economy, and the merge of the Internet of Things and blockchain in the FSC. The analysis of the key-route main path uncovers three critical periods marking the development of technology-enabled FSCs. The study offers scholars a better understanding of digitalization within the agri-food industry and the current knowledge gaps for future research. Practitioners may find the review useful to remain ahead of the latest discussions of technology-enabled FSCs. To the authors’ best knowledge, the current study is one of the few endeavors to explore technology-enabled FSCs using a comprehensive sample of journals articles published during the past five decades.
ARTICLE | doi:10.20944/preprints202007.0078.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: personalization; decision making; medical data; artificial intelligence; Data-driving; Big Data; Data Mining; Machine Learning
Online: 5 July 2020 (15:04:17 CEST)
The study was conducted on applying machine learning and data mining methods to personalizing the treatment. This allows investigating individual patient characteristics. Personalization is built on the clustering method and associative rules. It was suggested to determine the average distance between instances for optimal performance metrics finding. The formalization of the medical data pre-processing stage for finding personalized solutions based on current standards and pharmaceutical protocols is proposed. The model of patient data is built. The paper presents the novel approach to clustering built on ensemble of cluster algorithm with better than k-means algorithm Hopkins metrics. The personalized treatment usually is based on decision tree. Such approach requires a lot of computation time and cannot be paralyzed. Therefore, it is proposed to classify persons by conditions, to determine deviations of parameters from the normative parameters of the group, as well as the average parameters. This made it possible to create a personalized approach to treatment for each patient based on long-term monitoring. According to the results of the analysis, it becomes possible to predict the optimal conditions for a particular patient and to find the medicaments treatment according to personal characteristics.
ARTICLE | doi:10.20944/preprints201806.0282.v1
Subject: Earth Sciences, Geoinformatics Keywords: land-use/land-cover; multi-decadal change analysis; irrigation ponds; textural features; supervised classification; multi-source data
Online: 18 June 2018 (16:40:31 CEST)
A multi-decadal change analysis of the irrigation ponds in Taoyuan, Taiwan was conducted by using multi-source data including digitized ancient maps, declassified single-band CORONA satellite images, and multispectral SPOT images. Supervised LULC classifications were conducted using four textural features derived from the single-band CORONA images and spectral features derived from SPOT images. Post-classification analysis revealed that the number of irrigation ponds in the study area decreased during the post-World War II farmland consolidation period (1945 – 1965) and the subsequent industrialization period (1970 – 2000). However, efforts on restoration of irrigation ponds in recent years have resulted in gradual increases in the number (9%) and total area (12%) of irrigation ponds in the study area.
ARTICLE | doi:10.20944/preprints201908.0320.v1
Subject: Engineering, Biomedical & Chemical Engineering Keywords: Electrocardiography Analysis; Persistence Landscape; Signal Analysis; Machine Learning; Topological Data Analysis; Topological Signal Signature; Classification; Time Series Analysis; Biomedical Signal Analysis; Persistence Homology
Online: 30 August 2019 (09:51:40 CEST)
Data can be illustrated in shapes, and the shapes could provide insight for data modeling and information extraction. Topological data analysis provides an alternative insight in biomedical data analysis and knowledge discovery with the algebra topology tools. In present work, we study the application of topological data analysis for personalized electrocardiographic signal classification toward arrhythmia analysis. Using phase space reconstruction technique, the signal samples are converted into point clouds for topological analysis facility. With topological techniques the persistence landscapes from the point clouds are extracted as features to perform the arrhythmia classification task. We find that the proposed method is robust to the training set size, with only a training set size of 20% percents, the normal heartbeat class are 100% recognized, ventricular beats for 97.13%, supra-ventricular beats for 94.27% and fusion beats for 94.27% within the corresponding experiments. The property of keeping high performance when using smaller training sample proves that the proposed method is especially applicable to personalized analysis. With the present study, we show that the topological data analysis technique could be a useful tool in biomedical signal analysis, and provide powerful ability in personalized analysis.
ARTICLE | doi:10.20944/preprints202103.0593.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Business Inteligence; Data Mining; Data Warehouse.
Online: 24 March 2021 (13:47:31 CET)
In the coming years, digital applications and services that continue to use the country's native cloud systems will be huge. By 2023, that will exceed 500 million, according to IDC. This corresponds to the sum of all applications developed in the last 40 years. If you are the one you answered, yes! This article is for you!
ARTICLE | doi:10.20944/preprints202012.0468.v1
Online: 18 December 2020 (13:29:38 CET)
This manuscript describes the construction and validation of high resolution daily gridded (0.05° × 0.05°) rainfall and maximum and minimum temperature data for Bangladesh : the Enhancing National Climate Services for Bangladesh Meteorological Department (ENACTS-BMD) dataset. The dataset was generated by merging data from weather stations, satellite products (for rainfall) and reanalysis (for temperature). ENACTS-BMD is the first high-resolution gridded surface meteorological dataset developed specifically for studies of surface climate processes in Bangladesh. Its record begins in January 1981 and is updated in real-time monthly and outputs have daily, decadal and monthly time resolution. The Climate Data Tools (CDT), developed by the International Research Institute for Climate and Society (IRI), Columbia University, is used to generate the dataset. This data processing includes the collection of weather and gridded data, quality control of stations data, downscaling of the reanalysis for temperature, bias correction of both satellite rainfall and downscaled reanalysis of temperature, and the combination of station and bias-corrected gridded data. The ENACTS-BMD dataset is available as an open-access product at BMD’s official website, allowing the enhancement of the provision of services, overcoming the challenges of data quality, availability, and access, promoting at the same time the engagement and use by stakeholders.
CASE REPORT | doi:10.20944/preprints201801.0066.v1
Online: 8 January 2018 (11:11:47 CET)
The implementation of the European Cohesion Policy aiming at fostering regions competitiveness, economic growth and creation of new jobs is documented over the period 2014–2020 in the publicly available Open Data Portal for the European Structural and Investment funds. On the base of this source, this paper aims at describing the process of data mining and visualization for information production on regional programmes performace in achieving effective expenditure of resouces.
ARTICLE | doi:10.20944/preprints202205.0344.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Linked (open) Data; Semantic Interoperability; Data Mapping; Governmental Data; SPARQL; Ontologies
Online: 25 May 2022 (08:18:46 CEST)
In this paper, we present a method to map information regarding service activity provision residing in governmental portals across European Commission. In order to perform this, we used as a basis the enriched Greek e-GIF ontology, modeling concepts, and relations in one of the two data portals (i.e., Points of Single Contacts) examined, since relevant information on the second was not provided. Mapping consisted in transforming information appearing in governmental portals in RDF format (i.e., as Linked data), in order to be easily exchangeable. Mapping proved a tedious task, since description on how information is modeled in the second Point of Single Contact is not provided and must be extracted in a manual manner.
ARTICLE | doi:10.20944/preprints202111.0073.v1
Subject: Medicine & Pharmacology, Other Keywords: data quality; OMOP CDM; EHDEN; healthcare data; real world data; RWD
Online: 3 November 2021 (09:12:54 CET)
Background: Observational health data has the potential to be a rich resource to inform clinical practice and regulatory decision making. However, the lack of standard data quality processes makes it difficult to know if these data are research ready. The EHDEN COVID-19 Rapid Col-laboration Call presented the opportunity to assess how the newly developed open-source tool Data Quality Dashboard (DQD) informs the quality of data in a federated network. Methods: 15 Data Partners (DPs) from 10 different countries worked with the EHDEN taskforce to map their data to the OMOP CDM. Throughout the process at least two DQD results were collected and compared for each DP. Results: All DPs showed an improvement in their data quality between the first and last run of the DQD. The DQD excelled at helping DPs identify and fix conformance is-sues but showed less of an impact on completeness and plausibility checks. Conclusions: This is the first study to apply the DQD on multiple, disparate databases across a network. While study-specific checks should still be run, we recommend that all data holders converting their data to the OMOP CDM use the DQD as it ensures conformance to the model specifications and that a database meets a baseline level of completeness and plausibility for use in research.
ARTICLE | doi:10.20944/preprints202110.0103.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Data Analytics; Analytics; Supply Chain Input; Supply Chain; Data Science; Data
Online: 6 October 2021 (10:38:42 CEST)
One of the most remarkable features in the 20th century was the digitalization of technical progress, which changed the output of companies worldwide and became a defining feature of the century. The growth of information technology systems and the implementation of new technical advances, which enhance the integrity, agility and long-term organizational performance of the supply chain, can distinguish a digital supply chain from other supply chains. For example, the Internet of Things (IoT)-enabled information exchange and Big Data analysis might be used to regulate the mismatch between supply and demand. In order to assess contemporary ideas and concepts in the field of data analysis in the context of supply chain management, this literary investigation has been decided. The research was conducted in the form of a comprehensive literature review. In the SLR investigation, a total of 71 papers from leading journals were used. SLR has found that data analytics integrate into supply chain management can have long-term benefits on supply chain management from the input side, i.e., improved strategic development, management and other areas.
REVIEW | doi:10.20944/preprints202009.0500.v1
Subject: Medicine & Pharmacology, General Medical Research Keywords: COVID-19; impact on society during COVID-19; behavioral impact of COVID-19; government policies against COVID-19; measures adopted by the government; COVID-19 Statistics; Infection rate and Data analysis
Online: 21 September 2020 (11:09:11 CEST)
Background: COVID-19 pandemic has pulled us all a few steps back, were we never shake hands or hug each other when we meet our friends and family after a gap, but instead we greet them by saying Namaste and joining our hands together. As we all know, COVID-19 spreads through air and the only way to shield ourselves is by maintaining a safe distance from one another. Methodology: In order to conduct a meta-analysis on the number of COVID-19 cases in Kerala and India, the data was retrieved from various sites hosted by the government bodies. The data for analysis was collected from May 2020 to July 2020. The average number of days required to reach every 5000 fresh cases were also calculated using this data. COVID-19 has affected all the economy holistically regardless of financial, behavioral, or societal aspects. Conclusion: Lifting of the lockdown in a step by step process keeping in mind the necessities for the nation was a thoughtful act, but the people who mistook this opportunity and did not remain in quarantine after coming from abroad was recognized as the reasons behind the sudden and uncontrolled rise in the number of COVID-19 cases in Kerala, India. The government authorities had no other option but to lift the restrictions to reduce the economic burdens that had already affected the daily wage worker and farmers prompting them to give up their lives.
ARTICLE | doi:10.20944/preprints202205.0182.v1
Subject: Social Sciences, Organizational Economics & Management Keywords: Tourism; Measuring sustainability; Tourist satisfaction; E-reputation; Sustainable development; Sentiment analysis; ETIS; Open data; Geospatial Index
Online: 13 May 2022 (07:58:47 CEST)
The importance of measuring sustainability in tourism has been significantly advancing in recent years, following the need to manage the impact of tourism on territories and hosting communities. It was further boosted by the pandemic, where sustainability has been defined as one of the central elements to restart global tourism. The ETIS model, developed by the European Commission, is a point of reference based on self-assessment, data collection and analysis by the destinations themselves. The application of ETIS toolkit has faced many challenges, especially at sub-national level, mostly related to the lack of available and updated data to feed the model. The hypothesis explored by the authors is to solve the implementation issues, developing an indicator based on the use of the Sentiment Analysis to frame e-reputation and tourism satisfaction, and further combining it with other open data sources. The Tourism Sustainability Index (TSI) can provide a scalable and geo-referenced evaluation of tourism sustainability, measuring the four pillars and sub-components referenced to ETIS criteria, applicable to any tourism destination. Results show that the TSI can be seen as a consistent and valid tool for destinations to analyze sustainability, monitor its evolution through time periods and sub-areas, and compare it to other benchmark or competitive areas.
Subject: Engineering, Automotive Engineering Keywords: Business Intelligence; Data warehouse; Data Marts; Architecture; Data; Information; cloud; Data Mining; evolution; technologic companies; tools; software
Online: 24 March 2021 (13:06:53 CET)
Information has been and will be a vital element for a person or department groups in an organization. That is why there are technologies that help us to give them the proper management of data; Business Intelligence is responsible for bringing technological solutions that correctly and effectively manage the entire volume of necessary and important information for companies. Among the solutions offered by Business Intelligence are Data Warehouses, Data Mining, among other business technologies that working together achieve the objectives proposed by an organization. It is important to highlight that these business technologies have been present since the 50's and have been evolving through time, improving processes, infrastructure, methodologies and implementing new technologies, which have helped to correct past mistakes based on information management for companies. There are questions about Business Intelligence. Could it be that in the not-too-distant future it will be used as an essential standard or norm in any organization for data management, since it provides many benefits and avoids failures at the time of classifying information. On the other hand, Cloud storage has been the best alternative to safeguard information and not depend on physical storage media, which are not 100% secure and are exposed to partial or total loss of information, by presenting hardware failures or security failures due to mishandling that can be given to such information.
ARTICLE | doi:10.20944/preprints202111.0410.v1
Subject: Engineering, Other Keywords: Data compression; data hiding; psnr; mse; virtual data; public cloud; quantization error
Online: 22 November 2021 (15:17:12 CET)
Nowadays, information security is a challenge especially when transmitted or shared in public clouds. Many of researchers have been proposed technique which fails to provide data integrity, security, authentication and another issue related to sensitivity data. The most common techniques were used to protect data during transmission on public cloud are cryptography, steganography, and compression. The proposed scheme suggests an entirely new approach for data security on public cloud. Authors have suggested an entirely new approach that completely makes secret data invisible behind carrier object and it is not been detected with the image performance parameters like PSNR, MSE, entropy and others. The details of results are explain in result section of paper. Proposed technique have better outcome than any other existing technique as a security mechanism on a public cloud. Primary focus of suggested approach is to minimize integrity loss of public storage data due to unrestricted access rights by uses. To improve reusability of carrier even after data concealed is really a challenging task and achieved through suggested approach.
ARTICLE | doi:10.20944/preprints201808.0350.v2
Subject: Mathematics & Computer Science, Other Keywords: big data; clustering; data mining; educational data mining; e-learning; profile learning
Online: 19 October 2018 (05:58:05 CEST)
Educational data-mining is an evolving discipline that focuses on the improvement of self-learning and adaptive methods. It is used for finding hidden patterns or intrinsic structures of educational data. In the arena of education, the heterogeneous data is involved and continuously growing in the paradigm of big-data. To extract meaningful information adaptively from big educational data, some specific data mining techniques are needed. This paper presents a clustering approach to partition students into different groups or clusters based on their learning behavior. Furthermore, personalized e-learning system architecture is also presented which detects and responds teaching contents according to the students’ learning capabilities. The primary objective includes the discovery of optimal settings, in which learners can improve their learning capabilities. Moreover, the administration can find essential hidden patterns to bring the effective reforms in the existing system. The clustering methods K-Means, K-Medoids, Density-based Spatial Clustering of Applications with Noise, Agglomerative Hierarchical Cluster Tree and Clustering by Fast Search and Finding of Density Peaks via Heat Diffusion (CFSFDP-HD) are analyzed using educational data mining. It is observed that more robust results can be achieved by the replacement of existing methods with CFSFDP-HD. The data mining techniques are equally effective to analyze the big data to make education systems vigorous.
REVIEW | doi:10.20944/preprints201807.0059.v1
Subject: Life Sciences, Biophysics Keywords: data normalization; data scaling; zero-sum; metabolic fingerprinting; NMR; statistical data analysis
Online: 3 July 2018 (16:22:31 CEST)
The aim of this article is to summarize recent bioinformatic and statistical developments applicable to NMR-based metabolomics. Extracting relevant information from large multivariate datasets by statistical data analysis strategies may be of considerable complexity. Typical tasks comprise for example classification of specimens, identification of differentially produced metabolites, and estimation of fold changes. In this context it is of prime importance to minimize contributions from unwanted biases and experimental variance prior to these analyses. This is the goal of data normalization. Therefore, special emphasize is given to different data normalization strategies. In the first part, we will discuss the requirements and the pros and cons for a variety of commonly applied strategies. In the second part, we will concentrate on possible solutions in case that the requirements for the standard strategies are not fulfilled. In the last part, very recent developments will be discussed that allow reliable estimation of metabolic signatures for sample classification without prior data normalization. In this contribution special emphasis will be given to techniques that have worked well in our hands.
Subject: Mathematics & Computer Science, Other Keywords: Data Science; Advertising Campaign; Effectiveness; Evaluating, Social Media; Digital Marketing; Sentiment Analysis; Instagram; Machine Learning
Online: 11 August 2021 (08:44:07 CEST)
The growth of social media has changed the face of many aspects of marketing such as online, digital, etc. It also has changed the way modern human communicates and connects with others. Moreover, the behavior on this platform could not and should not be justified with strategies of other marketing channels and media. Due to the nature of social media, they are rich in precise and lean data, but processing these data and extracting knowledge and insights from them are problematic. Evaluating the effectiveness of a marketing endeavor is also a task related to these data. The current research attempts to assess the effectiveness of an advertising campaign on Instagram via advertising cost and sentiment classification of audience opinion regarding the campaign. The methodology used in this research is the standard process of data mining, i.e., CRISP-DM. Furthermore, multiple machine learning models and approaches were studied to train a prediction model based on data. In order to find the most accurate algorithm, grid search was used among the trained models and different algorithms with different combinations of hyper-parameters. The obtained results revealed that although the number of not profitable advertising media was higher than the profitable media, the overall status of the campaign was profitable, both in the cost-effectiveness approach and sentiment analysis approach. The other valued outcome of this research was important general and specific insights which can be used to shape a better-performing and effective advertising campaign on Instagram.
Online: 30 October 2020 (15:35:00 CET)
In the information age today, data are getting more and more important. While other industries achieve tangible improvement by applying cutting edge information technology, the construction industry is still far from being enough. Cost, schedule, and performance control are three major functions in the project execution phase. Along with their individual importance, cost-schedule integration has been a significant challenge over the past five decades in the construction industry. Although a lot of efforts have been put into this development, there is no method used in construction practice. The purpose of this study is to propose a new method to integrate cost and schedule data using big data technology. The proposed algorithm is designed to provide data integrity and flexibility in the integration process, considerable time reduction on building and changing database, and practical use in a construction site. It is expected that the proposed method can transform the current way that field engineers regard information management as one of the troublesome tasks in a data-friendly way.
ARTICLE | doi:10.20944/preprints201701.0090.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: transportation data; data interlinking; automatic schema matching
Online: 20 January 2017 (03:38:06 CET)
Multimodality requires integration of heterogeneous transportation data to construct a broad view of the transportation network. Many new transportation services are emerging with being isolated from previously existing networks. This lead them to publish their data sources to the web -- according to Linked Data Principles -- in order to gain visibility. Our interest is to use these data to construct an extended transportation network that links these new services to existing ones. The main problems we tackle in this article fall in the categories of automatic schema matching and data interlinking. We propose an approach that uses web services as mediators to help in automatically detect geospatial properties and map them between two different schemas. On the other hand, we propose a new interlinking approach that enables user to define rich semantic links between datasets in a flexible and customizable way.
ARTICLE | doi:10.20944/preprints202111.0266.v1
Subject: Engineering, Biomedical & Chemical Engineering Keywords: Pan-Cancer; somatic point mutations; cancer subtyping; biomarker discovery; driver genes; per-sonalized medicine; health data analytics
Online: 15 November 2021 (13:51:33 CET)
The advent of high throughput sequencing has enabled researchers to systematically evaluate the genetic variations in cancer, resulting in identifying many cancer-associated genes. Although cancers in the same tissue are widely categorized in the same group, they demonstrate many differences concerning their mutational profiles. Hence there is no “silver bullet” for the treatment of a cancer type. This reveals the importance of developing a pipeline to identify cancer-associated genes accurately and re-classify patients with similar mutational profiles. Classification of cancer patients with similar mutational profiles may help discover subtypes of cancer patients who might benefit from specific treatment types. In this study, we propose a new machine learning pipeline to identify protein-coding genes mutated in a significant portion of samples to identify cancer subtypes. We applied our pipeline to 12270 samples collected from the International Cancer Genome Consortium (ICGC), covering 19 cancer types. Here we identified 17 different cancer subtypes. Comprehensive phenotypic and genotypic analysis indicates distinguishable properties, including unique cancer-related signaling pathways, in which, for most of them, targeted treatment options are currently available. This new subtyping approach offers a novel opportunity for cancer drug development based on the mutational profile of patients. We also comprehensive study the causes of mutations among samples in each subtype by mining the mutational signatures, which provides important insight into their active molecular mechanisms. Some of the pathways we identified in most subtypes, including the cell cycle and the Axon guidance pathways, are frequently observed in cancer disease. Interestingly, we also identified several mutated genes and different rates of mutation in multiple cancer subtypes. In addition, our study on “gene-motif” suggests the importance of considering both the context of the mutations and mutational processes in identifying cancer-associated genes. The source codes for our proposed clustering pipeline and analysis are publicly available at: https://github.com/bcb-sut/Pan-Cancer.