ARTICLE | doi:10.20944/preprints202007.0078.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: personalization; decision making; medical data; artificial intelligence; Data-driving; Big Data; Data Mining; Machine Learning
Online: 5 July 2020 (15:04:17 CEST)
The study was conducted on applying machine learning and data mining methods to personalizing the treatment. This allows investigating individual patient characteristics. Personalization is built on the clustering method and associative rules. It was suggested to determine the average distance between instances for optimal performance metrics finding. The formalization of the medical data pre-processing stage for finding personalized solutions based on current standards and pharmaceutical protocols is proposed. The model of patient data is built. The paper presents the novel approach to clustering built on ensemble of cluster algorithm with better than k-means algorithm Hopkins metrics. The personalized treatment usually is based on decision tree. Such approach requires a lot of computation time and cannot be paralyzed. Therefore, it is proposed to classify persons by conditions, to determine deviations of parameters from the normative parameters of the group, as well as the average parameters. This made it possible to create a personalized approach to treatment for each patient based on long-term monitoring. According to the results of the analysis, it becomes possible to predict the optimal conditions for a particular patient and to find the medicaments treatment according to personal characteristics.
ARTICLE | doi:10.20944/preprints202306.1378.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Data Generation; Anomaly Data; User Behavior Generation; Big Data
Online: 19 June 2023 (16:31:37 CEST)
The rising importance of Big Data in modern information analysis is supported by vast quantities of user data, but it is only possible to collect sufficient data for all tasks within certain data-gathering contexts. There are many cases where a domain is too novel, too niche, or too sparsely collected to adequately support Big Data tasks. To remedy this, we have created ADG Engine that allows for the generation of additional data that follows the trends and patterns of the data that’s already been collected. Using a database structure that tracks users across different activity types, ADG Engine can use all available information to maximize the authenticity of the generated data. Our efforts are particularly geared towards data analytics by identifying abnormalities in the data and allowing the user to generate normal and abnormal data at custom ratios. In situations where it would be impractical or impossible to expand the available dataset by collecting more data, it can still be possible to move forward with algorithmically expanded data datasets.
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Data Visualization; Visual Analytics; Natural Language Processing; Dark Data; Pattern Recognition
Online: 28 October 2020 (07:47:26 CET)
Over the years, there has been a significant rise in the world's scientific knowledge. However, most of it lacks structure and is often termed as Dark Data. Both humans and expert systems have continually faced difficulty in analyzing and comprehending such overwhelming amounts of information which is crucial in solving several real-world problems. Information and data visualization techniques proffer a promising solution to explore such data by allowing quick comprehension of information, the discovery of emerging trends, identification of relationships and patterns, etc. In this tutorial, we utilize the rich corpus of PubMed comprising of more than 30 million citations from biomedical literature to visually explore and understand the underlying key-insights using various information visualization techniques. With this study, we aim to diminish the limitation of human cognition and perception in handling and examining such large volumes of data by speeding up the process of decision making and pattern recognition and enabling decision-makers to fully understand data insights and make informed decisions.
ARTICLE | doi:10.20944/preprints202110.0103.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Data Analytics; Analytics; Supply Chain Input; Supply Chain; Data Science; Data
Online: 6 October 2021 (10:38:42 CEST)
One of the most remarkable features in the 20th century was the digitalization of technical progress, which changed the output of companies worldwide and became a defining feature of the century. The growth of information technology systems and the implementation of new technical advances, which enhance the integrity, agility and long-term organizational performance of the supply chain, can distinguish a digital supply chain from other supply chains. For example, the Internet of Things (IoT)-enabled information exchange and Big Data analysis might be used to regulate the mismatch between supply and demand. In order to assess contemporary ideas and concepts in the field of data analysis in the context of supply chain management, this literary investigation has been decided. The research was conducted in the form of a comprehensive literature review. In the SLR investigation, a total of 71 papers from leading journals were used. SLR has found that data analytics integrate into supply chain management can have long-term benefits on supply chain management from the input side, i.e., improved strategic development, management and other areas.
Subject: Computer Science And Mathematics, Information Systems Keywords: Academic Analytics; data storage; education and big data; analysis of data; learning analytics
Online: 19 July 2020 (20:37:39 CEST)
Business Intelligence, defined by  as "the ability to understand the interrelations of the facts that are presented in such a way that it can guide the action towards achieving a desired goal", has been used since 1958 for the transformation of data into information, and of information into knowledge, to be used when making decisions in a business environment. But, what would happen if we took the same principles of business intelligence and applied them to the academic environment? The answer would be the creation of Academic Analytics, a term defined by  as the process of evaluating and analyzing organizational information from university systems for reporting and making decisions, whose characteristics allow it to be used more and more in institutions, since the information they accumulate about their students and teachers gathers data such as academic performance, student success, persistence, and retention . Academic Analytics enables an analysis of data that is very important for making decisions in the educational institutional environment, aggregating valuable information in the academic research activity and providing easy to use business intelligence tools. This article shows a proposal for creating an information system based on Academic Analytics, using ASP.Net technology and trusting storage in the database engine Microsoft SQL Server, designing a model that is supported by Academic Analytics for the collection and analysis of data from the information systems of educational institutions. The idea that was conceived proposes a system that is capable of displaying statistics on the historical data of students and teachers taken over academic periods, without having direct access to institutional databases, with the purpose of gathering the information that the director, the teacher, and finally the student need for making decisions. The model was validated with information taken from students and teachers during the last five years, and the export format of the data was pdf, csv, and xls files. The findings allow us to state that it is extremely important to analyze the data that is in the information systems of the educational institutions for making decisions. After the validation of the model, it was established that it is a must for students to know the reports of their academic performance in order to carry out a process of self-evaluation, as well as for teachers to be able to see the results of the data obtained in order to carry out processes of self-evaluation, and adaptation of content and dynamics in the classrooms, and finally for the head of the program to make decisions.
ARTICLE | doi:10.20944/preprints202307.1199.v1
Subject: Engineering, Industrial And Manufacturing Engineering Keywords: smart manufacturing; big data; manufacturing process; big data analytics; decision-making; uncertainty
Online: 18 July 2023 (09:38:31 CEST)
This paper presents a systematic approach to developing big data analytics for manufacturing process-relevant decision-making activities from the perspective of smart manufacturing. The proposed analytics consists of five integrated system components: 1) data preparation system, 2) data exploration system, 3) data visualization system, 4) data analysis system, and 5) knowledge extraction system. The functional requirements of the integrated systems are elucidated. In addition, JAVA- and spreadsheet-based systems are developed to realize the proposed integrated system components. Finally, the efficacy of the analytics is demonstrated using a case study where the goal is to determine the optimal material removal conditions of a dry electrical discharge machining operation. The analytics identified the variables (among voltage, current, pulse-off time, gas pressure, and rotational speed) that effectively maximize the material removal rate. It also identified the variables that do not contribute to the optimization process. The analytics also quantified the underlying uncertainty. In synopsis, the proposed approach results in transparent, big-data-inequality-free, and less resource-dependent data analytics, which is desirable for small and medium enterprises—the actual sites where machining is carried out.
ARTICLE | doi:10.20944/preprints202308.0366.v1
Subject: Social Sciences, Education Keywords: learning analytics; large language model; artificial intelligence; process data
Online: 4 August 2023 (07:23:55 CEST)
Although learning analytics (LA) holds great potential to improve teaching and learning, LA research and practice are currently riddled with limitations that impact every stage of the LA life cycle. The present paper offers an overview of these challenges before proposing strategies to overcome them and exploring how the recent innovations brought forth by language models can improve LA research and practice. In particular, we encourage the empowerment of teachers during LA development, as this would strengthen the theoretical foundation of LA solutions and increase their interpretability and usability. Furthermore, we provide examples of how process data can be used to understand learning processes and generate more interpretable LA insights. Furthermore, we explore how LLMs could come into play in LA to generate interpretable insights, timely and actionable feedback, increase personalization, and support teachers’ tasks more broadly.
REVIEW | doi:10.20944/preprints202104.0442.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: data science; advanced analytics; machine learning; deep learning; smart computing; decision-making; predictive analytics; data science applications;
Online: 16 April 2021 (11:28:09 CEST)
The digital world has a wealth of data, such as Internet of Things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science, advanced analytics methods including machine learning modeling can provide actionable insights or deeper knowledge about data, which makes the computing process automatic and smart. In this paper, we present a comprehensive view on "Data Science'' including various types of advanced analytics methods that can be applied to enhance the intelligence and capabilities of an application through smart decision-making in different scenarios. We also discuss and summarize ten potential real-world application domains including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making. Based on this, we finally highlight the challenges and potential research directions within the scope of our study. Overall, this paper aims to serve as a reference point on data science and advanced analytics to the researchers and decision-makers as well as application developers, particularly from the data-driven solution point of view for real-world problems.
ARTICLE | doi:10.20944/preprints202212.0522.v1
Subject: Computer Science And Mathematics, Computer Networks And Communications Keywords: privacy; social media; data retention; hyperloglog
Online: 28 December 2022 (01:25:25 CET)
Social media data is widely used to gain insights about social incidents, whether on a local or global scale. Within the process of analyzing and evaluating the data, it is common practice to download and store it locally. Considerations about privacy protection of social media users are often neglected thereby. However, protecting privacy when dealing with personal data is demanded by laws and ethics. In this paper we introduce a method to store social media data using the cardinality estimator HyperLogLog. Based on an exemplary disaster management scenario, we show that social media data can be analyzed by counting occurrences of posts, without becoming in possession of the actual raw data. For social media data analyses like these, that are based on counting occurrences, cardinality estimation suffices the task. Thus, the risk of abuse, loss or public exposure of the data can be mitigated and privacy of social media users can be preserved. The ability to do unions and intersections on multiple data sets further encourages the use of this technology. We provide a proof-of-concept implementation for our introduced method, using data provided by the Twitter API.
ARTICLE | doi:10.20944/preprints201811.0074.v1
Subject: Medicine And Pharmacology, Psychiatry And Mental Health Keywords: system dynamics modeling; big data; mental distress; diet
Online: 5 November 2018 (02:34:30 CET)
Dietary factors are one of the risk factors that can impact the brain chemistry, which leads to mental distress. Based on our data mining approach, we found that mental distress in men is associated with eating unhealthy food. Our aim in this paper is to apply results from our big data analytics approach to inform system dynamics (SD) modeling to investigate the causal relationships between brain structures, nutrients from food and dietary supplements, and mental health. We perform descriptive analysis based on a large data set to estimate the SD modeling parameters. Finally, we calibrate the model towards a time series data collected for individuals on their dietary and distress patterns. The results reveal that bridging these different methodologies leads to further insights from the SD model and decreases the error of calibrated parameter values. Future research is needed to validate our initial results for investigating the relationship between mental distress and dietary intake.
ARTICLE | doi:10.20944/preprints202105.0698.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Accidents, Data Analysis, Machine Learning, Transport
Online: 28 May 2021 (11:59:24 CEST)
Daily thousands of people and goods move along Brazilian Federal highways. Traffic accidents are numerous on these highways and have a significant impact, whether on the economy or the health system. Identifying predictor variables, the probability of an event occurring and how to mitigate them are of paramount importance for the actions of the transit authorities that manage these roads. The main contribution of this study is the development of a predictive machine learning model which uses open data to shows graphically the critical points in the highways. This model is fully reproducible and can be applied to any region worldwide helping to minimize the number of accidents and to prevent deaths by automotive collisions. For this study, 43 variables were analyzed supporting the identification of the causes of accidents with fatal victims on the main highways in the south of Brazil. RoadLytics is proposed as a supervised machine learning model, using the Random Forest algorithm to analyze about 33 thousand occurrences between 2017 and 2020. An exploratory analysis of the data was carried out to support the modeling and to facilitate data visualization. In this sense, heat maps were developed to support the analysis and identification of potential risk areas. The results show that BR386 highway registers the highest number of fatal occurrences, regardless of the season. Additionally, concerning the weather conditions, the analysis shows that 52% of accidents occurred in favorable conditions, such as clear skies, victimizing 501 people. The driver’s lack of attention is the main reason for the accidents’ occurrences. Applying the developed model, an accuracy of 77% was achieved for the classification of fatal accidents.
ARTICLE | doi:10.20944/preprints202307.1583.v1
Subject: Environmental And Earth Sciences, Sustainable Science And Technology Keywords: innovation; commercialization; decision making; human centered design; information technology; data analytics; resilience; environment
Online: 24 July 2023 (09:36:17 CEST)
Global climate change and associated environmental extremes present a pressing need to understand and predict social-environmental impacts while identifying opportunities for mitigation and adaptation. In support of informing a more resilient future, emerging data analytics technologies can leverage the growing availability of Earth observations from diverse data sources ranging from satellites to sensors to social media. Yet, there remains a need to transition from research for knowledge gain to sustained operational deployment. In this paper, we use the Wisdom-Knowledge-Information-Data (WKID) Innovation framework to inform solutions-oriented science in an integrated Research to Commercialization (R2C) model that explores market viability for value-added analytics. We conduct a case study using this R2C model to address the wicked wildfire problem. By integrating WKID and human centered design (HCD), through an industry-university partnership called the Climate Innovation Collaboratory between the University of Colorado Boulder and Deloitte Consulting, we systematically evaluated 39 different user stories across 8 user personas and identified common gaps, how to define information technologies that add value, and how to develop a marketable product. This R2C model could enable transition of knowledge to operational implementation informing policy and decision makers, tasked with addressing some of our most challenging environmental problems.
CONCEPT PAPER | doi:10.20944/preprints201811.0599.v1
Subject: Social Sciences, Geography, Planning And Development Keywords: geographies of disruption; data analytics; policy intervention; Uber; disruptive technology; disruptive innovation; path dependency; platform development; platform economics
Online: 27 November 2018 (06:52:38 CET)
The topic of technology development and its disruptive effects has been the subject of much debate over the last 20 years with numerous theories at both macro and micro scale offering potential models of technology progression and disruption. This paper focuses on how the potential theories of technology progression can be integrated and considers whether suitable indicators of this progression and any subsequent disruptive effects (particularly considering these geographically) might be derived, based on the use of big data analytic techniques. Given the magnitude of the economic, social and political implications of many disruptive technologies, the ability to quantify disruptive change at the earliest possible stage could deliver major returns by reducing uncertainty, assisting public policy intervention and managing the technology transition through disruption into deployment. However, determining when this stage has been reached is problematic because small random effects in the timing, direction of development, the availability of essential supportive technologies or “platform” technologies, market response or government policy can all result in failure of a technology, its form of adoption or optimality of implementation. This paper reviews some of the key models of technology evolution and their disruptive effect including, in particular, the geographical spread of disruption. It suggests a methodology for utilising the recent explosion of open and web-discoverable data to determine a methodology to achieve this earlier determination and considers the potential exploitation of big data modelling and predictive analytical techniques to achieve this goal.
ARTICLE | doi:10.20944/preprints202111.0047.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Data augmentation; Deep Learning; Convolutional Neural Networks; Ensemble.
Online: 2 November 2021 (11:18:23 CET)
Convolutional Neural Networks (CNNs) have gained prominence in the research literature on image classification over the last decade. One shortcoming of CNNs, however, is their lack of generalizability and tendency to overfit when presented with small training sets. Augmentation directly confronts this problem by generating new data points providing additional information. In this paper, we investigate the performance of more than ten different sets of data augmentation methods, with two novel approaches proposed here: one based on the Discrete Wavelet Transform and the other on the Constant-Q Gabor transform. Pretrained ResNet50 networks are finetuned on each augmentation method. Combinations of these networks are evaluated and compared across three benchmark data sets of images representing diverse problems and collected by instruments that capture information at different scales: a virus data set, a bark data set, and a LIGO glitches data set. Experiments demonstrate the superiority of this approach. The best ensemble proposed in this work achieves state-of-the-art performance across all three data sets. This result shows that varying data augmentation is a feasible way for building an ensemble of classifiers for image classification (code available at https://github.com/LorisNanni).
CONCEPT PAPER | doi:10.20944/preprints202102.0203.v1
Subject: Medicine And Pharmacology, Immunology And Allergy Keywords: Bigdata; IoT; Big Data Analytics; Covid-19; healthcare
Online: 8 February 2021 (12:19:28 CET)
— Big Data analytics has come a long way since its inception. This field is growing day by day. With the advent of large handling capacity of computational analysis of modern computing systems as well as Internet of Things (IoT), this field has revolutionized the way we think about data. It has influenced the major domains such as healthcare, automobile, computing, climatology, and space communications. Of late, the health care sector has been largely influenced by this. This communication deals with the areas of healthcare where big data analytics has been largely influential. Encompassing the basics of Big Data Analytics (BDA) driven by IoT, the applications of it in healthcare sector are outlined, accompanied by future expectations. Additionally, it also presents a comprehensive analysis of recent application with special reference to Covid-19 in this sector.
ARTICLE | doi:10.20944/preprints201611.0010.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: millimeter-wavelength cloud radar; attenuation correction; dual-radar; data fusion
Online: 1 November 2016 (10:05:18 CET)
In order to correct attenuated millimeter-wavelength (Ka-band) radar data and address the problem of instability, an attenuation correction methodology (attenuation correction with variation trend constraint; VTC) was developed. Using synchronous observation conditions and multi-band radars, the VTC method adopts the variation trends of reflectivity in X-band radar data captured with wavelet transform as a constraint to adjust reflectivity factors of millimeter-wavelength radar. The correction was evaluated by comparing reflectivities obtained by millimeter-wavelength cloud radar and X-band weather radar. Experiments showed that attenuation was a major contributory factor in the different reflectivities of the two radars when relatively intense echoes exist, and the attenuation correction developed in this study significantly improved data quality for millimeter-wavelength radar. Reflectivity differences between the two radars were reduced and reflectivity correlations were enhanced. Errors caused by attenuation were eliminated, while variation details in the reflectivity factors were retained. The VTC method is superior to the bin-by-bin method in terms of correction amplitude and can be used for attenuation correction of shorter wavelength radar assisted by longer wavelength radar data.
ARTICLE | doi:10.20944/preprints201608.0232.v2
Subject: Medicine And Pharmacology, Pulmonary And Respiratory Medicine Keywords: mHealth; ODK scan; mobile health application; digitizing data collection; data management processes; paper-to-digital system; technology-assisted data management; treatment adherence
Online: 2 September 2016 (03:17:38 CEST)
The present grievous situation of the tuberculosis disease can be improved by efficient case management and timely follow-up evaluations. With the advent of digital technology this can be achieved by quick summarization of the patient-centric data. The aim of our study was to assess the effectiveness of the ODK Scan paper-to-digital system during testing period of three months. A sequential, explanatory mixed-method research approach was employed to elucidate technology use. Training, smartphones, application and 3G enabled SIMs were provided to the four field workers. At the beginning, baseline measures of the data management aspects were recorded and compared with endline measures to see the impact of ODK Scan. Additionally, at the end, users’ feedback was collected regarding app usability, user interface design and workflow changes. 122 patients’ records were retrieved from the server and analysed for quality. It was found that ODK Scan recognized 99.2% of multiple choice bubble responses and 79.4% of numerical digit responses correctly. However, the overall quality of the digital data was decreased in comparison to manually entered data. Using ODK Scan, a significant time reduction is observed in data aggregation and data transfer activities, however, data verification and form filling activities took more time. Interviews revealed that field workers saw value in using ODK Scan, however, they were more concerned about the time consuming aspects of the use of ODK Scan. Therefore, it is concluded that minimal disturbance in the existing workflow, continuous feedback and value additions are the important considerations for the implementing organization to ensure technology adoption and workflow improvements.
REVIEW | doi:10.20944/preprints201607.0075.v1
Subject: Social Sciences, Behavior Sciences Keywords: travel mode detection; GPS raw data; smartphones
Online: 25 July 2016 (06:34:26 CEST)
Over the past couple of decades, Global positioning system (GPS) technology has been utilized to collect large-scale data from travel surveys. As the precise spatiotemporal characteristics of travel could be provided by GPS devices, the issues of traditional travel survey, such as misreporting and non-response, could be addressed. Considering the defects of dedicated GPS devices (e.g., need much money to buy devices, forget to take devices to collect data, limit the simple size because of the number of devices, etc.), and the phenomenon that the smartphone is becoming one of necessities of life, there is a great chance for the smartphone to replace dedicated GPS devices. Although, several general reviews have been done about smartphone-based GPS travel survey in the literature review section in some articles, a systematic review from smartphone-based GPS data collection to travel mode detection has none. The included studies were searched from six databases. The purpose of this review is to critically assess the current literature on the existing methodologies of travel mode detection based on GPS raw data collected by smartphones. Meanwhile, according to the systematic comparison among different methods from data-preprocessing to travel mode detection, this paper could carefully provide the Strengths and Weaknesses of existing methods. Furthermore, it is the crucial step to develop the methodologies and applications of GPS raw data collected by smartphones.
ARTICLE | doi:10.20944/preprints202303.0023.v1
Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: Game Design; Variational AutoEncoder (VAE); Image and Video Generation; Bayesian Algorithm; Loss Function; Data Clustering; Data and Image Analytics; MNIST database; Generator and Discriminator
Online: 1 March 2023 (11:17:12 CET)
In recent decades, the Variational AutoEncoder (VAE) model has shown good potential and capabilities in image generation and dimensionality reduction. The combination of VAE and various machine learning frameworks has also worked effectively in different daily life applications, however its possibility and effectiveness in modern game design has seldom been explored nor assessed. The use of its feature extractor for data clustering was minimally discussed in literature neither. This paper first attempts to explore different mathematical properties of the VAE model, in particular, the theoretical framework of the encoding and decoding processes, the possible achievable lower bound and loss functions of different applications; then applies the established VAE model into generating new game levels within two well-known game settings; as well as validating the effectiveness of its data clustering mechanism with the aid of the Modified National Institute of Standards and Technology (MNIST) database. Respective statistical metrics and assessments were also utilized for evaluating the performance of the proposed VAE model in aforementioned case studies. Based on the statistical and spatial results, several potential drawbacks and future enhancement of the established model were outlined, with the aim of maximizing the strengths and advantages of VAE for future game design tasks and relevant industrial missions.
ARTICLE | doi:10.20944/preprints201610.0012.v1
Subject: Biology And Life Sciences, Biochemistry And Molecular Biology Keywords: data exchange; resource donations; text mining
Online: 5 October 2016 (15:08:32 CEST)
Bio-molecular reagents like antibodies required in experimental biology are expensive and their effectiveness, among other things, is critical to the success of the experiment. Although such resources are sometimes donated by one investigator to another through personal communication between the two, there is no previous study to our knowledge on the extent of such donations, nor a central platform that directs resource seekers to donors. In this paper, we describe, to our knowledge, a first attempt at building a web-portal titled Bio-Resource Exchange that attempts to bridge this gap between resource seekers and donors in the domain of experimental biology. Users on this portal can request for or donate antibodies, cell-lines and DNA Constructs. This resource could also serve as a crowd-sourced database of resources for experimental biology. Further, in order to index donations outside of our portal, we mined scientific articles to find instances of donations of antibodies and attempted to extract information about these donations at the finest granularity. Specifically, we extracted the name of the donor, his/her affiliation and the name of the antibody for every donation by parsing the acknowledgements sections of articles. To extract annotations at this level, we propose two approaches – a rule based algorithm and a bootstrapped relation learning algorithm. The algorithms extracted donor names, affiliations and antibody names with average accuracies of 57% and 62% respectively. We also created a dataset of 50 expert-annotated acknowledgements sections that will serve as a gold standard dataset to evaluate extraction algorithms in the future. Contact: email@example.com, firstname.lastname@example.org Database URL: http://tonks.dbmi.pitt.edu/brx Supplementary information: Supplementary data are available at Database online.
ARTICLE | doi:10.20944/preprints202306.1679.v1
Subject: Computer Science And Mathematics, Computer Vision And Graphics Keywords: long-tailed image classification; contrastive learning; data augmentation
Online: 23 June 2023 (12:17:21 CEST)
To solve the problem that the common long-tailed classification method does not use the semantic features of the original label text of the image, and the difference between the classification accuracy of most classes and minority classes is large, the long-tailed image classification method based on enhanced contrast visual language trains the head class and tail class samples separately, uses text image to pre-train the information, and uses enhanced momentum contrast loss function and RandAugment enhancement to improve the learning of tail class samples. On the ImageNet-LT long-tailed dataset, the enhanced contrastive visual-language based long-tailed image classification method has improved all class accuracy, tail class accuracy, middle class accuracy, and F1 values by 3.4%, 7.6%, 3.5%, and 11.2%, respectively, compared to the BALLAD method. The difference in accuracy between the head class and tail class is reduced by 1.6% compared to the BALLAD method. The results of three comparative experiments indicate that the long-tailed image classification method based on enhanced contrastive visual-language has improved the performance of tail classes and reduced the accuracy difference between majority and minority classes.
ARTICLE | doi:10.20944/preprints201607.0047.v1
Subject: Business, Economics And Management, Econometrics And Statistics Keywords: SAINT model; SiZer; local linear fitting; mortality data
Online: 18 July 2016 (10:35:40 CEST)
The main contribution of this paper is to develop a graphical tool to evaluate the suitability of a candidate parametric model to fit a data set. The practical motivation is to check the adequacy of the so called SAINT model proposed in Jarner and Kryger (2011). This model has been implemented in practice by an important European pension fund concerned with setting annuity reserves for all current or former employees of Denmark. So, one particular relevant question is whether this mortality model is actually fitting old-age. Our graphical test can be described as follows. First we transform the data by means of the parametric model which is being evaluated. Should the model be correct, the transformed data will be in accordance with an Exponential distribution with rate 1. Then we construct a family of local linear hazard estimates based on the data on the transformed scale. Finally we use the statistical tool SiZer to assess the goodness-of-fit of the Exponential distribution to the data on the transformed scale. If the model is correct the SiZer map should not reveal any significant feature in the family of kernel estimates. Our method allow us to establish a diagnostic regarding the validity of the SAINT model when describing mortality patterns in four different countries.
ARTICLE | doi:10.20944/preprints202306.0556.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: image data extraction; driving behavior analysis; geographic information system; global position system
Online: 7 June 2023 (13:09:43 CEST)
This paper proposes an approach for image data extraction and driving behavior analysis using geographic information and driving data. Information derived from geographic and global positioning systems was used for image data extraction. In addition, we used an onboard diagnostic II and a controller area network bus logger to record driving data for driving behavior analysis. Driving behavior was analyzed using sparse automatic encoders and data exploration to detect abnormal and aggressive behavior. A regression analysis was used to derive the relationship between aggressive driving behavior and road facilities. Several traffic improvements were proposed for specific intersections and roads. The results indicate that lane ratios without lane markings and with straight lane markings are important features that affect aggressive driving behaviors.
ARTICLE | doi:10.20944/preprints201612.0091.v2
Subject: Environmental And Earth Sciences, Geophysics And Geology Keywords: reanalysis climate data; hydrologic modeling; comparative analysis
Online: 3 February 2017 (03:50:07 CET)
Large-scale hydrological modeling in China is challenging given the sparse meteorological stations and large uncertainties associated with atmospheric forcing data.Here we introduce the development and use of the China Meteorological Assimilation Driving Datasets for the SWAT model (CMADS) in the Heihe River Basin(HRB) for improving hydrologic modeling, by leveraging the datasets from the China Meteorological Administration Land Data Assimilation System (CLDAS)(including climate data from nearly 40000 area encryption stations, 2700 national automatic weather stations, FengYun (FY) 2 satellite and radar stations). CMADS uses the Space Time Multiscale Analysis System (STMAS) to fuse data based on ECWMF ambient field and ensure data accuracy. In addition, compared with CLDAS, CMADS includes relative humidity and climate data of varied resolutions to drive hydrological models such as the Soil and Water Assessment Tool (SWAT) model. Here, we compared climate data from CMADS, Climate Forecast System Reanalysis (CFSR) and traditional weather station (TWS) climate forcing data and evaluatedtheir applicability for driving large scale hydrologic modeling with SWAT. In general, CMADS has higher accuracy than CFRS when evaluated against observations at TWS; CMADS also provides spatially continuous climate field to drive distributed hydrologic models, which is an important advantage over TWS climate data, particular in regions with sparse weather stations. Therefore, SWAT model simulations driven with CMADS and TWS achieved similar performances in terms of monthly and daily stream flow simulations, and both of them outperformed CFRS. For example, for the three hydrological stations (Ying Luoxia, Qilian Mountain, and ZhaMasheke) in the HRB at the monthly and daily Nash-Sutcliffe efficiency ranges of 0.75-0.95 and 0.58-0.78, respectively, which are much higher than corresponding efficiency statistics achieved with CFSR (monthly: 0.32-0.49 and daily: 0.26 – 0.45). The CMADS dataset is available free of charge and is expected to a valuable addition to the existing climate reanalysis datasets for deriving distributed hydrologic modeling in China and other countries in East Asia.
ARTICLE | doi:10.20944/preprints201609.0088.v1
Subject: Engineering, Civil Engineering Keywords: classification; railway; power line; mobile laser scanning data; conditional random field; layout compatibility
Online: 26 September 2016 (09:33:05 CEST)
Railway has been used as one of the most crucial means of transportation in public mobility and economic development. For efficiently operating railways, the electrification system in railway infrastructure, which supplies electric power to trains, is essential facilities for stable train operation. Due to its important role, the electrification system needs to be rigorously and regularly inspected and managed. This paper presents a supervised learning method to classify Mobile Laser Scanning (MLS) data into ten target classes representing overhead wires, movable brackets and poles, which are recognized key objects in the electrification system. In general, the layout of railway electrification system shows a strong regularity of spatial relations among object classes. The proposed classifier is developed based on Conditional Random Field (CRF), which characterizes not only labeling homogeneity at short range, but also the layout compatibility between different object classes at long range in the probabilistic graphical model. This multi-range CRF model consists of a unary term and three pairwise contextual terms. In order to gain computational efficiency, MLS point clouds is converted into a set of line segments where the labeling process is applied. Support Vector Machine (SVM) is used as a local classifier considering only node features for producing the unary potentials of CRF model. As the short-range pairwise contextual term, Potts model is applied to enforce a local smoothness in short-range graph. While, long-range pairwise potentials are designed to enhance spatial regularities of both horizontal and vertical layouts among railway objects. We formulate two long-range pairwise potentials as the log posterior probability obtained by Naïve Bayes classifier. The directional layout compatibilities are characterized in probability look-up tables which represent co-occurrence rate of spatial relations in horizontal and vertical directions. The likelihood function is formulated by multivariate Gaussian distributions. In the proposed multi-range CRF model, the weight parameters to balance four sub-terms are estimated by applying the Stochastic Gradient Descent (SGD). The results show that the proposed multi-range CRF can effectively classify detailed railway elements, representing the average recall of 97.66% and the average precision of 97.07% for all classes.
ARTICLE | doi:10.20944/preprints202212.0405.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: hyperspectral data; few-shot learning; deep features; convolution kernels; edge-preserving filtering
Online: 22 December 2022 (01:44:48 CET)
In recent years, different deep learning frameworks were introduced for hyperspectral image (HSI) classification. However, the proposed network models have a higher model complexity and do not provide high classification accuracy if few-shot learning is used. This paper pre-sents an HSI classification method that combines random patches network (RPNet) and re-cursive filtering (RF) to obtain informative deep features. The proposed method first convolves image bands with random patches to extract multi-level deep RPNet features. Thereafter, the RPNet feature set is subjected to dimension reduction through principal component analysis (PCA) and the extracted components are filtered using the RF procedure. Finally, HSI spectral features and the obtained RPNet-RF features are combined to classify the HSI using a support vector machine (SVM) classifier. In order to test the performance of the proposed RPNet-RF method, some experiments were performed on three widely known datasets using a few training samples for each class and classification results were compared with those obtained by other advanced HSI classification methods adopted for small training samples. The comparison showed that the RPNet-RF classification is characterized by higher values of such evaluation metrics as overall accuracy and Kappa coefficient (https://github.com/UchaevD/RPNet-RF).
ARTICLE | doi:10.20944/preprints202307.1435.v2
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: sharing secret; data outsourcing; reversible watermarking; chaotic map
Online: 21 September 2023 (11:24:52 CEST)
A novel verifiable privacy preservation scheme for outsourcing medical image to cloud through ROI based crypto-watermarking is proposed in the paper. In the proposed scheme, data owner firstly carries out substitution of S-box for the region of interest (ROI) of medical image, and then separates the image into 4 most significant bits (MSBs) plane and 4 least significant bits (LSBs) plane images. Secondly, the hash value of ROI is embedded into the two separated bit plane images using reversible watermarking algorithm. Lastly, some selected hash values are transformed into the initial parameters of chaotic maps, and the two sharing secrets, which are produced through chaos based encryption algorithm, are finally outsourced to two different cloud servers. For authorized users, they can get shares from different cloud servers, and then can losslessly recover the original medical image through a series of decryption operations and extraction of watermarking. Furthermore, the users can verify whether the original image is completely reconstructed or not, they even can locate the tampered parts inside ROI if anyone of the sharing secrets is damaged. Some experiments analyses and comparisons are given to show the security and effectiveness of proposed scheme.
ARTICLE | doi:10.20944/preprints201708.0102.v1
Subject: Environmental And Earth Sciences, Remote Sensing Keywords: Content-Based Remote Sensing Image Retrieval; Change Information Detection; Information Management; Remote Sensing Data Service
Online: 29 August 2017 (16:18:20 CEST)
With the rapid development of satellite remote sensing technology, the volume of image datasets in many application areas is growing exponentially and the demand for Land-Cover and Land-Use change remote sensing data is growing rapidly. It is thus becoming hard to efficiently and intelligently retrieve the change information that users need from massive image databases. In this paper, content-based image retrieval is successfully applied to change detection and a content-based remote sensing image change information retrieval model is introduced. First, the construction of a new model framework for change information retrieval in a remote sensing database is described. Then, as the target content cannot be expressed by one kind of feature alone, a multiple-feature integrated retrieval model is proposed. Thirdly, an experimental prototype system that was set up to demonstrate the validity and practicability of the model is described. The proposed model is a new method of acquiring change detection information from remote sensing imagery and so can reduce the need for image pre-processing, deal with problems related toseasonal changes as well as other problems encountered in the field of change detection. Meanwhile, the new model has important implications for improving remote sensing image management and autonomous information retrieval.
CONCEPT PAPER | doi:10.20944/preprints202111.0117.v1
Subject: Business, Economics And Management, Business And Management Keywords: Big data predictive analytics; competitive strategies; strategic alliance performance; Telecom sector
Online: 5 November 2021 (11:29:12 CET)
Based on the resource-based theory, the current study examines the relationship between competitive strategies and strategic alliance performance. Furthermore, big data predictive analytics is treated as a boundary condition between competitive strategies and strategic alliance performance. Big data of predictive analytics in operations and industrial management has been a focal point in the current era. There has been little attention has about big data predictive analytics influences on competitive strategies and strategic alliance performance, especially in developing countries like Pakistan. A survey instrument was used to record the responses from 331 employees of the telecom sectors companies working in Pakistan. Study findings show that big competitive strategies have a positive and significant relationship with strategic alliances performance. It was also found that big data predictive analytics plays the role of moderator between competitive strategies and strategic alliance performance. The study add a new perspective and contribution to the literature on big data predictive analytics, strategic alliance performance, and competitive strategies in Pakistan's telecom sector companies. Further, the study results explain that big data analytics is just like the companies' lifeblood in the current era. The efficient and effective use of big data analytics, companies can boost their standards in a competitive environment.
ARTICLE | doi:10.20944/preprints201610.0067.v1
Subject: Computer Science And Mathematics, Applied Mathematics Keywords: point information gain; Rényi entropy; data processing
Online: 17 October 2016 (11:35:13 CEST)
We generalize the point information gain (PIG) and derived quantities, i.e., point information gain entropy (PIE) and point information gain entropy density (PIED), for the case of the Rényi entropy and simulate the behavior of PIG for typical distributions. We also use these methods for the analysis of multidimensional datasets. We demonstrate the main properties of PIE/PIED spectra for the real data on the example of several images, and discuss further possible utilizations in other fields of data processing.
ARTICLE | doi:10.20944/preprints201612.0079.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: fire detection; upwelling radiation; diurnal variation; training data; geostationary sensors
Online: 15 December 2016 (09:22:10 CET)
Fire detection from satellite sensors relies on an accurate estimation of the unperturbed state of a target pixel, from which an anomaly can be isolated. Methods for estimating the radiation budget of a pixel without fire depend upon training data derived from the location's recent history of brightness temperature variation over the diurnal cycle, which can be vulnerable to cloud contamination and the effects of weather. This study proposes a new method that utilises the common solar budget found at a given latitude in conjunction with an area's local solar time to aggregate a broad-area training dataset, which can be used to model the expected diurnal temperature cycle of a location. This training data is then used in a temperature fitting process with the measured brightness temperatures in a pixel, and compared to pixel-derived training data and contextual methods of background temperature determination. Results of this study show similar accuracy between clear-sky medium wave infrared upwelling radiation and the diurnal temperature cycle estimation compared to previous methods, with demonstrable improvements in processing time and training data availability. This method can be used in conjunction with brightness temperature thresholds to provide a baseline for upwelling radiation, from which positive thermal anomalies such as fire can be isolated.
ARTICLE | doi:10.20944/preprints201608.0123.v1
Subject: Engineering, Civil Engineering Keywords: limited sensor data; structural health monitoring; strain/stress response reconstruction; empirical mode decomposition
Online: 11 August 2016 (11:06:16 CEST)
Structural health monitoring has been studied by a number of researchers as well as various industries to keep up with the increasing demand for preventive maintenance routines. This work presents a novel method for reconstruct prompt, informed strain/stress responses at the hot spots of the structures based on strain measurements at remote locations. The structural responses measured from usage monitoring system at available locations are decomposed into modal responses using empirical mode decomposition. Transformation equations based on finite element modeling are derived to extrapolate the modal responses from the measured locations to critical locations where direct sensor measurements are not available. Then, two numerical examples (a two-span beam and a 19956-degree of freedom simplified airfoil) are used to demonstrate the overall reconstruction method. Finally, the present work investigates the effectiveness and accuracy of the method through a set of experiments conducted on an aluminium alloy cantilever beam commonly used in air vehicle and spacecraft. The experiments collect the vibration strain signals of the beam via optical fiber sensors. Reconstruction results are compared with theoretical solutions and a detailed error analysis is also provided.
ARTICLE | doi:10.20944/preprints201702.0074.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: network; systems; cloud computing; data centre; performance; software-defined; virtual machine; scheduling; admission control; application-aware;
Online: 20 February 2017 (04:56:24 CET)
Cloud computing refers to applications delivered as services over the Internet. Cloud systems employ policies that are inherently dynamic in nature and that depend on temporal conditions defined in terms of external events, such as the measurement of bandwidth, use of hosts, intrusion detection or specific time events. In this paper, we investigate an optimized resource management scheme named v-Mapper. The basic premise of v-Mapper is to exploit application-awareness concepts using software-defined networking (SDN) features. This paper makes three key contributions to the field: (1) We propose a virtual machine (VM) placement scheme that can effectively mitigate the VM placement issues for data-intensive applications; (2) We propose a validation scheme that will ensure that a service is entertained only if there are sufficient resources available for its execution and (3) We present a scheduling policy that aims to eliminate network load constraints. An evaluation was carried out with various benchmarks and demonstrated that v-Mapper shows improved performance over other state-of-the-art approaches in terms of average task completion time, service delay time and bandwidth utilization. Given the growing importance of supporting large-scale data processing and analysis in datacentres, the v-Mapper system has the potential to make a positive impact in improving datacentre performance in the future.
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Artificial intelligence; machine learning; real-time probabilistic data; for cyber risk; super forecasting; red teaming;
Online: 12 April 2021 (12:18:14 CEST)
Multiple governmental agencies and private organisations have made commitments for the colonisation of Mars. Such colonisation requires complex systems and infrastructure that could be very costly to repair or replace in cases of cyber-attacks. This paper surveys deep learning algorithms, IoT cyber security and risk models, and established mathematical formulas to identify the best approach for developing a dynamic and self-adapting system for predictive cyber risk analytics supported with Artificial Intelligence and Machine Learning and real-time intelligence in edge computing. The paper presents a new mathematical approach for integrating concepts for cognition engine design, edge computing and Artificial Intelligence and Machine Learning to automate anomaly detection. This engine instigates a step change by applying Artificial Intelligence and Machine Learning embedded at the edge of IoT networks, to deliver safe and functional real- time intelligence for predictive cyber risk analytics. This will enhance capacities for risk analytics and assists in the creation of a comprehensive and systematic understanding of the opportunities and threats that arise when edge computing nodes are deployed, and when Artificial Intelligence and Machine Learning technologies are migrated to the periphery of the internet and into local IoT networks.
ARTICLE | doi:10.20944/preprints202304.1077.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Multimodal Data Integration; Radiotherapy Standard Name mapping; Radiation Oncology; Machine Learning; Deep Learning; TG-263 Names
Online: 27 April 2023 (11:01:18 CEST)
Physicians often label anatomical structure sets in Digital Imaging and Communications in Medicine (DICOM) images with nonstandard names. As these names vary widely, the standardization of the nonstandard names in the Organs at Risk (OARs), Planning Target Volumes (PTVs), and 'Other' organs inside the area of interest is a vital problem. Prior works considered traditional machine learning approaches on structure sets with moderate success. This paper presents integrated deep learning methods applied to structure sets by integrating the multimodal data compiled from the radiotherapy centers administered by the US Veterans Health Administration (VHA) and the Department of Radiation Oncology at Virginia Commonwealth University (VCU). The de-identified radiation oncology data collected from VHA and VCU radiotherapy centers have 16,290 prostate structures. Our method integrates the heterogeneous (textual and imaging) multimodal data with Convolutional Neural Network (CNN)-based deep learning approaches like CNN, Visual Geometry Group (VGG) network, and Residual Network (ResNet). Our model presents improved results in prostate (RT) structure name standardization. Evaluation of our methods with macro-averaged F1 Score shows that our deep learning model with single-modal textual data usually performs better than the previous studies. We also experimented with various combinations of multimodal data (masked images, masked dose) besides textual data. The models perform well on the textual data alone, while the addition of imaging data shows that deep neural networks can achieve improved performance using information present in the other modalities. Additionally, using masked images and masked doses along with text leads to an overall performance improvement with the various CNN-based architectures than using all the modalities together. Undersampling the majority class leads to further performance enhancement. The VGG network on the masked image-dose data combined with CNNs on the text data performs the best and establishes the state-of-the-art in this domain.
ARTICLE | doi:10.20944/preprints201611.0110.v1
Subject: Business, Economics And Management, Finance Keywords: capital structure; firm’s performance; panel data; unit root analysis; Bangladesh
Online: 22 November 2016 (09:36:36 CET)
Capital structure decision plays an imperative role in firm’s performance. Recognizing the importance, there has been many studies inspected the rapport of capital structure with performance of firms and findings of those studies are inconclusive. In addition, there is relative deficiency of empirical studies examining the link of capital structure with performance of banks in Bangladesh. This paper attempted to fill this gap. Using panel data of 22 banks for the period of 2005-2014, this study empirically examined the impacts of capital structure on the performance of Bangladeshi banks assessed by return on equity, return on assets and earnings per share. Results from pooled ordinary least square analysis show that there are inverse impacts of capital structure on bank’s performance. Empirical findings of this study is of greater significance for the developing countries like Bangladesh because it will call upon concentration of the bank management and policy makers to pursue such policies to reduce reliance on debt and to accomplish optimal level capital structure. This research also contributes to empirical literatures by reconfirming (or otherwise) findings of previous studies.
ARTICLE | doi:10.20944/preprints201703.0058.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: Smartphone sensing; mobile-social integration; automatic recognition; social data; long-term health monitoring
Online: 10 March 2017 (17:32:31 CET)
Over the past decades, overweight and obesity has become a global epidemic and the leading threat for death. To prevent the serious risk, an overweight or obese individual must apply a long-term weight-management strategy to control food intake and physical activities, which is however, not easy. Recently, with the advances of information technology, more and more people can use wearable devices and smartphones to obtain physical activity information, while they can also access various health-related information from Internet online social networks (OSNs). Nevertheless, there is a lack of an integrated approach that can combine these two methods in an efficient way. In this paper, we address this issue and propose a novel mobile-social framework for health recognition and recommendation, namely, H-Rec2. The main ideas of H-Rec2 include (1) to recognize the individual's health status using smartphone as a general platform, and (2) to recommend physical activity and food intake based on personal health information, life science principles, and health-related information obtained from OSNs. To demonstrate the potentials of the H-Rec2 framework, we develop a prototype that consists of four important components: (1) an activity recognition module that senses physical activity using accelerometer, (2) a health status modeling module that applies a novel algorithm to generate personalized health status index, (3) a restaurant information collection module that collects relevant information from OSN, and (4) a restaurant recommendation module that provides personalized and context-aware recommendation. To evaluate the prototype, we conduct both objective and subjective experiments, which confirm the performance and effectiveness of the proposed system.
ARTICLE | doi:10.20944/preprints201608.0204.v1
Subject: Business, Economics And Management, Economics Keywords: logistics industry; sustainability; data envelopment analysis (DEA); grey forecasting
Online: 25 August 2016 (10:12:27 CEST)
Logistics plays an important role in globalized companies and contributes to the development of foreign trade. A large number of external conditions, such as recession and inflation, affect logistics. Therefore, managers should find ways to improve operational performance, enabling them to increase efficiency while considering environmental sustainability due to the industry’s large scale of energy consumption. Based on data collected from the financial reports of top global logistics companies, this study uses a DEA model to calculate corporate efficiency by implementing a Grey forecasting approach to forecast future sustainability values. Consequently, the study addresses the problem of how to enhance operational performance while accounting for the impact of external conditions. This research can help logistics companies develop operation strategies in the future that will enhance their competitiveness vis-à-vis rivals in a time of global economic volatility.
ARTICLE | doi:10.20944/preprints201612.0002.v1
Subject: Computer Science And Mathematics, Applied Mathematics Keywords: change point; estimation; consistency; panel data; short panels; boundary issue; structural change; bootstrap; non-life insurance; change in claim amounts
Online: 1 December 2016 (10:02:03 CET)
Panel data of our interest consist of a moderate number of panels, while the panels contain a small number of observations. An estimator of common breaks in panel means without a boundary issue for this kind of scenario is proposed. In particular, the novel estimator is able to detect a common break point even when the change happens immediately after the first time point or just before the last observation period. Another advantage of the elaborated change point estimator is that it results in the last observation in situations with no structural breaks. The consistency of the change point estimator in panel data is established. The results are illustrated through a simulation study. As a by-product of the developed estimation technique, a theoretical utilization for correlation structure estimation, hypothesis testing, and bootstrapping in panel data is demonstrated. A practical application to non-life insurance is presented as well.
CONFERENCE PAPER | doi:10.20944/preprints201612.0011.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: satellite data; fine particulate matter; air pollution; geographic information system; health risks; spatial analysis; Saudi Arabia
Online: 1 December 2016 (15:25:56 CET)
The study of the concentrations and effects of fine particulate matter in urban areas have been of great interest to researchers in recent times. This is due to the acknowledgment of the far-reaching impacts of fine particulate matter on public health. Remote sensing data have been used to monitor the trend of concentrations of particulate matter by deriving aerosol optical depth (AOD) from satellite images. The Center for International Earth Science Information Network (CIESIN) has released the second version of its global PM2.5 data with improvement in spatial resolution. This paper revisits the study of spatial and temporal variations in particulate matter in Saudi Arabia by exploring the cluster analysis of the new data. Cluster analysis of the PM2.5 values of Saudi cities is performed by using Anselin local Moran’s I statistic. Also, the analysis is carried out at the regional level by using self-organizing map (SOM). The results show an increasing trend in the concentrations of particulate matter in Saudi Arabia, especially in some selected urban areas. The eastern and south-western parts of the Kingdom have significantly clustering high values. Some of the PM2.5 values have passed the threshold indicated by the World Health Organization (WHO) standard and targets posing health risks to Saudi urban population.
ARTICLE | doi:10.20944/preprints202306.0123.v1
Subject: Biology And Life Sciences, Agricultural Science And Agronomy Keywords: Sentinel-2 multispectral data; Maize lodging; Random Forest classification; Predictive variables; Model generalizability
Online: 2 June 2023 (04:08:42 CEST)
Lodging is a common problem in maize production that seriously impacts yield, quality, and the capacity for mechanical harvesting. Evaluation of site-specific lodging risks requires establishment of a method for multi-year monitoring. In this study, spectral images collected by the Sentinel-2 satellite were processed to obtain three types of data: gray-level co-occurrence matrix texture (GLCM), vegetation indices (VIs), and spectral reflectance (SR). Lodging classification models were then established with Random Forest (RF) using each of the three data types separately (the GLCM, VI, and SR models) and in combination (SR+VI model, SR+GLCM model, VI+GLCM mod-el, and SR+VI+GLCM model). By gradually removing features with low importance scores from the SR+VI+GLCM model and analyzing the changes in the overall accuracy (OA), the optimal set of predictive variables was identified and used to construct the optimal model. A model built us-ing data from a single timepoint in 2021 was tested on data collected at a similar timepoint in 2019 and vice versa to assess interannual model generalizability. The results of this study demon-strate that for monitoring maize lodging, models constructed with a single feature type, the GLCM model had significantly lower accuracy compared to the VI and SR models. During certain growth stages, the model constructed with combined features had significantly higher accuracy in monitoring maize lodging compared to models constructed with a single feature. During the pro-cess of selecting the optimal predictive variables, it was found that the accuracy of the model did not increase as the number of predictive variables increased. The results show that the positive and negative validation models had an accuracy of 96.55% and 95.18%, with kappa values of 0.93 and 0.83, respectively. This indicates that the model has strong generality for the same repro-ductive stage between years. This study provides a detailed method for large-scale maize lodging monitoring, allowing for identification of optimal planting practices to reduce the probability of lodging and ultimately improving regional maize yield and quality.
ARTICLE | doi:10.20944/preprints201703.0028.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: GPS trajectory; GPS sensor; trajectory similarity measure; spatial-temporal data
Online: 6 March 2017 (06:51:37 CET)
With the rapid spread of built-in GPS handheld smart devices, the trajectory data from GPS sensors has grown explosively. Trajectory data has spatio-temporal characteristics and rich information. Using trajectory data processing techniques can mine the patterns of human activities and the moving patterns of vehicles in the intelligent transportation systems. A trajectory similarity measure is one of the most important issues in trajectory data mining (clustering, classification, frequent pattern mining, etc.). Unfortunately, the main similarity measure algorithms with the trajectory data have been found to be inaccurate, highly sensitive of sampling methods, and have low robustness for the noise data. To solve the above problems, three distances and their corresponding computation methods are proposed in this paper. The point-segment distance can decrease the sensitivity of the point sampling methods. The prediction distance optimizes the temporal distance with the features of trajectory data. The segment-segment distance introduces the trajectory shape factor into the similarity measurement to improve the accuracy. The three kinds of distance are integrated with the traditional dynamic time warping algorithm (DTW) algorithm to propose a new segment–based dynamic time warping algorithm (SDTW). The experimental results show that the SDTW algorithm can exhibit about 57%, 86%, and 31% better accuracy than the longest common subsequence algorithm (LCSS), and edit distance on real sequence algorithm (EDR) , and DTW, respectively, and that the sensitivity to the noise data is lower than that those algorithms.
ARTICLE | doi:10.20944/preprints201608.0040.v1
Subject: Biology And Life Sciences, Immunology And Microbiology Keywords: seeds; ELISA; Fusarium; morphological data analysis; mycotoxins; phylogenetic analysis S
Online: 4 August 2016 (10:12:54 CEST)
Adlay seed samples were collected from 3 adlay growing regions (Yeoncheon, Jeonnam and Eumseong regions) in Korea during 2012. Among all the samples collected, 400 seeds were tested for fungal occurrence by standard blotter and test tube agar methods and different taxonomic groups of fungal genera were detected. The most predominant fungal genera encountered were Fusarium, Phoma, Alternaria, Cladosporium, Curvularia, Cochliobolus and Leptosphaerulina. The occurrence of Fusarium species were 45.6% and based on the combined sequences of two protein coding genes, EF-1a, Beta-tubulin and phylogenetic analysis, 10 species were characterized as F. incarnatum (11.67%), F. kyushense (10.33%), F. fujikuroi (8.67%), F. concentricum (6.00%), F. asiaticum (5.67%), F. graminearum (1.67%), F. miscanthi (0.67%), F. polyphialidiom (0.33%), F. armeniacum (0.33%) and F. thapsinum (0.33%). The ability of these isolates to produce mycotoxins fumonisin (FUM) and zeralenone (ZEN) were tested by ELISA quantitative analysis method. The result revealed that fumonisin (FUM) was produced only by F. fujikuroi and zeralenone (ZEN) by F. asiaticum & F. graminearum. Mycotoxigenic species were then examined for their morphological characteristics to confirm their identity. Morphological observations of the species correlated well with their molecular identification and confirmed as F. asiaticum, F. fujikuroi and F. graminearum.
ARTICLE | doi:10.20944/preprints201710.0166.v2
Subject: Computer Science And Mathematics, Information Systems Keywords: satellite images; image analysis; self organizing maps; quantization error; structural change; demographic data
Online: 20 March 2018 (10:38:43 CET)
The quantization error (QE) from Self-Organizing Map (SOM) output after learning is exploited in this studies. SOM learning is applied on time series of spatial contrast images with variable relative amount of white and dark pixel contents, as in monochromatic medical images or satellite images. It is proven that the QE from the SOM output after learning provides a reliable indicator of potentially critical changes in images across time. The QE increases linearly with the variability in spatial contrast contents of images across time when contrast intensity is kept constant. The hitherto unsuspected capacity of this metric to capture even the smallest changes in large bodies of image time series after using ultra-fast SOM learning is illustrated on examples from SOM learning studies on computer generated images, MRI image time series, and satellite image time series. Linear trend analysis of the changes in QE as a function of the time an image of a given series was taken gives proof of the statistical reliability of this metric as an indicator of local change. It is shown that the QE is correlated with significant clinical, demographic, and environmental data from the same reference time period during which test image series were recorded. The findings show that the QE from SOM, which is easily implemented and requires computation times no longer than a few minutes for a given image series of 20 to 25, is useful for a fast analysis of whole series of image data when the goal is to provide an instant statistical decision relative to change/no change between images.
ARTICLE | doi:10.20944/preprints201609.0027.v1
Subject: Business, Economics And Management, Business And Management Keywords: customer complaint process improvement; customer complaint service; big data analysis
Online: 7 September 2016 (11:38:33 CEST)
With the advances in industry and commerce, passengers have become more accepting of environmental sustainability issues; thus, more people now choose to travel by bus. Government administration constitutes an important part of bus transportation services as the government gives the right-of-way to transportation companies allowing them to provide services. When these services are of poor quality, passengers may lodge complaints. The increase in consumer awareness and developments in wireless communication technologies have made it possible for passengers to easily and immediately submit complaints about transportation companies to government institutions, which has brought drastic changes to the supply-demand chain comprised of the public sector, transportation companies, and passengers. This study proposed the use of big data analysis technology including systematized case assignment and data visualization to improve management processes in the public sector and optimize customer complaint services. Taichung City, Taiwan was selected as the research area. There, the customer complaint management process in public sector was improved, effectively solving such issues as station-skipping, allowing the public sector to fully grasp the service level of transportation companies, improving the sustainability of bus operations, and supporting the sustainable development of the public sector-transportation company-passenger supply chain.
ARTICLE | doi:10.20944/preprints202209.0271.v1
Subject: Computer Science And Mathematics, Analysis Keywords: COVID-19; human mobility; spatial autocorrelation; temporal autocorrelation; Facebook mobility data
Online: 19 September 2022 (09:33:10 CEST)
COVID-19 is the most severe health crisis of the 21st century. COVID-19 presents a threat to almost all countries world-wide. The restriction of human mobility is one of the strategies used to control the transmission of COVID-19. However, it has yet to be determined how effective this restriction is in controlling the rise in COVID-19 cases, particularly in major capital cities such as Jakarta, Indonesia. Using Facebook's mobility data, our study explores the impact of restricting human mobility on COVID-19 case control in Jakarta. Our main contribution is showing how the restriction of human mobility data can give important information about how COVID-19 spreads in different places. We proposed modifying a global regression model into a local regression model by accounting for the spatial and temporal interdependence of COVID-19 transmission across space and time. We applied Bayesian hierarchical Poisson spatiotemporal models with spatially varying regression coefficients. We estimated the regression parameters using an Integrated Nested Laplace Approximation. We found that the local regression model with spatially varying regression coefficients outperforms the global regression model based on DIC, WAIC, MPL, and R2 criteria for model selection. In Jakarta's 44 districts, the impact of human mobility varies significantly. The impacts of human mobility on the log relative risk of COVID-19 range from –4.445 to 2.353. The prevention strategy involving the restriction of human mobility may be beneficial in some districts but ineffective in others. Therefore, a cost-effective strategy had to be adopted.
ARTICLE | doi:10.20944/preprints201608.0202.v2
Subject: Environmental And Earth Sciences, Environmental Science Keywords: HR satellite remote sensing; urban fabric vulnerability; UHI & heat waves; landsat & MODIS sensors; LST & urban heating; segmentation & objects classification; data mining; feature extraction & selection; stepwise regression & model calibration
Online: 26 October 2021 (13:11:23 CEST)
Densely urbanized areas, with a low percentage of green vegetation, are highly exposed to Heat Waves (HW) which nowadays are increasing in terms of frequency and intensity also in the middle-latitude regions, due to ongoing Climate Change (CC). Their negative effects may combine with those of the UHI (Urban Heat Island), a local phenomenon where air temperatures in the compact built up cores of towns increase more than those in the surrounding rural areas, with significant impact on the quality of urban environment, on citizens health and energy consumption and transport, as it has occurred in the summer of 2003 on France and Italian central-northern areas. In this context this work aims at designing and developing a methodology based on aero-spatial remote sensing (EO) at medium-high resolution and most recent GIS techniques, for the extensive characterization of the urban fabric response to these climatic impacts related to the temperature within the general framework of supporting local and national strategies and policies of adaptation to CC. Due to its extension and variety of built-up typologies, the municipality of Rome was selected as test area for the methodology development and validation. First of all, we started by operating through photointerpretation of cartography at detailed scale (CTR 1: 5000) on a reference area consisting of a transect of about 5x20 km, extending from the downtown to the suburbs and including all the built-up classes of interest. The reference built-up vulnerability classes found inside the transect were then exploited as training areas to classify the entire territory of Rome municipality. To this end, the satellite EO HR (High Resolution) multispectral data, provided by the Landsat sensors were used within a on purpose developed "supervised" classification procedure, based on data mining and “object-classification” techniques. The classification results were then exploited for implementing a calibration method, based on a typical UHI temperature distribution, derived from MODIS satellite sensor LST (Land Surface Temperature) data of the summer 2003, to obtain an analytical expression of the vulnerability model, previously introduced on a semi-empirical basis.
Subject: Engineering, Electrical And Electronic Engineering Keywords: optical fibre data; transmission; microcomb
Online: 15 March 2020 (15:20:23 CET)
Micro-combs [1-4] - optical frequency combs generated by integrated micro-cavity resonators – offer the full potential of their bulk counterparts [5,6], but in an integrated footprint. The discovery of temporal soliton states (DKS – dissipative Kerr solitons) [4,7-11] as a means of mode-locking micro-combs has enabled breakthroughs in many fields including spectroscopy [12,13], microwave photonics , frequency synthesis , optical ranging [16,17], quantum sources [18,19], metrology [20,21] and more. One of their most promising applications has been optical fibre communications where they have enabled massively parallel ultrahigh capacity multiplexed data transmission [22,23]. Here, by using a new and powerful class of micro-comb called “soliton crystals” , we achieve unprecedented data transmission over standard optical fibre using a single integrated chip source. We demonstrate a line rate of 44.2 Terabits per second (Tb/s) using the telecommunications C-band at 1550nm with a spectral efficiency – a critically important performance metric - of 10.4 bits/s/Hz. Soliton crystals exhibit robust and stable generation and operation as well as a high intrinsic efficiency that, together with a low soliton micro-comb spacing of 48.9 GHz enable the use of a very high coherent data modulation format of 64 QAM (quadrature amplitude modulated). We demonstrate error free transmission over 75 km of standard optical fibre in the laboratory as well as in a field trial over an installed metropolitan optical fibre network. These experiments were greatly aided by the ability of the soliton crystals to operate without stabilization or feedback control. This work demonstrates the capability of optical soliton crystal micro-combs to perform in demanding and practical optical communications networks.
ARTICLE | doi:10.20944/preprints202003.0268.v1
Subject: Social Sciences, Library And Information Sciences Keywords: matching; data marketplace; data platform; data visualization; call for data
Online: 17 March 2020 (04:10:28 CET)
Improvements in web platforms for data exchange and trading are creating more opportunities for users to obtain data from data providers of different domains. However, the current data exchange platforms are limited to unilateral information provision from data providers to users. In contrast, there are insufficient means for data providers to learn what kinds of data users desire and for what purposes. In this paper, we propose and discuss the description items for sharing users’ call for data as data requests in the data marketplace. We also discuss structural differences in data requests and providable data using variables, as well as possibilities of data matching. In the study, we developed an interactive platform, treasuring every encounter of data affairs (TEEDA), to facilitate matching and interactions between data providers and users. The basic features of TEEDA are described in this paper. From experiments, we found the same distributions of the frequency of variables but different distributions of the number of variables in each piece of data, which are important factors to consider in the discussion of data matching in the data marketplace.
ARTICLE | doi:10.20944/preprints202301.0162.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Content-based image classification; Data curation and preparation; Convolutional neural networks (CNN); Deep learning; Artificial intelligence (AI)
Online: 9 January 2023 (10:59:31 CET)
Background: MR image classification in datasets collected from multiple sources is complicated by inconsistent and missing DICOM metadata. Therefore, we aimed to establish a method for the efficient automatic classification of MR brain sequences. Methods: Deep convolutional neural networks (DCNN) were trained as one-vs-all classifiers to differentiate between six classes, T1 weighted (w), contrast-enhanced T1w, T2w, T2w-FLAIR, ADC, and SWI. Each classifier yields a probability, allowing threshold-based and relative probability assignment while excluding images with low probability (label: unknown, open-set recognition problem). Data from three high-grade glioma (HGG) cohorts was assessed; C1 (320 patients, 20101 MRI images) was used for training, while C2 (197, 11333) and C3 (256, 3522) were for testing. Two raters manually checked images through an interactive labeling tool. Finally, MR-Class' added value was evaluated via radiomics models' performance for progression-free survival (PFS) prediction in C2, utilizing the concordance index (C-I). Results: Approximately 10% of annotation errors were observed in each cohort between the DICOM series descriptions and the derived labels. MR-Class accuracy was 96.7% [95%-Cl: 95.8, 97.3] for C2 and 94.4% [93.6, 96.1] for C3. 620 images were misclassified; Manual assessment of those frequently showed motion artifacts or alterations of anatomy by large tumors. Implementation of MR-Class increased on average the PFS model C-I by 14.6% compared to a model trained without MR-Class. Conclusions: We provide a DCNN-based method for sequence classification of brain MR images and demonstrate its usability in two independent HGG datasets.
ARTICLE | doi:10.20944/preprints202308.1237.v1
Subject: Engineering, Transportation Science And Technology Keywords: data mining; data extraction; data science; cost infrastructure projects
Online: 17 August 2023 (09:25:22 CEST)
Context: Despite the effort put into developing standards for structuring construction cost, and the strong interest into the field. Most construction companies still perform the process of data gathering and processing manually. That provokes inconsistencies, different criteria when classifying, misclassifications, and the process becomes very time-consuming, particularly on big projects. Additionally, the lack of standardization makes very difficult the cost estimation and comparison tasks. Objective: To create a method to extract and organize construction cost and quantity data into a consistent format and structure, to enable rapid and reliable digital comparison of the content. Method: The approach consists of a two-step method: Firstly, the system implements data mining to review the input document and determine how it is structured based on the position, format, sequence, and content of descriptive and quantitative data. Secondly, the extracted data is processed and classified with a combination of data science and experts’ knowledge to fit a common format. Results: A big variety of information coming from real historical projects has been successfully extracted and processed into a common format with 97.5% of accuracy, using a subset of 5770 assets located on 18 different files, building a solid base for analysis and comparison. Conclusion: A robust and accurate method was developed for extracting hierarchical project cost data to a common machine-readable format to enable rapid and reliable comparison and benchmarking.
ARTICLE | doi:10.20944/preprints202304.0130.v1
Subject: Computer Science And Mathematics, Other Keywords: data; cooperatives; open data; data stewardship; data governance; digital commons; data sovereignty; open digital federation platform
Online: 7 April 2023 (14:14:02 CEST)
Network effects, economies of scale, and lock-in-effects increasingly lead to a concentration of digital resources and capabilities, hindering the free and equitable development of digital entrepreneurship (SDG9), new skills, and jobs (SDG8), especially in small communities (SDG11) and their small and medium-sized enterprises (“SMEs”). To ensure the affordability and accessibility of technologies, promote digital entrepreneurship and community well-being (SDG3), and protect digital rights, we propose data cooperatives [1,2] as a vehicle for secure, trusted, and sovereign data exchange [3,4]. In post-pandemic times, community/SME-led cooperatives can play a vital role by ensuring that supply chains to support digital commons are uninterrupted, resilient, and decentralized . Digital commons and data sovereignty provide communities with affordable and easy access to information and the ability to collectively negotiate data-related decisions. Moreover, cooperative commons (a) provide access to the infrastructure that underpins the modern economy, (b) preserve property rights, and (c) ensure that privatization and monopolization do not further erode self-determination, especially in a world increasingly mediated by AI. Thus, governance plays a significant role in accelerating communities’/SMEs’ digital transformation and addressing their challenges. Cooperatives thrive on digital governance and standards such as open trusted Application Programming Interfaces (APIs) that increase the efficiency, technological capabilities, and capacities of participants and, most importantly, integrate, enable, and accelerate the digital transformation of SMEs in the overall process. This policy paper presents and discusses several transformative use cases for cooperative data governance. The use cases demonstrate how platform/data-cooperatives, and their novel value creation can be leveraged to take digital commons and value chains to a new level of collaboration while addressing the most pressing community issues. The proposed framework for a digital federated and sovereign reference architecture will create a blueprint for sustainable development both in the Global South and North.
ARTICLE | doi:10.20944/preprints202103.0593.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Business Inteligence; Data Mining; Data Warehouse.
Online: 24 March 2021 (13:47:31 CET)
In the coming years, digital applications and services that continue to use the country's native cloud systems will be huge. By 2023, that will exceed 500 million, according to IDC. This corresponds to the sum of all applications developed in the last 40 years. If you are the one you answered, yes! This article is for you!
ARTICLE | doi:10.20944/preprints202308.0442.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: Thermometers; Temperature records; Early instrumental meteorological series; Data rescue; Data recovery; Data correction; Climate data analysis
Online: 7 August 2023 (03:01:24 CEST)
A distinction is made between data rescue (i.e., copying, digitizing and archiving) and data recovery that implies deciphering, interpreting and transforming early instrumental readings and their metadata to obtain high-quality datasets in modern units. This requires a multidisciplinary approach that includes: palaeography and knowledge of Latin and other languages to read the handwritten logs and additional documents; history of science to interpret the original text, data e metadata within the cultural frame of the 17th, 18th and early 19th century; physics and technology to recognize bias of early instruments or calibrations, or to correct for observational bias; astronomy to calculate and transform the original time in canonical hours that started from twilight. The liquid-in-glass thermometer was invented in 1641 and the earliest temperature records started in 1654. Since then, different types of thermometers were invented, based on the thermal expansion of air or selected thermometric liquids with deviation from linearity. Reference points, thermometric scales, calibration methodologies were not comparable, and not always adequately described. Thermometers had various locations and exposures, e.g., indoor, outdoor, on windows, gardens or roofs, facing different directions. Readings were made only one or a few times a day, not necessarily respecting a precise time schedule: this bias is analysed for the most popular combinations of reading times. The time was based on sundials and local Sun, but the hours were counted starting from twilight. In 1789-90 Italy changed system and all cities counted hours from their lower culmination (i.e., local midnight), so that every city had its local time; in 1866, all the Italian cities followed the local time of Rome; in 1893, the whole Italy adopted the present-day system, based on the Coordinated Universal Time and the time zones. In 1873, when the International Meteorological Committee (IMO) was founded, later transformed in World Meteorological Organization (WMO), a standardization of instruments and observational protocols was established, and all data became fully comparable. In the early instrumental period, from 1654 to 1873, the comparison, correction and homogenization of records is quite difficult, mainly because of the scarcity or even absence of metadata. This paper deals about this confused situation, discussing the main problems, but also the methodologies to recognize missing metadata, distinguish indoor from outdoor readings; correct and transform early datasets in unknown or arbitrary units into modern units; finally, in which cases it is possible to reach the quality level required by WMO. The focus is to explain the methodology needed to recover early instrumental records, i.e., the operations that should be performed to interpret, correct, and transform the original raw data into a high-quality dataset of temperature, usable for climate studies.
Subject: Engineering, Automotive Engineering Keywords: Business Intelligence; Data warehouse; Data Marts; Architecture; Data; Information; cloud; Data Mining; evolution; technologic companies; tools; software
Online: 24 March 2021 (13:06:53 CET)
Information has been and will be a vital element for a person or department groups in an organization. That is why there are technologies that help us to give them the proper management of data; Business Intelligence is responsible for bringing technological solutions that correctly and effectively manage the entire volume of necessary and important information for companies. Among the solutions offered by Business Intelligence are Data Warehouses, Data Mining, among other business technologies that working together achieve the objectives proposed by an organization. It is important to highlight that these business technologies have been present since the 50's and have been evolving through time, improving processes, infrastructure, methodologies and implementing new technologies, which have helped to correct past mistakes based on information management for companies. There are questions about Business Intelligence. Could it be that in the not-too-distant future it will be used as an essential standard or norm in any organization for data management, since it provides many benefits and avoids failures at the time of classifying information. On the other hand, Cloud storage has been the best alternative to safeguard information and not depend on physical storage media, which are not 100% secure and are exposed to partial or total loss of information, by presenting hardware failures or security failures due to mishandling that can be given to such information.
COMMUNICATION | doi:10.20944/preprints202303.0453.v1
Subject: Social Sciences, Media Studies Keywords: COVID-19; MPox; Twitter; Big Data; Data Mining; Data Analysis; Sentiment Analysis; Data Science; Social Media; Monkeypox
Online: 27 March 2023 (08:39:28 CEST)
Mining and analysis of the Big Data of Twitter conversations have been of significant interest to the scientific community in the fields of healthcare, epidemiology, big data, data science, computer science, and their related areas, as can be seen from several works in the last few years that focused on sentiment analysis and other forms of text analysis of Tweets related to Ebola, E-Coli, Dengue, Human papillomavirus (HPV), Middle East Respiratory Syndrome (MERS), Measles, Zika virus, H1N1, influenza-like illness, swine flu, flu, Cholera, Listeriosis, cancer, Liver Disease, Inflammatory Bowel Disease, kidney disease, lupus, Parkinson's, Diphtheria, and West Nile virus. The recent outbreaks of COVID-19 and MPox have served as "catalysts" for Twitter usage related to seeking and sharing information, views, opinions, and sentiments involving both these viruses. While there have been a few works published in the last few months that focused on performing sentiment analysis of Tweets related to either COVID-19 or MPox, none of the prior works in this field thus far involved analysis of Tweets focusing on both COVID-19 and MPox at the same time. With an aim to address this research gap, a total of 61,862 Tweets that focused on Mpox and COVID-19 simultaneously, posted between May 7, 2022, to March 3, 2023, were studied to perform sentiment analysis and text analysis. The findings of this study are manifold. First, the results of sentiment analysis show that almost half the Tweets (the actual percentage is 46.88%) had a negative sentiment. It was followed by Tweets that had a positive sentiment (31.97%) and Tweets that had a neutral sentiment (21.14%). Second, this paper presents the top 50 hashtags that were used in these Tweets. Third, it presents the top 100 most frequently used words that are featured in these Tweets. The findings of text analysis show that some of the commonly used words involved directly referring to either or both viruses. In addition to this, the presence of words such as "Polio", "Biden", "Ukraine", "HIV", "climate", and "Ebola" in the list of the top 100 most frequent words indicate that topics of conversations on Twitter in the context of COVID-19 and MPox also included a high level of interest related to other viruses, President Biden, and Ukraine. Finally, a comprehensive comparative study that involves a comparison of this work with 49 prior works in this field is presented to uphold the scientific contributions and relevance of the same.
ARTICLE | doi:10.20944/preprints202012.0468.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: climate data; gridded product; data merging
Online: 18 December 2020 (13:29:38 CET)
This manuscript describes the construction and validation of high resolution daily gridded (0.05° × 0.05°) rainfall and maximum and minimum temperature data for Bangladesh : the Enhancing National Climate Services for Bangladesh Meteorological Department (ENACTS-BMD) dataset. The dataset was generated by merging data from weather stations, satellite products (for rainfall) and reanalysis (for temperature). ENACTS-BMD is the first high-resolution gridded surface meteorological dataset developed specifically for studies of surface climate processes in Bangladesh. Its record begins in January 1981 and is updated in real-time monthly and outputs have daily, decadal and monthly time resolution. The Climate Data Tools (CDT), developed by the International Research Institute for Climate and Society (IRI), Columbia University, is used to generate the dataset. This data processing includes the collection of weather and gridded data, quality control of stations data, downscaling of the reanalysis for temperature, bias correction of both satellite rainfall and downscaled reanalysis of temperature, and the combination of station and bias-corrected gridded data. The ENACTS-BMD dataset is available as an open-access product at BMD’s official website, allowing the enhancement of the provision of services, overcoming the challenges of data quality, availability, and access, promoting at the same time the engagement and use by stakeholders.
ARTICLE | doi:10.20944/preprints201701.0090.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: transportation data; data interlinking; automatic schema matching
Online: 20 January 2017 (03:38:06 CET)
Multimodality requires integration of heterogeneous transportation data to construct a broad view of the transportation network. Many new transportation services are emerging with being isolated from previously existing networks. This lead them to publish their data sources to the web -- according to Linked Data Principles -- in order to gain visibility. Our interest is to use these data to construct an extended transportation network that links these new services to existing ones. The main problems we tackle in this article fall in the categories of automatic schema matching and data interlinking. We propose an approach that uses web services as mediators to help in automatically detect geospatial properties and map them between two different schemas. On the other hand, we propose a new interlinking approach that enables user to define rich semantic links between datasets in a flexible and customizable way.
ARTICLE | doi:10.20944/preprints202307.1117.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: history; endowments; query model; digital data; physical data
Online: 17 July 2023 (15:11:18 CEST)
Historical and Endowment Properties are different from Heritage and cultural Properties, as Historical and Endowment properties are governed by a unique set of laws that Waqf recipients must abide by. Property that is entrusted is usually in the form of buildings, land or valuables which in preservation is not limited to time as long as the property can be utilized. Reliable information technology is needed to ensure data security both digitally and physically, while the rapid development of information technology demands information openness and this will be a challenge in itself. The objectives of this study include examining the collection of historical databases and endowments, the relationship between digital data and physical data and management organizations. The method of how to design a query model to display data is then analyzed whether the data conforms to the rules in waqf management. The results are expected to bring up accurate data between digital data and physical data and if there are differences into findings for the next analysis.
ARTICLE | doi:10.20944/preprints202111.0410.v1
Subject: Engineering, Control And Systems Engineering Keywords: Data compression; data hiding; psnr; mse; virtual data; public cloud; quantization error
Online: 22 November 2021 (15:17:12 CET)
Nowadays, information security is a challenge especially when transmitted or shared in public clouds. Many of researchers have been proposed technique which fails to provide data integrity, security, authentication and another issue related to sensitivity data. The most common techniques were used to protect data during transmission on public cloud are cryptography, steganography, and compression. The proposed scheme suggests an entirely new approach for data security on public cloud. Authors have suggested an entirely new approach that completely makes secret data invisible behind carrier object and it is not been detected with the image performance parameters like PSNR, MSE, entropy and others. The details of results are explain in result section of paper. Proposed technique have better outcome than any other existing technique as a security mechanism on a public cloud. Primary focus of suggested approach is to minimize integrity loss of public storage data due to unrestricted access rights by uses. To improve reusability of carrier even after data concealed is really a challenging task and achieved through suggested approach.
DATA DESCRIPTOR | doi:10.20944/preprints202109.0370.v1
Subject: Engineering, Energy And Fuel Technology Keywords: smart meter data; household survey; EPC; energy data; energy demand; energy consumption; longitudinal; energy modelling; electricity data; gas data
Online: 22 September 2021 (10:16:05 CEST)
The Smart Energy Research Lab (SERL) Observatory dataset described here comprises half-hourly and daily electricity and gas data, SERL survey data, Energy Performance Certificate (EPC) input data and 24 local hourly climate reanalysis variables from the European Centre for Medium-Range Weather Forecasts (ECMWF) for over 13,000 households in Great Britain (GB). Participants were recruited in September 2019, September 2020 and January 2021 and their smart meter data are collected from up to one year prior to sign up. Data collection will continue until at least August 2022, and longer if funding allows. Survey data relating to the dwelling, appliances, household demographics and attitudes was collected at sign up. Data are linked at the household level and UK-based academic researchers can apply for access within a secure virtual environment for research projects in the public interest. This is a data descriptor paper describing how the data was collected, the variables available and the representativeness of the sample compared to national estimates. It is intended as a guide for researchers working with or considering using the SERL Observatory dataset, or simply looking to learn more about it.
REVIEW | doi:10.20944/preprints202003.0141.v1
Subject: Medicine And Pharmacology, Other Keywords: data sharing; data management; data science; big data; healthcare
Online: 8 March 2020 (16:46:20 CET)
In recent years, more and more health data are being generated. These data come not only from professional health systems, but also from wearable devices. All these data combined form ‘big data’ that can be utilized to optimize treatments for each unique patient (‘precision medicine’). To achieve this precision medicine, it is necessary that hospitals, academia and industry work together to bridge the ‘valley of death’ of translational medicine. However, hospitals and academia often have problems with sharing their data, even though the patient is actually the owner of his/her own health data, and the sharing of data is associated with increased citation rate. Academic hospitals usually invest a lot of time in setting up clinical trials and collecting data, and want to be the first ones to publish papers on this data. The idea that society benefits the most if the patient’s data are shared as soon as possible so that other researchers can work with it, has not taken root yet. There are some publicly available datasets, but these are usually only shared after studies are finished and/or publications have been written based on the data, which means a severe delay of months or even years before others can use the data for analysis. One solution is to incentivize the hospitals to share their data with (other) academic institutes and the industry. Here we discuss several aspects of data sharing in the medical domain: publisher requirements, data ownership, support for data sharing, data sharing initiatives and how the use of federated data might be a solution. We also discuss some potential future developments around data sharing.
Subject: Business, Economics And Management, Econometrics And Statistics Keywords: poverty; composite indicators; interval data; symbolic data
Online: 24 August 2021 (15:46:09 CEST)
The analysis and measurement of poverty is a crucial issue in the field of social science. Poverty is a multidimensional notion that can be measured using composite indicators relevant to synthesizing statistical indicators. Subjective choices could, however, affect these indicators. We propose interval-based composite indicators to avoid the problem, enabling us in this context to obtain robust and reliable measures. Based on a relevant conceptual model of poverty we have identified, we will consider all the various factors identified. Then, considering a different random configuration of the various factors, we will compute a different composite indicator. We can obtain a different interval for each region based on the distinct factor choices on the different assumptions for constructing the composite indicator. So we will create an interval-based composite indicator based on the results obtained by the Monte-Carlo simulation of all the different assumptions. The different intervals can be compared, and various rankings for poverty can be obtained. For their parameters, such as center, minimum, maximum, and range, the poverty interval composite indicator can be considered and compared. The results demonstrate a relevant and consistent measurement of the indicator and the shadow sector's relevant impact on the final measures.
Subject: Computer Science And Mathematics, Computer Science Keywords: big data; data integration; EVMS; construction management
Online: 30 October 2020 (15:35:00 CET)
In the information age today, data are getting more and more important. While other industries achieve tangible improvement by applying cutting edge information technology, the construction industry is still far from being enough. Cost, schedule, and performance control are three major functions in the project execution phase. Along with their individual importance, cost-schedule integration has been a significant challenge over the past five decades in the construction industry. Although a lot of efforts have been put into this development, there is no method used in construction practice. The purpose of this study is to propose a new method to integrate cost and schedule data using big data technology. The proposed algorithm is designed to provide data integrity and flexibility in the integration process, considerable time reduction on building and changing database, and practical use in a construction site. It is expected that the proposed method can transform the current way that field engineers regard information management as one of the troublesome tasks in a data-friendly way.
ARTICLE | doi:10.20944/preprints202308.1391.v1
Subject: Engineering, Transportation Science And Technology Keywords: data extraction; data mining; railway infrastructure costs; infrastructure costs data analysis; cost analysis
Online: 18 August 2023 (16:03:08 CEST)
The capability of extracting information and analyze it into a common format is essential for performing predictions, comparing projects through cost benchmarking, and for having a deeper understanding of the project costs. However, the lack of standardization and the manual inclusion of the data makes this process very time-consuming, unreliable, and inefficient. To tackle this problem, a novel approach with a big impact is presented combining the benefits of data mining, statistics, and machine learning to extract and analyze the information related to railway costs infrastructure data. To validate the suggested approach, data from 23 real historical projects from the client network rail was extracted, allowing their costs to be comparable. Finally, some machine learning and data analytics methods were implemented to identify the most relevant factors allowing for costs benchmarking. The presented method proves the benefits of data extraction being able to gather, analyze and benchmark each project in an efficient manner, and deeply understand the relationships and the relevant factors that matter in infrastructure costs.
ARTICLE | doi:10.20944/preprints201812.0071.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: data governance; data sovereignty; urban data spaces; ICT reference architecture; open urban platform
Online: 6 December 2018 (05:09:54 CET)
This paper presents the results of a recent study that was conducted with a number of German municipalities/cities. Based on the obtained and briefly presented recommendations emerging from the study, the authors propose the concept of an Urban Data Space (UDS), which facilitates an eco-system for data exchange and added value creation thereby utilizing the various types of data within a smart city/municipality. Looking at an Urban Data Space from within a German context and considering the current situation and developments in German municipalities, this paper proposes a reasonable classification of urban data that allows to relate the various data types to legal aspects and to conduct solid considerations regarding technical implementation designs and decisions. Furthermore, the Urban Data Space is described/analyzed in detail, and relevant stakeholders are identified, as well as corresponding technical artifacts are introduced. The authors propose to setup Urban Data Spaces based on emerging standards from the area of ICT reference architectures for Smart Cities, such as DIN SPEC 91357 “Open Urban Platform” and EIP SCC. Thereby, the paper walks the reader through the construction of an UDS based on the above mentioned architectures and outlines all the goals, recommendations and potentials, which an Urban Data Space can reveal to a municipality/city.
ARTICLE | doi:10.20944/preprints202206.0320.v4
Subject: Biology And Life Sciences, Other Keywords: data; reproducibility; FAIR; data reuse; public data; big data; analysis
Online: 2 November 2022 (02:55:49 CET)
With an increasing amount of biological data available publicly, there is a need for a guide on how to successfully download and use this data. The Ten simple rules for using public biological data are: 1) use public data purposefully in your research, 2) evaluate data for your use case, 3) check data reuse requirements and embargoes, 4) be aware of ethics for data reuse, 5) plan for data storage and compute requirements, 6) know what you are downloading, 7) download programmatically and verify integrity, 8) properly cite data, 9) make reprocessed data and models Findable, Accessible, Interoperable, and Reusable (FAIR) and share, and 10) make pipelines and code FAIR and share. These rules are intended as a guide for researchers wanting to make use of available data and to increase data reuse and reproducibility.
ARTICLE | doi:10.20944/preprints202109.0518.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: data fusion; multi-sensor; data visualization; data treatment; participant reports; air quality; exposure assessment
Online: 30 September 2021 (14:13:52 CEST)
Use of a multi-sensor approach can provide citizens a holistic insight in the air quality in their immediate surroundings and assessment of personal exposure to urban stressors. Our work, as part of the ICARUS H2020 project, which included over 600 participants from 7 European cities, discusses data fusion and harmonization on a diverse set of multi-sensor data streams to provide a comprehensive and understandable report for participants, and offers possible solutions and improvements. Harmonizing the data streams identified issues with the used devices and protocols, such as non-uniform timestamps, data gaps, difficult data retrieval from commercial devices, and coarse activity data logging. Our process of data fusion and harmonization allowed us to automate the process of generating visualizations and reports and consequently provide each participant with a detailed individualized report. Results showed that a key solution was to streamline the code and speed up the process, which necessitated certain compromises in visualizing the data. A thought-out process of data fusion and harmonization on a diverse set of multi-sensor data streams considerably improved the quality and quantity of data that a research participant receives. Though automatization accelerated the production of the reports considerably, manual structured double checks are strongly recommended.
SHORT NOTE | doi:10.20944/preprints202001.0196.v1
Subject: Biology And Life Sciences, Insect Science Keywords: reproducibility; open access; data curation; data mangement; pre-print servers
Online: 18 January 2020 (09:05:49 CET)
The ability to replicate scientific experiments is a cornerstone of the scientific method. Sharing ideas, workflows, data, and protocols facilitates testing the generalizability of results, increases the speed that science progresses, and enhances quality control of published work. Fields of science such as medicine, the social sciences, and the physical sciences have embraced practices designed to increase replicability. Granting agencies, for example, may require data management plans and journals may require data and code availability statements along with the deposition of data and code in publicly available repositories. While many tools commonly used in replicable workflows such as distributed version control systems (e.g. “git”) or scripted programming languages for data cleaning and analysis may have a steep learning curve, their adoption can increase individual efficiency and facilitate collaborations both within entomology and across disciplines. The open science movement is developing within the discipline of entomology, but practitioners of these concepts or those desiring to work more collaboratively across disciplines may be unsure where or how to embrace these initiatives. This article is meant to introduce some of the tools entomologists can incorporate into their workflows to increase the replicability and openness of their work. We describe these tools and others, recommend additional resources for learning more about these tools, and discuss the benefits to both individuals and the scientific community and potential drawbacks associated with implementing a replicable workflow.
COMMUNICATION | doi:10.20944/preprints201803.0054.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: data feature selection; data clustering; travel time prediction
Online: 7 March 2018 (13:30:06 CET)
In recent years, governments applied intelligent transportation system (ITS) technique to provide several convenience services (e.g., garbage truck app) for residents. This study proposes a garbage truck fleet management system (GTFMS) and data feature selection and data clustering methods for travel time prediction. A GTFMS includes mobile devices (MD), on-board units, fleet management server, and data analysis server (DAS). When user uses MD to request the arrival time of garbage truck, DAS can perform the procedure of data feature selection and data clustering methods to analyses travel time of garbage truck. The proposed methods can cluster the records of travel time and reduce variation for the improvement of travel time prediction. After predicting travel time and arrival time, the predicted information can be sent to user’s MD. In experimental environment, the results showed that the accuracies of previous method and proposed method are 16.73% and 85.97%, respectively. Therefore, the proposed data feature selection and data clustering methods can be used to predict stop-to-stop travel time of garbage truck.
ARTICLE | doi:10.20944/preprints202110.0260.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: big data; data acquisition; data visualization; data exchange; dashboard; frequency stability; Grafana lab; Power Quality; GPS reference; frequency measurement.
Online: 18 October 2021 (18:07:43 CEST)
This article proposes a measurement solution designed to monitor instantaneous frequency in power systems. It uses a data acquisition module and a GPS receiver for time stamping. A program in Python takes care of receiving the data, calculating the frequency, and finally transferring the measurement results to a database. The frequency is calculated with two different methods, which are compared in the article. The stored data is visualized using the Grafana platform, thus demonstrating its potential for comparing scientific data. The system as a whole constitutes an efficient low cost solution as a data acquisition system.
ARTICLE | doi:10.20944/preprints201806.0185.v1
Subject: Medicine And Pharmacology, Other Keywords: mHealth; mobile data collection; data quality; data quality assessment framework; Tuberculosis control; developing countries
Online: 12 June 2018 (10:34:33 CEST)
Background Increasingly, healthcare organizations are using technology for the efficient management of data. The aim of this study was to compare the data quality of digital records with the quality of the corresponding paper-based records by using data quality assessment framework. Methodology We conducted a desk review of paper-based and digital records over the study duration from April 2016 to July 2016 at six enrolled TB clinics. We input all data fields of the patient treatment (TB01) card into a spreadsheet-based template to undertake a field-to-field comparison of the shared fields between TB01 and digital data. Findings A total of 117 TB01 cards were prepared at six enrolled sites, whereas just 50% of the records (n=59; 59 out of 117 TB01 cards) were digitized. There were 1,239 comparable data fields, out of which 65% (n=803) were correctly matched between paper based and digital records. However, 35% of the data fields (n=436) had anomalies, either in paper-based records or in digital records. 1.9 data quality issues were calculated per digital patient record, whereas it was 2.1 issues per record for paper-based record. Based on the analysis of valid data quality issues, it was found that there were more data quality issues in paper-based records (n=123) than in digital records (n=110). Conclusion There were fewer data quality issues in digital records as compared to the corresponding paper-based records. Greater use of mobile data capture and continued use of the data quality assessment framework can deliver more meaningful information for decision making.
ARTICLE | doi:10.20944/preprints202206.0335.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: metadata; contextual data; harmonization; genomic surveillance; data management
Online: 24 June 2022 (08:46:04 CEST)
REVIEW | doi:10.20944/preprints202208.0420.v1
Subject: Social Sciences, Law Keywords: conversational commerce; data protection; law of obligations of data
Online: 24 August 2022 (10:55:06 CEST)
The possibilities and reach of social networks are increasing, the designs are becoming more diverse, and the ideas more visionary. Most recently, the former company “Facebook” announced the creation of a metaverse. With these technical possibilities, however, the danger of fraudsters is also growing. Using social bots, consumers are increasingly influenced on such platforms and business transactions are brought about through communication, i.e. conversational commerce. Minors or the elderly are particularly susceptible. This technical development is accompanied by a legal one: it is permitted by the Digital Services Directive and the Sale of Goods Directive to demand the provision of data as consideration for the sale of digital products. This raises legal problems at the level of the law of obligations and data protection law, whose regulations are intended to protect the aforementioned groups of individuals. This protection becomes even more important the more manipulative consumers are influenced by communicative bots. We show that there is a lack of knowledge about what objective data value can have in business transactions. Sufficient transparency of an objective data value can maintain legal protection, especially of vulnerable groups, and ensure the purpose of the laws.
ARTICLE | doi:10.20944/preprints202106.0738.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: time series; homogenization; ACMANT; observed data; data accuracy
Online: 30 June 2021 (13:08:39 CEST)
The removal of non-climatic biases, so-called inhomogeneities, from long climatic records needs sophistically developed statistical methods. One principle is that usually the differences between a candidate series and its neighbour series are analysed instead of directly the candidate series, in order to neutralize the possible impacts of regionally common natural climate variation on the detection of inhomogeneities. In most homogenization methods, two main kinds of time series comparisons are applied, i.e. composite reference series or pairwise comparisons. In composite reference series the inhomogeneities of neighbour series are attenuated by averaging the individual series, and the accuracy of homogenization can be improved by the iterative improvement of composite reference series. By contrast, pairwise comparisons have the advantage that coincidental inhomogeneities affecting several station series in a similar way can be identified with higher certainty than with composite reference series. In addition, homogenization with pairwise comparisons tends to facilitate the most accurate regional trend estimations. A new time series comparison method is presented here, which combines the use of pairwise comparisons and composite reference series in a way that their advantages are unified. This time series comparison method is embedded into the ACMANT homogenization method, and tested in large, commonly available monthly temperature test datasets.
Subject: Environmental And Earth Sciences, Environmental Science Keywords: forest inventory; data harvesting; forest modeling; forest growth; macroecology; public data
Online: 26 November 2020 (10:38:58 CET)
Net CO2 emissions and sequestration from European forests are the result of removal and growth of flora. To arrive at aggregated measurements of these processes at a country's level, local observations of increments and harvest rates are up-scaled to national forest areas. Each country releases these statistics through their individual National Forest Inventory using their particular definitions and methodologies. In addition, five international processes deal with the harmonization and comparability of such forest datasets in Europe, namely the IPCC, SOEF, FAOSTAT, HPFFRE, FRA (definitions follow in the article). In this study, we retrieved living biomass dynamics from each of these sources for 27 European Union member states. To demonstrate the reproducibility of our method, we release an open source python package that allows for automated data retrieval and analysis, as new data becomes available. The comparison of the published values shows discrepancies in the magnitude of forest biomass changes for several countries. In some cases, the direction of these changes also differ between sources. The scarcity of the data provided, along with the low spatial resolution, forbids the creation or calibration of a pan-European forest dynamics model, which could ultimately be used to simulate future scenarios and support policy decisions. To attain these goals, an improvement in forest data availability and harmonization is needed.
ARTICLE | doi:10.20944/preprints201804.0054.v1
Subject: Computer Science And Mathematics, Other Keywords: metadata; documentation; data life-cycle; metadata life-cycle; hierarchical data
Online: 4 April 2018 (08:16:15 CEST)
The historic view of metadata as “data about data” is expanding to include data about other items that must be created, used and understood throughout the data and project life cycles. In this context, metadata might better be defined as the structured and standard part of documentation and the metadata life cycle can be described as the metadata content that is required for documentation in each phase of the project and data life cycles. This incremental approach to metadata creation is similar to the spiral model used in software development. Each phase also has distinct users and specific questions they need answers to. In many cases, the metadata life cycle involves hierarchies where latter phases have increased numbers of items. The relationships between metadata in different phases can be captured through structure in the metadata standard or through conventions for identifiers. Metadata creation and management can be streamlined and simplified by re-using metadata across many records. Many of these ideas are being used in metadata for documenting the life cycle of research projects in the Arctic.
ARTICLE | doi:10.20944/preprints202006.0258.v2
Subject: Engineering, Civil Engineering Keywords: Conservation laws; Data inference; Data discovery; Dimensionless form; PINN
Online: 30 September 2020 (03:51:25 CEST)
Deep learning has achieved remarkable success in diverse computer science applications, however, its use in other traditional engineering fields has emerged only recently. In this project, we solved several mechanics problems governed by differential equations, using physics informed neural networks (PINN). The PINN embeds the differential equations into the loss of the neural network using automatic differentiation. We present our developments in the context of solving two main classes of problems: data-driven solutions and data-driven discoveries, and we compare the results with either analytical solutions or numerical solutions using the finite element method. The remarkable achievements of the PINN model shown in this report suggest the bright prospect of the physics-informed surrogate models that are fully differentiable with respect to all input coordinates and free parameters. More broadly, this study shows that PINN provides an attractive alternative to solve traditional engineering problems.
ARTICLE | doi:10.20944/preprints202208.0224.v1
Subject: Engineering, Automotive Engineering Keywords: VR-XGBoost; K-VDTE; ETC data; ESAs; data mining
Online: 12 August 2022 (03:53:23 CEST)
To scientifically and effectively evaluate the service capacity of expressway service areas (ESAs) and improve the management level of ESAs, we propose a method for the recognition of vehicles entering ESAs (VeESAs) and estimation of vehicle dwell times using ETC data. First, the ETC data and their advantages are described in detail, and then the cleaning rules are designed according to the characteristics of the ETC data. Second, we established feature engineering according to the characteristics of VeESA, and proposed the XGBoost-based VeESA recognition (VR-XGBoost) model. Studied the driving rules in depth, we constructed a kinematics-based vehicle dwell time estimation (K-VDTE) model. The field validation in Part A/B of Yangli ESA using real ETC transaction data demonstrates that the effectiveness of our proposal outperforms the current state of the art. Specifically, in Part A and Part B, the recognition accuracies of VR-XGBoost are 95.9% and 97.4%, respectively, the mean absolute errors (MAEs) of dwell time are 52 s and 14 s, respectively, and the root mean square errors (RMSEs) are 69 s and 22 s, respectively. In addition, the confidence level of controlling the MAE of dwell time within 2 minutes is more than 97%. This work can effectively identify the VeESA, and accurately estimate the dwell time, which can provide a reference idea and theoretical basis for the service capacity evaluation and layout optimization of the ESA.
ARTICLE | doi:10.20944/preprints201701.0080.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: wind turbine; failure detection; SCADA data; feature extraction; mutual information; copula
Online: 17 January 2017 (11:21:58 CET)
More and more works are using machine learning techniques while adopting supervisory control and data acquisition (SCADA) system for wind turbine anomaly or failure detection. While parameter selection is important for modelling a wind turbine’s health condition, only a few papers have been published focusing on this issue and in those papers interconnections among sub-components in a wind turbine are used to address this problem. However, merely the interconnections for decision making sometimes is too general to provide a parameter list considering the differences of each SCADA dataset. In this paper, a method is proposed to provide more detailed suggestions on parameter selection based on mutual information. Moreover, after proving that Copula, a multivariate probability distribution for which the marginal probability distribution of each variable is uniform is capable of simplifying the estimation of mutual information, an empirical copula based mutual information estimation method (ECMI) is introduced for an application. After that, a real SCADA dataset is adopted to test the method, and the results show the effectiveness of the ECMI in providing parameter selection suggestions when physical knowledge is not accurate enough.
ARTICLE | doi:10.20944/preprints201701.0079.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: accessibility; offshore; operation and maintenance; weather condition; Markov chain; data visualization
Online: 17 January 2017 (11:17:32 CET)
For offshore wind power generation, accessibility is one of the main factors that has great impact on operation and maintenance due to constraints on weather conditions for marine transportation. This paper presents a framework to explore the accessibility of an offshore site. At first, several maintenance types are defined and taken into account. Next, a data visualization procedure is introduced to provide an insight into the distribution of access periods over time. Then, a rigorous mathematical method based on finite state Markov chain is proposed to assess the accessibility of an offshore site from the maintenance perspective. A five-year weather data of a marine site is used to demonstrate the applicability and the outcomes of the proposed method. The main findings show that the proposed framework is effective in investigating the accessibility for different time scales and is able to catch the patterns of the distribution of the access periods. Moreover, based on the developed Markov chain, the average waiting time for a certain access period can be estimated. With more information on the maintenance of an offshore wind farm, the expected production loss due to time delay can be calculated.
REVIEW | doi:10.20944/preprints202007.0153.v1
Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: Open-science; big data; fMRI; data sharing; data management
Online: 8 July 2020 (11:53:33 CEST)
Large datasets that enable researchers to perform investigations with unprecedented rigor are growing increasingly common in neuroimaging. Due to the simultaneous increasing popularity of open science, these state-of-the-art datasets are more accessible than ever to researchers around the world. While analysis of these samples has pushed the field forward, they pose a new set of challenges that might cause difficulties for novice users. Here, we offer practical tips for working with large datasets from the end-user’s perspective. We cover all aspects of the data life cycle: from what to consider when downloading and storing the data, to tips on how to become acquainted with a dataset one did not collect, to what to share when communicating results. This manuscript serves as a practical guide one can use when working with large neuroimaging datasets, thus dissolving barriers to scientific discovery.
REVIEW | doi:10.20944/preprints202211.0161.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: High Performance Computing (HPC); big data; High Performance Data Analytics (HPDS); con-vergence; data locality; spark; Hadoop; design patterns; process mapping; in-situ data analysis
Online: 9 November 2022 (01:38:34 CET)
Big data has revolutionised science and technology leading to the transformation of our societies. High Performance Computing (HPC) provides the necessary computational power for big data analysis using artificial intelligence and methods. Traditionally HPC and big data had focused on different problem domains and had grown into two different ecosystems. Efforts have been underway for the last few years on bringing the best of both paradigms into HPC and big converged architectures. Designing HPC and big data converged systems is a hard task requiring careful placement of data, analytics, and other computational tasks such that the desired performance is achieved with the least amount of resources. Energy efficiency has become the biggest hurdle in the realisation of HPC, big data, and converged systems capable of delivering exascale and beyond performance. Data locality is a key parameter of HPDA system design as moving even a byte costs heavily both in time and energy with an increase in the size of the system. Performance in terms of time and energy are the most important factors for users, particularly energy, due to it being the major hurdle in high performance system design and the increasing focus on green energy systems due to environmental sustainability. Data locality is a broad term that encapsulates different aspects including bringing computations to data, minimizing data movement by efficient exploitation of cache hierarchies, reducing intra- and inter-node communications, locality-aware process and thread mapping, and in-situ and in-transit data analysis. This paper provides an extensive review of the cutting-edge on data locality in HPC, big data, and converged systems. We review the literature on data locality in HPC, big data, and converged environments and discuss challenges, opportunities, and future directions. Subsequently, using the knowledge gained from this extensive review, we propose a system architecture for future HPC and big data converged systems. To the best of our knowledge, there is no such review on data locality in converged HPC and big data systems.
ARTICLE | doi:10.20944/preprints202008.0487.v1
Subject: Social Sciences, Geography, Planning And Development Keywords: Twitter; data reliability; risk communication; data mining; Google Cloud Vision API
Online: 22 August 2020 (02:32:40 CEST)
While Twitter has been touted to provide up-to-date information about hazard events, the reliability of tweets is still a concern. Our previous publication extracted relevant tweets containing information about the 2013 Colorado flood event and its impacts. Using the relevant tweets, this research further examined the reliability (accuracy and trueness) of the tweets by examining the text and image content and comparing them to other publicly available data sources. Both manual identification of text information and automated (Google Cloud Vision API) extraction of images were implemented to balance accurate information verification and efficient processing time. The results showed that both the text and images contained useful information about damaged/flooded roads/street networks. This information will help emergency response coordination efforts and informed allocation of resources when enough tweets contain geocoordinates or locations/venue names. This research will help identify reliable crowdsourced risk information to enable near-real time emergency response through better use of crowdsourced risk communication platforms.
ARTICLE | doi:10.20944/preprints202108.0303.v2
Online: 19 November 2021 (08:38:42 CET)
Science continues to become more interdisciplinary and to involve increasingly complex data sets. Many projects in the biomedical and health-related sciences follow or aim to follow the principles of FAIR data sharing, which has been demonstrated to foster collaboration, to lead to better research outcomes, and to help ensure reproducibility of results. Data generated in the course of biomedical and health research present specific challenges for FAIR sharing in the sense that they are heterogeneous and highly sensitive to context and the needs of protection and privacy. Data sharing must respect these features without impeding timely dissemination of results, so that they can contribute to time-critical advances in medical therapy and treatment. Modeling and simulation of biomedical processes have become established tools, and a global community has been developing algorithms, methodologies, and standards for applying biomedical simulation models in clinical research. However, it can be difficult for clinician scientists to follow the specific rules and recommendations for FAIR data sharing within this domain. We seek to clarify the standard workflow for sharing experimental and clinical data with the simulation modeling community. By following these recommendations, data sharing will be improved, collaborations will become more effective, and the FAIR publication and subsequent reuse of data will become possible at the level of quality necessary to support biomedical and health-related sciences.
ARTICLE | doi:10.20944/preprints202307.0466.v1
Subject: Biology And Life Sciences, Plant Sciences Keywords: plant metabolomics; metabolite identification; data visualisation; omics data; bioinformatics tools
Online: 10 July 2023 (13:49:20 CEST)
The advancement of mass spectrometry technologies has revolutionised plant metabolomics research by enabling the acquisition of raw metabolomics data. However, the identification, analysis, and visualisation of these data require specialised tools. Existing solutions lack a dedicated plant-specific metabolite database and pose usability challenges. To address these limitations, we developed PlantMetSuite, a web-based tool for comprehensive metabolomics analysis and visualisation. PlantMetSuite encompasses interactive bioinformatics tools and databases specifically tailored for plant metabolomics data, facilitating upstream-to-downstream analysis in metabolomics and supporting integrative multi-omics investigations. PlantMetSuite can be accessed directly through a user's browser without the need for installation or programming skills. The tool is freely available at https://plantmetsuite.verygenome.com/ and will undergo regular updates and expansions to incorporate additional libraries and newly published metabolomics analysis methods. The tool's significance lies in empowering researchers with an accessible and customisable platform for unlocking plant metabolomics insights.
ARTICLE | doi:10.20944/preprints201810.0273.v1
Subject: Physical Sciences, Astronomy And Astrophysics Keywords: astroparticle physics, cosmic rays, data life cycle management, data curation, meta data, big data, deep learning, open data
Online: 12 October 2018 (14:48:32 CEST)
Modern experimental astroparticle physics features large-scale setups measuring different messengers, namely high-energy particles generated by cosmic accelerators (e.g. supernova remnants, active galactic nuclei, etc): cosmic and gamma rays, neutrinos and recently discovered gravitational waves. Ongoing and future experiments are distributed over the Earth including ground, underground/underwater setups as well as balloon payloads and spacecrafts. The data acquired by these experiments have different formats, storage concepts and publication policies. Such differences are a crucial issue in the era of big data and of multi-messenger analysis strategies in astroparticle physics. We propose a service ASTROPARTICLE.ONLINE in the frame of which we develop an open science system which enables to publish, store, search, select and analyse astroparticle physics data. The cosmic-ray experiments KASCADE-Grande and TAIGA were chosen as pilot experiments to be included in this framework. In the first step of our initiative we will develop and test the following components of the full data life cycle concept: (i) describing, storing and reusing of astroparticle data; (ii) software for performing multi-experiment and multi-messenger analyses like deep-learning methods; (iii) outreach including example applications and tutorial for students and scientists outside the specific research field. In the present paper we describe the concepts of our initiative, and in particular the plans toward a common, federated astroparticle data storage.
ARTICLE | doi:10.20944/preprints202105.0377.v1
Subject: Computer Science And Mathematics, Mathematical And Computational Biology Keywords: Sensor data, wireless body area network, wearable devices, sensor data interoperability
Online: 17 May 2021 (09:47:26 CEST)
The monitoring of maternal and child health, using wearable devices made with wireless sensor technologies, is expected to reduce maternal and child death rates. Wireless sensor technologies have been used in wireless sensor networks to enable the acquisition of data for monitoring machines, smart cities, transportation, asset tracking, and tracking of human activity. Applications based on wireless body area network (WBAN) have been used in healthcare for measuring and monitoring of patient health and activity through integration with wearable devices. Wireless sensors used in WBAN can be cost-effective, enable remote availability, and can be integrated with electronic health record (EHR) management systems. Interoperability of WBAN sensor data with other linked data has the potential to improve health for all, including maternal and child health through the improvement of data access, data quality and healthcare access. This paper presents a survey of the state-of-the-art techniques for managing WBAN sensor data interoperability. The findings in this study will provide reliable support to enable policymakers and health care providers to take action to enhance the use of e-health to improve maternal-child health and reduce the mortality rates of women and children.
REVIEW | doi:10.20944/preprints202304.0051.v1
Subject: Social Sciences, Library And Information Sciences Keywords: Research integrity; Publish or Perish; Misconduct in Science; Data fabrication; Data falsification; Plagiarism
Online: 4 April 2023 (16:00:35 CEST)
The concept of Research Integrity and research ethics are linked to the scientific research process and its communication. Presenting the results objectively is essential. It turns out that few scientists use manipulation of results and consequently other types of misconduct such as data Fabrication, Falsification, and Plagiarism (FFP). In this article, we show the definitions of these and different aspects of behavior that should be avoided, which affect principles of research reliability. We present, through a brief literature review, the concept of Research Integrity, FFP, and its relations with Publish or Perish. Editorial disputes are linked to the power that scientists have to remain in the field of research, governed by clear rules to increase their intellectual capital. We discussed that scientists tend to want their papers published in journals with better impact and well-evaluated, seeking prominence in the publishing sector. We have seen that both scientists and journals can have sequelae and problems in the face of the Publish or Perish movement, which can call into question the quality of the editorial process, peer review, and the journal itself.
ARTICLE | doi:10.20944/preprints201710.0076.v2
Subject: Computer Science And Mathematics, Information Systems Keywords: big data; machine learning; regularization; data quality; robust learning framework
Online: 17 October 2017 (03:47:41 CEST)
The concept of ‘big data’ has been widely discussed, and its value has been illuminated throughout a variety of domains. To quickly mine potential values and alleviate the ever-increasing volume of information, machine learning is playing an increasingly important role and faces more challenges than ever. Because few studies exist regarding how to modify machine learning techniques to accommodate big data environments, we provide a comprehensive overview of the history of the evolution of big data, the foundations of machine learning, and the bottlenecks and trends of machine learning in the big data era. More specifically, based on learning principals, we discuss regularization to enhance generalization. The challenges of quality in big data are reduced to the curse of dimensionality, class imbalances, concept drift and label noise, and the underlying reasons and mainstream methodologies to address these challenges are introduced. Learning model development has been driven by domain specifics, dataset complexities, and the presence or absence of human involvement. In this paper, we propose a robust learning paradigm by aggregating the aforementioned factors. Over the next few decades, we believe that these perspectives will lead to novel ideas and encourage more studies aimed at incorporating knowledge and establishing data-driven learning systems that involve both data quality considerations and human interactions.
ARTICLE | doi:10.20944/preprints202111.0019.v1
Subject: Engineering, Industrial And Manufacturing Engineering Keywords: Industry 4.0; Database; Data models; Big Data & Analytics; Asset Administration Shell
Online: 1 November 2021 (13:01:51 CET)
The data-oriented paradigm has proven to be fundamental for the technological transformation process that characterizes Industry 4.0 (I4.0) so that Big Data & Analytics is considered a technological pillar of this process. The literature reports a series of system architecture proposals that seek to implement the so-called Smart Factory, which is primarily data-driven. Many of these proposals treat data storage solutions as mere entities that support the architecture's functionalities. However, choosing which logical data model to use can significantly affect the performance of the architecture. This work identifies the advantages and disadvantages of relational (SQL) and non-relational (NoSQL) data models for I4.0, taking into account the nature of the data in this process. The characterization of data in the context of I4.0 is based on the five dimensions of Big Data and a standardized format for representing information of assets in the virtual world, the Asset Administration Shell. This work allows identifying appropriate transactional properties and logical data models according to the volume, variety, velocity, veracity, and value of the data. In this way, it is possible to describe the suitability of SQL and NoSQL databases for different scenarios within I4.0.
ARTICLE | doi:10.20944/preprints202105.0589.v1
Subject: Engineering, Automotive Engineering Keywords: Game Ratings; Public Data; Game Data; Data analysis; GRAC(Korea)
Online: 25 May 2021 (08:32:32 CEST)
As of 2020, public data for game ratings provided by Game Ratings And Administration Committee(GRAC) are more limited than public data for movie and video ratings provided by Korea Media Ratings Board and do not provide data which allow us to see information on ratings clearly and in detail. To get information on game ratings, we need to find information by searching for specific target on homepage which is inconvenient for us. In order to improve such inconvenience and extend scope of provision in public data, the author of this paper intends to study public data API which has been extended based on information on video ratings. To draw items to be extended, this study analyzes data for ratings on homepage of GRAC and designs collection system to build database. This study intends to implement system that provides data collected based on extended public data items in a form which users want. This study is expected to provide information on ratings to GRAC which will strengthen fairness and satisfy game users and people’s rights to know and contribute to promotion and development of game industry.
ARTICLE | doi:10.20944/preprints201702.0059.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: fine particulate matter (PM2.5); aerosol optical depth; community multi-scale air quality (CMAQ) model; data fusion; exposure assessment
Online: 16 February 2017 (08:58:09 CET)
Estimating ground surface PM2.5 with fine spatiotemporal resolution is a critical technique for exposure assessments in epidemiological studies of its health risks. Previous studies have utilized monitoring, satellite remote sensing or air quality modeling data to evaluate the spatiotemporal variations of PM2.5 concentrations, but such studies rarely combined these data simultaneously. We develop a three-stage model to fuse PM2.5 monitoring data, satellite-derived aerosol optical depth (AOD) and community multi-scale air quality (CMAQ) simulations together and apply it to estimate daily PM2.5 at a spatial resolution of 0.1˚ over China. Performance of the three-stage model is evaluated using a cross-validation (CV) method step by step. CV results show that the finally fused estimator of PM2.5 is in good agreement with the observational data (RMSE = 23.00 μg/m^3 and R2 = 0.72) and outperforms either AOD-retrieved PM2.5 (R2 = 0.62) or CMAQ simulations (R2 = 0.51). According to step-specific CVs, in data fusion, AOD-retrieved PM2.5 plays a key role to reduce mean bias, whereas CMAQ provides all-spacetime-covered predictions, which avoids sampling bias caused by non-random incompleteness in satellite-derived AOD. Our fused products are more capable than either CMAQ simulations or AOD-based estimates in characterizing the polluting procedure during haze episodes and thus can support both chronic and acute exposure assessments of ambient PM2.5. Based on the products, averaged concentration of annual exposure to PM2.5 was 55.75 μg/m3, while averaged count of polluted days (PM2.5 > 75 μg/m3) was 81, across China during 2014. Fused estimates will be publicly available for future health-related studies.
ARTICLE | doi:10.20944/preprints202103.0623.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: SARS-CoV-2; Big Data; Data Analytics; Predictive Models; Schools
Online: 25 March 2021 (14:35:53 CET)
Background: CoronaVirus Disease 2019 (COVID-19) is the main discussed topic world-wide in 2020 and at the beginning of the Italian epidemic, scientists tried to understand the virus diffusion and the epidemic curve of positive cases with controversial findings and numbers. Objectives: In this paper, a data analytics study on the diffusion of COVID-19 in Lombardy Region and Campania Region is developed in order to identify the driver that sparked the second wave in Italy Methods: Starting from all the available official data collected about the diffusion of COVID-19, we analyzed google mobility data, school data and infection data for two big regions in Italy: Lombardy Region and Campania Region, which adopted two different approaches in opening and closing schools. To reinforce our findings, we also extended the analysis to the Emilia Romagna Region. Results: The paper aims at showing how different policies adopted in school opening / closing may have on the impact on the COVID-19 spread. Conclusions: The paper shows that a clear correlation exists between the school contagion and the subsequent temporal overall contagion in a geographical area.
TECHNICAL NOTE | doi:10.20944/preprints202011.0038.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: dyadic data; co-occurrence data; expectation maximization (EM) algorithm; mixture model
Online: 2 November 2020 (12:06:26 CET)
Dyadic data which is also called co-occurrence data (COD) contains co-occurrences of objects. Searching for statistical models to represent dyadic data is necessary. Fortunately, finite mixture model is a solid statistical model to learn and make inference on dyadic data because mixture model is built smoothly and reliably by expectation maximization (EM) algorithm which is suitable to inherent spareness of dyadic data. This research summarizes mixture models for dyadic data. When each co-occurrence in dyadic data is associated with a value, there are many unaccomplished values because a lot of co-occurrences are inexistent. In this research, these unaccomplished values are estimated as mean (expectation) of random variable given partial probabilistic distributions inside dyadic mixture model.
REVIEW | doi:10.20944/preprints202103.0214.v2
Subject: Engineering, Automotive Engineering Keywords: data center; green data center; sustainability; energy efficiency; energy saving; ICT.
Online: 14 April 2021 (12:59:53 CEST)
Information and communication technologies (ICT) are increasingly permeating our daily life and we ever more commit our data to the cloud. Events like the COVID-19 pandemic put an exceptional burden upon ICT infrastructures. This involves increasing implementation and use of data centers, which increased energy use and environmental impact. The scope of this work is to take stock on data center impact, opportunities, and assessment. First, we estimate impact entity. Then, we review strategies for efficiency and energy conservation in data centers. Energy use pertain to power distribution, IT-equipment, and non-IT equipment (e.g. cooling): Existing and prospected strategies and initiatives in these sectors are identified. Among key elements are innovative cooling techniques, natural resources, automation, low-power electronics, and equipment with extended thermal limits. Research perspectives are identified and estimates of improvement opportunities are presented. Finally, we present an overview on existing metrics, regulatory framework, and bodies concerned.
COMMUNICATION | doi:10.20944/preprints202206.0383.v2
Subject: Computer Science And Mathematics, Information Systems Keywords: Exoskeleton; Twitter; Tweets; Big Data; social media; Data Mining; dataset; Data Science; Natural Language Processing; Information Retrieval
Online: 21 July 2022 (04:06:53 CEST)
The exoskeleton technology has been rapidly advancing in the recent past due to its multitude of applications and diverse use-cases in assisted living, military, healthcare, firefighting, and industry 4.0. The exoskeleton market is projected to increase by multiple times of its current value within the next two years. Therefore, it is crucial to study the degree and trends of user interest, views, opinions, perspectives, attitudes, acceptance, feedback, engagement, buying behavior, and satisfaction, towards exoskeletons, for which the availability of Big Data of conversations about exoskeletons is necessary. The Internet of Everything style of today's living, characterized by people spending more time on the internet than ever before, with a specific focus on social media platforms, holds the potential for the development of such a dataset, by the mining of relevant social media conversations. Twitter, one such social media platform, is highly popular amongst all age groups, where the topics found in the conversation paradigms include emerging technologies such as exoskeletons. To address this research challenge, this work makes two scientific contributions to this field. First, it presents an open-access dataset of about 140,000 tweets about exoskeletons that were posted in a 5-year period from May 21, 2017, to May 21, 2022. Second, based on a comprehensive review of the recent works in the fields of Big Data, Natural Language Processing, Information Retrieval, Data Mining, Pattern Recognition, and Artificial Intelligence that may be applied to relevant Twitter data for advancing research, innovation, and discovery in the field of exoskeleton research, a total of 100 Research Questions are presented for researchers to study, analyze, evaluate, ideate, and investigate based on this dataset.
ARTICLE | doi:10.20944/preprints202212.0390.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: hydraulic geometry; rating curves; flood mapping; accuracy; data acquisition; data needs
Online: 21 December 2022 (06:59:11 CET)
Hydraulic relationships are important for water resource management, hazard prediction, and modelling. Since Leopold first identified power law expressions that could relate streamflow to top-width, depth, and velocity, hydrologists have been estimating ‘At-a-station Hydraulic Geometries’ (AHG) to describe average flow hydraulics. As the amount of data, data sources, and application needs increase, the ability to apply, integrate and compare disparate and often noisy data is critical for applications ranging from reach to continental scales. However, even with quality data, the standard practice of solving each AHG relationship independently can lead to solutions that fail to conserve mass. The challenge addressed here is how to extend the physical properties of the AHG relations, while improving the way they are hydrologically addressed and fit. We present a framework for minimizing error while ensuring mass conservation at reach - or hydrologic Feature - scale geometries’(FHG) that complies with current state-of-the-practice conceptual and logical models. Through this framework, FHG relations are fit for the United States Geological Survey’s (USGS) Rating Curve database, the USGS HYDRoacoustic dataset in support of the Surface Water Oceanographic Topography satellite mission (HYDRoSWOT), and the hydraulic property tables produced as part of the NOAA/Oakridge Continental Flood Inundation Mapping framework. The paper describes and demonstrates the accuracy, interoperability, and application of these relationships to flood modelling and presents this framework in an R package.
ARTICLE | doi:10.20944/preprints202201.0365.v3
Subject: Biology And Life Sciences, Biochemistry And Molecular Biology Keywords: binding affinity prediction; machine learning; data quality; data quantity; deep learning
Online: 23 May 2022 (11:16:49 CEST)
Prediction of protein-ligand binding affinities is crucial for computational drug discovery. A number of deep learning approaches have been developed in recent years to improve the accuracy of such affinity prediction. While the predicting power of these systems have advanced to some degrees depending on the dataset used for model training and testing, the effects of the quality and quantity of the underlying data have not been thoroughly examined. In this study, we employed erroneous datasets and data subsets of different sizes, created from one of the largest databases of experimental binding affinities, to train and evaluate a deep learning system based on convolutional neural networks. Our results show that data quality and quantity do have significant impacts on the prediction performance of trained models. Depending on the variations in data quality and quantity, the performance discrepancies could be comparable to or even larger than those observed among different deep learning approaches. In particular, the presence of proteins during model training leads to a dramatic increase in prediction accuracy. This implies that continued accumulation of high-quality affinity data, especially for new protein targets, is indispensable for improving deep learning models to better predict protein-ligand binding affinities.
CASE REPORT | doi:10.20944/preprints201801.0066.v1
Subject: Engineering, Control And Systems Engineering Keywords: cohesion policy; data visualization; open data
Online: 8 January 2018 (11:11:47 CET)
The implementation of the European Cohesion Policy aiming at fostering regions competitiveness, economic growth and creation of new jobs is documented over the period 2014–2020 in the publicly available Open Data Portal for the European Structural and Investment funds. On the base of this source, this paper aims at describing the process of data mining and visualization for information production on regional programmes performace in achieving effective expenditure of resouces.