ARTICLE | doi:10.20944/preprints202003.0073.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: digital object; data infrastructure; research infrastructure; data management; data science; FAIR data; open science; European Open Science Cloud; EOSC; persistent identifier
Online: 5 March 2020 (02:30:06 CET)
Data science is facing the following major challenges: (1) developing scalable cross-disciplinary capabilities, (2) dealing with the increasing data volumes and their inherent complexity, (3) building tools that help to build trust, (4) creating mechanisms to efficiently operate in the domain of scientific assertions, (5) turning data into actionable knowledge units and (6) promoting data interoperability. As a way to overcome these challenges, we further develop the proposals by early Internet pioneers for Digital Objects as encapsulations of data and metadata made accessible by persistent identifiers. In the past decade, this concept was revisited by various groups within the Research Data Alliance and put in the context of the FAIR Guiding Principles for findable, accessible, interoperable and reusable data. The basic components of a FAIR Digital Object (FDO) as a self-contained, typed, machine-actionable data package are explained. A survey of use cases has indicated the growing interest of research communities in FDO solutions. We conclude that the FDO concept has the potential to act as the interoperable federative core of a hyperinfrastructure initiative such as the European Open Science Cloud (EOSC).
SHORT NOTE | doi:10.20944/preprints202001.0196.v1
Subject: Biology And Life Sciences, Insect Science Keywords: reproducibility; open access; data curation; data mangement; pre-print servers
Online: 18 January 2020 (09:05:49 CET)
The ability to replicate scientific experiments is a cornerstone of the scientific method. Sharing ideas, workflows, data, and protocols facilitates testing the generalizability of results, increases the speed that science progresses, and enhances quality control of published work. Fields of science such as medicine, the social sciences, and the physical sciences have embraced practices designed to increase replicability. Granting agencies, for example, may require data management plans and journals may require data and code availability statements along with the deposition of data and code in publicly available repositories. While many tools commonly used in replicable workflows such as distributed version control systems (e.g. “git”) or scripted programming languages for data cleaning and analysis may have a steep learning curve, their adoption can increase individual efficiency and facilitate collaborations both within entomology and across disciplines. The open science movement is developing within the discipline of entomology, but practitioners of these concepts or those desiring to work more collaboratively across disciplines may be unsure where or how to embrace these initiatives. This article is meant to introduce some of the tools entomologists can incorporate into their workflows to increase the replicability and openness of their work. We describe these tools and others, recommend additional resources for learning more about these tools, and discuss the benefits to both individuals and the scientific community and potential drawbacks associated with implementing a replicable workflow.
ARTICLE | doi:10.20944/preprints201710.0016.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: citizen science; volunteered geographical information; metadata; data quality; quality assurance; scientific workflow; provenance; metaquality; open data
Online: 3 October 2017 (13:52:29 CEST)
Environmental policy involving citizen science (CS) is of growing interest. In support of this open data stream, validation or quality assessment of the CS data and their appropriate usage for evidence-based policy making, needs a flexible and easily adaptable data curation process ensuring transparency. Addressing these needs, this paper describes an approach for automatic quality assurance as proposed by the Citizen OBservatory WEB (COBWEB) FP7 project. This approach is based upon a workflow composition that combines different quality controls, each belonging to seven categories or ‘pillars’. Each pillar focuses on a specific dimension in the types of reasoning algorithms for CS data qualification. These pillars attribute values to a range of quality elements belonging to three complementary quality models. Additional data from various sources, such as Earth Observation (EO) data, are often included as part of the inputs of quality controls within the pillars. However, qualified CS data can also contribute to the validation of EO data. Therefore, the question of validation can be considered as ‘two sides of the same coin’. Based on an invasive species CS study, concerning Fallopia japonica (Japanese knotweed), the paper discusses the flexibility and usefulness of qualifying CS data, either when using an EO data for the validation within the quality assurance process, or validating an EO data product that describes the risk of occurrence of the plant. Both validation paths are found to be improved by quality assurance of the CS data. Addressing the reliability of CS open data, issues and limitations of the role of quality assurance for validation, due to the quality of secondary data used within the automatic workflow, are described, e.g. error propagation, paving the route to improvements in the approach.
ARTICLE | doi:10.20944/preprints201809.0543.v1
Subject: Medicine And Pharmacology, Neuroscience And Neurology Keywords: Open Science, Data Sharing, Neuroimaging, Reproducibility, Transparency, Reform
Online: 27 September 2018 (11:50:12 CEST)
Ongoing debates regarding the virtues and challenges of implementing open science for brain imaging research mirror those of the larger scientific community. The present commentary acknowledges the merits of arguments on both sides, as well as the underlying realities that have forced so many to feel the need to resist the implementation of an ideal. Potential sources of top-down reform are discussed, along with the factors that threaten to slow their progress. The potential roles of generational change and the individual are discussed, and a starter list of actionable steps that any researcher can take, big or small, is provided.
REVIEW | doi:10.20944/preprints201805.0418.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: big data training and learning; company and business requirements; ethics; impact; decision support; data engineering; open data; smart homes; smart cities; IoT
Online: 29 May 2018 (08:45:52 CEST)
In Data Science we are concerned with the integration of relevant sciences in observed and empirical contexts. This results in the unification of analytical methodologies, and of observed and empirical data contexts. Given the dynamic nature of convergence, described are the origins and many evolutions of the Data Science theme. The following are covered in this article: the rapidly growing post-graduate university course provisioning for Data Science; a preliminary study of employability requirements, and how past eminent work in the social sciences and other areas, certainly mathematics, can be of immediate and direct relevance and benefit for innovative methodology, and for facing and addressing the ethical aspect of Big Data analytics, relating to data aggregation and scale effects. Associated also with Data Science is how direct and indirect outcomes and consequences of Data Science include decision support and policy making, and both qualitative as well as quantitative outcomes. For such reasons, the importance is noted of how Data Science builds collaboratively on other domains, potentially with innovative methodologies and practice. Further sections point towards some of the most major current research issues.
ARTICLE | doi:10.20944/preprints201810.0115.v2
Subject: Physical Sciences, Astronomy And Astrophysics Keywords: radio astronomy; interferometry; square kilometre array; big data; faraday tomography
Online: 21 November 2018 (07:19:33 CET)
The Square Kilometre Array (SKA) will be both the largest radio telescope ever constructed and the largest Big Data project in the known Universe. The first phase of the project will generate on the order of 5 zettabytes of data per year. A critical task for the SKA will be its ability to process data for science, which will need to be conducted by science pipelines. Together with polarization data from the LOFAR Multifrequency Snapshot Sky Survey (MSSS), we have been developing a realistic SKA-like science pipeline that can handle the large data volumes generated by LOFAR at 150 MHz. The pipeline uses task-based parallelism to image, detect sources, and perform Faraday Tomography across the entire LOFAR sky. The project thereby provides a unique opportunity to contribute to the technological development of the SKA telescope, while simultaneously enabling cutting-edge scientific results. In this paper, we provide an update on current efforts to develop a science pipeline that can enable tight constraints on the magnetised large-scale structure of the Universe.
ARTICLE | doi:10.20944/preprints202307.1847.v1
Subject: Computer Science And Mathematics, Other Keywords: Data Science; Literate Programming; Teaching, Emacs; Org-mode; IDE; Case Study
Online: 27 July 2023 (10:31:08 CEST)
This paper presents a case study on using Emacs and Org-mode for literate programming in undergraduate data science courses. Over three academic terms, the author mandated these tools across courses in R, Python, C++, SQL, and more. Onboarding relied on simplified Emacs tutorials and starter configurations. Students gained proficiency after initial practice. Live coding sessions demonstrated the flexible instruction enabled by literate notebooks. Assignments and projects required documentation alongside functional code. Student feedback showed enthusiasm for learning a versatile IDE, despite some frustration with the learning curve. Skilled students highlighted efficiency gains in a unified environment. However, uneven adoption of documentation practices pointed to a need for better incorporation into grading. Additionally, some students found Emacs unintuitive, desiring more accessible options. This highlights a need to match tools to skills levels, potentially starting novices with graphical IDEs before introducing Emacs. Key takeaways include literate programming aids comprehension but requires rigorous onboarding and reinforcement; Emacs excels for advanced workflows but has a steep initial curve. With proper support, these tools show promise for data science education.
REVIEW | doi:10.20944/preprints201905.0302.v1
Subject: Business, Economics And Management, Economics Keywords: open science; open access; open data; economic impacts
Online: 27 May 2019 (11:19:59 CEST)
A common motivation for increasing open access to research findings and data is the potential to create economic benefits – but evidence is patchy and diverse. This study systematically reviewed the evidence on what kinds of economic impacts (positive and negative) open science can have, how these comes about, and how benefits could be maximized. Use of open science outputs often leaves no obvious trace, so most evidence of impacts is based on interviews, surveys, inference based on existing costs, and modelling approaches. There is indicative evidence that open access to findings/data can lead to savings in access costs, labour costs and transaction costs. There are examples of open science enabling new products, services, companies, research and collaborations. Modelling studies suggest higher returns to R&D if open access permits greater accessibility and efficiency of use of findings. Barriers include lack of skills capacity in search, interpretation and text mining, and lack of clarity around where benefits accrue. There are also contextual considerations around who benefits most from open science (e.g. sectors, small vs larger companies, types of dataset). Recommendations captured in the review include more research, monitoring and evaluation (including developing metrics), promoting benefits, capacity building and making outputs more audience-friendly.
ARTICLE | doi:10.20944/preprints201612.0053.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: visualization; terrain rendering; geo-spatial data; uncertainty; prioritization
Online: 9 December 2016 (10:11:12 CET)
Visualizing geo-spatial data embedded into a three-dimensional terrain is challenging. The problem becomes even more complex when uncertainty information needs to be presented as well. This paper addresses the question of how to visually communicate all three aspects: the 3D terrain, the geo-spatial data, and the data-associated uncertainty. We argue that visualizing all aspects with a high degree of detail will likely exceed the visual budget. Therefore, we propose a visualization strategy based on prioritizing a selected aspect and presenting the remaining two with less detail. We discuss various design options that allow us to obtain differently prioritized visual representations. Our approach has been implemented as a tool for rapid visualization prototyping in the context of avionics applications. Practical solutions are described for a use case related to the visualization of 3D terrain and uncertain weather data.
ARTICLE | doi:10.20944/preprints202309.0560.v1
Subject: Computer Science And Mathematics, Analysis Keywords: Big Data Analytics; Revenue Generation; Customer Relationship Management (CRM)
Online: 8 September 2023 (13:33:43 CEST)
This study explores the potential of data science software solutions like Customer Relationship Management Software (CRM) for increasing the revenue generation of businesses. We focused on those businesses in Accommodation and Food Service sector across the European Union (EU). The investigation is contextualized within the rising trend of data-driven decision making, examining the potential correlation between data science application and business revenue. Employing a comprehensive evaluation of Eurostat datasets from 2014 to 2021, we used both univariate and multivariate analyses, we assessed e-commerce sales data across countries, focusing on the usage of big data and CRM tools. Big data utilization showed a clear, positive relationship with enhanced e-commerce sales. However, CRM tools exhibited a dualistic impact: while their use in marketing showed no significant effect on sales, their application in non-marketing functions had a negative correlation. These findings underscore the potential role of CRM and data science solutions in enhancing business performance in the EU's Accommodation and Food Service industry.
CONCEPT PAPER | doi:10.20944/preprints202204.0044.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Smart cities; data science; machine learning; Internet of Things; data-driven decision making; intelligent services; cybersecurity
Online: 6 April 2022 (11:35:15 CEST)
Cities are undergoing huge shifts in technology and operations in recent days, and `data science' is driving the change in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting insights or actionable knowledge from city data and building a corresponding data-driven model is the key to making a city system automated and intelligent. Data science is typically the study and analysis of actual happenings with historical data using a variety of scientific methodology, machine learning techniques, processes, and systems. In this paper, we concentrate on and explore ``Smart City Data Science", where city data collected from various sources like sensors and Internet-connected devices, is being mined for insights and hidden correlations to enhance decision-making processes and deliver better and more intelligent services to citizens. To achieve this goal, various machine learning analytical modeling can be employed to provide deeper knowledge about city data, which makes the computing process more actionable and intelligent in various real-world services of today's cities. Finally, we identify and highlight ten open research issues for future development and research in the context of data-driven smart cities. Overall, we aim to provide an insight into smart city data science conceptualization on a broad scale, which can be used as a reference guide for the researchers, professionals, as well as policy-makers of a country, particularly, from the technological point of view.
ARTICLE | doi:10.20944/preprints202001.0274.v1
Subject: Computer Science And Mathematics, Mathematical And Computational Biology Keywords: bioinformatics; computational genomics; computational medicine; data science; data visualization; parallel processing; grid computing; fog computing
Online: 24 January 2020 (10:26:26 CET)
Conventional data visualization software have greatly improved the efficiency of the mining and visualization of biomedical data. However, when one applies a grid computing approach the efficiency and complexity of such visualization allows for a hypothetical increase in research opportunities. This paper will present data visualization examples presented in conventional networks, then go into higher details about more complex techniques related to leveraging parallel processing architecture. Part of these complex techniques include the attempt to build a basic general adversarial network (GAN) in order to increase the statistical pool of biomedical data for analysis as well as an introduction to the project utilizing the decentralized-internet SDK. This paper is meant to show you said conventional examples then go into details about the deeper experimentation and self contained results.
ARTICLE | doi:10.20944/preprints201612.0079.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: fire detection; upwelling radiation; diurnal variation; training data; geostationary sensors
Online: 15 December 2016 (09:22:10 CET)
Fire detection from satellite sensors relies on an accurate estimation of the unperturbed state of a target pixel, from which an anomaly can be isolated. Methods for estimating the radiation budget of a pixel without fire depend upon training data derived from the location's recent history of brightness temperature variation over the diurnal cycle, which can be vulnerable to cloud contamination and the effects of weather. This study proposes a new method that utilises the common solar budget found at a given latitude in conjunction with an area's local solar time to aggregate a broad-area training dataset, which can be used to model the expected diurnal temperature cycle of a location. This training data is then used in a temperature fitting process with the measured brightness temperatures in a pixel, and compared to pixel-derived training data and contextual methods of background temperature determination. Results of this study show similar accuracy between clear-sky medium wave infrared upwelling radiation and the diurnal temperature cycle estimation compared to previous methods, with demonstrable improvements in processing time and training data availability. This method can be used in conjunction with brightness temperature thresholds to provide a baseline for upwelling radiation, from which positive thermal anomalies such as fire can be isolated.
REVIEW | doi:10.20944/preprints202001.0378.v1
Subject: Computer Science And Mathematics, Mathematical And Computational Biology Keywords: workflows; containers; cloud computing; Kubernetes; big data; reproducibility
Online: 31 January 2020 (05:15:01 CET)
Containers are gaining popularity in life science research as they provide a solution for encompassing dependencies of provisioned tools, simplify software installations for end users and offer a form of isolation between processes. Scientific workflows are ideal for chaining containers into data analysis pipelines to aid in creating reproducible analyses. In this manuscript we review a number of approaches to using containers as implemented in the workflow tools Nextflow, Galaxy, Pachyderm, Argo, Kubeflow, Luigi and SciPipe, when deployed in cloud environments. A particular focus is placed on the workflow tool’s interaction with the Kubernetes container orchestration framework.
Subject: Computer Science And Mathematics, Information Systems Keywords: Mobile data science; artificial intelligence; machine learning; natural language processing; expert system; data-driven decision making; context-awareness; intelligent mobile apps
Online: 14 September 2020 (00:01:39 CEST)
Artificial intelligence (AI) techniques have grown rapidly in recent years in the context of computing with smart mobile phones that typically allows the devices to function in an intelligent manner. Popular AI techniques include machine learning and deep learning methods, natural language processing, as well as knowledge representation and expert systems, can be used to make the target mobile applications intelligent and more effective. In this paper, we present a comprehensive view on mobile data science and intelligent apps in terms of concepts and AI-based modeling that can be used to design and develop intelligent mobile applications for the betterment of human life in their diverse day-to-day situation. This study also includes the concepts and insights of various AI-powered intelligent apps in several application domains, ranging from personalized recommendation to healthcare services, including COVID-19 pandemic management in recent days. Finally, we highlight several research issues and future directions relevant to our analysis in the area of mobile data science and intelligent apps. Overall, this paper aims to serve as a reference point and guidelines for the mobile application developers as well as the researchers in this domain, particularly from the technical point of view.
ARTICLE | doi:10.20944/preprints201611.0010.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: millimeter-wavelength cloud radar; attenuation correction; dual-radar; data fusion
Online: 1 November 2016 (10:05:18 CET)
In order to correct attenuated millimeter-wavelength (Ka-band) radar data and address the problem of instability, an attenuation correction methodology (attenuation correction with variation trend constraint; VTC) was developed. Using synchronous observation conditions and multi-band radars, the VTC method adopts the variation trends of reflectivity in X-band radar data captured with wavelet transform as a constraint to adjust reflectivity factors of millimeter-wavelength radar. The correction was evaluated by comparing reflectivities obtained by millimeter-wavelength cloud radar and X-band weather radar. Experiments showed that attenuation was a major contributory factor in the different reflectivities of the two radars when relatively intense echoes exist, and the attenuation correction developed in this study significantly improved data quality for millimeter-wavelength radar. Reflectivity differences between the two radars were reduced and reflectivity correlations were enhanced. Errors caused by attenuation were eliminated, while variation details in the reflectivity factors were retained. The VTC method is superior to the bin-by-bin method in terms of correction amplitude and can be used for attenuation correction of shorter wavelength radar assisted by longer wavelength radar data.
REVIEW | doi:10.20944/preprints202006.0139.v1
Subject: Computer Science And Mathematics, Security Systems Keywords: Cybersecurity; machine learning; data science; decision making; cyber-attack; security modeling; intrusion detection; threat intelligence
Online: 11 June 2020 (12:12:50 CEST)
In a computing context, cybersecurity is undergoing massive shifts in technology and its operations in recent days, and data science is driving the change. Extracting security incident patterns or insights from cybersecurity data and building corresponding data-driven model, is the key to make a security system automated and intelligent. To understand and analyze the actual phenomena with data, various scientific methods, machine learning techniques, processes, and systems are used, which is commonly known as data science. In this paper, we focus and briefly discuss cybersecurity data science, where the data is being gathered from relevant cybersecurity sources, and the analytics complement the latest data-driven patterns for providing more effective security solutions. The concept of cybersecurity data science allows making the computing process more actionable and intelligent as compared to traditional ones in the domain of cybersecurity. We then discuss and summarize a number of associated research issues and future directions. Furthermore, we provide a machine learning-based multi-layered framework for the purpose of cybersecurity modeling. Overall, our goal is not only to discuss cybersecurity data science and relevant methods but also to focus the applicability towards data-driven intelligent decision making for protecting the systems from cyber-attacks.
ARTICLE | doi:10.20944/preprints201702.0074.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: network; systems; cloud computing; data centre; performance; software-defined; virtual machine; scheduling; admission control; application-aware;
Online: 20 February 2017 (04:56:24 CET)
Cloud computing refers to applications delivered as services over the Internet. Cloud systems employ policies that are inherently dynamic in nature and that depend on temporal conditions defined in terms of external events, such as the measurement of bandwidth, use of hosts, intrusion detection or specific time events. In this paper, we investigate an optimized resource management scheme named v-Mapper. The basic premise of v-Mapper is to exploit application-awareness concepts using software-defined networking (SDN) features. This paper makes three key contributions to the field: (1) We propose a virtual machine (VM) placement scheme that can effectively mitigate the VM placement issues for data-intensive applications; (2) We propose a validation scheme that will ensure that a service is entertained only if there are sufficient resources available for its execution and (3) We present a scheduling policy that aims to eliminate network load constraints. An evaluation was carried out with various benchmarks and demonstrated that v-Mapper shows improved performance over other state-of-the-art approaches in terms of average task completion time, service delay time and bandwidth utilization. Given the growing importance of supporting large-scale data processing and analysis in datacentres, the v-Mapper system has the potential to make a positive impact in improving datacentre performance in the future.
REVIEW | doi:10.20944/preprints202104.0442.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: data science; advanced analytics; machine learning; deep learning; smart computing; decision-making; predictive analytics; data science applications;
Online: 16 April 2021 (11:28:09 CEST)
The digital world has a wealth of data, such as Internet of Things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science, advanced analytics methods including machine learning modeling can provide actionable insights or deeper knowledge about data, which makes the computing process automatic and smart. In this paper, we present a comprehensive view on "Data Science'' including various types of advanced analytics methods that can be applied to enhance the intelligence and capabilities of an application through smart decision-making in different scenarios. We also discuss and summarize ten potential real-world application domains including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making. Based on this, we finally highlight the challenges and potential research directions within the scope of our study. Overall, this paper aims to serve as a reference point on data science and advanced analytics to the researchers and decision-makers as well as application developers, particularly from the data-driven solution point of view for real-world problems.
ARTICLE | doi:10.20944/preprints201703.0058.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: Smartphone sensing; mobile-social integration; automatic recognition; social data; long-term health monitoring
Online: 10 March 2017 (17:32:31 CET)
Over the past decades, overweight and obesity has become a global epidemic and the leading threat for death. To prevent the serious risk, an overweight or obese individual must apply a long-term weight-management strategy to control food intake and physical activities, which is however, not easy. Recently, with the advances of information technology, more and more people can use wearable devices and smartphones to obtain physical activity information, while they can also access various health-related information from Internet online social networks (OSNs). Nevertheless, there is a lack of an integrated approach that can combine these two methods in an efficient way. In this paper, we address this issue and propose a novel mobile-social framework for health recognition and recommendation, namely, H-Rec2. The main ideas of H-Rec2 include (1) to recognize the individual's health status using smartphone as a general platform, and (2) to recommend physical activity and food intake based on personal health information, life science principles, and health-related information obtained from OSNs. To demonstrate the potentials of the H-Rec2 framework, we develop a prototype that consists of four important components: (1) an activity recognition module that senses physical activity using accelerometer, (2) a health status modeling module that applies a novel algorithm to generate personalized health status index, (3) a restaurant information collection module that collects relevant information from OSN, and (4) a restaurant recommendation module that provides personalized and context-aware recommendation. To evaluate the prototype, we conduct both objective and subjective experiments, which confirm the performance and effectiveness of the proposed system.
ARTICLE | doi:10.20944/preprints202103.0738.v1
Subject: Computer Science And Mathematics, Analysis Keywords: bibliometry; coronavirus; text and data mining; SARS; MERS; COVID-19
Online: 31 March 2021 (17:30:56 CEST)
A global event such as the COVID-19 crisis presents new, often unexpected responses that are fascinating to investigate from both, scientific and social standpoints. Despite several documented similarities, the Coronavirus pandemic is clearly distinct from the 1918 flu pandemic in terms of our exponentially increased, almost instantaneous ability to access/share information, offering an unprecedented opportunity to visualise rippling effects of global events across space and time. Personal devices provide “big data” on people’s movement, the environment and economic trends, while access to the unprecedented flurry in scientific publications and media posts provides a measure of the response of the educated world to the crisis. Most bibliometric (co-authorship, co-citation, or bibliographic coupling) analyses ignore the time dimension, but COVID-19 has made it possible to perform a detailed temporal investigation into the pandemic. Here, we report a comprehensive network analysis based on more than 20000 published documents on viral epidemics, authored by over 75,000 individuals from 140 nations in the past one year of the crisis. In contrast to the 1918 flu pandemic, access to published data over the past two decades enabled a comparison of publishing trends between the ongoing COVID-19 pandemic and those of the 2003 SARS epidemic, to study changes in thematic foci and societal pressures dictating research over the course of a crisis.
REVIEW | doi:10.20944/preprints202010.0263.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: data science; deep learning; ensemble machine learning models; economics; hybrid models
Online: 13 October 2020 (09:24:17 CEST)
This paper provides the state of the art of data science in economics. Through a novel taxonomy of applications and methods advances in data science are investigated. The data science advances are investigated in three individual classes of deep learning models, ensemble models, and hybrid models. Application domains include stock market, marketing, E-commerce, corporate banking, and cryptocurrency. Prisma method, a systematic literature review methodology is used to ensure the quality of the survey. The findings revealed that the trends are on advancement of hybrid models as more than 51% of the reviewed articles applied hybrid model. On the other hand, it is found that based on the RMSE accuracy metric, hybrid models had higher prediction accuracy than other algorithms. While it is expected the trends go toward the advancements of deep learning models.
ARTICLE | doi:10.20944/preprints202210.0328.v1
Subject: Environmental And Earth Sciences, Other Keywords: Geographic information science; gerrymandering; formal science; empirical science; spatial data science; DIKW paradigm; Metascience
Online: 21 October 2022 (10:04:08 CEST)
Sometimes there are clear and natural limits to the scope of action of a science, and in other cases they are simply convenient ones. Geographic Information Science (GISc) is a transversal science, with contacts with all geosciences but also with various formal sciences such as Mathematics, Logic and Computer Science. A first approach to specifying the limits of a science is through its definition. Definitions of GISc are often so expansive that they have been rightly criticized for practicing gerrymandering, in particular with the rest of the geosciences. To avoid this, an operational definition is proposed that places GISc among the sciences that handle Data and not Information. This solves the gerrymandering problem without really implying a significant cut of what is usually considered within GISc. As an unforeseen consequence, this delimitation will allow it to be characterized as Formal Science, leaving it as the only geoscience with this characteristic.
ARTICLE | doi:10.20944/preprints201905.0274.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: sUAS; drone; RPAS; UAV; Data; Management; FAIR; Community; standards; practices
Online: 22 May 2019 (11:42:08 CEST)
The use of small Unmanned Aircraft Systems (sUAS ) as platforms for data capture has rapidly increased in recent years. However, while there has been significant investment in improving the aircraft, sensors, operations, and legislation infrastructure for such, little attention has been paid to supporting the management of the complex data capture pipeline sUAS involve. This paper reports on the outcomes of a four-year-long community-engagement-based investigation into what tools, practices, and challenges currently exist for particularly researchers using sUAS as data capture platforms. The key results of this effort are: (1) sUAS captured data – as a set that is rapidly growing to include data in a wide range of Physical and Environmental Sciences, Engineering Disciplines, and many civil and commercial use cases – is characterised as both sharing many traits with traditional remote sensing data and also as exhibiting – as common across the spectrum of disciplines and use cases – novel characteristics that require novel data support infrastructure. And (2), given this characterization of sUAS data and its potential value in the identified wide variety of use case, we outline eight challenges that need to be addressed in order for the full value of sUAS captured data to be realized. We then conclude that there would be significant value gained and costs saved across both commercial and academic sectors if the global sUAS user and data management communities were to address these challenges in the immediate to near future, so as to extract the maximal value of sUAS captured data for the lowest long-term effort and monetary cost.
ARTICLE | doi:10.20944/preprints201907.0280.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: scientific visualization; interactive data analysis; support for earth system science; cross-platform application
Online: 25 July 2019 (06:32:24 CEST)
Visualization is an essential tool for analysis of data and communication of findings in the sciences, and the Earth System Science (ESS) are no exception. However, within ESS specialized visualization requirements and data models --- particularly for those data arising from numerical models --- often make general-purpose visualization packages difficult, if not impossible, to effectively use. This paper presents VAPOR: a domain-specific visualization package that targets the specialized needs of ESS modelers, particularly those working in research settings where highly interactive exploratory visualization is beneficial. We specifically describe VAPOR’s ability to handle ESS simulation data from a wide variety of numerical models, as well as a multi-resolution representation that enables interactive visualization on very large data while using only commodity computing resources. We also describe VAPOR’s visualization capabilities, paying particular attention to features for geo-referenced data and advanced rendering algorithms suitable for time-varying, 3D data. Finally, we illustrate VAPOR's utility in the study of a numerically simulated tornado. Our results demonstrate both ease-of-use and the rich capabilities of VAPOR in such a use case.
ARTICLE | doi:10.20944/preprints202003.0443.v2
Subject: Social Sciences, Library And Information Sciences Keywords: COVID-19; open science; data; bibliometric; pandemic
Online: 22 April 2020 (06:15:34 CEST)
Introduction: The Pandemic of COVID-19, an infectious disease caused by SARS-CoV-2 motivated the scientific community to work together in order to gather, organize, process and distribute data on the novel biomedical hazard. Here, we analyzed how the scientific community responded to this challenge by quantifying distribution and availability patterns of the academic information related to COVID-19. The aim of our study was to assess the quality of the information flow and scientific collaboration, two factors we believe to be critical for finding new solutions for the ongoing pandemic. Materials and methods: The RISmed R package, and a custom Python script were used to fetch metadata on articles indexed in PubMed and published on Rxiv preprint server. Scopus was manually searched and the metadata was exported in BibTex file. Publication rate and publication status, affiliation and author count per article, and submission-to-publication time were analysed in R. Biblioshiny application was used to create a world collaboration map. Results: Our preliminary data suggest that COVID-19 pandemic resulted in generation of a large amount of scientific data, and demonstrates potential problems regarding the information velocity, availability, and scientific collaboration in the early stages of the pandemic. More specifically, our results indicate precarious overload of the standard publication systems, significant problems with data availability and apparent deficient collaboration. Conclusion: In conclusion, we believe the scientific community could have used the data more efficiently in order to create proper foundations for finding new solutions for the COVID-19 pandemic. Moreover, we believe we can learn from this on the go and adopt open science principles and a more mindful approach to COVID-19-related data to accelerate the discovery of more efficient solutions. We take this opportunity to invite our colleagues to contribute to this global scientific collaboration by publishing their findings with maximal transparency.
ARTICLE | doi:10.20944/preprints202308.0509.v1
Subject: Medicine And Pharmacology, Urology And Nephrology Keywords: phase duration assessment; partial nephrectomy; video analysis; surgical data science
Online: 7 August 2023 (10:10:57 CEST)
(1) Background: Surgical phases form the basic building blocks for surgical skill assessment, feedback and teaching. Phase duration itself and its correlation to clinical parameters has not yet been investigated. Novel commercial platforms provide phase indications but have not been assessed for accuracy yet. (2) Methods: We assess 100 robot-assisted partial nephrectomy videos for phase duration based on previously defined proficiency metrics. We develop an annotation framework and subsequently compare our annotations to an existing commercial solution (Touch Surgery, Medtronic™). We subsequently explore clinical correlations between phase durations and peri-operative parameters. (3) Results: Objective and uniform phase assessment requires precise definitions derived from an iterative revision process. Comparison to a commercial solution shows large differences in definitions across phases. BMI and duration of renal tumor identification correlate positively, as well as tumor complexity and both tumor excision and renorraphy duration. (4) Conclusions: Surgical phase duration can be correlated with certain clinical outcomes. Further research should investigate if retrieved correlations are also clinically meaningful. This requires increasing dataset sizes and facilitation through intelligent computer vision algorithms. Commercial platforms can facilitate this dataset expansion and help unlock the full potential, provided the disclosure of phase annotation details.
REVIEW | doi:10.20944/preprints202309.1571.v1
Subject: Environmental And Earth Sciences, Geography Keywords: social-environmental systems; agent-based complex systems; sustainability science; agent-based models; artificial intelligence; data science
Online: 22 September 2023 (13:39:57 CEST)
A significant number and range of challenges besetting sustainability can be traced to the actions and interactions of multiple autonomous agents (people mostly) and the entities they create (e.g., institutions, policies, social network) in the corresponding social-environmental systems (SES). To address these challenges, we need to understand decisions made and actions taken by agents, the outcomes of their actions, including the feedbacks on the corresponding agents and environment. The science of Agent-based Complex Systems—ACS science—has a significant potential to handle such challenges. The advantages of ACS science for sustainability are addressed by way of identifying the key elements and challenges in sustainability science, the generic features of ACS, and the key advances and challenges in modeling ACS. Artificial intelligence and data science promise to improve understanding of agents’ behaviors, detect SES structures, and formulate SES mechanisms.
ARTICLE | doi:10.20944/preprints202303.0011.v1
Subject: Social Sciences, Media Studies Keywords: science communication; informal learning; public engagement; science in the media; entertainment media; data visualization; scientific visualization
Online: 1 March 2023 (06:23:52 CET)
Abstract This essay presents a real-world demonstration of the evidence-based science communication process, showing how it can be used to create scientific data visualizations for public audiences. Visualizing research data can be an important science communication tool. Maximizing its effectiveness has the potential to benefit millions of viewers. As with many forms of science communication, creators of such data visualizations typically rely on their own judgments and the views of the scientists providing the data to inform their science communication decision-making. But that leaves out a critical stakeholder in the communications pipeline: the intended audience. Here, we show the practical steps that our team - the Advanced Visualization Lab at the University of Illinois at Urbana-Champaign - has taken to shift towards more evidence-based practice to enhance our science communication impact. We do this using concrete examples from our work on two scientific documentary films, one on the theme of ‘solar superstorms’ and the other focusing on the black hole at the center of the Milky Way galaxy. We used audience research with each of these films to inform our strategies and designs. We describe how such research evidence informed our understanding of ‘what works and why’ with cinematic-style data visualizations for the public. We close the essay with our key ‘take home’ messages from this evidence-based science communication process.
ARTICLE | doi:10.20944/preprints201609.0088.v1
Subject: Engineering, Civil Engineering Keywords: classification; railway; power line; mobile laser scanning data; conditional random field; layout compatibility
Online: 26 September 2016 (09:33:05 CEST)
Railway has been used as one of the most crucial means of transportation in public mobility and economic development. For efficiently operating railways, the electrification system in railway infrastructure, which supplies electric power to trains, is essential facilities for stable train operation. Due to its important role, the electrification system needs to be rigorously and regularly inspected and managed. This paper presents a supervised learning method to classify Mobile Laser Scanning (MLS) data into ten target classes representing overhead wires, movable brackets and poles, which are recognized key objects in the electrification system. In general, the layout of railway electrification system shows a strong regularity of spatial relations among object classes. The proposed classifier is developed based on Conditional Random Field (CRF), which characterizes not only labeling homogeneity at short range, but also the layout compatibility between different object classes at long range in the probabilistic graphical model. This multi-range CRF model consists of a unary term and three pairwise contextual terms. In order to gain computational efficiency, MLS point clouds is converted into a set of line segments where the labeling process is applied. Support Vector Machine (SVM) is used as a local classifier considering only node features for producing the unary potentials of CRF model. As the short-range pairwise contextual term, Potts model is applied to enforce a local smoothness in short-range graph. While, long-range pairwise potentials are designed to enhance spatial regularities of both horizontal and vertical layouts among railway objects. We formulate two long-range pairwise potentials as the log posterior probability obtained by Naïve Bayes classifier. The directional layout compatibilities are characterized in probability look-up tables which represent co-occurrence rate of spatial relations in horizontal and vertical directions. The likelihood function is formulated by multivariate Gaussian distributions. In the proposed multi-range CRF model, the weight parameters to balance four sub-terms are estimated by applying the Stochastic Gradient Descent (SGD). The results show that the proposed multi-range CRF can effectively classify detailed railway elements, representing the average recall of 97.66% and the average precision of 97.07% for all classes.
ARTICLE | doi:10.20944/preprints201611.0033.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: genetic algorithms; parallel computation; computational complexity; algorithms; optimization techniques; traveling salesman problem; NP-Hard problems; Berlin-52 data set; machine learning; linear regression
Online: 7 November 2016 (04:57:46 CET)
This paper examines the correlation between numbers of computer cores in parallel genetic algorithms. The objective to determine the linear polynomial complementary equation in order represent the relation between number of parallel processing and optimum solutions. Model this relation as optimization function (f(x)) which able to produce many simulation results. F(x) performance is outperform genetic algorithms. Compression results between genetic algorithm and optimization function is done. Also the optimization function give model to speed up genetic algorithm. Optimization function is a complementary transformation which maps a TSP given to linear without changing the roots of the polynomials.
ARTICLE | doi:10.20944/preprints201608.0040.v1
Subject: Biology And Life Sciences, Immunology And Microbiology Keywords: seeds; ELISA; Fusarium; morphological data analysis; mycotoxins; phylogenetic analysis S
Online: 4 August 2016 (10:12:54 CEST)
Adlay seed samples were collected from 3 adlay growing regions (Yeoncheon, Jeonnam and Eumseong regions) in Korea during 2012. Among all the samples collected, 400 seeds were tested for fungal occurrence by standard blotter and test tube agar methods and different taxonomic groups of fungal genera were detected. The most predominant fungal genera encountered were Fusarium, Phoma, Alternaria, Cladosporium, Curvularia, Cochliobolus and Leptosphaerulina. The occurrence of Fusarium species were 45.6% and based on the combined sequences of two protein coding genes, EF-1a, Beta-tubulin and phylogenetic analysis, 10 species were characterized as F. incarnatum (11.67%), F. kyushense (10.33%), F. fujikuroi (8.67%), F. concentricum (6.00%), F. asiaticum (5.67%), F. graminearum (1.67%), F. miscanthi (0.67%), F. polyphialidiom (0.33%), F. armeniacum (0.33%) and F. thapsinum (0.33%). The ability of these isolates to produce mycotoxins fumonisin (FUM) and zeralenone (ZEN) were tested by ELISA quantitative analysis method. The result revealed that fumonisin (FUM) was produced only by F. fujikuroi and zeralenone (ZEN) by F. asiaticum & F. graminearum. Mycotoxigenic species were then examined for their morphological characteristics to confirm their identity. Morphological observations of the species correlated well with their molecular identification and confirmed as F. asiaticum, F. fujikuroi and F. graminearum.
REVIEW | doi:10.20944/preprints202008.0320.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Data Science; Machine Learning; Deep Learning; Genomics; COVID-19; Drug Discovery; Image Analysis; Interactomics; Epidemiology
Online: 14 August 2020 (11:01:56 CEST)
The outbreak of novel Coronavirus (SARS-COV-2 ) disease (COVID-19) in Wuhan has attracted worldwide attention. SARS-COV-2 known to share a similar clinical manifestation that includes various symptoms such as pneumonia, fever, breathing difficulty, and in particular, SARS-COV-2 also causes a severe in ammation state that leads to death. Consequently, massive and rapid research growth has been observed across the globe to elucidate the mechanisms of infections and disease progression in genotype and phenotype scale. Data Science is playing a pivotal role in in-silico analysis to draw hidden and novel insights about the SARS-COV-2 origin, pathogenesis, COVID-19 outbreak forecasting, medical diagnosis, and drug discovery. With the availability of multi-omics, radiological, biomolecular, and medical data urges to develop novel exploratory and predictive models or customise exiting learning models to t the current problem domain. The presence of many approaches generates the need for the systematic surveys to guide both data scientists and medical practitioners. We perform an elaborate study on the state-of-the-art data science method ologies in action to tackle the current pandemic scenario. We consider various active COVID-19 data analytics domains such as phylogeny analysis, SARS-COV-2 genome identication, protein structure prediction, host-viral protein interactomics, clinical imaging, epidemiological analysis, and most importantly (existing) drug discovery. We highlight types of data, their generation pipeline, and the data science models in use. We believe that the current study will give a detailed sketch of the road map towards handling COVID-19 like situation by leveraging data science in the future. We summarise our review focusing on prime challenges and possible future research directions .
ARTICLE | doi:10.20944/preprints202310.1089.v1
Subject: Engineering, Energy And Fuel Technology Keywords: anomaly detection; MPDI; SPE; Computer Science and Mathematics; bibliometric data; algorithms
Online: 17 October 2023 (11:29:47 CEST)
Anomaly detection in equipment processes plays an important role in the oil and gas sector. Algorithms for detecting anomalies in measured data are best understood in computer science and mathematics. Therefore, a possible transfer of knowledge from the latter knowledge area to the former can play a significant role. This paper addresses such a task by analyzing bibliometric data of Computer Science and Mathematics papers published in MDPI journals and publications found on the SPE search platform. It is shown that the main algorithms both extensively studied in MDPI publications and found in SPE publications and reflecting the anomaly detection problem are Random Forest, Support Vector Machine, Long-term Memory Method and Recurrent Neural Network. The main advantages and disadvantages of these methods are briefly described. Examples of classical, highly cited publications describing the work of these algorithms are given. Examples of papers describing their application in the oil and gas industry are given. The sections of SPE disciplines with the largest number of publications using the above algorithms that are frequently used for anomaly detection are presented.
ARTICLE | doi:10.20944/preprints202309.0349.v1
Subject: Social Sciences, Urban Studies And Planning Keywords: assessment model; sustainable city; crowdsourced data; design science; smart cities; general system theory
Online: 6 September 2023 (09:26:08 CEST)
Historically, the lack of comprehensive and real-time data is the main challenge in city sustainability assessments. However, with the increase in the availability of crowdsourced data over the last decade, a rich and real-time data source provides detailed images of urban systems. This has opened up new opportunities for practitioners, including geographers, social researchers, and data scientists, to gather and utilize social media data as a core source for studying cities and assessing their sustainability. This paper proposes an assessment model for the sustainability of cities based on crowdsourced data and the principles of General System Theory. The model aims to utilize crowdsourced data for the analysis and decision-making processes to enhance the sustainability and resilience of smart future cities.
ARTICLE | doi:10.20944/preprints201608.0232.v2
Subject: Medicine And Pharmacology, Pulmonary And Respiratory Medicine Keywords: mHealth; ODK scan; mobile health application; digitizing data collection; data management processes; paper-to-digital system; technology-assisted data management; treatment adherence
Online: 2 September 2016 (03:17:38 CEST)
The present grievous situation of the tuberculosis disease can be improved by efficient case management and timely follow-up evaluations. With the advent of digital technology this can be achieved by quick summarization of the patient-centric data. The aim of our study was to assess the effectiveness of the ODK Scan paper-to-digital system during testing period of three months. A sequential, explanatory mixed-method research approach was employed to elucidate technology use. Training, smartphones, application and 3G enabled SIMs were provided to the four field workers. At the beginning, baseline measures of the data management aspects were recorded and compared with endline measures to see the impact of ODK Scan. Additionally, at the end, users’ feedback was collected regarding app usability, user interface design and workflow changes. 122 patients’ records were retrieved from the server and analysed for quality. It was found that ODK Scan recognized 99.2% of multiple choice bubble responses and 79.4% of numerical digit responses correctly. However, the overall quality of the digital data was decreased in comparison to manually entered data. Using ODK Scan, a significant time reduction is observed in data aggregation and data transfer activities, however, data verification and form filling activities took more time. Interviews revealed that field workers saw value in using ODK Scan, however, they were more concerned about the time consuming aspects of the use of ODK Scan. Therefore, it is concluded that minimal disturbance in the existing workflow, continuous feedback and value additions are the important considerations for the implementing organization to ensure technology adoption and workflow improvements.
ARTICLE | doi:10.20944/preprints201702.0059.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: fine particulate matter (PM2.5); aerosol optical depth; community multi-scale air quality (CMAQ) model; data fusion; exposure assessment
Online: 16 February 2017 (08:58:09 CET)
Estimating ground surface PM2.5 with fine spatiotemporal resolution is a critical technique for exposure assessments in epidemiological studies of its health risks. Previous studies have utilized monitoring, satellite remote sensing or air quality modeling data to evaluate the spatiotemporal variations of PM2.5 concentrations, but such studies rarely combined these data simultaneously. We develop a three-stage model to fuse PM2.5 monitoring data, satellite-derived aerosol optical depth (AOD) and community multi-scale air quality (CMAQ) simulations together and apply it to estimate daily PM2.5 at a spatial resolution of 0.1˚ over China. Performance of the three-stage model is evaluated using a cross-validation (CV) method step by step. CV results show that the finally fused estimator of PM2.5 is in good agreement with the observational data (RMSE = 23.00 μg/m^3 and R2 = 0.72) and outperforms either AOD-retrieved PM2.5 (R2 = 0.62) or CMAQ simulations (R2 = 0.51). According to step-specific CVs, in data fusion, AOD-retrieved PM2.5 plays a key role to reduce mean bias, whereas CMAQ provides all-spacetime-covered predictions, which avoids sampling bias caused by non-random incompleteness in satellite-derived AOD. Our fused products are more capable than either CMAQ simulations or AOD-based estimates in characterizing the polluting procedure during haze episodes and thus can support both chronic and acute exposure assessments of ambient PM2.5. Based on the products, averaged concentration of annual exposure to PM2.5 was 55.75 μg/m3, while averaged count of polluted days (PM2.5 > 75 μg/m3) was 81, across China during 2014. Fused estimates will be publicly available for future health-related studies.
ARTICLE | doi:10.20944/preprints201612.0002.v1
Subject: Computer Science And Mathematics, Applied Mathematics Keywords: change point; estimation; consistency; panel data; short panels; boundary issue; structural change; bootstrap; non-life insurance; change in claim amounts
Online: 1 December 2016 (10:02:03 CET)
Panel data of our interest consist of a moderate number of panels, while the panels contain a small number of observations. An estimator of common breaks in panel means without a boundary issue for this kind of scenario is proposed. In particular, the novel estimator is able to detect a common break point even when the change happens immediately after the first time point or just before the last observation period. Another advantage of the elaborated change point estimator is that it results in the last observation in situations with no structural breaks. The consistency of the change point estimator in panel data is established. The results are illustrated through a simulation study. As a by-product of the developed estimation technique, a theoretical utilization for correlation structure estimation, hypothesis testing, and bootstrapping in panel data is demonstrated. A practical application to non-life insurance is presented as well.
REVIEW | doi:10.20944/preprints202003.0141.v1
Subject: Medicine And Pharmacology, Other Keywords: data sharing; data management; data science; big data; healthcare
Online: 8 March 2020 (16:46:20 CET)
In recent years, more and more health data are being generated. These data come not only from professional health systems, but also from wearable devices. All these data combined form ‘big data’ that can be utilized to optimize treatments for each unique patient (‘precision medicine’). To achieve this precision medicine, it is necessary that hospitals, academia and industry work together to bridge the ‘valley of death’ of translational medicine. However, hospitals and academia often have problems with sharing their data, even though the patient is actually the owner of his/her own health data, and the sharing of data is associated with increased citation rate. Academic hospitals usually invest a lot of time in setting up clinical trials and collecting data, and want to be the first ones to publish papers on this data. The idea that society benefits the most if the patient’s data are shared as soon as possible so that other researchers can work with it, has not taken root yet. There are some publicly available datasets, but these are usually only shared after studies are finished and/or publications have been written based on the data, which means a severe delay of months or even years before others can use the data for analysis. One solution is to incentivize the hospitals to share their data with (other) academic institutes and the industry. Here we discuss several aspects of data sharing in the medical domain: publisher requirements, data ownership, support for data sharing, data sharing initiatives and how the use of federated data might be a solution. We also discuss some potential future developments around data sharing.
ARTICLE | doi:10.20944/preprints202206.0320.v4
Subject: Biology And Life Sciences, Other Keywords: data; reproducibility; FAIR; data reuse; public data; big data; analysis
Online: 2 November 2022 (02:55:49 CET)
With an increasing amount of biological data available publicly, there is a need for a guide on how to successfully download and use this data. The Ten simple rules for using public biological data are: 1) use public data purposefully in your research, 2) evaluate data for your use case, 3) check data reuse requirements and embargoes, 4) be aware of ethics for data reuse, 5) plan for data storage and compute requirements, 6) know what you are downloading, 7) download programmatically and verify integrity, 8) properly cite data, 9) make reprocessed data and models Findable, Accessible, Interoperable, and Reusable (FAIR) and share, and 10) make pipelines and code FAIR and share. These rules are intended as a guide for researchers wanting to make use of available data and to increase data reuse and reproducibility.
ARTICLE | doi:10.20944/preprints202003.0268.v1
Subject: Social Sciences, Library And Information Sciences Keywords: matching; data marketplace; data platform; data visualization; call for data
Online: 17 March 2020 (04:10:28 CET)
Improvements in web platforms for data exchange and trading are creating more opportunities for users to obtain data from data providers of different domains. However, the current data exchange platforms are limited to unilateral information provision from data providers to users. In contrast, there are insufficient means for data providers to learn what kinds of data users desire and for what purposes. In this paper, we propose and discuss the description items for sharing users’ call for data as data requests in the data marketplace. We also discuss structural differences in data requests and providable data using variables, as well as possibilities of data matching. In the study, we developed an interactive platform, treasuring every encounter of data affairs (TEEDA), to facilitate matching and interactions between data providers and users. The basic features of TEEDA are described in this paper. From experiments, we found the same distributions of the frequency of variables but different distributions of the number of variables in each piece of data, which are important factors to consider in the discussion of data matching in the data marketplace.
REVIEW | doi:10.20944/preprints202309.2113.v1
Subject: Computer Science And Mathematics, Hardware And Architecture Keywords: Data, DWH, Data Warehouse, Architecture, Data Lake, Storage, Analysis, Data Mesh, Analytical, Architectural, Data Vault
Online: 3 October 2023 (03:28:55 CEST)
In the rapidly evolving field of data management, numerous terminologies, such as data warehouse, data lake, data lakehouse, and data mesh , have emerged, each representing a unique analytical data architecture. However, the distinctions and similarities among these paradigms often remain unclear. The present paper aimed to navigate the data architecture landscape by conducting a comparative analysis of these paradigms. The analysis a identified and elucidated the differences and similari- ties in features, capabilities, and limitations of these architectural constructs. The study outcome serves as a comprehensive guide, assisting practitioners in selecting the most suitable analytical data architecture for their specific applications.
ARTICLE | doi:10.20944/preprints202304.0130.v1
Subject: Computer Science And Mathematics, Other Keywords: data; cooperatives; open data; data stewardship; data governance; digital commons; data sovereignty; open digital federation platform
Online: 7 April 2023 (14:14:02 CEST)
Network effects, economies of scale, and lock-in-effects increasingly lead to a concentration of digital resources and capabilities, hindering the free and equitable development of digital entrepreneurship (SDG9), new skills, and jobs (SDG8), especially in small communities (SDG11) and their small and medium-sized enterprises (“SMEs”). To ensure the affordability and accessibility of technologies, promote digital entrepreneurship and community well-being (SDG3), and protect digital rights, we propose data cooperatives [1,2] as a vehicle for secure, trusted, and sovereign data exchange [3,4]. In post-pandemic times, community/SME-led cooperatives can play a vital role by ensuring that supply chains to support digital commons are uninterrupted, resilient, and decentralized . Digital commons and data sovereignty provide communities with affordable and easy access to information and the ability to collectively negotiate data-related decisions. Moreover, cooperative commons (a) provide access to the infrastructure that underpins the modern economy, (b) preserve property rights, and (c) ensure that privatization and monopolization do not further erode self-determination, especially in a world increasingly mediated by AI. Thus, governance plays a significant role in accelerating communities’/SMEs’ digital transformation and addressing their challenges. Cooperatives thrive on digital governance and standards such as open trusted Application Programming Interfaces (APIs) that increase the efficiency, technological capabilities, and capacities of participants and, most importantly, integrate, enable, and accelerate the digital transformation of SMEs in the overall process. This policy paper presents and discusses several transformative use cases for cooperative data governance. The use cases demonstrate how platform/data-cooperatives, and their novel value creation can be leveraged to take digital commons and value chains to a new level of collaboration while addressing the most pressing community issues. The proposed framework for a digital federated and sovereign reference architecture will create a blueprint for sustainable development both in the Global South and North.
ARTICLE | doi:10.20944/preprints202311.0104.v1
Subject: Public Health And Healthcare, Other Keywords: OMOP; OHDSI; interoperability; data harmonization; clinical data; claims data
Online: 2 November 2023 (07:45:02 CET)
To gain insight into the real-life care of patients in the healthcare system, data from hospital information systems and insurance systems are required. Consequently, linking clinical data with claims data is necessary. To ensure their syntactic and semantic interoperability, the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) was chosen. However, there is no detailed guide that would allow researchers to follow a consistent process for data harmonization. Thus, the aim of this paper is to conceptualize a generic data harmonization process for OMOP CDM. For this purpose, we conducted a literature review focusing on publications that address the harmonization of clinical or claims data in OMOP CDM. Subsequently, the process steps used and their chronological order were extracted for each included publication. The results were then compared to derive a generic sequence of the process steps. From 23 publications included, a generic data harmonization process for OMOP CDM was conceptualized, consisting of nine process steps: dataset specification, data profiling, vocabulary identification, coverage analysis of vocabularies, semantic mapping, structural mapping, extract-transform-load-process, qualitative and quantitative data quality analysis. This process can be used as a step-by-step guide to assist other researchers in harmonizing source data in OMOP CDM.
ARTICLE | doi:10.20944/preprints202308.1237.v1
Subject: Engineering, Transportation Science And Technology Keywords: data mining; data extraction; data science; cost infrastructure projects
Online: 17 August 2023 (09:25:22 CEST)
Context: Despite the effort put into developing standards for structuring construction cost, and the strong interest into the field. Most construction companies still perform the process of data gathering and processing manually. That provokes inconsistencies, different criteria when classifying, misclassifications, and the process becomes very time-consuming, particularly on big projects. Additionally, the lack of standardization makes very difficult the cost estimation and comparison tasks. Objective: To create a method to extract and organize construction cost and quantity data into a consistent format and structure, to enable rapid and reliable digital comparison of the content. Method: The approach consists of a two-step method: Firstly, the system implements data mining to review the input document and determine how it is structured based on the position, format, sequence, and content of descriptive and quantitative data. Secondly, the extracted data is processed and classified with a combination of data science and experts’ knowledge to fit a common format. Results: A big variety of information coming from real historical projects has been successfully extracted and processed into a common format with 97.5% of accuracy, using a subset of 5770 assets located on 18 different files, building a solid base for analysis and comparison. Conclusion: A robust and accurate method was developed for extracting hierarchical project cost data to a common machine-readable format to enable rapid and reliable comparison and benchmarking.
ARTICLE | doi:10.20944/preprints202306.1378.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Data Generation; Anomaly Data; User Behavior Generation; Big Data
Online: 19 June 2023 (16:31:37 CEST)
The rising importance of Big Data in modern information analysis is supported by vast quantities of user data, but it is only possible to collect sufficient data for all tasks within certain data-gathering contexts. There are many cases where a domain is too novel, too niche, or too sparsely collected to adequately support Big Data tasks. To remedy this, we have created ADG Engine that allows for the generation of additional data that follows the trends and patterns of the data that’s already been collected. Using a database structure that tracks users across different activity types, ADG Engine can use all available information to maximize the authenticity of the generated data. Our efforts are particularly geared towards data analytics by identifying abnormalities in the data and allowing the user to generate normal and abnormal data at custom ratios. In situations where it would be impractical or impossible to expand the available dataset by collecting more data, it can still be possible to move forward with algorithmically expanded data datasets.
REVIEW | doi:10.20944/preprints202007.0153.v1
Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: Open-science; big data; fMRI; data sharing; data management
Online: 8 July 2020 (11:53:33 CEST)
Large datasets that enable researchers to perform investigations with unprecedented rigor are growing increasingly common in neuroimaging. Due to the simultaneous increasing popularity of open science, these state-of-the-art datasets are more accessible than ever to researchers around the world. While analysis of these samples has pushed the field forward, they pose a new set of challenges that might cause difficulties for novice users. Here, we offer practical tips for working with large datasets from the end-user’s perspective. We cover all aspects of the data life cycle: from what to consider when downloading and storing the data, to tips on how to become acquainted with a dataset one did not collect, to what to share when communicating results. This manuscript serves as a practical guide one can use when working with large neuroimaging datasets, thus dissolving barriers to scientific discovery.
ARTICLE | doi:10.20944/preprints201810.0273.v1
Subject: Physical Sciences, Astronomy And Astrophysics Keywords: astroparticle physics, cosmic rays, data life cycle management, data curation, meta data, big data, deep learning, open data
Online: 12 October 2018 (14:48:32 CEST)
Modern experimental astroparticle physics features large-scale setups measuring different messengers, namely high-energy particles generated by cosmic accelerators (e.g. supernova remnants, active galactic nuclei, etc): cosmic and gamma rays, neutrinos and recently discovered gravitational waves. Ongoing and future experiments are distributed over the Earth including ground, underground/underwater setups as well as balloon payloads and spacecrafts. The data acquired by these experiments have different formats, storage concepts and publication policies. Such differences are a crucial issue in the era of big data and of multi-messenger analysis strategies in astroparticle physics. We propose a service ASTROPARTICLE.ONLINE in the frame of which we develop an open science system which enables to publish, store, search, select and analyse astroparticle physics data. The cosmic-ray experiments KASCADE-Grande and TAIGA were chosen as pilot experiments to be included in this framework. In the first step of our initiative we will develop and test the following components of the full data life cycle concept: (i) describing, storing and reusing of astroparticle data; (ii) software for performing multi-experiment and multi-messenger analyses like deep-learning methods; (iii) outreach including example applications and tutorial for students and scientists outside the specific research field. In the present paper we describe the concepts of our initiative, and in particular the plans toward a common, federated astroparticle data storage.
ARTICLE | doi:10.20944/preprints202105.0589.v1
Subject: Engineering, Automotive Engineering Keywords: Game Ratings; Public Data; Game Data; Data analysis; GRAC(Korea)
Online: 25 May 2021 (08:32:32 CEST)
As of 2020, public data for game ratings provided by Game Ratings And Administration Committee(GRAC) are more limited than public data for movie and video ratings provided by Korea Media Ratings Board and do not provide data which allow us to see information on ratings clearly and in detail. To get information on game ratings, we need to find information by searching for specific target on homepage which is inconvenient for us. In order to improve such inconvenience and extend scope of provision in public data, the author of this paper intends to study public data API which has been extended based on information on video ratings. To draw items to be extended, this study analyzes data for ratings on homepage of GRAC and designs collection system to build database. This study intends to implement system that provides data collected based on extended public data items in a form which users want. This study is expected to provide information on ratings to GRAC which will strengthen fairness and satisfy game users and people’s rights to know and contribute to promotion and development of game industry.
ARTICLE | doi:10.20944/preprints202007.0078.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: personalization; decision making; medical data; artificial intelligence; Data-driving; Big Data; Data Mining; Machine Learning
Online: 5 July 2020 (15:04:17 CEST)
The study was conducted on applying machine learning and data mining methods to personalizing the treatment. This allows investigating individual patient characteristics. Personalization is built on the clustering method and associative rules. It was suggested to determine the average distance between instances for optimal performance metrics finding. The formalization of the medical data pre-processing stage for finding personalized solutions based on current standards and pharmaceutical protocols is proposed. The model of patient data is built. The paper presents the novel approach to clustering built on ensemble of cluster algorithm with better than k-means algorithm Hopkins metrics. The personalized treatment usually is based on decision tree. Such approach requires a lot of computation time and cannot be paralyzed. Therefore, it is proposed to classify persons by conditions, to determine deviations of parameters from the normative parameters of the group, as well as the average parameters. This made it possible to create a personalized approach to treatment for each patient based on long-term monitoring. According to the results of the analysis, it becomes possible to predict the optimal conditions for a particular patient and to find the medicaments treatment according to personal characteristics.
ARTICLE | doi:10.20944/preprints202103.0593.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Business Inteligence; Data Mining; Data Warehouse.
Online: 24 March 2021 (13:47:31 CET)
In the coming years, digital applications and services that continue to use the country's native cloud systems will be huge. By 2023, that will exceed 500 million, according to IDC. This corresponds to the sum of all applications developed in the last 40 years. If you are the one you answered, yes! This article is for you!
ARTICLE | doi:10.20944/preprints202012.0468.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: climate data; gridded product; data merging
Online: 18 December 2020 (13:29:38 CET)
This manuscript describes the construction and validation of high resolution daily gridded (0.05° × 0.05°) rainfall and maximum and minimum temperature data for Bangladesh : the Enhancing National Climate Services for Bangladesh Meteorological Department (ENACTS-BMD) dataset. The dataset was generated by merging data from weather stations, satellite products (for rainfall) and reanalysis (for temperature). ENACTS-BMD is the first high-resolution gridded surface meteorological dataset developed specifically for studies of surface climate processes in Bangladesh. Its record begins in January 1981 and is updated in real-time monthly and outputs have daily, decadal and monthly time resolution. The Climate Data Tools (CDT), developed by the International Research Institute for Climate and Society (IRI), Columbia University, is used to generate the dataset. This data processing includes the collection of weather and gridded data, quality control of stations data, downscaling of the reanalysis for temperature, bias correction of both satellite rainfall and downscaled reanalysis of temperature, and the combination of station and bias-corrected gridded data. The ENACTS-BMD dataset is available as an open-access product at BMD’s official website, allowing the enhancement of the provision of services, overcoming the challenges of data quality, availability, and access, promoting at the same time the engagement and use by stakeholders.
CASE REPORT | doi:10.20944/preprints201801.0066.v1
Subject: Engineering, Control And Systems Engineering Keywords: cohesion policy; data visualization; open data
Online: 8 January 2018 (11:11:47 CET)
The implementation of the European Cohesion Policy aiming at fostering regions competitiveness, economic growth and creation of new jobs is documented over the period 2014–2020 in the publicly available Open Data Portal for the European Structural and Investment funds. On the base of this source, this paper aims at describing the process of data mining and visualization for information production on regional programmes performace in achieving effective expenditure of resouces.
COMMUNICATION | doi:10.20944/preprints202309.0047.v1
Subject: Public Health And Healthcare, Public Health And Health Services Keywords: MPox; big data; data analysis; data science; Twitter; natural language processing
Online: 1 September 2023 (10:23:41 CEST)
In the last decade and a half, the world has experienced the outbreak of a range of viruses such as COVID-19, H1N1, flu, Ebola, Zika Virus, Middle East Respiratory Syndrome (MERS), Measles, and West Nile Virus, just to name a few. During these virus outbreaks, the usage and effectiveness of social media platforms increased significantly as such platforms served as virtual communities, enabling their users to share and exchange information, news, perspectives, opinions, ideas, and comments related to the outbreaks. Analysis of this Big Data of conversations related to virus outbreaks using concepts of Natural Language Processing such as Topic Modeling has attracted the attention of researchers from different disciplines such as Healthcare, Epidemiology, Data Science, Medicine, and Computer Science. The recent outbreak of the MPox virus has resulted in a tremendous increase in the usage of Twitter. Prior works in this field have primarily focused on the sentiment analysis and content analysis of these Tweets, and the few works that have focused on topic modeling have multiple limitations. This paper aims to address this research gap and makes two scientific contributions to this field. First, it presents the results of performing Topic Modeling on 601,432 Tweets about the 2022 Mpox outbreak, which were posted on Twitter between May 7, 2022, and March 3, 2023. The results indicate that the conversations on Twitter related to Mpox during this time range may be broadly categorized into four distinct themes - Views and Perspectives about MPox, Updates on Cases and Investigations about Mpox, MPox and the LGBTQIA+ Community, and MPox and COVID-19. Second, the paper presents the findings from the analysis of these Tweets. The results show that the theme that was most popular on Twitter (in terms of the number of Tweets posted) during this time range was - Views and Perspectives about MPox. It is followed by the theme of MPox and the LGBTQIA+ Community, which is followed by the themes of MPox and COVID-19 and Updates on Cases and Investigations about Mpox, respectively. Finally, a comparison with prior works in this field is also presented to highlight the novelty and significance of this research work.
ARTICLE | doi:10.20944/preprints202205.0344.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Linked (open) Data; Semantic Interoperability; Data Mapping; Governmental Data; SPARQL; Ontologies
Online: 25 May 2022 (08:18:46 CEST)
In this paper, we present a method to map information regarding service activity provision residing in governmental portals across European Commission. In order to perform this, we used as a basis the enriched Greek e-GIF ontology, modeling concepts, and relations in one of the two data portals (i.e., Points of Single Contacts) examined, since relevant information on the second was not provided. Mapping consisted in transforming information appearing in governmental portals in RDF format (i.e., as Linked data), in order to be easily exchangeable. Mapping proved a tedious task, since description on how information is modeled in the second Point of Single Contact is not provided and must be extracted in a manual manner.
ARTICLE | doi:10.20944/preprints202111.0073.v1
Subject: Medicine And Pharmacology, Other Keywords: data quality; OMOP CDM; EHDEN; healthcare data; real world data; RWD
Online: 3 November 2021 (09:12:54 CET)
Background: Observational health data has the potential to be a rich resource to inform clinical practice and regulatory decision making. However, the lack of standard data quality processes makes it difficult to know if these data are research ready. The EHDEN COVID-19 Rapid Col-laboration Call presented the opportunity to assess how the newly developed open-source tool Data Quality Dashboard (DQD) informs the quality of data in a federated network. Methods: 15 Data Partners (DPs) from 10 different countries worked with the EHDEN taskforce to map their data to the OMOP CDM. Throughout the process at least two DQD results were collected and compared for each DP. Results: All DPs showed an improvement in their data quality between the first and last run of the DQD. The DQD excelled at helping DPs identify and fix conformance is-sues but showed less of an impact on completeness and plausibility checks. Conclusions: This is the first study to apply the DQD on multiple, disparate databases across a network. While study-specific checks should still be run, we recommend that all data holders converting their data to the OMOP CDM use the DQD as it ensures conformance to the model specifications and that a database meets a baseline level of completeness and plausibility for use in research.
ARTICLE | doi:10.20944/preprints202110.0103.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Data Analytics; Analytics; Supply Chain Input; Supply Chain; Data Science; Data
Online: 6 October 2021 (10:38:42 CEST)
One of the most remarkable features in the 20th century was the digitalization of technical progress, which changed the output of companies worldwide and became a defining feature of the century. The growth of information technology systems and the implementation of new technical advances, which enhance the integrity, agility and long-term organizational performance of the supply chain, can distinguish a digital supply chain from other supply chains. For example, the Internet of Things (IoT)-enabled information exchange and Big Data analysis might be used to regulate the mismatch between supply and demand. In order to assess contemporary ideas and concepts in the field of data analysis in the context of supply chain management, this literary investigation has been decided. The research was conducted in the form of a comprehensive literature review. In the SLR investigation, a total of 71 papers from leading journals were used. SLR has found that data analytics integrate into supply chain management can have long-term benefits on supply chain management from the input side, i.e., improved strategic development, management and other areas.
ARTICLE | doi:10.20944/preprints202310.1998.v1
Subject: Public Health And Healthcare, Public Health And Health Services Keywords: Marburg virus; big data; data mining; data analysis; google trends; web behavior; data science; conspiracy theory
Online: 31 October 2023 (07:02:07 CET)
During virus outbreaks in the recent past web behavior mining, modeling, and analysis have served as means to examine, explore, interpret, assess, and forecast the worldwide perception, readiness, reactions, and response linked to these virus outbreaks. The recent outbreak of the Marburg Virus disease (MVD), the high fatality rate of MVD, and the conspiracy theory linking the FEMA alert signal in the United States on October 4, 2023, with MVD and a zombie outbreak, resulted in a diverse range of reactions in the general public which has transpired in a surge in web behavior in this context. This resulted in “Marburg Virus” featuring in the list of the top trending topics on Twitter on October 3, 2023, and “Emergency Alert System” and “Zombie” featuring in the list of top trending topics on Twitter on October 4, 2023. No prior work in this field has mined and analyzed the emerging trends in web behavior in this context. The work presented in this paper aims to address this research gap and makes multiple scientific contributions to this field. First, it presents the results of performing time series forecasting of the search interests related to MVD emerging from 216 different regions on a global scale using ARIMA, LSTM, and Autocorrelation. The results of this analysis present the optimal model for forecasting web behavior related to MVD in each of these regions. Second, the correlation between search interests related to MVD and search interests related to zombies (in the context of this conspiracy theory) was investigated. The findings show that there were several regions where there was a statistically significant correlation between MVD-related searches and zombie-related searches (in the context of this conspiracy theory) on Google on October 4, 2023. Finally, the correlation between zombie-related searches (in the context of this conspiracy theory) in the United States and other regions was investigated. This analysis helped to identify those regions where this correlation was statistically significant.
ARTICLE | doi:10.20944/preprints202308.0442.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: Thermometers; Temperature records; Early instrumental meteorological series; Data rescue; Data recovery; Data correction; Climate data analysis
Online: 7 August 2023 (03:01:24 CEST)
A distinction is made between data rescue (i.e., copying, digitizing and archiving) and data recovery that implies deciphering, interpreting and transforming early instrumental readings and their metadata to obtain high-quality datasets in modern units. This requires a multidisciplinary approach that includes: palaeography and knowledge of Latin and other languages to read the handwritten logs and additional documents; history of science to interpret the original text, data e metadata within the cultural frame of the 17th, 18th and early 19th century; physics and technology to recognize bias of early instruments or calibrations, or to correct for observational bias; astronomy to calculate and transform the original time in canonical hours that started from twilight. The liquid-in-glass thermometer was invented in 1641 and the earliest temperature records started in 1654. Since then, different types of thermometers were invented, based on the thermal expansion of air or selected thermometric liquids with deviation from linearity. Reference points, thermometric scales, calibration methodologies were not comparable, and not always adequately described. Thermometers had various locations and exposures, e.g., indoor, outdoor, on windows, gardens or roofs, facing different directions. Readings were made only one or a few times a day, not necessarily respecting a precise time schedule: this bias is analysed for the most popular combinations of reading times. The time was based on sundials and local Sun, but the hours were counted starting from twilight. In 1789-90 Italy changed system and all cities counted hours from their lower culmination (i.e., local midnight), so that every city had its local time; in 1866, all the Italian cities followed the local time of Rome; in 1893, the whole Italy adopted the present-day system, based on the Coordinated Universal Time and the time zones. In 1873, when the International Meteorological Committee (IMO) was founded, later transformed in World Meteorological Organization (WMO), a standardization of instruments and observational protocols was established, and all data became fully comparable. In the early instrumental period, from 1654 to 1873, the comparison, correction and homogenization of records is quite difficult, mainly because of the scarcity or even absence of metadata. This paper deals about this confused situation, discussing the main problems, but also the methodologies to recognize missing metadata, distinguish indoor from outdoor readings; correct and transform early datasets in unknown or arbitrary units into modern units; finally, in which cases it is possible to reach the quality level required by WMO. The focus is to explain the methodology needed to recover early instrumental records, i.e., the operations that should be performed to interpret, correct, and transform the original raw data into a high-quality dataset of temperature, usable for climate studies.
DATA DESCRIPTOR | doi:10.20944/preprints202308.1701.v1
Subject: Public Health And Healthcare, Public Health And Health Services Keywords: disease X; big data; data science; data analysis; dataset development; database; google trends; data mining; healthcare; epidemiology
Online: 24 August 2023 (05:48:54 CEST)
The World Health Organization (WHO) added Disease X to their shortlist of blueprint priority diseases to represent a hypothetical, unknown pathogen that could cause a future epidemic. During different virus outbreaks of the past, such as COVID-19, Influenza, Lyme Disease, and Zika virus, researchers from various disciplines utilized Google Trends to mine multimodal components of web behavior to study, investigate, and analyze the global awareness, preparedness, and response associated with these respective virus outbreaks. As the world prepares for Disease X, a dataset on web behavior related to Disease X would be crucial to contribute towards the timely advancement of research in this field. Furthermore, none of the prior works in this field have focused on the development of a dataset to compile relevant web behavior data, which would help to prepare for Disease X. To address these research challenges, this work presents a dataset of web behavior related to Disease X, which emerged from different geographic regions of the world, between February 2018 to August 2023. Specifically, this dataset presents the search interests related to Disease X from 94 geographic regions. These regions were chosen for data mining as these regions recorded significant search interests related to Disease X during this timeframe. The dataset was developed by collecting data using Google Trends. The relevant search interests for all these regions for each month in this time range are available in this dataset. This paper also discusses the compliance of this dataset with the FAIR principles of scientific data management. Finally, a brief analysis of specific features of this dataset is presented to uphold the applicability, relevance, and usefulness of this dataset for the investigation of different research questions in the interrelated fields of Big Data, Data Mining, Healthcare, Epidemiology, and Data Analysis.
COMMUNICATION | doi:10.20944/preprints202303.0453.v1
Subject: Social Sciences, Media Studies Keywords: COVID-19; MPox; Twitter; Big Data; Data Mining; Data Analysis; Sentiment Analysis; Data Science; Social Media; Monkeypox
Online: 27 March 2023 (08:39:28 CEST)
Mining and analysis of the Big Data of Twitter conversations have been of significant interest to the scientific community in the fields of healthcare, epidemiology, big data, data science, computer science, and their related areas, as can be seen from several works in the last few years that focused on sentiment analysis and other forms of text analysis of Tweets related to Ebola, E-Coli, Dengue, Human papillomavirus (HPV), Middle East Respiratory Syndrome (MERS), Measles, Zika virus, H1N1, influenza-like illness, swine flu, flu, Cholera, Listeriosis, cancer, Liver Disease, Inflammatory Bowel Disease, kidney disease, lupus, Parkinson's, Diphtheria, and West Nile virus. The recent outbreaks of COVID-19 and MPox have served as "catalysts" for Twitter usage related to seeking and sharing information, views, opinions, and sentiments involving both these viruses. While there have been a few works published in the last few months that focused on performing sentiment analysis of Tweets related to either COVID-19 or MPox, none of the prior works in this field thus far involved analysis of Tweets focusing on both COVID-19 and MPox at the same time. With an aim to address this research gap, a total of 61,862 Tweets that focused on Mpox and COVID-19 simultaneously, posted between May 7, 2022, to March 3, 2023, were studied to perform sentiment analysis and text analysis. The findings of this study are manifold. First, the results of sentiment analysis show that almost half the Tweets (the actual percentage is 46.88%) had a negative sentiment. It was followed by Tweets that had a positive sentiment (31.97%) and Tweets that had a neutral sentiment (21.14%). Second, this paper presents the top 50 hashtags that were used in these Tweets. Third, it presents the top 100 most frequently used words that are featured in these Tweets. The findings of text analysis show that some of the commonly used words involved directly referring to either or both viruses. In addition to this, the presence of words such as "Polio", "Biden", "Ukraine", "HIV", "climate", and "Ebola" in the list of the top 100 most frequent words indicate that topics of conversations on Twitter in the context of COVID-19 and MPox also included a high level of interest related to other viruses, President Biden, and Ukraine. Finally, a comprehensive comparative study that involves a comparison of this work with 49 prior works in this field is presented to uphold the scientific contributions and relevance of the same.
Subject: Engineering, Automotive Engineering Keywords: Business Intelligence; Data warehouse; Data Marts; Architecture; Data; Information; cloud; Data Mining; evolution; technologic companies; tools; software
Online: 24 March 2021 (13:06:53 CET)
Information has been and will be a vital element for a person or department groups in an organization. That is why there are technologies that help us to give them the proper management of data; Business Intelligence is responsible for bringing technological solutions that correctly and effectively manage the entire volume of necessary and important information for companies. Among the solutions offered by Business Intelligence are Data Warehouses, Data Mining, among other business technologies that working together achieve the objectives proposed by an organization. It is important to highlight that these business technologies have been present since the 50's and have been evolving through time, improving processes, infrastructure, methodologies and implementing new technologies, which have helped to correct past mistakes based on information management for companies. There are questions about Business Intelligence. Could it be that in the not-too-distant future it will be used as an essential standard or norm in any organization for data management, since it provides many benefits and avoids failures at the time of classifying information. On the other hand, Cloud storage has been the best alternative to safeguard information and not depend on physical storage media, which are not 100% secure and are exposed to partial or total loss of information, by presenting hardware failures or security failures due to mishandling that can be given to such information.
ARTICLE | doi:10.20944/preprints202311.1570.v1
Subject: Public Health And Healthcare, Public Health And Health Services Keywords: cancer research; cancer; natural language processing; data mining; data warehouse; big data
Online: 26 November 2023 (05:13:14 CET)
Background: Real-world data (RWD) related to the health status and care of cancer patients reflect the ongoing medical practice, and their analysis yields essential real-world evidence. Advanced information technologies are vital for their collection, qualification, and reuse in research projects. Methods: UNICANCER, the French federation of comprehensive cancer centres, has innovated a unique research network : Consore. This potent federated tool enables the analysis of data from millions of cancer patients across eleven French hospitals. Results: Currently operational within eleven French cancer centres, Consore employs natural language processing to structure the therapeutic management data of approximately 1.3 million cancer patients. This data originates from their electronic medical records, encompassing about 65 millions of medical records. Thanks to the structured data, which is harmonized within a common data model, and its federated search tool, Consore can create patient cohorts based on patient or tumor characteristics, and treatment modalities. This ability to derive larger cohorts is particularly attractive when studying rare cancers. Conclusions: Consore serves as a tremendous data mining instrument that propels French cancer centres into the big data era. With its federated technical architecture and unique shared data model, Consore facilitates compliance to regulations and acceleration of cancer research projects.
ARTICLE | doi:10.20944/preprints202111.0410.v1
Subject: Engineering, Control And Systems Engineering Keywords: Data compression; data hiding; psnr; mse; virtual data; public cloud; quantization error
Online: 22 November 2021 (15:17:12 CET)
Nowadays, information security is a challenge especially when transmitted or shared in public clouds. Many of researchers have been proposed technique which fails to provide data integrity, security, authentication and another issue related to sensitivity data. The most common techniques were used to protect data during transmission on public cloud are cryptography, steganography, and compression. The proposed scheme suggests an entirely new approach for data security on public cloud. Authors have suggested an entirely new approach that completely makes secret data invisible behind carrier object and it is not been detected with the image performance parameters like PSNR, MSE, entropy and others. The details of results are explain in result section of paper. Proposed technique have better outcome than any other existing technique as a security mechanism on a public cloud. Primary focus of suggested approach is to minimize integrity loss of public storage data due to unrestricted access rights by uses. To improve reusability of carrier even after data concealed is really a challenging task and achieved through suggested approach.
ARTICLE | doi:10.20944/preprints201808.0350.v2
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: big data; clustering; data mining; educational data mining; e-learning; profile learning
Online: 19 October 2018 (05:58:05 CEST)
Educational data-mining is an evolving discipline that focuses on the improvement of self-learning and adaptive methods. It is used for finding hidden patterns or intrinsic structures of educational data. In the arena of education, the heterogeneous data is involved and continuously growing in the paradigm of big-data. To extract meaningful information adaptively from big educational data, some specific data mining techniques are needed. This paper presents a clustering approach to partition students into different groups or clusters based on their learning behavior. Furthermore, personalized e-learning system architecture is also presented which detects and responds teaching contents according to the students’ learning capabilities. The primary objective includes the discovery of optimal settings, in which learners can improve their learning capabilities. Moreover, the administration can find essential hidden patterns to bring the effective reforms in the existing system. The clustering methods K-Means, K-Medoids, Density-based Spatial Clustering of Applications with Noise, Agglomerative Hierarchical Cluster Tree and Clustering by Fast Search and Finding of Density Peaks via Heat Diffusion (CFSFDP-HD) are analyzed using educational data mining. It is observed that more robust results can be achieved by the replacement of existing methods with CFSFDP-HD. The data mining techniques are equally effective to analyze the big data to make education systems vigorous.
REVIEW | doi:10.20944/preprints201807.0059.v1
Subject: Biology And Life Sciences, Biophysics Keywords: data normalization; data scaling; zero-sum; metabolic fingerprinting; NMR; statistical data analysis
Online: 3 July 2018 (16:22:31 CEST)
The aim of this article is to summarize recent bioinformatic and statistical developments applicable to NMR-based metabolomics. Extracting relevant information from large multivariate datasets by statistical data analysis strategies may be of considerable complexity. Typical tasks comprise for example classification of specimens, identification of differentially produced metabolites, and estimation of fold changes. In this context it is of prime importance to minimize contributions from unwanted biases and experimental variance prior to these analyses. This is the goal of data normalization. Therefore, special emphasize is given to different data normalization strategies. In the first part, we will discuss the requirements and the pros and cons for a variety of commonly applied strategies. In the second part, we will concentrate on possible solutions in case that the requirements for the standard strategies are not fulfilled. In the last part, very recent developments will be discussed that allow reliable estimation of metabolic signatures for sample classification without prior data normalization. In this contribution special emphasis will be given to techniques that have worked well in our hands.
Subject: Business, Economics And Management, Econometrics And Statistics Keywords: poverty; composite indicators; interval data; symbolic data
Online: 24 August 2021 (15:46:09 CEST)
The analysis and measurement of poverty is a crucial issue in the field of social science. Poverty is a multidimensional notion that can be measured using composite indicators relevant to synthesizing statistical indicators. Subjective choices could, however, affect these indicators. We propose interval-based composite indicators to avoid the problem, enabling us in this context to obtain robust and reliable measures. Based on a relevant conceptual model of poverty we have identified, we will consider all the various factors identified. Then, considering a different random configuration of the various factors, we will compute a different composite indicator. We can obtain a different interval for each region based on the distinct factor choices on the different assumptions for constructing the composite indicator. So we will create an interval-based composite indicator based on the results obtained by the Monte-Carlo simulation of all the different assumptions. The different intervals can be compared, and various rankings for poverty can be obtained. For their parameters, such as center, minimum, maximum, and range, the poverty interval composite indicator can be considered and compared. The results demonstrate a relevant and consistent measurement of the indicator and the shadow sector's relevant impact on the final measures.
Subject: Computer Science And Mathematics, Computer Science Keywords: big data; data integration; EVMS; construction management
Online: 30 October 2020 (15:35:00 CET)
In the information age today, data are getting more and more important. While other industries achieve tangible improvement by applying cutting edge information technology, the construction industry is still far from being enough. Cost, schedule, and performance control are three major functions in the project execution phase. Along with their individual importance, cost-schedule integration has been a significant challenge over the past five decades in the construction industry. Although a lot of efforts have been put into this development, there is no method used in construction practice. The purpose of this study is to propose a new method to integrate cost and schedule data using big data technology. The proposed algorithm is designed to provide data integrity and flexibility in the integration process, considerable time reduction on building and changing database, and practical use in a construction site. It is expected that the proposed method can transform the current way that field engineers regard information management as one of the troublesome tasks in a data-friendly way.
ARTICLE | doi:10.20944/preprints201701.0090.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: transportation data; data interlinking; automatic schema matching
Online: 20 January 2017 (03:38:06 CET)
Multimodality requires integration of heterogeneous transportation data to construct a broad view of the transportation network. Many new transportation services are emerging with being isolated from previously existing networks. This lead them to publish their data sources to the web -- according to Linked Data Principles -- in order to gain visibility. Our interest is to use these data to construct an extended transportation network that links these new services to existing ones. The main problems we tackle in this article fall in the categories of automatic schema matching and data interlinking. We propose an approach that uses web services as mediators to help in automatically detect geospatial properties and map them between two different schemas. On the other hand, we propose a new interlinking approach that enables user to define rich semantic links between datasets in a flexible and customizable way.
ARTICLE | doi:10.20944/preprints202308.1391.v1
Subject: Engineering, Transportation Science And Technology Keywords: data extraction; data mining; railway infrastructure costs; infrastructure costs data analysis; cost analysis
Online: 18 August 2023 (16:03:08 CEST)
The capability of extracting information and analyze it into a common format is essential for performing predictions, comparing projects through cost benchmarking, and for having a deeper understanding of the project costs. However, the lack of standardization and the manual inclusion of the data makes this process very time-consuming, unreliable, and inefficient. To tackle this problem, a novel approach with a big impact is presented combining the benefits of data mining, statistics, and machine learning to extract and analyze the information related to railway costs infrastructure data. To validate the suggested approach, data from 23 real historical projects from the client network rail was extracted, allowing their costs to be comparable. Finally, some machine learning and data analytics methods were implemented to identify the most relevant factors allowing for costs benchmarking. The presented method proves the benefits of data extraction being able to gather, analyze and benchmark each project in an efficient manner, and deeply understand the relationships and the relevant factors that matter in infrastructure costs.
Subject: Computer Science And Mathematics, Information Systems Keywords: Academic Analytics; data storage; education and big data; analysis of data; learning analytics
Online: 19 July 2020 (20:37:39 CEST)
Business Intelligence, defined by  as "the ability to understand the interrelations of the facts that are presented in such a way that it can guide the action towards achieving a desired goal", has been used since 1958 for the transformation of data into information, and of information into knowledge, to be used when making decisions in a business environment. But, what would happen if we took the same principles of business intelligence and applied them to the academic environment? The answer would be the creation of Academic Analytics, a term defined by  as the process of evaluating and analyzing organizational information from university systems for reporting and making decisions, whose characteristics allow it to be used more and more in institutions, since the information they accumulate about their students and teachers gathers data such as academic performance, student success, persistence, and retention . Academic Analytics enables an analysis of data that is very important for making decisions in the educational institutional environment, aggregating valuable information in the academic research activity and providing easy to use business intelligence tools. This article shows a proposal for creating an information system based on Academic Analytics, using ASP.Net technology and trusting storage in the database engine Microsoft SQL Server, designing a model that is supported by Academic Analytics for the collection and analysis of data from the information systems of educational institutions. The idea that was conceived proposes a system that is capable of displaying statistics on the historical data of students and teachers taken over academic periods, without having direct access to institutional databases, with the purpose of gathering the information that the director, the teacher, and finally the student need for making decisions. The model was validated with information taken from students and teachers during the last five years, and the export format of the data was pdf, csv, and xls files. The findings allow us to state that it is extremely important to analyze the data that is in the information systems of the educational institutions for making decisions. After the validation of the model, it was established that it is a must for students to know the reports of their academic performance in order to carry out a process of self-evaluation, as well as for teachers to be able to see the results of the data obtained in order to carry out processes of self-evaluation, and adaptation of content and dynamics in the classrooms, and finally for the head of the program to make decisions.
ARTICLE | doi:10.20944/preprints201812.0071.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: data governance; data sovereignty; urban data spaces; ICT reference architecture; open urban platform
Online: 6 December 2018 (05:09:54 CET)
This paper presents the results of a recent study that was conducted with a number of German municipalities/cities. Based on the obtained and briefly presented recommendations emerging from the study, the authors propose the concept of an Urban Data Space (UDS), which facilitates an eco-system for data exchange and added value creation thereby utilizing the various types of data within a smart city/municipality. Looking at an Urban Data Space from within a German context and considering the current situation and developments in German municipalities, this paper proposes a reasonable classification of urban data that allows to relate the various data types to legal aspects and to conduct solid considerations regarding technical implementation designs and decisions. Furthermore, the Urban Data Space is described/analyzed in detail, and relevant stakeholders are identified, as well as corresponding technical artifacts are introduced. The authors propose to setup Urban Data Spaces based on emerging standards from the area of ICT reference architectures for Smart Cities, such as DIN SPEC 91357 “Open Urban Platform” and EIP SCC. Thereby, the paper walks the reader through the construction of an UDS based on the above mentioned architectures and outlines all the goals, recommendations and potentials, which an Urban Data Space can reveal to a municipality/city.
ARTICLE | doi:10.20944/preprints202110.0260.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: big data; data acquisition; data visualization; data exchange; dashboard; frequency stability; Grafana lab; Power Quality; GPS reference; frequency measurement.
Online: 18 October 2021 (18:07:43 CEST)
This article proposes a measurement solution designed to monitor instantaneous frequency in power systems. It uses a data acquisition module and a GPS receiver for time stamping. A program in Python takes care of receiving the data, calculating the frequency, and finally transferring the measurement results to a database. The frequency is calculated with two different methods, which are compared in the article. The stored data is visualized using the Grafana platform, thus demonstrating its potential for comparing scientific data. The system as a whole constitutes an efficient low cost solution as a data acquisition system.
DATA DESCRIPTOR | doi:10.20944/preprints202109.0370.v1
Subject: Engineering, Energy And Fuel Technology Keywords: smart meter data; household survey; EPC; energy data; energy demand; energy consumption; longitudinal; energy modelling; electricity data; gas data
Online: 22 September 2021 (10:16:05 CEST)
The Smart Energy Research Lab (SERL) Observatory dataset described here comprises half-hourly and daily electricity and gas data, SERL survey data, Energy Performance Certificate (EPC) input data and 24 local hourly climate reanalysis variables from the European Centre for Medium-Range Weather Forecasts (ECMWF) for over 13,000 households in Great Britain (GB). Participants were recruited in September 2019, September 2020 and January 2021 and their smart meter data are collected from up to one year prior to sign up. Data collection will continue until at least August 2022, and longer if funding allows. Survey data relating to the dwelling, appliances, household demographics and attitudes was collected at sign up. Data are linked at the household level and UK-based academic researchers can apply for access within a secure virtual environment for research projects in the public interest. This is a data descriptor paper describing how the data was collected, the variables available and the representativeness of the sample compared to national estimates. It is intended as a guide for researchers working with or considering using the SERL Observatory dataset, or simply looking to learn more about it.
ARTICLE | doi:10.20944/preprints201807.0038.v1
Online: 3 July 2018 (11:25:13 CEST)
The rich emission and absorption line spectra of Fe I may be used to extract crucial information on astrophysical plasmas, such as stellar metallicities. There is currently a lack, in quality and quantity, of accurate level-resolved effective electron-impact collision strengths and oscillator strengths for radiative transitions. Here, we discuss the challenges in obtaining a sufficiently good structure for neutral iron and compare our theoretical fine-structure energy levels with observation for several increasingly large models. Radiative data is presented for several transitions for which the atomic data is accurately known.
ARTICLE | doi:10.20944/preprints202309.1016.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Imbalanced data; Data preprocessing; Sampling; Tomek Links; DTW
Online: 14 September 2023 (14:00:42 CEST)
Purpose To alleviate the data imbalance problem caused by subjective and objective reasons, scholars have developed different data preprocessing algorithms, among which undersampling algorithms are widely used because of their fast and efficient performance. However, when the number of samples of some categories in a multi-classification dataset is too small to be processed by sampling, or the number of minority class samples is only 1 to 2, the traditional undersampling algorithms will be weakened. Methods This study selects 9 multi-classification time series datasets with extremely few samples as the objects, fully considers the characteristics of time series data, and uses a three-stage algorithm to alleviate the data imbalance problem. Stage one: Random oversampling with disturbance items increases the number of sample points; Stage two: On this basis, SMOTE (Synthetic Minority Oversampling Technique) oversampling; Stage three: Using dynamic time warping distance to calculate the distance between sample points, identify the sample points of Tomek Links at the boundary, and clean up the boundary noise.Results This study proposes a new sampling algorithm. In the 9 multi-classification time series datasets with extremely few samples, the new sampling algorithm is compared with four classic undersampling algorithms, ENN (Edited Nearest Neighbours), NCR (Neighborhood Cleaning Rule), OSS (One Side Selection) and RENN (Repeated Edited Nearest Neighbours), based on macro accuracy, recall rate and F1-score evaluation indicators. The results show that: In the 9 datasets selected, the dataset with the most categories and the least number of minority class samples, FiftyWords, the accuracy of the new sampling algorithm is 0.7156, far beyond ENN, RENN, OSS and NCR; its recall rate is also better than the four undersampling algorithms used for comparison, at 0.7261; its F1-score is increased by 200.71%, 188.74%, 155.29% and 85.61%, respectively, relative to ENN, RENN, OSS, and NCR; In the other 8 datasets, this new sampling algorithm also shows good indicator scores.Conclusion The new algorithm proposed in this study can effectively alleviate the data imbalance problem of multi-classification time series datasets with many categories and few minority class samples, and at the same time clean up the boundary noise data between classes.
ARTICLE | doi:10.20944/preprints202307.1117.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: history; endowments; query model; digital data; physical data
Online: 17 July 2023 (15:11:18 CEST)
Historical and Endowment Properties are different from Heritage and cultural Properties, as Historical and Endowment properties are governed by a unique set of laws that Waqf recipients must abide by. Property that is entrusted is usually in the form of buildings, land or valuables which in preservation is not limited to time as long as the property can be utilized. Reliable information technology is needed to ensure data security both digitally and physically, while the rapid development of information technology demands information openness and this will be a challenge in itself. The objectives of this study include examining the collection of historical databases and endowments, the relationship between digital data and physical data and management organizations. The method of how to design a query model to display data is then analyzed whether the data conforms to the rules in waqf management. The results are expected to bring up accurate data between digital data and physical data and if there are differences into findings for the next analysis.
COMMUNICATION | doi:10.20944/preprints202305.1694.v1
Subject: Medicine And Pharmacology, Clinical Medicine Keywords: Womens Health; Data Science; Data Methods; Artificial Intelligence
Online: 24 May 2023 (04:48:58 CEST)
Abstract ObjectivesThe aim of this perspective is to report the use of synthetic data as a viable method in women’s health given the current challenges linked to obtaining life-course data within a short period of time and accessing electronic healthcare data. Methods We used a 3-point perspective method to report an overview of data science, common applications, and ethical implications. Results There are several ethical challenges linked to using real-world data, consequently, generating synthetic data provides an alternative method to conduct comprehensive research when used effectively. The use of clinical characteristics to develop synthetic data is a useful method to consider. Aligning this data as closely as possible to the clinical phenotype would enable researchers to provide data that is very similar to that of the real-world. Discussion Population diversity and disease characterisation is important to optimally use data science. There are several artificial intelligence techniques that can be used to develop synthetic data. ConclusionSynthetic data demonstrates promise and versatility when used efficiently aligned to clinical problems. Therefore, exploring this option as a viable method in women’s health, in particular for epidemiology may be useful.
ARTICLE | doi:10.20944/preprints202206.0335.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: metadata; contextual data; harmonization; genomic surveillance; data management
Online: 24 June 2022 (08:46:04 CEST)
ARTICLE | doi:10.20944/preprints202108.0471.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Big data; Health prevention; Machine learning; Medical data
Online: 24 August 2021 (14:00:12 CEST)
CVDs are a leading cause of death globally. In CVDs, the heart is unable to deliver enough blood to other body regions. Since effective and accurate diagnosis of CVDs is essential for CVD prevention and treatment, machine learning (ML) techniques can be effectively and reliably used to discern patients suffering from a CVD from those who do not suffer from any heart condition. Namely, machine learning algorithms (MLAs) play a key role in the diagnosis of CVDs through predictive models that allow us to identify the main risks factors influencing CVD development. In this study, we analyze the performance of ten MLAs on two datasets for CVD prediction and two for CVD diagnosis. Algorithm performance is analyzed on top-two and top-four dataset attributes/features with respect to five performance metrics –accuracy, precision, recall, f1-score, and roc-auc – using the train-test split technique and k-fold cross-validation. Our study identifies the top two and four attributes from each CVD diagnosis/prediction dataset. As our main findings, the ten MLAs exhibited appropriate diagnosis and predictive performance; hence, they can be successfully implemented for improving current CVD diagnosis efforts and help patients around the world, especially in regions where medical staff is lacking.
ARTICLE | doi:10.20944/preprints202106.0738.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: time series; homogenization; ACMANT; observed data; data accuracy
Online: 30 June 2021 (13:08:39 CEST)
The removal of non-climatic biases, so-called inhomogeneities, from long climatic records needs sophistically developed statistical methods. One principle is that usually the differences between a candidate series and its neighbour series are analysed instead of directly the candidate series, in order to neutralize the possible impacts of regionally common natural climate variation on the detection of inhomogeneities. In most homogenization methods, two main kinds of time series comparisons are applied, i.e. composite reference series or pairwise comparisons. In composite reference series the inhomogeneities of neighbour series are attenuated by averaging the individual series, and the accuracy of homogenization can be improved by the iterative improvement of composite reference series. By contrast, pairwise comparisons have the advantage that coincidental inhomogeneities affecting several station series in a similar way can be identified with higher certainty than with composite reference series. In addition, homogenization with pairwise comparisons tends to facilitate the most accurate regional trend estimations. A new time series comparison method is presented here, which combines the use of pairwise comparisons and composite reference series in a way that their advantages are unified. This time series comparison method is embedded into the ACMANT homogenization method, and tested in large, commonly available monthly temperature test datasets.
Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: GAN; ECG; anonymization; healthcare data; sensors; data transformation
Online: 3 September 2020 (05:26:01 CEST)
In personalized healthcare, an ecosystem for the manipulation of reliable and safe private data should be orchestrated. This paper describes a first approach for the generation of fake electrocardiograms (ECGs) based on Generative Adversarial Networks (GANs) with the objective of anonymizing users’ information for privacy issues. This is intended to create valuable data that can be used both, in educational and research areas, while avoiding the risk of a sensitive data leakage. As GANs are mainly exploited on images and video frames, we are proposing general raw data processing after transformation into an image, so it can be managed through a GAN, then decoded back to the original data domain. The feasibility of our transformation and processing hypothesis is primarily demonstrated. Next, from the proposed procedure, main drawbacks for each step in the procedure are addressed for the particular case of ECGs. Hence, a novel research pathway on health data anonymization using GANs is opened and further straightforward developments are expected.
ARTICLE | doi:10.20944/preprints201806.0419.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: social business intelligence; data streaming models; linked data
Online: 26 June 2018 (12:48:17 CEST)
Social Business Intelligence (SBI) enables companies to capture strategic information from public social networks. Contrary to traditional Business Intelligence (BI), SBI has to face the high dynamicity of both the social network contents and the company analytical requests, as well as the enormous amount of noisy data. Effective exploitation of these continuous sources of data requires efficient processing of the streamed data to be semantically shaped into insightful facts. In this paper, we propose a multidimensional formalism to represent and evaluate social indicators directly from fact streams derived in turn from social network data. This formalism relies on two main aspects: the semantic representation of facts via Linked Open Data and the support of OLAP-like multidimensional analysis models. Contrary to traditional BI formalisms, we start the process by modeling the required social indicators according to the strategic goals of the company. From these specifications, all the required fact streams are modeled and deployed to trace the indicators. The main advantages of this approach are the easy definition of on-demand social indicators, and the treatment of changing dimensions and metrics through streamed facts. We demonstrate its usefulness by introducing a real scenario user case in the automotive sector.
COMMUNICATION | doi:10.20944/preprints201803.0054.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: data feature selection; data clustering; travel time prediction
Online: 7 March 2018 (13:30:06 CET)
In recent years, governments applied intelligent transportation system (ITS) technique to provide several convenience services (e.g., garbage truck app) for residents. This study proposes a garbage truck fleet management system (GTFMS) and data feature selection and data clustering methods for travel time prediction. A GTFMS includes mobile devices (MD), on-board units, fleet management server, and data analysis server (DAS). When user uses MD to request the arrival time of garbage truck, DAS can perform the procedure of data feature selection and data clustering methods to analyses travel time of garbage truck. The proposed methods can cluster the records of travel time and reduce variation for the improvement of travel time prediction. After predicting travel time and arrival time, the predicted information can be sent to user’s MD. In experimental environment, the results showed that the accuracies of previous method and proposed method are 16.73% and 85.97%, respectively. Therefore, the proposed data feature selection and data clustering methods can be used to predict stop-to-stop travel time of garbage truck.
COMMUNICATION | doi:10.20944/preprints202206.0172.v3
Subject: Computer Science And Mathematics, Information Systems Keywords: Monkeypox; monkey pox; Twitter; Dataset; Tweets; Social Media; Big Data; Data Mining; Data Science
Online: 25 July 2022 (09:41:19 CEST)
ARTICLE | doi:10.20944/preprints202109.0518.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: data fusion; multi-sensor; data visualization; data treatment; participant reports; air quality; exposure assessment
Online: 30 September 2021 (14:13:52 CEST)
Use of a multi-sensor approach can provide citizens a holistic insight in the air quality in their immediate surroundings and assessment of personal exposure to urban stressors. Our work, as part of the ICARUS H2020 project, which included over 600 participants from 7 European cities, discusses data fusion and harmonization on a diverse set of multi-sensor data streams to provide a comprehensive and understandable report for participants, and offers possible solutions and improvements. Harmonizing the data streams identified issues with the used devices and protocols, such as non-uniform timestamps, data gaps, difficult data retrieval from commercial devices, and coarse activity data logging. Our process of data fusion and harmonization allowed us to automate the process of generating visualizations and reports and consequently provide each participant with a detailed individualized report. Results showed that a key solution was to streamline the code and speed up the process, which necessitated certain compromises in visualizing the data. A thought-out process of data fusion and harmonization on a diverse set of multi-sensor data streams considerably improved the quality and quantity of data that a research participant receives. Though automatization accelerated the production of the reports considerably, manual structured double checks are strongly recommended.
ARTICLE | doi:10.20944/preprints201806.0185.v1
Subject: Medicine And Pharmacology, Other Keywords: mHealth; mobile data collection; data quality; data quality assessment framework; Tuberculosis control; developing countries
Online: 12 June 2018 (10:34:33 CEST)
Background Increasingly, healthcare organizations are using technology for the efficient management of data. The aim of this study was to compare the data quality of digital records with the quality of the corresponding paper-based records by using data quality assessment framework. Methodology We conducted a desk review of paper-based and digital records over the study duration from April 2016 to July 2016 at six enrolled TB clinics. We input all data fields of the patient treatment (TB01) card into a spreadsheet-based template to undertake a field-to-field comparison of the shared fields between TB01 and digital data. Findings A total of 117 TB01 cards were prepared at six enrolled sites, whereas just 50% of the records (n=59; 59 out of 117 TB01 cards) were digitized. There were 1,239 comparable data fields, out of which 65% (n=803) were correctly matched between paper based and digital records. However, 35% of the data fields (n=436) had anomalies, either in paper-based records or in digital records. 1.9 data quality issues were calculated per digital patient record, whereas it was 2.1 issues per record for paper-based record. Based on the analysis of valid data quality issues, it was found that there were more data quality issues in paper-based records (n=123) than in digital records (n=110). Conclusion There were fewer data quality issues in digital records as compared to the corresponding paper-based records. Greater use of mobile data capture and continued use of the data quality assessment framework can deliver more meaningful information for decision making.
REVIEW | doi:10.20944/preprints202105.0663.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: Big Data, Internet Data Sources (IDS), Internet of Things (IoT), Sustainable Development Goals (SDGs), Big data Technologies, Big data Challenges
Online: 27 May 2021 (10:31:03 CEST)
It is strongly believed that technology can reap the best only when it can be tamed by all stakeholders. Big data technology has no exception for this and even after a decade of emergence, the technology is still a herculean task and is in nascent stage with respect to applicability for many people. Having understood the gaps in the technology adoption for big data in the contemporary world, the present exploratory research work intended to highlight the possible prospects of big data technologies. It is also advocated as to how the challenges of various fields can be converted as opportunities with the shift in the perspective towards this evolving concept. Examples of apex organizations like (IMF and ITU) and their initiatives of big data technologies with respect to the Sustainable Development Goals (SDGs) are also cited for a broader outlook. The intervention of the responsible organizations along with the respective governments is also much sought for encouraging the technology adoption across all the sections of the market players.
ARTICLE | doi:10.20944/preprints202307.0244.v1
Subject: Environmental And Earth Sciences, Water Science And Technology Keywords: NASA-POWER platform; empirical equations; reanalysis data; meteorological data
Online: 4 July 2023 (13:59:00 CEST)
Reference evapotranspiration (ET0) is the first step in calculating crop irrigation demand, and numerous methods have been proposed to estimate this parameter. FAO-56 Penman-Monteith (PM) is the only standard method for defining and calculating ET0. However, it requires radiation, air temperature, atmospheric humidity, and wind speed data, limiting its application in regions where these data are unavailable; therefore, new alternatives are required. This study compared the accuracy of ET0 calculated with the Blaney-Criddle (BC) and Hargreaves-Samani (HS) methods versus PM using information from an automated weather station (AWS) and the NASA-POWER platform (NP) for different periods of time. The information collected corresponding at Module XII of the Lagunera Region Irrigation District 017, a semi-arid region in the North of Mexico. The HS method underestimated by 5.5 % the reference evapotranspiration (ET0) compared to the PM method during the period from March to August, and yielded the best fit in the different evaluation periods: daily, average, and 5-day cumulative; the latter showed the best values of inferential parameters. The information about maximum and minimum temperatures from the NP platform was suitable for estimating ET0 using the HS equation. This data source is a timely alternative, particularly in semi-arid regions where no data from weather stations are available.
ARTICLE | doi:10.20944/preprints202305.0722.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Anomaly detection; Malaria data; Machine learning; big data; epidemic
Online: 10 May 2023 (09:34:36 CEST)
Disease surveillance is critical to monitor ongoing control activities, detect early outbreaks and to inform intervention priorities and policies. Unfortunately, most data from disease surveillance remain under-utilised to support decision-making in real-time. Using the Brazilian Amazon malaria surveillance data as a case study, we explore unsupervised anomaly detection machine learning techniques to analyse and discover potential anomalies. We found that our models are able to detect early outbreaks, peak of outbreaks as well as change points in the proportion of positive malaria cases. Specifically, the sustained rise in malaria in the Brazilian Amazon in 2016 was flagged by several models. We also found that no single model detects all the anomalies across all health regions. The approaches using Clustering-based local outlier algorithm ranked first before Principal component analysis and Stochastic outlier selection in maximising the number of anomalies detected in local health regions. Because of this, we also provide the minimum number of machine learning models (top-k models) to maximise the number of anomalies detected across different health regions. We discovered that the top-3 models that maximise the coverage of the number and types of anomalies detected across the 13 health regions are: Principal component analysis, Stochastic outlier selection and Multi-covariance determinant. Anomaly detection approaches provide interesting solutions to discover patterns of epidemiological importance when confronted with a large volume of data across space and time. Our exploratory approach can be replicated for other diseases and locations to inform timely interventions and actions toward endemic disease control.
REVIEW | doi:10.20944/preprints202208.0420.v1
Subject: Social Sciences, Law Keywords: conversational commerce; data protection; law of obligations of data
Online: 24 August 2022 (10:55:06 CEST)
The possibilities and reach of social networks are increasing, the designs are becoming more diverse, and the ideas more visionary. Most recently, the former company “Facebook” announced the creation of a metaverse. With these technical possibilities, however, the danger of fraudsters is also growing. Using social bots, consumers are increasingly influenced on such platforms and business transactions are brought about through communication, i.e. conversational commerce. Minors or the elderly are particularly susceptible. This technical development is accompanied by a legal one: it is permitted by the Digital Services Directive and the Sale of Goods Directive to demand the provision of data as consideration for the sale of digital products. This raises legal problems at the level of the law of obligations and data protection law, whose regulations are intended to protect the aforementioned groups of individuals. This protection becomes even more important the more manipulative consumers are influenced by communicative bots. We show that there is a lack of knowledge about what objective data value can have in business transactions. Sufficient transparency of an objective data value can maintain legal protection, especially of vulnerable groups, and ensure the purpose of the laws.
ARTICLE | doi:10.20944/preprints202208.0224.v1
Subject: Engineering, Automotive Engineering Keywords: VR-XGBoost; K-VDTE; ETC data; ESAs; data mining
Online: 12 August 2022 (03:53:23 CEST)
To scientifically and effectively evaluate the service capacity of expressway service areas (ESAs) and improve the management level of ESAs, we propose a method for the recognition of vehicles entering ESAs (VeESAs) and estimation of vehicle dwell times using ETC data. First, the ETC data and their advantages are described in detail, and then the cleaning rules are designed according to the characteristics of the ETC data. Second, we established feature engineering according to the characteristics of VeESA, and proposed the XGBoost-based VeESA recognition (VR-XGBoost) model. Studied the driving rules in depth, we constructed a kinematics-based vehicle dwell time estimation (K-VDTE) model. The field validation in Part A/B of Yangli ESA using real ETC transaction data demonstrates that the effectiveness of our proposal outperforms the current state of the art. Specifically, in Part A and Part B, the recognition accuracies of VR-XGBoost are 95.9% and 97.4%, respectively, the mean absolute errors (MAEs) of dwell time are 52 s and 14 s, respectively, and the root mean square errors (RMSEs) are 69 s and 22 s, respectively. In addition, the confidence level of controlling the MAE of dwell time within 2 minutes is more than 97%. This work can effectively identify the VeESA, and accurately estimate the dwell time, which can provide a reference idea and theoretical basis for the service capacity evaluation and layout optimization of the ESA.
ARTICLE | doi:10.20944/preprints202208.0083.v1
Subject: Business, Economics And Management, Accounting And Taxation Keywords: Ratios; Financial Crisis; Covid-19; Big Data; Accounting Data
Online: 3 August 2022 (10:42:06 CEST)
The effects of the 2008 financial crisis undoubtedly caused problems not only to the banking sector but also to the real economy of the developed and the developing countries in almost all around the globe. Besides, as is widely known, every banking crisis entails the corresponding cost to the economy of each country affected by it, which results from the shakeout and the restructuring of its financial system. The purpose of this research is to investigate the consequences of the financial crisis and the COVID-19 health crisis and how these affected the course of the four systemic banks (Eurobank, Alpha Bank, National Bank, Piraeus Bank) through the analysis of ratios for the period of 2015-2020.
ARTICLE | doi:10.20944/preprints202103.0331.v1
Subject: Social Sciences, Media Studies Keywords: Social media ethics; Social media; data misuse; data integrity
Online: 12 March 2021 (08:05:09 CET)
The present high-tech landscape has allowed institutes to undergo digital transformation in addition to the storing of exceptional bulks of information from several resources, such as mobile phones, debit cards, GPS, transactions, online logs, and e-records. With the growth of technology, big data has grown to be a huge resource for several corporations that helped in encouraging enhanced strategies and innovative enterprise prospects. This advancement has also offered the expansion of linkable data resources. One of the famous data sources is social media platforms. Ideas and different types of content are being posted by thousands of people via social networking sites. These sites have provided a modern method for operating companies efficiently. However, some studies showed that social media platforms can be a source for misinformation at which some users tend to misuse social media data. In this work, the ethical concerns and conduct in online communities has been reviewed in order to see how social media data from different platforms has been misused, and to highlight some of the ways to avoid the misuse of social media data.
ARTICLE | doi:10.20944/preprints202006.0258.v2
Subject: Engineering, Civil Engineering Keywords: Conservation laws; Data inference; Data discovery; Dimensionless form; PINN
Online: 30 September 2020 (03:51:25 CEST)
Deep learning has achieved remarkable success in diverse computer science applications, however, its use in other traditional engineering fields has emerged only recently. In this project, we solved several mechanics problems governed by differential equations, using physics informed neural networks (PINN). The PINN embeds the differential equations into the loss of the neural network using automatic differentiation. We present our developments in the context of solving two main classes of problems: data-driven solutions and data-driven discoveries, and we compare the results with either analytical solutions or numerical solutions using the finite element method. The remarkable achievements of the PINN model shown in this report suggest the bright prospect of the physics-informed surrogate models that are fully differentiable with respect to all input coordinates and free parameters. More broadly, this study shows that PINN provides an attractive alternative to solve traditional engineering problems.
ARTICLE | doi:10.20944/preprints202007.0051.v2
Subject: Social Sciences, Library And Information Sciences Keywords: COVID-19; WHO; database; systematic review; data quality
Online: 2 August 2020 (17:43:38 CEST)
Introduction: A large number of COVID-19 publications has created a need to collect all research-related material in practical and reliable centralized databases. The aim of this study was to evaluate the functionality and quality of the compiled World Health Organisation COVID-19 database and compare it to Pubmed and Scopus. Methods: Article metadata for COVID-19 articles and articles on 8 specific topics related to COVID-19 was exported from the WHO global research database, Scopus and Pubmed. The analysis was conducted in R to investigate the number and overlapping of the articles between the databases and the missingness of values in the metadata. Results: The WHO database contains the largest number of COVID-19 related articles overall but retrieved the same number of articles on 8 specific topics as Scopus and Pubmed. Despite having the smallest number of exclusive articles overall, the highest number of exclusive articles on specific COVID-19 related topics was retrieved from the Scopus database. Further investigation revealed that PubMed and Scopus have more comprehensive structure than the WHO database, and less missing values in the categories searched by the information retrieval systems. Discussion: This study suggests that the WHO COVID-19 database, even though it is compiled from multiple databases, has a very simple and limited structure, and significant problems with data quality. As a consequence, relying on this database as a source of articles for systematic reviews or bibliometric analyses is undesirable.
ARTICLE | doi:10.20944/preprints201905.0158.v1
Subject: Medicine And Pharmacology, Other Keywords: blockchain; biomedical data managing; DWT; keyword search; data sharing.
Online: 13 May 2019 (13:30:37 CEST)
A crucial role is played by personal biomedical data when it comes to maintaining proficient access to health records by patients as well as health professionals. However, it is difficult to get a unified view pertaining to health data that have been scattered across various health center/hospital sections. To be specific, health records are distributed across many places and cannot be found integrated easily. In recent years, blockchain is regarded as a promising explanation that helps to achieve individual biomedical information sharing in a secured way along with privacy preservation, because of its benefit of immutability. This research work put forwards a blockchain-based managing scheme that helps to establish interpretation improvements pertaining to electronic biomedical systems. In this scheme, two blockchain were employed to construct the base of it, where the second blockchain algorithm is used to generate a secure sequence for the hash key that generated in first blockchain algorithm. The adaptively feature enable the algorithm to use multiple data types and combine between various biomedical images and text records as well. All the data, including keywords, digital records as well as the identity of patients are private key encrypted along with keyword searching capability so as to maintain data privacy preservation, access control and protected search. The obtained results which show the low latency (less than 750 ms) at 400 requests / second indicate the ability to use it within several health care units such as hospitals and clinics.
ARTICLE | doi:10.20944/preprints201806.0219.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Big data technology; Business intelligence; Data integration; System virtualization.
Online: 13 June 2018 (16:19:48 CEST)
Big Data warehouses are a new class of databases that largely use unstructured and volatile data for analytical purpose. Examples of this kind of data sources are those coming from the Web, such as social networks and blogs, or from sensor networks, where huge amounts of data may be available only for short intervals of time. In order to manage massive data sources, a strategy must be adopted to define multidimensional schemas in presence of fast-changing situations or even undefined business requirements. In the paper, we propose a design methodology that adopts agile and automatic approaches, in order to reduce the time necessary to integrate new data sources and to include new business requirements on the fly. The data are immediately available for analyses, since the underlying architecture is based on a virtual data warehouse that does not require the importing phase. Examples of application of the methodology are presented along the paper in order to show the validity of this approach compared to a traditional one.
ARTICLE | doi:10.20944/preprints202102.0326.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Data analysis; Artificial Intelligence; Machine Learning; Knowledge Engineering; Computers and information processing, Data analysis; Data Processing.
Online: 16 February 2021 (13:33:53 CET)
The copper mining industry is increasingly using artificial intelligence methods to improve cop-per production processes. Recent studies reveal the use of algorithms such as Artificial Neural Network, Support Vector Machine, and Random Forest, among others, to develop models for predicting product quality. Other studies compare the predictive models developed with these machine learning algorithms in the mining industry, as a whole. However, not many copper mining studies published compare the results of machine learning techniques for copper recovery prediction. This study makes a detailed comparison between three models for predicting copper recovery by leaching, using four datasets resulting from mining operations in northern Chile. The algorithms used for developing the models were Random Forest, Support Vector Machine, and Artificial Neural Network. To validate these models, four indicators or values of merit were used: accuracy (acc), precision (p), recall (r), and Matthew’s correlation coefficient (mcc). This paper describes dataset preparation and the refinement of the threshold values used for the predictive variable most influential on the class (the copper recovery). Results show both a precision over 98.50% and also the model with the best behavior between the predicted and the real. Finally, the models obtained show the following mean values: acc=94.32, p=88.47, r=99.59, and mcc=2.31. These values are highly competitive as compared with those obtained in similar studies using other approaches in the context.
ARTICLE | doi:10.20944/preprints202008.0254.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: feature selection; k-means; silhouette measure; clustering; big data; fault classification; sensor data; time-series data
Online: 11 August 2020 (06:26:43 CEST)
Feature selection is a crucial step to overcome the curse of dimensionality problem in data mining. This work proposes Recursive k-means Silhouette Elimination (RkSE) as a new unsupervised feature selection algorithm to reduce dimensionality in univariate and multivariate time-series datasets. Where k-means clustering is applied recursively to select the cluster representative features, following a unique application of silhouette measure for each cluster and a user-defined threshold as the feature selection or elimination criteria. The proposed method is evaluated on a hydraulic test rig, multi sensor readings in two different fashions: (1) Reduce the dimensionality in a multivariate classification problem using various classifiers of different functionalities. (2) Classification of univariate data in a sliding window scenario, where RkSE is used as a window compression method, to reduce the window dimensionality by selecting the best time points in a sliding window. Moreover, the results are validated using 10-fold cross validation technique. As well as, compared to the results when the classification is pulled directly with no feature selection applied. Additionally, a new taxonomy for k-means based feature selection methods is proposed. The experimental results and observations in the two comprehensive experiments demonstrated in this work reveal the capabilities and accuracy of the proposed method.