Search | Preprints.org

A number of studies have shown that assimilation of satellite derived soil moisture using the ensemble Kalman Filter (EnKF) can improve soil moisture estimates, particularly for the surface zone. However, the EnKF is computationally expensive since an ensemble of model integrations have to be propagated forward in time. Here, assimilating satellite soil moisture data from the Soil Moisture Active Passive (SMAP) mission, we compare the EnKF with the computationally cheaper ensemble Optimal Interpolation (EnOI) method over the contiguous United States (CONUS). The background error-covariance in the EnOI is sampled in two ways: i) by using the stochastic spread from an ensemble open-loop run, and ii) sampling from the model spinup climatology. Our results indicate that the EnKF is only marginally superior to one version of the EnOI. Furthermore the assimilation of SMAP data using the EnKF and EnOI is found to improve the surface zone correlation with in-situ observations at a 95% significance level. The EnKF assimilation of SMAP data is also found to improve root-zone correlation with independent in-situ data at the same significance level; however this improvement is dependent on which in-situ network we are validating against. We evaluate how the quality of the atmospheric forcing affects the analysis results by prescribing the land surface data assimilation system with either observation corrected or model derived precipitation. Surface zone correlation skill increases for the analysis using both the corrected and model derived precipitation, but only the latter shows an improvement at the 95% significance level. The study also suggest that the EnOI can be used for bias-correction of the atmospheric fields where post-processed data are not available. Finally, we assimilate three different Level-2 satellite derived soil moisture products from ESA Climate Change Initiative (CCI), SMAP and SMOS (Soil Moisture and Ocean Salinity) using the EnOI, and then compare the relative performance of the three resulting analyses against in-situ soil moisture observations. In this comparison, we find that all three analyses offer improvements over an open-loop run when comparing to in-situ observations. The assimilation of SMAP data is found to perform marginally better than the assimilation of SMOS data, while assimilation of the ESA CCI data shows the smallest improvement of the three analysis products.

Preprint ARTICLE | doi:10.20944/preprints202106.0143.v1

Changes in Power Plant NO_x Emissions over Northwest Greece Using a Data Assimilation Technique

Ioanna Skoulidou, Maria-Elissavet Koukouli, Arjo Segers, Astrid Manders, Dimitris Balis, Trissevgeni Stavrakou, Jos van Geffen, Henk Eskes

Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: Data assimilation; TROPOMI; Air Quality modelling; NOx Emissions; Ensemble Kalman Filter; LOTOS-EUROS; power plant; anthropogenic

Online: 4 June 2021 (12:59:09 CEST)

Show abstract| Download PDF| Supplementary Files| Share

In this work, we investigate the ability of a data assimilation technique and space-borne observations to quantify and monitor changes in nitrogen oxides (NOx) emissions over North-Western Greece for the summers of 2018 and 2019. In this region, four lignite-burning power plants are located. The data assimilation technique, based on the Ensemble Kalman Filter method, is employed to combine space-borne atmospheric observations from the high spatial resolution Sentinel-5 Precursor (S5P) Tropospheric Monitoring Instrument (TROPOMI) and simulations using the LOTOS-EUROS Chemical Transport model. The Copernicus Atmosphere Monitoring Service-Regional European emissions (CAMS-REG, version 4.2) inventory based on year 2015 is used as the a priori in the simulations. Surface measurements of nitrogen dioxide (NO2) from air quality stations operating in the region are compared with the model surface NO2 output using either the a priori (base run) or the a posteriori (assimilated run) NOx emissions. The high biases found between the in situ NO2 measurements and the base run surface NO2 decrease in the assimilated run in most cases. The bias in the station near the largest power plant decreases to 2.0 μg/m3 (2.83 μg/m3) from 10.5 μg/m3 (8.46 μg/m3) in 2019 (2018 respectively). Concerning the estimated annual a posteriori NOx emissions it was found that, for the pixels hosting the two largest power plants, the assimilated run results in emissions decreased by ~40-50% for 2018 compared to 2015, whereas a larger decrease, of ~70% for both power plants, was found for 2019, after assimilating the space-born observations. For the same power plants, the European Pollutant Release and Transfer Register (E-PRTR) reports decreased emissions in 2018 and 2019 compared to 2015 (-35% and -38% in 2018, -62% and -72% in 2019), in good agreement with the estimated emissions. We further compare the a posteriori emissions to the reported energy production of the power plants during the summer of 2018 and 2019. Mean decreases of about -35% and-63% in NOx emissions are estimated for the two larger power plants in summer of 2018 and 2019, respectively, which are supported by similar decreases in the reported energy production of the power plants (~-30% and -70%, respectively).

Preprint ARTICLE | doi:10.20944/preprints201707.0089.v1

Data Assimilation in the Air Contaminant Dispersion Using Particle Filter and Expectation-Maximization Algorithm with UAV Observations

Rongxiao Wang, Bin Chen, Sihang Qiu, Zhengqiu Zhu, Xiaogang Qiu

Subject: Engineering, Chemical Engineering Keywords: air contaminant dispersion; data assimilation; particle filter; expectation-maximization algorithm; UAV

Online: 31 July 2017 (11:02:27 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201612.0091.v2

The China Meteorological Assimilation Driving Datasets for the SWAT Model (CMADS) Application in China: A Case Study in Heihe River Basin

Xian-yong Meng, Hao Wang, Si-yu Cai, Xue-song Zhang, Guo-yong Leng, Xiao-hui Lei, Chun-xiang Shi, Shi-yin Liu

Subject: Environmental And Earth Sciences, Geophysics And Geology Keywords: reanalysis climate data; hydrologic modeling; comparative analysis

Online: 3 February 2017 (03:50:07 CET)

Show abstract| Download PDF| Share

Large-scale hydrological modeling in China is challenging given the sparse meteorological stations and large uncertainties associated with atmospheric forcing data.Here we introduce the development and use of the China Meteorological Assimilation Driving Datasets for the SWAT model (CMADS) in the Heihe River Basin(HRB) for improving hydrologic modeling, by leveraging the datasets from the China Meteorological Administration Land Data Assimilation System (CLDAS)(including climate data from nearly 40000 area encryption stations, 2700 national automatic weather stations, FengYun (FY) 2 satellite and radar stations). CMADS uses the Space Time Multiscale Analysis System (STMAS) to fuse data based on ECWMF ambient field and ensure data accuracy. In addition, compared with CLDAS, CMADS includes relative humidity and climate data of varied resolutions to drive hydrological models such as the Soil and Water Assessment Tool (SWAT) model. Here, we compared climate data from CMADS, Climate Forecast System Reanalysis (CFSR) and traditional weather station (TWS) climate forcing data and evaluatedtheir applicability for driving large scale hydrologic modeling with SWAT. In general, CMADS has higher accuracy than CFRS when evaluated against observations at TWS; CMADS also provides spatially continuous climate field to drive distributed hydrologic models, which is an important advantage over TWS climate data, particular in regions with sparse weather stations. Therefore, SWAT model simulations driven with CMADS and TWS achieved similar performances in terms of monthly and daily stream flow simulations, and both of them outperformed CFRS. For example, for the three hydrological stations (Ying Luoxia, Qilian Mountain, and ZhaMasheke) in the HRB at the monthly and daily Nash-Sutcliffe efficiency ranges of 0.75-0.95 and 0.58-0.78, respectively, which are much higher than corresponding efficiency statistics achieved with CFSR (monthly: 0.32-0.49 and daily: 0.26 – 0.45). The CMADS dataset is available free of charge and is expected to a valuable addition to the existing climate reanalysis datasets for deriving distributed hydrologic modeling in China and other countries in East Asia.

Preprint ARTICLE | doi:10.20944/preprints201809.0105.v1

LDAS-Monde Sequential Assimilation of Satellite Derived Observations Applied to the CONtiguous US: An ERA-5 Driven Reanalysis of the Land Surface Variables

Clément Albergel, Simon Munier, Aymeric Bocher, Bertrand Bonan, Yongjun Zheng, Clara Draper, Delphine J. Leroux, Jean-Christophe Calvet

Subject: Environmental And Earth Sciences, Environmental Science Keywords: Land Surface Data Assimilation, remote sensing, ERA5

Online: 6 September 2018 (00:24:47 CEST)

Show abstract| Download PDF| Share

LDAS-Monde, an offline land data assimilation system with global capacity, is applied over the CONtiguous US (CONUS) domain to enhance monitoring accuracy for water and energy states and fluxes. LDAS-Monde ingests satellite-derived Surface Soil Moisture (SSM) and Leaf Area Index (LAI) estimates to constrain the Interactions between Soil, Biosphere, and Atmosphere (ISBA) Land Surface Model (LSM) coupled with the CNRM (Centre National de Recherches Météorologiques) version of the Total Runoff Integrating Pathways (CTRIP) continental hydrological system (ISBA-CTRIP). LDAS-Monde is forced by the ERA-5 atmospheric reanalysis from the European Center For Medium Range Weather Forecast (ECMWF) from 2010 to 2016 leading to a 7-yr, quarter degree spatial resolution offline reanalysis of Land Surface Variables (LSVs) over CONUS. The impact of assimilating LAI and SSM into LDAS-Monde is assessed over North America, by comparison to satellite-driven model estimates of land evapotranspiration from the Global Land Evaporation Amsterdam Model (GLEAM) project, and upscaled ground-based observations of gross primary productivity from the FLUXCOM project. Also, taking advantage of the relatively dense data networks over CONUS, we also evaluate the impact of the assimilation against in-situ measurements of soil moisture from the USCRN network (US Climate Reference Network) are used in the evaluation, together with river discharges from the United States Geophysical Survey (USGS) and the Global Runoff Data Centre (GRDC). Those data sets highlight the added value of assimilating satellite derived observations compared to an open-loop simulation (i.e. no assimilation). It is shown that LDAS-Monde has the ability not only to monitor land surface variables but also to forecast them, by providing improved initial conditions which impacts persist through time. LDAS-Monde reanalysis has a potential to be used to monitor extreme events like agricultural drought, also. Finally, limitations related to LDAS-Monde and current satellite-derived observations are exposed as well as several insights on how to use alternative datasets to analyze soil moisture and vegetation state.

Preprint ARTICLE | doi:10.20944/preprints202402.0384.v1

Modeling Volcanic Ash Dispersion from the Hunga Tonga-Hunga Ha'apai Eruption Using WRF-Chem and Meteorological FASDAS Data Assimilation

Hosni Snoun, Mohammad Al Harbi, Amirhossein Nikfal, Abderrazak Arif, William Hatheway, Meznah A. Alamro, Alaeddine Mihoub, Moez Krichen

Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: Hunga Tonga-Hunga Ha’ apai volcano; WRF-Chem; FASDAS; data assimilation;

Online: 6 February 2024 (17:12:02 CET)

Show abstract| Download PDF| Supplementary Files| Share

Preprint ARTICLE | doi:10.20944/preprints202309.1637.v1

Constraining Flood Forecasting Uncertainties through Streamflow Data Assimilation in the Tropical Andes of Peru: Case of the Vilcanota River Basin

Harold Llauca, Miguel Arestegui, Waldo Lavado-Casimiro

Subject: Environmental And Earth Sciences, Water Science And Technology Keywords: Streamflow Data Assimilation; Flood forecasting; Tropical Andes; Satellite Precipitation Products; GR4H model

Online: 25 September 2023 (09:00:46 CEST)

Show abstract| Download PDF| Share

Preprint REVIEW | doi:10.20944/preprints202003.0141.v1

Sharing Is Caring – Data Sharing Initiatives in Healthcare

Tim Hulsen

Subject: Medicine And Pharmacology, Other Keywords: data sharing; data management; data science; big data; healthcare

Online: 8 March 2020 (16:46:20 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202405.0508.v2

HCI and Data: Interacting in a New Era of Virtualization

Iván Durango, José A. Gallud, Victor M. R. Penichet

Subject: Computer Science And Mathematics, Information Systems Keywords: Human-Data Interaction; Human-Computer Interaction; Big Data; Data virtualization; Data Accessibility; Data Management; Data Privacy; Data Ethics; Data-Driven Decision-Making

Online: 1 July 2024 (08:12:01 CEST)

Show abstract| Download PDF| Share

The rapid technological progress has ushered in a new era of human-computer interaction, where the distinction between the physical and virtual realms is becoming increasingly blurred. This research paper explores the profound and multifaceted intersection of Human-Data Interaction (HDI) and Data Virtualization (DV), examining how emerging technologies can significantly enhance the exploration, comprehension, and utilization of complex, multidimensional data sets. Informed by the insights gleaned from prior research in this domain , the present study delves into the potential of DV techniques to improve HDI, with a particular focus on three experimental investigations conducted within the realms of education, healthcare, and retail. The findings reveal the benefits and potential challenges associated with the implementation of DV in these diverse contexts, offering valuable guidance for the design and development of future HDI systems. Drawing upon a diverse array of authoritative sources, this paper presents a holistic, forward-looking perspective on the future of HDI, underscoring the critical role that DV will play in shaping the next generation of human-computer interfaces and facilitating a deeper, more intuitive understanding of the digital world. Furthermore, the paper presents a preliminary framework for integrating HDI principles into standard design practices. This framework outlines key considerations and guidelines to help designers and developers incorporate HDI techniques more effectively into the development of data-driven applications and interfaces.The proposed framework outlines key considerations for enhancing data accessibility and comprehension, empowering users to exercise greater control over their data, and cultivating transparent dialogues between data providers and end-users. By establishing this conceptual foundation, the paper aims to facilitate the seamless integration of HDI principles into standard design practices, ultimately leading to more intuitive, user-centric, and ethically-grounded approaches to data interaction and utilization.

Preprint ARTICLE | doi:10.20944/preprints202404.1018.v1

Discovering Data Domains and Products in Data Meshes Using Semantic Blueprints

Michalis Pingos, Andreas S. Andreou

Subject: Computer Science And Mathematics, Computer Science Keywords: Big Data; Data Lakes; Data Meshes; Data Products; Data Blueprints; Metadata Semantic Enrichment

Online: 16 April 2024 (16:26:06 CEST)

Show abstract| Download PDF| Supplementary Files| Share

Preprint ARTICLE | doi:10.20944/preprints202206.0320.v4

Ten Simple Rules for Using Public Biological Data for Your Research

Vishal Oza, Jordan Whitlock, Elizabeth Wilk, Angelina Uno-Antonison, Brandon Wilk, Manavalan Gajapathy, Timothy Howton, Austyn Trull, Lara Ianov, Elizabeth Worthey, Brittany Lasseigne

Subject: Biology And Life Sciences, Other Keywords: data; reproducibility; FAIR; data reuse; public data; big data; analysis

Online: 2 November 2022 (02:55:49 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202003.0268.v1

TEEDA: An Interactive Platform for Matching Data Providers and Users in Data Marketplace

Teruaki Hayashi, Yukio Ohsawa

Subject: Social Sciences, Library And Information Sciences Keywords: matching; data marketplace; data platform; data visualization; call for data

Online: 17 March 2020 (04:10:28 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202402.0602.v1

The Improvement of the Use of Open Data in Public Institutions

Besart Hyseni, Lejla Abazi Bexheti

Subject: Computer Science And Mathematics, Information Systems Keywords: Improving use of open data; data utilization; data optimization; enhancing data access; open data impact; open data government; data transparency; data-driven decision making

Online: 12 February 2024 (09:34:51 CET)

Show abstract| Download PDF| Share

Preprint REVIEW | doi:10.20944/preprints202309.2113.v1

Navigating the Data Architecture Landscape: A Comparative Analysis of Data Warehouse, Data Lake, Data Lakehouse, and Data Mesh

Benjamin wong

Subject: Computer Science And Mathematics, Hardware And Architecture Keywords: Data, DWH, Data Warehouse, Architecture, Data Lake, Storage, Analysis, Data Mesh, Analytical, Architectural, Data Vault

Online: 3 October 2023 (03:28:55 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202403.0265.v1

Security and Ownership in User Defined Data Meshes

Michalis Pingos, Panayiotis Christodoulou, Andreas S. Andreou

Subject: Computer Science And Mathematics, Computer Science Keywords: Big Data; Smart Data Processing; Systems of Deep Insight; Data Meshes; Data Lakes; Data Products; Blockchain; NFT; Data Blueprints

Online: 5 March 2024 (15:04:49 CET)

Show abstract| Download PDF| Supplementary Files| Share

Preprint ARTICLE | doi:10.20944/preprints202406.1319.v1

Proposing Machine Learning Models Suitable for Predicting Open Data Utilization

Junyoung Jeong, Keuntae Cho

Subject: Business, Economics And Management, Business And Management Keywords: open data; open government data; open data utilization

Online: 19 June 2024 (07:36:26 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202405.1988.v1

Privacy Preserving Human Mobility Generation using Grid based Data and Graph Autoencoders

Fabian Netzler, Markus Lienkamp

Subject: Social Sciences, Transportation Keywords: Mobility Data; Synthetic Data Generation; Mobility Data Analytics

Online: 30 May 2024 (12:02:38 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202304.0130.v1

Data Cooperatives as Catalysts for Collaboration, Data Sharing, and the (Trans)Formation of the Digital Commons

Michael Max Bühler, Igor Calzada, Isabel Cane, Thorsten Jelinek, Astha Kapoor, Morshed Mannan, Sameer Mehta, Marina Micheli, Vijay Mookerje, Konrad Nübel, Alex Pentland, Trebor Scholz, Divya Siddarth, Julian Tait, Bapu Vaitla, Jianguo Zhu

Subject: Computer Science And Mathematics, Other Keywords: data; cooperatives; open data; data stewardship; data governance; digital commons; data sovereignty; open digital federation platform

Online: 7 April 2023 (14:14:02 CEST)

Show abstract| Download PDF| Share

Network effects, economies of scale, and lock-in-effects increasingly lead to a concentration of digital resources and capabilities, hindering the free and equitable development of digital entrepreneurship (SDG9), new skills, and jobs (SDG8), especially in small communities (SDG11) and their small and medium-sized enterprises (“SMEs”). To ensure the affordability and accessibility of technologies, promote digital entrepreneurship and community well-being (SDG3), and protect digital rights, we propose data cooperatives [1,2] as a vehicle for secure, trusted, and sovereign data exchange [3,4]. In post-pandemic times, community/SME-led cooperatives can play a vital role by ensuring that supply chains to support digital commons are uninterrupted, resilient, and decentralized [5]. Digital commons and data sovereignty provide communities with affordable and easy access to information and the ability to collectively negotiate data-related decisions. Moreover, cooperative commons (a) provide access to the infrastructure that underpins the modern economy, (b) preserve property rights, and (c) ensure that privatization and monopolization do not further erode self-determination, especially in a world increasingly mediated by AI. Thus, governance plays a significant role in accelerating communities’/SMEs’ digital transformation and addressing their challenges. Cooperatives thrive on digital governance and standards such as open trusted Application Programming Interfaces (APIs) that increase the efficiency, technological capabilities, and capacities of participants and, most importantly, integrate, enable, and accelerate the digital transformation of SMEs in the overall process. This policy paper presents and discusses several transformative use cases for cooperative data governance. The use cases demonstrate how platform/data-cooperatives, and their novel value creation can be leveraged to take digital commons and value chains to a new level of collaboration while addressing the most pressing community issues. The proposed framework for a digital federated and sovereign reference architecture will create a blueprint for sustainable development both in the Global South and North.

Preprint COMMUNICATION | doi:10.20944/preprints202401.0780.v1

Data Reuse in Agricultural Genomics Research: Present Challenges and Future Solutions

Alenka Hafner, Victoria DeLeo, Cecilia Deng, Christine G. Elsik, Damarius Fleming, Peter W. Harrison, Theodore S. Kalbfleisch, Bruna Petry, Boas Pucker, Elsa H. Quezada-Rodríguez, Christopher K. Tuggle, James Koltes

Subject: Biology And Life Sciences, Agricultural Science And Agronomy Keywords: data reuse; agriculture; open data; metadata; data standards; equity

Online: 10 January 2024 (10:07:03 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202311.0104.v1

Conceptual Design of a Generic Data Harmonization Process for OMOP CDM

Elisa Henke, Michele Zoch, Yuan Peng, Ines Reinecke, Martin Sedlmayr, Franziska Bathelt

Subject: Public Health And Healthcare, Other Keywords: OMOP; OHDSI; interoperability; data harmonization; clinical data; claims data

Online: 2 November 2023 (07:45:02 CET)

Show abstract| Download PDF| Supplementary Files| Share

Preprint ARTICLE | doi:10.20944/preprints202308.1237.v1

A Method to Enable Automatic Extraction of Cost and Quantity Data from Hierarchical Construction Information Documents to Enable Rapid Digital Comparison and Analysis

Daniel Adanza Dopazo, Lamine Mahdjoubi, Bill Gething

Subject: Engineering, Transportation Science And Technology Keywords: data mining; data extraction; data science; cost infrastructure projects

Online: 17 August 2023 (09:25:22 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202306.1378.v1

Algorithm-based Data Generation (ADG) Engine for Data Analytics

Iman I. M. Abu Sulayman, Peter Voege, Abdelkader Ouda

Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Data Generation; Anomaly Data; User Behavior Generation; Big Data

Online: 19 June 2023 (16:31:37 CEST)

Show abstract| Download PDF| Share

Preprint REVIEW | doi:10.20944/preprints202007.0153.v1

A Hitchhiker’s Guide to Working with Large, Open-Source Neuroimaging Datasets

Corey Horien, Stephanie Noble, Abigail Greene, Kangjoo Lee, Daniel Barron, Siyuan Gao, Dave O'Connor, Mehraveh Salehi, Javid Dadashkarimi, Xilin Shen, Evelyn Lake, R. Todd Constable, Dustin Scheinost

Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: Open-science; big data; fMRI; data sharing; data management

Online: 8 July 2020 (11:53:33 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202407.2088.v1

The Case of Clean Customer Master Data for Customer Analytics: A Neglected Element for Data Monetization

Jasmin Singh, Heiko Gebauer

Subject: Business, Economics And Management, Business And Management Keywords: Customer analytics; data cleanliness; data harmonization; data integration; data monetization; digitization; digitalization; digital transformation; and; customer master data

Online: 25 July 2024 (16:53:25 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201810.0273.v1

Russian-German Astroparticle Data Life Cycle Initiative

Igor Bychkov, Andrey Demichev, Julia Dubenskaya, Oleg Fedorov, Andreas Haungs, Andreas Heiss, Yulia Kazarina, Elena Korosteleva, Dmitriy Kostunin, Alexander Kryukov, Andrey Mikhailov, Minh-Duc Nguyen, Stanislav Polyakov, Evgeny Postnikov, Alexey Shigarov, Dmitry Shipilov, Achim Streit, Viktoria Tokareva, Doris Wochele, Jürgen Wochele, Dmitry Zhurov

Subject: Physical Sciences, Astronomy And Astrophysics Keywords: astroparticle physics, cosmic rays, data life cycle management, data curation, meta data, big data, deep learning, open data

Online: 12 October 2018 (14:48:32 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202105.0589.v1

A Study on Ways to Extend Public Data for Game Ratings from Korea

HoSeong Kang, JungYoon Kim

Subject: Engineering, Automotive Engineering Keywords: Game Ratings; Public Data; Game Data; Data analysis; GRAC(Korea)

Online: 25 May 2021 (08:32:32 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202007.0078.v1

Data Driven Analytics for Personalized Medical Decision Making

Nataliia Melnykova, Nataliya Shakhovska, Michal Gregus, Volodymyr Melnykov, Mariana Zakharchuk, Olena Vovk

Subject: Computer Science And Mathematics, Information Systems Keywords: personalization; decision making; medical data; artificial intelligence; Data-driving; Big Data; Data Mining; Machine Learning

Online: 5 July 2020 (15:04:17 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202404.0849.v1

Intellecta Cognitiva: A Comprehensive Dataset for Advancing Academic Knowledge and Machine Reasoning

Ditto PS, Ajmal PS, Jithin VG

Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Synthetic data; pretrain data; llm training

Online: 12 April 2024 (12:46:27 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202103.0593.v1

Creating a Business and Supporting Digital Transformation

Miguel Ayala, Jorge Portella, Sergio Martinez, Maria Rojas, Luis Jimenez

Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Business Inteligence; Data Mining; Data Warehouse.

Online: 24 March 2021 (13:47:31 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202012.0468.v1

Developing High-Resolution Gridded Rainfall and Temperature Data for Bangladesh: The ENACTS-BMD Dataset

Nachiketa Acharya, Rija Faniriantsoa, Bazlur Rashid, Razia Sultana, Carlo Montes, Tufa Dinku, S.M.Q. Hassan

Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: climate data; gridded product; data merging

Online: 18 December 2020 (13:29:38 CET)

Show abstract| Download PDF| Share

Preprint CASE REPORT | doi:10.20944/preprints201801.0066.v1

Data Visualization of European Regional Operational Programmes: Unleashing the Informative Potential of Open Data for Performance Assessment

Emanuele Frontoni, Roberto Palloni

Subject: Engineering, Control And Systems Engineering Keywords: cohesion policy; data visualization; open data

Online: 8 January 2018 (11:11:47 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202403.0012.v1

Flexible Techniques to Detect Typical Hidden Errors in Large Longitudinal Datasets

Renato Bruni, Cinzia Daraio, Simone Di Leo

Subject: Computer Science And Mathematics, Computer Science Keywords: big data; information processing; information reconstruction; data quality: longitudinal data sequences

Online: 1 March 2024 (10:33:16 CET)

Show abstract| Download PDF| Share

Preprint COMMUNICATION | doi:10.20944/preprints202309.0047.v1

Analyzing Public Reactions during the MPox Outbreak: Findings from Topic Modeling of Tweets

Nirmalya Thakur, Yuvraj Nihal Duggal, Zihui Liu

Subject: Public Health And Healthcare, Public Health And Health Services Keywords: MPox; big data; data analysis; data science; Twitter; natural language processing

Online: 1 September 2023 (10:23:41 CEST)

Show abstract| Download PDF| Share

In the last decade and a half, the world has experienced the outbreak of a range of viruses such as COVID-19, H1N1, flu, Ebola, Zika Virus, Middle East Respiratory Syndrome (MERS), Measles, and West Nile Virus, just to name a few. During these virus outbreaks, the usage and effectiveness of social media platforms increased significantly as such platforms served as virtual communities, enabling their users to share and exchange information, news, perspectives, opinions, ideas, and comments related to the outbreaks. Analysis of this Big Data of conversations related to virus outbreaks using concepts of Natural Language Processing such as Topic Modeling has attracted the attention of researchers from different disciplines such as Healthcare, Epidemiology, Data Science, Medicine, and Computer Science. The recent outbreak of the MPox virus has resulted in a tremendous increase in the usage of Twitter. Prior works in this field have primarily focused on the sentiment analysis and content analysis of these Tweets, and the few works that have focused on topic modeling have multiple limitations. This paper aims to address this research gap and makes two scientific contributions to this field. First, it presents the results of performing Topic Modeling on 601,432 Tweets about the 2022 Mpox outbreak, which were posted on Twitter between May 7, 2022, and March 3, 2023. The results indicate that the conversations on Twitter related to Mpox during this time range may be broadly categorized into four distinct themes - Views and Perspectives about MPox, Updates on Cases and Investigations about Mpox, MPox and the LGBTQIA+ Community, and MPox and COVID-19. Second, the paper presents the findings from the analysis of these Tweets. The results show that the theme that was most popular on Twitter (in terms of the number of Tweets posted) during this time range was - Views and Perspectives about MPox. It is followed by the theme of MPox and the LGBTQIA+ Community, which is followed by the themes of MPox and COVID-19 and Updates on Cases and Investigations about Mpox, respectively. Finally, a comparison with prior works in this field is also presented to highlight the novelty and significance of this research work.

Preprint ARTICLE | doi:10.20944/preprints202205.0344.v1

Transforming Points of Single Contact Data into Linked Data

Pavlina Fragkou, Leandros Maglaras

Subject: Computer Science And Mathematics, Information Systems Keywords: Linked (open) Data; Semantic Interoperability; Data Mapping; Governmental Data; SPARQL; Ontologies

Online: 25 May 2022 (08:18:46 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202111.0073.v1

Using the Data Quality Dashboard to Improve the EHDEN Network

Clair Blacketer, Erica A Voss, Frank DeFalco, Nigel Hughes, Martijn J Schuemie, Maxim Moinat, Peter Rijnbeek

Subject: Medicine And Pharmacology, Other Keywords: data quality; OMOP CDM; EHDEN; healthcare data; real world data; RWD

Online: 3 November 2021 (09:12:54 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202110.0103.v1

Usage of Data Analytics in Improving Sourcing of Supply Chain Inputs

S M Nazmuz Sakib

Subject: Computer Science And Mathematics, Information Systems Keywords: Data Analytics; Analytics; Supply Chain Input; Supply Chain; Data Science; Data

Online: 6 October 2021 (10:38:42 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202310.1998.v1

Marburg Virus Outbreak and a New Conspiracy Theory: Findings from a Comprehensive Analysis of Web Behavior

Nirmalya Thakur, Shuqi Cui, Kesha A. Patel, Nazif Azizi, Victoria Knieling, Changhee Han, Audrey Poon, Rishika Shah

Subject: Public Health And Healthcare, Public Health And Health Services Keywords: Marburg virus; big data; data mining; data analysis; google trends; web behavior; data science; conspiracy theory

Online: 31 October 2023 (07:02:07 CET)

Show abstract| Download PDF| Share

During virus outbreaks in the recent past web behavior mining, modeling, and analysis have served as means to examine, explore, interpret, assess, and forecast the worldwide perception, readiness, reactions, and response linked to these virus outbreaks. The recent outbreak of the Marburg Virus disease (MVD), the high fatality rate of MVD, and the conspiracy theory linking the FEMA alert signal in the United States on October 4, 2023, with MVD and a zombie outbreak, resulted in a diverse range of reactions in the general public which has transpired in a surge in web behavior in this context. This resulted in “Marburg Virus” featuring in the list of the top trending topics on Twitter on October 3, 2023, and “Emergency Alert System” and “Zombie” featuring in the list of top trending topics on Twitter on October 4, 2023. No prior work in this field has mined and analyzed the emerging trends in web behavior in this context. The work presented in this paper aims to address this research gap and makes multiple scientific contributions to this field. First, it presents the results of performing time series forecasting of the search interests related to MVD emerging from 216 different regions on a global scale using ARIMA, LSTM, and Autocorrelation. The results of this analysis present the optimal model for forecasting web behavior related to MVD in each of these regions. Second, the correlation between search interests related to MVD and search interests related to zombies (in the context of this conspiracy theory) was investigated. The findings show that there were several regions where there was a statistically significant correlation between MVD-related searches and zombie-related searches (in the context of this conspiracy theory) on Google on October 4, 2023. Finally, the correlation between zombie-related searches (in the context of this conspiracy theory) in the United States and other regions was investigated. This analysis helped to identify those regions where this correlation was statistically significant.

Preprint ARTICLE | doi:10.20944/preprints202308.0442.v1

Instrumental and Observational Problems of the Earliest Temperature Records in Italy: A Methodology for Data Recovery and Correction

Dario Camuffo, Antonio Della Valle, Francesca Becherini

Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: Thermometers; Temperature records; Early instrumental meteorological series; Data rescue; Data recovery; Data correction; Climate data analysis

Online: 7 August 2023 (03:01:24 CEST)

Show abstract| Download PDF| Share

A distinction is made between data rescue (i.e., copying, digitizing and archiving) and data recovery that implies deciphering, interpreting and transforming early instrumental readings and their metadata to obtain high-quality datasets in modern units. This requires a multidisciplinary approach that includes: palaeography and knowledge of Latin and other languages to read the handwritten logs and additional documents; history of science to interpret the original text, data e metadata within the cultural frame of the 17th, 18th and early 19th century; physics and technology to recognize bias of early instruments or calibrations, or to correct for observational bias; astronomy to calculate and transform the original time in canonical hours that started from twilight. The liquid-in-glass thermometer was invented in 1641 and the earliest temperature records started in 1654. Since then, different types of thermometers were invented, based on the thermal expansion of air or selected thermometric liquids with deviation from linearity. Reference points, thermometric scales, calibration methodologies were not comparable, and not always adequately described. Thermometers had various locations and exposures, e.g., indoor, outdoor, on windows, gardens or roofs, facing different directions. Readings were made only one or a few times a day, not necessarily respecting a precise time schedule: this bias is analysed for the most popular combinations of reading times. The time was based on sundials and local Sun, but the hours were counted starting from twilight. In 1789-90 Italy changed system and all cities counted hours from their lower culmination (i.e., local midnight), so that every city had its local time; in 1866, all the Italian cities followed the local time of Rome; in 1893, the whole Italy adopted the present-day system, based on the Coordinated Universal Time and the time zones. In 1873, when the International Meteorological Committee (IMO) was founded, later transformed in World Meteorological Organization (WMO), a standardization of instruments and observational protocols was established, and all data became fully comparable. In the early instrumental period, from 1654 to 1873, the comparison, correction and homogenization of records is quite difficult, mainly because of the scarcity or even absence of metadata. This paper deals about this confused situation, discussing the main problems, but also the methodologies to recognize missing metadata, distinguish indoor from outdoor readings; correct and transform early datasets in unknown or arbitrary units into modern units; finally, in which cases it is possible to reach the quality level required by WMO. The focus is to explain the methodology needed to recover early instrumental records, i.e., the operations that should be performed to interpret, correct, and transform the original raw data into a high-quality dataset of temperature, usable for climate studies.

Preprint DATA DESCRIPTOR | doi:10.20944/preprints202308.1701.v1

A Dataset of Search Interests Related to Disease X Originating from Different Geographic Regions

Nirmalya Thakur, Kesha A. Patel, Isabella Hall, Yuvraj Nihal Duggal, Shuqi Cui

Subject: Public Health And Healthcare, Public Health And Health Services Keywords: disease X; big data; data science; data analysis; dataset development; database; google trends; data mining; healthcare; epidemiology

Online: 24 August 2023 (05:48:54 CEST)

Show abstract| Download PDF| Share

Preprint COMMUNICATION | doi:10.20944/preprints202303.0453.v1

Analysis of Public Discourse on Twitter involving COVID-19 and MPox: Findings from Sentiment Analysis and Text Analysis

Nirmalya Thakur

Subject: Social Sciences, Media Studies Keywords: COVID-19; MPox; Twitter; Big Data; Data Mining; Data Analysis; Sentiment Analysis; Data Science; Social Media; Monkeypox

Online: 27 March 2023 (08:39:28 CEST)

Show abstract| Download PDF| Share

Mining and analysis of the Big Data of Twitter conversations have been of significant interest to the scientific community in the fields of healthcare, epidemiology, big data, data science, computer science, and their related areas, as can be seen from several works in the last few years that focused on sentiment analysis and other forms of text analysis of Tweets related to Ebola, E-Coli, Dengue, Human papillomavirus (HPV), Middle East Respiratory Syndrome (MERS), Measles, Zika virus, H1N1, influenza-like illness, swine flu, flu, Cholera, Listeriosis, cancer, Liver Disease, Inflammatory Bowel Disease, kidney disease, lupus, Parkinson's, Diphtheria, and West Nile virus. The recent outbreaks of COVID-19 and MPox have served as "catalysts" for Twitter usage related to seeking and sharing information, views, opinions, and sentiments involving both these viruses. While there have been a few works published in the last few months that focused on performing sentiment analysis of Tweets related to either COVID-19 or MPox, none of the prior works in this field thus far involved analysis of Tweets focusing on both COVID-19 and MPox at the same time. With an aim to address this research gap, a total of 61,862 Tweets that focused on Mpox and COVID-19 simultaneously, posted between May 7, 2022, to March 3, 2023, were studied to perform sentiment analysis and text analysis. The findings of this study are manifold. First, the results of sentiment analysis show that almost half the Tweets (the actual percentage is 46.88%) had a negative sentiment. It was followed by Tweets that had a positive sentiment (31.97%) and Tweets that had a neutral sentiment (21.14%). Second, this paper presents the top 50 hashtags that were used in these Tweets. Third, it presents the top 100 most frequently used words that are featured in these Tweets. The findings of text analysis show that some of the commonly used words involved directly referring to either or both viruses. In addition to this, the presence of words such as "Polio", "Biden", "Ukraine", "HIV", "climate", and "Ebola" in the list of the top 100 most frequent words indicate that topics of conversations on Twitter in the context of COVID-19 and MPox also included a high level of interest related to other viruses, President Biden, and Ukraine. Finally, a comprehensive comparative study that involves a comparison of this work with 49 prior works in this field is presented to uphold the scientific contributions and relevance of the same.

Working Paper ARTICLE

Business Intelligence and Its Big Evolution

Andres Velosa, Gustavo Pabon

Subject: Engineering, Automotive Engineering Keywords: Business Intelligence; Data warehouse; Data Marts; Architecture; Data; Information; cloud; Data Mining; evolution; technologic companies; tools; software

Online: 24 March 2021 (13:06:53 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202311.1570.v1

Consore: A Powerful Federated Data Mining Tool Driving a French Research Network to Accelerate Cancer Research

Julien Guérin, Amine Nahid, Louis Tassy, Marc Deloger, François Bocquet, Simon Thézenas, Emmanuel Desandes, Marie-Cécile Le Deley, Xavier DURANDO, Anne Jaffré, Ikram Es Saad, Hugo Crochet, Marie Le Morvan, François Lion, Judith Raimbourg, Oussama Khay, Franck Craynest, Alexia Giro, Yec'han Laizet, Aurélie Bertaut, Frédérik Joly, Alain Livartowski, Pierre Etienne Heudel

Subject: Public Health And Healthcare, Public Health And Health Services Keywords: cancer research; cancer; natural language processing; data mining; data warehouse; big data

Online: 26 November 2023 (05:13:14 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202111.0410.v1

Design and Implementation of Efficient Transmission of Cloud Data in Wireless Media

Virendra Pandharipant Nikam, Sheetal S Dhande

Subject: Engineering, Control And Systems Engineering Keywords: Data compression; data hiding; psnr; mse; virtual data; public cloud; quantization error

Online: 22 November 2021 (15:17:12 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201808.0350.v2

Integration of Data Mining Clustering Approach with the Personalized E-Learning System

Samina Kausar, Huahu Xu, Iftikhar Hussain, Wenhau Zhu, Misha Zahid

Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: big data; clustering; data mining; educational data mining; e-learning; profile learning

Online: 19 October 2018 (05:58:05 CEST)

Show abstract| Download PDF| Share

Preprint REVIEW | doi:10.20944/preprints201807.0059.v1

Data Normalization in NMR-based Metabolomics

Helena Zacharias, Michael Altenbuchinger, Wolfram Gronwald

Subject: Biology And Life Sciences, Biophysics Keywords: data normalization; data scaling; zero-sum; metabolic fingerprinting; NMR; statistical data analysis

Online: 3 July 2018 (16:22:31 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202404.0357.v3

The Path to Data Protection Governance in China Mainland

Bing Chen, Yongji Liu

Subject: Social Sciences, Law Keywords: data protection; personal privacy; cybersecurity; data security

Online: 24 April 2024 (14:20:16 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202402.1372.v1

Leveraging Visualization and Machine Learning Techniques in Education: A Case Study of K-12 State Assessment Data

Loni Taylor, Vibhuti Gupta, Kwanghee Jung

Subject: Computer Science And Mathematics, Analysis Keywords: Data Visualization; Big Data; AI; Machine Learning

Online: 23 February 2024 (10:39:04 CET)

Show abstract| Download PDF| Share

Working Paper ARTICLE

The Analysis and the Measurement of Poverty: An Interval Based Composite Indicator Approach

Carlo Drago

Subject: Business, Economics And Management, Econometrics And Statistics Keywords: poverty; composite indicators; interval data; symbolic data

Online: 24 August 2021 (15:46:09 CEST)

Show abstract| Download PDF| Share

Working Paper ARTICLE

Development of Cost and Schedule Data Integration Algorithm based on Big Data Technology

Daegu Cho, Myungdo Lee, Jihye Shin

Subject: Computer Science And Mathematics, Computer Science Keywords: big data; data integration; EVMS; construction management

Online: 30 October 2020 (15:35:00 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201701.0090.v1

An Automatic Matcher and Linker for Transportation Datasets

Ali Masri, Karine Zeitouni, Zoubida Kedad, Bertrand Leroy

Subject: Computer Science And Mathematics, Information Systems Keywords: transportation data; data interlinking; automatic schema matching

Online: 20 January 2017 (03:38:06 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202407.1459.v1

Optimising Clinical Epidemiology in Disease Outbreaks: Analysis of ISARIC-WHO COVID-19 Case Report Form Utilisation

Laura Merson, Sara Duque, Esteban Garcia-Gallo, Trokon Omarley Yeabah, Jamie Rylance, Janet Diaz, Antoine Flahault, . ISARIC Clinical Characterisation Group

Subject: Public Health And Healthcare, Public Health And Health Services Keywords: clinical epidemiology; infectious disease outbreaks; data collection; data management; common data elements; ISARIC

Online: 18 July 2024 (09:53:41 CEST)

Show abstract| Download PDF| Supplementary Files| Share

Preprint ARTICLE | doi:10.20944/preprints202308.1391.v1

An Automated Method for Extracting and Analyzing Railway Infrastructure Cost Data

Daniel Adanza Dopazo, Lamine Mahdjoubi, Bill Gething

Subject: Engineering, Transportation Science And Technology Keywords: data extraction; data mining; railway infrastructure costs; infrastructure costs data analysis; cost analysis

Online: 18 August 2023 (16:03:08 CEST)

Show abstract| Download PDF| Share

Working Paper ARTICLE

Model for the Collection and Analysis of Data from Teachers and Students, Supported by Academic Analytics

Fredys A. Simanca H., Isabel Hernández Arteaga, María Elsa Unriza Puin, Fabian Blanco Garrido, Jaime Paez Paez, Jairo Cortes Méndez

Subject: Computer Science And Mathematics, Information Systems Keywords: Academic Analytics; data storage; education and big data; analysis of data; learning analytics

Online: 19 July 2020 (20:37:39 CEST)

Show abstract| Download PDF| Share

Business Intelligence, defined by [1] as "the ability to understand the interrelations of the facts that are presented in such a way that it can guide the action towards achieving a desired goal", has been used since 1958 for the transformation of data into information, and of information into knowledge, to be used when making decisions in a business environment. But, what would happen if we took the same principles of business intelligence and applied them to the academic environment? The answer would be the creation of Academic Analytics, a term defined by [2] as the process of evaluating and analyzing organizational information from university systems for reporting and making decisions, whose characteristics allow it to be used more and more in institutions, since the information they accumulate about their students and teachers gathers data such as academic performance, student success, persistence, and retention [5]. Academic Analytics enables an analysis of data that is very important for making decisions in the educational institutional environment, aggregating valuable information in the academic research activity and providing easy to use business intelligence tools. This article shows a proposal for creating an information system based on Academic Analytics, using ASP.Net technology and trusting storage in the database engine Microsoft SQL Server, designing a model that is supported by Academic Analytics for the collection and analysis of data from the information systems of educational institutions. The idea that was conceived proposes a system that is capable of displaying statistics on the historical data of students and teachers taken over academic periods, without having direct access to institutional databases, with the purpose of gathering the information that the director, the teacher, and finally the student need for making decisions. The model was validated with information taken from students and teachers during the last five years, and the export format of the data was pdf, csv, and xls files. The findings allow us to state that it is extremely important to analyze the data that is in the information systems of the educational institutions for making decisions. After the validation of the model, it was established that it is a must for students to know the reports of their academic performance in order to carry out a process of self-evaluation, as well as for teachers to be able to see the results of the data obtained in order to carry out processes of self-evaluation, and adaptation of content and dynamics in the classrooms, and finally for the head of the program to make decisions.

Preprint ARTICLE | doi:10.20944/preprints201812.0071.v1

Data Governance and Sovereignty in Urban Data Spaces Based on Standardized ICT Reference Architectures

Silke Cuno, Lina Bruns, Nikolay Tcholtchev, Philipp Lämmel, Ina Schieferdecker

Subject: Engineering, Electrical And Electronic Engineering Keywords: data governance; data sovereignty; urban data spaces; ICT reference architecture; open urban platform

Online: 6 December 2018 (05:09:54 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202110.0260.v1

Online System for Power Quality Operational Data Management in Frequency Monitoring using Python and Grafana

Jose-María Sierra-Fernández, Olivia Florencias-Oliveros, Manuel-Jesús Espinosa-Gavira, Juan-José González-de-la-Rosa, Agustín Agüera-Pérez, José-Carlos Palomares-Salas

Subject: Engineering, Electrical And Electronic Engineering Keywords: big data; data acquisition; data visualization; data exchange; dashboard; frequency stability; Grafana lab; Power Quality; GPS reference; frequency measurement.

Online: 18 October 2021 (18:07:43 CEST)

Show abstract| Download PDF| Share

Preprint DATA DESCRIPTOR | doi:10.20944/preprints202109.0370.v1

The SERL Observatory Dataset: Longitudinal Smart Meter Electricity and Gas Data, Survey, EPC and Climate Data for Over 13,000 GB Households

Ellen Webborn, Jessica Few, Eoghan McKenna, Simon Elam, Martin Pullinger, Ben Anderson, David Shipworth, Tadj Oreszczyn

Subject: Engineering, Energy And Fuel Technology Keywords: smart meter data; household survey; EPC; energy data; energy demand; energy consumption; longitudinal; energy modelling; electricity data; gas data

Online: 22 September 2021 (10:16:05 CEST)

Show abstract| Download PDF| Supplementary Files| Share

Preprint ARTICLE | doi:10.20944/preprints201807.0038.v1

Towards the Provision of Accurate Atomic Data for Neutral Iron

Andrew Conroy, Catherine Ramsbottom, Connor Ballance, Francis Keenan

Subject: Physical Sciences, Atomic And Molecular Physics Keywords: atomic data

Online: 3 July 2018 (11:25:13 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202406.1070.v1

The Mental Health Index across the Italian Regions in the Esg Context

Emanuela Resta, Giancarlo Logroscino, Silvio Tafuri, Preethymol Peter, Noviello Chiara, Alberto Costantiello, Angelo Leogrande

Subject: Business, Economics And Management, Econometrics And Statistics Keywords: ESG; Mental Health Index; Panel Data; Data Analysis

Online: 17 June 2024 (08:33:43 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202404.0740.v1

Functional Process Control (FPC): A Methodology to Reduce Variability

Joaquín Sancho, Javier Martínez, Jorge Pastor, Carlos Cajal

Subject: Computer Science And Mathematics, Applied Mathematics Keywords: functional data; quality; non-normal data; variability; outlier

Online: 10 April 2024 (15:52:41 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202309.1016.v1

Three-Stage Sampling Algorithm for Highly Imbalanced Multi-Classification Time Series Data Sets

Haoming Wang

Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Imbalanced data; Data preprocessing; Sampling; Tomek Links; DTW

Online: 14 September 2023 (14:00:42 CEST)

Show abstract| Download PDF| Share

Purpose To alleviate the data imbalance problem caused by subjective and objective reasons, scholars have developed different data preprocessing algorithms, among which undersampling algorithms are widely used because of their fast and efficient performance. However, when the number of samples of some categories in a multi-classification dataset is too small to be processed by sampling, or the number of minority class samples is only 1 to 2, the traditional undersampling algorithms will be weakened. Methods This study selects 9 multi-classification time series datasets with extremely few samples as the objects, fully considers the characteristics of time series data, and uses a three-stage algorithm to alleviate the data imbalance problem. Stage one: Random oversampling with disturbance items increases the number of sample points; Stage two: On this basis, SMOTE (Synthetic Minority Oversampling Technique) oversampling; Stage three: Using dynamic time warping distance to calculate the distance between sample points, identify the sample points of Tomek Links at the boundary, and clean up the boundary noise.Results This study proposes a new sampling algorithm. In the 9 multi-classification time series datasets with extremely few samples, the new sampling algorithm is compared with four classic undersampling algorithms, ENN (Edited Nearest Neighbours), NCR (Neighborhood Cleaning Rule), OSS (One Side Selection) and RENN (Repeated Edited Nearest Neighbours), based on macro accuracy, recall rate and F1-score evaluation indicators. The results show that: In the 9 datasets selected, the dataset with the most categories and the least number of minority class samples, FiftyWords, the accuracy of the new sampling algorithm is 0.7156, far beyond ENN, RENN, OSS and NCR; its recall rate is also better than the four undersampling algorithms used for comparison, at 0.7261; its F1-score is increased by 200.71%, 188.74%, 155.29% and 85.61%, respectively, relative to ENN, RENN, OSS, and NCR; In the other 8 datasets, this new sampling algorithm also shows good indicator scores.Conclusion The new algorithm proposed in this study can effectively alleviate the data imbalance problem of multi-classification time series datasets with many categories and few minority class samples, and at the same time clean up the boundary noise data between classes.

Preprint ARTICLE | doi:10.20944/preprints202307.1117.v1

Design and Analysis of Query Models Database Preservation Information Systems Digitization of History and Endowments; Case Study of History and Waqf of Sumedang Larang Kingdom Indonesia

R. Sudrajat, Budi Nurani Ruchjana, Atje Setiawan Abdullah, Rahmat Budiarto

Subject: Computer Science And Mathematics, Information Systems Keywords: history; endowments; query model; digital data; physical data

Online: 17 July 2023 (15:11:18 CEST)

Show abstract| Download PDF| Share

Preprint COMMUNICATION | doi:10.20944/preprints202305.1694.v1

Synthetic Data & the Future of Women's Health: A Synergistic Relationship

Gayathri Delanerolle, Peter Phiri, Heitor Cavalini, David Benfield, Ashish Shetty, Yassine Bouchareb, Jian Shi, Alain Zemkoho

Subject: Medicine And Pharmacology, Clinical Medicine Keywords: Womens Health; Data Science; Data Methods; Artificial Intelligence

Online: 24 May 2023 (04:48:58 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202206.0335.v1

The Dataharmonizer: a Tool for Faster Data Harmonization, Validation, Aggregation, and Analysis of Pathogen Genomics Contextual Information

Ivan Gill, Emma Griffiths, Damion Dooley, Rhiannon Cameron, Sarah Savić Kallesøe, Nithu Sara John, Anoosha Sehar, Gurinder Gosal, David Alexander, Madison Chapel, Matthew Croxen, Benjamin Delisle, Rachelle Di Tullio, Daniel Gaston, Ana Duggan, Jennifer Guthrie, Mark Horsman, Esha Joshi, Levon Kearney, Natalie Knox, Lynette Lau, Jason LeBlanc, Vincent Li, Pierre Lyons, Keith MacKenzie, Andrew McArthur, Emilie Panousis, John Palmer, Natalie Prystajecky, Kerri Smith, Jennifer Tanner, Christopher Townend, Andrea Tyler, Gary Van Domselaar, William Hsiao

Subject: Computer Science And Mathematics, Information Systems Keywords: metadata; contextual data; harmonization; genomic surveillance; data management

Online: 24 June 2022 (08:46:04 CEST)

Show abstract| Download PDF| Share

Pathogen genomics is a critical tool for public health surveillance, infection control, outbreak investigations, as well as research. In order to make use of pathogen genomics data, it must be interpreted using contextual data (metadata). Contextual data includes sample metadata, laboratory methods, patient demographics, clinical outcomes, and epidemiological information. However, the variability in how contextual information is captured by different authorities and how it is encoded in different databases poses challenges for data interpretation, integration, and its use/re-use. The DataHarmonizer is a template-driven spreadsheet application for harmonizing, validating, and transforming genomics contextual data into submission-ready formats for public or private repositories. The tool’s web browser-based JavaScript environment enables validation and its offline functionality and local installation increases data security. The DataHarmonizer was developed to address the data sharing needs that arose during the COVID-19 pandemic, and was used by members of the Canadian COVID Genomics Network (CanCOGeN) to harmonize SARS-CoV-2 contextual data for national surveillance and for public repository submission.In order to support coordination of international surveillance efforts, we have partnered with the Public Health Alliance for Genomic Epidemiology to also provide a template conforming to its SARS-CoV-2 contextual data specification for use worldwide. Templates are also being developed for One Health and foodborne pathogens. Overall, the DataHarmonizer tool improves the effectiveness and fidelity of contextual data capture as well as its subsequent usability. Harmonization of contextual information across authorities, platforms and systems globally improves interoperability and reusability of data for concerted public health and research initiatives to fight the current pandemic and future public health emergencies. While initially developed for the COVID-19 pandemic, its expansion to other data management applications and pathogens is already underway.

Preprint ARTICLE | doi:10.20944/preprints202108.0471.v1

Identifying the Main Risk Factors for CVD Prediction Using Machine Learning Algorithms

Luis Rolando Guarneros-Nolasco, Nancy Aracely Cruz-Ramos, Giner Alor-Hernández, Lisbeth Rodríguez-Mazahua, José Luis Sánchez-Cervantes

Subject: Computer Science And Mathematics, Information Systems Keywords: Big data; Health prevention; Machine learning; Medical data

Online: 24 August 2021 (14:00:12 CEST)

Show abstract| Download PDF| Share