Search | Preprints.org

Preprint COMMUNICATION | doi:10.20944/preprints202303.0453.v1

Analysis of Public Discourse on Twitter involving COVID-19 and MPox: Findings from Sentiment Analysis and Text Analysis

Subject: Social Sciences, Media Studies Keywords: COVID-19; MPox; Twitter; Big Data; Data Mining; Data Analysis; Sentiment Analysis; Data Science; Social Media; Monkeypox

Online: 27 March 2023 (08:39:28 CEST)

Show abstract| Download PDF| Share

Mining and analysis of the Big Data of Twitter conversations have been of significant interest to the scientific community in the fields of healthcare, epidemiology, big data, data science, computer science, and their related areas, as can be seen from several works in the last few years that focused on sentiment analysis and other forms of text analysis of Tweets related to Ebola, E-Coli, Dengue, Human papillomavirus (HPV), Middle East Respiratory Syndrome (MERS), Measles, Zika virus, H1N1, influenza-like illness, swine flu, flu, Cholera, Listeriosis, cancer, Liver Disease, Inflammatory Bowel Disease, kidney disease, lupus, Parkinson's, Diphtheria, and West Nile virus. The recent outbreaks of COVID-19 and MPox have served as "catalysts" for Twitter usage related to seeking and sharing information, views, opinions, and sentiments involving both these viruses. While there have been a few works published in the last few months that focused on performing sentiment analysis of Tweets related to either COVID-19 or MPox, none of the prior works in this field thus far involved analysis of Tweets focusing on both COVID-19 and MPox at the same time. With an aim to address this research gap, a total of 61,862 Tweets that focused on Mpox and COVID-19 simultaneously, posted between May 7, 2022, to March 3, 2023, were studied to perform sentiment analysis and text analysis. The findings of this study are manifold. First, the results of sentiment analysis show that almost half the Tweets (the actual percentage is 46.88%) had a negative sentiment. It was followed by Tweets that had a positive sentiment (31.97%) and Tweets that had a neutral sentiment (21.14%). Second, this paper presents the top 50 hashtags that were used in these Tweets. Third, it presents the top 100 most frequently used words that are featured in these Tweets. The findings of text analysis show that some of the commonly used words involved directly referring to either or both viruses. In addition to this, the presence of words such as "Polio", "Biden", "Ukraine", "HIV", "climate", and "Ebola" in the list of the top 100 most frequent words indicate that topics of conversations on Twitter in the context of COVID-19 and MPox also included a high level of interest related to other viruses, President Biden, and Ukraine. Finally, a comprehensive comparative study that involves a comparison of this work with 49 prior works in this field is presented to uphold the scientific contributions and relevance of the same.

Preprint ARTICLE | doi:10.20944/preprints202404.0429.v1

Analysis of Missingness Scenarios for Observational Health Data

Alireza Zamanian, Henrik von Kleist, Octavia Andreea Ciora, Marta Piperno, Gino Lancho, Narges Ahmidi

Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Missing Data Analysis; Observational Health Data; Missingness Scenarios; Missing Data Assumptions; Missingness distribution shift

Online: 5 April 2024 (10:45:36 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201808.0029.v1

Analysis Ready Data: Enabling Analysis of the Landsat Archive

John Dwyer, David Roy, Brian Sauer, Calli Jenkerson, Hankui Zhang, Leo Lymburner

Subject: Environmental And Earth Sciences, Environmental Science Keywords: Landsat, analysis ready data, collection 1

Online: 1 August 2018 (20:03:52 CEST)

Show abstract| Download PDF| Share

Preprint REVIEW | doi:10.20944/preprints202309.2113.v1

Navigating the Data Architecture Landscape: A Comparative Analysis of Data Warehouse, Data Lake, Data Lakehouse, and Data Mesh

Benjamin wong

Subject: Computer Science And Mathematics, Hardware And Architecture Keywords: Data, DWH, Data Warehouse, Architecture, Data Lake, Storage, Analysis, Data Mesh, Analytical, Architectural, Data Vault

Online: 3 October 2023 (03:28:55 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201608.0202.v2

Remote Sensing and Data Mining Techniques for Assessing the Urban Fabric Vulnerability to Heat Waves and UHI

Flavio Borfecchia, Vittorio Rosato, Emanuela Caiaffa, Maurizio Pollino, Luigi De Cecco, Luigi La Porta, Simone Ombuen, Lorenzo Barbieri, Federica Benelli, Flavio Camerata, Valeria Pellegrini, Andrea Filpa

Subject: Environmental And Earth Sciences, Environmental Science Keywords: HR satellite remote sensing; urban fabric vulnerability; UHI & heat waves; landsat & MODIS sensors; LST & urban heating; segmentation & objects classification; data mining; feature extraction & selection; stepwise regression & model calibration

Online: 26 October 2021 (13:11:23 CEST)

Show abstract| Download PDF| Share

Densely urbanized areas, with a low percentage of green vegetation, are highly exposed to Heat Waves (HW) which nowadays are increasing in terms of frequency and intensity also in the middle-latitude regions, due to ongoing Climate Change (CC). Their negative effects may combine with those of the UHI (Urban Heat Island), a local phenomenon where air temperatures in the compact built up cores of towns increase more than those in the surrounding rural areas, with significant impact on the quality of urban environment, on citizens health and energy consumption and transport, as it has occurred in the summer of 2003 on France and Italian central-northern areas. In this context this work aims at designing and developing a methodology based on aero-spatial remote sensing (EO) at medium-high resolution and most recent GIS techniques, for the extensive characterization of the urban fabric response to these climatic impacts related to the temperature within the general framework of supporting local and national strategies and policies of adaptation to CC. Due to its extension and variety of built-up typologies, the municipality of Rome was selected as test area for the methodology development and validation. First of all, we started by operating through photointerpretation of cartography at detailed scale (CTR 1: 5000) on a reference area consisting of a transect of about 5x20 km, extending from the downtown to the suburbs and including all the built-up classes of interest. The reference built-up vulnerability classes found inside the transect were then exploited as training areas to classify the entire territory of Rome municipality. To this end, the satellite EO HR (High Resolution) multispectral data, provided by the Landsat sensors were used within a on purpose developed "supervised" classification procedure, based on data mining and “object-classification” techniques. The classification results were then exploited for implementing a calibration method, based on a typical UHI temperature distribution, derived from MODIS satellite sensor LST (Land Surface Temperature) data of the summer 2003, to obtain an analytical expression of the vulnerability model, previously introduced on a semi-empirical basis.

Preprint REVIEW | doi:10.20944/preprints202203.0407.v1

Impact of Big Data Analysis on Health

Albérico Travassos Rosário, Joana Carmo Dias

Subject: Computer Science And Mathematics, Information Systems Keywords: big data analytics; healthcare; data technologies; decision making; information management; EHR

Online: 31 March 2022 (12:24:19 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202005.0274.v1

Big Data System for Medical Images Analysis

Janusz Bobulski, Mariusz Kubanek

Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: big data; deep learning; intelligent systems; medical imaging; multi-data processing

Online: 16 May 2020 (17:43:42 CEST)

Show abstract| Download PDF| Share

Working Paper ARTICLE

The Analysis and the Measurement of Poverty: An Interval Based Composite Indicator Approach

Carlo Drago

Subject: Business, Economics And Management, Econometrics And Statistics Keywords: poverty; composite indicators; interval data; symbolic data

Online: 24 August 2021 (15:46:09 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202312.1656.v1

Investigating Misinformation about COVID-19 on YouTube using Topic Modeling, Sentiment Analysis, and Language Analysis

Nirmalya Thakur, Shuqi Cui, Victoria Knieling, Karam Khanna, Mingchen Shao

Subject: Computer Science And Mathematics, Computer Science Keywords: COVID-19; YouTube; Misinformation; Big Data; Data Analysis; Topic Modeling; Sentiment Analysis; Correlation Analysis

Online: 21 December 2023 (11:52:17 CET)

Show abstract| Download PDF| Share

The work presented in this paper makes multiple scientific contributions with a specific focus on the analysis of misinformation about COVID-19 on YouTube. First, the results of topic modeling performed on the video descriptions of YouTube videos containing misinformation about COVID-19 revealed four distinct themes or focus areas - Promotion and Outreach Efforts, Treatment for COVID-19, Conspiracy Theories regarding COVID-19, and COVID-19 and Politics. Second, the results of topic-specific sentiment analysis revealed the sentiment associated with each of these themes. For the videos belonging to the theme of Promotion and Outreach Efforts, 45.8% were neutral, 39.8% were positive, and 14.4% were negative, for the videos belonging to the theme of Treatment for COVID-19, 38.113% were positive, 31.343% were neutral, and 30.544% were negative, for the videos belonging to the theme of Conspiracy Theories regarding COVID-19, 46.9% were positive, 31.0% were neutral, and 22.1% were negative, and for the videos belonging to the theme of COVID-19 and Politics, 35.70% were positive, 32.86% were negative, and 31.44% were negative. Third, topic-specific language analysis was performed to detect the various languages in which the video descriptions per topic were published on YouTube. This analysis revealed multiple novel insights. For instance, for all the themes, English and Spanish were the most widely used and second-most widely used languages, respectively. Fourth, the patterns of sharing these videos on other social media channels such as Facebook and Twitter were also investigated. The results revealed that videos containing video descriptions in English were shared the highest number of times on Facebook and Twitter. Finally, correlation analysis was performed by taking into account multiple characteristics of these videos. The results revealed that the correlation between the length of the video title and the number of Tweets as well as the correlation between the length of the video title and the number of Facebook posts was statistically significant.

Preprint ARTICLE | doi:10.20944/preprints202405.1581.v1

Gap Filling of Missing and Outlier Values of Rotorcraft Flight Data using Multilayer Perceptron

Seon Ho Jeong, Dongsoo Kang, Ikgyu Lee, Yongmin Lee, Jeong Ho Kim, Yun-Young Hwang

Subject: Engineering, Aerospace Engineering Keywords: rotorcraft; flight data; gap filling; multilayer perceptron; health and usage monitoring system

Online: 24 May 2024 (09:32:42 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202310.1998.v1

Marburg Virus Outbreak and a New Conspiracy Theory: Findings from a Comprehensive Analysis of Web Behavior

Nirmalya Thakur, Shuqi Cui, Kesha A. Patel, Nazif Azizi, Victoria Knieling, Changhee Han, Audrey Poon, Rishika Shah

Subject: Public Health And Healthcare, Public Health And Health Services Keywords: Marburg virus; big data; data mining; data analysis; google trends; web behavior; data science; conspiracy theory

Online: 31 October 2023 (07:02:07 CET)

Show abstract| Download PDF| Share

During virus outbreaks in the recent past web behavior mining, modeling, and analysis have served as means to examine, explore, interpret, assess, and forecast the worldwide perception, readiness, reactions, and response linked to these virus outbreaks. The recent outbreak of the Marburg Virus disease (MVD), the high fatality rate of MVD, and the conspiracy theory linking the FEMA alert signal in the United States on October 4, 2023, with MVD and a zombie outbreak, resulted in a diverse range of reactions in the general public which has transpired in a surge in web behavior in this context. This resulted in “Marburg Virus” featuring in the list of the top trending topics on Twitter on October 3, 2023, and “Emergency Alert System” and “Zombie” featuring in the list of top trending topics on Twitter on October 4, 2023. No prior work in this field has mined and analyzed the emerging trends in web behavior in this context. The work presented in this paper aims to address this research gap and makes multiple scientific contributions to this field. First, it presents the results of performing time series forecasting of the search interests related to MVD emerging from 216 different regions on a global scale using ARIMA, LSTM, and Autocorrelation. The results of this analysis present the optimal model for forecasting web behavior related to MVD in each of these regions. Second, the correlation between search interests related to MVD and search interests related to zombies (in the context of this conspiracy theory) was investigated. The findings show that there were several regions where there was a statistically significant correlation between MVD-related searches and zombie-related searches (in the context of this conspiracy theory) on Google on October 4, 2023. Finally, the correlation between zombie-related searches (in the context of this conspiracy theory) in the United States and other regions was investigated. This analysis helped to identify those regions where this correlation was statistically significant.

Preprint ARTICLE | doi:10.20944/preprints202407.1180.v1

Predictive Analysis for Road Safety Enhancement in Chicago County

Reshma Shaik

Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Traffic Crashes; Machine Learning; Predictive Modeling; Road Safety; Crash Severity

Online: 15 July 2024 (19:47:23 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202308.1237.v1

A Method to Enable Automatic Extraction of Cost and Quantity Data from Hierarchical Construction Information Documents to Enable Rapid Digital Comparison and Analysis

Daniel Adanza Dopazo, Lamine Mahdjoubi, Bill Gething

Subject: Engineering, Transportation Science And Technology Keywords: data mining; data extraction; data science; cost infrastructure projects

Online: 17 August 2023 (09:25:22 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202011.0297.v1

Linear Regression Analysis for Time-Point Datasets

Janardan Patil, Li Len, Abhinav Bharat, Xi Li

Subject: Computer Science And Mathematics, Mathematics Keywords: regression; time point data; modelling

Online: 10 November 2020 (10:00:37 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202407.1459.v1

Optimising Clinical Epidemiology in Disease Outbreaks: Analysis of ISARIC-WHO COVID-19 Case Report Form Utilisation

Laura Merson, Sara Duque, Esteban Garcia-Gallo, Trokon Omarley Yeabah, Jamie Rylance, Janet Diaz, Antoine Flahault, . ISARIC Clinical Characterisation Group

Subject: Public Health And Healthcare, Public Health And Health Services Keywords: clinical epidemiology; infectious disease outbreaks; data collection; data management; common data elements; ISARIC

Online: 18 July 2024 (09:53:41 CEST)

Show abstract| Download PDF| Supplementary Files| Share

Working Paper ARTICLE

Model for the Collection and Analysis of Data from Teachers and Students, Supported by Academic Analytics

Fredys A. Simanca H., Isabel Hernández Arteaga, María Elsa Unriza Puin, Fabian Blanco Garrido, Jaime Paez Paez, Jairo Cortes Méndez

Subject: Computer Science And Mathematics, Information Systems Keywords: Academic Analytics; data storage; education and big data; analysis of data; learning analytics

Online: 19 July 2020 (20:37:39 CEST)

Show abstract| Download PDF| Share

Business Intelligence, defined by [1] as "the ability to understand the interrelations of the facts that are presented in such a way that it can guide the action towards achieving a desired goal", has been used since 1958 for the transformation of data into information, and of information into knowledge, to be used when making decisions in a business environment. But, what would happen if we took the same principles of business intelligence and applied them to the academic environment? The answer would be the creation of Academic Analytics, a term defined by [2] as the process of evaluating and analyzing organizational information from university systems for reporting and making decisions, whose characteristics allow it to be used more and more in institutions, since the information they accumulate about their students and teachers gathers data such as academic performance, student success, persistence, and retention [5]. Academic Analytics enables an analysis of data that is very important for making decisions in the educational institutional environment, aggregating valuable information in the academic research activity and providing easy to use business intelligence tools. This article shows a proposal for creating an information system based on Academic Analytics, using ASP.Net technology and trusting storage in the database engine Microsoft SQL Server, designing a model that is supported by Academic Analytics for the collection and analysis of data from the information systems of educational institutions. The idea that was conceived proposes a system that is capable of displaying statistics on the historical data of students and teachers taken over academic periods, without having direct access to institutional databases, with the purpose of gathering the information that the director, the teacher, and finally the student need for making decisions. The model was validated with information taken from students and teachers during the last five years, and the export format of the data was pdf, csv, and xls files. The findings allow us to state that it is extremely important to analyze the data that is in the information systems of the educational institutions for making decisions. After the validation of the model, it was established that it is a must for students to know the reports of their academic performance in order to carry out a process of self-evaluation, as well as for teachers to be able to see the results of the data obtained in order to carry out processes of self-evaluation, and adaptation of content and dynamics in the classrooms, and finally for the head of the program to make decisions.

Preprint ARTICLE | doi:10.20944/preprints202307.0466.v1

PlantMetSuite: A User-Friendly Web-Based Tool for Metabolomics Analysis and Visualisation

Yu Liu, Hao-Zhuo Liu, Ding-Kang Chen, Hong-Yun Zeng, Yi-Li Chen, Nan Yao

Subject: Biology And Life Sciences, Plant Sciences Keywords: plant metabolomics; metabolite identification; data visualisation; omics data; bioinformatics tools

Online: 10 July 2023 (13:49:20 CEST)

Show abstract| Download PDF| Supplementary Files| Share

Preprint COMMUNICATION | doi:10.20944/preprints202401.2023.v1

Investigating the Global Fear associated with COVID-19 using Subjectivity Analysis and Deep Learning

Nirmalya Thakur, Kesha A. Patel, Audrey Poon, Rishika Shah, Nazif Azizi, Changhee Han

Subject: Computer Science And Mathematics, Computer Science Keywords: COVID-19; Big Data; Data Analysis; Machine Learning; Subjectivity Analysis; Data Science; Deep Learning; Mental Health

Online: 29 January 2024 (15:42:52 CET)

Show abstract| Download PDF| Share

The work presented in this paper makes multiple scientific contributions related to the investigation of the global fear associated with COVID-19 by performing a comprehensive analysis of a dataset comprising survey responses of participants from 40 countries. First, the results of subjectivity analysis of responses where participants indicated their biggest concern related to COVID-19 showed that the average subjectivity in responses by the age group of 41-50 decreased from April 2020 to June 2020, the average subjectivity in responses by the age group of 71-80 drastically increased from May 2020, and the age group of 11-20 indicated the least level of subjectivity in their responses between June 2020 to August 2020. Second, subjectivity analysis also revealed the percentage of highly opinionated, neutral opinionated, and least opinionated responses per age-group where the analyzed age groups were 11-20, 21-30, 31-40, 41-50, 51-60, 61-70, 71-80, and 81-90. For instance, the percentage of highly opinionated, neutral opinionated, and least opinionated responses by the age group of 11-20 were 17.92%, 16.24%, and 65.84%, respectively. Third, data analysis of responses from different age groups showed that the highest percentage of responses indicating that they were very worried about COVID-19 came from individuals in the age group of 21-30. Fourth, data analysis of the survey responses also revealed that in the context of taking precautions to prevent contracting COVID-19, the percentage of individuals in the age group of 31-40 taking precautions was higher as compared to the percentages of individuals from the age groups of 41-50, 51-60, 61-70, 71-80, and 81-90. Finally, a deep learning model was developed to detect if the survey respondents were seeing or planning to see a psychologist or psychiatrist for any mental health issues related to COVID-19. The deep learning model used the responses to multiple questions in the context of fear, preparedness, and response related to COVID-19 from the dataset and achieved an overall performance accuracy of 91.62% after 500 epochs.

Preprint COMMUNICATION | doi:10.20944/preprints202309.0694.v1

Investigating Self-Reporting of Long COVID on Twitter: Findings from Sentiment Analysis

Nirmalya Thakur

Subject: Public Health And Healthcare, Public Health And Health Services Keywords: COVID-19; long COVID; social media; Twitter; big data; data analysis; natural Language processing; data science; sentiment analysis

Online: 12 September 2023 (05:32:14 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202206.0335.v1

The Dataharmonizer: a Tool for Faster Data Harmonization, Validation, Aggregation, and Analysis of Pathogen Genomics Contextual Information

Ivan Gill, Emma Griffiths, Damion Dooley, Rhiannon Cameron, Sarah Savić Kallesøe, Nithu Sara John, Anoosha Sehar, Gurinder Gosal, David Alexander, Madison Chapel, Matthew Croxen, Benjamin Delisle, Rachelle Di Tullio, Daniel Gaston, Ana Duggan, Jennifer Guthrie, Mark Horsman, Esha Joshi, Levon Kearney, Natalie Knox, Lynette Lau, Jason LeBlanc, Vincent Li, Pierre Lyons, Keith MacKenzie, Andrew McArthur, Emilie Panousis, John Palmer, Natalie Prystajecky, Kerri Smith, Jennifer Tanner, Christopher Townend, Andrea Tyler, Gary Van Domselaar, William Hsiao

Subject: Computer Science And Mathematics, Information Systems Keywords: metadata; contextual data; harmonization; genomic surveillance; data management

Online: 24 June 2022 (08:46:04 CEST)

Show abstract| Download PDF| Share

Pathogen genomics is a critical tool for public health surveillance, infection control, outbreak investigations, as well as research. In order to make use of pathogen genomics data, it must be interpreted using contextual data (metadata). Contextual data includes sample metadata, laboratory methods, patient demographics, clinical outcomes, and epidemiological information. However, the variability in how contextual information is captured by different authorities and how it is encoded in different databases poses challenges for data interpretation, integration, and its use/re-use. The DataHarmonizer is a template-driven spreadsheet application for harmonizing, validating, and transforming genomics contextual data into submission-ready formats for public or private repositories. The tool’s web browser-based JavaScript environment enables validation and its offline functionality and local installation increases data security. The DataHarmonizer was developed to address the data sharing needs that arose during the COVID-19 pandemic, and was used by members of the Canadian COVID Genomics Network (CanCOGeN) to harmonize SARS-CoV-2 contextual data for national surveillance and for public repository submission.In order to support coordination of international surveillance efforts, we have partnered with the Public Health Alliance for Genomic Epidemiology to also provide a template conforming to its SARS-CoV-2 contextual data specification for use worldwide. Templates are also being developed for One Health and foodborne pathogens. Overall, the DataHarmonizer tool improves the effectiveness and fidelity of contextual data capture as well as its subsequent usability. Harmonization of contextual information across authorities, platforms and systems globally improves interoperability and reusability of data for concerted public health and research initiatives to fight the current pandemic and future public health emergencies. While initially developed for the COVID-19 pandemic, its expansion to other data management applications and pathogens is already underway.

Preprint REVIEW | doi:10.20944/preprints202404.0593.v1

Spatial Data Infrastructure for Remote Sensing: A Comprehensive Analysis

Paola Carrara, Gloria Bordogna, Giacomo De Carolis

Subject: Environmental And Earth Sciences, Remote Sensing Keywords: spatial data infrastructure; remote sensing

Online: 9 April 2024 (11:15:37 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202208.0083.v1

A Big Data Analysis with Machine Learning Techniques in Accounting Dataset from the Greek Banking System

Leonidas Theodorakopoulos, Georgios Thanasas, Spyridon Lampropoulos

Subject: Business, Economics And Management, Accounting And Taxation Keywords: Ratios; Financial Crisis; Covid-19; Big Data; Accounting Data

Online: 3 August 2022 (10:42:06 CEST)

Show abstract| Download PDF| Share

Preprint COMMUNICATION | doi:10.20944/preprints202309.1969.v1

A Comprehensive Analysis of the Public Discourse on Twitter about Exoskeletons from 2017 to 2023

Nirmalya Thakur, Kesha A. Patel, Audrey Poon, Rishika Shah, Nazif Azizi, Changhee Han

Subject: Computer Science And Mathematics, Computer Science Keywords: Twitter; Data Analysis; Big Data; Exoskeletons; Data Science; Text Analysis; Sentiment Analysis; Content Analysis; Natural Language Processing

Online: 28 September 2023 (13:25:30 CEST)

Show abstract| Download PDF| Share

The work of this paper presents multiple novel findings from a comprehensive analysis of about 150,000 tweets about exoskeletons posted between May 2017 and May 2023. First, findings from content analysis and temporal analysis of these tweets reveal the specific months per year when a significantly higher volume of Tweets was posted and the time windows when the highest number of tweets, the lowest number of tweets, tweets with the highest number of hashtags, and tweets with the highest number of user mentions were posted. Second, the paper shows that there are statistically significant correlations between the number of tweets posted per hour and different characteristics of these tweets. Third, the paper presents a multiple linear regression model to predict the number of tweets posted per hour in terms of these characteristics of tweets. The R2 score of this model was observed to be 0.9540. Fourth, the paper reports that the 10 most popular hashtags were #exoskeleton, #robotics, #iot, #technology, #tech #innovation, #ai, #sci, #construction and #news. Fifth, sentiment analysis of these tweets was performed using VADER and the DistilRoBERTa-base library. The results show that the percentage of positive, neutral, and negative tweets were 46.8%, 33.1%, and 20.1%, respectively. The results also show that in the tweets that did not express a neutral sentiment, the sentiment of surprise was the most common sentiment. It was followed by the sentiments of joy, disgust, sadness, fear, and anger. Furthermore, analysis of hashtag-specific sentiments revealed several novel insights, for instance, for almost all the months in 2022, the usage of #ai in tweets about exoskeletons was mainly associated with a positive sentiment. Sixth, text processing-based approaches were used to detect possibly sarcastic tweets and tweets that contained news. Finally, a comparison of positive tweets, negative tweets, neutral tweets, possibly sarcastic tweets, and tweets that contained news, in terms of different characteristic properties of these tweets are presented. The findings reveal multiple novel insights, for instance, the average number of hashtags used in tweets that contained news has considerably increased since January 2022.

Preprint ARTICLE | doi:10.20944/preprints202307.0492.v1

Bayesian Joint Modeling Analysis of Longitudinal Proportion and Survival Data

Wenting Liu, Huiqiong Li, Anming Tang, Zixin Cui

Subject: Computer Science And Mathematics, Probability And Statistics Keywords: longitudinal proportional data; survival data; joint model; Bayesian variable selection; B-splines; CDPMM metnod

Online: 10 July 2023 (02:26:21 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201610.0067.v1

Point Information Gain and Multidimensional Data Analysis

Renata Rychtáriková, Jan Korbel, Petr Macháček, Petr Císař, Jan Urban, Dalibor Štys

Subject: Computer Science And Mathematics, Applied Mathematics Keywords: point information gain; Rényi entropy; data processing

Online: 17 October 2016 (11:35:13 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202204.0068.v1

Functional Data Analysis for Imaging Mean Function Estimation: Computing Times and Parameter Selection

Juan Arias López, Carmen Cadarso Suárez, Pablo Aguiar Fernández

Subject: Computer Science And Mathematics, Computational Mathematics Keywords: Functional Data Analysis; Image Processing; Brain Imaging; Neuroimaging; Computational Neuroscience; Data Science

Online: 8 April 2022 (03:21:06 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201809.0073.v1

Software Processes Analysis with Provenance

Gabriella C. B. Costa, Humberto L. O. Dalpra, Eldânae N. Teixeira, Cláudia M. L. Werner, Regina M. M. Braga, Marcos A. Miguel

Subject: Computer Science And Mathematics, Information Systems Keywords: Software Process Analysis, Software Process Improvement, Data Prove-nance

Online: 4 September 2018 (16:30:51 CEST)

Show abstract| Download PDF| Share

Working Paper ARTICLE

Deep Neural Networks for Analysis of Microscopy Images - Synthetic Data Generation and Adaptive Sampling

Patrick Trampert, Dmitri Rubinstein, Faysal Boughorbel, Christian Schlinkmann, Maria Luschkova, Philipp Slusallek, Tim Dahmen, Stefan Sandfeld

Subject: Chemistry And Materials Science, Biomaterials Keywords: Microscopy Image Segmentation; Deep Learning; Data Augmentation; Synthetic Training Data; Parametric Models

Online: 1 March 2021 (13:07:00 CET)

Show abstract| Download PDF| Share

Preprint COMMUNICATION | doi:10.20944/preprints202402.0275.v3

Diversity of the Japanese Gut Microbiome Analysis: Relative Approach Using Principal Component Analysis

Tatsuki Itagaki, Ken-ichiro Sakata, Akira Hasebe, Yoshimasa Kitagawa

Subject: Biology And Life Sciences, Biology And Biotechnology Keywords: gut microbiome; compositional data; principal component analysis; unsupervised machine learning; diversity

Online: 11 March 2024 (16:59:03 CET)

Show abstract| Download PDF| Supplementary Files| Share

Preprint ARTICLE | doi:10.20944/preprints202309.1712.v1

Secure Hydrogen Production Analysis and Prediction Based on Blockchain Service Framework for Intelligent Power Management System

Harun Jamil, Faiza Qayyum, Naeem Iqbal, Murad Ali Khan, Syed Shehryar Ali Naqvi, Salabat Khan, Do-Hyeun Kim

Subject: Engineering, Energy And Fuel Technology Keywords: blockchain; IoT; hydrogen production; secure data-driven analysis; historical data management

Online: 26 September 2023 (05:24:51 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202103.0623.v1

How Schools affected COVID-19 Pandemic in Italy: Data Analysis for Lombardy Region, Campania Region, and Emilia Region

Davide Tosi, Alessandro Siro Campi

Subject: Computer Science And Mathematics, Information Systems Keywords: SARS-CoV-2; Big Data; Data Analytics; Predictive Models; Schools

Online: 25 March 2021 (14:35:53 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202401.0813.v1

Applied Bootstrap Analysis With Imputed Data in Stata

Felix Bittmann

Subject: Computer Science And Mathematics, Probability And Statistics Keywords: bootstrapping; multiple imputation; MICE; stata; missing data

Online: 10 January 2024 (10:22:49 CET)

Show abstract| Download PDF| Supplementary Files| Share

Preprint ARTICLE | doi:10.20944/preprints202204.0261.v1

Using Objective Analysis for the Assimilation of Satellite Derived Aerosol Products to Improve PM2.5 Predictions over Europe

Mounir Chrit, Marwa Majdi

Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: PM2.5; Aerosol Optical Depth; Data assimilation; MODIS; satellite data; Objective analysis

Online: 27 April 2022 (11:32:49 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201910.0146.v2

CSI NGS Portal: An Online Platform for Automated NGS Data Analysis and Sharing

Ömer An, Kar-Tong Tan, Ying Li, Jia Li, Chan-Shuo Wu, Bin Zhang, Leilei Chen, Henry Yang

Subject: Biology And Life Sciences, Biochemistry And Molecular Biology Keywords: NGS data analysis; bioinformatics pipelines; NGS pipelines

Online: 8 April 2020 (06:21:10 CEST)

Show abstract| Download PDF| Supplementary Files| Share

Preprint ARTICLE | doi:10.20944/preprints202312.0821.v1

Training Data Augmentation with Data Distilled by the Principal Component Analysis

Nikolay Metodiev Metodiev Sirakov, Tahsin Shahnewaz, Arie Nakhmani ---

Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: data; distillation; augmentation; classification; machine learning

Online: 12 December 2023 (07:24:28 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202307.1117.v1

Design and Analysis of Query Models Database Preservation Information Systems Digitization of History and Endowments; Case Study of History and Waqf of Sumedang Larang Kingdom Indonesia

R. Sudrajat, Budi Nurani Ruchjana, Atje Setiawan Abdullah, Rahmat Budiarto

Subject: Computer Science And Mathematics, Information Systems Keywords: history; endowments; query model; digital data; physical data

Online: 17 July 2023 (15:11:18 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202102.0593.v2

Hospital Admissions From Care Homes in England During the COVID-19 Pandemic: A Retrospective, Cross-Sectional Analysis Using Linked Administrative Data

Fiona Grimm, Karen Hodgson, Richard Brine, Sarah R Deeny

Subject: Public Health And Healthcare, Public Health And Health Services Keywords: Hospital admissions; care homes; COVID-19; linked data; administrative data

Online: 25 May 2021 (10:33:46 CEST)

Show abstract| Download PDF| Supplementary Files| Share

Preprint REVIEW | doi:10.20944/preprints202308.1478.v4

Hyperparameter Optimization and Combined Data Sampling Techniques in Machine Learning for Customer Churn Prediction: A Comparative Analysis

Mehdi Imani, Hamid Reza Arabnia

Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: machine learning; churn prediction; imbalanced data; combined data sampling techniques; hyperparameter optimization

Online: 17 November 2023 (14:15:58 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201609.0027.v1

Optimizing Bus Passenger Complaint Service through Big Data Analysis: Systematized Analysis for Improved Public Sector Management

Weng-Kun Liu, Chia-Chun Yen

Subject: Business, Economics And Management, Business And Management Keywords: customer complaint process improvement; customer complaint service; big data analysis

Online: 7 September 2016 (11:38:33 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201704.0169.v1

Thermal Signature Based Sleep Analysis Sensor

Ali Seba, Dan Istrate, Toufik Guettari, Adrien Ugon, Andrea Pinna, Patrick Garda

Subject: Engineering, Bioengineering Keywords: thermopile sensor; actimetry; thermal camera, data classification; tele-medicine; polysomnography;

Online: 26 April 2017 (12:27:38 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202305.1285.v1

Characteristics of Phase IV Clinical Trials in Oncology: An Analysis Using the ClinicalTrials.gov Registry Data

Brandon Michael Henry, Giuseppe Lippi, Ameen Nasser, Patryk Ostrowski

Subject: Medicine And Pharmacology, Oncology And Oncogenics Keywords: postmarketing surveillance; pharmacovigilance; registry data

Online: 18 May 2023 (07:18:52 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201802.0065.v3

The Consequences of Corruption on Economic Growth in Mediterranean Countries: Evidence from Panel Data Analysis

Hicham Boussalham

Subject: Business, Economics And Management, Economics Keywords: Corruption; Economic growth; Panel Data

Online: 26 February 2018 (15:38:23 CET)

Show abstract| Download PDF| Share

Preprint REVIEW | doi:10.20944/preprints202211.0161.v1

Data Locality in High Performance Computing, Big Data, and Converged Systems: An Analysis of the Cutting Edge and A Future System Architecture

Sardar Usman, Rashid Mehmood, Iyad Katib, Aiiad Albeshri

Subject: Computer Science And Mathematics, Information Systems Keywords: High Performance Computing (HPC); big data; High Performance Data Analytics (HPDS); con-vergence; data locality; spark; Hadoop; design patterns; process mapping; in-situ data analysis

Online: 9 November 2022 (01:38:34 CET)

Show abstract| Download PDF| Share

Big data has revolutionised science and technology leading to the transformation of our societies. High Performance Computing (HPC) provides the necessary computational power for big data analysis using artificial intelligence and methods. Traditionally HPC and big data had focused on different problem domains and had grown into two different ecosystems. Efforts have been underway for the last few years on bringing the best of both paradigms into HPC and big converged architectures. Designing HPC and big data converged systems is a hard task requiring careful placement of data, analytics, and other computational tasks such that the desired performance is achieved with the least amount of resources. Energy efficiency has become the biggest hurdle in the realisation of HPC, big data, and converged systems capable of delivering exascale and beyond performance. Data locality is a key parameter of HPDA system design as moving even a byte costs heavily both in time and energy with an increase in the size of the system. Performance in terms of time and energy are the most important factors for users, particularly energy, due to it being the major hurdle in high performance system design and the increasing focus on green energy systems due to environmental sustainability. Data locality is a broad term that encapsulates different aspects including bringing computations to data, minimizing data movement by efficient exploitation of cache hierarchies, reducing intra- and inter-node communications, locality-aware process and thread mapping, and in-situ and in-transit data analysis. This paper provides an extensive review of the cutting-edge on data locality in HPC, big data, and converged systems. We review the literature on data locality in HPC, big data, and converged environments and discuss challenges, opportunities, and future directions. Subsequently, using the knowledge gained from this extensive review, we propose a system architecture for future HPC and big data converged systems. To the best of our knowledge, there is no such review on data locality in converged HPC and big data systems.

Preprint ARTICLE | doi:10.20944/preprints202205.0334.v1

Performance Evaluation Analysis of Spark Streaming Backpressure for Data-Intensive Pipelines

Kassiano Jose Matteussi, Dos Anjos Julio, Valderi Leithardt, Claudio Fernando Resing Geyer

Subject: Engineering, Control And Systems Engineering Keywords: Backpressure; Big Data; Spark Streaming; Stream Processing

Online: 24 May 2022 (11:47:39 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202407.1304.v1

Tidal Flat Extraction and Analysis in China Based on Multi-source Remote Sensing Image Collection and MSIC-OA Algorithm

JiXiang Sun, Cheng Tang, Ke Mu, Yanfang Li, Xiangyang Zheng, Tao Zou

Subject: Environmental And Earth Sciences, Remote Sensing Keywords: remote sensing; Google Earth Engine; MSIC-OA; tidal flat resources; shoreline

Online: 16 July 2024 (10:39:27 CEST)

Show abstract| Download PDF| Share

Working Paper ARTICLE

Sparse Coded Autoencoder Features for Chemometric Data Analysis

Muhammad Bilal, Mohib Ullah

Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: Chemometric data,sparse autoencoder, gaussian process regressor, pareto optimization.

Online: 9 May 2019 (11:31:46 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202310.1673.v1

AI and Regulation an Analysis

Paul Dumouchel

Subject: Social Sciences, Cognitive Science Keywords: Intelligence; AI; regulations; data-driven agents; ethics; politics; moratorium; joint cognitive systems

Online: 26 October 2023 (10:58:54 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201804.0144.v1

Big Data Log-Based Correlation Analysis Profiling Auto Generation Model

Dongsik Sohn, Seungpyo Huh, Taejin Lee, Jin Kwak

Subject: Computer Science And Mathematics, Information Systems Keywords: big data; SIEM; correlation analysis; cyber crime profiling

Online: 11 April 2018 (08:39:02 CEST)

Show abstract| Download PDF| Share

Preprint CONCEPT PAPER | doi:10.20944/preprints201901.0246.v2

A Tool to Encourage Minimum Reporting GuidelineUptake for Data Analysis in Metabolomics

Elizabeth C. Considine, Reza M. Salek

Subject: Biology And Life Sciences, Endocrinology And Metabolism Keywords: reproducibility; minimum guidelines; reporting; data analysis; reporting

Online: 8 March 2019 (09:06:02 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201806.0279.v2

SpArcFiRe: Enhancing Spiral Galaxy Recognition using Arm Analysis and Random Forests

Pedro Silva, Leon T. Cao, Wayne B. Hayes

Subject: Physical Sciences, Astronomy And Astrophysics Keywords: galaxy morphology, machine learning; data analysis; object classification

Online: 22 October 2018 (13:01:42 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202402.1130.v1

Bibliography Analysis on Bioremediation on Heavy Metal Pollution

Yuanzhao Ding

Subject: Biology And Life Sciences, Biology And Biotechnology Keywords: Heavy metal pollution; Bioremediation; Bibliographic method; Big data; Machine learning

Online: 20 February 2024 (11:51:15 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202402.0268.v1

A New Cloud‐Native Tool for Pharmacogenetic Analysis

David Yu Yuan, Jun Hyuk Park, Zhenyu Li, Rohan Thomas, David M Hwang, Lei Fu

Subject: Computer Science And Mathematics, Software Keywords: Pharmacogenetics; bioinformatics pipeline; cloud‐native technologies; workflow; genomic data analysis

Online: 5 February 2024 (10:54:41 CET)

Show abstract| Download PDF| Supplementary Files| Share

Preprint ARTICLE | doi:10.20944/preprints202305.0049.v1

Analysis of Concurrent Systems Based on Interval Order

Yang Xu, YE Chen, Chen Yi Jun

Subject: Computer Science And Mathematics, Computer Science Keywords: concurrent systems; Mazurkiewicz traces; interval order; Petri net with Data

Online: 2 May 2023 (03:13:37 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202206.0347.v1

Commuting Analysis of the Budapest Metropolitan Area using Mobile Network Data

Gergő Pintér, Imre Felde

Subject: Social Sciences, Geography, Planning And Development Keywords: mobile network data; call detail records; data analysis; human mobility; urban mobility; social sensing; urban geography; urban sociology; commuting; sustainability

Online: 27 June 2022 (04:04:09 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202310.1450.v1

Analysis of Open-Social Data Behavior Concerning Gasoline Stealing: A Case Study of the Mexican Petroleum Crisis

Roberto Zagal-Flores, Felix Mata-Rivera, Miguel Torres-Ruiz, Violeta Shaid Benitez-Valerio, Rolando Quintero, Giovanni Guzmán, Joel Omar Juárez-Gambino

Subject: Computer Science And Mathematics, Computer Science Keywords: semantic and linguistic technologies; spatial data mining; spatial data analytics; spatio-temporal characterization; social media

Online: 23 October 2023 (16:15:23 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202405.1624.v1

Analysis of User Experience data and Methodology of application to improve the development of User Interface

Ellada Ismailova, Andrei Ermakov

Subject: Computer Science And Mathematics, Other Keywords: user experience, user interface, data analysis

Online: 24 May 2024 (09:56:15 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202104.0529.v1

Deep Learning-based Survival Analysis for High-dimensional Survival Data

Il Do Ha, Lin Hao, Juncheol Kim, Sookhee Kwon

Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: censored data; machine learning; deep learning; DNNSurv; survival analysis

Online: 20 April 2021 (11:15:02 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202306.0588.v1

Distance Based Analysis of Early Fire Indicators on a New Indoor Laboratory Dataset with Distributed Multi-Sensor Nodes

Pascal Vorwerk, Jörg Kelleter, Steffen Müller, Ulrich Krause

Subject: Engineering, Safety, Risk, Reliability And Quality Keywords: early fire detection; multi sensor network; data driven fire detection; machine learning; public data set

Online: 8 June 2023 (08:04:39 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202205.0238.v2

An Exploratory Study of Tweets about the SARS-CoV-2 Omicron Variant: Insights from Sentiment Analysis, Language Interpretation, Source Tracking, Type Classification, and Embedded URL Detection

Nirmalya Thakur, Chia Y. Han

Subject: Computer Science And Mathematics, Information Systems Keywords: COVID-19; SARS-CoV-2; Omicron; Twitter; tweets; sentiment analysis; big data; Natural Language Processing; Data Science; Data Analysis

Online: 7 July 2022 (08:36:40 CEST)

Show abstract| Download PDF| Share

This paper presents the findings of an exploratory study on the continuously generating Big Data on Twitter related to the sharing of information, news, views, opinions, ideas, knowledge, feedback, and experiences about the COVID-19 pandemic, with a specific focus on the Omicron variant, which is the globally dominant variant of SARS-CoV-2 at this time. A total of 12028 tweets about the Omicron variant were studied, and the specific characteristics of tweets that were analyzed include - sentiment, language, source, type, and embedded URLs. The findings of this study are manifold. First, from sentiment analysis, it was observed that 50.5% of tweets had the ‘neutral’ emotion. The other emotions - ‘bad’, ‘good’, ‘terrible’, and ‘great’ were found in 15.6%, 14.0%, 12.5%, and 7.5% of the tweets, respectively. Second, the findings of language interpretation showed that 65.9% of the tweets were posted in English. It was followed by Spanish or Castillian, French, Italian, Japanese, and other languages, which were found in 10.5%, 5.1%, 3.3%, 2.5%, and <2% of the tweets, respectively. Third, the findings from source tracking showed that “Twitter for Android” was associated with 35.2% of tweets. It was followed by “Twitter Web App”, “Twitter for iPhone”, “Twitter for iPad”, “TweetDeck”, and all other sources that accounted for 29.2%, 25.8%, 3.8%, 1.6%, and <1% of the tweets, respectively. Fourth, studying the type of tweets revealed that retweets accounted for 60.8% of the tweets, it was followed by original tweets and replies that accounted for 19.8% and 19.4% of the tweets, respectively. Fifth, in terms of embedded URL analysis, the most common domains embedded in the tweets were found to be twitter.com, which was followed by biorxiv.org, nature.com, wapo.st, nzherald.co.nz, recvprofits.com, science.org, and other URLs. Finally, to support similar research and development in this field centered around the analysis of tweets, we have developed an open-access Twitter dataset that comprises tweets about the SARS-CoV-2 omicron variant since the first detected case of this variant on November 24, 2021.

Preprint ARTICLE | doi:10.20944/preprints202011.0622.v1

The Data-Driven Pattern for Healthy Behaviors of Car Drivers Based on Daily Records of Traffic Count Data From 2018 To 2019 Near Airports. A Functional Data Analysis

Mohammad Fayaz, Alireza Abadi, Soheila Khodakarim, Mohammadreza Hoseini, Alireza Razzaghi

Subject: Computer Science And Mathematics, Probability And Statistics Keywords: Driving Offenses; Speed Zone; Airports; Functional Data Analysis; Data-Driven Policy;

Online: 24 November 2020 (16:12:38 CET)

Show abstract| Download PDF| Supplementary Files| Share

Preprint REVIEW | doi:10.20944/preprints202312.0260.v1

Topological Data Analysis in Cardiovascular Signals: An Overview

Enrique Hernández-Lemus, Pedro Miramontes, Mireya Martinez-Garcia

Subject: Computer Science And Mathematics, Applied Mathematics Keywords: topological data analysis; cardiovascular signals; alegbraic topology; persistent homology; mapper algorithm

Online: 5 December 2023 (11:00:09 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202405.1682.v1

Landslide Risk Assessments through Multicriteria Analysis

Fatma Zohra Chaabane, Salim Lamine, Mohamed Said Guettouche, Nour El Islam Bachari, Nassim Hallal

Subject: Environmental And Earth Sciences, Remote Sensing Keywords: Landsat Data Continuity; image processing; ArcGIS; Analytical Hierarchy Process; weighted sum; Multicriteria Analysis

Online: 27 May 2024 (06:57:37 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202311.0412.v1

Safety-Security Analysis of Maritime Surveillance Systems in Critical Marine Areas

Batu Şengül, Fatih Yılmaz, Özkan Uğurlu

Subject: Engineering, Marine Engineering Keywords: big data; artificial intelligence; maritime surveillance; maritime security; sustainability

Online: 7 November 2023 (06:46:46 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202204.0016.v1

Unstructured Data Analysis for Risk Management of Electric Power Transmission Lines

Lucas Pereira, Rafael Pereira, Pedro Prado, Felipe Cunha, Fabrício Góes, Roger Fiusa, Lorrany Silva

Subject: Engineering, Electrical And Electronic Engineering Keywords: natural language processing; risk management; transmission lines; unstructured data

Online: 4 April 2022 (11:26:15 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202104.0389.v1

Analysis and Forecasting the Precipitation and Temperature in the Dez Catchment Area

Alireza Chegnizadeh, Mohammad Javad Bahmani, Hamidreza Rabieifar3, Hossein Ebrahimi

Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: Trends in meteorological data; SWAT, RCP; Mann-Kendall forecast

Online: 14 April 2021 (16:06:14 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202402.0187.v2

Optimization and Performance Analysis of CAT Method for DNA Sequence Similarity Searching and Alignment

Veska Gancheva, Hristo Stoev

Subject: Computer Science And Mathematics, Computer Science Keywords: bioinformatics; biological data sequences; DNA sequences; metadada; performance analysis; similarity searching; sequence alignment

Online: 19 February 2024 (16:12:31 CET)

Show abstract| Download PDF| Share

Bioinformatics is a rapidly developing field enabling scientific experiments through computer models and simulations. Considering the vast databases of biological data available, it is ex-tremely important to develop efficient methods and algorithms for their processing. Sequence comparison is the best method for studying the evolutionary interaction between genes. It is based on alignment – the process of arranging two or more sequences to achieve the maximum level of identity and degree of similarity. The paper presents a new version of the algorithm for pairwise DNA sequences alignment, based on a new method called CAT, where a dependency with a previous match and the closes neighbor are taken in consideration to increase uniqueness of the CAT profile and to reduce possible collisions, i.e. two or more sequence having same CAT profiles. This makes proposed algorithm suitable for finding exact match of a concrete DNA se-quence in a big set of DNA data faster. The generation of CAT profiles is made once before data has been uploaded to the database, allowing the profiles to be used as metadata for the sequenc-es. It consists of an algorithm to calculate a CAT profile against the selected reference sequences and an algorithm to compare two sequences based on the calculated CAT profiles. Improve-ments in generation of the CAT profiles, are detailed described in the paper. Block scheme, pseudo code tables and figures were updated according to the proposed new version and ex-perimental results. Experiments have been carried out with the new version of the method and different datasets to align DNA sequences based on the CAT method. New experimental results in terms of collisions, speed, and efficiency of the proposed solutions are presented. Experiments related to the performance comparison with Needleman-Wunsch were re-executed with the new version of the algorithm to confirm that we have the same performance. An analysis of the per-formance of the proposed CAT based algorithm against Knuth–Morris–Pratt algorithm, which has a complexity of O(n) and is one of the most widely used for searching biological data, was performed. The impact of prior matching dependencies on uniqueness for generated CAT pro-files is investigated. The analysis of the experimental results obtained by sequence alignment shows a small deviation of the proposed algorithm based on the CAT method, which can be ig-nored if this deviation is acceptable at the expense of performance. The time efficiency of the CAT algorithm remains constant, regardless of the length of the sequences. Therefore, the ad-vantage of the proposed method is its fast processing in the alignment of large sequences, for which the execution of the exact algorithms takes a long time. The example code realization of the CAT Method, under the protection of the GNU General Public License v3.0, can be accessed on GitHub at: https://github.com/HristoS/CATSequenceAnalysis.

Preprint COMMUNICATION | doi:10.20944/preprints202310.0157.v1

Investigating Gender-Specific Discourse about Online Learning during COVID-19 on Twitter using Sentiment Analysis, Subjectivity Analysis, and Toxicity Analysis

Nirmalya Thakur, Shuqi Cui, Karam Khanna, Victoria Knieling, Yuvraj Nihal Duggal, Mingchen Shao

Subject: Public Health And Healthcare, Public Health And Health Services Keywords: online learning; COVID-19; Twitter; Data Analysis; Natural Language Processing; Sentiment Analysis; Subjectivity Analysis; Toxicity Analysis; Diversity Analysis

Online: 3 October 2023 (12:59:21 CEST)

Show abstract| Download PDF| Share

The work presented in this paper presents several novel findings from a comprehensive analysis of about 50,000 Tweets about online learning during COVID-19, posted on Twitter between November 9, 2021, and July 13, 2022. First, the results of sentiment analysis from VADER, Afinn, and TextBlob show that a higher percentage of these tweets were positive. The results of gender-specific sentiment analysis indicate that for positive tweets, negative tweets, and neutral tweets, between males and females, males posted a higher percentage of the tweets. Second, the results from subjectivity analysis show that the percentage of least opinionated, neutral opinionated, and highly opinionated tweets were 56.568%, 30.898%, and 12.534%, respectively. The gender-specific results for subjectivity analysis indicate that for each subjectivity class, males posted a higher percentage of tweets as compared to females. Third, toxicity detection was performed on the tweets to detect different categories of toxic content - toxicity, obscene, identity attack, insult, threat, and sexually explicit. The gender-specific analysis of the percentage of tweets posted by each gender in each of these categories revealed several novel insights. For instance, for the sexually explicit category, females posted a higher percentage of tweets as compared to males. Fourth, gender-specific tweeting patterns for each of these categories of toxic content were analyzed to understand the trends of the same. The results unraveled multiple paradigms of tweeting behavior, for instance, the intensity of obscene content in tweets about online learning by males and females has decreased since May 2022. Fifth, the average activity of males and females per month was calculated. The findings indicate that the average activity of females has been higher in all months as compared to males other than March 2022. Finally, country-specific tweeting patterns of males and females were also performed which presented multiple novel insights, for instance, in India a higher percentage of the tweets about online learning during COVID-19 were posted by males as compared to females.

Preprint ARTICLE | doi:10.20944/preprints202407.1344.v1

Research on Genome Data Recognition and Analysis based on Louvain Algorithm

Danyi Huang, Lidong Xu, Weilun Tao, Yizhou Li

Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Genome data recognition, Principal component analysis, Louvain algorithm, Cancer dataset.

Online: 16 July 2024 (16:47:15 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201709.0078.v1

Synergetic Evaluation of Project Portfolio Configuration Based on Data Envelopment Analysis

Libiao Bai, Sijun Bai, Tiantian Zhai

Subject: Business, Economics And Management, Business And Management Keywords: project portfolio configuration; synergetic management; data envelopment analysis; efficiency evaluation

Online: 18 September 2017 (11:25:21 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202005.0399.v2

Optimistic Prediction Model for the COVID-19 Coronavirus Pandemic Based on the Reported Data Analysis

Khalid Aloufi

Subject: Computer Science And Mathematics, Analysis Keywords: IR; ML; Data Analysis; COVID-19; Coronavirus; Pandemic

Online: 28 May 2020 (03:09:38 CEST)

Show abstract| Download PDF| Supplementary Files| Share

Preprint ARTICLE | doi:10.20944/preprints201804.0127.v1

Explorative Multidimensional Analysis for Energy Efficiency: Dataviz Versus Clustering Algorithms

Dario Cottafava, Giulia Sonetti, Paolo Gambino, Andrea Tartaglino

Subject: Engineering, Energy And Fuel Technology Keywords: energy efficiency indices; data visualization; clustering algorithms; university campus; energy management

Online: 10 April 2018 (10:40:47 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202201.0348.v1

A Hybrid Machine Learning And Network Analysis Approach Reveals Two Parkinson's Disease Subtypes From 115 RNA-Seq Post-mortem Brain Samples

Andrea Termine, Carlo Fabrizio, Claudia Strafella, Valerio Caputo, Laura Petrosini, Carlo Caltagirone, Raffaella Cascella, Emiliano Giardina

Subject: Medicine And Pharmacology, Neuroscience And Neurology Keywords: Data Science; Genomic Data Science; Machine Learning; Network Analysis; RNA-Seq; Precision Medicine; Subtyping; Parkinson’s Disease

Online: 24 January 2022 (11:36:51 CET)

Show abstract| Download PDF| Supplementary Files| Share

Working Paper ARTICLE

Evaluation and Analysis of Soil Temperature Data over Poyang Lake Basin, China

Mingjin Zhan, Lingjun Xia, Longfei Zhan, Yuanhao Wang

Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: soil temperature; data evaluation; climatology; interannual variation; Poyang Lake Basin

Online: 24 February 2020 (01:38:30 CET)

Show abstract| Download PDF| Share

Preprint COMMUNICATION | doi:10.20944/preprints202309.1683.v1

Super Typhoon Saola (2023) over the northern part of the South China Sea – aircraft data analysis

Junyi He, Pak Wai Chan, Y.W. Chan, P. Cheung

Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: tropical cyclone; aircraft data; turbulence; eddy dissipation rate

Online: 25 September 2023 (11:26:19 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202302.0474.v1

Association Between the COVID-19 Vaccine and Preventive Behaviors: Panel Data Analysis From Japan

Eiji YAMAMURA, Youki Kosaka, Yoshiro TSUTUI, Fumio Ohtake

Subject: Social Sciences, Psychology Keywords: Vaccine; COVID-19; Preventive behaviors; Norm; Japan; Panel data

Online: 27 February 2023 (10:28:45 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202003.0443.v2

Preliminary Analysis of COVID-19 Academic Information Patterns: A Call for Open Science in the Times of Closed Borders

Jan Homolak, Ivan Kodvanj, Davor Virag

Subject: Social Sciences, Library And Information Sciences Keywords: COVID-19; open science; data; bibliometric; pandemic

Online: 22 April 2020 (06:15:34 CEST)

Show abstract| Download PDF| Share

Introduction: The Pandemic of COVID-19, an infectious disease caused by SARS-CoV-2 motivated the scientific community to work together in order to gather, organize, process and distribute data on the novel biomedical hazard. Here, we analyzed how the scientific community responded to this challenge by quantifying distribution and availability patterns of the academic information related to COVID-19. The aim of our study was to assess the quality of the information flow and scientific collaboration, two factors we believe to be critical for finding new solutions for the ongoing pandemic. Materials and methods: The RISmed R package, and a custom Python script were used to fetch metadata on articles indexed in PubMed and published on Rxiv preprint server. Scopus was manually searched and the metadata was exported in BibTex file. Publication rate and publication status, affiliation and author count per article, and submission-to-publication time were analysed in R. Biblioshiny application was used to create a world collaboration map. Results: Our preliminary data suggest that COVID-19 pandemic resulted in generation of a large amount of scientific data, and demonstrates potential problems regarding the information velocity, availability, and scientific collaboration in the early stages of the pandemic. More specifically, our results indicate precarious overload of the standard publication systems, significant problems with data availability and apparent deficient collaboration. Conclusion: In conclusion, we believe the scientific community could have used the data more efficiently in order to create proper foundations for finding new solutions for the COVID-19 pandemic. Moreover, we believe we can learn from this on the go and adopt open science principles and a more mindful approach to COVID-19-related data to accelerate the discovery of more efficient solutions. We take this opportunity to invite our colleagues to contribute to this global scientific collaboration by publishing their findings with maximal transparency.

Preprint ARTICLE | doi:10.20944/preprints202103.0530.v1

Supervised Strategies of Multiblock Data Analysis: A Unified Approach and Extension

Essomanda Tchandao Mangamana, Romain Lucas Glele Kakai, El Mostafa Qannari

Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Multiblock data analysis; redundancy analysis; PLS regression; super- vised methods; multicolinearity

Online: 22 March 2021 (12:25:20 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202306.0767.v1

Livestock Disease Data Management for E-Surveillance and Disease Mapping Using Cluster Analysis

Mohammed Kemal Ahmed, Durga Prasad Sharma, Hussein Seid Worku, Amir Ibrahim Tahir

Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Data analytics; Cluster analysis; Disease mapping; Distance metrics; livestock Disease

Online: 12 June 2023 (05:10:55 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201801.0077.v1

A Methodology for Design and Analysis of Sensor Fusion with Real Data in UAV Platforms

Jesús García, Jose Manuel Molina, Jorge Trincado

Subject: Engineering, Electrical And Electronic Engineering Keywords: UAVs sensor fusion; EKF; real data analysis; system design

Online: 9 January 2018 (07:47:45 CET)

Show abstract| Download PDF| Share

Preprint REVIEW | doi:10.20944/preprints202309.2137.v1

Recent Optimization Methods and Techniques for Medical Image Analysis

jing wang

Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Medical image analysis, Medical image data, Deep learning, Computer vision techniques, Optimisation methods

Online: 30 September 2023 (17:58:32 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202403.0138.v1

Data Envelopment Analysis (DEA) to Estimate Technical and Scale Efficiencies of Smallholder Pineapple Farmers in Ghana

Kwaku Boakye, Yu-Feng Lee, Festus Annor-Frempong, Samuel Dadzie, Iddrisu Salifu

Subject: Business, Economics And Management, Economics Keywords: Data Envelopment Analysis, Technical Efficiency, Scale Efficiency, Farming Production

Online: 4 March 2024 (11:41:38 CET)

Show abstract| Download PDF| Supplementary Files| Share

Preprint ARTICLE | doi:10.20944/preprints202008.0074.v1

Clustering of Cardiovascular Disease Patients Using Data Mining Techniques with Principal Component Analysis and K-Medoids

Edy Irwansyah, Ebiet Salim Pratama, Margaretha Ohyver

Subject: Computer Science And Mathematics, Probability And Statistics Keywords: data mining; cardiovascular diseases; cluster analysis; principle component analysis

Online: 4 August 2020 (03:56:19 CEST)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201908.0225.v1

Analysis of Water Bodies under Partial Cloud Conditions Using Satellite Images

Rahul Neware, Mansi Thakare

Subject: Environmental And Earth Sciences, Remote Sensing Keywords: water bodies; satellite images; vector data; SVM; positive and negative buffering; polygons

Online: 21 August 2019 (10:30:16 CEST)

Show abstract| Download PDF| Share

Preprint REVIEW | doi:10.20944/preprints202309.2111.v1

Qualitative Comparative Analysis of Medical and Epidemiological Data

Valerii Tsvetkov, Ivan Tokin

Subject: Medicine And Pharmacology, Other Keywords: qualitative comparative analysis; qualitative analysis; data mining; calibration; truth table; logical minimization; QMC; eQMC; CCubes

Online: 29 September 2023 (13:05:00 CEST)

Show abstract| Download PDF| Share

Preprint REVIEW | doi:10.20944/preprints202405.0601.v1

Meta-Analysis and Review of in silico Methods in Drug Discovery – Part 1: Technological Evolution and Trends from Big Data to Chemical Space

Arife Uzundurukan, Mark Nelson, Christopher Teske, Md Shahidul Islam, Elzagheid Mohamed, John Victor Christy, Holli-Joi Martin, Eugene Muratov, Samantha Glover, DOMENICO FUOCO

Subject: Biology And Life Sciences, Life Sciences Keywords: Data science; Big data; Data mining; Bioinformatics; Chemometric; Medicinal chemistry; Targets; Knowledge discovery; Artificial intelligence, Machine learning, Deep learning; Data integration; Metadata, Database; QSAR; Collaborative drug discovery; Structure-based drug design; Ligand-based drug discovery; Clinical trials; Product development.

Online: 10 May 2024 (10:15:20 CEST)

Show abstract| Download PDF| Share

BackgroundThe present review summarizes the state-of-the-art of in silico methods and techniques that are the most useful in drug discovery, their relationship with data science, as well as the successful application of data science, machine learning (ML) and artificial intelligence (AI) applications. A meta-analysis of the various technologies available is furthermore proposed as a guideline for the non-expert, reader relative to the several subject areas is also discussed in this article. The scope of this meta-analysis is to rank the enlisted technologies by their field of applications and to depict the latter according to knowledge accessibility, from students to experts.Method The search strategy utilized for this review first produced a general collection of 900 papers without duplications, which were subsequently streamlined and divided into two independent collections: the top 300 most-cited papers of all time (since 2000) and the papers with the highest interest for a systematic review analysis (high-impact exciting papers). Results In Part 1, we discuss the most cited and quality 97 articles in these top 300 papers most relevant to the field of in silico drug discovery. The different disciplines are listed according to their industrial and economic incurred to society, independently from the “metric” results of how many new drug approvals (NDAs) each discipline has generated to date.ConclusionBig data, the ensemble of known items stored in publicly available databases, has improved our understanding of the many fates of a potential drug candidate during its development and even after its commercialization. Moreover, the combination of new screening techniques and “omics” with old drugs has led to a new paradigm in which the unknown knowledge of any biological molecule and its cellular structure, now plays an important role as a target for a series of yet-to-be-developed drugs: the chemical space. Furthermore, leveraging big data, data science, ML, and AI can revolutionize drug discovery by swiftly analyzing massive datasets, predicting efficacy and safety profiles, streamlining development, cutting costs, and boosting success rates for new drugs. AI also speeds up the search for promising drug candidates, advancing innovative therapies.

Preprint ARTICLE | doi:10.20944/preprints202309.1476.v1

A Data Envelopment Analysis to Benchmark Hotel Energy Consumption in an Urban Locality

Chukwudi Okpala, Howard Njoku, Paul Ako

Subject: Engineering, Other Keywords: hotel energy consumption, data envelopment analysis, hotel benchmarking, building energy efficiency

Online: 21 September 2023 (11:30:58 CEST)

Show abstract| Download PDF| Supplementary Files| Share

The benchmarking of hotel energy use comprehensively identifies the controllable and uncontrollable factors affecting energy performance, including building characteristics, management strategies, operations, and maintenance systems. Other factors include climatic conditions, floor areas, operating hours, occupancy rates, and guest populations. A benchmarking study on energy consumption patterns in significant hotels (each with less than 100 rooms and an average staff strength of 40 employees), situated in the university town of Nsukka (longitude 70 23' E, latitude 60 52' N), Nigeria, was performed using the data envelopment analysis (DEA) methodology. DEA, a linear programming technique that measures the relative performances of units, was chosen as a benchmarking methodology due to its ability to handle multiple inputs and outputs. Following a correlation test, energy use intensity, diesel consumption, and the number of employees were selected as the analysis inputs, while the occupancy rate was chosen as the output variable. Data on these variables spanning 12 months were collected using questionnaires, interviews, site visits, and oral conversations with hotel managers to ensure validity. Grid-supplied electricity accounted for most of the hotels' energy needs, followed by diesel used in generators. More than 70% of the electricity use was for HVAC. From the DEA, Hotel 3 (DMU H3) had a technical efficiency score of 1, whereas adjustments were recommended for improving the efficiency scores of the other hotels, which were deemed inefficient. DMU H7 had the lowest efficiency score (0.474) and the highest identified savings for electricity and diesel. The analysis also revealed that occupancy rates were generally low in the months of June and July, coinciding with the high rainfall season with its accompanying decline in outdoor activities. Consistent with this, electricity consumption was highest in the Christmas and Easter holiday months of December, January, and April following increased travel-related activities.

Preprint ARTICLE | doi:10.20944/preprints201801.0231.v1

Analysis of Occupational Accidents in Underground and Surface Mining in Spain Using Data Mining Techniques

Lluís Sanmiquel, Marc Bascompta, Josep Ma. Rossell, Hernán Anticoi, Eduard Guash

Subject: Engineering, Control And Systems Engineering Keywords: Data mining; Association rules; Previous Cause; Type of Accident; Overexertion

Online: 24 January 2018 (19:40:52 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints202105.0582.v1

Prevalence and Associated Factors of Under-Five Mortality in Ethiopia: Further Analysis of Ethiopian Mini Demographic and Health Surveys 2019

girma gilano, Samuel Hailegeberael

Subject: Medicine And Pharmacology, Immunology And Allergy Keywords: under-five; mortality; demographic health survey data; Ethiopia

Online: 24 May 2021 (15:12:13 CEST)

Show abstract| Download PDF| Share

Introduction: Over decades, much have been said and done regarding under-five mortality in Ethiopia. The country has been following the lead of sustainable development goals and UNICEF with its transformation plan targets. However, unless the efforts supported by status assessing studies, it might be difficult for the country to progress. Thus, the current study was directed to identify the prevalence and associated factors of under-five mortality in 2019. Methods: According to the study criteria, we extract and cleaned data in STATA v. 15.0. The data then weighted as per the sampling weight, primary sampling unit, and strata before analyzing in STATA 15.0. Data management consisted of descriptive (mean, standard deviation, and proportion or percent) and association statistics. We deliberated binary logistic regression for this analysis and we checked each variable at 0.25 p-values to include in the model. The final p-value to declare association was p <0.05 and AOR with 95% CI was also applied to describe the results. The data source was the Ethiopian Mini Demographic Health Survey (EMDHS) 2019. EMDHS collected the data from 8,885 in a face-to-face manner with a 99% response rate. Results: From 5,527 numbers of weighted women with under-five analysed in this study, the proportion of under-five mortality was 277.23(5.02%). Factors like 2nd birth order 0.52(0.35, 0.79), 3rd-4th 0.49(0.28, 0.84), 1-2 ANC visits 0.24(0.12, 0.49, ANC visit three’ 0.14(0.07, 0.28), ANC visit four and above 0.22(0.14, 0.36), in marriage mother 0.43(0.19, 0.96), ‘1-2 under-five children 0.02(0.011, 0.03), and greater than three under-five children 0.007(0.0007, 0.004) were all negatively associated with under-five mortality rate. Conclusion: To obtain the exalted outcome out of this study, the government might need to increase antenatal care, women education, institutional delivery, and the modern contraceptive methods use through enhanced community mobilization, health education using community health workers, increasing access to essential cares of mothers and children, and the policy commitment for the issues related to family size, birth order, and birth interval.

Preprint ARTICLE | doi:10.20944/preprints202406.1251.v1

Predicting Crime Categories in Montreal: A Comparative Analysis of Machine Learning Algorithms

Bappa Muktar, Vincent Fono, Meyo Zongo