DATASET | doi:10.20944/preprints202012.0047.v1
Subject: Engineering, Automotive Engineering Keywords: eye tracking dataset; gaze tracking dataset; iris tracking dataset; CNN for eye-tracking; neural networks for eye-tracking
Online: 2 December 2020 (08:00:46 CET)
In recent years many different deep neural networks were developed, but due to a large number of layers in deep networks, their training requires a long time and a large number of datasets. Today is popular to use trained deep neural networks for various tasks, even for simple ones in which such deep networks are not required. The well-known deep networks such as YoloV3, SSD, etc. are intended for tracking and monitoring various objects, therefore their weights are heavy and the overall accuracy for a specific task is low. Eye-tracking tasks need to detect only one object - an iris in a given area. Therefore, it is logical to use a neural network only for this task. But the problem is the lack of suitable datasets for training the model. In the manuscript, we presented a dataset that is suitable for training custom models of convolutional neural networks for eye-tracking tasks. Using data set data, each user can independently pre-train the convolutional neural network models for eye-tracking tasks. This dataset contains annotated 10,000 eye images in an extension of 416 by 416 pixels. The table with annotation information shows the coordinates and radius of the eye for each image. This manuscript can be considered as a guide for the preparation of datasets for eye-tracking devices.
REVIEW | doi:10.20944/preprints202110.0247.v1
Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: natural language; NLP; Korean; dataset
Online: 18 October 2021 (14:33:41 CEST)
English based datasets are commonly available from Kaggle, GitHub, or recently published papers. Although benchmark tests with English datasets are sufficient to show off the performances of new models and methods, still a researcher need to train and validate the models on Korean based datasets to produce a technology or product, suitable for Korean processing. This paper introduces 15 popular Korean based NLP datasets with summarized details such as volume, license, repositories, and other research results inspired by the datasets. Also, I provide high-resolution instructions with sample or statistics of datasets. The main characteristics of datasets are presented on a single table to provide a rapid summarization of datasets for researchers.
ARTICLE | doi:10.20944/preprints202002.0170.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Twitter; dataset; redundancy; reduction; archive
Online: 13 February 2020 (12:45:44 CET)
The data from social networks like Twitter is a valuable source for research but full of redundancy, making it hard to provide large-scale, self-contained, and small datasets. The data recording is a common problem in social media-based studies and could be standardized. Sadly, this is hardly done. This paper reports on lessons learned from a long-term evaluation study recording the complete public sample of the German and English Twitter stream. It presents a recording solution proposal that merely chunks a linear stream of events to reduce redundancy. If events are observed multiple times within the time-span of a chunk, only the latest observation is written to the chunk. A 10 Gigabyte Twitter raw dataset covering 1,2 Million Tweets of 120.000 users recorded between June and September 2017 was used to analyze expectable compression rates. It turned out that resulting datasets need only between 10\% and 20\% of the original data size without losing any event, metadata or the relationships between single events. This kind of redundancy reduction recording makes it possible to curate large-scale (even nation-wide), self-contained, and small datasets of social networks for research in a standardized and reproducible manner.
ARTICLE | doi:10.20944/preprints202305.0621.v1
Subject: Engineering, Energy And Fuel Technology Keywords: SOC; SOH; Dataset; Ageing; Model; Estimation
Online: 9 May 2023 (09:11:39 CEST)
The state estimation for lithium-ion battery cells has been the topic of many publications concerning the different states of a battery cell. They often focus on a battery cell’s state of charge (SOC) or state of health (SOH). Therefore this paper introduces a, on one hand, a new lithium-ion battery data set with dynamic validation data over degradation and on the other hand a model-based SOC and SOH estimation based on this dataset as a reference. An unscented Kalman filter-based approach was used for SOC estimation and extended with a holistic ageing model to handle the SOH estimation. The paper describes the dataset, the models, the parameterisation, the implementation of the state estimations, and their validation using parts of the dataset resulting in a SOC and SOH estimation over battery life. The results show that the dataset can be used to extract parameters, design models based on it and validate with dynamically degraded battery cells.
DATA DESCRIPTOR | doi:10.20944/preprints202310.0514.v2
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: anime character; 3D animation; audio-visual dataset
Online: 2 November 2023 (10:59:40 CET)
Characters are one of the most important elements in composing digital animation. The appear-ance and voice of a character should be designed to express the personality and values of the character. However, it is not easy for animation producers to harmoniously match the appear-ance and voice of a character. Advances in deep learning technology have made it possible to overcome this limitation. To achieve this, firstly, an audio-visual dataset of characters is required. In this study, we construct and verify a Korean audio-visual dataset consisting of frontal face im-ages of various characters and short voice clips. We developed an application that can automati-cally extract the frontal face image and a short voice clip of a character by collecting videos up-loaded to YouTube. Through this, a dataset consisting of a total of 1,522 face images and a total of 7,999 seconds of voice clips was built based on 490 characters. Furthermore, we automatically la-bel characters by gender and age to validate the dataset. The dataset built in this study is expected to be used in various deep learning fields, such as classification, generative adversarial networks, and speech synthesis.
ARTICLE | doi:10.20944/preprints202203.0172.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: object detection; larger-scale dataset; stacked carton
Online: 11 March 2022 (15:48:23 CET)
Carton detection is an important technique in the automatic logistics system and can be applied to many applications such as the stacking and unstacking of cartons, the unloading of cartons in the containers. However, there is no public large-scale carton dataset for the research community to train and evaluate the carton detection models up to now, which hinders the development of carton detection. In this paper, we present a large-scale carton dataset named Stacked Carton Dataset (SCD) with the goal of advancing the state-of-the-art in carton detection. Images are collected from the Internet and several warehouses, and objects are labeled using per-instance segmentation for precise localization. There are total of 250,000 instance masks from 16,136 images. Naturelly, a suite of benchmarks are established with several popular detectors. In addition, we design a carton detector based on RetinaNet by embedding our proposed Offset Prediction between Classification and Localization module (OPCL) and Boundary Guided Supervision module (BGS). OPCL alleviates the imbalance problem between classification and localization quality which boosts AP by 3.1%∼4.7% on SCD at the model level while BGS guides the detector to pay more attention to boundary information of cartons and decouple repeated carton textures at the task level. To demonstrate the generalization of OPCL to other datasets, we conduct extensive experiments on MS COCO and PASCAL VOC. The improvements of AP on MS COCO and PASCAL VOC are 1.8%∼2.2% and 3.4%∼4.3% respectively. Source dataset is available here.
ARTICLE | doi:10.20944/preprints201907.0039.v1
Subject: Computer Science And Mathematics, Other Keywords: fetal heart rate, baseline, acceleration, deceleration, dataset
Online: 2 July 2019 (11:17:55 CEST)
The fetal heart rate (FHR) is a screening signal for preventing fetal hypoxia during labor. When experts analyze this signal, they have to position a baseline and identify decelerations and accelerations. These steps can potentially be automated and made more objective by data processing analysis, but training and evaluation datasets are required. Here, we describe a dataset of 155 FHR recordings in which a reference baseline, accelerations and decelerations have been annotated by expert consensus. 66 FHR recordings with a shared expert analysis have been included in a training dataset, and 90 other FHR recordings with a non-shared expert analysis have been included in an evaluation dataset. Researchers wishing to evaluate their automatic analysis method should submit their results for comparison with the expert consensus. The dataset also contains the results produced by 11 re-coded automatic analysis methods from the literature. All the data are available at http://utsb.univ-catholille.fr/fhr-review.
ARTICLE | doi:10.20944/preprints202308.1933.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: Pancreatic cancer; NIH PLCO dataset; feature selection; classification
Online: 29 August 2023 (10:11:39 CEST)
Background: Pancreatic cancer (PC) is a disease with poor prognosis and survival rate. There is a pertinent need to identify the risk factors of this disease. The purpose of this study is to identify a subset of factors (a.k.a. features) as predictors of PC from the Prostate, Lung, Colorectal and Ovarian (PLCO) cancer dataset consisting of responses to 65 questions about demographics, cancer and health history, medication usage, and smoking habits from 154,897 participants. Method: There are two challenges to selecting the subset of features that predict PC with highest probability: the problem is computationally intractable, and the PLCO dataset is highly imbalanced. We use an innovative method to use the dataset in a balanced way, without involving up- or down-sampling. We use nine feature selection methods to select the optimal subset of features from the preprocessed and balanced dataset. Results: Our preprocessed dataset consists of 32 risk factors (8 demographics, 5 cancer history, 13 health history, 2 medication usage, 4 smoking habits). Risk factors belonging to cancer and health history, followed by smoking habits, were consistently chosen by the feature selection methods. We also discuss findings in the medical sciences literature that corroborate our findings. Conclusions: The study found that risk factors belonging to cancer and health history are the most prominent ones for PC. In particular, previously diagnosed with PC is chosen as the most prominent risk factor by majority of methods. While most of our findings are consistent with the literature, some of our findings shed light on novel factors that may not have received their due attention by the research community.
ARTICLE | doi:10.20944/preprints202307.1125.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Mexican sign language; Dataset; Hand-gestures; Computer-vision
Online: 17 July 2023 (16:17:50 CEST)
In Mexico, the incorporation of deaf people into education has been lacking since only 14% of the deaf population in the age group between 3 and 29 years access education with the support of a hearing aid. Additionally, those who have been incorporated frequently face inappropriate educational strategies which poorly develop the use of Mexican Sign Language (MSL) and therefore academical success and opportunities for insertion in the workplace are difficult. This research explores a novel mexican sign language lexicon video dataset containing the dynamical gestures most frequently used by MSL. Each gesture consists of a set of different versions of videos under uncontrolled conditions. MX-ITESO-100 data set is composed of a lexicon of 100 gestures and 5,000 videos from three participants with different grammatical elements. Additionally, the data set is evaluated in a two-step neural network model with an accuracy greater than 99%. and thus serves as a benchmark for future training of machine learning models in computer vision systems. Finally, this research provides an inclusive environment within society and organizations in particular for people with hearing impairment.
ARTICLE | doi:10.20944/preprints202306.0616.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Crop Disease Classification; Crop Disease Dataset; Image Augmentation
Online: 8 June 2023 (09:47:54 CEST)
Crop disease classification has always been a critical and persistent problem in the field of agricultural and forestry sciences, where often we do not have access to a sufficient number of samples to know the distribution of real-world samples. How to make full use of the existing data is the starting point of our thinking. To address this problem, this paper proposes a supervised image augmentation method Negative Contrast, which uses the contrast images of existing disease samples after removing disease areas as negative samples for image augmentation when samples are relatively scarce. Numerous experiments have shown that several classical models using this augmentation method have improved in disease classification of four crops, rice, wheat, corn, and soybean, with a maximum accuracy improvement of 30.8%. In addition, the comparative analysis of attentional heat map shows that the model using Negative Contrast is more accurate and intense on the area of interest of diseases, and thus reflects better generalization ability in real-world disease classification. Our dataset and codes can be found in https://www.kaggle.com/datasets/w970704112/corn-wheat-rice-soybean and https://github.com/hiter0/contrastaug .
ARTICLE | doi:10.20944/preprints202301.0551.v1
Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: multimodal dataset; sentiment analysis; classroom atmosphere; intelligent education
Online: 30 January 2023 (09:49:12 CET)
In this paper, we present a multimodal dataset for the analysis of classroom atmosphere, based on the behavior and voice of teachers in teaching scenarios. we propose four visual models, three audio models, and one visual-audio dual-modality model to be tested on our dataset. The results indicate that the CH-CC dataset is feasible and reliable and that the visual modality plays a major role in the analysis of this dataset.
ARTICLE | doi:10.20944/preprints202210.0301.v1
Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: Emotion prediction; music; music emotion dataset; affective computing
Online: 20 October 2022 (08:33:49 CEST)
Music is capable of conveying many emotions. The level and type of emotion of the music perceived by a listener, however, is highly subjective. In this study, we present the Music Emotion Recognition with Profile information dataset (MERP). This database was collected through Amazon Mechanical Turk (MTurk) and features dynamical valence and arousal ratings of 54 selected full-length songs. The dataset contains music features, as well as user profile information of the annotators. The songs were selected from the Free Music Archive using an innovative method (a Triple Neural Network with the OpenSmile toolkit) to identify 50 songs with the most distinctive emotions. Specifically, the songs were chosen to fully cover the four quadrants of the valence arousal space. Four additional songs were selected from DEAM to act as a benchmark in this study and filter out low quality ratings. A total of 277 participants participated in annotating the dataset, and their demographic information, listening preferences, and musical background were recorded. We offer an extensive analysis of the resulting dataset, together with a baseline emotion prediction model based on a fully connected model and an LSTM model, for our newly proposed MERP dataset.
COMMUNICATION | doi:10.20944/preprints202207.0450.v1
Subject: Environmental And Earth Sciences, Oceanography Keywords: SAR image; ship wake; deep learning; synthetic dataset
Online: 29 July 2022 (05:51:03 CEST)
The classification of vessel types in SAR imagery is of crucial importance for maritime applications. However, the ability to use real SAR imagery for deep learning classification is limited, due to the general lack of such data and/or the labor-intensive nature of labeling them. Simulating SAR images can overcome these limitations, allowing the generation of an infinite number of datasets. In this contribution, we present a synthetic SAR imagery dataset with ship wakes, which comprises 46080 images for ten different real vessel models. The variety of simulation parameters includes 16 ship heading directions, 6 ship velocities, 8 wind directions, 2 wind velocities, and 3 incidence angles. In addition, we extensively investigate classification performance for noise-free, noisy, and denoised ship wake scenes. We utilize the standard AlexNet architecture and employ training from scratch. To achieve the best classification performance, we conduct Bayesian optimization to determine hyperparameters. Results demonstrate that the classification of vessel types based on their SAR signatures is highly efficient, with maximum accuracies of 96.16%, 92.7%, and 93.59%, when training using noise-free, noisy, and denoised datasets respectively. Thus, we conclude that the best strategy in practical applications should be to train convolutional neural networks on denoised SAR datasets. The results show that the versatility of the SAR simulator can open up new horizons in the application of machine learning to a variety of SAR platforms.
ARTICLE | doi:10.20944/preprints202207.0176.v1
Subject: Engineering, Control And Systems Engineering Keywords: Lateral Movement; Sysmon; Dataset; Attacks; Network Security; Hacking
Online: 12 July 2022 (08:23:10 CEST)
This work attempts to answer in a clear way the following key questions regarding the optimal initialization of the Sysmon tool, towards the identification of Lateral Movement in the MS Windows ecosystem. First, from an expert’s standpoint and with reference to the relevant literature, what are the criteria of determining the possibly optimal initialization features of the Sysmon’s event monitoring tool, which are also applicable as custom rules within the config.xml configuration file? Second, based on the identified features, how can a functional configuration file, able to identify as many LM variants as possible, be generated? To answer these questions, we relied on the MITRE ATT&CK knowledge base of adversary tactics and techniques, and focused on the execution of the nine commonest LM methods. The conducted experiments, performed on a properly configured testbed, suggested a great number of interrelated networking features, that were implemented as custom rules in the Sysmon’s config.xml file. Moreover, by capitalizing on the rich corpus of the 870K Sysmon logs collected, we create and evaluate in terms of TP and FP rates an extensible Python .evtx file analyzer, dubbed PeX, which can be used towards automatizing the parsing and scrutiny of such voluminous files. Both the .evtx logs dataset and the developed PeX tool are provided publicly for further propelling future research in this interesting and rapidly evolving field.
DATASET | doi:10.20944/preprints202206.0346.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: dataset; NLP; Human Resource Management; classification; Job description
Online: 27 June 2022 (03:43:51 CEST)
We describe a dataset that contains job description published on a popular online website in the information and technology sector. As the website focus mainly on United Kingdom based jobs, the data have a specific focus on this country. It contains 11.501 job vacancies and 13 related meta data information. The dataset is suitable for HR analysis using machine learning techniques such as natural language processing and neural networks.
ARTICLE | doi:10.20944/preprints201706.0033.v2
Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: smartphone accelerometers; dataset; human activity recognition; fall detection
Online: 18 July 2017 (13:16:10 CEST)
Smartphones, smartwatches, fitness trackers, and ad-hoc wearable devices are being increasingly used to monitor human activities. Data acquired by the hosted sensors are usually processed by machine-learning-based algorithms to classify human activities. The success of those algorithms mostly depends on the availability of training (labeled) data that, if made publicly available, would allow researchers to make objective comparisons between techniques. Nowadays, publicly available data sets are few, often contain samples from subjects with too similar characteristics, and very often lack of specific information so that is not possible to select subsets of samples according to specific criteria. In this article, we present a new smartphone accelerometer dataset designed for activity recognition. The dataset includes 11,771 activities performed by 30 subjects of ages ranging from 18 to 60 years. Activities are divided in 17 fine grained classes grouped in two coarse grained classes: 9 types of activities of daily living (ADL) and 8 types of falls. The dataset has been stored to include all the information useful to select samples according to different criteria, such as the type of ADL performed, the age, the gender, and so on. Finally, the dataset has been benchmarked with two different classifiers and with different configurations. The best results are achieved with k-NN classifying ADLs only, considering personalization, and with both windows of 51 and 151 samples.
ARTICLE | doi:10.20944/preprints202311.0247.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Dataset; Recurrent Neural Networks; Internet of Things; Time Series.
Online: 3 November 2023 (12:38:10 CET)
The emergence of the Internet of Things (IoT) has led to the deployment of various types of sensors in many application fields, including environment monitoring, smart cities, health, industries, and others. The increasing number of connected devices has led to the creation of massive quantities of data that need to be analyzed. Typically, this data is ordered by time, as a time series. In this context, this paper presents a time series prediction model based on Recurrent Neural Networks in order to predict one step ahead. Result obtained through five Internet of Things monitoring datasets, showed that the Recurrent Neural Network obtained better performance that the prediction methods, ARIMA and SVM.
ARTICLE | doi:10.20944/preprints202309.0431.v1
Subject: Biology And Life Sciences, Biochemistry And Molecular Biology Keywords: plant diterpene synthases; functional annotation dataset; product specificity analysis
Online: 7 September 2023 (05:12:08 CEST)
Plant-derived diterpene synthases (PdiTPSs) play a critical role in the formation of structurally and functionally diverse diterpenoids. However, the specificity or promiscuity of PdiTPSs remains unclear. In order to gain more understanding of this, the sequences of 199 functionally characterized PdiTPSs and their corresponding 3D structures were collected and manually corrected. Then, the correlations among sequences, domains, structures and their corresponding products were comprehensively analyzed. However, those features alone was insufficient for ef-fective product-specific classification of PdiTPSs as these methods could not establish a clear mapping between the enzymes and products. Nevertheless, local structural analysis can identify residues that have been experimentally proven to influence product outcomes through mutagenesis, and these residues exhibit conservation in spatial positioning and physicochemical properties. And aromatic residues surrounding the substrate exhibited selectivity towards its chemical structure. Specifically, tryptophan (W) was preferentially located around the linear substrate geranylgeranyl pyrophosphate (GGPP), while phenylalanine (F) and tyrosine (Y) were preferentially located around the initial cyclized diterpene intermediate. This analysis revealed the functional space of residues surrounding the substrate of PdiTPSs, most of which have not been experimentally explored. These findings provide guidance for screening specific residues for mutation studies to change the catalytic products of PdiTPSs.
DATA DESCRIPTOR | doi:10.20944/preprints202206.0246.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: dataset; twitter; tweets; IMDb ratings; movies; sentiment analysis; NLP
Online: 17 June 2022 (04:39:16 CEST)
In this paper we intend to present a dataset that contain a collection of tweets generated as reactions of the release of 50 different movies. The dataset can be used for gaining useful insights regarding the conversation that is generated around a particular movie. It is particularly suitable for conducting sentiment analysis and other NLP techniques. The dataset contains approximately 2.5 million tweets with their related meta data and cover 50 movies. For each movie, its IMDb rating is included. The movies are the 25 releases with the highest number of votes during 2020 and 2021. The collected tweets represent the reactions of the twitter community during the first week of the release date in US of that particular movie. The tweets per movie ranged from 1.000 to approximately 200.000 tweets with an average of 50.000 per release. We used The Internet Archive Wayback Machine in order to retrieve the IMDb movie rating after one week of the US release date. The tweets and related metadata have been collected using the Tweet Downloader tool.
ARTICLE | doi:10.20944/preprints202111.0182.v1
Subject: Computer Science And Mathematics, Robotics Keywords: AHRS; Computer Vision; Dataset Acquisition; Deep Learning; Orientation Estimation.
Online: 9 November 2021 (14:35:21 CET)
The use of Attitude and Heading Reference Systems (AHRS) for orientation estimation is now common practice in a wide range of applications, e.g., robotics and human motion tracking, aerial vehicles and aerospace, gaming and virtual reality, indoor pedestrian navigation and maritime navigation. The integration of the high-rate measurements can provide very accurate estimates, but these can suffer from errors accumulation due to the sensors drift over longer time scales. To overcome this issue, inertial sensors are typically combined with additional sensors and techniques. As an example, camera-based solutions have drawn a large attention by the community, thanks to their low-costs and easy hardware setup; moreover, impressive results have been demonstrated in the context of Deep Learning. This work presents the preliminary results obtained by DOES , a supportive Deep Learning method specifically designed for maritime navigation, which aims at improving the roll and pitch estimations obtained by common AHRS. DOES recovers these estimations through the analysis of the frames acquired by a low-cost camera pointing the horizon at sea. The training has been performed on the novel ROPIS dataset, presented in the context of this work, acquired using the FrameWO application developed for the scope. Promising results encourage to test other network backbones and to further expand the dataset, improving the accuracy of the results and the range of applications of the method.
ARTICLE | doi:10.20944/preprints202102.0424.v1
Subject: Business, Economics And Management, Accounting And Taxation Keywords: dataset; stock; sentiment analysis; nlp; Nasdaq; stock prices; ML
Online: 18 February 2021 (17:22:50 CET)
The dataset reports a collection of earnings call transcripts, the related stock prices, and the related sector index. It contains a total of 188 transcripts, 11970 stock prices, and 1196 sector index values. Furthermore, all of these data originated in the period 2016-2020 and are related to the NASDAQ stock market. The data have been collected using Yahoo Finance and Thomson Reuters Eikon. Specifically, Yahoo Finance offered daily stock prices and traded volume. At the same time, Thomson Reuters Eikon has been used as source for the earnings call transcripts. The dataset can be used as a benchmark for the evaluation of several NLP techniques as well as machine learning algorithms for understanding their potential for financial applications. Moreover, it is also possible to expand the dataset by extending the period in which the data originated following a similar procedure.
ARTICLE | doi:10.20944/preprints202008.0040.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: pain assessment; pain recognition; deep learning; neural network; dataset
Online: 2 August 2020 (15:28:12 CEST)
The traditional standards employed for pain assessment have many limitations. One such limitation is reliability because of inter-observer variability. Therefore, there have been many approaches to automate the task of pain recognition. Recently, deep-learning methods have appeared to solve many challenges, such as feature selection and cases with a small number of data sets. This study provides a systematic review of pain-recognition systems that are based on deep-learning models for the last two years only. Furthermore, it presents the major deep-learning methods that were used in review papers. Finally, it provides a discussion of the challenges and open issues.
DATA DESCRIPTOR | doi:10.20944/preprints202212.0118.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Lip reading; Visual speech recognition; Turkish dataset; Face parts detection
Online: 7 December 2022 (06:50:33 CET)
The promised dataset was obtained from the daily Turkish words and phrases pronounced by various people in the videos posted on YouTube. The purpose of collecting the dataset is to provide detection of the spoken word by recognizing patterns or classifying lip movements with supervised, unsupervised, semi-supervised learning and machine learning algorithms. Most of the datasets related with lip reading consist of people recorded on camera with fixed backgrounds and the same conditions, but the dataset presented here consists of images compatible with machine learning models developed for real-life challenges. It contains a total of 2335 instances taken from TV series, movies, vlogs, and song clips on YouTube. The images in the dataset vary due to factors such as the way people say words, accent, speaking rate, gender and age. Furthermore, the instances in the dataset consist of videos with different angles, shadows, resolution, and brightness that are not created manually. The most important feature of our lip reading dataset is that we contribute to the non-synthetic Turkish dataset pool, which does not have wide dataset varieties. Machine learning studies can be carried out in many areas, such as the defense industry and social life, with this dataset.
DATA DESCRIPTOR | doi:10.20944/preprints202111.0511.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Social network analysis; Natural language processing; Dataset; Multimode; Opinion Dynamics
Online: 26 November 2021 (14:23:36 CET)
At the end of 2018, a high school student asked a question in Zhihu community, claiming that he had proved Goldbach's conjecture. The problem caused an explosive reaction and a large number of users participated in the discussion. And has caused the widespread influence. On January 1, 2019, the questioner issued his "proof". His proof was soon proved wrong. The heated discussion caused by the incident contains a lot of information of social science analysis value. Therefore, we follow up the event in the first time and build a time series dataset for the event. Taking the questioner's "proof" as the dividing line, all the answers, comments, sub comments and user information of writing these texts before and after two days were recorded. This series of temporal information can reflect the dynamic features of the interaction between user opinions, and the impact of exogenous shocks (proof release) on community opinions. The dataset can be used not only for the demonstration of various social network analysis algorithms, but also for a series of natural language processing tasks such as fine-grained sentiment analysis for long texts, as well as multimodal tasks combining natural language processing and social network analysis. This paper introduces the characteristics and structure of the dataset, shows the visualization effect of social network, and uses the dataset train the benchmark model of emotion analysis.
ARTICLE | doi:10.20944/preprints202101.0156.v1
Subject: Business, Economics And Management, Accounting And Taxation Keywords: dataset; comparison; algorithm; Naïve Bayes; C5.0 Decision Tree; student enrolment
Online: 8 January 2021 (13:04:44 CET)
In this preprint, we introduce a dataset containing students enrolment applications combined with the related result of their filing procedure. The dataset contains 73 variable. Student candidates, at the time of applying for study, fill a web form for filing the procedure. A committee at Tilburg University review each single application and decide if the student is admissible or not. This dataset is suitable for algorithmic studies and has been used in a comparison between the Naïve Bayes and the C5.0 Decision Tree Algorithms. They have been used for predicting the decision of the committee in admitting candidates at various bachelor programs. Our analysis shows that, in this particular case, a combination of the approaches outperform a both of them in term of precision and recall.
ARTICLE | doi:10.20944/preprints201911.0117.v1
Subject: Medicine And Pharmacology, Oncology And Oncogenics Keywords: malignant mesothelioma; epidemiology; association rule mining; Apriori method; imbalanced dataset
Online: 10 November 2019 (16:15:14 CET)
Malignant mesothelioma is a rare proliferative cancer that develops in the thin layer of tissues surrounding the lungs. Malignant mesothelioma is associated with an extremely poor prognosis and the majority of patients do not show symptoms. The epidemiology of mesothelioma is important for the identification of disease. The primary aim of this study is to explore the risk factors associated with mesothelioma. The dataset consists of healthy and mesothelioma patients but only mesothelioma patients were selected for the identification of symptoms. The raw data set has been pre-processed and then the Apriori method was utilized for association rules with various configurations. The pre-processing task involved the removal of duplicated and irrelevant attributes, balanced the dataset, numerical to the nominal conversion of attributes in the dataset and creating the association rules in the dataset. Strong associations of disease’s factors; asbestos exposure, duration of asbestos exposure, duration of symptoms, erythrocyte sedimentation rate and Pleural to serum LDH ratio determined via Apriori algorithm. The identification of risk factors associated with mesothelioma may prevent patients from going into the high danger of the disease. This will also help to control the comorbidities associated with mesothelioma which are cardiovascular diseases, cancer-related emotional distress, diabetes, anemia, and hypothyroidism.
ARTICLE | doi:10.20944/preprints202308.0140.v1
Subject: Engineering, Architecture, Building And Construction Keywords: sheep face recognotion; large benchmark; deep learning; convolutional neural networks; dataset
Online: 2 August 2023 (11:26:54 CEST)
The mutton sheep breeding industry has transformed significantly in recent years, from traditional grassland free-range farming to a more intelligent approach. As a result, automated sheep face recognition systems have become vital to modern breeding practices and have gradually replaced ear tagging and other manual tracking techniques. Although sheep face datasets have been introduced in previous studies, they have often involved pose or background restrictions (e.g., fixing of the subject’s head, cleaning of the face), which restrict data collection and have limited the size of available sample sets. As a result, a comprehensive benchmark designed exclusively for the evaluation of individual sheep recognition algorithms is lacking. To address this issue, this study develops a large-scale benchmark dataset, Sheepface-107, comprised of 5,350 images acquired from 107 different subjects. Images were collected from each sheep at multiple angles, including front and back views, in a diverse collection that provides a more comprehensive representation of facial features. In addition to the dataset, an assessment protocol is developed by applying multiple evaluation metrics to the results produced by three different deep learning models: VGG16, GoogLeNet, and ResNet50, which achieved F1-scores of 83.79%, 89.11%, and 93.44%, respectively. A statistical analysis of each algorithm suggested that accuracy and the number of parameters were the most informative metrics for use in evaluating recognition performance.
ARTICLE | doi:10.20944/preprints202308.0837.v1
Subject: Environmental And Earth Sciences, Remote Sensing Keywords: SAR vehicle detection; rotated object detection; Synthetic dataset; Mix MSTAR; deep learning
Online: 10 August 2023 (10:16:27 CEST)
The application of deep learning in the detection of Synthetic Aperture Radar (SAR) targets has been primarily limited to large objects such as ships and airplanes, with much less popularity in detecting SAR vehicles. The complexities of SAR imaging make it difficult to distinguish small vehicles from the background clutter, creating a barrier to data interpretation and the development of Automatic Target Recognition (ATR) in SAR vehicles. The scarcity of datasets has inhibited progress in SAR vehicle detection in the data-driven era. To address this, we introduce a new synthetic dataset called Mix MSTAR, which mixes target chips and clutter backgrounds with original radar data at the pixel level. Mix MSTAR contains 5,392 objects of 20 fine-grained categories in 100 high-resolution images, predominantly 1478x1784 pixels. The dataset includes various landscapes such as woods, grasslands, urban buildings, lakes, and tightly arranged vehicles, each labeled with Oriented Bounding Box (OBB). Notably, Mix MSTAR presents fine-grained object detection challenges by using the Extended Operating Condition (EOC) as a basis for dividing the dataset. Furthermore, we evaluate 9 benchmark rotated detectors on Mix MSTAR and demonstrate the fidelity and effectiveness of the synthetic dataset. To the best of our knowledge, Mix MSTAR represents the first public multi-class SAR vehicle dataset designed for rotated object detection in large-scale scenes with complex background. Mix MSTAR is available at: https://github.com/TheGreatTreatsby/Mix-MSTAR-mmrotate.
ARTICLE | doi:10.20944/preprints202305.2218.v1
Subject: Computer Science And Mathematics, Security Systems Keywords: Generative Adversarial Network; Intrusion Detection System; Imbalanced Dataset; Machine Learning; Unsupervised Learning
Online: 31 May 2023 (10:22:58 CEST)
The IDS serves as a security system that maintains constant surveillance over network traffic and host systems in order to identify any security breaches or potentially concerning activities. Recently. the rise in cyber-attacks has driven the necessity for the development of automated and intelligent network intrusion detection systems. These systems are designed to learn the typical patterns of network traffic, allowing them to identify any deviations from normal behaviour, which can be classified as anomalous or malicious. Machine learning methods are widely used to exhibit a satisfactory effectiveness in detecting malicious payloads in the network traffic. While the volume of the data generated from IDS is increasing exponentially results in the emergence of substantial security risks, it highlighted the imperative to strengthen network security. The performance of traditional machine learning methods depends on the dataset and the data balance distribution in it. while most of IDS datasets suffer from unbalancing, this limits the performance of the machine learning method used in the system and results in missed detections and false alarms in the conventional IDSs. To address this issue, this paper presents a new model-based Generative Adversarial Network (GAN) called TDCGAN to enhance the detection rate of less of minor class in the dataset while maintaining efficiency. The proposed model consists of one generator and three discriminators with an election layer at the end of architecture. The UGR’16 data set is used for evaluation purposes. In order to demonstrate the efficacy of our proposed model, various machine learning algorithms have been utilized for comparison. The experimental findings have determined that TDCGAN presents an efficient resolution for addressing imbalanced intrusion detection and surpasses the performance of other oversampling machine learning methods.
ARTICLE | doi:10.20944/preprints202311.1073.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Academic Performance, Progress Prediction, Score Prediction, Learning Behavior, Learning Dataset, Educational Data Mining
Online: 16 November 2023 (11:24:43 CET)
Intelligent Tutoring Systems (ITS) are increasingly popular for online learning. These systems use adaptive algorithms to recommend relevant content based on students' profiles. However, instructors need to periodically assess students' performance to ensure learning outcomes and adjust strategies accordingly. Our objective is to predict students' progress in advance, enabling teachers to make quicker decisions and facilitating the iterative process of adaptive algorithms. For this study, we collected a dataset from ALIN, an online learning platform, consisting of over 5,000 students' learning records and test results. Using this dataset, we conducted experiments employing various machine learning algorithms. The results indicate that learning behavior contributes to improving forecast performance, while students' progress strongly correlates with their previous test results. Additionally, we discovered that students' progress can be indirectly predicted by forecasting their scores. Furthermore, by breaking down overall scores into several distinct components and predicting individual scores for each component, the accuracy of the forecasts can be improved.
ARTICLE | doi:10.20944/preprints202310.1988.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: academic performance; progress prediction; score prediction; learning behavior; learning dataset; educational data mining
Online: 31 October 2023 (09:40:38 CET)
Data mining techniques have garnered significant attention within the realm of education. However, the procurement of ample student data poses a formidable challenge. In response to this challenge, we present a student dataset characterized by its size and distinctive attributes. This dataset encompasses various task-related topics interconnected through a learning pathway, thereby enabling researchers to delve into the data from novel perspectives. Moreover, it encompasses extensive longitudinal student behavioral data, a rarity that adds substantial value. Spanning the years from 2010 to 2021, our dataset comprises a cohort of 7,933 students, 64,344 test scores, and 183,390 behavior records, solidifying its status as a valuable resource for educational research. In our experiments, we achieved successful predictions of students' test outcomes based on behavioral learning data. The strengths of our dataset render it apt for analyzing the nexus between student conduct and academic performance, crafting personalized learning recommendations, and pursuing various other research pursuits.
ARTICLE | doi:10.20944/preprints202309.1704.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: adaptive neuro-fuzzy inference systems; liver disorders; bupa dataset; grey wolf optimization; intelligence
Online: 26 September 2023 (05:26:07 CEST)
The hybrid method proposed in the study, ANFIS-GWO, combines the Adaptive Neuro-Fuzzy Inference System (ANFIS) with Grey Wolf Optimization (GWO) for the diagnosis of liver disorders. ANFIS is a powerful tool that combines the advantages of neural networks and fuzzy logic to create a hybrid model capable of handling complex and uncertain data. GWO is a metaheuristic optimization algorithm inspired by the social behaviour of grey wolves. In the ANFIS-GWO method, the hyperparameters of ANFIS are optimized using GWO. This optimization process aims to fine-tune the ANFIS model based on the available dataset, which consists of 7 characteristic attributes and 354 samples related to liver diseases. By adopting the hyper-parameters, the ANFIS-GWO method enhances the overall performance and accuracy of the diagnostic system. To evaluate the effectiveness of the ANFIS-GWO intelligent medical system, the study employs classification accuracy, sensitivity, and specificity analysis. Classification accuracy measures the overall correctness of the system in predicting liver disease cases. Sensitivity refers to the system’s ability to correctly identify individuals with liver disorders, while specificity measures its ability to correctly identify those without liver disorders. Experimental results demonstrate that the performance of the ANFIS-GWO method surpasses that of traditional Fuzzy Inference Systems (FIS) and ANFIS models that do not undergo an optimization phase. This suggests that the integration of GWO optimization significantly improves the diagnostic accuracy of the ANFIS model for liver disease diagnosis.
ARTICLE | doi:10.20944/preprints202305.1376.v2
Subject: Computer Science And Mathematics, Robotics Keywords: multimodal sensors; autonomous driving; dataset collection framework; sensor calibration and synchronization; sensor fusion
Online: 29 June 2023 (08:32:46 CEST)
Autonomous driving vehicles rely on sensors for the robust perception of surroundings. Such vehicles are equipped with multiple perceptive sensors with a high level of redundancy to ensure safety and reliability in any driving condition. However, multi-sensor, such as camera, LiDAR and radar, systems bring up the requirements related to sensor calibration and synchronization, which are the fundamental blocks of any autonomous system. On the other hand, sensor fusion and integration have become important aspects of autonomous driving research and directly determine the efficiency and accuracy of advanced functions such as object detection and path planning. Classical model-based estimation and data-driven models are two mainstream approaches to achieving such integration. Most recent research is shifting to the latter, showing high robustness in real-world applications but requiring large quantities of data to be collected, synchronized, and properly categorized. To generalize the implementation of the multi-sensor perceptive system, we introduce an end-to-end generic sensor dataset collection framework that includes both hardware deploying solutions and sensor fusion algorithms. The framework prototype integrates a diverse set of sensors, such as cameras, LiDAR, and radar. Furthermore, we present a universal toolbox to calibrate and synchronize three types of sensors based on their characteristics. The framework also includes the fusion algorithms, which utilize the merits of three sensors , namely, camera, LiDAR and radar, and fuse their sensory information in a manner that is helpful for object detection and tracking research. The generality of this framework makes it applicable in any robotic or autonomous applications, also suitable for quick and large-scale practical deployment.
ARTICLE | doi:10.20944/preprints202306.1046.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: Hyperspectral Images; forensic; Ink mismatch detection; K-means; Elbow; silhouette; iVision HHID dataset
Online: 14 June 2023 (11:12:10 CEST)
Forensic document examiners can determine the authenticity of questioned documents by analyzing the ink used to create them. If an ink mismatch is found, it could be a sign of scam, backdating, or forgery. In this research a Hyperspectral Images of iVision HHID dataset is used to detect number of possible inks used in document. By using Hyperspectral Images, it’s possible to detect ink mismatch in a given document. In this research unsupervised learning method K-means is used to detect number of inks. Approximate number of clusters are determined by Elbow and Silhouette method before implementation of K-means.
ARTICLE | doi:10.20944/preprints202305.1871.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: urban climate; Copernicus dataset; urban heat island; weather types; urban overheating; synoptic classification
Online: 26 May 2023 (07:06:17 CEST)
In this study we investigated the association between weather types (WTs) and the Urban Heat Island Intensity (UHII) in the region of Attica (Greece). The application of the methodology results in ten WTs over Attica region. The UHII was calculated for every hour of the day from 2008 to 2017, using a new air temperature dataset produced by Copernicus Climate Change Service. To have more clear results concerning the association between WTs and UHII, we have used also the upper 5% of UHII (Urban Overheating-UO). The UO have been estimated for two-time intervals (daytime and nighttime) and for the warm period (June-September). The UHII frequency distribution as well as the spatial characteristics of the UO were also investigated. It was found that UO was amplified under WT2 during the night while, WT10 was mainly responsible for exacerbated UO magnitude at daytime, in all months. Furthermore, analysis results revealed that the UO effect is more pronounced in Athens during the night, especially at Athens center. The daytime hot-spots identified mainly in sub-urban and rural areas. Therefore, this methodology may help for heat mitigation strategies and climate adaptation measures, in urban environments.
ARTICLE | doi:10.20944/preprints202111.0446.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: n/a; Unity3D; Blender; Virtual Reality; Syntetic dataset generation; Machine Learning; Neural Networks
Online: 24 November 2021 (08:53:00 CET)
This paper provides a methodology for the production of synthetic images for training neural networks to recognise shapes and objects. There are many scenarios in which it is difficult, expensive and even dangerous to produce a set of images that is satisfactory for the training of a neural network. The development of 3D modelling software has nowadays reached such a level of realism and ease of use that it seemed natural to explore this innovative path and to give an answer regarding the reliability of this method that bases the training of the neural network on synthetic images. The results obtained in the two proposed use cases, that of the recognition of a pictorial style and that of the recognition of migrants at sea, leads us to support the validity of the approach, provided that the work is conducted in a very scrupulous and rigorous manner, exploiting the full potential of the modelling software. The code produced, which automatically generates the transformations necessary for the data augmentation of each image, and the generation of random environmental conditions in the case of Blender and Unity3D software, is available under the GPL licence on GitHub. The results obtained lead us to affirm that through the good practices presented in the article, we have defined a simple, reliable, economic and safe method to feed the training phase of a neural network dedicated to the recognition of objects and features, to be applied to various contexts.
ARTICLE | doi:10.20944/preprints202106.0144.v1
Subject: Medicine And Pharmacology, Endocrinology And Metabolism Keywords: Diabetic; Diabetic Mellitus; Diabetic Prediction; PIMA diabetic dataset; Female diabetic Patients; Machine Learning
Online: 4 June 2021 (15:25:16 CEST)
Diabetics or Diabetic Mellitus is a metabolic disorder of blood sugar levels in the human body. It is a major non-communicable disease and involved many serious health risk issues. This disease is rapidly increasing in India. It is a chronic condition and it occurs when a body doesn't produce enough insulin hormone to control the blood sugar level. In this study, different variables have been analyzed that cause the diabetics, and different machine learning algorithms are used to predict whether an unknown sample is diabetes or not. For this purpose, PIMA diabetic detection for Female patients was used. Here 10 different classification model is used for prediction. Finally, the detailed performance analysis of the different variables of the PIMA dataset and also the classification model are discussed.
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Urdu Twitter Dataset; Urdu Natural language processing (NLP); Urdu text Sentiments and Emoticons
Online: 24 March 2021 (12:03:46 CET)
This article presents a dataset of tweets in the Urdu language. There are 1,140,824 tweets in the dataset, collected from Twitter for September and October 2020. This large-scale corpus of tweets is generated by performing pre-processing which includes removing columns containing user information, retweet’s count, followers information, duplicate tweets, removing unnecessary punctuation, links, symbols, and spaces, and finally extracting emojis if present in the tweet text. In the final dataset each tweet record contains columns for tweet id, text, and emoji extracted from the text with a sentiment score. Emojis are extracted to validate Machine Learning models used for the multilingual sentiment and behavior analysis. These are extracted using a Python script that searches for an emoji from the list of 751 most frequently used emojis. If an emoji is present in the text, a column with the emoji description and sentiment score is added.
ARTICLE | doi:10.20944/preprints202007.0191.v1
Subject: Computer Science And Mathematics, Computer Networks And Communications Keywords: Intrusion Detection System; NSL-KDD Dataset; One Hot Encoding; Information Gain; Decision Tree
Online: 9 July 2020 (12:23:29 CEST)
. In today’s world, cyber attack is one of the major issues concerning the organizations that deal with technologies like cloud computing, big data, IoT etc. In the area of cyber security, intrusion detection system (IDS) plays a crucial role to identify suspicious activities in the network traffic. Over the past few years, a lot of research has been done in this area but in the current scenario, network attacks are diversifying in both volume and variety. In this regard, this research article proposes a novel IDS where a combination of information gain and decision tree algorithm has been used for the purpose of dimension reduction and classification. For experimental purpose the NSL-KDD dataset has been used. Initially out of 41 features present in the dataset only 5 high information gain valued features are selected for classification purpose. The applicability of the selected features are evaluated through various machine learning based algorithms. The experimental result shows that the decision tree based algorithm records highest recognition accuracy among all the classifiers. Based on the initial classification result a novel methodology based on decision tree has been further developed which is capable of identifying multiple attacks by analyzing the packets of various transactions in real time.
ARTICLE | doi:10.20944/preprints202311.0347.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: classification; machine learning; deep learning; convolution neural networks; dataset; landfill waste; waste management; sustainability
Online: 7 November 2023 (02:50:04 CET)
The accurate classification of landfill waste diversion plays a critical role in efficient waste management practices. Traditional approaches, such as visual inspection, weighing and volume measurement, and manual sorting, have been widely used but suffer from subjectivity, scalability, and labour requirements. In contrast, machine learning approaches, particularly Convolutional Neural Networks (CNN), have emerged as powerful deep learning models for waste detection and classification. This paper analyses VGG-16, InceptionResNetV2, DenseNet121, Inception V3, and MobileNetV2 models to classify real-life waste when trained on pristine and unadulterated materials, versus samples collected at a landfill site. When training on DiversionNet, the unadulterated material dataset with labels required for landfill modelling, classification accuracy was limited to 49.69% in the real environment. Using real-world samples in the newly formed RealWaste dataset showed that practical applications for deep learning in waste classification is possible, with Inception V3 reaching 89.19% classification accuracy on the full spectrum of labels required for accurate modelling.
ARTICLE | doi:10.20944/preprints202308.0954.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Arabic NLP; Kuwaiti Dialect; Dataset Labeling; Stance Detection; Weak supervised learning; Zero-shot learning
Online: 14 August 2023 (09:00:24 CEST)
The Kuwaiti dialect is a particular dialect of Arabic spoken in Kuwait; it differs significantly from standard Arabic and the dialects of neighboring countries in the same region. Few research papers with a focus on the Kuwaiti dialect have been published in the field of NLP. In this study, we created Kuwaiti dialect language resources using Q8VaxStance, a vaccine stance labeling system for a large dataset of tweets. This dataset will fill this gap and provide a valuable resource for researchers studying vaccine hesitancy in Kuwait. Furthermore, it will contribute to the Arabic natural language processing field by providing a dataset for developing and evaluating machine learning models for stance detection in the Kuwaiti dialect. The proposed vaccine stance labeling system combines the benefits of weak supervised learning and zero-shot learning; for this purpose, we implemented 52 experiments on 42815 unlabeled tweets extracted between December 2020 and July 2022. The results of the experiments show that using keyword detection in conjunction with zero-shot model labeling functions is significantly better than using only keyword detection labeling functions or just zero-shot model labeling functions. Furthermore, using the Arabic language in both the labels and prompt or a mix of Arabic labels and an English prompt is statistically significant compared to using English in both the labels and prompt for the total number of generated labels evaluation metric. Finally, the best accuracy for Macro-F1 values were found in the experiments KHZSLF-EE4 and KHZSLF-EA1, with values of 0.83 and 0.83, respectively. And, for the total automatically labeled data evaluation metric, experiment KHZSLF-EE4 labeled 42,270 tweets, while experiment KHZSLF-EA1 was able to generate 42,764 labels.
ARTICLE | doi:10.20944/preprints202301.0156.v1
Subject: Computer Science And Mathematics, Other Keywords: Online Learning; Emotion Classification; AMIGOS dataset; Wearable-EEG (Muse and Neurosity Crown); Psychopy Experiments
Online: 9 January 2023 (09:09:08 CET)
Emotions are indicators of affective states and play a significant role in human daily life, behavior, and interactions. Giving emotional intelligence to the machines could, for instance, facilitate early detection and prediction of (mental) diseases and symptoms. Electroencephalography (EEG) -based emotion recognition is being widely applied because it measures electrical correlates directly from the brain rather than the indirect measurement of other physiological responses initiated by the brain. The recent development of non-invasive and portable EEG sensors makes it possible to use them in real-time applications. Therefore, this paper presents a real-time emotion classification pipeline, which trains different binary classifiers for the dimensions of Valence and Arousal from an incoming EEG data stream. After achieving a 23.9% (Arousal) and 25.8% (Valence) higher f1-score on the state-of-art AMIGOS dataset, this pipeline was applied to the dataset achieved by an emotion elicitation experimental framework developed within the scope of this thesis. Following two different protocols, 15 participants were recorded using two different consumer-grade EEG devices while watching 16 short emotional videos in a controlled environment. For an immediate label setting, the mean f1-score of 87% and 82% were achieved for Arousal and Valence, respectively. In a live scenario, while continuously being updated on the incoming data stream with delayed labels, the pipeline proved to be fast enough to achieve predictions in real time. However, the significant discrepancy from the readily available labels on the classification scores leads to future work to include more data with frequent delayed labels in the live settings.
COMMUNICATION | doi:10.20944/preprints202206.0172.v3
Subject: Computer Science And Mathematics, Information Systems Keywords: Monkeypox; monkey pox; Twitter; Dataset; Tweets; Social Media; Big Data; Data Mining; Data Science
Online: 25 July 2022 (09:41:19 CEST)
ARTICLE | doi:10.20944/preprints202109.0068.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: Feature engineering; vibration; high performance computing (HPC); dataset; prognostics and availability management (P&AM)
Online: 3 September 2021 (14:21:24 CEST)
The Industrial Internet of things (IIoT) enabled smart system has entered into a golden era of rapid technology growth. IIoT is a concept to make every system interrelated such that they are able to collect and transfer data over a wireless network without human intervention. In this paper, we discuss the development of an IoT enabled system to monitor the vibration signature of equipment as part of prognosis and availability management system (P&AM) that serves to prevent unplanned operation downtime and catastrophic failure of a whole system. In order to simply the complexity of processing video content and performing inference, the Intel OpenVINO platform was selected because of it’s simplicity, portability across Intel AI processors, performance and comprehensiveness of it’s analytical and diagnostics capabilities that can be tested in Intel’s DevCloud. The IIoT system consists of a High Performance Computing (HPC) platform based on Intel’s Xeon processors and Movidius AI accelerator, Intel’s OpenVINO toolkit for AI, a Regul high performance programmable controller capturing vibration data through sensors and a low-latency network connection. Notifications of anomalies are sent to a smartphone. This paper reveals an approach for the features extraction and selection, known as feature engineering, of the equipment component we want to protect. Feature engineering is the first step for the P&AM of these components and extends to the whole system. The broader aim of this paper is to help technical leaders at the exploring or experimenting stages of their AI framework to learn the concepts of implementing algorithms using datasets that have real value to their companies. Datasets generated and referred to in this paper were generated by simulation under various material failure scenarios.
ARTICLE | doi:10.20944/preprints202105.0444.v1
Subject: Engineering, Other Keywords: Spectral Unmixing; Imaging Spectrometer; Hyperspectral; Benchmark Dataset; Dimensionality Estimation; Endmember Extraction; Abundance Estimation; HySpex.
Online: 19 May 2021 (13:25:39 CEST)
Spectral unmixing represents both an application per se and a pre-processing step for several applications involving data acquired by imaging spectrometers. However, there is still a lack of publicly available reference data sets suitable for the validation and comparison of different spectral unmixing methods. In this paper we introduce the DLR HyperSpectral Unmixing (DLR HySU) benchmark dataset, acquired over German Aerospace Center (DLR) premises in Oberpfaffenhofen. The dataset includes airborne hyperspectral and RGB imagery of targets of different materials and sizes, complemented by simultaneous ground-based reflectance measurements. The DLR HySU benchmark allows a separate assessment of all spectral unmixing main steps: dimensionality estimation, endmember extraction (with and without pure pixe assumption), and abundance estimation. Results obtained with traditional algorithms for each of these steps are reported. To the best of our knowledge, this is the first time that real imaging spectrometer data with accurately measured targets are made available for hyperspectral unmixing experiments. The DLR HySU benchmark dataset is openly available online and the community is welcome to use it for spectral unmixing and other applications.
ARTICLE | doi:10.20944/preprints202105.0429.v1
Subject: Medicine And Pharmacology, Other Keywords: Acute lymphoblastic leukemia; Deep convolutional neural networks; Ensemble image classifiers; C-NMC-2019 dataset.
Online: 19 May 2021 (07:42:23 CEST)
Although automated Acute Lymphoblastic Leukemia (ALL) detection is essential, it is challenging due to the morphological correlation between malignant and normal cells. The traditional ALL classification strategy is arduous, time-consuming, often suffers inter-observer variations, and necessitates experienced pathologists. This article has automated the ALL detection task, employing deep Convolutional Neural Networks (CNNs). We explore the weighted ensemble of deep CNNs to recommend a better ALL cell classifier. The weights are estimated from ensemble candidates' corresponding metrics, such as accuracy, F1-score, AUC, and kappa values. Various data augmentations and pre-processing are incorporated for achieving a better generalization of the network. We train and evaluate the proposed model utilizing the publicly available C-NMC-2019 ALL dataset. Our proposed weighted ensemble model has outputted a weighted F1-score of 88.6%, a balanced accuracy of 86.2%, and an AUC of 0.941 in the preliminary test set. The qualitative results displaying the gradient class activation maps confirm that the introduced model has a concentrated learned region. In contrast, the ensemble candidate models, such as Xception, VGG-16, DenseNet-121, MobileNet, and InceptionResNet-V2, separately produce coarse and scatter learned areas for most example cases. Since the proposed ensemble yields a better result for the aimed task, it can experiment in other domains of medical diagnostic applications.
ARTICLE | doi:10.20944/preprints202007.0187.v1
Subject: Computer Science And Mathematics, Computer Networks And Communications Keywords: Intrusion Detection System; NSL-KDD Dataset; One Hot Encoding; Information Gain; Convolution Neural Network
Online: 9 July 2020 (12:14:10 CEST)
Cyber security plays an important role to protect our computer, network, program and data from unauthorized access. Intrusion detection system (IDS) and intrusion prevention system (IPS) are two main categories of cyber security, designed to identify any suspicious activities present in inbound and outbound network packets and restrict the suspicious incident. Deep neural network plays a significant role in the construction of IDS and IPS. This paper highlights a novel IDS using optimized convolution neural network (CNN-IDS). An optimized CNNIDS model is an improvement over CNN which selects the best weighted model by considering the loss in every epoch. All the experiments have been conducted on the well known NSL-KDD dataset. Information gain has been used for dimensionality reduction. The accuracy of the proposed model is evaluated through optimized CNN for both binary and multiclass categories. Finally, a critical comparison has been performed with other general classifiers like J48, Naive Bayes, NB tree, Random forest, Multilayer Perceptron (MLP), Support Vector Machine (SVM), Recurrent Neural Network (RNN) and Convolution Neural Network(CNN). All the experimental results demonstrate that the optimized CNN-IDS model records the best recognition rate with minimum model construction time.
ARTICLE | doi:10.20944/preprints201805.0240.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: background reconstruction; image quality assessment; image dataset; subjective evaluation; perceptual quality; objective quality metric
Online: 17 May 2018 (09:36:33 CEST)
With an increased interest in applications that require a clean background image, such as video surveillance, object tracking, street view imaging and location-based services on web-based maps, multiple algorithms have been developed to reconstruct a background image from cluttered scenes. Traditionally, statistical measures and existing image quality techniques have been applied for evaluating the quality of the reconstructed background images. Though these quality assessment methods have been widely used in the past, their performance in evaluating the perceived quality of the reconstructed background image has not been verified. In this work, we discuss the shortcomings in existing metrics and propose a full reference Reconstructed Background image Quality Index (RBQI) that combines color and structural information at multiple scales using a probability summation model to predict the perceived quality in the reconstructed background image given a reference image. To compare the performance of the proposed quality index with existing image quality assessment measures, we construct two different datasets consisting of reconstructed background images and corresponding subjective scores. The quality assessment measures are evaluated by correlating their objective scores with human subjective ratings. The correlation results show that the proposed RBQI outperforms all the existing approaches. Additionally, the constructed datasets and the corresponding subjective scores provide a benchmark to evaluate the performance of future metrics that are developed to evaluate the perceived quality of reconstructed background images.
ARTICLE | doi:10.20944/preprints202309.1675.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: orthopedic walker; dataset; IoT; fall detection; activity logging; inertial measurement unit; machine learning; deep learning
Online: 25 September 2023 (10:07:40 CEST)
An accurate, economical, and reliable device for detecting falls in persons ambulating with the assistance of an orthopedic walker is crucially important for the elderly and patients with limited mobility. Existing wearable devices such as wristbands are not designed for walker users, and patients may not wear them at all times. This research proposes a novel idea of attaching an internet-of-things (IoT) device with an inertial measurement unit (IMU) sensor directly to an orthopedic walker to perform real-time fall detection as well as activity logging. A dataset is collected and labeled for walker users in four activities, including idle, motion, step, and fall. Classic machine learning algorithms are evaluated using the dataset by comparing their classification performance. Deep learning with convolutional neural network (CNN) is also explored. Furthermore, the hardware prototype is designed by integrating a low-power microcontroller for onboard machine learning, an IMU sensor, a rechargeable battery, and Bluetooth wireless connectivity. The research results show the promise of improved safety and well-being of walker users.
ARTICLE | doi:10.20944/preprints202305.1658.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Virtual screening; Bioactivity prediction; Equivariant graph neural network; Multiple instance learning; Molecular conformation; Benchmark dataset
Online: 23 May 2023 (13:05:48 CEST)
Ligand-based virtual screening (LBVS) is a promising approach for rapid and low-cost screening of potentially bioactive molecules in the early stage of drug discovery. Compared with traditional similarity-based machine learning methods, deep learning frameworks for LBVS can more effectively extract high-order molecule structure representations from molecular fingerprints or structures. However, the 3D conformation of a molecule largely influences its bioactivity and physical properties, and has rarely been considered in previous deep learning-based LBVS methods. Moreover, the relative bioactivity benchmark dataset is still lacking. To address these issues, we introduce a novel end-to-end deep learning architecture trained from molecular conformers for LBVS. We first extracted molecule conformers from multiple public molecular bioactivity data and consolidated them into a large-scale bioactivity benchmark dataset, which totally includes millions of endpoints and molecules corresponding to 954 targets. Then, we devised a deep learning-based LBVS called EquiVS to learn molecule representations from conformers for bioactivity prediction. Specifically, graph convolutional network (GCN) and equivariant graph neural network (EGNN) are sequentially stacked to learn high-order molecule-level and conformer-level representations, followed by attention-based deep multiple-instance learning (MIL) to aggregate these representations and then predict the potential bioactivity for the query molecule on a given target. We conducted various experiments to validate the data quality of our benchmark dataset, and confirmed EquiVS achieved better performance compared with 10 traditional machine learning or deep learning-based LBVS methods. Further ablation studies demonstrate the significant contribution of molecular conformation for bioactivity prediction, as well as the reasonability and non-redundancy of deep learning architecture in EquiVS. Finally, a model interpretation case study on CDK2 shows the potential of EquiVS in optimal conformer discovery. The overall study shows that our proposed benchmark dataset and EquiVS method have promising prospects in virtual screening applications.
ARTICLE | doi:10.20944/preprints202304.0645.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Lip Reading; Multiclass Classification; Turkish Lip Reading Dataset; Deep Learning; Convolutional Neural Networks; Lip Detection
Online: 20 April 2023 (10:07:48 CEST)
Automated lip reading is a research problem that has developed considerably in recent years. Lip reading is evaluated both visually and audibly in some cases. The lip reading model is a field of use for detecting specific words using images from security cameras, but it is not possible to use audio-visual databases in this situation. It is not possible to obtain the sound input of the pronounced word in all cases. We collected a new Turkish dataset with only the image in this study. The new dataset is produced using Youtube videos, which is an uncontrolled environment. For this reason, images have difficult parameters in terms of environmental factors such as light, angle, color, and personal characteristics of the face. Despite the different features on the human face such as mustache, beard, and make-up, the visual speech recognition problem was developed on 10 classes including single words and two-word phrases using Convolutional Neural Networks (CNN) without any intervention on the data. The proposed study using only-visual data obtained a model which is automated visual speech recognition with a deep learning approach. In addition, since this study uses only-visual data, the computational cost and resource usage is less than in multi-modal studies. It is also the first known study to address the lip reading problem with a deep learning algorithm using a new dataset belonging to the Ural-Altaic languages.
ARTICLE | doi:10.20944/preprints202208.0350.v1
Subject: Computer Science And Mathematics, Computer Vision And Graphics Keywords: Mental stress Covid-19; Covid-19 vaccine dataset; Vaccine sociodemographic; Vaccine acceptance rate; Vaccine perception
Online: 18 August 2022 (13:36:16 CEST)
In this study, we surveyed over 600 participants to determine: a) major causes to mental stress during the pandemic and its future impacts, and b) diversity in public perception and acceptance (specifically for children) of Covid-19 vaccination. Statistical results and intelligent clustering outcomes indicate significant relationships between sociodemographic diversity, mental stress causes, vaccination perception, and Covid-19 infections. For instance, statistical results indicate significant dependence between mental stress due to Covid-19 and gender (p = 1.7e-05). Over 25% of males indicated work related stress comparing 35% in females however, females indicated more stressed (17%) due to relationships comparing to males (12%). Around 30% of Asian/Arabic participants don’t feel vaccination being safe as compared to 8% of white-British and 22% of white-European indicating significant dependence (p = 1.8e-08) with ethnicity. More specifically, vaccination acceptance for children is significantly dependent to ethnicity (p = 3.7e-05) where only 47% participants show willingness towards children’s vaccination. Primary dataset in this study along with experimental outcomes identifying sociodemographic information diversity with respect to public perception and acceptance of vaccination to children and potential stress factors might be useful for public and policy makers to be better prepared for future epidemics as well as working globally to combat mental health issues and running more effective vaccination campaigns.
ARTICLE | doi:10.20944/preprints202002.0174.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: IEEE 802.15.4g; Smart Utility Networks; Low-Power; Wireless; Communications; Dependable; Predictable; Reliable; Available; Industrial; Dataset
Online: 13 February 2020 (14:01:05 CET)
In this article we present a deployment of 11 nodes using the three different SUN (Smart Utility Network) modulation schemes, as defined in the IEEE 802.15.4-2015 standard. The nodes were deployed in a 110.044 m2 warehouse for 99 days, and the resulting dataset contains a total of 10.710.868 measurements with RSSI (Received Signal Strength Indicator), CCA (Clear Channel Assessment) and PDR (Packet Delivery Ratio) values. The analyzed results show a high variability in average RSSI (i.e., between -82.1 dBm and -101.7 dBm) and CCA (i.e., between -111.2 dBm and -119.9 dBm) values, which are caused by the effects of multi-path propagation and external interference. Despite being above the sensitivity limit for each modulation, this values result in poor average PDR values (i.e., from 65.9% to 87.4%), indicating that additional schemes are required for low-power wireless communications to meet the dependability requirements of industrial applications. For that purpose, we also introduce the concept of modulation diversity, which can be combined with packet repetition to meet such requirements (i.e., PDR>99%) while minimizing the energy expenditure of nodes and meeting regulatory constraints.
ARTICLE | doi:10.20944/preprints201908.0019.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: emotion classification; machine learning classifiers; ISEAR dataset; data mining; performance evaluation; data science; opinion-mining
Online: 2 August 2019 (08:49:27 CEST)
Emotion detection from the text is an important and challenging problem in text analytics. The opinion-mining experts are focusing on the development of emotion detection applications as they have received considerable attention of online community including users and business organization for collecting and interpreting public emotions. However, most of the existing works on emotion detection used less efficient machine learning classifiers with limited datasets, resulting in performance degradation. To overcome this issue, this work aims at the evaluation of the performance of different machine learning classifiers on a benchmark emotion dataset. The experimental results show the performance of different machine learning classifiers in terms of different evaluation metrics like precision, recall ad f-measure. Finally, a classifier with the best performance is recommended for the emotion classification.
ARTICLE | doi:10.20944/preprints201804.0088.v1
Subject: Arts And Humanities, History Keywords: historical dataset; geocoding; localisation; geohistorical objects; database; GIS; collaborative; citizen science; crowd-sourced; digital humanities
Online: 8 April 2018 (09:13:10 CEST)
The latest developments in digital humanities have increasingly enabled the construction of large data sets which can easily be accessed and used. These data sets often contain indirect localisation information, such as historical addresses. Historical geocoding is the process of transforming the indirect localisation information to direct localisation that can be placed on a map, which enables spatial analysis and cross-referencing. Many efficient geocoders exist for current addresses, but they do not deal with temporal information and are usually based on a strict hierarchy (country, city, street, house number, etc.) that is hard, if not impossible, to use with historical data. Indeed, historical data are full of uncertainties (temporal, textual, positional accuracy, confidence in historical sources) that can not be ignored or entirely resolved. We propose an open source, open data, extensible solution for geocoding that is based on gazetteers composed of geohistorical objects extracted from historical topographical maps. Once the gazetteers are available, geocoding an historical address is a matter of finding the geohistorical object in the gazetteers that is the best match to the historical address searched by the user. The matching criteria are customisable and include several dimensions (fuzzy string, fuzzy temporal, level of detail, positional accuracy). As the goal is to facilitate historical work, we also propose web-based user interfaces that help geocode (one address or batch mode) and display over current or historical topographical maps, so that geocoding results can be checked and collaboratively edited. The system has been tested on the city of Paris, France, for the 19th and the 20th centuries. It shows high response rates and is fast enough to be used interactively.
DATA DESCRIPTOR | doi:10.20944/preprints202301.0473.v1
Subject: Computer Science And Mathematics, Applied Mathematics Keywords: Data generator; dataset; deep learning; health index; machine learning; prognosis and health management; remaining useful life
Online: 26 January 2023 (08:37:30 CET)
This paper presents PrognosEase; a software that provides an easier way to produce different types of run-to-failure data mimicking real-world conditions to simplify prognosis studies in terms of data collection and improvement in ML degradation modelling process. Different types of degradation types made available to meet different types of applications. Besides, some primary ML tests were performed to ensure that complexity patterns of real systems could be observed in the training/testing predictions attitude. This paper also presents the impacts, limitations and potential improvements of the data generator.
ARTICLE | doi:10.20944/preprints202210.0477.v1
Subject: Computer Science And Mathematics, Analysis Keywords: High Throughput Plant Phenotyping; Deep Neural Network; Flower Detection; Temporal Phenotypes; Benchmark Dataset; Flower Status Report
Online: 31 October 2022 (10:00:24 CET)
A phenotype is the composite of an observable expression of a genome for traits in a given environment. The trajectories of phenotypes computed from an image sequence and timing of important events in a plant’s life cycle can be viewed as temporal phenotypes and indicative of the plant’s growth pattern and vigor. In this paper, we introduce a novel method called FlowerPhenoNet which uses deep neural networks for detecting flowers from multiview image sequences for high throughput temporal plant phenotyping analysis. Following flower detection, a set of novel flower-based phenotypes are computed, e.g., the day of emergence of the first flower in a plant’s life cycle, the total number of flowers present in the plant at a given time, the highest number of flowers bloomed in the plant, growth trajectory of a flower and the blooming trajectory of a plant. To develop a new algorithm and facilitate performance evaluation based on experimental analysis, a benchmark dataset is indispensable. Thus, we introduce a benchmark dataset called FlowerPheno which comprises image sequences of three flowering plant species, e.g., sunflower, coleus, and canna, captured by a visible light camera in a high throughput plant phenotyping platform from multiple view angles. The experimental analyses on the FlowerPheno dataset demonstrate the efficacy of the FlowerPhenoNet.
ARTICLE | doi:10.20944/preprints202309.2022.v1
Subject: Environmental And Earth Sciences, Water Science And Technology Keywords: Gridded dataset; Standardized Precipitation Evapotranspiration Index (SPEI); Drought regionalization; Kernel occurrence rate estimator; Trend analysis; Climate indices
Online: 28 September 2023 (13:30:24 CEST)
Droughts are among the major natural hazards that are spreading to many parts of the world with huge multi-dimensional impacts. For Continental Croatia an extensive analysis of drought phenomenon is presented based on meteorological E-OBS gridded dataset (0.25° x 0.25°), within the period of 1950 to 2022. The drought events were characterized by the Standardized Precipita-tion Evapotranspiration Index (SPEI) applied to different time-scales (6 and 12 months) in order to describe the subannual and annual variability of drought. The spatiotemporal patterns of drought are obtained through principal component analysis (PCA) and K-means clustering (KMC) applied to the SPEI field. An areal drought evolution analysis and the changes in the frequency of occur-rence of the periods under drought conditions were achieved by using a kernel occurrence rate es-timator (KORE). The Modified Mann-Kendall (MMK) test coupled with the Sen’s slope estimator test are applied to the SPEI series in order to quantify the drought trends throughout the country. According to drought events history and considering the different morphoclimatic characteristics of the study area the results showed that Croatia could be divided into three different and spatial-ly well-defined regions with specific temporal and spatial characteristics of droughts (Central North, Eastern and Southern regions). It is shown that there is a manifest increase in the percent-age of area affected by drought as well as in the yearly drought occurrences rates in both Central North and Eastern regions and an evident decrease in the Southern region for both 6 and 12 month SPEI time-scales. In the observation of the drought temporal characteristics, it was found that downward trends expressing increasing drought severities were strongly significant in North and Eastern regions while a few significant upward trends were seen in Southern region. From this study, it is possible to obtain a broader view of the historical behaviour of droughts in Croatia with results providing useful support for drought risk assessment and decision-making processes.
DATA DESCRIPTOR | doi:10.20944/preprints202308.1701.v1
Subject: Public Health And Healthcare, Public Health And Health Services Keywords: disease X; big data; data science; data analysis; dataset development; database; google trends; data mining; healthcare; epidemiology
Online: 24 August 2023 (05:48:54 CEST)
The World Health Organization (WHO) added Disease X to their shortlist of blueprint priority diseases to represent a hypothetical, unknown pathogen that could cause a future epidemic. During different virus outbreaks of the past, such as COVID-19, Influenza, Lyme Disease, and Zika virus, researchers from various disciplines utilized Google Trends to mine multimodal components of web behavior to study, investigate, and analyze the global awareness, preparedness, and response associated with these respective virus outbreaks. As the world prepares for Disease X, a dataset on web behavior related to Disease X would be crucial to contribute towards the timely advancement of research in this field. Furthermore, none of the prior works in this field have focused on the development of a dataset to compile relevant web behavior data, which would help to prepare for Disease X. To address these research challenges, this work presents a dataset of web behavior related to Disease X, which emerged from different geographic regions of the world, between February 2018 to August 2023. Specifically, this dataset presents the search interests related to Disease X from 94 geographic regions. These regions were chosen for data mining as these regions recorded significant search interests related to Disease X during this timeframe. The dataset was developed by collecting data using Google Trends. The relevant search interests for all these regions for each month in this time range are available in this dataset. This paper also discusses the compliance of this dataset with the FAIR principles of scientific data management. Finally, a brief analysis of specific features of this dataset is presented to uphold the applicability, relevance, and usefulness of this dataset for the investigation of different research questions in the interrelated fields of Big Data, Data Mining, Healthcare, Epidemiology, and Data Analysis.
ARTICLE | doi:10.20944/preprints202307.0014.v1
Subject: Computer Science And Mathematics, Computer Vision And Graphics Keywords: deep learning; unbalanced dataset; augmentation; multiclass classification; metrics boosting method; sota algorithm; visual transformer; ResNet; Xception; Inception
Online: 3 July 2023 (08:25:13 CEST)
One of the critical problems in multiclass classification tasks is the imbalance of the dataset. This is especially true when using contemporary pre-trained neural networks, where, in fact, the last layers of the neural network are retrained. Therefore, the large datasets with highly unbalanced classes are not good for models’ training since the use of such a dataset leads to overfitting and, accordingly, poor metrics on test and validation datasets. In this paper the sensitivity to a dataset imbalance of Xception, ViT-384, ViT-224, VGG19, ResNet34, ResNet50, ResNet101, Inception_v3, DenseNet201, DenseNet161, DeIT was studied using a highly imbalanced dataset of 20,971 images sorted into 7 classes. It is shown that the best metrics were obtained when using a cropped dataset with augmentation of missing images in classes up to 15% of the initial number. So, the metrics can be increased by 2-6% compared to the metrics of the models on the initial unbalanced data set. Moreover, the metrics of the rare classes' classification also improved significantly – the TruePositive value can be increased by 0.3 and more. As result, the best approach to train considered networks on an initially unbalanced dataset was formulated.
DATA DESCRIPTOR | doi:10.20944/preprints202206.0146.v2
Subject: Computer Science And Mathematics, Information Systems Keywords: COVID-19; COVID; Omicron; online learning; remote learning; online education; Twitter; dataset; Tweets; social media; Big Data
Online: 21 July 2022 (08:05:19 CEST)
COMMUNICATION | doi:10.20944/preprints202206.0383.v2
Subject: Computer Science And Mathematics, Information Systems Keywords: Exoskeleton; Twitter; Tweets; Big Data; social media; Data Mining; dataset; Data Science; Natural Language Processing; Information Retrieval
Online: 21 July 2022 (04:06:53 CEST)
The exoskeleton technology has been rapidly advancing in the recent past due to its multitude of applications and diverse use-cases in assisted living, military, healthcare, firefighting, and industry 4.0. The exoskeleton market is projected to increase by multiple times of its current value within the next two years. Therefore, it is crucial to study the degree and trends of user interest, views, opinions, perspectives, attitudes, acceptance, feedback, engagement, buying behavior, and satisfaction, towards exoskeletons, for which the availability of Big Data of conversations about exoskeletons is necessary. The Internet of Everything style of today's living, characterized by people spending more time on the internet than ever before, with a specific focus on social media platforms, holds the potential for the development of such a dataset, by the mining of relevant social media conversations. Twitter, one such social media platform, is highly popular amongst all age groups, where the topics found in the conversation paradigms include emerging technologies such as exoskeletons. To address this research challenge, this work makes two scientific contributions to this field. First, it presents an open-access dataset of about 140,000 tweets about exoskeletons that were posted in a 5-year period from May 21, 2017, to May 21, 2022. Second, based on a comprehensive review of the recent works in the fields of Big Data, Natural Language Processing, Information Retrieval, Data Mining, Pattern Recognition, and Artificial Intelligence that may be applied to relevant Twitter data for advancing research, innovation, and discovery in the field of exoskeleton research, a total of 100 Research Questions are presented for researchers to study, analyze, evaluate, ideate, and investigate based on this dataset.
ARTICLE | doi:10.20944/preprints202110.0089.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Object Detection; Cascade Mask R-CNN; Floor Plan Images; Deep Learning; Transfer Learning; Dataset Augmentation; Computer Vision
Online: 5 October 2021 (15:09:26 CEST)
Object detection is one of the most critical tasks in the field of Computer vision. This task comprises identifying and localizing an object in the image. Architectural floor plans represent the layout of buildings and apartments. The floor plans consist of walls, windows, stairs, and other furniture objects. While recognizing floor plan objects is straightforward for humans, automatically processing floor plans and recognizing objects is a challenging problem. In this work, we investigate the performance of the recently introduced Cascade Mask R-CNN network to solve object detection in floor plan images. Furthermore, we experimentally establish that deformable convolution works better than conventional convolutions in the proposed framework. Identifying objects in floor plan images is also challenging due to the variety of floor plans and different objects. We faced a problem in training our network because of the lack of publicly available datasets. Currently, available public datasets do not have enough images to train deep neural networks efficiently. We introduce SFPI, a novel synthetic floor plan dataset consisting of 10000 images to address this issue. Our proposed method conveniently surpasses the previous state-of-the-art results on the SESYD dataset and sets impressive baseline results on the proposed SFPI dataset. The dataset can be downloaded from SFPI Dataset Link. We believe that the novel dataset enables the researcher to enhance the research in this domain further.
ARTICLE | doi:10.20944/preprints202305.0917.v2
Subject: Engineering, Electrical And Electronic Engineering Keywords: Machine learning; Geriartic fall detection; Dataset; Dew Computing; End Device; Feature Extraction; Supervised Machine Learning; Sensor Data Analysis
Online: 1 October 2023 (09:38:25 CEST)
ARTICLE | doi:10.20944/preprints202305.1393.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: information extraction; named entity recognition; natural language processing; dataset; sequence labeling; scholarly knowledge graphs; open research knowledge graph
Online: 19 May 2023 (07:33:48 CEST)
We introduce the Open Research Knowledge Graph Agriculture Named Entity Recognition (the ORKG Agri-NER) corpus and service for contribution-centric scientific entity extraction and classification. The ORKG Agri-NER corpus is a seminal benchmark for the evaluation of contribution-centric scientific entity extraction and classification in the agricultural domain. It comprises titles of scholarly papers that are available as Open Access articles on a major publishing platform. We describe the creation of this corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism focused on capturing scientific entities in agriculture that reflect the direct contribution of a work; 2) a performance benchmark for named entity recognition of scientific entities in the agricultural domain by empirically evaluating various state-of-the-art sequence labeling neural architectures and transformer models; and 3) a delineated 3-step automatic entity resolution procedure for the resolution of the scientific entities to an authoritative ontology, specifically AGROVOC that is released in the Linked Open Vocabularies cloud. With this work we aim to provide a strong foundation for future work on the automatic discovery of scientific entities in the scholarly literature of the agricultural domain.
ARTICLE | doi:10.20944/preprints202305.0467.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Sign Language Recognition (SLR); Large Scale Dataset; American Sign Language; Turkey Sign Language; Chinese Sign Language; AUTSL; CSL
Online: 8 May 2023 (08:24:04 CEST)
Sign Language Recognition (SLR) aims to bridge speech-impaired and general communities by recognizing signs from given videos. Researchers still face challenges developing efficient SLR systems because of the video’s complex background, light illumination, and subject structures. Recently many researchers developed a skeleton-based sign language recognition system to overcome the subject and background variation of hand gesture signs. However, skeleton-based SLR is still under exploration due to the lack of information and annotations on hand key points. More recently, researchers included body and face information with the hand gesture for the SLR, but their performance and efficiency are unsatisfactory. We proposed a Multi-Stream Graph-based Deep Neural Network (SL-GDN) for a skeleton-based SLR system to overcome the problems. The main purpose of the proposed SL-GDN approach is to improve the efficiency and performance of the SLR system with a low computational cost based on the human body pose in the form of 2D landmark locations. In the procedure, firstly, we constructed a skeleton graph based on the selected 27 whole-body key points among 67 key points to solve the inefficiency problems. Then we proposed multi-stream SL-GDN to extract features from the whole-body skeleton graph for four streams. Finally, we concatenated the four different features and applied a classification module to refine the feature and recognize corresponding sign classes. Our data-driven and graph construction method increases the system’s flexibility and brings high generability to adapt various data samples. We used three large-scale benchmark SLR datasets to evaluate the proposed model: WLASL, AUTSL and CSL. The demonstrated performance accuracy table proved the superiority of the proposed model, and we believe this will be considered a great invention in the SLR domain.
ARTICLE | doi:10.20944/preprints201903.0039.v2
Subject: Engineering, Control And Systems Engineering Keywords: Handwritten digit recognition; Convolutional Neural Network (CNN); Deep learning; MNIST dataset; Epochs; Hidden Layers; Stochastic Gradient Descent; Backpropagation
Online: 20 September 2019 (10:12:26 CEST)
In recent times, with the increase of Artificial Neural Network (ANN), deep learning has brought a dramatic twist in the field of machine learning by making it more Artificial Intelligence (AI). Deep learning is used remarkably used in vast ranges of fields because of its diverse range of applications such as surveillance, health, medicine, sports, robotics, drones etc. In deep learning, Convolutional Neural Network (CNN) is at the center of spectacular advances that mixes Artificial Neural Network (ANN) and up to date deep learning strategies. It has been used broadly in pattern recognition, sentence classification, speech recognition, face recognition, text categorization, document analysis, scene, and handwritten digit recognition. The goal of this paper is to observe the variation of accuracies of CNN to classify handwritten digits using various numbers of hidden layer and epochs and to make the comparison between the accuracies. For this performance evaluation of CNN, we performed our experiment using Modified National Institute of Standards and Technology (MNIST) dataset. Further, the network is trained using stochastic gradient descent and the backpropagation algorithm.
ARTICLE | doi:10.20944/preprints202305.0443.v1
Subject: Computer Science And Mathematics, Security Systems Keywords: Internet of Things (IoT); Dataset; Security; Machine Learning; Deep Learning; DoS; DDoS; Reconnaissance; Web Attacks; Brute Force; Spoofing; Mirai
Online: 8 May 2023 (04:41:28 CEST)
Nowadays, the Internet of Things (IoT) concept plays a pivotal role in society and brings new capabilities to different industries. The number IoT solutions in areas such as transportation and healthcare is increasing and new services are under development. In the last decade, society has experienced a drastic increase in IoT connections. In fact, IoT connections will increase in the next few years across different areas. Conversely, despite these benefits, several challenges still need to be faced to enable efficient and secure operations (e.g., interoperability, security, standards, and server technologies). Furthermore, although efforts have been made to produce datasets composed of attacks against IoT devices, several possible attacks are not considered. Most existing efforts do not consider an extensive network topology with real IoT devices. The main goal of this research is to propose a novel and extensive IoT attack dataset to foster the development of security analytics applications in real IoT operations. To accomplish this, 33 attacks are executed in an IoT topology composed of 105 devices. These attacks are classified into seven categories, namely DDoS, DoS, Recon, Web-based, Brute Force, Spoofing, and Mirai. Finally, all attacks are executed by malicious IoT devices targeting other IoT devices.
DATA DESCRIPTOR | doi:10.20944/preprints202209.0323.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: COVID-19; Open-source dataset; Drug Repurposing; Database system; Web application devel-opment; software development; Drug fingerprints; Bulk upload
Online: 21 September 2022 (10:14:11 CEST)
Although various vaccines are now commercially available, they have not been able to stop the spread of COVID-19 infection completely. An excellent strategy to quickly get safe, effective, and affordable COVID-19 treatment is to repurpose drugs that are already approved for other diseases as adjuvants along with the ongoing vaccine regime. The process of developing an accurate and standardized drug repurposing dataset requires a considerable level of resources and expertise due to the commercial availability of an extensive array of drugs that could be potentially used to address the SARS-CoV-2 infection. To address this bottleneck, we created the CoviRx platform. CoviRx is a user-friendly interface that provides access to the data, which is manually curated for COVID-19 drug repurposing data. Through CoviRx, the data curated has been made open-source to help advance drug repurposing research. CoviRx also encourages users to submit their findings after thoroughly validating the data, followed by merging it by enforcing uniformity and integ-rity-preserving constraints. This article discusses the various features of CoviRx and its design principles. CoviRx has been designed so that its functionality is independent of the data it dis-plays. Thus, in the future, this platform can be extended to include any other disease X beyond COVID-19. CoviRx can be accessed at www.covirx.org.
DATASET | doi:10.20944/preprints202311.1183.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Single Image based Deep Learning Model, Specular Highlight Removal, Reflection Removal, Synthetic Dataset, Multi-Scale Normalized Cross Correlation (MS-NCC)
Online: 21 November 2023 (10:00:38 CET)
Several studies in computer vision have examined specular removal, which is crucial for object detection and recognition. This research has traditionally been divided into two tasks: specular highlight removal, which focuses on removing specular highlights on object surfaces, and reflection removal, which deals with specular reflections occurring on glass surfaces. In reality, however, both types of specular effects often coexist, making it a fundamental challenge that has not been adequately addressed. Recognizing the necessity of integrating specular components handled in both tasks, we constructed a Specular-Light (S-Light) DB for training single-image-based deep learning models. Moreover, considering the absence of benchmark datasets for quantitative evaluation, the Multi-Scale Normalized Cross Correlation (MS-NCC) metric, which considers the correlation between specular and diffuse components, was introduced to assess the learning outcomes.
ARTICLE | doi:10.20944/preprints202311.0302.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: intrusion detection system; machine learning techniques; Exploratory Data Analysis; Performance Evaluation; feature selection; CSE-CIC-IDS-2018 dataset; Three phase models
Online: 6 November 2023 (07:46:31 CET)
In this paper, intrusion detection systems are thoroughly investigated utilizing the CSE-CIC-IDS-2018 dataset. The research is divided into three key phases: first, applying Data Cleaning, Exploratory Data Analysis, and Data Normalization techniques (min-max and z-score) for preparing data across distinct classifiers. Second, feature importance is reduced using a combination of Principal Component Analysis (PCA) and Random Forest (RF), with the goal of improving processing speed and decreasing model complexity. This stage comprises a comparison with the entire dataset. Finally, machine learning algorithms (XGBoost, CART, DT, KNN, MLP, RF, LR, and Bayes) are applied to specific features and preprocessing approaches. Surprisingly, the XGBoost, DT, and RF models outperform in both ROC values and CPU runtime. Following evaluation, which includes PCA and RF feature selection, an optimal set is produced.
ARTICLE | doi:10.20944/preprints202006.0031.v3
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Deep learning; Convolutional Neural Network; Coronavirus; COVID-19; radiology; CT scan; Medical image analysis; Automatic medical diagnosis; lung CT scan dataset
Online: 5 September 2020 (03:36:20 CEST)
COVID-19 is a severe global problem, and AI can play a significant role in preventing losses by monitoring and detecting infected persons in early-stage. This paper aims to propose a high-speed and accurate fully-automated method to detect COVID-19 from the patient's CT scan images. We introduce a new dataset that contains 48260 CT scan images from 282 normal persons and 15589 images from 95 patients with COVID-19 infections. At the first stage, this system runs our proposed image processing algorithm to discard those CT images that inside the lung is not properly visible in them. This action helps to reduce the processing time and false detections. At the next stage, we introduce a novel method for increasing the classification accuracy of convolutional networks. We implemented our method using the ResNet50V2 network and a modified feature pyramid network alongside our designed architecture for classifying the selected CT images into COVID-19 or normal with higher accuracy than other models. After running these two phases, the system determines the condition of the patient using a selected threshold. We are the first to evaluate our system in two different ways. In the single image classification stage, our model achieved 98.49% accuracy on more than 7996 test images. At the patient identification phase, the system correctly identified almost 234 of 245 patients with high speed. We also investigate the classified images with the Grad-CAM algorithm to indicate the area of infections in images and evaluate our model classification correctness.
ARTICLE | doi:10.20944/preprints202003.0297.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Data Mining; Alzheimer’s Dementia; Composite Hybrid Feature Selection; Machine learning; Stack Hybrid Classification; AI Techniques; Classification; AD Diagnose; Clinical AD Dataset
Online: 19 March 2020 (10:52:31 CET)
Alzheimer's disease (AD) is a significant regular type of dementia that causes damage in brain cells. Early detection of AD acting as an essential role in global health care due to misdiagnosis and sharing many clinical sets with other types of dementia, and costly monitoring the progression of the disease over time by magnetic reasoning imaging (MRI) with consideration of human error in manual reading. Our proposed model, in the first stage, apply the medical dataset to a composite hybrid feature selection (CHFS) to extract new features for select the best features to improve the performance of the classification process due to eliminating obscures features. In the second stage, we applied a dataset to a stacked hybrid classification system to combine Jrip and random forest classifiers with six model evaluations as meta-classifier individually to improve the prediction of clinical diagnosis. All experiments conducted on a laptop with an Intel Core i7- 8750H CPU at 2.2 GHz and 16 G of ram running on windows 10 (64 bits). The dataset evaluated using an explorer set of weka data mining software for the analysis purpose. The experimental show that the proposed model of (CHFS) feature extraction performs better than principal component analysis (PCA), and lead to effectively reduced the false-negative rate with a relatively high overall accuracy with support vector machine (SVM) as meta-classifier of 96.50% compared to 68.83% which is considerably better than the previous state-of-the-art result. The receiver operating characteristic (ROC) curve was equal to 95.5%. Also, the experiment on MRI images Kaggle dataset of CNN classification process with 80.21% accuracy result. The results of the proposed model show an accurate classify Alzheimer's clinical samples against MRI neuroimaging for diagnoses AD at a low cost.
Subject: Medicine And Pharmacology, Oncology And Oncogenics Keywords: breast cancer tumor; classification; majority-based voting mechanism; multilayer perceptron learning network; simple logistic regression; stochastic gradient descent learning; wisconsin breast cancer dataset
Online: 27 November 2019 (09:51:31 CET)
Breast cancer is the most common cause of death for women worldwide. Thus, the ability of artificial intelligence systems to predict and classify breast cancer is very important. In this paper, a hybrid ensemble method classification mechanism is proposed based on a majority voting mechanism. First, the performance of different state-of-the-art machine learning classification algorithms for the Wisconsin Breast Cancer Dataset (WBCD) were evaluated. The three best classifiers were then selected based on their F3 score. F3 score is used to emphasize the importance of false negatives (recall) in breast cancer classification. Then, these three classifiers, simple logistic Regression learning, stochastic gradient descent learning and multilayer perceptron network, are used for ensemble classification using a voting mechanism. We also evaluated the performance of hard and soft voting mechanism. For hard voting, majority-based voting mechanism was used and for soft voting we used average of probabilities, product of probabilities, maximum of probabilities and minimum of probabilities-based voting methods. The hard voting (majority-based voting) mechanism shows better performance with 99.42% as compared to the state-of-the-art algorithm for WBCD.
ARTICLE | doi:10.20944/preprints202007.0634.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: CVD rehabilitation; Local muscular endurance exercises; Exercise-based rehabilitation; Deep Learning; AlexNet; CNN; SVM; kNN; RF; MLP; PCA; multi-class classification; INSIGHT-LME dataset
Online: 26 July 2020 (15:21:08 CEST)
Exercise-based cardiac rehabilitation requires patients to perform a set of certain prescribed exercises a specific number of times. Local muscular endurance (LME) exercises are an important part of the rehabilitation program. Automatic exercise recognition and repetition counting, from wearable sensor data is an important technology to enable patients to perform exercises independently in remote settings, e.g. their own home. In this paper we first report on a comparison of traditional approaches to exercise recognition and repetition counting, corresponding to supervised machine learning and peak detection from inertial sensing signals respectively, with more recent machine learning approaches, specifically Convolutional Neural Networks (CNNs). We investigated two different types of CNN: one using the AlexNet architecture, the other using time-series array. We found that the performance of CNN based approaches were better than the traditional approaches. For exercise recognition task, we found that the AlexNet based single CNN model outperformed other methods with an overall 97.18% F1-score measure. For exercise repetition counting , again the AlexNet architecture based single CNN model outperformed other methods by correctly counting repetitions in 90% of the performed exercise sets within an error of ±1. To the best of our knowledge, our approach of using a single CNN method for both recognition and repetition counting is novel. In addition to reporting our findings, we also make the dataset we created, the INSIGHT-LME dataset, publicly available to encourage further research.
ARTICLE | doi:10.20944/preprints202307.0519.v1
Subject: Engineering, Bioengineering Keywords: Lung cancer classification; dimensionality reduction; feature selection techniques; STFT; Particle Swarm Optimization; Harmonic Search; Non-Linear Regression; Mixture Model; Convolutional Neural Network (CNN) for Lung Cancer; Microarray gene expression dataset
Online: 7 July 2023 (16:30:59 CEST)
Microarray gene expression-based detection and classification of medical conditions have been prominent in research studies over the past few decades. However, extracting relevant data from the high-volume microarray gene expression with inherent nonlinearity and inseparable noise components raises significant challenges during data classification and disease detection. So, this paper proposes a two-level strategy involving feature extraction and selection methods before the classification step. The feature extraction step utilizes Short Term Fourier Transform (STFT), and the feature selection step employs Particle Swarm Optimization (PSO) and Harmonic Search (HS) metaheuristic methods. The classifiers employed are Non-Linear Regression, Gaussian Mixture Model, Softmax Discriminant, Naive Bayes, SVM (Linear), SVM (Polynomial), and SVM (RBF). The two-level extracted relevant features are compared with raw data classification results, including Convolutional Neural Network (CNN) Methodology. Among the methods, STFT with PSO feature selection and SVM (RBF) classifier produced the highest accuracy of 94.47%.