DATASET | doi:10.20944/preprints202012.0047.v1
Subject: Engineering, Automotive Engineering Keywords: eye tracking dataset; gaze tracking dataset; iris tracking dataset; CNN for eye-tracking; neural networks for eye-tracking
Online: 2 December 2020 (08:00:46 CET)
In recent years many different deep neural networks were developed, but due to a large number of layers in deep networks, their training requires a long time and a large number of datasets. Today is popular to use trained deep neural networks for various tasks, even for simple ones in which such deep networks are not required. The well-known deep networks such as YoloV3, SSD, etc. are intended for tracking and monitoring various objects, therefore their weights are heavy and the overall accuracy for a specific task is low. Eye-tracking tasks need to detect only one object - an iris in a given area. Therefore, it is logical to use a neural network only for this task. But the problem is the lack of suitable datasets for training the model. In the manuscript, we presented a dataset that is suitable for training custom models of convolutional neural networks for eye-tracking tasks. Using data set data, each user can independently pre-train the convolutional neural network models for eye-tracking tasks. This dataset contains annotated 10,000 eye images in an extension of 416 by 416 pixels. The table with annotation information shows the coordinates and radius of the eye for each image. This manuscript can be considered as a guide for the preparation of datasets for eye-tracking devices.
REVIEW | doi:10.20944/preprints202110.0247.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: natural language; NLP; Korean; dataset
Online: 18 October 2021 (14:33:41 CEST)
English based datasets are commonly available from Kaggle, GitHub, or recently published papers. Although benchmark tests with English datasets are sufficient to show off the performances of new models and methods, still a researcher need to train and validate the models on Korean based datasets to produce a technology or product, suitable for Korean processing. This paper introduces 15 popular Korean based NLP datasets with summarized details such as volume, license, repositories, and other research results inspired by the datasets. Also, I provide high-resolution instructions with sample or statistics of datasets. The main characteristics of datasets are presented on a single table to provide a rapid summarization of datasets for researchers.
ARTICLE | doi:10.20944/preprints202002.0170.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Twitter; dataset; redundancy; reduction; archive
Online: 13 February 2020 (12:45:44 CET)
The data from social networks like Twitter is a valuable source for research but full of redundancy, making it hard to provide large-scale, self-contained, and small datasets. The data recording is a common problem in social media-based studies and could be standardized. Sadly, this is hardly done. This paper reports on lessons learned from a long-term evaluation study recording the complete public sample of the German and English Twitter stream. It presents a recording solution proposal that merely chunks a linear stream of events to reduce redundancy. If events are observed multiple times within the time-span of a chunk, only the latest observation is written to the chunk. A 10 Gigabyte Twitter raw dataset covering 1,2 Million Tweets of 120.000 users recorded between June and September 2017 was used to analyze expectable compression rates. It turned out that resulting datasets need only between 10\% and 20\% of the original data size without losing any event, metadata or the relationships between single events. This kind of redundancy reduction recording makes it possible to curate large-scale (even nation-wide), self-contained, and small datasets of social networks for research in a standardized and reproducible manner.
ARTICLE | doi:10.20944/preprints202203.0172.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: object detection; larger-scale dataset; stacked carton
Online: 11 March 2022 (15:48:23 CET)
Carton detection is an important technique in the automatic logistics system and can be applied to many applications such as the stacking and unstacking of cartons, the unloading of cartons in the containers. However, there is no public large-scale carton dataset for the research community to train and evaluate the carton detection models up to now, which hinders the development of carton detection. In this paper, we present a large-scale carton dataset named Stacked Carton Dataset (SCD) with the goal of advancing the state-of-the-art in carton detection. Images are collected from the Internet and several warehouses, and objects are labeled using per-instance segmentation for precise localization. There are total of 250,000 instance masks from 16,136 images. Naturelly, a suite of benchmarks are established with several popular detectors. In addition, we design a carton detector based on RetinaNet by embedding our proposed Offset Prediction between Classification and Localization module (OPCL) and Boundary Guided Supervision module (BGS). OPCL alleviates the imbalance problem between classification and localization quality which boosts AP by 3.1%∼4.7% on SCD at the model level while BGS guides the detector to pay more attention to boundary information of cartons and decouple repeated carton textures at the task level. To demonstrate the generalization of OPCL to other datasets, we conduct extensive experiments on MS COCO and PASCAL VOC. The improvements of AP on MS COCO and PASCAL VOC are 1.8%∼2.2% and 3.4%∼4.3% respectively. Source dataset is available here.
ARTICLE | doi:10.20944/preprints201907.0039.v1
Subject: Engineering, Biomedical & Chemical Engineering Keywords: fetal heart rate, baseline, acceleration, deceleration, dataset
Online: 2 July 2019 (11:17:55 CEST)
The fetal heart rate (FHR) is a screening signal for preventing fetal hypoxia during labor. When experts analyze this signal, they have to position a baseline and identify decelerations and accelerations. These steps can potentially be automated and made more objective by data processing analysis, but training and evaluation datasets are required. Here, we describe a dataset of 155 FHR recordings in which a reference baseline, accelerations and decelerations have been annotated by expert consensus. 66 FHR recordings with a shared expert analysis have been included in a training dataset, and 90 other FHR recordings with a non-shared expert analysis have been included in an evaluation dataset. Researchers wishing to evaluate their automatic analysis method should submit their results for comparison with the expert consensus. The dataset also contains the results produced by 11 re-coded automatic analysis methods from the literature. All the data are available at http://utsb.univ-catholille.fr/fhr-review.
ARTICLE | doi:10.20944/preprints202301.0551.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: multimodal dataset; sentiment analysis; classroom atmosphere; intelligent education
Online: 30 January 2023 (09:49:12 CET)
In this paper, we present a multimodal dataset for the analysis of classroom atmosphere, based on the behavior and voice of teachers in teaching scenarios. we propose four visual models, three audio models, and one visual-audio dual-modality model to be tested on our dataset. The results indicate that the CH-CC dataset is feasible and reliable and that the visual modality plays a major role in the analysis of this dataset.
ARTICLE | doi:10.20944/preprints202210.0301.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Emotion prediction; music; music emotion dataset; affective computing
Online: 20 October 2022 (08:33:49 CEST)
Music is capable of conveying many emotions. The level and type of emotion of the music perceived by a listener, however, is highly subjective. In this study, we present the Music Emotion Recognition with Profile information dataset (MERP). This database was collected through Amazon Mechanical Turk (MTurk) and features dynamical valence and arousal ratings of 54 selected full-length songs. The dataset contains music features, as well as user profile information of the annotators. The songs were selected from the Free Music Archive using an innovative method (a Triple Neural Network with the OpenSmile toolkit) to identify 50 songs with the most distinctive emotions. Specifically, the songs were chosen to fully cover the four quadrants of the valence arousal space. Four additional songs were selected from DEAM to act as a benchmark in this study and filter out low quality ratings. A total of 277 participants participated in annotating the dataset, and their demographic information, listening preferences, and musical background were recorded. We offer an extensive analysis of the resulting dataset, together with a baseline emotion prediction model based on a fully connected model and an LSTM model, for our newly proposed MERP dataset.
COMMUNICATION | doi:10.20944/preprints202207.0450.v1
Subject: Earth Sciences, Oceanography Keywords: SAR image; ship wake; deep learning; synthetic dataset
Online: 29 July 2022 (05:51:03 CEST)
The classification of vessel types in SAR imagery is of crucial importance for maritime applications. However, the ability to use real SAR imagery for deep learning classification is limited, due to the general lack of such data and/or the labor-intensive nature of labeling them. Simulating SAR images can overcome these limitations, allowing the generation of an infinite number of datasets. In this contribution, we present a synthetic SAR imagery dataset with ship wakes, which comprises 46080 images for ten different real vessel models. The variety of simulation parameters includes 16 ship heading directions, 6 ship velocities, 8 wind directions, 2 wind velocities, and 3 incidence angles. In addition, we extensively investigate classification performance for noise-free, noisy, and denoised ship wake scenes. We utilize the standard AlexNet architecture and employ training from scratch. To achieve the best classification performance, we conduct Bayesian optimization to determine hyperparameters. Results demonstrate that the classification of vessel types based on their SAR signatures is highly efficient, with maximum accuracies of 96.16%, 92.7%, and 93.59%, when training using noise-free, noisy, and denoised datasets respectively. Thus, we conclude that the best strategy in practical applications should be to train convolutional neural networks on denoised SAR datasets. The results show that the versatility of the SAR simulator can open up new horizons in the application of machine learning to a variety of SAR platforms.
ARTICLE | doi:10.20944/preprints202207.0176.v1
Subject: Engineering, Other Keywords: Lateral Movement; Sysmon; Dataset; Attacks; Network Security; Hacking
Online: 12 July 2022 (08:23:10 CEST)
This work attempts to answer in a clear way the following key questions regarding the optimal initialization of the Sysmon tool, towards the identification of Lateral Movement in the MS Windows ecosystem. First, from an expert’s standpoint and with reference to the relevant literature, what are the criteria of determining the possibly optimal initialization features of the Sysmon’s event monitoring tool, which are also applicable as custom rules within the config.xml configuration file? Second, based on the identified features, how can a functional configuration file, able to identify as many LM variants as possible, be generated? To answer these questions, we relied on the MITRE ATT&CK knowledge base of adversary tactics and techniques, and focused on the execution of the nine commonest LM methods. The conducted experiments, performed on a properly configured testbed, suggested a great number of interrelated networking features, that were implemented as custom rules in the Sysmon’s config.xml file. Moreover, by capitalizing on the rich corpus of the 870K Sysmon logs collected, we create and evaluate in terms of TP and FP rates an extensible Python .evtx file analyzer, dubbed PeX, which can be used towards automatizing the parsing and scrutiny of such voluminous files. Both the .evtx logs dataset and the developed PeX tool are provided publicly for further propelling future research in this interesting and rapidly evolving field.
DATASET | doi:10.20944/preprints202206.0346.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: dataset; NLP; Human Resource Management; classification; Job description
Online: 27 June 2022 (03:43:51 CEST)
We describe a dataset that contains job description published on a popular online website in the information and technology sector. As the website focus mainly on United Kingdom based jobs, the data have a specific focus on this country. It contains 11.501 job vacancies and 13 related meta data information. The dataset is suitable for HR analysis using machine learning techniques such as natural language processing and neural networks.
ARTICLE | doi:10.20944/preprints201706.0033.v2
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: smartphone accelerometers; dataset; human activity recognition; fall detection
Online: 18 July 2017 (13:16:10 CEST)
Smartphones, smartwatches, fitness trackers, and ad-hoc wearable devices are being increasingly used to monitor human activities. Data acquired by the hosted sensors are usually processed by machine-learning-based algorithms to classify human activities. The success of those algorithms mostly depends on the availability of training (labeled) data that, if made publicly available, would allow researchers to make objective comparisons between techniques. Nowadays, publicly available data sets are few, often contain samples from subjects with too similar characteristics, and very often lack of specific information so that is not possible to select subsets of samples according to specific criteria. In this article, we present a new smartphone accelerometer dataset designed for activity recognition. The dataset includes 11,771 activities performed by 30 subjects of ages ranging from 18 to 60 years. Activities are divided in 17 fine grained classes grouped in two coarse grained classes: 9 types of activities of daily living (ADL) and 8 types of falls. The dataset has been stored to include all the information useful to select samples according to different criteria, such as the type of ADL performed, the age, the gender, and so on. Finally, the dataset has been benchmarked with two different classifiers and with different configurations. The best results are achieved with k-NN classifying ADLs only, considering personalization, and with both windows of 51 and 151 samples.
DATA DESCRIPTOR | doi:10.20944/preprints202206.0246.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: dataset; twitter; tweets; IMDb ratings; movies; sentiment analysis; NLP
Online: 17 June 2022 (04:39:16 CEST)
In this paper we intend to present a dataset that contain a collection of tweets generated as reactions of the release of 50 different movies. The dataset can be used for gaining useful insights regarding the conversation that is generated around a particular movie. It is particularly suitable for conducting sentiment analysis and other NLP techniques. The dataset contains approximately 2.5 million tweets with their related meta data and cover 50 movies. For each movie, its IMDb rating is included. The movies are the 25 releases with the highest number of votes during 2020 and 2021. The collected tweets represent the reactions of the twitter community during the first week of the release date in US of that particular movie. The tweets per movie ranged from 1.000 to approximately 200.000 tweets with an average of 50.000 per release. We used The Internet Archive Wayback Machine in order to retrieve the IMDb movie rating after one week of the US release date. The tweets and related metadata have been collected using the Tweet Downloader tool.
ARTICLE | doi:10.20944/preprints202111.0182.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: AHRS; Computer Vision; Dataset Acquisition; Deep Learning; Orientation Estimation.
Online: 9 November 2021 (14:35:21 CET)
The use of Attitude and Heading Reference Systems (AHRS) for orientation estimation is now common practice in a wide range of applications, e.g., robotics and human motion tracking, aerial vehicles and aerospace, gaming and virtual reality, indoor pedestrian navigation and maritime navigation. The integration of the high-rate measurements can provide very accurate estimates, but these can suffer from errors accumulation due to the sensors drift over longer time scales. To overcome this issue, inertial sensors are typically combined with additional sensors and techniques. As an example, camera-based solutions have drawn a large attention by the community, thanks to their low-costs and easy hardware setup; moreover, impressive results have been demonstrated in the context of Deep Learning. This work presents the preliminary results obtained by DOES , a supportive Deep Learning method specifically designed for maritime navigation, which aims at improving the roll and pitch estimations obtained by common AHRS. DOES recovers these estimations through the analysis of the frames acquired by a low-cost camera pointing the horizon at sea. The training has been performed on the novel ROPIS dataset, presented in the context of this work, acquired using the FrameWO application developed for the scope. Promising results encourage to test other network backbones and to further expand the dataset, improving the accuracy of the results and the range of applications of the method.
ARTICLE | doi:10.20944/preprints202102.0424.v1
Subject: Social Sciences, Accounting Keywords: dataset; stock; sentiment analysis; nlp; Nasdaq; stock prices; ML
Online: 18 February 2021 (17:22:50 CET)
The dataset reports a collection of earnings call transcripts, the related stock prices, and the related sector index. It contains a total of 188 transcripts, 11970 stock prices, and 1196 sector index values. Furthermore, all of these data originated in the period 2016-2020 and are related to the NASDAQ stock market. The data have been collected using Yahoo Finance and Thomson Reuters Eikon. Specifically, Yahoo Finance offered daily stock prices and traded volume. At the same time, Thomson Reuters Eikon has been used as source for the earnings call transcripts. The dataset can be used as a benchmark for the evaluation of several NLP techniques as well as machine learning algorithms for understanding their potential for financial applications. Moreover, it is also possible to expand the dataset by extending the period in which the data originated following a similar procedure.
ARTICLE | doi:10.20944/preprints202008.0040.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: pain assessment; pain recognition; deep learning; neural network; dataset
Online: 2 August 2020 (15:28:12 CEST)
The traditional standards employed for pain assessment have many limitations. One such limitation is reliability because of inter-observer variability. Therefore, there have been many approaches to automate the task of pain recognition. Recently, deep-learning methods have appeared to solve many challenges, such as feature selection and cases with a small number of data sets. This study provides a systematic review of pain-recognition systems that are based on deep-learning models for the last two years only. Furthermore, it presents the major deep-learning methods that were used in review papers. Finally, it provides a discussion of the challenges and open issues.
DATA DESCRIPTOR | doi:10.20944/preprints202212.0118.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Lip reading; Visual speech recognition; Turkish dataset; Face parts detection
Online: 7 December 2022 (06:50:33 CET)
The promised dataset was obtained from the daily Turkish words and phrases pronounced by various people in the videos posted on YouTube. The purpose of collecting the dataset is to provide detection of the spoken word by recognizing patterns or classifying lip movements with supervised, unsupervised, semi-supervised learning and machine learning algorithms. Most of the datasets related with lip reading consist of people recorded on camera with fixed backgrounds and the same conditions, but the dataset presented here consists of images compatible with machine learning models developed for real-life challenges. It contains a total of 2335 instances taken from TV series, movies, vlogs, and song clips on YouTube. The images in the dataset vary due to factors such as the way people say words, accent, speaking rate, gender and age. Furthermore, the instances in the dataset consist of videos with different angles, shadows, resolution, and brightness that are not created manually. The most important feature of our lip reading dataset is that we contribute to the non-synthetic Turkish dataset pool, which does not have wide dataset varieties. Machine learning studies can be carried out in many areas, such as the defense industry and social life, with this dataset.
DATA DESCRIPTOR | doi:10.20944/preprints202111.0511.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Social network analysis; Natural language processing; Dataset; Multimode; Opinion Dynamics
Online: 26 November 2021 (14:23:36 CET)
At the end of 2018, a high school student asked a question in Zhihu community, claiming that he had proved Goldbach's conjecture. The problem caused an explosive reaction and a large number of users participated in the discussion. And has caused the widespread influence. On January 1, 2019, the questioner issued his "proof". His proof was soon proved wrong. The heated discussion caused by the incident contains a lot of information of social science analysis value. Therefore, we follow up the event in the first time and build a time series dataset for the event. Taking the questioner's "proof" as the dividing line, all the answers, comments, sub comments and user information of writing these texts before and after two days were recorded. This series of temporal information can reflect the dynamic features of the interaction between user opinions, and the impact of exogenous shocks (proof release) on community opinions. The dataset can be used not only for the demonstration of various social network analysis algorithms, but also for a series of natural language processing tasks such as fine-grained sentiment analysis for long texts, as well as multimodal tasks combining natural language processing and social network analysis. This paper introduces the characteristics and structure of the dataset, shows the visualization effect of social network, and uses the dataset train the benchmark model of emotion analysis.
ARTICLE | doi:10.20944/preprints202101.0156.v1
Subject: Social Sciences, Accounting Keywords: dataset; comparison; algorithm; Naïve Bayes; C5.0 Decision Tree; student enrolment
Online: 8 January 2021 (13:04:44 CET)
In this preprint, we introduce a dataset containing students enrolment applications combined with the related result of their filing procedure. The dataset contains 73 variable. Student candidates, at the time of applying for study, fill a web form for filing the procedure. A committee at Tilburg University review each single application and decide if the student is admissible or not. This dataset is suitable for algorithmic studies and has been used in a comparison between the Naïve Bayes and the C5.0 Decision Tree Algorithms. They have been used for predicting the decision of the committee in admitting candidates at various bachelor programs. Our analysis shows that, in this particular case, a combination of the approaches outperform a both of them in term of precision and recall.
ARTICLE | doi:10.20944/preprints201911.0117.v1
Subject: Medicine & Pharmacology, General Medical Research Keywords: malignant mesothelioma; epidemiology; association rule mining; Apriori method; imbalanced dataset
Online: 10 November 2019 (16:15:14 CET)
Malignant mesothelioma is a rare proliferative cancer that develops in the thin layer of tissues surrounding the lungs. Malignant mesothelioma is associated with an extremely poor prognosis and the majority of patients do not show symptoms. The epidemiology of mesothelioma is important for the identification of disease. The primary aim of this study is to explore the risk factors associated with mesothelioma. The dataset consists of healthy and mesothelioma patients but only mesothelioma patients were selected for the identification of symptoms. The raw data set has been pre-processed and then the Apriori method was utilized for association rules with various configurations. The pre-processing task involved the removal of duplicated and irrelevant attributes, balanced the dataset, numerical to the nominal conversion of attributes in the dataset and creating the association rules in the dataset. Strong associations of disease’s factors; asbestos exposure, duration of asbestos exposure, duration of symptoms, erythrocyte sedimentation rate and Pleural to serum LDH ratio determined via Apriori algorithm. The identification of risk factors associated with mesothelioma may prevent patients from going into the high danger of the disease. This will also help to control the comorbidities associated with mesothelioma which are cardiovascular diseases, cancer-related emotional distress, diabetes, anemia, and hypothyroidism.
ARTICLE | doi:10.20944/preprints202111.0446.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: n/a; Unity3D; Blender; Virtual Reality; Syntetic dataset generation; Machine Learning; Neural Networks
Online: 24 November 2021 (08:53:00 CET)
This paper provides a methodology for the production of synthetic images for training neural networks to recognise shapes and objects. There are many scenarios in which it is difficult, expensive and even dangerous to produce a set of images that is satisfactory for the training of a neural network. The development of 3D modelling software has nowadays reached such a level of realism and ease of use that it seemed natural to explore this innovative path and to give an answer regarding the reliability of this method that bases the training of the neural network on synthetic images. The results obtained in the two proposed use cases, that of the recognition of a pictorial style and that of the recognition of migrants at sea, leads us to support the validity of the approach, provided that the work is conducted in a very scrupulous and rigorous manner, exploiting the full potential of the modelling software. The code produced, which automatically generates the transformations necessary for the data augmentation of each image, and the generation of random environmental conditions in the case of Blender and Unity3D software, is available under the GPL licence on GitHub. The results obtained lead us to affirm that through the good practices presented in the article, we have defined a simple, reliable, economic and safe method to feed the training phase of a neural network dedicated to the recognition of objects and features, to be applied to various contexts.
ARTICLE | doi:10.20944/preprints202106.0144.v1
Subject: Keywords: Diabetic; Diabetic Mellitus; Diabetic Prediction; PIMA diabetic dataset; Female diabetic Patients; Machine Learning
Online: 4 June 2021 (15:25:16 CEST)
Diabetics or Diabetic Mellitus is a metabolic disorder of blood sugar levels in the human body. It is a major non-communicable disease and involved many serious health risk issues. This disease is rapidly increasing in India. It is a chronic condition and it occurs when a body doesn't produce enough insulin hormone to control the blood sugar level. In this study, different variables have been analyzed that cause the diabetics, and different machine learning algorithms are used to predict whether an unknown sample is diabetes or not. For this purpose, PIMA diabetic detection for Female patients was used. Here 10 different classification model is used for prediction. Finally, the detailed performance analysis of the different variables of the PIMA dataset and also the classification model are discussed.
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Urdu Twitter Dataset; Urdu Natural language processing (NLP); Urdu text Sentiments and Emoticons
Online: 24 March 2021 (12:03:46 CET)
This article presents a dataset of tweets in the Urdu language. There are 1,140,824 tweets in the dataset, collected from Twitter for September and October 2020. This large-scale corpus of tweets is generated by performing pre-processing which includes removing columns containing user information, retweet’s count, followers information, duplicate tweets, removing unnecessary punctuation, links, symbols, and spaces, and finally extracting emojis if present in the tweet text. In the final dataset each tweet record contains columns for tweet id, text, and emoji extracted from the text with a sentiment score. Emojis are extracted to validate Machine Learning models used for the multilingual sentiment and behavior analysis. These are extracted using a Python script that searches for an emoji from the list of 751 most frequently used emojis. If an emoji is present in the text, a column with the emoji description and sentiment score is added.
ARTICLE | doi:10.20944/preprints202007.0191.v1
Subject: Keywords: Intrusion Detection System; NSL-KDD Dataset; One Hot Encoding; Information Gain; Decision Tree
Online: 9 July 2020 (12:23:29 CEST)
. In today’s world, cyber attack is one of the major issues concerning the organizations that deal with technologies like cloud computing, big data, IoT etc. In the area of cyber security, intrusion detection system (IDS) plays a crucial role to identify suspicious activities in the network traffic. Over the past few years, a lot of research has been done in this area but in the current scenario, network attacks are diversifying in both volume and variety. In this regard, this research article proposes a novel IDS where a combination of information gain and decision tree algorithm has been used for the purpose of dimension reduction and classification. For experimental purpose the NSL-KDD dataset has been used. Initially out of 41 features present in the dataset only 5 high information gain valued features are selected for classification purpose. The applicability of the selected features are evaluated through various machine learning based algorithms. The experimental result shows that the decision tree based algorithm records highest recognition accuracy among all the classifiers. Based on the initial classification result a novel methodology based on decision tree has been further developed which is capable of identifying multiple attacks by analyzing the packets of various transactions in real time.
ARTICLE | doi:10.20944/preprints202301.0156.v1
Subject: Engineering, Biomedical & Chemical Engineering Keywords: Online Learning; Emotion Classification; AMIGOS dataset; Wearable-EEG (Muse and Neurosity Crown); Psychopy Experiments
Online: 9 January 2023 (09:09:08 CET)
Emotions are indicators of affective states and play a significant role in human daily life, behavior, and interactions. Giving emotional intelligence to the machines could, for instance, facilitate early detection and prediction of (mental) diseases and symptoms. Electroencephalography (EEG) -based emotion recognition is being widely applied because it measures electrical correlates directly from the brain rather than the indirect measurement of other physiological responses initiated by the brain. The recent development of non-invasive and portable EEG sensors makes it possible to use them in real-time applications. Therefore, this paper presents a real-time emotion classification pipeline, which trains different binary classifiers for the dimensions of Valence and Arousal from an incoming EEG data stream. After achieving a 23.9% (Arousal) and 25.8% (Valence) higher f1-score on the state-of-art AMIGOS dataset, this pipeline was applied to the dataset achieved by an emotion elicitation experimental framework developed within the scope of this thesis. Following two different protocols, 15 participants were recorded using two different consumer-grade EEG devices while watching 16 short emotional videos in a controlled environment. For an immediate label setting, the mean f1-score of 87% and 82% were achieved for Arousal and Valence, respectively. In a live scenario, while continuously being updated on the incoming data stream with delayed labels, the pipeline proved to be fast enough to achieve predictions in real time. However, the significant discrepancy from the readily available labels on the classification scores leads to future work to include more data with frequent delayed labels in the live settings.
COMMUNICATION | doi:10.20944/preprints202206.0172.v3
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Monkeypox; monkey pox; Twitter; Dataset; Tweets; Social Media; Big Data; Data Mining; Data Science
Online: 25 July 2022 (09:41:19 CEST)
ARTICLE | doi:10.20944/preprints202109.0068.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Feature engineering; vibration; high performance computing (HPC); dataset; prognostics and availability management (P&AM)
Online: 3 September 2021 (14:21:24 CEST)
The Industrial Internet of things (IIoT) enabled smart system has entered into a golden era of rapid technology growth. IIoT is a concept to make every system interrelated such that they are able to collect and transfer data over a wireless network without human intervention. In this paper, we discuss the development of an IoT enabled system to monitor the vibration signature of equipment as part of prognosis and availability management system (P&AM) that serves to prevent unplanned operation downtime and catastrophic failure of a whole system. In order to simply the complexity of processing video content and performing inference, the Intel OpenVINO platform was selected because of it’s simplicity, portability across Intel AI processors, performance and comprehensiveness of it’s analytical and diagnostics capabilities that can be tested in Intel’s DevCloud. The IIoT system consists of a High Performance Computing (HPC) platform based on Intel’s Xeon processors and Movidius AI accelerator, Intel’s OpenVINO toolkit for AI, a Regul high performance programmable controller capturing vibration data through sensors and a low-latency network connection. Notifications of anomalies are sent to a smartphone. This paper reveals an approach for the features extraction and selection, known as feature engineering, of the equipment component we want to protect. Feature engineering is the first step for the P&AM of these components and extends to the whole system. The broader aim of this paper is to help technical leaders at the exploring or experimenting stages of their AI framework to learn the concepts of implementing algorithms using datasets that have real value to their companies. Datasets generated and referred to in this paper were generated by simulation under various material failure scenarios.
ARTICLE | doi:10.20944/preprints202105.0444.v1
Subject: Earth Sciences, Geoinformatics Keywords: Spectral Unmixing; Imaging Spectrometer; Hyperspectral; Benchmark Dataset; Dimensionality Estimation; Endmember Extraction; Abundance Estimation; HySpex.
Online: 19 May 2021 (13:25:39 CEST)
Spectral unmixing represents both an application per se and a pre-processing step for several applications involving data acquired by imaging spectrometers. However, there is still a lack of publicly available reference data sets suitable for the validation and comparison of different spectral unmixing methods. In this paper we introduce the DLR HyperSpectral Unmixing (DLR HySU) benchmark dataset, acquired over German Aerospace Center (DLR) premises in Oberpfaffenhofen. The dataset includes airborne hyperspectral and RGB imagery of targets of different materials and sizes, complemented by simultaneous ground-based reflectance measurements. The DLR HySU benchmark allows a separate assessment of all spectral unmixing main steps: dimensionality estimation, endmember extraction (with and without pure pixe assumption), and abundance estimation. Results obtained with traditional algorithms for each of these steps are reported. To the best of our knowledge, this is the first time that real imaging spectrometer data with accurately measured targets are made available for hyperspectral unmixing experiments. The DLR HySU benchmark dataset is openly available online and the community is welcome to use it for spectral unmixing and other applications.
ARTICLE | doi:10.20944/preprints202105.0429.v1
Subject: Engineering, Biomedical & Chemical Engineering Keywords: Acute lymphoblastic leukemia; Deep convolutional neural networks; Ensemble image classifiers; C-NMC-2019 dataset.
Online: 19 May 2021 (07:42:23 CEST)
Although automated Acute Lymphoblastic Leukemia (ALL) detection is essential, it is challenging due to the morphological correlation between malignant and normal cells. The traditional ALL classification strategy is arduous, time-consuming, often suffers inter-observer variations, and necessitates experienced pathologists. This article has automated the ALL detection task, employing deep Convolutional Neural Networks (CNNs). We explore the weighted ensemble of deep CNNs to recommend a better ALL cell classifier. The weights are estimated from ensemble candidates' corresponding metrics, such as accuracy, F1-score, AUC, and kappa values. Various data augmentations and pre-processing are incorporated for achieving a better generalization of the network. We train and evaluate the proposed model utilizing the publicly available C-NMC-2019 ALL dataset. Our proposed weighted ensemble model has outputted a weighted F1-score of 88.6%, a balanced accuracy of 86.2%, and an AUC of 0.941 in the preliminary test set. The qualitative results displaying the gradient class activation maps confirm that the introduced model has a concentrated learned region. In contrast, the ensemble candidate models, such as Xception, VGG-16, DenseNet-121, MobileNet, and InceptionResNet-V2, separately produce coarse and scatter learned areas for most example cases. Since the proposed ensemble yields a better result for the aimed task, it can experiment in other domains of medical diagnostic applications.
ARTICLE | doi:10.20944/preprints202007.0187.v1
Subject: Keywords: Intrusion Detection System; NSL-KDD Dataset; One Hot Encoding; Information Gain; Convolution Neural Network
Online: 9 July 2020 (12:14:10 CEST)
Cyber security plays an important role to protect our computer, network, program and data from unauthorized access. Intrusion detection system (IDS) and intrusion prevention system (IPS) are two main categories of cyber security, designed to identify any suspicious activities present in inbound and outbound network packets and restrict the suspicious incident. Deep neural network plays a significant role in the construction of IDS and IPS. This paper highlights a novel IDS using optimized convolution neural network (CNN-IDS). An optimized CNNIDS model is an improvement over CNN which selects the best weighted model by considering the loss in every epoch. All the experiments have been conducted on the well known NSL-KDD dataset. Information gain has been used for dimensionality reduction. The accuracy of the proposed model is evaluated through optimized CNN for both binary and multiclass categories. Finally, a critical comparison has been performed with other general classifiers like J48, Naive Bayes, NB tree, Random forest, Multilayer Perceptron (MLP), Support Vector Machine (SVM), Recurrent Neural Network (RNN) and Convolution Neural Network(CNN). All the experimental results demonstrate that the optimized CNN-IDS model records the best recognition rate with minimum model construction time.
ARTICLE | doi:10.20944/preprints201805.0240.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: background reconstruction; image quality assessment; image dataset; subjective evaluation; perceptual quality; objective quality metric
Online: 17 May 2018 (09:36:33 CEST)
With an increased interest in applications that require a clean background image, such as video surveillance, object tracking, street view imaging and location-based services on web-based maps, multiple algorithms have been developed to reconstruct a background image from cluttered scenes. Traditionally, statistical measures and existing image quality techniques have been applied for evaluating the quality of the reconstructed background images. Though these quality assessment methods have been widely used in the past, their performance in evaluating the perceived quality of the reconstructed background image has not been verified. In this work, we discuss the shortcomings in existing metrics and propose a full reference Reconstructed Background image Quality Index (RBQI) that combines color and structural information at multiple scales using a probability summation model to predict the perceived quality in the reconstructed background image given a reference image. To compare the performance of the proposed quality index with existing image quality assessment measures, we construct two different datasets consisting of reconstructed background images and corresponding subjective scores. The quality assessment measures are evaluated by correlating their objective scores with human subjective ratings. The correlation results show that the proposed RBQI outperforms all the existing approaches. Additionally, the constructed datasets and the corresponding subjective scores provide a benchmark to evaluate the performance of future metrics that are developed to evaluate the perceived quality of reconstructed background images.
ARTICLE | doi:10.20944/preprints202208.0350.v1
Subject: Mathematics & Computer Science, Analysis Keywords: Mental stress Covid-19; Covid-19 vaccine dataset; Vaccine sociodemographic; Vaccine acceptance rate; Vaccine perception
Online: 18 August 2022 (13:36:16 CEST)
In this study, we surveyed over 600 participants to determine: a) major causes to mental stress during the pandemic and its future impacts, and b) diversity in public perception and acceptance (specifically for children) of Covid-19 vaccination. Statistical results and intelligent clustering outcomes indicate significant relationships between sociodemographic diversity, mental stress causes, vaccination perception, and Covid-19 infections. For instance, statistical results indicate significant dependence between mental stress due to Covid-19 and gender (p = 1.7e-05). Over 25% of males indicated work related stress comparing 35% in females however, females indicated more stressed (17%) due to relationships comparing to males (12%). Around 30% of Asian/Arabic participants don’t feel vaccination being safe as compared to 8% of white-British and 22% of white-European indicating significant dependence (p = 1.8e-08) with ethnicity. More specifically, vaccination acceptance for children is significantly dependent to ethnicity (p = 3.7e-05) where only 47% participants show willingness towards children’s vaccination. Primary dataset in this study along with experimental outcomes identifying sociodemographic information diversity with respect to public perception and acceptance of vaccination to children and potential stress factors might be useful for public and policy makers to be better prepared for future epidemics as well as working globally to combat mental health issues and running more effective vaccination campaigns.
ARTICLE | doi:10.20944/preprints202002.0174.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: IEEE 802.15.4g; Smart Utility Networks; Low-Power; Wireless; Communications; Dependable; Predictable; Reliable; Available; Industrial; Dataset
Online: 13 February 2020 (14:01:05 CET)
In this article we present a deployment of 11 nodes using the three different SUN (Smart Utility Network) modulation schemes, as defined in the IEEE 802.15.4-2015 standard. The nodes were deployed in a 110.044 m2 warehouse for 99 days, and the resulting dataset contains a total of 10.710.868 measurements with RSSI (Received Signal Strength Indicator), CCA (Clear Channel Assessment) and PDR (Packet Delivery Ratio) values. The analyzed results show a high variability in average RSSI (i.e., between -82.1 dBm and -101.7 dBm) and CCA (i.e., between -111.2 dBm and -119.9 dBm) values, which are caused by the effects of multi-path propagation and external interference. Despite being above the sensitivity limit for each modulation, this values result in poor average PDR values (i.e., from 65.9% to 87.4%), indicating that additional schemes are required for low-power wireless communications to meet the dependability requirements of industrial applications. For that purpose, we also introduce the concept of modulation diversity, which can be combined with packet repetition to meet such requirements (i.e., PDR>99%) while minimizing the energy expenditure of nodes and meeting regulatory constraints.
ARTICLE | doi:10.20944/preprints201908.0019.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: emotion classification; machine learning classifiers; ISEAR dataset; data mining; performance evaluation; data science; opinion-mining
Online: 2 August 2019 (08:49:27 CEST)
Emotion detection from the text is an important and challenging problem in text analytics. The opinion-mining experts are focusing on the development of emotion detection applications as they have received considerable attention of online community including users and business organization for collecting and interpreting public emotions. However, most of the existing works on emotion detection used less efficient machine learning classifiers with limited datasets, resulting in performance degradation. To overcome this issue, this work aims at the evaluation of the performance of different machine learning classifiers on a benchmark emotion dataset. The experimental results show the performance of different machine learning classifiers in terms of different evaluation metrics like precision, recall ad f-measure. Finally, a classifier with the best performance is recommended for the emotion classification.
ARTICLE | doi:10.20944/preprints201804.0088.v1
Subject: Arts & Humanities, History Keywords: historical dataset; geocoding; localisation; geohistorical objects; database; GIS; collaborative; citizen science; crowd-sourced; digital humanities
Online: 8 April 2018 (09:13:10 CEST)
The latest developments in digital humanities have increasingly enabled the construction of large data sets which can easily be accessed and used. These data sets often contain indirect localisation information, such as historical addresses. Historical geocoding is the process of transforming the indirect localisation information to direct localisation that can be placed on a map, which enables spatial analysis and cross-referencing. Many efficient geocoders exist for current addresses, but they do not deal with temporal information and are usually based on a strict hierarchy (country, city, street, house number, etc.) that is hard, if not impossible, to use with historical data. Indeed, historical data are full of uncertainties (temporal, textual, positional accuracy, confidence in historical sources) that can not be ignored or entirely resolved. We propose an open source, open data, extensible solution for geocoding that is based on gazetteers composed of geohistorical objects extracted from historical topographical maps. Once the gazetteers are available, geocoding an historical address is a matter of finding the geohistorical object in the gazetteers that is the best match to the historical address searched by the user. The matching criteria are customisable and include several dimensions (fuzzy string, fuzzy temporal, level of detail, positional accuracy). As the goal is to facilitate historical work, we also propose web-based user interfaces that help geocode (one address or batch mode) and display over current or historical topographical maps, so that geocoding results can be checked and collaboratively edited. The system has been tested on the city of Paris, France, for the 19th and the 20th centuries. It shows high response rates and is fast enough to be used interactively.
DATA DESCRIPTOR | doi:10.20944/preprints202301.0473.v1
Subject: Mathematics & Computer Science, Applied Mathematics Keywords: Data generator; dataset; deep learning; health index; machine learning; prognosis and health management; remaining useful life
Online: 26 January 2023 (08:37:30 CET)
This paper presents PrognosEase; a software that provides an easier way to produce different types of run-to-failure data mimicking real-world conditions to simplify prognosis studies in terms of data collection and improvement in ML degradation modelling process. Different types of degradation types made available to meet different types of applications. Besides, some primary ML tests were performed to ensure that complexity patterns of real systems could be observed in the training/testing predictions attitude. This paper also presents the impacts, limitations and potential improvements of the data generator.
ARTICLE | doi:10.20944/preprints202210.0477.v1
Subject: Mathematics & Computer Science, Analysis Keywords: High Throughput Plant Phenotyping; Deep Neural Network; Flower Detection; Temporal Phenotypes; Benchmark Dataset; Flower Status Report
Online: 31 October 2022 (10:00:24 CET)
A phenotype is the composite of an observable expression of a genome for traits in a given environment. The trajectories of phenotypes computed from an image sequence and timing of important events in a plant’s life cycle can be viewed as temporal phenotypes and indicative of the plant’s growth pattern and vigor. In this paper, we introduce a novel method called FlowerPhenoNet which uses deep neural networks for detecting flowers from multiview image sequences for high throughput temporal plant phenotyping analysis. Following flower detection, a set of novel flower-based phenotypes are computed, e.g., the day of emergence of the first flower in a plant’s life cycle, the total number of flowers present in the plant at a given time, the highest number of flowers bloomed in the plant, growth trajectory of a flower and the blooming trajectory of a plant. To develop a new algorithm and facilitate performance evaluation based on experimental analysis, a benchmark dataset is indispensable. Thus, we introduce a benchmark dataset called FlowerPheno which comprises image sequences of three flowering plant species, e.g., sunflower, coleus, and canna, captured by a visible light camera in a high throughput plant phenotyping platform from multiple view angles. The experimental analyses on the FlowerPheno dataset demonstrate the efficacy of the FlowerPhenoNet.
DATA DESCRIPTOR | doi:10.20944/preprints202206.0146.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: COVID-19; COVID; Omicron; online learning; remote learning; online education; Twitter; dataset; Tweets; social media; Big Data
Online: 21 July 2022 (08:05:19 CEST)
COMMUNICATION | doi:10.20944/preprints202206.0383.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Exoskeleton; Twitter; Tweets; Big Data; social media; Data Mining; dataset; Data Science; Natural Language Processing; Information Retrieval
Online: 21 July 2022 (04:06:53 CEST)
The exoskeleton technology has been rapidly advancing in the recent past due to its multitude of applications and diverse use-cases in assisted living, military, healthcare, firefighting, and industry 4.0. The exoskeleton market is projected to increase by multiple times of its current value within the next two years. Therefore, it is crucial to study the degree and trends of user interest, views, opinions, perspectives, attitudes, acceptance, feedback, engagement, buying behavior, and satisfaction, towards exoskeletons, for which the availability of Big Data of conversations about exoskeletons is necessary. The Internet of Everything style of today's living, characterized by people spending more time on the internet than ever before, with a specific focus on social media platforms, holds the potential for the development of such a dataset, by the mining of relevant social media conversations. Twitter, one such social media platform, is highly popular amongst all age groups, where the topics found in the conversation paradigms include emerging technologies such as exoskeletons. To address this research challenge, this work makes two scientific contributions to this field. First, it presents an open-access dataset of about 140,000 tweets about exoskeletons that were posted in a 5-year period from May 21, 2017, to May 21, 2022. Second, based on a comprehensive review of the recent works in the fields of Big Data, Natural Language Processing, Information Retrieval, Data Mining, Pattern Recognition, and Artificial Intelligence that may be applied to relevant Twitter data for advancing research, innovation, and discovery in the field of exoskeleton research, a total of 100 Research Questions are presented for researchers to study, analyze, evaluate, ideate, and investigate based on this dataset.
ARTICLE | doi:10.20944/preprints202110.0089.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Object Detection; Cascade Mask R-CNN; Floor Plan Images; Deep Learning; Transfer Learning; Dataset Augmentation; Computer Vision
Online: 5 October 2021 (15:09:26 CEST)
Object detection is one of the most critical tasks in the field of Computer vision. This task comprises identifying and localizing an object in the image. Architectural floor plans represent the layout of buildings and apartments. The floor plans consist of walls, windows, stairs, and other furniture objects. While recognizing floor plan objects is straightforward for humans, automatically processing floor plans and recognizing objects is a challenging problem. In this work, we investigate the performance of the recently introduced Cascade Mask R-CNN network to solve object detection in floor plan images. Furthermore, we experimentally establish that deformable convolution works better than conventional convolutions in the proposed framework. Identifying objects in floor plan images is also challenging due to the variety of floor plans and different objects. We faced a problem in training our network because of the lack of publicly available datasets. Currently, available public datasets do not have enough images to train deep neural networks efficiently. We introduce SFPI, a novel synthetic floor plan dataset consisting of 10000 images to address this issue. Our proposed method conveniently surpasses the previous state-of-the-art results on the SESYD dataset and sets impressive baseline results on the proposed SFPI dataset. The dataset can be downloaded from SFPI Dataset Link. We believe that the novel dataset enables the researcher to enhance the research in this domain further.
ARTICLE | doi:10.20944/preprints201903.0039.v2
Subject: Engineering, Control & Systems Engineering Keywords: Handwritten digit recognition; Convolutional Neural Network (CNN); Deep learning; MNIST dataset; Epochs; Hidden Layers; Stochastic Gradient Descent; Backpropagation
Online: 20 September 2019 (10:12:26 CEST)
In recent times, with the increase of Artificial Neural Network (ANN), deep learning has brought a dramatic twist in the field of machine learning by making it more Artificial Intelligence (AI). Deep learning is used remarkably used in vast ranges of fields because of its diverse range of applications such as surveillance, health, medicine, sports, robotics, drones etc. In deep learning, Convolutional Neural Network (CNN) is at the center of spectacular advances that mixes Artificial Neural Network (ANN) and up to date deep learning strategies. It has been used broadly in pattern recognition, sentence classification, speech recognition, face recognition, text categorization, document analysis, scene, and handwritten digit recognition. The goal of this paper is to observe the variation of accuracies of CNN to classify handwritten digits using various numbers of hidden layer and epochs and to make the comparison between the accuracies. For this performance evaluation of CNN, we performed our experiment using Modified National Institute of Standards and Technology (MNIST) dataset. Further, the network is trained using stochastic gradient descent and the backpropagation algorithm.
DATA DESCRIPTOR | doi:10.20944/preprints202209.0323.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: COVID-19; Open-source dataset; Drug Repurposing; Database system; Web application devel-opment; software development; Drug fingerprints; Bulk upload
Online: 21 September 2022 (10:14:11 CEST)
Although various vaccines are now commercially available, they have not been able to stop the spread of COVID-19 infection completely. An excellent strategy to quickly get safe, effective, and affordable COVID-19 treatment is to repurpose drugs that are already approved for other diseases as adjuvants along with the ongoing vaccine regime. The process of developing an accurate and standardized drug repurposing dataset requires a considerable level of resources and expertise due to the commercial availability of an extensive array of drugs that could be potentially used to address the SARS-CoV-2 infection. To address this bottleneck, we created the CoviRx platform. CoviRx is a user-friendly interface that provides access to the data, which is manually curated for COVID-19 drug repurposing data. Through CoviRx, the data curated has been made open-source to help advance drug repurposing research. CoviRx also encourages users to submit their findings after thoroughly validating the data, followed by merging it by enforcing uniformity and integ-rity-preserving constraints. This article discusses the various features of CoviRx and its design principles. CoviRx has been designed so that its functionality is independent of the data it dis-plays. Thus, in the future, this platform can be extended to include any other disease X beyond COVID-19. CoviRx can be accessed at www.covirx.org.
ARTICLE | doi:10.20944/preprints202006.0031.v3
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Deep learning; Convolutional Neural Network; Coronavirus; COVID-19; radiology; CT scan; Medical image analysis; Automatic medical diagnosis; lung CT scan dataset
Online: 5 September 2020 (03:36:20 CEST)
COVID-19 is a severe global problem, and AI can play a significant role in preventing losses by monitoring and detecting infected persons in early-stage. This paper aims to propose a high-speed and accurate fully-automated method to detect COVID-19 from the patient's CT scan images. We introduce a new dataset that contains 48260 CT scan images from 282 normal persons and 15589 images from 95 patients with COVID-19 infections. At the first stage, this system runs our proposed image processing algorithm to discard those CT images that inside the lung is not properly visible in them. This action helps to reduce the processing time and false detections. At the next stage, we introduce a novel method for increasing the classification accuracy of convolutional networks. We implemented our method using the ResNet50V2 network and a modified feature pyramid network alongside our designed architecture for classifying the selected CT images into COVID-19 or normal with higher accuracy than other models. After running these two phases, the system determines the condition of the patient using a selected threshold. We are the first to evaluate our system in two different ways. In the single image classification stage, our model achieved 98.49% accuracy on more than 7996 test images. At the patient identification phase, the system correctly identified almost 234 of 245 patients with high speed. We also investigate the classified images with the Grad-CAM algorithm to indicate the area of infections in images and evaluate our model classification correctness.
ARTICLE | doi:10.20944/preprints202003.0297.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Data Mining; Alzheimer’s Dementia; Composite Hybrid Feature Selection; Machine learning; Stack Hybrid Classification; AI Techniques; Classification; AD Diagnose; Clinical AD Dataset
Online: 19 March 2020 (10:52:31 CET)
Alzheimer's disease (AD) is a significant regular type of dementia that causes damage in brain cells. Early detection of AD acting as an essential role in global health care due to misdiagnosis and sharing many clinical sets with other types of dementia, and costly monitoring the progression of the disease over time by magnetic reasoning imaging (MRI) with consideration of human error in manual reading. Our proposed model, in the first stage, apply the medical dataset to a composite hybrid feature selection (CHFS) to extract new features for select the best features to improve the performance of the classification process due to eliminating obscures features. In the second stage, we applied a dataset to a stacked hybrid classification system to combine Jrip and random forest classifiers with six model evaluations as meta-classifier individually to improve the prediction of clinical diagnosis. All experiments conducted on a laptop with an Intel Core i7- 8750H CPU at 2.2 GHz and 16 G of ram running on windows 10 (64 bits). The dataset evaluated using an explorer set of weka data mining software for the analysis purpose. The experimental show that the proposed model of (CHFS) feature extraction performs better than principal component analysis (PCA), and lead to effectively reduced the false-negative rate with a relatively high overall accuracy with support vector machine (SVM) as meta-classifier of 96.50% compared to 68.83% which is considerably better than the previous state-of-the-art result. The receiver operating characteristic (ROC) curve was equal to 95.5%. Also, the experiment on MRI images Kaggle dataset of CNN classification process with 80.21% accuracy result. The results of the proposed model show an accurate classify Alzheimer's clinical samples against MRI neuroimaging for diagnoses AD at a low cost.
Subject: Medicine & Pharmacology, Oncology & Oncogenics Keywords: breast cancer tumor; classification; majority-based voting mechanism; multilayer perceptron learning network; simple logistic regression; stochastic gradient descent learning; wisconsin breast cancer dataset
Online: 27 November 2019 (09:51:31 CET)
Breast cancer is the most common cause of death for women worldwide. Thus, the ability of artificial intelligence systems to predict and classify breast cancer is very important. In this paper, a hybrid ensemble method classification mechanism is proposed based on a majority voting mechanism. First, the performance of different state-of-the-art machine learning classification algorithms for the Wisconsin Breast Cancer Dataset (WBCD) were evaluated. The three best classifiers were then selected based on their F3 score. F3 score is used to emphasize the importance of false negatives (recall) in breast cancer classification. Then, these three classifiers, simple logistic Regression learning, stochastic gradient descent learning and multilayer perceptron network, are used for ensemble classification using a voting mechanism. We also evaluated the performance of hard and soft voting mechanism. For hard voting, majority-based voting mechanism was used and for soft voting we used average of probabilities, product of probabilities, maximum of probabilities and minimum of probabilities-based voting methods. The hard voting (majority-based voting) mechanism shows better performance with 99.42% as compared to the state-of-the-art algorithm for WBCD.
ARTICLE | doi:10.20944/preprints202007.0634.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: CVD rehabilitation; Local muscular endurance exercises; Exercise-based rehabilitation; Deep Learning; AlexNet; CNN; SVM; kNN; RF; MLP; PCA; multi-class classification; INSIGHT-LME dataset
Online: 26 July 2020 (15:21:08 CEST)
Exercise-based cardiac rehabilitation requires patients to perform a set of certain prescribed exercises a specific number of times. Local muscular endurance (LME) exercises are an important part of the rehabilitation program. Automatic exercise recognition and repetition counting, from wearable sensor data is an important technology to enable patients to perform exercises independently in remote settings, e.g. their own home. In this paper we first report on a comparison of traditional approaches to exercise recognition and repetition counting, corresponding to supervised machine learning and peak detection from inertial sensing signals respectively, with more recent machine learning approaches, specifically Convolutional Neural Networks (CNNs). We investigated two different types of CNN: one using the AlexNet architecture, the other using time-series array. We found that the performance of CNN based approaches were better than the traditional approaches. For exercise recognition task, we found that the AlexNet based single CNN model outperformed other methods with an overall 97.18% F1-score measure. For exercise repetition counting , again the AlexNet architecture based single CNN model outperformed other methods by correctly counting repetitions in 90% of the performed exercise sets within an error of ±1. To the best of our knowledge, our approach of using a single CNN method for both recognition and repetition counting is novel. In addition to reporting our findings, we also make the dataset we created, the INSIGHT-LME dataset, publicly available to encourage further research.