ARTICLE | doi:10.20944/preprints202201.0454.v1
Subject: Mathematics & Computer Science, Other Keywords: Ransomware; Behavior analysis; Cyber Security; Machine Learning; Ensemble model; Supervised classification
Online: 31 January 2022 (11:49:48 CET)
Ransomware is one of the most dangerous types of malware, which is frequently intended to spread through a network to damage the designated client by encrypting the client’s vulnerable data. Conventional signature-based ransomware detection technique falls behind because it can only detect known anomalies. When it comes to new and non-familiar ransomware traditional system unveils huge shortcomings. For detecting unknown patterns and sorts of new ransomware families,behavior-based anomaly detection approaches are likely to be the most efficient approach. In the wake of this alarming condition, this paper presents an ensemble classification model consisting of three widely used machine learning techniques that include Decision Tree (DT), RandomForest (RF), and K-nearest neighbor (KNN). To achieve the best outcome ensemble soft voting and hard voting techniques are used while classifying ransomware families based on attack attributes. Performance analysis is done by comparing our proposed ensemble models with standalone models on behavioral attributes based ransomware dataset..
ARTICLE | doi:10.20944/preprints202109.0112.v1
Subject: Engineering, Marine Engineering Keywords: 3D point Cloud Classification, 3D point Cloud Shape Completion,Auto-Encoders, Contrastive Learning, Self-Supervised Learning
Online: 6 September 2021 (18:00:28 CEST)
In this paper, we present the idea of Self Supervised learning on the Shape Completion and Classification of point clouds. Most 3D shape completion pipelines utilize autoencoders to extract features from point clouds used in downstream tasks such as Classification, Segmentation, Detection, and other related applications. Our idea is to add Contrastive Learning into Auto-Encoders to learn both global and local feature representations of point clouds. We use a combination of Triplet Loss and Chamfer distance to learn global and local feature representations. To evaluate the performance of embeddings for Classification, we utilize the PointNet classifier. We also extend the number of classes to evaluate our model from 4 to 10 to show the generalization ability of learned features. Based on our results, embedding generated from the Contrastive autoencoder enhances Shape Completion and Classification performance from 84.2% to 84.9% of point clouds achieving the state-of-the-art results with 10 classes.
ARTICLE | doi:10.20944/preprints202201.0367.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Artificial Intelligence; Deep Learning; Image Classification; Machine Learning; Predictive Models; Small Datasets; Supervised Learning
Online: 25 January 2022 (08:24:17 CET)
One of the most important challenges in the Machine and Deep Learning areas today is to build good models using small datasets, because sometimes it is not possible to have large ones. Several techniques have been proposed in the literature to address this challenge. This paper aims at studying the different available Deep Learning techniques and performing a thorough experimentation to analyze which technique or combination thereof improves the performance and effectiveness of the models. A complete comparison with classical Machine Learning techniques was carried out, to contrast the results obtained using both techniques when working with small datasets. Thirteen algorithms were implemented and trained using three different small datasets (MNIST, Fashion MNIST, and CIFAR-10). Each experiment was evaluated using a well-established set of metrics (Accuracy, Precision, Recall, F1, and the Matthews correlation coefficient). The experimentation allowed concluding that it is possible to find a technique or combination of them to mitigate a lack of data, but this depends on the nature of the dataset, the amount of data, and the metrics used to evaluate them.
ARTICLE | doi:10.20944/preprints201903.0122.v1
Subject: Earth Sciences, Geoinformatics Keywords: Classification, SVM Classifier, ML Classifier, Supervised and Unsupervised Classification, Object-based Classification, Multispectral Data
Online: 11 March 2019 (09:01:44 CET)
This paper focuses on the crucial role that remote sensing plays in divining land features. Data that is collected distantly provides information in spectral, spatial, temporal and radiometric domains, with each domain having the specific resolution to information collected. Diverse sectors such as hydrology, geology, agriculture, land cover mapping, forestry, urban development and planning, oceanography and others are known to use and rely on information that is gathered remotely from different sensors. In the present study, IRS LISS IV Multi-spectral data is used for land cover mapping. It is known, however, that the task of classifying high-resolution imagery of land cover through manual digitizing consumes time and is way too costly. Therefore, this paper proposes accomplishing classifications by way of enforcing algorithms in computers. These classifications fall in three classes: supervised, unsupervised, and object-based classification. In the case of supervised classification, two approaches are relied upon for land cover classification of high-resolution LISS-IV multispectral image. These approaches are Maximum Likelihood and Support Vector Machine (SVM). Finally, the paper proposes a step-by-step procedure for optical image classification methodology. This paper concludes that in optical data classification, SVM classification gives a better result than the ML classification technique.
Subject: Medicine & Pharmacology, Psychiatry & Mental Health Studies Keywords: supervised learning, major depression, cytokines, inflammation, neuro-immune, opioids
Online: 25 March 2019 (10:14:02 CET)
Rationale: Major depressive disorder (MDD) is characterized by signaling aberrations in interleukin (IL)-6, IL-10, beta-endorphins as well as mu (MOR) and kappa (KOR) opioid receptors. Here we examined whether these biomarkers may aid in the classification of unknown subjects into the target class MDD.Methods: The aforementioned biomarkers were assayed in 60 first-episode, drug-naïve depressed patients and 30 controls. We analyzed the data using joint principal component analysis (PCA) performed on all subjects to check whether subjects cluster by classes; support vector machine (SVM) with 10-fold validation; and linear discriminant analysis (LDA) and SIMCA performed on calibration and validation sets and we computed the figures of merit and learnt from the data. Results: PCA shows that both groups were well separated using the first three PCs, while correlation loadings show that all 5 biomarkers have discriminatory value. SVM and LDA yielded an accuracy of 100% in validation samples. Using SIMCA there was a highly significant discrimination of both groups (model-to-model distance=87.5); all biomarkers showed a significant discrimination and modeling power, while 10% of the patients were identified as outsiders and no aliens could be identified.Discussion: We have delineated that MDD is a distinct class with respect to neuro-immune and opioid biomarkers and that future unknown subjects can be authenticated as having MDD using this SIMCA fingerprint. Precision psychiatry should employ SIMCA a) to authenticate patients as belonging to the claimed target class and identify other subjects as outsiders, members of another class or aliens; and b) to acquire knowledge through learning from the data by constructing a biomarker fingerprint of the target class.
ARTICLE | doi:10.20944/preprints201911.0218.v1
Subject: Earth Sciences, Environmental Sciences Keywords: Landsat; Google Earth; water index; unsupervised image classification; supervised image classification; Kappa coefficient
Online: 19 November 2019 (03:10:17 CET)
To address three important issues related to extraction of water features from Landsat imagery, i.e., selection of water indexes and classification algorithms for image classification, collection of ground truth data for accuracy assessment, this study applied four sets (ultra-blue, blue, green, and red light based) of water indexes (NWDI, MNDWI, MNDWI2, AWEIns, and AWEIs) combined with three types of image classification methods (zero-water index threshold, Otsu, and kNN) to 24 selected lakes across the globe to extract water features from Landsat-8 OLI imagery. 1440 (4x5x3x24) image classification results were compared with the extracted water features from high resolution Google Earth images with the same (or ±1 day) acquisition dates through computing the Kappa coefficients. Results show the kNN method is better than the Otsu method, and the Otsu method is better than the zero-water index threshold method. If the computational cost is not an issue, the kNN method combined with the ultra-blue light based AWEIns is the best method for extracting water features from Landsat imagery because it produced the highest Kappa coefficients. If the computational cost is taken into account, the Otsu method is a good choice. AWEIns and AWEIs are better than NDWI, MNDWI and MNDWI2. AWEIns works better than AWEIs under the Otsu method, and the average rank of the image classification accuracy from high to low is the ultra-blue, blue, green, and red light-based AWEIns.
ARTICLE | doi:10.20944/preprints202005.0356.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Supervised Learning; Time Series Classification; Jamming Detection; Automatic Modulation Classification; Feature Selection; Genetic Algorithm; Principal Component Analysis; QPSK modulation; APSK modulation
Online: 23 May 2020 (05:10:36 CEST)
Satellite communication (Satcom) is an artificial geostationary satellite that facilitates a wide range of telecommunications. Considering its quality of service (QoS) and security is crucial in government/military applications. The most challenging situation for efficient Satcom is radio frequency interference (RFI) environment. Thus, it is necessary to ensure that transmissions are incorruptible or at least sense the quality of its spectrum. This paper presents a new method to recognize received signal characteristics using a hierarchical classification in a multi-layer perceptron neural network. We consider signal modulation and the type of RFI as the characteristics of a real-time video stream transmitted in the direct broadcast satellite. Four different modulation types are investigated in this study. Moreover, the combination of the communication signal with various kinds of interference and their effects on the classification method widely have been analyzed. Besides, two robust feature selection techniques have been developed to reduce the data-set dimensional, which leads to optimizing the classification process. The results show that the Genetic Algorithm (GA) slightly outperforms Principal Component Analysis (PCA) for feature selection. Furthermore, the robustness of the proposed techniques is assessed to detect unknown signals at different signal to noise ratios.
ARTICLE | doi:10.20944/preprints202103.0780.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Deep learning; Computer vision; Remote sensing; Supervised learning; Semi-supervised learning; Segmentation; Seagrass mapping
Online: 31 March 2021 (15:53:19 CEST)
Intertidal seagrass plays a vital role in estimating the overall health and dynamics of coastal environments due to its interaction with tidal changes. However, most seagrass habitats around the globe have been in steady decline due to human impacts, disturbing the already delicate balance in environmental conditions that sustain seagrass. Miniaturization of multi-spectral sensors has facilitated very high resolution mapping of seagrass meadows, which significantly improve the potential for ecologists to monitor changes. In this study, two analytical approaches used for classifying intertidal seagrass habitats are compared: Object-based Image Analysis (OBIA) and Fully Convolutional Neural Networks (FCNNs). Both methods produce pixel-wise classifications in order to create segmented maps, however FCNNs are an emerging set of algorithms within Deep Learning with sparse application towards seagrass mapping. Conversely, OBIA has been a prominent solution within this field, with many studies leveraging in-situ data and multiscale segmentation to create habitat maps. This work demonstrates the utility of FCNNs in a semi-supervised setting to map seagrass and other coastal features from an optical drone survey conducted at Budle Bay, Northumberland, England. Semi-supervision is also an emerging field within Deep Learning that has practical benefits of achieving state of the art results using only subsets of labelled data. This is especially beneficial for remote sensing applications where in-situ data is an expensive commodity. For our results, we show that FCNNs have comparable performance with standard OBIA method used by ecologists, while also noting an increase in performance for mapping ecological features that are sparsely labelled across the study site.
ARTICLE | doi:10.20944/preprints202111.0243.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Feature Selection; Malaria Diagnosis; Supervised learning
Online: 15 November 2021 (10:36:16 CET)
Malaria remains an important cause of death, especially in sub-Saharan Africa with about 228 million malaria cases worldwide and an estimated 405,000 deaths in 2019. Currently, malaria is diagnosed in the health facility using a microscope (BS) or rapid malaria diagnostic test (MRDT) and with area where these tools are inadequate the presumptive treatment is performed. Apart from that self-diagnosis and treatment is also practiced in some of the households. With the high-rate self-medication on malaria drugs, this study aimed at computing the most significant features using feature selection methods for best prediction of malaria in Tanzania that can be used in developing a machine learning model for malaria diagnosis. A malaria symptoms and clinical diagnosis dataset were extracted from patients’ files from four (4) identified health facilities in the regions of Kilimanjaro and Morogoro. These regions were selected to represent the high endemic areas (Morogoro) and low endemic areas (Kilimanjaro) in the country. The dataset contained 2556 instances and 36 variables. The random forest classifier a tree based was used to select the most important features for malaria prediction. Regional based features were obtained to facilitate accurate prediction. The feature ranking as indicated that fever is universally the most influential feature for predicting malaria followed by general body malaise, vomiting and headache. However, these features are ranked differently across the regional datasets. Subsequently, six predictive models, using important features selected by feature selection method, were used to evaluate the features performance. The features identified complies with malaria diagnosis and treatment guideline provided with WHO and Tanzania Mainland. The compliance is observed so as to produce a prediction model that will fit in the current health care provision system in Tanzania.
ARTICLE | doi:10.20944/preprints202209.0100.v1
Subject: Life Sciences, Biotechnology Keywords: biocatalysts; bioprospecting; esterases/lipases; hydrolases; machine learning; supervised learning
Online: 7 September 2022 (04:53:30 CEST)
When bioprospecting for novel industrial enzymes, substrate promiscuity is a desirable property that increases the reusability of the enzyme. Among industrial enzymes, ester hydrolases have great relevance for which the demand has not ceased to increase. However, the search for new substrate promiscuous ester hydrolases is not trivial since the mechanism behind this property is greatly influenced by the active site's structural and physicochemical characteristics. These characteristics must be computed from the 3D structure, which is rarely available, and expensive to measure, hence the need for a method that can predict promiscuity from a sequence alone. Here we report such a method called EP-pred, an ensemble binary classifier, that combines three machine learning algorithms: SVM, KNN, and a Linear model. EP-pred has been evaluated against the Lipase Engineering Database together with a hidden Markov approach leading to a final set of 10 sequences predicted to encode promiscuous esterases. Experimental results confirmed the validity of our method since all ten proteins were found to exhibit a broad substrate ambiguity.
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Ship detection; self-supervised learning; transfer learning; Sentinel 2
Online: 7 October 2021 (23:04:24 CEST)
Automatic ship detection provides an essential function towards maritime domain awareness for security or economic monitoring purposes. This work presents an approach for training a deep learning ship detector in Sentinel-2 multispectral images with few labeled examples. We design a network architecture for detecting ships with a backbone that can be pre-trained separately. By using Self Supervised Learning, an emerging unsupervised training procedure, we learn good features on Sentinel-2 images, without requiring labeling, to initialize our network’s backbone. The full network is then fine-tuned to learn to detect ships in challenging settings. We evaluated this approach versus pre-training on ImageNet and versus a classical image processing pipeline. We examined the impact of variations in the self-supervised learning step and we show that in the few-shot learning setting self-supervised pre-training achieves better results than ImageNet pre-training. When enough training data is available, our self-supervised approach is as good as ImageNet pre-training. We conclude that a better design of the self-supervised task and bigger non-annotated dataset sizes can lead to surpassing ImageNet pre-training performance without any annotation costs.
REVIEW | doi:10.20944/preprints202108.0238.v1
Subject: Medicine & Pharmacology, Other Keywords: self-supervised learning; medicine; healthcare; representation learning; unlabeled data
Online: 11 August 2021 (08:27:57 CEST)
Machine learning has become an increasingly ubiquitous technology, as big data continues to inform and influence everyday life and decision-making. Currently in healthcare, as well as in most other industries, the two most prevalent machine learning paradigms are supervised learning and transfer learning. Both practices rely on large-scale, manually annotated datasets to train increasingly complex models. However, the requirement of data to be manually labeled leaves an excess of unused, unlabeled data available in both public and private data repositories. Self-supervised learning (SSL) is a growing area of machine learning that has the ability to take advantage of unlabeled data. Contrary to other machine learning paradigms, SSL algorithms create artificial supervisory signals from unlabeled data and pretrain algorithms on these signals. The aim of this review is two-fold: firstly, we provide a formal definition of SSL, divide SSL algorithms into their four unique subsets, and review the state-of-the-art published in each of those subsets between the years of 2014-2020. Second, this work surveys recent SSL algorithms published in healthcare, in order to provide medical experts with a clearer picture of how they can integrate SSL into their research, with the objective of leveraging unlabeled data.
ARTICLE | doi:10.20944/preprints202104.0678.v1
Subject: Earth Sciences, Atmospheric Science Keywords: supervised machine learning; automated landscape mapping; digital elevation model
Online: 26 April 2021 (14:44:24 CEST)
Landscapes evolve due to climatic conditions, tectonic activity, geological features, biological activity, and sedimentary dynamics. These processes link geological processes at depth to surface features. Consequently, the study of landscapes can reveal essential information about the geochemical footprint of ore deposits at depth. Advances in satellite imaging and computing power have enabled the creation of large geospatial datasets, the sheer size of which necessitates automated processing. We describe a methodology to enable the automated mapping of landscape pattern domains using machine learning (ML) algorithms. From a freely available Digital Elevation Model, derived data, and sample landclass boundaries provided by domain experts, our algorithm produces a dense map of the model region in Western Australia. Both random forest and support vector machine classification achieve about 98\% classification accuracy with reasonable runtime of 48 minutes on a single core. We discuss computational resources and study the effect of grid resolution. Larger tiles result in a more contiguous map, while smaller tiles result in a more detailed, and at some point, noisy map. Diversity and distribution of landscapes mapped in this study support previous results. In addition, our results are consistent with the geological trends and main basement features in the region.
ARTICLE | doi:10.20944/preprints202212.0433.v1
Subject: Physical Sciences, General & Theoretical Physics Keywords: Optimal control; supervised learning; system characterization; two-level quantum systems
Online: 23 December 2022 (01:44:57 CET)
We investigate the extent to which a two-level quantum system subjected to an external time-dependent drive can be characterized by supervised learning. We apply this approach to the case of bang-bang control and the estimation of the offset and the final distance to a given target state. The estimate is global in the sense that no a priori knowledge is required on the parameters to be determined. Different neural network algorithms are tested on a series of data sets. We point out the limits of the estimation procedure with respect to the properties of the mapping to be interpolated. We discuss the physical relevance of the different results.
ARTICLE | doi:10.20944/preprints202106.0482.v3
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: COVID-19 Infodemic; Text Classification; TFIDF Features; Network Training modes; Supervised Learning; Misinformation; News Classification; False Publications; PubMed; Anomaly Detection
Online: 26 July 2021 (12:06:04 CEST)
The spread of the Coronavirus pandemic has been accompanied by an infodemic. The false information that is embedded in the infodemic affects people’s ability to have access to safety information and follow proper procedures to mitigate the risks. This research aims to target the falsehood part of the infodemic, which prominently proliferates in news articles and false medical publications. Here, we present NeoNet, a novel supervised machine learning text mining algorithm that analyzes the content of a document (news article, a medical publication) and assigns a label to it. The algorithm is trained by TFIDF bigram features which contribute a network training model. The algorithm is tested on two different real-world datasets from the CBC news network and Covid-19 publications. In five different fold comparisons, the algorithm predicted a label of an article with a precision of 97-99 %. When compared with prominent algorithms such as Neural Networks, SVM, and Random Forests NeoNet surpassed them. The analysis highlighted the promise of NeoNet in detecting disputed online contents which may contribute negatively to the COVID-19 pandemic.
ARTICLE | doi:10.20944/preprints202210.0431.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Supervised machine learning; intrusion detection; data engineering; cybersecurity; Internet of Things.
Online: 27 October 2022 (10:57:09 CEST)
Nowadays, the Internet of Things (IoT) devices and applications have rapidly expanded worldwide due to their benefits in improving the business environment, industrial environment, and people's daily lives. However, the IoT devices are not immune to malicious network traffic, which causes potential negative consequences and sabotages IoT operating devices. Therefore, developing a method for screening network traffic is necessary to detect and classify malicious activity to mitigate its negative impacts. Therefore, this research proposes a predictive machine learning model to detect and classify network activity in an IoT system. Specifically, our model distinguishes between normal and anomaly network activity. Furthermore, it classifies network traffic into five categories, normal, Mirai attack, denial of service (DoS) attack, Scan attack, and man-in-the-middle (MITM) attack. Five supervised learning models were implemented to characterize their performance in detecting and classifying network activities for IoT systems. This includes models shallow neural networks (SNN), decision trees (DT), bagging trees (BT), support vector machine (SVM), and k-nearest neighbor (kNN). The learning models were evaluated on a new and broad dataset for IoT attacks, the IoTID20 dataset. Besides, a deep feature engineering process was applied to the dataset to improve the accuracy of the learning models. Our experimental evaluation exhibited an accuracy of 100% recorded for the detection using all implemented models and an accuracy of 99.4%-99.9% recorded for the classification process.
ARTICLE | doi:10.20944/preprints202203.0219.v1
Subject: Medicine & Pharmacology, General Medical Research Keywords: Artificial intelligence; Supervised Machine Learning; Kinematics; Head rotation test; Neck pain
Online: 15 March 2022 (14:30:51 CET)
Understanding neck pain is an important societal issue. Kinematic data from sensors may help to gain insight on the pathophysiological mechanisms associated with neck pain through a quantitative sensorimotor assessment of one patient. The objective of this study was to evaluate the potential usefulness of artificial intelligence with several Machine Learning (ML) algorithms in assessing neck sensorimotor performance. Angular velocity and acceleration measured by an inertial sensor placed on the forehead during the DidRen laser test in thirty-eight acute and subacute non-specific neck pain (ANSP) patients were compared to forty-two healthy control participants (HCP). Seven supervised ML algorithms were chosen for the predictions. The most informative kinematic features were computed using Sequential Feature Selection methods. The best performing algorithm is the Linear Support Vector Machine with an accuracy of 82% and Area Under Curve of 84%. The best discriminative kinematic feature between ANSP patients and HCP is the first quartile of head pitch angular velocity. This study has shown that supervised ML algorithms could be used to classify ANSP patients and identify discriminatory kinematic features potentially useful for the clinicians in the assessment and monitoring of the neck sensorimotor performance in ANSP patients.
ARTICLE | doi:10.20944/preprints201809.0346.v1
Subject: Engineering, Mechanical Engineering Keywords: SLM, Process Control, Semi-supervised Machine Learning, Randomised Singular Value Decomposition
Online: 18 September 2018 (11:21:58 CEST)
Risk-averse areas such as the medical, aerospace and energy sectors have been somewhat slow towards accepting and applying Additive Manufacturing (AM) in many of their value chains. This is partly because there are still signicant uncertainties concerning the quality of AM builds. This paper introduces a machine learning algorithm for the automatic detection of faults in AM products. The approach is semi-supervised in that, during training, it is able to use data from both builds where the resulting components were certied and builds where the quality of the resulting components is unknown. This makes the approach cost ecient, particularly in scenarios where part certication is costly and time consuming. The study specically analyses Selective Laser Melting (SLM) builds. Key features are extracted from large sets of photodiode data, obtained during the building of 49 tensile test bars. Ultimate tensile strength (UTS) tests were then used to categorise each bar as `faulty' or `acceptable'. A fully supervised approach identied faulty specimens with a 77% success rate while the semi-supervised approach was able to consistently achieve similar results, despite being trained on a fraction of the available certication data. The results show that semi-supervised learning is a promising approach for the automatic certication of AM builds that can be implemented at a fraction of the cost currently required.
ARTICLE | doi:10.20944/preprints202207.0261.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: Alcohol Detection; Smart Sensing; MQ-3 Alcohol Sensors; Supervised Learning; Neural Networks.
Online: 18 July 2022 (10:16:26 CEST)
According to the risk investigations of being involved in an accident, alcohol-impaired driving is one of the major causes of motor-vehicles accidents. Preventing highly intoxicated persons from driving would potentially save many lives. This paper proposes a lightweight in-vehicle alcohol detection that processes the data generated from 6-alcohol sensors (MQ-3 Alcohol Sensors) using an optimizable shallow neural network (O-SNN). The experimental evaluation results exhibit a high-performance detection system scoring a 99.8% of detection accuracy with a very short inferencing delay of 2.22 µ seconds. Hence, the proposed model can be efficiently deployed and used to discover in-vehicle alcohol with high accuracy and low inference overhead as a part of the driver alcohol detection system for safety (DADSS) system aiming at massive deployment of alcohol sensing systems that could potentially save thousands of lives annually.
ARTICLE | doi:10.20944/preprints202209.0025.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: object detection; semi-supervised learning; Mask R-CNN; floor-plan images; computer vision
Online: 1 September 2022 (15:16:43 CEST)
Research has been growing on object detection using semi-supervised methods in past few years. We examine the intersection of these two areas for floor-plan objects to promote the research objective of detecting more accurate objects with less labelled data. The floor-plan objects include different furniture items with multiple types of the same class, and this high inter-class similarity impacts the performance of prior methods. In this paper, we present Mask R-CNN based semi-supervised approach that provides pixel-to-pixel alignment to generate individual annotation masks for each class to mine the inter-class similarity. The semi-supervised approach has a student-teacher network that pulls information from the teacher network and feeds it to the student network. The teacher network uses unlabeled data to form pseudo-boxes, and the student network uses both unlabeled data with the pseudo boxes and labelled data as ground truth for training. It learns representations of furniture items by combining labelled and unlabeled data. On the Mask R-CNN detector with ResNet-101 backbone network, the proposed approach achieves mAP of 98.8%, 99.7%, and 99.8% with only 1%, 5% and 10% labelled data, respectively. Our experiment affirms the efficiency of the proposed approach as it outperforms the fully supervised counterpart using only 10% of the labels.
ARTICLE | doi:10.20944/preprints202203.0085.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: object segmentation; LiDAR-camera fusion; autonomous driving; artificial intelligence; semi-supervised learning; iseAuto
Online: 4 March 2022 (21:43:06 CET)
Object segmentation is still considered a challenging problem in autonomous driving, particularly in consideration of real world conditions. Following this line of research, this paper approaches the problem of object segmentation using LiDAR-camera fusion and semi-supervised learning implemented in a fully-convolutional neural network. Our method is tested on real-world data acquired using our custom vehicle iseAuto shuttle. The data include all-weather scenarios, featuring night and rainy weather. In this work, it is shown that LiDAR-camera fusion with only a few annotated scenarios and semi-supervised learning, it is possible to achieve robust performance on real-world data in a multi-class object segmentation problem. The performance of our algorithm is measured in terms of intersection over union, precision, recall and area-under-the-curve average precision. Our network achieves 82% IoU in vehicle detection in day fair scenarios and 64% IoU in vehicle segmentation in night rain scenarios.
ARTICLE | doi:10.20944/preprints202105.0677.v1
Subject: Medicine & Pharmacology, Allergology Keywords: heart failure, phenotype, left ventricular ejection fraction, primary care, artificial intelligence, supervised analysis
Online: 27 May 2021 (14:08:53 CEST)
Artificial Intelligence are creating a paradigm shift in health care, being phenotyping patients through clustering techniques one of the areas of interest. Objective: To develop a predictive model to classify heart failure (HF) patients according to their left ventricular ejection fraction (LVEF), by using available data in Electronic Health Records (EHR). Subjects and methods: 2854 subjects more than 25 years old with diagnose of HF and LVEF measured by echocardiography were selected to develop an algorithm to predict patients with reduced EF using supervised analysis. Performance of the algorithm developed were tested in heart failure patients from Primary Care. To select the most influencing variables, LASSO algorithm setting was used and to tackle the issue of one class exceed the other one by a large proportion we used the Synthetic Minority Oversampling Technique (SMOTE). Finally, Random Forest (RF) and XGBoost models were constructed. Results: Full XGBoost model obtained the maximized accuracy, a high negative predictive value and the highest positive predictive value. Gender, age, unstable angina, atrial fibrillation and acute myocardial infarct are the variables that most influence FE value. Applied in the EHR data set with a total 25594 patients with an ICD-code of HF and no regular follow-up in Cardiology clinics, 6170 (21.1%) were identified as those pertaining to the reduced EF group. Conclusion: The algorithm obtained is able to rescue a number of HF patients with reduced ejection fraction that can be take benefit for a protocol with strong recommendation to succeed. Furthermore, the methodology can be used for studies with data extracted from the Electronic Health Records.
ARTICLE | doi:10.20944/preprints202012.0058.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Deep Learning; LSTM Autoencoder; Supervised Learning, Hydraulic Test Rig; Sensor Faults; Component Faults
Online: 2 December 2020 (11:08:40 CET)
Anomaly occurrences in hydraulic machinery may lead to massive systems shut down, jeopardizing the safety of the machinery and its surrounding human operator(s) and environment, and the severe economic implications succeeding the faults and their associated damage. Hydraulics are mostly placed in ruthless environments, where they are consistently vulnerable to many faults. Hence, not only the machines and their components are prone to anomalies, but also the sensors attached to them, which monitor and report their health and behavioral changes. In this work, a comprehensive applicational analysis of anomalies in hydraulic systems extracted from a hydraulic test rig is thoroughly achieved. Firstly, we provided a combination of a new architecture of LSTM autoencoders and supervised machine and deep learning methodologies to perform two separate stages of fault detection and diagnosis. The two phases are condensed by: the detection phase using the LSTM autoencoder. Followed by the fault diagnosis phase represented by the classification schema. The previously mentioned framework is applied to both component and sensor faults in hydraulic systems, deployed in the form of two in-depth applicational experiments. Moreover, a thorough literature review of the past decade related work for the two stages separately is successfully conducted in this paper.
ARTICLE | doi:10.20944/preprints201812.0114.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: directional encoding mask; selective attention network; supervised learning; horizontal and vertical text recognition
Online: 11 December 2018 (07:24:04 CET)
Recent state-of-the-art scene text recognition methods have primarily focused on horizontal text in images. However, in several Asian countries, including China, large amounts of text in signs, books, and TV commercials are vertically directed. Because the horizontal and vertical texts exhibit different characteristics, developing an algorithm that can simultaneously recognize both types of text in real environments is necessary. To address this problem, we adopted the direction encoding mask (DEM) and selective attention network (SAN) methods based on supervised learning. DEM contains directional information to compensate in cases that lack text direction; therefore, our network is trained using this information to handle the vertical text. The SAN method is designed to work individually for both types of text. To train the network to recognize both types of text and to evaluate the effectiveness of the designed model, we prepared a new synthetic vertical text dataset and collected an actual vertical text dataset (VTD142) from the Web. Using these datasets, we proved that our proposed model can accurately recognize both vertical and horizontal text and can achieve state-of-the-art results in experiments using benchmark datasets, including the street view test (SVT), IIIT-5k, and ICDAR. Although our model is relatively simple as compared to its predecessors, it maintains the accuracy and is trained in an end-to-end manner.
ARTICLE | doi:10.20944/preprints202109.0010.v1
Subject: Medicine & Pharmacology, Anesthesiology Keywords: spinal cord stimulation; screening trial; infection; supervised learning; machine learning; predictive modeling; patient outcome
Online: 1 September 2021 (12:05:18 CEST)
Persistent Pain after Spinal Surgery can be successfully addressed by Spinal Cord Stimulation (SCS). International guidelines strongly recommend that a lead trial be performed before any permanent implantation. Recent clinical data highlight some major limitations of this approach. First, it appears that patient outcomes, WITH OR WITHOUT lead trial, are similar. In contrast, during trialing, infection rate drops drastically within time and can compromise the therapy. Using composite pain assessment experience and previous research, we hypothesized that ma-chine learning models could be robust screening tools and reliable predictors of long-term SCS efficacy. We developed several algorithms including logistic regression, Regularized Logistic Regression (RLR), naive Bayes classifier, artificial neural networks, random forest and gradient boosted trees to test this hypothesis and to perform internal and external validations, the objec-tive being to confront model predictions with lead trial results using a 1-year composite out-come from 103 patients. While almost all models have demonstrated superiority on lead trial-ing, the RLR model appears to represent the best compromise between complexity and inter-pretability in prediction of SCS efficacy. These results underscore the need to use AI based-predictive medicine, as a synergistic mathematical approach, aimed at helping implanters to optimize their clinical choices on daily practice.
Subject: Medicine & Pharmacology, Allergology Keywords: White matter lesions; white matter hyperintensities; supervised segmentation; unsupervised segmentation; deep learning; FLAIR hyperintensities
Online: 20 November 2020 (13:44:46 CET)
Background: White matter hyperintensities (WMH), of presumed vascular origin, are visible and quantifiable neuroradiological markers of brain parenchymal change. These changes may range from damage secondary to inflammation and other neurological conditions, through to healthy ageing. Fully automatic WMH quantification methods are promising, but still, traditional semi-automatic methods seem to be preferred in clinical research. We systematically reviewed the literature for fully automatic methods developed in the last five years, to assess what are considered state-of-the-art techniques, as well as trends in the analysis of WMH of presumed vascular origin. Method: We registered the systematic review protocol with the International Prospective Register of Systematic Reviews (PROSPERO), registration number - CRD42019132200. We conducted the search for fully automatic methods developed from 2015 to July 2020 on Medline, Science direct, IEE Explore, and Web of Science. We assessed risk of bias and applicability of the studies using QUADAS 2. Results: The search yielded 2327 papers after removing 104 duplicates. After screening titles, abstracts and full text, 37 were selected for detailed analysis. Of these, 16 proposed a supervised segmentation method, 10 proposed an unsupervised segmentation method, and 11 proposed a deep learning segmentation method. Average DSC values ranged from 0.538 to 0.91, being the highest value obtained from an unsupervised segmentation method. Only four studies validated their method in longitudinal samples, and eight performed an additional validation using clinical parameters. Only 8/37 studies made available their method in public repositories. Conclusions: We found no evidence that favours deep learning methods over the more established k-NN, linear regression and unsupervised methods in this task. Data and code availability, bias in study design and ground truth generation influence the wider validation and applicability of these methods in clinical research.
ARTICLE | doi:10.20944/preprints202008.0645.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Speech Emotion Recognition; Emotion AI; Self-Supervised Learning; Transfer Learning; Low Resource Training; wav2vec
Online: 28 August 2020 (15:05:37 CEST)
We propose a novel transfer learning method for speech emotion recognition allowing us to obtain promising results when only few training data is available. With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data. Our method leverages knowledge contained in pre-trained speech representations extracted from models trained on a more general self-supervised task which doesn’t require human annotations, such as the wav2vec model. We provide detailed insights on the benefits of our approach by varying the training data size, which can help labeling teams to work more efficiently. We compare performance with other popular methods on the IEMOCAP dataset, a well-benchmarked dataset among the Speech Emotion Recognition (SER) research community. Furthermore, we demonstrate that results can be greatly improved by combining acoustic and linguistic knowledge from transfer learning. We align acoustic pre-trained representations with semantic representations from the BERT model through an attention-based recurrent neural network. Performance improves significantly when combining both modalities and scales with the amount of data. When trained on the full IEMOCAP dataset, we reach a new state-of-the-art of 73.9% unweighted accuracy (UA).
ARTICLE | doi:10.20944/preprints201808.0154.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: deep learning; multiple instance learning; weakly supervised learning; demography; socioeconomic analysis; google street view
Online: 24 October 2018 (08:53:26 CEST)
(1) Background: Evidence-based policymaking requires data about the local population's socioeconomic status (SES) at detailed geographical level, however, such information is often not available, or is too expensive to acquire. Researchers have proposed solutions to estimate SES indicators by analyzing Google Street View images, however, these methods are also resource-intensive, since they require large volumes of manually labeled training data. (2) Methods: We propose a methodology for automatically computing surrogate variables of SES indicators using street images of parked cars and deep multiple instance learning. Our approach does not require any manually created labels, apart from data already available by statistical authorities, while the entire pipeline for image acquisition, parked car detection, car classification, and surrogate variable computation is fully automated. The proposed surrogate variables are then used in linear regression models to estimate the target SES indicators. (3) Results: We implement and evaluate a model based on the proposed surrogate variable at 30 municipalities of varying SES in Greece. Our model has $R^2=0.76$ and a correlation coefficient of $0.874$ with the true unemployment rate, while it achieves a mean absolute percentage error of $0.089$ and mean absolute error of $1.87$ on a held-out test set. Similar results are also obtained for other socioeconomic indicators, related to education level and occupational prestige. (4) Conclusions: The proposed methodology can be used to estimate SES indicators at the local level automatically, using images of parked cars detected via Google Street View, without the need for any manual labeling effort.
ARTICLE | doi:10.20944/preprints202201.0202.v1
Subject: Earth Sciences, Geoinformatics Keywords: crop detection; Sentinel 1; Sentinel 2; supervised classification; unsupervised classification; time series; agriculture; food security
Online: 14 January 2022 (11:18:59 CET)
Satellite Crop Detection technologies are focused on detection of different types of crops on the field in the early stage before harvesting. Crop detection is usually done on a time series of satellite data by classification of the desired fields. Currently, data obtained from Remote Sensing (RS) are used to solve tasks related to the identification of the type of agricultural crops, also modern technologies using AI methods are desired in the postprocessing part. In this challenge Sentinel-1 and Sentinel-2 time series data were used due to their periodic availability. Our focus was to develop methodology for classification of time series of Sentinel 2 and Sentinel 1 data and compare how accuracy of classification can be increased, but also how to guarantee availability of data. We analyse phenology of single crops and on the basis of this analysis we started to provide crop classification. Original crop classifications were made from Enhanced Vegetation Index (EVI) layers made from Sentinel-2 time-series data and then we added also . To increase accuracy we also integrate into the process parcel borders and provide classification of fields..
ARTICLE | doi:10.20944/preprints202112.0025.v2
Subject: Engineering, Other Keywords: Brain segmentation; Coarse-to-fine; Gen- erative Adversarial Network; Semi-supervised learning; Multi-stage method
Online: 6 December 2021 (14:33:23 CET)
Image segmentation is a new challenge prob- lem in medical application. The use of medical imaging has become an integral part of research, as it allows us to see inside the human body without surgical intervention. Many researcher have studied brain segmentation. One stage method is used to segment the brain tissues. In this paper, we proposed the multi-stage generative ad- versarial network to solve the problem of information loss in the one-stage. We utilize the coarse-to-fine to improve brain segmentation using multi-stage generative adversar- ial networks (GAN). In the first stage, our model generated a coarse outline for (i) background and (ii) brain tissues. Then, in the second stage, the model generated outline for (i) white matter (WM), (ii) gray matter (GM) and (iii) cerebrospinal fluid (CSF). A good result can be achieved by fusing the coarse outline and refine outline. We conclude that our model is more efficient and accu- rate in practice for both infant and adult brain segmenta- tion. Moreover, we observe that multi-stage model is faster than prior models. To be more specific, the main goal of multi-stage model is to see the performance of the model in a few shot learning case where a few labeled data are available. For medical image, this proposed model can work in a wide range of image segmentation where the convolution neural networks and one-stage methods have failed.
ARTICLE | doi:10.20944/preprints202002.0019.v1
Subject: Life Sciences, Endocrinology & Metabolomics Keywords: metabolomics; LC-MS; mass spectrometry; metabolic profiling; computational; statistical; unsupervised learning; supervised learning; pathway analysis
Online: 3 February 2020 (05:54:14 CET)
Metabolomics analysis generates vast arrays of data, necessitating comprehensive workflows involving expertise in analytics, biochemistry and bioinformatics, in order to provide coherent and high-quality data that enables discovery of robust and biologically significant metabolic findings. In this protocol article, we introduce NoTaMe, an analytical workflow for non-targeted metabolic profiling approaches utilizing liquid chromatography–mass spectrometry analysis. We provide an overview of lab protocols and statistical methods that we commonly practice for the analysis of nutritional metabolomics data. The paper is divided into three main sections: the first and second sections introducing the background and the study designs available for metabolomics research, and the third section describing in detail the steps of the main methods and protocols used to produce, preprocess and statistically analyze metabolomics data, and finally to identify and interpret the compounds that have emerged as interesting.
ARTICLE | doi:10.20944/preprints201704.0114.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: indoor localization; crowdsourcing; received signal strength; graph-based semi-supervised learning; linear regression; compressed sensing.
Online: 18 April 2017 (12:33:47 CEST)
Indoor positioning based on the received signal strength (RSS) of the WiFi signal has become the most popular solution for indoor localization. In order to realize the rapid deployment of indoor localization systems, solutions based on crowdsourcing have been proposed. However, compared to conventional methods, crowdsourced RSS values are more erroneous and can result in large localization errors. To mitigate the negative effect of the erroneous measurements, a graph-based semi-supervised learning (G-SSL) method is used to exploit the correlation between the RSS values at nearby locations to estimate an optimal RSS value at each location. Before using the G-SSL method, the Linear Regression (LR) algorithm is proposed to solve the device diversity problem in crowdsourcing system. Since the spatial distribution of the APs is sparse, the Compressed Sensing (CS) method is applied to precisely estimate the location of the APs. Based on the location of the APs and a simple signal propagation model, the RSS difference between different locations is calculated and used as an additional constraint to improve the performance of G-SSL. Furthermore, to exploit the sparsity of the weights used in the G-SSL, we use the CS method to reconstruct these weights more accurately and make a further improvement on the performance of the G-SSL. Experimental results show improved results in terms of the smoothness of the radio map and the localization accuracy.
REVIEW | doi:10.20944/preprints201708.0003.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: stylometry; author identification; author verification; authorprofiling; stylistic inconsistency; text analysis; supervised learning; unsupervised learning; classification; forensics
Online: 2 August 2017 (12:38:17 CEST)
Electronic text stylometry is a collection of forensics methods that analyze the writing styles of input electronic texts in order to extract information about authors of the input electronic texts. Such extracted information could be the identity of the authors, or aspects of the authors, such as their gender, age group, ethnicity, etc. This survey paper presents the following contributions: 1) A description of all stylometry problems in probability terms, under a unified notation. To the best of our knowledge, this is the most comprehensive definition to date. 2) A survey of key methods, with a particular attention to data representation (or feature extraction) methods. 3) An evaluation of 23,760 feature extraction methods, which is the most comprehensive evaluation of feature extraction methods in the literature of stylometry to date. The importance of this evaluation is two fold. First, identifying the relative effectiveness of the features (since, currently, many are not evaluated jointly; e.g. syntactic n-grams are not evaluated against k-skip n-grams, and so forth). Second, thanks to our generalizations, we could evaluate novel grams, such as what we name compound grams. 4) The release of our associated Python feature extraction library, namely Fextractor. Essentially, the library generalizes all existing n-gram based feature extraction methods under the "at least l-frequent, dir-directed, k-skipped n-grams'', and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as POS tags, as well as lower-level ones, such as distribution of function words, word shapes, etc. This makes the library, by far, the most extensive in this domain to date. 5) The construction, evaluation, and release of the first dataset for Emirati social media text. This evaluation represents the first evaluation of author identification against Emirati social media texts. Interestingly, we find that, when using our models and feature extraction library (Fextractor), authors could be identified significantly more accurately than what is reported with similarly sized datasets. The dataset also contains sub-datasets that represent other languages (Dutch, English, Greek and Spanish), and our findings are consistent across them.
ARTICLE | doi:10.20944/preprints202009.0729.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Data Envelopment Analysis; Machine learning; Optimization; Parametric and non-parametric methods; Supervised and unsupervised models; CVS model
Online: 30 September 2020 (08:19:51 CEST)
The main purpose of this paper is to propose a novel optimization model with a new machine learning approach in the first section to achieve the best results in financial institutions in the second section. Since the constancy of efficacy derived from parametric and non-parametric is not significant, this paper provides a scientific assessment at the optimization section and proposes a novel combined parametric and non-parametric model which will be a new experiment in literature perception. A scientific assessment of banks based on a combination of the efficiency measurement method of CCR(Charnes, Cooper and Rhodes model) or CRS(Constant Return to Scale) BCC(Banker, Charnes, and Cooper model) or VRS (Variable Return to Scale) in Data Envelopment Analysis (DEA), as well as Stochastic Frontier Approach (SFA) for 65 banks during Feb to July 2020, are introduced. For analyzing the performance of the parametric and non-parametric approaches we have considered the linear regression and Unreplicated Linear Functional Relationship (ULFR). At the machine learning section, a novel four-layers data mining filtering pre-processes for selected supervised classification as well as unsupervised clustering algorithms to increase the accuracy and to remove unrelated attributes and data are applied. For the four kinds of preprocessing approaches of unsupervised attributes, supervised attributes, supervised instances, and unsupervised instances, we have chosen discretization, attribute selection, stratified remove folds, and resample filters respectively. Based on the nature of the suggested financial institution's dataset and attributes, the most appropriate preprocessing filter in each layer to achieve the highest performance is suggested. Finally, the superior bank, best performance model, and the most accurate algorithm are introduced. The results indicate that the bank number 56 is the superior bank. Among the proposed techniques, the novel recommended CVS compared with CCR-BCC and SFA models, has a more positive correlation with profit risk, and show a higher coefficient of determination values. Sequential Minimal Optimization(SMO) algorithm receives the highest accuracy in all four suggested filtering layers.
ARTICLE | doi:10.20944/preprints201902.0233.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: deep neural network architectures; supervised learning; unsupervised learning; testing neural networks; applications of deep learning; evolutionary computation
Online: 26 February 2019 (04:02:00 CET)
Deep learning has taken over - both in problems beyond the realm of traditional, hand-crafted machine learning paradigms as well as in capturing the imagination of the practitioner sitting on top of petabytes of data. While the public perception about the efficacy of deep neural architectures in complex pattern recognition tasks grows, sequentially up-to-date primers on the current state of affairs must follow. In this review, we seek to present a refresher of the many different stacked, connectionist networks that make up the deep learning architectures followed by automatic architecture optimization protocols using multi-agent approaches. Further, since guaranteeing system uptime is fast becoming an indispensable asset across multiple industrial modalities, we include an investigative section on testing neural networks for fault detection and subsequent mitigation. This is followed by an exploratory survey of several application areas where deep learning has emerged as a game-changing technology - be it anomalous behavior detection in financial applications or financial time-series forecasting, predictive and prescriptive analytics, medical imaging, natural language processing or power systems research. The thrust of this review is on outlining emerging areas of application-oriented research within the deep learning community as well as to provide a handy reference to researchers seeking to embrace deep learning in their work for what it is: statistical pattern recognizers with unparalleled hierarchical structure learning capacity with the ability to scale with information.
ARTICLE | doi:10.20944/preprints201808.0269.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: social sensing; supervised learning; statistical methods; social networks; twitter; tweets; natural disaster; random forest, kernel density estimation
Online: 15 August 2018 (11:34:43 CEST)
In recent years, online social networks have received important consideration in spatial modelling fields given the critical information that can be extracted from them for events in real time; one of the most latent issues is that regarding various natural disasters such as earthquakes. Although it is possible to retrieve data from these social networks with embedded geographic information provided by GPS, in many cases this is not possible. An alternative solution is to reconstruct specific locations using probabilistic language models, more specifically those based on Name Entity Recognition (NER), which extracts names from a user’s description about an event occurring in a specific place (e.g., a collapsed building on a specific avenue). In this work, we present a methodology to use twitter as a social sensor system for disasters. The methodology scores NER locations with a kernel density estimation function for different subtopics originating from a natural disaster and that maps them into a geographic space is proposed. The proposed methodology is evaluated with tweets related to the 2017 earthquake in Mexico.
ARTICLE | doi:10.20944/preprints201806.0282.v1
Subject: Earth Sciences, Geoinformatics Keywords: land-use/land-cover; multi-decadal change analysis; irrigation ponds; textural features; supervised classification; multi-source data
Online: 18 June 2018 (16:40:31 CEST)
A multi-decadal change analysis of the irrigation ponds in Taoyuan, Taiwan was conducted by using multi-source data including digitized ancient maps, declassified single-band CORONA satellite images, and multispectral SPOT images. Supervised LULC classifications were conducted using four textural features derived from the single-band CORONA images and spectral features derived from SPOT images. Post-classification analysis revealed that the number of irrigation ponds in the study area decreased during the post-World War II farmland consolidation period (1945 – 1965) and the subsequent industrialization period (1970 – 2000). However, efforts on restoration of irrigation ponds in recent years have resulted in gradual increases in the number (9%) and total area (12%) of irrigation ponds in the study area.
Subject: Keywords: Textual data distributions; supervised learning; unsupervised learning; Kullback-Leibler divergence; sentiment; textual analytics; text generation; vaccine; stock market
Online: 17 June 2021 (10:03:41 CEST)
Efficient textual data distributions (TDD) alignment and generation are open research problems in textual analytics and NLP. It is presently difficult to parsimoniously and methodologically confirm that two or more natural language datasets belong to similar distributions, and to identify the extent to which textual data possess alignment. This study focuses on addressing a segment of the broader problem described above by applying multiple supervised and unsupervised machine learning (ML) methods to explore the behavior of TDD by (i) topical alignment, and (ii) by sentiment alignment. Furthermore we use multiple text generation methods including fine-tuned GPT-2, to generate text by topic and by sentiment. Finally we develop a unique process driven variation of Kullback-Leibler divergence (KLD) application to TDD, named KL Textual Distributions Contrasts (KL-TDC) to identify the alignment of machine generated textual corpora with naturally occurring textual corpora. This study thus identifies a unique approach for generating and validating TDD by topic and sentiment, which can be used to help address sparse data problems and other research, practice and classroom situations in need of artificially generated topic or sentiment aligned textual data.
ARTICLE | doi:10.20944/preprints202203.0093.v1
Subject: Life Sciences, Biochemistry Keywords: 6-hydroxydopamine; rotenone; in vitro neurotoxicity; mitochondrial dysfunction; exploratory data analysis; applied computational statistics; unsupervised and supervised machine learning
Online: 7 March 2022 (09:16:28 CET)
With the increase in life expectancy and consequent aging of the world’s population, the prevalence of many neurodegenerative diseases is increasing, without concomitant improvement in diagnostics and therapeutics. These diseases share neuropathological hallmarks, including mitochondrial dysfunction. In fact, as mitochondrial alterations appear prior to neuronal cell death at an early phase of the disease onset, the study and modulation of mitochondrial alterations rise as promising strategies to predict and prevent neurotoxicity and neuronal cell death before the onset of cell viability alterations. In this work, differentiated SH-SY5Y cells were treated with the mitochondrial-targeted neurotoxicants 6-hydroxydopamine and rotenone. These compounds were used at different concentrations and for different time points to understand the similarities and differences in their mechanisms of action. To accomplishing this, data on mitochondrial parameters was acquired and analyzed using unsupervised (hierarchical clustering) and supervised (decision tree) machine learning methods. Both biochemical and computational analyses resulted in an evident distinction between the neurotoxic effects of 6-hydroxydopamine and rotenone, specifically for the highest concentrations of both compounds.
ARTICLE | doi:10.20944/preprints201905.0382.v1
Subject: Engineering, Other Keywords: supervised machine learning; flood inundation mapping; high-resolution; synthetic aperture radar; height above nearest drainage; sentinel-1; inundated vegetation
Online: 31 May 2019 (08:48:14 CEST)
Floods are one of the most wide-spread, frequent, and devastating natural disasters that continue to increase in frequency and intensity. Remote sensing, specifically synthetic aperture radar (SAR), has been widely used to detect surface water inundation to provide retrospective and near-real time (NRT) information due to its high-spatial resolution, self-illumination, and low atmospheric attenuation. However, the efficacy of flood inundation mapping with SAR is susceptible to reflections and scattering from a variety of factors including dense vegetation and urban areas. In this study, the topographic dataset height above nearest drainage (HAND) was investigated as a potential supplement to Sentinel-1A C-Band SAR along with supervised machine learning to improve the detection of inundation in heterogeneous areas. Three machine learning classifiers were trained on two sets of features SAR only (VV & VH) and VV, VH & HAND to map inundated areas. Three study sites along the Neuse River in North Carolina, USA during the record flood of Hurricane Matthew in October 2016 were selected. The binary classification analysis (inundated as positive vs. non-inundated as negative) revealed significant improvements when incorporating HAND in several metrics including classification accuracy (ACC) (+37.1%), true positive rate (TPR) (+51.2%), and negative predictive value (NPV) (+23.7%), A marginal improvement of +1.4% was seen for positive predictive value (PPV), but true negative rate (TNR) fell -15.1%. By incorporating HAND, a significant number of areas with high SAR backscatter but low HAND values were detected as inundated which increased true positives. This in turn also increased the false positives detected but to a lesser extent as evident in the metrics. This study demonstrates that HAND could be considered a valuable feature to enhance SAR flood inundation mapping especially in areas with heterogeneous land covers with dense vegetation that interfere with SAR.
ARTICLE | doi:10.20944/preprints202010.0436.v1
Subject: Keywords: Naïve Bayes Classification; Eulers Strength Formula; Cricket Prediction; Supervised Learning; KNIME Tool; Cricket prediction; sports analytics; multivariate regression; neural network
Online: 21 October 2020 (12:34:00 CEST)
In cricket, particularly the twenty20 format is most watched and loved by the people, where no one can guess who will win the match until the last ball of the last over. In India, The Indian Premier League (IPL) started in 2008 and now it is the most popular T20 league in the world. So we decided to develop a machine learning model for predicting the outcome of its matches. Winning in a Cricket Match depends on many key factors like a home ground advantage, past performances on that ground, records at the same venue, the overall experience of the players, record with a particular opposition, and the overall current form of the team and also the individual player. This paper briefs about the key factors that affect the result of the cricket match and the regression model that best fits this data and gives the best predictions. Cricket, the mainstream and widely played sport across India which has the most noteworthy fan base. Indian Premier League follows 20-20 format which is very unpredictable. IPL match predictor is a ML based prediction approach where the data sets and previous stats are trained in all dimensions covering all important factors such as: Toss, Home Ground, Captains, Favorite Players, Opposition Battle, Previous Stats etc, with each factor having different strength with the help of KNIME Tool and with the added intelligence of Naive Bayes network and Eulers strength calculation formula.
ARTICLE | doi:10.20944/preprints202007.0735.v1
Subject: Life Sciences, Genetics Keywords: Variant of Unknown Significance (VUS); Single-Nucleotide Variant (SNV); Variant Effect Prediction (VEP); Stacked Ensemble of Supervised Deep Learners (SESDL); Next Generation Sequencing (NGS); Alternative Allele Frequency (AAF).
Online: 31 July 2020 (06:13:53 CEST)
Pathogenicity is unknown for the majority of human gene variants. For prioritization of sequenced somatic and germline mutation variants, in silico approaches can be utilized. In this study, 84 million non-synonymous Single Nucleotide Variants (SNVs) in the human coding genome were annotated using consensus Variant Effect Prediction (cVEP) method. An algorithm, implemented as a stacked ensemble of supervised learners, performed combination of the 39 functional, conservation mutation impact scores from dbNSFP4.0. Adding gene indispensability score, accounting for differences in the pathogenicities of the variants in the essential and the mutation-tolerant genes, improved the predictions. For each SNV the consensus combination gives either a continuous-value pathogenicity score, or a categorical score in five classes: pathogenic, likely pathogenic, uncertain significance, likely benign, benign. The provided class database is aimed for direct use in clinical practice. The trained prediction models were 5-fold cross-validated on the evidence-based categorical annotations from the ClinVar database. The rankings of the scores based on their ability to predict pathogenicity were obtained. A two-step strategy using the rankings, scores and class annotations is suggested for filtering and prioritization of the human exome mutations in clinical and biological applications of NGS technology.