ARTICLE | doi:10.20944/preprints202306.2209.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Deep learning; Semi-supervised learning; Pseudo labels; Classification; Reliable Match
Online: 30 June 2023 (10:23:20 CEST)
Deep learning has been widely used in various tasks such as computer vision, natural language processing, and predictive analysis, recommendation systems in the past decade. However, practical scenarios often lack labeled data, posing challenges for traditional supervised methods. Semi-supervised classification methods address this by leveraging both labeled and unlabeled data to enhance model performance, but they face challenges in effectively utilizing unlabeled data and distinguishing reliable information from unreliable sources. This paper introduces ReliaMatch, a semi-supervised classification method that addresses these challenges by using a confidence threshold. It incorporates a curriculum learning stage, feature filtering, and pseudo-label filtering to improve classification accuracy and reliability. The feature filtering module eliminates ambiguous semantic features by comparing labeled and unlabeled data in the feature space. The pseudo-label filtering module removes unreliable pseudo-labels with low confidence, enhancing algorithm reliability. ReliaMatch employs a curriculum learning training mode, gradually increasing training dataset difficulty by combining selected samples and pseudo-labels with labeled data. This supervised approach enhances classification performance. Experimental results show that ReliaMatch effectively overcomes challenges associated with the underutilization of unlabeled data and the introduction of error information, outperforming the pseudo-label strategy in semi-supervised classification.
ARTICLE | doi:10.20944/preprints202201.0454.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Ransomware; Behavior analysis; Cyber Security; Machine Learning; Ensemble model; Supervised classification
Online: 31 January 2022 (11:49:48 CET)
Ransomware is one of the most dangerous types of malware, which is frequently intended to spread through a network to damage the designated client by encrypting the client’s vulnerable data. Conventional signature-based ransomware detection technique falls behind because it can only detect known anomalies. When it comes to new and non-familiar ransomware traditional system unveils huge shortcomings. For detecting unknown patterns and sorts of new ransomware families,behavior-based anomaly detection approaches are likely to be the most efficient approach. In the wake of this alarming condition, this paper presents an ensemble classification model consisting of three widely used machine learning techniques that include Decision Tree (DT), RandomForest (RF), and K-nearest neighbor (KNN). To achieve the best outcome ensemble soft voting and hard voting techniques are used while classifying ransomware families based on attack attributes. Performance analysis is done by comparing our proposed ensemble models with standalone models on behavioral attributes based ransomware dataset..
ARTICLE | doi:10.20944/preprints202303.0208.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Convolutional Neural Networks; EfficientNet; Lung Ultrasound; SARS-CoV-2; COVID-19; Pneumonia; Ensemble; Computer Vision; Supervised Learning; Deep Learning
Online: 13 March 2023 (02:41:13 CET)
A machine learning method for classifying Lung UltraSound is here proposed to pro- vide a point of care tool for supporting a safe, fast and accurate diagnosis, that can also be useful during a pandemic like as SARS-CoV-2. Given the advantages (e.g. safety, rapidity, portability, cost-effectiveness) provided by the ultrasound technology over other methods (e.g. X-ray, computer tomography, magnetic resonance imaging), our method was validated on the largest LUS public dataset. Focusing on both accuracy and efficiency, our solution is based on an efficient adaptive ensembling of two EfficientNet-b0 models reaching 100% of accuracy, which, to our knowledge, outperforms the previous state-of-the-art. The complexity of this solution keeps the number of parameters in the same order as an EfficientNet-b0 by adopting specific design choices that are adaptive ensembling with a combination layer, ensembling performed on the deep features, minimal ensemble only two weak models. Moreover, a visual analysis of the saliency maps on sample images of all the classes of the dataset reveals where the focus is on an inaccurate weak model versus an accurate model.
REVIEW | doi:10.20944/preprints202306.1901.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Drones; Machine Learning; Artificial Intelligence; Supervised learning; Unsupervised Learning; Reinforcement Learning
Online: 27 June 2023 (12:27:38 CEST)
The use of drones for various applications has become increasingly popular in recent years, and machine learning has played a significant role in this trend. In this paper, we provide a comprehensive survey of the classification and application of machine learning in drones. The paper begins with an overview of the different types of machine learning algorithms and their applications in drones, including supervised learning, unsupervised learning, and reinforcement learning. Next, we present a detailed analysis of various real-world applications of machine learning in drones, such as object recognition, route planning, obstacle avoidance, search area optimization, and autonomous search. The paper also discusses the challenges and limitations of using machine learning in drones, such as data privacy, data quality, and computational requirements. Finally, the paper concludes with a discussion of the future directions of machine learning in drones and its potential impact on various industries and fields. This paper provides a valuable resource for researchers, practitioners, and students interested in the intersection of machine learning and drones.
ARTICLE | doi:10.20944/preprints202309.0575.v1
Subject: Environmental And Earth Sciences, Geochemistry And Petrology Keywords: Self-supervised; Pretrained Model; Transfer learning; Metric Learning; Transformer; Mask AutoEncoder; Hyperspectral Image Classification
Online: 8 September 2023 (07:42:04 CEST)
"Finding fresh water in the ocean of data." is a challenge that all deep learning domains struggle with, especially in the area of hyperspectral image analysis. As hyperspectral remote sensing technology advances by leaps and bounds, there are increasing amounts of hyperspectral images(HSIs) can be available. Whereas in fact, these unlabeled HSIs are powerless to be used as material to driven a supervised learning task due to the extremely expensive labeling costs and some unknown regions. Although learning-based methods have achieved remarkable performance due to their superior ability to represent features, at the cost, these methods are complex, inflexible and tough to carry out transfer learning. In this paper, we propose the "Instructional Mask AutoEncoder"(IMAE), which is a simple and powerful self-supervised learner for HSI classification that uses a transformer-based mask autoencoder to extract the general features of HSIs through a self-reconstructing agent task. Moreover, we utilize the metric learning to perform an instructor which can direct the model focus on the human interested region of the input so that we can alleviate the defects of transformer-based model such as local attention distraction, lack of inductive bias and tremendous training data requirement. In downstream forward propagation, instead of global average pooling, we employ a learnable aggregation to put the tokens into fullplay. The obtained results illustrate that our method effectively accelerates the convergence rate and promotes the performance in downstream task.
ARTICLE | doi:10.20944/preprints202109.0112.v1
Subject: Engineering, Marine Engineering Keywords: 3D point Cloud Classification, 3D point Cloud Shape Completion,Auto-Encoders, Contrastive Learning, Self-Supervised Learning
Online: 6 September 2021 (18:00:28 CEST)
In this paper, we present the idea of Self Supervised learning on the Shape Completion and Classification of point clouds. Most 3D shape completion pipelines utilize autoencoders to extract features from point clouds used in downstream tasks such as Classification, Segmentation, Detection, and other related applications. Our idea is to add Contrastive Learning into Auto-Encoders to learn both global and local feature representations of point clouds. We use a combination of Triplet Loss and Chamfer distance to learn global and local feature representations. To evaluate the performance of embeddings for Classification, we utilize the PointNet classifier. We also extend the number of classes to evaluate our model from 4 to 10 to show the generalization ability of learned features. Based on our results, embedding generated from the Contrastive autoencoder enhances Shape Completion and Classification performance from 84.2% to 84.9% of point clouds achieving the state-of-the-art results with 10 classes.
ARTICLE | doi:10.20944/preprints202201.0367.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Artificial Intelligence; Deep Learning; Image Classification; Machine Learning; Predictive Models; Small Datasets; Supervised Learning
Online: 25 January 2022 (08:24:17 CET)
One of the most important challenges in the Machine and Deep Learning areas today is to build good models using small datasets, because sometimes it is not possible to have large ones. Several techniques have been proposed in the literature to address this challenge. This paper aims at studying the different available Deep Learning techniques and performing a thorough experimentation to analyze which technique or combination thereof improves the performance and effectiveness of the models. A complete comparison with classical Machine Learning techniques was carried out, to contrast the results obtained using both techniques when working with small datasets. Thirteen algorithms were implemented and trained using three different small datasets (MNIST, Fashion MNIST, and CIFAR-10). Each experiment was evaluated using a well-established set of metrics (Accuracy, Precision, Recall, F1, and the Matthews correlation coefficient). The experimentation allowed concluding that it is possible to find a technique or combination of them to mitigate a lack of data, but this depends on the nature of the dataset, the amount of data, and the metrics used to evaluate them.
ARTICLE | doi:10.20944/preprints201903.0122.v1
Subject: Environmental And Earth Sciences, Remote Sensing Keywords: Classification, SVM Classifier, ML Classifier, Supervised and Unsupervised Classification, Object-based Classification, Multispectral Data
Online: 11 March 2019 (09:01:44 CET)
This paper focuses on the crucial role that remote sensing plays in divining land features. Data that is collected distantly provides information in spectral, spatial, temporal and radiometric domains, with each domain having the specific resolution to information collected. Diverse sectors such as hydrology, geology, agriculture, land cover mapping, forestry, urban development and planning, oceanography and others are known to use and rely on information that is gathered remotely from different sensors. In the present study, IRS LISS IV Multi-spectral data is used for land cover mapping. It is known, however, that the task of classifying high-resolution imagery of land cover through manual digitizing consumes time and is way too costly. Therefore, this paper proposes accomplishing classifications by way of enforcing algorithms in computers. These classifications fall in three classes: supervised, unsupervised, and object-based classification. In the case of supervised classification, two approaches are relied upon for land cover classification of high-resolution LISS-IV multispectral image. These approaches are Maximum Likelihood and Support Vector Machine (SVM). Finally, the paper proposes a step-by-step procedure for optical image classification methodology. This paper concludes that in optical data classification, SVM classification gives a better result than the ML classification technique.
Subject: Medicine And Pharmacology, Psychiatry And Mental Health Keywords: supervised learning, major depression, cytokines, inflammation, neuro-immune, opioids
Online: 25 March 2019 (10:14:02 CET)
Rationale: Major depressive disorder (MDD) is characterized by signaling aberrations in interleukin (IL)-6, IL-10, beta-endorphins as well as mu (MOR) and kappa (KOR) opioid receptors. Here we examined whether these biomarkers may aid in the classification of unknown subjects into the target class MDD.Methods: The aforementioned biomarkers were assayed in 60 first-episode, drug-naïve depressed patients and 30 controls. We analyzed the data using joint principal component analysis (PCA) performed on all subjects to check whether subjects cluster by classes; support vector machine (SVM) with 10-fold validation; and linear discriminant analysis (LDA) and SIMCA performed on calibration and validation sets and we computed the figures of merit and learnt from the data. Results: PCA shows that both groups were well separated using the first three PCs, while correlation loadings show that all 5 biomarkers have discriminatory value. SVM and LDA yielded an accuracy of 100% in validation samples. Using SIMCA there was a highly significant discrimination of both groups (model-to-model distance=87.5); all biomarkers showed a significant discrimination and modeling power, while 10% of the patients were identified as outsiders and no aliens could be identified.Discussion: We have delineated that MDD is a distinct class with respect to neuro-immune and opioid biomarkers and that future unknown subjects can be authenticated as having MDD using this SIMCA fingerprint. Precision psychiatry should employ SIMCA a) to authenticate patients as belonging to the claimed target class and identify other subjects as outsiders, members of another class or aliens; and b) to acquire knowledge through learning from the data by constructing a biomarker fingerprint of the target class.
ARTICLE | doi:10.20944/preprints201911.0218.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: Landsat; Google Earth; water index; unsupervised image classification; supervised image classification; Kappa coefficient
Online: 19 November 2019 (03:10:17 CET)
To address three important issues related to extraction of water features from Landsat imagery, i.e., selection of water indexes and classification algorithms for image classification, collection of ground truth data for accuracy assessment, this study applied four sets (ultra-blue, blue, green, and red light based) of water indexes (NWDI, MNDWI, MNDWI2, AWEIns, and AWEIs) combined with three types of image classification methods (zero-water index threshold, Otsu, and kNN) to 24 selected lakes across the globe to extract water features from Landsat-8 OLI imagery. 1440 (4x5x3x24) image classification results were compared with the extracted water features from high resolution Google Earth images with the same (or ±1 day) acquisition dates through computing the Kappa coefficients. Results show the kNN method is better than the Otsu method, and the Otsu method is better than the zero-water index threshold method. If the computational cost is not an issue, the kNN method combined with the ultra-blue light based AWEIns is the best method for extracting water features from Landsat imagery because it produced the highest Kappa coefficients. If the computational cost is taken into account, the Otsu method is a good choice. AWEIns and AWEIs are better than NDWI, MNDWI and MNDWI2. AWEIns works better than AWEIs under the Otsu method, and the average rank of the image classification accuracy from high to low is the ultra-blue, blue, green, and red light-based AWEIns.
ARTICLE | doi:10.20944/preprints202005.0356.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Supervised Learning; Time Series Classification; Jamming Detection; Automatic Modulation Classification; Feature Selection; Genetic Algorithm; Principal Component Analysis; QPSK modulation; APSK modulation
Online: 23 May 2020 (05:10:36 CEST)
Satellite communication (Satcom) is an artificial geostationary satellite that facilitates a wide range of telecommunications. Considering its quality of service (QoS) and security is crucial in government/military applications. The most challenging situation for efficient Satcom is radio frequency interference (RFI) environment. Thus, it is necessary to ensure that transmissions are incorruptible or at least sense the quality of its spectrum. This paper presents a new method to recognize received signal characteristics using a hierarchical classification in a multi-layer perceptron neural network. We consider signal modulation and the type of RFI as the characteristics of a real-time video stream transmitted in the direct broadcast satellite. Four different modulation types are investigated in this study. Moreover, the combination of the communication signal with various kinds of interference and their effects on the classification method widely have been analyzed. Besides, two robust feature selection techniques have been developed to reduce the data-set dimensional, which leads to optimizing the classification process. The results show that the Genetic Algorithm (GA) slightly outperforms Principal Component Analysis (PCA) for feature selection. Furthermore, the robustness of the proposed techniques is assessed to detect unknown signals at different signal to noise ratios.
ARTICLE | doi:10.20944/preprints202103.0780.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Deep learning; Computer vision; Remote sensing; Supervised learning; Semi-supervised learning; Segmentation; Seagrass mapping
Online: 31 March 2021 (15:53:19 CEST)
Intertidal seagrass plays a vital role in estimating the overall health and dynamics of coastal environments due to its interaction with tidal changes. However, most seagrass habitats around the globe have been in steady decline due to human impacts, disturbing the already delicate balance in environmental conditions that sustain seagrass. Miniaturization of multi-spectral sensors has facilitated very high resolution mapping of seagrass meadows, which significantly improve the potential for ecologists to monitor changes. In this study, two analytical approaches used for classifying intertidal seagrass habitats are compared: Object-based Image Analysis (OBIA) and Fully Convolutional Neural Networks (FCNNs). Both methods produce pixel-wise classifications in order to create segmented maps, however FCNNs are an emerging set of algorithms within Deep Learning with sparse application towards seagrass mapping. Conversely, OBIA has been a prominent solution within this field, with many studies leveraging in-situ data and multiscale segmentation to create habitat maps. This work demonstrates the utility of FCNNs in a semi-supervised setting to map seagrass and other coastal features from an optical drone survey conducted at Budle Bay, Northumberland, England. Semi-supervision is also an emerging field within Deep Learning that has practical benefits of achieving state of the art results using only subsets of labelled data. This is especially beneficial for remote sensing applications where in-situ data is an expensive commodity. For our results, we show that FCNNs have comparable performance with standard OBIA method used by ecologists, while also noting an increase in performance for mapping ecological features that are sparsely labelled across the study site.
ARTICLE | doi:10.20944/preprints202111.0243.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Feature Selection; Malaria Diagnosis; Supervised learning
Online: 15 November 2021 (10:36:16 CET)
Malaria remains an important cause of death, especially in sub-Saharan Africa with about 228 million malaria cases worldwide and an estimated 405,000 deaths in 2019. Currently, malaria is diagnosed in the health facility using a microscope (BS) or rapid malaria diagnostic test (MRDT) and with area where these tools are inadequate the presumptive treatment is performed. Apart from that self-diagnosis and treatment is also practiced in some of the households. With the high-rate self-medication on malaria drugs, this study aimed at computing the most significant features using feature selection methods for best prediction of malaria in Tanzania that can be used in developing a machine learning model for malaria diagnosis. A malaria symptoms and clinical diagnosis dataset were extracted from patients’ files from four (4) identified health facilities in the regions of Kilimanjaro and Morogoro. These regions were selected to represent the high endemic areas (Morogoro) and low endemic areas (Kilimanjaro) in the country. The dataset contained 2556 instances and 36 variables. The random forest classifier a tree based was used to select the most important features for malaria prediction. Regional based features were obtained to facilitate accurate prediction. The feature ranking as indicated that fever is universally the most influential feature for predicting malaria followed by general body malaise, vomiting and headache. However, these features are ranked differently across the regional datasets. Subsequently, six predictive models, using important features selected by feature selection method, were used to evaluate the features performance. The features identified complies with malaria diagnosis and treatment guideline provided with WHO and Tanzania Mainland. The compliance is observed so as to produce a prediction model that will fit in the current health care provision system in Tanzania.
ARTICLE | doi:10.20944/preprints202307.1950.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: broiler; welfare; mobility; YOLOv5; semi-supervised learning; neo-deepsort
Online: 28 July 2023 (10:31:34 CEST)
Mobility is a vital welfare indicator which may influence broilers’ daily activities. Classical broiler mobility assessment methods are laborious and cannot provide timely insights into their conditions. Here, we proposed a semi-supervised Deep Learning (DL) model, YOLOv5, combined with Deep Sort algorithm conjoined with our newly proposed algorithm, Neo-Deep Sort, for individual broiler mobility tracking. Initially, 1,650 labeled images from five days were employed to train the YOLOv5 model. Through semi-supervised learning (SSL), this narrowly trained model was then used for pseudo-labeling 2,160 images, of which 2,153 were successfully labeled. Thereafter, the YOLOv5 model was fine-tuned on the newly labeled images. Lastly, the trained YOLOv5 and the Neo-Deep Sort algorithm were applied to detect and track 28 broilers in two pens and categorized them in terms of hourly and daily traveled distances and speeds. SSL helped in increasing the YOLOv5 model’s mean Average Precision (mAP), in detecting birds, from 81% to 98%. As compared with the manually measured covered distances of broilers, the combined model provided individual broiler's hourly moved distances with a validation accuracy of about 80%. Eventually, individual and flock level mobilities were quantified while overcoming the occlusion, false and miss detection issues.
ARTICLE | doi:10.20944/preprints202209.0100.v1
Subject: Biology And Life Sciences, Biology And Biotechnology Keywords: biocatalysts; bioprospecting; esterases/lipases; hydrolases; machine learning; supervised learning
Online: 7 September 2022 (04:53:30 CEST)
When bioprospecting for novel industrial enzymes, substrate promiscuity is a desirable property that increases the reusability of the enzyme. Among industrial enzymes, ester hydrolases have great relevance for which the demand has not ceased to increase. However, the search for new substrate promiscuous ester hydrolases is not trivial since the mechanism behind this property is greatly influenced by the active site's structural and physicochemical characteristics. These characteristics must be computed from the 3D structure, which is rarely available, and expensive to measure, hence the need for a method that can predict promiscuity from a sequence alone. Here we report such a method called EP-pred, an ensemble binary classifier, that combines three machine learning algorithms: SVM, KNN, and a Linear model. EP-pred has been evaluated against the Lipase Engineering Database together with a hidden Markov approach leading to a final set of 10 sequences predicted to encode promiscuous esterases. Experimental results confirmed the validity of our method since all ten proteins were found to exhibit a broad substrate ambiguity.
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Ship detection; self-supervised learning; transfer learning; Sentinel 2
Online: 7 October 2021 (23:04:24 CEST)
Automatic ship detection provides an essential function towards maritime domain awareness for security or economic monitoring purposes. This work presents an approach for training a deep learning ship detector in Sentinel-2 multispectral images with few labeled examples. We design a network architecture for detecting ships with a backbone that can be pre-trained separately. By using Self Supervised Learning, an emerging unsupervised training procedure, we learn good features on Sentinel-2 images, without requiring labeling, to initialize our network’s backbone. The full network is then fine-tuned to learn to detect ships in challenging settings. We evaluated this approach versus pre-training on ImageNet and versus a classical image processing pipeline. We examined the impact of variations in the self-supervised learning step and we show that in the few-shot learning setting self-supervised pre-training achieves better results than ImageNet pre-training. When enough training data is available, our self-supervised approach is as good as ImageNet pre-training. We conclude that a better design of the self-supervised task and bigger non-annotated dataset sizes can lead to surpassing ImageNet pre-training performance without any annotation costs.
REVIEW | doi:10.20944/preprints202108.0238.v1
Subject: Public Health And Healthcare, Other Keywords: self-supervised learning; medicine; healthcare; representation learning; unlabeled data
Online: 11 August 2021 (08:27:57 CEST)
Machine learning has become an increasingly ubiquitous technology, as big data continues to inform and influence everyday life and decision-making. Currently in healthcare, as well as in most other industries, the two most prevalent machine learning paradigms are supervised learning and transfer learning. Both practices rely on large-scale, manually annotated datasets to train increasingly complex models. However, the requirement of data to be manually labeled leaves an excess of unused, unlabeled data available in both public and private data repositories. Self-supervised learning (SSL) is a growing area of machine learning that has the ability to take advantage of unlabeled data. Contrary to other machine learning paradigms, SSL algorithms create artificial supervisory signals from unlabeled data and pretrain algorithms on these signals. The aim of this review is two-fold: firstly, we provide a formal definition of SSL, divide SSL algorithms into their four unique subsets, and review the state-of-the-art published in each of those subsets between the years of 2014-2020. Second, this work surveys recent SSL algorithms published in healthcare, in order to provide medical experts with a clearer picture of how they can integrate SSL into their research, with the objective of leveraging unlabeled data.
ARTICLE | doi:10.20944/preprints202104.0678.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: supervised machine learning; automated landscape mapping; digital elevation model
Online: 26 April 2021 (14:44:24 CEST)
Landscapes evolve due to climatic conditions, tectonic activity, geological features, biological activity, and sedimentary dynamics. These processes link geological processes at depth to surface features. Consequently, the study of landscapes can reveal essential information about the geochemical footprint of ore deposits at depth. Advances in satellite imaging and computing power have enabled the creation of large geospatial datasets, the sheer size of which necessitates automated processing. We describe a methodology to enable the automated mapping of landscape pattern domains using machine learning (ML) algorithms. From a freely available Digital Elevation Model, derived data, and sample landclass boundaries provided by domain experts, our algorithm produces a dense map of the model region in Western Australia. Both random forest and support vector machine classification achieve about 98\% classification accuracy with reasonable runtime of 48 minutes on a single core. We discuss computational resources and study the effect of grid resolution. Larger tiles result in a more contiguous map, while smaller tiles result in a more detailed, and at some point, noisy map. Diversity and distribution of landscapes mapped in this study support previous results. In addition, our results are consistent with the geological trends and main basement features in the region.
ARTICLE | doi:10.20944/preprints202309.1699.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Image retrieval, Deep learning, Multi-scale feature, Deep supervised hashing .
Online: 26 September 2023 (05:13:58 CEST)
Deep networks-based hashing has gained significant popularity in recent years, particularly in the field of image retrieval. However, most existing methods only focus on extracting semantic information from the final layer, disregarding valuable structural information that contains important semantic details crucial for effective hash learning. To address this limitation and improve image retrieval accuracy, we propose a novel deep hashing method called Deep Supervised Hashing by Fusing Multiscale Deep Features (DSHFMDF). Our approach involves extracting multiscale features from multiple convolutional layers and fusing them to generate more robust representations for efficient image retrieval. Experimental results on CIFAR10 and NUS-WIDE datasets demonstrate that our method surpasses the performance of state-of-the-art hashing techniques.
ARTICLE | doi:10.20944/preprints202306.2115.v1
Subject: Medicine And Pharmacology, Oncology And Oncogenics Keywords: exercise oncology; telehealth; synchronous delivery; supervised exercise; group-based exercise
Online: 29 June 2023 (11:37:23 CEST)
Alberta Cancer Exercise (ACE) is an exercise oncology program that transitioned from in-person to online delivery during COVID-19. The purpose of this work was to understand participants’ experiences in both delivery modes. Specifically, survivors' exercise facilitators and barriers, delivery mode preference, and experience with program elements targeting behaviour change were gathered. A retrospective cohort design using explanatory sequential mixed methods was used. 57 participants completed a survey, and 19 subsequent, optional interviews were conducted. Most participants indicated preferring in-person programs (58%), followed by online (32%), and no preference (10%). There were significantly fewer barriers (i.e., commute time) (p<0.01), but also fewer facilitators (i.e., social support) (p<0.01), to exercising online. Four themes were generated from the qualitative data surrounding participant experiences in both delivery modes. Key differences in barriers and facilitators highlighted a more convenient experience online relative to a more socially supportive environment in-person. For future work that includes solely online, focusing on building social support and a sense of community will be critical to optimizing program benefits. Beyond the COVID-19 pandemic, results of this research will remain relevant as we aim to increase the reach of online exercise oncology programming to more underserved populations of individuals living with cancer.
ARTICLE | doi:10.20944/preprints202212.0433.v1
Subject: Physical Sciences, Quantum Science And Technology Keywords: Optimal control; supervised learning; system characterization; two-level quantum systems
Online: 23 December 2022 (01:44:57 CET)
We investigate the extent to which a two-level quantum system subjected to an external time-dependent drive can be characterized by supervised learning. We apply this approach to the case of bang-bang control and the estimation of the offset and the final distance to a given target state. The estimate is global in the sense that no a priori knowledge is required on the parameters to be determined. Different neural network algorithms are tested on a series of data sets. We point out the limits of the estimation procedure with respect to the properties of the mapping to be interpolated. We discuss the physical relevance of the different results.
ARTICLE | doi:10.20944/preprints202106.0482.v3
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: COVID-19 Infodemic; Text Classification; TFIDF Features; Network Training modes; Supervised Learning; Misinformation; News Classification; False Publications; PubMed; Anomaly Detection
Online: 26 July 2021 (12:06:04 CEST)
The spread of the Coronavirus pandemic has been accompanied by an infodemic. The false information that is embedded in the infodemic affects people’s ability to have access to safety information and follow proper procedures to mitigate the risks. This research aims to target the falsehood part of the infodemic, which prominently proliferates in news articles and false medical publications. Here, we present NeoNet, a novel supervised machine learning text mining algorithm that analyzes the content of a document (news article, a medical publication) and assigns a label to it. The algorithm is trained by TFIDF bigram features which contribute a network training model. The algorithm is tested on two different real-world datasets from the CBC news network and Covid-19 publications. In five different fold comparisons, the algorithm predicted a label of an article with a precision of 97-99 %. When compared with prominent algorithms such as Neural Networks, SVM, and Random Forests NeoNet surpassed them. The analysis highlighted the promise of NeoNet in detecting disputed online contents which may contribute negatively to the COVID-19 pandemic.
ARTICLE | doi:10.20944/preprints202308.1636.v1
Subject: Engineering, Civil Engineering Keywords: machine learning; supervised classification; drinking water quality; data-driven; artificial intelligence
Online: 24 August 2023 (03:20:26 CEST)
Water quality assessments are crucial for human health and environmental safeguards. The utilization of a subset of artificial intelligence such as Machine Learning (ML) presents significant impacts to enhance the prediction and classification of water quality. In this research, a set of diverse ML algorithms was evaluated to handle a comprehensive dataset of water quality measurements over an extended period. The aim was to develop a robust approach for accurately forecasting water quality. This approach employed machine learning classifiers such as Logistic Regression (LR), Support Vector Machine (SVM), Stochastic Gradient Descent (SGD), K-Nearest Neighbors (KNN), Gaussian Process Classification (GPC), Gaussian Naive Bayes (GNB), Random Forest (RF), Decision Tree (DT), XGBoost, and Multilayer Perceptron (MLP). The water quality parameters assessed for pH, hardness, solids, chloramines, sulfate, conductivity, organic carbon, trihalomethanes and turbidity. The XGBoost model exhibited the highest accuracy of 89.47% among the classifiers and Stacked Ensemble Classifiers (SEC) improved the prediction further to 92.98%. The findings suggest that XGBoost and the SEC hold promise as reliable approaches for water quality assessments in contrast of artificial intelligence.
ARTICLE | doi:10.20944/preprints202308.0403.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: vehicular ad-hoc networks; neural networks; supervised learning; SUMO; NS-3
Online: 4 August 2023 (10:49:07 CEST)
In urban Vehicular Ad Hoc Network (VANET) environments, buildings play a crucial role as they can act as obstacles that attenuate the transmission signal between vehicles. Quantifying the impact of buildings on the transmission quality is essential, especially in critical scenarios involving emergency vehicles, where reliable communication is of utmost importance. In this research, we propose a supervised learning approach based on artificial neural networks (ANNs) to develop a predictive model capable of estimating the level of signal degradation, represented by the bit error rate (BER), based on the obstacles perceived by moving emergency vehicles. By establishing a relationship between the level of signal degradation and the encountered obstacles, our proposed mechanism enables efficient routing decisions to be made prior to the transmission process. Consequently, data packets are routed through paths that exhibit the lowest BER. To gather the necessary training data, we employed SUMO and NS-3 simulations. The simulation results demonstrate that our developed model successfully learns and accurately estimates the BER for new data instances. Overall, our research contributes to enhancing the performance and reliability of communication in urban VANET environments, especially in critical scenarios involving emergency vehicles, by leveraging supervised learning and artificial neural networks to predict signal degradation levels and optimize routing decisions accordingly.
ARTICLE | doi:10.20944/preprints202308.0219.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Machine learning; educational data mining; supervised methods; classifiers; course failure risk
Online: 3 August 2023 (02:43:48 CEST)
In this paper, we address the following research question: Is feasibility to use an artificial intelligence system to predict the risk of student failure in a course based solely on their performance in prerequisite courses? Adopting a machine learning-based quantitative approach, we implement Course Prophet, the prototype of a predictive system that maps the input variables representing student performance to the target variable, i.e., the risk of course failure. We evaluate multiple machine learning methods and find that the Gaussian process with Matern kernel outperforms other methods, achieving the highest accuracy and a favorable trade-off between precision and recall. We conduct this research in the context of the students pursuing a Bachelor’s degree in Systems Engineering at the University of Córdoba, Colombia. In this context, we focus on predicting the risk of failing the Numerical Methods course. In conclusion, the main contribution of this research is the development of Course Prophet, providing an efficient and accurate tool for predicting student failure in the Numerical Methods course based on their academic history in prerequisite courses.
ARTICLE | doi:10.20944/preprints202307.0194.v1
Subject: Medicine And Pharmacology, Neuroscience And Neurology Keywords: general taste status; taste loss; supervised learning regression; random forest regressor
Online: 4 July 2023 (10:26:21 CEST)
In healthy humans, taste sensitivity varies widely, influencing food selection and nutritional status. Chemosensory reductions have been associated with numerous pathological disorders or pharmacological interventions. Reliable psychophysical methods are crucial resources to analyze the taste function during routine clinical assessment. However, in the daily clinical routine, they are often considered to be too time-consuming. We used the Supervised Learning (SL) regression method to analyze with high precision the overall taste status of healthy controls (HC) and patients with chemosensory loss and to characterize the combination of responses that best can predict the overall taste status of two groups. Random Forest regressor allowed us to achieve our objective. The analysis of the order of importance and impact of each parameter on the prediction of overall taste status in the two groups showed that salty (low concentration) and sour (high concentration) stimuli specifically characterized healthy subjects, while bitter (high concentration) and astringent (high concentration) stimuli identified patients with chemosensory loss. The identification of these distinctions appears to be of interest to the health system since they may allow the use of specific stimuli during routine clinical assessments of taste function reducing the commitment in terms of time and costs.
ARTICLE | doi:10.20944/preprints202210.0431.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Supervised machine learning; intrusion detection; data engineering; cybersecurity; Internet of Things.
Online: 27 October 2022 (10:57:09 CEST)
Nowadays, the Internet of Things (IoT) devices and applications have rapidly expanded worldwide due to their benefits in improving the business environment, industrial environment, and people's daily lives. However, the IoT devices are not immune to malicious network traffic, which causes potential negative consequences and sabotages IoT operating devices. Therefore, developing a method for screening network traffic is necessary to detect and classify malicious activity to mitigate its negative impacts. Therefore, this research proposes a predictive machine learning model to detect and classify network activity in an IoT system. Specifically, our model distinguishes between normal and anomaly network activity. Furthermore, it classifies network traffic into five categories, normal, Mirai attack, denial of service (DoS) attack, Scan attack, and man-in-the-middle (MITM) attack. Five supervised learning models were implemented to characterize their performance in detecting and classifying network activities for IoT systems. This includes models shallow neural networks (SNN), decision trees (DT), bagging trees (BT), support vector machine (SVM), and k-nearest neighbor (kNN). The learning models were evaluated on a new and broad dataset for IoT attacks, the IoTID20 dataset. Besides, a deep feature engineering process was applied to the dataset to improve the accuracy of the learning models. Our experimental evaluation exhibited an accuracy of 100% recorded for the detection using all implemented models and an accuracy of 99.4%-99.9% recorded for the classification process.
ARTICLE | doi:10.20944/preprints202203.0219.v1
Subject: Medicine And Pharmacology, Clinical Medicine Keywords: Artificial intelligence; Supervised Machine Learning; Kinematics; Head rotation test; Neck pain
Online: 15 March 2022 (14:30:51 CET)
Understanding neck pain is an important societal issue. Kinematic data from sensors may help to gain insight on the pathophysiological mechanisms associated with neck pain through a quantitative sensorimotor assessment of one patient. The objective of this study was to evaluate the potential usefulness of artificial intelligence with several Machine Learning (ML) algorithms in assessing neck sensorimotor performance. Angular velocity and acceleration measured by an inertial sensor placed on the forehead during the DidRen laser test in thirty-eight acute and subacute non-specific neck pain (ANSP) patients were compared to forty-two healthy control participants (HCP). Seven supervised ML algorithms were chosen for the predictions. The most informative kinematic features were computed using Sequential Feature Selection methods. The best performing algorithm is the Linear Support Vector Machine with an accuracy of 82% and Area Under Curve of 84%. The best discriminative kinematic feature between ANSP patients and HCP is the first quartile of head pitch angular velocity. This study has shown that supervised ML algorithms could be used to classify ANSP patients and identify discriminatory kinematic features potentially useful for the clinicians in the assessment and monitoring of the neck sensorimotor performance in ANSP patients.
ARTICLE | doi:10.20944/preprints201809.0346.v1
Subject: Engineering, Mechanical Engineering Keywords: SLM, Process Control, Semi-supervised Machine Learning, Randomised Singular Value Decomposition
Online: 18 September 2018 (11:21:58 CEST)
Risk-averse areas such as the medical, aerospace and energy sectors have been somewhat slow towards accepting and applying Additive Manufacturing (AM) in many of their value chains. This is partly because there are still signicant uncertainties concerning the quality of AM builds. This paper introduces a machine learning algorithm for the automatic detection of faults in AM products. The approach is semi-supervised in that, during training, it is able to use data from both builds where the resulting components were certied and builds where the quality of the resulting components is unknown. This makes the approach cost ecient, particularly in scenarios where part certication is costly and time consuming. The study specically analyses Selective Laser Melting (SLM) builds. Key features are extracted from large sets of photodiode data, obtained during the building of 49 tensile test bars. Ultimate tensile strength (UTS) tests were then used to categorise each bar as `faulty' or `acceptable'. A fully supervised approach identied faulty specimens with a 77% success rate while the semi-supervised approach was able to consistently achieve similar results, despite being trained on a fraction of the available certication data. The results show that semi-supervised learning is a promising approach for the automatic certication of AM builds that can be implemented at a fraction of the cost currently required.
ARTICLE | doi:10.20944/preprints202207.0261.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: Alcohol Detection; Smart Sensing; MQ-3 Alcohol Sensors; Supervised Learning; Neural Networks.
Online: 18 July 2022 (10:16:26 CEST)
According to the risk investigations of being involved in an accident, alcohol-impaired driving is one of the major causes of motor-vehicles accidents. Preventing highly intoxicated persons from driving would potentially save many lives. This paper proposes a lightweight in-vehicle alcohol detection that processes the data generated from 6-alcohol sensors (MQ-3 Alcohol Sensors) using an optimizable shallow neural network (O-SNN). The experimental evaluation results exhibit a high-performance detection system scoring a 99.8% of detection accuracy with a very short inferencing delay of 2.22 µ seconds. Hence, the proposed model can be efficiently deployed and used to discover in-vehicle alcohol with high accuracy and low inference overhead as a part of the driver alcohol detection system for safety (DADSS) system aiming at massive deployment of alcohol sensing systems that could potentially save thousands of lives annually.
ARTICLE | doi:10.20944/preprints202311.1705.v1
Subject: Engineering, Safety, Risk, Reliability And Quality Keywords: transformer; self-supervised learning; autoencoder; remaining useful life prediction; bidirectional LSTM; turbofan engine
Online: 27 November 2023 (13:22:46 CET)
Estimating the Remaining Useful Life (RUL) of aircraft engines holds a pivotal role in enhancing safety, optimizing operations, and promoting sustainability, thus being a crucial component of modern aviation management. Precise RUL predictions offer valuable insights into an engine’s condition, enabling informed decisions regarding maintenance and crew scheduling. In this context, we propose a novel RUL prediction approach in this paper, harnessing the power of Bi-directional LSTM and Transformer architectures, known for their success in sequence modeling, such as natural languages. We adopt the encoder part of the full Transformer as the backbone of our framework, integrating it with a self-supervised denoising autoencoder that utilizes Bidirectional LSTM for improved feature extraction. Within our framework, a sequence of multivariate time series sensor measurements serves as the input, initially processed by the Bidirectional LSTM autoencoder to extract essential features. Subsequently, these feature values are fed into our Transformer encoder backbone for RUL prediction. Notably, our approach simultaneously trains the autoencoder and Transformer encoder, different from the naive sequential training method. Through a series of numerical experiments carried out on the C-MAPSS datasets, we demonstrate that the efficacy of our proposed models either surpasses or stands on par with that of other existing methods.
ARTICLE | doi:10.20944/preprints202310.0524.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Deep learning; image representation learning; self-supervised learning; masked image modeling; contrastive learning
Online: 9 October 2023 (12:52:30 CEST)
Self-supervised learning is a method that learns general representation from unlabeled data. Masked image modeling (MIM), one of the generative self-supervised learning methods, has drawn attention showing state-of-the-art performance on various downstream tasks, though showing poor linear separability resulting from the token-level approach. In this paper, we propose a contrastive learning-based multi-view masked autoencoder for MIM, exploiting an image-level approach by learning common features from two different augmented views. We strengthen MIM by learning long-range global patterns from contrastive loss. Our framework adopts simple encoder-decoder architecture, learning rich and general representation by following a simple process: 1) two different views are generated from an input image with random masking and by contrastive loss, we can learn semantic distance of the representations generated by an encoder. By applying a high mask ratio, 80%, it works as strong augmentation and alleviates the representation collapse problem. 2) With reconstruction loss, decoder learns to reconstruct an original image from the masked image. We assess our framework by several experiments on benchmark datasets of image classification, object detection, and semantic segmentation. We achieve 84.3% fine-tuning accuracy on ImageNet-1K classification and 76.7% in linear probing, exceeding previous studies and show promising results on other downstream tasks. Experimental results demonstrate that our work can learn rich and general image representation by applying contrastive loss to masked image modeling.
ARTICLE | doi:10.20944/preprints202308.2098.v1
Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: Self-explanation; Automated scoring; Semi-supervised learning; Language Learning Model (LLM); Data augmentation
Online: 31 August 2023 (10:19:53 CEST)
In the realm of mathematics education, self-explanation stands as a crucial learning mechanism, allowing learners to articulate their comprehension of intricate mathematical concepts and strategies. As digital learning platforms grow in prominence, there are mounting opportunities to collect and utilize mathematical self-explanations. However, these opportunities are met with challenges in automated evaluation. Automatic scoring of mathematical self-explanations is crucial for preprocessing tasks, including the categorization of learner responses, identification of common misconceptions, and the creation of tailored feedback and model solutions. Nevertheless, this task is hindered by the dearth of ample sample sets. Our research introduces a semi-supervised technique using the Language Learning Model (LLM), specifically its Japanese variant, to enrich datasets for the automated scoring of mathematical self-explanations. We rigorously evaluated the quality of self-explanations across five datasets, ranging from human-evaluated originals to ones devoid of original content. Our results show that combining LLM-based explanations with mathematical material significantly improves the model's accuracy. Interestingly, there's an optimal limit to how much synthetic self-explanation data can bene-fit the system. Exceeding this limit doesn't further improve outcomes. This study thus highlights the need for careful consideration when integrating synthetic data into solutions, especially within the mathematics discipline.
ARTICLE | doi:10.20944/preprints202306.1360.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Internal Mammary Artery Segmentation; Alternating Blocks; Medical Imaging Segmentation; Self-supervised Pretraining; LegoNet
Online: 19 June 2023 (13:00:58 CEST)
Since the emergence of convolutional neural networks (CNNs), and later vision transformers (ViTs), the standard paradigm for model development has been using a set of identical block types with varying parameters/hyper-parameters. To leverage the benefits of different architectural designs (e.g., CNNs and ViTs), we propose alternating structurally different types of blocks to generate a new architecture, mimicking how Lego blocks can be assembled. Using two CNN-based and one SwinViT-based blocks, we investigate three variations to the so-called LegoNet that apply the new block alternation concept for the segmentation task in medical imaging. We also study a new clinical problem that has not been investigated before – the right internal mammary artery (RIMA) and perivascular space segmentation from computed tomography angiography (CTA). It was proven to demonstrate a prognostic value to primary cardiovascular outcomes. We compare the model performance against popular CNN and ViT architectures using two large datasets (achieving 0.749 dice similarity coefficient (DSC) on the larger dataset). We also evaluate the model’s performance on three external testing cohorts, where an expert clinician corrected model-segmented results (DSC>0.90 for the three cohorts). To assess our proposed model for suitability in clinical use, we perform intra- and inter-observer variability analysis. Finally, we investigate a joint self-supervised learning approach to determine its impact on model performance.
ARTICLE | doi:10.20944/preprints202304.0757.v1
Subject: Engineering, Mechanical Engineering Keywords: Additive Manufacturing; Explainable Artificial Intelligence; Machine Learning; Supervised Learning; Surface Roughness; Structural Integrity
Online: 23 April 2023 (04:04:49 CEST)
Structural integrity is a crucial aspect of engineering components, particularly in the field of additive manufacturing (AM). Surface roughness is a vital parameter that significantly influences the structural integrity of additively manufactured parts. In this study, we present a comprehensive investigation into the relationship between surface roughness and structural integrity of Polyactic Acid (PLA) specimens produced through additive manufacturing. This research work focuses on the prediction of surface roughness of Additive Manufactured Polyactic Acid (PLA) specimens using eight different supervised machine learning regression-based algorithms. For the first time, Explainable AI techniques are employed to enhance the interpretability of the machine learning models. The eight algorithms used in this study are Support Vector Regression, Random Forest, XG Boost, Ada Boost, Catboost, Decision Tree, Extra Tree regressor, and Gradient Boosting regressor. The study analyzes the performance of these algorithms to predict the surface roughness of PLA specimens, while also investigating the impact of individual input parameters through Explainable AI methods. The experimental results indicate that the XG Boost algorithm outperforms the other algorithms with the highest coefficient of determination value of 0.9634. This value demonstrates that the XG Boost algorithm provides the most accurate predictions for surface roughness compared to other algorithms. The study also provides a comparative analysis of the performance of all the algorithms used in this study, along with insights derived from Explainable AI techniques.
ARTICLE | doi:10.20944/preprints202209.0025.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: object detection; semi-supervised learning; Mask R-CNN; floor-plan images; computer vision
Online: 1 September 2022 (15:16:43 CEST)
Research has been growing on object detection using semi-supervised methods in past few years. We examine the intersection of these two areas for floor-plan objects to promote the research objective of detecting more accurate objects with less labelled data. The floor-plan objects include different furniture items with multiple types of the same class, and this high inter-class similarity impacts the performance of prior methods. In this paper, we present Mask R-CNN based semi-supervised approach that provides pixel-to-pixel alignment to generate individual annotation masks for each class to mine the inter-class similarity. The semi-supervised approach has a student-teacher network that pulls information from the teacher network and feeds it to the student network. The teacher network uses unlabeled data to form pseudo-boxes, and the student network uses both unlabeled data with the pseudo boxes and labelled data as ground truth for training. It learns representations of furniture items by combining labelled and unlabeled data. On the Mask R-CNN detector with ResNet-101 backbone network, the proposed approach achieves mAP of 98.8%, 99.7%, and 99.8% with only 1%, 5% and 10% labelled data, respectively. Our experiment affirms the efficiency of the proposed approach as it outperforms the fully supervised counterpart using only 10% of the labels.
ARTICLE | doi:10.20944/preprints202203.0085.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: object segmentation; LiDAR-camera fusion; autonomous driving; artificial intelligence; semi-supervised learning; iseAuto
Online: 4 March 2022 (21:43:06 CET)
Object segmentation is still considered a challenging problem in autonomous driving, particularly in consideration of real world conditions. Following this line of research, this paper approaches the problem of object segmentation using LiDAR-camera fusion and semi-supervised learning implemented in a fully-convolutional neural network. Our method is tested on real-world data acquired using our custom vehicle iseAuto shuttle. The data include all-weather scenarios, featuring night and rainy weather. In this work, it is shown that LiDAR-camera fusion with only a few annotated scenarios and semi-supervised learning, it is possible to achieve robust performance on real-world data in a multi-class object segmentation problem. The performance of our algorithm is measured in terms of intersection over union, precision, recall and area-under-the-curve average precision. Our network achieves 82% IoU in vehicle detection in day fair scenarios and 64% IoU in vehicle segmentation in night rain scenarios.
ARTICLE | doi:10.20944/preprints202105.0677.v1
Subject: Medicine And Pharmacology, Immunology And Allergy Keywords: heart failure, phenotype, left ventricular ejection fraction, primary care, artificial intelligence, supervised analysis
Online: 27 May 2021 (14:08:53 CEST)
Artificial Intelligence are creating a paradigm shift in health care, being phenotyping patients through clustering techniques one of the areas of interest. Objective: To develop a predictive model to classify heart failure (HF) patients according to their left ventricular ejection fraction (LVEF), by using available data in Electronic Health Records (EHR). Subjects and methods: 2854 subjects more than 25 years old with diagnose of HF and LVEF measured by echocardiography were selected to develop an algorithm to predict patients with reduced EF using supervised analysis. Performance of the algorithm developed were tested in heart failure patients from Primary Care. To select the most influencing variables, LASSO algorithm setting was used and to tackle the issue of one class exceed the other one by a large proportion we used the Synthetic Minority Oversampling Technique (SMOTE). Finally, Random Forest (RF) and XGBoost models were constructed. Results: Full XGBoost model obtained the maximized accuracy, a high negative predictive value and the highest positive predictive value. Gender, age, unstable angina, atrial fibrillation and acute myocardial infarct are the variables that most influence FE value. Applied in the EHR data set with a total 25594 patients with an ICD-code of HF and no regular follow-up in Cardiology clinics, 6170 (21.1%) were identified as those pertaining to the reduced EF group. Conclusion: The algorithm obtained is able to rescue a number of HF patients with reduced ejection fraction that can be take benefit for a protocol with strong recommendation to succeed. Furthermore, the methodology can be used for studies with data extracted from the Electronic Health Records.
ARTICLE | doi:10.20944/preprints202012.0058.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Deep Learning; LSTM Autoencoder; Supervised Learning, Hydraulic Test Rig; Sensor Faults; Component Faults
Online: 2 December 2020 (11:08:40 CET)
Anomaly occurrences in hydraulic machinery may lead to massive systems shut down, jeopardizing the safety of the machinery and its surrounding human operator(s) and environment, and the severe economic implications succeeding the faults and their associated damage. Hydraulics are mostly placed in ruthless environments, where they are consistently vulnerable to many faults. Hence, not only the machines and their components are prone to anomalies, but also the sensors attached to them, which monitor and report their health and behavioral changes. In this work, a comprehensive applicational analysis of anomalies in hydraulic systems extracted from a hydraulic test rig is thoroughly achieved. Firstly, we provided a combination of a new architecture of LSTM autoencoders and supervised machine and deep learning methodologies to perform two separate stages of fault detection and diagnosis. The two phases are condensed by: the detection phase using the LSTM autoencoder. Followed by the fault diagnosis phase represented by the classification schema. The previously mentioned framework is applied to both component and sensor faults in hydraulic systems, deployed in the form of two in-depth applicational experiments. Moreover, a thorough literature review of the past decade related work for the two stages separately is successfully conducted in this paper.
ARTICLE | doi:10.20944/preprints201812.0114.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: directional encoding mask; selective attention network; supervised learning; horizontal and vertical text recognition
Online: 11 December 2018 (07:24:04 CET)
Recent state-of-the-art scene text recognition methods have primarily focused on horizontal text in images. However, in several Asian countries, including China, large amounts of text in signs, books, and TV commercials are vertically directed. Because the horizontal and vertical texts exhibit different characteristics, developing an algorithm that can simultaneously recognize both types of text in real environments is necessary. To address this problem, we adopted the direction encoding mask (DEM) and selective attention network (SAN) methods based on supervised learning. DEM contains directional information to compensate in cases that lack text direction; therefore, our network is trained using this information to handle the vertical text. The SAN method is designed to work individually for both types of text. To train the network to recognize both types of text and to evaluate the effectiveness of the designed model, we prepared a new synthetic vertical text dataset and collected an actual vertical text dataset (VTD142) from the Web. Using these datasets, we proved that our proposed model can accurately recognize both vertical and horizontal text and can achieve state-of-the-art results in experiments using benchmark datasets, including the street view test (SVT), IIIT-5k, and ICDAR. Although our model is relatively simple as compared to its predecessors, it maintains the accuracy and is trained in an end-to-end manner.
ARTICLE | doi:10.20944/preprints202311.0963.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Hate Speech Detection; Machine Learning; Sentiment Analysis; Semi-Supervised Learning; Self-Learning; Text Mining
Online: 15 November 2023 (09:58:07 CET)
Text annotation is an essential element of the natural language processing approaches. The manual annotation process performed by humans has several drawbacks, such as subjectivity, slowness, fatigue, and possibly carelessness. In addition, annotators may annotate ambiguous data. So, we developed the concept of automated annotation to get the best annotations using several machine-learning approaches. The proposed approach is based on an ensemble algorithm of meta-learners and meta-vectorizer techniques. The approach employs a semi-supervised learning technique for automated annotation, aimed at detecting hate speech. This involves leveraging various machine learning algorithms, including Support Vector Machine (SVM), Decision Tree (DT), K-Nearest Neighbors (KNN), and Naive Bayes (NB), in conjunction with Word2Vec and TF-IDF text extraction methods. The annotation process is performed using 13,169 Indonesian YouTube comments data. The proposed model used a Stemming approach using data from Sastrawi and also new data of 2,245 words. Semi-supervised learning uses 5%, 10%, and 20% of labeled data as compared to performing labeling based on 80% of the datasets. In semi-supervised learning, the model learns from the labeled data, which provides explicit information, and the unlabeled data, which offers implicit insights. This hybrid approach enables the model to generalize and make informed predictions even when limited labeled data is available, ultimately enhancing its ability to handle real-world scenarios with scarce annotated information. In addition, the proposed method uses a variety of thresholds for matching words labeled with hate speech ranging from 0.6, 0.7, 0.8, and 0.9. The experiment showed that the KNN-Word2ec model has the best accuracy value of 96.9% with a scenario of 5%:80%:0.9. However, several other methods have also accuracy above 90%, such as SVM and DT based on both text extraction methods in several test scenarios.
ARTICLE | doi:10.20944/preprints202308.0954.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Arabic NLP; Kuwaiti Dialect; Dataset Labeling; Stance Detection; Weak supervised learning; Zero-shot learning
Online: 14 August 2023 (09:00:24 CEST)
The Kuwaiti dialect is a particular dialect of Arabic spoken in Kuwait; it differs significantly from standard Arabic and the dialects of neighboring countries in the same region. Few research papers with a focus on the Kuwaiti dialect have been published in the field of NLP. In this study, we created Kuwaiti dialect language resources using Q8VaxStance, a vaccine stance labeling system for a large dataset of tweets. This dataset will fill this gap and provide a valuable resource for researchers studying vaccine hesitancy in Kuwait. Furthermore, it will contribute to the Arabic natural language processing field by providing a dataset for developing and evaluating machine learning models for stance detection in the Kuwaiti dialect. The proposed vaccine stance labeling system combines the benefits of weak supervised learning and zero-shot learning; for this purpose, we implemented 52 experiments on 42815 unlabeled tweets extracted between December 2020 and July 2022. The results of the experiments show that using keyword detection in conjunction with zero-shot model labeling functions is significantly better than using only keyword detection labeling functions or just zero-shot model labeling functions. Furthermore, using the Arabic language in both the labels and prompt or a mix of Arabic labels and an English prompt is statistically significant compared to using English in both the labels and prompt for the total number of generated labels evaluation metric. Finally, the best accuracy for Macro-F1 values were found in the experiments KHZSLF-EE4 and KHZSLF-EA1, with values of 0.83 and 0.83, respectively. And, for the total automatically labeled data evaluation metric, experiment KHZSLF-EE4 labeled 42,270 tweets, while experiment KHZSLF-EA1 was able to generate 42,764 labels.
ARTICLE | doi:10.20944/preprints202306.1482.v1
Subject: Public Health And Healthcare, Other Keywords: Lipidol ultrafluid; Methylene blue; Iohexol; Lymphography; Supervised machine learning; Gray level Co-occurrence matrix
Online: 21 June 2023 (07:21:15 CEST)
The objective of the current investigation is to identify the first or first draining node or sentinel lymph node (SLN) from the primary tumor mass in a regional lymphocenter. Four different indirect lymphography (IL) methods were employed in 96 canine patients with different types of cancer between 2018 and 2021. The IL technique involved intradermal, submucosal, and peritumoral injections of 2ml contrast agent in the four-quadrant principle which were divided into equal aliquots. Lymphatic mapping with lipiodol (iodized oil) was 100% in squamous cell carcinoma of the head and neck, anal sac apocrine gland adenocarcinoma, mast cell tumor, squamous cell carcinoma of the skin, mammary carcinoma, and with methylene blue dye, 100% detection was achieved in testicular tumor and mammary carcinoma. Instead of a very short washout time of 2 minutes, Iohexol showed an excellent detection in indirect CT lymphography for histiocytic sarcoma and in indirect radiographic lymphography for lymphosarcoma. The significance of contrast and blue dyes in detecting the lymphatic spread of canine cancers is clearly emphasized in the current investigation. The nature of cancerous tissue was again analyzed through image and machine learning approach in this work. Supervised machine learning technique is applied in this work for automatic classification of cancerous and non-cancerous regions. Various statistical and texture-based features are extracted from X-Ray images and support vector machine with linear, polynomial, multilayer perceptron (MLP), and RBF kernel functions are applied for classification. Highest 95.53%, 94.64%, 93.05% sensitivity, specificity, accuracy, respectively, is achieved using RBF kernel function.
ARTICLE | doi:10.20944/preprints202109.0010.v1
Subject: Medicine And Pharmacology, Anesthesiology And Pain Medicine Keywords: spinal cord stimulation; screening trial; infection; supervised learning; machine learning; predictive modeling; patient outcome
Online: 1 September 2021 (12:05:18 CEST)
Persistent Pain after Spinal Surgery can be successfully addressed by Spinal Cord Stimulation (SCS). International guidelines strongly recommend that a lead trial be performed before any permanent implantation. Recent clinical data highlight some major limitations of this approach. First, it appears that patient outcomes, WITH OR WITHOUT lead trial, are similar. In contrast, during trialing, infection rate drops drastically within time and can compromise the therapy. Using composite pain assessment experience and previous research, we hypothesized that ma-chine learning models could be robust screening tools and reliable predictors of long-term SCS efficacy. We developed several algorithms including logistic regression, Regularized Logistic Regression (RLR), naive Bayes classifier, artificial neural networks, random forest and gradient boosted trees to test this hypothesis and to perform internal and external validations, the objec-tive being to confront model predictions with lead trial results using a 1-year composite out-come from 103 patients. While almost all models have demonstrated superiority on lead trial-ing, the RLR model appears to represent the best compromise between complexity and inter-pretability in prediction of SCS efficacy. These results underscore the need to use AI based-predictive medicine, as a synergistic mathematical approach, aimed at helping implanters to optimize their clinical choices on daily practice.
Subject: Medicine And Pharmacology, Immunology And Allergy Keywords: White matter lesions; white matter hyperintensities; supervised segmentation; unsupervised segmentation; deep learning; FLAIR hyperintensities
Online: 20 November 2020 (13:44:46 CET)
Background: White matter hyperintensities (WMH), of presumed vascular origin, are visible and quantifiable neuroradiological markers of brain parenchymal change. These changes may range from damage secondary to inflammation and other neurological conditions, through to healthy ageing. Fully automatic WMH quantification methods are promising, but still, traditional semi-automatic methods seem to be preferred in clinical research. We systematically reviewed the literature for fully automatic methods developed in the last five years, to assess what are considered state-of-the-art techniques, as well as trends in the analysis of WMH of presumed vascular origin. Method: We registered the systematic review protocol with the International Prospective Register of Systematic Reviews (PROSPERO), registration number - CRD42019132200. We conducted the search for fully automatic methods developed from 2015 to July 2020 on Medline, Science direct, IEE Explore, and Web of Science. We assessed risk of bias and applicability of the studies using QUADAS 2. Results: The search yielded 2327 papers after removing 104 duplicates. After screening titles, abstracts and full text, 37 were selected for detailed analysis. Of these, 16 proposed a supervised segmentation method, 10 proposed an unsupervised segmentation method, and 11 proposed a deep learning segmentation method. Average DSC values ranged from 0.538 to 0.91, being the highest value obtained from an unsupervised segmentation method. Only four studies validated their method in longitudinal samples, and eight performed an additional validation using clinical parameters. Only 8/37 studies made available their method in public repositories. Conclusions: We found no evidence that favours deep learning methods over the more established k-NN, linear regression and unsupervised methods in this task. Data and code availability, bias in study design and ground truth generation influence the wider validation and applicability of these methods in clinical research.
ARTICLE | doi:10.20944/preprints202008.0645.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Speech Emotion Recognition; Emotion AI; Self-Supervised Learning; Transfer Learning; Low Resource Training; wav2vec
Online: 28 August 2020 (15:05:37 CEST)
We propose a novel transfer learning method for speech emotion recognition allowing us to obtain promising results when only few training data is available. With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data. Our method leverages knowledge contained in pre-trained speech representations extracted from models trained on a more general self-supervised task which doesn’t require human annotations, such as the wav2vec model. We provide detailed insights on the benefits of our approach by varying the training data size, which can help labeling teams to work more efficiently. We compare performance with other popular methods on the IEMOCAP dataset, a well-benchmarked dataset among the Speech Emotion Recognition (SER) research community. Furthermore, we demonstrate that results can be greatly improved by combining acoustic and linguistic knowledge from transfer learning. We align acoustic pre-trained representations with semantic representations from the BERT model through an attention-based recurrent neural network. Performance improves significantly when combining both modalities and scales with the amount of data. When trained on the full IEMOCAP dataset, we reach a new state-of-the-art of 73.9% unweighted accuracy (UA).
ARTICLE | doi:10.20944/preprints201808.0154.v2
Subject: Computer Science And Mathematics, Information Systems Keywords: deep learning; multiple instance learning; weakly supervised learning; demography; socioeconomic analysis; google street view
Online: 24 October 2018 (08:53:26 CEST)
(1) Background: Evidence-based policymaking requires data about the local population's socioeconomic status (SES) at detailed geographical level, however, such information is often not available, or is too expensive to acquire. Researchers have proposed solutions to estimate SES indicators by analyzing Google Street View images, however, these methods are also resource-intensive, since they require large volumes of manually labeled training data. (2) Methods: We propose a methodology for automatically computing surrogate variables of SES indicators using street images of parked cars and deep multiple instance learning. Our approach does not require any manually created labels, apart from data already available by statistical authorities, while the entire pipeline for image acquisition, parked car detection, car classification, and surrogate variable computation is fully automated. The proposed surrogate variables are then used in linear regression models to estimate the target SES indicators. (3) Results: We implement and evaluate a model based on the proposed surrogate variable at 30 municipalities of varying SES in Greece. Our model has $R^2=0.76$ and a correlation coefficient of $0.874$ with the true unemployment rate, while it achieves a mean absolute percentage error of $0.089$ and mean absolute error of $1.87$ on a held-out test set. Similar results are also obtained for other socioeconomic indicators, related to education level and occupational prestige. (4) Conclusions: The proposed methodology can be used to estimate SES indicators at the local level automatically, using images of parked cars detected via Google Street View, without the need for any manual labeling effort.
ARTICLE | doi:10.20944/preprints202201.0202.v1
Subject: Environmental And Earth Sciences, Remote Sensing Keywords: crop detection; Sentinel 1; Sentinel 2; supervised classification; unsupervised classification; time series; agriculture; food security
Online: 14 January 2022 (11:18:59 CET)
Satellite Crop Detection technologies are focused on detection of different types of crops on the field in the early stage before harvesting. Crop detection is usually done on a time series of satellite data by classification of the desired fields. Currently, data obtained from Remote Sensing (RS) are used to solve tasks related to the identification of the type of agricultural crops, also modern technologies using AI methods are desired in the postprocessing part. In this challenge Sentinel-1 and Sentinel-2 time series data were used due to their periodic availability. Our focus was to develop methodology for classification of time series of Sentinel 2 and Sentinel 1 data and compare how accuracy of classification can be increased, but also how to guarantee availability of data. We analyse phenology of single crops and on the basis of this analysis we started to provide crop classification. Original crop classifications were made from Enhanced Vegetation Index (EVI) layers made from Sentinel-2 time-series data and then we added also . To increase accuracy we also integrate into the process parcel borders and provide classification of fields..
ARTICLE | doi:10.20944/preprints202112.0025.v2
Subject: Engineering, Control And Systems Engineering Keywords: Brain segmentation; Coarse-to-fine; Gen- erative Adversarial Network; Semi-supervised learning; Multi-stage method
Online: 6 December 2021 (14:33:23 CET)
Image segmentation is a new challenge prob- lem in medical application. The use of medical imaging has become an integral part of research, as it allows us to see inside the human body without surgical intervention. Many researcher have studied brain segmentation. One stage method is used to segment the brain tissues. In this paper, we proposed the multi-stage generative ad- versarial network to solve the problem of information loss in the one-stage. We utilize the coarse-to-fine to improve brain segmentation using multi-stage generative adversar- ial networks (GAN). In the first stage, our model generated a coarse outline for (i) background and (ii) brain tissues. Then, in the second stage, the model generated outline for (i) white matter (WM), (ii) gray matter (GM) and (iii) cerebrospinal fluid (CSF). A good result can be achieved by fusing the coarse outline and refine outline. We conclude that our model is more efficient and accu- rate in practice for both infant and adult brain segmenta- tion. Moreover, we observe that multi-stage model is faster than prior models. To be more specific, the main goal of multi-stage model is to see the performance of the model in a few shot learning case where a few labeled data are available. For medical image, this proposed model can work in a wide range of image segmentation where the convolution neural networks and one-stage methods have failed.
ARTICLE | doi:10.20944/preprints202002.0019.v1
Subject: Biology And Life Sciences, Endocrinology And Metabolism Keywords: metabolomics; LC-MS; mass spectrometry; metabolic profiling; computational; statistical; unsupervised learning; supervised learning; pathway analysis
Online: 3 February 2020 (05:54:14 CET)
Metabolomics analysis generates vast arrays of data, necessitating comprehensive workflows involving expertise in analytics, biochemistry and bioinformatics, in order to provide coherent and high-quality data that enables discovery of robust and biologically significant metabolic findings. In this protocol article, we introduce NoTaMe, an analytical workflow for non-targeted metabolic profiling approaches utilizing liquid chromatography–mass spectrometry analysis. We provide an overview of lab protocols and statistical methods that we commonly practice for the analysis of nutritional metabolomics data. The paper is divided into three main sections: the first and second sections introducing the background and the study designs available for metabolomics research, and the third section describing in detail the steps of the main methods and protocols used to produce, preprocess and statistically analyze metabolomics data, and finally to identify and interpret the compounds that have emerged as interesting.
ARTICLE | doi:10.20944/preprints201704.0114.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: indoor localization; crowdsourcing; received signal strength; graph-based semi-supervised learning; linear regression; compressed sensing.
Online: 18 April 2017 (12:33:47 CEST)
Indoor positioning based on the received signal strength (RSS) of the WiFi signal has become the most popular solution for indoor localization. In order to realize the rapid deployment of indoor localization systems, solutions based on crowdsourcing have been proposed. However, compared to conventional methods, crowdsourced RSS values are more erroneous and can result in large localization errors. To mitigate the negative effect of the erroneous measurements, a graph-based semi-supervised learning (G-SSL) method is used to exploit the correlation between the RSS values at nearby locations to estimate an optimal RSS value at each location. Before using the G-SSL method, the Linear Regression (LR) algorithm is proposed to solve the device diversity problem in crowdsourcing system. Since the spatial distribution of the APs is sparse, the Compressed Sensing (CS) method is applied to precisely estimate the location of the APs. Based on the location of the APs and a simple signal propagation model, the RSS difference between different locations is calculated and used as an additional constraint to improve the performance of G-SSL. Furthermore, to exploit the sparsity of the weights used in the G-SSL, we use the CS method to reconstruct these weights more accurately and make a further improvement on the performance of the G-SSL. Experimental results show improved results in terms of the smoothness of the radio map and the localization accuracy.
ARTICLE | doi:10.20944/preprints202309.2181.v1
Subject: Engineering, Other Keywords: Blockchain Technology-enabled Pharmaceutical Supply Chain; Uncertain Demand; Supervised Learning algorithms; Evolutionary Computation algorithms; Blockchain Technology
Online: 1 October 2023 (10:14:31 CEST)
This paper provides a new multi-function Blockchain Technology-enabled Pharmaceutical Supply Chain (BT-enabled PSC) mathematical cost model, including PSC costs, BT costs, and uncertain demand fluctuations. The purpose of this study is to find the most appropriate algorithm(s) with minimum prediction errors to predict the costs of the BT-enabled PSC model. This paper also aims to determine the importance and cost of each component of the multi-function model. To reach these goals, we combined four Supervised Learning algorithms (KNN, DT, SVM, and NB) with two Evolutionary Computation algorithms (HS and PSO) after data generation. Each component of the multi-function model has its own importance, and we applied the Feature Weighting approach to analyse their importance. Next, four performance metrics evaluated the multi-function model, and the Total Ranking Score determined predictive algorithms with high reliability. The results indicate the HS-NB and PSO-NB algorithms perform better than the other six algorithms in predicting the costs of the multi-function model with small errors. The findings also show that the Raw Materials cost has a stronger influence on the model than the other components. This study also introduces the components of the multi-function BT-enabled PSC model.
ARTICLE | doi:10.20944/preprints202304.0350.v3
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: ChatGPT; Generative AI; Fake Publications; Human-Generated Publications; Supervised Learning; ML Algorithm; Fake Science; NeoNet Algorithm
Online: 18 August 2023 (11:19:23 CEST)
Background: ChatGPT is becoming a new reality. Where do we go from here? Objective: to show how we can distinguish ChatGPT-generated publications from counterparts produced by scientist. Methods:By means of a new algorithm, called xFakeBibs, we show the significant difference between ChatGPT-generated fake publications and real publications. Specifically, we triggered ChatGPT to generate 100 publications that were related to Alzheimer’s disease and comorbidity. Using TF-IDF, using the real publications, we constructed a training network of bigrams comprised of 100 publications. Using 10-folds of 100 publications each, we also 10 calibrating networks to derive lower/upper bounds for classifying articles as real or fake. The final step was to test xFakeBibs against each of the ChatGPT-generated articles and predict its class. The algorithm successfully assigned the POSITIVE label for real ones and NEGATIVE for fake ones. Results: When comparing the bigrams of the training set against all the other 10 calibrating folds, we found that the similarities fluctuated between (19%-21%). On the other hand, the mere bigram similarity from the ChatGPT was only (8%). Additionally, when testing how the various bigrams generated from the calibrating 10-folds against ChatGPT we found that all 10 calibrating folds contributed (51%-70%) of new bigrams, while ChatGPT contributed only 23%, which is less than 50% of any of the other 10 calibrating folds. The final classification results using the xFakeBibs set a lower/upper bound of (21.96-24.93) number of new edges to the training mode without contributing new nodes. Using this calibration range, the algorithm predicted 98 of the 100 publications as fake, while 2 articles failed the test and were classified as real publications. Conclusions: This work provided clear evidence of how to distinguish, in bulk ChatGPT-generated fake publications from real publications. Also, we also introduced an algorithmic approach that detected fake articles with a high degree of accuracy. However, it remains challenging to detect all fake records. ChatGPT may seem to be a useful tool, but it certainly presents a threat to our authentic knowledge and real science. This work is indeed a step in the right direction to counter fake science and misinformation.
REVIEW | doi:10.20944/preprints201708.0003.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: stylometry; author identification; author verification; authorprofiling; stylistic inconsistency; text analysis; supervised learning; unsupervised learning; classification; forensics
Online: 2 August 2017 (12:38:17 CEST)
Electronic text stylometry is a collection of forensics methods that analyze the writing styles of input electronic texts in order to extract information about authors of the input electronic texts. Such extracted information could be the identity of the authors, or aspects of the authors, such as their gender, age group, ethnicity, etc. This survey paper presents the following contributions: 1) A description of all stylometry problems in probability terms, under a unified notation. To the best of our knowledge, this is the most comprehensive definition to date. 2) A survey of key methods, with a particular attention to data representation (or feature extraction) methods. 3) An evaluation of 23,760 feature extraction methods, which is the most comprehensive evaluation of feature extraction methods in the literature of stylometry to date. The importance of this evaluation is two fold. First, identifying the relative effectiveness of the features (since, currently, many are not evaluated jointly; e.g. syntactic n-grams are not evaluated against k-skip n-grams, and so forth). Second, thanks to our generalizations, we could evaluate novel grams, such as what we name compound grams. 4) The release of our associated Python feature extraction library, namely Fextractor. Essentially, the library generalizes all existing n-gram based feature extraction methods under the "at least l-frequent, dir-directed, k-skipped n-grams'', and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as POS tags, as well as lower-level ones, such as distribution of function words, word shapes, etc. This makes the library, by far, the most extensive in this domain to date. 5) The construction, evaluation, and release of the first dataset for Emirati social media text. This evaluation represents the first evaluation of author identification against Emirati social media texts. Interestingly, we find that, when using our models and feature extraction library (Fextractor), authors could be identified significantly more accurately than what is reported with similarly sized datasets. The dataset also contains sub-datasets that represent other languages (Dutch, English, Greek and Spanish), and our findings are consistent across them.
ARTICLE | doi:10.20944/preprints202009.0729.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Data Envelopment Analysis; Machine learning; Optimization; Parametric and non-parametric methods; Supervised and unsupervised models; CVS model
Online: 30 September 2020 (08:19:51 CEST)
The main purpose of this paper is to propose a novel optimization model with a new machine learning approach in the first section to achieve the best results in financial institutions in the second section. Since the constancy of efficacy derived from parametric and non-parametric is not significant, this paper provides a scientific assessment at the optimization section and proposes a novel combined parametric and non-parametric model which will be a new experiment in literature perception. A scientific assessment of banks based on a combination of the efficiency measurement method of CCR(Charnes, Cooper and Rhodes model) or CRS(Constant Return to Scale) BCC(Banker, Charnes, and Cooper model) or VRS (Variable Return to Scale) in Data Envelopment Analysis (DEA), as well as Stochastic Frontier Approach (SFA) for 65 banks during Feb to July 2020, are introduced. For analyzing the performance of the parametric and non-parametric approaches we have considered the linear regression and Unreplicated Linear Functional Relationship (ULFR). At the machine learning section, a novel four-layers data mining filtering pre-processes for selected supervised classification as well as unsupervised clustering algorithms to increase the accuracy and to remove unrelated attributes and data are applied. For the four kinds of preprocessing approaches of unsupervised attributes, supervised attributes, supervised instances, and unsupervised instances, we have chosen discretization, attribute selection, stratified remove folds, and resample filters respectively. Based on the nature of the suggested financial institution's dataset and attributes, the most appropriate preprocessing filter in each layer to achieve the highest performance is suggested. Finally, the superior bank, best performance model, and the most accurate algorithm are introduced. The results indicate that the bank number 56 is the superior bank. Among the proposed techniques, the novel recommended CVS compared with CCR-BCC and SFA models, has a more positive correlation with profit risk, and show a higher coefficient of determination values. Sequential Minimal Optimization(SMO) algorithm receives the highest accuracy in all four suggested filtering layers.
ARTICLE | doi:10.20944/preprints201902.0233.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: deep neural network architectures; supervised learning; unsupervised learning; testing neural networks; applications of deep learning; evolutionary computation
Online: 26 February 2019 (04:02:00 CET)
Deep learning has taken over - both in problems beyond the realm of traditional, hand-crafted machine learning paradigms as well as in capturing the imagination of the practitioner sitting on top of petabytes of data. While the public perception about the efficacy of deep neural architectures in complex pattern recognition tasks grows, sequentially up-to-date primers on the current state of affairs must follow. In this review, we seek to present a refresher of the many different stacked, connectionist networks that make up the deep learning architectures followed by automatic architecture optimization protocols using multi-agent approaches. Further, since guaranteeing system uptime is fast becoming an indispensable asset across multiple industrial modalities, we include an investigative section on testing neural networks for fault detection and subsequent mitigation. This is followed by an exploratory survey of several application areas where deep learning has emerged as a game-changing technology - be it anomalous behavior detection in financial applications or financial time-series forecasting, predictive and prescriptive analytics, medical imaging, natural language processing or power systems research. The thrust of this review is on outlining emerging areas of application-oriented research within the deep learning community as well as to provide a handy reference to researchers seeking to embrace deep learning in their work for what it is: statistical pattern recognizers with unparalleled hierarchical structure learning capacity with the ability to scale with information.
ARTICLE | doi:10.20944/preprints201808.0269.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: social sensing; supervised learning; statistical methods; social networks; twitter; tweets; natural disaster; random forest, kernel density estimation
Online: 15 August 2018 (11:34:43 CEST)
In recent years, online social networks have received important consideration in spatial modelling fields given the critical information that can be extracted from them for events in real time; one of the most latent issues is that regarding various natural disasters such as earthquakes. Although it is possible to retrieve data from these social networks with embedded geographic information provided by GPS, in many cases this is not possible. An alternative solution is to reconstruct specific locations using probabilistic language models, more specifically those based on Name Entity Recognition (NER), which extracts names from a user’s description about an event occurring in a specific place (e.g., a collapsed building on a specific avenue). In this work, we present a methodology to use twitter as a social sensor system for disasters. The methodology scores NER locations with a kernel density estimation function for different subtopics originating from a natural disaster and that maps them into a geographic space is proposed. The proposed methodology is evaluated with tweets related to the 2017 earthquake in Mexico.
ARTICLE | doi:10.20944/preprints201806.0282.v1
Subject: Environmental And Earth Sciences, Remote Sensing Keywords: land-use/land-cover; multi-decadal change analysis; irrigation ponds; textural features; supervised classification; multi-source data
Online: 18 June 2018 (16:40:31 CEST)
A multi-decadal change analysis of the irrigation ponds in Taoyuan, Taiwan was conducted by using multi-source data including digitized ancient maps, declassified single-band CORONA satellite images, and multispectral SPOT images. Supervised LULC classifications were conducted using four textural features derived from the single-band CORONA images and spectral features derived from SPOT images. Post-classification analysis revealed that the number of irrigation ponds in the study area decreased during the post-World War II farmland consolidation period (1945 – 1965) and the subsequent industrialization period (1970 – 2000). However, efforts on restoration of irrigation ponds in recent years have resulted in gradual increases in the number (9%) and total area (12%) of irrigation ponds in the study area.
ARTICLE | doi:10.20944/preprints202305.0917.v2
Subject: Engineering, Electrical And Electronic Engineering Keywords: Machine learning; Geriartic fall detection; Dataset; Dew Computing; End Device; Feature Extraction; Supervised Machine Learning; Sensor Data Analysis
Online: 1 October 2023 (09:38:25 CEST)
ARTICLE | doi:10.20944/preprints202309.0545.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: sigmoid function approximation; private machine learning; fully homomorphic encryption; log anomaly detection; supervised machine learning; probabilistic polynomial approximation
Online: 8 September 2023 (04:34:42 CEST)
Log collection and storage is a crucial process for enterprises around the globe. Log analysis helps identify potential security breaches and, in some cases, is required by law for compliance. However, enterprises often delegate these responsibilities to third-party Cloud Service Providers (CSPs), where the logs are collected and processed for anomaly detection and stored in a data warehouse for archiving. Prevalent schemes rely on plain (unencrypted) data for anomaly detection. More often, these logs can reveal sensitive information about an organization or the customers of that organization. Hence, it is best to keep it encrypted at all times. This paper presents "SigML++," an extension of work done in "SigML." We utilize Fully Homomorphic Encryption (FHE) with the Cheon-Kim-Kim-Song (CKKS) scheme for supervised log anomaly detection on encrypted data. We use an Artificial Neural Network (ANN) based probabilistic polynomial approximations using a Perceptron with linear activation. We probabilistically approximate the Sigmoid activation function (σ(x)) in the encrypted domain for the intervals [−10,10] and [−50,50]. Experiments show better approximations for Logistic Regression (LR) and Support Vector Machine (SVM) for low-order polynomials.
Subject: Computer Science And Mathematics, Computer Science Keywords: Textual data distributions; supervised learning; unsupervised learning; Kullback-Leibler divergence; sentiment; textual analytics; text generation; vaccine; stock market
Online: 17 June 2021 (10:03:41 CEST)
Efficient textual data distributions (TDD) alignment and generation are open research problems in textual analytics and NLP. It is presently difficult to parsimoniously and methodologically confirm that two or more natural language datasets belong to similar distributions, and to identify the extent to which textual data possess alignment. This study focuses on addressing a segment of the broader problem described above by applying multiple supervised and unsupervised machine learning (ML) methods to explore the behavior of TDD by (i) topical alignment, and (ii) by sentiment alignment. Furthermore we use multiple text generation methods including fine-tuned GPT-2, to generate text by topic and by sentiment. Finally we develop a unique process driven variation of Kullback-Leibler divergence (KLD) application to TDD, named KL Textual Distributions Contrasts (KL-TDC) to identify the alignment of machine generated textual corpora with naturally occurring textual corpora. This study thus identifies a unique approach for generating and validating TDD by topic and sentiment, which can be used to help address sparse data problems and other research, practice and classroom situations in need of artificially generated topic or sentiment aligned textual data.
ARTICLE | doi:10.20944/preprints202203.0093.v1
Subject: Biology And Life Sciences, Biochemistry And Molecular Biology Keywords: 6-hydroxydopamine; rotenone; in vitro neurotoxicity; mitochondrial dysfunction; exploratory data analysis; applied computational statistics; unsupervised and supervised machine learning
Online: 7 March 2022 (09:16:28 CET)
With the increase in life expectancy and consequent aging of the world’s population, the prevalence of many neurodegenerative diseases is increasing, without concomitant improvement in diagnostics and therapeutics. These diseases share neuropathological hallmarks, including mitochondrial dysfunction. In fact, as mitochondrial alterations appear prior to neuronal cell death at an early phase of the disease onset, the study and modulation of mitochondrial alterations rise as promising strategies to predict and prevent neurotoxicity and neuronal cell death before the onset of cell viability alterations. In this work, differentiated SH-SY5Y cells were treated with the mitochondrial-targeted neurotoxicants 6-hydroxydopamine and rotenone. These compounds were used at different concentrations and for different time points to understand the similarities and differences in their mechanisms of action. To accomplishing this, data on mitochondrial parameters was acquired and analyzed using unsupervised (hierarchical clustering) and supervised (decision tree) machine learning methods. Both biochemical and computational analyses resulted in an evident distinction between the neurotoxic effects of 6-hydroxydopamine and rotenone, specifically for the highest concentrations of both compounds.
ARTICLE | doi:10.20944/preprints201905.0382.v1
Subject: Engineering, Control And Systems Engineering Keywords: supervised machine learning; flood inundation mapping; high-resolution; synthetic aperture radar; height above nearest drainage; sentinel-1; inundated vegetation
Online: 31 May 2019 (08:48:14 CEST)
Floods are one of the most wide-spread, frequent, and devastating natural disasters that continue to increase in frequency and intensity. Remote sensing, specifically synthetic aperture radar (SAR), has been widely used to detect surface water inundation to provide retrospective and near-real time (NRT) information due to its high-spatial resolution, self-illumination, and low atmospheric attenuation. However, the efficacy of flood inundation mapping with SAR is susceptible to reflections and scattering from a variety of factors including dense vegetation and urban areas. In this study, the topographic dataset height above nearest drainage (HAND) was investigated as a potential supplement to Sentinel-1A C-Band SAR along with supervised machine learning to improve the detection of inundation in heterogeneous areas. Three machine learning classifiers were trained on two sets of features SAR only (VV & VH) and VV, VH & HAND to map inundated areas. Three study sites along the Neuse River in North Carolina, USA during the record flood of Hurricane Matthew in October 2016 were selected. The binary classification analysis (inundated as positive vs. non-inundated as negative) revealed significant improvements when incorporating HAND in several metrics including classification accuracy (ACC) (+37.1%), true positive rate (TPR) (+51.2%), and negative predictive value (NPV) (+23.7%), A marginal improvement of +1.4% was seen for positive predictive value (PPV), but true negative rate (TNR) fell -15.1%. By incorporating HAND, a significant number of areas with high SAR backscatter but low HAND values were detected as inundated which increased true positives. This in turn also increased the false positives detected but to a lesser extent as evident in the metrics. This study demonstrates that HAND could be considered a valuable feature to enhance SAR flood inundation mapping especially in areas with heterogeneous land covers with dense vegetation that interfere with SAR.
ARTICLE | doi:10.20944/preprints202010.0436.v1
Subject: Computer Science And Mathematics, Mathematical And Computational Biology Keywords: Naïve Bayes Classification; Eulers Strength Formula; Cricket Prediction; Supervised Learning; KNIME Tool; Cricket prediction; sports analytics; multivariate regression; neural network
Online: 21 October 2020 (12:34:00 CEST)
In cricket, particularly the twenty20 format is most watched and loved by the people, where no one can guess who will win the match until the last ball of the last over. In India, The Indian Premier League (IPL) started in 2008 and now it is the most popular T20 league in the world. So we decided to develop a machine learning model for predicting the outcome of its matches. Winning in a Cricket Match depends on many key factors like a home ground advantage, past performances on that ground, records at the same venue, the overall experience of the players, record with a particular opposition, and the overall current form of the team and also the individual player. This paper briefs about the key factors that affect the result of the cricket match and the regression model that best fits this data and gives the best predictions. Cricket, the mainstream and widely played sport across India which has the most noteworthy fan base. Indian Premier League follows 20-20 format which is very unpredictable. IPL match predictor is a ML based prediction approach where the data sets and previous stats are trained in all dimensions covering all important factors such as: Toss, Home Ground, Captains, Favorite Players, Opposition Battle, Previous Stats etc, with each factor having different strength with the help of KNIME Tool and with the added intelligence of Naive Bayes network and Eulers strength calculation formula.
ARTICLE | doi:10.20944/preprints202306.1005.v1
Subject: Computer Science And Mathematics, Security Systems Keywords: Continuous Authentication; Static Authentication; Behavioral Biometrics; Reinforcement Learning (RL); Q-learning; Keystroke Dynamics; Anomaly Detection; Machine Learning; Supervised Learning; User Authentication; Identification
Online: 14 June 2023 (07:42:17 CEST)
This article focuses on developing a continuous authentication system using behavioral biometrics to recognize users accessing computing devices. The user’s distinct behavioral biometric is captured through keystroke dynamics, and reward-based reinforcement learning (RL) ideas are applied to recognize them throughout the session. The suggested system adds an extra layer of security to traditional authentication methods, forming a robust continuous authentication system that can be added to static authentication systems. The methodology involves training a RL model to detect unusual user typing patterns and flag suspicious activity. Each user has an agent trained on their historical data, which is preprocessed and used to create episodes for the agent to learn from. The environment involves fetching observations and randomly corrupting them to learn out-of-order behavior. The observation vector includes both running features and summary features. The re-ward function is binary and minimalistic. The Principal Component Analysis (PCA) model is used to encode the running features, and the Double Deep Q-Network (DDQN) algorithm with a fully connected neural network is used as the policy net. The evaluation achieved an average training accuracy and EER (equal error rate) of 94.7% and 0.0126 and test accuracy and ERR of 81.06% and 0.0323 for all users when the number of encoder features was increased. Therefore, it is concluded that by continuously learning and adapting to changing behavior patterns, this approach can provide more secure and personalized authentication, lowering the possibility of unauthorized access and cyberattacks. Overall, the use of reinforcement learning and behavioral biometrics for continuous authentication has the potential to significantly enhance security in the digital age and are effective in identifying each user.
ARTICLE | doi:10.20944/preprints202308.0240.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: semi-supervised consensus clustering; ensemble clustering; constrained clustering; analysis of clustering constraints; online anti-counterfeiting; clustering fraudulent websites; detection 1of counterfeit affiliate programs
Online: 3 August 2023 (02:48:26 CEST)
Semi-supervised consensus clustering is a promising strategy to compensate for the subjectivity of clustering and its sensitivity to design factors, with various techniques being recently proposed to integrate domain knowledge and multiple clustering partitions. In this article we present a new approach that makes double use of domain knowledge, namely to build the initial partitions as well as to combine them. In particular, we show how to model and integrate must-link and cannot-link constraints into the objective function of a generic consensus clustering (??) framework that maximizes the similarity between the consensus partition and the input partitions, which have, in turn, been enriched with the same constraints. In addition, borrowing from the theory of functional dependencies, the integrated framework exploits the notions of deductive closure and minimal cover to take full advantage of the logical implication between constraints. Using standard UCI benchmarks, we found that the resulting algorithm, termed ??? (double-Constrained Consensus Clustering), was more effective than plain ?? at combining base constrained partitions. We then argue that ??? is especially well-suited to profiling counterfeit e-commerce websites, because constraints can be acquired by leveraging specific domain features, and demonstrate its potential for detecting affiliate marketing programs. Taken together, our experiments suggest that ??? makes the process of clustering more robust and able to withstand changes in clustering algorithms, datasets, and features, with a remarkable improvement of average performance.
ARTICLE | doi:10.20944/preprints202307.1098.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Artificial Intelligence; COVID-19; Digital Media, Emotions Detection; Machine Learning; Medical Informatics; Mental Health; Natural language Processing; SARS-COV-2; Social Media; Supervised Learning; Vaccination
Online: 17 July 2023 (10:16:20 CEST)
Global rapidly evolving events, e.g., COVID-19, are usually followed by countermeasures and policies. As a reaction, the public tends to express their emotions on social media platforms. Therefore, predicting emotional responses to events is critical to put a plan to avoid risky behaviors. This paper proposes a machine learning-based framework to detect public emotions based on social media posts in response to specific events. It presents a precise measurement of population-level emotions which can aid governance in monitoring public response and guide it to put in place strategies such as targeted monitoring of mental health, to react to a rise in negative emotions in response to lockdowns, or information campaigns, for instance in response to elevated rates of fear in response to vaccination programs. We evaluate our framework by extracting 15,455 tweets. We annotate and categorize the emotions into 11 categories based on Plutchik’s study of emotion and extract the features using a combination of Bag of Words and Term Frequency-Inverse Document Frequency. We filter 813 COVID-19 vaccine-related tweets and use them to demonstrate our framework’s effectiveness. Numerical evaluation of emotions prediction using Random Forest and Logistic Regression shows that our framework predicts emotions with an accuracy up to 95%.
ARTICLE | doi:10.20944/preprints202007.0735.v1
Subject: Biology And Life Sciences, Biochemistry And Molecular Biology Keywords: Variant of Unknown Significance (VUS); Single-Nucleotide Variant (SNV); Variant Effect Prediction (VEP); Stacked Ensemble of Supervised Deep Learners (SESDL); Next Generation Sequencing (NGS); Alternative Allele Frequency (AAF).
Online: 31 July 2020 (06:13:53 CEST)
Pathogenicity is unknown for the majority of human gene variants. For prioritization of sequenced somatic and germline mutation variants, in silico approaches can be utilized. In this study, 84 million non-synonymous Single Nucleotide Variants (SNVs) in the human coding genome were annotated using consensus Variant Effect Prediction (cVEP) method. An algorithm, implemented as a stacked ensemble of supervised learners, performed combination of the 39 functional, conservation mutation impact scores from dbNSFP4.0. Adding gene indispensability score, accounting for differences in the pathogenicities of the variants in the essential and the mutation-tolerant genes, improved the predictions. For each SNV the consensus combination gives either a continuous-value pathogenicity score, or a categorical score in five classes: pathogenic, likely pathogenic, uncertain significance, likely benign, benign. The provided class database is aimed for direct use in clinical practice. The trained prediction models were 5-fold cross-validated on the evidence-based categorical annotations from the ClinVar database. The rankings of the scores based on their ability to predict pathogenicity were obtained. A two-step strategy using the rankings, scores and class annotations is suggested for filtering and prioritization of the human exome mutations in clinical and biological applications of NGS technology.