ARTICLE | doi:10.20944/preprints202009.0729.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Data Envelopment Analysis; Machine learning; Optimization; Parametric and non-parametric methods; Supervised and unsupervised models; CVS model
Online: 30 September 2020 (08:19:51 CEST)
The main purpose of this paper is to propose a novel optimization model with a new machine learning approach in the first section to achieve the best results in financial institutions in the second section. Since the constancy of efficacy derived from parametric and non-parametric is not significant, this paper provides a scientific assessment at the optimization section and proposes a novel combined parametric and non-parametric model which will be a new experiment in literature perception. A scientific assessment of banks based on a combination of the efficiency measurement method of CCR(Charnes, Cooper and Rhodes model) or CRS(Constant Return to Scale) BCC(Banker, Charnes, and Cooper model) or VRS (Variable Return to Scale) in Data Envelopment Analysis (DEA), as well as Stochastic Frontier Approach (SFA) for 65 banks during Feb to July 2020, are introduced. For analyzing the performance of the parametric and non-parametric approaches we have considered the linear regression and Unreplicated Linear Functional Relationship (ULFR). At the machine learning section, a novel four-layers data mining filtering pre-processes for selected supervised classification as well as unsupervised clustering algorithms to increase the accuracy and to remove unrelated attributes and data are applied. For the four kinds of preprocessing approaches of unsupervised attributes, supervised attributes, supervised instances, and unsupervised instances, we have chosen discretization, attribute selection, stratified remove folds, and resample filters respectively. Based on the nature of the suggested financial institution's dataset and attributes, the most appropriate preprocessing filter in each layer to achieve the highest performance is suggested. Finally, the superior bank, best performance model, and the most accurate algorithm are introduced. The results indicate that the bank number 56 is the superior bank. Among the proposed techniques, the novel recommended CVS compared with CCR-BCC and SFA models, has a more positive correlation with profit risk, and show a higher coefficient of determination values. Sequential Minimal Optimization(SMO) algorithm receives the highest accuracy in all four suggested filtering layers.
ARTICLE | doi:10.20944/preprints202007.0325.v1
Subject: Mathematics & Computer Science, Applied Mathematics Keywords: Data Center; Thermal Characteristics Analysis; Machine Learning, Energy Efficiency, Hotspots, Clustering Technique, Unsupervised Learning
Online: 15 July 2020 (09:16:23 CEST)
Energy efficiency of Data Center (DC) operations heavily relies on IT and cooling systems performance. A reliable and efficient cooling system is necessary to produce a persistent flow of cold air to cool servers that are subjected to constantly increasing computational load due to the advent of IoT- enabled smart systems. Consequently, increased demand for computing power will bring about increased waste heat dissipation in data centers. In order to bring about a DC energy efficiency, it is imperative to explore the thermal characteristics analysis of an IT room (due to waste heat). This work encompasses the employment of an unsupervised machine learning modelling technique for uncovering weaknesses of the DC cooling system based on real DC monitoring thermal data. The findings of the analysis result in the identification of areas for energy efficiency improvement that will feed into DC recommendations. The methodology employed for this research includes statistical analysis of IT room thermal characteristics, and the identification of individual servers that frequently occur in the hotspot zones. A critical analysis has been conducted on available big dataset of ambient air temperature in the hot aisle of ENEA Portici CRESCO6 computing cluster. Clustering techniques have been used for hotspots localization as well as categorization of nodes based on surrounding air temperature ranges. The principles and approaches covered in this work are replicable for energy efficiency evaluation of any DC and thus, foster transferability. This work showcases applicability of best practices and guidelines in the context of a real commercial DC that transcends the set of existing metrics for DC energy efficiency assessment.
CONCEPT PAPER | doi:10.20944/preprints202101.0273.v1
Subject: Engineering, Automotive Engineering Keywords: context management; device classification; IoT device management; k-Means clustering; ubiquitous computing; unsupervised machine learning
Online: 14 January 2021 (13:36:31 CET)
Ubiquitous computing comprises scenarios where networks, devices within the network, and software components change frequently. Market demand and cost-effectiveness are forcing device manufacturers to introduce new-age devices. Also, the Internet of Things (IoT) is transitioning rapidly from the IoT to the Internet of Everything (IoE). Due to this enormous scale, effective management of these devices becomes vital to support trustworthy and high-quality applications. One of the key challenges of IoT device management is automatic device classification with the logically semantic type and using that as a parameter for device context management. This would enable smart security solutions. In this paper, a device classification approach is proposed for the context management of ubiquitous devices based on unsupervised machine learning. To classify unknown devices and to label them logically, a proactive device classification model is framed using a k-Means clustering algorithm. To group devices, it uses the information of network parameters such as Received Signal Strength Indicator (rssi), packet_size, number_of_nodes in the network, throughput, etc. Experimental analysis suggests that the well-formedness of clusters can be used to derive cluster labels as a logically semantic device type which would be a context for resource management and authorization of resources. This paper fulfills an identified need of proactive device classification for device management.
ARTICLE | doi:10.20944/preprints201912.0015.v1
Subject: Social Sciences, Other Keywords: school sports facility; assessment; t-sne; fuzzy c mean; unsupervised learning
Online: 3 December 2019 (05:24:26 CET)
The aim of this study is (a) to develop, test, and employ a combined method of unsupervised machine learning to objectively assess the condition of sports facility in primary schools (PSSFC) and (b) examine the examine the geographical and typological association with PSSFC. Based on the Sixth National Sports Facility Census (NSFC), six PSSFC indicators (indoor and outdoor facility included) were selected as the measurements and decomposed by using the t-stochastic neighbor embedding (t-SNE). Thereafter, the Fuzzy C-mean (FCM) algorithm was used to cluster the same type of PSSFC with selecting the optimum numbers of evaluation level. Overall 845 primary schools in Shanghai, China were recruited and tested by this combined approach of unsupervised machine learning. In addition, the two-way analysis of covariance was used to examine the location and types of school associated with PSSFC variables in each level. The combined method was found to have acceptable reliability and good interpretability, differentiating PSSFC into five gradient levels. The characteristics of PSSFC differ by the location and school type of individual school. Our findings are conducive to the regionalized and personalized intervention and promotion on the children’s physical activity (PA) upon the practical situation of particular schools.
ARTICLE | doi:10.20944/preprints201801.0125.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: multi-criteria decision analysis (MCDA); online broker; misspecification of criteria; structural uncertainty; unsupervised machine learning; factor analysis, quality of service (QoS)
Online: 15 January 2018 (11:29:56 CET)
Multi-criteria decision analysis (MCDA), one of the prevalent branches of operations research, aims to design mathematical and computational tools for selecting the best alternative among several choices with respect to specific criteria. In the cloud, MCDA based online brokers uses customer specified criteria to rank different service providers. However, subjected to limited domain knowledge, the customer may exclude relevant or include irrelevant criterion, which could result in suboptimal ranking of service providers. To deal with such misspecification, this research proposes a model, which uses notion of factor analysis from the domain of unsupervised machine learning. The model is evaluated using two quality-of-service (QoS) based datasets. The first dataset i.e., feedback from customers, was compiled using leading review websites such as Cloud Hosting Reviews, Best Cloud Computing Providers, and Cloud Storage Reviews and Ratings. The second dataset i.e., feedback from servers, was generated from cloud brokerage architecture that was emulated using high performance computing (HPC) cluster at University of Luxembourg (HPC @ Uni.lu). The simulation runs in a stable cloud environment i.e. when uncertainty is low, shows that online broker (equipped with the proposed model) produces optimized ranking of service providers as compared to other brokers. This is due the fact that proposed model assigns priorities to criteria objectively (using machine learning) rather than using priorities based on subjective judgments of the customer. This research will benefit potential cloud customers that view insufficient domain knowledge as a limiting factor for acquisition of web services in the cloud.
ARTICLE | doi:10.20944/preprints201802.0023.v1
Subject: Mathematics & Computer Science, Other Keywords: deep learning; graph kernels; unsupervised learning
Online: 4 February 2018 (10:52:50 CET)
This paper presents a new method : HIVEC to learn latent vector representations of graphs in a manner that captures the semantic dependencies of sub-structures. The representations can then be used in machine learning algorithms for tasks such as graph classification, clustering etcetera. The method proposed is unsupervised and uses the information of co-occurrence of sub-structures. It introduces a notion of hierarchical embeddings that allows us to avoid repetitive learning of sub-structures for every new graph. As an alternative to deep learning methods, the edit distance similarity between sub-structures is also used to learn vector representations. We compare the performance of these methods against previous work.
ARTICLE | doi:10.20944/preprints201809.0197.v2
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: unsupervised, metric learning, embedding learning, laplacian, information theoretic, diffusion maps
Online: 19 September 2018 (13:53:42 CEST)
Unsupervised metric learning has been generally studied as a byproduct of dimensionality reduction or manifold learning techniques. Manifold learning techniques like Diusion maps, Laplacian eigenmaps has a special property that embedded space is Euclidean. Although laplacian eigenmaps can provide us with some (dis)similarity information it does not provide with a metric which can further be used on out-of-sample data. On other hand supervised metric learning technique like ITML which can learn a metric needs labeled data for learning. In this work propose methods for incremental unsupervised metric learning. In rst approach Laplacian eigenmaps is used along with Information Theoretic Metric Learning(ITML) to form an unsupervised metric learning method. We rst project data into a low dimensional manifold using Laplacian eigenmaps, in embedded space we use euclidean distance to get an idea of similarity between points. If euclidean distance between points in embedded space is below a threshold t1 value we consider them as similar points and if it is greater than a certain threshold t2 we consider them as dissimilar points. Using this we collect a batch of similar and dissimilar points which are then used as a constraints for ITML algorithm and learn a metric. To prove this concept we have tested our approach on various UCI machine learning datasets. In second approach we propose Incremental Diusion Maps by updating SVD in a batch-wise manner.
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Unsupervised anomalous sound detection; classification-based model; Outlier classifier; ID classifier
Online: 17 August 2021 (08:36:44 CEST)
The task of unsupervised anomalous sound detection (ASD) is challenging for detecting anomalous sounds from a large audio database without any annotated anomalous training data. Many unsupervised methods were proposed, but previous works have confirmed that the classification-based models far exceeds the unsupervised models in ASD. In this paper, we adopt two classification-based anomaly detection models: (1) Outlier classifier is to distinguish anomalous sounds or outliers from the normal; (2) ID classifier identifies anomalies using both the confidence of classification and the similarity of hidden embeddings. We conduct experiments in task 2 of DCASE 2020 challenge, and our ensemble method achieves an averaged area under the curve (AUC) of 95.82% and averaged partial AUC (pAUC) of 92.32%, which outperforms the state-of-the-art models.
ARTICLE | doi:10.20944/preprints202102.0083.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: SAR image classification; Spiking Neural Network(SNN); unsupervised learning
Online: 2 February 2021 (10:35:38 CET)
Recent neuroscience research results show that the nerve information in the brain is not only encoded by the spatial information. Spiking neural network based on pulse frequency coding plays a very important role in dealing with the problem of brain signal, especially complicated space-time information. In this paper, an unsupervised learning algorithm for bilayer feedforward spiking neural networks based on spike-timing dependent plasticity (STDP) competitiveness is proposed and applied to SAR image classification on MSTAR for the first time. The SNN learns autonomously from the input value without any labeled signal and the overall classification accuracy of SAR targets reached 80.8%. The experimental results show that the algorithm adopts the synaptic neurons and network structure with stronger biological rationality, and has the ability to classify targets on SAR image. Meanwhile, the feature map extraction ability of neurons is visualized by the generative property of SNN, which is a beneficial attempt to apply the brain-like neural network into SAR image interpretation.
ARTICLE | doi:10.20944/preprints201810.0494.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: unsupervised training; features learning; deep learning; time series forecasting
Online: 22 October 2018 (12:24:43 CEST)
A continuous Deep Belief Network (cDBN) with two hidden layers is proposed in this paper, focusing on the problem of weak feature learning ability when dealing with continuous data. In cDBN, the input data is trained in an unsupervised way by using continuous version of transfer functions, the contrastive divergence is designed in hidden layer training process to raise convergence speed, an improved dropout strategy is then implemented in unsupervised training to realize features learning by de-cooperating between the units, and then the network is fine-tuned using back propagation algorithm. Besides, hyper-parameters are analysed through stability analysis to assure the network can find the optimal. Finally, the experiments on Lorenz chaos series, CATS benchmark and other real world like CO2 and waste water parameters forecasting show that cDBN has the advantage of higher accuracy, simpler structure and faster convergence speed than other methods.
ARTICLE | doi:10.20944/preprints202011.0696.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Archaeological Data Science; Artificial Intelligence; Unsupervised Learning; Generative Adversarial Networks; Robust Statistics.
Online: 27 November 2020 (14:43:36 CET)
The fossil record is notorious for being incomplete and distorted, frequently conditioning the type of knowledge that can be extracted from it. In many cases, this often leads to issues when performing complex statistical analyses, such as classification tasks, predictive modelling, and variance analyses, such as those used in Geometric Morphometrics. Here different Generative Adversarial Network architectures are experimented with, testing the effects of sample size and domain dimensionality on model performance. For model evaluation, robust statistical methods were used. Each of the algorithms were observed to produce realistic data. Generative Adversarial Networks using different loss functions produced multidimensional synthetic data significantly equivalent to the original training data. Conditional Generative Adversarial Networks were not as successful. The methods proposed are likely to reduce the impact of sample size and bias on a number of statistical learning applications. While Generative Adversarial Networks are not the solution to all sample-size related issues, combined with other pre-processing steps these limitations may be overcome. This presents a valuable means of augmenting geometric morphometric datasets for greater predictive visualization.
Subject: Earth Sciences, Geoinformatics Keywords: indoor scene recognition; unsupervised representation learning; Siamese network; graph constraints
Online: 19 March 2019 (13:11:09 CET)
Indoor scene recognition has great significance for intelligent applications such as mobile robots, location-based services (LBS) and so on. Wherever we are or whatever we do, we are under a specific scene. The human brain can easily discern a scene with a quick glance. However, for a machine to achieve this purpose, on one hand, it often requires plenty of well-annotated data which is time-consuming and labor-intensive. On the other hand, it is hard to learn effective visual representations due to large intra-category variation and inter-categories similarity of indoor scenes. To solve these problems, in this paper, we adopted an unsupervised visual representation learning method which can learn from unlabeled data with a Siamese Convolutional Neural Network (Siamese ConvNet) and graph-based constraints. Specifically, we first mined relationships between unlabeled samples with a graph structure. And then, these relationships can be used as supervision for representation learning with a Siamese network. In this method, firstly, a k-NN graph would be constructed by taking each image as a node in the graph and its k nearest neighbors are linked to form the edges. Then, with this graph, cycle consistency and geodesic distance would be considered as criteria for positive and negative pairs mining respectively. In other words, by detecting cycles in the graph, images with large differences but in the same cycle can be considered as same category (positive pairs). By computing geodesic distance instead of Euclidean distance from one node to another, two nodes with large geodesic distance can be regarded as in different categories (negative pairs). After that, visual representations of indoor scenes can be learned by a Siamese network in an unsupervised manner with the mined pairs as inputs. In order to evaluate the proposed method, we tested it on two scene-centric datasets, MIT67 and Places365. Experiments with different number of categories have been conducted to excavate the potential of proposed method. The results demonstrated that semantic visual representations for indoor scenes can be learned in this unsupervised manner. In addition, with the learned visual representations, indoor scene recognition models trained with the learned representations and a few of labeled samples can achieve competitive performance compared to the state-of-the-art approaches.
ARTICLE | doi:10.20944/preprints202109.0389.v1
Subject: Engineering, Other Keywords: Deep learning; Variational Autoencoders (VAEs); data representation learning; generative models; unsupervised learning; few shot learning; latent space; transfer learning
Online: 22 September 2021 (16:04:22 CEST)
Despite the importance of few-shot learning, the lack of labeled training data in the real world, makes it extremely challenging for existing machine learning methods as this limited data set does not represent the data variance well. In this research, we suggest employing a generative approach using variational autoencoders (VAEs), which can be used specifically to optimize few-shot learning tasks by generating new samples with more intra-class variations. The purpose of our research is to increase the size of the training data set using various methods to improve the accuracy and robustness of the few-shot face recognition. Specifically, we employ the VAE generator to increase the size of the training data set, including the basic and the novel sets while utilizing transfer learning as the backend. Based on extensive experimental research, we analyze various data augmentation methods to observe how each method affects the accuracy of face recognition. We conclude that the face generation method we proposed can effectively improve the recognition accuracy rate to 96.47% using both the base and the novel sets.
ARTICLE | doi:10.20944/preprints201902.0233.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: deep neural network architectures; supervised learning; unsupervised learning; testing neural networks; applications of deep learning; evolutionary computation
Online: 26 February 2019 (04:02:00 CET)
Deep learning has taken over - both in problems beyond the realm of traditional, hand-crafted machine learning paradigms as well as in capturing the imagination of the practitioner sitting on top of petabytes of data. While the public perception about the efficacy of deep neural architectures in complex pattern recognition tasks grows, sequentially up-to-date primers on the current state of affairs must follow. In this review, we seek to present a refresher of the many different stacked, connectionist networks that make up the deep learning architectures followed by automatic architecture optimization protocols using multi-agent approaches. Further, since guaranteeing system uptime is fast becoming an indispensable asset across multiple industrial modalities, we include an investigative section on testing neural networks for fault detection and subsequent mitigation. This is followed by an exploratory survey of several application areas where deep learning has emerged as a game-changing technology - be it anomalous behavior detection in financial applications or financial time-series forecasting, predictive and prescriptive analytics, medical imaging, natural language processing or power systems research. The thrust of this review is on outlining emerging areas of application-oriented research within the deep learning community as well as to provide a handy reference to researchers seeking to embrace deep learning in their work for what it is: statistical pattern recognizers with unparalleled hierarchical structure learning capacity with the ability to scale with information.
Subject: Medicine & Pharmacology, Allergology Keywords: White matter lesions; white matter hyperintensities; supervised segmentation; unsupervised segmentation; deep learning; FLAIR hyperintensities
Online: 20 November 2020 (13:44:46 CET)
Background: White matter hyperintensities (WMH), of presumed vascular origin, are visible and quantifiable neuroradiological markers of brain parenchymal change. These changes may range from damage secondary to inflammation and other neurological conditions, through to healthy ageing. Fully automatic WMH quantification methods are promising, but still, traditional semi-automatic methods seem to be preferred in clinical research. We systematically reviewed the literature for fully automatic methods developed in the last five years, to assess what are considered state-of-the-art techniques, as well as trends in the analysis of WMH of presumed vascular origin. Method: We registered the systematic review protocol with the International Prospective Register of Systematic Reviews (PROSPERO), registration number - CRD42019132200. We conducted the search for fully automatic methods developed from 2015 to July 2020 on Medline, Science direct, IEE Explore, and Web of Science. We assessed risk of bias and applicability of the studies using QUADAS 2. Results: The search yielded 2327 papers after removing 104 duplicates. After screening titles, abstracts and full text, 37 were selected for detailed analysis. Of these, 16 proposed a supervised segmentation method, 10 proposed an unsupervised segmentation method, and 11 proposed a deep learning segmentation method. Average DSC values ranged from 0.538 to 0.91, being the highest value obtained from an unsupervised segmentation method. Only four studies validated their method in longitudinal samples, and eight performed an additional validation using clinical parameters. Only 8/37 studies made available their method in public repositories. Conclusions: We found no evidence that favours deep learning methods over the more established k-NN, linear regression and unsupervised methods in this task. Data and code availability, bias in study design and ground truth generation influence the wider validation and applicability of these methods in clinical research.
Subject: Keywords: Textual data distributions; supervised learning; unsupervised learning; Kullback-Leibler divergence; sentiment; textual analytics; text generation; vaccine; stock market
Online: 17 June 2021 (10:03:41 CEST)
Efficient textual data distributions (TDD) alignment and generation are open research problems in textual analytics and NLP. It is presently difficult to parsimoniously and methodologically confirm that two or more natural language datasets belong to similar distributions, and to identify the extent to which textual data possess alignment. This study focuses on addressing a segment of the broader problem described above by applying multiple supervised and unsupervised machine learning (ML) methods to explore the behavior of TDD by (i) topical alignment, and (ii) by sentiment alignment. Furthermore we use multiple text generation methods including fine-tuned GPT-2, to generate text by topic and by sentiment. Finally we develop a unique process driven variation of Kullback-Leibler divergence (KLD) application to TDD, named KL Textual Distributions Contrasts (KL-TDC) to identify the alignment of machine generated textual corpora with naturally occurring textual corpora. This study thus identifies a unique approach for generating and validating TDD by topic and sentiment, which can be used to help address sparse data problems and other research, practice and classroom situations in need of artificially generated topic or sentiment aligned textual data.
ARTICLE | doi:10.20944/preprints201808.0421.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Graph clustering, Unsupervised structure learning, Network module inference
Online: 23 August 2018 (16:16:05 CEST)
Here we present a fast and highly scalable community structure preserving network module detection that recursively integrates graph sparsification and clustering. Our algorithm, called SparseClust, participated in the most recent DREAM community challenge on disease module identification, an open competition to comprehensively assess module identification methods across a wide range of biological networks.
ARTICLE | doi:10.20944/preprints202103.0352.v1
Subject: Physical Sciences, Acoustics Keywords: remote sensing; spectroscopy; blind source separation; unsupervised clustering; insects
Online: 12 March 2021 (20:16:55 CET)
Characterization of flying insects in-situ measurement using remote sensing spectroscopy is an emerging research field. Also, most analysis techniques in remote sensing spectroscopy are based on the use of an intensity threshold which introduces indeterminacies in the number of detected specimens. In this manuscript, we investigated the possibility of analysing passive remote sensing spectroscopy measurement data using the maximum noise fraction method. The results obtained show that this analysis technique can help to overcome the measurement of background noise in spectroscopic measurements.
ARTICLE | doi:10.20944/preprints201712.0001.v3
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: signal processing; bayesian methods; subaquatic audio; hydrophone; unsupervised learning
Online: 8 January 2018 (18:29:11 CET)
The problem of event detection in general noisy signals arises in many applications; usually, either a functional form for the event is available, or a previous annotated sample with instances of the event that can be used to train a classification algorithm. There are situations, however, where neither functional forms nor annotated samples are available; then it is necessary to apply other strategies to separate and characterize events. In this work, we analyze 15 minute-long samples of an acoustic signal, and are interested in separating sections, or segments, of the signal which are likely to contain significative events. For that, we apply a sequential algorithm with the only assumption that an event alters the energy of the signal. The algorithm is entirely based on Bayesian methods.
ARTICLE | doi:10.20944/preprints202210.0355.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Auto encoder; surface defects; abnormal defects; visual inspection; unsupervised defect
Online: 24 October 2022 (07:56:04 CEST)
Currently, most deep learning methods cannot solve the problem of scarcity of industrial product defect samples and significant differences in characteristics. This paper proposes an unsupervised defect detection algorithm based on a reconstruction network, which is realized using only a large number of easily obtained defect-free sample data. The network includes two parts: image reconstruction and surface defect area detection. The reconstruction network is designed through a fully convolutional autoencoder with a lightweight structure. Only a small number of normal samples are used for training so that the reconstruction network can be A defect-free reconstructed image is generated. A function combining structural loss and L1 loss is proposed as the loss function of the reconstruction network to solve the problem of poor detection of irregular texture surface defects. Further, the residual of the reconstructed image and the image to be tested is used as the possible region of the defect, and conventional image operations can realize the location of the fault. The unsupervised defect detection algorithm of the proposed reconstruction network is used on multiple defect image sample sets. Compared with other similar algorithms, the results show that the unsupervised defect detection algorithm of the reconstructed network has strong robustness and accuracy.
ARTICLE | doi:10.20944/preprints202107.0257.v1
Subject: Keywords: Hyperspectral images, unsupervised Algorithm, clustering,K-means algorithm, spectral signature.
Online: 12 July 2021 (12:14:58 CEST)
Hyper-spectral images contain a wide range of bands or wavelength due to which they are rich in information. These images are taken by specialized sensors and then investigated through various supervised or unsupervised learning algorithms. Data that is acquired by hyperspectral image contain plenty of information hence it can be used in applications where materials can be analyzed keenly, even the smallest difference can be detected on the basis of spectral signature i.e. remote sensing applications. In order to retrieve information about the concerned area, the image has to be grouped in different segments and can be analyzed conveniently. In this way, only concerned portions of the image can be studied that have relevant information and the rest that do not have any information can be discarded. Image segmentation can be done to assort all pixels in groups. Many methods can be used for this purpose but in this paper, we discussed k means clustering to assort data in AVIRIS cuprite, AVIRIS Muffet and Rosis Pavia in order to calculate the number of regions in each image and retrieved information of 1st, 10th and100th band. Clustering has been done easily and efficiently as k means algorithm is the easiest approach to retrieve information.
ARTICLE | doi:10.20944/preprints202108.0409.v1
Subject: Medicine & Pharmacology, Oncology & Oncogenics Keywords: Renal cancers; oncocytoma; chromophobe; transcriptomics; machine learning; clustering; gene signature; unsupervised learning
Online: 20 August 2021 (11:23:43 CEST)
Chromophobe renal cell carcinoma (chRCC) and oncocytoma (RO) are renal tumor types originating from alpha intercalated cells of the collecting ducts of the kidney. Both tumor types have similar gross histological morphology and increased mitochondria, which leads to difficulties differentiating between these tumors, especially with core biopsy samples. This study aims to apply a machine learning approach to develop a molecular classifier based on transcriptomics data. Here we generated a meta-data set containing 62 chRCC and 45 RO gene expression arrays. Arrays were subjected to quality control steps, and genes were selected based on differential expression and ROC analysis. The final gene list was evaluated with UMAP based dimension reduction followed by density-based clustering with 95.5% accuracy. Molecular profiling by KEGG pathway analysis identified enrichment of fatty acid oxidation pathway in RO. We finally identified and validated the 30-gene signature, with an accuracy of 94.4% to distinguish chRCC from RO on UMAP analysis. Our results show that chRCC and RO have a distinct gene signature that can differentiate these tumors and complement histology for routine diagnosis of these two tumors.
ARTICLE | doi:10.20944/preprints202004.0524.v2
Subject: Biology, Other Keywords: unsupervised learning; tensor decomposition; feature selection; COVID-19; drug discovery; gene expression
Online: 3 June 2020 (05:29:09 CEST)
Background: COVID-19 is a critical pandemic that has affected human communities worldwide, and there is an urgent need to develop effective drugs. Although there are a large number of candidate drug compounds that may be useful for treating COVID-19, the evaluation of these drugs is time-consuming and costly. Thus, screening to identify potentially effective drugs prior to experimental validation is necessary. Method: In this study, we applied the recently proposed method tensor decomposition (TD)-based unsupervised feature extraction (FE) to gene expression profiles of multiple lung cancer cell lines infected with severe acute respiratory syndrome coronavirus 2. We identified drug candidate compounds that significantly altered the expression of the 163 genes selected by TD-based unsupervised FE. Results: Numerous drugs were successfully screened, including many known antiviral drug compounds such as C646, chelerythrine chloride, canertinib, BX-795, sorafenib, sorafenib, QL-X-138, radicicol, A-443654, CGP-60474, alvocidib, mitoxantrone, QL-XII-47, geldanamycin, fluticasone, atorvastatin, quercetin, motexafin gadolinium, trovafloxacin, doxycycline, meloxicam, gentamicin, and dibromochloromethane. The screen also identified ivermectin, which was first identified as an anti-parasite drug and recently the drug was included in clinical trials for SARS-CoV-2. Conclusions: The drugs screened using our strategy may be effective candidates for treating patients with COVID-19.
ARTICLE | doi:10.20944/preprints202209.0347.v1
Subject: Earth Sciences, Geoinformatics Keywords: digital soil mapping; soil process units; soil parameter space; machine learning; unsupervised classification
Online: 22 September 2022 (15:08:05 CEST)
The national-scale evaluation and modelling of the impact of agricultural management and cli-mate change on soils, crop growth, and the environment require soil information at a spatial res-olution addressing individual agricultural fields. This manuscript presents a data science ap-proach which agglomerates the soil parameter space into a limited number of functional soil pro-cess units (SPUs) which may be used to run agricultural process models. In fact, two unsupervised classification methods were developed to generate a multivariate 3D data product consisting of SPUs, each being defined by a multivariate parameter distribution along the depth profile from 0 to 100 cm. The two methods account for differences in variable types and distributions and in-volve genetic algorithm optimization to identify those SPUs with the lowest internal variability and maximum inter-unit difference with regards to both, their soil characteristics and landscape setting. The high potential of the methods was demonstrated by applying them to the agricultural German soil landscape. The resulting data product consists of twenty SPUs. It has a 100 m raster resolution in the 2D mapping space, and its resolution along the depth profile is 1 cm. It includes the soil properties texture, stone content, bulk density, hydromorphic properties, total organic carbon content, and pH.
ARTICLE | doi:10.20944/preprints202201.0452.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: intelligent reflecting surface; low Earth orbit satellite; graph attention networks; unsupervised learning; beamforming
Online: 31 January 2022 (11:47:07 CET)
Satellite communication is expected to play a vital role in realizing Internet of Remote Things (IoRT) applications. This article considers an intelligent reflecting surface (IRS)-assisted downlink low Earth orbit (LEO) satellite communication network, where IRS provides additional reflective links to enhance the intended signal power. We aim to maximize the sum-rate of all the terrestrial users by jointly optimizing the satellite’s precoding matrix and IRS’s phase shifts. However, it is difficult to directly acquire the instantaneous channel state information (CSI) and optimal phase shifts of IRS due to the high mobility of LEO and the passive nature of reflective elements. Moreover, most conventional solution algorithms suffer from high computational complexity and are not applicable to these dynamic scenarios. A robust beamforming design based on graph attention networks (RBF-GAT) is proposed to establish a direct mapping from the received pilots and dynamic network topology to the satellite and IRS’s beamforming, which is trained offline using the unsupervised learning approach. The simulation results corroborate that the proposed RBF-GAT can achieve approximate performance compared to the upper bound with low complexity.
ARTICLE | doi:10.20944/preprints202106.0634.v1
Subject: Keywords: Hyperspectral image; HSI; PCA; K-means clustering; unsupervised; classification; bands; satellite; ROSIS; AVIRIS
Online: 28 June 2021 (10:01:41 CEST)
The visualization of hyperspectral images in display devices, having RGB colour composition channels is quite difficult due to the high dimensionality of these images. Thus, principal component analysis has been used as a dimensionality reduction algorithm to reduce information loss, by creating uncorrelated features. To classify regions in the hyperspectral images, K-means clustering has been used to form clusters/regions. These two algorithms have been implemented on the three datasets imaged by AVIRIS and ROSIS sensors.
ARTICLE | doi:10.20944/preprints202105.0674.v1
Subject: Engineering, Automotive Engineering Keywords: Offshore wind; life extension; modern portfolio theory; unsupervised machine learning; monopile; risk management
Online: 27 May 2021 (14:01:13 CEST)
The present study aims to develop a risk-based approach to find optimal solutions for life extension management for offshore wind farms based on Markowitz’s modern portfolio theory, adapted from finance. The developed risk-based approach assumes that the offshore wind turbines (OWT) can be considered as cash-producing tangible assets providing positive return from the initial investment (capital) with a given risk attaining the targeted (expected) return. In this regard, the present study performs a techno-economic life extension analysis within the scope of the multi-objective optimisation problem. The first objective is to maximise the return from the overall wind assets, while the latter aims to minimise the risk associated with obtaining the return. In formulating the multi-dimensional optimisation problem, the life-extension assessment considers the results of a detailed structural integrity analysis, free-cash-flow analysis, and probability of project failure, local and global economic constraints. Further, the risk is identified as the variance from the expected mean of return on investment. The risk-return diagram is utilised to classify the OWTs of different classes using an unsupervised machine learning algorithm. The optimal portfolios for the various required rate of return are recommended for different stages of life extension.
ARTICLE | doi:10.20944/preprints202011.0605.v1
Subject: Earth Sciences, Atmospheric Science Keywords: parameter-free spectral clustering; Lagrangian Coherent Structures; clusters; geophysical flows; unsupervised machine learning
Online: 24 November 2020 (09:25:02 CET)
In Lagrangian dynamics, the detection of coherent clusters can help understand the organization of transport by identifying regions with coherent trajectory patterns. Many clustering algorithms, however, rely on user-input parameters, requiring a priori knowledge about the flow and making the outcome subjective. Building on the conventional spectral clustering method of Hadjighasem et al (2016), a new parameter-free spectral clustering approach is developed that automatically identifies parameters and does not require any user-input choices. A noise-based metric for quantifying the coherence of the resulting coherent clusters is also introduced. The parameter-free spectral clustering is applied to two benchmark analytical flows, the Bickley Jet and the asymmetric Duffing oscillator, and to a realistic, numerically-generated oceanic coastal flow. In the latter case, the identified model-based clusters are tested using observed trajectories of real drifters. In all examples, our approach succeeded in performing the partition of the domain into coherent clusters with minimal inter-cluster similarity and maximum intra-cluster similarity. For the coastal flow, the resulting coherent clusters are qualitatively similar over the same phase of the tide on different days and even different years, whereas coherent clusters for the opposite tidal phase are qualitatively different.
ARTICLE | doi:10.20944/preprints201911.0218.v1
Subject: Earth Sciences, Environmental Sciences Keywords: Landsat; Google Earth; water index; unsupervised image classification; supervised image classification; Kappa coefficient
Online: 19 November 2019 (03:10:17 CET)
To address three important issues related to extraction of water features from Landsat imagery, i.e., selection of water indexes and classification algorithms for image classification, collection of ground truth data for accuracy assessment, this study applied four sets (ultra-blue, blue, green, and red light based) of water indexes (NWDI, MNDWI, MNDWI2, AWEIns, and AWEIs) combined with three types of image classification methods (zero-water index threshold, Otsu, and kNN) to 24 selected lakes across the globe to extract water features from Landsat-8 OLI imagery. 1440 (4x5x3x24) image classification results were compared with the extracted water features from high resolution Google Earth images with the same (or ±1 day) acquisition dates through computing the Kappa coefficients. Results show the kNN method is better than the Otsu method, and the Otsu method is better than the zero-water index threshold method. If the computational cost is not an issue, the kNN method combined with the ultra-blue light based AWEIns is the best method for extracting water features from Landsat imagery because it produced the highest Kappa coefficients. If the computational cost is taken into account, the Otsu method is a good choice. AWEIns and AWEIs are better than NDWI, MNDWI and MNDWI2. AWEIns works better than AWEIs under the Otsu method, and the average rank of the image classification accuracy from high to low is the ultra-blue, blue, green, and red light-based AWEIns.
ARTICLE | doi:10.20944/preprints202103.0753.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: unsupervised feature selection; histogram-valued data; compactness; hierarchical conceptual clustering; multi-role measure; visualization
Online: 31 March 2021 (07:53:39 CEST)
This paper presents an unsupervised feature selection method for multi-dimensional histogram-valued data. We define a multi-role measure, called the compactness, based on the concept size of given objects and/or clusters described by a fixed number of equal probability bin-rectangles. In each step of clustering, we agglomerate objects and/or clusters so as to minimize the compactness for the generated cluster. This means that the compactness plays the role of a similarity measure between objects and/or clusters to be merged. To minimize the compactness is equivalent to maximize the dis-similarity of the generated cluster, i.e., concept, against the whole concept in each step. In this sense, the compactness plays the role of cluster quality. We also show that the average compactness of each feature with respect to objects and/or clusters in several clustering steps is useful as feature effectiveness criterion. Features having small average compactness are mutually covariate, and are able to detect geometrically thin structure embedded in the given multi-dimensional histogram-valued data. We obtain thorough understandings of the given data by the visualization using dendrograms and scatter diagrams with respect to the selected informative features. We illustrate the effectiveness of the proposed method by using an artificial data set and real histogram-valued data sets.
ARTICLE | doi:10.20944/preprints202103.0276.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Artificial Intelligent algorithms; Analytical Hierarchical Process (AHP); Prediction methods; unsupervised learning; Biological neural networks
Online: 10 March 2021 (11:07:15 CET)
Artificial intelligence (AI) is a versatile term that is a conclusive remedy to solve the problem using past rational data after deep contemplation using these terms i-e basic statistics, carving data, familiarity with common AI algorithms. Seafood especially tiger prawn export as a busi-ness will provide enormous foreign exchange to any country if the farmers overcome the corre-lated vulnerabilities in prawn farming. This research is elucidating lacking in Tiger prawn (TP) farming like curbing of Oxygen, pH, water temperature, and nutrients, etc. Moreover, hatchery statistics in terms of juveniles will depict this study's clear picture of curbed aquaculture. For normative decisions, the Analytical Hierarchical Process (AHP) is used. The problem which has been faced by local prawn farmers that there is a stagnant TP growth in ponds, the reason is the predominant sensitivity factor in TP. For this reason, they need indemnification of thirteen fac-tors with natural resources to get the plausible results to get calmness in their lives. This study will solely focus on the TP growth model, and the monitoring effect will be established by the Artificial Intelligence algorithm. This study will employ the AHP, 0-1 scaling method, data cura-tion techniques, and ecological statistics. The life of Tiger Prawn (TP) depends upon these factors mainly, a) Physical and b) Chemical parameters. Physical parameters contain environment (E) provided to TP like season (S) and temperature (T) etc. whereas the quality of Ammonia NH3 (N) from fish waste, Oxygen level (O), and water quality hard & soft (W) lies in chemicals do-main. This research will Elucidate the factors which cause conceptual muddles in the aquamarine life of TP, for this reason, Statistical tools will assess the current result, forecast the gap. AHP will analyze the domain inputs, circumspect ramification which will depict visceral factors, later results depict which pond suits the TP. In curtail, these factors will be curbed to improve the growth of TP in a control conditioned environment.
ARTICLE | doi:10.20944/preprints202002.0113.v1
Subject: Chemistry, Other Keywords: Blind Source Separation; Component Analysis; Chemometrics; Unsupervised Machine Learning; Endmember Extraction; Spectral Unmixing; NMR
Online: 9 February 2020 (17:18:38 CET)
NMR spectral datasets, especially in systems with limited samples, can be difficult to interpret if they contain multiple chemical components (phases, polymorphs, molecules, crystals, glasses, etc…) and the possibility of overlapping resonances. In this paper, we benchmark several blind source separation techniques for analysis of NMR spectral datasets containing negative intensity. For benchmarking purposes, we generated a large synthetic datasbase of quadrupolar solid-state NMR-like spectra that model spin-lattice T1 relaxation or nutation tip/flip angle experiments. Our benchmarking approach focused exclusively on the ability of blind source separation techniques to reproduce the spectra of the underlying pure components. In general, we find that FastICA (Fast Independent Component Analysis), SIMPLISMA (SIMPLe-to-use-Interactive Self-modeling Mixture Analysis), and NNMF (Non-Negative Matrix Factorization) are top-performing techniques. We demonstrate that dataset normalization approaches prior to blind source separation do not considerably improve outcomes. Within the range of noise levels studied, we did not find drastic changes to the ranking of techniques. The accuracy of FastICA and SIMPLISMA degrades quickly if excess (unreal) pure components are predicted. Our results indicate poor performance of SVD (Singular Value Decomposition) methods, and we propose alternative techniques for matrix initialization. The benchmarked techniques are also applied to real solid state NMR datasets. In general, the recommendations from the synthetic datasets agree with the recommendations and results from the real data analysis. The discussion provides some additional recommendations for spectroscopists applying blind source separation to NMR datasets, and for future benchmark studies. Applications of blind source separation to NMR datasets containing negative intensity may be especially useful for understanding complex and disordered systems with limited samples and mixtures of chemical components.
ARTICLE | doi:10.20944/preprints201903.0122.v1
Subject: Earth Sciences, Geoinformatics Keywords: Classification, SVM Classifier, ML Classifier, Supervised and Unsupervised Classification, Object-based Classification, Multispectral Data
Online: 11 March 2019 (09:01:44 CET)
This paper focuses on the crucial role that remote sensing plays in divining land features. Data that is collected distantly provides information in spectral, spatial, temporal and radiometric domains, with each domain having the specific resolution to information collected. Diverse sectors such as hydrology, geology, agriculture, land cover mapping, forestry, urban development and planning, oceanography and others are known to use and rely on information that is gathered remotely from different sensors. In the present study, IRS LISS IV Multi-spectral data is used for land cover mapping. It is known, however, that the task of classifying high-resolution imagery of land cover through manual digitizing consumes time and is way too costly. Therefore, this paper proposes accomplishing classifications by way of enforcing algorithms in computers. These classifications fall in three classes: supervised, unsupervised, and object-based classification. In the case of supervised classification, two approaches are relied upon for land cover classification of high-resolution LISS-IV multispectral image. These approaches are Maximum Likelihood and Support Vector Machine (SVM). Finally, the paper proposes a step-by-step procedure for optical image classification methodology. This paper concludes that in optical data classification, SVM classification gives a better result than the ML classification technique.
ARTICLE | doi:10.20944/preprints202201.0202.v1
Subject: Earth Sciences, Geoinformatics Keywords: crop detection; Sentinel 1; Sentinel 2; supervised classification; unsupervised classification; time series; agriculture; food security
Online: 14 January 2022 (11:18:59 CET)
Satellite Crop Detection technologies are focused on detection of different types of crops on the field in the early stage before harvesting. Crop detection is usually done on a time series of satellite data by classification of the desired fields. Currently, data obtained from Remote Sensing (RS) are used to solve tasks related to the identification of the type of agricultural crops, also modern technologies using AI methods are desired in the postprocessing part. In this challenge Sentinel-1 and Sentinel-2 time series data were used due to their periodic availability. Our focus was to develop methodology for classification of time series of Sentinel 2 and Sentinel 1 data and compare how accuracy of classification can be increased, but also how to guarantee availability of data. We analyse phenology of single crops and on the basis of this analysis we started to provide crop classification. Original crop classifications were made from Enhanced Vegetation Index (EVI) layers made from Sentinel-2 time-series data and then we added also . To increase accuracy we also integrate into the process parcel borders and provide classification of fields..
ARTICLE | doi:10.20944/preprints202112.0366.v1
Subject: Medicine & Pharmacology, Pathology & Pathobiology Keywords: fibroepithelial breast lesions; phyllodes tumors; methylation analysis; copy number alterations; dimension reduction; unsupervised machine learning
Online: 22 December 2021 (12:46:50 CET)
Fibroepithelial lesions (FL) of the breast, in particular Phyllodes tumors (PT) and fibroadenomas, pose a significant diagnostic challenge. There are no generally accepted criteria that distinguish benign, borderline, malignant PT, and FA. Combined genome-wide DNA methylation and copy number variant (CNV) profiling is an emerging strategy to classify tumors. We compiled a series of patient-derived archival biopsy specimens reflecting the FL spectrum and histological mimickers including clinical follow-up data. DNA methylation and CNVs were determined by well-established microarrays. Comparison of the patterns with a pan-cancer dataset assembled from public resources including "The Cancer Genome Atlas" (TCGA) and "Gene Expression Omnibus" (GEO) suggests that FLs form a methylation class distinct from both control breast tissue as well as common breast cancers. Complex CNVs were enriched in clinically aggressive FLs. Subsequent fluorescence in situ hybridization (FISH) analysis detected respective aberrations in the neoplastic mesenchymal component of FLs only, confirming that the epithelial component is non-neoplastic. Of note, our approach could lead to the elimination of the diagnostically problematic category of borderline PT and allow for optimized prognostic patient stratification. Furthermore, the identified recurrent genomic aberrations such as 1q gains (including MDM4), CDKN2a/b deletions and EGFR amplifications may inform therapeutic decision-making.
ARTICLE | doi:10.20944/preprints202002.0019.v1
Subject: Life Sciences, Endocrinology & Metabolomics Keywords: metabolomics; LC-MS; mass spectrometry; metabolic profiling; computational; statistical; unsupervised learning; supervised learning; pathway analysis
Online: 3 February 2020 (05:54:14 CET)
Metabolomics analysis generates vast arrays of data, necessitating comprehensive workflows involving expertise in analytics, biochemistry and bioinformatics, in order to provide coherent and high-quality data that enables discovery of robust and biologically significant metabolic findings. In this protocol article, we introduce NoTaMe, an analytical workflow for non-targeted metabolic profiling approaches utilizing liquid chromatography–mass spectrometry analysis. We provide an overview of lab protocols and statistical methods that we commonly practice for the analysis of nutritional metabolomics data. The paper is divided into three main sections: the first and second sections introducing the background and the study designs available for metabolomics research, and the third section describing in detail the steps of the main methods and protocols used to produce, preprocess and statistically analyze metabolomics data, and finally to identify and interpret the compounds that have emerged as interesting.
ARTICLE | doi:10.20944/preprints201807.0207.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: smart anti-theft system; intruder detection; unsupervised activity monitoring; smart home; partially/fully covered faces
Online: 11 July 2018 (16:47:59 CEST)
The proposed research methodology aims to design a generally implementable framework for providing a house owner/member with the immediate notification of an on-going theft (unauthorized access to their premises). For this purpose, a rigorous analysis of existing systems was undertaken to identify research gaps. The problems found with existing systems were that they can only identify the intruder after the theft, or cannot distinguish between human and non-human objects. Wireless Sensors Networks (WSNs) combined with the use of Internet of Things (IoT), Cognitive Internet of Things, Internet of Medical Things, and Cloud Computing are expanding smart home concepts and solutions, and their applications. The primary objective of the present research work was to design and develop IoT and cloud computing based smart home solutions. In addition, we also propose a novel smart home anti-theft system that can detect an intruder, even if they have partially/fully hidden their face using clothing, leather, fiber, or plastic materials. The proposed system can also detect an intruder in the dark using a CCTV camera without night vision facility. The fundamental idea was to design a cost-effective and efficient system for an individual to be able to detect any kind of theft in real-time and provide instant notification of the theft to the house owner. The system also promises to implement home security with large video data handling in real-time.
ARTICLE | doi:10.20944/preprints202001.0375.v1
Subject: Mathematics & Computer Science, Other Keywords: unsupervised machine learning; hierarchical learning; computational representation; computational cognitive modeling; contextual modeling; classification; IoT data modeling
Online: 31 January 2020 (04:38:51 CET)
The term Concept has been a prominent part of investigations in psychology and neurobiology where, mostly, it is mathematically or theoretically represented. The Concepts are also studied computationally through their symbolic, distributed and hybrid representations. The majority of these approaches focused on addressing concrete concepts notion, but the view of the abstract concept is rarely explored. Moreover, most computational approaches have a predefined structure or configurations. The proposed method, Regulated Activation Network (RAN), has an evolving topology and learns representations of Abstract Concepts by exploiting the geometrical view of Concepts, without supervision. In the article, the IRIS data was used to demonstrate: the RAN's modeling; flexibility in concept identifier choice; and deep hierarchy generation. Data from IoT's Human Activity Recognition problem is used to show automatic identification of alike classes as abstract concepts. The evaluation of RAN with 8 UCI benchmarks and the comparisons with 5 Machine Learning models establishes the RANs credibility as a classifier. The classification operation also proved the RAN's hypothesis of abstract concept representation. The experiments demonstrate the RANs ability to simulate psychological processes (like concept creation and learning) and carry out effective classification irrespective of training data size.
REVIEW | doi:10.20944/preprints201708.0003.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: stylometry; author identification; author verification; authorprofiling; stylistic inconsistency; text analysis; supervised learning; unsupervised learning; classification; forensics
Online: 2 August 2017 (12:38:17 CEST)
Electronic text stylometry is a collection of forensics methods that analyze the writing styles of input electronic texts in order to extract information about authors of the input electronic texts. Such extracted information could be the identity of the authors, or aspects of the authors, such as their gender, age group, ethnicity, etc. This survey paper presents the following contributions: 1) A description of all stylometry problems in probability terms, under a unified notation. To the best of our knowledge, this is the most comprehensive definition to date. 2) A survey of key methods, with a particular attention to data representation (or feature extraction) methods. 3) An evaluation of 23,760 feature extraction methods, which is the most comprehensive evaluation of feature extraction methods in the literature of stylometry to date. The importance of this evaluation is two fold. First, identifying the relative effectiveness of the features (since, currently, many are not evaluated jointly; e.g. syntactic n-grams are not evaluated against k-skip n-grams, and so forth). Second, thanks to our generalizations, we could evaluate novel grams, such as what we name compound grams. 4) The release of our associated Python feature extraction library, namely Fextractor. Essentially, the library generalizes all existing n-gram based feature extraction methods under the "at least l-frequent, dir-directed, k-skipped n-grams'', and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as POS tags, as well as lower-level ones, such as distribution of function words, word shapes, etc. This makes the library, by far, the most extensive in this domain to date. 5) The construction, evaluation, and release of the first dataset for Emirati social media text. This evaluation represents the first evaluation of author identification against Emirati social media texts. Interestingly, we find that, when using our models and feature extraction library (Fextractor), authors could be identified significantly more accurately than what is reported with similarly sized datasets. The dataset also contains sub-datasets that represent other languages (Dutch, English, Greek and Spanish), and our findings are consistent across them.
ARTICLE | doi:10.20944/preprints202203.0093.v1
Subject: Life Sciences, Biochemistry Keywords: 6-hydroxydopamine; rotenone; in vitro neurotoxicity; mitochondrial dysfunction; exploratory data analysis; applied computational statistics; unsupervised and supervised machine learning
Online: 7 March 2022 (09:16:28 CET)
With the increase in life expectancy and consequent aging of the world’s population, the prevalence of many neurodegenerative diseases is increasing, without concomitant improvement in diagnostics and therapeutics. These diseases share neuropathological hallmarks, including mitochondrial dysfunction. In fact, as mitochondrial alterations appear prior to neuronal cell death at an early phase of the disease onset, the study and modulation of mitochondrial alterations rise as promising strategies to predict and prevent neurotoxicity and neuronal cell death before the onset of cell viability alterations. In this work, differentiated SH-SY5Y cells were treated with the mitochondrial-targeted neurotoxicants 6-hydroxydopamine and rotenone. These compounds were used at different concentrations and for different time points to understand the similarities and differences in their mechanisms of action. To accomplishing this, data on mitochondrial parameters was acquired and analyzed using unsupervised (hierarchical clustering) and supervised (decision tree) machine learning methods. Both biochemical and computational analyses resulted in an evident distinction between the neurotoxic effects of 6-hydroxydopamine and rotenone, specifically for the highest concentrations of both compounds.
ARTICLE | doi:10.20944/preprints202302.0070.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: deep learning; aerial imagery; precision agriculture; plant detection; domain adaptation; unsupervised learning; self-supervision; adversarial learning; domain shift; tropical crops
Online: 3 February 2023 (10:14:09 CET)
This paper presents a novel approach for accurate counting and localization of tropical plants in aerial images that is able to work in new visual domains in which the available data is not labeled. Our approach uses deep learning and domain adaptation, designed to handle domain shift between the training and test data, which is a common challenge in this agricultural applications. This method uses a source dataset with annotated plants and a target dataset without annotations, and adapts a model trained on the source dataset to the target dataset using unsupervised domain alignment and pseudolabeling. The experimental results show the effectiveness of this approach for plant counting in aerial images of pineapples under significative domain shift, achieving a reduction up to 97% in the counting error when compared to the supervised baseline.
Subject: Mathematics & Computer Science, Applied Mathematics Keywords: Approach Path Management; Atypical Flight Event; Non-Compliant Approach; Real Time; Anomaly Detection; Functional Principal Component Analysis; Unsupervised Learn- ing; Dubins Path
Online: 12 March 2021 (21:17:22 CET)
In this paper, a complete tool for real-time detection of atypical energy behaviors of airplanes is presented. The methodology extends in real time an existing offline process using Dubins trajectories as a predictor of the remaining distance to the runway threshold. Two major contributions are presented in this paper. First, a real-time measure of the aircraft energy behaviour is defined, indicating whether the aircraft is in good condition to intercept the extended runway centreline from its current position. Secondly, a 2D trajectory suggestion is given, allowing safe management of the approach path according to atypical criteria of historical data. Finally, this document proposes a comprehensive tool for air traffic controllers, which is a major step forward in understanding, becoming aware of and resolving critical situations that could lead to accidents.
ARTICLE | doi:10.20944/preprints201712.0110.v1
Subject: Earth Sciences, Geoinformatics Keywords: best practice; crop mapping; crowdsourcing; drought risk assessment; exposure; flood risk assessment; geospatial data; spaceborne remote sensing; unsupervised classification; rule-based classification
Online: 17 December 2017 (08:26:29 CET)
Cash crops are agricultural crops intended to be sold for profit as opposed to subsistence crops, meant to support the producer, or to support livestock. Since cash crops are intended for future sale, they translate into large financial value when considered on a wide geographical scale, so their production directly involves financial risk. At a national level, extreme weather events including destructive rain or hail, as well as drought, can have a significant impact on the overall economic balance. It is thus important to map such crops in order to set up insurance and mitigation strategies. Using locally generated data -such as municipality-level records of crop seeding- for mapping purposes implies facing a series of issues like data availability, quality, homogeneity etc. We thus opted for a different approach relying on global datasets. Global datasets ensure homogeneity and availability of data, although sometimes at the expense of precision and accuracy. A typical global approach makes use of spaceborne remote sensing, for which different land cover classification strategies are available in literature at different levels of cost and accuracy. We selected the optimal strategy in the perspective of a global processing chain. Thanks to a specifically developed strategy for fusing unsupervised classification results with environmental constraints and other geospatial inputs including ground-based data, we managed to obtain good classification results despite the constraints placed. The overall production process was composed using ``good-enough" algorithms at each step, ensuring that the precision, accuracy, and data-hunger of each algorithm was commensurate to the precision, accuracy, and amount of data available. This paper describes the tailored strategy developed on the occasion as a cooperation among different groups with diverse backgrounds, a strategy which is believed to be profitably reusable in other, similar contexts. The paper presents the problem, the constraints and the adopted solutions; it then summarizes the main findings including that efforts and costs can be saved on the side of Earth Observation data processing when additional ground-based data are available to support the mapping task.
ARTICLE | doi:10.20944/preprints202302.0066.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Smart Tourism; Sustainable Tourism; Natural language Processing (NLP); Big Data Analytics; Deep Learning; Machine Learning; Unsupervised Learning; Bidirectional Encoder Representations from Transformers (BERT); Literature Review; Smart Societies
Online: 3 February 2023 (09:47:55 CET)
The Global natural and manmade events are exposing the fragility of the tourism industry and its impact on the global economy. Prior to the COVID-19 pandemic, tourism contributed 10.3% to the global GDP and employed 333 million people but saw a significant decline due to the pandemic. Sustainable and smart tourism requires collaboration from all stakeholders and a comprehensive understanding of global and local issues to drive responsible and innovative growth in the sector. This paper presents an approach for leveraging big data and deep learning to dis-cover holistic, multi-perspective (e.g., local, cultural, national, and international) and objective information on a subject. Specifically, we develop a machine learning pipeline to extract parameters from academic literature and public opinions on Twitter, providing a unique and comprehensive view of the industry from both academic and public perspectives. The academic-view dataset was created from the Scopus database and contains 156,759 research articles from 2000 to 2022, which were modelled to identify 33 distinct parameters in 4 categories: Tourism Types, Planning, Challenges, and Media & Technologies. A Twitter dataset of 485,813 tweets was collected over 18 months starting March 2021 to August 2022 to showcase public perception of tourism in Saudi Arabia, which was modelled to reveal 13 parameters categorized into two broader sets: Tourist Attractions and Tourism Services. Discovering system parameters are re-quired to embed autonomous capabilities in systems and for decision-making and problem-solving during system design and operations. The proposed approach improves AI-based information discovery by extending the use of scientific literature, Twitter, and other sources for autonomous, dynamic optimizations of systems, promoting novel research in the tourism sector and contributing to the development of smart and sustainable societies. The paper also presents a comprehensive knowledge structure and literature review of the tourism sector based on over 250 research articles.