ARTICLE | doi:10.20944/preprints201910.0360.v1
Subject: Biology, Other Keywords: Random Forest; Iterative Random Forest; gene expression networks; high performance computing; X-AI-based eQTL
Online: 31 October 2019 (02:33:17 CET)
As time progresses and technology improves, biological data sets are continuously increasing in size. New methods and new implementations of existing methods are needed to keep pace with this increase. In this paper, we present a high performance computing(HPC)-capable implementation of Iterative Random Forest (iRF). This new implementation enables the explainable-AI eQTL analysis of SNP sets with over a million SNPs. Using this implementation we also present a new method, iRF Leave One Out Prediction (iRF-LOOP), for the creation of Predictive Expression Networks on the order of 40,000 genes or more. We compare the new implementation of iRF with the previous R version and analyze its time to completion on two of the world's fastest supercomputers Summit and Titan. We also show iRF-LOOP's ability to capture biologically significant results when creating Predictive Expression Networks. This new implementation of iRF will enable the analysis of biological data sets at scales that were previously not possible.
ARTICLE | doi:10.20944/preprints201609.0053.v1
Subject: Engineering, Other Keywords: electricity markets; price forecasting; multi-output models; random forests; conditional inference trees
Online: 18 September 2016 (06:16:19 CEST)
Predicting electricity prices is a very important issue in modern society, because the associated decision process under uncertainty requires accurate forecasts for the economic agents involved. In this paper, we apply the decision tree extension of Random Forests to the prediction of electricity prices in Spain, but with the novelty of modeling prices jointly with demand, with the purpose of achieving greater accuracy than with univariate response Random Forests, particularly in price prediction, as well as understanding the effect of the input variables (lagged values of price and demand, current production levels of available energy sources) on the joint of the two outputs. The results are very encouraging, providing significant increase in price prediction accuracy. Also, interesting methodological challenges appear as far as the appropriate choice of the relative weights of price and demand in the joint modeling is concerned and a new procedure to provide the importance variable ranking is proposed. The partykit (package of R software) library allowing for multivariate Random Forests has been used.
ARTICLE | doi:10.20944/preprints202012.0752.v1
Subject: Earth Sciences, Atmospheric Science Keywords: Random Forest; machine learning; multispectral imagery; deforestation; PFBC landscapes
Online: 30 December 2020 (11:57:18 CET)
The evaluation of deforestation by optical remote sensing remains a challenge in the humid tropical region due to high cloud cover. This paper develops a simple and reproducible method for mapping deforestation of the old-growth forest using open access software. A map of old-growth forest depletion was created using composites from three different dates (2003, 2010, 2016). Four models were tested: the first model using spectral bands (nir, swir1, swir2 and red), the second model was based on the association of spectral bands and spectral indices (NDVI, B54R, NDWI and NBR), the third model was constructed using spectral bands and geomorphological indices (DEM, Slope and Roughness) and the last model combined spectral bands, spectral indices and geomorphological indices. The optimal random forest ntrees and Mtry parameters were determined for each model to optimize the mapping in each model. The out-of-bag error for these four models was 2.15 %, 2.05 %, 1.86 % and 1.85 %, respectively. The fourth model had the lowest error and was hence used to predict deforestation of the old-growth forest. The annual rates of deforestation amounted 0.26 % (69861 ha) and 0.66 % (145768 ha) between 2003 – 2010 and 2010 – 2016, respectively. The area of the old-growth forest in 2016 was 3601607 ha and 215629 ha of forest lost between 2003 and 2016. These results showed that the Random Forest Classification (RFC) model was able to effectively map the reduction of old-growth forests.
ARTICLE | doi:10.20944/preprints201801.0088.v1
Subject: Earth Sciences, Geoinformatics Keywords: wheat classification; random forest; spectral gradient difference; vegetation indices
Online: 10 January 2018 (09:13:02 CET)
The early-season area estimation of the winter wheat crop as a strategic product is important for decision makers. Classification of multi-temporal images is an approach which is affected by many factors like appropriate training sample size, proper frequency and acquisition times, vegetation indices (VIs) type, temporal gradient of spectral bands and VIs, appropriate classifier and missed values because of cloudy conditions. This paper addresses the impact of appropriate frequency and acquisition times and VIs type along with the spectral and VI gradient on random forest (RF) classifier when missed values exist in multi-temporal images. To investigate the appropriate temporal resolution for image acquisition, the study area was selected on an overlapping area between two LDCM paths. In our developed method, the miss values of cloudy bands for each pixel are retrieved by the mean of k-nearest ordinary pixels. Then the multi-temporal image analysis is performed by considering different scenarios provided by decision makers in terms of desired crop types that should be extracted at early-season in the study areas. The classification results obtained by the RF decrease by 1.6% when temporally missed values retrieved by the proposed method, which is an acceptable result. Moreover, the experimental results demonstrated that if temporal resolution of Landsat 8 increased to one week the classification task can be conducted earlier with almost better results in terms of OA and kappa. The impact of incorporating VIs along with the temporal gradients of spectral bands and VIs as new features in RF demonstrated that the OA and Kappa are improved 3.1% and 6.6%, respectively. Furthermore, the obtained result showed that if only one image from seasonal changes of crops is available, the temporal gradient of VIs and spectral bands play the main role to discriminate remarkably wheat from barley. The experiments also demonstrated that if both wheat and barley merge to a single class the crop area can be estimated two months earlier with 97.1 and 93.5 in terms of OA and kappa, respectively.
Subject: Engineering, Mechanical Engineering Keywords: diesel engine; fault diagnosis; variational mode decomposition; random forest; feature extraction
Online: 25 December 2019 (11:13:13 CET)
Diesel engines, as power equipment, are widely used in the fields of automobile industry, ship and power equipment. Due to wear or faulty adjustment, the valve train clearance abnormal fault is a typical failure of diesel engines, which may result in the performance degradation, even valve fracture and cylinder hit fault. However, the failure mechanism features mainly in time domain and angular domain, on which the current diagnosis methods based, are easily affected by working conditions or hard to extract accurate enough, as the diesel engine keeps running in transient and non-stationary process. This work arms at diagnosing this fault mainly based on frequency band features which would change when the valve clearance fault occurs. For the purpose of extracting a series of frequency band features adaptively，a decomposition technique based on improved variational mode decomposition is investigated in this work. As the connection between the features and the fault is fuzzy, the random forest algorithm is used to analyze the correspondence between features and faults. In addition, the feature dimension is reduced to improve the operation efficiency according to importance score. The experimental results under variable speed condition show that the method based on variational mode decomposition and random forest is capable to detect valve clearance fault effectively.
ARTICLE | doi:10.20944/preprints202201.0138.v1
Subject: Social Sciences, Business And Administrative Sciences Keywords: Smart Grid; Random Forest; Internet of Things; Power management; Machine Learning; Smart Meter; Priority Power Scheduling.
Online: 11 January 2022 (13:01:08 CET)
Presently power control and management play a vigorous role in information technology and power management. Instead of non-renewable power manufacturing, renewable power manufacturing is preferred by every organization for controlling resource consumption, price reduction and efficient power management. Smart grid efficiently satisfies these requirements with the integration of machine learning algorithms. Machine learning algorithms are used in a smart grid for power requirement prediction, power distribution, failure identification etc. The proposed Random Forest-based smart grid system classifies the power grid into different zones like high and low power utilization. The power zones are divided into number of sub-zones and map to random forest branches. The sub-zone and branch mapping process used to identify the quantity of power utilized and the non-utilized in a zone. The non-utilized power quantity and location of power availabilities are identified and distributed the required quantity of power to the requester in a minimal response time and price. The priority power scheduling algorithm collect request from consumer and send the request to producer based on priority. The producer analysed the requester existing power utilization quantity and availability of power for scheduling the power distribution to the requester based on priority. The proposed Random Forest based sustainability and price optimization technique in smart grid experimental results are compared to existing machine learning techniques like SVM, KNN and NB. The proposed random forest-based identification technique identifies the exact location of the power availability, which takes minimal processing time and quick responses to the requestor. Additionally, the smart meter based smart grid technique identifies the faults in short time duration than the conventional energy management technique is also proven in the experimental results.
ARTICLE | doi:10.20944/preprints202102.0318.v3
Subject: Medicine & Pharmacology, Allergology Keywords: Machine Learning; Artificial Intelligence; Androgen Receptor; Random Forest; Deep Neural Network; Convolutional
Online: 24 February 2021 (13:14:01 CET)
Substances that can modify the androgen receptor pathway in humans and animals are entering the environment and food chain with the proven ability to disrupt hormonal systems and leading to toxicity and adverse effects on reproduction, brain development, and prostate cancer, among others. State-of-the-art databases with experimental data of human, chimp, and rat effects by chemicals have been used to build machine learning classifiers and regressors and evaluate these on independent sets. Different featurizations, algorithms, and protein structures lead to dif- ferent results, with deep neural networks (DNNs) on user-defined physicochemically-relevant features developed for this work outperforming graph convolutional, random forest, and large featurizations. The results show that these user-provided structure-, ligand-, and statistically-based features and specific DNNs provided the best results as determined by AUC (0.87), MCC (0.47), and other metrics and by their interpretability and chemical meaning of the descriptors/features. In addition, the same features in the DNN method performed better than in a multivariate logistic model: validation MCC = 0.468 and training MCC = 0.868 for the present work compared to evalu- ation set MCC = 0.2036 and training set MCC = 0.5364 for the multivariate logistic regression on the full, unbalanced set. Techniques of this type may improve AR and toxicity description and predic- tion, improving assessment and design of compounds. Source code and data are available at https://github.com/AlfonsoTGarcia-Sosa/ML
ARTICLE | doi:10.20944/preprints202108.0024.v1
Subject: Earth Sciences, Atmospheric Science Keywords: Nitrogen Dioxide (NO2); Random Forest; Contribution Rate; Air pollution; COVID-19 lockdown
Online: 2 August 2021 (11:54:10 CEST)
During the COVID-19 lockdown in Wuhan, transportation, industrial production and other human activities declined significantly, as did the NO2 concentration. In order to assess the relative contributions of different factors to reductions of air pollutants, sensitivity experiments were implemented by random forest (RF) model, with the comparison of contributions of meteorology, road traffic, and emission sources between different periods. Besides, an emulator was operated to suggest an appropriate limit for control of transportation. The RF models showed different mechanisms for air pollutants. Within-city Migration index (WMI) was more important in the normal, pre-lockdown and post-pandemic model while Out-Migration Index (OMI) was emphasized in the lockdown model. In the COVID-19 lockdown period, 73.3% of the reduction can be attributed to the decreased road traffic, showing massive impact of road traffic on the air quality. In the post-pandemic period, meteorology controlled about 42.2% of the decrease and emissions from industry and household controlled 40.0% while road traffic only contributed to 17.8%. It was suggested that priority of restriction should be given to road traffic within the city. A limit of less than 40% on the control of the road traffic can get a better effect, especially for cities with severe traffic pollution.
ARTICLE | doi:10.3390/sci2040061
Subject: Keywords: industry4.0; fault detection; fault diagnosis; random forest; diagnostic graph; distributed diagnosis; model-based; data-driven; hybrid approach; hydraulic test rig
Online: 24 September 2020 (00:00:00 CEST)
In this work, a hybrid component Fault Detection and Diagnosis (FDD) approach for industrial sensor systems is established and analyzed, to provide a hybrid schema that combines the advantages and eliminates the drawbacks of both model-based and data-driven methods of diagnosis. Moreover, it shines the light on a new utilization of Random Forest (RF) together with model-based diagnosis, beyond its ordinary data-driven application. RF is trained and hyperparameter tuned using three-fold cross validation over a random grid of parameters using random search, to finally generate diagnostic graphs as the dynamic, data-driven part of this system. This is followed by translating those graphs into model-based rules in the form of if-else statements, SQL queries or semantic queries such as SPARQL, in order to feed the dynamic rules into a structured model essential for further diagnosis. The RF hyperparameters are consistently updated online using the newly generated sensor data to maintain the dynamicity and accuracy of the generated graphs and rules thereafter. The architecture of the proposed method is demonstrated in a comprehensive manner, and the dynamic rules extraction phase is applied using a case study on condition monitoring of a hydraulic test rig using time-series multivariate sensor readings.
ARTICLE | doi:10.20944/preprints202007.0548.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: industry4.0; fault detection; fault diagnosis; random forest; diagnostic graph; distributed diagnosis; model-based; data-driven; hybrid approach; hydraulic test rig
Online: 23 July 2020 (11:26:41 CEST)
In this work, A hybrid component Fault Detection and Diagnosis (FDD) approach for industrial sensor systems is established and analyzed, to provide a hybrid schema that combines the advantages and eliminates the drawbacks of both model-based and data-driven methods of diagnosis. Moreover, spotting the light on a new utilization of Random Forest (RF) together with model-based diagnosis, beyond its ordinary data-driven application. RF is trained and hyperparameter tuned using 3-fold cross-validation over a random grid of parameters using random search, to finally generate diagnostic graphs as the dynamic, data-driven part of this system. Followed by translating those graphs into model-based rules in the form of if-else statements, SQL queries or semantic queries such as SPARQL, in order to feed the dynamic rules into a structured model essential for further diagnosis. The RF hyperparameters are consistently updated online using the newly generated sensor data, in order to maintain the dynamicity and accuracy of the generated graphs and rules thereafter. The architecture of the proposed method is demonstrated in a comprehensive manner, as well as the dynamic rules extraction phase is applied using a case study on condition monitoring of a hydraulic test rig using time series multivariate sensor readings.
ARTICLE | doi:10.20944/preprints201806.0188.v1
Subject: Earth Sciences, Geoinformatics Keywords: minimum noise fraction (MNF) transformation; object-based image analysis (OBIA); APEX hyperspectral imagery; Random forest (RF) classifier; multiresolution segmentation (MRS); tree species classification
Online: 12 June 2018 (10:55:07 CEST)
Tree species composition is an important key element for biodiversity and sustainable forest management, and hyperspectral data provide detailed spectral information, which can be used for tree species classification. There are two main challenges for using hyperspectral imagery: a) Hughes phenomena, meaning by increasing the number of bands in hyperspectral imagery, the number of required classification samples would increase exponentially, and b) in a more complex environment, such as riparian mixed forest, focusing on spectral variability per pixel may not be adequate for definability of tree species. Therefore, the focus of this study is to assess spectral-spatial dimensionality reduction of airborne hyperspectral imagery by using minim noise fraction (MNF) transformation, and object-based image analysis (OBIA). An airborne prism experiment (APEX) hyperspectral imagery was used. A study area was a riparian mixed forest located along the Salzach river, and six tree species including Picea abies, Populus (canadensis and balsamifera), Fraxinus excelsior, Alnus incana, and Salix alba were selected. A machine learning algorithm random forest (RF) was used to train and apply a prediction model for classification. Using a spectral dimensionality reduced APEX, a pixel-level classification was also done. According to a confusion matrix, the object-level classification of MNF-derived components achieved the overall accuracy of 85 %, and kappa coefficient of 0.805. The performance of classes according to producer’s accuracy varied between 80% for Fraxinus excelsior, Alnus incana, and Populus canadensis to 90% for Salix alba and Picea abies. Comparison the results to a pixel-level classification, showed a better performance of object-level classification (an overall accuracy of 63% and Kappa coefficient of 0.559 were achieved for pixel-level classification). The performance of classes using pixel-based classification varied 45 % for Alnus incana to 80% for Picea abies. In general, Spectral-spatial complexity reduction using MNF transformation and object-level classification yielded a statistically satisfactory results.
ARTICLE | doi:10.20944/preprints201809.0195.v1
Subject: Mathematics & Computer Science, Analysis Keywords: random fixed point, random $\alpha-$admissible with respect to $\eta$, generalized random $\alpha-\psi-$contractive mapping.
Online: 11 September 2018 (11:52:25 CEST)
In this paper, we prove some random fixed point theorems for generalized random $\alpha-\psi-$contractive mappings in a Polish space and, as some applications, we show the existence of random solutions of second order random differential equation.
ARTICLE | doi:10.20944/preprints201905.0036.v3
Online: 4 June 2019 (11:12:53 CEST)
We talk about random when it is not possible to determine a pattern on the observed outcomes. A computer follows a sequence of fixed instructions to give any of its output, hence the difficulty of choosing numbers randomly from algorithmic approaches. However, some algorithms based on mathematical formulas like the Linear Congruential algorithm and the Lagged Fibonacci generator appear to produce "true" random sequences to anyone who does not know the secret initial input . Up to now, we cannot rigorously answer the question on the randomness of prime numbers [2, page 1] and this highlights a connection between random number generator and the distribution of primes. From  and  one sees that it is quite naive to expect good random reproduction with prime numbers. We are, however, interested in the properties underlying the distribution of prime numbers, which emerge as sufficient or insufficient arguments to conclude a proof by contradiction which tends to show that prime numbers are not randomly distributed. To achieve this end, we use prime gap sequence variation. The algorithm that we produce makes possible to deduce, in the case of a binary choice, a uniform behavior in the individual consecutive occurrence of primes, and no uniformity trait when the occurrences are taken collectively.
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: discrete degenerate random variables; degenerate binomial random variable; degenerate Poisson random variable; new type degenerate Bell polynomials
Online: 15 November 2019 (16:43:03 CET)
In this paper, we introduce two discrete degenerate random variables, namely the degenerate binomial and degenerate Poisson random variables. We deduce the expectations of the degenerate binomial random variables. We compute the generating function of the moments of the degenerate Poisson random variables, which leads us to define the new type degenerate Bell polynomials, and hence obtain explicit expressions for the moments of those random variables in terms of such polynomials. We also get the variances of the degenerate Poisson random variables. Finally, we illustrate two examples of the degenerate Poisson random variables.
ARTICLE | doi:10.20944/preprints202002.0350.v1
Subject: Mathematics & Computer Science, Applied Mathematics Keywords: Swift-Hohenberg equation; Random-pullback attractor; Non-autonomous random dynamical system
Online: 24 February 2020 (12:30:08 CET)
In this paper, we study the existence of the random -pullback attractor of a non-autonomous local modiﬁed stochastic Swift-Hohenberg equation with multiplicative noise in stratonovich sense. It is shown that a random -pullback attractor exists in when its external force has exponential growth. Due to the stochastic term, the estimate are delicate, we overcome this difficulty by using the Ornstein-Uhlenbeck(O-U) transformation and its properties.
ARTICLE | doi:10.20944/preprints201905.0165.v2
Online: 18 June 2019 (11:15:56 CEST)
Background: As the opioid epidemic continues, understanding the geospatial, temporal and demand patterns is important for policymakers to assign resources and interdict individual, organization, and country-level bad actors. Methods: GIS geospatial-temporal analysis and extreme-gradient boosted random forests evaluate ICD-10 F11 opioid-related admissions and admission rates using geospatial analysis, demand analysis, and explanatory models, respectively. The period of analysis was January 2016 through September 2018. Results: The analysis shows existing high opioid admissions in Chicago and New Jersey with emerging areas in Atlanta, Salt Lake City, Phoenix, and Las Vegas. High rates of admission (claims per 10,000 population) exist in the Appalachian area and on the Northeastern seaboard. Explanatory models suggest that hospital overall workload and financial variables might be used for allocating opioid-related treatment funds effectively. Gradient-boosted random forest models accounted for 87.8% of the variability of claims on blinded 20% test data. Conclusions: Based on the GIS analysis, opioid admissions appear to have spread geographically, while higher frequency rates are still found in some regions. Interdiction efforts require demand-analysis such as that provided in this study to allocate scarce resources for supply-side and demand-side interdiction: prevention, treatment, and enforcement. Based on GIS analysis, the opioid epidemic is likely to spread or diffuse through the country, and interdiction efforts require demand-analysis such as that provided in this study to allocate scarce resources for supply-side and demand-side interdiction: prevention, treatment, and enforcement.
ARTICLE | doi:10.20944/preprints202102.0492.v3
Online: 1 April 2022 (06:22:53 CEST)
Disaggregated population counts are needed to calculate health, economic, and development indicators in Low- and Middle-Income Countries (LMICs), especially in settings of rapid urbanisation. Censuses are often outdated and inaccurate in LMIC settings, and rarely disaggregated at fine geographic scale. Modelled gridded population datasets derived from census data have become widely used by development researchers and practitioners. These datasets are evaluated for accuracy at the spatial scale of the input data which is often much courser (e.g. administrative units) than the neighbourhood or cell-level scale of many applications. We simulate a realistic "true" 2016 population in Khomas, Namibia, a majority urban region, and introduce realistic levels of outdatedness (over 15 years) and inaccuracy in slum, non-slum, and rural areas. We aggregate these simulated realistic populations by census and administrative boundaries (to mimic census data), and generate 32 gridded population datasets that are typical of a LMIC setting using WorldPop-Global-Unconstrained gridded population approach. We evaluate the cell-level accuracy of these simulated datasets using the original "true" population as a reference. In our simulation, we found large cell-level errors, particularly in slum cells, driven by the use of average population densities in large areal units to determine cell-level population densities. Age, accuracy, and aggregation of the input data also played a role in these errors. We suggest incorporating finer-scale training data into gridded population models generally, and WorldPop-Global-Unconstrained in particular (e.g., from routine household surveys or slum community population counts), and use of new building footprint datasets as a covariate to improve cell-level accuracy. It is important to measure accuracy of gridded population datasets at spatial scales more consistent with how the data are being applied, especially if they are to be used for monitoring key development indicators at neighbourhood scales with relevance to small dense deprived areas within larger administrative units.
ARTICLE | doi:10.20944/preprints201805.0302.v1
Subject: Mathematics & Computer Science, Computational Mathematics Keywords: graph entropy; chromatic classes; random graphs
Online: 22 May 2018 (11:59:26 CEST)
Combinatoric measures of entropy capture the complexity of a graph, but rely upon the calculation of its independent sets, or collections of non-adjacent vertices. This decomposition of the vertex set is a known NP-Complete problem and for most real world graphs is an inaccessible calculation. Recent work by Dehmer et al. and Tee et al. identified a number of alternative vertex level measures of entropy that do not suffer from this pathological computational complexity. It can be demonstrated that they are still effective at quantifying graph complexity. It is intriguing to consider whether there is a fundamental link between local and global entropy measures. In this paper, we investigate the existence of correlation between vertex level and global measures of entropy, for a narrow subset of random graphs. We use the greedy algorithm approximation for calculating the chromatic information and therefore Körner entropy. We are able to demonstrate close correlation for this subset of graphs and outline how this may arise theoretically.
ARTICLE | doi:10.20944/preprints202208.0050.v1
Subject: Physical Sciences, Applied Physics Keywords: quorum sensing; resistance random network; complex networks
Online: 2 August 2022 (08:21:25 CEST)
We propose a model for bacterial Quorum Sensing based on an auxiliary electrostatic-like interaction originating from a fictitious electrical charge that represents bacteria activity. A cooperative mechanism for charge/activity exchange is introduced to implement chemotaxis and replication. The bacteria system is thus represented by means of a complex resistor network where link resistances take into account the allowed activity-flow among individuals. By explicit spatial stochastic simulations, we show that the model exhibits different quasi-realistic behaviors from colony formation to biofilm aggregation. The electrical signal associated with Quorum Sensing is analyzed in space and time and provides useful information about the colony dynamics. In particular, we analyze the transition between the planktonic and the colony phases as the intensity of Quorum Sensing is varied.
ARTICLE | doi:10.20944/preprints202205.0023.v1
Subject: Mathematics & Computer Science, Applied Mathematics Keywords: Random Triangle; Quasiorthogonal Dimension; Combinatorics; Computational Problems
Online: 5 May 2022 (07:58:23 CEST)
In this work we study the following problem, from a computational point of view: If three points are selected in the unit square at random, what is the probability that the triangle obtained is obtuse, acute or right? We provide two convergent strategies: the frst derived from the ideas introduced in  and the second built on the combinatorics theory. The combined use of these two methods allows us to address the random triangle theory from a new perspective and, we hope, to work out a general method of dealing with some classes of computational problems.
ARTICLE | doi:10.20944/preprints202202.0175.v1
Subject: Life Sciences, Biotechnology Keywords: antimicrobial peptide prediction; sequence analysis; random forest
Online: 14 February 2022 (11:57:01 CET)
Antimicrobial peptides (AMPs) are considered as promising alternatives to conventional antibiotics in order to overcome the growing problems of antibiotic resistance. Computational prediction approaches receive an increasing interest to identify and design the best candidate AMPs prior to the in-vitro tests. In this study, we focused on the linear cationic peptides with non-hemolytic activity, which are downloaded from the Database of Antimicrobial Activity and Structure of Peptides (DBAASP). Referring to the MIC (Minimum inhibition concentration) values, we have assigned a positive label to a peptide if it shows antimicrobial activity; otherwise the peptide is labeled as negative. Here, we focused on the peptides showing antimicrobial activity against Gram-negative and against Gram-positive bacteria separately, and we created two datasets accordingly. Ten different physico-chemical properties of the peptides are calculated and used as features in our study. Following data exploration and data preprocessing steps, a variety of classification algorithms are used with 100-fold Monte Carlo Cross Validation to build models and to predict the antimicrobial activity of the peptides. Among the generated models, Random Forest has resulted in the best performance metrics for both Gram-negative dataset (Accuracy: 0.98, Recall: 0.99, Specificity: 0.97, Precision: 0.97, AUC: 0.99, F1: 0.98) and Gram-positive dataset (Accuracy: 0.95, Recall: 0.95, Specificity: 0.95, Precision: 0.90, AUC: 0.97, F1: 0.92) after outlier elimination is applied. This prediction approach might be useful to evaluate the antibacterial potential of a candidate peptide sequence before moving to the experimental studies.
ARTICLE | doi:10.20944/preprints202102.0498.v1
Subject: Earth Sciences, Atmospheric Science Keywords: proximal hyperspectral sensing; precision agriculture; random forest
Online: 22 February 2021 (17:20:41 CET)
A strategy to reduce qualitative and quantitative losses in crop-yields refers to early and accurate detection of insect-damage caused in plants. Remote sensing systems like hyperspectral proximal sensors are a promising strategy for managing crops. In this aspect, machine learning predictions associated with clustering techniques may be an interesting approach mainly because of its robustness to evaluate high dimensional data. In this paper, we model the spectral response of insect-herbivory-damage in maize plants and propose an approach based on machine learning and a clustering method to predict whether the plant is herbivore-attacked or not using leaf reflectance measurements. We differentiate insect-type damage based on the spectral response and indicate the most contributive wavelengths to perform it. For this, we used a maize experiment in semi-field conditions. The maize plants were submitted to three different treatments: control (health plants); plants submitted to Spodoptera frugiperda herbivory-damage, and; plants submitted to Dichelops melacanthus herbivory-damage. The leaf spectral response of all plants (controlled and submitted to herbivory) was measured with a FieldSpec 3.0 Spectroradiometer from 350 to 2500 nm for eight consecutive days. We evaluated the performance of different learners like random forest (RF), support vector machine (SVM), extreme gradient boost (XGB), neural networks (MLP), and measured the impact of a day-by-day analysis into the prediction. We proposed a novel framework with a ranking strategy, based on the accuracy returned by predictions, and a clusterization method based on a self-organizing map (SOM) to identify important regions in the reflectance measurement. Our results indicated that the RF-based framework algorithm is the overall best learner to deal with this type of data. After the 5th day of analysis, the accuracy of the algorithm improved substantially. It separated the three treatments into different groups with an F-measure equal to 0.967, 0.917, and 0.881, respectively. We also verified that the most contributive spectral regions are situated in the near-infrared domain. We conclude that the proposed approach with machine learning methods is adequate to monitor herbivory-damage of S. frugiperda and stink bugs like Dichelops melacanthus in maize, differentiating the types of insect-attack early on. We also demonstrate that the framework proposed for the analysis of the most contributive wavelengths is suitable to highlight spectral regions of interest.
Subject: Medicine & Pharmacology, Pharmacology & Toxicology Keywords: Cannabis; Metabolite; Principal Component Analysis; Random Forest
Online: 5 September 2020 (07:51:50 CEST)
The many strains of Cannabis spp. are associated with many effects on users and contain many different potentially psychoactive metabolites, but the links between metabolite profiles and user effects are unclear. Here we take a statistical approach to linking cause (i.e. metabolites) to effects in Cannabis spp. through the prism of strains, using quantitative data for metabolite composition and user effects. We find that species (indica vs. sativa) explains <2% of the variability in metabolite profiles, while strain explains 1/3 of variability, indicating species is nonindicative of metabolite composition, while strain is approximately indicative. Using random forests we generate a table of potential metabolite-effect links. We also find that effect-weighted metabolite composition can effectively be described in terms of four values representing the concentrations of pairs or triplets of particular compounds.
ARTICLE | doi:10.20944/preprints202008.0132.v1
Online: 5 August 2020 (10:51:29 CEST)
Henry Vidal first introduced the concept of using strips, grids, and sheets for reinforcing soil masses. Since then, a large variety of materials such as steel bars, tire shreds, polypropylene, polyester, glass fibers, coir, jute fibers etc. have been widely added to the soil mass randomly or in a regular, oriented manner. In this investigation, a new concept of multi-oriented plastic reinforcement (hexa-pods), is discussed. A systematic and comprehensive laboratory tests were conducted on unreinforced and reinforced soil samples. Laboratory tests such as direct shear teat and California bearing ratio (CBR) test were analyzed on soil samples consisting of only soil samples, soil sample with random inclusion of hexapods and soil samples with layered inclusion of hexapods. From the results obtained through direct shear test it could be observed that cohesion value of both the soil sample has increased and the angle of internal friction has been decreased after reinforcing it with inclusions in both randomly and layered conditions. CBR test indicates that for same amount of compactive effort, both random and layered inclusions of hexapods show improvement in strength and stiffness. Random inclusions of hexapods give better resistance to penetration as compared to layered inclusions. The hexa-pods also changed the brittle behavior of unreinforced sand samples to ductile ones.
ARTICLE | doi:10.20944/preprints201812.0250.v3
Subject: Earth Sciences, Geoinformatics Keywords: Built-settlements; urban features; spatial growth; , random forest; dasymetric modelling; population
Online: 9 October 2019 (10:48:20 CEST)
Mapping settlement extents at the annual time step has a wide variety of applications in demography, public health, sustainable development, and many other fields. Recently, while more multitemporal urban feature or human settlement datasets have become available, issues still exist in remotely-sensed imagery due to coverage, adverse atmospheric conditions, and expenses involved in producing such feature sets. These challenges make it difficult to increase temporal coverage while maintaining high fidelity in the spatial resolution. Here we demonstrate an interpolative and flexible modeling framework for producing annual built-settlement extents. We use a combined technique of random forest and spatio-temporal dasymetric modeling with open source subnational data to produce annual 100m x 100m resolution binary settlement maps in four test countries of varying environmental and developmental contexts for test periods of five-year gaps. We find that in the majority of years, across all study areas, the model correctly identified between 85-99% of pixels that transition to built-settlement. Additionally, with few exceptions, the model substantially out performed a model that gave every pixel equal chance of transitioning to the category “built” in each year. This modelling framework shows strong promise for filling gaps in cross-sectional urban feature datasets derived from remotely-sensed imagery, provide a base upon which to create future built/settlement extent projections, and further explore the relationships between built area and population dynamics.
ARTICLE | doi:10.20944/preprints201804.0022.v1
Subject: Social Sciences, Economics Keywords: cooperatives; membership heterogeneity; random forest; collective action
Online: 2 April 2018 (11:01:16 CEST)
The effects of heterogeneity of cooperative membership on cooperative and collective action sustainability has been previously discussed. However, despite the importance of membership heterogeneity in recent theoretical frameworks, empirical examinations have been limited. We determine the effect of changes to cooperative member heterogeneity on cooperative sustainability and discuss changes to heterogeneity overtime that can advance our understanding to cooperative sustainability long-term. This study uses USDA Agricultural Management Resource Survey data, coupled with USDA-Rural Development cooperative financial data at the state level, to quantify effects of cooperative member heterogeneity to sustainability of U.S. farmer cooperatives. We use random forest regression to interpret the significance of heterogeneity with cooperative sustainability at an aggregate level. The findings of this empirical study narrowly reconciles the theoretical understanding of the emergence of intra-cooperative issues while providing consistent empirical evidence and expectations for the sustainability of cooperatives in the near term.
ARTICLE | doi:10.20944/preprints201611.0028.v1
Subject: Medicine & Pharmacology, Oncology & Oncogenics Keywords: random survival forests; ependymoma; predictors; valproic acid
Online: 3 November 2016 (11:02:12 CET)
Ependymoma is responsible for 8–10% of all pediatric brain tumors and constitutes the third most common brain tumor in children. No robust molecular markers are yet in routine clinical use. Surgical resection and adjuvant radiotherapy cure approximately 40-70% of pediatric patients with ependymoma. In our centre, we have been using prophylactic valproic acid treatment for brain tumor patients. Initial observations indicated that valproate could have a beneficial effect in the survival of patients. Recent observations by other authors have shown that patients with glioblastoma benefited from the treatment with valproic acid, a histone deacetylase inhibitor. We have used random survival forest, a novel ensemble survival modelling method to study a single- center, small number cohort of pediatric patients with ependymoma. This analysis has confirmed surgery resection extent and treatment with radiotherapy as independent predictors of overall survival. Treatment with valproic acid was also a predictor of higher survival in this cohort. These results highlight the potential usefullness of the random survival forest model in gathering information from retrospective data. More data is needed about the possible influence of histone deacetylase inhibition by valproic acid in the survival of patients with ependymoma.
ARTICLE | doi:10.20944/preprints202111.0078.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: GRaVN; machine learning; convolutional neural networks; CNN; raman spectroscopy; analogue missions; planetary science; random undersampling; random oversampling; CanMoon
Online: 3 November 2021 (09:24:38 CET)
During planetary exploration mission operations, one of the key responsibilities of the instrument teams to determine data viability for subsequent analysis. During the 2019 CanMoon Lunar Sample Return Analogue Mission, the Lead Raman Specialist manually examined each spectra to provide quality assurance/validation. This non-trivial process requires years of experience to complete accurately. With the proven efficacy of Convolutional Neural Networks (CNNs) in classification tasks, and the increased use of automation and control loops on planetary space platforms for navigation and science targeting, an opportunity presents itself to approach this validation problem utilising CNNs. We present the Generalised Raman Validation Network (GRaVN), an neural network focused specifically on extracting the generalised structure of Raman spectra for quality assurance/validation. This work demonstrates the viability of utilising a CNN network in validation activities for Raman spectroscopy. Utilising only two hidden layers, a configuration was developed that provided good levels of accuracy on a manually curated dataset. This indicates that such a system could be useful as part of an autonomous control loop during planetary exploration activities.
ARTICLE | doi:10.20944/preprints202108.0248.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: Random fields; warped Gaussian Process; Spatial field reconstruction
Online: 11 August 2021 (10:39:35 CEST)
A class of models for non-Gaussian spatial random fields is explored for spatial field reconstruction in environmental and sensor network monitoring. The family of models explored utilises a class of transformation functions known as the Tukey g-and-h transformations to create a family of warped spatial Gaussian process models which can support various desirable features such as flexible marginal distributions, which can be skewed, leptokurtic and/or heavy-tailed. The resulting model is widely applicable in a range of spatial field reconstruction applications. To utilise the model in applications in practice, it is important to carefully characterise the statistical properties of the Tukey g-and-h random fields. In this work, we both study the properties of the resulting warped Gaussian processes as well as using the characterising statistical properties of the warped processes to obtain flexible spatial field reconstructions. In this regard, we derive five different estimators for various important quantities often considered in spatial field reconstruction problems. These include the multi-point Minimum Mean Squared Error (MMSE) estimators; the multiple point Maximum A-Posteriori (MAP) estimators; an efficient class of multiple-point linear estimators based on the Spatial-Best Linear Unbiased (S-BLUE) estimators; and two multi-point threshold exceedance based estimators, namely the Spatial Regional and Level Exceedance estimators. Simulation results and real data examples show the benefits of using the Tukey g-and-h transformation as opposed to standard Gaussian spatial random fields in a real data application for environmental monitoring.
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Random walk with resetting; Escape probabilities; Exit times
Online: 7 June 2021 (08:04:12 CEST)
We consider a discrete-time random walk (xt) which at random times is reset to the starting position and performs a deterministic motion between them. We show that the quantity Prxt+1=n+1|xt=n,n→∞ determines if the system is averse, neutral or inclined towards resetting. It also classifies the stationary distribution. Double barrier probabilities, first passage times and the distribution of the escape time from intervals are determined.
ARTICLE | doi:10.20944/preprints202101.0349.v1
Subject: Social Sciences, Accounting Keywords: Capital structure; Determinants; Microfinance Institutions; Random effect Model
Online: 18 January 2021 (14:50:08 CET)
The aim of this study was to identify MFIs specific determinants of capital structure of selected micro finance institutions in Ethiopia. The researcher employed quantitative research approach with explanatory research design. The result of regression analysis showed that out that variables like growth, profitability, firm size, age, and asset tangibility have positive and statistically significant effect on leverage ratio. Whereas, profitability has statistically significant and negative effect on capital structure. Based on the findings of the study, the researcher concluded that the firm specific determinants of capital structure of micro finance institutions in Ethiopia were growth, profitability, firm size, age, and asset tangibility.
ARTICLE | doi:10.20944/preprints202006.0028.v1
Online: 4 June 2020 (07:44:03 CEST)
Nowadays genetic algorithm (GA) is greatly used in engineering pedagogy as an adaptive technique to learn and solve complex problems and issues. It is a meta-heuristic approach that is used to solve hybrid computation challenges. GA utilizes selection, crossover, and mutation operators to effectively manage the searching system strategy. This algorithm is derived from natural selection and genetics concepts. GA is an intelligent use of random search supported with historical data to contribute the search in an area of the improved outcome within a coverage framework. Such algorithms are widely used for maintaining high-quality reactions to optimize issues and problems investigation. These techniques are recognized to be somewhat of a statistical investigation process to search for a suitable solution or prevent an accurate strategy for challenges in optimization or searches. These techniques have been produced from natural selection or genetics principles. For random testing, historical information is provided with intelligent enslavement to continue moving the search out from the area of improved features for processing of the outcomes. It is a category of heuristics of evolutionary history using behavioral science-influenced methods like an annuity, gene, preference, or combination (sometimes refers to as hybridization). This method seemed to be a valuable tool to find solutions for problems optimization. In this paper, the author has explored the GAs, its role in engineering pedagogies, and the emerging areas where it is using, and its implementation.
ARTICLE | doi:10.20944/preprints201808.0018.v1
Subject: Life Sciences, Biochemistry Keywords: Nuclear Magnetic Resonance Spectroscopy, Metabolomics, Biomarker, Random Forest.
Online: 1 August 2018 (11:30:39 CEST)
Background: Diabetes is among the most prevalent diseases worldwide, of all the affected individuals a significant proportion of the population remains undiagnosed because of a lack of specific symptoms early in this disorder and inadequate diagnostics. Diabetes and its associated sequela, i.e., comorbidity are associated with microvascular and macrovascular complications. As diabetes is characterized by an altered metabolism of key metabolites and regulatory pathways. Metabolic phenotyping can provide us with a better understanding of the unique set of regulatory perturbations that predispose to diabetes and its associated comorbidities. Methodology: The present study utilizes the analytical platform NMR spectroscopy coupled with Random Forest statistical analysis to identify the discriminatory metabolites of diabetes (DB) and diabetes-related comorbidity (DC) along with the healthy control (HC) subjects. A combined and pairwise analysis was performed, between the serum samples of HC (n=50), and DB (n=38), and DC (n=35) individuals to identify the discriminatory metabolites responsible for class separation. The perturbed metabolites were further rigorously validated using t-test, AUROC analysis to examine the statistical significance of the identified metabolites. Results: The DB and DC patients were well discriminated from HC. However, 15 metabolites were found to be significantly perturbed in DC patients compared to DB, the identified panel of metabolites are TCA cycle (succinate, citrate), methylamine metabolism (trimethylamine, methylamine, betaine), -intermediates; energy metabolites (glucose, lactate, pyruvate); and amino acids (valine, arginine, glutamate, methionine, proline and threonine). The metabolites were further used to identify the perturbed metabolic pathway and correlation of metabolites in DC patients. Conclusion: The 1H NMR metabolomics may prove a promising technique to differentiate and predict diabetes and its comorbidities on their onset or progression by determining the altered levels of the metabolites in serum.
ARTICLE | doi:10.20944/preprints201802.0008.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: Optimal Bayesian detection, information geometry, minimal error probability, Chernoff/Bhattacharyya upper bound, large random tensor, Fisher information, large random sensing matrix
Online: 1 February 2018 (16:32:04 CET)
The performance in terms of minimal Bayes’ error probability for detection of a high-dimensional random tensor is a fundamental under-studied difficult problem. In this work, we consider two Signal to Noise Ratio (SNR)-based detection problems of interest. Under the alternative hypothesis, i.e., for a non-zero SNR, the observed signals are either a noisy rank-R tensor admitting a Q-order Canonical Polyadic Decomposition (CPD) with large factors of size Nq R, i.e, for 1 q Q, where R, Nq ! ¥ with R1/q/Nq converge towards a finite constant or a noisy tensor admitting TucKer Decomposition (TKD) of multilinear (M1, . . . ,MQ)-rank with large factors of size Nq Mq, i.e, for 1 q Q, where Nq,Mq ! ¥ with Mq/Nq converge towards a finite constant. The detection of the random entries (coefficients) of the core tensor in the CPD/TKD is hard to study since the exact derivation of the error probability is mathematically intractable. To circumvent this technical difficulty, the Chernoff Upper Bound (CUB) for larger SNR and the Fisher information at low SNR are derived and studied, based on information geometry theory. The tightest CUB is reached for the value minimizing the error exponent, denoted by s?. In general, due to the asymmetry of the s-divergence, the Bhattacharyya Upper Bound (BUB) (that is, the Chernoff Information calculated at s? = 1/2) can not solve this problem effectively. As a consequence, we rely on a costly numerical optimization strategy to find s?. However, thanks to powerful random matrix theory tools, a simple analytical expression of s? is provided with respect to the Signal to Noise Ratio (SNR) in the two schemes considered. A main conclusion of this work is that the BUB is the tightest bound at low SNRs. This property is, however, no longer true for higher SNRs.
ARTICLE | doi:10.20944/preprints202209.0088.v1
Subject: Engineering, Mechanical Engineering Keywords: Short fiber-reinforced composite; Random fields; Plasticity; Numerical simulation
Online: 6 September 2022 (10:11:54 CEST)
For the numerical simulation of components made of short fiber-reinforced composites the correct prediction of the deformation including the elastic and plastic behavior and its spatial distribution is essential. When using purely deterministic modeling approaches the information of the probabilistic microstructure is not included in the simulation process. One possible approach for the integration of stochastic information is the use of random fields. In this study numerical simulations of tensile test specimens are conducted utilizing a finite deformation elastic-ideal plastic material model. A selection of the material parameters covering the elastic and plastic domain are represented by cross-correlated second-order Gaussian random fields to incorporate the probabilistic nature of the material parameters. To validate the modeling approach tensile tests until failure are carried out experimentally, that confirm the assumption of spatially distributed material behavior in both the elastic and plastic domain. Since the correlation lengths of the random fields cannot be determined by pure analytic treatments, additionally numerical simulations are performed for different values of the correlation length. The numerical simulations endorse the influence of the correlation length on the overall behavior. For a correlation length of 5mm a good conformity with the experimental results is obtained. Therefore, it is concluded, that the presented modeling approach is suitable to predict the elastic and plastic deformation of a set of tensile test specimens made of short fiber-reinforced composite sufficiently.
ARTICLE | doi:10.20944/preprints202208.0058.v1
Subject: Physical Sciences, Other Keywords: complexity; phase transitions; criticality; Ising model; random Boolean networks
Online: 2 August 2022 (09:30:37 CEST)
The dynamics of many complex systems can be classified as ordered, chaotic, or critical. Order offers stability and robustness, while chaos allows for change and adaptability. Criticality, then, is often seen as a balance required by living systems at different scales. In classical models, however, criticality is only found near phase transitions, restricting the parameter space (and thus the likelihood) of critical dynamics, as most parameters yield ``undesirable'' solutions. Here we show that this limitation is due to the homogeneity built-in these models, i.e., all elements sharing parameter values. By exploring heterogeneous versions of archetypal models in physics and computer science, we observe critical dynamics in a broader range of parameters, and thus could be more common than previously thought.
ARTICLE | doi:10.20944/preprints202207.0462.v1
Subject: Social Sciences, Finance Keywords: Machine Learning; Random Forest; Google Trends; Predictability; Banks; Greece
Online: 29 July 2022 (13:07:42 CEST)
Background/Objectives: Accurate prediction of stock prices is an extremely challenging task because of factors such as political conditions, global economy, unexpected events, market anomalies, and relevant companies’ features. In this work, the random forest has been used to forecast the prices of the four major Greek systemic banks Methods/Analysis: We make use of a set of financial variables based on intraday data: (i) Open stock price, (ii) High stock price, (iii) Low stock price, and (iv) Close stock price of a particular Greek systemic bank. Results/Findings: The variables used here are crucial in predicting systemic banks' stock closing prices. These provide a better prediction of the next day's closing price of the bank series. Novelty /Improvement: To our knowledge, this is the first study that employs machine learning techniques in Greek systemic banks.
ARTICLE | doi:10.20944/preprints202112.0138.v2
Subject: Biology, Agricultural Sciences & Agronomy Keywords: Yield mapping; vegetation index; Stepwise; SR; Random Forest; KNN
Online: 9 December 2021 (15:39:34 CET)
The use of machine learning techniques to predict yield based on remote sensing is a no-return path and studies conducted on farm aim to help rural producers in decision-making. Thus, commercial fields equipped with technologies in Mato Grosso, Brazil, were monitored by satellite images to predict cotton yield using supervised learning techniques. The objective of this research was to identify how early in the growing season, which vegetation indices and which machine learning algorithms are best to predict cotton yield at the farm level. For that, we went through the following steps: 1) We observed the yield in 398 ha (3 fields) and eight vegetation indices (VI) were calculated on five dates during the growing season. 2) Scenarios were created to facilitate the analysis and interpretation of results: Scenario 1: All Data (8 indices on 5 dates = 40 inputs) and Scenario 2: best variable selected by Stepwise regression (1 input). 3) In the search for the best algorithm, hyperparameter adjustments, calibrations and tests using machine learning were performed to predict yield and performances were evaluated. Scenario 1 had the best metrics in all fields of study, and the Multilayer Perceptron (MLP) and Random Forest (RF) algorithms showed the best performances with adjusted R2 of 47% and RMSE of only 0.24 t ha-1, however, in this scenario all predictive inputs that were generated throughout the growing season (approx. 180 days) are needed, so we optimized the prediction and tested only the best VI in each field, and found that among the eight VIs, the Simple Ratio (SR), driven by the K-Nearest Neighbor (KNN) algorithm predicts with 0.26 and 0.28 t ha-1 of RMSE and 5.20% MAPE, anticipating the cotton yield with low error by ±143 days, and with important aspect of requiring less computational demand in the generation of the prediction when compared to MLP and RF, for example, enabling its use as a technique that helps predict cotton yield, resulting in time savings for planning, whether in marketing or in crop management strategies.
Subject: Biology, Agricultural Sciences & Agronomy Keywords: Microbiome; Diazotroph; Nitrogen fixation bacteria; Random Forest; Network; Trichomona
Online: 23 August 2021 (12:15:31 CEST)
Biofertilizer, an environment-friendly and renewable plant nutrient source, has been widely applied and studied to reduce dependency on chemical fertilizers. However, most studies focus on the effects of biofertilizer on the bacterial and fungal communities, and we still lack an understanding of biofertilizer on the protistan community. Here, the effects of biofertilizer application on the composition and interaction of the protistan community in the wheat rhizosphere were investigated based on a 4-year field experiment. Biofertilizer application altered soil physicochemical properties and the protistan community composition (ANOSIM, p < 0.001), and significantly induced an alpha diversity decline. Random forecast and redundancy analysis demonstrated that nitrogenase activity and available phosphorus were the main drivers. Trichomonas classified to the phylum Metamonada was enriched by biofertilizer, and was significantly positive connections with soil nitrogenase activity and some function genes involved in nitrogen-fixation and nitrogen-dissimilation. Biofertilization loosely connected biotic interactions, while did not affect the stability of the protistan community. Besides, biofertilizer promoted the connections of protists with fungi, bacteria, and archaea. Combined with the conjunct biotic network (protist, fungi, bacteria, and archaea) and interactions between protists and soil physicochemical properties/function genes, protists may act as keystone taxa potentially driving soil microbiome composition and function.
ARTICLE | doi:10.20944/preprints202012.0152.v1
Subject: Engineering, Automotive Engineering Keywords: Wind farm noise; Amplitude modulation; Random Forest; AM detection
Online: 7 December 2020 (12:51:54 CET)
Amplitude modulation (AM) is a characteristic feature of wind farm noise and has the potential to contribute to annoyance and sleep disturbance. This study aimed to develop an AM detection method using a random forest approach. The method was developed and validated on 6,000 10-second samples of wind farm noise manually classified by a scorer via a listening experiment. Comparison between the random forest method and other widely-used methods showed that the proposed method consistently demonstrated superior performance. This study also found that a combination of low-frequency content features and other unique characteristics of wind farm noise play an important role in enhancing AM detection performance. Taken together, these findings support that using machine learning-based detection of AM is well suited and effective for in-depth exploration of large wind farm noise data sets for potential legislative and research purposes.
ARTICLE | doi:10.20944/preprints202011.0436.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Space-filling curves; Ergodic Theory; uniform random number generation.
Online: 16 November 2020 (16:51:07 CET)
In this paper the problem of sampling from uniform probability distributions is approached by means of space-filling curves (SFCs), a topological concept that has found a number of important applications in recent years. Departing from the theoretical fact that they are surjective but not necessarilly injective, the investigation focused upon the structure of the distributions obtained when their domain is swept in a uniform and discrete manner, and the corresponding values used to build histograms, that are approximations of their true PDFs. This work concentrates on the real interval [0,1], and the Sierpinski space-filling curve was chosen because of its favorable computational properties. In order to validate the results, the Kullback-Leibler distance is used when comparing the obtained distributions in several levels of granularity with other already established sampling methods. In truth, the generation of uniform random numbers is a deterministic simulation of randomness using numerical operations. In this fashion, sequences resulting from this sort of process are not truly random.
ARTICLE | doi:10.20944/preprints202002.0108.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: machine learning; decision tree; random forest; crime data analytics
Online: 9 February 2020 (16:02:03 CET)
Machine learning plays a key role in present day crime detection, analysis and prediction. The goal of this work is to propose methods for predicting crimes classified into different categories of severity. We implemented visualization and analysis of crime data statistics in recent years in the city of Boston. We then carried out a comparative study between two supervised learning algorithms, which are decision tree and random forest based on the accuracy and processing time of the models to make predictions using geographical and temporal information provided by splitting the data into training and test sets. The result shows that random forest as expected gives a better result by 1.54% more accuracy in comparison to decision tree, although this comes at a cost of at least 4.37 times the time consumed in processing. The study opens doors to application of similar supervised methods in crime data analytics and other fields of data science
Online: 22 October 2019 (15:40:00 CEST)
We use a random gap model to describe a metal-insulator transition in three-dimensional semiconductors due to doping and find a conventional phase transition, where the effective scattering rate is the order parameter. Spontaneous symmetry breaking results in metallic behavior, whereas the insulating regime is characterized by the absence of spontaneous symmetry breaking. The transition is continuous for the average conductivity with critical exponent equal to 1. Away from the critical point the exponent is roughly 0.6, which may explain experimental observations of a crossover of the exponent from 1 to 0.5 by going away from the critical point.
ARTICLE | doi:10.20944/preprints201802.0007.v1
Subject: Mathematics & Computer Science, Applied Mathematics Keywords: Chaotic itineracy; random dynamics; computer aided proof; neural networks
Online: 1 February 2018 (10:04:20 CET)
We consider a random dynamical system arising in a model of associative memory. This system can be seen as a small (stochastic and deterministic) perturbation of a determinstic system having two weak attractors which are destroyed after the perturbation. We show, with a computer aided proof, that the system has a kind of chaotic itineracy. Typical orbits are globally chaotic, while they spend relatively long time visiting attractor's ruins.
REVIEW | doi:10.20944/preprints201705.0111.v1
Subject: Physical Sciences, Optics Keywords: random fiber laser; Lévy statistics; photonic spin-glass behavior
Online: 15 May 2017 (11:59:43 CEST)
The interest in random fiber lasers (RFLs), first demonstrated one decade ago, is still growing and their basic characteristics have been studied by several authors. RFLs are open systems that present instabilities in the intensity fluctuations due to the energy exchange among their non-orthogonal quasi-modes. In this work, we present a review of the recent investigations on the output characteristics of a continuous-wave erbium-doped RFL, with emphasis on the statistical behavior of the emitted intensity fluctuations. A progression from the Gaussian to Lévy and back to the Gaussian statistical regime was observed by increasing the excitation laser power from below to above the RFL threshold. By analyzing the RFL output intensity fluctuations, the probability density function of emission intensities was determined, and its correspondence with the experimental results was identified, enabling a clear demonstration of the analogy between the RFL phenomenon and the spin-glass phase transition. A replica-symmetry-breaking phase above the RFL threshold was characterized and the glassy behavior of the emitted light was established. We also discuss perspectives for future investigations on RFL systems.
ARTICLE | doi:10.20944/preprints202206.0356.v1
Subject: Mathematics & Computer Science, Numerical Analysis & Optimization Keywords: optimization; video segmentation; decision tree; random forest; gradient boost tree
Online: 27 June 2022 (08:56:21 CEST)
Video segmentation is crucial in a variety of practical applications especially in computer visions. Most of recent works in video segmentation are focusing on Deep learning based video segmentation, there are rooms for improvement in respect of the evolutionary algorithms. This paper aims to propose the novel method to video segmentation by using the optimization of segmentation parameters based on ensemble-based random forest and gradient boosting decision tree. The experimental results show Pareto front of segmentation parameters (hue, brightness, luminance, and saturation). Our optimization model yields accuracy: 85% +/-8.85 % (micro average: 85.00 %), average class precision: 84.88%, and average class recall: 85%. We also show the video segmentation results based on our optimization method and compare our results with Kinect-based video segmentation.
ARTICLE | doi:10.20944/preprints202109.0460.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: Space-filling curves; Ergodic Theory; random number generation; Gaussian distribution
Online: 28 September 2021 (09:56:55 CEST)
This work addresses the problem of sampling from Gaussian probability distributions by means of uniform samples obtained deterministically and directly from space-filling curves (SFCs), a purely topological concept. To that end, the well-known inverse cumulative distribution function method is used, with the help of the probit function,which is the inverse of the cumulative distribution function of the standard normal distribution. Mainly due to the central limit theorem, the Gaussian distribution plays a fundamental role in probability theory and related areas, and that is why it has been chosen to be studied in the present paper. Numerical distributions (histograms) obtained with the proposed method, and in several levels of granularity, are compared to the theoretical normal PDF, along with other already established sampling methods, all using the cited probit function. Final results are validated with the Kullback-Leibler and two other divergence measures, and it will be possible to draw conclusions about the adequacy of the presented paradigm. As is amply known, the generation of uniform random numbers is a deterministic simulation of randomness using numerical operations. That said, sequences resulting from this kind of procedure are not truly random. Even so, and to be coherent with the literature, the expression ”random number” will be used along the text to mean ”pseudo-random number”.
ARTICLE | doi:10.20944/preprints202107.0389.v1
Subject: Engineering, Automotive Engineering Keywords: Centralized fusion estimation, Random delay systems, Tessarine processing, Tk properness.
Online: 16 July 2021 (16:30:57 CEST)
The centralized fusion estimation problem for discrete-time vectorial tessarine signals in multiple sensor stochastic systems with random one-step delays and correlated noises is analyzed under different T-properness conditions. Based on Tk, k=1,2, linear processing, new centralized fusion filtering, prediction, and fixed-point smoothing algorithms are devised. These algorithms have the advantage of providing optimal estimators with a significant reduction in computational cost compared to that obtained through a real or widely linear processing approach. Simulation examples illustrate the effectiveness and applicability of the algorithms proposed, in which the superiority of the Tk linear estimators over their counterparts in the quaternion domain is apparent.
ARTICLE | doi:10.20944/preprints202010.0564.v1
Subject: Engineering, Automotive Engineering Keywords: robotics; autonomy obstacle avoidance; path optimization; genetic algorithm; random search
Online: 27 October 2020 (20:44:30 CET)
In the rescue operations the full time of action plays important role. It is a sum of planning, travel, and manipulation (in the action place) phases times. The time minimization of first two phases by autonomous vehicle for remote action is considered in the paper. For known a priori map the path planning consists of local optimal decision collected next in the general algorithm of the optimal path. Such approach significantly reduces time of path planning. The robot features and known sparse obstacles reduce the allowable robot speeds. The time of travel is calculated from allowable velocity profile. So, it can be used to estimate the travel performance. Genetic algorithm and random search-based methods for path finding with travel time optimization are exploited and compared in the paper. All the proposed time optimisation solutions of rescue operation are checked during computer simulations and results of simulation are presented.
ARTICLE | doi:10.20944/preprints202008.0089.v1
Subject: Earth Sciences, Geology Keywords: Deep Neural Network; Extreme Gradient Boosting; Random Forest; Landslide Susceptibility
Online: 4 August 2020 (11:13:02 CEST)
Landslides impact on human activities and socio-economic development especially in mountainous areas. This study focuses on the comparison of the prediction capability of advanced machine learning techniques for rainfall-induced shallow landslide susceptibility of Deokjeokri catchment and Karisanri catchment in South Korea. The influencing factors for landslides i.e. topographic, hydrologic, soil, forest, and geologic factors are prepared from various sources based on availability and a multicollinearity test is also performed to select relevant causative factors. The landslide inventory maps of both catchments are obtained from historical information, aerial photographs and performing field survey. In this study, Deokjeokri catchment is considered as a training area and Karisanri catchment as a testing area. The landslide inventories content 748 landslide points in training and 219 points in testing areas. Three landslide susceptibility maps using machine learning models i.e. Random Forest (RF), Extreme Gradient Boosting (XGBoost) and Deep Neural Network (DNN) are prepared and compared. The outcomes of the analyses are validated using the landslide inventory data. A receiver operating characteristic curve (ROC) method is used to verify the results of the models. The results of this study show that the training accuracy of RF is 0.757 and the testing accuracy is 0.74. Similarly, training accuracy of XGBoost is 0.756 and testing accuracy is 0.703. The prediction of DNN revealed acceptable agreement between susceptibility map and the existing landslides with training and testing accuracy of 0.855 and 0.802, respectively. The results showed that, the DNN model achieved lower prediction error and higher accuracy results than other models for shallow landslide modeling in the study area
ARTICLE | doi:10.20944/preprints202003.0036.v1
Subject: Engineering, Biomedical & Chemical Engineering Keywords: ECG feature selection; heartbeat classification; arrhythmia detection; random forest classifier
Online: 3 March 2020 (11:12:20 CET)
Finding an optimal combination of features and classifier is still an open problem in the development of automatic heartbeat classification systems, especially when applications that involve resource-constrained devices are considered. In this paper, a novel study of the selection of informative features and the use of a random forest classifier while following the recommendations of the Association for the Advancement of Medical Instrumentation (AAMI) and an inter-patient division of datasets is presented. Features were selected using a filter method based on the mutual information ranking criterion on the training set. Results showed that normalized R-R intervals and features relative to the width of the QRS complex are the most discriminative among those considered. The best results achieved on the MIT-BIH Arrhythmia Database were an overall accuracy of 96.14% and F1-scores of 97.97%, 73.06%, and 90.85% in the classification of normal beats, supraventricular ectopic beats, and ventricular ectopic beats respectively. In comparison with other state of the art approaches tested under similar constraints, this work represents one of the highest performances reported to date while relying on a very small feature vector.
ARTICLE | doi:10.20944/preprints201908.0056.v1
Subject: Materials Science, General Materials Science Keywords: CO2 separation; random copolymer; PIM-polyimide; permeability-selectivity; pressure effect
Online: 5 August 2019 (08:07:23 CEST)
Random copolymers made of both (PIM-polyimide) and (6FDA-durene-PI) were prepared for the first time by a facile one-step polycondensation reaction. By combining the highly porous and contorted structure of PIM (polymers with intrinsic microporosity) and high thermomechanical properties of PI (polyimide), the membranes obtained from these random copolymers [(PIM-PI)x-(6FDA-durene-PI)y] showed high CO2 permeability (> 1047 Barrer) with moderate CO2/N2 (> 16.5) and CO2/CH4 (> 18) selectivity, together with excellent thermal and mechanical properties. The membranes prepared from three different compositions of two comonomers (1:4, 1:6 and 1:10 of x:y), all showed similar morphological and physical properties, and gas separation performance, indicating ease of synthesis and practicability for large-scale production. The gas separation performance of these membranes at various pressure ranges (100–1500 torr) was also investigated.
ARTICLE | doi:10.20944/preprints201907.0158.v1
Subject: Biology, Agricultural Sciences & Agronomy Keywords: Cunninghamia lanceolate; UAVs; hyperspectral camera; machine learning; random forests; XGBoost
Online: 11 July 2019 (11:41:33 CEST)
Accurate measurements of tree height and diameter at breast height (DBH) in forests to evaluate the growth rate of cultivars is still a significant challenge, even when using LiDAR and 3-D modeling. We propose an integrated pipeline methodology to measure the biomass of different tree cultivars in plantation forests with high crown density which that combines unmanned aerial vehicles (UAVs), hyperspectral image sensors, and data processing algorithms using machine learning. Using a planation of Cunninghamia lanceolate, commonly known as Chinese fir, in Fujian, China, images were collected using a hyperspectral camera and orthorectified in HiSpectral Stitcher. Vegetation indices and modeling were processed in Python using decision trees, random forests, support vector machine, and eXtreme Gradient Boosting (XGBoost) third-party libraries. Tree height and DBH of 2880 samples were measured manually and clustering into three groups: “fast growth,” “median,” growth and “normal” growth group, and 19 vegetation indices from 12,000 pixels were abstracted as the input of features for the modeling. After modeling and cross-validation, the classifier generated by random forests had the best prediction accuracy compare to other algorisms (75%). This framework can be applied to other tree species to make management and business decisions.
ARTICLE | doi:10.20944/preprints201904.0244.v1
Subject: Keywords: salient object; local binary pattern; histogram features; conditional random field
Online: 22 April 2019 (11:40:11 CEST)
We propose a novel method for salient object detection in different images. Our method integrates spatial features for efficient and robust representation to capture meaningful information about the salient objects. We then train a conditional random field (CRF) using the integrated features. The trained CRF model is then used to detect salient objects during the online testing stage. We perform experiments on two standard datasets and compare the performance of our method with different reference methods. Our experiments show that our method outperforms the compared methods in terms of precision, recall, and F-Measure.
ARTICLE | doi:10.20944/preprints201710.0086.v2
Subject: Engineering, Electrical & Electronic Engineering Keywords: cluster head; dead node; random; vicinity; modulation; index; survival; overhead
Online: 23 October 2017 (08:06:47 CEST)
As Heterogeneous Wireless Sensor Network (HWSN) fulfill the requirements of researchers in the design of real life application to resolve the issues of unattended problem. But, the main constraint face by researchers is energy source available with sensor nodes. To prolong the life of sensor nodes and hence HWSN, it is necessary to design energy efficient operational schemes. One of the most suitable routing scheme is clustering approach, which improves stability and hence enhances performance parameters of HWSN. A novel solution proposed in this article is to design energy efficient clustering protocol for HWSN, to enhance performance parameters by EECPEP-HWSN. Propose protocol is designed with three level nodes namely normal, advance and super node respectively. In clustering process, for selection of cluster head we consider three parameters available with sensor node at run time, i.e., initial energy, hop count and residual energy. This protocol enhance the energy efficiency of HWSN, it improves performance parameters in the form of enhance energy remain in the network, force to enhance stability period, prolong lifetime and hence higher throughput. It is been found that proposed protocol outperforms than LEACH, DEEC and SEP with about 188, 150 and 141 percent respectively.
ARTICLE | doi:10.20944/preprints202208.0481.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: Random normalization; thinning operators; Bernstein Theorem; problem of moments; Sibuya distribution
Online: 29 August 2022 (09:43:57 CEST)
Different variants of thinning for discrete random variables are studied. The thinning procedure allows to introduce an analog of scale parameter for positive integer-valued random variables. Sufficient and necessary conditions for the existence of such a scale are given.
ARTICLE | doi:10.20944/preprints202004.0316.v2
Subject: Earth Sciences, Environmental Sciences Keywords: Precision farming; Early crop-type mapping; Sentinel-2; Random Forest; SVM
Online: 17 January 2022 (10:54:10 CET)
Crop-type mapping is an important intermediate step for cost-effective crop management at the field level, as an overview of all fields with a particular crop type can be used for monitoring or yield forecasting, for instance. Our study used a data set with 2400 fields and corresponding satellite observations from the federal state of Bavaria, Germany. The study classified corn, winter wheat, winter barley, sugar beet, potato, and winter rapeseed as the main crops grown in Upper Bavaria. We additionally experimented with a rejection class "Other", which summarised further crop types. Corresponding Sentinel-2 data included the normalised difference vegetation index (NDVI) and raw bands from 2016 to 2018 for each selected field. The influence of raw bands compared to NDVI was analysed and the classification algorithms, i.e. support vector machine (SVM) and random forest (RF), were compared. The study showed that the use of an index should be critically questioned and that raw bands provided a wider spectral bandwidth, which significantly improved the mapping of crop types. The results underline the use of RF with raw bands and achieved overall accuracies (OA) of up to 92%. We also predicted crop types in an unknown year with significantly different weather conditions and several months before the end of the growing season. Thus, the influence of climate anomalies and the accuracy depending on the time of prediction were assessed. The crop types of a test site and year without labels could be determined with an OA of up to 86%. The results demonstrate the usefulness of the proof-of-concept and its readiness for use in real applications.
ARTICLE | doi:10.20944/preprints202201.0021.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: Protocol; Mobile ad hoc network; Mobility Model; Random Waypoint Mobility; throughput.
Online: 4 January 2022 (20:35:06 CET)
Mobility models are used to evaluated the network protocols of the ad hoc network using the simulation. The random waypoint model is a model for mobility which is usually used for performance evaluation of ad-hoc mobile network. Mobile nodes have the dynamic mobility in the ad hoc network so the mobility model plays an important role to evaluate the protocol performance.In this article, we developed modify random waypoint mobility (MRWM) model based on random waypoint for the mobile ad hoc network. In this article, the comparative analysis of modifying random waypoint mobility and random waypoint model on the ad hoc On-Demand Distance Vector (AODV) routing protocol has been done for large wireless ad hoc network (100 nodes) with the random mobile environment for the 1800s simulation time. To enhance the confidence on the protocol widespread simulations were accomplished under heavy traffic (i.e. 80 nodes) condition. The proposed model protocol has been investigated with the performance metrics: throughput; packet delivery ratio; packet dropping ratio; the end to end delay and normalized routing overhead. The obtained results revealed that proposed modify random waypoint mobility model reduces the mobility as compared to the random waypoint mobility model and it is trace is more realist.
ARTICLE | doi:10.20944/preprints202112.0184.v2
Subject: Earth Sciences, Other Keywords: Spectral; Geochemistry; Random Forest; Regression; Whole Rock; MIR; SWIR; VNIR; NMF
Online: 21 December 2021 (12:35:45 CET)
The efficacy of predicting geochemical parameters with a 2-chain workflow using spectral data as the initial input is evaluated. Spectral measurements spanning the approximate 400-25000nm spectral range are used to train a workflow consisting of a non-negative matrix function (NMF) step, for data reduction, and a random forest regression (RFR) to predict 8 geochemical parameters. Approximately 175000 spectra with their corresponding chemical analysis were available for training, testing and validation purposes. The samples and their spectral and chemical parameters represent 9399 drillcore. Of those, approximately 20000 spectra and their accompanying analysis were used for training and 5000 for model validation. The remaining pairwise data (150000 samples) were used for testing of the method. The data are distributed over 2 large spatial extents (980 km2 and 3025 km2 respectively) and allowed the proposed method to be tested against samples that are spatially distant from the initial training points. Global R2 scores and wt.% RMSE on the 150000 validation samples are Fe(0.95/3.01), SiO2(0.96/3.77), Al2O3(0.92/1.27), TiO(0.68/0.13), CaO(0.89/0.41), MgO(0.87/0.35), K2O(0.65/0.21) and LOI(0.90/1.14), given as Parameter(R2/RMSE), and demonstrate that the proposed method is capable of predicting the 8 parameters and is stable enough, in the environment tested, to extend beyond the training sets initial spatial location.
ARTICLE | doi:10.20944/preprints202110.0172.v1
Subject: Social Sciences, Economics Keywords: PSNP; household food consumption; household dietary diversity; random effect; instrumental variable
Online: 11 October 2021 (17:15:51 CEST)
This study empirically investigates the effect of productive safety net programme (PSNP) on household food consumption and dietary diversity in Ethiopia. The study applied random effects with instrumental variable to estimate the effect of PSNP membership. The result of the study indicates that though PSNP membership improves household food consumption, it reduces household dietary diversity score. Household food consumption and dietary diversity are also significantly influenced by sex, age, education status of household head, household size, livestock ownership, distance to the nearest market and participation in non-farm activities. The findings of this study suggest that PSNP membership should be reinforced by building household awareness of the benefits of consuming a variety of foods. In addition, PSNP membership should be designed to endow the households to accumulate essential assets, especially livestock.
ARTICLE | doi:10.20944/preprints202108.0268.v1
Subject: Keywords: Fuzzy collaborative intelligence; Dynamic random access memory; Fuzzy weighted intersection; Forecasting
Online: 11 August 2021 (18:08:46 CEST)
In a collaborative forecasting task, experts may have unequal authority levels. However, this has rarely been considered reasonably in the existing fuzzy collaborative forecasting methods. In addition, experts may not be willing to discriminate their authority levels. To address these issues, an auto-weighting fuzzy weighted intersection (FWI) fuzzy collaborative intelligence approach is proposed in this study. In the proposed auto-weighting FWI fuzzy collaborative intelligence approach, experts’ authority levels are automatically and reasonably assigned based on their past forecasting performances. Subsequently, the auto-weighting FWI mechanism is established to aggregate experts’ fuzzy forecasts. The theoretical properties of the auto-weighting FWI mechanism have been discussed and compared with those of the existing fuzzy aggregation operators. After applying the auto-weighting FWI fuzzy collaborative intelligence approach to a case of forecasting the yield of a DRAM product from the literature, its advantages over several existing methods were clearly illustrated.
ARTICLE | doi:10.20944/preprints202004.0392.v1
Subject: Mathematics & Computer Science, Applied Mathematics Keywords: Multivariate Public Key Cryptosystem; Random polynomial; Oil Vinegar signature; Provable Security
Online: 22 April 2020 (06:09:50 CEST)
An oil and vinegar scheme is a signature scheme based on multivariate quadratic polynomials over finite fields. The system of polynomials contains $n$ variables, divided into two groups: $v$ vinegar variables and $o$ oil variables. The scheme is called balanced (OV) or unbalanced (UOV), depending on whether $v = 0$ or not, respectively. These schemes are very fast and require modest computational resources, which make them ideal for low-cost devices such as smart cards. However, the OV scheme has been already proven to be insecure and the UOV scheme has been proven to be very vulnerable for many parameter choices. In this paper, we propose a new multivariate public key signature whose central map consists of a set of polynomials obtained from the multiplication of block matrices. Our construction is motivated by the design of the Simple Matrix Scheme for Encryption and the UOV scheme. We show that it is secure against the Separation Method, which can be used to attack the UOV scheme, and against the Rank Attack, which is one of the deadliest attacks against multivariate public-key cryptosystems. Some theoretical results on matrices with polynomial entries are also given, to support the construction of the scheme.
ARTICLE | doi:10.20944/preprints202002.0425.v1
Subject: Earth Sciences, Environmental Sciences Keywords: hydraulic conductivity; pedotransfer function; prediction uncertainty; random forest; soil water retention
Online: 28 February 2020 (12:06:14 CET)
Soil hydraulic properties are often derived indirectly, i.e. computed from easily available soil properties with pedotransfer functions (PTFs), when those are needed for catchment, regional or continental scale applications. When predicted soil hydraulic parameters are used for the modelling of the state and flux of water in soils, uncertainty of the computed values can provide more detailed information when drawing conclusions. The aim of this study was to update the previously published European PTFs (Tóth et al., 2015, euptf v1.4.0) by providing prediction uncertainty calculation built into the transfer functions. The new set of algorithms was derived for point predictions of soil water content at saturation (0 cm matric potential head), field capacity (both -100 and -330 cm matric potential head), wilting point (-15.000 cm matric potential head), plant available water, and saturated hydraulic conductivity, as well as the Mualem-van Genuchten model parameters of the moisture retention and hydraulic conductivity curve. The minimum set of input properties for the prediction is soil depth and sand, silt and clay content. The effect of including additional information like soil organic carbon content, bulk density, calcium carbonate content, pH and cation exchange capacity were extensively analysed. The PTFs were derived adopting the random forest method. The advantage of the new PTFs is that they i) provide information about prediction uncertainty, ii) are significantly more accurate than the euptfv1, iii) can be applied for more predictor variable combinations than the euptfv1, 32 instead of 5, and iv) are now also derived for the prediction of water content at -100 cm matric potential head and plant available water content.
ARTICLE | doi:10.20944/preprints202001.0385.v1
Subject: Earth Sciences, Geoinformatics Keywords: wildfires; susceptibility mapping; machine learning; random forest; model validation; Liguria region
Online: 31 January 2020 (11:40:30 CET)
Wildfire susceptibility maps display the wildﬁres occurrence probability, ranked from low to high, under a given environmental context. Current studies in this field often rely on expert knowledge, including or not statistical models allowing to assess the cause-effect correlation. Machine learning (ML) algorithms can perform very well and be more generalizable thanks to their capability of learning from and make predictions on data. Italy is highly affected by wildfires due to the high heterogeneity of the territory and to the predisposing meteorological conditions. The main objective of the present study is to elaborate a wildfire susceptibility map for Liguria region (Italy) by applying Random Forest, an ensemble ML algorithm based on decision trees. Susceptibility was assessed by evaluating the probability for an area to burn in the future considering where wildfires occurred in the past and which are the geo-environmental factors that favor their spread. Different models were compared, including or not the neighboring vegetation and using an increasing number of folds for the spatial-cross validation. Susceptibility maps for the two fire seasons were finally elaborated and validated and results critically discussed highlighting the capacity of the proposed approach to identify the efficiency of fire fighting activities.
ARTICLE | doi:10.20944/preprints202001.0058.v2
Subject: Engineering, Civil Engineering Keywords: slate; crown; random fill; compaction quality control; wheel-tracking test; topographic settlement
Online: 28 January 2020 (05:22:56 CET)
Particle size can pose a challenge to random embankment compaction control methods, where practical techniques have hardly been developed and procedural control is used instead. In order to develop new quality control procedures for slate random fill, the necessary fieldwork and laboratory tests were carried out. This involved the revision of certain methods such as the wheel-tracking or topographic settlement tests. More than six hundred in-situ density and moisture content measurements were carried out for this research. In addition, more three hundred topographic settlements and three hundred wheel-tracking carriage tests were performed. The quality control processes were completed with more than thirty plate bearing tests. Possible evidence of statistical correlations between compaction control tests were identified. An analysis of variance (ANOVA) was performed. When testing proved relationships between them, the replacement of one of them by the other was assessed by deduction. Finally, the study suggests new procedures for compaction quality control of random slate fill used in crown area.
ARTICLE | doi:10.20944/preprints201908.0097.v1
Subject: Engineering, Civil Engineering Keywords: Evapotranspiration, Genetic programming, Support vector machine, Multiple linear regression, Random forest
Online: 7 August 2019 (11:28:34 CEST)
The ASCE-EWRI reference evapotranspiration (ETo) equation is recommended as a standardized method for reference crop ETo estimation. However, various climate data as input variables to the standardized ETo method are considered limiting factors in most cases and restrict the ETo estimation. This paper assessed the potential of different machine learning (ML) models for ETo estimation using limited meteorological data. The ML models used to estimate daily ETo included Gene Expression Programming (GEP), Support Vector Machine (SVM), Multiple Linear Regression (LR), and Random Forest (RF). Three input combinations of daily maximum and minimum temperature (Tmax and Tmin), wind speed (W) with Tmax and Tmin, and solar radiation (Rs) with Tmax and Tmin were considered using meteorological data during 2003–2016 from six weather stations in the Red River Valley. To understand the performance of the applied models with the various combinations, station, and yearly based tests were assessed with local and spatial approaches. Considering the local and spatial approaches analysis, the LR and RF models illustrated the lowest rate of improvement compared to GEP and SVM. The spatial RF and SVM approaches showed the lowest and highest values of the scatter index as 0.333 and 0.457, respectively. As a result, the radiation-based combination and the RF model showed the best performance with higher accuracy for all stations either locally or spatially, and the spatial SVM and GEP illustrated the lowest performance among models and approaches.
ARTICLE | doi:10.20944/preprints201901.0202.v2
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: Concentration Inequality, Empirical Bernstein Bound, Stratified Random Sampling, Shapley Value Approximation
Online: 31 May 2019 (10:37:48 CEST)
We derive a concentration inequality for the uncertainty in the mean computed by stratified random sampling, and provide an online sampling method based on this inequality. Our concentration inequality is versatile and considers a range of factors including: the data ranges, weights, sizes of the strata, the number of samples taken, the estimated sample variances, and whether strata are sampled with or without replacement. Sequentially choosing samples to minimize this inequality leads to a online method for choosing samples from a stratified population. We evaluate and compare the effectiveness of our method against others for synthetic data sets, and also in approximating the Shapley value of cooperative games. Results show that our method is competitive with the performance of Neyman sampling with perfect variance information, even without having prior information on strata variances. We also provide a multidimensional extension of our inequality and discuss future applications.
Subject: Physical Sciences, Astronomy & Astrophysics Keywords: arithmetic figures, black hole, deterministic model, geometrization of physics, random walk
Online: 26 March 2019 (10:23:11 CET)
The Universe, rather than being homogeneous, displays an almost infinite topological genus, because it is punctured with a countless number of gravitational vortexes, i.e., black holes. Starting from this view, we aim to show that the occurrence of black holes is constrained by geometric random walks taking place during cosmic inflationary expansion. At first, we introduce a visual model, based on the Pascal’s triangle and linear and nonlinear arithmetic octahedrons, which describes three-dimensional cosmic random walks. In case of nonlinear 3D paths, trajectories in an expanding Universe can be depicted as the operation of filling the numbers of the octahedrons in the form of “islands of numbers”: this leads to separate cosmic structures (standing for matter/energy), spaced out by empty areas (constituted by black holes and dark matter). These procedures allow us to describe the topology of an universe of infinite genus, to assess black hole formation in terms of infinite Betti numbers, to highlight how non-linear random walks might provoke gravitational effects also in absence of mass/energy, and to propose a novel interpretation of Beckenstein-Hawking entropy: it is proportional to the surface, rather than the volume, of a black hole, because the latter does not contain information.
ARTICLE | doi:10.20944/preprints201810.0571.v1
Subject: Keywords: quantum random number; vacuum state; maximization of quantum conditional min-entropy
Online: 24 October 2018 (11:22:11 CEST)
Information-theoretically provable unique true random numbers, which cannot be correlated or controlled by an attacker, can be generated based on quantum measurement of vacuum state and universal-hashing randomness extraction. Quantum entropy in the measurements decides the quality and security of the random number generator (RNG). At the same time, it directly determines the extraction ratio of true randomness from the raw data, in other words, it obviously affects quantum random bits generating rate. In this work, we commit to enhancing quantum entropy content in the vacuum noise based quantum RNG. We have taken into account main factors in this proposal to establish the theoretical model of quantum entropy content, including the effects of classical noise, the optimum dynamical analog-digital convertor (ADC) range, the local gain and the electronic gain of the homodyne system. We demonstrate that by amplifying the vacuum quantum noise, abundant quantum entropy is extractable in the step of post-processing even classical noise excursion, which may be deliberately induced by an eavesdropper, is large. Based on the discussion and the fact that the bandwidth of quantum vacuum noise is infinite, we propose large dynamical range and moderate TIA gain to pursue higher local oscillator (LO) amplification of vacuum quadrature and broader detection bandwidth in homodyne system. High true randomness extraction ratio together with high sampling rate is attainable. Experimentally, an extraction ratio of true randomness of 85.3% is achieved by finite enhancement of the laser power of the LO when classical noise excursions of the raw data is obvious.
ARTICLE | doi:10.20944/preprints201808.0038.v1
Subject: Mathematics & Computer Science, Numerical Analysis & Optimization Keywords: true random number generation; von Neumann’s extractor; Peres’s extractor; Elias’s extractor
Online: 2 August 2018 (07:58:37 CEST)
Many cryptographic systems require random numbers, and weak random numbers lead to insecure systems. In the modern world, there are several techniques for generating random numbers, of which the most fundamental and important methods are deterministic extractors proposed by von Neumann, Elias, and Peres. Elias’s extractor achieves the optimal rate (i.e., information theoretic upper bound) h(p) if the block size tends to infinity, where h(·) is the binary entropy function and p is probability that each bit of input sequences occurs. Peres’s extractor achieves the optimal rate h(p) if the length of input and the number of iterations tend to infinity. The previous researches related to both extractors did not mention practical aspects including running time and memory-size with finite input sequences. In this paper, based on some heuristics, we derive a lower bound on the maximum redundancy of Peres’s extractor, and we show that Elias’s extractor is better than Peres’s one in terms of the maximum redundancy (or the rates) if we do not pay attention to time complexity or space complexity. In addition, we perform numerical and non-asymptotic analysis of both extractors with a finite input sequence with any biased probability under the same environments. For doing it, we implemented both extractors on a general PC and simple environments. Our empirical results show that Peres’s extractor is much better than Elias’s one for given finite input sequences under the almost same running time. As a consequence, Peres’s extractor would be more suitable to generate uniformly random sequences in practice in applications such as cryptographic systems.
ARTICLE | doi:10.20944/preprints201804.0333.v2
Subject: Mathematics & Computer Science, Other Keywords: capsule video endoscopy; stochastic sampling; random walks; color gradient; image decomposition
Online: 17 May 2018 (12:46:30 CEST)
Capsule endoscopy, which uses a wireless camera to take images of the digestive tract, is emerging as an alternative to traditional colonoscopy. The diagnostic values of these images depend on the quality of revealed underlying tissue surfaces. In this paper, we consider the problem of enhancing the visibility of detail and shadowed tissue surfaces for capsule endoscopy images. Using concentric circles at each pixel for random walks combined with stochastic sampling, the proposed method enhances the details of vessel and tissue surfaces. The framework decomposes the image into two detail layers that contain shadowed tissue surfaces and detail features. The target pixel value is recalculated for the smooth layer using similarity of the target pixel to neighboring pixels by weighting against the total gradient variation and intensity differences. In order to evaluate the diagnostic image quality of the proposed method, we used clinical subjective evaluation with a rank order on selected KID image database and compared to state of the art enhancement methods. The result showed that the proposed method provides a better result in terms of diagnostic image quality and objective quality contrast metrics and structural similarity index.
ARTICLE | doi:10.20944/preprints201803.0266.v1
Subject: Engineering, Mechanical Engineering Keywords: variational mode decomposition; random decrement technique; crankshaft bearing; engine; feature extraction
Online: 30 March 2018 (10:01:18 CEST)
The vibration signal of the engine contains strong background noise and many kinds of modulating components, which is difficult to diagnose. Variational mode decomposition (VMD) is a recently introduced adaptive signal decomposition algorithm with a solid theoretical foundation and good noise robustness compared with empirical mode decomposition (EMD). VMD can effectively avoid endpoint effect and modal aliasing. However, VMD cannot effectively eliminate the random noise in the signal, so the random decrement technique is introduced to solve the problem. Based on the crankshaft bearing fault simulation experiment, the four kinds of wear state vibration signals are decomposed by VMD, and the modal components with smaller permutation entropy are selected as fault components. Then the fault component is processed by the random decrement technique, and the Hilbert envelope spectrum of the fault component is obtained. Compared with the fault feature extraction method based on EMD and EEMD, the feature extraction results of the proposed method are better than those of the above two methods. The simulation analysis and the simulation test of the crankshaft bearing fault verify the effectiveness of the proposed method.
ARTICLE | doi:10.20944/preprints201709.0032.v1
Subject: Earth Sciences, Environmental Sciences Keywords: random forest; regression tree; carbon fertilization; land cover change; climate change
Online: 10 September 2017 (07:26:30 CEST)
Global change is affecting vegetation cover and processes through multiple pathways. Long time series of surface land surface properties derived from satellite remote sensing offer a unique abilities to observe these changes, particularly in areas with complex topography and limited research infrastructure. Here, we focus on Nepal, a biodiversity hotspot where vegetation productivity is limited by moisture availability (dominated by a summer monsoon) at lower elevations and by temperature at high elevations. We analyze normalized difference vegetation index (NDVI) from 1981 to 2015 semimonthly, at 8 km spatial resolution. We use a random forest (RF) of regression trees to generate a statistical model of NDVI as a function of elevation, land use, CO2 level, temperature, and precipitation. We find that NDVI has increased over the studied period, particularly at low and middle elevations and during fall (post-monsoon). We infer from the fitted RF model that the NDVI linear trend is primarily due to CO2 level (or another environmental parameter that is changing quasi-linearly), and not primarily to temperature or precipitation trends. On the other hand, interannual fluctuation in NDVI is more correlated with temperature and precipitation. RF accurately fits the available data and shows promise for estimating trends and testing hypotheses about their causes.
ARTICLE | doi:10.20944/preprints202206.0225.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: heterogeneous network embedding; random walks; non-meta-path; type and node constraints
Online: 15 June 2022 (10:41:23 CEST)
In heterogeneous networks, the random walks based on meta-path requires prior knowledge and lacks flexibility. And the random walks based on non-meta-path only considers the number of node types, but does not consider the influence of schema and topology between node types in real networks. To solve the above problems, this paper proposes a novel model HNE-RWTIC (Heterogeneous Network Embedding Based on Random Walks of Type & Inner Constraint). Firstly, to realize the flexible walks, we design a Type strategy, which is the node type selection strategy based on the co-occurrence probability of node types. Secondly, to achieve the uniformity of node sampling, we design an Inner strategy, which is the node selection strategy based on the adjacency relationship between nodes. The Type & Inner strategy can realize the random walks based on meta-path, the flexibility of the walks, and can sample the node types and nodes uniformly in proportion. Thirdly, based on the above strategy, a transition probability model is constructed; then, we obtain the nodes embedding based on the random walks and Skip-Gram. Finally, in classification and clustering tasks, we conducted a thorough empirical evaluation of our method on three real heterogeneous networks. Experimental results shown that F1-Score and NMI of HNE-RWTIC outperform state-of-the-art approaches.
ARTICLE | doi:10.20944/preprints202202.0201.v1
Subject: Earth Sciences, Environmental Sciences Keywords: wind damage; wind disturbance; Pinus sylvestris; Picea abies; machine learning; random forest
Online: 17 February 2022 (05:06:55 CET)
Management approaches inspired by the variability of natural disturbances are expected to produce forests in the future that will be significantly more resilient and better adapted to local environmental conditions. Due to climate change, windstorms are becoming increasingly common resulting in the destruction not only of extensive forest areas but, quite often, of small-sized and scattered forest lands that can ultimately become home to insects and disease dissemination sites. In the present study, an attempt is made to identify and record areas in the northeastern forests of Greece covered by mixed stands of conifers and broadleaves that experienced massive windthrow following local storms. Based on tree-level data, local topographic features, forest characteristics and the mechanical properties of green wood, a reliable model, to be used for the prediction of similar disturbances in the future, has been created after a thorough comparative study of the most well-known intelligent machine learning algorithms.
ARTICLE | doi:10.20944/preprints202103.0333.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: wrist; accelerometer; physical activity; energy expenditure; machine learning; random forest, age groups
Online: 12 March 2021 (08:41:44 CET)
Wrist-worn fitness trackers and smartwatches are proliferating with an incessant attention towards health tracking. Given the growing popularity of wrist-worn devices across all age groups, a rigorous evaluation for recognizing hallmark measures of physical activities and estimating energy expenditure is needed to compare their accuracy across the lifespan. The goal of the study was to build machine learning models to recognize physical activity type (sedentary, locomotion, and lifestyle) and intensity (low, light, and moderate), identify individual physical activities, and estimate energy expenditure. The primary aim of this study was to build and compare models for different age groups: young [20-50 years], middle (50-70 years], and old (70-89 years]. Participants (n = 253, 62% women, aged 20-89 years old) performed a battery of 33 daily activities in a standardized laboratory setting while wearing a portable metabolic unit to measure energy expenditure that was used to gauge metabolic intensity. Tri-axial accelerometer collected data at 80-100 Hz from the right wrist that was processed for 49 features. Results from random forests algorithm were quite accurate in recognizing physical activity type, the F1-Score range across age groups was: sedentary [0.955 – 0.973], locomotion [0.942 – 0.964], and lifestyle [0.913 – 0.949]. Recognizing physical activity intensity resulted in lower performance, the F1-Score range across age groups was: sedentary [0.919 – 0.947], light [0.813 – 0.828], and moderate [0.846 – 0.875]. The root mean square error range was [0.835 – 1.009] for the estimation of energy expenditure. The F1-Score range for recognizing individual physical activities was [0.263 – 0.784]. Performances were relatively similar and the accelerometer data features were ranked similarly between age groups. In conclusion, data features derived from wrist worn accelerometers lead to high-moderate accuracy estimating physical activity type, intensity and energy expenditure and are robust to potential age-differences.
ARTICLE | doi:10.20944/preprints202002.0074.v1
Subject: Earth Sciences, Geoinformatics Keywords: archaeological topography; tumulus; burial mound; geomorphometry; high-resolution; DEM; LiDAR; Random Forest
Online: 6 February 2020 (02:43:29 CET)
Archaeological topography identification from high-resolution DEMs is a current method that is used with high success in archaeological prospecting of wide areas. I present a methodology trough which burial mounds (tumuli) from LiDAR DEMS can be identified. This methodology uses geomorphometric and statistical methods to identify with high accuracy burial mound candidates. Peaks, defined as local elevation maxima are found as a first step. In the second step, local convexity watershed segments and their seeds are compared with positions of local peaks and the peaks that correspond or have in vicinity local convexity segments seeds are selected. The local convexity segments that correspond to these selected peaks are further feed to a Random Forest algorithm together with shape descriptors and descriptive statistics of geomorphometric variables in order to build a model for the classification. Multiple approaches to tune and selected the proper training dataset, settings and variables were tested. The validation of the model was performed on the full dataset where the training was performed and on an external dataset in order to test the usability of the method for other areas in a similar geomorphological and archaeological setting. The validation was performed against manually mapped and field checked burial mounds from two neighbor study areas of 100 km2 each. The results show that by training the Random Forest on a dataset composed of between 75% to 100% of the segments corresponding to burial mounds and ten times more non-burial mounds segments selected using latin hypercube sampling, 93% of the burial mound segments from the external dataset are identified. There are 42 false positive cases that need to be checked, and there are two burial mound segments missed. The method shows great promise to be used for burial mound detection on wider areas by delineating a certain number of tumuli mounds for model training.
ARTICLE | doi:10.20944/preprints201902.0046.v1
Subject: Earth Sciences, Geoinformatics Keywords: Soil Moisture; Remote Sensing; Landsat; SMAP; Random Forest; Machine Learning; Downscaling; Microwave
Online: 5 February 2019 (08:01:58 CET)
If given the correct remotely sensed information, machine learning can accurately describe soil moisture conditions in a heterogeneous region at the large scale based on soil moisture readings at the small scale through rule transference across scale. This paper reviews an approach to increase soil moisture resolution over a sample region over Australia using the Soil Moisture Active Passive (SMAP) sensor and Landsat 8 only and a validation experiment using Sentinal-2 and the Advanced Microwave Scanning Radiometer (AMSR-E) over Nevada. This approach uses an inductive localized approach, replacing the need to obtain a deterministic model in favor of a learning model. This model is adaptable to heterogeneous conditions within a single scene unlike traditional polynomial fitting models and has fixed variables unlike most learning models. For the purposes of this analysis, the SMAP 36 km soil moisture product is considered fully valid and accurate. Landsat bands coinciding in collection date with a SMAP capture are down sampled to match the resolution of the SMAP product. A series of indices describing the Soil-Vegetation-Atmosphere Triangle (SVAT) relationship are then produced, including two novel variables, using the down sampled Landsat bands. These indices are then related to the local coincident SMAP values to identify a series of rules or trees to identify the local rules defining the relationship between soil moisture and the indices. The defined rules are then applied to the Landsat image in the native Landsat resolution to determine local soil moisture. Ground truth comparison is done via a series of grids using point soil moisture samples and air-borne L-band Multibeam Radiometer (PLMR) observations done under the SMAPEx-5 campaign. This paper uses a random forest due to its highly accurate learning against local ground truth data yet easily understandable rules. The predictive power of the inferred learning soil moisture algorithm did well with a mean absolute error of 0.054 over an airborne L-band retrieved surface over the same region.
ARTICLE | doi:10.20944/preprints201811.0265.v1
Subject: Earth Sciences, Other Keywords: CAMELS; flood frequency; hydrological signatures; extreme value theory; random forests; spatial modelling
Online: 12 November 2018 (04:59:22 CET)
The finding of important explanatory variables for the location parameter and the scale parameter of the generalized extreme value (GEV) distribution, when the latter is used for the modelling of annual streamflow maxima, is known to have reduced the uncertainties in inferences, as estimated through regional flood frequency analysis frameworks. However, important explanatory variables have not been found for the GEV shape parameter, despite its critical significance, which stems from the fact that it determines the behaviour of the upper tail of the distribution. Here we examine the nature of the shape parameter by revealing its relationships with basin attributes. We use a dataset that comprises information about daily streamflow and forcing, climatic indices, topographic, land cover, soil and geological characteristics of 591 basins with minimal human influence in the contiguous United States. We propose a framework that uses random forests and linear models to find (a) important predictor variables of the shape parameter and (b) an interpretable model with high predictive performance. The process of study comprises of assessing the predictive performance of the models, selecting a parsimonious predicting model and interpreting the results in an ad-hoc manner. The findings suggest that the shape parameter mostly depends on climatic indices, while the selected prediction model results in more than 20% higher accuracy in terms of RMSE compared to a naïve approach. The implications are important, since incorporating the regression model into regional flood frequency analysis frameworks can considerably reduce the predictive uncertainties.
REVIEW | doi:10.20944/preprints201808.0092.v1
Subject: Physical Sciences, General & Theoretical Physics Keywords: quantum ergodicity; vibrational state space; local random matrix theory; many body localization
Online: 5 August 2018 (11:55:27 CEST)
We review a theory that predicts the onset of thermalization in a quantum mechanical coupled non-linear oscillator system, which models the vibrational degrees of freedom of a molecule. A system of N non-linear oscillators perturbed by cubic anharmonic interactions exhibits a many-body localization (MBL) transition in the vibrational state space (VSS) of the molecule. This transition can occur at rather high energy in a sizable molecule because the density of states coupled by cubic anharmonic terms scales as ~ N3, in marked contrast to the total density of states, which scales as exp(aN), where a is a constant. The emergence of a MBL transition in the VSS is seen by analysis of a random matrix ensemble that captures the locality of coupling in the VSS, referred to as local random matrix theory (LRMT). Upon introducing higher order anharmonicity, the location of the MBL transition of even a sizable molecule, such as an organic molecule with tens of atoms, still lies at an energy that may exceed the energy to surmount a barrier to reaction, such as a barrier to conformational change. Illustrative calculations are provided, and some recent work on the influence of thermalization on thermal conduction in molecular junctions is also discussed.
ARTICLE | doi:10.20944/preprints201804.0035.v2
Subject: Engineering, Civil Engineering Keywords: pedestrian safety; crash severity; crash factors; ordered probit model; random parameter model
Online: 27 April 2018 (08:10:22 CEST)
Background: According to the National Highway Traffic Safety Administration, 116 pedestrians were killed in motor vehicle crashes in Ohio in 2015. However, no study to date has analyzed crashes in Ohio exploring the factors contributing to the pedestrian injury severity resulting from motor vehicle crashes. This study fills this gap by investigating the crashes involving pedestrians exclusively in Ohio. Materials and Methods: This study uses the crash data from the Highway Safety Information System, from 2009 to 2013. The explanatory factors include the pedestrian, driver, vehicle, crash, and roadway characteristics. Both fixed- and random-parameters ordered probit models of injury severity (where possible outcomes are major, minor, and possible/no injury) were estimated. Results: The model results indicate that being older pedestrian (65 and over), younger driver (less than 24), driving under influence (DUI), being struck by truck, dark-unlighted roadways, six-lane roadways, and speed limit of 40 mph and 50 mph were associated with more severe injuries to the pedestrians. Conversely, older driver (65 and over), passenger car, crash occurring in urban locations, daytime traffic off-peak (10 AM to 3:59 PM), weekdays, and daylight condition were associated with less severe injuries. Conclusion: This study provides specific safety recommendations so that effective countermeasures could be developed and implemented by the policy makers, which in turn will improve overall highway safety.
ARTICLE | doi:10.20944/preprints202209.0190.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: green coffee bean; lightweight framework; deep convolutional neural network; explainable model; random optimization
Online: 14 September 2022 (04:04:05 CEST)
In recent years, the demand for coffee has increased tremendously. During the production process, green coffee beans are traditionally screened manually for defective beans before they are packed into coffee bean packages; however, this method is not only time-consuming but also increases the rate of human error due to fatigue. Therefore, this paper proposed a lightweight deep convolutional neural network (LDCNN) for the quality detection system of green coffee beans, which combined depthwise separable convolution (DSC), squeeze-and-excite block (SE block), skip block, and other frameworks. To avoid the influence of low parameters of the lightweight model caused by the model training process, rectified Adam (RA), lookahead (LA), and gradient centralization (GC) were included to improve efficiency; the model was also put into the embedded system. Finally, the local interpretable model-agnostic explanations (LIME) model was employed to explain the predictions of the model. The experimental results indicated that the accuracy rate of the model could reach up to 98.38% and the F1 score could be as high as 98.24% when detecting the quality of green coffee beans. Hence, it can obtain higher accuracy, lower computing time, and lower parameters. Moreover, the interpretable model verified that the lightweight model in this work is reliable, providing the basis for screening personnel to understand the judgment through its interpretability, thereby improving the classification and prediction of the model.
ARTICLE | doi:10.20944/preprints202209.0169.v1
Subject: Earth Sciences, Geoinformatics Keywords: Synthetic Aperture Rader (SAR); Optical image (Sentinel 2); Random Forest (RF); CART; GEE
Online: 13 September 2022 (10:06:14 CEST)
Observing cultivated crops and other forms of land use is an important environmental and economic concern for agricultural land management and crop classification. Crop categorization offers significant crop management data, ensuring food security, and developing agricultural policies. Remote sensing data, especially publicly available Sentinel 1 and 2 data, has effectively been used in crop mapping and classification in cloudy places because of their high spatial and temporal resolution. This study aimed to improve crop type classification by combining Sentinel-1 (Synthetic Aperture Rader (SAR)) data and the Sentinel-2 Multispectral Instrument (MSI) data. In the study, Random Forest (RF) and Classification and Regression Trees (CART) classier were used to classify grain crops (Barley and Wheat). The classification results based on the combination of Sentinel-2 and Sentinel-1 data indicated an overall accuracy (OA) of 93 % and a kappa coefficient (K) of 0.896 for RF and (89.15%, 0.84) for the CART classifier. It is suggested to employ a mix of radar and optical data to attain the highest level of classification accuracy since doing so improves the likelihood that the details will be observed in comparison to the single-sensor classification technique and yields more accurate results.
HYPOTHESIS | doi:10.20944/preprints202206.0409.v1
Subject: Life Sciences, Genetics Keywords: Non random mutation; interaction-based evolution; hemoglobin S; directed mutation; parallelism; genome organization
Online: 29 June 2022 (14:56:32 CEST)
Recent results have shown that the human malaria-resistant hemoglobin S mutation originates de novo more frequently in the gene and in the population where it is of adaptive significance, namely, in the hemoglobin subunit beta gene compared to the non-resistant but otherwise identical 20A>T mutation in the hemoglobin subunit delta gene, and in sub-Saharan Africans, who have been subject to intense malarial pressure for many generations, compared to Northern Europeans, who have not. This finding raises a fundamental challenge to the traditional notion of accidental mutation. Here we address this finding with the replacement hypothesis, according to which pre-existing genetic interactions can lead directly and mechanistically to mutations that simplify and replace them. Thus, a gradual evolutionary process under selection can hone in on interactions of importance for the currently evolving adaptations, from which large-effect mutations follow that are relevant to these adaptations. We exemplify this hypothesis using multiple types of mutation, including gene fusion mutations, gene duplication mutations, A>G mutations in RNA-edited sites and transcription-associated mutations, and place it in the broader context of a system-level view of mutation origination called Interaction-based Evolution. Potential consequences include that similarity of mutation pressures may contribute to parallel evolution in genetically related species, that the evolution of genome organization may be driven by mutational mechanisms, that transposable element movements may also be explained by replacement, and that long-term directional mutational responses to specific environmental pressures are possible. Such responses need to be further tested by future studies in natural and artificial settings.
ARTICLE | doi:10.20944/preprints202205.0386.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: Lower upper bound estimation; random forest; feature selection; probabilistic forecasting; photovoltaic generation forecasting
Online: 30 May 2022 (05:10:06 CEST)
Photovoltaic power generation has high variability and uncertainty because it is affected by uncertain factors such as weather conditions. Therefore, probabilistic forecasting is useful for optimal operation and risk hedging in power systems with large amounts of photovoltaic power generation. However, deterministic forecasting is the mainstay of photovoltaic generation forecasting; there are few studies on probabilistic forecasting and feature selection from weather or time-oriented features in such forecasting. In this study, prediction intervals were generated by the lower upper bound estimation using neural networks with two outputs to make probabilistic predictions. The objective was to improve prediction interval coverage probability (PICP), mean prediction interval width (MPIW), and loss, which is the integration of these two metrics, by removing unnecessary features through feature selection. When features with high gain were selected by random forests (RF), in the forecast of 14.7-kW PV systems, loss improved by 1.57 kW, PICP by 0.057, and MPIW by 0.12 kW on average over two weeks compared to the case where all features were used without feature selection. Therefore, the low gain features from RF act as noise in LUBE and reduce the prediction accuracy.
ARTICLE | doi:10.20944/preprints202201.0402.v1
Subject: Engineering, Other Keywords: project scheduling; underground mine; random breakdown simulation; wolf colony algorithm; multi-objective optimization
Online: 26 January 2022 (14:02:22 CET)
Due to production space and operating environment requirements, mine production equipment often breaks down, which seriously affects the mine’s production schedule. To ensure the smooth completion of the haulage operation plan under abnormal conditions, a model of the haulage equipment rescheduling plan based on the random simulation of equipment breakdowns is established in this paper. The model aims to accomplish both the maximum completion rate of the original mining plan and the minimum fluctuation of the ore grade during the rescheduling period. This model is optimized by improving the wolf colony algorithm and changing the location update formula of the individuals in the wolf colony. Then, the optimal model solution can be used to optimize the rescheduling of the haulage plan by considering equipment breakdowns. The application of the proposed method in an underground mine revealed that the completion rate of the mine’s daily mining plan reached 83.40% without increasing the number of the equipment, while and the ore quality was stable. Moreover, the improved optimization algorithm converged fast and was characterized by high robustness.
ARTICLE | doi:10.20944/preprints202201.0111.v1
Subject: Engineering, Mechanical Engineering Keywords: damage detection; linear regression; random forest; artificial neural network; training parameters; natural frequency
Online: 10 January 2022 (12:26:27 CET)
Damage detection based on modal parameter changes becomes popular in the last decades. Nowadays are available robust and reliable mathematical relations to predict the natural frequency changes if damage parameters are known. Using these relations, it is possible to create databases containing a large variety of damage scenarios. Damage can be thus assessed by applying an inverse method. The problem is the complexity of the database, especially for structures with more cracks. In this paper, we propose two machine learning methods, namely the random forest (RF) and the artificial neural network (ANN) as search tools. The databases we developed contain damage scenarios for a prismatic cantilever beam with one crack and ideal and non-ideal boundary conditions. The crack assessment is made in two steps. First, a coarse damage location is found from the networks trained for scenarios comprising the whole beam. Afterward, the assessment is made involving a particular network trained for the segment of the beam on which the crack is previously found. Using the two machine learning methods, we succeed to estimate the crack location and severity with high accuracy for both simulation and laboratory experiments. Regarding the location of the crack, which is the main goal of the practitioners, the errors are less than 0.6%. Based on these achievements, we concluded that the damage assessment we propose, in conjunction with the machine learning methods, is robust and reliable.
ARTICLE | doi:10.20944/preprints202103.0747.v1
Subject: Medicine & Pharmacology, Allergology Keywords: Leptospirosis; Leptospira; water; random; metagenomic; epidemiology; soil; environment; survival; climate; zones; serial sampling
Online: 30 March 2021 (14:17:24 CEST)
Human leptospirosis cannot be investigated without studying zoonotic and environmental as-pects of the disease. The objectives of this study are to explore the abundance of Leptospira in dif-ferent climate zones of Sri Lanka and to describe the presence of Leptospira in same water source at different time points. First, water and soil samples were collected from whole-island, secondly, water sampling continued only in dry-zone, finally serial sampling of water from ten open wells was performed at five different time points. Quantitative PCR for water and metagenomic se-quencing for soil were used to detect Leptospira. In first component, 2 out of 12 water sites were positive and both are situated in wet-zone. Very small quantities of Genus Leptospira was detect-ed by metagenomic analysis of soil. Only 5 out of 26 samples were positive in the second compo-nent. Six, five, four, five, six wells were positive respectively in serial measurements of third component. All wells were positive at least one measurement while only one well was positive in all measurements. Closer to tank and higher distance from main road were significant risk fac-tors associated with well positivity. Presence of Leptospira seems not consistent indicating ran-dom abundance of Leptospira in natural environment.
ARTICLE | doi:10.20944/preprints202103.0622.v1
Subject: Life Sciences, Other Keywords: human papillomavirus virus; cervical cancer; random network model; vaccination programs; oncogenic HPV eradication
Online: 25 March 2021 (14:35:21 CET)
Cervical cancer is the fourth most common malignancy in women worldwide, although it is preventable with prophylactic HPV vaccination. HPV transmission-dynamic models can predict the potential for global elimination of cervical cancer. The random network model is a new approach that allows individuals to be followed, and to implement a given vaccination policy according to their clinical records. We developed an HPV transmission dynamics model on a lifetime sexual partners network based on individual contacts, also accounting for the sexual behavior of men who have sex with men (MSM). We analyzed the decline in the prevalence of HPV infection in a scenario of 75% and 90% coverage for both sexes. An important herd immunity effect for men and women was observed in the heterosexual network, even with 75% coverage. However, HPV in-fections are persistent in the MSM population, with sustained circulation of the virus among un-vaccinated individuals. Coverage around 75% of both sexes would be necessary to eradicate HPV-related conditions in women within five decades. Nevertheless, the variation in the decline in infection in the long term between vaccination coverage of 75% and 90% is relatively small, suggesting that reaching coverage of around 70-75% in the heterosexual network may be enough to confer high protection. Nevertheless, HPV eradication maybe achieved if men’s coverage is strictly controlled. This accurate representation of HPV transmission demonstrates the need to maintain high HPV vaccination coverage, especially in men, for whom the cost-effectiveness of vaccination is questioned.
ARTICLE | doi:10.20944/preprints202103.0573.v1
Subject: Earth Sciences, Atmospheric Science Keywords: prediction; solar irradiation; machine learning; artificial neural network; random forest; vector support machine
Online: 23 March 2021 (15:51:55 CET)
Different machine learning models (multiple linear regression, vector support machines, artificial neural networks and random forests) are applied to predict the monthly global irradiation (MGI) from different input variables (latitude, longitude and altitude of meteorological station, month, average temperatures, among others) of different areas of Galicia (Spain). The models were trained, validated and queried using data from three stations, and each best machine model was checked in two independent stations. The results obtained confirmed that the best ML methodology is the ANN model which presents the lowest RMSE value in the validation and querying phases 122.6·10kJ/(m2∙day) and 113.6·10kJ/(m2∙day), respectively, and predict conveniently for independent stations, 201.3·10kJ/(m2∙day) and 209.4·10kJ/(m2∙day), respectively. Given the good results obtained, it is convenient to continue with the design of artificial neural networks applied to the analysis of monthly global irradiation.
ARTICLE | doi:10.20944/preprints202101.0569.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Saddlepoint approximations; Probability mass function; Probability distribution function; Poisson random variable, Linear combination
Online: 27 January 2021 (16:18:06 CET)
In this study, we investigate the performance of the saddlepoint approximation of the probability mass function and the cumulative distribution function for the weighted sum of independent Poisson random variables. The goal is to approximate the hazard rate function for this complicated model. The better performance of this method is shown by numerical simulations and comparison with a performance of other approximation methods.
ARTICLE | doi:10.20944/preprints202011.0519.v1
Subject: Biology, Anatomy & Morphology Keywords: Meta-prediction; Encoding data; ClinVar; Classification; Random Forest; Naive Bayes; Support Vector Machine
Online: 19 November 2020 (16:43:51 CET)
ClinVar is a web platform that stores around 774k curated entries, which allows exploring genetic variants and their associations with complex phenotypes. A partial set of ClinVar’s genetic associations were reported with conflict of interpretation or uncertain clinical impact significance, which currently challenges clinicians and geneticists. Here, we evaluate the performance of data pre-processing methods combined with classical prediction methods, such as Naive Bayes, Random Forest, and Support Vector Machine to build a meta-prediction model aiming to improve genetic pathogenicity interpretation. Models were trained with ClinVar data (September 2020), and genetic variants were annotated with eight functional impact predictors catalogued with SnpEff/SnpSift (v4.3). A 10-fold cross-validation strategy was performed for evaluation by accuracy, F1-Score, Receiver Operating Characteristic, Area Under Curve. The best meta-prediction model raises by combining one-hot encoding with tree-based classifiers as Random Forest, which shows Area Under Curve ≥ 0,93. We predict pathogenicity for 109k genetic variants, which were found labeled as uncertain significance or conflict of interpretation. Additionally, we implemented AmazonForest (https://www.lghm.ufpa.br/amazonforest), a web tool to query data for a set of 5k variants that were predicted with high pathogenic probability (RFprob >= 0.9).
ARTICLE | doi:10.20944/preprints201803.0180.v1
Subject: Materials Science, Nanotechnology Keywords: random magnetic field; safe basin; single-walled carbon nanotubes; stochastic resonance; strong nonlinearity
Online: 20 March 2018 (11:28:17 CET)
In this paper, a kind of single-walled carbon nanotube nonlinear model is developed, and the strongly nonlinear dynamic characteristics of such carbon nanotubes subjected to random magnetic field are studied. The nonlocal effect of microstructure is considered based on the theory of nonlocal elasticity. The natural frequency of the strongly nonlinear dynamic system is obtained by the energy function method, the drift coefficient and the diffusion coefficient are verified. The stationary probability density function of the system dynamic response is given and the fractal boundary of the safe basin is provided. Theoretical analysis and numerical simulation show that stochastic resonance occurs when varying the random magnetic field intensity. The boundary of safe basin has fractal characteristics and the area of safe basin decreases when the intensity of the magnetic field permeability increases.
ARTICLE | doi:10.20944/preprints201703.0132.v1
Subject: Medicine & Pharmacology, Pharmacology & Toxicology Keywords: machine learning; random forest; estrogen receptor; Tox21 data challenge 2014; QSAR prediction model
Online: 17 March 2017 (04:49:28 CET)
Many agonists for the estrogen receptor are known to disrupt endocrine functioning. We have developed a computational model that predicts agonists for the estrogen receptor ligand-binding domain in an assay system. Our model was entered into the Tox21 Data Challenge 2014, a computational toxicology competition organized by the National Center for Advancing Translational Sciences. This competition aims to find high-performance predictive models for various adverse-outcome pathways, including the estrogen receptor. Our predictive model, which is based on the random forest method, delivered the best performance in its competition category. In the current study, the predictive performance of the random forest models was improved by strictly adjusting the hyperparameters to avoid overfitting. The random forest models were optimized from 4,000 descriptors simultaneously applied to 10,000 activity assay results for the estrogen receptor ligand-binding domain, which have been measured and compiled by Tox21. At this time, our model delivers the highest predictive power on estrogen receptor agonists in the world. Furthermore, analysis of the optimized model revealed some important features of the agonists, such as the number of hydroxyl groups in the molecules.
ARTICLE | doi:10.20944/preprints201610.0037.v1
Subject: Engineering, Industrial & Manufacturing Engineering Keywords: collection-distribution center; closed loop supply chain; fuzzy random variable; particle swarm optimization
Online: 11 October 2016 (11:02:47 CEST)
Recycling waste products is an environmental-friendly activity that can bring benefits to accompany, saving manufacturing costs and improving economic efficiency. For the beer industry, recycling bottles can reduce manufacturing costs and reduce the industry's carbon footprint. This paper presents a model for a multi-objective collection-distribution center location and allocation problem in a closed loop supply chain for the beer industry, in which the objective is to minimize total costs and transportation pollution. Uncertainties in the form of randomness and fuzziness are jointly handled in this paper to ensure a more practical problem solution, for which returned bottle sand unusable bottles are considered fuzzy random variables. A heuristic algorithm based on priority-based global-local-neighbor particle swarm optimization (pb-glnPSO) is applied to ensure reliable solutions for this NP-hard problem. A case study on a beer operation company is conducted to illustrate the application of the proposed model and demonstrate the priority-based global-local-neighbor particle swarm optimization.
ARTICLE | doi:10.20944/preprints201609.0088.v1
Subject: Engineering, Civil Engineering Keywords: classification; railway; power line; mobile laser scanning data; conditional random field; layout compatibility
Online: 26 September 2016 (09:33:05 CEST)
Railway has been used as one of the most crucial means of transportation in public mobility and economic development. For efficiently operating railways, the electrification system in railway infrastructure, which supplies electric power to trains, is essential facilities for stable train operation. Due to its important role, the electrification system needs to be rigorously and regularly inspected and managed. This paper presents a supervised learning method to classify Mobile Laser Scanning (MLS) data into ten target classes representing overhead wires, movable brackets and poles, which are recognized key objects in the electrification system. In general, the layout of railway electrification system shows a strong regularity of spatial relations among object classes. The proposed classifier is developed based on Conditional Random Field (CRF), which characterizes not only labeling homogeneity at short range, but also the layout compatibility between different object classes at long range in the probabilistic graphical model. This multi-range CRF model consists of a unary term and three pairwise contextual terms. In order to gain computational efficiency, MLS point clouds is converted into a set of line segments where the labeling process is applied. Support Vector Machine (SVM) is used as a local classifier considering only node features for producing the unary potentials of CRF model. As the short-range pairwise contextual term, Potts model is applied to enforce a local smoothness in short-range graph. While, long-range pairwise potentials are designed to enhance spatial regularities of both horizontal and vertical layouts among railway objects. We formulate two long-range pairwise potentials as the log posterior probability obtained by Naïve Bayes classifier. The directional layout compatibilities are characterized in probability look-up tables which represent co-occurrence rate of spatial relations in horizontal and vertical directions. The likelihood function is formulated by multivariate Gaussian distributions. In the proposed multi-range CRF model, the weight parameters to balance four sub-terms are estimated by applying the Stochastic Gradient Descent (SGD). The results show that the proposed multi-range CRF can effectively classify detailed railway elements, representing the average recall of 97.66% and the average precision of 97.07% for all classes.
Subject: Mathematics & Computer Science, Applied Mathematics Keywords: Koopman Operator; Dynamic Mode Decomposition(DMD); Johnson-Lindenstrauss Lemma; Random Projection; Data-driven method
Online: 24 September 2021 (09:14:01 CEST)
A data-driven analysis method known as dynamic mode decomposition (DMD) approximates the linear Koopman operator on projected space. In the spirit of Johnson-Lindenstrauss Lemma, we will use random projection to estimate the DMD modes in reduced dimensional space. In practical applications, snapshots are in high dimensional observable space and the DMD operator matrix is massive. Hence, computing DMD with the full spectrum is infeasible, so our main computational goal is estimating the eigenvalue and eigenvectors of the DMD operator in a projected domain. We will generalize the current algorithm to estimate a projected DMD operator. We focus on a powerful and simple random projection algorithm that will reduce the computational and storage cost. While clearly, a random projection simplifies the algorithmic complexity of a detailed optimal projection, as we will show, generally the results can be excellent nonetheless, and quality understood through a well-developed theory of random projections. We will demonstrate that modes can be calculated for a low cost by the projected data with sufficient dimension.
ARTICLE | doi:10.20944/preprints202108.0373.v1
Subject: Engineering, Civil Engineering Keywords: Random filling; slate rock; core; wheel impression test; topographic settlement test; plate bearing test
Online: 18 August 2021 (08:23:11 CEST)
The construction of random fillings from the excavation of medium hardness rocks, with high particle sizes, presents limitations in compaction control. This research applies new control techniques with revised test procedures in the construction of the random fillings core, which constitutes the main part of the embankment, with the bigger volume and provides the geotechnical stability to the infrastructure. The maximum layer thickness researched was 800mm. As there are many types of rocks, this research is applied to metamorphic slates. Quality control has been carried out by applying new research associated with the revision of wheel impression test, topographic settlements and plate bearing test (PBT). A statistical analysis of the core of 16 slate random fillings has been carried out, with a total of 2250 in situ determination of density and moisture content, 75 wheel impression tests, 75 topographic settlement control and 75 PBT. The strong associations found between different tests have allowed to simplify the quality control.
ARTICLE | doi:10.20944/preprints202004.0302.v1
Subject: Earth Sciences, Environmental Sciences Keywords: active learning; poplar plantations; spatial transfer; sentinel-2; large scale; image classification; random forest
Online: 17 April 2020 (15:05:54 CEST)
Reliable estimates of poplar plantations area are not available at the French national scale due to the unsuitability and low update rate of existing forest databases for this short-rotation species. While supervised classification methods have been shown to be highly accurate in mapping forest cover from remotely sensed images, their performance depends to a great extent on the labelled samples used to build the models. In addition to their high acquisition cost, such samples are often scarce and not fully representative of the variability in class distributions. Consequently, when classification models are applied to large areas with high intra-class variance, they generally yield poor accuracies. In this paper, we propose the use of active learning (AL) to efficiently adapt a classifier trained on a source image to spatially distinct target images with minimal labelling effort and without sacrificing classification performance. The adaptation consists in actively adding to the initial local model, new relevant training samples from other areas, in a cascade that iteratively improves the generalisation capabilities of the classifier, leading to a global model tailored to different areas. This active selection relies on uncertainty sampling to directly focus on the most informative pixels for which the algorithm is the least certain of their class labels. Experiments conducted on Sentinel-2 time series showed that when the same number of training samples was used, active learning outperformed passive learning (random sampling) by up to 5% of overall accuracy and up to 12% of class F-score. In addition, and depending on the class considered, the random sampling required up to 50% more samples to achieve the same performance of an active learning-based model. Moreover, the results demonstrate the suitability of the derived global model to accurately map poplar plantations among other tree species with overall accuracy values up to 14% higher than those obtained with local models. The proposed approach paves the way for national-scale mapping in an operational context.
ARTICLE | doi:10.20944/preprints202001.0365.v1
Subject: Mathematics & Computer Science, Computational Mathematics Keywords: Hopfield Neural Networks; Election Algorithm; Imperialistic Competitive Algorithm; Exhaustive Search; Random Satisfiability; Logic Programming
Online: 30 January 2020 (11:46:31 CET)
Election Algorithm (EA) is a powerful metaheuristics model motivated by phenomena of the socio-political mechanism of the presidential election conducted in many countries. EA is selected as a topic of discussion due to its capability and robustness to carry out complex problems in the random-2SAT logic program. This paper utilizes a hybridized EA assimilated with the Hopfield neural network (HNN) in carrying out random logic program (HNN-R2SATEA). The efficiency of the proposed method was compared with the existing traditional exhaustive search (HNN-R2SATES) model and the recently introduced HNN-R2SATICA model. From the result obtained, clearly proven that based on our proposed hybrid model outperformed other existing model based on the Global Minima Ratio (ZM), Mean Absolute Error (MAE), Bayesian Information Criterion (BIC) and Execution Time (ET). The expected outcome portrays that the EA algorithm outperformed the other two algorithms in doing random-kSAT logic program. The results proved the robustness, effectiveness, and compatibility of the HNN-R2SATEA model.