ARTICLE | doi:10.20944/preprints202003.0233.v1
Subject: Engineering, Civil Engineering Keywords: roller compacted concrete pavement; classification-regression models; feature selection; mechanical properties; Monte-Carlo uncertainty
Online: 15 March 2020 (01:32:44 CET)
In the field of pavement engineering, the determination of the mechanical characteristics is one of the essential process for reliable material design and highway sustainability. Early determination of mechanical characteristics of pavement is highly essential for road and highway construction and maintenance. Tensile strength (TS), compressive strength (CS) and flexural strength (FS) of roller compacted concrete pavement (RCCP) are very crucial characteristics as they are necessitated for many data from mixture proportions as input variables. In this research, the classification-based regression models named Random Forest (RF), M5rule model tree (M5rule), M5prime model tree (M5p) and Chi-square Automatic Interaction Detection (CHAID) are developed for simulation of the mechanical characteristics of RCCP. A comprehensive and reliable dataset comprising 621, 326 and 290 data records for CS, TS and FS experimental cases extracted from several open sources over the literature. The mechanical properties are developed based on influential inputs combination that processed using Principle Component Analysis (PCA). The applied PCA method as feature selection is specified that volumetric/weighted content forms of experimental variables (e.g., coarse aggregate, fine aggregate, supplementary cementitious materials, water and binder) and specimens’ age are the most effective inputs to generate the better performances. Several statistical metrics are measured to evaluate proposed classification-based regression models. RF model revealed an optimistic classification capacity of the CS, TS and FS prediction of the RCCP in comparison with the CHAID, M5rule, and M5p models. The research is extended for the results verification using Monte-carlo model for the uncertainty and sensitivity of variables importance analysis. Overall, the proposed methodology indicated a reliable soft computing model that can be implemented for the material engineering construction and design.
REVIEW | doi:10.20944/preprints202110.0207.v1
Online: 13 October 2021 (16:28:59 CEST)
Accurate transfer learning of clinical outcomes, e.g., of the effects and side effects of drugs or other interventions, from one cellular context to another (in-vitro versus ex-vivo versus in-vivo, or across tissues), between cell-types, developmental stages, omics modalities or species, is considered tremendously useful. Ultimately, it may avoid most drug development failing in translation, despite large investments in the preclinical stages, which includes animal experiments requiring careful justification. Thus, when transferring a prediction task from a source (model) domain to a target domain, what counts is the high quality of the predictions in the target domain, requiring molecular states or processes common to both source and target that can be learned by the predictor, reflected by latent variables. These latent variables may form a compendium of knowledge that is learned in the source, to enable predictions in the target; usually, there are few, if any, labeled target training samples to learn from. Transductive learning then refers to the learning of the predictor in the source domain, transferring its outcome label calculations to the target domain, considering the same task. Inductive learning considers cases where the target predictor is performing a different yet related task as compared to the source predictor, making some labeled target data necessary. Often, there is also a need to first map the variables in the input/feature spaces (e.g. of gene names to orthologs) and/or the variables in the output/outcome spaces (e.g. by matching of labels). Transfer across omics modalities also requires that the molecular information flow connecting these modalities is sufficiently conserved. Only one of the methods for transfer learning we reviewed offers an assessment of input data, suggesting that transfer learning is unreliable in certain cases. Moreover, source domains feature their very own particularities, and transfer learning should consider these, e.g., as differences in pharmacokinetics, drug clearance or the microenvironment. In light of these general considerations, we here discuss and juxtapose various recent transfer learning approaches, specifically designed (or at least adaptable) to predict clinical (human in-vivo) outcomes based on molecular data, towards finding the right tool for a given task, and paving the way for a comprehensive and systematic comparison of the suitability and accuracy of transfer learning of clinical outcomes.
ARTICLE | doi:10.20944/preprints201811.0096.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: machine learning; stacking; forecasting; regression; sales; time series
Online: 5 November 2018 (09:54:54 CET)
In this paper, we study the usage of machine learning models for sales time series forecasting. The effect of machine learning generalization has been considered. A stacking approach for building regression ensemble of single models has been studied. The results show that using stacking technics, we can improve the performance of predictive models for sales time series forecasting.
ARTICLE | doi:10.20944/preprints202005.0328.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: classification; machine learning; breast cancer
Online: 20 May 2020 (10:53:42 CEST)
Classification algorithms are very widely used algorithms for the study of various categories of data located in multiple databases that have real-world implementations. The main purpose of this research work is to identify the efficiency of classification algorithms in the study of breast cancer analysis. Mortality rate of women increases due to frequent cases of breast cancer. The conventional method of diagnosing breast cancer is time consuming and hence research works are being carried out in multiple dimensions to address this issue. In this research work, Google colab, an excellent environment for Python coders, is used as a tool to implement machine learning algorithms for predicting the type of cancer. The performance of machine learning algorithms is analyzed based on the accuracy obtained from various classification models such as logistic regression, K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Naïve Bayes, Decision Tree and Random forest. Experiments show that these classifiers work well for the classification of breast cancers with accuracy>90% and the logistic regression stood top with an accuracy of 98.5%. Also implementation using Google colab made the task very easier without spending hours of installation of environment and supporting libraries which we used to do earlier.
ARTICLE | doi:10.20944/preprints201807.0353.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: corporate default swap spreads, correlation networks, vector autoregressive regression.
Online: 19 July 2018 (10:16:11 CEST)
We propose a novel credit risk measurement model for Corporate Default Swap spreads, that combines vector autoregressive regression with correlation networks. We focus on the sovereign CDS spreads of a collection of countries, that can be regarded as idiosyncratic measures of credit risk. We model them by means of a vector autoregressive regression model, composed by a time dependent country specific component, and by a contemporaneous component that describes contagion effects among countries. To disentangle the two components, we employ correlation networks, derived from the correlation matrix between the reduced form residuals. The proposed model is applied to ten countries that are representative of the recent financial crisis: top borrowing/lending countries, and peripheral European countries. The empirical findings show that the proposed model is a good predictor of CDS spreads movements, and that the contemporaneous component decreases prediction errors with respect to a simpler autoregressive model. From an applied viewpoint, core countries appear to import risk, as contagion increases their CDS spread, whereas peripheral countries appear as exporters of risk. Greece is an unfortunate exception, as its spreads seem to increase for both idiosyncratic factors and contagion effects.
ARTICLE | doi:10.20944/preprints202010.0057.v1
Subject: Social Sciences, Accounting Keywords: multiclass classification; text mining; accounting control system
Online: 5 October 2020 (09:05:53 CEST)
Electronic invoicing has become mandatory for Italian companies since January 2019. Invoices are structured in a predefined xml template where the information reported can be easily extracted and analyzed. The main aim of this paper is to exploit the information structured in electronic invoices to build an intelligent system which can facilitate accountants work. More precisely, this contribution shows how it is possible to automate part of the accounting process: all sent or received invoices of a company are classified into specific codes which represent the economic nature of the the financial transactions. In order to classify data contained in the invoices a machine learning multiclass classification problem is proposed using as input variables the information of the invoices to predict two different target variables, account codes and the VAT codes, which composes a general ledger entry. Different approaches are compared in terms of prediction accuracy. The best performance is achieved considering the hierarchical structure of the account codes.
REVIEW | doi:10.20944/preprints202111.0310.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: Functional Data Analysis (FDA); Hybrid Data; Semi-Functional Partial Linear Regression Model (SFPLR); Partial Functional Linear Regression; Literature Review
Online: 17 November 2021 (15:21:19 CET)
Background: In the functional data analysis (FDA), the hybrid or mixed data are scalar and functional datasets. The semi-functional partial linear regression model (SFPLR) is one of the first semiparametric models for the scalar response with hybrid covariates. Various extensions of this model are explored and summarized. Methods: Two first research articles, including “semi-functional partial linear regression model”, and “Partial functional linear regression” have more than 300 citations in Google Scholar. Finally, only 106 articles remained according to the inclusion and exclusion criteria such as 1) including the published articles in the ISI journals and excluding 2) non-English and 3) preprints, slides, and conference papers. We use the PRISMA standard for systematic review. Results: The articles are categorized into the following main topics: estimation procedures, confidence regions, time series, and panel data, Bayesian, spatial, robust, testing, quantile regression, varying Coefficient Models, Variable Selection, Single-index model, Measurement error, Multiple Functions, Missing values, Rank Method and Others. There are different applications and datasets such as the Tecator dataset, air quality, electricity consumption, and Neuroimaging, among others. Conclusions: SFPLR is one of the most famous regression modeling methods for hybrid data that has a lot of extensions among other models.
TECHNICAL NOTE | doi:10.20944/preprints201809.0539.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: Non-normality, Classical Linear Regression Model, Modified Maximum Likelihood Estimation.
Online: 27 September 2018 (10:04:26 CEST)
Regression models form the core of the discipline of econometrics. One of the basic assumptions of classical linear regression model is that the values of the explanatory variables are fixed in repeated sampling. However, in most of the real life cases, particularly in economics the assumption of fixed regressors is not always tenable. Under a non-experimental or uncontrolled environment, the dependent variable is often under the influence of explanatory variables that are stochastic in nature. There is a huge literature related to stochastic regressors in various aspects. In this paper, a historical perspective on some of the works related to stochastic regressor is being tried to pen down based on literature search.
ARTICLE | doi:10.20944/preprints201806.0467.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: variable annuity; portfolio valuation; linear regression; group-lasso; interaction effect
Online: 28 June 2018 (12:05:27 CEST)
A variable annuity is a popular life insurance product that comes with financial guarantees. Using Monte Carlo simulation to value a large variable annuity portfolio is extremely time-consuming. Metamodeling approaches have been proposed in the literature to speed up the valuation process. In metamodeling, a metamodel is first fitted to a small number of variable annuity contracts and then used to predict the values of all other contracts. However, metamodels that have been investigated in the literature are sophisticated predictive models. In this paper, we investigate the use of linear regression models with interaction effects for the valuation of large variable annuity portfolios. Our numerical results show that linear regression models with interactions are able to produce accurate predictions and can be useful additions to the toolbox of metamodels that insurance companies can use to speed up the valuation of large VA portfolios.
ARTICLE | doi:10.20944/preprints201607.0001.v1
Subject: Social Sciences, Finance Keywords: PUN, artificial intelligence models, regression tree, bootstrap aggregation, forecasting error
Online: 2 July 2016 (03:48:36 CEST)
Electricity price forecasting has become a crucial element for both private and public decision-making. This importance has been growing since the wave of deregulation and liberalization of energy sector worldwide late 1990s. Given these facts, this paper tries to come up with a precise and flexible forecasting model for the wholesale electricity price for the Italian power market on an hourly basis. We utilize artificial intelligence models such as neural networks and bagged regression trees that are rarely used to forecast electricity prices. After model calibration, our final model is bagged regression trees with exogenous variables. The selected model outperformed neural network and bagged regression with single price used in this paper, it also outperformed other statistical and non-statistical models used in other studies. We also confirm some theoretical specifications of the model. As a policy implication, this model might be used by energy traders, transmission system operators and energy regulators for an enhanced decision-making process.
ARTICLE | doi:10.20944/preprints201911.0141.v1
Subject: Mathematics & Computer Science, Computational Mathematics Keywords: sensitivity points; paranodic skewization; synodic skewization; Kabirian coefficient; regression lines; statistical models
Online: 13 November 2019 (03:46:03 CET)
The effect of sensitivity points (sequence order and position of every element) of sequences following comparative optinalysis under two proposed mechanisms, the paranodic and synodic skewization, was studied to develop a modeling approach to comparative optinalysis of sequences. The results show that the outcomes of comparative optinalysis (similarity measurement) in a set of paranodically skewed sequences can be modeled deterministically by suitable line regression functions. The sensitivity points (nodes) of a sequence display two important and distinct zones, the K-zones and the B-zones. Continues paranodic skewization within these zones (KB zones) operates within probability space, but at hyperskewization level and at the K-zones only, the outcomes of comparative optinalysis operate outside the probability space. Moreover, the outcomes of comparative optinalysis by synodic skewization can be modeled deterministically by some regression line functions, but a general regression function cannot be identified. At certain limit of skewization value space, following paranodic and synodic skewization, the outcomes of comparative optinalysis at the left-sided sequence form a similar pattern with the right-sided sequence.
SHORT NOTE | doi:10.20944/preprints202207.0302.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: machine learning; artificial intelligence; pattern; models; classification; regression; GIS; remote sensing
Online: 20 July 2022 (10:58:15 CEST)
Machine learning (ML) is a subdivision of artificial intelligence in which the machine learns from machine-readable data and information. It uses data, learns the pattern and predicts the new outcomes. Its popularity is growing because it helps to understand the trend and provides a solution that can be either a model or a product. Applications of ML algorithms have increased drastically in G.I.S. and remote sensing in recent years. It has a broad range of applications, from developing energy-based models to assessing soil liquefaction to creating a relation between air quality and mortality. Here, in this paper, we discuss the most popular supervised ML models (classification and regression) in G.I.S. and remote sensing. The motivation for writing this paper is that ML models produce higher accuracy than traditional parametric classifiers, especially for complex data with many predictor variables. This paper provides a general overview of some popular supervised non-parametric ML models that can be used in most of the G.I.S. and remote sensing-based projects. We discuss classification (Naïve Bayes (NB), Support Vector Machine (SVM), Random Forest (RF), Decision Trees (DT)) and regression models (Random Forest (RF), Support Vector Machine (SVM), Linear and Non-Linear) here. Therefore, the article can be a guide to those interested in using ML models in their G.I.S. and remote sensing-based projects
ARTICLE | doi:10.20944/preprints201910.0162.v1
Subject: Social Sciences, Economics Keywords: Xinchang Thai; quantile regression; functional classification of government expenditure; Xiao Kang
Online: 15 October 2019 (05:48:07 CEST)
On October 18, 2017, Chinese President Xi Jinping presented the blueprint for building a modernized socialist nation through the realization of the Xiao Kang (Every nation enjoys a peaceful and affluent life, it is meaningless to eliminate the poor) social construction at the 19th Congress of China. Subsequent to the 2008 financial crisis, the world has moved on to the new economic status of the New Normal. China has also entered the era of “Xinchang Thai,” which is moving from the high-growth to the moderate-growth phase. Therefore, the government of China emphasizes privatization, liberalization, and deregulation. China is also influenced by government policies due to the nature of socialism. This study confirms China’s current stage of economic development based on Barro’s theory. Thus, we use a quantile regression model and examine the correlation between economic growth and functional classification of government expenditure during Xi Jinping's term of office. Furthermore, we selected Korea as a comparative country as the two countries have common features.
ARTICLE | doi:10.20944/preprints202202.0058.v2
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Document Image Classification; Corruption Robustness; Robustness to Distortions; Model Robustness
Online: 14 June 2022 (08:43:57 CEST)
Deep neural networks have been extensively researched in the field of document image classification to improve classification performance and have shown excellent results. However, there is little research in this area that addresses the question of how well these models would perform in a real-world environment, where the data the models are confronted with often exhibits various types of noise or distortion. In this work, we present two separate benchmark datasets, namely RVL-CDIP-D and Tobacco3482-D, to evaluate the robustness of existing state-of-the-art document image classifiers to different types of data distortions that are commonly encountered in the real world. The proposed benchmarks are generated by inserting 21 different types of data distortions with varying severity levels into the well-known document datasets RVL-CDIP and Tobacco3482, respectively, which are then used to quantitatively evaluate the impact of the different distortion types on the performance of latest document image classifiers. In doing so, we show that while the higher accuracy models also exhibit relatively higher robustness, they still severely underperform on some specific distortions, with their classification accuracies dropping from ~90% to as low as ~40% in some cases. We also show that some of these high accuracy models perform even worse than the baseline AlexNet model in the presence of distortions, with the relative decline in their accuracy sometimes reaching as high as 300-450% that of AlexNet. The proposed robustness benchmarks are made available to the community and may aid future research in this area.
ARTICLE | doi:10.20944/preprints202104.0622.v1
Subject: Engineering, Automotive Engineering Keywords: Complex Regression, Least-Squares Techniques, Advanced Metering Infrastructure (AMI)
Online: 23 April 2021 (09:46:32 CEST)
This paper uses the complex regression analysis method to establish the customer’s load regression models, which consider economic indicators, temperature and rainfall. Furthermore, the proposed models are used to study the forecasting feasibility of the future energy sales and summer peak load demand. At first, this paper used least-squares techniques to derive regression models for considering economic indicators and temperature of 34 customer energy sales and total energy sales. Besides, the AMI high voltage customer demand data and system generating capacity for 24 hours were adopted to forecast summer peak load. The above-mentioned data analysis tool is used by EViews software to achieve, in order to verify the feasibility of the research framework. The study found that although its forecasting model accuracy is low only when mixed with temperature and high voltage demands. So, when mixed with high voltage demand data and system generating capacity for 24 hours to forecast peak load, the average error is ± 0.87% and in the majority of its energy sales forecasting model of average error is ±3%. This result can provide power company as future reference.
ARTICLE | doi:10.20944/preprints201803.0084.v1
Subject: Engineering, Civil Engineering Keywords: anfis; missing data; multiple regression; normal ratio method; Yeşilırmak
Online: 12 March 2018 (07:00:46 CET)
Good data analysis is required for the optimal design of water resources projects. However, data are not regularly collected due to material or technical reasons, which results in incomplete-data problems. Available data and data length are of great importance to solve those problems. Various studies have been conducted on missing data treatment. This study used data from the flow observation stations on Yeşilırmak River in Turkey. In the first part of the study, models were generated and compared in order to complete missing data using ANFIS, multiple regression and Normal Ratio Method. In the second part of the study, the minimum number of data required for ANFIS models was determined using the optimum ANFIS model. Of all methods compared in this study, ANFIS models yielded the most accurate results. A 10-year training set was also found to be sufficient as a data set.
ARTICLE | doi:10.20944/preprints201803.0245.v1
Subject: Earth Sciences, Environmental Sciences Keywords: severity mapping; regression models; maximum likelihood; GeoCBI; dNBR; RdNBR; RBR
Online: 29 March 2018 (06:06:32 CEST)
The severity of forest fires derived from remote sensing data for research and management has become increasingly widespread in the last decade, where these data typically quantify the pre- and post-fire spectral change between satellite images on multi-spectral sensors. However, there is an active discussion about which of the main indices (dNBR, RdNBR or RBR) is the most adequate to estimate the severity of the fire, as well about the adjustment model used in the classification of severity levels. This study proposes and evaluates a new technique for mapping severity as an alternative to regression models, based on the use of the maximum likelihood estimation (MLE) automatic learning algorithm, from GeoCBI field data and spectral indices dNBR, RdNBR and RBR applied to Landsat TM, ETM+ Images, for two fires in central Spain. We compare the severity discrimination capability on dNBR, RdNBR and RBR, through a spectral separability index (M) and then evaluated the concordance of these metrics with field data based on GeoCBI measurements. Specifically, we evaluated the correspondence (R2) between each metric and the continuous measurement of fire severity (GeoCBI) and the general precision of the regression and MLE models, for the four categorized levels of severity (Unburned, Low, Moderate, and High). The results show that the RBR has more spectral separability (average between two fires M = 2.00) that the dNBR (M = 1.82) and the RdNBR (M=1.80), additionally the GeoCBI has a better adjustment with the RBR of (R2 = 0.73), than the RdNBR (R2 = 0.72), and dNBR (R2 = 0.71). Finally, the overall classification accuracy achieved with the MLE (Kappa = 0.65) has a better result than regression models (Kappa = 0.58) and higher accuracy of individual classes.
ARTICLE | doi:10.20944/preprints202010.0146.v1
Subject: Earth Sciences, Atmospheric Science Keywords: particulate matter; prediction; model comparison; artificial neural network; multi-variate linear regression; small city
Online: 7 October 2020 (08:31:21 CEST)
Indian cities are increasingly becoming susceptible to PM10 induced health effects which have become a matter of concern for the policymakers of the country. Air pollution is engulfing the comparatively smaller cities as the rapid pace of urbanization, and economic development seems not to lose steam ever. A review of air pollution of 28 cities of India, which includes tier-I, II, and III cities of India, found to have grossly violated both WHO and NAAQS standards in respect of acceptable daily average PM10 concentrations by a wide margin. Predicting the city level PM10 concentrations in advance and accordingly initiate prior actions is an acceptable solution to save the city dwellers from PM10 induced health hazards. Predictive ability of three models, linear MLR, nonlinear MLP (ANN), and nonlinear CART, for one day ahead PM10 concentration forecasting of tier-II Guwahati city, were tested with 2016-2018 daily average observed climate data, PM10, and gaseous pollutants. The results show that the non-linear algorithm MLP with feedforward backpropagation network topologies of ANN class, giving the best prediction value when compared with linear MLR and nonlinear CART model. ANN (MLP) approach, therefore, may be useful to effectively derive a predictive understanding of one day ahead PM10 concentration level and thus provide a tool to the policymakers for improving decision-making associated with air pollution and public health.
ARTICLE | doi:10.20944/preprints202111.0378.v1
Subject: Engineering, Other Keywords: NCM classification; natural language processing; transformers; multilingual BERT; portuguese BERT; NLP; BERT
Online: 22 November 2021 (10:59:43 CET)
The classification of goods involved in international trade in Brazil is based on the Mercosur Common Nomenclature (NCM). The classification of these goods represents a real challenge due to the complexity involved in assigning the correct category codes especially considering the legal and fiscal implications of misclassification. This work focuses on the training of a classifier based on Bidirectional En-coder Representations from Transformers (BERT) for the tax classification of goods with NCM codes. In particular, this article presents results from using a specific Portuguese Language tuned BERT model as well results from using a Multilingual BERT. Experimental results justify the use of these models in the classification process and also that the language specific model has a slightly better performance.
CONCEPT PAPER | doi:10.20944/preprints202106.0509.v2
Subject: Keywords: EEG; Emotional States; Working Memory; Depression; Anxiety; Graph Theory; Classification; Machine Learning; Neural Networks.
Online: 6 July 2021 (12:42:59 CEST)
Functional Connectivity analysis using Electroencephalography signals is common. The EEG signals are converted to networks by transforming the signals into a correlation matrix and analyzing the resulting networks. Here, four learning models, namely, Logistic Regression, Random Forest, Support Vector Machine, and Recurrent Neural Networks, are implemented on the correlation matrix data to classify them either on their psychometric assessment or the effect of therapy; The EEG data is trail-based/event-related. The classifications based on RNN provided higher accuracy( 74-88%) than the other three models( 50-78%). Instead of using individual graph features, a correlation matrix provides an initial test of the data. When compared with the time-resolved correlation matrix, it offered a 4-5% higher accuracy. The time-resolved correlation matrix is better suited for dynamic studies here; it provides lower accuracy when compared to the correlation matrix, a static feature.
ARTICLE | doi:10.20944/preprints202006.0223.v1
Subject: Keywords: BERT; Classification; Mix-Code; Language Model; Youtube; Parametric and Non-Parametric
Online: 17 June 2020 (13:40:22 CEST)
The scope of a lucrative career promoted by Google through its video distribution platform YouTube 1 has attracted a large number of users to become content creators. An important aspect of this line of work is the feedback received in the form of comments which show how well the content is being received by the audience. However, volume of comments coupled with spam and limited tools for comment classification makes it virtually impossible for a creator to go through each and every comment and gather constructive feedback. Automatic classification of comments is a challenge even for established classification models, since comments are often of variable lengths riddled with slang, symbols and abbreviations. This is a greater challenge where comments are multilingual as the messages are often rife with the respective vernacular. In this work, we have evaluated top-performing classification models and four different vectorizers, for classifying comments which are a mix of different combinations of English and Malayalam (only English, only Malayalam and Mix of English and Malayalam). The statistical analysis of results indicates that Multinomial Naïve Bayes, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Random Forest and Decision Trees offer similar level of accuracy in comment classification. Further, we have also evaluated 3 multilingual sub-types of the novel NLP language model, BERT and compared its’ performance to the conventional machine learning classification techniques. XLM was the top-performing BERT model with an accuracy of 67.31%. Random Forest with Term Frequency Vectorizer was the best the top-performing model out of all the traditional classification models with an accuracy of 63.59%.
ARTICLE | doi:10.20944/preprints202107.0636.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Generative Adversarial Networks; Transfer Learning; Medical Imaging; Deep Learning Classification; Chest X-ray’s
Online: 28 July 2021 (17:12:31 CEST)
Data sets for medical images are generally imbalanced and limited in sample size because of high data collection costs, time-consuming annotations, and patient privacy concerns. The training of deep neural network classification models on these data sets to improve the generalization ability does not produce the desired results for classifying the medical condition accurately and often overfit the data on the majority of class samples. To address the issue, we propose a framework for improving the classification performance metrics of deep neural network classification models using transfer learning: pre-trained models, such as Xception, InceptionResNet, DenseNet along with the Generative Adversarial Network (GAN) – based data augmentation. Then, we trained the network by combining traditional data augmentation techniques, such as randomly flipping the image left to right and GAN-based data augmentation, and then fine-tuned the hyper-parameters of the transfer learning models, such as the learning rate, batch size, and the number of epochs. With these configurations, the Xception model outperformed all other pre-trained models achieving a test accuracy of 98.7%, the precision of 99%, recall of 99.3%, f1-score of 99.1%, receiver operating characteristic (ROC) - area under the curve (AUC) of 98.2%.
ARTICLE | doi:10.20944/preprints201812.0237.v1
Subject: Engineering, Mechanical Engineering Keywords: signal processing; sparse regression; system identification; impulse response; optimization; feature generation; structural dynamics; time series classification
Online: 19 December 2018 (16:21:41 CET)
Time recordings of impulse-type oscillation responses are short and highly transient. These characteristics may complicate the usage of classical spectral signal processing techniques for a) describing the dynamics and b) deriving discriminative features from the data. However, common model identification and validation techniques mostly rely on steady-state recordings, characteristic spectral properties and non-transient behavior. In this work, a recent method, which allows reconstructing differential equations from time series data, is extended for higher degrees of automation. With special focus on short and strongly damped oscillations, an optimization procedure is proposed that fine-tunes the reconstructed dynamical models with respect to model simplicity and error reduction. This framework is analyzed with particular focus on the amount of information available to the reconstruction, noise contamination and non-linearities contained in the time series input. Using the example of a mechanical oscillator, we illustrate how the optimized reconstruction method can be used to identify a suitable model and to extract features from uni-variate and multivariate time series recordings in an engineering-compliant environment. Moreover, the determined minimal models allow for identifying the qualitative nature of the underlying dynamical systems as well as testing for the degree and strength of non-linearity. The reconstructed differential equations would then be potentially available for classical numerical studies, such as bifurcation analysis. These results represent a physically interpretable enhancement of data-driven modeling approaches in structural dynamics.
ARTICLE | doi:10.20944/preprints202002.0200.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: uniqueness: regression depth; maximum depth estimator; regression median; robustness
Online: 15 February 2020 (14:51:15 CET)
Notion of median in one dimension is a foundational element in nonparametric statistics. It has been extended to multi-dimensional cases both in location and in regression via notions of data depth. Regression depth (RD) and projection regression depth (PRD) represent the two most promising notions in regression. Carrizosa depth DC is another depth notion in regression. Depth induced regression medians (maximum depth estimators) serve as robust alternatives to the classical least squares estimator. The uniqueness of regression medians is indispensable in the discussion of their properties and the asymptotics (consistency and limiting distribution) of sample regression medians. Are the regression medians induced from RD, PRD, and DC unique? Answering this question is the main goal of this article. It is found that only the regression median induced from PRD possesses the desired uniqueness property. The conventional remedy measure for non-uniqueness, taking average of all medians, might yield an estimator that no longer possesses the maximum depth in both RD and DC cases. These and other findings indicate that the PRD and its induced median are highly favorable among their leading competitors.
ARTICLE | doi:10.20944/preprints202010.0436.v1
Subject: Keywords: Naïve Bayes Classification; Eulers Strength Formula; Cricket Prediction; Supervised Learning; KNIME Tool; Cricket prediction; sports analytics; multivariate regression; neural network
Online: 21 October 2020 (12:34:00 CEST)
In cricket, particularly the twenty20 format is most watched and loved by the people, where no one can guess who will win the match until the last ball of the last over. In India, The Indian Premier League (IPL) started in 2008 and now it is the most popular T20 league in the world. So we decided to develop a machine learning model for predicting the outcome of its matches. Winning in a Cricket Match depends on many key factors like a home ground advantage, past performances on that ground, records at the same venue, the overall experience of the players, record with a particular opposition, and the overall current form of the team and also the individual player. This paper briefs about the key factors that affect the result of the cricket match and the regression model that best fits this data and gives the best predictions. Cricket, the mainstream and widely played sport across India which has the most noteworthy fan base. Indian Premier League follows 20-20 format which is very unpredictable. IPL match predictor is a ML based prediction approach where the data sets and previous stats are trained in all dimensions covering all important factors such as: Toss, Home Ground, Captains, Favorite Players, Opposition Battle, Previous Stats etc, with each factor having different strength with the help of KNIME Tool and with the added intelligence of Naive Bayes network and Eulers strength calculation formula.
ARTICLE | doi:10.20944/preprints202111.0202.v1
Subject: Mathematics & Computer Science, Other Keywords: solar energy; solar radiation prediction; hybrid machine learning; feature selection; feature extraction; classification algorithms; regression analysis; weather research and forecasting (WRF)
Online: 10 November 2021 (10:48:15 CET)
Solar radiation prediction is an important process in ensuring optimal exploitation of solar energy power. Numerous models have been applied to this problem, such as numerical weather prediction models and artificial intelligence models. However, well-designed hybridization approaches that combine numerical models with artificial intelligence models to yield a more powerful model can provide a significant improvement in prediction accuracy. In this paper, we propose novel hybrid machine learning approaches that exploit auxiliary numerical data. The proposed hybrid methods invoke different machine learning paradigms, including feature selection, classification, and regression. Additionally, numerical weather prediction (NWP) models are used in the proposed hybrid models. Feature selection is used for feature space dimension reduction to reduce the large number of recorded parameters that affect estimation and prediction processes. The rough set theory is applied for attribute reduction and the dependency degree is used as a fitness function. We investigate the effect of the attribute reduction process with thirty different classification and prediction models in addition to the proposed hybrid model. Then, different machine learning models are constructed based on classification and regression techniques to predict solar radiation. Moreover, other hybrid prediction models are formulated to use the output of the numerical model of Weather Research and Forecasting (WRF) as learning elements in order to improve the prediction accuracy. The proposed methodologies are evaluated using a data set that is collected from different regions in Saudi Arabia.
ARTICLE | doi:10.20944/preprints201910.0346.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: Prevention of Mother-to-Child transmission (PMTCT); Human Immunodeficiency Virus (HIV); Acquired Immune Deficiency Syndrome (AIDs); Poisson; negative binomial; logistic; regression
Online: 30 October 2019 (04:06:36 CET)
In Sub-Saharan African Countries such as Nigeria with high prevalence rate, Child HIV/AIDs acquired through Mother-to-Child transmission (MTCT) can be largely prevented by using a well-established prevention programme and scheme. This study examined factors that can enhanced Prevention of Mother-to-Child transmission (PMTCT) in Nasarawa State. To achieve this, structured questionnaire were used to collect data from one hundred and sixteen (116) women attending two (2) primary facilities and two (2) secondary facilities in the State. This study utilized methods of Poisson Regression, Negative Binomial Regression and Logistic Regression Analyses. Results revealed that women with at least a secondary school education, women with husband in military and women with perceived confidentiality of their HIV status significantly enhanced PMTCT of HIV in Nasarawa State while significant proportion of the women attest to the fact that drugs are available in the facilities (p-value=0.0000<0.05) . Other factors include mother income level, willingness to continue with PMTCT programme and women in support group can also enhanced PMTCT though they are not significant. This study recommends that the factors identified should be explored by NGOs, Ministry of Health and, Support groups and other relevant agencies since they have the capacity to enhanced PMTCT of HIV in Nasarawa State, Nigeria.
ARTICLE | doi:10.20944/preprints202007.0634.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: CVD rehabilitation; Local muscular endurance exercises; Exercise-based rehabilitation; Deep Learning; AlexNet; CNN; SVM; kNN; RF; MLP; PCA; multi-class classification; INSIGHT-LME dataset
Online: 26 July 2020 (15:21:08 CEST)
Exercise-based cardiac rehabilitation requires patients to perform a set of certain prescribed exercises a specific number of times. Local muscular endurance (LME) exercises are an important part of the rehabilitation program. Automatic exercise recognition and repetition counting, from wearable sensor data is an important technology to enable patients to perform exercises independently in remote settings, e.g. their own home. In this paper we first report on a comparison of traditional approaches to exercise recognition and repetition counting, corresponding to supervised machine learning and peak detection from inertial sensing signals respectively, with more recent machine learning approaches, specifically Convolutional Neural Networks (CNNs). We investigated two different types of CNN: one using the AlexNet architecture, the other using time-series array. We found that the performance of CNN based approaches were better than the traditional approaches. For exercise recognition task, we found that the AlexNet based single CNN model outperformed other methods with an overall 97.18% F1-score measure. For exercise repetition counting , again the AlexNet architecture based single CNN model outperformed other methods by correctly counting repetitions in 90% of the performed exercise sets within an error of ±1. To the best of our knowledge, our approach of using a single CNN method for both recognition and repetition counting is novel. In addition to reporting our findings, we also make the dataset we created, the INSIGHT-LME dataset, publicly available to encourage further research.
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: waste classification; transfer learning; deep learning; recognition classification
Online: 23 February 2020 (14:01:01 CET)
Using machine learning or deep learning to solve the problem of garbage recognition and classification is an important application in computer vision, but due to the incomplete establishment of garbage datasets and the poor performance of complex network models on smart terminal devices, the existing garbage classification models The effect is not good.This paper presents a waste classification and identification method base on transfer learning and lightweight neural network. By migrating the lightweight neural network MobileNetV2 and rebuild it, The reconstructed network is used for feature extraction, and the extracted features are introduced into the SVM to realize the identification of 6 types of garbage. The model was trained and verified by using 2527 pieces of garbage labeled data in the TrashNet dataset, which ultimately resulted in a classification accuracy of 98.4% of the method, which proves that the method can effectively improve the classification accuracy and time and overcome the problem of weak data and less labeling. The over-fitting phenomenon encountered by small data sets in deep learning makes the model robust.
ARTICLE | doi:10.20944/preprints201903.0122.v1
Subject: Earth Sciences, Geoinformatics Keywords: Classification, SVM Classifier, ML Classifier, Supervised and Unsupervised Classification, Object-based Classification, Multispectral Data
Online: 11 March 2019 (09:01:44 CET)
This paper focuses on the crucial role that remote sensing plays in divining land features. Data that is collected distantly provides information in spectral, spatial, temporal and radiometric domains, with each domain having the specific resolution to information collected. Diverse sectors such as hydrology, geology, agriculture, land cover mapping, forestry, urban development and planning, oceanography and others are known to use and rely on information that is gathered remotely from different sensors. In the present study, IRS LISS IV Multi-spectral data is used for land cover mapping. It is known, however, that the task of classifying high-resolution imagery of land cover through manual digitizing consumes time and is way too costly. Therefore, this paper proposes accomplishing classifications by way of enforcing algorithms in computers. These classifications fall in three classes: supervised, unsupervised, and object-based classification. In the case of supervised classification, two approaches are relied upon for land cover classification of high-resolution LISS-IV multispectral image. These approaches are Maximum Likelihood and Support Vector Machine (SVM). Finally, the paper proposes a step-by-step procedure for optical image classification methodology. This paper concludes that in optical data classification, SVM classification gives a better result than the ML classification technique.
REVIEW | doi:10.20944/preprints201910.0362.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: tuberculosis (TB); human immunodeficiency virus (HIV); Acquired Immune Deficiency Syndrome (AIDS); World Health Organization (WHO); panel data; poisson; negative binomial; regression
Online: 31 October 2019 (04:33:45 CET)
Tuberculosis cause of death worldwide and the leading cause from a single infectious agent, ranking above Human immunodeficiency virus (HIV) and Acquired Immune Deficiency Syndrome (AIDS). The aim of this study is to ascertain the trend of tuberculosis prevalence and the effect of HIV prevalence onl Tuberculosis case in some West African countries from 2000 to 2016 using count panel data regression models. The data used annual HIV and Tuberculosis cases spanning from 2000 to 2016 extracted from online publication of World health Organization (WHO). Panel Poisson regression model and Negative binomial regression model for fixed and random effects were used to analyzed the count data, the result revealed a positive trend in TB cases while increased in HIV cases leads to increase in TB cases in West African countries. Among the competing models used in this study, Panel Negative Binomial Regression Model with fixed effect emerged the best model with log likelihood value of -1336.554. This study recommended that Government and NGOs need more strategies to fight against HIV menace in West Africa as this will in turn reduced TB cases in West Africa.
Online: 18 September 2020 (09:40:45 CEST)
The main objective of this article is to explore the causes of household electricity poverty in Spain from an innovative perspective. Based on evidence of energy inequality across households with different income levels, a quantile regression approach was used to better capture the heterogeneity of determinants of energy poverty across different levels of electricity expenditure. The results illustrate some interesting and counter-intuitive findings about the relationship between household income and electricity poverty, and the technical efficiency of quantile regression compared to the imprecise results of a standard single coefficient/OLS approach.
ARTICLE | doi:10.20944/preprints202201.0441.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Active learning (AL); batch mode; expected model change; linear regression; nonlinear regression
Online: 28 January 2022 (15:03:10 CET)
Training supervised machine learning models requires labeled examples. A judicious choice of examples is helpful when there is a significant cost associated with assigning labels. This article improves upon a promising extant method – Batch-mode Expected Model Change Maximization (B-EMCM) method – for selecting examples to be labeled for regression problems. Specifically, it develops and evaluates alternate strategies for adaptively selecting batch size in B-EMCM. By determining the cumulative error that occurs from the estimation of the stochastic gradient descent, a stop criteria for each iteration of the batch can be specified to ensure that selected candidates are the most beneficial to model learning. This new methodology is compared to B-EMCM via mean absolute error and root mean square error over ten iterations benchmarked against machine learning data sets. Using multiple data sets and metrics across all methods, one variation of AB-EMCM, the max bound of the accumulated error (AB-EMCM Max), showed the best results for an adaptive batch approach. It achieved better root mean squared error (RMSE) and mean absolute error (MAE) than the other adaptive and non-adaptive batch methods while reaching the result in nearly the same number of iterations as the non-adaptive batch methods.
Subject: Medicine & Pharmacology, Oncology & Oncogenics Keywords: breast cancer tumor; classification; majority-based voting mechanism; multilayer perceptron learning network; simple logistic regression; stochastic gradient descent learning; wisconsin breast cancer dataset
Online: 27 November 2019 (09:51:31 CET)
Breast cancer is the most common cause of death for women worldwide. Thus, the ability of artificial intelligence systems to predict and classify breast cancer is very important. In this paper, a hybrid ensemble method classification mechanism is proposed based on a majority voting mechanism. First, the performance of different state-of-the-art machine learning classification algorithms for the Wisconsin Breast Cancer Dataset (WBCD) were evaluated. The three best classifiers were then selected based on their F3 score. F3 score is used to emphasize the importance of false negatives (recall) in breast cancer classification. Then, these three classifiers, simple logistic Regression learning, stochastic gradient descent learning and multilayer perceptron network, are used for ensemble classification using a voting mechanism. We also evaluated the performance of hard and soft voting mechanism. For hard voting, majority-based voting mechanism was used and for soft voting we used average of probabilities, product of probabilities, maximum of probabilities and minimum of probabilities-based voting methods. The hard voting (majority-based voting) mechanism shows better performance with 99.42% as compared to the state-of-the-art algorithm for WBCD.
ARTICLE | doi:10.20944/preprints202212.0475.v2
Subject: Electrical & Electronic Engineering, Engineering Keywords: Instance segmentation; Classification; Vehicle make classification; Mosaic-tiled augmentation
Online: 28 January 2023 (02:42:34 CET)
Vehicle identification is an important task in traffic monitoring because it allows for efficient inference and provides a cause for action. Vehicle classification via deep learning and other approaches such as segmentation is a critical tool for re-identification. In this paper, instance segmentation is used for vehicle make identification with license plate detection, allowing for better unique vehicle recognition for re-identification. An existing dataset is re-annotated and modified for polygonal segmentation of the vehicle’s unique frontal features, resulting in representation of the vehicle with its frontal form learned. In addition, an additional license plate identification class is added for efficient re-identification further down the re-identification and tracking pipeline. Furthermore, an additional class of license plate identification is added for efficient re-identification further down the re-identification and tracking pipeline. The results showed improved classification as well as a high mAP for the dataset when compared to previous approaches based on CNN and deformed CNN. Furthermore, a deep residual network and fully connected layer-based classification were utilized as the backbone for feature representation. Instance segmentation detects objects by segmenting and classifying regions of interest. The imbalance in the dataset is resolved using a mosaic-tiled approach, which produces greater precision than other approaches evaluated for in the paper.
CONCEPT PAPER | doi:10.20944/preprints201608.0109.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: Synthetic Aperture Radar; UAVSAR; levee; classification; radar polarimetry; classification
Online: 10 August 2016 (11:37:16 CEST)
The dynamics of surface and sub-surface water events can lead to slope instability resulting in anomalies such as slough slides on earthen levees. Early detection of these anomalies by a remote sensing approach could save time versus direct assessment. We have implemented a supervised Mahalanobis distance classification algorithm for the detection of slough slides on levees using complex polarimetric Synthetic Aperture Radar (polSAR) data. The classifier output was followed by a spatial majority filter post-processing step which improved the accuracy. The effectiveness of the algorithm is demonstrated using fully quad-polarimetric L-band Synthetic Aperture Radar (SAR) imagery from the NASA Jet Propulsion Laboratory’s (JPL’s) Uninhabited Aerial Vehicle Synthetic Aperture Radar (UAVSAR). The study area is a section of the lower Mississippi River valley in the southern USA. Slide detection accuracy of up to 98 percent was achieved, although the number of available slides examples was small.
ARTICLE | doi:10.20944/preprints202012.0546.v1
Subject: Social Sciences, Accounting Keywords: classification and regression trees; CART algorithm; design thinking; web-based prototype; engagement; ICT technologies; households; water demand management (WDM); machine learning; water consumption range
Online: 22 December 2020 (09:42:57 CET)
This paper uses the numerical results of surveys sent to Huelva’s (Andalusia, Spain) households to determine the degree of knowledge they have about the urban water cycle, needs, values, and attitudes regarding water in an intermediary city with low water stress. In previous research, we achieved three different households’ clusters. The first one grouped households with high knowledge of the integral water cycle and a positive attitude to smart devices at home. The second cluster described households with low knowledge of the integral water cycle and high sensitivity to price. The third one showed average knowledge and predisposition to have a closer relationship with the water company. This paper continues with this research line, applying Classification and Regression Trees (CART) to determine which hierarchy of variables/factors/ independent components obtained from the surveys are the decisive ones to predict the range of household water consumption in Huelva. Positive attitudes towards improved cleaning habits for personal or household purposes are the highest hierarchy component to predict the water consumption range. Second in the hierarchy, the variable Knowledge Global Score about the integral urban water cycle, associated with water literacy, also contributes to predicting the water consumption range. Together with the three clusters obtained previously, these results will allow us to design water demand management strategies (WDM) fit for purpose that enable Huelva’s households to use water more efficiently.
ARTICLE | doi:10.20944/preprints202208.0222.v1
Subject: Medicine & Pharmacology, General Medical Research Keywords: Tuberculosis; Mortality; Indigenous; Logistic Regression
Online: 11 August 2022 (12:00:20 CEST)
Aim. To identify factors associated with mortality with tuberculosis diagnosis in the indigenous population in Peru 2015-2019. Methods. Case-control study nested in a retrospective cohort, using the registry of persons belonging to indigenous peoples of the National Tuberculosis Prevention and Control Strategy of the Ministry of Health of Peru. A descriptive analysis was applied, and then bivariate and multiple logistic regression was used to evaluate associations between the variables and the outcome (live-deceased), the results were presented as OR with their respective 95% confidence intervals. Results. The mortality rate of the total indigenous population of Peru was 1.75 deaths per 100,000 indigenous people diagnosed with TB. The community of Kukama kukamiria - Yagua reported 505 (28.48%) individuals. The final logistic model showed that indigenous men (OR=1.93; 95% CI: 1.001-3.7), with a history of HIV prior to TB (OR=16.7; 95% CI: 4.7-58.7) and indigenous people in old age (OR=2.95; 95% CI: 1.5-5.7), are factors associated with a greater chance of dying from TB. Conclusions. It is important to reorient health services among indigenous populations, especially those related to improving the timely diagnosis and early treatment of TB-HIV co-infection, to ensure comprehensive care for this population, considering that they are vulnerable groups.
ARTICLE | doi:10.20944/preprints202011.0297.v1
Online: 10 November 2020 (10:00:37 CET)
In this paper, we present a relapse based demonstrating way to deal with investigate various arrangement MTC information. A commonplace use of this displaying approach incorporates three stages: first, define a model that approximates the connection between quality articulation and trial factors, with boundaries consolidated to address the exploration premium; second, utilize least-squares and assessing condition methods to gauge boundaries and their relating standard blunders; third, register test insights, P-qualities and NFD as proportions of factual criticalness. The benefits of this methodology are as per the following. To begin with, it tends to the exploration interest in a particular, precise way, and maximally uses all the information and other important data. Second, it represents both orderly and irregular varieties related with the information, and the consequences of such examination give not just quality explicit data applicable to the exploration objective, yet additionally its dependability, in this way helping agents to settle on better choices for subsequent investigations. Third, this methodology is truly adaptable, and can undoubtedly be stretched out to different sorts of MTC considers or other microarray explores by detailing various models dependent on the test plan of the investigations.
ARTICLE | doi:10.20944/preprints202206.0079.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: few-shot classification; attention
Online: 6 June 2022 (09:42:13 CEST)
Few-shot classification is challenging since the goal is to classify unlabeled samples with very few labeled samples provided. It has been shown that cross attention helps generate more discriminative features for few-shot learning. This paper extends the idea and proposes two cross attention modules, namely the cross scaled attention (CSA) and the cross aligned attention (CAA). Specifically, CSA scales different feature maps to make them better matched, and CAA adopts the principal component analysis to further align features from different images. Experiments showed that both CSA and CAA achieve consistent improvements over state-of-the-art methods on four widely used few-shot classification benchmark datasets, miniImageNet, tieredImageNet, CIFAR-FS, and CUB-200-2011, while CSA is slightly faster and CAA achieves higher accuracies.
ARTICLE | doi:10.20944/preprints201904.0095.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: EMR; SVM; Classification; Clustering
Online: 12 April 2019 (20:53:16 CEST)
Lately, the Critical Pathway(CP) of Electronic Medical Record(EMR) is used to the guideline for a treatment in the public hospital. We propose a healthcare promotion service using disease pattern with lifestyle risk factors. We classify a medical historical patient data with disease codes with lifestyle risk factors (hypertension, diabetes, smoking, overweight, excessive alcohol intake, and low physical activity) to make the lifestyle risk factors through the classification. We finally make the clusters of disease code with lifestyle risk factors using the medical historical data based on EMR's electronic discharge summary data. As the result of that, we do a healthcare recommending service based on the disease pattern with lifestyle risk. We can build a medical help desk of a public hospital to support people as we check into the public hospital; how to get the procedure of curing, the desired curing clinical method for the healthcare promotion service by each disease code, and how to be better our healthcare. We evaluate the performance of the proposed system by experimenting with the datasets collected at the medical center to measure performance and report some experimental results.
ARTICLE | doi:10.20944/preprints202201.0352.v1
Subject: Earth Sciences, Geoinformatics Keywords: Per-pixel classification confidence; spatial pattern; image classification; accuracy assessment; interpolation method
Online: 24 January 2022 (11:53:46 CET)
Obtaining classification confidence at the pixel level is a challenging task for accuracy assessment in remote sensing image classification. Among the various methods for estimating classification confidence at the pixel level, interpolation-based methods have drawn special attention in the literature. Even though they have been widely recognized in the literature, their usefulness has not been rigorously evaluated. This paper conducts a comprehensive evaluation of three interpolation-based methods: local error matrix method, bootstrap method, and geostatistical method. We applied each of the three methods to three representative datasets with different spatial resolutions, spectral bands, and the number of classes. We then derive the estimated classification confidence and true classification confidence and compared the results with each other using both exploratory data analysis (bi-histogram) and statistical analysis (Willmott's d and Binned classification quality). The results indicate that the three interpolation methods provide some interesting insights on various aspects of estimating per-pixel classification confidence. Unfortunately, the interpolation assumes that classification confidence is smooth across the space, which is usually not true in practice. In other words, interpolation-based methods have limited practical use.
Subject: Mathematics & Computer Science, Other Keywords: aerial scene classification; remote-sensing image classification; few-shot learning; meta-learning
Online: 15 December 2020 (13:21:49 CET)
CNN-based methods have dominated the field of aerial scene classification for the past few years. While achieving remarkable success, CNN-based methods suffer from excessive parameters and notoriously rely on large amounts of training data. In this work, we introduce few-shot learning to the aerial scene classification problem. Few-shot learning aims to learn a model on base-set that can quickly adapt to unseen categories in novel-set, using only a few labeled samples. To this end, we proposed a meta-learning method for few-shot classification of aerial scene images. First, we train a feature extractor on all base categories to learn a representation of inputs. Then in the meta-training stage, the classifier is optimized in the metric space by cosine distance with a learnable scale parameter. At last, in the meta-testing stage, the query sample in the unseen category is predicted by the adapted classifier given a few support samples. We conduct extensive experiments on two challenging datasets: NWPU-RESISC45 and RSD46-WHU. The experimental results show that our method yields state-of-the-art performance. Furthermore, several ablation experiments are conducted to investigate the effects of dataset scale, the impact of different metrics and the number of support shots; the experiment results confirm that our model is specifically effective in few-shot settings.
ARTICLE | doi:10.20944/preprints202008.0139.v1
Subject: Engineering, Industrial & Manufacturing Engineering Keywords: copper price; prediction; support vector regression
Online: 6 August 2020 (08:26:35 CEST)
Predicting copper price is essential for making decisions that can affect companies and governments dependent on the copper mining industry. Copper prices follow a time series that is non-linear, non-stationary, and which have periods that change as a result of potential growth, cyclical fluctuation and errors. Sometimes the trend and cyclical components together are referred to as a trend-cycle. In order to make predictions, it is necessary to consider the different characteristics of trend-cycle. In this paper, we study a copper price prediction method using Support Vector Regression. This work explores the potential of the Support Vector Regression with external recurrences to make predictions at 5, 10, 15, 20 and 30 days into the future in the copper closing price at the London Metal Exchanges. The best model for each forecast interval is performed using a grid search and balanced cross-validation. In experiments on real data-sets, our results obtained indicate that the parameters (C, ε, γ) of the model Support Vector Regression do not differ between the different prediction intervals. Additionally, the amount of preceding values used to make the estimates does not vary according to the predicted interval. Results show that the support vector regression model has a lower prediction error and is more robust. Our results show that the presented model is able to predict copper price volatilities near reality, being the RMSE equal or less than the 2.2% for prediction periods of 5 and 10 days.
ARTICLE | doi:10.20944/preprints202008.0058.v1
Online: 3 August 2020 (00:37:42 CEST)
House is the haven that keeps people from natural and human conditions, it gives them trust, safety, and steadiness. It is one of the most basic human needs this became a serious function which cities offer, and became one of the most important aspects which caught urban researchers interest, they take into consideration a wide range of architectural, social, and economic indicators. The study aims to provide an overall conception of Rwandz residential functions, using a collection of parameters and some GIS and statistical techniques, to help establish plans and future projects to improve the growth of this city and other towns and cities in that area. The study found that the old parts of Rwandz city which are located in the core, differ from the outer parts which are relatively newer in many properties, generally, the core is more densely populated than the outer, bigger family size, more illiteracy, and unemployment, few incomes, older houses, smaller houses, in the opposite of the outer parts. Besides, the study tested the correlation coefficient between the criteria; it found some strong statistical relationships between them, which reflected some real-life properties of the residential function. Lastly, the study designed a regression model to predict the main residential function criteria.
ARTICLE | doi:10.20944/preprints201902.0135.v1
Online: 14 February 2019 (11:30:03 CET)
Based on a rich data set of recoveries donated by a debt collection business, recovery rates for non-performing loans taken from a single European country are modelled using linear regression, linear regression with Lasso, beta regression and inflated beta regression. We also propose a two-stage model: beta mixture model combined with a logistic regression model. The proposed model allows us to model the multimodal distribution we find for these recovery rates. All models are built using loan characteristics, default data and collections data prior to purchase by the debt collection business. The intended use of the models is to estimate future recovery rates for improved risk assessment, capital requirement calculations and bad debt management. They are compared using a range of quantitative performance measures under K-fold cross validation. Among all the models, we find that the proposed two-stage beta mixture model performs best.
ARTICLE | doi:10.20944/preprints201809.0499.v1
Online: 26 September 2018 (05:23:02 CEST)
Understanding influences of multiple stressors across the landscape on aquatic biota is important for conservation, as it allows for an understanding of spatial patterns and informs stakeholders of significant conservation value. Data exists for land use/landcover (LULC) and other physicochemical components of the landscape throughout the Appalachian region yet biological data is sparse. This dearth of biological data relative to LULC and physicochemical data creates difficulties in making informed management and conservation decisions across large landscapes. At the HUC12 watershed scale we sought to create a single score for both abiotic and biotic values throughout the central and southern Appalachian region. We used boosted regression trees (BRT) to model biological responses (fish and aquatic macroinvertebrate variables) to abiotic variables. Variance explained by BRT models ranged from 62-94%. We categorized both predictor and response variables into themes and targets respectively to better understand large scale patterns on the landscape that influence biological condition of streams. We combined predicted values for a suite of response variables from BRT models to create a single watershed score for aquatic macroinvertebrates and fish. Regional models were developed for fish but we were unable to develop regional models for aquatic macroinvertebrates due to the low number of sample sites. There was strong correlation between regional and global watershed scores for fish models but not between fish and aquatic macroinvertebrate models. Use of such multimetric scores can inform managers, NGOs, and private land owners regarding land use practices; thereby contributing to largescale landscape scale conservation efforts.
ARTICLE | doi:10.20944/preprints201712.0032.v1
Subject: Engineering, Energy & Fuel Technology Keywords: statistics; uncertainty; regression; sampling; outlier; probabilistic
Online: 6 December 2017 (06:36:02 CET)
Energy Measurement and Verification (M&V) aims to make inferences about the savings achieved in energy projects, given the data and other information at hand. Traditionally, a frequentist approach has been used to quantify these savings and their associated uncertainties. We demonstrate that the Bayesian paradigm is an intuitive, coherent, and powerful alternative framework within which M&V can be done. Its advantages and limitations are discussed, and two examples from the industry-standard International Performance Measurement and Verification Protocol (IPMVP) are solved using the framework. Bayesian analysis is shown to describe the problem more thoroughly and yield richer information and uncertainty quantification than the standard methods while not sacrificing model simplicity. We also show that Bayesian methods can be more robust to outliers. Bayesian alternatives to standard M&V methods are listed, and examples from literature are cited.
COMMUNICATION | doi:10.20944/preprints202111.0549.v1
Subject: Keywords: Principal Component Regression, Partial Least Squares, Orthogonal Partial Least Squares, multivariate regression, hypothesis generation, Parkinson’s disease
Online: 29 November 2021 (15:42:03 CET)
In the current era of ‘big data’, scientists are able to quickly amass enormous amount of data in a limited number of experiments. The investigators then try to hypothesize about the root cause based on the observed trends for the predictors and the response variable. This involves identifying the discriminatory predictors that are most responsible for explaining variation in the response variable. In the current work, we investigated three related multivariate techniques: Principal Component Regression (PCR), Partial Least Squares or Projections to Latent Structures (PLS), and Orthogonal Partial Least Squares (OPLS). To perform a comparative analysis, we used a publicly available dataset for Parkinson’ disease patien ts. We first performed the analysis using a cross-validated number of principal components for the aforementioned techniques. Our results demonstrated that PLS and OPLS were better suited than PCR for identifying the discriminatory predictors. Since the X data did not exhibit a strong correlation, we also performed Multiple Linear Regression (MLR) on the dataset. A comparison of the top five discriminatory predictors identified by the four techniques showed a substantial overlap between the results obtained by PLS, OPLS, and MLR, and the three techniques exhibited a significant divergence from the variables identified by PCR. A further investigation of the data revealed that PCR could be used to identify the discriminatory variables successfully if the number of principal components in the regression model were increased. In summary, we recommend using PLS or OPLS for hypothesis generation and systemizing the selection process for principal components when using PCR.rewordexplain later why MLR can be used on a dataset with no correlation
ARTICLE | doi:10.20944/preprints202106.0613.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: LRTI; URTI; Asthma; Cough Classification; Respiratory Pathology Classification; MFCCs; BiLSTM; Deep Neural Networks
Online: 25 June 2021 (09:45:00 CEST)
Intelligent systems are transforming the world, as well as our healthcare system. We propose a deep learning-based cough sound classification model that can distinguish between children with healthy versus pathological coughs such as asthma, upper respiratory tract infection (URTI), and lower respiratory tract infection (LRTI). In order to train a deep neural network model, we collected a new dataset of cough sounds, labelled with clinician's diagnosis. The chosen model is a bidirectional long-short term memory network (BiLSTM) based on Mel Frequency Cepstral Coefficients (MFCCs) features. The resulting trained model when trained for classifying two classes of coughs -- healthy or pathology (in general or belonging to a specific respiratory pathology), reaches accuracy exceeding 84\% when classifying cough to the label provided by the physicians' diagnosis. In order to classify subject's respiratory pathology condition, results of multiple cough epochs per subject were combined. The resulting prediction accuracy exceeds 91\% for all three respiratory pathologies. However, when the model is trained to classify and discriminate among the four classes of coughs, overall accuracy dropped: one class of pathological coughs are often misclassified as other. However, if one consider the healthy cough classified as healthy and pathological cough classified to have some kind of pathologies, then the overall accuracy of four class model is above 84\%. A longitudinal study of MFCC feature space when comparing pathologicial and recovered coughs collected from the same subjects revealed the fact that pathological cough irrespective of the underlying conditions occupy the same feature space making it harder to differentiate only using MFCC features.
ARTICLE | doi:10.20944/preprints202009.0566.v1
Subject: Engineering, Automotive Engineering Keywords: transportation mode classification; vulnerable road users; recurrence plots; computer vision; image classification system
Online: 24 September 2020 (04:41:32 CEST)
As the Autonomous Vehicle (AV) industry is rapidly advancing, classification of non-motorized (vulnerable) road users (VRUs) becomes essential to ensure their safety and to smooth operation of road applications. The typical practice of non-motorized road users’ classification usually takes numerous training time and ignores the temporal evolution and behavior of the signal. In this research effort, we attempt to detect VRUs with high accuracy be proposing a novel framework that includes using Deep Transfer Learning, which saves training time and cost, to classify images constructed from Recurrence Quantification Analysis (RQA) that reflect the temporal dynamics and behavior of the signal. Recurrence Plots (RPs) were constructed from low-power smartphone sensors without using GPS data. The resulted RPs were used as inputs for different pre-trained Convolutional Neural Network (CNN) classifiers including constructing 227×227 images to be used for AlexNet and SqueezeNet; and constructing 224×224 images to be used for VGG16 and VGG19. Results show that the classification accuracy of Convolutional Neural Network Transfer Learning (CNN-TL) reaches 98.70%, 98.62%, 98.71%, and 98.71% for AlexNet, SqueezeNet, VGG16, and VGG19, respectively. The results of the proposed framework outperform other results in the literature (to the best of our knowledge) and show that using CNN-TL is promising for VRUs classification. Because of its relative straightforwardness, ability to be generalized and transferred, and potential high accuracy, we anticipate that this framework might be able to solve various problems related to signal classification.
ARTICLE | doi:10.20944/preprints201911.0218.v1
Subject: Earth Sciences, Environmental Sciences Keywords: Landsat; Google Earth; water index; unsupervised image classification; supervised image classification; Kappa coefficient
Online: 19 November 2019 (03:10:17 CET)
To address three important issues related to extraction of water features from Landsat imagery, i.e., selection of water indexes and classification algorithms for image classification, collection of ground truth data for accuracy assessment, this study applied four sets (ultra-blue, blue, green, and red light based) of water indexes (NWDI, MNDWI, MNDWI2, AWEIns, and AWEIs) combined with three types of image classification methods (zero-water index threshold, Otsu, and kNN) to 24 selected lakes across the globe to extract water features from Landsat-8 OLI imagery. 1440 (4x5x3x24) image classification results were compared with the extracted water features from high resolution Google Earth images with the same (or ±1 day) acquisition dates through computing the Kappa coefficients. Results show the kNN method is better than the Otsu method, and the Otsu method is better than the zero-water index threshold method. If the computational cost is not an issue, the kNN method combined with the ultra-blue light based AWEIns is the best method for extracting water features from Landsat imagery because it produced the highest Kappa coefficients. If the computational cost is taken into account, the Otsu method is a good choice. AWEIns and AWEIs are better than NDWI, MNDWI and MNDWI2. AWEIns works better than AWEIs under the Otsu method, and the average rank of the image classification accuracy from high to low is the ultra-blue, blue, green, and red light-based AWEIns.
ARTICLE | doi:10.20944/preprints201804.0371.v1
Subject: Engineering, Civil Engineering Keywords: economic regions; regional classification; classification methodology; construction industry; cluster analysis; accidents in construction
Online: 28 April 2018 (12:14:29 CEST)
The article presents the methodology for classifying economic regions with regards to selected factors that characterize a region, such as: the economic structure of the region, and thus the share of individual sectors in the economy; employment; the dynamics of the development of individual sectors expressed as an increase or decrease in production value; the population density in the region and also the level of occupational safety. Cluster analysis, which is a method of multidimensional statistical analysis available in Statistica software, was used to solve the task. The proposed methodology was used to group Polish voivodships with regards to the speed of economic development and occupational safety in the construction industry. Data published by the Central Statistical Office was used for this purpose, such as the value of construction and assembly production, the number of people employed in the construction industry, the population of an individual region and the number of people injured in occupational accidents.
ARTICLE | doi:10.20944/preprints201907.0351.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: evaporation; meteorological parameters; Gaussian process regression; support vector regression; machine learning modeling; hydrology; prediction; data science; hydroinformatics
Online: 31 July 2019 (10:58:29 CEST)
Evaporation is one of the main processes in the hydrological cycle, and it is one of the most critical factors in agricultural, hydrological, and meteorological studies. Due to the interactions of multiple climatic factors, the evaporation is a complex and nonlinear phenomenon; therefore, the data-based methods can be used to have precise estimations of it. In this regard, in the present study, Gaussian Process Regression (GPR), Nearest-Neighbor (IBK), Random Forest (RF) and Support Vector Regression (SVR) were used to estimate the pan evaporation (PE) in the meteorological stations of Golestan Province, Iran. For this purpose, meteorological data including PE, temperature (T), relative humidity (RH), wind speed (W) and sunny hours (S) collected from the Gonbad-e Kavus, Gorgan and Bandar Torkman stations from 2011 through 2017. The accuracy of the studied methods was determined using the statistical indices of Root Mean Squared Error (RMSE), correlation coefficient (R) and Mean Absolute Error (MAE). Furthermore, the Taylor charts utilized for evaluating the accuracy of the mentioned models. The outcome indicates that the optimum state of Gonbad-e Kavus, Gorgan and Bandar Torkman stations, Gaussian Process Regression (GPR) with the error values of 1.521, 1.244, and 1.254, the Nearest-Neighbor (IBK) with error values of 1.991, 1.775, and 1.577, Random Forest (RF) with error values of 1.614, 1.337, and 1.316, and Support Vector Regression (SVR) with error values of 1.55, 1.262, and 1.275, respectively, have more appropriate performances in estimating PE. It found that GPR for Gonbad-e Kavus Station with input parameters of T, W and S and GPR for Gorgan and Bandar Torkmen stations with input parameters of T, RH, W, and S had the most accurate performances and proposed for precise estimation of PE. Due to the high rate of evaporation in Iran and the lack of measurement instruments, the findings of the current study indicated that the PE values might be estimated with few easily measured meteorological parameters accurately.
ARTICLE | doi:10.20944/preprints201608.0202.v2
Subject: Earth Sciences, Environmental Sciences Keywords: HR satellite remote sensing; urban fabric vulnerability; UHI & heat waves; landsat & MODIS sensors; LST & urban heating; segmentation & objects classification; data mining; feature extraction & selection; stepwise regression & model calibration
Online: 26 October 2021 (13:11:23 CEST)
Densely urbanized areas, with a low percentage of green vegetation, are highly exposed to Heat Waves (HW) which nowadays are increasing in terms of frequency and intensity also in the middle-latitude regions, due to ongoing Climate Change (CC). Their negative effects may combine with those of the UHI (Urban Heat Island), a local phenomenon where air temperatures in the compact built up cores of towns increase more than those in the surrounding rural areas, with significant impact on the quality of urban environment, on citizens health and energy consumption and transport, as it has occurred in the summer of 2003 on France and Italian central-northern areas. In this context this work aims at designing and developing a methodology based on aero-spatial remote sensing (EO) at medium-high resolution and most recent GIS techniques, for the extensive characterization of the urban fabric response to these climatic impacts related to the temperature within the general framework of supporting local and national strategies and policies of adaptation to CC. Due to its extension and variety of built-up typologies, the municipality of Rome was selected as test area for the methodology development and validation. First of all, we started by operating through photointerpretation of cartography at detailed scale (CTR 1: 5000) on a reference area consisting of a transect of about 5x20 km, extending from the downtown to the suburbs and including all the built-up classes of interest. The reference built-up vulnerability classes found inside the transect were then exploited as training areas to classify the entire territory of Rome municipality. To this end, the satellite EO HR (High Resolution) multispectral data, provided by the Landsat sensors were used within a on purpose developed "supervised" classification procedure, based on data mining and “object-classification” techniques. The classification results were then exploited for implementing a calibration method, based on a typical UHI temperature distribution, derived from MODIS satellite sensor LST (Land Surface Temperature) data of the summer 2003, to obtain an analytical expression of the vulnerability model, previously introduced on a semi-empirical basis.
ARTICLE | doi:10.20944/preprints202208.0426.v1
Online: 25 August 2022 (07:22:48 CEST)
With the ever-increasing popularity of unmanned aerial vehicles and other platforms providing dense point clouds, filters for identification of ground points in such dense clouds are needed. Many filters have been proposed and are widely used, usually based on the determination of an original surface approximation and subsequent identification of points within a predefined dis-tance from such surface. We present a new filter, Multi-view and shift rasterization algorithm (MVSR) is based on a different principle, i.e., on the identification of just the lowest points in in-dividual grid cells, shifting the grid along both planar axis and subsequent tilting of the entire grid. The principle is presented in detail and compared both visually and numerically to other commonly used ground filters (PMF, SMRF, CSF, ATIN) on three sites with different ruggedness and vegetation density. Visually, the MVSR filter showed the smoothest and thinnest ground profiles, with ATIN the only filter performing comparably. The same was confirmed when comparing ground filtered by other filters with the MVSR-based surface. The goodness of fit with the original cloud is demonstrated by the root mean square deviations (RMSD) of the points from the original cloud found below the MVSR-generated surface (ranging, depending on site, between 0.6-2.5 cm). The MVSR filter performed outstandingly at all sites, identifying the ground points with great accuracy while filtering out the maximum of vegetation/above-ground points. The filter dilutes the cloud somewhat; in such dense point clouds, however, this can be perceived rather as a benefit than as a disadvantage.
ARTICLE | doi:10.20944/preprints202206.0300.v1
Online: 22 June 2022 (03:37:45 CEST)
With the ever-increasing popularity of unmanned aerial vehicles and other platforms providing dense point clouds, universal filters for accurate identification of ground points in such dense clouds are needed. Many filters have been proposed and are widely used, usually based on the determination of an original surface approximation and subsequent identification of points within a predefined distance from such surface. In this paper, we present a new filter. This Multi-view and shift rasterization algorithm (MVSR) is based on an entirely different principle, i.e., on the identification of just the lowest points in individual grid cells, shifting the grid along both planar axis and subsequent tilting of the entire grid – after each of these steps, one lowest point per cell is detected. The principle is presented in detail and compared both visually and numerically to other commonly used ground filters (PMF, SMRF, CSF, ATIN) on three sites with different ruggedness and vegetation density. Visually, the MVSR filter showed the smoothest and thinnest ground profiles, with ATIN the only filter performing comparably (although the profiles were somewhat thicker and not as complete as MVSR-acquired ground). The same was confirmed when comparing ground filtered by other filters with the MVSR-based surface. The goodness of fit with the original cloud is demonstrated by the root mean square deviations (RMSD) of the points from the original cloud found below the MVSR-generated surface (ranging, depending on site, between 0.6-2.5 cm). ATIN again performed closest to MVSR, with RMSDs of ground filtered points found above MVSR-based surface at individual sites ranging between 4.5-7.4 cm. The remaining filters performed comparable in the simplest flat area but poorly in rugged and much-vegetated sites, with RMSDs above the MVSR surface ranging at such sites from 21 to 95 cm. In conclusion, the novel filter presented in this paper performed outstandingly at all sites, identifying the ground points with great accuracy while filtering out the maximum of vegetation/above-ground points. The filter dilutes the cloud somewhat; in such dense point clouds, however, this can be perceived rather as a benefit than as a disadvantage.
CONCEPT PAPER | doi:10.20944/preprints202203.0069.v1
Online: 3 March 2022 (17:18:57 CET)
Genomics has put prokaryotic rank-based taxonomy on a solid phylogenetic foundation. However, most taxonomic ranks were set long before the advent of DNA sequencing and genomics. In this concept paper, we thus ask the simple yet profound question: Should prokaryotic classification schemes besides the current phylum-to-species ranks be explored, developed, and incorporated into scientific discourse? Could such alternative schemes provide better solutions to the basic need of science and society for which taxonomy was developed, namely, precise and meaningful identification? A neutral genome-similarity based framework is then described that could allow alternative classification schemes to be explored, compared, and translated into each other without having to choose only one as the gold standard. Classification schemes could thus continue to evolve and be selected according to their benefits and based on how well they fulfill the need for prokaryotic identification.
ARTICLE | doi:10.20944/preprints202009.0508.v2
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Bluetooth; RSSI; Classification; Machine Learning
Online: 12 November 2020 (08:31:41 CET)
This project focuses on using machine learning classification algorithms to determine whether two people are 6 feet apart or not. Two Raspberry Pis were used simulate smart phones. RSSI values of the Bluetooth beacons transmitted between the Raspberry Pis were collected and recorded to train the classifier. The Gaussian Support Vector Machine Classifer yielded the highest testing accuracy of 79.670 and the Decision Tree Classifier yielded the highest AUC of 0.80.
ARTICLE | doi:10.20944/preprints202003.0243.v1
Subject: Social Sciences, Geography Keywords: classification; morphology; geoprocessing; variography; geostatistics
Online: 15 March 2020 (12:50:52 CET)
This paper presents a script that classify spatial patterns of residential urban growth using a morpho-structural approach. The script performs a combination of variography analysis and morphological closings over buildings possessing a residential function in 2002 and 2017 within a region located in southern France named Centre-Var. The different bounding regions then allow classifying new residential buildings into different categories according to their degrees of clustering/scattering and to their locations regarding existing urban areas. Preliminary results show that this protocol is able to provide useful insights regarding the degree of contribution of each new residential building to different patterns of urban growth (clustered infill, scattered infill, clustered edge-expansion, scattered edge-expansion, clustered leapfrog and scattered leapfrog). Open-access to the script and to the test region data is provided.
ARTICLE | doi:10.20944/preprints202002.0042.v1
Subject: Earth Sciences, Geoinformatics Keywords: building footprint; LiDAR; classification; segmentation
Online: 4 February 2020 (10:27:59 CET)
Topographic mapping using stereo plotting is not effective because it takes much time and labor-intensive. Thus, this research was conducted to find the effective way to extract building footprint for mapping acceleration. Building extraction method in this process comprises four steps: ground / non-ground filtering, building classification, segmentation, and building extraction. Non-ground points from filtering process were classified as building with the algorithm based on multi-scale local dimensionality to separate points at the maximum separability plane. Segmentation using segment growing was used to separate each building, so edge detection could be conducted for each segment to create boundary of each building. Lastly, building extraction was conducted through three steps: edge points detection, building delineation, and building regularization. With 10 samples and step 0.5, classification resulted quality and miss factor of 0.597 and 0.524, respectively. The quality was improved by segmentation process to 0.604, while miss factor was getting worse to 0.561. Meanwhile, on average shape index value from extracted building had 0.02 difference and the number of errors was 30% for line segment comparison. Regarding positional accuracy using centroid accuracy assessment, this method could produce RMSE of 1.169 meters.
ARTICLE | doi:10.20944/preprints202001.0387.v1
Subject: Engineering, Energy & Fuel Technology Keywords: forecasting; clustering; energy systems; classification
Online: 31 January 2020 (13:28:01 CET)
This paper proposes an ARIMA approach to battery health forecasting with accuracy improvement by K shape-based clustered predictors. The health prediction of the battery pack is an important function of a battery management system in data centers. Accurate forecasting of battery life turns out to be very difficult without failure data to train a good forecasting model in real life. The conventional ARIMA model is compared with total and clustered predictors for battery health forecasting. Results show that the forecasting accuracy of the ARIMA model significantly improved by utilizing the results of the clustered predictors for 40 batteries in a real data center. One year of actual historical data of 40 batteries of large scale datacenter is presented to validate the effectiveness of the proposed methodology.
TECHNICAL NOTE | doi:10.20944/preprints201806.0343.v1
Subject: Engineering, Biomedical & Chemical Engineering Keywords: Lesion classification; statistical features; mammograms
Online: 21 June 2018 (15:46:03 CEST)
Breast cancer is the second cause of fatality among all cancers for women. Automatic classification of breast cancer lesions in mammograms is a challenging task due to the irregularity and complexity of the location, size, shape, and texture of these lesions. The intensity dissimilarity has been found between breast cancer tissues and normal tissues, when a multi-spectral anatomical mammographic screening scans have been done. In this work, two approaches have been evaluated to classify the breast tumor lesions. The first one is through Gabor wavelet features and the second one is Statistical features. Subsequently, support vector machine, Multilayer Perceptron and KNN classifiers have been used with computer based method for breast tumor classification.
ARTICLE | doi:10.20944/preprints202205.0417.v1
Subject: Earth Sciences, Geoinformatics Keywords: COVID-19; Eswatini; risk mapping; Poisson regression
Online: 31 May 2022 (11:04:12 CEST)
COVID-19 national spikes had been reported at varying temporal scales as a result of differences in the driving factors. Factors affecting case load and mortality rates have varied between countries and regions. We investigated the association between various socio-economic, demographic and health variables with the spread on COVID-19 cases in Eswatini using the maximum likelihood estimation method for count data. A generalized Poisson regression (GPR) model was fitted with the data comprising of fifteen covariates to predict COVID-19 risk in Eswatini. The results showed that variables that were key determinants in the spread of the disease were those that included the proportion of elderly above 55 years at 98% (95% CI: 97%-99%) and the proportion of youth below 35 years at 0.08% (95% CI: 0.017%-38%) with a pseudo R-square of 0.72. However, in the early phase of the virus when cases were fewer, results from the Poisson regression showed that household size, household density and poverty index were associated with COVID-19. We produced a risk map of predicted COVID-19 in Eswatini using the variables that were selected at 5% significance level. The map could be used by the country to plan and prioritize health interventions against COVID-19. The identified areas of high risk may be further investigated in order to find out the risk amplifiers and assess what could be done to prevent them.
ARTICLE | doi:10.20944/preprints202107.0139.v1
Subject: Social Sciences, Accounting Keywords: circularity; waste streams; circular approaches; regression equation
Online: 6 July 2021 (11:40:19 CEST)
In this paper, the authors identified key elements important for circularity: (1) Background: The primary goal of circularity is to eliminate waste and to prove the constant use of resources. In the paper, we classify studies according to circular approaches. The authors identified main elements and classified them into categories important for circularity, starting with the managing and reducing waste and the recovery of resources; and ending with the circularity of material, and general circularity-related topics and presented scientific works dedicated to each of the above-mentioned categories. The authors analyzed several core elements from the first category aiming to investigate and connect different waste streams and provided a regression model; (2) Methods: The authors used a dynamic regression model to identify relationships among variables and selected the ones, which has an impact on the increase of biowaste. The research was delivered for the 27 European Union countries during the period between 2020 and 2019; (3) Conclusions: The authors indicated that the recycling rate of wasted electrical equipment in the previous year has an impact on the increase of recycling biowaste next year. This is explained as non-metallic spare parts of electronic equipment are used as biowaste for fuel production. And the separation process of the composites of electric equipment takes some time, on average the effect is evident in one year period.
ARTICLE | doi:10.20944/preprints202012.0321.v1
Subject: Earth Sciences, Atmospheric Science Keywords: quantile regression; groundwater; environmental; multivariate; metals; health
Online: 14 December 2020 (10:13:09 CET)
One of the most important defining characteristics of groundwater quality is pH as it fundamentally controls the amount and chemical form of many organic and inorganic solutes in groundwater. Groundwater data are frequently characterized by a wide degree of variability of the factors which possibly influence pH distribution. For this reason, it is challenging to link the spatio-temporal dynamics of pH to a single environmental factor by the ordinary least squares regression technique of the conditional mean. In this study, quantile regression was used to estimate the response of pH to nine environmental factors (As, Cd, Fe, Mn, Pb, turbidity, electrical conductivity, total dissolved solids and nitrates). Results of 25%, 50%, 75% quantile regression and ordinary least squares (OLS) regression were compared. The standard regression of the conditional means (OLS) underestimated the rates of change of pH due to the selected factors in comparison with the regression quantiles. The effect of arsenic increased for sampling locations with higher pH values (higher quantiles) likewise the influence of Pb and Mn. However, the effects of Cd and Fe decreased for sampling locations in higher quantiles. It can be concluded that these detected heterogeneities would be missed if this study had focused exclusively on the conditional means of the pH values. Consequently, quantile regression provides a more comprehensive account of possible spatio-temporal relationships between environmental covariates in groundwater. This study is one of the first to apply this technique on groundwater systems in sub-Saharan Africa. The approach is useful and interesting and has broad application for other mining environments especially tropical low-income countries where climatic conditions can drive rapid cycling or transformations of pollutants. It is also pertinent to geopolitical contexts where regulatory; monitoring and management capacities are weak and where mining pollution of groundwater largely occur.
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Crime prediction; Ensemble Learning; Machine Learning; Regression
Online: 14 September 2020 (00:53:30 CEST)
While the use of crime data has been widely advocated in the literature, its availability is often limited to large urban cities and isolated databases tend not to allow for spatial comparisons. This paper presents an efficient machine learning framework capable of predicting spatial crime occurrences, without using past crime as a predictor, and at a relatively high resolution: the U.S. Census Block Group level. The proposed framework is based on an in-depth multidisciplinary literature review allowing the selection of 188 best-fit crime predictors from socio-economic, demographic, spatial, and environmental data. Such data are published periodically for the entire United States. The selection of the appropriate predictive model was made through a comparative study of different machine learning families of algorithms, including generalized linear models, deep learning, and ensemble learning. The gradient boosting model was found to yield the most accurate predictions for violent crimes, property crimes, motor vehicle thefts, vandalism, and the total count of crimes. Extensive experiments on real-world datasets of crimes reported in 11 U.S. cities demonstrated that the proposed framework achieves an accuracy of 73 and 77% when predicting property crimes and violent crimes, respectively.
ARTICLE | doi:10.20944/preprints202208.0495.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: chronic venous disease; deep leaning; data mining; Resnet50; DeiT; automatic classification; automatic CEAP classification
Online: 29 August 2022 (12:46:56 CEST)
Chronic venous disease (CVD) occurs in a substantial proportion of the world's population. If the onset of CVD looks like a cosmetic defect, then over time, it can develop into serious problems that require surgical intervention. The aim of the work is to use deep learning (DL) methods for automatic classification of the stage of CVD for self-diagnosis of a patient by using the image of the patient’s legs. The required for DL algorithms images of legs with CVD were obtained by using Internet Data Mining. For images preprocessing, the binary classification problem “legs - no legs” was solved based on Resnet50 with accuracy 0.998. The application of this filter made it possible to collect a data set of 11,118 good quality leg images with various stages of CVD. For classification of various stages of CVD according to CEAP classification, the multi classification problem was set and resolved by using two neural networks with completely different architecture - Resnet50 and DeiT. The model based on DeiT without any tuning shows better results than the model based on Resnet50 (precision = 0.770 (DeiT) and 0.615 (Resnet50)). To demonstrate the results of the work, a telegram bot was developed, in which fully functioning DL algorithms are implemented. This bot allows evaluating the condition of the patient's legs with a fairly good accuracy for the CVD classification.
ARTICLE | doi:10.20944/preprints201808.0056.v1
Subject: Medicine & Pharmacology, General Medical Research Keywords: heat-related illness; international classification; heat cramp; syncope; heat exhaustion; heat stroke; novel classification
Online: 3 August 2018 (03:51:27 CEST)
The Japanese Association for Acute Medicine Committee recently proposed a novel classification system for the severity of heat-related illnesses. The illnesses are simply classified into three stages based on symptoms and management or treatment. Stages I, II, and III broadly correspond to heat cramp and syncope, heat exhaustion, and heat stroke, respectively. Our objective was to examine whether this novel severity classification is useful in the diagnosis by healthcare professionals of patients with severe heat-related illness and organ failure. A nationwide surveillance study of heat-related illnesses was conducted between June 1 and September 30, 2012, at emergency departments in Japan. Among the 2130 patients who attended 102 emergency departments, the severity of their heat-related illness was recorded for 1799 patients, who were included in this study. In the patients with heat cramp and syncope or heat exhaustion (but not heat stroke), the blood test data (alanine aminotransferase, creatinine, blood urea nitrogen, and platelet counts) for those classified as stage III were significantly higher than those of patients classified as stage I or II. There were no deaths among the patients classified as stage I. This novel classification may avoid underestimating the severity of heat-related illness.
ARTICLE | doi:10.20944/preprints202201.0202.v1
Subject: Earth Sciences, Geoinformatics Keywords: crop detection; Sentinel 1; Sentinel 2; supervised classification; unsupervised classification; time series; agriculture; food security
Online: 14 January 2022 (11:18:59 CET)
Satellite Crop Detection technologies are focused on detection of different types of crops on the field in the early stage before harvesting. Crop detection is usually done on a time series of satellite data by classification of the desired fields. Currently, data obtained from Remote Sensing (RS) are used to solve tasks related to the identification of the type of agricultural crops, also modern technologies using AI methods are desired in the postprocessing part. In this challenge Sentinel-1 and Sentinel-2 time series data were used due to their periodic availability. Our focus was to develop methodology for classification of time series of Sentinel 2 and Sentinel 1 data and compare how accuracy of classification can be increased, but also how to guarantee availability of data. We analyse phenology of single crops and on the basis of this analysis we started to provide crop classification. Original crop classifications were made from Enhanced Vegetation Index (EVI) layers made from Sentinel-2 time-series data and then we added also . To increase accuracy we also integrate into the process parcel borders and provide classification of fields..
ARTICLE | doi:10.20944/preprints202108.0516.v1
Subject: Social Sciences, Other Keywords: machine learning; time; naive bayes classification; recurrent neural networks, Twitter; social media data; automatic classification
Online: 27 August 2021 (11:23:50 CEST)
Machine learning (ML) is increasingly useful as data grows in volume and accessibility as it can perform tasks (e.g. categorisation, decision making, anomaly detection, etc.) through experience and without explicit instruction, even when the data are too vast, complex, highly variable, full of errors to be analysed in other ways , . Thus, ML is great for natural language, images, or other complex and messy data available in large and growing volumes. Selecting a ML algorithm depends on many factors as algorithms vary in supervision needed, tolerable error levels, and ability to account for order or temporal context, among many other things. Importantly, ML methods for explicitly ordered or time-dependent data struggle with errors or data asymmetry. Most data are at least implicitly ordered, potentially allowing a hidden `arrow of time’ to affect non-temporal ML performance. This research explores the interaction of ML and implicit order by training two ML algorithms on Twitter data before performing automatic classification tasks under conditions that balance volume and complexity of data. Results show that performance was affected, suggesting that researchers should carefully consider time when selecting appropriate ML algorithms, even when time is only implicitly included.
ARTICLE | doi:10.20944/preprints202010.0168.v2
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Genetic Programming; Evolutionary Computation; Machine Learning; Classification; Multiclass Classification; Feature Construction; Hyper-features; Spectral Indices
Online: 24 December 2020 (08:59:19 CET)
Genetic Programming (GP) is a powerful Machine Learning (ML) algorithm that can produce readable white-box models. Although successfully used for solving an array of problems in different scientific areas, GP is still not well known in Remote Sensing. The M3GP algorithm, a variant of the standard GP algorithm, performs Feature Construction by evolving hyper-features from the original ones. In this work, we use the M3GP algorithm on several sets of satellite images over different countries to create hyper-feature from satellite bands to improve the classification of land cover types. We add the evolved hyper-features to the reference datasets and observe a significant improvement of the performance of three state-of-the-art ML algorithms (Decision Trees, Random Forests and XGBoost) on multiclass classifications and no significant effect on the binary classifications. We show that adding the M3GP hyper-features to the reference datasets brings better results than adding the well-known spectral indices NDVI, NDWI and NBR. We also compare the performance of the M3GP hyper-features in the binary classification problems with those created by other Feature Construction methods like FFX and EFS.
ARTICLE | doi:10.20944/preprints201807.0516.v1
Subject: Earth Sciences, Other Keywords: land-cover classification; very high spatial resolution remote sensing image; adaptive majority vote; post-classification.
Online: 26 July 2018 (15:05:16 CEST)
Land-cover classification that uses very-high-resolution (VHR) remote sensing images is a topic of considerable interest. Although many classification methods have been developed, there is still room for improvements in the accuracy and usability of classification systems. In this paper, a novel post-processing approach based on a dual-adaptive majority voting strategy (D-AMVS) is proposed for improving the performance of initial classification maps. D-AMVS defines a strategy for refining each label of a classified map that is obtained by different classification methods from the same original image and fusing the different refined classification maps to generate a final classification result. The proposed D-AMVS contains three main blocks. 1) An adaptive region is generated by extending gradually the region around a central pixel based on two predefined parameters (T1 and T2) in order to utilize the spatial feature of ground targets in a VHR image. 2) For each classified map, the label of the central pixel is refined according to the majority voting rule within the adaptive region. This is defined as adaptive majority voting (AMV). Each initial classified map is refined in this manner pixel by pixel. 3) Finally, the refined classified maps are used to generate a final classification map, and the label of the central pixel in the final classification map is determined by applying AMV again. Each entire classified map is scanned and refined pixel by pixel based on the proposed D-AMVS. The accuracies of the proposed D-AMVS approach are investigated through two remote sensing images with high spatial resolutions of 1.0 and 1.3 m, respectively. Compared with the classical majority voting method and a relatively new post-processing method called general post-classification framework, the proposed D-AMVS can achieve a land-cover classification map with less noise and higher classification accuracies.
ARTICLE | doi:10.20944/preprints202202.0010.v1
Subject: Medicine & Pharmacology, Clinical Neurology Keywords: elastic map; clustering; classification; degeneration; diagnostics
Online: 1 February 2022 (12:26:27 CET)
Positron-emission tomography is powerful but costly tool for various medical investigations. In particular, it is used in Parkinson’s disease and essential tremor diagnostics. However, yet there is no standardized figures of the references, for it. We examined the PET efficiency for the analysis of development and degradation of dophaminergic neurons in Parkinson’s disease. The informative indices are determined from the observed PET data. Also, high efficiency of PET for Parkinson’s disease as approved.
ARTICLE | doi:10.20944/preprints202110.0042.v1
Subject: Mathematics & Computer Science, Probability And Statistics Keywords: classification; ensemble; subspace; sparsity; feature ranking
Online: 4 October 2021 (10:36:37 CEST)
We propose a new ensemble classification algorithm, named Super Random Subspace Ensemble (Super RaSE), to tackle the sparse classification problem. The proposed algorithm is motivated by the Random Subspace Ensemble algorithm (RaSE). The RaSE method was shown to be a flexible framework that can be coupled with any existing base classification. However, the success of RaSE largely depends on the proper choice of the base classifier, which is unfortunately unknown to us. In this work, we show that Super RaSE avoids the need to choose a base classifier by randomly sampling a collection of classifiers together with the subspace. As a result, Super RaSE is more flexible and robust than RaSE. In addition to the vanilla Super RaSE, we also develop the iterative Super RaSE, which adaptively changes the base classifier distribution as well as the subspace distribution. We show the Super RaSE algorithm and its iterative version perform competitively for a wide range of simulated datasets and two real data examples. The new Super RaSE algorithm and its iterative version are implemented in a new version of the R package RaSEn.
ARTICLE | doi:10.20944/preprints202109.0285.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: remote sensing; deep learning; image classification
Online: 16 September 2021 (13:38:55 CEST)
Autonomous image recognition has numerous potential applications in the field of planetary science and geology. For instance, having the ability to classify images of rocks would allow geologists to have immediate feedback without having to bring back samples to the laboratory. Also, planetary rovers could classify rocks in remote places and even in other planets without needing human intervention. Shu et al. classified 9 different types of rock images using a Support Vector Machine (SVM) with the image features extracted autonomously. Through this method, the authors achieved a test accuracy of 96.71%. In this research, Convolutional Neural Networks(CNN) have been used to classify the same set of rock images. Results show that a 3-layer network obtains an average accuracy of 99.60% across 10 trials on the test set. A version of Self-taught Learning was also implemented to prove the generalizability of the features extracted by the CNN. Finally, one model has been chosen to be deployed on a mobile device to demonstrate practicality and portability. The deployed model achieves a perfect classification accuracy on the test set, while taking only 0.068 seconds to make a prediction, equivalent to about 14 frames per second.
Subject: Life Sciences, Biochemistry Keywords: scale spaces; differential invariants; segmentation; classification
Online: 5 July 2021 (09:22:00 CEST)
Image segmentation and classification still represent an active area of research since no universal solution can be identified. Established segmentation algorithms like thresholding are problem specific, treat well the easy cases and mostly relied on single parameter i.e intensity. Machine learning approaches offer alternatives where predefined features are combined into different classifiers. On the other hand, the outcome of machine learning is only as good as the underlying feature space. Differential geometry can substantially improve the outcome of machine learning since it can enrich the underlying feature space with new geometrical objects, called invariants. In this way, the geometrical features form a high-dimensional feature space for each pixel, where original objects can be resolved. Alternatives based on the geometry of the image scale-invariant interest points have been exploited successfully in the field of computer vision. Here, we integrate geometrical feature extraction based on signal processing, machine learning, and input relying on domain knowledge. The approach is exemplified on the ISBI 2012 image segmentation challenge data set. As a second application, we demonstrate powerful image classification functionality based on the same principles, which was applied to the HeLa and HEp-2 data sets. Obtained results demonstrate that feature space enrichment properly balanced with feature selection functionality can achieve performance comparable to deep learning approaches.
ARTICLE | doi:10.20944/preprints202011.0646.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: social media; hate speech; text classification
Online: 25 November 2020 (14:12:07 CET)
The exponential increase in the use of the Internet and social media over the last two decades has changed human interaction. This has led to many positive outcomes, but at the same time it has brought risks and harms. While the volume of harmful content online, such as hate speech, is not manageable by humans, interest in the academic community to investigate automated means for hate speech detection has increased. In this study, we analyse six publicly available datasets by combining them into a single homogeneous dataset and classify them into three classes, abusive, hateful or neither. We create a baseline model and we improve model performance scores using various optimisation techniques. After attaining a competitive performance score, we create a tool which identifies and scores a page with effective metric in near-real time and uses the same as feedback to re-train our model. We prove the competitive performance of our multilingual model on two langauges, English and Hindi, leading to comparable or superior performance to most monolingual models.
ARTICLE | doi:10.20944/preprints201812.0210.v1
Subject: Medicine & Pharmacology, Clinical Neurology Keywords: Parkinson’s; RBD; connectivity; phenotype; classification; network
Online: 18 December 2018 (03:58:08 CET)
Rapid eye movement sleep behavior disorder (RBD) is often prodromal to Parkinson’s disease (PD). Thus there should be detectable in vivo functional signatures shared between RBD and PD that aid in disease classification. To assess common in-vivo phenotypes, resting state data was collected on a 3T clinical MRI platform and a novel functional connectivity magnetic resonance imaging (fcMRI) approach, which combined independent component analysis (ICA) and graph theory, was used to evaluate deficits in interconnectivity among 15 PD, 14 RBD and 13 control participants. Whole brain and network-level analyses revealed the largest deficits in network connectivity in PD compared with controls, with less severe differences between RBD and controls. Importantly, the network-level analysis demonstrated decreased network interconnectivity, with the greatest aberrant networks in PD, and a subset in RBD. Additionally, a disease classification algorithm predicted PD cases by being trained on RBD cases with 0.87 sensitivity and 0.68 specificity. The functional alterations in cortical networks in RBD extended beyond the brainstem. These findings demonstrate progressive reductions in connectivity between brain networks, with less severe deficits in RBD than PD. Moreover, RBD phenotypes can be used to predict PD status in a cross-sectional sample, which suggests RBD is an intermediate phenotype.
ARTICLE | doi:10.20944/preprints201811.0533.v1
Subject: Life Sciences, Biotechnology Keywords: exercise classification; motion capture; virtual rehabilitation
Online: 22 November 2018 (04:33:58 CET)
The rapid development of algorithms for skeleton detection with relatively inexpensive contactless systems and cameras opens the possibility of virtual exercise therapy for patients with different complications. However, evaluation and confirmation of posture classifications is still needed. The purpose of this study was therefore to find the most accurate algorithm for automatic classification of human exercise movement. A Kinect V2 with 25 joints identification was used to record movements for data analysis. A total of 10 subjects volunteered for this study. Four algorithms were tested for the classification of different postures in Matlab. These were based on: total error of vector lengths, total error of angles, multiplication of these two parameters and simultaneous analysis of the first and second parameters. A base of 13 exercises was then created to test the recognition of postures by the algorithm, and to analyse subject performance. The best results for posture classification was shown by the second algorithm with an accuracy of 94.9%. The average correctness of exercises among the 10 participants was 94.2% (SD1.8%). The algorithms tested in this study therefore proved to be effective and could potentially form the basis for developing a system for remote monitoring of rehabilitation involving exercise.
ARTICLE | doi:10.20944/preprints201809.0449.v1
Subject: Engineering, Electrical & Electronic Engineering Keywords: motion; superpixel; temporal features; video classification
Online: 24 September 2018 (09:54:01 CEST)
Superpixels are a representation of still images as pixel grids because of their more meaningful information compared with atomic pixels. However, their usefulness for video classification has been given little attention. In this paper, rather than using spatial RGB values as low-level features, we use optical flows mapped into hue-saturation-value (HSV) space to capture rich motion features over time. We introduce motion superpixels, which are superpixels generated from flow fields. After mapping flow fields into HSV space, independent superpixels are formed by iteration of seeded regions. Every grid of a motion superpixel is tracked over time using nearest neighbors in the histogram of flow (HOF) for consecutive flow fields. To define the temporal representation, the evolution of three features within the superpixel region, namely the HOF, HOG, and the center of superpixel mass are used as descriptors. The bag of features algorithm is used to quantify final features, and generalized histogram-kernel support vector machines are used as learning algorithms. We evaluate the proposed superpixel tracking on first-person videos and action sports videos.
ARTICLE | doi:10.20944/preprints201807.0023.v1
Subject: Mathematics & Computer Science, Other Keywords: Kronecker product; Kronecker power; fractals; classification
Online: 3 July 2018 (05:46:08 CEST)
For the particular selected type of fractals, i.e., the Kronecker product based fractals, the introductory common classification is proposed. It essentially relies on the size and content of the initial fractal matrix. In turn, it gives one of the possible explanation of the phenomenal ability to produce a few of the most popular fractals using many different algorithms, languages and tools. Also, it paves the way to discover on the regular basis thousands of the new fractals, and this gives researchers a wider range of fractal models to choose from.
ARTICLE | doi:10.20944/preprints201609.0121.v1
Subject: Mathematics & Computer Science, General & Theoretical Computer Science Keywords: activity recognition; physical attributes; classification capability
Online: 29 September 2016 (12:57:00 CEST)
Motion related human activity recognition using wearable sensors can potentially enable various useful daily applications. So far, most studies view it as a stand-alone mathematical classification problem without considering the physical nature of human motions. Consequently, they suffer from data dependencies and encounter the dimension disaster problem and the over-fitting issue, and their models are never human-readable. In this study, we start from a deep analysis on natural physical properties of human motions, and then propose a useful feature selection method to quantify each feature's classification contribution capability. On one hand, the "dimension disaster" problem can be avoid to some extent, due to the affined dimension of key features; On the other hand, over-fitting issue can be depressed since the knowledge implied in human motions are nearly invariant, which compensates the possible data inadequacy. The experiment results indicate that the proposed method performs superior to those adopted in related works, such as decision tree, k-NN, SVM, neural networks.
ARTICLE | doi:10.20944/preprints202211.0227.v1
Subject: Medicine & Pharmacology, Sport Sciences & Therapy Keywords: Bayesian; cardiovascular disease; CVD; cross-sectional; logistic regression
Online: 14 November 2022 (01:55:06 CET)
Background: Cardiovascular disease (CVD) has been one of the leading causes of death and disability-adjusted life years lost worldwide. Blood pressure, lipid, and cholesterol are good predictors of CVD risk and correspond upon age and physical fitness. However, few studies have explored the variation trend of CVD risk factors across different populations upon age and their muscle strength. Objective: to analysis the variation tendency of CVD risk factors in blood according to age and relative grip strength among different populations. Method: 25363 participants were recruited in this cross-sectional study and 24709 were included in the analysis. A logistic regression and a Bayesian probabilistic analysis based on Markov Chain Monte Carlo (MCMC) Modeling is conducted to build probability prediction models of hypertension, hyperlipidemia, and hypercholesterolemia according to age, relative grip strength, body weight conditions, and physical activity levels. Results: 1) age might be the main influence factor of hypertension, which is regarded as one of the primary CVD risk factors. However, although keeping a high level of physical activity might have positive effect on preventing hypertension because that individuals with normal body weight and higher physical activity shows a lower probability of being diagnosed with hypertension, it might could not prevent individuals from getting hypertension with age. 2) After 60, individuals of normal body weight seem more likely to have hyperlipidemia than those are overweight or obese. 3) Larger relative grip strength might not be able to offset the negative effects of obesity, overweight and physical inactivity on hyperlipidemia. 4) The probability of getting hypercholesterolemia varies less with age and relative grip strength. Conclusion: Body weight management and keeping high levels of physical activity are recommended at any age. It might benefit to increase some bodyweight after 60 years old.
REVIEW | doi:10.20944/preprints202210.0391.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Tillage; Traction; Compaction; Neural networks; Support vector regression
Online: 26 October 2022 (02:07:19 CEST)
Soil working tools, implements, and machines are inevitable in mechanized agriculture. The soil-tool/machine interaction is a multivariate, dynamic, and intricate process. The accurate interpretation, description, and modeling of a soil-machine interaction is key to providing a solution to sustainable crop production by reducing energy input, excessive soil pulverization, and compaction. The traditional method provides insight into soil-machine interaction but often provides inadequate solutions and lacks broad applicability. Computational intelligence (CI) is a comprehensive class of approaches that rely on approximate information to solve complex problems. The CI method has been extensively studied and applied in soil tillage and traction domain in recent decades. The study critically reviews the CI techniques implemented in soil-machine interactions, especially in the context of tillage, traction, and compaction. The traditional methods and their limitation are discussed. The fundamental of CI methods and a detailed overview of the most popular methods are provided. The study reviews and summarizes the 50 selected articles on soil-machine interaction studies where CI methods were employed. It discusses the strength and limitations of employed CI methods. It also suggests the emergent CI methods and future applications are discussed. The outlined study would serve as a concise reference and a quick and systematic way to understand the applicable CI methods that allow crucial farm management decision-making.
ARTICLE | doi:10.20944/preprints202106.0533.v1
Online: 22 June 2021 (08:30:30 CEST)
The novel coronavirus disease (COVID-19) has created immense threats to public health on various levels around the globe. The unpredictable outbreak of this disease and the pandemic situation are causing severe depression, anxiety and other mental as physical health related problems among the human beings. To combat against this disease, vaccination is essential as it will boost the immune system of human beings while being in the contact with the infected people. The vaccination process is thus necessary to confront the outbreak of COVID-19. This deadly disease has put social, economic condition of the entire world into an enormous challenge. The worldwide vaccination progress should be tracked to identify how fast the entire economic as well as social life will be stabilized. The monitor ofthe vaccination progress, a machine learning based Regressor model is approached in this study. This tracking process has been applied on the data starting from 14th December, 2020 to 24th April, 2021. A couple of ensemble based machine learning Regressor models such as Random Forest, Extra Trees, Gradient Boosting, AdaBoost and Extreme Gradient Boosting are implemented and their predictive performance are compared. The comparative study reveals that the AdaBoostRegressor outperforms with minimized mean absolute error (MAE) of 9.968 and root mean squared error (RMSE) of 11.133.
Subject: Medicine & Pharmacology, Allergology Keywords: Diagnosing designs; rare diseases; statistics; regression; block designs
Online: 2 June 2021 (12:14:34 CEST)
Far too often, one meets patients who went for years or even decades from doctor to doctor, without getting a valid diagnosis. This brings pain to millions of patients and their families, not to speak of the enormous costs. Often patients cannot tell precisely enough which factors (or combinations thereof) trigger their problems. If conventional methods fail, we propose the use of statistics and algebra to give doctors much more useful inputs from patients. We use statistical regression for independent triggering factors for medical problems, and “balanced incomplete block designs” for non-independent factors. These methods can supply doctors with much more valuable inputs, and can also detect combinations of multiple factors by incredibly few tests. In order to show that these methods do work, we briefly describe a case in which these methods helped to solve a 60 year old problem in a patient, and give some more examples where these methods might be very useful. As a conclusion, while regression is used in clinical medicine, it seems to be widely unknown in diagnosing. Statistics and algebra can save the health systems much money, and the patients also a lot of pain.
ARTICLE | doi:10.20944/preprints202103.0586.v1
Subject: Earth Sciences, Atmospheric Science Keywords: NVOC; phytoncide; bamboo grove; monoterpene; microclimate; regression analysis
Online: 24 March 2021 (13:10:25 CET)
After the COVID-19 outbreak, more and more people are seeking physiological and psychological healing by visiting the forest as the time of stay-at-home became longer. NVOC, a major healing factor of forests, has several positive effects on human health, and this study researched about the NVOC characteristics of bamboo groves. The study revealed that α-pinene, 3-carene, and camphene were the most emitted, and the largest amount of NVOC was emitted in the early morning and late afternoon in bamboo groves. Furthermore, NVOC emission was found to have normal correlations with temperature and humidity, and inverse correlations with solar radiation, PAR and wind speed. A regression analysis conducted to predict the effect of microclimate factors on NVOC emissions resulted in a regression equation with 82.9% explanatory power and found that PAR, temperature, and humidity had a significant effect on NVOC emission prediction. In conclusion, this study investigated NVOC emission characteristics of bamboo groves, examined the relationship between NVOC emissions and microclimate factors and derived a prediction equation of NVOC emissions to figure out bamboo groves' forest healing effects. These results are expected to provide a basis for establishing more effective forest healing programs in bamboo groves.
ARTICLE | doi:10.20944/preprints202008.0329.v2
Subject: Medicine & Pharmacology, General Medical Research Keywords: COVID-19; Geospatial Regression; Health Disparities; Public Health
Online: 11 September 2020 (09:48:57 CEST)
COVID-19 is a potentially fatal viral infection. This study investigates geography, demography, socioeconomics, health conditions, hospital characteristics, and politics as potential explanatory variables for death rates at the state and county levels. Data from the Centers for Disease Control and Prevention, the Census Bureau, Centers for Medicare and Medicaid, Definitive Healthcare, and USAfacts.org were used to evaluate regression models. Yearly pneumonia and flu death rates (state level, 2014-2018) were evaluated as a function of the governors’ political party using repeated measures analysis. At the state and county level, spatial regression models were evaluated. At the county level, we discovered a statistically significant model that included geography, population density, racial and ethnic status, three health status variables along with a political factor. State level analysis identified health status, minority status, and the interaction between governors’ parties and health status as important variables. The political factor, however, did not appear in a subsequent analysis of 2014-2018 pneumonia and flu death rates. The pathogenesis of COVID-19 has greater and disproportionate effect within racial and ethnic minority groups, and the political influence on the reporting of COVID-19 mortality was statistically relevant at the county level and as an interaction term only at the state level.
ARTICLE | doi:10.20944/preprints201906.0291.v1
Subject: Medicine & Pharmacology, Other Keywords: endothelial disorders; glycocalyx injury; syndecan-1; nonlinear regression
Online: 28 June 2019 (07:42:18 CEST)
Endothelial disorders are related to various diseases. An initial endothelial injury is characterized by endothelial glycocalyx injury. We aimed to evaluate endothelial glycocalyx injury by measuring serum syndecan-1 concentrations in patients during comprehensive medical examinations. A single-center, prospective, observational study was conducted at Asahi University Hospital. The participants enrolled in this study were 1313 patients who underwent comprehensive medical examinations at Asahi University Hospital from January 2018, to June 2018. One patient undergoing hemodialysis was excluded from the study. At enrollment, blood samples were obtained, and study personnel collected demographic and clinical data. No treatments or exposures were conducted except for standard medical examinations and blood sample collection. Laboratory data were obtained by collection of blood samples at the time of study enrolment. According to nonlinear regression, the concentrations of serum syndecan-1 were significantly related to age (p = 0.016), aspartic aminotransferase concentration (AST, p = 0.020), blood urea nitrogen concentration (BUN, p = 0.013), triglyceride concentration (p < 0.001), and hematocrit (p = 0.006). These relationships were independent associations. Endothelial glycocalyx injury, which is reflected by serum syndecan-1 concentrations, is related to age, hematocrit, AST concentration, BUN concentration, and triglyceride concentration.
ARTICLE | doi:10.20944/preprints201608.0025.v2
Subject: Earth Sciences, Atmospheric Science Keywords: solar variability; NAO; ENSO; volcanic eruptions; multiple regression
Online: 17 May 2017 (06:27:16 CEST)
The role of natural factors mainly solar eleven-year cycle variability, and volcanic eruptions on two major modes of climate variability the North Atlantic Oscillation (NAO) and El Niño Southern Oscillation (ENSO) are studied for around last 150 years period. The NAO is the primary factor to regulate Central England Temperature (CET) during winter throughout the period, though NAO is impacted differently by other factors in various time periods. Solar variability indicates a strong positive influence on NAO during 1978-1997, though suggests opposite in earlier period. Solar NAO lag relationship is also shown sensitive to the chosen times of reference and thus points towards the previously proposed mechanism/ relationship related to the sun and NAO. The ENSO is influenced strongly by solar variability and volcanic eruptions in certain periods. This study observes a strong negative association between the sun and ENSO before the 1950s, which is even opposite during the second half of 20th century. The period 1978-1997, when two strong eruptions coincided with active years of strong solar cycles, the ENSO, and volcano suggested a stronger association, and we discussed the important role played by ENSO. That period showed warming in central tropical Pacific while cooling in the North Atlantic with reference to the later period (1999-2017) and also from chosen earlier period. Here we show that the mean atmospheric state is important for understanding the connection between solar variability, the NAO and ENSO and associated mechanism. It presents a critical analysis to improve knowledge about major modes of variability and their role in climate. We also discussed the importance of detecting the robust signal of natural variability, mainly the sun.
COMMENT | doi:10.20944/preprints201608.0166.v1
Subject: Social Sciences, Geography Keywords: Regional inequality; Multilevel regression; Markov chain; Guizhou Province
Online: 17 August 2016 (12:58:58 CEST)
This study analyses regional development in one of the poorest provinces in China, Guizhou Province, between 2000 and 2012 using a multiscale and multi-mechanism framework. In general, regional inequality has been declining since 2000. In addition, economic development in Guizhou Province presented spatial agglomeration and club convergence, which shows how the development pattern of one core area, two-wing areas and a contiguous area at the edge of the province have been developed between 2006 and 2012. Multilevel regression analysis revealed that industrialization and investment level were the primary driving forces of regional economic disparity in Guizhou Province. The influences of marketization and decentralization on regional economic disparity were relatively weak. Investment level reinforced regional economic disparity and the development of core-periphery structure in the province. However, investment level actually weakened the regional economic disparity in Guizhou Province when the variable of time was considered. In addition, both the topography and urban–rural differentiation were the two main reasons for forming a core-periphery structure in Guizhou Province.
ARTICLE | doi:10.20944/preprints202108.0389.v1
Subject: Mathematics & Computer Science, Other Keywords: remote-sensing classification; scene classification; few-shot learning; meta-learning; vision transformers; multi-scale feature fusion
Online: 18 August 2021 (14:29:29 CEST)
The central goal of few-shot scene classification is to learn a model that can generalize well to a novel scene category (UNSEEN) from only one or a few labeled examples. Recent works in the remote sensing (RS) community tackle this challenge by developing algorithms in a meta-learning manner. However, most prior approaches have either focused on rapidly optimizing a meta-learner or aimed at finding good similarity metrics while overlooking the embedding power. Here we propose a novel Task-Adaptive Embedding Learning (TAEL) framework that complements the existing methods by giving full play to feature embedding’s dual roles in few-shot scene classification - representing images and constructing classifiers in the embedding space. First, we design a lightweight network that enriches the diversity and expressive capacity of embeddings by dynamically fusing information from multiple kernels. Second, we present a task-adaptive strategy that helps to generate more discriminative representations by transforming the universal embeddings into task-specific embeddings via a self-attention mechanism. We evaluate our model in the standard few-shot learning setting on two challenging datasets: NWPU-RESISC4 and RSD46-WHU. Experimental results demonstrate that, on all tasks, our method achieves state-of-the-art performance by a significant margin.
ARTICLE | doi:10.20944/preprints202107.0482.v1
Subject: Earth Sciences, Atmospheric Science Keywords: Köppen-Geiger climate classification; Worldwide Bioclimatic Classification System (WBCS); Bioclimates; Thermotypes; Ombrotypes; Vineyards; Olive groves; Portugal
Online: 21 July 2021 (10:27:04 CEST)
Land and climate are strongly connected through multiple interface processes and climate change may lead to significant changes in land use. In this study, high-resolution observational gridded datasets are used to assess modifications in the Köppen-Geiger and Worldwide Bioclimatic (WBCS) Classification Systems, from 1950‒1979 to 1990‒2019 in Portugal. A compound Bioclimatic-Shift Exposure Index (BSEI) is also defined to identify the most exposed regions to recent climatic changes. The temporal evolution of land cover with vineyards and olive groves between 1990 and 2018, as well as correlations with areas with bioclimatic shifts, are analyzed. Results show an increase (decrease) of CSa Warm Mediterranean climate with hot summer (CSb, warm summer) of 18.1% (‒17.8%). The WBCS Temperate areas also reveal a decrease of ‒5.11%. Arid and semi-arid ombrotypes areas increased, conversely to humid to sub-humid ombrotypes. Thermotypic horizons depict a shift towards warmer classes. BSEI highlights the most significant shifts in northwestern Portugal. Vineyards have been displaced towards regions that are either the coolest/humid, in the northwest, or the warmest/driest, in the south. For oliviculture, the general trend for a relative shift towards cool/humid areas suggests an attempt of the sector to adapt, despite the cover area growth in the south. As vineyards and olive groves in southern Portugal are commonly irrigated, options for the intensification of these crops in this region may threaten the already scarce water resources and challenge the future sustainability of these sectors.
ARTICLE | doi:10.20944/preprints202011.0363.v1
Subject: Chemistry, Analytical Chemistry Keywords: cannabinoid receptor 1; synthetic cannabinoids; quantitative structure-activity relationship; multiple linear regression; partial least squares regression; dependence and abuse potential
Online: 13 November 2020 (07:19:36 CET)
In recent years, there have been frequent reports on the adverse effects of synthetic cannabinoid (SC) abuse. SCs cause psychoactive effects, similar to those caused by marijuana, by binding and activating cannabinoid receptor 1 (CB1R) in the central nervous system. The aim of this study was to establish a reliable quantitative structure-activity relationship (QSAR) model to correlate the structures and physicochemical properties of various SCs with their CB1R-binding affinities. We prepared 15 SCs and their derivatives (tetrahydrocannabinol [THC], naphthoylindoles, and cyclohexylphenols) and determined their binding affinity to CB1R, which is known as a dependence-related target. We calculated the molecular descriptors for dataset compounds using an R/CDK (R package integrated with CDK, version 3.5.0) toolkit to build QSAR regression models. These models were established and statistical evaluations were performed using the mlr and plsr packages in R software. The most reliable QSAR model was obtained from the partial least squares regression method via external validation. This model can be applied in vivo to predict the addictive properties of illicit new SCs. Using a limited number of dataset compounds and our own experimental activity data, we built a QSAR model for SCs with good predictability. This QSAR modeling approach provides a novel strategy for establishing an efficient tool to predict the abuse potential of various SCs and to control their illicit use.
ARTICLE | doi:10.20944/preprints202210.0078.v1
Subject: Medicine & Pharmacology, Obstetrics & Gynaecology Keywords: Africa; Maternal mortality rate; Joinpoint regression analysis; mortality; trends.
Online: 7 October 2022 (10:30:10 CEST)
Background: United Nations Sustainable Development Goals state that by 2030, the Global maternal mortality rate (MMR) should be lower than 70 per 100,000 live births. MMR is still one of Africa's leading causes of death among women. This research aims to study regional trends in maternal mortality in Africa. Methods: We extracted data for Maternal mortality rates per 100,000 births from the UNICE data bank from 2000 to 2017, being 2017 the last date available. Joinpoint regression was used to study the trends and estimate the annual percent change (APC). Results: Maternal mortality has decreased in Africa over the study period by an average APC of -3.0% (95% CI -2.9;-3,2%). All regions showed significant downward trends, with the sharpest decreases in the South. Only the North African region is close to the United Nations' sustainable development goals for Maternal mortality. The remaining sub-Saharan African regions are still far from achieving the goals. Conclusions: maternal mortality has decreased in Africa, especially in the South Africa region. The only region closed to the United Nations target is North Africa. The remaining sub-Saharan African regions are still far from achieving the goals. These results could be used for the development of Regional Policies.
ARTICLE | doi:10.20944/preprints202209.0353.v1
Subject: Medicine & Pharmacology, Obstetrics & Gynaecology Keywords: Africa; Maternal mortality rate; Joinpoint regression analysis; mortality; trends
Online: 23 September 2022 (03:06:07 CEST)
Background: United Nations Sustainable Development Goals state that by 2030, the Global maternal mortality rate (MMR) should be lower than 70 per 100,000 live births. MMR is still one of Africa's leading causes of death among women. This research aims to study regional trends in maternal mortality in Africa. Methods: We extracted data for Maternal mortality rates per 100,000 births from the World Bank database from 1990-2015. Joinpoint regression was used to study the trends and estimate the annual percent change (APC). Results: Maternal mortality has decreased in Africa over the study period by an average APC of -2.6%. All regions showed significant downward trends, with the sharpest decreases in East Africa. Only the North African region is close to the United Nations' sustainable development goals for Maternal mortality. The remaining sub-Saharan African regions are still far from achieving the goals. Conclusions: maternal mortality has decreased in Africa, especially in East Africa. The only region closed to the United Nations target is North Africa. The remaining sub-Saharan African regions are still far from achieving the goals. These results could be used for the development of Regional Policies.
ARTICLE | doi:10.20944/preprints202208.0445.v1
Subject: Social Sciences, Economics Keywords: Adult children's education; parental longevity; truncated regression; emotional support.
Online: 26 August 2022 (04:18:44 CEST)
Background: Some developing countries, such as China, population is aging rapidly, meanwhile, the average years of schooling for residents is constantly increasing. However, the question of whether adult children’s education has an effect on the longevity of older parents, remains inadequately studied. Methods: This paper uses China Health and Retirement Longitudinal Survey (CHARLS) data to estimate the causal impact of adult children's education on their parents' longevity. Identification is achieved by using the truncated regression model and using historical education data as instrument variables for adult children’s education. Results: For every unit increase in adult children’s education, the father’s and mother’s longevity increased by 0.89 years and 0.75 years, respectively. Mechanism analysis shows that adult children's education has a significant positive impact on parents' emotional support, financial support and self-reported health. Further evidence shows that for every unit increase in adult children’s education, the father-in-law’s and mother-in-law’s longevity increased by 0.40 years and 0.46 years, respectively. Conclusions: It is conclusion that improving the level of adult children’s education can increase parents’ and parents-in-law’s longevity. Adult children’s education might contribute to the longevity of older parents by three channels that providing emotional, economic support and affecting parents’ health.
ARTICLE | doi:10.20944/preprints202205.0255.v1
Subject: Life Sciences, Biophysics Keywords: SILCS; hERG channel; Physicochemical properties; Multiple linear regression; FragMaps
Online: 19 May 2022 (08:46:24 CEST)
Human ether-a-go-go-related gene (hERG) potassium channel is well-known contributor to drug-induced cardiotoxicity and therefore an extremely important target when performing safety assessments of drug candidates. Ligand-based approaches in connection with quantitative structure active relationships (QSAR) analyses have been developed to predict hERG toxicity. Availability of the recent published cryogenic electron microscopy (cryo-EM) structure for the hERG channel opened the prospect for using structure-based simulation and docking approaches for hERG drug liability predictions. In recent time, the idea of combining structure- and ligand-based approaches for modeling hERG drug liability has gained momentum offering improvements in predictability when compared to ligand-based QSAR practices alone. The present article demonstrates uniting the structure-based SILCS (site-identification by ligand competitive saturation) approach in conjunction with physicochemical properties to develop predictive models for hERG blockade. This combination leads to improved model predictability based on Pearson’s R and percent correct (represents rank-ordering of ligands) metric for different validation sets of hERG blockers involving diverse chemical scaffold and wide range of pIC50 values. The inclusion of the SILCS structure-based approach allows determination of the hERG region to which compounds bind and the contribution of different chemical moieties in the compounds to blockade, thereby facilitating the rational ligand design to minimize hERG liability.
ARTICLE | doi:10.20944/preprints202205.0240.v1
Subject: Social Sciences, Economics Keywords: Credit constraints; Export; SMEs; Instrumental variable; Probit regression; Vietnam
Online: 18 May 2022 (10:35:32 CEST)
Export participation and restricted access to external formal credit are two factors attracting meticulous attention from researchers and policymakers, especially in developing countries. Exploring the interactive relationship of these factors in both the static and dynamic models is the purpose of this study. The study uses data sets from small and medium-sized manufacturing enterprises (SMEs) in Vietnam for the period 2009 - 2015. The instrumental variable approach is implemented to deal with the endogenous variable problem in the model. The results show an effect of credit constraint on the firms’ exporting status, and continuous exports are likely to reduce the limit of credit constraint.