Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Crime prediction; Ensemble Learning; Machine Learning; Regression
Online: 14 September 2020 (00:53:30 CEST)
While the use of crime data has been widely advocated in the literature, its availability is often limited to large urban cities and isolated databases tend not to allow for spatial comparisons. This paper presents an efficient machine learning framework capable of predicting spatial crime occurrences, without using past crime as a predictor, and at a relatively high resolution: the U.S. Census Block Group level. The proposed framework is based on an in-depth multidisciplinary literature review allowing the selection of 188 best-fit crime predictors from socio-economic, demographic, spatial, and environmental data. Such data are published periodically for the entire United States. The selection of the appropriate predictive model was made through a comparative study of different machine learning families of algorithms, including generalized linear models, deep learning, and ensemble learning. The gradient boosting model was found to yield the most accurate predictions for violent crimes, property crimes, motor vehicle thefts, vandalism, and the total count of crimes. Extensive experiments on real-world datasets of crimes reported in 11 U.S. cities demonstrated that the proposed framework achieves an accuracy of 73 and 77% when predicting property crimes and violent crimes, respectively.
ARTICLE | doi:10.20944/preprints202310.0432.v1
Subject: Computer Science And Mathematics, Applied Mathematics Keywords: European Union; public revenues; public expenditures; regression analysis
Online: 8 October 2023 (10:08:59 CEST)
Modern countries generally deal with significant budget deficits and public debt. These countries need to rationalize their expenditures and increase revenue without major interference to economic flows. The aim of this paper is to create a model for forecasting public revenue and expenditure based on data from previous years. In the paper we formulated two hypotheses related to the validity of the set models. After detailed analysis, both hypotheses were accepted. The analysis includes all EU Member States and public revenue and expenditure data for the last decade. The significance of the analysis is reflected on the practical foundation of the pre-set theoretical views, which will have their basis in statistically significant results. By analyzing the model, we formulated the regression formulas of revenues and expenditures, which can be efficiently used in predicting these variables.
ARTICLE | doi:10.20944/preprints202010.0550.v2
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: expectation maximization (EM) algorithm; finite mixture model; conditional mixture model; regression model; adaptive regressive model (ARM)
Online: 28 October 2020 (11:18:04 CET)
Expectation maximization (EM) algorithm is a powerful mathematical tool for estimating statistical parameter when data sample contains hidden part and observed part. EM is applied to learn finite mixture model in which the whole distribution of observed variable is average sum of partial distributions. Coverage ratio of every partial distribution is specified by the probability of hidden variable. An application of mixture model is soft clustering in which cluster is modeled by hidden variable whereas each data point can be assigned to more than one cluster and degree of such assignment is represented by the probability of hidden variable. However, such probability in traditional mixture model is simplified as a parameter, which can cause loss of valuable information. Therefore, in this research I propose a so-called conditional mixture model (CMM) in which the probability of hidden variable is modeled as a full probabilistic density function (PDF) that owns individual parameter. CMM aims to extend mixture model. I also propose an application of CMM which is called adaptive regressive model (ARM). Traditional regression model is effective when data sample is scattered equally. If data points are grouped into clusters, regression model tries to learn a unified regression function which goes through all data points. Obviously, such unified function is not effective to evaluate response variable based on grouped data points. The concept “adaptive” of ARM means that ARM solves the ineffectiveness problem by selecting the best cluster of data points firstly and then evaluating response variable within such best cluster. In order words, ARM reduces estimation space of regression model so as to gain high accuracy in calculation.
ARTICLE | doi:10.20944/preprints202011.0363.v1
Subject: Chemistry And Materials Science, Analytical Chemistry Keywords: cannabinoid receptor 1; synthetic cannabinoids; quantitative structure-activity relationship; multiple linear regression; partial least squares regression; dependence and abuse potential
Online: 13 November 2020 (07:19:36 CET)
In recent years, there have been frequent reports on the adverse effects of synthetic cannabinoid (SC) abuse. SCs cause psychoactive effects, similar to those caused by marijuana, by binding and activating cannabinoid receptor 1 (CB1R) in the central nervous system. The aim of this study was to establish a reliable quantitative structure-activity relationship (QSAR) model to correlate the structures and physicochemical properties of various SCs with their CB1R-binding affinities. We prepared 15 SCs and their derivatives (tetrahydrocannabinol [THC], naphthoylindoles, and cyclohexylphenols) and determined their binding affinity to CB1R, which is known as a dependence-related target. We calculated the molecular descriptors for dataset compounds using an R/CDK (R package integrated with CDK, version 3.5.0) toolkit to build QSAR regression models. These models were established and statistical evaluations were performed using the mlr and plsr packages in R software. The most reliable QSAR model was obtained from the partial least squares regression method via external validation. This model can be applied in vivo to predict the addictive properties of illicit new SCs. Using a limited number of dataset compounds and our own experimental activity data, we built a QSAR model for SCs with good predictability. This QSAR modeling approach provides a novel strategy for establishing an efficient tool to predict the abuse potential of various SCs and to control their illicit use.
ARTICLE | doi:10.20944/preprints202302.0083.v2
Subject: Environmental And Earth Sciences, Environmental Science Keywords: Multilinear Regression; Dissolve Oxygen; Modeling; Machine Learning; Levenberg–Marquardt algorithm; ANN; Urban Lake
Online: 27 February 2023 (07:25:06 CET)
The paper portrays predictive models for dissolved oxygen (DO) levels in an urban lake using common water quality parameters like Temperature, pH, Conductivity and ORP at a time. Data were sampled using three real-time, industry-standard sensors, OPTOD, CTZN, and PHEHT, and then interpolated using the ArcGIS kriging technique. Correlation studies were analyzed through the ML algorithm, the correlation study signified a highly positive correlation between DO and other water parameters and the model was corroborated by R-score in order to create the linear regression model. In addition, an artificial neural network- a machine learning method using the Levenberg-Marquardt algorithm was developed to build a model to predict the do as well. Then, the performance of the models was validated and also the R2 accuracy was checked of the predicted data against the actual data. Thus, the appropriateness of the ANN model for the forecasting of investigated attributes is indicated by the fact that the discrepancy between the forecasted and real ANN model is significantly lesser than that of the regression model. However, the model can be used to reveal DO data from unknown urban lake water.
ARTICLE | doi:10.20944/preprints201803.0093.v1
Subject: Engineering, Control And Systems Engineering Keywords: linear regression; covariance matrix; data association; sensor fusing; SLAM
Online: 13 March 2018 (04:06:56 CET)
Linear regression is a basic tool in mobile robotics, since it enables accurate estimation of straight lines from range-bearing scans or in digital images, which is a prerequisite for reliable data association and sensor fusing in the context of feature-based SLAM. This paper discusses, extends and compares existing algorithms for line fitting applicable also in case of strong covariances between the coordinates at each single data point, which must not be neglected if range-bearing sensors are used. Besides, particularly the determination of the covariance matrix is considered, which is required for stochastic modeling. The main contribution is a new error model of straight lines in closed form for calculating fast and reliably the covariance matrix dependent on just a few comprehensible and easily obtainable parameters. The model can be applied widely in any case when a line is fitted from a number of distinct points also without a-priori knowledge of the specific measurement noise. By means of extensive simulations the performance and robustness of the new model in comparison to existing approaches is shown.
ARTICLE | doi:10.20944/preprints202002.0069.v1
Subject: Computer Science And Mathematics, Applied Mathematics Keywords: coal; supercritical CO2; Gaussian process regression; machine learning; adsorption model
Online: 5 February 2020 (14:09:33 CET)
Deep coal beds have been suggested as possible usable underground geological locations for carbon dioxide storage. Furthermore, injecting carbon dioxide into coal beds can improve the methane recovery. Due to importance of this issue, a novel investigation has been done on adsorption of carbon dioxide on various types of coal seam. This study has proposed four types of Gaussian Process Regression (GPR) approaches with different kernel functions to estimate excess adsorption of carbon dioxide in terms of temperature, pressure and composition of coal seams. The comparison of GPR outputs and actual excess adsorption expresses that proposed models have interesting accuracy and also the Exponential GPR approach has better performance than other ones. For this structure, R2=1, MRE=0.01542, MSE=0, RMSE=0.00019 and STD=0.00014 have been determined. Additionally, the impacts of effective parameters on excess adsorption capacity have been studied for the first time in literature. According to these results, the present work has valuable and useful tools for petroleum and chemical engineers who dealing with enhancement of recovery and environment protection.
ARTICLE | doi:10.20944/preprints202307.0405.v1
Subject: Computer Science And Mathematics, Applied Mathematics Keywords: Jackknife; Kibria-Lukman; estimator; Maximum Likelihood; Negative Binomial regression
Online: 6 July 2023 (08:58:10 CEST)
The negative binomial regression model (NBRM) is a generalized linear model which relaxes the restrictive assumption by the Poisson regression model when the variance is equal to the mean. The estimation of the parameters of the NBRM is obtained using the maximum likelihood (ML) method. Maximum likelihood estimator becomes unstable when the explanatory variables are linearly dependent, a situation known as multicollinearity. Based on this, we developed a new estimator called modified jackknifed Negative Binomial Kibria-Lukman (MJNBKL) estimator for the radiation of multicollinearity in NBRM using four different biasing (shrinkage) parameters. We establish superiority condition for MJNBKL estimator over the ones. The performance MJNBKL estimator was ascertained by comparing it with the existing ones through a Monte Carlo simulation study and two real life application datasets. The results of the simulation and real life application show that MJNBKL estimator outperformed the other estimators compared with by having the smallest MSE across all sample sizes and for different levels of correlation for the four biasing parameters used and the third biasing parameter is the optimal shrinkage parameter with the lowest MSE.
ARTICLE | doi:10.20944/preprints202310.0871.v1
Subject: Engineering, Aerospace Engineering Keywords: fiber optic gyroscope; thermal errors; prediction model; overfitting; biased regression
Online: 13 October 2023 (08:18:22 CEST)
For a fiber optic gyroscope, thermal deformation of the fiber coil can introduce additional ther-mal-induced phase errors, commonly referred to as thermal errors. Thermal error compensation techniques are effective means of addressing this issue. The principle behind these techniques involves real-time sensing of thermal errors and correcting them within the output signal. Since it is challenging to directly separate thermal errors from the output signal of the fiber optic gyro-scope, it is necessary to predict thermal errors based on temperature. To establish a mathematical model between temperature and thermal errors, this paper measured synchronized data of phase errors and angular velocity for the fiber coil under different temperature conditions and aimed to model it using data-driven methods. Due to the difficulty of conducting tests and the limited number of data samples, an algorithm called TD-model modeling is proposed to address the issue of overfitting, which can reduce the model's generalization ability. First, a theoretical analysis of the phase errors caused by thermal deformation of the fiber coil is performed. Subsequently, the critical parameters, such as the thermal expansion coefficient, are determined, and a theoretical model is established. Finally, the theoretical analysis model is incorporated as a regularization term and combined with the test data to jointly participate in the regression of model coefficients. Through experimental comparative analysis, it is shown that, relative to ordinary regression models, the TD-model effectively mitigates overfitting caused by the limited number of samples, leading to a 58% improvement in predictive accuracy.
ARTICLE | doi:10.20944/preprints202309.1143.v1
Subject: Public Health And Healthcare, Public Health And Health Services Keywords: aging; sigmoidal growth function; nonlinear regression; threshold estimation; fractional anisotropy
Online: 18 September 2023 (14:27:39 CEST)
Backgrounds Linear association has widely been assumed for prediction of aging-related fractional anisotropy (FA) decline in white matter of the brain. While useful for testing significance of the aging effect, it fails to identify a threshold age before and after which the age-FA association changes. Identification of such a threshold is often of clinical interest for timely intervention. Methods We employed a sigmoidal growth function to test a threshold effect in age triggering onset of cerebral decline in 21 white matter tracts, and compared its fitting performance to those of linear, and power regression. The study sample was a normal healthy cohort of 106 participants with ages in mid-life ranging from 18 to 60 years. Results Of the 21 white matter tracts analyzed, the posterior thalamic radiation showed better fit with sigmoidal curve model, compared to a linear or power regression. The estimated threshold age in years (95% confidence interval) were 47.2 (44.1-48.4). Conclusion While available evidence regarding the presence of a specific age threshold for cerebral decline in mid-life based on FA was limited, the posterior thalamic radiation exhibited a threshold age of 47.2. Beyond this age point, we observed a significant change in the FA risk pattern.
ARTICLE | doi:10.20944/preprints202306.1048.v1
Subject: Chemistry And Materials Science, Materials Science And Technology Keywords: Liquid microcapsule; Multiple linear regression; Permeability experiments; Permeability model; Predictive capacity
Online: 14 June 2023 (12:27:15 CEST)
A self-protective metal/ liquid microcapsule composite plating coating has been prepared in our previous research. The release speed of liquid core material from microcapsules is crucial for the coating surface properties. The aim of this paper is to study the permeability rule of liquid microcapsules within composite coatings. According to permeability experimental data’s linear characteristics, multiple linear regression is firstly used to set up permeability model of liquid microcapsules. The results show that a reliable mathematical model about membrane porosity, viscosity of core material and membrane thickness is established under ambient temperature and moisture. Meanwhile, the predictive capacity and reliability of this model are analyzed carefully.
ARTICLE | doi:10.20944/preprints202011.0266.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: dyadic data; co-occurrence data; attributed dyadic data (ADD); mixture model; conditional mixture model (CMM); regression model
Online: 9 November 2020 (08:48:40 CET)
Dyadic data contains co-occurrences of objects, which is often modeled by finite mixture model which in turn is learned by expectation maximization (EM) algorithm. Objects in traditional dyadic data are identified by names, causing the drawback which is that it is impossible to extract implicit valuable knowledge under objects. In this research, I propose the so-called attributed dyadic data (ADD) in which each object has an informative attribute and each co-occurrence of two objects is associated with a value. ADD is flexible and covers most of structures / forms of dyadic data. Conditional mixture model (CMM), which is a variant of finite mixture model, is applied into learning ADD. Moreover, a significant feature of CMM is that any co-occurrence of two objects is based on some conditional variable. As a result, CMM can predict or estimate co-occurrent values based on regression model, which extends applications of ADD and CMM.
ARTICLE | doi:10.20944/preprints202104.0592.v1
Subject: Computer Science And Mathematics, Discrete Mathematics And Combinatorics Keywords: Flexible count regression; balanced discrete gamma distribution; deviance statistic; latent equidispersion; likelihood ratio
Online: 22 April 2021 (08:55:29 CEST)
Most existing flexible count regression models allow only approximate inference. Balanced discretization is a simple method to produce a mean-parametrizable flexible count distribution starting from a continuous probability distribution. This makes easy the definition of flexible count regression models allowing exact inference under various types of dispersion (equi-, under- and overdispersion). This study describes maximum likelihood (ML) estimation and inference in count regression based on balanced discrete gamma (BDG) distribution and introduces a likelihood ratio based latent equidispersion (LE) test to identify the parsimonious dispersion model for a particular dataset. A series of Monte Carlo experiments were carried out to assess the performance of ML estimates and the LE test in the BDG regression model, as compared to the popular Conway-Maxwell-Poisson model (CMP). The results show that the two evaluated models recover population effects even under misspecification of dispersion related covariates, with coverage rates of asymptotic 95% confidence interval approaching the nominal level as the sample size increases. The BDG regression approach, nevertheless, outperforms CMP regression in very small samples (n = 15 − 30), mostly in overdispersed data. The LE test proves appropriate to detect latent equidispersion, with rejection rates converging to the nominal level as the sample size increases. Two applications on real data are given to illustrate the use of the proposed approach to count regression analysis.
ARTICLE | doi:10.20944/preprints202108.0178.v1
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: Spatial regression model; Influential observation; Outlier; Leverage; prediction residual; Masking and swamping; Diagnostic
Online: 9 August 2021 (07:57:56 CEST)
Influential Observations, which are outliers in x direction, y direction or both, remain a hitch in classical regression model fitting. Spatial regression model, with peculiar nature of outliers due to their local nature, is not free from the effect of such influential observations. Researchers have adapted some classical regression techniques to the spatial models and yielded satisfactory results. However, masking or/and swamping remain stumbling block to such methods. We obtained the spatial representation of the classical regression measures of diagnostic in general spatial model. Commonly used diagnostic measure in spatial diagnostic, the Cook's distance, is compared to some robust methods, Hi2 (using robust and non-robust measures), and classification based on generalized residuals and diagnostic generalized potentials, ISRs-Posi and ESRs-Posi, with the help of the obtained spatial prediction residuals and the spatial leverage term. Results of simulation and applications to real data have shown the advantage of the ISRs-Posi and ESRs-Posi due to classification of outliers over Cook's distance and non-robust Hsi12, which suffer from masking, and robust Hsi22 which suffer from swamping in general spatial model.
COMMUNICATION | doi:10.20944/preprints202308.0893.v2
Subject: Engineering, Control And Systems Engineering Keywords: marine lysozyme; seagull optimization algorithm; Gaussian process regression; soft sensor; gray correlation analysis
Online: 17 October 2023 (14:05:08 CEST)
Due to the highly nonlinear, multi-stage, and strongly time-varying marine lysozyme fermentation process, it is difficult to assure the stability and prediction accuracy of the traditional single global soft sensor model on a global scale. This study innovatively proposed a soft sensor model based on an improved seagull optimization algorithm (ISOA) combined with Gaussian process regression (GPR) weighted ensemble learning. First, the sample data set is divided into multiple local sample subsets by the improved density peak clustering algorithm (ADPC). Second, the Gaussian process regression model is optimally altered with an improved seagull optimization algorithm for the purpose of establishing the corresponding sub-prediction model. Finally, the prediction model's fusion strategy is ultimately determined depending on the degree of connection between the test samples and a subset of local pieces. Simulation results show that the proposed soft sensor model, which searches GPR based on ISOA and combines various sub-models, can predict the key biochemical parameters of the marine lysozyme fermentation process better with less prediction error under the condition of fewer training data, and it can be expanded to the soft sensor models of general nonlinear systems, according to simulation results.
ARTICLE | doi:10.20944/preprints201702.0032.v1
Subject: Business, Economics And Management, Economics Keywords: individual travel cost method; zero truncated poisson regression model; endogenous stratification; consumer surplus
Online: 10 February 2017 (11:10:04 CET)
To estimate the recreational value provided by the Foy’s Lake annually using the most applicable model for on-site data is the main objective of this study. Adhere to the objective of this study; Individual Travel Cost Method (ITCM) has been applied and Zero Truncated Poisson Regression Model has been found plausible among other models to estimate consumer surplus. Based on the findings of the study, an estimate of the consumer surplus or recreational benefits per trip per visitor can be recommended as BDT 5,875 or US $ 73.44 and counting the consumer surplus per trip per visitor, the annual recreational value (total consumer surplus) provided by the lake is found to be BDT 321 million or US $ 40.2 million.
ARTICLE | doi:10.20944/preprints202012.0650.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: flood proneness; zoning, CN hydrologic model; curve number (CN); logistic regression
Online: 25 December 2020 (10:36:39 CET)
Spatial evaluation of flood-prone areas at the drainage basins is one of the basic strategies in the field of flood risk management. The present study aims to investigate the efficiency of the CN logistic and hydrological regression model for predicting and zoning floods. In the first stage, 13 runoff parameters, hydrologic soil groups (HSGs), slope, lithology, drainage density (DD), land curvature, elevation, distance to waterways/rivers, topographic wetness index (TWI), stream power index (SPI), rainfall, land use, and NDVI were employed. In the SCS-CN model of the drainage basin, the infiltration rate (S) and runoff amount (Q) were determined. The weights of the used layers were weighted by the AHP. Also, a flood zoning map of the drainage basin with different 5, 15, 25, and 50 year return periods was drawn by applying the weights of the layers. To ensure the accuracy of the zoning map with the logistic regression model, the ROC curve, and the area below the curve were used. The results showed that for the prediction rate, the AUC is 0.81%, indicating that the model has acceptable accuracy. The most important factors affecting flood are geological index; distance to waterways/rivers; and NDVI in the logistic regression model, and slope, DD, rainfall, and land use in the SCS-CN model respectively. 30 to 46% of the drainage basin area during 5 to 50 year periods has moderate flood potential, and 28 to 34% has high potential.
ARTICLE | doi:10.20944/preprints201801.0275.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: forest biomass; aboveground biomass; airborne lidar; monitoring; regional forest inventory; variable selection; Bayesian model averaging; multiple linear regression
Online: 30 January 2018 (04:05:36 CET)
Historical forest management practices in the southwestern US have left forests prone to high intensity, stand-replacement fires. Effective management to reduce the cost and impact of forest-fire management and allow fires to burn freely without negative impact depends on detailed knowledge of stand composition, in particular, above-ground biomass (AGB). Lidar-based modeling techniques provide opportunities to reduce costs and increase ability of managers to monitor AGB and other forest metrics. Using Bayesian Model Averaging (BMA), we develop a regionally applicable lidar-based statistical model for Ponderosa pine and mixed conifer forest systems of the southwestern USA, using previously collected field data. The selected regional model includes a mid and low canopy height metric, a canopy cover, and height distribution term. It explains 72% of the variability in field estimates of AGB, and the RMSE of the two independent validation data sets are 23.25 and 32.82 Mg/ha. The regional model developed is structured in accordance with previously described models fit to local data, and performs equivalently to models designed for smaller scale application. Developing regional models for broad scale application provides a cost-effective, robust approach for managers to monitor and plan adaptively at the landscape scale.
ARTICLE | doi:10.20944/preprints201607.0056.v1
Subject: Environmental And Earth Sciences, Environmental Science Keywords: Land use change; urban sprawl; Logistic regression; Markov chain; Cellular automata; Gilan Province
Online: 18 July 2016 (11:53:16 CEST)
Although, promotion of urbanization culture in recent decades has made inevitable development of cities in the world, however, the development can be guided in a direction that leave, to the extent possible, minimum socioeconomic and environmental impacts. For this, it is required to first forecast auto-spreading orientation of cities and suburbs in rural areas over time and then avoid shapeless growth of cities. This paper is an attempt to develop a dynamic hybrid model based on logistic regression (LR), Markov chain (MC), and cellular automata (CA) for prediction of future urban sprawl in fast-growing cities. The model was developed using 12 widely-used urban development criteria, whose significant coefficient was determined by logistic regression, and validated by relative operating characteristic (ROC) analysis. The validated model was run in Guilan, a tourist province in northern Iran with a very high rate of urban development. For this, changes in the area of urban land use were detected over the period of 1989 to 2013 and then, future sprawl of the province was forecasted by the years 2025 and 2037. The analysis results revealed that the area of urban land use was increased by more than 1.7 % from 36012.5 ha in 1989 to 59754.8 ha in 2013, and the area of Caspian Hyrcanian forestland was reduced by 31628 ha. The results also predicted an alarming increase in the rate of urban development in the province by the years 2025 and 2037, during which urban land use is predicted to develop 0.9 % and 1.38 %, respectively. The development pattern is expected to be uneven and scattered, without following any particular direction. The development will occur close to the existing or newly-formed urban basements as well as around major roads and commercial areas. This development, if not controlled, will lead to the loss of 13863 ha of Hyrcanian forests and if the trend continues, 21013 ha of Hyrcanian forests and 20208 ha of Barren/open lands are expected to be destroyed by the year 2037. In general, the proposed model is an efficient tool for the support of urban planning decisions and facilitates the process of sustainable development of cities by providing decision-makers with an overview on future development of cities where the growth rate is very fast.
ARTICLE | doi:10.20944/preprints202306.1849.v1
Subject: Engineering, Mechanical Engineering Keywords: Machine Learning; Regression Model; XGBoost Regression; Yield Strength
Online: 27 June 2023 (05:25:11 CEST)
Magnesium matrix composites have attracted significant attention due to their lightweight nature and impressive mechanical properties. However, the fabrication process for these alloy compo-sites is often time-consuming, expensive, and labor-intensive. To overcome these challenges, this study employed machine learning (ML) techniques to predict the mechanical properties of magnesium matrix composites. Regression models were utilized to forecast the yield strength of magnesium alloy composites reinforced with various materials. The study incorporated previous research on matrix type, reinforcement type, heat treatment, and mechanical working. The re-gression models employed in this study included decision tree regression, random forest re-gression, extra tree regression, and XGBoost regression. Model performance was assessed using metrics such as RMSE and R2. The XGBoost Regression model out-performed others, exhibiting an R2 value of 0.94 and the lowest error rate. Feature importance analysis indicated that the rein-forcement particle form had the greatest influence on the mechanical properties. The study iden-tified the optimized parameters for achieving the highest yield strength, which was 186.99 MPa. Overall, this study successfully demonstrates the effectiveness of ML as a valuable tool for opti-mizing the production parameters of magnesium matrix composites.
ARTICLE | doi:10.20944/preprints202311.0145.v1
Subject: Medicine And Pharmacology, Hematology Keywords: febrile neutropenia; chemotherapy; diffuse large B-cell lymphoma; outcomes; multivariate logistic regression model
Online: 2 November 2023 (10:24:29 CET)
Febrile neutropenia (FN) is a major concern in patients undergoing chemotherapy for diffuse large B-cell lymphoma (DLBCL); however, the overall risk of FN is difficult to assess. This study aimed to develop a model for predicting the occurrence of FN in patients with DLBCL. In this multicenter, retrospective, observational analysis, a multivariate logistic regression model was used to analyze the association between FN incidence and pretreatment clinical factors. We included adult inpatients and outpatients (aged ≥ 18 years) diagnosed with DLBCL who were treated with chemotherapy. The study examined 246 patients. Considering FN occurring during the first cycle of chemotherapy as the primary outcome, a predictive model with a total score of 5 points was constructed as follows: 1 point each for viral hepatitis, extranodal involvement, and a high level of soluble interleukin-2 receptor and 2 points for lymphopenia. The area under the receiver operating characteristic curve of this model was 0.844 (95% confidence interval: 0.777–0.911). Our predictive model can assess the risk of FN before patients with DLBCL start chemotherapy, leading to better outcomes.
ARTICLE | doi:10.20944/preprints201812.0237.v1
Subject: Engineering, Mechanical Engineering Keywords: signal processing; sparse regression; system identification; impulse response; optimization; feature generation; structural dynamics; time series classification
Online: 19 December 2018 (16:21:41 CET)
Time recordings of impulse-type oscillation responses are short and highly transient. These characteristics may complicate the usage of classical spectral signal processing techniques for a) describing the dynamics and b) deriving discriminative features from the data. However, common model identification and validation techniques mostly rely on steady-state recordings, characteristic spectral properties and non-transient behavior. In this work, a recent method, which allows reconstructing differential equations from time series data, is extended for higher degrees of automation. With special focus on short and strongly damped oscillations, an optimization procedure is proposed that fine-tunes the reconstructed dynamical models with respect to model simplicity and error reduction. This framework is analyzed with particular focus on the amount of information available to the reconstruction, noise contamination and non-linearities contained in the time series input. Using the example of a mechanical oscillator, we illustrate how the optimized reconstruction method can be used to identify a suitable model and to extract features from uni-variate and multivariate time series recordings in an engineering-compliant environment. Moreover, the determined minimal models allow for identifying the qualitative nature of the underlying dynamical systems as well as testing for the degree and strength of non-linearity. The reconstructed differential equations would then be potentially available for classical numerical studies, such as bifurcation analysis. These results represent a physically interpretable enhancement of data-driven modeling approaches in structural dynamics.
ARTICLE | doi:10.20944/preprints201907.0118.v1
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: noise disturbances; residents complaints; logistic regression; spatio-temporal effects; socio-demographic and environmental effects; GIS
Online: 8 July 2019 (12:42:05 CEST)
The purpose of this paper is to explore the presence of spatial and temporal effects on the calls for noise disturbance service reported to the Local Police of València (Spain) in the time period from 2014 to 2015, and investigate how some socio-demographic and environmental variables affect the noise phenomenon. The analysis is performed at the level of València's boroughs. It has been carried out using a logistic model after dichotomization of the noise incidents variable. The spatial effects consider first and second order neighbours. The temporal effects are included in the model by means of one and two weeks temporal lags. Our model confirms the presence of strong spatio-temporal effects. We also find significant associations between noise incidence and specific age groups, socio-economic status, land uses and recreational activities, among other variables. The results suggest that there is a problem of ``social'' noise in València that is not exclusively a consequence of coexistence between local residents. External factors such as the increasing number of people on the streets during weekend nights or during summer months increase severely the chances of expecting a noise incident.
ARTICLE | doi:10.20944/preprints201807.0215.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: multivariate gaussian mixture model (MVGMM); multivariate linear regression; expectation-maximization imputation; WiFi localization; hidden markov model (HMM)
Online: 12 July 2018 (08:24:06 CEST)
The extensive deployment of wireless infrastructure provides a low-cost way to track mobile users in indoor environment. This paper demonstrates a prototype model of an accurate and reliable room location awareness system in a real public environment, where three typical problems arise. First, a massive number of access points (APs) can be sensed leading to a high-dimensional classification problem. Second, heterogeneous devices record different received signal strength (RSS) levels due to the variations in chip-set and antenna attenuation. Third, APs are not necessarily visible in every scanning cycle leading to missing data. This paper presents a probabilistic Wi-Fi fingerprinting method in a hidden Markov model (HMM) framework for mobile user tracking. Considering the spatial correlation of the signal strengths from multiple APs, a Multivariate Gaussian Mixture Model (MVGMM) is fitted to model the probability distribution of RSS measurements in each cell. Furthermore, the unseen property of invisible AP has been investigated in this research, and demonstrated the efficiency of differentiation between cells. The proposed system is able to achieve comparable localization performance. The filed test results present a reliable 97% localization room level accuracy of multiple mobile users in a real university campus WiFi network without any prior knowledge of the environment.
ARTICLE | doi:10.20944/preprints202307.0679.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: SQL injection attacks; Recurrent neural network (RNN) autoencoderANN; CNN; Decision Tree; Naïve Bayes; SVM; Random Forest; Logistic Regression
Online: 11 July 2023 (10:53:24 CEST)
SQL injection attacks are one of the most common types of attacks on web applications. These attacks exploit vulnerabilities in the application’s database access mechanisms, allowing attackers to execute unauthorized SQL queries. In this study, we propose an architecture for detecting SQL injection attacks using a recurrent neural network (RNN) autoencoder. The proposed architecture was trained on a publicly available dataset of SQL injection attacks. Then compared with several other machine learning models, including ANN, CNN, Decision Tree, Naïve Bayes, SVM, Random Forest, and Logistic Regression. The experimental result showed that the proposed approach achieved an accuracy of 94% and an F1 score of 92%, which demonstrate its effectiveness in detecting QL injection attacks with high accuracy in comparison with other models covered in the study.
ARTICLE | doi:10.20944/preprints202306.0891.v1
Subject: Engineering, Mining And Mineral Processing Keywords: Fragmentation; Artificial neural network; Random Forest regression; Support vector regression; XG Boost Regression; Sensitivity analysis
Online: 13 June 2023 (08:04:17 CEST)
In a limestone quarry mine, fragmentation is a crucial outcome of blasting operations. The optimization of blasting operations greatly benefits from the prediction of rock fragmentation. The main factors that affect fragmentation are rock mass characteristics, blast geometry, and explosive properties. This paper is a step towards the implementation of machine learning and deep learning algorithms for predicting the extent of fragmentation (in percentage) in opencast mining. Various parameters can affect fragmentation. But, in this paper initially, ten parameters (spacing, drill hole diameter, burden, average bench height, powder factor, number of holes, charge per delay, uniaxial compressive strength, specific drilling, and stemming) are collected to train the model. However, due to a weak correlation with rock fragmentation, drill diameter, Average bench height, compressive strength, stemming, and charge per delay are eliminated to reduce model complexity. A total of 219 data sets having five input features i.e., the number of holes, spacing, burden, specific drilling, and powder factor are used to develop the models. To predict rock fragmentation due to blasting in limestone quarry mines, both machine learning models (Random Forest Regression (Bagging), Support Vector Regression, and XG Boost Regression (Boosting)), as well as a deep learning model (Neural Network Regression), are applied to develop a model that can optimize the prediction of fragmentation. The Artificial neural network model optimization showed that the model with architecture 64-32-16-1 can perform well giving MSE (mean squared error) values of 41.32 and 28.59 on training and test data respectively. The R2 value for both training and test is 0.83. Random Forest regression is also performing well compared to SVR and XG boost with the MSE value 12.37 and 9.89 on training and testing data respectively. Here, the R2 value for both sets are 94%. Based on the permutation importance and Shapely plot values, the powder factor has the highest impact, and the burden has the lowest impact on fragmentation.
ARTICLE | doi:10.20944/preprints202002.0200.v1
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: uniqueness: regression depth; maximum depth estimator; regression median; robustness
Online: 15 February 2020 (14:51:15 CET)
Notion of median in one dimension is a foundational element in nonparametric statistics. It has been extended to multi-dimensional cases both in location and in regression via notions of data depth. Regression depth (RD) and projection regression depth (PRD) represent the two most promising notions in regression. Carrizosa depth DC is another depth notion in regression. Depth induced regression medians (maximum depth estimators) serve as robust alternatives to the classical least squares estimator. The uniqueness of regression medians is indispensable in the discussion of their properties and the asymptotics (consistency and limiting distribution) of sample regression medians. Are the regression medians induced from RD, PRD, and DC unique? Answering this question is the main goal of this article. It is found that only the regression median induced from PRD possesses the desired uniqueness property. The conventional remedy measure for non-uniqueness, taking average of all medians, might yield an estimator that no longer possesses the maximum depth in both RD and DC cases. These and other findings indicate that the PRD and its induced median are highly favorable among their leading competitors.
ARTICLE | doi:10.20944/preprints201612.0139.v1
Subject: Engineering, Mechanical Engineering Keywords: kinematic model; fiber Bragg grating; deformations; machine tools calibration; predicted model; multiple regression analysis; finite element analysis
Online: 29 December 2016 (07:39:26 CET)
Structural deformations are one of the most significant factor that affects machine tool (MT) positioning accuracy. These induced errors are complex to be represented by a model, nevertheless they need to be evaluated and predicted in order to increase the machining performance. This paper presents a novel approach to calibrate a machine tool in real-time, analyzing the thermo-mechanical errors through Fibre Bragg Grating (FBG) sensors embedded in the MT frame. The proposed configuration consists of an adaptronic structure of passive materials, Carbon Fibre Reinforced Polymers (CFRP), equipped by FBG sensors that are able to measure in real-time the deformed conditions of the frame. By using a proper thermo-mechanical kinematic model, the displacement of the end effector may be predicted and corrected when it is subjected to external undesired factors. By starting from a set of FE simulations to develop a model able to describe the MT structure stresses, a prototype has been fabricated and tested. The scope was to compare the numerical model with the experimental tests using FBG sensors. The experimental campaign has been performed varying the structure temperature over time and measuring the tool tip point (TTP) positions. The obtained results showed a substantial matching between the real and the predicted position of TTP confirming the effectiveness of the proposed calibration system.
ARTICLE | doi:10.20944/preprints202310.1467.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: body-worn sensors; multi layer classifier; random forest; kernel fisher discriminant analysis; SVM; stepwise regression
Online: 23 October 2023 (16:18:56 CEST)
This study presents a research plan that utilizes data obtained from wearable devices to identify human activities and gain insights into human behavior. We developed a model capable of classifying activities similar to human behavior and evaluated the effectiveness and generalization capabilities of this model. The data underwent initial preprocessing, including standardization and normalization. Additionally, recognizing the inherent similarities between human activity behaviors, we introduced a multi-layer classifier model. The first layer is a random forest model based on stepwise regression, which may encounter reduced accuracy for similar activities. The second layer employs a Support Vector Machine (SVM) model based on Kernel Fisher Discriminant Analysis (KFDA). KFDA is used to reduce the dimensionality of data points with potential confusion, followed by SVM for classification. The model was experimentally evaluated and applied to four benchmark datasets: UCI DSA, UCI HAR, WISDM, and IM-WSHA. The experimental results demonstrate that our approach achieved recognition accuracies of 99.71%, 98.71%, 99.12%, and 97.6% on these datasets, indicating excellent recognition performance. Furthermore, to assess the model's generalization ability, we performed K-fold cross-validation on the random forest model and utilized ROC curves for the SVM classifier. The results indicate that our multi-layer classifier model exhibits robust generalization capabilities.
ARTICLE | doi:10.20944/preprints201907.0341.v1
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: malaria; indwelling malaria control; insecticide treated net (ITN); pregnancy; socio-economic; logistic regression; odds ratio
Online: 30 July 2019 (14:40:53 CEST)
Malaria is endemic in Nigeria and remains a major public health problem, taking its greatest toll on children under age 5 and pregnant women, although it is preventable, treatable, and curable. This study investigates the Impact of socio-economic factors and indoor mosquito control on malaria prevalent among pregnant women in Nigeria using logistic regression. To achieve this, secondary data obtained from 2015 Nigeria Malaria Indicator survey, executed by the National Malaria Elimination Programme (NMEP) and the National Population Commission (NPopC), with a nationally representative sample of more than 8,000 consisting of 7,745 households. The results from the logistic regression with odds ratio revealed that pregnant women are more like to be affected by malaria fever (though not significant) compared to women that are not pregnant. The income levels of the household does not significant reduce the incidence of malaria fever among pregnant women in Niger. Concerning the malaria presenting measure, only dwelling sprayed by private company significantly reduce the incidence of malaria fever among pregnant women (P-value=0.020<0.05) compared to dwelling sprayed by government and NGOs and also to Insecticide Treated Net. Also pregnant women in the urban centers are less likely to have malaria fever compared to pregnant women in rural communities in Nigeria. Also, pregnant women with atleast a secondary school level of education are less likely to be affected by malaria fever compared to pregnant women with no formal education. The fitted logistic model passed the goodness-of-test fit; the classification test for the logistic model was correctly classified at about 67.02%. Therefore, this study recommends that government and NGOs should intensify their efforts in the area of dwelling spraying, awareness campaign of the danger of malaria fever among pregnant women and infants, engaged in effective distribution of insecticide treated net in order to reduce the incidence of malaria fever among pregnant women living in rural communities in Nigeria.
REVIEW | doi:10.20944/preprints202311.0156.v1
Subject: Biology And Life Sciences, Aquatic Science Keywords: tilapia; probiotics; linear regression analysis; hierarchical regression analysis; Pearson correlation
Online: 2 November 2023 (10:29:36 CET)
Data regarding the pandemic's impact on tilapia culture remain limited, but it is known that there was a significant decline in production and marketing since 2020. The post-pandemic challenges confronting tilapia farming necessitate prompt solutions, encompassing the management of bacterial infections and the adoption of more advanced technologies by small-scale producers in developing nations. Probiotics, acknowledged as a viable alternative, are presently extensively employed in tilapia aquaculture. Multiple studies have suggested that the application of diverse probiotics in tilapia culture has yielded favorable outcomes. Nonetheless, only a limited number of studies have employed statistical methods to evaluate such findings. To address this gap, a regression analysis was carried out to investigate the existence of a linear relationship between the probiotic dosage added to the feed and two key dependent variables: the specific growth rate (SGR) and the feed conversion ratio (FCR). Additionally, a hierarchical regression analysis was undertaken to ascertain the extent to which the variance observed in these responses could be explained by the variable "probiotic dosage in feed," after accounting for covariates such as initial weight, test duration, water temperature, and number of replicate tanks. Finally, two Pearson correlation matrices were constructed since different studies were included for the SGR and FCR analyses.
ARTICLE | doi:10.20944/preprints202309.2134.v1
Online: 30 September 2023 (05:42:45 CEST)
Dipteryx spp. is an important species in reforestation in the Amazon. The objective of this study is to characterize and compare the relationships between dendrometric variables in Dipteryx spp. stands in the Western Amazon by fitting linear regression equations for total height and crown diameter. Six forest stands were evaluated in three municipalities. Dendrometric variables collected included diameter at 1.3 m height (dbh), total height (ht) and crown diameter (dc). Simple and multiple linear regression equations were fitted to characterize the relationships between ht and dc. The total aboveground biomass of Dipteryx spp. trees and the carbon stock of the stands were estimated. The general equations showed higher R² values, exceeding 0.7. The general equations for estimating ht and dc were significant for all coefficients. The trees averaged 22 t/ha of aboveground biomass in the stands. There was a variation in carbon sequestration potential among stands, ranging from 5.12 to 88.91 t CO2.ha-1. Single-input equations using dbh as an independent variable are recommended for estimating dc and ht for individual Dipteryx spp. stands. Stands in the Western Amazon play a significant role in carbon sequestration and accumulation. Trees can sequester an average of 4.8 tons of CO2 per year.
REVIEW | doi:10.20944/preprints202110.0207.v1
Subject: Biology And Life Sciences, Biochemistry And Molecular Biology Keywords: transfer learning; classification; regression
Online: 13 October 2021 (16:28:59 CEST)
Accurate transfer learning of clinical outcomes, e.g., of the effects and side effects of drugs or other interventions, from one cellular context to another (in-vitro versus ex-vivo versus in-vivo, or across tissues), between cell-types, developmental stages, omics modalities or species, is considered tremendously useful. Ultimately, it may avoid most drug development failing in translation, despite large investments in the preclinical stages, which includes animal experiments requiring careful justification. Thus, when transferring a prediction task from a source (model) domain to a target domain, what counts is the high quality of the predictions in the target domain, requiring molecular states or processes common to both source and target that can be learned by the predictor, reflected by latent variables. These latent variables may form a compendium of knowledge that is learned in the source, to enable predictions in the target; usually, there are few, if any, labeled target training samples to learn from. Transductive learning then refers to the learning of the predictor in the source domain, transferring its outcome label calculations to the target domain, considering the same task. Inductive learning considers cases where the target predictor is performing a different yet related task as compared to the source predictor, making some labeled target data necessary. Often, there is also a need to first map the variables in the input/feature spaces (e.g. of gene names to orthologs) and/or the variables in the output/outcome spaces (e.g. by matching of labels). Transfer across omics modalities also requires that the molecular information flow connecting these modalities is sufficiently conserved. Only one of the methods for transfer learning we reviewed offers an assessment of input data, suggesting that transfer learning is unreliable in certain cases. Moreover, source domains feature their very own particularities, and transfer learning should consider these, e.g., as differences in pharmacokinetics, drug clearance or the microenvironment. In light of these general considerations, we here discuss and juxtapose various recent transfer learning approaches, specifically designed (or at least adaptable) to predict clinical (human in-vivo) outcomes based on molecular data, towards finding the right tool for a given task, and paving the way for a comprehensive and systematic comparison of the suitability and accuracy of transfer learning of clinical outcomes.
Subject: Business, Economics And Management, Economics Keywords: electricity poverty; quantile regression
Online: 18 September 2020 (09:40:45 CEST)
The main objective of this article is to explore the causes of household electricity poverty in Spain from an innovative perspective. Based on evidence of energy inequality across households with different income levels, a quantile regression approach was used to better capture the heterogeneity of determinants of energy poverty across different levels of electricity expenditure. The results illustrate some interesting and counter-intuitive findings about the relationship between household income and electricity poverty, and the technical efficiency of quantile regression compared to the imprecise results of a standard single coefficient/OLS approach.
ARTICLE | doi:10.20944/preprints202201.0441.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Active learning (AL); batch mode; expected model change; linear regression; nonlinear regression
Online: 28 January 2022 (15:03:10 CET)
Training supervised machine learning models requires labeled examples. A judicious choice of examples is helpful when there is a significant cost associated with assigning labels. This article improves upon a promising extant method – Batch-mode Expected Model Change Maximization (B-EMCM) method – for selecting examples to be labeled for regression problems. Specifically, it develops and evaluates alternate strategies for adaptively selecting batch size in B-EMCM. By determining the cumulative error that occurs from the estimation of the stochastic gradient descent, a stop criteria for each iteration of the batch can be specified to ensure that selected candidates are the most beneficial to model learning. This new methodology is compared to B-EMCM via mean absolute error and root mean square error over ten iterations benchmarked against machine learning data sets. Using multiple data sets and metrics across all methods, one variation of AB-EMCM, the max bound of the accumulated error (AB-EMCM Max), showed the best results for an adaptive batch approach. It achieved better root mean squared error (RMSE) and mean absolute error (MAE) than the other adaptive and non-adaptive batch methods while reaching the result in nearly the same number of iterations as the non-adaptive batch methods.
ARTICLE | doi:10.20944/preprints202008.0448.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: airport operation and management; air passenger index(API) prediction; machine learning(ML); mutual information(MI); support vector regression (SVR); K-Means
Online: 20 August 2020 (08:31:36 CEST)
Air passenger traffic prediction is crucial for the effective operation of civil aviation airports. Despite some progress in this field, the prediction accuracy and methods need further improvement. This paper proposes an integrated approach to the prediction of air passenger index as follows. Firstly, the air passenger index is defined and classified by the K-means clustering method. Based on the mutual information (MI) principle, the information entropy is used to analyze and select the key influencing factors of air passenger travel. By incorporating the MI principle into the support vector regression (SVR) framework, this paper presents an innovative MI-SVR machine learning model used to predict the air passenger index. Finally, the proposed model is validated by passenger throughput data of the Shanghai Pudong International Airport, China. The experimental results prove the model feasibility and effectiveness by comparing them with conventional methods, such as ARIMA, LSTM, and other machine learning models, outperformed by the MI-SVR model. Besides, it is shown that the prediction effect of each model could be improved by introducing influencing factors based on mutual information. The main findings are considered instrumental to the airport operation and air traffic optimization.
ARTICLE | doi:10.20944/preprints202008.0392.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: airport operation and management; air passenger index(API) prediction; machine learning(ML); mutual information(MI); support vector regression (SVR); K-Means
Online: 18 August 2020 (16:25:20 CEST)
Air passenger traffic prediction is crucial for the effective operation of civil aviation airports. Despite some progress in this field, the prediction accuracy and methods need further improvement. This paper proposes an integrated approach to the prediction of air passenger index as follows. Firstly, the air passenger index is defined and classified by the K-means clustering method. Based on the mutual information (MI) principle, the information entropy is used to analyze and select the key influencing factors of air passenger travel. By incorporating the MI principle into the support vector regression (SVR) framework, this paper presents an innovative MI-SVR machine learning model used to predict the air passenger index. Finally, the proposed model is validated by passenger throughput data of the Shanghai Pudong International Airport, China. The experimental results prove the model feasibility and effectiveness by comparing them with conventional methods, such as ARIMA, LSTM, and other machine learning models, outperformed by the MI-SVR model. Besides, it is shown that the prediction effect of each model could be improved by introducing influencing factors based on mutual information. The main findings are considered instrumental to the airport operation and air traffic optimization.
ARTICLE | doi:10.20944/preprints202208.0222.v1
Subject: Medicine And Pharmacology, Epidemiology And Infectious Diseases Keywords: Tuberculosis; Mortality; Indigenous; Logistic Regression
Online: 11 August 2022 (12:00:20 CEST)
Aim. To identify factors associated with mortality with tuberculosis diagnosis in the indigenous population in Peru 2015-2019. Methods. Case-control study nested in a retrospective cohort, using the registry of persons belonging to indigenous peoples of the National Tuberculosis Prevention and Control Strategy of the Ministry of Health of Peru. A descriptive analysis was applied, and then bivariate and multiple logistic regression was used to evaluate associations between the variables and the outcome (live-deceased), the results were presented as OR with their respective 95% confidence intervals. Results. The mortality rate of the total indigenous population of Peru was 1.75 deaths per 100,000 indigenous people diagnosed with TB. The community of Kukama kukamiria - Yagua reported 505 (28.48%) individuals. The final logistic model showed that indigenous men (OR=1.93; 95% CI: 1.001-3.7), with a history of HIV prior to TB (OR=16.7; 95% CI: 4.7-58.7) and indigenous people in old age (OR=2.95; 95% CI: 1.5-5.7), are factors associated with a greater chance of dying from TB. Conclusions. It is important to reorient health services among indigenous populations, especially those related to improving the timely diagnosis and early treatment of TB-HIV co-infection, to ensure comprehensive care for this population, considering that they are vulnerable groups.
ARTICLE | doi:10.20944/preprints202011.0297.v1
Subject: Computer Science And Mathematics, Mathematics Keywords: regression; time point data; modelling
Online: 10 November 2020 (10:00:37 CET)
In this paper, we present a relapse based demonstrating way to deal with investigate various arrangement MTC information. A commonplace use of this displaying approach incorporates three stages: first, define a model that approximates the connection between quality articulation and trial factors, with boundaries consolidated to address the exploration premium; second, utilize least-squares and assessing condition methods to gauge boundaries and their relating standard blunders; third, register test insights, P-qualities and NFD as proportions of factual criticalness. The benefits of this methodology are as per the following. To begin with, it tends to the exploration interest in a particular, precise way, and maximally uses all the information and other important data. Second, it represents both orderly and irregular varieties related with the information, and the consequences of such examination give not just quality explicit data applicable to the exploration objective, yet additionally its dependability, in this way helping agents to settle on better choices for subsequent investigations. Third, this methodology is truly adaptable, and can undoubtedly be stretched out to different sorts of MTC considers or other microarray explores by detailing various models dependent on the test plan of the investigations.
ARTICLE | doi:10.20944/preprints202307.0288.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Idiosyncratic Volatility Estimation/Prediction; Machine Learning; Deep learning Based Regression; Tree-Based Regression; Artificial Intelligence
Online: 6 July 2023 (02:14:16 CEST)
Financial markets require a great deal of decision making from the investors and market makers. One metric that can help ease the process of decision making is investment risk which can be measured in two parts; systematic risk and idiosyncratic risk. Clear understanding of the volatilities in each risk component can be a powerful signal in recognizing the right assets to maximize the investment returns. In this paper, we focus on the idiosyncratic volatility values and pre-calculate the idiosyncratic volatility values for 31,198 members of NYSE, Amex and Nasdaq markets for the trades occurring between January 1963 and December 2019. Utilizing a subset of dataset, limited to Nasdaq100 index, we consider the application of machine learning techniques in predicting the idiosyncratic volatility values using the raw trade data to explore a data extension option for the future market trade records that have not yet occurred. We offer a deep learning based regression model and compare it with traditional tree-based methods on a small subset of our per-calculated idiosyncratic volatility dataset. Our analytical results show that the performance of the deep learning techniques is much more robust in comparison to that of the traditional tree-based baselines.
ARTICLE | doi:10.20944/preprints202310.0202.v1
Subject: Biology And Life Sciences, Agricultural Science And Agronomy Keywords: amaranth; environmental index; linear regression; stability
Online: 4 October 2023 (05:04:02 CEST)
Amaranth has the potential to support Malawi's food and nutrition security, income generation and livelihoods, and climate change resilience efforts. Due to the high genetic variability of Ama-ranth, there is a need to develop stable and high-yielding genotypes for sustainable production. To determine the degree of genetic stability in different environments, five Amaranth accessions were subjected to stability analysis. The experiment was carried out at three sites (Bunda, Bembeke, and Chipoka) for two seasons in 2020-2021 in the central region of Malawi. It was laid out in Ran-domized Complete Block Design (RCBD) with four replicates. Eberhart and Russell linear regres-sion model was used for stability analysis and Pearson correlation was used to test the relationship between variables. Environmental variance + (genotype x environment) was significant for four of the parameters studied, namely grain yield, plant height, leaf length, and leaf width, indicating the presence of a remarkable interaction between genotypes and environment. The results of a pooled analysis of variance showed significant differences at a 5% significance level among the Amaranth accessions, indicating inherent genetic variability. Using the linear regression model of Eberhart and Russell, accessions PE-LO-BH -01 and LL-BH -04 were identified as the highest yielding stable genotypes for leaf and grain yield, respectively. In addition, Bembeke site was the most favourable environment for all the accessions. Thus, to enhance the production of amaranth in Malawi, LL-BH-04 and PE-LO-BH-01 were put forward for release as varieties for grain and leaf respectively. These results will also guide and support for future breeding programs.
ARTICLE | doi:10.20944/preprints202008.0139.v1
Subject: Engineering, Industrial And Manufacturing Engineering Keywords: copper price; prediction; support vector regression
Online: 6 August 2020 (08:26:35 CEST)
Predicting copper price is essential for making decisions that can affect companies and governments dependent on the copper mining industry. Copper prices follow a time series that is non-linear, non-stationary, and which have periods that change as a result of potential growth, cyclical fluctuation and errors. Sometimes the trend and cyclical components together are referred to as a trend-cycle. In order to make predictions, it is necessary to consider the different characteristics of trend-cycle. In this paper, we study a copper price prediction method using Support Vector Regression. This work explores the potential of the Support Vector Regression with external recurrences to make predictions at 5, 10, 15, 20 and 30 days into the future in the copper closing price at the London Metal Exchanges. The best model for each forecast interval is performed using a grid search and balanced cross-validation. In experiments on real data-sets, our results obtained indicate that the parameters (C, ε, γ) of the model Support Vector Regression do not differ between the different prediction intervals. Additionally, the amount of preceding values used to make the estimates does not vary according to the predicted interval. Results show that the support vector regression model has a lower prediction error and is more robust. Our results show that the presented model is able to predict copper price volatilities near reality, being the RMSE equal or less than the 2.2% for prediction periods of 5 and 10 days.
ARTICLE | doi:10.20944/preprints202008.0058.v1
Subject: Environmental And Earth Sciences, Geography Keywords: Rwandz; residential function; GIS; correlation; regression
Online: 3 August 2020 (00:37:42 CEST)
House is the haven that keeps people from natural and human conditions, it gives them trust, safety, and steadiness. It is one of the most basic human needs this became a serious function which cities offer, and became one of the most important aspects which caught urban researchers interest, they take into consideration a wide range of architectural, social, and economic indicators. The study aims to provide an overall conception of Rwandz residential functions, using a collection of parameters and some GIS and statistical techniques, to help establish plans and future projects to improve the growth of this city and other towns and cities in that area. The study found that the old parts of Rwandz city which are located in the core, differ from the outer parts which are relatively newer in many properties, generally, the core is more densely populated than the outer, bigger family size, more illiteracy, and unemployment, few incomes, older houses, smaller houses, in the opposite of the outer parts. Besides, the study tested the correlation coefficient between the criteria; it found some strong statistical relationships between them, which reflected some real-life properties of the residential function. Lastly, the study designed a regression model to predict the main residential function criteria.
ARTICLE | doi:10.20944/preprints201902.0135.v1
Subject: Business, Economics And Management, Finance Keywords: recovery rates; beta regression; credit risk
Online: 14 February 2019 (11:30:03 CET)
Based on a rich data set of recoveries donated by a debt collection business, recovery rates for non-performing loans taken from a single European country are modelled using linear regression, linear regression with Lasso, beta regression and inflated beta regression. We also propose a two-stage model: beta mixture model combined with a logistic regression model. The proposed model allows us to model the multimodal distribution we find for these recovery rates. All models are built using loan characteristics, default data and collections data prior to purchase by the debt collection business. The intended use of the models is to estimate future recovery rates for improved risk assessment, capital requirement calculations and bad debt management. They are compared using a range of quantitative performance measures under K-fold cross validation. Among all the models, we find that the proposed two-stage beta mixture model performs best.
ARTICLE | doi:10.20944/preprints201809.0499.v1
Subject: Biology And Life Sciences, Ecology, Evolution, Behavior And Systematics Keywords: aquatics; modeling; boosted regression trees; appalachians
Online: 26 September 2018 (05:23:02 CEST)
Understanding influences of multiple stressors across the landscape on aquatic biota is important for conservation, as it allows for an understanding of spatial patterns and informs stakeholders of significant conservation value. Data exists for land use/landcover (LULC) and other physicochemical components of the landscape throughout the Appalachian region yet biological data is sparse. This dearth of biological data relative to LULC and physicochemical data creates difficulties in making informed management and conservation decisions across large landscapes. At the HUC12 watershed scale we sought to create a single score for both abiotic and biotic values throughout the central and southern Appalachian region. We used boosted regression trees (BRT) to model biological responses (fish and aquatic macroinvertebrate variables) to abiotic variables. Variance explained by BRT models ranged from 62-94%. We categorized both predictor and response variables into themes and targets respectively to better understand large scale patterns on the landscape that influence biological condition of streams. We combined predicted values for a suite of response variables from BRT models to create a single watershed score for aquatic macroinvertebrates and fish. Regional models were developed for fish but we were unable to develop regional models for aquatic macroinvertebrates due to the low number of sample sites. There was strong correlation between regional and global watershed scores for fish models but not between fish and aquatic macroinvertebrate models. Use of such multimetric scores can inform managers, NGOs, and private land owners regarding land use practices; thereby contributing to largescale landscape scale conservation efforts.
ARTICLE | doi:10.20944/preprints201712.0032.v1
Subject: Engineering, Energy And Fuel Technology Keywords: statistics; uncertainty; regression; sampling; outlier; probabilistic
Online: 6 December 2017 (06:36:02 CET)
Energy Measurement and Verification (M&V) aims to make inferences about the savings achieved in energy projects, given the data and other information at hand. Traditionally, a frequentist approach has been used to quantify these savings and their associated uncertainties. We demonstrate that the Bayesian paradigm is an intuitive, coherent, and powerful alternative framework within which M&V can be done. Its advantages and limitations are discussed, and two examples from the industry-standard International Performance Measurement and Verification Protocol (IPMVP) are solved using the framework. Bayesian analysis is shown to describe the problem more thoroughly and yield richer information and uncertainty quantification than the standard methods while not sacrificing model simplicity. We also show that Bayesian methods can be more robust to outliers. Bayesian alternatives to standard M&V methods are listed, and examples from literature are cited.
COMMUNICATION | doi:10.20944/preprints202111.0549.v1
Subject: Computer Science And Mathematics, Applied Mathematics Keywords: Principal Component Regression, Partial Least Squares, Orthogonal Partial Least Squares, multivariate regression, hypothesis generation, Parkinson’s disease
Online: 29 November 2021 (15:42:03 CET)
In the current era of ‘big data’, scientists are able to quickly amass enormous amount of data in a limited number of experiments. The investigators then try to hypothesize about the root cause based on the observed trends for the predictors and the response variable. This involves identifying the discriminatory predictors that are most responsible for explaining variation in the response variable. In the current work, we investigated three related multivariate techniques: Principal Component Regression (PCR), Partial Least Squares or Projections to Latent Structures (PLS), and Orthogonal Partial Least Squares (OPLS). To perform a comparative analysis, we used a publicly available dataset for Parkinson’ disease patien ts. We first performed the analysis using a cross-validated number of principal components for the aforementioned techniques. Our results demonstrated that PLS and OPLS were better suited than PCR for identifying the discriminatory predictors. Since the X data did not exhibit a strong correlation, we also performed Multiple Linear Regression (MLR) on the dataset. A comparison of the top five discriminatory predictors identified by the four techniques showed a substantial overlap between the results obtained by PLS, OPLS, and MLR, and the three techniques exhibited a significant divergence from the variables identified by PCR. A further investigation of the data revealed that PCR could be used to identify the discriminatory variables successfully if the number of principal components in the regression model were increased. In summary, we recommend using PLS or OPLS for hypothesis generation and systemizing the selection process for principal components when using PCR.rewordexplain later why MLR can be used on a dataset with no correlation
ARTICLE | doi:10.20944/preprints201907.0351.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: evaporation; meteorological parameters; Gaussian process regression; support vector regression; machine learning modeling; hydrology; prediction; data science; hydroinformatics
Online: 31 July 2019 (10:58:29 CEST)
Evaporation is one of the main processes in the hydrological cycle, and it is one of the most critical factors in agricultural, hydrological, and meteorological studies. Due to the interactions of multiple climatic factors, the evaporation is a complex and nonlinear phenomenon; therefore, the data-based methods can be used to have precise estimations of it. In this regard, in the present study, Gaussian Process Regression (GPR), Nearest-Neighbor (IBK), Random Forest (RF) and Support Vector Regression (SVR) were used to estimate the pan evaporation (PE) in the meteorological stations of Golestan Province, Iran. For this purpose, meteorological data including PE, temperature (T), relative humidity (RH), wind speed (W) and sunny hours (S) collected from the Gonbad-e Kavus, Gorgan and Bandar Torkman stations from 2011 through 2017. The accuracy of the studied methods was determined using the statistical indices of Root Mean Squared Error (RMSE), correlation coefficient (R) and Mean Absolute Error (MAE). Furthermore, the Taylor charts utilized for evaluating the accuracy of the mentioned models. The outcome indicates that the optimum state of Gonbad-e Kavus, Gorgan and Bandar Torkman stations, Gaussian Process Regression (GPR) with the error values of 1.521, 1.244, and 1.254, the Nearest-Neighbor (IBK) with error values of 1.991, 1.775, and 1.577, Random Forest (RF) with error values of 1.614, 1.337, and 1.316, and Support Vector Regression (SVR) with error values of 1.55, 1.262, and 1.275, respectively, have more appropriate performances in estimating PE. It found that GPR for Gonbad-e Kavus Station with input parameters of T, W and S and GPR for Gorgan and Bandar Torkmen stations with input parameters of T, RH, W, and S had the most accurate performances and proposed for precise estimation of PE. Due to the high rate of evaporation in Iran and the lack of measurement instruments, the findings of the current study indicated that the PE values might be estimated with few easily measured meteorological parameters accurately.
ARTICLE | doi:10.20944/preprints202307.1405.v1
Subject: Engineering, Chemical Engineering Keywords: neural network regression; wastewater quality; spectral reflectance
Online: 20 July 2023 (10:44:00 CEST)
Wastewater (WW) analysis is a critical step in various operations such as control of a WW treatment facility, and speeding-up the analysis of WW quality can significantly improve such operations. This work demonstrates the capability of neural network (NN) regression models to estimate WW characteristic properties such as biochemical oxygen demand (BOD), chemical oxygen demand (COD), ammonia (NH3-N), total dissolved substances (TDS), total alkalinity (TA), and total hardness (TH) by training on WW spectral reflectance in the visible to near-infrared spectrum (400nm-2000nm). The dataset contains samples of spectral reflectance intensity, which were the inputs, and the WW parameter levels (BOD, COD, NH3-N, TDS, TA, and TH), which were the outputs. Various NN model configurations were evaluated in terms of regression model fitness. The mean-absolute-error (MAE) was used as the metric for training and testing the NN models, and the coefficient of determination (R2) between the model predictions and true values was also computed to measure how well the NN models predict the true values. With online spectral measurements, the trained neural network model can provide non-contact and real-time estimation of WW quality at minimum estimation error.
ARTICLE | doi:10.20944/preprints202305.1678.v1
Subject: Business, Economics And Management, Economics Keywords: Europe; Income Distrubution; Relative Distribution; RIF-regression
Online: 24 May 2023 (03:34:42 CEST)
The issue of polarization, as opposed to inequality, has been little explored for European countries. in this paper, using harmonized data produced by Luxembourg Income Study Database, observes income trends for 12 European countries, showing an increase in polarization in many of the countries considered. the drivers that led to this concentration of income are also analyzed, noting heterogeneous factors within countries.
ARTICLE | doi:10.20944/preprints202305.0792.v1
Subject: Business, Economics And Management, Business And Management Keywords: Baltic Dry Index; Covid-19; Stepwise Regression
Online: 11 May 2023 (05:11:46 CEST)
The outbreak of COVID-19 in 2020 caused significant disruptions to global shipping and the world economy. This paper aims to investigate the impact of the pandemic on global shipping by analyzing the Baltic Dry Index (BDI). The BDI is a metric that reflects the worldwide shipping costs and directs related to supply and demand conditions, making it an indicator of economic production. The study utilizes data from 2019 to 2021, before and after the outbreak of COVID-19, and considers 13 independent variables, including raw materials, energy, stock market indexes, global port calls, and confirmed COVID-19 cases to investigate how to influent the BDI. The study employs stepwise regression to select variables and build models before and after the pandemic. The findings reveal that the key factors affecting the freight index BDI before the outbreak are: international scrap steel prices, iron ore prices, and the Commodity Research Bureau Index. However, after the COVID-19 outbreak, the factors affecting the BDI changed to the Shanghai Index, global port calls, and the number of confirmed COVID-19 cases.
ARTICLE | doi:10.20944/preprints202305.0096.v1
Subject: Computer Science And Mathematics, Mathematics Keywords: Topological indices; Fibrates; Curvilinear regression; QSPR analysis
Online: 3 May 2023 (04:48:22 CEST)
The paper describes the use of topological indices in conjunction with high cholesterol drugs, specifically Fibrates, to predict their physicochemical properties and biological activities. Fibrates are known to lower high triglycerides, increase HDL cholesterol, and reduce the small dense fraction of LDL cholesterol. The study uses a quantitative structural-property relationships (QSPR) approach, which involves analyzing the relationships between physicochemical properties and topological indices using curvilinear regression. The QSPR model predicts the physicochemical properties of the drugs based on degrees and distances determined from topological indices. The study also conducted (DFT) calculations at the B3LYP/6-31G(d,p) level on the four investigated derivatives to gain insights into their optimized geometries, DOS plots, HOMO and LUMO orbital energies, and distribution. The theoretical results presented in the study suggest that the use of topological indices in QSPR models could provide a powerful tool for predicting the physicochemical properties and biological activities of molecules, including drugs. These findings could lead to the development of new cholesterol-lowering drugs with desirable properties.
ARTICLE | doi:10.20944/preprints202205.0417.v1
Subject: Medicine And Pharmacology, Epidemiology And Infectious Diseases Keywords: COVID-19; Eswatini; risk mapping; Poisson regression
Online: 31 May 2022 (11:04:12 CEST)
COVID-19 national spikes had been reported at varying temporal scales as a result of differences in the driving factors. Factors affecting case load and mortality rates have varied between countries and regions. We investigated the association between various socio-economic, demographic and health variables with the spread on COVID-19 cases in Eswatini using the maximum likelihood estimation method for count data. A generalized Poisson regression (GPR) model was fitted with the data comprising of fifteen covariates to predict COVID-19 risk in Eswatini. The results showed that variables that were key determinants in the spread of the disease were those that included the proportion of elderly above 55 years at 98% (95% CI: 97%-99%) and the proportion of youth below 35 years at 0.08% (95% CI: 0.017%-38%) with a pseudo R-square of 0.72. However, in the early phase of the virus when cases were fewer, results from the Poisson regression showed that household size, household density and poverty index were associated with COVID-19. We produced a risk map of predicted COVID-19 in Eswatini using the variables that were selected at 5% significance level. The map could be used by the country to plan and prioritize health interventions against COVID-19. The identified areas of high risk may be further investigated in order to find out the risk amplifiers and assess what could be done to prevent them.
ARTICLE | doi:10.20944/preprints202107.0139.v1
Subject: Business, Economics And Management, Accounting And Taxation Keywords: circularity; waste streams; circular approaches; regression equation
Online: 6 July 2021 (11:40:19 CEST)
In this paper, the authors identified key elements important for circularity: (1) Background: The primary goal of circularity is to eliminate waste and to prove the constant use of resources. In the paper, we classify studies according to circular approaches. The authors identified main elements and classified them into categories important for circularity, starting with the managing and reducing waste and the recovery of resources; and ending with the circularity of material, and general circularity-related topics and presented scientific works dedicated to each of the above-mentioned categories. The authors analyzed several core elements from the first category aiming to investigate and connect different waste streams and provided a regression model; (2) Methods: The authors used a dynamic regression model to identify relationships among variables and selected the ones, which has an impact on the increase of biowaste. The research was delivered for the 27 European Union countries during the period between 2020 and 2019; (3) Conclusions: The authors indicated that the recycling rate of wasted electrical equipment in the previous year has an impact on the increase of recycling biowaste next year. This is explained as non-metallic spare parts of electronic equipment are used as biowaste for fuel production. And the separation process of the composites of electric equipment takes some time, on average the effect is evident in one year period.
ARTICLE | doi:10.20944/preprints202012.0321.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: quantile regression; groundwater; environmental; multivariate; metals; health
Online: 14 December 2020 (10:13:09 CET)
One of the most important defining characteristics of groundwater quality is pH as it fundamentally controls the amount and chemical form of many organic and inorganic solutes in groundwater. Groundwater data are frequently characterized by a wide degree of variability of the factors which possibly influence pH distribution. For this reason, it is challenging to link the spatio-temporal dynamics of pH to a single environmental factor by the ordinary least squares regression technique of the conditional mean. In this study, quantile regression was used to estimate the response of pH to nine environmental factors (As, Cd, Fe, Mn, Pb, turbidity, electrical conductivity, total dissolved solids and nitrates). Results of 25%, 50%, 75% quantile regression and ordinary least squares (OLS) regression were compared. The standard regression of the conditional means (OLS) underestimated the rates of change of pH due to the selected factors in comparison with the regression quantiles. The effect of arsenic increased for sampling locations with higher pH values (higher quantiles) likewise the influence of Pb and Mn. However, the effects of Cd and Fe decreased for sampling locations in higher quantiles. It can be concluded that these detected heterogeneities would be missed if this study had focused exclusively on the conditional means of the pH values. Consequently, quantile regression provides a more comprehensive account of possible spatio-temporal relationships between environmental covariates in groundwater. This study is one of the first to apply this technique on groundwater systems in sub-Saharan Africa. The approach is useful and interesting and has broad application for other mining environments especially tropical low-income countries where climatic conditions can drive rapid cycling or transformations of pollutants. It is also pertinent to geopolitical contexts where regulatory; monitoring and management capacities are weak and where mining pollution of groundwater largely occur.
REVIEW | doi:10.20944/preprints202111.0310.v1
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: Functional Data Analysis (FDA); Hybrid Data; Semi-Functional Partial Linear Regression Model (SFPLR); Partial Functional Linear Regression; Literature Review
Online: 17 November 2021 (15:21:19 CET)
Background: In the functional data analysis (FDA), the hybrid or mixed data are scalar and functional datasets. The semi-functional partial linear regression model (SFPLR) is one of the first semiparametric models for the scalar response with hybrid covariates. Various extensions of this model are explored and summarized. Methods: Two first research articles, including “semi-functional partial linear regression model”, and “Partial functional linear regression” have more than 300 citations in Google Scholar. Finally, only 106 articles remained according to the inclusion and exclusion criteria such as 1) including the published articles in the ISI journals and excluding 2) non-English and 3) preprints, slides, and conference papers. We use the PRISMA standard for systematic review. Results: The articles are categorized into the following main topics: estimation procedures, confidence regions, time series, and panel data, Bayesian, spatial, robust, testing, quantile regression, varying Coefficient Models, Variable Selection, Single-index model, Measurement error, Multiple Functions, Missing values, Rank Method and Others. There are different applications and datasets such as the Tecator dataset, air quality, electricity consumption, and Neuroimaging, among others. Conclusions: SFPLR is one of the most famous regression modeling methods for hybrid data that has a lot of extensions among other models.
ARTICLE | doi:10.20944/preprints201910.0238.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: hybrid machine learning model; transportation infrastructure; flexible pavement; remaining service life prediction; pavement condition index; support vector regression; fruit fly optimization algorithm (foa); gene expression programming (gep); svr-foa
Online: 20 October 2019 (17:11:10 CEST)
Remaining service life (RSL) of pavement, as a sign of future pavement performance, has always received growing attention from pavement engineers. The RSL describes the time from the moment of pavement inspection until such a time when a major repair or reconstruction is required. The conventional approach to determining RSL involves using non-destructive tests. These tests, in addition to being costly, interfere with traffic flow and compromise users' safety. In this paper, surface distresses of pavement have been used to estimate the pavement’s RSL in order to eliminate the aforementioned problems and challenges. To implement the proposed theory, 105 flexible pavement segments were taken from Shahrood-Damghan Highway (Highway 44) in Iran. For each pavement segment, the type, severity, and extent of surface damage and pavement condition index (PCI) were determined. The pavement RSL was then estimated using non-destructive tests include Falling Weight Deflectometer (FWD) and Ground Penetrating Radar (GPR). After completing the dataset, the modeling was conducted to predict RSL using three techniques include Support Vector Regression (SVR), Support Vector Regression Optimized by Fruit Fly Optimization Algorithm (SVR-FOA), and Gene Expression Programming (GEP). All three techniques estimated the RSL of the pavement by selecting the PCI as input. The Correlation Coefficient (CC), Nash-Sutcliffe efﬁciency (NSE), Scattered Index (SI), and Willmott’s Index of agreement (WI) criteria were used to examine the performance of the three techniques adopted in this study. In the end, it was found that GEP with values of 0.874, 0.598, 0.601, and 0.807 for CC, SI, NSE, and WI criteria, respectively, had the highest accuracy in predicting the RSL of pavement.
ARTICLE | doi:10.20944/preprints202312.0092.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Active Learning; Design of experiments; Regression; s-PGD
Online: 1 December 2023 (15:04:37 CET)
Machine learning approaches are currently used to understand or model complex physical systems. In general, a substantial number of samples must be collected to create a model with reliable results. However, collecting numerous data is often relatively time-consuming or expensive. Moreover, the problems of industrial interest tend to be more and more complex and depending on a high number of parameters. High dimensional problems intrinsically involve the need of large data amount through the curse of dimensionality. That is why, new approaches based on smart sampling techniques are investigated to minimize the number of samples to be given to train the model, such as Active Learning methods. Here, we propose a technique based on a combination of Fisher information matrix and of Sparse Proper Generalized Decomposition that enables the definition of a new Active Learning informativeness criterion in high dimensions. We provide examples proving the performances of this technique on a theoretical 5D polynomial function and on an industrial crash simulation application. The results prove that the proposed strategy over-perform the usual ones.
ARTICLE | doi:10.20944/preprints202311.1782.v1
Subject: Business, Economics And Management, Economics Keywords: DEA; wood processing enterprises; small enterprises; fractional regression
Online: 28 November 2023 (07:49:48 CET)
Micro and small wood-processing enterprises represent the heart of the European forest-based industries, being among the key drivers of economic growth in rural, mountainous, and poor regions. Their economic efficiency is of fundamental importance for their existence and the pro-vision of income for the local population in rural areas. Data Envelopment Analysis (DEA) is nonparametric, linear-programming-based approach, commonly used to analyse the efficiency of organizational units. This method allows estimating the economic efficiency of a certain eco-nomic system without assumptions about the functional form between resources and products. Furthermore, DEA determines the efficiency frontier and gives results of whether an enterprise, i.e., a Decision Making Unit (DMU) is efficient or not. The main objective of this study was to investigate and evaluate the economic efficiency of micro and small wood-processing enterpris-es in the EU countries and reveal the hidden inputs that facilitate efficiency generation. The eco-nomic efficiency evaluation was carried out on the basis of the official statistical data for the mi-cro and small wood-processing companies in the EU member states for the period 2015-2020 by performing a two-stage DEA analysis. The data used were standardized by value per employee. In addition to the first stage of DEA, fractional regression probit and logit models with four contextual variables were used to reveal the influence of the hidden inputs in the model. The results showed that the micro and small wood-processing enterprises can be regarded as more scale-efficient than technically-efficient entities. The only contextual variable affecting the eco-nomic efficiency was Investments per Person Employed, improving the efficiency by 2% per 1% increase of the investments.
ARTICLE | doi:10.20944/preprints202311.1435.v1
Subject: Business, Economics And Management, Finance Keywords: Exchange Rate Volatility; Exports; NARDL; Smooth Threshold Regression
Online: 22 November 2023 (13:48:53 CET)
This research paper aimed to examine the impact of exchange rate volatility on South Africa's exports from 1994 Q1 to 2023 Q2. The study used the Augmented Dickey-Fuller (ADF) and Phillips-Perron (PP) tests to test for stationarity. The nonlinear autoregressive distributed lag (NARDL) model and smooth threshold regression (STR) are employed to analyse the relationship between exchange rate volatility and exports. The GARCH (1.1) technique is used to construct the exchange rate volatility data. The results of the stationarity tests reveal that variables are either integrated in order I(0) or I(1). This implies that the variables used in this study are stationary, which is crucial for conducting accurate analyses. Moreover, the NARDL test approach provided insights into the long-run effects of exchange rate volatility on South Africa's exports. Based on the NARDL test, positive shocks have a greater but statistically insignificant effect on exports than negative shocks. Therefore, a greater level of exchange rate volatility may lead to increased exports from South Africa. Furthermore, the STR also reveals that the impact of exchange rate volatility is insignificant. These findings provide valuable insights for policymakers and firms to make informed decisions regarding exchange rate management and export strategies in South Africa.
REVIEW | doi:10.20944/preprints202310.1913.v3
Subject: Engineering, Civil Engineering Keywords: Solar PV system; Regression Model; DOE; Solar energy; Fossil fuels
Online: 9 November 2023 (10:58:47 CET)
AbstractTo overcome the negative impacts on the environment and other problems associated with fossil fuels have forced many countries to inquire into and change to environmentally friendly alternatives that are renewable to sustain the increasing energy demand. Solar energy is one of the best renewable energy sources with the least negative impacts on the environment. Different countries have formulated solar energy policies to reduce dependence on fossil fuel and increasing domestic energy production by solar energy. According to the 2010 BP Statistical Energy Survey, the world cumulative installed solar energy capacity was 22928.9 MW in 2009, a change of 46.9% compared to 2008. In this study, a PV generation system has been modeled and installed considering uncertain whether based on the hourly wind speed data of New York City (NYC) of year 2014. Regression models has been used to forecast the hourly, weekly, and monthly wind speed of NYC year 2014. Design of experiment (DOE) has been used to determine the optimal panel size (area), the battery capacity size, and other levels of factors.
ARTICLE | doi:10.20944/preprints202308.0823.v1
Subject: Engineering, Bioengineering Keywords: chicken egg fertility; classification; PLS regression; hyperspectral imaging
Online: 10 August 2023 (08:59:12 CEST)
Partial least square (PLS) regression is a well-known chemometric method used for predictive modelling, especially in the presence of many variables. Although PLS was not initially developed as a technique for classification tasks, scientists have reportedly used this approach successfully for discrimination purposes. Whereas some non-supervised learning approaches including but not limited to PCA, and k-means clustering do well in identifying/understanding grouping and clustering patterns in multidimensional data, they are limited when the end target is discrimination, making PLS a preferable alternative. A total of fertilized 672 chicken egg hyperspectral imaging data, consisting of 336 white eggs and 336 brown eggs were used in this study. Hyperspectral images in the NIR region of 900-1700 nm wavelength range were captured prior to incubation on day 0 and on days 1-4 after incubation. Eggs were candled on incubation day 5 and broken out on day 10 to confirm fertility. While a total number of 312 and 314 eggs were found to be fertile in the brown and white egg batches respectively, total numbers of non-fertile eggs in the same set of batches were 23 and 21 respectively. Spectral information was extracted from a segmented region of interest (ROI) of each hyperspectral image and spectral transmission characteristics were obtained by averaging the spectral information. A moving-thresholding technique was implemented for discrimination based on PLS regression results on the calibration set. With true positive rates (TPR) of up to 100% obtained at selected threshold values of between 0.50-0.85 and on different days of incubation, the results indicated that the proposed PLS technique can accurately discriminate between fertile and non-fertile eggs. The adaptive PLS approach was thereby presented as suitable for handling hyperspectral imaging-based chicken egg fertility data
ARTICLE | doi:10.20944/preprints202211.0227.v1
Subject: Medicine And Pharmacology, Orthopedics And Sports Medicine Keywords: Bayesian; cardiovascular disease; CVD; cross-sectional; logistic regression
Online: 14 November 2022 (01:55:06 CET)
Background: Cardiovascular disease (CVD) has been one of the leading causes of death and disability-adjusted life years lost worldwide. Blood pressure, lipid, and cholesterol are good predictors of CVD risk and correspond upon age and physical fitness. However, few studies have explored the variation trend of CVD risk factors across different populations upon age and their muscle strength. Objective: to analysis the variation tendency of CVD risk factors in blood according to age and relative grip strength among different populations. Method: 25363 participants were recruited in this cross-sectional study and 24709 were included in the analysis. A logistic regression and a Bayesian probabilistic analysis based on Markov Chain Monte Carlo (MCMC) Modeling is conducted to build probability prediction models of hypertension, hyperlipidemia, and hypercholesterolemia according to age, relative grip strength, body weight conditions, and physical activity levels. Results: 1) age might be the main influence factor of hypertension, which is regarded as one of the primary CVD risk factors. However, although keeping a high level of physical activity might have positive effect on preventing hypertension because that individuals with normal body weight and higher physical activity shows a lower probability of being diagnosed with hypertension, it might could not prevent individuals from getting hypertension with age. 2) After 60, individuals of normal body weight seem more likely to have hyperlipidemia than those are overweight or obese. 3) Larger relative grip strength might not be able to offset the negative effects of obesity, overweight and physical inactivity on hyperlipidemia. 4) The probability of getting hypercholesterolemia varies less with age and relative grip strength. Conclusion: Body weight management and keeping high levels of physical activity are recommended at any age. It might benefit to increase some bodyweight after 60 years old.
REVIEW | doi:10.20944/preprints202210.0391.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Tillage; Traction; Compaction; Neural networks; Support vector regression
Online: 26 October 2022 (02:07:19 CEST)
Soil working tools, implements, and machines are inevitable in mechanized agriculture. The soil-tool/machine interaction is a multivariate, dynamic, and intricate process. The accurate interpretation, description, and modeling of a soil-machine interaction is key to providing a solution to sustainable crop production by reducing energy input, excessive soil pulverization, and compaction. The traditional method provides insight into soil-machine interaction but often provides inadequate solutions and lacks broad applicability. Computational intelligence (CI) is a comprehensive class of approaches that rely on approximate information to solve complex problems. The CI method has been extensively studied and applied in soil tillage and traction domain in recent decades. The study critically reviews the CI techniques implemented in soil-machine interactions, especially in the context of tillage, traction, and compaction. The traditional methods and their limitation are discussed. The fundamental of CI methods and a detailed overview of the most popular methods are provided. The study reviews and summarizes the 50 selected articles on soil-machine interaction studies where CI methods were employed. It discusses the strength and limitations of employed CI methods. It also suggests the emergent CI methods and future applications are discussed. The outlined study would serve as a concise reference and a quick and systematic way to understand the applicable CI methods that allow crucial farm management decision-making.
ARTICLE | doi:10.20944/preprints202106.0533.v1
Subject: Medicine And Pharmacology, Immunology And Allergy Keywords: COVID-19; Vaccine; Prediction; Regression; Ensemble learning; AdaBoost
Online: 22 June 2021 (08:30:30 CEST)
The novel coronavirus disease (COVID-19) has created immense threats to public health on various levels around the globe. The unpredictable outbreak of this disease and the pandemic situation are causing severe depression, anxiety and other mental as physical health related problems among the human beings. To combat against this disease, vaccination is essential as it will boost the immune system of human beings while being in the contact with the infected people. The vaccination process is thus necessary to confront the outbreak of COVID-19. This deadly disease has put social, economic condition of the entire world into an enormous challenge. The worldwide vaccination progress should be tracked to identify how fast the entire economic as well as social life will be stabilized. The monitor ofthe vaccination progress, a machine learning based Regressor model is approached in this study. This tracking process has been applied on the data starting from 14th December, 2020 to 24th April, 2021. A couple of ensemble based machine learning Regressor models such as Random Forest, Extra Trees, Gradient Boosting, AdaBoost and Extreme Gradient Boosting are implemented and their predictive performance are compared. The comparative study reveals that the AdaBoostRegressor outperforms with minimized mean absolute error (MAE) of 9.968 and root mean squared error (RMSE) of 11.133.
Subject: Medicine And Pharmacology, Immunology And Allergy Keywords: Diagnosing designs; rare diseases; statistics; regression; block designs
Online: 2 June 2021 (12:14:34 CEST)
Far too often, one meets patients who went for years or even decades from doctor to doctor, without getting a valid diagnosis. This brings pain to millions of patients and their families, not to speak of the enormous costs. Often patients cannot tell precisely enough which factors (or combinations thereof) trigger their problems. If conventional methods fail, we propose the use of statistics and algebra to give doctors much more useful inputs from patients. We use statistical regression for independent triggering factors for medical problems, and “balanced incomplete block designs” for non-independent factors. These methods can supply doctors with much more valuable inputs, and can also detect combinations of multiple factors by incredibly few tests. In order to show that these methods do work, we briefly describe a case in which these methods helped to solve a 60 year old problem in a patient, and give some more examples where these methods might be very useful. As a conclusion, while regression is used in clinical medicine, it seems to be widely unknown in diagnosing. Statistics and algebra can save the health systems much money, and the patients also a lot of pain.
ARTICLE | doi:10.20944/preprints202103.0586.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: NVOC; phytoncide; bamboo grove; monoterpene; microclimate; regression analysis
Online: 24 March 2021 (13:10:25 CET)
After the COVID-19 outbreak, more and more people are seeking physiological and psychological healing by visiting the forest as the time of stay-at-home became longer. NVOC, a major healing factor of forests, has several positive effects on human health, and this study researched about the NVOC characteristics of bamboo groves. The study revealed that α-pinene, 3-carene, and camphene were the most emitted, and the largest amount of NVOC was emitted in the early morning and late afternoon in bamboo groves. Furthermore, NVOC emission was found to have normal correlations with temperature and humidity, and inverse correlations with solar radiation, PAR and wind speed. A regression analysis conducted to predict the effect of microclimate factors on NVOC emissions resulted in a regression equation with 82.9% explanatory power and found that PAR, temperature, and humidity had a significant effect on NVOC emission prediction. In conclusion, this study investigated NVOC emission characteristics of bamboo groves, examined the relationship between NVOC emissions and microclimate factors and derived a prediction equation of NVOC emissions to figure out bamboo groves' forest healing effects. These results are expected to provide a basis for establishing more effective forest healing programs in bamboo groves.
ARTICLE | doi:10.20944/preprints202008.0329.v2
Subject: Medicine And Pharmacology, Epidemiology And Infectious Diseases Keywords: COVID-19; Geospatial Regression; Health Disparities; Public Health
Online: 11 September 2020 (09:48:57 CEST)
COVID-19 is a potentially fatal viral infection. This study investigates geography, demography, socioeconomics, health conditions, hospital characteristics, and politics as potential explanatory variables for death rates at the state and county levels. Data from the Centers for Disease Control and Prevention, the Census Bureau, Centers for Medicare and Medicaid, Definitive Healthcare, and USAfacts.org were used to evaluate regression models. Yearly pneumonia and flu death rates (state level, 2014-2018) were evaluated as a function of the governors’ political party using repeated measures analysis. At the state and county level, spatial regression models were evaluated. At the county level, we discovered a statistically significant model that included geography, population density, racial and ethnic status, three health status variables along with a political factor. State level analysis identified health status, minority status, and the interaction between governors’ parties and health status as important variables. The political factor, however, did not appear in a subsequent analysis of 2014-2018 pneumonia and flu death rates. The pathogenesis of COVID-19 has greater and disproportionate effect within racial and ethnic minority groups, and the political influence on the reporting of COVID-19 mortality was statistically relevant at the county level and as an interaction term only at the state level.
ARTICLE | doi:10.20944/preprints201906.0291.v1
Subject: Medicine And Pharmacology, Internal Medicine Keywords: endothelial disorders; glycocalyx injury; syndecan-1; nonlinear regression
Online: 28 June 2019 (07:42:18 CEST)
Endothelial disorders are related to various diseases. An initial endothelial injury is characterized by endothelial glycocalyx injury. We aimed to evaluate endothelial glycocalyx injury by measuring serum syndecan-1 concentrations in patients during comprehensive medical examinations. A single-center, prospective, observational study was conducted at Asahi University Hospital. The participants enrolled in this study were 1313 patients who underwent comprehensive medical examinations at Asahi University Hospital from January 2018, to June 2018. One patient undergoing hemodialysis was excluded from the study. At enrollment, blood samples were obtained, and study personnel collected demographic and clinical data. No treatments or exposures were conducted except for standard medical examinations and blood sample collection. Laboratory data were obtained by collection of blood samples at the time of study enrolment. According to nonlinear regression, the concentrations of serum syndecan-1 were significantly related to age (p = 0.016), aspartic aminotransferase concentration (AST, p = 0.020), blood urea nitrogen concentration (BUN, p = 0.013), triglyceride concentration (p < 0.001), and hematocrit (p = 0.006). These relationships were independent associations. Endothelial glycocalyx injury, which is reflected by serum syndecan-1 concentrations, is related to age, hematocrit, AST concentration, BUN concentration, and triglyceride concentration.
ARTICLE | doi:10.20944/preprints201811.0096.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: machine learning; stacking; forecasting; regression; sales; time series
Online: 5 November 2018 (09:54:54 CET)
In this paper, we study the usage of machine learning models for sales time series forecasting. The effect of machine learning generalization has been considered. A stacking approach for building regression ensemble of single models has been studied. The results show that using stacking technics, we can improve the performance of predictive models for sales time series forecasting.
ARTICLE | doi:10.20944/preprints201608.0025.v2
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: solar variability; NAO; ENSO; volcanic eruptions; multiple regression
Online: 17 May 2017 (06:27:16 CEST)
The role of natural factors mainly solar eleven-year cycle variability, and volcanic eruptions on two major modes of climate variability the North Atlantic Oscillation (NAO) and El Niño Southern Oscillation (ENSO) are studied for around last 150 years period. The NAO is the primary factor to regulate Central England Temperature (CET) during winter throughout the period, though NAO is impacted differently by other factors in various time periods. Solar variability indicates a strong positive influence on NAO during 1978-1997, though suggests opposite in earlier period. Solar NAO lag relationship is also shown sensitive to the chosen times of reference and thus points towards the previously proposed mechanism/ relationship related to the sun and NAO. The ENSO is influenced strongly by solar variability and volcanic eruptions in certain periods. This study observes a strong negative association between the sun and ENSO before the 1950s, which is even opposite during the second half of 20th century. The period 1978-1997, when two strong eruptions coincided with active years of strong solar cycles, the ENSO, and volcano suggested a stronger association, and we discussed the important role played by ENSO. That period showed warming in central tropical Pacific while cooling in the North Atlantic with reference to the later period (1999-2017) and also from chosen earlier period. Here we show that the mean atmospheric state is important for understanding the connection between solar variability, the NAO and ENSO and associated mechanism. It presents a critical analysis to improve knowledge about major modes of variability and their role in climate. We also discussed the importance of detecting the robust signal of natural variability, mainly the sun.
COMMENT | doi:10.20944/preprints201608.0166.v1
Subject: Social Sciences, Geography, Planning And Development Keywords: Regional inequality; Multilevel regression; Markov chain; Guizhou Province
Online: 17 August 2016 (12:58:58 CEST)
This study analyses regional development in one of the poorest provinces in China, Guizhou Province, between 2000 and 2012 using a multiscale and multi-mechanism framework. In general, regional inequality has been declining since 2000. In addition, economic development in Guizhou Province presented spatial agglomeration and club convergence, which shows how the development pattern of one core area, two-wing areas and a contiguous area at the edge of the province have been developed between 2006 and 2012. Multilevel regression analysis revealed that industrialization and investment level were the primary driving forces of regional economic disparity in Guizhou Province. The influences of marketization and decentralization on regional economic disparity were relatively weak. Investment level reinforced regional economic disparity and the development of core-periphery structure in the province. However, investment level actually weakened the regional economic disparity in Guizhou Province when the variable of time was considered. In addition, both the topography and urban–rural differentiation were the two main reasons for forming a core-periphery structure in Guizhou Province.
ARTICLE | doi:10.20944/preprints202304.1023.v1
Subject: Social Sciences, Safety Research Keywords: vehicle crash data; collision risk; ordinal logistic regression; multinomial logistic regression; proportional odds model (POM); partial proportional odds model (PPOM)
Online: 27 April 2023 (04:02:49 CEST)
The use of logistic regression models in data analysis and machine learning has expanded in recent years and has become the primary preference of researchers in risk assessment studies across a wide range of scientific fields. From the assessment of credit risk in financial institutions to the estimation of risk factors for traffic accidents or the identification of etiological factors for chronic diseases. All logistic models are natural extensions of the simple binary model, and their interpretation is based on it. Using the data of a cross-sectional study on the risk factors of traffic collisions, the two main extended models of logistic techniques, multinomial and ordinal logistic regression, are presented in the article in detail. Emphasis is placed on the use of ordinal regression since the outcome variable of the collision data is defined as ordinal measurement reflecting a latent continuous scale.
ARTICLE | doi:10.20944/preprints202312.0131.v1
Subject: Engineering, Civil Engineering Keywords: epoxy resin; grout; creep; strength; permeability; porosity; regression analysis
Online: 5 December 2023 (06:08:16 CET)
The aim of this research was to undertake laboratory testing to investigate the beneficial effects of epoxy resin grouts on the physical and mechanical properties of sands with a wide range of granulometric characteristics. Six sands, of different particle size and uniformity coefficients, were grouted using epoxy resin solutions with three ratios of epoxy resin to water (3.0, 2.0 and 1.5). A set of unconfined compressive strength tests were conducted on grouted samples at different curing periods and a set of long-term unconfined compressive creep tests in dry and wet conditions after 180 days of curing were also carried out, in order to evaluate the development of the mechanical properties of the sands, as well as, the impact of water on them. The findings of the investigation showed that epoxy resin resulted in appreciable strength values in the specimens, especially those of fine sands, grouted with the different epoxy resin grouts. In general, the compressive strength varied between 0.68 - 5.60 MPa and the modulus of elasticity between 75 - 480 MPa, after a curing period of 180 days. In terms of physical properties, the permeability and porosity (before and after the grouting process) were estimated. Grouts with an epoxy resin to water ratio of 3 decreased permeability by up to four orders of magnitude. Using laboratory results and regression analysis, three mathematical equations were developed that relate each of the dependent variables; compressive strength, elastic modulus, and coefficient of permeability, with particular explanatory variables.
ARTICLE | doi:10.20944/preprints202311.0350.v1
Subject: Computer Science And Mathematics, Computer Vision And Graphics Keywords: 3D segmentation; feature extraction; regression machine learning; weight estimation
Online: 6 November 2023 (11:20:30 CET)
Accurate weight measurement is pivotal for monitoring the growth and well-being of cattle. However, the conventional weighing process, which involves physically placing cattle on scales, is labor-intensive and distressing for the animals. Hence, the development of automated cattle weight prediction techniques assumes critical significance. This study proposes a weight prediction approach for Korean cattle using 3D segmentation-based feature extraction and regression machine learning techniques from incomplete 3D shapes acquired from real farm environments. In the initial phase, we generated mesh data of 3D Korean cattle shapes using a multiple-camera system. Subsequently, deep learning-based 3D segmentation with the PointNet network model was employed to segment two dominant parts of the cattle. From these segmented parts, three crucial dimensions of Korean cattle were extracted. Finally, we implemented five regression machine learning models (CatBoost regression, LightGBM, Polynomial regression, Random Forest regression, and XGBoost regression) for weight prediction. To validate our approach, we captured 270 Korean cattle in various poses, totaling 1190 poses of 270 cattle. The best result was achieved with mean absolute error (MAE) of 25.2 kg and mean absolute percent error (MAPE) of 5.81% using the random forest regression model.
ARTICLE | doi:10.20944/preprints202310.0938.v1
Subject: Engineering, Mechanical Engineering Keywords: onion; peeling; compressed air; skin; waste; non-linear regression
Online: 16 October 2023 (09:11:18 CEST)
The paper presents the relationship between the efficiency of the process of skin onion peeling and its effect in the form of waste. The research was carried out on a pilot test stand for onion peeling. The process variables were compressed air with a pressure of (p) and valve controlling opening time of flow (t). The experiment took into account the influence of the onion diameter (d0) and its hardness (H). The obtained results were subjected to statistical analysis. Standard deviations were of the percentage loss of onion mass in the form of the skin removed of onion peeling in the process in relation to obtained aver-age values. Tukey's multiple comparison test was performed in order to identify the importance of individual process variables on the final effect of onion peeling. This was the basis for the development of a predictive model in the form of a nonlinear regression Mp=f(p,t,d0,H), which is a mathematical description of the peeling onion skin process . Finally, the response surface area of relationship between analyzed variables was determined. The results of research showed the peeling efficiency of the onion and waste of skin mass depend on the compressed air pressure. Extending the onion blowing time does not improve the process efficiency, while the hardness and size of the onion are irrelevant to the process.
ARTICLE | doi:10.20944/preprints202309.0755.v1
Subject: Medicine And Pharmacology, Endocrinology And Metabolism Keywords: diabetes; CGM; hypoglycemia; hyperglycemia; prediction; ARIMA; logistic regression; LSTM
Online: 12 September 2023 (16:53:51 CEST)
Background: Novel technologies like continuous glucose monitor (CGM) systems are improving diabetes management by means of real-time sensor glucose levels, retrospective course of glucose and trend arrows. Continuous Glucose Monitoring (CGM) offers real-time alerts for (prognostic) hypo- and hyperglycemia, fast dropping or increasing glucose, and hence improving glycaemia under unstable conditions like during meals, physical activity and exercise management. Complex CGM systems challenge people with diabetes and health care professionals in interpreting rapid changes, sensor delay (~10-minute difference between interstitial and plasma glucose), and malfunctions. Enhanced prediction models are necessary for optimal insulin dosing, daily activities, and especially for future fully closed-loop systems. Methods: The aim of this study was to investigate the efficacy of three different predictive models for glucose responses: 1) an autoregressive integrated moving average model (ARIMA), 2) logistic regression, 3)and long short-term memory networks (LSTM), in predicting glucose levels after 15 minutes and one hour. We compared and evaluated the performance of these models in predicting hypoglycemia (<70 mg/dL), euglycemia (70-180 mg/dL), and hyperglycemia (>180 mg/dL). In more detail, by assessing metrics such as precision, recall, F1-score, and accuracy, we specifically assessed which model provided the most accurate and reliable predictions for glucose levels Results: As expected, ARIMA showed the worst accuracy especially predicting hypoglycaemia withing 1-hour (7.3%). The accuracy of the logistic regression model, predicting hypoglycemia during the first 15 min was higher (98%), comparing to LSTM (88%). However, the LSTM model (87%) exceeded the accuracy of hypoglycemia prediction of the logistic regression (83%) during an hour prognosis. The same pattern observed in hyperglycemia - ARIMA model (60%, 1 hour), logistic regression (96%, 15 minutes) and LSTM (85%, 1 hour) Conclusions: These findings suggest that different models may have varying strengths and weaknesses in predicting glucose levels, and the choice of model should be carefully considered based on the specific requirements and context of the clinical application. The logistic regression model was more accurate for the next 15 minutes, especially predicting hypoglycemia. However, the LSTM model exceeded logistic regression for the next one hour prediction. Future research could explore hybrid models or ensemble approaches that combine the strengths of multiple models to further improve the accuracy and reliability of glucose predictions.
ARTICLE | doi:10.20944/preprints202309.0302.v1
Subject: Computer Science And Mathematics, Robotics Keywords: stabilization; symbolic regression; synthesized control; evolutionary computations; quadcopter model
Online: 5 September 2023 (10:11:12 CEST)
The development of artificial intelligence systems assumes that a machine can independently generate an algorithm of actions or a control system to solve the tasks. To do this, the machine must have a formal description of the problem and possess computational methods for solving it. The article deals with the problem of optimal control, which is the main task in the development of control systems, insofar as all systems being developed must be optimal from the point of view of a certain criterion. However, there are certain difficulties in implementing the resulting optimal control modes. The paper considers an extended formulation of the optimal control problem, which implies the creation of such systems that would have the necessary properties for its practical implementation. To solve it, an adaptive synthesized optimal control approach based on the use of numerical methods of machine learning is proposed. The method moves the control object, optimally changing the position of the stable equilibrium point in the presence of some initial position uncertainty. As a result, from all possible synthesized controls, he chooses one that is less sensitive to changes in the initial states. As an example, the optimal control problem of quadcopter with complex phase constraints is considered. To solve this problem? according to the proposed approach, the control synthesis problem is firstly solved to obtain a stable equilibrium point in the state space by a machine learning method of symbolic regression. After that optimal positions of the stable equilibrium point are searched according to source functional from the optimal control problem by particle swarm optimization algorithm. It is shown that such approach allows generating the control system automatically by computer basing on the formal statement of the problem and then directly implementing it onboard as far as they have already had a stabilization system inserted.
ARTICLE | doi:10.20944/preprints202308.1978.v1
Subject: Biology And Life Sciences, Life Sciences Keywords: biomarker, LLM, interpretability, scRNA-seq, machine learning, symbolic regression
Online: 30 August 2023 (03:53:31 CEST)
Single-cell RNA sequencing (scRNA-seq) technology has significantly advanced our understanding of the diversity of cells and how this diversity is implicated in diseases. Yet, translating these findings across various scRNA-seq datasets poses challenges due to technical variability and dataset-specific biases. To overcome this, we present a novel approach that employs both an LLM-based framework and explainable machine learning to facilitate generalization across single-cell datasets and identify gene signatures to capture disease-driven transcriptional changes. Our approach uses scBERT, which harnesses shared transcriptomic features among cell types to establish consistent cell-type annotations across multiple scRNA-seq datasets. Additionally, we employ a symbolic regression algorithm to pinpoint highly relevant yet minimally redundant models and features for inferring a cell type’s disease state based on its transcriptomic profile. We ascertain the versatility of these cell-specific gene signatures across datasets, showcasing their resilience as molecular markers to pinpoint and characterize disease-associated cell types. Validation is carried out using four publicly available scRNA-seq datasets from both healthy individuals and those suffering from ulcerative colitis (UC). This demonstrates our approach’s efficacy in bridging disparities specific to different datasets, fostering comparative analyses. Notably, the simplicity and symbolic nature of the retrieved gene signatures facilitate their interpretability, allowing us to elucidate underlying molecular disease mechanisms using these models.
ARTICLE | doi:10.20944/preprints202308.0314.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: Hail; Lightning; Climate change; Regression analysis; Trends; Reanalysis data
Online: 3 August 2023 (10:07:40 CEST)
We have developed additive logistic models for the occurrence of lightning, large (≥ 2 cm), and very large (≥ 5 cm) hail to investigate the evolution of these hazards in the past, in the future, and for forecasting applications. The models, trained with lightning observations, hail reports, and predictors from atmospheric reanalysis, assign an hourly probability to any location and time on a 0.25° × 0.25° × 1-hourly grid as a function of reanalysis-derived predictor parameters, selected following an ingredients- based approach. The resulting hail models outperform the Significant Hail Parameter and the simulated climatological spatial distributions and annual cycles of lightning and hail are consistent with observations from storm report databases, radar, and lightning detection data. As a corollary result, CAPE released above the -10°C isotherm was found to be a more universally skilful predictor for large hail than CAPE. In the period 1950–2021, the models applied to the ERA5 reanalysis indicate significant increases of lightning and hail across most of Europe, primarily due to rising low-level moisture. The strongest modelled hail increases occur in northern Italy with increasing rapidity after 2010. Here, very large hail has become 3 times more likely than it was in the 1950s. Across North America trends are comparatively small, apart from isolated significant increases in the direct lee of the Rocky Mountains and across the Canadian Plains. In the southern Plains, a period of enhanced storm activity occurred in the 1980s and 1990s.
REVIEW | doi:10.20944/preprints202303.0401.v1
Subject: Computer Science And Mathematics, Applied Mathematics Keywords: Strawman fallacy; UK General Medical Council; autism; regression; MMR
Online: 22 March 2023 (14:39:30 CET)
Background: Articles published in scholarly journals form part of the scientific evidence base. It is the responsibility of the scientific community to maintain its integrity. In 2011 the BMJ commissioned a feature article to draw attention to an article that had appeared in another journal- The Lancet 13 years previously. The Lancet had already retracted the article. These actions exemplify the best traditions of scientific record-keeping. Objective: This submission examines whether the main claims summary made in the BMJ were factual. Method: We examine what was published in the Lancet against what was published in the BMJ and verify against the findings in the GMC hearings transcripts and verdict of the UK High Court. Results: The 6 points highlighted in BMJ had errors and need to be corrected. Conclusions: There are significant differences between what was reported in the Lancet paper and what was alleged to be there by the BMJ. This article aims only to point to errors in the BMJ article, to set the record straight. It does not show there was a causal association between MMR vaccination and autism.
ARTICLE | doi:10.20944/preprints202210.0078.v1
Subject: Medicine And Pharmacology, Obstetrics And Gynaecology Keywords: Africa; Maternal mortality rate; Joinpoint regression analysis; mortality; trends.
Online: 7 October 2022 (10:30:10 CEST)
Background: United Nations Sustainable Development Goals state that by 2030, the Global maternal mortality rate (MMR) should be lower than 70 per 100,000 live births. MMR is still one of Africa's leading causes of death among women. This research aims to study regional trends in maternal mortality in Africa. Methods: We extracted data for Maternal mortality rates per 100,000 births from the UNICE data bank from 2000 to 2017, being 2017 the last date available. Joinpoint regression was used to study the trends and estimate the annual percent change (APC). Results: Maternal mortality has decreased in Africa over the study period by an average APC of -3.0% (95% CI -2.9;-3,2%). All regions showed significant downward trends, with the sharpest decreases in the South. Only the North African region is close to the United Nations' sustainable development goals for Maternal mortality. The remaining sub-Saharan African regions are still far from achieving the goals. Conclusions: maternal mortality has decreased in Africa, especially in the South Africa region. The only region closed to the United Nations target is North Africa. The remaining sub-Saharan African regions are still far from achieving the goals. These results could be used for the development of Regional Policies.
ARTICLE | doi:10.20944/preprints202209.0353.v1
Subject: Medicine And Pharmacology, Obstetrics And Gynaecology Keywords: Africa; Maternal mortality rate; Joinpoint regression analysis; mortality; trends
Online: 23 September 2022 (03:06:07 CEST)
Background: United Nations Sustainable Development Goals state that by 2030, the Global maternal mortality rate (MMR) should be lower than 70 per 100,000 live births. MMR is still one of Africa's leading causes of death among women. This research aims to study regional trends in maternal mortality in Africa. Methods: We extracted data for Maternal mortality rates per 100,000 births from the World Bank database from 1990-2015. Joinpoint regression was used to study the trends and estimate the annual percent change (APC). Results: Maternal mortality has decreased in Africa over the study period by an average APC of -2.6%. All regions showed significant downward trends, with the sharpest decreases in East Africa. Only the North African region is close to the United Nations' sustainable development goals for Maternal mortality. The remaining sub-Saharan African regions are still far from achieving the goals. Conclusions: maternal mortality has decreased in Africa, especially in East Africa. The only region closed to the United Nations target is North Africa. The remaining sub-Saharan African regions are still far from achieving the goals. These results could be used for the development of Regional Policies.
ARTICLE | doi:10.20944/preprints202208.0445.v1
Subject: Business, Economics And Management, Economics Keywords: Adult children's education; parental longevity; truncated regression; emotional support.
Online: 26 August 2022 (04:18:44 CEST)
Background: Some developing countries, such as China, population is aging rapidly, meanwhile, the average years of schooling for residents is constantly increasing. However, the question of whether adult children’s education has an effect on the longevity of older parents, remains inadequately studied. Methods: This paper uses China Health and Retirement Longitudinal Survey (CHARLS) data to estimate the causal impact of adult children's education on their parents' longevity. Identification is achieved by using the truncated regression model and using historical education data as instrument variables for adult children’s education. Results: For every unit increase in adult children’s education, the father’s and mother’s longevity increased by 0.89 years and 0.75 years, respectively. Mechanism analysis shows that adult children's education has a significant positive impact on parents' emotional support, financial support and self-reported health. Further evidence shows that for every unit increase in adult children’s education, the father-in-law’s and mother-in-law’s longevity increased by 0.40 years and 0.46 years, respectively. Conclusions: It is conclusion that improving the level of adult children’s education can increase parents’ and parents-in-law’s longevity. Adult children’s education might contribute to the longevity of older parents by three channels that providing emotional, economic support and affecting parents’ health.
ARTICLE | doi:10.20944/preprints202205.0255.v1
Subject: Biology And Life Sciences, Biophysics Keywords: SILCS; hERG channel; Physicochemical properties; Multiple linear regression; FragMaps
Online: 19 May 2022 (08:46:24 CEST)
Human ether-a-go-go-related gene (hERG) potassium channel is well-known contributor to drug-induced cardiotoxicity and therefore an extremely important target when performing safety assessments of drug candidates. Ligand-based approaches in connection with quantitative structure active relationships (QSAR) analyses have been developed to predict hERG toxicity. Availability of the recent published cryogenic electron microscopy (cryo-EM) structure for the hERG channel opened the prospect for using structure-based simulation and docking approaches for hERG drug liability predictions. In recent time, the idea of combining structure- and ligand-based approaches for modeling hERG drug liability has gained momentum offering improvements in predictability when compared to ligand-based QSAR practices alone. The present article demonstrates uniting the structure-based SILCS (site-identification by ligand competitive saturation) approach in conjunction with physicochemical properties to develop predictive models for hERG blockade. This combination leads to improved model predictability based on Pearson’s R and percent correct (represents rank-ordering of ligands) metric for different validation sets of hERG blockers involving diverse chemical scaffold and wide range of pIC50 values. The inclusion of the SILCS structure-based approach allows determination of the hERG region to which compounds bind and the contribution of different chemical moieties in the compounds to blockade, thereby facilitating the rational ligand design to minimize hERG liability.
ARTICLE | doi:10.20944/preprints202205.0240.v1
Subject: Business, Economics And Management, Economics Keywords: Credit constraints; Export; SMEs; Instrumental variable; Probit regression; Vietnam
Online: 18 May 2022 (10:35:32 CEST)
Export participation and restricted access to external formal credit are two factors attracting meticulous attention from researchers and policymakers, especially in developing countries. Exploring the interactive relationship of these factors in both the static and dynamic models is the purpose of this study. The study uses data sets from small and medium-sized manufacturing enterprises (SMEs) in Vietnam for the period 2009 - 2015. The instrumental variable approach is implemented to deal with the endogenous variable problem in the model. The results show an effect of credit constraint on the firms’ exporting status, and continuous exports are likely to reduce the limit of credit constraint.
ARTICLE | doi:10.20944/preprints202205.0032.v1
Subject: Business, Economics And Management, Business And Management Keywords: digitalisation; sustainability; sustainable development goals; European Union; regression equations
Online: 5 May 2022 (10:24:13 CEST)
Digitalisation provides access to an integrated network of information that can benefit society, and business. Building digital network and society using digital means can create something unique opportunities to strategically address sustainable development challenges for the United Nations Targets (SDG) to ensure higher productivity, education and to equality oriented society. This point of view describes the potential of digitalisation for society and business of the future. The authors revise the links between digitalisation and sustainability in the European Union countries. The methodology for the research is suggested in the paper and linear regression method is applied. The results showed tiers with five SDG, focusing on society and business, and all these tiers are fixed in the constructed equations for each SDG. The suggested solution is statistically valid and proves the novelty of research. Among digitalisation indicators, only mobile-cellular subscriptions and fixed-broadband sub-basket prices in part have no effect on researched sustainable development indicators.
ARTICLE | doi:10.20944/preprints202201.0408.v1
Subject: Medicine And Pharmacology, Dietetics And Nutrition Keywords: Indonesia; islands cluster; multiple logistic regression; obesity; risk factor
Online: 27 January 2022 (06:53:58 CET)
Obesity has become a rising global health problem affecting adults’ quality of life. The objective of this study was to describe the prevalence of obesity in Indonesian adults based on the cluster of islands. The study was also aimed to identify the risk factors of obesity in each island cluster. This study analysed secondary data of Indonesian Basic Health Research 2018. Our data for analysis comprised 688,638 adults (>=15 years) randomly selected using proportionate to population size throughout Indonesia. We included 20 variables for sociodemographic and obesity-related risk factors for analysis. Obese status was defined using Body Mass Index (BMI) >= 27.5 kg/m2. Our current study defined seven major islands cluster as the unit analysis consisting of 34 provinces in Indonesia. Descriptive analysis was conducted to determine the characteristics of the population and to calculate the prevalence of obesity within provinces in each of the island’s clusters. Multivariate logistic regression analyses to calculate odds ratios (ORs) was performed using R version 3.6.3. The study results showed that all island clusters had at least one province with an obesity prevalence of more than 20%. Six out of twenty variables, comprising four diet factors (consumption of sweet food, high-salt food, meat food, and carbonated drinks) and two other factors (mental health disorders and smoking behaviour), varied across the island clusters. In conclusion, there was a variation of obesity prevalence of the provinces within and between island clusters. Variation of risk factors raised in each cluster island suggested the government rethink and reframe the intervention to address obesity.
ARTICLE | doi:10.20944/preprints202112.0455.v1
Subject: Public Health And Healthcare, Public Health And Health Services Keywords: COVID- 19; Durbin-Watson statistic; Multiple Linear Regression; Multicollinearity
Online: 28 December 2021 (16:11:44 CET)
This paper will discuss the application of statistic modeling to interpret a health system crisis in Sri Lanka due to COVID- 19.A strong focus on the preventive approach and the contact tracing with the utilization of available resources in a rational manner describes Sri Lanka’s response towards COVID- 19 prevention and mitigation. The early contact tracing, preemptive quarantining, isolation, and treatment were implemented as a concerted effort. This approach, proven efficient during the early phase of the pandemic, was sustainable when there was a rapid increase in the COVID- 19 patients since July 2021, exceeding the health system capacity.The country’s COVID- 19 situation during the period from 01st of August 2021 to 31st of October 2021 was taken into consideration. Variables used for analysis were; total number of cases, recovered cases, comorbid and O2 dependent patients, ICU patients, and deaths. The regression model was applied to analyze the data by using the EViews 12 (x64) software application.The correlation coefficients of all the independent variables under consideration implies that they have a strong positive relationship with the number of deaths occurred during the said period. According to the computed multiple linear regression model, the number of positive cases and O2 dependents have a positive relationship with the dependent variable. Further, the Durbin- Watson stat value of the model and multicollinearity test reflect that it is free from serial correlation thereby the model is fit. From the perspective of epidemiological control, these findings highlight the importance of keeping the number of cases within the limits of health system capacity.
ARTICLE | doi:10.20944/preprints202111.0227.v1
Subject: Business, Economics And Management, Marketing Keywords: Lolita fashion; multiple regression; decision tree; social media; XGBoost
Online: 12 November 2021 (14:54:04 CET)
Despite extensively investigating the impact of social media on fashion products’ marketing, little evidence is available on how the platforms influence sales prediction. Focusing on Lolita fashion, this study investigates the impact of social media marketing on the sales volume prediction of fashion products. Essentially, we analyzed marketing data, including comments, likes, and shares from the Weibo social platform, to forecast future sales, examine how to enhance profit performance, and make production decisions. Using a quantitative approach, we tested three different prediction models, including multiple regression, decision tree, and XGBoost. The results revealed that increasing comments and decreasing the number of likes could significantly improve the sales volumes of Lolita products. In contrast, shares exerted a less significant impact on sales. Regarding prediction models, XGBoost was found to be the best method. In the fashion industry, social media is a useful tool for forecasting market trend. A limitation of this study is that only one social media platform was used to extract data, which might limit the generalization of the findings.
ARTICLE | doi:10.20944/preprints202106.0497.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: Ecosystem services; Benefit transfer; Meta-analysis; Meta-regression function.
Online: 21 June 2021 (10:04:14 CEST)
Meta-analysis has increasingly been used to synthesize the ecosystem services literature, with some testing of the use of such analyses to transfer benefits. These are typically based on local primary studies. However, meta-analyses associated with ecosystem services are a potentially powerful tool for transferring benefits, especially for environmental assets for which no primary studies are available. In this study we use the Ecosystem Service Valuation Database (ESVD), which brings together 1350 value estimates from more than 320 studies around the world, to estimate meta-regression functions for provisioning, regulating & maintenance and cultural ecosystem services across 12 biomes. We tested the reliability of these meta-regression functions and found that even using variables with high explanatory power, transfer errors could still be large. We show that meta-analytic transfer performs better than simple value transfer and, in addition, that local meta-analytical transfer (i.e. based on local explanatory variable values) provides more reliable estimates than global meta-analytical transfer (i.e. based on mean global explanatory variable values). Thus, we conclude that when taking into account the characteristics of the study area under analysis, including explanatory variables such as income, population density and protection status, we can determine the value of ecosystem services with greater accuracy.
ARTICLE | doi:10.20944/preprints202105.0536.v1
Subject: Biology And Life Sciences, Anatomy And Physiology Keywords: Argan biosphere reserve; Climate change; Rainfall; Temperature; Woodland regression
Online: 24 May 2021 (07:44:25 CEST)
This paper explores the effect of climate change on the regression of the Argan tree (Argania spinosa L. Skeels) woodland, focusing on the Argan Biosphere Reserve and especially in the Souss plain (Western Morocco). Rainfall and temperature data of four sites within the Argan Biosphere Reserve were analyzed over the last 60 years to assess any climatic change. Regression curves applied to the dataset showed an important decrease in rainfall (18 to 26 %) in the four locations as well as an increase in temperature (1 to 2 °C). These changes may have a detrimental effect on the Argan woodland although human factors have been reported to be the main factor of its regression. It can therefore be concluded that the reduction in rainfall and the increase in temperature should now be considered as factors of Argan woodland regression.
ARTICLE | doi:10.20944/preprints202104.0622.v1
Subject: Engineering, Automotive Engineering Keywords: Complex Regression, Least-Squares Techniques, Advanced Metering Infrastructure (AMI)
Online: 23 April 2021 (09:46:32 CEST)
This paper uses the complex regression analysis method to establish the customer’s load regression models, which consider economic indicators, temperature and rainfall. Furthermore, the proposed models are used to study the forecasting feasibility of the future energy sales and summer peak load demand. At first, this paper used least-squares techniques to derive regression models for considering economic indicators and temperature of 34 customer energy sales and total energy sales. Besides, the AMI high voltage customer demand data and system generating capacity for 24 hours were adopted to forecast summer peak load. The above-mentioned data analysis tool is used by EViews software to achieve, in order to verify the feasibility of the research framework. The study found that although its forecasting model accuracy is low only when mixed with temperature and high voltage demands. So, when mixed with high voltage demand data and system generating capacity for 24 hours to forecast peak load, the average error is ± 0.87% and in the majority of its energy sales forecasting model of average error is ±3%. This result can provide power company as future reference.
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Face detection; CSEM; Deep learning; GPU; CPU; Benchmark; Regression
Online: 27 July 2020 (14:54:15 CEST)
Face recognition is a valuable forensic tool for criminal investigators since it certainly helps in identifying individuals in scenarios of criminal activity like fugitives or child sexual abuse. It is, however, a very challenging task as it must be able to handle low-quality images of real world settings and fulfill real time requirements. Deep learning approaches for face detection have proven to be very successful but they require a large computation power and processing time. In this work, we evaluate the speed-accuracy tradeoff of three popular deep-learning-based face detectors on the WIDER Face and UFDD data sets in several CPUs and GPUs. We also develop a regression model capable to estimate the performance, both in terms of processing time and accuracy. We expect this to become a very useful tool for the end user in forensic laboratories in order to estimate the performance for different face detection options. Experimental results showed that the best speed-accuracy tradeoff is achieved with images resized to 50% of the original size in GPUs and images resized to 25% of the original size in CPUs. Moreover, performance can be estimated using multiple linear regression models with a Mean Absolute Error (MAE) of 0.113 what is very promising for the forensic field.
ARTICLE | doi:10.20944/preprints202001.0377.v1
Subject: Environmental And Earth Sciences, Geophysics And Geology Keywords: ERT method; regression model; tailings pond; heavy metal; reclamation
Online: 31 January 2020 (05:04:37 CET)
Legacy mining industry has left a large number of tailings ponds exposed to water and wind erosion that causes serious environmental and health problems. Prior to rehabilitation actions a deep sampling of the materials infilling the pond used to be necessary. Thus, the primary objective of this study is to demonstrate the usefulness of the Electrical Resistivity Tomography (ERT) method as a non-invasive tool to determine the physicochemical composition of mine tailings ponds, enabling more efficient and low-cost surveys. To achieve this objective, three ERT profiles and three boreholes in each profile were carried out, from each borehole three waste samples from differents depths were collected and a geochemical characterization of the samples was carried. In order to estimate the composition of the infilling wastes in tailing ponds from electrical resistivity measures, several regression models were calculated for different physicochemical properties and metal concentrations. As a result, a high resistivity area was depicted in profiles G2 and G3 while a non-resistive area (profile G1) was also found. Relationships among low resistivity values and high salinity, clay content and high metal concentrations and mobility were established. Specifically, calibrated models were obtained for electrical conductivity, particles sizes of 0.02-50 µm and 50-2000 µm, total Zn and Cd concentration, and bioavailable Ni, Cd and Fe. Therefore, the ERT technique could be considered as a useful tool for mine tailings ponds characterization, and it can be used to estmate some physicochemical properties and metal concentrations of this mine waste.
ARTICLE | doi:10.20944/preprints201903.0090.v1
Subject: Engineering, Energy And Fuel Technology Keywords: Sustainable development; House prices; ARIMA; Regression analysis; New Zealand
Online: 7 March 2019 (12:02:50 CET)
The New Zealand housing sector is experiencing rapid growth that boosts the national economy but also results in the loss of valuable resources. In line with the growth, the housing market for both residential and business purposes has been booming, as have house prices. To sustain the housing development, it is critical to accurately monitor and predict housing prices so as to support the decision-making process in housing sector. This study is devoted to applying a mathematical method to predict housing prices. The forecasting performance of two types of models: ARIMA and multiple linear regression analysis are compared. The ARIMA and regression models are developed based on a training-validation sample method. The results show that the ARIMA model generally performs better than the regression model. However, the regression model explores, to some extent, the significant correlations between house prices in New Zealand and the macro-economic conditions.
ARTICLE | doi:10.20944/preprints201811.0394.v3
Subject: Engineering, Electrical And Electronic Engineering Keywords: marine current turbine; blade attachment; sparse autoencoder; softmax regression
Online: 12 February 2019 (09:59:09 CET)
The development and application of marine current energy are attracting more and more attention around the world. Due to the hardness of its working environment, it is important and difficult to study the fault diagnosis of a marine current generation system. In this paper, an underwater image is chosen as the fault-diagnosing signal, after different sensors are compared. This paper proposes a diagnosis method based on the sparse autoencoder (SA) and softmax regression (SR). The SA is used to extract the features and SR is used to classify them. Images are used to monitor whether the blade is attached by benthos and to determine its corresponding degree of attachment. Compared with other methods, the experiment results show that the proposed method can diagnose the blade attachment with higher accuracy.
ARTICLE | doi:10.20944/preprints201809.0076.v1
Subject: Medicine And Pharmacology, Pharmacology And Toxicology Keywords: pharmacovigilance; drug safety; segmented regression; interrupted time series; variation
Online: 5 September 2018 (01:27:54 CEST)
Introduction Pharmacovigilance may detect safety issues after marketing of medications, and this can result in regulatory action such as direct healthcare professional communications (DHPC). DHPC can be effective in changing prescribing behaviour, however the extent to which prescribers vary in their response to DHPC is unknown. This study aims to explore changes in prescribing and prescribing variation among GP practices following a DHPC on the safety of mirabegron, a medication to treat overactive bladder (OAB). Methods This is an interrupted time series study of English GP practices from 2014-2017. NHS Digital provided monthly statistics on aggregate practice-level prescribing and practice characteristics (practice staff and registered patient profiles, Quality & Outcomes Framework indicators, and deprivation of the practice area). The primary outcome was monthly mirabegron items as a percentage of all OAB drug items. The exposure was a DHPC issued by the European Medicines Agency in September 2015. Variation between practices in mirabegron prescribing before and after the DHPC was assessed using the systematic component of variation (SCV). Multilevel segmented regression with random effects quantified the change in level and trend of prescribing after the DHPC. Practice characteristics were assessed for their association with a reduction in prescribing following the DHPC. Results This study included 7,408 practices. During September 2015, 88.9% of practices prescribed mirabegron and mirabegron composed a mean of 8.2% (SD 6.8) of OAB items. Variation between practices was classified as very high and the median SCV did not change significantly (p=0.11) in the 6 months after the September 2015 DHPC (12.4) compared to before (11.6). Before the DHPC, there was a monthly trend of 0.294 (95%CI, 0.287, 0.301) percentage points increase in mirabegron percentage. There was no significant change in the month immediately after the DHPC (-0.023, 95% CI -0.105 to 0.058) however there was a significant reduction in trend (-0.036, 95% CI -0.049 to -0.023). Higher numbers of registered patients and patients aged ≥65 years, and practice area deprivation were associated with having a significant decrease in level and slope of mirabegron prescribing post-DHPC. Conclusion Variation in mirabegron prescribing was high over the study period and did not change substantively following the DHPC. There was no immediate prescribing change post-DHPC, although the monthly growth did slow. Knowledge of the degree of variation in and determinants of response to safety communications may allow those that do not change prescribing to be provided with additional supports.
ARTICLE | doi:10.20944/preprints201807.0353.v1
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: corporate default swap spreads, correlation networks, vector autoregressive regression.
Online: 19 July 2018 (10:16:11 CEST)
We propose a novel credit risk measurement model for Corporate Default Swap spreads, that combines vector autoregressive regression with correlation networks. We focus on the sovereign CDS spreads of a collection of countries, that can be regarded as idiosyncratic measures of credit risk. We model them by means of a vector autoregressive regression model, composed by a time dependent country specific component, and by a contemporaneous component that describes contagion effects among countries. To disentangle the two components, we employ correlation networks, derived from the correlation matrix between the reduced form residuals. The proposed model is applied to ten countries that are representative of the recent financial crisis: top borrowing/lending countries, and peripheral European countries. The empirical findings show that the proposed model is a good predictor of CDS spreads movements, and that the contemporaneous component decreases prediction errors with respect to a simpler autoregressive model. From an applied viewpoint, core countries appear to import risk, as contagion increases their CDS spread, whereas peripheral countries appear as exporters of risk. Greece is an unfortunate exception, as its spreads seem to increase for both idiosyncratic factors and contagion effects.
ARTICLE | doi:10.20944/preprints201807.0087.v1
Subject: Business, Economics And Management, Economics Keywords: Nigeria; financial development; economic growth; threshold regression; time series
Online: 5 July 2018 (08:39:38 CEST)
The relationship between economic growth, growth volatility and financial sector development continues to attract attention in the theoretical and empirical literature. Over time, some studies hypothesize that finance has a causal linear relationship with growth. Recently several other authors contradict this claim and argue that the relationship that exists between finance and growth is nonlinear. We investigate these claims for Nigeria for the period between 1970 and 2015, using semi-parametric econometric methods, Hansen sample splitting techniques and threshold estimator. We observed no evidence of ‘Too much finance’ as claimed by many researchers in recent times. We show that the relationship between financial development and economic growth is U-shaped. This is equally true for the relationship between financial development and growth volatility. We also discuss policy implications of our findings and recommend financial innovations and decentralization of stock exchanges to boost access to financial services, in addition, improved regulation to enhance financial market efficiency.