ARTICLE | doi:10.20944/preprints201703.0028.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: GPS trajectory; GPS sensor; trajectory similarity measure; spatial-temporal data
Online: 6 March 2017 (06:51:37 CET)
With the rapid spread of built-in GPS handheld smart devices, the trajectory data from GPS sensors has grown explosively. Trajectory data has spatio-temporal characteristics and rich information. Using trajectory data processing techniques can mine the patterns of human activities and the moving patterns of vehicles in the intelligent transportation systems. A trajectory similarity measure is one of the most important issues in trajectory data mining (clustering, classification, frequent pattern mining, etc.). Unfortunately, the main similarity measure algorithms with the trajectory data have been found to be inaccurate, highly sensitive of sampling methods, and have low robustness for the noise data. To solve the above problems, three distances and their corresponding computation methods are proposed in this paper. The point-segment distance can decrease the sensitivity of the point sampling methods. The prediction distance optimizes the temporal distance with the features of trajectory data. The segment-segment distance introduces the trajectory shape factor into the similarity measurement to improve the accuracy. The three kinds of distance are integrated with the traditional dynamic time warping algorithm (DTW) algorithm to propose a new segment–based dynamic time warping algorithm (SDTW). The experimental results show that the SDTW algorithm can exhibit about 57%, 86%, and 31% better accuracy than the longest common subsequence algorithm (LCSS), and edit distance on real sequence algorithm (EDR) , and DTW, respectively, and that the sensitivity to the noise data is lower than that those algorithms.
ARTICLE | doi:10.20944/preprints201806.0440.v1
Subject: Computer Science And Mathematics, Computational Mathematics Keywords: clustering; spatial data; grid-based k-prototypes; data mining; sustainability
Online: 27 June 2018 (10:21:22 CEST)
Data mining plays a critical role in the sustainable decision making. The k-prototypes algorithm is one of the best-known algorithm for clustering both numeric and categorical data. Despite this, however, clustering a large number of spatial object with mixed numeric and categorical attributes is still inefficient due to its high time complexity. In this paper, we propose an efficient grid-based k-prototypes algorithms, GK-prototypes, which achieves high performance for clustering spatial objects. The first proposed algorithm utilizes both maximum and minimum distance between cluster centers and a cell, which can remove unnecessary distance calculation. The second proposed algorithm as extensions of the first proposed algorithm utilizes spatial dependence that spatial data tend to be more similar as objects are closer. Each cell has a bitmap index which stores categorical values of all objects in the same cell for each attribute. This bitmap index can improve the performance in case that a categorical data is skewed. Our evaluation experiments showed that proposed algorithms can achieve better performance than the existing pruning technique in the k-prototypes algorithm.
ARTICLE | doi:10.20944/preprints202309.1016.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Imbalanced data; Data preprocessing; Sampling; Tomek Links; DTW
Online: 14 September 2023 (14:00:42 CEST)
Purpose To alleviate the data imbalance problem caused by subjective and objective reasons, scholars have developed different data preprocessing algorithms, among which undersampling algorithms are widely used because of their fast and efficient performance. However, when the number of samples of some categories in a multi-classification dataset is too small to be processed by sampling, or the number of minority class samples is only 1 to 2, the traditional undersampling algorithms will be weakened. Methods This study selects 9 multi-classification time series datasets with extremely few samples as the objects, fully considers the characteristics of time series data, and uses a three-stage algorithm to alleviate the data imbalance problem. Stage one: Random oversampling with disturbance items increases the number of sample points; Stage two: On this basis, SMOTE (Synthetic Minority Oversampling Technique) oversampling; Stage three: Using dynamic time warping distance to calculate the distance between sample points, identify the sample points of Tomek Links at the boundary, and clean up the boundary noise.Results This study proposes a new sampling algorithm. In the 9 multi-classification time series datasets with extremely few samples, the new sampling algorithm is compared with four classic undersampling algorithms, ENN (Edited Nearest Neighbours), NCR (Neighborhood Cleaning Rule), OSS (One Side Selection) and RENN (Repeated Edited Nearest Neighbours), based on macro accuracy, recall rate and F1-score evaluation indicators. The results show that: In the 9 datasets selected, the dataset with the most categories and the least number of minority class samples, FiftyWords, the accuracy of the new sampling algorithm is 0.7156, far beyond ENN, RENN, OSS and NCR; its recall rate is also better than the four undersampling algorithms used for comparison, at 0.7261; its F1-score is increased by 200.71%, 188.74%, 155.29% and 85.61%, respectively, relative to ENN, RENN, OSS, and NCR; In the other 8 datasets, this new sampling algorithm also shows good indicator scores.Conclusion The new algorithm proposed in this study can effectively alleviate the data imbalance problem of multi-classification time series datasets with many categories and few minority class samples, and at the same time clean up the boundary noise data between classes.
ARTICLE | doi:10.20944/preprints202112.0452.v1
Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: Electrical Resistance Tomography (ERT); Raw Data Processing; Inline Swirl Separator; Geometrical Parameter Extraction
Online: 28 December 2021 (14:42:44 CET)
Electrical Resistance Tomography (ERT) has been used in the literature to monitor the gas-liquid separation. However, the image reconstruction algorithms used in the studies take a considerable amount of time to generate the tomograms, which is far above the time scales of the flow inside the inline separator and, as a consequence, the technique is not fast enough to capture all the rele-vant dynamics of the process, vital for control applications. This article proposes a new strategy based on the physics behind the measurement and simple logics to monitor the separation with a high temporal resolution by minimizing both the amount of data and the calculations required to reconstruct one frame of the flow. To demonstrate its potential, the electronics of an ERT system are used together with a high-speed camera to measure the flow inside an inline swirl separator. For the 16-electrode system used in this study, only 12 measurements are required to reconstruct the whole flow distribution with the proposed algorithm, 10x less than the minimum number of measurements of ERT (120). In terms of computational effort, the technique was shown to be 1000x faster than solving the inverse problem non-iteratively via the Gauss-Newton approach, one of the computationally cheapest techniques available. Therefore, this novel algorithm has the potential to achieve measurement speeds in the order of 104 times the ERT speed in the context of inline swirl separation, pointing to flow measurements at around 10kHz while keeping the aver-age estimation error below 6 mm in the worst case scenario.
ARTICLE | doi:10.20944/preprints202111.0440.v1
Subject: Engineering, Control And Systems Engineering Keywords: time series; NMP algorithm; anomalies; data mining; similarities in time series; clustering
Online: 23 November 2021 (17:51:42 CET)
Time series data are significant and are derived from temporal data, which involve real numbers representing values collected regularly over time. Time series have a great impact on many types of data. However, time series have anomalies. We introduce hybrid algorithm named novel matrix profile (NMP) to solve the all-pairs similarity search problem for time series data. The proposed NMP inherits the features from two state-of-the art algorithms: similarity time-series automatic multivariate prediction (STAMP), and short text online microblogging protocol (STOMP). The proposed algorithm caches the output in an easy-to-access fashion for single- and multidimensional data. The proposed NMP algorithm can be used on large data sets and generates approximate solutions of high quality in a reasonable time. The proposed NMP can also handle several data mining tasks. It is implemented on a Python platform. To determine its effectiveness, it is compared with the state-of-the-art matrix profile algorithms i.e., STAMP and STOMP. The results confirm that the proposed NMP provides higher accuracy than the compared algorithms.
ARTICLE | doi:10.20944/preprints201810.0660.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: dynamic time warping; pattern matching trading system; time series data; sliding window
Online: 29 October 2018 (07:03:51 CET)
The futures market plays a significant role in hedging and speculating by investors. Although various models and instruments are developed for real-time trading, it is difficult to realize profit by processing and trading a vast amount of real-time data. This study proposes a real-time index futures trading strategy that uses the pattern of KOSPI 200 index futures time series data. We construct a pattern matching trading system (PMTS) based on a dynamic time warping algorithm that recognizes patterns of market data movement in the morning and determines the afternoon's clearing strategy. We adopt 13 and 27 representative patterns and conduct simulations with various ranges of parameters to find optimal ones. Our experimental results show that the PMTS provides stable and effective trading strategies with relatively low trading frequencies. Investor communities that have sustained financial markets are able to make more efficient investments by using the PMTS. In this sense, the system developed in this paper is a sustainable investment technique and helps financial markets achieve efficient sustainability.
ARTICLE | doi:10.20944/preprints201808.0540.v1
Subject: Engineering, Industrial And Manufacturing Engineering Keywords: circular economy; remanufacturing; fuel cells; data-driven; systems dynamics
Online: 31 August 2018 (05:31:03 CEST)
Remanufacturing is a viable option to extend the useful life of an end-of-use product or its parts, ensuring sustainable competitive advantages under the current global economic climate. Challenges typical to remanufacturing still persist, despite its many benefits. According to the European Remanufacturing Network a key challenge is lack of accurate, timely and consistent product knowledge as highlighted in a 2015 survey of 188 European remanufacturers. With more data being produced by electric and hybrid vehicles, this adds to the information complexity challenge already experienced in remanufacturing. Therefore, it is difficult to implement real-time and accurate remanufacturing for the shop floor; there are no papers that focus on this within an electric and hybrid vehicle environment. To address this problem this paper attempts to (1) identify the required parameters/ variables needed for fuel cell remanufacturing by means of interviews (2) rank the variables by Pareto analysis (3) develop a casual loop diagram for the identified parameters/ variables to visualise its impact on remanufacturing (4) model a simple stock and flow diagram to simulate and understand data and information-driven schemes in remanufacturing.
ARTICLE | doi:10.20944/preprints202306.1378.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Data Generation; Anomaly Data; User Behavior Generation; Big Data
Online: 19 June 2023 (16:31:37 CEST)
The rising importance of Big Data in modern information analysis is supported by vast quantities of user data, but it is only possible to collect sufficient data for all tasks within certain data-gathering contexts. There are many cases where a domain is too novel, too niche, or too sparsely collected to adequately support Big Data tasks. To remedy this, we have created ADG Engine that allows for the generation of additional data that follows the trends and patterns of the data that’s already been collected. Using a database structure that tracks users across different activity types, ADG Engine can use all available information to maximize the authenticity of the generated data. Our efforts are particularly geared towards data analytics by identifying abnormalities in the data and allowing the user to generate normal and abnormal data at custom ratios. In situations where it would be impractical or impossible to expand the available dataset by collecting more data, it can still be possible to move forward with algorithmically expanded data datasets.
ARTICLE | doi:10.20944/preprints201810.0253.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: adaptive filtering; set-membership filtering; affine projection; data censoring; big data; outliers
Online: 12 October 2018 (04:57:08 CEST)
In this paper, the set-membership affine projection (SM-AP) algorithm is utilized to censor non-informative data in big data applications. To this end, the probability distribution of the additive noise signal and the excess of mean-squared error (EMSE) in steady-state are employed in order to estimate the threshold parameter of the single threshold SM-AP (ST-SM-AP) algorithm aiming at attaining the desired update rate. Furthermore, by defining an acceptable range for the error signal, the double threshold SM-AP (DT-SM-AP) algorithm is proposed to detect very large errors due to the irrelevant data such as outliers. The DT-SM-AP algorithm can censor non-informative and irrelevant data in big data applications, and it can improve misalignment and convergence rate of the learning process with high computational efficiency. The simulation and numerical results corroborate the superiority of the proposed algorithms over traditional algorithms.
ARTICLE | doi:10.20944/preprints202008.0254.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: feature selection; k-means; silhouette measure; clustering; big data; fault classification; sensor data; time-series data
Online: 11 August 2020 (06:26:43 CEST)
Feature selection is a crucial step to overcome the curse of dimensionality problem in data mining. This work proposes Recursive k-means Silhouette Elimination (RkSE) as a new unsupervised feature selection algorithm to reduce dimensionality in univariate and multivariate time-series datasets. Where k-means clustering is applied recursively to select the cluster representative features, following a unique application of silhouette measure for each cluster and a user-defined threshold as the feature selection or elimination criteria. The proposed method is evaluated on a hydraulic test rig, multi sensor readings in two different fashions: (1) Reduce the dimensionality in a multivariate classification problem using various classifiers of different functionalities. (2) Classification of univariate data in a sliding window scenario, where RkSE is used as a window compression method, to reduce the window dimensionality by selecting the best time points in a sliding window. Moreover, the results are validated using 10-fold cross validation technique. As well as, compared to the results when the classification is pulled directly with no feature selection applied. Additionally, a new taxonomy for k-means based feature selection methods is proposed. The experimental results and observations in the two comprehensive experiments demonstrated in this work reveal the capabilities and accuracy of the proposed method.
Subject: Computer Science And Mathematics, Computer Science Keywords: big data; data integration; EVMS; construction management
Online: 30 October 2020 (15:35:00 CET)
In the information age today, data are getting more and more important. While other industries achieve tangible improvement by applying cutting edge information technology, the construction industry is still far from being enough. Cost, schedule, and performance control are three major functions in the project execution phase. Along with their individual importance, cost-schedule integration has been a significant challenge over the past five decades in the construction industry. Although a lot of efforts have been put into this development, there is no method used in construction practice. The purpose of this study is to propose a new method to integrate cost and schedule data using big data technology. The proposed algorithm is designed to provide data integrity and flexibility in the integration process, considerable time reduction on building and changing database, and practical use in a construction site. It is expected that the proposed method can transform the current way that field engineers regard information management as one of the troublesome tasks in a data-friendly way.
ARTICLE | doi:10.20944/preprints201806.0365.v1
Subject: Engineering, Control And Systems Engineering Keywords: ARIMA model; data forecasting; multi-objective genetic algorithm; regression model
Online: 24 June 2018 (07:48:49 CEST)
The aim of this study has been to develop a novel two-level multi-objective genetic algorithm (GA) to optimize time series forecasting data for fans used in road tunnels by the Swedish Transport Administration (Trafikverket). Level 1 is for the process of forecasting time series cost data, while level 2 evaluates the forecasting. Level 1 implements either a multi-objective GA based on the ARIMA model or a multi-objective GA based on the dynamic regression model. Level 2 utilises a multi-objective GA based on different forecasting error rates to identify a proper forecasting. Our method is compared with using the ARIMA model only. The results show the drawbacks of time series forecasting using only the ARIMA model. In addition, the results of the two-level model show the drawbacks of forecasting using a multi-objective GA based on the dynamic regression model. A multi-objective GA based on the ARIMA model produces better forecasting results. In level 2, five forecasting accuracy functions help in selecting the best forecasting. Selecting a proper methodology for forecasting is based on the averages of the forecasted data, the historical data, the actual data and the polynomial trends. The forecasted data can be used for life cycle cost (LCC) analysis.
ARTICLE | doi:10.20944/preprints202308.1170.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: research data management; FAIR; file structure; file crawler; semantic data model
Online: 16 August 2023 (11:05:47 CEST)
Although other methods exist to store and manage data in modern information technology, the standard solution are file systems. Therefore keeping well-organized file structures and file system layouts can be key to a sustainable research data management infrastructure. However, file structures alone are lacking several important capabilities for FAIR data management: The two most striking are insufficient visualization of data and inadequate possibilities for searching and getting an overview. Research data management systems (RDMS) can fill this gap, but many do not support the simultaneous use of the file system and the RDMS. This simultaneous use can have many benefits, but keeping data in the RDMS in synchrony with the file structure is challenging. Here, we present concepts that allow to keep file structures and semantic data models (in RDMS) synchronous. Furthermore, we propose a specification in yaml-format that allows for a structured and extensible declaration and implementation of a mapping between the file system and data models used in semantic research data management. Implementing these concepts will facilitate the re-use of specifications for multiple use cases. Furthermore, the specification can serve as a machine-readable and, at the same time, human-readable documentation of specific file system structures. We demonstrate our work using the Open Source RDMS CaosDB.
ARTICLE | doi:10.20944/preprints201708.0040.v2
Subject: Engineering, Transportation Science And Technology Keywords: spatial clustering; sweep-circle; Gestalt theory; data stream
Online: 24 August 2017 (10:53:05 CEST)
An adaptive spatial clustering (ASC) algorithm is proposed in this present study, which employs sweep-circle techniques and a dynamic threshold setting based on the Gestalt theory to detect spatial clusters. The proposed algorithm can automatically discover clusters in one pass, rather than through the modification of the initial model (for example, a minimal spanning tree, Delaunay triangulation or Voronoi diagram). It can quickly identify arbitrarily-shaped clusters while adapting efficiently to non-homogeneous density characteristics of spatial data, without the need of prior knowledge or parameters. The proposed algorithm is also ideal for use in data streaming technology with dynamic characteristics flowing in the form of spatial clustering in large data sets.
Subject: Environmental And Earth Sciences, Oceanography Keywords: 3DVAR; data assimilation; cost function; Sylvester equation
Online: 5 December 2019 (10:36:30 CET)
Three Dimensional Variational data assimilation or analysis (3DVAR) is one of most classical methods for providing the initial values for numerical models. In this method, the dimensions of the background error covariance and the observational error covariance matrices are large. Therefore, it is difficult to get the inverse of the covariance matrices and to reduce the orders of these matrices without information loss. With the use of the Sylvester Equation, on the basis of a new linear regression, a new cost function for 3DVAR was given. For the first-guess m×n field, there is an approximate 1−(m2+n2)/(mn×mn) reduction with m>1 & n>1 by using the cost function. The results of the numerical experiments show that the effect of this algorithm is no worse than that of the old cost function for 3DVAR.
ARTICLE | doi:10.20944/preprints202105.0390.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Multilayer perceptron neural network; regression model; backpropagation; missing data; imputation method
Online: 17 May 2021 (14:35:18 CEST)
Missing observations constitute one of the most important issues in data analysis in applied research studies. The magnitude and their structure impact parameters estimation in the modeling with important consequences for decision-making. This study aims to evaluate the efficiency of imputation methods combined with the backpropagation algorithm in a nonlinear regression context. The evaluation is conducted through a simulation study including sample sizes (50, 100, 200, 300 and 400) with different missing data rates (10, 20, 30 40 and 50%) and three missingness mechanisms (MCAR, MAR and MNAR). Four imputation methods (Last Observation Carried Forward, Random Forest, Amelia and MICE) were used to impute datasets before making prediction with backpropagation. 3-MLP model was used by varying the activation functions (Logistic-Linear, Logistic-Exponential, TanH-Linear and TanH-Exponentiel), the number of nodes in the hidden layer (3 - 15) and the learning rate (20 - 70%). Analysis of the performance criteria (R2, r and RMSE) of the network revealed good performances when it is trained with TanH-Linear functions, 11 nodes in the hidden layer and a learning rate of 50%. MICE and Random Forest were the most appropriate for data imputation. These methods can support up to 50% of missing rate with an optimal sample size of 200.
ARTICLE | doi:10.20944/preprints202006.0063.v1
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: COVID-19; Real-Time Tracker; Common Symptoms; Data Visualization; Hypothesis Testing; ARIMA Time-Series Forecast; Penalized Logistic Regression
Online: 7 June 2020 (07:44:48 CEST)
While the COVID-19 outbreak was reported to first originate from Wuhan, China, it has been declared as a Public Health Emergency of International Concern (PHEIC) on 30 January 2020 by WHO, and it has spread to over 180 countries by the time of this paper was being composed. As the disease spreads around the globe, it has evolved into a worldwide pandemic, endangering the state of global public health and becoming a serious threat to the global community. To combat and prevent the spread of the disease, all individuals should be well-informed of the rapidly changing state of COVID-19. In the endeavor of accomplishing this objective, a COVID-19 real-time analytical tracker has been built to provide the latest status of the disease and relevant analytical insights. The real-time tracker is designed to cater to the general audience without advanced statistical aptitude. It aims to communicate insights through various straightforward and concise data visualizations that are supported by sound statistical foundations and reliable data sources. This paper aims to discuss the major methodologies which are utilized to generate the insights displayed on the real-time tracker, which include real-time data retrieval, normalization techniques, ARIMA time-series forecasting, and logistic regression models. In addition to introducing the details and motivations of the utilized methodologies, the paper additionally features some key discoveries that have been derived in regard to COVID-19 using the methodologies.
ARTICLE | doi:10.20944/preprints202303.0031.v1
Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: Data instances, Real time systems, k-means algorithm, Agglomerative hierarchical algorithm, Similarity measure, merge function
Online: 2 March 2023 (04:15:10 CET)
Anomaly Detection in real time data is accepted as a vital research area. Clustering has effectively been tried for this purpose. As the datasets are real time, the time of generating of the data is also important. In this article, we introduce a mixture of partitioning and agglomerative hierarchical approach to detect anomalies from such datasets. It is a two-phase method which follows partitioning approach first and then agglomerative hierarchical approach. The dataset can have mixed attributes. In phase-1, a unified metric defined on mixed attributes is used. The same is also used for merging of similar clusters in phase-2. Also, we have kept the track of time attribute of each data instance which produces the clusters with their lifetimes in phase-1. Then in phase-2, we merge the similar clusters. While merging, the similar clusters, the lifetimes of the corresponding clusters with overlapping cores are to be superimposed producing fuzzy time intervals. This way, each cluster will have an associated fuzzy lifetime. The data instances either belonging sparse clusters or not belonging to any of the clusters can be treated as anomalies. The efficacy of the algorithms can be established using both complexity analysis as well as experimental studies.
ARTICLE | doi:10.20944/preprints202307.0925.v1
Subject: Environmental And Earth Sciences, Other Keywords: vector watermarking; vector copyright protection; vector geographic data; copyright protection; digital watermarking; zero-watermarking
Online: 13 July 2023 (10:53:23 CEST)
Vector geographic data play an important role in the natural resources and environment sector and other location information services. This is also one of the types of data where the cost to create it is relatively large because of the difficulty in surveying, collecting, and authorizing. The rapid development of the Internet has created many advantages in the distribution, exploitation, and use of vector geographic data, but it also gives rise to many problems such as duplication, redistribution, forgery, and illegal data use. The theft on the Internet is becoming more and more sophisticated and the number of violations is increasing, showing the urgent need to research and develop an effective solution to protect the copyright of vector geographic data and prevent them from being illegally collected and used. Among the major studies and solutions, digital watermarking emerges as an effective method and is an active research area for copyright protection. Towards a good solution for copyright protection of vector geographic data, our study proposes a new algorithm with three main contributions, including: (1) generating short, pseudo-random meaningful watermarks to increase robustness and to enable automated as well as visual manual verifying; (2) building a uniformly distributed mapping between the vertex coordinates and the watermark bit indexes to increase the robustness of the watermarks; and (3) integrate two types of watermarks, namely, spatial domain-based watermarking and zero-watermarking to be resistant to most common attacks on geographic vector data. The algorithm also allows working on all types of vector geographic data, including points, polylines, and polygons.
ARTICLE | doi:10.20944/preprints202306.1589.v1
Subject: Engineering, Bioengineering Keywords: data management; cloud computing; RESTful API; eye-tracking; web portal
Online: 22 June 2023 (10:28:01 CEST)
The rapid development of technology has led to the implementation of data-driven systems whose performance heavily relies on the amount and type of the data itself. In the latest decades, in the fields of bioengineering data management, among others, eye-tracking data has become one of the most interesting and essential components for many medical, psychological, and engineering research applications. However, despite the large usage of eye-tracking data in many studies and applications, a strong gap is still present in the literature regarding real-time data collection and management, which led to strong constraints for the reliability and accuracy of on-time results. To address this gap, this study aims to introduce a system that enables the collection, processing, real-time streaming, and storage of eye-tracking data. The system is developed by using Java programming language, WebSocket protocol, and Representational State Transfer (REST), improving the efficiency in transferring and managing eye-tracking data. Results were computed in two test conditions, i.e., local and online scenarios, within a time window of 100 seconds. The experiments conducted for this study were carried out by comparing the time delay between two different scenarios. Even if preliminary, results showed a significantly improved performance of data management systems in managing real-time data transfer. Overall, this system can significantly benefit the research community by providing real-time data transfer and storing the data, enabling more extensive studies using eye-tracking data.
ARTICLE | doi:10.20944/preprints201806.0367.v1
Subject: Computer Science And Mathematics, Hardware And Architecture Keywords: kinect; depth calibration; RGB-D; media art; skeletal joint data
Online: 24 June 2018 (11:19:41 CEST)
Kinect is a device that has been widely used in many areas since it was released in 2010. Kinect SDK was announced in 2011 and used in many other areas than its original purpose, which was a controller for gaming. In particular, it has been used by a number of artists in digital media art since it is inexpensive and has a fast recognition rate. However, there is a problem. Kinect create 3D coordinates with a single 2D RGB image for x, y value - single depth image for z value. And this creates a significant limitation on the installation for interactivity of media art. Because the Cartesian XY coordinate and the spherical Z coordinate system are used in combination, depth error depending on the distance is generated, which makes real-time rotation recognition and coordinate correction difficult above coordinate system. This paper proposes a real-time calibration method of Kinect recognition range expansion for useful application in the digital media art area. The proposed method can recognize the viewer accurately by calibrating a coordinate in any direction in front of the viewer. 3,400 datasets witch acquire from experiment were measured as five stances: the 1m attention stance, 1m hands-up stance, 2m attention stance, 2m hands-up stance, and 2m hands-half-up stance, which were taken and recorded every 0.5 sec. The experimental results showed that the accuracy rate was improved about 11.5% compared with front measurement data according to Kinect reference installation method.
ARTICLE | doi:10.20944/preprints202005.0101.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: chronic dialysis; administrative data; hospital discharge records; ambulatory specialty visits; case definition; algorithm
Online: 6 May 2020 (15:26:06 CEST)
Background: Administrative healthcare databases are widespread and are often standardized with regard to their content and data coding, thus they can be used also as data sources for surveillance and epidemiological research. Chronic dialysis requires patients to frequently access hospital and clinic services, causing a heavy burden to healthcare providers. This also means that these patients are routinely tracked on administrative databases, yet very few case definitions for their identification are currently available. The aim of this study was to develop two algorithms derived from administrative data for identifying incident chronic dialysis patients and test their validity compared to the reference standard of the regional dialysis registry. Methods: The algorithms are based on data retrieved from hospital discharge records (HDR) and ambulatory specialty visits (ASV) to identify incident chronic dialysis patients in an Italian region. Subjects are included if they have at least one event in the HDR or ASV databases based on the ICD9-CM dialysis-related diagnosis or procedure codes in the study period. Exclusion criteria comprise non-residents, prevalent cases, or patients undergoing temporary dialysis, and are evaluated only on ASV data by the first algorithm, on both ASV and HDR data by the second algorithm. We validated the algorithms against the Emilia-Romagna regional dialysis registry by searching for incident patients in 2014. Results: Algorithm 1 identified 680 patients and algorithm 2 identified 676 initiating dialysis in 2014, compared to 625 patients included in the regional dialysis registry. Sensitivity for the two algorithms was respectively 90.8% and 88.4%, positive predictive value 84.0% and 82.0%, and percentage agreement was 77.4% and 74.1%. Conclusions: These results suggest that administrative data have high sensitivity and positive predictive value for the identification of incident chronic dialysis patients. Algorithm 1, which showed the higher accuracy and has a simpler case definition, can be used in place of regional dialysis registries when they are not present or sufficiently developed in a region, or to improve the accuracy and timeliness of existing registries.
ARTICLE | doi:10.20944/preprints202303.0062.v1
Subject: Computer Science And Mathematics, Applied Mathematics Keywords: Underground space, information detection, fractional differentiation, high accuracy remote data
Online: 3 March 2023 (08:37:27 CET)
The quality of underground space information has become a major problem that endangers the safety of underground spaces. Currently, the main methods for the high-precision and long-distance transmission of detection information are radar and optical methods. However, in practical applications, we found that the radar method has the shortcomings of large energy loss and poor anti-jamming ability, which limit the accuracy of information data transmission and distance. The optical method has the shortcomings that the weather has a great impact on its accuracy and can only be applied to static objects above ground; therefore, it has the limitation of application objects and use environment. More importantly, the current high-precision information remote detection methods are limited to the detection of overground space objects and are not applicable to the detection of various information data in underground space. In this study, we analyze the spectral properties of the fractional differential operator and find that it is suitable for studying non-linear, non-causal, and non-stationary signals. The theory of fractional calculus is applied to the field of data processing, and a mathematical model of remote transmission and high-precision detection of information based on fractional difference is established, which realizes the functions of high-precision and remote detection of information. By fusing the information data to detect the mathematical model over a long distance and with high accuracy, a mathematical model for stratum data processing used to provide long-distance and high-accuracy data was established. Through application in engineering practice, the effectiveness of this method for underground space information data detection was verified.
ARTICLE | doi:10.20944/preprints202307.2154.v1
Subject: Engineering, Telecommunications Keywords: Mine Internet of Things (MIoT); post-disaster reconstruction; opportunistic routing (OR); data transmission; energy efficient; routing void
Online: 2 August 2023 (04:44:01 CEST)
The Mine Internet of Things (MIoT), as a key technology for reconstructing post-disaster communication networks, enables to realize the safety monitoring and controlling of the affected roadway. However, due to the challenging underground mine environment, the MIoT suffers from severe signal attenuation, vulnerable nodes, and limited energy, which result in low network reliability of the post-disaster MIoT. To improve the transmission reliability as well as to reduce energy consumption, a directional-area-forwarding-based energy-efficient opportunistic routing (DEOR) for the post-disaster MIoT is proposed. DEOR defines a forwarding zone (FZ) for each node to route packets toward the sink. The candidate forwarding set (CFS) is constructed by the nodes within the FZ that satisfy the energy constraint and the neighboring node degree constraint. The nodes in CFS are prioritized based on the routing quality evaluation by taking the local attributes of nodes into consideration, such as the directional angle, transmission distance, and residual energy. DEOR adopts a recovery mechanism to address the issue of void nodes. The simulation results validate that the proposed DEOR outperforms ORR, OBRN and ECSOR in terms of energy consumption, average hop count, packet delivery rate, and network lifetime.
ARTICLE | doi:10.20944/preprints202306.0570.v1
Subject: Engineering, Other Keywords: optical fiber data communication system; EML; PAM4; Volterra; DFE
Online: 8 June 2023 (03:00:45 CEST)
A novel simplifying Volterra structure algorithm is proposed for intensity modulation direct detection (IM-DD) optical fiber short distance communication system by using the decision feedback equalization algorithm (DFE). Based on this algorithm, the signal damage for four-level pulse amplitude modulation signal (PAM-4) is compensated, which is caused by device bandwidth limitation and dispersion during transmission. Experiments have been carried out using a 25GHz Electro-absorption Modulated Laser (EML), showing that PAM-4 signals can transmit over 10km in standard single-mode fiber (SSMF). The 112Gbps and 128Gbps signals can reach the error rate threshold of KP4-FEC (BER=2*10-4) and HD-FEC (BER=3.8*10-3), respectively. The simplified principle and process of proposed Volterra-based equalization algorithm are presented. Experimental results show that the algorithm complexity is greatly reduced by 75%, which provides an effective theoretical support for the commercial application of this algorithm.
ARTICLE | doi:10.20944/preprints202201.0229.v1
Subject: Public Health And Healthcare, Public Health And Health Services Keywords: FAIR principles; Multimorbidity; Mortality; Research data management; Pathfinder case study; Privacy-Preserving Distributed Data Mining.
Online: 17 January 2022 (13:04:03 CET)
The current availability of electronic health records represents an excellent research opportunity on multimorbidity, one of the most relevant public health problems nowadays. However, it also poses a methodological challenge due to the current lack of tools to access, harmonize and reuse research datasets. In FAIR4Health, a European Horizon 2020 project, a workflow to implement the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles on health datasets was developed, as well as two tools aimed at facilitating the transformation of raw datasets into FAIR ones and the preservation of data privacy. As part of this project, we conducted a multicentric retrospective observational study to apply the aforementioned FAIR implementation workflow and tools to five European health datasets for research on multimorbidity. We applied a federated frequent pattern growth association algorithm to identify the most frequent combinations of chronic diseases and their association with mortality risk. We identified several multimorbidity patterns clinically plausible and consistent with the bibliography, some of which were strongly associated with mortality. Our results show the usefulness of the solution developed in FAIR4Health to overcome the difficulties in data management and highlight the importance of implementing a FAIR data policy to accelerate responsible health research.
ARTICLE | doi:10.20944/preprints201611.0033.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: genetic algorithms; parallel computation; computational complexity; algorithms; optimization techniques; traveling salesman problem; NP-Hard problems; Berlin-52 data set; machine learning; linear regression
Online: 7 November 2016 (04:57:46 CET)
This paper examines the correlation between numbers of computer cores in parallel genetic algorithms. The objective to determine the linear polynomial complementary equation in order represent the relation between number of parallel processing and optimum solutions. Model this relation as optimization function (f(x)) which able to produce many simulation results. F(x) performance is outperform genetic algorithms. Compression results between genetic algorithm and optimization function is done. Also the optimization function give model to speed up genetic algorithm. Optimization function is a complementary transformation which maps a TSP given to linear without changing the roots of the polynomials.
ARTICLE | doi:10.20944/preprints201707.0089.v1
Subject: Engineering, Chemical Engineering Keywords: air contaminant dispersion; data assimilation; particle filter; expectation-maximization algorithm; UAV
Online: 31 July 2017 (11:02:27 CEST)
The precise prediction of air contaminant dispersion is essential to the air quality monitoring and the emergency management of the contaminant gases leakage incidents in the chemical industry park. The conventional atmospheric dispersion models can seldom give precise prediction due to inaccurate input parameters. In order to improve the prediction accuracy of dispersion model, two data assimilation methods (i.e. one is merely based on the typical particle filter while the other is a combination of particle filter and expectation-maximization algorithm) are proposed to assimilate the UAV observations into the atmospheric dispersion model. Two emission cases are taken into consideration, the difference between which is the different dimensions of state variables. To test the performances of the proposed methods, experiments corresponding to the two emission cases are designed and implemented. The results show that the particle filter can effectively estimate the model parameters and improve the accuracy of model prediction when the dimension of state variables is low. In contrast, when the dimension of state variables becomes higher, the method of particle filter combining expectation-maximization algorithm performs better in the parameter estimation accuracy and warm-up time. Therefore, the data assimilation methods are able to effectively support the air quality monitoring and emergency management in chemical industry parks.
ARTICLE | doi:10.20944/preprints202307.0452.v2
Subject: Engineering, Aerospace Engineering Keywords: Reduced order models; Higher order singular value decomposition; Health monitoring; Aeroengines; Predictive maintenance; Degradation parameters; Sensors scaling; Turbine inlet temperature; Gradient-like methods; Noisy data
Online: 11 July 2023 (08:22:53 CEST)
A reduced order model is developed to monitor aeroengines condition (defining their degradation from a baseline state) in real-time, by using data collected in specific sensors. This reduced model is constructed by applying higher order singular value decomposition plus interpolation to appropriate data, organized in tensor form. Such data are obtained using a detailed engine model that takes the engine physics into account. Thus, the method synergically combines the advantages of data-driven (fast online operation) and model-based (the engine physics is accounted for) condition monitoring methods. Using this reduced order model as surrogate of the engine model, two gradient-like condition monitoring tools are constructed. The first tool is extremely fast and able to precisely compute `on the fly’ the turbine inlet temperature, which is a paramount parameter for the engine performance, operation, and maintenance, and can only be roughly estimated by the engine instrumentation in civil aviation. The second tool is not so fast (but still reasonably inexpensive) and precisely computes both, the engine degradation and the turbine inlet temperature at which sensors data have been acquired. These tools are robust in connection with random noise added to the sensors data and can be straight forwardly applied to other mechanical systems.
ARTICLE | doi:10.20944/preprints202306.0490.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: Data flow testing; higher-order mutation testing; “ProbSubsumes”; “ProbBetter”
Online: 7 June 2023 (08:22:34 CEST)
Data-Flow and Higher-Order Mutation are white-box testing techniques. To our knowledge, no work has been proposed to compare data flow and higher-order mutation. This paper compares all def-uses data-flow and second-order mutation criteria. This compassion investigates the subsumption relation between these two criteria and evaluates the effectiveness of test data developed for each. To compare the two criteria, a set of test data satisfying each criterion is generated, which is used to explore whether one criterion subsumes the other criterion and assess the effectiveness of the test set that was developed for one methodology in terms of the other. The results showed that the mean mutation coverage ratio of the all du-pairs adequate test cover is 80.9%, and the mean data flow coverage ratio of the 2nd-order mutant adequate test cover is 98.7%. Consequently, 2nd-order mutation “ProbSubsumes” the all du-pairs data flow. The failure detection efficiency of the mutation (98%) is significantly better than the failure detection efficiency of data flow (86%). Consequently, 2nd-order mutation testing is “ProbBetter” than all du-pairs data flow testing. In contrast, the size of the test suite of 2nd-order mutation is more significant than the size of the test suite of all du-pairs.
ARTICLE | doi:10.20944/preprints202003.0298.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Data Mining; Breast Cancer; Hybrid Feature Selection; Machine learning; Support Vector Machine; Optimize Genetic Algorithm; boosting algorithms
Online: 19 March 2020 (11:13:15 CET)
Breast cancer is a significant health issue across the world. Breast cancer is the most widely-diagnosed cancer in women; early-stage diagnosis of disease and therapies increase patient safety. This paper proposes a synthetic model set of features focused on the optimization of the genetic algorithm (CHFS-BOGA) to forecast breast cancer. This hybrid feature selection approach combines the advantages of three filter feature selection approaches with an optimize Genetic Algorithm (OGA) to select the best features to improve the performance of the classification process and scalability. We propose OGA by improving the initial population generating and genetic operators using the results of filter approaches as some prior information with using the C4.5 decision tree classifier as a fitness function instead of probability and random selection. The authors collected available updated data from Wisconsin UCI machine learning with a total of 569 rows and 32 columns. The dataset evaluated using an explorer set of weka data mining open-source software for the analysis purpose. The results show that the proposed hybrid feature selection approach significantly outperforms the single filter approaches and principal component analysis (PCA) for optimum feature selection. These characteristics are good indicators for the return prediction. The highest accuracy achieved with the proposed system before (CHFS-BOGA) using the support vector machine (SVM) classifiers was 97.3%. The highest accuracy after (CHFS-BOGA-SVM) was 98.25% on split 70.0% train, remainder test, and 100% on the full training set. Moreover, the receiver operating characteristic (ROC) curve was equal to 1.0. The results showed that the proposed (CHFS-BOGA-SVM) system was able to accurately classify the type of breast tumor, whether malignant or benign.
ARTICLE | doi:10.20944/preprints202306.0192.v1
Subject: Computer Science And Mathematics, Computer Vision And Graphics Keywords: trademarks; data protection; artificial intelligence; image processing; trademark retrieval
Online: 2 June 2023 (11:37:01 CEST)
CNN-based off-the-shelf features have shown themselves as a good baseline for trademark retrieval. However, in recent years, the computer vision area was transitioning from CNNs to a new architecture – Vision Transformer. In this paper, we investigate the performance of off-the-shelf features extracted with vision transformers and explore the effects of pre, post-processing, and pre-training on big datasets. We propose a method of joint usage of global and local features, which leverages the best aspects of both approaches. Experimental results on METU Trademark Dataset show that off-the-shelf features extracted with ViT-based models outperform off-the-shelf features from CNN-based models. The proposed method achieves the mAP value of 31.23, surpassing previous state-of-the-art results. We assume that the proposed approach for the trademark similarity evaluation will allow one to improve the protection of such data with the help of artificial intelligence methods. Moreover, this approach will allow one to identify cases of unfair use of such data and form an evidence base for litigation.
ARTICLE | doi:10.20944/preprints201610.0067.v1
Subject: Computer Science And Mathematics, Applied Mathematics Keywords: point information gain; Rényi entropy; data processing
Online: 17 October 2016 (11:35:13 CEST)
We generalize the point information gain (PIG) and derived quantities, i.e., point information gain entropy (PIE) and point information gain entropy density (PIED), for the case of the Rényi entropy and simulate the behavior of PIG for typical distributions. We also use these methods for the analysis of multidimensional datasets. We demonstrate the main properties of PIE/PIED spectra for the real data on the example of several images, and discuss further possible utilizations in other fields of data processing.
Subject: Computer Science And Mathematics, Information Systems Keywords: microaggregation; k-anonymity; privacy; data utility
Online: 23 July 2019 (11:42:34 CEST)
With a data revolution underway for some time, there is an increasing demand for formal privacy protection mechanisms that are not so destructive. Hereof microaggregation is a popular high-utility approach designed to satisfy the popular k-anonymity criteria while applying low distortion to data. However, standard performance metrics are commonly based on mean square error, which will hardly capture the utility degradation related to a specific application domain of data. In this work, we evaluate the performance of k-anonymous microaggregation in terms of the loss in classification accuracy of the machine learned models built from perturbed data. Systematic experimentation is carried out on four microaggregation algorithms that are tested over four data sets. The empirical utility of the resulting microaggregated data is assessed using the learning algorithm that obtains the highest accuracy from original data. Validation tests are performed on a test set of non perturbed data. The results confirm k-anonymous microaggregation as a high-utility privacy mechanism in this context and distortion based on mean squared error as a poor predictor of practical utility. Finally, we corroborate the beneficial effects for empirical utility of exploiting the statistical properties of data when constructing privacy preserving algorithms.
ARTICLE | doi:10.20944/preprints202308.1237.v1
Subject: Engineering, Transportation Science And Technology Keywords: data mining; data extraction; data science; cost infrastructure projects
Online: 17 August 2023 (09:25:22 CEST)
Context: Despite the effort put into developing standards for structuring construction cost, and the strong interest into the field. Most construction companies still perform the process of data gathering and processing manually. That provokes inconsistencies, different criteria when classifying, misclassifications, and the process becomes very time-consuming, particularly on big projects. Additionally, the lack of standardization makes very difficult the cost estimation and comparison tasks. Objective: To create a method to extract and organize construction cost and quantity data into a consistent format and structure, to enable rapid and reliable digital comparison of the content. Method: The approach consists of a two-step method: Firstly, the system implements data mining to review the input document and determine how it is structured based on the position, format, sequence, and content of descriptive and quantitative data. Secondly, the extracted data is processed and classified with a combination of data science and experts’ knowledge to fit a common format. Results: A big variety of information coming from real historical projects has been successfully extracted and processed into a common format with 97.5% of accuracy, using a subset of 5770 assets located on 18 different files, building a solid base for analysis and comparison. Conclusion: A robust and accurate method was developed for extracting hierarchical project cost data to a common machine-readable format to enable rapid and reliable comparison and benchmarking.
ARTICLE | doi:10.20944/preprints201610.0012.v1
Subject: Biology And Life Sciences, Biochemistry And Molecular Biology Keywords: data exchange; resource donations; text mining
Online: 5 October 2016 (15:08:32 CEST)
Bio-molecular reagents like antibodies required in experimental biology are expensive and their effectiveness, among other things, is critical to the success of the experiment. Although such resources are sometimes donated by one investigator to another through personal communication between the two, there is no previous study to our knowledge on the extent of such donations, nor a central platform that directs resource seekers to donors. In this paper, we describe, to our knowledge, a first attempt at building a web-portal titled Bio-Resource Exchange that attempts to bridge this gap between resource seekers and donors in the domain of experimental biology. Users on this portal can request for or donate antibodies, cell-lines and DNA Constructs. This resource could also serve as a crowd-sourced database of resources for experimental biology. Further, in order to index donations outside of our portal, we mined scientific articles to find instances of donations of antibodies and attempted to extract information about these donations at the finest granularity. Specifically, we extracted the name of the donor, his/her affiliation and the name of the antibody for every donation by parsing the acknowledgements sections of articles. To extract annotations at this level, we propose two approaches – a rule based algorithm and a bootstrapped relation learning algorithm. The algorithms extracted donor names, affiliations and antibody names with average accuracies of 57% and 62% respectively. We also created a dataset of 50 expert-annotated acknowledgements sections that will serve as a gold standard dataset to evaluate extraction algorithms in the future. Contact: email@example.com, firstname.lastname@example.org Database URL: http://tonks.dbmi.pitt.edu/brx Supplementary information: Supplementary data are available at Database online.
ARTICLE | doi:10.20944/preprints201612.0077.v1
Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: rule based models; gene expression data; bayesian networks; parsimony
Online: 15 December 2016 (08:21:24 CET)
The comprehensibility of good predictive models learned from high-dimensional gene expression data is attractive because it can lead to biomarker discovery. Several good classifiers provide comparable predictive performance but differ in their abilities to summarize the observed data. We extend a Bayesian Rule Learning (BRL-GSS) algorithm, previously shown to be a significantly better predictor than other classical approaches in this domain. It searches a space of Bayesian networks using a decision tree representation of its parameters with global constraints, and infers a set of IF-THEN rules. The number of parameters and therefore the number of rules are combinatorial to the number of predictor variables in the model. We relax these global constraints to a more generalizable local structure (BRL-LSS). BRL-LSS entails more parsimonious set of rules because it does not have to generate all combinatorial rules. The search space of local structures is much richer than the space of global structures. We design the BRL-LSS with the same worst-case time-complexity as BRL-GSS while exploring a richer and more complex model space. We measure predictive performance using Area Under the ROC curve (AUC) and Accuracy. We measure model parsimony performance by noting the average number of rules and variables needed to describe the observed data. We evaluate the predictive and parsimony performance of BRL-GSS, BRL-LSS and the state-of-the-art C4.5 decision tree algorithm, across 10-fold cross-validation using ten microarray gene-expression diagnostic datasets. In these experiments, we observe that BRL-LSS is similar to BRL-GSS in terms of predictive performance, while generating a much more parsimonious set of rules to explain the same observed data. BRL-LSS also needs fewer variables than C4.5 to explain the data with similar predictive performance. We also conduct a feasibility study to demonstrate the general applicability of our BRL methods on the newer RNA sequencing gene-expression data.
ARTICLE | doi:10.20944/preprints202306.0974.v1
Subject: Engineering, Industrial And Manufacturing Engineering Keywords: Reliability estimation; EM algorithm; Censored data; Weibull distribution; Industrial equipment; Maintenance optimization; Failure analysis; Proactive maintenance
Online: 14 June 2023 (07:50:43 CEST)
Centrifugal pumps are widely employed in the oil refinery industry due to their efficiency and effectiveness in fluid transfer applications. The reliability of pumps plays a pivotal role in ensuring uninterrupted plant productivity and safe operations. Analysis of failure history data shows that bearings have been identified as critical components in oil refinery pump groups. However, traditional reliability estimation theories may not apply when data is limited or subject to right censoring. This paper addresses the complexity of estimating the Weibull distribution parameters using the maximum-likelihood method under the abovementioned conditions. The likelihood equation lacks an explicit analytical solution, necessitating the use of numerical methods for resolution. The proposed approach presented in this article leverages the Expectation-Maximization (EM) algorithm for estimating the Weibull distribution parameters. This method provides more accurate estimates of failure rates and probabilities by accounting for limited and censored data. The findings are demonstrated through a case study, showcasing the practical application of the proposed approach.
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Artificial intelligence; machine learning; real-time probabilistic data; for cyber risk; super forecasting; red teaming;
Online: 12 April 2021 (12:18:14 CEST)
Multiple governmental agencies and private organisations have made commitments for the colonisation of Mars. Such colonisation requires complex systems and infrastructure that could be very costly to repair or replace in cases of cyber-attacks. This paper surveys deep learning algorithms, IoT cyber security and risk models, and established mathematical formulas to identify the best approach for developing a dynamic and self-adapting system for predictive cyber risk analytics supported with Artificial Intelligence and Machine Learning and real-time intelligence in edge computing. The paper presents a new mathematical approach for integrating concepts for cognition engine design, edge computing and Artificial Intelligence and Machine Learning to automate anomaly detection. This engine instigates a step change by applying Artificial Intelligence and Machine Learning embedded at the edge of IoT networks, to deliver safe and functional real- time intelligence for predictive cyber risk analytics. This will enhance capacities for risk analytics and assists in the creation of a comprehensive and systematic understanding of the opportunities and threats that arise when edge computing nodes are deployed, and when Artificial Intelligence and Machine Learning technologies are migrated to the periphery of the internet and into local IoT networks.
ARTICLE | doi:10.20944/preprints201805.0120.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: cost prediction of substation projects; improved least square support vector machine; wolf pack algorithm; data inconsistency rate
Online: 8 May 2018 (05:01:45 CEST)
Accurate and stable cost forecasting of substation projects is of great significance to ensure the economic construction and sustainable operation of power engineering projects. In this paper, a forecasting model based on the improved least squares support vector machine (ILSSVM) optimized by wolf pack algorithm(WPA) is proposed to improve the accuracy and stability of the cost forecasting of substation projects. Firstly, the optimal features are selected through the data inconsistency rate (DIR), which helps reduce redundant input vectors. Secondly, the wolf pack algorithm is used to optimize the parameters of the improved least square support vector machine. Lastly, the cost forecasting method of WPA-DIR-ILSSVM is established. In this paper, 88 substation projects in different regions from 2015 to 2017 are chosen to conduct the training tests to verify the validity of the model. The results indicate that the new hybrid WPA-DIR-ILSSVM model presents better accuracy, robustness and generality in cost forecasting of substation projects.
REVIEW | doi:10.20944/preprints202103.0216.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: machine learning; deep learning; artificial intelligence; data science; data-driven decision making; predictive analytics; intelligent applications;
Online: 8 March 2021 (12:55:59 CET)
In the current age of the Fourth Industrial Revolution ($4IR$ or Industry $4.0$), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding real-world applications, the knowledge of artificial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning, which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale. In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study's key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world applications areas, such as cybersecurity, smart cities, healthcare, business, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for not only the application developers but also the decision-makers and researchers in various real-world application areas, particularly from the technical point of view.
ARTICLE | doi:10.20944/preprints202103.0753.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: unsupervised feature selection; histogram-valued data; compactness; hierarchical conceptual clustering; multi-role measure; visualization
Online: 31 March 2021 (07:53:39 CEST)
This paper presents an unsupervised feature selection method for multi-dimensional histogram-valued data. We define a multi-role measure, called the compactness, based on the concept size of given objects and/or clusters described by a fixed number of equal probability bin-rectangles. In each step of clustering, we agglomerate objects and/or clusters so as to minimize the compactness for the generated cluster. This means that the compactness plays the role of a similarity measure between objects and/or clusters to be merged. To minimize the compactness is equivalent to maximize the dis-similarity of the generated cluster, i.e., concept, against the whole concept in each step. In this sense, the compactness plays the role of cluster quality. We also show that the average compactness of each feature with respect to objects and/or clusters in several clustering steps is useful as feature effectiveness criterion. Features having small average compactness are mutually covariate, and are able to detect geometrically thin structure embedded in the given multi-dimensional histogram-valued data. We obtain thorough understandings of the given data by the visualization using dendrograms and scatter diagrams with respect to the selected informative features. We illustrate the effectiveness of the proposed method by using an artificial data set and real histogram-valued data sets.
ARTICLE | doi:10.20944/preprints202304.0959.v1
Subject: Engineering, Mechanical Engineering Keywords: machine learning; mechanical damage detection; pipelines; physics-informed datasets; simulations; welding detection; CNN structure optimization; sensing system; data classification performance and noise robustness
Online: 26 April 2023 (04:59:54 CEST)
This study proposes a machine-learning-based framework for detecting mechanical damage in pipelines, utilizing physics-informed datasets collected from simulations for mechanical damage. The framework provides an effective workflow from dataset generation to damage detection and identification for three types of pipeline events: welds, clamps, and corrosion defects. While the study initially focused on optimizing the CNN structure using various advanced optimizers, it also investigated the impact of sensing systems on data classification and the effect of noise on classification performance. The study's analysis highlights the importance of selecting the appropriate sensing system for the specific application. The authors also found that the proposed framework is robust to experimentally relevant levels of noise, suggesting its applicability in real-world scenarios where noise is present. Overall, this study contributes to the development of a more reliable and effective method for detecting mechanical damage in pipelines. The proposed framework provides an effective workflow for damage detection and identification, and the findings on the impact of sensing systems and noise on classification performance add to its robustness and reliability.
ARTICLE | doi:10.20944/preprints202305.1636.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Multiple extended targets; Data association; Tracklets; Min-cost network flow; Intermittent measurements
Online: 23 May 2023 (10:17:47 CEST)
The main problem in pursuing multiple extended targets tracking is distinguishing the origins of the measurements. The association of measurements to the possible origins within the target’s extent is difficult, especially for the occlusions or the detection blind zone which cause the intermittent measurements. To solve the problem, a hierarchical network-based tracklets data association algorithm is proposed. At the low level, the min cost network flow model is used to extract possible tracklets from the divided measurement set. At the high-level, the trajectories are estimated from the tracks produced by the previous low level network. The experimental results show that the hierarchical network-based tracklets data association algorithm outperforms the JPDA and RFS-based method when the measurement is intermittently unavailable.
ARTICLE | doi:10.20944/preprints202104.0142.v1
Subject: Physical Sciences, Acoustics Keywords: atomic data; inner-shell photoionization; atomic nitrogen ion
Online: 5 April 2021 (14:22:55 CEST)
High-resolution K-shell photoionization cross-sections for the C-like atomic nitrogen ion (N+) are reported in the 398 eV (31.15 Å) to 450 eV (27.55 Å) energy (wavelength) range. The results were obtained from absolute ion-yield measurements using the SOLEIL synchrotron radiation facility for spectral bandpasses of 65 meV or 250 meV. In the photon energy region 398 eV - 403 eV, 1s⟶2p autoionizing resonance states dominated the cross section spectrum. Analyses of the experimental profiles yielded resonance strengths and Auger widths. In the 415 eV - 440 eV photon region 1s⟶1s2s22p2 4Pnp and 1s⟶1s2s22p2 2Pnp resonances forming well-developed Rydberg series up n=7 and n=8 , respectively, were identified in both the single and double ionization spectra. Theoretical photoionization cross-section calculations, performed using the R-matrix plus pseudo-states (RMPS) method and the multiconfiguration Dirac-Fock (MCDF) approach were bench marked against these high-resolution experimental results. Comparison of the state-of-the-art theoretical work with the experimental studies allowed the identification of new resonance features. Resonance strengths, energies and Auger widths (where available) are compared quantitatively with the theoretical values. Contributions from excited metastable states of the N+ ions were carefully considered throughout.
ARTICLE | doi:10.20944/preprints202001.0274.v1
Subject: Computer Science And Mathematics, Mathematical And Computational Biology Keywords: bioinformatics; computational genomics; computational medicine; data science; data visualization; parallel processing; grid computing; fog computing
Online: 24 January 2020 (10:26:26 CET)
Conventional data visualization software have greatly improved the efficiency of the mining and visualization of biomedical data. However, when one applies a grid computing approach the efficiency and complexity of such visualization allows for a hypothetical increase in research opportunities. This paper will present data visualization examples presented in conventional networks, then go into higher details about more complex techniques related to leveraging parallel processing architecture. Part of these complex techniques include the attempt to build a basic general adversarial network (GAN) in order to increase the statistical pool of biomedical data for analysis as well as an introduction to the project utilizing the decentralized-internet SDK. This paper is meant to show you said conventional examples then go into details about the deeper experimentation and self contained results.
ARTICLE | doi:10.20944/preprints202301.0522.v1
Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: autonomous vehicle; data set; multidriver; biometric
Online: 28 January 2023 (07:55:36 CET)
The development of autonomous vehicles is becoming increasingly popular and gathering real world data is considered a valuable task. Many datasets have been published recently in the autonomous vehicle sector, with synthetic datasets gaining particular interest due to availability and cost. For a real implementation and correct evaluation of vehicles at higher levels of autonomy it is also necessary to consider human interaction, which is precisely something that lacks in existing datasets. In this article the UPCT dataset is presented, a public dataset containing high quality, multimodal data obtained using state of the art sensors and equipment installed onboard the UPCT’s CICar autonomous vehicle. The dataset includes data from a variety of perception sensors including 3D LiDAR, cameras, IMU, GPS, encoders, as well as driver biometric data and driver behaviour questionnaires. In addition to the dataset, the software developed for data synchronisation and processing has been made available. The quality of the dataset was validated using an end-to-end neural network model with multiple inputs to obtain speed and steering wheel angle and obtained very promising results.
ARTICLE | doi:10.20944/preprints202110.0362.v1
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: 3D reconstruction; 3D data smoothing; mesh simplification; high resolution micro-CT images
Online: 25 October 2021 (15:34:27 CEST)
Three-dimensional reconstruction plays an important role in assisting doctors and surgeons in diagnosing bone defects’ healing progress. Common three-dimensional reconstruction methods include surface and volume rendering. As the focus is on the shape of the bone, volume rendering is omitted. Many improvements have been made on surface rendering methods like Marching Cubes and Marching Tetrahedra, but not many on working towards real-time or near real-time surface rendering for large medical images, and studying the effects of different parameter settings for the improvements. Hence, in this study, an attempt towards near real-time surface rendering for large medical images is made. Different parameter values are experimented on to study their effect on reconstruction accuracy, reconstruction and rendering time, and the number of vertices and faces. The proposed improvement involving three-dimensional data smoothing with convolution kernel Gaussian size 0.5 and mesh simplification reduction factor of 0.1, is the best parameter value combination for achieving a good balance between high reconstruction accuracy, low total execution time, and a low number of vertices and faces. It has successfully increased the reconstruction accuracy by 0.0235%, decreased the total execution time by 69.81%, and decreased the number of vertices and faces by 86.57% and 86.61% respectively.
ARTICLE | doi:10.20944/preprints202304.0222.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Global models; Deep learning; Data partitioning; Time series features; Model complexity; Intermittent demand; Retail
Online: 11 April 2023 (10:41:55 CEST)
Global models have been developed to tackle the challenge of forecasting sets of series that are related or share similarities, but not for heterogeneous datasets. Various methods of partitioning by relatedness have been introduced to enhance the similarities of the set, resulting in improved forecasting accuracy but often at the cost of a reduced sample size, which could be harmful. To shed light on how the relatedness between series impacts the effectiveness of global models in real-world demand forecasting problems we perform an extensive empirical study using the M5 competition dataset. We examined cross-learning scenarios driven by the product hierarchy commonly employed in retail planning, which allow global models to capture interdependencies across products and regions more effectively. Our findings show that global models outperform state-of-the-art local benchmarks by a considerable margin, indicating that they are not inherently more limited than local models and can handle unrelated time series data effectively. The accuracy of data partitioning approaches increases, as the size of the data pools and the models' complexity decrease. However, there is a trade-off between data availability and data relatedness. Smaller data pools lead to increased similarity among time series, making it easier to capture cross-product and cross-region dependencies, but this comes at the cost of a reduced sample, which may not be beneficial. Finally, it's worth noting that the successful implementation of global models for heterogeneous datasets can significantly impact forecasting practice.
ARTICLE | doi:10.20944/preprints202112.0070.v1
Subject: Engineering, Control And Systems Engineering Keywords: Exoplanets Detection; Deep learning; Real and Simulated Data
Online: 6 December 2021 (12:36:42 CET)
Scientists and astronomers have attached Scientists and astronomers have attached great importance to the task of discovering new exoplanets, even more so if they are in the habitable zone. To date, more than 4300 exoplanets have been confirmed by NASA, using various discovery techniques, including planetary transits, in addition to the use of various databases provided by space and ground-based telescopes. This article proposes the development of a deep learning system for detecting planetary transits in Kepler Telescope lightcurves. The approach is based on related work from the literature and enhanced to validation with real lightcurves. A CNN classification model is trained from a mixture of real and synthetic data, and validated only with real data and different from those used in the training stage. The best ratio of synthetic data is determined by the perform of an optimisation technique and a sensitivity analysis. The precision, accuracy and true positive rate of the best model obtained are determined and compared with other similar works. The results demonstrate that the use of synthetic data on the training stage can improve the transit detection performance on real light curves.
ARTICLE | doi:10.20944/preprints202008.0074.v1
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: data mining; cardiovascular diseases; cluster analysis; principle component analysis
Online: 4 August 2020 (03:56:19 CEST)
Cardiovascular disease is the number one cause of death in the world and Quoting from WHO, around 31% of deaths in the world are caused by cardiovascular diseases and more than 75% of deaths occur in developing countries. The results of patients with cardiovascular disease produce many medical records that can be used for further patient management. This study aims to develop a method of data mining by grouping patients with cardiovascular disease to determine the level of patient complications in the two clusters. The method applied is principal component analysis (PCA) which aims to reduce the dimensions of the large data available and the techniques of data mining in the form of cluster analysis which implements the K-Medoids algorithm. The results of data reduction with PCA resulted in five new components with a cumulative proportion variance of 0.8311. The five new components are implemented for cluster formation using the K-Medoids algorithm which results in the form of two clusters with a silhouette coefficient of 0.35. Combination of techniques of Data reduction by PCA and the application of the K-Medoids clustering algorithm are new ways for grouping data of patients with cardiovascular disease based on the level of patient complications in each cluster of data generated.
ARTICLE | doi:10.20944/preprints202204.0068.v1
Subject: Computer Science And Mathematics, Computational Mathematics Keywords: Functional Data Analysis; Image Processing; Brain Imaging; Neuroimaging; Computational Neuroscience; Data Science
Online: 8 April 2022 (03:21:06 CEST)
Functional Data Analysis (FDA) is a relatively new field of statistics dealing with data expressed in the form of functions. FDA methodologies can be easily extended to the study of imaging data, an application proposed in Wang et al. (2020), where the authors settle the mathematical groundwork and properties of the proposed estimators. This methodology allows for the estimation of mean functions and simultaneous confidence corridors (SCC), also known as simultaneous confidence bands, for imaging data and for the difference between two groups of images. This is especially relevant for the field of medical imaging, as one of the most extended research setups consists on the comparison between two groups of images, a pathological set against a control set. FDA applied to medical imaging presents at least two advantages compared to previous methodologies: it avoids loss of information in complex data structures and avoids the multiple comparison problem arising from traditional pixel-to-pixel comparisons. Nonetheless, computing times for this technique have only been explored in reduced and simulated setups (Arias-López et al., 2021). In the present article, we apply this procedure to a practical case with data extracted from open neuroimaging databases and then measure computing times for the construction of Delaunay triangulations, and for the computation of mean function and SCC for one-group and two-group approaches. The results suggest that previous researcher has been too conservative in its parameter selection and that computing times for this methodology are reasonable, confirming that this method should be further studied and applied to the field of medical imaging.
ARTICLE | doi:10.20944/preprints202011.0297.v1
Subject: Computer Science And Mathematics, Mathematics Keywords: regression; time point data; modelling
Online: 10 November 2020 (10:00:37 CET)
In this paper, we present a relapse based demonstrating way to deal with investigate various arrangement MTC information. A commonplace use of this displaying approach incorporates three stages: first, define a model that approximates the connection between quality articulation and trial factors, with boundaries consolidated to address the exploration premium; second, utilize least-squares and assessing condition methods to gauge boundaries and their relating standard blunders; third, register test insights, P-qualities and NFD as proportions of factual criticalness. The benefits of this methodology are as per the following. To begin with, it tends to the exploration interest in a particular, precise way, and maximally uses all the information and other important data. Second, it represents both orderly and irregular varieties related with the information, and the consequences of such examination give not just quality explicit data applicable to the exploration objective, yet additionally its dependability, in this way helping agents to settle on better choices for subsequent investigations. Third, this methodology is truly adaptable, and can undoubtedly be stretched out to different sorts of MTC considers or other microarray explores by detailing various models dependent on the test plan of the investigations.
COMMUNICATION | doi:10.20944/preprints201803.0054.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: data feature selection; data clustering; travel time prediction
Online: 7 March 2018 (13:30:06 CET)
In recent years, governments applied intelligent transportation system (ITS) technique to provide several convenience services (e.g., garbage truck app) for residents. This study proposes a garbage truck fleet management system (GTFMS) and data feature selection and data clustering methods for travel time prediction. A GTFMS includes mobile devices (MD), on-board units, fleet management server, and data analysis server (DAS). When user uses MD to request the arrival time of garbage truck, DAS can perform the procedure of data feature selection and data clustering methods to analyses travel time of garbage truck. The proposed methods can cluster the records of travel time and reduce variation for the improvement of travel time prediction. After predicting travel time and arrival time, the predicted information can be sent to user’s MD. In experimental environment, the results showed that the accuracies of previous method and proposed method are 16.73% and 85.97%, respectively. Therefore, the proposed data feature selection and data clustering methods can be used to predict stop-to-stop travel time of garbage truck.
ARTICLE | doi:10.20944/preprints202302.0362.v2
Subject: Engineering, Bioengineering Keywords: Wearable devices; Wearable sensors; Data glove; Biomechatronic design; Hand kinematics; Joint measurement; Flex sensors; Biomedical engineering
Online: 27 February 2023 (10:40:17 CET)
For technical or medical applications, the knowledge of the exact kinematics of the human hand is key to utilizing its capability to handle and manipulate objects and to communicate with other humans or machines. The optimal relationship between the number of measurement parameters, measurement accuracy as well as complexity, usability and cost of the measuring systems is hard to find. Biomechanic assumptions, the concepts of a biomechatronic system and the mechatronic design process as well as commercially available components are used to develop a sensorized glove. The proposed wearable can measure 14 of 15 angular values of a simplified hand model introduced in this paper. Additionally, five contact pressure values at the fingertips and inertial data of the whole hand with a degree of freedom of six are gathered. Due to the modular design and a hand size examination based on anthropometric parameters, the concept of the wearable is applicable for a large variety of hand sizes and adaptable to different use cases. Validations show a combined root-mean-square error of 0.99° to 2.38° for the measurement of all joint angles at one finger, surpassing the human perception threshold and the current state of the art in science and technology for comparable systems.
ARTICLE | doi:10.20944/preprints202305.0390.v1
Subject: Public Health And Healthcare, Other Keywords: exploratory data analysis; non-parametric statistics; skewed data; survival analysis; repeated measures.
Online: 6 May 2023 (08:32:28 CEST)
Outliers can influence regression model parameters and change the direction of the estimated effect, over-estimating or under-estimating the strength of the association between a response variable and an exposure of interest. Identifying visit-level outliers from longitudinal data with continuous time-dependent covariates is important especially when the distribution of such variable is highly skewed at follow-up visits. The primary objective was to identify potential outliers at follow-up visits using interquartile range (IQR) statistic, motivated by a large TEDDY dietary longitudinal and time-to-event data with a continuous time varying vitamin B12 intake as the exposure of interest and time to developing Islet Autoimmunity (IA) as the response variable. The IQR method was also applied to simulated data. To assess the impact of IQR-method detected outliers, data was analyzed using Cox-proportional hazard model with robust sandwich estimator. Partial residual diagnostic plots were used to detect highly influential outliers. Results showed how some of the detected outliers had large influence on the Cox regression model and changed both the direction of hazard ratios and the strength of association with the risk of developing IA. In conclusion, the IQR method is useful in identifying potential outliers at visit-level which can be further investigated.
ARTICLE | doi:10.20944/preprints201905.0158.v1
Subject: Medicine And Pharmacology, Other Keywords: blockchain; biomedical data managing; DWT; keyword search; data sharing.
Online: 13 May 2019 (13:30:37 CEST)
A crucial role is played by personal biomedical data when it comes to maintaining proficient access to health records by patients as well as health professionals. However, it is difficult to get a unified view pertaining to health data that have been scattered across various health center/hospital sections. To be specific, health records are distributed across many places and cannot be found integrated easily. In recent years, blockchain is regarded as a promising explanation that helps to achieve individual biomedical information sharing in a secured way along with privacy preservation, because of its benefit of immutability. This research work put forwards a blockchain-based managing scheme that helps to establish interpretation improvements pertaining to electronic biomedical systems. In this scheme, two blockchain were employed to construct the base of it, where the second blockchain algorithm is used to generate a secure sequence for the hash key that generated in first blockchain algorithm. The adaptively feature enable the algorithm to use multiple data types and combine between various biomedical images and text records as well. All the data, including keywords, digital records as well as the identity of patients are private key encrypted along with keyword searching capability so as to maintain data privacy preservation, access control and protected search. The obtained results which show the low latency (less than 750 ms) at 400 requests / second indicate the ability to use it within several health care units such as hospitals and clinics.
ARTICLE | doi:10.20944/preprints202202.0134.v1
Subject: Medicine And Pharmacology, Immunology And Allergy Keywords: DOLAVI; Dolutegravir; Lamivudine; Real World Data; HIV
Online: 9 February 2022 (10:45:33 CET)
Background: Objectives were to determine the real-life effectiveness and safety of DT with dolutegravir (50 mg/QD) plus lamivudine (300 mg/QD) in multiple-tablet regimen (MTR) in naïve PLHIV followed up for 48 weeks and to evaluate the compliance and satisfaction of patients. Material and methods: Open, single-arm, multicenter, non-randomized clinical trial from May 2019 through September 2020 with 48-week follow-up. Results: The study included 88 PLHIV (91% male) with mean age of 35.9 years; 76.1% were MSM. Mean baseline CD4 was 516.4 cells/uL, with viral load (VL) of 104,828 cop/mL, and 11.4% were in AIDS stage. DT started within 7 days of first specialist consultation in all patients and the same day in 84.1%; 3.4% had baseline resistance mutations (K103N, V106I+E138A, and V108I); 12.5% were lost to follow-up. At week 48, 86.3% had VL< 50 cop/uL by intention-to-treat analysis and 98.7% by per-protocol (PP) analysis. Virological failure (VF) was recorded in 1.1%, with no resistance mutation. One blip was detected in 5.2%, without VF. Three reported anxiety, dizziness, and cephalgia, respectively, at week 4 and one insomnia at week 24; none reported adverse events at week 48. Mean weight was 4 kg higher at 48 weeks (p=0.0001) and abdominal circumference 3 cm larger at 24 weeks (p=0.022). No forgetfulness occurred in 98.7% of patients. Patient satisfaction was 90/100 at 4, 24, and 48 weeks. Conclusion: Real-world data demonstrate that dolutegravir plus lamivudine in MTR is effective, safe, and satisfactory, moderately increasing weight and abdominal circumference and administrable on a test-and-treat strategy.
ARTICLE | doi:10.20944/preprints202209.0341.v1
Subject: Social Sciences, Decision Sciences Keywords: Real State; Regressors; Artificial Intelligence; Machine Learning; Data-informed; Boston
Online: 22 September 2022 (10:33:09 CEST)
Real estate market analysis and place-based decision-making can both benefit from understanding house price development. Although considerable amounts of interest have been devoted to housing price modelling, the assessment of house price fluctuation still requires further comparing studies. Housing price prediction is challenging as contributing factors are quite dynamic and subject to a variety of regulating elements. The future understanding of the housing market trends not only provides sufficient customers’ investment trust potential but also enables the financial support to progress more realistic in advance. In this study, a comprehensive data-informed framework is developed to investigate and anticipate real estate house prices using historical data by combining explanatory features. We examined about 500 houses in the Boston area as a case study and discussed how the increase in housing prices could vary by each of the contributing components. Fourteen Machine Learning (ML) regressors imply to the dataset and lead to a comparative study of the accuracy of all the models. ML-based regressors forecast real estate home prices as a function of thirteen influencing factors. The most informative features were also selected by conducting the Permutation Feature Importance technique on all the features The study provides a comprehensive tool for evaluating the robustness and efficiency of ML models for housing price predictions. The results highlighted Random Forest as the best model has an R2 equals to 0.88 and Voting Regressor as the second highest rated model has R2 equals to 0.87. Results of multivariate exploratory data analysis also implied that the average number of rooms and percentage of the lower status of the population have the most significant impact on the price range predictions.
ARTICLE | doi:10.20944/preprints202208.0224.v1
Subject: Engineering, Automotive Engineering Keywords: VR-XGBoost; K-VDTE; ETC data; ESAs; data mining
Online: 12 August 2022 (03:53:23 CEST)
To scientifically and effectively evaluate the service capacity of expressway service areas (ESAs) and improve the management level of ESAs, we propose a method for the recognition of vehicles entering ESAs (VeESAs) and estimation of vehicle dwell times using ETC data. First, the ETC data and their advantages are described in detail, and then the cleaning rules are designed according to the characteristics of the ETC data. Second, we established feature engineering according to the characteristics of VeESA, and proposed the XGBoost-based VeESA recognition (VR-XGBoost) model. Studied the driving rules in depth, we constructed a kinematics-based vehicle dwell time estimation (K-VDTE) model. The field validation in Part A/B of Yangli ESA using real ETC transaction data demonstrates that the effectiveness of our proposal outperforms the current state of the art. Specifically, in Part A and Part B, the recognition accuracies of VR-XGBoost are 95.9% and 97.4%, respectively, the mean absolute errors (MAEs) of dwell time are 52 s and 14 s, respectively, and the root mean square errors (RMSEs) are 69 s and 22 s, respectively. In addition, the confidence level of controlling the MAE of dwell time within 2 minutes is more than 97%. This work can effectively identify the VeESA, and accurately estimate the dwell time, which can provide a reference idea and theoretical basis for the service capacity evaluation and layout optimization of the ESA.
REVIEW | doi:10.20944/preprints202207.0141.v1
Subject: Medicine And Pharmacology, Oncology And Oncogenics Keywords: review; real -world evidence; real -world data; randomized controlled trials; registry; digital health technology; early drug approval
Online: 8 July 2022 (11:09:58 CEST)
Real-world evidence (RWE) is increasingly involved in the early benefit assessment of medicinal drugs. It is expected that RWE will help to speed up approval processes comparable to RWE developments in vaccine research during the COVID-19 pandemic. Definitions of RWE are diverse marking the highly fluid status in this field. So far, RWE comprises information produced from data routinely collected on patient’s health status and/or delivery of health care from various sources other than traditional clinical trials. These sources can include electronic health records, claims, patient-generated data including in home-use settings, data from mobile devices as well as patient, product and disease registries. The aim of the present update was to review the current RWE developments and guidelines mainly in the U.S., the UK, Europe and Germany field during the last decade. RWE has already been included in various approval procedures of regulatory authorities reflecting its actual acceptance and growing importance in evaluating and accelerating new therapies. However, since the RWE research is still in a transition process and since a number of gaps in this field have been explored, more guidance and a consented definition are necessary to increase the implementation of real-world data.
ARTICLE | doi:10.20944/preprints202306.1477.v1
Subject: Medicine And Pharmacology, Medicine And Pharmacology Keywords: Ceftobiprole; sepsis; older; Real-World Data; OPAT
Online: 21 June 2023 (04:10:08 CEST)
Background: Ceftobiprole is a fifth-generation cephalosporin that has been approved in Europe solely for the treatment of community-acquired and nosocomial pneumonia. The objective was to analyze the use of ceftobiprole medocaril (Cefto-M) in Spanish clinical practice in patients with infection in hospital or outpatient parenteral antimicrobial therapy (OPAT). Methods: This retrospective, observational, multicenter study included patients treated from September 1, 2021 to December 31, 2022. Results: 249 individuals were enrolled, aged 66.6±15.4 years, 59.4% male, with Charlson index of 4 (IQR 2-6); 13.7% had COVID-19, and 4.8% were in intensive care unit (ICU). The most frequent type of infection was respiratory (55.8%), followed by skin and soft tissue infection (21.7%). Cefto-M was administered to 67.9% as empirical treatment, being in monotherapy for 7 days (5-10) in 53.8% of cases. The infection-related mortality was 11.2%.The highest mortality rates were for ventilator-associated pneumonia [40%], and infections due to methicillin-resistant S. aureus (20.8%) and Pseudomonas aeruginosa (16.1%). Mortality-related factors were age (OR: 1.1, 95%CI [1.04-1.16]), ICU admission (OR:42.02, 95%CI[4.49-393.4]), and sepsis/septic shock (OR:2.94, 95%CI [1.01-8.54]).Conclusions: In real life, Cefto-M is a safe antibiotic, with only half of prescriptions in respiratory infections, mainly administered as rescue therapy in pluripathological patients with severe infectious diseases.
ARTICLE | doi:10.20944/preprints202106.0738.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: time series; homogenization; ACMANT; observed data; data accuracy
Online: 30 June 2021 (13:08:39 CEST)
The removal of non-climatic biases, so-called inhomogeneities, from long climatic records needs sophistically developed statistical methods. One principle is that usually the differences between a candidate series and its neighbour series are analysed instead of directly the candidate series, in order to neutralize the possible impacts of regionally common natural climate variation on the detection of inhomogeneities. In most homogenization methods, two main kinds of time series comparisons are applied, i.e. composite reference series or pairwise comparisons. In composite reference series the inhomogeneities of neighbour series are attenuated by averaging the individual series, and the accuracy of homogenization can be improved by the iterative improvement of composite reference series. By contrast, pairwise comparisons have the advantage that coincidental inhomogeneities affecting several station series in a similar way can be identified with higher certainty than with composite reference series. In addition, homogenization with pairwise comparisons tends to facilitate the most accurate regional trend estimations. A new time series comparison method is presented here, which combines the use of pairwise comparisons and composite reference series in a way that their advantages are unified. This time series comparison method is embedded into the ACMANT homogenization method, and tested in large, commonly available monthly temperature test datasets.
ARTICLE | doi:10.20944/preprints202306.1667.v1
Subject: Environmental And Earth Sciences, Space And Planetary Science Keywords: GNSS time series; time length; missing data; noise analysis; velocity estimation
Online: 23 June 2023 (11:42:43 CEST)
Noise model selection criteria has a significant impact on identifying the stochastic noise proper-ties of any GNSS daily coordinate time series. The low-frequency random walk noise existing in these time series could lead to overestimation of the tectonic rate, so it is of great significance to accurately detect the random walk component. This study focuses on noise model estimation cri-terion (BIC_tp) derived from the AIC and the BIC by introducing 2π factors. It is more sensitive to abnormal steps (random jumps). Using observation data from 72 GNSS stations from 1992 to 2022 and simulated data, four combined noise models are used to explore the impacts of time se-ries lengths (ranging from2 to 24 years) and data loss (between 2% and 30%) on noise models and velocity estimation. The results show that as the time length increased, the selected optimal noise model, and the estimated uncertainty of the tectonic trend with different data gap, gradually con-verge. When the time length is short (less than 8 years), it could lead to the FNRWWN, FNWN, and PLWN models being mistakenly estimated as GGMWN models, thereby affecting the accura-cy of determining the station velocity parameters. When the time length is 12 years, the RW noise component is more probably detected, As the time length increases, the impact of RW on velocity uncertainty is weakened. Finally, we conclude that for a time series with a minimum time length of 12 years, both the selection of the optimal stochastic noise model and the estimation of the ve-locity parameters are reliable.
ARTICLE | doi:10.20944/preprints201801.0077.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: UAVs sensor fusion; EKF; real data analysis; system design
Online: 9 January 2018 (07:47:45 CET)
This paper presents a methodology to design sensor fusion parameters using real performance indicators of navigation in UAVs based on PixHawk flight controller and peripherals. This methodology and the selected performance indicators allows to find the best parameters for the fusion system of a determined configuration of sensors and a predefined real mission. The selected real platform is described with stress on available sensors and data processing software, and the experimental methodology is proposed to characterize sensor data fusion output and determine the best choice of parameters using quality measurements of tracking output with performance metrics not requiring ground truth.
COMMUNICATION | doi:10.20944/preprints202301.0335.v2
Subject: Computer Science And Mathematics, Information Systems Keywords: Cloud Computing; Data Protection; Secure Communication; Middleware; Protocols
Online: 30 January 2023 (09:24:01 CET)
In recent years, Cloud Computing and Big Data have been considered the most attractive areas that are revolutionizing the IT world. Cloud Computing paradigm has recently appeared that allows running proprietary or difficult portable applications outside their original software environment on one or more virtual hardware platforms. Therefore, we are to developing such techniques which make it possible to secure communication between the communicating Cloud entities. These techniques must take into account several factors due to the data transmitted in this type of environment is proprietary and of significant size. Conventional data security techniques are not suitable for today's cloud usage. Hence, the main research of this thesis is to define an adaptable architecture with the aim to propose a scalable system that supports cloud services. We will define feasible security solutions dedicated to the Cloud computing context in order to robustly protect data stored in the Cloud. We are more precisely looking for working on NoSQL databases. We also intend to propose a secure solution based on the blockchain that has powerful features like decentralization, autonomy, security, reliability, and transparency.
ARTICLE | doi:10.20944/preprints202302.0211.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: FPGA; DAQ; Data concentration; Beneš network
Online: 13 February 2023 (09:10:55 CET)
The concentration of data from multiple links to a single output is an essential task performed by High-Energy Physics (HEP) Data Acquisition Systems (DAQs). At high and varying data rates combined with the large width of the concentrator’s output interface, this task is non-trivial. This paper presents a concentrator based on the Beneš network, which provides efficient concentration without using a high-frequency clock internally. It warrants that empty data are eliminated and does not disturb the data time-ordering if the data rates significantly differ between inputs. Additionally, it is well suited to FPGA implementation. It is based on simple data-routing primitives and may be fully pipelined.
ARTICLE | doi:10.20944/preprints201904.0281.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Cluster computing, Big Data, Spark, Hadoop.
Online: 25 April 2019 (11:22:27 CEST)
The article provides detailed information about the new technologies based on cluster computing Hadoop and Apache Spark. The experimental task of processing logistic regression with the help of these technologies is considered. The findings on the comparison of the performance of cluster computing of Hadoop and Apache Spark are revealed and substantiated.
ARTICLE | doi:10.20944/preprints202308.0307.v1
Subject: Computer Science And Mathematics, Mathematics Keywords: conditional distribution function; asymptotic normality, conditionalhazard function; quasi-associated; functional data
Online: 3 August 2023 (10:53:17 CEST)
The objective of this study is to examine a nonparametric estimate , using the kernel approach, of the conditional distribution function of a scalar response variable that is given a random variable whose values take place in a separable real Hilbert space. The observations will be dependent on one another in a quasi-associated fashion. The pointwise practically perfect consistencies with rates of this estimator are established by us under some broad conditions. The study’s major objective is to investigate the convergence rate of the proposed estimator and its application in the convergence rate and asymptotic normality of the hazard function. The asymptotic normality of the developed estimator is established precisely. Simulation studies were conducted to investigate the behavior of the asymptotic property in the context of finite sample data.
ARTICLE | doi:10.20944/preprints202305.0049.v1
Subject: Computer Science And Mathematics, Computer Science Keywords: concurrent systems、,Mazurkiewicz traces、,interval order、,Petri net with Data
Online: 2 May 2023 (03:13:37 CEST)
Studying concurrent systems is to sort the read/write and control events in the concurrent system according to the order, or in partial order, or in strict order. Because of the concurrency of events and the timeliness of writing events, there will be overlap between events, which requires interval order analysis. A single start and end sequence to represent the entire hierarchical order structure as well as all equivalent spaced order observations.a single sequence of beginnings and endings to represent the entire stratified order structures as well as all equivalent interval order observations. Mazurkiewicz traces and Comtracks are hierarchical traces, but they cannot describe interval traces. In a few sentences of this summary, we can tell the problems (petri interval traces with data), the previous work (Mazurkiewicz traces, Comtracks), the reasons that cannot be solved, and our own solutions (DPN: Petri nets with data, interval traces). This paper focuses on a BE (Beginnings and Endings) sequence representing an equivalent class of runs, or targets. Our goal is to have a BE sequence to represent the entire hierarchical order structure, that is, all equivalent observable interval orders.
ARTICLE | doi:10.20944/preprints202008.0626.v1
Subject: Engineering, Civil Engineering Keywords: multispectral lidar; single-photon lidar; building data; 3D reconstruction
Online: 28 August 2020 (08:49:07 CEST)
This paper investigated building data from multispectral and single-photon Lidar systems. The multispectral datasets from the individual channels and fused channels were explored. The multispectral and single-photon Lidar data were compared across multiple aspects: the data acquisition geometry, number of echoes, intensity, density, resolution, data defects, noise level, and the absolute and relative accuracy. In addition, we explored the performance of the multispectral and single-photon data for roof plane detection for eight complex/stylish buildings to investigate the suitability of these data for 3D building reconstruction. The building data from the single-photon and multispectral Lidar systems were evaluated with respect to the reference building vector data with an accuracy of better than 5 cm. The advantages and disadvantages of both technologies and their applications in the urban building environment are discussed.
ARTICLE | doi:10.20944/preprints202307.1942.v1
Subject: Engineering, Bioengineering Keywords: exponential modes; visible modes, hidden modes; data limitations; input-output data; mechanistic model; model distinguishability; invariant 2-dimensional manifolds
Online: 28 July 2023 (13:12:25 CEST)
The particulars of stimulus-response experiments done on dynamic biosystems clearly limit what one can learn and validate about their structural interconnectivity (topology), even when collected kinetic output data are perfect (noise-free). As always, available access ports and other data limitations rule. For linear systems, exponential modes, visible and hidden, play an important role in understanding data limitations, embodied in what we call dynamical signatures in the data. We show here how to circumscribe and analyze modal response data in compartmentalizing model structures – so that modal analysis can be used constructively in systems biology model building – for nonlinear (NL) as well as linear biosystems. We do this by developing and exploiting the modal basis for dynamical signatures in hypothetical (perfect) input-output data associated with a structural model – one that includes inputs and outputs explicitly – and for NL as well as linear biosystems. The methodology establishes model dimensionality (size, complexity) from particular data sets; helps select among multiple candidate models (model distinguishability); helps in designing new experiments to extract “hidden” structure; and helps to simplify (reduce) models to their essentials. For NL biosystems, results are not as comprehensive, similarly informative about their dominant dynamical properties, and unified with linear models on invariant 2-dimensional manifolds in phase space. Some automation of these highly technical aspects of biomodeling is also introduced.
ARTICLE | doi:10.20944/preprints202307.1117.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: history; endowments; query model; digital data; physical data
Online: 17 July 2023 (15:11:18 CEST)
Historical and Endowment Properties are different from Heritage and cultural Properties, as Historical and Endowment properties are governed by a unique set of laws that Waqf recipients must abide by. Property that is entrusted is usually in the form of buildings, land or valuables which in preservation is not limited to time as long as the property can be utilized. Reliable information technology is needed to ensure data security both digitally and physically, while the rapid development of information technology demands information openness and this will be a challenge in itself. The objectives of this study include examining the collection of historical databases and endowments, the relationship between digital data and physical data and management organizations. The method of how to design a query model to display data is then analyzed whether the data conforms to the rules in waqf management. The results are expected to bring up accurate data between digital data and physical data and if there are differences into findings for the next analysis.
CONCEPT PAPER | doi:10.20944/preprints201810.0724.v2
Subject: Social Sciences, Political Science Keywords: Social-Ecological System; Water security; Governance; Institution; Learning; Data-Cube
Online: 22 November 2018 (14:47:31 CET)
The Social-Ecological Systems (SES) framework serves as a valuable framework to explore and understand social and ecological interactions, and pathways in water governance. Yet, it lacks a robust understanding of change. We argue an analytical and methodological approach to engaging global changes in SES is critical to strengthening the scope and relevance of the SES framework. Relying on SES and resilience thinking, we propose an institutional and cognitive model of change that institutions and natural resources systems co-evolve to provide a dynamic understanding of SES that stands on three causal mechanisms: institutional complexity trap, rigidity trap, and learning processes. We illustrate how Data Cube technology could overcome current limitations and offer reliable avenues to test hypothesis about the dynamics of social-ecological systems and water security by offering to combine spatial and time data with no major technical requirements for users.
ARTICLE | doi:10.20944/preprints202010.0093.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Functional data; Local linear estimation; Asymptotic normality; Conditional hazard function
Online: 6 October 2020 (09:18:49 CEST)
In this work, we treat a prediction problem via the conditional hazard function of a scalar response variable Y given a functional random variable X by using the local linear technique. The main purpose of this paper is to investigate the asymptotic normality of the nonparametric estimator of the conditional hazard function, under some general conditions. A simulation study, conducted to assess finite sample behavior, demonstrates the superiority of our method than the standard kernel method
REVIEW | doi:10.20944/preprints202211.0161.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: High Performance Computing (HPC); big data; High Performance Data Analytics (HPDS); con-vergence; data locality; spark; Hadoop; design patterns; process mapping; in-situ data analysis
Online: 9 November 2022 (01:38:34 CET)
Big data has revolutionised science and technology leading to the transformation of our societies. High Performance Computing (HPC) provides the necessary computational power for big data analysis using artificial intelligence and methods. Traditionally HPC and big data had focused on different problem domains and had grown into two different ecosystems. Efforts have been underway for the last few years on bringing the best of both paradigms into HPC and big converged architectures. Designing HPC and big data converged systems is a hard task requiring careful placement of data, analytics, and other computational tasks such that the desired performance is achieved with the least amount of resources. Energy efficiency has become the biggest hurdle in the realisation of HPC, big data, and converged systems capable of delivering exascale and beyond performance. Data locality is a key parameter of HPDA system design as moving even a byte costs heavily both in time and energy with an increase in the size of the system. Performance in terms of time and energy are the most important factors for users, particularly energy, due to it being the major hurdle in high performance system design and the increasing focus on green energy systems due to environmental sustainability. Data locality is a broad term that encapsulates different aspects including bringing computations to data, minimizing data movement by efficient exploitation of cache hierarchies, reducing intra- and inter-node communications, locality-aware process and thread mapping, and in-situ and in-transit data analysis. This paper provides an extensive review of the cutting-edge on data locality in HPC, big data, and converged systems. We review the literature on data locality in HPC, big data, and converged environments and discuss challenges, opportunities, and future directions. Subsequently, using the knowledge gained from this extensive review, we propose a system architecture for future HPC and big data converged systems. To the best of our knowledge, there is no such review on data locality in converged HPC and big data systems.
HYPOTHESIS | doi:10.20944/preprints201808.0127.v1
Subject: Medicine And Pharmacology, Oncology And Oncogenics Keywords: Big Data, Systems Models, Cancer metabolism, Cancer personalized treatment, Drug Discovery.
Online: 6 August 2018 (15:09:15 CEST)
Coordinated sets of extremely numerous digital data, on a given social or economic event, are treated by Artificial Intelligence tools to obtain reasonably accurate, valuable predictions. The same approach, applied to biomedical issues, as how to choose the right drug to completely cure a given cancer patient, does not reach satisfactory results. It is the “organized biological complexity”, which requires a different systems approach, to integrate, in an Augmented Intelligence strategy, statistical computations of digital data, network construction of “omics” findings, well-designed mathematical models and new experiments in an iterative pathway to reconstruct the “logic” beneath the “organized complexity”, as shown here for Systems Metabolomics of cancer. On this basis new diagnostic approaches, able to identify precision drug treatments, as well as new discovery strategy for more effective anti-cancer drugs are described.
ARTICLE | doi:10.20944/preprints202111.0029.v1
Subject: Social Sciences, Decision Sciences Keywords: Real-world fuel consumption rate; machine learning; big data; light-duty vehicle; China
Online: 2 November 2021 (09:40:05 CET)
Private vehicle travel is the most basic mode of transportation, and the effective control of the real-world fuel consumption rate of light-duty vehicles plays a vital role in promoting sustainable economic development as well as achieving a green low-carbon society. Therefore, the impact factors of individual carbon emission must be elucidated. This study builds five different models to estimate real-world fuel consumption rate of light-duty vehicles in China. The results reveal that the Light Gradient Boosting Machine (LightGBM) model performs better than the linear regression, Naïve Bayes regression, Neural Network regression, and Decision Tree regression models, with mean absolute error of 0.911 L/100 km, mean absolute percentage error of 10.4%, mean square error of 1.536, and R squared (R2) of 0.642. This study also assesses a large number of factors, from which three most important factors are extracted, namely, reference fuel consumption rate value, engine power and light-duty vehicle brand. Furthermore, a comparative analysis reveals that the vehicle factors with greater impact on real-world fuel consumption rate are vehicle brand, engine power, and engine displacement. Average air pressure, average temperature, and sunshine time are the three most important climate factors.
ARTICLE | doi:10.20944/preprints202308.1087.v1
Subject: Medicine And Pharmacology, Epidemiology And Infectious Diseases Keywords: Cytomegalovirus; prophylaxis; allogeneic hematopoietic cell transplantation; real-world data
Online: 15 August 2023 (09:28:45 CEST)
Prevention and management of cytomegalovirus (CMV) reactivation is important to improve outcome of allogeneic hematopoietic cell transplantation (allo-HCT) recipients. The aim of this study was to analyze real-world data for incidence and characteristics of CMV infections until 1-year after allo-HCT under 100-day letermovir prophylaxis. A single-center retrospective study was conducted between November 2020 and October 2021. During the study period, 358 patients underwent allo-HCT, 306 of whom received letermovir prophylaxis. Cumulative incidence of clinically significant CMV infection (CS-CMVi) was 11.4%, 31.7%, and 36.9% at 14-weeks, 24-weeks, and 1-year post-HCT, respectively. In multivariate analysis, risk of CS-CMVi increased with graft-versus-host disease (GVHD) ≥ grade 2 (adjusted odds ratio 3.640 [2.036–6.510]; P < 0.001). One-year non-relapse mortality was significantly higher in letermovir breakthrough CS-CMVi patients than those with subclinical CMV reactivation who continued letermovir (P = 0.002). There were 18 (15.9%) refractory CMV infection in this study population. In summary, letermovir prophylaxis is effective in preventing CS-CMVi until day 100, which increased after cessation of letermovir. GVHD is still a significant risk factor in letermovir prophylaxis era. Further research is needed to establish individualized management strategies especially in patients with significant GVHD or letermovir breakthrough CS-CMVi.
ARTICLE | doi:10.20944/preprints202109.0191.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: Process science; Data science; Concept drift detection and Branching frequency changes
Online: 10 September 2021 (15:44:14 CEST)
Business processes are continuously evolving in order to adapt to changes due to various factors. One important process drift perspective yet to be investigated is the detection of branching condition changes in the process model. None of the existing process drift detection methods focus on detecting changes of branching conditions in process models. Existing branching condition detection methods do not take changes within the process into account, hence results are inadequate to represent the changes of decision criteria of the process. In this paper, we present a method which can detect branching condition changes in process models. The method takes both process models and event logs as input, and translates event logs into decision sequences for change points detection. The proposed method is evaluated by simulated event logs.
ARTICLE | doi:10.20944/preprints201811.0632.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: fault line detection; data fusion; non-artificial setting; sound distance; fault distance; resonant grounding system
Online: 30 November 2018 (10:38:18 CET)
Fault line detection timely and accurately when single-phase-to-earth fault occurs in resonant grounding system is still a focus of research. This paper presents a new approach for fault detection based on data fusion and it has non-artificial setting. Firstly, the fault criterion for interphase difference energy ratio and time-frequency correlation coefficient of each line is proposed. Subsequently, the paper establish a coordinate system with the interphase difference energy ratio as X axis and the time-frequency correlation coefficient as Y axis, and it uses the Euclidean distance algorithm to get the characteristic distance of each line by fusing two-dimensional information. Finally, comparing the sound distance and the fault distance of each line to discriminate the fault line. Electromagnetic Transients Program (EMTP) simulation results and adaptability analysis have confirmed the effectiveness and reliability of the proposed scheme.
ARTICLE | doi:10.20944/preprints201710.0111.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: energy storage systems; charging profile; capacity loss; data-driven modeling
Online: 17 October 2017 (04:29:19 CEST)
Energy storage systems (ESS) are penetrating into various sections of power system through different applications. ESS can be used either as a buffer for intermittent renewable energy sources or as a stand-alone distributed storage for load shifting. ESS use different types of storage devices such as lead-acid batteries, lithium ion batteries, flow batteries, and super-capacitors. Hybrid ESS consisting of few types of storage devices are also common in practice. Determining the load demand of such ESSs at various instances (charging profile) accurately is indispensable in most of the cases. Capacity loss is common phenomenon that occurs in all types of storage devices because of ageing. Capacity loss has to be accounted while determining the charging profile of storage devices for better accuracy. Data-driven modeling is an attractive approach for determining the load demand of ESS due to the availability of valuable data from smart grid technologies. In this paper, the application of different types of data-driven models to predict the current charging profile of the ESS based on previous charging profiles is examined. The proposed method can leverage on the existing data from smart grid and is a black box modeling approach.
ARTICLE | doi:10.20944/preprints202106.0330.v1
Subject: Medicine And Pharmacology, Immunology And Allergy Keywords: Chemotherapy; Radiotherapy; Cognitive dysfunction; Big data; Cohort studies; Survival analysis
Online: 14 June 2021 (07:51:57 CEST)
Background: We aimed to assess the risk of chemotherapy- and radiotherapy-related cognitive impairment in colorectal cancer patients. Methods: We randomly selected 40% of colorectal cancer patients from Korean National Health Insurance Database (NHID), 2004-2018 (N=148,848). Patients with one or more ICD-10 diagnostic codes for dementia or mild cognitive impairment was defined as cognitive impairment cases. Patients who were aged 18 or younger, diagnosed with cognitive impairment before colorectal cancer (N=8,225) and did not receive primary resection (N=45,320) were excluded. The effects of each chemotherapy agent on cognitive impairment were estimated. We additionally estimated the effect of radiotherapy in rectal cancer patients. Time-dependent competing risk Cox regression was conducted to estimate overall and age-specific hazard ratios (HR) separately for colon and rectal cancer. Results: In colon cancer, capecitabine and irinotecan was associated with higher cognitive im-pairment, while 5-fluorouracil was not. In rectal cancer, no chemotherapy agents increased the risk of cognitive impairment, nor did radiotherapy. Hazardous association of irinotecan was estimated larger in elderly patients compared with younger counterparts. Conclusion: Heterogeneous associations between various chemotherapy agents and cognitive impairment were observed. Elderly patients were more vulnerable to possible adverse cognitive effects. Radiotherapy did not increase the risk of cognitive impairment.
ARTICLE | doi:10.20944/preprints201810.0601.v1
Subject: Engineering, Civil Engineering Keywords: support vector machine; travelling time; intelligent transportation system; artificial fish swarm algorithm; big data
Online: 25 October 2018 (10:48:45 CEST)
Freeway travelling time is affected by many factors including traffic volume, adverse weather, accident, traffic control and so on. We employ the multiple source data-mining method to analyze freeway travelling time. We collected toll data, weather data, traffic accident disposal logs and other historical data of freeway G5513 in Hunan province, China. Using Support Vector Machine (SVM), we proposed the travelling time model based on these databases. The new SVM model can simulate the nonlinear relationship between travelling time and those factors. In order to improve the precision of the SVM model, we applied Artificial Fish Swarm algorithm to optimize the SVM model parameters, which include the kernel parameter σ, non-sensitive loss function parameter ε, and penalty parameter C. We compared the new optimized SVM model with Back Propagation (BP) neural network and common SVM model, using the historical data collected from freeway G5513. The results show that the accuracy of the optimized SVM model is 17.27% and 16.44% higher than those of the BP neural network model and the common SVM model respectively.
REVIEW | doi:10.20944/preprints202304.0075.v1
Subject: Social Sciences, Demography Keywords: human migration; prediction; methods; artificial intelligence; data; uncertainty
Online: 6 April 2023 (07:12:19 CEST)
As a fundamental, overall, and strategic issue facing human society, human migration is a key factor affecting the development of countries and cities given constantly changing population numbers. The fuzziness of the spatiotemporal attributes of human migration limits the pool of open-source data for human migration prediction, leading to a relative lag in human migration prediction algorithm research. This study expands the definition of human migration research, reviews the progress of research into human migration prediction, and classifies and compares human migration algorithms based on open-source data. It also explores the critical uncertainty factors restricting the development of human migration prediction. Given the effect of human migration prediction, in combination with artificial intelligence and big data technology, the paper concludes with specific suggestions and countermeasures aimed at enhancing human migration prediction research results to serve economic and social development and national strategy.
ARTICLE | doi:10.20944/preprints201709.0085.v1
Subject: Engineering, Control And Systems Engineering Keywords: Data Distribution; Multi-Path; RPL; Wireless Sensor Network
Online: 18 September 2017 (17:36:09 CEST)
The RPL protocol is a routing protocol for low power and lossy networks. In such a network, energy is a very scarce resource; so many studies are focused on minimizing global energy consumption. End-to-end latency is another important performance indicator of the network, but existing research tends to focus more on energy consumption and ignore the end-to-end delay of data transmission. In this paper, we propose a kind of energy equalization routing protocol to maximize the surviving time of the restricted nodes so that the energy consumed by each node is close to each other. At the same time, a multi-path forwarding route is proposed based on the cache utilization. The data is sent to the sink node through different parent nodes at a certain probability, not only by selecting the preferred parent node, thus avoiding buffer overflow and reducing end-to-end delay. Finally, the two algorithms are combined to accommodate different application scenarios. The experimental results show that the proposed three improved schemes improve the reliability of the routing, extend the lifetime of the network, reduce the end-to-end delay, and reduce the number of DAG reconfiguration.
ARTICLE | doi:10.20944/preprints202101.0235.v1
Subject: Environmental And Earth Sciences, Atmospheric Science And Meteorology Keywords: forest resources; forest and tree species distribution; machine learning; multi-sensor data fusion; National Forest Inventory data
Online: 12 January 2021 (17:35:56 CET)
Mapping forest extent and forest cover classification are important for the assessment of forest resources in socio-economic as well as ecological terms. Novel developments in the availability of remotely sensed data, computational resources, and advances in areas of statistical learning have enabled fusion of multi-sensor data, often yielding superior classification results. Most former studies of nemoral forests fusing multi-sensor and multi-temporal data have been limited in spatial extent and typically to a simple classification of landscapes into major land cover classes. We hypothesize that multi-temporal, multi-censor data will have a specific strength in further classification of nemoral forest landscapes owing to the distinct seasonal patterns of the phenology of broadleaves. This study aimed to classify the Danish landscape into forest/non-forest and further into forest types (broadleaved/coniferous) and species groups, using a cloud-based approach based on multi-temporal Sentinel 1 and 2 data and machine learning (random forest) trained with National Forest Inventory (NFI) data. Mapping of non-forest and forest resulted in producer accuracies of 99% and 90 %, respectively. The mapping of forest types (broadleaf and conifer) within the forested area resulted in producer accuracies of 95% for conifer and 96% for broadleaf forest. Tree species groups were classified with producer accuracies ranging 34-74%. Species groups with coniferous species were the least confused whereas the broadleaf groups, especially Oak, had higher error rates. The results are applied in Danish National accounting of greenhouse gas emissions from forests, resource assessment and assessment of forest biodiversity potentials.
ARTICLE | doi:10.20944/preprints202012.0728.v1
Subject: Biology And Life Sciences, Biochemistry And Molecular Biology Keywords: omics data; hierarchical clustering; noise quantification
Online: 29 December 2020 (14:02:28 CET)
Identifying groups that share common features among datasets through clustering analysis is a typical problem in many fields of science, particularly in post-omics and systems biology research. In respect of this, quantifying how a measure can cluster or organize intrinsic groups is important since currently there is no statistical evaluation of how ordered is, or how much noise is embedded in the resulting clustered vector. Many of the literature focuses on how well the clustering algorithm orders the data, with several measures regarding external and internal statistical measures; but none measure has been developed to statistically quantify the noise in an arranged vector posterior a clustering algorithm, i.e., how much of the clustering is due to randomness. Here, we present a quantitative methodology, based on autocorrelation, to assess this problem.
ARTICLE | doi:10.20944/preprints202108.0516.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: machine learning; time; naive bayes classification; recurrent neural networks, Twitter; social media data; automatic classification
Online: 27 August 2021 (11:23:50 CEST)
Machine learning (ML) is increasingly useful as data grows in volume and accessibility as it can perform tasks (e.g. categorisation, decision making, anomaly detection, etc.) through experience and without explicit instruction, even when the data are too vast, complex, highly variable, full of errors to be analysed in other ways , . Thus, ML is great for natural language, images, or other complex and messy data available in large and growing volumes. Selecting a ML algorithm depends on many factors as algorithms vary in supervision needed, tolerable error levels, and ability to account for order or temporal context, among many other things. Importantly, ML methods for explicitly ordered or time-dependent data struggle with errors or data asymmetry. Most data are at least implicitly ordered, potentially allowing a hidden `arrow of time’ to affect non-temporal ML performance. This research explores the interaction of ML and implicit order by training two ML algorithms on Twitter data before performing automatic classification tasks under conditions that balance volume and complexity of data. Results show that performance was affected, suggesting that researchers should carefully consider time when selecting appropriate ML algorithms, even when time is only implicitly included.
ARTICLE | doi:10.20944/preprints202011.0451.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: Explainable AI; Cluster Analysis; Swarm Intelligence; Machine Learning System; High-Dimensional Data Visualization; Decision Trees
Online: 17 November 2020 (14:01:33 CET)
The understanding of water quality and its underlying processes is important for the protection of aquatic environments enabling the rare opportunity of access to a domain expert. Hence, an explainable AI (XAI) framework is proposed that is applicable to multivariate time series resulting in explanations that are interpretable by a domain expert. The XAI combines in three steps a data-driven choice of a distance measure with explainable cluster analysis through supervised decision trees. The multivariate time series consists of water quality measurements, including nitrate, electrical conductivity, and twelve other environmental parameters. The relationships between water quality and the environmental parameters are investigated by identifying similar days within a cluster and dissimilar days between clusters. The XAI does not depend on prior knowledge about data structure, and its explanations are tendentially contrastive. The relationships in the data can be visualized by a topographic map representing high-dimensional structures. Two comparable decision-based XAIs were unable to provide meaningful and relevant explanations from the multivariate time series data. Open-source code in R for the three steps of the XAI framework is provided.
ARTICLE | doi:10.20944/preprints202308.2174.v2
Subject: Environmental And Earth Sciences, Remote Sensing Keywords: Time-series; data availability; aggregation; long-term analyses
Online: 1 September 2023 (10:10:24 CEST)
Landsat and Sentinel-2 data archives provide ever-increasing amounts of satellite data for studying land cover and land use change (LCLUC) over the past four decades. However, the availability of cloud-, shadow-, and snow-free observations varies spatially and temporally due to climate and satellite data acquisition schemes. Spatio-temporal heterogeneity poses a major issue for some time-series analysis approaches, but can be addressed with pixel-based compositing that generates temporally equidistant cloud-free or near-cloud free synthetic images. Although much consideration is given to methods identifying the ‘best’ pixel value for each composite, determining the aggregation period receives less attention and is often done arbitrary, or based on expert intuition. Here, we evaluated data compositing windows ranging from five days to one year for 1984-2021 Landsat and 2015-2021 Sentinel‑2 time series across Europe. We considered separate and joint use of both data archives and analyzed spatio-temporal availability of composites during each calendar year and pixel-specific growing season. We reported mean annual composites’ availability investigating differences among biogeographical regions, checked feasibility of pan‑European analyses for three LCLUC applications based on annual, monthly and 10-day composites, and analyzed the shortest feasible compositing window ensuring ≥50% temporal data availability and interpolation of the remaining composites for individual years and across a variety of medium- and long‑term time windows. Our results highlighted low data coverage in the 1980s, 1990s, and in 2012, as well as spatial variability in data availability driven by climate and orbit overlaps, which altogether impact spatio-temporal consistency of medium- and long-term time series, limiting feasibility of some LCLUC analyses. We demonstrated that prior to 2011 monthly composites ensured overall 50-62% data coverage for each calendar year, and ~75% afterwards, with further increase to ~82% when Landsat and Sentinel-2 were combined. Temporal consistency of monthly composites was overall low and temporal interpolation augmenting up to 50% missing data each year and across a time window of interest, ensured feasibility of analyses. Applications based on shorter than monthly composites were challenging without joining Landsat and Sentinel‑2 archives after 2015, and beyond the Mediterranean biogeographical region. Using pixel-specific growing season data typically boosted data availability in most geographies and diminished most of the latitudinal differences, but feasibility of complete time series with sub-monthly compositing windows was still restricted to the most recent years, and required data interpolation. Overall, our analyses provided a detailed assessment of Landsat and Sentinel-2 data availability over Europe, and based on selected application examples, highlighted often lacking spatio-temporal consistency of time series with sub-monthly compositing windows and long-time periods, which might hinder feasibility of some LCLUC applications.
ARTICLE | doi:10.20944/preprints202305.1245.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Deep neural networks; Activation functions; Multiclass classification; Time-Series prediction; Reuters data; Energy trade value
Online: 17 May 2023 (11:11:21 CEST)
Deep learning has been applied in many areas that have had a significant impact on applications that improves real-life challenges. The success of deep learning in a wide range of areas is due in part to the use of activation functions, which are particularly effective at solving non-linear problems. Activation functions are a key focus for researchers in artificial intelligence who aim to improve the performance of neural networks. This article provides a comprehensive explanation and comparison of different activation functions, with a focus on the arc tangent and its variations specifically. The paper presents experimental results that show that variations of the arc tangent using irrational numbers such as pi, the golden ratio, and Euler’s number, as well as a self-arctan function, produce promising results. Since we experimented with promising activation functions on two different problems, and datasets, we reached a result that different irrationals work well for different problems. In other words, arctan ϕ gives the best results mostly for multiclass classification and arctan e gives the best results for time series prediction problems. The paper focuses on a multiclass classification problem applied to the Reuters Newswire dataset and a time-series prediction problem on Türkiye energy trade value to show the impacts of activation functions.
ARTICLE | doi:10.20944/preprints201608.0232.v2
Subject: Medicine And Pharmacology, Pulmonary And Respiratory Medicine Keywords: mHealth; ODK scan; mobile health application; digitizing data collection; data management processes; paper-to-digital system; technology-assisted data management; treatment adherence
Online: 2 September 2016 (03:17:38 CEST)
The present grievous situation of the tuberculosis disease can be improved by efficient case management and timely follow-up evaluations. With the advent of digital technology this can be achieved by quick summarization of the patient-centric data. The aim of our study was to assess the effectiveness of the ODK Scan paper-to-digital system during testing period of three months. A sequential, explanatory mixed-method research approach was employed to elucidate technology use. Training, smartphones, application and 3G enabled SIMs were provided to the four field workers. At the beginning, baseline measures of the data management aspects were recorded and compared with endline measures to see the impact of ODK Scan. Additionally, at the end, users’ feedback was collected regarding app usability, user interface design and workflow changes. 122 patients’ records were retrieved from the server and analysed for quality. It was found that ODK Scan recognized 99.2% of multiple choice bubble responses and 79.4% of numerical digit responses correctly. However, the overall quality of the digital data was decreased in comparison to manually entered data. Using ODK Scan, a significant time reduction is observed in data aggregation and data transfer activities, however, data verification and form filling activities took more time. Interviews revealed that field workers saw value in using ODK Scan, however, they were more concerned about the time consuming aspects of the use of ODK Scan. Therefore, it is concluded that minimal disturbance in the existing workflow, continuous feedback and value additions are the important considerations for the implementing organization to ensure technology adoption and workflow improvements.
ARTICLE | doi:10.20944/preprints201701.0080.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: wind turbine; failure detection; SCADA data; feature extraction; mutual information; copula
Online: 17 January 2017 (11:21:58 CET)
More and more works are using machine learning techniques while adopting supervisory control and data acquisition (SCADA) system for wind turbine anomaly or failure detection. While parameter selection is important for modelling a wind turbine’s health condition, only a few papers have been published focusing on this issue and in those papers interconnections among sub-components in a wind turbine are used to address this problem. However, merely the interconnections for decision making sometimes is too general to provide a parameter list considering the differences of each SCADA dataset. In this paper, a method is proposed to provide more detailed suggestions on parameter selection based on mutual information. Moreover, after proving that Copula, a multivariate probability distribution for which the marginal probability distribution of each variable is uniform is capable of simplifying the estimation of mutual information, an empirical copula based mutual information estimation method (ECMI) is introduced for an application. After that, a real SCADA dataset is adopted to test the method, and the results show the effectiveness of the ECMI in providing parameter selection suggestions when physical knowledge is not accurate enough.
ARTICLE | doi:10.20944/preprints201701.0079.v1
Subject: Engineering, Electrical And Electronic Engineering Keywords: accessibility; offshore; operation and maintenance; weather condition; Markov chain; data visualization
Online: 17 January 2017 (11:17:32 CET)
For offshore wind power generation, accessibility is one of the main factors that has great impact on operation and maintenance due to constraints on weather conditions for marine transportation. This paper presents a framework to explore the accessibility of an offshore site. At first, several maintenance types are defined and taken into account. Next, a data visualization procedure is introduced to provide an insight into the distribution of access periods over time. Then, a rigorous mathematical method based on finite state Markov chain is proposed to assess the accessibility of an offshore site from the maintenance perspective. A five-year weather data of a marine site is used to demonstrate the applicability and the outcomes of the proposed method. The main findings show that the proposed framework is effective in investigating the accessibility for different time scales and is able to catch the patterns of the distribution of the access periods. Moreover, based on the developed Markov chain, the average waiting time for a certain access period can be estimated. With more information on the maintenance of an offshore wind farm, the expected production loss due to time delay can be calculated.
ARTICLE | doi:10.20944/preprints201806.0082.v1
Subject: Engineering, Energy And Fuel Technology Keywords: wind energy; wind turbines; supervisory control and data acquisition; retrofitting; performance evaluation
Online: 6 June 2018 (10:17:12 CEST)
Wind turbine upgrades have been spreading in the recent years in the wind energy industry, with the aim of optimizing the efficiency of wind kinetic energy conversion. This kind of interventions has material and labor costs and it is therefore fundamental to estimate realistically the production improvement. Further, the retrofitting of wind turbines sited in harsh environments might exacerbate the stressing conditions to which wind turbines are subjected and consequently might affect the residue lifetime. This work deals with a case of retrofitting: the testing ground is a multi-megawatt wind turbine from a wind farm sited in a very complex terrain. The blades have been optimized by installing vortex generators and passive flow control devices. The complexity of this test case, dictated by the environment and by the features of the data set at disposal, inspires the formulation of a general method for estimating production upgrades, based on multivariate linear modeling of the power output of the upgraded wind turbine. The method is a distinctive part of the outcome of this work because it is generalizable to the study of whatever wind turbine upgrade and it is adaptable to the features of the data sets at disposal. In particular, applying this model to the test case of interest, it arises that the upgrade increases the annual energy production of the wind turbine of an amount of the order of the 2%. This quantity is of the same order of magnitude, albeit non-negligibly lower, than the estimate based on the assumption of ideal wind conditions. Therefore, it arises that complex wind conditions might affect the efficiency of wind turbine upgrades and it is therefore important to estimate their impact using data from wind turbines operating in the real environment of interest.
ARTICLE | doi:10.20944/preprints202304.0079.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: Deep neural networks; Activation functions; Multi-class classification; Time-Series prediction; Reuters data; Energy trade value
Online: 6 April 2023 (08:55:22 CEST)
Deep learning has been applied in many areas that have had a significant impact on applications that improves real-life challenges. The success of deep learning in a wide range of areas is due in part to the use of activation functions, which are particularly effective at solving non-linear problems. Activation functions are a key focus for researchers in artificial intelligence who aim to improve the performance of neural networks. This article provides a comprehensive explanation and comparison of different activation functions, with a focus on the arc tangent and its variations specifically. The paper presents experimental results that show that variations of the arc tangent using irrational numbers such as pi, the golden ratio, and Euler’s number, as well as a self-arctan function, produce promising results. Since we experimented with promising activation functions on two different problems, and datasets, we reached a result that different irrationals work well for different problems. In other words, arctan ϕ gives the best results mostly for multiclass classification and arctan e gives the best results for time series prediction problems. The paper focuses on a multi-class classification problem applied to the Reuters Newswire dataset and a time-series prediction problem on Türkiye energy trade value to show the impacts of activation functions.
ARTICLE | doi:10.20944/preprints202104.0745.v1
Subject: Computer Science And Mathematics, Algebra And Number Theory Keywords: green effect; pipelines; remote monitoring; data analysis; machine learning; time series
Online: 28 April 2021 (10:39:31 CEST)
Extensive, but remote oil and gas fields of the United States, Canada, and Russia require the construction and operation of extremely long pipelines. Global warming and local heating effects lead to rising soil temperatures and thus a reduction in the sub-grade capacity of the soils; this causes changes in the spatial positions and forms of the pipelines, consequently increasing the number of accidents. Oil operators are compelled to monitor the soil temperature along the routes of the remoted pipelines in order to be able to perform remedial measures in time. They are therefore seeking methods for the analysis of volumetric diagnostic information. To forecast soil temperatures at the different depths we propose compiling a multidimensional dataset, defining descriptive statistics; selecting uncorrelated time series; generating synthetic features; robust scaling temperature series, tuning the additive regression model to forecast soil temperatures.
ARTICLE | doi:10.20944/preprints201703.0213.v1
Subject: Engineering, Transportation Science And Technology Keywords: travel time predictability; multiple entropy; travel time series; vehicle trajectory data
Online: 28 March 2017 (17:22:03 CEST)
With the great development of intelligent transportation systems (ITS), travel time prediction has attracted the attentions of many researchers and a large number of prediction methods have been developed. However, as an unavoidable topic, the predictability of travel time series is the basic premise for travel time prediction has received less attention than the methodology. Based on the analysis of the complexity of travel time series, this paper defines travel time predictability to express the probability of correct travel time prediction and proposes an entropy-based method to measure the upper bound of travel time predictability. Multiscale entropy is employed to quantify the complexity of travel time series, and the relationships between entropy and the upper bound of travel time predictability are presented. Empirical studies are made with vehicle trajectory data in an express road section. The effectiveness of time scales, tolerance, and series length to entropy and travel time predictability are analysis, and some valuable suggestions about the accuracy of travel time predictability are discussed. Finally, the comparisons between travel time predictability and actual prediction results from two prediction models, ARIMA and BPNN, are conducted. Experimental results demonstrate the validity and reliability of the proposed travel time predictability.
ARTICLE | doi:10.20944/preprints201801.0139.v1
Subject: Environmental And Earth Sciences, Remote Sensing Keywords: data logger; environmental monitoring network; open source; submersible; under-water; critical zone observatory; cave; Yucatan Peninsula, vadose hydrology; subterranean karst estuary
Online: 16 January 2018 (10:40:15 CET)
A low-cost data logging platform is presented for environmental monitoring projects that provides long-term operation in remote or submerged environments. Three premade “breakout boards” from the open-source Arduino ecosystem are assembled into the core of the platform. The components are selected based on low-cost and ready availability, making the loggers easy to build and modify without specialized tools, or a significant background in electronics. Power optimization techniques are explained. The platform has proven to be highly reliable, and capable of operating for more than a year on standard AA batteries. The flexibility of the system is illustrated with two ongoing field studies recording drip rates in a cave, and water flow in a flooded cave system.