ARTICLE | doi:10.20944/preprints202204.0106.v1
Online: 12 April 2022 (09:49:42 CEST)
Gas chromatography-coupled mass spectrometry (GC-MS) has been used in biomedical research to analyze volatile, non-polar, and polar metabolites in a wide array of sample types. Despite advances in technology, missing values are still common in metabolomics datasets and must be properly handled. We evaluated the performance of ten commonly used missing value imputation methods with metabolites analyzed on an HR GC-MS instrument. By introducing missing values into the complete (i.e., data without any missing values) NIST plasma dataset we demonstrate that Random Forest (RF), Glmnet Ridge Regression (GRR), and Bayesian Principal Component Analysis (BPCA) shared the lowest Root Mean Squared Error (RMSE) in technical replicate data. Further examination of these three methods in data from baboon plasma and liver samples demonstrated they all maintained high accuracy. Overall, our analysis suggests that any of the three imputation methods can be applied effectively to untargeted metabolomics datasets with high accuracy. However, it is important to note that imputation will alter the correlation structure of the dataset, and bias downstream regression coefficients and p-values.
ARTICLE | doi:10.20944/preprints201806.0055.v1
Subject: Keywords: quality control; validation; reconstruction of missing data; temperature; precipitation
Online: 5 June 2018 (08:42:40 CEST)
This study provides a unique procedure for validating and reconstructing temperature and precipitation data. Although developed from data in Middle Italy, the validation method is intended to be universal, subject to appropriate calibration according to the climate zones analysed. This~research is an attempt to create shared applicative procedures that are most of the time only theorized or included in some software without a clear definition of the methods. The purpose is to detect most types of errors according to the procedures for data validation prescribed by the World Meteorological Organization, defining practical operations for each of the five types of data controls: gross error checking, internal consistency check, tolerance test, temporal consistency, and~spatial consistency. Temperature and~precipitation data over the period 1931--2014 were investigated. The~outcomes of this process have led to the removal of 375 records (0.02%) of temperature data from 40 weather stations and 1286 records (1.67%) of precipitation data from 118 weather stations, and 171 data points reconstructed. In conclusion, this work contributes to the development of standardized methodologies to validate climate data and provides an innovative procedure to reconstruct missing data in the absence of reliable reference time series.
ARTICLE | doi:10.20944/preprints201803.0084.v1
Subject: Engineering, Civil Engineering Keywords: anfis; missing data; multiple regression; normal ratio method; Yeşilırmak
Online: 12 March 2018 (07:00:46 CET)
Good data analysis is required for the optimal design of water resources projects. However, data are not regularly collected due to material or technical reasons, which results in incomplete-data problems. Available data and data length are of great importance to solve those problems. Various studies have been conducted on missing data treatment. This study used data from the flow observation stations on Yeşilırmak River in Turkey. In the first part of the study, models were generated and compared in order to complete missing data using ANFIS, multiple regression and Normal Ratio Method. In the second part of the study, the minimum number of data required for ANFIS models was determined using the optimum ANFIS model. Of all methods compared in this study, ANFIS models yielded the most accurate results. A 10-year training set was also found to be sufficient as a data set.
ARTICLE | doi:10.20944/preprints202105.0390.v1
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Multilayer perceptron neural network; regression model; backpropagation; missing data; imputation method
Online: 17 May 2021 (14:35:18 CEST)
Missing observations constitute one of the most important issues in data analysis in applied research studies. The magnitude and their structure impact parameters estimation in the modeling with important consequences for decision-making. This study aims to evaluate the efficiency of imputation methods combined with the backpropagation algorithm in a nonlinear regression context. The evaluation is conducted through a simulation study including sample sizes (50, 100, 200, 300 and 400) with different missing data rates (10, 20, 30 40 and 50%) and three missingness mechanisms (MCAR, MAR and MNAR). Four imputation methods (Last Observation Carried Forward, Random Forest, Amelia and MICE) were used to impute datasets before making prediction with backpropagation. 3-MLP model was used by varying the activation functions (Logistic-Linear, Logistic-Exponential, TanH-Linear and TanH-Exponentiel), the number of nodes in the hidden layer (3 - 15) and the learning rate (20 - 70%). Analysis of the performance criteria (R2, r and RMSE) of the network revealed good performances when it is trained with TanH-Linear functions, 11 nodes in the hidden layer and a learning rate of 50%. MICE and Random Forest were the most appropriate for data imputation. These methods can support up to 50% of missing rate with an optimal sample size of 200.
ARTICLE | doi:10.20944/preprints202005.0045.v2
Subject: Behavioral Sciences, Social Psychology Keywords: corona virus transmission; path of transmission of virus; missing information about corona virus; social distancing
Online: 8 May 2020 (04:41:22 CEST)
We present the detailed calculations of social distancing requirement. A comparative study of the growth pattern and death tolls in different communities indicates that the growth pattern of infected patients and death rate follow the similar distribution with different parametrizations. Every distribution follows an exponential growth pattern curve, like other microbes, then reaches the saturation point an d eventually decay s However, the argument for the exponential function depends on several parameters unbeknownst, as of yet. However, the slope varies different ial ly for various epicenter s and seems to have a relationship with parameters such as accessibi lity to healthcare facilities, pre existing medical conditions socio economic conditions and lifestyle. The mismatch of the growth pattern is also linked with the impact of various other factors and a premature interpretation of limit ed data. Novel behavi or of the virus brought many surprises, opened up new venues for medical research, and the need for the more detailed study of pathogens in the light of the interaction of RNA and DNA The adaptability to diverse ecological condition s and the relevant modification in the structure is also worth investigation The genetic modification can be studied using quantum mechanical probabilistic approach.
ARTICLE | doi:10.20944/preprints202105.0105.v1
Subject: Earth Sciences, Atmospheric Science Keywords: data scarcity; water quality; missing data; univariate imputation; multivariate imputation; machine learning; hydroinformatics.
Online: 6 May 2021 (15:18:23 CEST)
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water-resource management. However, worldwide, particularly in developing countries, water-quality studies are limited due to the lack of a complete and reliable dataset of surface-water-quality variables. In this context, several statistical and machine-learning models were assessed for imputing water-quality data at six monitoring stations located in the Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. The challenge of this study is represented by the high percentage of missing data (between 50% and 70%) and the high temporal and spatial variability that characterizes the water-quality variables. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Hubber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)). According to the results, more than 76% of the imputation outcomes are considered satisfactory (NSE > 0.45). The imputation performance shows better results at the monitoring stations located inside the reservoir than the ones positioned along the mainstream. IDW was the most chosen model for data imputation.
ARTICLE | doi:10.20944/preprints202204.0306.v1
Subject: Behavioral Sciences, Clinical Psychology Keywords: Fear of missing out; FoMO; social media; Social networking sites; addiction; depression; anxiety; sleep; exercise
Online: 29 April 2022 (13:50:46 CEST)
The fear of missing out (FoMO) is characterized in the literature as a fear that others are having rewarding experiences while one is missing out, and a constant need to keep connected with one’s social network. Driven by Social Determination Theory (SDT) FoMO has been linked with Problematic Social Networking Sites use (PSNSU), negative affectivity (NA), self-esteem (SE) and sleep disturbances. The present study reports findings from 512 individuals (79.1% women, mean age 30.5 years, SD= 8.61). Structural equation modelling (SEM) suggests that the duration of SNSs use and the numbers of SNSs platforms actively used partially mediated the relationship between FoMO and PSNSU. In turns, PSNSU partially mediated the relationship between FoMO and NA. Furthermore, the present study has extended the literature by incorporating the Vulnerability Model in the FoMO concept, identifying that SE partially mediated the relationship between FoMO and NA, while NA fully mediated the relationship between FoMO and sleeping disturbances. Accordingly, the present has extended previous research findings in showing exercise as a potential protective factor to prevent against FoMO. Practical and theoretical implications are discussed.
ARTICLE | doi:10.20944/preprints201612.0078.v1
Subject: Mathematics & Computer Science, General & Theoretical Computer Science Keywords: missing value imputation; machine learning; decision tree imputation; k-nearest neighbors imputation; self-organizing map imputation
Online: 15 December 2016 (08:27:13 CET)
Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.
ARTICLE | doi:10.20944/preprints202107.0173.v1
Subject: Social Sciences, Other Keywords: Fear of missing out (FOMO), Parental control, Problematic Social Media Use (PSMU), Social Media Addiction, Social Media Intrusion
Online: 7 July 2021 (10:23:46 CEST)
This study examines the relationship of fear of missing out (FOMO) with heavy social networking among Turkish university students (aged 17 - 55). The perception of the possible role of parental supervision on online activities is also investigated. Factor analysis of FOMO scale led us to evaluate the construct under two dimensions as (1) fear of missing experience and (2) fear of missing activity. The results revealed that fear of missing activity increases social media intrusion while fear of missing experience is found to have no significant effect. The reverse relationship is also valid: an urge to use social media predicts fear of missing out (activity and experience). Fear of missing experience is associated with problematic social media use (PSMU) and a high desire to use social media. The results additionally demonstrate that students aged 30 and older believe more in the requirement of parental control than those aged 17-22.
ARTICLE | doi:10.20944/preprints201812.0251.v1
Subject: Physical Sciences, Particle & Field Physics Keywords: charged dark matters, missing neutrinos, cosmic rays, gravitation constant, Coulomb’s constant, extended standard model, anti-Helium cosmic ray
Online: 20 December 2018 (12:54:24 CET)
In the present work, the charged B1, B2 and B3 bastons with the condition of k(mm) = k >> k(dd) > k(dm) = k(lq) = 0 are explained as the good candidates of the dark matters. The proposed rest mass (26.12 eV/c2) of the B1 dark matter is indirectly confirmed from the supernova 1987A data. The missing neutrinos are newly explained by using the dark matters and lepton charge force. The neutrino excess anomaly of the MinibooNE data is explained by the B1 dark matter scattering within the Cherenkov detectors. And the rest masses of 1.4 TeV/c2 and 42.7 GeV/c2 are assigned to the Le particle and the B2 dark matter, respectively, from the cosmic ray observations. In the present work, the Q1 baryon decays are used to explain the anti-Helium cosmic ray events. Because of the graviton evaporation and photon confinement, the very small Coulomb’s constant (k(dd)) of 10x-54k and gravitation constant (GN(dd)) of 10xGN for the charged dark matters at the present time are proposed. The x value can have the positive, zero or negative value around zero. Therefore, Fc(mm) > Fg(dd) (?) Fg(mm) > Fg(dm) > Fc(dd) > Fc(dm) = Fc(lq) = 0 for the proton-like particle.
ARTICLE | doi:10.20944/preprints202110.0107.v1
Subject: Keywords: missing item responses; multiple imputation; item response model; PISA; country comparisons; Mislevy-Wu model; latent ignorability; nonignorable item responses
Online: 6 October 2021 (12:52:48 CEST)
Missing item responses are prevalent in educational large-scale assessment studies like the programme for international student assessment (PISA). The current operational practice scores missing item responses as wrong, but several psychometricians advocated a model-based treatment based on latent ignorability assumption. In this approach, item responses and response indicators are jointly modeled conditional on a latent ability and a latent response propensity variable. Alternatively, imputation-based approaches can be used. The latent ignorability assumption is weakened in the Mislevy-Wu model that characterizes a nonignorable missingness mechanism and allows the missingness of an item to depend on the item itself. The scoring of missing item responses as wrong and the latent ignorable model are submodels of the Mislevy-Wu model. This article uses the PISA 2018 mathematics dataset to investigate the consequences of different missing data treatments on country means. Obtained country means can substantially differ for the different scaling models. In contrast to previous statements in the literature, the scoring of missing item responses as incorrect provided a better model fit than a latent ignorable model for most countries. Furthermore, the dependence of the missingness of an item from the item itself after conditioning on the latent response propensity was much more pronounced for constructed-response items than for multiple-choice items. As a consequence, scaling models that presuppose latent ignorability should be refused from two perspectives. First, the Mislevy-Wu model is preferred over the latent ignorable model for reasons of model fit. Second, we argue that model fit should only play a minor role in choosing psychometric models in large-scale assessment studies because validity aspects are most relevant. Missing data treatments that countries can simply manipulate (and, hence, their students) result in unfair country comparisons.
ARTICLE | doi:10.20944/preprints202208.0451.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: text splitting; text tokenization; transfer learning; mask-fill prediction; NLP linguistic rules; missing punctuations; cross-lingual BERT model; Masked Language Modeling
Online: 26 August 2022 (05:19:39 CEST)
Long unpunctuated texts containing complex linguistic sentences are a stumbling block to processing any low-resource languages. Thus, approaches that attempt to segment lengthy texts with no proper punctuation into simple candidate sentences are a vitally important preprocessing task in many hard-to-solve NLP applications. In this paper, we propose (PDTS) a punctuation detection approach for segmenting Arabic text, built on top of a multilingual BERT-based model and some generic linguistic rules. Furthermore, we showcase how PDTS can be effectively employed as a text tokenizer for unpunctuated documents (i.e., mimicking the transcribed audio-to-text documents). Experimental findings across two evaluation protocols (involving an ablation study and a human-based judgment) demonstrate that PDTS is practically effective in both performance quality and computational cost.
HYPOTHESIS | doi:10.20944/preprints202003.0419.v9
Subject: Life Sciences, Molecular Biology Keywords: ATP hypothesis; origin of genetic code; life’s building block; probiotic “soup”; coevolution; biochemical system; missing “matchmaker”; energy transformation; informatization; structuralization; precellular selection; photo- chemical origin of life; virus; anti-life form; 2019-nCoV
Online: 11 September 2020 (08:39:39 CEST)
A plenty of theories on the origin of genetic codes have been proposed so far, yet all ignored the energetic driving force, its relation to the biochemical system, and most importantly, the missing “matchmaker” between proteins and nucleic acids. Here, a new hypothesis is proposed, according to which ATP is at the origin of the primordial genetic code by driving the coevolution of the genetic code with the pristine biochemical system. This hypothesis aims to show how the genetic code was produced e.g. by photochemical reactions in a protocell that derived from a lipid vesicle enclosing various life’s building blocks (e.g. nucleotides and peptides). At extant cell, ATP is the only energetic product of photosynthesis, and is at the energetic heart of the biochemical systems. ATP could energetically form and elongate chains of both polynucleotides and polypeptides, thus acting a “matchmaker” between these two bio-polymers and eventually mediating precellular biochemical innovation from energy transformation to informatization. ATP was not the only one that could drive the formation of polynucleotides and polypeptides, but favored by precellular selection. The protocell innovated a photosynthesis system to produce ATP efficiently and regularly with the aids of proteins and RNA/DNA. The completion of permanently recording the genetic information by DNA marked the dawn of cellular life operated by Darwinian evolution. The ATP hypothesis assumes or supports the photochemical origin of life, shedding light on the origins of both photosynthetic and biochemical systems, which remains largely unknown thus far. Based on ATP hypothesis, virus (like the new coronavirus) could not be the earliest life on Earth, as it has neither biochemical systems nor lipid bilayer membrane that provided relatively isolated environment for the development of protobiochemical reactions, although it owns the genetic code of a cellular life. Virus could not be a bridge between life and non-life, but is an anti-life substance, as it depletes cellular material for its own replication, and then spreads by destroying the host cells. It can be imagined that if cellular life are completely wiped out by the virus, the complete destruction of life on Earth would be inevitable.