ARTICLE | doi:10.20944/preprints202301.0524.v1
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: Matrix variate distribution; Mixture models; EM-algorithm; Penalized likelihood
Online: 28 January 2023 (08:53:19 CET)
In the era of big data with increasingly complex data structures and ever-larger data scales, matrix-type data are becoming highly valued and their applications in the fields of medicine, industry, education, geography, and astronomy are growing in extent. In recent years, significant progress has been made in the practical use of matrix variable t-distribution finite mixture models for handling data in order to address the issues of multi-subgroup structures and long data tails. In this paper, the expectation-maximization (EM) algorithm with penalized maximum likelihood is proposed to resolve the problem of the unbounded nature of the likelihood function applied to the model by considering the degeneracy of the variance-covariance matrix of this model. Our data were analyzed through simulations and real data, and the results demonstrate that our model is effective in both preventing the likelihood function from being unbounded and in ensuring the accuracy of the estimated parameters of the EM algorithm.
ARTICLE | doi:10.20944/preprints201910.0321.v1
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: maximum likelihood; logistic regression; firth's correction; separation; penalized likelihood; bias
Online: 28 October 2019 (12:01:17 CET)
The parameters of logistic regression models are usually obtained by the method of maximum likelihood (ML). However, in analyses of small data sets or data sets with unbalanced outcomes or exposures, ML parameter estimates may not exist. This situation has been termed “separation” as the two outcome groups are separated by the values of a covariate or a linear combination of covariates. To overcome the problem of non-existing ML parameter estimates, applying Firth’s correction (FC) was proposed. In practice, however, a principal investigator might be advised to “bring more data” in order to solve a separation issue. We illustrate the problem by means of an examples from colorectal cancer screening and ornithology. It is unclear if such an increasing sample size (ISS) strategy that keeps sampling new observations until separation is removed improves estimation compared to applying FC to the original data set. We performed an extensive simulation study where the main focus was to estimate the cost-adjusted relative efficiency of ML combined with ISS compared to FC. FC yielded reasonably small root mean squared errors and proved to be the more efficient estimator. Given our findings, we propose not to adapt the sample size when separation is encountered but to use FC as the default method of analysis whenever the number of observations or outcome events is critically low.
ARTICLE | doi:10.20944/preprints202203.0182.v1
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: cumulative logit; penalized models; LASSO; variable inclusion indicators; spike-and-slab
Online: 14 March 2022 (10:04:52 CET)
Stage of cancer is a discrete ordinal response that indicates aggressiveness of disease and is often used by physicians to determine the type and intensity of treatment to be administered. For example, the FIGO stage in cervical cancer is based on the size and depth of the tumor as well as the level of spread. It may be of clinical relevance to identify molecular features from high-throughput genomic assays that are associated with stage of cervical cancer, to elucidate pathways related to tumor aggressiveness, identify improved molecular features that may be useful for staging, and identify therapeutic targets. High-throughput RNA-Seq data and corresponding clinical data (including stage) for cervical cancer patients has been made available through The Cancer Genome Atlas Project (TCGA). We recently described penalized Bayesian ordinal response models that can be used for variable selection for over-parameterized datasets such as the TCGA-CESC dataset. Herein, we describe our ordinalbayes R package, available from the Comprehensive R Archive Network (CRAN), which is capable of fitting cumulative logit models when the outcome is ordinal and the number of predictors exceeds the sample size, P>N, such as for TCGA data. We demonstrate use of this package through application to TCGA cervical cancer dataset. Our ordinalbayes package can be used to fit models to high-dimensional dataset and effectively performs variable selection.
ARTICLE | doi:10.20944/preprints202006.0063.v1
Subject: Computer Science And Mathematics, Probability And Statistics Keywords: COVID-19; Real-Time Tracker; Common Symptoms; Data Visualization; Hypothesis Testing; ARIMA Time-Series Forecast; Penalized Logistic Regression
Online: 7 June 2020 (07:44:48 CEST)
While the COVID-19 outbreak was reported to first originate from Wuhan, China, it has been declared as a Public Health Emergency of International Concern (PHEIC) on 30 January 2020 by WHO, and it has spread to over 180 countries by the time of this paper was being composed. As the disease spreads around the globe, it has evolved into a worldwide pandemic, endangering the state of global public health and becoming a serious threat to the global community. To combat and prevent the spread of the disease, all individuals should be well-informed of the rapidly changing state of COVID-19. In the endeavor of accomplishing this objective, a COVID-19 real-time analytical tracker has been built to provide the latest status of the disease and relevant analytical insights. The real-time tracker is designed to cater to the general audience without advanced statistical aptitude. It aims to communicate insights through various straightforward and concise data visualizations that are supported by sound statistical foundations and reliable data sources. This paper aims to discuss the major methodologies which are utilized to generate the insights displayed on the real-time tracker, which include real-time data retrieval, normalization techniques, ARIMA time-series forecasting, and logistic regression models. In addition to introducing the details and motivations of the utilized methodologies, the paper additionally features some key discoveries that have been derived in regard to COVID-19 using the methodologies.