Variable Selection for Sparse Logistic Regression with Grouped Variables

Mingrui Zhong; Zanhua Yin; Zhichao Wang

doi:10.20944/preprints202311.1114.v1

Submitted:

16 November 2023

Posted:

16 November 2023

You are already at the latest version

Abstract

We present a new penalized method for estimation in sparse logistic regression models with group structure. Group sparsity suggests us to consider the Group Lasso penalty. Being different from penalized log-likelihood estimation, our method can be viewed as a penalized weighted score function method. Under some mild conditions, we provide non-asymptotic oracle inequalities promoting group sparsity of predictors. A modified block coordinate descent algorithm based on a weighted score function is also employed. The net advantage of our algorithm over the existing Group Lasso-type procedures is that the tuning parameter can be pre-specified. The simulations show that this algorithm is considerably faster and more stable than competing methods. Finally, we illustrate our methodology with two real data sets.

Keywords:

high-dimensional data

;

non-asymptotic inequality

;

logistic regression

;

variable selection

;

block coordinate descent algorithm

Subject:

Computer Science and Mathematics - Probability and Statistics

1. Introduction

Logistic regression models are a powerful and popular technique for modeling the relationship between the predictors and a categorical response variable. Let

(x_{1}, y_{1}), \dots, (x_{n}, y_{n})

be independent pairs of observed data which are realizations of random vector

(X, Y)

, with

p -

dimensional predictors

X \in R^{p}

and univariate binary response variable

Y \in {0, 1}

.

(X, Y)

is assumed to satisfy

P (Y = 1 | X = x) = G (x^{T} β^{0}) = \frac{e x p (x^{T} β^{0})}{1 + e x p (x^{T} β^{0})},

(1)

where

β^{0} \in R^{p}

is a regression vector to be estimated. We are especially concerned with a sparse logistic regression problem when the dimension p is high and the sample size n might be small, the so-called "small n, large p" framework, which is a variable selection problem for high-dimensional data.

When dealing with high-dimensional data, there are usually two important considerations: model sparsity and prediction accuracy. The Lasso [1] was proposed to address these two objectives since Lasso can determine submodels with a moderate number of parameters that still fit the data adequately. There are also other similar methods include SCAD [2], elastic net [3], Dantzig selector [4], MCP [5] and so on. In high-dimensional logistic regression models, the topics for Lasso study range from the asymptotic results, including the consistency and asymptotic distribution of the estimator, e.g., Huang et al. [6], Bianco et al. [7], to the non-asymptotic results, including the non-asymptotic oracle inequalities on the estimation and prediction errors, e.g., Abramovich et al. [8], Huang et al. [9] and Yin [10].

In many applications, predictors can often be thought of as grouped. For example, in genome-wide association studies (GWAS), genes usually do not act individually, but are reflected in the covariation of several genes with each other. Or in histologically normal epithelium (NlEpi) studies, we need to consider the non-linear effects of genes for microarray data. Similar to the Lasso, considering this grouped information in the modeling process should improve the interpretability and the accuracy of the model. Yuan and Lin [11] proposed an extension of the Lasso, called the Group Lasso, which imposes an

L_{2}

penalty to individual groups of variables and then an

L_{1}

penalty to the resulting block norms, rather than only an

L_{1}

penalty to individual variables. Suppsose

x_{i}

and

β^{0}

in model (1) are divided into g known groups, where we consider a partition

{G_{1}, \dots, G_{g}}

of

{1, \dots, p}

into groups and denote the cardinality of a group

G_{l}

by

| G_{l} |

,

x_{i} = {(x_{i (1)}^{T}, x_{i (2)}^{T}, \dots, x_{i (g)}^{T})}^{T}

,

β^{0} = {({(β_{(1)}^{0})}^{T}, {(β_{(2)}^{0})}^{T}, \dots, {(β_{(g)}^{0})}^{T})}^{T}

,

x_{i (l)} \in R^{| G_{l} |}

,

β_{(l)}^{0} \in R^{| G_{l} |}

. We wish to achieves sparsity at the level of groups, i.e., to

β^{0}

such that

β_{(l)}^{0} = 0

for some of the groups

l \in {1, \dots, g}

. When using high-dimensional logistic regression models, the Group Lasso provides an estimator for

β^{0}

:

{\hat{β}}^{G L} : = \underset{β \in R^{p}}{arg min} \frac{1}{n} \sum_{i = 1}^{n} \{log (1 + exp (x_{i}^{T} β)) - (x_{i}^{T} β) y_{i}\} + λ \sum_{l = 1}^{g} ω_{l} {∥ β_{(l)} ∥}_{2},

(2)

where

λ \geq 0

is a tuning parameter which controls the amount of penalization,

ω_{l} = \sqrt{| G_{l} |}

is used to normalize across groups of different sizes, and

{∥ \cdot ∥}_{2}

denotes the

L_{2}

norm of a vector. Meier et al. [12] established the asymptotic consistency theory of the Group Lasso for logistic regression, Wang et al. [13] analyzed the rates of convergence, Blazere et al. [14] stated oracle inequalities and Kwemou [15] studied non-asymptotic oracle inequalities. Other important references are the works of Nowakowski et al. [16] and Zhang et al. [17]. In terms of computational algorithms, Meier et al. [12] applied the block coordinate descent algorithm of Tseng [18] to Group Lasso for logistic regression, Breheny and Huang [19] proposed the Group descent algorithm. These approaches are sufficiently fast for computing the exact coefficients at those selected values of

λ

.

However, it is well known that the Lasso (or the Group Lasso) in linear regression models, the respective optimal values of tuning parameter

λ

depend on the unknown parameter

σ^{2}

, the homogeneous noise variance, and its the accurate estimation is generally more difficult when

p ≫ n

. To solve this problem, Belloni et al. [20] proposed square-root Lasso, which removed this unknown parameter by using a weighted score function (i.e. the square root of empirical loss function). Bunea et al. [21] extended the ideas behind the square-root Lasso for group selection and developed the Group square-root Lasso. Inspired by Group square-root Lasso, we propose a new penalized weighted score function method, which alternatively replaces the original score function (i.e. the gradient of negative loglikelihood function) with a weighted score function (Huang and Wang [22]), to study sparse logistic regression with Group Lasso penalty. We get convergence rates for the estimation error and provide a direct choice for the tuning parameter. Moreover, we propose a modified block coordinate descent algorithm based on the weighted score function, which greatly optimizes the computational complexity.

The framework of this paper is as follows. In Section 2, we apply this idea behind the Group square-root Lasso to sparse logisitic models and develop our method, the penalized weighted score function method. In Section 3, we propose asymptotic bounds for our new estimator and a direct selection for the tuning parameter. In Section 4, we provide weighted block coordinate descent algorithm. In Section 5, numerical simulations show the advantages of our algorithm in terms of selection effects and computational time. In Section 6, we present real data for genes and musk to support the simulation and theoretical results. The Section 7 concludes our work. All proofs are given in Appendix.

Notation :

Throughout the paper, denote by

I = {l : ∥ β_{(l)}^{0} ∥_{2} \neq 0}

the non-zero coordinate of

β^{0}

and let

s = card {I}

be the number of non-zero elements of

β^{0}

. For all

δ \in R^{p}

and subset I, we denote by

δ_{I}

that has the same coordinates as

δ

on I and zero coordinates on the complement

I^{C}

of I. For a function

f (β) \in R

, we denote by

\nabla f (β) \in R^{p}

its gradient and

H (β) \in R^{p \times p}

its Hessian matrix at

β \in R^{p}

. Define the

L_{q}

norm of any vector v as

{∥ v ∥}_{q} = (\sum_{i} | v_{i} {|^{q})}^{1 / q}

and for any vector

β \in R^{p}

with group structures, denote the block norm of

β

for any

0 \leq q \leq \infty

as

{∥ β ∥}_{2, q} = (\sum_{l = 1}^{g} ∥ β_{(l)} {∥_{2}^{q})}^{1 / q}

. In particular,

{∥ β ∥}_{2, 0} = \sum_{l = 1}^{g} 1_{β_{(l)} \neq 0}

indicates the number of non-zero groups,

{∥ β ∥}_{2, 1} = \sum_{l = 1}^{g} {∥ β_{(l)} ∥}_{2}

represents the form of Group Lasso,

{∥ β ∥}_{2, 2} = {∥ β ∥}_{2}

denotes

L_{2}

norm, and

{∥ β ∥}_{2, \infty} = {max}_{l} {∥ β_{(l)} ∥}_{2}

means the largest

L_{2}

norm of all groups. Moreover

Φ (x)

denotes the cumulative distribution function of the standard normal distribution.

2. Penalized weighted score function method

Recall model (1), the loss function (i.e. the negative loglikelihood) is given by

ℓ (β) = \frac{1}{n} \sum_{i = 1}^{n} \{log (1 + exp (x_{i}^{T} β)) - (x_{i}^{T} β) y_{i}\},

leading to the score function

\nabla ℓ (β) = \frac{1}{n} \sum_{i = 1}^{n} (G (x_{i}^{T} β) - y_{i}) x_{i} .

Note that the solution

{\hat{β}}^{G L}

of model (2) satisfies KKT conditions defined as follows

\{\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} (G (x_{i}^{T} {\hat{β}}^{G L}) - y_{i}) x_{i (l)} = - λ ω_{l} {\hat{β}}_{(l)}^{G L} / {∥ {\hat{β}}_{(l)}^{G L} ∥}_{2}, & if {\hat{β}}_{(l)}^{G L} \neq 0, \\ | \frac{1}{n} \sum_{i = 1}^{n} (G (x_{i}^{T} {\hat{β}}^{G L}) - y_{i}) x_{i (l)} | \leq λ ω_{l}, & if {\hat{β}}_{(l)}^{G L} = 0, \end{matrix}

(3)

for all

l = 1, \dots, g

. The left side of equation (3) is the score function for logistic regression with group structure, which shows that

{\hat{β}}^{G L}

is actually a penalized score function estimator. To obtain a good estimator, usually we require that the inequality

λ ω_{l} \geq c {∥ \nabla ℓ (β^{0}) ∥}_{2, \infty}

for all

l = 1, \dots, g

and some constant

c \geq 1

holds with high probability (Meier et al. [12] and Kwemou [15]). However, the random part

G (x_{i}^{T} β^{0}) - y_{i}

for

\nabla ℓ (β^{0})

, the score function valued at

β = β^{0}

, has variance

G (x_{i}^{T} β^{0}) (1 - G (x_{i}^{T} β^{0}))

, which is also the variance of the binary random variable

Y_{i} | X_{i} = x_{i}

. Obviously, binary noises are not homogeneous like noises of the linear regression models, a unique tuning parameter for all of the different coefficient is not a good choice.

We apply the idea from Group square-root Lasso to solve the above problem on choosing tuning parameter, and develop our method as follows. Huang and Wang [22] formed a class of root-consistent estimating functions by weighted score function for logistic regression

\nabla ℓ_{ψ} (β) = \frac{1}{n} \sum_{i = 1}^{n} ψ (x_{i}^{T} β) (G (x_{i}^{T} β) - y_{i}) x_{i},

(4)

where

ψ (\cdot)

is the weighted function of

x_{i}^{T} β

. This requires choosing a suitable weighed function to ensure that

\nabla ℓ_{ψ} (β)

is almost integrable for

β

. Then, replacing the score function in equation (3) with the weighted score function, we develop a penalized weighted score function estimate

\hat{β}

, which is a solution of the following equation:

\{\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} ψ (x_{i}^{T} \hat{β}) (G (x_{i}^{T} \hat{β}) - y_{i}) x_{i (l)} = - λ ω_{l} {\hat{β}}_{(l)} / {∥ {\hat{β}}_{(l)} ∥}_{2}, & if {\hat{β}}_{(l)} \neq 0, \\ | \frac{1}{n} \sum_{i = 1}^{n} ψ (x_{i}^{T} \hat{β}) (G (x_{i}^{T} \hat{β}) - y_{i}) x_{i (l)} | \leq λ ω_{l}, & if {\hat{β}}_{(l)} = 0 . \end{matrix}

(5)

Let

ℓ_{ψ} (β)

be the loss function corresponding to the weighted score function (4), the solution to Equation (5) is equivalent to solve the following optimization problem:

\hat{β} : = \underset{β \in R^{p}}{arg min} \{ℓ_{ψ} (β) + λ \sum_{l = 1}^{g} ω_{l} {∥ β_{(l)} ∥}_{2}\} .

(6)

Our method is motivated by Bunea et al. [21] minimization of the Group square-root Lasso for linear model:

{\hat{β}}^{G S L} : = \underset{β \in R^{p}}{arg min} \{\frac{{∥ Y - X β ∥}_{2}}{\sqrt{n}} + \frac{λ}{n} \sum_{l = 1}^{g} ω_{l} {∥ β_{(l)} ∥}_{2}\},

where

Y \in R^{n \times 1}

and

X \in R^{n \times p}

. When

∥ Y - X {\hat{β}}^{G S L} ∥_{2}

is non-zero, the Group square-root Lasso estimator

{\hat{β}}^{G S L}

satisfies the KKT condition

\{\begin{matrix} \sqrt{n} \sum_{i = 1}^{n} (∥ Y - X {\hat{β}}^{G S L} ∥_{2})^{- 1} (y_{i} - x_{i}^{T} {\hat{β}}^{G S L}) x_{i (l)} = λ ω_{l} {\hat{β}}_{(l)}^{G S L} / {∥ {\hat{β}}_{(l)}^{G S L} ∥}_{2}, & if {\hat{β}}_{(l)}^{G S L} \neq 0, \\ | \sqrt{n} \sum_{i = 1}^{n} (∥ Y - X {\hat{β}}^{G S L} ∥_{2})^{- 1} (y_{i} - x_{i}^{T} {\hat{β}}^{G S L}) x_{i (l)} | \leq λ ω_{l}, & if {\hat{β}}_{(l)}^{G S L} = 0 . \end{matrix}

(7)

Compared with the KKT conditions for Group square-root Lasso and Group Lasso, the Group square-root Lasso adds the weighted function

(\sqrt{n} ∥ Y - X {\hat{β}}^{G S L} {∥_{2})}^{- 1}

to estimate the homogeneous noise variance, which allows the tuning parameter

λ

independent of the homogeneous noise variance. Thus, the Group square-root Lasso is able to estimate for the grouped variables and direct choice for the tuning parameter simultaneously.

A drawback of Group square-root Lasso is that it can only directly select the tuning parameter in a linear regression models. However, in logistic regression models, there is no direct way to select the tuning parameter. The penalized weighted score function method implements this scheme. We will discuss this in more detail in the next section.

3. Statistical properties

In this section, we will establish non-asymptotic oracle inequalities for the penalized weighted score function estimate and give a direct choice for tuning parameter.

Throughout this paper, we consider a fixed design setting (i.e.,

x_{1}, \dots, x_{n}

are consider deterministic), and we make the following assumptions:

(A1): There exists a positive constant $M < \infty$ such that ${max}_{1 \leq i \leq n} {max}_{1 \leq l \leq g} \sqrt{\sum_{j \in G_{l}} x_{i j}^{2}} \leq M$ .
(A2): The $n, p$ satisfy that $n \leq p = o (e^{n^{1 / 3}})$ , and $p / ϵ > 2$ for $ϵ \in (0, 1)$ .
(A3): There exists $N (β^{0}) > 0$ such that

$N^{2} (β^{0}) = max_{1 \leq j \leq p} \{\frac{1}{n} \sum_{1 \leq i \leq n} ψ^{2} (x_{i}^{T} β^{0}) G (x_{i}^{T} β^{0}) (1 - G (x_{i}^{T} β^{0})) x_{i j}^{2}\} .$
(A4): Let $ℓ_{ψ} (\cdot) : R^{p} \mapsto R$ be a convex three times differentiable function such that for all $u, v \in R^{P}$ , the function $g (t) = ℓ_{ψ} (u + t v)$ satisfies for all $t \in R$ , $| g^{^{'''}} (t) | \leq τ_{0} {max}_{1 \leq i \leq n} | x_{i}^{T} v | g^{^{''}} (t)$ , where $τ_{0} > 0$ is a constant.

The assumption (A1) strictly controls the bounds of predictors, since the real data we collected was often bounded. The assumption (A2) controls the sparsity of the data and the lower bound on the probability that the non-asymptotic property holds. The assumption (A3) makes sure the variance of each component of

\nabla ℓ_{ψ} (β^{0})

is bounded with choosing a suitable weighted function

ψ (\cdot)

. The assumption (A4) is similar to the proposition 1 proposed by Bach [23]. Under the assumption (A4), we can obtain lower and upper Taylor expansions of the loss function

ℓ_{ψ} (\cdot)

, using which we can derive non-asymptotic results.

Moreover, restricted eigenvalue condition plays a key role in deriving oracle inequalities. For the Group Lasso problem of high-dimensional linear regression models, the oracle property under the group restricted eigenvalue condition was discussed by Hu et al. [24] and extended to logistic regression models by Zhang et al. [17]. To establish the desired group restricted eigenvalue condition, we introduce the following group restricted set

Θ_{α} = : \{ϑ \in R^{p} : ∥ W_{I^{C}} ϑ_{(I^{C})} ∥_{2, 1} \leq α {∥ W_{I} ϑ_{(I)} ∥}_{2, 1}, α > 0\},

(8)

which is a grouped version of the restricted set

θ_{α} = : \{ϑ \in R^{p} : ∥ ϑ_{I^{C}} ∥_{1} \leq α {∥ ϑ_{I} ∥}_{1}\}

mentioned in Bickel et al. [25], where

W_{I}

is a diagonal matrix with the jth diagonal element

ω_{j}

if

j \in I

, and 0 otherwise. Based on the group restricted set (8), we propose the following group restricted eigenvalue condition:

(A5): For some integer s such that $1 < s < g$ and a positive number $α$ , the follow condition holds

$μ (s, α) \overset{▵}{=} min_{\binom{I \subseteq {1, \dots, g}}{| I | \leq s}} min_{\binom{δ \neq 0}{δ \in Θ_{α}}} \frac{{(δ^{T} H_{ψ} (β^{0}) δ)}^{1 / 2}}{∥ W_{I} δ_{(I)} ∥_{2, 2}} > 0,$

(9)

where $H_{ψ} (β^{0})$ is the Hessian matrix for $ℓ_{ψ} (β^{0})$ . Different from the restricted eigenvalue condition mentioned in Bickel et al. [25] for linear regression models, the group restricted eigenvalue condition for logistic regression is converted from the $L_{2}$ norm to the block norm for the denominator part and from the Gram matrix to the Hessian matrix $H_{ψ} (β^{0})$ for the numerator part of (9).

Remark 1.

The Hessian matrix of

ℓ_{ψ} (β)

is given by

\begin{matrix} H_{ψ} (β) & = \frac{1}{n} \sum_{i = 1}^{n} \{\nabla ψ (x_{i}^{T} β) [\frac{exp (x_{i}^{T} β)}{1 + exp (x_{i}^{T} β)} - y_{i}] + ψ (x_{i}^{T} β) \frac{exp (x_{i}^{T} β)}{{(1 + exp (x_{i}^{T} β))}^{2}}\} x_{i} x_{i}^{T} \\ = \frac{1}{n} \sum_{i = 1}^{n} \{\nabla ψ (x_{i}^{T} β) [G (x_{i}^{T} β) - y_{i}] + ψ (x_{i}^{T} β) G (x_{i}^{T} β) (1 - G (x_{i}^{T} β))\} x_{i} x_{i}^{T} . \end{matrix}

Bach [23] has already shown the Hessian matrix of

ℓ (β)

is positive definite on some restricted sets. If the chosen weighted function

ψ (x_{i}^{T} β)

makes the loss function

ℓ_{ψ} (β)

satisfy the assumption (A3),

H_{ψ} (β)

is also positive definite on the group restricted set (8). Such weighed functions in fact exist and will be described later. In addition, the group restricted eigenvalue condition can effectively control the estimation error, enabling the estimation with good statistical properties and reliable results.

Theorem 1.

Assume that (A1), (A2), (A3), (A4) are satisfied. Let

λ < \frac{k (1 - z) μ (s, α)}{4 τ_{0} M s}, z \in (0, 1)

and

k < {min}_{1 \leq l \leq g} ω_{l}

. Let λ be a tuning parameter chosen such that

λ ω_{l} = \frac{N (β^{0})}{z} \sqrt{\frac{| G_{l} |}{n}} Φ^{- 1} (1 - \frac{ϵ}{2 p}) .

(10)

Then, with probability of at least

1 - ϵ (1 + o (1))

, we have the following:

1. A group restricted set

\hat{β} - β^{0} \in Θ_{α}

with

α = \frac{1 + z}{1 - z}

.

2. Under the group restricted eigenvalue condition (A5), the block norm estimation error are

∥ \hat{β} - β^{0} ∥_{2, 1} \leq \frac{2 k λ s}{(min_{1 \leq l \leq g} ω_{l} - k) (1 - z) μ (s, α)},

(11)

∥ \hat{β} - β^{0} ∥_{2, q}^{q} \leq {(\frac{2 k λ s}{(min_{1 \leq l \leq g} ω_{l} - k) (1 - z) μ (s, α)})}^{q}, forall 1 < q < 2,

(12)

respectively, and the error of the loss function

ℓ_{ψ}

is

| ℓ_{ψ} (\hat{β}) - ℓ_{ψ} (β^{0}) | \leq \frac{2 min_{1 \leq l \leq g} ω_{l} λ^{2} s}{(min_{1 \leq l \leq g} ω_{l} - k) (1 - z) μ (s, α)} .

(13)

The non-asymptotic oracle inequalities for the true coefficient

β^{0}

are provided in (11) and (12). Unfortunately, the parameter

N (β^{0})

is influenced by the true coefficient

β^{0}

, so that the choice of

λ

also depends on

β^{0}

. Therefore, we will choose the suitable

ψ (x_{i}^{T} β^{0})

to solve this problem in the next theorem.

Theorem 2.

Choose the weight function in the following form

ψ (x_{i}^{T} β^{0}) = \frac{1}{2} (exp (\frac{x_{i}^{T} β^{0}}{2}) + exp (- \frac{x_{i}^{T} β^{0}}{2})) .

(14)

Under assumptions (A2) and (A3) we choose the tuning parameter as

λ ω_{l} = \frac{\sqrt{| G_{l} | max_{1 \leq j \leq p} (\sum_{i = 1}^{n} x_{i j}^{2})}}{2 n z} Φ^{- 1} (1 - \frac{ϵ}{2 p}) .

(15)

Then, under the assumptions of Theorem 1 with probability at least

1 - ϵ (1 + o (1))

we have inequalities (11), (12) and (13).

In Theorem 2, Yin [10] gives a discussion for the order of

Φ^{- 1} (1 - \frac{ϵ}{2 p})

in (15), and proving that

Φ^{- 1} (1 - \frac{ϵ}{2 p}) \sim O (\sqrt{log (2 p / ϵ)})

. When

| G_{l} | = 1

for

l = 1, 2, \dots, g

, our estimate

\hat{β}

is a Lasso estimate and its theoretical properties have been well studied in Yin [10].

Remark 2.

If

ψ (x_{i}^{T} β^{0})

is given as in Theorem 2, the loss function, weighted score function and the Hessian matrix, respectively, are given by

\{\begin{matrix} ℓ_{ψ} (β^{0}) = \frac{1}{n} \sum_{i = 1}^{n} \{(1 - y_{i}) exp (\frac{x_{i}^{T} β^{0}}{2}) + y_{i} exp (- \frac{x_{i}^{T} β^{0}}{2})\}, \\ \nabla ℓ_{ψ} (β^{0}) = \frac{1}{2 n} \sum_{i = 1}^{n} \{(1 - y_{i}) exp (\frac{x_{i}^{T} β^{0}}{2}) - y_{i} exp (- \frac{x_{i}^{T} β^{0}}{2})\} x_{i}, \\ H_{ψ} (β^{0}) = \frac{1}{4 n} \sum_{i = 1}^{n} \{(1 - y_{i}) exp (\frac{x_{i}^{T} β^{0}}{2}) + y_{i} exp (- \frac{x_{i}^{T} β^{0}}{2})\} x_{i} x_{i}^{T} . \end{matrix}

Clearly, the Hessian matrix given as a weighting function of the form of Theorem 2 is positive definite.

4. Weighted block coordinate descent algorithm

We apply the techniques of the block coordinate descent algorithm to the penalized weighted score function. Choose the weighted function as the form of (14) and set

β = \hat{β} + ζ

, then a second-order Taylor expansion of the loss function

ℓ_{ψ} (β)

in equation (6) has

D (\hat{β} + ζ) = \{(ℓ_{ψ} (\hat{β}) + ζ^{T} \nabla ℓ_{ψ} (\hat{β}) + \frac{1}{2} ζ^{T} H_{ψ} (\hat{β}) ζ) + λ {∥ W (\hat{β} + ζ) ∥}_{2, 1}\},

(16)

Now we consider minimization

D (\hat{β} + ζ)

with respect to the lth group of penalized parameters , it mean that

\nabla ℓ_{ψ} {(\hat{β})}_{(l)} + H_{ψ} {(\hat{β})}_{(l)} ζ_{(l)} + λ ω_{l} \frac{{\hat{β}}_{(l)} + ζ_{(l)}}{∥ {\hat{β}}_{(l)} + ζ_{(l)} ∥_{2}} = 0 .

(17)

Inspired by Meier et al. [12] assumptions, setting the sub-matrix

H_{ψ} {(\hat{β})}_{(l)}

is of the form

H_{ψ} {(\hat{β})}_{(l)} = h_{ψ} {(\hat{β})}_{(l)} I_{(l)}

, which choose

h_{ψ} {(\hat{β})}_{(l)} = - max {diag (- H_{ψ} {(\hat{β})}_{(l)}), r_{0}}

, where

r_{0}

is a lower bound to ensure convergence. Then, simplifying equation (17) gives

(\frac{λ ω_{l}}{∥ {\hat{β}}_{(l)} + ζ_{(l)} ∥} + h_{ψ} {(\hat{β})}_{(l)}) ({\hat{β}}_{(l)} + ζ_{(l)}) = H_{ψ} {(\hat{β})}_{(l)} {\hat{β}}_{(l)} - \nabla ℓ_{ψ} {(\hat{β})}_{(l)} .

This leads to the following equivalence equation

\frac{{\hat{β}}_{(l)} + ζ_{(l)}}{∥ {\hat{β}}_{(l)} + ζ_{(l)} ∥_{2}} = \frac{H_{ψ} {(\hat{β})}_{(l)} {\hat{β}}_{(l)} - \nabla ℓ_{ψ} {(\hat{β})}_{(l)}}{∥ H_{ψ} {(\hat{β})}_{(l)} {\hat{β}}_{(l)} - \nabla ℓ_{ψ} {(\hat{β})}_{(l)} ∥_{2}} .

(18)

According to equation (15) and Remark 2, it is obtained that:

If

∥ H_{ψ} {(\hat{β})}_{(l)} {\hat{β}}_{(l)} - \nabla ℓ_{ψ} {(\hat{β})}_{(l)} ∥_{2} \leq λ ω_{l}

, the value of

ζ

at the k-th iteration is given by

ζ_{(l)}^{(k)} = - {\hat{β}}_{(l)}^{(k)},

otherwise

ζ_{(l)}^{(k)} = - \frac{1}{h_{ψ} {({\hat{β}}^{(k)})}_{(l)}} (\nabla ℓ_{ψ} {({\hat{β}}^{(k)})}_{(l)} + \frac{x_{(l)}}{∥ x_{(l)} ∥_{2}} \frac{\sqrt{| G_{l} | max_{1 \leq j \leq p} (\sum_{i = 1}^{n} x_{i j}^{2})}}{2 n z} Φ^{- 1} (1 - \frac{ϵ}{2 p})) .

If

ζ_{(l)}^{(k)} \neq 0

, we use the Armijo rule of Tseng and Yun [26] to select the step factor

σ^{(k)}

as follows:

Armijo rule

Finally, the update direction is calculated for the gradient of the parameters and the parameters are updated according to a certain step size

{\hat{β}}_{(l)}^{(k + 1)} = {\hat{β}}_{(l)}^{(k)} + σ^{(k)} ζ_{(l)}^{(k)} .

The weighted block coordinate gradient descent algorithm is given by Table 1. In general, selecting the tuning parameter

λ

using cross-validation method is complicated. As we know from Table 1, the algorithm eliminates the selection process for the tuning parameter

λ ω_{l}

. Given an initial value

{\hat{β}}^{(0)}

, we can then iterate directly over the

{\hat{β}}^{(0)}

until it converges to the range which we expect.

It is worth noting that we have given a direct choice (15) for

λ

under a specific weight function

ψ (x_{i}^{T} β^{0})

given by (14), so the weighted block coordinate gradient descent algorithm will be computationally faster than working iteratively on a fixed grid of tuning parameters

λ

(see Meier et al. [12]). If choosing other weight functions, the weighted block coordinate gradient descent algorithm can still be used to solve (6). But then the tuning parameter

λ

depends on

β^{0}

(unknown), some cross-validation can be used for choosing the parameter

λ

.

5. Simulations

In this section, we use simulated datasets to evaluate the performance of the penalized weighted score function estimator. Meier [12] describes block coordinate gradient descent algorithm using the R package grplasso. We modify grplasso, named wgrplasso, and use it to describe the weighted block coordinate gradient descent algorithm. We compare the performance of the wgrplasso algorithm, R package grpreg developed by Breheny [19] and the R package gglasso developed by Yang and Zou [27]. Three main aspects of model performance are considered: correctness of variable selection and accuracy of coefficient estimation, and running time of the algorithm. The evaluation indicators for the model include the following:

TP: the number of predicted non-zero values in the non-zero coefficient set when determining the model
TN: the number of predicted zero values in the zero coefficient set when determining the model
FP: the number of predicted non-zero values in the zero coefficient set when determining the model
FN: the number of predicted zero values in the non-zero coefficient set when determining the model
TPR: the ratio of predicted non-zero values in the non-zero coefficient set when determining the model, which is calculated by the following formulation:

$T P R = \frac{T P}{T P + F N} .$
Accur: the ratio of accurate predictions when determining the model, which is calculated by the following formulation:

$A c c u r = \frac{T P + T N}{T P + T N + F P + F N} .$
Time: the running time of the algorithm.
BNE: the block norm of the estimation error, which is calculated by the following formulation:

$B N E = ∥ \hat{β} {- β ∥}_{2, 1} .$

The sample size was 200. We set values of

p = 300

, 600, and 900, and generated 100 random datasets to repeat the simulation. We set

ϵ

to 0.01 and 0.05 and uniformly specify the true non-zero coefficient parameters of the logistic regression models as

β = (1, \underset{\begin{matrix} 30 \end{matrix}}{\underset{︸}{{\underset{︸}{1, \dots, 1}}_{\begin{matrix} 3 \end{matrix}}, \dots, {\underset{︸}{1, \dots, 1}}_{\begin{matrix} 3 \end{matrix}}}}, \underset{\begin{matrix} p - 30 \end{matrix}}{\underset{︸}{0, \dots, 0}}) .

For the log odd

η

setting, we consider the following four different models.

(a) In Model I, the observed data X assume sampling from a multivariate normal distribution and the log odd

η

is considered to be the linear case, where the data between groups are independent but the data within groups are correlated. We set the size of each group to 3 and assume that the data within the groups obey

X_{i} \sim N (0, Σ_{i, j k})

, where

Σ_{i} = 0 . 5^{| j - k |}

. Thus, the observed data can then be defined as

X \sim N (0, Σ)

, where

Σ = d i a g (Σ_{1}, \dots, Σ_{\frac{p}{3}})

.

(b) In Model II, the observed data X is assumed to be the sum of two uniform distributions and the log odd

η

is considered to be the linear case. Assume that the p-dimensional vectors

Z_{1}, \dots, Z_{p}

and W are generated independently and through a uniform distribution of

[- 1, 1]

. Thus, the observed data can be defined as

X_{i} = Z_{i} + W

.

The log odds

η

for Model I and Model II are then defined as follows

η = β_{0} + X_{1} β_{1} + \dots + X_{p} β_{p} .

(c) In Model III, the observed data X is assumed to follow a standard multivariate normal distribution and the log ogg

η

is considered to be additive case. Assuming that X obeys the

\frac{p}{3}

-dimensional standard normal distribution, the observed data can therefore be defined as

X \sim N (0, I_{\frac{p}{3}})

.

(d) In Model IV, the observed data X is assumed to be the sum of two uniform distributions and the log odd

η

is considered to be the additive case. This means assume that the

\frac{p}{3}

-dimensional vectors

Z_{1}, \dots, Z_{\frac{p}{3}}

and W are generated independently by a uniform distribution of

[- 1, 1]

. Thus, the observed data can be defined as

X_{i} = Z_{i} + W

.

The log odds

η

for Model III and Model IV are then defined as follows

η = β_{0} + X_{1} β_{1} + X_{1}^{2} β_{2} + X_{1}^{3} β_{3} + \dots + X_{\frac{p}{3}} β_{p - 2} + X_{\frac{p}{3}}^{2} β_{p - 1} + X_{\frac{p}{3}}^{3} β_{p} .

Then, the dataset for the response variable Y was generated by the logistic regression models

P (Y = 1 | η) = \frac{1}{1 + e x p (η^{- 1})} .

Table 2 shows the average simulation results of the three algorithms for the linear case, and Figure plots the point-line plots of Model I and Model II for TPR, Accur, Time and MSE.

First, from the TPR perspective, all three algorithms show excellent selection results when the normal distribution assumption is adopted. However, when the uniform distribution assumption is used, the wgrplasso algorithm shows higher correct selection in the nonzero set than the other algorithms, and the wgrplasso algorithm is also more stable in terms of variance.

Second, from the Accur perspective, compared to the gepreg algorithm, the wgrplasso and gglasso algorithms maintain a high selection effect under the assumption of normal distribution. However, Accur is also affected by FP, and the gepreg algorithm and gglasso algorithm are not stable enough to control FP from the perspective of variance. In addition, under the assumption of uniform distribution, both in terms of the effect of selection and the stability of variance, the wgrplasso algorithm has lower control on the FP aspect, which makes the wgrplasso algorithm perform better than the other algorithms in terms of Accur.

Third, from the Time perspective, using the wgrplasso algorithm saves a lot of time, both for the normal distribution assumption and the uniform distribution assumption.

And last, from the BNE perspective, under the assumption of normal distribution, the BNE obtained by wgrplasso and gglasso algorithms are similar and smaller than that obtained by grpreg algorithm. However, under the assumption of uniform distribution, compared with gglasso algorithm and grpreg algorithm, The BNE obtained by wgrplasso algorithm is smaller, which means that wgrplasso algorithm performs better.

Figure 1. Average TPR, Accur, Time and BNE plots for 100 repetitions of the three algorithms in Model I and Model II.

Table 3 gives the simulation results of the three algorithms for the additive case, and Figure plots the point line plots of Models III and IV for TPR, Accur, Time and BNE.

The simulation results show that the grpreg algorithm and the gglasso algorithm in the additive case are reduce both in terms of TPR and Accur, and also show through the variance that the grpreg algorithm and the gglasso algorithm also do not have a stable selection, as well as increasing computational time overhead and BNE. However, wgrplasso obtains similar results in the additive case as in the linear case, and still maintains a better selection. Regardless of TPR, Accur and BNE, the wgrplasso algorithm performs better than the other algorithms, and the advantage in Time is even more obvious.

Figure 2. Average TPR, Accur, Time and BNE plots for 100 repetitions of the three algorithms in Model III and Model IV.

6. Real data

In this section, we apply our proposed estimates to analyze two real data. The first data comes from the molecular shape and conformation of musk. The second data comes from histologically normal epithelial cells from breast cancer patients and cancer-free prophylactic mastectomy patients. As in the previous section, we set

ϵ

to 0.01 and 0.05, respectively. In Section 6.1 we compare the number of variables selected and the computation time of the three algorithms in the above simulation, and in Section 6.2 we compare the prediction accuracy and the computation time.

6.1. Studies on the molecular structure of Muscadine

The R package of kernlab contains the molecular shape and conformation of musk in the native dataset musk. The data contains a data frame of 476 observations for the following 167 variables. The first 162 of these variables are the distance characteristics of the rays, measured relative to the origin along which each ray is placed. Any experiment with the data should treat these features as being on any continuous scale. Variable 163 is the distance of the oxygen atom to a specified point in 3-space. Variable 164 is the x-displacement from the specified point. Variable 165 is the Y-displacement from the specified point. Variable 166 is the Z displacement from the specified point. The variable 167 tells us that 0 means no musk and 1 means musk.

We used 3/4 of the data for training and performed a third-order B-spline basis function expansion on the training data, and then use the wgrplasso, grpreg, gglasso, and glmnet algorithms for estimation on the expanded training data, respectively. The remaining 1/4 of the data was used as a test, and the estimated coefficients were used to predict the test data, comparing the prediction accuracy, model size, and time for each of the four algorithms. Table 4 gives the experimental results of 100 repetitions.

The experimental results show that wgrplasso has the highest prediction accuracy among the four algorithms, indicating that the algorithm is able to identify the target class more accurately in the task of categorizing musk data, and wgrplasso also exhibits shorter computation time without sacrificing accuracy. This makes the wgrplasso algorithm the preferred algorithm for dealing with the problem of categorizing musk datasets.

6.2. Gene expression studies in epithelial cells of breast cancer patients

We obtained microarray data from the NCBI Gene Expression Omnibus for patient histological epithelial cells. (https://www.ncbi.nlm.nih.gov/geo/) under accession GSE20437. The dataset consists of 42 samples with 22,283 variables. It consists of microarray gene expression data collected from the histologically normal epithelium (NlEpi) from 18 breast cancer patients (HN), 18 undergoing breast reduction (RM) and 6 cancer-free prophylactic mastectomies (PM) in high-risk women. Graham et al. [28] have shown that genes are differentially expressed between HM and RM samples. This is more fully discussed in Yang and Zou [27]. Here, we consider the effect of genes on HM and RM. Similar to Yang and Zou [27] approach to the data, fitting the sparse additive logistic regression model using the Group Lasso penalty while selecting the significant additive components.

As with the setup in Section 6.1, we continue to train with 3/4 of the data and expand the training data using a third-order B-spline basis function and treated them as a group to reflect the role in the additive models, leading to a grouped regression problem with n = 36 and p = 66849. All data were then standardized so that the mean of each original variable was zero and the sample variance was in units. This experiment was repeated 100 times to get the prediction error. We built a complete observational model for the one of experiments, and report the selected genes in wgrplasso, grpreg and gglasso algorithms. These results are listed in Table 5. We observe that the wgrplasso and gglasso algorithms selects more variables than the grpreg algorithm, and wgrplasso has less prediction errors. Summarizing the above results, our proposed penalized weighted score function method can pick much more meaningful variables for explanation and prediction.

7. Conclusion

In our work, we propose the penalized weighted score function method for Group Lasso under logistic regression models. We determine an upper bound on the error of parameter estimation at high probability, and a direct choice of the tuning parameter under a specific weighted function. Under the direct choice of the tuning parameter, we improve the block coordinate descent algorithm to reduce computational time and complexity. Simulation results show that our method not only exhibits better statistical accuracy, but also calculaters faster than competing methods. Indeed, our approach can be extended to other generalized linear models with sparse group structure, which will be a future research.

Appendix

Lemma 1.

(Bach [23]) Consider a three times differentiable convex function

g : R \to R

such that for all

t \in R, | g^{​^{'''}} (t) | \leq S g^{​^{''}} (t),

for some

S \geq 0 .

Then, for all

t \geq 0 :

\frac{g^{​^{''}} (0)}{S^{2}} (e x p (- S t) + S t - 1) \leq g (t) - g (0) - g^{​^{'}} (0) t \leq \frac{g^{​^{''}} (0)}{S^{2}} (e x p (S t) - S t - 1) .

Lemma 2.

(Hu et al. [24]) If the inequality

\sum_{i = 1}^{n} a_{i} \leq b_{0}

holds for all

a_{i} > 0

, we have

\sum_{i = 1}^{n} a_{i}^{q} \leq b_{0}^{q}

for

1 < q < 2

.

Proof of Lemma 2:

We first introduce the Holder inequality:

Set

m, n > 1

and

\frac{1}{m} + \frac{1}{n} = 1

. Let

a_{i}

and

b_{i}

be non-negative real numbers, then

\sum_{i = 1}^{n} a_{i} b_{i} \leq {(\sum_{i = 1}^{n} a_{i}^{m})}^{\frac{1}{m}} {(\sum_{i = 1}^{n} b_{i}^{n})}^{\frac{1}{n}} .

According to the Holder Inequality, set

m = \frac{1}{2 - q}

and

n = \frac{1}{q - 1}

, we have

\begin{matrix} \sum_{i = 1}^{n} a_{i}^{q} & = \sum_{i = 1}^{n} (a_{i}^{2 - q} a_{i}^{2 q - 2}) \\ \leq {(\sum_{i = 1}^{n} a_{i})}^{2 - q} {(\sum_{i = 1}^{n} a_{i}^{2})}^{q - 1}, \end{matrix}

becasue

\sum_{i = 1}^{n} a_{i}^{2} \leq {(\sum_{i = 1}^{n} a_{i})}^{2} \leq b_{0}^{2}

, then

\sum_{i = 1}^{n} a_{i}^{q} \leq b_{0}^{2 - q} {(b_{0}^{2})}^{q - 1} = b_{0}^{q},

here

m, n > 1

which means

q \in (1, 2)

. □

Lemma 3.

(Sakhanenko [29]) Let

F_{1}, \dots, F_{n}

be independent random variables with

E (F_{i}) = 0

and

| F_{i} | < 1

for all

1 \leq i \leq n

. Denote

B_{n}^{2} = \sum_{i = 1}^{n} E (F_{i}^{2})

and

L_{n} = \sum_{i = 1}^{n} E (| F_{i} |^{3}) / B_{n}^{3}

. Then there exists a positive constant A such that for all

x \in [1, \frac{1}{A} min {B_{n}, L_{n}^{- 1 / 3}}]

P (\sum_{i = 1}^{n} F_{i} > B_{n} x) = (1 + O (1) x^{3} L_{n}) (1 - Φ (x)) .

Proof of Theorem 1:

Define the event

A = \{max_{1 \leq l \leq g} \sqrt{\sum_{j \in G_{l}} \nabla ℓ_{ψ}^{2} (β_{j}^{0})} \leq z λ ω_{l}\} .

We state the theorem result on the event A and find an lower bound of

P (A)

.

Define

I = \{k : ∥ β_{(k)}^{0} ∥_{2} \neq 0\},

since

\hat{β}

is the minimizer of

ℓ_{ψ} (β) + λ {∥ W β ∥}_{2, 1}

, we get

ℓ_{ψ} (\hat{β}) + λ ∥ W \hat{β} ∥_{2, 1} \leq ℓ_{ψ} (β^{0}) + λ {∥ W β^{0} ∥}_{2, 1} .

(19)

Adding

λ ∥ W (\hat{β} - β^{0}) ∥_{2, 1}

to both sides of (19) are rearranging the inequality, we obtain

\begin{matrix} ℓ_{ψ} (\hat{β}) - ℓ_{ψ} (β^{0}) + λ {∥ W (\hat{β} - β^{0}) ∥}_{2, 1} & \leq λ ∥ W β^{0} ∥_{2, 1} - λ ∥ W \hat{β} ∥_{2, 1} + λ {∥ W (\hat{β} - β^{0}) ∥}_{2, 1} \\ \leq 2 λ ∥ W_{I} {(\hat{β} - β^{0})}_{(I)} ∥_{2, 1} . \end{matrix}

(20)

According to the fact that

ℓ_{ψ} (β^{0})

is a convex function, by applying Cauchy-Schwarz inequality, it’s Taylor expansion is as follows

\begin{matrix} ℓ_{ψ} (\hat{β}) - ℓ_{ψ} (β^{0}) & \geq {(\hat{β} - β^{0})}^{T} \nabla ℓ_{ψ} (β^{0}) \\ \geq - \sum_{l = 1}^{g} \sqrt{\sum_{j \in G_{l}} \nabla ℓ_{ψ}^{2} (β_{j}^{0})} / ω_{l} \cdot ω_{l} {∥ {(\hat{β} - β^{0})}_{(l)} ∥}_{2} \\ \geq - max_{1 \leq l \leq g} \sqrt{\sum_{j \in G_{l}} \nabla ℓ_{ψ}^{2} (β_{j}^{0})} / ω_{l} \cdot \sum_{l = 1}^{g} ω_{l} {∥ {(\hat{β} - β^{0})}_{(l)} ∥}_{2} \\ \geq - z λ ∥ W (\hat{β} - β^{0}) ∥_{2, 1} . \end{matrix}

(21)

Combining (20) and (21) and define

δ_{(l)} = β_{(l)}^{0} - {\hat{β}}_{(l)}

, we obtain the weighted restricted group

∥ W_{I^{C}} δ_{(I^{c})} ∥_{2, 1} \leq α {∥ W_{I} δ_{(I)} ∥}_{2, 1} .

Therefore, on the event A we have

μ (s, α) > 0

for

α = \frac{1 + z}{1 - z}

.

And then, due to

ℓ_{ψ} (β^{0})

satisfies the condition of three times differentiable, define the function

g (t) = ℓ_{ψ} (β^{0} + t δ),

by applying Cauchy-Schwarz inequality, we have

\begin{matrix} | g^{​^{'''}} (t) | & \leq τ_{0} max_{1 \leq i \leq n} | x_{i}^{T} δ | g^{​^{''}} (t) \\ \leq τ_{0} max_{1 \leq i \leq n} \sum_{l = 1}^{g} (\sqrt{\sum_{j \in G_{l}} x_{i j}^{2}} / ω_{l}) ω_{l} {∥ δ_{(l)} ∥}_{2} g^{​^{''}} (t) \\ \leq τ_{0} max_{1 \leq i \leq n} max_{1 \leq l \leq g} (\sqrt{\sum_{j \in G_{l}} x_{i j}^{2}} / ω_{l}) {∥ W δ ∥}_{2, 1} g^{​^{''}} (t) \\ \leq τ_{0} (M / min_{1 \leq l \leq g} ω_{l}) (α + 1) \sqrt{s} {∥ W_{I} δ_{(I)} ∥}_{2, 2} g^{​^{''}} (t) . \end{matrix}

Make

\bar{M} = τ_{0} (α + 1) \sqrt{s} M / min_{1 \leq l \leq g} ω_{l}

, where

ω_{l}

is a real-valued constant, so

\bar{M}

is bounded, this means that

| g^{​^{'''}} (t) | \leq \bar{M} {∥ W_{I} δ_{(I)} ∥}_{2, 2} g^{​^{''}} (t)

, by Lemma 1, we have

ℓ_{ψ} (\hat{β}) - ℓ_{ψ} (β^{0}) \geq δ^{T} \nabla ℓ_{ψ} (β^{0}) + \frac{δ^{T} H_{ψ} (β^{0}) δ}{{\bar{M}}^{2} {∥ W_{I} δ_{(I)} ∥}_{2, 2}^{2}} (e^{- \bar{M} {∥ W_{I} δ_{(I)} ∥}_{2, 2}} + \bar{M} {∥ W_{I} δ_{(I)} ∥}_{2, 2} - 1) .

(22)

Combining (21) and (22), we have the following result

\begin{matrix} - {z λ ∥ W δ ∥}_{2, 1} + \frac{δ^{T} H_{ψ} (β^{0}) δ}{{\bar{M}}^{2} {∥ W_{I} δ_{(I)} ∥}_{2, 2}^{2}} (e^{- \bar{M} {∥ W_{I} δ_{(I)} ∥}_{2, 2}} + \bar{M} {∥ W_{I} δ_{(I)} ∥}_{2, 2} - 1) \\ \leq λ ∥ W_{I} δ_{(I)} ∥_{2, 1} - λ {∥ W_{I^{C}} δ_{(I^{c})} ∥}_{2, 1} . \end{matrix}

Further more, combined group restricted eigenvalue condition, we obtain

\frac{μ (s, α)}{{\bar{M}}^{2}} (e^{- \bar{M} {∥ W_{I} δ_{(I)} ∥}_{2, 2}} + \bar{M} {∥ W_{I} δ_{(I)} ∥}_{2, 2} - 1) + {(1 - z) λ ∥ W δ ∥}_{2, 1} \leq 2 λ \sqrt{s} {∥ W_{I} δ_{(I)} ∥}_{2, 2} .

(23)

This implies that

e^{- \bar{M} {∥ W_{I} δ_{(I)} ∥}_{2, 2}} + \bar{M} ∥ W_{I} δ_{(I)} ∥_{2, 2} - 1 \leq \frac{2 λ \sqrt{s}}{μ (s, α)} {\bar{M}}^{2} {∥ W_{I} δ_{(I)} ∥}_{2, 2} .

(24)

In fact, we can reach the conclusion as follow under all

t \in [0, 1)

exp (\frac{- 2 t}{1 - t}) + 2 t - 1 \geq 0 .

Therefore, we adopt

t = \bar{M} ∥ W_{I} δ_{(I)} ∥_{2, 2} / (2 + \bar{M} ∥ W_{I} δ_{(I)} ∥_{2, 2})

that meets the above conditions, then we obtain

e^{- \bar{M} {∥ W_{I} δ_{(I)} ∥}_{2, 2}} + \bar{M} {∥ W_{I} δ_{(I)} ∥}_{2, 2} - 1 \geq \frac{{\bar{M}}^{2} {∥ W_{I} δ_{(I)} ∥}_{2, 2}^{2}}{2 + \bar{M} {∥ W_{I} δ_{(I)} ∥}_{2, 2}} .

(25)

Combining (24) and (25), we have

\frac{∥ W_{I} δ_{(I)} ∥_{2, 2}}{2 + \bar{M} {∥ W_{I} δ_{(I)} ∥}_{2, 2}} \leq \frac{2 λ \sqrt{s}}{μ (s, α)} .

Based on group restricted eigenvalue condition, choose

λ \leq \frac{k (1 - z) μ (s, α)}{8 τ_{0} s M}

, for positive constant

k < min_{1 \leq l \leq g} ω_{l}

, substitute into the above equation

\bar{M} {∥ W_{I} δ_{(I)} ∥}_{2, 2} \leq \frac{2 k}{min_{1 \leq l \leq g} ω_{l} - k} .

Then substitute the equation into (25), we have

e^{- \bar{M} {∥ W δ ∥}_{2, 2}} + \bar{M} {∥ W δ ∥}_{2, 2} - 1 \geq \frac{min_{1 \leq l \leq g} ω_{l} - k}{2 min_{1 \leq l \leq g} ω_{l}} {\bar{M}}^{2} {∥ W_{I} δ_{(I)} ∥}_{2.2}^{2} .

(26)

Combining (23) and (26), because of the Cauchy Schwarz inequality that

\begin{matrix} \frac{min_{1 \leq l \leq g} ω_{l} - k}{2 min_{1 \leq l \leq g} ω_{l}} μ (s, α) ∥ W_{I} δ_{(I)} ∥_{2, 2}^{2} + (1 - z) λ {∥ W δ ∥}_{2, 1} & \leq 2 λ ∥ W_{I} δ_{(I)} ∥_{2, 1} \\ \leq 2 λ \sqrt{s} {∥ W_{I} δ_{(I)} ∥}_{2, 2} \\ \leq a λ^{2} s + \frac{1}{a} {∥ W_{I} δ_{(I)} ∥}_{2, 2}^{2} . \end{matrix}

Let

a = \frac{2 min_{1 \leq l \leq g} ω_{l}}{(min_{1 \leq l \leq g} ω_{l} - k) μ (s, α)}

, we have the following conclusion under the event A

{∥ W δ ∥}_{2, 1} \leq \frac{2 min_{1 \leq l \leq g} ω_{l} λ s}{(min_{1 \leq l \leq g} ω_{l} - k) (1 - z) μ (s, α)},

it means that

{∥ δ ∥}_{2, 1} \leq \frac{2 λ s}{(min_{1 \leq l \leq g} ω_{l} - k) (1 - z) μ (s, α)} .

And equation (12) follows from (11) by applying Lemma 2.

Furthermore, by (20) and (21), we obtain

| ℓ_{ψ} (\hat{β}) - ℓ_{ψ} (β^{0}) {| \leq λ ∥ W δ ∥}_{2, 1} \leq \frac{2 min_{1 \leq l \leq g} ω_{l} λ^{2} s}{(min_{1 \leq l \leq g} ω_{l} - k) (1 - z) μ (s, α)}

Now, we prove the probability of event A

\begin{matrix} P (A^{c}) & = P \{max_{1 \leq l \leq g} \sqrt{\sum_{j \in G_{l}} \nabla ℓ_{ψ}^{2} (β_{j}^{0})} / ω_{l} > z λ\} \\ \leq P \{max_{1 \leq l \leq g} max_{j \in G_{l}} | G_{l} | \frac{\nabla ℓ_{ψ}^{2} (β_{j}^{0})}{ω_{l}^{2}} > {(z λ)}^{2}\} \\ \leq P \{max_{1 \leq j \leq p} | \nabla ℓ_{ψ} (β_{j}^{0}) | > \frac{z λ ω_{l}}{\sqrt{| G_{l} |}}\}, \end{matrix}

take

η = Φ^{- 1} (1 - \frac{ϵ}{2 p})

and

λ ω_{l} = \frac{N (β^{0})}{z} \sqrt{\frac{G_{l}}{n}} η

, it follow that

\begin{matrix} P (A^{c}) & \leq p max_{1 \leq j \leq p} P \{| \nabla ℓ_{ψ} (β_{j}^{0}) | > \frac{z λ ω_{l}}{\sqrt{| G_{l} |}}\} \\ \leq p max_{1 \leq j \leq p} P \{| \frac{1}{n} \sum_{i = 1}^{n} \{ψ (x_{i}^{T} β^{0}) [G (x_{i}^{T} β^{0}) - Y_{i}] x_{i j}\} | > \frac{z λ ω_{l}}{\sqrt{| G_{l} |}}\} \\ = p max_{1 \leq j \leq p} P \{| \sum_{i = 1}^{n} κ_{i j} | > \sqrt{n} N (β^{0}) η\}, \end{matrix}

where

κ_{i j} = ψ (x_{i}^{T} β^{0}) [G (x_{i}^{T} β^{0}) - Y_{i}] x_{i j}

. Furthermore, with assumption we obtain

\begin{matrix} E (κ_{i j}) & = ψ (x_{i}^{T} β^{0}) [G (x_{i}^{T} β^{0}) - E (Y_{i})] x_{i j} = 0, \\ E (κ_{i j}^{2}) & = Var (κ_{i j}) = ψ^{2} (x_{i}^{T} β^{0}) G (x_{i}^{T} β^{0}) (1 - G (x_{i}^{T} β^{0})) x_{i j}^{2} = N^{2} (β^{0}), \end{matrix}

because of

| κ_{i j} | \leq ψ (x_{i}^{T} β^{0}) [G (x_{i}^{T} β^{0}) - Y_{i}] (max_{i, j} | x_{i j} |) \leq M R,

with a positive constant

R = max_{1 \leq i \leq n} ψ (x_{i}^{T} β^{0}), 0 \leq G (x_{i}^{T} β^{0}) \leq 1

, make

F_{i j} = κ_{i j} / (M R)

have

| F_{i j} | \leq 1, E (F_{i j}) = 0

.

\begin{matrix} B_{n j}^{2} = \sum_{j = 1}^{n} E (F_{i j}^{2}) = \sum_{j = 1}^{n} E (κ_{i j}^{2}) / {(M R)}^{2} \leq n N^{2} (β^{0}) / {(M R)}^{2}, \\ L_{n j} = \sum_{j = 1}^{n} E (| F_{i j} |^{3}) / B_{n j}^{3} \leq \sum_{j = 1}^{n} E (| F_{i j} |^{2}) / B_{n j}^{3} = \frac{1}{B_{n j}} . \end{matrix}

Then,

B_{n j} = O (\sqrt{n})

and

L_{n j} = O (1 / \sqrt{n})

. By lemme 3 we have

\begin{matrix} P \{| \sum_{i = 1}^{n} κ_{i j} | > \sqrt{n} N (β^{0}) η\} & = P \{| \sum_{i = 1}^{n} F_{i j} | > \frac{\sqrt{n} N (β^{0})}{M R} η\} \\ \leq P \{| \sum_{i = 1}^{n} F_{i j} | > B_{n j} η\} \\ = 2 (1 + O (1) η^{3} L_{n j}) (1 - Φ (η)) \\ = \frac{ϵ}{p} (1 + O (η^{3} / \sqrt{n})) . \end{matrix}

Notice for any

η > 0

, we have

1 - Φ (η) \leq Φ (η) / η

, then

\frac{ϵ}{2 p} = 1 - Φ (η) \leq \frac{Φ (η)}{η} = \frac{exp (- η^{2} / 2)}{\sqrt{2 π} η} .

Our default

p > 2

has

p / ϵ > 2

which means

η > Φ^{- 1} (3 / 4) > 1 / \sqrt{2 π}

, and so

\frac{ϵ}{2 p} \leq \frac{exp (- η^{2} / 2)}{\sqrt{2 π} η} < exp (- \frac{η^{2}}{2}) .

Here, we get

η < \sqrt{2 log \frac{2 p}{ϵ}} .

As

n, p \to \infty

with

n \leq p = o (e^{n^{1 / 3}})

, we have

P (A^{c}) \leq ϵ (1 + o (1)) .

which completes the proof of Theorem 1. □

Proof of Theorem 2:

We only need to show that the action of the weight function in the form of (15) as under logistic loss satisfies the assumption (A3)

Denote

g (t) = ℓ_{ψ} (u + t v; X, Y)

for

u, v \in R^{p}

, we have

\begin{matrix} g^{'} (t) & = \frac{1}{2 n} \sum_{i = 1}^{n} \{(1 - Y_{i}) exp (\frac{x_{i}^{T} u + x_{i}^{T} t v}{2}) - Y_{i} exp (- \frac{x_{i}^{T} u + x_{i}^{T} t v}{2})\} v^{T} x_{i}, \\ g^{''} (t) & = \frac{1}{4 n} \sum_{i = 1}^{n} \{(1 - Y_{i}) exp (\frac{x_{i}^{T} u + x_{i}^{T} t v}{2}) + Y_{i} exp (- \frac{x_{i}^{T} u + x_{i}^{T} t v}{2})\} {(v^{T} x_{i})}^{2}, \\ g^{'''} (t) & = \frac{1}{8 n} \sum_{i = 1}^{n} \{(1 - Y_{i}) exp (\frac{x_{i}^{T} u + x_{i}^{T} t v}{2}) - Y_{i} exp (- \frac{x_{i}^{T} u + x_{i}^{T} t v}{2})\} {(v^{T} x_{i})}^{3} . \end{matrix}

It is not difficult to find that

|g^{''} (t)| = g^{''} (t)

, then

\begin{matrix} | g^{'''} (t) | & = \frac{1}{8 n} |\sum_{i = 1}^{n} \{(1 - Y_{i}) exp (\frac{x_{i}^{T} u + x_{i}^{T} t v}{2}) - Y_{i} exp (- \frac{x_{i}^{T} u + x_{i}^{T} t v}{2})\} {(v^{T} x_{i})}^{3}| \\ \leq \frac{1}{2} max_{1 \leq i \leq n} | x_{i}^{T} v | \frac{1}{4 n} \{\sum_{i = 1}^{n} |(1 - Y_{i}) exp (\frac{x_{i}^{T} u + x_{i}^{T} t v}{2})| + |Y_{i} exp (- \frac{x_{i}^{T} u + x_{i}^{T} t v}{2})|\} {(v^{T} x_{i})}^{2} \\ = \frac{1}{2} (max_{1 \leq i \leq n} | x_{i}^{T} v |) | g^{''} (t) | . \end{matrix}

which completes the proof of Theorem 2. □

Funding

The authors’work was supported by Educational Commission of Jiangxi Province of China (No.GJJ160927) and the National Natural Science Foundation of China (No.62266002).

References

Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology 1996, 58, 267–288. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology 2005, 67, 301–320. [Google Scholar] [CrossRef]
Candes, E.; Tao, T. The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics 2007, 35, 2313–2351. [Google Scholar]
Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 2010, 38, 894–942. [Google Scholar] [CrossRef] [PubMed]
Huang, J.; Ma, S.; Zhang, C.H. The iterated lasso for high-dimensional logistic regression. The University of Iowa, Department of Statistics and Actuarial Sciences 2008, 7. [Google Scholar]
Bianco, A.M.; Boente, G.; Chebi, G. Penalized robust estimators in sparse logistic regression. TEST 2022, 31, 563–594. [Google Scholar] [CrossRef]
Abramovich, F.; Grinshtein, V. High-dimensional classification by sparse logistic regression. IEEE Transactions on Information Theory 2018, 65, 3068–3079. [Google Scholar] [CrossRef]
Huang, H.; Gao, Y.; Zhang, H.; Li, B. Weighted Lasso estimates for sparse logistic regression: Non-asymptotic properties with measurement errors. Acta Mathematica Scientia 2021, 41, 207–230. [Google Scholar] [CrossRef]
Yin, Z. Variable selection for sparse logistic regression. Metrika 2020, 83, 821–836. [Google Scholar] [CrossRef]
Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology 2006, 68, 49–67. [Google Scholar] [CrossRef]
Meier, L.; Van De Geer, S.; Bühlmann, P. The group lasso for logistic regression. Journal of the Royal Statistical Society Series B: Statistical Methodology 2008, 70, 53–71. [Google Scholar] [CrossRef]
Wang, L.; You, Y.; Lian, H. Convergence and sparsity of Lasso and group Lasso in high-dimensional generalized linear models. Statistical Papers 2015, 56, 819–828. [Google Scholar] [CrossRef]
Blazere, M.; Loubes, J.M.; Gamboa, F. Oracle Inequalities for a Group Lasso Procedure Applied to Generalized Linear Models in High Dimension. IEEE Transactions on Information Theory 2014, 60, 2303–2318. [Google Scholar] [CrossRef]
Kwemou, M. Non-asymptotic oracle inequalities for the Lasso and group Lasso in high dimensional logistic model. ESAIM: Probability and Statistics 2016, 20, 309–331. [Google Scholar] [CrossRef]
Nowakowski, S.; Pokarowski, P.; Rejchel, W.; Sołtys, A. Improving Group Lasso for high-dimensional categorical data. International Conference on Computational Science. Springer, 2023, pp. 455–470.
Zhang, Y.; Wei, C.; Liu, X. Group Logistic Regression Models with Lp, q Regularization. Mathematics 2022, 10, 2227. [Google Scholar] [CrossRef]
Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications 2001, 109, 475–494. [Google Scholar] [CrossRef]
Breheny, P.; Huang, J. Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and computing 2015, 25, 173–187. [Google Scholar] [CrossRef] [PubMed]
Belloni, A.; Chernozhukov, V.; Wang, L. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika 2011, 98, 791–806. [Google Scholar] [CrossRef]
Bunea, F.; Lederer, J.; She, Y. The group square-root lasso: Theoretical properties and fast algorithms. IEEE Transactions on Information Theory 2013, 60, 1313–1325. [Google Scholar] [CrossRef]
Huang, Y.; Wang, C. Consistent functional methods for logistic regression with errors in covariates. Journal of the American Statistical Association 2001, 96, 1469–1482. [Google Scholar] [CrossRef]
Bach, F. Self-concordant analysis for logistic regression. Electronic Journal of Statistics 2010, 4, 384–414. [Google Scholar] [CrossRef]
Hu, Y.; Li, C.; Meng, K.; Qin, J.; Yang, X. Group sparse optimization via lp, q regularization. The Journal of Machine Learning Research 2017, 18, 960–1011. [Google Scholar]
Bickel, P.J.; Ritov, Y.; Tsybakov, A.B. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics 2009, 37, 1705–1732. [Google Scholar] [CrossRef]
Tseng, P.; Yun, S. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming 2009, 117, 387–423. [Google Scholar] [CrossRef]
Yang, Y.; Zou, H. A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and Computing 2015, 25, 1129–1141. [Google Scholar] [CrossRef]
Graham, K.; de Las Morenas, A.; Tripathi, A.; King, C.; Kavanah, M.; Mendez, J.; Stone, M.; Slama, J.; Miller, M.; Antoine, G.; others. Gene expression in histologically normal epithelium from breast cancer patients and from cancer-free prophylactic mastectomy patients shares a similar profile. British journal of cancer 2010, 102, 1284–1293. [Google Scholar] [CrossRef]
Sakhanenko, A. Berry-Esseen type estimates for large deviation probabilities. Siberian Mathematical Journal 1991, 32, 647–656. [Google Scholar] [CrossRef]

Table 1. Weighted block coordinate gradient descent of logistic regression.

Table 2. Average results for 100 repetitions of the three algorithms in Model I and Model II.

		Model I
		TP	TPR	FP	Accur	Time	BNE
p=300	grpreg( $λ$ =min)	29.97 (0.30)	0.999	91.71 (20.45)	0.694	75.93	16.963 (1.17)
	gglasso( $λ$ =min)	29.94 (0.42)	0.998	36.60 (25.76)	0.878	61.59	14.275 (0.55)
	gglasso( $λ$ =lse)	29.67 (1.12)	0.989	13.74 (14.25)	0.953	76.11	14.933 (0.46)
	wgrplasso( $ϵ$ =0.01)	29.55 (1.08)	0.985	25.80 (8.19)	0.912	4.55	15.021 (0.50)
	wgrplasso( $ϵ$ =0.05)	29.67 (0.94)	0.989	25.80 (9.39)	0.879	5.47	15.133 (0.56)
p=600	grpreg( $λ$ =min)	29.91 (0.51)	0.997	115.89 (27.43)	0.807	99.54	18.136 (1.49)
	gglasso( $λ$ =min)	29.97 (0.30)	0.999	49.56 (36.29)	0.917	93.64	14.904 (0.62)
	gglasso( $λ$ =lse)	29.46 (1.43)	0.982	17.25 (18.61)	0.970	99.90	15.271 (0.37)
	wgrplasso( $ϵ$ =0.01)	29.40 (1.48)	0.980	40.53 (12.07)	0.931	7.75	15.553 (0.52)
	wgrplasso( $ϵ$ =0.05)	29.61 (1.25)	0.987	53.97 (12.85)	0.909	9.42	15.829 (0.61)
p=900	grpreg( $λ$ =min)	29.82 (0.72)	0.994	134.37 (32.87)	0.851	120.47	18.736 (1.53)
	gglasso( $λ$ =min)	29.85 (0.66)	0.995	59.19 (42.77)	0.934	125.79	15.292 (0.57)
	gglasso( $λ$ =lse)	29.37 (1.37)	0.979	25.23 (24.93)	0.971	120.78	15.486 (0.41)
	wgrplasso( $ϵ$ =0.01)	29.13 (1.43)	0.971	51.48 (12.94)	0.942	10.06	15.907 (0.55)
	wgrplasso( $ϵ$ =0.05)	29.34 (1.32)	0.978	68.55 (14.97)	0.923	13.77	16.251 (0.62)
		Model II
		TP	TPR	FP	Accur	Time	BNE
p=300	grpreg( $λ$ =min)	16.74 (4.27)	0.558	65.19 (9.30)	0.739	76.81	19.851 (0.82)
	gglasso( $λ$ =min)	13.20 (5.22)	0.440	35.70 (11.74)	0.825	130.13	17.889 (0.62)
	gglasso( $λ$ =lse)	10.20 (4.78)	0.340	27.69 (11.51)	0.842	77.11	17.676 (0.41)
	wgrplasso( $ϵ$ =0.01)	24.57 (2.58)	0.819	6.24 (4.61)	0.961	7.01	12.256 (0.54)
	wgrplasso( $ϵ$ =0.05)	24.66 (2.47)	0.822	6.51 (4.75)	0.961	6.95	12.241 (0.55)
p=600	grpreg( $λ$ =min)	12.69 (4.28)	0.423	85.35 (12.24)	0.829	114.24	20.737 (0.76)
	gglasso( $λ$ =min)	10.62 (4.09)	0.354	49.77 (14.23)	0.885	183.45	18.459 (0.68)
	gglasso( $λ$ =lse)	7.80 (4.02)	0.260	37.35 (13.77)	0.901	114.80	17.952 (0.44)
	wgrplasso( $ϵ$ =0.01)	24.66 (2.91)	0.822	7.50 (5.40)	0.979	14.17	12.323 (0.43)
	wgrplasso( $ϵ$ =0.05)	24.75 (2.81)	0.825	7.71 (5.33)	0.978	15.17	12.309 (0.44)
p=900	grpreg( $λ$ =min)	10.17 (4.53)	0.339	95.97 (14.07)	0.871	141.31	21.192 (0.78)
	gglasso( $λ$ =min)	8.55 (4.42)	0.285	52.08 (16.18)	0.918	224.54	18.582 (0.73)
	gglasso( $λ$ =lse)	6.87 (4.25)	0.229	39.96 (14.36)	0.930	142.08	18.038 (0.53)
	wgrplasso( $ϵ$ =0.01)	25.20 (2.70)	0.840	10.77 (6.74)	0.983	22.06	12.393 (0.56)
	wgrplasso( $ϵ$ =0.05)	25.29 (2.67)	0.843	11.07 (6.59)	0.982	21.83	12.373 (0.58)

Reported numbers are the averages and standard errors (show in parentheses).

Table 3. Average results for 100 repetitions of the three algorithms in Model III and Model IV.

		Model III
		TP	TPR	FP	Accur	Time	BNE
p=300	grpreg( $λ$ =min)	28.92 (2.15)	0.964	69.87 (20.50)	0.763	161.17	18.771 (1.41)
	gglasso( $λ$ =min)	29.43 (3.10)	0.981	74.16 (29.85)	0.751	92.28	16.155 (0.84)
	gglasso( $λ$ =lse)	28.71 (3.82)	0.957	36.66 (21.05)	0.874	162.14	15.849 (0.52)
	wgrplasso( $ϵ$ =0.01)	27.42 (2.86)	0.914	25.95 (7.98)	0.905	6.86	15.093 (0.49)
	wgrplasso( $ϵ$ =0.05)	28.38 (2.06)	0.946	33.66 (8.31)	0.882	9.22	16.294 (0.55)
p=600	grpreg( $λ$ =min)	27.57 (4.02)	0.919	80.61 (30.38)	0.862	189.25	19.277 (1.56)
	gglasso( $λ$ =min)	29.58 (1.05)	0.986	102.12 (41.15)	0.829	124.10	17.521 (1.26)
	gglasso( $λ$ =lse)	26.82 (7.21)	0.894	43.86 (34.66)	0.922	190.67	16.644 (0.58)
	wgrplasso( $ϵ$ =0.01)	27.57 (2.62)	0.919	41.37 (11.08)	0.927	9.60	16.709 (0.60)
	wgrplasso( $ϵ$ =0.05)	28.53 (1.93)	0.951	52.68 (12.12)	0.910	11.95	17.024 (0.68)
p=900	grpreg( $λ$ =min)	26.34 (5.59)	0.878	84.69 (38.07)	0.902	214.50	19.459 (1.92)
	gglasso( $λ$ =min)	28.95 (3.20)	0.965	113.34 (49.11)	0.873	155.14	17.835 (1.33)
	gglasso( $λ$ =lse)	24.33 (9.68)	0.811	39.99 (31.89)	0.949	216.52	16.691 (0.46)
	wgrplasso( $ϵ$ =0.01)	27.51 (2.77)	0.917	50.49 (12.26)	0.941	15.23	16.939 (0.68)
	wgrplasso( $ϵ$ =0.05)	28.20 (2.33)	0.940	61.77 (12.53)	0.929	16.97	17.307 (0.74)
		Model IV
		TP	TPR	FP	Accur	Time	BNE
p=300	grpreg( $λ$ =min)	21.75 (3.94)	0.725	63.24 (9.34)	0.762	80.51	23.983 (1.19)
	gglasso( $λ$ =min)	19.86 (4.41)	0.662	52.47 (9.74)	0.791	98.53	18.512 (1.21)
	gglasso( $λ$ =lse)	18.03 (4.58)	0.601	47.82 (11.15)	0.801	80.87	18.051 (1.11)
	wgrplasso( $ϵ$ =0.01)	28.26 (2.05)	0.942	26.31 (8.76)	0.906	45.56	14.895 (1.04)
	wgrplasso( $ϵ$ =0.05)	28.26 (2.01)	0.942	26.4 (8.67)	0.906	46.08	14.934 (1.03)
p=600	grpreg( $λ$ =min)	18.12 (4.45)	0.604	82.29 (12.32)	0.843	112.95	24.765 (1.38)
	gglasso( $λ$ =min)	15.63 (4.94)	0.521	68.28 (12.20)	0.862	145.22	19.121 (1.30)
	gglasso( $λ$ =lse)	14.10 (5.07)	0.470	64.11 (13.27)	0.867	113.62	18.523 (1.14)
	wgrplasso( $ϵ$ =0.01)	28.77 (1.66)	0.959	34.11 (10.02)	0.941	84.97	15.353 (1.11)
	wgrplasso( $ϵ$ =0.05)	28.77 (1.66)	0.959	34.89 (10.25)	0.940	87.64	15.38 (1.11)
p=900	grpreg( $λ$ =min)	16.38 (3.99)	0.546	93.60 (14.46)	0.881	139.78	25.239 (1.44)
	gglasso( $λ$ =min)	14.19 (4.69)	0.473	78.21 (13.51)	0.896	185.60	19.309 (1.25)
	gglasso( $λ$ =lse)	11.67 (4.73)	0.389	67.86 (13.89)	0.904	140.75	18.453 (1.09)
	wgrplasso( $ϵ$ =0.01)	28.77 (1.71)	0.959	38.79 (12.26)	0.956	123.20	15.780 (1.14)
	wgrplasso( $ϵ$ =0.05)	28.80 (1.71)	0.960	38.79 (11.90)	0.956	121.92	15.827 (1.15)

Reported numbers are the averages and standard errors (show in parentheses).

Table 4. Average prediction accuracy, model size, and time for 100 repetitions of the four algorithms in the musk dataset.

	wgrplasso (ϵ=0.05)	grpreg (λ=min)	gglasso (λ=min)	glmnet (λ=min)
Prediction accuracy	0.820	0.813	0.771	0.758
Model size	66.53	31.29	30.14	53.53
Time	0.69	3.04	2.70	2.12

Table 5. Average prediction error, model size selected genes for 100 repetitions of three algorithms in microarray gene expression data from histological epithelial cells

	wgrplasso (ϵ=0.05)	grpreg (λ=min)	gglasso (λ=min)
Prediction error	0.73	0.63	0.71
Model size	14	9	14
Selected genes	117_at 1255_g_at 200000_s_at 200002_at 200030_s_at 200040_at 200041_s_at 200655_s_at 200661_at 200729_s_at 201040_at 201465_s_at 202707_at 211997_x_at	201464_x_at 201465_s_at 201778_s_at 202707_at 204620_s_at 205544_s_at 211997_x_at 213280_at 217921_at	200047_s_at 200729_s_at 200801_x_at 201465_s_at 202046_s_at 202707_at 205544_s_at 208443_x_at 211374_x_at 211997_x_a212234_at 213280_at 217921_at 220811_at

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Variable Selection for Sparse Logistic Regression with Grouped Variables

Abstract

Keywords:

Subject:

1. Introduction

2. Penalized weighted score function method

3. Statistical properties

4. Weighted block coordinate descent algorithm

5. Simulations

6. Real data

6.1. Studies on the molecular structure of Muscadine

6.2. Gene expression studies in epithelial cells of breast cancer patients

7. Conclusion

Appendix

Funding

References

MDPI Initiatives

Important Links

Subscribe