Preprint
Article

Quick and Complete Convergence in the Law of Large Numbers with Applications to Statistics

Altmetrics

Downloads

150

Views

17

Comments

0

A peer-reviewed article of this preprint also exists.

This version is not peer-reviewed

Submitted:

11 May 2023

Posted:

12 May 2023

You are already at the latest version

Alerts
Abstract
In the first part of this article, we discuss and generalize the complete convergence introduced by Hsu and Robbins (1947) to the r-complete convergence introduced by Tartakovsky (1998). We also establish its relation to the r-quick convergence first introduced by Strassen (1967) and extensively studied by Lai (1976). Our work is motivated by various statistical problems, mostly in sequential analysis. As we show in the second part, generalizing and studying these convergence modes is important not only in probability theory but also to solve challenging statistical problems in hypothesis testing and changepoint detection for general stochastic non-i.i.d. models.
Keywords: 
Subject: Computer Science and Mathematics  -   Probability and Statistics

1. Introduction

In [1], Hsu and Robbins introduced the notion of complete convergence which is stronger than almost sure (a.s.) convergence. Hsu and Robbins used this notion to discuss certain aspects of the Law of Large Numbers (LLN). In particular, let X 1 , X 2 , be independent and identically distributed (i.i.d.) random variables with the common mean μ = E [ X 1 ] . Hsu and Robbins proved that, while in the Kolmogorov Strong Law of Large Numbers (SLLN), only the first moment condition is needed for the sample mean n 1 t = 1 n X t to converge to μ as n , the complete version of the SLLN requires the second-moment condition E | X 1 | 2 < (finiteness of variance). Later, Baum and Katz [2], working on the rate of convergence in the LLN, established that the second-moment condition is not only necessary but also sufficient for complete convergence. Strassen [3] introduced another mode of convergence, the r-quick convergence. When r = 1 , these two modes of convergence are closely related. In the case of i.i.d. random variables and the sample mean n 1 t = 1 n X t , they are identical. This fact and certain statistical applications motivated Tartakovsky [4] (see also Tartakovsky [5] and Tartakovsky et al. [6]) to introduce a natural generalization of complete convergence – the r-complete convergence, which turns out to be identical to the r-quick convergence in the i.i.d. case.
Section 2 discusses pure probabilistic issues related to r-complete convergence and r-quick convergence. Section 3 explores statistical applications in sequential hypothesis testing and changepoint detection. Section 4 outlines sufficient conditions for r-complete convergence for Markov and hidden Markov models, which is needed to establish optimality properties of sequential hypothesis tests and changepoint detection procedures. Section 5 concludes.

2. Modes of Convergence and the Law of Large Numbers

We begin by listing some standard definitions in probability theory. Let ( Ω , F ) be a measurable space, i.e., Ω is a set of elementary events ω and F is a sigma-algebra (a system of subsets of Ω satisfying standard conditions). A probability space is a triple ( Ω , F , P ) , where P is a probability measure (completely additive measure normalized to 1) defined on the sets from the sigma-algebra F . More specifically, by Kolmogorov’s axioms, probability P satisfies: P ( A ) 0 for any A F ; P ( Ω ) = 1 ; and P ( i = 1 A i ) = i = 1 P ( A i ) for A i F , A i A j = , i j , where is an empty set.
A function X = X ( ω ) defined on ( Ω , F ) with values in X is called random variable if it is F -measurable, i.e., { ω : X ( ω ) B } belongs to the sigma-algebra F . The function F ( x ) = P ( ω : X ( ω ) x ) is the distribution function of X. It is also referred to as cumulative distribution function (cdf). The real-valued random variables X 1 , X 2 , are independent if the events { X 1 x 1 } , { X 2 x 2 } , are independent for every sequence x 1 , x 2 , of real numbers. In what follows, we shall deal with real-valued random variables unless specified otherwise.

2.1. Standard Modes of Convergence

Let X be a random variable and let { X n } n Z + ( Z + = { 0 , 1 , 2 , } ) be a sequence of random variables, both defined on the probability space ( Ω , F , P ) . We now give several standard definitions and results related to the Law of Large Numbers.
Convergence in Distribution (Weak Convergence). Let F n ( x ) = P ( ω : X n x ) be the cdf of X n and let F ( x ) = P ( ω : X x ) be the cdf of X. We say that the sequence { X n } n Z + converges to X in distribution (or in law or weakly) as n and write X n n law X if
lim n F n ( x ) = F ( x )
at all continuity points of F ( x ) .
Convergence in Probability. We say that the sequence { X n } n Z + converges to X in probability as n and write X n n P X if
lim n P ( | X n X | > ε ) = 0 for every ε > 0 .
Almost Sure Convergence. We say that the sequence { X n } n Z + converges to X almost surely (a.s.) or with probability 1 (w.p. 1) as n under probability measure P and write X n n P a . s . X if
P ω : lim n X n = X = 1 .
It is easily seen that (1) is equivalent to the condition
lim n P ω : t = n | X t X | > ε = 0 for every ε > 0 ,
and that the a.s. convergence implies convergence in probability, and the convergence in probability implies convergence in distribution, while the converse statements are not generally true.
The following double implications that establish necessary and sufficient conditions (i.e., equivalences) for the a.s. convergence are useful:
X n n a . s . X P sup t n | X t X | > ε n 0 for all ε > 0 .
The following result is often useful.
Lemma 1.
Let f ( t ) be a nonnegative increasing function, lim t f ( t ) = . If
X n f ( n ) n P a . s . 0 ,
then
lim n P 1 f ( n ) max 0 t n X t > ε = 0 for every ε > 0 .
Proof. 
For any ε > 0 , n 0 > 0 and n > n 0 , we have
P 1 f ( n ) max 0 t n X t > ε P 1 f ( n ) max 0 t n 0 X t > ε + P 1 f ( n ) max n 0 < t n X t > ε P 1 f ( n ) max 0 t n 0 X t > ε + P max t > n 0 X t f ( t ) > ε .
Letting n and taking into account that
lim n P 1 f ( n ) max 0 t n 0 X t > ε = 0 ,
we obtain
lim sup n P 1 f ( n ) max 0 t n X t > ε P sup t > n 0 X t f ( t ) > ε .
Since n 0 can be arbitrarily large, we can let n 0 and since, by assumption X n / f ( n ) n a . s . 0 , it follows from (2) that the upper bound approaches 0 as n 0 . This completes the proof. □
Remark 1.
The proof of Lemma 1 shows that the assertion (3) also holds under the one-sided condition
P sup t > n X t f ( t ) > ε n 0 for all ε > 0 .
Random Walk. Let X 0 , X 1 , X 2 , be i.i.d. random variables with mean E [ X n ] = μ for n 1 and the initial condition X 0 = x . Then S n = t = 0 n X t is called a random walk with mean x + μ n .
In what follows, in the case where X 1 , X 2 , are i.i.d. random variables and S n = t = 0 n X t , we prefer to formulate the results in terms of the random walk { S n } n Z + (typically S 0 = 0 while not necessarily).
We now recall the two Strong Law of Large Numbers (SLLN). Write S n = X 0 + X 1 + + X n for the partial sum ( X 0 = S 0 = 0 ), so that { S n } n Z + is a random walk with zero initial condition as long as X 1 , X 2 , are i.i.d. with mean μ .
Kolmogorov’s SLLN. Let { S n } n Z + be a random walk under probability measure P . If E [ S 1 ] exists, then the sample mean S n / n converges to the mean value E [ S 1 ] w.p. 1, i.e.,
n 1 S n n P a . s . E [ S 1 ] .
Conversely, if n 1 S n n P a . s . μ , where | μ | < , then E [ S 1 ] = μ .
Marcinkiewicz-Zygmund’s SLLN. Let { S n } n Z + be a zero-mean random walk under probability measure P . The following two statements are equivalent:
(i) E | S 1 | p < for 0 < p < 2 ;
(ii) n 1 / p S n n P a . s . 0 .

2.2. Complete and r-Complete Convergence

We begin with discussing the issue of rates of convergence in the LLN.
Rates of Convergence. Let { X n } n Z + be a sequence of random variables and assume that X n converges to 0 w.p. 1 as n . The question is what the rate of convergence is? In other words, how fast does the tail probability P ( | X n | > ε ) decay to zero? This question can be answered by analyzing the behavior of the sums
Σ ( r , ε ) : = n = 1 n r 1 P ( | X n | > ε ) for some r > 0 and all ε > 0 .
More specifically, if Σ ( r , ε ) is finite for every ε > 0 , then the tail probability P ( | X n | > ε ) decays with the rate faster than 1 / n r , so that n r P ( | X n | > ε ) 0 for all ε > 0 as n .
To answer these questions we now consider modes of convergence that strengthen the almost sure convergence, and therefore, help to determine the rate of convergence in the SLLN. Historically this issue was first addressed in 1947 by Hsu and Robbins [1] who introduced the new mode of convergence that they called Complete Convergence.
Complete Convergence. The sequence { X n } n Z + converges to 0 completely if
lim n i = n P ( | X t | > ε ) = 0 for every ε > 0 .
Clearly, (6) is equivalent to
Σ ( 1 , ε ) = n = 1 P ( | X n | > ε ) < for every ε > 0 .
Also, (6) implies a.s. convergence X n n a . s . 0 , but converse is not generally true unless the variables X 1 , X 2 , are not independent.
Let { S n } n Z + be a random walk with mean E [ S n ] = μ n . Kolmogorov’s SLLN (5) implies that the sample mean S n / n converges to μ w.p. 1. Hsu and Robbins [1] proved that under the same assumptions (i.e., under the only first-moment condition E | S 1 | < ) the sequence { n 1 S n } n 1 need not converge to μ completely, but it will do so under the further second-moment condition E | S 1 | 2 < . So the finiteness of variance is a sufficient condition for complete convergence in the SLLN. They conjectured that the second-moment condition is not only sufficient but also necessary for complete convergence. Thus, it follows from these results that if the variance is finite, then the rate of convergence in Kolmogorov’s SLLN is lim n n P ( | S n / n μ | > ε ) = 0 for all ε > 0 .
A further step towards this issue was done in 1965 by Baum and Katz [2]. In particular, the following result follows from Theorem 3 in [2] for the random walk { S n } n Z + with mean E [ S 1 ] = μ .
Theorem 1.
Let r > 0 and α > 1 / 2 . If { S n } n Z + is a random walk with mean E [ S 1 ] = μ , then the following statements are equivalent:
E [ | S 1 | ( r + 1 ) / α ] < n = 1 n r 1 P 1 n α | S n μ n | > ε < for all ε > 0 n = 1 n r 1 P sup k n 1 k α | S k μ k | > ε < for all ε > 0 .
Setting r = 1 and α = 1 in (7), we obtain the following equivalence
E [ | S 1 | 2 ] < n = 1 P | S n / n μ | > ε for all ε > 0 ,
which shows that the conjecture of Hsu and Robbins is correct – the second-moment condition E | S 1 | 2 < is both necessary and sufficient for complete convergence
n 1 S n n P completely μ .
Furthermore, if for some r > 0 the ( r + 1 ) -th moment is finite, E | S 1 | r + 1 < , then the rate of convergence in the SLLN is lim n n r P ( | S n / n μ | > ε ) = 0 for all ε > 0 .
Previous results suggest that it is reasonable to generalize the notion of complete convergence into the following mode of convergence that we will refer to as r-Complete Convergence, which is also related to the so-called r-Quick Convergence that we will discuss later on (see Subsection 2.3).
Definition 1
(r-Complete Convergence). Let r > 0 . We say that the sequence of random variables { X n } n Z + converges to X r-completelyas n under probability measure P and write X n n P - r - completely X if
Σ ( r , ε ) : = n = 1 n r 1 P ( | X n X | > ε ) < for every ε > 0 .
Note that the a.s. convergence of { X n } to X can be equivalently written as
lim n P i = n | X t X | > ε = 0 for every ε > 0 ,
so that the r-complete convergence with r 1 implies the a.s. convergence, but the converse is not true in general.
Suppose that X n converges a.s. to X. If Σ ( r , ε ) is finite for every ε > 0 , then
lim n t = n t r 1 P ( | X t X | > ε ) = 0 for every ε > 0
and probability P ( | X n X | > ε ) goes to 0 as n with the rate faster than 1 / n r . Hence, as already mentioned above, the r-complete convergence allows one to determine the rate of convergence of X n to X, i.e., to answer the question on how fast the tail probability P ( | X n X | > ε ) decays to zero.
The following result provides a very useful implication of complete convergence.
Theorem 2.
Let { X n } n Z + and { Y n } n Z + be two arbitrary, possibly dependent sequences of random variables. Assume that there are positive and finite numbers μ 1 and μ 2 such that
n = 1 P | 1 n X n μ 1 | > ε < for every ε > 0
and
n = 1 P | 1 n Y n μ 2 | > ε < for every ε > 0 ,
i.e., n 1 X n n P completely μ 1 and n 1 Y n n P completely μ 2 . If μ 1 μ 2 , then for any random time T
P X T < b , Y T + 1 b ( 1 + δ ) 0 as b for any δ > 0 .
Proof. 
Fix δ > 0 , c ( 0 , δ ) and let N b = 1 + c ) b / μ 2 be the smallest integer that is larger than or equal to ( 1 + c ) b / μ 2 . Observe that
P X T < b , Y T + 1 b ( 1 + δ ) P X T b , T N b + P Y T + 1 ( 1 + δ ) b , T < N b P X T b , T N b + P max 1 n N b Y n ( 1 + δ ) b .
Thus, to prove (11) it suffices to show that the two terms on the right-hand side go to 0 as b .
For the first term, we notice that for any n N b ,
b n b N b μ 2 1 + c μ 1 1 + c < μ 1 ,
so that
P X T b , T N b = n = N b P X n b , T = n n = N b P X n n b n n = N b P X n n μ 1 1 + c = n = N b P X n n μ 1 c 1 + c μ 1 .
Since N b as b the upper bound goes to 0 as b due to condition (9).
Next, since c ( 0 , δ ) there exists ε > 0 such that
( 1 + δ ) b N b = ( 1 + δ ) b b ( 1 + c ) / μ 2 ( 1 + ε ) μ 2 .
As a result,
P max 1 n N b Y n ( 1 + δ ) b P 1 N b max 1 n N b Y n ( 1 + ϵ ) μ 2 ,
where the upper bound goes to 0 as b by condition (10) (see Lemma 1). □
Remark 2.
The proof suggests that the assertion (11) of Theorem 2 holds under the following one-sided conditions
P n 1 max 1 s n Y s μ 2 > ε n 0 , n = 1 P n 1 X n μ 1 < ε < .
Complete convergence conditions (9) and (10) guarantee both these conditions.
Remark 3.
Theorem 2 can be applied to the overshoot problem. Indeed, if X n = Y n = Z n and the random time T is the first time n when Z n exceeds the level b, T = inf { n 1 : Z n > b } , then Theorem 2 shows that the relative excess of boundary crossing (overshoot) ( Z T b ) / b converges to 0 in probability as b when Z n / n converges completely as n to a positive number μ.

2.3. r-Quick Convergence

In 1967, Strassen [3] introduced the notion of r-quick limit points of a sequence of random variables. The r-quick convergence has been further addressed by Lai [7,8], Chow and Lai [9], Fuh and Zhang [10], and Tartakovsky [4,5] (see certain details in Subsection 2.4).
We define r-quick convergence in a way suitable for this paper. Let { X n } n Z + be a sequence of real-valued random variables and let X be a random variable defined on the same probability space ( Ω , F , P ) .
Definition 2
(r-Quick Convergence). Let r > 0 and for ε > 0 let
L ε = sup { n 1 : | X n X | > ε } ( sup { } = 0 )
be the last entry time of X n in the region ( X + ε , ) ( , X ε ) . We say that the sequence { X n } n Z + converges to X r-quickly as n under probability measure P and write X n n P r - quickly X if, and only if,
E [ L ε r ] < for every ε > 0 ,
where E is the operator of expectation under probability P .
This definition can be of course generalized to random variables X, { X n } n Z + taking values in a metric space ( X , d ) with distance d: X n n r - quickly X if
E sup { n 1 : d ( X , X n ) > ε } r < for every ε > 0 .
Note that the a.s. convergence X n μ ( | μ | < ) as n to a constant μ can be expressed as P ( L ε ( μ ) < ) = 1 , where L ε ( μ ) = sup { n 1 : | X n μ | > ε } . Therefore, the r-quick convergence implies the convergence w.p. 1 but not conversely.
Note also that in general r-quick convergence is stronger than r-complete convergence. Specifically, the following lemma shows that
max 1 i n X t n r completely μ X n n r quickly μ X n n r completely μ .
Lemma 2.
Let { X n } n Z + be a sequence of random variables. Let f ( t ) be a nonnegative increasing function, f ( 0 ) = 0 , lim t f ( t ) = + , and let for ε > 0
L ε ( f ) = sup n 1 : | X n | > ε f ( n ) ( sup { } = 0 )
be the last time X n leaves the interval [ ε f ( n ) , + ε f ( n ) ] .
(i) For any r > 0 and any ε > 0 the following inequalities hold:
r n = 1 n r 1 P | X n | ε f ( n ) E L ε ( f ) r r n = 1 n r 1 P sup t n | X t | f ( t ) ε .
Therefore,
n = 1 n r 1 P sup t n | X t | f ( t ) ε < for all ε > 0 X n n r - quickly 0 .
(ii) If f ( t ) is a power function, f ( t ) = t γ , γ > 0 , then finiteness of
n = 1 n r 1 P max 1 t n X t ε n γ
for some r > 0 and every ε > 0 implies r-quick convergence of X n to 0:
n = 1 n r 1 P max 1 t n X t ε n γ < ε > 0 E [ L ε ( γ ) r ] < ε > 0 ,
where L ε ( γ ) = sup n 1 : | X n | > ε n γ .
Proof. 
(i) Obviously,
P | X n | ε f ( n ) P L ε ( f ) n P sup t n 1 f ( t ) | X t | ε
from which the inequalities (14) follow immediately.
(ii) Write M u = max 1 n u | X n | , where u is an integer part of u. We have the following chain of inequalities and equalities:
E L 2 ε ( γ ) r r 0 t r 1 P sup u t u γ | X u | 2 ε d t r 0 t r 1 P sup u t | X u | ε u γ ε t γ d t r 0 t r 1 P sup u > 0 | X u | ε u γ ε t γ d t r n = 1 0 t r 1 P sup ( 2 n 1 1 ) t γ < u γ ( 2 n 1 ) t γ | X u | ε u γ ε t γ d t r n = 1 0 t r 1 P sup u γ 2 n t γ | X u | 2 n 1 ε t γ d t = r n = 1 0 t r 1 P M 2 n / γ u 2 n 1 ε t γ d t = r n = 1 2 n / γ 0 u r 1 P M u ( ε / 2 ) u γ d u .
It follows that
E L 2 ε ( γ ) r r 2 1 / γ 1 1 0 u r 1 P M u ( ε / 2 ) u γ d u
r 2 1 / γ 1 1 n = 1 n r 1 P max 1 t n X n ε n γ
which yields the implication (15) and completes the proof. □
The following theorem shows that, in the i.i.d. case, the implications in (13) become equivalences.
Theorem 3.
Let { S n } n Z + be the random walk with mean E [ S n ] = μ n . The following statements are equivalent
E | S 1 | r + 1 < n 1 S n n r completely μ ,
E | S 1 | r + 1 < n 1 S n n r quickly μ ,
E | S 1 | r + 1 n = 1 n r 1 P sup k n 1 k | S k μ | > ε < for all ε > 0 .
Proof. 
By Theorem 1, in the i.i.d. case,
E | S 1 | r + 1 < n = 1 n r 1 P 1 n | S n μ | > ε < ε > 0
and
E | S 1 | r + 1 < n = 1 n r 1 P sup k n 1 k | S k μ | > ε < ε > 0 ,
so that assertion (18) follows from (21) and (20) from (22).
Next, let
L ε = sup n 1 : | S n n μ | n ε ( sup = 0 ) .
By Lemma 2(i),
E [ L ε r ] r n = 1 n r 1 P sup t n | S t μ t | / t ε ε > 0 ,
which along with (22) implies (19).

2.4. Further Remarks on r-Complete Convergence, r-Quick Convergence and Rates of Convergence in SLLN

Let { S n } n Z + be a random walk. Without loss of generality let S 0 = 0 and E [ S 1 ] = 0 .
1. Strassen [3] proved, in particular, that if f ( n ) = ( 2 n log n ) 1 / 2 in Lemma 2, then for r > 0
lim sup n S n 2 n log n = r E [ S 1 2 ] r quickly
whenever E | S 1 | p < for p > ( 2 r + 1 ) . He also proved the functional form of the law of the iterated logarithm.
2. Lai [7] improved this result showing that Strassen’s moment condition E | S 1 | p < for p > ( 2 r + 1 ) can be relaxed. Specifically, he showed that a weaker condition
E | S 1 | 2 ( r + 1 ) ( log + | S 1 | + 1 ) ( r + 1 ) ) < for r > 0
is the best one can do (i.e., both necessary and sufficient):
E | S 1 | 2 ( r + 1 ) ( log + | S 1 | + 1 ) ( r + 1 ) < lim sup n S n 2 n log n < r quickly ,
in which case equality (24) holds.
Note, however, that for r = 0 in terms of the a.s. convergence
E | S 1 | 2 < lim sup n S n 2 n log log n = E | S 1 | 2 a . s .
but under condition (25) for all r > 0
lim sup n S n 2 n log log n = r quickly .
3. Let α > 1 / 2 and r > 0 . Chow and Lai [9] established the following one-sided inequality for tail probabilities:
n = 1 n r 1 P max 1 t n S t n α C r , α E ( S 1 + ) ( r + 1 ) / α + E [ S 1 2 ] r / ( 2 α 1 )
whenever E | S 1 | 2 < . Under the same hypotheses, this one-sided inequality implies the two-sided one:
n = 1 n r 1 P max 1 t n | S t | n α C r , α E | S 1 | ( r + 1 ) / α + E [ S 1 2 ] r / ( 2 α 1 ) .
The upper bound in (27) turns out to be sharp since the lower bound also holds:
n = 1 n r 1 P max 1 t n | S t | n α 1 + B r , α E | S 1 | ( r + 1 ) / α + E [ S 1 2 ] r / ( 2 α 1 ) .
Here the constants C r , α and B r , α are universal depending only on r , α .
The results of Chow and Lai [9] provide one-sided analogues of the results of Baum and Katz [2] as well as extend their results. Indeed, the one-sided inequality (26) implies that the following statements are equivalent for the zero-mean random walk:
(i) E [ ( S 1 + ) ( r + 1 ) / α ] < ;
(ii) n = 1 n r 1 P n α S n ε < for all ε > 0 ;
(iii) n = 1 n r 1 P sup k n k α S k ε < for all ε > 0 , where α > 1 / 2 .
Clearly, the two-sided inequality (27) yields the assertions of Theorem 1 if μ = 0 .
4. The Marcinkiewicz-Zygmund SLLN states that for α > 1 / 2 the following implications hold:
E | S 1 | 1 / α < n α S n n a . s . 0 .
The strengthened r-quick equivalent of this SLLN is: For any r > 0 and α > 1 / 2 the following statements are equivalent,
E [ | S 1 | ( r + 1 ) / α ] < i = 1 n r 1 P 1 n α | S n | > ε < for all ε > 0 n = 1 n r 1 P sup k n 1 k α | S k | > ε < for all ε > 0 n α S n n r quickly 0 .
Implications (29) follow from Theorem 1, Theorem 3 and inequality (27). The proof is almost obvious and omitted.

3. Applications of r-Complete and r-Quick Convergences in Statistics

In this section, we outline certain statistical applications which show the usefulness of r-complete and r-quick versions of the SLLN.

3.1. Sequential Hypothesis Testing

We begin with formulating the following multihypothesis testing problem for a general non-i.i.d stochastic model. Let ( Ω , F , F n , P ) , n Z + = { 0 , 1 , 2 , } , be a filtered probability space with standard assumptions about the monotonicity of the sub- σ -algebras F n . The sub- σ -algebra F n = σ ( X n ) of F is assumed to be generated by the sequence X n = { X t , 1 t n } observed up to time n, which is defined on the space ( Ω , F ) . The hypotheses are H i : P = P i , i = 0 , 1 , , N , where P 0 , P 1 , , P N are given probability measures assumed to be locally mutually absolutely continuous, i.e., their restrictions P i { n } and P j { n } to F n are equivalent for all 1 n < and all i , j = 0 , 1 , , N , i j . Let Q { n } be a restriction to F n of a σ -finite measure Q on ( Ω , F ) . Under P i the sample X n = ( X 1 , , X n ) has a joint density p i , n ( X n ) with respect to the dominating measure Q ( n ) for all n Z + , which can be written as
p i , n ( X n ) = t = 1 n f i , t ( X t | X t 1 ) ,
where f i , n ( X n | X n 1 ) , n 1 are corresponding conditional densities.
Define the likelihood ratio (LR) process between the hypotheses H i and H j
Λ i j ( n ) = d P i { n } d P j { n } ( X n ) = p i , n ( X n ) p j , n ( X n ) = t = 1 n f i , t ( X t | X t 1 ) f j , t ( X t | X t 1 )
and the log-likelihood ratio (LLR) process
λ i j ( n ) = log Λ i j ( n ) = t = 1 n log f i , t ( X t | X t 1 ) f j , t ( X t | X t 1 ) ,
where we set Λ i j ( 0 ) = 1 and λ i j ( 0 ) = 0 .
A multihypothesis sequential test is a pair δ = ( d , T ) , where T is a stopping time with respect to the filtration { F n } n Z + and d = d ( X T ) is an F T -measurable terminal decision function with values in the set { 0 , 1 , , N } . Specifically, d = i means that the hypothesis H i is accepted upon stopping, i.e., d = i = T < , δ accepts H i . Let α i j ( δ ) = P i ( d = j ) , i j , i , j = 0 , 1 , , N , denote the error probabilities of the test δ , i.e., the probabilities of accepting the hypothesis H j when H i is true.
Introduce the class of tests with probabilities of errors α i j ( δ ) that do not exceed the prespecified numbers 0 < α i j < 1 :
C ( α ) = δ : α i j ( δ ) α i j for i , j = 0 , 1 , , N , i j ,
where α = ( α i j ) is a matrix of given error probabilities that are positive numbers less than 1.
Let E i denote the expectation under the hypothesis H i (i.e., under the measure P i ). The goal of a statistician is to find a sequential test that would minimize the expected sample sizes E i [ T ] for all hypotheses H i , i = 0 , 1 , , N at least approximately, say asymptotically for small probabilities of errors, i.e., as α i j 0 .

3.1.1. Asymptotic Optimality of Walds’s SPRT

Assume first that N = 1 , i.e., that we are dealing with two hypotheses H 0 and H 1 . In the mid 1940s, Wald [11,12] introduced the Sequential Probability Ratio Test (SPRT) for the sequence of i.i.d. observations X 1 , X 2 , , in which case f i , t ( X t | X t 1 ) = f i ( X t ) in (30) and the LR Λ 1 , 0 ( n ) = Λ n is
Λ n = t = 1 n f 1 ( X t ) f 0 ( X t ) .
After n observations have been made Wald’s SPRT prescribes for each n 1 :
Stop and accept H 1 if Λ n A 1 . Stop and accept H 0 if Λ n A 0 . Continue sampling if A 0 < Λ n < A 1 .
where A 0 < 1 < A 1 are two thresholds.
Let Z t = log [ f 1 ( X t ) / f 0 ( X t ) ] be the LLR for the observation X t , so the LLR for the sample X n is the sum
λ 10 ( n ) = λ n = t = 1 n Z t , n = 1 , 2 ,
Let a 0 = log A 0 < 0 and a 1 = log A 1 > 0 . The SPRT δ * ( a 0 , a 1 ) = ( d * , T * ) can be represented in the form
T * ( a 0 , a 1 ) = inf n 1 : λ n ( a 0 , a 1 ) , d * ( a 0 , a 1 ) = 1 if λ T * a 1 0 if λ T * a 0 .
In the case of two hypotheses, the class of tests (31) is of the form
C ( α 0 , α 1 ) = δ : α 0 ( δ ) α 0 and α 1 ( δ ) α 1 ,
i.e., it upper-bounds the probabilities of errors of Type 1 (false positive) α 0 ( δ ) = α 0 , 1 ( δ ) and Type 2 (false negative) α 1 ( δ ) = α 1 , 0 ( δ ) , respectively.
Wald’s SPRT has an extraordinary optimality property: it minimizes both expected sample sizes E 0 [ T ] and E 1 [ T ] in the class of sequential (and non-sequential) tests C ( α 0 , α 1 ) with given error probabilities as long as the observations are i.i.d. under both hypotheses. More specifically, Wald and Wolfowitz [13] proved, using a Bayesian approach, that if α 0 + α 1 < 1 and thresholds a 0 and a 1 can be selected in such a way that α 0 ( δ * ) = α 0 and α 1 ( δ * ) = α 1 , then the SPRT δ * is strictly optimal in class C ( α 0 , α 1 ) . A rigorous proof of this fundamental result is tedious and involves several delicate technical details. Alternative proofs can be found in [14,15,16,17,18,19].
Regardless of the strict optimality of SPRT which holds if, and only if, thresholds are selected so that the probabilities of errors of SPRT are exactly equal to the prescribed values α 0 , α 1 , which is usually impossible, suppose that thresholds a 0 and a 1 are so selected that
a 0 log ( 1 / α 1 ) and a 1 log ( 1 / α 0 ) as α max 0 .
Then
E 1 [ T * ] | log α 0 | I 1 , E 0 [ T * ] | log α 1 | I 0 as α max 0 ,
where I 1 = E 1 [ Z 1 ] and I 0 = E 0 [ Z 1 ] are Kullback-Leibler (K-L) information numbers so that the following asymptotic lower bounds for ESS are attained by SPRT:
inf δ C ( α 0 , α 1 ) E 1 [ T ] | log α 0 | I 1 + o ( 1 ) , inf δ C ( α 0 , α 1 ) E 0 [ T ] | log α 1 | I 0 + o ( 1 ) as α max 0
(cf. [6]). Hereafter α max = max ( α 0 , α 1 ) . The following inequalities for the error probabilities of the SPRT hold in the most general non-i.i.d. case
α 1 ( δ * ) exp { a 0 } [ 1 α 0 ( δ * ) ] , α 0 ( δ * ) exp { a 1 } [ 1 α 1 ( δ * ) ] .
These bounds can be used to guarantee asymptotic relations (33).
In the i.i.d. case, by the SLLN, the LLR λ n has the following stability property
n 1 λ n n P 1 a . s . I 1 , n 1 ( λ n ) n P 0 a . s . I 0 .
This allows one to conjecture that if in the general non-i.i.d. case the LLR is also stable in the sense that the almost sure convergence conditions (36) are satisfied with some positive and finite numbers I 1 and I 0 , then the asymptotic formulas (34) still hold. In the general case, these numbers represent the local K–L information in the sense that often (while not always) I 1 = lim n n 1 E 1 [ λ n ] and I 0 = lim n n 1 E 0 [ λ n ] . Note, however, that in the general non-i.i.d. case the SLLN does not even guarantee the finiteness of the expected sample sizes E i [ T * ] of the SPRT, so some additional conditions are needed, such as a certain rate of convergence in the strong law, e.g., complete or quick convergence.
In 1981, Lai [8] was the first who proved asymptotic optimality of Wald’s SPRT in a general non-i.i.d. case as α max = max ( α 0 , α 1 ) 0 . While the motivation was near optimality of invariant SPRTs with respect to nuisance parameters, Lai proved a more general result using the r-quick convergence concept. Specifically, for i = 0 , 1 and 0 < I i < , define
L 1 ( ε ) = sup n 1 : | λ n I 1 | ε and L 0 ( ε ) = sup n 1 : | λ n + I 0 | ε
( sup { } = 0 ) and suppose that E i [ L i ( ε ) r ] < for some r > 0 and every ε > 0 , i.e., that the normalized LLR converges r-quickly to I 1 under P 1 and to I 0 under P 0 :
n 1 λ n n P 1 r quickly I 1 and n 1 λ n n P 0 r quickly I 0 .
Strengthening the a.s. convergence (36) into the r-quick version (37), Lai [8] established first-order asymptotic optimality of Wald’s SPRT for moments of the stopping time distribution up to order r: If thresholds a i ( α 0 , α 1 ) , i = 0 , 1 in the SPRT are so selected that δ * ( a 0 , a 1 ) C ( α 0 , α 1 ) and asymptotics (33) hold, then as α max 0 ,
inf δ C ( α 0 , α 1 ) E 1 [ T r ] | log α 0 | I 1 r E 1 [ T * r ] , inf δ C ( α 0 , α 1 ) E 0 [ T r ] | log α 1 | I 0 r E 0 [ T * r ] .
Wald’s ideas have been generalized in many publications to construct sequential tests of composite hypotheses with nuisance parameters when these hypotheses can be reduced to simple ones by the principle of invariance. If M n is the maximal invariant statistic and p i ( M n ) is the density of this statistic under hypothesis H i , then the invariant SPRT is defined as in (32) with the LLR λ n = log [ p 1 ( M n ) / p 0 ( M n ) ] . But even if the observations X 1 , X 2 , are i.i.d. the invariant LLR statistic λ n is not a random walk anymore and Wald’s methods cannot be applied directly. Lai [8] has applied the asymptotic optimality property (38) of Wald’s SPRT in the non-i.i.d. case to investigate optimality properties of several classical invariant SPRTs such as the sequential t-test, the sequential T 2 -test, and Savage’s rank-order test.
In the sequel, the case where the a.s. convergence in the non-i.i.d. model (36) holds with the rate 1 / n we will call asymptotically stationary. Assume now that (36) is generalized to
λ n / ψ ( n ) n P 1 a . s . I 1 , ( λ n ) / ψ ( n ) n P 0 a . s . I 0 ,
where ψ ( t ) is a positive increasing function. If ψ ( t ) is not linear, then this case will be referred to as the asymptotically non-stationary. A simple example where this generalization is needed is testing H 0 versus H 1 regarding the mean of the normal distribution:
X n = i S n + ξ n , n Z + , i = 0 , 1 ,
where { ξ n } n 1 is a zero-mean i.i.d. standard Gaussian sequence N ( 0 , 1 ) and S n = j = 0 k c j n j is a polynomial of order k > 1 . Then
λ n = t = 1 n S t X t 1 2 t = 1 n S t 2 ,
E 1 [ λ n ] = E 0 [ λ n ] = 1 2 t = 1 n S t 2 c k 2 n 2 k for large n, so ψ ( n ) = n 2 k and I 1 = I 0 = c k 2 / 2 in (39). This example is of interest for certain practical applications, in particular, for the recognition of ballistic objects and satellites [19].
Tartakovsky et al. [6] generalized Lai’s results for the asymptotically non-stationary case. Write Ψ ( t ) for the inverse function for ψ ( t ) .
Theorem 4.
Assume that there exist finite positive numbers I 0 and I 1 and an increasing nonnegative function ψ ( t ) such that the r-quick convergence conditions
λ n ψ ( n ) n P 1 r quickly I 1 , λ n ψ ( n ) n P 0 r quickly I 0
hold. If thresholds a 0 ( α 0 , α 1 ) and a 1 ( α 0 , α 1 ) are selected so that δ * ( a 0 , a 1 ) C ( α 0 , α 1 ) and a 0 | log α 1 | and a 1 | log α 0 | , then, as α max 0 ,
inf δ C ( α 0 , α 1 ) E 1 [ T r ] Ψ | log α 0 | I 1 r E 1 [ T * r ] , inf δ C ( α 0 , α 1 ) E 0 [ T r ] Ψ | log α 1 | I 0 r E 0 [ T * r ] .
This theorem implies that the SPRT asymptotically minimizes the moments of the stopping time distribution up to the order r.
The proof of this theorem is performed in two steps which are related to our previous discussion of the rates of convergence in Section 2. The first step is to obtain the asymptotic lower bounds in class C ( α 0 , α 1 ) :
lim inf α max 0 inf δ C ( α 0 , α 1 ) E 1 [ T r ] [ Ψ | log α 0 | / I 1 ] r 1 , lim inf α max 0 inf δ C ( α 0 , α 1 ) E 0 [ T r ] [ Ψ | log α 1 | / I 0 ] r 1 .
These bounds hold whenever the following right-tail conditions for the LLR are satisfied:
lim M P 1 1 ψ ( M ) max 1 n M λ n ( 1 + ε ) I 1 = 1 , lim M P 0 1 ψ ( M ) max 1 n M ( λ n ) ( 1 + ε ) I 0 = 1 .
Note that by Lemma 1 these conditions are satisfied when the SLLN (39) holds so that the almost sure convergence (39) is sufficient. However, as we already mentioned, the SLLN for the LLR is not sufficient to guarantee even the finiteness of the SPRT stopping time.
The second step is to show that the lower bounds are attained by the SPRT. To do so, it suffices to impose the following additional left-tail conditions:
n = 1 n r 1 P 1 λ n ( I 1 ε ) ψ ( n ) < , n = 1 n r 1 P 0 λ n ( I 0 ε ) ψ ( n ) <
for all 0 < ε < min ( I 0 , I 1 ) . Since both right-tail and left-tail conditions hold if the LLR converges r-completely to I i ,
n = 1 n r 1 P 1 λ n ψ ( n ) I 1 ε < , n = 1 n r 1 P 0 λ n ψ ( n ) + I 0 ε
and since r-quick convergence implies r-complete convergence (see (13)), we conclude that the assertions (40) hold.
Remark 4.
In the i.i.d. case, Wald’s approach allows us to establish asymptotic equalities (40) with I 1 = E 1 [ λ 1 ] and I 0 = E 0 [ λ 1 ] being K-L information numbers under the only condition of finiteness I i . However, Wald’s approach breaks down in the non-i.i.d. case. Certain generalizations in the case of independent but non-identically and substantially non-stationary observations, extending Wald’s ideas, have been considered in [19,20,21,22]. Theorem 4 covers all these non-stationary models.
Fellouris and Tartakovsky [23] extended previous results on asymptotic optimality of the SPRT to the case of multistream hypothesis testing problem when the observations are sequentially acquired in multiple data streams (or channels or sources). The problem is to test the null hypothesis H 0 that none of the N streams is affected against the composite hypothesis H B that a subset B { 1 , , N } is affected. Two sequential tests were studied in [23] – the Generalized Sequential Likelihood Ratio Test and the Mixture Sequential Likelihood Ratio Test. It has been shown that both tests are first-order asymptotically optimal, minimizing moments of the sample size E 0 [ T r ] and E B [ T r ] for all B P up to order r as max ( α 0 , α 1 ) 0 in the class of tests
C P ( α 0 , α 1 ) = δ : P 0 ( d = 1 ) α 0 and max B P P B ( d = 0 ) α 1 , 0 < α i < 1 ,
where P B is the distribution of observations under hypothesis H B and P is a class of subsets of { 1 , , N } that incorporates prior information which is available regarding the subset of affected streams, e.g., not more than K < N streams can be affected.1 The proof is essentially based on the concept of r-complete convergence of LLR with the rate 1 / n . See also Chapter 1 in [5].

3.1.2. Asymptotic Optimality of the Multihypothesis SPRT

We now return to the multihypothesis model with N > 1 that we started to discuss at the beginning of this section (see (30) and (31)). The problem of sequential testing of many hypotheses is substantially more difficult than that of testing two hypotheses. For multiple-decision testing problems, it is usually very difficult, if even possible, to obtain optimal solutions. Finding an optimal non-Bayesian test in the class of tests (31) that minimizes ESS E i [ T ] for all hypotheses H i , i = 0 , 1 , , N is not manageable even in the i.i.d. case. For this reason, a substantial part of the development of sequential multihypothesis testing in the 20th century has been directed towards the study of certain combinations of one-sided sequential probability ratio tests when observations are i.i.d. (see, e.g., [24,25,26,27,28,29]).
We will focus on the following first-order asymptotic criterion: Find a multihypothesis test δ * ( α ) = ( d * ( α ) , T * ( α ) ) such that for some r > 0
lim α max 0 inf δ C ( α ) E i [ T r ] E i [ T * ( α ) r ] = 1 for all i = 0 , 1 , , N ,
where α max = max 0 i , j N , i j α i j .
In 1998, Tartakovsky [4] was the first who considered the sequential multiple hypothesis testing problems for general non-i.i.d. stochastic models following Lai’s idea of exploiting the r-quick convergence in the SLLN for two hypotheses. The results have been obtained for both discrete and continuous-time scenarios and for the asymptotically non-stationary case where the LLR processes between hypotheses converge to finite numbers with the rate 1 / ψ ( t ) . Two multihypothesis tests were investigated: (1) The Rejecting test which rejects the hypotheses one by one and the last hypothesis, which is not rejected, is accepted, and (2) The Matrix Accepting test that accepts a hypothesis for which all component SPRTs that involve this hypothesis vote for accepting it. We now proceed with introducing this accepting test which we will refer to as the Matrix SPRT (MSPRT). In the present article, we do not consider the continuous-time scenarios. Those who are interested in continuous time we refer to [4,6,20,22,30].
Write N = { 0 , 1 , , N } . For a threshold matrix ( A i j ) i , j N , with A i j > 0 and the A i i are immaterial (say 0), define the Matrix SPRT δ * N = ( T * N , d * N ) , built on ( N + 1 ) N / 2 one-sided SPRTs between the hypotheses H i and H j , as follows:
Stop at the first n 1 such that , for some i , Λ i j ( n ) A j i for all j i ,
and accept the unique H i that satisfies these inequalities. Note that for N = 1 the MSPRT coincides with Wald’s SPRT.
In the following, we omit the superscript N in δ * N = ( T * N , d * N ) for brevity. Obviously, with a j i = log A j i , the MSPRT in (42) can be written as
T * = inf n 1 : λ i j ( n ) a j i for all j i and some i ,
d * = i for which ( 43 ) holds .
Introducing the Markov accepting times for the hypotheses H i as
T i = inf n 1 : λ i 0 ( n ) max j i 1 j N [ λ j 0 ( n ) + a j i ] , i = 0 , 1 , , N ,
the test in (43)–(44) can be also written in the following form:
T * = min 0 j N T j , d * = i if T * = T i .
Thus, in the MSPRT, each component SPRT is extended until, for some i N , all N SPRTs involving H i accept H i .
Using Wald’s likelihood ratio identity, it is easily shown that α i j ( δ * ) exp ( a i j ) for i , j N , i j , so selecting a j i = | log α j i | implies δ * C ( α ) . These inequalities are similar to Wald’s ones in the binary hypothesis case and are very imprecise. In his ingenious paper, Lorden [28] showed that with a very sophisticated design that includes accurate estimation of thresholds accounting for overshoots, the MSPRT is nearly optimal in the third-order sense, i.e., it minimizes ESS for all hypotheses up to an additive disappearing term: inf δ C ( α ) E i [ T ] = E i [ T * ] + o ( 1 ) as α max 0 . This result holds only for i.i.d. models with the finite second moment E i [ λ i j ( 1 ) 2 ] < . In non-i.i.d. cases (and even for i.i.d. for higher moments r > 1 ), there is no way to obtain such a result, so we focus on the first-order optimality (41).
The following theorem establishes asymptotic operating characteristics and optimality of MSPRT under the r-quick convergence of λ i j ( n ) / ψ ( n ) to finite K-L-type numbers I i j , where ψ ( n ) is a positive increasing function, ψ ( ) = .
Theorem 5
(MSPRT asymptotic optimality). Assume that there exist finite positive numbers I i j , i , j = 0 , 1 , , N , i j and an increasing nonnegative function ψ ( t ) such that for some r > 0
λ i j ( n ) ψ ( n ) n P i r quickly I i j for all i , j = 0 , 1 , , N , i j .
Then the following assertions are true.
(i) 
For i = 0 , 1 , , N ,
E i [ T * r ] Ψ max j i 0 j N a j i I i j r as min j , i a j i .
(ii) 
If the thresholds are so selected that α i j ( δ * ) α i j and a j i | log α j i | , in particular as a j i = | log α j i | , then for all i = 0 , 1 , , N
inf δ C ( α ) E i [ T r ] Ψ max j i 0 j N | log α j i | I i j r E i [ T * r ] as α max 0 .
Assertion (ii) implies that the MSPRT minimizes asymptotically the moments of the stopping time distribution up to order r for all hypotheses H 0 , H 1 , , H N in the class of tests C ( α ) .
Remark 5.
Both assertions of Theorem 5 are correct under the r-complete convergence
λ i j ( n ) ψ ( n ) n P i r complete I i j for all i , j = 0 , 1 , , N , i j ,
i.e., whenever
n = 1 n r 1 P i 1 ψ ( n ) | λ i j ( n ) I i j | > ε < for all ε > 0 .
While this statement was not proved anywhere so far, it can be easily proved using the methods developed for multistream hypothesis testing and changepoint detection [5].
Remark 6.
As the example given in Subsection 3.4.3 of [6] shows, the r-quick convergence conditions in Theorem 5 (or corresponding r-complete convergence conditions for LLR processes) cannot be generally relaxed into the almost sure convergence
λ i j ( n ) ψ ( n ) n P i a . s . I i j for all i , j = 0 , 1 , , N , i j .
However, the following weak asymptotic optimality result holds for the MSPRT under the a.s. convergence: if the a.s. convergence (50) holds with the power function ψ ( t ) = t k , k > 0 , then for every 0 < ε < 1 ,
inf δ C ( α ) P i T > ε T * 1 as α max 0 for all i = 0 , 1 , , N
whenever thresholds a j i are selected as in Theorem 5(ii).
Note that several interesting statistical and practical applications of these results to invariant sequential testing and multisample slippage scenarios are discussed in Sections 4.5 and 4.6 of Tartakovsky et al. [6] (see Mosteller [31] and Ferguson [16] for terminology regarding multisample slippage problems).

3.2. Sequential Changepoint Detection

Sequential changepoint detection (or quickest disorder detection) is an important branch of Sequential Analysis. In the sequential setting, one assumes that the observations are made successively, one at a time, and as long as their behavior suggests that the process of interest is in a normal state, the process is allowed to continue; if the state is believed to have become anomalous, the goal is to detect the change in distribution as rapidly as possible. Quickest change detection problems have an enormous number of important applications, e.g., object detection in noise and clutter, industrial quality control, environment surveillance, failure detection, navigation, seismology, computer network security, genomics, epidemiology (see, e.g., [32,33,34,35,36,37,38,39,40,41]). Several challenging application areas are discussed in the books by Tartakovsky, Nikiforov, and Basseville [6] and Tartakovsky [5].

3.2.1. Changepoint Models

The probability distribution of the observations X = { X n } n Z + , which are acquired sequentially in time, is subject to a change at an unknown point in time ν { 0 , 1 , 2 , } , so that X 1 , , X ν are generated by one stochastic model and X ν + 1 , X ν + 2 , by another model. A sequential detection rule is a stopping time T for an observed sequence { X n } n 1 , i.e., T is an integer-valued random variable, such that the event { T = n } belongs to the sigma-algebra F n = σ ( X 1 , , X n ) generated by observations X 1 , , X n .
Let P denote the probability measure corresponding to the sequence of observations { X n } n 1 when there is never a change ( ν = ) and, for k = 0 , 1 , , let P k denote the measure corresponding to the sequence { X n } n 1 when ν = k < . By H : ν = we denote the hypothesis that the change never occurs and by H k : ν = k – the hypothesis that the change occurs at time 0 k < .
Consider first a general non-i.i.d. model assuming that the observations may have a very general stochastic structure. Specifically, if we let as before X n = ( X 1 , , X n ) denote the sample of size n, then when ν = (there is no change) the conditional density of X n given X n 1 is g n ( X n | X n 1 ) for all n 1 and when ν = k < , then the conditional density of X n given X n 1 is g n ( X n | X n 1 ) for n k and f n ( X n | X n 1 ) for n > k . Thus, for the general non-i.i.d. changepoint model, the joint density p ( X n | H k ) under hypothesis H k can be written as follows
p ( X n | H k ) = t = 1 n g t X t | X t 1 ) for ν = k n , t = 1 k g t ( X t | X t 1 ) × t = k + 1 n f t ( X t | X t 1 ) for ν = k < n ,
where g n ( X n | X n 1 ) is the pre-change conditional density and f n ( X n | X n 1 ) is the post-change conditional density which may depend on ν , f n ( X n | X n 1 ) = f n ( ν ) ( X n | X n 1 ) , but we will omit the superscript ν for brevity.
The classical changepoint detection problem deals with the i.i.d. case where there is a sequence of observations X 1 , X 2 , that are identically distributed with a probability density function (pdf) g ( x ) for n ν and with a pdf f ( x ) for n > ν . That is, in the i.i.d. case, the joint density of the vector X n = ( X 1 , , X n ) under hypothesis H k in (52) is simplified as
p ( X n | H k ) = t = 1 n g ( X t ) for ν = k n , t = 1 k g ( X t ) × t = k + 1 n f ( X t ) for ν = k < n .
Note that, as discussed in [5,6], in applications, there are two different kinds of changes – additive and non-additive. Additive changes lead to a change in the mean value of the sequence of observations. Non-additive changes are typically produced by a change in variance or covariance, i.e., these are spectral changes.
We now proceed with discussing the models for the change point ν . The change point ν may be considered either as an unknown deterministic number or as a random variable. If the change point is treated as a random variable, then the model has to be supplied with the prior distribution of the change point. There may be several changepoint mechanisms and, as a result, a random variable ν may be partially or completely dependent on the observations or independent of the observations. To account for these possibilities at once, let π 1 = Pr ( ν < 0 ) and π k = Pr ( ν = k | X k ) , k 0 , and observe that π k , k = 1 , 2 , are F k -adapted. That is, the probability of a change occurring at the time instant ν = k depends on X k , the observations’ history accumulated up to and including the time k 1 . The probability π 1 + π 0 = Pr ( ν 0 ) represents the probability of the “atom” associated with the event that the change already took place before the observations became available. With the so-defined prior distribution, one can describe very general changepoint models, including those that assume ν to be a { F n } -adapted stopping time (see Moustakides [42]). In this article, we will not discuss Moustakides’s concept by allowing the prior distribution to depend on some additional information available to “Nature” (see [5] for a detailed discussion); rather when considering a Bayesian approach we will assume that the prior distribution of the unknown change point is independent of the observations.

3.2.2. Popular Changepoint Detection Procedures

Before formulating criteria of optimality in the next subsection, we begin with defining the three most popular and common change detection procedures, which are either optimal or nearly optimal in different settings. To define these procedures we need to introduce the partial likelihood ratio and the corresponding log-likelihood ratio
LR t = f t ( X t | X t 1 ) g t ( X t | X t 1 ) , Z t = log f t ( X t | X t 1 ) g t ( X t | X t 1 ) , t = 1 , 2 ,
It is worth iterating that for general non-i.i.d. models the post-change density often depends on the point of change, f t ( X t | X t 1 ) = f t ( ν ) ( X t | X t 1 ) , so in general LR t = LR t ( ν ) and Z t = Z t ( ν ) also depend on the change point ν . However, this is not the case for the i.i.d. model (53).

The CUSUM Procedure 

We now introduce the Cumulative Sum (CUSUM) algorithm, which was first proposed by Page [43] for the i.i.d. model (53). Recall that we consider the changepoint detection problem as a problem of testing two hypotheses: H ν that the change occurs at a fixed point 0 ν < against the alternative H that the change never occurs. The LR between these hypotheses is Λ n ν = t = ν + 1 n LR t for ν < n and 1 for ν n . Since the hypothesis H ν is composite, we may apply the generalized likelihood ratio (GLR) approach maximizing the LR Λ n ν over ν to obtain the GLR statistic
V n = max 0 ν < n t = ν + 1 n LR t , n 1 .
It is easy to verify that this statistic follows the recursion
V n = max { 1 , V n 1 } LR n , n 1 , V 0 = 1
as long as the partial LR LR n does not depend on the change point, i.e., the post-change conditional density f n ( X n | X n 1 ) does not depend on ν . This is always the case for i.i.d. models (53) when f n ( X n | X n 1 ) = f ( X n ) . However, as we already mentioned, for non-i.i.d. models often f n ( X n | X n 1 ) = f n ( ν ) ( X n | X n 1 ) depends on the change point ν , so LR n = LR n ( ν ) , in which case recursion (54) does not hold.
The logarithmic version of V n , W n = log V n , is related to Page’s CUSUM statistic G n introduced by Page [43] in the i.i.d. case as G n = max ( 0 , W n ) . In fact, the statistic G n can also be obtained via the GLR approach by maximizing the LLR λ n ν = log Λ n ν over 0 ν < . However, since the hypotheses H and H ν are indistinguishable for ν n the maximization over ν n does not make too much sense. Note also that in contrast to Page’s CUSUM statistic G n the statistic W n may take values smaller than 0, so the CUSUM procedure
T CS = inf { n 1 : W n a }
makes sense even for negative values of the threshold a. Thus, it is more general than Page’s CUSUM. Note the recursions
W n = W n 1 + + Z n , n 1 , W 0 = 0
and
G n = G n 1 + Z n + , n 1 , G 0 = 0
in case where Z n = log [ f n ( X n | X n 1 ) / g n ( X n | X n 1 ) ] does not depend on ν .

Shiryaev’s Procedure 

In the i.i.d. case and for the zero-modified geometric prior distribution of the change point, Shiryaev [44] introduced the change detection procedure that prescribes thresholding of the posterior probability P ( ν < n | X n ) . Introducing the statistic
S n π = P ( ν < n | X n ) 1 P ( ν < n | X n )
one can write the stopping time of the Shiryaev procedure in the general non-i.i.d. case and for an arbitrary prior π as
T SH = inf n 1 : S n π A ,
where A is a threshold controlling for the false alarm risk. Write π 1 = P ( ν < 0 ) = p , p [ 0 , 1 ) . The statistic S n π can be written as
S n π = p 1 p Λ n 0 + 1 P ( ν n ) k = 0 n 1 π k Λ n k = p 1 p t = 1 n LR t + 1 P ( ν n ) k = 0 n 1 π k t = k + 1 n LR t , n 1 , S 0 π = p 1 p ,
where the product t = i j LR t = 1 for j < i . Threshold A has to be set larger than p / ( 1 p ) to avoid triviality, since otherwise T SH = 0 w.p. 1.
Often (following Shiryaev’s assumptions) it is supposed that the change point ν is distributed according to the zero-modified geometric distribution Geometric ( p , ϱ )
P ( ν < 0 ) = π 1 = p and P ( ν = k ) = ( 1 p ) ϱ ( 1 ϱ ) k for k = 0 , 1 , 2 , ,
where p [ 0 , 1 ) and ϱ ( 0 , 1 ) .
If LR n does not depend on the change point ν and the prior distribution is zero-modified geometric (59) then the statistic S ˜ n ϱ = S n π / ϱ can be rewritten in the recursive form
S ˜ n ϱ = 1 + S ˜ n 1 ϱ LR n 1 ϱ , n 1 , S ˜ 0 ϱ = p ( 1 p ) ϱ .
However, as mentioned above, this may not be the case for non-i.i.d. models since often LR n depends on ν .

Shiryaev–Roberts Procedure 

The generalized Shiryaev–Roberts (SR) change detection procedure is based on thresholding of the generalized SR statistic
R n r 0 = r 0 Λ n 0 + k = 0 n 1 Λ n k = r 0 t = 1 n LR t + k = 0 n 1 t = k + 1 n LR t , n 1 ,
with a non-negative head-start R 0 = r 0 , r 0 0 , i.e., the stopping time of the SR procedure is given by
T SR r 0 = inf n 1 : R n r 0 A , A > 0 .
This procedure is usually referred to as the SR-r detection procedure in contrast to the standard SR procedure T SR T SR r 0 , r 0 = 0 that starts with a zero initial condition r 0 = 0 . In the i.i.d. case (53), this modification of the SR procedure was introduced and studied in detail in [45,46].
If LR n does not depend on the change point ν , then the SR-r detection statistic satisfies the recursion
R n r 0 = ( 1 + R n 1 r 0 ) LR n , n 1 , R 0 r 0 = r 0 .
Note that as the parameter of the geometric prior distribution ϱ 0 , the Shiryaev statistic S ˜ n ϱ converges to the SR-r statistic R n r 0 .

3.2.3. Optimality Criteria

The goal of online change detection is to detect the change as soon as possible after it occurs controlling a false alarm rate at a given level. Tartakovsky et al. [6] suggested five changepoint problem settings – the Bayesian approach, the generalized Bayesian approach, the minimax approach, the uniform (pointwise) approach, and the approach related to multicyclic detection of a change in a stationary regime. In this article, we discuss only a single-run case and two main settings – Bayesian and uniform pointwise optimality, which are tightly related.
Let E k denote the expectation with respect to the measure P k when the change occurs at ν = k < and E with respect to P when there is no change.
In 1954, Page [43] suggested measuring the risk associated with a false alarm by the mean time to false alarm E [ T ] and the risk associated with a true change detection by the mean time to detection E 0 [ T ] when the change occurs at the very beginning. He called these performance characteristics the Average Run Length (ARL). Page also introduced the now most famous change detection procedure – CUSUM procedure – and analyzed it using these operating characteristics.
While the false alarm rate is reasonable to measure by the ARL to false alarm
ARL 2 FA ( T ) = E [ T ] ,
as Figure 1 suggests, the risk associated with a true change detection is reasonable to measure by the conditional average delay to detection
CEDD ν ( T ) = E ν [ T ν | T > ν ] , ν = 0 , 1 , ,
but not necessarily by the ARL to detection E 0 [ T ] CEDD 0 ( T ) . A good detection procedure should guarantee small values of the expected detection delay CEDD ν ( T ) for all change points ν 0 when ARL 2 FA ( T ) is fixed at a certain level. However, if the false alarm risk is measured in terms of the ARL to false alarm, i.e., it is required that ARL 2 FA ( T ) γ for some γ 1 , then a procedure that minimizes the conditional average delay to detection CEDD ν ( T ) uniformly over all ν does not exist. For this reason, we have to resort to different optimality criteria, e.g., to Bayesian and minimax criteria.

Minimax Changepoint Optimization Criteria 

There are two popular minimax criteria. The first one was introduced by Lorden [47]:
inf T sup ν 0 ess sup E ν [ T ν T > ν , F ν ] subject to ARL 2 FA ( T ) γ .
It requires minimizing the conditional expected delay to detection E ν [ T ν T > ν , F ν ] in the worst-case scenario with respect to both the change point ν and the trajectory ( X 1 , , X ν ) of the observed process in the class of detection procedures
C ARL ( γ ) = T : ARL 2 FA ( T ) γ , γ 1 ,
for which the ARL to false alarm exceeds the prespecified value γ [ 1 , ) . Let ESADD ( T ) = sup ν 0 ess sup E ν [ T ν T > ν , F ν ] denote Lorden’s speed detection measure. Under Lorden’s minimax approach the goal is to find a stopping time T opt C ARL ( γ ) such that
ESADD ( T opt ) = inf T C ARL ( γ ) ESADD ( T ) for any γ 1 .
In the classical i.i.d. scenario (53), Lorden [47] proved that the CUSUM detection procedure (55) is asymptotically first-order minimax optimal as γ , i.e.,
inf T C ARL ( γ ) ESADD ( T ) = ESADD ( T CS ) ( 1 + o ( 1 ) ) , γ .
Later on, Moustakides [48], using optimal stopping theory, in his ingenious paper established the exact optimality of CUSUM for any ARL to false alarm γ 1 .
Another popular, less pessimistic minimax criterion is due to Pollak [49]:
inf T sup ν 0 CEDD ν ( T ) subject to ARL 2 FA ( T ) γ ,
which requires minimizing the conditional expected delay to detection CEDD ν ( T ) = E ν [ T ν T > ν ] in the worst-case scenario with respect to the change point ν in class C ARL ( γ ) . Under Pollak’s minimax approach the goal is to find a stopping time T opt C ARL ( γ ) such that
sup ν 0 CEDD ν ( T opt ) = inf T C ARL ( γ ) sup ν 0 CEDD ν ( T ) for any γ 1 .
For the i.i.d. model (53), Pollak [49] showed that the modified SR detection procedure that starts from the quasi-stationary distribution of the SR statistic (i.e., the head-start r 0 in the SR-r procedure is a specific random variable) is third-order asymptotically optimal as γ , i.e., the best one can attain up to an additive term o ( 1 ) :
inf T C ARL ( γ ) sup ν 0 CEDD ν ( T ) = sup ν 0 CEDD ν ( T SR r 0 ) + o ( 1 ) , γ ,
where o ( 1 ) 0 as γ . Later Tartakovsky et al. [50] proved that this is also true for the SR-r procedure (62) that starts from the fixed but specially designed point r 0 = r 0 ( γ ) that depends on γ , which was first introduced and thoroughly studied by Moustakides et al. [45]. See also Polunchenko and Tartakovsky [51] on the exact optimality of the SR-r procedure.

Bayesian Changepoint Optimization Criterion 

In Bayesian problems, the point of change ν is treated as random with a prior distribution π k = Pr ( ν = k ) , < k < + . Define the probability measure on the Borel σ -algebra B in R × N as
P π ( A × K ) = k K π k P k A , A B ( R ) , K N .
Under measure P π the change point ν has distribution π = { π k } and the model for the observations is given in (52). From the Bayesian point of view, it is reasonable to measure the false alarm risk with the Weighted Probability of False Alarm (PFA), defined as
PFA π ( T ) : = P π ( T ν ) = k = π k P k ( T k ) = k = 0 π k P ( T k ) .
The summation in (63) is over k Z + = { 0 , 1 , 2 , } since P ( T < 0 ) = 0 . Also, the last equality follows from the fact that P k ( T k ) = P ( T k ) because the event { T k } depends on the first k observations which under measure P k correspond to the no-change hypothesis H . Thus, for α ( 0 , 1 ) , introduce the class of changepoint detection procedures
C π ( α ) = T : PFA π ( T ) α
for which the weighted PFA does not exceed a prescribed level α . Let E π denote expectation with respect to measure P π .
Shiryaev [18,44] introduced the Bayesian optimality criterion
inf T C π ( α ) E π [ ( T ν ) + ] ,
which is equivalent to minimizing the conditional average detection delay EDD π ( T ) = E π [ T ν | T > ν ]
inf T EDD π ( T ) subject to PFA π ( T ) α .
Under the Bayesian approach, the goal is to find a stopping time T opt C π ( α ) such that
EDD π ( T opt ) = inf T C π ( α ) EDD π ( T ) for any α ( 0 , 1 ) .
For the i.i.d. model (53) and under the assumption that the changepoint ν has the zero-modified geometric prior distribution Geometric ( p , ϱ ) (59), this problem was solved by Shiryaev [18,44]. Shiryaev [18,44,52] proved that the optimal detection procedure is based on comparing the posterior probability of a change currently being in effect with a certain detection threshold, which is equivalent to the stopping time T SH ( A ) (57). To guarantee its strict optimality the detection threshold A = A α should be set to guarantee that the PFA is exactly equal to the selected level α . Thus, if A = A α can be selected in such a way that PFA π ( T SH ( A α ) ) = α , then it is strictly optimal in class C π ( α ) ,
inf T C π ( α ) EDD π ( T ) = EDD π ( T SH ( A α ) ) for any 0 < α < 1 p .

Uniform Optimality Under Local Probabilities of False Alarm 

While the Bayesian and minimax formulations are reasonable and can be justified in many applications, it would be most desirable to guarantee small values of the conditional expected detection delay CEDD ν ( T ) = E ν [ T ν | T ν ] uniformly for all ν Z + when the false alarm risk is fixed at a certain level. However, as we already mentioned, if the false alarm risk is measured in terms of the ARL to false alarm, i.e. if it is required that ARL 2 FA ( T ) γ for some γ 1 , then a procedure that minimizes CEDD ν ( T ) for all ν does not exist. More importantly, as discussed in [5], the requirement of having large values of the ARL 2 FA ( T ) generally does not guarantee small values of the maximal local probability of false alarm MLPFA ( T ) = sup 0 P ( T + m | T > ) in a time window of a length m 1 , while the opposite is always true (see Lemmas 2.1-2.2 in [5]). Hence, the constraint MLPFA ( T ) β is more stringent than ARL 2 FA ( T ) γ .
Yet another reason for considering the MLPFA constraint instead of the ARL to false alarm constraint is that the latter one makes sense, if and only if, the P -distribution of stopping times is geometric or at least close to geometric, which is often the case for many popular detection procedures such as CUSUM and SR in the i.i.d. case. However, for general non-i.i.d. models this is not necessarily true (see [5] and [53] for a detailed discussion).
For these reasons, introduce the most stringent class of change detection procedures for which the MLPFA ( T ) is upper-bounded by the prespecified level β ( 0 , 1 ) :
C PFA ( m , β ) = T : sup 0 P ( T + m | T > ) β .
The goal is to find a stopping time T opt C PFA ( m , β ) such that
CEDD ν ( T opt ) = inf T C PFA ( m , β ) CEDD ν ( T ) for all ν Z + and any 0 < β < 1 .

3.2.4. Asymptotic Optimality for General Non-i.i.d. Models via r-Quick and r-Complete Convergence

Complete Convergence and General Bayesian Changepoint Detection Theory 

Consider first the Bayesian problem assuming that the change point ν is a random variable independent of the observations with a prior distribution π = { π k } . Unfortunately, in the general non-i.i.d. case and for an arbitrary prior π , the Bayesian optimization problem (65) is intractable for arbitrary values of PFA α ( 0 , 1 ) . For this reason, we will consider the following first-order asymptotic problem assuming that the given PFA α approaches zero: Find a change detection procedure T * such that it minimizes the expected detection delay EDD π ( T ) asymptotically to first order as α 0 . That is, the goal is to design such a detection procedure T * that
inf T C π ( α ) EDD π ( T ) = EDD π ( T * ) ( 1 + o ( 1 ) ) as α 0 ,
where o ( 1 ) 0 as α 0 . It turns out that in the asymptotic setting, it is also possible to find a procedure that minimizes the conditional expected detection delay EDD k ( T ) = E k T k | T > k uniformly for all possible values of the change point ν = k Z + , i.e.,
lim α 0 inf T C π ( α ) EDD k ( T ) EDD k ( T * ) = 1 for all k Z + .
Note that if the change occurs before the observations become available, i.e., ν = k { 1 , 2 , } , then EDD k ( T ) E 0 [ T ] since T 0 w.p. 1.
Furthermore, asymptotic optimality results can be also established for higher moments of the detection delay of order r > 1
E k ( T k ) r | T > k and E π ( T ν ) r | T > ν .
Since the Shiryaev procedure T SH ( A ) defined in (57)-(58) is optimal for the i.i.d. model and Geometric ( p , ϱ ) prior, it is reasonable to assume that it is asymptotically optimal for the more general prior and the non-i.i.d model. However, to study asymptotic optimality we need certain constraints imposed on the prior distribution and on the asymptotic behavior of the decision statistics as the sample size increases, i.e., on the general stochastic model (52).
Assume that the prior distribution { π k } is fully supported, i.e., π k > 0 for all k Z + and π = 0 and that the following conditions hold:
lim n 1 n | log k = n + 1 π k | = μ for some 0 μ < ;
k = 0 π k | log π k | r < for some r 1 if μ = 0 .
Note that if μ > 0 , then by condition (70) the prior distribution has an exponential right tail. Distributions such as geometric and discrete versions of gamma and logistic distributions, i.e., models with bounded hazard rates, belong to this class. In this case, condition (71) holds automatically. If μ = 0 , the distribution has a heavy tail, i.e., belongs to the model with a vanishing hazard rate. However, we cannot allow this distribution to have a too-heavy tail, which is guaranteed by condition (71).
Define the LLR of the hypotheses H k and H
λ n k = log d P k ( n ) d P ( n ) = t = k + 1 n f t ( X t | X t ) g t ( X t | X t ) , n > k
( λ n k = 0 for n k ). To obtain asymptotic optimality results the general non-i.i.d. model for observations is restricted to the case that the normalized LLR n 1 λ k + n k obeys the SLLN as n with a finite and positive number I under the probability measure P k and its r-complete strengthened version
n = 1 n r 1 sup k Z + P k | n 1 λ k + n k I | > ε < for every ε > 0 .
By Lemma 7.2.1 in [6],
PFA π ( T SH ( A ) ) 1 / ( 1 + A ) for every A > p / ( 1 p ) ,
and therefore, setting A = A α = ( 1 α ) / α guarantees that T SH ( A α ) C π ( α ) .
The following theorem that can be deduced from Theorem 3.7 in [5] shows that the Shiryaev detection procedure is asymptotically optimal if the normalized LLR n 1 λ k + n k converges r-completely to a positive and finite number I and the prior distribution satisfies conditions (70)-(71).
Theorem 6.
Let r 1 . Let the prior distribution of the change point satisfy conditions (70)-(71). Assume that there exists some number 0 < I < such that the LLR process n 1 λ k + n k converges to I uniformly completely as n under P k , i.e., condition (72) holds. If threshold A = A α in the Shiryaev procedure is so selected that PFA π ( T SH ( A α ) ) α and log A α | log α | as α 0 , e.g., as A = ( 1 α ) / α , then as α 0
inf T C π ( α ) E k ( T k ) r | T > k | log α | I + μ r E k ( T SH k ) r | T SH > k for all k Z +
and
inf T C π ( α ) E π ( T ν ) r | T > ν | log α | I + μ r E π ( T SH ν ) r | T SH > ν .
Therefore, the Shiryaev procedure T SH ( A α ) is first-order asymptotically optimal as α 0 in class C π ( α ) , minimizing moments of the detection delay up to order r whenever the r-complete version of the SLLN (72) holds for the LLR process.
For r = 1 , the assertions of this theorem imply asymptotic optimality of the Shiryaev procedure for the expected detection delays (68) and (69) as well as asymptotic approximations for the expected detection delays.
Remark 7.
The results of Theorem 6 can be generalized to the asymptotically non-stationary case where λ k + n k / ψ ( n ) converges to I uniformly completely as n under P k with a non-linear function ψ ( n ) similarly to the hypothesis testing problem discussed in Section 3.1. See also the recent paper [54] for the minimax change detection problem with independent but substantially non-stationary post-change observations.
It is also interesting to see how two other most popular changepoint detection procedures – the SR and CUSUM – perform in the Bayesian context.
Consider the SR-r procedure defined by (61)-(62). It follows from Lemma 3.4 (page 100) in [5] that
PFA π ( T SR r 0 ( A ) ) r 0 k = 1 π k + k = 1 k π k A for every A > 0 ,
and therefore, setting A = A α = α 1 ( r 0 + k = 1 k π k ) implies T SR r 0 ( A α ) C π ( α ) . Let threshold A = A α in the SR-r procedure is so selected that PFA π ( T SR r 0 ( A α ) ) α and log A α | log α | as α 0 , e.g., as A α = α 1 ( r 0 + k = 1 k π k ) , then as α 0
E k ( T SR r 0 k ) r | T SR r 0 > k | log α | I r for all k Z +
and
E π ( T SR r 0 ν ) r | T SR r 0 > ν | log α | I r
whenever the uniform r-complete convergence condition (72) holds. Therefore, the SR-r procedure T SR r 0 ( A α ) is first-order asymptotically optimal as α 0 in class C π ( α ) , minimizing moments of the detection delay up to order r, when the prior distribution π is heavy-tailed (i.e., when μ = 0 ) and the r-complete version of the SLLN holds. In the case where μ > 0 (i.e., the prior distribution has an exponential tail) the SR-r procedure is not optimal. This can be expected since it uses the improper uniform prior in the detection statistic.
The same asymptotic results (73)-(74) are true for the CUSUM procedure T CS ( a ) defined in (55) if threshold a = a α is so selected that PFA π ( T CS ( a α ) ) α and a α | log α | as α 0 and the uniform r-complete convergence condition (72) holds.
Hence, the r-complete convergence of the LLR process is the sufficient condition for uniform asymptotic optimality of several popular change detection procedures in class C π ( α ) .

Complete Convergence and General Non-Bayesian Changepoint Detection Theory 

Consider now the non-Bayesian problem assuming that the change point ν is an unknown deterministic number. We focus on the most interesting for applications uniform optimality criterion (67) that requires minimizing the conditional expected delay to detection CEDD ν ( T ) = E ν [ T ν | T > ν ] for all values of the change point ν Z + in the class of change detection procedures C PFA ( m , β ) defined in (66). Recall that this class includes change detection procedures with the maximal local probability of false alarm in the time window m,
MLPFA ( T ) = sup 0 P ( T + m | T > ) ,
which does not exceed the prescribed value β ( 0 , 1 ) . However, the exact solution to this challenging problem is unknown even in the i.i.d. case.
So instead consider the following asymptotic problem assuming that the given MLPFA β goes to zero: Find a change detection procedure T which minimizes the expected detection delay E ν [ T ν | T > ν ] asymptotically to first order as β 0 . That is, the goal is to design such a detection procedure T that
inf T C PFA ( m , β ) E ν [ T ν | T > ν ] = E ν [ T ν | T > ν ] ( 1 + o ( 1 ) ) for all ν Z + as β 0 .
More generally, we may focus on the asymptotic problem of minimizing moments of the detection delay of order r 1 :
inf T C PFA ( m , β ) E ν [ ( T ν ) r | T > ν ] = E ν [ ( T ν ) r | T > ν ] ( 1 + o ( 1 ) ) for all ν Z + as β 0 .
To solve this problem we need to assume that the window length m = m β is a function of the MLPFA constraint β and that m β goes to infinity as β 0 with a certain appropriate rate. Using [55] the following results can be established.
Let r 1 and assume that the complete version of the SLLN holds with some number 0 < I < , i.e., n 1 λ ν + n ν converges to I uniformly completely as n under P ν . If m β = O ( | log β | 2 ) as β and threshold A = A β in the SR procedure is so selected that MPFA ( T SR ( A β ) ) β and log A β | log β | as β 0 , e.g., as defined in [55], then as β 0
inf T C PFA ( m β , β ) E ν ( T ν ) r | T > ν | log β I r E ν ( T SR ν ) r | T SR > ν for all ν Z + .
A similar result also holds for the CUSUM procedure T CS ( a ) if threshold a = a β is so selected that MPFA ( T CS ( a β ) ) β and a β | log β | as β 0 and the complete version of the SLLN holds for the normalized LLR n 1 λ ν + n ν as n .
Hence, the r-complete convergence of the LLR process is the sufficient condition for uniform asymptotic optimality of SR and CUSUM change detection procedures with respect to moments of the detection delay of order r in class C PFA ( m β , β ) .

4. Quick and Complete Convergence for Markov and Hidden Markov Models

Usually, in particular problems, verification of the SLLN for the LLR process is relatively easy. However, in practice, verifying strengthened r-complete or r-quick versions of the SLLN, i.e., checking condition (72) can cause some difficulty. Many interesting examples where this verification was performed can be found in [5,6]. However, it is interesting to find sufficient conditions for r-complete convergence for a relatively large class of stochastic models.
In this section, we outline this issue for Markov and hidden Markov models based on the results obtained by Pergamenchtchikov and Tartakovsky [55] for ergodic Markov processes and by Fuh and Tartakovsky [56] for hidden Markov models (HMM). See also Tartakovsky [5].
Let { X n } n Z + be a time-homogeneous Markov process with values in a measurable space ( X , B ) with the transition probability P ( x , A ) . Let E x denote the expectation with respect to this probability. Assume that this process is geometrically ergodic, i.e., there exist positives constants 0 < R < , κ > 0 , probability measure ϰ on ( X , B ) and the Lyapunov X [ 1 , ) function V with ϰ ( V ) < , such that
sup n Z + e κ n sup 0 < ψ V sup x 1 V ( x ) E x [ ψ ( X n ) ] ϰ ( ψ ) R .
In the change detection problem, the sequence { X n } n Z + is a Markov process, such that { X n } 1 n ν is a homogeneous process with the transition density g ( y | x ) and { X n } n > ν is homogeneous positive ergodic with the transition density f ( y | x ) and the ergodic (stationary) distribution ϰ . In this case, the LLR process λ n k can be represented as
λ n k = t = k + 1 n G ( X t , X t 1 ) , n > k ,
where G ( y , x ) = log [ f ( y | x ) / g ( y | x ) ] .
Define
I = X X G ( y , x ) f ( y | x ) d y ϰ ( d x ) .
Under a set of quite sophisticated sufficient conditions the LLR λ k + n n / n converges r-completely to I (cf. [55]). We omit the details and only mention that the main condition is the finiteness of ( r + 1 ) -th moment of the LLR increment, E 0 [ ( G ( X 1 , X 0 ) ) r + 1 ] < .
Consider now the HMM with finite state space. Then again, as in the pure Markov case, the main condition for r-complete convergence of λ k + n n / n to I , where I is specified in Fuh and Tartakovsky [56], is E 0 [ ( λ 1 0 ) r + 1 ] < . Further details can be found in [56].
Similar results for Markov and hidden Markov models hold for the hypothesis testing problem considered in Section 3.1. Specifically, if in the Markov case we assume that the observed Markov process { X n } n Z + is a time-homogeneous geometrically ergodic with transition density f i ( y | x ) under hypothesis H i ( i = 0 , 1 , , N ) and invariant distribution ϰ i , then the LLR processes are
λ i j ( n ) = t = 1 n G i j ( X t , X t 1 ) , i , j = 0 , 1 , , N , i j ,
where G i j ( y , x ) = log [ f i ( y | x ) / f j ( y | x ) ] . If E i [ ( G i j ( X 1 , X 0 ) ) r + 1 ] < then the LLR n 1 λ i j ( n ) converges r-completely to a finite number
I i j = X X G i j ( y , x ) f i ( y | x ) d y ϰ i ( d x ) .

5. Conclusion

We show that the strengthened versions of the SLLN, specifically the r-quick and r-complete versions, are useful tools for many statistical problems for general non-i.i.d. stochastic models. In particular, r-quick and r-complete convergences for log-likelihood ratio processes are sufficient for near optimality of sequential hypothesis tests and changepoint detection procedures for models with dependent and non-identically distributed observations. Such non-i.i.d. models are typical for modern large-scale information and physical systems that produce Big Data in numerous practical applications. Readers interested in specific applications may find detailed discussions in [4,5,6,8,19,22,23,34,36,38,54,55,56,57,58,59,54].

Short Biography of the Author

Preprints 73394 i001 Alexander G Tartakovsky received the Ph.D. degree in statistics and information theory and the advanced D.Sc. degree from the Moscow Institute of Physics and Technology (PhysTech), Russia, in 1981 and 1990, respectively.
From 1981 to 1992, he was first a Senior Research Scientist and then the Department Head at the Moscow Institute of Radio Technology and a Professor at PhysTech, where he worked on the application of statistical methods to the optimization and modeling of information systems. From 1993 to 1996, he was a Professor at the University of California, Los Angeles (UCLA), first with the Department of Electrical Engineering and then with the Department of Mathematics. From 1997 to 20013, he was a Professor at the Department of Mathematics and an Associate Director of the Center for Applied Mathematical Sciences, University of Southern California (USC). In the late 1990s, he organized one of America’s first master’s programs in Mathematical Finance (a joint program of the Mathematics and Economics departments at USC). From 2013 to 2015, he was a Professor of statistics with the Department of Statistics at the University of Connecticut, Storrs. From 2016 to 2021, he was the Head of the Space Informatics Laboratory at PhysTech. He is currently the President of AGT StatConsult, Los Angeles, CA, USA. Dr. Tartakovsky also served as visiting faculty at various universities such as Universite de Rouen, France; University of Technology, Sydney, Australia; The Hebrew University of Jerusalem, Israel; University of North Carolina, Chapel Hill; Columbia University; and Stanford University.
Dr. Tartakovsky is an internationally recognized researcher in theoretical and applied statistics, applied probability, sequential analysis, and changepoint detection. He is the author of three books, several book chapters, and over 100 papers across a range of subjects, including theoretical and applied statistics, applied probability, and sequential analysis. His research focuses on a variety of applications including statistical image and signal processing; video surveillance and object detection and tracking; information integration/fusion; intrusion detection and network security; detection and tracking of malicious activity; mathematical/engineering finance applications; pharmacokinetics/ pharmacodynamics; and early detection of epidemics using changepoint methods. Dr. Tartakovsky has provided statistical consulting and developed algorithms and software for many companies and U.S. federal agencies.
Dr. Tartakovsky is a Fellow of the Institute of Mathematical Statistics (IMS) and Senior Member of IEEE. He is an Award-Winning Statistician. He received numerous awards for his work, including the Abraham Wald Prize in Sequential Analysis. He presented several keynote and plenary talks at leading conferences.

References

  1. Hsu, P.L.; Robbins, H. Complete convergence and the law of large numbers. Proceedings of the National Academy of Sciences of the United States of America 1947, 33, 25–31. [CrossRef]
  2. Baum, L.E.; Katz, M. Convergence rates in the law of large numbers. Transactions of the American Mathematical Society 1965, 120, 108–123. [CrossRef]
  3. Strassen, V. Almost sure behavior of sums of independent random variables and martingales. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, June 21–July 18, 1965 and December 27, 1965–January 7, 1966; Le Cam, L.M.; Neyman, J., Eds.; University of California Press: Berkeley, CA, USA, 1967; Vol. 2: Contributions to Probability Theory. Part 1, pp. 315–343.
  4. Tartakovsky, A.G. Asymptotic optimality of certain multihypothesis sequential tests: Non-i.i.d. case. Statistical Inference for Stochastic Processes 1998, 1, 265–295. [CrossRef]
  5. Tartakovsky, A.G. Sequential Change Detection and Hypothesis Testing: General Non-i.i.d. Stochastic Models and Asymptotically Optimal Rules; Monographs on Statistics and Applied Probability 165, Chapman & Hall/CRC Press, Taylor & Francis Group: Boca Raton, London, New York, 2020.
  6. Tartakovsky, A.G.; Nikiforov, I.V.; Basseville, M. Sequential Analysis: Hypothesis Testing and Changepoint Detection; Monographs on Statistics and Applied Probability 136, Chapman & Hall/CRC Press, Taylor & Francis Group: Boca Raton, London, New York, 2015.
  7. Lai, T.L. On r-quick convergence and a conjecture of Strassen. Annals of Probability 1976, 4, 612–627.
  8. Lai, T.L. Asymptotic optimality of invariant sequential probability ratio tests. Annals of Statistics 1981, 9, 318–333. [CrossRef]
  9. Chow, Y.S.; Lai, T.L. Some one-sided theorems on the tail distribution of sample sums with applications to the last time and largest excess of boundary crossings. Transactions of the American Mathematical Society 1975, 208, 51–72. [CrossRef]
  10. Fuh, C.D.; Zhang, C.H. Poisson equation, moment inequalities and quick convergence for Markov random walks. Stochastic Processes and Their Applications 2000, 87, 53–67. [CrossRef]
  11. Wald, A. Sequential tests of statistical hypotheses. Annals of Mathematical Statistics 1945, 16, 117–186.
  12. Wald, A. Sequential Analysis; John Wiley & Sons, Inc: New York, USA, 1947.
  13. Wald, A.; Wolfowitz, J. Optimum character of the sequential probability ratio test. Annals of Mathematical Statistics 1948, 19, 326–339.
  14. Burkholder, D.L.; Wijsman, R.A. Optimum properties and admissibility of sequential tests. Annals of Mathematical Statistics 1963, 34, 1–17.
  15. Matthes, T.K. On the optimality of sequential probability ratio tests. Annals of Mathematical Statistics 1963, 34, 18–21.
  16. Ferguson, T.S. Mathematical Statistics: A Decision Theoretic Approach; Probability and Mathematical Statistics, Academic Press, 1967.
  17. Lehmann, E.L. Testing Statistical Hypotheses; John Wiley & Sons, Inc: New York, USA, 1968.
  18. Shiryaev, A.N. Optimal Stopping Rules; Vol. 8, Series on Stochastic Modelling and Applied Probability, Springer-Verlag: New York, USA, 1978.
  19. Tartakovsky, A.G. Sequential Methods in the Theory of Information Systems; Radio i Svyaz’: Moscow, RU, 1991. In Russian.
  20. Golubev, G.K.; Khas’minskii, R.Z. Sequential testing for several signals in Gaussian white noise. Theory of Probability and its Applications 1984, 28, 573–584. [CrossRef]
  21. Tartakovsky, A.G. Asymptotically optimal sequential tests for nonhomogeneous processes. Sequential Analysis 1998, 17, 33–62. [CrossRef]
  22. Verdenskaya, N.V.; Tartakovskii, A.G. Asymptotically optimal sequential testing of multiple hypotheses for nonhomogeneous Gaussian processes in an asymmetric situation. Theory of Probability and its Applications 1991, 36, 536–547. [CrossRef]
  23. Fellouris, G.; Tartakovsky, A.G. Multichannel sequential detection – Part I: Non-i.i.d. data. IEEE Transactions on Information Theory 2017, 63, 4551–4571. [CrossRef]
  24. Armitage, P. Sequential analysis with more than two alternative hypotheses, and its relation to discriminant function analysis. Journal of the Royal Statistical Society - Series B Methodology 1950, 12, 137–144.
  25. Chernoff, H. Sequential design of experiments. Annals of Mathematical Statistics 1959, 30, 755–770.
  26. Kiefer, J.; Sacks, J. Asymptotically optimal sequential inference and design. Annals of Mathematical Statistics 1963, 34, 705–750.
  27. Lorden, G. Integrated risk of asymptotically Bayes sequential tests. Annals of Mathematical Statistics 1967, 38, 1399–1422.
  28. Lorden, G. Nearly-optimal sequential tests for finitely many parameter values. Annals of Statistics 1977, 5, 1–21. [CrossRef]
  29. Pavlov, I.V. Sequential procedure of testing composite hypotheses with applications to the Kiefer-Weiss problem. Theory of Probability and its Applications 1990, 35, 280–292. [CrossRef]
  30. Baron, M.; Tartakovsky, A.G. Asymptotic optimality of change-point detection schemes in general continuous-time models. Sequential Analysis 2006, 25, 257–296. Invited Paper in Memory of Milton Sobel. [CrossRef]
  31. Mosteller, F. A k-sample slippage test for an extreme population. Annals of Mathematical Statistics 1948, 19, 58–65.
  32. Bakut, P.A.; Bolshakov, I.A.; Gerasimov, B.M.; Kuriksha, A.A.; Repin, V.G.; Tartakovsky, G.P.; Shirokov, V.V. Statistical Radar Theory; Vol. 1 (G. P. Tartakovsky, Editor), Sovetskoe Radio: Moscow, USSR, 1963. In Russian.
  33. Basseville, M.; Nikiforov, I.V. Detection of Abrupt Changes – Theory and Application; Information and System Sciences Series, Prentice-Hall, Inc: Englewood Cliffs, NJ, USA, 1993. Online.
  34. Jeske, D.R.; Steven, N.T.; Tartakovsky, A.G.; Wilson, J.D. Statistical methods for network surveillance. Applied Stochastic Models in Business and Industry 2018, 34, 425–445. Discussion Paper. [CrossRef]
  35. Jeske, D.R.; Steven, N.T.; Wilson, J.D.; Tartakovsky, A.G. Statistical network surveillance. Wiley StatsRef: Statistics Reference Online 2018, pp. 1–12. [CrossRef]
  36. Tartakovsky, A.G.; Brown, J. Adaptive spatial-temporal filtering methods for clutter removal and target tracking. IEEE Transactions on Aerospace and Electronic Systems 2008, 44, 1522–1537. [CrossRef]
  37. Szor, P. The Art of Computer Virus Research and Defense; Addison-Wesley Professional: Upper Saddle River, NJ, USA, 2005.
  38. Tartakovsky, A.G. Rapid detection of attacks in computer networks by quickest changepoint detection methods. In Data Analysis for Network Cyber-Security; Adams, N.; Heard, N., Eds.; Imperial College Press: London, UK, 2014; pp. 33–70.
  39. Tartakovsky, A.G.; Rozovskii, B.L.; Blaźek, R.B.; Kim, H. Detection of intrusions in information systems by sequential change-point methods. Statistical Methodology 2006, 3, 252–293.
  40. Tartakovsky, A.G.; Rozovskii, B.L.; Blaźek, R.B.; Kim, H. A novel approach to detection of intrusions in computer networks via adaptive sequential and batch-sequential change-point detection methods. IEEE Transactions on Signal Processing 2006, 54, 3372–3382. [CrossRef]
  41. Siegmund, D. Change-points: from sequential detection to biology and back. Sequential Analysis 2013, 32, 2–14. [CrossRef]
  42. Moustakides, G.V. Sequential change detection revisited. Annals of Statistics 2008, 36, 787–807. [CrossRef]
  43. Page, E.S. Continuous inspection schemes. Biometrika 1954, 41, 100–114.
  44. Shiryaev, A.N. On optimum methods in quickest detection problems. Theory of Probability and its Applications 1963, 8, 22–46. [CrossRef]
  45. Moustakides, G.V.; Polunchenko, A.S.; Tartakovsky, A.G. A numerical approach to performance analysis of quickest change-point detection procedures. Statistica Sinica 2011, 21, 571–596.
  46. Moustakides, G.V.; Polunchenko, A.S.; Tartakovsky, A.G. Numerical comparison of CUSUM and Shiryaev–Roberts procedures for detecting changes in distributions. Communications in Statistics – Theory and Methods 2009, 38, 3225–3239. [CrossRef]
  47. Lorden, G. Procedures for reacting to a change in distribution. Annals of Mathematical Statistics 1971, 42, 1897–1908.
  48. Moustakides, G.V. Optimal stopping times for detecting changes in distributions. Annals of Statistics 1986, 14, 1379–1387. [CrossRef]
  49. Pollak, M. Optimal detection of a change in distribution. Annals of Statistics 1985, 13, 206–227. [CrossRef]
  50. Tartakovsky, A.G.; Pollak, M.; Polunchenko, A.S. Third-order asymptotic optimality of the generalized Shiryaev–Roberts changepoint detection procedures. Theory of Probability and its Applications 2012, 56, 457–484. [CrossRef]
  51. Polunchenko, A.S.; Tartakovsky, A.G. On optimality of the Shiryaev–Roberts procedure for detecting a change in distribution. Annals of Statistics 2010, 38, 3445–3457.
  52. Shiryaev, A.N. The problem of the most rapid detection of a disturbance in a stationary process. Soviet Mathematics – Doklady 1961, 2, 795–799. Translation from Doklady Akademii Nauk SSSR, 138:1039–1042, 1961.
  53. Tartakovsky, A.G. Discussion on “Is Average Run Length to False Alarm Always an Informative Criterion?” by Yajun Mei. Sequential Analysis 2008, 27, 396–405. [CrossRef]
  54. Liang, Y.; Tartakovsky, A.G.; Veeravalli, V.V. Quickest change detection with non-stationary post-change observations. IEEE Transactions on Information Theory 2023, 69, 3400–3414. [CrossRef]
  55. Pergamenchtchikov, S.; Tartakovsky, A.G. Asymptotically optimal pointwise and minimax quickest change-point detection for dependent data. Statistical Inference for Stochastic Processes 2018, 21, 217–259. [CrossRef]
  56. Fuh, C.D.; Tartakovsky, A.G. Asymptotic Bayesian theory of quickest change detection for hidden Markov models. IEEE Transactions on Information Theory 2019, 65, 511–529. [CrossRef]
  57. Kolessa, A.; Tartakovsky, A.; Ivanov, A.; Radchenko, V. Nonlinear estimation and decision-making methods in short track identification and orbit determination problem. IEEE Transactions on Aerospace and Electronic Systems 2020, 56, 301–312. [CrossRef]
  58. Tartakovsky, A.; Berenkov, N.; Kolessa, A.; Nikiforov, I. Optimal sequential detection of signals with unknown appearance and disappearance points in time. IEEE Transactions on Signal Processing 2021, 69, 2653–2662. [CrossRef]
  59. Pergamenchtchikov, S.M.; Tartakovsky, A.G.; Spivak, V.S. Minimax and pointwise sequential changepoint detection and identification for general stochastic models. Journal of Multivariate Analysis 2022, 190, 1–22. [CrossRef]
1
In many practical problems, K is substantially smaller than the total number of streams N, which can be very large.
Figure 1. Illustration of single-run sequential changepoint detection. Two possibilities in the detection process: false alarm (left) and correct detection (right).
Figure 1. Illustration of single-run sequential changepoint detection. Two possibilities in the detection process: false alarm (left) and correct detection (right).
Preprints 73394 g001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2024 MDPI (Basel, Switzerland) unless otherwise stated