Conditional mixture model and its application for regression model

Expectation maximization (EM) algorithm is a powerful mathematical tool for estimating statistical parameter when data sample contains hidden part and observed part. EM is applied to learn finite mixture model in which the whole distribution of observed variable is average sum of partial distributions. Coverage ratio of every partial distribution is specified by the probability of hidden variable. An application of mixture model is soft clustering in which cluster is modeled by hidden variable whereas each data point can be assigned to more than one cluster and degree of such assignment is represented by the probability of hidden variable. However, such probability in traditional mixture model is simplified as a parameter, which can cause loss of valuable information. Therefore, in this research I propose a so-called conditional mixture model (CMM) in which the probability of hidden variable is modeled as a full probabilistic density function (PDF) that owns individual parameter. CMM aims to extend mixture model. I also propose an application of CMM which is called adaptive regressive model (ARM). Traditional regression model is effective when data sample is scattered equally. If data points are grouped into clusters, regression model tries to learn a unified regression function which goes through all data points. Obviously, such unified function is not effective to evaluate response variable based on grouped data points. The concept “adaptive” of ARM means that ARM solves the ineffectiveness problem by selecting the best cluster of data points firstly and then evaluating response variable within such best cluster. In order words, ARM reduces estimation space of regression model so as to gain high accuracy in calculation.


Introduction
Suppose data has two parts such as hidden part X and observed part Y and we only know Y. A relationship between random variable X and random variable Y is specified by the joint probabilistic density function (PDF) denoted f(X, Y | Θ) where Θ is parameter. Given sample = {Y1, Y2,…, YN} whose all Yi (s) are mutually independent and identically distributed (iid), it is required to estimate Θ based on whereas X is unknown. Expectation maximization (EM) algorithm is applied to solve this problem when only is observed. EM has many iterations and each iteration has two steps such as expectation step (E-step) and maximization step (Mstep). At some t th iteration, given current parameter Θ (t) , the two steps are described as follows: E-step: The expectation Q(Θ | Θ (t) ) is determined based on current parameter Θ (t) , according to equation 1.1 (Nguyen, Tutorial on EM tutorial, 2020, p. 50).
EM algorithm will converge after some iterations, at that time we have the estimate Θ (t) = Θ (t+1) = Θ * . Note, the estimate Θ * is result of EM. Especially, the random variable X represents latent class or latent component of random variable Y. Suppose X is discrete and ranges in {1, 2,…, K}. As a convention, let k=X. Note, because all Yi (s) are iid, let random variable Y represent every Yi. The so-called probabilistic finite mixture model is represented by the PDF of Y, as follows: Where, Θ = ( 1 , 2 , … , , 1 , 2 , … , ) Note, the superscript "T" denotes transpose operator for vector and matrix. The Q(Θ | Θ (t) ) is re-defined for finite mixture model as follows (Nguyen, Tutorial on EM tutorial, 2020, p. 79): If every fk(Y|θk) distributes normally with mean μk and covariance matrix Σk such that θk = (μk, Σk) T , the next parameter Θ (t+1) is calculated at M-step of such t th iteration given current parameter Θ (t) as follows (Nguyen, Tutorial on EM tutorial, 2020, p. 85): (1.5) Note, the conditional probability P(k | Yi, Θ (t) ) is calculated at E-step.
In the traditional finite mixture model, the parameter αk is essentially the parameter of hidden random variable X when X is discrete, αk = P(X=k). In other words, P(X) is "reduced" at most. There is a problem of how to define and learn finite mixture model when P(X) is still a full PDF which owns individual parameter. Such problem is solved by the definition of conditional mixture model (CMM) in the next section.

Conditional mixture model
Now let W and Y be two random variables and both of them are observed. I define the conditional PDF of Y given W as follows: Where gk(W|φk) is the k th PDF of W which can be considered PDF of X for the k th component. Equation 2.1 specifies the so-call conditional mixture model (CMM) when random variable Y is dependent on another random variable W. It is possible to consider that the parameter αk in the traditional mixture model specified by equation 1.2 is: It is deduced that hidden variable X=k in CMM is represented by gk(W|φk) with a full of necessary parameters φk. When the sum ∑ ( | ) =1 is considered as constant, we have: Where the sign "∝" indicates proportion. The quasi-conditional PDF of Y given W is defined to be proportional to the conditional PDF of Y given W as follows: Where the parameter of CMM is Θ = (φ1, φ2,…, φK, θ1, θ2,…, θK) T . Of course, we have: Xi (s) are iid and all yi (s) are iid, we need to learn CMM. Let W and Y represent every Wi and every Yi, respectively. When applying EM along with the quasi-conditional PDF ̃( | , Θ) to estimate Θ, the Q(Θ | Θ (t) ) is re-defined as follows:

=1
(2.4) We need to maximize Q(Θ | Θ (t) ) at M-step of some t th iteration given current parameter Θ (t) . Expectedly, the next parameter Θ (t+1) is solution of the equation created by setting the firstorder derivative of Q(Θ | Θ (t) ) with regard to Θ to be zero. The first-order partial derivatives of Q(Θ | Θ (t) ) with regard to φk and θk are:

Adaptive regression model
Traditional regression model is effective when data sample is scattered equally. If data points are grouped into clusters with their nature, regression model tries to learn a unified regression function which goes through all data points. Obviously, such unified function is not effective to evaluate response variable based on grouped data points. Alternately, if it is possible to select a right cluster for evaluating response variable, the value of response variable will be more precise. Therefore, selective evaluation is the main idea of adaptive regression model (ARM).
The main ideology of ARM to group sample into clusters and build respective regression functions for clusters in parallel. CMM is applied to solve this problem, in other words, ARM is an application of CMM. There may be other applications of CMM but here I focus on ARM. Given a n-dimension random variable W = (w1, w2,…, wn) T which is called regressors, a linear regressive function is defined as Where y is the random variable called response variable and each βj is called regressive coefficient. According to linear regression model, y conforms multinormal distribution, as follows: Where β = (β0, β1,…, βn) T is called regressive parameter of f(y | W, β, σ 2 ). Therefore, mean and variance of f(y | W, β, σ 2 ) are β T X and σ 2 , respectively. Note, f(y | W, β, σ 2 ) is called regressive PDF of y. As a convention, we denote: When applying EM to estimate Θ, by following equation 2.3, the Q(Θ | Θ (t) ) for ARM is redefined as follows:
The function fk(y | W, θk) is the k th regressive PDF of y.