Tutorial on Deep Generative Model

Loc Nguyen

doi:10.20944/preprints202405.1348.v1

Submitted:

20 May 2024

Posted:

21 May 2024

You are already at the latest version

Abstract

Artificial intelligence (AI) is a current trend in computer science, which extends itself its amazing capacities to other technologies such as mechatronics and robotics. Going beyond technological applications, the philosophy behind AI is that there is a vague and potential convergence of artificial manufacture and natural world although the limiting approach may be still very far away, but why? The implicit problem is that Darwin theory of evolution focuses on natural world where breeding conservation is the cornerstone of the existence of creature world but there is no similar concept of breeding conservation in artificial world whose things are created by human. However, after developing for a long time until now, AI issues an interesting concept of generation in which artifacts created by computer science can derive their new generations inheriting their aspects / characteristics. Such generated artifacts make us look back on offsprings by the process of breeding conservation in natural world. Therefore, it is possible to think that AI generation, which is a recent subject of AI, is a significant development in computer science as well as high-tech domain. AI generation does not help us to reach near biological evolution even in the case that AI can combine with biological technology but, AI generation can help us to extend our viewpoint about Darwin theory of evolution as well as there may exist some uncertain relationship between man-made world and natural world. Anyhow AI generation is a current important subject in AI and there are two main generative models in computer science: 1) generative model that applies large language model into generating natural language texts understandable by human and 2) generative model that applies deep neural network into generating digital content such as sound, image, and video. This technical report focuses on deep generative model (DGM) for digital content generation, which is a short summary of approaches to implement DGMs. Researchers can read this work as an introduction to DGM with easily understandable explanations.

Keywords:

generative artificial intelligence

;

deep neural network

;

deep generative model

;

data generation

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction to Deep Generative Model (DGM)

By informal statement, generative artificial intelligence (GenAI) applications aim to reproduce original artifacts such as images, sounds, music, texts, and speeches into a new artifact with some changes. The problem is that reproduction or generation, which is not duplication, indeed derives a new piece of content which is large or small from whole content of the original artifacts. For example, given a smiling face of a specific person, GenAI application will generate a crying face of the same person. As a subdomain of GenAI, deep generative model (DGM) applies deep neural network (DNN) into generating artifacts but many deep generative models (DGMs) are also relevant to applied statistics. Note, DNN is an artificial neural network having many hidden layers, besides input layer and output layer. Training deep neural network or learning deep neural network is known as deep learning. Given random variable vector x = (x₁, x₂,…, x_m)^T presenting any digital artifact or any digital data such as image and sound, let P(x) be probability density function (PDF) of x but it is difficult to estimate such probabilistic distribution P(x) because data x is complicated with suppose that x belongs to the real field R^m where m is high dimension and so, P(x) is called intractable PDF. Suppose there is another random variable vector z = (z₁, z₂,…, z_n)^T belonging to the real field Rⁿ where n is low dimension (n < m) so that PDF of z denoted P(z) is tractable and it is possible to understand P(z). Moreover, it is most important that suppose there is a function g(z | Φ) = x that maps tractable data z to intractable data x where Φ is parameter of such mapping function. For some illustrations or examples in this report, random variable vector x is flattened from two-dimension image. As a convention, the function g(z | Φ) = x is called generator of x and its parameter Φ is called generator parameter (Ruthotto & Haber, 2021, p. 2).

g : R^{n} \to R^{m} s u c h t h a t x = g (z| Φ) w h e r e z \in Z \subseteq R^{n}, x \in X \subseteq R^{m}

Where Z and X are domains of tractable data x and intractable data z with note that Z is called latent space and X is called sample space by convention. When Z is called latent space, tractable PDF P(z) is called latent distribution. Because g(z | Φ) is essentially vector-by-vector function whose input and output are vectors, it should have denoted as g(z | Φ). However, it is still denoted g(z | Φ) in context DNN because there are two reasons: 1) g(z | Φ) is not bijection and 2) the output x of g(z | Φ) can be considered as scalar variable x corresponding to an output neuron of output layer in neural network. Therefore, g(z | Φ) also implies a vector-by-scalar function whose first-order derivative can be considered as gradient vector although the first-order derivative of vector-by-vector function g(z | Φ) is Jacobian matrix. Note, in mathematical, the first-order derivative of scalar-by-vector function is called gradient vector and the first-order derivative of vector-by-vector function is called Jacobian matrix.

The ultimate purpose of any DGM is to determine parameter Φ because generator g(z | Φ) is defined based on Φ. In DGM, generator g(z | Φ) is constructed by a deep neural network (DNN) and its parameter Φ is essentially weights of such DNN. When g(z | Φ) is constructed by DNN, g(z | Φ) is not totally equal to x as g(z | Φ) = x but it is expected that g(z | Φ) is approximated to x in practice:

g (z| Φ) = x^{'} ≅ x

Note that there are many DGMs and some of them do not require explicit definition of the PDF P(z) of tractable data z but how to estimate generator parameter Φ for determining generator g(z | Φ) = x is always concerned. When g(z | Φ) was determined, we can easily randomize some random tractable data z’ according to the known tractable PDF P(z) and then it is totally possible to generate new artifact x’ by x’ = g(z’ | Φ) so that x’ is generated data / derived data of original intractable data x with expectation that probability distribution of x’ is approximate to the true distribution P(x) of x. The process to randomize z’ is called sampling tractable data (z). When g(z | Φ) is modeled by a DNN, how to estimate parameter Φ is essentially to train such DNN and hence, the DNN is denoted as generator function g(z | Φ) for a convention, which is called generator DNN g(z | Φ). Here we identify generator function with DNN.

Intractable PDF P(x) of x is specified (Ruthotto & Haber, 2021, p. 3) based on law of total probability as follows:

P (x) = \int_{z} P_{g} (x| z) P (z) d z

Where P_g(x | z) is conditional PDF of x given z, which implies that P_g(x | z) depends on generator g(z | Φ) too because random variable x inside condition PDF P_g(x | z) is generated from z. Note, notation P(.) denotes probability distribution or probability density function (PDF) in this research. Therefore, it is possible to denote such conditional PDF as P(x | Φ, z).

P_{g} (x| z) = P (x| Φ, z)

Such that:

P (x) = \int_{z} P (x| Φ, z) P (z) d z

(1.1)

This implies intractable PDF P(x) can be known via tractable PDF P(z) and conditional PDF P(x | Φ, z); however it is really difficult to compute P(x) due to complication of the integral but this difficulty is unimportant because the purpose of DGM is to estimate generator g(z | Φ). As a convention, the conditional PDF P(x | Φ, z) is called likelihood P(x | Φ, z). Indeed, P(x | Φ, z) is likelihood function of intractable data x given tractable data z, which indicates how close generated data x’ = g(z | Φ) to x. In practice tractable PDF P(z) is predefined and likelihood P(x | Φ, z) is determined based on generator DNN g(z | Φ). For instance, P(x | Φ, z) is assumed to be normal distribution (Gaussian distribution) with mean μ and variance σ² in popular as follows (Ruthotto & Haber, 2021, p. 3):

P (x| Φ, z) = {(2 π σ^{2})}^{- \frac{m}{2}} e x p (- \frac{{‖g (z| Φ) - x - μ‖}^{2}}{2 σ^{2}})

Let μ = 0 and σ²=1 for optimization, we have:

P (x| Φ, z) = {(2 π)}^{- \frac{m}{2}} e x p (- \frac{{‖g (z| Φ) - x - μ‖}^{2}}{2})

Where notation ||.|| denotes norm of vector. For instance, Euclidean norm of intractable data x is:

‖x‖ = \sqrt{x_{1}^{2} + x_{2}^{2} + \dots + x_{m}^{2}}

That tractable PDF P(z) is predefined (constant with regard to Φ and x) and likelihood P(x | Φ, z) is assumed to distribute normally indicates that intractable PDF P(x) is implied by the simpler conditional PDF P(x | Φ, z) with support of generator DNN g(z | Φ); in other words, P(x | Φ, z) is probability distribution of x from viewpoint of DNN g(z | Φ) indeed where z is totally determined in latent space Z and P(x | Φ, z) is really simpler with support of DNN g(z | Φ). Because how to determine generator g(z | Φ) is to estimate parameter Φ, it is easy to calculate Φ as maximizer of likelihood P(x | Φ, z), which is the optimization problem as follows:

Φ^{*} = \underset{Φ}{armax} P (x| Φ, z)

(1.2)

Taking natural logarithm of likelihood P(x | Φ, z) aims to easily determine Φ by maximizing the log-likelihood function log(P(x | Φ, z)) as follows:

Φ^{*} = \underset{Φ}{armax} P (x| Φ, z) = \underset{Φ}{armax} (\log P (x| Φ, z)) = \underset{Φ}{armax} (- \frac{1}{2} {‖g (z| Φ) - x‖}^{2})

Note,

\log P (x| Φ, z) = - \frac{m}{2} \log (2 π σ^{2}) - \frac{{‖g (z| Φ) - x - μ‖}^{2}}{2 σ^{2}}

Let μ = 0 and σ²=1 for optimization, we have:

\log P (x| Φ, z) = - \frac{m}{2} \log (2 π) - \frac{{‖g (z| Φ) - x‖}^{2}}{2}

This implies the minimization problem as follows:

Φ^{*} = \underset{Φ}{armin} \frac{1}{2} {‖g (z| Φ) - x‖}^{2}

As a result, the estimation of generator parameter Φ based on maximum likelihood estimation (MLE) with assumption of normal distribution of generator g(z | Φ) turns back minimization of error function ½||g(z | Φ) – x||² which is popular technique in learning DNN by backpropagation algorithm because ½||g(z | Φ) – x||² is, indeed, quadratic error function in neural network where g(z | Φ) and x are output and real output of a neuron, respectively with note that the output g(z | Φ) is calculated by propagation rule and the real output x is from training data. In other words, MLE is entry point to estimate generator parameter Φ which is weights of DNN g(z | Φ) that is learned fully by backpropagation algorithm (Nguyen, 2023, pp. 8-20). Therefore, please pay attention to the association of MLE and backpropagation algorithm for determining totally generator g(z | Φ), in which g(z | Φ) and x are output at real output of neurons at the output layer of DNN so that backpropagation algorithm can be applied successively. Note, error function is also called loss function. Backpropagation algorithm is often associated with stochastic gradient descent (SGD) method to optimize loss function.

Let ∇P(x | Φ, z) be gradient of likelihood P(x | Φ, z) which is first-order derivative of P(x | Φ, z) with regard to parameter Φ where x and z are samples as follows:

\nabla P (x| Φ, z) = \frac{d P (x| Φ, z)}{d Φ}

Stochastic gradient descent (SGD) method (Nguyen, 2023, pp. 22-27) estimates Φ by iterative process to update successively Φ at every iteration as follows:

Φ = Φ + γ \nabla P (x| Φ, z)

Where 0 < γ ≤ 1 is learning rate. SGD, which is an iterative process, pushes candidate solution at each iteration along the direction which is opposite to gradient of target function for minimization or has the same direction to gradient of target function for maximization with note that the step length is represented by learning rate. In practice, likelihood P(x | Φ, z) is replaced by its natural logarithm as follows:

Φ = Φ + γ \nabla \log P (x| Φ, z)

(1.3)

Where ∇P(x | Φ, z) is gradient of the log-likelihood function logP(x | Φ, z) with regard to Φ. Note the estimation equation above mentions maximization problem according to MLE method and hence, if error function denoted ε(x | Φ, z) which is the function related to g(z | Φ) and x like ε(x | Φ, z) = ½||g(z | Φ) – x||² aforementioned, then SGD modifies a little bit the estimation equation as follows:

Φ = Φ - γ \nabla ε (x| Φ, z)

(1.4)

Where ∇ε(x | Φ, z) is gradient of error ε(x | Φ, z).

\nabla ε (x| Φ, z) = \frac{d ε (x| Φ, z)}{d Φ}

The main difference between maximizing likelihood P(x | Φ, z) and minimizing error ε(x | Φ, z) is the changing from the sign “+” regarding maximization problem to the sign “–” regarding minimization problem, which is the essence of gradient descent method. In the example of assuming normal distribution, likelihood maximization is the same to error minimization but likelihood maximization gives broader applications to estimate generator parameter Φ within context of DNN along with backpropagation algorithm to train DNN. Besides, it is possible to consider error function is the minus opposite of likelihood function:

ε (x| Φ, z) = - P (x| Φ, z)

It is better that error is the minus opposite of log-likelihood function:

ε (x| Φ, z) = - \log P (x| Φ, z)

Moreover, there many ways to define likelihood and error and so, the way to define them will contribute to form a concrete DGM, besides how to specify and design generator DNN g(z | Φ). When Φ is weight vector consisting of many weights of entire DNN, only elemental sub-weights at the output layer are estimated by SGD which maximizes likelihood or minimizes error:

Φ_{o u t p u t} = Φ_{o u t p u t} + γ \nabla P (x| Φ, z)

Or

Φ_{o u t p u t} = Φ_{o u t p u t} - γ \nabla ε (x| Φ, z)

Then backpropagation algorithm continues to update remaining sub-weights at hidden layers based on such determined sub-weights at the output layer. Therefore, for convenience, we only focus on likelihood maximization or error minimization and parameter Φ represents entire weights of DNN with assertion that backpropagation algorithm is always feasible. It means that there are two important equivalent estimation equations as follows:

Φ = Φ + γ \nabla P (x| Φ, z) Φ = Φ - γ \nabla ε (x| Φ, z)

In similar to:

Φ = Φ + γ \nabla \log P (x| Φ, z)

When DGM is trained with big data, training data is fed to DGM at very time point i as a pair d⁽ⁱ⁾ = (x⁽ⁱ⁾, z⁽ⁱ⁾) and therefore, a set of pairs over N time points is called epoch. As a convention, epoch of size N is denoted as D = (d⁽¹⁾ = (x⁽¹⁾, z⁽¹⁾), d⁽²⁾ = (x⁽²⁾, z⁽²⁾),…, d^(N) = (x^(N), z^(N))). An interesting result from SGD is that DGM can be learned with epoch D without significant change as follows:

Φ = Φ + γ \nabla \frac{1}{N} \sum_{i = 1}^{N} P (x^{(i)}| Φ, z^{(i)}) = Φ + γ \frac{1}{N} \sum_{i = 1}^{N} \nabla P (x^{(i)}| Φ, z^{(i)})

Therefore, training data is counted according to every epoch D instead of every pair (x, z) so that D is fed to SGD at every time point k. Moreover, it is essential that SGD aims to update current parameter at current iteration based on previous parameter at previous iteration. Exactly, let Φ^(k+1) be generator parameter at the (k+1)^th iteration, then Φ^(k+1) is calculated based on previous generator parameter Φ^(k) at the k^th iteration as follows:

Φ^{(k + 1)} = Φ^{(k)} + γ \nabla \frac{1}{N} \sum_{i = 1}^{N} P (x^{(i)}| Φ^{(k)}, z^{(i)})

The equation above is the most precise equation for parameter estimation with SGD, which is called epoch estimation with note that SGD is an iterative process. It can also be replaced by following equations:

Φ^{(k + 1)} = Φ^{(k)} + γ \nabla \frac{1}{N} \sum_{i = 1}^{N} \log P (x^{(i)}| Φ^{(k)}, z^{(i)}) Φ^{(k + 1)} = Φ^{(k)} - γ \nabla \frac{1}{N} \sum_{i = 1}^{N} ε (x^{(i)}| Φ^{(k)}, z^{(i)})

The first equation

Φ^{(k + 1)} = Φ^{(k)} + γ \nabla \frac{1}{N} \sum_{i = 1}^{N} \log P (x^{(i)}| Φ^{(k)}, z^{(i)})

(1.5)

Which is most popular. Moreover, the index k indicates time point as well as iteration of SGD. If learning rate γ is varied at every iteration as γ^(k), we have:

Φ^{(k + 1)} = Φ^{(k)} + γ^{(k)} \nabla \frac{1}{N} \sum_{i = 1}^{N} \log P (x^{(i)}| Φ^{(k)}, z^{(i)})

There are two problems related to construct a DGM: 1) how to define likelihood or error to train generator DNN g(z | Φ) and 2) how to define tractable PDF P(z) which implies the way to randomize z. The second problem relates to assert qualification of random data z’ and hence, the second problem is stated as qualification problem of how to qualify random data. Therefore, the two problems of constructing DGM are 1) how to train generator DNN g(z | Φ) and 2) how to qualify such training task which often relates to another optimization task or another training task. Some basic principles related to DGM are introduced in this section but the two problems cannot be mentioned because there are many specific DGMs which have own specifications. Anyhow generator likelihood P(x | Φ, z) based on definition of generator g(z | Φ) is always important regardless that if it is specified explicitly and thus, suppose it was defined, then SGD is favorite method to optimize it. As an example aforementioned, suppose P(x | Φ, z) distributes normally with mean μ and variance σ² in some DGM as follows (Ruthotto & Haber, 2021, p. 3):

P (x| Φ, z) = {(2 π σ^{2})}^{- \frac{m}{2}} e x p (- \frac{{‖g (z| Φ) - x - μ‖}^{2}}{2 σ^{2}})

Generator log-likelihood is natural logarithm of generator likelihood P(x | Φ, z):

\log P (x| Φ, z) = - \frac{m}{2} \log (2 π σ^{2}) - \frac{{‖g (z| Φ) - x - μ‖}^{2}}{2 σ^{2}}

Gradient of this log-likelihood with regard to Φ is:

\nabla \log P (x| Φ, z) = \frac{d \log P (x| Φ, z)}{d Φ} = - \frac{(g (z| Φ) - x - μ)}{σ^{2}} \frac{d g (z| Φ)}{d Φ}

Where dg(z | Φ) / dΦ is differential of g(z | Φ) with regard to Φ. Let μ = 0 and σ²=1 for optimization, we have:

\nabla \log P (x| Φ, z) = \frac{d \log P (x| Φ, z)}{d Φ} = - (g (z| Φ) - x) \frac{d g (z| Φ)}{d Φ}

As usual, estimation equation resulted from SGD is:

Φ = Φ + γ \nabla \log P (x| Φ, z) = Φ + γ (x - g (z| Φ)) \frac{d g (z| Φ)}{d Φ}

There is a question that how to calculate the differential dg(z | Φ) / dΦ. Indeed, it is not difficult to calculate it in context of neural network associated with backpropagation algorithm so that the last output layer as well as last neuron o of DNN g(z | Φ) is acted by activation function a(.) as follows:

a (o) = a (w^{T} i) o = w^{T} i

Where i is input of the last layer o and weight parameter w is a part of entire parameter Φ and hence, we need to focus on calculating differential da(o) / dw which is equivalent to differential dg(z | Φ) / dΦ so that backpropagation algorithm will solve remaining parts of entire parameter Φ.

\frac{d a (o)}{d w} ≅ \frac{d g (z| Φ)}{d Φ}

Indeed, we have:

\frac{d a (o)}{d w} = \frac{d a (w^{T} i)}{d w} = \frac{d a (o)}{d o} i^{T} = a^{'} (o) i^{T}

Note, the subscript “T” denotes transposition operator of vector and matrix in which row vector becomes column vector and vice versa. It is easy to calculate the derivative a’(o) when activation function was specified, for instance, if a(o) is sigmoid function, we have:

\begin{array}{l} a (o) = y = \frac{1}{1 + e^{- o}} \\ a^{'} (o) = \frac{1}{1 + e^{- o}} (1 - \frac{1}{1 + e^{- o}}) = a (o) (1 - a (o)) = y (1 - y) \end{array}

In practice, y is replaced by a(y) in order to prevent o from being out of space:

a^{'} (o) ≅ a (y) (1 - a (y)) = a (a (o)) (1 - a (a (o)))

As a result, we have:

\frac{d g (z| Φ)}{d Φ} ≅ a (a (o)) (1 - a (a (o))) i^{T}

For fast computation, it is possible to set the derivative a’(o) to be small enough constants like 1 such that dg(z | Φ) / dΦ = i^T.

Suppose some other DGM assumes that x is binary (x = 0 or x = 1) and follows Bernoulli (Ruthotto & Haber, 2021, p. 3) and so, its generator DNN g(z | Φ) derives values in interval [0, 1]. In other words, image of g(z | Φ) is the real number interval [0, 1], which leads to a specification that g(z | Φ) is probability of the event x=1 with note that x is scalar variable (x) for convenience:

g (z| Φ) = P (x = 1)

Because g(z | Φ) becomes a (scalar) random variable whose value is probability, it is possible to identify g(z | Φ) with its parameter Φ as a convention:

Φ = g (z| Φ) = P (x = 1)

Given N trials with binary values of x, let N(x) be the number of event x=1 among N trials, then generator likelihood P(x | Φ, z) is specified according to Bernoulli distribution as follows:

P (x| Φ, z) = Φ^{N (x)} {(1 - Φ)}^{N - N (x)}

The generator log-likelihood is:

\log P (x| Φ, z) = N (x) \log Φ + (N - N (x)) \log (1 - Φ)

Gradient of the generator log-likelihood with regard to Φ is:

\nabla \log P (x| Φ, z) = \frac{d \log P (x| Φ, z)}{d Φ} = \frac{N (x)}{Φ} - \frac{N - N (x)}{1 - Φ} = \frac{N (x) - N Φ}{Φ (1 - Φ)}

As a result, estimation equation resulted from SGD is:

Φ = Φ + γ \nabla \log P (x| Φ, z) = Φ + γ \frac{N (x) - N Φ}{Φ (1 - Φ)}

Although normal distribution and Bernoulli distribution are two popular distributions to specify generator likelihood P(x | Φ, z), there are other specifications which depend on specific DGM.

Given epoch D = (d⁽¹⁾ = (x⁽¹⁾, z⁽¹⁾), d⁽²⁾ = (x⁽²⁾, z⁽²⁾),…, d^(N) = (x^(N), z^(N))) implies that the epoch is created or sent by equilateral distribution 1/N but in general case, D can follow an arbitrary distribution denoted by PDF P(d), which makes the optimization problem and the SGD estimation changed a little bit by theoretical expectation given distribution P(d).

Φ^{*} = \underset{Φ}{armax} E (P (x| Φ, z)| P (d)) Φ = Φ + γ \nabla E (P (x| Φ, z)| P (d))

Where,

E (P (x| Φ, z)| P (d)) = \int_{d} P (x| Φ, z) P (d) d d

However, there is no significant change in aforementioned practical technique to estimate parameters.

Turning back the assumption that generator likelihood P(x | Φ, z) distributes normally with mean μ and variance σ² in some DGM as follows (Ruthotto & Haber, 2021, p. 3):

P (x| Φ, z) = {(2 π σ^{2})}^{- \frac{m}{2}} e x p (- \frac{{‖g (z| Φ) - x - μ‖}^{2}}{2 σ^{2}})

This assumption is not totally exact because the distribution above mentions the error g(z | Φ) – x between generated data g(z | Φ) and real data x. Exactly, generator likelihood P(x | Φ, z) is defined as distribution of the error ε = g(z | Φ) – x and such error distribution is assumed to follow normal distribution with mean μ and variance σ².

P (ε| μ, σ^{2}) = P (x| Φ, z) = {(2 π σ^{2})}^{- \frac{m}{2}} e x p (- \frac{{‖g (z| Φ) - x - μ‖}^{2}}{2 σ^{2}})

Therefore, setting error mean and error variance to be zero and one as μ = 0, σ² = 1 is for best optimization because of smallest error mean 0 but the setting is not totally diverse in data generation.

P (ε| μ, σ^{2}) = P (x| Φ, z) = {(2 π)}^{- \frac{m}{2}} e x p (- \frac{{‖g (z| Φ) - x‖}^{2}}{2})

When learning generator DNN by backpropagation algorithm associated with SGD, it is possible to estimate dynamically μ and σ² by maximum likelihood estimation (MLE) method. Given epoch D = (d⁽¹⁾ = (x⁽¹⁾, z⁽¹⁾), d⁽²⁾ = (x⁽²⁾, z⁽²⁾),…, d^(N) = (x^(N), z^(N))), error mean and error variance are estimated as follows:

\begin{array}{l} \hat{μ} = \frac{1}{N} \sum_{i = 1}^{N} (g (z^{(i)}| Φ) - x^{(i)}) \\ {\hat{σ}}^{2} = \frac{1}{N} \sum_{i = 1}^{N} {‖g (z^{(i)}| Φ) - x^{(i)} - \hat{μ}‖}^{2} \end{array}

(1.6)

When error mean and error variance are dynamically estimated instead of fixing them by zero and unit, generator DNN g(z | Φ) may produce new data in high diversity, which is similar to add noises to generated data. In other words, estimation of error mean and error variance based on epoch makes the data generation more diverse because z may be randomized in interval [0, 1] although DGMs try to diversify z or x like Variational Autoencoders (VAE) and Generative Adversarial Network (GAN). Note, if z is randomized only in interval [0, 1], generated data x’ = g(z | Φ) may not be different much from sample x in epoch in case that error mean μ and error variance σ² are fixed by 0 and 1. However, quality of data generation is the best with zero error mean 0.

Recall that the two problems of constructing DGM are 1) how to train generator DNN g(z | Φ) and 2) how to qualify such training task which often relates to another optimization task or another training task. The first problem relates to how to establish generator likelihood P(x | Φ, z) which is the probability density function (PDF) of intractable x given tractable data z and this establishment is based on generator DNN g(z | Φ). However, there are some DGMs do not specify explicitly the density function P(x | Φ, z), which is cause of the fact that there are two DGM approaches: 1) DGM specifies explicitly generator PDF P(x | Φ, z) and 2) vice versa. In group of explicit PDF approach, there are two built-in approaches: 1) tractable density DGM specifies explicitly well-known distributions for generator likelihood and 2) approximate density DGM tries to estimate approximately generator PDF P(x | Φ, z) or derive other PDF that is similar to P(x | Φ, z). In general, there are three main approaches for constructing DGM such as tractable density DGM, approximate density DGM, and implicit density DGM which are mentioned in next sections. Following figure depicts taxonomy of DGM (Oussidi & Elhassouny, 2018, p. 7) by Goodfellow.

Figure 1.1. Taxonomy of DGM.

Especially, if data is image, there is another categorization way that there are two main approaches: 1) pixel density approach tries to model pixel distribution and 2) block density approach tries to model entire image distribution as any data distribution. In other words, likelihood P(x | Φ, z) is defined based on probabilistic distribution of pixels where x is considered as set of pixels according to pixel density approach. On the other hand, block density approach considers likelihood P(x | Φ, z) is PDF of a block or entire image (unified big block) where x is considered as any arbitrary data. A usual, pixel density approach belongs to tractable density approach of the first categorization.

2. Tractable Density DGM

According to tractable density approach, DGMs specify explicitly generator PDF P(x | Φ, z) with note that PDF is abbreviation of probability density function. Recall that the two problems of constructing DGM are 1) how to train generator DNN g(z | Φ) and 2) how to qualify such training task which often relates to another optimization task or another training task. However, the two problems are merged into the first problem which is to train g(z | Φ) according to normalizing flow technique in which g(z | Φ) is invertible given tractable data z and intractable data x have the same dimension n. Therefore, latent space Z and sample space X are the same with dimension n. As a convention, g^–1(x | Φ) is called inversed generator.

\begin{array}{l} x = g (z| Φ) \\ z = g^{- 1} (x| Φ) \end{array}

(2.1)

Where,

\begin{array}{l} z = {(z_{1}, z_{2}, \dots, z_{n})}^{T}, x = {(x_{1}, x_{2}, \dots, x_{n})}^{T} \\ x \in X \subseteq R^{n} \\ z \in Z \subseteq R^{n} \\ X = Z \end{array}

Note, the subscript “T” denotes transposition operator of vector and matrix in which row vector becomes column vector and vice versa. Because g(z | Φ) is essentially vector-by-vector function whose input and output are vectors, it should have denoted as g(z | Φ), especially, when g(z | Φ) here is bijection. However, it is still denoted g(z | Φ) for convenience. Therefore, the first-order derivative of vector-by-vector function g(z | Φ) here is Jacobian matrix but is stilled called gradient. Note, in mathematical, the first-order derivative of scalar-by-vector function is called gradient vector and the first-order derivative of vector-by-vector function is called Jacobian matrix.

As a result, normalizing flow (NL) technique focuses on maximizing intractable PDF P(x) now called sample PDF or sample likelihood rather than maximizing generator likelihood P(x | Φ, z) because P(x) is now proportional to tractable PDF P(z) and P(x) is stronger than P(x | Φ, z). When P(x) has generator parameter Φ, it is denoted as P(x | Φ). According to applied statistics literature, sample likelihood P(x | Φ) is determined based on tractable PDF P(z) and generator g(z | Φ) as follows (Ruthotto & Haber, 2021, p. 7):

P (x| Φ) = P (z) |\nabla_{x} g^{- 1} (x| Φ)| = P (g^{- 1} (x| Φ)) |\nabla_{x} g^{- 1} (x| Φ)|

(2.2)

\nabla_{x} g^{- 1} (x| Φ) = \frac{d g^{- 1} (x| Φ)}{d x}

The equation of sample likelihood P(x | Φ) is much more definite than the integral formulation of P(x) as aforementioned

P (x) = \int_{z} P (x| Φ, z) P (z) d z

It is explained from the equation of sample likelihood P(x | Φ) that given source and target with a function from source to target, target distribution is calculated by multiplying source distribution with determinant of gradient of inversed function.

P (x| Φ) = P (g^{- 1} (x| Φ)) |\nabla_{x} g^{- 1} (x| Φ)|

For optimization, P(z) is assumed to follow standard normal distribution with mean 0 and variance 1:

P (z) = {(2 π)}^{- \frac{n}{2}} e x p (- \frac{1}{2} {‖z‖}^{2})

Such that:

P (x| Φ) = {(2 π)}^{- \frac{n}{2}} e x p (- \frac{1}{2} {‖g^{- 1} (x| Φ)‖}^{2}) |\nabla_{x} g^{- 1} (x| Φ)|

Where notation ||.|| denotes norm of vector. Exactly, P(z) follows standard normal distribution with mean vector 0 and identity covariance matrix I. Sample log-likelihood is derived by taking natural logarithm of sample likelihood:

\log P (x| Φ) = - \frac{n}{2} \log 2 π - \frac{1}{2} {‖g^{- 1} (x| Φ)‖}^{2} + \log |\nabla_{x} g^{- 1} (x| Φ)|

NL aims to maximize sample log-likelihood so as to estimate generator parameter Φ:

Φ^{*} = \underset{Φ}{armax} \log P (x| Φ)

Stochastic gradient descent (SGD) method is used to estimate Φ by iterative process to update successively Φ at every iteration as follows:

Φ = Φ + γ \nabla \log P (x| Φ)

Where ∇logP(x | Φ) which is called sample log-likelihood gradient is gradient of sample log-likelihood logP(x | Φ) and γ (0 < γ ≤ 1) is learning rate. Note that SGD, which is an iterative process, pushes candidate solution at each iteration along the direction which is opposite to gradient of target function for minimization or has the same direction to gradient of target function for maximization with note that the step length is represented by learning rate. Given epoch of size N is denoted as D = (x⁽¹⁾, x⁽²⁾,…, x^(N)), the estimation equation of Φ is extended exactly as epoch estimation at every iteration of SGD:

Φ^{(k + 1)} = Φ^{(k)} + γ \nabla \frac{1}{N} \sum_{i = 1}^{N} \log P (x^{(i)}| Φ^{(k)})

It is necessary to determine sample log-likelihood gradient ∇logP(x | Φ) with regard to parameter Φ. Due to (Nguyen, Matrix Analysis and Calculus, 2015, pp. 45-46):

\begin{array}{l} \frac{d \log (|\nabla_{x} g^{- 1} (x| Φ)|)}{d Φ} = \frac{d \log (|\nabla_{x} g^{- 1} (x| Φ)|)}{d \nabla_{x} g^{- 1} (x| Φ)} \frac{d \nabla_{x} g^{- 1} (x| Φ)}{d Φ} \\ = \frac{d |\nabla_{x} g^{- 1} (x| Φ)|}{d \nabla_{x} g^{- 1} (x| Φ)} \frac{1}{|\nabla_{x} g^{- 1} (x| Φ)|} \frac{d \nabla_{x} g^{- 1} (x| Φ)}{d Φ} \\ = \frac{|\nabla_{x} g^{- 1} (x| Φ)| {(\nabla_{x} g^{- 1} (x| Φ))}^{- 1}}{|\nabla_{x} g^{- 1} (x| Φ)|} \frac{d \nabla_{x} g^{- 1} (x| Φ)}{d Φ} \\ = {(\nabla_{x} g^{- 1} (x| Φ))}^{- 1} \frac{d \nabla_{x} g^{- 1} (x| Φ)}{d Φ} \end{array}

And

\frac{d \frac{1}{2} {‖g^{- 1} (x| Φ)‖}^{2}}{d Φ} = g^{- 1} (x| Φ) \nabla_{x} g^{- 1} (x| Φ)

We have following equation to calculate log-likelihood gradient ∇logP(x | Φ):

\nabla \log P (x| Φ) = \frac{d \log P (x| Φ)}{d Φ} = {(\nabla_{x} g^{- 1} (x| Φ))}^{- 1} \frac{d \nabla_{x} g^{- 1} (x| Φ)}{d Φ} - g^{- 1} (x| Φ) \nabla_{x} g^{- 1} (x| Φ)

The notation (∇_xg^–1(x | Φ))^–1 denotes inverse of matrix ∇_xg^–1(x | Φ). Because ∇_xg^–1(x | Φ) called Jacobian matrix is a square matrix, the derivative d∇_xg^–1(x | Φ) / dΦ is determined by taking first-order derivative for every element of ∇_xg^–1(x | Φ) with regard to Φ, which produces a tensor. Therefore, d∇_xg^–1(x | Φ) / dΦ is the second-order derivative of inversed generator g^–1(x | Φ) with regard to x and Φ. Let

\nabla_{Φ} \nabla_{x} g^{- 1} (x| Φ) = \frac{d \nabla_{x} g^{- 1} (x| Φ)}{d Φ}

It is possible to calculate this second-order derivative if inversed gradient ∇_xg^–1(x | Φ) is determined. Log-likelihood gradient ∇logP(x | Φ) is rewritten:

\nabla \log P (x| Φ) = {(\nabla_{x} g^{- 1} (x| Φ))}^{- 1} \nabla_{Φ} \nabla_{x} g^{- 1} (x| Φ) - g^{- 1} (x| Φ) \nabla_{x} g^{- 1} (x| Φ)

(2.3)

Where:

\nabla_{x} g^{- 1} (x| Φ) = \frac{d g^{- 1} (x| Φ)}{d x} \nabla_{Φ} \nabla_{x} g^{- 1} (x| Φ) = \frac{d \nabla_{x} g^{- 1} (x| Φ)}{d Φ}

According to traditional neural network, let φ_i be the i^th row vector of matrix Φ, then generator g(x | φ) is linear composition as follows:

g (z| φ) = a (φ_{i}^{T} z + δ_{i}) = x_{i}

Where δ_i is the i^th bias parameter associated with each x_i. Note, x_i is the i^th elemental variable in x whereas activation a(.) is invertible, whose inverse is a^–1(.). In traditional neural network, x_i represents a neuron or unit. Due to:

a^{- 1} (x_{i}) = φ_{i}^{T} z + δ_{i} = \sum_{t = 1}^{n} φ_{i t} z_{t} + δ_{i} = φ_{i j} z_{j} + \sum_{t = 1, t \neq j}^{n} φ_{i t} z_{t} + δ_{i}

When φ_i = (φ_i₁, φ_i₂,…, φ_in)^T and z = (z₁, z₂,…, z_n)^T, without loss of generality, given φ_ij and z_j are the j^th elements of φ_i and z, respectively we have fine-tuned inversed generator g^–1(x_i | φ_ij):

g^{- 1} (x_{i}| φ_{i j}) = z_{j} = \frac{a^{- 1} (x_{i}) - b_{i j} - δ_{i}}{φ_{i j}}

Where,

b_{i j} = \sum_{t = 1, t \neq j}^{n} φ_{i t} z_{t}

It is easy to calculate the inversed gradient:

\nabla_{x_{i}} g^{- 1} (x_{i}| φ_{i j}) = \frac{d (a^{- 1} (x_{i}) - b_{i j} - δ_{i}) / d x_{i}}{φ_{i j}} = \frac{{(a^{- 1} (x_{i}))}^{'}}{φ_{i j}}

Where a^–1(x_i) is the first-order derivative of a^–1(.) at x_i. The second-order derivative

\nabla_{φ_{i j}} \nabla_{x_{i}} g^{- 1} (x_{i}| φ_{i j})

is determined as follows:

\nabla_{φ_{i j}} \nabla_{x_{i}} g^{- 1} (x_{i}| φ_{i j}) = \frac{d \nabla_{x_{i}} g^{- 1} (x_{i}| φ_{i j})}{d φ_{i j}} = - \frac{{(a^{- 1} (x_{i}))}^{'}}{φ_{i j}^{2}}

Log-likelihood gradient with regard to φ_ij is fine-tuned as ∇logP(x_i | φ_ij) is expended again:

\begin{array}{l} \nabla \log P (x_{i}| φ_{i j}) = - \frac{φ_{i j}}{{(a^{- 1} (x_{i}))}^{'}} \frac{{(a^{- 1} (x_{i}))}^{'}}{φ_{i j}^{2}} - \frac{a^{- 1} (x_{i})}{φ_{i j}} \frac{{(a^{- 1} (x_{i}))}^{'}}{φ_{i j}} \\ = - \frac{1}{φ_{i j}} - \frac{a^{- 1} (x_{i}) {(a^{- 1} (x_{i}))}^{'}}{φ_{i j}^{2}} = - \frac{φ_{i j} + a^{- 1} (x_{i}) {(a^{- 1} (x_{i}))}^{'}}{φ_{i j}^{2}} \end{array}

Because δ_i is the i^th bias parameter, the second-order derivative

\nabla_{δ_{i}} \nabla_{x_{i}} g^{- 1} (x_{i}| φ_{i j})

is determined as follows:

\nabla_{δ_{i}} \nabla_{x_{i}} g^{- 1} (x_{i}| δ_{i}) = \frac{d \nabla_{x_{i}} g^{- 1} (x_{i}| δ_{i})}{d δ_{i}} = 0

Where,

\nabla_{x_{i}} g^{- 1} (x_{i}| δ_{i}) = \frac{d (a^{- 1} (x_{i}) - b_{i j} - δ_{i}) / d x_{i}}{φ_{i j}} = \frac{{(a^{- 1} (x_{i}))}^{'}}{φ_{i j}} g^{- 1} (x_{i}| δ_{i}) = z_{j} = \frac{a^{- 1} (x_{i}) - b_{i j} - δ_{i}}{φ_{i j}}

Log-likelihood gradient with regard to δ_i is fine-tuned as ∇logP(x_i | δ_i) is expended again:

\nabla \log P (x_{i}| δ_{i}) = - \frac{φ_{i j}}{{(a^{- 1} (x_{i}))}^{'}} 0 - \frac{a^{- 1} (x_{i})}{φ_{i j}} \frac{{(a^{- 1} (x_{i}))}^{'}}{φ_{i j}} = - \frac{a^{- 1} (x_{i}) {(a^{- 1} (x_{i}))}^{'}}{φ_{i j}^{2}}

In general, log-likelihood gradient ∇logP(x_i | φ_ij, δ_i) is specified as follows:

\begin{array}{l} \nabla \log P (x_{i}| φ_{i j}) = - \frac{1}{φ_{i j}} (1 + \frac{a^{- 1} (x_{i}) {(a^{- 1} (x_{i}))}^{'}}{φ_{i j}}), \forall i, j \\ \nabla \log P (x_{i}| δ_{i}) = - \frac{a^{- 1} (x_{i}) {(a^{- 1} (x_{i}))}^{'}}{φ_{i j}^{2}}, \forall i \end{array}

(2.4)

Where a(.) and a^–1(.) are invertible activation function and its inverse and,

\begin{array}{l} x_{i} \in x = {(x_{1}, x_{2}, \dots, x_{n})}^{T} \\ φ_{i j} \in φ_{i} = {(φ_{i 1}, φ_{i 2}, \dots, φ_{i n})}^{T} \\ φ_{i} \in Φ \\ x_{i} = a (φ_{i}^{T} z + δ_{i}) \end{array}

SGD estimation is fine-tuned as follows:

φ_{i j} = φ_{i j} + γ \nabla \log P (x_{i}| φ_{i j}) = φ_{i j} - γ \frac{1}{φ_{i j}} (1 + \frac{a^{- 1} (x_{i}) {(a^{- 1} (x_{i}))}^{'}}{φ_{i j}}), \forall i, j δ_{i} = δ_{i} + γ \nabla \log P (x_{i}| δ_{i}) = δ_{i} - γ \frac{a^{- 1} (x_{i}) {(a^{- 1} (x_{i}))}^{'}}{φ_{i j}^{2}}, \forall i

Given epoch of size N is denoted as D = (x⁽¹⁾, x⁽²⁾,…, x^(N)), the estimation equation of φ_ij and δ_i is extended exactly as epoch estimation at every u^th iteration of SGD with regard to log-likelihood gradient ∇logP(x_i | φ_ij, δ_i).

\begin{array}{l} φ_{i j}^{(u + 1)} = φ_{i j}^{(u)} + γ \frac{1}{N} \nabla \sum_{v = 1}^{N} \log P (x_{i}^{(v)}| φ_{i j}^{(u)}) \\ = φ_{i j}^{(u)} - γ \frac{1}{φ_{i j}^{(u)}} (1 + \frac{1}{N φ_{i j}^{(u)}} \sum_{v = 1}^{N} a^{- 1} (x_{i}^{(v)}) {(a^{- 1} (x_{i}^{(v)}))}^{'}), \forall i, j \\ δ_{i}^{(u + 1)} = δ_{i}^{(u)} + γ \frac{1}{N} \nabla \sum_{v = 1}^{N} \log P (x_{i}^{(v)}| δ_{i}^{(u)}) \\ = δ_{i}^{(u)} - γ \frac{1}{N {(φ_{i j}^{(u)})}^{2}} \sum_{v = 1}^{N} a^{- 1} (x_{i}^{(v)}) {(a^{- 1} (x_{i}^{(v)}))}^{'}, \forall i \end{array}

Where x_i^(v) is the i^th element of x^(v) in epoch. As a result, NL trained with SGD is specified as follows:

Initialize all φ_ij, δ_i and set u = 0.

Repeat

Sampling epoch X = (x⁽¹⁾, x⁽²⁾,…, x^(N)) or receiving epoch X from big data / data stream.

φ_{i j}^{(u + 1)} = φ_{i j}^{(u)} - γ \frac{1}{φ_{i j}^{(u)}} (1 + \frac{1}{N φ_{i j}^{(u)}} \sum_{v = 1}^{N} a^{- 1} (x_{i}^{(v)}) {(a^{- 1} (x_{i}^{(v)}))}^{'}), \forall i, j δ_{i}^{(u + 1)} = δ_{i}^{(u)} - γ \frac{1}{N {(φ_{i j}^{(u)})}^{2}} \sum_{v = 1}^{N} a^{- 1} (x_{i}^{(v)}) {(a^{- 1} (x_{i}^{(v)}))}^{'}, \forall i

Increase u = u + 1.

Until some terminating conditions are met.

Note, a terminating condition is customized, for example, parameters φ_ij and δ_i are not changed significantly or there is no more coming epoch X. Moreover, the index u indicates time point as well as iteration of SGD. After finite NL is trained, it can generate new data x’ by generator g(z | Φ) = x’ with any z randomized from standard normal distribution with mean 0 and variance 1.

It is interesting that log-likelihood gradient ∇logP(x_i | φ_ij) is determined based on inversed gradient

\nabla_{x_{i}} g^{- 1} (x_{i}| φ_{i j})

. Therefore, how to estimate generator parameter Φ by SGD estimation focuses on calculating inversed gradient

\nabla_{x_{i}} g^{- 1} (x_{i}| φ_{i j})

which is central point of normalizing flow (NL) technique. Moreover, how to calculate

\nabla_{x_{i}} g^{- 1} (x_{i}| φ_{i j})

is based on how to determine inversed generator g^–1(x | Φ). In other words, the main problem of NL is how to determine inversed generator g^–1(x | Φ) because it is easy to calculate gradient of function f(x) = g^–1(x | Φ) with regard to x. Especially, when generator g(x | Φ) is implemented by DNN, NL will have some special techniques so that determining its inverse g^–1(x | Φ) is easier. One of these technique is finite normalizing flow (finite NL) in which generator g(x | Φ) is implemented by a DNN having K layers from layer 1 to layer K where layer 0 is input layer with note that each layer is represented by partial generator function f_k (Ruthotto & Haber, 2021, p. 8):

g (z| Φ) = f_{K} \circ f_{K - 1} \circ \dots \circ f_{2} \circ f_{1} (z)

(2.5)

Note, all layers f_k have the same number of neurons which is the dimension n. Because f_k is essentially vector-by-vector function whose input and output are vectors, it should have denoted as f_k, especially, when f_k here is bijection. However, it is still denoted f_k for convenience. Let z^(k+1) be output of partial generator f_k, we have:

z^{(k + 1)} = f_{k} (z^{(k)}) z = z^{(1)} = z^{(0)} = f_{0} (z^{(0)}) \equiv f_{0} x = z^{(K + 1)} = f_{K} (z^{(K)})

Inversed generator g^–1(x | Φ) representing inversed DNN is determined:

g^{- 1} (x| Φ) = f_{1}^{- 1} \circ f_{2}^{- 1} \circ \dots \circ f_{K - 1}^{- 1} \circ f_{K}^{- 1} (x)

(2.6)

Each f_k^–1 is called inversed partial generator function which is the inverse of partial generator function f_k. Let x^(k–1) be output of partial generator f_k^–1, we have:

x^{(k - 1)} = f_{k}^{- 1} (x^{(k)}) x = x^{(K + 1)} = x^{(K)} = f_{K + 1}^{- 1} (x^{(K + 1)}) \equiv f_{K + 1}^{- 1} z = x^{(0)} = f_{2}^{- 1} (x^{(1)})

“Input layer” of “inversed DNN” is f_K₊₁^–1. The inversed generator DNN may be pseudo in case that only one generator DNN is designed so that inversed generator function f_k^–1 is existent. An interesting result of the design of finite NL is that inversed gradient ∇g^–1(x | Φ) is product of gradients of inversed partial generator f_k^–1.

\nabla_{x} g^{- 1} (x| Φ) = \prod_{k = 1}^{K} \nabla_{x^{(k)}} f_{k}^{- 1} (x^{(k)}| Φ^{(k)}) \log |\nabla_{x} g^{- 1} (x| Φ)| = \sum_{k = 1}^{K} |\nabla_{x^{(k)}} f_{k}^{- 1} (x^{(k)}| Φ^{(k)})|

Where Φ^(k) is parameter of f_k. It is now necessary to determine fine-tuned partial inversed gradient

\nabla_{x_{i}^{(k)}} f_{k}^{- 1} (x_{i}^{(k)}| φ_{i j}^{(k)})

in order to determine fine-tuned partial log-likelihood gradient ∇logP(x_i^(k) | φ_ij^(k)) where x_i^(k) is an elemental variable in x^(k) = (x₁^(k), x₂^(k),…, x_n^(k))^T and φ_ij^(k) is the j^th element in φ_i^(k) = (φ_i₁^(k), φ_i₂^(k),…, φ_in^(k))^T with note that φ_i^(k) is the i^th row vector of matrix Φ^(k). Moreover, let δ_i^(k) be the ith bias parameter associated with each x_i^(k). Without loss of generality, given φ_j^(k), δ_i^(k), and z_j^(k) along with invertible activation a(.), we have fine-tuned inversed generator g^–1(x_i^(k) | φ_ij^(k), δ_i^(k)).

f_{k}^{- 1} (x_{i}^{(k)}| φ_{i j}^{(k)}) = z_{j}^{(k)} = \frac{a^{- 1} (x_{i}^{(k)}) - b_{i j}^{(k)} - δ_{i}^{(k)}}{φ_{i j}^{(k)}}

Where,

b_{i j}^{(k)} = \sum_{t = 1, t \neq j}^{n} φ_{i t}^{(k)} z_{t}^{(k)}

Where z_i^(k) is the i^th elemental variable in z^(k) = (z₁^(k), z₂^(k),…, z_n^(k))^T. By similar way aforementioned, log-likelihood gradient ∇logP(x_i^(k) | φ_ij^(k), δ_i^(k)) is specified as follows:

\nabla \log P (x_{i}^{(k)}| φ_{i j}^{(k)}) = - \frac{1}{φ_{i j}^{(k)}} (1 + \frac{a^{- 1} (x_{i}^{(k)}) {(a^{- 1} (x_{i}^{(k)}))}^{'}}{φ_{i j}^{(k)}}), \forall i, j \nabla \log P (x_{i}^{(k)}| δ_{i}^{(k)}) = - \frac{a^{- 1} (x_{i}^{(k)}) {(a^{- 1} (x_{i}^{(k)}))}^{'}}{{(φ_{i j}^{(k)})}^{2}}, \forall i

(2.7)

Where a(.) and a^–1(.) are invertible activation function and its inverse and,

\begin{array}{l} x_{i}^{(k)} \in x^{(k)} = {(x_{1}^{(k)}, x_{2}^{(k)}, \dots, x_{n}^{(k)})}^{T} \\ φ_{i j}^{(k)} \in φ_{i}^{(k)} = {(φ_{i 1}^{(k)}, φ_{i 2}^{(k)}, \dots, φ_{i n}^{(k)})}^{T} \\ φ_{i}^{(k)} \in Φ^{(k)} \\ x_{i}^{(k)} = a ({(φ_{i}^{(k)})}^{T} z + δ_{i}^{(k)}) \end{array}

SGD estimation is fine-tuned as follows:

φ_{i j}^{(k)} = φ_{i j}^{(k)} + γ \nabla \log P (x_{i}^{(k)}| φ_{i j}^{(k)}) = φ_{i j}^{(k)} - γ \frac{1}{φ_{i j}^{(k)}} (1 + \frac{a^{- 1} (x_{i}^{(k)}) {(a^{- 1} (x_{i}^{(k)}))}^{'}}{φ_{i j}^{(k)}}), \forall i, j δ_{i}^{(k)} = δ_{i}^{(k)} + γ \nabla \log P (x_{i}^{(k)}| δ_{i}^{(k)}) = δ_{i}^{(k)} - γ \frac{a^{- 1} (x_{i}^{(k)}) {(a^{- 1} (x_{i}^{(k)}))}^{'}}{{(φ_{i j}^{(k)})}^{2}}, \forall i

Given epoch of size N is denoted as D = (x⁽¹⁾, x⁽²⁾,…, x^(N)), the estimation equation of φ_ij^(k) and δ_i^(k) is extended exactly as epoch estimation at every u^th iteration of SGD with regard to log-likelihood gradient ∇logP(x_i^(k) | φ_ij^(k), δ_i^(k)):

\begin{array}{l} {(φ_{i j}^{(k)})}^{(u + 1)} = {(φ_{i j}^{(k)})}^{(u)} + γ \frac{1}{N} \nabla \sum_{v = 1}^{N} \log P ({(x_{i}^{(k)})}^{(v)}| {(φ_{i j}^{(k)})}^{(u)}) \\ = {(φ_{i j}^{(k)})}^{(u)} \\ - γ \frac{1}{{(φ_{i j}^{(k)})}^{(u)}} \\ (1 + \frac{1}{N {(φ_{i j}^{(k)})}^{(u)}} \sum_{v = 1}^{N} a^{- 1} ({(x_{i}^{(k)})}^{(v)}) {(a^{- 1} ({(x_{i}^{(k)})}^{(v)}))}^{'}), \forall i, j \\ {(δ_{i}^{(k)})}^{(u + 1)} = {(δ_{i}^{(k)})}^{(u)} + γ \frac{1}{N} \nabla \sum_{v = 1}^{N} \log P ({(x_{i}^{(k)})}^{(v)}| {(δ_{i}^{(k)})}^{(u)}) \\ = {(δ_{i}^{(k)})}^{(u)} - γ \frac{1}{N {({(φ_{i j}^{(k)})}^{(u)})}^{2}} \sum_{v = 1}^{N} a^{- 1} ({(x_{i}^{(k)})}^{(v)}) {(a^{- 1} ({(x_{i}^{(k)})}^{(v)}))}^{'}, \forall i \end{array}

Where (x_i^(k))^(v) is the i^th element of x^(v) in epoch with regard to inversed generator f_k^–1. As a result, finite NL trained with SGD is specified as follows:

Initialize all φ_ij^(k), δ_i^(k) and set u = 0.

Repeat

Sampling epoch X = (x⁽¹⁾, x⁽²⁾,…, x^(N)) or receiving epoch X from big data / data stream.

\begin{array}{l} {(φ_{i j}^{(k)})}^{(u + 1)} = {(φ_{i j}^{(k)})}^{(u)} \\ - γ \frac{1}{{(φ_{i j}^{(k)})}^{(u)}} \\ (1 + \frac{1}{N {(φ_{i j}^{(k)})}^{(u)}} \sum_{v = 1}^{N} a^{- 1} ({(x_{i}^{(k)})}^{(v)}) {(a^{- 1} ({(x_{i}^{(k)})}^{(v)}))}^{'}), \forall i, j \\ {(δ_{i}^{(k)})}^{(u + 1)} = {(δ_{i}^{(k)})}^{(u)} - γ \frac{1}{N {({(φ_{i j}^{(k)})}^{(u)})}^{2}} \sum_{v = 1}^{N} a^{- 1} ({(x_{i}^{(k)})}^{(v)}) {(a^{- 1} ({(x_{i}^{(k)})}^{(v)}))}^{'}, \forall i \end{array}

Increase u = u + 1.

Until some terminating conditions are met.

Note, a terminating condition is customized, for example, parameters φ_ij^(k) and δ_i^(k) are not changed significantly or there is no more coming epoch X. Moreover, the index u indicates time point as well as iteration of SGD. After finite NL is trained, it can generate new data x’ by generator g(z | Φ) = x’ with any z randomized from standard normal distribution with mean 0 and variance 1.

Because it is not easy to calculate inversed gradient ∇_xg^–1(x | Φ) as well as its determinant |∇_xg^–1(x | Φ)| according to finite NL except the decomposition technique above of entire matrix parameter Φ into partial vector parameters φ_i^(k), there is technique called real RVP (Ruthotto & Haber, 2021, p. 9) which defines each layer or partial generator f_k(z^(k)) by special way where z^(k) is split into two parts such as z₁^(k) and z₂^(k) so that:

f_{k} (z^{(k)}) = f_{k} (\begin{matrix} z_{1}^{(k)} \\ z_{2}^{(k)} \end{matrix}) = (\begin{matrix} z_{1}^{(k)} \\ z_{2}^{(k)} ⨂ e x p (s_{k} (z_{1}^{(k)})) + t_{k} (z_{1}^{(k)}) \end{matrix})

(2.8)

Of course, we have:

f_{k} (z^{(k)}) = f_{k} (\begin{matrix} z_{1}^{(k)} \\ z_{2}^{(k)} \end{matrix}) = (\begin{matrix} z_{1}^{(k)} \\ z_{2}^{(k)} ⨂ e x p (s_{k} (z_{1}^{(k)})) + t_{k} (z_{1}^{(k)}) \end{matrix}) = (\begin{matrix} z_{1}^{(k + 1)} \\ z_{2}^{(k + 1)} \end{matrix}) = z^{(k + 1)}

Where s_k and t_k are two neural networks for scaling and translation, whose inputs and outputs have the same dimension. The operator

⨂

denotes component-wise multiplication of two vectors where every pair of two corresponding elements of the two vectors are multiplied together, for instance, given two arbitrary vectors u = (u₁, u₂,…, u_n)^T and v = (v₁, v₂,…, v_n)^T, we have u

⨂

v = (u₁v₁, u₂v₂,…, u_nv_n)^T. Moreover, the exponential function exp(.) above whose input is vector produces a vector by taking exponential function over every element of input vector. Inversed generator f_k^–1 is specified from generator f_k.

f_{k}^{- 1} (x^{(k)}) = f_{k}^{- 1} (\begin{matrix} x_{1}^{(k)} \\ x_{2}^{(k)} \end{matrix}) = (\begin{matrix} x_{1}^{(k)} \\ (x_{2}^{(k)} - t_{k} (x_{1}^{(k)})) ⨂ e x p (- s_{k} (x_{1}^{(k)})) \end{matrix})

(2.9)

Of course, we have:

\begin{array}{l} f_{k}^{- 1} (x^{(k)}) = f_{k}^{- 1} (\begin{matrix} x_{1}^{(k)} \\ x_{2}^{(k)} \end{matrix}) = (\begin{matrix} x_{1}^{(k)} \\ (x_{2}^{(k)} - t_{k} (x_{1}^{(k)})) ⨂ e x p (- s_{k} (x_{1}^{(k)})) \end{matrix}) = (\begin{matrix} x_{1}^{(k - 1)} \\ x_{2}^{(k - 1)} \end{matrix}) \\ = x^{(k - 1)} \end{array}

Inversed gradient

\nabla_{x^{(k)}} f_{k}^{- 1} (x^{(k)})

is the 2x2 Jacobian matrix determined as follows:

\nabla_{x^{(k)}} f_{k}^{- 1} (x^{(k)}) = (\begin{matrix} 1 & 0 \\ \frac{d (x_{2}^{(k)} - t_{k} (x_{1}^{(k)})) ⨂ e x p (- s_{k} (x_{1}^{(k)}))}{d x_{1}^{(k)}} & e x p (- s_{k} (x_{1}^{(k)})) \end{matrix})

It is interesting that taking determinant of inversed gradient ∇_xf_k^–1(x^(k)) becomes simple:

|\nabla_{x^{(k)}} f_{k}^{- 1} (x^{(k)})| = e x p (- s_{k} (x_{1}^{(k)}))

(2.10)

When this determinant is determined, it is possible to maximize log-likelihood logP(x | Φ) to estimate Φ where Φ here are weights of scaling neural network s_k and translation neural network t_k. Log-likelihood logP(x | Φ) is written:

\begin{array}{l} \log P (x| Φ) = - \frac{n}{2} \log 2 π - \frac{1}{2} {‖g^{- 1} (x| Φ)‖}^{2} + \log |\nabla_{x} g^{- 1} (x| Φ)| \\ = \frac{n}{2} \log 2 π - \frac{1}{2} {‖g^{- 1} (x| Φ)‖}^{2} + \sum_{k = 1}^{K} \log |\nabla_{x^{(k)}} f_{k}^{- 1} (x^{(k)})| \\ = \frac{n}{2} \log 2 π - \frac{1}{2} {‖g^{- 1} (x| Φ)‖}^{2} - \sum_{k = 1}^{K} s_{k} (x_{1}^{(k)}) \end{array}

Because parameter Φ is now only weights of scaling neural network s_k and translation neural network t_k, maximizing log-likelihood logP(x | Φ) is now to optimize (train) s_k and t_k by some algorithms like backpropagation algorithm.

{(Φ_{s}^{(k)})}^{*} = \underset{Φ_{s}^{(k)}}{optimize} s_{k} (x_{1}^{(k)}| Φ_{s}^{(k)}) {(Φ_{t}^{(k)})}^{*} = \underset{Φ_{t}^{(k)}}{optimize} t_{k} (x_{1}^{(k)}| Φ_{t}^{(k)})

Beside finite NL there is another NL technique called continuous NL but it is not mentioned here because continuous NL is relevant to hazard problem of differential equation which is not main subject of DNN.

Recall that there are three main approaches for constructing DGM such as tractable density DGM, approximate density DGM, and implicit density DGM. However, if data is image, there is another categorization way that there are two main approaches: 1) pixel density approach tries to model pixel distribution and 2) block density approach tries to model entire image distribution as any data distribution. In other words, likelihood P(x | Φ, z) is defined based on probabilistic distribution of pixels where x is considered as set of pixels according to pixel density approach. On the other hand, block density approach considers likelihood P(x | Φ, z) is PDF of a block or entire image (unified big block) where x is considered as any arbitrary data. For instance, NL belongs to both tractable density DGM and block density approach. It is interesting that pixel density approach also belongs to tractable density approach because its PDF is defined obviously. Moreover, pixel density approach merges the two problems of training generator g(z | Φ) and qualifying such training task into the first problem which is to train g(z | Φ) by learning sample PDF P(x) because P(x) or P(x | Φ) now replaces P(x | Φ, z).

Shortly, pixel density (PD) approach defines P(x) as product of all pixel distribution. Concretely, let x = (x₁, x₂,…,

x_{n^{2}}

)^T denote an image whose every i^th pixel is represented by elemental variable x_i and P(x) called image PDF is defined according to joint probability rule as follows:

P (x) = \prod_{i = 1}^{n^{2}} P (x_{i}| x_{i - 1}, x_{i - 2}, \dots, x_{1})

(2.11)

Where n implies image width with suppose that image width and image height are equal for convenience, and

P (x_{1}| x_{0}) = P (x_{1})

In other words, image PDF P(x) is product of all conditional PDFs P(x_i | x_i_–1, x_i_–2,…, x₁) where every P(x_i | x_i_–1, x_i_–2,…, x₁) is called conditional pixel PDF or pixel PDF in short. There is neither tractable data z nor explicit generator g(z | Φ) for generating new data in PD because generation task is based on the entire PDF P(x). For instance, without loss of generality, if we randomize k first pixels x₁, x₂,…, x_k, we can generate n²–k remaining pixels by the recurrent process: determining P(x_k₊₁ | x_k, x_k_–1,…, x₁) based on x₁ to x_k, generating x_k₊₁ according to P(x_k₊₁ | x_k, x_k_–1,…, x₁) and determining P(x_k₊₂ | x_k₊₁, x_k,…, x₁) based on x₁ to x_k₊₁, generating x_k₊₂ according to P(x_k₊₂ | x_k₊₁, x_k,…, x₁) and determining P(x_k₊₃ | x_k₊₂, x_k₊₁,…, x₁) based on x₁ to x_k₊₂,…, generating

x_{n^{2} - 1}

according to P(

x_{n^{2} - 1}

|

x_{n^{2} - 2}

,

x_{n^{2} - 3}

,…, x₁) and determining P(

x_{n^{2}}

|

x_{n^{2} - 1}

,

x_{n^{2} - 2}

,…, x₁) based on x₁ to

x_{n^{2} - 1}

, generating the last

x_{n^{2}}

according to P(

x_{n^{2}}

|

x_{n^{2} - 1}

,

x_{n^{2} - 2}

,…, x₁). By another viewpoint, the joint probability of n²–k remaining pixels denoted P(x_k, x_k₊₁,…,

x_{n^{2}}

) is determined and then, n²–k remaining pixels are generated according to this joint probability. Indeed, the joint probability P(x_k, x_k₊₁,…,

x_{n^{2}}

) of n²–k remaining pixels is totally determined when P(x) and k probabilities P(x_i | x_i_–1, x_i_–2,…, x₁) are determined where i is from 1 to k.

P (x_{k + 1}, x_{k + 2}, \dots, x_{n^{2}}) = \prod_{i = k + 1}^{n^{2}} P (x_{i}| x_{i - 1}, x_{i - 2}, \dots, x_{1}) = \frac{P (x)}{\prod_{i = 1}^{k} P (x_{i}| x_{i - 1}, x_{i - 2}, \dots, x_{1})}

Because there are a large number of pixels in a large image which produces a large number of pixel PDFs as well as every pixel PDF P(x_i | x_i_–1, x_i_–2,…, x₁) of a given pixel x_i is itself also complicated with a lot of its previous pixels x_i_–1, x_i_–2,…, x₁, there are many techniques proposed to PD in order to decrease complexity and increase computation effectiveness. Anyhow, the equation of image PDF P(x) above is important one in theory. One of PD techniques is to apply long short-term memory (LSTM) (Theis & Bethge, 2015) into modeling and learning sample PDF P(x).

The default artificial neural network is feedforward neural network where data is fed to input layer which, in turn, is evaluated and passed across hidden layers to output layer in one-way direction, finally. However, there is an extension of neural network, which is called recurrent neural work (RNN), where an output can be turned back in order to feed on network as input. In other words, RNN has circle, which allow that output can become input. For convenience and easy explanation, given T time points t = 1, 2,…, T, current state of a RNN at time point t is represented by three layers such as input layer x_t, hidden layer h_t, and output layer o_t without loss of generality with note that h_t can represent many hidden layers when RNN is a DNN too. Obviously, RNN is an extension of neural network because every triple (x_t, h_t, o_t) is, essentially, a feedforward neural network, even a DNN. Hidden layer h_t as well as output layer o_t at current state t is calculated based on both current input layer x_t and previous hidden layer h_t_–1 of previous state at time point t–1. Without loss of generality, input layer, hidden layer, and output layer are considered as input neuron, hidden neuron, and output neuron for convenience (Wikipedia, Recurrent neural network, 2005).

\begin{array}{l} h_{t} = σ_{h} (W_{h} x_{t} + U_{h} h_{t - 1} + b_{h}) \\ o_{t} = σ_{o} (W_{o} h_{t} + b_{o}) \end{array}

(2.12)

Where W_h is weight matrix of current hidden neuron h_t regarding current input neuron x_t, U_h is weight matrix of current hidden neuron h_t regarding previous hidden neuron h_t_–1, and b_h is bias vector of current hidden neuron h_t whereas W_o is weight matrix of current output neuron o_t regarding current hidden neuron h_t and b_o is bias vector of current output neuron o_t. Moreover, σ_h(.) and σ_o(.) are activation functions of h_t and o_t, respectively with note that σ_h(.) and σ_o(.) are vector-by-vector functions. Backpropagation algorithm can be applied into learning RNN as usual. It is interesting that structure of RNN defined by the triple (x_t, h_t, o_t) is not changed but its parameters W_h, U_h, b_h, W_o, and b_o are changed by backpropagation algorithm when RNN is learned. Of course, values of the triple (x_t, h_t, o_t) are changed over time points. Note, W_h, U_h, and W_o are matrix parameters and b_h and b_o are vector parameters whereas x_t, h_t, and o_t are vector variables.

Long short-term memory (LSTM) is an extension of RNN, which implies that RNN is used to implement short-term memory so that the short-term memory can last for a longer time through T time points t = 1, 2,…, T built in RNN. Consequently, the short-term memory is represented by a so-called cell associated with three gates such as input gate, forget gate, and output gate. Cell represents information piece stored in memory at current time (Wikipedia, 2007). Input gate controls which new information to be put to cell, forget gate decides which information to be discarded, and output gate controls which information to be sent to next state (Wikipedia, 2007). As a convention, the cell at current state t is represented by the pair (c_t, h_t) whereas input gate, forget gate, and output gate are represented by vector variables i_t, f_t, and o_t, respectively. Note, let g_t and c_t be cell input activation variable and cell state variable where cell input activation variable g_t represents the activated input part of a cell, which is the important input part being different from the forgotten part, whereas cell state variable c_t represents real information stored in cell which is, exactly, the short memory at current state. In literature, g_t is also called cell gate. Some LSTM variants merge g_t and c_t into the same cell state variable. Although output gate o_t represents which information to be sent to next state, it is consolidated with current cell memory c_t in order to produce the real output information h_t which represents bright and clear-cut memory. In other words, given cell (c_t, h_t), then c_t represents the real information stored in memory and h_t represents the clear-cut memory which displays brightly at the outside for next state. It is possible to consider that c_t is evaluated value of cell t and h_t is predictive value of cell t. Following equations specify LSTM based on specification of RNN (Wikipedia, 2007), which indicates how to calculate cell and gates.

\begin{array}{l} i_{t} = σ_{i} (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i}) \\ f_{t} = σ_{f} (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f}) \\ o_{t} = σ_{o} (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o}) \\ g_{t} = σ_{g} (W_{g} x_{t} + U_{g} h_{t - 1} + b_{g}) \\ c_{t} = f_{t} ⨂ c_{t - 1} + i_{t} ⨂ g_{t} \\ h_{t} = o_{t} ⨂ σ_{h} (c_{t}) \end{array}

(2.13)

Note, weight matrix W_i, weight matrix U_i, and bias vector b_i are parameters of input gate i_t. Weight matrix W_f, weight matrix U_f, and bias vector b_f are parameters of forget gate f_t. Weight matrix W_o, weight matrix U_o, and bias vector b_o are parameters of output gate o_t. Weight matrix W_g, weight matrix U_g, and bias vector b_g are parameters of cell gate g_t. Vector variables i_t, f_t, and o_t are often in range [0, 1] whereas vector variables c_t and h_t are often in range [–1, 1]. Activation functions σ_i(.), σ_f(.), and σ_o(.) are often sigmoid (logistic) functions whereas activation functions σ_g(.) and σ_h(.) are hyperbolic tangent functions. The operator

⨂

denotes component-wise multiplication of two vectors where every pair of two corresponding elements of the two vectors are multiplied together, for instance, given two arbitrary vectors u = (u₁, u₂,…, u_n)^T and v = (v₁, v₂,…, v_n)^T, we have u

⨂

v = (u₁v₁, u₂v₂,…, u_nv_n)^T. Note, backpropagation algorithm can be applied into learning LSTM as usual.

By applying LSTM into pixel density (PD) approach for modeling DGM, each pixel x_i is represented by cell c_i when pixel index i is considered as time point t. Because each cell c_i is dependent on its one right previous cell c_i_–1 whereas conditional pixel PDF P(x_i | x_i_–1, x_i_–2,…, x₁) of pixel x_i is dependent on i–1 previous pixelsx_i_–1, x_i_–2,…, x₁, Markov property is applied so that conditional pixel PDF of pixel x_i depends on only one previous pixel x_i_–1.

P (x_{i}| x_{i - 1}, x_{i - 2}, \dots, x_{1}) = P (x_{i}| x_{i - 1})

It is now possible to apply LSTM to model PD by matching each pixel x_i with each cell c_i so that cell c_i is considered as evaluated value of pixel x_i as well as each h_i is predictive value of pixel x_i. Because image is two-dimension array, each pixel x_ij or each cell c_ij is indexed by two indices i and j following image height and image width. The event that cell c_ij or c_i_,j indexed by two indices i and j makes LSTM extended into two-dimension LSTM as follows:

\begin{array}{l} i_{i, j} = σ_{i} (W_{i} x_{i, j} + U_{i} h_{i, j - 1} + V_{i} h_{i - 1, j} + b_{i}) \\ f_{i, j} = σ_{f} (W_{f} x_{i, j} + U_{f} h_{i, j - 1} + V_{f} h_{i - 1, j} + b_{f}) \\ o_{i, j} = σ_{o} (W_{o} x_{i, j} + U_{o} h_{i, j - 1} + V_{o} h_{i - 1, j} + b_{o}) \\ g_{i, j} = σ_{g} (W_{g} x_{i, j} + U_{g} h_{i, j - 1} + V_{g} h_{i - 1, j} + b_{g}) \\ c_{i, j} = f_{i, j} ⨂ (c_{i, j - 1} + c_{i - 1, j}) + i_{i, j} ⨂ g_{i, j} \\ h_{i, j} = o_{i, j} ⨂ σ_{h} (c_{i, j}) \end{array}

(2.14)

The equations above specify core ideology of PD associated with two-dimension LSTM where the contextual meaning of weight and bias parameters W_(.), U_(.), V_(.), and b_(.) is not changed with note that W_(.), U_(.), and V_(.) are weight matrices regarding current pixel, previous pixel (i, j–1), and previous pixel (i–1, j), respectively. In literature, such PD is called PixelRNN associated with diagonal two-dimension LSTM (Oord, Kalchbrenner, & Kavukcuoglu, 2016, pp. 3-4). According to diagonal two-dimension LSTM each pixel (i, j) at i^th row and j^th column has two previous neighbors such as previous left pixel (i, j–1) and previous upper pixel (i–1, j). For extension, each pixel (i, j) can have up four previous neighbors such as pixel (i, j–1), pixel (i–1, j–1), pixel (i–1, j), and pixel (i–1, j+1). Following figure depicts PixelRNN with diagonal two-dimension LSTM (Oord, Kalchbrenner, & Kavukcuoglu, 2016, p. 4).

Figure 2.1. PixelRNN with diagonal two-dimension LSTM.

It is easy to add more weight parameters to these extensive cases. For example, cell gates and cell state with regard to the four previous neighbors are specified as follows:

\begin{array}{l} i_{i, j} = σ_{i} (W_{i} x_{i, j} + U_{i} h_{i, j - 1} + V_{i} h_{i - 1, j} + R_{i} h_{i - 1, j - 1} + S_{i} h_{i - 1, j + 1} + b_{i}) \\ f_{i, j} = σ_{f} (W_{f} x_{i, j} + U_{f} h_{i, j - 1} + V_{f} h_{i - 1, j} + R_{f} h_{i - 1, j - 1} + S_{f} h_{i - 1, j + 1} + b_{f}) \\ o_{i, j} = σ_{o} (W_{o} x_{i, j} + U_{o} h_{i, j - 1} + V_{o} h_{i - 1, j} + R_{o} h_{i - 1, j - 1} + S_{o} h_{i - 1, j + 1} + b_{i}) \\ g_{i, j} = σ_{g} (W_{g} x_{i, j} + U_{g} h_{i, j - 1} + V_{g} h_{i - 1, j} + R_{g} h_{i - 1, j - 1} + S_{g} h_{i - 1, j + 1} + b_{g}) \\ c_{i, j} = f_{i, j} ⨂ (c_{i, j - 1} + c_{i - 1, j} + c_{i - 1, j - 1} + c_{i - 1, j + 1}) + i_{i, j} ⨂ g_{i, j} \\ h_{i, j} = o_{i, j} ⨂ σ_{h} (c_{i, j}) \end{array}

Where matrices R_(.) and S_(.) are additional weight parameters regarding two new neighbor pixels such as pixel (i–1, j–1) and pixel (i–1, j+1).

Recall that c_i_,j is considered as evaluated value of pixel x_i_,j and h_i_,j is predictive value of pixel x_i_,j. It is interesting that h_i_,j is generated pixel within the aforementioned generation process by PD. Turning back the generation process, without loss of generality, given k randomized pixels x_i_–1,1, x_i_–1,2,.., x_i_–1,j+1,…, x_i_,1, x_i_,2,…, x_i_,j, we will generate the next pixel x_i_,j+1. Firstly, PD model must be trained by some dataset as a set of images. Secondly, k randomized pixels x_i_–1,1, x_i_–1,2,.., x_i_–1,j+1,…, x_i_,1, x_i_,2,…, x_i_,j are fed to PD again so as to update k sets of parameters W_(.), U_(.), and b_(.) as well as compute k predictive values h_i_–1,1, h_i_–1,2,.., h_i_–1,j+1,…, h_i_,1, h_i_,2,…, h_i_,j. Finally, it is possible to determine the predictive value h_i_,j+1 of the next pixel (i, j+1) given x_i_,j+1, h_i_,j, and h_i_–1,j+1 along with learned parameters of two-dimension LSTM PD. It is important to note that x_i_,j+1 is randomized arbitrarily whereas h_i_,j and h_i_–1,j+1 were computed previously. Obviously, it is easy to generate next predictive values h_i_,j+2, h_i_,j+2,…, h_i_+1,j, h_i_+1,j+1, etc. by the similar process. Note, backpropagation algorithm can be applied into learning two-dimension LSTM as usual. Note, backpropagation algorithm is often associated with stochastic gradient descent (SGD) method and so, please pay attention to SGD.

3. Approximate Density DGM

According to approximate density approach, DGMs try to estimate approximately generator PDF P(x | Φ, z) or derive other PDF that is similar to P(x | Φ, z) with note that PDF is abbreviation of probability density function.

Recall that there are two problems related to construct a DGM: 1) how to define likelihood or error to train generator DNN g(z | Φ) and 2) how to define tractable PDF P(z) which implies the way to randomize z. The second problem relates to assert qualification of random data z’ and hence, the second problem is stated as qualification problem of how to qualify random data. According to implicit density approach, a discrimination DNN is used to qualify randomized data z instead of defining tractable PDF P(z) by Generative Adversarial Network (GAN) which is a typical method belonging to implicit density approach. In different way belonging to this approximate density approach, Variational Autoencoders (VAE) method developed by Kingma and Welling (Kingma & Welling, 2022) proposed another DNN called encoder f(x | Θ) to expectedly convert intractable data x into tractable data z. In other words, encoder f(x | Θ) approximates tractable data z by encoded data z’.

f (x| Θ) = z^{'} ≅ z

(3.1)

It is easy to recognize that encoder f(x | Θ) is an approximation of the inverse of generator g(z | Φ) when g(z | Φ) is invertible where x-dimension m is larger than z-dimension n (m > n), which is the reason that generator g(z | Φ) is called decoder g(z | Φ) in VAE. Like decoder g(z | Φ), encoder f(x | Θ) is modeled by a so-called encoder DNN whose weights are parameter Θ called encoder parameter and so parameter Φ is called decoder parameter in VAE. By following the fact that encoder f(x | Θ) approximates tractable data z by encoded data z’, tractable PDF P(z) is approximated by a so-called encoder PDF P_f(z’).

P_{f} (z^{'}) ≅ P (z)

Because encoder f(x | Θ) depends on its parameter Θ, we can denote:

P_{f} (z^{'}) = P (z^{'}| Θ, x)

Essential, encoder PDF P(z’ | Θ, x) is likelihood function of z’ given x which is conditional PDF of z’ given x and hence, P(z’ | Θ, x) is called encoder likelihood which depends on encoder f(x | Θ), of course. On the other hand, P(z’ | Θ, x) is posterior PDF of tractable data given tractable data x where P(z) is prior PDF of tractable data. In practice, z’ is assumed to conform multivariate normal distribution and therefore, let μ(x) and Σ(x) be mean vector and covariance matrix of z’. Encoder likelihood P(z’ | Θ, x) becomes P(z’ | Θ, μ(x | Θ), Σ(x | Θ)) so that output of encoder DNN f(x | Θ) is mean μ(x | Θ) and covariance matrix Σ(x | Θ) while its input is x and its weights are Θ, of course.

\begin{array}{l} f (x| Θ) = (\begin{matrix} μ (x| Θ) \\ μ (x| Θ) \end{matrix}) \\ P (z^{'}| Θ, x) = P (z^{'}| Θ, μ (x), Σ (x)) = N (z| μ (x| Θ), Σ (x| Θ)) \end{array}

(3.2)

Note,

N

(.) denotes normal distribution and thus,

N

(z | μ(x | Θ), Σ(x | Θ)) represents encoder likelihood. That

N

(z | μ(x | Θ), Σ(x | Θ)) is encoder likelihood is an important improvement in developing VAE because encoder DNN f(x | Θ) is learned by minimizing a so-called encoder error which is represented by the difference between encoder likelihood and predefined tractable PDF P(z). Let KL(

N

(z | μ(x | Θ), Σ(x | Θ)) | P(z)) be Kullback-Leibler divergence of encoder likelihood

N

(z | μ(x | Θ), Σ(x | Θ)) and predefined tractable PDF P(z). As a result, KL(

N

(z | μ(x | Θ), Σ(x | Θ)) | P(z)) becomes an ideal encoder error, which is called encoder KL divergence. The smaller the encoder KL divergence is, the closer the encoder likelihood

N

(z | μ(x | Θ), Σ(x | Θ)) is to tractable PDF P(z), the better the encoder DNN f(x | Θ) is.

ε (μ (x| Θ), Σ (x| Θ)| Θ) = K L (N (z| μ (x| Θ), Σ (x| Θ))| P (z))

(3.3)

Therefore, encoder KL divergence KL(

N

(z | μ(x | Θ), Σ(x | Θ)) | P(z)) is minimized by stochastic gradient descent (SGD) method in order to estimate decoder parameter Θ for training encoder DNN f(x | Θ) as follows:

Θ^{*} = \underset{Θ}{armin} K L (N (z| μ (x| Θ), Σ (x| Θ))| P (z))

Which results estimation equation according to SGD:

Θ = Θ - γ \nabla K L (N (z| μ (x| Θ), Σ (x| Θ))| P (z))

Where ∇KL(

N

(z | μ(x | Θ), Σ(x | Θ)) | P(z)) is gradient of encoder KL divergence KL(

N

(z | μ(x | Θ), Σ(x | Θ)) | P(z)) with regard to μ(x | Θ) and Σ(x | Θ) while γ is learning rate. Recall that SGD, which is an iterative process, pushes candidate solution at each iteration along the direction which is opposite to gradient of target function for minimization or has the same direction to gradient of target function for maximization with note that the step length is represented by learning rate. We have:

\nabla K L (N (z| μ (x| Θ), Σ (x| Θ))| P (z)) = \frac{d K L (N (z| μ (x| Θ), Σ (x| Θ))| P (z))}{d Θ}

There can be no change in estimating decoder parameter Φ within VAE so that decoder error ε(x | Φ, z) = ½||g(z | Φ) – x||² is minimized to produce optimal Φ.

Φ^{*} = \underset{Φ}{armin} \frac{1}{2} {‖g (z| Φ) - x‖}^{2}

Which results estimation equation according to SGD:

Φ = Φ - γ \nabla \frac{1}{2} {‖g (z| Φ) - x‖}^{2}

Recall that generator g(z | Φ) is called decoder g(z | Φ) in VAE. As a result, encoder parameter Θ and decoder parameter Φ are estimated as follows:

\begin{array}{l} Θ = Θ - γ \nabla K L (N (z| μ (x| Θ), Σ (x| Θ))| P (z)) \\ Φ = Φ - γ \nabla \frac{1}{2} {‖g (z| Φ) - x‖}^{2} = Φ - γ (g (z| Φ) - x) \frac{d g (z| Φ)}{d Φ} \end{array}

Where dg(z | Φ) / dΦ is differential of g(z | Φ) with regard to Φ while 0 < γ ≤ 1 is learning rate and tractable PDF P(z) is predefined with note that VAE replaces tractable PDF P(z) by likelihood P(z’ | Θ, μ(x | Θ), Σ(x | Θ)) with fixed P(z). As usual, P(z) is assumed to conform standard normal distribution with mean 0 and covariance matrix I.

P (z) = N (z| 0, I)

This implies:

Θ = Θ - γ \nabla K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| 0, I))

Where I is identity matrix:

I = (\begin{matrix} 1 & 0 & \dots & 0 \\ 0 & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & 1 \end{matrix})

It is easier to determine gradient of encoder KL divergence ∇KL(N(μ(x | Θ), Σ(x | Θ)) |

N

(z | 0, I)) with regard to Θ between the multivariate normal distribution

N

(μ(x), Σ(x) | Θ) and the standard multivariate normal distribution

N

(z | 0, I)). We have following equation to calculate such gradient (Kingma & Welling, 2022, p. 5), (Doersch, 2016, p. 9), (Nguyen, 2015, p. 43):

\begin{array}{l} \nabla K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| 0, I)) = \frac{d K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| 0, I))}{d Θ} \\ = {(\begin{array}{l} μ (x| Θ) \frac{d μ (x| Θ)}{d Θ} \\ \frac{1}{2} (I - {(Σ (x| Θ))}^{- 1}) \frac{d Σ (x| Θ)}{d Θ} \end{array})}^{T} \\ = (μ (x| Θ) \frac{d μ (x| Θ)}{d Θ}, \frac{1}{2} (I - {(Σ (x| Θ))}^{- 1}) \frac{d Σ (x| Θ)}{d Θ}) \end{array}

Where (Σ(x | Θ))^–1 is inverse of covariance matrix Σ(x | Θ) and the subscript “T” denotes transposition operator of matrix and vector whereas dμ(x | Θ) / dΘ and dΣ(x | Θ) / dΘ are differentials of μ(x | Θ) and Σ(x | Θ) with regard to Θ, respectively. As a result, encoder parameter Θ and decoder parameter Φ are totally estimated according to SGD as follows:

\begin{array}{l} Θ = Θ - γ (\begin{array}{l} μ (x| Θ) \frac{d μ (x| Θ)}{d Θ} \\ \frac{1}{2} (I - {(Σ (x| Θ))}^{- 1}) \frac{d Σ (x| Θ)}{d Θ} \end{array}) \\ Φ = Φ - γ (g (z| Φ) - x) \frac{d g (z| Φ)}{d Φ} \end{array}

The estimation equations above are simple explanation of VAE but its formal construction is more complicated. We begin the aforementioned intractable PDF P(x) specified by law of total probability:

P (x) = \int_{z} P (x| Φ, z) P (z) d z

However, P(x) is interpreted by another way which is based on Bayes’ rule within VAE:

P (x) = \frac{P (x, z)}{P (z| x)}

Because the conditional probability P(z | x) is arbitrary without formal specification, it should be approximated by another PDF denoted Q(z | x) with assumption that the PDF Q(z | x) has formal specification like normal distribution.

Q (z| x) ≅ P (z| x)

Logarithm of intractable PDF P(x) is specified as follows (Ruthotto & Haber, 2021, p. 13):

\log P (x) = \log \frac{P (x, z)}{P (z| x)} = \log (\frac{P (x, z)}{Q (z| x)} \frac{Q (z| x)}{P (z| x)})

This implies:

\log P (x) = \log \frac{P (x, z)}{Q (z| x)} + \log \frac{Q (z| x)}{P (z| x)}

(3.4)

The second term log(Q(z | x) / P(z | x)) is not variant because Q(z | x) is approximated to P(z | x). Therefore, the first term log(P(x, z) / Q(z | x) is called variation lower bound or evidence lower bound because it is variant. Let l(x, z) be loss function or error function on VAE which is defined as the minus opposite of expectation of the evidence lower bound log(P(x, z) / Q(z | x) given PDF Q(z | x) with note that Q(z | x) has formal probabilistic distribution.

l (x, z) = - E_{Q (z| x)} (\log \frac{P (x, z)}{Q (z| x)}) = - \int_{z} Q (z| x) \log \frac{P (x| z) P (z)}{Q (z| x)} d z

Loss function l(x, z) is expended as follows:

l (x, z) = - \log P (x| z) \int_{z} Q (z| x) d z + \int_{z} Q (z| x) (\log \frac{Q (z| x)}{P (z)}) d z = - \log P (x| z) + \int_{z} Q (z| x) (\log \frac{Q (z| x)}{P (z)}) d z

Because Q(z | x) and P(x | z) depend on encoder f(x | Θ) and decoder g(z | Φ), respectively, their parameters are Θ and Φ, respectively.

Q (z| x) = Q (z| Θ, x) P (x| z) = P (x| Φ, z)

Exactly, Q(z | Θ, x) is encoder likelihood which is the same to the aforementioned P(z’ | Θ, x) except that it is focused that Q(z | Θ, x) has formal probabilistic specification like normal distribution. Loss function l(Θ, Φ | x, z), which is now function of encoder parameter Θ and decoder parameter Φ, is written as follows (Ruthotto & Haber, 2021, p. 14):

l (Θ, Φ| x, z) = - \log P (x| Φ, z) + \int_{z} Q (z| Θ, x) (\log \frac{Q (z| Θ, x)}{P (z)}) d z

Firstly, please pay attention to the first term loss function l(Θ, Φ | x, z) where P(x | Φ, z) depends only on Φ although it can be considered as a conditional PDF of x given z because P(x | Φ, z) is defined for output layer containing only x of decoder DNN g(x | Φ) whose input is x. Therefore, we had the following assertion:

\int_{z} Q (z| x) \log P (x| Φ, z) d z = \log P (x| Φ, z) \int_{z} Q (z| x) d z = \log P (x| Φ, z) = \log P (x| Φ)

Secondly, the second term in loss function l(Θ, Φ | x, z) is, actually, Kullback-Leibler divergence of encoder likelihood Q(z | Θ, x) and predefined tractable PDF P(z), which measure the difference between Q(z | Θ, x) and P(z). As a convention, this Kullback-Leibler divergence is called encoder KL divergence which is an ideal encoder error.

K L (Q (z| Θ, x)| P (z)) = E_{Q (z| x)} (\log \frac{Q (z| Θ, x)}{P (z)}) = \int_{z} Q (z| Θ, x) (\log \frac{Q (z| Θ, x)}{P (z)}) d z

The smaller the encoder KL divergence is, the closer the encoder likelihood Q(z | Θ, x) is to tractable PDF P(z), the better the encoder DNN f(x | Θ) is. Loss function is rewritten again:

l (Θ, Φ| x, z) = - \log P (x| Φ, z) + K L (Q (z| Θ, x)| P (z))

Or,

l (Θ, Φ| x, z) = - \log P (x| Φ) + K L (Q (z| Θ)| P (z))

(3.5)

According to the two problem of construct a DGM, the first term –log(P(x | Φ, z)) in loss function indicates the first problem of how to train decoder DNN g(z | Φ) which is called reconstruction error in literature and the second term KL(Q(z | Θ, x) | P(z)) in loss function indicates the second problem of how to qualify training task for training encoder DNN f(x | Θ) which is called regularity in literature. Loss function l(Θ, Φ | x, z) is minimized to estimate Θ and Φ as follows:

\begin{array}{l} Θ^{*} = \underset{Θ}{armin} l (Θ, Φ| x, z) \\ Φ^{*} = \underset{Φ}{armin} l (Θ, Φ| x, z) \end{array}

(3.6)

Because P(x | Θ, z) depends only on Θ and encoder KL divergence KL(Q(z | Θ, x) | P(z)) depends only on Φ, the optimization problem is specified as follows:

\begin{array}{l} Θ^{*} = \underset{Θ}{armin} K L (Q (z| Θ, x)| P (z)) \\ Φ^{*} = \underset{Φ}{armin} (- \log P (x| Φ, z)) = \underset{Φ}{armax} (\log P (x| Φ, z)) \end{array}

Which results estimation equations according to SGD:

\begin{array}{l} Θ = Θ - γ \nabla K L (Q (z| Θ, x)| P (z)) \\ Φ = Φ + γ \nabla \log P (x| Φ, z) \end{array}

(3.7)

Where ∇KL(Q(z | Θ, x) | P(z)) is gradient of encoder KL divergence KL(Q(z | Θ, x) | P(z)) with regard to encoder parameter Θ. Note that tractable PDF P(z) is predefined (fixed). While Q(z | Θ, x) is called encoder likelihood, P(x | Φ, z) is called decoder likelihood. On the other hand, while P(z) is prior PDF of intractable data z, then Q(z | Θ, x) is approximated posterior PDF of z given x where both P(z) and Q(z | Θ, x) have formal probabilistic specifications and moreover, P(z) is fixed (predefined).

Q (z| Θ, x) = P (z^{'}| Θ, x) ≅ P (z| Θ, x)

Both P(z | Θ, x) and Q(z | Θ, x) are encoder likelihood as well as posterior PDF of tractable data z but Q(z | Θ, x) is approximated one whose probabilistic distribution is specified formally. Therefore (Ruthotto & Haber, 2021, p. 16), randomized data z’ in latent space Z is sampled from approximated distribution Q(z | Θ, x) instead of sampling from true distribution P(z | Θ, x).

Given epoch of size N is denoted as D = (d⁽¹⁾ = (x⁽¹⁾, z⁽¹⁾), d⁽²⁾ = (x⁽²⁾, z⁽²⁾),…, d^(N) = (x^(N), z^(N))), the estimation equations of Θ and Φ are extended exactly as epoch estimation at every iteration of SGD:

\begin{array}{l} Θ^{(k + 1)} = Θ^{(k)} - γ \nabla \frac{1}{N} \sum_{i = 1}^{N} K L (Q (z^{(i)}| Θ^{(k)}, x^{(i)})| P (z)) \\ Φ^{(k + 1)} = Φ^{(k)} + γ \nabla \frac{1}{N} \sum_{i = 1}^{N} \log P (x^{(i)}| Φ, z^{(i)}) \end{array}

Please distinguish that the tractable data z⁽ⁱ⁾ in the first equation above follows distribution P(z) but the tractable data z⁽ⁱ⁾ in the second equation above follows distribution Q(z | Θ, x). As a result, VAE trained with SGD is specified as follows:

Initialize Θ and Φ and set k = 0.

Repeat

Sampling epoch X = (x⁽¹⁾, x⁽²⁾,…, x^(N)) or receiving epoch X from big data / data stream.

Randomize random epoch Z = (z⁽¹⁾, z⁽²⁾,…, z^(N)) in which each z⁽ⁱ⁾ is randomized from distribution Q(z | Θ^(k), x⁽ⁱ⁾).

\begin{array}{l} Θ^{(k + 1)} = Θ^{(k)} - γ \nabla \frac{1}{N} \sum_{i = 1}^{N} K L (Q (z^{(i)}| Θ^{(k)}, x^{(i)})| P (z)) \\ Φ^{(k + 1)} = Φ^{(k)} + γ \nabla \frac{1}{N} \sum_{i = 1}^{N} \log P (x^{(i)}| Φ, z^{(i)}) \end{array}

Increase k = k + 1.

Until some terminating conditions are met.

Note, a terminating condition is customized, for example, parameters Θ and Φ are not changed significantly or there is no more coming epoch X. Moreover, the index k indicates time point as well as iteration of SGD. Because PDF P(z) is predefined, it is easy to calculate encoder KL divergence KL(Q(z⁽ⁱ⁾ | Θ^(k), x⁽ⁱ⁾) | P(z)) but it is necessary to define P(x) by well-known distribution. However, randomizing random epoch Z = (z⁽¹⁾, z⁽²⁾,…, z^(N)) from distribution Q(z | Θ^(k), x⁽ⁱ⁾)) is not easy and so, VAE trained with SGD will be fine-tuned. It is interesting that when Q(z | Θ^(k), x⁽ⁱ⁾)) is posterior PDF of z and P(z) is prior PDF of z, the event that z is randomized from the posterior PDF Q(z | Θ^(k), x⁽ⁱ⁾)) and Q(z | Θ^(k), x⁽ⁱ⁾)) itself is updated continuously based on its previous evidence x⁽ⁱ⁾ over SGD iterations implies that VAE conforms Bayesian statistics in estimation. Moreover, P(z) is an alignment that Q(z | Θ^(k), x⁽ⁱ⁾)) adjusts itself with support of encoder KL divergence KL(Q(z⁽ⁱ⁾ | Θ^(k), x⁽ⁱ⁾) | P(z)).

\begin{array}{l} f (x| Θ) = (\begin{matrix} μ (x| Θ) \\ Σ (x| Θ) \end{matrix}) ~ z \\ Q (z| μ (x| Θ), Σ (x| Θ)) = N (z| μ (x| Θ), Σ (x| Θ)) \end{array}

(3.8)

Note,

N

(z | μ(x | Θ), Σ(x | Θ)) denotes multivariate normal distribution with mean μ(x | Θ) and covariance matrix Σ(x | Θ).

\begin{array}{l} N (z| μ (x| Θ), Σ (x| Θ)) \\ = {(2 π)}^{- \frac{n}{2}} {|Σ (x| Θ)|}^{- \frac{1}{2}} e x p (- \frac{1}{2} {(z - μ (x| Θ))}^{T} {(Σ (x| Θ))}^{- 1} (z - μ (x| Θ))) \end{array}

Note, dimension of tractable data z is n. Moreover, notation |.| or notation det(.) denotes determinant of matrix whereas (Σ(x | Θ))^–1 is inverse of covariance matrix Σ(x | Θ) and the subscript “T” denotes transposition operator of matrix and vector. It is easy to recognize that z’ is approximation of z. When tractable PDF P(z) is fixed, it is often assumed to follow multivariate normal distribution with predefined mean μ₀ and predefined covariance matrix Σ₀ as follows:

P (z) = N (z| μ_{0}, Σ_{0}) = N (μ_{0}, Σ_{0}) = {(2 π)}^{- \frac{n}{2}} {|Σ_{0}|}^{- \frac{1}{2}} e x p (- \frac{1}{2} {(z - μ_{0})}^{T} Σ_{0}^{- 1} (z - μ_{0}))

Encoder KL divergence KL(Q(z | Θ, x) | P(z)) between Q(z | Θ, x) and P(z) becomes encoder KL divergence KL(Q(z | μ(x | Θ), Σ(x | Θ)) | P(z)) between Q(z | μ(x | Θ), Σ(x | Θ)) and P(z) as follows:

K L (Q (z| μ (x| Θ), Σ (x| Θ))| P (z)) = K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| μ_{0}, Σ_{0}))

Which is, essentially, encoder KL divergence between two normal distributions, KL(

N

(z | μ(x | Θ), Σ(x | Θ)) |

N

(μ₀, Σ₀)). As a convention, this divergence is called encoder KL divergence which is determined in literature as follows (Doersch, 2016, p. 9):

\begin{array}{l} K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| μ_{0}, Σ_{0})) \\ = \frac{1}{2} (t r (Σ_{0}^{- 1} Σ (x| Θ)) + {(μ (x| Θ) - μ_{0})}^{T} Σ_{0}^{- 1} (μ (x| Θ) - μ_{0}) + \log (\frac{|Σ_{0}|}{|Σ (x| Θ)|}) - n) \end{array}

Where tr(.) denotes trace operator of square matrix which is sum of elements on main diagonal, for instance, given n_xn matrix A, then tr(A) = a₁₁ + a₂₂ +… + a_nn with note that a_ij is the element at row i and column j. Moreover, notation |.| or notation det(.) denotes determinant of matrix. Gradient of encoder KL divergence consists of two elemental gradients with regard to mean μ(x | Θ) and covariance matrix Σ(x | Θ).

\begin{array}{l} \nabla K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| μ_{0}, Σ_{0})) \\ = {(\begin{matrix} \nabla_{μ} K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| μ_{0}, Σ_{0})) \\ \nabla_{Σ} K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| μ_{0}, Σ_{0})) \end{matrix})}^{T} \end{array}

Where,

\begin{array}{l} \nabla_{μ} K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| μ_{0}, Σ_{0})) \\ = \frac{d K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| μ_{0}, Σ_{0}))}{d μ (x| Θ)} \frac{d μ (x| Θ)}{d Θ} \\ \nabla_{Σ} K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| μ_{0}, Σ_{0})) \\ = \frac{d K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| μ_{0}, Σ_{0}))}{d Σ (x| Θ)} \frac{d Σ (x| Θ)}{d Θ} \end{array}

Where dμ(x | Θ) / dΘ and dΣ(x | Θ) / dΘ are differentials of μ(x | Θ) and Σ(x | Θ) with regard to Θ, respectively. It is not difficult to calculate KL gradient ∇_μ:

\nabla_{μ} K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| μ_{0}, Σ_{0})) = {(μ (x| Θ) - μ_{0})}^{T} Σ_{0}^{- 1} \frac{d μ (x| Θ)}{d Θ}

(3.9)

Due to (Nguyen, Matrix Analysis and Calculus, 2015, p. 35):

\frac{d {(μ (x| Θ) - μ_{0})}^{T} Σ_{0}^{- 1} (μ (x| Θ) - μ_{0})}{d μ (x| Θ)} = 2 {(μ (x| Θ) - μ_{0})}^{T} Σ_{0}^{- 1}

It is not difficult to calculate KL gradient ∇_Σ too:

\nabla_{Σ} K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| μ_{0}, Σ_{0})) = \frac{1}{2} (Σ_{0}^{- 1} - {(Σ (x| Θ))}^{- 1}) \frac{d Σ (x| Θ)}{d Θ}

(3.10)

Due to (Nguyen, Matrix Analysis and Calculus, 2015, pp. 45-46):

\begin{array}{l} \frac{d t r (Σ_{0}^{- 1} Σ (x| Θ))}{d Σ (x| Θ)} = Σ_{0}^{- 1} \\ \frac{d \log (|Σ (x| Θ)|)}{d Σ (x| Θ)} = \frac{d |Σ (x| Θ)|}{d Σ (x| Θ)} \frac{1}{|Σ (x| Θ)|} = \frac{|Σ (x| Θ)| {(Σ (x| Θ))}^{- 1}}{|Σ (x| Θ)|} = {(Σ (x| Θ))}^{- 1} \end{array}

As a result, encoder parameter Θ consists of two elemental parameters according to with regard to mean μ(x | Θ) and covariance matrix Σ(x | Θ) as follows:

Θ = (\begin{matrix} Θ_{μ} \\ Θ_{Σ} \end{matrix})

Where,

\begin{array}{l} Θ_{μ} = {(μ_{1}, μ_{2}, \dots, μ_{n})}^{T} \\ Θ_{Σ} = (\begin{matrix} σ_{1}^{2} & σ_{12} & \dots & σ_{1 n} \\ σ_{21} & σ_{2}^{2} & \dots & σ_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ σ_{n 1} & σ_{n 2} & \dots & σ_{n}^{2} \end{matrix}) \end{array}

Note, given random vector z = (z₁, z₂,…, z_n)^T whose elements z_i are random variables too, σ_ij where i≠j is covariance between two random variables z_i and z_j and σ_i² is variance of random variable z_i. It is easy to calculate encoder parameters Θ_μ and Θ_Σ by SGD estimation:

\begin{array}{l} Θ_{μ} = Θ_{μ} - γ \nabla_{μ} K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| μ_{0}, Σ_{0})) \\ = Θ_{μ} - γ Σ_{0}^{- 1} (μ (x| Θ_{μ}) - μ_{0}) \frac{d μ (x| Θ_{μ})}{d Θ_{μ}} \\ Θ_{Σ} = Θ_{Σ} - γ \nabla_{Σ} K L (N (z| μ (x| Θ), Σ (x| Θ))| N (z| μ_{0}, Σ_{0})) \\ = Θ_{Σ} - γ \frac{1}{2} (Σ_{0}^{- 1} - {(Σ (x| Θ_{Σ}))}^{- 1}) \frac{d Σ (x| Θ_{Σ})}{d Θ_{Σ}} \end{array}

Where dμ(x | Θ_μ) / dΘ_μ and dΣ(x | Θ_Σ) / dΘ_Σ are differentials of μ(x | Θ_μ) and Σ(x | Θ_Σ) with regard to Θ_μ and Θ_Σ, respectively. In practice, P(z) is assumed to conform standard normal distribution with zero mean μ₀ = 0 and identity covariance matrix Σ₀ = I where I is identity matrix so that encoder parameters Θ_μ and Θ_Σ are computed effectively.

\begin{array}{l} Θ_{μ} = Θ_{μ} - γ μ (x| Θ_{μ}) \frac{d μ (x| Θ_{μ})}{d Θ_{μ}} \\ Θ_{Σ} = Θ_{Σ} - γ \frac{1}{2} (I - {(Σ (x| Θ_{Σ}))}^{- 1}) \frac{d Σ (x| Θ_{Σ})}{d Θ_{Σ}} \end{array}

(3.11)

In order to improve more computational effectiveness, it is possible to suppose that elemental variables z_i in z = (z₁, z₂,…, z_n)^T within context P(z) are mutually independent so that covariance σ_ij between two variables z_i and z_j where i≠j is 0, which results that there only exist variances σ_i² of z_i. Covariance matrix Σ(x | Θ) becomes diagonal matrix:

Σ (x| Θ_{Σ}) = {(σ^{2} (x| Θ_{Σ}))}_{n x n} = (\begin{matrix} σ_{1}^{2} (x| Θ_{Σ}) & 0 & \dots & 0 \\ 0 & σ_{2}^{2} (x| Θ_{Σ}) & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & σ_{n}^{2} (x| Θ_{Σ}) \end{matrix})

Note,

σ^{2} (x| Θ_{Σ}) = (\begin{matrix} σ_{1}^{2} (x| Θ_{Σ}) \\ σ_{2}^{2} (x| Θ_{Σ}) \\ ⋮ \\ σ_{n}^{2} (x| Θ_{Σ}) \end{matrix})

Where σ_i²(x | Θ) is variance of elemental variable x_i in z = (z₁, z₂,…, z_n)^T given x according to encoder f(x | Θ). As a result, encoder parameter Θ_Σ, which is now diagonal matrix represented by its diagonal vector

Θ_{σ^{2}}

, is computed easier.

Θ_{σ^{2}} = Θ_{σ^{2}} - γ \frac{1}{2} (1 - {(σ^{2} (x| Θ_{σ^{2}}))}^{- 1}) \frac{d σ^{2} (x| Θ_{σ^{2}})}{d Θ_{σ^{2}}}

Where,

Θ_{Σ} = {(Θ_{σ^{2}})}_{n x n} = (\begin{matrix} σ_{1}^{2} & 0 & \dots & 0 \\ 0 & σ_{2}^{2} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & σ_{n}^{2} \end{matrix}) Θ_{σ^{2}} = {(σ_{1}^{2}, σ_{2}^{2}, \dots, σ_{n}^{2})}^{T} σ^{2} (x| Θ_{σ^{2}}) = (\begin{matrix} σ_{1}^{2} (x| Θ_{σ^{2}}) \\ σ_{2}^{2} (x| Θ_{σ^{2}}) \\ ⋮ \\ σ_{n}^{2} (x| Θ_{σ^{2}}) \end{matrix}) {(σ^{2} (x| Θ_{σ^{2}}))}^{- 1} = (\begin{matrix} 1 / σ_{1}^{2} (x| Θ_{σ^{2}}) \\ 1 / σ_{2}^{2} (x| Θ_{σ^{2}}) \\ ⋮ \\ 1 / σ_{n}^{2} (x| Θ_{σ^{2}}) \end{matrix}) 1 = {(1,1, \dots, 1)}^{T}

In general, estimation equations for encoder parameter Θ = (Θ_μ,

Θ_{σ^{2}}

)^T are specified as follows:

\begin{array}{l} Θ_{μ} = Θ_{μ} - γ μ (x| Θ_{μ}) \frac{d μ (x| Θ_{μ})}{d Θ_{μ}} \\ Θ_{σ^{2}} = Θ_{σ^{2}} - γ \frac{1}{2} (1 - {(σ^{2} (x| Θ_{σ^{2}}))}^{- 1}) \frac{d σ^{2} (x| Θ_{σ^{2}})}{d Θ_{σ^{2}}} \end{array}

(3.12)

Where dσ²(x |

Θ_{σ^{2}}

) / d

Θ_{σ^{2}}

is differential of σ²(x |

Θ_{σ^{2}}

) with regard to

Θ_{σ^{2}}

.

There can be no change in estimating decoder parameter Φ within VAE so that decoder log-likelihood log(P(x | Φ, z)) is maximized.

Φ^{*} = \underset{Φ}{armin} (- \log P (x| Φ, z)) = \underset{Φ}{armax} (\log P (x| Φ, z))

As usual, decoder likelihood P(x | Φ, z) is assumed to distribute normally with mean δ and variance σ².

P (x| Φ, z) = {(2 π σ^{2})}^{- \frac{m}{2}} e x p (- \frac{{‖g (z| Φ) - x - δ‖}^{2}}{2 σ^{2}})

Which implies decoder log-likelihood log(P(x | Φ, z)) as follows:

\log P (x| Φ, z) = - \frac{m}{2} \log (2 π σ^{2}) - \frac{{‖g (z| Φ) - x - δ‖}^{2}}{2 σ^{2}}

Where ||.|| denotes Euclidean norm of vector. Gradient of decoder log-likelihood is:

\nabla \log P (x| Φ, z) = \frac{d \log P (x| Φ, z)}{d Φ} = - \frac{(g (z| Φ) - x - δ)}{σ^{2}} \frac{d g (z| Φ)}{d Φ}

Where dg(z | Φ) / dΦ is differential of g(z | Φ) with regard to Φ. Let δ = 0 and σ²=1 optimization, we have:

\nabla \log P (x| Φ, z) = - (g (z| Φ) - x) \frac{d g (z| Φ)}{d Φ}

Which implies estimation equation for decoder parameter Φ by SGD as follows:

Φ = Φ + γ \nabla \log P (x| Φ, z) = Φ - γ (g (z| Φ) - x) \frac{d g (z| Φ)}{d Φ}

Because data z in the decoder estimation equation above follows encoder likelihood Q(z | Θ, μ(x | Θ_μ), Σ(x | Θ_Σ)) =

N

(z | μ(x | Θ_μ), Σ(x | Θ_Σ)) rather than tractable PDF P(z) =

N

(z | μ₀, Σ₀), it is denoted as z’ such that:

Φ = Φ + γ \nabla \log P (x| Φ, z^{'}) = Φ - γ (g (z^{'}| Φ) - x) \frac{d g (z^{'}| Φ)}{d Φ}

Given epoch of size N is denoted as D = (d⁽¹⁾ = (x⁽¹⁾, z’⁽¹⁾), d⁽²⁾ = (x⁽²⁾, z’⁽²⁾),…, d^(N) = (x^(N), z’^(N))), the estimation equations of Θ and Φ are extended exactly as epoch estimation at every iteration of SGD:

Θ_{μ}^{(k + 1)} = Θ_{μ}^{(k)} - γ \frac{1}{N} \sum_{i = 1}^{N} μ (x^{(i)}| Θ_{μ}^{(k)}) \frac{d μ (x^{(i)}| Θ_{μ}^{(k)})}{d Θ_{μ}} Θ_{σ^{2}}^{(k + 1)} = Θ_{σ^{2}}^{(k)} - γ \frac{1}{2 N} \sum_{i = 1}^{N} (1 - {(σ^{2} (x^{(i)}| Θ_{σ^{2}}^{(k)}))}^{- 1}) \frac{d σ^{2} (x^{(i)}| Θ_{σ^{2}}^{(k)})}{d Θ_{σ^{2}}} Φ^{(k + 1)} = Φ^{(k)} - γ \frac{1}{N} \sum_{i = 1}^{N} (g (z^{' (i)}| Φ^{(k)}) - x^{(i)}) \frac{d g (z^{' (i)}| Φ^{(k)})}{d Φ}

(3.13)

As a result, VAE trained with SGD is specified as follows:

Initialize Θ = (Θ_μ,

Θ_{σ^{2}}

)^T and Φ and set k = 0.

Repeat

Sampling epoch X = (x⁽¹⁾, x⁽²⁾,…, x^(N)) or receiving epoch X from big data / data stream.

\begin{array}{l} Θ_{μ}^{(k + 1)} = Θ_{μ}^{(k)} - γ \frac{1}{N} \sum_{i = 1}^{N} μ (x^{(i)}| Θ_{μ}^{(k)}) \frac{d μ (x^{(i)}| Θ_{μ}^{(k)})}{d Θ_{μ}} \\ Θ_{σ^{2}}^{(k + 1)} = Θ_{σ^{2}}^{(k)} - γ \frac{1}{2 N} \sum_{i = 1}^{N} (1 - {(σ^{2} (x^{(i)}| Θ_{σ^{2}}^{(k)}))}^{- 1}) \frac{d σ^{2} (x^{(i)}| Θ_{σ^{2}}^{(k)})}{d Θ_{σ^{2}}} \end{array}

Randomize random epoch Z = (z⁽¹⁾, z⁽²⁾,…, z^(N)) from standard normal distribution P(z) =

N

(0, I) with mean 0 and identity covariance matrix I. For each randomized data z⁽ⁱ⁾, let z’⁽ⁱ⁾ be calculated based on z⁽ⁱ⁾ so that z’⁽ⁱ⁾ follows multivariate normal distribution Q(z’ | μ(x | Θ_μ), Σ(x | Θ_Σ)) =

N

Θ_{σ^{2}}

)_n_xn is diagonal matrix.

\begin{array}{l} z^{' (i)} = μ (x^{(i)}| Θ_{μ}^{(k)}) + {(Σ (x^{(i)}| Θ_{Σ}^{(k)}))}^{\frac{1}{2}} z^{(i)} \\ Φ^{(k + 1)} = Φ^{(k)} - γ \frac{1}{N} \sum_{i = 1}^{N} (g (z^{' (i)}| Φ^{(k)}) - x^{(i)}) \frac{d g (z^{' (i)}| Φ^{(k)})}{d Φ} \end{array}

Increase k = k + 1.

Until some terminating conditions are met.

Note, a terminating condition is customized, for example, parameters Θ and Φ are not changed significantly or there is no more coming epoch X. Moreover, the index k indicates time point as well as iteration of SGD. Because it is not easy to randomize z according to normal distribution Q(z | μ(x | Θ_μ), Σ(x | Θ_Σ)) =

N

N

(0, I) with mean 0 and identity covariance matrix I and, then random data z’ is calculated based on z and μ(x | Θ_μ), Σ(x | Θ_Σ) as follows:

z^{'} = μ (x| Θ_{μ}) + {(Σ (x| Θ_{Σ}))}^{\frac{1}{2}} z

(3.14)

Such that z’ follows normal distribution

N

(z’ | μ(x | Θ_μ), Σ(x | Θ_Σ)) with mean μ(x | Θ_μ) and covariance matrix Σ(x | Θ_Σ) according to some rule of normal distribution in applied statistics (Hardle & Simar, 2013, p. 157). The notation A = Σ(x | Θ_Σ)^1/2 implies AA = Σ(x | Θ_Σ) and so, we can consider it as square root of Σ(x | Θ_Σ). Calculating this square root is not so easy because of complexity of singular decomposition for calculating it. Fortunately, it is easier to calculate the square root when Θ_Σ was simplified by diagonal elements (σ²(x | Θ_Σ))_n_xn. Indeed, we have:

{(Σ (x| Θ_{Σ}))}^{\frac{1}{2}} = {({(σ^{2} (x| Θ_{Σ}))}_{n x n})}^{\frac{1}{2}} = (\begin{matrix} \sqrt{σ_{1}^{2} (x| Θ_{Σ})} & 0 & \dots & 0 \\ 0 & \sqrt{σ_{2}^{2} (x| Θ_{Σ})} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & \sqrt{σ_{n}^{2} (x| Θ_{Σ})} \end{matrix})

Where,

Θ_{Σ} = {(Θ_{σ^{2}})}_{n x n} = (\begin{matrix} σ_{1}^{2} & 0 & \dots & 0 \\ 0 & σ_{2}^{2} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & σ_{n}^{2} \end{matrix})

Following figure depicts VAE.

Figure 3.1. Variational Autoencoders (VAE).

There is a question that how to calculate the differentials dμ(x | Θ_μ) / dΘ_μ, dσ²(x |

Θ_{σ^{2}}

) / d

Θ_{σ^{2}}

, and dg(z’ | Φ) / dΦ. Indeed, it is not difficult to calculate them in context of neural network associated with backpropagation algorithm so that the last output layer as well as last neuron o of any DNN f(x | Θ) or g(z | Φ) is acted by activation function a(.) as follows:

a (o) = a (w^{T} i) o = w^{T} i

Where i is input of the last layer o and weight parameter w is a part of entire parameter Θ or Φ and hence, we need to focus on calculating differential da(o) / dw which is equivalent to any differential dμ(x | Θ_μ) / dΘ_μ, dσ²(x |

Θ_{σ^{2}}

) / d

Θ_{σ^{2}}

, or dg(z’ | Φ) / dΦ so that backpropagation algorithm will solve remaining parts of entire parameter Θ or Φ.

\frac{d a (o)}{d w} ≅ a n y d i f f e r e n t i a l

Indeed, we have:

\frac{d a (o)}{d w} = \frac{d a (w^{T} i)}{d w} = \frac{d a (o)}{d o} i^{T} = a^{'} (o) i^{T}

Note, the subscript “T” denotes transposition operator of vector and matrix in which row vector becomes column vector and vice versa. It is easy to calculate the derivative a’(o) when activation function was specified, for instance, if a(o) is sigmoid function, we have:

\begin{array}{l} a (o) = y = \frac{1}{1 + e^{- o}} \\ a^{'} (o) = \frac{1}{1 + e^{- o}} (1 - \frac{1}{1 + e^{- o}}) = a (o) (1 - a (o)) = y (1 - y) \end{array}

In practice, y is replaced by a(y) in order to prevent o from being out of space:

a^{'} (o) ≅ a (y) (1 - a (y)) = a (a (o)) (1 - a (a (o)))

As a result, we have:

a n y d i f f e r e n t i a l ≅ a (a (o)) (1 - a (a (o))) i^{T}

For fast computation, it is possible to set the derivative a’(o) to be small enough constants like 1 such that any differential is i^T.

Given epoch D = (d⁽¹⁾ = (x⁽¹⁾, z⁽¹⁾), d⁽²⁾ = (x⁽²⁾, z⁽²⁾),…, d^(N) = (x^(N), z^(N))) implies that the epoch is created or sent by equilateral distribution 1/N but in general case, D can follow an arbitrary distribution denoted by PDF P(d), which makes the optimization problem and the SGD estimation changed a little bit by theoretical expectation given distribution P(d).

Θ^{*} = \underset{Θ}{armin} E (K L (Q (z| Θ, x)| P (z))| P (d)) Φ^{*} = \underset{Φ}{armax} E (\log P (x| Φ, z)| P (d))

Where,

E (K L (Q (z| Θ, x)| P (d))| P (d)) = \int_{d} K L (Q (z| Θ, x)| P (d)) P (d) d d E (\log P (x| Φ, z)| P (d)) = \int_{d} \log P (x| Φ, z) P (d) d d

However, there is no significant change in aforementioned practical technique to estimate parameters.

Recall that the default artificial neural network is feedforward neural network where data is fed to input layer which, in turn, is evaluated and passed across hidden layers to output layer in one-way direction, finally. However, there is an extension of neural network, which is called recurrent neural work (RNN), where an output can be turned back in order to feed on network as input. In other words, RNN has circle, which allow that output can become input. There are many kinds of RNN, for instance, long short-term memory is a case of RNN aforementioned. Boltzmann machine (Wikipedia, Boltzmann machine, 2004) is another variant of RNN, in which there is no separation of inputs from outputs. Like Hopfield network (Wikipedia, Hopfield network, 2004), every neuron (unit) in Boltzmann machine connects to all remaining neurons. In other words, Boltzmann machine applies an interesting aspect that all input neurons are output neurons too.

Figure 3.2. Topology of Hopfield network and Boltzmann machine.

Boltzmann machine named by the name of Austrian physicist Ludwig Eduard Boltzmann, also called Sherrington-Kirkpatrick model with external field or stochastic Ising-Lenz-Little model, is a stochastic spin-glass model with an external field and classified as a Markov random filed too. For easy explanation, Boltzmann machine simulates spinning glass process or annealing metal process, in which melt glass or melt metal will be frozen or get stable at some energy and some temperature where such energy and temperature are called stable energy and stable temperature at stable state of glass. The annealing process aims to reach the stable state of metal (glass) at which time the metal is frozen. Given concrete temperature, the smaller the energy is, the more stable the metal state is. Similarly, given concrete energy, the smaller the temperature is, the more stable the metal state is. Therefore, annealing process is cooling process where probability of metal state, which is proportional to energy and temperature, follows the so-called Boltzmann distribution specified as follows:

P (s) = \frac{\exp (- \frac{E (s)}{κ T})}{\sum_{i = 1}^{M} \exp (- \frac{E (s)}{κ T})}

(3.15)

Where P(s) is probability of current state s and E(s) is energy applied to metal at state s given temperature T while κ is Boltzmann constant and M is the number of states. Note, T can be considered as a parameter. If the denominator is constant, Boltzmann probability is approximated as follows:

P (s) ≅ \exp (- \frac{E (s)}{κ T})

In annealing process, if next energy is concerned by observing current energy because of successive annealing process, energy deviation or energy difference ΔE(s, s_new) between current energy E(s) and next energy E(s_new) is concerned so that Boltzmann probability derives a so-called acceptance probability P(s, s_new, T) as follows:

P (s, s_{n e w}, T) = \{\begin{array}{l} 1 i f ∆ E (s, s_{n e w}) < 0 \\ \exp (- \frac{∆ E (s, s_{n e w})}{κ T}) o t h e r w i s e \end{array}

Where,

∆ E (s, s_{n e w}) = E (s_{n e w}) - E (s)

Given a certain temperature T, the larger the acceptance probability is, the higher likely the annealing process stops, the higher the likelihood of stability is. In other words, acceptance probability P(s, s_new, T) decides whether or not the new state s_new is moved next in annealing process. When applied into solving optimization problem as well as learning problem, simulated annealing (SA) algorithm codes candidate solution as states. Indeed, SA is iterative process including many enough iterations where SA decreases temperature T at each iteration and then, randomize a new state s_new and calculates energy E(s_new) of the new state. Whether or not the new state (new candidate solution) s_new is based on the acceptance probability P(s, s_new, T) based on current state s, new state s_new, current temperature T. If the new candidate solution s_new is selected as current solution, SA will decrease temperature in the next iteration. Following is pseudo code of SA:

Initialize current temperature T by highest temperature T₀ as T = T₀.

Repeat

Decrease current temperature, for example, T = decrease(T).

Select a random neighbor of current state as s_new = neighbor(s).

If P(s, s_new, T) is larger than a predefined threshold then

s = s_new

End if

Until terminating conditions are met.

The terminating conditions can be that best state (solution) is reached, the current state s is good enough, the current temperature T is low enough, or the number of iterations is large enough. A usual, given a maximum iteration number K and the current iteration number k, the temperature decreasing function can be defined as follows:

d e c r e a s e (T) = T - \frac{k}{K} T

It is easy to infer that it is possible to set the initial temperature to be the maximum number of iterations as T₀ = K in practice. There is no significant change when applying SA into training Boltzmann machine where the most important problem is how to specify energy of Boltzmann machine. Fortunately, global energy of Boltzmann machine inherits from global energy of Hopfield network because Boltzmann machine is a type of Hopfield network which in turn is a variant of RNN. Suppose an entire Boltzmann machine is represented by a vector x = (x₁, x₂,…, x_n) in which each x_i is a neuron or unit. It is exact that a certain state of Boltzmann machine is represented by x which is evaluated at certain time point. It is possible to denote current state of Boltzmann machine as x instead. For convenience, the next state of Boltzmann machine is denoted x’. Energy E(x) of Boltzmann machine at state x is defined based on global energy of Hopfield network as follows (Hinton, 2007, p. 2):

E (x) = - \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} w_{i j} x_{i} x_{j} - \sum_{i = 1}^{n} b_{i} x_{i}

Note, w_ij is weight between neuron x_i and neuron x_j whereas b_i is bias of neuron x_i. As usual, biases b_i are considered as parameters like weights w_ij. Because there are n(n–1)/2 connections as well as n(n–1)/2 weights, the equation of energy is rewritten for convenience as follows (Wikipedia, Boltzmann machine, 2004):

E (x) = - \sum_{i < j} w_{i j} x_{i} x_{j} - \sum_{i} b_{i} x_{i}

(3.16)

All weights w_ij compose weight matrix W = (w_ij)_n_xn whose elements on diagonal are zero. Note, W is n_xn symmetric matrix.

\begin{array}{l} w_{i j} = w_{j i}, \forall i, j \\ w_{i i} = 0, \forall i b y c o n v e n t i o n \end{array}

Every neuron x_i is evaluated by propagation rule:

x_{i} = \sum_{j = 1, j \neq i}^{n} w_{i j} x_{j}

Neurons in traditional Boltzmann machine are binary variables such that xi belongs to {0, 1} but it is extended to allow neurons x_i to belong to arbitrary real interval and so, suppose every x_i ranges in interval [0, 1] without loss of generality. Rectified Linear Unit (ReLU) function is used to ramp x_i in interval [0, 1] so as to modify the propagation rule a little bit but learning algorithm mentioned later is not changed because the first-order derivative of ReLU function within valid domain [0, 1] is 1.

x_{i} = R e L U (\sum_{j = 1, j \neq i}^{n} w_{i j} x_{j})

Where

R e L U (x) = \{\begin{array}{r} x i f 0 \leq x \leq 1 \\ 0 i f x < 0 \\ 1 i f x > 1 \end{array}

It implies:

x_{i} = R e L U (x_{i})

So that the propagation rule is not changed in theory:

x_{i} = \sum_{j = 1, j \neq i}^{n} w_{i j} x_{j}

Based on definition of global energy, Boltzmann probability density function (PDF) of Boltzmann machine is determined as follows:

P (x| W, b) = \frac{\exp (- \frac{E (x| W, b)}{κ T})}{\int_{x} \exp (- \frac{E (x| W)}{κ T}) d x}

Recall that:

E (x| W, b) = - \sum_{i < j} w_{i j} x_{i} x_{j} - \sum_{i} b_{i} x_{i} b = {(b_{1}, b_{2}, \dots, b_{n})}^{T}

Within context of DGM, such PDF is generator likelihood whose parameter is Φ = (W, b).

P (x| Φ) = P (x| W, b) = \frac{\exp (- \frac{E (x| W, b)}{κ T})}{\int_{x} \exp (- \frac{E (x| W, b)}{κ T}) d x}

Because the denominator is constant with regard to W and b, Boltzmann PDF is approximated as follows:

P (x| W, b) ≅ \exp (- \frac{E (x| W, b)}{κ T})

(3.17)

For learning Boltzmann, maximum likelihood estimation (MLE) method (Goodfellow, Bengio, & Courville, Deep Learning, 2016, p. 655) is applied into estimating weight parameter W and bias parameter b by maximizing Boltzmann PDF with regard to w_ij and b_i.

w_{i j}^{*} = \max_{w_{ij}} P (x| W, b) b_{i}^{*} = \max_{b_{i}} P (x| W, b)

By taking natural logarithm of Boltzmann PDF, the optimization becomes easier to be solved.

w_{i j}^{*} = \max_{w_{ij}} \log P (x| W, b) b_{i}^{*} = \max_{b_{i}} \log P (x| W, b)

Where logP(x | W, b) is called Boltzmann log-likelihood or Boltzmann log-PDF.

\log P (x| W, b) ≅ - \frac{E (x| W, b)}{κ T}

The first-order partial derivatives of Boltzmann log-likelihood are:

\nabla_{w_{i j}} \log P (x| W, b) ≅ \frac{\partial \log P (x| W, b)}{\partial w_{i j}} = \frac{\partial \log E (x| W, b)}{\partial w_{i j}} = \frac{x_{i} x_{j}}{κ T} \nabla_{b_{i}} \log P (x| W, b) ≅ \frac{\partial \log P (x| W, b)}{\partial b_{i}} = \frac{\partial \log E (x| W, b)}{\partial b_{i}} = \frac{x_{i}}{κ T}

As a convention, these first-order partial derivatives are called (partial) gradients. By applying stochastic gradient descent (SGD) method into estimating w_ij and b_i given Boltzmann log-likelihood, we have:

\begin{array}{l} w_{i j} = w_{i j} + γ \nabla_{w_{i j}} \log P (x| W, b) = w_{i j} + γ \frac{x_{i} x_{j}}{κ T} \\ b_{i} = b_{i} + γ \nabla_{b_{i}} \log P (x| W, b) = b_{i} + γ \frac{x_{i}}{κ T} \end{array}

(3.18)

Where 0 < γ ≤ 1 is learning rate. It is easy to recognize that the estimation equations above confirm Hebbian learning rule in which the strength of connection represented by weight is consolidated by agreement of two nodes to which the connection is attached. As a result, Boltzmann machine trained with SGD is specified as follows:

Initialize W and set k = 0.

Repeat

Data (state) x is received from some real sample, or it can be kept intact.

\begin{array}{l} w_{i j}^{(k + 1)} = w_{i j}^{(k)} + γ \frac{x_{i} x_{j}}{κ T} \\ b_{i}^{(k + 1)} = b_{i}^{(k)} + γ \frac{x_{i}}{κ T} \end{array}

Increase k = k + 1.

Until some terminating conditions are met.

Note, a terminating condition is customized, for example, parameters W and b are not changed significantly, the maximum number of iterations is reached, or Boltzmann machine gets stable. The terminating condition that Boltzmann machine gets stable receives more concerns because stability is important property of spinning glass process or annealing process that Boltzmann machine. However, checking the stability in which global energy E(x) is not changed may consume a lot of iterations. Fortunately, SA can be incorporated into SGD so as to derive a more effective estimation. Boltzmann machine trained with SGD and SA is specified as follows:

Initialize current temperature T by highest temperature T₀ as T = T₀.

Repeat

Data (state) x is received from some real sample, or it can be kept intact.

\begin{array}{l} w_{i j}^{(k + 1)} = w_{i j}^{(k)} + γ \frac{x_{i} x_{j}}{κ T} \\ b_{i}^{(k + 1)} = b_{i}^{(k)} + γ \frac{x_{i}}{κ T} \end{array}

Evaluate Boltzmann machine given current parameter W^(k+1) and b^(k+1) so as to produce a new state x’:

x_{i}^{'} = R e L U (\sum_{j = 1, j \neq i}^{n} w_{i j}^{(k + 1)} x_{j}), f o r a l l x_{i}^{'} n o t f r o m r e a l s a m p l e

If P(x, x’, T | W^(k+1), b^(k+1) is larger than a predefined threshold then

x = x’

Decrease current temperature, for example, T = decrease(T).

End if

Increase k = k + 1.

Until terminating conditions are met.

The terminating conditions can be that best state (x’) is reached, the current state x is good enough, or the current temperature T is low enough. These terminating conditions reflect the stable state of Boltzmann machine. A usual, given a maximum iteration number K and the current iteration number k, the temperature decreasing function can be defined as follows:

d e c r e a s e (T) = T - \frac{k}{K} T

Of course, the acceptance probability is:

P (x, x^{'}, T| W, b) = \{\begin{array}{l} 1 i f ∆ E (x, x^{'}| W, b) < 0 \\ \exp (- \frac{∆ E (x, x^{'}| W, b)}{κ T}) o t h e r w i s e \end{array}

(3.19)

Where,

∆ E (x, x^{'}| W, b) = E (x^{'}| W, b) - E (x| W, b)

There is a so-called restricted Boltzmann machine (RBM) in which neurons are separated into two groups such as input group denoted x_z and hidden group denoted x_x.

x = {(x_{z}, x_{x})}^{T}

The training algorithm by incorporation of SA and SGD is not changed except that there is neither connection between input neurons and input neurons nor connection between hidden neurons and hidden neurons. In other words, all connections are made between input group and hidden group, for instance, suppose cardinality of input group is k then, the number of connections is k(n – k). Therefore, the two groups are considered layers such as input layer and hidden layer. Of course, both layers are output layers because connections in Boltzmann machine are two-direction connections whereas feed-forward neural network accepts only one-direction connections. RBM is trained faster than traditional Boltzmann machine because its number of connections is smaller. Moreover, it is clear to apply RBM into DGM because generator function in DGM x = g(z | Φ) is modeled by RBM whose input is input group x_z and whose output is output group x_x such as x_x = g(x_z | W, b) where x_x is calculated by evaluating RBM given input x_z.

x_{x} = g (x_{z}| W, b) ≝ e v a l u a t e R B M a t x_{z} t o p r o d u c e x_{x}

The reason that the RBM approach for DGM is classified into approximate density DGM is that generator likelihood P(x | W, b) is defined indirectly based on the energy E(x | W, b). Of course, x_z is randomized such that x_x is generated data.

4. Implicit Density DGM

According to implicit density approach, DGMs do not specify explicitly generator PDF P(x | Φ, z), which does not means that such PDF is not existent but it is simple that such PDF is not concerned. Note, PDF is abbreviation of probability density function.

Recall that there are two problems related to construct a DGM: 1) how to define likelihood or error to train generator DNN g(z | Φ) and 2) how to define tractable PDF P(z) which implies the way to randomize z. The second problem relates to assert qualification of random data z’ and hence, the second problem is stated as qualification problem of how to qualify random data. However, it is essential that the qualification problem aims to improve generator DNN g(z | Φ) because g(z | Φ) translate intractable z into tractable x. Generative Adversarial Network (GAN) developed by Goodfellow et al. (Goodfellow, et al., 2014) aims to reinforcing quality of generator DNN g(z | Φ) = x’ ≈ x by adding a so-called discriminator which is a discrimination function d(x | Ψ): x → [0, 1] from concerned data x or x’ to range [0, 1] in which d(x | Ψ) can distinguish fake data from real data. In other words, the larger result the discriminator d(x’ | Ψ) derives, the more realistic the generated data x’ is. Obviously, discriminator d(x | Ψ) is implemented by a DNN whose weights are Ψ called discriminator parameter with note that this discriminator DNN has only one output neuron denoted d₀.

d : R^{m} \to [0,1] s u c h t h a t d_{0} = d (x| Ψ) w h e r e d_{0} \in [0,1], x \in X \subseteq R^{m}

(4.1)

l (Φ, Ψ| x, z) = \log (d (x| Ψ)) + \log (1 - d (g (z| Φ)| Ψ))

(4.2)

Indeed, target function l(Φ, Ψ | x, z) is error function and so it is called loss function in literature. As a result, GAN tries to optimize dually generator parameter Φ and discriminator parameter Ψ so that optimal estimate Φ^* and optimal estimate Ψ^* are minimizer and maximizer of loss function l(Φ, Ψ | x, z) with expectation that Nash equilibrium will be achieved at the saddle point (Φ^*, Ψ^*) with note that loss function l(Φ, Ψ | x, z) is function of Φ and Ψ given data x and z.

\begin{array}{l} Φ^{*} = \underset{Φ}{argmin} l (Φ, Ψ^{*}| x, z) = \underset{Φ}{argmin} \log (1 - d (g (z| Φ)| Ψ^{*})) \\ Ψ^{*} = \underset{Ψ}{argmax} l (Φ^{*}, Ψ| x, z) = \underset{Ψ}{argmax} (\log (d (x| Ψ)) + \log (1 - d (g (z| Φ^{*})| Ψ))) \end{array}

(4.3)

This is min-max problem in game theory (Goodfellow, et al., 2014):

\min_{Φ} \max_{Ψ} l (Φ, Ψ| x, z)

Which results estimation equation according to stochastic gradient descent (SGD) method:

Φ = Φ - γ \nabla_{Φ} l (Φ, Ψ^{*}| x, z) Ψ = Ψ + γ \nabla_{Ψ} l (Φ^{*}, Ψ| x, z)

Where γ is learning rate. Recall that SGD, which is an iterative process, pushes candidate solution at each iteration along the direction which is opposite to gradient of target function for minimization or has the same direction to gradient of target function for maximization with note that the step length is represented by learning rate. Note, ∇_Φl(Φ, Ψ^* | x, z) is gradient of loss function l(Φ, Ψ^* | x, z) fixed Ψ^* with regard to generator parameter Φ and ∇_Ψl(Φ^*, Ψ | x, z) is gradient of loss function l(Φ^*, Ψ | x, z) fixed Φ^* with regard to discriminator parameter Ψ as follows:

\begin{array}{l} \nabla_{Φ} l (Φ, Ψ^{*}| x, z) = \frac{d l (Φ, Ψ^{*}| x, z)}{d Φ} = \frac{\partial \log (1 - d (g (z| Φ)| Ψ^{*}))}{\partial Φ} \\ \nabla_{Ψ} l (Φ^{*}, Ψ| x, z) = \frac{d l (Φ^{*}, Ψ| x, z)}{d Ψ} = \frac{\partial (\log (d (x| Ψ)) + \log (1 - d (g (z| Φ^{*})| Ψ)))}{\partial Ψ} \end{array}

Therefore, the estimation equation is rewritten as follows:

\begin{array}{l} Φ = Φ - γ \nabla_{Φ} \log (1 - d (g (z| Φ)| Ψ^{*})) \\ Ψ = Ψ + γ \nabla_{Ψ} (\log (d (x| Ψ)) + \log (1 - d (g (z| Φ^{*})| Ψ))) \end{array}

(4.4)

According to equations above, real data x aims to maximize discriminator d(x | Ψ) and in opposite, generated data x’ = g(z | Φ) aims to minimize discriminator d(x’ | Ψ). Although both GAN and VAE use two DNNs for data generation but the underlying theory of GAN is slightly more succinct than VAE because there is no requirement of specifying probabilistic distribution P(z) of tractable z. As a convention, the gradient ∇_Φ(log(1 – d(g(z | Φ) | Ψ^*))) related to generator parameter Θ is called generator gradient and the gradient ∇_Ψ(log(d(x | Ψ)) + log(1 – d(g(z | Φ^*) | Ψ))) related to discriminator parameter Θ is called discriminator gradient.

Given epoch of size N is denoted as D = ((x⁽¹⁾, z⁽¹⁾), (x⁽²⁾, z⁽²⁾),…, (x^(N), z^(N))), the estimation equations of Φ and Ψ are extended exactly as epoch estimation at every iteration of SGD:

\begin{array}{l} Φ^{(k + 1)} = Φ^{(k)} - γ \nabla_{Φ} \frac{1}{N} \sum_{i = 1}^{N} \log (1 - d (g (z^{(i)}| Φ^{(k)})| Ψ^{(k)})) \\ Ψ^{(k + 1)} = Ψ^{(k)} + γ \nabla_{Ψ} \frac{1}{N} \sum_{i = 1}^{N} (\log (d (x^{(i)}| Ψ^{(k)})) + \log (1 - d (g (z^{(i)}| Φ^{(k)})| Ψ^{(k)}))) \end{array}

As a result, GAN trained with SGD is specified as follows:

Initialize Φ and Ψ and set k = 0.

Repeat

Sampling epoch X = (x⁽¹⁾, x⁽²⁾,…, x^(N)) or receiving epoch X from big data / data stream.

Randomize random epoch Z = (z⁽¹⁾, z⁽²⁾,…, z^(N)) from standard normal distribution P(z) =

N

(z | 0, I) with mean 0 and identity covariance matrix I.

\begin{array}{l} Φ^{(k + 1)} = Φ^{(k)} - γ \nabla_{Φ} \frac{1}{N} \sum_{i = 1}^{N} \log (1 - d (g (z^{(i)}| Φ^{(k)})| Ψ^{(k)})) \\ Ψ^{(k + 1)} = Ψ^{(k)} + γ \nabla_{Ψ} \frac{1}{N} \sum_{i = 1}^{N} (\log (d (x^{(i)}| Ψ^{(k)})) + \log (1 - d (g (z^{(i)}| Φ^{(k)})| Ψ^{(k)}))) \end{array}

Increase k = k + 1.

Until some terminating conditions are met.

Note, a terminating condition is customized, for example, parameters Φ and Ψ are not changed significantly or there is no more coming epoch X. Moreover, the index k indicates time point as well as iteration of SGD.

Recall that the estimation equations of generator parameter Φ and discriminator parameter Ψ are:

\begin{array}{l} Φ = Φ - γ \nabla_{Φ} \log (1 - d (g (z| Φ)| Ψ^{*})) \\ Ψ = Ψ + γ \nabla_{Ψ} (\log (d (x| Ψ)) + \log (1 - d (g (z| Φ^{*})| Ψ))) \end{array}

It is necessary to calculate generator gradient and discriminator gradient. Indeed, we have:

\begin{array}{l} \nabla_{Φ} \log (1 - d (g (z| Φ)| Ψ^{*})) = \frac{\partial \log (1 - d (g (z| Φ)| Ψ^{*}))}{\partial Φ} = - \frac{\partial d (g (z| Φ)| Ψ^{*}) / \partial Φ}{1 - d (g (z| Φ)| Ψ^{*})} \\ \nabla_{Ψ} (\log (d (x| Ψ)) + \log (1 - d (g (z| Φ^{*})| Ψ))) \\ = \frac{\partial (\log (d (x| Ψ)) + \log (1 - d (g (z| Φ^{*})| Ψ)))}{\partial Ψ} \\ = \frac{\partial d (x| Ψ) / \partial Ψ}{d (x| Ψ)} - \frac{\partial d (g (z| Φ^{*})| Ψ) / \partial Ψ}{1 - d (g (z| Φ^{*})| Ψ)} \end{array}

Where ∂d(.) / ∂Φ and ∂d(.) / ∂Ψ denotes differentials of discriminator function with regard to Φ and Ψ, respectively.

Given epoch of size N is denoted as D = ((x⁽¹⁾, z⁽¹⁾), (x⁽²⁾, z⁽²⁾),…, (x^(N), z^(N))), the estimation equations of Φ and Ψ are extended exactly as epoch estimation at every iteration of SGD:

\begin{array}{l} Φ^{(k + 1)} = Φ^{(k)} + γ \frac{1}{N} \sum_{i = 1}^{N} \frac{\partial d (g (z^{(i)}| Φ^{(k)})| Ψ^{(k)}) / \partial Φ}{1 - d (g (z^{(i)}| Φ^{(k)})| Ψ^{(k)})} \\ Ψ^{(k + 1)} = Ψ^{(k)} + γ \frac{1}{N} \sum_{i = 1}^{N} (\frac{\partial d (x^{(i)}| Ψ^{(k)}) / \partial Ψ}{d (x^{(i)}| Ψ^{(k)})} - \frac{\partial d (g (z^{(i)}| Φ^{(k)})| Ψ^{(k)}) / \partial Ψ}{1 - d (g (z^{(i)}| Φ^{(k)})| Ψ^{(k)})}) \end{array}

(4.5)

As a result, GAN trained with SGD is specified as follows:

Initialize Φ and Ψ and set k = 0.

Repeat

Sampling epoch X = (x⁽¹⁾, x⁽²⁾,…, x^(N)) or receiving epoch X from big data / data stream.

Randomize random epoch Z = (z⁽¹⁾, z⁽²⁾,…, z^(N)) from standard normal distribution P(z) =

N

(z | 0, I) with mean 0 and identity covariance matrix I.

\begin{array}{l} Φ^{(k + 1)} = Φ^{(k)} + γ \frac{1}{N} \sum_{i = 1}^{N} \frac{\partial d (g (z^{(i)}| Φ^{(k)})| Ψ^{(k)}) / \partial Φ}{1 - d (g (z^{(i)}| Φ^{(k)})| Ψ^{(k)})} \\ Ψ^{(k + 1)} = Ψ^{(k)} + γ \frac{1}{N} \sum_{i = 1}^{N} (\frac{\partial d (x^{(i)}| Ψ^{(k)}) / \partial Ψ}{d (x^{(i)}| Ψ^{(k)})} - \frac{\partial d (g (z^{(i)}| Φ^{(k)})| Ψ^{(k)}) / \partial Ψ}{1 - d (g (z^{(i)}| Φ^{(k)})| Ψ^{(k)})}) \end{array}

Increase k = k + 1.

Until some terminating conditions are met.

Note, a terminating condition is customized, for example, parameters Φ and Ψ are not changed significantly or there is no more coming epoch X. Moreover, the index k indicates time point as well as iteration of SGD.

Following figure depicts GAN.

Figure 4.1. Generative Adversarial Network (GAN).

There is a question that how to calculate the differentials ∂d(.) / ∂Φ and ∂d(.) / ∂Ψ. Indeed, it is not difficult to calculate them in context of neural network associated with backpropagation algorithm so that the last output layer as well as last neuron o of any DNN f(x | Θ) or g(z | Φ) is acted by activation function a(.) as follows:

a (o) = a (w^{T} i) o = w^{T} i

Where i is input of the last layer o and weight parameter w is a part of entire parameter Φ or Ψ and hence, we need to focus on calculating differential da(o) / dw which is equivalent to any differential ∂d(.) / ∂Φ or ∂d(.) / ∂Ψ so that backpropagation algorithm will solve remaining parts of entire parameter Φ or Ψ.

\frac{d a (o)}{d w} ≅ a n y d i f f e r e n t i a l

Indeed, we have:

\frac{d a (o)}{d w} = \frac{d a (w^{T} i)}{d w} = \frac{d a (o)}{d o} i^{T} = a^{'} (o) i^{T}

Note, the subscript “T” denotes transposition operator of vector and matrix in which row vector becomes column vector and vice versa. It is easy to calculate the derivative a’(o) when activation function was specified, for instance, if a(o) is sigmoid function, we have:

\begin{array}{l} a (o) = y = \frac{1}{1 + e^{- o}} \\ a^{'} (o) = \frac{1}{1 + e^{- o}} (1 - \frac{1}{1 + e^{- o}}) = a (o) (1 - a (o)) = y (1 - y) \end{array}

In practice, y is replaced by a(y) in order to prevent o from being out of space:

a^{'} (o) ≅ a (y) (1 - a (y)) = a (a (o)) (1 - a (a (o)))

As a result, we have:

a n y d i f f e r e n t i a l ≅ a (a (o)) (1 - a (a (o))) i^{T}

For fast computation, it is possible to set the derivative a’(o) to be small enough constants like 1 such that any differential is i^T.

Given epoch D = (d⁽¹⁾ = (x⁽¹⁾, z⁽¹⁾), d⁽²⁾ = (x⁽²⁾, z⁽²⁾),…, d^(N) = (x^(N), z^(N))) implies that the epoch is created or sent by equilateral distribution 1/N but in general case, D can follow an arbitrary distribution denoted by PDF P(d), which makes loss function l(Φ, Ψ) changed a little bit by theoretical expectation given distribution P(d).

l (Φ, Ψ) = \int_{d} (\log (d (x| Ψ)) + \log (1 - d (g (z| Φ)| Ψ))) d d

Suppose x and z distribute separately as P(x) and Q(z) such that P(x) is called original data PDF and Q(z) is called generated data PDF, we have:

l (Φ, Ψ) = \int_{x} P (x) \log (d (x| Ψ)) d x + \int_{z} Q (z) \log (1 - d (g (z| Φ)| Ψ)) d z

(4.6)

Although there is no significant change in aforementioned practical technique to estimate parameters, it is necessary to research original data PDF P(x) and generated data PDF Q(z) as well as expectation form of loss function l(Φ, Ψ) so as to prove convergence of GAN. Recall that the min-max problem is:

\min_{Φ} \max_{Ψ} l (Φ, Ψ)

That is:

Φ^{*} = \underset{Φ}{argmin} l (Φ, Ψ^{*}) Ψ^{*} = \underset{Ψ}{argmax} l (Φ^{*}, Ψ)

The convergence of GAN is equivalent to the convergence of this min-max problem. In other words, Goodfellow et al. (Goodfellow, et al., 2014) proved the existence of global optimal value l^* such that min-max problem approach l^* as follows:

l^{*} = l (Φ^{*}, Ψ^{*}) = \min_{Φ} l (Φ, Ψ^{*}) = \min_{Φ} \max_{Ψ} l (Φ, Ψ)

(4.7)

Because z is generated by distribution Q(z) and g(z | Φ) is valuated as x as g(z | Φ) = x, loss function l(Φ, Ψ) is rewritten by changing variable (Goodfellow, et al., 2014, p. 5).

\begin{array}{l} l (Φ, Ψ) = \int_{x} P (x) \log (d (x| Ψ)) d x + \int_{x} Q (x) \log (1 - d (x| Ψ)) d x \\ = \int_{x} (P (x) \log (d (x| Ψ)) + Q (x) \log (1 - d (x| Ψ))) d x \end{array}

In mathematical literature, function alog(y) + blog(1–y) gets maximal at maximizer y = a/(a+b) such that:

d (x| Ψ^{*}) = \frac{P (x)}{P (x) + Q (x)}

Therefore, we have (Goodfellow, et al., 2014, p. 5):

\begin{array}{l} \max_{Ψ} l (Φ, Ψ) = l (Φ, Ψ^{*}) = \int_{x} (P (x) \log (\frac{P (x)}{P (x) + Q (x)}) + Q (x) \log (\frac{Q (x)}{P (x) + Q (x)})) d x \\ = \\ = \int_{x} P (x) \log (\frac{2 P (x)}{P (x) + Q (x)}) d x + \int_{x} Q (x) \log (\frac{2 Q (x)}{P (x) + Q (x)}) d x - l o g 4 \\ = K L (P (x)| \frac{P (x) + Q (x)}{2}) + K L (Q (x)| \frac{P (x) + Q (x)}{2}) - l o g 4 \end{array}

Where KL(.) denotes Kullback-Leibler divergence of two distributions. The sum of two KL divergences above is a so-called Jensen-Shannon divergence of original data distribution P(x) and generated data distribution Q(z), denoted JS(P(x) | Q(x)). Therefore, we have (Goodfellow, et al., 2014, p. 5):

\max_{Ψ} l (Φ, Ψ) = l (Φ, Ψ^{*}) = J S (P (x)| Q (x)) - l o g 4

Because Jensen-Shannon divergence is always nonnegative, we have (Goodfellow, et al., 2014, p. 5):

\max_{Ψ} l (Φ, Ψ) = l (Φ, Ψ^{*}) \geq - l o g 4

The sign “=” occurs because Jensen-Shannon divergence is zero if the two distributions are equal, for instance, P(x) = Q(x). Therefore, l(Φ, Ψ^*) has maximal value –log4. In other words, we have:

l^{*} = l (Φ^{*}, Ψ^{*}) = \min_{Φ} \max_{Ψ} l (Φ, Ψ) = - l o g 4

(4.8)

Due to the existence of global optimal value l^* = –log4, the convergence of GAN is asserted.

5. Conclusions

Recall that there are three main approaches for constructing deep generative model (DGM) such as tractable density DGM, approximate density DGM, and implicit density DGM. When skimming these approaches, it is easy to recognize that applied statistical problems such as probability distribution and parameter estimation are often mentioned but the effectiveness of a deep generative model is also dependent on how to structure the deep neural network (DNN) and how to train such network. Anyhow data generation function called generator is always defined by DNN in DGM. Backpropagation (BP) algorithm associated with stochastic gradient descent (SGD) method is focused as typical example in this research but there are some more effective training algorithms. Essential, training DNN generator is unsupervised learning task because there is no data class in DGM although generating data distribution (distribution of tractable data z) is often assumed to follow normal distribution whereas BP belongs to supervised learning algorithm. This is the reason that the two problems of constructing DGM are 1) how to train generator DNN g(z | Φ) and 2) how to qualify such training task which often relates to another optimization task or another training task so that the qualification task tries to attach supervised learning BP to unsupervised learning mechanism. For instance, PixelRNN allows output data becomes input data by recurrent neural network, VAE applies Kullback-Leibler divergence into forming data distribution as implicit data class, and GAN issues target function with expectation of Nash equilibrium. Essentially, these mechanisms make the exchange or transformation between supervised learning and unsupervised learning, which plays the role of a hinge for creating the DNN generator.

One of problems issued by BP is the zero derivative problem when SGD cannot improve parameters after some large enough iteration because the gradient (derivative) approaches zero at that time. In other words, SGD may not converge even though there is a large enough number of iterations. Moreover, basic DGM approaches mentioned here require a continuous data provision for training DNN, which consumes more resources than reinforcement learning.

Appendices

A1. Backpropagation Algorithm

Because backpropagation (BP) algorithm is often associated with stochastic gradient descent (SGD) method to optimize loss function, it is necessary to describe a little bit BP and SGD, especially, in case of DGM where artificial neural network is deep neural network (DNN) with many hidden layers so that learning DNN (training DNN) is essential to deep learning. A neural network often has one input layer, one output layer, and hidden layers. The simplest neural network has one input layer and one output layer without a hidden layer. A DNN is a neural network often having many enough hidden layers. Each layer has a list of units called neurons and there are full connections of neurons between two successive layers. Feed-forward network which is the neural network whose connections are one-way from input layer to hidden layers to output layer is focused here. BP is a reverse process that begins from output layer back to input layer. Without loss of generality, input neurons, hidden neurons, and output neurons are concerned rather than input layer, hidden layers, and output layers, respectively because BP processes layers by processing neurons of layers. As a convention, let i, h, and o denote indices of input neurons, hidden neurons, and output neurons, respectively. Let x and y denote input variable and output variable of neurons. For instance, x_i and y_i are input and output of an input neuron and x_h and y_h are input and output of a hidden neuron whereas x_o and y_o are input and output of an output neuron, respectively. Because BP is a reverse process that begins from output layer back to input layer, output neuron is concerned firstly by starting its propagation rule as follows:

\begin{array}{l} x_{o} = \sum_{h} w_{h o} y_{h} + θ_{o} \\ y_{o} = f (x_{o}) \end{array}

Where w_ho and θ_o are weights of connections between previous hidden neurons h and current output neurons o while θ_o is bias of current output neurons o. Moreover, f denotes activation function that squashes input into valid range, which is often sigmoid (logistic) function or hyperbolic tangent function. Some literature documents use letter b to denote such bias. BP aims to learn the parameters such as connection weights w_(.)(.) and biases θ_(.) from sample data. Note, propagation rule is cornerstone of evaluating neural network. Let v_o is real value of output neuron o, error function ε(y_o) of output neuron is half the square deviation between y_o and v_o.

ε (y_{o}) = \frac{1}{2} {(v_{o} - y_{o})}^{2} = \frac{1}{2} {(y_{o} - v_{o})}^{2}

Note, y_o is variable and v_o represents sample data. Weight parameter and bias parameter are estimated by minimizing output error function ε(y_o) according to BP.

w_{h o}^{*} = \underset{w_{ho}}{argmin} \frac{1}{2} {(y_{o} - v_{o})}^{2} θ_{o}^{*} = \underset{θ_{o}}{argmin} \frac{1}{2} {(y_{o} - v_{o})}^{2}

Minimizing output error function ε(y_o) is equivalent to maximizing likelihood (PDF) of random variable y_o. Indeed, output likelihood P(y_o) is specified as follows:

P (y_{o}) = \frac{1}{\sqrt{2 π}} e x p (- \frac{1}{2} {(y_{o} - v_{o})}^{2})

Where,

y_{o} = f (x_{o}) = f (\sum_{h} w_{h o} y_{h} + θ_{o})

Note, mean and variance of output likelihood P(y_o) are 0 and 1, respectively. Exactly, P(y_o) is probability density function (PDF) of the error y_o–v_o. By maximum likelihood estimation (MLE) method, maximizing P(y_o) is to maximize its natural logarithm logP(y_o):

\log P (y_{o}) = - \frac{1}{2} π - \frac{1}{2} {(y_{o} - v_{o})}^{2}

Because π is constant, it is obtained:

w_{h o}^{*} = \underset{w_{ho}}{argmax} \log P (y_{o}) = \underset{w_{ho}}{argmax} (- \frac{1}{2} {(y_{o} - v_{o})}^{2}) θ_{o}^{*} = \underset{θ_{o}}{argmax} \log P (y_{o}) = \underset{θ_{o}}{argmax} (- \frac{1}{2} {(y_{o} - v_{o})}^{2})

This implies:

w_{h o}^{*} = \underset{w_{ho}}{argmin} \frac{1}{2} {(y_{o} - v_{o})}^{2} θ_{o}^{*} = \underset{θ_{o}}{argmin} \frac{1}{2} {(y_{o} - v_{o})}^{2}

Which confirm the equivalence of error minimization and likelihood maximization. Therefore, BP becomes more potential because MLE can be extended with more techniques so that error function is defined indirectly and more flexibly via likelihood function. It is interesting that the minimization problem can be exchanged with the maximization problem. Shortly, the optimization is ultimate purpose.

The problem now is how to minimize error function ε(y_o), which is the optimization problem. Fortunately, stochastic gradient descent (SGD) method is applied into solving this optimization when square error function is Lipschitz continuous and bounded. Given target function has both variable and parameter where parameter is the subject of optimization, SGD pushes parameter candidate-point along with the same direction (for maximization) or the opposite direction (for minimization) of gradient of target function. There are two important aspects of SGD: 1) the gradient is the first-order derivative of target function with regard to parameter, and 2) the variable is considered as input data which is fed by stochastic process or random way. Moreover, candidate point considered as candidate solution or candidate minimizer / maximizer of optimal parameter is pushed with step length which is coded as learning rate γ. It is proved that SGD will be converged to the optimal solution (good enough minimizer / maximizer) after many enough iterations and many enough data. Shortly, SGD updates weight parameter w_ho and bias parameter θ_o of output error function ε(y_o) at every iteration within context of BP as follows:

w_{h o} = w_{h o} - γ \nabla_{w_{h o}} ε (y_{o}) θ_{o} = θ_{o} - γ \nabla_{θ_{o}} ε (y_{o})

Where

\nabla_{w_{h o}} ε (y_{o})

and

\nabla_{θ_{o}} ε (y_{o})

are gradients of output error function ε(y_o) with regard to weight parameter w_ho and bias parameter θ_o, respectively:

\nabla_{w_{h o}} ε (y_{o}) = \frac{d ε (y_{o})}{d w_{h o}} \nabla_{θ_{o}} ε (y_{o}) = \frac{d ε (y_{o})}{d θ_{o}}

It is not difficult to calculate these gradients. Due to chain rule of derivative and propagation rule

y_{o} = f (x_{o} = \sum_{h} w_{h o} y_{h} + θ_{o})

We have:

\nabla_{w_{h o}} ε (y_{o}) = \frac{d ε (y_{o})}{d w_{h o}} = \frac{d ε (y_{o})}{d y_{o}} \frac{d y_{o}}{d x_{o}} \frac{d x_{o}}{d w_{h o}} = (y_{o} - v_{o}) f^{'} (x_{o}) y_{h} \nabla_{θ_{o}} ε (y_{o}) = \frac{d ε (y_{o})}{d θ_{o}} = \frac{d ε (y_{o})}{d y_{o}} \frac{d y_{o}}{d x_{o}} \frac{d x_{o}}{d θ_{o}} = (y_{o} - v_{o}) f^{'} (x_{o})

Note,

\frac{d ε (y_{o})}{d y_{o}} = \frac{d \frac{1}{2} {(y_{o} - v_{o})}^{2}}{d y_{o}} = (y_{o} - v_{o}) \frac{d y_{o}}{d x_{o}} = \frac{d f (x_{o})}{d x_{o}} = f^{'} (x_{o}) \frac{d x_{o}}{d w_{h o}} = \frac{d (\sum_{h} w_{h o} y_{h} + θ_{o})}{d w_{h o}} = y_{h} \frac{d x_{o}}{d θ_{o}} = \frac{d (\sum_{h} w_{h o} y_{h} + θ_{o})}{d θ_{o}} = 1

And due to:

w_{h o} = w_{h o} - γ \nabla_{w_{h o}} ε (y_{o}) θ_{o} = θ_{o} - γ \nabla_{θ_{o}} ε (y_{o})

As a result, parameter w_ho and bias parameter θ_o at output neuron is updated at every iteration of SGD within context of BP as follows:

\begin{array}{l} w_{h o} = w_{h o} + γ y_{h} (v_{o} - y_{o}) f^{'} (x_{o}) \\ θ_{o} = θ_{o} + γ (v_{o} - y_{o}) f^{'} (x_{o}) \end{array}

(A1.1)

Note, f’(x_o) is the first derivative of activation function at x_o and γ is learning rate (0 < γ ≤ 1). It is expected that weight parameter w_ho and bias parameter θ_o will converge the aforementioned optimizers w^*_ho and θ^*_o. Let δ_o denote output error, we have:

\begin{array}{l} w_{h o} = w_{h o} + γ y_{h} δ_{o} \\ θ_{o} = θ_{o} + γ δ_{o} \end{array}

Where,

δ_{o} = (v_{o} - y_{o}) f^{'} (x_{o})

BP continues to estimate weight and bias parameters of previous neurons which are hidden neurons. Without loss of generality, given hidden neuron h whose error ε(y_h) is sum of errors of all output neuron o to which such hidden neuron connects.

ε (y_{h}) = \sum_{o} ε (y_{o})

Note, y_h is calculated by propagation rule as usual:

y_{h} = f (x_{h} = \sum_{j} w_{j h} y_{j} + θ_{h})

Note, y_j is output value of previous hidden neuron j which connects to current hidden neuron h. Because SGD is continuously applied into estimating weight parameters w_jh and bias parameter θ_h of hidden neuron h, gradients of hidden error function ε(y_h) with regard to w_jh and θ_h need to be determined, respectively. According to chain rule of derivative, we have:

\nabla_{w_{j h}} ε (y_{h}) = \frac{d ε (y_{h})}{d w_{j h}} = \frac{d ε (y_{h})}{d y_{h}} \frac{d y_{h}}{d x_{h}} \frac{d x_{h}}{d w_{j h}} \nabla_{θ_{h}} ε (y_{h}) = \frac{d ε (y_{h})}{d θ_{h}} = \frac{d ε (y_{h})}{d y_{h}} \frac{d y_{h}}{d x_{h}} \frac{d x_{h}}{d θ_{h}}

Due to:

\frac{d y_{h}}{d x_{h}} = \frac{d f (x_{h})}{d x_{h}} = f^{'} (x_{h}) \frac{d x_{h}}{d w_{j h}} = \frac{d (\sum_{j} w_{j h} y_{j} + θ_{h})}{d w_{j h}} = y_{j} \frac{d x_{h}}{d θ_{h}} = \frac{d (\sum_{j} w_{j h} y_{j} + θ_{h})}{d θ_{h}} = 1

Equations of gradients

\nabla_{w_{j h}} ε (y_{h})

and

\nabla_{θ_{h}} ε (y_{h})

are written:

\nabla_{w_{j h}} ε (y_{h}) = \frac{d ε (y_{h})}{d y_{h}} y_{j} f^{'} (x_{h}) \nabla_{θ_{h}} ε (y_{h}) = \frac{d ε (y_{h})}{d θ_{h}} = \frac{d ε (y_{h})}{d y_{h}} f^{'} (x_{h})

It is now necessary to calculate the derivative dε(y_h)/dy_h of hidden error function ε(y_h) with regard to y_h. Indeed, we have:

\frac{d ε (y_{h})}{d y_{h}} = \frac{d \sum_{o} ε (y_{o})}{d y_{h}} = \sum_{o} \frac{d ε (y_{o})}{d y_{h}} = \sum_{o} \frac{d ε (y_{o})}{d y_{o}} \frac{d y_{o}}{d x_{o}} \frac{d x_{o}}{d y_{h}}

Due to:

\frac{d ε (y_{o})}{d y_{o}} \frac{d y_{o}}{d x_{o}} = \frac{d \frac{1}{2} {(y_{o} - v_{o})}^{2}}{d y_{o}} \frac{d f (x_{o})}{d x_{o}} = (y_{o} - v_{o}) f^{'} (x_{o}) = - δ_{o}

This produces:

\frac{d ε (y_{h})}{d y_{h}} = - \sum_{o} \frac{d x_{o}}{d y_{h}} δ_{o} = - \sum_{o} \frac{d (\sum_{h} w_{h o} y_{h} + θ_{o})}{d y_{h}} δ_{o} = - \sum_{o} w_{h o} δ_{o}

Gradients

\nabla_{w_{j h}} ε (y_{h})

and

\nabla_{θ_{h}} ε (y_{h})

are totally determined:

\nabla_{w_{j h}} ε (y_{h}) = - y_{j} (\sum_{o} w_{h o} δ_{o}) f^{'} (x_{h}) \nabla_{θ_{h}} ε (y_{h}) = - (\sum_{o} w_{h o} δ_{o}) f^{'} (x_{h})

Due to SGD estimation:

w_{j h} = w_{j h} - γ \nabla_{w_{j h}} ε (y_{h}) θ_{h} = θ_{h} - γ \nabla_{θ_{h}} ε (y_{h})

Parameter w_jh and bias parameter θ_h at output neuron is updated at every iteration of SGD within context of BP as follows:

\begin{array}{l} w_{j h} = w_{j h} + γ y_{j} (\sum_{o} w_{h o} δ_{o}) f^{'} (x_{h}) \\ θ_{o} = θ_{o} + γ (\sum_{o} w_{h o} δ_{o}) f^{'} (x_{h}) \end{array}

(A1.2)

Let δ_h denote hidden error:

δ_{h} = (\sum_{o} w_{h o} δ_{o}) f^{'} (x_{h})

SGD estimation equations of hidden neurons become more succinct:

\begin{array}{l} w_{j h} = w_{j h} + γ y_{j} δ_{h} \\ θ_{h} = θ_{h} + γ δ_{h} \end{array}

The reverse process of BP recurrently continues to estimate other parameters of previous hidden neurons, which is an interesting and effective aspect of BP. Finally, following system of estimation equations is the summary of association of BP and SGD, in which weight parameters and bias parameters of feed-forward neural network are updated at every iteration whenever data sample v is received.

\begin{array}{l} w_{h o} = w_{h o} + γ y_{h} δ_{o} \\ θ_{o} = θ_{o} + γ δ_{o} \\ w_{j h} = w_{j h} + γ y_{j} δ_{h} \\ θ_{h} = θ_{h} + γ δ_{h} \end{array}

(A1.3)

Where output error δ_o and hidden error δ_h are calculated as follows:

\begin{array}{l} δ_{o} = (v_{o} - y_{o}) f^{'} (x_{o}) \\ δ_{h} = (\sum_{o} w_{h o} δ_{o}) f^{'} (x_{h}) \end{array}

(A1.4)

For easy explanation, according to the BP recurrent process, weight parameters w_kj and bias parameter θ_j of previous hidden neuron j of hidden neuron h are calculated as follows:

\begin{array}{l} w_{k j} = w_{k j} + γ y_{k} δ_{j} \\ θ_{j} = θ_{j} + γ δ_{j} \end{array}

Where error δ_j of hidden neuron j is:

δ_{j} = (\sum_{h} w_{j h} δ_{h}) f^{'} (x_{j})

The back recurrent process continues until reaching input layer (exclusive). It is easy to recognize that the entry point of BP is the output error δ_o which relates to derivative of error function of output neuron at sample point v. Recall that such error function can be replaced by likelihood function instead. Therefore, δ_o is the opposite of gradient of output error function if the error function is applied to estimation within context of minimization. Otherwise, δ_o is gradient of output likelihood function if the likelihood function is applied to estimation within context of maximization. The interesting result allows us to extend BP applications by defining error function or likelihood function at output layer without changing BP recurrent process.

It is necessary to consider activation function f(x) and its derivative f’(x) which are evaluated at output neuron and hidden neuron as f(x_o), f(x_h), f’(x_o), and f’(x_h). For instance, if f(x) is sigmoid function (logistic function), we have:

\begin{array}{l} f (x) = y = \frac{1}{1 + e^{- x}} \\ f^{'} (x) = \frac{1}{1 + e^{- x}} (1 - \frac{1}{1 + e^{- x}}) = f (x) (1 - f (x)) = y (1 - y) \end{array}

In practice, y is replaced by f(y) in order to prevent o from being out of space:

f^{'} (x) ≅ f (y) (1 - f (y)) = f (f (x)) (1 - f (f (x)))

It is possible to fix the derivative by 1 as f’(x) = 1 for all x for fast computation but this approximation is not optimal.

A2. Kullback-Leibler Divergence

In information theory, entropy is the quantity that indicates uncertainty of a random variable. Exactly, given random variable x and its probability density function (PDF) P(x), entropy of x denoted H(x) is the metric which is minus expected value of natural logarithm of x given distribution P(x).

H (x) = - \int_{x} P (x) \ln P (x) d x

If random variable x is discrete, its entropy becomes simpler:

H (x) = - \sum_{x} P (x) \ln P (x)

It is easy to recognize that entropy H(x) measures the level of uncertainty or the level of surprise for random variable x and such level is average level. As the opposite of probability, such uncertainty which measures the variation of random variable is also called information content, self-information, surprise, Shannon information, or information, in short:

I (x) = I (x| P (x)) = - \ln P (x)

It is interesting that information of x is defined based on probability of x although it is opposite to the probability. The larger the information of x is, the more the uncertainty of x is, the less the probability of x is. Entropy H(x) is the expectation of information I(x) given distribution P(x).

H (x) = \int_{x} P (x) I (x) d x

In thermodynamics, entropy represents the level of chaos in movement of particles. Kullback-Leibler (KL) divergence which is defined based on concept of entropy measures the difference of two distributions. For instance, given distributions P(x) and Q(x), Kullback-Leibler divergence of P(x) given Q(x) denoted D_KL(P(x) | Q(x)) measures how much the distribution P(x) is different from the distribution Q(x) (Wikipedia, Kullback-Leibler divergence, 2004).

D_{K L} (P (x) ‖Q (x)) = \int_{x} P (x) \ln \frac{P (x)}{Q (x)} d x

For convenience, we denote:

K L (P (x)| Q (x)) = \int_{x} P (x) \ln \frac{P (x)}{Q (x)} d x

(A2.1)

The larger the KL divergence KL(P(x) | Q(x)) is, the more different focused distribution P(x) is from distribution Q(x). However, such difference does not represent distance metric between P(x) and Q(x) because KL divergence is not symmetric:

K L (P (x)| Q (x)) \neq K L (Q (x)| P (x))

KL divergence does not satisfy triangle inequality too (Wikipedia, Kullback-Leibler divergence, 2004). KL divergence is expended:

\begin{array}{l} K L (P (x)| Q (x)) = \int_{x} P (x) \ln \frac{P (x)}{Q (x)} d x = \int_{x} P (x) \ln P (x) d x - \int_{x} P (x) \ln Q (x) d x \\ = - \int_{x} P (x) \ln Q (x) d x - H (x) \end{array}

Let H(x | Q(x)) denote entropy of x such that x is quantified by distribution Q(x).

H (x| Q (x)) = - \int_{x} P (x) \ln Q (x) d x = \int_{x} P (x) I (x| Q (x)) d x

Where,

I (x| Q (x)) = - \ln Q (x)

We obtain:

K L (P (x)| Q (x)) = H (x| Q (x)) - H (x) = \int_{x} P (x) (I (x| Q (x)) - I (x| P (x))) d x

This implies KL divergence measures the expected value of uncertainty when focused distribution P(x) is replaced by distribution Q(x) for quantifying random variable x. This is the reason that KL divergence is also called relative entropy. KL divergence is always nonnegative due to:

\int_{x} P (x) \ln P (x) d x \geq \int_{x} P (x) \ln Q (x) d x

Two distributions P(x) and Q(x) are identical if KL divergence KL(P(x) | Q(x)) is zero. Moreover, if P(x) is 0 for all x then Q(x) must be 0 for all x. As usual, P(x) represents data distribution and Q(x) represents theoretical distribution so that it is possible to compare or fit observational model with hypothesis model (Wikipedia, Kullback-Leibler divergence, 2004).

While Kullback-Leibler divergence is not a metric because it satisfies neither symmetry nor triangle inequality, Jensen-Shannon (JS) divergence is a real metric that measures the distance between two distributions although JS divergence is defined based on KL divergence. Given the two distributions P(x) and Q(x), its JS divergence is the following average KL divergence (Wikipedia, Jensen-Shannon divergence, 2006):

D_{J S} (P (x) ‖Q (x)) = \frac{1}{2} D_{K L} (P (x) ‖M (x)) + \frac{1}{2} D_{K L} (Q (x) ‖M (x))

Where M(x) is mixture distribution or mean distribution of P(x) and Q(x):

M (x) = \frac{P (x) + Q (x)}{2}

Note, JS divergence satisfies both symmetry and triangle inequality.

\begin{array}{l} D_{J S} (P (x) ‖Q (x)) = D_{J S} (Q (x) ‖P (x)) \\ D_{J S} (P (x) ‖R (x)) < D_{J S} (P (x) ‖Q (x)) + D_{J S} (Q (x) ‖R (x)) \end{array}

For convenience, we denote:

J S (P (x)| Q (x)) = \frac{1}{2} K L (P (x)| M (x)) + \frac{1}{2} K L (Q (x)| M (x))

(A2.2)

JS divergence is bounded in interval [0, 1] such that 0 ≤ JS(P(x) | Q(x)) ≤ 1. Square root of JS divergence is called JS distance of two distributions.

References

Doersch, C. (2016, January 3). Tutorial on Variational Autoencoders. arXiv preprint. Retrieved from https://arxiv.org/abs/1606.05908.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. Cambridge, Massachusetts, US: The MIT Press. Retrieved from https://www.deeplearningbook.org/.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., . . . Bengio, Y. (2014). Generative Adversarial Nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K. Weinberger (Ed.), Advances in Neural Information Processing Systems 27 (NIPS 2014). 27. Montreal: NeurIPS. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.
Hardle, W., & Simar, L. (2013). Applied Multivariate Statistical Analysis. Berlin, Germany: Research Data Center, School of Business and Economics, Humboldt University.
Hinton, G. E. (2007). Boltzmann machine. Brain Corporation, 2(5), 1668. [CrossRef]
Kingma, D. P., & Welling, M. (2022, December 10). Auto-Encoding Variational Bayes. arXiv Preprint, 1-14. [CrossRef]
Nguyen, L. (2015). Matrix Analysis and Calculus (1st ed.). (C. Evans, Ed.) Hanoi, Vietnam: Lambert Academic Publishing. Retrieved March 3, 2014, from https://www.shuyuan.sg/store/gb/book/matrix-analysis-and-calculus/isbn/978-3-659-69400-4.
Nguyen, L. (2023). Tutorial on artificial neural network. Loc Nguyen’s Academic Network. Open Science Framework (OSF). doi:https://osf.io/k8syc.
Oord, A. v., Kalchbrenner, N., & Kavukcuoglu, K. (2016, August 19). Pixel Recurrent Neural Networks. arXiv preprint, 1-11. [CrossRef]
Oussidi, A., & Elhassouny, A. (2018). Deep generative models: Survey. 2018 International Conference on Intelligent Systems and Computer Vision (ISCV). Fez, Morocco: IEEE. [CrossRef]
Ruthotto, L., & Haber, E. (2021, April 12). An Introduction to Deep Generative Modeling. arXiv preprint. [CrossRef]
Theis, L., & Bethge, M. (2015, September 18). Generative Image Modeling Using Spatial LSTMs. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, & R. Garnett (Ed.), Advances in Neural Information Processing Systems 28 (NIPS 2015). 28, pp. 1-9. Montreal: NeurIPS. [CrossRef]
Wikipedia. (2004, November 13). Boltzmann machine. (Wikimedia Foundation) Retrieved from Wikipedia website: https://en.wikipedia.org/wiki/Boltzmann_machine.
Wikipedia. (2004, November 15). Hopfield network. (Wikimedia Foundation) Retrieved from Wikipedia website: https://en.wikipedia.org/wiki/Hopfield_network.
Wikipedia. (2004, February 13). Kullback-Leibler divergence. (Wikimedia Foundation) Retrieved from Wikipedia website: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence.
Wikipedia. (2005, April 7). Recurrent neural network. (Wikimedia Foundation) Retrieved from Wikipedia website: https://en.wikipedia.org/wiki/Recurrent_neural_network.
Wikipedia. (2006, February 9). Jensen-Shannon divergence. (Wikimedia Foundation) Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence.
Wikipedia. (2007, April 16). Long short-term memory. (Wikimedia Foundation) Retrieved from Wikipedia website: https://en.wikipedia.org/wiki/Long_short-term_memory.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Tutorial on Deep Generative Model

Abstract

Keywords:

Subject:

1. Introduction to Deep Generative Model (DGM)

2. Tractable Density DGM

3. Approximate Density DGM

4. Implicit Density DGM

5. Conclusions

Appendices

A1. Backpropagation Algorithm

A2. Kullback-Leibler Divergence

References

MDPI Initiatives

Important Links

Subscribe