3. Approximate Density DGM
According to approximate density approach, DGMs try to estimate approximately generator PDF P(x | Φ, z) or derive other PDF that is similar to P(x | Φ, z) with note that PDF is abbreviation of probability density function.
Recall that there are two problems related to construct a DGM: 1) how to define likelihood or error to train generator DNN
g(
z | Φ) and 2) how to define tractable PDF
P(
z) which implies the way to randomize
z. The second problem relates to assert qualification of random data
z’ and hence, the second problem is stated as qualification problem of how to qualify random data. According to implicit density approach, a discrimination DNN is used to qualify randomized data
z instead of defining tractable PDF
P(
z) by Generative Adversarial Network (GAN) which is a typical method belonging to implicit density approach. In different way belonging to this approximate density approach,
Variational Autoencoders (
VAE) method developed by Kingma and Welling (Kingma & Welling, 2022) proposed another DNN called encoder
f(
x | Θ) to expectedly convert intractable data
x into tractable data
z. In other words, encoder
f(
x | Θ) approximates tractable data
z by encoded data
z’.
It is easy to recognize that encoder
f(
x | Θ) is an approximation of the inverse of generator
g(
z | Φ) when
g(
z | Φ) is invertible where
x-dimension
m is larger than
z-dimension
n (
m >
n), which is the reason that generator
g(
z | Φ) is called decoder
g(
z | Φ) in VAE. Like decoder
g(
z | Φ), encoder
f(
x | Θ) is modeled by a so-called encoder DNN whose weights are parameter Θ called encoder parameter and so parameter Φ is called decoder parameter in VAE. By following the fact that encoder
f(
x | Θ) approximates tractable data
z by encoded data
z’, tractable PDF
P(
z) is approximated by a so-called encoder PDF
Pf(
z’).
Because encoder
f(
x | Θ) depends on its parameter Θ, we can denote:
Essential, encoder PDF
P(
z’ | Θ,
x) is likelihood function of
z’ given
x which is conditional PDF of
z’ given
x and hence,
P(
z’ | Θ,
x) is called encoder likelihood which depends on encoder
f(
x | Θ), of course. On the other hand,
P(
z’ | Θ,
x) is posterior PDF of tractable data given tractable data
x where
P(
z) is prior PDF of tractable data. In practice,
z’ is assumed to conform multivariate normal distribution and therefore, let
μ(
x) and Σ(
x) be mean vector and covariance matrix of
z’. Encoder likelihood
P(
z’ | Θ,
x) becomes
P(
z’ | Θ,
μ(
x | Θ), Σ(
x | Θ)) so that output of encoder DNN
f(
x | Θ) is mean
μ(
x | Θ) and covariance matrix Σ(
x | Θ) while its input is
x and its weights are Θ, of course.
Note,
(.) denotes normal distribution and thus,
(
z |
μ(
x | Θ), Σ(
x | Θ)) represents encoder likelihood. That
(
z |
μ(
x | Θ), Σ(
x | Θ)) is encoder likelihood is an important improvement in developing VAE because encoder DNN
f(
x | Θ) is learned by minimizing a so-called encoder error which is represented by the difference between encoder likelihood and predefined tractable PDF
P(
z). Let KL(
(z |
μ(
x | Θ), Σ(
x | Θ)) |
P(
z)) be Kullback-Leibler divergence of encoder likelihood
(z |
μ(
x | Θ), Σ(
x | Θ)) and predefined tractable PDF
P(
z). As a result, KL(
(
z |
μ(
x | Θ), Σ(
x | Θ)) |
P(
z)) becomes an ideal encoder error, which is called encoder KL divergence. The smaller the encoder KL divergence is, the closer the encoder likelihood
(z |
μ(
x | Θ), Σ(
x | Θ)) is to tractable PDF
P(
z), the better the encoder DNN
f(
x | Θ) is.
Therefore, encoder KL divergence KL(
(
z |
μ(
x | Θ), Σ(
x | Θ)) |
P(
z)) is minimized by stochastic gradient descent (SGD) method in order to estimate decoder parameter Θ for training encoder DNN
f(
x | Θ) as follows:
Which results estimation equation according to SGD:
Where ∇KL(
(z |
μ(
x | Θ), Σ(
x | Θ)) |
P(
z)) is gradient of encoder KL divergence KL(
(
z |
μ(
x | Θ), Σ(
x | Θ)) |
P(
z)) with regard to
μ(
x | Θ) and Σ(
x | Θ) while
γ is learning rate. Recall that SGD, which is an iterative process, pushes candidate solution at each iteration along the direction which is opposite to gradient of target function for minimization or has the same direction to gradient of target function for maximization with note that the step length is represented by learning rate. We have:
There can be no change in estimating decoder parameter Φ within VAE so that decoder error
ε(
x | Φ,
z) = ½||
g(
z | Φ) –
x||
2 is minimized to produce optimal Φ.
Which results estimation equation according to SGD:
Recall that generator
g(
z | Φ) is called decoder
g(
z | Φ) in VAE. As a result, encoder parameter Θ and decoder parameter Φ are estimated as follows:
Where
dg(
z | Φ) /
dΦ is differential of
g(
z | Φ) with regard to Φ while 0 <
γ ≤ 1 is learning rate and tractable PDF
P(
z) is predefined with note that VAE replaces tractable PDF
P(
z) by likelihood
P(
z’ | Θ,
μ(
x | Θ), Σ(
x | Θ)) with fixed
P(
z). As usual,
P(
z) is assumed to conform standard normal distribution with mean
0 and covariance matrix
I.
This implies:
Where
I is identity matrix:
It is easier to determine gradient of encoder KL divergence ∇KL(
N(
μ(
x | Θ), Σ(
x | Θ)) |
(
z |
0,
I)) with regard to Θ between the multivariate normal distribution
(
μ(
x), Σ(
x) | Θ) and the standard multivariate normal distribution
(
z |
0,
I)). We have following equation to calculate such gradient (Kingma & Welling, 2022, p. 5), (Doersch, 2016, p. 9), (Nguyen, 2015, p. 43):
Where (Σ(
x | Θ))
–1 is inverse of covariance matrix Σ(
x | Θ) and the subscript “
T” denotes transposition operator of matrix and vector whereas
dμ(
x | Θ) /
dΘ and
dΣ(
x | Θ) /
dΘ are differentials of
μ(
x | Θ) and Σ(
x | Θ) with regard to Θ, respectively. As a result, encoder parameter Θ and decoder parameter Φ are totally estimated according to SGD as follows:
The estimation equations above are simple explanation of VAE but its formal construction is more complicated. We begin the aforementioned intractable PDF
P(
x) specified by law of total probability:
However,
P(
x) is interpreted by another way which is based on Bayes’ rule within VAE:
Because the conditional probability
P(
z |
x) is arbitrary without formal specification, it should be approximated by another PDF denoted
Q(
z |
x) with assumption that the PDF
Q(
z |
x) has formal specification like normal distribution.
Logarithm of intractable PDF
P(
x) is specified as follows (Ruthotto & Haber, 2021, p. 13):
The second term log(
Q(
z |
x) /
P(
z |
x)) is not variant because
Q(
z |
x) is approximated to
P(
z |
x). Therefore, the first term log(
P(
x,
z) /
Q(
z |
x) is called variation lower bound or
evidence lower bound because it is variant. Let
l(
x,
z) be loss function or error function on VAE which is defined as the minus opposite of expectation of the evidence lower bound log(
P(
x,
z) /
Q(
z |
x) given PDF
Q(
z |
x) with note that
Q(
z |
x) has formal probabilistic distribution.
Loss function
l(
x,
z) is expended as follows:
Because
Q(
z |
x) and
P(
x |
z) depend on encoder
f(
x | Θ) and decoder
g(
z | Φ), respectively, their parameters are Θ and Φ, respectively.
Exactly,
Q(
z | Θ,
x) is encoder likelihood which is the same to the aforementioned
P(
z’ | Θ,
x) except that it is focused that
Q(
z | Θ,
x) has formal probabilistic specification like normal distribution. Loss function
l(Θ, Φ |
x,
z), which is now function of encoder parameter Θ and decoder parameter Φ, is written as follows (Ruthotto & Haber, 2021, p. 14):
Firstly, please pay attention to the first term loss function
l(Θ, Φ |
x,
z) where
P(
x | Φ,
z) depends only on Φ although it can be considered as a conditional PDF of
x given
z because
P(
x | Φ,
z) is defined for output layer containing only
x of decoder DNN
g(
x | Φ) whose input is
x. Therefore, we had the following assertion:
Secondly, the second term in loss function
l(Θ, Φ |
x,
z) is, actually, Kullback-Leibler divergence of encoder likelihood
Q(
z | Θ,
x) and predefined tractable PDF
P(
z), which measure the difference between
Q(
z | Θ,
x) and
P(
z). As a convention, this Kullback-Leibler divergence is called encoder KL divergence which is an ideal encoder error.
The smaller the encoder KL divergence is, the closer the encoder likelihood
Q(
z | Θ,
x) is to tractable PDF
P(
z), the better the encoder DNN
f(
x | Θ) is. Loss function is rewritten again:
According to the two problem of construct a DGM, the first term –log(
P(
x | Φ,
z)) in loss function indicates the first problem of how to train decoder DNN
g(
z | Φ) which is called reconstruction error in literature and the second term KL(
Q(
z | Θ,
x) |
P(
z)) in loss function indicates the second problem of how to qualify training task for training encoder DNN
f(
x | Θ) which is called regularity in literature. Loss function
l(Θ, Φ |
x,
z) is minimized to estimate Θ and Φ as follows:
Because
P(
x | Θ,
z) depends only on Θ and encoder KL divergence KL(
Q(
z | Θ,
x) |
P(
z)) depends only on Φ, the optimization problem is specified as follows:
Which results estimation equations according to SGD:
Where ∇KL(
Q(
z | Θ,
x) |
P(
z)) is gradient of encoder KL divergence KL(
Q(
z | Θ,
x) |
P(
z)) with regard to encoder parameter Θ. Note that tractable PDF
P(
z) is predefined (fixed). While
Q(
z | Θ,
x) is called encoder likelihood,
P(
x | Φ,
z) is called decoder likelihood. On the other hand, while
P(
z) is prior PDF of intractable data
z, then
Q(
z | Θ,
x) is approximated posterior PDF of
z given
x where both
P(
z) and
Q(
z | Θ,
x) have formal probabilistic specifications and moreover,
P(
z) is fixed (predefined).
Both P(z | Θ, x) and Q(z | Θ, x) are encoder likelihood as well as posterior PDF of tractable data z but Q(z | Θ, x) is approximated one whose probabilistic distribution is specified formally. Therefore (Ruthotto & Haber, 2021, p. 16), randomized data z’ in latent space Z is sampled from approximated distribution Q(z | Θ, x) instead of sampling from true distribution P(z | Θ, x).
Given epoch of size
N is denoted as
D = (
d(1) = (
x(1),
z(1)),
d(2) = (
x(2),
z(2)),…,
d(N) = (
x(N),
z(N))), the estimation equations of Θ and Φ are extended exactly as epoch estimation at every iteration of SGD:
Please distinguish that the tractable data z(i) in the first equation above follows distribution P(z) but the tractable data z(i) in the second equation above follows distribution Q(z | Θ, x). As a result, VAE trained with SGD is specified as follows:
Initialize Θ and Φ and set k = 0.
Repeat
Sampling epoch X = (x(1), x(2),…, x(N)) or receiving epoch X from big data / data stream.
Randomize random epoch
Z = (
z(1),
z(2),…,
z(N)) in which each
z(i) is randomized from distribution
Q(
z | Θ
(k),
x(i)).
Increase k = k + 1.
Until some terminating conditions are met.
Note, a terminating condition is customized, for example, parameters Θ and Φ are not changed significantly or there is no more coming epoch X. Moreover, the index k indicates time point as well as iteration of SGD. Because PDF P(z) is predefined, it is easy to calculate encoder KL divergence KL(Q(z(i) | Θ(k), x(i)) | P(z)) but it is necessary to define P(x) by well-known distribution. However, randomizing random epoch Z = (z(1), z(2),…, z(N)) from distribution Q(z | Θ(k), x(i))) is not easy and so, VAE trained with SGD will be fine-tuned. It is interesting that when Q(z | Θ(k), x(i))) is posterior PDF of z and P(z) is prior PDF of z, the event that z is randomized from the posterior PDF Q(z | Θ(k), x(i))) and Q(z | Θ(k), x(i))) itself is updated continuously based on its previous evidence x(i) over SGD iterations implies that VAE conforms Bayesian statistics in estimation. Moreover, P(z) is an alignment that Q(z | Θ(k), x(i))) adjusts itself with support of encoder KL divergence KL(Q(z(i) | Θ(k), x(i)) | P(z)).
Because encoder likelihood
Q(
z | Θ,
x) must always have formal probabilistic distribution, it is assumed to follow multivariate normal distribution in practice. Therefore, let
μ(
x | Θ) and Σ(
x | Θ) be mean vector and covariance matrix of
z, then encoder likelihood
Q(
z | Θ,
x) becomes
Q(
z |
μ(
x | Θ), Σ(
x | Θ)) so that output of encoder DNN
f(
x | Θ) is mean
μ(
x | Θ) and covariance matrix Σ(
x | Θ) while its input is
x and its weights are Θ, of course. Please pay attention to the fact that output of encoder DNN
f(
x | Θ) is now
μ(
x | Θ) and Σ(
x | Θ) which are corresponding to
z. Moreover,
μ(
x | Θ) and Σ(
x | Θ) are functions of
x, whose parameter is Θ.
Note,
(
z |
μ(
x | Θ), Σ(
x | Θ)) denotes multivariate normal distribution with mean
μ(
x | Θ) and covariance matrix Σ(
x | Θ).
Note, dimension of tractable data
z is
n. Moreover, notation |.| or notation det(.) denotes determinant of matrix whereas (Σ(
x | Θ))
–1 is inverse of covariance matrix Σ(
x | Θ) and the subscript “
T” denotes transposition operator of matrix and vector. It is easy to recognize that
z’ is approximation of
z. When tractable PDF
P(
z) is fixed, it is often assumed to follow multivariate normal distribution with predefined mean
μ0 and predefined covariance matrix Σ
0 as follows:
Encoder KL divergence KL(
Q(
z | Θ,
x) |
P(
z)) between
Q(
z | Θ,
x) and
P(
z) becomes encoder KL divergence KL(
Q(
z |
μ(
x | Θ), Σ(
x | Θ)) |
P(
z)) between
Q(
z |
μ(
x | Θ), Σ(
x | Θ)) and
P(
z) as follows:
Which is, essentially, encoder KL divergence between two normal distributions, KL(
(
z |
μ(
x | Θ), Σ(
x | Θ)) |
(
μ0, Σ
0)). As a convention, this divergence is called encoder KL divergence which is determined in literature as follows (Doersch, 2016, p. 9):
Where tr(.) denotes trace operator of square matrix which is sum of elements on main diagonal, for instance, given
nxn matrix
A, then tr(
A) =
a11 +
a22 +… +
ann with note that
aij is the element at row
i and column
j. Moreover, notation |.| or notation det(.) denotes determinant of matrix. Gradient of encoder KL divergence consists of two elemental gradients with regard to mean
μ(
x | Θ) and covariance matrix Σ(
x | Θ).
Where,
Where
dμ(
x | Θ) /
dΘ and
dΣ(
x | Θ) /
dΘ are differentials of
μ(
x | Θ) and Σ(
x | Θ) with regard to Θ, respectively. It is not difficult to calculate KL gradient ∇
μ:
Due to (Nguyen, Matrix Analysis and Calculus, 2015, p. 35):
It is not difficult to calculate KL gradient ∇
Σ too:
Due to (Nguyen, Matrix Analysis and Calculus, 2015, pp. 45-46):
As a result, encoder parameter Θ consists of two elemental parameters according to with regard to mean
μ(
x | Θ) and covariance matrix Σ(
x | Θ) as follows:
Where,
Note, given random vector
z = (
z1,
z2,…,
zn)
T whose elements
zi are random variables too,
σij where
i≠
j is covariance between two random variables
zi and
zj and
σi2 is variance of random variable
zi. It is easy to calculate encoder parameters Θ
μ and Θ
Σ by SGD estimation:
Where
dμ(
x | Θ
μ) /
dΘ
μ and
dΣ(
x | Θ
Σ) /
dΘ
Σ are differentials of
μ(
x | Θ
μ) and Σ(
x | Θ
Σ) with regard to Θ
μ and Θ
Σ, respectively. In practice,
P(
z) is assumed to conform standard normal distribution with zero mean
μ0 =
0 and identity covariance matrix Σ
0 =
I where
I is identity matrix so that encoder parameters Θ
μ and Θ
Σ are computed effectively.
In order to improve more computational effectiveness, it is possible to suppose that elemental variables
zi in
z = (
z1,
z2,…,
zn)
T within context
P(
z) are mutually independent so that covariance σ
ij between two variables
zi and
zj where
i≠
j is 0, which results that there only exist variances
σi2 of
zi. Covariance matrix Σ(
x | Θ) becomes diagonal matrix:
Note,
Where
σi2(
x | Θ) is variance of elemental variable
xi in
z = (
z1,
z2,…,
zn)
T given
x according to encoder
f(
x | Θ). As a result, encoder parameter Θ
Σ, which is now diagonal matrix represented by its diagonal vector
, is computed easier.
Where,
In general, estimation equations for encoder parameter Θ = (Θ
μ,
)
T are specified as follows:
Where
dσ2(
x |
) /
d is differential of
σ2(
x |
) with regard to
.
There can be no change in estimating decoder parameter Φ within VAE so that decoder log-likelihood log(
P(
x | Φ,
z)) is maximized.
As usual, decoder likelihood
P(
x | Φ,
z) is assumed to distribute normally with mean
δ and variance
σ2.
Which implies decoder log-likelihood log(
P(
x | Φ,
z)) as follows:
Where ||.|| denotes Euclidean norm of vector. Gradient of decoder log-likelihood is:
Where
dg(
z | Φ) /
dΦ is differential of
g(
z | Φ) with regard to Φ. Let
δ =
0 and
σ2=1 optimization, we have:
Which implies estimation equation for decoder parameter Φ by SGD as follows:
Because data
z in the decoder estimation equation above follows encoder likelihood
Q(
z | Θ,
μ(
x | Θ
μ), Σ(
x | Θ
Σ)) =
(
z |
μ(
x | Θ
μ), Σ(
x | Θ
Σ)) rather than tractable PDF
P(
z) =
(
z |
μ0, Σ
0), it is denoted as
z’ such that:
Given epoch of size
N is denoted as
D = (
d(1) = (
x(1),
z’
(1)),
d(2) = (
x(2),
z’
(2)),…,
d(N) = (
x(N),
z’
(N))), the estimation equations of Θ and Φ are extended exactly as epoch estimation at every iteration of SGD:
As a result, VAE trained with SGD is specified as follows:
Initialize Θ = (Θμ, )T and Φ and set k = 0.
Repeat
Sampling epoch
X = (
x(1),
x(2),…,
x(N)) or receiving epoch
X from big data / data stream.
Randomize random epoch
Z = (
z(1),
z(2),…,
z(N)) from standard normal distribution
P(
z) =
(
0,
I) with mean
0 and identity covariance matrix
I. For each randomized data
z(i), let
z’
(i) be calculated based on
z(i) so that
z’
(i) follows multivariate normal distribution
Q(
z’ |
μ(
x | Θ
μ), Σ(
x | Θ
Σ)) =
(
z’ |
μ(
x | Θ
μ), Σ(
x | Θ
Σ)) with mean
μ(
x | Θ
μ) and covariance matrix Σ(
x | Θ
Σ) with note that Θ
Σ = (
)
nxn is diagonal matrix.
Increase k = k + 1.
Until some terminating conditions are met.
Note, a terminating condition is customized, for example, parameters Θ and Φ are not changed significantly or there is no more coming epoch
X. Moreover, the index
k indicates time point as well as iteration of SGD. Because it is not easy to randomize
z according to normal distribution
Q(
z |
μ(
x | Θ
μ), Σ(
x | Θ
Σ)) =
(
z |
μ(
x | Θ
μ), Σ(
x | Θ
Σ)) with mean
μ(
x | Θ
μ) and covariance matrix Σ(
x | Θ
Σ), there is a trick that simple data
z is randomized firstly by simple normal distribution
P(
z) =
(
0,
I) with mean
0 and identity covariance matrix
I and, then random data
z’ is calculated based on
z and
μ(
x | Θ
μ), Σ(
x | Θ
Σ) as follows:
Such that
z’ follows normal distribution
(
z’ |
μ(
x | Θ
μ), Σ(
x | Θ
Σ)) with mean
μ(
x | Θ
μ) and covariance matrix Σ(
x | Θ
Σ) according to some rule of normal distribution in applied statistics (Hardle & Simar, 2013, p. 157). The notation
A = Σ(
x | Θ
Σ)
1/2 implies
AA = Σ(
x | Θ
Σ) and so, we can consider it as square root of Σ(
x | Θ
Σ). Calculating this square root is not so easy because of complexity of singular decomposition for calculating it. Fortunately, it is easier to calculate the square root when Θ
Σ was simplified by diagonal elements (
σ2(
x | Θ
Σ))
nxn. Indeed, we have:
Where,
Following figure depicts VAE.
Figure 3.1.
Variational Autoencoders (VAE).
Figure 3.1.
Variational Autoencoders (VAE).
There is a question that how to calculate the differentials
dμ(
x | Θ
μ) /
dΘ
μ,
dσ2(
x |
) /
d, and
dg(
z’ | Φ) /
dΦ. Indeed, it is not difficult to calculate them in context of neural network associated with backpropagation algorithm so that the last output layer as well as last neuron
o of any DNN
f(
x | Θ) or
g(
z | Φ) is acted by activation function
a(.) as follows:
Where
i is input of the last layer
o and weight parameter
w is a part of entire parameter Θ or Φ and hence, we need to focus on calculating differential
da(
o) /
dw which is equivalent to any differential
dμ(
x | Θ
μ) /
dΘ
μ,
dσ2(
x |
) /
d, or
dg(
z’ | Φ) /
dΦ so that backpropagation algorithm will solve remaining parts of entire parameter Θ or Φ.
Note, the subscript “
T” denotes transposition operator of vector and matrix in which row vector becomes column vector and vice versa. It is easy to calculate the derivative
a’(
o) when activation function was specified, for instance, if
a(
o) is sigmoid function, we have:
In practice,
y is replaced by
a(
y) in order to prevent
o from being out of space:
For fast computation, it is possible to set the derivative a’(o) to be small enough constants like 1 such that any differential is iT.
Given epoch
D = (
d(1) = (
x(1),
z(1)),
d(2) = (
x(2),
z(2)),…,
d(N) = (
x(N),
z(N))) implies that the epoch is created or sent by equilateral distribution 1/
N but in general case,
D can follow an arbitrary distribution denoted by PDF
P(
d), which makes the optimization problem and the SGD estimation changed a little bit by theoretical expectation given distribution
P(
d).
Where,
However, there is no significant change in aforementioned practical technique to estimate parameters.
Recall that the default artificial neural network is feedforward neural network where data is fed to input layer which, in turn, is evaluated and passed across hidden layers to output layer in one-way direction, finally. However, there is an extension of neural network, which is called recurrent neural work (RNN), where an output can be turned back in order to feed on network as input. In other words, RNN has circle, which allow that output can become input. There are many kinds of RNN, for instance, long short-term memory is a case of RNN aforementioned. Boltzmann machine (Wikipedia, Boltzmann machine, 2004) is another variant of RNN, in which there is no separation of inputs from outputs. Like Hopfield network (Wikipedia, Hopfield network, 2004), every neuron (unit) in Boltzmann machine connects to all remaining neurons. In other words, Boltzmann machine applies an interesting aspect that all input neurons are output neurons too.
Figure 3.2.
Topology of Hopfield network and Boltzmann machine.
Figure 3.2.
Topology of Hopfield network and Boltzmann machine.
Boltzmann machine named by the name of Austrian physicist Ludwig Eduard Boltzmann, also called Sherrington-Kirkpatrick model with external field or stochastic Ising-Lenz-Little model, is a stochastic spin-glass model with an external field and classified as a Markov random filed too. For easy explanation, Boltzmann machine simulates spinning glass process or annealing metal process, in which melt glass or melt metal will be frozen or get stable at some energy and some temperature where such energy and temperature are called stable energy and stable temperature at stable state of glass. The annealing process aims to reach the stable state of metal (glass) at which time the metal is frozen. Given concrete temperature, the smaller the energy is, the more stable the metal state is. Similarly, given concrete energy, the smaller the temperature is, the more stable the metal state is. Therefore, annealing process is cooling process where probability of metal state, which is proportional to energy and temperature, follows the so-called Boltzmann distribution specified as follows:
Where
P(
s) is probability of current state
s and
E(
s) is energy applied to metal at state
s given temperature
T while
κ is Boltzmann constant and
M is the number of states. Note,
T can be considered as a parameter. If the denominator is constant, Boltzmann probability is approximated as follows:
In annealing process, if next energy is concerned by observing current energy because of successive annealing process, energy deviation or energy difference Δ
E(
s,
snew) between current energy
E(
s) and next energy
E(
snew) is concerned so that Boltzmann probability derives a so-called acceptance probability
P(
s,
snew,
T) as follows:
Where,
Given a certain temperature T, the larger the acceptance probability is, the higher likely the annealing process stops, the higher the likelihood of stability is. In other words, acceptance probability P(s, snew, T) decides whether or not the new state snew is moved next in annealing process. When applied into solving optimization problem as well as learning problem, simulated annealing (SA) algorithm codes candidate solution as states. Indeed, SA is iterative process including many enough iterations where SA decreases temperature T at each iteration and then, randomize a new state snew and calculates energy E(snew) of the new state. Whether or not the new state (new candidate solution) snew is based on the acceptance probability P(s, snew, T) based on current state s, new state snew, current temperature T. If the new candidate solution snew is selected as current solution, SA will decrease temperature in the next iteration. Following is pseudo code of SA:
Initialize current temperature T by highest temperature T0 as T = T0.
Repeat
Decrease current temperature, for example, T = decrease(T).
Select a random neighbor of current state as snew = neighbor(s).
If P(s, snew, T) is larger than a predefined threshold then
s = snew
End if
Until terminating conditions are met.
The terminating conditions can be that best state (solution) is reached, the current state
s is good enough, the current temperature
T is low enough, or the number of iterations is large enough. A usual, given a maximum iteration number
K and the current iteration number
k, the temperature decreasing function can be defined as follows:
It is easy to infer that it is possible to set the initial temperature to be the maximum number of iterations as
T0 =
K in practice. There is no significant change when applying SA into training Boltzmann machine where the most important problem is how to specify energy of Boltzmann machine. Fortunately, global energy of Boltzmann machine inherits from global energy of Hopfield network because Boltzmann machine is a type of Hopfield network which in turn is a variant of RNN. Suppose an entire Boltzmann machine is represented by a vector
x = (
x1,
x2,…,
xn) in which each
xi is a neuron or unit. It is exact that a certain state of Boltzmann machine is represented by
x which is evaluated at certain time point. It is possible to denote current state of Boltzmann machine as
x instead. For convenience, the next state of Boltzmann machine is denoted
x’. Energy
E(
x) of Boltzmann machine at state
x is defined based on global energy of Hopfield network as follows (Hinton, 2007, p. 2):
Note,
wij is weight between neuron
xi and neuron
xj whereas
bi is bias of neuron
xi. As usual, biases
bi are considered as parameters like weights
wij. Because there are
n(
n–1)/2 connections as well as
n(
n–1)/2 weights, the equation of energy is rewritten for convenience as follows (Wikipedia, Boltzmann machine, 2004):
All weights
wij compose weight matrix
W = (
wij)
nxn whose elements on diagonal are zero. Note,
W is
nxn symmetric matrix.
Every neuron
xi is evaluated by propagation rule:
Neurons in traditional Boltzmann machine are binary variables such that xi belongs to {0, 1} but it is extended to allow neurons
xi to belong to arbitrary real interval and so, suppose every
xi ranges in interval [0, 1] without loss of generality. Rectified Linear Unit (ReLU) function is used to ramp
xi in interval [0, 1] so as to modify the propagation rule a little bit but learning algorithm mentioned later is not changed because the first-order derivative of ReLU function within valid domain [0, 1] is 1.
Where
So that the propagation rule is not changed in theory:
Based on definition of global energy, Boltzmann probability density function (PDF) of Boltzmann machine is determined as follows:
Within context of DGM, such PDF is generator likelihood whose parameter is Φ = (
W,
b).
Because the denominator is constant with regard to
W and
b, Boltzmann PDF is approximated as follows:
For learning Boltzmann, maximum likelihood estimation (MLE) method (Goodfellow, Bengio, & Courville, Deep Learning, 2016, p. 655) is applied into estimating weight parameter
W and bias parameter
b by maximizing Boltzmann PDF with regard to
wij and
bi.
By taking natural logarithm of Boltzmann PDF, the optimization becomes easier to be solved.
Where log
P(
x |
W,
b) is called Boltzmann log-likelihood or Boltzmann log-PDF.
The first-order partial derivatives of Boltzmann log-likelihood are:
As a convention, these first-order partial derivatives are called (partial) gradients. By applying stochastic gradient descent (SGD) method into estimating
wij and
bi given Boltzmann log-likelihood, we have:
Where 0 < γ ≤ 1 is learning rate. It is easy to recognize that the estimation equations above confirm Hebbian learning rule in which the strength of connection represented by weight is consolidated by agreement of two nodes to which the connection is attached. As a result, Boltzmann machine trained with SGD is specified as follows:
Initialize W and set k = 0.
Repeat
Data (state)
x is received from some real sample, or it can be kept intact.
Increase k = k + 1.
Until some terminating conditions are met.
Note, a terminating condition is customized, for example, parameters W and b are not changed significantly, the maximum number of iterations is reached, or Boltzmann machine gets stable. The terminating condition that Boltzmann machine gets stable receives more concerns because stability is important property of spinning glass process or annealing process that Boltzmann machine. However, checking the stability in which global energy E(x) is not changed may consume a lot of iterations. Fortunately, SA can be incorporated into SGD so as to derive a more effective estimation. Boltzmann machine trained with SGD and SA is specified as follows:
Initialize current temperature T by highest temperature T0 as T = T0.
Repeat
Data (state)
x is received from some real sample, or it can be kept intact.
Evaluate Boltzmann machine given current parameter
W(k+1) and
b(k+1) so as to produce a new state
x’:
If P(x, x’, T | W(k+1), b(k+1) is larger than a predefined threshold then
x = x’
Decrease current temperature, for example, T = decrease(T).
End if
Increase k = k + 1.
Until terminating conditions are met.
The terminating conditions can be that best state (
x’) is reached, the current state
x is good enough, or the current temperature
T is low enough. These terminating conditions reflect the stable state of Boltzmann machine. A usual, given a maximum iteration number
K and the current iteration number
k, the temperature decreasing function can be defined as follows:
Of course, the acceptance probability is:
Where,
There is a so-called restricted Boltzmann machine (RBM) in which neurons are separated into two groups such as input group denoted
xz and hidden group denoted
xx.
The training algorithm by incorporation of SA and SGD is not changed except that there is neither connection between input neurons and input neurons nor connection between hidden neurons and hidden neurons. In other words, all connections are made between input group and hidden group, for instance, suppose cardinality of input group is
k then, the number of connections is
k(
n –
k). Therefore, the two groups are considered layers such as input layer and hidden layer. Of course, both layers are output layers because connections in Boltzmann machine are two-direction connections whereas feed-forward neural network accepts only one-direction connections. RBM is trained faster than traditional Boltzmann machine because its number of connections is smaller. Moreover, it is clear to apply RBM into DGM because generator function in DGM
x =
g(
z | Φ) is modeled by RBM whose input is input group
xz and whose output is output group
xx such as
xx =
g(
xz |
W,
b) where
xx is calculated by evaluating RBM given input
xz.
The reason that the RBM approach for DGM is classified into approximate density DGM is that generator likelihood P(x | W, b) is defined indirectly based on the energy E(x | W, b). Of course, xz is randomized such that xx is generated data.