A Beginner's Tutorial of Restricted Boltzmann Machines

Restricted Boltzmann machines (RBMs) are the building blocks of some deep learning networks. However, despite their importance, it is our perception that some very important derivations about the RBM are missing in the literature, and a beginner may feel RBM very hard to understand. We provide here these missing derivations. We cover the classic Bernoulli-Bernoulli RBM and the Gaussian-Bernoulli RBM, but leave out the ``continuous'' RBM as it is believed not as mature as the former two. This tutorial can be used as a companion or complement to the famous RBM paper ``Training restricted Boltzmann machines: An introduction'' by Fisher and Igel.


Introduction
A restricted Boltzmann machine (RBM) is actually a parameterized model of probability distributions. Given some observations, i.e. the training data, the parameters of the model can be learned, which is actually done by using an optimization algorithm to maximize the likelihood function, where the optimization algorithm is usually stochastic gradient ascent. An RBM consists of two types of units: the visible units which can be observed, and the hidden units which cannot be observed. Connections only exist between the visible units and the hidden units, but not between different visible units, nor between different hidden units. So the visible units can be viewed as constituting the first layer, and the hidden units as constituting the second layer. Each visible unit corresponds to a component of an observation (e.g., one visible unit for each pixel of an image). For detailed descriptions of RBM see [1,2]. The architecture of an RBM is depicted as follows.
There are two well-received RBM types: Bernoulli-Bernoulli RBM and Gaussian-Bernoulli RBM, which we will discuss in this paper. The biggest difference between the two is that the visible units are binary in the former but continuous in the latter. The hidden units are binary in both types of RBM. In the literature there are formulations of RBM that the hidden units are also continuous, which are called continuous RBMs. We will not cover this type of RBM here as it is believed by the present author that the existing continuous RBM formulations are not as mature as the previous two. because all papers and tutorials of RBM have not provided detailed derivations of some very important facts about the RBM. These authors may feel the facts trivial or obvious, but actually they are not, at least as felt by the present author. Therefore, our purpose here is to provide the missing derivations neatly in one paper so that beginners of this field can benefit from it. This paper should be used as a companion for other RBM tutorials, e.g. [1,2]. Concepts that are explained very well in the existing tutorials will not be explained here, as we only provide missing derivations.
Notation: Throughout this paper, Bernoulli-Bernoulli RBM will be abbreviated as BB-RBM, and Gaussian-Bernoulli RBM will be abbreviated as GB-RBM. We denote the number of visible units by I and the number of hidden units by J. The vector of visible units is denoted by v and the vector of hidden units by h. We define the sigmoid function σ(·) by which we will use quite often later.

Derivations for BB-RBM
In this section, we consider BB-RBMs, which are RBMs with binary visible and hidden units, i.e., v ∈ {0, 1} I and h ∈ {0, 1} J . It is important to note that RBM is an energy-based model of probability distributions. In this spirit, the BB-RBM with parameters θ represents a probability distribution is the normalizing constant which is called the partition function and E(v, h; θ) is the (actually negative) energy function, which is defined as where θ = w ∈ R I×J , b ∈ R I , c ∈ R J is the set of model parameters. Please be warned that the E in this paper is actually −E (negative of energy) in other papers. I believe defining E this way helps to reduce some clutter in later derivations.

Definition of p(v; θ)
We already have u,g e E(u,g;θ) .

And the probability mass function of the visible variables is thus
.

General formula for the gradient of the log-likelihood
Learning the RBM parameters θ is through the maximization of the log-likelihood over the training samples. The log-likelihood, for a single training sample v, is given by In view of (2.3), (2.10)

Derivation of
hj =0 In a similar manner, we have (2.13)

Detailed formula for the gradient of the log-likelihood
(2.16)

Derivation of p(v|h; θ) = p(v,h;θ) p(h;θ)
In Gibbs sampling for the calculation of the second term (model average) on the right-hand side of (2.14-16), both p(h|v; θ) and p(v|h; θ) are needed. p(h|v; θ) is derived in Section 2.3, and we now derive p(v|h; θ) here. In view of (2.

Derivations for GB-RBM
In this section, we consider GB-RBMs, which are RBMs with continuous visible units and binary hidden units, i.e., v ∈ R I and h ∈ {0, 1} J . The GB-RBM with parameters θ represents a probability distribution is the normalizing constant which is called the partition function and E(v, h; θ) is the negative energy function, which is defined as where θ = w ∈ R I×J , b ∈ R I , c ∈ R J is the set of model parameters. In (3.3), a is supposed to be the inverse of the standard deviation of the visible variables (we assume they share a standard deviation here, but we can also assume otherwise). In practice, with proper preprocessing of the data, a is usually treated as a hyper-parameter, i.e. not included in the trained parameters. See [3,4] for discussions.

Definition of p(v; θ)
We already have And the probability density function of the visible variables is thus g e E(v,g;θ) . (3.5)

General formula for the gradient of the log-likelihood
Learning the RBM parameters θ is through the maximization of the log-likelihood over the training samples. The log-likelihood, for a single training sample v, is given by (3.7)

Derivation of
hj =0 e hj (cj + I i=1 aviwij ) In a similar manner, we have (3.13) In Gibbs sampling for the calculation of the second term (model average) on the right-hand side of (3.14-16), both p(h|v; θ) and p(v|h; θ) are needed. p(h|v; θ) is derived in Section 3.3, and we now derive p(v|h; θ) here. In view of (3.