Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Introduction to Bayesian Statistics

We start with Bayesian Statistics. Watanabe’s theory is fundamentally based on generalizing classical results in Bayesian Statistics, so it is important to get a strong grip and understand this classical theory well before moving on. It also gives us the complete understanding of the framework we are working in, and is the first essential thing to master.

Connection with Machine Learning and Setup

Machine Learning Models are primarliy consisting of two frameworks (or a combination of them): Frequentist and Bayesian.

The setup is that we have a true data generating distribution pdata(x)p_{data}(x), and consider a set of arbitrary samples X={x1,...,xn}X = \{x_1, ..., x_n\} . We take a statistical model pmodel(x,θ)p_{model}(x, \theta) (which is a parametric family of probability distributions) which aims to estimate the true distribution.

The likelihood function of our statistical model is defined as

Pmodel(X;θ)=i=1npmodel(xi;θ)P_{model}(X; \theta) = \prod_{i = 1}^n p_{model}(x_i; \theta)

The frequentist approach is to find the optimal θ\theta which maximizes this likelihood function.

The KL divergence from probability distribution pp to qq is defined as

D(pq)=p(x)logp(x)q(x)dxD(p||q) = \int_{-\infty}^{\infty} p(x) \log \dfrac{p(x)}{q(x)}dx

This is the main measure that we will use to associate similarity between probability distributions (even though it is not really a metric, it is clear that it is not even symmetric).

It can be easily seen that finding the optimal θ\theta (called the maximum likelihood estimator) is equivalent to minimizing the KL divergence from the empirical true distribution to our statistical model, which is a function of θ\theta. An approximation to the local optimal parameter is often approached via (stochastic) gradient descent. This is also the case in neural networks, which are essentially function approximators. We use SGD to approximate to the local optimal parameter vector.

We will not delve into the frequentist approach more here (you may refer to Goodfellow et al). We will move on to the Bayesian approach here. Thus, when we refer to neural networks here, an important distinction is that now this is not the standard neural networks where SGD is used. Still, we gain many insights from this approach that also carry to the standard networks.

In the Bayesian approach, instead of considering just the optimal parameter, we consider a probability distributin over the space of parameters itself. Initially, this is called the prior function, and as we observe the data from the true distribution, we update this prior function to successively obtain a posterior functin, which is an estimate over the entire parameter space to what generates the true distribution function.

Specifically, we consider an appropriate prior function φ(w)\varphi(w) and a statistical model p(xw)p(x|w). These are chosen by us, and this choice often determines what estimate our bayesian method will given us. We assume there is a true date generating distribution q(x)q(x), from which we draw NN samples independently, {x1,...,xn}\{x_1, ..., x_n\}. This sample induces a function p(wx1,...,xn)p(w|x_1, ..., x_n), which is the update of our prior function. This further induces p(xx1,...,xn)p(x|x_1, ..., x_n), which is our estimate of the true distribution. This process goes on as we make more samples. As can be seen, this is more computatinally intensive. However, this approach is superior in many cases, we will specifically see an example later on. We will now define everything mathematically. To summarie, here is the procedure:

  1. Construct the universe and the mathematical laws between bayesian observables which hold for any arbitrary: true distribution, statistical model, and a prior.

  2. Evaluate how appropriate the statistical model and the prior is using these laws.

  3. Employ the most suitable pair.

Introduction to Bayesian Statistics

The posterior function is obtained through Bayes’ rule.

p(wx1,...,xn)=p(x1,...,xnw)φ(w)p(x1,...,xnw)φ(w)dwp(w|x_1, ..., x_n) = \dfrac{p(x_1, ..., x_n|w) \varphi(w)}{\int p(x_1, ..., x_n|w)\varphi(w)dw}

But neither do we know the statistical model, nor do we know the prior. Thus a meaningful approach is to just start with something, evaluate how good it is, and then update it. The evaluation is done through the mathematical laws described above.

This gives rise to the estimated pdf of xx, called the predictive distribution:

p^(x)=p(xw)P(wx1,...,xn)dw\hat{p}(x) =\int p(x|w) P(w|x_1, ..., x_n) dw

Expected: p^(x)q(x)\hat{p}(x) \approx q(x) if (p(xw),φ(x))(p(x|w), \varphi(x)) is appropriate for q(x)q(x). We want to evaluate the tuple appropriateness without information about q(x)q(x). We develop the machinery for that.

True Distribution

A realized value of XnX^n in a trial is denoted xn=(x1,,xn)x^n = (x_1, \ldots, x_n). In practical applications, while we may not know q(x)q(x), we assume its existence.

Let us just revise the basics first as they will be important in the calculations that we make.

Let f:Xnf(Xn)Rf : X^n \rightarrow f(X^n) \in \mathbb{R}

E[f(Xn)]=f(xn)i=1nq(xi)dxiE[f(X^n)] = \int \cdots \int f(x^n) \prod_{i=1}^n q(x_i) dx_i

Do observe that we are able to take the product here because of the independent sampling.

V[f(Xn)]=E[f(Xn)2]E[f(Xn)]2\mathbb{V}[f(X^n)] = E[f(X^n)^2] - E[f(X^n)]^2

The average entropy of the true distribution is defined as:

S=q(x)logq(x)dxS = - \int q(x) \log q(x) dx

The empirical entropy is defined as:

Sn=1ni=1nlogq(Xi)S_n = -\frac{1}{n} \sum_{i=1}^n \log q(X_i)

By definition, one can see that E[Sn]=SE[S_n] = S. I outline it nonetheless, so that you can get comfortable with the calculations.

E[Sn]=1ni=1nE[log(q(Xi))]E[S_n] = -\frac{1}{n} \sum_{i=1}^n E[\log(q(X_i))]
=1ni=1nlogq(x)q(x)dx= -\frac{1}{n} \sum_{i=1}^n \int \log q(x) \cdot q(x) dx
=logq(x)q(x)dx=S= -\int \log q(x) \cdot q(x) dx = S

Similarly, one can see that the variance of the empirical entropy is:

V[Sn]=1n[q(x)(logq(x))2dxS2]\mathbb{V}[S_n] = \frac{1}{n} \left[ \int q(x)(\log q(x))^2 dx - S^2 \right]

The average and empirical entropies of the true distribution which is a conditional distribution is defined similarly:

S=q(x,y)logq(yx)dxdyS = - \int q(x,y) \log q(y|x) dx dy
Sn=1ni=1nlogq(yixi)S_n = - \frac{1}{n} \sum_{i=1}^n \log q(y_i|x_i)

Model, Prior and Posterior

Let WRdW \subset \mathbb{R}^d. Let XnX^n be independent real random values subjected to q(x)q(x). For an arbitrary pair (p(xw),φ(w))(p(x|w), \varphi(w)), the posterior probability density is defined by

p(wXn)=1Z(Xn)φ(w)i=1np(Xiw)p(w|X^n) = \frac{1}{Z(X^n)} \varphi(w) \prod_{i=1}^n p(X_i|w)

where Z(Xn)Z(X^n) is defined by

Z(Xn)=φ(w)i=1np(Xiw)dwZ(X^n) = \int \varphi(w) \prod_{i=1}^n p(X_i|w) dw

which is called the partition function/marginal likelihood/evidence.

Expected value over the posterior distribution is denoted Ew[]E_w[\cdot]. Do note that Ew[f(w)]=1Z(Xn)f(w)φ(w)i=1np(Xiw)dwE_w[f(w)] = \frac{1}{Z(X^n)} \int f(w) \varphi(w) \prod_{i=1}^n p(X_i|w) dw

This expected value is a random variable as it depends on XnX^n. (Better to say, it is the expected value over a conditional probability density and hence is a random variable)

The posterior gives rise to the predictive density function: p(xXn)=defp(x|X^n) \stackrel{def}{=}

=Ew[p(xw)]=p(xw)p(wXn)dw= E_w[p(x|w)] = \int p(x|w) p(w|X^n) dw

(estimate ww from XnX^n, estimate xx from ww, vary over all ww)

If φ(w)dw<\int \varphi(w) dw < \infty, the prior is called proper, because it is normalized so that φ(w)dw=1\int \varphi(w) dw = 1. Even for an improper prior, posterior and predictive probability densities can be defined if Z(Xn)Z(X^n) is finite and well defined.

An Important Example - The Exponential Family

In many simple statistical models, the posterior converges to the normal distribution as nn \to \infty. We see such a case in the example referred to below. However, even in some simple cases, and many others, this fails. This is the key problem resulting in the new theory.

At this point, I highly recommend referring to this example:

We are now going to prove the formulae given in the example.

If the statistical model is of the form

p(xθ)=u(x)exp(v(θ)Tw(x))p(x|\theta) = u(x)\exp (v(\theta)^T w(x))

where uu is a real valued function (and the other two are vector valued), then this distribution is said to belong to the exponential family. Furthermore, if the distribution of the parameter θΘ\theta \in \Theta depends on some hyperparameter ϕ\phi, and can be written as

φ(θϕ)=exp(v(θ)Tϕ)z(ϕ)\varphi(\theta|\phi) = \frac{\exp (v(\theta)^T \phi)}{z(\phi)}

where z(ϕ)z(\phi) is the normalizing factor, then φ(θϕ)\varphi(\theta|\phi) is said to be a conjugate prior distribution.

In the case when the distribution is of the form

p(xμ,σ2)=12πσ2exp{(xμ)22σ2}p(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp \{-{\frac{(x-\mu)^2}{2\sigma^2}}\}

,

we can take u(x)=12πu(x) = \dfrac{1}{\sqrt{2\pi}}, v(θ)T=[1σ2,μσ2,μ2σ2,logσ2]v(\theta)^T = [\dfrac{1}{\sigma^2}, \dfrac{\mu}{\sigma^2}, \dfrac{\mu^2}{\sigma^2}, \log\sigma^2]

and w(x)T=[x22,x,12,12]w(x)^T = [\dfrac{-x^2}{2}, x, \dfrac{-1}{2}, \dfrac{-1}{2}]. Thus it is of the exponential family.

Now, as we know p(θx1,...,xn)=i=1np(xiθ)φ(θϕ)Z(x1,...,xn)p(\theta|x_1, ..., x_n) = \dfrac{\prod_{i=1}^np(x_i|\theta)\varphi(\theta|\phi)}{Z(x_1, ..., x_n)}. Let us calculate the numerator.

i=1np(xiθ)φ(θϕ)=exp(v(θ)Tϕ)z(ϕ)i=1nu(xi)exp(v(θ)Tw(xi))\prod_{i=1}^np(x_i|\theta)\varphi(\theta|\phi) = \frac{\exp (v(\theta)^T\phi)}{z(\phi)}\prod^n_{i=1}u(x_i)\exp (v(\theta)^Tw(x_i))
=exp(v(θ)Tϕ)exp(v(θ)Ti=1nw(xi))1z(ϕ)i=1nu(xi)= \exp(v(\theta)^T\phi)\cdot \exp(v(\theta)^T\sum_{i = 1}^{n}w(x_i))\cdot \frac{1}{z(\phi)}\prod_{i = 1}^nu(x_i)
=exp(v(θ)T(ϕ+i=1nw(xi)))1z(ϕ)i=1nu(xi)=\exp(v(\theta)^T(\phi + \sum_{i=1}^{n}w(x_i)))\frac{1}{z(\phi)}\prod_{i=1}^{n}u(x_i)

Let us denote ϕn=ϕ+i=1nw(xi)\phi_n = \phi + \sum_{i=1}^{n}w(x_i). Then we have that the numerator is:

=exp(v(θ)Tϕn)z(ϕn)z(ϕn)z(ϕ)i=1nu(xi)= \frac{\exp(v(\theta)^T\phi_n)}{z(\phi_n)}\cdot\frac{z(\phi_n)}{z(\phi)}\prod_{i=1}^{n}u(x_i)

Let us get the Z(x1,...,xn)Z(x_1, ..., x_n) for which we need to integrate the numerator with respect to θ\theta. and here we use a nice hack. We know that the integral of the prior with respect to θ\theta is 1, regardless of what ϕ\phi is. So set ϕ=ϕn\phi = \phi_n and we see that the first term integrates out to 1, while the second term is a scalar number independent of θ\theta.

Hence,

Z(x1,...,xn)=z(ϕn)z(ϕ)i=1nu(xi)Z(x_1, ..., x_n) = \frac{z(\phi_n)}{z(\phi)}\prod_{i=1}^{n}u(x_i)

and thus, the posterior is

p(θx1,...,xn)=exp(v(θ)Tϕn)z(ϕn)p(\theta|x_1, ..., x_n) = \frac{\exp(v(\theta)^T\phi_n)}{z(\phi_n)}

which is also from the exponential family!

Finally, the predictive probability is given by

p(xx1,..,xn)=Z(x1,...,xn,x)Z(x1,...,xn)=u(x)z(ϕn+w(x))z(ϕn)p(x|x_1, .., x_n) = \frac{Z(x_1, ..., x_n, x)}{Z(x_1, ..., x_n)} = \frac{u(x)z(\phi_n + w(x))}{z(\phi_n)}

One may notice that we are using a different formula for the predictive density, bypassing the integral definition. This comes directly from using the bayes rule in the given definition (check it yourself), and it is computationally more useful in some cases to use this instead.

For the example given at the start of the section, it is just a matter of inputting numbers into the formulae.

Estimation and Generalization

We need an objective measure which indicates the difference between true and estimated probability density to evaluate how accurate the predictive density is.

Let XnX^n be a sample taken independently from q(x)q(x) and p(xXn)p(x|X^n) be a predictive density using a statistical model p(xw)p(x|w) and a prior φ(w)\varphi(w). We are going to make two definitions:

Tn=1ni=1nlogp(XiXn)T_n = -\frac{1}{n} \sum_{i=1}^n \log p(X_i | X^n)
Gn=q(x)logp(xXn)dxG_n = - \int q(x) \log p(x | X^n) dx

Notice how both of these quantities are random variables.

Thus

GnS=q(x)logp(xXn)dx+q(x)logq(x)dxG_n - S = - \int q(x) \log p(x|X^n) dx + \int q(x) \log q(x) dx

=q(x)logq(x)p(xXn)dx= \int q(x) \log \frac{q(x)}{p(x|X^n)} dx

=K(q(x)p(xXn))= K(q(x) || p(x | X^n))

Thus GnSG_n \geq S, with equality iff q(x)=p(xXn)q(x) = p(x | X^n). That is, the smaller GnG_n is, the more precise our estimate is according to the KL divergence.

GnSG_n - S and TnSnT_n - S_n are called generalization errors and training errors resp.

An observation: As entropy does not depend on either a model / prior, smaller generalization error is equivalent to lower KL divergence.

Definition: Assume n2n \geq 2. Let XnXiX^n \setminus X_i be a set of random variables (leave one out).

Cross validation loss is defined by Cn=1ni=1nlogp(XiXnXi)C_n = -\frac{1}{n} \sum_{i=1}^n \log p(X_i | X^n \setminus X_i) and CnSnC_n - S_n is called the cross validation error.

We now prove an important theorem, which has three statements regarding the definitions that we made.

Theorem: Assume that XnX^n is independent. Then the following holds.

  1. Assume that E[Gn],E[Cn]E[G_n], E[C_n] are finite values. Then

>E[Cn]=E[Gn1]>>E[C_n] = E[G_{n-1}] >
  1. The cross validation loss satisfies the following.

Cn=1ni=1nlogEw[1p(Xiw)]C_n = \frac{1}{n} \sum_{i=1}^n \log E_w \left[ \frac{1}{p(X_i|w)} \right]
  1. For an arbitrary set of XnX^n, CnTnC_n \geq T_n. Cn=TnC_n = T_n iff p(Xiw)p(X_i|w) is a const function of ww on {wW,p(wXn)>0}\{w \in W, p(w|X^n) > 0\}.


Note: Ew[f(w)]=f(w)p(wXn)dwE_w[f(w)] = \int f(w) p(w|X^n) dw this is just the integral with respect to the posterior distribution

(1) Here is the proof of the first statement.

E[Cn]=1ni=1nE[logp(XiXnXi)]E[C_n] = -\frac{1}{n} \sum_{i=1}^n E \left[ \log p(X_i | X^n \setminus X_i) \right]
=1ni=1nq(x)logp(xXnXi)dx= -\frac{1}{n} \sum_{i=1}^n \int q(x) \log p(x | X^n \setminus X_i) dx
=1nq(x)i=1nlogp(xXnXi)dx= -\frac{1}{n} \int q(x) \sum_{i=1}^n \log p(x | X^n \setminus X_i) dx
=1ni=1nEXnXi[EFubiniXi[logp(XiXnXi)]]= -\frac{1}{n} \sum_{i=1}^n \underset{X^n \setminus X_i}{E} \left[ \underset{X_i}{\underset{\uparrow \text{Fubini}}{E}} [\log p(X_i | X^n \setminus X_i)] \right]
=1ni=1nEXnXi[q(x)logp(xXnXi)dx]= -\frac{1}{n} \sum_{i=1}^n E_{X^n \setminus X_i} \left[ \int q(x) \log p(x | X^n \setminus X_i) dx \right]
=1ni=1nEXnXi[Gn1(XnXi)]=EXn1[Gn1]= -\frac{1}{n} \sum_{i=1}^n E_{X^n \setminus X_i} [ - G_{n-1}(X^n \setminus X_i) ] = E_{X^{n-1}} [G_{n-1}]

While it was not mentioned what the expectation is being taken over in the statement, the proof clarifies it. In any case, the answer to the clarification is the canonical and the most standard answer.

(2) We now prove the second statement:

Cn=1ni=1nlogp(XiXnXi)C_n = -\frac{1}{n} \sum_{i=1}^n \log p(X_i | X^n \setminus X_i)

Thus,

Cn=1ni=1nlogp(Xiw)p(wXnXi)dwC_n = -\frac{1}{n} \sum_{i=1}^n \log \int p(X_i | w) p(w | X^n \setminus X_i) dw
=1ni=1nlogp(Xiw)jip(Xjw)φ(w)dwZ(XnXi)= -\frac{1}{n} \sum_{i=1}^n \log \frac{\int p(X_i | w) \prod_{j \neq i} p(X_j | w) \cdot \varphi(w) dw}{Z(X^n \setminus X_i)}
=1ni=1nlog(i=1np(Xiw)φ(w)dwjip(Xjw)φ(w)dw)= -\frac{1}{n} \sum_{i=1}^n \log \left( \frac{\int \prod_{i=1}^n p(X_i | w) \cdot \varphi(w) dw}{\int \prod_{j \neq i} p(X_j | w) \cdot \varphi(w) dw} \right)
=1ni=1nlog(jip(Xjw)φ(w)dwi=1np(Xiw)φ(w)dw)= \frac{1}{n} \sum_{i=1}^n \log \left( \frac{\int \prod_{j \neq i} p(X_j | w) \cdot \varphi(w) dw}{\int \prod_{i=1}^n p(X_i | w) \cdot \varphi(w) dw} \right)

Call the integrand in the denominator AA.

=1ni=1nlog(Ap(Xiw)dwAdw)= \frac{1}{n} \sum_{i=1}^n \log \left( \frac{\int \frac{A}{p(X_i|w)} dw}{\int A dw} \right)
=1ni=1nlogEw[1p(Xiw)]= \frac{1}{n} \sum_{i=1}^n \log E_w \left[ \frac{1}{p(X_i|w)} \right]


We introduced cross validation as a measure to evaluate the accuracy of our estimation. However, rhere are two issues with cross validation:

(1) Although the averages of Cn,GnC_n, G_n are equal the variances need not be equal. However, we do have this relation:

σ(GnS)=σ(CnSn)+O(1n)\sigma(G_n - S) = \sigma(C_n - S_n) + O(\frac{1}{n})

.

(2) In the second statement, if the average by the posterior is numerically approximated, then

ISCV=1ni=1nlogE^w[1p(Xiw)]ISCV = \frac{1}{n} \sum_{i=1}^n \log \hat{E}_w \left[ \frac{1}{p(X_i|w)} \right]

is called the importance sampling cross validation loss.

CV=1ni=1nlogEw(i)[p(Xiw)]CV = -\frac{1}{n} \sum_{i=1}^n \log E_w^{(-i)} \left[ p(X_i|w) \right]

is fundamentally different from the former (Ew(i)E_w^{(-i)} is expectation with respect to XnXiX^n \setminus X_i).

(3) Let us prove the third statement now.

CnTn=1ni=1nlog[Ew[p(Xiw)]Ew[1/p(Xiw)]]0C_n - T_n = \frac{1}{n} \sum_{i=1}^n \log \left[ E_w[p(X_i|w)] E_w \left[ 1/p(X_i|w) \right] \right] \geq 0

By Cauchy Schwarz. Equality holds iff p(Xiw)1/2p(Xiw)1/2p(X_i|w)^{1/2} \propto p(X_i|w)^{-1/2} as a funciton of ww p(Xiw)\Rightarrow p(X_i|w) is a const fn of ww.


We introduce another measure now, and it is often better than the cross validaion loss. There are also many cases where WAIC can be employed whereas cross validation cannot.

Def: Assume n1n \geq 1. Let XnX^n be a set of random variables. The widely applicable information criterion (WAIC) is defined by

Wn=Tn+1ni=1nVw[logp(Xiw)]W_n = T_n + \frac{1}{n} \sum_{i=1}^n \mathbb{V}_w \left[ \log p(X_i|w) \right]

WnSnW_n - S_n is called the WAIC error.

Result: If XnX^n is independent, WAIC is asymptotically equal to cross validation loss.

Wn=Cn+O(1/n2)W_n = C_n + \mathcal{O}(1/n^2)

E[Wn]=E[Cn]+O(1/n2)E[W_n] = E[C_n] + \mathcal{O}(1/n^2)

Remark: CnC_n and WAIC can be employed to evaluate a stats model and a prior even if the prior is improper.

Just to summarize, we have introduced three instruments of measure:

  1. Generalization Error: GnSG_n - S

  2. Cross Validation Error: CnSnC_n - S_n

  3. WAIC Error: WnSnW_n - S_n

In numerical experiments, we often care about minimizing errors instead of the loss itself due to the lower variance.

Marginal likelihood or Partition Fn

If a prior satisfies φ(w)dw=1\int \varphi(w) dw = 1, then the marginal likelihood (partition function) satisfies

Z(x1,,xn)dx1dxn=φ(w)dwp(xiw)dx1dxn=1\int Z(x_1, \dots, x_n) dx_1 \dots dx_n = \int \varphi(w) dw \int \prod p(x_i | w)dx_1 \dots dx_n = 1

We have slyly used Fubini Theorem above. Thus EXn[Z(Xn)]=1E_{X^n}[Z(X^n)] = 1 if the prior is proper.

Z(Xn)Z(X^n) can be thus be understood as an estimated pdf of XnX^n using a statistical model p(xw)p(x|w) and a prior φ(w)\varphi(w). Thus it is sometimes written as p(Xn)p(X^n). Thus this is an estimate of q(Xn)q(X^n) by looking at the KL divergence.

Definition: The Free energy, or the minus log marginal likelihood is defined by

Fn=logZ(Xn)F_n = - \log Z(X^n)

We look at this quantity also as an estimate of q(Xn)q(X^n).

Using the notation q(Xn)=q(xi)q(X^n) = \prod q(x_i) and p(Xn)=Z(Xn)p(X^n) = Z(X^n), firstly note that

H(Xn)=q(Xn)logq(Xn)dxn=i=1nq(Xn)logq(xi)dxnH(X^n) = - \int q(X^n) \log q(X^n) dx^n = - \sum_{i=1}^n \int q(X^n) \log q(x_i) dx^n

=i=1nq(xi)logq(xi)dxi=nS= - \sum_{i=1}^n \int q(x_i) \log q(x_i) dx_i = nS

Thus,

E[Fn]nS=E[logp(Xn)]+q(Xn)logq(Xn)dxnE[F_n] - nS = E[-\log p(X^n)] + \int q(X^n) \log q(X^n) dx^n

=q(Xn)logp(Xn)dxn+q(Xn)logq(Xn)dxn= - \int q(X^n) \log p(X^n) dx^n + \int q(X^n) \log q(X^n) dx^n

Then E[Fn]nS=q(Xn)logq(Xn)p(Xn)dxn=K(q(Xn)p(Xn))E[F_n] - nS = \int q(X^n) \log \frac{q(X^n)}{p(X^n)} dx^n = K(q(X^n) || p(X^n))

Smaller E[Fn]E[F_n] means KL divergence decreases (KL0\Rightarrow KL \geq 0) hence p(Xn)p(X^n) is a better estimate for q(Xn)q(X^n).

Thus E[Gn]SE[G_n] - S is the average KL divergence from q(x)q(x) to p(xXn)p(x|X^n) whereas E[Fn]nSE[F_n] - nS is their sum.

We now prove yet another important theorem.

Theorem: Let n1n \geq 1. The average generalization loss is equal to the increase in free energy.

E[Gn]=E[Fn+1]E[Fn]E[G_n] = E[F_{n+1}] - E[F_n]

Thus E[Fn]=i=1n1E[Gi]+E[F1]E[F_n] = \sum_{i=1}^{n-1} E[G_i] + E[F_1]

Proof: For an arbitrary fn f(x)f(x), q(x)f(x)=EXn+1[f(Xn+1)]\int q(x) f(x) = E_{X_{n+1}} [f(X_{n+1})]

Now, Gn=q(x)logp(xXn)dxG_n = - \int q(x) \log p(x|X^n) dx

=EXn+1[logp(Xn+1Xn)]= - E_{X_{n+1}} [\log p(X_{n+1} | X^n)]

=EXn+1[logp(Xn+1w)φ(w)p(Xiw)dwφ(w)p(Xiw)dw]= - E_{X_{n+1}} \left[ \log \frac{\int p(X_{n+1}|w) \varphi(w) \prod p(X_i|w) dw}{\int \varphi(w) \prod p(X_i|w) dw} \right]
=EXn+1[logZ(Xn+1)]+logZ(Xn)= - E_{X_{n+1}} [\log Z(X^{n+1})] + \log Z(X^n)

Thus, E[Gn]=E[Fn+1]E[Fn]E[G_n] = E[F_{n+1}] - E[F_n].

Remark: As Fn=logZ(Xn)F_n = - \log Z(X^n), the correspondence between free energy and marginal likelihood is one to one.

However, in general, asymptotic order of the marginal likelihood as a random variable is not equal to its average, whereas for free energy that is the case. We have not proved this yet. Thus, for asymptotic statistics, free energy is a more convenient random variable.

We can illustrate the failure for the former: Let the marginal likelihood ratio for r(Xn)=Z(Xn)q(Xn)r(X^n) = \dfrac{Z(X^n)}{q(X^n)} Then E[r(Xn)]=1E[r(X^n)] = 1. However, this is a result: r(Xn)0r(X^n) \rightarrow 0 in probability.

Meaning of marginal likelihood

Assume that p0(p,φ)p_0(p,\varphi) is the prior distributions of a model p(xw)p(x|w) and a prior φ(w)\varphi(w). Then p(Xnp,φ)=p(Xiw)φ(w)dw=Z(Xn)p(X^n | p,\varphi) = \int \prod p(X_i|w) \varphi(w) dw = Z(X^n)

By Bayes thm, p(p,φXn)=p(Xnp,φ)p0(p,φ)p(Xn)p(p,\varphi | X^n) = \dfrac{p(X^n | p,\varphi) p_0(p,\varphi)}{p(X^n)}

Thus if nn is sufficiently large, maximizing Z(Xn)Z(X^n) is maximizing p(p,φXn)p(p,\varphi | X^n)

Conditional independent Cases

We will make the definitions for the conditionally independent case. They are quite similar.

Let us assume XnX^n is dependent but YnY^n is conditionally independent.

For an arbitrary function f:(Xn,Yn)f(Xn,Yn)Rf : (X^n, Y^n) \rightarrow f(X^n, Y^n) \in \mathbb{R},

E[f(Xn,Yn)]= ⁣f(Xn,Yn)q(yixi)dy1dynE[f(X^n, Y^n)] = \int \dots \int f(X^n,Y^n) \prod q(y_i|x_i) dy_1 \dots dy_n

which is a function of XnX^n.

S=1ni=1nq(yixi)logq(yxi)dyS = -\frac{1}{n} \sum_{i=1}^n \int q(y_i|x_i) \log q(y|x_i) dy
Sn=1ni=1nlogq(YiXi)S_n = -\frac{1}{n} \sum_{i=1}^n \log q(Y_i|X_i)
p(wXn,Yn)=1Z(Xn,Yn)φ(w)p(YiXi,w)p(w | X^n, Y^n) = \frac{1}{Z(X^n, Y^n)} \varphi(w) \prod p(Y_i | X_i, w)
p(YX,Xn,Yn)=p(YX,w)p(wXn,Yn)dwp(Y | X, X^n, Y^n) = \int p(Y | X, w) p(w | X^n, Y^n) dw

Everything else is defined similarly. But in this case, E[Cn]E[Gn1]E[C_n] \neq E[G_{n-1}] And E[Wn]=E[Gn]+O(1/n)E[W_n] = E[G_n] + O(1/n)

For example:

  1. Regression problem f(yixi)f(y_i|x_i) for a fixed set {xi}\{x_i\} are studied. Cross validation cannot be employed.

  2. Consider the time series expressed by the relation

    Zt=a1Zt1+a2Zt2+a3Zt3+Gaussian NoiseZ_t = a_1 Z_{t-1} + a_2 Z_{t-2} + a_3 Z_{t-3} + \text{Gaussian Noise}

    . This can be understood as a regression problem

    (Zt1,Zt2,Zt3)=XtYt=Zt.(Z_{t-1}, Z_{t-2}, Z_{t-3}) = X_t \rightarrow Y_t = Z_t.

Thus XtX_t is dependent, so cross validation loss cannot be employed. WAIC, however, can be. It is superior.

Exercises

I now refer you to the first set of exercises.