We start with Bayesian Statistics. Watanabe’s theory is fundamentally based on generalizing classical results in Bayesian Statistics, so it is important to get a strong grip and understand this classical theory well before moving on. It also gives us the complete understanding of the framework we are working in, and is the first essential thing to master.
Machine Learning Models are primarliy consisting of two frameworks (or a combination of them): Frequentist and Bayesian.
The setup is that we have a true data generating distribution pdata(x), and consider a set of arbitrary samples X={x1,...,xn} . We take a statistical model pmodel(x,θ) (which is a parametric family of probability distributions) which aims to estimate the true distribution.
The likelihood function of our statistical model is defined as
This is the main measure that we will use to associate similarity between probability distributions (even though it is not really a metric, it is clear that it is not even symmetric).
It can be easily seen that finding the optimal θ (called the maximum likelihood estimator) is equivalent to minimizing the KL divergence from the empirical true distribution to our statistical model, which is a function of θ. An approximation to the local optimal parameter is often approached via (stochastic) gradient descent. This is also the case in neural networks, which are essentially function approximators. We use SGD to approximate to the local optimal parameter vector.
We will not delve into the frequentist approach more here (you may refer to Goodfellow et al). We will move on to the Bayesian approach here. Thus, when we refer to neural networks here, an important distinction is that now this is not the standard neural networks where SGD is used. Still, we gain many insights from this approach that also carry to the standard networks.
In the Bayesian approach, instead of considering just the optimal parameter, we consider a probability distributin over the space of parameters itself. Initially, this is called the prior function, and as we observe the data from the true distribution, we update this prior function to successively obtain a posterior functin, which is an estimate over the entire parameter space to what generates the true distribution function.
Specifically, we consider an appropriate prior function φ(w) and a statistical model p(x∣w). These are chosen by us, and this choice often determines what estimate our bayesian method will given us. We assume there is a true date generating distribution q(x), from which we draw N samples independently, {x1,...,xn}. This sample induces a function p(w∣x1,...,xn), which is the update of our prior function. This further induces p(x∣x1,...,xn), which is our estimate of the true distribution. This process goes on as we make more samples. As can be seen, this is more computatinally intensive. However, this approach is superior in many cases, we will specifically see an example later on. We will now define everything mathematically. To summarie, here is the procedure:
Construct the universe and the mathematical laws between bayesian observables which hold for any arbitrary: true distribution, statistical model, and a prior.
Evaluate how appropriate the statistical model and the prior is using these laws.
But neither do we know the statistical model, nor do we know the prior. Thus a meaningful approach is to just start with something, evaluate how good it is, and then update it. The evaluation is done through the mathematical laws described above.
This gives rise to the estimated pdf of x, called the predictive distribution:
Expected: p^(x)≈q(x) if (p(x∣w),φ(x)) is appropriate for q(x). We want to evaluate the tuple appropriateness without information about q(x). We develop the machinery for that.
Let W⊂Rd. Let Xn be independent real random values subjected to q(x). For an arbitrary pair (p(x∣w),φ(w)), the posterior probability density is defined by
which is called the partition function/marginal likelihood/evidence.
Expected value over the posterior distribution is denoted Ew[⋅].
Do note that Ew[f(w)]=Z(Xn)1∫f(w)φ(w)∏i=1np(Xi∣w)dw
This expected value is a random variable as it depends on Xn.
(Better to say, it is the expected value over a conditional probability density and hence is a random variable)
The posterior gives rise to the predictive density function:
p(x∣Xn)=def
(estimate w from Xn, estimate x from w, vary over all w)
If ∫φ(w)dw<∞, the prior is called proper, because it is normalized so that ∫φ(w)dw=1. Even for an improper prior, posterior and predictive probability densities can be defined if Z(Xn) is finite and well defined.
In many simple statistical models, the posterior converges to the normal distribution as n→∞. We see such a case in the example referred to below. However, even in some simple cases, and many others, this fails. This is the key problem resulting in the new theory.
At this point, I highly recommend referring to this example:
We are now going to prove the formulae given in the example.
where u is a real valued function (and the other two are vector valued), then this distribution is said to belong to the exponential family. Furthermore, if the distribution of the parameter θ∈Θ depends on some hyperparameter ϕ, and can be written as
Let us get the Z(x1,...,xn) for which we need to integrate the numerator with respect to θ. and here we use a nice hack. We know that the integral of the prior with respect to θ is 1, regardless of what ϕ is. So set ϕ=ϕn and we see that the first term integrates out to 1, while the second term is a scalar number independent of θ.
One may notice that we are using a different formula for the predictive density, bypassing the integral definition. This comes directly from using the bayes rule in the given definition (check it yourself), and it is computationally more useful in some cases to use this instead.
For the example given at the start of the section, it is just a matter of inputting numbers into the formulae.
We need an objective measure which indicates the difference between true and estimated probability density to evaluate how accurate the predictive density is.
Let Xn be a sample taken independently from q(x) and p(x∣Xn) be a predictive density using a statistical model p(x∣w) and a prior φ(w). We are going to make two definitions:
While it was not mentioned what the expectation is being taken over in the statement, the proof clarifies it. In any case, the answer to the clarification is the canonical and the most standard answer.
By Cauchy Schwarz. Equality holds iff p(Xi∣w)1/2∝p(Xi∣w)−1/2 as a funciton of w⇒p(Xi∣w) is a const fn of w.
We introduce another measure now, and it is often better than the cross validaion loss. There are also many cases where WAIC can be employed whereas cross validation cannot.
Def: Assume n≥1. Let Xn be a set of random variables. The widely applicable information criterion (WAIC) is defined by
We have slyly used Fubini Theorem above.
Thus EXn[Z(Xn)]=1 if the prior is proper.
Z(Xn) can be thus be understood as an estimated pdf of Xn using a statistical model p(x∣w) and a prior φ(w). Thus it is sometimes written as p(Xn). Thus this is an estimate of q(Xn) by looking at the KL divergence.
Definition: The Free energy, or the minus log marginal likelihood is defined by
Remark: As Fn=−logZ(Xn), the correspondence between free energy and marginal likelihood is one to one.
However, in general, asymptotic order of the marginal likelihood as a random variable is not equal to its average, whereas for free energy that is the case. We have not proved this yet. Thus, for asymptotic statistics, free energy is a more convenient random variable.
We can illustrate the failure for the former: Let the marginal likelihood ratio for r(Xn)=q(Xn)Z(Xn) Then E[r(Xn)]=1. However, this is a result: r(Xn)→0 in probability.