These are the solutions to the selected exercises from Watanabe’s green book. Majority of them are from chapter 1. Please refer to the exercises from the pdf, which is freely available online.
Problems ¶ Problem 1 ¶ (a) Let w 0 = ( 1 , 1 , … , 1 ) ∈ R 10 w_0 = (1, 1, \dots, 1) \in \mathbb{R}^{10} w 0 = ( 1 , 1 , … , 1 ) ∈ R 10 , and let W W W be a random variable on R 10 \mathbb{R}^{10} R 10 which is subject to
p ( w ) = c { exp ( − ∣ ∣ w ∣ ∣ 2 ) + 100 exp ( − 10 ∣ ∣ w − w 0 ∣ ∣ 2 ) } p(w) = c \{ \exp(- ||w||^2) + 100 \exp(-10 ||w - w_0||^2) \} p ( w ) = c { exp ( − ∣∣ w ∣ ∣ 2 ) + 100 exp ( − 10∣∣ w − w 0 ∣ ∣ 2 )} Let w = x w 0 + v w = x w_0 + v w = x w 0 + v . Then,
∣ ∣ w ∣ ∣ 2 = ∣ ∣ w 0 ∣ ∣ 2 x 2 + ∣ ∣ v ∣ ∣ 2 = 10 x 2 + ∣ ∣ v ∣ ∣ 2 ||w||^2 = ||w_0||^2 x^2 + ||v||^2 = 10 x^2 + ||v||^2 ∣∣ w ∣ ∣ 2 = ∣∣ w 0 ∣ ∣ 2 x 2 + ∣∣ v ∣ ∣ 2 = 10 x 2 + ∣∣ v ∣ ∣ 2 and
∣ ∣ w − w 0 ∣ ∣ 2 = ∣ ∣ ( x − 1 ) w 0 + v ∣ ∣ 2 = 10 ( x − 1 ) 2 + ∣ ∣ v ∣ ∣ 2 ||w - w_0||^2 = ||(x-1) w_0 + v||^2 = 10(x-1)^2 + ||v||^2 ∣∣ w − w 0 ∣ ∣ 2 = ∣∣ ( x − 1 ) w 0 + v ∣ ∣ 2 = 10 ( x − 1 ) 2 + ∣∣ v ∣ ∣ 2 Thus,
$$
p(x, v) = c { \exp( - (10 x^2 + ||v||^2) ) + 100 \exp( -10 (10(x-1)^2 + ||v||^2) ) }
$T o m a x i m i z e
To maximize T o ma x imi ze p(x,v)w e c a n d o s o s e p a r a t e l y . W e m i n i m i z e we can do so separately.
We minimize w ec an d osose p a r a t e l y . W e minimi ze ||v||^2 \Rightarrow v=0$.
argmax p ( x , v ) = ( argmax p ( x , 0 ) , 0 ) \text{argmax } p(x, v) = (\text{argmax } p(x, 0), 0) argmax p ( x , v ) = ( argmax p ( x , 0 ) , 0 )
p ( x , 0 ) = c { exp ( − 10 x 2 ) + 100 exp ( − 100 ( x − 1 ) 2 ) } p(x, 0) = c \{ \exp(-10 x^2) + 100 \exp(-100(x-1)^2) \} p ( x , 0 ) = c { exp ( − 10 x 2 ) + 100 exp ( − 100 ( x − 1 ) 2 )} By checking the derivative we have x ∈ ( 0 , 1 ) x \in (0, 1) x ∈ ( 0 , 1 )
As the function is dominated by the second term, x ≈ 1 x \approx 1 x ≈ 1 .
Then w ≈ w 0 w \approx w_0 w ≈ w 0 .
(b) E [ W ] ≈ 0 E[W] \approx 0 E [ W ] ≈ 0
E [ W ] = ∫ R 10 w p ( w ) d w E[W] = \int_{\mathbb{R}^{10}} w p(w) dw E [ W ] = ∫ R 10 wp ( w ) d w We know that ∫ R 10 p ( w ) d w = 1 \int_{\mathbb{R}^{10}} p(w) dw = 1 ∫ R 10 p ( w ) d w = 1
So
c [ ∫ R 10 e − ∣ ∣ w ∣ ∣ 2 d w + 100 ∫ R 10 e − 10 ∣ ∣ w − w 0 ∣ ∣ 2 d w ] = 1 c \left[ \int_{\mathbb{R}^{10}} e^{-||w||^2} dw + 100 \int_{\mathbb{R}^{10}} e^{-10 ||w - w_0||^2} dw \right] = 1 c [ ∫ R 10 e − ∣∣ w ∣ ∣ 2 d w + 100 ∫ R 10 e − 10∣∣ w − w 0 ∣ ∣ 2 d w ] = 1 Remember: ∫ R d exp ( − α ∣ ∣ x − μ ∣ ∣ 2 ) d x = ( π α ) d / 2 \int_{\mathbb{R}^d} \exp(-\alpha ||x - \mu||^2) dx = \left( \frac{\pi}{\alpha} \right)^{d/2} ∫ R d exp ( − α ∣∣ x − μ ∣ ∣ 2 ) d x = ( α π ) d /2
Then c [ π 5 + 100 ⋅ ( π 10 ) 5 ] = 1 c \left[ \pi^5 + 100 \cdot \left( \frac{\pi}{10} \right)^5 \right] = 1 c [ π 5 + 100 ⋅ ( 10 π ) 5 ] = 1 .
E [ W ] = c [ ∫ R 10 w exp ( − ∣ ∣ w ∣ ∣ 2 ) d w ↓ odd + ∫ R 10 w exp ( − 10 ∣ ∣ w − w 0 ∣ ∣ 2 ) d w ] E[W] = c \left[ \underset{\downarrow \text{odd}}{\int_{\mathbb{R}^{10}} w \exp(-||w||^2) dw} + \int_{\mathbb{R}^{10}} w \exp(-10 ||w - w_0||^2) dw \right] E [ W ] = c ⎣ ⎡ ↓ odd ∫ R 10 w exp ( − ∣∣ w ∣ ∣ 2 ) d w + ∫ R 10 w exp ( − 10∣∣ w − w 0 ∣ ∣ 2 ) d w ⎦ ⎤ = c [ ∫ R 10 w exp ( − 10 ∣ ∣ w − w 0 ∣ ∣ 2 ) d w ] = c \left[ \int_{\mathbb{R}^{10}} w \exp(-10 ||w - w_0||^2) dw \right] = c [ ∫ R 10 w exp ( − 10∣∣ w − w 0 ∣ ∣ 2 ) d w ] = c ( 1 0 − 5 π 5 w 0 ) ≈ 0.000999 w 0 ≈ 0. = c (10^{-5} \pi^5 w_0) \approx 0.000999 w_0 \approx 0. = c ( 1 0 − 5 π 5 w 0 ) ≈ 0.000999 w 0 ≈ 0. Even though the pdf peaks at w 0 w_0 w 0 , the vol is very small as the var is 1 / 10 1/10 1/10 in all 10 dims, Ah the curse of dimensionality.
The takeaway from this is that MAP can be a pretty bad estimator for the right parameter. Notice that as we consider higher dimensions, the distance between the expected value and the MAP estimate increases to arbitrarily large values.
Problem 2 - Fluctuation Dissipation Theorem ¶ Let β > 0 \beta > 0 β > 0 , and H ( x ) : R n → R H(x) : \mathbb{R}^n \rightarrow \mathbb{R} H ( x ) : R n → R . Say X ∈ R n X \in \mathbb{R}^n X ∈ R n is subject to a pdf.
p ( x ∣ β ) = 1 Z ( β ) exp ( − β H ( x ) ) where Z ( β ) = ∫ exp ( − β H ( x ) ) d x p(x | \beta) = \frac{1}{Z(\beta)} \exp(-\beta H(x)) \quad \text{where } Z(\beta) = \int \exp(-\beta H(x)) dx p ( x ∣ β ) = Z ( β ) 1 exp ( − β H ( x )) where Z ( β ) = ∫ exp ( − β H ( x )) d x We need to prove that ∂ E [ H ( X ) ] ∂ β = − V [ H ( X ) ] \dfrac{\partial E[H(X)]}{\partial \beta} = - \mathbb{V}[H(X)] ∂ β ∂ E [ H ( X )] = − V [ H ( X )]
E [ H ( X ) ] = ∫ 1 Z ( β ) exp ( − β H ( x ) ) H ( x ) d x E[H(X)] = \int \frac{1}{Z(\beta)} \exp(-\beta H(x)) H(x) dx E [ H ( X )] = ∫ Z ( β ) 1 exp ( − β H ( x )) H ( x ) d x ∂ Z ( β ) ∂ β = ∫ ∂ ∂ β exp ( − β H ( x ) ) d x = ∫ − H ( x ) exp ( − β H ( x ) ) d x \frac{\partial Z(\beta)}{\partial \beta} = \int \frac{\partial}{\partial \beta} \exp(-\beta H(x)) dx = \int -H(x) \exp(-\beta H(x)) dx ∂ β ∂ Z ( β ) = ∫ ∂ β ∂ exp ( − β H ( x )) d x = ∫ − H ( x ) exp ( − β H ( x )) d x = − Z ( β ) E [ H ( X ) ] = - Z(\beta) E[H(X)] = − Z ( β ) E [ H ( X )] ∂ f ( β ) ∂ β = ∫ − H 2 ( x ) exp ( − β H ( x ) ) d x = − Z ( β ) E [ H 2 ( X ) ] \frac{\partial f(\beta)}{\partial \beta} = \int -H^2(x) \exp(-\beta H(x)) dx = - Z(\beta) E[H^2(X)] ∂ β ∂ f ( β ) = ∫ − H 2 ( x ) exp ( − β H ( x )) d x = − Z ( β ) E [ H 2 ( X )] Thus,
∂ E [ H ( X ) ] ∂ β = − Z 2 ( β ) E [ H 2 ( X ) ] + Z 2 ( β ) E 2 [ H ( X ) ] Z 2 ( β ) \frac{\partial E[H(X)]}{\partial \beta} = \frac{- Z^2(\beta) E[H^2(X)] + Z^2(\beta) E^2[H(X)]}{Z^2(\beta)} ∂ β ∂ E [ H ( X )] = Z 2 ( β ) − Z 2 ( β ) E [ H 2 ( X )] + Z 2 ( β ) E 2 [ H ( X )] = E 2 [ H ( X ) ] − E [ H 2 ( X ) ] = E^2[H(X)] - E[H^2(X)] = E 2 [ H ( X )] − E [ H 2 ( X )] = − V [ H ( X ) ] = - \mathbb{V}[H(X)] = − V [ H ( X )] Thus, ∂ E [ H ( X ) ] ∂ β = − V [ H ( X ) ] \dfrac{\partial E[H(X)]}{\partial \beta} = - \mathbb{V}[H(X)] ∂ β ∂ E [ H ( X )] = − V [ H ( X )]
Problem 3 ¶ Let p ( x ∣ a ) p(x|a) p ( x ∣ a ) be a statistical model of x ∈ { 0 , 1 } x \in \{0, 1\} x ∈ { 0 , 1 } defined by
p ( x ∣ a ) = a x ( 1 − a ) 1 − x where 0 ≤ a ≤ 1 , φ ( a ) = 1 is the prior . p(x|a) = a^x (1-a)^{1-x} \quad \text{where } 0 \leq a \leq 1, \quad \varphi(a) = 1 \text{ is the prior}. p ( x ∣ a ) = a x ( 1 − a ) 1 − x where 0 ≤ a ≤ 1 , φ ( a ) = 1 is the prior . Let X n X^n X n be independently subject to p ( x ∣ a 0 ) p(x|a_0) p ( x ∣ a 0 ) .
Let n 1 = ∑ i = 1 n x i , n 2 = n − n 1 n_1 = \sum_{i=1}^n x_i, \quad n_2 = n - n_1 n 1 = ∑ i = 1 n x i , n 2 = n − n 1
Find the MLE.
log p ( x i ∣ a ) = x i log a + ( 1 − x i ) log ( 1 − a ) \log p(x_i | a) = x_i \log a + (1-x_i) \log(1-a) log p ( x i ∣ a ) = x i log a + ( 1 − x i ) log ( 1 − a ) f ( a ) = ∑ log p ( x i ∣ a ) = n 1 log a + n 2 log ( 1 − a ) f(a) = \sum \log p(x_i | a) = n_1 \log a + n_2 \log(1-a) f ( a ) = ∑ log p ( x i ∣ a ) = n 1 log a + n 2 log ( 1 − a ) Better way: Find argmax ∏ a x i ( 1 − a ) 1 − x i \text{argmax } \prod a^{x_i} (1-a)^{1-x_i} argmax ∏ a x i ( 1 − a ) 1 − x i
= a n 1 ( 1 − a ) n − n 1 = a^{n_1} (1-a)^{n-n_1} = a n 1 ( 1 − a ) n − n 1 As 0 ≤ a ≤ 1 0 \leq a \leq 1 0 ≤ a ≤ 1 , we can use AM-GM to maximize. The argmax satisfies:
a n 1 = 1 − a n − n 1 \frac{a}{n_1} = \frac{1-a}{n-n_1} n 1 a = n − n 1 1 − a n a − n 1 a = n 1 − n 1 a n a - n_1 a = n_1 - n_1 a na − n 1 a = n 1 − n 1 a ⇒ a = n 1 n \Rightarrow a = \frac{n_1}{n} ⇒ a = n n 1 Estimated probability distribution p ( x ∣ a ^ ) p(x | \hat{a}) p ( x ∣ a ^ ) (Frequentist estimation)
p ( 1 ∣ a ^ ) = a ^ = n 1 n , p ( 0 ∣ a ^ ) = 1 − a ^ = n − n 1 n = n 2 n p(1 | \hat{a}) = \hat{a} = \frac{n_1}{n}, \quad p(0 | \hat{a}) = 1 - \hat{a} = \frac{n - n_1}{n} = \frac{n_2}{n} p ( 1∣ a ^ ) = a ^ = n n 1 , p ( 0∣ a ^ ) = 1 − a ^ = n n − n 1 = n n 2 Bayesian Predictive distribution p ( x ∣ X n ) p(x | X^n) p ( x ∣ X n )
Let us calculate the posterior first.
p ( a ∣ X n ) = p ( X n ∣ a ) φ ( a ) ∫ 0 1 p ( X n ∣ a ) φ ( a ) d a p(a | X^n) = \frac{p(X^n | a) \varphi(a)}{\int_0^1 p(X^n | a) \varphi(a) da} p ( a ∣ X n ) = ∫ 0 1 p ( X n ∣ a ) φ ( a ) d a p ( X n ∣ a ) φ ( a ) = p ( X n ∣ a ) ∫ 0 1 p ( X n ∣ a ) d a ( this can be calculated by multiplication ) = \frac{p(X^n | a)}{\int_0^1 p(X^n | a) da} \quad (\text{this can be calculated by multiplication}) = ∫ 0 1 p ( X n ∣ a ) d a p ( X n ∣ a ) ( this can be calculated by multiplication ) So, p ( x ∣ X n ) = ∫ 0 1 p ( x ∣ a ) p ( a ∣ X n ) d a p(x | X^n) = \int_0^1 p(x | a) p(a | X^n) da p ( x ∣ X n ) = ∫ 0 1 p ( x ∣ a ) p ( a ∣ X n ) d a
= ∫ 0 1 p ( x ∣ a ) p ( X n ∣ a ) d a ∫ 0 1 p ( X n ∣ a ) d a = \frac{\int_0^1 p(x | a) p(X^n | a) da}{\int_0^1 p(X^n | a) da} = ∫ 0 1 p ( X n ∣ a ) d a ∫ 0 1 p ( x ∣ a ) p ( X n ∣ a ) d a Now, p ( X n ∣ a ) = ∏ p ( x i ∣ a ) = a ∑ x i ( 1 − a ) n − ∑ x i = a n 1 ( 1 − a ) n 2 p(X^n | a) = \prod p(x_i | a) = a^{\sum x_i} (1-a)^{n - \sum x_i} = a^{n_1} (1-a)^{n_2} p ( X n ∣ a ) = ∏ p ( x i ∣ a ) = a ∑ x i ( 1 − a ) n − ∑ x i = a n 1 ( 1 − a ) n 2
Now, ∫ 0 1 p ( X n ∣ a ) d a = ∫ 0 1 a n 1 ( 1 − a ) n 2 d a = β ( n 1 + 1 , n 2 + 1 ) \int_0^1 p(X^n | a) da = \int_0^1 a^{n_1} (1-a)^{n_2} da = \beta(n_1 + 1, n_2 + 1) ∫ 0 1 p ( X n ∣ a ) d a = ∫ 0 1 a n 1 ( 1 − a ) n 2 d a = β ( n 1 + 1 , n 2 + 1 )
p ( 1 ∣ X n ) = ∫ 0 1 a n 1 + 1 ( 1 − a ) n 2 β ( n 1 + 1 , n 2 + 1 ) = β ( n 1 + 2 , n 2 + 1 ) β ( n 1 + 1 , n 2 + 1 ) p(1 | X^n) = \frac{\int_0^1 a^{n_1+1} (1-a)^{n_2}}{\beta(n_1+1, n_2+1)} = \frac{\beta(n_1+2, n_2+1)}{\beta(n_1+1, n_2+1)} p ( 1∣ X n ) = β ( n 1 + 1 , n 2 + 1 ) ∫ 0 1 a n 1 + 1 ( 1 − a ) n 2 = β ( n 1 + 1 , n 2 + 1 ) β ( n 1 + 2 , n 2 + 1 ) p ( 0 ∣ X n ) = β ( n 1 + 1 , n 2 + 2 ) β ( n 1 + 1 , n 2 + 1 ) p(0 | X^n) = \frac{\beta(n_1+1, n_2+2)}{\beta(n_1+1, n_2+1)} p ( 0∣ X n ) = β ( n 1 + 1 , n 2 + 1 ) β ( n 1 + 1 , n 2 + 2 ) Now, β ( α , β ) = Γ ( α ) Γ ( β ) Γ ( α + β ) \beta(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)} β ( α , β ) = Γ ( α + β ) Γ ( α ) Γ ( β )
Thus, p ( 1 ∣ X n ) = n 1 + 1 n + 2 p(1 | X^n) = \frac{n_1+1}{n+2} p ( 1∣ X n ) = n + 2 n 1 + 1
p ( 0 ∣ X n ) = 1 − n 1 + 1 n + 2 = n 2 + 1 n + 2 p(0 | X^n) = 1 - \frac{n_1+1}{n+2} = \frac{n_2+1}{n+2} p ( 0∣ X n ) = 1 − n + 2 n 1 + 1 = n + 2 n 2 + 1