PRML Chapter 5 Neural Networks

5.1 Feed-forward Network Functions

A network with one hidden layer may be the form like this:

yk(x,w)=σ(∑j=1Mwkj2h(∑i=1Dwji1xi+wj01)+wk02)y_{k}(x, w) = \sigma\left( \sum_{j=1}^{M}w_{kj}^{2}h(\sum_{i=1}^{D}w_{ji}^{1}x_{i} + w_{j0}^{1}) + w_{k0}^{2} \right)yk(x,w)=σ(j=1Mwkj2h(i=1Dwji1xi+wj01)+wk02)

5.1.1 Weight-space symmetries

5.2 Network Training

The error function that we need to minimize:

E(w)=12∑n=1N∣∣y(xn,w)−tn∣∣2E(w) = \frac{1}{2}\sum_{n=1}^{N} || y(x_{n}, w) - t_{n} ||^{2}E(w)=21n=1Ny(xn,w)tn2

For the regression problem, assume that the output of the network is

p(t∣x,w)=N(t∣y(x,w),β−1)p(t | x, w) = N(t | y(x,w), \beta^{-1})p(tx,w)=N(ty(x,w),β1)

The corresponding likelihood function will be

p(t∣X,w,β)=∏n=1Np(tn∣xn,w,β)p(\mathbf{t} | \mathbf{X}, w, \beta) = \prod_{n=1}^{N}p(t_{n} | x_{n}, w, \beta)p(tX,w,β)=n=1Np(tnxn,w,β)

Taking the negative logarithm, we obtain the error function:

β2∑n=1N{y(x,w)−tn}2−N2ln⁡β+N2ln⁡(2π)\frac{\beta}{2}\sum_{n=1}^{N}\{ y(x, w) - t_{n} \}^{2} - \frac{N}{2}\ln\beta + \frac{N}{2} \ln(2\pi)2βn=1N{y(x,w)tn}22Nlnβ+2Nln(2π)

from which we can learn www and β\betaβ. Maximizing the likelihood funciton is equivalent to minimizing the sum-of-squares error function given by:

E(w)=12∑n=1N{y(xn,w)−tn}2E(w) = \frac{1}{2}\sum_{n=1}^{N}\{y(x_{n}, w) - t_{n} \}^{2}E(w)=21n=1N{y(xn,w)tn}2

Having found wMLw_{ML}wML, the value of β\betaβ can be found by minimizing the negative log likelihood to give

1βML=1N∑n=1N{y(xn,wML)−tn}2\frac{1}{\beta_{ML}} = \frac{1}{N}\sum_{n=1}^{N}\{ y(x_{n}, w_{ML}) - t_{n} \}^{2}βML1=N1n=1N{y(xn,wML)tn}2

In the regression case, we can view the network as having an output activation function that is the identity, so that yk=aky_k=a_kyk=ak. The corresponding sum-of-squares error function has the property

∂E∂ak=yk−tk\frac{\partial E}{\partial a_{k}} = y_{k} - t_{k}akE=yktk

In the case of binary classification, the conditional distribution of targets is a Bernoulli distribution:

p(t∣x,w)=y(x,w)t{1−y(x,w)}1−tp(t|x,w)=y(x,w)^t\{1-y(x,w)\}^{1-t}p(tx,w)=y(x,w)t{1y(x,w)}1t

Given by negative log likelihood, we have a cross-entropy error function:

E(w)=−∑n=1N{tnln⁡yn+(1−tn)ln⁡(1−yn)}E(w) = -\sum_{n=1}^{N}\{ t_{n}\ln y_{n} + (1-t_{n})\ln(1-y_{n}) \}E(w)=n=1N{tnlnyn+(1tn)ln(1yn)}

And for multiclass classification problem:

E(w)=−∑n=1N∑k=1Ktnkln⁡yk(xn,w)E(w) = -\sum_{n=1}^{N}\sum_{k=1}^{K}t_{nk}\ln y_{k}(x_{n}, w)E(w)=n=1Nk=1Ktnklnyk(xn,w)

5.2.1 Parameter optimization

We choose some initial value and moving through weight space in a succession of steps:

wτ+1=wτ+Δwτw^{\tau + 1} = w^{\tau} + \Delta w^{\tau}wτ+1=wτ+Δwτ

5.2.2 Local quadratic approximation

Consider the Taylor expansion of E(w)E(w)E(w) around some point w^\hat{w}w^ in weight space:

E(w)≃E(w^)+(w−w^)Tb+12(w−w^)TH(w−w^)E(w) \simeq E(\hat{w}) + (w - \hat{w})^{T}b + \frac{1}{2}(w - \hat{w})^{T}H(w - \hat{w})E(w)E(w^)+(ww^)Tb+21(ww^)TH(ww^)x

where bbb is defined to be the gradient of EEE evaluated at w^\hat{w}w^: b≡∇E∣w=w^b \equiv \nabla E|_{w=\hat{w}}bEw=w^

When w∗w^*w is a minimum of the error function, there is no linear term because ∇E=0\nabla E=0E=0, we have:

E(w)=E(w∗)+12(w−w∗)TH(w−w∗)E(w) = E(w^{*}) + \frac{1}{2}(w-w^{*})^{T}H(w-w^{*})E(w)=E(w)+21(ww)TH(ww)

5.2.3 Use of gradient information

5.2.4 Gradient descent optimization

Using gradient information:

w(τ+1)=w(τ)−η∇E(w(τ))w^{(\tau + 1)} = w^{(\tau)} - \eta\nabla E(w^{(\tau)})w(τ+1)=w(τ)ηE(w(τ))

On-line gradient descent(sequential/stochastic gradient descent)

w(τ+1)=w(τ)−η∇En(w(τ))w^{(\tau+1)} = w^{(\tau)}- \eta\nabla E_{n}(w^{(\tau)})w(τ+1)=w(τ)ηEn(w(τ))

5.3 Error Backpropagation

5.3.1 Evaluation of error-function derivatives

We now derive the backpropagation algorithm for a general network:

En=12∑k(yk(xn,w)−tnk)2 → ∂En∂wji=(ynj−tnj)xniE_n=\frac{1}{2}\sum_k(y_k(x_n,w)-t_{nk})^2\ \rightarrow\ \frac{\partial E_n}{\partial w_{ji}}=(y_{nj}-t_{nj})x_{ni}En=21k(yk(xn,w)tnk)2  wjiEn=(ynjtnj)xni

In a general feed-forward network, each unit computes a weighted sum of its inputs, and the sum is transformed by a nonlinear activation function:

aj=∑iwjizi,  zj=h(aj)a_j=\sum_i w_{ji}z_i,\ \ z_j=h(a_j)aj=iwjizi,  zj=h(aj)

Apply the chain rule

∂En∂wji=∂En∂aj∂aj∂wji=δjzi\frac{\partial E_n}{\partial w_{ji}}=\frac{\partial E_n}{\partial a_j}\frac{\partial a_j}{\partial w_{ji}}=\delta_j z_iwjiEn=ajEnwjiaj=δjzi

We can obtain the backpropagation formula:

δj=∂En∂aj=∑k∂En∂ak∂ak∂aj=h′(aj)∑kwkjδk\delta_j=\frac{\partial E_n}{\partial a_j}=\sum_k\frac{\partial E_n}{\partial a_k}\frac{\partial a_k}{\partial a_j}=h'(a_j)\sum_k w_{kj}\delta_kδj=ajEn=kakEnajak=h(aj)kwkjδk

For all data points, we have:

E(w)=∑n=1NEn(w),  ∂E∂wji=∑n∂En∂wjiE(w)=\sum_{n=1}^N E_n(w),\ \ \frac{\partial E}{\partial w_{ji}}=\sum_n\frac{\partial E_n}{\partial w_{ji}}E(w)=n=1NEn(w),  wjiE=nwjiEn

5.3.2 A simple example

5.3.3 Efficiency of backpropagation

5.3.4 The Jacobian matrix

Consider the evaluation of the Jacobian matrix, first we write down the backpropagation formula to determine the derivatives ∂yk∂aj\frac{\partial y_{k}}{\partial a_{j}}ajyk.

∂yk∂aj=∑l∂yk∂al∂al∂aj=h′(aj)∑lwlj∂yk∂al\frac{\partial y_{k}}{\partial a_{j}} = \sum_{l}\frac{\partial y_{k}}{\partial a_{l}}\frac{\partial a_{l}}{\partial a_{j}} = h^{'}(a_{j})\sum_{l}w_{lj}\frac{\partial y_{k}}{\partial a_{l}}ajyk=lalykajal=h(aj)lwljalyk

If we have individual sigmoidal activation functions at each output unit, then

∂yk∂al=δkl′(al)\frac{\partial y_{k}}{\partial a_{l}} = \delta_{kl}^{'}(a_{l})alyk=δkl(al)

whereas for softmax outputs we have:

∂yk∂al=δklyk−ykyl\frac{\partial y_{k}}{\partial a_{l}} = \delta_{kl}y_{k} - y_{k}y_{l}alyk=δklykykyl

Finally we can calculate the element in the Jacobi matrix:

Jji=∑jwji∂yk∂ajJ_{ji} = \sum_{j}w_{ji}\frac {\partial y_{k}}{\partial a_{j}}Jji=jwjiajyk

5.4 The Hessian Matrix

5.4.1 Diagonal approximation

The diagonal elements of the Hessian can be written:

∂2En∂wji2=∂2En∂aj2zi2\frac{\partial^2 E_n}{\partial w^2_{ji}}=\frac{\partial^2 E_n}{\partial a_j^2} z^2_iwji22En=aj22Enzi2

Recursively using the chain rule of differential calculus to give a backpropagation equation of the form:

∂2En∂aj2=h′(aj)2∑k∑k′wkjwk′j∂2En∂ak∂ak′+h′′(aj)∑kwkj∂En∂ak\frac{\partial^2 E_n}{\partial a_j^2}=h'(a_j)^2\sum_k\sum_{k'}w_{kj}w_{k'j} \frac{\partial^2 E_n}{\partial a_k \partial a_{k'}}+h''(a_j)\sum_k w_{kj}\frac{\partial E^n}{\partial a_k}aj22En=h(aj)2kkwkjwkjakak2En+h(aj)kwkjakEn

5.4.2 Outer product approximation

Write teh Hessian matrix in the form:

H=∇∇E=∑n=1N∇yn∇yn+∑n=1N(yn−tn)∇∇ynH=\nabla\nabla E=\sum_{n=1}^N\nabla y_n\nabla y_n+\sum_{n=1}^N(y_n-t_n)\nabla\nabla y_nH=E=n=1Nynyn+n=1N(yntn)yn

By neglecting the second term, we get the Levenberg-Marquardt approximation (outer product approximation)

H≃∑n=1NbnbnTH\simeq\sum_{n=1}^N b_n b_n^THn=1NbnbnT

where bn=∇yn=∇anb_n=\nabla y_n=\nabla a_nbn=yn=an.

In the case of the cross-entropy error function for a network with logistic sigmoid output-unit activation functions, the corresponding approximation is given by:

H≃∑n=1Nyn(1−yn)bnbnTH\simeq\sum_{n=1}^N y_n(1-y_n)b_nb_n^THn=1Nyn(1yn)bnbnT

5.5 Regularization in Neural Networks

To control the complexity of a neural network, the simplest regularizer is the quadratic, giving a regularized error:

E~(w)=E(w)+λ2wTw\tilde{E}(w)=E(w)+\frac{\lambda}{2} w^TwE~(w)=E(w)+2λwTw.

5.5.1 Consistent Gaussian priors

A regularizer which is invariant under the linear transformations is given by:

λ12∑w∈W1w2+λ22∑w∈W2w2\frac{\lambda_{1}}{2}\sum_{w\in W_{1}}w^{2} + \frac{\lambda_{2}}{2}\sum_{w\in W_{2}}w^{2}2λ1wW1w2+2λ2wW2w2

5.5.2 Early stopping

Training can be stopped at the point of smallest error with respect to the validation set in order to obtain a network having good generalization performance.

5.5.3 Invariances

5.5.4 Tangent propagation

We can use regularization to encourage models to be invariant to transformations of the input through the technique of tangent propagation. Let the vector that results from acting on xnx_nxn bu this transformation be denoted by s(xn,ϵ)s(x_{n}, \epsilon)s(xn,ϵ) and s(xn,0)=xs(x_n,0)=xs(xn,0)=x. Then the tangent to the curve MMM is given by the directional derivative τ=∂x∂ϵ\tau = \frac{\partial x}{\partial \epsilon}τ=ϵx,and the tangent vector at the point xnx_nxn is given by:

τn=∂s(xn,ϵ)∂ϵ∣ϵ=0\tau_n=\frac{\partial s(x_n,\epsilon)}{\partial\epsilon}|_{\epsilon =0}τn=ϵs(xn,ϵ)ϵ=0

The derivative of output k with respect to ϵ\epsilonϵ is given by:

∂yk∂ϵ∣ϵ=0=∑i=1D∂yk∂xixi∂ϵ∣ϵ=0=∑i=1DJkiτi\frac{\partial y_{k}}{\partial \epsilon} |_{\epsilon=0} =\sum_{i=1}^D\frac{\partial y_k}{\partial x_i}\frac{x_i}{\partial\epsilon}|_{\epsilon=0} =\sum_{i=1}^{D}J_{ki}\tau_{i}ϵykϵ=0=i=1Dxiykϵxiϵ=0=i=1DJkiτi

The result can be used to modify the standard error funciton:

E~=E+λΩ\tilde{E}=E+\lambda\OmegaE~=E+λΩ

where λ\lambdaλ is a regularization coefficient and:

Ω=12∑n∑k(∂ynk∂ϵ∣ϵ=0)2=12∑n∑k(∑i=1DJnkiτni)2\Omega=\frac{1}{2}\sum_n\sum_k(\frac{\partial y_{nk}}{\partial \epsilon}|_{\epsilon=0})^2=\frac{1}{2}\sum_n\sum_k(\sum_{i=1}^D J_{nki}\tau_{ni})^2Ω=21nk(ϵynkϵ=0)2=21nk(i=1DJnkiτni)2

5.5.5 Training with transformed data

Consider a transformation governed by a single parameter ϵ\epsilonϵ and describe by the function s(x,ϵ)s(x,\epsilon)s(x,ϵ). Consider a sum-of-squares error function, for untransformed inputs can be written in the form:

E=12∫∫{y(x)−t}2p(t∣x)p(x)dxdtE = \frac{1}{2}\int\int \{ y(x) - t\}^{2}p(t|x)p(x) dx dtE=21{y(x)t}2p(tx)p(x)dxdt

if the parameter ϵ\epsilonϵ is drawn from a distribution p(ϵ)p(\epsilon)p(ϵ), then:

E~=12∫∫{y(s(x,ϵ))−t}2p(t∣x)p(x)p(ϵ)dxdtdϵ\tilde{E} = \frac{1}{2}\int\int \{ y(s(x, \epsilon)) - t\}^{2}p(t|x)p(x)p(\epsilon) dx dt d\epsilonE~=21{y(s(x,ϵ))t}2p(tx)p(x)p(ϵ)dxdtdϵ

Further assume that p(ϵ)p(\epsilon)p(ϵ) has zero mean with small variance, after the Taylor expansion and substituting into the mean error function, the average error
E~=E+λΩ\tilde{E} = E + \lambda\OmegaE~=E+λΩ

where E is the original sum-of-squares error, and the regularization term OmegaOmegaOmega takes the form:

Ω=12∫[{y(x)−E[t∣x]}{(τ′)T∇y(x)+τT∇∇y(x)τ}+(τT∇y(x))2]p(x)dx\Omega = \frac{1}{2}\int [ \{ y(x) - E[t|x] \} \{ (\tau')^T\nabla y(x) + \tau^{T}\nabla\nabla y(x)\tau \} + (\tau^{T}\nabla y(x))^{2} ]p(x) dxΩ=21[{y(x)E[tx]}{(τ)Ty(x)+τTy(x)τ}+(τTy(x))2]p(x)dx

5.5.6 Convolutional networks

5.5.7 Soft weight sharing

In this part, the hard constraint of equal weights is replaced by a form of regularization in which groups of weights are encouraged to have similar values. Furthermore, the division of weights into groups, the mean weight value for each group, and the spread of values within the groups are all determined as part of the learning process.

5.6 Mixture Density Networks

Develop the model explicitly for Gaussian components, so that:

p(t∣x)=∑k=1Kπk(x)N(t∣μk(x),σk2(x)I)p(t|x) = \sum_{k=1}^{K}\pi_{k}(x)N(t | \mu_{k}(x), \sigma_{k}^{2}(x)I)p(tx)=k=1Kπk(x)N(tμk(x),σk2(x)I)

For indenpendent data, the error function takes the form:

E(w)=−∑n=1Nln⁡{∑n=1Kπk(xn,w)N(tn∣μk(xn,w),σk2(xn,w)I)}E(w) = -\sum_{n=1}^{N}\ln \left\{ \sum_{n=1}^{K}\pi_{k}(x_{n}, w)N(t_{n} | \mu_{k}(x_{n},w), \sigma_{k}^{2}(x_{n}, w)\mathbf{I}) \right\}E(w)=n=1Nln{n=1Kπk(xn,w)N(tnμk(xn,w),σk2(xn,w)I)}

5.7 Bayesian Neural Networks

In this part, we will approximate the posterior distribution by a Guassian, centred at a mode of the true posterior. We will also assume that the covariance of this Gaussian is small so that the network function is approximately linear.

5.7.1 Posterior parameter distribution

We suppose that the conditional distribution p(t∥x)p(t\|x)p(tx) is Gaussian.

p(t∣x,w,β)=N(t∣y(x,w),β−1)p(t|x,w,\beta) = N(t | y(x, w), \beta^{-1})p(tx,w,β)=N(ty(x,w),β1)

Also, we choose a prior distribution over the weights www that is Guassian of the form.

p(w∣α)=N(w∣0,α−1I)p(w | \alpha) = N(w | 0, \alpha^{-1}\mathbf{I})p(wα)=N(w0,α1I)

For an i.i.d. data set of NNN observations x1,...,xNx_1,...,x_Nx1,...,xN, with a corresponding set of target values D={t1,...,tN}D=\{t_1,...,t_N\}D={t1,...,tN}, the likelihood function is given by:

p(D∣w,β)=∏n=1NN(tn∣y(x,w),β−1)p(D | w, \beta) = \prod_{n=1}^{N}N(t_{n} | y(x, w), \beta^{-1})p(Dw,β)=n=1NN(tny(x,w),β1)

so we can get the posterior distribution:

p(w∣D,α,β)∝p(w∣α)p(D∣w,β)p(w | D, \alpha, \beta) \propto p(w | \alpha)p(D | w, \beta)p(wD,α,β)p(wα)p(Dw,β)

The Gaussian approximation to the posterior is given by:

q(w∣D)=N(w∣wMAP,A−1)q(w | D) = N(w | w_{MAP}, \mathbf{A}^{-1})q(wD)=N(wwMAP,A1)

Similarly, the predictive distribution is obtained by marginalizing with respect to this posterior distribution:

p(t∣x,D)=∫p(t∣x,w)q(w∣D)dwp(t | x, D) = \int p(t | x, w)q(w | D) dwp(tx,D)=p(tx,w)q(wD)dw

Make a Taylor series expansion of the network function around wMAPw_{MAP}wMAP and retain the linear terms, we will get a linear-Gaussian model:

p(t∣x,w,β)≃N(t∣y(x,wMAP)+gT(w−wMAP),β−1)p(t| x, w, \beta) \simeq N(t | y(x, w_{MAP}) + g^{T}(w - w_{MAP}), \beta^{-1})p(tx,w,β)N(ty(x,wMAP)+gT(wwMAP),β1)

we can therefore make use of the general result for the marginal p(x)p(x)p(x) to give:

p(t∣x,D,α,β)=N(t∣y(x,wMAP),σ2(x))p(t | x, D, \alpha, \beta) = N(t | y(x, w_{MAP}), \sigma^{2}(x))p(tx,D,α,β)=N(ty(x,wMAP),σ2(x))

where

σ2(x)=β−1+gTA−1g\sigma^{2}(x) = \beta^{-1} + g^{T}\mathbf{A}^{-1}gσ2(x)=β1+gTA1g

g=∇wy(x,w)∣w=wMAPg = \nabla_{w}y(x,w)|_{w = w_{MAP}}g=wy(x,w)w=wMAP

5.7.2 Hyperparameter optimization

5.7.3 Bayesian neural networks for classification

The logistic sigmoid output corresponding to a two-class classification problem. The log likelihood function for this model is given by:

ln⁡p(D∣w)=∑n=1N{tnln⁡yn+(1−tn)ln⁡(1−yn)}\ln p(D|w) = \sum_{n=1}^{N}\{t_{n}\ln y_{n} + (1-t_{n})\ln(1-y_{n}) \}lnp(Dw)=n=1N{tnlnyn+(1tn)ln(1yn)}

Minimizing the regularized error function:

E(w)=ln⁡p(D∣w)+α2wTwE(w) = \ln p(D|w) + \frac{\alpha}{2}w^{T}wE(w)=lnp(Dw)+2αwTw

The result of the approximate distribution will be

p(t=1∣x,D)=σ(k(σa2)bTwMAP)p(t=1 | x, D) = \sigma(k(\sigma_{a}^{2})b^T w_{MAP})p(t=1x,D)=σ(k(σa2)bTwMAP)

Logo

CSDN联合极客时间,共同打造面向开发者的精品内容学习社区,助力成长!

更多推荐