模式识别 | PRML Chapter 5 Neural Networks
PRML Chapter 5 Neural Networks
5.1 Feed-forward Network Functions
A network with one hidden layer may be the form like this:
yk(x,w)=σ(∑j=1Mwkj2h(∑i=1Dwji1xi+wj01)+wk02)y_{k}(x, w) = \sigma\left( \sum_{j=1}^{M}w_{kj}^{2}h(\sum_{i=1}^{D}w_{ji}^{1}x_{i} + w_{j0}^{1}) + w_{k0}^{2} \right)yk(x,w)=σ(j=1∑Mwkj2h(i=1∑Dwji1xi+wj01)+wk02)
5.1.1 Weight-space symmetries
5.2 Network Training
The error function that we need to minimize:
E(w)=12∑n=1N∣∣y(xn,w)−tn∣∣2E(w) = \frac{1}{2}\sum_{n=1}^{N} || y(x_{n}, w) - t_{n} ||^{2}E(w)=21n=1∑N∣∣y(xn,w)−tn∣∣2
For the regression problem, assume that the output of the network is
p(t∣x,w)=N(t∣y(x,w),β−1)p(t | x, w) = N(t | y(x,w), \beta^{-1})p(t∣x,w)=N(t∣y(x,w),β−1)
The corresponding likelihood function will be
p(t∣X,w,β)=∏n=1Np(tn∣xn,w,β)p(\mathbf{t} | \mathbf{X}, w, \beta) = \prod_{n=1}^{N}p(t_{n} | x_{n}, w, \beta)p(t∣X,w,β)=n=1∏Np(tn∣xn,w,β)
Taking the negative logarithm, we obtain the error function:
β2∑n=1N{y(x,w)−tn}2−N2lnβ+N2ln(2π)\frac{\beta}{2}\sum_{n=1}^{N}\{ y(x, w) - t_{n} \}^{2} - \frac{N}{2}\ln\beta + \frac{N}{2} \ln(2\pi)2βn=1∑N{y(x,w)−tn}2−2Nlnβ+2Nln(2π)
from which we can learn www and β\betaβ. Maximizing the likelihood funciton is equivalent to minimizing the sum-of-squares error function given by:
E(w)=12∑n=1N{y(xn,w)−tn}2E(w) = \frac{1}{2}\sum_{n=1}^{N}\{y(x_{n}, w) - t_{n} \}^{2}E(w)=21n=1∑N{y(xn,w)−tn}2
Having found wMLw_{ML}wML, the value of β\betaβ can be found by minimizing the negative log likelihood to give
1βML=1N∑n=1N{y(xn,wML)−tn}2\frac{1}{\beta_{ML}} = \frac{1}{N}\sum_{n=1}^{N}\{ y(x_{n}, w_{ML}) - t_{n} \}^{2}βML1=N1n=1∑N{y(xn,wML)−tn}2
In the regression case, we can view the network as having an output activation function that is the identity, so that yk=aky_k=a_kyk=ak. The corresponding sum-of-squares error function has the property
∂E∂ak=yk−tk\frac{\partial E}{\partial a_{k}} = y_{k} - t_{k}∂ak∂E=yk−tk
In the case of binary classification, the conditional distribution of targets is a Bernoulli distribution:
p(t∣x,w)=y(x,w)t{1−y(x,w)}1−tp(t|x,w)=y(x,w)^t\{1-y(x,w)\}^{1-t}p(t∣x,w)=y(x,w)t{1−y(x,w)}1−t
Given by negative log likelihood, we have a cross-entropy error function:
E(w)=−∑n=1N{tnlnyn+(1−tn)ln(1−yn)}E(w) = -\sum_{n=1}^{N}\{ t_{n}\ln y_{n} + (1-t_{n})\ln(1-y_{n}) \}E(w)=−n=1∑N{tnlnyn+(1−tn)ln(1−yn)}
And for multiclass classification problem:
E(w)=−∑n=1N∑k=1Ktnklnyk(xn,w)E(w) = -\sum_{n=1}^{N}\sum_{k=1}^{K}t_{nk}\ln y_{k}(x_{n}, w)E(w)=−n=1∑Nk=1∑Ktnklnyk(xn,w)
5.2.1 Parameter optimization
We choose some initial value and moving through weight space in a succession of steps:
wτ+1=wτ+Δwτw^{\tau + 1} = w^{\tau} + \Delta w^{\tau}wτ+1=wτ+Δwτ
5.2.2 Local quadratic approximation
Consider the Taylor expansion of E(w)E(w)E(w) around some point w^\hat{w}w^ in weight space:
E(w)≃E(w^)+(w−w^)Tb+12(w−w^)TH(w−w^)E(w) \simeq E(\hat{w}) + (w - \hat{w})^{T}b + \frac{1}{2}(w - \hat{w})^{T}H(w - \hat{w})E(w)≃E(w^)+(w−w^)Tb+21(w−w^)TH(w−w^)x
where bbb is defined to be the gradient of EEE evaluated at w^\hat{w}w^: b≡∇E∣w=w^b \equiv \nabla E|_{w=\hat{w}}b≡∇E∣w=w^
When w∗w^*w∗ is a minimum of the error function, there is no linear term because ∇E=0\nabla E=0∇E=0, we have:
E(w)=E(w∗)+12(w−w∗)TH(w−w∗)E(w) = E(w^{*}) + \frac{1}{2}(w-w^{*})^{T}H(w-w^{*})E(w)=E(w∗)+21(w−w∗)TH(w−w∗)
5.2.3 Use of gradient information
5.2.4 Gradient descent optimization
Using gradient information:
w(τ+1)=w(τ)−η∇E(w(τ))w^{(\tau + 1)} = w^{(\tau)} - \eta\nabla E(w^{(\tau)})w(τ+1)=w(τ)−η∇E(w(τ))
On-line gradient descent(sequential/stochastic gradient descent)
w(τ+1)=w(τ)−η∇En(w(τ))w^{(\tau+1)} = w^{(\tau)}- \eta\nabla E_{n}(w^{(\tau)})w(τ+1)=w(τ)−η∇En(w(τ))
5.3 Error Backpropagation
5.3.1 Evaluation of error-function derivatives
We now derive the backpropagation algorithm for a general network:
En=12∑k(yk(xn,w)−tnk)2 → ∂En∂wji=(ynj−tnj)xniE_n=\frac{1}{2}\sum_k(y_k(x_n,w)-t_{nk})^2\ \rightarrow\ \frac{\partial E_n}{\partial w_{ji}}=(y_{nj}-t_{nj})x_{ni}En=21k∑(yk(xn,w)−tnk)2 → ∂wji∂En=(ynj−tnj)xni
In a general feed-forward network, each unit computes a weighted sum of its inputs, and the sum is transformed by a nonlinear activation function:
aj=∑iwjizi, zj=h(aj)a_j=\sum_i w_{ji}z_i,\ \ z_j=h(a_j)aj=i∑wjizi, zj=h(aj)
Apply the chain rule
∂En∂wji=∂En∂aj∂aj∂wji=δjzi\frac{\partial E_n}{\partial w_{ji}}=\frac{\partial E_n}{\partial a_j}\frac{\partial a_j}{\partial w_{ji}}=\delta_j z_i∂wji∂En=∂aj∂En∂wji∂aj=δjzi
We can obtain the backpropagation formula:
δj=∂En∂aj=∑k∂En∂ak∂ak∂aj=h′(aj)∑kwkjδk\delta_j=\frac{\partial E_n}{\partial a_j}=\sum_k\frac{\partial E_n}{\partial a_k}\frac{\partial a_k}{\partial a_j}=h'(a_j)\sum_k w_{kj}\delta_kδj=∂aj∂En=k∑∂ak∂En∂aj∂ak=h′(aj)k∑wkjδk
For all data points, we have:
E(w)=∑n=1NEn(w), ∂E∂wji=∑n∂En∂wjiE(w)=\sum_{n=1}^N E_n(w),\ \ \frac{\partial E}{\partial w_{ji}}=\sum_n\frac{\partial E_n}{\partial w_{ji}}E(w)=n=1∑NEn(w), ∂wji∂E=n∑∂wji∂En
5.3.2 A simple example
5.3.3 Efficiency of backpropagation
5.3.4 The Jacobian matrix
Consider the evaluation of the Jacobian matrix, first we write down the backpropagation formula to determine the derivatives ∂yk∂aj\frac{\partial y_{k}}{\partial a_{j}}∂aj∂yk.
∂yk∂aj=∑l∂yk∂al∂al∂aj=h′(aj)∑lwlj∂yk∂al\frac{\partial y_{k}}{\partial a_{j}} = \sum_{l}\frac{\partial y_{k}}{\partial a_{l}}\frac{\partial a_{l}}{\partial a_{j}} = h^{'}(a_{j})\sum_{l}w_{lj}\frac{\partial y_{k}}{\partial a_{l}}∂aj∂yk=l∑∂al∂yk∂aj∂al=h′(aj)l∑wlj∂al∂yk
If we have individual sigmoidal activation functions at each output unit, then
∂yk∂al=δkl′(al)\frac{\partial y_{k}}{\partial a_{l}} = \delta_{kl}^{'}(a_{l})∂al∂yk=δkl′(al)
whereas for softmax outputs we have:
∂yk∂al=δklyk−ykyl\frac{\partial y_{k}}{\partial a_{l}} = \delta_{kl}y_{k} - y_{k}y_{l}∂al∂yk=δklyk−ykyl
Finally we can calculate the element in the Jacobi matrix:
Jji=∑jwji∂yk∂ajJ_{ji} = \sum_{j}w_{ji}\frac {\partial y_{k}}{\partial a_{j}}Jji=j∑wji∂aj∂yk
5.4 The Hessian Matrix
5.4.1 Diagonal approximation
The diagonal elements of the Hessian can be written:
∂2En∂wji2=∂2En∂aj2zi2\frac{\partial^2 E_n}{\partial w^2_{ji}}=\frac{\partial^2 E_n}{\partial a_j^2} z^2_i∂wji2∂2En=∂aj2∂2Enzi2
Recursively using the chain rule of differential calculus to give a backpropagation equation of the form:
∂2En∂aj2=h′(aj)2∑k∑k′wkjwk′j∂2En∂ak∂ak′+h′′(aj)∑kwkj∂En∂ak\frac{\partial^2 E_n}{\partial a_j^2}=h'(a_j)^2\sum_k\sum_{k'}w_{kj}w_{k'j} \frac{\partial^2 E_n}{\partial a_k \partial a_{k'}}+h''(a_j)\sum_k w_{kj}\frac{\partial E^n}{\partial a_k}∂aj2∂2En=h′(aj)2k∑k′∑wkjwk′j∂ak∂ak′∂2En+h′′(aj)k∑wkj∂ak∂En
5.4.2 Outer product approximation
Write teh Hessian matrix in the form:
H=∇∇E=∑n=1N∇yn∇yn+∑n=1N(yn−tn)∇∇ynH=\nabla\nabla E=\sum_{n=1}^N\nabla y_n\nabla y_n+\sum_{n=1}^N(y_n-t_n)\nabla\nabla y_nH=∇∇E=n=1∑N∇yn∇yn+n=1∑N(yn−tn)∇∇yn
By neglecting the second term, we get the Levenberg-Marquardt approximation (outer product approximation)
H≃∑n=1NbnbnTH\simeq\sum_{n=1}^N b_n b_n^TH≃n=1∑NbnbnT
where bn=∇yn=∇anb_n=\nabla y_n=\nabla a_nbn=∇yn=∇an.
In the case of the cross-entropy error function for a network with logistic sigmoid output-unit activation functions, the corresponding approximation is given by:
H≃∑n=1Nyn(1−yn)bnbnTH\simeq\sum_{n=1}^N y_n(1-y_n)b_nb_n^TH≃n=1∑Nyn(1−yn)bnbnT
5.5 Regularization in Neural Networks
To control the complexity of a neural network, the simplest regularizer is the quadratic, giving a regularized error:
E~(w)=E(w)+λ2wTw\tilde{E}(w)=E(w)+\frac{\lambda}{2} w^TwE~(w)=E(w)+2λwTw.
5.5.1 Consistent Gaussian priors
A regularizer which is invariant under the linear transformations is given by:
λ12∑w∈W1w2+λ22∑w∈W2w2\frac{\lambda_{1}}{2}\sum_{w\in W_{1}}w^{2} + \frac{\lambda_{2}}{2}\sum_{w\in W_{2}}w^{2}2λ1w∈W1∑w2+2λ2w∈W2∑w2
5.5.2 Early stopping
Training can be stopped at the point of smallest error with respect to the validation set in order to obtain a network having good generalization performance.
5.5.3 Invariances
5.5.4 Tangent propagation
We can use regularization to encourage models to be invariant to transformations of the input through the technique of tangent propagation. Let the vector that results from acting on xnx_nxn bu this transformation be denoted by s(xn,ϵ)s(x_{n}, \epsilon)s(xn,ϵ) and s(xn,0)=xs(x_n,0)=xs(xn,0)=x. Then the tangent to the curve MMM is given by the directional derivative τ=∂x∂ϵ\tau = \frac{\partial x}{\partial \epsilon}τ=∂ϵ∂x,and the tangent vector at the point xnx_nxn is given by:
τn=∂s(xn,ϵ)∂ϵ∣ϵ=0\tau_n=\frac{\partial s(x_n,\epsilon)}{\partial\epsilon}|_{\epsilon =0}τn=∂ϵ∂s(xn,ϵ)∣ϵ=0
The derivative of output k with respect to ϵ\epsilonϵ is given by:
∂yk∂ϵ∣ϵ=0=∑i=1D∂yk∂xixi∂ϵ∣ϵ=0=∑i=1DJkiτi\frac{\partial y_{k}}{\partial \epsilon} |_{\epsilon=0} =\sum_{i=1}^D\frac{\partial y_k}{\partial x_i}\frac{x_i}{\partial\epsilon}|_{\epsilon=0} =\sum_{i=1}^{D}J_{ki}\tau_{i}∂ϵ∂yk∣ϵ=0=i=1∑D∂xi∂yk∂ϵxi∣ϵ=0=i=1∑DJkiτi
The result can be used to modify the standard error funciton:
E~=E+λΩ\tilde{E}=E+\lambda\OmegaE~=E+λΩ
where λ\lambdaλ is a regularization coefficient and:
Ω=12∑n∑k(∂ynk∂ϵ∣ϵ=0)2=12∑n∑k(∑i=1DJnkiτni)2\Omega=\frac{1}{2}\sum_n\sum_k(\frac{\partial y_{nk}}{\partial \epsilon}|_{\epsilon=0})^2=\frac{1}{2}\sum_n\sum_k(\sum_{i=1}^D J_{nki}\tau_{ni})^2Ω=21n∑k∑(∂ϵ∂ynk∣ϵ=0)2=21n∑k∑(i=1∑DJnkiτni)2
5.5.5 Training with transformed data
Consider a transformation governed by a single parameter ϵ\epsilonϵ and describe by the function s(x,ϵ)s(x,\epsilon)s(x,ϵ). Consider a sum-of-squares error function, for untransformed inputs can be written in the form:
E=12∫∫{y(x)−t}2p(t∣x)p(x)dxdtE = \frac{1}{2}\int\int \{ y(x) - t\}^{2}p(t|x)p(x) dx dtE=21∫∫{y(x)−t}2p(t∣x)p(x)dxdt
if the parameter ϵ\epsilonϵ is drawn from a distribution p(ϵ)p(\epsilon)p(ϵ), then:
E~=12∫∫{y(s(x,ϵ))−t}2p(t∣x)p(x)p(ϵ)dxdtdϵ\tilde{E} = \frac{1}{2}\int\int \{ y(s(x, \epsilon)) - t\}^{2}p(t|x)p(x)p(\epsilon) dx dt d\epsilonE~=21∫∫{y(s(x,ϵ))−t}2p(t∣x)p(x)p(ϵ)dxdtdϵ
Further assume that p(ϵ)p(\epsilon)p(ϵ) has zero mean with small variance, after the Taylor expansion and substituting into the mean error function, the average error
E~=E+λΩ\tilde{E} = E + \lambda\OmegaE~=E+λΩ
where E is the original sum-of-squares error, and the regularization term OmegaOmegaOmega takes the form:
Ω=12∫[{y(x)−E[t∣x]}{(τ′)T∇y(x)+τT∇∇y(x)τ}+(τT∇y(x))2]p(x)dx\Omega = \frac{1}{2}\int [ \{ y(x) - E[t|x] \} \{ (\tau')^T\nabla y(x) + \tau^{T}\nabla\nabla y(x)\tau \} + (\tau^{T}\nabla y(x))^{2} ]p(x) dxΩ=21∫[{y(x)−E[t∣x]}{(τ′)T∇y(x)+τT∇∇y(x)τ}+(τT∇y(x))2]p(x)dx
5.5.6 Convolutional networks
5.5.7 Soft weight sharing
In this part, the hard constraint of equal weights is replaced by a form of regularization in which groups of weights are encouraged to have similar values. Furthermore, the division of weights into groups, the mean weight value for each group, and the spread of values within the groups are all determined as part of the learning process.
5.6 Mixture Density Networks
Develop the model explicitly for Gaussian components, so that:
p(t∣x)=∑k=1Kπk(x)N(t∣μk(x),σk2(x)I)p(t|x) = \sum_{k=1}^{K}\pi_{k}(x)N(t | \mu_{k}(x), \sigma_{k}^{2}(x)I)p(t∣x)=k=1∑Kπk(x)N(t∣μk(x),σk2(x)I)
For indenpendent data, the error function takes the form:
E(w)=−∑n=1Nln{∑n=1Kπk(xn,w)N(tn∣μk(xn,w),σk2(xn,w)I)}E(w) = -\sum_{n=1}^{N}\ln \left\{ \sum_{n=1}^{K}\pi_{k}(x_{n}, w)N(t_{n} | \mu_{k}(x_{n},w), \sigma_{k}^{2}(x_{n}, w)\mathbf{I}) \right\}E(w)=−n=1∑Nln{n=1∑Kπk(xn,w)N(tn∣μk(xn,w),σk2(xn,w)I)}
5.7 Bayesian Neural Networks
In this part, we will approximate the posterior distribution by a Guassian, centred at a mode of the true posterior. We will also assume that the covariance of this Gaussian is small so that the network function is approximately linear.
5.7.1 Posterior parameter distribution
We suppose that the conditional distribution p(t∥x)p(t\|x)p(t∥x) is Gaussian.
p(t∣x,w,β)=N(t∣y(x,w),β−1)p(t|x,w,\beta) = N(t | y(x, w), \beta^{-1})p(t∣x,w,β)=N(t∣y(x,w),β−1)
Also, we choose a prior distribution over the weights www that is Guassian of the form.
p(w∣α)=N(w∣0,α−1I)p(w | \alpha) = N(w | 0, \alpha^{-1}\mathbf{I})p(w∣α)=N(w∣0,α−1I)
For an i.i.d. data set of NNN observations x1,...,xNx_1,...,x_Nx1,...,xN, with a corresponding set of target values D={t1,...,tN}D=\{t_1,...,t_N\}D={t1,...,tN}, the likelihood function is given by:
p(D∣w,β)=∏n=1NN(tn∣y(x,w),β−1)p(D | w, \beta) = \prod_{n=1}^{N}N(t_{n} | y(x, w), \beta^{-1})p(D∣w,β)=n=1∏NN(tn∣y(x,w),β−1)
so we can get the posterior distribution:
p(w∣D,α,β)∝p(w∣α)p(D∣w,β)p(w | D, \alpha, \beta) \propto p(w | \alpha)p(D | w, \beta)p(w∣D,α,β)∝p(w∣α)p(D∣w,β)
The Gaussian approximation to the posterior is given by:
q(w∣D)=N(w∣wMAP,A−1)q(w | D) = N(w | w_{MAP}, \mathbf{A}^{-1})q(w∣D)=N(w∣wMAP,A−1)
Similarly, the predictive distribution is obtained by marginalizing with respect to this posterior distribution:
p(t∣x,D)=∫p(t∣x,w)q(w∣D)dwp(t | x, D) = \int p(t | x, w)q(w | D) dwp(t∣x,D)=∫p(t∣x,w)q(w∣D)dw
Make a Taylor series expansion of the network function around wMAPw_{MAP}wMAP and retain the linear terms, we will get a linear-Gaussian model:
p(t∣x,w,β)≃N(t∣y(x,wMAP)+gT(w−wMAP),β−1)p(t| x, w, \beta) \simeq N(t | y(x, w_{MAP}) + g^{T}(w - w_{MAP}), \beta^{-1})p(t∣x,w,β)≃N(t∣y(x,wMAP)+gT(w−wMAP),β−1)
we can therefore make use of the general result for the marginal p(x)p(x)p(x) to give:
p(t∣x,D,α,β)=N(t∣y(x,wMAP),σ2(x))p(t | x, D, \alpha, \beta) = N(t | y(x, w_{MAP}), \sigma^{2}(x))p(t∣x,D,α,β)=N(t∣y(x,wMAP),σ2(x))
where
σ2(x)=β−1+gTA−1g\sigma^{2}(x) = \beta^{-1} + g^{T}\mathbf{A}^{-1}gσ2(x)=β−1+gTA−1g
g=∇wy(x,w)∣w=wMAPg = \nabla_{w}y(x,w)|_{w = w_{MAP}}g=∇wy(x,w)∣w=wMAP
5.7.2 Hyperparameter optimization
5.7.3 Bayesian neural networks for classification
The logistic sigmoid output corresponding to a two-class classification problem. The log likelihood function for this model is given by:
lnp(D∣w)=∑n=1N{tnlnyn+(1−tn)ln(1−yn)}\ln p(D|w) = \sum_{n=1}^{N}\{t_{n}\ln y_{n} + (1-t_{n})\ln(1-y_{n}) \}lnp(D∣w)=n=1∑N{tnlnyn+(1−tn)ln(1−yn)}
Minimizing the regularized error function:
E(w)=lnp(D∣w)+α2wTwE(w) = \ln p(D|w) + \frac{\alpha}{2}w^{T}wE(w)=lnp(D∣w)+2αwTw
The result of the approximate distribution will be
p(t=1∣x,D)=σ(k(σa2)bTwMAP)p(t=1 | x, D) = \sigma(k(\sigma_{a}^{2})b^T w_{MAP})p(t=1∣x,D)=σ(k(σa2)bTwMAP)
更多推荐



所有评论(0)