Training hopfield nets
Geometric approach
-
W=YYT−NpI\mathbf{W}=\mathbf{Y} \mathbf{Y}^{T}-N_{p} \mathbf{I}W=YYT−NpI
-
E(y)=yTWy\mathbf{E}(\mathbf{y})=\mathbf{y}^{T} \mathbf{W y}E(y)=yTWy
-
Sine : yT(YYT−NpI)y=yTYYTy−NNp\mathbf{y}^{T}\left(\mathbf{Y} \mathbf{Y}^{T}-N_{p} \mathbf{I}\right) \mathbf{y}=\mathbf{y}^{T} \mathbf{Y} \mathbf{Y}^{T} \mathbf{y}-N N_{p}yT(YYT−NpI)y=yTYYTy−NNp
-
So W is identical to behavior with W=YYT\mathbf{W}=\mathbf{Y} \mathbf{Y}^{T}W=YYT
- Energy landscape only differs by an additive constant
- Have the same eigen vectors

-
A pattern ypy_pyp is stored if:
- sign(Wyp)=y_p\operatorname{sign}\left(\mathbf{W} \mathbf{y}_{p}\right)=\mathbf{y}\_{p}sign(Wyp)=y_p for all target patterns
-
Training: Design WWW such that this holds
-
Simple solution: ypy_pyp is an Eigenvector of WWW
Storing k orthogonal patterns
- Let Y=[y_1y_2…y_K]\mathbf{Y}=\left[\mathbf{y}\_{1} \mathbf{y}\_{2} \ldots \mathbf{y}\_{K}\right]Y=[y_1y_2…y_K]
- W=YΛYT\mathbf{W}=\mathbf{Y} \Lambda \mathbf{Y}^{T}W=YΛYT
- λ1,...,λk\lambda_1,...,\lambda_kλ1,...,λk are positive
- for λ1=λ2=λk=1\lambda_1= \lambda_2=\lambda_k= 1λ1=λ2=λk=1 this is exactly the Hebbian rule
- Any pattern yyy can be written as
- y=a1y1+a2y2+⋯+aNyN\mathbf{y}=a_{1} \mathbf{y}_{1}+a_{2} \mathbf{y}_{2}+\cdots+a_{N} \mathbf{y}_{N}y=a1y1+a2y2+⋯+aNyN
- Wy=a1Wy1+a2Wy2+⋯+aNWyN=y\mathbf{W y}=a_{1} \mathbf{W y}_{1}+a_{2} \mathbf{W y}_{2}+\cdots+a_{N} \mathbf{W y}_{N} = yWy=a1Wy1+a2Wy2+⋯+aNWyN=y
- All patterns are stable
- Remembers everything
- Completely useless network
- Even if we store fewer than NNN patterns
- Let Y=[y_1y_2…y_Kr_K+1r_K+2…r_N]Y=\left[\mathbf{y}\_{1} \mathbf{y}\_{2} \ldots \mathbf{y}\_{K} \mathbf{r}\_{K+1} \mathbf{r}\_{K+2} \ldots \mathbf{r}\_{N}\right]Y=[y_1y_2…y_Kr_K+1r_K+2…r_N]
- W=YΛYTW=Y \Lambda Y^{T}W=YΛYT
- r_K+1r_K+2…r_N\mathbf{r}\_{K+1} \mathbf{r}\_{K+2} \ldots \mathbf{r}\_{N}r_K+1r_K+2…r_N are orthogonal to y1y2…yK\mathbf{y}_1 \mathbf{y}_2 \ldots \mathbf{y}_Ky1y2…yK
- λ1=λ2=λk=1\lambda_1= \lambda_2=\lambda_k= 1λ1=λ2=λk=1
- Problem arise because eigen values are all 1.0
- Ensures stationarity of vectors in the subspace
- All stored patterns are equally important
General (nonorthogonal) vectors
- wji=∑p∈{p}yipyjpw_{j i}=\sum_{p \in\{p\}} y_{i}^{p} y_{j}^{p}wji=∑p∈{p}yipyjp
- The maximum number of stationary patterns is actually exponential in NNN (McElice and Posner, 84’)
- For a specific set of KKK patterns, we can always build a network for which all KKK patterns are stable provided k≤Nk \le Nk≤N
- But this may come with many “parasitic” memories
Optimization
- Energy function
- E=−12yTWy−bTyE=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W} \mathbf{y}-\mathbf{b}^{T} \mathbf{y}E=−21yTWy−bTy
- This must be maximally low for target patterns
- Must be maximally high for all other patterns
- So that they are unstable and evolve into one of the target patterns
- Estimate WWW such that
- EEE is minimized for y1,...,yPy_1,...,y_Py1,...,yP
- EEE is maximized for all other yyy
- Minimize total energy of target patterns
- E(y)=−12yTWyW^=argminW∑y∈YPE(y)E(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W y} \quad \widehat{\mathbf{W}}=\underset{\mathbf{W}}{\operatorname{argmin}} \sum_{\mathbf{y} \in \mathbf{Y}_{P}} E(\mathbf{y})E(y)=−21yTWyW=Wargmin∑y∈YPE(y)
- However, might also pull all the neighborhood states down
- Maximize the total energy of all non-target patterns
- E(y)=−12yTWyE(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W} \mathbf{y}E(y)=−21yTWy
- W^=argminW∑y∈YPE(y)−∑y∉YPE(y)\widehat{\mathbf{W}}=\underset{\mathbf{W}}{\operatorname{argmin}} \sum_{\mathbf{y} \in \mathbf{Y}_{P}} E(\mathbf{y})-\sum_{\mathbf{y} \notin \mathbf{Y}_{P}} E(\mathbf{y})W=Wargmin∑y∈YPE(y)−∑y∈/YPE(y)
- Simple gradient descent
-
W=w+η(∑y∈YPyyT−∑y∉YPyyT)\mathbf{W}=\mathbf{w}+\eta\left(\sum_{\mathbf{y} \in \mathbf{Y}_{P}} \mathbf{y} \mathbf{y}^{T}-\sum_{\mathbf{y} \notin \mathbf{Y}_{P}} \mathbf{y} \mathbf{y}^{T}\right)W=w+η(∑y∈YPyyT−∑y∈/YPyyT)
-
minimize the energy at target patterns
-
raise all non-target patterns
- Do we need to raise everything?
-
Raise negative class
-
Focus on raising the valleys
- If you raise every valley, eventually they’ll all move up above the target patterns, and many will even vanish

-
How do you identify the valleys for the current WWW?
-

-
Initialize the network randomly and let it evolve
-
It will settle in a valley
-

-
Should we randomly sample valleys?
- Are all valleys equally important?
- Major requirement: memories must be stable
- They must be broad valleys

-
Solution: initialize the network at valid memories and let it evolve
- It will settle in a valley
- If this is not the target pattern, raise it
-
What if there’s another target pattern downvalley
-

-
no need to raise the entire surface, or even every valley
- Raise the neighborhood of each target memory
-
-

Storing more than N patterns

- Visible neurons
- The neurons that store the actual patterns of interest
- Hidden neurons
- The neurons that only serve to increase the capacity but whose actual values are not important

- The maximum number of patterns the net can store is bounded by the width NNN of the patterns…
- So lets pad the patterns with KKK “don’t care” bits
- The new width of the patterns is N+KN+KN+K
- Now we can store N+KN+KN+K patterns!
- Taking advantage of don’t care bits
- Simple random setting of don’t care bits, and using the usual training and recall strategies for Hopfield nets should work
- However, to exploit it properly, it helps to view the Hopfield net differently: as a probabilistic machine
A probabilistic interpretation
- For binary y the energy of a pattern is the analog of the negative log likelihood of a Boltzmann distribution
- Minimizing energy maximizes log likelihood
- E(y)=−12yTWyP(y)=Cexp(−E(y))E(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W y} \quad P(\mathbf{y})=\operatorname{Cexp}(-E(\mathbf{y}))E(y)=−21yTWyP(y)=Cexp(−E(y))
Boltzmann Distribution
- E(y)=−12yTWy−bTyE(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W} \mathbf{y}-\mathbf{b}^{T} \mathbf{y}E(y)=−21yTWy−bTy
- P(y)=Cexp(−E(y)kT)P(\mathbf{y})=\operatorname{Cexp}\left(\frac{-E(\mathbf{y})}{k T}\right)P(y)=Cexp(kT−E(y))
- C=1∑yexp(−E(y)kT)C=\frac{1}{\sum_{\mathrm{y}} \exp \left(\frac{-E(\mathbf{y})}{k T}\right)}C=∑yexp(kT−E(y))1
- kkk is the Boltzmann constant, TTT is the temperature of the system
- Optimizing WWW
- E(y)=−12yTWyW^=argminW∑y∈YPE(y)−∑y∉YPE(y)E(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W} \mathbf{y} \quad \widehat{\mathbf{W}}=\underset{\mathbf{W}}{\operatorname{argmin}} \sum_{\mathbf{y} \in \mathbf{Y}_{P}} E(\mathbf{y})-\sum_{\mathbf{y} \notin \mathbf{Y}_{P}} E(\mathbf{y})E(y)=−21yTWyW=Wargmin∑y∈YPE(y)−∑y∈/YPE(y)
- Simple gradient descent
- W=W+η(∑y∈YPαyyyT−∑y∉YPβ(E(y))yyT)\mathbf{W}=\mathbf{W}+\eta\left(\sum_{\mathbf{y} \in \mathbf{Y}_{P}} \alpha_{\mathbf{y}} \mathbf{y} \mathbf{y}^{T}-\sum_{\mathbf{y} \notin \mathbf{Y}_{P}} \beta(E(\mathbf{y})) \mathbf{y} \mathbf{y}^{T}\right)W=W+η(∑y∈YPαyyyT−∑y∈/YPβ(E(y))yyT)
- αy\alpha_yαy more importance to more frequently presented memories
- β(E(y))\beta (E(y))β(E(y)) more importance to more attractive spurious memories
- Looks like an expectation
- W=W+η(Ey∼YPyyT−Ey∼YyyT)\mathbf{W}=\mathbf{W}+\eta\left(E_{\mathbf{y} \sim \mathbf{Y}_{P}} \mathbf{y} \mathbf{y}^{T}-E_{\mathbf{y} \sim Y} \mathbf{y} \mathbf{y}^{T}\right)W=W+η(Ey∼YPyyT−Ey∼YyyT)
- The behavior of the Hopfield net is analogous to annealed dynamics of a spin glass characterized by a Boltzmann distribution
Hopfield网络是一种人工神经网络模型,用于存储和检索多个稳定状态,这些状态对应于训练模式。通过矩阵W的设计,可以确保目标模式的能量最低,从而在网络演化过程中稳定存在。训练过程涉及调整权重矩阵,使其最大化目标模式的能量并最小化非目标模式的能量。能量函数的优化可通过梯度下降等方法实现。此外,通过引入隐藏节点和利用'不关心'位,网络的存储容量可以增加。网络行为类似于玻尔兹曼机的退火动力学,其中模式的概率分布由能量函数决定。


618

被折叠的 条评论
为什么被折叠?



