KL\rm KLKL散度
由于以下推导需要用到KL\rm KLKL散度,这里先简单介绍一下。
KL\rm KLKL散度一般用于度量两个概率分布函数之间的“距离”,其定义如下:
KL[P(X)∣∣Q(X)]=∑x∈X[P(x)logP(x)Q(x)]=Ex∼P(x)[logP(x)Q(x)]KL\big[P(X)||Q(X)\big]=\sum_{x\in X}\Big[P(x)\log\frac{P(x)}{Q(x)}\Big]=E_{x\sim P(x)}\Big[\log\frac{P(x)}{Q(x)}\Big]KL[P(X)∣∣Q(X)]=∑x∈X[P(x)logQ(x)P(x)]=Ex∼P(x)[logQ(x)P(x)]
这里P(X)P(X)P(X)和Q(X)Q(X)Q(X)是两个概率分布函数,可以看到对于离散型随机变量,KL\rm KLKL散度对xxx进行求和;对于连续型随机变量,KL\rm KLKL散度对xxx进行积分(期望)。
高斯分布的KL\rm KLKL散度
对于两个单一变量的高斯分布p∼N(μ1,σ12)p\sim\mathcal{N}(\mu_1, \sigma_1^2)p∼N(μ1,σ12)和q∼N(μ2,σ22)q\sim\mathcal{N}(\mu_2,\sigma_2^2)q∼N(μ2,σ22)而言,它们的KL散度为
KL(p,q)=logσ2σ1+σ12+(μ1−μ2)22σ22−12KL(p,q)=\log\frac{\sigma_2}{\sigma_1}+\frac{\sigma_1^2+(\mu_1-\mu_2)^2}{2\sigma_2^2}-\frac{1}{2}KL(p,q)=logσ1σ2+2σ22σ12+(μ1−μ2)2−21
似然函数
下方是论文中给出的后向过程xt−1\mathbf{x}_{t-1}xt−1的分布,其方差为常数。
pθ(x0:T)=p(xT)∏t=1Tpθ(xt−1∣xt),pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),∑θ(xt,t))p_{\theta}(\mathbf{x}_{0:T})=p(\mathbf{x}_T)\prod_{t=1}^T p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t),\qquad p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t)=\mathcal{N}(\mathbf{x}_{t-1};\mu_{\theta}(\mathbf{x}_t,t),\sum_{\theta}(\mathbf{x}_t,t))pθ(x0:T)=p(xT)∏t=1Tpθ(xt−1∣xt),pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),∑θ(xt,t))
推出扩散模型目标数据分布的似然函数,推出似然函数后才能优化模型。pθ(x0)p_{\theta}(\mathbf{x}_0)pθ(x0)为目标数据分布,其对数似然下界越大,那么对数似然越大。为了方便推导,这里用其负对数似然−logpθ(x0)-\log p_{\theta}(\mathbf{x}_0)−logpθ(x0)推导,其上界越小,负对数似然越小,相对应其对数似然越大。
−logpθ(x0)≤−logpθ(x0)+DKL(q(x1:T∣x0)∥pθ(x1:T∣x0))(1)=−logpθ(x0)+Ex1:T∼q(x1:T∣x0)[logq(x1:T∣x0)pθ(x0:T)/pθ(x0)](2)=−logpθ(x0)+Eq[logq(x1:T∣x0)pθ(x0:T)+logpθ(x0)](3)=Eq(x1:T∣x0)[logq(x1:T∣x0)pθ(x0:T)](4)
\begin{aligned}
-\log p_{\theta}(\mathbf{x}_0)
& \leq -\log p_{\theta}(\mathbf{x}_0)+D_{KL}(q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)\parallel p_{\theta}(\mathbf{x}_{1:T}\mid\mathbf{x}_0)) \qquad(1)\\
& = -\log p_{\theta}(\mathbf{x}_0)+\Bbb{E}_{\mathbf{x}_{1:T}\sim q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}\Big[\log\frac{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{0:T})/p_{\theta}(\mathbf{x}_0)}\Big] \quad(2)\\
& = -\log p_{\theta}(\mathbf{x}_0)+\Bbb{E}_q\Big[\log\frac{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{0:T})}+\log p_{\theta}(\mathbf{x}_0)\Big]\qquad(3)\\
& = \Bbb{E}_{q(\mathbf{x}_{1:T}\mid\mathbf{\mathbf{x}_0})}\Big[\log\frac{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{0:T})}\Big]\qquad(4)
\end{aligned}
−logpθ(x0)≤−logpθ(x0)+DKL(q(x1:T∣x0)∥pθ(x1:T∣x0))(1)=−logpθ(x0)+Ex1:T∼q(x1:T∣x0)[logpθ(x0:T)/pθ(x0)q(x1:T∣x0)](2)=−logpθ(x0)+Eq[logpθ(x0:T)q(x1:T∣x0)+logpθ(x0)](3)=Eq(x1:T∣x0)[logpθ(x0:T)q(x1:T∣x0)](4)
公式推导
- (1)(1)(1) : 不等式右边加上一个KL\rm KLKL散度,由于KL\rm KLKL散度始终大于等于0,所以不等号成立。也即不等式右边是左边的上界,我们只需要优化右边的式子使其达到最小,那么等式左边的对数似然就达到最小。
- (1)→(2)(1)\rightarrow(2)(1)→(2) : 这一步是将KL\rm KLKL散度展开,可以见上方KL\rm KLKL散度的定义,定义中P(x)P(x)P(x)相当于q(x1:T∣x0)q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)q(x1:T∣x0),Q(x)Q(x)Q(x)相当于pθ(x1:T∣x0)p_{\theta}(\mathbf{x}_{1:T}\mid\mathbf{x}_0)pθ(x1:T∣x0)。将Q(x)Q(x)Q(x)按照条件概率公式展开:pθ(x1:T∣x0)=pθ(x1:T,x0)/pθ(x0)=pθ(x0:T)/pθ(x0)p_{\theta}(\mathbf{x}_{1:T}\mid\mathbf{x}_0)=p_{\theta}(\mathbf{x}_{1:T},\mathbf{x}_0)/p_{\theta}(\mathbf{x}_0)=p_{\theta}(\mathbf{x}_{0:T})/p_{\theta}(\mathbf{x}_0)pθ(x1:T∣x0)=pθ(x1:T,x0)/pθ(x0)=pθ(x0:T)/pθ(x0),这样就得到了第(2)(2)(2)步的式子。
- (2)→(3)(2)\rightarrow(3)(2)→(3) : 将log\loglog进行展开即可。
- (3)→(4)(3)\rightarrow(4)(3)→(4) : 由于该期望是针对分布qqq的,则logpθ(x0)\log p_{\theta}(\mathbf{x}_0)logpθ(x0)相对于qqq就是常数。所以Eq[logpθ(x0)]=logpθ(x0)\Bbb{E}_q\big[\log p_{\theta}(\mathbf{x}_0)\big]=\log p_{\theta}(\mathbf{x}_0)Eq[logpθ(x0)]=logpθ(x0),然后和前面的−logpθ(x0)-\log p_{\theta}(\mathbf{x}_0)−logpθ(x0)约去,就得到了式子(4)(4)(4)。
推导结束
然后我们将不等式左边的−logpθ(x0)-\log p_{\theta}(\mathbf{x}_0)−logpθ(x0)套上一个关于分布q(x0)q(\mathbf{x}_0)q(x0)的期望,得到−Eq(x0)logpθ(x0)-\Bbb{E}_{q(\mathbf{x}_0)}\log p_{\theta}(\mathbf{x}_0)−Eq(x0)logpθ(x0)(交叉熵,也即loss);相应的,不等式右边也要加上一个x0\mathbf{x}_0x0,则由Eq(x1:T∣x0)\Bbb{E}_{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}Eq(x1:T∣x0)变为Eq(x0:T)\Bbb{E}_{q(\mathbf{x}_{0:T})}Eq(x0:T)。如果我们想最小化loss,也就是最小化Eq(x0:T)\Bbb{E}_{q(\mathbf{x}_{0:T})}Eq(x0:T)。
Let LVLB=Eq(x0:T)[logq(x1:T∣x0)pθ(x0:T)]≥−Eq(x0)logpθ(x0)\rm Let\text{ }\it L_{\rm VLB} \it = \Bbb{E}_{q(\mathbf{x}_{0:T})}\Big[\log\frac{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{0:T})}\Big]\geq -\Bbb{E}_{q(\mathbf{x}_0)}\log p_{\theta}(\mathbf{x}_0)Let LVLB=Eq(x0:T)[logpθ(x0:T)q(x1:T∣x0)]≥−Eq(x0)logpθ(x0)
化简loss上界
LVLB=Eq(x0:T)[logq(x1:T∣x0)pθ(x0:T)](1)=E[log∏t=1Tq(xt∣xt−1)pθ(xT)∏t=1Tpθ(xt−1∣xt)](2)=Eq[−logpθ(xT)+∑t=1Tlogq(xt∣xt−1)pθ(xt−1∣xt)](3)=Eq[−logpθ(xT)+∑t=2Tlogq(xt∣xt−1)pθ(xt−1∣xt)+logq(x1∣x0)pθ(x0∣x1)](4)=Eq[−logpθ(xT)+∑t=2Tlog(q(xt−1∣xt,x0)pθ(xt−1∣xt)⋅q(xt∣x0)q(xt−1∣x0))+logq(x1∣x0)pθ(x0∣x1)](5)=Eq[−logpθ(xT)+∑t=2Tlogq(xt−1∣xt,x0)pθ(xt−1∣xt)+∑t=2Tlogq(xt∣x0)q(xt−1∣x0)+logq(x1∣x0)pθ(x0∣x1)](6)=Eq[−logpθ(xT)+∑t=2Tlogq(xt−1∣xt,x0)pθ(xt−1∣xt)+logq(xT∣x0)q(x1∣x0)+logq(x1∣x0)pθ(x0∣x1)](7)=Eq[logq(xT∣x0)pθ(xT)+∑t=2Tlogq(xt−1∣xt,x0)pθ(xt−1∣xt)−logpθ(x0∣x1)](8)=Eq[DKL(q(xT∣x0)∥pθ(xT))⏟LT+∑t=2TDKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))⏟Lt−1−logpθ(x0∣x1)⏟L0](9) \begin{aligned} L_{\rm VLB} \it & = \Bbb{E}_{q(\mathbf{x}_{0:T})}\Big[\log\frac{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{0:T})}\Big] \qquad (1)\\ & = \Bbb{E}\Big[\log\frac{\prod_{t=1}^Tq(\mathbf{x}_t\mid\mathbf{x}_{t-1})}{p_{\theta}(\mathbf{x}_T)\prod_{t=1}^Tp_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t)}\Big] \qquad(2)\\ & = \Bbb{E}_q \Big[-\log p_{\theta}(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{q(\mathbf{x}_t\mid\mathbf{x}_{t-1})}{p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t)} \Big] \qquad(3)\\ & = \Bbb{E}_q \Big[-\log p_{\theta}(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t\mid\mathbf{x}_{t-1})}{p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_1\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)} \Big] \qquad(4)\\ & = \Bbb{E}_q \Big[-\log p_{\theta}(\mathbf{x}_T) + \sum_{t=2}^T \log \Big(\frac{q(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t) } \cdot \frac{q(\mathbf{x}_t\mid\mathbf{x}_0)}{q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)} \Big) + \log\frac{q(\mathbf{x}_1\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)} \Big] \qquad(5)\\ & = \Bbb{E}_q \Big[-\log p_{\theta}(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t) } + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t\mid\mathbf{x}_0)}{q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)}+\log\frac{q(\mathbf{x}_1\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)} \Big] \qquad(6)\\ & = \Bbb{E}_q \Big[-\log p_{\theta}(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t) } + \log \frac{q(\mathbf{x}_T\mid\mathbf{x}_0)}{q(\mathbf{x}_{1}\mid\mathbf{x}_0)}+\log\frac{q(\mathbf{x}_1\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)} \Big] \qquad(7)\\ & = \Bbb{E}_q \Big[\log \frac{q(\mathbf{x}_T\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_T)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t) } - \log p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1) \Big] \qquad(8)\\ & = \Bbb{E}_q[\underbrace{D_{\rm KL}(q(\mathbf{x}_T\mid\mathbf{x}_0)\parallel p_{\theta}(\mathbf{x}_T))}_{L_T}+\sum_{t=2}^T\underbrace{D_{\rm KL}(q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)\parallel p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t))}_{L_{t-1}}-\underbrace{\log p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)}_{L_0}]\qquad(9) \end{aligned} LVLB=Eq(x0:T)[logpθ(x0:T)q(x1:T∣x0)](1)=E[logpθ(xT)∏t=1Tpθ(xt−1∣xt)∏t=1Tq(xt∣xt−1)](2)=Eq[−logpθ(xT)+t=1∑Tlogpθ(xt−1∣xt)q(xt∣xt−1)](3)=Eq[−logpθ(xT)+t=2∑Tlogpθ(xt−1∣xt)q(xt∣xt−1)+logpθ(x0∣x1)q(x1∣x0)](4)=Eq[−logpθ(xT)+t=2∑Tlog(pθ(xt−1∣xt)q(xt−1∣xt,x0)⋅q(xt−1∣x0)q(xt∣x0))+logpθ(x0∣x1)q(x1∣x0)](5)=Eq[−logpθ(xT)+t=2∑Tlogpθ(xt−1∣xt)q(xt−1∣xt,x0)+t=2∑Tlogq(xt−1∣x0)q(xt∣x0)+logpθ(x0∣x1)q(x1∣x0)](6)=Eq[−logpθ(xT)+t=2∑Tlogpθ(xt−1∣xt)q(xt−1∣xt,x0)+logq(x1∣x0)q(xT∣x0)+logpθ(x0∣x1)q(x1∣x0)](7)=Eq[logpθ(xT)q(xT∣x0)+t=2∑Tlogpθ(xt−1∣xt)q(xt−1∣xt,x0)−logpθ(x0∣x1)](8)=Eq[LTDKL(q(xT∣x0)∥pθ(xT))+t=2∑TLt−1DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))−L0logpθ(x0∣x1)](9)
公式推导
- (1)→(2)(1)\rightarrow(2)(1)→(2) : 将条件概率展开。由于q(x1:T∣x0)q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)q(x1:T∣x0)是扩散过程,是从x0\mathbf{x}_0x0逐步推导xT\mathbf{x}_TxT得到过程,其符合马尔科夫假设,故q(x1:T∣x0)=q(x1∣x0)⋅q(x2∣x1)⋅...⋅q(xT∣xT−1)=∏t=1Tq(xt∣xt−1)q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)=q(\mathbf{x}_1\mid\mathbf{x}_0)\cdot q(\mathbf{x}_2\mid\mathbf{x}_1)\cdot ... \cdot q(\mathbf{x}_T\mid\mathbf{x}_{T-1})=\prod_{t=1}^Tq(\mathbf{x}_t\mid\mathbf{x}_{t-1})q(x1:T∣x0)=q(x1∣x0)⋅q(x2∣x1)⋅...⋅q(xT∣xT−1)=∏t=1Tq(xt∣xt−1);对于pθ(x0:T)p_{\theta}(\mathbf{x}_{0:T})pθ(x0:T),我们先将其根据条件概率转换为pθ(xT)pθ(x0:T−1∣xT)p_{\theta}(\mathbf{x}_T)p_{\theta}(\mathbf{x}_{0:T-1}\mid\mathbf{x}_T)pθ(xT)pθ(x0:T−1∣xT),然后将后面那一项和qqq一样,展开即可。
- (2)→(3)(2)\rightarrow(3)(2)→(3) : 将log\loglog进行展开,连乘展开后转换为求和。
- (3)→(4)(3)\rightarrow(4)(3)→(4) : 将logq(x1∣x0)pθ(x0∣x1)\log\frac{q(\mathbf{x}_1\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)}logpθ(x0∣x1)q(x1∣x0)单独拿出来计算。
- (4)→(5)(4)\rightarrow(5)(4)→(5) : 回忆一下,之前在讲逆扩散过程的时候我们得到了这样一个式子q(xt−1∣xt,x0)=q(xt∣xt−1)q(xt−1∣x0)q(xt∣x0)q(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_0)=q(\mathbf{x}_{t}\mid\mathbf{x}_{t-1})\frac{q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)}{q(\mathbf{x}_{t}\mid\mathbf{x}_0)}q(xt−1∣xt,x0)=q(xt∣xt−1)q(xt∣x0)q(xt−1∣x0),通过这个式子,我们就能得到q(xt∣xt−1)q(\mathbf{x}_t\mid\mathbf{x}_{t-1})q(xt∣xt−1)的表达式,然后替换即可。
- (5)→(6)(5)\rightarrow(6)(5)→(6) : 将log\loglog进行展开。
- (6)→(7)(6)\rightarrow(7)(6)→(7) : ∑t=2Tlogq(xt∣x0)q(xt−1∣x0)=log(q(x2∣x0)q(x1∣x0)⋅q(x3∣x0)q(x2∣x0)⋅...⋅q(xT∣x0)q(xT−1∣x0))=logq(xT∣x0)q(x1∣x0)\sum_{t=2}^T\log\frac{q(\mathbf{x}_t\mid\mathbf{x}_0)}{q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)}=\log\Big(\frac{q(\mathbf{x}_2\mid\mathbf{x}_0)}{q(\mathbf{x}_1\mid\mathbf{x}_0)}\cdot\frac{q(\mathbf{x}_3\mid\mathbf{x}_0)}{q(\mathbf{x}_2\mid\mathbf{x}_0)}\cdot...\cdot\frac{q(\mathbf{x}_T\mid\mathbf{x}_0)}{q(\mathbf{x}_T-1\mid\mathbf{x}_0)}\Big)=\log\frac{q(\mathbf{x}_T\mid\mathbf{x}_0)}{q(\mathbf{x}_1\mid\mathbf{x}_0)}∑t=2Tlogq(xt−1∣x0)q(xt∣x0)=log(q(x1∣x0)q(x2∣x0)⋅q(x2∣x0)q(x3∣x0)⋅...⋅q(xT−1∣x0)q(xT∣x0))=logq(x1∣x0)q(xT∣x0)
- (7)→(8)(7)\rightarrow(8)(7)→(8) : logq(xT∣x0)q(x1∣x0)+logq(x1∣x0)pθ(x0∣x1)=logq(xT∣x0)−logpθ(x0∣x1)\log\frac{q(\mathbf{x}_T\mid\mathbf{x}_0)}{q(\mathbf{x}_1\mid\mathbf{x}_0)} + \log\frac{q(\mathbf{x}_1\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)}=\log q(\mathbf{x}_T\mid\mathbf{x}_0)-\log p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)logq(x1∣x0)q(xT∣x0)+logpθ(x0∣x1)q(x1∣x0)=logq(xT∣x0)−logpθ(x0∣x1),然后将logq(xT∣x0)\log q(\mathbf{x}_T\mid\mathbf{x}_0)logq(xT∣x0)和−logpθ(xT)-\log p_{\theta}(\mathbf{x}_T)−logpθ(xT)合并成logq(xT∣x0)pθ(xT)\log \frac{q(\mathbf{x}_T\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_T)}logpθ(xT)q(xT∣x0)
- (8)→(9)(8)\rightarrow(9)(8)→(9) : 对于LTL_TLT,q(xT∣x0)q(\mathbf{x}_T\mid\mathbf{x}_0)q(xT∣x0)和pθ(xT)p_{\theta}(\mathbf{x}_T)pθ(xT)都是不含参的,前者qqq分布是由βt\beta_tβt求出的,不含有任何参数;后者是一个各向同性的高斯分布。故LTL_TLT是不含参的,在优化时可以将其舍弃。对于Lt−1L_{t-1}Lt−1,参见KL\rm KLKL散度定义,可以将其表示为KL\rm KLKL散度,如果这里我们将ttt取1,其转化为logq(x0∣x1,x0)pθ(x0∣x1)=log1pθ(x0∣x1)\log\frac{q(\mathbf{x}_0\mid\mathbf{x}_1,\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)}=\log\frac{1}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)}logpθ(x0∣x1)q(x0∣x1,x0)=logpθ(x0∣x1)1。故当ttt为1时,得到的结果就是Lt−1L_{t-1}Lt−1后面那一项L0L_0L0,故我们可以将其合并。故我们只需要优化Lt−1L_{t-1}Lt−1即可。
推导结束
在论文中,作者将分布pθ(xt−1∣xt)p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t)pθ(xt−1∣xt)的方差看作与β\betaβ相关的常数,那么可训练的参数就存在于其均值当中。在Lt−1L_{t-1}Lt−1中,q(xt−1∣xt,x0)q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)q(xt−1∣xt,x0)是一个高斯分布,其方差和均值我们已经在之前后向过程推导中求出,均值为μ~t(xt)\tilde{\mu}_t(\mathbf{x}_t)μ~t(xt),方差为和βt\beta_tβt有关的常数。而pθ(xt−1∣xt)p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t)pθ(xt−1∣xt)也是我们假设的高斯分布,它的方差也是常数,均值为μθ(xt,t)\mu_{\theta}(\mathbf{x}_t,t)μθ(xt,t),所以参数只在μθ\mu_{\theta}μθ当中。对于这两个高斯分布,我们可以运用高斯分布的KL\rm KLKL散度公式,其中的方差我们可以不考虑。则我们可以得到如下的式子:
Lt−1=Eq[12σt2∥μ~t(xt,x0)−μθ(xt,t)∥2]+CL_{t-1}=\Bbb{E}_q \Big[\frac{1}{2\sigma_t^2} \lVert \tilde{\mu}_t(\mathbf{x}_t,\mathbf{x}_0)-\mu_{\theta}(\mathbf{x}_t,t)\rVert^2 \Big]+CLt−1=Eq[2σt21∥μ~t(xt,x0)−μθ(xt,t)∥2]+C
由这个式子,我们优化目标就很明确了,我们要优化μθ\mu_{\theta}μθ,让其无线逼近于μ~t\tilde{\mu}_tμ~t,这样才能使Lt−1L_{t-1}Lt−1最小。首先我们将μ~t(xt)\tilde{\mu}_t(\mathbf{x}_t)μ~t(xt)代入上述的式子中,原式中的z~t\tilde{z}_tz~t用ϵ\epsilonϵ来表示,xt\mathbf{x}_txt用xt(x0,ϵ)\mathbf{x}_t(\mathbf{x}_0,\epsilon)xt(x0,ϵ)替换,就能得到下方第二个等号的式子。
Lt−1−C=Ex0,ϵ[12σt2∥μ~t(xt(x0,ϵ),1αˉt(xt(x0,ϵ)−1−αˉtϵ))−μθ(xt(x0,ϵ),t)∥2]=Ex0,ϵ[12σt2∥1αt(xt(x0,ϵ)−βt1−αˉtϵ)−μθ(xt(x0,ϵ),t)∥2]
\begin{aligned}
L_{t-1}-C
& = \Bbb{E}_{\mathbf{x}_0,\epsilon} \Bigg[\frac{1}{2\sigma_t^2}\Big\lVert\tilde{\mu}_t\Big(\mathbf{x}_t(\mathbf{x}_0,\epsilon),\frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t(\mathbf{x}_0,\epsilon)-\sqrt{1-\bar{\alpha}_t}\epsilon)\Big)-\mu_{\theta}(\mathbf{x}_t(\mathbf{x}_0,\epsilon),t)\Big\rVert^2 \Bigg] \\
& = \Bbb{E}_{\mathbf{x}_0,\epsilon} \Bigg[\frac{1}{2\sigma_t^2}\Big\lVert\frac{1}{\sqrt{\alpha}_t}\Big(\mathbf{x}_t(\mathbf{x}_0,\epsilon)-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon \Big)-\mu_{\theta}(\mathbf{x}_t(\mathbf{x}_0,\epsilon),t)\Big\rVert^2 \Bigg]
\end{aligned}
Lt−1−C=Ex0,ϵ[2σt21∥∥∥μ~t(xt(x0,ϵ),αˉt1(xt(x0,ϵ)−1−αˉtϵ))−μθ(xt(x0,ϵ),t)∥∥∥2]=Ex0,ϵ[2σt21∥∥∥αt1(xt(x0,ϵ)−1−αˉtβtϵ)−μθ(xt(x0,ϵ),t)∥∥∥2]
这里我们的xt\mathbf{x}_txt是已知的,那么为了使Lt−1L_{t-1}Lt−1最小,我们可以将μθ(xt,t)\mu_{\theta}(\mathbf{x}_t,t)μθ(xt,t)表示为μ~t\tilde{\mu}_tμ~t的一个波动,其中的ϵ\epsilonϵ是未知的,则我们可以训练一个网络来预测ϵ\epsilonϵ。
μθ(xt,t)=μ~t(xt,1αˉt(xt−1−αˉtϵθ(xt)))=1αt(xt−βt1−αˉtϵθ(xt,t))\mu_{\theta}(\mathbf{x}_t,t)=\tilde{\mu}_t\Big(\mathbf{x}_t,\frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{\mathbf{x}_t-\sqrt{1-\bar{\alpha}_t}\epsilon_{\theta}(\mathbf{x}_t)}) \Big)=\frac{1}{\sqrt{\alpha_t}}\Big(\mathbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_{\theta}(\mathbf{x}_t,t) \Big)μθ(xt,t)=μ~t(xt,αˉt1(xt−1−αˉtϵθ(xt)))=αt1(xt−1−αˉtβtϵθ(xt,t))
于是Lt−1L_{t-1}Lt−1可以简化为如下形式
Ex0,ϵ[βt22σt2αt(1−αˉt)∥ϵ−ϵθ(αˉtx0+1−αˉtϵ,t)∥2]\Bbb{E}_{\mathbf{x_0},\epsilon}\Big[ \frac{\beta_t^2}{2\sigma_t^2\alpha_t(1-\bar{\alpha}_t)}\lVert \epsilon-\epsilon_{\theta}(\sqrt{\bar{\alpha}_t}\mathbf{x}_0+\sqrt{1-\bar{\alpha}_t}\epsilon,t)\rVert^2\Big]Ex0,ϵ[2σt2αt(1−αˉt)βt2∥ϵ−ϵθ(αˉtx0+1−αˉtϵ,t)∥2]
作者又发现,将系数丢掉,训练更加稳定质量更好,于是就得到了下方的LsimpleL_{\rm simple}Lsimple
Lsimple(θ):=Et,x0,ϵ[∥ϵ−ϵθ(αˉtx0+1−αˉtϵ,t)∥2]L_{\rm simple}(\theta):=\Bbb{E}_{t,\mathbf{x_0},\epsilon}\Big[ \lVert \epsilon-\epsilon_{\theta}(\sqrt{\bar{\alpha}_t}\mathbf{x}_0+\sqrt{1-\bar{\alpha}_t}\epsilon,t)\rVert^2\Big]Lsimple(θ):=Et,x0,ϵ[∥ϵ−ϵθ(αˉtx0+1−αˉtϵ,t)∥2]

1554

被折叠的 条评论
为什么被折叠?



