[优化算法]梯度下降法-凸函数的收敛性

1. 三个条件

  • f f f 有下界,可微,凸函数
  • ∇ f \nabla f fLipschitz连续
  • 步长 α ∈ ( 0 , 1 L ] \alpha \in (0,\frac{1}{L}] α(0,L1]
    { f ( x k ) } \{ f(x_k) \} {f(xk)} O ( 1 k ) \mathcal{O}(\frac{1}{k}) O(k1) 收敛于 f ∗ f^{*} f

2. 二次上界引理

f f f 可微, ∇ f \nabla f fLipschitz连续,则 f f f有二次上界,即
∀ x , y ∈ R n → f ( y ) ≤ f ( x ) + ∇ f ( x ) T ( y − x ) + L 2 ∣ ∣ y − x ∣ ∣ 2 \begin{equation} \forall x,y \in \mathbb{R}^n\rightarrow f(y)\le f(x)+\nabla f(x)^T(y-x)+\frac{L}{2}||y-x||^2 \end{equation} x,yRnf(y)f(x)+f(x)T(yx)+2L∣∣yx2

3. 证明

  • 我们前提是梯度下降法: x i = x i − 1 − ∇ f T ( x i − 1 ) α x_i=x_{i-1}-\nabla f^T(x_{i-1} ) \alpha xi=xi1fT(xi1)α

  • 不妨设 y = x i , x = x i − 1 y=x_i,x=x_{i-1} y=xi,x=xi1,则可得: y − x = x i − x i − 1 = − ∇ f T ( x i − 1 ) α y-x=x_i-x_{i-1}=-\nabla f^T(x_{i-1} ) \alpha yx=xixi1=fT(xi1)α

  • 代入到二次上界引理可得:
    f ( x i ) ≤ f ( x i − 1 ) + ∇ f ( x i − 1 ) T [ − ∇ f T ( x i − 1 ) α ] + L 2 [ − ∇ f T ( x i − 1 ) α ] 2 \begin{equation} f(x_i)\le f(x_{i-1})+\nabla f(x_{i-1})^T[-\nabla f^T(x_{i-1} ) \alpha]+\frac{L}{2}[-\nabla f^T(x_{i-1} ) \alpha]^2 \end{equation} f(xi)f(xi1)+f(xi1)T[fT(xi1)α]+2L[fT(xi1)α]2

  • 右边整理可得:
    = f ( x i − 1 ) − [ ∇ f T ( x i − 1 ) ] 2 α + L α 2 2 [ ∇ f T ( x i − 1 ) ] 2 \begin{equation} = f(x_{i-1})-[\nabla f^T(x_{i-1} )]^2 \alpha+\frac{L\alpha^2}{2}[\nabla f^T(x_{i-1})]^2 \end{equation} =f(xi1)[fT(xi1)]2α+2Lα2[fT(xi1)]2

  • 因为 α ≤ 1 L \alpha \le \frac{1}{L} αL1,则可得: L ≤ 1 α → L α 2 2 ≤ α 2 2 1 α → L α 2 2 ≤ α 2 L\le \frac{1}{\alpha}\rightarrow \frac{L\alpha^2}{2}\le \frac{\alpha^2}{2}\frac{1}{\alpha}\rightarrow \frac{L\alpha^2}{2}\le \frac{\alpha}{2} Lα12Lα22α2α12Lα22α

  • 右边放缩可得:
    ≤ f ( x i − 1 ) − [ ∇ f T ( x i − 1 ) ] 2 α + α 2 [ ∇ f T ( x i − 1 ) ] 2 = f ( x i − 1 ) − α 2 [ ∇ f T ( x i − 1 ) ] 2 \begin{equation} \le f(x_{i-1})-[\nabla f^T(x_{i-1} )]^2 \alpha+\frac{\alpha}{2}[\nabla f^T(x_{i-1})]^2= f(x_{i-1})-\frac{\alpha}{2}[\nabla f^T(x_{i-1} )]^2 \end{equation} f(xi1)[fT(xi1)]2α+2α[fT(xi1)]2=f(xi1)2α[fT(xi1)]2

  • 综上所述可得:
    f ( x i ) ≤ f ( x i − 1 ) − α 2 [ ∇ f T ( x i − 1 ) ] 2 \begin{equation} f(x_i)\le f(x_{i-1})-\frac{\alpha}{2}[\nabla f^T(x_{i-1} )]^2 \end{equation} f(xi)f(xi1)2α[fT(xi1)]2
    在这里插入图片描述
    在这里插入图片描述

  • 如图所示,极值点 f ∗ ≥ y A f^*\ge y_A fyA,
    f ( x i − 1 ) − y A x i − 1 − x ∗ = ∇ f T ( x i − 1 ) \begin{equation} \frac{f(x_{i-1})-y_A}{x_{i-1}-x^*}=\nabla f^T(x_{i-1}) \end{equation} xi1xf(xi1)yA=fT(xi1)

  • 整理上式可得:
    y A = f ( x i − 1 ) − ∇ f T ( x i − 1 ) [ x i − 1 − x ∗ ] \begin{equation} y_A=f(x_{i-1})-\nabla f^T(x_{i-1})[x_{i-1}-x^*] \end{equation} yA=f(xi1)fT(xi1)[xi1x]
    f ∗ ≥ f ( x i − 1 ) − ∇ f T ( x i − 1 ) [ x i − 1 − x ∗ ] \begin{equation} f^*\ge f(x_{i-1})-\nabla f^T(x_{i-1})[x_{i-1}-x^*] \end{equation} ff(xi1)fT(xi1)[xi1x]

  • 整理上式可得:
    f ( x i − 1 ) ≤ f ∗ + ∇ f T ( x i − 1 ) [ x i − 1 − x ∗ ] \begin{equation} f(x_{i-1})\le f^*+\nabla f^T(x_{i-1})[x_{i-1}-x^*] \end{equation} f(xi1)f+fT(xi1)[xi1x]

  • 将上面得到的公式如下:
    f ( x i ) ≤ f ( x i − 1 ) − α 2 [ ∇ f T ( x i − 1 ) ] 2 \begin{equation} f(x_i)\le f(x_{i-1})-\frac{\alpha}{2}[\nabla f^T(x_{i-1} )]^2 \end{equation} f(xi)f(xi1)2α[fT(xi1)]2

  • 放大右边可得:
    f ( x i ) ≤ f ∗ + ∇ f T ( x i − 1 ) [ x i − 1 − x ∗ ] − α 2 [ ∇ f T ( x i − 1 ) ] 2 \begin{equation} f(x_i)\le f^*+\nabla f^T(x_{i-1})[x_{i-1}-x^*]-\frac{\alpha}{2}[\nabla f^T(x_{i-1} )]^2 \end{equation} f(xi)f+fT(xi1)[xi1x]2α[fT(xi1)]2

  • 配方右边等式可得:
    = f ∗ − 1 2 α [ ( α 2 ( ∇ f T ( x i − 1 ) ) 2 − 2 α ∇ f T ( x i − 1 ) ( x i − 1 − x ∗ ) ] \begin{equation} =f^*-\frac{1}{2\alpha}[(\alpha^2 (\nabla f^T(x_{i-1} ))^2-2\alpha \nabla f^T(x_{i-1} )(x_{i-1}-x^*)] \end{equation} =f2α1[(α2(fT(xi1))22αfT(xi1)(xi1x)]
    = f ∗ − 1 2 α [ ( α 2 ( ∇ f T ( x i − 1 ) ) 2 − 2 α ∇ f T ( x i − 1 ) ( x i − 1 − x ∗ ) + ( x i − 1 − x ∗ ) 2 − ( x i − 1 − x ∗ ) 2 ] \begin{equation} =f^*-\frac{1}{2\alpha}[(\alpha^2 (\nabla f^T(x_{i-1} ))^2-2\alpha \nabla f^T(x_{i-1} )(x_{i-1}-x^*)+(x_{i-1}-x^*)^2-(x_{i-1}-x^*)^2] \end{equation} =f2α1[(α2(fT(xi1))22αfT(xi1)(xi1x)+(xi1x)2(xi1x)2]
    = f ∗ − 1 2 α [ ( α ∇ f T ( x i − 1 ) − x i − 1 + x ∗ ) 2 − ( x i − 1 − x ∗ ) 2 ] \begin{equation} =f^*-\frac{1}{2\alpha}[(\alpha \nabla f^T(x_{i-1} )-x_{i-1}+x^*)^2-(x_{i-1}-x^*)^2] \end{equation} =f2α1[(αfT(xi1)xi1+x)2(xi1x)2]

  • ( α ∇ f T ( x i − 1 ) − x i − 1 + x ∗ ) 2 = ( x i − 1 − α ∇ f T ( x i − 1 ) − x ∗ ) 2 (\alpha \nabla f^T(x_{i-1} )-x_{i-1}+x^*)^2=(x_{i-1} -\alpha \nabla f^T(x_{i-1} )-x^*)^2 (αfT(xi1)xi1+x)2=(xi1αfT(xi1)x)2,根据梯度公式可得:
    x i = x i − 1 − ∇ f T ( x i − 1 ) α → ( α ∇ f T ( x i − 1 ) − x i − 1 + x ∗ ) 2 = ( x i − x ∗ ) 2 \begin{equation} x_i=x_{i-1}-\nabla f^T(x_{i-1} ) \alpha \rightarrow (\alpha \nabla f^T(x_{i-1} )-x_{i-1}+x^*)^2=(x_i-x^*)^2 \end{equation} xi=xi1fT(xi1)α(αfT(xi1)xi1+x)2=(xix)2

  • 综合整理上式可得:
    f ( x i ) ≤ f ∗ − 1 2 α [ ( x i − x ∗ ) 2 − ( x i − 1 − x ∗ ) 2 ] \begin{equation} f(x_i)\le f^*-\frac{1}{2\alpha}[(x_i-x^*)^2-(x_{i-1}-x^*)^2] \end{equation} f(xi)f2α1[(xix)2(xi1x)2]

  • 改成范数形式可得:
    f ( x i ) ≤ f ∗ + 1 2 α [ ∣ ∣ x i − 1 − x ∗ ∣ ∣ 2 − ∣ ∣ x i − x ∗ ∣ ∣ 2 ] \begin{equation} f(x_i)\le f^*+\frac{1}{2\alpha}[||x_{i-1}-x^*||^2-||x_{i}-x^*||^2] \end{equation} f(xi)f+2α1[∣∣xi1x2∣∣xix2]

  • 不断代入可得:
    f ( x i ) − f ∗ ≤ 1 2 α [ ∣ ∣ x 0 − x ∗ ∣ ∣ 2 − ∣ ∣ x 1 − x ∗ ∣ ∣ 2 ] \begin{equation} f(x_i) -f^*\le\frac{1}{2\alpha}[||x_{0}-x^*||^2-||x_{1}-x^*||^2] \end{equation} f(xi)f2α1[∣∣x0x2∣∣x1x2]
    f ( x i ) − f ∗ ≤ 1 2 α [ ∣ ∣ x 1 − x ∗ ∣ ∣ 2 − ∣ ∣ x 2 − x ∗ ∣ ∣ 2 ] \begin{equation} f(x_i) -f^*\le\frac{1}{2\alpha}[||x_{1}-x^*||^2-||x_{2}-x^*||^2] \end{equation} f(xi)f2α1[∣∣x1x2∣∣x2x2]
    f ( x k ) − f ∗ ≤ 1 2 α [ ∣ ∣ x k − 1 − x ∗ ∣ ∣ 2 − ∣ ∣ x k − x ∗ ∣ ∣ 2 ] \begin{equation} f(x_k) -f^*\le\frac{1}{2\alpha}[||x_{k-1}-x^*||^2-||x_{k}-x^*||^2] \end{equation} f(xk)f2α1[∣∣xk1x2∣∣xkx2]

  • 左右相加可得:
    ∑ i = 1 k [ f ( x i ) − f ∗ ] ≤ 1 2 α [ ∣ ∣ x 0 − x ∗ ∣ ∣ 2 − ∣ ∣ x k − x ∗ ∣ ∣ 2 ] ≤ 1 2 α ∣ ∣ x 0 − x ∗ ∣ ∣ 2 \begin{equation} \sum_{i=1}^k[f(x_i) -f^*]\le\frac{1}{2\alpha}[||x_{0}-x^*||^2-||x_{k}-x^*||^2]\le\frac{1}{2\alpha}||x_{0}-x^*||^2 \end{equation} i=1k[f(xi)f]2α1[∣∣x0x2∣∣xkx2]2α1∣∣x0x2

  • 因为:
    ∑ i = 1 k ( f ( x k ) − f ∗ ) = k [ f ( x k ) − f ∗ ] → f ( x k ) − f ∗ = 1 k ∑ i = 1 k ( f ( x k ) − f ∗ ) \begin{equation} \sum_{i=1}^k(f(x_k)-f^*)=k[f(x_k)-f^*]\rightarrow f(x_k)-f^*=\frac{1}{k}\sum_{i=1}^k(f(x_k)-f^*) \end{equation} i=1k(f(xk)f)=k[f(xk)f]f(xk)f=k1i=1k(f(xk)f)

  • 由于 f ( x i ) f(x_i) f(xi)我们定义单调递减,所以可得: f ( x i ) ≥ f ( x k ) f(x_i)\ge f(x_k) f(xi)f(xk)
    f ( x k ) − f ∗ = 1 k ∑ i = 1 k ( f ( x k ) − f ∗ ) ≤ 1 k ∑ i = 1 k ( f ( x i ) − f ∗ ) \begin{equation} f(x_k)-f^*=\frac{1}{k}\sum_{i=1}^k(f(x_k)-f^*)\le \frac{1}{k}\sum_{i=1}^k(f(x_i)-f^*) \end{equation} f(xk)f=k1i=1k(f(xk)f)k1i=1k(f(xi)f)

  • 整理可得:
    f ( x k ) − f ∗ = 1 k ∑ i = 1 k ( f ( x i ) − f ∗ ) ≤ 1 k ⋅ 1 2 α ∣ ∣ x 0 − x ∗ ∣ ∣ 2 = O ( 1 k ) \begin{equation} f(x_k)-f^*=\frac{1}{k}\sum_{i=1}^k(f(x_i)-f^*)\le \frac{1}{k}\cdot \frac{1}{2\alpha}||x_0-x^*||^2=\mathcal{O}(\frac{1}{k}) \end{equation} f(xk)f=k1i=1k(f(xi)f)k12α1∣∣x0x2=O(k1)

  • 综上所述:
    f ( x k ) − f ∗ = O ( 1 k ) \begin{equation} f(x_k)-f^*=\mathcal{O}(\frac{1}{k}) \end{equation} f(xk)f=O(k1)
    { f ( x k ) } \{ f(x_k) \} {f(xk)} O ( 1 k ) \mathcal{O}(\frac{1}{k}) O(k1) 收敛于 f ∗ f^{*} f

视频来源于B站大佬,以上为视频学习笔记

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值