tag:blogger.com,1999:blog-8730014808678923743.post6014014667248428254..comments2019-02-09T13:28:00.420+03:00Comments on almost stochastic: Convergence of gradient descent algorithmsDenizhttp://www.blogger.com/profile/06043980704154563908noreply@blogger.comBlogger9125tag:blogger.com,1999:blog-8730014808678923743.post-31565449773174399112016-02-25T22:56:25.035+02:002016-02-25T22:56:25.035+02:00Hi there, sorry i missed your comment, I am readin...Hi there, sorry i missed your comment, I am reading it. Thanks!Jie Tianhttps://www.blogger.com/profile/14603639500678494276noreply@blogger.comtag:blogger.com,1999:blog-8730014808678923743.post-22832325136610490352016-02-25T10:31:20.675+02:002016-02-25T10:31:20.675+02:00Hi,
Did you look at the paper Leon Bottou, Online...Hi,<br /><br />Did you look at the paper Leon Bottou, Online learning and stochastic approximations?Denizhttps://www.blogger.com/profile/06043980704154563908noreply@blogger.comtag:blogger.com,1999:blog-8730014808678923743.post-63551039747696595712016-02-25T07:14:04.831+02:002016-02-25T07:14:04.831+02:00hi there again, there are some subscripts that are...hi there again, there are some subscripts that are not very clear to me, for example: <br />the i and t in auxiliary sequences, μn<br />the second condition for $f$, where did you find this condition? why it is not $\theta^*$ in $\nabla_{\theta} f(\theta)$?<br />thank you!Jie Tianhttps://www.blogger.com/profile/14603639500678494276noreply@blogger.comtag:blogger.com,1999:blog-8730014808678923743.post-13104958713314013162015-10-22T19:05:34.864+03:002015-10-22T19:05:34.864+03:00Jie, for the second comment, yes you are right. In...Jie, for the second comment, yes you are right. In fact, the typo is in the definition of the gradient descent, it should be read as,<br />\begin{align*} \theta_{n+1} = \theta_{n} - \gamma_n \nabla_{\theta} f(\theta,\x) |_{\theta = \theta_{n}} \end{align*}<br />Thanks for catching it! Sorry for the late reply. In any case, you can check Bottou's paper: http://leon.bottou.org/publications/pdf/online-1998.pdf<br /><br />I think your proof will be valid for maximisation process with gradient ascent as soon as negative of the cost function you use satisfies the concepts introduced in the paper (general convexity).<br /><br />(correction about the algorithm may lead to another compilation error in the notation that I thought was a typo in Bottou's paper, but apparently it wasn't!).Denizhttps://www.blogger.com/profile/06043980704154563908noreply@blogger.comtag:blogger.com,1999:blog-8730014808678923743.post-79549218050083303342015-10-19T08:39:45.271+03:002015-10-19T08:39:45.271+03:00With rearrangements, we have, hn+1−(1−γnB)hn, I th...With rearrangements, we have, hn+1−(1−γnB)hn, I think here should be hn+1−(1+γn^2 B)hn, right?Jie Tianhttps://www.blogger.com/profile/14603639500678494276noreply@blogger.comtag:blogger.com,1999:blog-8730014808678923743.post-17266259207134904242015-10-19T08:18:27.096+03:002015-10-19T08:18:27.096+03:00So did you use the fixed step-size when proved the...So did you use the fixed step-size when proved the convergence of Discrete-time gradient descent? Now I know how to derive (10), but that stepsize should be $\gamma_{n+1}$, not $n$. So how can you get $\gamma_{n+1}$ in (10)? <br />And another question, if now I wanna use this proving process to a maximization problem, and my updating equation is $\theta_{n}=\theta_{n-1} + \gamma_n ∇θf(θ_{n-1})$, can I use this method? and how to adaptively rewrite it? thank you!Jie Tianhttps://www.blogger.com/profile/14603639500678494276noreply@blogger.comtag:blogger.com,1999:blog-8730014808678923743.post-44916546728588402842015-10-19T06:34:34.723+03:002015-10-19T06:34:34.723+03:00how do you have equation (10)? I tried some ways t...how do you have equation (10)? I tried some ways to rebuilt $h_{n+1} - h_{n}$ but can not get something like (10). if we expand h_{n+1} as (θn+1−θ∗)^2 ,and h_{n} as (θn−θ∗)^2, then replace θn with θn−1−γn∇θf(θn-1), and θn+1 with θn−γn+1∇θf(θn), but how to calculate γn+1?Jovitiannoreply@blogger.comtag:blogger.com,1999:blog-8730014808678923743.post-24969119660156623502015-08-12T16:16:37.873+03:002015-08-12T16:16:37.873+03:00Yes you are right. "$\gamma_n$ diverges"...Yes you are right. "$\gamma_n$ diverges" should be written as $\sum_{i=1}^\infty \gamma_n$ is divergent. Thanks for correction!<br /><br />As for your question, for simplicity, let's write $k_n = (\theta_n - \theta^*) \nabla_\theta f(\theta_n)$. And $\sum_n$ will denote the infinite sum over $n$. So the statement becomes, if $\sum_n \gamma_n = \infty$ (14) and $\sum_n \gamma_n k_n < \infty$ (13), it means $k_n \to 0$ (15).<br /><br />I'll try to make this rigorous proof by contradiction. Let us assume (13) and (14) holds, and $k_n$ doesn't converge to zero. This means there exists an $\epsilon > 0$ such that $k_n > \epsilon$ for all $n$. Then we can show (since we assumed (14) holds),<br />\begin{align*}<br />\sum_n k_n \gamma_n > \epsilon \sum_n \gamma_n = \infty.<br />\end{align*}<br />Assuming $k_n$ doesn't converge to zero, we arrived a contradiction with (13), one of our assumptions. Then $k_n \to 0$ should be the case.Denizhttps://www.blogger.com/profile/06043980704154563908noreply@blogger.comtag:blogger.com,1999:blog-8730014808678923743.post-62392182607105034382015-08-12T05:49:54.605+03:002015-08-12T05:49:54.605+03:00Hi, this is a nice post. Could you elaborate more ...Hi, this is a nice post. Could you elaborate more about the convergence (15) implied by (13) and (14) in a mathematically rigorous way? By "\gamma_{n} diverges" you actually mean that the series \sum_{n}\gamma_{n} diverges, right? This is because the sequence actually \gamma_{n} converges to zero. I still don't find a rigorous proof for the convergence (15) implied by (13) and (14). But my strategy is to first prove the left-hand side of (15) is bounded and then prove the upper limit of the left-hand side of (15) is less than or equal to zero (it's easy to see that the lower limit is zero).Unknownhttps://www.blogger.com/profile/02578770883598744054noreply@blogger.com