Modern deep networks are trained with stochastic gradient descent (SGD) whose key hyperparameters are the number of data considered at each step or batch size [Formula: see text], and the step size or learning rate [Formula: see text]. For small [Formula: see text] and large [Formula: see text], SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the "temperature" [Formula: see text]. Yet this description is observed to break down for sufficiently large batches [Formula: see text], or simplifies to gradient descent (GD) when the temperature is sufficiently small. Understanding where these cross-overs take place remains a central challenge. Here, we resolve these questions for a teacher-student perceptron classification model and show empirically that our key predictions still apply to deep networks. Specifically, we obtain a phase diagram in the [Formula: see text]-[Formula: see text] plane that separates three dynamical phases: i) a noise-dominated SGD governed by temperature, ii) a large-first-step-dominated SGD and iii) GD. These different phases also correspond to different regimes of generalization error. Remarkably, our analysis reveals that the batch size [Formula: see text] separating regimes (i) and (ii) scale with the size [Formula: see text] of the training set, with an exponent that characterizes the hardness of the classification problem.
On the different regimes of stochastic gradient descent.
阅读:11
作者:Sclocchi Antonio, Wyart Matthieu
| 期刊: | Proceedings of the National Academy of Sciences of the United States of America | 影响因子: | 9.100 |
| 时间: | 2024 | 起止号: | 2024 Feb 27; 121(9):e2316301121 |
| doi: | 10.1073/pnas.2316301121 | ||
特别声明
1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。
2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。
3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。
4、投稿及合作请联系:info@biocloudy.com。
