On the different regimes of stochastic gradient descent.

阅读:4
作者:Sclocchi Antonio, Wyart Matthieu
Modern deep networks are trained with stochastic gradient descent (SGD) whose key hyperparameters are the number of data considered at each step or batch size [Formula: see text], and the step size or learning rate [Formula: see text]. For small [Formula: see text] and large [Formula: see text], SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the "temperature" [Formula: see text]. Yet this description is observed to break down for sufficiently large batches [Formula: see text], or simplifies to gradient descent (GD) when the temperature is sufficiently small. Understanding where these cross-overs take place remains a central challenge. Here, we resolve these questions for a teacher-student perceptron classification model and show empirically that our key predictions still apply to deep networks. Specifically, we obtain a phase diagram in the [Formula: see text]-[Formula: see text] plane that separates three dynamical phases: i) a noise-dominated SGD governed by temperature, ii) a large-first-step-dominated SGD and iii) GD. These different phases also correspond to different regimes of generalization error. Remarkably, our analysis reveals that the batch size [Formula: see text] separating regimes (i) and (ii) scale with the size [Formula: see text] of the training set, with an exponent that characterizes the hardness of the classification problem.

特别声明

1、本文转载旨在传播信息,不代表本网站观点,亦不对其内容的真实性承担责任。

2、其他媒体、网站或个人若从本网站转载使用,必须保留本网站注明的“来源”,并自行承担包括版权在内的相关法律责任。

3、如作者不希望本文被转载,或需洽谈转载稿费等事宜,请及时与本网站联系。

4、此外,如需投稿,也可通过邮箱info@biocloudy.com与我们取得联系。