Stochastic Gradient Descent and Anomaly of Variance-Flatness Relation in Artificial Neural Networks
Xia Xiong1 , Yong-Cong Chen1* , Chunxiao Shi1 , and Ping Ao2
1 Shanghai Center for Quantitative Life Sciences and Physics Department, Shanghai University, Shanghai 200444, China2 Colloge of Biomedical Engineering, Sichuan University, Chengdu 610065, China
Abstract :Stochastic gradient descent (SGD), a widely used algorithm in deep-learning neural networks, has attracted continuing research interests for the theoretical principles behind its success. A recent work reported an anomaly (inverse) relation between the variance of neural weights and the landscape flatness of the loss function driven under SGD [Feng Y and Tu Y Proc. Natl. Acad. Sci. USA 118 e2015617118 (2021) }]. To investigate this seeming violation of statistical physics principle, the properties of SGD near fixed points are analyzed with a dynamic decomposition method. Our approach recovers the true “energy” function under which the universal Boltzmann distribution holds. It differs from the cost function in general and resolves the paradox raised by the anomaly. The study bridges the gap between the classical statistical mechanics and the emerging discipline of artificial intelligence, with potential for better algorithms to the latter.
收稿日期: 2023-05-19
Editors' Suggestion
出版日期: 2023-08-04
PACS:
02.50.-r
(Probability theory, stochastic processes, and statistics)
02.50.Ey
(Stochastic processes)
05.10.-a
(Computational methods in statistical physics and nonlinear dynamics)
07.05.Mh
(Neural networks, fuzzy logic, artificial intelligence)
[1] LeCun Y, Bengio Y, and Hinton G 2015 Nature 521 436
[2] Goodfellow I, Bengio Y, and Courville A 2016 Deep Learning (Cambridge: MIT Press)
[3] Aggarwal C C 2018 Neural Networks and Deep Learning (Berlin: Springer) p 105
[4] Le Q V, Ngiam J, Coates A, Lahiri A, Prochnow B, and Ng A Y 2011 Proceedings of 28th International Conference on Machine Learning , 28 June–2 July 2011, Madison, WI, USA, p 265
[5] Martens J 2010 Proceedings of the 27th International Conference on Machine Learning , 21–24 June 2010, Haifa, Israel, p 735
[6] Young S R, Rose D C, Karnowski T P, Lim S H, and Patton R M 2015 Proceedings of the Workshop on Machine Learning in High-performance Computing Environments , November 2015, New York, NY, USA, Article No. 4
[7] Advani M, Lahiri S, and Ganguli S 2013 J. Stat. Mech.: Theory Exp. 2013 P03014
[8] Baldassi C, Borgs C, Chayes J T, Ingrosso A, Lucibello C, Saglietti L, and Zecchina R 2016 Proc. Natl. Acad. Sci. USA 113 E7655
[9] Zhang C, Bengio S, Hardt M, Recht B, and Vinyals O 2021 Commun. ACM 64 107
[10] Chaudhari P and Soatto S 2018 Information Theory and Applications Workshop , 11–16 February 2018, San Diego, CA, USA, pp 1–10
[11] Zhang Y, Saxe A M, Advani M S, and Lee A A 2018 Mol. Phys. 116 3214
[12] Feng Y and Tu Y 2021 Mach. Learn.: Sci. Technol. 2 043001
[13] Carleo G, Cirac I, Cranmer K, Daudet L, Schuld M, Tishby N, Vogt-Maranto L, and Zdeborová L 2019 Rev. Mod. Phys. 91 045002
[14] Mehta P, Bukov M, Wang C H, Day A G, Richardson C, Fisher C K, and Schwab D J 2019 Phys. Rep. 810 1
[15] Feng Y and Tu Y 2021 Proc. Natl. Acad. Sci. USA 118 e2015617118
[16] Ghorbani B, Krishnan S, and Xiao Y 2019 Proceedings of Machine Learning Research 97 2232
[17] Li H, Xu Z, Taylor G, Studer C, and Goldstein T 2018 32nd Conference on Neural Information Processing Systems (NeurIPS 2018) , Montréal, Canada, vol 31
[18] Ao P 2004 J. Phys. A 37 L25
[19] Kwon C, Ao P, and Thouless D J 2005 Proc. Natl. Acad. Sci. USA 102 13029
[20] Chen Y C, Shi C, Kosterlitz J M, Zhu X, and Ao P 2020 Proc. Natl. Acad. Sci. USA 117 23227
[21] Shi C, Chen Y C, Xiong X, and Ao P 2023 J. Nonlinear Math. Phys. 30 (accepted)
[22] Chen Y C, Shi C, Kosterlitz J M, Zhu X, and Ao P 2022 Proc. Natl. Acad. Sci. USA 119 e2211359119
[23] Yuan R S, Zhu X M, Wang G W, Li S T, and Ao P 2017 Rep. Prog. Phys. 80 042701
[24] Robins A 1995 Connect. Sci. 7 123
[25] Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu A A, Milan K, Quan J, Ramalho T, and Grabska-Barwinska A 2017 Proc. Natl. Acad. Sci. USA 114 3521
[26] Bray A J and Dean D S 2007 Phys. Rev. Lett. 98 150201
[27] Beer R D 2006 Neural Comput. 18 3009
[28] Amari S I 1996 Advances in Neural Information Processing Systems vol 9
[29] Rattray M, Saad D, and Amari S I 1998 Phys. Rev. Lett. 81 5461
[30] Sohl-Dickstein J, Poole B, and Ganguli S 2014 Proceedings of the 31st International Conference on Machine Learning , 21–26 June 2014, Beijing, China, vol 32 pp 604–612
[31] Sompolinsky H, Crisanti A, and Sommers H J 1988 Phys. Rev. Lett. 61 259
[32] Hochreiter S and Schmidhuber J 1997 Neural Comput. 9 1
[33] Chaudhari P, Choromanska A, Soatto S, LeCun Y, Baldassi C, Borgs C, Chayes J, Sagun L, and Zecchina R 2019 J. Stat. Mech.: Theory Exp. 2019 124018
[34] Baldassi C, Pittorino F, and Zecchina R 2020 Proc. Natl. Acad. Sci. USA 117 161
[35] Abdi H and Williams L J 2010 Wiley Interdisciplinary Re41 views: Computational Statistics 2 433
[36] Van Kampen N G 1992 Stochastic Processes in Physics and Chemistry (Amsterdam: Elsevier Press)
[37] Han M, Park J, Lee T, and Han J H 2021 Phys. Rev. E 104 034126
[1]
. [J]. 中国物理快报, 2022, 39(7): 70501-.
[2]
. [J]. 中国物理快报, 2019, 36(7): 70201-.
[3]
. [J]. 中国物理快报, 2017, 34(7): 70201-.
[4]
. [J]. 中国物理快报, 2015, 32(11): 110501-110501.
[5]
. [J]. 中国物理快报, 2014, 31(09): 90501-090501.
[6]
. [J]. 中国物理快报, 2013, 30(7): 70504-070504.
[7]
. [J]. 中国物理快报, 2013, 30(5): 58901-058901.
[8]
SHU Chang-Zheng,NIE Lin-Ru**,ZHOU Zhong-Rao. Stochastic Resonance-Like and Resonance Suppression-Like Phenomena in a Bistable System with Time Delay and Additive Noise [J]. 中国物理快报, 2012, 29(5): 50506-050506.
[9]
REN Xue-Zao1 , YANG Zi-Mo2 , WANG Bing-Hong1,3 , ZHOU Tao2,3** . Mandelbrot Law of Evolving Networks [J]. 中国物理快报, 2012, 29(3): 38904-038904.
[10]
HE Zheng-You;ZHOU Yu-Rong**
. Vibrational and Stochastic Resonance in the FitzHugh–Nagumo Neural Model with Multiplicative and Additive Noise [J]. 中国物理快报, 2011, 28(11): 110505-110505.
[11]
LI Chun;MEI Dong-Cheng;**
. Effects of Time Delay on Stability of an Unstable State in a Bistable System with Correlated Noises [J]. 中国物理快报, 2011, 28(4): 40501-040501.
[12]
LI Jian-Long;ZENG Ling-Zao;ZHANG Hui-Quan. A Demonstration of Equivalence between Parameter-Induced and Noise-Induced Stochastic Resonances with Multiplicative and Additive Noises [J]. 中国物理快报, 2010, 27(10): 100502-100502.
[13]
ZHOU Yu-Rong. Effect of Time-Delay in the Logistic Growth Model Driven by Weak Signal and White Noise [J]. 中国物理快报, 2010, 27(8): 80502-080502.
[14]
XU Yan;GUO Liang-Peng;DING Ning;WANG You-Gui. Evidence of Scaling in Chinese Income Distribution [J]. 中国物理快报, 2010, 27(7): 78901-078901.
[15]
TIAN Jing;CHEN Yong. Effect of Time Delay on Stochastic Tumor Growth [J]. 中国物理快报, 2010, 27(3): 30502-030502.