GENERAL |
|
|
|
|
Stochastic Gradient Descent and Anomaly of Variance-Flatness Relation in Artificial Neural Networks |
Xia Xiong1, Yong-Cong Chen1*, Chunxiao Shi1, and Ping Ao2 |
1Shanghai Center for Quantitative Life Sciences and Physics Department, Shanghai University, Shanghai 200444, China 2Colloge of Biomedical Engineering, Sichuan University, Chengdu 610065, China
|
|
Cite this article: |
Xia Xiong, Yong-Cong Chen, Chunxiao Shi et al 2023 Chin. Phys. Lett. 40 080202 |
|
|
Abstract Stochastic gradient descent (SGD), a widely used algorithm in deep-learning neural networks, has attracted continuing research interests for the theoretical principles behind its success. A recent work reported an anomaly (inverse) relation between the variance of neural weights and the landscape flatness of the loss function driven under SGD [Feng Y and Tu Y Proc. Natl. Acad. Sci. USA 118 e2015617118 (2021)}]. To investigate this seeming violation of statistical physics principle, the properties of SGD near fixed points are analyzed with a dynamic decomposition method. Our approach recovers the true “energy” function under which the universal Boltzmann distribution holds. It differs from the cost function in general and resolves the paradox raised by the anomaly. The study bridges the gap between the classical statistical mechanics and the emerging discipline of artificial intelligence, with potential for better algorithms to the latter.
|
|
Received: 19 May 2023
Editors' Suggestion
Published: 04 August 2023
|
|
PACS: |
02.50.-r
|
(Probability theory, stochastic processes, and statistics)
|
|
02.50.Ey
|
(Stochastic processes)
|
|
05.10.-a
|
(Computational methods in statistical physics and nonlinear dynamics)
|
|
07.05.Mh
|
(Neural networks, fuzzy logic, artificial intelligence)
|
|
|
|
|
[1] | LeCun Y, Bengio Y, and Hinton G 2015 Nature 521 436 |
[2] | Goodfellow I, Bengio Y, and Courville A 2016 Deep Learning (Cambridge: MIT Press) |
[3] | Aggarwal C C 2018 Neural Networks and Deep Learning (Berlin: Springer) p 105 |
[4] | Le Q V, Ngiam J, Coates A, Lahiri A, Prochnow B, and Ng A Y 2011 Proceedings of 28th International Conference on Machine Learning, 28 June–2 July 2011, Madison, WI, USA, p 265 |
[5] | Martens J 2010 Proceedings of the 27th International Conference on Machine Learning, 21–24 June 2010, Haifa, Israel, p 735 |
[6] | Young S R, Rose D C, Karnowski T P, Lim S H, and Patton R M 2015 Proceedings of the Workshop on Machine Learning in High-performance Computing Environments, November 2015, New York, NY, USA, Article No. 4 |
[7] | Advani M, Lahiri S, and Ganguli S 2013 J. Stat. Mech.: Theory Exp. 2013 P03014 |
[8] | Baldassi C, Borgs C, Chayes J T, Ingrosso A, Lucibello C, Saglietti L, and Zecchina R 2016 Proc. Natl. Acad. Sci. USA 113 E7655 |
[9] | Zhang C, Bengio S, Hardt M, Recht B, and Vinyals O 2021 Commun. ACM 64 107 |
[10] | Chaudhari P and Soatto S 2018 Information Theory and Applications Workshop, 11–16 February 2018, San Diego, CA, USA, pp 1–10 |
[11] | Zhang Y, Saxe A M, Advani M S, and Lee A A 2018 Mol. Phys. 116 3214 |
[12] | Feng Y and Tu Y 2021 Mach. Learn.: Sci. Technol. 2 043001 |
[13] | Carleo G, Cirac I, Cranmer K, Daudet L, Schuld M, Tishby N, Vogt-Maranto L, and Zdeborová L 2019 Rev. Mod. Phys. 91 045002 |
[14] | Mehta P, Bukov M, Wang C H, Day A G, Richardson C, Fisher C K, and Schwab D J 2019 Phys. Rep. 810 1 |
[15] | Feng Y and Tu Y 2021 Proc. Natl. Acad. Sci. USA 118 e2015617118 |
[16] | Ghorbani B, Krishnan S, and Xiao Y 2019 Proceedings of Machine Learning Research 97 2232 |
[17] | Li H, Xu Z, Taylor G, Studer C, and Goldstein T 2018 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada, vol 31 |
[18] | Ao P 2004 J. Phys. A 37 L25 |
[19] | Kwon C, Ao P, and Thouless D J 2005 Proc. Natl. Acad. Sci. USA 102 13029 |
[20] | Chen Y C, Shi C, Kosterlitz J M, Zhu X, and Ao P 2020 Proc. Natl. Acad. Sci. USA 117 23227 |
[21] | Shi C, Chen Y C, Xiong X, and Ao P 2023 J. Nonlinear Math. Phys. 30 (accepted) |
[22] | Chen Y C, Shi C, Kosterlitz J M, Zhu X, and Ao P 2022 Proc. Natl. Acad. Sci. USA 119 e2211359119 |
[23] | Yuan R S, Zhu X M, Wang G W, Li S T, and Ao P 2017 Rep. Prog. Phys. 80 042701 |
[24] | Robins A 1995 Connect. Sci. 7 123 |
[25] | Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu A A, Milan K, Quan J, Ramalho T, and Grabska-Barwinska A 2017 Proc. Natl. Acad. Sci. USA 114 3521 |
[26] | Bray A J and Dean D S 2007 Phys. Rev. Lett. 98 150201 |
[27] | Beer R D 2006 Neural Comput. 18 3009 |
[28] | Amari S I 1996 Advances in Neural Information Processing Systems vol 9 |
[29] | Rattray M, Saad D, and Amari S I 1998 Phys. Rev. Lett. 81 5461 |
[30] | Sohl-Dickstein J, Poole B, and Ganguli S 2014 Proceedings of the 31st International Conference on Machine Learning, 21–26 June 2014, Beijing, China, vol 32 pp 604–612 |
[31] | Sompolinsky H, Crisanti A, and Sommers H J 1988 Phys. Rev. Lett. 61 259 |
[32] | Hochreiter S and Schmidhuber J 1997 Neural Comput. 9 1 |
[33] | Chaudhari P, Choromanska A, Soatto S, LeCun Y, Baldassi C, Borgs C, Chayes J, Sagun L, and Zecchina R 2019 J. Stat. Mech.: Theory Exp. 2019 124018 |
[34] | Baldassi C, Pittorino F, and Zecchina R 2020 Proc. Natl. Acad. Sci. USA 117 161 |
[35] | Abdi H and Williams L J 2010 Wiley Interdisciplinary Re41 views: Computational Statistics 2 433 |
[36] | Van Kampen N G 1992 Stochastic Processes in Physics and Chemistry (Amsterdam: Elsevier Press) |
[37] | Han M, Park J, Lee T, and Han J H 2021 Phys. Rev. E 104 034126 |
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|