Stochastic Gradient Descent and Anomaly of Variance-Flatness Relation in Artificial Neural Networks

doi:10.1088/0256-307X/40/8/080202

Chin. Phys. Lett.

2023, Vol. 40

Issue (8): 080202 DOI: 10.1088/0256-307X/40/8/080202

GENERAL

Stochastic Gradient Descent and Anomaly of Variance-Flatness Relation in Artificial Neural Networks

Xia Xiong¹, Yong-Cong Chen^1*, Chunxiao Shi¹, and Ping Ao²

¹Shanghai Center for Quantitative Life Sciences and Physics Department, Shanghai University, Shanghai 200444, China
²Colloge of Biomedical Engineering, Sichuan University, Chengdu 610065, China

Cite this article:

Xia Xiong, Yong-Cong Chen, Chunxiao Shi et al 2023 Chin. Phys. Lett. 40 080202

Download: PDF(1487KB) PDF(mobile)(1436KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract Stochastic gradient descent (SGD), a widely used algorithm in deep-learning neural networks, has attracted continuing research interests for the theoretical principles behind its success. A recent work reported an anomaly (inverse) relation between the variance of neural weights and the landscape flatness of the loss function driven under SGD [Feng Y and Tu Y Proc. Natl. Acad. Sci. USA 118 e2015617118 (2021)}]. To investigate this seeming violation of statistical physics principle, the properties of SGD near fixed points are analyzed with a dynamic decomposition method. Our approach recovers the true “energy” function under which the universal Boltzmann distribution holds. It differs from the cost function in general and resolves the paradox raised by the anomaly. The study bridges the gap between the classical statistical mechanics and the emerging discipline of artificial intelligence, with potential for better algorithms to the latter.

Received: 19 May 2023 Editors' Suggestion Published: 04 August 2023

PACS:	02.50.-r	(Probability theory, stochastic processes, and statistics)
	02.50.Ey	(Stochastic processes)
	05.10.-a	(Computational methods in statistical physics and nonlinear dynamics)
	07.05.Mh	(Neural networks, fuzzy logic, artificial intelligence)

TRENDMD:

URL:

https://cpl.iphy.ac.cn/10.1088/0256-307X/40/8/080202 OR https://cpl.iphy.ac.cn/Y2023/V40/I8/080202

Service
	E-mail this article
	E-mail Alert
	RSS
Articles by authors
	Xia Xiong
	Yong-Cong Chen
	Chunxiao Shi
	and Ping Ao

[1]	LeCun Y, Bengio Y, and Hinton G 2015 Nature 521 436
[2]	Goodfellow I, Bengio Y, and Courville A 2016 Deep Learning (Cambridge: MIT Press)
[3]	Aggarwal C C 2018 Neural Networks and Deep Learning (Berlin: Springer) p 105
[4]	Le Q V, Ngiam J, Coates A, Lahiri A, Prochnow B, and Ng A Y 2011 Proceedings of 28th International Conference on Machine Learning, 28 June–2 July 2011, Madison, WI, USA, p 265
[5]	Martens J 2010 Proceedings of the 27th International Conference on Machine Learning, 21–24 June 2010, Haifa, Israel, p 735
[6]	Young S R, Rose D C, Karnowski T P, Lim S H, and Patton R M 2015 Proceedings of the Workshop on Machine Learning in High-performance Computing Environments, November 2015, New York, NY, USA, Article No. 4
[7]	Advani M, Lahiri S, and Ganguli S 2013 J. Stat. Mech.: Theory Exp. 2013 P03014
[8]	Baldassi C, Borgs C, Chayes J T, Ingrosso A, Lucibello C, Saglietti L, and Zecchina R 2016 Proc. Natl. Acad. Sci. USA 113 E7655
[9]	Zhang C, Bengio S, Hardt M, Recht B, and Vinyals O 2021 Commun. ACM 64 107
[10]	Chaudhari P and Soatto S 2018 Information Theory and Applications Workshop, 11–16 February 2018, San Diego, CA, USA, pp 1–10
[11]	Zhang Y, Saxe A M, Advani M S, and Lee A A 2018 Mol. Phys. 116 3214
[12]	Feng Y and Tu Y 2021 Mach. Learn.: Sci. Technol. 2 043001
[13]	Carleo G, Cirac I, Cranmer K, Daudet L, Schuld M, Tishby N, Vogt-Maranto L, and Zdeborová L 2019 Rev. Mod. Phys. 91 045002
[14]	Mehta P, Bukov M, Wang C H, Day A G, Richardson C, Fisher C K, and Schwab D J 2019 Phys. Rep. 810 1
[15]	Feng Y and Tu Y 2021 Proc. Natl. Acad. Sci. USA 118 e2015617118
[16]	Ghorbani B, Krishnan S, and Xiao Y 2019 Proceedings of Machine Learning Research 97 2232
[17]	Li H, Xu Z, Taylor G, Studer C, and Goldstein T 2018 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada, vol 31
[18]	Ao P 2004 J. Phys. A 37 L25
[19]	Kwon C, Ao P, and Thouless D J 2005 Proc. Natl. Acad. Sci. USA 102 13029
[20]	Chen Y C, Shi C, Kosterlitz J M, Zhu X, and Ao P 2020 Proc. Natl. Acad. Sci. USA 117 23227
[21]	Shi C, Chen Y C, Xiong X, and Ao P 2023 J. Nonlinear Math. Phys. 30 (accepted)
[22]	Chen Y C, Shi C, Kosterlitz J M, Zhu X, and Ao P 2022 Proc. Natl. Acad. Sci. USA 119 e2211359119
[23]	Yuan R S, Zhu X M, Wang G W, Li S T, and Ao P 2017 Rep. Prog. Phys. 80 042701
[24]	Robins A 1995 Connect. Sci. 7 123
[25]	Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu A A, Milan K, Quan J, Ramalho T, and Grabska-Barwinska A 2017 Proc. Natl. Acad. Sci. USA 114 3521
[26]	Bray A J and Dean D S 2007 Phys. Rev. Lett. 98 150201
[27]	Beer R D 2006 Neural Comput. 18 3009
[28]	Amari S I 1996 Advances in Neural Information Processing Systems vol 9
[29]	Rattray M, Saad D, and Amari S I 1998 Phys. Rev. Lett. 81 5461
[30]	Sohl-Dickstein J, Poole B, and Ganguli S 2014 Proceedings of the 31st International Conference on Machine Learning, 21–26 June 2014, Beijing, China, vol 32 pp 604–612
[31]	Sompolinsky H, Crisanti A, and Sommers H J 1988 Phys. Rev. Lett. 61 259
[32]	Hochreiter S and Schmidhuber J 1997 Neural Comput. 9 1
[33]	Chaudhari P, Choromanska A, Soatto S, LeCun Y, Baldassi C, Borgs C, Chayes J, Sagun L, and Zecchina R 2019 J. Stat. Mech.: Theory Exp. 2019 124018
[34]	Baldassi C, Pittorino F, and Zecchina R 2020 Proc. Natl. Acad. Sci. USA 117 161
[35]	Abdi H and Williams L J 2010 Wiley Interdisciplinary Re41 views: Computational Statistics 2 433
[36]	Van Kampen N G 1992 Stochastic Processes in Physics and Chemistry (Amsterdam: Elsevier Press)
[37]	Han M, Park J, Lee T, and Han J H 2021 Phys. Rev. E 104 034126

Related articles from Frontiers Journals

[1]	Xu Han, Zhaolong Wu, Tian Yang, and Qi Ouyang. Cryo-EM Data Statistics and Theoretical Analysis of KaiC Hexamer[J]. Chin. Phys. Lett., 2022, 39(7): 080202
[2]	Mingshu Cong, Bo-Qiang Ma. A Proof of First Digit Law from Laplace Transform[J]. Chin. Phys. Lett., 2019, 36(7): 080202
[3]	Pei-Rong Guo, Hai-Yan Wang, Jin-Zhong Ma. Resonance Analyses for a Noisy Coupled Brusselator Model[J]. Chin. Phys. Lett., 2017, 34(7): 080202
[4]	ZHANG Chi, LIU Li-Wei, WANG Long-Fei, YUE Yuan, YU Lian-Chun. Optimal Size for Maximal Energy Efficiency in Information Processing of Biological Systems Due to Bistability[J]. Chin. Phys. Lett., 2015, 32(11): 080202
[5]	CAI Li-Qiang, WANG Li-Fang, WU Ke, YANG Jie. Diagonal Slices of 3D Young Diagrams in the Approach of Maya Diagrams[J]. Chin. Phys. Lett., 2014, 31(09): 080202
[6]	WANG Kang-Kang, LIU Xian-Bin. Stochastic Resonance for a Time-Delayed Metapopulation System Driven by Multiplicative and Additive Noises[J]. Chin. Phys. Lett., 2013, 30(7): 080202
[7]	ZHANG Yong, JU Xian-Meng, ZHANG Li-Jie, XU Xin-Jian. Statistics of Leaders in Index-Driven Networks[J]. Chin. Phys. Lett., 2013, 30(5): 080202
[8]	SHU Chang-Zheng,NIE Lin-Ru,ZHOU Zhong-Rao. Stochastic Resonance-Like and Resonance Suppression-Like Phenomena in a Bistable System with Time Delay and Additive Noise**[J]. Chin. Phys. Lett., 2012, 29(5): 080202
[9]	REN Xue-Zao, YANG Zi-Mo, WANG Bing-Hong, ZHOU Tao. Mandelbrot Law of Evolving Networks[J]. Chin. Phys. Lett., 2012, 29(3): 080202
[10]	HE Zheng-You, ZHOU Yu-Rong . Vibrational and Stochastic Resonance in the FitzHugh–Nagumo Neural Model with Multiplicative and Additive Noise**[J]. Chin. Phys. Lett., 2011, 28(11): 080202
[11]	LI Chun, MEI Dong-Cheng, . Effects of Time Delay on Stability of an Unstable State in a Bistable System with Correlated Noises**[J]. Chin. Phys. Lett., 2011, 28(4): 080202
[12]	LI Jian-Long, ZENG Ling-Zao, ZHANG Hui-Quan. A Demonstration of Equivalence between Parameter-Induced and Noise-Induced Stochastic Resonances with Multiplicative and Additive Noises[J]. Chin. Phys. Lett., 2010, 27(10): 080202
[13]	ZHOU Yu-Rong. Effect of Time-Delay in the Logistic Growth Model Driven by Weak Signal and White Noise[J]. Chin. Phys. Lett., 2010, 27(8): 080202
[14]	XU Yan, GUO Liang-Peng, DING Ning, WANG You-Gui. Evidence of Scaling in Chinese Income Distribution[J]. Chin. Phys. Lett., 2010, 27(7): 080202
[15]	TIAN Jing, CHEN Yong. Effect of Time Delay on Stochastic Tumor Growth[J]. Chin. Phys. Lett., 2010, 27(3): 080202

Viewed

Full text

Abstract