[期刊论文][Full-length article]


Learning ability of interpolating deep convolutional neural networks

作   者:
Tian-Yi Zhou;Xiaoming Huo;

出版年:暂无

页    码:101582 - 101582
出版社:Elsevier BV


摘   要:

It is frequently observed that overparameterized neural networks generalize well. Regarding such phenomena, existing theoretical work mainly devotes to linear settings or fully-connected neural networks. This paper studies the learning ability of an important family of deep neural networks , deep convolutional neural networks (DCNNs), under both underparameterized and overparameterized settings. We establish the first learning rates of underparameterized DCNNs without parameter or function variable structure restrictions presented in the literature. We also show that by adding well-defined layers to a non-interpolating DCNN, we can obtain some interpolating DCNNs that maintain the good learning rates of the non-interpolating DCNN. This result is achieved by a novel network deepening scheme designed for DCNNs. Our work provides theoretical verification of how overfitted DCNNs generalize well. Introduction Neural networks are computing systems with powerful applications in many disciplines, such as data analysis and pattern and sequence recognition. In particular, deep neural networks with well-designed structures, numerous trainable free parameters, and massive-scale input data have outstanding performances in function approximation [32], [18], classification [19], [11], regression [30], and feature extraction [23]. The success of deep neural networks in practice has motivated research activities intended to rigorously explain their capability and power, in addition to the literature on shallow neural networks [26]. In this paper, we study an important family of neural networks known as convolutional neural networks. Given that neural networks, in general, are powerful and versatile, researchers have been working to improve their computational efficiency further. When the data dimension is large such as the AlexNet [19] of input dimension about 150 , 000 , fully-connected neural networks are not feasible. Structures are often imposed on neural networks to reduce the number of trainable free parameters and get feasible deep learning algorithms for various practical tasks [20]. The structure we are interested in is induced by one-dimensional convolution (1-D convolution), and the resulting networks are deep convolutional neural networks (DCNNs) [31]. The convolutional structure of DCNNs reduces the computational complexity and is believed to capture local shift-invariance properties of image and speech data. Such features of DCNNs contribute to the massive popularity of DCNNs in image processing and speech recognition. In recent years, there has been a line of work studying overparameterization in deep learning. It is frequently observed that overparameterized deep neural networks, such as DCNNs, generalize well while achieving zero training error [6]. This phenomenon, known as benign overfitting, seems to confront the classical bias-variance trade-off in statistical theory. Such a mismatch between observations and classical theory sparked avid research attempting to understand how benign overfitting occurs. Theoretical work studying benign overfitting was initiated in [4], where a linear regression setting with Gaussian data and noise was considered. It presented conditions for minimum-norm interpolators to generalize well. In a non-linear setting induced by the ReLU activation function, benign overfitting is verified for deep fully-connected neural networks in [22]. On top of that, a recent work [10] shows that training shallow neural networks with shared weights by gradient descent can achieve an arbitrarily small training error. In this paper, we study the learning ability of DCNNs under both underparameterized and overparameterized settings. We aim to show that an overparameterized DCNN can be constructed to have the same convergence rate as a given underparameterized one while it perfectly fits the input data. In other words, we intend to prove that interpolating DCNNs generalize well. The main contributions of the paper are as follows. Our first result rigorously proves that for an arbitrary DCNN with good learning rates, we can add more layers to build overparameterized DCNNs satisfying the interpolation condition while retaining good learning rates. Here, “learning rates” refers to rates of convergence of the output function to the regression function in a regression setting. Our second result establishes the learning rates of DCNNs in general. Previously in [33], convergence rates of approximating functions in some Sobolev spaces by DCNNs were given without generalization analysis. Moreover, learning rates of DCNNs for learning radial functions were given in [24], where the bias vectors and filters are assumed to be bounded, with bounds depending on the sample size and depths. More recently, learning rates for learning additive ridge functions were presented in [14]. Unlike these existing works, the learning rates we derive do not require any restrictions on norms of the filters or bias vectors, or variable structures of the target functions. Without the boundedness of free parameters, the standard covering number arguments do not apply. To overcome such a challenge, we derive a special estimate of the pseudo-dimension of the hypothesis space generated by a DCNN. Previously, a pseudo-dimension estimate was given in [3] for fully-connected neural networks, using the piecewise polynomial property of the activation function. We shall apply our pseudo-dimension estimate to, in turn, bound the empirical covering number of the hypothesis space. In such a way, we can achieve our results without restrictions on free parameters. Combining our first and second results, we prove that for any input data, there exist some overparameterized DCNNs which interpolate the data and achieve a good learning rate. The third result provides theoretical support for the possible existence of benign overfitting under the DCNN setting. The rest of this paper is organized as follows. In Section 2, we introduce notations and definitions used throughout the paper, including the definition of DCNNs to be studied. In Section 3, we present our main results that describe how a DCNN achieves benign overfitting. The proof of our first result is given in Section 4, and the proofs of our second and third results are provided in Section 5. In Section 6, we present the results of numerical experiments which corroborate our theoretical findings. Lastly, in Section 7, we present some discussions and compare our work with the existing literature. Section snippets Problem formulation In this section, we define the DCNNs to be studied in this paper and the corresponding hypothesis space (Subsection 2.1). Then, we introduce the regression setting with data and the regression function (Subsection 2.2). Main results In this section, we state our main results. The corresponding theorems will be proved in Sections 4 and 5. Here, we introduce our first result. Our first result shows that a good generalization ability of any given non-interpolating DCNN (“teacher DCNN”) can be maintained by some interpolating output function f ∈ H i n t , J , s of a “student DCNN” obtained by adding well-defined layers to the given DCNN. The learning and approximation ability is measured in regression by the L 2 norm ‖ f ‖ 2 : = { ∫ Ω | f ( x ) | 2 d ρ Ω } 1 Achieving interpolation condition by deepening DCNNs In this section, we present a network deepening scheme for proving Theorem 1. Here, we first give an outline of the proof. Suppose we are given a non-interpolating DCNN that outputs a function f ⁎ . Theorem 1 tells us that we can construct a larger DCNN that interpolates any data while maintaining the generalization ability of f ⁎ . The proof is carried out by means of an interpolator of the form f ( x ) = f ⁎ ( x ) + ∑ ℓ = 1 n ( y ℓ − f ⁎ ( x ℓ ) ) ϕ ℓ ( x ) with ϕ ℓ ( x ) = ϕ ( η ⁎ ξ ⋅ ( x − x ℓ ) ) , where η ⁎ ∈ { 1 , − 1 } is a sign number, ξ ∈ R d is a Learning rates of DCNNs In this section, we derive learning rates of DCNNs for proving Theorem 2. In other words, we establish an upper bound of the excess error ε ( π M f D , J , S ) − ε ( f ρ ) = ‖ π M f D , J , S − f ρ ‖ 2 2 . Our analysis is based on an error decomposition (Subsection 5.1), followed by bounding the approximation error and the sampling error, respectively (Subsection 5.2). Afterward, we bound the number of layers and achieved the desired learning rates (Subsection 5.3). Our method is different from the approaches in the existing Numerical experiment In this section, we present our results of numerical experiments on simulated data to corroborate our theoretical findings in Theorem 2. Our purpose is to demonstrate our theoretical findings. We do not intend to perform numerical experiments with real data. We conduct numerical experiments for four different input data dimensions d ∈ { 10 , 30 , 50 , 100 } . For each d , we simulate n training data sets { ( x i , y i ) } i = 1 n with n varying in { 100 , 300 , 500 , 700 , 1000 , 1500 , 2000 , 3000 , 4000 , 5000 , 6000 } , x i ∈ R d being a Related work, discussions, and conclusions In this section, we discuss prior work in overparameterization in regression and the generalization ability of DCNNs. As mentioned before, it is frequently observed that overparameterized neural networks generalize well with zero training error for regression loss. Such a phenomenon is known as benign overfitting. Benign overfitting is characterized in [4] in linear least squares regression with Gaussian data and noise. Sufficient and necessary conditions are presented for the input covariance Acknowledgements The authors would like to thank the referees for their detailed comments and constructive suggestions that led to an improved presentation of this paper. The authors are partially sponsored by NSF grants CCF-1740776 , DMS 2015363 , and IIS-2229876 . They are also partially supported by the A. Russell Chandler III Professorship at Georgia Tech . References (34) Tong Mao et al. Theory of deep convolutional neural networks III: approximating radial functions Neural Netw. (2021) Ding-Xuan Zhou Universality of deep convolutional neural networks Appl. Comput. Harmon. Anal. (2020) Martin Anthony et al. Neural Network Learning: Theoretical Foundations, vol. 9 (1999) Peter Bartlett et al. Almost linear VC dimension bounds for piecewise polynomial networks Adv. Neural Inf. Process. Syst. (1998) Peter L. Bartlett et al. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks J. Mach. Learn. Res. (2019) Peter L. Bartlett et al. Benign overfitting in linear regression Proc. Natl. Acad. Sci. (2020) Mikhail Belkin et al. Reconciling modern machine-learning practice and the classical bias–variance trade-off Proc. Natl. Acad. Sci. (2019) Mikhail Belkin et al. To understand deep learning we need to understand kernel learning Serge Bernstein Sur l'extension du théorème limite du calcul des probabilités aux sommes de quantités dépendantes Math. Ann. (1927) Helmut Bolcskei et al. Optimal approximation with sparsely connected deep neural networks SIAM J. Math. Data Sci. (2019) Denis Bosq Linear Processes in Function Spaces: Theory and Applications, vol. 149 (2000) Yuan Cao et al. Benign overfitting in two-layer convolutional neural networks Andrei Caragea et al. Neural network approximation and estimation of classifiers with classification boundary in a Barron class Niladri S. Chatterji et al. The interplay between implicit bias and benign overfitting in two-layer linear networks J. Mach. Learn. Res. (2022) Ingrid Daubechies Ten Lectures on Wavelets (1992) Zhiying Fang et al. Optimal learning rates of deep convolutional neural networks: additive ridge functions Trans. Mach. Learn. Res. (2023) Ian Goodfellow et al. Deep Learning (2016) View more references Cited by (0) Recommended articles (6) Research article Phase retrieval of bandlimited functions for the wavelet transform Applied and Computational Harmonic Analysis, Volume 64, 2023, pp. 102-117 Show abstract We study the recovery of square-integrable signals from the absolute values of their wavelet transforms, also called wavelet phase retrieval . We present a new uniqueness result for wavelet phase retrieval. To be precise, we show that any wavelet with finitely many vanishing moments allows for the unique recovery of real-valued bandlimited signals up to global sign. Additionally, we present the first uniqueness result for sampled wavelet phase retrieval in which the underlying wavelets are allowed to be complex-valued and we present a uniqueness result for phase retrieval from sampled Cauchy wavelet transform measurements. Research article Fundamental component enhancement via adaptive nonlinear activation functions Applied and Computational Harmonic Analysis, Volume 63, 2023, pp. 135-143 Show abstract In many real world oscillatory signals, the fundamental component of a signal f ( t ) might be weak or does not exist. This makes it difficult to estimate the instantaneous frequency of the signal. A traditional approach is to apply the rectification trick, working with | f ( t ) | or ReLu ( f ( t ) ) instead, to enhance the fundamental component. This raises an interesting question: what type of nonlinear function g : R → R has the property that g ( f ( t ) ) has a more pronounced fundamental frequency? g ( t ) = | t | and g ( t ) = ReLu ( t ) seem to work well in practice; we propose a variant of g ( t ) = 1 / ( 1 − | t | ) and provide a theoretical guarantee. Several simulated signals and real signals are analyzed to demonstrate the performance of the proposed solution. Research article Super-resolution of generalized spikes and spectra of confluent Vandermonde matrices Applied and Computational Harmonic Analysis, Volume 65, 2023, pp. 181-208 Show abstract We study the problem of super-resolution of a linear combination of Dirac distributions and their derivatives on a one-dimensional circle from noisy Fourier measurements. Following numerous recent works on the subject, we consider the geometric setting of “partial clustering”, when some Diracs can be separated much below the Rayleigh limit. Under this assumption, we prove sharp asymptotic bounds for the smallest singular value of a corresponding rectangular confluent Vandermonde matrix with nodes on the unit circle. As a consequence, we derive matching lower and upper min-max error bounds for the above super-resolution problem, under the additional assumption of nodes belonging to a fixed grid. Research article A simple approach for quantizing neural networks Applied and Computational Harmonic Analysis, Volume 66, 2023, pp. 138-150 Show abstract In this short note, we propose a new method for quantizing the weights of a fully trained neural network. A simple deterministic pre-processing step allows us to quantize network layers via memoryless scalar quantization while preserving the network performance on given training data. On one hand, the computational complexity of this pre-processing slightly exceeds that of state-of-the-art algorithms in the literature. On the other hand, our approach does not require any hyper-parameter tuning and, in contrast to previous methods, allows a plain analysis. We provide rigorous theoretical guarantees in the case of quantizing single network layers and show that the relative error decays with the number of parameters in the network if the training data behave well, e.g., if it is sampled from suitable random distributions. The developed method also readily allows the quantization of deep networks by consecutive application to single layers. Research article Non-asymptotic bounds for discrete prolate spheroidal wave functions analogous with prolate spheroidal wave function bounds Applied and Computational Harmonic Analysis, Volume 63, 2023, pp. 20-47 Show abstract Prolate spheroidal wave functions and discrete prolate spheroidal wave functions are two distinct sets of functions, in spite of the names suggesting the latter being the discrete version of the former. Both sets of functions are eigenfunctions of 2nd order differential equations whose coefficients behave in a very similar way over the concentration interval in terms of monotonicity and non-negativity. As a result of this similarity, we show that certain bounds that are well established for PSWFs apply also to DPSWFs but only for an appropriate set of conditions. Furthermore, the results for DPSWF apply to the entire domain unlike for PSWFs where the results are limited to the concentration band. Research article Frames by orbits of two operators that commute Applied and Computational Harmonic Analysis, Volume 66, 2023, pp. 46-61 Show abstract Frames formed by orbits of vectors through the iteration of a bounded operator have recently attracted considerable attention, in particular due to its applications to dynamical sampling. In this article, we consider two commuting bounded operators acting on some separable Hilbert space H . We completely characterize operators T and L with T L = L T and sets Φ ⊂ H such that the collection { T k L j ϕ : k ∈ Z , j ∈ J , ϕ ∈ Φ } forms a frame of H . This is done in terms of model subspaces of the space of square integrable functions defined on the torus and having values in some Hardy space with multiplicity. The operators acting on these models are the bilateral shift and the compression of the unilateral shift (acting pointwisely). This context includes the case when the Hilbert space H is a subspace of L 2 ( R ) , invariant under translations along the integers, where the operator T is the translation by one and L is a shift-preserving operator. View full text © 2023 Elsevier Inc. All rights reserved. About ScienceDirect Remote access Shopping cart Advertise Contact and support Terms and conditions Privacy policy We use cookies to help provide and enhance our service and tailor content and ads. By continuing you agree to the use of cookies . Copyright © 2023 Elsevier B.V. or its licensors or contributors. ScienceDirect® is a registered trademark of Elsevier B.V. ScienceDirect® is a registered trademark of Elsevier B.V.



关键字:

暂无


所属期刊
Applied and Computational Harmonic Analysis
ISSN: 1063-5203
来自:Elsevier BV