Approximating Functions with Multi-Features on Sphere by Deep Convolutional Neural Networks

Junyan Huang

doi:10.20944/preprints202512.0342.v1

Submitted:

02 December 2025

Posted:

03 December 2025

You are already at the latest version

Abstract

In recent years, deep convolutional neural networks (DCNNs) have demonstrated remarkable success in approximating functions with multiple features. However, several challenges remain unresolved, including the approximation of target functions in Sobolev spaces defined on the unit sphere, and the extension of the types for intrinsic functions. To address these issues, we propose a DCNNs architecture with multiple downsampling layers to approximate multi-feature functions in Sobolev spaces on the unit sphere. Our method facilitates automatic feature extraction without requiring prior knowledge of the underlying composite structure and alleviates the curse of dimensionality in function approximation by extracting general smooth and spherical polynomial features. Compared with previous approaches, the proposed DCNNs architecture is more effective in capturing a variety of features.

Keywords:

convolutional neural networks

;

feature extraction

;

curse of dimensionality

;

composite functions approximation

;

Sobolev space

Subject:

Computer Science and Mathematics - Mathematics

MSC: 68T07, 68Q32, 41A25

1. Introduction

Deep neural networks have demonstrated exceptional computational power and benefited greatly from the availability of large-scale data [1], leading to remarkable performance across a wide range of scientific and engineering applications. Among these architectures, deep convolutional neural networks (DCNNs), which consist of multiple convolutional layers, have proven particularly effective. DCNNs have been successfully applied to speech recognition, image classification, strategic decision-making in board games, and many other practical tasks [2,3,25,26,27].

In the context of approximation theory, DCNNs have shown strong potential [4,5,6,7,12,28], and their approximation capabilities are often regarded as a form of nonlinear approximation [31]. Early foundational work [13,29,30] examined shallow neural networks with sigmoidal-type activation functions for function approximation. More recently, DCNNs with rectified linear unit (ReLU) activation functions have been shown to possess universal approximation properties and continuous functions [8]. Furthermore, [9] demonstrated that DCNNs require only one-eighth of the free parameters needed by fully connected neural networks (FNNs) to approximate smooth functions.

Nevertheless, several challenges persist. For instance, the error rates reported in [10,11,19] suffer from the curse of dimensionality, with the approximation error deteriorating as the input dimension d increases. In an effort to address this, [21] considered the approximation of composite functions of the form

f (Q (x))

in the Sobolev space

W_{\infty}^{β} (B^{d})

for

(0 < β \leq 1)

, where

Q (x)

is polynomial. This approach achieved an error rate of

O (ϵ^{\frac{- 1}{β}})

for

(0 < ϵ < 1)

. However, this method assumes a single feature, implying strong dependence among the input variables.

It has been shown that convolutional layers can effectively extract relevant features [32,33]. Neural networks, due to their mesh-free architectures, have also demonstrated the ability to overcome the curse of dimensionality when the target functions exhibit intrinsic structures [17,18,34], possess fast decaying Fourier coefficients [13,14], or belong to general mixed smooth function spaces [15,16]. To further reduce variable dependence and alleviate the curse of dimensionality, [20] proposed the approximation of functions with multiple features in Sobolev space

W_{\infty}^{β} ({[0, 1]}^{d})

for

(0 < β \leq 1)

. This setting is more realistic for complex tasks, where numerous features are typically required to describe objects. In this case, the error rate becomes

O (ϵ^{\frac{- d^{*}}{β}})

, where

d^{*}

denotes the number of features and

d^{*} \leq d

. This indicates that the curse of dimensionality can be mitigated by extracting features; however, the error rate is still worse than in the case

d^{*} = 1

, as the input variables exhibit reduced dependence.

Despite these advances, several challenges remain unresolved. This paper focuses on approximating functions with multiple features defined on the unit sphere, i.e.,

f^{*} \in W_{\infty}^{β} (S^{d - 1})

for

β \in (0, 1]

. The target functions are of the form:

f^{*} (x) = f (F_{1} (x), F_{2} (x), \dots, F_{d^{*}} (x)), x \in S^{d - 1},

(1)

where each

{F_{τ} (x)}_{τ = 1}^{d^{*}}

is a smooth function in

W_{\infty}^{r} (S^{d - 1})

with

(r > 0)

. In addition, for further alleviating the curse of dimensionality in function approximation, we approximate functions with spherical polynomial features and functions with symmetric polynomial features in Scetion 3.4 and 3.5 respectively. The DCNN architecture proposed in Section 3.2 is capable of extracting three types of features: general smooth features, polynomial features, and symmetric polynomial features. In contrast, DCNNs constructed in [20] are limited to extracting only polynomial and symmetric polynomial features.

2. Preliminaries

We begin by establishing notation for function classes. Let

S^{d - 1} = {x \in R^{d} {: ∥ x ∥}_{2} = 1}

denote the unit sphere in

R^{d}

.

2.1. Deep Convolutional Neural Networks with Down Sampling

DCNNs, constructed using convolutional layers, have demonstrated significant effectiveness in tasks, such as image classification [3]. These networks consist of a sequence of convolutional filters denoted by

w = {w^{(j)} : Z \to R}_{j = 1}^{J}

, where each filter

w^{(j)}

has infinite support over the index set

{0, \dots, s^{(j)}}

for some

s^{(j)} \in N

, referred to as filter length. In this work, we assume a uniform filter length across layers, that is,

s^{(j)} \equiv s \in N

, which yields the dimensionality sequence

{d_{j} = d + j s}_{j = 1}^{J}

. The activation function employed is the ReLU.

σ (u) = \{\begin{matrix} u & u \geq 0, \\ 0 & u < 0 . \end{matrix}

(2)

Given a filter w with support on the set

{0, \dots, s}

, its convolution with a sequence

v = (v_{1}, \dots, v_{d_{j - 1}})

produces a new sequence denoted by

w * v

defined as follows:

{(w * v)}_{i} = \sum_{k \in Z} w_{i - k} v_{k} = \sum_{k = 1}^{d_{j - 1}} w_{i - k} v_{k}, i \in Z,

(3)

which has support on the set

{1, \dots, d_{j - 1} + s}

. A DCNN comprising J hidden layers, with neuron mappings

{h^{(j)} : R \to R^{d_{j}}}_{j = 1}^{J}

is defined iteratively starting from the input vector

h^{(0)} = x

,

x = (x_{1}, \dots, x_{d})

and

h^{(j)} (x) = σ (T^{(j)} h^{(j - 1)} (x) - b^{(j)}) .

(4)

where

T^{(j)}

represents the Toeplitz type convolutional matrix.

T^{(j)} : = {(w_{i - k}^{(j)})}_{i = 1, \dots, d_{j}, k = 1, \dots, d_{j - 1}},

(5)

associated with a filter

w^{(j)}

of filter length s and

d_{j} \in N

given explicitly by

T^{(j)} = [\begin{matrix} w_{0}^{(j)} & 0 & 0 & \dots & \dots & 0 \\ w_{1}^{(j)} & w_{0}^{(j)} & 0 & 0 & \dots & 0 \\ ⋮ & ⋱ & ⋱ & ⋱ & ⋱ & ⋮ \\ w_{s}^{(j)} & w_{s - 1}^{(j)} & \dots & w_{0}^{(j)} & 0 & \dots & 0 \\ 0 & w_{s}^{(j)} & \dots & w_{1}^{(j)} & w_{0}^{(j)} & \dots & 0 \\ ⋮ & ⋱ & ⋱ ⋱ & ⋱ ⋱ & ⋱ & ⋮ \\ \dots & \dots & 0 & w_{s}^{(j)} & \dots & w_{0}^{(j)} \\ \dots & \dots & \dots & 0 & w_{s}^{(j)} & \dots & w_{1}^{(j)} \\ ⋮ & ⋱ & ⋱ & ⋱ & ⋱ & ⋮ \\ 0 & \dots & \dots & \dots & 0 & w_{s}^{(j)} & w_{s - 1}^{(j)} \\ 0 & \dots & \dots & \dots & \dots & 0 & w_{s}^{(j)} \end{matrix}] \in R^{d_{j} \times d_{j - 1}} .

(6)

A constraint is imposed on the bias vectors

{b^{(j)}}_{j = 1}^{J}

associated with the convolutional layers,

b_{s + 1}^{(j)} = \dots = b_{d_{j} - s}^{(j)}, j = 1, \dots, J .

(7)

Definition 1

(Deep Convolutional Neural Networks with Downsampling). Let

x = (x_{1}, x_{2}, \dots, x_{d}) \in S^{d - 1}

represent the input data vector, and let

s \in N

denote the filter length. A downsampled DCNNs with mappings

{h^{(j)} : R^{d} \to R^{d_{j}}}_{j = 1}^{J}

, incorporating downsampling operations at selected layers, is defined recursively. The layer widths

{d_{j}}_{j = 1}^{J}

are determined as follows:

d_{0} = d

,

d_{J_{1}} = ⌊ \frac{d + J_{1} s}{D_{1}}

⌋,

d_{J_{2}} = ⌊ \frac{d_{J_{1}} + (J_{2} - J_{1}) s}{D_{2}} ⌋

,

d_{J_{k}} = ⌊ \frac{d_{J_{k - 1}} + (J_{k} - J_{k - 1}) s}{D_{k}} ⌋

and more generally,

d_{j} = d_{j - 1} + s, j \in {1, 2, \dots, J} ∖ {J_{1}, J_{2}, \dots, J_{k}}

is a sequence of vectors of functions defined iteratively by

h^{(0)} = x

. The downsampling operator

D_{D} : R^{\hat{D}} \to R^{⌊ \hat{D} / D ⌋}

, parameterized by scaling factor

D \leq \hat{D}

, is defined by

D_{D} (v) = {(v_{i D})}_{i = 1}^{⌊ \hat{D} / D ⌋}, v = {(v_{i})}_{i = 1}^{\hat{D}} \in R^{\hat{D}},

where

⌊ u ⌋

denotes the integer part of

u > 0

. Denote the activated affine mapping

C_{T, b}

induced by the connection matrix T and bias vector b as

C_{T, b} (x) = σ (T x - b),

and

h^{(j)} (x) = \{\begin{matrix} C_{T^{(j)}, b^{(j)}} (h^{(j - 1)} (x)) & if J_{k - 1} < j < J_{k}, \\ D_{D_{k}} \circ C_{T^{(j)}, b^{(j)}} (h^{(j - 1)} (x)) & if j = J_{k}, \end{matrix}

where

J_{0} = 0

, and let

{D_{1}, D_{2}, \dots, D_{k}}

denote the downsampling scaling factors corresponding to the layers

{J_{1}, J_{2}, \dots, J_{k}}

. The bias vectors satisfy

{(b^{(j)})}_{s + 1} = {(b^{(j)})}_{s + 2} = \dots = {(b^{(j)})}_{d_{j - 1}}, j \neq {J_{1}, J_{2}, \dots, J_{k}} .

In this study, we refer to

J_{1}

as the first group of convolutional layers, while

J_{2} - J_{1}

constitutes the second group, and in general,

J_{k} - J_{k - 1}

represents the k-th group of convolutional layers.

2.2. Spherical Harmonics and Sobolev Space on the Sphere

Let

H_{k}^{d}

represent the set of all spherical harmonics of degree k on

S^{d - 1}

. This space can be characterized as the eigenfunction corresponding to the Laplace-Beltrami operator

Δ_{0}

on

S^{d - 1}

(see [22], Chapter 1.4):

H_{k}^{d} = {F \in C^{2} (S^{d - 1}) : Δ_{0} F = - λ_{k} F},

(8)

where

λ_{k} = k (k + d - 2)

. The dimension of the linear space

H_{k}^{d}

is

N (k, d) = (\binom{k + d - 1}{k}) - (\binom{k + d - 3}{k - 2}) = O (k^{d - 2}) .

(9)

Let

{Proj}_{k} : L_{2} (S^{d - 1}) \to H_{k}^{d}

represent the orthogonal projection from

L_{2} (S^{d - 1})

to

H_{k}^{d}

. The spaces

H_{k}^{d}

, where

k \in Z_{+}

, are mutually orthogonal with respect to the inner product of

L_{2} (S^{d - 1})

and possess a spherical harmonic expansion:

F = \sum_{k = 0}^{\infty} {Proj}_{k} F = \sum_{k = 0}^{\infty} \sum_{l = 1}^{N (k, d)} {\hat{F}}_{k, l} Y_{k, l},

(10)

where

{Y_{k, l}}_{l = 1}^{N (k, d)}

forms an orthonormal basis of

H_{k}^{d}

, and

{\hat{F}}_{k, l}

represents the Fourier coefficients of F, which are given by the expression:

{\hat{F}}_{k, l} = {〈 F, Y_{k . l} 〉}_{L_{2} (S^{d - 1})} = \int_{S^{d - 1}} F (x) Y_{k, l} (x) d μ (x) .

(11)

The normalized spherical measure

μ

on

S^{d - 1}

is defined by the surface area

ω_{d} = \frac{2 π^{\frac{d}{2}}}{Γ (\frac{d}{2})}

, such that

μ (S^{d - 1}) = 1

).

In addition,

{Proj}_{k} F (x)

has an integral representation given by:

{Proj}_{k} F (x) = \int_{S^{d - 1}} F (y) Z_{k} (x, y) d μ (y), x, y \in S^{d - 1},

(12)

where

Z_{k} (x, y) = \sum_{l = 1}^{N (k, d)} Y_{k, l} (x) Y_{k, l} (y), x, y \in S^{d - 1} .

(13)

It can be easily demonstrated that

Z_{k} (x, y)

is the reproducing kernel of

H_{k}^{d}

independent of the choice of

{Y_{k, l}}_{l = 1}^{N (k, d)}

. Furthermore, with

λ = \frac{d - 2}{2}

,

Z_{k} (x, y) = \frac{k + λ}{λ} C_{k}^{λ} (〈 x, y 〉), x, y \in S^{d - 1},

(14)

where

C_{k}^{λ} (t)

is the Gegenbauer polynomial of degree k with parameter

λ > \frac{- 1}{2}

, as discussed, for instance, in [22]. Consequently, for any

x, y \in S^{d - 1}

,

\frac{k + λ}{λ} C_{k}^{λ} (〈 x, y 〉)

serves as a reproducing kernel of

H_{k}^{d}

in the sense that

\int_{S^{d - 1}} p (y) \frac{k + λ}{λ} C_{k}^{λ} (〈 x, y 〉) d μ (y) = p (x), \forall p \in H_{k}^{d} .

(15)

Definition 2

(Sobolev Space on Sphere). Let

3 \leq d \in N ∖ {0}

, and for

r > 0

, the Sobolev space

W_{\infty}^{r} (S^{d - 1})

is defined as

W_{\infty}^{r} (S^{d - 1}) : = \{F \in L_{\infty} (S^{d - 1}) : {∥ {(- Δ_{0} + I)}^{r / 2} F ∥}_{L_{\infty} (S^{d - 1})} < \infty\},

(16)

with the norm given by

{∥ F ∥}_{W_{\infty}^{r} (S^{d - 1})} : = {∥ {(- Δ_{0} + I)}^{r / 2} F ∥}_{L_{\infty} (S^{d - 1})} = ∥ \sum_{k = 0}^{\infty} {(1 + λ_{k})}^{\frac{r}{2}} \sum_{l = 1}^{N (k, d)} {\hat{F}}_{k, l} Y_{k, l} ∥_{L_{\infty} (S^{d - 1})} .

(17)

2.3. Near-Optimal Approximation and Cubature Formula

The optimal approximation of a function using polynomial spaces of varying degrees is generally nonlinear. In spherical harmonic analysis, a valuable method for approximation is the application of a linear operator, denoted as

L_{n}

.

Let

η \in C^{\infty} ([0, \infty])

with

η (t) = 1

be a smooth function, such that

t \in [0, 1]

,

0 \leq η (t) \leq 1

for

t \in [1, 2]

, and

η (t) = 0

for

t \geq 2

. Define

k_{n} (t) = \sum_{k = 0}^{2 n} {(η (\frac{k}{n}))}^{2} \frac{λ + k}{λ} C_{k}^{λ} (t), t \in [- 1, 1],

and introduce a sequence of linear operators

K_{n}

, for

n \in Z_{+}

, on the space

L_{\infty} (S^{d - 1})

as follows:

K_{n} (F) (x) = \int_{S^{d - 1}} F (y) k_{n} (〈 x, y 〉) d μ (y), x \in S^{d - 1} .

This integral operator is bounded and yields effective approximations for functions belonging to Sobolev spaces.

Proposition 1

([19]). For

n \in N

,

r > 0

and

F \in W_{\infty}^{r} (S^{d - 1})

, there holds

∥ F - K_{n} (F) ∥_{L_{\infty} (S^{d - 1})} \leq C_{η} 2^{d - 1} n^{- r} {∥ F ∥}_{W_{\infty}^{r} (S^{d - 1})},

where

C_{η}

represents a constant depending only on the function η. Furthermore, with a constant

C_{η, d} > 0

depending on d and η,

∥ K_{n} {(F) ∥}_{L_{\infty} (S^{d - 1})} \leq C_{η, d} {∥ F ∥}_{W_{\infty}^{r} (S^{d - 1})} .

To obtain a discrete representation of

K_{n} (F)

, a cubature formula is employed for integrating polynomials of degree up to

4 n

on the unit sphere

S^{d - 1}

, with

d \geq 3

([23], Theorem 3.1).

(Cubature Formula) There exist a constant

C_{d} > 0

that depends only on d, such that for any

m \geq C_{d} n^{d - 1}

, there exist positive constants

γ_{l}

and points

y_{l} \in S^{d - 1}

, for

l = 1, \dots, m

, satisfying the following relation:

\int_{S^{d - 1}} F (x) d μ (x) = \sum_{l = 1}^{m} γ_{l} F (y_{l}), \forall F \in \prod_{4 n} (S^{d - 1}),

(18)

where

\prod_{4 n} (S^{d - 1})

denotes to the space of polynomials of degree up to

4 n

on

S^{d - 1}

. Furthermore, for

F \in \prod_{4 n} (S^{d - 1})

, the following inequality holds:

{∥ F ∥}_{L_{\infty} (S^{d - 1})} ≍ max_{l = 1, \dots, m} n^{d - 1} γ_{l} | F (y_{l}) |,

(19)

where

A ≍ B

indicates that existence of

C_{1} > 0

and

C_{2} > 0

that are independent of n or m, such that

C_{1} A \leq B \leq C_{2} A

. In particular, we say that the family

{(γ_{l}, y_{l})}_{l = 1}^{m}

follows a cubature rule of degree

4 n

.

3. Approximating Functions with Multiple Features on Sphere

In practical deep learning applications, the input dimension d of the target function is often quite large. As d increases, the approximation rates generally worsen exponentially, leading to what is known as the curse of dimensionality. This poses significant challenges when working with high-dimensional functions.

However, when the input variables are interdependent, meaning there are inherent structures within the target functions, DCNNs can help alleviate the curse of dimensionality by making use of these underlying relationships.

Our first target is to approximate composite function with general smooth features

f \in W_{\infty}^{β} ({[- B_{F}, B_{F}]}^{d^{*}}), β \in (0, 1]

with

B_{F} = {max}_{τ} {∥ F_{τ} {∥_{\infty}}}_{τ = 1}^{d^{*}}

:

f^{*} (x) = f (F_{1} (x), F_{2} (x), \dots, F_{d^{*}} (x)), x \in S^{d - 1},

(20)

where

d^{*} \leq d

, and

{F_{τ} (x)}_{τ = 1}^{d^{*}} \in W_{\infty}^{r} (S^{d - 1})

(

r > 0

). Since the outer function f is Lipschitz-

β

continuous, the norm of f is given by

{∥ f ∥}_{W_{\infty}^{β} ({[- B_{F}, B_{F}]}^{d^{*}})} : = sup_{x \in {[- B_{F}, B_{F}]}^{d^{*}}} | f (x) | + sup_{x \neq y \in {[- B_{F}, B_{F}]}^{d^{*}}} \frac{| f (x) - f (y) |}{{| x - y |}^{β}} .

(21)

Theorem 1.

Let

2 \leq s \leq d

,

d \geq 3

,

d^{*} \leq d

,

N \in N

, and let

{F_{τ} (x)}_{τ = 1}^{d^{*}}

be smooth functions. Define

B_{F} = max {∥ F_{τ} {∥_{\infty}}}_{τ = 1}^{d^{*}}

, and let

f \in W_{\infty}^{β} ({[- B_{F}, B_{F}]}^{d^{*}})

with

β \in (0, 1]

. For approximating functions of the form (20), there exists a downsampled DCNNs as defined in Definition 1 (explicitly constructed as in Section 3.1), with downsampling scaling parameters

{d, 1}

at layers

{J_{1}, J_{2}}

, respectively, and filters

{w^{(j)}}_{j = 1}^{J_{2}}

of uniform length s, bias vectors

{b^{(j)}}_{j = 1}^{J_{2} + 1}

, connection matrix

F^{(J_{2} + 1)}

, and coefficient vector

c^{(J_{2} + 1)} \in R^{N}

, such that

∥ c^{(J_{2} + 1)} \cdot h^{(J_{2} + 1)} - f^{*} ∥_{L_{\infty} (S^{d - 1})} \leq c_{1} {∥ f ∥}_{W_{\infty}^{β}} N^{\frac{- β}{d^{*}}}

where

c_{1} = c_{1} (d^{*}, d, β, F_{τ}, η)

, and

η \in C^{\infty} ([0, \infty])

defined in Section 2.3.

3.1. Proof of Theorem 1

Before the proof, we firstly introduce the two Lemmas for approximating inner and outer functions.

Define a kernel:

l_{n} (t) = \sum_{k = 0}^{2 n} η (\frac{k}{n}) \frac{λ + k}{λ} C_{k}^{λ} (t), t \in [- 1, 1],

(22)

which is a polynomial of degree

2 n

. According to [19], combined with Proposition 1 and the cubature formula, we have:

\begin{matrix} ∥ F_{τ} - K_{n} (F_{τ}) ∥_{L_{\infty} (S^{d - 1})} & = ∥ F_{τ} - \sum_{l = 1}^{m} γ_{l} L_{n} (F_{τ}) (y_{l}) l_{n} (〈 y_{l}, \cdot 〉) ∥_{L_{\infty} (S^{d - 1})} \\ \leq C_{η} 2^{d - 1} n^{- r} {∥ F_{τ} ∥}_{W_{\infty}^{r} (S^{d - 1})}, \end{matrix}

(23)

where

L_{n} (F_{τ}) (y_{l}) = \int_{S^{d - 1}} F_{τ} (z) l_{n} (〈 y_{l}, z 〉) d μ (z),

and

{(γ_{l}, y_{l})}_{l = 1}^{m}

follows a cubature rule of degree

4 n

.

For

N_{1} \in N

, let

t = {t_{i}}_{i = 1}^{2 N_{1} + 3}

be the uniform mesh on

[- 1 - \frac{1}{N_{1}}, 1 + \frac{1}{N_{1}}]

with

t_{i} = - 1 + \frac{i - 2}{N_{1}}

. Construct a linear operator

I_{t}

on

C [- 1, 1]

by

I_{t} (l_{n}) (u) = \sum_{i = 2}^{2 N_{1} + 2} l_{n} (t_{i}) δ_{i} (u), u \in [- 1, 1], l_{n} \in C [- 1, 1],

(24)

where

δ_{i} \in C (R)

,

i = 2, \dots, 2 N_{1} + 2

, is given by

δ_{i} (u) = N_{1} (σ (u - t_{i - 1}) - 2 σ (u - t_{i}) + σ (u - t_{i + 1})) .

(25)

To facilitate the analysis of the number of free parameters, we define a linear operator

L_{N_{1}} : R^{2 N_{1} + 1} \to R^{2 N_{1} + 3}

, which acts on

ζ = {(ζ_{i})}_{i = 1}^{2 N_{1} + 1} \in R^{2 N_{1} + 1}

as follows:

{(L_{N_{1}} (ζ))}_{i} = \{\begin{matrix} ζ_{2}, & for i = 1, \\ ζ_{3} - 2 ζ_{2}, & for i = 2, \\ ζ_{i - 1} - 2 ζ_{i} + ζ_{i + 1}, & for 3 \leq i \leq 2 N_{1} + 1, \\ ζ_{2 N_{1} + 1} - 2 ζ_{2 N_{1} + 2}, & for i = 2 N_{1} + 2, \\ ζ_{2 N_{1} + 2}, & for i = 2 N_{1} + 3 . \end{matrix}

This operator

L_{N_{1}}

facilitates the representation of the approximation operator

I_{t}

on

C ([- 1, 1])

in terms of

{σ (\cdot - t_{i})}_{i = 1}^{2 N_{1} + 3}

as

I_{t} (l_{n}) = N_{1} \sum_{i = 1}^{2 N_{1} + 3} {(L_{N_{1}} ({l_{n} (t_{k})}_{k = 2}^{2 N_{1} + 2}))}_{i} σ (\cdot - t_{i}) .

(26)

By [19,24], we have

\begin{matrix} ∥ \sum_{l = 1}^{m} γ_{l} L_{n} (F_{τ}) (y_{l}) l_{n} (〈 y_{l}, \cdot 〉) - \sum_{l = 1}^{m} γ_{l} L_{n} (F_{τ}) (y_{l}) I_{t} (l_{n}) (〈 y_{l}, \cdot 〉) ∥_{\infty} \\ \leq & \frac{C 3^{d + 2} n^{d + 3} C_{1} C_{η, d}}{N_{1}^{2}} {∥ F_{τ} ∥}_{W_{\infty}^{r}}, \end{matrix}

(27)

where

C > 0

is a constant. Combining this with (23) and (27), we obtain the following Lemma:

Lemma 1.

Let

d \geq 3

,

r > 0

,

n \in N

,

N_{1} \in N

. There exist

γ_{l} \in R

and

y_{l} \in S^{d - 1}

,

l = 1, 2, \dots, m

, such that for any

F_{τ} \in W_{\infty}^{r} (S^{d - 1})

(

τ = {1, 2, \dots, d^{*}}

),

∥ F_{τ} - \sum_{l = 1}^{m} γ_{l} L_{n} (F_{τ}) (y_{l}) I_{t} (l_{n}) (〈 y_{l}, \cdot 〉) ∥_{\infty} \leq \tilde{c} {∥ F_{τ} ∥}_{W_{\infty}^{r} (S^{d - 1})} max \{n^{- r}, \frac{n^{d + 3}}{N_{1}^{2}}\},

(28)

where

\tilde{c} = C_{η} 2^{d - 1} + C \cdot 3^{d + 2} C_{1} C_{η, d}

.

For approximating the outer function f of the form (20), we use the result of shallow sigmoidal neural networks approximation with scaling [35].

Lemma 2

([35]). Let

d^{*}, N \in N

,

β > 0

. For any

f \in W_{\infty}^{β} ({[- B_{F}, B_{F}]}^{d^{*}})

, there exists

{{\hat{α}}_{k}}_{k = 1}^{N} \in R^{d^{*}}

,

{{\hat{b}}_{k}}_{k = 1}^{N} \in R

,

{{\hat{c}}_{k}}_{k = 1}^{N} \in R

, and a constant

C_{d^{*}, β}

depending on

d^{*}

and β such that

sup_{z \in {[- B_{F}, B_{F}]}^{d^{*}}} | f (z) - \sum_{k = 1}^{N} {\hat{c}}_{k} \hat{σ} ({\hat{α}}_{k} \cdot z - {\hat{b}}_{k}) | \leq B_{F}^{β} C_{d^{*}, β} {∥ f ∥}_{W_{\infty}^{β}} N^{\frac{- β}{d^{*}}},

(29)

where

\hat{σ}

is the activation function of the sigmoidal type.

Proof

(Proof of Theorem 1).

For the approximate target functions of the form (20), we construct DCNNs based on a series of convolutional layers and a final fully connected layer. The convolutional transformations are represented by matrices

{T^{(j)} : = T^{w^{(j)}} \in R^{(d_{j - 1} + s) \times d_{j - 1}}}_{j = 1}^{J_{2}}

, each induced by corresponding filter

w : = {w^{(j)}}_{j = 1}^{J_{2}}

each supported in

{0, 1, \dots, s}

. Associated with these layers are bias vectors

{b^{(j)} \in R^{d_{j - 1} + s}}_{j = 1}^{J_{2}}

. The network concludes with a fully connected layer

h^{(J_{2} + 1)} : R^{d_{J_{2}}} \to R^{N}

defined by a connection matrix

F^{(J_{2} + 1)} \in R^{N \times d_{J_{2}}}

and a bias vector

b^{(J_{2} + 1)} \in R^{N}

, expressed as

h^{(J_{2} + 1)} (x) = \hat{σ} (F^{(J_{2} + 1)} h^{(J_{2})} (x) - b^{(J_{2} + 1)}),

where

\hat{σ}

is the activation function of the sigmoidal type. The activation function

\hat{σ}

that satisfies the assumptions:

{\hat{σ}}^{(i)} (u) \neq 0, u \in R, i \in Z_{+},

and for some integer

q \neq 1

, we have

{lim}_{u \to - \infty} \frac{\hat{σ} (u)}{{| u |}^{q}} = 0

and

{lim}_{u \to + \infty} \frac{\hat{σ} (u)}{u^{q}} = 1

.

The associated hypothesis space

H

, used for learning and function approximation, encompasses all output functions defined by the filter sequence

w = {w^{(j)}}_{j = 1}^{J_{2}}

, the connection matrix

F^{(J_{2} + 1)}

, bias sequence

b = {b^{(j)}}_{j = 1}^{J_{2} + 1}

, and a coefficient vector

c \in R^{N}

, is given by:

H : = \{c^{(J_{2} + 1)} \cdot h^{(J_{2} + 1)} (x) : w, b, F^{(J_{2} + 1)}, c^{(J_{2} + 1)}\} .

We design the initial convolutional layer to extract linear features by employing a convolutional factorization approach [8,10], which facilitates the construction of ridge functions for approximating the smooth components

{F_{τ} (x)}_{τ = 1}^{d^{*}}

. Let

s \geq 2

and consider a sequence

U = {(U_{k})}_{k = - \infty}^{\infty}

supported on the index set

{0, \dots, M}

with

M \leq 0

. Then, there exists a finite sequence of filters

{w^{(j)}}_{j = 1}^{p}

, each supported in

{0, \dots, s}

, where

p \leq ⌈ \frac{M}{s - 1} ⌉

, such that the following convolutional factorization is satisfied:

U = w^{(p)} * w^{(p - 1)} * \dots * w^{(2)} * w^{(1)} .

(30)

For

m \in N

and

y = {y_{1}, \dots, y_{m}} \subset S^{d - 1}

, we define U as a sequence supported on

{0, \dots, m d - 1}

, given by:

U_{(l - 1) d + (d - k) - {(y_{l})}_{k}}

for

l \in {1, \dots, m}

and

k \in {1, \dots, d}

. Let

M = m d - 1

,

p \leq ⌈\frac{M}{s - 1}⌉

. There exists a sequence of filters

w = {w^{(j)}}_{j = 1}^{J_{1}}

supported on

{0, \dots, s}

with

J_{1} \geq ⌈\frac{M}{s - 1}⌉

that satisfies the convolutional factorization:

U = w^{(J_{1})} * w^{(J_{1} - 1)} * \dots * w^{(2)} * w^{(1)} .

Here, for

j = p + 1, \dots, J_{1}

, we take

w^{(j)}

to be a delta sequence, i.e.,

{1, 0, \dots, 0}

:

T^{(J_{1})} \dots T^{(2)} T^{(1)} = T^{(J_{1}, 1)} = {(U_{i - k})}_{i = 1, \dots, d + J s, k = 1, \dots, d} \in R^{(d + J_{1} s) \times d},

(31)

where

T^{(j)}

represents the Toeplitz matrix in (6).

The next step involves constructing the bias vectors in the neural network. We define this as:

{∥ w ∥}_{1} = \sum_{k = - \infty}^{\infty} | w_{k} |,

(32)

For the first layer, the bias is set as:

b^{(1)} = - {∥ w ∥}_{1} I_{d_{0}}

and

b^{(j)} = (Π_{p = 1}^{j - 1} {∥ w^{(p)} ∥}_{1}) T^{(j)} I_{d_{j - 1}} - (Π_{p = 1}^{j} {∥ w^{(p)} ∥}_{1}) I_{d_{j}}

(33)

and for layers

j = 2, \dots, J_{1}

, the bias vectors are:

b_{s + 1}^{(j)} = \dots = b_{d_{j} - s}^{(j)} .

(34)

Note that for

{∥ x ∥}_{\infty} \leq 1

for

x \in S^{d - 1}

. Let

{∥ h ∥}_{\infty} = max {∥ h_{j} ∥_{\infty} : j = 1, \dots, q}

for a vector of functions

h : S^{d - 1} \to R^{q}

. It is known that for

h : S^{d - 1} \to R^{d_{j - 1}}

, we have the following inequality:

∥ T^{(j)} {h ∥}_{\infty} \leq ∥ w^{(j)} ∥_{1} {∥ h ∥}_{\infty},

(35)

Therefore, the components of

h^{(J_{1})} (x)

satisfy the following:

{(h^{(J_{1})} (x))}_{l d} = 〈 y_{l}, x 〉 + B^{(J_{1})}, l = 1, \dots, m,

(36)

where

B^{(J_{1})} = Π_{p = 1}^{J_{1}} {∥ w^{(p)} ∥}_{1}

. We have

D_{d} (h^{(J_{1})} (x)) = [\begin{matrix} 〈 y_{1}, x 〉 \\ ⋮ \\ 〈 y_{m}, x 〉 \\ 0 \\ ⋮ \\ 0 \end{matrix}] + B^{(J_{1})} I_{⌊ (d + J_{1} s) / d ⌋} .

(37)

Since

J_{1} \geq ⌈\frac{m d - 1}{s - 1}⌉

, we obtain the following:

\frac{d + J_{1} s}{d} \geq 1 + \frac{m d - 1}{d} \frac{s}{s - 1} > 1 + \frac{m d - 1}{d} \geq m,

(38)

Therefore, we have

d_{J_{1}} = ⌊ (d + J_{1} s) / d ⌋ \geq m

. To satisfy the constraint

J_{1} \geq ⌈\frac{m d}{s - 1}⌉

for the DCNNs, and

m = ⌊ C_{d} + 1 ⌋ n^{d - 1}

for the cubature formula, we need to select

n = ⌊{(\frac{(s - 1) J_{1}}{⌊ C_{d} + 1 ⌋ d})}^{\frac{1}{d - 1}}⌋

.

Considering that

| y_{l} \cdot x | \leq 1

, we define re exist two groups of convolutional layers with filters

{w^{(j)}}_{j = 1}^{J_{2}}

of uniform length s and biases

{b^{(j)}}_{j = 1}^{J_{2}}

that satisfy the restriction (7) for

j \neq J_{2}

. These layers are downsampled in the

J_{1}

-th layer by a scale factor d, ensuring that the output of the

J_{2}

-th layer, denoted

h^{(J_{2})} (x) \in R^{d_{J_{2}}}

is given by:

{(h^{(J_{2})} (x))}_{(i - 1) d_{J_{1}} + l} = \{\begin{matrix} σ (y_{l} \cdot x - t_{i}) & if 1 \leq l \leq m, 1 \leq i \leq 2 N_{1} + 3, \\ 0 & otherwise, \end{matrix}

(39)

where

J_{1} \geq ⌈\frac{m d - 1}{s - 1}⌉

,

J_{2} - J_{1} \geq ⌈\frac{(2 N_{1} + 3) d_{J_{1}}}{s - 1}⌉

. So

d_{J_{1}} = ⌊ (d + J_{1} s) / d ⌋ \geq m

and

d_{J_{2}} = d_{J_{1}} + (J_{2} - J_{1}) s > (2 N_{1} + 3) d_{J_{1}} \geq (2 N_{1} + 2) d_{J_{1}} + m

. Therefore, we choose the bias vector

b^{(J_{2})}

as:

\begin{matrix} {(b^{(J_{2})})}_{k} & = B^{(J_{1})} (Π_{p = J_{1} + 1}^{J_{2} - 1} {∥ w^{(p)} ∥}_{1}) {(T^{(J_{2})} I_{d_{J_{2}} - 1})}_{k} \\ + \{\begin{matrix} t_{i}, & if (i - 1) d_{J_{1}} + 1 \leq k \leq (i - 1) d_{J_{1}} + m, 1 \leq i \leq 2 N_{1} + 3, \\ B^{(J_{1})}, & otherwise . \end{matrix} \end{matrix}

(40)

Finally, we can express

h^{(J_{2})} (x)

as:

h^{(J_{2})} (x) = {[\begin{matrix} H_{1}^{T} H_{2}^{T} \dots H_{2 N_{1} + 3}^{T} 0 \dots 0 \end{matrix}]}^{T} \in R^{d_{J_{2}}},

(41)

where

H_{i}^{T} = [\begin{matrix} σ (y_{1} \cdot x - t_{i}) σ (y_{2} \cdot x - t_{i}) \dots σ (y_{m} \cdot x - t_{i}) 0 \dots 0 \end{matrix}] \in R^{d_{J_{1}}}, i = 1, 2, \dots, 2 N_{1} + 3 .

(42)

We define

v_{i} = {(L_{N_{1}} ({l_{n} (t_{k})}_{k = 2}^{2 N_{1} + 2}))}_{i}

for

i = 1, 2, \dots, 2 N_{1} + 3

. Additionally, we define the zero matrices: the

m \times (d_{J_{1}} - m)

matrix as

O

, and the

m \times (d_{J_{2}} - (2 N_{1} + 2) d_{J_{1}} - m)

matrix as

\hat{O}

. Then, we have:

F^{(N_{1})} = N_{1} [\begin{matrix} v_{1} I_{m} O v_{2} I_{m} O \dots v_{2 N_{1} + 3} I_{m} \hat{O} \end{matrix}] \in R^{m \times d_{J_{2}}} .

(43)

Next, to obtain

\sum_{l = 1}^{m} γ_{l} L_{n} (F_{τ}) (y_{l}) I_{t} (l_{n}) (〈 y_{l}, \cdot 〉)

(

τ = 1, 2, \dots, d^{*}

), by (27), we obtain as follows:

F^{(γ)} = [\begin{matrix} γ_{1} L_{n} (F_{1}) (y_{1}) γ_{2} L_{n} (F_{1}) (y_{2}) \dots γ_{m} L_{n} (F_{1}) (y_{m}) \\ γ_{1} L_{n} (F_{2}) (y_{1}) γ_{2} L_{n} (F_{2}) (y_{2}) \dots γ_{m} L_{n} (F_{2}) (y_{m}) \\ ⋮ \\ γ_{1} L_{n} (F_{d^{*}}) (y_{1}) γ_{2} L_{n} (F_{d^{*}}) (y_{2}) \dots γ_{m} L_{n} (F_{d^{*}}) (y_{m}) \end{matrix}] \in R^{d^{*} \times m} .

(44)

Define the connection matrix

F^{(\hat{α})} \in R^{N_{2} \times d^{*}}

for the fully connected layer as the following:

F^{(J_{2} + 1)} = F^{(\hat{α})} F^{(γ)} F^{(N_{1})} = [\begin{matrix} {\hat{α}}_{1} \\ ⋮ \\ {\hat{α}}_{N_{2}} \end{matrix}] F^{(γ)} F^{(N_{1})},

(45)

where

{\hat{α}}_{k} \in R^{d^{*}}

(

k = 1, 2, \dots, N_{2}

). Next, select the bias vector

b^{(J_{2} + 1)} \in R^{N_{2}}

as

b^{(J_{2} + 1)} = {[{\hat{b}}_{1}, {\hat{b}}_{2}, \dots, {\hat{b}}_{N_{2}}]}^{T}

. Additionally, define the coefficient vector

c^{(J_{2} + 1)} \in R^{N_{2}}

as

c^{(J_{2} + 1)} = {[{\hat{c}}_{1}, {\hat{c}}_{2}, \dots, {\hat{c}}_{N_{2}}]}^{T}

.

We refer to N and s as structural parameters because the architecture of the DCNNs is entirely defined by these two parameters. The remaining parameters are termed training parameters, as they can be trained or selected once s and N are specified.

The error rate for smooth features was established in Lemma 1. By setting

n = N_{1}^{\frac{2}{d + 3 + r}}

, we obtain the following expression

∥ F_{τ} - \sum_{l = 1}^{m} γ_{l} L_{n} (F_{τ}) (y_{l}) I_{t} (l_{n}) (〈 y_{l}, \cdot 〉) ∥_{\infty} = ∥ F_{τ} - {\hat{F}}_{τ} ∥_{\infty} \leq \tilde{c} {∥ F_{τ} ∥}_{W_{\infty}^{r} (S^{d - 1})} N_{1}^{\frac{- 2 r}{d + 3 + r}} .

(46)

Denote

S (x) = (F_{1} (x), F_{2} (x), \dots, F_{d^{*}} (x))

and

\hat{S} (x) = ({\hat{F}}_{1} (x), {\hat{F}}_{2} (x), \dots, {\hat{F}}_{d^{*}} (x))

. We have

| f (S (x)) - f (\hat{S} (x)) | \leq {∥ f ∥}_{W_{\infty}^{β}} {(\tilde{c} \sqrt{d^{*}} max {\{∥ F_{τ} ∥_{W_{\infty}^{r} (S^{d - 1})}\}}_{τ = 1}^{d^{*}})}^{β} N_{1}^{\frac{- 2 r β}{d + 3 + r}},

(47)

where

\tilde{c} = C_{η} 2^{d - 1} + C \cdot 3^{d + 2} C_{1} C_{η, d}

.

Using (46) and Lemma 2, the resulting error rates can be derived as follows:

Taking

N_{2} = N

and

N_{1} = ⌈N_{2}^{\frac{d + 3 + r}{2 r d^{*}}}⌉

, we have

\begin{matrix} | c^{(J_{2} + 1)} \cdot h^{(J_{2} + 1)} (x) - f (S (x)) | \\ \leq & | c^{(J_{2} + 1)} \cdot h^{(J_{2} + 1)} (x) - f (\hat{S} (x)) | + | f (\hat{S} (x)) - f (S (x)) | \\ \leq & B_{F}^{β} C_{d^{*}, β} {∥ f ∥}_{W_{\infty}^{β}} N^{\frac{- β}{d^{*}}} + {∥ f ∥}_{W_{\infty}^{β}} {(\tilde{c} \sqrt{d^{*}} max {\{∥ F_{τ} ∥_{W_{\infty}^{r} (S^{d - 1})}\}}_{τ = 1}^{d^{*}})}^{β} N_{1}^{\frac{- 2 r β}{d + 3 + r}} \\ \leq & c_{1} {∥ f ∥}_{W_{\infty}^{β}} N^{\frac{- β}{d^{*}}}, \end{matrix}

where

c_{1} = C_{d^{*}, β} B_{F}^{β} + {(\tilde{c} \sqrt{d^{*}} max {\{∥ F_{τ} ∥_{W_{\infty}^{r} (S^{d - 1})}\}}_{τ = 1}^{d^{*}})}^{β}

. □

Remark 1.

Since

N_{1} = ⌈N^{\frac{d + 3 + r}{2 r d^{*}}}⌉

, exponent part depend on input dimension d. For approximating functions in

W_{\infty}^{β} (S^{d - 1})

, the index of inner function r and the number of features

d^{*}

should satisfy the following condition:

\frac{d + 3 + r}{2 r} \leq d^{*} < d,

such that the curse of dimensionality can be reduced.

We give a comparison with [10] in Table 1. Assume that the index of the outer functions

β

are equal. Our error rate is much faster when the index of the inner function r and the number of features

d^{*}

satisfy

\frac{d + 3 + r}{2 r} \leq d^{*} < d

(

d \geq 3

).

Theorem 1 concerns the approximation of composite functions in Sobolev spaces on the unit sphere

f^{*} \in W_{\infty}^{β} (S^{d - 1})

(β \in (0, 1])

. In contrast, related previous studies—such as the approximation of composite functions in

W_{\infty}^{β} ({[0, 1]}^{d})

(β \in (0, 1])

— have been discussed in [20]. When

d^{*} = d

and

β = 1

, meaning that there are d general smmoth features on sphere, corresponding to that of approximating functions in d-dimension cube [11], the following corollary is obtained. It illustrates that the DCNNs we constructed can also approximate or represent functions in a cube.

Corollary 1.

Let

2 \leq s \leq d

,

d \geq 3

,

N \in N

. For approximating functions

f \in W_{\infty}^{1} ({[0, 1]}^{d})

, there exists a downsampled DCNNs as defined in Definition 1 (explicitly constructed as in Section 3.1), with downsampling scaling parameters

{d, 1}

at layers

{J_{1}, J_{2}}

, respectively, and filters

{w^{(j)}}_{j = 1}^{J_{2}}

of uniform length s, bias vectors

{b^{(j)}}_{j = 1}^{J_{2} + 1}

, connection matrix

F^{(J_{2} + 1)}

, and coefficient vector

c^{(J_{2} + 1)} \in R^{N}

, such that

∥ c^{(J_{2} + 1)} \cdot h^{(J_{2} + 1)} - f ∥_{L_{\infty} ({[0, 1]}^{d})} \leq c_{1}^{'} {∥ f ∥}_{W_{\infty}^{β}} N^{\frac{- 1}{d}}

where

c_{1}^{'} = c_{1}^{'} (d, η)

, and

η \in C^{\infty} ([0, \infty])

defined in Section 2.3.

3.2. Approximating Functions with Polynomial Features

The approximation of functions using polynomial features is also discussed in [20], where the domain is considered as

{[0, 1]}^{d}

. In Theorem 1, we derived the error bounds for functions with smooth features defined on the unit sphere and introduced the reproducing kernel

\frac{k + λ}{λ} C_{k}^{λ} (〈 x, y 〉)

associated with the space

H_{k}^{d}

in Section 2.2.

To reduce the curse of dimensionality, we leverage the reproducing kernel to approximate functions with polynomial features on the unit sphere. Specifically, we consider the function:

f^{*} (x) = f (P_{1, q_{1}} (x), P_{2, q_{2}} (x), \dots, P_{d^{*}, q_{d^{*}}} (x)) = f (P (x)), x \in S^{d - 1}

(48)

where

{P_{τ, q_{τ}} (x)}_{τ = 1}^{d^{*}}

denotes spherical polynomials of degree

q_{τ}

.

Define

B_{P} = max {∥ P_{τ, q_{τ}} {∥_{\infty}}}_{τ = 1}^{d^{*}}

, assume further that

f \in W_{\infty}^{β} ({[- B_{P}, B_{P}]}^{d^{*}})

(

β \in (0, 1]

). Recall the polynomial

k_{n} (t)

introduced in Section 2.3., suppose

max {q_{τ}}_{τ = 1}^{d^{*}} \leq 2 n

, we have the following Theorem:

Theorem 2.

Let

2 \leq s \leq d

,

d \geq 3

,

d^{*} \leq d

,

N \in N

,

n \in N

. Let

{P_{τ, q_{τ}} (x)}_{τ = 1}^{d^{*}}

be spherical polynomials of degree

q_{τ}

, satisfying

max {q_{τ}}_{τ = 1}^{d^{*}} \leq 2 n

,

B_{P} = max {∥ P_{τ, q_{τ}} {∥_{\infty}}}_{τ = 1}^{d^{*}}

and suppose

f \in W_{\infty}^{β} ({[- B_{P}, B_{P}]}^{d^{*}})

with

β \in (0, 1]

. For approximating functions of the form (47), there exists a downsampled DCNNs, as defined in Definition 1 (explicitly constructed in Section 3.1), with downsampling scaling parameters

{d, 1}

applied at layers

{J_{1}, J_{2}}

, respectively. The network uses filters

{w^{(j)}}_{j = 1}^{J_{2}}

of uniform length s, bias vectors

{b^{(j)}}_{j = 1}^{J_{2} + 1}

, a connection matrix

F^{(J_{2} + 1)}

, and a coefficient vector

c^{(J_{2} + 1)} \in R^{N}

, such that

∥ c^{(J_{2} + 1)} \cdot h^{(J_{2} + 1)} - f^{*} ∥_{L_{\infty} (S^{d - 1})} \leq c_{2} {∥ f ∥}_{W_{\infty}^{β}} \cdot n^{(d + 3) β} \cdot N^{\frac{- β}{d^{*}}}

where

c_{2} = c_{2} (d^{*}, d, β, B_{P}, n, η)

, and

η \in C^{\infty} ([0, \infty])

defined in Section 2.3.

Proof.

Let

P_{τ, q_{τ}} (x)

be a spherical polynomial of degree

q_{τ}

, where

q_{τ} \leq 2 n

. According to (15), we have:

P_{τ, q_{τ}} (x) = \int_{S^{d - 1}} P_{τ, q_{τ}} (y) k_{n} (〈 x, y 〉) d μ (y) .

Applying (18) and (19), and setting

m = ⌊ C_{d} + 1 ⌋ n^{d - 1}

. Since

P_{τ, q_{τ}} (y) k_{n} (〈 x, y 〉)

is a spherical polynomial in y of degree at most

4 n

, there exist nodes

y_{1}, y_{2}, \dots, y_{m} \in S^{d - 1}

and positive weights

γ_{1}, γ_{2}, \dots, γ_{m} > 0

such that

P_{τ, q_{τ}} (x) = \sum_{l = 1}^{m} λ_{l} P_{τ, q_{τ}} (y_{l}) k_{n} (〈 x, y_{l} 〉), x \in S^{d - 1} .

Similar to the proof of Theorem 1, we can derive the following error rates by selecting

N_{1} = N_{2} = N

:

\begin{matrix} | c^{(J_{2} + 1)} \cdot h^{(J_{2} + 1)} (x) - f (P (x)) | \\ \leq & | c^{(J_{2} + 1)} \cdot h^{(J_{2} + 1)} (x) - f (\hat{P} (x)) | + | f (\hat{P} (x)) - f (P (x)) | \\ \leq & B_{P}^{β} C_{d^{*}, β} {∥ f ∥}_{W_{\infty}^{β}} N_{2}^{\frac{- β}{d^{*}}} + {∥ f ∥}_{W_{\infty}^{β}} {(c^{'} B_{P} \sqrt{d^{*}})}^{β} n^{(d + 3) β} N_{1}^{- 2 β} \\ \leq & c_{2} {∥ f ∥}_{W_{\infty}^{β}} n^{(d + 3) β} N^{\frac{- β}{d^{*}}}, \end{matrix}

where

c_{2} = B_{P}^{β} C_{d^{*}, β} + {(c^{'} B_{P} \sqrt{d^{*}})}^{β}

, and

c^{'} = C \cdot 3^{d + 2} C_{1} C_{η, d}

.

□

Remark 2.

Extrcting sphrical polynomial features can also derive the rates of approximating functions in

W_{\infty}^{1} ({[0, 1]}^{d})

when taking

d^{*} = d

and

β = 1

.

Remark 3.

Compared with [20], our spherical polynomial features are not restricted to the same degree, require only less or equal to

2 n

.

We provide a comparison with [10] in Table 2. Clearly, for same index

β

, our results provide faster error rates since

d^{*}

can be much smaller than d

(d \geq 3)

. Unlike in the case of extracting general smooth features (Theorem 1), the index of the inner function r and the number of features

d^{*}

do not need to be limited to a specific range.

3.3. Approximating Functions with Symmetric Polynomial Features

Even though

d^{*}

much smaller than d, in practical applications of machine learning

d^{*}

is large. In order to further improve the rates of approximation, we consider the polynomial features

{P_{τ, q_{τ}} (x)}_{τ = 1}^{d^{*}}

in (47), which exhibits symmetric structures with degree n. We denote these features by

{Q_{τ, n} (x)}_{τ = 1}^{d^{*}}

, where

Q_{τ, n} (x) = Q_{τ, n} (x_{1}, x_{2}, \dots, x_{d})

and

x \in S^{d - 1}

. For any permutation in the symmetric group

Q_{τ, n}

:

Q_{τ, n} (x_{1}, \dots, x_{i}, \dots, x_{j}, \dots, x_{d}) = Q_{τ, n} (x_{1}, \dots, x_{j}, \dots, x_{i}, \dots, x_{d}),

(49)

Therefore, the target function is of the form

f^{*} (x) = f (Q_{1, n} (x), Q_{2, n} (x), \dots, Q_{d^{*}, n} (x)), x \in S^{d - 1},

(50)

and

f \in W_{\infty}^{β} ({[- B_{Q}, B_{Q}]}^{d^{*}})

(

β \in (0, 1]

) with

B_{Q} = max {∥ Q_{τ, n} {∥_{\infty}}}_{τ = 1}^{d^{*}}

.

Theorem 3.

Let

2 \leq s \leq d

,

d \geq 3

,

n < d^{*} \leq d

,

N \in N

,

n \in N

,

{Q_{τ, n} (x)}_{τ = 1}^{d^{*}}

be a symmetric spherical polynomial of degree n, and

f \in W_{\infty}^{β} ({[- B_{Q}, B_{Q}]}^{d^{*}})

with

B_{Q} = max {∥ Q_{τ, n} {∥_{\infty}}}_{τ = 1}^{d^{*}}

and

β \in (0, 1]

. To approximate functions of the form (49), one can construct a downsampled DCNNs as described in Definition 1 (explicitly defined in Section 3.1), employing downsampling scaling parameters

{d, 1}

at layers

{J_{1}, J_{2}}

, respectively. The network uses filters

{w^{(j)}}_{j = 1}^{J_{2}}

of uniform length s, bias vectors

{b^{(j)}}_{j = 1}^{J_{2} + 1}

, a connection matrix

F^{(J_{2} + 1)}

, a coefficient vector

c^{(J_{2} + 1)} \in R^{N}

, such that

∥ c^{(J_{2} + 1)} \cdot h^{(J_{2} + 1)} - f^{*} ∥_{L_{\infty} (S^{d - 1})} \leq c_{3} n^{(d + 3) β} N^{\frac{- β}{n}}

where

c_{3} = c_{3} (d^{*}, d, β, f, Q_{τ, n}, n, η)

, and

η \in C^{\infty} ([0, \infty])

defined in Section 2.3.

Proof.

Let the elementary symmetric polynomials on

x \in S^{d - 1}

be denoted by

q_{k} (x) = \sum_{1 \leq i_{1} < i_{2} < \dots < i_{k} \leq d} x_{i_{1}} x_{i_{2}} \dots x_{i_{k}}, k = 1, 2, \dots, d .

(51)

According to the fundamental theorem of symmetric polynomials, any symmetric polynomial can be uniquely expressed as a polynomial in the variables

(q_{1} (x), q_{2} (x), \dots, q_{d} (x))

. Given that the symmetric polynomials

{Q_{τ, n} (x)}_{τ = 1}^{d^{*}}

are of degree n, and

n \leq d

, they have a unique expression as a polynomial in

(q_{1} (x), q_{2} (x), \dots, q_{n} (x))

. Consequently, for

Q_{τ, n}

with

n \leq d^{*} \leq d

, there exists a unique polynomial

{\tilde{Q}}_{τ, n}

such that

Q_{τ, n} (x) = {\tilde{Q}}_{τ, n} (q_{1} (x), q_{2} (x), \dots, q_{n} (x)), τ = 1, 2, \dots d^{*} .

(52)

Therefore, the target function

f^{*} (x)

can expressed as

\begin{matrix} f^{*} (x) = & f (Q_{1, n} (x), Q_{2, n} (x), \dots, Q_{d^{*}, n} (x)) \\ = & f ({\tilde{Q}}_{1, n} (q_{1} (x), q_{2} (x), \dots, q_{n} (x)), \dots, {\tilde{Q}}_{d^{*}, n} (q_{1} (x), q_{2} (x), \dots, q_{n} (x))) \\ = & \tilde{f} (q_{1} (x), q_{2} (x), \dots, q_{n} (x)), \end{matrix}

(53)

for some function

\tilde{f}

defined on

R^{n}

as

\tilde{f} (z) = f ({\tilde{Q}}_{1, n} (z), \dots, {\tilde{Q}}_{d^{*}, n} (z)), z \in R^{n} .

(54)

Since the polynomial

{\tilde{Q}}_{τ, n} \in W_{\infty}^{1} ({[- B_{q}, B_{q}]}^{n})

with semi-norm

| {\tilde{Q}}_{τ, n} |_{W_{\infty}^{1}}

, and where

B_{q} = {max}_{k} {∥ q_{k} {∥_{\infty}}}_{k = 1}^{n}

, it follows that for any

z_{1} \neq z_{2} \in {[- B_{q}, B_{q}]}^{n}

, we have:

\begin{matrix} \tilde{f} (z_{1}) - \tilde{f} (z_{2}) = & f ({\tilde{Q}}_{1, n} (z_{1}), \dots, {\tilde{Q}}_{d^{*}, n} (z_{1})) - f ({\tilde{Q}}_{1, n} (z_{2}), \dots, {\tilde{Q}}_{d^{*}, n} (z_{2})) \\ \leq & {(\sqrt{d^{*}} max_{τ} {| {\tilde{Q}}_{τ, n} (z_{1}) - {\tilde{Q}}_{τ, n} (z_{2}) {|}}_{τ = 1}^{d^{*}})}^{β} \\ \leq & {(\sqrt{d^{*}} max_{τ} {| {\tilde{Q}}_{τ} |_{W_{\infty}^{1}}}_{τ = 1}^{d^{*}} | z_{1} - z_{2} |)}^{β} . \end{matrix}

(55)

Since

{q_{k} (x)}_{k = 1}^{n}

are spherical polynomials of degree up to n, let us denote

q (x) = (q_{1} (x), q_{2} (x), \dots, q_{n} (x)) a n d \hat{q} (x) = ({\hat{q}}_{1} (x), {\hat{q}}_{2} (x), \dots, {\hat{q}}_{n} (x)) .

Similar to the proof of Theorem 2, by selecting

N_{1} = N_{2} = N

, we obtain:

\begin{matrix} | c^{(J_{2} + 1)} \cdot h^{(J_{2} + 1)} (x) - \tilde{f} (q (x)) | \\ \leq & | c^{(J_{2} + 1)} \cdot h^{(J_{2} + 1)} (x) - \tilde{f} (\hat{q} (x)) | + | \tilde{f} (\hat{q} (x)) - \tilde{f} (q (x)) | \\ \leq & B_{q}^{β} C_{d^{*}, β} {∥ \tilde{f} ∥}_{W_{\infty}^{β}} N_{2}^{\frac{- β}{n}} + {(B_{q} max_{τ} {| {\tilde{Q}}_{τ} {|_{W_{\infty}^{1}}}}_{τ = 1}^{d^{*}} \sqrt{d^{*}})}^{β} {(c^{'} n^{d + 3})}^{β} N_{1}^{- 2 β} \\ \leq & B_{q}^{β} C_{d^{*}, β} ({(B_{q} max_{τ} {| {\tilde{Q}}_{τ} {|_{W_{\infty}^{1}}}}_{τ = 1}^{d^{*}} \sqrt{d^{*}})}^{β} + {∥ f ∥}_{W_{\infty}^{β}}) N_{2}^{\frac{- β}{n}} \\ + & {(B_{q} max_{τ} {| {\tilde{Q}}_{τ} {|_{W_{\infty}^{1}}}}_{τ = 1}^{d^{*}} \sqrt{d^{*}})}^{β} {(c^{'} n^{d + 3})}^{β} N_{1}^{- 2 β}, \\ \leq & c_{3} n^{(d + 3) β} N^{\frac{- β}{n}}, \end{matrix}

(56)

where

c_{3} = {(B_{q} {max}_{τ} {| {\tilde{Q}}_{τ} {|_{W_{\infty}^{1}}}}_{τ = 1}^{d^{*}} \sqrt{d^{*}})}^{β} (c^{' β} + B_{q}^{β} C_{d^{*}, β}) + B_{q}^{β} C_{d^{*}, β} {∥ f ∥}_{W_{\infty}^{β}}

, and

c^{'} = C \cdot 3^{d + 2} C_{1} C_{η, d}

.

□

4. Conclusions

This paper proposes the construction of DCNNs with multiple downsampled layers for approximating functions with multi-features—namely, general smooth features, polynomial features and symmetric polynomial features on the unit sphere. The result of Theorem 1 implies that if the index of the inner function r and the number of features

d^{*}

satisfy

\frac{d + 3 + r}{2 r} \leq d^{*} < d

, the curse of dimensionality can be reduced. For further reducing the curse of dimensionality, we also derive error rates for cases where the features are either polynomials or symmetric polynomials on the unit sphere, as shown in Theorem 2 and 3. This implies that the DCNNs developed in Section 3.2 demonstrate the capability to extract three types of features: general smooth features, polynomial features, and symmetric polynomial features. In comparison, the DCNNs in [20] are limited to extracting only polynomial and symmetric polynomial features. In addition, compare to [20], when extracting polynomial features on sphere, our spherical polynomial features are not restricted to the same degree. Although our input vector x is spherical, we show that the DCNNs designed in Section 3.2 can also approximate or represent functions in

W_{\infty}^{1} ({[0, 1]}^{d})

by extrcating general smooth or spherical polynomial features, as shown in Corollary 1. These results underscore the superior learning capacity of the proposed DCNNs framework in capturing complex and high-dimensional feature representations on the unit sphere.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declares no conflicts of interest.

References

Goodfellow, I. ; Bengio, Y. and Courville, A. Deep Learning. MIT Press, 2016.
Hinton, G. E.; Osindero, S. and Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation. 2006, 18, 1527–1554. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I. and Hinton, G. E. ImageNet classification with deep convolutional neural networks. Communications of the ACM. 2017, 60, 84–90. [Google Scholar] [CrossRef]
DeVore, R.; Hanin, B. and Petrova, G. Neural network approximation. ACTA Numerica. 2021, 30, 327–444. [Google Scholar] [CrossRef]
Elbrächter, D.; Perekrestenko, D.; Grohs, P. and Böelcskei, H. Deep Neural Network Approximation Theory. IEEE Transactions on Information Theory. 2021, 67, 2581–2623. [Google Scholar] [CrossRef]
Bartolucci, F.; De Vito, E.; Rosasco, L. and Vigogna, S. Understanding neural networks with reproducing kernel Banach spaces. Applied and Computational Harmonic Analysis. 2023, 62, 194–236. [Google Scholar] [CrossRef]
Song, L. H.; Liu, Y.; Fan, J. and Zhou, D. X. Approximation of smooth functionals using deep ReLU networks. Neural Networks. 2023, 166, 424–436. [Google Scholar] [CrossRef]
Zhou, D. X. Universality of deep convolutional neural networks. Applied and Computational Harmonic Analysis. 2020, 48, 787–794. [Google Scholar] [CrossRef]
Zhou, D. X. Theory of deep convolutional neural networks: Downsampling. Neural Networks. 2020, 124, 319–327. [Google Scholar] [CrossRef]
Fang, Z. Y.; Feng, H.; Huang, S. and Zhou, D. X. Theory of deep convolutional neural networks II: Spherical analysis. Neural Networks. 2020, 131, 154–162. [Google Scholar] [CrossRef] [PubMed]
Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Networks. 2017, 94, 103–114. [Google Scholar] [CrossRef] [PubMed]
Zhou, D. X. Deep distributed convolutional neural networks: Universality. Analysis and Applications. 2018, 16, 895–919. [Google Scholar] [CrossRef]
Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory. 1993, 39, 930–945. [Google Scholar] [CrossRef]
Klusowski, J. and Barron, A. R. Approximation by Combinations of ReLU and Squared ReLU Ridge Functions With ℓ¹ and ℓ⁰ Controls. IEEE Transactions on Information Theory. 2018, 64, 7649–7656. [Google Scholar] [CrossRef]
Suzuki,T. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. International Conference on Learning Representations. New Orleans, United States, 2019.
Montanelli, H. and Du, Q. New Error Bounds for Deep ReLU Networks Using Sparse Grids. SIAM Journal on Mathematics of Data Science. 2019, 1, 78–92. [Google Scholar] [CrossRef]
Bach, F. Breaking the Curse of Dimensionality with Convex Neural Networks. Journal of Machine Learning Research. 2017, 18, 19. [Google Scholar]
Bauer, B. and Kohler, M. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Annals of Statistics. 2019, 47, 2261–2285. [Google Scholar] [CrossRef]
Feng, H.; Huang, S. and Zhou, D. X. Generalization Analysis of CNNs for Classification on Spheres. IEEE Transactions on Neural Networks and Learning Systems. 2023, 34, 6200–6213. [Google Scholar] [CrossRef]
Mao, T.; Shi, Z. J. and Zhou, D. X. Approximating functions with multi-features by deep convolutional neural networks. Analysis and Applications. 2023, 21, 93–125. [Google Scholar] [CrossRef]
Mao, T.; Shi, Z. J. and Zhou, D. X. Theory of deep convolutional neural networks III: Approximating radial functions. Neural Networks. 2021, 144, 778–790. [Google Scholar] [CrossRef]
Dai, F. and Xu, Y. Approximation Theory and Harmmonic Analysis on Spheres and Balls. Springer, 2013.
Brown, G. and Dai, F. Approximation of smooth functions on compact two-point homogeneous spaces. Journal of Functional Analysis. 2005, 220, 401–423. [Google Scholar] [CrossRef]
De Boor, C. and Fix, G. J. Spline approximation by quasiinterpolants. Journal of Approximation Theory. 1973, 8, 19–45. [Google Scholar] [CrossRef]
Feng, H.; Hou, S. Z.; Wei, L. Y. and Zhou, D. X. CNN models for readability of chinese texts. Mathematical Foundations of Computing. 2022, 5, 351–362. [Google Scholar] [CrossRef]
Ahmed, R.; Fahim, A. I.; Islam, M.; Islam, S. and Shatabda, S. Dolg-next: Convolutional neural network with deep orthogonal fusion of local and global features for biomedical image segmentation. Neurocomputing. 2023, 546, 126362. [Google Scholar] [CrossRef]
Silver, D.; Schrittwieser, J.; Simonyan, K.; et al. Mastering the game of Go without human knowledge. Nature. 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
Herrmann, L.; Opschoor, J. and Schwab, C. Constructive Deep ReLU Neural Network Approximation. Journal of Scientific Computing. 2022, 90, 75. [Google Scholar] [CrossRef]
Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems. 1989, 2, 303–314. [Google Scholar] [CrossRef]
Hornik, K. ; Stinchcombe,M. and White,H. Multilayer feedforward networks are universal approximators. Neural Networks. 1989, 2, 359–366. [Google Scholar] [CrossRef]
DeVore, R.; Howard, R. and Micchelli, C. Optimal nonlinear approximation. Manuscripta Mathematica. 1989, 63, 469–478. [Google Scholar] [CrossRef]
Mallat, S. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2016, 374, 20150203. [Google Scholar] [CrossRef]
Wiatowski, T. and Bölcskei, H. A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction. IEEE Transactions on Information Theory. 2018, 64, 1845–1866. [Google Scholar] [CrossRef]
Mhaskar, H. and Poggio, T. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications. 2016, 14, 829–848. [Google Scholar] [CrossRef]
Mhaskar, H. N. and Micchelli, C. A. Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied Mathematics. 1992, 13, 350–373. [Google Scholar] [CrossRef]

Table 1. Approximation rates of extracting features or not.

Regularity	Range	Error rate	features
$f^{*} \in W_{\infty}^{β} (S^{d - 1})$ [10]	$0 < β \leq 1$ [10]	$O (N^{\frac{- β}{d + 1}})$ [10]	None [10]
$f^{*} \in W_{\infty}^{β} (S^{d - 1})$	$0 < β \leq 1$	$O (N^{- \frac{β}{d^{*}}})$	General Smooth

Table 2. Approximation rates of extracting features or not.

Regularity	Range	Error rate	features
$f^{*} \in W_{\infty}^{β} (S^{d - 1})$ [10]	$0 < β \leq 1$ [10]	$O (N^{\frac{- β}{d + 1}})$ [10]	None [10]
$f^{*} \in W_{\infty}^{β} (S^{d - 1})$	$0 < β \leq 1$	$O (N^{- \frac{β}{d^{*}}})$	Spherical Polynomial

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Approximating Functions with Multi-Features on Sphere by Deep Convolutional Neural Networks

Abstract

Keywords:

Subject:

1. Introduction

2. Preliminaries

2.1. Deep Convolutional Neural Networks with Down Sampling

2.2. Spherical Harmonics and Sobolev Space on the Sphere

2.3. Near-Optimal Approximation and Cubature Formula

3. Approximating Functions with Multiple Features on Sphere

3.1. Proof of Theorem 1

3.2. Approximating Functions with Polynomial Features

3.3. Approximating Functions with Symmetric Polynomial Features

4. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe