Near-optimal Sparse Neural Trees for Supervised Learning

Decision tree algorithms have been among the most popular algorithms for interpretable (transparent) machine learning since the early 1980s. On the other hand, deep learning methods have boosted the capacity of machine learning algorithms and are now being used for non-trivial applications in various applied domains. But training a fully-connected deep feed-forward network by gradient-descent backpropagation is slow and requires arbitrary choices regarding the number of hidden units and layers. In this paper, we propose near-optimal neural regression trees, intending to make it much faster than deep feed-forward networks and for which it is not essential to specify the number of hidden units in the hidden layers of the neural network in advance. The key idea is to construct a decision tree and then simulate the decision tree with a neural network. This work aims to build a mathematical formulation of neural trees and gain the complementary beneﬁts of both sparse optimal decision trees and neural trees. We propose near-optimal sparse neural trees (NSNT) that is shown to be asymptotically consistent and robust in nature. Additionally, the proposed NSNT model obtain a fast rate of convergence which is near-optimal upto some logarithmic factor. We comprehensively benchmark the proposed method on a sample of 80 datasets (40 classiﬁcation datasets and 40 regression datasets) from the UCI machine learning repository. We establish that the proposed method is likely to outperform the current state-of-the-art methods (random forest, XGBoost, optimal classiﬁcation tree, and near-optimal nonlinear trees) for the majority of the datasets.


Introduction
Decision trees [8] and deep feed-forward neural networks [23] are very popular nonparametric prediction models due to their elegant mathematical basis and ability to model both linear and non-linear decision boundaries. Since decision trees are rule-based, when small-sized, they are deemed to be leaders in terms of interpretability whereas neural networks perform superior in terms of out-of-sample predictability, but sometimes lack mathematical interpretability due to having multiple hidden layers (usually unknown). Generally, these networks have no built-in hierarchy and consequently are fully connected, viz. all neurons in a layer are connected to all the neurons in the adjacent layers [34]. Since the number of neurons in the hidden layers is not known apriori, one designs a network with a varying number of neurons in different layers to determine the best suitable architecture for a given problem. As a result of these limitations, a considerable amount of design and training time is needed in many situations, and even then one is not sure that the 'optimal' design has been achieved. It is also well-known that the problem of building optimal neural networks is NP-complete [7].
To overcome these drawbacks, a tree-structured hybrid representation of the neural network was proposed in the previous literature [48][41] [44]. The main idea behind neural trees is to construct a decision tree and then simulate the decision tree with a neural net [10]. Neural trees are composed of three basic steps -(a) converting a tree into rules, (b) constructing a neural network from the rules, and (c) training the neural net. The main motivation behind converting the tree into a rule set is that it allows distinguishing among different contexts in which a decision node is used. The significance level of each node is determined in terms of weights trained by the neural network model as follows. The antecedents of a rule are used as input features that are linked to the hidden unit(s) which represent the rule. Thus, the number of hidden units in the network is the same as the total number of leaf nodes obtained from the tree-based algorithm. All the hidden units and output units include bias weights and network weights. These weights are further trained using gradient backpropagation algorithm [39]. Training multilayer neural networks by backpropagation is slow and requires arbitrary choices about the number of hidden units and layers. Neural trees are much faster and for which it is not necessary to specify the number of hidden units in advance [42]. To design neural trees for a given problem, one first develops a decision tree which is then transformed into a three-layered structure following a set of simple rules. Decision trees can be automatically designed using a set of input-output mapping pairs. Thus, the model can self-configure the architecture for a given problem. This is very important as using the proper number of neurons in the hidden layers affects the training time and the prediction performance [42] [13]. The model has been applied to solve classification and regression problems [44] [47] [38], and various modification of the algorithm can be found in previous literature [40] [43][50] [16] [19][25] [46]. However, most of these methods are non-scalable, and their mathematical formulations are not aptly done. From a theoreti-cal point of view, the literature on neural trees is less conclusive. Regardless of the use of the model in applied problems of classification and regression, asymptotic results are yet to be proved. This creates a gap between theory and practice. To the best of our knowledge, there is no optimal (near or sub) set-up available for the neural tree. However, in the current literature of decision trees, there are recent works that introduced optimal classification trees [3][49] [4] and sparse optimal decision trees [24][30] [6] which are necessarily a strong motivation behind the current work.
The aim of this work is to introduce a mathematical formulation of nearoptimal neural trees and gain the complementary benefits of both sparse optimal decision trees and neural trees. To this end, we propose near-optimal sparse neural trees (NSNT), which generalize and address the limitations of the previous works [44][41] [19][6] that attempted the same unification. We aim to make the proposed model scalable (the size of the data does not pose a problem), robust (work well in a wide variety of problems in the presence of noisy samples), accurate (achieve higher predictive accuracy), statistically sound (have desired asymptotic properties), and interpretable for its effective implementation in real-world classification and regression problems. We strongly desire that, whilst achieving competitive performance on real-world datasets, NSNT would benefit from (i) lightweight inference via conditional computation, (ii) hierarchical separation of features useful to the neural network building stage, and (iii) a mechanism to adapt the architecture to the size and complexity of the training dataset. We further investigate asymptotic consistency and rate of convergence for the theoretical robustness of the proposed NSNT model.
The most celebrated theoretical results in the field of decision trees and neural networks have given the general sufficient conditions for almost-sure L 2 -consistency of data-driven density estimates [31] and consistency for feedforward neural network estimates [32], respectively. Universal approximation properties for two hidden layered neural networks with a bounded number of neurons in hidden layers are proved [26] [20]. In recent work, least-square estimates based on deep feed-forward neural networks are shown to circumvent the curse of dimensionality in nonparametric regression [2]. Motivated by these works, our current study proves the strong consistency of NSNT model, which gives a basic theoretical guarantee for its effectiveness in practical studies. The approach depends on the choice of the total number of leaves and certain restrictions imposed on neural network hyper-parameters to ensure the consistency of the model. We discuss an analysis of the algorithmic complexity of the model along with the fast rate of convergence of the model. More interestingly, a general bound on the expected L 2 error of adaptive least square estimates is established and applied to the regression function estimates of the proposed model to obtain the rate of convergence in the case of additive regression functions. The rate of convergence for the model is shown to be near-optimal up to some logarithmic factor according to [45].
To show the practical utility of the proposal to machine learning practitioners, we comprehensively benchmark NSNT against the state-of-the-art models on a sample of 80 datasets from the UCI machine learning repository.
We show that across this sample, proposed NSNT perform consistently well for datasets with sizes in the thousands and yield higher out-of-sample accuracy than the random forest [9], XGBoost [14], optimal classification tree [3] and near-optimal nonlinear trees [4] on an average. The application of the proposal is demonstrated with several standard classification and regression data sets of various sizes. An implementation of the proposed NSNT model is made available for public use at https://github.com/tanujit123/NSNT.

Formulation of Proposed NSNT Model
In this section, we describe how a pre-trained decision tree can be reformulated as a two hidden-layered (2HL) deep feed-forward neural network with similar types of predictive behavior [48][41][10] [11]. Suppose we are given a training sample D n = {(X 1 , Y 1 ), (X 2 , Y 2 ), ..., (X n , Y n )} with n observations on p independent variables. Consider a nonparametric regression framework in which p input features X ∈ C p = [0, 1] p are observed and we predict a square integrable output function Y ∈ R. A decision tree can be viewed as a regression function estimate using hierarchical axes-parallel splits of the feature space. Each tree node must correspond to one of the segmentation subsets available in C p . For simplicity and easy interpretability, let us consider ordinary binary decision tree where a node has exactly two child nodes or zero child nodes (leaf nodes). Tree consists of split nodes (for example, x (i) ≥ α for some i ∈ {1, 2, ..., p} and some α ∈ C) and leaf nodes. The feature space C p is partitioned into axesparallel hyper-rectangles. The standard splitting criteria, MSE (mean squared error) is used to create the decision tree. While making prediction, the input vector is passed into root node of the decision tree and iteratively transmitted further to the leaf node which belongs to the subspace where the input is located; this is repeated until a leaf node is finally reached. If we suppose that a leaf represents region S (S ⊆ C p ), then the natural regression function estimate can be written in the following mathematical form: where N n (S) is defined as the number of observations in cell S (by convention, we assume 0 0 = 0). We obtain the predicted result for a query point x in leaf node S as an average of the Y i s of all training samples which fall into the region S. Below we present a formal description of the decision tree to be used in the proposed model.

Background: Regression Trees
To be specific, we consider the decision tree by [8], also named as classification and regression tree (CART) in our case. The main idea is to form a tree having k leaf regions (k depends on n) defined by a partition of the p-dimensional feature space with n observations. In the construction of the tree, the so-called CART-split criterion (MSE for regression set up) is applied recursively. This criterion helps in determining the input direction for the split and also for finding the appropriate cut. A formal mathematical expression for the decision tree algorithm based on [8] is as follows.
We assume S to be a generic cell and N n (S) to be the number of observations falling in S. Then a cut in S is a pair (i, α), where i ∈ {1, 2, ..., p} and α ∈ [0, 1] is the position of the cut along the i-th coordinate, within the limits of S. Let P S be the set of all such possible cuts in S. Then, with the notation X j = X j (1) , · · · , X j (p) , the CART-split criterion for the decision tree takes the following form, for any (i, α) ∈ P S , where S L = {x ∈ S, x (i) < α}, S R = {x ∈ S, x (i) ≥ α}, and Y S represents the average of the Y j belonging to S with the convention 0 0 = 0. Intuitively, L n (i, α) measures the (re-normalized) difference between the empirical variance in the node before and after a cut is performed. Specifically, the best cut (i * n , α * n ) for each cell S is selected by maximizing L n (i, α) over P S , viz., At each cell, the decision tree model evaluates the criterion (1) over all possible cuts in the p directions and returns the best possible cut. In case of ties, the best cut is usually selected in the middle of the two consecutive data points. This process is recursively continued until the tree exactly contains k terminal nodes, where k ≥ 2 is an integer eventually depends on n (k and k n have the same meaning).

Near-optimal Sparse Neural Trees
Assuming that we have a decision tree t n (whose construction eventually depends upon the data D n ) at hand, which takes constant values on each of k ≥ 2 terminal nodes. It turns out that these estimate can be reinterpreted as two hidden-layer neural networks, as summarized below. Let HL 1 = {H 1 , . . . , H k−1 } denote the collection of all hyperplanes participating in the construction of t n . Each H k ∈ HL 1 is of the form .., p} and α i k ∈ C. To reach the leaf of the query point x, one finds the side on which x falls (+1 for right and −1 for left) for each hyperplane H k . Using the above notations, the tree estimate t n is mapped to the neural network as discussed below.
Designing the first hidden layer (HL 1 ): The input layer supplies the features to the first hidden layer of neurons which corresponds to k − 1 perceptrons, with the threshold activation function defined as τ (h k (x)) = τ (x (i k ) − α i k ), where τ (u) = 21 u≥0 − 1. Therefore, for each split in the tree, there is a neuron in HL 1 whose activity encodes the relative position of an input x with respect to the concerned split. The output of the first layer is ±1-vector (τ (h 1 (x)), . . . , τ (h k−1 (x))), which describes all decisions of the inner tree nodes (it also includes the nodes off the tree path of x). Note that τ (h k (x)) takes the value +1 if x is on one side of the hyperplane H k and −1 if x is on the other side of H k (where, by convention, +1 if x ∈ H k ). Also, each neuron k of the first hidden layer is connected to one and only one input x (i k ) and the connection has weight +1 and offset −α i k .
Designing the second hidden layer (HL 2 ): HL 1 outputs a (k − 1)dimensional vector of ±1-bits that encodes the exact position of x in the leaves of the tree. The leaf node identity of x can be extracted from the abovementioned vector using a weighted combination of the bits along with an appropriate threshold function. Second hidden layer consists of k neurons, one for each leaf, and assigns a terminal cell to x. Let HL 2 = {L 1 , . . . , L k } denote the collection of all the leaf nodes of the tree, and let L(x) be the leaf that contains x. We connect a unit k from HL 1 to a unit k from HL 2 if and only if the hyperplane H k is involved in the sequence of splits forming the path from the root to the leaf L k . The connection has weight +1 if the split by H k is from a node to a right child in that path and −1 otherwise. Suppose we have (u 1 (x), . . . , u k−1 (x)) as the vector of ±1-bits seen at the output of HL 1 . Then where l(k ) is the length of the path from the root to L k . The rationale behind the choice (2) is that there are exactly l(k ) connections starting from the first layer and pointing to k and that Using (2), the argument of the threshold function is 1 2 if x ∈ L k , and is smaller than −1 2 , otherwise. Thus, v k (x) = 1 if and only if the terminal cell of x is L k . To summarize, HL 2 outputs a vector of ±1-bits (v 1 (x), . . . , v k (x)) whose components equal −1 except the one corresponding to the leaf L(x), which is +1.
If v k (x) = 1, then the output layer calculates the averageȲ k of the Y i corresponding to X i falling in L k as follows: In order to increase the generalization capabilities of the proposed NSNT model, we replace the original relay-type activation function τ (u) with a hyperbolic tangent activation function σ(u) := tanh(u) that has ranges between −1 to 1. Specifically, we use σ 1 (u) = σ(β 1 u) at every neuron of the first hidden layer and σ 2 (u) = σ(β 2 u) at every neuron of the second hidden layer. Here, β 1 and β 2 are positive hyper-parameters that determine the contrast of the hyperbolic tangent activation; larger the parameters β 1 and β 2 , sharper is the transition from −1 to 1. As β 1 and β 2 approach to infinity, the continuous functions σ 1 and σ 2 converge to the threshold function. The hyperbolic tangent activation functions allow operations with a smooth approximation of the discontinuous step activation function. Having a differentiable loss function with respect to the parameters almost everywhere in the network helps the gradients to be backpropagated while training the model. The probabilistic interpretation of the network output can be obtained by interpreting the activation functions in the hidden layers.

Remark 1
The tree estimate t n , depending on D n , can be interpreted as a neural network estimate. The architecture of this network is kept fixed and so are the weights and offsets of the network layers. The idea of building a near-optimal neural tree with sparse connections is to keep the structure of the network intact (as discussed above) and let the parameters vary in a subsequent network training procedure with backpropagation algorithm. Thus, once we design the connections between the neurons of the NSNT model, we can learn network parameters in a better way by minimizing the empirical MSE for this network over the sample D n . Fitting a fully-connected deep feed-forward neural network model with two hidden layers (p input features, k n − 1 neurons in the HL 1 , and k n neurons in the HL 2 ) requires o(pk n + k 2 n ) parameters to fit whereas for the proposed NSNT model, it is o(k n logk n ) (assuming that decision tree generates roughly balanced trees). We show the near-optimal rate of convergence of the proposed frameworks for nonparametric regression set-up in Section 4. Since the value of k depends on n, we can use k n instead of k in the paper with the same meaning.
In the next section, we show the asymptotic consistency (local and strong) of the proposed NSNT model in the context of nonparametric regression using empirical risk minimization.

Strong Consistency of the Proposed NSNT Model
Consider the tree structure and denote it by G 1 ≡ G 1 (D n ), the bipartite graph which creates the connections between the input vectors x = (x (1) , . . . , x (p) ) and the k n − 1 hidden neurons of HL 1 . Similarly, let G 2 ≡ G 2 (D n ) be the bipartite graph that represents the connections between the first hidden layer and the k n hidden neurons of The parameters that specify the first hidden units are contained in a matrix A of M (G 1 ) identified by the weights over the edges of G 1 and a column vector of biases b 1 , of size k n − 1. Similarly, the parameters of the second hidden units are represented by a matrix B of M (G 2 ) of weights over G 2 and by a column vector b 2 of offsets having size k n . Let us take the output weights and offset to be W out = (w 1 , . . . , w kn ) ∈ R kn and b out ∈ R, respectively. And the parameters that specify the model are represented by a vector: We further assume that there exists a positive constant c 1 such that where · ∞ is the supremum norm of a matrix, and · 1 is the L 1 -norm of a vector. The rationale behind this assumption (5) is that the weights and offsets are taken by the computation units of the second hidden layer and the output layer. It can be easily verified that the condition is satisfied by the original random tree estimates when Y is assumed to be bounded. Thus, we can assume that the absolute value of Y ∞ ≤ L < ∞ almost surely, for some L. Therefore, letting Λ n = λ = (A, b 1 , B, b 2 , W out , b out ) : (5) is satisfied , we see that the neural network implements functions of this specific form where λ ∈ Λ n . We aim to tune the parameters λ using the data D n such that the function realized by the obtained network becomes a 'good' estimate that can minimize the empirical L 2 error. Let F n,kn = f λ : λ ∈ Λ n be the class of neural networks and m n be the network that minimizes the empirical L 2 error which is defined as, where F n is a rich class of functions, including additive functions, polynomial functions having coefficients of the same sign, products of continuous functions and etc. In order to establish the consistency of the regression function estimates m n , we specify some specific class F n of functions over C p . Consider a hyperplane we are given a measurable function f : C p → R together with S ⊂ C p . We consider the following two statements: (a) For any i ∈ {1, 2, · · · , p}, the function Definition 1 If F n be the class of continuous real function on C p such that, For example, the additive functions of the following form where each f i is continuous, belong to F n . Also, the products of continuous functions of the following form f ( Also, this will be true for polynomial function whose coefficients have the same sign. Thus, we aim to find an estimate m n : We say m n is consistent if (5) tends to 0 as n → ∞. We can write using Lemma 10.1 of [21] where µ denotes the distribution of X. For the strong consistency of the nearoptimal sparse neural regression trees model, we show that the estimation error (first term in the R.H.S. of Eqn. 6) and the approximation error (second term in the R.H.S. of Eqn. 6) tend to 0. The former can be proved by using non-asymptotic uniform deviation inequalities and covering numbers corresponding to F n , to be shown in Theorem 1. Approximation error can be handled using a pseudo-estimate similar to decision tree generated function t n and application of Lipschitz property on the activation function of the model, as shown in Theorem 2. We further assume that X is uniformly distributed in C p and Y ∞ ≤ L < ∞ almost surely, for some L in the proof of Theorem 1 and 2.
The next two theorems state that with certain restrictions imposed on the number k n of terminal nodes and with the parameters, β 1 and β 2 being properly regulated as functions of n, the empirical L 2 risk-minimization provides the strong consistency of the proposed NSNT model.
Proof Let F n be the set containing all neural networks constrained by Eqn.
(4) with inputs in R p having two hidden layers of respective size k n − 1 and k n , and one output unit. We have assumed that for each f ∈ F n,kn , f satisfies f ∞ ≤ c 1 k n and Y is also bounded ( Y ∞ ≤ L < ∞). Let z n 1 = (z 1 , . . . , z n ) be a vector of n fixed points in R p and let H be a set of functions from R p → R. For every ε > 0, we let N 1 (ε, H , z n 1 ) be the L 1 ε-covering number of H with respect to z 1 , . . . , z n . N 1 (ε, H , z n 1 ) is defined as the smallest integer N such that there exist functions h 1 , . . . , h N : R p → R with the property that for every h ∈ H , there is a j ∈ {1, . . . , N } such that The functions in H n will satisfy the following: 0 ≤ h(x, y) ≤ 2c 2 1 k 2 n + 2L 2 ≤ 4c 2 1 k 2 n . This assumption holds for large n such that c 1 k n ≥ L is satisfied. Using Pollard's inequality [21], we have, for arbitrary ε > 0, Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 November 2021 doi:10.20944/preprints202105.0117.v2 Next we bound the covering number N 1 ( ε 8 , H n , Z n 1 . Let us consider two functions h i (x, y) = |y − f i (x)| 2 of H n for some f i ∈ F n and i = 1, 2. We get Thus, if {h 1 , h 2 , ..., h l } is an ε/8 packing of H n on Z n 1 , then {f 1 , f 2 , ..., f l } is an ε/64c 1 k n packing of F n .
The covering number N 1 ( ε 64c1kn , F n , X n 1 ) can be upper bounded independently of X n 1 by extending the arguments of Theorem 16.1 of [21, ] from a network with one hidden layer to a network with two hidden layers. We will now apply Theorem 9.4, Lemma 16.4, and Lemma 16.5 repeatedly in the rest of the proof [21, ]. Let the neurons of the first hidden layer output belong to the class For 0 < ε < 1/4, The second units compute the functions of the collection Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 November 2021 doi:10.20944/preprints202105.0117.v2 Note that σ 2 satisfies the Lipschitz property |σ 2 (u) − σ 2 (v)| ≤ β 2 |u − v| for all (u, v) ∈ R 2 . Thus, Also, letting By assuming without loss of generality c 1 , β 2 ≥ 1, we get .
Finally, we can write We conclude Combining (7)-(9) together, we obtain Now if the conditions of Theorem 1 holds, then
Proof Let us consider a piece-wise constant function (pseudo-estimate) similar to the tree estimate t n , with only one difference: the function computes the true conditional expectation E[Y |X ∈ L k ] in each leaf L k , but not the empirical one, viz.Ȳ k . In another way, we can write ( (3) of Section 3. This tree-type pseudo-estimate has the form . We can write the expression for approximation error as Now, we show that the two terms in the R.H.S. of (11) tends to zero under certain restrictions on k n , β 1 and β 2 . The second term requires a careful analysis of the asymptotic behavior of the cells for the tree pseudo-estimate t λ (x).
To deal with the first term we consider the following two expressions: Using Cauchy-Schwarz inequality and triangle inequality; we can write for all To find the upper bounds of I 1 and I 2 , we will use the following properties of functions: -Recall that σ i is tan hyperbolic activation function and τ is a threshold activation function, then we can write for all u ∈ R, |σ i (u) − τ (u)| ≤ 2e −2βi|u| for all i = 1, 2. -Also, σ i satisfies the Lipschitz property for continuous functions |σ i (u) − σ i (v)| ≤ β i |u − v| for all (u, v) ∈ R 2 and i = 1, 2.
Using the above properties of functions, we can write Using Lipschitz property and Cauchy-Schwarz inequality, we can find the upper-bound for I 2 as follows: Putting these upper bounds of I 1 and I 2 together and using Eqn. (12) we get the following: → 0 hold. To deal with the second term of Eqn. (11), we consider m ∈ F n and X is uniformly distributed in C p and bounded Y .
, where S n (x) be the cell of the tree that contains x. Accordingly, where m is continuous on C p . It reduces to the problem of finding empirically optimal regression trees (as in [8]) to yield consistent estimates of m(·). We can directly use the consistency results of the CART model to show that the R.H.S. of (14) tends to zero when k n → ∞ and k n = o(n/logn) as n → ∞ [36] [12]. The condition suggests that the tree estimates are consistent when the size of the tree grows with sample size n at a controller rate.

Remark 3
The consistency results for near-optimal sparse neural regression trees depend on the choice of the total number of leaves and certain restrictions imposed on neural network hyper-parameters to ensure the theoretical consistency of the model.

Analysis of Rate of Convergence
We can find the rate of convergence by using complexity regularization principle [28][22] [27]. Using Equation (9), we can penalize the complexity of F n,kn . For a detailed discussion on penalized risk minimization, one may refer to Chapter 12 of [21]. We balance the approximation error with the bounds on the covering number to get the following theorem for the rate of convergence of the near-optimal sparse neural regression trees model. The similar idea for finding rate of convergence for single and multilayered perceptrons has been used in [21][28] [33]. In the sequel, we have assumed that m is Lipschitz (δ, c)smooth according to the following definition: Definition 2 Let δ ∈ (0, 1] and c ∈ R + , then a function m : C p → R is called Lipschitz (δ, C)-smooth if it satisfies the following equation: Theorem 3 (Rate of convergence) Assume that X is uniformly distributed in C p and Y ∞ ≤ L < ∞ a.s. and m is Lipschitz (δ, c)-smooth. Let m n be the estimate that minimizes empirical L 2 -risk and the network activation function σ i satisfies Lipschitz property. Then for any n ≥ max{β 2 , 2 p+1 L}, we have where δ characterizes the smoothness of the true function.
Next we use the complexity regularization principle to choose the parameter k n of the estimate in a data-dependent way. To do this let be the upper bound on the covering number of F kn and define for w kn ≥ 0 pen n (k n ) = 45L 2 logN 1 1 n , F kn + w kn n as a penalty term penalizing the complexity of F kn [28]. Thus (15) implies that pen n (k n ) is of the following form with w kn = 1, pen n (k n ) = 45L 2 (2p + 5)k n (k n + 1)log 48ec 2 1 (n + 1) 6 ) + 1 n = O k 2 n log(n) 6 n .
Our proof for rate of convergence relies on an extension of the proof techniques introduced by [28] and Chapter 12 of [21]. Assuming Y is bounded as in Theorem 1 and 2, we write (6) as Thus, (16) becomes Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 November 2021 doi:10.20944/preprints202105.0117.v2 The approximation error inf f ∈F kn C p f (x) − m(x) 2 µ(dx) depends on the smoothness of the regression function. According to Theorem 3.4 of [33] and Corollary 1 of [28], for any deep feed-forward networks with two hidden layers satisfying the assumptions of Theorem 3, we have for all x ∈ [0, 1] p . Using (17), we have for sufficiently large n. Now we have to balance the approximation error with the bound on the covering number. Thus, taking , and upon using (18), we get 2p+2δ . Thus, we obtained the desired convergence rate for the proposed NSNT model. The rate of convergence for the proposed model is 'near-optimal' upto some logarithmic factor according to [45].
Remark 4 Near-optimal sparse neural trees is a hybrid concept representation which has a built-in strategy for neural networks. The architecture is less complex and have less number of tuning parameters resulting in less training time. Since the algorithm uses the prior knowledge of decision tress, thus the algorithm has less restrictions on the geometry of the decision boundaries. Theorem 1 and 2 point out the consistency results of the algorithm. Theorem 3 also points out a remarkable property of the proposed NSNT model and the significance of the name for the proposal.

Performance Comparison on Real-world Datasets
In this section, we report the out-of-sample accuracy of our proposed NSNT model in comparison with state-of-the-art models on 40 regression datasets and 40 classification datasets obtained from the UCI Machine Learning Repository [17]. A summary of the datasets is available in Table 1. Our objective is to assess the relative strength of NSNT. We partition each dataset into training (50%), validation (25%), and testing sets (25%). We employ 10-fold crossvalidation with different randomly assigned training, validation, and test sets.
We use the area under the receiver operating characteristic curve (AUC) as the performance metric for classification datasets and the coefficient of determination (R 2 ) for regression datasets. By convention, the higher the value of AUC and R 2 , the better the model is. Further, we compare our proposed NSNT model mostly with popularly used greedy, optimal, and near-optimal classifiers. Multivariate adaptive regression splines (MARS) [18], regression trees [8], random forest [9], XGBoost [14], optimal tree for classification and regression (OCT and ORT) [3], and near-optimal nonlinear trees for classification and regression (NNCT and NNRT) [4] are used for comparison. The available state-of-the-art models were implemented as follows. NNCT and NNRT were implemented in the Julia programming language as in [4]. For testing the performance of MARS, we used the earth package in the R language [35]. For random forests, we used the random forest package in R [29]. For gradientboosted trees, we used the XGBoost library [15] with value of the parameter ρ = 0.1. For OCT and ORT models, we used the OptimalTrees package in Julia programming language with standard auto-tuning complexity parameter as in [3]. We first compare all methods across all datasets and present the out-of-sample performance among all methods with a maximum depth of 12 and results are reported in Table 2 and 3.
The training procedure for the proposed NSNT model is as follows. A decision tree is first built using the 'scikit-learn' package [37] for tree designing. From decision tree, we extract the set of all split directions and split positions and use them to build neural network initialization parameters. The hybrid models are then trained using the TensorFlow library in python software [1]. The optimization with the network model is done by minimizing the empirical error on the training set. It is achieved by employing an iterative stochastic gradient-descent optimization technique. The architecture of the network is kept fixed, and thus the weights and offsets of the three-layered DFFNN model are also fixed. A natural idea is then to keep the structure of the network intact and let the parameters vary in a subsequent network training procedure with backpropagation neural network training. For this, we used the default functions available in TensorFlow. In our model set up, Neural Network was trained for 100 epochs. The default hyper-parameter values were chosen for the gradient-based optimization algorithm available in the Python machine learning software. We experimentally found that using a lower value for β 2 than that for β 1 is appropriate for achieving the high accuracy of the model. In this case, the initial parameters of the tan-hyperbolic activation function in the two layers were chosen as: β 1 = 100, β 2 = 1. This is also evident from the theoretical results presented in Section 3 and 4. This is also practically very significant, since for a relatively small β 2 , the transition in the activation function from −1 to +1 is smoother, and a very stronger stochastic gradient signal reaches the first hidden layer in the backpropagation training. Similarly, a converse explanation can be given for β 1 . The training time and memory requirements are also quite low for the proposed NSNT model compared to advanced deep neural network models. Our proposed model is faster, especially when trained on a GPU. In Table 3, we present the results of different classifiers Table 1 Description of the regression and classification datasets. Here, n denotes the total rate of convergence asymptotically. Moreover, NNCT (NNRT) and XGBoost are 'second' and 'third' best performing classifiers (regression models) in terms of the performance metrics for the majority of the datasets as compared to the other traditional methods considered in this study. From the experimental evaluation of different classifiers, it can be concluded that, on average, the nearoptimal sparse neural trees model outperforms other greedy, optimal and nearoptimal statistical and machine learning models in a significant margin. Thus, the proposed NSNT method can be a 'good' choice for statistical learning in regression and classification datasets arising in various applied domains.

Conclusions and Discussions
In recent years, several studies attempted to build classification trees in which the greedy sub-optimal construction approach is replaced by solving an optimization problem, usually in integer variables. These procedures, while successful against CART, are extremely time-consuming and can only address problems of moderate size. In this paper, we have proposed a new soft pruning approach to build classification trees and trained them using neural networks. By replacing the binary decisions with randomized decisions along with neural trees, the resulting sparse framework is smooth and only contains continuous variables, allowing one to use gradient information. This paper developed an easily interpretable near-optimal sparse neural tree that is asymptotically consistent, near-optimal convergence rate, scalable, and accurate as compared to state-of-the-art models. The model is empirically shown to perform consistently for a wide range of datasets of various sizes. Strong empirical shreds of evidence show that with limited running time, our method outperformed recent benchmarks, achieving significant improvement over XGBoost, OCT, NNCT among many others. Both theoretically and experimentally, we evaluated the performance of the proposed NSNT model. Improving the framework for imbalanced classification problems and survival regression problems can be considered as future research questions.