4.1. Results of ESUL
In this section, we aim to investigate the effectiveness of ESUL from a network science perspective. We utilize the framework of ESUL and compare CH3-L3 (the proposed topological link prediction-based regrown method) with randomly assigning new links (which is typical of SET [
9]). We run all the experiments with the default hyperparameter setting as shown in Suppl. Table 2, and conduct experiments on 5 datasets: MNIST, Fashion_MNIST, EMNIST using MLP and CIFAR10, CIFAR100 using VGG16. For the first 3 datasets, we adopt MLP with an architecture of 784-1000-1000-1000-10 (47 output neurons for EMNIST), while for VGG16, we replace the fully connected layers after the convolution layers with the sparse network and assign them a structure of 512-1000-1000-1000-10 (100 output neurons for CIFAR100).
Figure 3 compares the performance of CH3-L3 (ESUL) and random (SET) on the five datasets. The first row shows the accuracy curves of each algorithm and on the upper right corner of each panel, we mark the accuracy improvement at the 250
th epoch. The second row of
Figure 3 shows the Area Across the Epochs (AAE), which reflects the learning speed of each algorithm. The computation and explanation of AAE can be found in the supplementary material. Based on the results of the accuracy curve and AAE values, CH3-L3 (ESUL) outperforms random (SET) on all the datasets, demonstrating that the proposed epitopological learning method is effective in identifying lottery tickets, and CH theory is boosting deep learning.
Figure 4A provides evidence that the sparse network trained with ESUL can automatically percolate the network to a significantly smaller size and form a hyperbolic network with community organizations. The example network structure of MNIST dataset on MLP is presented, where each block shows a plain representation and a hyperbolic representation of each network at the initial (that is common) and the final epoch status (that is random (SET) or CH3-L3 (ESUL)). The ESUL block in
Figure 4A shows that each layer of the network has been percolated to a small size, especially the middle two hidden layers where the active neuron post-percolation rate (ANP) has been reduced to less than 20%. This indicates that the network trained by ESUL can achieve better performance with significantly fewer active neurons and that the size of the hidden layers can be automatically learned by the topological evolution of ESUL. ANP is particularly significant in deep learning because neural networks often have a large number of redundant neurons that do not contribute significantly to the model’s performance. ESUL can reduce the computational cost of training and inference, making the network more efficient and scalable. In addition, reducing the number of active neurons can also help to prevent overfitting and improve the generalization ability of the model which remarkably reduces the dimensionality of the embedding in the hidden layers.
Figure 3.
The comparison of CH3-L3 (ESUL) and random (SET). In the first row, we report the accuracy curves of CH3-L3 and random on 5 datasets and mark the increment at the 250th epoch on the upper right corner of each panel. The second row reports the value of area across the epochs (AAE) which indicates the learning speed of different algorithms corresponding to the upper ACC plot.
Figure 3.
The comparison of CH3-L3 (ESUL) and random (SET). In the first row, we report the accuracy curves of CH3-L3 and random on 5 datasets and mark the increment at the 250th epoch on the upper right corner of each panel. The second row reports the value of area across the epochs (AAE) which indicates the learning speed of different algorithms corresponding to the upper ACC plot.
Figure 4.
Network science analysis of the sparse ANNs topology. A shows the presentation of different statuses of the network, including the network initialized by ER, the network trained by random (final epoch), and the network trained by ESUL (final epoch). The angular separability index (ASI) and power law exponent are reported in each panel. Additionally, the active neuron post-percolation rate (ANP) curve across the epochs is reported inside the ESUL panel. B shows the 4 network topological measures averaged on 5 different ANNs, trained in 5 datasets by CH3-L3 (ESUL) and random (SET), the shadow area means the standard error of the 5 datasets.
Figure 4.
Network science analysis of the sparse ANNs topology. A shows the presentation of different statuses of the network, including the network initialized by ER, the network trained by random (final epoch), and the network trained by ESUL (final epoch). The angular separability index (ASI) and power law exponent are reported in each panel. Additionally, the active neuron post-percolation rate (ANP) curve across the epochs is reported inside the ESUL panel. B shows the 4 network topological measures averaged on 5 different ANNs, trained in 5 datasets by CH3-L3 (ESUL) and random (SET), the shadow area means the standard error of the 5 datasets.
Furthermore, we perform coalescent embedding [
33] of each network in
Figure 4A into the hyperbolic space, where the radial coordinates are associated with node degree hierarchical power-law-like distribution and the angular coordinates with geometrical proximity of the nodes in the latent space. This means that the typical tree-like structure of the hyperbolic network emerges if the node degree distribution is power law (
[
23]). Indeed, the more power-law in the network, the more the nodes migrated towards the center of the hyperbolic 2D representation, and the more the network displays a hierarchical hyperbolic structure.
Surprisingly, we found that the network formed by CH3-L3 finally (at the 250
th epoch) becomes a hyperbolic power law (
) network with a hyperbolic community organization (angular separability index, ASI = 1, indicating a perfect angular separability of the community formed by each layer [
34]), while the others (initial and random at the 250
th epoch) do not display either power law (
) topology with latent hyperbolic geometry or crystal-clear community organization (ASI around 0.6 denotes a substantial presence of aberration in the community organization [
34]). Community organization is a fundamental mesoscale structure of real complex networks [
35] such as biological and socioeconomical [
36]. For instance, brain networks [
37] and maritime networks [
38] display distinctive community structure that is fundamental to facilitating diversified functional processing in each community separately and global sharing to integrate these functionalities between communities. ESUL approach can not only efficiently identify the important neurons but also learn a more complex and ultra-deep network structure. This mesoscale structure leverages topologically separated layer-community to implement in each of them diversified and specialized functional processing. The result of this ‘regional’ layer-community processing is then globally integrated together via the hubs (nodes with higher degrees) that each layer-community owns. Thanks to the power-law distributions, regional hubs that are part of each regional layer-community emerge as ultra-deep nodes in the global network structure and promote a hierarchical organization that makes the networks ultra-small-world (as we will explain a few lines below). Indeed, in
Figure 4A left we show that the community layer organization of the ESUL learned network is ultra-deep, meaning that also each layer has an internal hierarchical depth due to power-law node degree hierarchy. Nodes with higher degrees in each layer community are also more central and cross-connected in the entire network topology playing a central role (their radial coordinate is smaller) in the latent hyperbolic space which underlies the network topology. It is as the ANN has an ultra-depth that is orthogonal to the regular plan layer depth. The regular plan depth is given by the organization in layers, the ultra-depth is given by the topological centrality of different nodes from different layers in the hyperbolic geometry that underlies the ANN trained by ESUL. Moreover, we present the analysis of the ANP reduction throughout the epochs within the ESUL blocks. From our results, it is evident that the ANP diminishes to a significantly low level after approximately 50 evolution iterations. This finding highlights the efficient network percolation capability of ESUL, while in the meantime it suggests that a prolonged network evolution may not be necessary for ESUL, as the network stabilizes within a few epochs, indicating its rapid convergence.
To evaluate the structural properties of the sparse network, we utilize measures commonly used in network science (
Figure 4B). The details of each measure are explained in the supplementary material, and here we will focus on the observed phenomena. We consider the entire ANN network topology and compare the average values in 5 datasets of CH3-L3 (ESUL) with random (SET) using 4 network topological measures. The plot of power-law gamma indicates that CH3-L3 (ESUL) produces networks with power-low degree distribution (
), whereas random (SET) does not. We want to clarify that Mocanu et al. [
9] also reported that SET could achieve a power-law distribution with MLPs but in their case, they used the learning rate of 0.01 and extended the learning epochs to 1000. We acknowledge that maybe with a higher learning rate and a much longer waiting time, SET can achieve a power-law distribution. However, we emphasize that in
Figure 4B (first panel) ESUL achieves power-law degree distribution regardless of the conditions in 5 different datasets and 2 types (MLPs, VGG16) of network architecture. This is because ESUL regrows connectivity based on the existing network topology according to a brain-network automaton rule, instead SET regrows at random. Furthermore, structure consistency [
39] implies that CH3-L3 (ESUL) makes the network more predictable, while modularity shows that it can form distinct obvious communities. The characteristic path length is computed as the average node-pairs length in the network, it is a measure associated with network small-worldness and express also message passing and navigability efficiency of a network [
23]. ESUL-trained network topology is ultra-small-world, which happens when a network is small-world with power-law degree exponent lower than 3 (more precisely closer to 2 than 3)[
23], and this means that the transfer of information or energy in the network is even more efficient than a regular small-world network [
23]. All of these measures give strong evidence that ESUL is capable of transforming the original random network into a complex structured topology that is more suitable for training and potentially a valid lottery-ticket network [
8]. This transformation can lead to significant improvements in network efficiency, scalability, and performance.
4.2. Results of CHT
Table 1.
Result of CHT comparing to RigL and Fully connected network on 1x and 2x cases
Table 1.
Result of CHT comparing to RigL and Fully connected network on 1x and 2x cases
|
MNIST(MLP) |
Fashion_MNIST(MLP) |
EMNIST(MLP) |
|
ACC |
AAE |
ACC |
AAE |
ACC |
AAE |
FC |
98.69±0.02 |
96.33±0.16 |
90.43±0.09 |
87.40±0.02 |
85.58±0.06 |
81.75±0.10 |
RigL |
97.40±0.07 |
93.69±0.10 |
88.02±0.11 |
84.49±0.12 |
82.96±0.04 |
78.01±0.06 |
CHT |
98.05±0.04 |
95.45±0.05 |
88.07±0.11 |
85.20±0.06 |
83.82±0.04 |
80.25±0.20 |
FC |
98.73±0.03 |
96.27±0.13 |
90.74±0.13 |
87.58±0.04 |
85.85±0.05 |
82.19±0.12 |
RigL |
97.91±0.09 |
94.25±0.03 |
88.66±0.07 |
85.23±0.08 |
83.44±0.09 |
79.10±0.25 |
CHT |
98.34±0.08 |
95.60±0.05 |
88.34±0.07 |
85.53±0.25 |
85.43±0.10 |
81.18±0.15 |
|
CIFAR10(VGG16) |
CIFAR100(VGG16) |
Multi30K(Transformer) |
|
ACC |
AAE |
ACC |
AAE |
BLEU |
AAE |
FC |
91.52±0.04 |
87.74±0.03 |
66.73±0.06 |
57.21±0.11 |
24.0±0.20 |
20.9±0.13 |
RigL |
91.60±0.10* |
86.54±0.14 |
67.87±0.17* |
53.80±0.49 |
21.1±0.08 |
18.1±0.08 |
CHT |
91.68±0.15* |
86.57±0.08 |
67.58±0.30* |
57.30±0.20* |
21.3±0.29 |
18.3±0.13 |
FC |
91.75±0.07 |
87.86±0.02 |
66.34±0.06 |
57.02±0.04 |
- |
- |
RigL |
91.75±0.03 |
87.07±0.09 |
67.88±0.35* |
54.08±0.43 |
- |
- |
CHT |
91.98±0.03* |
88.29±0.10* |
67.70±0.16* |
57.79±0.08* |
- |
- |
To enhance the performance of ESUL, we propose a 4-step procedure named CHT. We first report the initialization structure in
Figure 2C which is a CSTI example with 2x version of MNIST dataset, we construct the heatmap of the input layer degrees and node distribution density curve. It can be clearly observed that with CSTI, the links are assigned to nodes mostly associated with input pixels in the center of figures in the MNIST dataset, and indeed the area at the center of the figures is more informative. We also report the CSTI-formed network has an initialized power law exponent
which indicates a topology with more hub nodes than in a random network with
. To assess the effectiveness of each process (excluding ESUL which was investigated in the previous section) proposed by CHT, we conducted an ablation test and reported the results in the first three plots of Suppl. Figure S2. These experiments demonstrate the utility of each individual component within the CHT framework.
In Table 1, we report the final performance of CHT compared to the SOTA DST method RigL and fully connected network on 6 datasets and 3 different models. Based on the obtained results, it is evident that CHT outperforms RigL in the majority of cases and consistently achieves faster training. The analysis of running time and FLOPs please refer to the Suppl. Table S3. Furthermore, we achieve comparable results to the Fully Connected (FC) model, with only 1% of the links remaining, on both the MNIST and EMNIST datasets. Moreover, note that while our performance is comparable to RigL on the CIFAR100 dataset, we observe higher AAE in our approach, indicating that CHT exhibits faster learning capabilities.
Table 2.
Active neuron post-percolation rate corresponding to the results in Table 1
Table 2.
Active neuron post-percolation rate corresponding to the results in Table 1
|
MNIST |
Fashion_MNIST |
EMNIST |
CIFAR10 |
CIFAR100 |
Multi-30k |
CHT |
31.15% |
30.45% |
42.22% |
32.52% |
32.32% |
33.86% |
RigL |
97.45% |
98.69% |
89.19% |
94.14% |
93.50% |
88.7% |
CHT |
33.27% |
29.25% |
39.78% |
34.24% |
35.71% |
- |
RigL |
100% |
100% |
99.82% |
99.83% |
99.80% |
- |
Additionally, as illustrated in Table 2, CHT successfully percolated the network to a range of 30-40% of the original neurons in every dataset. This implies that, from a logical standpoint, we utilize fewer neurons to achieve similar results as RigL, therefore CHT can generalize and abstract better in the hidden layers of the data information associated with the classification task, providing a new solution to find the Lottery Ticket of ANNs [
8]. CHT’s result to adaptively percolate the network size providing a minimalistic network modeling that preserves performance with a smaller architecture is a typical solution in line with Occam’s razor also named in science as the principle of parsimony, which is the problem-solving principle advocating to search for explanations constructed with the smallest possible set of elements [
40]. For instance, in physics, parsimony was an important heuristic in Albert Einstein’s formulation of special relativity [
41]; Pierre Louis Maupertuis’ and Leonhard Euler’s work on the principle of least action [
42]; and Max Planck, Werner Heisenberg and Louis de Broglie development of quantum mechanics [
43,
44]. On this basis, we can claim that CHT represents the first example of an algorithm for “parsimony sparse training”.