4. Experiments
In this section, we conduct extensive experiments to validate the superiority of ViFi and answer the following research questions.
Q1: What is the performance of ViFi in node classification and graph classification tasks?
Q2: How can we verify the advantage of the ViFi framework?
Q3: What roles are fulfilled by the individual components within the proposed framework?
Q4: In what ways is the robustness of ViFi demonstrated?
Q5: How effective is the approach when it is utilized for node-clustering applications?
Q6: To what extent do changes in hyperparameter settings influence the behavior of ViFi?
Datasets. We conduct our evaluation on six benchmark datasets, whose main statistics are listed in
Table 2. Cora, Citeseer, Pubmed, and DBLP are citation networks, whereas ACM and Chameleon come from academic and Wikipedia sources respectively.
ViFi. In the experimental analysis, OURS denotes the semi-supervised variant of ViFi, whereas OURS-UN refers to its unsupervised counterpart. For the unsupervised setting, the learned representations are subsequently fed into linear classifiers, allowing the model to produce node-classification results.
Baselines. We compare ViFi with state-of-the-art methods. 1) Base encoder: GCN [
33]. 2) Attention-based encoder: GAT [
35], MAGCN [
36], DGCN [
37] and PA-GCN [
38]. 3) Multi-view information fusion-based encoder for node classification: MixHop [
39], N-GCN [
40], MOGCN [
41], MAGCN [
36], DGCN [
37], PA-GCN [
38], LoGo-GNN[
32], StrucGCN[
42] and ND-GCN[
43]. 4) Multi-view information fusion-based encoder for graph classification: Co-GCN [
44], LGCN-FF [
45], SLFNet [
46], HGCN-MVSC [
47], MGCN-DNS [
48]. 5) Contrastive learning-based encoder: NCLA [
49], PA-GCN [
38], GraphCL [
50], IGCL [
51], and GCA [
52]. 5) Unsupervised learning model: K-means, Deepwalk [
53], GAE [
34], and VGAE [
34].
Parameter Setting. A complete list of hyperparameter settings for each dataset is provided in
Table 3.
Implementation Details. Our training procedure adopts a full-batch strategy for each epoch. The method is implemented in Pytorch, and parameter updates are carried out using the Adam [
54] algorithm. For the standard graph benchmarks, we randomly sample different numbers of labeled nodes per class for training, while keeping 1,000 nodes fixed for testing. In superpixel datasets, evaluation is conducted on a set of 10,000 images. All classification accuracies (ACC) are averaged over 10 independent runs using the data splits described above. The hyperparameter
is explored across {0.05, 0.1, 0.15, …, 0.95}, while
and
are adjusted within {0, 0.1, 0.2, …, 1}. Additionally, the cosine-threshold is tuned in {0.1, 0.15, 0.2, …, 0.5}, and
k is varied over {5, 10, 15, …, 30}. The final results for each dataset are reported using the hyperparameter combination and iteration count that yield optimal performance.
Evaluation Metrics. Following established practices in node and graph classification, we evaluate the performance of both baseline methods and ViFi on node classification tasks using classification accuracy (ACC). For each dataset, ACC is computed across all test samples. In addition, to assess clustering effectiveness, we employ normalized mutual information (NMI) [
55] and adjusted rand index (ARI) [
56], providing complementary measures of how well ViFi and competing approaches capture underlying cluster structures.
Table 4.
Notation description.
Table 4.
Notation description.
| Notation |
Description |
| Raw |
The raw input view. |
| RawC |
The convolution augmentation view via Equation (22). |
| PR |
The augmented view based on raw view relations. |
| PC |
The augmented view based on cosine similarity view relations. |
| PK |
The augmented view based on kNN view relations. |
| Global |
Learned representation of global views ({RAW, PC, PK}). |
| Local1 |
Learned representation of local views ({RAW, PC}). |
| Local2 |
Learned representation of local views ({RAW, PK}). |
| Local3 |
Learned representation of local views ({PC, PK}). |
| Local-Global |
Fused representation from the local-to-global views. |
| No-Opt (manual) |
Fusion without optimization; a fixed manual strategy is used to integrate filtered views. |
| Random-Select (avg) |
Randomly selects view subsets of the same size for fusion and averages the results. |
| Only-Entropy |
Employs only the entropy-balance term in the gain function. |
| Only-Structure |
Employs only the normalized structural-difference term in the gain function. |
| Greedy-by-Entropy |
Sequentially adds views based on their individual information scores until the gain ceases to improve. |
| Full (Ours) |
Complete optimized fusion mechanism using both entropy balance and structural complementarity to adaptively select the optimal view integration. |
4.1. Performance on Node and Graph Classification(Q1)
4.1.1. Performance on Node Classification
This section reports the average classification accuracy (ACC) along with its standard deviation across 10 independent trials. For reference, the results of DeepWalk [
53], NCLA [
49], LGCN-FF [
45], and SLFNet [
46] are adopted from their respective original studies. The outcomes of semi-supervised node classification are compiled in
Table 5, with the key insights summarized as follows:
Compared with the baseline models, ViFi demonstrates consistently superior performance across most datasets. Notably, the semi-supervised version of ViFi (OURS) achieves better results than its unsupervised counterpart (OURS-UN). This improvement can be explained by the fact that the semi-supervised framework of ViFi adopts an end-to-end fusion mechanism, in which the available label information effectively guides and refines the embedding fusion process during model training.
The semi-supervised version of ViFi (OURS) consistently outperforms other models that incorporate multi-topology or multi-view information fusion, such as MAGCN, MOGCN, and PA-GCN. Moreover, the unsupervised version of ViFi (OURS-UN) also achieves better performance than most graph contrastive learning methods, including GCA and IGCL. This advantage primarily stems from ViFi’s ability to identify and retain only the views that contribute positively to representation, while filtering out irrelevant views that contribute negatively, thereby enhancing representation quality and improving the model’s robustness against noisy or irrelevant information.
Building on the failure cases listed in
Table 5, we select the Pubmed dataset as a representative example for a more detailed analysis. In this dataset, the performance of ViFi is inferior to that of the NCLA model. This phenomenon can be attributed to a key characteristic of NCLA: it employs an augmentation-based learning strategy, which facilitates effective extraction of self-supervised signals between the original graph and its augmented versions, provided that the underlying relationships in the raw graph are trustworthy. A comparable pattern emerges when examining the Citeseer dataset. In this case, ViFi also trails behind the LA-GCN model. This performance gap can be attributed to the design of LA-GCN, which incorporates a trainable local augmentation mechanism grounded in the structural relations of the original graph. Meanwhile, it places greater emphasis on enhancing the informative value of data from the perspective of feature engineering, which further explains its performance advantage over ViFi in this dataset. However, overall, ViFi continues to exhibit better performance compared with these competing approaches. The superior performance of ViFi can be mainly attributed to its entropy-driven adaptive view filter and optimized fusion mechanism. The adaptive view filter evaluates the similarity and connectivity among nodes to characterize the concentration of information distribution. Based on this, it filters the views that contain richer and more informative content for subsequent fusion. In the optimized fusion mechanism, a novel information gain function is proposed to evaluate candidate view groupings based on entropy balance and structural complementarity. It is further employed to determine the most effective integration strategy, thereby enabling a more efficient and complementary fusion of multi-view representations. The superiority of ViFi becomes particularly pronounced in the presence of noise. The following experimental section examines and quantifies the robustness of the model.
4.1.2. Performance on Graph Classification
We use the multi view datasets for graph classification and take GCN-fusion [
33], Co-GCN [
44], LGCN-FF [
45], SLFNet [
46], HGCN-MVSC [
47] and MGCN-DNS [
48] as baselines. The reported classification accuracy (ACC) represents the average over 10 independent runs. Graph classification outcomes are summarized in
Table 6, demonstrating that ViFi maintains strong competitiveness in graph-level classification tasks.
4.2. ViFi Architecture Study (Q2)
To simplify the analysis, we employ only the best-performing variant of ViFi (OURS) built upon a semi-supervised learning framework. The model is applied to node classification and semantic similarity tasks on the Cora dataset, providing additional validation for the effectiveness of the proposed entropy-based adaptive view filtering and the optimized fusion strategy.
Multi views-GCN+MLP: A GCN framework combining local and global encoder views via MLP is utilized to perform node classification tasks;
Multi views-GAT+GAT: A graph attention network (GAT) architecture that merges local and global encoder views using a single-layer GAT is employed for node level classification tasks;
Global-GCN+GCN: A GCN architecture adopting a global encoder perspective, implemented with a single-layer GCN for node classification tasks;
OURS-UN-w/o: ViFi implemented under an unsupervised learning framework, operating without the complementary loss term ;
OURS-w/o: ViFi implemented under an semi-supervised learning framework, operating without the complementary loss term ;
OURS-UN: ViFi configured within an unsupervised learning framework;
OURS: ViFi configured within an semi-supervised learning framework;
These results lead to several noteworthy observations, as discussed below:
To further assess the efficacy of the ViFi framework, we utilize a standard MLP or a Graph Attention Network as the terminal classifier. The experimental results indicate that incorporating graph attention networks (GAT) yields superior performance compared with both the MLP-based and our proposed configurations. This improvement suggests that GAT provides a more efficient mechanism for capturing the topological dependencies within the fused node representations. Moreover, the performance of our model surpasses that of the Multi-view GCN+MLP, implying that the fusion process preserves the essential relational patterns among nodes.
Compared with the Global-GCN+GCN variant, ViFi delivers superior results, underscoring the necessity of incorporating localized structural cues when global information is sparse or partially missing, as is often the case under limited label supervision. Furthermore, architectures that jointly exploit global and local views tend to exhibit more stable and effective behavior than approaches relying solely on a global-view encoder.
Acquiring complementary information from different views plays a vital role. As shown in
Table 7, the ViFi (
and
) trained with the global loss function achieves better performance than the ViFi trained solely with the object loss
. This observation further validates our rationale for introducing the global loss function.
4.3. Ablation Study (Q3)
The effectiveness of the proposed ViFi is confirmed through the comparative experiments discussed above. In addition, to further validate the contribution of each individual component within ViFi, we conduct a series of ablation studies in this subsection.
4.3.1. Entropy-Based Adaptive View Filter
First, we apply various aggregation strategies to generate augmented graphs, each serving as a distinct view. For comparison, we incorporate the GCN aggregation method [
33] into the graph augmentation process (Equation
22).
Table 4 summarizes the descriptions of the various views. To visualize the representational capacity of each view (PR, PC, RawC, Raw), a broken-line chart is employed, tracking changes after each iteration. The alignment between each augmented view and the original input view (Raw) is subsequently quantified by examining their semantic similarity. Both the augmented and original views are encoded using a shared GCN encoder under identical parameter settings, and their learned graph representations are subsequently compared to assess semantic alignment.
The results in
Table 8 indicate that the performance of augmented views varies notably across datasets, reflecting the distinct structural and semantic characteristics captured by each view. The PC view attains the highest accuracy on the Cora and ACM datasets, suggesting that cosine similarity preserves meaningful feature correlations beneficial for classification, while the PK view performs slightly better on Pubmed, implying that neighborhood-based relations are more informative in this dataset. Such inconsistency among single views highlights the necessity of filtering and integrating multiple complementary views. When multiple views are combined, classification accuracy improves substantially compared with individual views. Among them, Raw+PC+PK consistently achieves the best results across all datasets, indicating that combining feature-similarity and structural-proximity views enhances the representational capacity of the model and leads to more effective multi-view learning.
The analysis of semantic similarity and representation ability of
Figure 4 provides a key explanation for this phenomenon. The figure shows that the semantic similarity and representation ability between different views show different evolution trajectories with the change of the number of iterations. The PC view can obtain higher representation ability with fewer iterations, and maintain a moderate semantic distance from the original view, which can provide effective complementary information, but said the ability to improve the limited. This inconsistency in performance and properties implies that directly fusing all views without any filtering mechanism may introduce interfering or task-irrelevant views, thereby degrading model performance. Moreover, the model would fail to adaptively identify the most informative views for the given task, ultimately resulting in suboptimal fusion outcomes. Therefore, a filtering mechanism that evaluates and adaptively filters views is essential to ensure that the fusion process focuses on the most informative and complementary views.
4.3.2. Optimized Fusion Mechanism
Furthermore, we design a series of comparative experiments, to investigate the role of the optimized fusion mechanism. Specifically, we implement several fusion variants, including manual fusion, random subset selection, and simplified versions that retain only the entropy-balance or structural-difference term. Each variant employs the same set of filtered views obtained from the first module, ensuring that the observed differences arise solely from the fusion mechanism. The descriptions of these variants are summarized in
Table 4. Subsequently, we compare their classification accuracy across multiple datasets to evaluate the impact of different fusion mechanisms.
The experimental results presented in
Table 9 reveal several important observations. The manual fusion setting, yields the lowest accuracy across all datasets, indicating that fixed integration schemes fail to capture the complementary information among views effectively. The random selection strategy shows only marginal improvement, suggesting that arbitrary integrations of views can not guarantee consistent information gain. The variants using only the entropy-balance or only the structural-difference term both lead to moderate performance gains, implying that each component contributes partially to the overall fusion objective. The greedy-by-entropy approach achieves slightly higher accuracy, demonstrating that prioritizing informative views provides a limited but noticeable benefit. The full optimized fusion mechanism achieves the highest accuracy across datasets, confirming that jointly considering entropy balance and structural complementarity enables more adaptive and synergistic view integration. Overall, the results validate the necessity of automatic strategy selection in promoting balanced and effective multi-view fusion.
4.4. Robustness Analysis (Q4)
As both the unsupervised () and semi-supervised () variants of VF-GRFN employ an identical fusion framework, only the semi-supervised model (), which demonstrates superior performance, is utilized in the subsequent experiments to streamline the analysis.
First, we evaluate the expressive capability of each view through visualization and graph reconstruction techniques.To accomplish the reconstruction of the latent representations derived from each view-specific encoder on the Cora dataset, we utilize a Variational Graph Auto-Encoder (VGAE) [
34]. The reconstruction quality is quantified using the AUC (area under the ROC curve) metric, and the corresponding results are illustrated in
Figure 5. Furthermore, to intuitively demonstrate the distribution of the learned representations, we visualize the embeddings of each view encoder using t-SNE [
58], as shown in
Figure 6.
On the Cora benchmark, we monitor the evolution of semantic similarity across embeddings from the global and local views throughout the learning process. This is presented through a line-based visualization computed at selected training epochs{1; 5; 10; 20; 40; 100}. The embedding vectors for all views are obtained through the same GCN encoder. Their semantic similarity is subsequently quantified by analyzing the resulting feature representations, as depicted in
Figure 7.
Table 4 shows various views’ description. When the input graphs are perturbed, the expressive capability of each view encoder declines as both its visual distribution and its reconstruction quality deteriorate. In contrast, the fused representation that integrates local and global views preserves stronger expressiveness under noisy conditions, indicating that ViFi is more robust. The global-only representation performs less effectively, while local views exhibit higher discriminability and better adaptation to structural noise. Moreover, the semantic discrepancy between local and global representations grows with the proposed contrastive loss
, confirming the significance of introducing local views. Our objective is enabled to enhance diversity among encoders from different views while maintaining the strong learning ability of key encoders. However, this divergence does not continually increase with training epochs, as the optimization is jointly constrained by contrastive and objective losses.
To assess the robustness of ViFi, we further subject ViFi, its variants Global-GNN, and GCN [
33] to a set of four uncertainty scenarios. These conditions are used to examine how reliably each model maintains classification performance when exposed to varying views.These uncertainties may introduce disturbances to the graph structure or node semantics, thereby affecting classification accuracy. All experiments are conducted on the Cora dataset. The number of labeled nodes is set to {14; 21; 28; 54}. The attack and mask ratios are {0.2; 0.4; 0.6; 0.8}, and the noise level is adjusted to {0.001; 0.01; 0.1; 1}. For topology and feature masking, we randomly remove a portion of edges or node features according to the specified ratio. The modified graphs are then used to test classification performance on the corrupted data. In the case of feature corruption caused by node noise, we perturb the input attributes by injecting Gaussian disturbances, where the parameter noise level controls the strength of the added variation. This process intentionally distorts the original feature distribution to evaluate model stability under corrupted inputs. The corresponding outcomes are presented in
Figure 8. Drawing from these observations, several key findings can be outlined as follows.
- (1)
As illustrated in
Figure 8(a), the baseline model experiences a sharp decline in performance as the label rate decreases. In contrast, OURS maintains superior accuracy even under extremely low label availability. This observation indicates that the introduction of a view filter and an optimized fusion mechanism enhances the model’s ability to capture essential features under limited supervision. Notably, OURS substantially outperforms Global-GNN in such low-label scenarios, further validating the effectiveness of the proposed design.
- (2)
As illustrated in
Figure 8(b),
OURS consistently achieves superior performance compared to the baseline models under higher attack ratios. This advantage arises from its ability to effectively capture information from entropy-based view filter. In general, all methods exhibit a sharp decline in performance as the random attack ratio increases.
- (3)
In alignment with earlier findings, the results presented in
Figure 8(c)–(d) indicate that
OURS maintains a clear advantage over both GCN and Global-GNN across varying noise and masking conditions. Overall, as the proportion of masked features or the strength of the injected noise increases, every method exhibits a substantial reduction in predictive accuracy.
4.5. Performance on Node Clustering (Q5)
We further examine the capability of the framework in unsupervised graph representation learning. Once the model is trained, the fused node representations
are extracted and subsequently grouped using the K-means clustering procedure on three benchmark datasets—Cora, Citeseer, and Pubmed. The evaluation is based on two metrics, normalized mutual information (NMI) and adjusted rand index (ARI), and their mean values along with standard errors are reported in
Table 10. To establish competitive baselines, we incorporate a range of widely used unsupervised and contrastive graph learning methods, such as K-means, DeepWalk [
53], GraphCL [
50], IGCL [
51], GCA [
52], GAE [
34], and VGAE [
34]. To further illustrate the flexibility of the ViFi framework, the objective
is equipped with alternative contrastive components by substituting its
term with the loss formulations of GCA [
52], IGCL [
51], and GraphCL [
50]. These variants are referred to as
,
, and
, respectively. Experimental results indicate that ViFi achieves performance comparable to other state-of-the-art methods across most datasets. These results confirm that ViFi exhibits strong competitiveness in unsupervised clustering tasks. Furthermore, the model’s performance is highly sensitive to the choice of
loss, underscoring its inherent flexibility.
4.6. Parameter Sensitivity (Q6)
Sensitivity studies are conducted on the key hyperparameters to examine their influence on model performance. ViFi is trained with values ranging from 0.05 to 0.95, with an interval of 0.05. Experimental observations indicate that values between 0.3 and 0.7 produce the most favorable outcomes, suggesting that maintaining a balanced emphasis between feature entropy and structural entropy enables the model to effectively capture both semantic richness and topological diversity, thereby achieving more comprehensive and discriminative representations. ViFi is trained with values ranging from 0.01 to 0.5, with an interval of 0.01. Experimental results show that moderate values, typically between 0.1 and 0.3, yield the best performance, suggesting that an appropriate regularization strength helps balance the trade-off between maximizing information gain and controlling the size of selected view subsets, thereby preventing overfitting to irrelevant views while maintaining informative diversity. We train ViFi with values from 0.05 to 0.95 and observe that 0.4–0.8 yields the best performance, indicating that appropriately weighting cross-encoder representation relationships improves the model.
Furthermore, we investigate the effects of the hyperparameters
and
on the model’s performance. In this experiment, other hyperparameters fixed while
and
vary from 0 to 0.9. The classification results of all nodes are visualized in a 3D bar chart, as shown in
Figure 9. The findings reveal that higher values of
and
generally lead to improved self-supervised learning performance.