Attention Might Offer Little Benefit for Graph Node Classification

Qin Jiang; Chengjia Wang; Michael Lones; Dongdong Chen; Wei Pang

doi:10.20944/preprints202604.1443.v1

Submitted:

16 April 2026

Posted:

21 April 2026

You are already at the latest version

Abstract

Attention mechanisms have achieved remarkable success in language models and have since been widely adopted in vision, speech, and multimodal learning. This trend has extended to graph learning, where attention-based models such as Graph Attention Networks (GAT) and Graph Transformers are now prevalent. \textbf{This position paper argues that attention mechanisms may not be as beneficial for graph node classification as commonly believed.} Through systematic ablation studies, we find that attention often provides negligible or even detrimental gains compared to simpler alternatives, with the only notable exception being graphs whose node features are language word embeddings. This suggests that the benefit of attention is largely limited outside language-related applications. We examine attention at three scales: 1-hop (GAT-style), Inception-style, and global mechanisms. We further analyze potential explanations for these results, including the limitations of gradient-based optimization and the fundamental differences between language and graph. Overall, these findings suggest that the prevailing enthusiasm for attention in graph node classification may be overstated, motivating a more critical and evidence-driven re-evaluation of its adoption. The code for all experiments is available at https://github.com/Qin87/ScaleNet/tree/July25.

Keywords:

graph neural network

;

attention mechanism

;

node classification

Subject:

Computer Science and Mathematics - Computer Science

1. Introduction

In academic research, when a breakthrough algorithm emerges in one field, its underlying scientific paradigm often diffuses rapidly to others [1,2]. Such cross-domain adoption typically attracts widespread attention and follow-up studies, regardless of its ultimate effectiveness, until the initial momentum fades or a new paradigm emerges. A decade ago, convolutional neural networks (CNNs) [3] played this role in computer vision. More recently, the attention mechanism, originally developed for language modeling, has triggered a comparable wave across machine learning. Attention has achieved strong results in language modeling [4] and was subsequently shown to be effective in computer vision [5], speech processing [6], and multimodal learning [7], particularly in domains that require modeling sequential data with global connectivity to capture long-range dependencies.

In graph node classification, attention mechanisms were rapidly adopted through Graph Attention Networks (GAT) [8], which apply attention over first-order neighbors. More recently, Graph Transformers [9] have emerged as a rapidly evolving research area, offering an alternative to traditional Message Passing Neural Networks (MPNNs) through the use of full attention and spawning a growing number of model variants. However, early work by Shchur et al. [10] showed that GAT can suffer from extremely low performance under certain weight initializations, as well as lower accuracy and higher variance than non-attention-based models on some datasets. Unlike GAT, which represents a single model, Graph Transformers encompass a broad and expanding family of models. Despite attracting significant research attention in recent years, concerns have been raised regarding their empirical effectiveness [11,12,13], scalability [14], and the complexity of their preprocessing pipelines [15]. Although some researchers have shifted away from developing ever more Graph Transformer variants, the broader belief in the effectiveness of attention for graph learning remains strong, often taken for granted [15], and attention-based architectures continue to proliferate.

In this position paper, we challenge the prevailing hype around attention mechanisms in graph node classification. Our experiments across multiple datasets reveal that attention often offers negligible benefit for node classification. We discuss potential reasons for this and advocate for a more cautious, evidence-driven adoption of attention mechanisms in this domain.

Structure of This Position Paper:

Section 2 provides background on edge weight adjustment methods, positioning attention as one specific approach among many alternatives.
Section 3 shows that attention for 1-hop neighbors is dispensable by experimenting on GAT [8] and its variants, and extends this finding to larger neighborhoods using Inception-style models, where attention remains unnecessary.
Section 4 examines Graph Transformers and explains the reasons behind their performance shortcomings.

Limitations:

Although the paper discusses and evaluates both homophilic and heterophilic graph settings, the scope of the conclusions is limited to node classification and does not necessarily generalize to other graph learning tasks. In addition, our experiments focus on existing node classification settings, such as content-based and traffic-based node classification, and the findings may not directly apply to novel node classification scenarios.

2. Edge Weight Adjustment

Many Graph Neural Networks (GNNs) modify edge weights (or, equivalently, the adjacency entries used for propagation). This design choice appears under various names in the literature, including aggregators [16], graph filters [17], normalization [18], and scalers [17]. Despite the differing terminology, they all fundamentally adjust edge weights, which determine the relative contribution of each neighbor during aggregation. We classify these approaches as follows:

(1) Fixed/rule-based adjustment.

A common approach is to set edge scalings deterministically, most often as a function of node degrees. Representative examples include the symmetric normalization used by GCN [18], random-walk-based normalization [16], and direction-aware normalizations in Dir-GNN [19]. We provide a detailed summary of these normalizations in Appendix A.

Not all fixed aggregators are degree-based. There are several rules that do not rely on degree, such as min/max aggregators, softmax/softmin aggregators, and standard-deviation aggregators [17], as well as pooling and LSTM-based aggregators [16]. Other choices incorporate structural priors beyond degree, e.g., centrality-based graph shift operators [20]. Notably, PNA [17] argues that a single fixed graph filter can fail in certain regimes; to mitigate such failures, it combines multiple aggregators (mean, min, max, standard deviation) together with degree-based scalers to increase robustness.

(2) Learned adjustment.

Another line of work learns edge coefficients end-to-end during training. Attention-based models, such as GAT [8] and graph transformer variants [9], compute data-dependent edge weights (often normalized across neighbors) to adaptively modulate message contributions.

(3) Using given edge weights.

When edge weights are provided by the dataset, models may directly use these values; in unweighted graphs, a standard baseline is to set all edge weights to 1.

Unified view.

Across these cases, edge-weight adjustment can be viewed as applying an element-wise reweighting to the adjacency matrix, i.e., performing a Hadamard product between the adjacency matrix and an adjustment matrix that encodes either rule-based scalings or learned coefficients [21].

3. Attention of GAT Model

3.1. GAT as Learned Edge Weights

Graph Attention Networks (GATs) [8] integrate attention into message passing, allowing each node to assign different importance to its directly connected neighbors, as shown in Figure 1(a). The layer-wise propagation rule of a Graph Attention Network (GAT) [8,22] can be written in matrix form as

H^{(l + 1)} = σ (A^{(l)} H^{(l)} W^{(l)}) = σ ((α^{(l)} ⊙ A) H^{(l)} W^{(l)}),

(1)

where

H^{(l)} = {\vec{h}}_{1}, {\vec{h}}_{2}, . . ., {\vec{h}}_{N} \in R^{N \times D}

denotes the node feature matrix at layer l, containing one D-dimensional representation per node, with

H^{(0)} = X

, the original node features.

W^{(l)}

is a layer-specific trainable weight matrix. The matrix

A^{(l)} \in R^{N \times N}

collects the normalized attention coefficients produced by the l-th attention mechanism and can be written as the element-wise product

α^{(l)} ⊙ A

, where

A

is the adjacency matrix and

α^{(l)}

contains the learned attention weights on edges.

σ (\cdot)

denotes an activation function.

This update rule implements a message-passing operation on the underlying graph: each node aggregates linearly transformed features from its neighbors, but the aggregation weights are dataset-dependent and learned through the attention mechanism, rather than being fixed by a predefined normalization of

A

, as in GCN [18] or GraphSage [16].

3.2. Attention in GAT is Dispensable

We now examine the attention matrix

α^{(l)}

in Equation 1. Each entry

α_{i j}

represents a normalized attention coefficient computed by the attention mechanism [8]:

α_{i j} = Softmax (LeakyReLU (e_{i j})),

(2)

where

e_{i j}

denotes the unnormalized attention score that measures the importance of node

j

’s feature to node

i

. Since GAT restricts attention computations to the first-order neighbors of

i

(including

i

), the mechanism effectively implements masked attention to encode local graph structure.

From Equation 2, the attention mechanism in GAT can be decomposed into three key steps:

Step 1: Computation of learnable weights $e_{i j}$ ;
Step 2: Application of the LeakyReLU activation;
Step 3: Softmax normalization to obtain $α_{i j}$ .

To test whether the learnable weights in Step 1 are truly necessary, we conducted an ablation study by replacing the learnable weights

e_{i j}

with either uniform weights (all set to 1) or random weights sampled from a uniform distribution in the range [0.0001, 10000]. We further replaced the Softmax normalization in Step 3 with several alternative normalization schemes : symmetric normalization (Sym) from GCN [18], row normalization (Row) from GraphSAGE [16], directed normalization (Dir) from Dir-GNN [19], and a no-normalization configuration (None). Results are reported in Table 1. Because GAT attention weights can be negative and thus incompatible with these normalization schemes, we take their absolute values before applying normalization.

To ensure a fair comparison, all models use the same implementation. Detailed hyperparameter settings are provided in Appendix B.2.2.9. Table 1 presents the performance of GAT variants under different weight schemes and normalization strategies, showing the results for the best hyperparameter configurations in each case. The results demonstrate that model performance remains largely unaffected by the removal of learned attention weights. In most cases, uniform and random weights achieve comparable or even superior performance to attention-based weighting. Only on the Amazon-Photo and Amazon-Computer datasets do we observe marginal performance differences (less than 0.5%).

Furthermore, Table 1 reveals that the choice of normalization scheme exerts a stronger influence on model performance than the choice of weighting strategy. Specifically, the Dir-norm and Sym-norm consistently outperform Softmax-norm and No-norm on the PubMed, CS, and Physics datasets. In contrast, for the Telegram dataset, the No-norm configuration achieves a noticeably higher performance than all other normalization methods, regardless of whether the edge weights are uniform, random, or learned. These findings suggest that, for these datasets, the attention mechanism is entirely dispensable, while normalization plays a much more critical role in determining performance. The competitive results of uniform edge weighting on datasets such as PubMed, CS, and Physics further indicate that simply assigning uniform weights to edges is both sufficient and effective—making the learning of attention-based weights an unnecessary complexity.

Collectively, since the evaluated datasets cover widely used node classification benchmarks, including citation networks, co-author networks, co-purchase graphs, and social networks(Appendix B.1), they consistently indicate that GAT’s attention weights contribute minimally to overall performance. This suggests that attention mechanisms over local (1-hop) neighborhoods are largely unnecessary for effective graph node classification. Sensitivity experiments on the number of attention heads, presented in Table A4 (in the Appendix), further show that increasing the number of heads generally yields little performance improvement and can even lead to slightly worse results for some datasets.

As the GAT model [8] is primarily evaluated on homophilic datasets, the above ablation results, largely based on homophilic graphs, already call into question the necessity of attention in its original formulation. To rule out the possibility that the observed behavior is driven by homophily or heterophily, we extend our study to four heterophilic datasets, with results reported in Table A7.

Across the Chameleon, Squirrel, and Arxiv-Year datasets, competitive performance can be achieved without attention. Multi-head attention yields modest gains on these datasets, with improvements of approximately 6% on Arxiv-Year, 4% on Squirrel, and 1% on Chameleon, compared to only around 0.5% on homophilic graphs. In contrast, the Roman-Empire dataset exhibits a substantial improvement of over 30%, indicating a clear benefit from attention mechanisms. We attribute this to the nature of its node features: as dense word embeddings, they encode rich semantic information, making fine-grained neighbor discrimination more beneficial than in datasets with sparse features. Nevertheless, prior work [19,23] reports accuracies of up to 93.58% on Roman-Empire without using attention, suggesting that such gains may not be uniquely attributable to attention mechanisms. While further research is needed to fully understand this phenomenon, this evidence suggests that the benefits of attention mechanisms may be more limited than commonly assumed.

For Roman-Empire, LargeScaleNet [23] achieves state-of-the-art performance by learning on directed multi-scale learning with Jumping Knowledge Connections, reporting an accuracy of 93.58±0.24. When replacing the uniform edge weights with attention weights while keeping all other settings identical, performance changes only marginally to 93.60±0.32, as shown in Table A10 (Appendix). This negligible difference further supports the conclusion that attention mechanisms are not the determining factor, even on a dataset where simpler attention-based models show large gains.

3.3. Attention is Dispensable for Extended Neighborhoods

In Section 3, we showed attention to immediate neighbors is unnecessary. Before examining graph transformers with full attention, we evaluate DiGib [24], an inception-style framework that aggregates information from both immediate neighbors and nodes that are not directly connected, thereby expanding the receptive field, as illustrated in Figure 1(b). Specifically, DiGib connects nodes that share a common predecessor (i.e., both point to the same node) or a common successor (i.e., both are pointed to by the same node).

The DiGib variants retain the same aggregation structure but replace edge weights with learned attention weights, uniform weights, or random weights sampled from

[10^{- 4}, 10^{4}]

, and apply the same normalization variants as in Section 3.

Although DiGib was originally designed for directed graphs, it is also compatible with undirected graphs. We therefore evaluate it on four directed and four undirected datasets. The Amazon-Computer dataset is excluded due to GPU memory limitations on the NVIDIA A40 platform. To ensure a fair comparison, all models share the same implementation, and detailed hyperparameter settings are provided in Appendix B.2.2.10. Table 2 reports the results for the best hyperparameter configurations in each case.

As shown in Table 2, across all datasets except Amazon-Photo (where differences are below

0.2 %

), the non-attention variants match or outperform the attention-based version. Notably, on three datasets (CoraML, CiteSeer, and Telegram), learned attention consistently underperforms uniform weighting and, in some cases, even random weighting. This suggests that the weight-learning mechanism may adversely affect node classification, potentially due to sparse connectivity and limited labeled data, as reflected in the dataset statistics reported in Table A1. Sensitivity experiments on the number of attention heads, presented in Table A5 (in the Appendix) show that increasing the number of heads generally yields little performance improvement. Experiments on heterophilic datasets are reported in Table A9(in Appendix D.

Overall, these results confirm that attention provides no tangible benefit even when the receptive field is expanded.

3.4. Why Attention Can Fail in GAT and Inception Models

Node classification benchmarks vary in which signals are most predictive of the labels [26]. For example, citation graphs are often more influenced by node attributes (e.g., high-dimensional sparse bag-of-words or one-hot encodings), whereas social networks can rely more on structural cues (e.g., degree and local connectivity patterns) [21]. This difference helps explain why attention mechanisms in GAT and Inception-style GNNs behave inconsistently across datasets.

When prediction is dominated by high-dimensional sparse features, the precise feature magnitudes are often secondary to preserving which dimensions are active. In this regime, attention-based neighbor reweighting may yield limited gains. Conversely, when labels correlate strongly with scale-sensitive structural statistics such as node degree, normalization and reweighting within message passing may inadvertently attenuate or distort these signals. For instance, in degree-driven networks such as Telegram, where strong performance can be obtained using an MLP trained solely on degree features [23], unnormalized aggregation can better preserves degree information and can outperform attention-based or normalized variants.

4. Graph Transformers

4.1. Graph Transformers As Learned Edge Weights

Li et al. [9] treats all node pairs as connected (Fig. Figure 1(c)), shifting from sparse adjacency to dense attention and inspiring numerous Graph Transformers(GTs) [27,28,29]. The propagation rule of GTs [30] is:

H^{(l + 1)} = σ ((α^{(l)} ⊙ 1) H^{(l)} W^{(l)}),

(3)

where

1 \in R^{N \times N}

is the all-ones matrix.

H^{(l)} \in R^{N \times D}

denotes the node feature matrix at layer l, with

H^{(0)} = X

[9] or

H^{(0)}

formed by combining

X

with additional features such as positional/structural encodings [31]. GTs learn global edge weights, unlike local methods, introducing two issues: (1) Loss of structural information, and (2) Limited scalability due to quadratic complexity. Given these challenges, community perspectives on GTs are divided: some remain optimistic, developing architectural variants to overcome current limitations, while others express concern over weak empirical gains, high computational costs, and excessive resource demands.

4.1.1. Optimism in Graph Transformers

GTs are often regarded as the next frontier in graph learning—aiming to overcome the limited expressiveness, over-smoothing, and oversquashing problems of MPNNs [12,32]. Inspired by the dominance of Transformers in language and vision, researchers have attempted to replicate this success for graph data, positioning GTs as successors to “traditional” GNNs [33,34]. Recent work mainly develops along two lines: improving structural encoding to recover information lost by removing the adjacency matrix, and enhancing scalability to overcome the

O (N^{2})

cost of full attention (see Appendix E for details). This persistent optimism contrasts with the limited fundamental scrutiny of Graph Transformers and the actual effectiveness of attention in graph learning—precisely the gap this position paper seeks to address.

4.1.2. Pessimism in Graph Transformers

In contrast, a growing body of research highlights the limitations of Graph Transformers. They are often criticized for their high computational complexity [14], which restricts scalability to large graphs. Empirical studies have shown that well-tuned MPNNs can outperform GTs in both accuracy and efficiency [11,12,32], though such comparisons rarely question the design assumptions underlying attention on graphs. Furthermore, Xing et al. [13] identify an over-globalization problem: attention mechanisms often overemphasize distant nodes while underweighting nearby, more informative ones, leading to degraded representational quality. Combined with growing model complexity and preprocessing overhead, these issues cast serious doubt on whether architectural sophistication truly translates into better graph learning performance.

Together, these perspectives illustrate the ongoing tension in the GT research community between excitement for innovation and practical concerns.

4.2. Why GT Fails

Recent empirical studies [11,12,32] have conducted extensive evaluations showing that Graph Transformers often underperform strong GNN baselines. Building on these findings, this position paper does not reproduce such experiments but instead infers GT behavior from prior evidence on GATs and Inception-style models.

4.2.0.1. From Local to Global: Amplifying a Flawed Premise

Section 3 and Section 3.3 show that learned attention weights provide little benefit, and can even be detrimental, both in local Graph Attention Networks (GATs) and in multi-scale Inception models. Graph Transformers (GTs) extend this flawed premise to its extreme by replacing locality with full global attention.

In doing so, GTs remove valuable local inductive biases while retaining the same ineffective attention-weighting mechanisms. This combination aggravates the problem: their global receptive fields promote aggregation over irrelevant nodes, often washing out the informative local structures that are crucial for effective graph reasoning. The following section examines several factors contributing to these shortcomings.

4.2.1. Limitation of Gradient Descent

Graph Transformers (GTs) are designed to transcend fixed graph structures via global attention and dynamic connectivity [9]. However, gradient descent exhibits a fundamental limitation in realizing this objective.

Once global connections are initialized, gradient descent struggles to unlearn them by pruning irrelevant edges. Both empirical and theoretical analyses show that gradient-based dynamics tend to preserve connectivity and overfit to spurious edges, even in the large-data regime [26,35]. Our experiments corroborate this behavior: with proper normalization, random edge weights as small as 0.0001 perform equivalently to those as large as 10000, highlighting gradient descent’s inability to achieve exact sparsity. Although weights can take arbitrarily small positive or negative values during training, reaching an exact zero is exceedingly unlikely under standard gradient-based optimization.

This reveals a fundamental paradox: global attention, intended to enable dynamic structure learning, effectively degenerates into a fixed fully connected structure. When the ground-truth graph is sparse, Graph Transformers trained with gradient descent cannot recover it. Lacking strong local inductive biases, this over aggregation results in suboptimal generalization. Consequently, GTs become trapped in a static, fully connected regime rather than learning meaningful task-specific structures.

4.2.2. Language vs. Graph: Fundamental Mismatch

Structure Enhancement vs. Destruction:

Transformers enhance linear language sequences by inducing structure through attention, where heads capture syntax and coreference [36]. Graphs, however, already possess rich topology. Global attention disrupts this inherent structure [yingTransformersReallyPerform2021], effectively destroying it.

Semantic entanglement vs. local relevance:

Language tokens encode long-range dependencies (negation, coreference) where distant context critically shapes meaning [attention17]. Graph nodes, by contrast, especially for node classification, typically rely on local neighborhoods; distant nodes often introduce noise rather than signal.

Content-based classification: The presence or absence of a feature is often more informative than its exact magnitude. Local neighbors generally carry the signals defining a node’s class, whereas distant neighbors may belong to different classes. Aggregating information globally can introduce noise from these inter-class nodes, diluting discriminative features.

Structure-based classification: Global attention (GT) can disrupt scale-sensitive structural information, such as node degrees. Although incorporating PE/SE to input features initially preserves this information, a single global attention step mixes all nodes’ degrees, making it impossible to recover per-node degree information. In contrast, local message passing progressively captures richer structural statistics [21,26].

5. Alternative Views

This paper argues that attention mechanisms are largely unnecessary for graph node classification. At the same time, several commonly held positions in the research community support the opposite conclusion. In this section, we briefly outline these opposing views without endorsing them.

View 1: Attention improves performance and expressivity. A common position is that attention increases the capacity of graph models by allowing them to assign data-dependent weights [8] to neighbors instead of using fixed, normalization-based weights. From this perspective, attention is expected to provide stronger function approximation power and better empirical performance than simpler message-passing schemes.

View 2: Attention is needed for long-range interactions. Another widely held view is that global attention, as used in Graph Transformers, is essential for modeling long-range dependencies on graphs. In this line of reasoning, attention is considered necessary to propagate information between distant but semantically related nodes, overcoming the over-smoothing and oversquashing issues of traditional GNNs.

View 3: Attention’s cross-domain success justifies its use on graphs. Because attention mechanisms underpin state-of-the-art models in natural language processing, computer vision, and multimodal learning, many researchers expect similar benefits to carry over to graph-structured data. This expectation motivates continued development of attention-based graph architectures [15] and reinforces the assumption that attention should also be effective for node classification.

These views stand in contrast to the stance taken in this work. Our empirical analysis focuses specifically on graph node classification and leads to a different conclusion: in this setting, attention often does not deliver the improvements that these prevailing assumptions would predict.

6. Conclusions

This position paper systematically challenges the prevailing faith in attention mechanisms for graph node classification. Through three levels of analysis, we demonstrate that learned attention weights are largely dispensable. Graphs are fundamentally different from language in how and why attention mechanisms apply. For the specific task of node classification, we observe that citation networks built on sparse features rely primarily on local structural cues, where precise attention adds little value. In contrast, social networks emphasize degree and connectivity patterns, which global attention can easily distort. For global attention to function effectively in graphs, it would need the capacity to unlearn, that is, to reduce attention to exactly zero when unnecessary, a behavior that standard gradient-based optimization cannot reliably achieve.

While criticism of Graph Transformers is not new, the attention mechanism itself has escaped fundamental scrutiny. Our results indicate that attention is generally unnecessary for node classification and is only demonstrably effective in graphs whose node features are dense language word embeddings. This provides strong evidence against the prevailing assumption that attention is inherently beneficial across graph domains.

Author Contributions

For research articles with several authors, a short paragraph specifying their individual contributions must be provided. The following statements should be used “Conceptualization, X.X. and Y.Y.; methodology, X.X.; software, X.X.; validation, X.X., Y.Y. and Z.Z.; formal analysis, X.X.; investigation, X.X.; resources, X.X.; data curation, X.X.; writing—original draft preparation, X.X.; writing—review and editing, X.X.; visualization, X.X.; supervision, X.X.; project administration, X.X.; funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.”, please turn to the CRediT taxonomy for the term explanation. Authorship must be limited to those who have contributed substantially to the work reported.

Funding

Please add: “This research received no external funding” or “This research was funded by NAME OF FUNDER grant number XXX.” and and “The APC was funded by XXX”. Check carefully that the details given are accurate and use the standard spelling of funding agency names at

Institutional Review Board Statement

In this section, you should add the Institutional Review Board Statement and approval number, if relevant to your study. You might choose to exclude this statement if the study did not require ethical approval. Please note that the Editorial Office might ask you for further information. Please add “The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board (or Ethics Committee) of NAME OF INSTITUTE (protocol code XXX and date of approval).” for studies involving humans. OR “The animal study protocol was approved by the Institutional Review Board (or Ethics Committee) of NAME OF INSTITUTE (protocol code XXX and date of approval).” for studies involving animals. OR “Ethical review and approval were waived for this study due to REASON (please provide a detailed justification).” OR “Not applicable” for studies not involving humans or animals.

Informed Consent Statement

Any research article describing a study involving humans should contain this statement. Please add “Informed consent was obtained from all subjects involved in the study.” OR “Patient consent was waived due to REASON (please provide a detailed justification).” OR “Not applicable” for studies not involving humans. You might also choose to exclude this statement if the study did not involve humans.

Data Availability Statement

We encourage all authors of articles published in MDPI journals to share their research data. In this section, please provide details regarding where data supporting reported results can be found, including links to publicly archived datasets analyzed or generated during the study. Where no new data were created, or where data is unavailable due to privacy or ethical restrictions, a statement is still required. Suggested Data Availability Statements are available in section “MDPI Research Data Policies” at https://www.mdpi.com/ethics.

Acknowledgments

In this section you can acknowledge any support given which is not covered by the author contribution or funding sections. This may include administrative and technical support, or donations in kind (e.g., materials used for experiments).

Conflicts of Interest

Declare conflicts of interest or state “The authors declare no conflicts of interest.” Authors must identify and declare any personal circumstances or interest that may be perceived as inappropriately influencing the representation or interpretation of reported research results. Any role of the funders in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; or in the decision to publish the results must be declared in this section. If there is no role, please state “The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results”.

Appendix A. Normalizations

Graph normalization, which typically involves dot multiplication of the adjacency matrix to adjust edge weights, plays a crucial role in graph neural networks (GNNs). While various normalization schemes exist, their theoretical implications remain under-explored. We denote a general normalization function as

f (A)

.

No Normalization

The simplest approach is to use the raw adjacency matrix without any normalization [37]:

f_{1} (A) = A

. In this case, the node feature update rule becomes:

h_{i}^{(l + 1)} = σ (\sum_{j \in N (i)} h_{j}^{(l)} W^{(l)})

The aggregation directly sums neighboring features, leading to larger feature magnitudes for higher degree nodes. With homogeneous features

h_{i}^{(0)} = 1

, node representations become proportional to degrees.

Row Normalization

Row normalization [16] scales each row of the adjacency matrix by the inverse of node degree:

f_{2} (A) = D^{- 1} A

. The node feature update rule becomes:

h_{i}^{(l + 1)} = σ (\frac{\sum_{j \in N (i)} h_{j}^{(l)} W^{(l)}}{d_{i}})

For this formulation, the aggregated information represents the mean of neighboring features rather than their sum. Node degrees no longer directly influence feature magnitudes. With homogeneous features

h_{i}^{(0)} = 1

, all nodes get identical representations.

Symmetric Normalization

Symmetric normalization [18] applies:

f_{3} (A) = D^{- 1 / 2} A D^{- 1 / 2}

. The node feature update rule becomes:

h_{i}^{(l + 1)} = σ (\sum_{j \in N (i)} \frac{h_{j}^{(l)} W^{(l)}}{\sqrt{d_{i} d_{j}}})

The neighbor’s influence is determined by both degrees—if a neighbor’s degree is much larger than the center node’s, its feature weight becomes smaller than in row normalization.

Directed Normalization

For directed graphs, Rossi et al. [19] proposes:

f_{4} (A) = D_{i n}^{- 1 / 2} A D_{o u t}^{- 1 / 2}

. The node feature update rule becomes:

h_{i}^{(l + 1)} = σ (\sum_{j \in N (i)} \frac{h_{j}^{(l)} W^{(l)}}{\sqrt{d_{i}^{i n} d_{j}^{o u t}}})

This distinguishes between in-degree and out-degree for more accurate normalization in directed graphs.

Softmax Normalization

Softmax normalization is introduced by GAT [8] to make coefficients easily comparable across different nodes, they normalize them across all choices of

j

using the softmax function:

α_{i j} = {softmax}_{j} (e_{i j}) = \frac{exp (e_{i j})}{\sum_{k \in N_{i}} exp (e_{i k})},

where

N_{i}

is node

i

’s 1-hop neighbors(including

i

),

e_{i j}

is the original weights.

In summary, row normalization of adjacency matrices, when applied with uniform node features, results in loss of structural information since all nodes become indistinguishable. While using the unnormalized adjacency matrix preserves both degree information and feature distinctions, it can lead to numerical instability since the eigenvalues of

f (A)

may grow or diminish exponentially, rather than being bounded within [-1, 1]. This instability can affect the training of graph neural networks.

Table A1. Dataset statistics. #Train is number of train nodes.

Dataset	#Nodes(#Train)	#Edges	#Feature	#Class
CoraML	2995(140)	8416	2879	7
CiteSeer	3312(120)	4715	3703	6
Telegram	245(145)	8912	1	4
WikiCS	11701(580)	297110	300	10
Coauthor-CS	18333(8793)	182121	6805	15
Coauthor-Phy.	34493(16555)	495924	8415	5
PubMed	19717(60)	88648	500	3
Amazon-Photo	7650(3669)	238162	745	8
Amazon-Comp.	13752(6595)	491722	767	10

Appendix B. Implementation Details

Appendix B.1. Datasets

Our experiments are based on a diverse collection of directed and undirected graph benchmarks commonly used for node classification.

Citation networks. CiteSeer, CoraML, PubMed and WikiCS are citation graphs.

For Citeseer and Cora-ML, we use the splits specified in the DiGCN(ib) paper [24]. For the WikiCS dataset, the splits are described in PubMed is obtained from the Deep Graph Library (DGL) and treated as an undirected citation network. In all citation datasets, nodes represent papers and edges denote citation relationships. Node features are bag-of-words representations, and labels correspond to paper topics. For PubMed, we generate 10 random splits following the same protocol as CiteSeer and CoraML: 20 labeled nodes per class for training, 30 per class for validation, and the remaining nodes for testing.
Social network. Telegram is a directed social network dataset from MagNet [38].
Coauthor networks. Coauthor-CS and Coauthor-Physics (denoted as CS and Physics in the main paper) are co-authorship graphs derived from the Microsoft Academic Graph released for the KDD Cup 2016 challenge. Nodes represent authors, edges indicate co-authorship relations, node features aggregate keywords from each author’s papers, and labels correspond to the author’s primary field of study.
Co-purchasing networks. Amazon-Computer and Amazon-Photo (denoted as Computer and Photo in the main paper) are co-purchase graphs introduced in Shchur et al. [10]. Nodes represent products, edges connect frequently co-purchased items, node features are bag-of-words representations of product reviews, and labels indicate product categories.

Among these datasets, CoraML, CiteSeer, WikiCS, and Telegram are originally directed graphs, whereas PubMed, Photo, Computers, Coauthor-CS, and Coauthor-Physics are originally undirected. All datasets are evaluated using cross-validation with 10 train/validation/test splits, except for WikiCS, which provides 20 splits as in the original source. We use the official splits when available; otherwise, we randomly generate 10 splits. For experiments involving GAT variants, we convert directed graphs into undirected ones by augmenting each edge with its reverse counterpart.

Appendix B.2. Code and Hyperparameters

GATConv Variants Implementation.

We implement three GATConv variants: learned attention (standard GAT), uniform weights, and random weights. Implementation is based on PyTorch Geometric GATConv [39]. Specific code for the GATConv variants and GAT model is in Supplementary Material.

All variants use the same architecture and hyperparameters (see Table A2 and Table A3) to ensure a fair comparison. Directed graphs can be converted to undirected by adding reverse edges. All models use a hidden dimension of 128. Training is performed for up to 1500 epochs with early stopping. If validation performance does not improve for 80 epochs, the learning rate is reduced by a factor of 0.5. Model variants differ in whether a ReLU activation function is applied after each GNN layer.

Our experimental design prioritizes (i) computational efficiency, (ii) fairness across model variants by using comparable hyperparameter search spaces and selecting the best-performing configuration for each model, and (iii) practical GPU constraints. In particular, for large graphs, using more than one attention head is often prohibitive in terms of memory, and thus we restrict the number of heads to one in such cases.

GAT Variants

Table A2 summarizes the hyperparameter configurations that were exhaustively explored for the experiments on GAT variants (Table 1).

Table A2. Hyperparameters used for the GAT variants evaluated in Table 1.

Dataset	Layer	Learning Rate	Heads
Coauthor-CS	2	0.005	1, 2, 8
Coauthor-Physics	2	0.005	1, 2, 8
PubMed	2	0.005	1, 2, 8
Amazon-Photo	2, 3	0.005	1, 2, 4
Amazon-Computer	2, 3	0.005	1, 2, 8
WikiCS	2	0.005	1, 2, 4
Telegram	1, 2, 3, 4, 5	0.01, 0.005	1, 16
ArXiv-Year [40]	3	0.005	1, 8
Roman-Empire [40]	2	0.005	1, 8
Chameleon [41]	2	0.005	1, 8
Squirrel [41]	2	0.005	1, 8

DiGib Variants

Table A3 reports the hyperparameter configurations used for DiGib variants.

Table A3. Hyperparameters used for the DiGib variants evaluated in Table 2.

Dataset	Layer	Learning Rate	Heads
CoraML	3	0.01	1, 8, 16
CiteSeer	3	0.01	1, 8, 16
Telegram	1-5	0.005, 0.01	1, 8, 16
WikiCS	2, 3	0.005	1, 2
Amazon-Photo	2, 3	0.005	1
PubMed	2	0.005	1, 8
Coauthor-CS	2	0.005	1
Coauthor-Physics	2	0.005	1
Roman-Empire [40]	2	0.005	1, 8
Chameleon [41]	2	0.005	1, 8
Squirrel [41]	2	0.005	1, 8

Appendix C. Experiments on Sensitivity

In this section, we present experimental results examining the sensitivity of attention-based models to changes in key hyperparameters, focusing especially on the number of attention heads, number of layers and normalization strategies. Experiments with GAT variants and DiGib attention variants are presented in Table A4 and Table A5.

Table A4. Sensitivity analysis of the number of attention heads in GAT variants under different normalization schemes, reporting accuracy (mean ± standard deviation). Blank entries indicate experiments not conducted. “OOM” denotes out-of-memory errors on the Nvidia A40 platform. Across most datasets (except Telegram), increasing the number of heads does not improve performance and can even cause slight degradation. For Telegram, the best performance occurs with 8–16 heads, while smaller or larger values yield marginally worse results. The best performance for each dataset is highlighted in yellow.

Datasets	Layer	Heads	Softmax	Dir	Sym	Row	None
Coauthor-CS	2	1	93.1±0.1	92.8±0.2	92.5±0.2	92.2±0.3	92.0±0.4
	2	2	93.3±0.2	92.9±0.2	92.6±0.2	92.4±0.2	92.5±0.5
	2	8	93.1±0.2	93.3±0.3	93.3±0.2	93.0±0.1	92.5±0.5
Coauthor-Physics	2	1	96.0±0.1	96.0±0.1	96.0±0.1	95.9±0.1	95.4±0.4
	2	2	96.0±0.1	96.1±0.1	96.1±0.1	95.9±0.1	95.6±0.2
	2	4	96.0±0.1	96.1±0.1	96.1±0.1	96.0±0.1	95.6±0.3
	2	8	96.1±0.1	96.2±0.1	96.2±0.2	96.0±0.1	95.8±0.2
PubMed	2	1	74.8±1.0	74.4±1.1	73.8±1.0	74.1±0.8	67.3±2.1
	2	8	74.9±1.4	74.7±0.7	74.8±0.7	74.5±1.3	68.7±2.5
Amazon-Photo	2	1	93.9±0.4	93.7±0.5	93.9±0.3	93.8±0.2	93.1±0.3
	2	2	94.0±0.3	93.8±0.2	94.0±0.4	94.0±0.3	93.1±0.4
	2	4	93.9±0.4	93.7±0.3	93.7±0.4	94.0±0.3	93.2±0.4
Amazon-Computer	2	1	90.7±0.2	90.5±0.2	90.7±0.3	90.6±0.3	89.0±1.1
	2	2	90.7±0.2	90.4±0.3	90.8±0.1	90.5±0.3	89.4±0.5
	3	1	90.6±0.3	90.7±0.2	90.8±0.3	90.9±0.3	82.2±16.9
	3	2	90.7±0.3	90.5±0.3	91.0±0.3	90.8±0.3	88.7±0.9
WikiCS Undirected	2	1	78.5±0.9	78.3±0.9	78.4±1.1	78.5±0.9	75.3±1.2
	2	2	78.7±1.0	78.2±0.9	78.5±0.8	78.4±0.8	75.0±0.9
	2	4	78.9±0.9	78.5±0.9	78.4±0.8	78.3±1.0	75.3±0.7
Telegram	3	1	78.2±9.4	83.6±5.3	86.4±5.2	67.6±6.2	90.8±3.6
	3	16	77.6±6.3	86.0±6.7	87.8±5.8	77.6±10.9	90.6±4.5

In conclusion, model performance is largely insensitive to the number of attention heads. Using 1 or 8 heads yields representative results, and for some datasets, increasing the number of heads can even degrade performance.

Appendix D. Experiments on Heterophilic Graphs

Experimental results on heterophilic graphs are reported in Table A7 and Table A9. We evaluate four heterophilic datasets: Arxiv-Year, Roman-Empire, Chameleon, and Squirrel. For DiGib variants, experiments on Arxiv-Year could not be conducted due to memory limitations. All experiments on heterophilic datasets use 3-layer GAT variants and 2-layer DiGib variants.

Across Arxiv-Year, Chameleon, and Squirrel, competitive performance is achieved without attention mechanisms, with multi-head attention providing only marginal improvements. In contrast, Roman-Empire exhibits a substantial performance gain when attention is introduced. For GAT variants, accuracy increases from 55.4% to 82.6% as the number of attention heads increases from 1 to 8. Similarly, for DiGib variants, performance improves from 80.3% to 83.4%.

We hypothesize that this behavior is related to the nature of node features in Roman-Empire, which consist of dense word embeddings. In such settings, learned attention coefficients may help differentiate informative feature interactions more effectively than uniform aggregation.

Since Roman-Empire achieves strong performance with LargeScaleNet [23], which employs directed multi-scale learning and Jumping Knowledge connections with uniform edge weights, we further evaluate LargeScaleNet variants with different weight settings and normalization schemes. Hyperparameters follow the best-performing configuration reported in the original paper, but only the first scale is retained (no higher-order scales). As shown in Table A10, uniform-weight variants achieve competitive performance even on Roman-Empire. This suggests that while dense node features may benefit from attention in certain architectures, comparable performance can also be achieved without attention when appropriate structural mechanisms are used.

Table A5. This table reports accuracy (mean ± standard deviation) and presents a sensitivity analysis of the number of attention heads in the DiGib variant, where edge weights are replaced by attention weights, evaluated under different normalization schemes. Blank entries indicate experiments not conducted. “OOM” denotes out-of-memory errors on the Nvidia A40 platform. In general, increasing the number of attention heads does not improve performance. For PubMed, performance declines as the number of heads increases from 1 to 8. The best performance for each dataset is highlighted in yellow.

Datasets	Layer	Heads	Dir	Sym	Row	Softmax	None
CoraML	3	1	81.7±1.5	81.8±1.5	81.5±1.5	81.0±1.7	27.7±3.3
	3	8	82.0±1.4	82.2±1.5	81.8±1.4	81.0±1.6	43.8±3.5
	3	16	81.8±1.5	81.6±1.8	81.7±1.4	80.9±1.7	46.0±4.6
CiteSeer	3	1	66.3±1.8	66.2±1.8	66.4±1.7	66.4±1.9	34.1±3.3
	3	8	66.5±1.9	66.3±1.9	66.0±1.6	66.2±1.7	37.6±2.6
	3	16	66.8±2.4	66.5±2.4	66.8±1.6	66.6±1.6	39.6±3.6
WikiCS	2,3	1	79.7±0.5	79.7±0.5	80.6±0.6	80.7±0.5	37.8±4.4
	2,3	8	79.7±0.5	79.1±0.6	80.6±0.6	80.6±0.5	26.7±2.5
Telegram	1-5	1	87.2±3.8	88.2±5.6	79.8±6.1	76.8±2.9	87.2±3.0
	1-5	8	85.4±5.7	87.6±5.2	79.0±8.1	81.4±3.9	89.8±4.6
	1-5	16	84.6±4.6	87.8±5.8	74.2±9.4	79.8±5.5	90.4±3.5
Coauthor-CS	2	1	94.3±0.2	93.9±0.2	94.0±0.2	94.3±0.2	93.9±0.2
	2	8	94.9±0.1	94.6±0.2	94.7±0.1	94.4±0.2	94.2±0.2
Coauthor-Physics	2	1	96.3±0.1	96.2±0.1	96.1±0.1	96.5±0.1	96.4±0.1
	2	2	96.4±0.1	96.3±0.1	96.3±0.1	96.5±0.1	96.3±0.2
	2	8	OOM	OOM	OOM	OOM	OOM
PubMed	2	1	77.0±0.8	76.7±1.1	77.0±1.1	76.3±2.9	74.6±1.1
	2	8	75.4±0.7	75.7±1.0	77.8±0.4	76.3±1.3	73.9±1.0
Amazon-Photo	2,3	1	94.9±0.3	93.9±0.7	94.6±0.3	95.1±0.2	71.0±26.3
	2,3	2	94.8±0.2	93.5±0.6	94.5±0.5	94.8±0.3	81.7±12.6
	2	8	OOM	OOM	OOM	OOM	OOM

Overall, these results indicate that the effectiveness of attention mechanisms is not inherently tied to graph heterophily. Instead, it appears to depend more on the characteristics of node features and model design, with attention not being a necessary component for strong performance.

Appendix E. Additional Details on Graph Transformers

Graph Transformers (GTs) have primarily developed along two main directions.

(1) Encoding strategies for recovering structural information.

By discarding the adjacency matrix

A

, GTs lose original graph connectivity. To compensate, models commonly employ positional and structural encodings (PE/SE) [27,31]. These encodings infuse node features with graph-aware signals such as Laplacian eigenvectors [42], degree centrality [31], or edge regularization [43], helping attention mechanisms infer relative positions and relationships. However, these strategies often introduce costly pre-processing [44], sometimes with

O (N^{3})

complexity, and their ability to effectively or fully restore the lost structural information remains questionable.

(2) Improving scalability.

Full attention incurs

O (N^{2})

time and memory complexity, limiting applicability to large graphs. To mitigate this, scalable architectures have been proposed, such as GraphMamba [45], NodeFormer [43], SGFormer [44], etc. For more details on Graph Transformers, see [15,46] or read a survey [30].

Table A6. Accuracy (mean±standard deviation) of GAT variants under different weight settings and normalization schemes on heterophilic datasets. yellow cells correspond to the original GAT configuration. Results in bold exceed the baseline mean, while underlined results exhibit strong distributional overlap with the baseline.

Datasets	Weight	Heads	Softmax	Dir	Sym	Row	None
Arxiv-Year Directed (Heterophilic)	Attention	1	44.7±0.2	52.7±0.7	30.1±4.1	40.1±1.1	30.1±1.7
	Attention	8	43.3±0.9	59.2±0.5	36.1±0.7	45.8±0.6	36.8±2.5
	Uniform	1	30.0±1.7	50.1±0.2	47.4±0.2	29.7±1.1	42.9±0.1
	Random	1	35.3±0.2	48.2±0.3	42.5±0.3	29.1±0.8	41.5±0.2
Roman-Empire Directed (Heterophilic)	Attention	1	46.4±3.7	55.4±2.4	50.5±3.1	43.8±3.0	55.0±2.4
	Attention	8	71.5±2.0	82.6±0.9	70.3±4.7	70.3±3.0	79.9±0.6
	Uniform	1	32.9±0.3	36.3±0.4	33.4±0.5	32.8±0.4	36.1±0.6
	Random	1	20.9±0.6	32.3±0.6	29.8±1.0	29.9±0.5	30.0±0.4
Chameleon Undirected (Heterophilic)	Attention	1	67.1±1.5	65.4±1.7	66.4±2.2	66.0±3.5	59.2±5.0
	Attention	8	68.8±2.1	66.8±3.2	68.8±3.1	67.5±1.6	59.4±2.8
	Uniform	1	67.9±2.6	67.5±2.5	67.6±2.4	68.0±2.2	66.4±2.2
	Random	1	40.3±2.7	65.2±2.5	66.8±2.0	66.9±1.9	64.1±2.3
	Random	8	56.5±2.6	67.2±2.6	67.7±2.3	68.0±2.2	65.3±2.2
Squirrel Undirected (Heterophilic)	Attention	1	56.1±2.1	55.8±0.9	57.2±2.1	55.1±2.3	39.9±4.5
	Attention	8	60.5±1.8	55.1±1.9	58.0±1.2	57.0±1.8	42.5±1.8
	Uniform	1	53.1±1.2	56.4±1.3	56.6±1.7	53.4±0.9	45.9±1.6
	Random	1	29.0±1.7	54.5±1.2	54.4±1.2	50.3±1.5	45.6±1.8
	Random	8	38.7±1.2	56.3±1.6	55.9±1.3	53.6±1.2	43.3±2.3

In addition, GTs have been extended to new domains, such as directed graphs [47] and few-shot learning [48]. The field continues to expand rapidly, with new models proposed each year [49,50]. Although many studies acknowledge limitations of existing GTs [15,32], attention is still widely treated as an inherently powerful mechanism. Consequently, researchers either design new GT variants [46] or propose alternative attention-based architectures [15].

Summary

Overall, despite rapid progress, existing Graph Transformer designs remain heavily focused on engineering solutions for structure and efficiency, while the fundamental role and necessity of attention for graph learning remain insufficiently understood.

Table A7. Accuracy (mean ± standard deviation) of DiGib [25] variants with different weight settings and normalization schemes evaluated on heterophilic datasets. yellow cells correspond to the best attention-weighted configuration. Results in bold exceed the baseline mean, while underlined results exhibit strong distributional overlap with the baseline.

Datasets	Weight	Heads	Dir	Sym	Row	Softmax	None
Roman-Empire Direct (Heterophilic)	Attention	1	80.7±0.4	80.3±0.6	78.7±0.3	80.3±0.4	79.4±0.5
	Attention	8	82.1±0.3	83.0±0.7	80.4±0.6	82.2±0.4	83.4±0.4
	Uniform	1	80.1±0.5	80.1±0.5	78.9±0.5	78.9±0.5	77.6±0.4
	Random	1	76.9±0.5	76.5±0.4	76.1±0.3	68.3±0.5	14.2±0.3
	Random	8	76.8±0.4	75.7±0.8	76.0±0.4	69.7±0.5	14.3±0.6
Chameleon Direct (Heterophilic)	Attention	1	58.3±1.2	59.3±1.7	62.0±2.1	63.7±1.3	27.6±2.6
	Attention	8	55.5±2.2	58.3±1.8	59.7±1.3	58.7±1.6	37.8±3.4
	Uniform	1	60.0±1.6	60.0±1.6	63.9±1.9	63.9±1.9	26.8±2.6
	Random	1	59.8±2.0	59.3±1.3	61.3±2.1	43.3±2.4	26.6±1.8
	Random	8	58.6±2.2	59.6±3.0	62.1±1.6	47.1±2.6	34.0±2.4
Squirrel Direct (Heterophilic)	Attention	1	37.5±1.8	38.5±0.9	39.5±1.5	40.6±1.7	28.0±3.2
	Attention	8	38.0±1.9	39.3±1.6	41.0±1.8	39.2±1.7	25.8±1.7
	Uniform	1	40.3±1.3	40.3±1.3	43.3±1.5	43.6±1.4	27.9±2.6
	Random	1	40.4±1.4	39.9±1.5	42.7±1.1	34.5±1.2	27.4±1.7
	Random	8	40.4±1.4	39.9±1.5	42.7±1.1	34.3±1.6	27.4±1.7

Table A8. Accuracy (mean±standard deviation) of LargeScaleNet [23] variants under different weight settings and normalization schemes on heterophilic datasets. Hyperparameters are taken from the LargeScaleNet paper’s best-performing configuration, but only the first scale is used (no higher-order scales). yellow cells correspond to the best attention-weighted configuration. Results in bold exceed the baseline mean, while underlined results exhibit strong distributional overlap with the baseline.

Datasets	Weight	Dir	Sym	Row	None
Roman-Empire	Attention	93.47±0.24	93.01±0.54	92.60±0.60	93.60±0.32
	Uniform	93.35±0.55	93.25±0.34	92.27±0.29	93.59±0.29
Arxiv-year	Attention	65.56±0.12	64.43±0.31	59.74±0.70	65.18±0.26
	Uniform	66.02±0.28	63.48±0.51	41.52±3.00	56.46±2.44
Chameleon	Attention	74.74±0.94	77.11±1.71	78.07±1.00	66.01±2.67
	Uniform	79.21±1.01	79.63±0.92	79.50±1.16	58.68±5.05
Squirrel	Attention	69.33±2.09	69.76±1.62	72.88±2.01	63.53±2.65
	Uniform	74.63±1.60	75.05±1.98	73.38±1.74	74.53±2.10

References

Rogers, E.M. Diffusion of Innovations, 5 ed.; Simon and Schuster, 2003. [Google Scholar]
Dearing, J.W. Applying Diffusion of Innovation Theory to Intervention Development. Research on social work practice 2009, 19, 503–518. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, 2012; Curran Associates, Inc.; Vol. 25. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc., 2017; Vol. 30. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018; pp. 7794–7803. [Google Scholar] [CrossRef]
Gong, Y.; Chung, Y.A.; Glass, J. AST: Audio Spectrogram Transformer; 2021. [Google Scholar]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation. In Proceedings of the Proceedings of the 38th International Conference on Machine Learning. PMLR, 2021; pp. 8821–8831. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the International Conference on Learning Representations, 2018. [Google Scholar]
Li, Y.; Liang, X.; Hu, Z.; Chen, Y.; Xing, E.P. Graph Transformer. 2019. [Google Scholar]
Shchur, O.; Mumme, M.; Bojchevski, A.; Günnemann, S. Pitfalls of Graph Neural Network Evaluation, 2019. arXiv [cs, stat. arXiv:1811.05868.
Luo, Y.; Shi, L.; Wu, X.M. Classic GNNs are Strong Baselines: Reassessing GNNs for Node Classification, 2024. arXiv [cs]. arXiv:2406.08993. [CrossRef]
Tönshoff, J.; Ritzert, M.; Rosenbluth, E.; Grohe, M. Where Did the Gap Go? Reassessing the Long-Range Graph Benchmark. Transactions on Machine Learning Research, 2024. [Google Scholar]
Xing, Y.; Wang, X.; Li, Y.; Huang, H.; Shi, C. Less is more: on the over-globalizing problem in graph transformers. In Proceedings of the Proceedings of the 41st International Conference on Machine Learning. JMLR.org, 2024; p. ICML’24. [Google Scholar]
Sancak, K.; Hua, Z.; Fang, J.; Xie, Y.; Malevich, A.; Long, B.; Balin, M.F.; Çatalyürek, U.V. A scalable and effective alternative to graph transformers. In Proceedings of the Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence, 2025; AAAI Press; AAAI’25/IAAI’25/EAAI’25. [Google Scholar] [CrossRef]
Buterez, D.; Janet, J.P.; Oglic, D.; Liò, P. An End-to-End Attention-Based Approach for Learning on Graphs. Nature Communications 2025, 16, 5244. [Google Scholar] [CrossRef] [PubMed]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Advances in neural information processing systems 2017, 30. [Google Scholar]
Corso, G.; Cavalleri, L.; Beaini, D.; Liò, P.; Veličković, P. Principal Neighbourhood Aggregation for Graph Nets. Proceedings of the Advances in Neural Information Processing Systems 2020, Vol. 33, 13260–13271. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Rossi, E.; Charpentier, B.; Di Giovanni, F.; Frasca, F.; Günnemann, S.; Bronstein, M.M. Edge directionality improves learning on heterophilic graphs. In Proceedings of the Learning on Graphs Conference. PMLR, 2024; pp. 25–1. [Google Scholar]
Abbahaddou, Y.; Malliaros, F.D.; Lutzeyer, J.F.; Vazirgiannis, M. Centrality Graph Shift Operators for Graph Neural Networks. arXiv 2024, arXiv:2411.04655. [Google Scholar] [CrossRef]
Jiang, Q.; Wang, C.; Lones, M.; Pang, W. Demystifying MPNNs: Message Passing as Merely Efficient Matrix Multiplication. arXiv 2025, arXiv:2502.00140. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems 2021, 32, 4–24. [Google Scholar] [CrossRef]
Jiang, Q.; Wang, C.; Lones, M.; Chen, D.; Pang, W. Scale-aware Message Passing for Graph Node Classification, 2026.
Tong, Z.; Liang, Y.; Sun, C.; Li, X.; Rosenblum, D.; Lim, A. Digraph inception convolutional networks. Advances in neural information processing systems 2020, 33, 17907–17918. [Google Scholar]
Tong, Z.; Liang, Y.; Sun, C.; Rosenblum, D.S.; Lim, A. Directed graph convolutional network. arXiv 2020, arXiv:2004.13970. [Google Scholar]
Bechler-Speicher, M.; Amos, I.; Gilad-Bachrach, R.; Globerson, A. Graph Neural Networks Use Graphs When They Shouldn’t. In Proceedings of the Forty-First International Conference on Machine Learning, 2024. [Google Scholar]
Rampášek, L.; Galkin, M.; Dwivedi, V.P.; Luu, A.T.; Wolf, G.; Beaini, D. Recipe for a General, Powerful, Scalable Graph Transformer. In Proceedings of the Proceedings of the 36th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2022; NIPS ’22, pp. 14501–14515. [Google Scholar]
Shirzad, H.; Velingker, A.; Venkatachalam, B.; Sutherland, D.J.; Sinop, A.K. Exphormer: Sparse Transformers for Graphs. In Proceedings of the Proceedings of the 40th International Conference on Machine Learning. PMLR, 2023; pp. 31613–31632, ISSN 2640-3498. [Google Scholar]
Deng, C.; Yue, Z.; Zhang, Z. Polynormer: Polynomial-Expressive Graph Transformer in Linear Time. In Proceedings of the The Twelfth International Conference on Learning Representations, 2024. [Google Scholar]
Shehzad, A.; Xia, F.; Abid, S.; Peng, C.; Yu, S.; Zhang, D.; Verspoor, K. Graph Transformers: A Survey. IEEE Transactions on Neural Networks and Learning Systems 2026, 1–20. [Google Scholar] [CrossRef] [PubMed]
Ying, C.; Cai, T.; Luo, S.; Zheng, S.; Ke, G.; He, D.; Shen, Y.; Liu, T.Y. Do Transformers Really Perform Badly for Graph Representation? In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc., 2021; Vol. 34, pp. 28877–28888. [Google Scholar]
Luo, Y.; Shi, L.; Wu, X.M. Can Classic GNNs Be Strong Baselines for Graph-level Tasks? Simple Architectures Meet Excellence. In Proceedings of the Forty-second International Conference on Machine Learning, 2025. [Google Scholar]
Leskovec, J. What Every Data Scientist Should Know About Graph Transformers and Their Impact on Structured Data. 2025. [Google Scholar]
Fey, M.; Kocijan, V.; Lopez, F.; Lenssen, J.E.; Leskovec, J. KumoRFM: A Foundation Model for In-Context Learning on Relational Data; Technical whitepaper: Kumo AI, 2025. [Google Scholar]
Gunasekar, S.; Lee, J.D.; Soudry, D.; Srebro, N. Implicit Bias of Gradient Descent on Linear Convolutional Networks. In Proceedings of the Advances in Neural Information Processing Systems, 2018; Curran Associates, Inc.; Vol. 31. [Google Scholar]
Clark, K.; Khandelwal, U.; Levy, O.; Manning, C.D. What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP; Florence, Italy, Linzen, T., Chrupała, G., Belinkov, Y., Hupkes, D., Eds.; 2019; pp. 276–286. [Google Scholar] [CrossRef]
Li, Y.; Tarlow, D.; Brockschmidt, M.; Zemel, R. Gated Graph Sequence Neural Networks; 2017. [Google Scholar] [CrossRef]
Zhang, X.; He, Y.; Brugnone, N.; Perlmutter, M.; Hirn, M. Magnet: A neural network for directed graphs. Advances in neural information processing systems 2021, 34, 27003–27015. [Google Scholar]
PyTorch Geometric Team. GATConv — PyTorch Geometric Documentation. 2024. Available online: https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.conv.GATConv.html (accessed on 2026-01-28).
Lim, D.; Hohne, F.M.; Li, X.; Huang, S.L.; Gupta, V.; Bhalerao, O.P.; Lim, S.N. Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods. In Proceedings of the Advances in Neural Information Processing Systems, 2021. [Google Scholar]
Pei, H.; Wei, B.; Chang, K.C.C.; Lei, Y.; Yang, B. Geom-gcn: Geometric graph convolutional networks. arXiv 2020, arXiv:2002.05287. [Google Scholar] [CrossRef]
Dwivedi, V.P.; Bresson, X. A Generalization of Transformer Networks to Graphs. 2021, 2012.09699. [Google Scholar] [CrossRef]
Wu, Q.; Zhao, W.; Li, Z.; Wipf, D.; Yan, J. NodeFormer: A Scalable Graph Structure Learning Transformer for Node Classification. arXiv 2023, arXiv:2306.08385. [Google Scholar] [CrossRef]
Wu, Q.; Zhao, W.; Yang, C.; Zhang, H.; Nie, F.; Jiang, H.; Bian, Y.; Yan, J. SGFormer: Simplifying and Empowering Transformers for Large-Graph Representations. Proceedings of the Proceedings of the 37th International Conference on Neural Information Processing Systems 2023, NIPS ’23, 64753–64773. [Google Scholar]
Behrouz, A.; Hashemi, F. Graph mamba: Towards learning on graphs with state space models. In Proceedings of the Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, 2024; pp. 119–130. [Google Scholar]
Zhang, Y.; Li, X.; Xu, Y.; Xu, X.; Wang, Z. A Graph Transformer with Optimized Attention Scores for Node Classification. Scientific Reports 2025, 15, 30015. [Google Scholar] [CrossRef] [PubMed]
Luo, Y.; Thost, V.; Shi, L. Transformers over Directed Acyclic Graphs. Advances in Neural Information Processing Systems 2024, 36. [Google Scholar]
Cao, B.; Ding, C.; Chen, K.; Zhu, Y. DGT: Differential Graph Transformer for Graph Learning with Few-Shot Learning. Expert Systems with Applications 2026, 303, 130638. [Google Scholar] [CrossRef]
Yuan, C.; Song, Z.; Kuruoglu, E.; Zhao, K.; Liu, Y.; Zhao, D.; Cheng, H.; Rong, Y. ParaFormer: A Generalized PageRank Graph Transformer for Graph Representation Learning. In Proceedings of the Proceedings of the 19th ACM International Conference on Web Search and Data Mining, New York, NY, USA, 2026. [Google Scholar]
Aminian-Dehkordi, J.; Parsa, M.; Dickson, A.; Mofrad, M.R.K. SIMBA-GNN: Mechanistic Graph Learning for Microbiome Prediction. In npj Systems Biology and Applications; 2025. [Google Scholar] [CrossRef]

Figure 1. Attention mechanisms across neighborhood scopes. (a) GAT attends to immediate neighbors, revealing local structure; (b) inception attention covers expanded neighborhoods; (c) global attention lacks structural information.

{\vec{h}}_{i}

denotes node i’s input features;

{\vec{h}}_{i}^{'}

denotes layer-wise output. Arrow colors indicate independent attention computations.

Figure 1. Attention mechanisms across neighborhood scopes. (a) GAT attends to immediate neighbors, revealing local structure; (b) inception attention covers expanded neighborhoods; (c) global attention lacks structural information.

{\vec{h}}_{i}

denotes node i’s input features;

{\vec{h}}_{i}^{'}

denotes layer-wise output. Arrow colors indicate independent attention computations.

Table 1. Accuracy (mean±standard deviation) of GAT variants under different weight settings and normalization schemes. Since attention weights can take negative values, the square-root operations required by Dir-norm and Sym-norm may become invalid. To address this, we use the absolute values of the weights for Dir-norm and Sym-norm. For Row-norm and no normalization (None), we report the better performance between using the original weights and their absolute values. Learned attention weights are compared against uniform weights (all ones) and random weights sampled uniformly from

[10^{- 4}, 10^{4}]

). yellow cells correspond to the original GAT configuration. Results in bold exceed the baseline mean, while underlined results exhibit strong distributional overlap with the baseline.

Table 1. Accuracy (mean±standard deviation) of GAT variants under different weight settings and normalization schemes. Since attention weights can take negative values, the square-root operations required by Dir-norm and Sym-norm may become invalid. To address this, we use the absolute values of the weights for Dir-norm and Sym-norm. For Row-norm and no normalization (None), we report the better performance between using the original weights and their absolute values. Learned attention weights are compared against uniform weights (all ones) and random weights sampled uniformly from

[10^{- 4}, 10^{4}]

). yellow cells correspond to the original GAT configuration. Results in bold exceed the baseline mean, while underlined results exhibit strong distributional overlap with the baseline.

Datasets	Weight	Softmax	Dir	Sym	Row	None
Coauthor-CS	Attention	93.1±0.1	92.8±0.2	92.5±0.2	92.6±0.2	92.0±0.4
	Uniform	93.0±0.2	93.6±0.1	93.5±0.1	93.0±0.1	92.0±0.1
	Random	76.9±0.6	92.9±0.2	92.8±0.1	92.5±0.2	90.2±0.3
Coauthor-Physics	Attention	96.0±0.1	96.0±0.1	96.0±0.1	95.9±0.1	95.4±0.4
	Uniform	95.9±0.1	96.1±0.1	96.1±0.1	95.9±0.1	95.0±0.1
	Random	88.3±0.3	95.8±0.2	95.8±0.2	95.6±0.1	94.2±0.2
PubMed	Attention	74.8±1.0	74.4±1.1	73.8±1.0	74.1±0.8	67.3±2.1
	Uniform	74.9±1.1	75.2±1.1	75.1±0.6	75.1±0.9	69.4±2.1
	Random	66.0±0.3	73.6±0.8	73.2±0.8	72.8±0.8	72.2±1.4
Amazon-Photo	Attention	93.9±0.4	93.7±0.5	93.9±0.3	93.8±0.2	93.1±0.3
	Uniform	93.4±0.5	93.4±0.5	93.4±0.5	93.1±0.3	91.3±0.3
	Random	80.9±0.6	92.8±0.8	92.9±0.5	92.9±0.4	91.0±0.3
Amazon-Computer	Attention	90.7±0.3	90.5±0.2	90.7±0.3	90.6±0.3	89.0±1.1
	Uniform	90.4±0.2	90.1±0.4	90.3±0.3	90.2±0.2	85.2±1.1
	Random	73.6±0.6	89.9±0.2	89.6±0.2	88.7±0.3	85.5±0.7
WikiCS Undirected	Attention	78.5±0.8	78.3±0.9	78.4±1.1	78.5±0.9	75.3±1.2
	Uniform	77.9±0.8	78.3±1.1	78.3±1.1	77.9±1.1	71.5±1.01
	Random	57.7±0.7	77.8±1.1	77.7±1.0	77.3±0.9	70.5±1.0
Telegram Undirected	Attention	78.2±9.4	86.0±6.7	87.8±5.8	77.6±10.9	90.8±3.6
	Uniform	71.0±4.7	78.0±7.5	79.6±6.6	71.4±3.1	92.4±3.2
	Random	48.8±6.8	89.0±3.9	87.8±4.2	81.8±4.8	93.8±3.3

Table 2. Accuracy (mean ± standard deviation) of DiGib [25] variants with different weight settings and normalization schemes evaluated on four directed and four undirected graphs. Attention weights are compared with uniform weights and random weights sampled uniformly from

[10^{- 4}, 10^{4}]

. yellow cells correspond to the best attention-weighted configuration. Results in bold exceed the baseline mean, while underlined results exhibit strong distributional overlap with the baseline. dir and sym for attention are absolute weights.

Table 2. Accuracy (mean ± standard deviation) of DiGib [25] variants with different weight settings and normalization schemes evaluated on four directed and four undirected graphs. Attention weights are compared with uniform weights and random weights sampled uniformly from

[10^{- 4}, 10^{4}]

. yellow cells correspond to the best attention-weighted configuration. Results in bold exceed the baseline mean, while underlined results exhibit strong distributional overlap with the baseline. dir and sym for attention are absolute weights.

Datasets	Weight	Dir	Sym	Row	Softmax	None
CoraML Directed	Attention	81.7±1.5	81.8±1.5	81.5±1.5	81.0±1.7	27.7±3.3
	Uniform	82.0±1.3	82.0±1.3	81.8±1.5	81.8±1.5	77.7±2.4
	Random	82.0±1.6	81.6±1.3	81.7±1.3	74.1±1.5	76.1±1.7
CiteSeer Directed	Attention	66.3±1.8	66.2±1.8	66.4±1.7	66.4±1.9	34.1±3.3
	Uniform	66.5±1.6	66.5±1.6	66.2±1.2	66.2±1.2	62.1±2.5
	Random	66.1±2.0	65.8±1.8	65.5±1.8	63.7±1.6	60.9±2.7
Telegram Directed	Attention	87.2±3.8	88.2±5.6	79.8±6.1	76.8±2.9	87.2±3.0
	Uniform	89.2±5.8	89.2±5.8	73.4±7.8	73.4±7.8	91.2±4.4
	Random	87.8±4.4	88.0±4.5	61.4±8.7	34.6±5.5	83.2±3.7
WikiCS Directed	Attention	79.7±0.5	79.7±0.5	80.6±0.6	80.7±0.5	37.8±4.4
	Uniform	79.8±0.6	79.8±0.6	80.7±0.6	80.7±0.6	40.3±5.2
	Random	79.6±0.5	79.7±0.5	80.4±0.5	74.4±0.7	38.1±5.0
Coauthor-CS	Attention	94.4±0.2	93.8±0.3	93.9±0.2	94.3±0.1	94.0±0.2
	Uniform	94.9±0.1	94.9±0.1	94.7±0.1	94.7±0.1	55.3±5.2
	Random	94.7±0.1	94.8±0.1	94.5±0.1	92.4±0.1	51.6±8.8
Coauthor-Physics	Attention	96.4±0.1	96.2±0.1	96.1±0.1	96.5±0.1	96.4±0.1
	Uniform	96.7±0.1	96.7±0.1	96.5±0.1	96.5±0.1	86.9±3.0
	Random	96.6±0.1	96.6±0.1	96.5±0.1	95.5±0.1	88.0±2.1
PubMed	Attention	77.0±0.8	76.7±1.1	77.0±1.1	76.3±2.9	74.6±1.1
	Uniform	77.7±0.3	77.5±0.7	76.9±1.5	76.9±1.5	66.7±2.1
	Random	77.2±0.8	77.4±0.2	76.6±1.9	74.6±0.3	67.8±2.5
Amazon-Photo	Attention	94.9±0.3	93.9±0.7	94.6±0.3	95.1±0.2	71.0±26.3
	Uniform	95.1±0.3	95.1±0.3	95.0±0.3	95.0±0.2	30.7±6.0
	Random	95.2±0.3	95.2±0.1	94.8±0.4	91.9±0.2	30.2±3.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Attention Might Offer Little Benefit for Graph Node Classification

Abstract

Keywords:

Subject:

1. Introduction

Structure of This Position Paper:

Limitations:

2. Edge Weight Adjustment

(1) Fixed/rule-based adjustment.

(2) Learned adjustment.

(3) Using given edge weights.

Unified view.

3. Attention of GAT Model

3.1. GAT as Learned Edge Weights

3.2. Attention in GAT is Dispensable

3.3. Attention is Dispensable for Extended Neighborhoods

3.4. Why Attention Can Fail in GAT and Inception Models

4. Graph Transformers

4.1. Graph Transformers As Learned Edge Weights

4.1.1. Optimism in Graph Transformers

4.1.2. Pessimism in Graph Transformers

4.2. Why GT Fails

4.2.0.1. From Local to Global: Amplifying a Flawed Premise

4.2.1. Limitation of Gradient Descent

4.2.2. Language vs. Graph: Fundamental Mismatch

Structure Enhancement vs. Destruction:

Semantic entanglement vs. local relevance:

5. Alternative Views

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Normalizations

No Normalization

Row Normalization

Symmetric Normalization

Directed Normalization

Softmax Normalization

Appendix B. Implementation Details

Appendix B.1. Datasets

Appendix B.2. Code and Hyperparameters

GATConv Variants Implementation.

GAT Variants

DiGib Variants

Appendix C. Experiments on Sensitivity

Appendix D. Experiments on Heterophilic Graphs

Appendix E. Additional Details on Graph Transformers

(1) Encoding strategies for recovering structural information.

(2) Improving scalability.

Summary

References

MDPI Initiatives

Important Links

Subscribe