Preprint
Article

This version is not peer-reviewed.

REA-CNN: A Region-Aware & Enhanced Attention CNN for Robust Leaf Disease Classification

Submitted:

09 January 2026

Posted:

09 January 2026

You are already at the latest version

Abstract
Automated plant leaf disease detection using deep learning has achieved high accuracy on benchmark datasets; however, its performance often degrades when applied to real-world agricultural images affected by background clutter, illumination variability, and partial occlusions. These factors limit the reliability of conventional convolutional neural network (CNN)–based models trained under controlled conditions. To address this limitation, this paper proposes a Region-Aware and Enhanced Attention Convolutional Neural Network (REA-CNN) for robust classification of plant leaf diseases. The proposed framework integrates explicit region-aware pre-processing for background suppression with an attention-enhanced CNN backbone, enabling the model to focus on disease-relevant visual patterns. Unlike conventional two-stage CNN–SVM pipelines, REA-CNN is trained in a fully end-to-end manner, allowing joint optimization of feature extraction, attention refinement, and classification. Experimental results show that the proposed approach achieves higher classification accuracy and improved generalization on heterogeneous and real-world images compared to existing methods. These results demonstrate the effectiveness of combining region awareness and attention-guided learning for developing practical and deployable decision-support systems in precision agriculture.
Keywords: 
;  ;  ;  ;  

1. Introduction

Accurate detection of plant leaf diseases is crucial for enhancing crop productivity and facilitating timely agricultural interventions. In recent years, convolutional neural networks (CNNs), particularly when combined with transfer learning, have demonstrated strong performance in plant disease classification. However, most existing CNN-based methods are evaluated primarily on controlled benchmark datasets such as PlantVillage, where images are captured under uniform lighting conditions with clean and uncluttered backgrounds. As a result, their performance often degrades when applied to real agricultural images that exhibit background clutter, illumination variability, occlusions, and partial leaf visibility. In our previous work, we investigated hybrid CNN–CBAM–SVM architectures to improve classification performance by incorporating attention mechanisms and margin-based classifiers. Although this approach demonstrated that attention can enhance disease-relevant feature discrimination, its two-stage training strategy introduces important limitations. In particular, decoupling feature extraction from classification prevents end-to-end optimization, increases system complexity, and restricts the influence of the classifier on learned feature representations. In addition, background interference is not explicitly addressed, which limits robustness when models trained on laboratory-style datasets are deployed in real-world environments.
These limitations motivate the development of a unified learning framework that explicitly accounts for both background suppression and feature refinement. To this end, we propose a Region-Aware and Enhanced Attention Convolutional Neural Network (REA-CNN), designed as an end-to-end architecture for real-world plant leaf disease classification. The core idea of REA-CNN is to suppress irrelevant background information while enhancing disease-specific visual patterns within the leaf region. This is achieved by integrating region-aware pre-processing prior to feature extraction and embedding an attention mechanism directly into the CNN backbone, without requiring pixel-level lesion annotations.
Unlike conventional pre-processing techniques that apply uniform transformations across the entire image, region-aware processing enables the model to focus on leaf-centric visual cues, reducing dataset bias and improving robustness to environmental variations. Furthermore, end-to-end training ensures uninterrupted gradient flow from the classification layer to early convolutional stages, allowing joint optimization of feature extraction, attention refinement, and classification. Consequently, the proposed REA-CNN provides a robust and deployable framework that addresses the limitations of conventional CNN-based and hybrid plant disease classification systems under real-world imaging conditions.
In recent years, numerous hybrid artificial intelligence models have been proposed for plant leaf disease detection, combining convolutional neural networks with complementary techniques such as attention mechanisms, optimization algorithms, transformers, and recurrent neural networks. These hybrid designs aim to enhance feature representation and improve classification accuracy. Most reported results are obtained on benchmark datasets captured under controlled conditions, characterized by uniform backgrounds and consistent illumination. Although these approaches demonstrate high accuracy on benchmark datasets, many hybrid frameworks process the full image and rely primarily on implicit mechanisms such as attention layers, transformer blocks, or feature optimization strategies to suppress irrelevant information. While effective under controlled conditions, these mechanisms do not explicitly address background clutter, illumination variability, or environmental noise. Consequently, their performance often degrades when applied to real-world field images containing complex backgrounds, shadows, partial occlusions, and unconstrained lighting conditions.
In addition, several hybrid models employ multi-stage learning pipelines that separate feature extraction, optimization, and classification into distinct components. Although such designs can improve classification accuracy in specific settings, they increase system complexity and limit end-to-end optimization. In these architectures, the classifier has limited influence on earlier feature extraction stages, restricting the model’s ability to adapt learned representations to task-specific and real-world variations. In contrast, the proposed Region-Aware and Enhanced Attention Convolutional Neural Network (REA-CNN) adopts a unified design that explicitly integrates region-aware background suppression with attention-enhanced feature learning within a single end-to-end framework. By isolating leaf regions and enhancing disease-relevant visual features before deep feature extraction, the proposed approach guides the learning process toward semantically meaningful information while reducing environmental interference.
Unlike recent hybrid models that primarily increase architectural depth or complexity, REA-CNN emphasizes input-level refinement combined with joint optimization of feature extraction, attention refinement, and classification. This design improves robustness, interpretability, and suitability for practical agricultural deployment. While a limited number of studies incorporate segmentation or detection, these components are typically applied as independent pre-processing steps and are not jointly optimized with attention mechanisms and classification. A detailed comparison with recent studies published in 2025 is summarized in Table 1.
Table 2 provides a comparative overview of recent state-of-the-art approaches for plant leaf disease detection across multiple crops, including tomato, potato, apple, pepper, and grape. As summarized, most existing studies rely on conventional convolutional neural networks, transfer-learning-based architectures, or classical machine-learning pipelines, with performance commonly reported using overall classification accuracy. Although some methods incorporate region-based processing, such as segmentation or object detection, or attention mechanisms, these components are typically applied independently rather than within a unified learning framework.
In contrast, the proposed REA-CNN integrates explicit region-aware pre-processing for background suppression with attention-enhanced feature learning in a single end-to-end architecture. This joint design enables more effective focus on disease-relevant visual patterns while reducing sensitivity to background clutter and illumination variability.
Overall, the comparison in Table 2 indicates that REA-CNN combines multiple complementary components within a unified framework, extending beyond incremental architectural modifications observed in prior approaches and supporting its practical relevance for plant leaf disease classification.
For image-based classification tasks, AI system architectures can be interpreted in terms of multiple functional stages, depending on task complexity and deployment constraints. In this work, we adopt a level-based design perspective that organizes such systems into up to four functional levels (Table 3). A basic two-level system typically consists of limited input preparation followed by direct feature extraction and classification using a convolutional neural network. While effective for clean and well-controlled datasets, such architectures often exhibit limited robustness under real-world conditions.
A three-level system extends this structure by introducing an intermediate refinement stage, such as attention mechanisms, feature optimization strategies, or transfer learning, to improve discriminative capability. Although this additional stage enhances feature representation, it may remain insufficient when images are affected by background clutter, illumination variability, or environmental noise.
For more challenging real-world scenarios, an advanced four-level system design can be employed. This configuration incorporates explicit input pre-processing and augmentation, deep feature extraction, intelligent feature refinement, and a final decision layer. By decomposing the learning process into distinct functional stages, each level addresses specific challenges such as background suppression, illumination variation, or feature ambiguity, thereby improving robustness, interpretability, and deployment suitability. This level-based perspective provides a practical framework for analysing existing approaches and for positioning the proposed REA-CNN within a unified and application-oriented system design, as summarized in Table 3 and Table 4.
Based on the literature survey, each level of a level-based AI system can be implemented using a range of techniques or their combinations, as summarized in Table 4. The input level commonly includes image pre-processing, normalization, and data augmentation to improve visual consistency and reduce data variability. The feature extraction level is typically realized using convolutional neural networks or transfer-learning architectures capable of learning hierarchical and discriminative representations.
The feature refinement level further enhances these representations by emphasizing informative features and suppressing redundancy, often through attention mechanisms, optimization algorithms, or transformer-based modules. The decision level produces the final classification output and is commonly implemented using fully connected layers with softmax activation or, in hybrid systems, classical machine-learning classifiers such as SVM or k-NN. Together, the techniques listed in Table 4 illustrate the modular and flexible nature of multi-level AI system design, enabling architectures to be adapted to task complexity, data characteristics, and deployment requirements.
Based on this categorization, the proposed REA-CNN corresponds to a four-level AI system: region-aware pre-processing and enhancement at Level-1, CNN-based deep feature extraction at Level-2, CBAM-based attention refinement at Level-3, and dense softmax classification at Level-4.

2. Proposed System

Conventional convolutional neural networks (CNNs) learn discriminative representations directly from raw input images. Given an RGB input image:
I   R H × W × 3
a standard CNN relies on hierarchical feature learning to implicitly suppress irrelevant visual information. While effective under controlled imaging conditions, this assumption is often violated in real agricultural environments, where leaf images exhibit complex backgrounds, illumination variability, shadows, occlusions, and viewpoint changes. Under such conditions, convolutional filters may respond strongly to background textures rather than disease-related patterns, resulting in degraded generalization. In our earlier work, this limitation was partially addressed using a hybrid CNN–CBAM–SVM framework, in which attention-enhanced deep features were classified using a separate SVM. Although this approach improved class separability, it exhibits two fundamental limitations:
  • The absence of explicit background suppression at the input level, and
  • The decoupling of feature learning and classification, which prevents end-to-end optimization and limits task-driven feature adaptation.
To overcome these limitations, we propose a Region-Aware and Enhanced Attention Convolutional Neural Network (REA-CNN), an end-to-end learning framework that integrates explicit region-aware pre-processing with attention-guided feature refinement. By suppressing background interference prior to convolutional feature extraction and embedding attention directly within the CNN backbone, the proposed REA-CNN learns disease-relevant representations that are more robust to real-world imaging variability. This unified design improves robustness and consistency compared to conventional CNNs and the previously proposed CNN–CBAM–SVM model, as summarized in Table 5.

2.1. REA-CNN: Architecture and Learning Framework

REA-CNN is an end-to-end deep learning framework that integrates region-aware processing and attention-enhanced feature learning within a unified CNN architecture. The primary objective of this design is to suppress irrelevant background information while emphasizing disease-specific visual patterns, enabling reliable classification under diverse and unconstrained imaging conditions.
Unlike conventional CNN pipelines that operate directly on full images, REA-CNN explicitly guides the learning process toward leaf-centric and disease-relevant regions. This design improves generalization beyond controlled benchmark datasets and enhances suitability for real agricultural deployment.
As illustrated in Figure 1, the REA-CNN architecture is implemented as a sequence of five processing stages: region-aware pre-processing, CNN-based feature extraction, attention-based feature refinement, feature aggregation, and classification. These stages represent functional processing blocks and are collectively mapped onto the four-level system design introduced earlier. Specifically, feature aggregation and classification are treated as part of a single decision level in the level-based framework, while being shown as separate stages here for architectural clarit Formally, the proposed model can be expressed as a composite function f I ;   Θ :
f I ;   Θ = f c l s f g a p f e a f c n n f r a I        
Where
fra ()denotes region-aware pre-processing
fcnn ()is the CNN backbone
fea()represents the CBAM attention module
fgap()denotes global average pooling
fcls()is the classification head and
Θ = Θ r a , Θ c n n , Θ e a , Θ c l s is everything learned by the network during a training
As illustrated in Figure 1, the REA-CNN pipeline consists of five sequential stages. Here we are taking GAP and dense layer separately, otherwise it will be equal to 4 layers as mentioned in Table 3:
I→ RA → CNN Backbone → CBAM (EA) → GAP → Dense+Softmax

2.2. Stage I – Region-Aware Processing (RA)

The objective of the region-aware processing stage is to suppress irrelevant background information and emphasize the leaf region before deep feature extraction. By isolating the leaf and enhancing its visual characteristics, this stage reduces background interference and illumination variability, providing a refined input for the subsequent CNN backbone. Let the input RGB image be denoted as
I   R H × W × 3
where H and W represent the image height and width, respectively. To isolate the leaf region, a binary foreground mask, one can define the mask as follows:
M     0,1 H × W
The mask M is estimated using mask-based or segmentation-driven background removal techniques. In this mask, M x , y = 1 indicates leaf pixels, while M x , y = 0 corresponds to background regions. The region-aware image is obtained through element-wise multiplication of the input image with the binary mask:
I R A ( x , y , c ) = I ( x , y , c )     M ( x , y )
where x , y denotes the spatial pixel location and c R ,   G ,   B represents the color channel. This operation preserves pixel values within the leaf region and suppresses background pixels across all channels, resulting in a leaf-centric image representation.
Although background suppression isolates the leaf region, variations in illumination, contrast, and color distribution may still affect visual consistency. To address these issues, classical image enhancement operations are applied to the region-aware image I R A .
A. 
Contrast normalization: First, contrast normalization is performed using minmax normalization:
I ' = I R A min I R A max I R A min I R A + ε      
where the minimum and maximum values are combined over all pixel intensitie of I R A , and ε = 10 8 is a small constant added to ensure numerical stability. This operation scales pixel values to a normalized range and reduces the effect of illumination bias.
B. 
Color normalization and brightness/saturation adjustment: Next, color normalization and brightness–saturation adjustment is applied to improve chromatic consistency and highlight disease-relevant patterns:
I E N H = N I '
where N denotes the combined color normalization and enhancement function. These operations improve visual clarity, enhance symptom visibility, and reduce intra-class variability caused by lighting conditions.
C. 
Output of Region-Aware Processing: The final output of this stage, IENH constitutes a refined, leaf-centric, and visually enhanced image representation. As illustrated in Figure 2, this pipeline includes foreground–background separation, background suppression, and final image enhancement, producing an input that emphasizes disease-relevant visual features while minimizing environmental noise. This enhanced representation is then forwarded to the CNN backbone for robust feature extraction.
Figure 2 shows the region-aware enhancement pipeline, including background–foreground separation, background suppression, and final image enhancement, resulting in a refined leaf-centric representation that emphasizes disease-relevant visual features.

2.3. Stage II – CNN Backbone (Feature Extraction)

The enhanced leaf-centric image obtained from the region-aware processing stage, denoted as IENH, is forwarded to a pretrained convolutional neural network (CNN) backbone for hierarchical feature extraction. In this work, a ResNet50 architecture initialized with ImageNet pretrained weights is employed due to its strong representation capability and proven generalization performance.
The CNN backbone progressively transforms the input image into high-level feature representations through a sequence of convolution, normalization, nonlinearity, and pooling operations.
A. 
Convolutional Feature Extraction: At the lth convolutional layer, the feature maps are computed as:
F ( l ) = W ( l ) * F ( l 1 ) + b ( l )
where F ( l 1 ) and F ( l ) denote the input and output feature maps of layer l , respectively, W ( l ) represents the convolutional filter weights, b ( l ) is the bias term, and the concolution product is denoted by “∗” . Through this process, local spatial patterns such as edges, textures, and disease-related visual cues are captured at increasing levels of abstraction.
B. 
Batch Normalization: To stabilize training and accelerate convergence, batch normalization is applied to the convolutional outputs:
F ^ l = F l μ B σ B + ε
F B N l = γ F ^ l + β
where μ B and σ B denote the mean and the standard deviation computed over the mini-batch, ε is a small constant added for numerical stability, and γ and β are learnable scaling and shifting parameters. This normalization reduces internal covariate shift and improves optimization stability.
C. 
Nonlinearity and Pooling: Nonlinear activation is introduced using the Rectified Linear Unit (ReLU):
F R e L U l = m a x 0 , F B N l
which enables the network to model complex nonlinear relationships and prevents vanishing gradients. To further enhance robustness and spatial invariance, spatial pooling is applied:
F p o o l i , j = max m , n Ω i , j F R e L U l m , n
where Ω i , j defines the local pooling neighbourhood. Pooling reduces spatial resolution while preserving dominant activations, allowing the network to focus on salient disease-related structures.
D. 
Output of CNN Backbone: After multiple stacked convolutional blocks, the CNN backbone produces a high-level feature tensor:
F R H ´ × W ´ × C
where H ´ and W ´ are the spatial dimensions of the extracted feature maps and C denotes the number of feature channels. This tensor encodes rich semantic information about leaf structure and disease patterns and serves as the input to the subsequent attention-enhancement stage.

2.4. Stage III – Enhanced Attention via CBAM (EA)

To further refine the feature representations extracted by the CNN backbone, the proposed REA-CNN employs a Convolutional Block Attention Module (CBAM). CBAM sequentially applies channel attention followed by spatial attention, allowing the network to emphasize informative feature channels and spatial locations while suppressing irrelevant or noisy responses. This attention-guided refinement enables more discriminative learning of disease-specific visual patterns. Let the high-level feature tensor denoted by:
F R H ´ × W ´ × C
Channel Attention Mechanism
The channel attention module focuses on identifying which feature channels are more important for disease classification.
A.
Channel Descriptor Generation: Global information is first aggregated from the feature tensor using both global average pooling, GAP(·) and global max pooling, GMP(·):
Z a v g = G A P ( F )
Z m a x = G M P ( F )
where z a v g , z m a x   R C summarize channel-wise statistics and capture complementary contextual information.
B. 
Channel Weight Estimation: The pooled descriptors are passed through a shared multi-layer perceptron (MLP) to compute channel attention weights:
s = σ W 2 δ W 1 z
where W 2 and W 1 are learnable weight matrices, δ denotes the ReLU activation, and σ is the sigmoid function. The resulting vector s R C assigns an importance weight to each feature channel.
C. 
Channel-wise Feature Refinement: The original feature tensor is refined by channel-wise multiplication:
F C A = F s
where ⊗ denotes element-wise multiplication with broadcasting across spatial dimensions. This operation amplifies disease-relevant channels while suppressing less informative ones.

2.5. Spatial Attention Mechanism

While channel attention emphasizes what to focus on, spatial attention identifies where important features are located within the image.
A.
Spatial Descriptor Computation: Spatial descriptors are generated by applying average and max pooling across the channel dimension:
F a v g = m e a n c F C A
F m a x = m a x c F C A
where both descriptors capture complementary spatial cues related to disease symptoms.
B. 
Spatial Attention Map Generation: The pooled feature maps are concatenated and passed through a convolutional layer to compute the spatial attention map:
A s = σ C o n v 7 × 7 F a v g , F m a x
where C o n v 7 × 7 denotes a convolution with a 7×7 kernel and σ is the sigmoid activation. The resulting map A s R H ´ × W ´ highlights spatial locations that are most relevant to disease patterns.
C. 
Spatial Feature Refinement: The final attention-enhanced feature tensor is obtained as:
F E A = F C A A s
which selectively emphasizes disease-relevant regions while suppressing background or irrelevant spatial responses.
D. 
Output of Enhanced Attention Stage: The output FEA represents an attention-refined feature tensor that integrates both channel-wise and spatial importance. By explicitly modelling what features and where they matter, the CBAM module significantly improves feature discriminability and robustness, particularly for visually similar plant diseases. This refined representation is subsequently forwarded to the feature aggregation and classification stages.

2.6. Stage IV – Feature Aggregation

The attention-refined feature tensor produced by the CBAM module,
F E A R H ´ × W ´ × C
still retains spatial dimensions. To convert these spatial features into a compact and discriminative representation suitable for classification, Global Average Pooling (GAP) is applied.
Global Average Pooling: For each channel c, GAP computes the average response over all spatial locations:
v c = 1 H ´ W ´ i = 1 H ´ j = 1 W ´ F E A i , j , c
where H ´ and W ´ denote the height and width of the feature maps, respectively. By aggregating spatial information in this manner, GAP produces a compact feature vector:
v R C
Role of Feature Aggregation: Global Average Pooling acts as a structural regularizer, significantly reducing the number of trainable parameters and mitigating overfitting. Moreover, since GAP directly aggregates the attention-refined feature maps, it preserves the semantic consistency enforced by the preceding channel and spatial attention mechanisms. The resulting feature vector v therefore encodes global, disease-relevant information in a compact and robust form.

2.7. Stage V – Classification Head

The final classification stage maps the aggregated feature vector v to class scores corresponding to the plant disease categories.
Fully Connected Layer: A fully connected layer is used to compute the logits:
o = W f c v + b f c
where W f c and b f c denote the learnable weights and bias of the classification layer, respectively.
Softmax Classification: The logits are transformed into normalized class probabilities using the
P y = k | I = e o k i = 1 K e o i
where K is the number of disease classes and o k represents the logit corresponding to class k .
Output Interpretation: The predicted class corresponds to the category with the highest posterior probability. By combining attention-refined features, global aggregation, and end-to-end optimization, this classification head enables reliable and robust decision-making for plant leaf disease recognition under real-world imaging conditions.

2.8. End-to-End Learning Strategy

The proposed REA-CNN is trained in an end-to-end manner using a single objective function, allowing all network components from region-aware processing to final classification to be jointly optimized.
Loss Function: Let K denote the number of disease classes. The model is optimized using the categorical cross-entropy loss, defined as:
L = k = 1 K y k log P y = k | I
Where y k 0,1 is the ground-truth label for class k, and P y = k | I denotes the predicted probability for class k obtained from the softmax output of the network. This loss encourages the model to assign high probability to the correct disease class while penalizing incorrect predictions.
Gradient Propagation and Joint Optimization: During training, gradients of the loss function are propagated backward through the entire network:
L Θ C l a s s i f i c a t i o n   H e a d G A P C B A M C N N   B a c k b o n e R e g i o n A w a r e   I n p u t
where Θ denotes the set of all trainable parameters of the model. This uninterrupted gradient flow ensures that:
  • The classification objective directly influences feature learning,
  • The attention weights are dynamically adapted based on task relevance, and
  • The convolutional filters learn representations that are aligned with disease-specific visual patterns.
Significance of End-to-End Learning: Unlike multi-stage or hybrid pipelines that decouple feature extraction and classification, the proposed REA-CNN benefits from joint optimization across all stages. This unified learning strategy enables task-driven adaptation of region awareness, attention mechanisms, and deep feature representations simultaneously, resulting in improved robustness, faster convergence, and better generalization to real-world plant leaf images. Key Advantages:
  • Explicit background suppression reduces input entropy
  • Attention enhances semantic and spatial discriminability
  • GAP improves generalization and parameter efficiency
  • Fully end-to-end differentiability ensures robustness under domain shift

2.9. Comparison with Previous Hybrid Models

Unlike CNN+CBAM+SVM architectures:
  • REA-CNN does not freeze feature extractors
  • The classifier influences feature learning directly
  • Attention mechanisms adapt dynamically during training
This results in improved robustness, generalization, and deployment simplicity.

3. Experimental Setup

Experiments were conducted on the PlantVillage dataset and its region-aware enhanced variant (PlantVillage_BG_ENH), comprising approximately 32,000 leaf images across multiple disease classes. No pixel-level lesion annotations were used; region awareness was achieved using classical unsupervised background suppression. The enhanced dataset (PlantVillage_BG_ENH) was generated by applying region-aware background suppression and image enhancement to the original PlantVillage images. Images were resized to 224×224 pixels and normalized using ImageNet pre-processing. The dataset was split into 70% training, 15% validation, and 15% testing using class-stratified sampling. The proposed REA-CNN employed a ResNet50 backbone pretrained on ImageNet, followed by a CBAM attention module and a global average pooling–based classification head. Training was performed end-to-end using the Adam optimizer with an initial learning rate of 10-4, categorical cross-entropy loss, and early stopping based on validation accuracy.

3.1. Evaluation Metrics

The performance of the proposed REA-CNN model was evaluated using Top-1 accuracy, Top-5 accuracy, and categorical cross-entropy loss. Given the balanced nature of the dataset, these metrics provide a reliable and comprehensive assessment of classification correctness, ranking quality, and prediction confidence.
Top-1 Accuracy: Top-1 accuracy measures the proportion of test samples for which the predicted class with the highest probability matches the ground-truth label. It is defined as:
T O P 1 : A c c u r a c y = 1 N i = 1 N 1 a r g m a x k 1 , . , K   P y = k I i = y i
where N denotes the total number of samples, y i is the true class label of sample i I i   is the corresponding input image, a r g m a x provides the argument that maximizes the function, and 1 is the indicator function. Top-1 accuracy directly reflects the model’s classification precision under strict decision criteria. To analyze model behavior at different stages of training, three accuracy values are reported:
  • acc: training accuracy,
  • val_acc: validation accuracy,
  • test_acc: final test accuracy, representing the true generalization performance.
Top-5 Accuracy: Top-5 accuracy considers a prediction correct if the ground-truth class appears among the five most probable predicted classes:
T O P 5 : A c c u r a c y = 1 N i = 1 N 1 y i T o p 5   P y = k I i k = 1 K
This metric is particularly relevant for plant disease classification, where visually similar symptoms can lead to ambiguity between multiple disease categories. A high Top-5 accuracy indicates that the model effectively captures the underlying disease structure even when fine-grained distinctions are challenging.

3.2. Result Analysis

The confusion matrix shown in Figure 3 indicates strong overall classification performance of the proposed REA-CNN, with clear diagonal dominance across most classes, reflecting high true-positive rates. Classes such as Healthy, Target Spot, Yellow Leaf Curl Virus, and Spider Mite (Two-Spotted) exhibit particularly high correct classification rates, suggesting that region-aware background suppression and attention-enhanced feature learning effectively emphasize disease-relevant visual patterns. Misclassifications are primarily observed between visually similar disease categories, notably Early Blight and Late Blight, as well as between Bacterial Spot and Late Blight. These confusions are expected due to overlapping texture and color characteristics in leaf symptoms. Minor confusion is also observed between Septoria Leaf Spot and Leaf Mold, indicating residual inter-class similarity despite attention-based refinement. Overall, the confusion matrix analysis shows that the proposed REA-CNN reduces inter-class confusion across most disease categories while maintaining consistent performance, supporting its applicability to plant disease classification under complex background conditions.
The proposed REA-CNN model was evaluated against the Hybrid-1 (CNN+CBAM+SVM) architecture over 30 training epochs, as shown in Figure 4, where the x-axis represents the number of epochs and the y-axis denotes training accuracy (%). The REA-CNN exhibits faster convergence, achieving high training accuracy within the initial epochs, whereas the Hybrid-1 model demonstrates a more gradual learning trend. Throughout training, REA-CNN maintains consistently higher accuracy, approaching approximately 99% by the final epochs, compared to around 96–97% for Hybrid-1. This behaviour indicates that the integration of region-aware pre-processing and attention-enhanced learning facilitates more efficient feature learning during early training stages, contributing to improved convergence and overall training performance.
Training accuracy is reported to illustrate convergence behaviour, whereas validation accuracy is used to assess generalization performance. Figure 5 compares the validation accuracy of the proposed REA-CNN with the Hybrid-1 (CNN+CBAM+SVM) model. The REA-CNN achieves higher validation accuracy during the early training epochs, indicating faster stabilization on unseen data. Although the Hybrid-1 model gradually improves and approaches comparable validation accuracy in later epochs, it requires a larger number of training iterations to do so. In contrast, the REA-CNN validation curve remains consistently high with limited fluctuations, suggesting improved stability during training. Overall, the observed validation behaviour indicates that the proposed framework achieves more efficient convergence and maintains consistent generalization performance, supporting its applicability to plant disease classification under challenging imaging conditions.
Figure 6 illustrates the relationship between overall validation accuracy (Top-1) and Top-5 validation accuracy of the proposed REA-CNN model over 30 training epochs, where the x-axis represents epochs and the y-axis denotes accuracy (%). The Top-5 accuracy rapidly approaches nearly 100% within the initial epochs and remains saturated throughout training, indicating that the ground-truth class is consistently included among the model’s top predictions. In contrast, the overall validation accuracy exhibits a more gradual and stable improvement, converging to approximately 95–97% in later epochs. The observed gap between Top-5 and Top-1 accuracy reflects the presence of fine-grained visual similarities among certain disease classes, which can lead to occasional misclassification in the top prediction. Overall, the comparison between Top-1 and Top-5 validation accuracy indicates strong class separability and effective feature representation learning by the proposed REA-CNN, supporting its suitability for multi-class plant disease classification tasks.
To examine the behavior of the proposed REA-CNN in realistic conditions, we conducted a small-scale experiment using 12 tomato leaf images representing different disease types, including target spot, mosaic virus, and spider mite (Figure 7). These images were intentionally selected to reflect real-world variability, with noticeable differences in image resolution, leaf size and shape, background clutter, and lighting conditions. Unlike laboratory datasets, the test samples were not standardized or visually optimized, allowing the evaluation to focus on the model’s generalization capability under practical conditions. Model performance was evaluated using Top-1 accuracy, Top-5 accuracy, and average prediction confidence. The proposed REA-CNN correctly identified 9 out of 12 images at Top-1, while the remaining samples were correctly included within the Top-5 predictions, resulting in complete Top-5 coverage. In addition, the model maintained a consistently high confidence level, exceeding 70%, indicating stable and reliable predictions even in the presence of background noise and illumination variation.
For comparison (Figure 8), the existing Hybrid-1 (CNN+CBAM+SVM) framework showed limited robustness on the same test set, correctly identifying only one image with a very low confidence score of 1.25%. This comparison suggests that the explicit incorporation of region awareness and attention-based feature enhancement enables REA-CNN to better focus on disease-relevant visual patterns, leading to improved accuracy and confidence in real-world scenarios. Overall, the experiment demonstrates the suitability of the proposed approach for deployment beyond controlled laboratory environments.
Metric REA-CNN Hybrid-1
Top-1 Correct (out of 12) 9 (75%) 1 (8.5%)
Top-5 Correct (out of 12) 12 (100 %) 5 (41%)
Average Confidence (%) 70+ 1.25

4. Conclusion

This paper presented REA-CNN, a region-aware and attention-enhanced convolutional neural network for plant leaf disease classification under real agricultural imaging conditions. Unlike conventional CNN-based and hybrid pipelines that rely primarily on implicit feature learning or multi-stage optimization, the proposed framework integrates explicit region-aware pre-processing and CBAM-based attention within a single end-to-end architecture. This design enables the model to focus on leaf-centric and disease-relevant representations while maintaining uninterrupted gradient flow during training. Experimental evaluation shows that REA-CNN exhibits improved convergence behaviour and consistent classification performance compared to conventional CNNs and CNN–CBAM–SVM hybrid systems. The confusion matrix and accuracy trends indicate reduced inter-class confusion, particularly for visually similar disease categories, while attention-based feature refinement contributes to improved discriminative learning. These observations suggest that combining input-level refinement with feature-level attention can be effective for improving robustness beyond controlled benchmark conditions.
Overall, the proposed REA-CNN provides a unified and practical framework for plant disease recognition that is better suited to real-world deployment scenarios. Future work will focus on extending the framework to larger and more diverse field datasets, exploring lightweight backbone architectures for edge deployment, and incorporating temporal and multispectral information to further enhance early disease detection in precision agriculture applications.

References

  1. Yağ, I.; Altan, A. Artificial Intelligence-Based Robust Hybrid Algorithm Design and Implementation for Real-Time Detection of Plant Diseases in Agricultural Environments. Biology (Basel) 2022, 11, 1732. [Google Scholar] [CrossRef] [PubMed]
  2. Kamencay, P.; Benco, M.; Mizdos, T.; Radil, R. A New Method for Face Recognition Using Convolutional Neural Network. Advances in Electrical and Electronic Engineering 2017, 15. [Google Scholar] [CrossRef]
  3. Bedi, P.; Gole, P. Plant disease detection using hybrid model based on convolutional autoencoder and convolutional neural network. Artificial Intelligence in Agriculture 2021, 5, 90–101. [Google Scholar] [CrossRef]
  4. Hashmi, N.; Haroon, M. A Hybrid AI-Based Approach for Early and Accurate Rice Disease Detection. Fusion: Practice and Applications 2026, 21. [Google Scholar] [CrossRef]
  5. Kabir, M.F.; Rahat, I.S.; Beverley, C.; Uddin, R.; Kant, S. Tealeafnet-gwo: an intelligent CNN-Transformer hybrid framework for tea leaf disease detection using gray wolf optimization. Discover Artificial Intelligence 2025, 5, 377. [Google Scholar] [CrossRef]
  6. Khalid, M.; Talukder, M.D.A. A Hybrid Deep Multistacking Integrated Model for Plant Disease Detection. IEEE Access 2025, 13, 116037–116053. [Google Scholar] [CrossRef]
  7. Aboelenin, S.; Elbasheer, F.A.; Eltoukhy, M.M.; El-Hady, W.M.; Hosny, K.M. A hybrid Framework for plant leaf disease detection and classification using convolutional neural networks and vision transformer. Complex & Intelligent Systems 2025, 11, 142. [Google Scholar] [CrossRef]
  8. Huang, X.; Chen, A.; Zhou, G.; Zhang, X.; Wang, J.; Peng, N.; Yan, N.; Jiang, C. Tomato Leaf Disease Detection System Based on FC-SNDPN. Multimed Tools Appl 2023, 82, 2121–2144. [Google Scholar] [CrossRef]
  9. Baser, P.; Saini, J.R.; Kotecha, K. TomConv: An Improved CNN Model for Diagnosis of Diseases in Tomato Plant Leaves. Procedia Computer Science 2023, 218, 1825–1833. [Google Scholar] [CrossRef]
  10. Arafath, M.; Nithya, A.A.; Gijwani, S. Tomato Leaf Disease Detection Using Deep Convolution Neural Network. Procedia Computer Science 2023, 236–245. [Google Scholar] [CrossRef]
  11. Indira, K.; Mallika, H. Classification of Plant Leaf Disease Using Deep Learning. Journal of The Institution of Engineers (India): Series B 2024, 105, 609–620. [Google Scholar] [CrossRef]
  12. Roy, K.; et al. Detection of Tomato Leaf Diseases for Agro-Based Industries Using Novel PCA DeepNet. IEEE Access 2023, 11, 14983–15001. [Google Scholar] [CrossRef]
  13. Zhang, L.; Zhou, G.; Lu, C.; Chen, A.; Wang, Y.; Li, L.; Cai, W. MMDGAN: A fusion data augmentation method for tomato-leaf disease identification. Applied Soft Computing 2022, 123, 108969. [Google Scholar] [CrossRef]
  14. Sunil, C.K.; Jaidhar, C.D.; Patil, N. Tomato plant disease classification using Multilevel Feature Fusion with adaptive channel spatial and pixel attention mechanism. Expert Syst Appl 2023, 228, 120381. [Google Scholar] [CrossRef]
  15. Arshad, F.; Mateen, M.; Hayat, S.; Wardah, M.; Al-Huda, Z.; Gu, Y.H.; Al-antari, M.A. PLDPNet: End-to-end hybrid deep learning framework for potato leaf disease prediction. Alexandria Engineering Journal 2023, 78, 406–418. [Google Scholar] [CrossRef]
  16. Rashid, J.; Khan, I.; Ali, G.; Almotiri, S.H.; AlGhamdi, M.A.; Masood, K. Multi-Level Deep Learning Model for Potato Leaf Disease Recognition. Electronics (Basel) 2021, 10, 2064. [Google Scholar] [CrossRef]
  17. Khobragade, P.; Shriwas, A.; Shinde, S.; Mane, A.; Padole, A. Potato Leaf Disease Detection Using CNN. International IEEE Conference on Smart Generation Computing, Communication and Networking (SMART GENCON) 2022, 1–5. [Google Scholar] [CrossRef]
  18. Mahum, R.; Munir, H.; Mughal, Z.; Awais, M.; Khan, F.S.; Saqlain, M.; Mahmad, S.; Tlili, I. A novel framework for potato leaf disease detection using an efficient deep learning model. Human and Ecological Risk Assessment: An International Journal 2023, 29, 303–326. [Google Scholar] [CrossRef]
  19. Goyal, B.; Pandey, A.K.; Kumar, R.; Gupta, M. Disease Detection in Potato Leaves Using an Efficient Deep Learning Model. International IEEE Conference on Data Science and Network Security (ICDSNS) 2023, 01–05. [Google Scholar] [CrossRef]
  20. Das, P.K. Leaf Disease Classification in Bell Pepper Plant using VGGNet. Journal of Innovative Image Processing 2023, 5, 36–46. [Google Scholar] [CrossRef]
  21. Kapoor, K.; Singh, S.; Singh, N.P.; Priyanka. Bell-Pepper Leaf Bacterial Spot Detection Using AlexNet and VGG-16; 2023; pp. 507–519. [Google Scholar] [CrossRef]
  22. Mahesh, T.Y.; Mathew, M.P. Detection of Bacterial Spot Disease in Bell Pepper Plant Using YOLOv3. IETE J Res 2024, 70, 2583–2590. [Google Scholar] [CrossRef]
  23. Begum, S. S. A.; Syed, H. GSAtt-CMNetV3: Pepper Leaf Disease Classification Using Osprey Optimization. IEEE Access 2024, 12, 32493–32506. [Google Scholar] [CrossRef]
  24. Fatima, S.; Kaur, R.; Doegar, A.; Srinivasa, K.G. CNN Based Apple Leaf Disease Detection Using Pre-trained GoogleNet Model; 2023; pp. 575–586. [Google Scholar] [CrossRef]
  25. Hasan, S.; Jahan, S.; Islam, M.I. Disease detection of apple leaf with combination of color segmentation and modified DWT. Journal of King Saud University - Computer and Information Sciences 2022, 34, 7212–7224. [Google Scholar] [CrossRef]
  26. Rehman, Z.U.; Khan, M.A.; Ahmed, F.; Damasevicius, R.; Naqvi, S.R.; Nisat, W.; Javed, K. Recognizing apple leaf diseases using a novel parallel real-time processing framework based on MASK RCNN and transfer learning: An application for smart agriculture. IET Image Process 2021, 15, 2157–2168. [Google Scholar] [CrossRef]
  27. Kaur, N.; Devendran, V. A novel framework for semi-automated system for grape leaf disease detection. Multimed Tools Appl 2023, 83, 50733–50755. [Google Scholar] [CrossRef]
  28. URao, S.; Swathi, R.; Sanjana, V.; Arpitha, L.; Chandrasekhar, K.; Chimayi; Naik, P.K. Deep Learning Precision Farming: Grapes and Mango Leaf Disease Detection by Transfer Learning. Global Transitions Proceedings 2021, 2, 535–544. [Google Scholar] [CrossRef]
  29. Altalak, M.; Uddin, M.A.; Alajmi, A.; Rizg, A. A Hybrid Approach for the Detection and Classification of Tomato Leaf Diseases. Applied Sciences 2022, 12, 8182. [Google Scholar] [CrossRef]
Figure 1. ra/ea-cnn system architecture.
Figure 1. ra/ea-cnn system architecture.
Preprints 193588 g001
Figure 2. Image enhancement.
Figure 2. Image enhancement.
Preprints 193588 g002
Figure 3. Confusion matrix for Rea-cnn model.
Figure 3. Confusion matrix for Rea-cnn model.
Preprints 193588 g003
Figure 4. Comparison of Training Accuracy of the proposed model VS Hybrid-1.
Figure 4. Comparison of Training Accuracy of the proposed model VS Hybrid-1.
Preprints 193588 g004
Figure 5. COMPARISON OF Validation ACCURACY OF THE PROPOSED MODEL VS HYBRID-1.
Figure 5. COMPARISON OF Validation ACCURACY OF THE PROPOSED MODEL VS HYBRID-1.
Preprints 193588 g005
Figure 6. COMPARISON OF Overall VALIDATION ACCURACY VS top-5 accuracies of the PROPOSED MODEL.
Figure 6. COMPARISON OF Overall VALIDATION ACCURACY VS top-5 accuracies of the PROPOSED MODEL.
Preprints 193588 g006
Figure 7. Data set of real images.
Figure 7. Data set of real images.
Preprints 193588 g007
Figure 8. high accuracy and confidance is observed for proposed REA-CNN model.
Figure 8. high accuracy and confidance is observed for proposed REA-CNN model.
Preprints 193588 g008
Table 1. COMPARISON OF REA-CNN WITH RECENT PUBLISHED ARTICLES.
Table 1. COMPARISON OF REA-CNN WITH RECENT PUBLISHED ARTICLES.
Aspect [1] [2] [3] [4] [5] [6] [7] RA/EA-CNN
Publication venue Biology MDPI AEEE Journal Elsevier
SciDirect
Springer Springer IEEE Access Springer
Application domain Plant diseases Face Recognition Plant diseases Rice diseases Tea leaf diseases Plant diseases Plant leaf diseases Plant diseases
Input type Leaf images Face images Leaf images Leaf images Leaf images Leaf image sequences Leaf images Leaf images
Background removal No No No segmentation Partial No No Explicit
Region awareness No No No Yes Partial No Partial (ViT attention) Yes
Image enhancement No No No Yes CLAHE No Indirect Explicit
Feature extraction Hand-crafted (2D-DWT) PCA / LBP / CNN CAE latent space Hybrid fusion CNN + Transformer VGG16 CNN CNN ensemble + ViT Region-guided CNN
Temporal modeling No No No No No MBi-LSTM No No
Optimization strategy Flower Pollination Algorithm Manual Implicit African Vultures Optimization Gray Wolf Optimization None None Not required
Classifier / backbone SVM + CNN CNN / KNN CNN Depth-Separable NN CNN-Transformer CNN + Bi-LSTM CNN + Vision Transformer CNN
End-to-end learning Partial Mixed
Semi
Semi Semi Semi Semi Full
Model complexity High Medium Low Moderate–High Moderate–High High High Moderate
Explainability Weak
Weak
Weak Moderate Moderate Moderate Moderate Strong
Real-time suitability Possible Limited Yes Limited Moderate Limited Limited Yes
Generalizability oderate Low oderate Moderate Moderate Moderate Moderate High
Robust to real-world images No No Limited Limited Limited Limited Limited High
Table 2. PUBLISHED HYBRID AI SYSTEMS AND THEIR RELEVANCE WITH OUR PROPOSED ONE.
Table 2. PUBLISHED HYBRID AI SYSTEMS AND THEIR RELEVANCE WITH OUR PROPOSED ONE.
Title Crop Studied AI Model Used Accuracy (%) Relevance with respect to REA-CNN
Tomato Leaf Disease Detection System Based on FC-SNDPN [8] Tomato FC-SNDPN (FCN + Dual-Path + Switchable Norm) 97.59 Partial (segmentation, no explicit attention)
An Improved CNN Model for Diagnosis of Diseases in Tomato Crop [9] Tomato Hybrid CNN (VGG + Inception) 99.17 No
Tomato Leaf Disease Detection Using Deep CNN[10] Tomato Deep CNN (Batch Norm + Dropout) 98.00 No
Classification of Plant Leaf Disease Using Deep Learning [11] Multi
(incl. Tomato)
CNN / AlexNet / MobileNet 84.24 / 91.19 / 97.33 No
Detection of Tomato Leaf Diseases… PCA DeepNet [12] Tomato PCA DeepNet + GAN + Faster R-CNN 99.60 Partial (region detection, no attention)
MMDGAN : Fusion Augmentation + B-ARNet [13] Tomato GAN-based + B-ARNet 97.12 No
Tomato Plant Disease Classification… Adaptive Attention [14] Tomato Multilevel fusion + attention ≈99.83 No
PLDPNet: Hybrid CNN Framework [15] Potato / Tomato Hybrid CNN 94.25 (Tomato) No
Multi-Level Deep Learning Model [16] Potato / Tomato Multi-level CNN 96.71 (Tomato) No
Potato Leaf Disease Detection Using CNN [17] Potato CNN 98.07 No
Efficient DenseNet-Based Potato Leaf Disease Detection) [18] Potato Efficient DenseNet-201 97.20 No
Disease Detection in Potato Leaves Using DL + SVM [19] Potato Enhanced DL + SVM 99.42 No
Leaf Diseases Classification in Bell Pepper Using VGGNet [20] Bell Pepper VGG16 / VGG19 97 / 96 No
Bell-Pepper Leaf Bacterial Spot Detection [21] Bell Pepper AlexNet / VGG-16 97.87 No
Detection of Bacterial Spot Disease Using YOLOv3 [22] Bell Pepper YOLOv3 (detection) 90 Partial (region detection only)
Optimized GSAtt-CMNetV3 for Pepper Leaf Disease [23] Pepper GSAtt-CMNetV3 + Os-OA 97.87 Partial (attention, no RA pre-processing)
CNN Based Apple Leaf Disease Detection [24] Apple GoogleNet (pretrained) 99.79 No
Apple Leaf Disease Detection using Color + DWT + RF [25] Apple Color + DWT + Random Forest 98.63 No
Recognizing Apple Leaf Diseases using Mask R-CNN [26] Apple Mask R-CNN + Transfer Learning 96.6 Partial (segmentation, no attention)
Efficient Deep Learning Approach for Apple Leaf Diseases Apple CNN (transfer learning) 97.82 No
A Novel Framework for Grape Leaf Disease Detection [27] Grape Hybrid segmentation + ensemble classifier 98.7 Partial
Deep Learning Precision Farming: Grapes & Mango Leaf Diseases [28] Grapes / Mango AlexNet (pretrained) 99 (Grapes)
89 (Mango)
No
Table 3. AI SYSTEM DEVELOPMENT DEPTH/COMPLEXITY.
Table 3. AI SYSTEM DEVELOPMENT DEPTH/COMPLEXITY.
AI System Type Level-1: Input & Preparation Level-2: Feature Extraction Level-3: Feature Refinement / Intelligence Level-4: Decision / Classification
2-Level AI System Preprocessing / Normalization / Augmentation CNN / Feature Extractor
3-Level AI System Preprocessing / Augmentation / Normalization CNN / Transfer Learning Attention / Optimization / Feature Selection
4-Level AI System (Advanced / Hybrid) Preprocessing + Augmentation + Normalization CNN Backbone CBAM / Attention / Hybrid Intelligence Dense / Softmax / SVM
Table 4. AI TECHNOLOGY INVOLVED AT EACH LAYER OF THE AI SYSTEM.
Table 4. AI TECHNOLOGY INVOLVED AT EACH LAYER OF THE AI SYSTEM.
Level Possible Techniques
Level-1 (Input & Preparation) Pre-processing, normalization, data augmentation, CLAHE, background removal, segmentation
Level-2 (Feature Extraction) CNN, ResNet, VGG, DenseNet, EfficientNet, Autoencoder, Transfer Learning
Level-3 (Refinement / Intelligence) CBAM, SE-block, Attention, ViT, PCA, GWO, PSO, LSTM
Level-4 (Decision Layer) Dense + Softmax, SVM, KNN, Random Forest, XGBoost
Table 5. COMPARISON OF THE REA-CNN SYSTEM WITH CNN AND THE HYBRID-1 SYSTEM.
Table 5. COMPARISON OF THE REA-CNN SYSTEM WITH CNN AND THE HYBRID-1 SYSTEM.
CNN Model Hybrid-1 [29] Proposed Hybrid
Input Leaf Image

Image Pre-processing
Resize, normalization, augmentation

CNN Backbone
ResNet50
pretrained

Global Feature Vector Global Average Pooling

Fully Connected Layer

Softmax

Disease / Healthy Label
Input Leaf Image

Image Pre-processing
Resize, normalize
, augmentation

CNN Backbone
ResNet50
pretrained

CBAM
(Channel + Spatial Attention)

Deep Feature Vector

SVM Classifier

Disease / Healthy Label
Input Leaf Image

Region-Aware Processing
Leaf-focused, BG Suppression
GrabCut-based leaf extraction

CNN Backbone
ResNet50
pretrained

CBAM
Attention Module

Global Average Pooling

Fully Connected Layer

Softmax

Disease / Healthy Label
Stage 1: CNN + CBAM trained using Softmax loss
Stage 2: CNN frozen and SVM trained on extracted deep features
Single-stage, end-to-end training
Joint optimization of feature extraction and classification Extract strong features, then classify robustly. Focus on the right regions, attend to disease patterns, and learn everything end-to-end.
End-to-end training
Single loss function
× No explicit attention
×No region isolation
Best for: Clean benchmark datasets (e.g., PlantVillage)
Weakness: Sensitive to background and dataset bias
× Not end-to-end
Attention focuses on disease regions
SVM improves margin-based generalization
Best for: Small datasets, controlled environments
Weakness: More complex training, harder deployment
End-to-end
Attention-enhanced
Region-focused
Robust to background & noise
Simple deployment
Best for: Real-world images and practical deployment
Weakness: Slightly more complex than CNN-only
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated