Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability

Nisreen Albzour; Sarah S. Lam

doi:10.20944/preprints202606.0693.v1

Submitted:

09 June 2026

Posted:

09 June 2026

You are already at the latest version

Abstract

Manual Pap smear analysis for cervical cancer screening is limited by inter-observer variability, time constraints, and restricted expert availability. Although convolutional neural networks (CNNs) have automated cervical cell classification, they remain limited in modeling long-range spatial dependencies and often lack clinical interpretability. In this study, Vision Transformer (ViT) architectures were systematically optimized to enhance automated cervical cancer screening, which resulted in improved interpretability. The Herlev dataset (917 images: 242 normal, 675 abnormal) was utilized to optimize ViT-Tiny, a lightweight Vision Transformer architecture designed for reduced computational complexity, through a comprehensive evaluation of augmentation strategies, class weighting, and hyperparameters. The optimal configuration achieved a cross-validation accuracy of approximately 95% (95.15% for the best replicated configuration), in which random horizontal flipping and class weighting (0.7 × 1.3) were identified as most effective. Gradient weighted Class Activation Mapping (Grad CAM) analysis confirmed that model attention corresponded to clinically relevant morphological features, which include nuclear regions, cell boundaries, and chromatin texture, which align with cytopathological criteria. These findings indicate that Vision Transformers can deliver accurate and interpretable decision support for cervical cancer screening, and that they combine competitive classification performance with the attention-based transparency relevant to medical AI. Further validation on larger, multi-center datasets remains necessary before clinical deployment.

Keywords:

vision transformer

;

cervical cancer

;

pap smear

;

medical image classification

;

interpretability

;

Grad-CAM

Subject:

Medicine and Pharmacology - Oncology and Oncogenics

1. Introduction

Cervical cancer remains a significant global health challenge, in which over 600,000 new cases and 340,000 deaths are reported annually worldwide [1]. The primary strategy for reducing mortality is early detection through Pap smear screening, which enables identification of precancerous lesions before they progress to invasive cancer. However, manual analysis of Pap smear slides is inherently time-consuming, labor-intensive, and subject to substantial interobserver variability, in which diagnostic accuracy varies significantly among cytotechnologists [2,3]. These limitations highlight the critical need for automated, reliable, and interpretable diagnostic solutions that can augment human expertise while maintaining the accuracy essential for clinical decision-making.

Deep learning approaches, particularly Convolutional Neural Networks (CNNs), have revolutionized medical image analysis by providing automated feature extraction and achieving remarkable diagnostic accuracy across various medical imaging modalities [4,5]. CNNs have demonstrated substantial success in cervical cancer detection, which surpass traditional machine learning methods through their ability to learn hierarchical feature representations directly from image data [6,7]. However, CNN architecture faces inherent limitations in medical imaging applications. Their reliance on local receptive fields and hierarchical feature extraction can miss important long-range spatial relationships that may be crucial for accurate diagnosis of cellular abnormalities [8,9]. Additionally, the limited interpretability of CNN decision-making processes poses significant barriers to clinical adoption, where understanding the rationale behind diagnostic predictions is essential for building trust and enabling clinical validation [10,11].

Related automation-oriented work in healthcare has also explored hyperautomation-based leukemia detection and classification [12], supporting the broader relevance of AI-driven clinical decision-support workflows beyond a single disease domain. In a closely related study applying rigorous evaluation protocols to structured clinical data, [13] evaluated hybrid deep learning–gradient boosting ensembles for breast cancer survival prediction using SEER and METABRIC cohorts, finding that under nested cross-validation with Nadeau–Bengio corrected statistical testing, hybrid ensembles did not outperform a class-weighted logistic regression baseline, and that cross-cohort degradation was concentrated in calibration rather than discrimination — underscoring that methodological rigor in evaluation is as consequential as architectural choice across AI-driven clinical applications.

Vision Transformers (ViTs) have emerged as a promising alternative architecture that addresses many limitations of traditional CNNs. By adapting the transformer architecture from natural language processing to computer vision, ViTs [14] leverage self-attention mechanisms to capture global contextual relationships across entire images, which enable more comprehensive analysis of spatial patterns [15]. This global perspective is particularly valuable for medical image analysis, where diagnostic features may be distributed across different regions of an image and require integration of multiple cellular characteristics [16]. Furthermore, the attention mechanisms inherent in ViTs provide natural interpretability through attention maps, which offer insights into which image regions influence diagnostic decisions—a crucial requirement for clinical acceptance and regulatory approval [17]. Recent studies have further validated the effectiveness of Vision Transformers in medical imaging applications, demonstrating improved performance in modeling both global and local context, enhanced computational efficiency, and robustness to limited labeled data [18,19].

Despite the promising potential of ViTs in medical imaging, their application to cervical cancer screening remains underexplored. Existing studies have not systematically examined optimization requirements for ViTs in this domain, nor have they provided the rigorous statistical validation necessary for clinical adoption. Critical gaps remain in the evaluation of data augmentation strategies, class imbalance handling techniques, and hyperparameter optimization tailored for cervical cell classification. Furthermore, the interpretability advantages of ViTs have not been fully investigated in the context of cytopathological analysis, where alignment with established diagnostic criteria is essential.

This study addresses these gaps by presenting a comprehensive evaluation of ViT architectures for automated cervical cancer screening using Pap smear images. The contributions of this study are fourfold: (1) systematic optimization of augmentation strategies and class weighting approaches to address class imbalance while preserving biological validity; (2) rigorous statistical validation through repeated experiments and pairwise comparisons to identify statistically equivalent high-performing configurations; (3) enhanced Gradient-weighted Class Activation Mapping (Grad-CAM) interpretability aligned with cytopathological diagnostic criteria; and (4) demonstration that ViTs can achieve clinically relevant performance while providing the transparency and reliability essential for medical AI applications.

The remainder of this paper is organized as follows: Section 2 reviews prior work on traditional machine learning, deep learning, and emerging transformer-based methods for cervical cancer detection. Section 3 provides a detailed description of the methodology, which includes dataset preparation, model architecture, optimization strategies, and interpretability frameworks. Section 4 presents experimental results and discussion, which include statistical analyses and interpretability findings. Section 5 concludes with a summary of contributions and directions for future research.

2. Related Literature

The application of artificial intelligence techniques in cervical cancer detection and classification has advanced significantly over the past decade, which led to the development of increasingly accurate and automated diagnostic methods. As illustrated in Figure 1, these techniques can be broadly categorized into machine learning and deep learning approaches, in which each category encompasses a range of methodologies tailored to improve diagnostic performance.

2.1. Machine Learning Approaches

Traditional machine learning approaches for cervical cancer detection rely heavily on handcrafted feature extraction followed by classical classification algorithms. This subsection reviews studies that employ various feature-engineering and machine-learning techniques. Support Vector Machine (SVM) and statistical learning approaches have been extensively utilized. Several studies [20,21] developed hybrid linear iterative clustering with Bayes classification, k-Nearest Neighbors (k-NN) with fuzzy logic, and machine learning-assisted detection frameworks, respectively. Mariarputham and Stephen [22] used handcrafted texture features with SVM and neural networks on the Herlev dataset, which found SVM performed best. Optimization-based methods [23] explored quantum hybrid Particle Swarm Optimization (PSO) with fuzzy k-NN for feature selection and cell classification. Cross-validation and prediction-based machine-learning techniques [24,25] investigated stratified k-fold cross-validation, early detection frameworks, and prediction using various machine learning methods. Additional ML approaches addressed Pap smear image classification and early prediction [26,27]. Nandanwar and Dhonde [28] proposed a hybrid stacked ensemble model that combines feature extraction with segmentation, followed by classifier-level fusion using Histogram and Hu Moments techniques.

Traditional machine learning approaches in cervical cancer diagnosis often lack integrated data augmentation strategies, which limits their ability to generalize across diverse image variations. Furthermore, these systems are prone to high false negative rates, which can result in missed diagnoses and pose significant risks in clinical settings. These limitations highlight the need for more robust, automated, and generalizable solutions in medical diagnostics.

2.2. Deep Learning Approaches

Deep learning approaches have demonstrated remarkable success in medical image analysis, particularly for cervical cancer detection and classification. The following review encompasses studies that employ various deep neural network architectures for cervical cancer classification.

CNNs are widely adopted deep learning models known for their ability to extract spatial and hierarchical features from input data, which make them suitable for a variety of pattern recognition tasks. CNN architecture has been extensively employed for cervical cancer detection across multiple imaging modalities. Several studies [29,30] employed various CNN variants that included Mask Region-based Convolutional Neural Networks (R-CNN) with Visual Geometry Group (VGG) and Residual Network (ResNet) components, Capsule Networks (CapsNet) preprocessing with VGG-like networks, specialized CNN architectures using deformable Faster Region-based Convolutional Neural Network with Feature Pyramid Networks (R-CNN-FPN) with pretrained backbones for cervical image analysis, and improved Inception-ResNet-V2. Transfer learning approaches [31] implemented and optimized Squeeze-and-Excitation Residual Network-152 (SE-ResNet152) models for multi-class cervical cancer detection. Recent CNN-based work [32,33] explored privacy-preserved detection, lightweight CNN architectures, ResNet-based automated screening, and Attention-Fused Squeeze-and-Excitation Network (AF-SENet) using pre-trained models for feature fusion. Additional CNN approaches [34,35] investigated various CNN variants with different backbone networks for enhanced performance.

In a closely related cervical cytology study, Albzour and Lam investigated deep-learning-based segmentation and classification of Pap smear images for cervical cancer detection [36], which provides a relevant foundation for the present transformer-based extension.

U-Net is a deep learning architecture originally developed for biomedical applications, which is recognized for its encoder-decoder structure and ability to capture both high-level context and fine-grained details. U-Net variants and advanced architectures have shown exceptional performance in cervical cancer applications. Enhanced U-Net architectures [37,38] incorporated residual SE blocks, feature fusion from multiple pretrained models, and encoder-weighted designs for improved analysis.

Ensemble learning refers to the technique of combining predictions from multiple models to achieve better performance and generalization than any single model alone. Ensemble and hybrid deep learning approaches have emerged to improve robustness and diagnostic accuracy in cervical cancer classification. Several studies have proposed hybrid frameworks that integrate deep feature fusion, pretrained networks, and classifier-level combinations to enhance performance [39,40] . In addition, radiomics-guided deep learning approaches have combined handcrafted feature descriptors with neural network representations to further strengthen feature expressiveness and diagnostic robustness [41].

Following advancements in CNNs, U-Nets, and ensemble learning, transformer-based architectures have recently attracted significant attention due to their capability to capture global contextual relationships [14,42]. Recent studies have further advanced Vision Transformer frameworks through systematic optimization strategies and attention-enhanced architectures to improve classification accuracy, computational efficiency, and contextual feature learning in medical image analysis [43,44]. Vision Transformers have demonstrated strong potential in learning long-range dependencies in medical images, though their performance often benefits from pretraining and large datasets. In a companion benchmarking study, the present authors compared a compact Vision Transformer (ViT-Tiny) against four widely used CNN baselines on the Herlev dataset under a uniform training recipe, finding that the transformer matched the strongest CNNs in accuracy while offering substantial parameter and memory efficiency advantages [45].

Despite their successes, deep learning approaches face several limitations that impact their clinical applicability. Common challenges include class imbalance between normal and abnormal cells, and lack of interpretability tools such as saliency maps or attention visualizations. These issues can reduce model robustness and make predictions less transparent and more difficult to validate for clinical use. This study specifically addresses two critical limitations: class imbalance through systematic optimization of class weighting approaches, and lack of interpretability through comprehensive Grad-CAM visualization that aligns model attention with established cytopathological criteria. Addressing these limitations is essential to improve trust, generalizability, and the practical deployment of AI models in cervical cancer diagnosis.

3. Methodology

Figure 2 provides a comprehensive overview of the entire methodology employed in this study. Each stage plays a crucial role that ensures the effectiveness and robustness of the classification framework, which underscores the systematic approach adopted to tackle the challenges in cervical cancer diagnosis.

3.1. Dataset Preparation

The Herlev Pap Smear dataset, which comprises 917 images, was used and regrouped from seven cytological classes into a binary screening task (normal: 242; abnormal: 675), consistent with clinical triage. This publicly available dataset, introduced by Jantzen et al. [46], is a standard benchmark for cervical cytology classification. The binary reformulation reflects the primary objective of population-level screening, in which the clinically decisive step is separating cases that require cytopathologist review (any abnormal grade) from those that do not (normal), rather than assigning a precise dysplasia grade; the seven-class grading task is therefore treated as a complementary objective and is discussed as a direction for future work in Section 5. Images were inspected for readability and converted to a unified input size compatible with ViT (RGB, standardized resolution). Pixel intensities were normalized using ImageNet statistics to leverage pretrained foundations. Because the Herlev dataset does not provide patient-level subject identifiers, stratified k-fold cross-validation was performed at the image level, with class proportions preserved across folds. The absence of subject-level separation is acknowledged as a limitation in Section 5. All preprocessing and augmentations were applied on the fly within the training pipeline to avoid data leakage.

3.2. Data Augmentation

Augmentations were designed to increase data diversity while preserving cytological validity (cell/nuclear morphology, chromatin patterns). The augmentation policy included the following transformations:

Geometric: horizontal flip, small rotations, translations, and mild scaling to simulate slide handling variability.
Photometric: limited brightness/contrast jitter and color perturbations to reflect staining variability.
Regularization: light Gaussian blur/noise where appropriate. Each transformation used clinically conservative ranges and per-operation probabilities to ensure that augmented images remained biologically plausible. All augmentations were applied on the fly during training (virtual augmentation): each image was transformed stochastically at load time rather than expanded into a fixed enlarged dataset on disk, so the nominal dataset size of 917 images was unchanged while the effective diversity seen across epochs increased. Augmentations were applied only to training data.

3.3. Class Weighting

Given the class imbalance (normal < abnormal), class-weighted cross-entropy was employed to mitigate bias toward the majority class. For class

c

with

N_{c}

samples and a total of

N

samples across

C

classes, the class weight

W_{c}

was defined as:

W_{c} = \frac{N}{C \times N_{c}}

where

N

is the total number of samples,

C

is the total number of classes, and

N_{c}

is the number of samples in class

c

. For the binary case (

C = 2

), this formulation assigns a larger weight to the minority (normal) class, thereby penalizing its misclassification more heavily.

3.4. ViT Model Training

The primary backbone was ViT-Tiny (vit_tiny_patch16_224; 12 transformer blocks, embedding dimension 192, patch size 16, approximately 5.5 million parameters), a compact Vision Transformer adapted for binary classification by replacing the classification head with a two-logit linear layer. Figure 3 illustrates the overall architecture, showing the patch-embedding pipeline and the internal structure of a transformer encoder block. The encoder was initialized with ImageNet pretraining to enhance data efficiency. Training used a modern optimization recipe: AdamW with weight decay, label smoothing, and a learning-rate schedule (warm-up followed by cosine/step decay). Early stopping and checkpointing were triggered by validation loss to prevent overfitting. A full grid search was performed over the three hyperparameters that most affect convergence on this dataset: batch size {16, 32, 64}, learning rate {1×10⁻⁴, 5×10⁻⁴, 1×10⁻³}, and number of epochs {5, 10, 15}, yielding 27 configurations (Section 4.4). The optimizer was AdamW, the input resolution was 224×224 pixels (RGB), and the encoder was initialized from ImageNet pretrained weights. The primary metric used to select the optimal configuration was the mean cross-validation F1-score, with accuracy used as a secondary criterion (Section 3.6). The same training recipe underpins the controlled cross-architecture benchmark reported in our companion study [45].

3.5. Grad-CAM Application

To enhance model interpretability, Grad-CAM was applied to ViT predictions using a transformer-compatible implementation that computes gradients with respect to the final attention or feature maps. Because Grad-CAM was originally formulated for the spatial feature maps of convolutional networks, it was adapted to the transformer as follows. The sequence of patch tokens output by the final encoder block (excluding the [CLS] token) was reshaped back into its 14×14 spatial grid, which serves as the target activation map in place of a convolutional feature map. Gradients of the target-class logit were computed with respect to these reshaped token activations, channel-wise averaged to obtain importance weights, combined with the activations, and passed through a ReLU to yield the class-discriminative localization map, which was then upsampled to the input resolution. This reshape-and-attribute procedure is the standard means of applying gradient-based localization to ViTs. Heatmaps were generated for validation images to visualize regions that contribute to the classification decision. Visualization parameters (smoothing, normalization, overlay opacity) were maintained consistently across all runs. This procedure enables qualitative assessment of whether the model’s attention corresponds to clinically relevant morphological regions such as nuclei, irregular boundaries, or chromatin distribution. The resulting observations are analyzed in Section 4.

3.6. Evaluation Protocol

For completeness, performance was assessed with accuracy, precision, recall, and F1-score on the held-out folds. Because false negatives and false positives carry different clinical consequences in screening, sensitivity (recall on the abnormal class) and specificity (recall on the normal class) were reported separately, and discrimination was additionally summarized using the area under the receiver operating characteristic curve (ROC-AUC). These metrics, together with the cross-architecture comparison in Section 4.4, follow the evaluation protocol of the companion benchmark study [45]. The use of multiple complementary metrics together with feature-based model assessment is consistent with prior health-analytics work on clinical outcome prediction, where careful metric selection improved both performance and interpretability [47]. The metrics are defined as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

P r e c i s i o n = \frac{T P}{T P + F P}

R e c a l l (S e n s i t i v i t y) = \frac{T P}{T P + F N}

F 1 - s c o r e = \frac{2 \cdot (P r e c i s i o n \cdot R e c a l l)}{P r e c i s i o n + R e c a l l}

where

T P

,

T N

,

F P

, and

F N

represent true positives, true negatives, false positives, and false negatives, respectively. To enable consistent comparison across configurations, fold-wise means and 95% confidence intervals (CI) were computed as follows:

C I = \bar{x} \pm z . \frac{s}{\sqrt{n}}

where

\bar{x}

= mean of fold-wise results,

s

= standard deviation,

n

= number of folds,

z = 1.96

(for 95% confidence interval). Interpretability outputs were summarized with representative heatmaps; quantitative alignment with expert-defined regions can be added when annotations are available.

4. Results and Discussion

This section reports experimental outcomes following the same five-stage workflow described in Section 3 and Figure 2: (1) Dataset Preparation, (2) Data Augmentation, (3) Class Weighting, (4) ViT Model Training, and (5) Grad-CAM Interpretability. Organizing results in parallel with methodology enables direct tracing from design choices to performance.

4.1. Dataset Preparation Results

The Herlev dataset was successfully preprocessed into 917 images with binary classification (242 normal, 675 abnormal). All images were standardized to RGB format with consistent resolution. ImageNet normalization statistics were applied (mean= [0.485, 0.456, 0.406], std= [0.229, 0.224, 0.225]). Stratified 5-fold cross-validation maintained the 27.8% normal, 72.2% abnormal distribution across all folds.

4.2. Data Augmentation Results

Table 1 presents the performance comparison of seven augmentation strategies (three single and four combined augmentation strategies) evaluated using 5-fold cross-validation with optimal class weights.

Single augmentations showed varying effectiveness. Color Jitter performed worst with an accuracy of 89.87% and a precision of 80.40%, which suggests color variations obscured key diagnostic cues such as chromatin texture. Random Affine resulted in an accuracy of 91.39% but with a reduced recall (83.40%), likely due to geometric distortions that affected nuclear shape. In contrast, Horizontal Flip achieved strong overall performance 94.77% accuracy and 91.30% recall, which indicates that left–right invariance is a beneficial augmentation for cervical cytology images without distorting diagnostically relevant structures.

Combined augmentations showed limited synergy. Color Jitter + Horizontal Flip (94.33%) was slightly below horizontal flip alone, while Color Jitter + Random Affine (93.23%) achieved the highest recall (95.00%) but lower precision (84.10%), which increased false positives. Horizontal Flip + Random Affine (92.26%) and All Three Combined (94.22%) offered no significant gains over simpler (single) augmentations.

Overall, augmentation effectiveness depends on biological plausibility rather than variety. Transformations that preserve diagnostic features (horizontal flip) outperformed those introducing artificial distortions, which confirms that simpler (single) augmentations as well as clinically valid augmentations yield the most reliable results.

4.3. Class Weighting Results

Table 2 summarizes the systematic evaluation of five weight multiplier configurations to address class imbalance.

Varying the abnormal-to-normal weight ratio produced distinct performance trade-offs. Case 4 (0.7 × 1.3 multipliers) achieved a balanced trade-off between recall and precision, which resulted in the highest F1-score (91.90%). Case 4 achieved the best overall performance, with 95.64% accuracy and 93.40% recall, which effectively prioritizes abnormal-cell detection without excessive misclassification of normal samples. These results confirm that moderate weighting enhances screening reliability, which is critical for medical applications where missing abnormal cases carry higher clinical risk.

4.4. ViT Model Training Results

A comprehensive hyperparameter optimization study evaluated 27 combinations of batch sizes (16, 32, 64), learning rates (0.0001, 0.0005, 0.001), and numbers of epochs (5, 10, 15). All hyperparameter configurations were evaluated using stratified 5-fold cross-validation, and Table 3 reports the average performance across folds for all 27 configurations, with the top-performing configurations highlighted in bold.

A learning rate of 0.0001 consistently outperformed higher values, which ensured stable convergence and prevented accuracy loss caused by overshooting the optimal loss minimum during optimization. Batch size affected training efficiency; Batch 16 required more epochs due to noisy gradients, while Batch 32 achieved the best balance with the highest accuracy (96.51%). Batch 64 showed slight degradation, likely from reduced gradient diversity. Training duration analysis indicated that 15 epochs yielded optimal convergence, whereas shorter runs led to underfitting.

To assess result stability and reduce the impact of stochastic training effects, each selected configuration was replicated 10 times using different random initializations. Replication mitigates variability introduced by random weight initialization and data shuffling, which provides a more reliable estimate of model performance. Table 4 reports the mean and standard deviation across these replications for both cross-validation and application-level evaluation. In Table 4, “CV” denotes cross-validation performance computed on the held-out fold in each split, and “App” (application-level) denotes resubstitution on the full dataset used to fit the final model.

As shown in Table 4, B32_E15 (batch size of 32 and 15 epochs) achieved the highest cross-validation accuracy (95.15%) with minimal variation, which indicates stable convergence and strong generalization. The application-level columns in Table 4 are reported only for completeness: they reflect resubstitution on the full dataset used to fit the final model and therefore overestimate generalization. Accordingly, all comparative claims in this study rely exclusively on the cross-validation estimates, and the application-level figures should not be interpreted as held-out performance. Their large standard deviations (for example, an application F1 standard deviation of 9.02 for B32_E15) further underscore that these values are not a reliable basis for model selection.

Statistical analysis that used pairwise t-tests (Table 5 and Table 6) verified if the performance differences in both accuracy and F1-score were significant. The statistical analysis in Table 5 and Table 6 revealed that only the comparison between B32_E10 and B32_E15 showed statistically significant differences (p = 0.035 for accuracy, p = 0.045 for F1-score). The remaining configurations (B16_E15, B32_E15, and B64_E15) exhibited no significant differences, which indicate comparable performance levels. These results confirm that model variations were minor beyond the optimal configuration, as visualized in Figure 4. Pairwise testing was reported for accuracy and F1-score because, in this binary screening setting, F1-score already integrates precision and recall into a single balanced measure of the minority (abnormal) class. Precision and recall were not tested individually here; given the clinical importance of false negatives versus false positives, the cross-architecture comparison in Section 4.4 and the companion benchmark study [45] additionally examine sensitivity and specificity directly, and formal pairwise testing of precision and recall across configurations is a useful extension for future work.

4.5. Grad-CAM Interpretability Results

Gradient-weighted Class Activation Mapping was applied to the best-performing configuration (B32_E15) to visualize model decision-making patterns and provide transparency in the classification process.

Figure 5 presents Grad-CAM visualizations for correctly classified abnormal cells and false negative errors. Here, the reported “focus scores” denote the normalized fraction of total Grad-CAM activation that falls within manually delineated nuclear/abnormal versus normal regions for each example, scaled to the range [0,1]; the scores are illustrative of individual representative cases rather than a dataset-wide quantitative metric. For correctly classified abnormal cells, the model demonstrated strong focus on nuclear regions (focus score: 1.000) with minimal attention to normal features (score: 0.000). In contrast, false negative cases shown in Figure 5 exhibited high normal focus scores (0.994-0.998) with insufficient attention to abnormal features (0.002-0.006), which indicates the model incorrectly focused on normal-appearing regions within abnormal samples.

Figure 6 illustrates the attention patterns for false positive errors and correctly classified normal cells. False positive cases in Figure 6 showed moderate abnormal focus scores (0.712-0.725) on benign features that superficially resembled abnormal patterns, which often result from staining artifacts or cellular overlapping. Correctly classified normal cells demonstrated appropriate normal focus (0.997-1.000) with minimal abnormal attention (0.000-0.003).

The attention patterns revealed in Figure 5 and Figure 6 consistently aligned with established cytopathological criteria, which focus on nuclear morphology, chromatin distribution, and cellular boundaries—the same features utilized by trained cytopathologists for diagnosis. This alignment demonstrates that the Vision Transformer model learns clinically relevant features for cervical cancer detection.

4.6. Comparison with CNN Baselines and Computational Efficiency

To contextualize the performance of the ViT-Tiny model against established convolutional architectures under identical conditions, a controlled cross-architecture benchmark was conducted in our companion study [45], in which ViT-Tiny and four widely used CNN baselines (ResNet50, EfficientNet-B0, VGG16, and DenseNet121) were trained on the same Herlev binary task using a single uniform training recipe with matched data splits and replication. Under that controlled comparison, ViT-Tiny matched the strongest CNN baselines (VGG16 and DenseNet121) on classification performance, with paired statistical testing finding no significant difference between them, while significantly outperforming ResNet50 and EfficientNet-B0 [45]. This indicates that, among modern competitive architectures, the choice of backbone is not the dominant determinant of accuracy on this dataset, which shifts the practical selection criterion toward efficiency. Full per-architecture results are reported in the companion study [45]; the present paper’s own ViT-Tiny results (for example, the 95.15% cross-validation accuracy of the optimal configuration in Section 4.4) were obtained under a different optimization protocol and remain this study’s primary results.

Beyond classification accuracy, the same study quantified computational efficiency on an NVIDIA A100 GPU, which is the decisive factor once accuracy saturates. ViT-Tiny (5.52 M parameters) uses approximately 24× fewer parameters than VGG16 (134.27 M) and about 15× less peak GPU memory, while delivering roughly 2.3× higher batch-level throughput; relative to the more compact DenseNet121 it still uses fewer parameters and markedly less memory at higher throughput. These reductions translate directly into deployment advantages for the resource-constrained, often GPU-free laboratory settings where cervical cancer screening is most needed, and they provide the practical justification for selecting a compact Vision Transformer over heavier CNN backbones. Full per-architecture latency, throughput, and memory measurements are reported in [45].

4.7. Comparison with Prior Work

To position the present results against prior work on the same dataset, Table 7 compares binary-classification studies on the Herlev Pap smear dataset in terms of method, reported performance, validation technique, split type, and methodological rigor. Early approaches relied on classical machine learning, such as the hybrid genetic-algorithm and nearest-neighbor scheme of Marinakis et al. [48], whereas more recent work has shifted toward deep learning and, most recently, transformer-based models. The proposed ViT-Tiny model is the only study in this comparison that simultaneously employs cross-validation, statistical significance testing, and clinical interpretability. While some earlier studies report higher headline accuracy, those results were obtained on single train/test splits without cross-validation or significance testing, which tend to yield optimistic, higher-variance estimates that are not directly comparable to cross-validated performance.

As Table 7 indicates, the proposed ViT-Tiny model achieves competitive cross-validated accuracy (95.15%) while being the only approach that combines stratified cross-validation, pairwise statistical testing, and Grad-CAM interpretability aligned with cytopathological criteria. Among the cross-validated studies, Pirovano et al. (2019) is the closest comparable baseline, reporting 94.0% binary accuracy on Herlev using 4-fold CV and ResNet-101 with Integrated Gradients interpretability — our approach exceeds this by approximately 1.15 percentage points under a stricter protocol (stratified 5-fold CV with 10 replications and pairwise statistical testing). Ghoneim et al. (2020) also used 5-fold cross-validation and achieved a higher accuracy of 99.5%; however, their work lacks both statistical significance testing and clinical interpretability, which are essential for trustworthy medical AI. The remaining studies — including CerviFormer (94.57%, single train/test split 90/10), Kaur et al. (95.0%, single train/val/test split 60/20/20), and Yilmaz and Kantar (85.0–93.0%, single train/test split 85/15) — relied on single fixed splits without cross-validation, statistical testing, or interpretability analysis. Single-split estimates tend to yield optimistic, higher-variance results that are not directly comparable to cross-validated performance. The present work therefore represents the most methodologically complete approach in this comparison, combining competitive cross-validated accuracy with the statistical rigor and Grad-CAM interpretability essential for transparent and reliable deployment in cervical cancer screening.

5. Conclusions

This study demonstrates the potential of Vision Transformers as accurate and interpretable tools for automated cervical cancer screening. By restructuring Pap smear classification into a binary screening task, the experiments were aligned with clinical practice and demonstrated that optimized ViT models can achieve ~95% cross-validation accuracy, with pairwise t-tests indicating that the top configurations were statistically comparable. Interpretability analysis suggested that, for the representative cases examined, model attention aligned with clinically meaningful morphological features, supporting their potential as decision-support tools pending broader validation.

Several limitations should be acknowledged in this study. First, the Herlev dataset contains only 917 images from a single institution, which may not fully represent the diversity of cervical cell morphology across different populations, staining protocols, and imaging equipment. Second, the binary classification approach, while clinically relevant for initial screening, simplifies the nuanced multi-class grading system used in cytopathology practice. Third, the study relied on pre-existing annotations without inter-rater reliability assessment, which potentially introduce label noise. Future work should mitigate this through expert consensus labeling by multiple cytopathologists with reporting of an agreement statistic such as Cohen’s or Fleiss’ kappa, and through uncertainty-aware learning techniques that down-weight or flag ambiguous samples. Fourth, the computational requirements of Vision Transformers may limit deployment in resource-constrained settings where cervical cancer screening is most needed; the companion benchmark study [45] partly addresses this by quantifying that the compact ViT-Tiny backbone used here requires substantially fewer parameters and less memory than common CNN baselines while remaining competitive in accuracy. Finally, while Grad-CAM provides valuable insights into model attention, it may not fully capture all aspects of the transformer’s decision-making process, particularly the complex interactions between attention heads. A further important limitation is that evaluation was confined to the single-source Herlev dataset; external validation on independent, multi-center datasets such as SIPaKMeD, and cross-dataset transfer experiments, are necessary to establish generalizability across populations, staining protocols, and imaging equipment, and are a priority for subsequent work.

While these results are promising, further validation on larger and multi-center datasets is essential to ensure robustness across diverse populations and imaging protocols. Additionally, prospective clinical trials that evaluate AI-assisted screening in real-world workflows with cytotechnologists and pathologists are crucial for clinical translation, which assess not only diagnostic accuracy but also workflow integration, efficiency, and user acceptance in routine medical practice. Ultimately, this research contributes an important step toward bridging cutting-edge AI with clinical practice, which paves the way for AI-assisted cytopathology systems that can enhance early cervical cancer screening, reduce diagnostic errors, and expand global access to preventive care.

References

Sung, H.; et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA. Cancer J. Clin 2021, vol. 71(no. 3), 209–249. [Google Scholar] [CrossRef]
Stoler, M. H.; Schiffman, M. and for the Atypical Squamous Cells of Undetermined Significance–Low-grade Squamous Intraepithelial Lesion Triage Study (ALTS) Group, Interobserver Reproducibility of Cervical Cytologic and Histologic InterpretationsRealistic Estimates From the ASCUS-LSIL Triage Study. JAMA 2001, vol. 285(no. 11), 1500–1505. [Google Scholar] [CrossRef]
Koonmee, S.; et al. False-Negative Rate of Papanicolaou Testing: A National Survey from the Thai Society of Cytology. Acta Cytol. 2017, vol. 61(no. 6), 434–440. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T. M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, vol. 6(no. 1), 60. [Google Scholar] [CrossRef]
Greenspan, H.; Van Ginneken, B.; Summers, R. M. Guest Editorial Deep Learning in Medical Imaging: Overview and Future Promise of an Exciting New Technique. IEEE Trans. Med. Imaging 2016, vol. 35(no. 5), 1153–1159. [Google Scholar] [CrossRef]
Litjens, G.; et al. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, vol. 42, 60–88. [Google Scholar] [CrossRef] [PubMed]
da Silva, E. L. P. Combining machine learning and deep learning approaches to detect cervical cancer in cytology images. 2021. [Google Scholar]
Tan, X.; et al. Automatic model for cervical cancer screening based on convolutional neural network: a retrospective, multicohort, multicenter study. Cancer Cell Int. 2021, vol. 21(no. 1), 35. [Google Scholar] [CrossRef] [PubMed]
Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic attribution for deep networks. presented at the International conference on machine learning, PMLR, 2017; pp. 3319–3328. [Google Scholar]
Samek, W.; Wiegand, T.; Müller, K.-R. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. ArXiv Prepr. 2017, ArXiv17080829. [Google Scholar]
Al-Zoubi, H.; Al-Bzoor, N. Toward driverless AI: Automating leukemia detection and classification using hyperautomation, a case study. 2022. [Google Scholar] [CrossRef] [PubMed]
Albzour, N. “Do Hybrid Deep Learning–Gradient Boosting Ensembles Generalize? A Cross-Cohort Evaluation of Breast Cancer Survival Prediction Using SEER and METABRIC,” SSRN preprint. 2026. Available online: https://ssrn.com/abstract=6810699.
Dosovitskiy, A.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv Prepr. 2020, ArXiv201011929. [Google Scholar]
Vaswani, et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, vol. 30. [Google Scholar]
Han, K.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, vol. 45(no. 1), 87–110. [Google Scholar] [CrossRef]
Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In presented at the Proceedings of the IEEE international conference on computer vision, 2017; pp. 618–626. [Google Scholar]
Yuan, F.; Zhang, Z.; Fang, Z. An effective CNN and Transformer complementary network for medical image segmentation. Pattern Recognit. 2023, vol. 136, 109228. [Google Scholar] [CrossRef]
Guo, X.; Lin, X.; Yang, X.; Yu, L.; Cheng, K.-T.; Yan, Z. UCTNet: Uncertainty-guided CNN-Transformer hybrid networks for medical image segmentation. Pattern Recognit. 2024, vol. 152, 110491. [Google Scholar] [CrossRef]
Magaraja, A.D.; et al. A hybrid linear iterative clustering and Bayes classification-based GrabCut segmentation scheme for dynamic detection of cervical cancer. Appl. Sci. 2022, vol. 12(no. 20), 10522. [Google Scholar] [CrossRef]
Mehmood, M.; Rizwan, M.; Gregus Ml, M.; Abbas, S. “Machine Learning Assisted Cervical Cancer Detection,” Front. Public Health 2021, vol. 9, 788376. [Google Scholar] [CrossRef]
Mariarputham, E. J.; Stephen, A. Nominated Texture Based Cervical Cancer Classification. Comput. Math. Methods Med. 2015, vol. 2015, 1–10. [Google Scholar] [CrossRef]
Iliyasu, A.M.; Fatichah, C. A quantum hybrid PSO combined with fuzzy k-NN approach to feature selection and cell classification in cervical cancer detection. Sensors 2017, vol. 17(no. 12), 2935. [Google Scholar] [CrossRef]
Pranuthi, Tenali. Predicting Cervical Cancer Cases Resulting in Biopsies Using Machine Learning Techniques. Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol. 2021, 28–37. [Google Scholar] [CrossRef]
Prusty, S.; Patnaik, S.; Dash, S. K. SKCV: Stratified K-fold cross-validation on ML classifiers for predicting cervical cancer. Front. Nanotechnol. 2022, vol. 4, 972421. [Google Scholar] [CrossRef]
HassanMbaga, A.; ZhiJun, P. Pap Smear Images Classification for Early Detection of Cervical Cancer. Int. J. Comput. Appl. 2015, vol. 118(no. 7), 10–16. [Google Scholar] [CrossRef]
Al-Batah, M. S.; Alzyoud, M.; Alazaidah, R.; Toubat, M.; Alzoubi, H.; Olaiyat, A. Early prediction of cervical cancer using machine learning techniques. Jordanian J. Comput. Inf. Technol. 2022, vol. 8(no. 4). [Google Scholar] [CrossRef]
Nandanwar, P. D.; Dhonde, S. B. A Novel Approach to Cervical Cancer Detection Using Hybrid Stacked Ensemble Models and Feature Selection. Int. J. Electr. Electron. Res. 2023, vol. 11(no. 2), 582–589. [Google Scholar] [CrossRef]
K., D. Cervical Cancer Classification. Int. J. Emerg. Trends Eng. Res. 2020, vol. 8(no. 3), 804–807. [Google Scholar] [CrossRef]
Dash, S.; Sethy, P. K.; Behera, S. K. Cervical Transformation Zone Segmentation and Classification based on Improved Inception-ResNet-V2 Using Colposcopy Images. Cancer Inform. 2023, vol. 22. [Google Scholar] [CrossRef]
Battula, K. Prasad; Chandana, B. Sai. Multi-class Cervical Cancer Classification using Transfer Learning-based Optimized SE-ResNet152 model in Pap Smear Whole Slide Images. Int. J. Electr. Comput. Eng. Syst. 2023, vol. 14(no. 6), 623–623. [Google Scholar] [CrossRef]
Alsubai, S.; et al. Privacy Preserved Cervical Cancer Detection Using Convolutional Neural Networks Applied to Pap Smear Images. Comput. Math. Methods Med. 2023, vol. 2023(no. 1). [Google Scholar] [CrossRef]
Huang, P.; Tan, X.; Chen, C.; Lv, X.; Li, Y. AF-SENet: Classification of Cancer in Cervical Tissue Pathological Images Based on Fusing Deep Convolution Features. Sensors 2020, vol. 21(no. 1), 122. [Google Scholar] [CrossRef] [PubMed]
Kurnianingsih, et al. Segmentation and Classification of Cervical Cells Using Deep Learning. IEEE Access 2019, vol. 7, 116925–116941. [Google Scholar] [CrossRef]
Suguna, S. P.; Balamurugan. Multi-Class Segmentation with Deep Learning based Pap Smear Image Analysis for Cervical Cancer Detection and Classification Model. Tuijin JishuJournal Propuls. Technol. 2023, vol. 44(no. 3), 4475–4487. [Google Scholar] [CrossRef]
Albzour, N.; Lam, S. S. Segmentation and Classification of Pap Smear Images for Cervical Cancer Detection Using Deep Learning. arXiv 2025, arXiv:2508.17728. [Google Scholar] [CrossRef]
Chowdary, G. J.; S. G, P. M.; Yogarajah, P. Nucleus segmentation and classification using residual SE-UNet and feature concatenation approach incervical cytopathology cell images. Technol. Cancer Res. Treat. 2023, vol. 22. [Google Scholar] [CrossRef]
Park, J.; Yang, H.; Roh, H.-J.; Jung, W.; Jang, G.-J. Encoder-Weighted W-Net for Unsupervised Segmentation of Cervix Region in Colposcopy Images. Cancers 2022, vol. 14(no. 14), 3400. [Google Scholar] [CrossRef]
Shinde, S.; Kalbhor, M.; Wajire, P. DeepCyto: a hybrid framework for cervical cancer classification by using deep feature fusion of cytology images. Math. Biosci. Eng. 2022, vol. 19(no. 7), 6415–6434. [Google Scholar] [CrossRef] [PubMed]
Kalbhor, M.; Shinde, S.; Popescu, D. E.; Hemanth, D. J. Hybridization of Deep Learning Pre-Trained Models with Machine Learning Classifiers and Fuzzy Min–Max Neural Network for Cervical Cancer Diagnosis. Diagnostics 2023, vol. 13(no. 7), 1363. [Google Scholar] [CrossRef] [PubMed]
Jiménez Gaona, Y.; et al. Radiomics Diagnostic Tool Based on Deep Learning for Colposcopy Image Classification. Diagnostics 2022, vol. 12(no. 7), 1694. [Google Scholar] [CrossRef]
Nirmala, G.; Nayudu, P. P.; Kumar, A. R.; Sagar, R. Automatic cervical cancer classification using adaptive vision transformer encoder with CNN for medical application. Pattern Recognit. 2025, vol. 160, 111201. [Google Scholar] [CrossRef]
Şahin, E.; Özdemir, D.; Temurtaş, H. Multi-objective optimization of ViT architecture for efficient brain tumor classification. Biomed. Signal Process. Control 2024, vol. 91, 105938. [Google Scholar] [CrossRef]
Liu, X.; Hu, Y.; Chen, J. Hybrid CNN-Transformer model for medical image segmentation with pyramid convolution and multi-layer perceptron. Biomed. Signal Process. Control 2023, vol. 86, 105331. [Google Scholar] [CrossRef]
Albzour, N.; Lam, S. S. “A reproducible benchmark of ViT-Tiny against CNN baselines for cervical cell classification: Accuracy, statistical validation, and deployment efficiency,” SSRN preprint, 2025. Available online: https://ssrn.com/abstract=6839541.
Jantzen, J.; Norup, J.; Dounias, G.; Bjerregaard, B. Pap-smear benchmark data for pattern classification. Proc. Nature Inspired Smart Information Systems (NiSIS), Albufeira, Portugal, 2005; pp. 1–9. [Google Scholar]
Albzour, N.; Agarwal, S.; Althnaibat, H.; Lu, S. S. Predicting post-stroke activities of daily living: Enhancing Machine Learning with Feature Selection. IISE Annual Conference Proceedings, 2025; pp. 1–6. [Google Scholar]
Marinakis, Y.; Dounias, G.; Jantzen, J. Pap smear diagnosis using a hybrid intelligent scheme focusing on genetic algorithm based feature selection and nearest neighbor classification. Comput. Biol. Med. 2009, vol. 39(no. 1), 69–78. [Google Scholar] [CrossRef]
Yilmaz, E.; Kantar, M. Comparison of deep learning and traditional machine learning techniques for classification of normal and abnormal cervical cells. arXiv 2020, arXiv:2009.06366. [Google Scholar] [CrossRef]
Ghoneim, A.; Muhammad, G.; Hossain, M. S. Cervical cancer classification using convolutional neural networks and extreme learning machines. Future Gener. Comput. Syst. 2020, vol. 102, 643–649. [Google Scholar] [CrossRef]
Deo, B.S.; Pal, M.; Panigrahi, P. K.; Pradhan, A. CerviFormer: A Pap-smear-based cervical cancer classification method using cross-attention and latent transformer. arXiv 2023, arXiv:2303.10222. [Google Scholar] [CrossRef]
Pirovano; Almeida, L. G.; Ladjal, S. Regression Constraint for an Explainable Cervical Cancer Classifier. arXiv 2019, arXiv:1908.02650. [Google Scholar] [CrossRef]
Kaur, H.; Sharma, R.; Kaur, J. Comparison of deep transfer learning models for classification of cervical cancer from pap smear images. Sci. Rep. vol. 15(no. 1), 3945, 2025. [CrossRef]

Figure 1. AI Techniques in Cervical Cancer Detection.

Figure 2. Overview of the Methodology.

Figure 3. Architecture of the ViT-Tiny model. (a) The pipeline splits the input Pap smear cell into 14 × 14 patches, projects them to 192-dimensional tokens with positional embeddings, processes them through 12 identical transformer encoder blocks, and classifies the [CLS] token as normal or abnormal via a linear head. (b) Each encoder block applies LayerNorm, multi-head self-attention (3 heads), and an MLP (192 → 768 → 192) with GELU activation, each wrapped in a residual connection. The full model has 5.52 million parameters.

Figure 4. Performance Comparison Across All Experimental Configurations.

Figure 5. Grad-CAM Analysis of Correct Abnormal Classifications and False Negative Errors.

Figure 6. Grad-CAM Analysis of False Positive Errors and Correct Normal Classifications.

Table 1. Augmentation Techniques Evaluation (5-Fold Cross-Validation).

Augmentation Strategy	Precision (%)	Recall (%)	F1-score	Accuracy (%)
Color Jitter	80.40	90.60	83.30	89.87
Horizontal Flip	89.70	91.30	90.00	94.77
Random Affine	86.90	83.40	83.10	91.39
Color Jitter + Horizontal Flip	89.10	90.10	89.40	94.33
Color Jitter + Random Affine	84.10	95.00	88.70	93.23
Horizontal Flip + Random Affine	83.60	89.70	85.70	92.26
All Three Combined	89.40	89.70	89.10	94.22

Table 2. Class Weight Optimization Results (5-Fold Cross-Validation).

Case #	Weight Multiplier	Abnormal Weight	Normal Weight	Precision (%)	Recall (%)	F1-score (%)	Accuracy (%)
1	1.0×1.0	0.68	1.90	92.10	84.30	87.40	93.67
2	0.8×0.8	0.54	1.52	84.40	90.60	85.80	91.93
3	1.2×1.2	0.82	2.27	83.00	93.00	86.70	91.72
4	0.7×1.3	0.48	2.46	90.90	93.40	91.90	95.64
5	1.3×0.7	0.88	1.33	90.70	88.80	89.70	94.55

Table 3. Full Hyperparameter Optimization Results (27 Configurations). Bold rows indicate the top-performing configurations selected for further analysis.

Experiment #	Batch Size	Learning Rate	Epochs	Precision (%)	Recall (%)	F1-score (%)	Accuracy (%)
1	16	0.0001	5	79.14	90.93	84.28	90.95
2	16	0.0001	10	89.91	90.07	89.66	94.55
3	16	0.0001	15	93.36	90.52	91.82	95.75
4	16	0.0005	5	53.84	87.17	61.0	65.32
5	16	0.0005	10	58.96	84.23	68.55	78.96
6	16	0.0005	15	65.7	83.97	72.31	82.54
7	16	0.001	5	53.53	79.31	58.01	62.83
8	16	0.001	10	53.03	83.38	60.91	68.72
9	16	0.001	15	59.24	78.87	63.69	73.07
10	32	0.0001	5	85.31	90.47	87.56	93.34
11	32	0.0001	10	90.47	95.06	92.59	95.96
12	32	0.0001	15	96.23	90.49	93.1	96.51
13	32	0.0005	5	56.57	77.67	61.21	70.13
14	32	0.0005	10	68.24	78.98	71.36	82.43
15	32	0.0005	15	53.52	95.88	68.11	74.94
16	32	0.001	5	67.09	60.26	57.91	77.53
17	32	0.001	10	52.15	81.38	61.44	70.77
18	32	0.001	15	49.43	86.44	60.69	67.26
19	64	0.0001	5	78.96	95.46	86.05	91.71
20	64	0.0001	10	84.55	95.03	89.39	94.0
21	64	0.0001	15	93.83	92.15	92.87	96.29
22	64	0.0005	5	50.99	84.75	59.55	67.39
23	64	0.0005	10	61.6	93.38	73.71	82.02
24	64	0.0005	15	71.84	82.19	75.72	85.72
25	64	0.001	5	51.95	69.81	57.18	72.73
26	64	0.001	10	42.27	88.89	55.77	61.49
27	64	0.001	15	58.11	88.87	69.27	78.17

Table 4. Comprehensive Experimental Results (10 Replications).

Configuration (B = Batch size, E = Epochs)	CV Precision (%)	CV Recall (%)	CV F1-score (%)	CV Accuracy (%)	App Precision (%)	App Recall (%)	App F1-score (%)	App Accuracy (%)
B16_E15	91.08 ± 4.10	90.99 ± 1.43	90.74 ± 1.93	95.05 ± 1.20	96.98 ± 4.15	96.65 ± 6.85	96.63 ± 4.27	98.27 ± 2.10
B32_E10	86.91 ± 3.86	93.02 ± 2.82	89.36 ± 2.06	93.93 ± 1.41	97.59 ± 2.48	99.55 ± 0.47	98.54 ± 1.32	99.21 ± 0.72
B32_E15	90.78 ± 2.41	91.91 ± 1.49	91.03 ± 0.87	95.15 ± 0.57	93.57 ± 14.15	99.96 ± 0.12	96.00 ± 9.02	97.18 ± 6.67
B64_E15	91.22 ± 3.48	90.38 ± 1.94	90.53 ± 1.91	94.92 ± 1.22	98.41 ± 2.71	99.55 ± 0.84	98.95 ± 1.34	99.43 ± 0.73

Table 5. Pairwise T-Test Results for Accuracy.

Comparison	Mean₁ (Exp A)	Mean₂ (Exp B)	Diff (A-B)	95% CI (Diff)	p-value	Significant? (p < 0.05)
Exp1 vs Exp2	95.05	93.93	+1.122	(−0.179, 2.423)	0.087	No
Exp1 vs Exp3	95.05	95.15	−0.099	(−1.064, 0.865)	0.826	No
Exp1 vs Exp4	95.05	94.92	+0.128	(−1.077, 1.334)	0.825	No
Exp2 vs Exp3	93.93	95.15	−1.221	(−2.336, −0.106)	0.035	Yes
Exp2 vs Exp4	93.93	94.92	−0.993	(−2.306, 0.319)	0.129	No
Exp3 vs Exp4	95.15	94.92	+0.228	(−0.753, 1.209)	0.622	No

Table 6. Pairwise T-Test Results for F1- Score.

Comparison	Mean₁ (Exp A)	Mean₂ (Exp B)	Diff (A-B)	95% CI (Diff)	p-value	Significant? (p < 0.05)
Exp1 vs Exp2	90.74	89.36	+1.384	(−0.603, 3.371)	0.160	No
Exp1 vs Exp3	90.74	91.03	−0.287	(−1.824, 1.250)	0.692	No
Exp1 vs Exp4	90.74	90.53	+0.212	(−1.697, 2.121)	0.817	No
Exp2 vs Exp3	89.36	91.03	−1.671	(−3.297, −0.045)	0.045	Yes
Exp2 vs Exp4	89.36	90.53	−1.172	(−3.149, 0.805)	0.228	No
Exp3 vs Exp4	91.03	90.53	+0.499	(−1.024, 2.021)	0.489	No

Table 7. Comparison of binary cervical cytology classification studies on the Herlev dataset. All studies use the Herlev dataset (917 images) for binary (normal vs. abnormal) classification. Methodological-rigor columns are reported as Yes/No. F1 scores marked with † are macro-averaged from per-class values reported in the original paper.

Study	Method	Acc (%)	F1 (%)	Validation technique	Split type	Dataset	Statistical test	Interpretability
Yilmaz & Kantar [49]	XGBoost/k-NN and Custom CNN	85.0 / 93.0	87.0 / 95.0	Train/test split (85/15)	Single split	Herlev (917)	No	No
Ghoneim et al. [50]	CNN + extreme learning machine	99.5	—	5-fold CV	Cross-validation	Herlev (917)	No	No
Deo et al., CerviFormer [51]	Cross-attention transformer	94.57	92.5†	Train/test split (90/10)	Single split	Herlev (917)	No	No
Pirovano et al. [52]	ResNet-101 + Integrated Gradients	94.0	96.0	4-fold CV	Cross-validation	Herlev (917)	No	Yes (Integrated Gradients)
Kaur et al. [53]	ResNet50 (best of 16 TL models)	95.0	94.0	Train/val/test (60/20/20)	Single split	Herlev (917)	No	No
Present work — ViT-Tiny	ViT-Tiny (~5.5 M) + Grad-CAM	95.15	91.03	Stratified 5-fold CV (×10 reps)	Cross-validation	Herlev (917)	Yes (paired t-tests)	Yes (Grad-CAM)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability

Abstract

Keywords:

Subject:

1. Introduction

2. Related Literature

2.1. Machine Learning Approaches

2.2. Deep Learning Approaches

3. Methodology

3.1. Dataset Preparation

3.2. Data Augmentation

3.3. Class Weighting

3.4. ViT Model Training

3.5. Grad-CAM Application

3.6. Evaluation Protocol

4. Results and Discussion

4.1. Dataset Preparation Results

4.2. Data Augmentation Results

4.3. Class Weighting Results

4.4. ViT Model Training Results

4.5. Grad-CAM Interpretability Results

4.6. Comparison with CNN Baselines and Computational Efficiency

4.7. Comparison with Prior Work

5. Conclusions

References

MDPI Initiatives

Important Links

Subscribe