Beyond Accuracy: Pneumonia Severity Grading in Chest X-Rays Using RSNA 2018 Bounding-Box Extent Metadata and ViT

Emanuel-Crăciun Trînc; Beatrice Arvinti; Emil-Radu Iacob; Cristina Stolojescu-Crisan

doi:10.20944/preprints202606.1178.v2

Submitted:

16 June 2026

Posted:

25 June 2026

You are already at the latest version

Abstract

Pneumonia detection in chest X-rays is commonly treated as a binary classification task, although the radiographic burden of disease varies substantially across patients. This study investigates whether the expert bounding-box annotations available in the RSNA 2018 Pneumonia Detection dataset can be reused to derive an interpretable severity grading framework based on lesion extent. For each pneumonia-positive image, total disease burden was computed as the cumulative area of all annotated opacity boxes, and a balanced three-tier split was generated, defining Severity 1 (low/mild), Severity 2 (moderate), and Severity 3 (severe). The resulting severity groups contained 1999, 2006, and 2007 images, respectively. To validate the usefulness of this annotation-driven severity formulation, we trained Vision Transformer models in both a classical binary setting and a severity-specialized setup. The classical binary baseline reached a maximum validation accuracy of 95.05%, while the specialized models achieved 94.98% for Severity 1, 97.88% for Severity 2, and 99.08% for Severity 3. These results indicate that radiographic burden derived from bounding-box extent supports highly separable severity-aware subproblems, particularly for severe pneumonia, while lower-burden categories remain more difficult. The proposed framework offers a transparent and reproducible way to transform RSNA 2018 lesion annotations into ordinal severity labels. Beyond its original use as a detection and localization benchmark, the dataset can therefore also support severity-aware classification, radiographic burden analysis, and future explainable AI studies in chest X-ray interpretation.

Keywords:

pneumonia severity grading

;

chest X-Ray

;

RSNA 2018 dataset

;

bounding box annotations

;

disease extent

;

radiographic burden

;

severity stratification

;

medical imaging

;

explainable AI

;

chest radiograph analysis

Subject:

Engineering - Telecommunications

1. Introduction

Chest X-ray (CXR) imaging remains one of the most widely used diagnostic tools for pulmonary disease because of its low cost, rapid acquisition, and broad clinical availability. Among thoracic conditions, pneumonia continues to be a major target for computer-aided analysis, and automated CXR interpretation has therefore become an important topic in medical image computing. Most existing deep learning studies formulate pneumonia assessment as a binary task, namely distinguishing pneumonia-positive from normal chest X-rays. While this setting is useful for detection benchmarking, it does not capture the fact that pneumonia presents with markedly different radiographic burdens, ranging from small focal opacities to extensive multifocal or diffuse lung involvement [2,3,4].

Severity-aware analysis is important for several reasons. First, it provides a more fine-grained representation of disease burden than binary diagnosis alone. Second, stratifying cases by radiographic extent can support dataset curation and help reveal whether models behave differently across lower- and higher-burden presentations. Third, severity-oriented formulations may provide a practical bridge toward downstream tasks such as triage support, prioritization, explainability studies, and burden-aware classification [3,5,6]. Even when direct clinical severity labels are unavailable, radiographic extent remains an interpretable signal that can be exploited for structured analysis.

The RSNA 2018 Pneumonia Detection dataset is particularly well suited for such an investigation because, in addition to binary pneumonia labels, it provides expert bounding-box annotations for pneumonia-positive cases [2,7]. These annotations explicitly localize suspicious opacity regions and therefore encode more information than simple disease presence or absence. Prior work has used this dataset primarily for pneumonia detection and lesion localization, treating the bounding boxes as supervision for object detection or classification-plus-localization pipelines [8,9,10]. At the same time, a separate body of literature has shown that chest radiographs can support meaningful severity scoring and grading, although such studies often rely on externally defined radiologist scores, regional scoring systems, or disease-specific datasets rather than on RSNA 2018 annotation geometry itself [3,4,6]. This creates a natural opportunity to reinterpret the RSNA bounding-box metadata not only as localization targets, but also as a source of image-level radiographic burden.

In this work, we develop and validate a severity-aware reinterpretation of the RSNA 2018 dataset based on bounding-box extent, and we additionally release the resulting public severity-oriented dataset split for reuse by the research community [11]. For each pneumonia-positive image, we quantify the total annotated disease burden by summing the areas of all lesion bounding boxes and use this quantity to derive a three-tier severity grouping corresponding to low/mild, moderate, and severe radiographic involvement. The resulting split is nearly perfectly balanced across the three pneumonia severity groups, which makes it suitable for downstream comparative modeling. Rather than claiming a clinical severity score from annotation geometry alone, we propose an interpretable and reproducible proxy framework for radiographic burden analysis grounded directly in expert-labeled lesion extent.

To assess whether this grading-oriented reinterpretation is useful in practice, we further evaluate Vision Transformer (ViT)-based downstream classification models using the derived severity labels. In addition to a classical binary baseline, we train specialized severity-aware branches and compare their validation behavior across the different burden levels. The final results show that the proposed split is not only statistically coherent, but also operationally meaningful for learning, with stronger separability observed for higher-burden groups than for lower-burden ones. In this way, the study moves beyond pure dataset reinterpretation and provides empirical evidence that annotation-derived burden labels can support structured pneumonia grading experiments.

The main contributions of this paper are as follows:

We analyze the RSNA 2018 pneumonia bounding-box annotations from the perspective of radiographic disease burden rather than binary detection alone.
We define a transparent and reproducible image-level burden measure based on the cumulative area of all expert-annotated lesion boxes.
We derive a balanced three-tier severity split from the pneumonia-positive cohort, yielding low/mild, moderate, and severe groups with near-equal sample counts.
We validate the monotonic structure of the resulting grading scheme through distributional, thresholding, and subgroup analyses.
We evaluate the practical utility of the derived severity labels in downstream ViT-based classification experiments, including comparison with a classical binary baseline.
We position the RSNA 2018 dataset as a useful benchmark not only for detection and localization, but also for severity-aware chest X-ray analysis based on annotation extent metadata.

The remainder of this paper is organized as follows. Section 2 reviews related work on RSNA expert annotations, pneumonia detection and localization, chest X-ray severity grading, and localization-to-severity bridging methods. Section 3 clarifies the positioning of the present study with respect to these research directions. Section 4 describes the dataset, the bounding-box extent analysis, the proposed severity stratification procedure, the dataset split, and the training setup. Section 5 presents the statistical and model-based results. Section 6 discusses the implications and limitations of the proposed framework, and Section 7 and Section 8 concludes the paper.

2. Related Work

2.1. Foundational Annotation Work: RSNA Bounding-Box Expert Labels

A directly related foundational study for the present work is the dataset paper by Shih et al. in [12], Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia. This paper is central to our study because it established the expert-annotated bounding-box resource on which the RSNA 2018 pneumonia challenge dataset is based. In particular, the authors augmented a 30,000-image frontal chest radiograph subset derived from the NIH ChestX-ray8 collection and introduced expert annotations for pneumonia-like opacities, thereby transforming a weakly labeled chest X-ray corpus into a localization-aware dataset suitable for machine learning development. As stated in the paper, the goal was not only to improve categorical labeling, but also to provide spatially localized information through bounding boxes, enabling algorithms to identify the location and approximate size of suspicious pulmonary opacity regions.

The importance of this contribution lies in the annotation design. Shih et al. [12] explicitly emphasized that accurate object detection in medical imaging requires more than image-level labels, and that manually drawn bounding boxes provide clinically meaningful information about both lesion location and extent. The paper notes that such localization may help clinicians assess whether an algorithm’s output is trustworthy by showing where pneumonia has been identified and whether the extent appears small or large. This idea is especially relevant to the present study, since our severity-oriented analysis is based precisely on the geometric burden encoded in those bounding boxes. Rather than using the annotations only for detection, we reinterpret them as a source of image-level radiographic burden by summing their areas across each study.

The annotation protocol described by Shih et al. [12] also provides important methodological context for our work. Eighteen board-certified radiologists from multiple institutions participated in the labeling effort, and readers were instructed to draw boxes as small as possible while still encompassing the full suspicious opacity. When discontinuous opacities were present, multiple boxes were used. The paper further describes a warm-up and quality-assurance stage in which readers compared annotations on a shared set of 50 radiographs. The screenshot shown in Figure 1 (page 3) of the Shih et al. [12] paper, also illustrated in Figure 1, reveals substantial overlap among readers’ bounding boxes but also visible variation in their exact placement and opacity confidence assignment. This figure is highly relevant to our study because it visually demonstrates that the RSNA annotations encode not only lesion presence but also an approximate spatial footprint of radiographic abnormality.

Another key aspect of the dataset paper is the adjudication procedure. According to Shih et al. [12], a subset of cases was triple-read, majority voting was used for final categorical labels, isolated boxes were reviewed by adjudicators, and medium- and high-probability opacity boxes were ultimately merged into a unified likely-pneumonia category for the adjudicated release. The paper also notes that low-confidence boxes were removed from the final adjudicated dataset. This is important for the present article because it means that the bounding boxes used in the RSNA challenge were not arbitrary rough marks, but the result of a structured expert-labeling and review process. Although the annotations remain an imperfect proxy for clinical severity, they are sufficiently systematic to support downstream analysis of opacity burden.

The paper is also directly relevant because it reports properties of the released dataset that align with the assumptions of our study. Shih et al. [12] state that the 30,000-image set contains both PA and AP frontal chest radiographs, includes metadata such as patient sex and age, and was released in both original and adjudicated forms. They further report that the final dataset included adults and minors, male and female patients, and both posteroanterior and anteroposterior views. These metadata fields provide the basis for later subgroup analyses in our study, while the bounding-box annotations provide the basis for our severity grading strategy.

Most importantly, Shih et al. [12] explicitly discuss the size distribution of the final bounding boxes and even present histograms of bounding-box area for the train/validation and test sets on page 5. This is particularly relevant to the present paper because it confirms that bounding-box extent is a meaningful measurable variable in the dataset and not merely an auxiliary localization label. Our work extends this idea by aggregating box areas at the study level and using the resulting total disease area as a radiographic burden signal for ordinal severity grouping. In this sense, the dataset paper provides the annotation foundation, while our study builds a severity-oriented reinterpretation on top of the same expert-labeled spatial metadata.

2.2. Pneumonia Detection and Localization on RSNA 2018 Chest X-Rays

Following the release of the RSNA 2018 pneumonia challenge dataset, most subsequent work used the collection primarily as a benchmark for pneumonia detection and lesion localization, rather than for severity grading. This emphasis is consistent with the official challenge design, which explicitly asked participants to identify and localize pneumonia in frontal chest radiographs using the released expert annotations [2,7]. As a result, the dominant research trajectory around the dataset has focused on predicting image-level pneumonia presence together with bounding-box coordinates for suspicious opacities.

Early work in this direction is exemplified by the DeepRadiology Team, who described a competition-level pipeline for pneumonia classification and localization in chest radiographs using an open-source object detector based on CoupleNet together with a custom box-ensembling strategy [8]. Their work is representative of the challenge-era interpretation of the RSNA dataset: the expert bounding boxes were treated as supervision for object detection, with the goal of maximizing localization performance rather than transforming annotation geometry into an ordinal burden representation.

This detector-oriented perspective was further summarized by Pan et al., who reviewed deep learning approaches used by top-placing teams in the RSNA challenge [9]. Their discussion reinforced the central framing of the dataset as a benchmark for classification plus localization, highlighting how challenge solutions relied on detector-style architectures and ensemble methods to identify pneumonia regions. In this literature, the bounding boxes primarily served as spatial targets for pneumonia detection rather than as metadata for secondary image-level interpretation.

Subsequent studies continued to develop this localization-oriented use of the RSNA dataset. For example, Mao et al. proposed a deep learning approach based on an ensemble of RetinaNet and Mask R-CNN models for automatic pneumonia detection in chest radiographs [10]. Here again, the RSNA 2018 labels were used as localization supervision to improve box prediction and lesion detection performance. Such work confirms that the most common downstream use of the dataset has been to train detector architectures that estimate where pneumonia is located and whether it is present.

Broader reviews of deep learning for pneumonia analysis also reflect this trend. Recent surveys describe a literature dominated by image classification, object detection, and localization frameworks, with the RSNA 2018 dataset playing a major role as a benchmark for annotated pneumonia region detection [13]. Across this body of work, the primary goals remain diagnosis, classification, and region localization, while comparatively little attention has been paid to reinterpreting bounding-box extent itself as an image-level burden signal.

Overall, the literature built around the RSNA 2018 dataset has treated the expert annotations primarily as supervision for pneumonia detection and lesion localization. Most published approaches have focused on identifying whether pneumonia is present and where suspicious opacities are located, typically through classification or detector-based frameworks. Consequently, while these studies establish an important methodological foundation for localized pneumonia analysis in chest radiographs, they leave open the question of whether the same bounding-box annotations can also support a grading-oriented interpretation of radiographic disease burden.

2.3. Chest X-Ray Severity Scoring and Grading

Compared with the large body of work on pneumonia detection and lesion localization, chest X-ray severity scoring and grading has received comparatively less attention, particularly outside the COVID-19 literature. Most severity-oriented studies aim to move beyond binary diagnosis by estimating the extent of radiographic involvement, often through ordinal labels, regional burden scores, or continuous severity targets derived from expert assessment. This line of work is directly relevant to the present study because it shares the broader objective of converting chest radiographs from simple disease-presence classification into a richer representation of radiographic burden [3,4,6].

One influential early study in this direction is the work of Cohen et al., who proposed deep learning-based prediction of COVID-19 pneumonia severity from frontal chest radiographs [3]. Rather than performing only disease detection, their framework estimated severity-related scores linked to lung opacity and geographic extent, thereby illustrating how chest X-rays can support a more graded interpretation of pulmonary involvement. Although their setting was focused on COVID-19 rather than RSNA 2018 pneumonia, the study is conceptually important because it demonstrated that radiographic burden can be treated as a structured prediction target rather than a purely qualitative observation [3].

A related direction was developed by Signoroni et al., who introduced BS-Net for automatic estimation of chest X-ray disease severity using deep learning [4]. Their work emphasized multilevel severity prediction and regional burden quantification, showing that severity assessment can be formulated as a dedicated task in its own right rather than as an accessory to disease classification. Similarly, Chandra et al. proposed a framework that explicitly linked disease localization and severity assessment in chest X-rays, thereby reinforcing the idea that the spatial extent of abnormality is strongly connected to clinically meaningful severity-related interpretation [5].

More recent work has continued to explore severity scoring in chest radiography using deep neural networks and transformer-based models. For example, Slika et al. investigated lung pneumonia severity scoring in chest X-ray images using transformer architectures and curated severity labels, highlighting the growing interest in chest X-ray grading beyond binary prediction [6]. Likewise, newer studies have proposed confidence-aware or multistage severity classification frameworks that categorize chest radiographs into normal, mild, moderate, and severe groups based on opacity burden and consolidation extent [14,15]. These works are important because they confirm that severity-aware chest X-ray analysis is becoming an active research direction.

However, most existing severity-scoring studies differ from the present work in two important respects. First, many of them rely on dedicated radiologist-defined severity scores, region-based scoring systems, or COVID-specific datasets, rather than on the RSNA 2018 pneumonia challenge data [3,4,6]. Second, their labels are usually introduced externally through manual scoring protocols or disease-specific grading schemes, rather than derived directly from the geometric extent of existing lesion annotations. In contrast, our study leverages the expert bounding boxes already available in the RSNA 2018 dataset and reinterprets their cumulative area as a proxy for radiographic burden.

From this perspective, the severity-scoring literature provides a strong conceptual foundation for the present work: it establishes that chest radiographs can support meaningful ordinal grading of pulmonary involvement and that radiographic burden is a relevant target for machine learning analysis. At the same time, the present study differs methodologically by deriving severity labels directly from expert bounding-box extent metadata rather than from an externally defined scoring system. This makes our work particularly relevant as a bridge between annotation-based localization research and severity-aware chest X-ray grading.

2.4. Localization-to-Severity Bridging Methods

A particularly relevant line of related work lies between pure lesion localization and explicit severity scoring, namely methods that use where disease is found in a chest radiograph to infer how severe the radiographic involvement may be. These localization-to-severity bridging approaches are closely aligned with the spirit of the present study because they treat spatial extent, infection maps, or abnormal-region burden as intermediate signals connecting detection with grading [3,4,5].

One of the most directly related examples is the framework of Chandra et al., who proposed a multistage chest X-ray pipeline that jointly addressed disease localization and severity assessment [5]. Their study generated compact disease boundaries and infection maps and then used these spatial outputs to support severity estimation. This is methodologically important because it explicitly recognizes that radiographic severity is not independent of lesion extent; rather, the size and spread of the abnormal region provide a natural bridge from localization to grading. Although their approach differs from ours in using a model-based localization framework rather than expert RSNA bounding boxes directly, it supports the core idea that spatial disease burden can be transformed into ordinal severity information [5].

A related conceptual direction appears in the work of Cohen et al., who predicted COVID-19 pneumonia severity on frontal chest radiographs using deep learning and radiologist-derived scores linked to lung opacity and geographic extent [3]. While their study did not use bounding-box annotations, it is highly relevant because the target variables themselves were severity measures grounded in the extent of radiographic abnormality. In this sense, the work showed that chest X-ray severity can be modeled through spatial burden signals rather than through binary disease labels alone [3].

Similarly, Signoroni et al. introduced BS-Net, an end-to-end framework for chest X-ray severity estimation that predicts a multiregional burden score conveying the degree of pulmonary involvement [4]. Although the study was developed in the context of COVID-19 pneumonia, it further reinforced the broader principle that structured spatial disease extent can serve as a meaningful severity target. Rather than classifying radiographs only as positive or negative, these approaches interpret the image through ordered levels of pathological spread [4].

Additional work has explored anatomy-aware and progression-oriented variants of this bridge between localization and severity. For example, Nizam et al. proposed an anatomy-aware model for COVID-19 severity prediction from chest X-rays, emphasizing the role of anatomically constrained image understanding in severity estimation [16]. Likewise, studies such as the longitudinal framework reported in COVID-19 in CXR: From Detection and Severity Scoring to Patient Disease Monitoring used localization maps to derive a pneumonia ratio as an image-level severity indicator [17]. These works are especially relevant because they show that localization outputs can be transformed into scalar or ordinal burden measures for downstream interpretation.

Despite these important advances, most localization-to-severity bridging studies differ from the present work in two major ways. First, they typically rely on learned localization maps, segmentation outputs, or radiologist-defined severity scores rather than on expert bounding-box metadata already embedded in a large benchmark dataset [3,5,16]. Second, many of them are developed in COVID-specific settings, where severity labels are often tied to disease-specific scoring systems or longitudinal clinical cohorts [3,4,17]. By contrast, our study uses the RSNA 2018 expert pneumonia annotations directly and derives severity from the cumulative geometric burden of those bounding boxes at study level.

For this reason, the present work can be viewed as a structurally simpler but highly transparent localization-to-severity bridge. Instead of learning a latent infection map and then inferring severity from it, we start from explicit expert-drawn lesion regions and convert their aggregated extent into ordered radiographic burden groups. In this way, our study extends the bridging logic established in prior work while grounding it in an annotation-driven reinterpretation of the RSNA 2018 pneumonia dataset.

3. Study Positioning

As discussed in Section 2.1, Section 2.2, and Section 2.3, the existing literature around the RSNA 2018 dataset and chest X-ray severity analysis has developed along two largely separate directions. On the one hand, RSNA-based studies have primarily treated the dataset as a benchmark for pneumonia detection and lesion localization, using expert bounding boxes as supervision for identifying whether pneumonia is present and where suspicious opacities are located [2,7,8,9,10]. On the other hand, severity-scoring studies have shown that chest radiographs can support meaningful ordinal grading of pulmonary burden, but these approaches typically rely on externally defined severity scores, radiologist scoring systems, or disease-specific datasets rather than on RSNA 2018 bounding-box metadata [3,4,5,6].

The present study is positioned at the intersection of these two lines of work. Rather than predicting bounding boxes or optimizing a detector for challenge-style scoring, and rather than importing an external severity protocol, we reuse the expert RSNA annotations themselves as a quantitative proxy for radiographic burden. By aggregating bounding-box area at the study level and converting it into ordered severity groups, we reinterpret the RSNA 2018 dataset not only as a detection/localization benchmark, but also as a grading-oriented resource.

In this sense, prior RSNA-based studies provide the methodological foundation for lesion localization, while the chest X-ray severity literature provides the conceptual foundation for moving beyond binary disease prediction. Our contribution connects these two traditions by showing how expert localization annotations can be repurposed into a transparent, annotation-driven severity framework for pneumonia burden analysis.

4. Methodology

4.1. RSNA 2018 Bounding Box Annotation Structure and Spatial Distribution

Pneumonia localization annotations in the RSNA dataset are provided separately from the DICOM files in a structured CSV file (stage_2_train_labels.csv). Each entry contains a patientId, bounding box coordinates (x, y, width, height), and a binary target label indicating the presence of pneumonia.

Table 1 summarizes the key properties of the annotation structure. The dataset contains 9,555 bounding box annotations associated exclusively with pneumonia-positive cases. A total of 6,012 patients are labeled as positive, while 20,672 are negative. Negative cases do not contain bounding box annotations, resulting in a strict separation between classification labels and localization data.

Positive cases may contain multiple bounding boxes corresponding to distinct regions of radiographic opacity. On average, each positive patient contains approximately 1.59 bounding boxes, with a maximum of four annotations observed in a single image. No inconsistencies were identified in the labeling: all positive cases include at least one bounding box, and no negative cases contain bounding box annotations.

Figure 2 provides representative examples of bounding box annotations overlaid on chest X-ray images. These examples illustrate both localized and multifocal disease presentations, with variability in bounding box size and position reflecting differences in disease extent and severity.

Together, these visualizations confirm that bounding box annotations are anatomically meaningful and consistently aligned with lung regions. This provides a strong reference for evaluating model explainability, enabling direct comparison between predicted saliency maps and ground-truth pathology locations.

The bounding box overlays shown in Figure 2 were generated using OpenCV version 4.11.0. Each DICOM image was first read with pydicom and converted to an 8-bit grayscale display representation in the range [0,255]. The grayscale image was then transformed to a three-channel BGR image using cv2.cvtColor(..., cv2.COLOR_GRAY2BGR) to enable colored annotation rendering. Bounding boxes were drawn with cv2.rectangle(...), and annotation labels were added with cv2.putText(...). The final visualizations were exported as PNG images using cv2.imwrite(...).

Figure 3 presents an aggregated heatmap of bounding box locations across all pneumonia-positive cases. The heatmap is constructed by accumulating bounding box regions over a normalized image grid, highlighting the spatial distribution of annotated pathology. The distribution reveals two dominant regions corresponding to the left and right lung fields, with higher concentration in the mid-to-lower lung zones. This pattern is consistent with the typical radiographic presentation of pneumonia and indicates that annotations are well localized to clinically relevant areas.

4.2. Bounding-Box Extent Analysis and Severity-Oriented Stratification

To derive a severity-aware labeling scheme from the RSNA 2018 pneumonia annotations, we quantified radiographic disease burden at the image level using the extent of the annotated bounding boxes. Only pneumonia-positive chest X-ray (CXR) studies with valid lesion annotations were included in this analysis. More specifically, from the RSNA training labels file, we retained cases with Target = 1 from stage_2_train_labels.csv metadata file and non-missing bounding-box coordinates, resulting in a cohort of 6,012 radiographs for severity grading. Normal studies (Target = 0) were not assigned a severity grade, since no lesion burden can be computed for these cases.

For each image i, every annotated pneumonia region r was represented by a rectangular bounding box with width

{width}_{i, r}

and height

{height}_{i, r}

. The area of each lesion box was computed as shown in Equation 1.

a_{i, r} = w_{i, r} \cdot h_{i, r}

(1)

and the total disease burden for the image was then defined as the sum of all annotated lesion areas as shown in Equation 2.

A_{i} = \sum_{r = 1}^{R_{i}} a_{i, r}

(2)

where

R_{i}

denotes the number of pneumonia boxes associated with image i. This produced a single image-level burden measure, expressed in pixel².

The present study adopts a simplified three-tier severity split intended to preserve ordinal disease progression while improving class balance for downstream experiments. The three severity categories are defined as follows: Severity 1 (low / mild burden), Severity 2 (moderate burden), and Severity 3 (severe burden), as can be seen in Figure 4. To obtain these thresholds, all positive cases were sorted in ascending order of total disease area, and a rank-based optimization procedure was used to identify two cut points near the one-third and two-third positions of the ordered list.

The optimization was designed to satisfy two objectives simultaneously: (1) produce approximately equal sample sizes across the three severity tiers, and (2) preserve a male–female composition within each tier that remains close to the overall sex distribution of the positive cohort (reserved for future usage). Let n denote the number of graded images, and let

k_{1}

and

k_{2}

denote candidate split positions near

round (n / 3)

and

round (2 n / 3)

, respectively. For each feasible pair

(k_{1}, k_{2})

, an objective function was minimized and is shown in Equation 3.

L = L_{cnt} + λ_{sex} L_{sex}

(3)

where

L_{cnt}

penalizes deviation from equal-sized thirds,

L_{sex}

penalizes deviation between tier-specific male prevalence and the global male prevalence in the cohort, and

λ_{sex} = 6.0

controls the relative influence of the sex-balance term. The search was performed in a neighborhood around the ideal tertile ranks, with a constrained minimum bin size to avoid degenerate splits.

Using this procedure, the final thresholds were obtained at:

c_{1} = 60, 455 {px}^{2}, c_{2} = 139, 133 {px}^{2} .

The final severity assignment rule was therefore:

Severity 1 : A_{i} < 60, 455,

Severity 2 : 60, 455 \leq A_{i} < 139, 133,

Severity 3 : A_{i} \geq 139, 133 .

This thresholding strategy yielded a nearly perfectly balanced severity distribution: 1,999 images in Severity 1, 2,006 images in Severity 2, and 2,007 images in Severity 3, corresponding to approximately 33.3%, 33.4%, and 33.4% of the graded pneumonia-positive cohort, respectively. As shown in Figure 5, the resulting class frequencies are effectively uniform, which is advantageous for subsequent classification experiments that compare model behavior across severity-defined subgroups.

Before introducing the final three-tier grouping, it is useful to examine the raw distribution of total disease burden in the positive cohort. Figure 6 shows the histogram of total disease area per pneumonia-positive chest X-ray, computed as the sum of all annotated bounding-box areas within each image. The distribution is strongly right-skewed, with a large concentration of cases at relatively low burden values and a progressively decreasing number of images extending toward higher disease areas. This confirms that radiographic opacity burden is not uniformly distributed across positive cases and provides the empirical motivation for deriving ordered severity strata from the bounding-box extent variable.

The relationship between disease area and the resulting severity labels is illustrated from multiple perspectives. Figure 7 shows the global distribution of total disease area, stacked by sex, together with the two decision thresholds and shaded severity bands. This visualization confirms the strongly right-skewed nature of the burden distribution and highlights how the three severity intervals partition the lesion-positive cohort.

Figure 8 further characterizes the spread of disease burden within each tier using boxplots and violin plots. As expected, the median and dispersion of total disease area increase monotonically from Severity 1 to Severity 3, indicating that the proposed grading preserves an interpretable ordering of radiographic burden.

Figure 9 provides the empirical cumulative distribution function (ECDF) of total disease area for each severity tier. This plot is useful because it shows how lesion burden accumulates within each category. In particular, the Severity 1 curve rises rapidly at lower area values, Severity 2 occupies an intermediate range, and Severity 3 spans a broader and higher-burden range with a longer upper tail. Thus, the ECDF confirms that the three groups are not only balanced in count, but also remain well ordered with respect to the underlying disease-area variable.

Finally, Figure 10 summarizes the sex distribution within each severity category. Although males remain somewhat more frequent than females across all three tiers, the sex proportions remain reasonably stable after threshold optimization. Specifically, Severity 1 contains 893 female and 1,106 male cases, Severity 2 contains 821 female and 1,185 male cases, and Severity 3 contains 788 female and 1,219 male cases. This indicates that the optimization successfully balanced the severity tiers by count while avoiding extreme sex imbalance in any individual group.

Overall, this bounding-box-based stratification provides a reproducible and interpretable way to transform RSNA lesion annotations into ordinal severity labels. Importantly, the resulting grades should be understood as a proxy for radiographic burden rather than a direct clinical severity score. Nevertheless, they provide a practical and structured basis for severity-aware downstream modeling and subgroup analysis.

4.3. Dataset Split Used in This Study

Although the original RSNA 2018 training folder contains 26,684 DICOM studies, the classification pipeline used in this work was built on a curated subset of this cohort. More specifically, the final dataset used in the study contains four diagnostic categories: Normal, Pneumonia Severity 1 (low/mild), Pneumonia Severity 2 (moderate), and Pneumonia Severity 3 (severe). From the normal category, 8,851 chest X-ray images were retained for the present pipeline, while the pneumonia-positive cases were divided according to the severity-grading procedure described earlier into 1,999 Severity 1 images, 2,006 Severity 2 images, and 2,007 Severity 3 images. In total, the working dataset used in this study therefore comprises 14,863 chest X-ray studies.

To construct the experimental partitions, a stratified class-wise split was applied independently within each of the four categories. For every class, 70% of the available images were assigned to the training subset, 15% to the validation subset, and the remaining 15% to the test subset. This design preserves the original class structure inside each category while ensuring that normal, low/mild, moderate, and severe pneumonia cases are all represented in the three experimental partitions.

The training subset was used for model optimization, whereas the validation subset was employed for monitoring convergence, tuning hyperparameters, and selecting the final checkpoint. The test subset was kept fully separated from the optimization process and was used only after training had been completed. In this sense, the test partition acts as a held-out inference cohort, allowing the final trained model to be evaluated on previously unseen studies in a manner that more closely resembles a practical deployment scenario.

This separation is particularly important in the context of severity-aware classification. Because the final goal is not only to learn the distinction between normal and pneumonia cases, but also to differentiate among multiple severity levels, the test subset must remain untouched during development in order to provide a more realistic estimate of how the model behaves when applied to new chest X-ray images. Thus, the present study adopts a three-way split strategy, in which model fitting is restricted to the training subset, model selection is guided by the validation subset, and the final performance assessment is performed later on the held-out test subset in inference mode.

Figure 11 summarizes the dataset organization used in this study. It illustrates both the class composition of the final working cohort and the 70/15/15 train–validation–test rule applied to each class. The resulting protocol provides a consistent basis for training and evaluating the proposed severity-aware pipeline, while also preserving a separate test stage that can be interpreted as a simulation of clinical inference on unseen data.

4.4. Budget-Aware ViT Configuration Exploration

Vision Transformer fine-tuning introduces a large hyperparameter search space. In principle, the exploration could include many combinations of backbone family, patch size, image resolution, batch size, optimizer, learning rate, weight decay, classification-head design, data augmentation, scheduler type, number of epochs, and fine-tuning strategy. Exhaustively evaluating all possible combinations would rapidly lead to hundreds or thousands of training runs. However, this was not practical in the present study because each ViT experiment required substantial GPU time, memory, storage, and wall-clock duration, especially when using high-resolution

512 \times 512

input images and large ViT backbones.

For this reason, we adopted a budget-aware experimental design. Rather than performing a fully exhaustive neural architecture search, we selected a compact but systematic set of configurations that would allow controlled comparison across the most relevant ViT design dimensions. The search was restricted to four widely used Vision Transformer backbones: ViT-B/16, ViT-L/16, ViT-B/32, and ViT-L/32. These models allow comparison between base and large model capacity, as well as between 16-pixel and 32-pixel patch sizes. Since patch size directly affects the number of image tokens and the amount of local radiographic detail preserved, this comparison was particularly relevant for chest X-ray pneumonia analysis.

The main tunable optimization parameters were the learning rate and weight decay. Three learning rates were evaluated,

3 \times 10^{- 4}

,

1 \times 10^{- 4}

, and

5 \times 10^{- 5}

, together with two weight-decay values, 0.05 and 0.01. This produced six optimization configurations per backbone. Combined with the four ViT backbones, the final exploration consisted of:

4 backbones \times 3 learning rates \times 2 weight decays = 24 runs .

(4)

All other training settings were fixed to ensure fair comparison. Each run used

512 \times 512

input images, batch size 8, ImageNet-pretrained initialization, AdamW optimization, automatic mixed precision, cosine annealing learning-rate scheduling, GELU activation, and a single-layer classification head without dropout. Each configuration was trained for 50 epochs during the exploration stage. By fixing these parameters, the experiments focused specifically on the effect of backbone choice, patch size, learning rate, and weight decay.

The same 24-run design was applied consistently to the classical binary model exploration and to each severity-specific binary branch. This enabled direct comparison of configuration behavior across the normal-versus-pneumonia baseline and the normal-versus-severity classification tasks. For the severity-specific experiments, each branch was trained as a binary classifier: normal versus Severity 1, normal versus Severity 2, or normal versus Severity 3. This design allowed the study to identify the strongest ViT configuration for each branch while keeping the search computationally feasible.

Figure 12 summarizes the budget-aware configuration strategy. The figure illustrates how a potentially large ViT search space was reduced to a focused 24-run design by selecting the most important architectural and optimization dimensions. This compact design does not represent a full exhaustive NAS study, but rather a controlled exploration intended to maximize experimental insight under realistic computational constraints.

This design should therefore be interpreted as a focused model exploration rather than an exhaustive hyperparameter optimization study. The goal was not to claim that the globally optimal ViT configuration was found, but to identify strong and computationally practical candidates for subsequent severity-aware analysis. The selected 24-run search space provided a balanced compromise between experimental coverage and feasibility, making it possible to compare multiple ViT backbones across the classical binary and severity-specific pneumonia grading tasks.

4.5. Vision Transformer Training Pipeline Used in This Study

The overall training pipeline used in this study is summarized in Figure 13. The workflow was designed to support both severity-aware four-class classification and binary severity-specific experiments derived from the RSNA 2018 chest X-ray dataset. At a high level, the pipeline consists of four main stages: dataset preparation, class-wise data splitting, automated Vision Transformer configuration search, and downstream model training and evaluation.

The starting point of the pipeline is the RSNA 2018 chest X-ray collection, which is reorganized into a JPEG-based folder structure compatible with the training scripts. In the four-class severity-aware setting, the dataset is arranged into one normal class and three pneumonia severity classes: severity_1_low_mild, severity_2_moderate, and severity_3_severe. The main training script supports this RSNA severity layout directly, while also retaining compatibility with simpler binary ImageFolder-style datasets when needed. During loading, the script detects the folder layout automatically and constructs the corresponding sample list and class-to-index mapping.

After dataset construction, a stratified split is applied independently within each class. In the current study, the pipeline uses a class-wise 70%/15%/15% partition into training, validation, and testing subsets, thereby preserving the class composition inside each severity category. The data-loading code also supports optional class-balanced sampling through a WeightedRandomSampler, which can be enabled to compensate for class imbalance during training. Training and evaluation transforms are defined separately: the training transform includes resizing, random resized cropping, horizontal flipping, and small rotations, whereas the validation and test transforms use deterministic resizing and center cropping. The scripts further support high-resolution inputs, such as 512×512, which are relevant in this work.

The model search and training stage is based on an automated NAS wrapper built around Vision Transformers. The NAS script explores combinations of ViT architectures and optimization hyperparameters through a configurable search space. In the present implementation, the search includes multiple ViT variants (such as vit_b_16, vit_l_16, vit_b_32, and vit_l_32), input image sizes, batch sizes, learning rates, weight decay values, optimizer choices, scheduler types, mixed-precision settings, and head configurations. A particularly important feature of the pipeline is that it adapts torchvision ViT models to custom image sizes by resizing positional embeddings, which allows pretrained transformers to be used at larger square resolutions than their original defaults.

For each NAS trial, the pipeline builds the dataloaders, initializes the corresponding ViT model, applies the selected optimizer and learning-rate scheduler, and trains the model while monitoring validation behavior. Standard classification metrics are recorded, including accuracy, precision, recall, F1-score, and confusion matrices. The scripts also save trial-specific artifacts such as configuration files, training histories, split summaries, best checkpoints, and evaluation outputs. This makes the workflow reproducible and suitable for systematic comparison across model variants and severity definitions.

In addition to the four-class framework, the study also uses three dedicated binary NAS wrappers, each targeting one severity subgroup against the normal class. More specifically, separate scripts are provided for normal versus severity_1_low_mild, normal versus severity_2_moderate, and normal versus severity_3_severe. These wrappers reuse the same end-to-end NAS and training logic, but filter the dataset to keep only the normal class and the selected severity class, excluding the remaining pneumonia categories. In this way, the pipeline supports both a global severity-aware experiment and specialized binary branches that allow a more targeted analysis of separability at each radiographic burden level.

Overall, the resulting pipeline provides a unified experimental framework for severity-aware chest X-ray analysis. It combines structured RSNA severity-labeled data preparation, class-wise stratified splitting, automated exploration of ViT architectures and hyperparameters, and both multiclass and binary severity-specific evaluation. This design enables a consistent comparison between baseline and severity-specialized models while preserving a transparent and reproducible training workflow.

4.6. Specialized Binary ViT Branch for Severity 1

Figure 14 presents the specialized binary Vision Transformer pipeline used for the Severity 1 branch. In this experiment, the RSNA severity-oriented dataset is filtered so that only two classes remain: Normal and Pneumonia Severity 1 (low/mild). The binary wrapper script enforces this mapping explicitly by keeping only normal and severity_1_low_mild samples and excluding the moderate and severe pneumonia groups. This design allows the model to focus on the most subtle pneumonia classification scenario considered in the study.

After dataset filtering, the selected chest X-ray images undergo the preprocessing pipeline implemented in the training framework, including resizing to the target resolution, normalization, and data augmentation for the training subset. The resulting inputs are then forwarded to a pretrained Vision Transformer model adapted to the selected image size. In the present study, the best-performing Severity 1 branch was obtained with a ViT-B/16 model, using

512 \times 512

input images and hyperparameters chosen through the automated NAS procedure. The model therefore learns a binary decision boundary between normal chest radiographs and low/mild pneumonia cases, rather than solving a multiclass severity problem.

The final output block consists of a binary classification head with softmax probabilities for the two target classes, producing one confidence score for Normal and one for Low/Mild Pneumonia. As highlighted by this figure, the Severity 1 branch represents the most challenging specialized experiment in the paper, since the radiographic difference between normal lungs and low-burden pneumonia is often small. For this reason, the branch provides an important test of whether a dedicated binary ViT can capture subtle opacity patterns that may be diluted in broader classification settings.

4.7. Performance Metrics

To evaluate the classification performance of the proposed Vision Transformer models, we used standard metrics derived from the confusion matrix, including accuracy, precision, recall, and F1-score. These metrics were computed for training, validation, and testing outputs, depending on the experiment. In addition, per-class accuracy was used to assess how well each individual class was recognized, which is particularly important in the present study because the difficulty of the classification task differs across severity levels.

Let

T P

,

T N

,

F P

, and

F N

denote the numbers of true positives, true negatives, false positives, and false negatives, respectively. The overall classification accuracy is defined as

Accuracy = \frac{T P + T N}{T P + T N + F P + F N} .

(5)

Accuracy measures the proportion of correctly classified samples among all evaluated samples. In the multiclass setting, this corresponds to the fraction of total predictions whose predicted class matches the ground-truth class.

Precision measures how many of the samples predicted as positive are actually positive, and is defined as

Precision = \frac{T P}{T P + F P} .

(6)

Recall, also referred to as sensitivity, measures how many of the truly positive samples are correctly identified by the model:

Recall = \frac{T P}{T P + F N} .

(7)

The F1-score combines precision and recall into a single harmonic-mean metric and is given by

F 1 - score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} .

(8)

In the multiclass experiments, macro-averaged precision, recall, and F1-score were used in order to give equal importance to each class regardless of class frequency. If C denotes the number of classes and

M_{c}

denotes the metric value computed for class c, then the macro-average is defined as

MacroAvg (M) = \frac{1}{C} \sum_{c = 1}^{C} M_{c} .

(9)

To better understand model behavior across the severity-aware labels, we also report per-class accuracy. For a given class c, per-class accuracy is defined as

{Accuracy}_{c} = \frac{N_{c}^{(correct)}}{N_{c}^{(total)}},

(10)

where

N_{c}^{(correct)}

is the number of correctly classified samples belonging to class c, and

N_{c}^{(total)}

is the total number of samples in that class. This metric is especially useful in the present work because it highlights the relative difficulty of distinguishing subtle classes such as normal versus low/mild pneumonia.

For model selection during training, validation accuracy was monitored at each epoch. Two summary indicators were retained: the final validation accuracy, corresponding to the accuracy obtained at the last training epoch, and the maximum validation accuracy, corresponding to the best validation accuracy achieved during the full training run. These measures were used together with training duration to compare candidate configurations during the hyperparameter search process.

Finally, confusion matrices were generated to provide a class-wise view of prediction errors. A confusion matrix summarizes how many samples from each ground-truth class were assigned to each predicted class, thereby offering additional insight into which categories are most frequently confused.

The confusion matrix provides a class-wise summary of prediction outcomes. For a classification problem with C classes, the confusion matrix is defined as

C = [C_{i j}] \in N^{C \times C},

(11)

where each element

C_{i j}

denotes the number of samples whose true class is i and whose predicted class is j. Formally,

C_{i j} = \sum_{n = 1}^{N} 1 (y_{n} = i \land {\hat{y}}_{n} = j),

(12)

where N is the total number of evaluated samples,

y_{n}

is the ground-truth label of sample n,

{\hat{y}}_{n}

is the predicted label, and

1 (\cdot)

is the indicator function, which equals 1 when the condition is true and 0 otherwise.

In this representation, the diagonal entries

C_{i i}

correspond to correctly classified samples, while the off-diagonal entries represent misclassifications between classes. Therefore, the confusion matrix provides a more detailed view of model behavior than scalar metrics alone, especially in severity-aware settings where some classes may be more easily confused than others.

Using the confusion matrix notation, overall multiclass accuracy can be written as

Accuracy = \frac{\sum_{i = 1}^{C} C_{i i}}{\sum_{i = 1}^{C} \sum_{j = 1}^{C} C_{i j}} .

(13)

4.8. Training Hardware and Software Environment

All model training experiments in this study were executed on a cloud-based environment provided by Runpod.io. The training setup used a single NVIDIA RTX 5090 GPU with 32 GB of VRAM, accompanied by 60 GB of system RAM, 12 virtual CPUs, and 80 GB of available disk space. The software environment was based on a Runpod PyTorch 2.8.0 container, specifically configured with a CUDA-enabled runtime suitable for GPU-accelerated deep learning workloads.

Figure 15 summarizes the hardware and software environment used for training. This setup provided sufficient computational resources for Vision Transformer training and experimentation, while also offering a reproducible cloud-based configuration for future extensions of the study. The use of a standardized PyTorch environment helped ensure compatibility with the implemented training pipeline, data loading procedures, and model fine-tuning workflow.

5. Results

5.1. Qualitative Inspection of the Generated Severity Split

To qualitatively verify the consistency of the proposed three-tier severity grading scheme, we manually inspected a second dataset generated for debugging and visualization purposes, distinct from the dataset used for model training. In this auxiliary dataset, the original RSNA 2018 chest X-ray images were rendered together with their annotated pneumonia bounding boxes, which were overlaid using OpenCV. The resulting visualization is shown in Figure 16.

The figure provides a clear visual progression across the three severity levels. In the Severity 1 row (low/mild burden), the annotated regions are generally small and localized, typically involving one limited opacity region or a small number of compact boxes. In the Severity 2 row (moderate burden), the lesion boxes become larger and, in several cases, more numerous, indicating a broader radiographic extent of pneumonia-related involvement. Finally, in the Severity 3 row (severe burden), the annotated regions are visibly larger, often spanning substantial portions of one or both lungs, and multiple boxes are more frequently present. This confirms that the severity split derived from total bounding-box area is visually coherent and consistent with the intended ordering of disease burden.

Beyond the annotation geometry itself, the radiographic appearance of the pneumonia also changes progressively across the three groups. As severity increases from Severity 1 to Severity 3, the white opacity patterns become more extensive and more visually prominent, indicating a larger burden of abnormal lung involvement. In the low/mild group, opacities tend to be focal and relatively limited. In the moderate group, they are more apparent and occupy a larger fraction of the lung field. In the severe group, the opacities are often widespread, denser, and distributed over large lung regions, consistent with more substantial radiographic disease burden. Thus, the visual patterns in the underlying chest radiographs align well with the annotation-based stratification.

It is important to note that this figure is intended as a qualitative validation of the generated severity split rather than as a formal statistical result. The images shown were selected from a debugging-oriented derived dataset and were not part of the actual training subset. Nevertheless, the visualization is useful because it confirms that the automated severity partitioning based on summed bounding-box area corresponds to an interpretable increase in both annotation extent and visible pulmonary opacity. This supports the validity of the proposed grading approach as a radiographic burden proxy for downstream experiments.

5.2. Severity 1 ViT Model Exploration

To evaluate the suitability of Vision Transformer (ViT) architectures for the low/mild pneumonia severity category, we performed a focused exploration of the Severity 1 binary classification branch. Four pretrained ViT backbones were evaluated: ViT-B/16, ViT-L/16, ViT-B/32, and ViT-L/32. For each backbone, six fine-tuning configurations were tested, resulting in a total of 24 runs. The explored configurations varied the learning rate and weight decay, while the remaining training setup was kept fixed across all experiments.

All runs used an input resolution of

512 \times 512

, batch size of 8, ImageNet-pretrained weights, automatic mixed precision (AMP), AdamW optimization, a cosine annealing learning-rate scheduler, and a single-layer classification head with no dropout. The purpose of this exploration was not only to identify the highest-performing Severity 1 classifier, but also to compare the trade-off between validation accuracy and training duration across different ViT model sizes and patch resolutions.

Table 2 summarizes the 24 training runs. For each configuration, the table reports the final training accuracy, final validation accuracy, maximum training accuracy, maximum validation accuracy, best epoch according to validation performance, and total training duration in seconds.

The best overall validation result was obtained by the ViT-B/16 backbone using a learning rate of

5 \times 10^{- 5}

and weight decay of 0.01. This configuration reached a maximum validation accuracy of 94.98% and a final validation accuracy of 94.42%, while requiring 7,483.2 s of training. The second strongest maximum validation performance was achieved by ViT-L/16 with a learning rate of

5 \times 10^{- 5}

and weight decay of 0.05, reaching 94.70%. However, the ViT-L/16 runs required substantially longer training time, with durations close to 19,500 s, making them considerably more computationally expensive than ViT-B/16.

The comparison between patch resolutions shows that 16-pixel patch models achieved better validation performance than their 32-pixel counterparts. ViT-B/16 and ViT-L/16 both exceeded 94.5% maximum validation accuracy, whereas ViT-B/32 and ViT-L/32 remained below 94.0%. This suggests that the smaller patch size preserved more spatial detail relevant to the detection of low/mild pneumonia patterns, which are expected to have smaller and more localized radiographic extent.

A consistent trend was also observed with respect to the learning rate. For all four backbones, the best validation performance was obtained using the lowest tested learning rate,

5 \times 10^{- 5}

. This indicates that Severity 1 fine-tuning benefits from a more conservative optimization regime. At the same time, several configurations reached nearly perfect training accuracy while validation accuracy saturated around 94–95%, suggesting that additional generalization improvements may require stronger data augmentation, improved regularization, or severity-aware sampling strategies.

To further summarize the comparison across the four backbones, Figure 17 presents the final and maximum validation accuracy for all 24 runs, grouped by model family and configuration. Figure 18 shows the corresponding training duration comparison. These plots highlight the main trade-off observed in the Severity 1 exploration: ViT-L/16 provides competitive accuracy but at a substantially higher computational cost, while ViT-B/16 achieves the best overall balance between validation performance and training efficiency.

Based on these results, ViT-B/16 with a learning rate of

5 \times 10^{- 5}

and weight decay of 0.01 (run #6) was selected as the preferred Severity 1 classifier for subsequent severity-aware inference experiments. This configuration achieved the highest validation accuracy while remaining substantially more efficient than the larger ViT-L/16 alternative.

To further inspect the behavior of the selected Severity 1 model, we analyzed the complete training history of the best-performing ViT-B/16 configuration. This additional analysis is important because the 24-run exploration identifies the best configuration at the summary level, while the epoch-wise curves reveal how the model converged, whether the final checkpoint was also the best checkpoint, and whether signs of overfitting appeared during training. Since Severity 1 corresponds to low/mild pneumonia burden, this branch is expected to be the most difficult among the three severity-specific classifiers, as the radiographic opacity patterns are smaller and more localized.

Figure 19 shows the training and validation accuracy curves for the selected Severity 1 ViT-B/16 model (run #6). The training accuracy increases steadily and approaches near-complete convergence by the final epochs. The validation accuracy also improves substantially during training, but it reaches its best value before the final epoch (epoch #39). This confirms that validation-based checkpoint selection is necessary, since the final epoch does not necessarily correspond to the best generalization point. In this experiment, the best validation checkpoint reached a maximum validation accuracy of 94.98%, while the final validation accuracy was slightly lower (94.42%).

Figure 20 presents the corresponding training and validation loss curves. The training loss decreases consistently throughout the 50-epoch schedule, indicating that the model continues to fit the training data. However, the validation loss does not decrease in the same monotonic manner and shows signs of saturation and later instability. This behavior suggests mild overfitting after the best validation checkpoint, which is consistent with the observed gap between the final and maximum validation accuracy. Therefore, saving the best validation checkpoint is more appropriate than relying only on the final training epoch.

In addition to accuracy and loss, we also examined validation macro-precision, macro-recall, and macro-F1, as shown in Figure 21. These metrics provide a more balanced view of performance than accuracy alone, because they account for both classes equally. This is particularly relevant for the Severity 1 branch, where low/mild pneumonia cases may be visually closer to normal radiographs. The macro-metric curves show that the model reaches strong balanced performance, but also that the validation behavior fluctuates across epochs, again supporting the need for checkpoint selection based on validation performance rather than final-epoch performance alone.

Finally, Figure 22 reports the class-specific validation accuracy for the normal class and the Severity 1 pneumonia class. This analysis is useful because it shows whether the model performs uniformly across both classes or whether one class remains more difficult. The results indicate that low/mild pneumonia classification is more challenging than normal-image recognition, which is consistent with the subtle radiographic presentation of Severity 1 cases. This supports the interpretation that the Severity 1 branch is the hardest severity-specific classifier and may benefit from additional data augmentation, longer training with early stopping, or calibration strategies in future experiments.

Overall, the training-history analysis confirms that the selected Severity 1 ViT-B/16 model learns a strong representation for low/mild pneumonia detection, but also highlights the difficulty of this branch compared with the higher-severity models, as we will see in the following subsections. The best validation performance occurs before the final epoch, and the validation loss behavior suggests that later epochs may introduce mild overfitting. These observations motivate the use of validation-based checkpoint selection and support future experiments with longer schedules, early stopping, stronger augmentation, and calibrated probability outputs.

To further examine the class-level behavior of the selected Severity 1 model, Figure 23 presents the training and validation confusion matrices at the final epoch. This analysis complements the class-specific validation accuracy curves by showing the absolute number of correct and incorrect predictions for both classes: normal and Severity 1 low/mild pneumonia.

The training confusion matrix shows near-complete separation between the two classes, with 7074 normal images correctly classified as normal and 1594 Severity 1 images correctly classified as low/mild pneumonia. Only 7 normal images were incorrectly predicted as Severity 1, and only 5 Severity 1 images were incorrectly predicted as normal. This confirms that the selected ViT-B/16 model fits the training distribution very strongly.

The validation confusion matrix reveals the main generalization challenge of the Severity 1 branch. The model correctly classifies 1721 normal validation images and 328 Severity 1 validation images. However, 72 Severity 1 images are misclassified as normal, while 49 normal images are misclassified as Severity 1. This indicates that the most important error mode is false-negative behavior for low/mild pneumonia, where subtle radiographic abnormalities may be insufficiently visible or overlap with normal anatomical variation. This supports the interpretation that Severity 1 is the most difficult severity-specific branch and may require additional augmentation, calibration, or explainability-based refinement.

Together with the class-specific validation accuracy curves, the confusion matrix confirms that the normal class is easier to recognize than the low/mild pneumonia class, while most clinically relevant errors occur in the Severity 1 false-negative direction.

5.3. Severity 2 ViT Model Exploration

To evaluate Vision Transformer performance for the moderate pneumonia burden category, we performed a dedicated model exploration for the Severity 2 binary classification branch. As in the Severity 1 experiment, four pretrained ViT backbones were evaluated: ViT-B/16, ViT-L/16, ViT-B/32, and ViT-L/32. For each backbone, six fine-tuning configurations were tested, resulting in a total of 24 training runs.

All runs used an input resolution of

512 \times 512

, batch size 8, ImageNet-pretrained initialization, automatic mixed precision (AMP), AdamW optimization, a cosine annealing learning-rate scheduler, GELU activation, and a single-layer classification head without dropout. The explored hyperparameters were the learning rate and weight decay. The objective of this experiment was to identify the most suitable backbone and optimization configuration for the Severity 2 classifier, while also considering the computational cost of each model.

Table 3 summarizes the complete Severity 2 exploration. For each run, the table reports the final training accuracy, final validation accuracy, maximum training accuracy, maximum validation accuracy, the epoch at which the maximum validation accuracy was achieved, and the total training duration in seconds.

To further summarize the Severity 2 exploration, Figure 24 compares the final and maximum validation accuracy across all 24 runs, grouped by ViT backbone and training configuration. The figure shows that the best validation performance was obtained with the lowest tested learning rate,

5 \times 10^{- 5}

, across multiple backbone families. The highest maximum validation accuracy was 97.88%, reached by ViT-B/16, ViT-L/16, and ViT-L/32 configurations. However, among these top-performing runs, ViT-B/16 achieved the highest final validation accuracy, reaching 97.84%, which indicates better stability at the end of training.

Figure 25 presents the corresponding training duration for each run. The results show a clear computational difference between the evaluated backbones. ViT-L/16 required the longest training time, with an average duration of approximately 19,587 s per run, while ViT-B/16 required approximately 7,542 s per run. ViT-B/32 was slightly faster on average, requiring approximately 6,862 s, but it did not reach the same validation accuracy as ViT-B/16. Therefore, ViT-B/16 provided the best trade-off between validation performance and computational efficiency for the Severity 2 classifier.

Based on the combined accuracy and duration analysis, ViT-B/16 with a learning rate of

5 \times 10^{- 5}

and weight decay of 0.05 was selected as the preferred Severity 2 model configuration. This configuration reached a maximum validation accuracy of 97.88%, achieved the best final validation accuracy of 97.84%, and required substantially less training time than the larger ViT-L/16 alternatives.

The best maximum validation accuracy obtained in the Severity 2 exploration was 97.88%. This value was reached by multiple configurations: ViT-B/16 with a learning rate of

5 \times 10^{- 5}

and weight decay of 0.05, ViT-L/16 with a learning rate of

5 \times 10^{- 5}

and both tested weight-decay values, and ViT-L/32 with a learning rate of

5 \times 10^{- 5}

and weight decay of 0.05. Among these tied configurations, ViT-B/16 achieved the best final validation accuracy, reaching 97.84%, while also requiring substantially less training time than ViT-L/16.

The results show that Severity 2 classification achieved higher validation accuracy than the Severity 1 branch, with several models approaching or reaching 98% maximum validation accuracy. This suggests that moderate pneumonia cases may provide more visually distinguishable radiographic patterns than low/mild cases, likely because the annotated opacity burden is larger and more clearly expressed in the image. As a result, the distinction between normal and Severity 2 pneumonia appears to be more separable for ViT-based classifiers.

A consistent trend was observed across all four backbone families: the best-performing configurations used the lowest tested learning rate,

5 \times 10^{- 5}

. This confirms the pattern already observed in the Severity 1 exploration and indicates that conservative fine-tuning is beneficial for pneumonia severity classification using pretrained ViTs. Higher learning rates, particularly

3 \times 10^{- 4}

, produced noticeably lower validation accuracies across all backbones.

From the perspective of backbone comparison, ViT-B/16 provided the most favorable balance between validation performance and computational efficiency. Although ViT-L/16 matched the best maximum validation accuracy, its training duration was approximately 19,500–19,700 s per run, compared with approximately 7,500 s for ViT-B/16. Therefore, the larger ViT-L/16 backbone did not provide a sufficient accuracy advantage to justify its substantially higher computational cost in this experiment.

The patch-size comparison also indicates that the 16-pixel patch configuration remained competitive for Severity 2 classification. ViT-B/16 achieved the best final validation accuracy and matched the best maximum validation accuracy, while ViT-B/32 remained slightly lower, with a maximum validation accuracy of 97.60%. ViT-L/32 also reached 97.88%, but its best final validation accuracy remained below the best ViT-B/16 configuration. These results suggest that smaller patches may preserve useful spatial detail for identifying pneumonia-related opacities, while the base-size architecture remains computationally more efficient than the large variants.

Based on these results, ViT-B/16 with a learning rate of

5 \times 10^{- 5}

and weight decay of 0.05 (run #5) was selected as the preferred Severity 2 classifier. This configuration achieved the best maximum validation accuracy, the highest final validation accuracy among the top-performing runs, and a substantially lower training duration than the larger ViT-L/16 alternatives.

To further examine the behavior of the selected Severity 2 model, we analyzed the epoch-wise training history of the best-performing ViT-B/16 configuration. This configuration corresponds to run 5, using a learning rate of

5 \times 10^{- 5}

and weight decay of 0.05. While the 24-run exploration table identifies this configuration as the strongest Severity 2 candidate, the training-history curves provide additional insight into the convergence behavior, validation stability, and class-level performance of the selected model.

Figure 26 shows the training and validation accuracy curves for the selected Severity 2 ViT-B/16 model. The model reaches its best validation checkpoint early in the training process, at epoch 13, with a maximum validation accuracy of 97.88%. The final validation accuracy remains very close, reaching 97.84%, which indicates that the model maintains stable validation performance after the best checkpoint. Compared with the Severity 1 branch, the Severity 2 model shows stronger validation performance and a smaller gap between the best and final validation accuracy, suggesting that moderate pneumonia cases are more visually separable from normal radiographs.

Figure 27 presents the corresponding training and validation loss curves. The training loss decreases rapidly during the early epochs and continues to remain low throughout training, indicating strong fitting of the training distribution. The validation loss also drops sharply at the beginning and remains relatively stable afterward, although some fluctuations are visible across later epochs. This pattern suggests that the model converges quickly and that the selected learning rate is appropriate for Severity 2 fine-tuning. The absence of a strong divergence between training and validation behavior also indicates better generalization stability than observed for the lower-severity branch.

To complement the accuracy and loss analysis, Figure 28 shows the validation macro-precision, macro-recall, and macro-F1 curves. These metrics are important because they provide a class-balanced view of model performance, rather than being dominated by the majority class. The curves remain high across the training schedule, confirming that the selected Severity 2 model performs well on both normal and moderate-pneumonia validation images. The macro-F1 trajectory also supports the conclusion that the Severity 2 branch is more stable and easier to optimize than the Severity 1 branch.

Finally, Figure 29 reports the class-specific validation accuracy for the normal class and the Severity 2 pneumonia class. This analysis provides a direct view of whether the model favors one class over the other. The results show that both class-specific curves remain high, indicating that the selected Severity 2 model is able to classify moderate pneumonia images more reliably than the Severity 1 model classified low/mild pneumonia images. This supports the interpretation that increasing radiographic burden improves class separability, since moderate opacity patterns are more visible and easier for the ViT backbone to distinguish from normal chest radiographs.

Overall, the training-history analysis confirms that the selected Severity 2 ViT-B/16 model (run configuration #5) provides a strong and stable classifier for moderate pneumonia detection. The model reaches its best validation performance early, maintains a final validation accuracy close to the maximum value, and shows balanced macro-metric and class-specific behavior. These results support the use of this configuration as the preferred Severity 2 branch in the proposed severity-aware inference framework.

To further examine the class-level behavior of the selected Severity 2 model, Figure 30 presents the training and validation confusion matrices at the final epoch. This analysis complements the class-specific validation accuracy curves by showing the absolute number of correct and incorrect predictions for the normal and Severity 2 moderate-pneumonia classes.

The training confusion matrix indicates almost complete separation between the two classes, with 7080 normal images correctly classified as normal and 1603 Severity 2 images correctly classified as moderate pneumonia. Only one normal image is incorrectly predicted as Severity 2, and only two Severity 2 images are incorrectly predicted as normal. This confirms that the selected ViT-B/16 model fits the training distribution extremely well.

The validation confusion matrix shows that the model also generalizes strongly to unseen data. A total of 1752 normal validation images are correctly classified as normal, while 372 Severity 2 validation images are correctly classified as moderate pneumonia. The number of false positives is limited to 18 normal images misclassified as Severity 2, and the number of false negatives is 29 moderate-pneumonia images misclassified as normal. Compared with the Severity 1 branch, these results indicate a much more balanced and reliable classifier, which is consistent with the stronger visual separability of moderate pneumonia patterns.

Together with the class-specific validation accuracy curves, the confusion matrix confirms that the Severity 2 branch is both accurate and stable. Most clinically relevant errors remain limited in number, and the model shows substantially fewer ambiguous cases than the Severity 1 classifier. This further supports the selection of the ViT-B/16 configuration (run #5) as the preferred Severity 2 model in the proposed severity-aware inference framework.

5.4. Severity 3 ViT Model Exploration

To complete the severity-specific ViT exploration, we performed a dedicated model search for the Severity 3 binary classification branch, corresponding to the severe pneumonia burden category. As in the Severity 1 and Severity 2 experiments, four pretrained Vision Transformer backbones were evaluated: ViT-B/16, ViT-L/16, ViT-B/32, and ViT-L/32. Each backbone was trained using six learning-rate and weight-decay configurations, resulting in a total of 24 experimental runs.

All experiments used an input resolution of

512 \times 512

, batch size 8, ImageNet-pretrained initialization, automatic mixed precision (AMP), AdamW optimization, GELU activation, a single-layer classification head without dropout, and a cosine annealing learning-rate scheduler. The main objective of this exploration was to identify the most suitable ViT configuration for distinguishing normal cases from severe pneumonia cases, while also comparing the computational cost of each backbone.

Table 4 summarizes the complete Severity 3 exploration. For each run, the table reports the final training accuracy, final validation accuracy, maximum training accuracy, maximum validation accuracy, the epoch at which the best validation performance was reached, and the total training duration in seconds.

To further analyze the Severity 3 exploration, Figure 31 compares the final and maximum validation accuracy across all 24 runs. The results show that the severe pneumonia classification task achieved the strongest validation performance among the severity-specific branches. This is expected, since Severity 3 cases correspond to the largest annotated disease burden and therefore contain more extensive radiographic opacity patterns. As a result, the distinction between normal cases and severe pneumonia cases appears to be more clearly separable for the evaluated ViT classifiers.

The validation accuracy comparison also confirms the learning-rate trend observed in the Severity 1 and Severity 2 experiments. For all four ViT backbone families, the best or near-best results were obtained with the lowest tested learning rate,

5 \times 10^{- 5}

. The highest maximum validation accuracy was achieved by ViT-L/16, reaching 99.08% (run #11), with a final validation accuracy of 98.99%. However, ViT-B/16 followed closely, reaching 98.94% maximum validation accuracy, while ViT-B/32 achieved 98.89%. This indicates that, although the larger ViT-L/16 backbone produced the best absolute result, the smaller ViT-B/16 and ViT-B/32 variants remained highly competitive for severe pneumonia classification.

Figure 32 presents the corresponding training duration comparison. A clear computational gap can be observed between the evaluated backbones. ViT-L/16 required the longest training time, with an average duration of approximately 19,526 s per run. In contrast, ViT-B/16 required approximately 7,895 s on average, while ViT-B/32 was the fastest competitive backbone, requiring approximately 6,027 s on average. Therefore, the modest accuracy gain obtained by ViT-L/16 came at a substantially higher computational cost.

Taken together, the validation accuracy and duration comparisons highlight two possible model-selection strategies. If the objective is to maximize validation accuracy, ViT-L/16 with a learning rate of

5 \times 10^{- 5}

and weight decay of 0.05 is the strongest Severity 3 configuration. If the objective is to balance accuracy and computational efficiency, ViT-B/16 becomes a strong alternative, since it achieved a very similar maximum validation accuracy while requiring less than half the training duration of ViT-L/16. ViT-B/32 also represents an efficient option, especially when training time is a major constraint.

The Severity 3 exploration produced the highest validation accuracies among the three severity-specific branches. The best overall configuration was obtained with ViT-L/16 using a learning rate of

5 \times 10^{- 5}

and weight decay of 0.05. This configuration reached a maximum validation accuracy of 99.08% and a final validation accuracy of 98.99%. These results indicate that severe pneumonia cases are highly distinguishable from normal cases, likely because this group contains the largest annotated radiographic burden and therefore more extensive visible opacity patterns.

Although ViT-L/16 achieved the highest validation accuracy, its computational cost was substantially higher than the other backbones, requiring approximately 19,641.5 s for the best-performing run. In comparison, ViT-B/16 reached a maximum validation accuracy of 98.94% with a training duration of approximately 7,912.2 s, while ViT-B/32 reached 98.89% with a duration below 6,000 s. Therefore, the performance advantage of ViT-L/16 over ViT-B/16 was relatively small in absolute terms, but it came at a considerably higher computational cost.

The learning-rate trend was consistent with the previous severity branches. For all four ViT backbone families, the strongest results were obtained using the lowest tested learning rate,

5 \times 10^{- 5}

. This confirms that conservative fine-tuning is beneficial across all severity-specific binary classifiers. Higher learning rates, especially

3 \times 10^{- 4}

, generally produced lower validation accuracies and less stable final performance.

The patch-size comparison again shows that 16-pixel patches remained highly competitive. ViT-L/16 achieved the best overall result, while ViT-B/16 achieved the strongest accuracy–efficiency trade-off among the 16-pixel and 32-pixel variants. The 32-pixel models also performed well for Severity 3, particularly ViT-B/32, which achieved 98.89% maximum validation accuracy with the lowest training duration among the competitive configurations. This suggests that severe pneumonia patterns are easier to detect even with larger patch sizes, because the radiographic abnormalities are more extensive than in lower-severity cases.

Based strictly on maximum validation accuracy, ViT-L/16 with a learning rate of

5 \times 10^{- 5}

and weight decay of 0.05 (run configuration #11) was selected as the best Severity 3 model. However, from a deployment-oriented perspective, ViT-B/16 remains an attractive alternative because it achieved a very similar maximum validation accuracy of 98.94% while requiring less than half the training time of ViT-L/16.

To further analyze the selected Severity 3 model, we inspected the epoch-wise training history of the best-performing ViT-L/16 configuration. This configuration corresponds to run #11, using a learning rate of

5 \times 10^{- 5}

and weight decay of 0.05. Since Severity 3 represents the highest radiographic burden group, the training dynamics provide insight into whether severe pneumonia cases are more easily separable from normal radiographs and whether the larger ViT-L/16 backbone contributes to stronger validation performance.

Figure 33 shows the training and validation accuracy curves for the selected Severity 3 ViT-L/16 model. The model reaches its best validation checkpoint at epoch 39, with a maximum validation accuracy of 99.08%, while the final validation accuracy remains very close at 98.99%. This indicates that the model maintains excellent generalization performance throughout the later training epochs. Compared with the Severity 1 and Severity 2 branches, the Severity 3 classifier achieves the highest validation accuracy, supporting the interpretation that severe pneumonia cases are the most visually distinguishable from normal radiographs.

Figure 34 presents the corresponding training and validation loss curves. The training loss decreases rapidly during the early epochs and remains very low throughout the rest of training, indicating strong fitting of the training distribution. The validation loss also remains low, although some fluctuations are visible across the training schedule. These fluctuations are expected in high-performing models trained on limited validation subsets and do not prevent the model from maintaining very high validation accuracy. Overall, the loss curves confirm that the selected ViT-L/16 configuration converges effectively for the severe-pneumonia classification task.

To evaluate the selected model beyond accuracy alone, Figure 35 shows the validation macro-precision, macro-recall, and macro-F1 curves. These class-balanced metrics remain consistently high, confirming that the model performs well across both the normal and Severity 3 classes. This is especially important because high accuracy alone could be misleading in binary classification if one class dominates the prediction behavior. The macro-metric curves indicate that the selected ViT-L/16 model does not simply rely on the normal class, but instead provides strong balanced discrimination between normal radiographs and severe pneumonia cases.

Finally, Figure 36 presents the class-specific validation accuracy for the normal class and the Severity 3 pneumonia class. The curves show high performance for both categories, indicating that the selected model is able to identify severe pneumonia cases reliably while preserving strong normal-case recognition. This class-level behavior further supports the conclusion that Severity 3 is the most separable severity branch, likely because severe cases contain larger and more visually prominent opacity patterns than low/mild or moderate cases.

Overall, the training-history analysis confirms that the selected Severity 3 ViT-L/16 model provides the strongest performance among the three severity-specific branches. The model achieves the highest maximum validation accuracy, maintains a final validation accuracy close to the best checkpoint, and shows strong macro-metric and class-specific behavior. These findings support the use of the ViT-L/16 configuration as the preferred Severity 3 branch when maximum validation performance is prioritized, while also motivating comparison with smaller alternatives when computational efficiency is more important.

To further examine the class-level behavior of the selected Severity 3 model, Figure 37 presents the training and validation confusion matrices at the final epoch. This analysis complements the class-specific validation accuracy curves by showing the absolute number of correct and incorrect predictions for the normal and Severity 3 severe-pneumonia classes.

The training confusion matrix indicates almost perfect separation between the two classes. A total of 7080 normal images are correctly classified as normal and all 1606 Severity 3 images are correctly classified as severe pneumonia, while only one normal image is incorrectly predicted as Severity 3. No Severity 3 training image is misclassified as normal. This confirms that the selected ViT-L/16 model fits the training distribution extremely strongly.

The validation confusion matrix shows that the model also generalizes very well to unseen data. A total of 1764 normal validation images are correctly classified as normal, while 385 Severity 3 validation images are correctly classified as severe pneumonia. Only 6 normal images are misclassified as Severity 3, and only 16 severe-pneumonia images are misclassified as normal. These results indicate that both false-positive and false-negative errors remain very limited, which is consistent with the strong class separability of severe pneumonia patterns.

Compared with the Severity 1 and Severity 2 branches, the Severity 3 confusion matrix should not be interpreted as evidence that the Severity 3 model is generally more reliable than the other two models, since each branch addresses a different binary classification task. Rather, the results indicate that severe pneumonia cases are more easily separable from normal radiographs in this dataset. This is clinically meaningful, because the cases with the highest radiographic burden are also the cases for which early recognition and prioritization may be most important in a hospital triage workflow.

Together with the class-specific validation accuracy curves, the confusion matrix confirms that the selected Severity 3 configuration performs strongly for its specific binary task: distinguishing normal radiographs from severe pneumonia cases. Both classes are recognized with a small number of validation errors, supporting the use of this configuration as the preferred Severity 3 model when maximum validation performance for the severe-pneumonia branch is prioritized.

5.5. Classical Binary ViT Model Exploration

Before analyzing the three severity-specific pneumonia branches, we first evaluated the conventional binary classification setting, where chest X-ray images are classified as either normal or pneumonia-positive. This experiment provides a useful baseline because most existing pneumonia classification studies formulate the task as a binary decision problem. In contrast, the proposed severity-aware framework extends this setting by separating pneumonia-positive cases into low/mild, moderate, and severe categories. Therefore, the classical binary experiment helps contextualize the difficulty and performance of the later severity-specific models.

As in the severity-specific experiments, four pretrained Vision Transformer backbones were evaluated: ViT-B/16, ViT-L/16, ViT-B/32, and ViT-L/32. Each backbone was tested using six learning-rate and weight-decay configurations, resulting in a total of 24 runs. All runs used

512 \times 512

input images, batch size 8, ImageNet-pretrained initialization, AdamW optimization, automatic mixed precision, and cosine annealing scheduling. The explored configurations varied the learning rate and weight decay, while the remaining training setup was kept fixed.

Table 5 summarizes the 24-run exploration for the classical binary task. For each run, the table reports the final training accuracy, final validation accuracy, maximum validation accuracy, validation macro-F1 score, best validation epoch, and total training duration.

The best overall binary classification result was obtained by the ViT-L/16 backbone using a learning rate of

5 \times 10^{- 5}

and weight decay of 0.05. This configuration reached a maximum validation accuracy of 95.05% (run #11), a final validation accuracy of 94.72%, and a validation macro-F1 score of 94.85%. The second-best result was obtained by ViT-L/16 with the same learning rate and weight decay of 0.01, reaching 94.95% maximum validation accuracy. Among the base-size models, ViT-B/16 achieved a competitive maximum validation accuracy of 94.68%, with a substantially shorter training duration.

The results show a consistent learning-rate pattern similar to the severity-specific branches. The strongest configurations generally used the lowest tested learning rate,

5 \times 10^{- 5}

. Higher learning rates, especially

3 \times 10^{- 4}

, produced weaker validation performance across all backbone families. This confirms that conservative fine-tuning is beneficial not only for severity-specific classification, but also for the classical binary pneumonia detection task.

From a backbone perspective, the 16-pixel patch models outperformed their 32-pixel counterparts. ViT-L/16 achieved the best validation performance, followed by ViT-B/16. The best ViT-B/32 and ViT-L/32 configurations reached 94.01% and 94.38% maximum validation accuracy, respectively, remaining below the best 16-pixel models. This suggests that the smaller patch size preserves useful spatial detail for pneumonia detection in chest radiographs.

To further summarize the 24-run classical binary exploration, Figure 38 compares the final and maximum validation accuracy across all evaluated configurations. The results show that the strongest validation performance was obtained by the ViT-L/16 backbone using the lowest tested learning rate,

5 \times 10^{- 5}

. This configuration reached the highest maximum validation accuracy of 95.05%, while the best ViT-B/16 configuration achieved a closely competitive maximum validation accuracy of 94.68%. The figure also shows that the 16-pixel patch models generally outperformed their 32-pixel counterparts, suggesting that the smaller patch size preserved useful local radiographic detail for the normal-versus-pneumonia task.

Figure 39 presents the corresponding training duration comparison. The ViT-L/16 configurations required the longest training time, with an average duration of approximately 26,706 s per run. In contrast, ViT-B/16 required approximately 13,671 s on average, while ViT-B/32 was the fastest family, requiring approximately 8,025 s per run. Therefore, although ViT-L/16 achieved the strongest absolute binary validation accuracy, it also introduced a substantial computational cost. ViT-B/16 provides a more efficient alternative, with only a modest reduction in validation performance.

Taken together, Figure 38 and Figure 39 highlight the main trade-off in the classical binary setting. ViT-L/16 achieved the strongest absolute validation performance, reaching 95.05% maximum validation accuracy and 94.85% validation macro-F1. However, this improvement came with a substantial computational cost, since the best ViT-L/16 run required 26,763.1 s, compared with 10,409.6 s for the best ViT-B/16 run. Therefore, ViT-L/16 is the preferred choice when maximum binary validation accuracy is prioritized, while ViT-B/16 remains a more efficient alternative with only a modest reduction in validation accuracy.

Overall, the classical binary experiment confirms that ViT-based models can achieve strong normal-versus-pneumonia classification performance on this dataset, with the best classical model reaching 95.05% maximum validation accuracy. However, the binary formulation compresses all pneumonia-positive cases into a single category and therefore does not expose differences in radiographic burden. In contrast, the severity-aware formulation separates pneumonia cases into low/mild, moderate, and severe burden groups, enabling grading-oriented analysis while also producing strong severity-specific binary classifiers. Across the selected Severity 1, Severity 2, and Severity 3 branches, the maximum validation accuracies were 94.98%, 97.88%, and 99.08%, respectively, corresponding to a mean of approximately 97.31% and a median of 97.88%. This suggests that severity splitting may improve class separability, especially for moderate and severe pneumonia cases, while also providing a more informative grading-oriented output. Nevertheless, this comparison should be interpreted carefully, since the classical binary model and the severity-specific models solve different classification tasks. A definitive performance comparison would require evaluating the final combined severity-routing system end-to-end on the same external validation or test cohort.

To further inspect the selected classical binary model, we analyzed the epoch-wise training history of the best-performing ViT-L/16 configuration. This configuration corresponds to the conventional normal-versus-pneumonia classification task, where all pneumonia-positive images are grouped into a single class. The training-history curves provide additional insight into convergence behavior, validation stability, and class-level performance beyond the 24-run summary table.

Figure 40 shows the training and validation accuracy curves for the selected classical binary ViT-L/16 model. The best validation checkpoint was reached at epoch 45, with a maximum validation accuracy of 95.05%, while the final validation accuracy was 94.72%. The validation curve improves during training but does not increase monotonically, indicating that the best checkpoint should be selected according to validation performance rather than simply using the final epoch. This behavior is consistent with the severity-specific experiments, where validation-based checkpointing was also important.

Figure 41 presents the corresponding training and validation loss curves. The training loss decreases across the training schedule, showing that the model continues to fit the training data. The validation loss decreases during the early epochs but then fluctuates across the remaining training period. This indicates that the model reaches a strong generalization point before the end of training, but additional epochs do not necessarily produce consistent validation improvement. Therefore, the loss behavior again supports the use of validation-based checkpoint selection for the classical binary setting.

To complement accuracy and loss, Figure 42 reports the validation macro-precision, macro-recall, and macro-F1 curves. These metrics are important because they provide a class-balanced view of performance between the normal and pneumonia classes. The macro-metric curves remain high throughout the later epochs, confirming that the selected ViT-L/16 model performs strongly in the classical binary task. However, the visible fluctuations also show that validation performance varies across epochs, reinforcing the need to preserve the best validation checkpoint.

Finally, Figure 43 shows the class-specific validation accuracy for the normal class and the pneumonia-positive class. This analysis is useful because the binary pneumonia-positive label contains cases with different radiographic burden levels, ranging from low/mild to severe involvement. The class-specific curves indicate that the model achieves strong recognition of both normal and pneumonia-positive images, but the pneumonia class remains heterogeneous because it combines all severity levels into a single target category.

Overall, the training-history analysis confirms that the selected classical binary ViT-L/16 model provides strong normal-versus-pneumonia classification performance. However, the binary task remains limited by its label structure, since all pneumonia-positive images are treated as one class regardless of radiographic extent. This motivates the severity-aware experiments, where pneumonia-positive cases are separated into more homogeneous burden-oriented groups, enabling both grading-oriented analysis and a more detailed understanding of model behavior across disease extent.

5.6. Selection of ViT Models for Extended Training and Inference

After completing the classical binary ViT exploration and the separate severity-specific ViT exploration experiments for Severity 1, Severity 2, and Severity 3, the best-performing configuration from each setting was selected for further analysis. The selection was primarily based on maximum validation accuracy, while also considering final validation accuracy and training duration. The classical binary model is included as a conventional normal-versus-pneumonia baseline, whereas the three severity-specific models represent the selected branches of the proposed severity-aware framework.

Table 6 summarizes the selected ViT configurations. For Severity 1, the best configuration was ViT-B/16 with a learning rate of

5 \times 10^{- 5}

and weight decay of 0.01, reaching a maximum validation accuracy of 94.98%. For Severity 2, the selected configuration was ViT-B/16 with a learning rate of

5 \times 10^{- 5}

and weight decay of 0.05, which reached a maximum validation accuracy of 97.88% and the highest final validation accuracy among the top-performing Severity 2 runs. For Severity 3, the best absolute validation performance was achieved by ViT-L/16 with a learning rate of

5 \times 10^{- 5}

and weight decay of 0.05, reaching a maximum validation accuracy of 99.08%.

The selected configurations show that the conventional binary model achieves strong overall pneumonia detection performance, with a maximum validation accuracy of 95.05%. However, the severity-aware formulation provides additional information by separating pneumonia-positive cases into low/mild, moderate, and severe radiographic burden categories. Across the three selected severity-specific branches, the maximum validation accuracies were 94.98%, 97.88%, and 99.08%, respectively. This corresponds to a mean maximum validation accuracy of approximately 97.31% and a median of 97.88% across the selected severity-specific branches. Although this should not be interpreted as a direct end-to-end comparison against the classical binary model, it suggests that severity splitting may improve class separability, particularly for moderate and severe pneumonia cases.

In addition to predictive performance, the selected severity-specific models also differ substantially in storage footprint. Figure 44 summarizes the disk size of the three selected severity-specific classifiers. The selected Severity 1 model occupies 989 MB, whereas the selected Severity 2 and Severity 3 models occupy 3.41 GB and 3.39 GB, respectively. This indicates that the Severity 2 and Severity 3 models are approximately 3.5 times larger than the Severity 1 model, which is consistent with the use of larger-capacity backbone variants in the final selection.

The selected models reveal a clear relationship between pneumonia severity and classification difficulty. Severity 1 produced the lowest validation performance among the severity-specific branches, with a maximum validation accuracy of 94.98%. This is expected because low/mild pneumonia cases contain smaller and more localized opacity patterns, making them more difficult to distinguish from normal radiographs. Severity 2 achieved higher performance, reaching 97.88% maximum validation accuracy, suggesting that moderate pneumonia burden provides more visible and separable radiographic features. Severity 3 achieved the strongest severity-specific performance, with a maximum validation accuracy of 99.08%, indicating that severe pneumonia cases are the most visually distinguishable from normal cases due to their larger annotated disease extent.

A consistent optimization trend was observed across all selected configurations. In both the classical binary model and the three severity-specific branches, the best-performing configuration used the lowest tested learning rate,

5 \times 10^{- 5}

. This suggests that conservative fine-tuning is important when adapting pretrained ViT backbones to the RSNA pneumonia classification tasks. The selected Severity 1 and Severity 2 models both used ViT-B/16, indicating that the base-size model with 16-pixel patches provides a strong balance between accuracy and computational efficiency for low/mild and moderate pneumonia classification.

For the classical binary task, ViT-L/16 achieved the strongest maximum validation accuracy, but required the longest training duration among the selected models, with 26,763.1 s. For Severity 3, ViT-L/16 also achieved the best absolute validation accuracy, requiring 19,641.5 s. However, this higher performance came with a substantially greater computational cost than the ViT-B/16 severity-specific branches. Therefore, two possible strategies can be considered in the next stage. If the objective is to maximize validation accuracy, the selected ViT-L/16 models should be retained for the classical binary and Severity 3 tasks. If computational efficiency is prioritized, ViT-B/16 remains an attractive alternative, especially because the best ViT-B/16 Severity 3 configuration achieved a closely competitive maximum validation accuracy of 98.94

The selected severity-specific models will be used in the next experimental stage for extended training and inference analysis. In particular, each selected configuration can be trained for longer schedules, such as 250, 500, and 1000 epochs, to determine whether validation performance improves further or whether the models saturate early. These extended experiments will help verify whether the current best validation results are limited by the 50-epoch exploration schedule or whether additional training mainly increases overfitting. The resulting checkpoints can then be evaluated in an inference-oriented setting, where the three binary severity classifiers are applied jointly to assess their behavior across normal, Severity 1, Severity 2, and Severity 3 cases.

Overall, the exploration stage identified one preferred model for the classical binary baseline and one preferred model per severity-specific branch. The classical binary model provides a strong conventional benchmark, while the selected Severity 1, Severity 2, and Severity 3 models provide the foundation for the proposed severity-aware ViT pipeline. This reduces the severity-specific search space from 72 total runs across the three severity levels to three primary models for extended training and downstream inference, while retaining the classical binary model as an important reference point for comparison.

5.7. Cross-Severity Inference Analysis

After selecting the best severity-specific ViT models, we performed a cross-severity inference experiment to evaluate how each binary classifier responds to pneumonia validation images from all three severity levels, as illustrated in Figure 45. In this experiment, each selected model was applied not only to its own target severity validation subset, but also to the validation subsets of the other severity categories. Therefore, the Severity 1 model was evaluated on Severity 1, Severity 2, and Severity 3 pneumonia validation images; the Severity 2 model was evaluated on Severity 1, Severity 2, and Severity 3 pneumonia validation images; and the Severity 3 model was evaluated on Severity 1, Severity 2, and Severity 3 pneumonia validation images.

The objective of this analysis was not to compute standard binary accuracy against the original training labels, but rather to study the probability response of each severity-specific classifier across the ordered pneumonia burden spectrum. For each model, the probability assigned to its target pneumonia class was extracted. Thus, the Severity 1 model output was interpreted as

P (mild)

, the Severity 2 model output as

P (moderate)

, and the Severity 3 model output as

P (severe)

. This allowed us to examine whether the trained classifiers behave as strict severity-specific detectors or as broader pneumonia-presence detectors with different sensitivity to disease extent.

Table 7 summarizes the cross-severity inference results numerically. For each model–severity combination, the table reports the number of pneumonia validation images, the mean and median predicted target-class probability, the interquartile range, and the percentage of images above the 0.5 threshold.

Figure 46 shows the resulting probability distributions using violin plots. Each row corresponds to one selected severity-specific model, and each column within a row corresponds to the true pneumonia severity of the validation images. The dashed horizontal line marks the 0.5 decision threshold. A probability distribution concentrated above this threshold indicates that the corresponding model frequently activates on that severity group, while a distribution below or around the threshold indicates weaker or more uncertain activation.

The probability distributions reveal an important behavior of the binary severity-specific classifiers. The Severity 1 model strongly activates not only on Severity 1 images, but also on Severity 2 and Severity 3 images. Its median probability is close to 1.0 for all three severity groups, and the proportion of pneumonia validation images above the 0.5 threshold increases from 82.0% for Severity 1 to 93.0% for Severity 2 and 96.5% for Severity 3. This suggests that the Severity 1 classifier learned features that are not exclusive to mild pneumonia, but instead capture general pneumonia-related abnormalities that become even more visible as disease burden increases.

A similar pattern is observed for the Severity 2 model. The model activates on its own Severity 2 validation images with a high median probability and also responds strongly to Severity 3 images. The activation rate above the 0.5 threshold increases from 72.5% on Severity 1 images to 93.5% on Severity 2 images and 97.3% on Severity 3 images. This indicates that the moderate-severity classifier also behaves partly as a pneumonia-burden detector: it is less confident on low/mild cases, but it responds strongly when the radiographic burden becomes moderate or severe.

The Severity 3 model shows the clearest ordinal behavior. On Severity 1 images, its median probability remains below the 0.5 threshold, and only 48.5% of the Severity 1 validation images are classified above the severe-probability threshold. On Severity 2 images, the model response increases substantially, with 85.8% of cases above the threshold. On Severity 3 images, the model reaches its strongest response, with a median probability close to 1.0 and 95.0% of images above the threshold. This trend suggests that the severe classifier is more selective than the other two branches and better aligned with increasing radiographic burden.

Overall, the cross-severity inference results suggest that the three selected binary ViT classifiers are not completely independent severity detectors. Instead, their probability responses reflect a mixture of pneumonia presence and disease-burden sensitivity. The Severity 1 and Severity 2 models generalize strongly toward higher-severity cases, indicating that mild and moderate classifiers also capture visual patterns present in more advanced pneumonia. In contrast, the Severity 3 model shows the strongest ordinal selectivity, with substantially weaker activation on Severity 1 images and progressively stronger activation on Severity 2 and Severity 3 images.

This behavior has important implications for the final inference strategy. A simple independent thresholding approach may produce multiple positive severity outputs for the same image, especially for moderate and severe cases. Therefore, the three binary classifiers should not be interpreted as mutually exclusive class detectors without an additional decision rule. A practical next step is to combine the three probability outputs into a severity-routing mechanism, for example by selecting the severity branch with the highest calibrated probability, applying hierarchical rules, or introducing a final meta-classifier that receives the three probabilities as input. Such a strategy would allow the model ensemble to exploit the observed ordinal probability structure while reducing ambiguity between overlapping severity responses.

5.8. Toward a Hierarchical Severity-Routing System for Clinical Use

The cross-severity inference results suggest that the three severity-specific ViT models can be interpreted not only as independent binary classifiers, but also as components of a unified severity-routing system. In a clinical decision-support scenario, such a system could be used to assist the prioritization and grading of chest X-rays by first identifying whether severe radiographic burden is likely, then progressively routing lower-confidence cases toward moderate or mild severity branches. This direction is motivated by the probability distributions observed in Figure 46, where the Severity 3 model showed the clearest ordinal behavior: low response on Severity 1 images, stronger response on Severity 2 images, and the strongest response on Severity 3 images.

Let x denote an input chest X-ray image, and let the three selected binary models be denoted by

f_{1}

,

f_{2}

, and

f_{3}

, corresponding to Severity 1, Severity 2, and Severity 3, respectively. Each model outputs a probability associated with its target severity class:

p_{1} (x) = f_{1} (x) = P (Severity 1 ∣ x),

(14)

p_{2} (x) = f_{2} (x) = P (Severity 2 ∣ x),

(15)

p_{3} (x) = f_{3} (x) = P (Severity 3 ∣ x) .

(16)

Although these probabilities are produced by independently trained binary classifiers, the cross-severity inference experiment showed that they contain useful information about the relative radiographic burden of pneumonia. Therefore, instead of interpreting the three models independently, they can be arranged into a hierarchical decision process that evaluates the highest-severity condition first. This is clinically intuitive because severe cases are generally the most important to detect early in a triage-oriented workflow.

A possible hierarchical severity-routing rule can be defined using two high-confidence thresholds,

τ_{3}

and

τ_{2}

, for severe and moderate pneumonia detection, respectively. Based on the observed violin plots, these thresholds could initially be explored in the range of 0.85–0.90, although they would need to be calibrated on an external validation set before any clinical interpretation. The hierarchical decision rule can be written as:

C_{3} (x) : p_{3} (x) \geq τ_{3}, C_{2} (x) : p_{3} (x) < τ_{3} \land p_{2} (x) \geq τ_{2},

(17)

C_{1} (x) : p_{3} (x) < τ_{3} \land p_{2} (x) < τ_{2} \land p_{1} (x) \geq τ_{1} .

(18)

\hat{y} (x) = \{\begin{matrix} Severity 3, & if C_{3} (x), \\ Severity 2, & if C_{2} (x), \\ Severity 1, & if C_{1} (x), \\ Normal / Uncertain, & otherwise . \end{matrix}

(19)

In this formulation, the model first evaluates whether the image is likely to belong to the severe category. If

p_{3} (x)

is above the severe threshold

τ_{3}

, the image is classified as Severity 3. If the severe probability is below the threshold, the system interprets the image as unlikely to represent severe pneumonia and passes it to the moderate branch. The Severity 2 model then evaluates whether the image is more consistent with moderate pneumonia. If

p_{2} (x)

is above

τ_{2}

, the image is classified as Severity 2. If both severe and moderate probabilities are below their thresholds, the image is passed to the Severity 1 model, which provides the final low/mild pneumonia probability.

The normal case can be handled in two possible ways. In a strict threshold-based interpretation, an image may be classified as normal if none of the severity-specific models produces a probability above its corresponding decision threshold. This can be expressed as:

\hat{y} (x) = Normal if p_{1} (x) < τ_{1},; p_{2} (x) < τ_{2},; p_{3} (x) < τ_{3} .

(20)

Alternatively, in a clinical safety-oriented workflow, such a case may be marked as uncertain rather than directly normal, especially if the probability values are close to the decision thresholds. This would allow the system to avoid overconfident negative predictions and instead flag ambiguous cases for radiologist review. A margin-based uncertainty rule can be introduced as:

\hat{y} (x) = Uncertain if max_{k \in 1, 2, 3} p_{k} (x) < τ_{normal} or |p_{i} (x) - τ_{i}| < δ

(21)

for at least one severity branch i, where

δ

is a predefined uncertainty margin around the decision threshold. This mechanism is important because the proposed system is intended as a decision-support tool rather than an autonomous diagnostic system.

Another possible formulation is to combine the three model outputs into a severity score. Since the three classes follow an ordinal structure, a weighted severity index can be computed as:

S (x) = \frac{1 \cdot p_{1} (x) + 2 \cdot p_{2} (x) + 3 \cdot p_{3} (x)}{p_{1} (x) + p_{2} (x) + p_{3} (x) + ϵ},

(22)

where

ϵ

is a small constant used to avoid division by zero. The resulting score

S (x)

ranges approximately between 1 and 3 when at least one pneumonia model activates, with higher values indicating stronger evidence for more severe radiographic burden. This score could be useful as a continuous severity indicator, complementing the discrete severity class predicted by the hierarchical routing rule.

For clinical interpretability, the system could report both the final severity decision and the three raw probability outputs:

p (x) = [p_{1} (x),; p_{2} (x),; p_{3} (x)] .

(23)

This probability vector would allow clinicians to inspect whether the model prediction is clear or ambiguous. For example, a case with high

p_{3} (x)

and low

p_{1} (x)

and

p_{2} (x)

would indicate a strong severe-pneumonia signal. In contrast, a case with similar moderate values across multiple branches could be flagged as uncertain or borderline, requiring manual review.

The proposed hierarchical system is summarized conceptually as follows:

x \to p_{3} (x) .

(24)

d_{3} (x) = \{\begin{matrix} Severity 3, & p_{3} (x) \geq τ_{3}, \\ continue, & p_{3} (x) < τ_{3} . \end{matrix}

(25)

d_{2} (x) = \{\begin{matrix} Severity 2, & p_{2} (x) \geq τ_{2}, \\ continue, & p_{2} (x) < τ_{2} . \end{matrix}

(26)

d_{1} (x) = \{\begin{matrix} Severity 1, & p_{1} (x) \geq τ_{1}, \\ Normal / Uncertain, & p_{1} (x) < τ_{1} . \end{matrix}

(27)

This approach has several practical advantages. First, it prioritizes the detection of severe cases, which is desirable in a hospital triage setting. Second, it makes use of the observed ordinal behavior of the severity-specific models rather than forcing the three classifiers to operate as unrelated binary systems. Third, it provides interpretable intermediate outputs, since each branch produces a clinically meaningful probability associated with mild, moderate, or severe radiographic burden. Finally, it creates a flexible framework in which thresholds can be adjusted according to the desired sensitivity–specificity trade-off.

However, this proposed routing strategy should be interpreted as a future direction rather than a clinically validated system. The thresholds

τ_{1}

,

τ_{2}

, and

τ_{3}

must be calibrated using an independent validation cohort, and the final decision logic should be evaluated against expert radiologist assessment or clinically grounded severity labels. Moreover, because the severity labels in this study are derived from bounding-box extent rather than direct clinical outcomes, the predicted severity should be interpreted as radiographic burden severity, not as a complete clinical severity score.

To better visualize the proposed clinical-use direction, Figure 47 presents a conceptual overview of the hierarchical severity-routing framework. The diagram integrates the three specialized severity models into a single sequential decision process. Starting from an input chest X-ray, the system first applies the Severity 3 model and checks whether the severe-probability output exceeds a high-confidence threshold. If the threshold is reached, the image is classified as Severity 3. Otherwise, the case is routed to the Severity 2 branch, where a similar thresholding step is applied. If the moderate-probability output also remains below threshold, the image is then evaluated by the Severity 1 model.

Figure 47 also highlights an important practical aspect of the framework: the three model outputs can be used jointly rather than independently. In this way, the system can support both discrete grading and clinical triage, while still exposing the full probability vector for interpretability. If none of the three branches exceeds its threshold, the image may be labeled as Normal or Uncertain, depending on the chosen safety policy. This makes the proposed routing framework suitable as a decision-support concept that could be explored in future work through threshold calibration, external validation, and integration with radiologist review.

Overall, the cross-severity inference results indicate that a hierarchical ensemble of severity-specific ViT models may provide a promising direction for severity-aware chest X-ray analysis. By applying the severe classifier first, followed by the moderate and mild classifiers only when needed, the system could support both pneumonia grading and clinical triage. This motivates future work on threshold calibration, probability calibration, external validation, and integration with explainability methods such as Grad-CAM to verify whether the severity-routing decisions are based on clinically meaningful radiographic regions.

6. Discussion

6.1. Severity Splitting as a Strategy for Improved Class Separability

An important observation emerging from this study is that severity-aware splitting may improve the separability of pneumonia cases compared with the classical binary formulation. In the conventional setting, all pneumonia-positive images are grouped into a single class, regardless of whether the radiographic burden is small, moderate, or extensive. This produces a heterogeneous pneumonia category that contains visually subtle low/mild cases together with more obvious severe cases. As a result, the binary classifier must learn a broad decision boundary that covers a wide spectrum of radiographic appearances.

In the classical binary experiment, the best-performing ViT model reached a maximum validation accuracy of 95.05% for normal-versus-pneumonia classification. This confirms that ViT-based models can perform strongly in the traditional pneumonia detection setting. However, when pneumonia cases were separated into three severity-oriented groups and modeled using severity-specific binary branches, the selected models achieved maximum validation accuracies of 94.98%, 97.88%, and 99.08% for Severity 1, Severity 2, and Severity 3, respectively. Across these three selected branches, the mean maximum validation accuracy was approximately 97.31%, while the median was 97.88%. This suggests that severity-specific modeling can produce stronger class separability, particularly for moderate and severe pneumonia cases.

The interpretation of this result is clinically and methodologically important. The Severity 1 branch remains the most challenging, with a maximum validation accuracy close to the classical binary baseline. This is expected because low/mild pneumonia cases contain smaller and more localized opacity patterns, which may visually overlap with normal anatomical variation or subtle non-pneumonia findings. In contrast, Severity 2 and Severity 3 cases contain larger annotated disease burden and therefore provide stronger visual evidence for the model. The higher validation performance in these branches indicates that pneumonia cases with greater radiographic extent are easier to separate from normal chest radiographs.

This finding supports the main motivation of the proposed framework: pneumonia should not always be treated as a homogeneous positive class. A binary label can hide clinically relevant variation in radiographic burden, whereas severity splitting exposes this variation and allows separate analysis of different disease extents. From a machine learning perspective, the severity-aware formulation reduces intra-class heterogeneity by replacing one broad pneumonia category with more structured subgroups. This can make the classification problem more interpretable and may allow each model branch to specialize in a narrower visual phenotype.

At the same time, the comparison between the classical binary model and the severity-specific branches must be interpreted carefully. The classical model solves a single normal-versus-all-pneumonia task, whereas the severity-specific branches solve three separate normal-versus-severity tasks. Therefore, the higher mean or median validation accuracy across the severity-specific models should not be presented as definitive proof that the complete severity-aware system outperforms the classical binary model in an end-to-end setting. A direct comparison would require evaluating the final combined severity-routing system on the same validation or external test cohort using a unified decision rule.

Nevertheless, the observed performance pattern is encouraging. It suggests that severity splitting may provide two simultaneous benefits: first, it enables grading-oriented output rather than simple pneumonia detection; second, it may improve class separability for higher-burden pneumonia cases. This is particularly relevant in clinical triage, where the ability to identify moderate and severe pneumonia cases reliably may be more important than maximizing a single binary detection metric. The fact that the highest-severity branch achieved the strongest validation performance is consistent with the expectation that severe radiographic burden produces more distinctive visual patterns.

Future work should therefore investigate severity-aware classification not only as a grading strategy, but also as a possible way to improve model robustness and interpretability. A natural next step is to compare the classical binary model, the three independent severity-specific models, and the proposed hierarchical severity-routing system under the same evaluation protocol. This should include external hospital data, probability calibration, radiologist-reviewed severity labels, and clinically meaningful metrics such as sensitivity for severe cases, false-negative rate, and triage prioritization performance. Such experiments would clarify whether the apparent separability advantage observed in this study translates into a practical improvement in real-world clinical decision support.

Overall, the results suggest that bounding-box-derived severity splitting is not only useful for constructing an interpretable grading framework, but may also help reveal structure that is hidden inside the conventional pneumonia-positive label. By separating pneumonia cases according to radiographic burden, the proposed approach provides a more informative representation of the dataset and creates a foundation for future severity-aware AI systems in chest X-ray analysis.

6.2. Clinical Translation of the Severity-Routing Proposal

The hierarchical severity-routing framework proposed in this study should be interpreted as an initial conceptual direction rather than a complete clinical decision-support system. The proposed approach uses the observed behavior of the three severity-specific ViT classifiers to define a simplified sequential routing strategy, where the system first evaluates the probability of severe pneumonia, then moderate pneumonia, and finally low/mild pneumonia. This design is useful because it provides an interpretable way to combine the three binary severity models into a unified grading workflow. However, a real-world clinical implementation would likely require a more sophisticated decision mechanism.

In a hospital setting, pneumonia severity assessment is not determined only by the visual extent of radiographic opacity. Clinical decision making may also depend on patient age, symptoms, oxygen saturation, inflammatory markers, comorbidities, previous imaging, treatment history, and radiologist interpretation. Therefore, the severity-routing proposal described in this work should be understood as a radiographic-burden-oriented decision-support concept, not as a complete substitute for clinical severity assessment. Its purpose is to show how severity-specific AI models may be organized into a structured inference workflow, while keeping the final interpretation under medical supervision.

A more advanced clinical version of this system could incorporate additional model outputs and mathematical decision layers. For example, instead of relying only on fixed thresholds such as

τ_{1}

,

τ_{2}

, and

τ_{3}

, future work could investigate calibrated probabilities, uncertainty estimation, Bayesian decision rules, ordinal classification models, cost-sensitive triage functions, or meta-classifiers that learn how to combine the three severity probabilities. Such methods could reduce ambiguity in cases where multiple severity branches activate simultaneously, which was observed during the cross-severity inference analysis (boundary severity cases). In addition, probability calibration would be essential before interpreting model outputs as clinically meaningful confidence scores.

Another important direction is to evaluate the routing framework in a real clinical environment. The present study uses severity labels derived from RSNA bounding-box extent, which provide an interpretable proxy for radiographic burden but do not represent direct clinical severity labels. Hospital-based validation would make it possible to compare model predictions against radiologist grading, clinical outcomes, oxygen requirements, hospitalization status, or treatment escalation. Such evaluation could reveal whether the current severity-routing logic is clinically useful, whether the thresholds need to be adapted, or whether the entire decision process should be reformulated.

The proposed framework should therefore be viewed as a starting point for deeper investigation. More AI models, ensemble methods, explainability tools, and mathematical algorithms could be incorporated as the system is tested and refined. For example, Grad-CAM or other attention-based visualization methods could be used to verify whether the routing decisions are based on clinically meaningful opacity regions. Similarly, transformer-based models, convolutional architectures, hybrid CNN–ViT systems, or multimodal models could be compared to determine whether the same severity-routing behavior remains stable across architectures.

Overall, the severity-routing proposal represents a simplified but useful bridge between the experimental cross-severity inference results and a potential clinical decision-support workflow. It demonstrates how independently trained severity-specific models may be combined into a structured system for pneumonia grading, while also highlighting the need for further validation, calibration, and clinical refinement. Future studies should investigate this direction using larger datasets, external hospital cohorts, expert radiologist scoring, and more advanced decision algorithms before such a system could be considered for practical clinical deployment.

7. Conclusions

This paper investigated whether the expert bounding-box annotations provided in the RSNA 2018 Pneumonia Detection dataset can be repurposed beyond their original detection and localization role to support a severity-aware interpretation of pneumonia burden in chest X-rays. By quantifying the cumulative area of all annotated lesion boxes in each pneumonia-positive image, we derived a transparent image-level burden measure and used it to construct a balanced three-tier severity grading framework corresponding to low/mild, moderate, and severe radiographic involvement.

The resulting severity split proved to be both interpretable and structurally consistent. The proposed thresholds produced three nearly perfectly balanced pneumonia subgroups, while the accompanying histogram, boxplot, violin, ECDF, and sex-distribution analyses confirmed that the derived groups preserved a meaningful ordinal progression of radiographic burden. In this sense, the study shows that bounding-box extent metadata, although not equivalent to a clinical severity score, can nevertheless serve as a practical and explainable proxy for radiographic disease burden.

Beyond dataset reinterpretation, the study also demonstrated the downstream usefulness of the proposed grading framework through Vision Transformer experiments. Compared with a classical binary baseline, the severity-specialized models achieved strong validation performance, especially for the higher-burden categories. The best specialized branches reached maximum validation accuracies of 94.98% for Severity 1, 97.88% for Severity 2, and 99.08% for Severity 3, while the classical binary baseline reached 95.05%. These results suggest that severe pneumonia cases are substantially more separable in radiographic space, whereas lower-burden categories remain more challenging, likely due to greater visual overlap with normal or intermediate presentations.

Taken together, these findings support the main premise of the paper: the RSNA 2018 dataset can be reinterpreted not only as a benchmark for pneumonia detection and localization, but also as a useful resource for severity-aware chest X-ray analysis. The proposed framework provides a simple, reproducible, and annotation-driven path from localized opacity labels to ordinal burden categories, thereby bridging detection-oriented and grading-oriented use of the dataset.

At the same time, several limitations should be acknowledged. The proposed severity labels are derived from annotation geometry and should not be interpreted as direct clinical severity scores. Bounding-box area reflects spatial extent of annotated opacity, but it does not capture all clinically relevant aspects of pneumonia, such as density, temporal evolution, symptoms, laboratory findings, or patient outcome. In addition, the official RSNA test set could only be used in a downstream inference setting because it does not expose the same metadata structure required for the full routing pipeline. For these reasons, the present framework should be viewed as an interpretable radiographic burden proxy rather than a definitive clinical grading system.

Future work can extend this study in several directions. First, the proposed severity split can be validated on external chest X-ray datasets or through radiologist review to better assess clinical plausibility. Second, more advanced architectures and multi-branch strategies can be explored for severity-aware classification. Third, bounding-box extent could be complemented with lung segmentation, opacity density estimation, or anatomical region analysis in order to produce richer burden descriptors. Finally, the framework may serve as a foundation for explainable AI studies, subgroup analysis, and metadata-aware modeling of pneumonia severity.

Overall, this work moves beyond binary pneumonia detection and demonstrates that expert localization annotations can be transformed into a useful severity-oriented representation of disease burden. By doing so, it opens a practical research direction for annotation-driven chest X-ray grading based on a widely used public benchmark.

8. Next Steps

The present work establishes a severity-aware reinterpretation of the RSNA 2018 dataset based on expert bounding-box extent metadata and demonstrates that the resulting severity groups can support strong downstream Vision Transformer classification. However, several promising next steps emerge from the current findings and from preliminary follow-up experiments conducted alongside this study.

A first natural extension is to combine the severity framework proposed here with lung segmentation (Figure 48). In preliminary experiments, we trained a U-Net-based segmentation model and used it to generate segmented lung regions for 2000 chest X-ray images drawn from the same RSNA 2024 dataset used in this paper. Visual inspection and early quantitative behavior suggested that the segmentation model performed better on normal and lower-severity radiographs, where lung boundaries remained clearer and less disrupted by extensive opacity burden. In addition, initial Vision Transformer experiments on these segmented images showed extremely strong validation performance for low/mild CXRs, approaching 100% in some preliminary runs. While these early results should be interpreted cautiously, they indicate that lung-focused preprocessing may provide an important advantage for severity-aware classification.

This direction is especially relevant because the present study showed that the low/mild severity group is the most difficult to classify accurately. From a radiographic perspective, the difference between normal and low/mild pneumonia cases is inherently subtle, as the burden of opacity may be limited, focal, or visually close to normal anatomical variation. By constraining model attention to the lung fields and suppressing irrelevant background content, lung segmentation may help the classifier focus on the most informative regions and reduce confounding signals outside the lungs. A particularly important next step, therefore, is to segment the full severity-labeled dataset developed in this paper and repeat the ViT experiments under the same evaluation protocol. Such a comparison would make it possible to assess whether lung-focused inputs improve validation accuracy, especially for the more challenging low/mild severity category.

A second major direction concerns clinical translation (Figure 49). Because this study already includes medical collaboration through a physician coauthor, there is a valuable opportunity to move beyond retrospective computational analysis and toward medically meaningful deployment scenarios. In future work, the developed algorithms could be refined and tested in a hospital setting in order to evaluate how well they function as decision-support tools for real clinicians. Rather than replacing radiologists or physicians, such systems could be designed to act as assistants that provide severity-oriented suggestions, visual localization cues, and interpretable confidence information.

For such clinical deployment, explainability will be especially important. Future iterations of the framework should therefore incorporate more explicit explainable AI components, such as saliency maps, attention visualization, and region-based attribution aligned with radiographic findings. These additions could help physicians better understand why the model assigns a given severity level and whether the highlighted regions correspond to meaningful pathology. At the same time, deployment-oriented evaluation should extend beyond validation accuracy and include metrics that are more relevant to clinical use, such as sensitivity to higher-burden cases, reliability under ambiguous presentations, confidence calibration, and agreement with expert assessment.

Overall, the next steps following this study are: first, to investigate whether lung-segmented severity-aware classification can improve performance, particularly for subtle low/mild cases; and second, to progress toward clinically informed evaluation and eventual hospital-based testing of the developed framework. Together, these directions would strengthen both the methodological and translational significance of the present work.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ML	Machine Learning
ViT	Vision Transformer
CNN	Convolutional Neural Network
CXR	Chest X-Ray
XAI	Explainable Artificial Intelligence
RSNA	Radiological Society of North America
AP	Anteroposterior
PA	Posteroanterior
ECDF	Empirical Cumulative Distribution Function
DICOM	Digital Imaging and Communications in Medicine
CSV	Comma-Separated Values
PACS	Picture Archiving and Communication System
U-Net	U-shaped Convolutional Network
AUC	Area Under the Curve
ROC	Receiver Operating Characteristic
ACC	Accuracy
LR	Learning Rate
WD	Weight Decay
TP	True Positive
TN	True Negative
FP	False Positive
FN	False Negative

References

Trînc, E.C.; Ancuți, C.; Stolojescu-Crişan, C.; Nițǎ, V. Beyond Accuracy: An Explainable AI Approach to Chest X-Ray Pneumonia Detection. In Proceedings of the 2025 27th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC); 2025; pp. 253–262. [Google Scholar] [CrossRef]
Shih, G.; Wu, C.C.; Halabi, S.S.; Kohli, M.D.; Prevedello, L.M.; Cook, T.S.; Sharma, A.; Amorosa, J.K.; Arteaga, V.; Galperin-Aizenberg, M.; et al. Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia. Radiol. Artif. Intell. 2019, 1, e180041. [Google Scholar] [CrossRef] [PubMed]
Cohen, J.P.; Dao, L.; Roth, K.; Morrison, P.; Bengio, Y.; Abbasi, A.F.; Shen, B.; Mahsa, H.K.; Ghassemi, M.; Li, H.; et al. Predicting COVID-19 Pneumonia Severity on Chest X-ray With Deep Learning. Cureus 2020, 12, e9448. [Google Scholar] [CrossRef] [PubMed]
Signoroni, A.; Savardi, M.; Benini, S.; Adami, N.; Leonardi, R.; Gibellini, P.; Vaccher, F.; Ravanelli, M.; Borghesi, A.; Maroldi, R.; et al. BS-Net: Learning COVID-19 pneumonia severity on a large chest X-ray dataset. Med. Image Anal. 2021, 71, 102046. [Google Scholar] [CrossRef] [PubMed]
Chandra, T.B.; Singh, B.K.; Jain, D. Disease Localization and Severity Assessment in Chest X-Ray Images using Multi-Stage Superpixels Classification. Comput. Methods Programs Biomed. 2022, 222, 106947. [Google Scholar] [CrossRef] [PubMed]
Slika, B.; Dornaika, F.; Merdji, H.; Hammoudi, K. Lung pneumonia severity scoring in chest X-ray images using transformers. Med. Biol. Eng. Comput. 2024, 62, 2389–2407. [Google Scholar] [CrossRef] [PubMed]
Radiological Society of North America. RSNA Pneumonia Detection Challenge (2018), 2018. Official challenge page. Accessed 2026-06-14.
Team, T.D. Pneumonia Detection in Chest Radiographs. arXiv 2018, arXiv:cs.CV/1811.08939. [Google Scholar]
Pan, I.; Cadrin-Chênevert, A.; Cheng, P.M. Tackling the Radiological Society of North America Pneumonia Detection Challenge. Am. J. Roentgenol. 2019, 213, 568–574. [Google Scholar] [CrossRef] [PubMed]
Mao, L.; Yumeng, T.; Lina, C. Pneumonia Detection in chest X-rays: a deep learning approach based on ensemble RetinaNet and Mask R-CNN. In Proceedings of the 2020 Eighth International Conference on Advanced Cloud and Big Data (CBD), 2020; pp. 213–218. [Google Scholar] [CrossRef]
Trinc, E.C. RSNA 2018 Severity Split, 2026. Kaggle dataset, accessed 2026-06-14.
Shih, G.; Wu, C.C.; S., H.S.; Kohli, M.D.; Prevedello, L.M.; Cook, T.S.; Sharma, A.; Amorosa, J.K.; Arteaga, V.; Galperin-Aizenberg, M.; et al. Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia. Radiol. Artif. Intell. 2019, 1, e180041. [Google Scholar] [CrossRef] [PubMed]
Siddiqi, R.; Javaid, S. Deep Learning for Pneumonia Detection in Chest X-ray Images: A Comprehensive Survey. J. Imaging 2024, 10. [Google Scholar] [CrossRef] [PubMed]
Zandehshahvar, M.; van Assen, M.; Kim, E.; Kiarashi, Y.; Keerthipati, V.; Tessarin, G.; Muscogiuri, E.; Stillman, A.E.; Filev, P.; Davarpanah, A.H.; et al. Confidence-Aware Severity Assessment of Lung Disease from Chest X-Rays Using Deep Neural Network on a Multi-Reader Dataset. J. Imaging Inform. Med. 2025, 38, 793–803. [Google Scholar] [CrossRef] [PubMed]
Singh, T.; Mishra, S.; Kalra, R.; Satakshi; Kumar, M.; Kim, T. COVID-19 severity detection using chest X-ray segmentation and deep learning. Sci. Rep. 2024, 14, 19846. [Google Scholar] [CrossRef] [PubMed]
Nizam, N.B.; Siddiquee, S.M.; Shirin, M.; Bhuiyan, M.I.H.; Hasan, T. COVID-19 Severity Prediction from Chest X-ray Images Using an Anatomy-Aware Deep Learning Model. J. Digit. Imaging 2023, 36, 2100–2112. [Google Scholar] [CrossRef] [PubMed]
Frid-Adar, M.; Amer, R.; Gozes, O.; Nassar, J.; Greenspan, H. COVID-19 in CXR: From Detection and Severity Scoring to Patient Disease Monitoring. IEEE J. Biomed. Health Inform. 2021, 25, 1892–1903. [Google Scholar] [CrossRef] [PubMed]
Pan, I.; Cadrin-Chênevert, A.; Cheng, P.M. Tackling the Radiological Society of North America Pneumonia Detection Challenge. Am. J. Roentgenol. 2019, 213, 568–574. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Initial quality-assurance example reproduced from Shih et al. [12], showing multiple radiologists’ bounding-box annotations on the same pneumonia case during the shared calibration stage. The figure highlights both overlap and variation in box placement, illustrating the expert-driven annotation process underlying the RSNA bounding-box dataset.

Figure 2. Example chest X-ray images with bounding box annotations. The images illustrate both localized and multifocal pneumonia patterns.

Figure 3. Aggregated bounding box location heatmap showing the spatial distribution of pneumonia annotations across the dataset. High-density regions correspond to the lung fields, particularly in mid-to-lower zones.

Figure 4. Schematic illustration of the three pneumonia severity levels used in this study. The severity groups are defined from the total annotated disease area per chest X-ray image, with Severity 1 corresponding to low/mild burden (

< 60, 455

px²), Severity 2 to moderate burden (

60, 455

–

139, 133

px²), and Severity 3 to severe burden (

\geq 139, 133

px²). The visual opacities are illustrative and represent increasing radiographic burden from localized involvement to widespread disease.

Figure 4. Schematic illustration of the three pneumonia severity levels used in this study. The severity groups are defined from the total annotated disease area per chest X-ray image, with Severity 1 corresponding to low/mild burden (

< 60, 455

px²), Severity 2 to moderate burden (

60, 455

–

139, 133

px²), and Severity 3 to severe burden (

\geq 139, 133

px²). The visual opacities are illustrative and represent increasing radiographic burden from localized involvement to widespread disease.

Figure 5. Count and proportion of images in the three derived severity categories. The optimized thresholding procedure yields an almost perfectly balanced distribution across Severity 1, Severity 2, and Severity 3.

Figure 6. Histogram of total disease area per pneumonia-positive chest X-ray in the RSNA 2018 dataset. For each image, total burden was computed by summing the areas of all annotated lesion bounding boxes. The right-skewed distribution indicates that most positive cases have relatively limited annotated extent, while a smaller subset exhibits substantially larger radiographic burden.

Figure 7. Distribution of total disease area in pneumonia-positive RSNA 2018 chest X-rays, stacked by sex. The shaded regions indicate the three final severity intervals derived from the optimized rank-based split: Severity 1 (low / mild), Severity 2 (moderate), and Severity 3 (severe).

Figure 8. Distribution of total disease area across the three severity categories, shown using boxplots and violin plots. Both visualizations confirm the expected monotonic increase in burden from Severity 1 to Severity 3.

Figure 9. Empirical cumulative distribution functions (ECDFs) of total disease area for the three severity categories. The curves show the within-grade distribution of lesion burden and confirm the ordinal separation between low / mild, moderate, and severe groups.

Figure 10. Sex distribution within each severity category after the optimized three-tier split. Males are more frequent than females in all tiers, but the optimization avoids extreme sex imbalance while preserving near-equal class sizes.

Figure 11. Dataset split used in this study. From the curated RSNA 2018 cohort employed in the pipeline, 8,851 studies were retained in the Normal category, while pneumonia-positive cases were divided into three severity groups: Severity 1 (1,999 images), Severity 2 (2,006 images), and Severity 3 (2,007 images). A stratified class-wise split was then applied independently within each category, assigning 70% of the data to training, 15% to validation, and 15% to testing. The held-out test subset is not used during optimization, but is reserved for later inference-based evaluation in order to simulate a more realistic clinical deployment scenario.

Figure 12. Budget-aware ViT configuration exploration. Although many architectural and optimization combinations were possible, the search was reduced to 24 strategically selected runs: four ViT backbones, three learning rates, and two weight-decay values. All other training settings were fixed to enable fair comparison under realistic computational constraints.

Figure 13. Overview of the Vision Transformer (ViT) pipeline used in this study. The workflow starts from the RSNA 2018 chest X-ray dataset reorganized into a severity-aware JPEG folder structure, followed by a stratified class-wise split into training, validation, and testing subsets. An automated NAS procedure is then used to explore multiple ViT architectures and hyperparameter combinations, including model variant, image resolution, batch size, learning rate, weight decay, head design, optimizer, scheduler, and mixed-precision settings. The pipeline supports both the four-class severity-aware setting and specialized binary branches comparing normal cases against each severity subgroup independently.

Figure 14. Specialized binary Vision Transformer pipeline used for the Severity 1 branch. The workflow takes only Normal and Pneumonia Severity 1 (low/mild) chest X-ray images as input, applies preprocessing and augmentation, and trains a ViT-B/16 model at

512 \times 512

resolution for binary classification. The output layer produces softmax probabilities for the two classes, allowing the model to distinguish subtle low/mild pneumonia cases from normal chest radiographs.

Figure 14. Specialized binary Vision Transformer pipeline used for the Severity 1 branch. The workflow takes only Normal and Pneumonia Severity 1 (low/mild) chest X-ray images as input, applies preprocessing and augmentation, and trains a ViT-B/16 model at

512 \times 512

resolution for binary classification. The output layer produces softmax probabilities for the two classes, allowing the model to distinguish subtle low/mild pneumonia cases from normal chest radiographs.

Figure 15. Overview of the hardware and software environment used for model training. Experiments were conducted on Runpod.io using a PyTorch 2.8.0 environment with one NVIDIA RTX 5090 GPU (32 GB VRAM), 60 GB RAM, 12 vCPUs, and 80 GB disk space.

Figure 16. Qualitative examples from the generated three-tier severity split. Bounding boxes were overlaid with OpenCV on a debugging-oriented dataset created for manual inspection. From Severity 1 (low/mild) to Severity 3 (severe), the annotated boxes generally increase in size and/or number, while the visible pneumonia-related white opacities also become progressively more extensive and pronounced.

Figure 17. Comparison of final and maximum validation accuracy across the 24 Severity 1 ViT exploration runs. Results are grouped by backbone and configuration. The strongest validation performance was obtained by ViT-B/16 using a learning rate of

5 \times 10^{- 5}

and weight decay of 0.01.

Figure 17. Comparison of final and maximum validation accuracy across the 24 Severity 1 ViT exploration runs. Results are grouped by backbone and configuration. The strongest validation performance was obtained by ViT-B/16 using a learning rate of

5 \times 10^{- 5}

and weight decay of 0.01.

Figure 18. Training duration comparison across the 24 Severity 1 ViT exploration runs. ViT-L/16 required substantially longer training time than the other backbones, while ViT-B/32 was the fastest model family but achieved lower validation accuracy than ViT-B/16.

Figure 19. Training and validation accuracy curves for the selected Severity 1 ViT-B/16 model. The best validation checkpoint was reached at epoch 39, with a maximum validation accuracy of 94.98%, while the final validation accuracy was 94.42%.

Figure 20. Training and validation loss curves for the selected Severity 1 ViT-B/16 model. The training loss decreases throughout the training schedule, while the validation loss shows signs of saturation and later instability, suggesting mild overfitting after the best validation checkpoint.

Figure 21. Validation macro-precision, macro-recall, and macro-F1 evolution for the selected Severity 1 ViT-B/16 model. These metrics provide a class-balanced view of model behavior beyond accuracy alone.

Figure 22. Class-specific validation accuracy for the selected Severity 1 ViT-B/16 model. The curves compare validation performance on the normal class and the low/mild pneumonia class.

Figure 23. Training and validation confusion matrices for the selected Severity 1 ViT-B/16 model at the final epoch. The model achieves very strong training-set separation, while the validation matrix shows that most errors occur when Severity 1 low/mild pneumonia images are predicted as normal.

Figure 24. Comparison of final and maximum validation accuracy across the 24 Severity 2 ViT exploration runs. Results are grouped by backbone and configuration. Several configurations reached the highest maximum validation accuracy of 97.88%, while ViT-B/16 achieved the best final validation accuracy among the top-performing runs.

Figure 25. Training duration comparison across the 24 Severity 2 ViT exploration runs. ViT-L/16 required substantially longer training time than the other backbones, while ViT-B/16 provided the best balance between validation accuracy and computational efficiency.

Figure 26. Training and validation accuracy curves for the selected Severity 2 ViT-B/16 model. The best validation checkpoint was reached at epoch 13, with a maximum validation accuracy of 97.88%, while the final validation accuracy was 97.84%.

Figure 27. Training and validation loss curves for the selected Severity 2 ViT-B/16 model. The model converges rapidly, with validation loss stabilizing after the early epochs, indicating stable generalization behavior for the moderate-severity classification branch.

Figure 28. Validation macro-precision, macro-recall, and macro-F1 evolution for the selected Severity 2 ViT-B/16 model. The curves show strong balanced validation performance across the normal and moderate-pneumonia classes.

Figure 29. Class-specific validation accuracy for the selected Severity 2 ViT-B/16 model. The curves compare validation performance on the normal class and the moderate-pneumonia class, showing strong performance for both categories.

Figure 30. Training and validation confusion matrices for the selected Severity 2 ViT-B/16 model at the final epoch. The model shows near-perfect separation on the training set and strong generalization on the validation set, with relatively few errors in both directions.

Figure 31. Comparison of final and maximum validation accuracy across the 24 Severity 3 ViT exploration runs. Results are grouped by backbone and configuration. The strongest validation performance was obtained by ViT-L/16, while ViT-B/16 and ViT-B/32 achieved closely competitive results at substantially lower computational cost.

Figure 32. Training duration comparison across the 24 Severity 3 ViT exploration runs. ViT-L/16 required the longest training time, whereas ViT-B/32 was the fastest among the competitive configurations.

Figure 33. Training and validation accuracy curves for the selected Severity 3 ViT-L/16 model. The best validation checkpoint was reached at epoch 39, with a maximum validation accuracy of 99.08%, while the final validation accuracy was 98.99%.

Figure 34. Training and validation loss curves for the selected Severity 3 ViT-L/16 model. The model converges rapidly and maintains low validation loss, supporting the strong generalization performance observed for the severe-pneumonia branch.

Figure 35. Validation macro-precision, macro-recall, and macro-F1 evolution for the selected Severity 3 ViT-L/16 model. The curves show strong balanced validation performance across the normal and severe-pneumonia classes.

Figure 36. Class-specific validation accuracy for the selected Severity 3 ViT-L/16 model. The curves compare validation performance on the normal class and the severe-pneumonia class, showing strong performance for both categories.

Figure 37. Training and validation confusion matrices for the selected Severity 3 ViT-L/16 model at the final epoch. The model achieves almost perfect separation on the training set and very strong generalization on the validation set, with only a small number of errors in either direction.

Figure 38. Comparison of final and maximum validation accuracy across the 24 classical binary ViT exploration runs. The best validation performance was obtained by ViT-L/16 using a learning rate of

5 \times 10^{- 5}

and weight decay of 0.05.

Figure 38. Comparison of final and maximum validation accuracy across the 24 classical binary ViT exploration runs. The best validation performance was obtained by ViT-L/16 using a learning rate of

5 \times 10^{- 5}

and weight decay of 0.05.

Figure 39. Training duration comparison across the 24 classical binary ViT exploration runs. ViT-L/16 required the longest training time, while ViT-B/16 provided a more efficient alternative with competitive validation accuracy.

Figure 40. Training and validation accuracy curves for the selected classical binary ViT-L/16 model. The best validation checkpoint was reached at epoch 45, with a maximum validation accuracy of 95.05%, while the final validation accuracy was 94.72%.

Figure 41. Training and validation loss curves for the selected classical binary ViT-L/16 model. The training loss decreases during training, while the validation loss shows fluctuations after the early convergence period, supporting validation-based checkpoint selection.

Figure 42. Validation macro-precision, macro-recall, and macro-F1 evolution for the selected classical binary ViT-L/16 model. These metrics provide a class-balanced view of model performance beyond accuracy alone.

Figure 43. Class-specific validation accuracy for the selected classical binary ViT-L/16 model. The curves compare validation performance on the normal class and the combined pneumonia-positive class.

Figure 44. Size comparison of the three selected severity-specific ViT models. The Severity 1 selected model has a disk size of 989 MB (1,037,438,261 bytes), while the Severity 2 and Severity 3 selected models are considerably larger, with 3.41 GB (3,669,091,565 bytes) and 3.39 GB (3,650,217,197 bytes), respectively.

Figure 45. Overview of the cross-severity inference experiment. Each selected severity-specific ViT model was applied to all three pneumonia validation subsets, resulting in a total of nine cross-severity inference runs.

Figure 46. Cross-severity inference probability distributions for the selected severity-specific ViT models. Each model was applied to the pneumonia validation images from Severity 1, Severity 2, and Severity 3. The violin plots show the predicted probability of the target class of each binary model:

P (mild)

for the Severity 1 model,

P (moderate)

for the Severity 2 model, and

P (severe)

for the Severity 3 model. The dashed horizontal line indicates the 0.5 decision threshold.

Figure 46. Cross-severity inference probability distributions for the selected severity-specific ViT models. Each model was applied to the pneumonia validation images from Severity 1, Severity 2, and Severity 3. The violin plots show the predicted probability of the target class of each binary model:

P (mild)

for the Severity 1 model,

P (moderate)

for the Severity 2 model, and

P (severe)

for the Severity 3 model. The dashed horizontal line indicates the 0.5 decision threshold.

Figure 47. Conceptual overview of the proposed hierarchical severity-routing framework. The system evaluates the Severity 3 model first, followed by the Severity 2 and Severity 1 models only when the higher-severity probability does not exceed its corresponding threshold. If none of the severity branches reaches the decision threshold, the case can be flagged as Normal or Uncertain for additional review.

Figure 48. Proposed next-step extension of the severity-aware classification framework through lung-segmentation-guided modeling.

Figure 49. Example concept dashboard illustrating how AI-assisted chest X-ray analysis could support clinical triage and consultation workflows. The interface combines a prioritized patient queue, current-exam visualization with AI attention overlays, severity prediction, confidence scoring, and structured decision-support outputs. In this example, the system highlights a severe pneumonia case, suggests immediate review priority, provides explainability cues such as attention maps and suspicious-region highlights, and summarizes clinically relevant findings to assist radiologists and clinicians in faster review, triage prioritization, and more informed consultation.

Table 1. Summary statistics of bounding box annotations in the RSNA Pneumonia Detection dataset.

#	Metric	Description / Value
1	Total bounding box annotations	9,555 annotations across all pneumonia-positive cases
2	Positive patients	6,012 patients labeled with pneumonia
3	Negative patients	20,672 patients labeled as normal
4	Average bounding boxes per positive patient	1.59 bounding boxes per patient
5	Maximum bounding boxes per patient	Up to 4 bounding boxes in a single patient
6	Positive cases without bounding boxes	0 (all positive cases include at least one annotation)
7	Negative cases with bounding boxes	0 (no bounding boxes are assigned to negative cases)
8	Annotation consistency	Bounding boxes are exclusively associated with positive cases, indicating a consistent labeling structure

Table 2. Severity 1 ViT model exploration results across 24 runs. Four ViT backbones were evaluated using six learning-rate and weight-decay configurations per model. All runs used

512 \times 512