Introduction
Axial spondyloarthritis (axSpA) is a group of chronic inflammatory rheumatic diseases primarily affecting the spine and sacroiliac joints, typically beginning in early adulthood and potentially leading to progressive structural damage and functional impairment [
1]. Early diagnosis is of critical importance to alter the natural course of the disease and improve long-term functional outcomes. However, conventional radiography often appears normal in the early stages of axSpA, resulting in diagnostic delays [
2].
According to the European League Against Rheumatism (EULAR), the definition of sacroiliitis on Magnetic Resonance Imaging (MRI) in patients with SpA is based on the qualitative, i.e., visually appreciable, presence of subchondral bone marrow edema (BME) as an indicator of active inflammation. Within the framework of the Assessment of SpondyloArthritis International Society (ASAS) classification criteria, MRI's ability to detect inflammation in the sacroiliac joints is highlighted as crucial for diagnosis [
3]. MRI can reveal bone marrow edema and signs of active inflammation much earlier than radiographic imaging. However, interpreting MRI scans is time-consuming and inherently subjective. Differences in interpretation between expert radiologists and rheumatologists may arise during image assessment [
4]. These discrepancies can introduce uncertainty in the diagnostic and therapeutic decision-making process, potentially compromising diagnostic accuracy.
Despite efforts toward standardization and the development of guidelines such as those by ASAS [
3], qualitative and semi-quantitative (i.e., measurable but not fully numerical) assessments remain largely dependent on the observer’s interpretation. As a result, intra- and inter-observer variability persists in the image analysis process, continuing to pose diagnostic challenges [
4]. Although ASAS imaging criteria aim to improve standardization, in early-stage or borderline cases, MRI assessments still rely heavily on observer experience. Consequently, objectivity and reproducibility remain limited [
4,
5].
In recent years, artificial intelligence (AI) and deep learning techniques have emerged as promising tools to overcome these limitations in medical imaging. Convolutional Neural Networks (CNNs) possess the capability to analyze textual and morphological features of images at a resolution beyond human visual perception [
6]. However, the heterogeneous distribution of lesions in sacroiliitis reduces the efficiency of classical CNN models that process all slices uniformly. Therefore, the Multiple Instance Learning (MIL) framework is particularly well-suited, as it enables the model to treat each patient as a “bag” of instances (i.e., slices) and learn the importance of each slice dynamically. Gated Attention MIL, in particular, generates attention weights across slices, allowing the model to emphasize inflamed regions [
7]. AI algorithms can thus accelerate the automated analysis of MRI data, grading of disease activity, and detection of inflammatory changes in the sacroiliac joints.
This study aims to evaluate the diagnostic performance of a Gated Attention MIL-based deep learning model combined with Test-Time Augmentation (TTA) for detecting active inflammatory sacroiliitis on sacroiliac joint MRI scans.
Materials and Methods
This study was designed as a retrospective, single-center, observational investigation. The study was conducted in accordance with the principles of the Declaration of Helsinki and received approval from the local ethics committee (Decision No: 2024/001, date: 19 November 2024). Due to the retrospective nature of the study, the requirement for obtaining informed consent from patients was waived by the ethics committee. All imaging data and clinical information included in the study were obtained from the hospital’s electronic medical record system, and all data were fully anonymized prior to analysis.
Study Population and Inclusion Criteria
A total of 554 individuals aged over 18 who presented to our hospital’s Rheumatology Outpatient Clinic between November 2022 and October 2024 were included in this study. Of these, 276 were patients diagnosed with axSpA, while 278 were clinically and radiologically confirmed healthy controls. The diagnosis of axSpA was established by consensus among one rheumatologist and two radiologists based on the ASAS 2009 classification criteria [
3].
Study Population and Selection Criteria
This study retrospectively included cases that met specific inclusion and exclusion criteria. The inclusion criteria were as follows: age 18 years or older, availability of sacroiliac joint MRI, full accessibility of clinical records, and the presence or absence of acute inflammatory sacroiliitis clearly reported by a radiologist.
Exclusion criteria included significant artifacts or low-resolution issues in MRI, history of any surgical intervention on the sacroiliac joint, pathological conditions requiring differential diagnosis such as fractures, and the use of MRI data obtained from sequences outside the standard imaging protocol. In accordance with these criteria, only cases with diagnostically valuable, appropriate, and standardized imaging were included in the study.
Imaging Protocol
All MRI scans were performed using a 1.5 Tesla Magnetom Vision scanner (Siemens Medical Systems, Erlangen, Germany) located at the same center. For sacroiliac joint evaluation, the T2-TSE Transverse Fat-Suppressed (t2_tse_tra_FS) sequence was selected in accordance with the protocol recommended by ASAS. This sequence is recognized in the literature for its high sensitivity in BME
(Figure 1) [
3].
Imaging parameters were as follows: TR/TE time of 2220/66 ms, slice thickness of 3.3 mm, matrix size of 110 × 256, and field of view (FOV) of 250 mm. For each individual, an average of 25 to 60 transverse (axial) slices was obtained. All DICOM data were converted to 8-bit grayscale PNG format using a Python-based converter, resized to 224 × 224 pixels, normalized to a 0-255 pixel value range, and subjected to histogram equalization for contrast standardization.
Dataset and Preprocessing
The dataset consisted of 554 individuals (patients and healthy controls) who underwent MRI for suspected sacroiliitis. Data were extracted from ISO files in the hospital’s PACS system using automated scripts, selecting only the t2_tse_tra_FS sequence. Images were organized in individual folders for each subject, creating a “bag” structure. Each bag contained a minimum of 25 slices.
To prevent data leakage, the dataset was split on a patient basis into training, validation, and test subsets. The training set included 193 patients and 194 healthy controls (387 subjects total) with 11,677 slices. The validation set included 41 patients and 41 healthy controls (82 subjects total) with 2,473 slices. The test set comprised 42 patients and 43 healthy controls (85 subjects total) with 2,549 slices. Across all datasets, a total of 16,699 images from 276 patients and 278 healthy controls were used. There was no overlap of individuals between training, validation, and test sets. During training, data augmentation techniques such as random horizontal and vertical flipping, contrast and brightness variations were applied to prevent overfitting. All images were normalized according to ImageNet statistics.
Deep Learning Architecture: Gated Attention MIL
Given that sacroiliitis typically exhibits focal or patchy involvement without affecting the entire tissue uniformly, a MIL based architecture was employed instead of conventional supervised learning [
8]. In this approach, each patient is considered a “bag” and their MRI slices are treated as “instances” within that bag.
The model architecture consists of three main stages: (1) Feature Extraction, (2) Gated Attention Mechanism, and (3) Bag-Level Classification.
- 1.
Instance-Level Feature Extraction: Each patient P
i_, is represented as a bag containing K MRI slices. A deep Convolutional Neural Network (CNN) was employed to transform the inflammatory tissue characteristics (such as intensity variations and structural distortions) present in the image
to a low-dimensional feature vector (hik) [
9]. The ResNet-18 model, pre-trained on ImageNet, was employed as the backbone architecture [
10]. The fully connected layers of the model were removed, and a feature vector of dimension ( d = 512 ) was extracted for each slice:
-
2.
Gated Attention Mechanism: The most critical step in detecting active sacroiliitis is identifying which slices in the MRI series show signs of inflammation. A Gated Attention Mechanism was employed to prevent healthy slices from negatively influencing the model’s decision.
This mechanism calculates a learnable attention score (ak) for each slice k, indicating how informative that slice is for the patient’s inflammation status. It combines hyperbolic tangent (tanh) and sigmoid (σ) activation functions:
Here: - w, V, U are trainable weight matrices.
⊙ represents element-wise multiplication [
10].
This allows the model to assign low weights (near zero) to non-informative slices and high weights to potentially inflamed slices, thereby mitigating mislabeling issues [
11].
- 3.
Bag Representation and Classification: A single patient-level feature vector (zi) is obtained by aggregating all instance feature vectors weighted by their attention scores(ak):
This zi vector contains the most discriminative inflammation-related features for that patient. Finally, this vector is passed through a classifier layer to compute the probability of being “Inflamed” or “Healthy” [
9,
10,
11,
12,
13].
Test-Time Augmentation (TTA)
To enhance the model’s robustness against noise and acquisition variability, TTA was applied during inference. Each patient’s slices were presented to the model in both their original and horizontally flipped forms, with final predictions obtained by averaging the outputs [
14,
15].
Findings
In the cohort of 554 individuals (mean age: 37.9 years; range: 18–76), the Gated Attention MIL-based deep learning model demonstrated high diagnostic accuracy in detecting active inflammatory sacroiliitis on sacroiliac joint MRI. The MRI and structural imaging findings, clinical and laboratory characteristics, and the coexistence of active osteitis with rheumatic diseases in the study population are summarized in
Table 1,
Table 2 and
Table 3. A total of 16,699 MRI slices obtained from these 554 unique individuals were divided into training (n = 11,677), validation (n = 2,473), and test (n = 2,549) sets on a per-patient basis. This patient-level splitting strategy completely eliminated the risk of data leakage and enabled an objective and reliable assessment of the model’s generalization performance under real-world clinical conditions.
This table summarizes the distribution of rheumatic diseases in 276 patients with active osteitis, including adjusted patient counts, sex distribution, and mean age by sex.
The model’s baseline accuracy prior to the application of TTA was approximately 75.29%. Following the implementation of TTA, this rate increased to 85.88%(
Table 4). This improvement indicates that the model became more resilient to variables such as image quality, anatomical variations, and borderline inflammatory signal changes. These findings suggest that TTA significantly enhances the model’s generalization capacity and strengthens diagnostic consistency despite clinical variability. (
Figure 2 and
Figure 3).
This line graph presents the validation F1-scores of the baseline ResNet-18 and the proposed Gated Attention MIL model over 20 training epochs. While ResNet-18 reaches an early performance plateau, the Gated Attention MIL model demonstrates a dynamic learning pattern with eventual superior performance, reflecting its capacity to focus on informative image slices over time.
Discussion
This study quantitatively demonstrated the superiority of the Gated Attention MIL architecture developed for the diagnosis of active sacroiliitis in sacroiliac joint MRIs, compared to standard deep learning approaches (i.e., plain ResNet-18). The results show that our model achieved an accuracy of 85.88% and a high sensitivity of 92.86% on the independent test set. These performance values significantly surpass the outcomes reported for classical CNN architectures in the literature.
The most striking finding of our study is the Gated MIL model's improvement of classification accuracy by 10.59% and specificity by 13.95% compared to the plain ResNet-18. The baseline ResNet-18 model, treating all MR slices with equal weight, processes signals from healthy, non-inflammatory slices as “noise,” leading to a notably low specificity of 65.12%. In contrast, the Gated Attention mechanism filters out this noise by focusing only on clinically meaningful slices (e.g., those BME) in sacroiliitis cases with heterogeneous lesion distributions.
When compared with similar studies in the literature, for example, the classical CNN model developed by Bordner et al. (2023) reported a sensitivity of 56%, whereas the use of MIL in our study increased this value to 92.86%. This demonstrates that “bag-level” learning, as opposed to “instance-level” learning, provides more reliable results in clinical decision support systems for focal pathologies like sacroiliitis.
The high sensitivity of the model helps minimize the risk of “missed cases” (false negatives), which is critical in early axSpA diagnosis. Furthermore, the integration of TTA into the model boosted accuracy by approximately 10%, enhancing robustness against different acquisition parameters and anatomical variability.
In conclusion, despite being trained with weakly labeled data (i.e., patient-level labels only), the Gated Attention MIL model provides high diagnostic reliability and time efficiency, unlike methods requiring complex segmentation or manual ROI annotation. This study supports the potential of the proposed AI solution to serve as a standardized assistive tool in clinical practice by reducing observer-dependent variability in radiological interpretation.
The proposed Gated Attention MIL deep learning model demonstrates high diagnostic accuracy in detecting active inflammatory sacroiliitis in sacroiliac joint MRIs. In the independent test set, which included 85 previously unseen patients, the model achieved 85.88% accuracy, 92.86% sensitivity, and 79.07% specificity. These results are significantly better than comparable approaches in the literature and highlight the effectiveness of the MIL approach in modeling the heterogeneous and slice-distributed inflammatory patterns of sacroiliitis [
5,
6,
7].
Previous studies have explored various AI techniques for sacroiliitis classification. For example, Faleiros et al. (2020) [
6] used support vector machines (SVM) and random forest algorithms, reporting 75–82% accuracy. However, these models were limited by manual feature selection and observer dependency.
More recently, deep learning architectures have become more prevalent. Bordner et al. (2023) [
7] developed a CNN model to automatically detect active sacroiliitis based on ASAS MRI criteria. While their model reported 56% sensitivity and 100% specificity in the external validation set, its AUC was only 0.76—indicating limited clinical utility due to low sensitivity. Classic CNNs evaluate all slices with equal importance, often missing critical slices in heterogeneous MRIs.
In a large-scale study by Bressem et al. (2022), a deep learning framework was developed for detecting inflammatory and structural changes in the sacroiliac joints in axial spondyloarthritis (axSpA). With a training/validation dataset of 477 patients and an independent test set of 116, the model achieved 84% accuracy in the validation set and 75% in the independent test set for identifying active inflammation. For structural changes, the model achieved 85% and 79% accuracy, respectively, demonstrating reasonable generalization on new data [
16].
In the study by Zhang et al. (2024), the effectiveness of deep learning radiomics (DLR) combining multimodal MRI features with clinical data was evaluated. Their ensemble model, integrating ResNet50, ResNet101, and DenseNet121, achieved 82.5% accuracy. The final hybrid model combining the DLR signature with clinical factors achieved the highest diagnostic accuracy of 85.6% [
17].
In another multicenter study by Bressem et al. (2022), a deep learning model was developed to detect inflammatory and structural changes in sacroiliac joint MRIs in axSpA. Among 593 patients, the model achieved 75% accuracy for detecting inflammatory changes (sensitivity 88%, specificity 71%) and 79% for structural changes (sensitivity 85%, specificity 78%) on an external test set (n=116). High AUC values (0.88–0.94 for inflammation and 0.89 for structural changes) support the model’s reliability in standardizing axSpA diagnosis and assisting radiologists [
18].
In a retrospective cohort study by Liu et al., a diagnostic model was developed using semi-supervised segmentation and radiomic analysis of sacroiliac MRIs for detecting sacroiliitis and BME. Radiomic features extracted from automated segmentations using the Unimatch framework were classified with SVM, logistic regression (LR), and Light GBM. Independent test results showed 81.2% accuracy for sacroiliitis and 74.2% for BME detection. This study demonstrated the potential of combining machine learning and radiomics to reduce reliance on expert annotation and enhance diagnostic workflows [
19].
In another study by Nicolaes et al. (2024), a deep learning algorithm was tested on a large external validation set of 731 patients drawn from two randomized controlled trials (RAPID-axSpA and C-OPTIMISE). Using expert assessments based on the 2009 ASAS definitions as a reference, the algorithm achieved 74% overall agreement (accuracy). On independent test data, the model achieved 70% sensitivity, 81% specificity, and 84% positive predictive value indicating acceptable performance for identifying inflammatory changes in large, heterogeneous datasets [
20]. As summarized in
Table 6, the Gated Attention MIL model achieved the highest independent test accuracy among the reviewed approaches.
This table summarizes machine learning, radiomics, and deep learning-based approaches used for the detection of sacroiliitis and axSpA from sacroiliac joint MRI, along with their performance on independent test datasets. The reviewed studies differ in dataset size, model architecture, and methodological design. Compared with previously reported methods, the proposed Gated Attention MIL approach achieves the highest independent test accuracy.
The primary advantage of the Gated Attention MIL approach used in this study lies in its ability to dynamically evaluate the importance of each image slice and assign higher weights only to the most informative ones. This mechanism is particularly meaningful given that sacroiliitis typically presents intensely in certain slices while remaining minimal in others. The architecture developed by Ilse et al. (2018) [
8] enables learning from weakly labeled data, making it highly suitable for clinical applications. Similarly, our study achieved high classification performance using only patient-level labels under a comparable weak labeling scenario.
Inconsistencies in intra- and inter-observer evaluations are a significant challenge in diagnosing sacroiliitis [
4]. In our study, MIL attention maps significantly overlapped with the BME regions identified by radiologists, suggesting the model not only performs classification but may also contribute to clinical decision-making. Given the risk of treatment delays from missed early inflammatory changes, the model’s high sensitivity (92.86%) is clinically vital. Its low false-negative rate further supports its ability to detect early inflammation.
Moreover, the model’s relatively low false positive rate (20.93%) suggests it could reduce unnecessary diagnostic procedures or treatments. The model’s specificity (79.07%) reflects its capacity to distinguish healthy individuals, minimizing misinterpretation of mechanical back pain or degenerative changes as inflammatory lesions.
The integration of TTA boosted accuracy by approximately 10%, enhancing the model’s robustness against image quality variations and acquisition inconsistencies. This finding aligns with other studies reporting improved generalization performance through TTA [
21,
22]. Given the variability in sacroiliac joint anatomy and imaging parameters, TTA’s contribution in this domain is clear.
Since MIL does not require slice-level annotations, its clinical applicability is high. Annotating individual slices is both time-consuming and expert-dependent. In our study, patient-level labeling sufficed due to MIL, significantly accelerating data preparation. Furthermore, the attention mechanism provided insight into which slices influenced predictions, reducing the "black box" nature and enhancing clinical interpretability.
In contrast to much of the literature, which depends on manual ROI annotation or strong slice-level labels [
5,
6,
7], our study used a minimally supervised data preparation approach. Thanks to the MIL framework, a large number of unlabeled images could be effectively processed.
Additionally, patient-based data partitioning completely prevented data leakage. Many studies partition by image slices, leading to the same patient's slices appearing in both training and testing sets, which artificially inflates performance. Our methodology avoided this flaw entirely.
There are, however, several limitations. First, the study is based on data from a single center, necessitating validation of the model's performance across different centers, scanners, and populations. The MRI systems used were predominantly 1.5 Tesla, and the model’s performance at varying field strengths was not assessed. Furthermore, only T2-weighted fat-suppressed sequences were evaluated; alternative sequences like STIR or post-contrast T1 were excluded, potentially limiting the model’s sensitivity to lesion diversity. The model was not validated on external datasets, leaving its generalizability unproven.
Future studies should incorporate multicenter data acquired using diverse scanners and imaging protocols to better evaluate generalization capacity. Hybrid CNN–Transformer architectures and advanced attention mechanisms could improve the precision of inflammatory lesion detection. Integrating MIL-generated attention maps with automatic lesion segmentation systems may enhance interpretability and ease integration into clinical decision support systems. Real-time deployment of such models could further promote the adoption of AI in clinical practice.
In conclusion, our Gated Attention MIL-based deep learning model demonstrated high diagnostic accuracy for detecting active inflammatory sacroiliitis on SIJ MRI. With accuracy (85.88%), sensitivity (92.86%), and specificity (79.07%), the model shows promising potential for both early diagnosis of axial spondyloarthritis and clinical application. Beyond classification, the MIL approach highlights clinically relevant slices, aiding decision-making.
Patient-based partitioning during training enabled realistic evaluation without data leakage. Despite being a single-center study with a broad patient population, the results suggest high generalizability. Nonetheless, broader validation through multicenter studies is warranted.
Ultimately, the proposed model offers an effective AI-based solution that can be integrated into clinical workflows for early and accurate sacroiliitis diagnosis. With further validation, its clinical utility can be significantly expanded.
Author Contributions
Conceptualization: Zeynep Keskin, Reyhan Bilici, Onur İnan. Methodology: Zeynep Keskin, Reyhan Bilici, Ömer Özberk. Software: Onur İnan, Sema Servi. Validation: Zeynep Keskin, Onur İnan, Ömer Özberk, Reyhan Bilici, Sema Servi. Formal Analysis: Zeynep Keskin, Sema Servi, Onur İnan. Investigation: Zeynep Keskin, Ömer Özberk, Reyhan Bilici, Mehmet Yıldırım. Resources: Reyhan Bilici, Sema Servi, Selma Özlem Çelikdelen, Zeynep Keskin, Ömer Özberk, Mehmet Yıldırım. Data Curation: Zeynep Keskin, Onur İnan, Ömer Özberk, Reyhan Bilici, Sema Servi, Mehmet Yıldırım. Writing – Original Draft: Zeynep Keskin, Sema Servi. Writing – Review & Editing: Zeynep Keskin, Onur İnan, Sema Servi, Ömer Özberk, Reyhan Bilici, Mehmet Yıldırım. Visualization: Zeynep Keskin, Ömer Özberk, Mehmet Yıldırım. Supervision: Selma Özlem Çelikdelen, Zeynep Keskin. Project Administration: Zeynep Keskin, Selma Özlem Çelikdelen, Onur İnan, Sema Servi, Ömer Özberk, Reyhan Bilici, Mehmet Yıldırım. All authors have reviewed and approved the final version of the manuscript and agreed to be personally accountable for their own contributions and for ensuring the integrity and accuracy of the work.