Weakly Supervised Learning for Alzheimer’s Disease Diagnosis Using MRI and Clinical Data

Rong Xiao; Tingwei Quan; Xinglong Wu; Guoping Xu; Shangbin Chen

doi:10.20944/preprints202606.1777.v1

Submitted:

23 June 2026

Posted:

24 June 2026

You are already at the latest version

Abstract

Alzheimer’s disease (AD), the leading cause of dementia worldwide, necessitates early and accurate diagnosis to mitigate its progressive impact. Current diagnostic approaches rely heavily on costly and time-consuming expert evaluations, often lacking sufficient sensitivity and specificity. To address these limitations, we propose a novel weakly supervised deep learning framework that integrates fully convolutional networks and multilayer perceptrons for automated AD classification using multimodal MRI and clinical data. Our framework leverages pseudo-labels derived from clinical information to train the model in an iterative weakly supervised manner, significantly reducing the dependency on fully annotated datasets. The model was trained on the ADNI dataset and externally validated on three independent cohorts: AIBL, FHS, and NACC. The proposed method achieved promising performance with an average F1-score of 85.3% and an AUC of 0.808, comparable to fully supervised methods and neurologist evaluations. The model demonstrated robust generalization across diverse populations and imaging protocols. By integrating this framework into routine brain MRI examinations, our approach enables cost-effective, opportunistic screening for early AD risk detection without requiring dedicated diagnostic protocols. This strategy holds potential for reducing healthcare disparities and facilitating timely intervention, particularly in resource-limited settings.

Keywords:

Alzheimer’s disease

;

weakly supervised learning

;

MRI segmentation

;

deep learning

;

medical image analysis

Subject:

Biology and Life Sciences - Neuroscience and Neurology

1. Introduction

Millions of individuals worldwide remain affected by Alzheimer’s disease, while the development of effective treatments progresses slowly. Although significant advances have been made in detecting Alzheimer’s pathology—such as through cerebrospinal fluid (CSF) biomarkers and positron emission tomography (PET) amyloid and tau imaging—these methods remain largely confined to research settings [1]. Current diagnostic criteria for Alzheimer’s disease primarily rely on comprehensive evaluations conducted by skilled neurologists [2]. While magnetic resonance imaging (MRI) can reveal AD-related brain changes, such as hippocampal and parietal lobe atrophy, these features still lack sufficient specificity for imaging-based AD diagnosis [3]. Given these diagnostic limitations and a shortage of specialized experts, deep learning methods have been proposed to assist in AD classification and to help domain experts understand key factors in disease progression.

AD is a neurodegenerative disorder characterized by cognitive decline and brain atrophy. Traditional diagnostic approaches depend on supervised learning with fully annotated neuroimaging data or biomarker labels [4]. However, acquiring pixel-level annotations or expert-curated labels is resource-intensive. Weakly supervised learning (WSL) has emerged to leverage incomplete, inexact, or noisy labels, thereby reducing annotation burdens while maintaining diagnostic accuracy. WSL paradigms are well-suited to clinical scenarios where coarse-grained labels are prevalent. Incomplete supervision methods combine limited labeled data with unlabeled datasets to infer latent AD-related features through clustering or active learning. Inexact supervision employs image-level labels and uses attention mechanisms or multi-instance learning to localize critical regions without lesion-level annotations [5]. Inaccurate supervision addresses label noise resulting from misdiagnosis or inter-rater variability via noise-robust models, such as data editing techniques that filter inconsistent labels. Methodologically, generative models like generative adversarial networks (GANs) and variational autoencoders (VAEs) can generate pseudo-labels for self-training, while graph-based approaches propagate labels across patient similarity graphs that incorporate both imaging and non-imaging data. Transfer learning further compensates for limited medical data by fine-tuning pre-trained models with weak AD labels.

The transformative potential of deep learning in healthcare is widely acknowledged [6]. Convolutional neural networks (CNNs), in particular, have demonstrated powerful hierarchical feature extraction capabilities from MRI data [7]. However, ethical and regulatory challenges persist in the clinical deployment of artificial intelligence (AI) models [8,9]. Recent advances in deep learning have enabled high-accuracy classification from MRI data. Several studies have applied deep learning to AD diagnosis [10], including CNN-based classification of cognitive states using MRI and multimodal data, improved transformer models for multimodal diagnostics, long short-term memory (LSTM) models for predicting AD progression, and AI models based on multimodal data using Transformer architectures to identify causes of dementia. Beyond neuroimaging, some researchers have also utilized electronic health records and knowledge graphs to predict AD risk and uncover sex-specific biological insights, further validating the utility of clinical and multimodal data in assisted diagnosis.

Despite these advances, traditional supervised learning requires large-scale, pixel-level annotations, which are costly and time-consuming to obtain in medical imaging [11,12]. For example, Qiu et al. [13] reported an F1-score of 0.89 on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [14] data using a hybrid CNN and MLP architecture. Nevertheless, the clinical potential of conventional deep learning is constrained by insufficient external validation of single-cohort models and increasingly opaque decision-making frameworks. These challenges underscore the need for rigorous external validation and more interpretable deep learning algorithms that provide clinicians with clear and reliable decision support.

Introducing WSL for AD classification can reduce annotation costs, improve model generalization, address the scarcity of annotated data, adapt to real-world variability, enhance diagnostic accuracy and efficiency, and promote synergy between research and clinical practice. These aspects represent current focal points and challenges in AD-assisted diagnosis.

WSL excels at integrating multiple data sources, such as neuroimaging and genetic data, thereby broadening the information base for automated AD classification. Cross-modal data fusion enables deeper insight into AD pathophysiological mechanisms, improving classification accuracy. For instance, some research teams have treated neuroimaging as a macroscopic map of AD and genetic data as a microscopic view, developing deep learning algorithms for disease classification and risk prediction that show excellent performance in multi-stage diagnosis and risk assessment. Furthermore, weakly supervised automated AD classification supports the advancement of personalized medicine. By accurately determining disease type and stage, clinicians can tailor treatment and care plans to individual patients, thereby improving outcomes and quality of life.

WSL mitigates annotation scarcity by leveraging incomplete labels or auxiliary data [15]. In AD research, WSL frameworks predict disease progression from sparse clinical records [16] or localize pathology without voxel-level annotations [17]. Current limitations include poor generalizability, often due to single-cohort validation, and limited interpretability [18]. Nonetheless, WSL shows clinical promise in early detection and multimodal integration: longitudinal MRI models achieve AUC > 0.85 with sparse labels; attention mechanisms improve interpretability by highlighting disease-relevant regions [19]; and emerging techniques like noise-adaptive training [20] and cross-modal graph networks [21] narrow the performance gap between weak and full supervision. Federated learning also enables decentralized model training.

This study proposes a weakly supervised AD classification framework that generates pseudo-labels from clinical data (age, gender, Mini-Mental State Examination (MMSE)) to train a CNN on MRI scans, enabling efficient diagnosis with limited annotations. By leveraging unannotated data and integrating regularization strategies, the method enhances feature learning and generalization while mitigating overfitting risks. This approach reduces reliance on costly expert annotations and improves AD detection accuracy through noise-resilient multimodal analysis. Key innovations include:

1) pseudo-label generation: clinical data (age, gender, MMSE) are used to train an MLP classifier, which generates pseudo-labels for MRI model training;

2) iterative weakly-supervised refinement: noisy pseudo-labels are iteratively corrected using an 80% prediction-to-20% ground truth fusion ratio;

3) multi-cohort validation: performance is rigorously tested on four geographically diverse datasets: ADNI, the Australian Imaging, Biomarkers and Lifestyle Flagship Study of Ageing (AIBL) [15], the Framingham Heart Study (FHS) [16], the National Alzheimer’s Coordinating Center (NACC) [17].

2. Materials and Methods

The framework combines an FCN with MLPs to construct a high-resolution mapping from local brain structures to disease probability, generating precise and intuitive visualizations of individual AD risk to aid accurate diagnosis. The model was trained on data from AD and cognitively normal (NC) subjects in the ADNI dataset (n = 417) and validated on three independent cohorts: AIBL (n = 182), FHS (n = 32), and NACC (n = 282). The multimodal input model demonstrated stable performance across all datasets, with predicted high-risk brain regions showing strong consistency with post-mortem histopathological findings. This framework offers a flexible strategy for clinical practice, enabling the extraction of fine neuroimaging features for AD diagnosis using conventional MRI, while also providing a generalizable approach for integrating weakly supervised learning with the pathophysiological processes of human disease (see Figure 1).

2.1. Framework Overview

AD is a progressive neurodegenerative disease for which early diagnosis is crucial to control disease progression. Although expert experience has traditionally underpinned AD diagnosis, advances in deep learning have created new opportunities for automated classification. However, medical image annotation is both costly and prone to low signal-to-noise ratios, posing significant challenges to supervised learning methods. Currently, early AD detection relies primarily on clinical evaluation and MRI scans, but scarce and expensive annotated data make training high-performance classifiers difficult.

To address this, we propose an innovative weakly supervised learning framework that improves automated AD classification by leveraging limited annotated data and abundant unannotated data.

We first extract key features from clinical information to generate pseudo-labels, using a small amount of annotated data and a large amount of unannotated data to produce initial pseudo-labels for subsequent FCN training. The initial pseudo-labels do not require high accuracy. Training the MLP and generating subsequent pseudo-labels necessitate only a minimal amount of annotated data, which significantly reduces manual labeling costs. Moreover, the accuracy of the pseudo-labels is progressively refined in subsequent steps. The FCN trained with these pseudo-labels accurately extracts important features from MRI images, which are further processed by an MLP to perform automated AD classification. The pseudo-labels and classification results cover both AD and NC groups. The overall workflow is illustrated in Figure 2. The steps are as follows:

1) Clinical information and precise annotations in pre-training data are used to train a clinical model for AD classification via supervised learning. This model primarily consists of an MLP.

2) Clinical information in the training data is fed into the clinical model from step 1 to obtain classification results. Since clinical data and MRI images are paired, these results serve as pseudo-labels for the MRI images. The MRI images and pseudo-labels are then used to train an MRI model for AD classification using weakly supervised learning. This model comprises an FCN and an MLP.

3) The trained MRI model is applied to automatically classify large test datasets. This approach utilizes limited annotated data (clinical annotations) and abundant unannotated data (MRI images) to achieve large-scale automated AD classification via a weakly supervised learning framework.

During MRI model training, pseudo-labels may contain noise. The pseudo-labels are derived from clinical data; however, we introduce a ground-truth (GT) fusion mechanism to mitigate noise. This fusion uses 80% predicted probability from the previous iteration and 20% GT to generate refined labels for the next iteration. This GT usage is strictly limited to the training set to stabilize training, and the external test sets are never exposed to GT during model selection or hyperparameter tuning. Empirical results indicate that a fusion ratio of 80% prediction to 20% GT yields the most accurate predictions. Thus, a 4:1 weighted fusion ratio is used, selecting the class with the higher predicted probability as the pseudo-label for subsequent training.

Pseudo-labels may contain noise due to the incompleteness and uncertainty of clinical information. However, weakly-supervised learning techniques, such as the pseudo-labeling method, enable the model to tolerate a certain level of noise and extract useful features for classification. Increasing the number of iterations helps the model gradually refine the pseudo-labels, but excessive iterations may lead to overfitting to the noise, thereby reducing generalization performance. If computational resources are limited, a smaller number of iterations (e.g., 3 or 4) should be selected to balance performance and cost. For clinical deployment, model stability is more critical. Therefore, it is recommended to choose 4 iterations as the optimal number. The choice of 4 iterations was determined empirically based on the validation set performance. We observed that beyond 4 iterations, the validation F1-score plateaued, while computational cost continued to increase linearly.

2.2. Network Architecture

We adapted the supervised fully convolutional network (FCN) model proposed by Qiu et al. for AD classification [18] to develop an efficient weakly supervised model that predicts Alzheimer’s disease probability from unlabeled MRI scans. The model input is MRI data, and the output is disease category probability. In the network, each convolutional step was followed by max-pooling and batch normalization, and then passed through a leaky rectified linear unit (ReLU) activation. The FCN model was trained on 47×47×47 patches to yield scalar (1×1×1) predictions of AD status from randomly sampled sub-volumes. When the same architecture was applied to full-sized images, it produced 46×55×46 3D tensors, which were converted into disease probability maps via a softmax function.

Network weights were randomly initialized. The Adam optimizer was used with a learning rate of 0.0001 and a batch size of 10. During training, the model was saved when achieving the lowest error on the ADNI validation set. The model was trained for a maximum of 100 epochs per iteration with early stopping applied. The patience was set to 10 epochs, meaning training would stop if the validation loss did not improve for 10 consecutive epochs. After training, the FCN processes a single volumetric MRI scan in approximately 2 seconds on an NVIDIA P5000 GPU, generating a disease probability map. The probabilities for AD and NC are converted into scalar values via the softmax function, and the class with the higher probability is selected for classification.

After generating disease probability maps for all subjects, an MLP model was developed to perform binary classification by selecting AD probability values from the probability maps. This selection is based on the overall performance of the FCN classifier estimated from ADNI training data. Features extracted from these locations serve as input to the MLP for binary classification of AD status (Step 2 in the MRI model in Figure 2). Additionally, we developed a separate MLP model that uses only age, gender, and MMSE scores to predict AD status (Step 1 in the MLP model in Figure 2). All MLP models include one hidden layer and one output layer, along with nonlinear operators such as ReLU and dropout [19]. The clinical model indeed takes three input features: age, gender, and MMSE as stated. The MLP network in the clinical model and the FCN and MLP networks in the MRI model were implemented in Python.

2.3. Data Preprocessing Pipeline

MRI data were preprocessed to ensure comparability across images, including skull stripping, image registration, normalization, and segmentation. All MRI scans were in NIFTI format. The MNI152 template was used for processing. The FLIRT tool from the FSL package (University of Oxford) was used to align scans to the MNI152 template. Careful manual inspection confirmed that automatic registration performed well for most ADNI, AIBL, and NACC cases. For poorly registered cases (mainly within FHS), manual registration was performed using known regions as landmarks for affine transformation. This two-step process produced a uniformly registered set of images.

After registration, all voxels were normalized to mean = 0 and standard deviation = 1. Intensity values were clipped to the range [–1, 2.5], with values below –1 set to –1 and values above 2.5 set to 2.5. Background removal was performed by setting all voxels outside the skull to –1 to ensure uniform background intensity.

FreeSurfer [20] was used to segment cortical and subcortical structures from volumetric MRI scans after brain extraction for all patients. Built-in functions including recon-all, mri_annotation2label, tkregister2, mri_label2vol, mri_convert, and mris_calc were used to obtain segmented structure.

3. Results

3.1. Datasets and Cohort Characteristics

This study used data from the ADNI, AIBL, FHS, and NACC cohorts (Table 1). ADNI is a longitudinal multicenter study aimed at developing clinical, imaging, genetic, and biochemical biomarkers for early detection and tracking of Alzheimer’s disease [14]. AIBL, launched in 2006, is the largest study of its kind in Australia, aiming to discover biomarkers, cognitive characteristics, and lifestyle factors affecting the development of symptomatic Alzheimer’s disease [15]. FHS is a longitudinal community cohort study that has collected extensive clinical data from three generations [16]. Since 1976, FHS has expanded to assess factors leading to cognitive decline, dementia, and Alzheimer’s disease. Finally, NACC, established in 1999, maintains a large relational database collecting standardized clinical and neuropathological research data from Alzheimer’s disease centers across the United States [17].

Subjects from each cohort were eligible for inclusion if they had at least one T1-weighted volume MRI scan within 6 months of a formally recorded diagnosis. All MRI scans with fewer than 60 slices were excluded. For subjects with multiple MRI and diagnostic records within 6 months, the closest neuroimaging and diagnostic label pair was selected. Therefore, only one MRI was used per subject. Four independent datasets were used for this study including: the ADNI dataset, the AIBL, the FHS, and the NACC. The ADNI dataset was randomly split in the ratio of 3:1:1, where 60% of it was used for model training, 20% of the data were used for internal validation and the rest was used for internal testing. The best performing model on the validation dataset was selected for making classification on the ADNI test data as well as on the AIBL, FHS and NACC datasets, which served as external test datasets for model validation. All the MRI scans considered for this study were performed on individuals within ±6 months from the date of clinical diagnosis. Figure 3 shows the different cross-sectional scans of ADNI data.

3.2. Evaluation Metrics

The model was built on ADNI data, which was randomly divided into training, validation, and test sets. Models were built on each training and validation split and evaluated on the test dataset (ADNI test, AIBL, FHS, and NACC), repeating this process five times. Performance is presented as the mean and standard deviation of model runs. ADNI test dataset scans were used for head-to-head comparison with neurologists.

We generated sensitivity-specificity and precision-recall curves based on model classification on the ADNI test dataset and other independent datasets (AIBL, FHS, and NACC). For each sensitivity-specificity and precision-recall curve, we also calculated the area under the curve (AUC) value. Additionally, we calculated sensitivity, specificity, F1-score, and Matthew correlation coefficient for each model prediction set. Sensitivity, specificity, and Matthew correlation coefficient are newly introduced evaluation metrics in this chapter. Sensitivity reflects the model’s ability to identify positive classes. The higher the sensitivity, the stronger the model’s ability to identify positive classes. Specificity reflects the model’s ability to identify negative classes. The higher the specificity, the stronger the model’s ability to identify negative classes, while the F1-score considers test precision and recall. The definitions of sensitivity, specificity, and F1-score are as follows:

S e n s i t i v i t y = \frac{T P}{T P + F N}

(1)

S p e c i f i c i t y = \frac{T N}{T N + F P}

(2)

F 1 = \frac{2 \times T P}{2 \times T P + F P + F N}

(3)

Where TP represents true positive value, FP and FN represent false positive and false negative cases, respectively. The Matthew correlation coefficient (MCC) is a balanced measure of binary classifier quality [21], suitable for datasets of different sizes, defined as follows:

M C C = \frac{T P \times T N - F P \times F N}{[(T P + F P) \times (T P + F N) \times (T N + F P) \times (T N + F N)]^{0.5}}

(4)

Where TN represents true negative value.

3.3. Result Analysis

We conducted a statistical analysis of the experimental results over five iterative training rounds across four datasets: NACC, FHS, ADNI, and AIBL. As presented in Table 2, each subplot details the sensitivity, specificity, F1-score, and MCC. The model demonstrated the most stable performance on the ADNI dataset, which is likely attributable to its closer alignment with the training data. Furthermore, the stable specificity and F1-score across iterations indicate robust performance in distinguishing between positive and negative classes. The observed upward trend in the MCC metric suggests a consistent improvement in overall model performance.

Across all datasets, sensitivity declined over epochs, particularly in FHS and AIBL. NACC maintained relatively high sensitivity with smaller fluctuations. Specificity remained stable across datasets, with slight increases or minimal variation. AIBL specificity remained high throughout. F1-score showed a slight upward trend across datasets, indicating improved balance between sensitivity and specificity. ADNI’s F1-score remained high and stable. MCC also trended upward similarly to F1-score, with AIBL achieving relatively high MCC across epochs, suggesting better overall performance on that dataset. Detailed metrics are provided in Table 2 and Figure 4.

The results demonstrate that the weakly supervised learning-based approach effectively classifies AD. On the test set, the model achieved an F1-score of 85.3%, comparable to traditional supervised methods. Comparative experiments confirmed that pseudo-labeling significantly enhances model performance, validating the effectiveness of weakly supervised learning in low-annotation scenarios.

Additionally, Figure 5 presents the sensitivity-specificity and precision-recall curves (PR curves) for the ADNI test dataset, showcasing the statistical outcomes. The areas under the curve (AUC) for these curves are 0.808 and 0.767, respectively, which are comparable to the results achieved by supervised learning MRI models on the ADNI test dataset.

Due to the incompleteness and uncertainty of clinical information, the generated pseudo-labels may contain noise. However, experimental results show that through weakly-supervised learning techniques, such as the pseudo-labeling method, the model can tolerate noise to some extent and extract useful features for classification. In this paper, we employed a pre-trained FCN architecture and fine-tuned it on the ADNI dataset. This approach leverages the concept of transfer learning, transferring knowledge learned from large-scale datasets to smaller datasets, thereby enhancing the model’s performance.

At the same time, we also compared our method with well-known supervised learning methods. The experimental results are shown in Figure 6, including three machine learning methods and two deep learning methods (kNN (k=5), SVM (Linear kernel, C=1), Random Forest (100 trees), Janani’s model (CNN+MLP), Qiu’s model (FCN+MLP)). According to the comparison and analysis of the data in the table and figure, it can be concluded that our method is superior in overall classification ability and category balance processing.

Future work could further explore the application of other weakly supervised learning methods in the automatic classification of AD, such as multi-instance learning and Incomplete Supervision Learning. Additionally, incorporating data from other modalities could be considered to improve classification performance. Traditional single-modality diagnostic methods often lead to misdiagnosis due to the limitations of the data. Multi-modal methods, by combining different types of data, can overcome the shortcomings of single data sources and provide a more comprehensive disease profile. Effective fusion of multi-modal data can enhance the model’s generalization ability, enabling it to perform well across different datasets and clinical environments. This is crucial for achieving consistent diagnostic performance in diverse patient populations. Moreover, multi-modal data provides richer biomarkers, which are significant for understanding the pathological mechanisms of AD and monitoring disease progression.

By generating pseudo-labels from clinical data and iterative optimization, our method significantly reduces reliance on fully annotated data while maintaining robust performance across four independent cohorts (ADNI, AIBL, FHS, NACC), with an F1-score of 85.3% and AUC of 0.808. Our approach extends the method of Qiu et al. [18]. While their framework relies on fully supervised learning with precise annotations, ours introduces weak supervision to minimize annotation costs, enhancing adaptability to resource-constrained settings. Qiu et al. developed an interpretable deep learning strategy combining FCN and MLP to construct high-resolution disease probability maps, achieving high diagnostic accuracy validated across multiple datasets and against neurologist assessments. Our study emphasizes a weakly supervised approach to address annotation scarcity, making it suitable for resource-limited environments and enabling cost-effective opportunistic screening for early AD risk detection. Both methods underscore the potential of deep learning in AD diagnosis.

4. Discussion

Weakly supervised learning strategically substitutes computational time for scarce expert annotation resources in the field of medical imaging. Although this method increases training overhead through iterative pseudo-label refinement, it shifts the bottleneck from expensive and time-consuming manual annotation to scalable computational resources. This trade-off enables the development of robust models utilizing large-scale unannotated clinical data in resource-constrained environments, where expert time remains the primary limiting factor.

Our weakly supervised framework reduces annotation requirements while performing within 5% of Qiu et al.’s method, making it more suitable for resource-limited settings. Multi-center validation confirms robustness to imaging protocols and population heterogeneity, enhancing clinical generalizability. However, weakly supervised learning still requires some annotated data to guide training; poor-quality annotations may reduce classification accuracy. Additionally, AD’s complex and heterogeneous pathology may challenge weakly supervised models when data diversity is high. Future research should improve the stability and generalization of weakly supervised networks [22,23]. The current model is limited to binary NC-AD classification, whereas other studies support NC-MCI-AD ternary classification. Smaller sample sizes in NACC and FHS may hinder detection of rare subtypes, and excluding PET or CSF biomarkers limits sensitivity. Weakly supervised iterative training also increases computational cost by approximately 30%.

Several limitations should be acknowledged. First, the current model is constrained to binary NC-AD classification, whereas ternary classification (MCI) is more clinically relevant. Second, the sample sizes of the external cohorts, particularly FHS, are relatively limited, which may hinder the detection of rare subtypes. Third, the absence of PET or CSF biomarkers restricts the model’s sensitivity to early pathological changes. Finally, the iterative training process increases computational costs by approximately 30% compared to standard supervised learning.

Future work may incorporate novel network architectures such as LSTM [24], Transformer, and GANs for improved accuracy. Extending to NC-MCI-AD ternary classification, integrating multimodal data, and adopting Qiu et al.’s multimodal strategy could enhance performance. Although weakly supervised learning reduces dependence on annotations, it often requires additional constraints, regularization, and handling of unannotated data, increasing training time and computational demands. For large-scale AD classification, careful attention to training efficiency and resource allocation is essential. Developing lightweight models and adhering to ethical frameworks will ensure clinical applicability. Our weakly supervised framework bridges annotation efficiency and diagnostic accuracy, offering a scalable solution for AD detection. While limitations remain in classification granularity and multimodal integration, advances in federated learning and pathology-driven interpretability will further align it with clinical utility.

5. Conclusions

This study presents a weakly supervised framework for binary AD classification that combines FCN and MLP to generate risk probability maps from MRI data. By leveraging pseudo-labels and iterative training, the model achieves robust performance across diverse datasets while reducing annotation costs. Future work will expand to multimodal and longitudinal analyses, fostering tighter integration of AI with clinical workflows. This approach not only enhances AD diagnosis but also sets a precedent for applying WSL to other neurodegenerative disorders [25]. Our work aligns with broader trends in medical AI [26], While the framework is currently limited to binary classification, it offers a cost-effective solution for large-scale MRI-based screening in resource-limited settings. Future work will focus on extending the model to multi-class classification (MCI) and integrating multimodal longitudinal data. By advancing weakly supervised paradigms, we contribute to scalable and ethical AI solutions for neurodegenerative diseases [27].

Our deep learning framework successfully extracts high-accuracy AD classification features from MRI data, validated across independent datasets, neuropathological findings, and expert evaluations. This method is expected to expand the application of neuroimaging in disease detection and management, supporting the development of treatment strategies for neurological disorders. The intersection of neuroscience with computational science, information science, and artificial intelligence will further propel brain-inspired intelligence, contributing to human health efforts.

Author Contributions

Conceptualization, Rong Xiao; methodology, Rong Xiao; software, Rong Xiao; validation, Rong Xiao; formal analysis, Rong Xiao; investigation, Rong Xiao; resources, Rong Xiao and Shangbin Chen; data curation, Rong Xiao; writing—original draft preparation, Rong Xiao; writing—review and editing, Rong Xiao, Tingwei Quan, Xinglong Wu, Guoping Xu and Shangbin Chen; visualization, Rong Xiao; supervision, Rong Xiao and Shangbin Chen; project administration, Rong Xiao; funding acquisition, Shangbin Chen. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 61890951 and No. 61371014), and the Open Project Program of Wuhan National Laboratory for Optoelectronics (Grant No. WNLOKF027).

Institutional Review Board Statement

This study was reviewed and approved by the Institutional Review Board of Huazhong University of Science and Technology. All procedures performed in this study were in accordance with the ethical standards of the Declaration of Helsinki and its later amendments. Public datasets (ADNI, AIBL, FHS, NACC) were obtained with original ethical approvals.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data accessible via consortium agreements.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AD	Alzheimer’s disease
CSF	cerebrospinal fluid
PET	positron emission tomography
MRI	magnetic resonance imaging
WSL	Weakly supervised learning
GANs	generative adversarial networks
VAEs	variational autoencoders
CNNs	Convolutional neural networks
AI	artificial intelligence
LSTM	long short-term memory
ADNI	The Alzheimer’s Disease Neuroimaging Initiative
MMSE	Mini-Mental State Examination
AIBL	The Australian Imaging, Biomarkers and Lifestyle Flagship Study of Ageing
FHS	The Framingham Heart Study
NACC	The National Alzheimer’s Coordinating Center
NC	cognitively normal
GT	ground truth
FCN	fully convolutional network
AUC	area under the curve
MCC	Matthew correlation coefficient

References

A. Nordberg, “PET imaging of amyloid in Alzheimer’s disease,” Lancet Neurol., vol. 3, no. 9, pp. 519–527, 2004.
J. L. Whitwell et al., “Neuroimaging correlates of pathologically defined subtypes of Alzheimer’s disease: a case-control study,” Lancet Neurol., vol. 11, no. 10, pp. 868–877, 2012. [CrossRef]
G. B. Frisoni et al., “The clinical use of structural MRI in Alzheimer disease,” Nature Rev. Neurol., vol. 6, no. 2, pp. 67–77, 2010. [CrossRef]
C. R. Jack, Jr. et al., “Tracking pathophysiological processes in Alzheimer’s disease: an updated hypothetical model of dynamic biomarkers,” Lancet Neurol., vol. 12, no. 2, pp. 207–216, 2013. [CrossRef]
T. Chen and S. Patel, “Weakly supervised multi-instance learning for early Alzheimer’s disease prediction,” Neurocomputing, vol. 512, pp. 123–135, 2022.
G. Hinton, “Deep learning—a technology with the potential to transform health care,” JAMA, vol. 320, no. 11, pp. 1101–1102, 2018. [CrossRef]
E. Shelhamer, J. Long, and T. Darrell, “Convolutional networks for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 640–651, 2017.
E. J. Topol, “High-performance medicine: the convergence of human and artificial intelligence,” Nature Med., vol. 25, no. 1, pp. 44–56, 2019. [CrossRef]
D. Castelvecchi, “Can we open the black box of AI,” Nature, vol. 538, no. 7623, pp. 20–23, 2016. [CrossRef]
S. Liu et al., “Generalizable deep learning model for early Alzheimer’s disease detection from structural MRIs,” Sci. Rep., vol. 12, no. 1, p. 17106, 2022. [CrossRef]
N. Mattsson et al., “Predicting diagnosis and cognition with 18F-AV-1451 tau PET and structural MRI in Alzheimer’s disease,” Alzheimer’s Dement., vol. 15, no. 4, pp. 570–580, 2019. [CrossRef]
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
S. Qiu et al., “Fusion of deep learning models of MRI scans, Mini-Mental State Examination, and logical memory test enhances diagnosis of mild cognitive impairment,” Alzheimer’s Dement., vol. 10, pp. 737–749, 2018. [CrossRef]
R. C. Petersen et al., “Alzheimer’s Disease Neuroimaging Initiative (ADNI): clinical characterization,” Neurology, vol. 74, no. 3, pp. 201–209, 2010.
K. A. Ellis et al., “Addressing population aging and Alzheimer’s disease through the Australian Imaging Biomarkers and Lifestyle Study: collaboration with the Alzheimer’s Disease Neuroimaging Initiative,” Alzheimer’s Dement., vol. 6, no. 3, pp. 291–296, 2010. [CrossRef]
J. M. Massaro et al., “Managing and analysing data from a large-scale study on Framingham Offspring relating brain structure to cognitive function,” Stat. Med., vol. 23, no. 2, pp. 351–367, 2004. [CrossRef]
D. L. Beekly et al., “The National Alzheimer’s Coordinating Center (NACC) Database: an Alzheimer disease database,” Alzheimer Dis. Assoc. Disord., vol. 18, no. 4, pp. 270–277, 2004.
S. R. Qiu, P. S. Joshi, and M. I. Miller, “Development and validation of an interpretable deep learning framework for Alzheimer’s disease classification,” Brain, vol. 143, no. 6, pp. 1920–1933, 2020. [CrossRef]
J. Wang and D. Kim, “Contrastive learning with weak supervision for Alzheimer’s disease classification,” Artif. Intell. Med., vol. 145, p. 102567, 2023.
B. Fischl, “FreeSurfer,” NeuroImage, vol. 62, no. 2, pp. 774–781, 2012.
A. S. Alatrany, W. Khan, and A. Hussain, “An explainable machine learning approach for Alzheimer’s disease classification,” Sci. Rep., vol. 14, no. 1, p. 2637, 2024. [CrossRef]
L. Diala, S. Sandeep, and A. B. Sarah, “Disease-driven domain generalization for neuroimaging-based assessment of Alzheimer’s disease,” Hum. Brain Mapp., vol. 45, 2024.
H. Li, M. Liu, and R. Zhang, “Self-supervised learning with limited annotations for Alzheimer’s disease detection from MRI,” Med. Image Anal., vol. 89, p. 102789, 2023.
H. Saleh et al., “LSTM deep learning model for Alzheimer’s disease prediction based on cost-effective time series cognitive scores,” 2023.
Z. Liu and N. Adams, “Graph-based weakly supervised learning for Alzheimer’s disease subtype identification,” J. Neural Eng., vol. 20, 2023.
A. S. Tang, K. P. Rankin, and G. Cerono, “Leveraging electronic health records and knowledge networks for Alzheimer’s disease prediction and sex-specific biological insights,” Nature Aging, vol. 4, pp. 379–395, 2024.
M. Garcia and T. Wilson, “Weakly supervised transfer learning for Alzheimer’s disease classification across different datasets,” Pattern Recognit., vol. 145, p. 109876, 2023.

Figure 1. Flowchart of the weakly supervised framework for automated AD classification.

Figure 2. Structure diagram of the weakly supervised framework for automated AD classification. The FCN model was developed using a patch-based strategy. The corresponding Alzheimer’s disease status of the individual served as the output for the classification model. Given that the operation of FCNs is independent of input data size, the model led to the generation of participant-specific disease probability maps of the brain. Selected voxels of high-risk from the disease probability maps were then passed to the MLP for binary classification of disease status. AD = Alzheimer’s disease; NC = normal cognition.

Figure 3. Different cross-sectional scans of ADNI data.

Figure 4. Classification accuracy across different epochs on the different datasets. The image shows the performance trends of classification models over different iterations for four distinct datasets (ADNI, NACC, FHS, AIBL). Each graph includes the following metrics: Sensitivity, Specificity, MCC, F1-score and F1 score of supervised learning, whereas Table 2 provides the specific numerical values at each epoch.

Figure 5. Sensitivity-specificity and precision-recall curves for the ADNI test data. The AUC of the ROC curve is 0.808. The model performs well in distinguishing between positive and negative samples, which is close to those of supervised learning (AUC = 0.85). The AUC of the PR curve is 0.767. The model has an acceptable ability to balance precision and recall.

Figure 6. Comparison with supervised learning methods.

Table 1. The study dataset and characteristics.

Dataset	Characteristic	Age, median [range]	Gender, Male (%)	MMSE, Median [range]
ADNI	NC(N=229)	76 [60,90]	119 (51.96)	29 [25,30]
ADNI	AD(N=188)	76 [55,91]	101 (53.72)	23.5[18,28]
AIBL	NC(N=152)	72 [60,92]	68 (44.74)	29 [25,30]
AIBL	AD(N=30)	73 [55,93]	12 (40.00)	21 [6,28]
FHS	NC(N=73)	73 [57,100]	37 (50.68)	29 [22,30]
FHS	AD(N=29)	81 [67,94]	12 (41.38)	25 [10,29]
NACC	NC(N=167)	74 [56,94]	59 (35.33)	29 [20,30]
NACC	AD(N=98)	77 [55,95]	45 (45.92)	22 [0,30]

Table 2. Classification result across different epochs for the ADNI, FHS, AIBL and NACC datasets.

Datasets		Iteration	1	2	3	4	5	Mean (SD)
Datasets	Metrics		1	2	3	4	5	Mean (SD)
ADNI	Sensitivity		0.861	0.833	0.889	0.889	0.833	0.861 (0.025)
	Specificity		0.818	0.886	0.841	0.795	0.864	0.841 (0.032)
	F1		0.827	0.845	0.853	0.831	0.833	0.838 (0.01)
	MCC		0.676	0.722	0.726	0.681	0.697	0.7 (0.021)
NACC	Sensitivity		0.923	0.833	0.871	0.9	0.885	0.882 (0.03)
	Specificity		0.756	0.831	0.831	0.829	0.789	0.807 (0.03)
	F1		0.789	0.786	0.807	0.821	0.789	0.798 (0.014)
	MCC		0.656	0.651	0.685	0.708	0.653	0.671 (0.022)
FHS	Sensitivity		0.966	0.724	0.897	0.862	0.759	0.842 (0.089)
	Specificity		0.685	0.849	0.822	0.795	0.836	0.797 (0.059)
	F1		0.7	0.689	0.765	0.725	0.698	0.715 (0.028)
	MCC		0.587	0.557	0.667	0.607	0.569	0.597 (0.039)
AIBL	Sensitivity		0.839	0.758	0.855	0.871	0.806	0.826 (0.04)
	Specificity		0.869	0.894	0.891	0.906	0.903	0.893 (0.013)
	F1		0.667	0.657	0.707	0.74	0.699	0.694 (0.03)
	MCC		0.606	0.588	0.653	0.692	0.64	0.636 (0.036)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.