A Two-Stage Ensemble Machine Learning Pipeline for Breast Cancer Diagnosis from Digital Mammograms

Fernando Martín-Rodríguez; Carmen Freire-Bouza; Mónica Fernández-Barciela; Ainhoa Morales-Fernandez; Maria Marante-Boado

doi:10.20944/preprints202606.0367.v1

Submitted:

03 June 2026

Posted:

04 June 2026

You are already at the latest version

Abstract

This paper presents a two-stage ensemble machine learning pipeline for breast cancer diagnosis from digital mammograms. A complete end-to-end platform was developed and trained using publicly available datasets. The proposed methodology includes a dedicated preprocessing stage followed by two classification stages. In the first classification stage, six convolutional neural networks (CNNs), each trained under different conditions and dataset configurations, are used to extract predictive information from mammographic images. In the second stage, the outputs of these CNNs are combined into feature vectors and used for final classification. Several machine learning approaches were evaluated for this stage, including multilayer perceptron (MLP), support vector machine (SVM), bagged trees (BT), gradient boosting (XGBoost), and a heuristic fusion method. Among them, an optimized two-layer MLP was selected due to its high sensitivity and suitability for continuous risk estimation, achieving an F1-score and recall exceeding 0.90. AUC is also computed showing a value over 0.97. Additionally, a demonstration application was developed to provide a quantitative breast cancer risk score directly from mammography images, supporting clinical decision-making and early diagnosis. Finally, the Grad-CAM technique is applied to provide explainability by highlighting the image regions that most strongly contribute to and support the final diagnostic decision.

Keywords:

breast cancer diagnosis

;

digital mammography

;

computer-aided diagnosis (CAD)

;

machine learning

;

deep learning

;

convolutional neural networks (CNN)

;

ensemble learning

Subject:

Engineering - Bioengineering

1. Introduction

Breast cancer is the most common type of cancer worldwide, according to the World Health Organization (WHO), with more than 2.3 million cases reported in 2022 [1]. Nearly 1 in 12 women will develop breast cancer during their lifetime. In addition, it affects women across all age groups after puberty, with the probability increasing significantly after the age of 40. Approximately half of all breast cancer cases occur in women with no identifiable risk factors other than gender and age.

Since 1980, mortality rates in developed countries have been reduced by 40% due to the implementation of regular mammographic screening programs in the most affected age groups. Therefore, early detection and treatment are essential to reducing the number of deaths. Furthermore, there are significant disparities in patient survival rates between high-income countries and low- and middle-income countries, largely due to the high cost of diagnosis, which requires the involvement of radiologists with specialized expertise in this field.

Currently, thanks to recent advances in artificial intelligence and, more specifically, in machine learning, interest in the development of automatic and semi-automatic diagnostic support systems has increased considerably. These tools are not intended to replace healthcare professionals, but rather to complement their work, optimize their time, and improve productivity, contributing to faster and better-informed clinical decision-making.

This work presents a system for the analysis of digital mammograms using machine learning techniques. A complete platform trained on publicly available datasets is developed, including a preprocessing stage and two modeling phases. In the first phase, six convolutional neural networks (CNNs) [2], trained with different datasets, extract relevant information from the images. In the second phase, their outputs are combined as feature vectors to perform the final classification. After evaluating several methods (MLP, SVM, BT, XGBoost, and a heuristic method), an optimized two-layer Multilayer Perceptron (MLP) [3] was selected, achieving F₁-score and recall higher than 0.90. Area Under the Curve (AUC) is also computed showing an impressive value over 0.97. A demonstration application was also implemented.

The application of convolutional neural networks to computer-aided diagnosis of breast cancer from mammographic images has been an active research area in recent years. In a systematic review by Nasser and Yusof [4], deep learning based CAD systems for breast cancer are analyzed. Better results are for systems that use genomic and/or histopathological information. For the particular case of mammographies, we found some examples. In [5], CNNs are combined with Generative Adversarial Network (GANs) achieving 0.88 AUC; [6] focuses on segmentation given a manually introduced ROI achieving 0.87 IoU; [7] applies CNN on extracted ROIs from mammograms, training CNN’s where an adversarial network is used to augment training data, maximum reported accuracy is 0.85. In [8], authors get impressive results with 0.98 recall using multiple modalities for input images (combining mammography with ultrasound, MRI and tomosynthesis).

Other interesting work is that from Dehghan Rouzi et al. [9], authors here propose a CAD ensemble system integrating multiple pretrained deep networks (EfficientNet, Xception, MobileNetV2, InceptionV3, and ResNet50) via a consensus-adaptive weighting method, achieving an accuracy of 0.95 on image crops (selected ROIs), nevertheless F-score values on whole mammographies are not so impressive. Similarly, Shah et al. [10] combined EfficientNet, AlexNet, ResNet, and DenseNet in an optimized ensemble, demonstrating that the complementary feature extraction capabilities of heterogeneous architectures lead to superior diagnostic performance compared to any single model. In a broader survey following PRISMA guidelines, Masud et al. [11] analyzed 50 studies published between 2018 and 2025 and confirmed that hybrid and ensemble models consistently outperform standalone CNN or classical ML approaches in mammogram classification. A common thread in these works is the reliance on large pretrained architectures and transfer learning; by contrast, the present work proposes two lightweight CNNs: the first one is a novel architecture trained from scratch and the second one is the well-known ResNet-18 architecture, trained via transfer learning. Six differently trained CNNs are used in a first stage and then combined through an ensemble strategy based on a classical classifier. This approach achieves competitive sensitivity while significantly reducing computational requirements.

The main contribution of this paper is the development of a complete two-stage ensemble machine learning pipeline for breast cancer diagnosis from digital mammograms. Unlike approaches based solely on a single deep learning model, the proposed system combines the predictive capabilities of six independently trained convolutional neural networks with a second-stage classifier that integrates their outputs as a feature vector for final decision-making. This strategy improves diagnostic performance, achieving an F₁ and recall higher than 0.90, which is particularly relevant in medical diagnosis where minimizing false negatives is critical. In addition, the work includes a dedicated preprocessing methodology for heterogeneous DICOM mammography images and the implementation of a demonstration application capable of providing a quantitative breast cancer risk score directly from mammographic images. The application also incorporates explainability through the Grad-CAM technique, highlighting the image regions that most strongly support the final diagnostic decision. This improves interpretability and increases the practical applicability of the proposed system for clinical decision support.

2. Materials and Methods

2.1. The Datasets

The starting point of this study is a dataset published on the Kaggle platform [12]. This dataset consists of a large collection of classified mammograms in DICOM format [13], compiled by the Radiological Society of North America (RSNA). Depending on the type of study and the associated diagnosis, the distribution of the images is shown below (Table 1):

Four types of images are found: CC: “Cranial-Caudal” (top view), ML: “Mediolateral” (from the center), MLO: “Mediolateral-Oblique” (oblique view from the center), and LM: “Latero-Medial” (from the outside toward the center).

From Table 1, it can be observed that the most common image types are CC and MLO. In addition, the dataset is clearly highly imbalanced. The direct application of any machine learning model would result in a system that tends to prioritize the negative (healthy) class.

To alleviate the imbalance problem, another public dataset was obtained from [14]. This second dataset contains only positive cases: 1,324 positive cases (631 CC and 693 MLO).

From this initial collection, the following working datasets were created:

BLD-CC: Balanced CC dataset for training: 1,017 samples per class; 180 samples per class reserved for testing.
BLD-MLO: Balanced MLO dataset for training: 1,091 samples per class; 192 samples per class reserved for testing.
BLD-MIX: Combination of the two previous datasets for training: 2,108 samples per class; 372 samples per class reserved for testing.
AUG-CC: Imbalanced CC dataset (to be balanced using data oversampling): 2,320 healthy and 1,017 sick cases for training; 180 samples per class reserved for testing.
AUG-MLO: Imbalanced MLO dataset: 2,308 healthy and 1,091 sick cases for training; 192 samples per class reserved for testing.
AUG-MIX: Combination of the two previous augmented datasets for training: 4,628 healthy and 2,108 sick cases; 372 samples per class reserved for testing.

These six datasets are summarized in Table 2, below.

Data oversampling consists of generating synthetic samples to balance an imbalanced dataset, especially when the ratio between the majority and minority classes is moderate (less than 3:1). In many cases, this is achieved using interpolation between nearby samples (SMOTE algorithm). In this case, since the data consists of images, oversampling is performed by simple random repetition of minority-class examples until the number of samples matches that of the majority class.

2.1. Pipeline Scheme

The general scheme for the full pipeline is shown in Figure 1.

As shown in the figure, after the input image is preprocessed, six different CNNs are applied. Each individual CNN is independently trained to solve the diagnostic classification problem, as described in Section 2.3. The six numerical outputs obtained from these networks are then combined into a feature vector, which is used to make the final decision through a second machine learning model trained independently from the CNN stage. Several classical machine learning methods were evaluated and compared for this final classification task.

2.3. Preprocessing Stage

Preprocessing is necessary because the dataset contains a mixture of images from different sources. For example, the number of bits per pixel in DICOM files may be 10, 12, or 16. Image size can also vary and CNN-based networks require fixed-size input images.

The applied preprocessing consists of the following operations:

Conversion to floating-point numerical format through an affine transformation that ensures maximum contrast (minimum value equal to 0.0 or black, maximum value equal to 1.0 or white).
When necessary, horizontal mirroring along the X-axis of the image to ensure that the significant information is always located on the left side.
Nonlinear contrast enhancement using a pointwise nonlinear power-law filter: y = xⁿ, with n = 1.2 if the total sum of pixel values is less than 50% of the maximum possible value, and n = 2 otherwise.
Resizing to a fixed size of 384 × 512 pixels. Most available files have an aspect ratio of 3:4 = 0.75 (ar = width/height). In some cases, this value is smaller, and the image is resized without distortion to a height of 512 pixels. Then, blank columns are added to the right side. If ar were greater than 0.75, the opposite procedure would be applied: resizing to a width of 384 pixels and adding blank rows at the top.

Figure 2 shows a real mammogram displayed using a DICOM viewer (original image) and the same image after the complete preprocessing procedure.

2.4. Convolutional Neural Networks

A CNN is a cascade of filter banks and nonlinear operations followed by a multilayer perceptron (sometimes consisting of only a single layer). The perceptron is a set of individual neurons connected to all numerical values from the previous layer (fully connected layers).

The main purpose of the initial stages is to extract relevant information from the input images in order to facilitate the classification task in the final stages. The training algorithm includes the filter parameters (impulse responses), allowing the feature extraction process to be optimized.

A novel architecture was designed and trained from scratch in this work. Figure 3 shows the architecture. This option was selected after testing several alternatives. All datasets described in subsection 2.1 will be used to train different instances of this network.

As previously mentioned, the inputs must be grayscale images of size 384 × 512 pixels. The first filter bank consists of 16 filters of size 7 × 7, followed by normalization, ReLU activation, and max-pooling with a factor of 4. This is followed by 32 filters of size 5 × 5, normalization, ReLU activation, and max-pooling with a factor of 3. The final filter bank contains 64 filters of size 3 × 3, followed by normalization and ReLU activation.

The dropout layer removes output pixels with a probability of 0.5, providing protection against overfitting. The final MLP consists of three fully connected layers with 128, 64, and 2 neurons, respectively.

This architecture is trained using the backpropagation algorithm with the SGDM optimizer, a mini-batch size of 128, and 30 epochs. The optimization target is recall (also called sensitivity, the probability of detecting a positive case). At the beginning of each training process, some samples are set aside for validation during training (1/9 of the training samples), while the test set is not used until the final evaluation.

Training was performed in the same way for the six datasets described in the previous subsection, with the results presented in section 3. For the AUG-xx datasets, balancing is performed before training by random repetition of minority-class samples.

In all cases, data augmentation was applied to each mini-batch by performing small random transformations on each image to introduce variability into the data. This process helps to avoid overfitting. These transformations do not alter the class of each sample: a small vertical displacement (randomly between -30 and +30 pixels) and a vertical mirror reflection along the Y-axis, applied randomly with a probability of 0.50.

To provide a comparative baseline, an additional CNN architecture was trained using transfer learning with ResNet-18 [15]. This architecture was selected among the currently available pretrained models because of its relative simplicity, with a smaller number of parameters, which is particularly suitable when working with moderately sized datasets. In addition, residual networks have consistently shown strong performance and robustness in medical image classification tasks.

As is standard in transfer learning, the original output layer of ResNet-18 was replaced with a new fully connected layer containing two output neurons corresponding to the two target classes. This final layer was trained using a higher learning rate than the pretrained layers in order to adapt the network to the specific classification problem.

Since ResNet-18 requires RGB input images of size 224 × 224 pixels, the preprocessed mammograms were resized accordingly and converted from grayscale to RGB by replicating the single intensity channel across the three color channels. The same training strategy described previously was applied in this case to ensure a fair comparison with the proposed custom CNN architecture.

2.5. Second Classification Stage

This second stage uses the outputs (logits) of the six CNNs described in the previous subsection as a feature vector to make the final decision. The idea is to apply the six networks to the same image and obtain six numerical values.

The main reason for implementing this second stage was the limited performance achieved by the individual CNN models, which by themselves were not sufficient to provide the level of diagnostic accuracy required for a reliable clinical support system. Although the CNNs were able to extract relevant predictive information from the mammographic images, their individual classification performance remained moderate. Therefore, an additional classification stage was necessary to improve the overall system performance by combining the outputs of the six networks into a single feature vector. This ensemble strategy allows the model to take advantage of complementary information provided by each CNN, resulting in better generalization, higher robustness, and improved sensitivity for breast cancer detection.

Since a softmax activation function is used at the end of each CNN (Figure 3), the two output neurons always produce numerical outputs in probability format (between 0.0 and 1.0, with a total sum equal to 1.0). By taking the output corresponding to the “Sick” class, we obtain something similar to a probability of disease detected by that network.

It should be noted that in this scheme, an image of type CC (for example) is also processed by networks trained with MLO images. Note that numerical output in this case can be used as a feature vector component. The underlying idea is that the low-level characteristics that define the diagnosis do not depend majorly on the projection type.

Note that previously trained CNN’s now are treated as feature extractors, mainly as specialized filters so that a new classification problem is raised now. Formerly described BLD_MIX dataset is used for training and testing.

The methods evaluated were:

A simple multilayer perceptron (MLP) with one hidden layer and a single output neuron (probability of disease). This neural network was optimized in two ways. First, the hidden layer size was determined by performing a preliminary training with an excessive number of neurons and analyzing the autocorrelation matrix of the hidden-layer outputs and its eigenvalues. Then, once trained, the best output threshold for declaring a positive case was selected by maximizing the parameter: $F_{β} = 1 / [\frac{1}{1 + β^{2}} (\frac{1}{p r e c i s i o n} + \frac{β^{2}}{r e c a l l})]$ , where precision is understood as the probability that a positive prediction is correct, and recall as the probability that a true positive case is detected. The parameter β determines how strongly recall is prioritized. In this work, β = 1.5 was used.
A Support Vector Machine (SVM) model [16], which consists of defining a transformation of the feature vectors that simplifies the classification problem by making it linearly separable. In this case, an automatic parameter optimization option was applied.
A Bagged Tree model (Bootstrap Aggregated Trees, BT) [17] with ten trees. In this case, several decision trees are trained using random (and overlapping) subsets of the training set. The final decision is obtained by running all constructed trees in parallel and averaging their results. For proper training, the cost of a false negative was set to twice the cost of a false positive.
A Gradient Boosting model (XGBoost) [18]. This technique is based on the sequential construction of weak models (decision trees), where each new model is trained to correct the errors made by the previous ones (implemented by optimizing a loss function using gradient descent). Automatic parameter optimization was applied, and the cost of a false negative was also set to twice the cost of a false positive.
The final method tested was a heuristic calculation (HRS). In this case, six disease probability estimates are available. These are converted into health probabilities: x’=1-x. They are then averaged using a harmonic mean instead of an arithmetic mean: $\bar{x} = 1 / [\frac{1}{n} \sum_{i = 1}^{n} \frac{1}{x_{i}}]$ (the inverse of the averaged inverses). The use of this formula causes the average to decrease much more sharply if one of the probabilities decreases, compared to a linear average. This average is considered the global probability of health, and therefore: $1 - \bar{x}$ is used as the global discriminant function. As in the MLP case, the optimal threshold is again calculated by maximizing F_β with β = 1.5.

Finally, note that two-stage strategy allows the system to combine the feature extraction capability of convolutional neural networks with the robustness of classical machine learning classifiers. By integrating the outputs of multiple CNN models trained under different conditions, the final classifier benefits from complementary predictive information, leading to improved generalization and higher diagnostic reliability. This ensemble approach is particularly suitable for medical decision support systems, where maximizing sensitivity and reducing false negatives are critical requirements.

3. Results

3.1. Performance Metrics

The results obtained for each of the trained models are presented below. For each case, performance metrics were calculated using the test subset of the corresponding dataset. These evaluation metrics are defined as follows:

P = p r e c i s i o n = \frac{T r u e D e t e c t i o n s}{T r u e D e t e c t i o n s + F a l s e D e t e c t i o n s} R = r e c a l l = \frac{T r u e D e t e c t i o n s}{T r u e D e t e c t i o n s + M i s s e d D e t e c t i o n s} F_{1} = \frac{2 \cdot P \cdot R}{P + R}

(1)

These metrics were selected because they provide a meaningful evaluation of classification performance in medical diagnosis problems, where the cost of errors is not equally distributed. Precision measures the reliability of positive predictions, indicating how many detected cancer cases are actually correct, which helps reduce unnecessary follow-up procedures and patient anxiety caused by false positives. Recall (or sensitivity) is particularly important in this context because it measures the ability of the system to correctly identify true positive cases, minimizing false negatives, which represent missed cancer diagnoses and may have serious clinical consequences. The F₁-score combines both precision and recall into a single metric, providing a balanced assessment of overall performance. Since breast cancer screening systems must prioritize the detection of positive cases while maintaining acceptable precision, these three metrics are more informative and clinically relevant than overall accuracy alone, especially when working with imbalanced datasets.

3.2. CNN Results (First Stage)

The results obtained for the different trained CNNs are presented in Table 3 (using the novel architecture described in subsection 2.4). Figure 4 illustrates the training of AUG-CC network (all training curves are similar).

From the table, it can be concluded that none of these models is sufficient when used individually. The BLD-CC and AUG-MLO models provide the best results. It should be noted that, up to this point, the models have been tested using the same type of images as those in their respective training datasets. Therefore, BLD-CC should only be used with CC mammograms, while AUG-MLO should be used with MLO-type mammograms.

Pretrained networks using the ResNet-18 architecture are customized via transfer learning and assessed obtaining the results presented below (Table 4).

The results obtained are very similar to those of the previously proposed architecture. Therefore, it is worthwhile to evaluate both CNN architectures as feature extractors for the second-stage classifier, with the aim of determining whether combining their outputs can further improve the overall diagnostic performance.

3.3. Second Stage Results

Using training part of BLD_MIX dataset for learning and testing part for assessment, results of this new stage are obtained and summarized here. First (Table 5 and Table 6), results are computed using the CNN architecture in Figure 3. Afterwards (Table 7 and Table 8) results with a ResNet-18 as feature extractor will be presented.

A new test was performed: a second version of the dataset was created for training the second-stage classifier. Since the CNNs are used here only as feature extractors, it is possible to merge the original training and testing partitions of the BLD-MIX dataset into a single larger dataset. The samples in this new dataset are then randomly shuffled, and a new test subset is selected before training, representing 10% of the total samples. This test subset is carefully constructed to preserve the same class balance as the complete dataset, ensuring a fair and representative evaluation of the final classifier.

The results obtained for this new dataset are presented in Table 6.

In both versions of the second-stage classifier, a significant improvement in performance is observed compared to the individual CNN models. All the evaluated methods achieve an F₁-score above 0.80, with XGBoost providing the best overall results in both versions. However, when a continuous output score (a non-binary value between 0.0 and 1.0 representing disease probability) is required, the MLP becomes the preferred option, as it achieves the highest sensitivity (recall), which is particularly important in medical diagnosis tasks where minimizing false negatives is critical.

The second dataset configuration provides slightly better results for both XGBoost and MLP, making it the preferred training strategy. This can be explained by the fact that CNN outputs (logits) are usually more decisive in the original training subset, often being closer to extreme values such as (1.0, 0.0) or (0.0, 1.0). Maintaining the original train-test split for the second stage may therefore introduce a bias that slightly reduces generalization performance. By reshuffling all samples and creating a new balanced test subset, the second-stage classifier achieves a more robust and representative evaluation.

Repeating the same two experiments with ResNet-18 based CNN stage, new results are obtained. See Table 7 and Table 8 below.

The results presented in Table 7 and Table 8 are very similar to each other, further confirming that the second stage behaves as an independent classifier whose performance is not strongly dependent on the dataset partition. Nevertheless, the models using ResNet-18 as the feature extractor achieve consistently better results, making this architecture the preferred option for the first stage.

As in the previous experiments, XGBoost provides the best overall performance in terms of F1-score, while the MLP achieves the highest recall (sensitivity). However, in this case, the difference in recall between both models is relatively small and not sufficiently conclusive to clearly favor the MLP over XGBoost. Therefore, considering the better balance between precision and recall, XGBoost remains the most robust overall solution for the final classifier. If a continuous (non binary) result is needed, MLP is the best solution. The heuristic method is surprisingly effective.

Additionally for MLP and HRS (methods offering a continuous result), ROC curves and AUC (Area Under the Curve) have been computed. Curves are shown in Figure 5 (for the option in Table 8: ResNet-18 for the first stage and a shuffled dataset). AUC numerical results are both over 0.97 confirming a very good performance.

3.4. Demonstration Application

A demonstration application has been created that allows the estimation of cancer risk from an image that can be provided in JPEG or DICOM format. This involves running the entire process: preprocessing, 6 CNN networks, feature vector creation, and execution of one of the models trained in the previous subsection.

Here, the goal is to obtain a value between 0% and 100%, which requires choosing one of the models that produces a continuous output between 0.0 and 1.0. SVM, BT, and XGBOOST produce binary outputs. Between MLP and HRS, MLP is selected (because it offers maximum recall and AUC). Figure 5 shows the graphical user interface of the application, note how the temperature scale takes into account the computed threshold for MLP logit.

Figure 6. Screenshot of the DEMO app.

As a final enhancement of the application, an explainability option was incorporated to allow users to better understand the diagnostic results using Grad-CAM [19]. Grad-CAM (Gradient-weighted Class Activation Mapping) is a visualization technique used to explain the decisions made by CNNs by analyzing the gradients of the target class (the “Sick” class in this case) with respect to the feature maps of the last convolutional layer. Based on this information, a heatmap is generated to indicate how strongly each region of the input image contributes to the final prediction. This functionality is particularly valuable in positive cases, as it helps verify whether the model is focusing on clinically relevant suspicious areas of the mammogram, improving interpretability and increasing confidence in the diagnostic output. Feature maps for a real positive case are shown in Figure 7.

4. Discussion

One of the most relevant findings of this study is the difference in performance between the individual CNN models and the final ensemble system. The standalone CNNs achieved moderate performance, with F₁-scores generally ranging between 0.70 and 0.77. Although these values indicate that the networks were capable of extracting meaningful diagnostic information from mammographic images, they were insufficient for a reliable clinical support system when used independently. However, when the outputs of the six CNNs were combined into a feature vector and processed by a second-stage classifier, a clear improvement was observed across all evaluated metrics. This result supports the hypothesis that different CNN configurations learn complementary representations of the mammographic data and that combining them increases robustness and generalization capability.

Among the evaluated second-stage methods, XGBoost achieved the highest F1-score, while the optimized MLP showed comparable performance and consistently competed for the highest recall. The MLP was selected as the preferred option for the demonstration application because it provides a continuous output score (logit) representing the probability of disease, which is particularly useful in practical clinical decision support scenarios. In addition, its high recall is especially important in medical diagnosis, where sensitivity is often prioritized over overall accuracy, since failing to detect a malignant lesion may have severe clinical consequences. The achieved recall above 0.90 indicates that the proposed system is capable of identifying the majority of positive cases while maintaining acceptable precision.

The preprocessing stage also plays an important role in the overall system performance. The original datasets contain mammograms acquired under heterogeneous conditions, with different image sizes, intensity ranges, and acquisition formats. The proposed preprocessing pipeline standardizes these variations through normalization, contrast enhancement, geometric alignment, and resizing. This allows the CNNs to focus on diagnostically relevant patterns rather than irrelevant acquisition differences. Additionally, the mirroring strategy ensures consistent anatomical orientation, which may facilitate the learning process.

Another important aspect of this work is the use of relatively lightweight CNN architectures: ResNet-18 and a novel one (designed by authors). ResNet was trained by transfer learning while the new one was trained from scratch. This design choice reduces computational complexity and training requirements while still providing competitive performance. Other, more complex, transfer learning approaches based on architectures such as ResNet-50, DenseNet, or EfficientNet have shown strong performance in medical imaging applications, they often require significantly larger computational resources and may introduce unnecessary complexity for datasets of limited size. The proposed architecture demonstrates that simpler networks, when combined through an ensemble strategy, can still achieve clinically meaningful results.

The study also highlights the importance of data balancing and augmentation in medical imaging problems. The original datasets were highly imbalanced, with healthy cases greatly outnumbering positive cancer cases. Without compensation mechanisms, this imbalance would bias the models toward the majority class. The use of balanced datasets, oversampling strategies, and online data augmentation contributed to improving generalization and reducing overfitting. Nevertheless, some limitations remain due to the relatively limited number of positive samples available for training.

Despite the encouraging results, several important research directions remain open. The main future objective is the acquisition of an external validation dataset from local clinical sources. This would constitute a first step toward the development of a potential commercial product.

Another limitation of the current system is that it relies exclusively on imaging information. In real clinical practice, radiologists also consider additional patient data such as age, family history, breast density, genetic predisposition, and prior examinations. Integrating multimodal clinical information could further improve diagnostic performance and reduce uncertainty in challenging cases.

Future work should therefore focus on several directions. First, external validation studies should be conducted using independent clinical datasets, preferably obtained from the local healthcare system. Second, more advanced architectures and alternative transfer learning strategies could be evaluated and compared with the proposed CNN-based approaches. Third, explainability methods should be further investigated to improve model interpretability and facilitate clinical adoption. Finally, incorporating additional clinical variables and temporal information from previous mammograms could enable the development of a more robust and clinically useful multimodal decision-support system.

Overall, the results obtained in this work suggest that ensemble-based machine learning strategies constitute a promising approach for computer-aided breast cancer diagnosis from mammographic images. The proposed system demonstrates that combining multiple CNN models with a second-stage classifier can significantly improve sensitivity and overall diagnostic reliability, providing a valuable foundation for future research and potential clinical decision support applications.

Author Contributions

Conceptualization, F.M-R., C.F-B., M.F-B., A.M-F. and M.M-B.; methodology, F.M-R. and M.F-B.; formal analysis, F.M-R. and M.F-B.; investigation, F.M-R. and C.F-B.; software, F.M-R. and C.F-B.; resources, F.M-R. and M.F-B.; data curation, F.M-R. and C.F-B.; writing—original draft preparation, F.M-R.; writing—review and editing, F.M-R., C.F-B., M.F-B., A.M-F. and M.M-B.; visualization, F.M-R., C.F-B., M.F-B. and M.M-B.; supervision, M.M-B.

Funding

“This research received no external funding”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Input data were taken from public sources that are referenced within the paper, generated data are all included within the paper.

Acknowledgments

The authors would like to thank all the personnel at the atlanTTic Research Center for their continuous support and valuable collaboration throughout the development of this work.

Conflicts of Interest

“The authors declare no conflicts of interest.”.

Abbreviations

The following abbreviations are used in this manuscript:

AUC	Area Under the Curve
AUG	Augmented dataset
BLD	Balanced dataset
BT	Bagged tree classifier
CAD	Computer aided diagnosis
CC	Cranial caudal view
CNN	Convolutional neural network
DICOM	Digital imaging and communications in medicine
F1-score	Harmonic mean of precision and recall
GAN	Generative Adversarial Networks
HRS	Heuristic aggregation method
LM	Latero-medial view
ML	Medio-lateral view
MLO	Medio-lateral oblique view
MLP	Multi-layer perceptron
ReLU	Rectified linear unit
ROC	Receiver Operating Characteristic
ROI	Region of Interest
SGDM	Stochastic gradient descent with momentum
SMOTE	Synthetic minority oversampling technique
SVM	Support vector machine
WHO	World health organization
XGBOOST	eXtreme gradient boosting

References

“Breast Cancer”. https://www.who.int/news-room/fact-sheets/detail/breast-cancer, World health Organization (WHO), 23-03-2021 (accessed on 04-05-2026).
Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324 (1998). [CrossRef]
Rumelhart, D., Hinton, G. & Williams, R. “Learning representations by back-propagating errors,” in Nature, no 323, pp. 533–536 (1986). [CrossRef]
Nasser M, Yusof U.K., “Deep Learning Based Methods for Breast Cancer Diagnosis: A Systematic Review and Future Direction,”.in Diagnostics, vol. 13, no. 1, pp. 161 (2023). [CrossRef]
Shams, S., Platania, R., Zhang, J., Kim, J., Lee, K., Park, SJ. Deep Generative Breast Cancer Screening and Diagnosis. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds) Medical Image Computing and Computer Assisted Intervention (MICCAI 2018). Lecture Notes in Computer Science, vol 11071 (2018). Springer, Cham. [CrossRef]
Vivek Kumar Singh, Hatem A. Rashwan, Santiago Romani, Farhan Akram, Nidhi Pandey, Md. Mostafa Kamal Sarker, Adel Saleh, Meritxell Arenas, Miguel Arquez, Domenec Puig, Jordina Torrents-Barrena, Breast tumor segmentation and shape classification in mammograms using generative adversarial and convolutional neural network, Expert Systems with Applications, Volume 139, 2020, 112855, ISSN 0957-4174. [CrossRef]
Shuyue Guan and Murray Loew “Breast cancer detection using synthetic mammograms from generative adversarial networks in convolutional neural networks”, Proc. SPIE 10718, 14th International Workshop on Breast Imaging (IWBI 2018), 107180X (6 July 2018); [CrossRef]
J. Zheng, D. Lin, Z. Gao, S. Wang, M. He and J. Fan, “Deep Learning Assisted Efficient AdaBoost Algorithm for Breast Cancer Detection and Early Diagnosis,” in IEEE Access, vol. 8, pp. 96946-96954, 2020. [CrossRef]
Dehghan Rouzi, M. et al. “Breast Cancer Detection with an Ensemble of Deep Learning Networks Using a Consensus-Adaptive Weighting Method.” J. Imaging 9(11), 247 (2023). [CrossRef]
Shah et al. “Optimizing Breast Cancer Detection With an Ensemble Deep Learning Approach.” Int. J. Intelligent Systems, 2024. [CrossRef]
Masud et al. “From Machine Learning to Ensemble Approaches: A Systematic Review of Mammogram Classification Methods.” Diagnostics 15, 2829 (2025). [CrossRef]
C. Carr, F. Kitamura y G. Partridge, “RSNA Screening Mammography Breast Cancer Detection,” https://www.kaggle.com/competitions/rsnabreast-cancer-detection/overview (accessed on 04-05-2026).
National Electrical Manufacturers Association, “Digital Imaging and Communications in Medicine (DICOM) Standard”, Rosslyn, VA, USA. https://dicomstandard.org (accessed on 04-05-2026).
https://www.kaggle.com/competitions/rsna-breast-cancerdetection/discussion/377790 (accessed on 04-05-2026).
K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 770-778. [CrossRef]
Cortes, C., Vapnik, V. “Support-vector networks,” in Machine Learing vol. 20, no. 3, pp. 273–297 (1995). [CrossRef]
Breiman, L., “Bagging predictors,” in Machine Learning, vol. 24, no. 2, pp. 123–140 (1998). [CrossRef]
Friedman, J. H., “Greedy function approximation: A gradient boosting machine,” in Annals of Statistics, vol. 29, no. 5, pp. 1189–1232 (2001). [CrossRef]
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh and D. Batra, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization,” 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 618-626. [CrossRef]

Figure 1. Pipeline general diagram.

Figure 2. Digital mammogram: (a) original DICOM file displayed by a DICOM viewer (microdicom), (b) preprocessed image.

Figure 3. Novel CNN architecture used in this work.

Figure 4. Training curves for AUG-CC.

Figure 5. (a) ROC curve and AUC value for MLP. (b). ROC curve and AUC valur for HRS.

Figure 7. grad-CAM feature maps computed for a strongly positive case. (a) Preprocessed mammogram, (b) Grad_Cam activation maps together with the corresponding numerical outputs (logits) from the six different CNNs. From left to right and top to bottom: BLD-CC, BLD-MLO, BLD-MIX, AUG-CC, AUG-MLO, and AUG-MIX.

Table 1. Main dataset.

	CC	MLO	ML	LM
Healthy	26199	27313	8	10
Sick	566	590	0	0

Table 2. Work datasets.

		BLD_CC	BLD-MLO	BLD-MIX	AUG-CC	AUG_MLO	AUG-MIX
Train	Healthy	1017	1091	2108	2320	2308	4628
Train	Sick	1017	1091	2108	1017	1091	2108
Test	Healthy	180	192	372	180	192	372
Test	Sick	180	192	372	180	192	372

Table 3. CNN results (novel architecture of Figure 3).

Training dataset	Precision	Recall	F₁
BLD-CC	0.81	0.70	0.75
BLD-MLO	0.77	0.67	0.72
BLD-MIX	0.98	0.59	0.73
AUG-CC	0.78	0.63	0.70
AUG-MLO	0.79	0.74	0.77
AUG-MIX	0.79	0.71	0.75

Table 4. CNN results (ResNet-18).

Training dataset	Precision	Recall	F₁
BLD-CC	0.84	0.74	0.79
BLD-MLO	0.83	0.67	0.74
BLD-MIX	0.85	0.69	0.76
AUG-CC	0.90	0.65	0.75
AUG-MLO	0.95	0.60	0.74
AUG-MIX	0.86	0.70	0.77

Table 5. Second stage (final) results, BLD_MIX dataset.

	MLP	SVM	BT	XGBOOST	HRS
Precision	0.79	0.88	0.84	0.82	0.78
Recall	0.89	0.83	0.85	0.88	0.90
F₁	0.84	0.85	0.84	0.85	0.84

Table 6. Second stage (final) results, joint and shuffled dataset.

	MLP	SVM	BT	XGBOOST	HRS
Precision	0.76	0.90	0.85	0.87	0.77
Recall	0.92	0.81	0.83	0.84	0.86
F₁	0.83	0.85	0.84	0.86	0.81

Table 7. Second stage results after ResNet-18 stage, BLD_MIX dataset.

	MLP	SVM	BT	XGBOOST	HRS
Precision	0.92	0.95	0.95	0.95	0.88
Recall	0.98	0.96	0.97	0.98	0.97
F₁	0.95	0.96	0.95	0.96	0.93

Table 8. Second stage results after ResNet-18 stage, joint and shuffled dataset.

	MLP	SVM	BT	XGBOOST	HRS
Precision	0.94	0.97	0.95	0.97	0.93
Recall	0.93	0.90	0.93	0.92	0.93
F₁	0.94	0.93	0.94	0.94	0.93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.