4. Discussion
One of the most relevant findings of this study is the difference in performance between the individual CNN models and the final ensemble system. The standalone CNNs achieved moderate performance, with F1-scores generally ranging between 0.70 and 0.77. Although these values indicate that the networks were capable of extracting meaningful diagnostic information from mammographic images, they were insufficient for a reliable clinical support system when used independently. However, when the outputs of the six CNNs were combined into a feature vector and processed by a second-stage classifier, a clear improvement was observed across all evaluated metrics. This result supports the hypothesis that different CNN configurations learn complementary representations of the mammographic data and that combining them increases robustness and generalization capability.
Among the evaluated second-stage methods, XGBoost achieved the highest F1-score, while the optimized MLP showed comparable performance and consistently competed for the highest recall. The MLP was selected as the preferred option for the demonstration application because it provides a continuous output score (logit) representing the probability of disease, which is particularly useful in practical clinical decision support scenarios. In addition, its high recall is especially important in medical diagnosis, where sensitivity is often prioritized over overall accuracy, since failing to detect a malignant lesion may have severe clinical consequences. The achieved recall above 0.90 indicates that the proposed system is capable of identifying the majority of positive cases while maintaining acceptable precision.
The preprocessing stage also plays an important role in the overall system performance. The original datasets contain mammograms acquired under heterogeneous conditions, with different image sizes, intensity ranges, and acquisition formats. The proposed preprocessing pipeline standardizes these variations through normalization, contrast enhancement, geometric alignment, and resizing. This allows the CNNs to focus on diagnostically relevant patterns rather than irrelevant acquisition differences. Additionally, the mirroring strategy ensures consistent anatomical orientation, which may facilitate the learning process.
Another important aspect of this work is the use of relatively lightweight CNN architectures: ResNet-18 and a novel one (designed by authors). ResNet was trained by transfer learning while the new one was trained from scratch. This design choice reduces computational complexity and training requirements while still providing competitive performance. Other, more complex, transfer learning approaches based on architectures such as ResNet-50, DenseNet, or EfficientNet have shown strong performance in medical imaging applications, they often require significantly larger computational resources and may introduce unnecessary complexity for datasets of limited size. The proposed architecture demonstrates that simpler networks, when combined through an ensemble strategy, can still achieve clinically meaningful results.
The study also highlights the importance of data balancing and augmentation in medical imaging problems. The original datasets were highly imbalanced, with healthy cases greatly outnumbering positive cancer cases. Without compensation mechanisms, this imbalance would bias the models toward the majority class. The use of balanced datasets, oversampling strategies, and online data augmentation contributed to improving generalization and reducing overfitting. Nevertheless, some limitations remain due to the relatively limited number of positive samples available for training.
Despite the encouraging results, several important research directions remain open. The main future objective is the acquisition of an external validation dataset from local clinical sources. This would constitute a first step toward the development of a potential commercial product.
Another limitation of the current system is that it relies exclusively on imaging information. In real clinical practice, radiologists also consider additional patient data such as age, family history, breast density, genetic predisposition, and prior examinations. Integrating multimodal clinical information could further improve diagnostic performance and reduce uncertainty in challenging cases.
Future work should therefore focus on several directions. First, external validation studies should be conducted using independent clinical datasets, preferably obtained from the local healthcare system. Second, more advanced architectures and alternative transfer learning strategies could be evaluated and compared with the proposed CNN-based approaches. Third, explainability methods should be further investigated to improve model interpretability and facilitate clinical adoption. Finally, incorporating additional clinical variables and temporal information from previous mammograms could enable the development of a more robust and clinically useful multimodal decision-support system.
Overall, the results obtained in this work suggest that ensemble-based machine learning strategies constitute a promising approach for computer-aided breast cancer diagnosis from mammographic images. The proposed system demonstrates that combining multiple CNN models with a second-stage classifier can significantly improve sensitivity and overall diagnostic reliability, providing a valuable foundation for future research and potential clinical decision support applications.