1. Introduction
For breast cancer, which remains a prevalent malignancy among women worldwide, early detection and accurate diagnosis are crucially important for improving survival. Mammography is the widely adopted standard for breast cancer screening, but its interpretation demands extensive expertise. Challenges persist related to diagnostic discrepancies and missed diagnoses among radiologists.
Beyond mammography, breast cancer diagnosis incorporates various methods, including visual inspection, palpation, and ultrasound examination. When these examinations reveal abnormalities, clinicians often perform highly invasive procedures such as cytological and histological examinations for definitive diagnosis. If deep learning-based image analysis of minimally invasive mammographic images can achieve high diagnostic accuracy, then it could reduce the need for highly invasive procedures. This approach would simultaneously alleviate burdens on radiologists and breast surgeons responsible for interpreting these images.
Recent rapid advancements in artificial intelligence (AI) technology, particularly deep learning, have accelerated the development of automated analysis and diagnostic support systems for mammographic images remarkably. For various image recognition tasks, deep learning algorithms, especially convolutional neural networks (CNNs), now demonstrate performance comparable to or exceeding human capabilities. For medical image diagnosis, these technologies often achieve superior accuracy and efficiency compared to conventional methodologies.
Many studies have explored deep learning applications for mammographic image diagnosis. For instance, Zhang et al. [
1] performed two-stage classification (normal/abnormal and benign/malignant) of two-view mammograms (CC and MLO) on the public DDSM dataset using a multi-scale attention DenseNet. Lång et al. [
2] evaluated the potential of AI for identifying normal mammograms by classifying cancer likelihood scores with a deep learning model on a private dataset, comparing the obtained results to radiologists’ interpretations. Another study by Lång et al. [
3] indicated that deep learning models trained on a private dataset can reduce interval cancer rates without supplementary screening. Zhu et al. [
4] predicted future breast cancer development in negative subjects during an eight-year period using a deep learning model with a private dataset. Kerschke et al. [
5] compared human versus deep learning AI accuracy for benign–malignant screening using a private dataset, highlighting the need for prospective studies. Nica et al. [
6] reported high-accuracy benign–malignant classification of cranio-caudal view mammography images using an AlexNet deep learning model and a private dataset. Rehman et al. [
7] achieved high-accuracy architectural distortion detection using image processing and proprietary depth-wise 2D V-net 64 convolutional neural networks on the PINUM, CBIS-DDSM, and DDSM datasets. Yirgin et al. [
8] used a public deep learning diagnostic system on a private dataset, concluding that combined assessment by the deep learning model and radiologists yielded the best performance. Tzortzis et al. [
9] demonstrated superior performance for efficiently detecting abnormalities on the public INBreast dataset using their tensor-based deep learning model, particularly showing robustness with limited data and reduced computational requirements. Pawar et al. [
10] and Hsu et al. [
11] both reported high-accuracy Breast Imaging Reporting and Data System (BIRADS) category classification, respectively, using proprietary multi-channel DenseNet architecture and a fully convolutional dense connection network on private datasets. Elhakim et al. [
12] investigated the feasibility of replacing the first reader with AI in double-reading mammography using a commercial AI system with a private dataset, emphasizing the importance of an appropriate AI threshold. Jaamour et al. [
13] improved the segmentation accuracy for mass and calcification images from the public CBIS-DDSM dataset by application of transfer learning. Kebede et al. [
14] developed a model combining EfficientNet-based classifiers with a YOLOv5 object detection model and an anomaly detection model for mass screening on the public VinDr and Mini-DDSM datasets. Ellis et al. [
15], using the UK-national OPTIMAM dataset, developed a deep learning AI model for predicting future cancer risk in patients with negative mammograms. Elhakim et al. [
16] further investigated replacement of one or both readers with AI in double-reading mammography, emphasizing clinical implications for accuracy and workload. Sait et al. [
17] reported high segmentation accuracy and generalizability in multi-class breast cancer image classification using an EfficientNet B7 model within a LightGBM model on the CBIS-DDSM and CMMD datasets. Chakravarthy et al. [
18] reported high classification accuracy for normal, benign, and malignant cases using an ensemble method with a modified Gompertz function on the BCDR, MIAS, INbreast, and CBIS-DDSM datasets. Liu et al. [
19] achieved high classification accuracy on four binary tasks using a CNN and a private mammography image dataset, suggesting a potential to reduce unnecessary breast biopsies. Finally, Park et al. [
20] reported improved diagnostic accuracy, especially in challenging ACR BIRADS categories 3 and 4 with breast density exceeding 50%, by learning both benign–malignant classification and lesion boundaries using a ViT-B DINO-v2 model on the public CBIS-DDSM dataset. AlMansour et al. [
21] reported high-accuracy BIRADS classification using MammoViT, a novel hybrid deep learning framework, on a private dataset.
Despite these advancements, several points of difficulty hinder the reproducibility of claims in deep learning applications for mammographic image diagnosis. Studies using private, non-public datasets or proprietary deep learning models with undisclosed details make verification challenging. Methods incorporating subject information alongside mammographic images as training data also face reproducibility issues caused by limited commonalities across different datasets. Similarly, studies combining mammographic images with other modality images require specific data combinations, thereby complicating claim reproduction.
Given these considerations, we prioritized reproducible research by particularly addressing studies using publicly available datasets and open-source deep learning models. Furthermore, we emphasized the generalizability of claims across multiple public datasets and various deep learning models.
Therefore, this study tested the hypothesis that prediction accuracy improves when images are divided into those with and without annotated mask information for regions of interest, with subsequent separate training and prediction for each of the four mammographic views (RCC, LCC, RMLO, LMLO), before merging the results. This approach is compared to cases for which image data are not separated based on the availability of mask information for regions of interest. Using two public datasets and two deep learning models, we validated this hypothesis, particularly addressing the presence or absence of annotated mask information as a novel feature.
4. Discussion
Some data lead to diagnosis as malignant, but lack a visible region of interest (ROI). This discrepancy is likely attributable to factors such as dense breast tissue, which can obscure ROIs by causing the entire image to appear uniformly opaque. In such cases, a malignant diagnosis reached despite the absence of a clear ROI on the image likely reflects corroborating results from other diagnostic modalities such as biopsies. Also observed are instances of data diagnosed as normal but exhibiting an ROI. The presence of an ROI in a “normal” diagnosis looks strange and suggests a potential misrepresentation or artifact in the diagnostic labeling process. Such anomalous data points, whether they involve a malignant diagnosis without a discernible ROI or a normal diagnosis with an ROI, introduce noise into deep learning models. This noise can strongly hinder the model’s ability to learn accurate patterns. The noise consequently diminishes its predictive performance. Preprocessing the dataset to identify and remove or re-evaluate these inconsistent data points before training might enhance the learning and prediction accuracy of deep learning algorithms for medical image analysis.
This study used data with pre-existing ROI mask images. However, mammographic images requiring benign–malignant classification do not always have corresponding mask images available. Therefore, future research should specifically examine generation of mask images for mammographic data lacking existing masks, employing techniques such as semantic segmentation or object detection, and subsequently validating these approaches.
The deep learning models used for this study, Swin Transformer and ConvNeXtV2, demonstrated superior accuracy in both training and prediction compared to other deep learning models. We hypothesize that this improved performance derives from differences in their respective layer architectures. However, detailed analysis of this phenomenon is left as a subject for future investigation.
Whereas this study specifically addressed benign–malignant classification, mammographic data are typically categorized initially into normal versus abnormal findings, with abnormal cases subsequently classified as either benign or malignant. An important area for future investigation is assessment of whether our methodology can also classify normal versus abnormal cases effectively. If successful, this capability would enable diagnostic prediction for a broader range of arbitrary mammographic data.