1. Introduction
For breast cancer, which remains a prevalent malignancy among women worldwide, early detection and accurate diagnosis are crucially important for improving survival. Mammography is the widely adopted standard for breast cancer screening, but its interpretation demands extensive expertise. Challenges persist related to diagnostic discrepancies and missed diagnoses among radiologists.
Beyond mammography, breast cancer diagnosis incorporates various methods, including visual inspection, palpation, and ultrasound examination. When these examinations reveal abnormalities, clinicians often perform highly invasive procedures such as cytological and histological examinations for definitive diagnosis. If deep learning-based image analysis of minimally invasive mammographic images can achieve high diagnostic accuracy, then it could reduce the need for highly invasive procedures. This approach would simultaneously alleviate burdens on radiologists and breast surgeons responsible for interpreting these images.
Recent rapid advancements in artificial intelligence (AI) technology, particularly deep learning, have accelerated the development of automated analysis and diagnostic support systems for mammographic images remarkably. For various image recognition tasks, deep learning algorithms, especially convolutional neural networks (CNNs), now demonstrate performance comparable to or exceeding human capabilities. For medical image diagnosis, these technologies often achieve superior accuracy and efficiency compared to conventional methodologies.
Many studies have explored deep learning applications for mammographic image diagnosis. For instance, Zhang et al. [
1] performed two-stage classification (normal/abnormal and benign/malignant) of two-view mammograms (CC and MLO) on the public DDSM dataset using a multi-scale attention DenseNet. Lång et al. [
2] evaluated the potential of AI for identifying normal mammograms by classifying cancer likelihood scores with a deep learning model on a private dataset, comparing the obtained results to radiologists’ interpretations. Another study by Lång et al. [
3] indicated that deep learning models trained on a private dataset can reduce interval cancer rates without supplementary screening. Zhu et al. [
4] predicted future breast cancer development in negative subjects during an eight-year period using a deep learning model with a private dataset. Kerschke et al. [
5] compared human versus deep learning AI accuracy for benign–malignant screening using a private dataset, highlighting the need for prospective studies. Nica et al. [
6] reported high-accuracy benign–malignant classification of cranio-caudal view mammography images using an AlexNet deep learning model and a private dataset. Rehman et al. [
7] achieved high-accuracy architectural distortion detection using image processing and proprietary depth-wise 2D V-net 64 convolutional neural networks on the PINUM, CBIS-DDSM, and DDSM datasets. Yirgin et al. [
8] used a public deep learning diagnostic system on a private dataset, concluding that combined assessment by the deep learning model and radiologists yielded the best performance. Tzortzis et al. [
9] demonstrated superior performance for efficiently detecting abnormalities on the public INBreast dataset using their tensor-based deep learning model, particularly showing robustness with limited data and reduced computational requirements. Pawar et al. [
10] and Hsu et al. [
11] both reported high-accuracy Breast Imaging Reporting and Data System (BIRADS) category classification, respectively, using proprietary multi-channel DenseNet architecture and a fully convolutional dense connection network on private datasets. Elhakim et al. [
12] investigated the feasibility of replacing the first reader with AI in double-reading mammography using a commercial AI system with a private dataset, emphasizing the importance of an appropriate AI threshold. Jaamour et al. [
13] improved the segmentation accuracy for mass and calcification images from the public CBIS-DDSM dataset by application of transfer learning. Kebede et al. [
14] developed a model combining EfficientNet-based classifiers with a YOLOv5 object detection model and an anomaly detection model for mass screening on the public VinDr and Mini-DDSM datasets. Ellis et al. [
15], using the UK-national OPTIMAM dataset, developed a deep learning AI model for predicting future cancer risk in patients with negative mammograms. Elhakim et al. [
16] further investigated replacement of one or both readers with AI in double-reading mammography, emphasizing clinical implications for accuracy and workload. Sait et al. [
17] reported high segmentation accuracy and generalizability in multi-class breast cancer image classification using an EfficientNet B7 model within a LightGBM model on the CBIS-DDSM and CMMD datasets. Chakravarthy et al. [
18] reported high classification accuracy for normal, benign, and malignant cases using an ensemble method with a modified Gompertz function on the BCDR, MIAS, INbreast, and CBIS-DDSM datasets. Liu et al. [
19] achieved high classification accuracy on four binary tasks using a CNN and a private mammography image dataset, suggesting a potential to reduce unnecessary breast biopsies. Finally, Park et al. [
20] reported improved diagnostic accuracy, especially in challenging ACR BIRADS categories 3 and 4 with breast density exceeding 50%, by learning both benign–malignant classification and lesion boundaries using a ViT-B DINO-v2 model on the public CBIS-DDSM dataset. AlMansour et al. [
21] reported high-accuracy BIRADS classification using MammoViT, a novel hybrid deep learning framework, on a private dataset.
Despite these advancements, several points of difficulty hinder the reproducibility of claims in deep learning applications for mammographic image diagnosis. Studies using private, non-public datasets or proprietary deep learning models with undisclosed details make verification challenging. Methods incorporating subject information alongside mammographic images as training data also face reproducibility issues caused by limited commonalities across different datasets. Similarly, studies combining mammographic images with other modality images require specific data combinations, thereby complicating claim reproduction.
Given these considerations, we prioritized reproducible research by particularly addressing studies using publicly available datasets and open-source deep learning models. Furthermore, we emphasized the generalizability of claims across multiple public datasets and various deep learning models.
Therefore, this study tested the hypothesis that prediction accuracy improves when images are divided into those with and without annotated mask information for regions of interest, with subsequent separate training and prediction for each of the four mammographic views (RCC, LCC, RMLO, LMLO), before merging the results. This approach is compared to cases for which image data are not separated based on the availability of mask information for regions of interest. Using two public datasets and two deep learning models, we validated this hypothesis, particularly addressing the presence or absence of annotated mask information as a novel feature.