4. Discussion
With the advent of artificial intelligence (AI), its applications in various medical fields are increasingly being explored. From assisting in endoscopic image diagnosis to aiding in the identification of pulmonary nodules and COVID-19 pneumonia on imaging, significant advancements have been made, particularly in the development of AI-powered medical devices for image-based diagnostics, such as endoscopy and CT scans. In particular, the field of gastrointestinal endoscopy, which has a high volume of cases, has made substantial progress in real-time AI-assisted diagnostic support during examinations [
8].
Our previous AI research has demonstrated the feasibility of distinguishing inverted papilloma on endoscopic images and predicting the extent of middle ear cholesteatoma, achieving high diagnostic accuracy even with small sample sizes. The potential for AI to provide high-precision diagnostic support for rare cases holds promise for significantly transforming future clinical practices.
The primary treatment for cholesteatoma is surgery, but reports indicate a residual cholesteatoma rate of 11% and a recurrence rate of 8% within five years postoperatively [
9], with long-term recurrence rates reaching approximately 20–30% [
10]. Achieving prevention of recurrence during surgery remains critical. While complete removal along the cholesteatoma matrix can theoretically prevent residual lesions, inflammatory findings in infected cases often make this task challenging. Additionally, complete removal of all soft tissues, including the mastoid air cells, might prevent residual lesions but would impair postoperative re-aeration of the mastoid cavity [
11].
If AI-based diagnostic support could be implemented during surgery, it might contribute to reducing recurrence rates while enabling the formation of well-aerated mastoid air cells, thus improving surgical outcomes.
In this study, the diagnostic accuracy for cholesteatoma (based on overall positive rate analysis) was approximately 80% for both endoscopic and microscopic images, indicating room for further improvement. Compared to endoscopic views, microscopic images often included non-lesion elements, such as bone or skin. To address this, the videos were edited and magnified to exclude non-lesion areas and to center the lesion within the image for training. In contrast, most endoscopic videos already focused on the lesion, requiring minimal to no editing.
As a result, while the diagnostic accuracy improved for both modalities, the improvement was more significant for microscopic images. This can be attributed to the fact that lesions in microscopic images were often located in distal regions within the frame, and magnification allowed the AI model to learn from images that focused solely on the lesion, leading to enhanced accuracy. This is consistent with the results observed when comparing
Figure 3 and
Figure 4, where the improvement in the average sensitivity and specificity of unedited microscopic videos became more pronounced with longer unit times. The limited improvement in edited microscopic videos may be due to excessive magnification, which could have made it difficult to distinguish the lesion’s contours and its relationship with the surrounding structures. The improvement in endoscopic diagnostic accuracy, despite minimal modifications to the original images, may be explained by the AI model being trained on both endoscopic and microscopic data. The inclusion of edited microscopic images in the training dataset likely contributed to the improvement in endoscopic performance as well.
When performing lesion detection with AI, trimming and centering the lesion in the training data may enhance diagnostic accuracy. Previous studies have reported that AI models tend to perform poorly in detecting distal lesions within an image [
12] or may focus on non-target areas during training [
13], highlighting the need for ingenuity in training datasets. Furthermore, some studies have shown that AI can accurately recognize lesions even in the presence of surgical instruments [
14,
15]. Although we minimized the inclusion of surgical instruments in our videos for this study, their presence might have had minimal impact on the results.
We re-examined the cases with diagnostic accuracy below 10% in this study (see
Supplementary Tables S4–S7). In false-negative cases, some involved surgical fields with significant bleeding that obscured the margins, whereas others reflected very thin residual epithelium; importantly, no obvious volumetric cholesteatoma lesions were overlooked. In false-positive cases, some also involved bleeding fields, while others appeared to show no apparent lesion; however, it was difficult to confidently exclude the absence of disease based on limited frames. The exclusion of lesions in these cases was supported by the absence of recurrence during a follow-up period of more than two years, yet it remains challenging to rule out disease with certainty from restricted video information alone. As surgeons, we do not rely on a single frame; instead, we incorporate tactile feedback and the overall intraoperative progression to determine whether residual cholesteatoma is present, thereby achieving greater diagnostic accuracy. By contrast, AI depends solely on visual information from the video, and thus cases that are visually ambiguous to clinicians may also be difficult for AI to classify correctly. These findings suggest that incorporating temporal continuity and multimodal intraoperative information may further improve diagnostic performance in future model development.
Time windows positive rate analysis showed that diagnostic accuracy improved with longer video durations. This is likely because longer videos provide the AI with more opportunities to analyze the cholesteatoma from various angles, enabling more accurate diagnosis. Since the videos were originally recorded for surgical purposes rather than for lesion diagnosis, the lesion's position within the frame varied across videos. In gastrointestinal endoscopy, where AI-assisted diagnosis has been more widely implemented, the accuracy and reliability of diagnosis have been shown to improve when the operator adjusts the focus and angle to better observe the lesion, a process that depends on the operator’s skill [
16].
When using AI for intraoperative diagnosis, it may be possible to enhance diagnostic accuracy by using magnification to observe the lesion with a microscope or by utilizing an endoscope for deep or angled regions. Interestingly, in the development of the AI model for this study, we found that diagnostic accuracy was higher when both endoscopic and microscopic data were used for training and diagnosis, compared to when each modality was trained and tested separately. While it was initially assumed that separating endoscopic and microscopic data would facilitate better AI training by simplifying the learning process [
17], the opposite result was observed, highlighting an interesting finding.
Previous studies have reported AI applications for cholesteatoma mucosal lesions using static endoscopic images [
6], but this study is the first to focus on surgical videos and to integrate both endoscopic and microscopic perspectives, and to evaluate performance without manual annotation. As the era of robotic surgery advances, the incorporation of AI during surgery is inevitable. Research on AI systems that enhance the safety and precision of surgical procedures is crucial.
This study has several limitations. First, the sample size was relatively small. However, given the rarity of cholesteatoma compared to the large case volumes in gastrointestinal endoscopy, the AI model used in this study may be particularly valuable for similarly rare diseases. Second, instead of static images, the model was trained using segmented surgical videos, which frequently included bleeding scenes. Although the videos did not exclusively focus on cholesteatoma, this likely provided a closer approximation to real intraoperative conditions, since middle ear surgeries almost always involve some degree of bleeding. Third, this study should be regarded as a proof-of-concept investigation. It should also be noted that the present results are not directly comparable with those of previous studies, as differences in datasets, methodologies, and evaluation criteria make straightforward comparisons difficult. Instead, our findings should be interpreted as exploratory, proof-of-concept evidence that clarifies technical requirements for future clinical systems. It does not directly establish a real-time intraoperative system; rather, it demonstrates under controlled conditions that pretrained convolutional models can distinguish cholesteatoma from normal mucosa, thereby providing foundational evidence that may contribute to the future development of real-time intraoperative support. Fourth, although this study did not include a direct comparison between the model’s performance and the diagnostic accuracy of experienced surgeons, we recognize that such analyses would be useful for validating the clinical relevance of the system and represent an important direction for future research. At the same time, the present findings suggest that AI support may help reduce the risk of overlooked lesions and serve as a valuable aid for a wide range of surgeons. Therefore, future studies should focus on verifying the generalizability of these findings through multi-institutional collaborations, with direct comparisons to established diagnostic benchmarks and clinical expertise considered within future validation efforts.
In addition, the present study employed a large ensemble of models as an experimental strategy to stabilize performance under limited data conditions. Such an approach is not clinically feasible, and future work will require the development of robust single models trained on larger multi-institutional datasets. Furthermore, real-time processing capability and seamless integration into the surgical workflow represent essential technical hurdles that must be overcome before this system can be applied intraoperatively.
Taken together, we believe this proof-of-concept study offers insights into the potential of AI-equipped video systems to support intraoperative detection of residual cholesteatoma and lays the groundwork for their future clinical application.