Preprint
Article

This version is not peer-reviewed.

Deep Learning-Based Blood Segmentation and Temporal Characterization for the Robin Heart Surgical Robot

Submitted:

21 April 2026

Posted:

22 April 2026

You are already at the latest version

Abstract
Background/Objectives: In laparoscopic and robot-assisted surgery, bleeding may rapidly impair operative-field readability and procedural safety. In the broader Robin Heart teleoperation framework, interpretation of such events is relevant not only for scene understanding, but also as a potential prerequisite for future safety-oriented supervisory functions under communication-degraded conditions. The aim of this study was to assess whether a deep learning model for blood segmentation could provide outputs suitable for preliminary image-level temporal characterization of visible blood-region behavior in laparoscopic video. Methods: The model was first trained on a simulated bleeding dataset prepared under controlled conditions and then fine-tuned on annotated frames from robot-assisted laparoscopic hysterectomy video. Additional limited adaptation and held-out evaluation were performed on annotated bleeding-related episodes derived from the public GynSurg dataset. Segmentation performance was assessed using the Dice coefficient and Intersection over Union (IoU). Temporal analysis was performed on representative internal and external sequences using mask-derived descriptors and auxiliary optical-flow-based motion descriptors computed after camera-motion compen-sation within the detected blood ROI. Results: The model achieved Dice/IoU values of 0.94/0.89 on the simulated validation set, 0.907/0.830 on the internal operative validation set, and 0.764/0.626 on the annotated external GynSurg subset. The combined descriptor set differentiated more dynamic and unstable progression profiles from more spatially coherent ones across both datasets. Peak dA/dt reflected abrupt visible blood-area ex-pansion, temporal IoU described mask stability over time, and optical-flow-based de-scriptors provided additional information on local motion activity. A peak-only descrip-tion was insufficient to fully characterize the observed progression patterns. Conclusions: The results support the feasibility of combining deep-learning-based blood segmentation with temporal and optical-flow-based descriptors for exploratory image-level character-ization of visible blood-region behavior in laparoscopic video. Within the Robin Heart development pathway, such descriptors may in the future serve as candidate components of image-analysis support modules for safety-oriented teleoperative scenarios. At this stage, they should be interpreted as exploratory image-derived indicators rather than clinically validated markers of bleeding severity.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  ;  

1. Introduction

Robot-assisted surgery is evolving from the classical master-slave paradigm toward systems that are expected not only to execute motion with high precision, but also to interpret the operative scene, preserve basic situation awareness, and support safety under communication disturbances [1]. In teleoperated surgery, the quality and continuity of communication between the surgeon and the robotic platform become key determinants of procedural safety [2]. Under conditions of transmission delay, degraded image quality, packet loss, or temporary interruption of contact, mechanical precision alone is no longer sufficient [3].
This broader perspective defines the development logic of the Robin Heart project at the Foundation for Cardiac Surgery Development in Zabrze, Poland (Figure 1) [4]. Robin Heart should not be regarded merely as a single surgical robot, but rather as a broader research and development platform encompassing teleoperation, safety, operative planning, feedback systems, and the gradual development of partially autonomous functions. Within this program, multiple constructions and prototypes have been created for experimental and research purposes. This progression - from systems designed primarily to study motion and control to platforms approaching more practical applications - creates a natural basis for the next stage, namely the integration of artificial intelligence with safety-oriented teleoperation [5]. Rather than being viewed solely as a motion-execution system, Robin Heart is increasingly being developed as a platform in which robotic control, image interpretation, and safety-oriented support may ultimately form an integrated whole.
The central idea of this stage of development is straightforward: if a robot is expected to operate with greater independence, it cannot merely perform motion but must also interpret the operative field [6]. This requires combining adequate technical capability with a cognitive layer able to recognize relevant events in the image and interpret changes occurring within the surgical scene [7]. In teleoperative settings, this becomes particularly important, because under degraded communication the robot should still be capable of maintaining at least a structured representation of the operative situation and supporting predefined safety priorities. In this context, image-based recognition of safety-relevant intraoperative events may be regarded as an important prerequisite for future supervisory functions intended to operate under degraded teleoperative conditions.
One of the most important intraoperative phenomena in this context is bleeding [8]. In laparoscopic and robot-assisted surgery, bleeding may rapidly impair operative-field readability, hinder orientation within the scene, and increase procedural risk, especially when remote control is temporarily degraded [9]. From this perspective, it becomes important not only to detect the presence of blood, but also to determine how the visible blood region behaves over time - whether it expands rapidly, remains localized, progresses gradually, or exhibits irregular temporal variability.
At the same time, the visual interpretation of bleeding in laparoscopic video is inherently difficult [10]. The appearance of blood depends on anatomical context, rate of outflow, illumination conditions, camera position, and tissue arrangement. Additional challenges arise from instrument motion, partial occlusions, specular reflections, and surgical smoke (Figure 2) [11]. As a result, image changes do not always reflect true bleeding progression alone but may also result from temporary visibility loss or scene changes. For this reason, frame-by-frame segmentation alone does not provide a sufficient basis for higher-level interpretation of the operative situation.
In recent years, deep-learning-based methods have become increasingly important in medical image analysis, including segmentation tasks in minimally invasive surgical video [12,13]. In the case of bleeding, segmentation can provide information not only about the presence of blood, but also about its spatial extent and distribution within the operative field [14]. However, moving from detection toward a more structured description of the phenomenon requires extending segmentation with temporal descriptors and auxiliary motion-sensitive measures that characterize how the segmented blood region changes over time [15].
Against this background, the aim of the present study was to assess whether a deep learning model for blood segmentation could provide masks of sufficient quality to support a preliminary image-level temporal characterization of visible blood-region behavior in robot-assisted laparoscopic video. Accordingly, the main contribution of this work does not lie in proposing a novel segmentation architecture, but in evaluating whether segmentation-derived masks can be transformed into descriptors of extent, field occupancy, rate of change, short-term consistency, and local motion activity that may in the future contribute to image-analysis support modules within the Robin Heart framework. At this stage, the study should be regarded as an exploratory proof of concept rather than a clinically validated framework.

2. Materials and Methods

2.1. Dataset and Data Preparation

The data used in this study reflected the consecutive stages of model development and evaluation. First, the model was trained on a laboratory dataset prepared under controlled conditions. Next, it was fine-tuned on real operative material. Finally, its behavior was additionally assessed on external gynecologic laparoscopic video episodes derived from the public GynSurg dataset in order to examine how the model performed on recordings obtained under different visual conditions.
In the first stage, a simulated bleeding dataset was prepared under controlled experimental conditions using animal tissue and an artificial blood-like fluid. This dataset contained 300 annotated images, of which 250 were used for training and 50 for validation. The original images had a spatial resolution of 1920 × 1080 pixels.
In the second stage, a real operative dataset was prepared. All operative material used in the internal dataset was derived from a single robot-assisted laparoscopic hysterectomy procedure. This dataset contained 200 annotated operative frames, of which 150 were used for fine-tuning and 50 for validation. These images were used to adapt the model to the visual appearance of blood in real laparoscopic scenes. The original operative frames also had a spatial resolution of 1920 × 1080 pixels.
To reduce the risk of information leakage within the internal evaluation setting, the temporal-analysis stage on the operative material was performed on five bleeding sequences derived from video fragments that were not used for model training, fine-tuning, or internal validation. These sequences were treated as a separate application-oriented analysis stage rather than as a formal external validation cohort.
To broaden the evaluation beyond the internal material, an externally sourced annotated subset was prepared from the public GynSurg dataset [16], which contains gynecologic laparoscopic surgical videos intended for video-based surgical analysis. Fifteen bleeding-related episodes were selected. Each of them represented a different bleeding situation and originated from gynecologic laparoscopic procedures recorded under different visual conditions, including different camera views, illumination settings, surgical instruments, and tissue appearance. The detailed distribution of procedure types was not specified. From this material, 102 frames were manually annotated using the same binary class definition as in the internal dataset. The external subset was split at the episode level rather than at the frame level in order to reduce information leakage between adaptation and evaluation stages.
All annotations were prepared manually by the research team using CVAT software [17]. A single binary class (“blood”) was used throughout the study and included both clearly visible fresh blood and more diffuse blood-related staining when distinguishable from surrounding tissue. Specular reflections, smoke, and purely illumination-related artifacts were excluded unless accompanied by a clearly visible blood-related region. Ambiguous cases were resolved according to a predefined annotation protocol to maintain consistency across the dataset. Annotations were prepared by the research team according to predefined rules, and difficult cases were discussed until consensus was reached.

2.2. Model Architecture

A U-Net-based model [18] was used for binary segmentation of blood regions in laparoscopic images. The network takes an RGB image as input and produces a single output probability map indicating, for each pixel, the estimated likelihood of belonging to the blood class.
The architecture was configured with a base number of 64 channels, progressively increased across subsequent encoder levels. The network consisted of five encoder levels and five decoder levels. Each convolutional block included convolutional layers, Batch Normalization [19], and ReLU activation [20]. Downsampling was performed using max-pooling, and upsampling was implemented using transposed convolutions. Dropout with a value of 0.3 was applied in the deeper layers to improve generalization under conditions of limited training data.
A single-output binary architecture was adopted because the study focused on segmentation of one class only, namely blood. U-Net was selected because of its established role in medical image segmentation, architectural transparency, and suitability for feasibility-oriented experiments on a relatively limited annotated dataset.

2.3. Training Procedure

Model training was conducted in three consecutive stages. In the first stage, the network was trained on the simulated bleeding dataset prepared under controlled laboratory conditions. In the second stage, the model was fine-tuned on the operative dataset in order to adapt it to the visual appearance of blood in real laparoscopic scenes. In the third stage, an additional limited fine-tuning step was performed on the training split of the annotated GynSurg subset in order to assess model adaptation to external material acquired under different visual conditions.
In all stages, input images were resized to 512 × 512 pixels before training and inference. The optimization objective was defined as the sum of Binary Cross-Entropy (BCE) loss and Dice loss [21,22]. This combination was selected to account simultaneously for pixel-wise classification correctness and overlap quality, which is particularly important in the presence of class imbalance [22]. Training was performed using the AdamW optimizer [23] with an initial learning rate of 3 × 10⁻⁴. The model was trained with a batch size of 1. In the first and second stages, training was performed for 20 epochs. In the third stage, the same training setup was retained for the limited external adaptation step. Accordingly, the GynSurg experiment should be interpreted as an external-domain adaptation and held-out evaluation setting rather than as a fully independent external validation. No additional learning-rate scheduler was used. Online data augmentation included horizontal and vertical flips, small rotations, brightness and contrast modifications, and blur. Mixed precision and gradient norm clipping to 1.0 were applied to improve training stability.
Training and fine-tuning were carried out in Python using the PyTorch library [24] on an NVIDIA RTX 5090 GPU. In each stage, the best checkpoint was selected on the basis of the highest Dice coefficient obtained on the corresponding validation split.
During inference, binary masks were obtained by thresholding the output probability map produced by the network. For the internal and external operative evaluation, a default threshold of 0.5 was used. No additional morphological postprocessing was applied.
A formal k-fold cross-validation framework was not used. Instead, evaluation was based on predefined validation splits, on internal bleeding sequences excluded from training and validation, and on an additional external assessment performed on annotated GynSurg material representing heterogeneous gynecologic laparoscopic recordings. Importantly, the reported external evaluation was performed on previously unseen GynSurg episodes at the episode level, that is, on sequences not used during the limited external adaptation step.

2.4. Temporal Analysis of Bleeding

Temporal analysis was performed to assess whether the predicted blood masks could support image-level characterization of visible blood-region behavior over time. This stage was applied to all five withheld internal bleeding episodes extracted from the operative material and excluded from model training, fine-tuning, and validation. These episodes contained five bleeding events in total. In addition, temporal descriptors were computed for fifteen external bleeding episodes derived from the GynSurg material. The external episodes were selected to represent visually different bleeding situations recorded under heterogeneous scene conditions. For each sequence, the predicted blood masks were mapped back to the geometry of the original video frames. The analysis was then performed within a local region defined automatically from the segmented blood area and updated over time so that it followed the most relevant local blood-related image changes. This approach was used to provide a uniform analysis framework for both internal and external material.
For source-indication estimation, consecutive predicted blood masks were analyzed frame by frame in order to identify newly appearing blood pixels. These frame-to-frame growth regions were accumulated over a short temporal window of five frames, and the dominant cumulative-growth component was used to define the estimated source region and the local ROI for subsequent temporal analysis. In the present study, the source-indication region should be understood as an image-derived estimate of the dominant visible origin of blood-region expansion rather than as direct anatomical localization of the bleeding vessel (Figure 3).
To reduce the influence of camera motion, frame-to-frame camera-motion compensation was applied before motion analysis using the ECC image-alignment method in OpenCV [25]. In the present configuration, a Euclidean motion model was used. After compensation, dense optical flow was computed using the Farnebäck method [26] within the automatically updated blood-related ROI. Directional consistency was quantified as the norm of the mean normalized flow vector within the ROI, with higher values indicating greater directional agreement of local motion vectors. The resulting flow-based descriptors were interpreted as supplementary image-derived measures of local motion activity within the segmented blood ROI rather than as direct estimates of true blood-flow velocity. Optical flow and frame-to-frame alignment were implemented using standard OpenCV routines in the present configuration.
All temporal parameters were intended to describe visible image-level blood-region behavior rather than clinical bleeding severity.

2.5. Evaluation Metrics

Segmentation performance was evaluated using the Dice coefficient and Intersection over Union (IoU), calculated with respect to manually prepared reference masks. Training and validation loss values were also monitored during model development.
For the temporal-analysis stage, both mask-derived and motion-derived descriptors were calculated. The mask-derived descriptors included maximum visible blood area, occupancy, the rate of visible blood-area change (dA/dt), the mean absolute frame-to-frame area change, and temporal IoU between consecutive predicted masks. For selected sequences, the times required to reach predefined occupancy thresholds were also reported.
In addition, auxiliary motion-sensitive descriptors were computed from optical flow after camera-motion compensation. These included mean flow magnitude, the 95th percentile of flow magnitude, and direction-consistency-related measures.
All temporal metrics were treated as image-derived descriptors rather than clinically validated severity markers.

3. Results

3.1. Segmentation Performance

In the first stage, the model was evaluated on the simulated laboratory validation set, where it achieved a Dice coefficient of 0.94 and an IoU of 0.89. These results indicated good pixel-level segmentation performance under controlled visual conditions.
In the second stage, after fine-tuning on the operative material, the model achieved a Dice coefficient of 0.907 and an IoU of 0.830 on the internal operative validation set, indicating good agreement between predicted masks and manually annotated reference masks in real laparoscopic scenes.
To assess transferability beyond the internal operative material, the model was additionally evaluated on a held-out annotated subset derived from the external GynSurg source after a limited adaptation step. In this setting, the model achieved a Dice coefficient of 0.764 and an IoU of 0.626. Although the external performance was lower than that observed on the internal operative validation set, the results still indicated that the model retained useful segmentation capability on visually different gynecologic laparoscopic material.
Taken together, these results reflect the staged development of the model: strong performance under controlled laboratory conditions, good adaptation to real operative frames, and retained useful segmentation capability on external material acquired under heterogeneous visual conditions.
Qualitative inspection showed that the model was able to delineate both clearly visible fresh blood regions and more diffuse, lower-contrast blood-related staining. At the same time, local imperfections in segmentation were still observed in more challenging frames, especially under conditions of partial occlusion, reduced visibility, smoke, or rapid scene change. These observations are consistent with the intended role of the segmentation stage in the present study, namely to provide masks of sufficient quality for downstream temporal characterization rather than to serve as the sole endpoint of evaluation.
Figure 4. Overview of the bleeding segmentation pipeline. (a) Original laparoscopic input frame. (b) U-Net-based model used for binary blood segmentation. (c) Predicted segmentation mask. (d) Overlay of the predicted blood region on the original image. The highlighted overlay indicates pixels classified as blood by the model. Schematic prepared by the authors using original study materials.
Figure 4. Overview of the bleeding segmentation pipeline. (a) Original laparoscopic input frame. (b) U-Net-based model used for binary blood segmentation. (c) Predicted segmentation mask. (d) Overlay of the predicted blood region on the original image. The highlighted overlay indicates pixels classified as blood by the model. Schematic prepared by the authors using original study materials.
Preprints 209618 g004

3.2. Temporal Analysis of Bleeding Dynamics

To better describe the temporal behavior of the segmented blood region, additional analysis was performed using mask-derived descriptors supplemented with optical-flow-based motion descriptors. Temporal descriptors were computed for all five withheld internal episodes and for fifteen external bleeding episodes. For clarity, the text below discusses two examples from each dataset chosen to illustrate distinct image-level progression patterns, whereas Table 1 presents selected representative cases rather than the full temporal-analysis cohort. The cases included in Table 1 were selected to represent clearly distinguishable image-level progression profiles across the internal and external material, including more dynamic and unstable behavior, more spatially coherent progression, localized burst-like expansion, and a visually static reference case. In the external material, the table includes four progressive bleeding episodes together with one visually static reference case used for comparison. The terms used here describe qualitative image-level progression patterns and should not be interpreted as clinically validated categories.
In the internal dataset, the more dynamic and unstable example was represented by int-5. This sequence showed relatively low temporal stability (mean temporal IoU = 0.789), very high average frame-to-frame variability (mean absolute frame-to-frame area change ratio = 0.258), and elevated motion-related activity (mean flow magnitude = 3.583, p95 flow magnitude = 26.291). Together, these results indicate that the episode was not only associated with visible blood-region growth (peak dA/dt = 3.87 × 10^6 px/s), but also remained clearly unstable and temporally active. This profile is therefore consistent with a more dynamic and unstable pattern of progression.
In contrast, a more spatially coherent internal example was represented by int-4. In this case, temporal stability was higher (mean temporal IoU = 0.882) and average variability was substantially lower (mean absolute frame-to-frame area change ratio = 0.055), while motion-related activity remained present (mean flow magnitude = 2.526, p95 flow magnitude = 23.957). This indicates that, despite ongoing progression and visible blood-area expansion (peak dA/dt = 5.05 × 10^6 px/s), the sequence remained more spatially coherent and temporally organized than int-5.
A similar contrast was observed in the external dataset. The more dynamic and unstable external example was ext-1, which showed low temporal stability (mean temporal IoU = 0.690), high variability (mean absolute frame-to-frame area change ratio = 0.179), and very high flow descriptors (mean flow magnitude = 7.213, p95 flow magnitude = 31.167). These values indicate a clearly dynamic episode in which the segmented blood region changed substantially over time and also exhibited strong local motion activity.
A more spatially coherent external profile was represented by ext-2. This sequence retained relatively high temporal stability (mean temporal IoU = 0.856) and low average variability (mean absolute frame-to-frame area change ratio = 0.045), while still showing greater motion-related activity than a visually static reference case (mean flow magnitude = 1.966, p95 flow magnitude = 23.032). This suggests a progressive episode in which the blood region remained active, but evolved in a more spatially coherent and less unstable manner than in ext-1.
Taken together, the results from both datasets show that more dynamic and unstable episodes tend to be associated with lower temporal IoU, higher mask variability, and increased motion-related activity, whereas more spatially coherent episodes retain greater temporal coherence and lower variability. At the same time, peak dA/dt alone was not sufficient to describe the overall course of an episode, because it primarily reflected the abruptness of instantaneous visible blood-area expansion rather than the general temporal stability or irregularity of the sequence. For this reason, the most informative description of temporal blood-region behavior was obtained by jointly considering descriptors of burst growth, mask stability, average variability, and local motion activity.

4. Discussion

The present study showed that deep-learning-based blood segmentation can serve not only as a tool for frame-wise blood detection, but also as a basis for temporal characterization of visible blood-region behavior in laparoscopic video. Importantly, the proposed framework should be interpreted as an exploratory proof of concept rather than a clinically validated bleeding-assessment system. In this setting, segmentation masks were treated not only as indicators of where blood was present in a given frame, but also as a basis for describing how the visible blood region evolved over time.
This distinction is relevant because the mere detection of blood does not indicate whether the observed episode is rapidly expanding, relatively stable, localized, or temporally irregular. In operative and teleoperative settings, the practical question is often not only whether blood is visible, but also whether the visible region is growing quickly, occupying an increasing portion of the operative field, or remaining spatially limited and relatively coherent. For this reason, temporal analysis may provide a more informative description than frame-wise segmentation alone.
The obtained results suggest that the analyzed descriptors reflect different aspects of segmented blood-region behavior. Peak dA/dt primarily captures the abruptness of visible blood-area expansion over a short period but does not by itself determine whether the episode remains stable or irregular across the full sequence. In contrast, mean temporal IoU reflects temporal mask stability, that is, the extent to which the segmented region preserves a similar shape and location across consecutive frames. Higher values therefore suggest a more spatially coherent course, whereas lower values indicate greater temporal instability. Mean absolute frame-to-frame area change ratio provides complementary information on the average degree of change between consecutive frames and can therefore be interpreted as a measure of overall temporal variability rather than a single burst event.
These descriptors should therefore not be regarded as interchangeable. A high peak dA/dt together with a high mean temporal IoU may indicate abrupt but spatially coherent expansion, whereas a lower mean temporal IoU combined with high average frame-to-frame variability suggests a more unstable and irregular temporal pattern. This behavior was visible in representative examples from both the internal and external datasets, confirming that different episodes may exhibit distinct temporal profiles rather than a single uniform pattern of change.
Additional value was provided by the optical-flow-based analysis. After camera-motion compensation and restriction of the analysis to the blood ROI, the flow-based descriptors provided supplementary information on local motion activity within the segmented region. These measures should not be interpreted as direct estimates of true blood-flow velocity, but rather as auxiliary image-derived descriptors indicating whether relevant local motion activity was present in the ROI. In practice, they proved useful in distinguishing visually static cases from progressive ones and in differentiating episodes that were more motion-active from those that remained more spatially coherent despite ongoing progression.
This is particularly important in laparoscopic video, where the visible appearance of blood is influenced not only by the bleeding event itself, but also by camera repositioning, instrument manipulation, partial occlusions, specular reflections, and surgical smoke. Under such conditions, frame-wise segmentation alone may be insufficient for a broader interpretation of the observed phenomenon. By combining segmentation with temporal and motion-sensitive descriptors, it becomes possible to move from simple blood detection toward a more structured image-level description of visible blood-region behavior over time.
An important strength of the proposed framework is that it allows distinction not only between more dynamic and unstable cases and more spatially coherent ones, but also between episodes with high instantaneous growth yet preserved spatial coherence and those showing stronger instability and motion activity. In this sense, the descriptor set may serve as a multidimensional description of temporal blood-region behavior rather than as a single maximal-growth indicator.
Although the proposed descriptors are not clinically validated severity markers, they were qualitatively anchored to visually distinct image-level progression patterns observed in the analyzed sequences. These included more dynamic and unstable courses, more spatially coherent progression, localized burst-like expansion, and visually static reference behavior. Accordingly, the descriptor values were interpreted in the context of observable temporal and spatial characteristics of the segmented blood region rather than as abstract numerical outputs alone.
From a practical perspective, such a framework may be relevant for future image-analysis support systems in robotic surgery and teleoperation. In remote-control settings, it may be important not only to detect the presence of blood, but also to assess whether the visible blood region is expanding rapidly, remaining localized, occupying an increasing portion of the operative field, or showing rising temporal activity. Structured temporal descriptors may therefore provide more informative support than frame-wise blood detection alone.
Several limitations should be acknowledged. The internal operative dataset was limited in size and originated from a single robot-assisted laparoscopic hysterectomy procedure, which restricts generalizability. The external GynSurg experiment included a limited adaptation step and should therefore not be interpreted as fully independent external validation, although the reported external evaluation was performed on previously unseen episodes at the episode level. In addition, the proposed temporal descriptors are exploratory rather than clinically anchored measures. Finally, inter-annotator agreement was not formally quantified, although annotation consistency was supported by predefined labeling rules and consensus-based discussion of difficult cases.
Future work should focus on extending the analysis to a larger number of cases, further assessing descriptor robustness under different visual conditions, and linking the obtained parameters more directly to expert interpretation and decision relevance. In the longer term, such a framework may support the development of image-analysis modules for robotic and teleoperative systems, including safety-oriented monitoring and other carefully constrained image-analysis support functions under degraded communication conditions.

5. Conclusions

The present study showed that a deep-learning-based model can effectively segment visible blood regions in laparoscopic images, including under visually challenging operative conditions. The obtained results further indicate that segmentation masks may serve not only for frame-wise blood detection, but also as a basis for a more structured temporal characterization of visible blood-region behavior.
By extending segmentation with temporal and motion-sensitive descriptors, the proposed framework enabled a richer image-level description of bleeding-related visual change than a peak-only approach alone. The results showed that different episodes may present distinct combinations of abrupt growth, temporal stability, average variability, and local motion activity. In this sense, the combined descriptor set made it possible to distinguish more dynamic and unstable progression patterns from more spatially coherent ones.
At this stage, the proposed descriptors should be understood as exploratory image-derived indicators rather than as clinically validated severity markers or decision-ready measures. Overall, the findings support the feasibility of combining deep-learning-based blood segmentation with temporal and optical-flow-based descriptors for exploratory characterization of visible blood-region behavior in minimally invasive and robotic surgery. Under strictly controlled conditions, such descriptors may in the future support image-analysis modules in robotic and teleoperative settings.

Author Contributions

Conceptualization, K.S., D.K., Z.N.; methodology, K.S., D.K.; software, K.S.; validation, K.S., D.K.; formal analysis, K.S.; investigation, K.S.; data curation, K.S.; writing - original draft preparation, K.S.; writing - review and editing, K.S., D.K., Z.N.; visualization, K.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were not required for this study according to institutional policy, because it was based on retrospective, fully anonymized surgical video data, did not affect patient management, and did not involve direct interaction with patients.

Data Availability Statement

The data presented in this study are not publicly available due to patient privacy and ethical restrictions.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (OpenAI, San Francisco, CA, USA) as a language-support tool for improving clarity, readability, and stylistic consistency, and for assisting in the generation of initial graphical elements (Figure 3). The authors reviewed, revised, and edited all content and take full responsibility for the integrity of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ghezzi, T.L.; Corleta, O.C. 30 Years of Robotic Surgery. World J. Surg. 2016, 40, 2550–2557. [CrossRef]
  2. King, A.B.; Fowler, G.E.; Macefield, R.C.; Walker, H.; Thomas, C.; Markar, S.; Higgins, E.; Blazeby, J.M.; Blencowe, N.S. Use of artificial intelligence in the analysis of digital videos of invasive surgical procedures: Scoping review. BJS Open 2025, 9, zraf073. [CrossRef]
  3. Nawrat, Z. MIS AI – the artificial intelligence application in minimally invasive surgery. Mini-invasive Surg. 2020, 4, 28. [CrossRef]
  4. Nawrat, Z.; Kostka, P.; Polański, A.; Rohr, K.; Sadowski, W.; Krzysztofik, K. Polish cardio-robot “Robin Heart”. System description and technical evaluation. Int. J. Med. Robot. Comput. Assist. Surg. 2006, 2, 36–44.
  5. Nawrat, Z.; Mucha, Ł.; Krawczyk, D.; Lis, K.; Lehrich, K.; Rohr, K.; Földesy, P.; Radó, J.; Dücső, C.; Sántha, H.; Szebényi, G.; Fürjes, P. Robin Heart INCITE surgical tele manipulator controlled by system equipped with 3D force sensor. Med. Robot. Rep. 2017, 6, 37-46.
  6. Nawrat, Z. Introduction to AI-driven surgical robots. Artif. Intell. Surg. 2023, 3, 90–97. [CrossRef]
  7. Arakaki, S.; Takenaka, S.; Sasaki, K.; Kitaguchi, D.; Hasegawa, H.; Takeshita, N.; Takatsuki, M.; Ito, M. Artificial Intelligence in Minimally Invasive Surgery: Current State and Future Challenges. JMA J. 2025, 8, 86–90. [CrossRef]
  8. Hua, S.; Gao, J.; Wang, Z.; Yeerkenbieke, P.; Li, J.; Wang, J.; He, G.; Jiang, J.; Lu, Y.; Yu, Q.; Han, X.; Liao, Q.; Wu, W. Automatic bleeding detection in laparoscopic surgery based on a faster region-based convolutional neural network. Ann. Transl. Med. 2022, 10, 546. [CrossRef]
  9. Sunakawa, T.; Kitaguchi, D.; Kobayashi, S.; Aoki, K.; Kujiraoka, M.; Sasaki, K.; Azuma, L.; Yamada, A.; Kudo, M.; Sugimoto, M.; Hasegawa, H.; Takeshita, N.; Gotohda, N.; Ito, M. Deep learning-based automatic bleeding recognition during liver resection in laparoscopic hepatectomy. Surg. Endosc. 2024, 38, 7656–7662. [CrossRef]
  10. Okamoto, T.; Ohnishi, T.; Kawahira, H.; Dergachyava, O.; Jannin, P.; Haneishi, H. Real-time identification of blood regions for hemostasis support in laparoscopic surgery. Signal Image Video Process. 2019, 13, 405–412. [CrossRef]
  11. Grammatikopoulou, M.; Sanchez-Matilla, R.; Bragman, F.; Owen, D.; Culshaw, L.; Kerr, K.; Stoyanov, D.; Luengo, I. A spatio-temporal network for video semantic segmentation in surgical videos. Int. J. Comput. Assist. Radiol. Surg. 2024, 19, 375–382. [CrossRef]
  12. Gao, Y.; Jiang, Y.; Peng, Y.; Yuan, F.; Zhang, X.; Wang, J. Medical Image Segmentation: A Comprehensive Review of Deep Learning-Based Methods. Tomography 2025, 11, 52. [CrossRef]
  13. Kamtam, D.N.; Shrager, J.B.; Malla, S.D.; Lin, N.; Cardona, J.J.; Kim, J.J.; Hu, C. Deep learning approaches to surgical video segmentation and object detection: A scoping review. Comput. Biol. Med. 2025, 194, 110482. [CrossRef]
  14. Sibilano, E.; Delprete, C.; Marvulli, P.M.; Brunetti, A.; Marino, F.; Lucarelli, G.; Battaglia, M.; Bevilacqua, V. Deep learning strategies for semantic segmentation in robot-assisted radical prostatectomy. Appl. Sci. 2025, 15, 10665. [CrossRef]
  15. Caballero, D.; Sánchez-Margallo, J.A.; Pérez-Salazar, M.J.; Sánchez-Margallo, F.M. Applications of Artificial Intelligence in Minimally Invasive Surgery Training: A Scoping Review. Surgeries 2025, 6, 7. [CrossRef]
  16. Nasirihaghighi, S.; Ghamsarian, N.; Peschek, L.; Munari, M.; Husslein, H.; Sznitman, R.; Schoeffmann, K. GynSurg: A Comprehensive Gynecology Laparoscopic Surgery Dataset. arXiv 2025, arXiv:2506.11356.
  17. Computer Vision Annotation Tool (CVAT). Available online: https://cvat.ai/ (accessed on 7 April 2026).
  18. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. [CrossRef]
  19. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; Bach, F.; Blei, D., Eds.; PMLR: Lille, France, 2015; Volume 37, pp. 448–456.
  20. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2015), Santiago, Chile, 7–13 December 2015; IEEE: Santiago, Chile, 2015; pp. 1026–1034. [CrossRef]
  21. Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the Fourth International Conference on 3D Vision (3DV 2016), Stanford, CA, USA, 25–28 October 2016; IEEE: Stanford, CA, USA, 2016; pp. 565–571. [CrossRef]
  22. Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Cardoso, M.J. Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Wang, Q., Shi, Y., Suk, H.-I., Suzuki, K., Eds.; Springer: Cham, Switzerland, 2017; Volume 10553, pp. 240–248. [CrossRef]
  23. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [CrossRef]
  24. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; Desmaison, A.; Köpf, A.; Yang, E.; DeVito, Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.; Chintala, S. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019); Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E.; Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035.
  25. Evangelidis, G.D.; Psarakis, E.Z. Parametric Image Alignment Using Enhanced Correlation Coefficient Maximization. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1858–1865. [CrossRef]
  26. Farnebäck, G. Two-Frame Motion Estimation Based on Polynomial Expansion. In Image Analysis; Bigun, J.; Gustavsson, T., Eds.; Springer: Berlin, Heidelberg, Germany, 2003; pp. 363–370.
Figure 1. Selected systems developed within the Robin Heart platform. From left to right: (a) Robin Heart Tele and (b) Robin Heart PVA during tests at the FRK Robotics Laboratory in Zabrze (2015). (c) Robin Heart mc2 during experimental operations at the Center for Experimental Medicine of the Medical University of Silesia in Katowice (2009). Photographs from the authors’ archive.
Figure 1. Selected systems developed within the Robin Heart platform. From left to right: (a) Robin Heart Tele and (b) Robin Heart PVA during tests at the FRK Robotics Laboratory in Zabrze (2015). (c) Robin Heart mc2 during experimental operations at the Center for Experimental Medicine of the Medical University of Silesia in Katowice (2009). Photographs from the authors’ archive.
Preprints 209618 g001
Figure 2. Example of visual challenges in laparoscopic scenes, including surgical instruments, partial occlusions, and surgical plume generated during electrocoagulation. Image from the authors’ internal operative dataset.
Figure 2. Example of visual challenges in laparoscopic scenes, including surgical instruments, partial occlusions, and surgical plume generated during electrocoagulation. Image from the authors’ internal operative dataset.
Preprints 209618 g002
Figure 3. Schematic illustration of image-based source-indication estimation from segmented blood masks. Consecutive laparoscopic frames were first processed by the blood-segmentation model. Newly appearing blood pixels were then identified between successive masks and accumulated over a short temporal window. The dominant cumulative-growth component was used to define the estimated source region and the local ROI for subsequent temporal analysis. Figure was prepared as an illustrative schematic using AI-assisted generation of initial visual elements, followed by manual refinement and composition.
Figure 3. Schematic illustration of image-based source-indication estimation from segmented blood masks. Consecutive laparoscopic frames were first processed by the blood-segmentation model. Newly appearing blood pixels were then identified between successive masks and accumulated over a short temporal window. The dominant cumulative-growth component was used to define the estimated source region and the local ROI for subsequent temporal analysis. Figure was prepared as an illustrative schematic using AI-assisted generation of initial visual elements, followed by manual refinement and composition.
Preprints 209618 g003
Table 1. Representative examples of temporal and optical-flow-based descriptors for segmented blood-region behavior.
Table 1. Representative examples of temporal and optical-flow-based descriptors for segmented blood-region behavior.
Sequence Source Pattern Duration [s] Peak dA/dt
(px/s)
Mean temporal
IoU
Mean abs change ratio Mean flow magnitude P95 flow magnitude
int-1 internal burst-like coherent progression 2.47 5.01 × 106 0.909 0.036 1.978 20.590
int-2 internal motion-active progression 20.47 7.88 × 106 0.849 0.068 2.916 27.537
int-3 internal burst-like coherent progression 20.00 1.62 × 107 0.900 0.044 2.514 27.592
int-4 internal more spatially coherent progression 17.87 5.05 × 106 0.882 0.055 2.526 23.957
int-5 internal dynamic and unstable progression 31.50 3.87 × 106 0.789 0.258 3.583 26.291
ext-static external static reference 4.70 1.83 × 106 0.894 0.065 0.006 0.046
ext-1 external dynamic and unstable progression 6.07 1.57 × 106 0.690 0.179 7.213 31.167
ext-2 external more spatially coherent progression 4.83 5.23 × 105 0.856 0.045 1.966 23.032
ext-3 external motion-active despite relative stability 6.97 3.46 × 106 0.865 0.04 3.091 29.703
ext-4 external localized burst 1.53 4.01 × 106 0.816 0.078 1.250 9.051
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated