A Unified Multi-Task Vision Transformer for Interpretable Ovarian Tumor Analysis

Abdussamad Abdullahi Musa; David Emmanuel; Adeeb Alchaikh Hassan; Anil Fernando

doi:10.20944/preprints202606.0575.v1

Submitted:

06 June 2026

Posted:

08 June 2026

You are already at the latest version

Abstract

Ovarian cancer remains a leading cause of gynaecological cancer mortality, and ultrasound based deep learning systems for its diagnosis are typically built as separate post-hoc processes for classification, segmentation, and interpretability, which introduces work flow inefficiencies and may produce inconsistent predictions. This work addresses that limitation. We propose UM-TOTA (Unified Multi-Task Ovarian Tumor Architecture), a Vision Transformer (ViT) based architecture that performs eight-class tumor classification, three-class malignancy detection, tumor segmentation, and clinical concept interpretability within a single unified framework. We integrate a concept bottleneck guided by the IOTA and O-RADS clinical guidelines to enable transparent decision-making through medical concepts that clinicians can understand, and we employ combined adaptive t-vMF Dice 11 and boundary-enhanced segmentation losses with progressive task weighting to stabilise multi-task optimisation. We evaluated the model on the Multi-Modality Ovarian Tumor Ultrasound (MMOTU) 2D dataset using 5-fold stratified cross-validation. UM-TOTA achieved 80.26% ± 1.10% accuracy (97.06% one-vs-rest macro specificity) for eight-class classification, 90.88% ± 1.14% accuracy (90.41% specificity) for malignancy detection, and 77.29% ± 1.29% Dice for segmentation, while reducing the computational parameter load by approximately 66.7% relative to sequential single-task pipelines. The learned concepts aligned with established malignancy criteria, identifying vascularization, solid components, and papillary projections as key predictors. This unified approach offers an efficient and interpretable framework for clinical ovarian ultrasound workflows.

Keywords:

ovarian cancer

;

multi-task learning

;

vision transformer

;

explainable AI

;

concept bottleneck models

;

medical image analysis

Subject:

Computer Science and Mathematics - Computer Vision and Graphics

1. Introduction

Ovarian cancer (OC) is the cancer of the ovaries and is a prevalent gynaecological malignancy causing many deaths annually. According to the World Cancer Research Fund [1], 206,965 deaths were recorded out of the 324,603 cases recorded in 2022. In spite of all ongoing research efforts, early detection of OC remains challenging, with only 25% of cases being diagnosed at the early stages. However, patients that were diagnosed early (Stage 1) have a 95% survival rate of five years while those diagnosed late (Stage 4) have a 15% survival rate of five years. The observed difference in survival rates suggests that accurate early detection can play a significant role in improving patient outcomes.

There are two commonly used screening approaches employed by clinicians to diagnose OC. The first approach involves blood testing to measure CA125 protein levels, a biomarker frequently elevated in OC patients. However, this screening approach lacks specificity since various other benign conditions can also elevate CA125 concentrations. The second approach uses ultrasound imaging which is non-invasive and provides immediate real-time diagnostic imaging capabilities [2,3]. However, interpreting ovarian ultrasound images can be challenging, as it often depends on the operator’s skill and experience. This variability in interpretation can lead to possible misdiagnoses, which highlights the need for a diagnostic tool that is more objective, reliable and interpretable to the clinician [4]. This diagnostic challenge demands standardised, reproducible, and interpretable diagnostic approaches.

The advent of deep learning has advanced medical image analysis by automating and improving various diagnostic tasks such as classification of the images and segmentation of the region of interest within the images [5,6,7,8]. Convolution Neural Networks (CNN) have shown strong performance in categorizing and delineating anomalies within the ovarian medical images [9,10,11,12,13]. More recently, Vision transformers (ViTs) have gained attention as promising alternatives especially in their ability to capture long range dependencies and global contextual information which are crucial for comprehensive ovarian medical image understanding [14,15,16,17]. Our preliminary single-task study comparing ViT and CNN backbones on ovarian ultrasound informs the backbone selection adopted in this work [18].

Although deep learning models have achieved strong predictive performance, their `black-box’ nature limit their acceptance in clinical settings, especially in situations where decisions carry high risk. Clinicians do not only need accurate predictions but also they need to clearly understand the reason behind those predictions so that they can align that with the medical knowledge and practice. This has led to growing interest in Explainable Artificial Intelligence (XAI) [19,20]. One promising approach is the use of Concept Bottleneck Models (CBMs) [21,22], which introduce human-interpretable concepts as an intermediate step in the decision-making process. Incorporating interpretability directly into the model is important for building trust and supporting clinical use, rather than relying on post-hoc saliency-based methods such as Grad-CAM [23] and attention visualization that fail to provide clinical relevant explanations that aligns with medical knowledge [11,24].

Despite the advancements in deep learning for ovarian ultrasound [11,13,17], multi modal learning [12], and Explainable Artificial Intelligence (XAI) [11], a notable research gap still exists. Most studies focus on single-task approaches such as classification or segmentation or employ XAI as post-hoc process. No prior research has attempted to integrate concept-based interpretability directly into multi-task framework using vision transformers for a comprehensive ovarian ultrasound image analysis.

To address this gap we proposed a Vision Transformer-based multi-task deep learning pipeline for comprehensive ovarian ultrasound analysis. Unlike previous approaches that treat classification, segmentation and Interpretability as separate post-hoc processes, our framework simultaneously performs: 8-class ovarian tumor classification, 3-class malignancy detection (Normal, Benign, Malignant), precise ovarian tumor segmentation, and medical concept interpretability guided by established clinical criteria.

Our architectural design reflects how radiologists actually work when they analyze ovarian ultrasound images. Each computational task maps to a specific clinical step: classification provides the initial tumor categorization, malignancy detection enables risk stratification, segmentation supports surgical planning, and concept prediction gives diagnostic justification. By integrating these tasks into a single model rather than handling them separately, the system ensures that risk assessment and morphological analysis both come from the same underlying feature representations. This helps maintain consistency across predictions, which is similar to how radiologists consider multiple characteristics together rather than in isolation.

Contributions

UM-TOTA makes three distinct technical contributions to the field of ultrasound image analysis of ovarian cancer:

We introduce UM-TOTA, the first ViT-based architecture for ovarian ultrasound that integrates concept-based interpretability directly into the multi-task learning framework, rather than applying it post-hoc. All predictions pass through a clinical concept bottleneck whose targets are informed by the International Ovarian Tumor Analysis(IOTA) and Ovarian-Adnexal Reporting and Data System(O-RADS) criteria, ensuring that model explanations reflect the actual decision process rather than approximating it after the fact.
We develop a knowledge-guided concept supervision framework that maps ground-truth tumor labels to IOTA- and O-RADS-defined clinical concept targets, going beyond generic concept bottleneck implementations to embed domain-specific diagnostic knowledge directly into model training. This enables clinicians to verify model reasoning using criteria that they already apply in practice.
We demonstrate that a unified four-task architecture performing classification, malignancy detection, segmentation, and concept-based interpretability in a single forward pass achieves a 66.7% parameter reduction and 65.5% inference latency reduction over sequential single-task deployment, with concrete hardware measurements provided, while maintaining comparable diagnostic accuracy to specialised single-task models.

The remainder of this paper is organised as follows. Section 2 reviews related work. Section 3 details our proposed UM-TOTA methodology. Section 4 presents our detailed results and discussion. Section 5 concludes with a summary of contributions and clinical impact.

2. Related Works

2.1. Deep Learning Architectures and Multi-Task Strategies in Ovarian Cancer

Deep learning has shown strong capabilities in medical imaging tasks, with CNN emerging as the dominant approach for automated diagnosis [25]. Early deep learning approaches for OC diagnosis by Wu et al. [26] explored classification methods for ultrasound images, demonstrating improved accuracy compared to traditional manual methods. Also, Wang et al. [27] employed transfer learning techniques to distinguish between benign, borderline, and malignant ovarian masses. Furthermore, [28] demonstrated the effectiveness of fine-tuned VGG-16 networks for ovarian cyst detection, establishing that transfer learning approaches are viable for this medical domain.

Recent developments have seen the emergence of comprehensive diagnostic systems that extend beyond simple classification. This approach aligns with multi-task attention frameworks observed in adjacent ultrasound tasks [29], which combine localized segmentation maps and pathological descriptors within a unified learning architecture. This approach is similar to the ensemble methods demonstrated by [11], which achieved accuracy improvements through model integration. Another study [30] used datasets from multiple health centres to prove that deep learning approaches can be clinically viable across diverse settings. The integration of multi-modal approaches in studies such as [13,31,32], has opened new avenues for comprehensive analysis combining ultrasound imaging with clinical parameters.

Vision Transformers (ViT) have achieved competitive performance when compared with CNNs by providing superior capability for modelling global relationships within images [33]. This advancement builds upon the foundational Transformer architecture introduced by [34]. Hybrid CNN-Transformer architectures have shown consistent advantages in medical applications. TransUNet [35] combines Transformers with U-Net for medical image segmentation, while Swin-Unet [36] utilizes hierarchical ViT to capture multi-scale features effectively. Furthermore, [37] demonstrated pure transformer architectures for 3D medical image segmentation with UNETR. PMFFNet [38] addressed CNN limitations by combining feature pyramid networks with ViT components for ovarian tumor segmentation. Multiple ViT variants, including ViT-Large-P32-384, have been reported to achieve strong performance in ovarian cancer classification tasks [15].

Multi-task learning has been successfully applied to other medical domains, such as the end-to-end multi-task architecture for brain tumor analysis in MRI proposed by [39] and the breast cancer framework by [29], which integrates BI-RADS descriptors with SHAP-based post-hoc explainability. However, it is observed that ViT use in multi-task OC diagnosis remains under explored, particularly for simultaneous classification, segmentation, and concept based interpretability in multi-task medical settings. Most existing systems address tasks in isolation or use post-hoc interpretability [11,29]. The choice of ViT in this work is motivated by its ability to capture long-range spatial dependencies through self-attention mechanisms, which is essential for sonographic interpretation, where understanding relationships across the entire image matters for accurate diagnosis.

2.2. From Post-Hoc Explanation to Concept-Based Interpretability in Ovarian Tumor Diagnosis

The increasing adoption of artificial intelligence in medical practice has created a growing demand for interpretable and explainable medical AI systems. Explainable AI methods in medical imaging are important for clinical decision-making transparency [20]. Traditional post-hoc explanation methods such as Grad-CAM [23], LIME [40], and SHAP [41] have been applied to several medical Interpretability tasks, enabling clinicians to understand which image regions influence model decisions.

Concept bottleneck models represent a significant advance in interpretable machine learning for medical diagnosis. While foundational post-hoc approaches like Testing with Concept Activation Vectors (TCAV) [42] measure a model’s sensitivity to high-level ideas after training, the framework presented by [43] utilizes a concept bottleneck paradigm, enabling models to explicitly learn human-understandable concepts as intermediate representations for interpretable medical diagnosis. The work by [44] demonstrated concept-based explanations for dermatology. Another study [45] emphasized the importance of inherently interpretable models rather than post-hoc explanations, which aligns with medical requirements for transparent decision-making. The work by [15] which integrates ViT with LIME for OC classification, has demonstrated the potential for combining advanced architectures with interpretability methods.

Current interpretability approaches primarily rely on generic explanation techniques rather than incorporating domain-specific medical knowledge. For instance, [29] demonstrated post-hoc explainability of BI-RADS descriptors for breast cancer detection. Post-hoc methods approximate the model’s reasoning and may not be faithful to the actual decision process [45]. Unlike these approaches, our framework incorporates interpretability directly into the model through the concept bottleneck, ensuring that decisions are causally linked to the detected clinical concepts. The challenge in medical interpretability extends beyond generic explanation methods to require integration of established clinical frameworks. In OC diagnosis, clinical assessment relies on well-established criteria such as the IOTA (International Ovarian Tumor Analysis) simple rules [46] and O-RADS (Ovarian-Adnexal Reporting and Data System) guidelines [47]. These provide standards for characterizing ovarian masses based on some morphological features that can be encoded into AI decision-making processes. Current interpretability approaches fail to incorporate domain-specific features established in medical knowledge frameworks like IOTA/O-RADS criteria into AI decision-making processes.

2.3. Evaluation Methodologies in Medical AI

Cross-validation methods specific to medical imaging address unique challenges in healthcare data. According to [48], foundational cross-validation principles remain relevant for medical AI applications, while [49] provided theoretical foundations for model selection. The study by [50] provided some good guidelines for cross-validation in medical image analysis, highlighting the importance of patient-level splits to avoid optimistic bias.

The importance of ablation studies for understanding model component contributions through systematic evaluation of architectural elements was demonstrated by PMFFNet [38]. Comprehensive empirical evaluation and benchmarking practices in medical image analysis are exemplified by nnU-Net [51] and the survey of Litjens et al. [52].

Statistical significance testing in medical AI has gained increased attention as systems approach clinical deployment. Statistical comparisons of classifiers over multiple datasets require careful consideration [53], while [54] addressed challenges of inference for generalization error. The recent work by [55] highlighted the importance of accounting for randomness in deep learning experiments, which is particularly relevant for medical AI where reproducibility is crucial for clinical acceptance.

Current evaluation methodologies frequently lack systematic analysis of individual task contributions in multi-task learning scenarios. While single-task evaluations are well-established, comprehensive ablation studies that examine how different tasks interact and contribute to overall system performance remain uncommon. This limitation is particularly problematic for complex medical AI systems where understanding the relative importance and interdependencies of different clinical objectives is crucial for system optimization and clinical acceptance.

3. Methodology

3.1. Dataset and Data Preparation

3.1.1. Dataset

We use the Multi-Modality Ovarian Tumor Ultrasound (MMOTU) dataset [13] that has 1,469 2D ultrasound images from 294 patients captured from different viewing angle during transvaginal examination. These images were obtained using a Mindray Resona8 ultrasonic diagnostic instrument at Beijing Shijitan Hospital, Capital Medical University. The dataset has expert annotations from 27 specialists in Obstetrics and Gynecology. Each image undergoes a two-stage annotation process where one expert provides initial annotations, followed by verification from another expert, ensuring high-quality ground truth labels essential for medical AI applications. The dataset has eight distinct ovarian tumor categories and we label each tumor class with a number as thus: 0-chocolate cyst, 1-serous cystadenoma, 2-teratoma, 3-theca cell tumor, 4-simple cyst, 5-normal ovary, 6-mucinous cystadenoma, and 7-high-grade serous carcinoma. This research maps the eight-class categorization to a three-class malignancy classification scheme following established clinical practice as follows:

M (c) = \{\begin{matrix} 0, & c \in {0, 1, 2, 3, 4, 6} (benign) \\ 1, & c \in {5} (normal) \\ 2, & c \in {7} (malignant) \end{matrix}

(1)

Where c represents the 8-class tumor category and

M (c)

represents the corresponding malignancy classification.

Although the MMOTU dataset is publicly available and the original authors obtained ethics approval for data collection and release, we secured additional departmental ethics approval from our institution (Ethics Approval Number: 2457) to ensure full compliance with research standards.

3.1.2. Data Preparation and Augmentation

The images in the dataset has various resolutions but we resized them to 224×224 pixels to match the ViT backbone requirements. We then applied a normalisation strategy using ImageNet channel statistics [56] to align the input distribution with that of the pre-trained backbone and to support effective domain adaptation. We then employed an augmentation strategy with probabilistic transformations to prevent overfitting:

T_{aug} (I) = T_{rot} (T_{vflip} (T_{hflip} (I, p_{h}), p_{v}), θ)

(2)

Where

T_{hflip}

and

T_{vflip}

represent horizontal and vertical flipping with probabilities

p_{h} = p_{v} = 0.5

, and

T_{rot}

applies random rotation with angle

θ \sim U (0^{\circ}, 90^{\circ})

. According to general image augmentation frameworks [57], these geometric transformations mitigate overfitting by artificially increasing dataset diversity without altering the underlying semantic labels.

In addition to geometric transformations, intensity augmentations were applied including RandomBrightnessContrast (

p = 0.4

) and CLAHE (

p = 0.4

). Noise augmentations included GaussNoise (

p = 0.3

) and CoarseDropout (

p = 0.15

). Images were resized to

224 \times 224

pixels and normalized using ImageNet statistics.

3.2. Unified Multi-Task Ovarian Tumor Architecture (UM-TOTA)

We propose UM-TOTA, a unified multi-task architecture for ovarian tumor analysis integrating Vision Transformer (ViT)-based representation learning with concept-driven clinical interpretability. Unlike traditional single-task models, UM-TOTA concurrently performs:

1.: Eight-class tumor classification
2.: Three-class malignancy detection
3.: ROI-based tumor segmentation
4.: Prediction of clinical semantic concepts for transparent diagnosis

The system consists of five interconnected modules:

Transformer Backbone: A ViT encoder which extracts global contextual features from ultrasound images.
Task Heads: Parallel classification, malignancy detection and segmentation heads generate respective predictions.
Concept Bottleneck Module: Predicts interpretable clinical indicators such as boundary clarity, shape regularity, vascularization, and solid-component presence.
Clinical Reasoning Unit: Integrates concept activations into attention-weighted decision pathways.
Multi-Task Loss Coordinator: Joint optimization with dynamic weighting ensures stable convergence during training.

Figure 1 illustrates the high-level structure of UM-TOTA. The design enforces semantic alignment between clinical concepts and final decisions, producing interpretable predictions suitable for radiology-assisted assessment.

The architectural design is motivated by the clinical workflow where radiologists simultaneously assess multiple characteristics when analyzing ovarian ultrasound images. Rather than treating these tasks independently, UM-TOTA leverages shared feature representations to improve performance across all objectives while maintaining clinical interpretability through explicit concept modeling. This approach represents a significant shift from conventional CNN-based medical imaging systems, offering both performance and clinical transparency.

3.2.1. ViT Backbone Architecture

The choice of ViT over conventional CNN architectures was motivated by factors specific to our multi-task interpretability framework; firstly, the self-attention mechanism supports our clinical reasoning module by capturing global relationships between anatomical structures across the entire ultrasound image. This entire view is essential for learning IOTA/O-RADS concepts which require assessment of features distributed across the image rather than localized regions. In contrast, the local receptive fields of CNNs would require explicit mechanisms to aggregate such global context. Secondly, the way ViT processes images in patches is very similar to how radiologists examine ovarian ultrasounds. They look at features like the edges, solid areas, and cystic regions separately, but also consider how these parts relate to each other. This representation naturally handles the varying scales of ovarian tumors without requiring explicit multi-scale feature extraction modules.

To provide direct empirical justification for this choice, we conducted a systematic single-task comparison of ViT-Base/16 against ResNet50, DenseNet-121, EfficientNet-B0, and EfficientNetV2-m on the same MMOTU dataset under identical 5-fold stratified cross-validation conditions [18]. ViT achieved 82.98% ± 1.89% 8-class classification accuracy and 91.15% ± 1.48% 3-class malignancy accuracy, outperforming the best CNN baseline (EfficientNet-B0 at 80.66%) by 2.32 percentage points. This empirical superiority in the single-task setting, combined with the global attention properties described above, motivates the ViT backbone selection for UM-TOTA.

The ViT backbone processes input images of size

224 \times 224

pixels by dividing them into

16 \times 16

patches, resulting in a sequence of 196 patch embeddings. Each patch embedding is projected to a 768-dimensional feature space through the learned linear transformation.

The transformer encoder processes patch embeddings through 12 layers of multi-head self-attention [34], producing a 768-dimensional feature representation via global average pooling that serves as the shared feature space for all downstream tasks (Figure 2).

To address the data efficiency challenges inherent in Vision Transformers when trained on medical datasets of moderate size (

N = 1, 469

), we employed transfer learning strategy. The ViT backbone was initialised with ImageNet-1K pre-trained ViT-Base/16 weights to draw on learned textural and shape features before fine-tuning on the MMOTU dataset. This stabilizes the training and prevents overfitting which occurs when training Transformers from scratch on small datasets.

3.2.2. Task-Specific Head Architecture Design

The shared 768-dimensional feature representation from the ViT backbone feeds into four specialized prediction heads, each architecturally designed for specific clinical objectives as shown in Figure 2. This design ensures that while features are shared at the backbone level, each task receives specialized processing appropriate to its clinical requirements.

As illustrated in Figure 2, the shared 768-dimensional ViT feature embeddings feed into four task-specific heads for classification, malignancy detection, segmentation and clinical concept reasoning.

The segmentation head implements a decoder architecture that progressively upsamples the feature representation through transposed convolutions. Unlike traditional U-Net architectures that require encoder–decoder symmetry, this approach leverages the rich feature representations from the transformer backbone.

3.2.3. Concept Bottleneck Model Integration

The distinguishing feature of the UM-TOTA architecture is the integration of a concept bottleneck model. Ten clinical concepts informed by the IOTA and O-RADS guidelines were modelled into the architecture, building on concept bottleneck models [22] and their extension to clinically guided supervision [21]. The concept learning utilizes knowledge-guided supervision, where we mapped ground-truth tumor labels to concept targets according to IOTA and O-RADS guidelines. For example, a lesion classified as Serous Cystadenoma was linked to the concepts high boundary clarity and Avascular. This design ensures that the bottleneck layer reflects clinical criteria familiar to practicing clinicians. For presentation in Figure 3, the ten concepts are grouped into three categories: Boundary/Structure (boundary clarity, shape regularity, acoustic shadowing, posterior enhancement), Tissue/Component (homogeneous texture, cystic components, solid components, papillary projections), and Clinical Signs (vascularization, ascites presence).

The concept bottleneck accepts the 768-dimensional ViT features

F \in R^{768}

and processes them through individual concept predictors to generate interpretable clinical representations. As shown in Figure 3, it utilizes ten individual concept predictors rather than a single joint layer. Each predictor is formulated as:

C_{i} = h_{{concept}_{i}} (F)

(3)

where

h_{{concept}_{i}}

represents the i-th concept predictor defined as:

h_{{concept}_{i}} (F) = {Linear}^{1} (ReLU ({Linear}^{64} (Dropout (F, 0.1))))

(4)

where the numerical superscripts (

64, 1

) represent the exact terminal output dimensionality of the respective linear transformation layers, ensuring each individual task head reduces the shared feature map down to a single raw scalar logit

C_{i} \in R

. The ten clinical concepts modelled are: boundary clarity, shape regularity, acoustic shadowing, posterior enhancement, texture homogeneity, cystic components, solid components, papillary projections, vascularization patterns, and ascites presence.

The individual concept logits are stacked into a joint vector

C = {[C_{1}, C_{2}, \dots, C_{10}]}^{T} \in R^{10}

and converted to non-mutually exclusive probability scores using an element-wise sigmoid activation function:

C_{scores} = σ (C) = {[σ (C_{1}), σ (C_{2}), \dots, σ (C_{10})]}^{T}

(5)

where

C_{scores} \in R^{10}

represents the final probability scores for each clinical concept, providing interpretable intermediate representations bounded tightly between

[0, 1]

that align with radiological assessment criteria established by [46].

3.2.4. Clinical Reasoning Module Architecture

From the concept bottleneck module shown in Figure 3, the clinical reasoning module processes the extracted concept activations through a vectorized attention mechanism that weights the relative importance of different clinical attributes for the final diagnostic decision. This module bridges the soft checklists and malignancy decision.

Rather than relying on post-hoc local approximations or disjoint scalar updates, the global attention distribution weights are derived directly from the concept probability vector

C_{scores} \in R^{10}

through a joint linear projection layer wrapped in a spatial softmax normalization function:

{attention}_{weights} = Softmax (W_{a} \cdot C_{scores} + b_{a})

(6)

where

W_{a} \in R^{10 \times 10}

represents the learnable attention transformation weight matrix,

b_{a} \in R^{10}

is the structural layer bias vector, and

{attention}_{weights} \in R^{10}

dictates the relative clinical priority assigned across the predicted checklist.

The contextual vector containing the scaled, human-interpretable features is subsequently derived by performing a strict Hadamard product over the concept probability and normalized attention tensors:

{attended}_{concepts} = C_{scores} ⊙ {attention}_{weights}

(7)

where ⊙ denotes element-wise multiplication, yielding an isolated structural representation vector

{attended}_{concepts} \in R^{10}

. This attention-weighted framework enables clinicians to directly track and audit which morphological features contribute most significantly to the model’s diagnostic conclusions [47].

The clinical reasoning network processes these integrated concept structures through a multi-layer perceptron (MLP) configuration (

h_{reasoning}

) to isolate the final categorical tumor malignancy estimation. For parameter efficiency, the hidden layer transformations, activation thresholds, and dropout are defined sequentially:

\{\begin{matrix} z_{1} = {Dropout}_{0.2} (ReLU ({Linear}^{64} (x))) \\ z_{2} = {Dropout}_{0.1} (ReLU ({Linear}^{32} (z_{1}))) \end{matrix}

(8)

The terminal forward execution path then passes the contextually regularized representations through the output network head to isolate the absolute risk index:

{reasoning}_{output} = h_{reasoning} ({attended}_{concepts}) = σ ({Linear}^{1} (z_{2}))

(9)

where

z_{1} \in R^{64}

and

z_{2} \in R^{32}

represent the hidden latent spaces initialized by the clinical features, and

σ

represents the terminal sigmoid activation block matching the operational execution configuration of the pipeline.

3.2.5. Multi-Task Loss Coordination System

UM-TOTA architecture implements advanced loss functions specifically designed to address the challenges of multi-task learning in medical image analysis. We adopted the state-of-the-art adaptive t-vMF Dice loss (AdaptiveTvMFDiceLoss) [58] for segmentation tasks, incorporating adaptive weighting mechanisms to handle class imbalance and boundary sensitivity issues common in medical segmentation. This loss function provides superior performance compared to traditional Dice loss formulations by dynamically adjusting the loss contribution based on prediction difficulty.

We used boundary enhanced Dice loss (BoundaryEnhancedDiceLoss) to specifically address the challenge of accurate boundary delineation in medical segmentation tasks. This loss function combines traditional Dice loss with boundary-aware components that penalize incorrect boundary predictions more heavily than interior regions. We implement label smoothing for classification tasks following the methodology established in [59] which provides benefits of regularization and improved calibration for medical classification problems. The loss weighting and balancing strategies employ dynamic weighting schemes to ensure balanced training across multiple tasks, preventing any single task from dominating the learning process as recommended by multi-task learning literature [60].

The multi-task loss coordination system implements a dynamic normalization strategy to stabilize concurrent parameter optimization across diverse clinical objectives. Let T represent the active subset of diagnostic objectives determined by the specific structural ablation configuration, where

T \subseteq {classification, malignancy, segmentation, boundary, concepts}

. To protect the shared backbone from gradient imbalance and prevent any single objective from dominating the backpropagation trajectory, individual raw task losses (

L_{t}

) are coordinated using dynamically scaled weighting ratios:

{\bar{w}}_{t} = \frac{w_{t}}{\sum_{i \in T} w_{i}}

(10)

L_{total} = \sum_{t \in T} {\bar{w}}_{t} \times L_{t}

(11)

where

w_{t}

corresponds to the predefined baseline static weight assigned to task component t upon loss module initialization, and

{\bar{w}}_{t}

represents the dynamically normalized relative task coefficient computed on the fly to enforce a strict unit sum constraint where

\sum_{t \in T} {\bar{w}}_{t} = 1.0

across all training folds [60].

3.3. Ablation Study Design and Task Configuration

This research implements a systematic ablation study framework to evaluate task contributions through controlled component removal. Let

M

represent the full model and

M_{- T_{i}}

represent the model with task

T_{i}

removed. The task contribution is quantified as:

Δ_{T_{i}} = Performance (M) - Performance (M_{- T_{i}})

(12)

The ablation variants are defined as:

\begin{matrix} Full model : & M_{full} = {T_{cls}, T_{mal}, T_{seg}, T_{concept}} \end{matrix}

(13)

\begin{matrix} No concepts : & M_{- concept} = {T_{cls}, T_{mal}, T_{seg}} \end{matrix}

(14)

\begin{matrix} Classification only : & M_{cls} = {T_{cls}, T_{mal}} \end{matrix}

(15)

\begin{matrix} No Classification : & M_{- cls} = {T_{seg}, T_{concept}} \end{matrix}

(16)

\begin{matrix} Segmentation only : & M_{seg} = {T_{seg}} \end{matrix}

(17)

Where cls, mal, seg, concept means classification, malignancy, segmentation and concept. This systematic approach provides insight into task interactions and individual contributions. A fixed random seed (

s = 42

) was applied across all variants to ensure reproducible initialization [55].

3.4. Training and Validation Protocol

We employed 5-fold stratified cross-validation [48] to preserve class distribution across splits, reporting mean and standard deviation across folds for all metrics. Statistical significance between UM-TOTA and baseline variants was assessed using paired t-tests (

p < 0.05

) [53].

We used the AdamW optimizer for the model training with initial learning rate of

5 \times 10^{- 5}

and maximum learning rate of

1 \times 10^{- 4}

. The learning rate was scheduled using OneCycle schedule which starts from the initial value, increases to the maximum, and then smoothly decreases. We ran the training for 250 epochs with a batch size of 16, implemented in pyTorch 2.5.1

3.5. Comprehensive Evaluation Framework

The evaluation framework uses a multi-tiered quantitative paradigm. For the multi-class categorization and malignancy detection tasks, performance criteria are measured using Accuracy, Precision, Sensitivity (Recall), F1-Score (the harmonic mean of precision and recall), and the Area Under the Receiver Operating Characteristic curve (AUC-ROC). For the pixel-wise segmentation task, accuracy is quantified using the Dice Similarity Coefficient (DSC) and the Intersection over Union (IoU). Following established medical image analysis evaluation standards [61], these indicators provide a balanced validation that remains informative under severe dataset class imbalance.

To systematically evaluate the fidelity of the intermediate explainable layer, the internal parameters of the continuous concept bottleneck are benchmarked using an absolute concept alignment metric. The Mean Absolute Error (MAE) alignment score for the j-th clinical attribute across an assessment cohort of N validation images is formulated as:

{Alignment}_{{concept}_{j}} = \frac{1}{N} \sum_{i = 1}^{N} |C_{pred} (i, j) - C_{target} (i, j)|

(18)

where

j \in {1, 2, \dots, 10}

represents the specific clinical descriptor index, and N represents the total number of validation samples evaluated (

N = 1, 469

). The scalar value

C_{pred} (i, j) \in [0, 1]

represents the sigmoid-activated continuous probability score output by the model for image i regarding concept j, while

C_{target} (i, j) \in [0, 1]

represents the respective deterministic clinical target profile derived from expert consensus International Ovarian Tumor Analysis (IOTA) simple rules [62]. Values near zero indicate close alignment between the model’s concept activations and the clinical ground truth [42].

4. Results and Discussion

4.1. Multi-Task Learning Performance and Clinical Significance

We implemented the UM-TOTA framework on an ASUS TUF GAMING A15 system with AMD Ryzen 9 processor with RTX4070 GPU 12GB VRAM and 32GB RAM running Ubuntu 24.04 LTS with PyTorch 2.5.1+cu121. The performance of our UM-TOTA model, when handling all four tasks simultaneously, was promising. This outcome supports our central hypothesis that an integrated learning model can provide a form of unified diagnostic support that is valuable in a clinical setting (Section 2.3). Table 1 provides a full summary of the 5-fold stratified cross-validation results.

4.1.1. Task-Specific Performance Analysis and Comparison with related studies

For the 8-Class Tumor Classification, the system showed consistent performance across the different tumor types. The ROC analysis in Figure 4 demonstrates robust discriminative capability. Figure 4 shows respective fold wise AUC for 8-class classification as well as the mean AUC =

0.950 \pm 0.009

. This level of performance alongside the specificity of

97.06 % \pm 0.15 %

means that our approach is capable of differentiating between tumor types that look similar, like serous and mucinous cystadenomas, a known challenge in clinical practice. The F1-score of

80.24 % \pm 1.00 %

from Table 1 also indicates a balanced performance across the tumor classes.

For the 3-Class Malignancy Detection, we observed that it has the strongest performance with an accuracy of

90.88 % \pm 1.14 %

and specificity of

90.41 % \pm 1.60 %

. The lower specificity relative to the 8-class task is largely a consequence of the class structure: under one-vs-rest macro averaging, the 8-class setting has seven negative classes per comparison, which inflates the true-negative count, whereas the 3-class malignancy setting has fewer negative classes per comparison. The difference is therefore primarily a statistical artefact of the averaging scheme rather than a direct measure of clinical difficulty. The ROC analysis plots for the individual fold is presented in Figure 4. It can be observed that the model achieved a high discriminative power with mean AUC =

0.946 \pm 0.047

, and reliably separate benign, malignant, and normal tissues. This result is comparable to clinical validation studies, for example where IOTA simple rules have shown sensitivity of

91.66 %

and specificity of

84.84 %

[63].

In the Tumor Segmentation task, the model produced Dice scores of

77.29 % \pm 1.29 %

, which we consider acceptable for clinical use, as shown in Figure 5. There was also a strong correlation found between the Dice and IoU scores

r = 0.980

. This suggests that the model’s ability to delineate boundaries is consistent, which is important for applications like planning for surgery or monitoring treatment.

Comparing our classification accuracy of

80.26 % \pm 1.10 %

to those reported in the existing literature, the performance can be considered competitive. For example, recent meta-analyses have reported overall sensitivities of

81 %

(95% CI, 0.80–0.82) and specificities of

92 %

(95% CI, 0.92–0.93) for diagnosing ovarian cancer from ultrasound imaging [64]. In addition, the study that introduced the dataset used in this work achieved an accuracy of

80.6 %

[13].

The simultaneous training approach adopted in UM-TOTA has shown advantages when compared to separate post-hoc processing pipelines. This is reflected in the stable performance observed across all tasks and the consistent convergence patterns shown in the training curves presented in Figure 6. The Vision Transformer backbone effectively learns shared feature representations that facilitate knowledge transfer between tasks. For instance, features learned for classification contribute to improved segmentation precision, while segmentation-related features help the classification task become more boundary-aware.

To contextualize the contributions of UM-TOTA, Table 2 presents a benchmark for our framework against existing methods. Since no prior study simultaneously integrates classification, malignancy detection, segmentation, and concept-based interpretability for ovarian ultrasound, we compare our results against state-of-the-art single-task models, clinical rules, and recent multi-task frameworks from related medical imaging domains. Although, the breast ultrasound [29] and brain MRI [39] entries are reported on different datasets but they are included for cross-domain context rather than as direct comparisons on MMOTU.

By separating our results into specific tasks, the advantages become clear. Our UM-TOTA 8-class classification achieves a specificity of 97.06%, outperforming all compared methods shown in Table 2. High specificity is particularly important in medical diagnosis, as it reduces false positive rates and helps to avoid unnecessary surgical interventions for patients.

For malignancy detection, UM-TOTA achieves an accuracy of 90.88%, outperforming the 80.60% and 86.66% reported by [13] and the IOTA simple rules [63], respectively. In addition, UM-TOTA attains a sensitivity of 90.88%, which is competitive with established clinical rules while offering the added benefit of inherent interpretability.

When compared with recent multi-task learning approaches in other medical imaging domains, the unique contribution of UM-TOTA becomes evident. [29] proposed MT-BI-RADS for breast ultrasound, achieving strong classification and sensitivity performance through a multi-task framework; however, their interpretability relies on post-hoc SHAP explanations rather than inherent concept-based reasoning. Similarly, [39] developed an end-to-end multi-task architecture for brain tumor analysis combining classification and segmentation, but without any explicit interpretability mechanism.

In contrast, UM-TOTA integrates concept-based reasoning directly into the diagnostic pipeline, guided by established IOTA and O-RADS clinical guidelines. This inherent interpretability enables clinicians to understand, verify, and trust model predictions using familiar diagnostic criteria, addressing a critical requirement for clinical adoption that remains unmet by existing multi-task approaches.

4.1.2. Comparison with CNN-Based Multi-Task Baseline

To validate the ViT backbone selection with direct experimental evidence, we trained a ResNet50-based multi-task baseline (CNN-MTL) under identical experimental conditions: the same four tasks, loss functions, 5-fold stratified cross-validation protocol, and MMOTU dataset. The only difference is the backbone. The CNN used was ResNet50 with 2048-dimensional features versus ViT-Base/16 with 768-dimensional features, with task heads adapted accordingly.

Table 3 presents the results. CNN-MTL achieved 79.31% ± 2.96% 8-class classification accuracy, 90.13% ± 0.88% malignancy detection accuracy, and 77.71% ± 0.52% Dice score. Paired t-tests across the five folds confirmed no statistically significant difference between UM-TOTA and CNN-MTL on any task (classification:

p = 0.178

, malignancy:

p = 0.396

, segmentation:

p = 0.463

). Notably, UM-TOTA demonstrated greater cross-fold stability on the classification task (±1.10% vs ±2.96%), suggesting more consistent generalisation across data splits.

Although the CNN-MTL framework with a ResNet-50 backbone demonstrated improved classification performance relative to the single-task baseline reported in our earlier study [18], this outcome is consistent with established multi-task learning theory. In particular, multi-task optimisation can improve the performance of moderately strong baseline models by encouraging shared feature representations across related auxiliary tasks. The improvements observed here are therefore in line with prior findings that multi-task learning acts as a form of implicit regularisation, leading to better generalisation in limited data settings [60,65,66]. This regularisation benefit and the gradient-interference cost discussed below are not contradictory: their net effect depends on backbone capacity, and our ablation in Section 4.3 shows that the higher-capacity ViT, already strong in the single-task setting, gains less from multi-task regularisation and is more exposed to interference between objectives.

However, the reduction in ViT accuracy from

82.98 %

in our previous single-task benchmarks [18] to

80.26 %

in this multi-task configuration reflects a common trend in joint optimisation settings where competing objectives introduce gradient interference. In this case, balancing a global semantic task such as tumour classification alongside localised pixel-wise segmentation introduces interference between optimisation objectives. This phenomenon can hinder the convergence path and limit the extent to which higher-capacity backbones realise their full isolated representational advantage [67].

Consequently, while joint optimisation narrows the performance gap between the ResNet-50 and the stronger ViT backbone, the ViT-based UM-TOTA model offers the strongest overall trade-off between accuracy and interpretability among the multi-task configurations evaluated. This is supported by its superior performance in the single-task setting, its ability to model globally distributed features relevant to IOTA-based reasoning, and its favourable trade-off between performance and deployment efficiency, as discussed in Section 4.1.3.

4.1.3. Deployment Efficiency Analysis

Table 4 presents concrete efficiency measurements comparing UM-TOTA as a unified model against three sequential single-task ViT models which is the realistic clinical alternative for pipelines handling classification, segmentation, and reporting independently. The comparison uses three sequential models rather than four because concept prediction in UM-TOTA is a lightweight head intrinsically coupled to the classification and malignancy heads; it does not require a separately deployed backbone, so the realistic sequential baseline comprises three deployable models. All experiments were performed on an NVIDIA RTX 4070 GPU (12 GB VRAM) using PyTorch 2.5.1, with 10 GPU warmup runs followed by 100 timed inference passes per image.

UM-TOTA processes a single ultrasound image in

8.98 \pm 1.30

ms, compared to 26.01 ms for three sequential models which is a 65.5% latency reduction. GPU memory consumption is 7.4 MB versus 20.8 MB (64.2% reduction), and throughput improves from 48.9 to 147.9 images per second (202.7% increase). The throughput measured on the hardware with batch size of 16. Execution timing are captured directly on GPU using asynchronous CUDA events (torch.cuda.Event()). These efficiency gains are directly relevant to clinical deployment, where real-time processing during ultrasound examination and GPU memory constraints at point-of-care settings are practical requirements.

4.2. Clinical Interpretability Results and Trust Building

Our UM-TOTA Clinical Reasoning module was able to learn clinical concepts that align with established IOTA and O-RADS guidelines. Figure 7 presents the real medical concept interpretability analysis. It shows correlations between learned concepts and clinical tumor characteristics that are statistically significant.

4.2.1. IOTA/O-RADS Guideline Alignment

Examining the concept activation patterns in Figure 7, the alignment with clinical criteria is notable. We distinguish here between two related but separate analyses: intra-class concept activations (the mean activation of a concept within a single tumor class) and between-group statistical comparisons (benign versus malignant), which are reported with p-values in Table 5. As an example of the intra-class activations, serous cystadenomas had high activation scores for `boundary_clarity’ (0.86), `cystic_components’ (0.84), and `homogeneous_texture’ (0.83). This is a good match for the IOTA B-features, which describe unilocular cysts with smooth walls. In contrast, high-grade serous carcinomas showed higher activations for `vascularization’ (0.53), `solid_components’ (0.58), and `papillary_projections’ (0.36), which are known M-features indicating malignancy.

Quantitative Concept Activation Analysis: To move beyond descriptive validation, we computed weighted mean concept activation values separately for benign and malignant tumours across all 1,469 samples. Statistical significance was assessed for each concept. All ten IOTA-grounded concepts showed statistically significant differences between benign and malignant cases (all

p < 0.01

, eight concepts

p < 10^{- 6}

), as presented in Table 5.

The three strongest malignancy indicators were vascularization (benign: 0.167 vs malignant: 0.399, importance:

+ 0.375

,

p = 5.48 \times 10^{- 96}

), papillary projections (0.072 vs 0.204, importance:

+ 0.270

,

p = 3.41 \times 10^{- 90}

), and solid components (0.313 vs 0.644, importance:

+ 0.262

,

p = 3.40 \times 10^{- 8}

). Conversely, benign tumours showed significantly higher activations for homogeneous texture (0.687 vs 0.493,

p = 2.79 \times 10^{- 19}

), cystic components (0.606 vs 0.356,

p = 1.93 \times 10^{- 8}

), and boundary clarity (0.669 vs 0.626,

p = 1.63 \times 10^{- 7}

). These activation patterns are directly consistent with established IOTA simple rules and O-RADS scoring criteria [46,47], providing quantitative evidence that the model has learned clinically meaningful concept representations rather than arbitrary internal features.

Statistical Validation of Clinical Concepts: Our analysis of feature importance for malignancy confirmed that the model identified clinically relevant features as shown in Figure 7 which is called Key Concept for malignancy (*

p < 0.05

, **

p < 0.01

, ***

p < 0.001

). `Vascularization’ was the strongest predictor of malignancy (importance:

+ 0.375

,

p < 10^{- 90}

). This was followed by `papillary projections’ (

+ 0.270

,

p < 10^{- 80}

) and `solid components’ (

+ 0.262

,

p < 10^{- 8}

). All of these are considered high-risk features in the O-RADS system [47]. Features associated with benign tumors showed strong negative correlations, such as `boundary_clarity’ (

- 0.355

,

p < 10^{- 7}

), `cystic_components’ (

- 0.338

,

p < 10^{- 8}

), and `homogeneous_texture’ (

- 0.329

,

p < 10^{- 19}

).

4.2.2. Clinical Reasoning Transparency and Trust Building

Our UM-TOTA Clinical Reasoning module provides decision pathways that are transparent. This allows radiologists to validate what the system is doing, which we believe helps build trust. Figure 8 shows some example cases where the concept activations link directly to what a radiologist would see, creating an understandable connection between the UM-TOTA’s prediction and known diagnostic rules.

Attention-Based Clinical Reasoning: The model’s attention mechanism seems to weigh the importance of concepts based on their clinical relevance. It can be seen from the clinical attention weight by class heat map in Figure 9 that attention patterns also changed in an appropriate way for different tumor types. For malignant cases, the system gave more weight to vascularization and solid components. For benign cases, it focused more on boundary clarity and cystic features.

Case-by-Case Interpretability: Looking at individual predictions from Figure 10, we can see consistent pathways from concept to diagnosis, even with different kinds of tumors. The system gives an explanation for each prediction with a confidence weight. This could allow a clinician to check the AI’s reasoning against their own judgement.

4.2.3. Interpretable Multi-Task Integration

The concept bottleneck design makes it possible to have a unified clinical reasoning for all four tasks at once. The correlation analysis shown in the concept patterns graph in Figure 7 suggests that the concepts the model learns for classification are also used to inform malignancy detection and segmentation. This helps ensure the system processes cases in an integrated manner, analogous to how radiologists assess cases.

4.3. Ablation Study Insights and Technical Innovations

To better understand the contribution of each component, we conducted a series of ablation studies on five different versions of the architecture which are full_model, no_concepts, class_mal_only, segmentation_only, and no_class_no_mal. These revealed some complex patterns of how the tasks interact. The ablation variants are defined as follows:

full_model: {classification: True, malignancy: True, segmentation: True, concepts: True}
no_concepts: {classification: True, malignancy: True, segmentation: True, concepts: False}
no_class_no_mal: {classification: False, malignancy: False, segmentation: True, concepts: True}
segmentation_only: {classification: False, malignancy: False, segmentation: True, concepts: False}
class_mal_only: {classification: True, malignancy: True, segmentation: False, concepts: False}

4.3.1. Task Combination Impact Analysis

The interactions between tasks were found to be complex, going beyond simple synergy as shown in Figure 11, for:

Classification Figure: no_concepts (82.44% ± 1.94%) > class_mal_only (80.67% ± 2.25%) > full_model (80.26% ± 1.10%)
Malignancy Figure: no_concepts (92.38% ± 0.85%) > class_mal_only (91.22% ± 1.22%) > full_model (90.88% ± 1.14%)
Segmentation Figure: segmentation_only (80.88% ± 1.04%) > full_model (77.29% ± 1.29%) > no_concepts (75.50% ± 0.94%)

The result indicated that the models trained for a single task do better in that specific task. In medical imaging, multi-task learning often has difficulty balancing local spatial features needed for segmentation with global semantic features needed for classification [66,68]. Our segmentation-only model reached a Dice score of 80.88% (Figure 11), which is 3.6 percentage points higher than the full model. We observed that, within the multi-task setting, the segmentation ranking shows full_model (77.29%) above no_concepts (75.50%); that is, adding the concept bottleneck improves segmentation by 1.79 percentage points relative to the no-concepts multi-task variant. The segmentation penalty therefore arises from joint optimisation with the classification and malignancy tasks, not from the concept bottleneck.

The analysis of training curves for the multi-task model is illustrated in the training dynamics visualizations. The segmentation_only model has the smoothest convergence curves as shown in Figure 12a. In contrast, the full_model training curves Figure 6 and no_concepts training curves Figure 12b show more complex training patterns since the optimizer tries to balance gradients pulling in different directions thereby resulting in noisy oscillating curves. The class_mal_only model shown in Figure 12c also shows stable and smooth convergence without any degradation. With these we can deduce that segmentation introduces optimization complexity thereby causing gradient pulling in different directions.

We validated statistical significance using paired Student’s t-tests on the 5-fold cross-validation results, as shown in Table 6. For classification, the UM-TOTA model achieved an accuracy of

80.26 % \pm 1.10 %

, which showed no statistically significant difference from the task-matched class_mal_only variant (

80.67 % \pm 2.25 %

;

t = - 0.43

;

p = 0.692

). Similarly, malignancy detection exhibited no significant variance (

90.88 % \pm 1.14 %

vs

91.22 % \pm 1.37 %

;

t = - 0.85

;

p = 0.445

). To isolate the explicit architectural impact of the concept bottleneck module, a direct baseline comparison was conducted between the full multi-task network and the concept-free variant model (no_concepts). Removing the concept layer yielded a marginal absolute increase of 2.18 percentage points in classification accuracy (

82.44 % \pm 2.17 %

) and 1.50 percentage points in malignancy detection (

92.38 % \pm 0.95 %

). Formal paired t-tests confirmed that these absolute variations are not statistically significant (classification:

t = 1.85, p = 0.139

; malignancy:

t = 1.61, p = 0.182

), indicating that the integration of the 10-concept interpretability framework provides vital clinical explainability without compromising downstream multi-class diagnostic accuracy.

By contrast, segmentation performance demonstrated a highly significant difference (

t = - 9.88

;

p < 0.001

). The UM-TOTA model achieved a Dice score of

77.29 % \pm 1.29 %

compared with the specialized segmentation_only variant’s

80.88 % \pm 1.04 %

. This trade-off is visibly apparent in the cross-validation training dynamics illustrated in Figure 12, where the unified architecture balances local, pixel-level spatial dimensions against the global semantic features required for multi-class optimization. This reduction in topological accuracy is a known trade-off attributable to multi-task learning optimization conflicts. Despite this localized degradation, the unified UM-TOTA model delivers transparent, fully auditable decision-making pipelines structured directly around clinical International Ovarian Tumor Analysis (IOTA) guidelines while preserving competitive performance boundaries. We explicitly note that, with a 5-fold cross-validation design (degrees of freedom

= 4

), these paired tests possess inherently conservative statistical power; non-significant results are interpreted as a validation of baseline performance stability rather than definitive absolute equivalence.

4.3.2. Technical Component Validation

Interpretability-Performance Trade-off Analysis: When we compared full_model with no_concepts variants, we observed that there was a trade-off between the 10-concept interpretability module and predictive performance on the classification and malignancy tasks. Removing the concept module increased classification accuracy by 2.18 percentage points (80.26% → 82.44%) and malignancy accuracy by 1.50 percentage points (90.88% → 92.38%); for segmentation, however, removing the concept module reduced the Dice score by 1.79 percentage points (77.29% → 75.50%). This indicates a tension between interpretability and peak per-task accuracy for the global semantic tasks, while the concept bottleneck is, if anything, beneficial for segmentation within the multi-task setting. Figure 11 shows the corresponding performance levels.

4.3.2.1. Quantified Interpretability Effect:

Classification: 2.18 percentage points cost for clinical interpretability
Malignancy detection: 1.50 percentage points cost for transparent reasoning
Segmentation: 1.79 percentage points gain from concept-guided features

Advanced Loss Function Contributions: Our specific loss function design, which used an adaptive t-vMF Dice loss, a boundary-enhanced loss, and progressive task weighting, appears to have contributed to a more stable training process as seen in Figure 12. The boundary-enhanced Dice loss in particular helps to address a common problem with standard Dice loss by putting more emphasis on getting the boundaries correct.

Progressive Training Dynamics: Looking at the learning rate schedules from the plots in Figure 12 for the different models, our progressive training approach seems to be effective. The OneCycle learning rate scheduler showed consistent patterns in all variants, with smooth acceleration and controlled annealing phases.

4.3.3. Optimal Clinical Configuration

The recommendations below are practical guidance for deployment scenarios and do not retract the central argument of this paper, which is that the unified, interpretable model is the primary contribution. Based on our findings, we suggest different configurations for different operational situations:

High-volume screening: The no_concepts variant (82.44% ± 1.94% classification, malignancy).
Surgical planning: The segmentation_only variant (80.88% ± 1.04% Dice), as it is focused on boundary precision.
Research/teaching: The full_model variant, which gives balanced performance and interpretable reasoning.
Real-time decision support: The class_mal_only variant for its good performance and efficiency.

Even though the single-task models performed better on their specific tasks, the full_model still offers a critical advantage for a complete clinical workflow. Its ability to provide 8-class classification (80.26% ± 1.10%), malignancy detection (90.88% ± 1.14%), and segmentation (77.29% ± 1.29% Dice) all at once, with explainable reasoning, presents a unified framework. This eliminates the complexity and inconsistencies you get with sequential post-hoc approaches.

4.4. Comparative Analysis and Clinical Deployment

Our UM-TOTA architecture was designed to address some of the known limitations of traditional approaches where tasks are done sequentially. These limitations relate to performance consistency, workflow efficiency, and whether predictions are coherent.

As mentioned, our results are competitive with established benchmarks. Published meta-analyses show sensitivities of 81% and specificities of 92% [64]. Our result is up to par when compared with the result that created the MMOTU dataset [13] we used for this experiment. Our model’s performance for 8-class classification (80.26% ± 1.10%) and 3-class malignancy detection (90.88% ± 1.14%) is in this range, but with the added capabilities of segmentation and clinical concept interpretability, which are missing in existing single-task models.

With post-hoc approaches, there is a risk of getting conflicting results for the same patient. Our simultaneous approach ensures that all predictions originate from shared feature representation which provides the radiologists with diagnostic information that is internally consistent.

Using a sequence of models increases computational steps, memory usage, and pipeline complexity. We analyzed the parameter load of our UM-TOTA architecture compared to such a sequential pipeline. In a traditional single-task framework, performing classification, malignancy detection, and segmentation requires three separate models. If

θ_{enc}

represents the backbone parameters (ViT encoder, which serves as feature extractor), this results in a total load of

3 \times θ_{enc}

for the three tasks. In contrast, UM-TOTA employs a shared encoder where feature extraction is performed only once (

1 \times θ_{enc}

) per image. The reduction in computational load is calculated as:

Saved Load = Total Work (Sequential) - Total Work (UM - TOTA)

(19)

Saved Load = 3 θ_{enc} - 1 θ_{enc} = 2 θ_{enc}

(20)

Efficiency Gain \approx \frac{Saved Load}{Sequential Load} = \frac{2 θ_{enc}}{3 θ_{enc}} \approx 66.7 %

(21)

This shared encoder strategy reduces the backbone computational load by approximately two-thirds (66.7%). This efficiency advantage, combined with the preservation of diagnostic accuracy of the classification and malignancy, justifies the segmentation performance trade-off observed in Section 4, as similarly demonstrated in other medical imaging applications [65]. Table 4 summarises all the efficiency improvements.

4.4.1. State-of-the-Art Positioning and Clinical Adoption Advantages

To our knowledge, our approach is the first to use a clinically-guided concept learning framework for ovarian ultrasound analysis, by building IOTA and O-RADS knowledge directly into the model. The 10 medical concepts we used map directly to clinical assessment criteria [47,62]. This helps radiologists validate the model’s predictions using a framework they are already familiar with.

The concept activation patterns show relationships that are clinically meaningful. For instance, malignant tumors are associated with increased vascularization (

+ 0.375

importance), papillary projections (

+ 0.270

), and solid components (

+ 0.262

). Benign lesions, in contrast, are associated with higher boundary clarity (

- 0.355

importance, i.e., higher boundary clarity indicates benignity) and cystic characteristics (

- 0.338

). This kind of clinical consistency is important for building radiologist confidence and for addressing a major barrier to AI adoption, namely trust in how the algorithm makes its decisions.

4.5. Limitations and Future Directions

Despite the promising results, this study has certain limitations. First, our evaluation was based on data from a single centre (the MMOTU dataset), which is small for training a Vision Transformer from scratch. Although we used a transfer learning strategy with ImageNet weights to stabilise training and prevent overfitting, validation with data from multiple institutions is needed to confirm that our model can generalise, since clinics differ in their ultrasound machines and patient populations.

Second, our current analysis is based on 2D images. This omits volumetric information present in 3D ultrasound, which could provide additional context.

Third, the clinical concept annotations, which we based on the IOTA and O-RADS guidelines, may carry some subjectivity, as they rely on expert interpretation. Relatedly, the 8-to-3 class malignancy mapping in Equation (1) is a clinical simplification; categories assigned here to the benign group can in practice be borderline or malignant (for example, immature teratomas, sex-cord-stromal theca cell tumors, and borderline mucinous cystadenomas). This mapping was agreed with our clinical co-author (A.A.H.) as a tractable target for malignancy detection.

Lastly, our ablation study focused on task contribution analysis rather than backbone architecture comparison.

Future investigation should examine the integration of multiple data types, such as combining 2D ultrasound with Doppler flow or 3D data, which presents a viable avenue for enriching the diagnostic context. We also suggest the exploration of weakly supervised learning to automatically discover new clinical concepts. The UM-TOTA architecture could also be extended to other gynaecological imaging problems beyond ovarian tumors.

5. Conclusion

In this work, we addressed the limitation of current AI systems that handle OC classification, segmentation, and interpretability as separate post-hoc processes, which is inefficient. We proposed the UM-TOTA architecture, which performs these tasks simultaneously. This unified multi-task approach contributes a coherent diagnostic pipeline in which predictions are derived from shared feature representations, unlike fragmented baselines.

Our model achieved competitive performance for 8-class classification (80.26% ± 1.10% accuracy, 97.06% one-vs-rest macro specificity), 3-class malignancy detection (90.88% ± 1.14% accuracy, 90.41% specificity), and segmentation (77.29% ± 1.29% Dice). We note that the 97.06% value is a multiclass macro-averaged specificity and is not directly comparable with the binary clinical specificities reported elsewhere (for example, 84.84% for the IOTA rules [63] and 92% in meta-analysis [64]). Most importantly, our architecture achieves an approximately 66.7% reduction in computational parameter load compared with deploying equivalent sequential models, demonstrating significant efficiency for clinical environments. The model also produced interpretability that aligns with clinical guidelines such as IOTA and O-RADS, correctly identifying key malignancy predictors such as vascularization, solid components, and papillary projections. These results were achieved using the loss functions we developed and a progressive training strategy that supported stable multi-task learning.

The genuine contribution of UM-TOTA is not state-of-the-art per-task accuracy, since single-task models outperform it on each individual task in our ablation, but rather a unified, interpretable, and efficient framework. Its central novelty is concept-based interpretability embedded directly in the decision process: each prediction is accompanied by clinically grounded concept activations and attention weights that map to IOTA and O-RADS criteria, allowing clinicians to audit the model’s reasoning using familiar diagnostic descriptors. The primary clinical significance therefore extends beyond model performance, to improving radiologist workflow and trust through the provision of integrated and interpretable predictions within a single system.

Further research is needed to fully realise this potential. We suggest that integrating multi-modal data, such as Doppler flow or 3D ultrasound, represents a viable avenue for enriching the diagnostic context. Furthermore, external validation on multi-center clinical dataset is necessary to confirm that the model performs consistently across different patient populations and real world diagnostic environments. We hope this work serves as a model for developing medical AI that effectively supports clinical expertise for better OC patient outcomes.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
AUC	Area Under the Curve
CA125	Cancer Antigen 125
CBM	Concept Bottleneck Model
CI	Confidence Interval
CNN	Convolutional Neural Network
CNN-MTL	CNN-based Multi-Task Learning baseline
DSC	Dice Similarity Coefficient
IoU	Intersection over Union
IOTA	International Ovarian Tumor Analysis
LIME	Local Interpretable Model-agnostic Explanations
MMOTU	Multi-Modality Ovarian Tumor Ultrasound
MTL	Multi-Task Learning
OC	Ovarian Cancer
O-RADS	Ovarian-Adnexal Reporting and Data System
ROC	Receiver Operating Characteristic
SHAP	SHapley Additive exPlanations
TCAV	Testing with Concept Activation Vectors
UM-TOTA	Unified Multi-Task Ovarian Tumor Architecture
ViT	Vision Transformer
VRAM	Video Random Access Memory
XAI	Explainable Artificial Intelligence

Author Contributions

Conceptualization, A.A.M., A.A.H.(clinical guidance) and A.F.; methodology, A.A.M.; software, A.A.M.; validation, A.A.M., D.E. and A.F.; formal analysis, A.A.M.; investigation, A.A.M.; data curation, A.A.M.; writing—original draft preparation, A.A.M.; writing—review and editing, A.A.M., D.E. and A.F.; visualization, A.A.M.; supervision, A.F.; project administration, A.F.; funding acquisition, A.A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Petroleum Technology Development Fund (PTDF), Nigeria, under the Overseas Scholarship Scheme.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Departmental Ethics Committee of the University of Strathclyde (Ethics Approval Number: 2457). The Multi-Modality Ovarian Tumor Ultrasound (MMOTU) dataset used in this work was originally collected and released by Zhao et al. [13] under their own institutional ethics approval.

Informed Consent Statement

Not applicable. This study used a publicly available, de-identified retrospective ultrasound image dataset (MMOTU) [13]; no new patient data were collected by the authors.

Data Availability Statement

The Multi-Modality Ovarian Tumor Ultrasound (MMOTU) dataset analyzed in this study is publicly available and was originally released by Zhao et al. [13]. The trained UM-TOTA model weights and reproduction code can be made available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to express their sincere gratitude to Professor Anil Fernando, whose guidance and expertise were instrumental throughout this research. We also extend our appreciation to the Artificial Intelligence Research Group at the Department of Computer and Information Sciences, University of Strathclyde, for their collaborative environment, technical support and insightful discussions.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

References

World Cancer Research Fund International. Ovarian Cancer Statistics. Accessed. 2022. (accessed on 2026-04-18).
Cancer Research UK. Screening for Ovarian Cancer, 2025. Last reviewed. 17 February 2025. (accessed on 2023-06-21).
Sahu, S.A.; Shrivastava, D. A Comprehensive Review of Screening Methods for Ovarian Masses: Towards Earlier Detection. Cureus 2023, 15, e48534. [Google Scholar] [CrossRef]
Tang, C.; Xu, Z.; Duan, H.; Zhang, S. Advancements in artificial intelligence for ultrasound diagnosis of ovarian cancer: a comprehensive review. Front. Oncol. 2025, 15, 1581157. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE, 2016; pp. 770–778. [Google Scholar] [CrossRef]
Qadir, H.A.; Shin, Y.; Solhusvik, J.; Bergsland, J.; Aabakken, L.; Balasingham, I. Polyp Detection and Segmentation Using Mask R-CNN: Does a Deeper Feature Extractor CNN Always Perform Better? In Proceedings of the 2019 13th International Symposium on Medical Information and Communication Technology (ISMICT), 2019; pp. 1–6. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015); Springer; Lecture Notes in Computer Science; 2015; Vol. 9351, pp. 234–241. [Google Scholar] [CrossRef]
Sanderson, E.; Matuszewski, B.J. FCN-Transformer Feature Fusion for Polyp Segmentation. In Proceedings of the Proceedings of the Medical Image Understanding and Analysis (MIUA 2022), Cambridge, UK; Cham, Switzerland, 24–26 August 2022; Lecture Notes in Computer Science. 2022; Vol. 13413, pp. 892–907. [Google Scholar] [CrossRef]
Wang, R.; Cai, Y.; Lee, I.K.; Hu, R.; Purkayastha, S.; Pan, I.; Yi, T.; Tran, T.M.L.; Lu, S.; Liu, T.; et al. Evaluation of a convolutional neural network for ovarian tumor differentiation based on magnetic resonance imaging. Eur. Radiol. 2021, 31, 4960–4971. [Google Scholar] [CrossRef]
Sengupta, D.; Ali, S.N.; Bhattacharya, A.; Mustafi, J.; Mukhopadhyay, A.; Sengupta, K. A deep hybrid learning pipeline for accurate diagnosis of ovarian cancer based on nuclear morphology. In PLOS ONE; Public Library of Science, 2022; Volume 17. [Google Scholar] [CrossRef]
Hsu, S.T.; Su, Y.J.; Hung, C.H.; Chen, M.J.; Lu, C.H.; Kuo, C.E. Automatic ovarian tumors recognition system based on ensemble convolutional neural network with ultrasound imaging. BMC Med. Inform. Decis. Mak. 2022, 22, 298. [Google Scholar] [CrossRef]
Ghoniem, R.M.; Algarni, A.D.; Refky, B.; Ewees, A.A. Multi-Modal Evolutionary Deep Learning Model for Ovarian Cancer Diagnosis Number: 4. In Symmetry; Multidisciplinary Digital Publishing Institute, 2021; Volume 13. [Google Scholar] [CrossRef]
Zhao, Q.; Lyu, S.; Bai, W.; Cai, L.; Liu, B.; Cheng, G.; Wu, M.; Sang, X.; Yang, M.; Chen, L. MMOTU: A Multi-Modality Ovarian Tumor Ultrasound Image Dataset for Unsupervised Cross-Domain Semantic Segmentation 2207.06799 [cs]. 2023. [Google Scholar]
Xu, T.; Farahani, H.; Bashashati, A. Multi-Resolution Vision Transformer for Subtype Classification in Ovarian Cancer Whole-Slide Histopathology Images; 2022. [Google Scholar] [CrossRef]
Alahmadi, A. Towards ovarian cancer diagnostics: A vision transformer-based computer-aided diagnosis framework with enhanced interpretability. Results Eng. 2024, 23, 102651. [Google Scholar] [CrossRef]
Alshdaifat, E.H.; Gharaibeh, H.; Sindiani, A.M.; Madain, R.; Al-Mnayyis, A.M.; Abu Mhanna, H.Y.; Almahmoud, R.E.; Akhdar, H.F.; Amin, M.; Nasayreh, A.; et al. Hybrid vision transformer and Xception model for reliable CT-based ovarian neoplasms diagnosis. Intell.-Based Med. 2025, 11, 100227. [Google Scholar] [CrossRef]
Wei, S.; Hu, Z.; Tan, L. Res-ECA-UNet++: an automatic segmentation model for ovarian tumor ultrasound images based on residual networks and channel attention mechanism. In Frontiers in Medicine; Frontiers, 2025. [Google Scholar] [CrossRef]
Musa, A.A.; Fernando, A. Vision Transformer for Ovarian Tumor Classification: A Comparative Study with CNNs on Ultrasound Imaging. In Proceedings of the 2026 IEEE International Conference on Consumer Electronics (ICCE), Dubai, United Arab Emirates, 2026; pp. 1–6. [Google Scholar] [CrossRef]
Gunning, D.; Aha, D. DARPA’s Explainable Artificial Intelligence (XAI) Program. AI Mag. 2019, 40, 44–58. Number: 2. [Google Scholar] [CrossRef]
Borys, K.; Schmitt, Y.A.; Nauta, M.; Seifert, C.; Krämer, N.; Friedrich, C.M.; Nensa, F. Explainable AI in medical imaging: An overview for clinical practitioners – Saliency-based XAI approaches. Eur. J. Radiol. 2023, 162, 110787. [Google Scholar] [CrossRef]
Pang, W.; Ke, X.; Tsutsui, S.; Wen, B. Integrating Clinical Knowledge into Concept Bottleneck Models. In Proceedings of the Medical Image Computing and Computer Assisted Intervention – MICCAI 2024; Linguraru, M.G., Dou, Q., Feragen, A., Giannarou, S., Glocker, B., Lekadir, K., Schnabel, J.A., Eds.; Springer Nature Switzerland, 2024; pp. 243–253. [Google Scholar] [CrossRef]
Koh, P.W.; Nguyen, T.; Tang, Y.S.; Mussmann, S.; Pierson, E.; Kim, B.; Liang, P. Concept bottleneck models. Proceedings of the Proceedings of the 37th International Conference on Machine Learning. JMLR.org 2020, Vol. 119(ICML’20), 5338–5348. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), 2017; pp. 618–626, ISSN 2380-7504. [Google Scholar] [CrossRef]
Patrício, C.; Neves, J.C.; Teixeira, L.F. Explainable Deep Learning Methods in Medical Image Classification: A Survey. ACM Comput. Surv. 2023, 56, 85:1–85:41. [Google Scholar] [CrossRef]
Shen, D.; Wu, G.; Suk, H.I. Deep Learning in Medical Image Analysis. Annu. Rev. Biomed. Eng. 2017, 19, 221–248. [Google Scholar] [CrossRef]
Wu, C.; Wang, Y.; Wang, F. Deep Learning for Ovarian Tumor Classification with Ultrasound Images. In Proceedings of the Advances in Multimedia Information Processing – PCM 2018; Hong, R., Cheng, W.H., Yamasaki, T., Wang, M., Ngo, C.W., Eds.; Springer International Publishing, 2018; pp. 395–406. [Google Scholar] [CrossRef]
Wang, H.; Liu, C.; Zhao, Z.; Zhang, C.; Wang, X.; Li, H.; Wu, H.; Liu, X.; Li, C.; Qi, L.; et al. Application of Deep Convolutional Neural Networks for Discriminating Benign, Borderline, and Malignant Serous Ovarian Tumors From Ultrasound Images. Front. Oncol. 2021, 11. [Google Scholar] [CrossRef]
Srivastava, S.; Kumar, P.; Chaudhry, V.; Singh, A. Detection of Ovarian Cyst in Ultrasound Images Using Fine-Tuned VGG-16 Deep Learning Network Number: 2. In SN Computer Science; Springer, 2020; Volume 1, pp. 1–8. [Google Scholar] [CrossRef]
Karimzadeh, M.; Vakanski, A.; Xian, M.; Zhang, B. Post-Hoc Explainability of BI-RADS Descriptors in a Multi-Task Framework for Breast Cancer Detection and Segmentation. In Proceedings of the 2023 IEEE 33rd International Workshop on Machine Learning for Signal Processing (MLSP), 2023; pp. 1–6, ISSN 2161-0371. [Google Scholar] [CrossRef]
Christiansen, F.; Konuk, E.; Ganeshan, A.R.; Welch, R.; Palés Huix, J.; Czekierdowski, A.; Leone, F.P.G.; Haak, L.A.; Fruscio, R.; Gaurilcikas, A.; et al. International multicenter validation of AI-driven ultrasound detection of ovarian cancer. Nat. Med. 2025, 31, 189–196. [Google Scholar] [CrossRef] [PubMed]
Su, C.; Miao, K.; Zhang, L.; Yu, X.; Guo, Z.; Li, D.; Xu, M.; Zhang, Q.; Dong, X. Multimodal Deep Learning Based on Ultrasound Images and Clinical Data for Better Ovarian Cancer Diagnosis. J. Imaging Inform. Med. 2025. [Google Scholar] [CrossRef] [PubMed]
Heinrich, M.P. Intra-operative Ultrasound to MRI Fusion with a Public Multimodal Discrete Registration Tool. In Proceedings of the Simulation, Image Processing, and Ultrasound Systems for Assisted Diagnosis and Navigation; Stoyanov, D., Taylor, Z., Aylward, S., Tavares, J.M.R., Xiao, Y., Simpson, A., Martel, A., Maier-Hein, L., Li, S., Rivaz, H., et al., Eds.; Springer International Publishing, 2018; pp. 159–164. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2021, 2010.11929 [cs]. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. 2023, 1706.03762 [cs]. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. [cs]. 2021. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. [eess]. 2021. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.; Xu, D. UNETR: Transformers for 3D Medical Image Segmentation. 2021, 2103.10504 [eess]. [Google Scholar] [CrossRef]
Li, L.; He, L.; Guo, W.; Ma, J.; Sun, G.; Ma, H. PMFFNet: A hybrid network based on feature pyramid for ovarian tumor segmentation. In PLOS ONE; Public Library of Science, 2024; Volume 19. [Google Scholar] [CrossRef]
Nazir, M.; Shakil, S.; Khurshid, K. End-to-End Multi-task Learning Architecture for Brain Tumor Analysis with Uncertainty Estimation in MRI Images. J. Imaging Inform. Med. 2024, 37, 2149–2172. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. Why Should I Trust You?": Explaining the Predictions of Any Classifier. Proceedings of the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery 2016, KDD ’16, 1135–1144. [Google Scholar] [CrossRef]
Lundberg, S.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. 2017, 1705.07874 [cs]. [Google Scholar] [CrossRef]
Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C.; Wexler, J.; Viegas, F.; Sayres, R. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). 2018, 1711.11279 [stat]. [Google Scholar] [CrossRef]
Wang, H.; Hou, J.; Chen, H. Concept Complement Bottleneck Model for Interpretable Medical Image Diagnosis 2410.15446 [cs]]. version: 1. 2024. [Google Scholar] [CrossRef]
Lucieri, A.; Bajwa, M.N.; Braun, S.A.S.; Malik, M.I.; Dengel, A.; Ahmed, S. ExAID: A multimodal explanation framework for computer-aided diagnosis of skin lesions. Comput. Methods Programs Biomed. 2022, 215, 106620. [Google Scholar] [CrossRef] [PubMed]
Rudin, C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. 2019, 1811.10154 [stat]. [Google Scholar] [CrossRef] [PubMed]
Timmerman, D.; Valentin, L.; Bourne, T.H.; Collins, W.P.; Verrelst, H.; Vergote, I. Terms, definitions and measurements to describe the sonographic features of adnexal tumors: a consensus opinion from the International Ovarian Tumor Analysis (IOTA) group. Ultrasound Obstet. Gynecol. 2000, 16, 500–505. _eprint. Available online: https://obgyn.onlinelibrary.wiley.com/doi/pdf/10.1046/j.1469-0705.2000.00287.x. [CrossRef]
Andreotti, R.F.; Timmerman, D.; Strachowski, L.M.; Froyman, W.; Benacerraf, B.R.; Bennett, G.L.; Bourne, T.; Brown, D.L.; Coleman, B.G.; Frates, M.C.; et al. O-RADS US Risk Stratification and Management System: A Consensus Guideline from the ACR Ovarian-Adnexal Reporting and Data System Committee. In Radiology; Radiological Society of North America: Publisher, 2020; Volume 294, pp. 168–185. [Google Scholar] [CrossRef]
Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the Proceedings of the 14th international joint conference on Artificial intelligence-Volume 2 1995, IJCAI’95, 1137–1143. [Google Scholar]
Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 2010, 4, 0907.4728 [math]. [Google Scholar] [CrossRef]
Roberts, M.; Driggs, D.; Thorpe, M.; Gilbey, J.; Yeung, M.; Ursprung, S.; Aviles-Rivero, A.I.; Etmann, C.; McCague, C.; Beer, L.; et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell. 2021, 3, 199–217. [Google Scholar] [CrossRef]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. In Nature Methods; Nature Publishing Group: Publisher, 2021; Volume 18, pp. 203–211. [Google Scholar] [CrossRef]
Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Laak, J.A.W.M.v.d.; Ginneken, B.v.; Sánchez, C.I. A Survey on Deep Learning in Medical Image Analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef]
Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Nadeau, C.; Bengio, Y. Inference for the Generalization Error. In Machine Learning; Number: 3 Publisher; Springer, 2003; Volume 52, pp. 239–281. [Google Scholar] [CrossRef]
Bouthillier, X.; Delaunay, P.; Bronzi, M.; Trofimov, A.; Nichyporuk, B.; Szeto, J.; Sepah, N.; Raff, E.; Madan, K.; Voleti, V. Accounting for Variance in Machine Learning Benchmarks 2103.03098 [cs]. 2021. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009; pp. 248–255, ISSN 1063-6919. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Kato, S.; Hotta, K. Adaptive t-vMF Dice Loss for Multi-class Medical Image Segmentation. [eess]. 2022. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE, 2016; pp. 2818–2826. [Google Scholar] [CrossRef]
Ruder, S. An Overview of Multi-Task Learning in Deep Neural Networks. arXiv 2017, arXiv:1706.05098. [Google Scholar] [CrossRef]
Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
Timmerman, D.; Testa, A.C.; Bourne, T.; Ameye, L.; Jurkovic, D.; Van Holsbeke, C.; Paladini, D.; Van Calster, B.; Vergote, I.; Van Huffel, S.; et al. Simple ultrasound-based rules for the diagnosis of ovarian cancer. Ultrasound Obstet. Gynecol. 2008, 31, 681–690. Available online: https://onlinelibrary.wiley.com/doi/pdf/10.1002/uog.5365. [CrossRef]
Garg, S.; Kaur, A.; Mohi, J.K.; Sibia, P.K.; Kaur, N. Evaluation of IOTA Simple Ultrasound Rules to Distinguish Benign and Malignant Ovarian Tumours. J. Clin. Diagn. Res. JCDR 2017, 11, TC06–TC09. [Google Scholar] [CrossRef]
Mitchell, S.; Nikolopoulos, M.; El-Zarka, A.; Al-Karawi, D.; Al-Zaidi, S.; Ghai, A.; Gaughran, J.E.; Sayasneh, A. Artificial Intelligence in Ultrasound Diagnoses of Ovarian Cancer: A Systematic Review and Meta-Analysis Number: 2. In Cancers; Multidisciplinary Digital Publishing Institute, 2024; Volume 16. [Google Scholar] [CrossRef]
Schäfer, R.; Nicke, T.; Höfener, H.; Lange, A.; Merhof, D.; Feuerhake, F.; Schulz, V.; Lotz, J.; Kiessling, F. Overcoming data scarcity in biomedical imaging with a foundational multi-task model. In Nature Computational Science; Nature Publishing Group, 2024; Volume 4, pp. 495–509. [Google Scholar] [CrossRef]
Rhanoui, M.; Alaoui Belghiti, K.; Mikram, M. Multi-Task Deep Learning for Simultaneous Classification and Segmentation of Cancer Pathologies in Diverse Medical Imaging Modalities Number: 3. In Onco; Multidisciplinary Digital Publishing Institute, 2025; Volume 5. [Google Scholar] [CrossRef]
Yu, T.; Kumar, S.; Gupta, A.; Levine, S.; Hausman, K.; Finn, C. Gradient Surgery for Multi-Task Learning. In Proceedings of the Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS 2020), Virtual Event, 6–12 December 2020; Curran Associates, Inc., 2020; pp. 5824–5836. [Google Scholar]
Bui, P.N.; Le, D.T.; Bum, J. Multi-scale Feature Enhancement in Multi-task Learning for Medical Image Analysis. arXiv 2024, arXiv:2412.00351. [Google Scholar] [CrossRef]

Figure 1. UM-TOTA high-level multi-task architecture.

Figure 2. Multi-task heads architecture for UM-TOTA showing the four task-specific prediction heads derived from shared ViT backbone features: classification, malignancy detection, segmentation, and concept bottleneck layers.

Figure 3. Clinical Reasoning Module and Ovarian Concept Bottleneck. Ten clinically interpretable concepts are extracted from ViT features, grouped into Boundary/Structure, Tissue/Component, and Clinical Sign categories. Concept logits are transformed to probabilities and attended using a clinical reasoning network, generating malignancy predictions with concept-level explanation.

Figure 4. ROC curves for UM-TOTA evaluated using 5-fold stratified cross-validation. Top row: fold-wise ROC curves for (left) 8-class tumor classification and (right) 3-class malignancy detection. Bottom row: mean ROC curves with standard deviation shading, achieving an average AUC of

0.950 \pm 0.009

for classification and

0.946 \pm 0.047

for malignancy detection.

Figure 4. ROC curves for UM-TOTA evaluated using 5-fold stratified cross-validation. Top row: fold-wise ROC curves for (left) 8-class tumor classification and (right) 3-class malignancy detection. Bottom row: mean ROC curves with standard deviation shading, achieving an average AUC of

0.950 \pm 0.009

for classification and

0.946 \pm 0.047

for malignancy detection.

Figure 5. Segmentation performance of UM-TOTA across five folds. (Left): Dice score evolution. (Middle): IoU evolution. (Right): Final Dice and IoU per fold.

Figure 6. Training curves across all five folds for the full UM-TOTA model, showing training and validation loss, classification accuracy, malignancy detection accuracy, Dice score, IoU score, and the OneCycle learning-rate schedule.

Figure 7. Real medical concept interpretability analysis showing correlations between learned clinical concepts and tumor characteristics across ovarian tumor classes, including concept activation patterns, malignancy-relevant concepts, and overall activation distributions.

Figure 8. Example cases illustrating clinical reasoning transparency in UM-TOTA.

Figure 9. Clinical attention weight distributions across tumor classes.

Figure 10. Individual predictions with clinical reasoning produced by the UM-TOTA model. For each case, the input ultrasound image, ground truth segmentation mask, predicted segmentation, class prediction with confidence, and top contributing medical concepts with attention weights are shown.

Figure 11. Ablation study task-specific performance analysis.

Figure 12. Training curves across all folds for the three ablation variants: (a) segmentation_only, (b) no_concepts, (c) class_mal_only.

Table 1. Multi-Task (UM-TOTA) Performance Summary.

Task	Metric	Mean ± Std	Minimum	Maximum
8-Class Classification	Accuracy	80.26% ± 1.10%	79.3%	83.0%
	Precision	81.07% ± 0.89%	–	–
	Sensitivity (Recall)	80.26% ± 1.10%	–	–
	Specificity	97.06% ± 0.15%	–	–
	F1-Score	80.24% ± 1.00%	–	–
3-Class Malignancy	Accuracy	90.88% ± 1.14%	89.4%	91.8%
	Precision	90.94% ± 1.27%	–	–
	Sensitivity (Recall)	90.88% ± 1.14%	–	–
	Specificity	90.41% ± 1.60%	–	–
	F1-Score	90.57% ± 1.22%	–	–
Segmentation	Dice Score	77.29% ± 1.29%	75.1%	79.0%
	IoU	66.57% ± 1.40%	–	–

For a multi-class tasks, sensitivity (recall) is the weighted-average recall, which equals overall accuracy; specificity is the one-vs-rest macro-averaged specificity. The 8-class specificity is therefore averaged over seven negative classes per comparison.

Table 2. Benchmark comparison of UM-TOTA against existing ovarian ultrasound and related medical imaging approaches.

Study	Accuracy	Sensitivity	Specificity	Interpretability	Multi-Task?
Zhao et al. [13]	80.60%	–	–	×	×
Mitchell et al. [64]	–	81.00%	92.00%	–	×
Garg et al. [63] (IOTA Rules)	86.66%	91.66%	84.84%	Manual Rules	×
Christiansen et al. [30]	–	89.31%	82.67%	–	×
Karimzadeh et al. [29]	91.30%	94.00%	85.80%	Post-hoc (SHAP)	✓
Nazir et al. [39]	95.10%	–	–	×	✓
UM-TOTA (8-Class)	80.26%	80.26%	97.06%	Inherent	✓
UM-TOTA (Malignancy)	90.88%	90.88%	90.41%	Inherent	✓

Table 3. Performance comparison of UM-TOTA (ViT-Base/16) and CNN-MTL (ResNet50) baselines under identical 5-fold stratified cross-validation. ns = not statistically significant (

p > 0.05

, paired t-test).

Table 3. Performance comparison of UM-TOTA (ViT-Base/16) and CNN-MTL (ResNet50) baselines under identical 5-fold stratified cross-validation. ns = not statistically significant (

p > 0.05

, paired t-test).

Model	Backbone	8-Class Accuracy	Malignancy Accuracy	Dice Score
CNN-MTL	ResNet50	79.31% ± 2.96%	90.13% ± 0.88%	77.71% ± 0.52%
UM-TOTA	ViT-Base/16	80.26% ± 1.10%	90.88% ± 1.14%	77.29% ± 1.29%
p-value (paired t-test)	–	0.178 (ns)	0.396 (ns)	0.463 (ns)

Table 4. Deployment efficiency: unified UM-TOTA versus sequential single-task ViT models. Measurements on RTX 4070 (12 GB VRAM), PyTorch 2.5.1. Latency = mean ± std over 100 inference runs after 10 warmup runs.

Metric	Sequential (3×ViT)	UM-TOTA (Unified)	Improvement
Parameters	259.8 M	86.6 M	66.7% reduction
Latency (ms/img)	26.01 ms	$8.98 \pm 1.30$ ms	65.5% reduction
GPU Memory (MB)	20.8 MB	7.4 MB	64.2% reduction
Throughput (img/s)	48.9	147.9	202.7% increase

Table 5. Concept activation analysis: mean activations for benign and malignant tumours. Positive importance = malignancy indicator; negative = benign indicator. All concepts statistically significant (

p < 0.05

). Bold values indicate higher activation per concept.

Table 5. Concept activation analysis: mean activations for benign and malignant tumours. Positive importance = malignancy indicator; negative = benign indicator. All concepts statistically significant (

p < 0.05

). Bold values indicate higher activation per concept.

Concept (IOTA/O-RADS)	Benign	Malignant	Importance	p-value
Vascularization	0.167	0.399	$+ 0.375$	$5.48 \times 10^{- 96}$
Papillary projections	0.072	0.204	$+ 0.270$	$3.41 \times 10^{- 90}$
Solid components	0.313	0.644	$+ 0.262$	$3.40 \times 10^{- 8}$
Ascites presence	0.068	0.199	$+ 0.192$	$1.13 \times 10^{- 87}$
Boundary clarity	0.669	0.626	$- 0.355$	$1.63 \times 10^{- 7}$
Cystic components	0.606	0.356	$- 0.338$	$1.93 \times 10^{- 8}$
Homogeneous texture	0.687	0.493	$- 0.329$	$2.79 \times 10^{- 19}$
Acoustic shadowing	0.259	0.405	$- 0.208$	$1.34 \times 10^{- 6}$
Posterior enhancement	0.475	0.257	$- 0.190$	$3.19 \times 10^{- 6}$
Shape regularity	0.405	0.641	$+ 0.031$	$9.28 \times 10^{- 3}$

Table 6. Paired t-test results comparing the unified UM-TOTA model against specialized baselines and ablation variants across tasks.

Task	Model Configuration	Metric Value	Compared Model	t-stat	p-value
8-Class Classification	UM-TOTA (Full Model)	80.26% ± 1.10%	–	–	–
	vs. Specialized (class_mal_only)	–	80.67% ± 2.25%	$- 0.43$	$0.692$
	vs. Ablated (no_concepts)	–	82.44% ± 2.17%	$1.85$	$0.139$
3-Class Malignancy	UM-TOTA (Full Model)	90.88% ± 1.14%	–	–	–
	vs. Specialized (class_mal_only)	–	91.22% ± 1.37%	$- 0.85$	$0.445$
	vs. Ablated (no_concepts)	–	92.38% ± 0.95%	$1.61$	$0.182$
Segmentation	UM-TOTA (Full Model)	77.29% ± 1.29%	–	–	–
	vs. Specialized (segmentation_only)	–	80.88% ± 1.04%	$- 9.88$	$< 0.001$ ***

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

A Unified Multi-Task Vision Transformer for Interpretable Ovarian Tumor Analysis

Abstract

Keywords:

Subject:

1. Introduction

Contributions

2. Related Works

2.1. Deep Learning Architectures and Multi-Task Strategies in Ovarian Cancer

2.2. From Post-Hoc Explanation to Concept-Based Interpretability in Ovarian Tumor Diagnosis

2.3. Evaluation Methodologies in Medical AI

3. Methodology

3.1. Dataset and Data Preparation

3.1.1. Dataset

3.1.2. Data Preparation and Augmentation

3.2. Unified Multi-Task Ovarian Tumor Architecture (UM-TOTA)

3.2.1. ViT Backbone Architecture

3.2.2. Task-Specific Head Architecture Design

3.2.3. Concept Bottleneck Model Integration

3.2.4. Clinical Reasoning Module Architecture

3.2.5. Multi-Task Loss Coordination System

3.3. Ablation Study Design and Task Configuration

3.4. Training and Validation Protocol

3.5. Comprehensive Evaluation Framework

4. Results and Discussion

4.1. Multi-Task Learning Performance and Clinical Significance

4.1.1. Task-Specific Performance Analysis and Comparison with related studies

4.1.2. Comparison with CNN-Based Multi-Task Baseline

4.1.3. Deployment Efficiency Analysis

4.2. Clinical Interpretability Results and Trust Building

4.2.1. IOTA/O-RADS Guideline Alignment

4.2.2. Clinical Reasoning Transparency and Trust Building

4.2.3. Interpretable Multi-Task Integration

4.3. Ablation Study Insights and Technical Innovations

4.3.1. Task Combination Impact Analysis

4.3.2. Technical Component Validation

4.3.2.1. Quantified Interpretability Effect:

4.3.3. Optimal Clinical Configuration

4.4. Comparative Analysis and Clinical Deployment

4.4.1. State-of-the-Art Positioning and Clinical Adoption Advantages

4.5. Limitations and Future Directions

5. Conclusion

Abbreviations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

MDPI Initiatives

Important Links

Subscribe