3. Results
-
A.
Proton MRI
- 1)
Full dataset experiment
Performance evaluation on the complete proton MRI dataset revealed the following quantitative results across all model architectures. MedSAM demonstrated a Dice Similarity Coefficient of 0.92 with a range from 0.73 to 0.96, while the Average Hausdorff Distance measured 1.18 with values ranging between 0.55 and 4.15. The 95th percentile Hausdorff Distance was recorded at 3.27, spanning from 1.41 to 10.0, and the XOR error reached 0.14 with a range of 0.05 to 0.55.
SAM exhibited a DSC of 0.92, ranging from 0.74 to 0.96, with an Average HD of 1.20 extending from 0.60 to 4.18. The HD95 measurement was 3.31, varying between 1.41 and 10.63, while the XOR error was 0.14, ranging from 0.06 to 0.53.
Among the traditional deep learning architectures, FPN-MIT-B5 achieved a DSC of 0.92, with values spanning from 0.77 to 0.96. The Average HD was measured at 1.24, ranging from 0.56 to 3.81, while HD95 reached 3.38 with a span of 1.41 to 9.62. The XOR error was recorded at 0.14, varying between 0.06 and 0.46.
UNet-VGG19 demonstrated a DSC of 0.92, ranging from 0.72 to 0.97, accompanied by an Average HD of 1.31 that extended from 0.47 to 4.3. The HD95 measurement was 3.59, spanning from 1.24 to 10.77, and the XOR error reached 0.16 with values between 0.05 and 0.57.
DeepLabV3-ResNet152 recorded a DSC of 0.92, ranging from 0.73 to 0.96, with an Average HD of 1.27 that varied from 0.64 to 4.31. The HD95 was measured at 3.64, extending from 2.0 to 10.15, while the XOR error was 0.16, ranging from 0.07 to 0.55.
A summary of the Metrics is provided at
Table 1 for Proton MRI full dataset experiments.
Statistical analysis using Friedman tests with Bonferroni-corrected post-hoc comparisons yielded specific p-values for pairwise model comparisons. For the Dice Similarity Coefficient metric, the comparison between DeepLabV3-ResNet152 and UNet-VGG19 resulted in a p-value of 0.003, while DeepLabV3-ResNet152 compared to FPN-MIT-B5 produced a p-value less than 0.001. The comparison of DeepLabV3-ResNet152 with SAM yielded a p-value less than 0.001, as did the comparison with MedSAM. UNet-VGG19 versus FPN-MIT-B5 resulted in a p-value of 1.000, while UNet-VGG19 compared to SAM produced a p-value of 0.173 and to MedSAM yielded 0.124. The comparison between FPN-MIT-B5 and SAM resulted in a p-value of 1.000, as did the comparison between FPN-MIT-B5 and MedSAM. The SAM versus MedSAM comparison produced a p-value of 1.000.
For Average Hausdorff Distance comparisons, MedSAM versus SAM yielded a p-value of 1.000, while MedSAM compared to FPN-MIT-B5 resulted in 0.518. MedSAM versus UNet-VGG19 produced a p-value of 0.001, and MedSAM compared to DeepLabV3-ResNet152 yielded a p-value less than 0.001. SAM versus FPN-MIT-B5 resulted in a p-value of 1.000, while SAM compared to UNet-VGG19 produced 0.026 and to DeepLabV3-ResNet152 yielded a p-value less than 0.001. FPN-MIT-B5 versus UNet-VGG19 resulted in a p-value of 0.430, while FPN-MIT-B5 compared to DeepLabV3-ResNet152 produced 0.010. UNet-VGG19 versus DeepLabV3-ResNet152 yielded a p-value of 1.000.
The 95th percentile Hausdorff Distance comparisons showed that MedSAM versus SAM produced a p-value of 1.000, while MedSAM compared to FPN-MIT-B5 resulted in 0.808. MedSAM versus UNet-VGG19 yielded a p-value of 0.002, and MedSAM compared to DeepLabV3-ResNet152 produced a p-value less than 0.001. SAM versus FPN-MIT-B5 resulted in a p-value of 1.000, while SAM compared to UNet-VGG19 yielded 0.027 and to DeepLabV3-ResNet152 produced a p-value less than 0.001. FPN-MIT-B5 versus UNet-VGG19 resulted in a p-value of 0.543, while FPN-MIT-B5 compared to DeepLabV3-ResNet152 yielded 0.002. UNet-VGG19 versus DeepLabV3-ResNet152 produced a p-value of 0.650.
For XOR error comparisons, MedSAM versus SAM resulted in a p-value of 1.000, as did MedSAM compared to FPN-MIT-B5. MedSAM versus UNet-VGG19 yielded a p-value of 0.010, while MedSAM compared to DeepLabV3-ResNet152 produced a p-value less than 0.001. SAM versus FPN-MIT-B5 resulted in a p-value of 1.000, while SAM compared to UNet-VGG19 yielded 0.045 and to DeepLabV3-ResNet152 produced a p-value less than 0.001. FPN-MIT-B5 versus UNet-VGG19 resulted in a p-value of 0.843, while FPN-MIT-B5 compared to DeepLabV3-ResNet152 yielded a p-value less than 0.001. UNet-VGG19 versus DeepLabV3-ResNet152 produced a p-value of 0.027.
- 2)
Limited dataset experiment
When trained on the reduced dataset containing 25% of the original training data, all model architectures demonstrated altered performance patterns. MedSAM achieved a Dice Similarity Coefficient of 0.92, with values ranging from 0.81 to 0.97, while the Average Hausdorff Distance was measured at 1.33, extending from 0.54 to 3.50. The 95th percentile Hausdorff Distance reached 3.68, spanning from 1.41 to 10.04, and the XOR error was recorded at 0.15 with a range of 0.04 to 0.42.
SAM exhibited a DSC of 0.92, ranging from 0.80 to 0.97, accompanied by an Average HD of 1.40 that varied between 0.55 and 4.11. The HD95 measurement was 3.94, extending from 2.0 to 13.89, while the XOR error reached 0.16, spanning from 0.05 to 0.44.
Among the traditional architectures, UNet-VGG19 demonstrated a DSC of 0.79, with values ranging from 0.56 to 0.89. The Average HD was measured at 3.32, extending from 1.2 to 7.45, while HD95 reached 9.23 with a span of 3.60 to 20.44. The XOR error was recorded at 0.34, varying between 0.19 and 0.60.
DeepLabV3-ResNet152 achieved a DSC of 0.71, ranging from 0.16 to 0.83, with an Average HD of 6.68 that extended from 3.64 to 13.11. The HD95 measurement was 17.24, spanning from 8.54 to 45.75, and the XOR error reached 0.74 with values between 0.39 and 1.18.
FPN-MIT-B5 recorded a DSC of 0.53, ranging from 0.30 to 0.75, accompanied by an Average HD of 5.28 that varied from 2.26 to 9.06. The HD95 was measured at 18.55, extending from 8.67 to 33.34, while the XOR error was 0.62, ranging from 0.40 to 0.81.
Table 2.
Performance metrics for all model architectures on proton MRI limited dataset experiment (25% of training data). Values are presented as mean (range) for Dice Similarity Coefficient (DSC), Average Hausdorff Distance (Avg_HD), 95th percentile Hausdorff Distance (HD95), and XOR error.
Table 2.
Performance metrics for all model architectures on proton MRI limited dataset experiment (25% of training data). Values are presented as mean (range) for Dice Similarity Coefficient (DSC), Average Hausdorff Distance (Avg_HD), 95th percentile Hausdorff Distance (HD95), and XOR error.
| Model |
DSC |
Avg_HD |
HD95 |
XOR |
| DeepLabV3_resnet152 |
0.71 (0.16–0.83) |
6.68 (3.64–13.11) |
17.24 (8.54–45.75) |
0.74(0.39–1.18) |
| FPN_mit_b5 |
0.53 (0.30–0.75) |
5.28 (2.26–9.06) |
18.55 (8.67–33.34) |
0.62(0.40–0.81) |
| MEDSAM |
0.92 (0.81–0.97) |
1.33 (0.54–3.50) |
3.68 (1.41–10.04) |
0.15 (0.04–0.42) |
| SAM |
0.92 (0.80–0.97) |
1.40 (0.55–4.11) |
3.94 (2.0–13.89) |
0.16 (0.05–0.44) |
| Unet_vgg19 |
0.79 (0.56–0.89) |
3.32 (1.2–7.45) |
9.23 (3.60–20.44) |
0.34(0.19–0.60) |
Statistical analysis using Friedman tests with Bonferroni correction revealed specific p-values for pairwise comparisons under limited data conditions. For Dice Similarity Coefficient comparisons, FPN-MIT-B5 versus DeepLabV3-ResNet152 resulted in a p-value of 0.008, while FPN-MIT-B5 compared to UNet-VGG19 produced a p-value less than 0.001. FPN-MIT-B5 versus SAM yielded a p-value less than 0.001, as did the comparison with MedSAM. DeepLabV3-ResNet152 compared to UNet-VGG19 resulted in a p-value of 0.168, while DeepLabV3-ResNet152 versus SAM and MedSAM both produced p-values less than 0.001. UNet-VGG19 compared to both SAM and MedSAM yielded p-values less than 0.001, while SAM versus MedSAM resulted in a p-value of 1.000.
Average Hausdorff Distance comparisons showed that SAM versus MedSAM produced a p-value of 1.000, while both foundational models compared to UNet-VGG19, FPN-MIT-B5, and DeepLabV3-ResNet152 all yielded p-values less than 0.001. Among traditional models, UNet-VGG19 versus FPN-MIT-B5 resulted in a p-value of 0.002, UNet-VGG19 compared to DeepLabV3-ResNet152 produced a p-value less than 0.001, while FPN-MIT-B5 versus DeepLabV3-ResNet152 yielded a p-value of 1.000.
For 95th percentile Hausdorff Distance measurements, MedSAM versus SAM resulted in a p-value of 1.000, while both foundational models compared to all traditional architectures produced p-values less than 0.001. Among traditional models, all pairwise comparisons yielded p-values less than 0.001, except for DeepLabV3-ResNet152 versus FPN-MIT-B5 which resulted in a p-value of 1.000.
XOR error comparisons revealed that MedSAM versus SAM produced a p-value of 1.000, while both foundational models compared to all traditional architectures yielded p-values less than 0.001. Among traditional models, UNet-VGG19 versus both FPN-MIT-B5 and DeepLabV3-ResNet152 produced p-values less than 0.001, while FPN-MIT-B5 versus DeepLabV3-ResNet152 resulted in a p-value of 1.000.
-
B.
Hyperpolarized Gas MRI
- 1)
Full dataset experiment
Performance evaluation on the complete hyperpolarized gas MRI dataset yielded the following quantitative measurements. SAM demonstrated a Dice Similarity Coefficient of 0.92, with values ranging from 0.77 to 0.97, while the Average Hausdorff Distance measured 1.17, extending from 0.47 to 3.40. The 95th percentile Hausdorff Distance was recorded at 3.60, spanning from 1.0 to 8.0, and the XOR error reached 0.16 with a range of 0.05 to 0.56.
MedSAM exhibited a DSC of 0.92, ranging from 0.77 to 0.96, accompanied by an Average HD of 1.15 that varied between 0.52 and 3.57. The HD95 measurement was 3.75, extending from 1.0 to 10.0, while the XOR error was 0.16, ranging from 0.06 to 0.56.
Among the traditional deep learning architectures, FPN-MIT-B5 achieved a DSC of 0.91, with values spanning from 0.76 to 0.97. The Average HD was measured at 1.18, ranging from 0.45 to 3.21, while HD95 reached 3.63 with a span of 1.41 to 8.54. The XOR error was recorded at 0.17, varying between 0.05 and 0.61.
UNet-VGG19 demonstrated a DSC of 0.91, ranging from 0.74 to 0.97, with an Average HD of 1.24 that extended from 0.42 to 4.08. The HD95 measurement was 3.81, spanning from 1.0 to 11.40, and the XOR error reached 0.17 with values between 0.05 and 0.68.
DeepLabV3-ResNet152 recorded a DSC of 0.91, ranging from 0.74 to 0.97, accompanied by an Average HD of 1.29 that varied from 0.61 to 3.56. The HD95 was measured at 4.13, extending from 1.41 to 10.0, while the XOR error was 0.19, ranging from 0.06 to 0.70.
Table 3 summarizes the quantitative results demonstrating the specialized challenges posed by hyperpolarized gas MRI segmentation in full dataset experiments.
Statistical analysis using Friedman tests with Bonferroni correction yielded the following p-values for pairwise model comparisons. For Dice Similarity Coefficient measurements, DeepLabV3-ResNet152 compared to FPN-MIT-B5 resulted in a p-value of 0.001, while DeepLabV3-ResNet152 versus UNet-VGG19, MedSAM, and SAM all produced p-values less than 0.001. FPN-MIT-B5 compared to UNet-VGG19 yielded a p-value of 1.000, while FPN-MIT-B5 versus MedSAM resulted in 0.701 and versus SAM produced 0.252. UNet-VGG19 compared to both MedSAM and SAM yielded p-values of 1.000, while MedSAM versus SAM resulted in a p-value of 1.000.
Average Hausdorff Distance comparisons revealed that MedSAM and SAM each compared to DeepLabV3-ResNet152 produced p-values of 0.001. MedSAM versus FPN-MIT-B5 and SAM versus FPN-MIT-B5 both resulted in p-values of 1.000, while MedSAM versus SAM yielded a p-value of 1.000. MedSAM and SAM each compared to UNet-VGG19 produced p-values of 0.497, while FPN-MIT-B5 versus UNet-VGG19 resulted in a p-value of 1.000. FPN-MIT-B5 compared to DeepLabV3-ResNet152 yielded a p-value of 0.041, while UNet-VGG19 versus DeepLabV3-ResNet152 produced 0.442.
For 95th percentile Hausdorff Distance measurements, SAM compared to FPN-MIT-B5 and MedSAM both resulted in p-values of 1.000. SAM versus UNet-VGG19 yielded a p-value of 0.826, while SAM compared to DeepLabV3-ResNet152 produced 0.009. FPN-MIT-B5 versus MedSAM and UNet-VGG19 both resulted in p-values of 1.000, while FPN-MIT-B5 compared to DeepLabV3-ResNet152 yielded 0.052. MedSAM versus UNet-VGG19 produced a p-value of 1.000, while MedSAM compared to DeepLabV3-ResNet152 resulted in 0.180. UNet-VGG19 versus DeepLabV3-ResNet152 yielded a p-value of 1.000.
XOR error comparisons showed that SAM versus MedSAM resulted in a p-value of 1.000, while SAM compared to UNet-VGG19 yielded 0.968 and to FPN-MIT-B5 produced 0.119. SAM versus DeepLabV3-ResNet152 resulted in a p-value less than 0.001. MedSAM compared to UNet-VGG19 and FPN-MIT-B5 both yielded p-values of 1.000, while MedSAM versus DeepLabV3-ResNet152 produced a p-value less than 0.001. UNet-VGG19 compared to FPN-MIT-B5 resulted in a p-value of 1.000, while UNet-VGG19 versus DeepLabV3-ResNet152 yielded a p-value less than 0.001. FPN-MIT-B5 compared to DeepLabV3-ResNet152 produced a p-value of 0.001.
- 2)
Limited dataset experiment
When evaluated on the reduced hyperpolarized gas MRI dataset containing 25% of the original training data, the model architectures exhibited the following performance characteristics. MedSAM achieved a Dice Similarity Coefficient of 0.88, with values ranging from 0.71 to 0.95, while the Average Hausdorff Distance measured 1.62, extending from 0.68 to 4.00. The 95th percentile Hausdorff Distance was recorded at 5.48, spanning from 1.41 to 12.92, and the XOR error reached 0.23 with a range of 0.08 to 0.52.
SAM demonstrated a DSC of 0.88, ranging from 0.62 to 0.94, accompanied by an Average HD of 1.78 that varied between 0.80 and 4.15. The HD95 measurement was 6.42, extending from 2.23 to 17.72, while the XOR error was 0.24, spanning from 0.10 to 0.54.
Among the traditional architectures, UNet-VGG19 achieved a DSC of 0.76, with values ranging from 0.19 to 0.93. The Average HD was measured at 2.64, extending from 1.00 to 7.06, while HD95 reached 11.39 with a span of 2.23 to 57.72. The XOR error was recorded at 0.38, varying between 0.13 and 1.02.
DeepLabV3-ResNet152 demonstrated a DSC of 0.74, ranging from 0.07 to 0.87, with an Average HD of 2.97 that extended from 1.71 to 4.58. The HD95 measurement was 12.97, spanning from 7.0 to 68.80, and the XOR error reached 0.46 with values between 0.22 and 0.98.
FPN-MIT-B5 recorded a DSC of 0.65, ranging from 0.05 to 0.89, accompanied by an Average HD of 3.53 that varied from 1.62 to 8.14. The HD95 was measured at 15.35, extending from 4.0 to 76.41, while the XOR error was 0.51, ranging from 0.21 to 0.97.
Table 4 summarizes the quantitative results demonstrating the hyperpolarized gas MRI segmentation in limited dataset experiments.
Statistical analysis using Friedman tests with Bonferroni correction revealed the following p-values for pairwise model comparisons under limited data conditions. For Dice Similarity Coefficient measurements, FPN-MIT-B5 compared to DeepLabV3-ResNet152 resulted in a p-value of 0.649, while FPN-MIT-B5 versus UNet-VGG19 produced a p-value of 0.001. FPN-MIT-B5 compared to SAM and MedSAM both yielded p-values less than 0.001. DeepLabV3-ResNet152 versus UNet-VGG19 resulted in a p-value of 0.409, while DeepLabV3-ResNet152 compared to both SAM and MedSAM produced p-values less than 0.001. UNet-VGG19 versus both SAM and MedSAM yielded p-values less than 0.001, while SAM compared to MedSAM resulted in a p-value of 1.000.
Average Hausdorff Distance comparisons showed that MedSAM versus SAM produced a p-value of 1.000, while MedSAM compared to FPN-MIT-B5, UNet-VGG19, and DeepLabV3-ResNet152 all resulted in p-values less than 0.001. SAM versus FPN-MIT-B5 yielded a p-value of 0.016, while SAM compared to UNet-VGG19 and DeepLabV3-ResNet152 both produced p-values less than 0.001. Among traditional models, FPN-MIT-B5 versus UNet-VGG19 resulted in a p-value of 1.000, FPN-MIT-B5 compared to DeepLabV3-ResNet152 yielded 0.046, while UNet-VGG19 versus DeepLabV3-ResNet152 produced 0.993.
For 95th percentile Hausdorff Distance measurements, MedSAM versus SAM resulted in a p-value of 0.062, while MedSAM compared to UNet-VGG19, FPN-MIT-B5, and DeepLabV3-ResNet152 all produced p-values less than 0.001. SAM versus UNet-VGG19 yielded a p-value of 0.518, SAM compared to FPN-MIT-B5 resulted in 0.192, while SAM versus DeepLabV3-ResNet152 produced a p-value less than 0.001. Among traditional models, UNet-VGG19 versus FPN-MIT-B5 resulted in a p-value of 1.000, UNet-VGG19 compared to DeepLabV3-ResNet152 yielded 0.092, while FPN-MIT-B5 versus DeepLabV3-ResNet152 produced 0.272.
XOR error comparisons revealed that MedSAM versus SAM resulted in a p-value of 1.000, while MedSAM compared to UNet-VGG19, FPN-MIT-B5, and DeepLabV3-ResNet152 all yielded p-values less than 0.001. SAM versus UNet-VGG19 produced a p-value of 0.001, while SAM compared to FPN-MIT-B5 and DeepLabV3-ResNet152 both resulted in p-values less than 0.001. Among traditional models, UNet-VGG19 versus FPN-MIT-B5 yielded a p-value of 1.000, UNet-VGG19 compared to DeepLabV3-ResNet152 produced 0.122, while FPN-MIT-B5 versus DeepLabV3-ResNet152 resulted in 0.147.