Submitted:
06 October 2023
Posted:
10 October 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Ensemble Approaches
- for different tasks, including image classification, detection, and segmentation,
- in several application domains, including healthcare, speech analysis, forecasting, fraud prevention, and information retrieval.
3. Materials and Methods
3.1. Loss Functions
- The Generalized Dice Loss is a multiclass variant of the Dice Loss.
- The Tversky Loss is a weighed version of the Twersky index designed to deal with unbalanced classes.
- The Focal Tversky Loss is a variant of the Twersky loss where a modulating factor is used to ensure that the model focuses on hard samples instead of properly classified examples.
- The Focal Generalized Dice Loss is the focal version of the Generalized Dice Loss.
- The Log-Cosh Dice Loss is a combination of the Dice Loss and the Log-Cosh function, applied with the purpose of smoothing the loss curve.
- The Log-Cosh Focal Tversky Loss is based on the same idea of smoothing, here applied to the Focal Tversky Loss.
- The SSIM Loss is obtained from the Structural similarity (SSIM) index, usually adopted to evaluate the quality of an image.
- The MS-SIM Loss is a variant of defined using the Multiscale structural similarity (MS-SSIM) index.
-
The losses described above can be combined in different ways:
- –
- ,
- –
- ,
- –
- ,
- –
- .
- The Boundary Enhancement Loss () explicitly focuses on the boundary areas during training. The Laplacian filter is used to generate strong responses around the boundaries and zero everywhere else. We gather Dice Loss, Boundary Enhancement loss and the Structure Loss together, weighted appropriately: . We set and
- The Structure Loss is a combination of the weighted Intersect over Union () and the weighted binary-crossed entropy loss . We refer the reader to [10] for details. The weights in this loss function are determined by the importance of each pixel, which is calculated from the difference between the center pixel and its surrounding pixels. To give more importance to the binary-crossed entropy loss, we use a weight of 2 for it: .
3.2. Data Augmentation
- Data Augmentation 1 (DA1) [24] is obtained through horizontal flip, vertical flip, and 90° rotation;
- Data Augmentation 2 (DA2) [24] consists of 13 operations, some changing the color of an image and some changing its shape.
- Data Augmentation 3 (DA3) is a variant of the approach used in [12]. It consists in using multi-scale strategies (i.e., 1.25, 1, 0.75) to alleviate the sensitivity of the network to scale variability. Simultaneously, random perspective technology is adopted to process the input image with a probability of 0.5, together with random color adjustment with a probability of 0.2 for data augmentation. While DA1 and DA2 do not include randomness, DA3 uses a different training set for each network. The application of this data augmentation technique substantially amplifies result variability within the network, consequently fostering greater diversity among ensemble constituents.
3.3. Performance Metrics
3.4. Datasets and Testing Protocols
3.4.1. Polyp Segmentation (POLYP)
- The Kvasir-SEG dataset comprises medical images that have been meticulously labeled and verified by medical professionals. These images depict various segments of the digestive system, showcasing both healthy and diseased tissue. The dataset encompasses images with varying resolutions, ranging from 720x576 pixels to 1920x1072 pixels, organized into folders based on their content. Some of these images also include a small picture-in-picture display indicating the position of the endoscope within the body.
- The CVC-ColonDB dataset consists of images designed to offer a diverse range of polyp appearances, maximizing dataset variability.
- CVC-T serves as the test set of a larger dataset named CVC-EndoSceneStill.
- The ETIS-Larib dataset comprises 196 colonoscopy images.
- CVC-ClinicDB encompasses images extracted from 31 videos of colonoscopy procedures. Expert annotations identify the regions affected by polyps, and ground truth data is also available for light reflections. The images in this dataset are uniformly sized at 576x768 pixels.
3.4.2. Skin Segmentation (SKIN)
3.4.3. Leukocyte Segmentation (LEUKO)
3.4.4. Butterfly Identification (BFLY)
3.4.5. Microorganism Identification (EMICRO)
3.4.6. Ribs Segmentation (RIBS)
3.4.7. Locust Segmentation (LOC)
3.4.8. Portrait Segmentation (POR)
3.4.9. Camouflaged Segmentation (CAM)
4. Experimental Results
- In Section 4.1, different methods for building an ensemble of DeepLabV3+ models are tested and compared.
- In Section 4.2, the ensemble of different topologies is tested and the different methods for building the output mask of HArdNet, HSN and PVT are compared.
4.1. Experiments: DeepLabV3+
- initial learning rate = 0.01;
- number of epoch = 10 or 15 (it depends on data augmentation: see below);
- momentum = 0.9;
- L2Regularization = 0.005;
- Learning Rate Drop Period = 5;
- Learning Rate Drop Factor = 0.2;
- shuffle training images at every epoch;
- optimizer = SGD (stochastic gradient descent).
- ERN18(N) is an ensemble of N RN18 networks trained with DA1.
- ERN50(N) is an ensemble of N RN50 networks trained with DA1.
- ERN101(N) is an ensemble of N RN101 networks trained with DA1.
- E101(10) is an ensemble of 10 RN101 models trained with DA1 and five different loss functions. The final fusion is determined by the formula: , where indicates two RN101 models trained using the loss function .
- EM(10) is a similar ensemble, but the two networks using the same loss (as in E101(10), the five losses are , , , , ) are trained once using DA1 and once using DA2.
- EM2(10) is similar to the previous ensemble, but is used instead of .
- In EM2(5)_DAx, five RN101 networks are trained using the loss of EM2(10). All five networks are trained using data augmentation DAx.
- EM3(10) is similar to the previous ensemble, but is used as a loss function.
- Among stand-alone networks, RN101 obtains the best average performance, but in RIBS (small training set) it works worse than the others. This probably happens because it is a larger network with respect to RN18 and RN50, thus requiring a larger training set for better tuning.
- ERN101(10) always outperforms RN101(1).
- E101(10) outperforms ERN101(10) with a p-value of 0.0078 (Wilcoxon signed rank test) and EM(10) outperforms E101(10) with a p-value of 0.0352. For the sake of space, we have not reported the performance obtained from individual losses. In any case, there is no winner, the various losses lead to similar performance.
- EM3(10) obtains the highest average performance but the p-value is quite high: it outperforms EM(10) with a p-value of 0.1406 and EM2(10) with a p-value of 0.2812.
- There is no statistical difference between the performance of EM2(5)_DA1 and EM2(5)_DA2.
4.2. Experiments: Combining Different Topologies
- LRa: .
- LRb: decaying to after 10 epochs.
- LRc: decaying to after 30 epochs.
- Fusion: the combination of all the nets varying DA and LR strategy.
- Baseline Ensemble: fusion between 9 networks (same size of Fusion) obtained retraining DA3-LRc nine times.
- Fusion obtains the best performance, outperforming (on average) the stand-alone approaches and previous ensemble.
- There is no clear winner among the different data augmentation approaches and learning rate strategies.
- The proposed Fusion always improves the Baseline Ensemble except in CAMO. In this dataset there is a significant difference in performance between LRc and the other learning strategy, combining only the 3 networks based on LRc (i.e., using the three data augmentations coupled with LRc) both HS and PVT get a Dice of 0.830, outperforming the Baseline Ensemble.
- Ens1: EM3(10) ⊖ Fusion(FH) ⊖ Fusion(PVT) ⊖ Fusion(HSN).
- Ens2: Fusion(FH) ⊖ Fusion(PVT) ⊖ Fusion(HSN).
- Ens3: Fusion(PVT) ⊖ Fusion(HSN).
5. Conclusions
- a fusion of different convolutional and transformer networks can achieve SOTA performance;
- applying different approaches to learning rate strategy is a feasible method to build a set of segmentation networks;
- a better way to add the transformers (HSN and PVT) in an ensemble is to modify the way the final segmentation map is obtained, avoiding excessively sharp masks.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Hao, S.; Zhou, Y.; Guo, Y. A brief survey on semantic segmentation with deep learning. Neurocomputing 2020, 406, 302–321. [Google Scholar] [CrossRef]
- Wang, S.; Mu, X.; Yang, D.; He, H.; Zhao, P. Attention guided encoder-decoder network with multi-scale context aggregation for land cover segmentation. IEEE Access 2020, 8, 215299–215309. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
- Siddique, N.; Paheding, S.; Elkin, C.P.; Devabhaktuni, V. U-Net and its variants for medical image segmentation: A review of theory and applications. IEEE Access 2021, 9, 82031–82057. [Google Scholar] [CrossRef]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; Houlsby, N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021, [arXiv:cs.CV/2010.11929].
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. IEEE/CVF International Conference on Computer Vision, 2021, pp. 568–578.
- Mohammed, A.; Kora, R. A comprehensive review on ensemble deep learning: Opportunities and challenges. Journal of King Saud University-Computer and Information Sciences 2023.
- Huang, C.H.; Wu, H.Y.; Lin, Y.L. HarDNet-MSEG: A Simple Encoder-Decoder Polyp Segmentation Neural Network that Achieves over 0.9 Mean Dice and 86 FPS, 2021, [arXiv:cs.CV/2101.07172].
- Dong, B.; Wang, W.; Fan, D.P.; Li, J.; Fu, H.; Shao, L. Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers, 2023, [arXiv:eess.IV/2108.06932].
- Zhang, W.; Fu, C.; Zheng, Y.; Zhang, F.; Zhao, Y.; Sham, C.W. HSNet: A hybrid semantic network for polyp segmentation. Computers in Biology and Medicine 2022, 150, 106173. [Google Scholar] [CrossRef]
- Rokach, L. Ensemble-based classifiers. Artificial Intelligence Review 2010, 33, 1–39. [Google Scholar] [CrossRef]
- Polikar, R. Ensemble Based Systems in Decision Making. IEEE Circuits and Systems Magazine 2006, 6, 21–45. [Google Scholar] [CrossRef]
- Dong, X.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. A survey on ensemble learning. Frontiers of Computer Science 2020, 14, 241–258. [Google Scholar] [CrossRef]
- Schapire, R.E. The strength of weak learnability. Machine learning 1990, 5, 197–227. [Google Scholar] [CrossRef]
- Breiman, L. Bagging Predictors. Machine learning 1996, 24, 123–140. [Google Scholar] [CrossRef]
- Valiant, L.G. A Theory of the Learnable. Communications of the ACM 1984, 27, 1134–1142. [Google Scholar] [CrossRef]
- Kearns, M.; Valiant, L.G. Cryptographic Limitations on Learning Boolean Formulae and Finite Automata. Journal of the ACM 1994, 41, 67–95. [Google Scholar] [CrossRef]
- Efron, B. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 1979, 7, 1–26. [Google Scholar] [CrossRef]
- Alexandre, L.A.; Campilho, A.C.; Kamel, M. On combining classifiers using sum and product rules. Pattern Recognition Letters 2001, 22, 1283–1289, Selected Papers from the 11th Portuguese Conference on Pattern Recognition - RECPAD2000. [Google Scholar] [CrossRef]
- Ganaie, M.A.; Hu, M.; Malik, A.K.; Tanveer, M.; Suganthan, P.N. Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence 2022, 115, 105151. [Google Scholar] [CrossRef]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. Computer Vision – ECCV 2018: 15th European Conference; Springer-Verlag: Berlin, Heidelberg, 2018; pp. 833–851. [Google Scholar] [CrossRef]
- Nanni, L.; Lumini, A.; Loreggia, A.; Formaggio, A.; Cuza, D. An Empirical Study on Ensemble of Segmentation Approaches. Signals 2022, 3, 341–358. [Google Scholar] [CrossRef]
- Lumini, A.; Nanni, L. Fair comparison of skin detection approaches on publicly available datasets. Expert Systems with Applications 2020, 160, 113677. [Google Scholar] [CrossRef]
- Phung, S.L.; Bouzerdoum, A.; Chai, D. Skin segmentation using color pixel classification: analysis and comparison. IEEE transactions on pattern analysis and machine intelligence 2005, 27, 148–154. [Google Scholar] [CrossRef]
- Liu, Y.; Cao, F.; Zhao, J.; Chu, J. Segmentation of White Blood Cells Image Using Adaptive Location and Iteration. IEEE Journal of Biomedical and Health Informatics 2017, 21, 1644–1655. [Google Scholar] [CrossRef]
- Filali, I.; Achour, B.; Belkadi, M.; Lalam, M. Graph ranking based butterfly segmentation in ecological images. Ecological Informatics 2022, 68, 101553. [Google Scholar] [CrossRef]
- Zhao, P.; Li, C.; Rahaman, M.M.; Xu, H.; Ma, P.; Yang, H.; Sun, H.; Jiang, T.; Xu, N.; Grzegorzek, M. EMDS-6: Environmental Microorganism Image Dataset Sixth Version for Image Denoising, Segmentation, Feature Extraction, Classification, and Detection Method Evaluation. Frontiers in Microbiology 2022, 13. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, H.C.; Le, T.T.; Pham, H.H.; Nguyen, H.Q. VinDr-RibCXR: A Benchmark Dataset for Automatic Segmentation and Labeling of Individual Ribs on Chest X-rays, 2021, [arXiv:eess.IV/2107.01327].
- Liu, L.; Liu, M.; Meng, K.; Yang, L.; Zhao, M.; Mei, S. Camouflaged locust segmentation based on PraNet. Computers and Electronics in Agriculture 2022, 198, 107061. [Google Scholar] [CrossRef]
- Park, H.; Sjösund, L.L.; Yoo, Y.; Kwak, N. ExtremeC3Net: Extreme Lightweight Portrait Segmentation Networks using Advanced C3-modules, 2019, [arXiv:cs.CV/1908.03093].
- Yan, J.; Le, T.N.; Nguyen, K.D.; Tran, M.T.; Do, T.T.; Nguyen, T.V. MirrorNet: Bio-Inspired Camouflaged Object Segmentation. IEEE Access 2021, 9, 43290–43300. [Google Scholar] [CrossRef]
- Nanni, L.; Lumini, A.; Loreggia, A.; Formaggio, A.; Cuza, D. An Empirical Study on Ensemble of Segmentation Approaches. Signals 2022, 3, 341–358. [Google Scholar] [CrossRef]
- Nanni, L.; Loreggia, A.; Lumini, A.; Dorizza, A. A Standardized Approach for Skin Detection: Analysis of the Literature and Case Studies. Journal of Imaging 2023, 9. [Google Scholar] [CrossRef]
- Nanni, L.; Fantozzi, C.; Loreggia, A.; Lumini, A. Ensembles of Convolutional Neural Networks and Transformers for Polyp Segmentation. Sensors 2023, 23. [Google Scholar] [CrossRef]
| Short Name | Name | #Samples |
|---|---|---|
| Kvasir | Kvasir-SEG dataset | 100 |
| ColDB | CVC-ColonDB | 380 |
| CVC-T | CVC-EndoSceneStill | 300 |
| ETIS | ETIS-Larib | 196 |
| ClinicDB | CVC-ClinicDB | 612 |
| Short Name | Name | #Samples |
|---|---|---|
| Prat | Pratheepan | 78 |
| MCG | MCG-skin | 1000 |
| UC | UChile DB-skin | 103 |
| CMQ | Compaq | 4675 |
| SFA | SFA | 1118 |
| HGR | Hand Gesture Recognition | 1558 |
| Sch | Schmugge dataset | 845 |
| VMD | Human Activity Recognition | 285 |
| ECU | ECU Face and Skin Detection | 2000 |
| VT | VT-AAST | 66 |
| POLYP | SKIN | LEUKO | BFLY | EMICRO | RIBS | LOC | POR | CAM | |
|---|---|---|---|---|---|---|---|---|---|
| RN18(1) | 0.806 | 0.865 | 0.897 | 0.960 | 0.908 | 0.827 | 0.812 | 0.980 | 0.624 |
| RN50(1) | 0.802 | 0.871 | 0.895 | 0.968 | 0.909 | 0.818 | 0.835 | 0.979 | 0.665 |
| RN101(1) | 0.808 | 0.871 | 0.915 | 0.976 | 0.918 | 0.776 | 0.830 | 0.981 | 0.717 |
| ERN18(10) | 0.821 | 0.866 | 0.913 | 0.963 | 0.913 | 0.842 | 0.830 | 0.981 | 0.672 |
| ERN50(10) | 0.807 | 0.872 | 0.897 | 0.969 | 0.918 | 0.839 | 0.840 | 0.980 | 0.676 |
| ERN101(10) | 0.834 | 0.878 | 0.925 | 0.978 | 0.919 | 0.779 | 0.838 | 0.982 | 0.734 |
| E101(10) | 0.842 | 0.880 | 0.925 | 0.980 | 0.921 | 0.785 | 0.841 | 0.984 | 0.747 |
| EM(10) | 0.851 | 0.883 | 0.936 | 0.983 | 0.924 | 0.833 | 0.854 | 0.985 | 0.740 |
| EM2(10) | 0.851 | 0.883 | 0.943 | 0.984 | 0.925 | 0.846 | 0.859 | 0.986 | 0.731 |
| EM2(5)_DA1 | 0.836 | 0.881 | 0.928 | 0.982 | 0.921 | 0.800 | 0.841 | 0.985 | 0.742 |
| EM2(5)_DA2 | 0.847 | 0.869 | 0.948 | 0.985 | 0.920 | 0.860 | 0.842 | 0.983 | 0.700 |
| EM3(10) | 0.852 | 0.883 | 0.945 | 0.985 | 0.925 | 0.856 | 0.860 | 0.986 | 0.728 |
| POLYP | SKIN | LEUKO | BFLY | EMICRO | RIBS | LOC | POR | CAM | |
|---|---|---|---|---|---|---|---|---|---|
| EM(10) | 0.787 | 0.798 | 0.887 | 0.966 | 0.869 | 0.714 | 0.769 | 0.971 | 0.630 |
| EM2(10) | 0.790 | 0.799 | 0.897 | 0.969 | 0.870 | 0.734 | 0.778 | 0.972 | 0.621 |
| EM3(10) | 0.791 | 0.798 | 0.899 | 0.970 | 0.872 | 0.749 | 0.780 | 0.972 | 0.617 |
| DA | LR | POLYP | SKIN | EMICRO | CAM | |
|---|---|---|---|---|---|---|
| HardNet | DA1 | LRa | 0.828 | 0.873 | 0.912 | 0.700 |
| LRb | 0.821 | 0.858 | 0.905 | 0.667 | ||
| LRc | 0.795 | 0.869 | 0.909 | 0.712 | ||
| HardNet | DA2 | LRa | 0.852 | 0.870 | 0.912 | 0.715 |
| LRb | 0.826 | 0.854 | 0.905 | 0.665 | ||
| LRc | 0.846 | 0.872 | 0.910 | 0.710 | ||
| HardNet | DA3 | LRa | 0.828 | 0.853 | 0.907 | 0.653 |
| LRb | 0.832 | 0.839 | 0.904 | 0.613 | ||
| LRc | 0.828 | 0.865 | 0.904 | 0.694 | ||
| Fusion | DA1,2,3 | LRa,b,c | 0.868 | 0.883 | 0.921 | 0.726 |
| Previous | 0.863 | 0.886 | 0.916 | — |
| DA | LR | SM | POLYP | SKIN | EMICRO | CAM | |
|---|---|---|---|---|---|---|---|
| PVT | DA1 | LRa | No | 0.857 | 0.874 | 0.919 | 0.788 |
| LRb | No | 0.850 | 0.844 | 0.914 | 0.743 | ||
| LRc | No | 0.861 | 0.877 | 0.919 | 0.810 | ||
| PVT | DA2 | LRa | No | 0.862 | 0.845 | 0.917 | 0.742 |
| LRb | No | 0.847 | 0.854 | 0.912 | 0.743 | ||
| LRc | No | 0.862 | 0.876 | 0.917 | 0.813 | ||
| PVT | DA3 | LRa | No | 0.855 | 0.875 | 0.917 | 0.765 |
| LRb | No | 0.851 | 0.856 | 0.916 | 0.718 | ||
| LRc | No | 0.871 | 0.883 | 0.918 | 0.817 | ||
| Fusion | DA1,2,3 | LRa,b,c | No | 0.884 | 0.892 | 0.925 | 0.813 |
| Fusion | DA1,2,3 | LRa,b,c | Yes | 0.885 | 0.892 | 0.926 | 0.814 |
| Baseline Ensemble | DA3 | LRc | 0.880 | 0.886 | 0.921 | 0.829 | |
| Previous | 0.877 | 0.883 | 0.922 | — |
| DA | LR | SM | POLYP | SKIN | EMICRO | CAM | |
|---|---|---|---|---|---|---|---|
| HSN | DA1 | LRa | No | 0.847 | 0.873 | 0.919 | 0.776 |
| LRb | No | 0.852 | 0.816 | 0.916 | 0.742 | ||
| LRc | No | 0.860 | 0.873 | 0.919 | 0.817 | ||
| HSN | DA2 | LRa | No | 0.857 | 0.873 | 0.921 | 0.742 |
| LRb | No | 0.849 | 0.850 | 0.918 | 0.743 | ||
| LRc | No | 0.873 | 0.873 | 0.919 | 0.814 | ||
| HSN | DA3 | LRa | No | 0.866 | 0.863 | 0.922 | 0.782 |
| LRb | No | 0.854 | 0.856 | 0.913 | 0.697 | ||
| LRc | No | 0.866 | 0.876 | 0.924 | 0.800 | ||
| Fusion | DA1,2,3 | LRa,b,c | No | 0.881 | 0.885 | 0.926 | 0.813 |
| Fusion | DA1,2,3 | LRa,b,c | Yes | 0.882 | 0.886 | 0.926 | 0.812 |
| Baseline Ensemble | DA3 | LRc | 0.876 | 0.879 | 0.923 | 0.820 | |
| Previous | 0.879 | 0.879 | — | — |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).