Submitted:
11 March 2025
Posted:
12 March 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We propose Scalable Bi-Directional Feature Pyramid (SBFPN) with attention mechanism to enhance the discriminant ability of feature representations for instance segmentation. The scalable layers and other hyperparameters are applied for different scale objects, so that the model can be adapted to more downstream tasks.
- We dynamically adjust the network structure through a Mixture of Experts (MoE) model, aiming to enable the model to adaptively select the appropriate structure and accommodate multi-scale tasks.
2. Related Work
2.1. Instance Segmentation
2.2. Ship Detection and Segmentation
3. Methodology: A Roadmap
3.1. Bi-Directional Feature Pyramid
3.1.1. Bottom-up Path Argumentation
3.1.2. Sparse Feature Fusion (Skip Connection)
3.2. Attention Mechanism
3.2.1. Basic Attention Unit
-
Spatial Attention (SA)The spatial attention mechanism treats spatial convolution features differently, assigning more weights to target feature regions and devoting more “attention” to target regions. It assigns small weights to background and noise regions in order to achieve “suppression” of background and noise distractors. The spatial attention module is designed as follows in Figure 5(a). Given an input feature map , we use a convolution to fuse and compress the feature map F into a feature map with one channel, and then activate it by a Sigmoid function to generate a spatial attention mask.
-
Channel-wise Attention (CA)In convolutional neural network, the features of different channels will respond to different semantic information. As shown in Figure 5(b), using the channel attention mechanism, channels with higher target response are assigned larger weights, and channels with lower target response are assigned smaller weights, so that the network pays more attention to those channels with higher response. Our designed channel-wise attention module is implemented as follows.Given the input feature map , we perform global average pooling and maximum pooling operations for each feature channel of the feature map F to generate the fused feature maps . The output results are summed up using a convolution kernel of size to perform convolution operations on the feature maps , respectively. We multiply the feature map F with the channel attention map element by element and add the result with the feature map F. The final output result of the channel-wise attention module is obtained, and the weight assignment for each channel is achieved.
-
Non-local Attention (NA)The self-attention mechanism was first applied in the field of natural language processing, and later in the novel Non-local network proposed by Wang et al. which applied the self-attention mechanism to computer vision tasks [36]. Non-local can be used as a component to combine with other network structures. The whole process is divided into four steps:
- Convolution operation is performed on the output feature map X using three convolution kernels of size to achieve a linear mapping to compress the number of channels and obtain features.
- Do matrix multiplication operation on and , that is, calculate the autocorrelation in the features.
- Do softmax operation on the autocorrelation features to get the weights from 0 to 1, that is, the self-attentive coefficients.
- Perform matrix multiplication operation on the self-attentive coefficients and g, and later do residual operation with the feature map X to get the output of Non-local block. The formula of Non-Local is defined as follows:


3.2.2. Mixture-of-Experts (MoE) Attention Layer
3.2.3. Hybrid Attention Module
-
Channel-wise and Non-local and Spatial Attention Module (CA-NA-SA)The structure of the Channel-wise and Non-local and Spatial Attention module (CA-NA-SA) is demonstrated. The feature map F is first passed through the channel-wise attention module to obtain the channel-level attentional features , and then input to the Non-local Attention module. It should be noted that Non-local Attention computes the relational vector with the same size of output and output feature maps, while there is an identity mapping from input to output. Therefore, the Non-local Attention module can be embedded as a generic module in other modules. Feeding into the spatial attention module to obtain the Channel Attention - Non-local Attention - Spatial Attention Module .
-
Channel-wise and Spatial Attention Module (CA-SA)As shown in Figure 9(a), the structure of the Channel-wise and Spatial Attention (CA-SA) Module is demonstrated. The feature map F is first passed through the channel attention module to obtain the channel-level attentional features , and then is input to the spatial attention module to obtain the final channel-spatial attentional feature , which is calculated as follows.Where S denotes the spatial attention module computation process and C denotes the channel attention module computation process.
-
Parallel [Channel-wise and Spatial] Attention Module ([CA-SA])As shown in Figure 9(b), the structure of the parallel [channel-wise and spatial] attention module is demonstrated, and its implementation process is more different from the first three types of attention modules in series connection. First, a three-dimensional attentional map is inferred from the input feature map . And then is multiplied element by element with the feature map F. The output result is summed up with the original input feature map F, that is, the output of the Parallel [Channel-wise and Spatial] Attention Module is obtained, and its calculation procedure is as in Eq. (4), where ⊙ denotes element-by-element multiplication.The input feature map is processed through the spatial attention branch and the channel attention branch to obtain the spatial attention map and the channel attention map , respectively, where the spatial attention and channel attention branches follow the process implemented in the Spatial Attention Module and the Channel Attention Module in this paper. After extending the two attention maps to , they are fused by adding each other element by element, and the fusion result is processed by a Sigmoid activation function to obtain an with output values between [0,1], which is calculated as in formula 6. In which, denotes the Sigmoid Function calculated by.
3.3. FPN and Bottom-up Structures with Attention Module

3.3.1. FPN with Attention Module (FPN-AM)
3.3.2. Bottom-up with Attention Module (Bottom-up-AM)
3.4. Scalable Bi-Directional Attention Feature Pyramid
3.5. Other Tricks
3.5.1. Adjustment for Backbone
-
Group ConvolutionIn ResNeXt [37], Saining Xie et al. introduced Group Convolution to the residual network to improve the accuracy of the model and enhance the feature representation of the model without significantly increasing the number of parameters. Inspired by ResNeXt, we adopted a similar approach by replacing the convolutional structure of size in each Stage of the original ResNet-101 network with a grouped convolutional structure and setting groups of all grouped convolutions to 32.
-
Activation FunctionThe performance of the ReLU activation function decreases as network layers deepen. The Swish function has lower bound, smooth, and non-monotonic properties. The Swish activation function inherits advantages of the ReLU activation function and does not have the gradient disappearance problem, and performs better in deep networks. We try to replace the ReLU activation function in the residual network with the Swish activation function.The Swish activation function is formulated as follows.where is a constant or trainable parameter, and the value of is set to 1 in the paper.
3.5.2. Training Techniques
3.5.3. Loss Function
4. Experimental Results on Airbus Ship Dataset
4.1. Airbus Ship Dataset
4.2. Implementation Details and Results
4.2.1. Experimental Analysis of Scalable Bi-Directional Feature Pyramid Structure
4.2.2. Experimental Analysis of the Attention Module
4.2.3. Other Experimental Analysis
4.3. Comparison with Other Methods
| Method | AP | |||||
| Mask R-CNN | 71.3/ 62.4 | 94.6/90.8 | 81.7/71.8 | 63.2/ 50.8 | 86.6/ 79.1 | 87.2/ 84.1 |
| PANet [35] | 72.6/ 63.2 | 95.1/82.3 | 80.4/ 68.5 | 63.0/ 49.4 | 89.4/ 79.5 | 91.4/ 84.9 |
| Mask Scoring R-CNN [11] | 65.8/ 56.2 | 93.5/ 89.7 | 72.5/ 60.2 | 54.4/ 42.0 | 85.3/ 78.2 | 88.7/ 85.2 |
| Mask R-CNN+ S-NMS [49] | 66.5/ 56.2 | 88.6/89.7 | 75.3/ 67.6 | 55.4/ 46.5 | 85.7/ 79.4 | 88.3/ 85.2 |
| Cascade Mask R-CNN [50] | 72.9/ 65.2 | 86.1/84.9 | 77.3/ 73.4 | 61.0/ 53.0 | 95.0/ 86.8 | 96.1/ 88.6 |
| (ours) | 82.7/ 71.1 | 97.4/95.3 | 90.2/80.3 | 76.9/ 61.8 | 93.9/86.9 | 93.4/ 89.3 |
| Method | AP | |||||
| Mask R-CNN | 37.7/ 36.5 | 63.7/ 59.1 | 46.0/ 37.5 | 26.9/20.9 | 48.7/42.5 | 55.1/ 50.5 |
| PANet [35] | 42.6/ 34.2 | 64.2/ 59.1 | 46.4/ 37.9 | 27.2/20.9 | 48.8/42.4 | 55.1/ 51.4 |
| PANet+ [14] | 46.3/38.5 | 64.3/ 59.5 | 45.9/ 38.4 | 26.5/21.2 | 50.7/42.7 | 56.8/ 52.5 |
| Mask Scoring R-CNN [11] | 41.7/35.5 | 63.5/ 58.8 | 46.1/ 37.7 | 26.9/20.9 | 48.3/42.4 | 54.8/ 51.4 |
| Cascade Mask R-CNN [50] | 43.2/ 37.1 | 63.5/ 59.6 | 45.7/ 37.2 | 26.8/ 20.6 | 50.0/43.6 | 55.9/ 52.4 |
| (ours) | 47.3/40.7 | 64.2/ 60.2 | 48.3/ 39.2 | 28.7/21.1 | 51.2/44.3 | 57.5/ 51.9 |
| Method | Extra Data | Extra Data | ||
| Mask R-CNN | × | 31.5 | × | 26.2 |
| PANet [35] | × | 36.5 | × | 31.8 |
| PANet | - | - | ✓ | 36.4 |
| Panoptic-DeepLab [51] | ✓ | 38.8 | ✓ | 39.0 |
| Panoptic-DeepLab | × | 35.3 | × | 34.6 |
| Panoptic-FPN [52] | × | 33.0 | - | - |
| GAIS-Net [53] | ✓ | 37.1 | × | 32.5 |
| AUNet [54] | × | 34.4 | - | - |
| AdaptIS [55] | × | 36.3 | × | 32.5 |
| UPSNet [56] | 37.8 | ✓ | 33.0 | |
| UPSNet | × | 33.3 | - | - |
| (ours) | × | 38.6 | × | 36.7 |
4.4. Ablation Studies
5. Experimental Results on Other Tasks
5.1. Cityscape Dataset
5.2. Results on Cityscape Dataset
5.3. iSAID Dataset
5.4. Results on iSAID Dataset
6. Conclusion
Acknowledgments
References
- Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS journal of photogrammetry and remote sensing 2020, 159, 296–307. [Google Scholar]
- Pi, Z.; Shao, Y.; Gao, C.; Sang, N. Instance-based feature pyramid for visual object tracking. IEEE Transactions on Circuits and Systems for Video Technology 2021, 32, 3774–3787. [Google Scholar]
- Pi, Z.; Shao, Y.; Gao, C.; Sang, N. Instance-based feature pyramid for visual object tracking. IEEE Transactions on Circuits and Systems for Video Technology 2021, 32, 3774–3787. [Google Scholar]
- Sun, Y.; Su, L.; Luo, Y.; Meng, H.; Zhang, Z.; Zhang, W.; Yuan, S. IRDCLNet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes. IEEE Transactions on Circuits and Systems for Video Technology 2022, 32, 6029–6043. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
- Li, G.; Fang, Q.; Zha, L.; Gao, X.; Zheng, N. HAM: Hybrid attention module in deep convolutional neural networks for image classification. Pattern Recognition 2022, 129, 108785. [Google Scholar]
- Guo, N.; Gu, K.; Qiao, J.; Bi, J. Improved deep CNNs based on Nonlinear Hybrid Attention Module for image classification. Neural Networks 2021, 140, 158–166. [Google Scholar] [PubMed]
- Wang, D.; Li, M.; Gong, C.; Chandra, V. Attentivenas: Improving neural architecture search via attentive sampling. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 6418–6427.
- Jianxin, G.; Zhen, W.; Shanwen, Z. Multi-scale ship detection in SAR images based on multiple attention cascade convolutional neural networks. In Proceedings of the 2020 international conference on virtual reality and intelligent systems (ICVRIS). IEEE, 2020, pp. 438–441.
- Zheng, L.; Zeng, L. An Improved YOLOv7x Small-Scale Ship Target Detection Algorithm. In Proceedings of the 2023 11th International Conference on Information Systems and Computing Technology (ISCTech). IEEE, 2023, pp. 130–135.
- Huang, Z.; Huang, L.; Gong, Y.; Huang, C.; Wang, X. Mask scoring r-cnn. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6409–6418.
- Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–improving object detection with one line of code. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2017, pp. 5561–5569.
- Al-Saad, M.; Aburaed, N.; Panthakkan, A.; Al Mansoori, S.; Al Ahmad, H.; Marshall, S. Airbus ship detection from satellite imagery using frequency domain learning. In Proceedings of the Image and Signal Processing for Remote Sensing XXVII. SPIE, 2021, Vol. 11862, pp. 279–285.
- Waqas Zamir, S.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Shahbaz Khan, F.; Zhu, F.; Shao, L.; Xia, G.S.; Bai, X. isaid: A large-scale dataset for instance segmentation in aerial images. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 28–37.
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
- Su, H.; Wei, S.; Yan, M.; Wang, C.; Shi, J.; Zhang, X. Object detection and instance segmentation in remote sensing imagery based on precise mask R-CNN. In Proceedings of the IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2019, pp. 1454–1457.
- Ran, J.; Yang, F.; Gao, C.; Zhao, Y.; Qin, A. Adaptive fusion and mask refinement instance segmentation network for high resolution remote sensing images. In Proceedings of the IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2020, pp. 2843–2846.
- Zhang, T.; Zhang, X.; Zhu, P.; Tang, X.; Li, C.; Jiao, L.; Zhou, H. Semantic attention and scale complementary network for instance segmentation in remote sensing images. IEEE Transactions on Cybernetics 2021, 52, 10999–11013. [Google Scholar] [CrossRef] [PubMed]
- Liu, X.; Di, X. Global context parallel attention for anchor-free instance segmentation in remote sensing images. IEEE Geoscience and Remote Sensing Letters 2020, 19, 1–5. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
- Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2015, pp. 1520–1528.
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 2015, pp. 234–241.
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 2017, 40, 834–848. [Google Scholar] [CrossRef]
- Liang-Chieh, C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. Semantic image segmentation with deep convolutional nets and fully connected crfs. In Proceedings of the International conference on learning representations, 2015.
- Chen, L.C. Rethinking atrous convolution for semantic image segmentation. arXiv, arXiv:1706.05587 2017.
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In roceedings of the Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818.
- Feng, Y.; Diao, W.; Zhang, Y.; Li, H.; Chang, Z.; Yan, M.; Sun, X.; Gao, X. Ship instance segmentation from remote sensing images using sequence local context module. In Proceedings of the IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2019, pp. 1025–1028.
- Huang, Z.; Li, R. Orientated silhouette matching for single-shot ship instance segmentation. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2021, 15, 463–477. [Google Scholar]
- Huang, Z.; Sun, S.; Li, R. Fast single-shot ship instance segmentation based on polar template mask in remote sensing images. In Proceedings of the IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium. IEEE; 2020; pp. 1236–1239. [Google Scholar]
- Xie, E.; Sun, P.; Song, X.; Wang, W.; Liu, X.; Liang, D.; Shen, C.; Luo, P. Polarmask: Single shot instance segmentation with polar representation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12193–12202.
- Sun, Y.; Su, L.; Cui, H.; Chen, Y.; Yuan, S. Ship instance segmentation in foggy scene. In Proceedings of the 2021 40th Chinese Control Conference (CCC). IEEE, 2021, pp. 8340–8345.
- Gao, F.; Huo, Y.; Wang, J.; Hussain, A.; Zhou, H. Anchor-free SAR ship instance segmentation with centroid-distance based loss. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2021, 14, 11352–11371. [Google Scholar] [CrossRef]
- Zhu, C.; Zhao, D.; Qi, J.; Qi, X.; Shi, Z. Cross-domain transfer for ship instance segmentation in SAR images. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS. IEEE; 2021; pp. 2206–2209. [Google Scholar]
- Shafiq, M.; Gu, Z. Deep residual learning for image recognition: A survey. Applied Sciences 2022, 12, 8972. [Google Scholar] [CrossRef]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8759–8768.
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7794–7803.
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International conference on machine learning. PMLR, 2021, pp. 10347–10357.
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.
- Loshchilov, I. Decoupled weight decay regularization. arXiv, arXiv:1711.05101 2017.
- Zhang, H. mixup: Beyond empirical risk minimization. arXiv, arXiv:1710.09412 2017.
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
- Peng, C.; Xiao, T.; Li, Z.; Jiang, Y.; Zhang, X.; Jia, K.; Yu, G.; Sun, J. Megdet: A large mini-batch object detector. In Proceedings of the Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6181–6189.
- Redmon, J. Yolov3: An incremental improvement. arXiv, arXiv:1804.02767 2018.
- Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv, arXiv:1608.03983 2016.
- Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. Mixed precision training. arXiv, arXiv:1710.03740 2017.
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the Proceedings of the AAAI conference on artificial intelligence, 2020, Vol. 34, pp. 12993–13000.
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.
- Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–improving object detection with one line of code. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2017, pp. 5561–5569.
- Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6154–6162.
- Cheng, B.; Collins, M.D.; Zhu, Y.; Liu, T.; Huang, T.S.; Adam, H.; Chen, L.C. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12475–12485.
- Kirillov, A.; Girshick, R.; He, K.; Dollár, P. Panoptic feature pyramid networks. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6399–6408.
- Wu, C.Y.; Hu, X.; Happold, M.; Xu, Q.; Neumann, U. Geometry-aware instance segmentation with disparity maps. arXiv, arXiv:2006.07802 2020.
- Li, Y.; Chen, X.; Zhu, Z.; Xie, L.; Huang, G.; Du, D.; Wang, X. Attention-guided unified network for panoptic segmentation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7026–7035.
- Sofiiuk, K.; Barinova, O.; Konushin, A. Adaptis: Adaptive instance selection network. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 7355–7363.
- Xiong, Y.; Liao, R.; Zhao, H.; Hu, R.; Bai, M.; Yumer, E.; Urtasun, R. Upsnet: A unified panoptic segmentation network. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8818–8826.













| Method | AP | |||||
| Mask R-CNN(baseline) | 71.3/61.4 | 94.6/90.8 | 81.7/71.8 | 63.2/50.8 | 86.6/79.1 | 87.2/84.1 |
| + B(Bottom up Path) | 73.9/63.9 | 94.9/91.3 | 82.2/72.7 | 64.8/52.9 | 87.4/81.2 | 86.0/83.8 |
| + B + SC(Skip Connection) | 74.6/65.7 | 95.0/91.6 | 84.5/73.2 | 64.4/54.7 | 87.9/81.9 | 88.8/85.9 |
| Number of feature fusion layer stacks | ||||||
| + B + SC + =2 | 76.4/66.4 | 95.6/92.3 | 85.6/74.3 | 68.2/55.3 | 89.8/83.7 | 89.4/87.0 |
| + B + SC + =3 | 75.2/65.7 | 95.2/92.6 | 84.7/74.8 | 67.8/54.8 | 88.3/83.2 | 90.8/86.3 |
| + B + SC + =4 | 75.5/65.4 | 95.3/92.0 | 84.9/73.7 | 67.2/54.6 | 89.2/82.9 | 90.3/86.7 |
| Method | AP | |||||
| Mask R-CNN(baseline) | 71.3/ 62.4 | 94.6/ 90.8 | 81.7/71.8 | 63.2/ 50.8 | 86.6/ 79.1 | 87.2/ 84.1 |
| +B+SC+ | 76.4/ 66.4 | 95.6/ 92.3 | 85.6/74.3 | 68.2/ 55.3 | 89.8/ 83.7 | 89.4/ 87.0 |
| Bottom-up-AM(+B+SC+) | ||||||
| +CA | 77.3/ 67.1 | 96.1/ 93.8 | 86.9/76.9 | 72.2/ 57.3 | 90.5/84.9 | 90.8/ 87.2 |
| +SA | 77.7/ 66.9 | 95.7/ 94.3 | 87.2/ 77.2 | 72.7/ 56.9 | 90.3/ 84.2 | 90.5/ 87.5 |
| + NA | 78.5/ 67.9 | 96.8/ 94.6 | 87.7/ 77.9 | 72.3/ 57.0 | 90.9/ 85.1 | 91.5/ 86.9 |
| + CA-SA | 79.7/ 68.2 | 97.1/ 94.7 | 88.3/ 78.7 | 73.4/ 58.1 | 91.8/ 85.0 | 92.6/ 88.3 |
| + SA-CA | 79.0/ 68.0 | 96.6/ 94.5 | 87.9/ 78.1 | 73.2/ 58.4 | 91.1 / 85.2 | 91.7 91.7/ 88.7 |
| + CA-NA | 78.3/ 67.8 | 95.9/ 94.2 | 87.8/ 77.9 | 73.9/ 57.8 | 91.0 91.0/ 84.8 | 91.0 / 97.5 |
| + NA-CA | 77.8/ 67.4 | 96.3/ 93.9 | 87.3/ 77.8 | 73.5/ 57.4 | 90.6 /85.0 | 91.8/ /87.9 |
| + SA-NA | 78.7/ 67.7 | 96.0/ 94.2 | 88.0/ 78.3 | 72.5/ 58.0 | 91.5/ 85.0 | 92.3/ 88.6 |
| + NA-SA | 78.2/ 68.0 | 96.5/ 94.3 | 87.5/ 78.0 | 72.3/ 57.8 | 90.8/ 84.7 | 92.0/ 87.9 |
| +[CA-SA] | 77.2/66.8 | 95.3/ 94.0 | 86.7/76.9 | 71.2/ 56.3 | 90.6/ 83.9 | 90.2/87.0 |
| Method | AP | |||||
| Mask R-CNN(baseline) | 71.3/ 62.4 | 94.6/90.8 | 81.7/71.8 | 63.2/50.8 | 86.6/ 79.1 | 87.2/ 84.1 |
| +B+SC++B-AM(CA-SA) | 79.7/ 68.2 | 97.1/94.7 | 88.3/ 78.7 | 73.4/58.1 | 91.8/85.0 | 92.6/88.3 |
| +B +SC + = 2 + B - AM ( CA-SA ) | ||||||
| +GC(Group Convolution) | 81.5/70.5 | 97.1/94.9 | 89.7/ 79.8 | 74.8/ 60.2 | 93.0/86.2 | 92.6/ 88.6 |
| +Swich | 80.2/ 69.2 | 96.7/ 94.7 | 88.8/ 78.8 | 74.2/ 59.7 | 92.7/ 85.8 | 92.3/ 88.3 |
| +GC+Swich | 81.7/70.5 | 97.4/95.3 | 90.1/80.0 | 75.4/ 60.8 | 93.3/86.2 | 92.4/ 89.5 |
| +GC+Swich+DIoU | 81.2/70.1 | 97.0/ 94.9 | 89.7/ 79.0 | 73.9/ 59.3 | 92.9/85.9 | 92.1/ 89.1 |
| Training Techniques | 82.7/71.1 | 97.4/95.3 | 90.2/80.3 | 76.9/ 61.8 | 93.9/86.9 | 93.4/ 89.3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).