Submitted:
13 June 2025
Posted:
16 June 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
2.1. Test-Time Augmentation
2.2. Semi-Supervised Learning
2.3. Random Augmentation Sampler
- Shuffling: Shuffled the original dataset indices (e.g., [0, 1, 2, 3, 4]) to create a randomized order (e.g., [3, 1, 4, 0, 2]).
- Repetition: Repeating each index based on the augmentation count (e.g., three times) results in [3, 3, 3, 1, 1, 1, 4, 4, 4, 0, 0, 0, 2, 2, 2].
- Subsampling: Split the indices to create a subset (e.g., [3, 3, 1, 4, 4]).
3. Methods
3.1. Inverse Transform
3.2. Instance Alignment
3.3. Loss Transform
4. Results
4.1. Training Benefits from LossTransform
4.1.1. Suitability for Complex Augmentation
4.1.2. Faster Convergence
4.1.3. Stable Predictions
4.1.4. The Number of Repetitions for Repeated Augmentation
4.1.5. Comparison with Other Loss Functions
4.1.6. Object Detection and Instance Segmentation
4.2. Fine-Tuning Benefits from LossTransform-v1
5. Discussion
6. Conclusions
Abbreviations
| AP | Average Precision |
| CCE | Categorical Cross Entropy |
| hflip | Horizontal Flip |
| mAP | Mean Average Precision |
| RA | Random Augmentation |
| RASampler | Random Augmentation Sampler |
| RPN | region proposal network |
| SSL | Semi-Supervised Learning |
| TTA | Test Time Augmentation |
References
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International conference on machine learning. PMLR; 2020; pp. 1597–1607. [Google Scholar]
- Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Advances in neural information processing systems 2020, 33, 18661–18673. [Google Scholar]
- Krizhevsky, A.; Hinton, G.; et al. Learning multiple layers of features from tiny images 2009.
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. International journal of computer vision 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 2015, 28. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2017, pp.
- Shi, J.; et al. Penn-Fudan Database for Pedestrian Detection and Tracking. https://www.cis.upenn.edu/~jshi/ped_html/, 2007. Accessed: 2024-08-09.
- Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. International journal of computer vision 2010, 88, 303–338. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 2014, Proceedings, Part V 13. Springer, 2014, September 6-12; pp. 740–755.
- Dong, X.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. A survey on ensemble learning. Frontiers of Computer Science 2020, 14, 241–258. [Google Scholar] [CrossRef]
- Shanmugam, D.; Blalock, D.; Balakrishnan, G.; Guttag, J. Better aggregation in test-time augmentation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp.
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, arXiv:1409.1556 2014.
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.
- Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. International Journal of Computer Vision 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
- Galstyan, A.; Cohen, P.R. Empirical comparison of “hard” and “soft” label propagation for relational classification. In Proceedings of the International Conference on Inductive Logic Programming. Springer; 2007; pp. 98–111. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J.; et al. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2015, arXiv:1503.02531 2015, 22. [Google Scholar]
- Kim, I.; Kim, Y.; Kim, S. Learning loss for test-time augmentation. Advances in neural information processing systems 2020, 33, 4163–4174. [Google Scholar]
- Zbontar, J.; Jing, L.; Misra, I.; LeCun, Y.; Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. In Proceedings of the International Conference on Machine Learning. PMLR; 2021; pp. 12310–12320. [Google Scholar]
- Athiwaratkun, B.; Finzi, M.; Izmailov, P.; Wilson, A.G. There are many consistent explanations of unlabeled data: Why you should average. arXiv preprint arXiv:1806.05594, arXiv:1806.05594 2018.
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International conference on machine learning. PMLR; 2021; pp. 10347–10357. [Google Scholar]
- Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp.
- DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, arXiv:1708.04552 2017.
- Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp.
- Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, arXiv:1710.09412 2017.
- Liu, W.; Wen, Y.; Yu, Z.; Yang, M. Large-margin softmax loss for convolutional neural networks. arXiv preprint arXiv:1612.02295, arXiv:1612.02295 2016.
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, arXiv:1704.04861 2017.
- Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.
- Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International conference on machine learning. PMLR; 2019; pp. 6105–6114. [Google Scholar]
- Huang, Y.; Cheng, Y.; Bapna, A.; Firat, O.; Chen, D.; Chen, M.; Lee, H.; Ngiam, J.; Le, Q.V.; Wu, Y.; et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 2019, 32. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp.
- Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146, arXiv:1605.07146 2016.
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
- Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp.
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.
- Fiorio, C.; Gustedt, J. Two linear time union-find strategies for image processing. Theoretical Computer Science 1996, 154, 165–181. [Google Scholar] [CrossRef]
- TorchVision Contributors. TorchVision Models. https://pytorch.org/vision/master/models.html, 2024. Accessed: 2024-08-09.
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, arXiv:2010.11929 2020.
| 1 | The validation accuracy (MobileNet-V2: 71.3%, NASNetMobile: 74.4%, EfficientNet-B0: 77.1%) shown on the TensorFlow webpage and corresponding papers, are the results of ensemble and multi-crop models [30]. |
| 2 | As of v0.14, TorchVision offers the model architectures for ImageNet. We extract their backbone with feature pyramid network (FPN) on top of Mask R-CNN. |




| Model | Cutout | CutMix | mixup | CCE | |
|---|---|---|---|---|---|
| ✗ | ✗ | ✗ | 7.07 | 6.46 | |
| ResNet-20 | ✓ | ✗ | ✗ | 7.23 | 6.70 |
| #params=0.27M | ✗ | ✓ | ✗ | 7.50 | 6.13 |
| ✗ | ✓ | ✓ | 7.70 | 6.27 | |
| ✗ | ✗ | ✗ | 6.32 | 5.73 | |
| ResNet-40 | ✓ | ✗ | ✗ | 6.22 | 5.58 |
| #params=0.66M | ✗ | ✓ | ✗ | 5.99 | 4.47 |
| ✗ | ✓ | ✓ | 7.70 | 4.39 | |
| ✗ | ✗ | ✗ | 6.10 | 5.64 | |
| ResNet-110 | ✓ | ✗ | ✗ | 5.71 | 4.90 |
| #params=1.7M | ✗ | ✓ | ✗ | 5.07 | 4.28 |
| ✗ | ✓ | ✓ | 5.76 | 4.02 |
| Model | loss function | CIFAR-10 | CIFAR-100 |
|---|---|---|---|
| ResNet-20 #params=0.27M |
CCE | 7.70% | 33.60% |
| SupCon | 6.55% | 31.10% | |
| 6.27% | 28.59% | ||
| ResNet-40 #params=0.66M |
CCE | 6.68% | 29.80% |
| SupCon | 5.23% | 26.48% | |
| 4.39% | 25.34% | ||
| ResNet-110 #params=1.7M |
CCE | 5.76% | 28.45% |
| SupCon | 4.56% | 25.14% | |
| 4.02% | 24.37% | ||
| DenseNet-40 #params=1.0M |
CCE | 5.67% | 27.37% |
| SupCon | 4.52% | 24.90% | |
| 4.35% | 23.55% | ||
| ResNet-50 #params=25.6M |
CCE | 5.0% | 29.7% |
| SimCLR | 6.4% | 29.3% | |
| Max-Margin | 7.6% | 29.5% | |
| SupCon | 4.0% | 23.5% |
| Model | loss function | Acc@1 | Acc@5 |
|---|---|---|---|
| MobileNet-V2 | CCE | 70.81% | 89.93% |
| 71.26% | 90.48% | ||
| NASNetMobile | CCE | 72.35% | 90.87% |
| 73.14% | 91.54% | ||
| EfficientNet-B0 | CCE | 74.57% | 92.13% |
| 77.09% | 93.61% |
| Backbone | loss | Detection (%) | Segmentation (%) | ||||
|---|---|---|---|---|---|---|---|
| AP | AP50 | AP75 | AP | AP50 | AP75 | ||
| ConvNeXt-tiny | 65.4 | 96.7 | 82.7 | 60.5 | 91.5 | 79.2 | |
| 65.6 | 96.9 | 83.6 | 63.5 | 94.0 | 81.8 | ||
| DensNet121 | 64.6 | 96.7 | 74.9 | 62.8 | 96.7 | 74.8 | |
| 67.1 | 99.8 | 82.7 | 68.7 | 99.8 | 88.3 | ||
| EfficientNetB4 | 73.8 | 100 | 91.3 | 66.8 | 96.8 | 88.7 | |
| 74.4 | 100 | 91.4 | 69.0 | 100 | 85.6 | ||
| GoogLeNet | 59.6 | 98.9 | 74.0 | 60.1 | 98.9 | 75.8 | |
| 59.4 | 96.6 | 68.7 | 63.1 | 96.1 | 81.6 | ||
| InceptionV3 | 71.0 | 99.7 | 87.0 | 64.6 | 99.5 | 82.2 | |
| 75.0 | 99.8 | 91.5 | 66.9 | 99.8 | 88.8 | ||
| MNASNet1.0 | 61.7 | 94.1 | 78.3 | 56.0 | 94.1 | 65.7 | |
| 65.5 | 96.5 | 81.1 | 61.0 | 93.7 | 83.0 | ||
| ResNet50 | 66.1 | 99.5 | 78.4 | 64.6 | 99.4 | 80.4 | |
| 70.7 | 99.9 | 88.6 | 70.1 | 99.9 | 91.8 | ||
| ShuffleNetV2 | 70.5 | 100 | 83.9 | 62.7 | 96.8 | 81.8 | |
| 72.0 | 99.3 | 90.9 | 64.0 | 96.5 | 85.6 | ||
| WideResNet50 | 69.0 | 100 | 84.5 | 70.8 | 100 | 91.2 | |
| 74.2 | 100 | 90.9 | 71.3 | 100 | 92.2 | ||
| Mask R- | 86.9 | 100 | 96.8 | 80.6 | 100 | 96.8 | |
| 88.2 | 100 | 96.7 | 81.9 | 100 | 96.7 | ||
| Backbone | Baseline (%) | LossTrans (%) | ||
|---|---|---|---|---|
| Acc@1 | Acc@5 | Acc@1 | Acc@5 | |
| ResNet-50 | 76.130 | 92.862 | 76.206 | 93.138 |
| ResNext-50 | 77.618 | 93.698 | 77.853 | 93.923 |
| MobileNet-V2 | 71.878 | 90.286 | 72.020 | 90.310 |
| EfficientNet-B0 | 77.692 | 93.532 | 77.817 | 93.674 |
| ViT-B16 | 81.072 | 95.318 | 82.019 | 95.326 |
| Model | #Params | loss | Box MAP |
|---|---|---|---|
| Faster R-CNN | 41.8M | 37.0% | |
| ResNet-50 FPN | 37.1% | ||
| Faster R-CNN | 19.4M | 32.8% | |
| MobileNetV3-L FPN | 33.2% | ||
| RetinaNet | 34.0M | 36.4% | |
| ResNet-50 FPN | 36.6% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).