Submitted:
04 March 2025
Posted:
05 March 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Improved TSM based on previous studies and designed the Dual Cascade Temporal Shift and Fusion (DualCascadeTSF) module. This module integrates the feature fusion method into TSM, strengthening the connection in the temporal dimension between adjacent time steps without increasing the number of parameters.
- Combined the DualCascadeTSF module with the MobileNetV2. This not only improves the accuracy of MobileNetV2, but also makes it have fewer parameters and a faster operation speed compared to other classic models, making it more suitable for deployment on edge devices.
2. Related Work
2.1. Research on Lightweight Models
2.2. TSM and Two-Cascade TSM
2.3. MobileNetV2
3. Methods
3.1. Conception
3.2. DualCascadeTSF Module
3.3. DualCascadeTSF-MobileNetV2
4. Experiments
4.1. Datasets


4.2. Parameter Settings
4.3. Results
4.3.1. Ablation Experiments of Different Feature Fusion Methods
4.3.2. Ablation Study on Different Structures
4.3.3. Comparative Experiments with Classic Models
5. Discussion
6. Conclusion
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| DualCascadeTSF | Dual Cascade Temporal Shift and Fusion module. |
| TSM | temporal shift module |
References
- Lin, J.; Gan, C.; Han, S. TSM: Temporal Shift Module for Efficient Video Understanding. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV); Seoul, South Korea, 27 October–. 5 November.
- Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning Efficient Convolutional Networks through Network Slimming. In 2017 IEEE International Conference on Computer Vision (ICCV); Venice, Italy, 22–29 October.
- Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, M. P. Pruning Filters for Efficient ConvNets. CoRR arXiv:1608.08710, 2016.
- Sanh, V.; Wolf, T.; Rush, A. M. Movement Pruning: Adaptive Sparsity by Fine-Tuning. arXiv arXiv:2005.07683, 2020.
- Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T. ; others. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint, arXiv:1704.04861.
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–23 June.
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, –23, 2018. 18 June.
- Iandola, F. N.; Han, S.; Moskewicz, M. W.; Ashraf, K.; Dally, W. J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. arXiv preprint arXiv:1602.07360, arXiv:1602.07360 2016.
- Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.-C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; Adam, H.; Le, Q. Searching for MobileNetV3. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV); Seoul, Korea, Oct. 27 - Nov. 2, 2019.
- Tan, M. and Le, Q. V., "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks," in Statistics, vol. 2, 2019.
- Wang, X.; Xiang, T.; Zhang, C.; Song, Y.; Liu, D.; Huang, H.; Cai, W. BiX-NAS: Searching Efficient Bi-directional Architecture for Medical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; 2021.
- Yan, B.; Peng, H.; Wu, K.; Wang, D.; Fu, J.; Lu, H. LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021.
- Cai, H.; Gan, C.; Han, S. Once for All: Train One Network and Specialize it for Efficient Deployment. Statistics 2019. [Google Scholar]
- Liang, Q.; Li, Y.; Chen, B.; Yang, K. Violence behavior recognition of two-cascade temporal shift module with attention mechanism. Journal of Electronic Imaging 2021, 30. [Google Scholar] [CrossRef]
- Hassner, T.; Itcher, Y.; Kliper-Gross, O. Violent flows: Real-time detection of violent crowd behavior. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops; 2012.
- Cheng, M.; Cai, K.; Li, M. RWF-2000: An Open Large Scale Video Database for Violence Detection. In 2020 25th International Conference on Pattern Recognition (ICPR); 2021.
- Bermejo Nievas, E.; Deniz Suarez, O.; Bueno Garcia, G.; Sukthankar, R. Violence Detection in Video Using Computer Vision Techniques. In International Conference on Computer Analysis of Images and Patterns; 2011.
- Zhang, Y.; Li, Y.; Guo, S. Lightweight mobile network for real-time violence recognition. PloS one 2022, 17. [Google Scholar] [CrossRef] [PubMed]
- Shafiq, M.; Gu, Z. Deep Residual Learning for Image Recognition: A Survey. Applied Sciences 2022, 12, 8972. [Google Scholar]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Transactions On Pattern Analysis And Machine Intelligence 2013, 35, 221–231. [Google Scholar] [PubMed]
- Donahue, J.; Hendricks, L. A.; Rohrbach, M.; Venugopalan, S.; Guadarrama, S.; Saenko, K.; Darrell, T. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017, 39, 677–691. [Google Scholar] [CrossRef]
- Carreira, J.; Zisserman, A. Quo Vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017.
- Wang, W.; Dong, S.; Zou, K.; Li, W. A Lightweight Network for Violence Detection. In ICIGP 2022: 2022 the 5th International Conference on Image and Graphics Processing (ICIGP); 2022.
- Meng, Y.; Lin, C.-C.; Panda, R.; Sattigeri, P.; Karlinsky, L.; Oliva, A.; Saenko, K.; Feris, R. AR-Net: Adaptive Frame Resolution for Efficient Action Recognition. In Computer Vision – ECCV 2020; 2020.
- Li, Y.; Ji, B.; Shi, X.; Zhang, J.; Kang, B.; Wang, L. TEA: Temporal Excitation and Aggregation for Action Recognition. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020.
- Asad, M.; Jiang, H.; Yang, J.; Tu, E.; Malik, A. A. Multi-Level Two-Stream Fusion-Based Spatio-Temporal Attention Model for Violence Detection and Localization. International Journal of Pattern Recognition & Artificial Intelligence 2022, 36, 1–25. [Google Scholar]
- Mohammadi, H.; Nazerfard, E. Video violence recognition and localization using a semi-supervised hard attention model. Expert Systems with Applications 2023, 212, 118791. [Google Scholar]
- Zhang, Y.; Li, Y.; Guo, S.; Liang, Q. Not all temporal shift modules are profitable. Journal of Electronic Imaging 2022, 31. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In Computer Vision – ECCV 2016, Part VIII; 2016; Vol. 9912, pp 20–36.









| Layer | Structure | DualCascadeTSF Included | Output Size | Repetition Times |
|---|---|---|---|---|
| conv1 | 3 × 3 × 32 | 112 × 112 | ×1 | |
| bottleneck1 | 1×1×32, 3×3×32, 1×1×16 | Yes | 112 × 112 | ×1 |
| bottleneck2 | 1×1×96, 3×3×96, 1×1×24 | Yes | 56 × 56 | ×2 |
| bottleneck3 | 1×1×192, 3×3×192, 1×1×32 | Yes | 28 × 28 | ×3 |
| bottleneck4 | 1×1×384, 3×3×384, 1×1×64 | Yes | 14 × 14 | ×4 |
| bottleneck5 | 1×1×576, 3×3×576, 1×1×96 | Yes | 14 × 14 | ×3 |
| bottleneck6 | 1×1×960, 3×3×960, 1×1×160 | Yes | 7 × 7 | ×3 |
| bottleneck7 | 1×1×960, 3×3×960, 1×1×320 | Yes | 7 × 7 | ×1 |
| conv2 | 1 × 1 × 1280 | 7 × 7 | ×1 | |
| Avgpool | Pool 7 × 7 | 1 × 1 | ×1 | |
| FC | Linear 1280 → 2 | 1 × 1 | ×1 |
| Model | Crowd Violence(%) | RWF-2000(%) | Hockey Fights(%) |
|---|---|---|---|
| ResNet-50 [19] | 93.878 | 84 | 95.5 |
| 3D-CNN [20] | 94.3 | 82.75 | 94.4 |
| LRCN [21] | 94.57 | 77 | 97.1 |
| I3D [22] | 88.89 | 85.75 | 97.5 |
| MiNet-3D [23] | 91.41 | 81.98 | 94.71 |
| AR-Net [24] | 95.918 | 87.3 | 97.2 |
| TSM [36] | 95.95 | 88 | 97.5 |
| TEA [25] | 96.939 | 88.5 | 97.7 |
| Two-cascade TSM [14] | 96.939 | 89 | 98.05 |
| SAM [26] | 98.15 | 89.1 | 99.1 |
| SSHA [27] | - | 90.4 | 98.7 |
| P-TSM [28] | 96.969 | 91 | 98.5 |
| MobileNet-TSM [18] | 97.959 | 87.75 | 97.5 |
| DualCascadeTSM-MobileNetV2(ours) | 98.98 | 88.5 | 98.0 |
| Model | Parameters(MB) | Training Speed(min.) | Memory Size(MB) |
|---|---|---|---|
| 3D-CNN [20] | 297.83 | 248.38 | 2647.7 |
| LRCN [21] | 237.83 | 185.12 | 1212.93 |
| I3D [22] | 46.88 | 123.8 | 1000.2 |
| TSN [29] | 89.69 | 46.68 | 390.05 |
| TSM [36] | 89.69 | 113.43 | 397.71 |
| TEA [25] | 91.95 | 190.8 | 479.78 |
| Two-cascade TSM [14] | 89.69 | 128.33 | 397.71 |
| P-TSM [28] | 89.69 | 87.63 | 396.57 |
| MobileNet-TSM [18] | 8.49 | 32.68 | 175.86 |
| DualCascadeTSM-MobileNetV2(ours) | 16.99 | 35.92 | 347.13 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).