Submitted:
03 October 2025
Posted:
03 October 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Materials and Methods
2.1. Overview
- Temporal Feature Adapter: We conduct a comparative analysis of five Parameter-Efficient Fine-Tuning (PEFT) adapters: the original BottleneckAdapter, InvertedConvNeXtAdapter, GRNConvNeXtAdapter, Adapter+, and Compacter, to optimize temporal feature learning while the VideoMAE backbone remains frozen.
- Decoder Head: We replace the standard Multilayer Perceptron (MLP) decoder in the detection head with a Kolmogorov-Arnold Network (KAN) that utilizes Chebyshev polynomials, hypothesizing it can better model the non-linear dynamics of swallowing.
- Input Modality: To leverage depth information, we evaluate a channel substitution strategy (RGD, RDB, DGB) and compare its performance against standard RGB and traditional early/late fusion RGB-D methods.
- Regression Method: We compare two strategies for boundary prediction: centerness-based regression and direct boundary regression.
- Patch Embedding: We assess the impact of different positional encoding techniques, specifically comparing the standard sinusoidal positional encoding with Rotary Positional Encoding (RoPE).
2.2 Data Acquisition and Labelling
2.3. Data Processing
2.4. Baseline Model Architecture
2.5. Adapter Exploration
2.6. Decoder Selection
2.7. Data Input Strategy
2.8. Data Input Strategy
2.9. Patch Embedding Method
2.10. Model Training
2.11. Model Evaluation
3. Results
3.1. Benchmarking and Initial Ablation
3.2. Domain Adptation and Model Performance

3.3. Adapter Fine-Tuning
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Putri, A.R.; Chu, Y.-H.; Chen, R.; Chiang, K.-J.; Banda, K.J.; Liu, D.; Lin, H.-C.; Niu, S.-F.; Chou, K.-R. Prevalence of swallowing disorder in different dementia subtypes among older adults: a meta-analysis. Age and Ageing 2024, 53, afae037. [Google Scholar] [CrossRef]
- Altman, K.W.; Yu, G.-P.; Schaefer, S.D. Consequence of dysphagia in the hospitalized patient: impact on prognosis and hospital resources. Archives of Otolaryngology–Head & Neck Surgery 2010, 136, 784–789. [Google Scholar]
- Lear, C.S.; Flanagan Jr, J.; Moorrees, C. The frequency of deglutition in man. Archives of oral biology 1965, 10, 83–IN15. [Google Scholar] [CrossRef] [PubMed]
- Kramarow, E.; Warner, M.; Chen, L.-H. Food-related choking deaths among the elderly. Injury prevention 2014, 20, 200–203. [Google Scholar] [CrossRef] [PubMed]
- De Sire, A.; Ferrillo, M.; Lippi, L.; Agostini, F.; de Sire, R.; Ferrara, P.E.; Raguso, G.; Riso, S.; Roccuzzo, A.; Ronconi, G. Sarcopenic dysphagia, malnutrition, and oral frailty in elderly: a comprehensive review. Nutrients 2022, 14, 982. [Google Scholar] [CrossRef]
- Bhattacharyya, N. The prevalence of dysphagia among adults in the United States. Otolaryngology--Head and Neck Surgery 2014, 151, 765–769. [Google Scholar] [CrossRef]
- Wu, C.-P.; Chen, Y.-W.; Wang, M.-J.; Pinelis, E. National trends in admission for aspiration pneumonia in the United States, 2002–2012. Annals of the American Thoracic Society 2017, 14, 874–879. [Google Scholar] [CrossRef]
- de Castro, M.A.F.; Dedivitis, R.A.; de Matos, L.L.; Baraúna, J.C.; Kowalski, L.P.; de Carvalho Moura, K.; Partezani, D.H. Endoscopic and videofluoroscopic evaluations of swallowing for dysphagia: a systematic review. Brazilian Journal of Otorhinolaryngology 2025, 91, 101598. [Google Scholar] [CrossRef]
- Feng, H.-Y.; Zhang, P.-P.; Wang, X.-W. Presbyphagia: Dysphagia in the elderly. World journal of clinical cases 2023, 11, 2363. [Google Scholar] [CrossRef]
- Lai, D.K.-H.; Cheng, E.S.-W.; Lim, H.-J.; So, B.P.-H.; Lam, W.-K.; Cheung, D.S.K.; Wong, D.W.-C.; Cheung, J.C.-W. Computer-aided screening of aspiration risks in dysphagia with wearable technology: a Systematic Review and meta-analysis on test accuracy. Frontiers in bioengineering and biotechnology 2023, 11, 1205009. [Google Scholar] [CrossRef]
- Wong, D.W.-C.; Wang, J.; Cheung, S.M.-Y.; Lai, D.K.-H.; Chiu, A.T.-S.; Pu, D.; Cheung, J.C.-W.; Kwok, T.C.-Y. Current Technological Advances in Dysphagia Screening: Systematic Scoping Review. Journal of Medical Internet Research 2025, 27, e65551. [Google Scholar] [CrossRef] [PubMed]
- So, B.P.-H.; Chan, T.T.-C.; Liu, L.; Yip, C.C.-K.; Lim, H.-J.; Lam, W.-K.; Wong, D.W.-C.; Cheung, D.S.K.; Cheung, J.C.-W. Swallow detection with acoustics and accelerometric-based wearable technology: a scoping review. International journal of environmental research and public health 2022, 20, 170. [Google Scholar] [CrossRef]
- Yao, K.-Y.; Lai, D.K.-H.; Lim, H.-J.; So, B.P.-H.; Chan, A.C.-H.; Yip, P.Y.-M.; Wong, D.W.-C.; Dai, B.; Zhao, X.; Wong, S.H.D.; et al. 2H-MoS2 lubrication-enhanced MWCNT nanocomposite for subtle bio-motion piezoresistive detection with deep learning integration. Materials & Design 2025, 253, 113861. [Google Scholar] [CrossRef]
- Sakai, K.; Gilmour, S.; Hoshino, E.; Nakayama, E.; Momosaki, R.; Sakata, N.; Yoneoka, D. A machine learning-based screening test for sarcopenic dysphagia using image recognition. Nutrients 2021, 13, 4009. [Google Scholar] [CrossRef]
- Yamamoto, Y.; Sato, H.; Kanada, H.; Iwashita, Y.; Hashiguchi, M.; Yamasaki, Y. Relationship between lip motion detected with a compact 3D camera and swallowing dynamics during bolus flow swallowing in Japanese elderly men. Journal of Oral Rehabilitation 2020, 47, 449–459. [Google Scholar] [CrossRef]
- Lai, D.K.-H.; Cheng, E.S.-W.; So, B.P.-H.; Mao, Y.-J.; Cheung, S.M.-Y.; Cheung, D.S.K.; Wong, D.W.-C.; Cheung, J.C.-W. Transformer Models and Convolutional Networks with Different Activation Functions for Swallow Classification Using Depth Video Data. Mathematics 2023, 11, 3081. [Google Scholar] [CrossRef]
- Lim, H.-J.; Lai, D.K.-H.; So, B.P.-H.; Yip, C.C.-K.; Cheung, D.S.K.; Cheung, J.C.-W.; Wong, D.W.-C. A comprehensive assessment protocol for swallowing (CAPS): Paving the way towards computer-aided dysphagia screening. International journal of environmental research and public health 2023, 20, 2998. [Google Scholar] [CrossRef]
- Hu, K.; Shen, C.; Wang, T.; Xu, K.; Xia, Q.; Xia, M.; Cai, C. Overview of temporal action detection based on deep learning. Artificial Intelligence Review 2024, 57, 26. [Google Scholar] [CrossRef]
- Shou, Z.; Chan, J.; Zareian, A.; Miyazawa, K.; Chang, S.-F. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, USA, 2017, 21 - 26 Jul; pp. 5734–5743.
- Zhao, Y.; Xiong, Y.; Wang, L.; Wu, Z.; Tang, X.; Lin, D. Temporal action detection with structured segment networks. In Proceedings of the IEEE international conference on computer vision, Venice, Italy, 2017, 22- 29 Oct; pp. 2914–2923.
- Zhang, C.; Wu, J.; Li, Y. ActionFormer: Localizing Moments of Actions with Transformers. ArXiv preprint, 2202. [Google Scholar] [CrossRef]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? In a new model and the kinetics dataset. In Proceedings of the the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017, 21 - 26 Jul; pp. 6299–6308.
- Liu, S.; Zhang, C.-L.; Zhao, C.; Ghanem, B. End-to-end temporal action detection with 1b parameters across 1000 frames. In Proceedings of the the IEEE/CVF conference on computer vision and pattern recognition, Seattle, USA, 2024, 16 - 22 Jun; pp. 18591–18601.
- Tong, Z.; Song, Y.; Wang, J.; Wang, L. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems 2022, 35, 10078–10093. [Google Scholar]
- Han, Z.; Gao, C.; Liu, J.; Zhang, J.; Zhang, S.Q. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608, arXiv:2403.14608 2024.
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P. The kinetics human action video dataset. arXiv preprint 2017, arXiv:1705.06950. [Google Scholar]
- Lin, Z.; Geng, S.; Zhang, R.; Gao, P.; De Melo, G.; Wang, X.; Dai, J.; Qiao, Y.; Li, H. Frozen clip models are efficient video learners. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 2022, 25 - 27 Oct; pp. 388–404.
- Pratticò, D.; Laganà, F.; Oliva, G.; Fiorillo, A.S.; Pullano, S.A.; Calcagno, S.; Carlo, D.D.; Foresta, F.L. Integration of LSTM and U-Net Models for Monitoring Electrical Absorption With a System of Sensors and Electronic Circuits. IEEE Transactions on Instrumentation and Measurement 2025, 74, 1–11. [Google Scholar] [CrossRef]
- Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, USA, 2022, 18 - 24 Jun; pp. 11976–11986.
- Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, Canada, 2023, 17 - 24 Jun; pp. 16133–16142.
- Steitz, J.-M.O.; Roth, S. Adapters Strike Back. 2024, arXiv:2406. 0682. [Google Scholar] [CrossRef]
- Karimi Mahabadi, R.; Henderson, J.; Ruder, S. Compacter: Efficient Low-Rank Hypercomplex Adapter Layers. 2021, arXiv:2106. 0464. [Google Scholar] [CrossRef]
- Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-arnold networks. arXiv preprint 2024, arXiv:2404.19756. [Google Scholar]
- SS, S.; AR, K.; KP, A. Chebyshev polynomial-based kolmogorov-arnold networks: An efficient architecture for nonlinear function approximation. arXiv preprint 2024, arXiv:2405.07200. [Google Scholar]
- Idrees, H.; Zamir, A.R.; Jiang, Y.-G.; Gorban, A.; Laptev, I.; Sukthankar, R.; Shah, M. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 2017, 155, 1–23. [Google Scholar] [CrossRef]
- Kumar, M.; Veeraraghavan, A.; Sabharwal, A. DistancePPG: Robust non-contact vital signs monitoring using a camera. Biomedical optics express 2015, 6, 1565–1588. [Google Scholar]
- Wyatt, P.J. Differential light scattering and the measurement of molecules and nanoparticles: A review. Analytica Chimica Acta: X 2021, 7, 100070. [Google Scholar] [CrossRef]
- Liu, P.T.; Ruan, D.B.; Yeh, X.Y.; Chiu, Y.C.; Zheng, G.T.; Sze, S.M. Highly responsive blue light sensor with amorphous indium-zinc-oxide thin-film transistor based architecture. Scientific reports 2018, 8, 8153. [Google Scholar] [CrossRef]
- Liberatori, B.; Conti, A.; Rota, P.; Wang, Y.; Ricci, E. Test-time zero-shot temporal action localization. In Proceedings of the the IEEE/CVF conference on computer vision and pattern recognition, Seattle, USA, 2024, 17 - 21 Jun; pp. 18720–18729.
- Reka, A.; Borza, D.L.; Reilly, D.; Balazia, M.; Bremond, F. Introducing Gating and Context into Temporal Action Detection. In Proceedings of the European Conference on Computer Vision, MiCo Milano, Italy, 2024, 8 - 13 Sep; pp. 322–334.
- Cheng, F.; Bertasius, G. TALLFormer: Temporal Action Localization with a Long-memory Transformer. ArXiv preprint, 2204. [Google Scholar] [CrossRef]
- Laganà, F.; Faccì, A.R. Parametric optimisation of a pulmonary ventilator using the Taguchi method. Journal of Electrical Engineering 2025, 76, 265–274. [Google Scholar] [CrossRef]
- Keskar, N.S.; Mudigere, D.; Nocedal, J.; Smelyanskiy, M.; Tang, P.T.P. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint 2016, arXiv:1609.04836. [Google Scholar]
- Somvanshi, S.; Javed, S.A.; Islam, M.M.; Pandit, D.; Das, S. A survey on kolmogorov-arnold network. ACM Computing Surveys 2025. [Google Scholar] [CrossRef]
- Zhang, J.; Fan, Y.; Cai, K.; Wang, K. Kolmogorov-Arnold Fourier Networks. arXiv preprint 2025, arXiv:2502.06018. [Google Scholar]





| Data metric | Training | Testing | Total |
|---|---|---|---|
| No. of video clips | 512 (79.88%) | 129 (20.12%) | 641 |
| Total duration (min) | 229.6 | 55.17 | 284.71 |
| Mean (SD) duration (s) | 26.90 (20.92) | 25.66 (20.02) | 26.65(20.75) |
| Max duration (s) | 150.80 | 122.22 | – |
| Min duration (s) | 4.74 | 7.04 | – |
| Annotated swallowing events | 1091 | 260 | 1351 |
| Annotated non-swallowing events | 1427 | 375 | 1802 |
| Total No. of events | 2518 (79.86%) | 635 (20.14%) | 3153 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).