Submitted:
25 February 2024
Posted:
27 February 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Understanding Swin Transformer
3. Diverse Datasets
- American Sign Language Lexicon Video Dataset, that consists of videos of more than 3,300 ASL signs in citation form, each produced by 1-6 native ASL signers [16].
- World Level American Sign Language Video Dataset on Kaggle: This dataset contains 12k processed videos of word-level ASL glossary performances [17].
- ASL Citizen by Microsoft Research: The first crowdsourced isolated sign language video dataset containing about 84k video recordings of 2.7k isolated signs from ASL [18].
- MS-ASL Dataset: A large-scale sign language dataset comprising over 25,000 annotated videos [19].
- OpenASL Dataset: A large-scale ASL - English dataset collected from online video sites, containing 288 hours of ASL videos in multiple domains from over 200 signers [20].
- How2Sign Dataset: A multimodal and Multiview continuous ASL dataset, consisting of a parallel corpus of more than 80 hours of sign language videos along with corresponding modalities including speech, English transcripts, and depth [21].
- YouTube-ASL Dataset: A large-scale, open-domain corpus of ASL videos and accompanying English captions drawn from YouTube, with about 1000 hours of videos [22].
- ASL video dataset - Boston University: a large and expanding public dataset containing video sequences of thousands of distinct ASL signs (produced by native signers of ASL), along with annotations of those sequences [23].
4. Related Work
5. Applying Transformers to the ASL Dataset
6. Case Studies and Applications
- (1)
- Develop a user-friendly interface for ASL translation, ensuring it is suitable for the intended users and use cases.
- (2)
- Create interactive and engaging learning tools for ASL education.
7. Sign Language and LLMs
8. Discussion
9. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- GitHub Repository of DAT Transformer. Available online: https://github.com/LeapLabTHU/DAT (accessed on 24 February 2024).
- GitHub Repository of Swin Transformer. Available online: https://github.com/microsoft/Swin-Transformer (accessed on 24 February 2024).
- ASL Alphabet [Online]. Available online: https://www.kaggle.com/datasets/grassknoted/asl-alphabet (accessed on 24 February 2024).
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021; pp. 10012–10022. [Google Scholar]
- Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Guo, B. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022; pp. 12009–12019. [Google Scholar]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video Swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022; pp. 3202–3211. [Google Scholar]
- Lu, Y.; You, K.; Zhou, C.; Chen, J.; Wu, Z.; Jiang, Y.; Huang, C. Video surveillance-based multi-task learning with Swin transformer for earthwork activity classification. Eng. Appl. Artif. Intell. 2024, 131, 107814. [Google Scholar] [CrossRef]
- Hu, X.; Hampiholi, B.; Neumann, H.; Lang, J. Temporal Context Enhanced Referring Video Object Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2024; pp. 5574–5583. [Google Scholar]
- Yu, Z.; Guan, F.; Lu, Y.; Li, X.; Chen, Z. Video Quality Assessment Based on Swin TransformerV2 and Coarse to Fine Strategy. arXiv preprint 2024, arXiv:2401.08522. [Google Scholar] [CrossRef]
- Alexey, D.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. [Google Scholar]
- Karen, S.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
- Xie, S.; et al. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. [Google Scholar]
- Alexander, K.; et al. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019. [Google Scholar]
- Huang, G.; et al. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. arXiv preprint, Author 1, A.; Author 2, B. Book Title, 3rd ed.; Publisher: Publisher Location, Country, 2008; pp. 154–196. 2022; arXiv:2201.01266. [Google Scholar]
- Home Page of ASLLVD (American Sign Language Lexicon Video Dataset). Available online: https://paperswithcode.com/dataset/asllvd (accessed on 24 February 2024).
- WLASL Dataset on Kaggle. Available online: https://www.kaggle.com/datasets/grassknoted/asl-alphabet (accessed on 24 February 2024).
- Microsoft Research ASL Citizen Dataset. Available online: https://www.microsoft.com/en-us/research/project/asl-citizen/ (accessed on 24 February 2024).
- MS-ASL Dataset. Available online: https://www.microsoft.com/en-us/research/project/ms-asl/ (accessed on 24 February 2024).
- GitHub Repository of OpenASL Dataset. Available online: https://github.com/chevalierNoir/OpenASL (accessed on 24 February 2024).
- GitHub Repository of How2sign Dataset. Available online: https://how2sign.github.io/ (accessed on 24 February 2024).
- Uthus, D.; Tanzer, G.; Georg, M. Youtube-asl: A large-scale, open-domain American sign language-English parallel corpus. Advances in Neural Information Processing Systems 2024, 36. [Google Scholar]
- Colarossi, J. World’s Largest American Sign Language Database Makes ASL Even More Accessible. 2021. Available online: https://www.bu.edu/articles/2021/worlds-largest-american-sign-language-database-makes-asl-even-more-accessible/ (accessed on 24 February 2024).
- Home Page of TAT (Taiwanese Across Taiwan). Available online: https://paperswithcode.com/dataset/tat (accessed on 24 February 2024).
- A Survey of Sign Language in Taiwan. Available online: https://www.sil.org/resources/archives/9125 (accessed on 24 February 2024).
- Sklar, J. A Mobile App Gives Deaf People a Sign-Language Interpreter They Can Take Anywhere. Available online: https://www.technologyreview.com/innovator/ronaldo-tenorio/ (accessed on 24 February 2024).
- Ankit Jain. Project Idea | Audio to Sign Language Translator. Available online: https://www.geeksforgeeks.org/project-idea-audio-sign-language-translator/ (accessed on 24 February 2024).
- English to Sign Language (ASL) Translator. Available online: https://wecapable.com/tools/text-to-sign-language-converter/ (accessed on 24 February 2024).
- The ASL App (ASL for the People) on Google Play. Available online: https://theaslapp.com/about (accessed on 24 February 2024).
- iASL App on Speechie Apps. Available online: https://speechieapps.wordpress.com/2012/03/26/iasl/ (accessed on 24 February 2024).
- Sign 4 Me App. Available online: https://apps.microsoft.com/detail/9pn9qd80mblx?hl=en-us&gl=US (accessed on 24 February 2024).
- ASL Dictionary App. Available online: https://play.google.com/store/apps/details?id=com.signtel&gl=US (accessed on 24 February 2024).
- Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision; 2021; pp. 1833–1844. [Google Scholar]
- Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
- Xie, Z.; Lin, Y.; Yao, Z.; Zhang, Z.; Dai, Q.; Cao, Y.; Hu, H. Self-supervised learning with Swin transformers. arXiv preprint arXiv:2105.04553, 2021.
- He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Transactions on Geoscience and Remote Sensing 2022, 60, 1–15. [Google Scholar] [CrossRef]
- Zu, B.; Cao, T.; Li, Y.; Li, J.; Ju, F.; Wang, H. SwinT-SRNet: Swin transformer with image super-resolution reconstruction network for pollen images classification. Engineering Applications of Artificial Intelligence 2024, 133, 108041. [Google Scholar] [CrossRef]
- Nguyen, L.X.; Tun, Y.L.; Tun, Y.K.; Nguyen, M.N.; Zhang, C.; Han, Z.; Hong, C.S. Swin transformer-based dynamic semantic communication for multi-user with different computing capacity. IEEE Transactions on Vehicular Technology 2024. [Google Scholar] [CrossRef]
- MohanRajan, S.N.; Loganathan, A.; Manoharan, P.; Alenizi, F.A. Fuzzy Swin transformer for Land Use/Land Cover change detection using LISS-III Satellite data. Earth Science Informatics 2024, 1–20. [Google Scholar] [CrossRef]
- Ekanayake, M.; Pawar, K.; Harandi, M.; Egan, G.; Chen, Z. McSTRA: A multi-branch cascaded Swin transformer for point spread function-guided robust MRI reconstruction. Computers in Biology and Medicine 2024, 168, 107775. [Google Scholar] [CrossRef] [PubMed]
- Lu, Y.; You, K.; Zhou, C.; Chen, J.; Wu, Z.; Jiang, Y.; Huang, C. Video surveillance-based multi-task learning with Swin transformer for earthwork activity classification. Engineering Applications of Artificial Intelligence 2024, 131, 107814. [Google Scholar] [CrossRef]
- Lin, Y.; Han, X.; Chen, K.; Zhang, W.; Liu, Q. CSwinDoubleU-Net: A double U-shaped network combined with convolution and Swin Transformer for colorectal polyp segmentation. Biomedical Signal Processing and Control 2024, 89, 105749. [Google Scholar] [CrossRef]
- Pan, C.; Chen, J.; Huang, R. Medical image detection and classification of renal incidentalomas based on YOLOv4+ ASFF swin transformer. Journal of Radiation Research and Applied Sciences 2024, 17, 100845. [Google Scholar] [CrossRef]
- Kumar, Y.; Huang, K.; Gordon, Z.; Castro, L.; Okumu, E.; Morreale, P.; Li, J.J. Transformers and LLMs as the New Benchmark in Early Cancer Detection. In ITM Web of Conferences; EDP Sciences; 2024; Volume 60, p. 00004. [Google Scholar]
- Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision Transformer with Deformable Attention. arXiv 2022. [Google Scholar] [CrossRef]
- Tellez, N.; Serra, J.; Kumar, Y.; Li, J.J.; Morreale, P. Gauging Biases in Various Deep Learning AI Models. In Intelligent Systems and Applications. IntelliSys 2022. Lecture Notes in Networks and Systems; Arai, K., Ed.; Springer: Cham, 2023; Volume 544. [Google Scholar] [CrossRef]
- Delgado, J.; Ebreso, U.; Kumar, Y.; Li, J.J.; Morreale, P. Preliminary Results of Applying Transformers to Geoscience and Earth Science Data. In Proceedings of the 2022 International Conference on Computational Science and Computational Intelligence (CSCI); 2022; pp. 284–288. [Google Scholar] [CrossRef]
- Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision; 2017; pp. 618–626. [Google Scholar]
- TeachableMachines Web Tool Page. Available online: https://teachablemachine.withgoogle.com/models/TY21XA7_Q/ (accessed on 24 February 2024).
- Home Page of the NAD Youth. Available online: https://youth.nad.org/ (accessed on 24 February 2024).
- Home Page of the NAD. Available online: https://www.nad.org/resources/american-sign-language/learning-american-sign-language/ (accessed on 24 February 2024).













| Trial Parameter | Comments |
|---|---|
| Initial Dataset Trial Dataset Classification Batch Size Trial Dataset Optimizer used Number of Epochs Pythorch version |
87000 images 80% for training, 20% for testing at random 29 classes (A to Z, Space, Del, and Nothing) 16 256×256 (resized) SGD, learning rate 0.001 100 1.12.1. |
| Trial Parameter | Number of Parameters | Accuracy |
|---|---|---|
| DAT Transformer | 86,886,357 | 99.99% |
| VGG-16 | 165,845,085 | 100% |
| ResNet-50 | 23,567,453 | 100% |
| Swin Transformer | 65,960,349 | 100% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).