Real-Time Sports Action Recognition Using a CNN–Transformer Hybrid Deep Learning Framework

Valli Nayagam; Anukarthika S; Muhesh Krishnaa S; Sri Sathya K B

doi:10.20944/preprints202604.0150.v1

Submitted:

01 April 2026

Posted:

02 April 2026

You are already at the latest version

Abstract

The rapid expansion of sports broadcasting and digital media platforms has increased the demand for intelligent systems capable of automatically identifying important sports events for real-time analytics and highlight generation. Manual annotation of sports videos requires significant time and effort and may introduce human errors during analysis. This paper presents a real-time sports action recognition framework using a hybrid CNN–Transformer architecture for detecting critical events in football and cricket videos. The proposed system processes live or recorded video streams through frame extraction, normalization, and spatial feature learning using the MobileNetV2 network. Temporal relationships between consecutive frames are modeled using a Transformer encoder to improve action understanding. The framework classifies events such as pass and goal in football, and four, six, and wicket in cricket. Motion-based filtering and confidence thresholding reduce non-action frames and improve prediction reliability. Detected events are recorded with timestamps and displayed using broadcast-style overlays to support automated highlight generation. Experimental evaluation demonstrates high recognition accuracy and efficient real-time performance on low-cost hardware platforms. The framework provides an effective solution for sports analytics, media automation, and intelligent decision-support systems.

Keywords:

sports action recognition

;

CNN–transformer

;

real-time video analytics

;

highlight generation

;

MobileNetV2

;

deep learning

Subject:

Computer Science and Mathematics - Computer Vision and Graphics

I. Introduction

The rapid expansion of sports broadcasting and digital streaming platforms generates an enormous volume of sports video content every day. Efficient analysis of such content requires intelligent systems capable of automatically identifying important events without human intervention. Sports action recognition therefore becomes an important application of computer vision and artificial intelligence. Recent survey studies highlight the growing interest in automated sports video analysis due to its applications in sports analytics, automated highlight generation, and intelligent broadcasting systems [1,2].

Sports videos contain complex visual patterns including rapid player movements, dynamic camera motion, occlusions, and varying lighting conditions. These characteristics make automatic action recognition a challenging task. In games such as football and cricket, crucial events occur within short time intervals and require both spatial understanding of visual features and temporal interpretation of motion patterns. Survey research on video understanding emphasizes that traditional image-based recognition approaches are insufficient because they process frames independently and fail to capture motion information across time sequences [3,4].

Deep learning techniques significantly improve video analysis by automatically learning hierarchical feature representations. Convolutional Neural Networks (CNNs) are widely used for extracting spatial features from video frames, while temporal models such as recurrent neural networks and transformer architectures capture relationships between consecutive frames. Recent survey papers indicate that hybrid deep learning models combining CNNs with temporal learning mechanisms achieve better performance in action recognition tasks [5,6].

Transformer-based architectures introduce attention mechanisms that enable models to capture long-range dependencies across video frames. These models demonstrate promising results in various video understanding applications including sports event detection and activity recognition. However, existing approaches often require large computational resources and may not support efficient real-time processing in practical deployment environments [7,8,9,10]

A. Problem Statement

Sports broadcasting platforms generate a large volume of video data every day, creating a need for automated systems capable of identifying important sports events efficiently. Manual annotation of sports videos requires significant time and human effort and may introduce inconsistencies in analysis. Although deep learning approaches improve the performance of action recognition systems, several limitations remain in existing methods.

Many current sports action recognition models operate in offline environments and require high computational resources for processing video data. These models often struggle to achieve real-time performance, particularly when deployed on low-cost hardware platforms. In addition, several existing approaches focus only on a single sport or limited action categories, which restricts their applicability in real-world sports analytics systems.

Furthermore, sports videos contain complex visual challenges such as dynamic camera movements, occlusions, background clutter, and rapid scene transitions. These factors increase the difficulty of accurately identifying key events while minimizing false detections. Therefore, an efficient and scalable framework is required to detect important sports actions in real time while maintaining high accuracy and low computational complexity.

B. Aim

The aim of the proposed research is to develop a real-time sports action recognition framework using a CNN–Transformer based deep learning architecture to accurately detect key events in football and cricket videos.

C. Research Objectives

To achieve the above aim, the following objectives are defined:

Design an efficient deep learning framework for recognizing sports actions from video streams.
Extract spatial features from video frames using a lightweight convolutional neural network architecture.
Model temporal relationships between consecutive frames using a Transformer encoder.
Improve prediction reliability by applying motion filtering, confidence thresholding, and temporal smoothing techniques.
Generate time-stamped event detection outputs with broadcast-style overlays for automated highlight generation and sports analytics.

D. Proposed Solution

To address the identified challenges, the proposed research introduces a hybrid CNN–Transformer framework that integrates efficient spatial feature extraction with temporal modeling capabilities. A lightweight pre-trained MobileNetV2 network extracts spatial features from video frames, enabling efficient processing with reduced computational requirements. The extracted features are then passed to a Transformer encoder, which models temporal dependencies across consecutive frames using a self-attention mechanism.

The system processes sampled video frames in real time and applies post-processing techniques such as confidence filtering and temporal smoothing to improve prediction stability. This approach enables accurate detection of sports events while maintaining efficient inference performance on low-cost hardware platforms.

E. Contributions of the Work

The main contributions of the proposed research are summarized as follows:

Development of a unified multi-sport action recognition framework capable of detecting events in both football and cricket videos.
Integration of a CNN–Transformer hybrid architecture designed for efficient real-time inference.
Implementation of motion-based filtering and temporal smoothing techniques to reduce false predictions.
Generation of time-stamped event logs and broadcast-style overlays to support automated highlight generation.
Demonstration of accurate and efficient performance on low-cost computing hardware.

F. Scope of the Work

The scope of the proposed research focuses on the development of a practical real-time sports action recognition system using broadcast sports video data. The framework aims to detect high-impact events in football and cricket videos, supporting automated highlight generation and sports analytics applications. The system operates on single-camera video streams and emphasizes low-latency processing for real-time deployment.

Although the current implementation targets selected sports actions, the modular architecture enables extension to additional sports, action categories, and live video streams in future developments. This design ensures scalability while maintaining a balance between accuracy, efficiency, and practical applicability.

II. Literature Survey

Guo et al. (2025) introduced RSFormer, a transformer-based architecture designed for recognizing sports actions from video sequences. The model aims to better understand temporal relationships in sports footage by applying attention mechanisms across features extracted from consecutive frames. By learning motion dynamics and long-range dependencies within the video, the framework enhances the recognition of complex sports activities. The experimental evaluation shows that transformer-driven models can achieve higher accuracy in sports action recognition compared with many conventional approaches [26,29].

Zhang and Yang (2025) developed FoT (Football Transformer), a transformer-based framework optimized for detecting small objects in football broadcast videos. The method focuses on improving the identification of objects that are difficult to detect, such as footballs and players appearing far from the camera. By using a lightweight transformer structure, the model maintains real-time processing capability while improving detection accuracy. Their work demonstrates that efficient transformer architectures can be effectively applied to real-time sports video analysis [30].

Kristina and Ivasic-Kos (2022) provided an extensive review of computer vision techniques used for human action recognition in sports. The study examined various methodological developments, including traditional machine learning algorithms, convolutional neural networks, and other deep learning approaches used to analyze sports activities. The authors also discussed several practical challenges encountered in sports video analysis, such as player occlusion, complex backgrounds, and fast movements. Their review highlights the growing importance of deep learning methods in advancing modern sports analytics systems [1,3].

Xarles et al. (2024) proposed ASTRA (Action Spotting Transformer), a transformer-based approach designed to detect significant events within soccer videos. Instead of processing entire video sequences uniformly, the framework uses attention mechanisms to emphasize temporal segments that are more likely to contain important sports actions. This targeted analysis improves the efficiency and accuracy of identifying key moments during sports broadcasts. The results demonstrate that transformer-based models can effectively support event detection tasks in sports video streams [26].

Wu et al. (2023) conducted a comprehensive survey on video action recognition in sports, reviewing existing datasets, methodologies, and real-world applications. The study analyzed a wide range of techniques, including CNN-based models, recurrent neural networks, and transformer architectures applied to sports video analysis. The authors emphasized the strength of transformer models in modeling long-term temporal relationships between frames. At the same time, the survey highlighted challenges such as high computational requirements and difficulties in achieving real-time performance [1,2].

Bertasius et al. (2022) proposed a transformer-based architecture for video action recognition that focuses on modeling space–time relationships using attention mechanisms. The model analyzes spatial and temporal features simultaneously to better capture motion dynamics across video frames. Experimental results demonstrate that space–time attention significantly improves the recognition of complex actions in video datasets. Their work highlights the potential of transformer-based architectures for improving video understanding tasks [21].

Liu et al. (2022) introduced the Video Swin Transformer, a hierarchical transformer architecture designed for efficient video understanding. The model processes video frames using shifted window attention mechanisms, enabling effective learning of spatial and temporal features while maintaining computational efficiency. The framework achieved strong performance on several benchmark datasets, demonstrating its capability for action recognition and event detection in complex video environments [18].

Arnab et al. (2022) proposed the Video Vision Transformer (ViViT), which applies transformer architectures directly to video classification tasks. The model decomposes video data into spatial and temporal tokens and uses self-attention mechanisms to capture long-range dependencies between frames. Experimental evaluations show that ViViT achieves competitive performance in action recognition while providing a scalable architecture for large-scale video analysis [17].

The reviewed studies demonstrate significant progress in sports action recognition using deep learning techniques. Early research focused on convolutional neural networks for spatial feature extraction, while recent studies increasingly utilize transformer architectures to model temporal dependencies in video sequences. These approaches have improved the accuracy of event detection in sports videos. However, many existing methods rely on computationally expensive architectures or are designed for offline analysis, limiting their applicability in real-time sports analytics systems. These limitations highlight the need for efficient hybrid models that combine lightweight spatial feature extraction with effective temporal modeling.

B. Limitations

Although several deep learning approaches have been proposed for sports action recognition, existing methods still exhibit certain limitations. Many models rely on computationally intensive architectures such as 3D convolutional networks or large transformer models, which require high-end hardware resources and limit their practical deployment in real-time environments. Additionally, several approaches focus on a single sport or a limited number of action categories, reducing their generalization capability across different sports scenarios.

Furthermore, some existing methods primarily emphasize spatial feature extraction while providing limited attention to temporal relationships between consecutive video frames. This limitation can affect the accurate recognition of complex sports actions that occur within short time intervals.

To overcome these challenges, the proposed research introduces a hybrid CNN–Transformer framework that integrates efficient spatial feature extraction using MobileNetV2 with transformer-based temporal modeling. This approach reduces computational complexity while improving temporal understanding of sports actions, enabling accurate and real-time event detection in football and cricket videos.

III. Related Research and Technical Background

A. Traditional and Classical Approaches

Early sports action recognition systems relied on handcrafted visual and motion features such as Histogram of Oriented Gradients (HOG), Scale-Invariant Feature Transform (SIFT), optical flow, and trajectory-based descriptors. These features were combined with classical classifiers such as Support Vector Machines (SVM) and Hidden Markov Models (HMM). While these approaches performed reasonably in controlled environments, they failed to generalize to real-world broadcast videos due to camera motion, cluttered backgrounds, and variations in lighting and viewpoints.

B. Deep Learning-Based Action Recognition

The introduction of Convolutional Neural Networks (CNNs) significantly improved visual feature learning for sports videos. Architectures such as VGGNet, ResNet, and Inception enabled automatic extraction of spatial features from video frames. To incorporate temporal information, CNN features were later combined with Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. Although CNN–LSTM models improved temporal understanding, they suffer from limited parallelism, higher latency, and difficulty in modeling long-range dependencies, making real-time deployment challenging.

C. Transformer-Based and Hybrid Models

Recent research has explored Transformer architectures for video understanding due to their self-attention mechanism, which effectively models long-term temporal dependencies. Vision Transformers and video Transformers have demonstrated strong performance in action recognition tasks. However, pure Transformer models are computationally expensive and require large-scale annotated datasets. To overcome these challenges, hybrid CNN–Transformer architectures have been proposed, where CNNs extract spatial features and Transformers model temporal relationships. Lightweight CNN backbones such as MobileNetV2 are increasingly preferred for real-time applications due to their efficiency and reduced computational cost.

D. Research Gaps Identified

Despite notable progress, several research gaps remain in existing sports action recognition systems:

Most existing approaches focus on offline analysis and lack real-time processing capability suitable for live sports broadcasts.
Many models are computationally intensive and require high-end hardware, limiting their deployment on low-cost systems.
Existing solutions are often sport-specific and do not generalize well across multiple sports and action categories.
Limited attention has been given to false positive suppression, uncertainty handling, and broadcast-specific challenges such as scoreboard interference and non-action frames.

To address these gaps, this work proposes a real-time, low-cost CNN–Transformer-based framework capable of multi-sport action recognition with enhanced robustness and practical deployment feasibility.

IV. Methodological Framework

A. Input Video Acquisition

The system accepts sports video streams captured from broadcast footage or recorded match videos. Videos are provided in standard formats (MP4) and may include camera motion, replays, celebrations, and scoreboard overlays. Both football and cricket videos are supported, enabling multi-sport applicability within a unified framework.

B. Frame Extraction and Sampling

Input videos are decomposed into individual frames using OpenCV. To ensure real-time performance and reduce computational load, frame sampling is applied by processing every Nth frame. This strategy preserves temporal information while maintaining low latency, making the system suitable for live or near-real-time applications.

C. Preprocessing and Normalization

Each extracted frame is resized to a fixed resolution of 224×224 pixels to match the input requirements of the pre-trained CNN backbone. Pixel values are normalized to the range [0,1] to ensure numerical stability and faster convergence during inference. This preprocessing step standardizes input data across different videos and lighting conditions.

D. Spatial Feature Extraction Using CNN

A lightweight pre-trained MobileNetV2 model is employed as the CNN backbone for spatial feature extraction. MobileNetV2 utilizes depth wise separable convolutions, significantly reducing the number of parameters while preserving representational power. The CNN extracts high-level spatial features related to player positions, ball movement, and contextual cues from each frame.

E. Temporal Modeling Using Transformer Encoder

To capture temporal relationships between consecutive video frames, a Transformer encoder is employed. The Transformer utilizes a self-attention mechanism to learn dependencies across frame-level feature representations extracted from the CNN backbone. Unlike sequential models such as recurrent neural networks, the attention mechanism allows the model to analyze all frames simultaneously and identify important temporal interactions between them.

The self-attention mechanism maps the input feature sequence into three representations known as query (Q), key (K), and value (V) vectors. The attention weights are calculated based on the similarity between query and key vectors. The self-attention operation can be expressed as:

A t t e n t i o n (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

where:

$Q$ represents the query matrix derived from input features
$K$ represents the key matrix used to compute similarity scores
$V$ represents the value matrix containing the feature representations
$d_{k}$ denotes the dimension of the key vectors, used for scaling the dot-product attention

This mechanism enables the model to assign higher importance to frames that are most relevant to a specific sports action, such as the moment of ball contact in a football pass or the bat–ball interaction during a cricket shot.

F. Classification and Decision Logic

The Transformer output is aggregated using global pooling and passed through fully connected layers with a softmax activation function to classify actions. Confidence thresholding is applied to suppress uncertain predictions. Additionally, temporal smoothing using a sliding window majority-voting mechanism is employed to stabilize predictions and reduce frame-level noise.

G. Post-Processing, Event Detection, and Output Generation

Post-processing techniques such as motion-based filtering and event confirmation logic are applied to further reduce false positives. Detected events are time-stamped and logged for analysis. A broadcast-style overlay displaying action labels and counts is rendered on the output video, enabling automated highlight generation and real-time visualization.

H. Dataset Description

The football and cricket datasets were created from publicly available broadcast match videos. Frames were extracted from each video and organized into class-wise directories. For football, frames corresponding to pass and goal actions were collected. For cricket, frames corresponding to four, six, and wicket actions were collected. Data augmentation techniques such as horizontal flipping and brightness variation were applied to improve generalization. The dataset contains several thousand frames per class, ensuring sufficient diversity across lighting conditions, camera angles, and match situations.

V. System Design and Implementation

A. System Architecture Design

The overall system architecture of the proposed sports action recognition framework is illustrated in Figure 1. The system follows a modular pipeline consisting of video input, preprocessing, feature extraction, temporal modeling, classification, and output generation.

The input sports video stream is first captured from a recorded or live broadcast source. Frames are extracted and sampled at regular intervals to balance computational efficiency and temporal continuity. Preprocessed frames are then forwarded to the CNN-based feature extractor, where spatial features such as player positions, ball movement, and scene context are learned.

These spatial features are reshaped into temporal sequences and passed to the Transformer encoder, which captures long-range temporal dependencies across frames. The classification module assigns action labels based on learned representations, while post-processing modules handle confidence filtering, temporal smoothing, and event confirmation. Finally, detected events are logged with timestamps and rendered as broadcast-style overlays on the output video.

B. Implementation Details

The proposed research framework is implemented using the Python programming language, with TensorFlow and Keras serving as the primary deep learning libraries. These frameworks provide efficient tools for designing, training, and deploying deep neural network models. The OpenCV library is used for video processing tasks including frame extraction, image preprocessing, and real-time visualization of the detected events.

A pre-trained MobileNetV2 convolutional neural network is utilized as the spatial feature extractor. The model is initialized with weights trained on the ImageNet dataset, allowing effective feature learning while maintaining computational efficiency. MobileNetV2 employs depthwise separable convolutions, which significantly reduce the number of parameters and enable faster inference on low-cost hardware systems.

To capture temporal dependencies between consecutive video frames, a Transformer encoder architecture is incorporated into the framework. The Transformer encoder consists of multi-head self-attention layers and feed-forward neural networks, which enable the model to learn contextual relationships across frame sequences. This mechanism improves the model’s ability to distinguish between similar actions occurring in different temporal contexts.

The training process uses categorical cross-entropy as the loss function and the Adam optimizer for efficient gradient-based optimization. The dataset is divided into training and testing subsets, and the model learns to classify sports actions through supervised learning.

During inference, several post-processing techniques are applied to improve prediction stability. Confidence thresholding is used to filter low-confidence predictions, while majority-voting-based temporal smoothing helps reduce frame-level prediction noise.

The output module generates annotated video streams with real-time overlays, displaying detected action labels and cumulative event counts. Additionally, the system records event timestamps in structured text logs, which can be used for automated highlight generation and post-match analysis. The modular design of the framework enables easy integration with live RTSP video streams and allows extension to additional sports or action categories in future implementation.

C: System Workflow

Figure 2. System Workflow.

Table 1. System Configuration.

Component	Specification
Processor	Intel i5
RAM	8 GB
OS	Windows 10
Framework	TensorFlow, Keras
Input Resolution	224×224
FPS Achieved	20–25

VI. Experimental Results and Analysis:

A. Experimental Setup

Experiments were conducted on a computing system equipped with an Intel i5 processor, 8 GB RAM, and Windows operating system. The proposed model was implemented using the Python programming language with TensorFlow and Keras deep learning frameworks. OpenCV was used for video processing operations such as frame extraction and visualization. Both training and inference were performed on a CPU-based environment, demonstrating the feasibility of deploying the framework on low-cost hardware platforms.

For football action recognition, two action classes were considered: pass and goal. For cricket action recognition, three action classes were considered: four, six, and wicket. The datasets were prepared by extracting frames from match videos and organizing them into class-specific folders. Data preprocessing included frame resizing and normalization before model training.

The dataset was divided into training and testing sets using an 80:20 ratio, where 80% of the data was used for training the model and the remaining 20% was used for testing and evaluation.

B. Evaluation Metrics

The performance of the proposed model was evaluated using standard classification metrics including accuracy, precision, recall, and F1-score. These metrics provide an effective evaluation of the model’s ability to correctly recognize sports actions from video frames.

1. Accuracy

Accuracy represents the ratio of correctly predicted samples to the total number of samples.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

where:

TP – True Positive,

TN – True Negative,

FP – False Positive,

FN – False Negative.

2. Precision

Precision measures the proportion of correctly predicted positive samples among all predicted positives.

P r e c i s i o n = \frac{T P}{T P + F P}

where:

TP – True Positive predictions,

FP – False Positive predictions.

3. Recall

Recall measures the ability of the model to correctly identify actual positive samples.

R e c a l l = \frac{T P}{T P + F N}

where:

TP – True Positive predictions,

FN – False Negative predictions.

4. F1-Score

F1-score represents the harmonic mean of precision and recall.

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

where Precision and Recall represent the classification performance of the model.

In addition, confusion matrices were generated to visualize class-wise prediction performance. Figure 3 presents the confusion matrix obtained for the proposed model.

Figure 3. Confusion Matrix.

Figure 4. Graph.

C. Quantitative Results

The experimental results demonstrate strong performance of the proposed CNN–Transformer framework in recognizing sports actions from video sequences.

For football action recognition, the system achieved an overall accuracy of approximately 98–99%. For cricket action recognition, the model achieved an accuracy ranging from 95–97% depending on the action category.

The confusion matrix presented in Figure 3 shows strong diagonal dominance, indicating that most samples were correctly classified. Minor misclassifications were observed between visually similar actions such as certain pass sequences and non-goal celebrations.

The system also generates time-stamped event logs that record detected actions along with their occurrence time in seconds. These logs can be used directly for automated highlight generation and post-match analysis.

An example event log generated by the system is shown in Figure 6, demonstrating the practical usability of the proposed framework.

Figure 5. Performance & accuracy of the model.

Figure 6. : Timestamp.

Table 2. Performance Comparison with Baseline Models.

Model Architecture	Accuracy (%)	Inference Speed (FPS)	Hardware Requirement
CNN + LSTM	91.5	12–15	Medium
3D-CNN (C3D)	94.2	5–8	High (GPU)
Two-Stream CNN	95.0	8–12	High (GPU)
Proposed CNN–Transformer	97.8	20–25	Low (CPU)

The comparative results presented in Table II demonstrate that the proposed CNN–Transformer framework achieves higher classification accuracy while maintaining significantly improved real-time performance. Traditional architectures such as CNN–LSTM and 3D-CNN require higher computational resources and often rely on GPU-based processing. In contrast, the proposed framework combines efficient spatial feature extraction using MobileNetV2 with transformer-based temporal modeling, enabling accurate action recognition with lower hardware requirements and higher inference speed.

D. Real-Time Performance

The real-time performance of the proposed framework was evaluated by measuring the inference speed during video processing. The system achieved an average processing speed of approximately 20–25 frames per second (FPS) on a CPU-based system.

Frame sampling and the lightweight architecture of MobileNetV2 significantly reduce computational overhead while maintaining high recognition accuracy. These results indicate that the proposed model is capable of performing near real-time sports action recognition on low-cost hardware platforms.

Figure 7. Event detection Football.

Figure 8. Event detection Cricket.

VII. Discussion

The experimental results demonstrate that integrating CNN-based spatial feature extraction with Transformer-based temporal modeling provides improved performance compared to conventional CNN-only approaches. The MobileNetV2 backbone ensures efficient feature extraction with reduced computational complexity, while the Transformer encoder effectively captures temporal dependencies between consecutive frames. This combination enables the system to better distinguish between visually similar sports actions by considering both spatial context and motion patterns.

The application of confidence thresholding and temporal smoothing significantly reduces false positive detections and improves prediction stability during real-time inference. Despite the strong overall performance, certain challenges remain in detecting events such as goals and wickets due to limited training samples, occlusions, and variations in camera angles commonly present in broadcast sports videos. Increasing dataset diversity and incorporating motion-aware features such as optical flow are expected to further enhance action recognition accuracy.

Overall, the proposed framework achieves a balanced trade-off between accuracy, computational efficiency, and real-time performance. These characteristics make the system suitable for practical deployment in sports analytics platforms and automated highlight generation systems.

VIII. Conclusions

This paper presented a real-time CNN–Transformer-based framework for multi-sport action recognition in football and cricket videos. The proposed system integrates efficient spatial feature extraction using MobileNetV2 with Transformer-based temporal modeling to capture temporal dependencies between consecutive video frames. This hybrid architecture enables accurate detection of key sports events while maintaining efficient real-time performance on low-cost hardware platforms. Experimental results demonstrate strong classification accuracy and confirm the practical applicability of the framework for automated highlight generation and sports analytics. Compared with conventional CNN-based approaches, the proposed CNN–Transformer framework improves temporal understanding of sports actions while maintaining computational efficiency, making it suitable for real-world deployment in intelligent sports broadcasting systems.

Future work will focus on extending the framework to additional sports and action categories, integrating player and ball tracking mechanisms, incorporating motion-based features such as optical flow or 3D-CNN models, and deploying the system with live RTSP video streams for real-time sports broadcasting environments.

References

F. Wu, Q. Wang, J. Bian, H. Xiong, and J. Cheng, “A survey on video action recognition in sports: Datasets, methods and applications,” IEEE Access, 2022.
Q. Duan, “Video action recognition: A survey,” Applied and Computational Engineering, vol. 6, pp. 1366–1378, 2023.
L. Zhao, Z. Lin, R. Sun, and A. Wang, “A review of state-of-the-art methodologies and applications in action recognition,” Electronics, vol. 13, no. 23, 2024.
H. Yin, R. O. Sinnott, and G. T. Jayaputera, “A survey of video-based human action recognition in team sports,” Artificial Intelligence Review, 2024.
M. Mao, A. Lee, and M. Hong, “Deep learning innovations in video classification: A survey on techniques and dataset evaluations,” Electronics, 2024.
H. P. Nguyen and B. Ribeiro, “Video action recognition collaborative learning with dynamics via PSO-ConvNet transformer,” Scientific Reports, vol. 13, 2023.
J. Wang, L. Zuo, and C. C. Martínez, “Basketball technique action recognition using 3D convolutional neural networks,” Scientific Reports, 2024.
T. Wang, “Deep learning-based action recognition and quantitative assessment method for sports skills,” Applied Mathematics and Nonlinear Sciences, 2024.
S. Dass, H. B. Barua, G. Krishnasamy, and R. C. W. Phan, “ActNetFormer: Transformer-ResNet hybrid method for semi-supervised action recognition in videos,” 2024.
C. Lai, J. Mo, H. Xia, and Y. Wang, “FACTS: Fine-grained action classification for tactical sports,” 2024.
A. K. AlShami et al., “SMART-Vision: Survey of modern action recognition techniques in vision,” 2025.
J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the Kinetics dataset,” Proc. CVPR, updated applications widely used in recent research.
C. Feichtenhofer, A. Pinz, and R. Wildes, “Spatiotemporal residual networks for video action recognition,” IEEE Conference on Computer Vision and Pattern Recognition, extended applications in sports analytics.
K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Neural Information Processing Systems, adapted in many sports video recognition systems.
D. Tran et al., “Learning spatiotemporal features with 3D convolutional networks,” IEEE ICCV.
H. Wang and C. Schmid, “Action recognition with improved trajectories,” IEEE ICCV.
A. Arnab et al., “ViViT: A video vision transformer,” IEEE International Conference on Computer Vision, widely used in transformer-based video recognition.
Z. Liu et al., “Video Swin Transformer for video understanding,” IEEE CVPR, 2022.
H. Fan et al., “Multiscale vision transformers for video recognition,” IEEE CVPR, 2022.
Z. Wu et al., “Long-term feature banks for detailed video understanding,” IEEE CVPR.
G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” ICML, transformer-based video recognition.
K. Soomro, A. Zamir, and M. Shah, “UCF101: A dataset of 101 human action classes,” University of Central Florida dataset widely used in action recognition research.
W. Kay et al., “The Kinetics human action video dataset,” DeepMind Dataset, widely used for training video models.
A. Deliège et al., “SoccerNet: A scalable dataset for action spotting in soccer videos,” IEEE Conference on Computer Vision.
A. Cioppa et al., “SoccerNet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos,” IEEE CVPR Workshops.
S. Sudhakaran and O. Lanz, “Learning to detect events in videos with transformer networks,” IEEE Transactions on Multimedia.
Y. Sun et al., “Temporal action localization using deep learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence.
M. Baccouche et al., “Sequential deep learning for human action recognition,” Pattern Recognition Letters.
R. Girdhar et al., “Video action transformer network,” IEEE CVPR.
X. Sun, S. Liu, and P. Niyogi, “Efficient action recognition in sports videos using deep learning,” IEEE International Conference on Multimedia and Expo.

Figure 1. System Architecture.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Real-Time Sports Action Recognition Using a CNN–Transformer Hybrid Deep Learning Framework

Abstract

Keywords:

Subject:

I. Introduction

A. Problem Statement

B. Aim

C. Research Objectives

D. Proposed Solution

E. Contributions of the Work

F. Scope of the Work

II. Literature Survey

B. Limitations

III. Related Research and Technical Background

A. Traditional and Classical Approaches

B. Deep Learning-Based Action Recognition

C. Transformer-Based and Hybrid Models

D. Research Gaps Identified

IV. Methodological Framework

A. Input Video Acquisition

B. Frame Extraction and Sampling

C. Preprocessing and Normalization

D. Spatial Feature Extraction Using CNN

E. Temporal Modeling Using Transformer Encoder

F. Classification and Decision Logic

G. Post-Processing, Event Detection, and Output Generation

H. Dataset Description

V. System Design and Implementation

A. System Architecture Design

B. Implementation Details

C: System Workflow

VI. Experimental Results and Analysis:

A. Experimental Setup

B. Evaluation Metrics

1. Accuracy

2. Precision

3. Recall

4. F1-Score

C. Quantitative Results

D. Real-Time Performance

VII. Discussion

VIII. Conclusions

References

MDPI Initiatives

Important Links

Subscribe