Fine-grained action recognition in sports video anal- ysis presents a constellation of challenges that distinguish it from conventional human activity recognition tasks. When actions share substantial visual and temporal overlap—as is the case with cricket batting strokes—the discriminative cues necessary for accurate classification reside in subtle kinematic variations that operate at fine spatial and temporal granularities. This paper presents a comprehensive empirical investigation into the efficacy of contemporary temporal deep learning architectures for the classification of fifteen distinct cricket batting strokes using the CricShot10kdataset, a curated collection of approximately 10,000 video clips with balanced class representation. We conduct a rigorously controlled comparative evaluation of four distinct modeling paradigms: a non-temporal convolutional baseline em- ploying frame-averaged ResNet50 features, a unidirectional Long Short-Term Memory (LSTM) network, a bidirectional LSTM (BiLSTM) architecture, and a Transformer encoder with multi- head self-attention. All models are trained and evaluated under identical experimental conditions, including uniform frame sam- pling (32 frames per video), consistent spatial feature extraction (2048-dimensional ResNet50 embeddings), matched data splits, and standardized optimization hyperparameters. This method- ological rigor ensures that observed performance differences can be attributed cleanly to architectural choices in temporal modeling rather than confounding experimental variables. Our empirical results establish a clear performance hierarchy among the evaluated architectures. The BiLSTM model achieves the highest classification accuracy at 44.0 %, followed by the unidirec- tional LSTM at 42.0 %, the non-temporal baseline at 34.0 %, and notably, the Transformer architecture at only 23.0 %—a result that falls substantially below even the frame-averaged baseline. Statistical validation through McNemar’s test confirms that the 2 % improvement from LSTM to BiLSTM is highly significant (p < 0.001), indicating systematic correction of specific misclassi- fication patterns rather than random variation. Detailed per-class analysis and confusion matrix examination reveal pronounced performance heterogeneity across the fifteen stroke categories. High-performing classes such as the pull shot and cut shot exhibit F1-scores exceeding 0.65, attributable to their distinctive lateral motion patterns that create separable trajectories in the spatiotemporal feature space. Conversely, a cluster of front-foot strokes—including the cover drive, defensive shot, down-the- wicket stroke, and lofted offside drive—constitute the primary locus of classification error, with pairwise confusion rates frequently exceeding 40 %. We attribute these persistent confusions to the visual and temporal similarity of these strokes, which share common initial footwork, comparable bat trajectories through the downswing phase, and overlapping follow-through dynamics. The frame-level ResNet50 features, while capturing high-level spatial semantics effectively, lack the temporal granularity and explicit motion encoding necessary to resolve these fine-grained distinctions. Our error analysis further reveals that the model struggles to capture subtle kinematic cues such as bat face angle at impact, degree of wrist rotation, bat elevation trajectory, and the precise timing of weight transfer—features that human experts rely upon for stroke differentiation. The pronounced underperformance of the Transformer architecture provides a critical case study in the data efficiency limitations of attention- based models for fine-grained visual sequence tasks. Despite its theoretical capacity to model long-range dependencies without the sequential bottleneck of recurrence, the Transformer’s lack of built-in inductive biases regarding temporal locality and sequen- tial ordering renders it vulnerable to overfitting in moderate- data regimes. With approximately 400 training examples per class, the CricShot10kdataset provides insufficient statistical sig- nal for the Transformer to learn robust spatiotemporal atten- tion patterns from scratch, resulting in degraded generalization performance. This finding underscores an important principle for practitioners: architectural sophistication does not guarantee empirical superiority; rather, the choice of temporal modeling strategy must be carefully aligned with dataset scale and task characteristics. The contributions of this work are fourfold. First, we establish a rigorous and reproducible benchmark for cricket stroke classification on the CricShot10kdataset, providing a standardized reference point for future research. Second, through systematic architectural comparison, we quantify the marginal contribution of bidirectional temporal context to clas- sification performance and demonstrate its statistical significance. Third, we provide granular error analysis that identifies specific stroke categories and confusion pairs that dominate classification failures, thereby directing future research efforts toward the most impactful areas for improvement. Fourth, we offer a critical assessment of Transformer limitations in fine-grained, moderate-data video classification scenarios, contributing to the broader understanding of attention mechanism applicability. This paper is organized as follows. Section II reviews related work in action recognition, temporal modeling, and sports video analysis, situating our contribution within the broader research landscape. Section III details our methodological framework, including dataset characteristics, preprocessing pipelines, feature extraction protocols, and architectural specifications. Section IV presents comprehensive experimental results encompassing aggregate performance metrics, training dynamics, statistical validation, and per-class analysis. Section V provides extended discussion of key findings, interpreting the relative performance of architectures and analyzing the structure of classification errors. Section VI addresses limitations and outlines promising directions for future investigation. Section VII concludes with a summary of contributions and their implications for the field.