The development and introduction of fully autonomous vehicles is seen as the future of the automotive industry. With the implementation of this new paradigm, a revolution is foreseen in the way that means of transportation are used today. Actions by autonomous vehicle manufacturers and partners suggest that these vehicles will initially be deployed in shared mobility services. For example, BMW, Ford, Volkswagen, and Hyundai have all partnered with various companies to develop autonomous vehicles for ride-sharing and on-demand services, with production planned in 2021 [
8,
9,
10]. Daimler has partnered with Uber to allow the introduction of autonomous vehicles in Uber’s ride-sharing network [6]. Toyota has also partnered with Uber with the same goal [
11]. Waymo has already started commercial autonomous ride-sharing services in Tempe, Mesa, and Chandler [
12]. Regarding action recognition, there is a lot of work developed in this field, in [
13] Carreira et al. introduced a deep learning model and kinetics datasets, which is a large-scale video action recognition dataset. The model is called two-stream inflatable 3D ConvNet (I3D) and is an extension of the popular two-stream architecture that uses spatial and temporal information. Their main contributions are the proposed I3D model, which is a 3D CNN that inflates 2D filters to 3D and uses spatial and temporal information. We can not only reuse the 2D models’ architecture (e.g. ResNet, Inception), but also bootstrap the model weights from 2D pretrained models. Following the same path of spatio-temporal 3D ConvNets, the model R(2+1)D. Tran et al. [
14] present a deeper analysis of the use of spatio-temporal convolutions for action recognition in videos. Spatiotemporal convolutions are a type of convolutional layer that can model spatial and temporal dependencies in input data. They are commonly used in video action recognition models and are effective for this task. Their main contributions are a more detailed analysis of the properties of spatio-temporal convolutions and their impact on the performance of action recognition models. The authors study different variations of spatiotemporal convolutions, such as 2D convolutions, 3D convolutions, and separable spatiotemporal convolutions. They also evaluated the impact of different factors, such as kernel size, dilation rate, and number of layers, on the performance of the models. They trained and tested their algorithm with Sports1M[
15], Kinetics[
16,
17,
18], UCF101[
19,
20], and HMDB51[
21] datasets, reporting an accuracy of 73.3%, 72%, 96.8%, and 74.5%, respectively. SlowFast is another recent implementation of the Resnet3D backbone, and was presented by Feichtenhofer et al. [
22] for video action recognition. SlowFast Networks are a type of two-stream architecture that uses both a slow pathway, which processes the video at a lower frame rate, and a fast pathway, which processes the video at a higher frame rate. The technique is partially inspired by the retinal ganglion in primates, in which 80% of the cells (P-cells) operate at a low temporal frequency and recognize fine details, and 20% of the cells (M-cells) operate at a high temporal frequency and are responsive to swift changes. Similarly, in SlowFast the compute cost of the Slow pathway is 4x larger than that of the Fast pathway. Their main contributions are the proposed SlowFast Networks, which can be trained end-to-end to learn both spatial and temporal representations of the video. The slow pathway captures long-term temporal information, while the fast pathway captures fine-grained temporal information. The authors also propose a new fusion strategy that combines the output of the two pathways to make the final prediction. The proposed method was evaluated on several action recognition datasets, and it outperforms state-of-the-art methods that use only a single pathway or simple fusion methods. In [
23] the authors present a method for action recognition in videos called Temporal Segment Networks (TSN). TSN is a deep learning method that addresses the problem of recognizing actions in videos with variable length and temporal structure. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. Their main contributions are the TSN architecture, which is composed of multiple branches that process different segments of the input video, and a fusion module that combines the features of all branches to make the final prediction. The authors also propose several good practices for training TSNs, such as using multiple segments per video, data augmentation, and a consensus loss function. In [
24] Lin et al. proposed a new method called Temporal Shift Module (TSM) for video action recognition. TSM is a technique that enables the network to efficiently process videos of different lengths by shifting the frames of the video in the temporal dimension. Their main contributions are the proposed TSM module, which can be added to existing CNN architectures and enables the network to efficiently process videos of different lengths by shifting the frames of the video in the temporal dimension. Video understanding faces the challenge of achieving high accuracy at low computational cost. While 2D CNNs are computationally cheap, they fail to capture temporal relationships. Meanwhile, 3D CNNs are computationally intensive but achieve high performance. Temporal Shift Module (TSM) offers a solution that combines high efficiency with performance by enabling the exchange of information among adjacent frames, achieving 3D CNN performance while retaining 2D CNN complexity. TSM can be easily integrated into 2D CNNs without adding any computational or parameter costs. TSM is also adaptable to online settings, enabling real-time and low-latency recognition and detection of video objects. The authors also propose a new architecture that uses the TSM module to process the input video and show that it can improve the performance of the model while reducing computational costs. Object detection is a well-studied area with diverse applications, including mask detection. The R-CNN family of algorithms [
25,
26] identifies regions of interest and uses CNN to detect objects within those regions. More recently, YOLO [
27] developed by Redmon et al. introduce a novel object detection system called YOLO (You Only Look Once), which is based on a neural network that takes an entire image as input and outputs a set of bounding boxes and class probabilities for all objects in the image. This is in contrast to traditional object detection systems that use multiple stages to identify objects, such as region proposal generation, feature extraction, and object classification. The YOLO object detection family presented as YOLOv2 [
28], YOLOv3 [
29], YOLOv4 [
30], and YOLOv5 [
7], provides a more accurate and faster method compared to the R-CNN family. In addition to the development of a method for action recognition in a car environment, it is necessary to consider its incorporation into it. To do this, it is necessary to select and develop an embedded system that allows the implementation of the algorithm and is suitable for the automotive context. Embedded computing systems that are selected to operate in a vehicle should have certain factors as a selection focus, these being the cost and the ratio between operations per second and energy consumed (FLOPS/Watt). Even considering the selection based on the mentioned factors, it is relevant to focus on the development taking into account the future trends of the automotive industry. With the prospects of achieving full autonomous driving levels, car control systems are moving towards a centralized topology, where all the intelligence of the car is composed of a single processing unit. To meet this general requirement of the automotive industry, NVIDIA provides its products for the centralized development of all vehicle intelligence. All the factors mentioned above, such as cost, computing, energy expenditure, ASIL, and SAE, are available for the different needs of autonomous driving. Right now, NVIDIA is a tier 1 company that serves OEM customers who make autonomous cars. Implementations and direct comparisons between standard and embedded CUDA systems show that it is possible to achieve the same performance with 50% reductions in power consumption [
31,
32].