2. Research and Development in AI and Control Strategies in the Field of Self-Driving Cars
Experiments on self-driving cars have been conducted since the 1920s, with notable milestones marking the progress of autonomous vehicle technology. In 1925, a demonstration called "American Wonder" showcased self-driving cars controlled by radio signals on the streets of New York City, establishing radio-controlled cars as a popular trend for many years.
In the 1980s, a significant development took place when Mercedes-Benz introduced a robotic van in Munich, Germany. This van utilized vision-based guidance, marking a shift towards the use of visual perception for autonomous driving. Subsequently, researchers and developers focused on integrating technologies such as LIDAR, GPS, and computer vision into a vision-based guidance system, comparing various methodologies to advance research in autonomous systems.
The concept of using trained neural networks for self-driving cars emerged in the early 1990s, pioneered by Dean Pomerleau, a Carnegie Mellon researcher. This approach involved capturing raw road images for neural network training, enabling the steering wheel to be controlled based on road conditions. The utilization of neural networks proved to be a significant breakthrough in self-driving cars, demonstrating high efficiency compared to other methods. Training methods like convolutional neural networks (R-CNN) became available, with neural networks relying on live video streaming and image data for training. However, it should be noted that neural networks are limited to two-dimensional (2D) images for training, which is a notable drawback.
To address the limitations of neural networks, the development of Light Detection and Ranging (LIDAR) technology has emerged. LIDAR collects real-time data in a 360-degree range around the car, providing a comprehensive view of the surroundings. Combining LIDAR technology with neural networks has been employed to enhance efficiency in self-driving cars. Additionally, GPS systems have been incorporated to improve navigation. As advancements progressed, odometry sensors and ultrasonic sensors were added to cars for motion detection and distance measurement, respectively. These technological advancements contribute to the ongoing development of self-driving cars. A central computer was also embedded in cars, to which all interfaced devices and the car’s controlling systems were connected and controlled by [
27].
Figure 1.
Application of LIDAR in path planning [
28].
Figure 1.
Application of LIDAR in path planning [
28].
The self-driving car industry and autonomous driving (AD) have witnessed significant technological progress in recent years. Researchers and developers are actively engaged in research and development activities aimed at introducing innovative advancements and enhancements to self-driving cars. Cutting-edge concepts of artificial intelligence (AI) and efficient control strategies are being utilized to realize the desired features and incentives in autonomous vehicles.
In a publication by Li et al. (2019) [
29], a new control policy called "Autonomous Driving Via Principled Simulations (ADAPS)" is proposed for self-driving cars. This control policy is claimed to be robust due to its inclusion of various scenarios, including rare ones like accidents, in the training data. ADAPS utilizes two simulation platforms to generate and analyze accidental scenarios. Through this process, a labeled training dataset and a hierarchical control policy are automatically generated. The hierarchical control policy, which incorporates memory, proves to be effective in avoiding collisions, as demonstrated through conducted experiments. Additionally, the online learning mechanism employed by ADAPS is deemed more efficient than other state-of-the-art methods like DAGGER, requiring fewer iterations in the learning process.
In another research work by Tan (2018) [
30], the use of "Reinforcement Learning" and "Image Translation" for self-driving cars is recommended. While reinforcement learning does not necessitate an abundance of labeled data like supervised learning, it can only be trained in a virtual environment. The real challenge with reinforcement learning lies in bridging the gap between real and virtual environments, considering the unpredictable nature of real-world situations, such as accidents. The research presents an innovative platform that employs an "Image Semantic Segmentation Network" to enable adaptability to the real environment. Semantic images, which focus on essential content for training and decision-making while discarding irrelevant information, are used as input to reinforcement learning agents to address the disparity between real and virtual environments. However, the quality of the segmentation technique limits the outcomes of this approach. The model is trained in a virtual environment and then transferred to the real environment, minimizing the risks and significant losses associated with experimental failures.
The emergence of U-Net, a revolutionary semantic segmentation architecture developed by Ronneberger et al. [
31], marked a significant breakthrough. This design combines the principles of a traditional Convolutional Network and is divided into two sections: the contracting path and the expanding path. In the contracting path, each convolutional block, along with activation and max pooling, has a stride of 2, resulting in down-sampling and doubling the number of feature channels. The expanding path then up-samples the convolutional blocks, reducing the feature channels by half. These blocks are merged with their counterparts from the contracting path, followed by 3x3 convolutions and ReLU activation. Finally, a 1x1 convolution is used to convert the 64-component feature into the desired number of classes.
Imitation learning has shown promise for self-driving cars, but it is not straightforward to train a model to achieve arbitrary goals. Goal-directed planning-based algorithms are typically employed to address this challenge, utilizing reward functions and dynamics models to accomplish diverse goals. However, specifying the reward function that elicits the desired behavior is difficult. To address this issue, researchers have proposed "Imitative Models" [
32]. These probabilistic and predictive models can achieve specified goals by planning expert-like trajectories that are explainable. They can learn from expert demonstrations without requiring online data collection. The "Deep Imitative Model" has been evaluated for various goal objectives, such as energy-based, unconstrained, and constrained goal sets, and has demonstrated successful behavior alignment with the specified goal. The researchers also claimed that their model outperforms six state-of-the-art imitation learning approaches and a planning-based approach in a dynamic self-driving simulation task. Additionally, the model exhibits reasonable robustness when dealing with loosely specified objectives.
Imitation learning trains deep networks based on driving demonstrations by human experts. These models can emulate expert behavior, such as avoiding obstacles and staying on straight roads. However, during testing, an end-to-end driving policy trained with imitation learning cannot be controlled or directed to make specific turns at intersections. To address this limitation, a technique called "Conditional Imitation Learning" [
34] has been proposed. This model takes high-level commands as input during training, serving two important purposes. Firstly, these commands help resolve uncertainties in the mapping between perception and action, leading to better learning. Secondly, during testing, the controller is directed using the communication channel provided by these commands. The proposed approach has been evaluated in dynamic urban simulated environments as well as on a real robotic vehicle, showing improved performance compared to simple imitation learning in both scenarios.
Figure 3.
Semantic Segmentation [
33].
Figure 3.
Semantic Segmentation [
33].
Researchers have explored ways to further enhance the performance of imitation learning for robust driving in real-world conditions [
35]. They found that cloning standard behaviors alone is insufficient to handle complex scenarios, even with a large number of examples. To address this challenge, they proposed synthesizing perturbations or interesting situations into the expert’s driving behavior, such as going off the road or encountering collisions. Additionally, they introduced additional losses to exaggerate the imitation loss, discouraging failures and undesired events. These recommendations improved the robustness of the imitation learning model, referred to as the "ChauffeurNet model." The model demonstrated efficient handling of complex situations in simulations and ablation experiments.
Attention mechanisms, such as channel, spatial, and temporal attention, have gained popularity due to their resemblance to human visual systems. These mechanisms enable dynamic selection by adaptively weighting features based on their relevance. Wang et al. [
36] introduced self-attention in computer vision, leading to significant advancements in video understanding and object detection. This breakthrough laid the groundwork for vision-transformers, and subsequent works [
37,
38,
39,
40,
41] have demonstrated the remarkable potential of these attention mechanisms.
Direct perception is another innovative approach that combines the strengths of modular pipelines and imitation learning for autonomous driving. In modular pipelines, an elaborate model of the environment is created, while imitation learning directly maps images to control outputs. Direct perception techniques employ neural networks that learn from intermediate and low-dimensional representations. While direct perception approaches have been developed for highway scenarios, they have lacked features such as following traffic lights, obeying speed limitations, and navigating intersections. A recent research work introduced "Conditional Affordance Learning" [
42], which applies the direct perception approach to complex urban environments. This method takes video input and maps it to intermediate representations, utilizing high-level directional commands for autonomous navigation. Researchers claimed a 68% improvement in goal-directed navigation compared to other reinforcement learning and conditional imitation learning approaches. Additionally, this approach successfully incorporates desirable features like smooth car-following, adherence to traffic lights, and compliance with speed limitations.
A comprehensive understanding of real traffic is crucial for self-driving cars. In the near future, combining various devices such as video recorders, cameras, and laser scanners will enable semantic comprehension of traffic. Currently, most approaches rely on large-scale videos for learning, as there is a lack of proper benchmarks for accurate laser scanner data. Researchers have introduced an innovative benchmark called "Driving Behaviour Net (DBNet)" [
43] to address this gap. DBNet provides high-quality, large-scale point clouds obtained from Velodyne laser scanning and video recording with a dashboard camera. It also includes driving behavior information from a standard driver. The additional depth of information provided by the DBNet dataset proves beneficial for learning driving policies and improving prediction performance. Similarly, the "LIDAR-Video dataset" presented in the same work facilitates detailed understanding of real traffic behavior.
Another group of researchers has emphasized the importance of "Deep Object-Centric" models for self-driving cars [
44]. Based on their findings in robotics tasks, representing objects explicitly in models leads to higher robustness in new scenarios and more intuitive visualizations. The researchers presented a classification of Object-Centric models that are advantageous for end-to-end learning and object instances. These proposed models were evaluated in a "Grand Theft Auto V simulator" for scenarios involving pedestrians, vehicles, and other objects. Performance metrics such as collision frequency, interventions, and distance driven were used to assess the models. The researchers claimed that their models outperformed object-agnostic methods, even when using a defective detector. Furthermore, evaluations in real environments with the "Berkeley DeepDrive Video dataset" demonstrated the models’ effectiveness in low-data regimes.
End-to-end learning, a promising approach for self-driving cars, is being explored by a doctoral student [
43]. It has the potential to achieve effective self-driving results in complex urban environments. Representation learning plays a crucial role in end-to-end learning, and auxiliary vision techniques, such as semantic segmentation, can facilitate it. While simple convolutional neural networks (CNNs) can handle reactive control tasks, they are not capable of complex reasoning. This doctoral thesis focuses on addressing scene-conditioned driving, which requires more than just reactive control. To tackle this, an algorithmic unified method is proposed, combining the advantages of imitation learning and reinforcement learning. This algorithm can learn from both the environment and demonstration data.
Transferring and scaling end-to-end driving policies from simulations to real-world environments is challenging. Real-world environments are more complex compared to the controlled and diverse environments provided by simulations for model training. Researchers have discovered a novel technique to scale simulated driving policies for real-world environments by using abstraction and modularity. This approach combines the benefits of end-to-end deep learning techniques with a modular architecture. The driving policy remains hidden from low-level vehicle dynamics and unprocessed perceptual input. The proposed methodology successfully transfers a driving policy trained on simulations to a 1/5-scale robotic truck without the need for fine-tuning. The robotic truck is evaluated in different circumstances on two continents, demonstrating the effectiveness of the approach.
The importance of surround-view cameras and route planners for end-to-end driving policy learning is emphasized in a study by [
45]. Humans rely on side-view and rear-view mirrors, as well as their cognitive map, to understand their surroundings while driving. However, most self-driving research and development only utilize a front-facing camera without incorporating a route planner for driving policy learning. The researchers propose a more comprehensive setup that includes a route planner and a surround-view camera system with eight cameras, along with a CAN bus reader. This realistic sensor setup provides a 360° view of the vehicle’s surroundings and a diverse driving dataset with varying weather conditions, illumination, and driving scenarios. The sensor setup also provides a proposed route to the destination and low-level driving activities, such as predicting speed and steering angle. The researchers empirically demonstrate that the setup with surround-view cameras and route planners improves the efficiency of autonomous driving, especially in urban driving environments with intersections.
Convolutional Neural Networks (CNNs) have been widely used in the field of self-driving cars. Previous end-to-end steering control methods often utilize CNNs for predicting steering angles based on image sequences. However, controlling the entire vehicle’s operation solely through the prediction of steering angles is insufficient. To address this limitation, Yang et al. [
46] propose a framework for multi-task learning that simultaneously predicts speed and steering angle for end-to-end learning. Accurately predicting speeds using visual input alone is not common practice. Their research introduces a multi-task and multi-modal network that incorporates sequences of images, recorded visualizations, and feedback speeds from previous instances as input. The proposed framework is evaluated on the "SAIC dataset" and the public Udacity dataset, demonstrating accurate predictions of speed and steering angle values.
While basic end-to-end visuomotor policies can be learned through behavior cloning, it is impossible to cover the entire range of driving behaviors. Researchers have conducted important exploratory work [
34] and developed a benchmark to investigate the scalability and limitations of behavior cloning techniques. Their experiments show that behavior cloning techniques can perform well in arbitrary and unseen situations, including complicated longitudinal and lateral operations without explicit pre-programming. However, behavior cloning protocols have limitations such as dataset biasing and overfitting, model training instability, the presence of dynamic objects, and the lack of causality. Addressing these limitations requires further research and technical advancements to ensure practical driving in real-world environments.
It is ideal to evaluate self-driving policies and models in real-world environments with multiple vehicles. However, this approach is risky and impractical for the extensive amount of research being conducted in this field. Recognizing the need for a fair and appropriate evaluation method specifically designed for autonomous driving tasks, developers have realized the importance of "Vision-based Autonomous Steering Control (V-ASC) Models" [
47]. They propose a suitable dataset for training models and develop a benchmark environment for evaluating different models, generating authentic quantitative and qualitative results. This platform allows the examination of the accuracy of steering value prediction for each frame and the evaluation of AD performance during continuous frame changes. The developers introduce a Software-In-the-Loop Simulator (S-ILS) that generates view-transformed image frames based on steering value changes, taking into account the vehicle and camera sensor models. They also present a baseline V-ASC model using custom-built features. A comparison between two state-of-the-art methods indicates that the end-to-end CNN technique is more efficient in achieving ground-truth (GT) tracking results, which strongly depend on human driving outcomes.
Figure 4.
Sensing modalities by CARLA. From left to right: normal vision camera, ground-truth depth, ground-truth semantic segmentation [
48]
Figure 4.
Sensing modalities by CARLA. From left to right: normal vision camera, ground-truth depth, ground-truth semantic segmentation [
48]
The need for an appropriate evaluation method is indispensable. Another group of researchers has investigated offline evaluation methods for AD models [
34]. These methods involve collecting datasets with "Ground Truth Annotation" in advance for validation purposes. The research focuses on relating different offline and online evaluation metrics specific to AD models. The findings suggest that offline prediction error alone does not accurately indicate the driving quality of a model. Two models with the same prediction error can demonstrate significantly different driving performances. However, by carefully selecting validation datasets and appropriate metrics, the efficiency of offline evaluation methods can be greatly enhanced.
Another novel approach, called "Controllable Imitative Reinforcement Learning (CIRL)," has been proposed for vision-based autonomous vehicles [
49]. The researchers claimed that their driving agent, which relies solely on visual inputs, achieved a higher success rate in realistic simulations. Traditional reinforcement learning methods often struggle with the exploration of large and continuous action spaces in self-driving tasks, leading to inefficiencies. To address this issue, the CIRL technique restricts the action space by leveraging encoded experiences from the Deep Deterministic Policy Gradient (DDPG). The researchers also introduced adaptive strategies and steering-angle reward designs for various control signals such as "Turn Right," "Turn Left," "Follow," and "Go Straight." These proposals rely on shared representations and enable the model to handle diverse scenarios effectively. The researchers evaluated their methodology using the "CARLA Driving Benchmark" and demonstrated its superior performance compared to previously developed approaches in both goal-directed tasks and handling unseen scenarios.
Environments with a high presence of pedestrians, such as campus surroundings, pose challenges for self-driving policies. The development of robot or self-driving policies for such environments requires frequent intervention. Early detection and resolution of robotic decisions that could lead to failures are crucial. Recently, researchers proposed the "Learning from Intervention Dataset Aggregation (DAgger) algorithm" specifically designed for pedestrian-rich surroundings [
50]. This algorithm incorporates an error backtrack function that can include expert interventions. It works in conjunction with a CNN and a hierarchically nested procedure for policy selection. The researchers claim that their algorithm outperforms standard imitation learning policies in pedestrian-rich environments. Furthermore, there is no need to explicitly model pedestrian behavior in the real world, as the control commands can be directly mapped to pixels.
"Learning by Cheating" is another promising approach for vision-based autonomous vehicles in urban areas [
51]. The developers of this technique divided the learning process into two phases. In the first phase, a privileged agent is provided with additional information, including the locations of all traffic objects and the "Ground Truth Layout" of the surroundings. This privileged agent is allowed to exploit this privileged information and its own observations. In the second phase, a "Vision-based Sensorimotor" agent learns from the privileged agent without access to any privileged information or the ability to cheat. Although this two-step training algorithm may seem counterintuitive, the developers have experimentally demonstrated its benefits. They claim that their strategy achieves significantly better performance compared to other state-of-the-art methodologies on the CARLA benchmark and the NoCrash benchmark.
Computer Vision, particularly in the context of autonomous driving, has been a widely explored approach, thanks to the extensive research conducted in this field. Numerous object detection models have been proposed, often employing a combination of a backbone (encoder-decoder and transformer) such as VGG16, ResNet50, MobileNet [
52,
53], CSPDarkNet53, and a head for making predictions and bounding box estimations. The head architecture varies between one-stage detectors, such as YOLO, SSD [
54], RetinaNet [
55], CornerNet [
56], and two-stage detectors, including different variants of the R-CNN family (R-CNN, R-FCN) [
57]. In addition, some research like Feature Pyramid Network (FPN) [
58], BiFPN [
59], and NAS-FPN [
60] have also included an additional
Neck which basically collects information regarding the feature maps and performs regularization.
Data augmentation techniques, including photometric and geometric distortions, as well as object occlusion methods like CutOut [
61], hide-and-seek [
62], and grid mask [
63], have shown significant improvements in image classification and object detection. Techniques such as DropOut [
64], DropConnect [
65] , and DropBlock [
66], which replace randomly selected regions with zeros in feature maps, have been successful in dealing with the semantic bias and imbalanced datasets. Example mining techniques have been applied to two-stage detection models but not to the dense prediction layer in one-stage detection systems. To address this, Lin [
67] introduced focal loss, which effectively handles the bias in the dense prediction layer.
Bounding box (BBox) estimation is crucial for image classification and object detection tasks. Traditional methods used various representations such as central coordinates, height and width, top-left and bottom-right corners, or anchor-based offset pairs to determine the size and position of the BBox. However, treating the object as an independent task sometimes compromised its integrity. To overcome this limitation, researchers introduced IoU loss [
68], a scale-independent method that considers the ground truth when calculating the BBox area. Subsequently, GIoU [
69] incorporated the object’s form and orientation, DIoU [
70] considered the central distance, and CIoU [
70] took into account the object’s overlap, center point distance, and aspect ratio. These advancements significantly improved the accuracy of BBox estimation.
Attention mechanisms, including Squeeze-and-Excitation (SE) [
71] and Spatial Attention Module (SAM) [
72], have been used for feature engineering. Spatial mechanisms like Spatial Pyramid Pooling (SPP) [
73] have aggregated information from feature maps into a single-dimensional feature vector. SSPP was combined with max-pooling output of kernel size
and improved YOLOv3-608
by 2.7% on the MS COCO dataset. Further, using RFB [
74], it was improved to 5.7%. Other feature integration techniques, such as SFAM [
75], ASFF [
76], and BiFPN [
77], have successfully integrated low-level physical features with high-level semantic features on a feature pyramid. PANet [
78], proposed by Wang, performs per-pixel segmentation by leveraging few-shot non-parametric metric learning. YOLOv4, proposed by Bochkovskiy et al. [
79], enhanced the YOLOv3 model by adding an SPP block, PANet as the path-aggregation neck, Mish-activation [
80], DropBlock regularization, and employed Mosaic and Self-Adversarial Training (SAT) data aggregation. It also utilized a modified version of SAM, PAN, and Cross mini-Batch Normalization (CmBN), which selectively samples data within a batch, and employed CIoU loss for BBox estimation.
In summary, computer vision techniques, such as object detection architectures, data augmentation methods, attention mechanisms, and feature integration strategies, have greatly contributed to the advancement of autonomous driving. These techniques have addressed challenges related to exploration of action spaces, pedestrian-rich environments, and accurate object detection, enabling vision-based self-driving cars to achieve better performance and handle diverse scenarios effectively.
3. Multi-Task Learning and Meta Learning
AutoML has emerged as a popular field aimed at simplifying the use of machine learning techniques and reducing the reliance on experienced human experts. By automating time-consuming tasks such as
hyperparameter optimization (HPO), using methods like
Bayesian optimization, AutoML can enhance the efficiency of machine learning specialists. Google’s
Neural Architecture Search (NAS) [
81] is a well-known AutoML deep learning approach. Another significant AutoML technology is Meta-learning [
82], which involves teaching machine learning algorithms to learn from previous models through methods like
transfer learning [
30],
few-shot learning [
83], and even
zero-shot learning [
84].
In simple terms, a task
in the context of intelligent systems, such as robots, refers to accomplishing a specific objective by learning from repetitive actions. While single-task learning can be effective in environments where the learning objective remains consistent, it may not be applicable when the environment changes. For instance, in supervised learning, like classifying cat images, using a dataset of animal images
, the task can be defined as follows:
where, there are a lot of input-output
pairs. The objective is to minimize any some function function
L which is a function of the dataset
and the parameters of the model
. The loss function can vary a lot and is often subjected to the learning task. One common choice of such loss function can be a negative log likelihood loss in Equation (
1).
which is the expectation of the datapoints in the dataset of the log probabilities of the target labels of the model. Thus, a task can be defined as a distribution of the input over the labels and the loss function as in Equation (
2).
In the context of multi-task learning, there are scenarios where certain parameters remain consistent across all tasks. For instance, in multi-task classification, the loss function
is identical for all tasks. Similarly, in multi-label learning, both the loss function and the input distribution (
) are the same across all tasks. An example of this is scene understanding, where the loss function can be approximated as shown in Equation (
3). However, it’s important to note that the loss functions may differ when the distribution changes (e.g., mixed discrete versus continuous labels) or when there is a preference for one task over another.
Here, it also uses a task descriptor
which is an index of the task or any meta-data related to the task. Thus, the objective changes to a summation over all the tasks as:
also, the some parameters (
) can be shared and optimized across all the tasks (
) and some parameters can be task-specific parameters (
), which can in turn be shared and optimized among similar sub-tasks. Thus the objective function generalises to Equation (
5). Choosing various task descriptor conditioning determines how these parameters are shared.
3.1. Conditioning Task Descriptor
Extensive research has been conducted on how to condition the task description with various techniques like:
3.1.1. Concatenation
In this method, the task descriptor is added to the neural network after passing it through a linear layer. This step is taken to adjust the dimensionality of the task vector to match the hidden layers.
3.1.2. Additive Conditioning
In this strategy, the network is divided into multiple heads, each consisting of task-specific layers. This allows the selection of specific segments of the network for different tasks.
3.1.3. Multi-head Architecture
In this strategy, the network is divided into multiple heads, each consisting of task-specific layers. This allows the selection of specific segments of the network for different tasks.
3.1.4. Multiplicative Conditioning
In this approach, the output of the task descriptor is projected onto the individual tasks within the input or hidden layers. Each task is trained separately, without sharing parameters across tasks. This provides precise control over features for specific tasks.
While more complex conditioning techniques for task descriptors exist, each of these methods is tailored to a specific optimization problem and requires domain knowledge for effective utilization.
3.2. Objective Optimization
Meta Learning follows standard neural network objective function optimization. A simple objective function minimization can be depicted in Algorithm below:
Sample Mini Batch
Sample Mini Batch Datapoints for each task
Compute Loss on Mini Batch
Compute Gradient via Backpropagation
Optimize using Gradient information
3.3. Action Prediction
Initially, the perception module of the self-driving car detects the presence of obstacles in its surroundings. It utilizes obstacle data and basic information such as accelerations, headings, velocities, and positions as input. Upon processing this input, the perception module predicts the trajectories of all obstacles, along with their associated probabilities. Subsequently, the prediction module analyzes and forecasts the expected behavior of these obstacles. It is crucial for the prediction module to operate in real-time, with high precision and minimal latency. Additionally, the prediction module should possess the ability to learn and adapt to the varying behaviors exhibited by nearby obstacles. This section encompasses some cutting-edge approaches employed for action prediction in self-driving cars.
Model-based Prediction: It comprises two models, one for describing the vehicle’s movement (straight or curved) and another for determining turning positions.
Driven-based Prediction: It relies on machine learning and involves training a model on observations to generate real-time predictions.
Lane sequence-based predictions: This approach divides the original path into segments and uses sensors in self-driving cars to detect obstacles. Classical probabilistic graphical modeling methods, such as spatio-temporal graphs, factors graph, and dynamic Bayesian networks, are used to predict the state of detected obstacles and model temporal sequences.
Recurrent neural networks: This approach utilizes recurrent neural networks for prediction and takes advantage of time-series data. Object detection using a "Single Shot Detector (SSD)" module is performed for handling larger inputs. Recurrent neural networks can also detect videos. Trajectory planning and generation are carried out as a concluding step, where the action prediction module generates constraints to choose the most suitable trajectory for the self-driving car.
To achieve optimal path planning, the physical environment is digitally represented, enabling the search for the best and drivable lane and passage for the self-driving car. [
85].
3.4. Depth and Flow Estimation
Precisely estimating optical flow, scene depth, and ego-motion poses significant challenges in computer vision, robotics, video analysis, simultaneous localization and mapping (SLAM), and self-driving cars. While humans can easily and quickly perceive ego-motion and scene direction, developing reconstruction models for real-world scenes is highly challenging. Obstacles such as light reflection, occlusion, and non-rigidity complicate the process [
86].
Accurate depth estimation plays a critical role in various computer vision applications, including 3D reconstruction, object tracking, and self-driving assistance. Photometric-based methods for depth estimation can be categorized into two main approaches:
Previous research in the field of depth estimation has attempted to replicate the binocular human vision system using Graphics Processing Units (GPUs) [
87] for real-time processing of source scenes. These studies tried to achieve accurate and efficient depth estimation [
88]; however, certain challenges remained, such as inadequate depth quality in occluded areas, differences in image scale, and variations in lighting conditions. To address these issues, a more effective approach involves analyzing sequential or unordered scenes and predicting motion or structure. One commonly used method for quickly generating scene structure is "Structure from Motion" (SfM), which leverages ego-motion. This approach is sensitive and capable of handling homogeneous regions [
89]. Similarly, "Visual Odometry" techniques are widely adopted as learning methodologies in this context.
These methods involve training network structures using a portion of the ground truth dataset. Simultaneously, the same portion of the dataset is utilized for achieving odometry [
90]. Typically, visual reconstruction encompasses tasks such as extracting feature points, matching scenes, and verifying geometry. Subsequently, visual reconstruction requires optimization through specialized procedures such as bundle adjustments and determining camera poses based on matching points. The main drawback of visual odometry techniques lies in their susceptibility to geometric constraints, including homogeneous regions, occlusions, and complex textures during feature point extraction [
86].
Furthermore, many researchers are exploring the use of supervised learning techniques for scene reconstruction tasks, including camera pose estimation, feature matching, and camera ego-motion prediction. Convolutional neural networks (CNNs) are highly recommended for accurate extraction of feature points from input scenes [
90]. Researchers have also been experimenting with recursive CNN models that make use of sequential scenes, claiming improved efficiency in feature extraction from individual scenes [
90]. Additionally, incorporating probabilistic approaches like conditional random field (CRF) can further enhance the efficiency of motion estimation [
91]. CNN architectures have been proven effective for both calibrated stereo scenes and monocular scenes in estimating depth.
Moreover, depth and optical flow estimation problems have also been explored using unsupervised learning techniques. These approaches offer a solution to the geometric challenges posed by the camera, which were previously considered difficult to address through learning techniques. Since sequential scenes exhibit spatial smoothness, the loss function can be modeled by extending the concept of scene reconstruction (Reference 53). Similarly, the photometric discrepancy technique, along with other depth estimation approaches, can be employed to predict ego-motion from a monocular scene [
92]. A Kalman filter is recommended for estimating depth and camera pose from monocular video sequences, as it enhances the smoothness of the camera’s ego motion within the learning framework [
93]. Additionally, the deployment of a multi-stream architecture of CNNs can help avoid aliasing problems [
86].
3.5. Behavior Prediction of Nearby Objects
Having the ability to accurately predict the behaviors of nearby vehicles, cyclists, and pedestrians is extremely valuable for self-driving cars to navigate safely. In terms of safety, it is important for self-driving cars to maintain a safe distance from all other objects in their surroundings. At the same time, they must prioritize minimal fuel consumption, time efficiency, and adherence to traffic laws. However, this task of ensuring safe and efficient navigation becomes even more difficult in densely populated urban areas where there are more vehicles driven by humans, as well as pedestrians and cyclists. In order to drive cautiously and avoid any collisions, self-driving cars need to calculate the expected paths of all surrounding objects [
94].
Due to the wide range of driving behaviors exhibited by human drivers, including aggressiveness and unpredictability, extensive research is being conducted to assess this significant factor. Self-driving cars are programmed to prioritize hyper-cautious driving in all situations, which can potentially frustrate human drivers nearby and result in minor accidents [
95]. Conversely, human drivers are known to act recklessly in certain scenarios, such as sudden lane changes, aggressive maneuvers, and imprudent overtaking [
94]. Mavrogiannis and colleagues have developed an innovative algorithm that utilizes learning techniques to take into account the behaviors of human drivers when predicting actions and navigating autonomous vehicles (AVs). They have categorized traffic participants into conservative individuals and those with varying degrees of aggressiveness.
To achieve this goal, a versatile simulator with rich behavior capabilities has been implemented. This simulator can generate various longitudinal and lateral behaviors and offers a range of driving styles and traffic densities for different scenarios. The prediction of high-level actions for a self-driving car when faced with aggressive drivers nearby is performed using a ’Markov Decision’ process. Additionally, the interaction between different traffic participants is modeled using a ’Graph Convolutional Network (GCN)’, which trains a behavioral policy [
94]. Another group of researchers has recognized the significance of considering uncertainties associated with behavioral predictions. They have highlighted the need for self-driving cars to take into account not only the behavioral predictions of nearby traffic participants but also the corresponding uncertainties. To address this, they have proposed a new framework called ’uncertain-aware integrated prediction and planning (UAPP)’. This framework gathers real-time information about other traffic participants and generates behavioral predictions along with their associated uncertainties [
96].
The presence of many pedestrians in an area poses challenges for self-driving cars in predicting their behavior and ensuring safe navigation. Accurately predicting the future paths of nearby pedestrians over extended time periods becomes crucial. In order to address this issue, Jayaraman et al. introduced a hybrid model that estimates long-term trajectories (over 5 seconds) of pedestrians, enabling smoother interaction with self-driving cars. This model combines the dynamics of constant velocity with the pedestrians’ tendency to accept gaps in traffic, as described in their analysis [
97].
A research study has examined the behavior of cyclists, who are important road users in urban areas, and proposed an innovative framework for predicting their intentions. The researchers focused on four specific hand signals commonly used by cyclists and developed a data generation pipeline to overcome the challenge of insufficient data in this area. The pipeline is used to generate synthetic point cloud scans representing various cyclists in urban environments. Ultimately, the framework takes a set of these point cloud scans as input and produces predictions for the cyclists’ intentions, assigning the highest probability [
98].
3.6. One Shot Learning
One-shot learning is a key method in computer vision that enables the categorization of objects depicted in an image. Unlike other machine learning and deep learning techniques, which rely on large amounts of labeled data or numerous samples for effective training, one-shot learning aims to learn from just a single sample or image during the training phase. It then applies this learned knowledge to categorize objects during the testing phase. This approach takes inspiration from the human brain’s ability to quickly learn thousands of object categories by observing only a few samples or examples from each category.
In the context of one-shot learning, when a new object category is presented to the algorithm with just a single example (without being previously included in the training dataset), the algorithm recognizes it as a new category. It then utilizes this single sample to learn and subsequently becomes capable of re-identifying instances of this object class in the future. One notable application of one-shot learning is face recognition, which is widely used in smart devices and security systems for re-identification purposes. For example, the face recognition system employed at passport security checks takes the passport photo as a single training sample and then determines whether the person holding the passport matches the image in the passport photo [
99].
Researchers have developed many algorithms capable of performing one-shot learning. Some of them are:
‘Probabilistic models’ using ‘Bayesian learning’
‘Generative models’ deploying ‘probability density functions’
Images transformation
Memory augmented neural networks
Meta learning
Metric learning exploiting ‘convolutional neural networks (CNN)’ [
99]
The realm of self-driving cars and robotics is making use of one-shot learning algorithms in various ways. Typically, deep neural networks are employed to visually understand driving scenes in fully autonomous driving systems. However, these networks necessitate large annotated databases for training. Fortunately, a recent research study has introduced a method called one-shot learning, which eliminates the need for manual annotation during the training of perception modules. This approach involves the development of a generative framework known as "Generative One-Shot Learning (GOL)." GOL takes in one-shot samples or generic patterns, along with some regularization samples, to guide the generative process [
100].
In a similar vein, Hadad et al. have proposed the use of one-shot learning techniques for driver identification in shared mobility services. They have implemented a fully functional system using a camera module, Python language, and a Raspberry Pi. The application of one-shot learning for face recognition has yielded satisfactory results [
101]. Additionally, one-shot learning can also be leveraged for surveillance anomaly detection without requiring precise temporal annotations. Researchers have developed an innovative framework that utilizes a three-dimensional CNN Siamese network to identify surveillance abnormalities. The discriminative properties of the 3D CNN enable the detection of similarities between two sequences of surveillance anomalies [
102].
3.7. Few Shot Learning
Few-shot learning is a subset of machine learning that focuses on learning from a small number of labeled examples for each classification present in a database. It falls within the realm of supervised learning techniques. Few-shot learning finds its main applications in object detection and recognition, image classification, as well as sentiment classification. One of the key advantages of few-shot learning techniques is their ability to operate without the need for large databases, which is typically required by most machine learning approaches. Leveraging prior knowledge, few-shot learning can generalize to new tasks with limited examples and supervised information. However, a significant challenge in few-shot learning lies in the empirical unreliability of risk minimizers. Different methods in few-shot learning address this issue in various ways, thereby aiding in the categorization of these methods.
Some methods use data to enhance supervised experience by using previously acquired knowledge
Some methods use model to minimize the dimensions of hypothesis space by deploying previously acquired knowledge
Others use previous knowledge to facilitate algorithm which helps in searching of optimal hypothesis present in given space [
103]
Many researchers in the field of autonomous driving and robotics have embraced few-shot learning techniques to address specific challenges. With the escalating road traffic and diverse traffic scenarios, ensuring safety for self-driving cars has become increasingly crucial. To tackle this, a group of researchers has proposed equipping self-driving cars with multiple low-light cameras equipped with fish-eye lenses to enhance the assessment of nearby traffic conditions. Furthermore, a combination of LIDAR and infrared cameras captures the front view. The few-shot learning algorithm is utilized to identify nearby vehicles and pedestrians in the front view. Additionally, the researchers have developed a comprehensive embedded visual assistance system that integrates all these devices into a miniature setup [
104].
Similarly, Majee et al. have encountered the "few-shot object detection (FSOD)" problem in real-world, class-imbalanced scenarios. They specifically selected the "India Driving Dataset (IDD)" due to its inclusion of less frequently occurring road agents. The researchers conducted experiments with both metric-learning and meta-learning approaches for FSOD and found that metric-learning yielded superior outcomes compared to the meta-learning method [
105].
Furthermore, researchers have conducted experiments using a new approach called "Surrogate-gradient Online Error-triggered Learning (SOEL)" to achieve few-shot learning. This method utilizes neuromorphic processors and combines advanced techniques such as deep learning, transfer learning, and computational neuroscience. During the research, Spiking Neural Networks (SNNs) were partially trained and made adaptable online to unfamiliar data classes within a specific domain. The SOEL system automatically updates itself after encountering an error, enabling fast learning. Gesture recognition tasks have been successfully performed using the SOEL system, demonstrating its ability to rapidly learn new classes of gesture data using a few-shot learning paradigm. The developed system also has potential applications in predicting the behavior of pedestrians and cyclists in self-driving cars [
106].
In the field of autonomous driving, point clouds play a crucial role, and their applicability continues to expand. While deep neural networks with labeled point clouds have been used for various supervised learning tasks, the annotation of point clouds remains a burdensome process that should be minimized. To address this issue, some researchers proposed an innovative solution involving the use of a cover-tree to hierarchically partition the point cloud, which is then encoded through two self-supervised learning pre-training tasks. Few-shot learning has been employed for the pre-training of downstream self-supervised learning networks, reducing the reliance on labeled data during the annotation process [
107].
5. Case Study: Waymo vs Tesla
Extensive research has enabled the development of actual self-driving vehicles in the real world. These vehicles have undergone extensive training in computer simulations and real-world conditions, often with passengers on board. Companies such as Waymo (formerly Google Self-Driving Car Project) and Tesla Motors have achieved remarkable progress by rigorously testing their vehicles and deploying a wide range of models that have traveled millions of miles without major accidents.
Pioneers in the field have explored diverse architectures and deep learning algorithms due to the vast combination of underlying framework parameters. Both approaches involve gathering substantial amounts of metadata about the vehicle’s environment and surroundings, although the methods of processing and data collection may vary. Waymo, for example, utilizes a modular pipeline approach for autonomous vehicles, incorporating five onboard cameras, LiDAR, and HD maps. In contrast, Tesla Motors focuses exclusively on computer vision, relying on eight cameras. This approach increases the computational demands on the vision aspect but simplifies the interaction between different pipeline systems. The collected metadata, often referred to as "trigger-points," serves as input for solving meta-learning tasks. These triggers include moving objects, stationary objects, line markings, road markings, crosswalks, road signs, traffic signs, and various environmental factors, as shown in
Figure 6. It is crucial to note that achieving high accuracy (
) in solving these tasks is essential for deploying such vehicles in real-world scenarios. Reliable vision systems require highly accurate models for operation. This is exemplified in
Figure 7, where a Tesla car accurately recognizes environmental conditions, such as wet roads, in the real world and updates its neural network accordingly.
Tesla recently announced their plans to replace their previous technology stack, which included Radar, Vision, and Sensor Fusion, with solely Computer Vision. This approach is driven by the efficiency of Vision compared to their previous Sensor Fusion techniques. They argue that radar has evident disadvantages in various situations, particularly during emergency braking, where the radar signals exhibit a jagged response to sudden brakes. Additionally, when the vehicle passed under a bridge, the radar system struggled to determine if it was a stationary object or a stationary car. In such cases, Vision was relied upon for confirmation and successfully differentiated the object by reporting "negative velocity" for a few frames. Vision also offers vertical resolution, enabling the classification of stationary bridges or cars.
Another significant drawback of the Radar stack is its susceptibility to environmental triggers. As mentioned earlier, radar heavily relies on vision for classification. Therefore, a noisy vision layer can introduce delays in the radar stack’s response, leading to a reduced reaction time to stationary objects. On the other hand, using Vision alone allows for early identification of such objects, resulting in longer and smoother braking responses. The utilization of Vision alone has demonstrated improved precision and recall compared to the previous technology stack.
In a Tesla vehicle, there are eight cameras that capture continuous sequences of images. These images are then processed using an image extractor, such as a standard Resnet model. The next step involves fusing all the images from the eight different vision sources together using a Transformer-like network. This fusion process enables the sharing of information across all the cameras and over time, utilizing Transformers (such as RNN, 3D CNN, or other hybrid structures).
Once the multi-camera fusion is completed, the information is passed through various components, including heads, trunks, and terminals, in a branching-like architecture employed by Tesla, as depicted in
Figure 8. This architecture offers several advantages, particularly in situations where resources are limited, and employing individual neural networks for each independent task is not feasible. Feature sharing becomes crucial in such cases, as the amount of input data sampled from each input source may vary depending on the target feature. Additionally, this architecture allows for the decoupling of signals at the terminal level, independent of operations on other levels.