An autonomous system requires several components and its architectural design provides an abstract view of the system operation and organization. In a layered architecture, the components have public and well-defined communication interfaces through which they exchange information with other components. This characteristic enables the definition of a common architecture for all tracks in this challenge, through adjustments of few components for the maintenance of the same communication interface. This strategy reduces the time spent on the development of the agents, and enables the evaluation of the autonomous navigation performance with different sensors and algorithms for a specific task.
3.1. Mapping and Path Planning
A map is an essential component enabling the autonomous vehicle to execute its tasks safely and efficiently, storing diverse information, beneficial for various components of the autonomous system [
32]. This includes the road geometry description for path planning and the topological representation of roads and intersections, commonly referred to as the road network, used for route planning. In addition to static object positions, navigable areas, positions of traffic signs and lights, traffic rules, and semantic information related to the road. In this architecture, we employed the OpenDRIVE [
33,
34] map standard to assist the navigation and perception components.
3.1.1. OpenDRIVE
The OpenDRIVE is an open format to describe road networks, using XML version 1.0, which is able to represent the road geometry as well as the context information of roads that may influence the behavior of vehicles driving in it, such as traffic signs, traffic lights, and the type of roads and lanes (e.g. highway and sidewalk) [
34,
35]. The format description is built on a hierarchical structure with four main elements:
header;
road;
junction; and,
controller.
Figure 3 shows the visualization of the OpenDRIVE map after being parsed by the
map manager in the architecture.
The header, is the first element of the description and holds metadata of the map, such as the name or a geographic reference for transformations between the Cartesian and Geodesic coordinate systems. The road element encompasses geometry description and additional properties (e.g., elevation profile, lanes, and traffic signs). The properties such as traffic signs are placed with respect to the distance to the initial point of the lane, using a local coordinate system. Besides that, roads can connect directly or through intersections using the junction component, preventing ambiguities in connections. Lastly, a controller is employed for signalized junctions or other road elements imposing control on the vehicle.
3.1.2. Path Planning
Path planning is responsible for determining a feasible and optimal path from a starting point to a destination in a given environment. This planned path considers factors such as road geometry, static and dynamic obstacles, vehicle physical constraints, and criteria like time, speed, fuel efficiency, or safety. Various algorithms, including rule-based, gradient-based, graph-based, and optimization methods, can perform these tasks. In addition, recent studies have explored the integration of machine learning into path planning. Moreover, the task can be divided into two steps: global planning and local planning. Global planning estimates a reference path, considering static features (e.g., road geometry and static obstacles) and the intended route. Local planning adjusts this global reference path based on dynamic variables in the traffic scene, such as dynamic obstacles.
In the modular navigation pipeline strategy, global reference planning incorporates a local speed profile based on the dynamic scene, along with local lane-change planning due to traffic events. The global reference planning involves sparse waypoints representing the intended route. Subsequently, lane-level localization determines the corresponding lane and road ID for each waypoint using the OpenDRIVE map manager. This information enables the system to estimate a lane-level route, identifying the roads and lanes the vehicle must traverse to reach its destination. Finally, segments of the reference path, each spanning 50 meters, are published in the ROS ecosystem based on the vehicle’s speed and current position.
3.2. Perception
The autonomous driving systems proposed in this paper relies on two types of sensors for robust environmental perception: a stereo camera and a LiDAR sensor. Our proposed perception system, depicted in
Figure 4, adopts a multi-sensor fusion approach to achieve accurate and robust object detection in 2D/3D images and point clouds. This approach employs distinct detection modules and their fusion, each capitalizing on the strengths of different sensor modalities:
•
Height map: As a basic obstacle detector, we employ a height map generated from the LiDAR point cloud, similar to the one presented in [
36]. This method analyzes height differences in a grid, identifying obstacles exceeding a threshold. Grid cells with no such differences are deemed part of a plane. We use a polar grid map and a 20 cm as threshold. This detection mechanism serves as a backup for emergency situations. The first branch in
Figure 4 depicts a height map-based perception process, where points are assigned colors based on their height to create a visual representation of the surrounding environment’s topography then grid cells within that contain points exceeding the predetermined vertical distance threshold are identified. Finally, points situated within these designated grid cells are marked in red to clearly indicate potential obstacles. Our obstacle detector using height map in a polar grid is open source and available online
5.
•
Instance segmentation: For object detection, we employ Mask R-CNN [
37], a powerful instance segmentation algorithm that extracts coordinates of bounding boxes and masks for each object instance detected in the image. Our system categorizes these objects into eight categories: car (including vans, trucks, and buses), bicycle (including motorcycles), pedestrian, red traffic light, yellow traffic light, green traffic light, stop, and emergency vehicle. Code is open source and available online
6.
•Fusion with stereo camera: Since the image used for instance segmentation is also the left image of the stereo camera, we leverage this for 3D detection and classification. The bounding box coordinates and pixel mask of each object instance detected in 2D are mapped to corresponding 3D points in the organized point cloud. This 3D point cloud inherits the color and the image’s row, column structure but expands it with 3D/depth information. This fusion process creates an RGBD point cloud with color and 3D/depth information for each object instance, enabling accurate classification and positioning.
The third branch,
Figure 4, illustrates an example of a point cloud constructed from the stereo camera, while the fourth branch shows an example of instance segmentation corresponding to the same scenario. Finally, the fusion of these two branches enables the detected objects to be mapped onto the RGBD point cloud, providing a comprehensive visualization of their positions within the 3D environment.
While this method can detect both static and dynamic objects, it is not the primary system detector in our architecture due to its non-real-time operation with a maximum of 5 frames per second. Nevertheless, the detailed information it provides on traffic light states, unavailable from LiDAR, justifies its inclusion despite the latency. Dynamic object detection (cars, bicycles and pedestrians) in this fusion serves as a backup for emergencies, and we currently do not track these objects.
•
3D detection in point clouds (dynamic objects): For 3D detection in LiDAR point clouds, we leverage the PointPillars algorithm [
38]. This algorithm provides regression of 3D bounding boxes of objects and their orientation related to the ego vehicle. PointPillars demonstrates commendable performance in real-world autonomous driving scenarios. Its pillar representation retains valuable spatial information while maintaining computational efficiency, a critical factor for real-time applications. Moreover, it effectively leverages the strengths of LiDAR data, including its ability in handling occlusions and performing reliably in diverse lighting conditions.
•Tracking: The perception stack (in point clouds) detailed in this section, which supplies inputs to the risk assessment module responsible for determining FSM graph states, consists of two integral components: (i) pose estimation (bounding boxes) and (ii) multi-object tracking (MOT) modules. Having a stable and precise state estimation, both for the ego-vehicle and dynamic objects in the surroundings, is important to transition across the states.
Regarding the
multi-object tracking module, we based on the one proposed by [
39]. This approach, known as
Simple Online and Realtime Tracking (SORT), divides the tracking task into three sub-tasks: detection, data association, and state estimation.
As detailed earlier, objects are detected in the LiDAR frame using PointPillars. This detection model differs from the ones adopted originally by the SORT tracker. The detected objects’ bounding boxes are projected into the world frame using the ego-vehicle estimated pose, and their respective poses are matched in the data association step. At this point, we keep the SORT tracker data association in the 2D space, in which we compute the intersection over union (IoU) of the top-down projection of the bounding boxes. Our assumption is that two dynamic objects do not overlap in the X-Y plane.
As for the state estimation, we use a Kalman Filter-based approach, in which our goal is to estimate the 3D position of the bounding box center , 3D dimensions of the bounding box , the yaw angle , and the 3D velocities of the tracked object in the International System of Units and in the global reference frame. This space state differs from the SORT paper state, in which the estimated state is performed in the pixel space. Also, notice that the vehicle pose estimation is important in this step, as we are tracking in the global reference frame.
As we represent this state vector as , we need to define both the state propagation and observation model matrices.
The state propagation we adopted assumes constant linear velocity between detections so that we can propagate the position using the propagation matrix
F from Equation
1.
where
I represents the identity matrix, 0 the zero-filled matrix, and
represents the period between predictions. The subscripts indicated with
represent the matrix number of rows
M and columns
N, respectively.
Regarding the observation model matrix, since we obtain directly the 3D position, dimensions, and yaw angle from our detection module, our observation matrix is defined by .
The third branch in
Figure 4 illustrates detection and tracking in 3D, with LiDAR serving as input to the 3D detector. In our case, the 3D detections of point pillars are fed into the SORT algorithm, which we have modified for tracking. Ultimately, for visualization, each tracked object is assigned a distinct color.
•Prediction: Our system employs a prediction-based approach to ensure safe navigation by anticipating the movements of surrounding objects. This approach utilizes a simple motion model based on the object’s current speed, as estimated by the tracking system and this model assumes constant velocity for each object, providing a first-order approximation of their trajectories.
The prediction formula for a linear motion model is defined by the Equation
2.
Where is the current pose, v the actual velicity of the surrounding object and represents the time interval for prediction (5 seconds). Finally is the predicted pose in the interval .
While more complex prediction models exist, opting for a simple linear model ensures computational efficiency. This linear motion model for future pose prediction demands fewer computational resources, making them suitable for real-time applications in autonomous driving.
3.4. Decision making
The decision-making module utilizes a synchronous Moore Finite State Machine (FSM) to orchestrate actions based on inputs from the Collision Risk Assessment (CRA) module. The FSM employs a binary encoding scheme for inputs as shown in
Table 2
The FSM comprises four key states, each governing specific speed control behaviors:
•Drive State (S1): No obstacles impede the vehicle’s progress. Target speed is set to a maximum of 8.8 m/s (31.68 km/h).
•Follow the Leader State (S2): The CRA reports an obstacle (static or dynamic) ahead of the ego vehicle, triggering dynamic speed adjustments. Speed is adjusted based on distance and time to collision (TTC), calculated using the ego vehicle’s current speed and distance to the obstacle.
•Red Light State (S3) and Stop Sign State (S4): These states mirror the "Follow the Leader" logic, utilizing TTC to achieve controlled stops at designated locations. The vehicle decelerates smoothly, ensuring compliance with traffic rules and safety.
Figure 6 depicts the state transition diagram, visually representing the FSM’s logic. The decision-making module employs a straightforward yet effective FSM structure for robust decision-making. Speed control strategies adapt dynamically to varying conditions, ensuring safe and efficient navigation. The module seamlessly integrates with other components of the autonomous driving system, including perception and control modules. The FSM operates synchronously at 10 Hz, aligning with sensory data capture rates.
The current approach uses the actual velocity of the ego vehicle to calculate TTC, after which velocity adjustments are made to prevent collisions. This method is simple, fast, suitable for quick estimations, easy to implement, and computationally efficient. -However, it assumes constant velocity and ignores potential future trajectory changes. The TTC formula is:
3.6. Localization
In order to perform the ego-vehicle
pose estimation, we fused the relative transforms obtained using an odometry source (in our case, visual-inertial odometry - VIO), the IMU orientation, and the GNSS position using an Extended Kalman Filter (EKF) approach, as in the
Figure 7. The inputs of our stack are the camera image, the IMU orientation, the GNSS coordinates, and the sensor calibration (external reference frames’ relative transformation). The output elements of the pose estimation stack are estimated pose and its uncertainty. The VIO estimation is responsible for estimating
, which represents the pose transformation matrix of the current camera frame w.r.t. the previous, and the estimation uncertainty covariance matrix,
.
While the GNSS is responsible for providing the global geographic coordinates, the IMU provides the linear acceleration, angular velocity, and 3D orientation at a higher frequency. We then synchronize both the 3D orientation and global coordinates in order to provide , which represents the transformation matrix of the IMU frame w.r.t. the world frame, and its uncertainty, . In our case, the IMU and the GNSS are represented by the same reference frame, but we left them illustrated in the diagram for the sake of clarity. Also, the geographic coordinates provided are then converted to a plane projection coordinate system.
The input poses, and , and the sensor calibrated, are then provided to the EKF and then converted to a common reference frame internally. The goal is to estimate the 6DoF pose of the agent frame w.r.t. the world frame, , where: are the global coordinates, easting, northing, and altitude, respectively; and represents the four components of the quaternion that represents our agent’s orientation. For each relative pose received, , the EKF performs a system prediction, which implies accumulating drift until a global pose, , is received and the state update is performed.
We emphasize that this pose estimation module is also modular, so that the back end (in this case, the EKF), the methods used for estimating the relative transforms, and the source of the global pose estimation do not need to be the same as the ones we used in this project.
In practice, for estimating the relative pose transformations using VIO, we used the RTabMap ROS implementation
7. RTabMap is known for its estimation robustness and its full functionalities are widely used in SLAM applications. As for the EKF implementation, we used the GTSAM implementation
8. While GTSAM is known for implementing solutions using factor graphs, it also implements a very convenient interface for representing pose transformations and implements the 3D pose Extended Kalman filter off-the-shelf. Finally, our localization stack is open source and available online
9.