Submitted:
26 October 2023
Posted:
27 October 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Related Work
1.1.1. Data Augmentation
1.1.2. Dynamic Graph CNN
1.1.3. Differences from Pointpillars Model
1.2. Contributions
- We introduce the AGPNet network framework, which offers a novel integration of Data Augmentation and Dynamic Graph CNN modules with PointPillars. This innovative approach not only enhances the diversity of point cloud data but also transforms it into pseudo-image data, resulting in a notable acceleration in the detection process;
- Our proposed model effectively addresses the issue of inadequate detection accuracy encountered when dealing with obstructed or small-sized objects, a common limitation in the current industry-standard PointPillars model. Importantly, our model strikes a balance between detection speed and accuracy;
- Experimental results demonstrate that our method achieves a detection accuracy approximately 6-7% higher than that of the PointPillars model when evaluated on the KITTI dataset. Remarkably, it accomplishes this without sacrificing detection speed, ensuring it meets the stringent requirements for both accuracy and speed in a real-world autonomous driving environment.
2. Proposed 3D Object Detection
- The Data Augmentation module augments point cloud data diversity, facilitating the acquisition of more varied point cloud features;
- The Dynamic Graph CNN module performs dimension expansion on each point, encompassing its three-dimensional coordinates and pertinent geometric relationships with nearby points;
- The Pillar Feature Net module voxels the dimension-expanded point cloud data and conducts related processing to convert it into pseudo-image information representing the point cloud data;
- The Detection Head (SSD) module processes the two-dimensional pseudo-image features and generates three-dimensional detection results.
2.1. Data Augmentation
- Regional Ensemble Point Cloud Pruning: A certain probability is applied to randomly remove point cloud data from one of the six regions, as depicted in the blue section in the figure. This action simulates point cloud occlusion experienced in real-world environments.
- Inter-Region Point Cloud Exchange: With a specified probability, an area is randomly selected from the six regional point cloud sets in the current frame. Corresponding point cloud data is then exchanged with another region on the same side of objects belonging to the same category. This exchange is illustrated by the yellow and green portions in the figure. It serves to enrich the point cloud data's ability to capture the diversity among objects of the same category.
- Subsequent Sparsification: For the point cloud data involved in the second step, a voxel-based method is used to randomly select a point within non-empty voxels. This point serves as a representative feature of the voxel area, and downsampling is applied, as seen in the yellow part of the point cloud in the figure. This operation simulates the variation in point cloud density in real environments.
2.2. Dynamic Graph CNN
2.3. Pillar Feature Net
- Firstly, we discretize the point cloud by uniformly dividing it into a grid() on the x-y plane, generating a collection of columns, denoted as P, which function as voxels with infinite spatial extent along the z-coordinate.
- Secondly, we construct a dense tensor with dimensions by imposing specific constraints on the number of non-empty bins per sample, P, and the number of points per bin, N. In cases where a sample or pillar contains an excessive number of points, we downsample the data by randomly selecting P points or, if needed, apply zero padding to ensure completeness.
- Thirdly, we employ a simplified version of PointNet to produce a tensor with dimensions . Subsequently, a maximum operation is applied to create an output tensor with dimensions .
- Fourthly, we disperse these features back to their original pillar positions, effectively generating pseudo-image data denoted as , with H and W representing the height and width of the canvas.
2.4. Detection Head (SSD)
2.5. Loss Function
2.5.1. Object frame regression loss function
- (1)
- represents the real object frame parameter;
- (2)
- denotes the object frame parameter of the positive sample in the prediction;
- (3)
- stands for the set of positive sample object frames in the prediction, with a total of ;
- (4)
- signifies the set of negative sample object frames in the prediction, with a total of .
2.5.2. Classification loss function
2.5.3. Overall loss function
3. Evaluations
3.1. Implementation Details
3.1.1. Experiment platform
3.1.2. Experimental data

3.1.3. Evaluation index
3.2. Comparison with state-of-the-art methods
3.3. Multi-scene continuous frame object detection experiment
3.4. Ablation experiment
4. Conclusions
References
- Chen X, Ma H, Wan J, et al. Multi-view 3d object detection network for autonomous driving[C]. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. IEEE, 2017: 1907-1915. [CrossRef]
- Ku J, Mozifian M, Lee J, et al. Joint 3d proposal generation and object detection from view aggregation[C]. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018: 1-8. [CrossRef]
- Zhou Y, Sun P, Zhang Y, et al. End-to-end multi-view fusion for 3d object detection in lidar point clouds[C]. Conference on Robot Learning. PMLR, 2020: 923-932. [CrossRef]
- Shi S, Wang X, Li H. PointRCNN: 3d object proposal generation and detection from point cloud[C]. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, 2019: 770-779. [CrossRef]
- Qi C R, Liu W, Wu C, et al. Frustum PointNets for 3d object detection from rgb-d data[C]. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 2018: 918-927. [CrossRef]
- Shi W, Rajkumar R. Point-GNN: Graph neural network for 3d object detection in a point cloud[C]. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, 2020: 1711-1719. [CrossRef]
- Qi C R, Su H, Mo K, et al. PointNet: Deep learning on point sets for 3d classification and segmentation[C]. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 2017: 652-660. [CrossRef]
- Qi C R, Yi L, Su H, et al. PointNet++: Deep hierarchical feature learning on point sets in a metric space[J]. Advances in neural information processing systems, 2017:1-14. [CrossRef]
- Liu W, Anguelov D, Erhan D, et al. SSD: Single shot multibox detector[C]. Computer Vision–ECCV 2016: 14th European Conference, Amsterdam. Springer International Publishing, 2016: 21-37. [CrossRef]
- Zhou Y, Tuzel O. VoxelNet: End-to-end learning for point cloud based 3d object detection[C]. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 2018: 4490-4499. [CrossRef]
- Yan Y, Mao Y, Li B. SECOND: Sparsely embedded convolutional detection[J]. Sensors, 2018, 18(10): 1-17. [CrossRef]
- Lang A H, Vora S, Caesar H, et al. Pointpillars: Fast encoders for object detection from point clouds[C]. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, 2019: 12697-12705. [CrossRef]
- Wang Y, Sun Y, Liu Z, et al. Dynamic graph cnn for learning on point clouds[J]. ACM Transactions on Graphics (tog), 2019, 38(5): 1-12. [CrossRef]
- Zheng W, Tang W, Jiang L, et al. SE-SSD: Self-ensembling single-stage object detector from point cloud[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2021: 14494-14503. [CrossRef]
- Girshick, R. Fast R-CNN[C]. Proceedings of the IEEE international conference on computer vision. IEEE, 2015: 1440-1448. [CrossRef]
- Jang E, Gu S, Poole B. Categorical reparameterization with gumbel-softmax[J]. arXiv preprint arXiv:1611.01144, 2016:1-13. [CrossRef]






| Difficulty | Bounding box Minimum Height | Object occlusion level | The maximum degree of truncation of an object |
|---|---|---|---|
| Simple | 40px | 0, fully visible | 15% |
| Medium | 25px | 1, partial occlusion | 30% |
| Difficult | 25px | 2, hard to see | 50% |
| Method | Model | Time/s | Car | Bicycle | Pedestrian | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Simple | Medium | Difficult | Simple | Medium | Difficult | Simple | Medium | Difficult | |||
| Voxel | Pointpillars | 0.21 | 89.45 | 82.26 | 80.86 | 66.99 | 53.61 | 49.97 | 51.90 | 45.26 | 41.55 |
| SECOND | 0.31 | 88.77 | 82.40 | 78.91 | 76.36 | 63.56 | 60.07 | 49.78 | 45.83 | 42.71 | |
| AGPNet | 0.24 | 90.93 | 86.75 | 84.78 | 77.99 | 59.86 | 55.97 | 59.14 | 52.51 | 47.40 | |
| Point | PointRCNN | 0.45 | 89.91 | 86.49 | 80.11 | 88.20 | 72.80 | 66.97 | 64.22 | 55.47 | 48.35 |
| Method | Model | Time/s | Car | Bicycle | Pedestrian | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Simple | Medium | Difficult | Simple | Medium | Difficult | Simple | Medium | Difficult | |||
| Voxel | Pointpillars | 0.21 | 80.62 | 67.88 | 65.21 | 61.95 | 48.70 | 45.55 | 47.49 | 40.77 | 36.73 |
| SECOND | 0.31 | 84.68 | 71.89 | 68.31 | 71.60 | 59.41 | 57.62 | 41.27 | 36.28 | 32.92 | |
| AGPNet | 0.24 | 84.94 | 75.51 | 72.76 | 73.60 | 55.57 | 51.54 | 54.64 | 47.72 | 42.72 | |
| Point | PointRCNN | 0.45 | 89.56 | 78.24 | 75.79 | 86.80 | 69.99 | 64.99 | 61.47 | 53.61 | 45.94 |
| Scene | Data | Time/s | Frames | Object Category | ||
|---|---|---|---|---|---|---|
| Car | Bicycle | Pedestrian | ||||
| City | 2011_09_26_drive_0014 | 32 | 320 | 32 | 4 | 5 |
| 2011_09_26_drive_0056 | 30 | 300 | 30 | 1 | 2 | |
| Apartment | 2011_09_26_drive_0035 | 13 | 137 | 13 | 1 | 2 |
| 2011_09_26_drive_0039 | 40 | 401 | 40 | 1 | 2 | |
| Highway | 2011_09_26_drive_0032 | 39 | 396 | 39 | 0 | 0 |
| 2011_09_26_drive_0070 | 42 | 426 | 42 | 2 | 2 | |
| Campus | 2011_09_26_drive_0038 | 11 | 116 | 11 | _ | _ |
| 2011_09_26_drive_0043 | 15 | 151 | 15 | _ | _ | |
| Scenes | Frames | Model | Time/s | Precision | Recal |
|---|---|---|---|---|---|
| City | 620 | Pointpillars | 0.18s | 62.4% | 79.1% |
| AGPNet | 0.21s | 76.2% | 87.2% | ||
| Apartment | 538 | Pointpillars | 0.20s | 67.5% | 71.8% |
| AGPNet | 0.22s | 88.3% | 89.3% | ||
| Highway | 822 | Pointpillars | 0.19s | 76.2% | 89.6% |
| AGPNet | 0.21s | 87.2% | 92.9% | ||
| Campus | 267 | Pointpillars | 0.18s | 75.1% | 82.4% |
| AGPNet | 0.21s | 88.1% | 92.7% |
| Model | Time/s | Car | Bicycle | Pedestrian | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Simple | Medium | Difficult | Simple | Medium | Difficult | Simple | Medium | Difficult | ||
| Pointpillars | 0.21 | 89.45 | 82.26 | 80.86 | 66.99 | 53.61 | 49.97 | 51.90 | 45.26 | 41.55 |
| GNN-Pointpillars | 0.24 | 91.45 | 85.01 | 82.30 | 76.36 | 63.56 | 60.07 | 49.78 | 45.83 | 42.71 |
| AGPNet | 0.24 | 90.93 | 86.75 | 84.78 | 88.20 | 72.80 | 66.97 | 64.22 | 55.47 | 48.35 |
| Model | Time/s | Car | Bicycle | Pedestrian | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Simple | Medium | Difficult | Simple | Medium | Difficult | Simple | Medium | Difficult | ||
| Pointpillars | 0.21 | 80.62 | 67.88 | 65.21 | 61.95 | 48.70 | 45.55 | 47.49 | 40.77 | 36.73 |
| GNN-Pointpillars | 0.24 | 83.74 | 71.91 | 67.37 | 73.60 | 59.41 | 57.62 | 41.27 | 36.28 | 32.92 |
| AGPNet | 0.24 | 84.94 | 75.51 | 72.76 | 86.80 | 69.99 | 64.99 | 61.47 | 53.61 | 45.94 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).