Deep Learning Methods Applied to 3D Point Clouds Based Instance Segmentation: A Review

: Beyond semantic segmentation,3D instance segmentation(a process to delineate objects of interest and also classifying the objects into a set of categories) is gaining more and more interest among researchers since numerous computer vision applications need accurate segmentation processes(autonomous driving, indoor navigation, and even virtual or augmented reality sys-tems… ) This paper gives an overview and a technical comparison of the existing deep learning architectures in handling unstructured Euclidean data for the rapidly developing 3D instance segmentation. First, the authors divide the 3D point clouds based instance segmentation techniques into two major categories which are proposal based methods and proposal free methods. Then, they also introduce and compare the most used datasets with regard to 3D instance segmentation. Furthermore, they compare and analyze these techniques performance (speed, accuracy, response to noise…). Finally, this paper provides a review of the possible future directions of deep learning for 3D sensor-based information and provides insight into the most promising areas for prospective research.


Introduction
The instance segmentation is not an isolated domain of object detection but rather a natural evolution from semantic segmentation to fine refinement. The idea of instance segmentation emanates from classification which is essentially the process of predicting a whole input while classifying each object or alternatively naming a ranked list of the items in the input [1]. Localization (detection) is the following process of fine-grained regression yielding the classes and supplementary data such as those classes spatial position.
The rapid expansion of affordable 3D sensors of diverse types including 3D scanners, RGB-D cameras (Kinect, Apple depth cameras…) as well as LIDARS [2] have contributed immensely to the development of computer vision where semantic and instance segmentation are key components. Data acquired by these 3D sensors have the advantage of providing size information and rich geometric profile ( [3], [4]). 3D data have the ability to better describe the surroundings for robots, for instance. Autonomous driving, medical treatment, remote sensing and robotics among others are good applications areas for 3D data [5].
3D instance segmentation is also closely connected to the tasks of 3D semantic segmentation and 3D object detection. 3D semantic segmentation serves to predict 3D points semantic labels, but it does not split dissimilar instances. Conversely, 3D object detection predicts the 3D bounding box of each particular object, but is unable to present a detailed mask of the 3D target object. Hence, 3D instance segmentation can safely be seen as an integrated task of 3D object detection and semantic segmentation.
In general, deep learning of 3D point clouds does encounter numerous considerable difficulties [5] such as the unstructured disposition of 3D point clouds and undersized datasets…In essence, this paper focuses on the study of the 3D point clouds based deep learning methods for instance segmentation. This review elaborates on the most used datasets for 3D segmentation such as NYUv2 [6], ShapeNet [7], ScanNet [8] to name a few. In fact, these datasets have immensely contributed to improve the research on 3D point clouds deep learning because they ushered in several new techniques dedicated to solving point cloud processing challenges including 3D segmentation and 3D object detection.
The simplest representation of 3D data is as a point cloud [6] which is a selection of 3D points each with three coordinates in a coordinate system (Cartesian or otherwise) [7]. Generally, point clouds only possess implicit neighborhood relations (unstructured data) contrary to 3D images which are structurally stored with clear neighborhood relationships [8]. From the few available review papers on deep learning on 3D data ( [1], [8], [9], [10]), this paper is the first to specially center on the cross-analysis of deep learning methods for point clouds covering only 3D instance segmentation.
Compared to the existing literature, the major contributions of this paper are as follows: 1) As far as the author's knowledge, this paper is the first to cover exclusively and expansively the very important 3D point clouds instance segmentation task.

3)
This paper covers the most recent and advanced progress of deep learning for 3D instance segmentation on point clouds and the impact on their performance from data noise, scene variations, loss functions choices, hardware used…Therefore, it provides the reader with an all-round performances comparison of the 3D instance segmentation state-of-the-art methods.
The remainder of this paper is organized as follows. Firstly, section 2 introduces the 3D point clouds segmentation methods where the common deep learning networks architectures are explained. Next, section 3 describes existing benchmarks datasets, their focus and contents. Section 4 reviews and compares existing methods. Section 5 presents possible future research directions. Finally, section 6 summarizes the paper and draws conclusions.

PointNet
Only Convolutional Neural Networks make it possible to learn the local regions using a hierarchical approach while the network drops deeper. Yet, convolution demands a structured network that point cloud data normally does not have. The inaugural framework that has used deep learning directly on unstructured point cloud is Point-Net [11]. PointNet relies on two fundamental symmetric functions (a maxpooling function and a multilayer perceptron (MLP)) ( Figure 1). Following the PointNet method, scores of techniques were introduced that consider the point clouds local structure.

Other methods for raw point clouds processing
Many state-of-the-arts approaches are able to consider the point clouds local structure in a hierarchical fashion [12]. Since 3D instance segmentation is a natural evolution from semantic segmentation, this paper will endeavor to give a quick overview of the latter. In essence, 3D semantic segmentation methods can be divided in two broad categories such as point based and projection based (table 1 (a and b)). As stated before, 3D instance segmentation is a natural evolution from semantic segmentation as it aims to desalinate points in the 3D scene. Table 1 a. Nomenclature of 3D semantic segmentation methods for projection based

Categories Representative techniques
Projection based methods
As for grouping, once the representative points are sampled, k-nearest neighbor algorithm chooses the nearest neighbor points to the representative points in order to form a plot. After sampling and grouping, the last action consists of mapping into a feature vector that will be the sought-for structure.
Unlike PointNet, there are numerous methods that consider learning features for each point in literature. Such is for instance PointNte++ [27] which completed PointNet by implementing it in local regions. Other methods are: VoxelNet, Self Organizing Map, Pointwise Convolution ( Figure 2). Moreover, graph based approaches ( Figure 3) were proposed that represent the point cloud using a graph structure where all points are considered to be a node. In a graph structure, the relationships between points are represented using edges and nodes.  [63] propose a Dynamic Graph CNN (DGCNN) method with a clear disparity from the standard graph network where the computed features from the prior layer are the basis of an edge updating mechanism which takes place after edgeConv layer.

3D point clouds methods for instance segmentation
Generally, 3D instance segmentation techniques will be divided into 2 main categories (Table 2) such as region proposal based methods centered on the generation of a bounding box and region proposal free methods focusing on deep learning frameworks. These categories are the focus of this paper which will analyze, compare and study them.

Region proposal based methods
This approach is generally divided into two main steps which are the objects proposal (bounding box proposal, object proposal) and the refinement task (mask predic-tion…). Yang et al. [64] followed this two-step solution and proposed a structure for 3D instance segmentation known as 3D-BoNet where the system simultaneously regress 3D bounding boxes and predicts point-level mask for all objects in the scene without the need of post-processing tasks such as clustering, feature sampling, non-maximum suppression or voting. However, 3D-BoNet also has some restrictions such as an inability to learn the weights of the diverse input point clouds, the absence of advanced feature fusion components to improve both semantic and instance segmentation at the same time.
Yi et al. set p the Generative Shape Proposal Network (GSPN) [65] as an algorithm for point cloud instance segmentation that leverages an innovative structure named Region-based PointNet (R-PoinNet) to allow both adjustable proposal enhancement and instance segmentation generation. A Point RoIAlign layer is introduced to accumulate features and to allow the network process the proposals. For the most part, the triumph of GSPN emerges from its reliance on geometric shape in object proposal with the advantage of removing proposals with low objectness. With PanopticFusion [66], Narita et al. used a voxel based semantic technique at the level of stuff and things that tightly predicts object classes from a background area and independently section random forefront objects to recreate a major scene using a volumetric map depiction. However, the network lacks global stability against lasting pose drift to incur major throughput mapping w.r.t. to dynamic environments. Zhang et al. [67] introduced LIDARSeg which is an outdoor LiDAR point clouds based network for 3D instance segmentation for the segmentation and localization of little objects while it also remains an adequate answer to single-shot instance prediction and to severe class imbalances. Basically, this method learns a feature representation by using self-attention blocks in point clouds.
Hou et al. [68] established a neural network structure called 3D-SIS for instance semantic segmentation of RGB-D scans. Here, the major contribution is the ability to simultaneously learn from both geometric and RGB input, hence allowing precise instance estimations. The network builds a bounding box regression followed by instance (mask) segmentation for all objects. The network is fully convolutional meaning it can run well in a single shot for sizeable 3D environments. Liu et al. [69] presents a novel method called Gaussian Instance Center Network (GICN), which approximates the distributions of instance centers spread in the entire scene as Gaussian center heatmaps. Centers instance size prediction, bounding boxes generation and final instance masks are obtained as a result of the predicted heatmaps. Possible improvements would consist of enhancing accuracy in finding centers for the challenging semantic classes and using metric learning to learn feature embeddings.

Region proposal free methods
Proposal-free semantic methods group points into instances using a system of similarity metrics while methods relying on segmentation predict the semantic labels to gather points into object instances. These techniques learn 3D instance segmentation using deep learning framework. Wang et al. [70] brought forth a Similarity Group Proposal Network (SGPN) which learns a similarity matrix to articulate group proposals of object instances. The structure first uses PointNet and PointNet++ in order to pull out significant characteristics in every point of the specific point cloud. Here points from he identical objects are considered to carry similar features to create a similarity matrix. The problem is that the size of the similarity matrix increases quadratically making it impossible to process very large scenes due to memory constraints. Meanwhile, Liu and Furukawa [71] presented MASC, a novel based straightforward and efficient process to learn the ressemblance between points to group them into instances. The network adopts sparse convolution and propose a clustering algorithm based on learned multi-scale affinities to tackle the 3D instance segmentation problem. But the network is slow since the clustering algorithm is implemented sequentially even if it is parallel in theory. Liang et al. [72] suggested a 3D CNN called sub-manifold sparse convolutional network which simultaneously produces semantic predictions and instance embeddings. Here a loss function takes into account both the embedding and structure data where a graph convolutional neural network uses an attention-based k-nearest neighbor (KNN) to allocate diverse weights to various neighbors. Lahoud et al. [73] suggested a multi-task learning problem (MTML) which is a 3D instance segmentation process that centers on volumetric scenes from either multi-view stereo or depth sensors. In MTML, voxel-based scenes use metric learning to approximate the scene object centers directional information to achieve the segmentation results.
Hu et al. [74] introduced a method called 3D-SIS where segmentation is coupled with labeling. In fact, the method first segment the point clouds into surface patches and clusters them into group patches using unsupervised learning. The method thus avoids parameter sensitivity by producing both semantic labeling results and robust patch segmentation. Though the method manages occlusions very well, its performance is affected in extremely occluded scenes which could be resolved by generalizing the patch segmentation.
Shao et al. [75] proposed ClusterNet model for 3D instance segmentation of objects in RGB-D images. The task is presented as a regression problem followed by clustering. Specifically, the model makes pixel-wise predictions followed by sequential clustering in the feature space to deduce the object instances. To overcome shortcomings, an improved version would be to assimilate that approach with semantic segmentation. Jiang et al. [76] have introduced a innovative end-to-end bottom-up structure called Point Group with three key components (clustering, backbone network and ScoreNet) that specially centers on an improved grouping by investigating the empty space among objects. The network has a two-branch architecture (point features extraction and semantic labels prediction) and actually pushes every point to its own object centroid. However, this model requires a refinement section to alleviate the semantic inaccuracy predicament that affects the instance assemblage and should investigate the prospect of supervision techniques to enhance its performance.
To implement instance segmentation over complete 3D scans, Elich et al. [77] brought up 3D-BEVIS which is a 2D-3D bird's eye view framework which learns global consistent instances features from a u-shaped fully convolution network. The point clouds features are mutually predicted by a graph neural network. However the technique is unable to show very well numerous occluded objects in its 2D representations. A possible future solution could be to consider different 2D representations to rise above the bird's-eye view restrictions. Engelmann et al. [78] propose 3D-MPA an object-centric 3D instance segmentation technique to create instance proposals based on a graph convolutional network to enable better interactions among neighboring proposals. The absolute object instances will be reached when putting together several proposals rather than pruning them with non-maximum-suppression. Even though multi-proposal aggregation yields good promise for object detection, there still remain challenges such as the combination possibilities between tracking and detection in sequences that are semi-dynamic for instance.
On the other side, Pham et al. [79] present the Multi-Task Point-wise Network (MT-PNet) that concurrently addresses the challenges of instance and semantic segmentation in3D point clouds. It develops a multi-task point wise network (semantic classes prediction and same object instance points embedding similar) followed by a conditional random field model to integrate the semantic and objects instance labels. Wang et al. [80] introduced a method called Associatively Segmenting Instances and Semantics (ASIS) which makes semantic segmentation benefit its instance level counterpart through learning an instance embedding from a semantically aware point cloud. Semantic features belonging to the one instance are combined to correctly improve semantic predictions. This process has a possibility to be extended to panoptic segmentation and beyond.
Moreover, Zhao and Tao [81] proposed JSNet which deals with both the 3D point clouds semantic and instance segmentation at the same time which contains an efficient backbone network (vigorous features extraction) and also a point cloud feature fusion module (discriminative features detection).

3.1.Outdoors 3D datasets
These datasets are those that are built from outdoor scenes. They are manufactured using various tools such as Lidar, lasers scanners, cameras… The most used outdoors datasets used for 3D segmentation and classifications are:

Indoor 3D datasets
Contrary to the outdoors datasets, these ones are those built from indoor scenes. They are also manufactures using tools such as Kinect, Sturcture from motion, laser scanners, cameras… The most used indoor datasets used for 3D segmentation and classifications are as follows:

Runtime analysis
Computation speed is a very important metric for most systems which meet rigid requirements regarding the amount of time spent on execution, training… However, it is helpful to provide a system speed with a comprehensive description of the hardware used to execute it ( Table 3) that could help other researchers, Despite the various hardware and the different training methods used, the table clearly shows that the 3D-SIS network outperforms all the other methods regardless of categories with only about 10 epochs needed for training.

Accuracy
To evaluate the accuracy in computer vision, the overall mean average precision is an important factor. The mean Average Precision (mAP) is the most used because it is a measure that combines reacll and precision. Table 4 shows the comparison of the methods performance on the ScanNet (v2) dataset while table 5 is didplaying the accuracy comparison on the 3DSIS Dataset for area 5 and 6.
Using ScanNet (v2) as dataset, Table 4 illustrates that no method is simultaneously the complete solution for both categories of methods and metrics. A complete solution should outperform all other methods both in 25 and 50% mAP at th same time. So @25% 3D MPA performs better while @50% GICN (superseding Pointgroup slightly) performs well.
Using the 3DSIS 6-fold CV dataset, Table 5 on the contrary shows 3D MPA strength on both metrics by performing better than all other methods in making it the best solution on all methods for the 3D-SIS dataset. In conclusion, it appears that the region proposal free methods achieve better accuracy in general on both ScanNet (v2) and 3DSIS datasets.

Robustness to noise
The presence of noise in the 3D LIDAR point clouds in the proposal based methods will result in the loss of points in the expected results. But in terms of voting for classification or segmentation a strategy of "one point one vote" would help to estimate the categories of each object as well as minimize the impact of the extreme values of unanticipated noises. By showing where to sample within the object distribution, the scene observations help in directing the proposal generation process. For instance, GSPN [65] encodes noisy observations in the object space to produce an instance based feature extraction process.
In the proposal free methods, the idea to generate object proposals from directional information alone is tiresome because of the noise while the clustering predicament is harder and less efficient. Therefore, mean shift clustering for objects proposals and directional information are the prevalent methods. Since real data contains missing data, noise and outliers, the projection of each points in the feature space could be less discriminative and clusters could overlap.

Performance on scenes variations
Proposal based methods such as GICN [69] are capable to approximate the center distributions spread in the entire scene as Gaussian center heatmaps correctly even for unknown 3D scenes. Furthermore, PanopticFusion [66] implements total scene reconstruction and solid semantic labelling with the capacity to distinguish each object and easily realizes context-aware interactions while naturally visualizing the occlusion effects. SGPN [70] is weak for road scenes with very scarece points but LIDARSeg [67] is able to recognize smaller objects and not have several false positives.
For the proposal free methods, there exist fundamental limitations such as for scenes where various occluded objects are not well visible. As an example, 3D-BEVIS [77] is able to learn global instance features which are consistent over a full scene.
In rare situations, the results of MTML [73] turn out to be worse due to unproperly scanned scenes parts while ClusterNet [38] model does force RGB and depth data to obtain robust 3D instance segmentation in cluttered scenes with much occlusions. SGPN [70] is unable to process really large scenes in the order 10 plus ponts.

Loss functions study for performance improvement
In 3D-BEVIS [77] approach for sizeable distances between similar pixels the network uses a discriminative loss as the cross-entropy loss where the semantic losses are gauged with the negative logarithm of the class frequency.
In GICN [69], the proposed network is trained and optimized by a joint loss Ltotal made of numerous loss expressions such that Ltotal = Lcenter + Lbound + LIoU + Lmask + Lsize. The GICN [69] bounding box loss Lbound is simple when compared with the multi-criteria loss used for box prediction by 3D-BoNet [64] because it requires a box association layer to choose the mapping between the ground-truth bounding boxes and the predicted ones.
In a case of a lack of focal loss, GICN helps the network resolve the imbalance predicament of predicting diverse instances. The difference in using a vanilla cross entropy loss and the focal loss is wide. An improvement of 14% on mAP can be observed. This result shows that without the focal loss strategy th GICN network tends to learn merely the simpler instances.
In 3D-SIS [68],the detection algorithm brings out features that serve as input to the 3D Region Proposal Network (3D-RPN) and the 3D Region of Interest to predict both the locations of the bounding box and the labels of the object class. The 3D-RPN training is by merging ground-truth object annotations with previously designed anchors. In addition, the 3D-SIS [68] uses a two-set cross entropy loss for the objectiveness measurement. Finally, a Huber loss is used for the bounding box regression.
In order to constrain the direction of the predicted offset, the PointGroup [76] network formulates a direction loss because it finds it difficult to regress precise offsets most importantly for boundary points of sizeable objects. The binary cross-entropy loss is used as the 3D-SIS [68] score loss to allow the whole framework to train in an end-to-end approach where the total loss is an addition of the all the losses (semantic, direction, regression, score).
In order to learn the feature embeddings (where one maps each voxel to a feature space, the other allocates a 3D vector to each voxel), MTML [73] introduces a multi-assignment loss function that is reduced in training. The first component of the loss promotes discrimination among various instances in the feature space, while the second component castigates vector angular deviations from the selected direction. For the directional loss, MTML [73] aims to generate a vector feature that would locally describe the intra-cluster relationship without being affected by other clusters. We choose the vector to be the one pointing towards the ground truth center of the object. To learn this vector feature, we attend to the following directional loss.
Generally, the MTML [73] network trained with a multitask loss always performs better than the single-assignment one. This is confirmer by the results on the synthetic dataset and further supports the premise that the directional loss appends more discriminative features.
In Clusternet [75], the loss function stimulates the network to project a point in feature space to allow that points of equivalent instances would be close to each other while points of different instances would be separated by a wide margin. It is defined as an object-centric training loss consisting of the sum of the semantic mask loss (cross-entropy between the ground truth and estimated semantic segmentation), the cluster center loss, the pixel-wise loss, the variance loss and finally the violation loss.
In ASIS [80], the classical cross-entropy loss supervises the semantic segmentation step at training time. The loss function is formulated as an addition of the which Lvar aims to drag embeddings to the average embedding of the instance, i.e the instance center, the Ldist allows instances to fend each other off and the Lreg is a regilarization term to bind the embedding values. In 3D-MPA [78], the mask loss is introduced as a focal loss instead of a cross-entropy losss in order to manage the class imbalance in the foreground and background while the model is trained with the multi-task loss in an end-to-end manner.
Basically, in 3D-BoNet [64], the loss function has two goals which are to minimize the Euclidean distance between the associated predicted box and each ground truth box, but also to maximize the coverage of suitable points inside of every predicted box. For the loss function, 3D-BoNet uses the focal loss with with default hyper-parameters in the place of the standard cross-entropy loss for optimization.
GSPN [65] is trained to reduce a multi-task loss function LGSPN defined for every prospective object proposal. In fact, LGSPN is an addition of five expressions such as the shape generation loss Lgen, the shape geneation per-point confidence loss Le, the KL loss LKL, the center prediction loss Lcenter and the objectness loss Lobj. The loss of the MTPNet [79] is the summation of the branches of the losses where L = Lprediction + Lembedding. The prediction loss Lprediction is then by the cross-entropy. MTPNet emplys a discriminative loss to introduce the embedding loss Lembedding. In JSNet [81], the loss function L at trainig time consists of a summation of the semantic segmentation loss Lsem and the instance embedding loss Lins where Lsem is defined with the classical cross-entropy loss.

Metric as a learning tool
2D instance segmentation methods use metric learning as one of the very important tool. A frequent metric-learning approach is to identify a pair-wise loss for learning appropriate pixel embeddings so that points from the same instance are closer together. Comparable approaches can be implemented in the learning of embeddings for 3D instance segmentation. For example, MTML [73] learns intra-instance and inter-instance relationships, and develops a post-processing task for semantic segmentation with a mean-shift clustering to assemble the 3D points. The labels are defined by learning a metric which assembles parts of one object instance and predicts the direction of the object's center of mass. ASIS [80] and JSIS3D [79] also choose mean-shift clustering to deduce instance segmentation clusters. One more strategy for metric learning consists of training a network to estimate the instance affinity sore if it predicts whether two points are from one object instance such as in MASC [71].
After outputting initial embeddings for all points, the discriminative embeddings [72] method argues that ponts belonging to one instance will present comparable embeddings while ponts of different instances would be far in the embedding space. This is a normal predicament in the metric learning. Nonetheless, for the 3D instance segmentation task, points of the same instance will have both embedding features and geometric associations in the 3D space. The network combines such embedding features with structure information to attain more discriminative final results. So the network needs a metric to measure the likeness among embeddings and argues that the KNN uses the spatial distance of points rather than the embedding distance as metric.
In 3D-MPA [78], bottom-up approaches use metric-learning techniques to learn a per-point feature embedding space that is then grouped into object instances. This approach is able to successffully deal with outliers, but it deeply relies on altering cluster parameters. The SGPN [70] method is also intimately connected to similarity metric learning, which has been extensively used in deep learning on a multitude of tasks such as person re-identification [102], matching [103], image retrieval [104], and face recogition [105]. Basically, the SGPN method uses metric learning in a different way in that it regress the probability of two points belonging to the one group and formulates the similarty matrix as group proposals to deal with variety of a number of instances.

Possible future research directions
Based on this review which studies and compares the most prominent methods in the 3D instace segmentation field, possible future research would address the increasing need for large datasets, the problem of an efficient segmentation of voluminous point clouds, the semantic inaccuracy problem by exploring the possiblity of incorporating auto-supervision techniques that could further enhance the system achievement as wellas the ability to learn the dynamic point clouds spatio-temporal data which could enhance the output of ensuing tasks such as 3D object recognition and segmentation.
Many others avenues exist such as new applications turned towards context-aware augmented reality and intelligent autonomous robotics or combination of detections using tracking in partially dynamic situations but also the possibility to separate object proposals in 4D space.

Conclusions
This paper has established a comparative analysis of the releveant techniques for 3D perception specifically 3D instance segmentation including 3D region proposal based methods and 3D region free proposal methods. All datasets normally used in 3D point clouds instance segmentation have been presented. A complete nomenclature, performance evaluation as well as the respective contributions of these 3D instance segmentation methods was discussed. Prospective research possibilities were discussed. This research is a unique review paper in the literature which compares and discusses the various deep learning based instance segmentation techniques.