Contrasitive Learning for 3D Point Clouds Classification and Shape Completion

In this paper, we present the idea of Self Supervised learning on the Shape Completion 1 and Classification of point clouds. Most 3D shape completion pipelines utilize autoencoders 2 to extract features from point clouds used in downstream tasks such as Classification, Segmen3 tation, Detection, and other related applications. Our idea is to add Contrastive Learning into 4 Auto-Encoders to learn both global and local feature representations of point clouds. We use a 5 combination of Triplet Loss and Chamfer distance to learn global and local feature representations. 6 To evaluate the performance of embeddings for Classification, we utilize the PointNet classifier. 7 We also extend the number of classes to evaluate our model from 4 to 10 to show the generalization 8 ability of learned features. Based on our results, embedding generated from the Contrastive 9 autoencoder enhances Shape Completion and Classification performance from 84.2% to 84.9% of 10 point clouds achieving the state-of-the-art results with 10 classes. 11


Introduction
Performing tasks on point clouds are considered more challenging than 2D images. 15 It is because 2D images live on a nice 2D grid with regular spacing in an Image Plane, 16 and in contrast, point clouds are represented by a list of sparse 3D Cartesian points. Point 17 clouds are spread in 3D Coordinate System and lack the required topological information. 18 It is also challenging to apply typical deep learning methods like Convolutional Neural 19 Networks directly on point clouds to learn feature representations because Convolution 20 operation requires its neighbouring samples to appear at some fixed spatial orientations 21 and distances to facilitate Convolution which is not the case in point clouds. 22 Recently, Graph-based methods have been very successful in learning point cloud 23 representations task, and Dynamic Graph Convolutional Neural Networks (DGCNN) 24 [1] are introduced, which aims to recover the topology of the point clouds it can extract 25 rich representations. The major drawback of GCNNs based methods is that they have 26 millions of learnable parameters, making them vulnerable to over-fitting. We need 27 large-scale annotated datasets for training GCNN's in order to have a generalizable 28 solution to our shape completion problem. However, unlike Images, there are very 29 limited datasets available for point clouds. The collection and the annotation of the new 30 point cloud dataset are time-consuming and expensive since pixel-level annotations are 31 needed. With their powerful ability to learn useful representations from unlabeled data, 32 PointNet Completed Shape Airplane Car Chair Table  Bench Lamp unsupervised learning methods, sometimes known as self-supervised learning methods, 33 have drawn significant attention. 34 In this work, we focus on extracting representations of point clouds which are 35 helpful in downstream tasks. We choose Classification as a downstream task and use 36 PointNet [2] to measure the Classification accuracy. The classical approach to this prob-37 lem is to voxelize point clouds to extract features, but this is computationally expensive, 38 and representation accuracy is not enough, and we lose information. We present Con-

54
• We extend our idea to shape a completion task to complete corrupt and noisy point 55 clouds. The results show that our method can help in that task as well.

56
• We increase the number of classes for evaluation from 4 to 10.

57
The forward pass of the proposed network is given in Figure 2.

59
Point clouds have obtained increased attention over the past few years. We will 60 review related point cloud analysis and shape completion methods.    Mover's distance (EMD) [32], and Chamfer's distance [33]. We chose Chamfer distance 157 because it gave us much better reconstruction quality. We use symmetric Chamfer 158 distance to measure the quality between Input and Encoder's generated feature repre-159 sentations. Chamfer distance can be defined as follows.
Where P 1 ∈ R 2048×3 denotes Input point cloud from training set and P 2 ∈ R 2048×3 161 denotes Encoder's generated point cloud representation which is decoded using Decoder.

162
Our loss function consists of a Distance function and Constrasitive loss.  The first idea is Pairwise Ranking loss, which is also used in related work [23]. It  and also it is easier to optimize on. Hence we are using Triplet Ranking loss in our setup.

176
Triplet Ranking loss can be defined as follows.

177
L r a , r p , r n = max 0 , m + ∑ i∈r a ,j∈r p Where r a ∈ R 2048×3 denote anchor sample, r p ∈ R 2048×3 denote positive sample,  The objective of our method is to encourage learning global and local feature representations. Using Equation 1 and 2, we define a unique loss function which aligns with our objective. L r a , r p , r n = L triplet r a , r p , r n + L ch (r a , dec(e a )) +L ch r p , dec(e p ) + L ch (r n , dec(e n )) (3) where embeddings e a ,e p and e n ∈ R 128 are generated using encoder for anchor,

217
We used PyTorch [34] for the implementation and training of our network. We 218 wanted to observe the effect of Contrastive learning on shape completion pipelines, 219 and therefore we compare our results with autoencoders trained without Contrasitive 220 learning. We also evaluate whether Contrasitive learning helps in the Classification of 221 incomplete point clouds after completing shapes from our proposed autoencoder.

222
We refer to Naive autoencoder to an autoencoder with the same architecture as a    Figure 4 shows the TSNE plots. Figure 4a shows that there are somewhat clear 246 separation between the classes. From the ten classes plot given in Figure 4b, we can infer    shown in Figure 7, and save completed shapes. The quantitative results i.e. mean Cham-274 fer distance per point for completed shapes is given in Table 1 whereas the qualitative 275 results is given in Table 3.
276 Table 1. Quantitative results computed by average Chamfer distance (10 −4 ) between Ground truth and completed shapes with respective methods. Lower the Chamfer distance from the ground truth, better is the completed shape. Naive AutoEncoder performs better than all of the available methods. are sent to a pre-trained PointNet which outputs the classes for each completed shape.

280
The results by are shown in Table 2.  On the other hand, for ten classes, the dataset plot in Figure 3 shows that there is 286 little to no separation between the shapes of different classes, and they are also much 287 more semantically similar, e.g., the shape of Car is similar to a certain kind of Vessel, 288 and this makes classification a much more challenging task. In this case, classification 289 accuracy for the Contrastive autoencoder is better than other methods, which shows 290 that Contrasitive learning helps separate the embeddings of similar classes, which then 291 helps in classification, as shown in Figure 5 a.

292
Last but not least, from Table 1, it is evident that the mean Chamfer distance for 293 Naive autoencoder is slightly lower than other methods, but the overall Classification 294 accuracy for the Contrastive autoencoder is higher than others. It also shows how crucial 295 global feature learning is for downstream tasks.

297
We proposed Contrastive Learning for the 3D point cloud shape completion and 298 classification task. Contrastive learning provides us with the global features of point 299 clouds, and we use Chamfer distance to extract local features. We combined both feature 300 extractors and trained our network to learn both the global and local feature sets. We 301 provide benchmarks on ShapeNetCore [28] dataset for 4 and 10 classes. Our results for 302 both shape completion and classification are very promising, and as a possible extension, 303 we would like to further look into other pretext tasks which help extract more useful 304 global features.