Autonomous and Real Time Rock Image Classification using Convolutional Neural Networks

Autonomous image recognition has numerous potential applications in the field of planetary science and geology. For instance, having the ability to classify images of rocks would allow geologists to have immediate feedback without having to bring back samples to the laboratory. Also, planetary rovers could classify rocks in remote places and even in other planets without needing human intervention. In 2017, Shu et. al. used a Support Vector Machine (SVM) classification algorithm to classify 9 different types of rock images using a with the image features extracted autonomously. Through this method, they achieved a test accuracy of 96.71%. Within the last few years, Convolutional Neural Networks (CNNs) have been shown to be perform better than other algorithms in classifying images of everyday objects. In light of this development, this thesis demonstrates the use of CNNs to classify the same set of rock images. With the addition of dataset augmentation, a 3-layer CNN is shown to have a significant improvement over Shu et. al.’s results, achieving an average accuracy of 99.60% across 10 trials on the test set. Multiple CNN operations with similar output shapes have been designed and appended to an existing architecture to expand hyperparameter considerations. These Combinational Fully Connected Neural Networks achieves an accuracy of 99.36% on the test set. The resulting models are also shown to be lightweight enough that they can be deployed on a mobile device. To tackle a more interesting and practical problem, CNNs have also been designed to classify natural scene images of rocks, an inherently more complex dataset. The task has been simplified into a binary classification problem where the images are classified into breccia and non-breccia. This thesis shows that a Combinational Fully Connected Neural Network achieves an accuracy of 93.50%, better than a 5-layer CNN, which achieves 89.43%.


Introduction and Motivation
Identifying outcrop lithology is an important aspect of geological mapping and exploration because it gives valuable insight about the area under examination, i.e. its geologic history, origin, and nature. At the same time, making accurate geologic maps has a profound impact on numerous other fields of study. For instance, mining and resource exploration rely on accurate maps to know where a valuable mineral could be feasibly extracted; civil and structural engineering require solid geological information in building dams, roads and buildings; and environmental geosciences depend upon geological maps to predict hazards [1]. Therefore, a system that could automatically classify outcrop lithology would have multiple benefits to the geologist: • Unknown rocks could be identified without bringing back samples to the laboratory; • Accurate geological maps could be made with significant ease; • Geologists could have access to the classifier as learning tool in identifying rocks; Applications outside of this planet could also be forseen. Planetary rovers like the Mars Science Laboratory Rover are equipped with a suite of instruments that are collecting a plethora of scientific data [2]. However, communication with the rovers from Earth remains an issue. For one, as the rover travels farther distances in each mission, the amount of data that can be sent to Earth is reduced which potentially results in missed science opportunities during long traverses [3]. As well, many tasks require several steps of human intervention to perform such as approaching a rock outcrop and placing an instrument against it [4]. As a result, significant amount of time and data is wasted with the long, arduous process it takes to send commands and receive data from the rover. For this reason, autonomous targeting systems like the OASIS (Onboard Autonomous Science Investigation System) and AEGIS (Autonomous Exploration for Gathering Increased Science) have been developed to cut back on the data turnaround time, to afford the rover some amount of autonomy and, to allow the rover to collect and send more data [3,5,6]. Therefore, a vision system that could identify interesting rock types without human intervention could further aid in increasing the quality and the quantity of data sent by the rovers.

Related Work
The task of rock image classification garners significant attention from researchers in the field of geology. For instance, Harinie et. al tested image classification algorithms on four types of rocks: intrinsic igneous, extrinsic igneous, sedimentary, and metamorphic [7]. It was accomplished by extracting Tamura features [8] from a set of 50 images and then using these as the basis features for each class. Subsequent images are then classified by comparing the basis features with the Tamura features of the input image. The image is then assigned to the class with the minimum distance with the basis feature. Through this method, the authors obtained a classification accuracy of at least 87%.
As well, Mlynarczuk et. al. classified thin section images of nine different rock samples [9]. First order statistics (mean, standard deviation, etc.) were used as features for each image. The features were extracted in four different color spaces namely RGB, HSV, YIQ, and CIELAB [10,11] to examine the effect of color spaces in automatically classifying each image. Using these features, the authors found that the Nearest Neighbours and K-Nearest Neighbours algorithms performed best across all color spaces, achieving an accuracy upwards of 96%.
Baykan and Yilmaz [12] also classified thin section images of rocks as well as the percentage of each mineral present in the thin section. The images were taken under under both plane polarized and cross polarized light with each pixel being labelled according to its mineral content. The authors used raw pixel values on the RGB and HSV spaces as the features which was then fed into an Artificial Neural Network with 1 hidden layer. The authors obtained an accuracy of 89.53% when using the RGB color space, 87.5% in HSV and 87.45% using both color spaces.
Meanwhile, Shang and Barnes hand crafted 54 different features and used three different classifiers (SVM, KNN, Decision Trees) to classify rock images based on 14 different textures [13]. The authors then used a reliability based method and Information Gain based ranking to select the top n features with which to represent the images. Overall, the authors achieved the highest accuracy by using the top 20 features out of the possible 54 with an SVM classifier. In using all 54 features, the authors did not see any significant increase in accuracy for the SVM.
Ishikawa and Gulick takes a different approach in mineral classification, this time by using Raman Spectra instead of color images [14,15]. The authors gathered Raman data from 13 different minerals and segregated them according to mineral group. A total of 190 spectra each with 765 dimensions were collected. Using Principal Component Analysis, the number of dimensions were reduced to 3. The authors then used a Decision Tree and an Artificial Neural Network with 6 nodes in 1 hidden layer to classify the minerals. The authors obtained an average accuracy of 83% in classifying the samples by mineral group and 73% by classifying individual minerals.
Singh et. al used a mix of first and second order statistics, Image Features (percentage of most common gray level, number of edge pixels, etc.), and Region Features to classify textures of basaltic rock images into three classes [16]. In total, 27 features were selected. Similar to [14], Principal Component Analysis was used to reduce feature dimensionality. An Artificial Neural Network was then used, having 2 hidden layers and 15 and 20 nodes respectively. The authors obtained an average classification accuracy of 92.22% Outside of classifying images into rock types, Shu et al. autonomously categorized images of rocks into their different sorting levels based on their granularity using Convolutional Neural Networks (CNN) [17]. The rocks were classified into 5 different levels, from very well sorted if their granularity is consistent, to very poorly sorted if the granularity of the rocks varies by a huge amount.

Baseline and Point of Comparison
Shu et al. described the perils of having to manually select the features and emphasized on unsupervised learning of features. In their work, they compared the results of combining different manually selected features with using a variation of the K-means algorithm to extract features from the images. Then, using a Support Vector Machine, they have shown that the unsupervised method of selecting features outperforms other methods by a significant margin [18]. Shu

Convolutional Neural Networks
A Convoutional Neural Network (CNN) is simply an Artificial Neural Network (ANN) preceded by operations that extract information from images. In the same way that an ANN attempts to mimic how the human brain functions, the CNN also attempts to mimic how the brain interprets images. Take for example how an infant would differentiate between shapes. One would imagine that that the edges of each objects are one of the main factors in deciding shape. Over time, an infant learns that an object with four edges is a square, and object with no corners is a circle, and so on. Then, in differentiating different objects, combinations of shapes, edges, as well as colors are taken into account. This object has this shape and this color. And as children learn more shapes and more objects, the distinguishing features tend to be more complicated. A CNN is composed of the following layers:

Convolution Layer
The first operation executed on an input image is the convolution layer. A convolution with an image is simply a weighted sum of the pixels within the window size of a filter called the receptive field. To illustrate, Figure 1 shows a 4 x 4 sample image convolved with a 3 x 3 filter with weights [1 1 1; 1 1 1; 1 1 1]. This simply results to a linear sum of all the pixels in the receptive field of the filter. Then the window moves over to the next central pixel which fits the filter, also known as the stride. In this example, the stride is 1. This process is repeated until the convolutions all through out the image is exhausted, creating a feature map.
Even though the illustration in Figure 1 is a simple 4 x 4 x 1 image (only one color channel) and a 3 x 3 x 1 filter, the convolutions would extend all the way through the depth of the input. As such, if we have a 3 color channel image, 4 x 4 x 3, the filters would also be 3 x 3 x 3, matching the depth of the input. Take note that when the 4x4 image was convolved with a 3x3 filter, the resulting feature map is smaller than the input image. To alleviate this, some CNN architectures add a zero padding around the images so that the size of the image is maintained all throughout the network.
The output of each filter would then be stacked on top of each other, creating a 3dimensional feature map. In general, the output of the convolution layer is a 3-D volume [H 2 × W 2 × K] where: where: The number of filters per convolution layer depends on the discretion of the researcher. Even the most common networks employ varying numbers of filters per layer, and there is no hard and fast rule on how many filters there should be per layer.
In this algorithm, the weights within the filters are learned through back-propagation [19] whereby a loss function is defined, and the weights of the filters are updated using a gradient descent optimizer. In this research, the Adam optimizer is used [20].
Finally, the convolution operations are followed by an activation function to introduce non-linearity to the problem. The most commonly used activation function is the Rectified Linear Units (ReLU) where: The activation is simply a threshold at 0.

Pooling Layer
Convolutional layers are often followed by a pooling layer which reduces the dimensionality of the input. The mechanism is similar to the convolution layer where a sliding window propagates through the image. The output of the pooling operation could be 1) the maximum pixel value within the window (Max Pooling), or 2) the average of the pixel values (Average Pooling).
The purpose of the pooling layer is to aggregate responses within a local area of the image. On one hand, this decreases the number of parameters within the network, which results in a reduction of the computational complexity for training as well as a mitigation for overfitting. On the other hand, this operation brings a level of invariance to positional changes within the image, meaning the exact position of pixels in the image would matter less while preserving the structure of the whole image. Ultimately, this affords the classifier some leeway in that regardless of where the object is in the image, the classifier would still be able to successfully recognize the image [21].

Flattened Layer, Dropout, and Output Layer
After a series of convolutions, activations, and pooling, the resulting 3D feature map is then flattened, creating a 1-D vector of size H × W × K. This essentially creates the input for the hidden layers of an ANN, which comprises the last few layers of the CNN. In this research, a dropout layer is implemented where the neurons the in hidden layers of the ANN randomly shut off during training [22]. This technique allows the neurons to train robustly and independently of other neurons. Thus, when all of the neurons are active during testing, the predictions can be more accurate.
Finally, the output layer is an activation operation where the result is the prediction for which class the input image belongs. In this case, the softmax activation function is used where the output is a normalized probability that the input image belongs in each class (Equation 4).
where: k = number of classes x = input feature vector θ = layer weights

Transfer Learning
Given the layers that comprise a CNN, there is a huge number of possible combinations that could be designed for the architecture of the network. Convolution layers could be repeated as many times as one wants, the pooling layer may or may not be included in the architecture at all, and the hyperparameter choice (number of filters, filter size, stride, zero padding) is practically unlimited. Fortunately, constant research is being done to search for optimal architectures in improving the performance of a CNN.
In fact, the following CNN architectures [23][24][25][26][27][28][29][30][31][32] have been proven to work well on the ImageNet Dataset, a dataset consisting of more than 1 million images and 1000 classes of every day objects [33,34]: We can therefore assume that the filters in these networks have learned what features to extract from images of everyday objects. We will then apply the capabilities of these networks to a more specific target task of classifying rock images. To accomplish this, the filter weights of the networks will be frozen and will be used to extract features from rock images. The remaining layer that is left to be trained is the output layer where the number of neurons will be modified to match the number of classes in our dataset.
One advantage of using this method is that the architecture of these networks and their filter weights pre-trained on the ImageNet dataset is readily available. Thus, there is no need for training very deep networks with a large dataset. This dramatically decreases computational costs and introduces a great potential for producing good results.

Dataset
The dataset from the work of Shu et al. contains 9 classes of rocks with approximately 70-80 images per class [18]. The images have a resolution of 128 x 128 and is 1-2cm in reality. Sample images are shown in Figure 2. The dataset is then split into 70% training set, 15% validation set, and 15% test set. From the names themselves, the training set will be used to train and create the model. The validation set will then be used to test the accuracy of the model for each training step. If the accuracy of the model improves on the validation set, the model will be saved for further testing. But, if the accuracy on the validation set does not improve for a training step, the model produced by that particular step will be discarded. This guarantees that the best model produced by training will be obtained. Finally, the model's performance will be reported on the test set, which will be a good indication of the generalizability of the model. For deployment testing purposes, 5 random images from each class were isolated from training, validation, and testing. These images will then be used to check the time it takes for each prediction and if the models that are deployed on an iOS application still achieves the desired accuracy.

Data Augmentation
Before jumping in to classifiying rock images with CNNs, it must be pointed out that CNNs require a large number of labelled images for it to have a high testing accuracy [21]. For instance, the ImageNet dataset has more than 1 million images with 1000 classes [33]. This gives about 1000 images per class. Meanwhile, Shu et. al.'s dataset only contains about 80 images in some classes and even less in others [18]. To make up for the difference, data augmentation is employed to increase the number of images in the dataset without having to take more pictures of the same rocks. Data augmentation is accomplished by applying transformations on the images (e.g. translation, rotation, scaling, blurring), then concatenating the transformed images with the original dataset. In addition to multiplying the number of images in our dataset, data augmentation affords the model some form of invariance as to how the rock is positioned in an image. After all, regardless of the way any person looks at an image of a rock, the rock in the image would never change.
The intuition also goes back to the definition of an image being a 2-D matrix. Transforming the images would yield different pixel positions on the matrix, but the transformed matrix still belongs to the same class as the original image, thus artificially creating an image that is different to the computer, but not in reality. Overall, data augmentation is a cheap way of improving the performance of classifiers. A diagram of the applied transformations are shown in Figure 3.  Figure 4 shows the architecture designed for the CNN. It features 3 convolutional layers, each followed by a 2x2 pooling layer. The filters in the convolutional layers have a receptive field size of 3x3, with each layer having 32, 32, and 64 filters respectively. No zero padding was added, and the stride for each convolution was kept at 1. It should be noted that this configuration for the architecture was not arbitrarily decided. Preliminary work has been done to determine the optimal number of filters to use, and how deep the network has to be.

Self Taught Learning with CNNs
As a measure of how well the network learns what features to extract on rock images, a version of Self-taught Learning has been applied with CNNs as well. Self-taught learning is a technique where feature representations are learned from large unlabelled datasets. From these learned feature representations, a classifier is then trained on a more specific and labelled dataset [35]. The advantage of using Self-taught learning is that robust features are being learned from unlabelled datasets. These unlabelled datasets are generally easier and cheaper to acquire than labelled datasets. For example, feature representations can be learned from rock images downloaded from the internet without having a geologist label each and every one of them. Better yet, rocks can be manually photographed by a non-geologist and have the Self-taught learning algorithm learn from this dataset, again with the absence of a trained geologist. Once feature representations are learned, they can then be applied to a labelled dataset to train a different classifier. Ultimately, this provides a more generalizable classifier.
Admittedly, training on an unlabelled dataset is impossible for a CNN because by nature, a CNN is a supervised learning algorithm. In this case, the same 3 layer network was trained with images from only 5 out of the 9 original classes. Then, freezing the weights of the filters, transfer learning was applied to the other 4 classes, with the only modification being the output layer to accommodate the change in the number of remaining classes in the dataset. Essentially, this tests whether the filters that have learned to extract information from a separate "dataset" of rocks can be used to extract meaningful information as well for a different set of rocks.
Shu et. al.'s work demonstrated Self-taught Learning as well, where feature representations were first learned on an unlabelled subset of the original dataset [18]. From there, a Support Vector Machine classifier was trained on the remaining classes. The technique they used resembles the actual Self-taught Learning algorithm more closely in that the feature representation learning is done on an unlabelled dataset.

Hyperparameters
The hyperparameters used to train all pertinent networks are the same. However, for each of the networks used in Transfer Learning, the output layer and the preceding activation has been modified to fit our dataset. As such, the final output is reduced from the original 1000 nodes to 9 nodes. At the same time, all the pre-trained filter weights and biases have been frozen. This dramatically reduces the number of parameters to train. Each model was trained for 200 epochs with batch size of 16. The Adam optimizer was used, with a learning rate of 0.0001 [20]. To confirm the accuracy of the results, 10 trials were performed for each model with all the dataset being shuffled for each trial. The accuracy and loss on the test set is then recorded and reported. Table 1 summarizes the hyperparameters used during training.

iPad Deployment
For the models to be useful to a geologist, they have to be portable enough to be deployed in a mobile device. Since the best model is saved for each trial, one of the best performing models have been chosen to be deployed on an iPad. To accomplish this, the model has to be converted first from a Keras module into one that is supported by Apple applications, i.e. into a CoreML module. The conversion is done using CoreMLTools [36], a toolkit published by Apple. The conversion is straightforward, and a simple application has been developed that accepts an image and outputs the predicted type of rock. Figure 5 shows a screenshot of an iPhone simulator that displays the intended functionality of the application. The input image is displayed at the center and the predicted class, alongside the confidence of the prediction is shown as text at the bottom. Once the models were installed on an iPad, testing was conducted with the 5 random images that were excluded in the dataset. The duration of the operation from the selection of the image up until displaying the prediction was recorded. At the same time, the accuracy and the confidence of each prediction was recorded. Table 2 shows the results of Transfer Learning, the CNN, and Shu et. al's best results obtained through unsupervised feature learning.

Results and Discussion
Results show that that the best classification accuracy was achieved by the 3-layer CNN, averaging more than 99% accuracy on the test set. In fact, in 4 out of the 10 trials, the model was able to achieve 100% test accuracy, correctly predicting the class of each input image. Transfer learning did not perform so well, however, in that only the VGG networks achieve better than randomly guessing the class of each image. One reason we could posit is that the filters of the networks used in transfer learning might have learned how to extract features from relatively larger objects but fail to recognize finer details like rock textures. After all, the ImageNet dataset does not have images for specific types of rocks.
Results also show that the CNN performs slightly better by 2.19% than Shu

Self Taught Learning with CNNs
The results from training the 3-layer network on 5 classes further reinforce the advantage of using CNNs in classifying rock images. Across 10 trials, the network was able to achieve a classification accuracy of 99.70%. One of the best performing networks were selected out of the 10 trials, and was used for Transfer Learning on the 4 remaining classes. Modifying the output layer to have 4 nodes to match the target task, and keeping the hyperparameters the same, results show that the filters do learn how to extract rock texture information from images as evidenced by a 91.00% classification accuracy. Similar to how Transfer Learning on the deep networks performed worse than the custom trained 3-layer network, the decrease in classification accuracy in the Self-Taught learning version of the CNN is also not surprising. After all, a network trained and tested on the original dataset would typically perform better than the one trained on a separate dataset [37].

Model
Average Accuracy Accuracy Standard Deviation Training Dataset (5 classes Table 3: Accuracy and Loss Results for Self Taught Learning

iPad Deployment
Out of the 10 networks trained on the original 9 classes, one of the best performing networks was also selected to be deployed and tested on an iPad with the isolated images from the dataset. Results show that the model achieves an accuracy of 100% on the iPad. This comes as no surprise because the model already achieves almost 100% accuracy on the test set. On average, the time it takes for the application to make a prediction is a respectable 0.0680 seconds, which is roughly equivalent to 14 frames per second. This shows that the models could be deployed in the field with the absence of relatively more powerful computers.

Conclusions
With this research, we have explored the use of Convolutional Neural Networks (CNN) in classifying rock images. Comparing with the results of Shu et al. [ have shown that CNNs perform better than unsupervised feature learning using the K-Means algorithm with a Support Vector Machine classifier. With the CNN, instead of having to choose feature representations and having to train a separate classifier, the image classification pipeline is merged into one algorithm. We have also shown proof that a CNN does indeed learn to extract rock textures from the images by implementing a version of Self-taught Learning where a network was trained on 5 classes and was then used for Transfer Learning on the remaining 4 classes. Deploying a model into an iPad, we also have shown that the models are lightweight enough that they can be practically used in the field. The model remains accurate on the iPad, and the model does not take more than a second to make a prediction. With this, potential for a video feed classification can be seen. An obvious research direction to take is to add more rocks to the dataset and have a more general classifier. For a more practical and useful tool, having a natural scene image classifier where a user out in the field simply takes an image of a rock in the field without the need for special equipment is also worth pursuing.