GRaVN: A Convolutional Network Approach to Generalised Characterisation of Raman Spectra for Space Exploration

: During planetary exploration mission operations, one of the key responsibilities of the instrument teams to determine data viability for subsequent analysis. During the 2019 CanMoon Lunar Sample Return Analogue Mission, the Lead Raman Specialist manually examined each spectra to provide quality assurance/validation. This non-trivial process requires years of experience to complete accurately. With the proven efﬁcacy of Convolutional Neural Networks (CNNs) in classiﬁcation tasks, and the increased use of automation and control loops on planetary space platforms for navigation and science targeting, there is an opportunity to approach this validation problem utilising CNNs. We present the Generalised Raman Validation Network (GRaVN), an neural network focussed speciﬁcally on extracting the generalised structure of Raman spectra for quality assurance/validation. This work demonstrates the viability of utilising a CNN network in validation activities for Raman spectroscopy. Utilising only two hidden layers, a conﬁguration was developed that provided good levels of accuracy on a manually curated dataset. This indicates that such a system could be useful as part of an autonomous control loop during planetary exploration activities.


Introduction
One of the key tasks for science teams on planetary surface missions is data acquisition and the activities associated with the processing of that data. In general, the initial focus is on viability, checking that the data is useful, before examining it for science. Raman spectroscopy in particular, requires a significant amount of experience and training to interpret returned data.
The focus of this work is to use machine learning to reliably automate this qualitative step, so that the remote platform (robotic rover/lander etc) itself can make decisions about data quality. When the platform takes a reading using it's on-board Raman instrument, if it can detect whether it has an over-saturation (fluorescence) spectrum, for example, then it can immediately retake the reading until it obtains a qualitatively acceptable measurement.
This simple feedback loop can then be incorporated into existing sequence automation routines, already tested as a progression from manual sequencing in previous analogue missions [1][2][3][4][5][6][7]. The challenge with Raman spectra is that an (un)usable spectra can have a very large variety of forms. This is why we have elected to focus on extremely generalised categorisation, that the trained human conducts, rather than attempting train on individual effects visible in unusable spectra, such as cosmic rays or fluorescence spectra, or, identify specific spectra of materials as is done for classification tasks.

Background: Raman Spectroscopy
Raman spectroscopy is a non-destructive spectra-based sensing technique, with applications in material and organic science [8,9]. In recent years, instruments specialising in the technique have seen consistent use in geological analysis terrestrially, as well as, on groundbased space platforms [10,11]. Increased use in a variety of other areas such as medical diagnosis and analysis, and, crime scene investigation [12][13][14][15][16], have demonstrated viability as a reliable sensing paradigm. Most recently, the Raman technique is represented on the Mars 2020 rover, Perseverence, through the SuperCam SHERLOC instrument [10]. This instrument utilises 523nm wavelength to take readings and differs from other instruments by reliably being able to take readings from further away (anything upto 7m).
Raman spectroscopy exploits reactive shifts in the frequency of monochromatic light upon interaction with [molecular species in] target materials. Inelastic scattering occurs when monochromatic laser light of wavelength λ 0 and energy E 0 = hc/λ 0 is incident on a surface, where h is Plancks constant and c is the velocity of light [17]. In this process, the light that interacts with the matter will either be scattered or absorbed. Of the scattered light, the majority will scatter elastically, in a process known as Rayleigh Scattering. However, a proportionally small amount will scatter inelastically, producing an emission "Raman spectrum" of light, 6 -8 orders of magnitude weaker than the source [17,18]. Raman Spectrographic Artifiacts. Raman spectroscopy is sensitive to a number of factors that can prevent the acquisition of scientifically viable readings. Two of the conditions that can require retaking of the sample or post-processing are Fluorescence (A) and Cosmic Ray Spikes (B) [19][20][21] 1.1.1. Fluorescence Background and other Impacts to Data Quality One of the major obstacles to obtaining well-defined spectra is a condition known as fluorescence background [12,18,[20][21][22][23][24][25][26]. This has the effect of "over-saturating" the spectra, such that the identifying molecular peaks are no longer visible. Many molecules that respond to Raman excitation can also exhibit fluorescence behaviour upon excitation, so this fluorescence state is frequently encountered in complex samples [22], which is unfortunate, because apparent complexity, particularly in a planetary exploration situation, is often the reason that a target is selected to start with. Fluorescence can also have the knock-on effect of increasing shot noise, a saturation effect associated with Rayleigh scattering [8,9,27], further adding to the difficulty of validation.
A variety of hardware and software-based methods have been explored for quenching fluorescence background and enhancing weak Raman signals. These range from modifying the aperture time of the Raman instrumentation, so that there isn't enough time for the fluorescence to develop in the signal [12,18,[20][21][22][23][24][25][26], to applying baseline detection algorithm to estimate a clean signal [28]. All these techniques interfere with the data itself by either modifying conditions for acquisition or altering data directly, which impacts interpretation.
Another artifact that can appear in spectra results are Cosmic Ray spikes. These manifest as sharp spikes that often extend vertically by a disproportionate amount. These spikes are representative of low level radiation events and appear randomly across different areas of the spectrum [9]. Generally the most used technique for removal of these, is to compare spectra of the same target and remove spikes that are not contained in both. [8,9] The third major consideration when evaluating Raman return signals is the signal-to-noise (SNR) ratio present in the reading. In general the lower signal-to-noise, the better quality the reading. Various mathematical and algorithmic methods have been utilised to improve SNR including derivative and wavelet methods [20], Fourier filtering and least squares analysis [21]. In this regard some of the best results came from the use of Prime Component Analysis, where the authors demonstrate that this technique reduced acquisition times, allowing for it to be used in real-time applications [21].

Raman Technique Compatibility with Automation and Machine Learning
For commonly used spectra-based techniques such as Laser Induced Breakdown Spectroscopy (LIBS) [29], Alpha Particle X-Ray Spectrometer (APSX) and Mass Spectroscopy for example, a peak at a specific wavelength will usually indicate a specific material substance and the result can be reproduced reliably, regardless of mitigating factors [30,31]. For Raman, that is not necessarily the case. Here, the same species or molecule can have multiple signatures at different positions. This aspect is what adds an extra challenge to the automated classification of Raman data. This, plus the results of various conditions under which that sample has been captured, lead to an intrinsic multi-variate uncertainty that suggests a complex problem space. The increasing application of machine learning in planetary science and space exploration for problems residing in complex problem spaces [4,7,16,28,29,[32][33][34][35][36][37][38][39][40][41], suggests an opportunity to further apply these techniques.

Raman Analysis for Analogue and Live Planetary Rover Missions
Terrestrial analogues are locations on Earth that approximate conditions on other planetary bodies. These locations allow for comparative planetology studies, technology development/demonstration and the training of personnel.

1.
Compare the accuracy of selecting lunar samples remotely from mission control versus a traditional human field party 2.
Test the efficiency of remote science operations including the use of pre-planned strategic traverses 3.
Evaluate the utility of real-time automated data analysis approaches for lunar missions 4.
Explore the mission control operations structure for 24/7 lunar science operations

Test how Virtual Reality (VR) technology can be used to help with enhancing the situational awareness in mission control
In order to achieve this, a full complement of instrument types was deployed in the field, with science and navigation targetting controlled from Mission Control at Western University, over the course of several days [1,6,42]. During the mission, one of the key responsibilities of the Raman specialist was to determine data viability for analysis [7]. The lead Raman specialist was a geoscience PhD student who had been training to recognise Raman data from a variety of instruments and conditions. Every spectra returned from targetting activities was manually evaluated in order to provide a "Good", "Bad" or "Potential" categorisation. This indicated whether the data met the criteria to attempt a detailed scientific classification. This task was non-trivial, requiring significant time on the downlink end of operations to complete. It could not be completed automatically, or via a well-defined pre-mission training process i.e. during operations time because of the complexity of analysis. Amongst the Raman instrument team, there was often vigorous debate as to whether the data received contained a viable spectra, indicating the challenge in interpreting results. In the case of Raman data validation, to be competent takes skill, patience and experience. Also with the requirement of analogue missions to maintain fidelity to real planetary surface operations, data can only be reviewed intermittently (during communication cycles), and bandwidth is limited [2,5,42,45]. So, in this case, the "best" option is to send all collected data back for further processing and have a trained technician perform data validation before then continuing on to extract science. This is not optimal, and makes it likely that at least some resources will be wasted during the communication cycle.

Algorithmic Processing for Raman Data
Work in this area revolves around using numerical and computational techniques to treat data, so that all the data returned can be considered viable. Here, a variety of work has been done in the area of florescence suppression [21,22,25,26], spectra identification using standard algorithmic processing [15,24,28,46] and more advanced techniques, such as machine learning and/or CNNs [33,34,[46][47][48][49], to directly identify the molecular signatures.
However, there are limitations to this. Some spectra cannot be improved by these techniques, and there is the technical point that these techniques alter the original data, and we believe there is significant merit in preserving the original form of the data, before it is evaluated by a trained technician.
Taking into account these points, plus the sensitivity of the readings to additional conditions not related to the physical construction of the instrument, an argument can be made that the option that is likely to result in the best data analysis, is to start with the best spectra possible.
Thus, a reliable machine learning model, that can perform initial data validation in the same way as an experienced technician, could prove invaluable in saving resources. By providing in-situ analysis of quality as data is acquired, an automated command cycle can be implemented, with the aim that only valid data is returned for evaluation.

Neural Network Approaches to Raman Analysis
There are many works in the literature focussing on identification of specific spectral signatures, in a variety of different fields [17,19,27,32,46,49]. The focus of this work is not to propose another solution to that problem but instead, to focus on initial quality assurance/validation, based on generalised criterion.
While the criterion is more general, the act of categorisation is still a non-trivial operation, particularly because the data conditions, such as the wide variance of SNR in samples, make it difficult to simply cross-check with a database, for example.
As mentioned previously, this can only be done efficiently once the human technician has gone through the process many times. In order for the technician to become competent, the process must be completed using multiple different instruments, instrument types (field, laboratory, handheld, etc), and different conditions under which readings are taken. All these variables can affect the resulting spectral signature of the data products produced, even if the target is the same [9].
Recognition of all these elements feeds into the expert technicians' decision as to whether a data product can be considered contain data of high quality (Good), unusable data (Bad) or data that can be further treated to produce usable data (Potential).
This kind of multi-variate problem, where a "correct" answer is not easily defined, is generally considered a space to which neural networks are well suited. This work will examine the viability of a simple, deep 2D Convolutional Neural Network in performing this generalised categorisation on Raman Spectra.

Software Pipeline
The GRaVN pipeline reads each comma separated datafile and generates a 300 x 300 pixel spectra for consumption by the GRaVN network. The larger than usual image size is to allow enough base resolution to reliably pick up detail at small and large scales. Once this operation is complete, each spectra is converted to grey scale and matched with the appropriate label before being preprocessed, read to pass into the CNN. More information in the preprocess step is given in section 2.4.1 and information on the neural network itself is given in section 2.2.

Network Architecture and configuration
The Convolutional network consists of a layer configuration of three 2D Convolutional layers 16, 32, 64 neurons deep respectively, each with a window size of 3x3 and ReLu activation. This is then followed by 2D max pooling layer with a pool size of 2x2. The feature maps are flattened and then passed to two fully connected layers. The first consists of 1024 neurons and an l2 regulariser. The activation is a sigmoid function and dropout of 0.5 is applied to the layer. The following layer utilises the same configuration but with 256 neurons. The final layer is a fully connected output layer consisting of a number of neurons equalling the total number of classes, with a standard softmax activation. Sparse Categorical Cross Entropy was used as the model loss function, with the ADAM algorithm used as the optimiser. See Fig. 3

The Dataset
The dataset is manually curated from multiple sources by a trained PhD-level Raman specialist. It consists of a total of 3691 spectra, labelled as either Good, Bad, Potential. The breakdown between the categories is given in Table 1. The dataset consists of approximately 3500 samples taken from the 532 nm catalog in the RRUFF database, with the remainder being manually curated from other sources, such as lab-based and handheld spectrometers such as the DeltaNu RockHound.

Data Preprocessing
As poor quality readings aren't generally considered useful, it is far more common to find examples of good spectra than any other. For the purpose of training a CNN-derived model, designed to sort the spectra into different classifications, this leads to what is referred to as an imbalanced dataset. The problem of imbalanced datasets is a common

Sampling Methods
We chose to employ random sampling to smooth network learning between categories.
In this case, we chose two techniques: Random Undersampling (RU) and Random Oversampling (RO). These strategies are commonly accepted substitutes for not having a large, balanced dataset [50][51][52] but both have trade-offs in implementation: RU limiting variation in training, the larger the class is, and, RO potentially oversampling from smaller classes. Results from both are examined to determine which, if any, work better for our case.

Experiments and Results Validation
We tested the network using a variation on the 10-Fold Cross Validation methodology. In our case, each "fold" is an experimental run consisting of a training and validation cycle, followed by a testing (evaluation) cycle on unlabelled data. Because of the limitations of the size of our dataset, each run uses the entire dataset. The dataset is split 80%/20% train/test, with the same split on the training set for train/validation.
For each RO run the train test split occurs immediately followed by the sampling operation. This ensures that the train set contains no spectra that are in the test set.
For RU, the sampling operation occurs first, followed by the train-test split. This, again, ensures that no evaluation spectra are contained in the training set. Numbers of samples for each category are contained in Table 1.
Experiments consisted of standard evaluation of accuracy, loss and time to completion metrics for training/validation, and, accuracy and loss for evaluation. Training cycles were set for a maximum of 30 epochs, with early stopping activated after no accuracy score improvement for 5 epochs.

Results
For RU, figure 5 shows accuracy at evaluation time consistently remains between 70% and 80% approx. for each run. Loss tends to fluctuate more, between approximately 0.68 and 1.0. Training time varied between approximately 350 and 850 seconds for each run.
Conversely the RO results fluctuate far less. Accuracy remains in a tighter value band between 80% and 90%, while loss remains between approximately 0.65 and 0.8. Time to completion for each run, was much longer, however ranging from well over 9500 to just over 6500 seconds, which of course makes sense, considering there is a larger overall size of the training set.
When comparing Confusion Matrices overall (Table 2), RO and RU have almost the exact same std and variance for Good and Bad classes. Oversampling is better for the Potential group. The average correct classification score for Good is higher for oversampling while for Bad and Potential classes undersampling averages better. Table 3 shows that for accuracy and loss, the RO average score is higher and lower for accuracy and loss, respectively, suggesting better performance than with a RU treated dataset. In this case, the lower std and variance for RU confirm the lower rate of variation for RO shown in Figure 6, as compared to RU in Figure 5.
During evaluation, the GRaVN model classifies with accuracy scores consistently above 70% for both sampling techniques. That is, post training cycle, on unseen data the CNN is guaranteed to be correct 70% of the time. We contend that this demonstrates the robustness of the GRaVN model for this task. RO showed the best results for evaluation accuracy, with scores consistently between 80%-90% as opposed to 70%-80%.

Discussion
In the case of RO, the test set is extracted before the sampling operation, then the all elements of the smaller classes are randomly duplicated until all classes have the same number as the largest class. What these results indicate is that for RO the network learns the Good class with a high level of accuracy, but because of the proportional difference between the class sizes the network, will over train on the smaller classes. This means that when presented with the test/evaluation set, that the network finds it difficult to accurately identify the Bad classes in particular, while showing a high level of accuracy when classifying a Good spectra.
For RU, the result is slightly different. In this case, to preserve the integrity of the test set, the sampling operation is performed first, then the 80%/20% train/test split is done. This means that training occurs on even numbers of each class but those numbers are smaller, resulting in lower accuracy scores overall, but a more even distribution of accuracy between the classes. We consider this level of generalisation challenging for a network to learn, especially with the highly imbalanced nature of the dataset. This assessment also considers the nature of the data being evaluated. A 2D convolutional network examines images pixel by pixel. When examining images of spectra like this, the vast majority of the information is contained in a small area. Within this area is a very large variation because not only can the SNR vary, but so can the geometry and amplitude because we are not limiting training to specific signatures.
These factors create a broad problem space on which to attempt to apply solutions. Therefore, we consider this result positive and good platform to build upon. We intend to continue modifying our neural network layer depth and width, to further to improve our accuracy and loss scores. Although, because of the nature of the problem, we would not necessarily predict that a simple increase in network depth and/or width (and therefore increase in the number of trainable parameters), would guarantee an improvement in network output statistics.
He et al. first recorded that simply stacking increasing numbers of convolutional blocks in network architectures limits optimisation, and leads to diminishing returns in accuracy through overfitting and other artifacts such as the degradation problem [53]. Skip connections, which allow network information to skip layers, rather than passing directly between them, were devised as a method of maintaining the trainability of models, while increasing their depth. This would be an interesting avenue to pursue in increasing network complexity, while continuing to improve accuracy.
Further, in the area of mineral classification, certain network architectures [32] have demonstrated promising results by passing different parts of the input data through the network separately, and concatenating these in into a single dense layer for classification later in the process. A network developed utilising these tools, along with a larger dataset, with higher proportion of Bad spectra, could learn to differentiate far more of the characteristics of a Raman spectra, and thus be more accurate.
Lastly, it would be interesting to compare our network to some well-known convolutional frameworks. The closest problem to our work would be handwriting recognition. Although at the pixel scale, there is more variation in this problem because of the broad range of SNR, a factor that rarely plays a role in handwriting or letter/digit recognition. A comparison between well-known networks like VGG or ResNet and our method, could be informative in a variety of different areas.

Conclusion and Future Work
We have presented a convolutional neural network solution for Raman spectrum processing, focusing on qualitative validation rather than specific classification. The network was able to accurately sort untreated spectra into categories as specified by a trained technical expert in the field.
The results indicate that networks developed to perform the more qualitative evaluative tasks are a viable option and, that further development to improve output scores is a legitimate direction to pursue.
We suggest that an algorithm such as this could be the basis of a sequence control loop. Automated space platforms could use to validate and pre-select data in-situ during planetary missions, or other autonomous robotic activities, where Raman is utilised.
Regarding the sampling techniques for training, if mission parameters would specify a two class spectra evaluation, then the RO technique would be a better choice. In this case, mission operators would only be interested in returning spectra classed as Good. If mission parameters dictated that a more granular validation be preferable, then and RU training scheme could be implemented. Table 4 shows that network process time per sample on the order of tens of milliseconds, regardless of sampling type and indicates the time scale required for a platform to obtain a reading.
Continuing work in this area would focus on addressing the dataset imbalance issue by acquiring more examples of poorer quality spectral reading, increasing the overall size of the dataset and, perhaps even expanding the classification to different Raman sources.
We would welcome input from the Raman community in providing us with [labelled] examples of Bad or Potentially Bad/Good spectra readings, regardless of instrument or conditions. As stated above, it is easy to find examples of good readings, but this algorithm's success depends on knowing what a bad reading looks like. So please, don't discard poor data, send it to us. It would allow us to further increase robustness by including data classified from multiple experts.
Unlike with other techniques, internal instrument components, wavelengths used and sample interval can all vary between different instruments. An interesting study would be to build a network capable of equalising all these factors and provide a tool that could accurately validate regardless of any mitigating construction or environmental factors.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. Code developed in this study is available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.