Mushroom’s Evaluation Based on FT-IR Fingerprint and Chemometrics

Edible mushrooms have been recognized as highly nutritional food for a long time, due to their specific flavor, texture and also for therapeutic effects. This study proposes a new simple approach, based on FT-IR analysis, followed by statistical methods, in order to differentiate three wild mushrooms species from Romanian spontaneous flora, namely Armillaria mellea, Boletus edulis and Cantharellus cibarius. The preliminary data treatment consisted of data set reduction with principal component analysis (PCA), which provided scores for the next methods. Linear discriminant analysis (LDA) manage to 100% classify the three species and the cross validation step of the method returned 97.4% of correctly classified samples. Only one A. mellea sample overlapped on B. edulis group. When kNN was used in the same manner as LDA, the overall percent of correctly classified samples from the training step was 86.21%, while for holdout set the percent raised at 94.74%. The lowered values obtained for the training set was due to one C. cibarius sample, two B. edulis and five A. mellea, which were placed to other species. Anyway, for holdout sample set, only one sample from B. edulis was misclassified. The fuzzy c-means clustering (FCM) analysis successfully classified investigated mushroom samples according to their species, meaning that in every partition the predominant specie had the biggest DOMs, while samples belonging to other specie had lower DOMs.


Introduction
Edible mushrooms have been recognized as highly nutritional food for a long time due to their specific flavor, texture and also for therapeutic effects. From the nutritional point of view, mushrooms represent an important source of proteins, fibers, minerals and also polyunsaturated fatty acids, their proportion having a large variation among different species. Regarding the vitamins content, its represent the only vegetarian source of vitamins D [1] and an important source of B group vitamin [2]. Moreover, mushrooms serve also as vegetarian source of protein [3]. On the other hand, wild mushrooms are thought to be richer in flavor, taste, texture, nutrition and medical effects [4].
Due to their beneficial effects upon human health, their demand is continuously growing and is expected to grow even more in the future. It is well known that, due to their soft texture, mushrooms have a short lifetime, around five days, and different types of post harvest procedures are usually applied in order to preserve as long as possible their availability [5].
There are three main classes of preservation procedures: thermal (drying/freezing), chemical (edible coatings, film, washing solutions) and physical (packing, irradiation, pulse electric field, ultrasound) [6]. Another reason for applying conservation steps in mushrooms preparation is the seasonal variability of some wild species. All these conservation methods also contribute to the preservation of their nutritional and nutraceutical value. Every procedure has advantages and drawbacks, for example, drying process, which is the first methods of choice [7], offers a more flavor taste of dried mushrooms, comparing to the fresh ones, but modify the content of bioactive compounds and nutrients [8].
In Romania, for the fifth last consecutive year, the market recorded an increase of the exported wild mushrooms quantities. The main destinations countries were Italy, followed by Hungary and Spain. China is the main producer of cultivated, edible mushrooms. Collecting wild edible mushrooms for consumption is widely practiced in many countries, including Romania [9,10]. The consumption of mushrooms is expected to increase, since consumers are becoming aware of the helpful benefits brought in the diet [11]. Among all available analytical techniques able to evaluate different types of compounds in food matrices, such nuclear magnetic resonance (NMR), high performance liquid chromatography (HPLC) [12] Fourier-transform infrared spectroscopy (FT-IR) is one of the most widely used method to identify chemical compounds and elucidate chemical structure, having as main advantages the rapid, reagent less and high-throughput operation, within a wide range of matrices [13]. It allows rapid and simultaneous characterization of different functional groups, such as lipids, proteins, and polysaccharides [14]. For food quality and control field, FT-IR spectroscopy is an important tool, due to low operating costs and good performance [15]. The record FT-IR spectra represents a global assessment of a specific matrix, more precisely a molecular fingerprint, which are very suitable for characterization, differentiation or identification of different matrices, including mushrooms [16].
The aim of the present study was the differentiation of three investigated mushrooms species (Armillaria mellea, Boletus edulis and Cantharellus cibarius) through the development of a differentiation tool, made up of a fast and efficient analytical technique coupled with different chemometric methods.

Sample collection
For fulfilling the aim of this study, a number of 77 wild-grown mushrooms samples, belonging to three different species, namely Armillaria mellea, Boletus edulis and Cantharellus cibarius, were collected and analyzed. The samples were collected during summer, in 2019, from different geographical area located mainly near Cluj County, Romania. The distribution of samples according to their species was as follows: Armillaria mellea, 12 samples, Boletus edulis 31 samples and Cantharellus cibarius, 34 samples.

Sample preparation and analysis
In the laboratory, the samples were dried in an oven at 60°C, until constant weight.
Subsequently, the dried samples were grounded into a fine powder and store at 4°C for further analysis. The powder of each sample was mixed uniformly with KBr ad then pressed into a tablet using a tablet press. The FT-IR spectrometer (PerkinElmer, USA) used to perform the mushrooms analysis was equipped with a thermal deuterated triglycine sulfate (DTGS) detector. The spectral range was 4000-400 cm −1 with resolution of 4 cm −1 . For each sample the spectrum consisted of 64 scans, which were performed intriplicate and averaged. After recording the spectra, and prior other chemometric prelucration, all spectra were smoothed by Savitzky-Golay algorithms and linear baseline corrected. The spectra were further imported in OriginPro 2017 (OriginLab, Northampton, USA) and subjected to [0,1] normalization.

Chemometrics methods
All chemometric methods were made using SPSS Statistics version 24 (IBM, USA) software. The first method which was applied to normalized spectra was principal component analysis (PCA). This method is one of the most used unsupervised pattern techniques, which is able to provide the reduction of large data set to smaller components called principal components (PC) or factors, with a minimum loss of original information. The obtained PCs are uncorrelated and appear in decreasing order of importance, an important aspect being the eigenvalues, which are a measure of components significance to the data set variance. Usually, the first two or three components retain a high percent of data variance.
A widely employed supervised chemometric methods used for classification purposes is linear discriminant analysis (LDA). Being a supervised method, a new variable must be created and every sample receives a code, corresponding to different discrimination criterion. LDA will find linear combinations of variables, called discriminant functions (DFs), creating a predictive model. While constructing the model, the method tries to maximize the distance among classes and to minimize the distance within the same class, thus providing a robust classification model, which consists only of representative features. A validation step is also made, using "leave-oneout cross validation", which implies the testing of each sample as a new one, using a model obtained without that sample [17]. The model performances are evaluated through the percent of correctly classified samples, a higher percent suggesting a stronger model. In this specific case, the LDA was applied for discovering the specific FT-IR bands, which can discriminate the three investigated mushrooms species.
Apart from LDA, another widely used classification method is represented by k nearest neighbor (kNN), which is one of the simplest machine learning algorithms. This method is based on similarities between new samples and available data and puts the new sample within category that is most similar. An important aspect of this algorithm is that it does not need training, finds the neighbors nearest to the sample and divides them into categories. Thus, kNN is suitable for multivariate classification and has high classification accuracy when the category boundary is obvious [18].
Clustering is an unsupervised machine learning technique that implies the grouping of samples into different clusters and sample from the same cluster has high degree of similarity, while samples from different clusters have low degree of similarity. In fuzzy clustering, each point (sample) has a probability of belonging to each cluster, rather than completely belonging to just one cluster, as it is the case in the traditional k-means. Clustering and classification methods are useful for big data visualization, due to the fact that allow meaningful generalizations to be made by recognizing general patterns among them [19,20]. In Fuzzy-C Means clustering, each point has a weighting associated with a particular cluster, so a point doesn't lie "in a cluster" as long as the association to the cluster is weak. The fuzzy C-means (FCM) algorithm, a method of fuzzy clustering, is an efficient algorithm for extracting rules and mining data from a dataset in which the fuzzy properties are highly common [21,22].

3.
Results and discussion

FT-IR initial spectra of mushroom samples
As it was previously mentioned, 77 wild-grown mushrooms samples, belonging to three different species, namely Armillaria mellea, Boletus edulis and Cantharellus cibarius were analyzed. The experimental spectra are presented in Figure 1. The impossibility to identify all spectral differences among the analyzed species is not surprising for spectroscopic analysis of complex matrices, therefor different chemometric methods are required in order to give a better and more comprehensive characterization of matrices.

Chemometric processing
For chemometric dara processing only the fingerprint region 1800-400 cm -1 was took into account. Even so, due to the large dimension of obtained FT-IR matrix, which is very difficult to be further chemometrically processed, first a factorial analysis for dimensions reduction was applied, namely PCA. In this case, the PCA analysis was run using the following key parameters: extraction method, principal components, rotation methods, Varimax with Kaiser Normalization.
An impressive number of PCs was obtained, but only PC with eigenvalue higher that one was retained fur further analysis. Usually the first PCs obtained explain the largest percent of data variation. In this case the first fourteen PCs have eigenvalues higher than one and explained a cumulative variance of 99.53%, being representative for next chemometric treatment.
For discrimination of the three investigated mushrooms species a new variable was created, and each sample received a code corresponding to their species, as follows: code 1 for The graphical representation is presented in Figure 2. particular species cannot be identified through spectroscopic techniques [28,29]. The next PC grouped another two spectral regions, 1484-1559 cm -1 and 1598-1695 cm -1 . The next two PC grouped another significant region of FT-IR spectra, namely: 1715-1800 cm -1 and 1548-1561 cm -1 .
Taken into consideration that some of the last obtained PCs did not contain any specific spectra regions and also that no specific points were identified, a more powerful classification method was employed, this time having as variables the entire FT-IR spectra.
Among the machine learning algorithms, k nearest neighbor is the most simple and accessible one. In this case, kNN was applied for highlighting the features used to predict a certain mushrooms species. As target variable for the model the species variable was set, having specific code for each sample. The specific number of neighbors was set at five, while the distance among identified neighbors was measured through Euclidian distance. Also a features selection was adopted and a weight by importance of each point was selected [30,31]. The partition of sample between training and holdout sets was randomly assigned, having a proportion of 70% and 30%, respectively ( Table 1). The classification table, obtained after running kNN, is presented below. Regarding the features selection, only three points were selected: 1746 cm -1 , 1510 cm -1 and 1388 cm -1 . The samples distribution between the two set, according to selected features is presented in Figure 3, below:  Table 2 and    The fuzzy c-means clustering analysis successfully classified investigated samples according to their species, meaning that in every partition the predominant specie had the biggest DOMs, while samples belonging to other specie had lower DOMs.

Conclusions
The present study demonstrated the great potential of combined analytical techniques, such FT-IR and different chemometric processing methods. By applying LDA on scores obtained after PCA, the obtained percent for initial classification was 100%, and the cross validation step of the method returned 97.4% of correctly classified samples. Only one A. mellea sample overlapped on B. edulis group. When kNN was used in the same manner as LDA, the overall percent of correctly classified samples from the training step was 86.21%, while for holdout set the percent raised at 94.74%. The lowered values obtained for the training set was due to one C. cibarius sample, two B. edulis and five A. mellea, which were placed to other