Hyperspectral Image based Biodiversity of Forest Canopy and Marine Benthic Species

Hyperspectral images are an important tool to assess ecosystem biodiversity both on terrestrial and benthic habitats. To obtain more precise analysis of biodiversity indicators that agree with indicators obtained using field data, analysis of spectral diversity calculated from images have to be validated with field based diversity estimates. The plant species richness is one of the most important indicators of biodiversity. This indicator can be measured in hyperspectral images considering the Spectral Variation Hypothesis (SVH) which states that the spectral heterogeneity is related to spatial heterogeneity and thus to species richness. The goal of this research is to capture spectral heterogeneity from hyperspectral images for a terrestrial neo tropical forest site using Vector Quantization (VQ) method and then use the result for prediction of plant species richness. The results are compared with that of Hierarchical Agglomerative Clustering (HAC). The validation of the process index is done calculating the Pearson correlation coefficient between the Shannon entropy from actual field data and the Shannon entropy computed in the images. Terrestrial dry forest and marine coastal hyperspectral images with different resolutions have been used for spectral diversity feature validation.


Introduction
Hyperspectral images are obtained from Earth's surface by sensors that detect the reflectance of materials at hundreds of narrow wavelengths thus giving a detailed signature of the spectral properties of the material that is imaged.Imaging spectroscopy gives spectral signatures that can be used to distinguish species and communities of species in the ecosystem.Spectral diversity is a measure that is derived from spectral properties at the canopy level of plants which can be used as a similarity measure to group communities.Spectral diversity is used to derive entropies to track species richness.The species or plant richness has emerged as one of the most important indicators of biodiversity.Biodiversity can be defined as the variety of species in an area.There are three levels of biodiversity: alpha, beta and gamma.Alpha refers to diversity of a specific area or ecosystem, beta to the change in species between ecosystems and gamma for the overall diversity [1].Based on the spectral diversity or the Spectral Variation Hypothesis (SVH) the plant richness can serve as an indicator which can be measured from hyperspectral images.In an earlier paper, we have used hierarchical agglomerative clustering of spectrally similarly regions and derived Shannon Entropy based on the similarity of spectral signatures in this region through spectral unmixing [2].This was applied to correlate biodiversity based on the Shannon entropy in the hyperspectral images with those derived from field measurements in the Guanica dry tropical forest in Puerto Rico.It has been shown that canopy level monitoring of plant spectral diversity can be an indicator of chemical diversity of tropical forests [3].
To assess ecosystem biodiversity, hyperspectral images are a valuable tool.In recent years it has become clear that improvements in spectral and spatial sensors resolution are needed [4].A clear advantage is that it helps to plan more efficiently the collection of field data, taking into consideration that the actual process of collecting data is very expensive in time and resources.Also, an extension of the analysis to larger zones using hyperspectral image processing can be achieved by developing more accurate analysis tools.
Our goal is to capture spectral heterogeneity using clustering and unmixing method and then use the result for prediction of plant species richness.The information of heterogeneity can be used to compute entropies like Shannon, Gini-Simpson and Renyi among others, the Shannon entropy was chosen since it is commonly used in biodiversity assessment.Other studies try to capture plant diversity starting from spectral heterogeneity; these studies estimate the spectral variability using the radius of hyperspectral clusters.The spectral variability is related to the alpha diversity, which is a measure of the biological diversity of a particular community, habitat or sampling unit.This paper focusses on capturing spectral heterogeneity from hyperspectral images in the terrestrial tropical forest area of Guanica.The spectral heterogeneity is then used for prediction of plant species richness.Entropy indices are derived from proportional abundances which are more meaningful than species richness alone.The paper is organized as follows.Section 2 presents the theoretical background which includes the algorithms used for hyperspectral image preprocessing and analysis, the methods for spectral variability assessment and derivation of biodiversity measures.Section 3 present the spectral diversity assessment from images over the Guanica dry tropical forest taken at different times and also results with Landsat 8 multispectral images.Correlation with field data is presented for areas for which this data is available.Section 4 presents spectral heterogeneity assessment from hyperspectral images of benthic marine habitats of Enrique reef, La Parguera.Both sites are located in Puerto Rico.Section 5 presents the discussion, and Section 6 presents the conclusions.

Theoretical Background
The terms biodiversity and diversity have been used widely in the field.The two terms differ from each other [5].When diversity is mentioned it refers to the variety between ecosystems (from a pond to a dessert) and biodiversity, as mention before, to the variety of species in an area.Biodiversity involves everything that lives.When calculating biodiversity for a specific area, only species that directly affects the ecosystems are considered.This requires field data and close studies of ecosystems involving a lot time and money [6][7][8].Satellite images are a more viable option for biodiversity assessment as they do not require the extensive effort in field sampling.The image acquisition is autonomous as they are acquired by sensors that are flown above the Earth's atmosphere.These images can be collected irrespective of the situation in the field, during extreme weather and other situations that may impede field data collection.However, field data collection is necessary for validation of the results obtained from remote sensing satellite measurements.Satellite remote sensing is also an excellent way to keep an eye on the heath of habitats.Concerns in biodiversity have been present from a long time, but right now where human impact in ecosystems is increasing [9] it is becoming more invaluable.
In this paper, by diversity we mean spectral diversity as computed from images and biodiversity is the species biodiversity.Two methods are used for computing spectral biodiversity indicators from hyperspectral images.Hierarchical Agglomerative Clustering (HAC) and Vector Quantization (VQ).Previous work with vector quantization has been for compression and classification of hyperspectral images [10].The image preprocessing steps and the procedure for calculating biodiversity is shown in Figure 1.

Hierarchical Clustering
The goal of hierarchical clustering is to build a hierarchy of clusters, a cluster tree.There are two main approaches to build a hierarchy of clusters: Agglomerative and divisive.The former approach is used in this work.Hierarchical Agglomerative Clustering (HAC) is a "bottom up" approach in which, at first, each observation constitutes its own cluster; then, pairs of clusters are merged successively, generating the hierarchy, until there is just one cluster left.HAC needs two concepts to carry out the process: a metric (distance) and a linkage [11].The metric measures the dissimilarity between pairs of observations.The linkage measures the dissimilarity between clusters as a function of the pairwise distances of observations in compared clusters.HAC can work directly with the dissimilarity matrix, a representation just in terms of dissimilarity between pairs of objects.The dissimilarity matrix is an nxn matrix D, where n is the number of observations.An entry dij in D represents the dissimilarity between the ith and jth observations.In subsequent sections the notation used considers x or y as individual observations, more specifically, pixels; and X or Y as groups (clusters) of pixels.

Vector Quantization
Applying an entropy index to a hyperspectral image requires a process for grouping the image pixels.Vector quantization can perform such a task in an efficient way and faster than clustering methods.The VQ method is a classical quantization technique from signal processing which allows the modeling of probability density functions by the distribution of prototype vectors.It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them and each group is represented by its centroid point, as in clustering algorithm.The resulting codebook is suitable for the computation of the entropy index.
Two important values obtained using the VQ method over the hyperspectral images are the Shannon entropy and the Pearson's Correlation Coefficient.The Shannon entropy is a popular biodiversity index that represents the weighted geometric mean of the proportional abundances of the species.Applying Shannon entropy to spectral data requires grouping pixels under some criteria, otherwise it is very likely that every pixel is unique; which would produce constant and more important meaningless maximum entropy.The second value, the Pearson's correlation coefficient between two variables, is defined as the covariance of the two variables divided by the product of their standard deviations.
The Vector Quantization method requires a parameter called maximum distance.This parameter can be estimated in an empirical way by extracting a random sample of pixels from the image and calculating the "elbow curve" of distances between these pixels.A distance close to the elbow part of the curve, where distances change from slow growth to fast growth should be appropriate as a maximum distance.
The vector quantization process aims to build a codebook for the image.Every element of this codebook is called a codeword and it is a generated representative pixel for the group.The process is iterative and explores all the pixels.A pixel is compared with every codeword, the closest codeword with distance less or equal than maximum distance is selected and recomputed as a centroid considering the new pixel in the group.If no codeword can be associated, then the pixel is added as a new codeword.Once the whole image is processed, it can be encoded completely in terms of the codebook.The entropy index is calculated grouping the codewords that replaced the original pixels.For a particular pixel, a neighborhood is selected in order to compute the probabilities; the entropy obtained for the neighborhood is assigned to this pixel.The neighborhood is selected using a scheme of a squared sliding window centered on a specific pixel.The reconstructed image, using the entropy index, is supposed to reflect degree of diversity of spectral signatures.The higher the entropy value the more diverse the neighborhood associated with the pixel.

Unmixing
Statistics have two different points of view when analyzing data, the frequentists and the Bayesian approach.For unsupervised unmixing we use Bayesian method because it is the only view that takes into consideration unknown parameter [12].Unmixing is a method which takes an image and decomposes each of the pixels into pure endmembers spectra [13].Each pure endmember is represented in the new image as a new band within the unmixed image.This is what we call unsupervised unmixing.On the other hand we have the supervised unmixing in which pure enmembers spectra is well known.

Principal Component Analysis (PCA)
Principal component analysis (PCA) is a very common technique used to reduce noise by reducing large set of data and creating a smaller one with new variables, thus preserving most important information [14].These new variables are created by mathematical algorithms that identifies the largest variations between data and creates new representative values or variables, known as Principal Components (PCs).This analysis uses Eigenvectors which are linear matrix that in this case collects the data's variance.The new values are ordered from high variance to low variance and for analysis purposes the low variance variables are usually discarded analyzing just high variance values.For a multispectral or hyperspectral image for example, a new PCA image is created and low bands that represents the low variance are usually eliminated for analysis purposes [15].The main purpose of PCA is to reduce/compress data to facilitate analyses and be able to visualize the information better.

Maximum Noise Fraction (MNF)
While data are linearly transformed for easier manipulation and better analysis, it results in some loss of information.In contrast to PCA the MNF, Maximum Noise Fraction, takes into consideration the noise part of the data too.It can be simply defined as a linear transformation of data using noise fraction [16].This can be computed as follows: where ∑  is the covariance matrix of signal, ∑  is the covariance of noise,  is an orthogonal matrix containing the eigenvectors of ∑  ∑  −1 , and Λ is a diagonal matrix containing eigenvalues that correspond to  [17].

Shannon Entropy
The Shannon entropy is a popular biodiversity index that represents the weighted geometric mean of the proportional abundances of the species [5].It is calculated as follows: where k is the number of species and   is the proportion of abundance of the ith species.Applying Shannon entropy to spectral data requires grouping pixels under some criteria.Otherwise, it is very likely that every pixel is unique, rendering every   to 1/k; which would produce a constant and also meaningless maximum entropy h = ln(k).

Pearson Correlation Coefficient
Pearson's correlation coefficient between two variables is define as the covariance of the two variables divided by the product of their standard deviations [18]:

Spectral Diversity of Canopy Species
This Section presents the experimental results for Guanica forest hyperspectral AISA images acquired during 2007 and 2013 (Figure 2).It has 128 bands.The 2007 image has one meter resolution and 2013 image has 2 meter resolution.There are 30 field plots each spanning an area of 20 meters x 20 meters.Therefore, the image of 2007 was subsetted to have circular plots of diameter 20 pixels and 2013 image has circular plots of diameter 10 pixels.Some of the plots are shown in Figures 3 and 4. The plots are grouped in to five groups.Group 1 has 6 plots, group 2 has 6 plots, group 3 has 8 plots, group 4 and 5 each have 5 plots.As shown in Figure 3, 2007 image plot sizes are 20 x 20 pixels circular plots.Figure 4 shows the plots of the 2013 images with 2m resolution, hence the plot sizes are 10 x 10 pixels circular plots.Table 1 gives the correlation results for 2007 plots using HAC and vector quantization.Each of these methods of initial pixel grouping is followed by spectral unmixing.Table 2 gives the correlation results for 2013 image with the same methods.It can be seen that vector quantization produces better results in both images.Because of lower resolution HAC does not perform very well with the 2013 image.An interesting point is that Table 3 shows very good correlations for alpha diversity using vector quantization method.

Result with Dimensionality Reduction Methods
Further experiments were done to determine if dimensionality reduction produces better results in diversity calculation.PCA and MNF were applied to the original 128 bands of the images, 6 principal component bands were selected for both the methods and then the correlations were computed using the HAC algorithm and VQ method.Tables 4 and 5   The dimensionality reduction methods in general do not give good correlation with field data.Also, the lower resolution in 2013 gives poorer correlation.Both HAC and VQ methods require the entire spectral signature to compute accurate spectral biodiversity values which correlates well with field data.

Spectral Diversity of Marine Benthic Habitats
In the case of marine biodiversity, studies have been limited due to the presence of the water column, bathymetry and attenuation, diffraction and dispersion of optical radiation at different depths in the water column.The term benthic refers to anything associated with or occurring on the bottom of a body of water.The animals and plants that live on or in the bottom are known as the benthos.Benthic habitats can best be defined as bottom environments with distinct physical, geochemical, and biological characteristics.Benthic habitats vary widely depending on their location and depth, and they are often characterized by dominant structural features and biological communities.The Benthic marine habitats include all biological communities associated with the sea floor, from the top of the intertidal zone and inner reaches of estuaries down to the deep sea.Water column makes it difficult to differentiate between species [19] and that is why all the study areas are shallow waters of approximately 2m depth.Another important detail when working with water images is the sunglint, it affects the pixel values drastically damaging the results.The airborne AISA hyperspectral image used here has minimum sunglint.Results of spectral heterogeneity from      Figure 9 above has two three-dimensional cubes showing all 128 bands at the bottom and a spectral subset of 67 bands in the top cube.128 band wavelengths runs from 395.71nm -996.40nm.Our data was analyzed using only 67 bands subset which represents the visible light from 395.71nm to 701.36nm.This is because visible light, on the electromagnetic spectrum, are the wavelengths that penetrates the water and thus holds the most information [20].In the 128 band cube, the region after the 67 band looks dark as is the red and infrared light, which does not have enough information for biodiversity assessment; which is why those bands are removed.The Tables 6, 7, and 8 show the Shannon entropy of each subset and the ground truth data for each transect and area and its corresponding endmember.Transect 1, 2 and 3 correspond to the blue, red and green from Fig. 8 and Transect 4 the subset of the three transects together.The ground truth data of the 4 transects is an average of transects 1, 2 and 3. Figure 10 shows the maximum, mean and minimum data value of one of the subsets in one of the locations where data was collected.

Discussion
For forest canopy species, 2 meter spatial resolution seems a good choice for diversity estimation.Clustering methods gave promising results for biodiversity assessment in shallow coastal waters.The results illustrate that even in shallow waters of approximately 2m depth, it is hard to spectrally differentiate between species with the water column.From this research, several directions for future missions of hyperspectral imaging can be obtained.Space-borne satellite images will continue to play a key role in hyperspectral data collection which can be used for biodiversity assessment.New sensors such as Sentinel, Landsat 8 and ECOSTRESS multispectral and hyperspectral images have to be evaluated for biodiversity assessment.Some of these images have poor spatial resolution, in which case subpixel resolution enhancement methods can be used.Field sampling plays a critical role in ground truth data collection for validation of satellite derived biodiversity estimates.Due to the complexity of field sampling, recent advances in the use of Unmanned Aerial Vehicles (UAVs) and Underwater Unmanned Vehicles (UUVs) can be availed for field data collection.With new trends in hyperspectral remote sensing, the biodiversity of the Earth's terrestrial and coastal ecosystem can be mapped in the near future.With global climate change, the new hyperspectral sensing methods have to be put to use for accurate assessments.More research and endeavor has to be put in by the research community to assess changes in biodiversity over different periods of time, such as over intervals of 2 to 5 years, so that the impact of climate change can be derived and presented in a fast manner.Rapid retrieval of measurements and fast algorithm assessment are required in this fast changing world, so that information is provided to the society in a timely manner.

Conclusions
A method based on HAC and VQ for calculating spectral variability of vegetation and benthich habitat using hyperspectral images was presented; the captured information was used to assess plant species richness and entropy of underwater marine species.The reconstructed image using the entropy index reflects degree of diversity of spectral signatures.A higher entropy value means a more diverse neighborhood associated to the pixel.Vector quantization followed by spectral unmixing outperforms hierarchical agglomerative clustering method in giving better correlation of Entropy values compared to field entropies.Since forest canopy species do not change much, 2007 field entropies were used for correlating with 2013 image as well.Biodiversity assessment plays a key role in understanding global climate change and the impact of human activity on the environment.The tools presented in this paper are invaluable for biodiversity assessment in forest canopy and underwater benthic habitats.It is hoped that with the availability of new hyperspectral sensing methods and faster algorithms, useful information on biodiversity of forest and benthic reef habitats can be presented to the community.

Figure 1 .
Figure 1.Steps for spectral diversity feature extraction from hyperspectral images.

Figure 4 .
Figure 4. RGB images of plots from AISA 2013 images at 2m resolution.

Figure 5
Figure 5 shows entropy values computed at each pixel, higher values indicate a more spectrally diverse neighborhood, while lower values indicate lesser spectral diversity.

Preprints
(www.preprints.org)| NOT PEER-REVIEWED | Posted: 25 October 2018 doi:10.20944/preprints201810.0615.v1hyperspectral images in benthic marine habitats of Enrique reef, La Parguera in Puerto Rico from AISA 2013 hyperspectral image of 128 bands with 2x2m resolution are presented below.All data was compared for validation with 2014 ground truth data.The region of interest is inside the red square in Fig. 6.The three analyzed areas inside the red square are shown in Fig. 7.

Figure 6 .
Figure 6.Locations of three analyzed areas at approximately 2m depth.

Figure 8 .
Figure 8. Transect of ground data for the three years.

Figure 8
Figure8shows how the ground data was collected.It was collected in a straight line transect of 10 meters long represented with the color blue and two secondary perpendicular ones, red and green, to cover more area.Each square in transect represents 1x1 meter resolution, each 2 squares represents a pixel in the image.Subsets were extracted around the transect lines at two pixels around each transect and one with the three transects together.

Figure 9 .
Figure 9.The view of one subset of area 3.

Table 1 .
Entropy correlation between field data and 2007 Guanica forest Hyperspectral AISA image data.

Table 3 .
Correlation of field Alpha diversity with spectral diversity using vector quantization.

Table 4 .
present the results for HAC and VQ methods, respectively.Entropy correlation between field data and Entropies calculated from dimensionality reduction methods for 2007 Guanica forest Hyperspectral AISA image using HAC method.

Table 5 .
Entropy correlation between field data and Entropies calculated from dimensionality reduction methods for 2007 Guanica forest Hyperspectral AISA image using VQ method.

Table 6 .
Shannon entropy for VQ and HAC with its corresponding ground data and endmembers.

Table 7 .
Shannon entropy for VQ and HAC with its corresponding ground data and endmembers.

Table 8 .
Shannon entropy for VQ and HAC with its corresponding ground data and endmembers.