Ink Classification in Hyperspectral Images

—Hyperspectral imaging provides vital information about the objects and elements present inside the image. That’s why they are very useful in satellite imagery as well as image forensics. Hyperspectral document analysis (HSDI) can be used for document authentication using ink analysis which can provide sufficient information about the composition and type of ink. In this project, we have implemented HSDI based ink classification technique using Principle Component Analysis for dimensionality reduction and K-means clustering for ink classification. This is unsupervised learning approach and it is very simple and efficient in order to classify limited number of bands. We have used this technique to classify 33 different bands of ink.


I. INTRODUCTION
HSI consists of several hundreds of bands that can provide valuable information which can't be extracted from the ordinary camera images as they can only extract information present in visible band spectrum. Due to this property, Hyperspectral images are widely used different fields including Satellite imagery analysis, photogrammetry and mining. Mining companies uses HSI to detect minerals beneath the earth surface. HSI provides broad spectral information for the subjected image/document. Thus HSI can be used by forensic experts and investigators to extract information about features like age of the subject, fingerprints, blood strain etc. Potential applications of HSI in pharmaceutical sector include counterfeit drugs detection via hyperspectral cameras.
Unique spectral response can be observed at different wavelength bands of HSI. Figure 1 shows hyperspectral cube showing spectral signature (response) against each wavelength. Hyperspectral document imagery (HSDI) analysis is one of the applications. HSDI is now being used in image forensics for tasks like to checking the authenticity of documents by detecting ink mismatch or handwriting pattern on the documents. In this paper, a technique is proposed to detect ink bands in HSDI using PCA (Principle Component Analysis) and kmeans clustering. We have implemented it to separate 33 different ink bands from HSI cube. Deep learning can be used for this task, but due to only 33 bands present in the dataset, we have utilized Pattern Recognition tools to classify the ink bands. Although different clustering techniques like fuzzy c means clustering can also be used but we have used k means clustering as it is simpler and it provides the required results for simple tasks II. LITERATURE REVIEW Hyperspectral Imagery is an emerging field and the research work in this field has increased drastically due to advancements in computing power of machines and deep learning. A team of researchers proposed method to discriminate inks that appears similar visually using Hyper Spectral Images. They created a dataset containing handwritten notes written by different types of blue and black inks. Unsupervised learning (clustering) was used to separate different inks. This method performed better on HSI as compare to RGB thus separating inks efficiently [1]. Hyperspectral imaging can also be used to restore /enhance the historical documents in museums and archives as it enables to utilize beyond the visual bands of the images. This technique was used by researchers of university of Singapore to enhance the historical documents using near-infrared bands (NIB) present in HSI to make the documents readable by extracting information for (NIB). First the portions of document that can be made better are extracted, after the processing, the final enhanced document is reconstructed. This approach was evaluated on historical documents in collaboration with National Archives of the Netherlands (NAN) [2].
In [3], the authors have implemented a Convolution neural network (CNN) based deep learning algorithm to detect ink mismatch for document authentication in hyperspectral Document Image (HSDI). This technique utilizes spectral features like spectral correlation and spectral context to do ink classification. Six different CNN architectures were implemented and different training vs test ratios were evaluated. Experiments were performed on different types and different ratios of ink from different manufacturers and the results were evaluated. The performance of this algorithm outperforms the previous methods. The limitation in this method is that it requires prior knowledge of ink present in the document. In [4], local thresholding technique was applied to separate foreground pixels from background pixels and then the authors have utilized fuzzy c-means clustering for classification of different inks present in the document. Experiments were carried out using different combinations of ink and feature selection was used to get optimal results.
In Ink mismatch detection techniques using HSI are being developed but the main issue is that ink mismatch is an unbalanced clustering problem. Using ordinary methods, high performance is not achieved. To solve this problem, a technique was proposed by researchers using Hyperspectral unmixing scheme which identifies the spectral response of ink and their abundance. This scheme performed well on the dataset compared with the other techniques [5].
Vast application of HSI in image forensics also include writer identification based on the handwriting pattern. Previous methods for such application usually gave low accuracy. HSI solves this problems as vast spectral bands are available for feature identification. Deep learning was used by researchers to identify writer using HSI. Spectral responses were extracted and input to the convolutional neural network (CNN).Different test train ratios were evaluated [6].
In [6], PCA is used for dimensionality reduction of HSDI and k-means clustering is applied. These clusters are eventually trained by multi-class SVM. This technique was validated on three benchmark images and this scheme was compared with standard one and it gave better results than the state-of-theart method.

III. METHODOLOGY
The complete methodology of this project is depicted in figure 1. As depicted in the above figure, as the HSI are input to the system, the dimensions of feature space are very large, which results in overfitting [11]. To overcome this problem, PCA is applied on these images which extracts only the useful features. After this step, k means clustering is applied for ink band classification and then each band is depicted in different color [12]. Consider an HSI of (m * n * N) as in figure 4, the pixels (x,y) in an image can be represented by: N represents the no of HSI bands. If m is the number of rows, n is the number of column in image, then M= m x n. then mean can be calculated as: .

Figure 4 Pixel Vector in HSI
The covariance vector is given by: Next step is Eigen decomposition which can be found the following equation.

=
Where D is the diagonal vector consisting of eigenvalues of the covariance matrix. A is orthonormal matrix comprising of eigenvectors. The number of eigenvectors are N.

=
Where Y is the PCA pixel vector and it represents the linear transformation. Yi contains all the PCA bands pixels. If we arrange all the eigenvalues calculated, in descending order like 1, ≥ 2 ≥ ... ≥ .Now the first K rows of the matrix AT form AKT.

=
Here K represents the number of PCA having highest eigenvalue. Every pixel in HSI pixel vector can be mapped using the above transformation. Now we have M data points and each vector is of length K.

B. Dataset:
The dataset used over here consists of 33 grey scale images, each image has a spatial resolution of 627 by 81 pixels. 33 images represent bands of handwritten hyperspectral image captured in spectral range from 400nm to 720nm with step size of 10nm. While each image consists of phrase, "The quick brown fox jumps over the lazy dog". The image is merged with 1:1 ratio form two different inks combinations for experimental analysis. Combinations of two inks are colored differently for the sake of visualization are shown in Figure 8.

IV. RESULTS AND DISCUSSION
Unsupervised classification algorithm i.e. k-means divide image pixels into groups based on spectral similarity of the pixels without using any prior knowledge of the spectral classes. As a result, we get 3 clusters one for background pixels and two other clusters for two different inks as depicted in fig 6. Here we have no prior knowledge in the form of ground truth.
Performing supervised classification requires training a classifier with training data that associates samples with particular training classes. To assign class labels to pixels in an image having M rows and N columns, you must provide an MxN integer-valued ground truth array whose elements are indices for the corresponding training classes.
As we know that many of the bands within hyperspectral images are often strongly correlated. The principal components transformation represents a linear transformation of the original image bands to a set of new, uncorrelated features. A very large percentage of the image variance can be captured in a relatively small number of principal components compared to the original number of bands. Then we retain enough eigenvalues to capture a desired fraction of the total image variance. We then reduce the dimensionality of the image pixels by projecting them onto the remaining eigenvectors. We choose to retain a minimum of 99.9% of the total image variance as depicted by fig 7 [8].  HSI has great potential for determining the legitimacy in forensic document examination. In this paper, a technique is proposed to detect ink bands in HSDI using s PCA (Principle Component Analysis) and k-means clustering. We have implemented it to separate 33 different ink bands from HSI cube. In future Deep learning can be used for this task for better results. For deep learning we would need bigger dataset in order to train a network [9]. Although different clustering techniques like fuzzy c means clustering can also be used but we have used k means clustering as it is simpler and it provides the required results for simple tasks [10].