Unsupervised Metric Learning using low dimensional embedding

Unsupervised metric learning has been generally studied as a byproduct of dimensionality reduction or manifold learning techniques. Manifold learning techniques like Diffusion maps, Laplacian eigenmaps has a special property that embedded space is Euclidean. Although laplacian eigenmaps can provide us with some (dis)similarity information it does not provide with a metric which can further be used on out-of-sample data. On other hand supervised metric learning technique like ITML which can learn a metric needs labeled data for learning. In this work propose methods for incremental unsupervised metric learning. In first approach Laplacian eigenmaps is used along with Information Theoretic Metric Learning(ITML) to form an unsupervised metric learning method. We first project data into a low dimensional manifold using Laplacian eigenmaps, in embedded space we use euclidean distance to get an idea of similarity between points. If euclidean distance between points in embedded space is below a threshold t1 value we consider them as similar points and if it is greater than a certain threshold t2 we consider them as dissimilar points. Using this we collect a batch of similar and dissimilar points which are then used as a constraints for ITML algorithm and learn a metric. To prove this concept we have tested our approach on various UCI machine learning datasets. In second approach we propose Incremental Diffusion Maps by updating SVD in a batch-wise manner.

1 Unsupervised Information Theoretic Metric Learning using Laplacian eigenmaps

Laplacian eigenmaps
Laplacian eigenmaps learns a low dimensional representation of the data such that the local geometry is optimally preserved, this low-dimensional manifold approximate the geometric structure of data.Steps below describes the methods in detail.Consider set of data points X ∈ R N , goal of laplacian eigenmaps is to find an embedding in m dimensional space where m < N preserving the local properties of data.
1. Construct a graph G(V, E) where E is set of edges and V is a set of vertices.Each node in the graph G corresponds to a point in X, we connect any two vertices v i and v j by an edge if they are close, closeness can be defined in 2 ways: (a) ||x i − x j || 2 < , ||.|| is euclidean norm in R N or, (b) x i is in k nearest neighbour of x j here & k are user defined parameters.
2. We construct a weight matrix W (i, j) which assigns weights between each edge in the graph G, weights can be assigned in two ways: (a) Simple minded approach is to assign W (i, j) = 1 if vertices v i and v j are connected otherwise 0.
(b) Heat kernel based, we assign weight W (i, j) such that: Final low dimensional embedding can be computes by solving generalized eigen decomposition Lv = λDv Let 0 = λ 0 ≤ λ 1 ... ≤ λ m be the first smallest m + 1 eigenvalues, choose corresponding eigenvectors v 1 , v 2 ...v m ignoring eigenvector corresponding to λ 0 = 0. Embedding coordinates can be calculates as mapping: y i , i = 1, 2...n is the coordinates in m dimensional embedded space.

Information Theoretic Metric Learning
Information Theoretic Metric Learning(ITML) [3] learns a mahalanobis distance metric that satisfy some given similarity and dissimilarity constraints on input data.Goal of ITML algorithm is to learn a metric of form d A = (x i − x j ) A(x i − x j ) according to which similar data point is close relative to dissimilar points.ITML starts with an initial matrix d A 0 where A 0 can be set to identity matrix(I) or inverse of covariance of the data and eventually learns a metric d A which is close to starting metric d A 0 and satisfies the the defined constraints.To measure distance between metrics it exploits the bijection between Gaussian distribution with fixed mean µ and Mahalanobis distance, Using the above connection, the problem is formulated as: Above formulation can be simplified by utilizing the connection between KL-divergence and LogDet divergence which is given as, Using 2 and 1 problem can be reformulated as: Above formulation can be solved efficiently using bregman projection method as described in [3].

Proposed algorithm
We propose a method which combines Laplacian eigenmaps and ITML to form an unsupervised metric learning method.Laplacian eigenmaps as described in 1.1 can be used to recover underlying low dimensional manifold of data where we can use euclidean distance to get (dis)similarity information using which we can learn a metric using supervised metric learning settings like ITML.In box 1.3 we describe the details of proposed algorithm.

Manifold + supervised metric learning
Input: X ∈ N × k, is input data in k dimensional space t s : threshold for similarity t d : threshold for dissimilarity , m: parameters for laplacian eigenmaps algorithm, m < k Output: Learned metric A k×k Steps: 1. Construct low dimensional embedding: 2. Construct similarity and dissimilarity pairs: for each pair (x i , x j ) ∈ E: 3. Apply ITML 1.2 procedure to learn metric: We reduce the time complexity of above algorithm by limiting number of similar and dissimilar points.
Calculating threshold: Manually setting thresholds for similarity and dissimilarity can be difficult in real case, but we can set these thresholds with a simple procedure of calculating the distance extremes for data in embedded space E. A safe way for setting threshold is to compute histogram of distances and set t s & t d to be the 5 th and 95 th percentiles respectively.

Results
We have evaluated our method on different UCI datasets [4], to best of our knowledge there is no other method that does exactly what we tried we compare our algorithm with euclidean distance, we construct similar and dissimilar pairs using euclidean distance and then learn a metric using ITML.We split each dataset randomly into two parts 80% for training and 20% for testing, to evaluate the learned metric with k-NN classification using learned metric as distance measure.All results presented are the average of 5 runs.

Dataset
Proposed The method we proposed in this section is general and the metric learned can be used for clustering dataset.From results it is clear that performing low dimensional embedding to obtain similar and dissimilar pairs performs better than directly apply a general measure than euclidean distance.It is important to note that the comparison is not been made directly between euclidean distance and proposed method rather in both the cases we learned a metric using ITML .
2 Incremental Diffusion Maps

Diffusion maps
Diffusion maps [2] are non-linear dimensionality reduction technique.It achieves dimensionality reduction by exploiting relation between Markov chains and heat diffusion.Diffusion map embeds data into low dimensional space such that euclidean distance between points in embedded space is approximated as diffusion distance in the original feature space.Consider a graph G = (Ω, W ) where Ω = {x i } N i=1 are data samples and W is a similarity matrix with W (i, j) ∈ [0, 1].W is obtained by applying Gaussian kernel on distances, where d(i,j) is the distance between x i and x j and σ is kernel size.Using W we can obtain a transition matrix by row wise normalizing the similarity matrix: Transition matrix reflects the local geometry of the data, where p(x, y) is the probability of transition from x to y in one step.If we look forward in time than P t gives the probability of transition from x to y in t time steps.Intuitively what that means is running the diffusion in time will reveal the geometric structure of data at different scales.
Diffusion process We define a new kernel L using normalized laplacian: where, D is diagonal matrix suct that D ii = j W i,j and α ∈ R. Apply weighted graph Laplacian normalization to this kernel: where Diffusion distance can be defined in terms of eigenvectors of matrix ψ l is right eigenvector and λ i are eigenvalues of M t and Ψ t (x) = (λ t 1 ψ 1 (x), λ t 2 ψ 2 (x)..., λ t k ψ k (x)).

Updating SVD
Given matrix A m×n and Âm×n = U ΣV where Âm×n is rank-k approximation of A m×n , [5] describes the procedure to get the approximate rank-k approximation of Âm×n , B m×r and Âm×n ; B r×n .Method proposed by [5] is summarized below,

Updating Columns
1. Let the QR decomposition of (I − U U )B be (I − U U )B = QR where R is upper triangular.

Get SVD decomposition of
Let the QR decomposition of (I − V V )B be (I − V V )B = QL where L' is lower triangular.
2. Get SVD decomposition of Σ 0 BV L = Û Σ V 3. Then best rank-k approximation of Âm×n ; B m×r is given as

Proposed algorithm
In box 2.3 we have described the basics steps of proposed incremental diffusion maps algorithm.Calculation can be done in a very efficient way by storing the previous values of row sum at step 2 and step 4 of the algorithm.

Results
To check the performance of proposed method we have used a toy dataset having two nonlinearly separable clusters, total number of points in dataset are 2000, to test the effectiveness of incremental procedure we divide the dataset into 2 parts of 1600 and 400 points.Initial 1600 points were used in batch mode to get initial embedding then we update the embedding using proposed incremental diffusion maps procedure.To simulate real world setting we update the embedding 10 points at a time, calling incremental update 40 time in this case.
Once we get the final embedding after updating all 400 points for both batch and incremental setting we use k-means with k = 2 to visualize the effectiveness of diffusion distance learned.Final results are plotted in figures 1.
All the experiment was done on Sony machine with i3 processor and 4GB RAM.It can be observed that the results are very close with very less run time than batch mode.We have tested this methods on other datasets but due to high approximation error in markov matrix calculation and update svd procedure results are not appreciable.

Conclusion and future work
From our results we conclude that approximation in markov matrix and svd is not very appreciable, however, the current method used a basic svd update procedure and using recent advanced procedures [1] could lead to further improvements.