A Robust Symmetric Nonnegative Matrix Factorization Framework for Clustering Multiple Heterogeneous Microbiome Data

Integration of multi-view datasets which are comprised of heterogeneous sources or different representations is challenging to understand the subtle and complex relationship in data. Such data integration methods attempt to combine efficiently the complementary information of multiple data types to construct a comprehensive view of underlying data. Nonnegative matrix factorization (NMF), an approach that can be used for signal compression and noise reduction, has aroused widespread attention in the last two decades. The Kullback–Leibler divergence (or relative entropy) information distance can be used to measure the loss function of NMF. In this article, we propose a fast and robust framework (RSNMF) based on symmetric nonnegative matrix factorization (SNMF) and similarity network fusion (SNF) for clustering human microbiome data including functional, metabolic and phylogenetic profiles. Many existing methods typically utilize all the information provided by each view to create a consensus representation, which often suffers a lot from noise in data and cannot provide a precise representation of the latent data structures. In contrast, RSNMF combines the strength of SNMF and the advantage of SNF to form a robust clustering indicator matrix thus can reduce the noise influence. We conduct experiments on one synthetic and two real dataset (microbiome data, text data) and the results show that the proposed RSNMF has better performance over the baseline and the state-of-art methods, which demonstrates the potential application of RSNMF for microbiome data analysis.


Introduction
In recent years, advances in high-throughput sequencing technologies have made the collection of the multiple heterogeneous microbiome data on an unprecedented scale become available.Microbiome studies investigate the relationship among microorganisms and the interaction between microbiota and the host environments, which have shown increasingly important values in understanding the principles and mechanisms of microbiota-associated health and disease and other biological questions including environmental science, bio-energy production and so on [1,2].Many microbiome projects including Human Microbiome Project (HMP) [3], Metagenomics of the Human Intestinal Tract (MetaHIT) [4], Tara Oceans Project [5], have generated large amount of data which can be processed to represent the composition profiles of microbiome community, for example phylogenetic profile and abundance profile.However, these composition profiles are often analyzed individually in order to investigate the variety among samples.
Using metagenomics data from MetaHIT, Arumugam et al. classified microbiome into enterotypes and highlighted the significance of a functional analysis to understand the interactions among microbes [6,7].Koren et al. utilized the clustering methods and distance metrics to demonstrate that how different factors exerted influence on the detection of enterotypes, and pointed not all clustering methods accurately identified such enterotypes or clusters [8].Ma et al. conducted research on the microbiome function profile data from HMP so as to find the difference among samples [9].Nevertheless, many other studies suggested that either additional dataset should be considered or more rigorous approaches should be provided to draw an unbiased conclusion from microbiome data.These researches are all based on clustering methods.So far, there are rare methods to integrate multiple measurements together to study the microbial community.Adopting different sequencing technology, for example, 16S rRNA sequencing or shotgun metagenomics sequencing, a microbial sample or the DNA sequences of a sample can be probed to measure its phylogenetic profile, functional profile, metabolic pathway, protein families and so on.However, due to the noise existed in the raw data, heterogeneous and dynamic properties of microbiome data [10], it is difficult to combine different views of data to provide an overall understanding of microbiome samples.Novel data integrating methods are eagerly needed to disentangle complex microbial community.
In the field of text mining and image processing, there emerging many novel methods to integrate multi-view dataset [11][12][13][14][15]. Co-training spectral clustering algorithm [16] attempts to find compatible clustering solution by updating iteratively discriminative eigenvectors of each view.Similarity Network Fusion (SNF) [17] efficiently constructs a fused network that represents the underlying data structure across different views.SNF is an effective approach and has applied to many fields, such as image retrieval [18], cancer subtypes and survival prediction [17].However, at the stage of building similarity networks for each view's data type, SNF uses the Gaussian kernel and K-Nearest Neighborhoods (KNN) approaches, which is not all appropriate and feasibility for all types of data, for instance, text data.
Recently, information theory in machine learning and data science has already attracted extensive attention.For example, Nonnegative Matrix Factorization (NMF) adopts relative entropy to measure the signal compression ratio and obtain good performance.NMF can be considered as a signal compression and noise reduction method which can approximate original data signal by its low-rank factorization.The relative entropy or the Kullback-Leibler divergence between the original data matrix and the low-rank approximation reflects the maximum compression possible [19].Small relative entropy represents that the original matrix (information source) has high redundancy.Multiview Nonnegative Matrix Factorization (multi-NMF) [12] and Joint Nonnegative Matrix Factorization (JNMF) [20] are the variants of NMF and they are efficient clustering algorithms by searching a consensus and meaningful solution shared by all views.
Another efficiently clustering method is Symmetric Nonnegative Matrix Factorization (SNMF).Kuang et al. indicated that SNMF could be used for graph clustering and often performed better than spectral methods and NMF [21].SNMF takes a nonnegative similarity matrix as input, and outputs two low-rank matrix (H, H T ).The performance of SNMF relies on the affinity matrix which is measured by various distance metrics.Zhu et al. established robust affinity graph based on clustering random forest (ClustRF).Although ClustRF is capable of capturing subtle and weak relevant information distributed among the discriminative features [10,11], it's an extraordinarily timeconsuming process.The complexity of ClustRF algorithms increases exponentially with the size of samples and the number of features and training trees.
To address the challenges above, we propose a fast and robust multi-view clustering framework (RSNMF) which combines the strength of SNMF and the advantage of SNF.RSNMF utilizes the final fused matrix formed by SNF to enhance the local neighborhood and adopts a robust symmetric nonnegative matrix factorization algorithm without additional parameters to assign the clustering indicator label to each sample.In our study, we use different kernel functions to construct affinity matrices according to various data types (see the Materials and Methods).By conducting extensive experiments on several realistic datasets including HMP data, the clustering performance of RSNMF significantly outperforms other methods, which suggests that the latent application of RSNMF in microbiome data analysis.Fig. 1 demonstrates the framework of RSNMF algorithm.The contributions of this study rest with: an efficient clustering approach to integrate multiple heterogeneous microbiome data has been proposed.The proposed method can be easily extended to other fields if the samples have multiple measurements.The rest of the paper is organized as below: in next section, we describe what dataset we used in this study, and a brief overview of NMF and SNF is provided, then followed by the RSNMF algorithm.At last, experimental results and the conclusion are described.

Experimental Data
Three datasets are used in our experiment.The first is a synthetic dataset, the second is the threesource text story dataset and the last is the human microbiome data.Table 1 is the important statistics summary of these three datasets.Synthetic dataset: This artificial data includes three views generated from a two-component Gaussian mixture model.Each view of data has two clusters.Detailed information and parameters was given in [16].
Three-source text story dataset: we use this public dataset derived from three online new source, BBC, Reuters and the Guardian.In total, 948 new articles covering 416 different stories were collected, and 169 stories of them were reported in all three sources [22].Each of story was manually classified into one of the six topical labels: entertainment, politics, health, technology, business and sport.More details is described in Table 2. Human microbiome dataset: This public dataset includes three compositional profiles: the phylogenetic, transporter and metabolic profiles from HMP (http://hmpdacc.org/).It contains 637 samples drawn from seven body sites (see Table 3 for the detailed information).The phylogenetic profile is comprised of the microbe relative abundances which was estimated by MetaPh1An at species level.The transporter and metabolic profiles data are also downloaded from HMP site, please refer to http://hmpdacc.org/ to get more information [23].

Nonnegative Matrix Factorization (NMF)
In NMF, given an original matrix V(n×m), we seek to find two low-rank matrices W(n×k) and H(k×m) to approximate V by minimizing an objective function(Eq.1) where k is the underlying clusters number or the degree of factorization, W and H is nonnegative.The objective function of NMF is the least squares: Where, ∈ × , ∈ × , ∈ × , and ‖. ‖ denotes Frobenius norm.
NMF can not only be used for data representation, but also for clustering.In the first case, the linear additional combination of basis vectors ( , , ⋯ , ) can be used to represent each row of V, where denotes the i-th row of H.At the same time, each column of V can also be represented by the linear additional combination of ( , , ⋯ , ) where denotes the j-th column of W. In the second case, by conducting the hard clustering approaches such as Kmeans on the basis matrix H, all samples will be assigned to corresponding clustering membership labels.Xu el al. used NMF to cluster document and obtained better performance [24], which further indicated the latent application of NMF in data clustering.

Symmetric Nonnegative Matrix Factorization (SNMF)
Although NMF has performed better than other methods in many fields, such as text clustering, image processing and so on, it is not suitable to any circumstance.One of the important reasons is that NMF approximates original data by a linear combination of basis vectors [25].When the data has nonlinear structure or lies on a complicated manifold (for example ring structure), the performance of NMF will not be satisfactory consistently.SNMF is an effective approach to cluster data with nonlinear structure.It only concerns with the symmetric matrix which can be constructed by various similarity metrics and factorizes the matrix into two low-rank matrices (H, H T ).SNMF is also an approach based on graph clustering.The objective function of SNMF is defined as below: Where, ∈ × is the similarity matrix measured by certain distance metric, and ∈ × , k is the degree of factorization or the default number of clusters., the ij-th element of A, denotes the similarity between i-th and j-th data points.
The similarity between each pair of nodes can be measured by many approaches, for example, heat kernel function, inner product kernel function, correlation coefficient methods and so on.In this study, we experimentally adopt inner-product kernel for normalized text data and Gaussian kernel for other data type.

Constructing Affinity Graphs based on Different Kernels
Given a matrix ∈ × , denotes i-th sample point.The sample-by-sample similarity matrix or the affinity graph can be established by two ways: sparse graph and full graph.Sparse graph adopts KNN approach to preserve the locality among one specific sample and its k-th nearest neighborhoods [17].For microbiome data, we use the Gauss kernel function and sparse manner to construct the similarity matrix.For text data, we utilize the Cosine distance and full graph to build the weight matrix.The Cosine similarity between two samples can be defined as below: Where, denotes the similarity between two nodes, ‖ ‖ denotes the norm of i-th text vector.We can also transform the Eq.3 into inner-product form by normalizing each sample vector making ‖ ‖ = 1 for ∀ .
For HMP data, the similarity is defined as: Where, µ is a parameter that can be empirically set and is used to eliminate the scaling problem [17].can be defined as below: Where, denotes the distance between i-th and j-th data points, ( , ) is the average value of distance among i-th node and its neighbors.Furthermore, the obtained weight matrix can be transformed into a normalized one: = / / (6) Where, D is a degree matrix whose diagonal element is = ∑ and other elements equals to zeros.

The Weighted Symmetric Nonnegative Matrix Factorization
SNMF has been described in the section above.For Eq.2 the updating rule of SNMF is [26]: Where, 0 < ≤ 1.In this study, the values of are set in the range between 0 and 1.In general, SNMF requires the similarity matrix to be semi-positive definite (s.p.d).However, not all of the nonnegative matrices are s.p.d.So, a more generalized algorithm was proposed in [27]: Where, S is a symmetric matrix which takes care of the negative eigenvalues.When A is indefinite, the indefiniteness can pass onto S. So, Eq.8 has better approximation than Eq.2, which is well-known in Cholesky factorization.Eq.8 is also called weighted SNMF.Its advantage lies in: H is closer to the form of clustering indicator and S provides a good representation for the clustering quality.If the clusters are separated well, respectively the diagonal elements of S will be much larger than the off-diagonal elements [26].
The updating rules of the weighted SNMF are: A fast algorithm, called NNDSVD, can be used to enhance the initiation stage of SNMF.NNDSVD can readily to be combined with SNMF and leads to rapid reduction of the approximation error of many NMF algorithms [28].

Similarity Network Fusion (SNF)
SNF is an effective method to integrate multiple heterogeneous data types [17].It utilizes heat kernel function to construct similarity matrices and iteratively updates each affinity matrix with the information from other matrices to form a final matrix.Then, spectral clustering is used to obtain network clusters.
After the weight matrix W is established (Eq.4), a well-normalized affinity matrix P can be derived from: To certain degree, this normalization can avoid suffering from numerical instability involving selfsimilarity.Note that P encodes the global information between each sample and any other sample.To capture local structures of graphs, [17] used KNN methods to define a kernel matrix M as below: Where, denotes the neighborhood of i-th sample.In contrast to P, M only carries the local similarity to the k nearest neighbors of each sample.SNF process always starts from P as initial state using M as the kernel matrix.More details can be obtained by [17].
Let and represent similarity matrices from v-th view which can be obtained from Eq.11 and Eq.12, respectively.SNF steps iteratively update affinity matrix as follows: Where, j is the number of distinct views.We can see that similarity information is propagated and accumulated through the fusion process.Each view's data utilizes all other views' information by interchanging diffusion processes.At last, the affinity matrix that fuses each data type is defined as: SNF can reduce some noise between samples by using KNN approach to construct sub-graph M. So, SNF is robust to noise existed in affinity matrix.In graph theory, global silencing [29] and network deconvolution [30] are also efficient methods to eliminate noise.For future study, we will consider to adopt these two approaches to remove indirect correlation (or similarity).

The Robust Symmetric Nonnegative Matrix Factorization (RSNMF)
Here, we propose a robust symmetric nonnegative matrix factorization framework (RSNMF) based on SNF and SNMF for clustering text data, microbiome data and so on.First of all, RSNMF adopts appropriate kernel functions to construct the similarity matrices according to different data types.Then, SNF is used to integrate the similarity matrices or networks obtained in the previous step, forming the last fused matrix.At last, by conducting the weighted SNMF on the final fused matrix, the clustering indicator matrix is obtained and all samples can be assigned to corresponding membership labels.Figure 1 clearly elucidates the process of RSNMF.
Because RSNMF combines the strength of SNMF and SNF to form a robust clustering indicator matrix, it achieves better performance in terms of two metrics as are mentioned below.The subsequent experimental results also elucidates the effectiveness and efficiency of RSNMF.

Evaluation Metrics
In our experiments, two widely adopted metrics, Accuracy (AC) and Normalized mutual information (NMI) are used to evaluate the clustering quality by comparing the cluster labels derived from various algorithms with the ground-truth [24].High values of AC and NMI show better cluster performance.

Detection of the Noisy views
As described, SNF takes advantage of the complementary information presented in different views to construct a comprehensive representation that reflects the full spectrum of latent data.The final fused matrix P can be used to measure the compatibility between two similarity matrices obtained from different views.That NMI measures the information agreement between two clustering results is an appropriate method to filters out noisy views.If the value of NMI between one similarity matrix and P is much smaller than others, it may contain more noise.

Experimental results
We conduct extensive experiments on one synthetic and two real-world datasets, and compare the proposed RSNMF algorithm with a number of baselines.These baseline algorithms include:  Single view (BSV and WSV) [12]: Running each view using the SNMF algorithm.BSV refers to the most informative view which can achieve the best clustering performance.In contrast to BSV, WSV refers to the worst view of the data.By running SNMF on single view, we can obtain the clustering results, respectively. Co-training Spectral Clustering (Co-training SC): Appling the idea of co-training [31] to the framework of spectral clustering.By iteratively using the discriminative eigenvectors from one view to modify the graph structure of the other view.Ultimately, the underlying clusters of both views tend to agree with each other. Multi-view NMF: Integrating different coefficient matrices learnt from all views to obtain a common consensus solution [12].The consensus matrix reflects the latent data structure shared by each view.The objective function of Multi-view is: , , * ≥ 0.
By introducing auxiliary matrix = (∑ , , ∑ , , ⋯ , ∑ , ), the coefficient matrices s from different views are comparable which guarantees the fusion of all views is meaningful and interpretable. SNF: Constructing sample-by-sample similarity networks for all data types and integrating them into a final affinity graph which captures both shared and complementary information provided by distinct views.More details have been described in section SNF [17]. RSNF: Constructing clustering random forest for each data type and then using SNF to integrate all the affinity graphs from different views to create a fused network [10].At last, spectral clustering is employed to partition nodes in the graph.Experimental results are reported on one synthetic and two real-world dataset.Table 4 shows the clustering results of RSNMF compared with the baseline algorithms above on these three datasets.As Table 4 shown, RSNMF significantly outperforms other algorithms in terms of AC/NMI.*For Muti-view NMF, the clustering results on Three-source and HMP datasets are obtained when α = 0.01 and 0.05, respectively.--indicates that the result is unavailable due to the negative elements in this dataset.For RSNF_Adpt, the number of neighborhoods and sub tree on three datasets is set to be 12 and 200, respectively.µ, a hyperparameter, is set as 0.5 empirically.
On these three datasets, RSNMF achieves much improvement in two metrics compared with other algorithms.Surprisingly, on the synthetic data, RSNMF obtains the perfect performance (100%/100%) in terms of AC/NMI.RSNMF outperforms the best single view (BSV) as 3.37%/8.48%on Three-source data, and 6.43%/10.41% on HMP data.The difference between RSNMF and SNF is also significantly, where RSNMF has 82.25%/78.14%,96.23%/95.32% in AC/NMI, but SNF only possesses 65.68%/56.34%,92.78%/89.20% on the two realistic datasets, respectively (see Table 4).One of the possible reasons is that RSNMF provides extra degrees of freedom via S that allows H to be much closer to the form of cluster indicator than spectral clustering based methods (SNF, RSNF_Adpt, Cotraining SC) [26].Furthermore, RSNMF utilizes more flexible way to construct affinity matrix for distinct data types.Therefore, the proposed RSNMF algorithm can effectively and exactly find the true clustering.

Parameter tuning
There are several parameters in the proposed RSNMF framework.µ is a hyperparameter, it is set to be 0.4 on Synthetic and Three-source data and to be 0.45 on HMP data.For other values, RSNMF still outperforms other algorithms in most case.The number of iteration and neighbors in SNF, RSNF_Adpt and RSNMF is set to be 20.The value of β varies between 0 and 1.By conducting extensive experiments on these three data, we find that RSNMF is insensitive to these parameters and runs efficiently.
Figure 2 shows how the performance of RSNMF on two realistic datasets varies with changes in parameters β , respectively.We can see that RSNMF achieves consistently good performance regardless of β.

Computational complexity study
In this subsection, we will discuss the computational complexity of RSNMF.Based on the updating rules summarized in Eq.9 and Eq.10, the overall cost of RSNMF after running SNF on different data source is (3 + 5 + ) Where, k is the latent number of clusters and ≪ , .Therefore, the computational cost can be simplified as ( + ).It is worth noting that the number of samples n is generally much smaller than the features in bioinformatics.Additionally, RSNMF adopts a faster initialization algorithm (NNDSVD) in the first stage of iteration which guarantees the inner loop converges very quickly [28].
For large-scale data, the whole running time is approximately linear with respect to the number of samples.In contrast to RSNMF, ClustRF is very costly and computationally infeasible to generate the similarity matrix when the number of data points increases dramatically.

Analysis on HMP data
In Eq.8, S provides a good description of the quality of clustering and has a special implication [26].By conducting Kmeans on H, we can obtain vigorous cluster indicators.Assuming = and setting the derivative .8 ⁄ = 0, we acquire: Where, denotes i-th cluster, denotes the number of samples in .More details can be found in [26].S represents the normalized sum of weights between two clusters.If the clusters are wellseparated, the diagonal elements will be much larger than the off-diagonal elements.
On Three-source and HMP data, RSNMF factories the fused similarity matrices and obtains S1, S2 as follows:  elements of S denote between-cluster sum of weights.As we can see, the off-diagonal elements are much smaller compared with the diagonal elements, demonstrating the boundary between two clusters is evident.That is to say these clusters are well separated.Fig. 5 illustrates the clear clustering patterns on HMP data.As RSNMF and other variants of NMF are soft-clustering based methods, the coefficient matrix H (637×7) which is a low-dimension representation of microbiome samples can be transformed into a rigid cluster indicator with Kmeans.This figure clearly identifies clustering structure corresponding to microbiome samples from distinct body sites.As also seen in Table 4, the proposed method can achieve 96.23%/95.32% in terms of AC/NMI which is the highest record according to our best knowledge.From Fig. 5 we can see that the connection with-in cluster is very strong, however, the edges between-cluster are very sparse, which is consistent with the description of S2 (see Figure 4).RSNMF successfully divides nodes presented in this graph into different clusters, simultaneously has small loss of weights.This suggests RSNMF is a more versatile framework for integrating multiple heterogeneous data.After the microbiome sample network is constructed, RSNMF can correctly identify communities from distinct body sites, which can be observed in Fig. 5.The weight sum of edges connecting different modules is much smaller, however, it is larger with-in cluster.The clusters from Posterior_fornix (orange, triangle), Stool (empty), Anterior_nares (light white, box) have a distinct boundary, but samples from Plaque (blue, ellipse), Buccal_mucosa (blue, diamond), and Tongue_dorsurm (pink, diamond) are not well-separated.There are few samples (V169, V181, V319, V514, V504, and V608) in one certain oral site (Buccal_mucosa) having overlapping with another oral site.Not surprisingly, these three body sites are all are closely related to mouth and have similar microbiome composition and diversity [1].At the top of this figure, the microbiome samples from Anterior_nares (light white, box) and Retroauricular_crease (light white, ellipse) have high similarities.Possibly, we attribute this cause to the fact that they are both from skins.In this case RSNMF cannot clearly make a distinction between these two sites.This may be due to the noise existed in raw data and the following establishment of network which is based on this noise.
On the other hand, RSNMF can exactly detects the modules so that the cut-off between clusters has a minimum weight loss (see S2).In RSNMF, the quality of the clustering and community detection depends on the construction of similarity network.If there is a robust affinity graph, RSNMF will successfully identify the cluster structure.In this study, although community detection is not our focus, RSNMF could be also viewed as a novel method for community finding.

Conclusions
In this paper, we introduce a novel framework (RSNMF) for data integration based on similarity network fusion (SNF) and symmetric nonnegative matrix factorization (SNMF).We extend the similarity measurement approaches of SNF according to different data type and achieve much improvement in performance.On human microbiome data, we combine phylogenetic, metabolic, and transporter profiles into RSNMF framework to analyze the correlations among microbiome samples.Due to the delicate and complex interactions among microbiota and host environment, the nonlinear method is more appropriate for modeling microbial community.RSNMF, a graph clustering approach, can be used to construct network in any reasonable manner, including but not limited to Euclidean behavior-based methods.
The proposed RSNMF is a robust method for clustering text and microbiome data efficiently and effectively.The performance of RSNMF on these two data type is quite stable.We also show that RSNMF converges with nearly linear computational time.Experimental results on one synthetic and two realistic datasets demonstrate RSNMF is a more versatile framework compared with other baseline and state-of-art approaches, which suggests the potential application of RSNMF for microbiome data analysis.At the same time, RSNMF can also be applied in community finding.The analysis on HMP data shows RSNMF has capability to identify distinct modules presented in complex network.In the future, we will focus on the new community detection algorithm based on RSNMF.

Fig 1 .
Fig 1. Illustrative example of RSNMF.(a) Example representation of the abundance profile and the phylogenetic profile for the same cohort of samples.(b) Sample similarity matrices for each view.(c) SNF iteratively updates each similarity matrix with information from the other similarity matrix, forming the last fused matrix.(d) By conducting RSNMF on the final fused matrix, the clustering indicator matrix is obtained and each sample is assigned to a reliable membership label.

Fig 2 .
Fig 2. Performance of RSNMF versus β on Three-source and HMP datasets.In two evaluation metrics (AC and NMI), RSNMF has stable and reliable performance.When β equals to 0.5 and 0.2 on Three-source and HMP data, respectively, RSNMF obtains the highest values in terms of AC/NMI.For other values of β, the fluctuation of RSNMF is much smaller.

Fig 3 .
Fig 3.The 6×6 matrix S1 derived from Three-source data indicates that the distinct clusters are well-separated.The result is obtained when β is set as 0.5, for other values of β, RSNMF still has analogous performance.

Fig 4 .
Fig 4. The 7×7 matrix S2 derived from HMP data indicating to what extent two clusters are well-separated.This matrix is filled with many zeros, illustrating the sum of weights between clusters is very small.The result above is obtained when β equals to 0.2.

Fig 5 .
Fig 5. Clusters of microbiome samples after integrating three sources of data (phylogenetic, transporter and metabolic profiles).This result is obtained when the threshold of edge (or weight) is set as 3e-3 andβ equals to 0.2 empirically.

Table 2 .
Statistics of three-source dataset

Table 3 .
Statistics information of the human microbiome samples

Table 4 .
The clustering performance on three datasets (%)