Benchmarking Spatial Clustering Methods for Mass Spectrometry-Based Spatial Metabolomics

Yunning Lu; Zhanlong Mei; Haoke Deng; Yun Zhao; Chunlu Feng; Siqi Liu

doi:10.20944/preprints202604.1355.v1

Submitted:

17 April 2026

Posted:

20 April 2026

You are already at the latest version

Abstract

Mass spectrometry imaging (MSI) enables in situ mapping of metabolite distributions within tissues, and spatial clustering is a key step for delineating metabolically distinct regions. Nevertheless, spatial clustering methods have not been systematically benchmarked for spatial metabolomics data. Here, we evaluated the effects of ion filtering and clustering method selection on clustering performance and established a dual-metric framework that jointly assesses the spatial continuity of cluster labels and inter-cluster metabolic heterogeneity. We benchmarked 30 clustering algorithms across 12 heterogeneous MSI datasets spanning three major ion sources, four mass analyzers, and multiple spatial resolutions, covering approaches from non-spatial methods to advanced spatially aware models. Noise filtering markedly improved the spatial continuity of results generated by non-spatial methods (mean improvement, approximately 28%) but provided limited benefit for spatially aware methods. Across the 12 datasets, a median of only 11 methods satisfied both evaluation criteria simultaneously, whereas SSC and DRSC met the dual-metric thresholds in at least nine datasets. In the mbrain2_pos50 dataset, the top-ranked method based on the composite dual-metric score achieved 22% higher concordance between cluster assignments and cell-type annotations than the lowest-ranked method. Together, the proposed evaluation framework and the online platform SMcluster provide a standardized resource for benchmarking and selecting MSI clustering methods. Our results highlight the critical roles of preprocessing and method selection in determining spatial clustering performance and offer practical guidance for spatial metabolomics studies.

Keywords:

spatial metabolomics

;

mass spectrometry imaging

;

clustering benchmarking

;

spatial continuity

;

inter-cluster heterogeneity

;

online platform

Subject:

Biology and Life Sciences - Biology and Biotechnology

1. Introduction

Spatial metabolomics offers a unique molecular perspective for elucidating disease mechanisms and cellular heterogeneity by mapping the in situ distribution of metabolites within the tissue microenvironment [1,2,3]. Major mass spectrometry imaging (MSI) platforms include matrix-assisted laser desorption/ionization mass spectrometry imaging (MALDI-MSI) [4], desorption electrospray ionization mass spectrometry imaging (DESI-MSI), and air flow-assisted desorption electrospray ionization mass spectrometry imaging (AFADESI-MSI) [5]. These platforms differ markedly in spatial resolution, detection sensitivity, and sample compatibility. In addition, instrument-specific noise characteristics — such as matrix effects in MALDI and ion suppression in DESI — together with the high dimensionality of MSI data, which often comprise thousands of ions, pose substantial challenges for downstream analysis. This complexity highlights the need for computational frameworks tailored to spatial metabolomics.

Within the MSI analysis workflow, spatial clustering is a fundamental step for identifying tissue regions with distinct metabolic characteristics, thereby supporting downstream differential analysis, regional functional annotation, and biomarker discovery [6,7,8]. A broad range of clustering strategies has been proposed, including early approaches that combine dimensionality reduction with conventional clustering [9], spatially aware methods that incorporate neighborhood information [10,11], and more recent graph neural network models that capture complex spatial dependencies [12]. At the same time, advanced methods originally developed for spatial transcriptomics are increasingly being adapted for MSI analysis [13,14]. Despite this rapid methodological expansion, the suitability of different clustering methods for diverse spatial metabolomics datasets remains unclear. Although several benchmarking studies have been conducted in spatial transcriptomics [15,16,17,18], the substantial differences between transcriptomic and metabolomic measurements limit the direct transferability of those conclusions to spatial metabolomics. Moreover, most existing MSI clustering studies focus on introducing new methods and typically compare only 2–5 baseline methods [10,11,12,14,19,20]; consequently, no study has yet systematically benchmarked 30 clustering methods across more than 10 heterogeneous MSI datasets. In addition, the effects of preprocessing strategies, particularly noise filtering, on clustering performance remain insufficiently quantified [21,22,23], and the absence of a unified evaluation framework hinders cross-study comparison.

To address these gaps, we systematically evaluated 30 clustering methods across 12 MSI datasets spanning three ion sources, multiple spatial resolutions, and diverse tissue types. We established a dual-metric evaluation framework that jointly measures spatial continuity and inter-cluster metabolic heterogeneity and, where available, incorporated matched spatial transcriptomics-derived cell-type annotations for external validation. We also quantified the impact of preprocessing strategies on clustering performance. Through this comprehensive benchmark, we clarify the importance of preprocessing and method selection, identify clustering methods that perform robustly on spatial metabolomics data, and provide practical guidance for method selection. Finally, we developed an interactive online platform to help researchers choose appropriate clustering tools according to specific data characteristics and analytical objectives.

2. Materials and Methods

2.1. Data Collection and Preparation

To comprehensively assess the applicability of clustering methods to spatial metabolomics data, we assembled a benchmark comprising 12 MSI datasets from diverse sources (Table 1). These datasets span three major ion sources (AFADESI, MALDI, and DESI), four mass analyzers (Orbitrap, TOF, LTQ, and FT-ICR), and spatial resolutions ranging from 20 to 100 µm. The benchmark includes both publicly available datasets and internally generated datasets. Seven datasets were collected from external sources, covering the major MSI ionization platforms commonly used in current research and representative tissue types such as mouse brain.

To further examine the effects of ionization mode and spatial resolution on clustering performance, we acquired MSI data from coronal serial sections of 7-week-old male mouse brain using the AFADESI-MSI platform in both positive and negative ion modes at spatial resolutions of 20 µm, 50 µm, and 100 µm. The spray solvent consisted of acetonitrile/water (80:20, v/v), and the AFADESI extraction gas flow rate was set to 45 L/min. Mass spectrometric detection was performed on a Q Exactive mass spectrometer (Thermo Fisher Scientific, USA) at a mass resolution of 70,000.

All raw data, whether experimentally generated or publicly obtained, were processed using a unified workflow in Cardinal [24] to minimize analytical bias. Peak picking was performed with the diff method in Cardinal, using a signal-to-noise ratio (SNR) threshold of 6 to identify local maxima as feature peaks. Peaks were then aligned across pixels with a tolerance of 10 ppm, and low-frequency features detected in fewer than 10% of pixels were removed. Peak areas were calculated after Gaussian smoothing of the raw spectra. Finally, SMAnalyst [8] was used for quality control of the peak-picked spatial metabolomics data, including identification and removal of background regions.

To establish an objective biological validation framework and assess whether clustering results recapitulate anatomical structure, we used co-registered spatial metabolomics and spatial transcriptomics data from our previous SMIntegration study [25]. This paired dataset consists of spatial metabolomics data (mbrain2_pos50) and spatial transcriptomics (ST) data obtained from adjacent mouse brain sections. The ST data were generated using Stereo-seq at a native spatial resolution of 500 nm. To match the pixel resolution of the spatial metabolomics data, the transcriptomics data were aggregated to 50 µm by binning, and cell-type annotations were inferred by deconvolving the ST data with Cell2location [26]. Precise co-registration between the spatial metabolomics and transcriptomics data was then performed using the SpatialData framework [27]. This workflow enabled cell-type annotations to be mapped onto spatial metabolomics pixels, thereby providing an external biological reference for downstream evaluation of clustering performance.

Table 1. Summary of the 12 heterogeneous MSI datasets used in this study, covering multiple platforms and spatial resolutions.

Dataset Name	Ion Source	Mass Analyzer	Ionization Mode	Resolution (μm)	Sample Type	No. of Pixels	No. of m/z	Source
mbrain1_neg20	AFADESI	Orbitrap	Neg	20	Mouse brain	107,423	2005	In-house
mbrain1_neg50	AFADESI	Orbitrap	Neg	50	Mouse brain	23,531	2138	In-house
mbrain1_neg100	AFADESI	Orbitrap	Neg	100	Mouse brain	5875	2162	In-house
mbrain1_pos20	AFADESI	Orbitrap	Pos	20	Mouse brain	112,023	2578	In-house
mbrain1_pos50	AFADESI	Orbitrap	Pos	50	Mouse brain	22,792	2989	In-house
mbrain1_pos100	AFADESI	Orbitrap	Pos	100	Mouse brain	5373	3044	In-house
mbrain2_pos50	AFADESI	Orbitrap	Pos	50	Mouse brain	14,283	2654	In-house
PDX_mbrain_pos100	MALDI	FTICR	Pos	100	Mouse brain	3570	1131	[28]
pfetus_neg	DESI	LTQ	Neg	NA¹	Pig fetus	4959	687	[24]
mfetus_neg	MALDI	TOF	Neg	NA¹	Mouse fetus	16,197	2203	[29]
mbrain_neg40	MALDI	TOF	Neg	40	Mouse brain	32,368	2531	[30]
mkidney_neg40	MALDI	Orbitrap	Neg	40	Mouse kidney	45,623	258	[31]

¹ Spatial resolution not explicitly reported in the original study.

2.2. Spatial Noise Score and Ion Filtering

MSI data contain complex matrix interferences and stochastic detection noise that manifest as random spatial distributions. Noisy ions interfere with the ability of clustering algorithms to identify tissue structures [22]. To systematically evaluate the impact of ion filtering on clustering results, we introduced the Spatial Noise Score (SNS) to characterize the spatial distribution properties of individual ions. The SNS quantifies the spatial noise level of a single ion image by computing the proportion of spatially dispersed pixels [32]. Specifically, the spatial intensity distribution of each ion is first binarized at the median value, and the proportion of “abnormal boundary pixels” — those whose labels are inconsistent with their neighboring pixels — is then calculated. A lower SNS value indicates greater spatial coherence of the ion’s distribution, suggesting it is more likely to represent a genuine anatomical region. To systematically evaluate the effect of noisy ion filtering on clustering algorithms, we employed a gradient filtering strategy, retaining the top 5%, 10%, 20%, 40%, 60%, 80%, and 100% of ions ranked by SNS score from low to high (i.e., from low to high noise level) to construct test datasets. By comparing clustering performance across different filtering levels, this study aimed to identify the optimal preprocessing threshold that balances biological signal retention with noise suppression.

2.3. Clustering Methods

A total of 30 representative clustering methods were selected for systematic benchmarking. These methods are broadly categorized into spatially-aware methods and non-spatially-aware methods based on their algorithmic logic and their use of spatial information. Spatially-aware methods explicitly integrate pixel spatial coordinates with spectral similarity through architectures such as graph convolutional networks (GCN), graph attention mechanisms (GAT), or hidden Markov random fields (HMRF). Non-spatially-aware methods rely solely on metabolite expression profiles, and include commonly used community detection algorithms (e.g., Leiden and Louvain) as well as two-stage pipelines consisting of dimensionality reduction (PCA/t-SNE/UMAP) followed by conventional clustering (K-means/GMM/HC/Spectral).

The selected methods encompass algorithms specifically designed for MSI, advanced models transferred from spatial transcriptomics (ST), and general-purpose clustering methods used in bioinformatics analysis. Detailed descriptions of each algorithm are provided in Supplementary Note S1. To ensure a fair evaluation, all methods were run in a unified hardware environment. To improve the comparability of results across methods, the number of clusters was set to approximately the same value for all methods, with all remaining parameters set to their default values (Supplementary Table S1). Table 2 summarizes the categorical attributes, application scenarios, technical architectures, and implementation languages of all evaluated methods.

2.4. Evaluation Metrics

2.4.1. Spatial Continuity Assessment

Spatial continuity is an important metric for evaluating the performance of spatial clustering algorithms, with its core purpose being to assess whether clustering results exhibit coherence in their spatial distribution. This study employs the Percentage of Abnormal Spots (PAS) to quantify spatial continuity by calculating the degree of consistency between each pixel’s label and the labels of its neighboring pixels. We define PAS as the proportion of pixels whose cluster label is inconsistent with more than half of their 8 surrounding neighbors [22]. A lower PAS value indicates a lower degree of spatial fragmentation in the clustering results, resulting in smoother and more coherent tissue domain boundaries, which is consistent with the biological principle that metabolite distributions in the tissue microenvironment tend to exhibit regional continuity.

2.4.2. Inter-Cluster Metabolic Heterogeneity Assessment

To quantify the correspondence between clustering regions and the spatial metabolic heterogeneity, we introduced the variance decomposition-based inter-cluster heterogeneity metric median-η². This metric draws on the concept of between-group and within-group variance decomposition from analysis of variance, and is consistent with explanatory measures commonly used in spatial heterogeneity analysis [48], serving to quantify the extent to which cluster labels explain the variance in the spatial distribution of ions. For each ion feature in the dataset, we first compute the ratio of the between-cluster sum of squares to the total sum of squares (η²). The median of η² values across all ions in the dataset is then taken as the composite evaluation score for that method. A median-η² value closer to 1 indicates more pronounced differences in metabolic patterns across the clustering result.

2.4.3. Computational Resource Efficiency Assessment

Computational efficiency and resource consumption are key metrics for evaluating the potential of algorithms for application to large-scale, high-resolution MSI data. This study primarily records the runtime and peak memory usage of each method during the clustering task. Runtime is measured as the total wall-clock time from algorithm initiation to final clustering output, recorded using Python’s system time interface. Peak memory is measured by high-frequency sampling of the process’s resident set size (RSS), capturing the maximum memory footprint over the entire course of task execution. This dimension of evaluation is of considerable importance for identifying efficient algorithms capable of handling high-resolution spatial metabolomics data.

2.4.4. Clustering Consistency and Biological Validation

To objectively evaluate the consistency of clustering results and their biological accuracy, this study uniformly employs Normalized Mutual Information (NMI) for quantitative assessment [49]. NMI is a commonly used statistical metric for measuring the similarity between two label partitions, with values ranging from 0 to 1, where higher values indicate greater concordance between the two partitions. In the experimental design, we first computed pairwise NMI between the clustering results of different methods on the same dataset to assess the clustering consistency of each algorithm on specific spatial metabolomics data. Subsequently, for datasets with adjacent Stereo-seq spatial transcriptomics sections, we used the cell-type annotations from the ST data as an external biological validation standard, computing the NMI between cluster labels and registered cell-type labels to evaluate the concordance of clustering results at the level of cellular composition.

2.5. Construction and Implementation of the Online Clustering Evaluation Platform

To lower the technical barrier for spatial metabolomics clustering analysis and improve the reproducibility of the evaluation workflow, we developed the online clustering evaluation platform SMcluster (available at: https://metax.genomics.cn/app/smcluster) based on the R Shiny framework. The platform is deployed on a high-performance cloud server equipped with 128 CPU cores and 1,000 GB of memory, supporting concurrent multi-user access through a dynamic resource allocation mechanism. The platform is designed with ease of use as its core principle, implementing an end-to-end analysis pipeline that encompasses raw data import, ion quality filtering, batch execution of multiple algorithms, and automated metric evaluation. To facilitate rapid onboarding, the platform includes built-in standardized example datasets as well as detailed operational documentation. With respect to data security and privacy protection, the platform enforces strict controls: all data uploaded by users are processed exclusively in active memory during the session and are automatically destroyed upon session termination, ensuring that researchers retain full control and ownership of their raw data.

The platform requires input in the form of a standard CSV-format peak intensity matrix, in which the first two columns must explicitly define the spatial coordinates (X, Y) of each pixel, with subsequent columns corresponding to the detected intensities of individual ions. During the data import stage, the platform automatically identifies coordinate and feature columns and generates a data overview including pixel count, ion count, and coordinate range, while also providing a spatial distribution preview of target features to assist with preliminary quality inspection. In the preprocessing module, the platform integrates the SNS-based ion filtering strategy described in Section 2.2, allowing users to visually observe the effect of different filtering thresholds on ion image quality and to save filtering parameters to ensure the traceability of the analysis pipeline. In the core computation module, the platform fully supports all 30 clustering methods covered in Section 2.3. Users can configure multiple algorithms and parameter combinations within a unified session for queued batch execution, with the system automatically recording run logs and monitoring computational overhead. Upon task completion, the platform automatically computes key evaluation metrics including spatial continuity (PAS), inter-cluster metabolic heterogeneity (median-η²), clustering consistency (NMI), peak memory usage, and runtime (as detailed in Section 2.4), and presents the results in the form of interactive tables and figures.

3. Results and Discussion

3.1. Effects of SNS-Based Ion Filtering on Spatial Clustering Performance

Because MSI data are pervasively affected by technical noise and matrix interference, effective ion pre-filtering is essential for obtaining biologically meaningful clustering results. We therefore used the Spatial Noise Score (SNS) to quantify the spatial noise level of each ion and examined its distribution across the 12 heterogeneous datasets. Ions with high SNS values produced spatially fragmented ion images (Supplementary Figure S1). The overall SNS distributions varied substantially across platforms and tissue types: high-resolution AFADESI datasets contained a larger fraction of low-SNS, structurally informative ions, whereas several other datasets showed a higher proportion of noisy ions (Figure 1A).

To visually assess the effect of SNS-based ion filtering on clustering quality, we used the mouse brain dataset mbrain1_pos50 as an example and compared the results of the conventional non-spatially-aware method pca_Kmeans under different filtering thresholds. Without ion filtering, the clustering map displayed a pronounced salt-and-pepper pattern caused by large numbers of noisy ions, making tissue boundaries difficult to resolve (Figure 1C, right). In contrast, retaining only the top 20% highest-quality ions markedly improved spatial continuity, and the resulting substructures — including the cortex, hippocampus, and thalamus — showed better agreement with anatomical annotations from the Allen Brain Atlas (Figure 1B–C).

This trend was further supported by quantitative analysis. We examined how spatial continuity (PAS) changed with the ion-retention proportion for representative algorithms across all datasets (Figure 1D; Supplementary Figure S2). For the two non-spatially-aware methods pca_Kmeans and umap_Kmeans, PAS decreased substantially after moderate ion filtering (e.g., retaining the top 20%), indicating that removal of high-SNS ions improved their ability to recover spatially coherent structures. However, in some datasets, such as mbrain_neg100 and mbrain_pos20, overly stringent filtering (retaining only the top 5% or 10%) caused PAS to rise again. By contrast, spatially-aware methods such as DRSC maintained consistently low and stable PAS values across a wide range of ion-retention levels, indicating stronger robustness to noise. These results show that SNS-based ion filtering is a critical preprocessing step for non-spatially-aware methods. We therefore determined an appropriate filtering threshold for each dataset (Supplementary Table S2); at these thresholds, the PAS of non-spatially-aware methods decreased by an average of 27.67% relative to unfiltered data. All subsequent analyses were performed using the selected filtering thresholds.

3.2. Spatial Continuity Across Clustering Methods

Identifying biologically coherent and meaningful spatial regions is the core task of spatial clustering; however, some clustering results exhibit fragmented clustering regions — the so-called salt-and-pepper effect — that fail to reveal true regional boundaries. To address this, we used the spatial continuity metric PAS to quantitatively evaluate the spatial clustering performance of all 30 algorithms on SNS-filtered data. In the pfetus_neg dataset, we observed that clustering regions with PAS > 0.2 exhibited pronounced fragmentation, whereas regions with PAS < 0.2 were relatively continuous with well-defined boundaries. Spatially-aware methods demonstrated a marked advantage: methods such as CCST, SCAN.IT, and SpaceFlow, by explicitly modeling spatial neighborhood dependencies, produced clustering maps with clear boundaries and low PAS values. In contrast, non-spatially-aware algorithms (e.g., K-means) were able to capture the approximate outlines of organs but were riddled with numerous isolated pixels (high PAS values), substantially reducing the anatomical interpretability of the spatial structures (Figure 2A, 2B). Evaluation across all datasets further supported this observation: spatially-aware methods consistently exhibited low PAS values across data from different imaging platforms (MALDI, DESI, AFADESI), different resolutions (20 - 100 µm), and both positive and negative ion modes (Figure 2C; Supplementary Figure S3). Notably, in high-noise datasets such as mfetus_neg, the PAS values of non-spatially-aware methods increased sharply, while spatially-aware methods maintained good spatial continuity. These results indicate that the integration of spatial coordinates or neighborhood graph structures effectively suppresses random noise and produces continuous clustering regions.

3.3. Inter-Cluster Metabolic Heterogeneity Across Clustering Methods

An ideal spatial clustering result should not only exhibit good spatial continuity, but should also minimize intra-cluster metabolic variation while maximizing inter-cluster metabolic differences. To this end, we employed median-η² as an evaluation metric to quantify the degree of ion heterogeneity between clustering regions. Interestingly, the performance of clustering methods on this metric was not consistent with their performance on the PAS metric. Using the pfetus_neg dataset as an example, some methods that demonstrated strong spatial continuity in terms of PAS showed relatively low median-η² values (Figure 3A), suggesting that an excessive emphasis on spatial continuity may obscure subtle metabolic differences. In contrast, non-spatially-aware methods such as umap_Spectral, SSC, and Leiden achieved higher median-η² scores, indicating greater ion heterogeneity between their clustering regions.

To further validate the biological relevance of median-η², we compared the clustering regions of methods with high (> 0.5) versus low (< 0.5) median-η² values against the spatial distributions of marker ions for specific organs (Figure 3B). The results show that for tissues including the midbrain (marker ion m/z 810.424), heart (marker ion m/z 187.36), and liver (marker ion m/z 537.11), the clustering regions identified by methods with higher median-η² rankings showed greater concordance with the distribution of organ marker ions, whereas the clustering regions of methods with lower median-η² values often failed to reflect the fine structural organization of tissue regions. This suggests that, in this dataset, the ability to accurately delineate tissue regions may be one of the factors contributing to greater inter-cluster heterogeneity and lower intra-cluster variation.

Comparing median-η² values across all methods and datasets (Figure 3C), most algorithms achieved relatively high median-η² values on AFADESI platform data, while values were generally lower on the mkidney_neg40 dataset, suggesting that median-η² is to some extent influenced by the intrinsic ion heterogeneity structure of the data itself. Beyond dataset-level characteristics, there was also considerable variation among methods within the same dataset. For example, pca_Kmeans consistently achieved the highest median-η² across all datasets. This stems from the properties of the method: PCA preferentially retains orthogonal directions of maximum variance, while K-means minimizes within-cluster sum of squares; together, these two components tend to produce compact within-cluster and well-separated between-cluster structures, which are precisely the type of clustering results that receive high median-η² scores.

3.4. Dual-Metric Evaluation Framework and Its Biological Validation

Both spatial continuity (PAS) and inter-cluster heterogeneity (median-η²) are important metrics for evaluating clustering results and must be considered jointly. However, the preceding results indicate that a trade-off exists between the two. For example, some spatially-aware methods (e.g., CCST) performed well on PAS (Figure 2B) but showed low median-η² (Figure 3C), suggesting that an excessive reinforcement of spatial continuity may compromise the preservation of local metabolic differences. This study constructed a dual-metric filtering framework to identify methods that perform well on both spatial continuity and metabolic heterogeneity preservation. With PAS below 0.2 and median-η² above 0.5 as thresholds, the metric distributions and pass/fail outcomes for each dataset are shown in Supplementary Figures S4 and S6. Using the pfetus_neg dataset as an example, 4 methods performed well on both spatial continuity and metabolic heterogeneity preservation, while 9 methods passed only on PAS and 17 passed only on median-η² (Figure 4A). We further computed pairwise Normalized Mutual Information (NMI) between clustering results to assess the consistency among different clustering outcomes (Figure 4B). The results show that methods satisfying both metrics exhibited higher mutual consistency in their clustering results; this phenomenon was also observed in other datasets (Supplementary Figure S5), indicating that the clustering regions identified by these methods are more similar to one another compared to methods that fail to satisfy both criteria.

The pass rate statistics for all clustering methods across the 12 test datasets show that methods such as SSC and DRSC passed the dual-metric filtering on 9 datasets (pass rate 75%), demonstrating their ability to simultaneously achieve spatial continuity and inter-cluster heterogeneity across the majority of datasets (Figure 4C). In addition, some datasets — such as mfetus_neg — had very few or no methods that simultaneously passed both metrics. Examining the overall noise levels of each dataset revealed that datasets with lower pass rates tended to exhibit higher overall noise levels (Supplementary Figures S6 and S7).

To demonstrate the validity of the dual-metric approach for selecting clustering methods, we introduced cell-type annotations from adjacent Stereo-seq sections on the mbrain2_pos50 dataset (Figure 4D) as external biological validation, analyzing the concordance between MSI clustering results and ST cell-type annotations and comparing this concordance ranking with the dual-metric composite ranking (Figure 4E). The results show that methods ranked higher by the dual-metric composite evaluation exhibited greater concordance with cell-type annotations in the ST data, with a Spearman correlation coefficient of 0.83. The method with the best dual-metric composite ranking (STAGATE; NMI = 0.448) showed a 22% improvement in concordance with cell-type annotations (NMI) compared to the worst-ranked method (GraphST; NMI = 0.221). These results indicate that the dual-metric evaluation framework not only reflects the stable performance of algorithms on MSI data, but also captures, to a meaningful degree, their concordance with true tissue structures. In summary, this dual-metric evaluation framework can provide a more reliable basis for clustering method selection in subsequent spatial metabolomics research.

3.5. Online Clustering Evaluation Platform

Given the substantial differences in ion noise distribution and biological heterogeneity across MSI datasets (Figure 1A), researchers in practice often need to optimize parameters and adjust strategies according to the specific characteristics of their data. To this end, we developed the interactive online evaluation platform SMcluster, designed to provide standardized tool support for clustering evaluation of MSI data. To validate the reliability of the platform, we used an example dataset to fully reproduce the evaluation workflow of this study on the platform, encompassing key steps including data upload, SNS-based ion filtering, and parallel multi-algorithm clustering (Supplementary Figure S8). To further test the robustness of the evaluation conclusions, we compared on the platform the effect of different noise filtering intensities (retaining 80% versus 20% of ions) on clustering performance. The analysis results showed that as the number of retained ions increased, the spatial continuity (PAS) of non-spatially-aware methods deteriorated substantially (Supplementary Figure S9), while spatially-aware methods demonstrated strong noise robustness — a trend highly consistent with the conclusions of the benchmarking analysis in Section 3.1. In summary, the platform not only demonstrates the generalizability and reliability of the evaluation framework established in this study, but also provides researchers with a scientific basis for selecting optimal clustering strategies when confronted with complex and variable real-world data, through the provision of flexible personalized evaluation options such as freely adjustable noise filtering thresholds and dual-metric thresholds.

4. Conclusions

This study systematically benchmarked 30 representative clustering methods across 12 heterogeneous MSI datasets and established a dual-metric evaluation framework that jointly assesses spatial continuity and inter-cluster metabolic heterogeneity. Our results show that preprocessing has a substantial impact on spatial clustering performance: SNS-based ion filtering markedly improves the spatial continuity of results generated by non-spatially-aware methods, whereas spatially-aware methods are comparatively less affected by noise. We further found that many clustering methods struggle to simultaneously achieve strong spatial continuity and high inter-cluster metabolic heterogeneity. Under the dual-metric framework, methods such as SSC and DRSC showed robust performance across most datasets, and higher-ranked methods tended to yield more concordant clustering patterns. External validation with cell-type annotations further demonstrated that methods with better overall rankings also exhibited stronger biological interpretability, indicating that the proposed framework can objectively capture the extent to which clustering results recapitulate tissue spatial organization and metabolic heterogeneity. Finally, we developed the online platform SMcluster, which implements a standardized workflow for ion filtering, clustering analysis, and result evaluation, thereby providing practical support for method selection in spatial metabolomics studies. Overall, this work fills an important gap in the systematic evaluation of spatial clustering methods for spatial metabolomics, highlights the critical roles of preprocessing and method choice in MSI analysis, and provides a methodological foundation for future research in this area.

Data Availability Statement

A detailed data availability statement will be provided in the final version of the manuscript.

Supplementary Materials

The following supporting information can be downloaded at the website of this paper posted on Preprints.org. Supplementary Note S1: Detailed descriptions of clustering algorithms. Supplementary Tables: Table S1, key parameter settings for clustering methods; Table S2, dataset-specific Spatial Noise Score (SNS) filtering thresholds; Table S3, η² values of organ-specific marker ions in the pfetus_neg dataset. Supplementary Figures: Figure S1, spatial distributions of ions near SNS cutoff thresholds; Figure S2, spatial clustering maps under different SNS filtering proportions; Figure S3, spatial clustering maps generated by all evaluated methods; Figure S4, joint distributions of PAS and median-η² for all methods across datasets; Figure S5, pairwise Normalized Mutual Information (NMI) between clustering results of different methods; Figure S6, pass rates of each method under dual criteria (PAS and median-η²) across datasets; Figure S7, distributions of Spatial Noise Score (SNS) in filtered datasets used for clustering; Figure S8, workflow reproduction and benchmarking results on the SMcluster online platform; Figure S9, comparison of clustering results under different SNS filtering stringencies (retaining the top 20% versus 80% of ions).

Funding

Not applicable.

Institutional Review Board Statement

The animal study protocol was approved by the Institutional Review Board of BGI (protocol code BGI-IRB A25004 and date of approval is 02/21/2025).

Informed Consent Statement

Not applicable.

Acknowledgments

During the preparation of this manuscript, the authors used [GPT-5.4] for the purposes of [Language Polishing]. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviation

The following abbreviations are used in this manuscript:

SNS	Spatial Noise Score
PAS	Percentage of Abnormal Spots
AFADESI	Air flow-assisted desorption electrospray ionization
MSI	Mass spectrometry imaging

References

Fujimura, Y.; Miura, D. MALDI Mass Spectrometry Imaging for Visualizing In Situ Metabolism of Endogenous Metabolites and Dietary Phytochemicals. Metabolites 2014, 4, 319–346. [Google Scholar] [CrossRef]
Chen, K.; Baluya, D.; Tosun, M.; Li, F.; Maletic-Savatic, M. Imaging Mass Spectrometry: A New Tool to Assess Molecular Underpinnings of Neurodegeneration. Metabolites 2019, 9, 135. [Google Scholar] [CrossRef]
He, M.J.; Pu, W.; Wang, X.; Zhang, W.; Tang, D.; Dai, Y. Comparing DESI-MSI and MALDI-MSI Mediated Spatial Metabolomics and Their Applications in Cancer Studies. Front. Oncol. 2022, 12, 891018. [Google Scholar] [CrossRef]
Caprioli, R.M.; Farmer, T.B.; Gile, J. Molecular Imaging of Biological Samples: Localization of Peptides and Proteins Using MALDI-TOF MS. Anal. Chem. 1997, 69, 4751–4760. [Google Scholar] [CrossRef]
Takáts, Z.; Wiseman, J.M.; Gologan, B.; Cooks, R.G. Mass Spectrometry Sampling under Ambient Conditions with Desorption Electrospray Ionization. Science, New Series 2004, 306, 471–473. [Google Scholar] [CrossRef]
Ràfols, P.; Vilalta, D.; Brezmes, J.; Cañellas, N.; Del Castillo, E.; Yanes, O.; Ramírez, N.; Correig, X. Signal Preprocessing, Multivariate Analysis and Software Tools for MA(LDI)-TOF Mass Spectrometry Imaging for Biological Applications. Mass Spectrometry Reviews 2018, 37, 281–306. [Google Scholar] [CrossRef] [PubMed]
Buchberger, A.R.; DeLaney, K.; Johnson, J.; Li, L. Mass Spectrometry Imaging: A Review of Emerging Advancements and Future Insights. Anal. Chem. 2018, 90, 240–265. [Google Scholar] [CrossRef] [PubMed]
Mei, Z.; Ning, X.; Deng, H.; Chen, L.; Zhao, Y.; Zi, J. SMAnalyst: A Web Server for Spatial Metabolomic Data Analysis and Annotation. Biomolecules 2025, 15, 1562. [Google Scholar] [CrossRef] [PubMed]
McCombie, G.; Staab, D.; Stoeckli, M.; Knochenmuss, R. Spatial and Spectral Correlations in MALDI Mass Spectrometry Images by Clustering and Multivariate Analysis. Anal. Chem. 2005, 77, 6118–6124. [Google Scholar] [CrossRef]
Alexandrov, T.; Kobarg, J.H. Efficient Spatial Segmentation of Large Imaging Mass Spectrometry Datasets with Spatially Aware Clustering. Bioinformatics 2011, 27, i230–i238. [Google Scholar] [CrossRef]
Bemis, K.D.; Harry, A.; Eberlin, L.S.; Ferreira, C.R.; Van De Ven, S.M.; Mallick, P.; Stolowitz, M.; Vitek, O. Probabilistic Segmentation of Mass Spectrometry (MS) Images Helps Select Important Ions and Characterize Confidence in the Resulting Segments. Molecular & Cellular Proteomics 2016, 15, 1761–1772. [Google Scholar] [CrossRef] [PubMed]
Shah, M.; Wang, L.; Guo, L.; Xie, C.; Lam, T.K.-Y.; Deng, L.; Xu, X.; Xu, J.; Dong, J.; Cai, Z. SagMSI: A Graph Convolutional Network Framework for Precise Spatial Segmentation in Mass Spectrometry Imaging. Analytica Chimica Acta 2025, 1358, 344098. [Google Scholar] [CrossRef]
Dong, K.; Zhang, S. Deciphering Spatial Domains from Spatially Resolved Transcriptomics with an Adaptive Graph Attention Auto-Encoder. Nat Commun 2022, 13, 1739. [Google Scholar] [CrossRef]
Xiao, K.; Wang, Y.; Dong, K.; Zhang, S. SmartGate Is a Spatial Metabolomics Tool for Resolving Tissue Structures. Briefings in Bioinformatics 2023, 24, bbad141. [Google Scholar] [CrossRef]
Cheng, A.; Hu, G.; Li, W.V. Benchmarking Cell-Type Clustering Methods for Spatially Resolved Transcriptomics Data. Briefings in Bioinformatics 2023, 24, bbac475. [Google Scholar] [CrossRef]
Hu, Y.; Xie, M.; Li, Y.; Rao, M.; Shen, W.; Luo, C.; Qin, H.; Baek, J.; Zhou, X.M. Benchmarking Clustering, Alignment, and Integration Methods for Spatial Transcriptomics. Genome Biol 2024, 25, 212. [Google Scholar] [CrossRef]
Yuan, Z.; Zhao, F.; Lin, S.; Zhao, Y.; Yao, J.; Cui, Y.; Zhang, X.-Y.; Zhao, Y. Benchmarking Spatial Clustering Methods with Spatially Resolved Transcriptomics Data. Nat Methods 2024, 21, 712–722. [Google Scholar] [CrossRef]
Kang, L.; Zhang, Q.; Qian, F.; Liang, J.; Wu, X. Benchmarking Computational Methods for Detecting Spatial Domains and Domain-Specific Spatially Variable Genes from Spatial Transcriptomics Data. Nucleic Acids Research 2025, 53, gkaf303. [Google Scholar] [CrossRef] [PubMed]
Guo, L.; Liu, X.; Zhao, C.; Hu, Z.; Xu, X.; Cheng, K.-K.; Zhou, P.; Xiao, Y.; Shah, M.; Xu, J.; et al. iSegMSI: An Interactive Strategy to Improve Spatial Segmentation of Mass Spectrometry Imaging Data. Anal. Chem. 2022, 94, 14522–14529. [Google Scholar] [CrossRef] [PubMed]
Shah, M.; Guo, L.; Xu, X.; Deng, L.; Lu, K.; Dong, J.; Zhao, C.; Xu, J. eLIMS: Ensemble Learning-Based Spatial Segmentation of Mass Spectrometry Imaging to Explore Metabolic Heterogeneity. 2024. [Google Scholar]
Alexandrov, T.; Becker, M.; Deininger, S.-O.; Ernst, G.; Wehder, L.; Grasmair, M.; Von Eggeling, F.; Thiele, H.; Maass, P. Spatial Segmentation of Imaging Mass Spectrometry Data with Edge-Preserving Image Denoising and Clustering. J. Proteome Res. 2010, 9, 6535–6546. [Google Scholar] [CrossRef]
Guo, L.; Hu, Z.; Zhao, C.; Xu, X.; Wang, S.; Xu, J.; Dong, J.; Cai, Z. Data Filtering and Its Prioritization in Pipelines for Spatial Segmentation of Mass Spectrometry Imaging. Anal. Chem. 2021, 93, 4788–4793. [Google Scholar] [CrossRef]
Mei, Z.; Sun, W.; Zhao, Y.; Deng, H.; Ning, X.; Feng, C.; Zi, J. SMQVP: A Web Application for Spatial Metabolomics Quality Visualization and Processing. Metabolites 2025, 15, 354. [Google Scholar] [CrossRef]
Bemis, K.D.; Harry, A.; Eberlin, L.S.; Ferreira, C.; Van De Ven, S.M.; Mallick, P.; Stolowitz, M.; Vitek, O. Cardinal: An R Package for Statistical Analysis of Mass Spectrometry-Based Imaging Experiments. Bioinformatics 2015, 31, 2418–2420. [Google Scholar] [CrossRef]
Deng, H.; Ning, X.; Lin, X.; Zong, L.; Zheng, S.; Zhao, Y.; Wang, J.; Chen, L.; Zi, J.; Mei, Z. SMIntegration: A Web Tool for Comprehensive Spatial Metabolomics and Transcriptomics Integrated Analysis and Visualization. GigaScience 2026, giag033. [Google Scholar] [CrossRef]
Kleshchevnikov, V.; Shmatko, A.; Dann, E.; Aivazidis, A.; King, H.W.; Li, T.; Elmentaite, R.; Lomakin, A.; Kedlian, V.; Gayoso, A.; et al. Cell2location Maps Fine-Grained Cell Types in Spatial Transcriptomics. Nat Biotechnol 2022, 40, 661–671. [Google Scholar] [CrossRef]
Marconato, L.; Palla, G.; Yamauchi, K.A.; Virshup, I.; Heidari, E.; Treis, T.; Vierdag, W.-M.; Toth, M.; Stockhaus, S.; Shrestha, R.B.; et al. SpatialData: An Open and Universal Data Framework for Spatial Omics. Nat Methods 2025, 22, 58–62. [Google Scholar] [CrossRef] [PubMed]
Randall, E.C.; Emdal, K.B.; Laramy, J.K.; Kim, M.; Roos, A.; Calligaris, D.; Regan, M.S.; Gupta, S.K.; Mladek, A.C.; Carlson, B.L.; et al. Integrated Mapping of Pharmacokinetics and Pharmacodynamics in a Patient-Derived Xenograft Model of Glioblastoma. Nat Commun 2018, 9, 4904. [Google Scholar] [CrossRef] [PubMed]
Zhao, C. Airborne Fine Particulate Matter Induces Cognitive and Emotional Disorders in Offspring Mice Exposed during Pregnancy. 2021. [Google Scholar] [CrossRef] [PubMed]
Xu, Y. Brain-9aa-Neg-40um. Available online: https://metaspace2020.org/dataset/2023-08-25_17h02m43s (accessed on 14 March 2026).
Kasarla, S.S.; Fecke, A.; Smith, K.W.; Flocke, V.; Flögel, U.; Phapale, P. Improved MALDI-MS Imaging of Polar and² H-Labeled Metabolites in Mouse Organ Tissues. Anal. Chem. 2025, 97, 10720–10728. [Google Scholar] [CrossRef]
Inglese, P.; Correia, G.; Takats, Z.; Nicholson, J.K.; Glen, R.C. SPUTNIK: An R Package for Filtering of Spatially Related Peaks in Mass Spectrometry Imaging Data. Bioinformatics 2019, 35, 178–180. [Google Scholar] [CrossRef]
Singhal, V.; Chou, N.; Lee, J.; Yue, Y.; Liu, J.; Chock, W.K.; Lin, L.; Chang, Y.-C.; Teo, E.M.L.; Aow, J.; et al. BANKSY Unifies Cell Typing and Tissue Domain Segmentation for Scalable Spatial Omics Data Analysis. Nat Genet 2024, 56, 431–441. [Google Scholar] [CrossRef]
Li, J.; Chen, S.; Pan, X.; Yuan, Y.; Shen, H.-B. Cell Clustering for Spatial Transcriptomics Data with Graph Neural Networks. Nat Comput Sci 2022, 2, 399–408. [Google Scholar] [CrossRef] [PubMed]
Zong, Y.; Yu, T.; Wang, X.; Wang, Y.; Hu, Z.; Li, Y. conST: An Interpretable Multi-Modal Contrastive Learning Framework for Spatial Transcriptomics 2022.
Guo, L.; Dong, J.; Xu, X.; Wu, Z.; Zhang, Y.; Wang, Y.; Li, P.; Tang, Z.; Zhao, C.; Cai, Z. Divide and Conquer: A Flexible Deep Learning Strategy for Exploring Metabolic Heterogeneity from Mass Spectrometry Imaging Data. Anal. Chem. 2023, 95, 1924–1932. [Google Scholar] [CrossRef] [PubMed]
Xu, C.; Jin, X.; Wei, S.; Wang, P.; Luo, M.; Xu, Z.; Yang, W.; Cai, Y.; Xiao, L.; Lin, X.; et al. DeepST: Identifying Spatial Domains in Spatial Transcriptomics by Deep Learning. Nucleic Acids Research 2022, 50, e131–e131. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Liao, X.; Yang, Y.; Lin, H.; Yeong, J.; Zhou, X.; Shi, X.; Liu, J. Joint Dimension Reduction and Clustering Analysis of Single-Cell RNA-Seq and Spatial Transcriptomics Data. Nucleic Acids Research 2022, 50, e72–e72. [Google Scholar] [CrossRef]
Long, Y.; Ang, K.S.; Li, M.; Chong, K.L.K.; Sethi, R.; Zhong, C.; Xu, H.; Ong, Z.; Sachaphibulkij, K.; Chen, A.; et al. Spatially Informed Clustering, Integration, and Deconvolution of Spatial Transcriptomics with GraphST. Nat Commun 2023, 14, 1155. [Google Scholar] [CrossRef]
Traag, V.A.; Waltman, L.; Van Eck, N.J. From Louvain to Leiden: Guaranteeing Well-Connected Communities. Sci Rep 2019, 9, 5233. [Google Scholar] [CrossRef]
Blondel, V.D.; Guillaume, J.-L.; Lambiotte, R.; Lefebvre, E. Fast Unfolding of Communities in Large Networks. J. Stat. Mech. 2008, 2008, P10008. [Google Scholar] [CrossRef]
Cang, Z.; Ning, X.; Nie, A.; Xu, M.; Zhang, J. SCAN-IT: Domain Segmentation of Spatial Transcriptomics Images by Graph Neural Network. 2022. [Google Scholar]
Xu, H.; Fu, H.; Long, Y.; Ang, K.S.; Sethi, R.; Chong, K.; Li, M.; Uddamvathanak, R.; Lee, H.K.; Ling, J.; et al. Unsupervised Spatially Embedded Deep Representation of Spatial Transcriptomics. Genome Med 2024, 16, 12. [Google Scholar] [CrossRef]
Ren, H.; Walker, B.L.; Cang, Z.; Nie, Q. Identifying Multicellular Spatiotemporal Organization of Cells with SpaceFlow. Nat Commun 2022, 13, 4076. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Li, X.; Coleman, K.; Schroeder, A.; Ma, N.; Irwin, D.J.; Lee, E.B.; Shinohara, R.T.; Li, M. SpaGCN: Integrating Gene Expression, Spatial Location and Histology to Identify Spatial Domains and Spatially Variable Genes by Graph Convolutional Network. Nat Methods 2021, 18, 1342–1351. [Google Scholar] [CrossRef] [PubMed]
Abdelmoula, W.M.; Balluff, B.; Englert, S.; Dijkstra, J.; Reinders, M.J.T.; Walch, A.; McDonnell, L.A.; Lelieveldt, B.P.F. Data-Driven Identification of Prognostic Tumor Subpopulations Using Spatially Mapped t-SNE of Mass Spectrometry Imaging Data. Proc. Natl. Acad. Sci. U.S.A. 2016, 113, 12244–12249. [Google Scholar] [CrossRef]
Healy, J.; McInnes, L. Uniform Manifold Approximation and Projection. Nat Rev Methods Primers 2024, 4, 82. [Google Scholar] [CrossRef]
Wang, J.; Li, X.; Christakos, G.; Liao, Y.; Zhang, T.; Gu, X.; Zheng, X. Geographical Detectors-Based Health Risk Assessment and Its Application in the Neural Tube Defects Study of the Heshun Region, China. International Journal of Geographical Information Science 2010, 24, 107–127. [Google Scholar] [CrossRef]
Vinh, N.X.; Epps, J.; Bailey, J. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. Journal of Machine Learning Research 2010, 11, 2837–2854. [Google Scholar]

Figure 1. Effect of SNS-based ion filtering on clustering quality. (A) Distribution of SNS values for all ions across the 12 datasets. (B) H&E staining image of mouse brain tissue (mbrain1_pos50) and its corresponding structural annotations from the Allen Brain Atlas. (C) Comparison of clustering results produced by the pca_Kmeans algorithm with and without ion filtering. (D) PAS values of representative methods across all datasets at different ion filtering proportions (Top n% m/z).

Figure 2. Spatial continuity (PAS) of clustering results from 30 clustering methods across different datasets. (A) Visualization of clustering results produced by 30 algorithms on the pfetus_neg dataset. (B) PAS values of clustering results from each algorithm on the pfetus_neg dataset, ranked in descending order. (C) Heatmap of PAS scores for 30 methods across 12 datasets.

Figure 3. Comparison of the degree of ion heterogeneity between clustering regions. (A) Ranking of 30 algorithms by median-η² on the pfetus_neg dataset. (B) Comparison of clustering regions from methods with high versus low median-η² values against the spatial intensity distributions of organ marker ions. (C) Heatmap of median-η² values for clustering results of all methods across all datasets.

Figure 4. Dual-metric evaluation of spatial clustering results and cross-validation using spatial cell-type distribution data. (A) Distribution of clustering results from 30 methods on the two metrics (PAS and median-η²) for the pfetus_neg dataset. (B) Heatmap of pairwise NMI between clustering results. (C) Summary of the number of algorithms passing PAS and median-η² filtering across 12 datasets. (D) Spatial distribution of Stereo-seq cell-type annotation results from the adjacent section of the mbrain2_pos50 sample. (E) Correlation between the composite evaluation results and the biological validation evaluation (NMI between clustering results and cell-type annotations) for each method on the mbrain2_pos50 sample.

Table 2. Summary of the evaluated clustering algorithms.

Algorithm	Primary Application	Spatially-Aware	Deep Learning	Language	Reference
Banksy	Spatial transcriptomics	√	×	Python	[33]
CCST	Spatial transcriptomics	√	√	Python	[34]
conST	Spatial transcriptomics	√	√	Python	[35]
dcDeepMSI	Spatial metabolomics	√	√	Python	[36]
DeepST	Spatial transcriptomics	√	√	Python	[37]
DRSC	Spatial transcriptomics	√	×	R	[38]
eLIMS	Spatial metabolomics	×	×	Python	[20]
GraphST	Spatial transcriptomics	√	√	Python	[39]
isegMSI	Spatial metabolomics	√	√	Python	[19]
Leiden	Single-cell omics	×	×	R/Python	[40]
Louvain	Single-cell omics	×	×	R/Python	[41]
pca_GMM	General	×	×	R/Python	--
pca_HC	General	×	×	R/Python	--
pca_Kmeans	General	×	×	R/Python	--
pca_Spectral	General	×	×	R/Python	--
sagMSI	Spatial metabolomics	√	√	Python	[12]
SCAN.IT	Spatial transcriptomics	√	√	Python	[42]
SEDR	Spatial transcriptomics	√	√	Python	[43]
SpaceFlow	Spatial transcriptomics	√	√	Python	[44]
SpaGCN	Spatial transcriptomics	√	√	Python	[45]
SSC	Spatial metabolomics	√	×	R	[11]
STAGATE	Spatial transcriptomics	√	√	Python	[13]
tsne_GMM	General	×	×	R/Python	[46]
tsne_HC	General	×	×	R/Python	--
tsne_Kmeans	General	×	×	R/Python	[46]
tsne_Spectral	General	×	×	R/Python	--
umap_GMM	General	×	×	R/Python	[47]
umap_HC	General	×	×	R/Python	--
umap_Kmeans	General	×	×	R/Python	[47]
umap_Spectral	General	×	×	R/Python	--

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.