UHPLC-Orbitrap-HRMS Identification of 51 Oleraceins (Cyclo-Dopa Amides) in Portulaca oleracea L. Cluster Analysis and MS2 Filtering by Mass Difference

Oleraceins are a class of indoline amide glycosides found in Portulaca oleracea L. (Portulacaceae), or purslane. These compounds are characterized with 5,6-dihydroxyindoline-2-carboxylic acid N-acylated with cinnamic acid derivatives, and many are glucosylated. Herein, hydromethanolic extracts of the aerial parts of purslane were subjected to UHPLC-Orbitrap-HRMS analysis, conducted in negative ionization mode. Diagnostic ion filtering (DIF), followed by diagnostic difference filtering (DDF), were utilized to automatically filter out MS data and select plausible oleracein structures. After an in-depth MS2 analysis, a total of 51 oleracein compounds were tentatively identified. Of them, 26 had structure matching one of already known oleraceins and the other 25 are new, undescribed in the literature structures, belonging to the oleracein class. Moreover, diagnostic fragment ions were selected, based on which clustering algorithms and visualizations were employed. As we demonstrate, clustering methods can provide valuable insights into the mass fragmentation elucidation of natural compounds in complex mixtures.


Introduction
Portulaca oleracea L. (Portulacaceae), or purslane, is a widely spread annual plant found in many parts of the world. Purslane is considered an edible vegetable in many areas of Europe, the Mediterranean, and tropical Asian countries, and added in soups and salads [1][2][3]. Purslane has been used in folk and traditional medicine as a remedy for many ailments [4].
Herein, utilizing UHPLC-Orbitrap-HRMS in negative ionization mode, we carried out an extensive identification and characterization of oleraceins in hydromethanolic extracts from the aerial parts of purslane. We limited the MS 2 characterization to oleraceins having mass up to 1 kDa, although heavier ones were also detected. After an in-depth MS 2 analysis, a total of 51 oleracein compounds were tentatively identified and characterized, of which 26 had structure matching one of the already known oleraceins and the other 25 are new, undescribed in the literature structures, belonging to the oleraceins class. Diagnostic ion filtering (DIF) and diagnostic difference filtering (DDF) were utilized to refine the selection of compounds that possess oleracein structure. Additionally, clustering of every oleracein based on their MS 2 features was performed and presented with heatmaps, k-means and pam clustering, principal component analysis (PCA) and hierarchical clustering. As we demonstrate, clustering methods can provide valuable insights into the structure elucidation of natural compounds by mass spectrometry in complex mixtures.

Results and Discussion
A workflow diagram of the study is shown in Figure  1. In summary, hydromethanolic extract of purslane were obtained, and subjected to HR-MS 2 analysis. The raw data was filtered by DIF and then DDF to select compounds possessing specific fragment ions and specific mass differences for the 5,6-dihydroxyindoline-2-carboxylic acid scaffold, which is common for all oleraceins. The filtered MS 2 data were then analyzed manually to structurally elucidate oleracein compounds. In the course of elucidation, 43 fragment ions were selected as diagnostic fragment ions that were afterwards used to describe every oleracein as a vector with length 43 and values equal to the relative percentage intensities of the diagnostic ions. This permitted to carry out clustering analyses to establish structural similarities between oleracein compounds based on their MS 2 features. The results from the clustering analysis were used to corroborate and supplement the structural elucidation or to correct it.
In our study, the lowest m/z oleracein observed was oleracein A with a molecular ion of 502.135 [M-H] -m/z. Other reported in the literature oleraceins with lower mass [29,30] were not observed in our samples. Our study limited the characterization of oleraceins with mass up to 1kDa, although heavier oleraceins were also detected.
In total of 82 candidate substances were automatically selected, based on the abovementioned criteria using DIF, followed by DDF, and their MS 2 fragmentation manually inspected. The DDF results are presented in the supplementary material (Table S1) as m/z transitions for all identified oleraceins. Of the total 82 candidates, 19 had too low MS 2 intensity (below 1.5E04) and were not interpreted, 12 were false positives (not having oleracein structure), and 51 were identified as oleraceins (Table 2). Of them, 25 are new, undescribed in the literature oleraceins. The other 26 oleraceins were matching a structure of one of the already known oleraceins: A, B, C, D (2 isoforms), H, I (2 isoforms), J, K, L, M (2 isoforms), N/S (4 isoforms), O (3 isoforms), P, Q, W glu, X (3 isoforms). Table 2 presents the chemical structures of the identified 51 oleraceins and Table 3 provides their chromatographic and mass spectral characteristics. For complete chromatographic and mass spectral data see Table S2 and Table S3

Individual MS 2 fragmentation analysis
Below, the tentative identification of all 51 oleraceins ( Table 2 and Table 3)

Diagnostic ions
Since oleracein compounds bear similar structure, we sought to refine the "fragment ion pool" and to select diagnostic fragment ions that can be used to describe the identified oleraceins. And so, after thorough MS 2 fragmentation analysis of the identified 51 oleraceins, we selected 43 fragment ions, their elemental compositions and exact masses determined, that were utilized as diagnostic ions for the identified substances. Hence, each oleracein could be described as a vector of length 43 with values equal to the relative intensities of the corresponding diagnostic ions. A fragment ion from an MS 2 data of a particular oleracein was assigned to a diagnostic ion if its m/z were within 15 ppm error of the diagnostic ion's m/z. All diagnostic ions had the following features: a mass greater than 100 Da, were encountered in 2 or more oleraceins, had a mean percent intensity greater than 5%, and a maximum percent intensity greater than 10%. The diagnostic fragment ions along with their featured structures are shown in Table S4 and discussed below in increasing mass. As mentioned above, the ind-HCA structures IC, IA and IF are confirmed with their corresponding fragment ions: 340.0826, 356.0775 and 370.0931 m/z, respectively. However, as the mass of the oleracein increases, the intensities of these characteristic fragments might lower. If the oleraceins bear easily cleavable moieties, like two or three consecutively linked G, the fragment ions indicating the ind-HCA structures may be more prominent. Thus, in the oleracein GGGICG, where the GGG cleave together as a neutral loss of 486.159 Da, high intensity of fragment 340.0826 m/z (64%) is observed, as well as fragment 145.0294 m/z (100%). On the other hand, in the oleracein AGGIC, lower intensities of both these fragments are observed (11% and 37%, respectively).
The IC fragment at 340.0826 m/z undergoes consecutive CO2 and CO cleavages, resulting in fragments 296.0927 and 268.0978 m/z, respectively, in decreasing intensity. The  Then, k-means and pam clustering were used to estimate the optimal number of clusters. The clustering observed in the ordered heatmap (Figure 4), as well as the data of the k-means and pam clustering (supplementary material Figure S1 and Figure S2), gave us grounds to cluster our data with 8 clusters ( Figure 5). Distribution of individuals in the groups can be found in Table S6. Next, principal components were calculated. Scree plot (representing the percentage of variances explained by each principal component) can be found in Figure S3. The PCA visualization is presented in Figure 6, where the color gradient from orange (darker) to blue (lighter) presents the quality of representation (cos2), from high to low. A high cos2 indicates a good representation of the variable on the principal component, and vice versa. Hence, regarding the well represented oleraceins, three groups can be distinguished, that are characterized with either IA (1 st quadrant), IF (2 nd quadrant) or IC (4 th quadrant) substructures. The quality of representation (cos2) of individuals as well as a visualization of the contribution of individuals on PC1 and PC2 are given in Figure S4 and Figure S5.   Overall, the different clustering methods used give very similar clustering. In accordance with the proposed structures, different isoforms with the same structure are clustered together. In general, oleraceins are grouped depending on the presence of either of the three common substructures: GIC, GIA, or GIF that give rise to other diagnostic fragment ions. Thenceforward, different combinations or permutations of substructures lead to specific diagnostic fragment ions.
Oleraceins GIC, FGIC, SGIC, GICG, FGICG and SGICG have either GIC or GICG in common and are grouped together. Next, GGICGG has the unique feature that there is a GG attached to the N-coumaroyl. It does, however, show some similarity to GGICG, GGGICG, GGGICG.1. The latter three cluster together, because of the ICG substructure. Worth mentioning is that the calculated similarities cannot provide direct quantitative measure of the structural similarity. Some substructures (or moieties) are represented with more than one diagnostic fragment ion, and others are not. Additionally, some diagnostic fragment ions exhibit, in general, higher intensity than others. Nevertheless, the clustering analysis performed by us demonstrates that this approach can provide additional perception on the relationships of MS 2 fragment ions and outline groups of parent ion → daughter ion.

Conclusions
Herein, utilizing UHPLC-Orbitrap-HRMS technique, a total of 51 oleraceins were tentatively identified in hydromethanolic extracts of purslane, 25 of them are new structures. Diagnostic ion filtering (DIF) and diagnostic difference filtering (DDF) were employed to filter out MS data. Moreover, all 51 identified oleraceins were represented with a selected set of 43 diagnostic fragment ions that permitted the generation of a distance matrix. Furthermore, clustering of the identified oleraceins, based on their MS 2 features, was achieved, and presented by heatmaps, pam and k-means clustering, principal component analysis and hierarchical clustering. Here, we demonstrate that clustering methods can provide valuable insights in the MS 2 elucidation of natural compounds in complex mixtures.

Еxtraction and sample preparation
The hydromethanolic extracts of purslane were obtained as described in our previous study [28]. In brief, air-dried aerial parts of purslane were powdered, and 3.00 g of plant material were extracted twice by sonication with 10 ml 50% MeOH at 50 °C for 15 min in an ultrasonic bath. The combined extracts were filtered and diluted to 25 ml in volumetric flasks with 50% MeOH. The solutions were filtered through a 0.22 μm syringe filter, and 1μL was injected into the LC instrument for LC-MS analysis.

Chromatographic parameters
Elution was carried out on Kromasil EternityXT C18 (1.8 μm, 2.1×100 mm) column maintained at 40 °C. The chromatographic conditions were as described elsewhere [28] with slight modifications. The binary mobile phase consisted of A: 0.1% formic acid in water, and B: 0.1% formic acid in acetonitrile. The total run time was 23.5 min. The acquisition time where substances were analyzed with MS was 19 min and set between the 1 st and the 20 th min. The following gradient was used: the mobile phase was held at 5% B for 0.5 min and then gradually turned to 33% B over 19.5 min. Next, % B was increased gradually to 95 % over 1 min and maintained at 95% B for 2 min. The system was turned to the initial condition of 5% B in 1 min and re-equilibrated over 4 min. The flow rate and the injection volume were set to 300 μL/min and 1 μL, respectively.

Mass spectrometric parameters
For MS 2 fragmentation analysis, several normalized collision energies (NCE) were tested to select the optimal conditions. The 20 NCE gave satisfactory abundance of variety of heavier fragment ions, and 40 NCE provided good intensity to lower m/z specific fragment ions, and thus, a stepped 20-40 NCE was selected for initial screening of oleraceins. Mass spectrometric parameters for Full-scan MS were as follows: resolution 17,500; AGC target 1e6; Maximum IT 83ms; Scan range 500 -2000 m/z. For dd-MS 2 , the following parameters were used: TopN 10; isolation window 1.0 m/z; stepped NCE 20-40; Minimum AGC target 8.00e3; Intensity threshold 9.6e4; Apex trigger 2 to 6 sec; Dynamic exclusion 3 sec. The structural elucidation of the oleraceins was achieved by manual inspection of the MS 2 spectra in Xcalibur 4.2 software (Thermo Fisher Scientific).

Mass spectral filtering by Diagnostic Ion Filtering (DIF) and Diagnostic Difference Filtering (DDF)
Initially, vendor *.raw (Thermo Fisher Scientific) files were converted to *.mzML files by MSConvertGUI 3.0 (ProteoWizard) and imported to MZmine 2.53. Then, DIF was applied based on the presence of two of the specific fragment ions for 5,6-dihydroxyindoline-2-carboxylic acid (called below in the text as "indoline core"): 194.0459 m/z (chemical formula: C9H8O4N -) and 150.0560 m/z (chemical formula: C8H8O2N -) (Figure 2), with a ± 15 ppm threshold. MZmine also offers "diagnostic neutral loss" filtering for searching of specific mass difference(s) only between the precursor ion and each of its fragments. However, since we were interested in searching for the specific mass difference including between fragment ions of the same precursor ion (Figure 3), a DDF approach was applied to refine the selection of molecules that supposedly possess the 5,6-dihydroxyindoline-2-carboxylic acid substructure. DDF involved searching for a specific mass difference between each fragment (including the precursor ion, even if it was not present in the MS 2 spectrum) and all lower m/z fragment ions. That difference was set to 195.05316 Da that suggested a neutral loss of 5,6-dihydroxyindoline-2-carboxylic acid. DDF was achieved by an in-house script written in Python 3.7.1 programming language. The defined threshold was set to ± 15 ppm of the ions from which the difference originated. Thus, in the fragmentation transition 340.0848 m/z → 145.0300 m/z, with a threshold of ± 15 ppm, the searched difference was between 145.0300 ± 15 ppm and 340.0848 ± 15 ppm (i.e., from 195.0475 Da to 195.0621 Da). And so, if the difference originated from heavier fragments, a bigger mass threshold was used, and vice versa.

Grouping of MS 2 scans
In order to group MS 2 scans that presumably derive from the same substance, MS 2 scans with precursor ion m/z within 15 ppm and within 1.5 % deviation in retention times were added together, and afterwards manually checked. In these grouped MS 2 scans, fragment ions that were within 15 ppm m/z were considered identical, their intensities added, and their masses recalculated by weighted mean averaging: is the recalculated m/z value, ( / ) and are the m/z and the intensity of the i th fragment ion, respectively. Fragment ions having less than 0.5 % intensity and mass < 100 Da were excluded. The retention time of the precursor ion with the highest intensity was chosen as the retention time of grouped MS 2 scans, i.e., the peak apex.

Used abbreviations
For simplicity and clarity of the presentation, the following abbreviations are used throughout this paper: hydroxybenzoyl: hydroxycinnamic acid: HCA ; hb or O; coumaroyl: coum or C; caffeoyl: caff or A; glucopyranosyl: glu or G; feruloyl: fer or F; indoline core: ind or I; sinapoyl: sin or S. In case multiple oleraceins bore the same structure, the names of the compounds were suffixed with numbers, i.e., FGGIC, FGGIC.1, FGGIC.2, etc. (Table 2 and Table 3).

Supplementary Materials:
The supplementary tables are available online at www.mdpi.com/xxx/...  Figure S1: Estimating the optimal number of clusters with k-means clustering; Figure S2: Estimating the optimal number of clusters with pam clustering; Figure S3: Scree plot representing the percentage of variances explained by each principal component; Figure S4