Watershed: a key for microbial biogeography

: Biogeography research is flawed by the poor understanding of microbial distributions due to the lack of a systematic research framework, especially regarding appropriate study units. By combining pure culture and molecular methods, we studied the biogeographic patterns of nema-tode-trapping fungi by collecting and analysing 2,250 specimens from 228 sites in Yunnan Province, China. We found typical watershed patterns at the species and genetic levels of nematode-trapping fungi. The results showed that microbial biogeography could be better understood by 1) using watersheds as research units, 2) removing the coverup of widespread species, and 3) applying good sampling efforts and strategies. We suggest that watersheds could help unify the understanding of the biogeographic patterns of animals, plants, and microbes and may also help account for the historical and contemporary factors driving species distributions.


Introduction
Biogeographic patterns and their underlying mechanisms are essential in biodiversity research. Understanding biogeographic patterns is crucial for predicting how global change affects biodiversity, designing effective conservation strategies, and achieving sustainable development [1]. Despite microbes being the most important biotic driver of global material and energy flows, biogeographic research on microbiota has long lagged behind that of macro-biota. Debates exist regarding whether heterogeneous biogeographic patterns exist in microbes [2,3], as so far researchers also believe that its biogeographic pattern is weaker than macro-biota. The strong diffusion ability and extreme environmental adaptability (such as extremely hardy resting stages) of microorganisms are considered to have led to their lack of or weak biogeographic pattern, especially in regions with low heterogeneity [3][4][5]. It is time to carry out microbial biogeography research based on a new perspective.we put forward.
In this regard, a hypothesis we put forward is that the understanding of microbial biogeographic pattern may be helped by watershed perspective. Specifically, microorganisms mainly diffuse through wind, water currents, animal carrying, etc [6]. Watersheds may restrict these processes, thereby shaping the heterogeneity microbial biogeographic pattern, and this diffusion restriction should be more obvious in mountain and valley areas. For example, the climate of the dry-hot valley in Yunnan, China is caused by the restriction of the distribution of water vapor due to the blocking of the airflow by the mountains [7]. Even the airflow will be blocked by mountains and unevenly distributed. There is no doubt that the microorganisms that travel with the airflow will also be restricted by this. Secondly, the blocking of mountains will create unique habitats and relatively independent spaces in the watersheds, driving the occurrence of endemic species. These endemic species are relatively rare, but they play a key role in shaping the regional heterogeneity patterns. Third, the widespread species does not have biogeographic pattern at the species level or even conducive to the discovery of the pattern, but there should be a watershed pattern at the genetic level.
Consequently, we conducted this research in Yunnan Province, China, which holds six major international rivers belonging to the Pacific and Indian Ocean systems. The watersheds were largely shaped by the collision of the Indian Ocean and Eurasian plates with the formation of the Qinghai-Tibet Plateau. The spatial heterogeneity in Yunnan is very high, given that mountainous areas occupy 88.6% of the province. For the research object, Chase and Martiny (2018) believe that the study of small taxa is conducive to the discovery of microbial patterns [8]. Therefor this study take Orbiliaceae nematode-trapping fungi (NTF) as research object which is a cultivable small taxa with 104 known species. Using the stratified sampling method, we collected 2,250 specimens from 228 sites in Yunnan Province ( Figure S1, S2), combined pure culture and molecular methods to investigate the NTF diversity both at species and genetic level. By doing so, we aim to verify each of our hypotheses.

Study area
Yunnan Province is located between 21° 8' 32" and 29° 15' 8" N and between 97° 31' 39" and 106° 11' 47" E, with a total area of 394,100 km2, and spans elevations of 76-6740 m. This area is characterized by a sharp elevational gradient and six large rivers -the Mekong, Irrawaddy, Salween, Red, Pearl and Yangtze Rivers -cutting across the province. This complex topography helped shape the region's relatively closed ecological zones, among which biodiversity is extremely high.

Research objects
Nematode-trapping fungi (NTF) consist of eukaryotic microorganisms with terrestrial and aquatic biotypes belonging to the Orbiliaceae family. NTF are predatory microbes and, as their name suggests, prey on nematodes and other small organisms by producing trapping structures with their vegetative hyphae. Globally distributed across various habitats and extreme environments, 104 NTF species have thus far been reported, including in the Arthrobotrys, Drechslerella and Dactylella genera. All NTF species can be pure cultured, and their morphological characteristics are observable under low-magnification microscopes. NTF can be studied qualitatively and semi-quantitatively, making them ideal taxa for studying patterns in the spatial distribution of microbial diversity.

Stratified sampling
We systematically established 282 stratified sampling sites according to Yunnan's hydrographic network ( Figure S1). We were unable to sample from sites located in inaccessible terrain, including in deep valleys and high mountains ( Figure S2). Sites with strong human disturbances were also avoided. Finally, 228 randomly distributed sites (Ripley's K, p = 0.2435) remained ( Figure S1a). To ensure the phenological consistency and clear aquatic-terrestrial boundaries of the sites, sampling was conducted from the southernmost to the northernmost region and between February 27, 2014, and June 8, 2014. At each site, 5 sampling points were set parallel to the river in both the aquatic and terrestrial habitats at a separation distance of 10 metres ( Figure S1b). In total, 2,250 sampling points were set ( Figure S1, panel a), and 30 sampling points were lost due to the lack of soil at the sampling location. Using the five-point sampling method in 1-m×1-m squares ( Figure  S1b), samples consisted of ~50 g of mixed soil (or sediment) and were collected at a depth of 0~10 cm below the surface at each sampling point. In total, we collected 2,250 specimens from 228 sites. The samples were then stored in a disposable self-sealing bag at room temperature. We recorded habitat information, including the location, longitude, latitude, and elevation of each sampling site. The samples were sent to the laboratory within 1 week of collection, immediately followed by NTF identification processing.

NTF classification and identification
Isolation and purification: Using the five-point spreading method, the samples were plated on petri dishes with Corn agar medium (CMA) with ~5,000 individuals of Panagrellus redivivus to induce NTF germination. We plated three replicates per sample for a total of 6,750 petri dishes plated. The petri dishes were incubated at room temperature (25-28°C). We examined the petri dishes once after 3 weeks and again after 4-5 weeks using a stereomicroscope to isolate all the species. During each examination, we used single-spore isolation to collect and purify any germinated NTF on CMA medium. A sterile toothpick was used during collection. Pure cultures were preserved until we were able to isolate and purify cultures from all specimens.
Morphological identification: All preserved pure cultures were revived in CMA medium and placed on a temporary slide using the insertion and sticking method. The morphological characteristics of NTF, including conidia, conidiophores and chlamydospores, were then photographed using a microscope camera (Olympus BX51, Japan). For each NTF strain, the type of trapping device was observed and confirmed using the observation chamber and nematode induction method. The strains were identified according to Li et al., 2014 [9].
Molecular identification: Each NTF strain was inoculated on PDA medium to gather the mycelium. We then extracted DNA from the mycelium using a CTAB protocol. Fragments of ITS (internal transcribed spacer), TUB (β-tubulin gene), TEF (translation elongation factor 1-alpha) and RPB2 (RNA polymerase II) were amplified by PCR and sequenced by BioSune Biotechnology Co., Ltd. (Shanghai, China). The homologous ITS and TUB sequences were aligned using the BLAST toolkit from NCBI (National Center for Biotechnology Information Search database).
We were able to identify and classify specific NTF species using morphological and molecular identifications, referring to the taxonomic system used by Li et al., 2014 [9].

Analysis of NTF species composition and distribution
The occurrence frequency (OF) of each species and sampling site was calculated. The OF (%) was calculated as the number of specimens with NTF occurrence divided by the total number of specimens × 100%. Species with OF > 5% were defined as widespread species, while species with OF < 1% were defined as rare species.
Using the latitude and longitude of each site and the NTF OFs, we then plotted distribution maps based on the genus, species, and rare species with the raster, sf and ggplot2 packages in R version 4.02.
The mean variance ratio was used to evaluate the NTF species distributions. A mean variance ratio of 1 indicated a random distribution (conforming to Poisson's distribution); a mean variance ratio >1 indicated a clustered distribution; and a mean variance ratio <1 indicated an even distribution. The significances of these distributions were verified using a T-test.
We divided the sampling areas into 9 grids using QGIS version 3.10. We selected the six grids with the largest sampling areas (grids A, B, C, D, E and F) for analysis ( Figure  S3a). The watershed boundaries of Yunnan Province were extracted from a derived digital elevation model (CGIAR-CSI SRTM v4.1) at a 90-metre resolution using GIS geoprocessing algorithms ( Figure S3b). We constructed UpSet plots using grid and watershed units with the UpSetR package in R version 4.02.

Genetic divergence and biogeographical distribution analysis of Arthrobotrys oligospora
DNA sequencing of ITS, TUB, TEF and RPB2 fragments was conducted on strains of Arthrobotrys oligospora, which were found at every sampling site. The four sequences from each strain were summarized in a text file and converted into a fasta file. MAFFT version 7 was used to generate the multi-sequence matrix (homologous sequence searching), and BioEdit was used to manually improve the accuracy of the alignment. We used jmodeltest software to select the optimal calculating alternative model for the piecing sequence. The phylogenetic tree was constructed following the maximum likelihood (ML) method via partition calculation using IQ-Tree version 1.6.5 software. We used FigTree version 1.3.1, Microsoft Word and Photoshop to read the phylogenetic tree.

Biogeographical distribution of NTF in Yunnan
A first prediction was performed using the different watersheds as unique predictors of the clade distribution in the phylogenetic tree of Arthrobotrys oligospora. We assigned clades to watersheds using the majority rule and assessed the classification accuracy using error matrix metrics.
From the original dataset, we generated 45 different splits between the training and test samples with training ratios between 30% and 70% (step=5). Each ratio step included five different mixes of training/test samples. Voronoi (Thiessen) polygons were computed around each training set and assigned the clade corresponding to their originating point. The polygons were used to predict the test dataset clades.

Multivariate machine learning model
We compiled a dataset of 97 variables divided based on bioclimatic (24), topographic (6), vegetation (12), and soil properties (55) using QGIS (Table S1). In addition to the watershed, 19 predictors were selected among an initial pool of 97 variables. Table S2 shows the selection results after multinomial logistic regression and collinearity screening. In these cases, we always retained the values for the first layer.
To select the most suitable predictor variables to include in our machine learning model, we hierarchically screened each variable in our dataset. We iteratively performed multinomial logistic regressions to assess the effects of each predictor variable on clade. We assessed the relationships among predictor variables using correlation matrices (Pearson's and Kendall's) and assessed multicollinearity using the variance inflation factor.
The selected predictor variables and watershed predictors constituted the final dataset to investigate patterns in clade distribution using an extra-trees classifier (extremely randomized trees). We aimed to improve the prediction ability from the previous analysis (using the watershed subdivision as a unique predictor) and therefore used the same accuracy metrics as those used in that analysis. The hyperparameters were optimized with grid search and cross validation techniques, although we also considered bootstrapping and accuracy estimations on the out-of-bag samples. The relative feature importance was calculated for the best model.

Results
We identified 44 Orbiliaceae NTF species from 2,250 samples. These species belonged to three genera: Arthrobotrys (29 species), Dactylella (9 species), and Drechslerella (6 species) ( Table S3). The ranked order of the occurrence frequency (OF) of NTF species corresponded to a logarithmic distribution model (R 2 = 0.940, p <0.05) (Figure 1). Four species were widely distributed (mean OF = 5.22%) and contributed 9.09% of the total species richness and 62.32% of the total OF. Conversely, 31 species were classified as rare (mean OF = 0.18%) and contributed 70.45% of the total species richness and 7.52% of the total OF (Figure 1).

Widespread species shaped the random distribution pattern of NTF
At the genus level, Arthrobotrys was regionally distributed and found in 98.68% of the sampling sites, while Dactylella and Drechslerella were locally distributed and found in 7.02% and 7.89% of the sampling sites, respectively ( Figure S4).
A mean variance ratio of 1.07 (t = 0.7029，p > 0.05) indicated that the NTF species were randomly distributed ( Figure 2). Four widespread species were found in 83.30%, 56.58%, 47.37% and 41.23% of the sampling sites ( Figure S5). When the widespread species were removed, the cumulative distribution of the OFs of rare and non-widely distributed species was non-random and clustered (rare species: mean variance ratio = 1.284, t = 3.022, p = 0.003; non-widely distributed species: mean variance ratio = 1.326, t = 3.471, p < 0.001).

The watershed patterns of the widespread species (for genetic level) and rare species
At the genetic level, the phylogenetic tree revealed that the widespread species A. oligospora was divided into 5 large clades (Figure 3a). The distribution of these clades was consistent with the natural watershed divisions in Yunnan Province, except for the Irrawaddy River, which had no corresponding clade (Figure 3b). The strains within the other clades were distributed within their corresponding watersheds. A total of 64.70%, 60.71%, 80.00%, 63.89% and 61.11% of the strains from clade 1 to clade 5 were correspondingly distributed in the Yangtze River, Red River, Pearl River, Mekong watershed, and Salween-Irrawaddy watershed, respectively (Figure 3b).
Among the 44 species detected in this study, 31 rare species were found in fewer than 10% of the sampling sites (range: 0.41-9.21%) ( Figure S5). Of the 19 rare species with non-single-site distributions within grids or watersheds, 63.16% were distributed in the same or an adjacent grid and 84.21% were distributed in the same or adjacent watershed ( Figure  S6). With the exception of Drechslerella dactyloides, 8 of the 9 rare species that appeared at only 2 or 3 sites were distributed in the same or adjacent watershed ( Figure S6).

Watershed is a key for discover NTF biogeographic pattern
When a grid was used as the unit of analysis, no clear distribution pattern of the five clades in the A. oligospora phylogenetic tree was detected (Figure 3c). Based on our phylogenetic tree, the spatial distribution of A. oligospora was machine learned using randomly generated Voronoi polygons and watersheds. We found that the 45 maps generated using polygons all had lower accuracy than the maps watersheds generated (Table S4). On average, the accuracy of the polygons was low (mean: 36%, median: 38%), with all but one prediction falling below 50%. By assigning the clades according to the watersheds, 68.8% of clades were classified correctly (Figure S7 A). The watershed unit explained nearly 70% of the A. oligospora distribution in Yunnan. None of the additional climate, topographic, soil or vegetation variables significantly improved the model ( Figure S8, Table S5).
We were better able to capture the distribution pattern of NTF when using watersheds as the unit of analysis compared with grids. Only 17 intersections were found when we constructed UpSet plots using watershed units( Figure 4). We found 19 species distributed in only one watershed unit, and 76.2% of all species were distributed in the adjacent watersheds. There was a large gap between the datasets (the variance was 7.89, and the mean value was 2.53). Despite finding 22 combinations when constructing the integrative diagrams by grid, only 11 species were distributed in only one grid, and 55.0% of all species were distributed in adjacent grids. There was a small fluctuation between the datasets (the variance was 5.04, and the mean value was 1.77). Ultimately, using watershed units led to higher observations of endemic species, more species distributed within adjacent or similar units, and a more varied species composition compared with those obtained using grid units.

Discussion
Nearly half of the reported NTF species (N = 44) in three genera were found during this study. At the species level, the distribution pattern was random and had no heterogeneous pattern. However, at the level of widespread species genetic, the level of rare species and genera, NTF has a pattern of local distribution. The heterogeneous distribution of microbial diversity could be better explained by using watersheds as units. Therefore, our three hypotheses have been verified, 1) rare species are indeed important to the microbial biogeographic pattern; 2) widespread species will cover the pattern at the species level, and its geographic pattern needs to be revealed at the genetic level; 3) watersheds are the key to discovering the microbial biogeographic pattern. In the process of gradually removing obstacles to discover the microbial biogeographic pattern, the following points need to be paid attention to.
First, widespread species are widely distributed, covering patterns at the species level, and their biogeographic patterns need to be studied at the genetic level. In this study, the number of widespread species only accounted for less than 5% of the total number of species but contributed 70% of the OF. When excluding the 4 widespread species, the pattern tended to be non-random with a clear heterogeneous pattern. If the distribution of the species with high OFs was gradually superimposed, the patterns of NTF tended to become random. The results indicated that the extremely high OFs and wide distribution of widespread species masked the heterogeneous microbial biogeographic patterns. This may be the major factor in the past that microbial research has been difficult to find or can only find a weak biogeographic pattern. At the genetic level, the 5 clades of A. oligospora were consistent with the natural watershed divisions, and these divisions were further supported by machine learning results. Totals of 64.70%, 60.71%, 80.00%, 63.89% and 61.11% a. c. d.
of the strains from clade 1 to clade 5 were correspondingly distributed in the Yangtze River, Red River, Pearl River, Mekong watershed, and Salween-Irrawaddy watershed, respectively. Therefore, in future studies, removing the effects of widespread species at the species level, and paying attention to the lineage differentiation at the genetic level will help microbial biogeography research. Second, although rare species and genera have low abundance, they play a key role in shaping the microbial diversity pattern [10]. It is necessary to be more cautious to conduct research on the biogeographical pattern of microbes based on operational taxonomic units (OTU) alone. At the level of rare species, the 31 rare species were restricted to 0.44% ~ 9.21% of the sampling sites. The genera Dactylella and Drechslerella were restricted to 7.02% and 7.89% of the sampling sites, respectively. Because these limited distribution and the existence of endemic species were representative of the biogeographic pattern in itself, we concluded that NTF species have heterogeneous biogeographic patterns [2]. Many of the microbial biogeographic pattern studies are OTU based which may ignore rare species and the genetic variation under the species due to the broader resolution of OTUs (usually 97% similarity clustering), and thus hinders the detecting of microbial heterogeneous patterns emerged during the evolution [8,11,12]. Therefore, we support the view of Cbase and Martiny (2018) and believe that conducting more refined research on a specific taxa and paying more attention to the pattern of the genetic variation will promote the understanding of the microbial biogeographic [8]. The pure culture based study is still indispensable for current microbial biogeography research. The results of the experiments carried out in this study confirm the feasibility of this idea. In the future, we need to combine culture-dependent and culture-independent approaches to carry out more systematic research on microbial biogeography.
Third, microbial biogeography research requires more scientific sampling design, including sample area selection, sampling effort and sampling strategy. Insufficient sampling effort is the main cause of our insufficient understanding of the microbial biogeographic pattern [3]. In accordance with the logic of stratified sampling, we arranged 228 sites in Yunnan Province and collected 2,250 samples intensively, and discovered the NTF biogeographic pattern. Had the sampling effort been reduced by 10% (do not set 10 samples at each site), we estimated that 24 species (77.42% of the rare species) would have been more difficult to observe. If 23 sites were set up only (10% of the current sample site), we estimated that 20 species (64.52% of the rare species) would have been more difficult to observe. What's bad is that most of the current researches on microbial diversity are not high in sampling intensity. The sampling intensity interval between 3~300 samples, with an average of 40 samples collected (summarised according to articles published in the ISME Journal and Microbiome over the past five years). Sampling efforts were subsequently reduced to the same time and resources when using metagenomic methods [13]. With low sampling efforts, many rare species would be ignored. In addition, the selection of research area is also very critical. In this study, a strong biogeographic pattern of NTF was found in Yunnan.
It is precisely because of the rich types of mountains and the strong environmental heterogeneity in Yunnan that we discovered the strong biogeographical pattern of NTF here. It is difficult to find a strong microbial biogeographical pattern if the research is carried out in areas with relatively flat geographical terrain and not strong geographical barriers.
Finally, the watershed unit is the key to understanding the microbial biogeographic pattern. On the one hand, our research results show that a clear watershed pattern of NTF. On the other hand, the heterogeneous distribution of microbial diversity could be better explained by using watersheds as units compared with grids. At the genetic level, the 5 clades of A. oligospora were consistent with the natural watershed divisions. That is to say, even if it can be widely distributed, the genetic diversity of A. oligospora is differentiated within the watershed. At the species level, more endemic species were observed using watershed units (19 species by using watersheds vs 11 species by using grids). Additionally, 84.21% of rare species were clustered in the same or adjacent watersheds, while only 63.16% were clustered in the same or adjacent grids. It can be assumed that the spatial compression of the uplift of the Qinghai-Tibet Plateau and the Hengduan Mountains has formed the six major watersheds of Yunnan. Because environments in mountainous regions are geographically isolated by mountains, rivers, and other natural barriers, many new species have evolved in these environments [14]. The results of machine learning found that watershed is the most important factor for this pattern, while other environmental factors have little effect. Therefore, the biogeographical pattern of NTF found in this study is mainly shaped by the geographical isolation caused by the historical changes of the six major river watersheds in Yunnan, and contemporary factors have little influence. Considering that there are natural cascades in the watershed itself, we believe that the future watershed units can also help form a breakthrough in the impact of contemporary and historical factors on the microbial diversity. The proportion of contemporary and historical factors in the watershed units of different scales should be gradual. Historical factors represented by geographical barriers between large watersheds are the main factors shaping the biogeographic patterns. As the scale of the watersheds shrinks, the geographical barriers between sub-small watersheds weaken, and the role of shaping the contemporary environment within the watershed will become more important. In the past, research in this area was often carried out on different spatial, temporal and phylogenetic scales [15]. Even worse, biogeographic patterns are particularly sensitive to the above factors [16]. These confusing results make it impossible for researchers to distinguish variations in microbial biodiversity that are caused by contemporary or historical factors [17]. Natural watershed units can provide a natural scale for the study of the scale effect of the above-mentioned factors driving diversity patterns.
Natural units with clear boundaries should be a fundamental need in studies of biodiversity because the coupling of animal, plant and microbial biodiversity distributions is inevitable in future biogeographic research [18]. While biogeographic units have long been defined for animals and plants (e.g., ecoregions, biota, and biogeographic provinces) [19,20], these units do not correspond well with each other; great differences exist among different regions and taxa [21]. We propose that natural watersheds are ideal units for biodiversity research because they are relatively independent or "closed" units of material and energy flow, and their boundaries can be natural barriers to transmission [22]. As a result, the watershed provides a unit selection for biogeography research that is suitable for multiple groups.

Conclusions
Future microbial biogeography could be better understood by 1) using watersheds as research units, 2) removing the coverup of widespread species, and 3) applying systematic sampling design. When conducting research, attention should be paid to selecting regions with high heterogeneity, increasing sampling efforts, and introducing stratified sampling strategies. The research methods aimed at small taxa and pure cultivation are a powerful supplement to the current research with metagenomic technology. Research on a single widespread species can focus on its biogeographic distribution at the level of genetic diversity. The watershed unit is helpful to analyze the impact of historical and contemporary factors on biodiversity, and it also provides a natural unit for studying the coupling relationship between microbes and the biogeographic pattern of animal and plant diversity, which is the key to the study of biogeography.
Supplementary Materials: The following are available online at www.mdpi.com/xxx/s1, Figure S1: a. Map of sampling sites. b. Sampling points at each site. Figure S2: Geographic environments of the sampling sites. Figure S3: Divisions of a) grids and b) watershed units. Figure S4: Biogeographic patterns of each genus of Nematode-Trapping Fungi. Figure S5: Biogeographic patterns of each genus of Nematode-Trapping Fungi. Figure S6: Biogeographic patterns of rare species in the pespective of a) grids and b) watersheds. Figure S7: Watershed map (A) and three examples of maps generated using Voronoi polygons from different training sets. Figure S8: Impurity-based feature importance from the Extra-Trees classification results. Table S1: List of environmental factors. Table S2 Variables selection stages and final data set. Table S3: List of detected species of Nematode-Trapping Fungi. Table S4: The 45 maps generated using polygons all had lower accuracy than the maps watersheds generated. Table S5: Permutation feature importance from the Extra-Trees classification results.