A genomic comparative analysis of Bacillus coagulans unravels the genetic potential of MA-13 strain for biotechnological applications

: The production of bio-chemicals requires the use of microbial strains with efficient substrate conversion and excellent environmental robustness, such as Bacillus coagulans spp. So far the genomes of about 50 strains have been sequenced. Herein, we report a comparative genomic analysis of nine strains on the full repertoire of CAZymes, secretion systems, and resistance mechanisms to environmental challenges. Moreover, B. coagulans Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) immune system along with CRISPR-associated Cas) genes, was also analysed. Overall, this study expands our understanding of the strains genomic diversity of B. coagulans to fully exploit its potential in biotechnological applications.


Introduction
Wide substrate efficiency and excellent environmental robustness are indispensable features for the microbial-based production of bio-chemicals [1,2]. For instance, bacterial fermentation is currently used as an eco-sustainable alternative to obtain lactic acid (LA) from raw materials [3]. So far several lactic-acid producer strains of B. coagulans have been used for the production of LA from lignocellulosic biomasses [4]. Interestingly, B. coagulans grows at temperatures ranging from 50 to 55° and pH 5-6 that are chemicophysical conditions comparable to those used for the biomass saccharification [5][6][7]. Therefore B. coagulans represents an attractive biocatalyst for LA production because of its thermophilic nature that not only reduces the risk of microbial contamination but also makes it possible to develop efficient fermentation processes, i.e, simultaneous saccharification and co-fermentation (SSF) [6,8]. Recently, a novel B. coagulans strain, designed as MA-13 and isolated from canned beans manufacturing, has been tested in an SSF configuration for efficient production of LA from wheat straw [5]. Moreover, it turned out to be exceptionally resistant to extreme conditions, such as the presence of toxic compounds derived from the thermo-acidic treatment of lignocellulose pointing to this microbe as a suitable biological tool for the eco-friendly production of LA [6].
B. coagulans is a homofermentative microorganism and glucose is converted into lactate usually via the Embden-Meyerhof-Parnas (EMP) pathway and xylose via the pentose phosphate pathway (PPP) or the phosphoketolase pathway (PKP). Specifically, the yield of LA varies from 60% to 98% of the total fermentation products depending on whether xylose is metabolized through one or the other pathway [9,10]. Interestingly, genomic comparison among different B. coagulans strains have revealed that some of them lack at least some of the genetic determinants of the pentose phosphate pathway, hinting at a significant metabolic and genomic diversity among the isolates [4,7,9]. On the other hand, knowledge of the genomic features of the different B. coagulans strains is essential to fully exploit their potential in white biotechnology applications [11]. Indeed, B. coagulans is also emerging as a promising GRAS (Generally Regarded As Safe) probiotic candidate, by sharing characteristics with both Bacillaceae and Lactobacillaceae [12]. Currently, B. coagulans strains are widely used as probiotics in marketed worldwide formulations namely, Sustenex, Lactospore, Neolactoflorene, etc [13] In a recent study it has been demonstrated that MA-13 over-produces under standard growth conditions αand β-galactosidases that are key enzymes for improving the nutritional value of raffinose family oligosaccharides and lactose-containing food as well as for the production of galactooligosaccharides (GOS) that improve the digestibility and nutritional value of food. Moreover, the identification and comparative analysis of the full repertoire of Carbohydrate-Active enZymes (CAZymes) of other B. coagulans strains might help to shed light on the probiotic and prebiotic properties of the available strains [14].
Furthermore, B. coagulans has been exploited also in other diverse biotechnological applications in medicine, food, and chemical industry including the production of thermostable enzymes and antimicrobial peptides, such as coagulin [15].
So far many strains have been isolated and for most of them, the genome sequence is available [7,[16][17][18]. To fully explore the biotechnological potential of MA-13, a comparative analysis among some strains has been carried out. Genome decoding is functional to unravel molecular mechanisms driving diversification, genetic variability, and host-pathogen interactions [19]. Previously, it has been reported comparative genomic analyses on a few B. coagulans strains with a special focus on carbohydrates catabolism and genetic determinants responsible for general defense systems from foreign DNA, have been reported [4]. In our analysis, we picked up the closest evolutionary related strains to MA and pinpoints the core and the dispensable genes that are present in some but not all the strains studied (non-core) as well as the strain-specific genes (singletons) related to CAZymes, protein secretion systems, antimicrobial peptides/proteins, secondary metabolites, and the defense systems protecting cells from foreign DNA. Moreover, we show that the genomic content varied among close individual strains, and such variations are traced back to the gain or loss of singleton, genomic islands, selfish DNA (plasmids, bacteriophages, integrative conjugative elements), and/or widespread horizontal gene transfer (HGT) [20].

Genomic features of B. coagulans strains
A draft of the MA-13 genome was recently published [21] supplying a general overview of its genetic content. In this work, we provide a comparative analysis among MA-13 and other phylogenetically related strains, with the purpose to highlight the genomic features of MA-13 that point up its potential in different biotechnological fields. To this aim, the genome sequence of MA-13 was re-assembled using different tools (data not shown) and according to the quality check (QUAST), the best result was obtained using MEGAHIT. The updated genome assembly was also deposited in NCBI under the same accession number (WGS: SMSP02000001-SMSP02000116, ACCESSION SMSP00000000) and consists of 125 contigs, with a total length of 2,942,169 bp and an average G+C content of 47.2%. Compared to the previous work [21], the number of contigs was drastically reduced (from 1,653 to 125 contigs) and the total length adjusted from 3,237,270 to 2,942,169 bp. To date, forty-seven B. coagulans strains have been deposited in the NCBI database and their genomic features are summarized in Table S1. Only fourteen strains were reported to be sequenced as single replicons, while the others are deposited as draft genomes. In general B. coagulans genome size varies from 2.86 to 3.69 Mbp with an average size set around 3.0 Mbp as described for MA-13 as well as for H1, XZL4, AF24-21, AF24-19, MGYG-HGUT-00191, B4098, DSM1 strains (Table S1). The number of CDS ranges from 2,536 to 3,237 and the GC% content is around 46-47% (Table S1).

Phylogenetic and comparative genomic analysis
In microbial genomics, phylogenetic analysis is crucial not only to establish the genetic novelty and the genotype-phenotype relationships of the isolates but also to identify the closest relatives (micro)organisms within assembled genomes. With this aim, a species tree was generated using the genomes of nine B. coagulans strains available in the KBase system. In our previous work, a phylogenetic tree was built based on 16S rRNA genes and two main groups were identified [5]. In this study, the comparative whole-genome sequence analysis highlighted the presence of two clades (named Group A and B) that underline the coexistence of two distinct evolutionary lineages. This hypothesis is supported by the similar genome size within each clade (~2.9 Mbp for Group A and ~3.4 Mbp for Group B, respectively) ( Figure 1, Table S1). Moreover, the closest strain to MA-13 is DSM1 (accession number ALAS00000000) which was also isolated from agri-food wastes [18]. Then, we resolved to calculate the Average Nucleotide Identity (ANI) value of orthologous genes shared by the two genomes. The FastANI visualization of the reciprocal mappings shows the close evolutionary relationship between the two strains as clearly highlighted by the dark-red lines representing all the conserved regions shared by MA-13 and DSM1 which define an ANI value of 98.43%. Moreover, as further evidence of the evolutionary connection, we have calculated the DDH (DNA-DNA hybridization) values to demonstrate that the two microorganisms unambiguously belong to the same specie (DDH1=98.96%) but they are not members of the same sub-specie (DDH2= 60.49%).
To get a deeper insight into the genomic variation among the nine strains a comparison of all the genes grouped by sequence homology was carried out [22]. This analysis was performed using the phylogenetic tree ( Figure 1) as input to build up a framework for exploring the genomic diversity and the common features through the analysis of non-core genes/singletons and core genes ( Figure 2, Table S2). The formers are represented by all genes not universally conserved within a species, whilst the core is constituted by genes conserved across all the genomes [23]. The genomic diversity and variability among the isolates are highlighted by the analysis of singletons and non-core genes that may be acquired from distal lineages through horizontal gene transfer and represent the genetic source for the emergence of novel functions ( Figure 2, Table S2). To assess the functional features of singletons, TIGRFAM was used to cluster proteins in categories and sub-categories ( Figure 3). Genes involved in central metabolism as well as several hypothetical genes and some with unknown functions are distributed over all the genomes. Moreover, a distinctive feature is the presence of prophage elements that are found only in MA-13, DSM1, and P38 ( Figure 3). These strains are also characterized by singletons related to the functions of mobile genetic elements. This feature points to the availability of the genes attainable by the evolving microbial population and to the capability of microbial genomes to acquire new functions. The distribution of singletons in MA-13 strain revealed a significant proportion (~ 40%) of them falling within the hypothetical proteins category with unknown function thus indicating that a significant portion of them is still "terra incognita" and therefore the full understanding of its genomic potential needs to be still unveiled (Table S3). Moreover, transposable elements provide microorganisms with the ability to be responsive and susceptible to environmental changes by acquiring new genetic material and disseminating regulatory elements. The presence of singletons related to transposon activity, which is a distinctive feature of MA-13, points to significant genome plasticity, possibly to suit the requirements of environmental adaptation.

CAZome of B. coagulans
CAZymes (Carbohydrate-Active enZYmes) repertoire covers diverse enzymatic functions related to the synthesis, modification, and breakdown of saccharides, thus exploiting a great potential in biotechnological applications. B. coagulans CAZome was explored through dbCAN HMMs and subsequently matched with HMMER 3 for functional annotation (Table 1) [24]. The total amount of CAZymes in the nine genomes (344 genes in total) represents at least ~1% of all predicted coding sequences (Table 1). A deeper inspection of the CAZome is reported in Figure 4. The annotated genes were clustered in glycoside hydrolases (51.2 %), glycosyltransferases (23.2 %), carbohydrate esterases (17.1 %), and auxiliary activity (8.4%) families. Glycoside Hydrolases (GHs) catalyze the hydrolytic cleavage of the glycosidic bond and are involved in the degradation of agri-food and lignocellulose biomasses [25][26][27][28][29]. A total of 176 genes encoding GHs are clustered in 20 families, among which some are spread over all the compared genomes, i.e. GH13, GH18, GH23, GH25, GH36, GH65. The highest number of sequences (51 and 21, respectively) were found within GH13 and GH65 families whose members are mainly active on substrates containing α-glucoside linkages, such as starch [30]. GH42 representatives are found in almost all the strains except for H-1 and XZL4. Sequence analysis and experimental data indicate that the GH42 enzymes are mostly beta-galactosidases which are key providers of energy and carbons source through the breakdown of lactose to galactose and glucose [31]. GHs members belonging to families 18 might be involved in the sporulation pathway and this observation is in line with the lifestyle of B. coagulans members. Interestingly, four GH5 (in B. coagulans XZL4, XZL9, CSIL1) and one GH10 representative (in B. coagulans 36D1) were annotated. Both families include enzymes potentially acting on cellulose and hemicellulose, respectively [32,33]. In particular, the GH10 member was retrieved in the singletons list, suggesting that this enzyme was acquired through a 'genetic crosstalk' and/or horizontal gene transfer among bacterial genomes.
Glycosyl Transferases (GTs) are enzymes that catalyse the transfer of sugar moieties from activated donors to specific acceptor molecules, forming glycosidic bonds, and are involved in the synthesis of oligosaccharides, polysaccharides, and glycoconjugates [34]. Typically, the bacterial glycosyltransferases are poly-specific enzymes that mainly belong to families GT2 and GT4 and exploit the inverting and retaining catalytic mechanism, respectively. Except for GT2, GT35, and GT83, at least one member of the remaining identified GTs families is found in all the genomes. Moreover, GT4 and GT83 are the largest and the smallest groups, respectively. Noteworthy, the GT83 family which includes enzymes involved in lipopolysaccharide biosynthesis [35], typifies the GT repertoire of only two strains, i.e. MA-13 and CSIL1. Finally, family GT28 related to peptidoglycan biosynthesis, maintenance of the rod cell shape, and elongation of the lateral cell, is represented by two members for each genome.
Carboxyl Esterases (CEs) release acyl or alkyl groups attached by an ester linkage to carbohydrates [36]. Among the 18 CEs families, only six of them were annotated in this analysis. CE1, CE4, and CE14 are found in all the strains whilst a singleton encoding for a putative acetyl xylan esterase (CE6) was found in XZL9.
The remaining enzymes are categorized as Auxiliary Activities (AAs) in the CAZy database and comprise carbohydrate oxidases, redox enzymes involved in diverse functions such as lignin degradation and metabolism of lignin-derived compounds [37]. AA members (in total 29) were assigned to AA1 and AA4 families; the former are multicopper oxidases that use diphenols and related substances as donors and oxygen as acceptors [38], while the latter are potential vanillyl-alcohol oxidases [39].
The CAZome analysis along with the available biochemical and physiological data in the literature points to B. coagulans as model microorganism not only for carbohydrate degradation processes but also as a valuable source of novel enzymes for potential applications especially in food and medical industries. Specifically, MA-13 is currently tested for its ability to use agri-food wastes as carbon and energy sources (manuscript in preparation), showing excellent degradative capability on diverse biomasses. The absence of enzymatic activity related to cellulose degradation can be used for applications aimed at the isolation of cellulose from agri-food wastes.

Protein secretion systems
Bacterial cells rely on a plethora of secretion systems to dynamically interact with their environment. Two main types of protein secretion systems, the Sec and Tat pathways are the most common ones and have been identified in all domains of life [40]. Both Tat and Sec systems are present in B. coagulans as shown by the core feature set that includes tatC/tatA and secYEG genes [4]. Moreover, the comparative analysis of non-core genes revealed that B. coagulans exploits other less widespread secretion systems (Table 2). Indeed, 36D1 and CSL1 bear the hylD gene that is a component of the Escherichia coli αhemolysin secretion system (type I secretion system) which is otherwise made up also of hlyA, hlyB, and tolC [41].
Most but not all B. coagulans strains are featured by genes of the type VII secretion system (called T7SS) which exports proteins lacking a canonical cleavable signal peptide but bearing a WXG motif at the N-terminus [42]. EsxA and EsxB are molecular markers of the T7SS system and the relative genes (esxA and esxB) are usually part of gene clusters each harboring a membrane-associated ATPase of the FtsK-SpoIIIE protein family and other putative components of the secretion systems. Only P38 bears a complete set of the main components, whereas all the others show only some of them (Table 2). Table 2. Non-core genes of B.coagulans secretion systems.
It is tempting to speculate that either some of the T7SS genes are dispensable for leader-less secretion systems or that single components might cross-function with Tat or Sec systems. Interestingly, MA-13 bears only essA and essB, encoding for membrane proteins usually working in association with essC and esxA that are instead missing. Nevertheless, we have demonstrated in a previous work that WXG-proteins even not related to virulence factors are secreted in MA-13 through a leader-less mechanism [14].
In biotechnological applications, it would be useful to facilitate protein production or protein downstream processing using different secretion mechanisms for biotransformations and/or for screening enzyme variants in libraries. The study of the multiple secretion systems functioning in MA-13 along with its fermentation performance [5,6] makes this host a suitable starting point for the development of a consolidated bioprocess for the production of recombinant proteins/enzymes of biotechnological interest.

Toxin-antitoxin systems
Toxin-antitoxin systems (TAs) are ubiquitous among bacteria and play a crucial role in the dissemination and evolution of antibiotic resistance, for instance through stabilization of multi-resistant mobile genetic elements that inhibit or kill plasmid-free daughter cells [43]. Generally, toxins are proteins that affect metabolism and antitoxins are either RNA or proteins that counteract the effect of the toxin. Currently, TA systems are based on the coordinated action of a toxin which is always a stable protein, while the labile antitoxins can be either RNAs or proteins [44].
All B. coagulans strains are endowed with a common TA system retrieved from the core set and which relies on ydcD and ydcE genes, encoding for both an endoribonuclease that inactivates cellular mRNAs and its inhibitor, respectively. Moreover, we spotlighted in the MA-13 singletons set, the presence of a single gene (MBF8417901.1) encoding for the toxin of the PemK/MazF family [45]. The absence of the antitoxin component suggests that a functional TA system was lost during the evolution. It is tempting to speculate that the TA system based on ydcD/ydcE genes has been selected to face efficiently the challenge of an ever-changing environment. The TA systems of B. coagulans are a platform to discover novel antimicrobial targets, as well as to set up genetic reporter systems for plasmid maintenance and protein production.

Antimicrobial resistance mechanisms
Another strategy to withstand exposure to toxic compounds is antimicrobial resistance (AMR) which refers to the ability of microorganisms to cope with the effects of antibiotic treatments or antimicrobial peptides [46]. To advance phenotype prediction and to identify genomic regions related to AMR, an in silico analysis of the "resistome" of B. coagulans was performed using PATRIC software. Signatures of antimicrobial target genes were identified, among which those encoding for resistance to fluoroquinolones (gyrA and gyrB) and (murA) were found in MA-13, DSM1, and 2-6 (Table S4). Whilst those conferring resistance to sulfonamides (folP) and cycloserine (Alr) were present only in MA-13 and DSM1 strains. Interestingly, ORFs encoding for the primary target of the isoniazid (inhA, kasA) were only identified in MA-13, together with gdpD which is involved in phospholipid metabolism. In general, alteration in composition or distribution of phospholipids in the cell membrane is a common mechanism of resistance to lipopeptides in Gram-positive bacteria. Indeed, mutations in Enterococcus faecalis gdpD and liaFSR genes have been linked to the development of resistance to the cyclic lipopeptide daptomycin [47]. Then, the presence of the gdpD gene in MA-13 might represent a defense weapon against the action of antimicrobial peptides. In antibiotic-producing bacteria, it is known that biosynthetic operons of peptide antibiotics are frequently associated with membrane drug efflux transporters of the ABC family, which pump out the antibiotics for selfprotection. Accordingly, all B. coagulans strains analyzed bear three genes encoding for efflux pumps (bceA, bceB, bcrC) [48]. Furthermore, MA-13 is the only strain with a complete genetic array responsible for resistance to bacitracin including the two-component signal transduction system (bceRS) which is adjacent to the efflux transporter genes (bceA/B) ( Figure 5). This genetic proximity is common to other Gram-positive bacteria and the relative genes are located on contig k141_1005 in MA-13. In this system, BceR is a sensory transduction protein that responds to extracellular bacitracin stimulus and transmits the signal to the regulator BceS. This latter binds the promoter region upstream to bceA/B genes affecting their expression. Furthermore, MA-13 exhibits another genetic determinant of bacitracin resistance, i.e. bcrC encoding for a broad-specificity multidrug efflux pump which is located on a genomic region (contig k141_1) distant from the bceRS/AB genetic array. MA-13 BcrC protein is homolog (31.3% a.a. sequence identity) to BcrC of Bacillus licheniformis, a hydrophobic protein functioning as membrane component of the permease [49]. Differently from Bacillus licheniformis which bears a threecomponents operon (bcrABC), MA-13 possesses only the bcrC gene ( Figure 5). Furthermore, the MA-13 genome bears genetic determinants for triclosan (TCS)resistance, i.e. fabL and fabI genes, while a more complex cluster was identified in DSM1 including fabD, fabH, fabI, fabL, fabF (Table S4) [50]. TCS is a commonly used artificial antimicrobial agent widely spread in the environment. One of the major concerns about TCS is that resistance to this compound may contribute to reduced susceptibility to clinically important antimicrobials, due to either cross-resistance or co-resistance mechanisms. Indeed, bacteria exposed to continuously changing chemico-physical conditions are expected to be endowed with features of better adaptability and competitive fitness to overcome environmental challenges. The horizontal gene transfer (conjugation, transformation, and transduction) and the role played by mobile genetic elements (plasmids, transposons, insertion sequences, integrons, and integrativeconjugative elements) and the bacterial toxin-antitoxin system as well as the occurrence of genetic mutations, lead to a speedy bloom of antibiotic resistance amongst bacteria [20]. MA-13 which is an environmentally isolated strain is a multidrug resistance bacterium containing several diverse antibiotic resistance genes whose characterization might unravel novel mechanisms of resistance to organic compounds [6]. Our previous studies have highlighted the robustness of MA-13 toward biomass-derived inhibitors [5,6] which allows its propagation in the presence of such toxic compounds without affecting LA production. It is tempting to speculate that MA-13 tolerates the presence of diverse classes of organic toxic compounds through highly efficient detoxification/extrusion systems and therefore is suitable for different biotechnological applications related to the food industry as well as to biomass-degrading processes.

Secondary metabolites: bacteriocins
The discovery of new natural compounds produced by microorganisms, the socalled secondary metabolites, is a hot topic for diverse biotechnological applications [51]. These bioactive molecules, such as pigments, alkaloids, toxins, antimicrobials, are not involved in the basal metabolism, rather they often perform ancillary functions [52]. The ability of B. coagulans to hinder bacterial growth and balance co-habitant microbiota populations by producing antimicrobial compounds was already demonstrated in several studies. For instance, the metabolic acidosis associated with the secretion of LA during fermentation is detrimental for acid-sensitive microorganisms. To understand the potential of MA-13 in the production of secondary metabolites, the genome was analysed using antiSMASH, a bioinformatic tool suitable to search for clusters involved in secondary metabolites synthesis ( Figure 6). From this analysis, a gene encoding for a non-lanthionine-containing peptide (class II bacteriocins), the circularin A, was identified. Circularin A is usually produced as a pre-peptide that undergoes a proteolytic cleavage of the leader sequence followed by head-to-tail ligation between the N-and C-termini to produce a smaller circular antimicrobial peptide [53]. This bacteriocin is usually encoded by a gene cluster or in 4-10 genes containing operon. A genetic arrangement formed by ORFs encoding for circularin A, CircC (see below), ABC transporter, and two hypothetical proteins was found in MA-13 on contig k141_585. A similar cluster structure is shared among almost all the strains analysed ( Figure 6), but in the case of MA-13, the region is delimited by two repeats. Moreover, given the presence of several surrounding mobile elements, among which a transposase IS4, it is likely that the acquisition of this genetic cassette occurred upon HGT in MA-13. The strain XZL4 is the only one missing the circC gene, whose product has known to have a central role in the maturation of circularin A. Bacteriocin production could be considered advantageous to the producer since these peptides can kill or inhibit bacteria by competing for the same ecological niche or the same nutrient pool. Moreover, probiotics that synthetize bacteriocins are of particular importance because these bacteria are widely used in dairy industries. In this context, MA-13 has been proven to exhibit probiotic and prebiotic properties, because of its ability to express enzymes catalysing the synthesis of GOS as well as to improve the digestibility and nutritional value of foods. These features along with the ability to produce bacteriocins make MA-13 an ideal candidate for exploitation in white biotechnology.
2.6. Innate and adaptive immunity 2.6.1. Innate immunity Prokaryotic immunity is constituted by multiple systems to defend from mobile genetic elements (i.e. viruses and plasmids). Innate defense mechanisms do not require previous exposure to a pathogen. The examination of the different B. coagulans strains revealed the ubiquitous presence of the Wadjet defense immune system in the nine strains analysed (Figure 7 and Table S5) [54]. Wadjet cassettes are composed of the four genes jetABCD, where JetABC are homologs of the bacterial condensins MukF, MukE, and MukB, respectively. Bacterial condensins are involved in DNA condensation and segregation during bacterial replication. The fourth component of the operon, JetD, has a putative topoisomerase VI domain. Three types of Wadjet systems have been described, reflecting the domain organization of its components within the cassette [ strains predominantly contain type II Wadjet systems, although strains 36D1, CSIL1, and XZL9 harbor type III Wadjet operons (Figure 7). Wadjet system has been related to antiplasmid defense in the Gram-positives Bacillus cereus Q1 and Mycobacterium smegmatis [56] where it impairs plasmid transformation and plasmid segregation to daughter cells, respectively. It is tempting to speculate that the presence of Wadjet systems in B. coagulans is involved in preventing the acquisition of DNA from the environment. This would hamper the development of genetic systems for these strains, similarly to what occurs in M. smegmatis, where disruption of this system allowed transformation and maintenance of plasmid pAL5000 [56]. Other isolated defense systems are present in B. coagulans strains. Strain 2-6 harbors a single CBASS (cyclic-oligonucleotide-based anti-phage signaling systems) cassette in addition to a type II Wadjet system (Figure 7) [57]. CBASS systems are constituted by an oligonucleotide cyclase that produces cyclic nucleotide secondary messengers upon the encounter of a pathogen, and an effector protein. They are classified into four types according to their operon organization, type of effector, and cyclic nucleotide messenger. In particular, effectors of type II CBASS systems contain ubiquitin-associated domains and their mechanism of action is still unknown given that ubiquitin has not been found in bacteria. However, this system confers protection against phages in E. coli and V. cholerae. Strain XZL9 harbors two Wadjet cassettes of types II and III, plus two Gabija loci (Figure 7). The Gabija system is composed of proteins GajA and GajB where GajA is a sequence-specific DNA nicking endonuclease and GajB is predicted to be a UvrD-like helicase. It is hypothesized that GajA nuclease activity is activated by the change in the intracellular nucleotide levels resulting from virus replication and transcription, thus eliciting bacteriophage resistance [58].
Likewise many environmental strains, B. coagulans is not amenable to transformation with exogenous DNA and so far only 2 strains (DSM1 and P4-102B) are genetically tractable. However, setting up a reproducible genetic system is indispensable to apply metabolic engineering techniques to meet the requirements of commercial applications. The ubiquitous presence of the putative antiplasmid Wadjet immune system in B. coagulans may explain the recalcitrance of these strains to genetic tractability and constitutes a promising candidate for the development of strains amenable to genetic engineering. 2.6.2. Adaptive immunity: CRISPR-Cas system and mobile elements CRISPR-Cas systems constitute an adaptive immune apparatus in a wide range of bacteria (~40%), and the majority of archaea (~ 90%); small guide RNAs (crRNAs) interfere with sequence-specifically nucleic acids neutralizing invading genetic elements [59]. A CRISPR locus consists of repeats (23-47 bp) separated by unique sequences (spacers) with similar length (21-72 bp), originating from mobile genetic elements such as bacteriophages, transposons, or plasmids. Moreover, the cas genes are generally closely linked to CRISPR arrays and are diversified. Indeed, CRISPR-Cas systems are classified into two classes which are further subclassified into six types and several sub-types: in class I, the effector module consists of several Cas proteins whereas class II is defined by the presence of a single multidomain protein (e.g. Cas9). The CRISPR-Cas defense mechanism acts in a sequence-specific manner by recognizing and cleaving foreign DNA or RNA. The defence response can be divided into three stages: (i) adaptation or spacer acquisition, (ii) crRNA biogenesis, and (iii) target interference [60]. MA-13 genome harbors 48 spacers, 51 repeats organized in at least two CRISPR arrays with the same direct repeat sequence. The exact number of arrays cannot be determined due to the fragmented nature of the MA-13 genomic assembly that includes small contigs containing partial, truncated CRISPR arrays. Additionally, these arrays are linked to several unidirectionally arranged cas genes organized in two loci (cas1, cas2, cas3, cas4, cas5, cas7, cas8) ( Figure 8 and Table S6). Such genomic association is typical of Class I-C, characterized by the lack of cas6, encoding for an endoribonuclease employed by most type I systems for pre-crRNA processing and by the presence of a single protein encoded by the cas8c gene. The absence of cas6 can be replaced by cas5 whose product performs a similar catalytic reaction. In type I-C system, genes are typically encoded by a single (predicted) operon. Genes cas 3, 5, 8c and 7 of locus 1 share sequence identity below the 30% threshold with the corresponding genes located on the other locus. This indicates that these cas genes do not derive from gene duplication rather they have been acquired through independent events (see below). The comparative genomic analysis revealed that eight strains of B. coagulans possess cas operons of type I-B (2-6, 36D1, H-1, P-38, and XZL9), type I-C (CSIL1, DSM1, H-1, MA-13) or type IV (DSM1), mostly located in the vicinity of CRISPR arrays (Figure 8-9 and Table 3). Both type IV cas operons are located within prophages in the DSM1 genome, suggesting they are of viral origin. Only strain XZL4 lacks a complete cas operon or CRISPR array.  The B. coagulans strains contain 1 to 5 CRISPR arrays each, for a total of 30 arrays of which 13 are annotated as orphan CRISPRs (i. e. not located in the vicinity of a cas operon). Additionally, while strains MA-13, P38, and XZL9 have one consensus direct repeat sequence, strains 36D1, DSM1, CSIL1 and H1 have two to three direct repeat sequences despite all strains harboring one single adaptation cassette (Table 3).
At present, it is not clear whether orphan CRISPR arrays are indeed functional, although several lines of evidence indicate that at least some isolated CRISPR arrays are active [61,62]. Strains 2-6, 36D1, and CSIL1 have orphan arrays with a repeat different to the one of the CRISPR array associated with their cas operon and adaptation cassette. Moreover, in strains 36D1 and CSIL1 this repeat is predicted to belong to another CRISPR subtype. A similar picture is seen for strain DSM1, which has an adaptation cassette associated with an I-C CRISPR system, but an orphan array with a different direct repeat. These arrays could represent a remnant of a cas operon previously present in the genome that was subsequently lost. Additionally, strain H-1 bears interference cassettes for both I-B and I-C subtypes, but an adaptation cassette only for the I-B operon ( Table 3). The repeat of the associated CRISPR arrays is different for each subtype, suggesting that both cas operons are capable of interference, but only the I-B cassette may be able to acquire new spacers. The low number of spacers in the I-C CRISPR arrays (3 to 5) could reflect its inability to acquire new spacers. Taken together, the distribution of CRISPR-Cas modules in B. coagulans strains suggests that loss and gain of CRISPR-Cas cassettes is a common trait in these organisms. From an applicative point of view, the endogenous CRISPR-Cas adaptive immune systems can be repurposed for genome editing of B. coagulans, potentially constituting a valuable tool in the genetic toolbox for these strains. Analysis of CRISPR spacer matches allowed the identification of viruses infecting the B. coagulans strains. From a pool of 651 spacers extracted from the strains, 113 spacers (17.3%) matched to 77 Firmicutes-infecting viruses in the IMG-VR database [63]. These viruses are classified as members of the class Caudoviricetes, with the majority of them (66%) belonging to the Siphoviridae family. Several viruses match to CRISPR spacers of strains from both clades of B. coagulans (group A and group B), suggesting that these strains share a common pool of viruses. Additionally, most viruses in the network originate from an engineered environment (e.g. lab enrichment), including all spacers from strain MA-13 (Figure 10b), while only 4 viruses originated from environmental ecosystems. Interestingly, one single spacer from strain H-1 matched 38 viruses, of which 36 originate from host-associated environments, particularly the human digestive system ( Figure 10). The source of the B. coagulans viruses is in agreement with the prevalent agri-food origin of the isolated B. coagulans strains. The analysis of the adaptive immunity in MA-13 was extended also to the identification of genomic islands (GIs, regions of probable horizontal origin) and prophage regions [64,65]. GIs are non-self-mobilizing integrative regions encoding factors that support the adaptability and competitiveness of the microbes within a niche, including virulence factors and other medically or environmentally important adaptations [61,63]. Using IslandViewer, the prediction of GIs is carried out with a precise definition of boundaries since integration splits off a gene fragment that marks the distal ends of the island. 297 genes were assigned as belonging to GIs (Table S8 and Fig 11), among which the following cas genes have been identified: cas3 (MBF8417476.1), cas5 (MBF8417475.1), cas8c (MBF8417474.1), cas7 (MBF8417473.1, MBF8417526.1), cas4 (MBF8417527.1), cas1 (MBF8417528.1) and cas2 (MBF8417529.1). A similar analysis conducted on strains 2-6 and 36D1 indicates that the CRISPRs-Cas systems are located in genomic islands that are flanked by transposases [4]. Several CRISPR-Cas systems have been discovered within GIs and for V. cholerae all CRISPR-Cas arrays are located in mobile genetic elements (MGEs) [64]. Among the other genetic traits found in GIs that enhance the fitness of MA-13, worth noting is the presence of i) two genes encoding for putative thiazole-containing heterocyclic bacteriocins (MBF8418016.1 bacteriocin biosynthesis cyclodehydratase) and uberolysin/carnocyclin family circular bacteriocin (MBF8417079.1); ii) a full set of genes belonging to type I restriction systems that are large pentameric proteins with separate restriction (R, MBF8419113.1), methylation (M, MBF8419111.1) and DNA sequence-recognition (S, MBF8419112.1) subunits; iii) a complete cluster of genes encoding for arsenic resistance made up of arsABCD (MBF8417195.1, MBF8417192.1, MBF8417193.1, MBF8417194.1). ArsB and D are arsenite efflux transporters, ArsC is an arsenate reductase and ArsA is an arsenical pump-driving ATPase that overall contributes to confer resistance to arsenic ( Figure 11, Table S8) [65]. Furthermore, the prophage finding software PHASTER was used to identify intact prophages and their integrase genes in MA-13, utilizing a BLASTP comparison of the query genome with a frequently updated prophage sequence database [66]. Only one prophage is present in the MA-13 genome whose integrity is indicated by the presence of the attachment junctions attL and attR (also named hybrid attachment sites) at the extremities of the prophage (Figure 11). A total of 58 genes were identified in this genomic region that overall represent a blend of bacterial and bacteriophage features. Specifically, 20 hypothetical proteins were annotated using Bacterial Database and the remaining 38 belong to diverse phages ( Figure 12 and Table S9). Among these genes, 13 are linked to Bacillus cereus prophage phBC6A52 (Genbank NC_004821.1) assigned to the Podoviridae family [67]. Besides MA-13, also P38 bears a complete sequence related to a B. cereus prophage. PHASTER analysis revealed that it shares 100% identity with Bacillus Phage vB_BtS_B83 that belongs to the Bembunaquatrovirus bacteriophages genus. Although bacteriophages typically exhibit a narrow host range, yet the presence of the same phage in two different Bacillus species suggests that the horizontal gene transfer events can overcome the interspecies boundaries to allow the bacteria-phage coevolution in a given ecotype. Indeed, the exchange of phage attachment molecules could even occur in an interspecies context, enabling phage adsorption to non-host species, providing an alternative route for HGT. In in vitro models, species of Bacillus, including B. coagulans, B. subtilis, and B. cereus showed a similar life cycle in artificial gastrointestinal tract models [68].
Since B. coagulans and B. cereus strains have been both isolated mainly from food sources, it is reasonable that they share the same lifestyle and/or ecological environment, making it possible for genetic material exchanges.
Two strains i.e MA-13 and DSM1 were selected based on their genetic relatedness and by pairwise comparative analysis by calculating the average nucleotide identity (ANI) using FastANI. The percentage threshold for species boundary is 95% ANI [73]. In silico DNA-DNA hybridization (DDH) values were estimated using the Genome-to-Genome Distance Calculator (GGDC).

Conclusions
Thermophilic microorganisms are a reservoir for biodiversity, molecular phylogeny, and the production of unique industrially valuable enzymes and other compounds [76][77][78][79][80][81][82][83][84]. The purpose of this work is to expand our understanding of the intra-strains genomic diversity of B. coagulans and to provide new insights into its genetic potential in biotechnological applications. Since MA-13 has been successfully used for LA production from lignocellulose biomasses and as prebiotics producer, we attempted to explain its favorable fermentation features through this comparative genomic analysis. Several interesting characteristics emerged from this study, pointing to MA-13 as a platform for the efficient production of bio-chemicals. Indeed, the high operating temperature used during fermentation and its innate and adaptive immunity could protect the MA-13 strain from phage infection and contamination. Moreover, the presence of diverse secretion systems can be efficiently employed for homologous/heterologous protein production.
Finally, our analysis contributes also to explore genomic diversity and sheds light on genetic traits that are essential to the basic lifestyle of B. coagulans. Furthermore, we have expanded the knowledge of the selective advantages that shape the organization and dynamics of B. coagulans genomes, including niche adaptation, antibiotic resistance, and genetic tractability.
Supplementary Materials: The following are available online at www.mdpi.com/xxx/s1, Figure S1: title, Table S1: title, Video S1: title.  Institutional Review Board Statement: "Not applicable." Informed Consent Statement: "Not applicable." Data Availability Statement: All data generated or analyzed during this study are included in this published article.