Preprint
Article

This version is not peer-reviewed.

Draft Genome Analysis and Virulence Gene Profiling of Escherichia coli ERR039477

Submitted:

01 June 2026

Posted:

02 June 2026

You are already at the latest version

Abstract
Escherichia coli is a highly diverse bacterial species with strains ranging from harmless commensals to pathogenic lineages. Here, we present a draft genome assembly and virulence gene analysis of the E. coli isolate ERR039477. Illumina paired-end sequencing reads were quality-checked, trimmed, and assembled using SPAdes, producing a draft genome of 4,420,077 bp across 2,450 contigs with a GC content of 50.7% and an N50 of 2,794 bp. Genome annotation with Prokka identified 6,151 protein-coding sequences, including 4,675 with putative functions, 38 tRNAs, a complete rRNA operon, and two CRISPR arrays. Species-level assignment was validated using KmerFinder and TYGS analyses, confirming placement within the E. coli clade. Virulence profiling against the Virulence Factor Database (VFDB) revealed a diverse repertoire of genes. These are associated with motility, adhesion, invasion, immune modulation and iron acquisition. The most abundant virulence factors included flagellar biosynthesis genes (fliP, fliI, flhA, flgD, flgI, flgG), fimbrial adhesion genes (fimD), and iron uptake systems (entA, entB, entE, entF, fepA). Antimicrobial resistance screening using ResFinder did not detect any known acquired resistance genes in the genome. Additionally, pathogenicity prediction with PathogenFinder classified the isolate as a human pathogen with a probability score of 0.886, supported by numerous matches to pathogenic protein families. These findings indicate that ERR039477 holds genetic traits associated with pathogenic potential. This study provides a comprehensive genomic resource and virulence gene profile for ERR039477. Also illustrating a reproducible workflow for combining genome assembly, annotation along with virulence screening. The results offer a valuable reference for comparative studies of E. coli isolates and pathogenicity research.
Keywords: 
;  ;  ;  ;  ;  ;  

1. Introduction

Escherichia coli is a Gram-negative bacterium that inhabits the intestinal tract of humans and animals, displaying remarkable genetic and phenotypic diversity (Lim et al., 2010). While most strains exist as harmless commensals (Genomics & Fletcher, 2023), certain lineages have acquired virulence determinants (Sora et al., 2021). That enables them to cause diseases ranging from urinary tract infections and septicemia to enteric illnesses and neonatal meningitis (Smith et al., 2007). The pathogenic potential of E. coli is largely dictated by its complement of virulence genes. They encode proteins responsible for adhesion, motility, biofilm formation, immune evasion, toxin production, and iron acquisition (Gambushe et al., 2022; Miethke & Marahiel, 2007).
Advances in whole-genome sequencing and computational genomics have transformed the study of bacterial pathogens. It allows rapid and detailed characterization of genome structure, gene content along with functional capabilities (Quainoo et al., 2017). Genome sequencing facilitates species-level identification and phylogenetic placement. It also enables comprehensive detection of virulence determinants and other adaptive traits (Altamirano et al., 2020; Yu et al., 2022). Such genomic insights are critical for understanding strain-specific pathogenic mechanisms and assessing epidemiological risks.
In this study, we assembled and annotated the draft genome of the Escherichia coli isolate ERR039477 and screened it for virulence genes using the Virulence Factor Database (VFDB). The main goals were to generate a reliable draft genome, confirm the species identity using genome-based methods, and describe the virulence genes present in this isolate. This work not only assembles and annotates the draft genome of ERR039477 but also systematically profiles its virulence genes, offering a reproducible resource for comparative and functional studies. Together, these results provide a valuable genomic resource for understanding the pathogenic potential of this E. coli strain and will support future comparative and functional studies of pathogenic lineages.

2. Materials and Methods

2.1. Data Acquisition

Illumina paired-end sequencing reads for Escherichia coli K-12 DH10B (ERR039477) were obtained in FASTQ format from a public repository (Durfee et al., 2008). All analyses were conducted in a Linux environment using Windows Subsystem for Linux (WSL).

2.2. Raw Data Inspection

Sequencing reads were initially inspected for file integrity, completeness, and proper read structure to ensure suitability for downstream analyses.

2.3. Quality Assessment

Per-base sequence quality, GC content, sequence length distribution, sequence duplication levels, and adapter contamination were evaluated using FastQC (v0.11.9). This step identified typical declines in quality toward the ends of reads and the presence of adapter sequences.

2.4. Read Trimming and Filtering

Low-quality bases and adapter sequences were removed using fastp (v0.23.4), producing high-quality reads for genome assembly.

2.5. De novo genome assembly

Trimmed reads were assembled de novo using SPAdes (v3.15.5), generating contiguous sequences (contigs) for genome reconstruction. Assembly statistics, including total contigs, total length, N50, L50, and GC content, were evaluated with QUAST (v5.2.0).

2.6. Genome Annotation

Structural and functional annotation was performed using Prokka (v1.15.6). Protein-coding sequences (CDSs), tRNAs, rRNAs, tmRNA, and CRISPR arrays were identified. Functional annotation was assigned using curated protein databases, while hypothetical proteins were flagged for sequences lacking known functional assignments.

2.7. Species Confirmation

Species-level identity was validated using KmerFinder (v3.2) by comparing the assembled genome against reference Escherichia coli genomes. Genome-based taxonomic analysis was further performed with the Type (Strain) Genome Server (TYGS), using digital DNA–DNA hybridization (dDDH) and 16S rRNA gene-based phylogenies to confirm placement within the E. coli species cluster.

2.8. Virulence Gene Profiling

To evaluate pathogenic potential, all Prokka-predicted protein sequences were screened against the Virulence Factor Database (VFDB) using BLASTp. High-confidence matches were defined as sequences with ≥70% amino acid identity and ≥50% query coverage. Identified virulence factors were cataloged by functional class and gene name to construct a genome-wide virulence profile.

3. Results

3.1. Quality Assessment and Genome Assembly

Initial quality assessment of raw Illumina reads using FastQC revealed a typical decline in per-base quality towards the read ends, with adapter sequences present (Figure 1A). Post-trimming with fastp improved base quality across all positions (Figure 1B), and removed adapter sequences, resulting in 372,243 high-quality reads suitable for assembly.
De novo assembly with SPAdes produced a fragmented draft genome consisting of 2,450 contigs with a total length of 4,420,077 bp, an N50 of 2,794 bp, and a GC content of 50.7% (Table 1). QUAST evaluation indicated that the largest contig was 39,568 bp, and per-base sequence quality was acceptable post-trimming. Sequence duplication levels were low, and no overrepresented sequences were detected.

3.2. Genome Annotation

Prokka v1.15.6 (Seemann, 2014) annotation identified 6,151 protein-coding sequences (CDSs), 38 tRNAs, one tmRNA, and a complete rRNA operon comprising 5S, 16S, and 23S rRNAs on a single contig. Two CRISPR arrays containing 13 and 5 spacers were detected, suggesting prior bacteriophage exposure. Approximately 4,675 CDSs were assigned putative functional annotations, while 1,476 remained hypothetical. Several duplicated gene names were observed due to contig fragmentation, and a few CDSs were predicted as pseudogenes (Table 2).

3.3. Species Confirmation

3.3.1. KmerFinder Analysis

KmerFinder 3.2 matched the genome to Escherichia coli str. K12 substr. DH10B with 97.55% query coverage, 95.66% template coverage, and 0.91× depth (Table 3). These results indicate high confidence in species-level identification.

3.3.2. TYGS Analysis

TYGS provided genome-based taxonomic validation (Table 4, Table 5) using digital DNA–DNA hybridization (dDDH) (Meier-Kolthoff et al., 2021; Meier-Kolthoff et al., 2021) and 16S rRNA phylogenies (Kreft et al., 2017; Lefort et al., 2015; Farris JS 1972). The genome clustered within the Escherichia coli species cluster with high confidence (Figure 2, Figure 3) (Meier-Kolthoff & Göker, 2019; Meier-Kolthoff et al., 2014).

3.4. Antimicrobial Resistance (AMR) Analysis

To evaluate antimicrobial resistance potential, the assembled genome was screened using ResFinder. The analysis searches for acquired antimicrobial resistance genes through comparison with a curated resistance gene database. No known acquired antimicrobial resistance genes were detected in the ERR039477 genome under the applied identity and coverage thresholds. This result suggests that the strain does not harbor major plasmid-mediated resistance determinants commonly associated with multidrug-resistant E. coli strains. The absence of detectable AMR genes indicates that the isolate may remain susceptible to commonly used antibiotics. However, resistance may still arise through chromosomal mutations or regulatory mechanisms that are not detectable using acquired resistance gene databases.

3.5. Virulence Gene Profiling

To assess the pathogenic potential of the assembled Escherichia coli genome, all Prokka-predicted protein sequences (n = 6,151) were screened against the Virulence Factor Database (VFDB) using BLASTp. High-confidence matches were defined as alignments with ≥70% amino acid identity and ≥50% query coverage. This analysis identified multiple virulence-associated genes distributed across the genome. The most frequently detected virulence factors included entries VFG045823, VFG035970, VFG049146, VFG049144, VFG049139, and VFG049136, each represented by multiple high-confidence matches. These VFDB entries correspond to proteins involved in host adhesion, membrane interaction, nutrient acquisition, and survival within host environments.
The presence of iron-uptake systems, membrane-associated virulence proteins along with stress-response factors suggests that the strain possesses genetic traits enabling colonization and persistence in host tissues. Several virulence genes were detected in multiple fragmented forms, which is consistent with the highly fragmented nature of the draft genome assembly. It also reflects contig breaks rather than true gene duplications.
Overall, the virulence gene profile indicates that the ERR039477 isolate belongs to a potentially pathogenic lineage of E. coli, rather than a benign commensal or laboratory strain.

3.6. Virulence Gene Repertoire

BLAST screening against the Virulence Factor Database (VFDB) revealed a large and diverse virulence gene repertoire (Table 6, Table 7). This virulence profile indicates strong motility, epithelial attachment, invasion, and iron scavenging, all of which are hallmarks of pathogenic and invasive E. coli lineages.

3.7. Pathogenicity Prediction

The pathogenic potential of the assembled genome was further evaluated using PathogenFinder. The analysis predicted the ERR039477 isolate to be a human pathogen with a probability score of 0.886.
Out of 6,156 analyzed sequences, 410 matched pathogenic protein families, whereas 32 matched non-pathogenic families, resulting in a genome coverage of 7.18%. Several sequences showed similarity to proteins from well-characterized pathogenic strains such as Escherichia coli O157:H7, Escherichia coli E24377A, and Shigella flexneri.
The combined evidence from virulence gene profiling and pathogenicity prediction suggests that the ERR039477 genome contains multiple genetic determinants associated with human pathogenicity, despite the absence of detectable antimicrobial resistance genes.

4. Discussion

In this study, we present the draft genome sequence and virulence profile of the Escherichia coli isolate ERR039477. High-quality Illumina reads, post-trimming with fastp, enabled the assembly of a 4.42 Mb genome with moderate fragmentation (N50 = 2,794 bp). The assembly is fragmented and the contigs are sufficient to capture the majority of coding sequences. But CRISPR arrays and rRNA operons are providing a solid foundation for downstream analyses. Fragmentation may have resulted in partial gene predictions or apparent duplications, but the overall genome completeness is consistent with typical draft assemblies from Illumina short reads (Gurevich et al., 2013).
Annotation with Prokka revealed a comprehensive gene set, including over 6,000 coding sequences and two CRISPR arrays. The presence of complete rRNA operons, tRNAs as well as CRISPR elements indicates that essential genomic features are well represented despite contig fragmentation. Approximately 24% of the coding sequences were predicted as hypothetical proteins. These highlights the continuing need for experimental validation and functional characterization in E. coli genomics (Vincent, 2024).
Species identification through KmerFinder and TYGS confirmed ERR039477 as E. coli, with close relatedness to type strain DSM 30083. The high query and template coverage, along with dDDH values above the 70% species delineation threshold, provide strong taxonomic confidence (Meier-Kolthoff et al., 2013). Phylogenetic analyses based on 16S rRNA and genome-scale GBDP trees further reinforced this classification. It is also distinguishing the isolate from closely related Shigella species. These findings support the utility of combining k-mer-based and genome-scale methods for accurate species assignment (Byrd et al., 2020; Tian et al., 2024), especially in highly recombinogenic taxa such as Escherichia/Shigella (Chattaway et al., 2017).
The virulence gene profiling revealed a diverse repertoire of genes associated with motility, adhesion, invasion, immune evasion, and iron acquisition. The predominance of flagellar genes (fliP, fliI, flhA, flgD, flgI, flgG) and type 1 fimbrial genes (fimD) suggests strong motility and epithelial attachment capabilities(Type 1 Fimbriae (Pili), n.d.), essential traits for colonization and biofilm (Guttenplan & Kearns, 2013). Additionally, the presence of genes encoding the E. coli common pilus (yagX/ecpC, ykgK/ecpR, yagV/ecpE) and brain endothelial invasion factors (ibeB, ibeC) highlights the potential of this strain to penetrate host tissues. It also indicates the traverse epithelial barriers, a hallmark of extraintestinal pathogenic E. coli (ExPEC) (Seib et al., 2012; Köhler & Dobrindt, 2011; Johnson & Russo, 2005)
Iron acquisition systems, including enterobactin biosynthesis and transport genes (entA, entB, entE, entF, fepA), were highly represented. It emphasizes the importance of iron scavenging in host colonization and survival under nutrient-limited conditions (Amiri et al., 2025). The identification of multiple stress-response and immune modulation genes (VFC0258) further suggests that ERR039477 possesses genetic traits enabling persistence in hostile host environments. Collectively, these virulence determinants indicate that the isolate belongs to a potentially pathogenic lineage, which is distinct from benign commensal or laboratory strains.
Antimicrobial resistance analysis using ResFinder did not detect any known acquired resistance genes in the assembled genome. This suggests that ERR039477 lacks major plasmid-mediated resistance determinants commonly observed in multidrug-resistant E. coli strains. While this indicates potential susceptibility to commonly used antibiotics, it should be noted that resistance can also arise through chromosomal mutations or regulatory changes that are not captured by acquired gene detection databases (Munita & Arias, 2016).
Further evaluation of pathogenic potential using PathogenFinder predicted the isolate as a human pathogen with a probability score of 0.886. The analysis identified 410 matches to pathogenic protein families compared to only 32 non-pathogenic families, supporting the virulence gene profiling results. The detected matches included proteins associated with membrane transport, iron uptake, and prophage-related elements, many of which share similarity with sequences from pathogenic E. coli and Shigella strains. This finding reinforces the conclusion that ERR039477 possesses multiple genomic features linked to host colonization and pathogenicity (Srinivasan et al., 2025). It is important to note that some virulence genes were present in fragmented forms. It likely reflects assembly artifacts rather than true gene duplication. Future studies using long-read sequencing technologies could resolve these fragmented loci. Also it will provide more precise insights into genomic organization and pathogenic potential (González et al., 2025).
In conclusion, the combination of high-quality assembly, comprehensive annotation, accurate species confirmation, and detailed virulence profiling establishes ERR039477 as a genetically well-equipped, potentially pathogenic E. coli isolate. The absence of detectable acquired antimicrobial resistance genes alongside strong virulence signatures suggests that this strain may represent a pathogenic but potentially antibiotic-susceptible lineage. These findings contribute to our understanding of E. coli virulence mechanisms. Also it provides a foundation for functional studies, epidemiological surveillance along with comparative genomics of pathogenic versus commensal strains.

5. Conclusion

The draft genome of Escherichia coli ERR039477 provides a comprehensive view of its genetic composition and virulence potential. High-quality assembly and annotation revealed a complete set of coding sequences, rRNAs, tRNAs, and CRISPR elements, confirming genomic integrity despite contig fragmentation. Species confirmation through KmerFinder and TYGS analyses established the isolate as E. coli, closely related to type strain DSM 30083. Virulence profiling highlighted a diverse repertoire of genes associated with motility, adhesion, invasion, immune evasion, and iron acquisition, reflecting traits characteristic of pathogenic and invasive E. coli lineages. The predominance of flagellar and fimbrial genes, combined with iron-scavenging systems and stress-response factors, underscores the isolate’s potential to colonize host tissues and persist under adverse conditions. Antimicrobial resistance screening using ResFinder did not identify any known acquired resistance genes in the genome. However, pathogenicity prediction with PathogenFinder classified ERR039477 as a probable human pathogen with a high probability score (0.886), supporting the extensive virulence gene repertoire detected in this study. Overall, ERR039477 represents a potentially pathogenic E. coli strain. It provides valuable genomic and virulence insights that can inform future functional studies, epidemiological monitoring as well as comparative analyses of pathogenic versus commensal strains. These findings improve our understanding of the genomic determinants underlying E. coli pathogenicity and add to bacterial genomics research.

Author Contributions

Rofiqul Islam Nayem designed the study, performed genome assembly, annotation, and bioinformatics analyses, and drafted the manuscript. Nishat Tasnim assisted with data curation, quality control, and figure preparation. Mst Raihana Sultana contributed to literature review, virulence analysis, and result interpretation. Md. Mahidul Islam Masum supervised the project and critically revised the manuscript. All authors approved the final version.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The genome assembly and annotation files for Escherichia coli ERR039477 have been deposited in the Zenodo repository and are publicly available at: https://doi.org/10.5281/zenodo.19499706. All data supporting the findings of this study are included within the article and its supplementary files.

Acknowledgments

We thank the European Nucleotide Archive (ENA) and the European Bioinformatics Institute (EMBL-EBI) for providing open access to the Illumina sequencing data used in this study. We also acknowledge the developers of FastQC, fastp, SPAdes, QUAST, Prokka, and the Virulence Factor Database (VFDB) for making their tools freely available. All computational analyses were done using Linux through the Windows Subsystem for Linux (WSL).

Competing Interests

The author declares that there are no competing interests.

References

  1. Altamirano, S., Jackson, Katrina M., & Nielsen, K. (2020). The interplay of phenotype and genotype in Cryptococcus neoformans disease. Bioscience Reports, 40(10). [CrossRef]
  2. Amiri, M., Mehdi Golchin, Majid Jamshidian Mojaver, Hamidreza Farzin, & Abbas Hajizade. (2025). Enterobactin: A key player in bacterial iron acquisition and virulence and its implications for vaccine development and antimicrobial strategies. Virulence, 16(1), 2563018–2563018. [CrossRef]
  3. Byrd, A. L., Liu, M., Fujimura, K. E., Lyalina, S., Nagarkar, D. R., Charbit, B., Bergstedt, J., Patin, E., Harrison, O. J., Lluís Quintana-Murci, Mellman, I., Duffy, D., & Albert, M. L. (2020). Gut microbiome stability and dynamics in healthy donors and patients with non-gastrointestinal cancers. The Journal of Experimental Medicine, 218(1). [CrossRef]
  4. Chattaway, M. A., Schaefer, U., Tewolde, R., Dallman, T. J., & Jenkins, C. (2017). Identification of Escherichia coli and Shigella Species from Whole-Genome Sequences. Journal of Clinical Microbiology, 55(2), 616–623. [CrossRef]
  5. Durfee, T., Nelson, R., Baldwin, S., Plunkett, G., Burland, V., Mau, B., Petrosino, J. F., Qin, X., Muzny, D. M., Ayele, M., Gibbs, R. A., Csorgo, B., Posfai, G., Weinstock, G. M., & Blattner, F. R. (2008). The Complete Genome Sequence of Escherichia coli DH10B: Insights into the Biology of a Laboratory Workhorse. Journal of Bacteriology, 190(7), 2597–2606. [CrossRef]
  6. Farris JS. Estimating phylogenetic trees from distance matrices. Am Nat. 1972;106: 645–667.
  7. Gambushe, S. M., Zishiri, O. T., & El Zowalaty, M. E. (2022). Review of Escherichia coli O157:H7 Prevalence, Pathogenicity, Heavy Metal and Antimicrobial Resistance, African Perspective. Infection and Drug Resistance, Volume 15, 4645–4673. [CrossRef]
  8. Genomics, F. L., & Fletcher, L. (2023, April 7). Study reveals the diversity of non-pathogenic E. coli strains. Front Line Genomics. https://frontlinegenomics.com/study-reveals-the-diversity-of-non-pathogenic-e-coli-strains/.
  9. González, A., Asier Fullaondo, & Odriozola, A. (2025). Why Are Long-Read Sequencing Methods Revolutionizing Microbiome Analysis? Microorganisms, 13(8), 1861–1861. [CrossRef]
  10. Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072–1075. [CrossRef]
  11. Guttenplan, S. B., & Kearns, D. B. (2013). Regulation of flagellar motility during biofilm formation. FEMS Microbiology Reviews, 37(6), 849–871. [CrossRef]
  12. Johnson, J. R., & Russo, T. A. (2005). Molecular epidemiology of extraintestinal pathogenic (uropathogenic) Escherichia coli. International Journal of Medical Microbiology, 295(6-7), 383–404. [CrossRef]
  13. Köhler, C.-D., & Dobrindt, U. (2011). What defines extraintestinal pathogenic Escherichia coli? International Journal of Medical Microbiology, 301(8), 642–647. [CrossRef]
  14. Kreft, Ł., Botzki, A., Coppens, F., Vandepoele, K., & Van Bel, M. (2017). PhyD3: a phylogenetic tree viewer with extended phyloXML support for functional genomics data visualization. Bioinformatics, 33(18), 2946–2947. [CrossRef]
  15. Lefort V, Desper R, Gascuel O. FastME 2.0: A comprehensive, accurate, and fast distance-based phylogeny inference program. Mol Biol Evol. 2015;32: 2798–2800. DOI: 10.1093/molbev/msv150.
  16. Lim, J. Y., Yoon, J. W., & Hovde, C. J. (2010). A Brief Overview of Escherichia coli O157:H7 and Its Plasmid O157. Journal of Microbiology and Biotechnology, 20(1), 5. https://pmc.ncbi.nlm.nih.gov/articles/PMC3645889/.
  17. Meier-Kolthoff JP, Sardà Carbasse J, Peinado-Olarte RL, Göker M. TYGS and LPSN: a database tandem for fast and reliable genome-based classification and nomenclature of prokaryotes. Nucleic Acid Res. 2022;50: D801–D807. DOI: 10.1093/nar/gkab902.
  18. Meier-Kolthoff, J. P., & Göker, M. (2019). TYGS is an automated high-throughput platform for state-of-the-art genome-based taxonomy. Nature Communications, 10(1). [CrossRef]
  19. Meier-Kolthoff, J. P., Hahnke, R. L., Petersen, J., Scheuner, C., Michael, V., Fiebig, A., Rohde, C., Rohde, M., Fartmann, B., Goodwin, L. A., Chertkov, O., Reddy, T., Pati, A., Ivanova, N. N., Markowitz, V., Kyrpides, N. C., Woyke, T., Göker, M., & Klenk, H.-P. (2014). Complete genome sequence of DSM 30083T, the type strain (U5/41T) of Escherichia coli, and a proposal for delineating subspecies in microbial taxonomy. Standards in Genomic Sciences, 9(1), 2. [CrossRef]
  20. Meier-Kolthoff JP, Auch AF, Klenk H-P, Göker M. Genome sequence-based species delimitation with confidence intervals and improved distance functions. BMC Bioinformatics. 2013;14: 60. DOI: 10.1186/1471-2105-14-60.
  21. Miethke, M., & Marahiel, M. A. (2007). Siderophore-Based Iron Acquisition and Pathogen Control. Microbiology and Molecular Biology Reviews, 71(3), 413–451. [CrossRef]
  22. Munita, J. M., & Arias, C. A. (2016). Mechanisms of antibiotic resistance. Virulence Mechanisms of Bacterial Pathogens, Fifth Edition, 4(2), 481–511. [CrossRef]
  23. Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14), 2068–2069. [CrossRef]
  24. Seib, K. L., Zhao, X., & Rappuoli, R. (2012). Developing vaccines in the era of genomics: a decade of reverse vaccinology. Clinical Microbiology and Infection, 18(5), 109–116. [CrossRef]
  25. Smith, J. L., Fratamico, P. M., & Gunther, N. W. (2007). Extraintestinal PathogenicEscherichia coli. Foodborne Pathogens and Disease, 4(2), 134–163. [CrossRef]
  26. Sora, V. M., Meroni, G., Martino, P. A., Soggiu, A., Bonizzi, L., & Zecconi, A. (2021). Extraintestinal Pathogenic Escherichia coli: Virulence Factors and Antibiotic Resistance. Pathogens, 10(11), 1355. [CrossRef]
  27. Srinivasan, V. B., Rajasekar, N., Krishnan, K., Kumar, M., Giri, C., Singh, B., & Rajamohan, G. (2025). Genomic and Functional Characterization of Multidrug-Resistant E. coli: Insights into Resistome, Virulome, and Signaling Systems. Antibiotics, 14(7), 667. [CrossRef]
  28. Type 1 Fimbriae (Pili). (n.d.). Encyclopedia.pub. https://encyclopedia.pub/entry/17950.
  29. Vincent, A. T. (2024). Bacterial hypothetical proteins may be of functional interest. Frontiers in Bacteriology, 3. [CrossRef]
  30. Yu, H., Li, L., Huffman, A., Beverley, J., Hur, J., Merrell, E., Huang, H., Wang, Y., Liu, Y., Ong, E., Cheng, L., Zeng, T., Zhang, J., Li, P., Liu, Z., Wang, Z., Zhang, X., Ye, X., Handelman, S. K., & Sexton, J. (2022). A new framework for host-pathogen interaction research. Frontiers in Immunology, 13. [CrossRef]
Figure 1. FastQC per-base quality plots (A) Pre-trimming, (B) Post-trimming.).
Figure 1. FastQC per-base quality plots (A) Pre-trimming, (B) Post-trimming.).
Preprints 216462 g001aPreprints 216462 g001b
Figure 2. Minimum evolution tree showing placement of the assembled genome relative to type strains of Escherichia and Shigella.
Figure 2. Minimum evolution tree showing placement of the assembled genome relative to type strains of Escherichia and Shigella.
Preprints 216462 g002
Figure 3. Phylogenetic tree based on 16S rRNA sequences confirming assignment within E. coli clade.
Figure 3. Phylogenetic tree based on 16S rRNA sequences confirming assignment within E. coli clade.
Preprints 216462 g003
Table 1. Genome assembly statistics (SPAdes + QUAST).
Table 1. Genome assembly statistics (SPAdes + QUAST).
Metric Value
Total contigs 2,450
Total length (bp) 4,420,077
Largest contig (bp) 39,568
N50 (bp) 2,794
L50 457
GC content (%) 50.7
Sequence duplication Low
Overrepresented sequences None
Table 2. Genome annotation summary (Prokka).
Table 2. Genome annotation summary (Prokka).
Feature Count
Protein-coding sequences (CDS) 6,151
tRNA genes 38
tmRNA 1
rRNA operon (5S, 16S, 23S) 1
CRISPR arrays 2
Hypothetical proteins 1,476
Table 3. KmerFinder species identification.
Table 3. KmerFinder species identification.
Template Query Coverage (%) Template Coverage (%) Depth Score
E. coli str. K12 substr. DH10B 97.55 95.66 0.91 141,302
Table 4. dDDH values between user genome and selected type strains.
Table 4. dDDH values between user genome and selected type strains.
Query Subject d0 (%) d4 (%) d6 (%) ΔG+C (%)
contigs.fasta Shigella sonnei ATCC 29930 76.1 85.2 80.5 0.28
contigs.fasta Shigella boydii ATCC 8700 72.3 84.2 76.9 0.39
contigs.fasta Shigella flexneri ATCC 29903 71.3 83.1 75.8 0.06
contigs.fasta Escherichia coli DSM 30083 70.4 74.3 73.5 0.09
contigs.fasta Escherichia ruysiae OPT1704T 63.2 49.4 61.5 0.14
Table 5. Closest type strains from TYGS.
Table 5. Closest type strains from TYGS.
Species Strain Base pairs GC (%) Proteins Assembly Accession
Escherichia coli DSM 30083 5,037,933 50.6 4,762 GCA_000690815
Escherichia ruysiae OPT1704T 4,767,674 50.6 4,375 GCF_902498915
Shigella flexneri ATCC 29903 4,938,295 50.7 5,112 GCA_002950215
Shigella sonnei ATCC 29930 4,994,001 51.0 5,043 GCA_002950395
Table 6. Virulence gene functional categories.
Table 6. Virulence gene functional categories.
Functional class Role
Motility (VFC0204) Flagellar assembly and host navigation
Adhesion (VFC0001) Attachment to host cells
Invasion (VFC0272) Tissue penetration
Immune modulation (VFC0258) Evasion of host defenses
Iron acquisition (VFC0083, VFC0086) Survival in host
Exotoxins (VFC0235) Host cell damage
Biofilm formation (VFC0301) Persistence and colonization
Table 7. The most abundant virulence genes.
Table 7. The most abundant virulence genes.
Gene Function
fliP, fliI, flhA, flgD, flgI, flgG Flagellar biosynthesis
fimD Type 1 fimbrial adhesion
yagX/ecpC, ykgK/ecpR, yagV/ecpE E. coli common pilus
ibeB, ibeC Brain endothelial invasion
entA, entB, entE, entF, fepA Enterobactin iron uptake
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated