Preprint Article Version 1 This version is not peer-reviewed

The Utility of Data Transformation for Alignment, de novo Assembly and Classification of Short Read Virus Sequences

Version 1 : Received: 30 March 2019 / Approved: 1 April 2019 / Online: 1 April 2019 (13:29:58 CEST)

A peer-reviewed article of this Preprint also exists.

Tapinos, A.; Constantinides, B.; Phan, M.V.T.; Kouchaki, S.; Cotten, M.; Robertson, D.L. The Utility of Data Transformation for Alignment, De Novo Assembly and Classification of Short Read Virus Sequences. Viruses 2019, 11, 394. Tapinos, A.; Constantinides, B.; Phan, M.V.T.; Kouchaki, S.; Cotten, M.; Robertson, D.L. The Utility of Data Transformation for Alignment, De Novo Assembly and Classification of Short Read Virus Sequences. Viruses 2019, 11, 394.

Journal reference: Viruses 2019, 11, 394
DOI: 10.3390/v11050394

Abstract

Advances in DNA sequencing technology are facilitating genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate and fully exploit biological sequence data. Comparable analytical challenges are encountered in other data-intensive fields involving sequential data, such as signal processing, in which dimensionality reduction (i.e., compression) methods are routinely used to lessen the computational burden of analyses. In this work we explore the application of dimensionality reduction methods to numerically represent high-throughput sequence data for three important biological applications of virus sequence data: reference-based mapping, short sequence classification and de novo assembly. Despite using highly compressed sequence transformations to accelerate the processes, our sequence processing approach yielded comparable accuracy to existing approaches, and are ideally suited for sequences originating from highly diverse virus populations. We demonstrate the application of our methodology to both synthetic and real viral pathogen sequence data. Our results show that the use of highly compressed sequence approximations can provide accurate results and that useful analytical performance can be retained and even enhanced through appropriate dimensionality reduction of sequence data.

Subject Areas

Alignment; assembly; taxonomic classification; time series; data transformation; DWT; DFT; PAA; data compression; compressive genomics

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.

Leave a public comment
Send a private comment to the author(s)
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.