Preprint Article Version 2 Preserved in Portico This version is not peer-reviewed

Optimization and Performance Analysis of CAT Method for DNA Sequence Similarity Searching and Alignment

Version 1 : Received: 2 February 2024 / Approved: 2 February 2024 / Online: 4 February 2024 (17:15:55 CET)
Version 2 : Received: 18 February 2024 / Approved: 19 February 2024 / Online: 19 February 2024 (16:12:31 CET)

A peer-reviewed article of this Preprint also exists.

Gancheva, V.; Stoev, H. Optimization and Performance Analysis of CAT Method for DNA Sequence Similarity Searching and Alignment. Genes 2024, 15, 341. Gancheva, V.; Stoev, H. Optimization and Performance Analysis of CAT Method for DNA Sequence Similarity Searching and Alignment. Genes 2024, 15, 341.

Abstract

Bioinformatics is a rapidly developing field enabling scientific experiments through computer models and simulations. Considering the vast databases of biological data available, it is ex-tremely important to develop efficient methods and algorithms for their processing. Sequence comparison is the best method for studying the evolutionary interaction between genes. It is based on alignment – the process of arranging two or more sequences to achieve the maximum level of identity and degree of similarity. The paper presents a new version of the algorithm for pairwise DNA sequences alignment, based on a new method called CAT, where a dependency with a previous match and the closes neighbor are taken in consideration to increase uniqueness of the CAT profile and to reduce possible collisions, i.e. two or more sequence having same CAT profiles. This makes proposed algorithm suitable for finding exact match of a concrete DNA se-quence in a big set of DNA data faster. The generation of CAT profiles is made once before data has been uploaded to the database, allowing the profiles to be used as metadata for the sequenc-es. It consists of an algorithm to calculate a CAT profile against the selected reference sequences and an algorithm to compare two sequences based on the calculated CAT profiles. Improve-ments in generation of the CAT profiles, are detailed described in the paper. Block scheme, pseudo code tables and figures were updated according to the proposed new version and ex-perimental results. Experiments have been carried out with the new version of the method and different datasets to align DNA sequences based on the CAT method. New experimental results in terms of collisions, speed, and efficiency of the proposed solutions are presented. Experiments related to the performance comparison with Needleman-Wunsch were re-executed with the new version of the algorithm to confirm that we have the same performance. An analysis of the per-formance of the proposed CAT based algorithm against Knuth–Morris–Pratt algorithm, which has a complexity of O(n) and is one of the most widely used for searching biological data, was performed. The impact of prior matching dependencies on uniqueness for generated CAT pro-files is investigated. The analysis of the experimental results obtained by sequence alignment shows a small deviation of the proposed algorithm based on the CAT method, which can be ig-nored if this deviation is acceptable at the expense of performance. The time efficiency of the CAT algorithm remains constant, regardless of the length of the sequences. Therefore, the ad-vantage of the proposed method is its fast processing in the alignment of large sequences, for which the execution of the exact algorithms takes a long time. The example code realization of the CAT Method, under the protection of the GNU General Public License v3.0, can be accessed on GitHub at: https://github.com/HristoS/CATSequenceAnalysis.

Keywords

bioinformatics; biological data sequences; DNA sequences; metadada; performance analysis; similarity searching; sequence alignment

Subject

Computer Science and Mathematics, Computer Science

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.