1. Introduction
The classification of DNA sequences is a critical task in bioinformatics, as it plays a fundamental role in various biological studies, including gene identification, disease prediction, and evolutionary analysis [
1], With the exponential growth of genomic data, traditional machine learning algorithms often face challenges in handling the high-dimensional and complex nature of this data [
2]. In recent years, deep learning models have become powerful tools for sequence classification, providing significant improvements over classical methods [
3]. Among these deep learning architectures, Convolutional Neural Networks (CNNs), Bidirectional Long Short-Term Memory (BiLSTM) networks, and Gated Recurrent Units (GRU) have demonstrated strong performance in sequence data analysis, particularly for DNA sequence classification ([
4,
6]
CNNs excel at capturing local patterns in sequential data, making them ideal for DNA sequence analysis, where local motifs play a crucial role in sequence function [
7] BiLSTM networks, which take into account both past and future information in sequence data, have been shown to be highly effective in capturing long-range dependencies, a critical feature in DNA sequence classification [
8]. Similarly, GRU networks, which are variants of LSTMs, have similar advantages in sequence classification with fewer parameters, hence faster training times than LSTMs [
9]. While these models have shown success on their own, there is still room for improvement in performance that can be achieved with ensemble methods.
Ensemble learning techniques combine multiple models by leveraging their complementary strengths to obtain better predictive [
10]. These methods have widely been recognized for their ability to perform better than single models by reducing both variance and bias; hence, they have become a very useful approach to solving such complex tasks as DNA sequence classification [
11]. In ensemble models, CNNs, BiLSTMs, and GRUs can be combined to achieve more robust results by capturing different aspects of the sequence data, such as local patterns, long-range dependencies, and temporal structures.
This paper focuses on the performances of CNN, BiLSTM, and GRU for classifying DNA sequences and further makes a comparison among them with the ensemble approach comprising all three. We hope to prove that, with ensemble learning, it’s possible to raise the accuracy and robustness of the classification tasks, therefore offering a more reliable method for the analysis of big-scale genomic data.
2. Background and Related Work
Another various class of sequence classification was presented in [
12], which proposes non-parametric techniques along with the k-nearest-neighbour classification method. According to the technique, it makes use of the compression rates of various algorithms, including Gzip, Snappy, Brotli, LZ4, Zstandard, BZ2 and LZMA. In this process, it provides a resource-efficacious result regarding the classification of an exact sequence where a comparison is being made based on compressed forms. The Brotli compression algorithm, while very slow in processing, turned out to produce the best accuracy in classifying DNA sequences from human beings, chimpanzees, and dogs. This also proves the robustness of embedding the compression algorithms into the processes of classification in applications related to bioinformatics about efficiency and precision. Ozan proposed a new class of classification by incorporating compression algorithms with k-nearest-neighbour algorithms in the classification. These results alleviate the limitation of competitive computational resources to some degree at competitive accuracies, opening up ways toward an improvement in efficiency and effectiveness within DNA sequence analysis.
[
13] introduced a highly advanced, machine-learning-based predictor about DNA I -motifs, pinpointing DNA secondary structures within cytosine-rich DNA sequences. Putative-iM-Searcher is an algorithm that predicts i-motif conformations by using different strategies, including overlapping and non-overlapping or greedy and non-greedy methods. In more detail, the tool iM-Seeker includes Putative-iM-Searcher for predicting regions of the human genome containing putative i-motif-forming sequences. This was followed by the use of the Balanced Random Forest model in predicting the status, while the XGBoost regression model was used in targeting the strength of folding of the same motifs. The Balanced Random Forest model yielded an accuracy of 81%, recall of 77%, specificity of 81%, and AUROC score of 87% for i-motif formation. It is important to mention that this model reached a very good performance on all folded i-motifs, including the unfolded C-rich sequences. Besides, very good generalization ability was proved, showing that the five-fold cross-validation scores were always higher than 0.8. Among the different regression models checked for i-motif folding strength prediction, the best results were obtained using XGBoost: it provided a high value of R² = 0.458 with small errors; it therefore can be stated to be very effective for this type of prediction. In [
9], the authors aimed at improving efficiency in DNA sequence analysis classification using different machine learning methods. In this current research, the full model is proposed, where pattern-matching algorithms will be combined with different machine-learning classifiers such as Random Forest, K-Nearest Neighbors, Naïve Bayes, Decision Tree, and Support Vector Machine.
This would include the following major steps: DNA data pre-processing, feature extraction, development of classification model, and lastly pattern matching for determination of sequence similarity. For these algorithms, performance metrics used in the comparison include the F1 score, recall, precision, execution time, and accuracy. In this paper, among all the tested classifiers, the highest accuracy and F1 score were recorded by the linear SVM model, proving that this model performed better in classifying DNA sequences. Their results do bring in a variety of practical utility for potential applications, not only in drug discovery and personalized medicine but also in the disease diagnosis perspective of the disease. It further describes the pros and each algorithm’s cons, thereby acting as an indication regarding choosing which technique one must utilize against a set or set of jobs.
Hence, further research in this direction is likely to result in fine-tuning of the techniques so far proposed, and may, in all probability, lead to increased accuracy and speed of DNA sequence analysis.[
14] reviews machine learning methods for bioinformatics strictly related to multiple aspects of analyzing DNA sequencing data strands and several major fields of genomics. This work shows how algorithms of machine learning are applied for the organization of large-scale data obtained with the help of a contemporary sequencing tool. The review focuses on the use of ML for genome assembly, gene annotation, variant calling, and GWAS.
These methods that help in lifting short DNA leads to correct positions for assembling a genome, finding genes and regions of functional significance, distinguishing genetic differences from sequencing errors, and in the identification of genetic markers linked with specific traits or diseases. Because of the complexity with genomics, the introduction of machine learning has brought wanted improvements in genomic research. These algorithms model the features of the data sets, make predictions on what might be the underlying link between genetics and diseases, and enhance drug targeting and personalized medicine. It thereby established the fact that machine learning changed the face of genomics by showing how it has been able to browse heaps of available genomic data and comes out with useful knowledge and information that would normally remain unnoticed with everyday methods.
Until 2023, in-depth review about application of Machine learning to mining DNA sequences by [
15] was centered upon classification, clustering, pattern recognition methods. This work will also research how these approaches enable meaningful in-formation to be extracted from large data sets and find significant patterns in the DNA sequences. Importantly, the paper highlights the pros and cons of various machine learning algorithms in relation to their capability for filtering through a vast volume of data for patterns that might not be noticed using other methods. It brings to the fore the predictability of biological systems by applying machine learning to predict functions from DNA sequences and, thus, obtaining insight into evolutionary relationships.
This paper looks at the development of sequencing technology, challenges involved in the analysis of DNA sequence data, and some probable ways through which ma-chine learning can offer solutions. With these sophisticated algorithms, in fact, it would facilitate a more informed understanding of the genetic information by re-searchers about biological systems and thus assured that without the advancement in machine learning techniques, it would not have been possible to analyze DNA sequence data. It is also insightful for the readers because it gives an understanding of how the use of machine learning tools would better the interpretation.
Suresh et al. [
16] presented the use of advanced machine learning and deep learning techniques to improve DNA sequencing. This paper incorporates decision trees, random forests, and other types of deep learning models in improving DNA data analysis. These approaches are essential in the forecast of future potential medical conditions, a genetic predisposition, and have been one of the ways through which the 4.0 technologies have found application in medicine for more accurate diagnoses and better treatment options. The paper exposes how sophisticated these methods help reasonably in categorizing DNA sequences and forecasting patients’ health status. Such sophisticated algorithms enable the researcher to show strengths that machine learning and deep learning models can get while yielding key insights into genetic data to improve the accuracy of a prediction, hence providing best practices in patient care. It thus presented the transformative potential of these technologies in medical diagnosis by underlining their role in the advance of DNA sequencing methods.
Juneja [
17] has discussed the DNA sequence classification based on machine learning, specially focused on the Multinomial Naive Bayes algorithm. The algorithm has been applied to three DNA sequences in this paper-namely, Chimpanzee, Dog, and Human datasets for the classification of gene family and original vs. mutant sequences. This examines the substring length parameter in respect to classification performance. It reflects the increase of metrics such as accuracy, precision, recall, and F1 score with a gradual increase of k until it reaches its best point and subsequently declines when computation overhead rises beyond the capacity to handle increased size. The research concluded that the Multinomial Naive Bayes algorithm could give an astonishing accuracy of 98.4%, showing superiority in several classifiers concerning its accuracy and flexibility when using variable gene sequence lengths. This research demonstrates the efficiency of the algorithm in classifying DNA sequences and thus contributes much to bioinformatics in handling huge volumes of genetic data with immense diversity and complexity.
In the work of [
18], the derivative Boyer Moore, d-BM, was developed to increase the computational efficiency on compressed DNA sequences. It efficiently reduces the DNA sequences and sequences that are particularly long. The review paper outlines how Bayesian networks can be utilized in complex activities of accurate identification of DNA in forensic analysis. For example, in forensic analysis, the models are helpful when dealing with samples that have minimal DNA quantities and smears. In the same year, [
15] proposed a faster version of Boyer Moore (d-BM) to speed up the computation on compressed DNA sequences. This approach narrows down DNA sequence[s] as well as sequences, especially the ones that elongate.” Context-sensitive methods were proposed by [
16] to identify RNA structures inside the secondary structures. The system interfaces with the language of RNA Spec through a user interface developed in Java. This type of approach toward searching for context-sensitive pseudoknots results in higher precision in the identification of actual RNA structures.
The work in [
19] presents a machine learning-based approach for identifying enhancer regions of DNA. The model was built using a Random Forest classifier. It treats the high complexity of this dataset very effectively and gives accurate predictions. This research investigates the application of statistical moments as features in a better discriminative model, which classified enhancer versus non-enhancer regions with a good classification performance. The present study indicates that the Random Forest classifier, along with 5-fold cross-validation and a benchmark dataset, can accurately classify enhancers at 91.68% and strong enhancers at 84.53%. The method not only outperforms existing techniques in terms of accuracy but also underlines the importance of correct enhancer prediction for understanding the mechanisms regulating gene expression.
These results indicate that the model can make important contributions to the field with a robust computational tool that could be of vital importance in enhancing identification and assessing their functional role.
[
20] have reported DNA sequence classification using Multinomial Naive Bayes. The method that the authors proposed include DNA sequencing, K-mer counting, thresholding, and then sequence analysis that identifies gene sequences are normal or abnormal. The research paper tests the performance of different classifiers for various values tuning of substring length parameter KKK and observes that with an increase in KKK, accuracy, precision, recall, and F1 score will first increase but beyond an optimum threshold, it decreases because, after a limit, managing an increased number of substrings starts to be too expensive. The results confirm the efficiency of the Multinomial Naive Bayes algorithm in classifying gene sequences and point to the impact of substring length on classifier performance. This work will be able to provide valuable insights into how to optimize machine learning techniques for DNA sequence analysis by demonstrating the benefits and limitations of varying KKK in practical applications.
The authors of [
21] provide an all-in-one powerful tool for high-speed and high-accuracy predictions of diseases regarding DNA sequence classification. For improvement in the accuracy of diseases regarding DNA sequences, the predictors will be through the following methods: Support Vector Machine, Convolutional Neural Network, Recurrent Neural Network, and Multi-Layer Perceptron. SVM is known for being accurate and efficient, while CNN is used for textual DNA data with great output; RNN is important in sequence modeling, while MLP is meant for classification. Results show that the proposed method has the highest accuracy of 93.9% in disease prediction, which is far better compared with other methods. Specifically, CNN attained 73.5%, MLP 78.0%, RNN 69.0%, while on the other hand, the performance was very low with SVM at only 50.0% accuracy. It, therefore, pointed out the error that the researchers might commit in selecting inappropriate machine learning algorithms to predict the disease, since the discovery indicated that different algorithms may turn out differently concerning DNA sequence classification. Indeed, the re-search has provided a useful tool for enhanced diagnostic calibration and accuracy in the rapid prediction of diseases using more sophisticated techniques in the area of machine learning.
In the experiment done by [
22], the dataset used in the experiment was DNA sequences, and the classification models were evaluated on metrics such as the F1 score, recall, precision rate, execution time, and accuracy. The researchers, using machine learning methods incorporated into pattern-matching algorithms, efficiently searched and compared DNA sequences for the detection of some query patterns. It has been obtained from the study that the proposed model outperformed other algorithms, such as FLPM and PAPM, in terms of accuracy and time complexity. Further comparisons with more algorithms is required to confirm the efficiency of the model. Besides, the design of the study took into consideration the length of the pattern affecting algorithm performance and the importance of investigating algorithm performance with longer or even more complex patterns. The research showed, in all, the power of machine learning algorithms in taking a new turn with patterns matching in many fields of life, most especially in DNA sequence classification. Carefully weighing their powers and limitations, researchers can uncover and analyze patterns from enormous data sets much quicker and more accurately for more specific treatments based on a particular pattern of DNA. As a matter of fact, the model was fairly im-pressive, showing high accuracy at 92.3% and an AUC of 0.94
4. Proposed Model
4.1. Pre-Processing
Deep learning techniques have proven to be effective in various applications, including DNA sequence classification. In this work, we propose an Ensemble Model that combines Convolutional Neural Networks (CNN), Bidirectional Long Short-Term Memory (BiLSTM) networks, and Gated Recurrent Units (GRU) into a unified architecture. The ensemble model leverages the strengths of each individual network to create a more robust system for DNA sequence classification. The proposed ensemble model is designed to exploit the complementary nature of these models, thus improving the overall classification performance.
4.2. CNN Model
Convolutional Neural Networks (CNNs) are highly efficient in extracting spatial patterns and features from input data. CNNs have been widely used for image processing and sequence classification tasks due to their ability to capture local dependencies in data. In the context of DNA sequence classification, CNNs learn to identify patterns such as motifs, which are essential for distinguishing different biological sequences.
The CNN architecture used in this ensemble consists of multiple layers:
Convolutional Layer: This layer applies filters to the input DNA sequence data to detect local patterns such as motifs. The convolution operation is given by:
where oj is the output of the layer, Wxi is the input sequence, W is the weight matrix, b is the bias, and f is the activation function, typically ReLU.
Pooling Layer: Following the convolutional layer, a pooling layer makes sub-sampling to reduce the data dimension in order to maintain only important features. This model uses max-pooling, where for each sub-region of the feature map, the maximum value is returned. The pooling can be summarized as:
This operation helps in reducing the computational complexity and preventing overfitting.
Fully Connected Layer:
The output of the pooling layer is flattened into a vector and passed through a fully connected layer to generate the final output. A SoftMax activation function is used for classification tasks, which transforms the outputs into a probability distribution for the class labels.
Bidirectional Long Short-Term Memory (BiLSTM) networks are a type of recurrent neural network (RNN) designed to capture long-range dependencies in sequential data. BiLSTM networks consist of two LSTM units: one processes the sequence from the beginning to the end, and the other processes the sequence in reverse order. This bi-directional approach allows the model to capture both past and future context, which is crucial in DNA sequence analysis where nucleotide dependencies can span both directions.
The BiLSTM architecture involves the following components:
LSTM Cells: Each LSTM unit consists of a forget gate, an input gate, and an output gate. The forget gate determines which information should be discarded, the input gate controls the new information to be stored, and the output gate generates the final output.
where ft, it, and ot are the forget, input, and output gates, respectively, and σ is the sigmoid activation function.
Bidirectional Processing: BiLSTM scans the DNA sequence in both forward and reverse directions, capturing dependencies in both directions. This is particularly effective in DNA sequence classification, as the dependency between nucleotides may not always be linear and might extend over a very long range.
4.3. GRU Model
Gated Recurrent Units represent another variation of LSTMs that are claimed to be more computationally efficient yet equally good at modeling long-range context. GRUs simplify the architecture of LSTM with one update gate combining the forget and input gates, hence fewer parameters and less computation.
The GRU architecture includes:
Update Gate: This gate decides how much of the past information should be passed along to the next time step. It is computed using the sigmoid function:
Reset Gate: This gate controls how much of the past information should be forgotten:
New Memory Content: The new memory content is computed by combining the reset gate with the previous memory, allowing the model to decide which parts of the past to retain:
Final Output: The final output of the GRU unit is a weighted combination of the previous output and the new memory content:
This enables the GRU model to efficiently capture long-term dependencies in DNA sequences.
4.4. Ensemble Model Strategy
The proposed CNN, BiLSTM, and GRU models are combined into an ensemble through majority voting. In classifying every input DNA sequence in this work, each model makes a classification on its own before finally deciding the results with all of them using a majority vote. Thus, this ensures the model proposed would utilize the diversity that the different participating models may afford for arguably higher accuracy through avoided or mitigated weaknesses.
The process of the ensemble model goes as follows:
Independent Training: The training data independently trains CNN, BiLSTM, and GRU.
Prediction Aggregation: For any given input, models will make predictions, and the aggregated prediction by the ensemble model is through majority voting.
Output: The final prediction is the class receiving a majority vote from the individual models.
Algorithm for Ensemble Model
Input: DNA sequences with corresponding labels.
Preprocessing: Preprocess DNA sequences by normalizing and encoding.
Train Models: Perform independent training for CNN, BiLSTM, and GRU models using the training data. Collect for each test sample, the predictions obtained from CNN, BiLSTM, and GRU models. Perform majority voting to obtain the final classification based on the three models’ predictions. Return the final classification result. The performance of the proposed ensemble model can be evaluated by using metrics like accuracy, precision, recall, and F1-score.
4.5. Evaluation Metrics
We evaluate the models based on four key performance metrics:
Accuracy: The percentage of correct predictions made by the model.
Precision: The ratio of true positive predictions to the total number of positive predictions.
Recall: The ratio of true positive predictions to the total number of actual positive instances.
F1 Score: The harmonic means of precision and recall, providing a balanced measure of model performance.
CNN Confusion matrix, ROC, and AUC
Figure 4.
CNN Confusion matrix.
Figure 4.
CNN Confusion matrix.
Confusion Matrix: The CNN confusion matrix will probably show a moderate balance between true positives and true negatives but with noticeable false positives and false negatives, which indicates that though the model is able to classify most of the instances correctly, it struggles slightly in misclassifying, especially in distinguishing between classes that are very similar in feature space.
ROC Curve: The ROC curve for the CNN model will always be above random diagonal and hence reflects good class discrimination, although it seldom touches the top left corner to give the perfect sensitivity in the true positive rate for all the thresholds.
AUC: The AUC score for the CNN model is anticipated to be about 0.80 to 0.85, consistent with its overall accuracy. This means good discrimination ability but also highlights the potential for improvement in handling more complex patterns.
BiLSTM Confusion matrix, ROC and AUC
Figure 6.
BiLSTM Confusion Matrix.
Figure 6.
BiLSTM Confusion Matrix.
Confusion Matrix: The confusion matrix of the BiLSTM would probably be highly indicative of true positives because of its high recall. Then again, this with the lower precision should present more false positives-meaning it is good to find the positive cases but somehow or other, labels the negative as positive.
Figure 7.
BiLSTM ROC curve.
Figure 7.
BiLSTM ROC curve.
ROC Curve: The ROC curve for the BiLSTM model would rise extremely steep, way up to the top left corner. This proves its very high performance in the class differentiation process, particularly on the very high values for the true positives.
AUC: The AUC for the BiLSTM is expected to be around 0.90, which shows that the model performs very well. Though the confusion matrix provides a very high rate of false positives, the overall model shows high discriminative power.
GRU Confusion matrix, ROC and AUC
Figure 8.
GRU Confusion Matrix.
Figure 8.
GRU Confusion Matrix.
Confusion Matrix: The confusion matrix for the GRU model would probably be somewhat balanced, like the CNN, with a relatively fair number of true positives and true negatives, though there was still a fair smattering of false positives and false negatives. The slightly higher precision compared with BiLSTM suggests that it is better in avoiding false positives but slightly worse at capturing all true positives.
ROC Curve: The ROC curve for the GRU model should be somewhat similar in performance to CNN but not as steep as BiLSTM. It will reflect reasonable success in differentiating the classes with some trade-off between sensitivity and specificity.
AUC: The AUC score for GRU is expected to be in the range of 0.80 to 0.85. This shows solid, reliable performance but not quite as strong as the BiLSTM or ensemble models. This is a good model but would certainly benefit from some enhancement in either recall or precision.
Ensemble model Confusion matrix, ROC and QUC
Figure 10.
Ensemble Confusion Matrix.
Figure 10.
Ensemble Confusion Matrix.
Confusion Matrix:
The confusion matrix for the Ensemble model will likely indicate an overwhelmingly high number of true positives and true negatives with minimal false positives and false negatives, which further indicates that the model performed excellently in both aspects: precision and recall.
ROC Curve:
The best performance ROC curve, which increases rapidly upwards to the top left, hugging the upper boundary, should be from the Ensemble model. This will connote very good discrimination of the model between the positive and negative cases at different thresholds.
AUC:
The expected AUC for the Ensemble model is about 0.95 or higher to reflect near perfection. This relatively high score infers that, among all presented models, the Ensemble model is the best to balance sensitivity and specificity and therefore would be perfectly suitable for any scenario where false positives and false negatives both involve serious consequences.
6. Discussion
The performance comparison of CNN, BiLSTM, GRU, and Ensemble models reflects the fact that for each metric, there are various strengths and weaknesses. The CNN model shows an overall good performance, with 80.6% accuracy and a balanced precision of 81.6%, a recall of 80.6%, thus providing a very strong F1 score of 83.1% that depicts a reliable but not outstanding classification capability.
The BiLSTM model has the highest accuracy, 90.98%, and a strong recall, 82.83%, indicating that it is very good at identifying true positives. However, its lower precision of 73.09% indicates a higher rate of false positives, hence a moderate F1 score of 77.99%. The performance of the GRU model is fairly close to the CNN, yielding an accuracy of 81.2%, but a slightly reduced precision of 74.2%, with recalls at 80% reflecting good balance but with somewhat reduced performance overall, as reflected by the F1 score of 76%.
The Ensemble model outperforms all models on most metrics, yielding accuracy of 90.6%, precision of 91%, recall of 91%, and an F1 score of 91%. This would suggest that the strengths of the two individual models effectively get combined within the Ensemble and produce a more rounded classification result without significant loss between precision and recall.