2.1. Sequence-Based Features
In protein primary sequences, the 20 standard amino acids (AA) exert different biochemical properties such as hydrophobicity, hydrophilicity, side chain characters, etc. Sequence-based methods intend to make predictions out of the correlations between protein subcellular locations and the information embedded in amino acid sequences. There are three major types of features used for model construction: AA composition information, sorting signal information and evolutionary information.
The composition-based features, which include AA occurrences and order in the query sequence, were commonly used in the earliest subcellular prediction methods. Moreover, previous studies have confirmed a better performance of the model by combining AA original sequence, gapped amino acid composition (GapAA) [
18], and amino-acid pair composition (PairAA) [
19]. Based on AA-composition features, Chou [
20] proposed pseudo-amino-acid composition (PseAA) using the sequence-order correlation factor for more biomedical properties discovery when avoiding the high-dimensional vector formation. The simplicity of composition features helps the generalization and interpretation of the computational models since they capture the most basic trends in protein sequences associated with their locations. However, they may not provide sufficient resolution for a high accuracy rate since there’s a loss of information about important sequences or structural motifs highly related to proteins’ subcellular location.
The sorting signal sequences or signal peptides, including transit peptides like mitochondrial transit peptides (mTPs) and chloroplast transit peptides (cTPs) [
21], are short and cleavable segment of amino acid sequences added to a newly synthesized proteins, determining their destination of the transportation process. These short peptides possess the directions mature proteins should be transported, reflecting the possible location event for one protein [
22]. Available approaches with signal peptides for protein localization mainly refer to finding their cleavage sites [
23]. As described in previous studies, sorting-signal sequences vary in length and composition but have similar structures: the N-terminal flanking region, also known as the n-region, the central hydrophobic region (h-region), and the C-terminal flanking region (c-region) [
24]. The hydrophobicity in the h-region and a large proportion of nonpolar residues in the c-region are used to label the cleavage sites by computational methods [
25,
26]. According to the location signal embedded in those short peptides, one can mimic the de facto information processing in cells and find the target spot of the test protein.
In addition, based on the fact that homologous sequences are likely to share the same subcellular location, the unknown protein can be assigned the same subcellular location as its homologs generated from PSI-BLAST [
27]. Moreover, the evolutionary similarity profiles extracted from the position-specific scoring matrix (PSSM) and position-specific frequency matrix (PSFM) derived from multiple sequences alignment results can contribute as classification features providing valuable information such as conserved motifs or targeting signals among different protein families. This representation can also be extended by integrating pseudo-analysis [
28]. Once aligned with known homologs in the database, this method can achieve high accuracy. However, as one amino acid change can directly influence the characters of one protein sequence, this method is more likely to be one of the sources of feature basis of prediction models.
2.2. Sequences-Based AI Approaches
Most computational frameworks include three major steps: feature extraction, feature selection, and final classification. Considering common features discussed above, the complexity of the models developed also increases with the amount of data processed and the dimension of input features, from traditional machine learning classification to complex deep learning analytical models. Besides the development of computational frameworks, we will also introduce techniques that are used to improve the algorithms dealing with multi-location proteins in the following.
For conventional classification, Support Vector Machine (SVM) [
29], K-Nearest Neighbor (KNN) [
30], and Random Forest (RF) [
31,
32] are widely chosen classifiers for training. Their simplicity makes them easy to use for prediction protocols with fast speed and low computational cost, suitable for limited data and low-dimensional inputs. Combined with highly efficient feature extraction methods, these frameworks will work well in most cases [
33]. For instance, Du et al. [
34] proposed two novel feature extraction methods that utilize evolutionary information via the transition matrix of the consensus sequence (CTM) and PSSM before adopting SVM, which in the end reach an overall accuracy of 99.7% in CL317 dataset. A feature extraction-based hierarchical extreme learning machine (H-ELM) introduced by Zhang et al. [
35] can handle high-dimension feature inputs directly without demanding dimension reduction for acceptable results. Alaa et al. [
36] exploits an extended Markov chain to provide the latent feature vector, which records micro-similarities between the given sequence and their counterparts in reference models. These methods help extract more abundant features of query sequences and provide better performance.
However, these conventional models may not perform well in complex scenarios [
1], especially multi-locational protein prediction [
28]. Though many proteins only stay in one subcellular space, studies have discovered many multi-location proteins that have special functions or are involved in crucial biological steps [
37]. Moreover, rather than staying in one place, proteins move from one subcellular compartment to another or simultaneously reside at two locations and participate in different cellular processes [
38]. Recent studies have also shown the remarkable significance of multilocation proteins in cell growth and development [
39]. For instance, phosphorylation-related multilocation proteins can function as a “needle and thread” via protein-protein interactions (PPI), thus playing an important role in organelle communication and regulating plant growth [
40]. Under these circumstances, there are mainly two ways for predicting multi-location proteins based on conventional classifiers: algorithm-adaption and problem transformation. The former method extends existing algorithms to deal with multi-label problems. Jiang et al. [
41] considers weighted prior probabilities with a multi-label KNN algorithm to increase the model accuracy. Library of SVM (LIBSVM) toolbox [
34,
42], instead, uses a one-versus-one (OVO) strategy to solve multi-class classification problems. Customization of well-known algorithms enhances their ability for specific requirements, but there is a risk of overfitting and may require significant computational resources. The problem transformation approach focuses on transforming the original problem into a different representation or formulation that is solvable with existing algorithms [
43] [
44], such as converting a multi-location classification problem into multiple single-label classification problems [
45]. Shen et al. [
28] introduces multi-kernel SVM by training multiple independent SVM classifiers to solve single-label problems before combining their results, one classifier for each class. Following this idea, an algorithm can be easily extended to solve multi-label classification.
In summary, traditional machine learning algorithms can achieve fast training times and high accuracy in scenarios with well-organized feature spaces and clear decision boundaries, their performance may degrade quickly when faced with large-scale data inputs, even with tailored classifiers featuring more selected features. Dimension reduction [
46] and parallel processing [
47] can be applied to mitigate the challenges, allowing an improved computational method scalability.
As multi-layered structure provides better performance compared to traditional approaches [
31], more methods based on deep networks especially neural networks have become increasingly popular in protein subcellular localization research [
48,
49]. Starting as effective feature extractors which automatically obtain deep features embedded in sequences [
50], convolutional neural network (CNN) is widely implanted in multi-locus protein localization framework. Mining deeper, Kaleel et al. [
51] ensemble Deep N-to-1 Convolutional Neural Networks that predict the location of endomembrane system and secretory pathway versus all others and outperform many state-of-the-art web servers. Cong et al. [
52] proposed a self-evoluting deep convolutional neural network (DCNN) protocol to solve the difficulties in feature correlation between sites and avoid the impact of unknown data distribution while using the self-attention mechanism [
53] and a customized loss function to ensure the model performance. In addition, long short-term memory network (LSTM) which combines the previous states and current inputs is also commonly used [
54,
55], with Generative Adversarial Network (GAN) [
56] and Synthetic Minority Over-sampling Technique (SMOTE) [
57] used for synthesizing minority samples to deal with data imbalance. Developing data augmentation methods by deep learning algorithms has also made protein language model construction possible [
58,
59]. Through transfer learning [
60], pre-trained models can be fine-tuned on different downstream tasks, reduces the need for large amounts of labeled data for training. For example, Heinzinger et al. [
61] proposed Sequence-to-Vector (SeqVec) that embeds biophysical properties of protein sequences as continuous vectors by using the natural language processing model ELMo on unlabeled big data. This represents a way to speed up the prediction process independent of the size of inputs. Details of computational models mentioned above can be found in
Table 1.
Deep learning will demonstrate exceptional outcomes dealing with high-dimensional inputs with deep feature extraction, eliminating the need for manual feature engineering and capturing intricate patterns in sequences. However, large, labeled, and high-quality datasets are still needed for original model training, which results in too many hyper-parameters and makes it hard to interpret the model itself [
31].