Depression Diagnosis by Deep Learning Using EEG Signals: A Systematic Review

Depression is considered by WHO as the main contributor to global disability and it poses dangerous threats to approximately all aspects of human life, in particular public and private health. This mental disorder is usually characterized by considerable changes in feelings, routines, or thoughts. With respect to the fact that early diagnosis of this illness would be of critical importance ineffective treatment, some development has occurred in the purpose of depression detection. EEG signals reflect the working status of the human brain by which are considered the most proper tools for a depression diagnosis. Deep learning algorithms have the capacity of pattern discovery and extracting features from the raw data which is fed into them. Owing to this significant characteristic of deep learning, recently, these methods have intensely utilized in the diverse field of researches, specifically medicine and healthcare. Thereby, in this article, we aimed to review all papers concentrated on using deep learning to detect or predict depressive subjects with the help of EEG signals as input data. Regarding the adopted search method, we finally evaluated 22 articles between 2016 and 2021. This article which is organized according to the systematic literature review (SLR) method, provides complete summaries of all exploited studies and compares the noticeable aspects of them. Moreover, some statistical analysis performs to gain a depth perception of the general ideas of the latest researches in this area. A pattern of a five-step procedure was also established by which almost all reviewed articles fulfilled the goal of depression detection. Finally, open issues and challenges in this way of depression diagnosis or prediction and suggested works as the future directions discussed.


Introduction
It has been announced by the World Health Organization (WHO) that according to the estimation, in excess of 322 million people all around the world suffer from depression by which this mental disorder has become the leading contributor to the causing of disability. Depressive patients are typically recognized by symptoms such as a feeling of sadness, hopelessness, and guilty, loss of interest, concentration, and energy, changes in appetite, sleep, and other routines [1].
Poverty, unemployment, tragic events of life, physical disorders, and problems with alcohol or drug consumption are considered as determining factors leading to depression [1]. Recently, the Covid19 pandemic has advanced the cause of depression, and its consequent conditions such as imposing lockdown, going into quarantine, and practicing social distancing are considered other main reasons for experiencing depression [2][3][4][5][6].
With regard to the fact that depression poses an unprecedented threat to public health and have some adverse effects on depressive people such as committing suicide [7], and also considering this matter that early diagnosis can result in providing timely and more effective treatment, devising and developing an efficient and reliable method of depression detection or even prediction would be of the high importance.
Electroencephalogram (EEG) signals which naturally have nonstationary, highly complex, non-invasive, and nonlinear structure, involves human brain activities and working status. Due to this complexity, available abnormality would be difficult to detect with the naked eyes. These properties have made physiological signals are deemed to be valuable tools for depression detection [8][9].
Deep learning is defined as a hierarchy structure with a series of algorithms which have some hidden neurons [10]. These models provide computers with the ability of building complex concepts from simple statements [11]. These learnt concepts utilize to build next layers. Furthermore, in these methods, multiple processing layers are responsible for pattern and data structure recognition [12].
This multi-layer method has been recently utilized in different applications of various fields, ranging from agriculture [13], automotive Industry [14][15], and IoT [16] to diverse usage in medicine [17][18][19][20]. Because deep learning architecture has the capability of automated learning and extracting features from input raw data, and regarding the limitation of machine learning techniques in this point [21], and also due to difficulties in manual EEG signals analyzing [22], adopting deep learning solutions has been more popular in the related occasions by which the implicit nonlinear features in EEG signals can extract with the least effort [23][24][25].
Applications of deep learning in depression diagnosis with the assistance of EEG signals have increased in recent times. The main purpose of this article is to conduct an SLR-based review of papers concentrated on deep learning usage in depression detection or prediction using EEG signals. To the best of our knowledge, there was no systematic literature review focused on the mentioned topic and our SLR survey is the pioneer in this field of research. The important contributions of this review are as follow: • Designing a complete taxonomy chart of all exploited deep learning algorithms • Providing concise summaries of all studied researches and drawing a comparison between main aspects of them • Propose a general pattern of five steps based on all reviewed papers and providing some information related to each stage to analyze them • Highlighting open issues and future works for this purpose In the organization of this review, section 2 is dedicated to discuss related works. In section 3 the adopted research method is presented. The provided summaries together with the designed taxonomy are located in section 4. In section 5, a discussion and comparison between studied papers are provided. Section 6 at the end, is devoted to the conclusion.

Review of Related Studies
This section is aimed to review the conducted research works which have addressed depressive subjects' diagnosis and prediction by deep learning methods and EEG signals as part of them.
Yasin et al. [26] conducted a review of studies that had adopted both neural network and deep learning approaches to detect two kinds of depression which were Major Depressive Disorder (MDD) and Bipolar Disorder (BD) using EEG signals. It utilized various source engines and a combination of diverse keywords to search among papers that have been published over the last ten years and then extracted some useful information from those. One of the good points about this review was providing different classifications about exploited datasets, methods of analyzing or extracting features, and algorithms in articles. It also draws various tables to present extracted data and provide comparisons between them in different ways. However, the major drawback of this research was that it did not use a sufficient number of articles to review, so approximately five articles as it stated, particularly for MMD diagnosis, was considered an obvious flaw. Besides, those employed articles did not explain as it was required to do to get the general idea and procedure of them.
The presented review by [27] concentrated on articles within which deep learning methods had been employed to study mental disorders that depression was one of the subjects. Detection using clinical data, the use of genetic data in disease diagnosis, analyzing different datasets, and utilizing social media data to estimate the probability of mental illness were the four main categories of this study. Selected papers which were published until April 2019 utilized various kinds of datasets among which the Electroencephalogram dataset type were used in only three articles focused on depression detection or foreseen. This research included a complete representation of evaluated datasets. It also discussed thoroughly the opportunities and challenges which are likely to face by exploiting each dataset. However, due to the fact that it was a comprehensive review concentrated on encompassing diverse cases of mental illness, it covered a few papers related to adopting deep learning for depression diagnosis and prediction using EEG signals with a short explanation.
Khosla et al. [28] carried out a review of articles focused on using EEG signals and different types of classifiers to diagnose some neurological disorders such as depression and monitor other problems such as emotion recognition. It utilized various sources like journals, conferences, books, and thesis to get papers that were mainly published from 1999 to 2019, and only four papers belonged to an earlier time. With respect to the depression diagnosis, just around ten articles were considered. It provided a comprehensive comparison with statistical analyses of exploited techniques and knowledge in articles which were categorized into separate sections, namely, artifacts removal, feature extraction, types of extracted features, feature selection, dimensionality reduction, and classifiers methods. Different adopted datasets were also summarized and exhibited as public and local acquisition classifications based on the manner within which they were collected. Moreover, it contained information about techniques of functional neuroimaging. Covering diverse areas of research in mental health problems, however, prevented it from concentrating on perfectly addressing each area. Table 1 provides some main information about the reviewed articles contained some analyses of depressive discrimination by adopting deep learning using EEG signals.

Research Method
In this article, we have concentrated on utilizing deep learning algorithms to detect or predict depression using EEG signals. These techniques are classified based on the SLR method as an assessment method of the research study [29][30][31][32]. This section intends to expound on the adopted SLR-based research method.
The Analytical Questions which are defined in this SLR paper would be as follow. These questions are going to be answered completely in regard to the purpose of this study. AQ1: What deep learning algorithms have been used to detect or predict depression? AQ2: Which deep learning methods are mostly preferred to fulfill our goal?
AQ3: What approaches are mainly used to extract features from EEG signals? AQ4: What are the future and open works related to depression detection or prediction?
The selection process of ultimate papers is carried out by applying some principles and inclusion/exclusion criteria.
Publisher type and index, publish year, citation numbers of articles were four main aspects affected by applied rules. Thereby, only peer-viewed, ISI-indexed papers that were published over the past five years in journals or conferences with more than 10 citations, except those published over two last years, were analyzed. Owning to the fact that published articles in the last year and also in 2020 have been cited fewer than those in other years, researches with zero-citation in 2021, less than 5 quotations for the second half of 2020, and at least 5 for the first half of 2020 were chosen to include in the analysis.

Organization of deep learning algorithms for depression diagnosis using EEG signals
This part, according to the applied systematic literature review process, presents a review and summary of papers intended to use deep learning algorithms to detect or predict depression with the assistance of EEG signals, in a way that they have adhered to the set of mentioned rules in section 3. Fig.3 exhibits a comprehensive taxonomy of used deep neural networks models to serve the purpose of depressive disorders diagnosis and foreseen. According to the presented taxonomy, all adopted models can be categorized into four main groups, namely CCN-Based models, LSTM models, Combination (combined models of both convolutional networks and LSTM blocks), and Other Algorithms. Due to the fact that all other algorithms of deep learning except CNN and LSTM were less prevalent and less common among all selected papers, with about one application for each of them, these kinds of methods have been classified as Other Algorithms cluster. With respect to the construction of the reviewed articles, the vast majority of them have applied various deep learning algorithms on the same prepared dataset to obtain the best accuracy. So the offered taxonomy chart is classified and drawn based on utilized algorithms from all articles. Thereby, the summary of inspected studies have been arranged in descending chronological order into subsections 4.1 and 4.2, respectively. A detailed analyses of these studies and also the comparison between them has been carried out in subsection 4.3. Furthermore, some major concepts such as main context, advantage, drawback, and new contribution of summarized articles presents in table 2 and table 3. As we have noticed, all articles follow some general steps to discriminate between depressed and normal cases. Subsection 4.3 also attempt to analyze this five-step pattern and fig.4 shows these five key stages. In subsections 4.3.1 to 4.3.5, some extracted information related to each of these five has been presented in tables numbered 4 to 9 by which an opportunity for comparison will provide between examined papers in the point of different techniques and methods which have been employed in each step.

Deep learning methods for depression detection using EEG signals
Following the SLR structure and organization [29][30][31][32], this part is aimed to provide a brief and yet complete introduction to all examined articles to acquaint future researchers with previous works done in the regard of this review's topic. It has been strived for covering all main points and features of papers in the form of summaries, including some facts about used dataset, structure of proposed deep learning model, ways of data preparation, advantages and drawbacks of their adopted method, and their achievements and contribution.
Sharma et al. [33] proposed a new, computer-aided (CAD) method for depression diagnosis called DepHNN which stands for Depression Hybrid Neural Network. Convolutional Neural Network (CNN) and Long-short Term Memory (LSTM) are two deep learning algorithms used to capture the temporal dependencies in the time-series EEG input signals and to process the sequence learning, respectively. Artifacts in this study have been eliminated by Independent Components Analysis. Time-frequency information also extracted by the Fast Fourier Transform method from EEG signals. To construct a model with high performance, some models with different numbers of LSTM and fully connected layers has been built. Thereby, this result achieved that decreasing in the number of LSTM layers and increasing in fully connected layers numbers, in a way that the former figure is less than the latter one, will lead to reduction of both execution time and the loss value. In this adopted approach, firstly, CNN is in charge of EEG input data transformation into cross-sectional data using a technique named windowing. Then, the transformed output is fed into the LSTM block which has the ability to memorize data from previous steps due to hidden memory cells in its architecture. Finally, fully connected layers are employed to assist in the automatic detection of depressive cases. This Hybrid 6-layer CNN-LSTM model has enjoyed advantages of less time and computation complexity with considerable high accuracy due to using fewer hidden layers and the windowing technique. Overfitting possibility and small input data, however, have been two main problems this research has been confronted with.
A new, 18-layer, CNN-based framework named DeprNet has been proposed by [34] in which both spatial and temporal dimensions of the EEG data have been utilized during training. Produced artifacts by eye blink have been removed from EEG signals with the help of independent component analysis (ICA) Method. Three kinds of interferences which were the low-frequency noise, irrelevant signals, and the baseline noise has been eliminated with the assistance of three filter methods, namely, 0.1-Hz high-pass filter, a 100-Hz low-pass filter, and a 50-Hz notch filter. A construction of 2D matrix has been fed as input data. Five convolution layers were in charge of feature extraction, five batch normalization layers aimed to keep the network steady through output normalization, five max-pooling layers that decreased the sample size of EEG signals, two fully connected layers, and finally, a soft-max Layer employed to classify subjects. This paper has taken advantage of using short 4-s EEG signals from 19 channels which will lead to adopt this method in pragmatic approaches. Besides, obtaining significant-high accuracy over other base models was considered another merit of it. Moreover, it has shown that activities related to the left and right hemispheres of the brain are differently influenced by depression consequences. However, the required technologies to fulfill the goal of depression diagnosis using this method might impose difficulties on clinical settings in using up this method. Complexity might also be a result of using a large number of layers.
Saeedi A et al. [35] practiced using generalized partial directed coherence (GPDC) and direct directed transfer function (DDTF) along with deep learning algorithms to achieve the goal of detection. GPDC and DDTF intended to extract the available connection between EEG signals channels as an effective analysis of brain connectivity, combined with eight frequency bands in pairs. These sixteen connectivity methods aimed to transform 1D EEG signals into 2D images by which the input of the deep learning classifiers is formed. With regards to classification, among five adopted deep learning algorithms, which were 1DCNN, 2DCNN, LSTM, 1DCNN-LSTM, 2DCNN-LSTM, one-dimensional CNN-LSTM had better performance rather than others in terms of accuracy and sensitivity. Although 2DCNN-LSTM was faster and employed more parameters, temporal information might filter out by two-dimensional filtering of this method and lead it to fail in obtaining perfect accuracy. Providing comparison between diverse deep learning algorithms and conversion of EEG signals into connectivity-2D matrix images by the employment of effective connectivity methods are two imperative merits of this research. It also derived the advantage of temporal and spatial information in detection. Small data size, however, is considered a major deficiency of this model. Moreover, there might be more techniques to utilize these features and parameters to get a more precise accuracy.
The proposed model by [36] was depression detection by adopting a three-dimensional convolutional neural network (3D-CNN) using effective connectivity inside the brain default mode network (DMN) region which was estimated from 19-channel EEG signals. The process of removing artifacts by eye blinks, muscles, etc. has been done by EEGLAB software in which the artifact subspace reconstruction (ASR) method was responsible for elimination. To calculate effective connectivity, some continuous segments of 2-sec length extracted from the output of pre-processing to feed into partial directed coherence (PDC) algorithm. The result of PDC was connectivity matrices that covered 19 channels. These output matrices reduced to overlay six channels of DMN area by adopting the subsequent DMN connectivity extraction. These matrices were then utilized as input of the CNN model. Three convolutional layers, three batch normalization (BN), and three rectified linear unit (ReLU) activation layers, a global average pooling layer, a dropout layer, and one fully connected layer were components of this deep network. The fully connected layer aimed to classify using the binary soft-max regression. This approach that according to the assertion of its authors has been taken for the first time, expressed a novel viewpoint in EEG signals study. Overfitting, however, might be one of the most likely issues that this method should address. For evaluation of this new manner, more data is required to clarify and examine this procedure Qayyum et al. [37] suggested using two deep neural network based models which were called IDCNN-GRU and IDCNN-LSTM on two eye open (EO) and eye close (EC) datasets. Input EEG signals segmented into a time window of one second which has included 19 channels. IDCNN, Gated Recurrent Units (GRU)/long short term memory (LSTM), and classifiers were considered three main parts of both methods. 1DCNN as the first component, which included three CNN, two max-pooling, and two dropout layers, adopted to extract features. Then LSTM/GRU was in charge of sequential learning. Finally, KNN, RF, and SVM machine learning models accomplished the purpose of prediction. Extracted features from only the last dense layer fed into the classification layer (sigmoid function for binary classification) to improve performance. Learning rate, optimizers, and loss functions were training hyper parameters that were exploited in the study. As a result, GRU outperformed LSTM since it is appropriate for short sequences and fewer training parameters and memory is required. Time window segmentation, data size, and extracted features, however, needed more inspections.
Depression detection in [38] carried out using a 12-layer CNN-LSTM model. Recorded EEG signals prepared through two steps. By adopting the FASTER algorithm at first, artifacts eliminated. The FASTER algorithm method was then employed to remove the offset effects and the amplitude scaling issues from EEG signals. The proposed model generally was comprised of three main components which were convolutional neural network, LSTM memory, and classification parts. Three CNN layers and three MaxPooling1D layers formed the first section of the model which exploited to extract features. The second unit included two LSTM layers whose duty was to generate the feature maps by discovering different patterns in EEG signals and then retain the sequence of these learning. Two fully connected layers, one dropout layer, and one Flatten layer designed to carry out the classification task within which the flatten layer aimed to provide fully connected layers with a feature vector by converting the output of LSTM layers. This simple architecture with the flatten layer exhibited that the right hemisphere signal affected the result of diagnosis more than signals from the left hemisphere. In addition to the fact that more data was required to evaluate the accuracy and assertion of the model and also the risk of overfitting, it was expected to utilize some advanced manner of feature extraction and analysis. Moreover, it had inadequate explanation about the used dataset.
Kang M et al. [39] offered a two-dimensional convolution layer-based model using recorded signals of both eye closed (EC) and eye opened (EO) states. After data collection, firstly, power line interference was suppressed by a 50 Hz notch filter. The process of removal artifacts was applied only on the EC dataset. Thereby, noises by eye blink or muscle movement were eliminated with the help of a low-pass filter by which frequencies above 32Hz are removed. The output signals from preprocessing step were decomposed to Delta (0.5~4), Theta (4~8), Alpha (8~13) Beta (13 ~32) frequency bands. The 11-layer proposed model was comprised of a convolutional neural network included two Conv2D layers with ReLU activation function, two Max-pooling layers, and one dropout layer, fully connected layer contained two dense layers with ReLU activation function and one dropout, and the output layer as the last section composed of one dense layer with ReLU activation function. It also utilized ten-fold cross-validation to evaluate and verify the result of the model. As a result, it obtained a high accuracy compared to existing models on the same dataset. However, the act of generalizing the result of this model was hampered by a lack of adequate data.
Kang M et al. [40] introduced a novel methodology of feature extraction to detect depression which was exploiting asymmetry feature of EEG signals and converting it into 2D images to feed to the convolutional neural network. In the preprocessing level, each channel of raw EEG signals normalized by the min-max normalization method, and then artifacts elimination was done by the independent component analysis (ICA). The output signals were afterward segmented into a time window of four seconds. To generate the asymmetry matrix image, firstly, four power frequency bands which were Delta, Theta, Alpha, and Beta extracted, and based on them the matrix image created. The produced image used as input of a CNN model which had constructed from three two-dimensional convolution layers with the ReLU as activation function, three two-dimensional max-pooling layers, one flatten and one dropout layer, and two fully connected layers. It concluded that the alpha-band asymmetry matrix image obtained the best accuracy rather than other power bands. Although it strived to take advantage of utilizing the dropout layer, data segmentation, and batch normalization to avoid overfitting, lack of sufficient data was still a potential cause of it.
Duan L et al. [41] endeavored to exploit the effects of depression on the MDD patient's signals. To begin, to do pre-processing, finite impulse response (FIR) filter and independent component analysis (ICA) algorithm adopted to remove unnecessary signals and visual artifacts from EEG data, respectively. Then, the fast Fourier transform (FFT) algorithm aimed to extract Theta, Beta, and Alpha frequency bands from pre-processed EEG signals by which two features named the interhemispheric asymmetry value and crosscorrelation value computed. The structural and connective changes analysis afterwards conducted by utilizing these extracted features. Combination of all the structural features and connectivity features used to construct three feature matrices; two single-feature matrices which were comprised of three layers of Theta, Beta, and Alpha interhemispheric asymmetry (or cross-correlation), and one 6-layer mixed feature matrix that had composed of two Alpha, two Beta, and two Theta band mixed features layers. In consequence, to determine the most appropriate feature to obtain the best accuracy, this study drew a conclusion that among the structural features, connectivity features, and the mixed features which were fed into CNN, the feature combining method outperforms others. Anyway, other than overfitting which might be one of the weaknesses of this proposed model, by proving more data, use features and formulas might be reconsidered.
Saeedi M et al. [42] aimed to the prediction by a three-step procedure; feature extraction, feature selection, classification. For pre-processing operation, Discrete Wavelet Transform (DWT) employed to decompose EEG signals into detailed and approximate coefficients by which thresholds that matched artifacts omitted. Five prevalent frequency bands called linear features extracted from EEG signals were Delta (0.5-4 Hz), Theta (4-8 Hz), Alpha (8-13 Hz), Beta (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30), and Gamma . In the following, non-linear features are obtained by applying sample entropy and approximate entropy on the wavelet packet coefficients. Then, a genetic algorithm (GA) adopted to choose significant features and reduce feature dimensionality. Two machine learning algorithms, namely, KNN and SVM, and multilayer perceptron (MLP) algorithm as deep learning methods were exploited to identify depressive cases. In respect of deep learning model utilization, MLP combined with non-linear features achieved a more high accuracy compared to linear features. While using linear features, the Gamma power band provided a high value of accuracy rather than others. Although this research strived for enhance the result of this prediction by deriving advantages of various features from EGG signals, these parameters should be optimized.
Y. Xie et al. [43] designed a CNN model in a combination of functional connectivity of brain networks. To begin, to do pre-processing tasks, after artifacts elimination from EEG signals, the PLI method employed to form a 31-in-31 adjacency matrix with functional connectivity of brain networks which was calculated on the Delta, Alpha, Beta, Theta, and Gamma frequency bands. In this research, a brain network defined as a graph with edges and nodes in which nodes corresponded to particular areas of the brain to record EEG signals, and edges indicated the function of each node. PLI method was also adopted to measure the strength of the established connection between nodes to draw the brain network as input data. They utilized a simple deep learning model named CNN-2 in which there were one convolutional layer and one pooling layer. This study provided the opportunity of making a comparison between six different models in which brain networks and Prefrontal Lateralization methods combined with CNN-2, DBN, and LDA algorithms in pairs. They showed that among all six mentioned models, brain networks with CNN-2 performed the best. Although this proposed method took a new approach in pre-processing step, the simple-adopted convolutional neural network model posed a risk of overfitting to classification output. Moreover, the employment of more data or other techniques such as LSTM might improve the accuracy.
X. Zhang et al. [44] experimented with two models of MDD prediction which were the employment of one-dimensional Convolutional Neural Network (1-D CNN) with and without demographic attention mechanism to indicate the effects of EEG signals and demographic information integration on accuracy. In order to prepare data, discrete wavelet transform and Kalman filtering methods were adopted to omit noisy signals. In addition, time windows of 4-non-overlapping seconds are made from 40-second-high-quality parts of EEG signals. One Hot Encoding and normalization method converted demographics information into 1D. Spatial-temporal features then extracted using 1DCNN. After that, demographics information is indirectly integrated into the mentioned model by the attention mechanism. Lastly, a fully connected layer utilized to produce the final classification through the soft-max function. As a result, CNN with demographic attention outperformed CNN without that information. To more closely investigate the impacts of incorporated demographic information, the DEEP-DREAM method exploited to create two artificial EEG signals from mentioned CNN models and their power spectral density (PSD) measured. It exhibited that mainly Beta frequency bands distinguished two created signals. This study took advantage of demographics information as a new technique to obtain acceptable diagnostic accuracy. However, alongside the lack of data and overfitting, it is required to take a closer look at the integration and influence of this information.
C. Uyulan et al. [45] suggested three convolutional neural network-based models which integrated with advanced computational neuroscience methodology named ResNet-50, MobileNet, and Inception-v3 to apply to the EEG recorded signals from both left and right hemispheres with eye closed states. The produced artifacts as a result of eye and muscle activities were removed by the wavelet transform method. In addition to a notch filter of 50 Hz which was employed to compensate power line interferences, three band-stop with 3 Hz, 35 Hz, and 50 Hz cutoff frequencies and the FastICA method also applied on input signals to carry out a complete process of noise removal. 12th-order Butterworth band pass filter utilized to extract four Delta, Theta, Beta, and Alpha frequency bands. Probable error in both passbands and stopbands minimized by using Parks-McClellan optimal FIR (finite impulse response) filter algorithm. The output signals from noise removal and pre-processing steps were then converted to a 2D EEG image matrix to input CNN structures. 22-layer ResNet-50 Convolutional model which was considered a residual learning network had the ability to deal with problems about the gradients vanish and led the deep network to better training. MobileNet Convolutional structure used depth-wise, separable convolutions and some parameters to enhance performance. 48-layer Inception-v3 Convolutional model also had some advantages like avoiding loss of information during training. As a consequence, MobileNet architecture performed better compared to two other proposed models and obtained the highest accuracy for hemispheric classification. Furthermore, for frequency band-based classification the best accuracy belonged to ResNet-50. Although this proposed model had some novelty in the point of proposed CNN-based models, clinical environments will be faced with some barriers to implementation due to complexity. Moreover, the gained accuracy should be evaluated with more data.
Z. Wan et al. [46] set the objective of MDD prediction using HybridEEGNet which was a sort of Convolutional Neural Network with the assistance of feature analysis as the second main aspect of this proposed model. Two different sorts of convolutional filters adopted to learn the synchronous and regional characteristics from EEG signals in order to analyze them. Analysis process conducted by using the deepdreaming algorithm to create feature matrices and the FFT method to examine those generated matrices. A 21-layer deep model employed in this research which constituted of eight convolution layer to learn features, eight max-pooling layers to do down-sampling, one concatenation layer to transform the output of the last-previous layer into one vector and feed it into the next-first fully connected layer, three fully connected layers, and one soft-max layer to categorize. Studying the FFT resulted of analyzing two learned features, particularly the combination of them, provided the fact that they carry some significant information about the depressive cases, namely, they claimed that amplitude ranges and spatial distributions differences in Alpha frequency band might be a distinctive characteristic. It also emphasized the importance of the 4-10 Hz band as the low frequency in the diagnosis goal. Moreover, they stated that Theta and Alpha rhythms played an important part in depression discrimination with considerable performance. In addition to the lack of data that considered a remarkable drawback since led to doubt the accuracy, the manner of extracting features and related achievements to extraction need to be review.
One Dimensional Convolutional Neural Network (1DCNN) and 1DCNN with long short-term memory (LSTM) are two deep learning algorithm employed in [47]. 1DCNN which is aimed to learn and transform temporal information has consisted of three 1D Convolutional layers, two pooling layers, and one dropout layer, and three other layers. The second model with the same purpose as the first architecture but better performance, has comprised of three blocks of one 1D Convolutional layer, one pooling layer, and one dropout layer which were ended up with two LSTM layers. As the start point, the multiple source eye correction (MSEC) method employed to rectify raw input data from which artefacts by eye blinks, muscular or heart activities corrected. Then, those signals segmented into 1 second time window. This proposed model has the merit of rectifying noises generated by eye blinks and muscular activities while EEG recording with applying the multiple source eye correction (MSEC) method session. It also showed that EO signals had a better performance in classification compared to EC signals. Nonetheless, disadvantages of this would be small data size and small time window segmentation. Although it claimed that 1-second window size resulted in best performance, this assertion needed to be proved Mahato et al. [48] fulfilled the goal of depression detection by employing four deep learning classifiers called multi-layered perceptron neural network (MLPNN), radial basis function network (RBFN), linear discriminant analysis (LDA), and quadratic discriminant analysis. Eight linear features such as Alpha, Beta, Delta, Gamma frequency band power and also Alpha, Beta, Delta, Gamma hemispheric asymmetry, relative wavelet energy (RWE) and wavelet entropy (WE) as non-linear feature extracted from input EEG signals. Dimension reduction of non-linear features occurred with the assistance of principal component analysis (PCA). Independent component analysis (ICA) aimed to delete artifacts. It was achieved by this paper that the highest accuracy can be obtained by adopting the combination of linear and non-linear features with both MLPNN and RBFN classifiers. Using two non-linear features like WE and RWE together with the LDA classifier produced the second-highest accuracy which is higher than the employment of any mentioned classifiers with any linear features in combination. Although this model disclosed Theta power as a possible way of getting significant-high accuracy to diagnose depressive cases, it required more data to clarify and generalized.
X. Li et al. [49] expounded on taking ensemble learning and deep learning approaches to predict. It also paid close attention to the influence of all three temporal, spectral, and spatial information aspects of EEG signals in analyzing them. EEG signals generated from 128-channel HydroCel Geodesic Sensor Net (HCGSN). Auto-regress model and Hjorth algorithms executed with various time windows to extract power spectral density which was frequency-domain feature and activity that considered the time-domain feature. In this research, EEG signals were transformed into images by adding spatial information through applying the auditory evoked potentials (AEP) method. Two various VGG style architecture exploited in this study in which both ended in a dense layer and a soft-max layer. This study exhibited that the Alpha frequency band outperformed Beta and Theta frequency scales. It also showed that using features played a prominent role in obtaining accuracy. Adopting the AEP method, however, put an obstacle to the performance since it is a distant-based projection method and as a consequence, non-distant aspects were ignored during image conversion. It might be required to review the steps of noisy signals from EEG input.
Ay B et al. [50] studied an 11-layer CNN-LSTM model to discriminate normal and depressed cases using EEG signals from both right and left brain hemispheres. The process of artifact elimination took place manually by specialists. It was comprised of four Conv1D, two Max-Pooling 1D, two Dropout, two dense, and one LSTM layers. The task of feature extraction was carried out by adopting CNN. The employment of LSTM was to perform sequence learning using feature maps which were generated from the output of the CNN. In the end, the detection was accomplished by fully connected layers with the assist of the LSTM output. Besides, it utilized the maximum pooling technique in the max-pooling for input size, memory usage, and the number of parameters reduction. This research reached a noticeable performance due to utilizing some EEG signal properties such as long-term dependencies and local characteristics. In addition to the small data size, manual removal of the artifact was another drawback of this method. Even though this study affirmed that there is no possibility of overfitting, this matter needs to be investigated.
Acharya et al. [51] claimed to pioneer in employment of deep neural network idea and Convolutional Neural Network (CNN) to detect depression. This 13-layer CNN model was comprised of 5 convolutional layers, five pooling layers, and three fully-connected layers in which convolutional layers were responsible for providing important feature obtained from the input EEG signals from the left and right hemisphere to train the algorithm, pooling layers were accountable for reducing of the feature map's size, and fullyconnected layers established connection between neurons of a layer with the next one. Finally, results produced by the last fully-connected layer was applied to a soft-max function to detect depressive cases. One of their remarkable achievement was to show that the right hemisphere signals outperformed the left ones to the point of sensitivity and accuracy. However, this method might suffer from false prediction due to lack of adequate data. Moreover, complex architecture because of a lot layers might pose limitations upon using this model. W. Mao et al. [52] placed a strong emphasis on pre-processing step in which raw EEG signals were transformed and encoded to construct a sample input data to feed into deep learning algorithms. Some primary preparation tasks were comprised of segmentation of 270 seconds, eliminating noise, removing interfering effects of EOG, and extracting Theta, Alpha, and Beta bands, cropping into 0.5-s segments, and at last, applying the Auto-Regress model (AR model) to calculate the power of electrodes. Power spectrum density of signals exploited to preserve the temporal property of EEG signals. To preserving spatial information of EEG two projection models called distance-based and non-distance methods were employed. The output data fed into deep neural networks to extract both inherent spatial and temporal patterns. After data processing, four deep learning models are CNN, Temporal Convolution, MAX, and LSTM used for classification. The first three models perform better on non-distance mapping frames whilst the LSTM had a lower performance. Although the distance-based method generates frames with more brain activity information, it badly affected the prediction accuracy. Low accuracy and small input data are two downsides of this procedure. Moreover, the performance can get better by optimizing parameter configuration.
Cai et al. [53] made the depression diagnosis by using both deep learning and machine learning algorithm and drew a comparison between them. This research collected EEG signals from Fpl, Fp2, and Fpz channels on the frontal scalp by locating an EEG collector with three electrodes in the mentioned areas. Data preprocessing in this study involved firstly noisy signals removal by wavelet transform method, secondly alpha, theta, beta, and gamma linear features extraction, then normalization, and finally feature matrix construction to feed into deep models. After pre-processing and artifacts elimination, some features were extracted which included both linear and non-linear types. ANN (Artificial Neural Network) and DBN (Deep Belief Network) were deep learning algorithms used to the way of detection purpose. As a consequence, among adopted methods, DBN in combination with extracted features outperformed other ones. This study showed that we can take advantage of the Beta wave power to enhance classification accuracy. However, providing more data would be imperative to reach a more authentic and trustable accuracy. Besides, some problems are likely to arise due to the complexity of big DBN architecture. Table 2 presents some major and useful information about all articles in section 4.1 to provide the opportunity for highlighting the contrasts between them and delineating their main context.

Deep learning methods for depression prediction using EEG signals
S. D. Kumar et al. [54] researched depression prediction by the LSTM model and the help of feature extraction. To remove artifacts of power line interference from recorded EEG signals from the left and right hemispheres of the brain, a notch filter of 50 Hz was applied. To generate input data, time-domain analysis with moving window segmentation was employed to extract the statistical mean feature. LSTM block comprised of one LSTM layer with ten hidden neurons, a dropout layer of 0.1, and a dense layer. This study compared the introduced method with two other models that were CNN-LSTM and ConvLSTM. Among them, LSTM had the best performance over those structures since it owned the smallest RMSE values (Root mean square error) as model evaluator rather than them. Although simple construction of the suggested model might bring some advantages, especially in the point of implementation, EEG signals could be investigated more to extract more efficient features or parameters to assist in attaining the goal of MDD prediction. Besides, in addition to providing more data, some enhancements in the architecture of the model could applied to improve the accuracy of prediction. Table 3 displays the key concepts of the studies article in section 4.2 in order to assist to comprehend its main idea.

Detailed analysis of reviewed articles
Subsections 4.1 and 4.2 are devoted to giving brief and effective summaries of all examined papers on depression detection and prediction, respectively. In this subsection, the main goal is to provide a comprehensive analysis of studies while introducing the five-step process and make a thorough comparison between works in the point of all five stages.
With respect to our notice, approximately all studied articles share the same idea of having five general steps to reach the target of depressive disorders recognition or prediction by deep learning using EEG signals . According to fig.4, five parts of the above-mentioned process include collecting EEG signals, eliminating artifacts and noise signals, performing required pre-processing operations, extracting suitable features, and finally classifying the output. Subsections 4.3.1 to 4.3.5 intends to delineate the detailed aspects of utilized methods in each phase per research. Because all of these information has been presented with the related reference, it is also possible to compare the output of studies in the point of each step. Fig.4. Five general steps to achieve the goal of depression diagnosis by deep learning using EEG signals

Analysis of utilized EEG datasets to diagnose depressive cases by deep learning
The main purpose of table 4 is to impart detailed facts about main properties of datasets used in reviewed articles. Using this table assists us with making a comprehensive comparison between datasets and also can directly affect researchers' choices of data in the future.
As it can be observed, most datasets consisting of high-channels signals have been recorded from a few patients. As a consequence, considering the fact that deep learning models require lots of data to work effectively, this shortage of data might throw doubt on the accuracy of results.

Analysis of interference and artifacts removal methods
In order to perform pre-processing actions, EEG signals are expected to be free of any kind of noise. Tables 5 and 6 represent methods adopted by each article to address interferences and artifacts, respectively. These structures also help us to find out more about the fact that which works have taken advantages of the same method. As an example, most studies have employed A Notch Filter of 50 Hz.

Analysis of pre-processing operations
To utilize EEG signals to fulfill the goal of depression detection or prediction, they are supposed to undergo some alterations by which they get prepared to use in the next step to extract features. Table 7 shows some information related to this matter for each article.  Table 8 explains how each research work has extracted required features from prepared input signals. Most research works have enjoyed convolutional layers as an end-to-end manner to extract features, among which frequency waves features like Alpha, Beta, and so forth have been preferred more.  Table 9 provides the summary of used classifier(s), obtained results, and some related facts in articles. In fact, this is aimed to convey the gist of the detection or prediction stage at a glance.

Discussion and comparison
With regard to the adopted SLR method, previous sections were aimed to provide a comprehensive explanation about the procedure of papers selection, together with summaries of ultimate studies to get the gist of them and detailed analyses of all research works according to the defined five-step pattern. This part offers some statistical analysis and reports of studied articles in this review. Furthermore, the suggested analytical questions in section 3 are provided with complete answers. AQ1: What deep learning algorithms have been used to detect or predict depression?
A variety of deep learning algorithms utilized to pursue the goal of diagnosis or prediction of depressive subjects with EEG signals has been illustrated with fig.5 according to the designed taxonomy in section 4. Among the displayed methods, LSTM has used for prediction purposes. Majority of papers exploited at least two deep learning structures and applied them on the same prepared dataset to provide more opportunity of getting the best accuracy. Combined models of convolution neural network structure and LSTM blocks have obtained better results than other ones because these representations enjoyed the advantages of memorization capability of LSTM architecture and built-in feature extracting property of CNN constructions. Referring to reviewed studies, it can conclude that using some external methods and combining them with a deep learning structure will also give a significant accuracy, as this technique has been adopted by [42,51].  CNN, 1DCNN, 2DCNN, and 3DCNN, which together accounted for approximately half of the utilized deep learning classifiers, which were preferred to other algorithms. Among this category, CNN with 32.4% of the whole percentage was considered the most popular deep learning structure to accomplish the objective of depression discrimination or foreseen. As it can perceive from the depiction, most studies have selected convolutional layer(s) to derive features as an end-to-end technique. It is true that there are possibilities of extracting some defined features, this manner sounds safe because all aspects of extraction process are handled by the deep model. The second place of frequency has simultaneously occupied by band-pass filter and FFT methods with 3 times usage. Other ones were exploited one time. We observed that some researchers practiced two or more mentioned feature extraction approaches to produces different or combined sets of features.

Open issues and future works
This article has been organized based on the SLR method. Considering the evaluated researches and their chief characteristics such as main context, advantage, drawback, and contribution, some challenges still need to be addressed. These matters are presented as AQ4: AQ4: What are the future and open works related to depression detection or prediction?
In this part, issues that studies had confronted with, are discussed according to the introduced five-step procedure, including input data source, data preparation ( noise removal and pre-processing) , and implementation of deep learning model. In addition, we will look into a general problem about depression prediction and the importance of resolving it.
Inadequate input data: The lack of input data was the first and major problem that posed serious difficulties for all studied researchers. Overfitting and unreliable accuracy were two consequences from which the detection or prediction process suffered. As we have noticed, some studies strived to overcome these issues. Exploiting one or more dropout layers with various probability rates was one technique that was applied to deal with overfitting. Tackling overfitting was also carried out by utilizing the windowing method and sample generation to produce substantial data records. However, although these approaches might alleviate these difficulties, the problem of insufficient data remains an open issue and would be of primary importance to cope with.
Lack of specific and effective technique for data pre-processing: one problem that almost all works had in common was their challenge for preparing data to use for prediction or detection. In the point of noise removal, some of them have accomplished this task manually, others have utilized various experiencebased methods to fulfill this purpose, which have been listed in tables 5 and 6. This issue can be generalized to pre-processing steps too, in which diverse manners such as using time window segmentation of different lengths, AEP method, or transforming signals to images can negatively affect the accuracy of the proposed model. The main reason is rooted in the elimination of some important features of input signals by applying these experimental techniques, loss of non-distance properties of signals by conversion to images for example. So, a specific and reliable way of data preparation is required.

Implementing and usability:
Another key problem with which papers have encountered was the implementation of their proposed method(s). Due to the fact that the main objective of the conducted researches was to present a practical method to assist clinical environments and related medical doctors, the construction and architecture of some introduced models have not met this important aspect. To achieve the high accuracy, they usually have adopted deep learning models with many layers that this complexity can erect some barriers in the way of using these proposed methods in depression detection or prediction in real situations. Moreover, as another issue in this regard, in some studies, EEG signals were recorded with a lot of electrodes, such as signals of 19 channels. Clinical situations might face considerable difficulties in the process of collecting these kinds of high-channel signals. Thereby, designing a model compatible with the clinical setting is an important aspect that is more expected to focus on.
Giving less attention to depression prediction: As it has been sought in different search engines for papers on predicting depression, unfortunately, very few works have been done in this respect, in a way that we only could review one article in accordance with our paper selection criteria. Depression, other than seriously affecting people's health or costing their life [57], produces damaging effects on patients' financial condition [58] and imposes some difficulties on the healthcare system and economy [59]. Above these factors, on many occasions, depression is diagnosed late for a variety of reasons, by which sufferers' treatment will face difficulties, and sometimes, it becomes incurable [60]. Considering all these mentioned elements, put more emphasis on the importance of devoting more time and attention to research into the prediction of this dangerous disease.

Conclusion
This paper has concentrated on depression detection and prediction by deep learning algorithms using EEG signals. According to the SLR method which has been applied to this study, a comprehensive review was conducted in which some researches focused on our specified topic were evaluated and their main aspects were inspected. Moreover, open issues and futures research directions were discussed. Regarding our purpose and also the fact that most articles drew a comparison between the results of two or more deep learning algorithms on the same prepared dataset, the taxonomy was designed based on a collection of all used deep learning methods from all researches. By analyzing 22 articles as an output of a complete, elaborated, SLR-based refinement, it was concluded that CNN-based deep learning methods, namely, CNN, 1DCNN, 2DCNN, and 3DCNN are by far the more preferable group among the diverse adopted algorithms with almost 50% of the total in sum. In this classification, CNN with approximately one-third of the whole took first place. Combined models of both mentioned CNN-based algorithms and LSTM blocks were second only to the CNN-based category in popularity. Regarding AQ3, it was also inferred that different researchers have taken advantage of adopting various feature extraction methods to build more appropriate models. Amongst these techniques, the majority of papers aimed to utilize convolutional layers as an endto-end technique to extract local features. As it has been determined, all analyzed studies follow almost the same procedure of collecting EEG signals, artifacts and noise removal of them, extracting required features from pre-processed signals, and finally, classify depressive and normal subjects using one or more deep learning methods. To sum up, with respect to the SLR method and according to our goal, we aimed to present a comprehensive review to assist next pieces of research with a strong background of this area.