A Review of Deep Learning with Special Emphasis on Architectures, Applications and Recent Trends

Deep learning (DL) has solved a problem that as little as five years ago was thought by many to be intractable - the automatic recognition of patterns in data; and it can do so with accuracy that often surpasses human beings. It has solved problems beyond the realm of traditional, hand-crafted machine learning algorithms and captured the imagination of practitioners trying to make sense out of the flood of data that now inundates our society. As public awareness of the efficacy of DL increases so does the desire to make use of it. But even for highly trained professionals it can be daunting to approach the rapidly increasing body of knowledge produced by experts in the field. Where does one start? How does one determine if a particular model is applicable to their problem? How does one train and deploy such a network? A primer on the subject can be a good place to start. With that in mind, we present an overview of some of the key multilayer ANNs that comprise DL. We also discuss some new automatic architecture optimization protocols that use multi-agent approaches. Further, since guaranteeing system uptime is becoming critical to many computer applications, we include a section on using neural networks for fault detection and subsequent mitigation. This is followed by an exploratory survey of several application areas where DL has emerged as a game-changing technology: anomalous behavior detection in financial applications or in financial time-series forecasting, predictive and prescriptive analytics, medical image processing and analysis and power systems research. The thrust of this review is to outline emerging areas of application-oriented research within the DL community as well as to provide a reference to researchers seeking to use it in their work for what it does best: statistical pattern recognition with unparalleled learning capacity with the ability to scale with information.

2 uptime is becoming critical to many computer applications, we include a section on using neural networks for fault detection and subsequent mitigation. This is followed by an exploratory survey of several application areas where deep learning has emerged as a game-changing technology: anomalous behavior detection in financial applications or in financial time-series forecasting, predictive and prescriptive analytics, medical image processing and analysis and power systems research. The thrust of this review is to outline emerging areas of applicationoriented research within the deep learning community as well as to provide a handy reference to researchers seeking to use deep learning in their work for what it does best: statistical pattern recognition with unparalleled learning capacity with the ability to scale with information.

Introuction
Artificial neural networks (ANNs), now one of the most widely-used approaches to computational intelligence, started as an attempt to mimic adaptive biological nervous systems in software and customized hardware [1]. ANNs have been studied for more than 70 years [2] during which time they have waxed and waned in the attention of researchers. Recently they have made a strong resurgence as pattern recognition tools following pioneering work by a number of researchers [3]. It has been demonstrated unequivocally that multilayered artificial neural architectures can learn complex, non-linear functional mappings, given sufficient computational resources and training data. Importantly, unlike more traditional approaches, their results scale with training data. Following these remarkable, significant results in robust pattern recognition, the intellectual neighborhood has seen exponential growth, both in terms of academic and industrial research. Moreover, multilayer ANNs reduce much of the manual work that until now has been needed to set up classical pattern recognizers. They are, in effect, black box systems that can deliver, with minimal human attention, excellent performance in applications that require insights from unstructured, highdimensional data [4] [5] [6] [7] [8] [9]. These facts motivate this review of the topic.

What is an Artificial Neural Network?
An artificial neural network comprises many interconnected, simple functional units, or neurons that act in concert as parallel information-processors, to solve classification or regression problems. If a black-box is created by stacking layers of these multiply connected neurons the resulting computational system can: 1. Interact with the surrounding environment by using one layer of neurons to receive information (these units are known to be part of the input layers of the neural network) 2. Pass information back-and-forth between layers within the black-box for processing by invoking certain design goals and learning rules (these units are known to be part of the hidden layers of the neural network) 3. Relay processed information out to the surrounding environment via some of its atomic units (these units are known to be part of the output layers of the neural network ).
Within a hidden layer each neuron is connected to the outputs of a subset (or all) of the neurons in the previous layer each multiplied by a number called a weight. The neuron computes the sum of the products of those outputs (its inputs) and their corresponding weights. This computation is the dot product between an input vector and weight vector which can be thought of as the projection of one vector onto the other or as a measure of similarity between the the two. Assume the input vectors and weights are both n-dimensional and there are m neurons in the layer. Each neuron has its own weight vector, so the output of the layer is an m-dimensional vector computed as the input The hyperplane maps new input points to output points that are consistent with the original data, in the sense that some error function between the computed outputs and the actual outputs in the training data is minimized.
Multiple layers of linear maps, wherein the output of one linear classifier or regression is the input of another, is actually equivalent to a different single linear classifier or regression. This is because the output of k different layers reduces to the multiplication of the inputs by a single q × n matrix that is the product of the k matrices, one per layer.
To classify inputs non-linearly or to approximate a non linear function with a regression the output of each neuron is passed though a nonlinear activation function. The actual form of the activation function is a design parameter. Given the dimensions of the input and output vectors, the number of layers, the number of neurons in each layer, the form of the activation function, and an error function, the weights are computed via optimization over input-output pairs to minimize the error function. That way the resulting network is a best approximation of the known input-output data. Each layer of the network acts like a filter that places its inputs on one side or the other of it's own hyperplane defined by the weight matrix. In this way the artificial network is analogous the excitation and inhibition processes in biological neural systems.

How do these networks learn?
Neural networks are capable of learning -by changing the distribution of weights it is possible to approximate a function representative of the patterns in the input. The key idea is to re-stimulate the black-box using new excitation (data) until a sufficiently well-structured representation is achieved. Each stimulation redistributes the neural weights a little bit -hopefully in the right direction, given the learning algorithm involved is appropriate for use, until the error in approximation w.r.t some well-defined metric is below a practitioner-defined lower bound. Learning then, is the aggregation of a variable length of causal chains of neural computations [10] seeking to approximate a certain pattern recognition task through linear/nonlinear modulation of the activation of the neurons across the architecture. The instances in which chains of implicit linear activation fail to learn the underlying structure, non-linearity aids the modulation process. The term 'deep' in this context is a direct indicator of the space complexity of the aggregation chain across many hidden layers to learn sufficiently detailed representations. Theorists and empiricists alike have contributed to an exponential growth in studies using Deep Neural Networks, although generally speaking, the existing constraints of the field are well-acknowledged [11] [12] [13].
Deep learning has grown to be one of the principal components of contemporary 6 research in artificial intelligence in light of its ability to scale with input data and its capacity to generalize across problems with similar underlying feature distributions, which are in stark contrast to the hard-coded, problem-specific pattern recognition architectures of yesteryear.

Why are deep neural networks garnering so much attention now?
Multi-layer neural networks have been around through the better part of the latter half of the previous century. A natural question to ask why deep neural networks have gained the undivided attention of academics and industrialists alike in recent years? There are many factors contributing to this meteoric rise in research funding and volume. Some of these are briefed: • A surge in the availability of large training data sets with high quality labels • Advances in parallel computing capabilities and multi-core, multi-threaded implementations • Niche software platforms such as PyTorch [25], Tensorflow [26], Caffe [27] , Chainer [28], Keras [29], BigDL [30] etc. that allow seamless integration of architectures into a GPU computing framework without the complexity of addressing low-level details such as derivatives and environment setup. Table   2 provides a summary of popular Deep Learning Frameworks.
• Better regularization techniques introduced over the years help avoid overfitting as we scale up: techniques like batch normalization, dropout, data augmentation, early stopping etc are highly effective in avoiding overfitting and can single handedly improve model performance with scaling.

Review Methodology
The article, in its present form serves to present a collection of notable work carried out by researchers in and related to the deep learning niche. It is by no means exhaustive and limited in its own right to capture the global scheme of proceedings in the ever-evolving complex web of interactions among the deep learning community. While cognizant of the difficulty of achieving the stated goal, we tried to present nonetheless to the reader an overview of pertinent scholarly collections in varied niches in a single article.
The article makes the following contributions from a practitioner's reading perspective: context of deep learning and Section 4 details diagnostic approaches in assuring fault-tolerant implementations of deep learning systems. Section 5 makes an exploratory survey of several pertinent applications highlighted in the previous paragraph while Section 6 makes a critical dissection of the general successes and pitfalls of the field as of now and Section 7 concludes the article.

Deep architectures: Working mechanisms
There are numerous deep architectures available in the literature. The Comparison of architectures is difficult as different architectures have different advantages based on the application and the characteristics of the data involved, for example, In vision, Convolutional Neural Networks [23], for sequences and time series modelling Recurrent neural networks [33] is prefered. However, deep learning is a fast evolving field. In every year various architectures with various learning algorithms are developed to endure the need to develop human-like efficient machines in different domains of application.

Deep Feed-forward Networks
Deep Feedforward Neural network, the most basic deep architecture with only the connections between the nodes moves forward. Basically, when a multilayer neural network contains multiple numbers of hidden layers, we call it deep neural network [34]. An example of Deep Feed-Forward Network with n hidden layers is provided in Figure 3. Multiple hidden layers help in modelling complex nonlinear relation more efficiently compared to the shallow architecture.
A complex function can be modelled with less number of computational units compared to a similarly performing shallow network due to the hierarchical learning possible with the multiple levels of nonlinearity [35]. Due to the simplicity of architecture and the training in this model, It is always a popular architecture among researchers and practitioners in almost all the domains of engineering. Backpropagation using gradient descent [36] is the most common learning algorithm used to train this model. The algorithm first initialises the weights randomly, and then the weights are tuned to minimise the error using gradient descent. The learning procedure involves multiple forward and backwards passes consecutively. In forward pass, we forward the input towards the output through multiple hidden layers of nonlinearity and ultimately compare the computed output with the actual output of the corresponding input. In the backward pass, the error derivatives with respect to the network parameters are back propagated to adjust the weights in order to minimise the error in the output. The process continues multiple times until we obtained a desired improvement in the model prediction. If Xi is the input and fi is the nonlinear activation function in layer i, the output of the layer i can be represented by, Xi+1, as this becomes input for the next layer. Wi and biare the parameters connecting the layer i with the previous layer. In the backward pass, these parameters can be updated with, Where Wnew and bnew are the updated parameters for W and b respectively, and E is the cost function and ηis the learning rate. Depending on the task to be performed like regression or classification, the cost function of the model is decided. Like for regression, root mean square error is common and for classification softmax function.
Many issues like overfitting, trapped in local minima and vanishing gradient issues can arise if a deep neural network is trained naively. This was the reason; neural network was forsaken by the machine learning community in the late 1990s.
However, in 2006 [24,37], with the advent of unsupervised pretraining approach in deep neural network, the neural network is revived again to be used for the complex tasks like vision and speech. Lately, many other techniques like l1, l2 regularisation [38], dropout [39], batch normalisation [40], good set of weight initialisation techniques [41,42,43,44] and good set of activation functions [45] are introduced to combat the issues in training deep neural networks.

Restricted Boltzmann Machines
Restricted Boltzmann Machine (RBM) [46] can be interpreted as a stochastic neural network. It is one of the popular deep learning frameworks due to its ability to learn the input probability distribution in supervised as well as unsupervised manner. It was first introduced by Paul Smolensky in 1986 with the name Harmonium [47]. However, it gets popularised by Hinton in 2002 [48] with the advent of the improved training algorithm to RBM. After that, it got a wide application in various tasks like representation learning [49], dimensionality reduction [50], prediction problems [51]. However, deep belief network training using the RBM as building block [24] was the most prominent application in the history of RBM that provides the starting of deep learning era. Recently RBM is getting immense popularity in the field of collaborative filtering [52] due to the state of the art performance in Netflix.  [53] and conditional RBM [54] proposed and applied to model multivariate time series data and to generate motion captures, Gated RBM [55] to learn transformation between two input images, Convolutional RBM [56,57] to understand the time structure of the input time series, mean-covariance RBM [58,59,60] to represent the covariance structure of the data, and many more like Recurrent TRBM [61], factored conditional RBM (fcRBM) [62]. Different types of nodes like Bernoulli, Gaussian [63] are introduced to cope with the characteristics of the data used. However, the basic RBM modelling concept introduced with Bernoulli units. Each node in RBM is a computational unit that processes the input it receives to make stochastic decisions whether to transmit that input or not. An RBM with m visible and n hidden units is provided in Figure 4. RBM is trained to maximise the expected probability of the training samples.
Contrastive divergence algorithm proposed by Hinton [48] is popular for the training of RBM. The training brings the model to a stable state by minimising its energy by updating the parameters of the model. The parameters can be updated using the following equations: Where, ϵ is the learning rate, < . > data , < . > model are used to represent the expected values of the data and the model. The log-probability of the training data can be improved by adding layers to the network, which, in turn, increases the true representational power of the network [65]. significantly applied as discrimination model by appending a discrimination layer at the end and fine-tuning the model using the target labels provided [3]. In most of the applications, this approach of pretraining a deep architecture led to the state of the performance in discriminative model [66,24,37,67,50]  weights that can improve both modelling and the convergence in fine-tuning [66,70]. DBN has been used as an initialised model in classification in many applications like in phone recognition [58], computer vision [59] where it is used for the training of higher order factorized Boltzmann machine, speech recognition [71,72,73] for pretraining DNN, for pretraining of deep convolutional neural network (CNN) [56,74,57]. The improved performance is due to the ability to

Autoencoders
Autoencoder is a three-layer neural network, as shown in Figure 6, that tries to reconstruct its input in its output layer. Hence, the output layer of an Where w, b are the parameters to be tuned, f is the activation function, x is the input vector, and y is the hidden representation. In decoding phase, Where w' is the transpose of w, c is the bias to the output layer, x' is the reconstructed input at the output layer. The parameters of the autoencoder can be updated using the following equations: Where wnew and bnew are the updated parameters for w and b respectively at the end of the current iteration, and E is the reconstruction error of the input at the output layer.
Autoencoder with multiple hidden layers forms a deep autoencoder. Similar like in deep neural network, autoencoder training may be difficult due to multiple layers. This can be overcome by training each layer of deep autoencoder as a simple autoencoder [24,37]. The approach has been successfully applied to encode documents for faster subsequent retrieval [77], image retrieval, efficient speech features [78] etc. As like RBM stacking to form DBN [24] for layerwise pretraining of DNN, autoencoder [37] along with sparse encoding energy-based model [67] are independently developed at that time. They both were effectively used to pretrain a deep neural network, much like the DBN. The unsupervised pretraining using autoencoder has been successfully applied in many fields like in image recognition and dimensionality reduction in MNIST [50,78,79], multimodal learning in speech and video images [80,81] and many more. Autoencoder has got immense popularity as generative model in recent years [34,82]. Non Probabilistic and non-generative nature of conventional autoencoder has been generalised to generative modelling [83,38,84,85,86] that can be used to generate the samples from the network meaningfully.
Several variations of autoencoders are introduced with quite different properties and implementation to learn more efficient representation of data. One of the popular variation of autoencoder that is robust to input variations is denoising autoencoder [85,38,86]. The model can be used for good compact representation of input with the number of hidden layers less than the input layer. It can also be used to perform robust modelling of the input distribution with higher number of neurons in the hidden layer. The robustness in denoising autoencoder is achieved by introducing dropout trick or by introducing some gaussian noise to the input data [87,88] or to the hidden layers [89]. The approach helps in many many ways to improve performance. It virtually increasing the training set hence reduce overfitting, and make robust representation of the input. Sparse autoencoder [89] is introduced in a consideration to allow larger

Convolutional Neural Networks
Convolutional representations in different layers [96]. Figure 8 is a good representative of this learning scheme.
The layers involved in any CNN model are the convolution layers and the subsampling/pooling layers which allow the network learn filters that are specific to specific parts in an image. The convolution layers help the network retain the spatial arrangement of pixels that is present in any image whereas the pooling layers allow the network to summarize the pixel information [97]. There are several CNN architectures ZFNet, AlexNet, VGG, YOLO, SqueezeNet, ResNet and so on and some these have been discussed in section 2.8.

Recurrent Neural Networks
Although Hidden Markov Models (HMM) can express time dependencies, they become computationally unfeasible in the process of modelling long term dependencies which RNNs are capable of. A detailed derivation of Recurrent Neural Network from differential equations can be found in [98]. RNNs are The amount of information to be retained from previous time steps is controlled by a sigmoid layer known as 'forget' gate whereas the sigmoid activated 'input gate' decides upon the new information to be stored in the cell followed by a hyperbolic tangent activated layer to produce new candidate values which is updated taking forget gate coefficient weighted old state's candidate value.
Finally the output is produced controlled by output gate and hyperbolic tangent activated candidate value of the state.

Generative Adversarial Networks
Goodfellow et al., [105] introduced a novel framework for Generative Adversarial Nets with simultaneous training of a generative and a discriminative model. The goal in this process is to train the Generative network in a way to maximize the probability of the discriminative network to make a mistake. A In [106], the authors presented training procedures to be applied to GANs focusing on producing visually sensible images. The proposed model was successful in producing MNIST samples visually indistinguishable from the original data and also in learning recognizable features from Imagenet dataset in a semisupervised way. This work provides insight about appropriate evaluation metric for generative models in GANs and stable semi-supervised training approach. In [107], the authors identified distinct features of GANs from a Turing perspective. The discriminators were allowed to behave as interrogators such as in Turing Test by interacting with data sample generating processes and affirmed the increase in accuracy of the models by verification with two case studies. The first one was about inferring an agent's behavior based on a hidden stochastic process while managing its environment.
The second examples talks about active self-discovery exercised by a robot to conclude about its own sensors by controlled movements.
Wu et al., [108] proposed a 3D Generative Adversarial Network (3DGAN) for three dimensional object generation using volumetric convolutional networks with a mapping from probabilistic space of lower dimension to three dimensional object space so that the 3D object can be sampled or explored without any reference image. As a result high quality 3D objects can be generated employing efficient shape descriptor learnt in an unsupervised manner by the adversarial discriminator. Another interesting application is generating images from detailed visual descriptions [110]. The authors trained a deep convolutional generative adversarial network (DC-GAN) based on encoded text features through hybrid characterlevel convolutional recurrent neural network and used manifold interpolation regularizer. The generalizability of the approach was tested by generating images from various objects and changing backgrounds.

Recent Deep Architectures
When it comes to deep learning and computer vision, datasets like Cats and Dogs, ImageNet, CIFAR-10, MNIST are used for benchmarking purposes.
Throughout this section, the ImageNet dataset is used for the purpose of benchmarking results as it is more generalized than the other datasets just mentioned.
Every year a competition named ILSVRC (ImageNet Large Scale Visual Recognition Competition) is organized (which is an image classification competition) which based on the ImageNet dataset and it is widely accepted by the deep learning community [111]. It was an enhancement of the AlexNet architecture. It uses expanded mid convolution layers and incorporates smaller strides and filters in the first convolution layer for capturing the pixel information in a great detail [112]. In 2014,

Swarm Intelligence in Deep Learning
The introduction of heuristic and meta-heuristic algorithms in designing complex neural network architectures aimed towards tuning the network parameters to optimize the learning process has brought improvements in the performance of

Different Methods of Adversarial Test Generation
Despite the success of deep learning in various domains, the robustness of the architectures need to be studied before applying neural network architectures in safety critical systems. In this subsection we discuss the kind of malicious attack that can fool or mislead NN to output wrong decisions and ways to overcome them.
The work presented by Tuncali et al., [131] where s is the magnitude of perturbation η which when added to an input data generates an adversarial data.

Countermeasures for Adversarial Examples
The paper [132]  can be found in [136].

Fraud Detection in Financial Services
Fraud detection is an interesting problem in that it can be formulated in an unsupervised, a supervised and a one-class classification setting. In unsupervised learning category, class labels either unknown or are assumed to be unknown and clustering techniques are employed to figure out (i) distinct clusters containing fraudulent samples or (ii) far off fraudulent samples that do not belong to any cluster, where all clusters contained genuine samples, in which case, it is treated as an outlier detection problem. In supervised learning category, class labels are known and a binary classifier is built in order to classify fraudulent samples.
Examples of these techniques include logistic regression, Naive Bayes, supervised neural networks, decision tree, support vector machine, fuzzy rule based classifier,  [191], Kong et al. [192], Dong et al. [193], Kalogirou et al. [194], Wang et al. [195], Das et al. [196], Dabra et al. [197], Liu et al. [198], Jang et al. [199], Gensler et al. In today's word-of-mouth marketing, online reviews posted by customers critically influence buyers purchase decisions more than before. However, fraud can be perpetrated in these reviews too by posting fake and meaningless reviews, which cannot reflect customers'/users genuine purchase experience and opinions.

37
They pose great challenges for users to make right choices. Therefore, it is desirable to build a fraud detection model to identify and weed out fake reviews. Fiore et al (2019) [144] observed that data imbalance is a crucial issue in payment card fraud detection and that oversampling has some drawbacks. They

38
In summary, as far as fraud detection is concerned, some progress is made in the application of a few deep learning architectures. However, there is immense potential to contribute to this field especially, the application of Resnet, gated recurrent unit, capsule network etc to detect frauds including cyber frauds. .

Financial Time Series Forecasting
Advances in technology and break through in deep learning models have seen an increase in intelligent automated trading and decision support systems in Financial markets, especially in the stock and foreign exchange (FOREX) markets. However, time series problems are difficult to predict especially financial time series [145]. On the other hand, NN and deep learning models have shown great success in forecasting financial time series [146] despite the contradictory report by efficient market hypothesis (EMH) [147], that the FOREX and stock market follows a random walk and any profit made is by chance. This can be attributed to the ability of NN to self-adapt to any nonlinear data set without any statically assumption and prior knowledge of the data set [148].
Deep leaning algorithms have used both fundamental and technical analysis data, which is the two most commonly used techniques for financial time series forecasting, to trained and build deep leaning models [145]. Fundamental analysis is the use or mining of textual information like financial news, company financial reports and other economic factors like government policies, to predict price movement. Technical analysis on the other hand, is the analysis of historical data of the stock and FOREX market.
Deep Learning NN (DLNN) or Multilayer Feed forward NN (MFF) is the most used algorithms for financial markets [149]. According to the experimental analysis done by Pandey el at [150], showed that MFF with Bayesian learning performed better than MFF learning with back propagation for the FOREX market.
Deep neural networks or machine learning models are considered as a black box, because the internal workings is not fully understood. The performance of DNN is highly influence by the its parameters for a particular domain. Lasfer el 39 at [151] performed an analysis on the influence of parameter (like the number of neurons, learning rate, activation function etc) on stock price forecasting. The authors work showed that a larger NN produces a better result than a smaller NN.
However, the effect of the activation function on a large NN is lesser.
Although CNN is well known for its stripes in image recognition and less application in the Financial markets, CNN have also shown good performance in forecasting the stock market. As indicated by [151], the deeper the network the more NN can generalize to produce good results. However, the more the layers of NN, it is more likely to overfit a given data set. CNN on the other hand, with its techniques of convolution, pooling and drop out mechanism reduces the tendency of overfitting [152].
In order to apply CNN for the Financial market, the input data need to be transformed or adapted for CNN. With the help of a sliding window, Gudelek el at [152]  EMH [147] holds the view that financial news which affects the price movement are in cooperated into the price immediately or gradual. Therefore, any investor that can first analyze the news and make a good trading strategy can profit. Based on this view and the rise of big data, there has been an upward trend in sentiment analysis and text mining research which utilizes blogs, financial news social media to forecast the stock or FOREX market [145]. Santos et al [154] explored the impact of news articles on company stock prices by implementing a LSTM neural network pre-trained by a character level language model to predict the changes in prices of a company for both inter day and intraday trading.

40
The results showed that, CNN with word wise based model outperformed other models. However, LSTM character level-based model performed better than RNN base models and also has less architectural complexity than other algorithms.
Moreover, there has been hybrid architectures to combine the strengths or more than one deep leaning models to forecast financial time series. Bao et al This approach of learning might be the breakthrough for intelligent automated trading for Financial markets.

Prognostics and Health Management
The service reliability of the ever-encompassing cyber-physical systems around us has started to garner the undivided attention of the prognostics community in For model-driven approaches to continue to perform as well when the problem complexity scales, the prior distribution (physical equations) needs to continue to capture the embedded causalities in the data accurately. However, it has been the observation that as sensor data scales, the ability of model-driven approaches to learn the inherent structures in the data has lagged. This is of course due to the use of simplistic priors and updates which are unable to capture the complex functional relationships from the high dimensional input data. With the introduction of self-regulated learning paradigms such as Deep Learning, this problem of learning the structure in sensor data was mitigated to a large extent because it was no longer necessary for an expert to hand-design the physical evolution scheme of the system. With the recent advancements in parallel computational capabilities, techniques leveraging the volume of available data have begun to shine. One key issue to keep in mind is that the performance of data-driven approaches are only as good as the labeled data available for training. While the surplus of sensor data may act as a motivation for choosing such approaches, it is critical that the precursor to the supervised part of learning, i.e. data labeling is accurate. This often requires laborious and time-consuming efforts and is not guaranteed to result in the generation of near-accurate ground truth. However, when adequate precaution is in place and strategic implementation facilitating optimal learning is achieved, it is possible to deliver customized solutions to complex prediction problems with an accuracy unmatched by simpler, modeldriven approaches. Therein lies the holy grail of deep learning: the ability to scale learning with training data.
The recent works on device health forecasting are as follows: Basak et al.

Medical Image Processing
Deep learning techniques have pervaded the entire discipline of medical image processing and the number of studies highlighting its application in canonical tasks such as image classification, detection, enhancement, image generation, registration and segmentation have been on a sharp rise. A recent survey by Litjens et al. [209] presents a collective picture of the prevalence and applications of deep learning models within the community as does a fairly rigorous treatise of the same by Shen et al. [210]. A concise overview of recent work in some of these canonical tasks follows.  Figure 11: MRI Brain Slice and its different segmentation [213] number of dermatologists. In 2018, Rajaraman et. al [169] used specialized CNN architectures like ResNet for detecting malarial parasites in thin blood smear images. Kang et al. [170] improved the performance of 2D CNN by using a 3D multi-view CNN for lung nodule classification using spatial contextual information with the help of 3D Inception-ResNet architecture.
Object/lesion detection aims to identify different parts/lesions in an image.
Although object classification and object detection are quite similar to each other but the challenges are specific to each of the categories. When it comes to object detection, the problem of class-imbalance can pose a major hurdle in terms of the performance of object detection models. Object detection also involves identification of localized information (that is specific to different parts of an image) from the full image space. Therefore, the task of object detection is a combination of identification of localized information and classification [212]. In 2016, Hwang and Kim proposed a self-transfer learning (STL) framework which optimizes both the aspects of medical object detection task. They tested the STL framework for the detection of nodules in chest radiographs and lesions in mammography [171].  [174] developed CNN regressors to directly evaluate the registration transformation parameters. In addition to these, image generation and enhancement techniques have been discussed in [175], [176].
So far, applications of deep learning in medical image processing has produced satisfactory results in most of the cases, However, in a sensitive field like medical image processing prior knowledge should be incorporated in cases of image detection and recognition, reconstruction so that the data driven approaches do not produce implausible results [215].

Power Systems
Artificial Neural Networks (ANN) have rapidly gained popularity among power system researchers [177]. Since their introduction to the power systems area in 1988 [178], numerous applications of ANN to problems of electric power systems have been proposed. However, the recent developments of Deep Learning (DL) methods have resulted into powerful tools that can handle large data-sets and often outperform traditional machine learning methods in problems related to the power sector [179]. Load forecasting is one of the most important tasks for the efficient power system's operation. It allows the system operator to schedule spinning reserve allocation, decide for possible interchanges with other utilities and assess system's security [180]. A small decrease in load forecasting error may result in significant reduction of the total operation cost of the power system [181]. Among the Artificial Intelligence techniques applied for load forecasting, methods based on ANN have undoubtedly received the largest share of attention [182]. A basic reason for their popularity lies on the fact that ANN techniques are well-suited for energy forecast [183]; they may obtain adequate estimations in cases where data is incomplete [184] and can consistently deal with complex non-linear problems [185]. Park et al. [186], was one of the first approaches for applying ANN in load forecasting. The efficiency of the proposed Multi-layer Perceptron (MLP) was demonstrated by benchmarking it against a numerical forecasting method frequently used by utilities. As an evolution of ANN forecasting techniques, DL methods are expected to increase the prediction accuracy by allowing higher levels of abstraction [187]. Thus, DL methods are gradually gain increased popularity due to their ability to capture data behaviour when considering complex non-linear patterns and large amounts of data. In [188], an end-to-end proposed network comprised several stacks of RBM, which were pre-trained layerwise. Rahman et al. [191] proposed two models based on the architecture of Recurrent Neural Networks (RNN) aiming to predict the medium and long term electricity consumption in residential and commercial buildings with one-hour resolution. The approach has utilized a MLP in combination with a LSTM based model using an encoder-decoder architecture. A model based on LSTM-RNN framework with appliance consumption sequences for short term residential load forecasting has been proposed in [192]. The researchers have showed that their method outperforms other state-of-the-art methods for load forecasting. In [193] a Convolutional Neural Network (CNN) with k-means clustering has been proposed. K-means is used to partition the large amount of data into clusters, which are then used to train the networks. The method has shown improved performance compared to the case where the k-means has not been engaged.
The utilization of DL techniques for modelling and forecasting in systems of renewable energy is progressively increasing. Since the data in such systems are inherently noisy, they may be adequately handled with ANN architectures 48 [194]. Moreover, because renewable energy data is complicated in nature, shallow learning models may be insufficient to identify and learn the corresponding deep non-linear and non-stationary features and traits [195]. Among the various renewable energy sources, wind and solar energy have gained more popularity due to their potential and high availability [196]. As a result, in recent years the research endeavours have been focused on developing DL techniques for the problems related to the deployment of the aforementioned renewable energy sources.
Photovolatic (PV) energy has received much attention, due to its many advantages; it is abundant, inexhaustible and clean [197]. However, due to the chaotic and erratic nature of the weather systems, the power output of PV energy systems is intermittent, volatile and random [198]. These uncertainties may potentially degrade the real-time control performance, reduce system economics, and thus pose a great challenge for the management and operation of electric power and energy systems [199]. For these reasons, the accuracy of forecasting of PV power output plays a major role in ensuring optimum planning and modelling of PV plants. In [195]  has exhibited the best performance is the Auto-LSTM network, which combines the feature extraction ability of the Autoencoder with the forecasting ability of the LSTM. In [201] an LSTM-RNN is proposed for forecasting the output power of solar PV systems. In particular, the authors examine five different LSTM network architectures in order to obtain the one with the highest forecasting accuracy at the examined data-sets, which are retrieved from two cities of Egypt.
The network, which provided the highest accuracy is the LSTM with memory between batches.
With the advantages of non-pollution, low costs and remarkable benefits of scale, wind power is considered as one of the most important sources of energy [202]. ANN have been widely employed for processing large amounts of data obtained from data acquisition systems of wind turbines [203]. In recent years,   A survey on retail sales forecasting and prediction in fashion markets [227] Electric load forecasting: Literature survey and classification of methods [228] A review of unsupervised feature learning and deep learning for time-series modeling [229] Image Processing A Survey on Deep Learning in Medical Image Analysis [209] A Comprehensive Survey of Deep Learning for Image Captioning [230] Biological image analysis using deep learning-based methods: Literature review [231] Deep learning for remote sensing image classification: A survey [232] Deep Processing [250] A survey on the state-of-the-art machine learning models in the context of NLP [251] Inflectional Review of Deep Learning on Natural Language Processing [252] Deep learning for natural language processing: advantages and challenges [253] Deep A survey on deep learning for big data [262] A Survey on Data Collection for Machine Learning: a Big Data -AI Integration Perspective [263] A survey of machine learning for big data processing [264] Deep learning in big data Analytics: A comparative study [265] Deep learning applications and challenges in big data analytics [266]

Conclusions and Future Work
In conclusion, we highlight a few open areas of research and elaborate on some of the existing lines of thoughts and studies in addressing challenges that lie within.
• Challenges with scarcity of data: With growing availability of data as 64 well as powerful and distributed processing units Deep Learning architectures can be successfully applied to major industrial problems. However, deep learning is traditionally big data driven and lacks efficiency to learn abstractions through clear verbal definitions [267] if not trained with billions of training samples. Also the large reliance on Convolutional Neural Networks(CNNs) especially for video recognition purposes could face exponential ineffeciency leading to their demise [268] which can be avoided by capsules [269] capturing critical spatial hierarchical relationships more efficiently than CNNs with lesser data requirements. To make DL work with smaller available data sets, some of the approaches in use are data augmentation, transfer learning, recursive classification techniques as well as synthetic data generation. One shot learning [270] is also bringing new avenues to learn from very few training examples which has already started showing progress in language processing and image classification tasks.
More generalized techniques are being developed in this domain to make DL models learn from sparse or fewer data representations is a current research thrust.
• Adopting unsupervised approaches: A major thrust is towards combining deep learning with unsupervised learning methods. Systems developed to set their own goals [267] and develop problem-solving approaches in its way towards exploring the environment are the future research di-rections surpassing supervised approaches requiring lots of data apriori. So, the thrust of AI research including Deep Learning is towards Meta Learning, i.e., learning to learn which involves automated model designing and decision making capabilities of the algorithms. It optimizes the ability to learn various tasks from fewer training data [271].
• Influence of cognitive meuroscience: Inspiration drawn from cognitive neuroscience, developmental psychology to decipher human behavioral pattern are able to bring major breakthrough in applications such as enabling artificial agents learn about spatial navigation on their own which comes naturally to most living beings [272].
• Neural networks and reinforcement learning: Meta-modeling approaches using Reinforcement Learning(RL) are being used for designing problem specific Neural Network architectures. In [273] the authors introduced MetaQNN, a RL based meta-modeling algorithm to automatically generate CNN architectures for image classification by using Q-learning [274] with s greedy exploration. AlphaGo, the computer program built combining reinforcement learning and CNN for playing the game 'Go' achieved a great success by beating human professional 'Go' players. Also deep convolutional neural networks can work as function approximators to predict 'Q' values in a reinforcement learning problem. So, a major thrust of current research is on superposition of neural networks and reinforcement learning geared towards problem specific requirements.
This review has aimed at aiding the beginner as well as the practitioner in the field make informed choices and has made an in-depth analysis of some recent deep learning architectures as well as an exploratory dissection of some pertinent application areas. It is the authors' hope that readers find the material engaging and informative and openly encourage feedback to make the organization and content of this article more aligned along the lines of a formal extension of the literature within the deep learning community.