Deep Reinforcement Learning For Trading - A Critical Survey

: Deep reinforcement learning (DRL) has achieved signiﬁcant results in many Machine Learning (ML) benchmarks. In this short survey we provide an overview of DRL applied to trading on ﬁnancial markets, including a short meta-analysis using Google Scholar, with an emphasis on using hierarchy for dividing the problem space as well as using model-based RL to learn a world model of the trading environment which can be used for prediction. In addition, multiple risk measures are deﬁned and discussed, which not only provide a way of quantifying the performance of various algorithms, but they can also act as (dense) reward-shaping mechanisms for the agent. We discuss in detail the various state representations used for ﬁnancial markets, which we consider critical for the success and efﬁciency of such DRL agents. The market in focus for this survey is the cryptocurrency market.


Introduction
Predicting and analyzing financial indices has been of interest in the financial community for a long time, but recently, there has been a wide interest in the Machine Learning (ML) community to make use of and benefit from the more advanced techniques, like deep networks or deep reinforcement learning. In addition, the cryptocurrencies market has gained a lot of popularity worldwide, among which Bitcoin, makes the top news almost weekly now.
In this survey article we explore different aspects of using DRL for trading, by first using Google Scholar to get a glimpse of what the corpus of papers which use DRL for trading looks like, then going through the main concepts of interest at the intersection of trading and DRL, like using different measures of risk for reward shaping and embedding some form of forecasting or prediction into the agent's decision-making. We then explain the contributions of the first DRL paper in more detail [6] and continue by going through the specifics of applying DRL to trading markets. Then, we go deeper into the technologies which look most promising, hierarchical DRL and model-based RL and finish by stating some of the commonalities and methodological issues we found in the corpus explored.
We begin by referring to a few survey articles which are somehow related to this work: • [1] looks at model-free methods and is thus limiting. Moreover, it discusses some aspects from a financial-economical point of view, of the portfolio management problem, the relation between the Kelly Criterion, Risk Parity and Mean Variance. We are interested in the RL perspective, and the need for risk measures is only to feed the agents with helpful reward functions. We also discuss the hierarchical approach which we consider to be quite promising, especially for applying a DRL agent to multiple markets. • [2] looks only at the prediction problem using deep learning. • [3] has a similar outlook as this work, but the works discussed are quite old, very few using deep learning techniques. • [4] looks at more applications in economics, not just trading. Also, the deep reinforcement learning part is quite small in the whole article.
We focus more on the cryptocurrency markets, but without loss of generality, since trading is very similar among markets, with the only thing differing being the volatility of assets  in each market type, with the cryptocurrencies being the most volatile. We do advise the curious reader to take a look at the mentioned works, since they provide different perspectives and plenty of resources to further one's research. Following [5] we take a similar approach to our investigation. We use Google Scholar for meta-analysis, to search through the papers instead of downloading them and then individually searching through them, since this is a cumbersome process due to the fact that a large number of papers is behind a paywall. This is possible for two reasons: • All papers using DRL for trading will most probably cite the first paper that introduced DQN [6]. • Because of the above reason we can use the feature in Scholar which searches through the citing articles.
The following data is obtained through this methodology. The total number of articles citing [6] is 16652, today (18 October 2021). By adding trading market reward action state as search words we narrow it down to 395. We have to mention here that we had to put every word between quotes, otherwise Google Scholar seemed a bit stochastic sometimes, in the sense that, by changing the position of market in the search query, we got a different number of results. We investigate the type of action space, state space, market type, input data and if the papers use some form of prediction or not by using representative keywords (e.g. continuous actions vs. discrete actions). Of course the following results should be taken with a grain of salt, in the sense that some keywords might appear in the paper, but as references or mentiones, rather than actual uses. We think these results provide a rough idea of the underlying structures but definitely not the precise ones. More details in Figures 1 and 2.
The cryptocurrency market has been of wide interest for economic agents for some time now. Everything started with the whitepaper on Bitcoin [7]. This is in part due to the fundamentally different principles on which it is functioning (i.e. coins are minted / mined, they are decentralised, etc.) [8] but also due to the extremely high volatility the market is experiencing [9]. This means there is room for significant profit for the smart agent but also the possibility of huge losses for the speculative trader. Moreover the market capitalisation is increasing almost continuously, which means new actors are entering and investing actively in the crypto markets. When looking at the problem of automatically trading on these markets, we can define a few, quite different problems one is facing and which have been tackled separately:

•
The prediction problem, where we want to accurately predict some timesteps in the future based on the past and on the general dynamics of the time-series. What we mean by this is that, for example, gold or oil do not have such large fluctuations as the crypto markets. When one is predicting the next point for the price, it is advisable to take this information into account. One way to do this is to consider general a priori models, based on large historical samples, such that we embed this possible, historical dynamics into our prediction model. This would function as a type of general prior information, for example it is absurd to consider models where gold will rise 100 percent overnight, this has never happened and probably never will. Deep learning models have been successfully applied to such predictions [2,[10][11][12] as well as many other ML techniques [13][14][15]. However, for the crypto markets this might not be that impossible and some of them did in fact rise thousands percent overnight, with many still rising significantly every day 1 . In the context of RL, prediction is usually done through model-based RL [16], where a simulation of the environment is created based on data seen so far. And then prediction amounts to running this simulation to see what comes next (at least according to the prediction model). More on model-based RL in section 9.1. For a comprehensive survey see [17].

•
The strategy itself might be based on the prediction, but embed other information as well, like for example the certainty of the prediction (which should inform how much we would like to invest, assuming a rising or bull future market) or other external information sources, like social media sentiments [18](e.g. are people positive or negative about an asset? ) or news [19], volume, or other sources of related information which might influence the decision. The strategy should be also customisable, in the sense that we might want a risk-averse strategy or risk-seeking one. This cannot be done from a point prediction only, but there are some ways of estimating risk for DRL [20]. For all these reasons we consider the strategy, or decision-making, a quite different problem than the prediction problem. And this is the general trend in the related literature [21,22].

•
The portfolio management problem is encountered when dealing with multiple assets. Until now we've been talking only about a single asset, but if some amount of resources is being split among multiple concurrent assets then we have to decide how to split these resources. At first the crypto markets were highly correlated (if Bitcoin was going up then most other cryptocurrencies would go up, so this differentiation didn't matter too much), but more recently, it has been shown that the correlation among different pairs is significantly decreasing 2 . Thus, an investing strategy is needed to select among the most promising (for the next time slice) coins. The timeslice for which one does rebalancing is also a significant parameter of the investing strategy. Rebalancing, usually means, how to split the resources (evenly) among the most promising coins.
Most of the work in trading using smart techniques like DRL is situated here. See for example [1,[23][24][25][26][27][28]. Not all the related works we are citing are dealing with the cryptocurrency markets, but since all financial markets are quite similar, we consider these relevant for this article.

•
The risk associated with some investment is inevitable. However, through smart techniques, a market actor might minimize the risk of investment while still achieving significant profits. Also, some actors might want to follow a risk-averse strategy while other might be risk-seeking. The risk of a strategy is generally measured by using the Sharpe ratio or Sortino ratio, which take into account a risk-free investment (e.g. U.S. treasury bond or cash), but even though we acknowledge the significance of these measures, they do have some drawbacks. We will discuss in more detail the two measures, giving the formal description and propose alternative measures of risk. Moreover, if one uses the prediction in the strategy, by looking at the uncertainty of prediction, one can also infuse our strategy with a measure of risk given by this certainty, i.e. if the prediction of some asset going up is very uncertain then the risk is high and the other way around, if prediction is certain, risk is lower.
We show an overview of the above mentioned financial concepts, the relations between them and their RL correspondents in Figure 3.

Prediction
Prediction or forecasting of a time-series is the process of estimating what will happen in the next time period based on what we have seen so far. Will the price go up or down, by how much and for how long? A comprehensive prediction provides a probability distribution (if it is not a point estimate) of these quantities. Many older works deal with the prediction problem in a financial market, assuming the strategy is clear once the prediction is accurate. And it is indeed so, but quite often the predictions are not correct and this can significantly impact the performance of the trading strategy. However, it has been shown that some techniques exist which can predict even quasi-periodic time-series, which is quite remarkable, considering the fundamentally chaotic nature of such time-series [29,30]. Depending on the nature of the time-series, the prediction interval is generally one time-step and for each new time-point prediction, the ground truth is fed back to the network. If the prediction is fed, such that multiple time-steps are predicted, the prediction degrades rapidly, due to accumulating errors. However, there are some exceptions, see for example [31]. One way to overcome this limitation is to look at data at a higher granularity (at coarser data). For example, if one looks at minute data (by this, we mean that the prediction machinery is fed with this type of data), to predict 30 minutes in the future, one needs to predict 30 time-points, whereas if the one looks at 30 minutes data (meaning each point is a summarization of a 30 minutes time-slice), then there is need only for one-step  prediction. Any strategy built on top of a prediction module is critically dependent on this sampling period. By using multiple different time-scales, one can have a much better picture of the future, especially if the predictions at different levels are combined (for example using ensembles). Surprisingly, there are not many related works which use such a technique, but there are some [28].

Strategy for RL
The strategy for trading can vary widely, depending on its scope, the data sampling rate, the risk quantification choice and more. We distinguish a few ways to organise the type of strategy:

•
Single or multi-asset strategy. This needs to take into account data for each of the assets, either time-based or price-based data. This can be done individually (e.g. have one strategy for each asset) or aggregated.

•
The sampling rate and nature of the data. We can have strategies which make decision at second intervals or even less (High Frequency Trading or HFT [32]) or make a few decisions throughout a day (intraday) [33], day trading (at the end of each day all open orders are closed) [34] or higher intervals. The type of data fed is usually time-based but there are also price-based environments for RL ( [35].

•
The reward fed to the RL agent is completely governing its behaviour, so a wise choice of the reward shaping function is critical for good performance. There are quite a number of rewards one can choose from or combine, from risk-based measures, to profitability or cumulative return, number of trades per interval, etc. The RL framework accepts any sort of rewards, the denser the better. For example, in [35] the authors test 7 different reward functions based on various measures of risk or PnL (Profit and Loss).
There are also works which discover or learn algorithmic trading rules through various methods like evolutionary computation [36] or deep reinforcement learning [37].

Benchmark strategies
When evaluating this type of work using backtests, one needs to have some benchmark strategies to compare with. We have a wide variety to choose from, but we restrict ourselves to just a few for brevity. We list them below in increasing order of complexity: • Buy and Hold. This is the classical benchmark [38][39][40] when trading on the cryptocurrency markets since the markets are significantly increasing, thus simply buying an asset and holding it until the end of the testing period provides some informative baseline profit measure.

•
Uniform constant rebalanced portfolio (UCRP) [41] which rebalances the assets every day keeping the same distribution among them.

•
On-line moving average reversion (OLMAR) [42] is an extension to the moving average reversion (MAR) strategy [43] but overcomes some of its limitations. MAR tries to make profits by assuming the asset will return (revert) to the moving average value in the near future. • A deep RL architecture [27] very well received by the community and with great results. It has also been used in other related works for comparison since the code is readily available 3 .

Portfolio management
In a financial market one of the main issues is in which assets to invest, which coins to buy. There are many available assets, on Binance (the largest platform for cryptocurrencies by market cap) for example, there are hundreds of coins and each one can form a pair with a few major stable coins like USDT (Tether). Thus the choices are numerous and the decision is hard. The portfolio management problem deals with allocating the cash resources (this can be any stable currency which does not suffer from the fluctuations of the market, a baseline) into a number of assets m. Formally, this means finding a set of weights where at the beginning the weights are w = {1, 0, 0, ..., 0} which means we have only cash. The goal is to increase the value of the portfolio by selecting the assets which give the highest return. If at the time-step t the overall value of the porfolio is given by: where p i is the sell price of the asset i, then V t+1 > V t . Prices and asset quantities change during trading periods and when switching back to cash (this is what the above equation does), this new amount should be larger than the previous cash value of the portfolio.

Risk
Risk has been quantified (measured) and/or managed (fed as reward to the DRL agent, or part of the reward) through various measures throughout the years. We state below what we consider to be some of the most important ones.

Sharpe ratio and Sortino ratio
One of the most ubiquitous is the Sharpe ratio [44] or the related Sortino ratio. These are performance measures of investments which take into account the risk by considering an extra term, which is a risk-free asset (originally was suggested that U.S. treasure bonds be used here). Formally, the Sharpe ratio is defined as: where R a is the asset return and R b is the risk-free return, with E being the expectation over the excess return of the asset return compared to the risk-free return. var is the variance of the respective excess, so in the denominator we have the standard deviation of the excess. To accurately measure the risk of an investment strategy, this should be taken over comprehensive periods, where all types of transactions happen, thus we need a representative sample for the Sharpe ratio to give an accurate estimate of the risk.

Sortino ratio
The Sortino ratio is a modified version of the Sharpe ratio where negative outcomes (failures) are penalized quadratically. This can be written as: where R is the average return of the asset of interest, T is the target rate of return (also called MAR -minimum acceptance return) and DR is called the downside risk, and is the target semi-deviation (square root of the target semi-variance) and is formally, it's given by: where r is the random variable representing the return (this might be annual) and f (r) is the distribution of returns.

Differential Sharpe ratio
In an online learning setting, as is the case for RL (if using this measure as a reward shaping mechanism), a different measure is needed, which does not require all the returns to get the expectation over the returns, which are obviously not available in an online setting. To overcome this limitation, [45] has derived a new measure given by: where A t and B t are exponential moving estimates of the first and second moment of R t , where R t is the estimate for the return at time t (R T is the usual return used in the standard Sharpe ratio, with T being the final step of the evaluated period), so small t usually refers to the current timestep. A t and B t are defined as:

Value at risk
Another measure for risk is the so called value at risk (VaR) [23,37,46], which specifies with probability at most α that a loss greater then VaR will occur and with probability at least 1 − α a loss smaller than VaR will occur. It is the smallest number y such that the probability (p(Y = −X) ≤ y) ≥ 1 − α. Formally, the VaR is defined as: where X is the random variable which describes the profit and losses, F X is the cumulative distribution function. This can also be seen as the (1 − α)-quantile of Y. However, it implies that we assume a parametric form for the distribution of X, which is quite limiting in practice.

Conditional value at risk
This is also called expected shortfall [23,46] or average value at risk and is related to VaR but it is more sensitive to the shape of the tail of the loss distribution. It is a more conservative way of measuring risk, since for the wins, the measure ignores the highest profits less likely to happen and for the losses it focuses on the worst ones. Formally, CVaR is defined as: with x α the lower α-quantile given by : and 1 being the indicator function. For more details and for other measures of risk see 4 .

Turbulence
In [47] they use the measure of turbulence to indicate if the market is in some extreme conditions. If this value is above some threshold, then the algorithm is stopped and all assets are sold. This quantity, it time-dependent and is defined as: where x t ∈ R D is the return of the asset which we're computing the turbulence for at timestep t and µ ∈ R D is the mean return of the asset. Σ ∈ R D×D is the covariance of the returns, with D the number of assets.

Maximum drawdown (MDD)
The maximum drawdown is a measure of downside risk, it represents the maximum loss from a peak to a trough in some period, and is given by: where TV is the through value (the lowest drop an asset reaches before recovering) and PV is the peak value of an asset in the period of interest. Obviously this says nothing about the frequency of losses or about the magnitude or frequency of wins, but it is considered very informative nonetheless (usually correlated with other complementary measures), with small MDD indicating a robust strategy. The Calmar ratio is another measure which uses MDD as its only quantification of risk. For a comprehensive overview of Calmar ration, MDD and related measures see [48].

Deep Reinforcement Learning
Deep reinforcement learning is a recent field of ML that combines two powerful techniques, the data-hungry deep learning and the older process oriented reinforcement learning (RL). Reinforcement learning took birth as a heuristic approach, descendant of dynamic programming. Sutton and Barto in their seminal work [49] put the basis for a whole new field, which would turn out to be highly influential throughout the years, both in neuroscience and ML. In general, reinforcement learning was employed where the data was little and the behaviour was complex. However, recently, because of the advent of deep networks, reinforcement learning has received increased horsepower to tackle more challenging problems. For a comprehensive overview see [50][51][52]. The general reinforcement learning problem is defined by an agent acting, or making decisions in an environment, where the process is modelled as a Markov Decision Process (MDP).
The MDP is represented by a state space S comprising the states the agent can be in, defined by the environment, the action space A, which is the set of actions an agent can take, a transition dynamics which gives the probabilities that the agent has of being in a state at time t, taking an action and being in another state, at time t + 1 i.e. p(s t+1 |s t , a t ), and a reward function r : S × A → R. An MDP dynamics is assumed to satisfy the Markov property p(s t+1 |s 1 , a 1 , ..., s t , a t ) = p(s t+1 |s t , a t ), that is the next state is dependent only on the previous state and action for any state-action space trajectory. This is generally verbalised as: the future is independent of the past given the present. The general goal of the agent is to find a policy which maximizes the discounted future reward, given by R = ∑ T k=t γ k−t r s k ,a k , with γ a real number called discount factor, with 0 < γ < 1. The policy is modelled to be stochastic in general and is parameterized by a set of parameters θ, i.e. π θ : S → P (A), where P (A) is the set of probability measures on the action space A. The reward is assumed to be given by the environment, but we will see later that auxiliary reward functions (not given by the environment) can significantly improve an agent's performance. We will state briefly the main paradigms used in finding a good policy and then we will delve directly into the acclaimed DQN, the first DRL agent introduced in the machine learning community.
We consider the partitioning of the RL algorithms following Silver 5 . The three approaches are: value-based RL, where the value in each state (and action) is a kind of prediction of the cumulative future reward. In short, each state has an associated real value number which defines how good it is to be in one state (or to take a specific action being in one state). This is given by the state-value function, usually denoted by V π (s), where π is the policy which governs the agent, and which has one identifier, namely the state, or the state-action value function given by Q π (s, a) which has two identifiers namely the state and action. We will see later that new flavours of the value function exist, where additional identifiers are used, for example the current goal of the agent (in the case where there are multiple goals). In DRL, the value functions are outputs of deep networks.
The second approach is to represent the actual policy as a deep network, and this is referred to as policy-based RL. In general, the policy networks can be directly optimized with stochastic gradient descent (SGD).
The third approach is model-based RL, where the agents specifically construct a transition model of the environment, which is generally modelled by a deep network. This is sometimes hard, depending on the complexity of the environment but offers some advantages over the model-free approaches, for example, the model is generative, meaning that the agent can generate samples from its model of the environment and thus can avoid actually interacting with the environment. This can be highly beneficial when this interaction is expensive or risky. Having stated the main approaches we proceed to the description of the first DRL agent, often referred to as DQN (Deep Q Network) [6]. We show in Figure 5 the much acclaimed architecture of the original DQN. In DQN the powerful deep networks were employed for approximating the optimal action-value function, i.e.: where r t , a t are the reward and action at timestep t, each future reward is discounted by a factor of γ and π is the policy that governs the agent's behaviour. For the rest of this document, if we do not specify with respect to what variables the expectation operator is considered, then it means it is with respect to the state variable. The iterative version is then given by: In general, in reinforcement learning, the optimal Bellman value or action-value function is denoted with an upper * . DQN was able to play a few tens of Atari games with no special tuning of hyperparameters or engineered features. Learning just from pixel data, the agent  was capable of learning to play all games with human or almost human performance. This is remarkable for an artificial agent. For the first time, an agent combined the representational power of deep nets with the RL's complex control, with two simple tricks.
The first one is replay memory, which is a type of memory that stores that last million frames (this number can vary as a function of the task, it is a hyperparameter of the model) from the experience of the agent and avoids the highly correlated consecutive samples arising from direct experience learning.
The second one was brought by replacing the supervised signal, i.e. the target values of the Q-network, or the optimal Q function from the Bellman equation with an approximation consisting of a previously saved version of the network, this is referred to as the target network. Using this approach defines a well-posed problem, avoiding the trouble of estimating the optimal value function. Then the loss function is given by (we use the notation used in the original DQN paper [6]): where U(D) is the uniform distribution giving the sample from replay memory, γ is the usual discount factor, and θ − i is the set of parameters from a previous iteration (this is the second enhancement we mentioned above). This gives the following gradient for the loss: These two main additions enabled unprecedented performance and adaptability of the deep net, and enabled agent to deal with a relatively diverse set of task. Different games, but still the same domain, Atari 2600. The work that followed was immense, we show in Figure 4 the number of citations over the years.
We will talk about some of the ideas that extend the current approach, but functionally and theoretically. We will not be discussing practical concerns, even though we are aware of the critical importance of such works and the fact that they effectively enable these algorithms to run in an acceptable time. Deep learning is a relatively small subset of the whole machine learning community, even though it is progressing fast, at an unprecedented rate, due to the relative ease of use and ease of understanding but also due to the higher computational resources available. Being a visual task, CNNs perform best at playing Atari games, but other types of networks have been used, for example, for text based reinforcement learning as well [53,54]. More and more tasks will be amenable to DRL as the field embraces the new paradigm and the various existing approaches are adapted to fit these new understandings of what can be achieved. Trading has been of wide interest for a long time in the ML community and DRL has been viewed as a new set of powerful tools to be used for modelling and analysis on the financial markets.
We now proceed to describe different aspects of training a DRL agent to make profitable trades. The first critical problem is how we present the data for an agent, such that it gets all the information it needs in a compressed form for reduced dimensionality. This is refered to as the representation learning problem in RL [55] and recently it has been showed that this can be decoupled from policy learning [56].

Time-series encodings or state representation for RL
There are many ways to encode a time-series, from a simple moving average to highrank tensors. We will discuss in detail the most widely used encodings for the financial markets. We are interested solely in trading based on price, and will not consider other alternatives, like for example, limit order books (LOB) 6 . In short, LOBs are lists of orders which actors on the market are willing to execute. They usually specify a price level and a quantity. Sometimes these are aggregated to give on single value for the price (for example first 10 buy orders and first 10 sell orders are weighted by their respective quantities to give a weighted average). We also skip the details of how the news / twitter feeds sentiments are encoded in the state.

Convolutions
As Convolutional Neural Networks (CNNs) have been initially successfully used in vision [57], they gained popularity throughout the wider ML field, for example in Natural Language Processing (NLP) [58], anomaly detection [59], drug discovery 7 and many others. Among these, there is also the application to time-series encoding and forecasting [60]. As we are interested in RL, one way to do it for state representation of financial timeseries is to concatenate the time-series of interest into a tensor, with one axis corresponding to time (usually there are multiple adjacent time-steps fed as a state, the last 5 or last 10 steps) and the other axis corresponding to the different quantities of interest (e.g. Open, Close, High, Low price of a candlestick). We show the graphical representation from [40] and [27] in Figure 6.

Raw-data OHLCV
Due to large variations in the price of different assets, we need to preprocess each one such that they have a standardized form (e.g. between 0 and 1). Using the return, v t /v t−1 to normalize prices is often used in the financial market literature. The OHLCV means the Open, High, Low, Close and Volum of one time-point. Many works use this representation, concatenating the last few timesteps as a single state for each point. A variation on this is to obtain features from the raw data, which is often quite noisy, by preprocessing it with a (a) Stacked data used for convolutional preprocessing. From [40] (b) Another way of stacking prices for convolutions. From [27] Figure 6. Different RL state representations for time-series built for being processed by CNNs.
Kalman filter and then extraction features with an Autoencoder and/or SVD for time-series [61].

Technical trading Indicators
These are analytical ways of transforming the time-series, developed by economists to summarize or highlight different properties of the time-series. A wide variety of technical indicators exist, see for example here [62] or some of their uses combined with DRL in [63], [64] or [65].

Correlated pairs
In one work [66] on the FX market, the authors use 3 related pairs to predict one. The same can be done for the crypto markets where we use inputs of correlated pairs of cryptocurrencies (e.g. for predicting BTCUSD we might used ETHUSD, BTCEUR, ETHBTC, etc.). This assumes there is a lot of information in the related pairs such that we get a more informative dynamics than in the predicted time-series alone.

Historical binning or price-events environment
One interesting way to transform the continuos trading environment (i.e. price) is to discretize the price, transform the time-series such that it can take only a fixed number of values. If the price change is between the threshold values, than no effective change is seen by the DRL agent. This, in fact, compresses the time-series and transforms into a price-event based instead of time-event based [35]. It is a way of simplifying the time-series, making it more relevant for trading and removing some of the noise existent in the original data. Due to the shorter time-series produced in this way, agents using this transformation should train faster. We can also combine the two approaches, where we keep the time component but we still discretise the price value. We show both approaches in Figure 7. We embark on developing more types of environments (among them being also the one in Figure 7b, where we modify the incoming data on the fly, by for example artificially augmenting rises or drops. This should make agents more sensitive to one or the other and thus change their behaviour into being risk-seeking or risk-averse.

Adding various account information
Some of the authors add to the state of the RL agent the price or quantity of the last buy, or the current balance of the porftolio, or the specific stocks held and their respective quantities. There doesn't seem to be a general conclusion for what sort of information is useful in the state of the agent, as agents can be widely different and the optimal choice is specific to each. For more information see Table 1.

Action space
The action space is generally the simple discrete {buy, hold, sell} but there are some which use more actions to denote a different number of shares or quantity to buy or sell, as in [68]. If dealing with multiple assets either a continuous multi-dimensional action space is used or the problem is split into selecting the number of shares / quantity (for example with a deep learning module [69,70]) and separately selecting the type of order (buy, sell).

Reward shaping
The reward is the most important part when dealing with a DRL trading agent. Most works use the direct profit from a transaction or a longer term profit (even after a full episode, but episodes can vary in length from a few hundred time-steps to thousands). The other group of reward functions are based on various measures of risk, for a comprehensive overview see [35]. Even the number of transactions can be used [71]. In general, model-based methods are more data efficient, or have lower sample 2 complexity, but are more computationally intensive. There are many approaches in the 3 general RL literature, both parametric and non-parametric, most notably [79][80][81], but 4 we are interested to describe in this work just the few approaches used in deep RL. Both 5 model-free and model-based methods are known to exist in the brain [82], but the cognitive 6 sciences differentiate between the type of task which employs each. Model-based decision 7 making is shown to be used when the task is outcome-based, so one is interested in the 8 unfolding in time of the sequence of states and actions (and sometimes rewards), while 9 model-free methods are used when the decision making is mostly concerned with the 10 value of an action (e.g. moral judgements, this action is good because it is moral). For DRL 11 however, model-based approaches are quite scarce, probably due to the high complexity of 12 models that need to be learned and the variety of models needed for the tasks that DRL 13 is usually applied to. For example, TRPO deals with a number of classical control tasks 14 and Atari games as well, it would be quite hard to derive a model building technique 15 easily applicable to both of these domains. Even the Atari games are so different from one 16 another, that a general model-building algorithm for all of them (the 50 games used more 17 often in the DRL literature) would not be trivial. 18 Model-based RL in trading 19 As we mentioned earlier, model-based RL implies the creation of a model or simulation 20 of the environment. This simulation is then run to produce different (if it's run multiple 21 times and it is a stochastic model) possible prediction for the future. One such example is 22 the work in [78], where they learn a model of the time-series which is a summary of a limit 23 order book with an MDN after encoding (or extracting features from) the LOB with an 24 AE [83]. They then use this model to learn from it with a RL agent and without requiring 25 the interaction with the actual environment. In [23] they test two prediction models, the 26 NDyBM and WaveNet adapted to multi-variate data. The predicted data is the Close, 27 High, Low percentage change in the asset price value. With a couple of additional modules 28 (data-augmenting module and behavioural cloning module) the overall architecture is 29 applied on a small number of US stocks data successfully. 30

31
There are many approaches which use, a policy-based RL algorithms, combined with 32 other techniques. 33 For example in [70] they use a supervised learning module (autoencoder) to predict 34 the next time step of the price which was then used in signal representation for RL. This GRU and LSTM, is present for each stock and is learned end-to-end. 41 There are many contributions and additions of this paper, training and testing on 2000 42 and 3000 steps respectively for 20 years with an annual rate of return of about 20% in bull 43 and 10% in a bear market. A larger scale test (55 stocks) show 26% annual return. One of 44 the most well-received papers is [27] where they apply their portfolio management DRL to Independent Evaluators) because to some degree each stream is independent and they 51 only connect at the output, even though they are sharing the weights among them. They 52 also add the previous portfolio value before the softmax output as to minimise transaction 53 costs. The results are remarkable, compared to the baseline strategies tested (up to almost 54 50x the initial investment). Interesting to note is the large training set (2 years) compared 55 to the test set (50 days). They test 3 different topologies, a CNN, a simple RNN and an 56 LSTM, with all three performing significantly better than any other strategy tested. Learning (HRL) where the overall problem is split into somewhat independent subprob-64 lems. Either the actions space is divided into smaller partitions or the state space or both. 65 The most appealing aspect of HRL is that the agent higher in the hierarchy gives goal 66 information to the level below and thus, rewards based on how well the agent is doing 67 with respect to the goal being pursued. This means that auxiliary rewards (rewards which 68 are given to the agent in addition to the environment rewards, usually defined by the 69 programmer, see for example [85]) can be learned rather than hard-coded. This is a huge 70 milestone in DRL since the rewards are fully governing an agent's behaviour. There are 71 many different architectures for HRL, each inheriting some of the traits of the original flat 72 RL, some with engineered dynamics between the agents, while others learn this as well. 73 HRL holds the promise of scaling DRL to large sets of different problems, because agents 74 in lower levels learn how to solve specific narrow tasks, or skills, whereas going up the 75 hierarchy the agents learn more general behaviour and how to select between the agents in 76 lower levels. Transfer learning also comes more naturally in HRL due to the advantages 77 mentioned above. 78 HRL in trading 79 There is little research of HRL in trading, but some works exists, see for example 80 [22,25,26,77]. HRL is used in different in ways in the mentioned works, we will discuss 81 each in detail. 82 In one of the works, [26], the authors use HRL to manage a subset of the portfolio 83 with an individual, low-level agent. Of course, the principle can be applied repeatedly, 84 where a few flat agents (we call flat agents the lowest level agents which interact with the 85 actual trading environment) are controlled by a higher level agent (level 1), then a group of 86 agents at level 1 are controlled by an agent at the level 2 and so on. In this way, an arbitrary 87 level hierachical agent can be formed.

88
This type of partitioning deals with the state space (since each DRL agent will receive 89 only the prices for the stocks that is managing) and action space (will output only actions 90 for specific stocks), but there is no shared dynamics between the levels of the hierarchy.

91
However, there is information sharing in the form of the weight vector for each agent 92 coming from the level above. Even though this approach is quite simplistic, it is also quite 93 natural and partitions the overall problem into smaller subproblems which can be solved 94 individually without loss of expressivity.

95
In a second line of work, [22]  time-scales to make decisions. The high-level agent decides which assets to hold for some 120 larger period (they call it the holding period) and the low-level agent decides on how to 121 buy these assets (and/or sell them to reach the desired portfolio configuration) such that 122 transaction costs are minimised through a shorter period (the trading period). The authors 123 use limit order books for states, which are higher dimensional and require more resources, 124 see more details in section 7.

126
As a conclusion to this work we mention some of the commonalities between papers 127 and some methodological issues which arise, and if solved could benefit the community 128 significantly.

129
Common ground 130 • Many papers use convolutional neural networks to (at least) preprocess the data.

131
• A significant number of papers concatenate the last few prices as a state for the DRL 132 agent (this can vary from 10 steps to hundreds).

133
• Many papers use the Sharpe ratio as a performance measure.   We showed in Table 1 a list of the most representative modelling choices with respect   159 to the data used, its sampling frequency, the state space, the action space, the reward