Implementing Neural SDEs for Data-Driven Dynamics of the Bitcoin Option Surface

Arjun Shah; Erik Schlögl

doi:10.20944/preprints202605.0049.v1

Submitted:

01 May 2026

Posted:

05 May 2026

You are already at the latest version

Abstract

This paper presents a full implementation of data-driven modelling of the dynamics of the options on Bitcoin, using high-frequency data from the Deribit exchange. To this end, we provide a synthesis of methods established in prior papers, namely the works involving “neural SDE market models,” to build a pipeline to go from raw options quotes to a functioning non-parametric model. The options surface is decomposed into a low-dimensional latent space designed to minimise arbitrage in reconstruction and the temporal evolution of these factors are modelled with a stochastic differential equation (SDE). The drift and diffusion of the SDE are learnt from data using neural networks, thereby forming a ’Neural SDE’. These networks are constrained in order to guarantee the absence of static arbitrage and to minimise dynamic arbitrage in the resulting model. The networks are trained using a likelihood-based objective function in an SDE transition discretisation. The framework produces arbitrage-free simulations of option surfaces and enables risk management applications such as Value-at-Risk estimation and hedging applications.

Keywords:

neural SDEs

;

cryptocurrency

;

options

Subject:

Business, Economics and Management - Finance

1. Introduction

1.1. Motivation

Accurately modeling the dynamics of options is a core function of modern quantitative finance, underpinning both valuation and risk management in numerous asset classes. Traditional approaches to option pricing, such as Black-Scholes (Black and Scholes 1973), Heston (Heston 1993) and SABR (Hagan et al. 2002) have been extensively developed and tested in the context of mature financial markets such as equities or foreign exchange. As with any modeling approach, these frameworks have simplifying assumptions, in particular in the form of parametric assumptions for the drift and diffusion processes.

Cryptocurrency markets, and by extension their derivatives markets, represent a markedly different environment than established financial markets. It is because of these altered dynamics that the assumptions in traditional options pricing models may be even more restrictive when applied to cryptocurrency as opposed to traditional asset classes. Ahmed et al. (2024) provide a review paper which highlights that high volatility seems to be a structural feature in cryptocurrency markets. They also suggest that machine learning methods, in particular deep learning models, may better capture non-linear dynamics and handle high-frequency data.

Furthermore, there has been little work done in the space of cryptocurrency option pricing within the literature. While all options modelling techniques are domain agnostic (i.e., the market itself doesn’t matter, the mathematics of the models do not change), as far as we are aware there has not been any work done on the modelling of a multi-strike, multi-maturity options book (i.e., the so-called volatility surface) for cryptocurrencies.

1.2. Goals

The broad goal of this project is to use raw options quotes, obtained from the Deribit exchange, and create a data-driven model for the joint dynamics of the options observable in the market. This can be broken down into three points.

Data driven method to model dynamics of the observable call options surface. We use non-parametric machine learning methods to extract the model dynamics from the observable call options on Deribit. Rather than assuming a parametric form for the drift and diffusion of the underlying, the model seeks to learn the evolution of option prices directly from historical time-series data across strikes and maturities. By focusing on the evolution of the entire surface rather than individual options, the model captures the co-movements and structural relationships inherent in a live options market, enabling a more realistic depiction of the dynamics observed in high-frequency cryptocurrency data.
Ensure the model does not admit arbitrage. Absence of arbitrage is a basic consistency requirement for financial models. We want to ensure our model accounts for this by ensuring that outputs admitting arbitrage are either eliminated entirely, or at the least minimised. To ensure theoretical and practical validity, the model is explicitly designed to satisfy static, and minimise any violation of dynamic, no-arbitrage conditions. Static arbitrage constraints impose shape restrictions on the option surface—such as monotonicity in strike and convexity, and calendar spread consistency—ensuring that prices at a single point in time form a feasible and economically consistent surface. This is the more fundamental requirement in the sense that violations of static arbitrage constraints lead to immediate, model–free arbitrage opportunities. Dynamic arbitrage constraints, in contrast, govern the temporal evolution of option prices, ensuring that the model’s drift and diffusion dynamics are consistent with an equivalent martingale measure. What each of these constraints mean definitionally will be discussed in more depth in Section 2.2 and Section 2.3 below. The static conditions are incorporated directly into the learning process through constrained optimisation and diffusion shrinkage methods derived from the Arbitrage-Free Neural SDE framework of Cohen et al. (2023a).
Use of the model for risk management and hedging. As application examples of the model fitted to the market in the present paper, we build on the work of Cohen et al. (see Cohen et al. (2022) and Cohen et al. (2023b)) to hedge options portfolios and provide value at risk (VaR) estimates via market simulation.

1.3. Significance and Contributions

This work contributes to the emerging literature on data-driven financial modelling by applying the arbitrage-free Neural SDE framework to the cryptocurrency derivatives market. It provides the first empirical application of an arbitrage-consistent neural SDE to intraday, multi-strike, multi-maturity crypto options data. Beyond methodological adaptation, the model demonstrates practical relevance by producing risk and hedging metrics that are derived from the same underlying stochastic dynamics, bridging the gap between theoretical consistency and applied market utility.

1.4. Overview of Dataset Used

The dataset has been obtained from the Deribit exchange, which is a cryptocurrency options exchange, where at each timestep we see every available option on Bitcoin from the exchange. This provides effectively a timeseries of options prices, though we need to perform some pre-processing to create a static grid of strikes and maturities for the time-series. This will be explored more in depth in Section 3. An important feature to note of the raw data is that it is high-frequency, meaning that each consecutive snapshot is provided at sub-second level, see Figure 1. There are

118, 193

unique timestamps in one dataset which spans 24 hours.

The date for the market data used in the initial experiment is May 22nd 2025. For context, the price of BTCUSD on that day is plotted in Figure 2.

We also perform an extended analysis on a dataset following the same schema, i.e., over a week’s worth of data as opposed to a day. This extended analysis covers the dates 7th to 14th (inclusive) of September 2025. The distribution of timestep lengths is seen in Figure 3 and the movement of underlying asset is seen in Figure 4. This dataset has 842,940 unique timesteps.

2. Background

2.1. Market Models

The present paper follows the approach of so-called market models, which model the joint dynamics of all liquidly tradable assets simultaneously and impose hard and/or soft constraints on the modelling process to ensure absence of arbitrage. This idea is similar to the interest rate modelling framework of Heath-Jarrow-Morton (HJM) (Heath et al. 1992), though models for options are inherently more complex due to the constrained state-space of options prices created through no-arbitrage relationships.

The literature for market models contains works that have been done in special cases. Schönbucher (1999) and Lyons (1995) have proposed a model for the implied volatility of single options. For example, denote by

S_{t}

the price of the underlying asset at time t,

C_{t}

the price of a call option, K the strike price and T the maturity. If there is only one option to model, one has the static bounds

{(S_{t} - K)}^{+} \leq C_{t} \leq S_{t}

and

X = C_{T} = {(S_{T} - K)}^{+}

, where the latter corresponds to the terminal pay-off condition. The authors of these papers do this by reparameterising the option price by its implied volatility. By contrast, modelling the joint dynamics of multiple options across strikes and maturities introduces a much richer set of static arbitrage constraints that prices must satisfy (Carr and Madan 2005,Davis and Hobson 2007,Gatheral 2011). To our knowledge, only two substantial contributions address this setting: the arbitrage-consistent approach of Wissel (2008) and the data-driven neural SDE framework of Cohen et al. (2023a). The latter introduces the concept of a “liquid lattice”—a structured, yet arbitrage-consistent, state space for option prices—allowing dynamic modelling of an entire option surface rather than a single contract. This addresses the need for having a data-driven model for options that can take in a time-series of data, so the present paper builds on the methods provided by Cohen et al. (Cohen et al. 2020,Cohen et al. 2022,Cohen et al. 2023b).

2.2. Static Arbitrage

Static arbitrage refers to the ability to make a risk free profit by exploiting the prices of options at a specific point in time (Carr and Madan 2005,Davis and Hobson 2007,Gatheral 2011). Static arbitrage strategies are immediate and model–free; thus there is a strong requirement that option price surfaces should not allow static arbitrage. Effectively, these static arbitrage constraints are a set of shape constraints on the price surface. Specifically, for call options the conditions are:

1.: Monotonicity in maturity. Whenever two call prices share the same moneyness m but have different maturities $τ_{i} < τ_{j}$ , the call price must satisfy

$C (τ_{j}, m) \geq C (τ_{i}, m) .$
2.: Monotonicity in moneyness. At any fixed maturity $τ$ , when two moneyness values satisfy $m_{k} < m_{ℓ}$ , the corresponding call prices must satisfy

$C (τ, m_{k}) \geq C (τ, m_{ℓ}),$

These conditions are derived from the absence of ’calender spread’ arbitrage formalised by Carr and Madan (2005) and further reviewed by Gatheral (2011).
3.: Convexity in moneyness. Also at a fixed $τ$ , for any three consecutive moneyness points $m_{k - 1} < m_{k} < m_{k + 1}$ , convexity requires (see e.g. Breeden and Litzenberger 1978 and Roper and Rutkowski 2009)

$C (τ, m_{k - 1}) - 2 C (τ, m_{k}) + C (τ, m_{k + 1}) \geq 0 .$
4.: Nonnegativity. For every node $(τ_{i}, m_{i})$ , the call price must be nonnegative:

$C (τ_{i}, m_{i}) \geq 0 .$

As in Cohen et al. (2020), we can collect all the linear inequalities arising from conditions (1)–(3) into a single matrix inequality

A c \geq 0,

giving a set of R linear constraints in

R^{n}

. In particular, each row of A corresponds to exactly one of the following three types of finite-difference expressions:

\begin{matrix} (a) If (τ_{j}, m) and (τ_{i}, m) are adjacent in τ, C (τ_{j}, m) - C (τ_{i}, m) \geq 0, \\ (b) If (τ, m_{k}) and (τ, m_{ℓ}) are adjacent in m, C (τ, m_{k}) - C (τ, m_{ℓ}) \geq 0, \\ (c) If (τ, m_{k - 1}), (τ, m_{k}), (τ, m_{k + 1}) are consecutive in m, \\ C (τ, m_{k - 1}) - 2 C (τ, m_{k}) + C (τ, m_{k + 1}) \geq 0 . \end{matrix}

Meanwhile, the nonnegativity requirement is captured by the coordinate-wise condition

c \geq 0

. We use this parameterisation in the data preprocessing step, for detecting and removing arbitrage in our historical option prices. Similar arbitrage-cleaning procedures are applied in Cohen et al. (2020) and Wissel (2008) to enforce shape constraints before model calibration.

2.3. Dynamic Arbitrage

While static arbitrage concerns the shape of the option price surface at a single point in time, dynamic arbitrage refers to risk–free profits that can be generated by trading options (and the underlying asset) over time as their prices evolve. In other words, even if an option surface satisfies all static no–arbitrage conditions at each date, its temporal evolution may still violate arbitrage–free dynamics if there is an inconsistent drift or diffusion structure across maturities and strikes (Derman and Kani 1994,Dupire et al. 1994,Fukasawa 2011).

Formally, a family of call price processes

{C_{t} (τ, m)}_{t \geq 0}

is ensured to be dynamically arbitrage–free if there exists a risk–neutral probability measure

Q

under which the discounted price processes of the underlying and of all replicable portfolios of options are martingales. This implies that:

e^{- \int_{0}^{t} r_{s} d s} S_{t} and e^{- \int_{0}^{t} r_{s} d s} C_{t} (τ, m) are Q - martingales .

2.4. Neural Networks in Options Pricing & Modelling

The use of neural networks in options pricing was first pursued by Hutchinson et al. (1994), who trained feedforward networks to learn the pricing function of S&P 500 index options. The resurgence of neural networks in the 2010s brought new techniques for option pricing. Major advances include Neural SDEs, generative models, and the use of physics informed neural nets (PINNs).

2.4.1. Dynamic Hedging via Deep Learning

The concept of using deep learning for hedging, or ’deep hedging’ was introduced by Buehler et al. (2019). Here, the hedging problem is treated as a reinforcement learning task (Sutton et al. 1998), where an agent (the hedging strategy) learns to optimise a reward or risk objective over time. Buehler et al. (2019) used a synthetic dataset, with transaction costs, and found that the learned hedging strategy outperforms the traditional Black-Scholes delta hedging in terms of risk-adjusted returns. Similar work by Kolm and Ritter (2019) applied reinforcement learning (RL) to dynamic replication and hedging, training a hedging agent that learns how to optimally trade to manage option risk. These RL-based approaches mark a transition from using neural networks merely as function approximators to using them as decision-makers in a sequential setting.

2.4.2. Generative Models

Another use of neural networks is generative models, which seek to capture the joint distribution of asset prices and implied volatility surfaces, enabling realistic market simulation and data augmentation. Generative Adversarial Networks (GANs) (Goodfellow et al. 2020), and other related generative models, have been used to learn distributions of option prices across strikes, maturities and time, with the aim to produce synthetic data that is the same statistically as real market data. Work by Wiese et al. (2019), building on deep hedging, proposed a GAN-based simulator for equity options. The approach of Wiese et al. (2019) involved a generator network producing a sequence of market states (underlying price and entire implied volatility surface over time) which a discriminator network tries to distinguish from real historical market data. By training these networks against each other, the generator learns to create highly realistic option price dynamics. Similarly, the model introduced as ’VolGAN’ by Vuletić and Cont (2024) has a GAN-based architecture trained on historical sequences of implied volatility surfaces and underlying prices, capable of simulating realistic joint dynamics of the volatility surface and the underlying while respecting static arbitrage constraints. Cont and Vuletić (2023) also introduce another, tractable, method capable of simulating arbitrage free volatility surfaces using more conventional stochastic and factor dynamics with constraints and calibration steps.

2.4.3. Neural SDEs

Noteable examples of neural SDE martingale models include Cuchiero et al. (2020), using GANs to match risk-neutral probability distributions, and Gierjatowicz et al. (2022), who train the neural SDE in a more traditional way. The thesis by Phytides (2023) demonstrates that neural SDEs, when properly constrained with no-arbitrage rules, can accurately replicate a benchmark process (like Black-Scholes and constant elasticity of variance (CEV)). Another approach to note is the work by DeLise (2021), who treats neural SDEs as universal Itô process approximators. Using standard martingale theory and Girsanov’s Theorem, these trained neural SDEs reliably reproduce the Black-Scholes option price (i.e. he trains the neural SDE on just the underlying geometric Brownian motion dynamics and has it reproduce Black-Scholes option prices). In contrast, Lei Fan (2024) use real world historical S&P 500 and S&P 100 index options to train neural SDE models for pricing and hedging.

Cohen et al. (2023a) developed an arbitrage-free neural SDE market model for European options, incorporating explicit constraints to enforce static arbitrage-free relationships among options. In their framework, a set of latent factors (such as implied volatility levels or slopes) evolve in time along with the underlying asset price, and these jointly drive the entire surface of option prices. The drift and volatility of these factors are given by neural networks, which are trained on time-series data of options and underlying prices. To ensure the learned model does not violate no-arbitrage conditions, the neural network outputs are passed through transformations that enforce the linear inequalities reflecting static arbitrage constraints. This constrained neural network approach guarantees that every price surface produced by the model is free of calendar arbitrage and butterfly arbitrage by design. The result is a nonparametric yet arbitrage-consistent market model that can be calibrated to the market’s entire option book. Such a model can serve as a market simulator, generating realistic paths of future implied volatility surfaces for risk management use, or be used to price illiquid exotic derivatives by Monte Carlo simulation in the learned dynamics.

2.4.4. Physics Informed Neural Networks (PINNs) for Options Pricing

PINNs offer a powerful way to reduce the “black-box” nature of deep learning by embedding known physical or theoretical laws directly into the learning process. In PINNs, the solution to a (partial) differential equation (PDE) is approximated by a neural network

μ_{θ} (x, t)

, and the governing PDE is enforced by penalising any deviation from its expected behavior.

There have been several attempts at applying PINNs to the option pricing problem. The thesis by Tanios (2021) demonstrates the feasibility of applying PINNs to high-dimensional option pricing problem by solving both the forward and inverse formulations of the Black-Scholes and Heston partial differential equations (PDEs). In his work, PINNs are used not only to accurately approximate the solution to these PDEs, but also to compute sensitivities (Greeks) via automatic differentiation, thereby overcoming the limitations of traditional numerical methods in high-dimensional settings. Similarly, recent work by Dhiman and Hu (2023) extends the PINN framework to both European and American options, showing that embedding financial constraints—such as no-arbitrage conditions—directly into the neural network’s loss function can lead to improved accuracy and computational efficiency. While Tanios uses synthetically generated data, Dhiman and Hu (2023) include the application of PINNs to real market data.

2.5. Choice of Modelling Approach

Static constraints define a feasible state space for option prices, while dynamic constraints govern admissible drifts/diffusions over time. The “liquid lattice”, i.e., linear-inequality parameterisation, enables tractable enforcement of static constraints during learning and cleaning. For our use case, these constraints are not optional regularisation but necessary domain knowledge that stabilises learning and ensures economically meaningful outputs.

Deep hedging (i.e., RL) targets hedging strategy optimisation, not market option surface dynamics. Generative models (e.g., GANs) excel at sample realism, but can struggle with enforcing the absence of arbitrage and with the interpretability of calibrated factor dynamics. PINNs embed the PDE structure, but align best with parametric PDE models. Thus, we will focus on Neural SDEs, which provide a stochastic, non-parametric, time-series view with continuous-time semantics. Combined with explicit no-arbitrage conditions, they can model the joint evolution of the entire option surface while remaining economically consistent. Building on the approach of Cohen et al. (2023a), we will construct a model with dynamics which are learnt directly from the data, for the temporal evolution of the options surface.

3. Data Preprocessing

This section details the pre-processing steps required to take raw options quotes, obtained from Deribit, and clean and transform them so that they are ready to be fed into the neural SDE model. This involves the creation of a static lattice of options at high frequency, arbitrage repair of these option price surfaces and a transformation of the surfaces into a lower dimensional representation.

3.1. Static Grid for Options

Over time, the time-to-maturity/moneyness coordinates of liquid options change stochastically. To facilitate training our model, it is hence required to create a static lattice of time-to-maturity/moneyness coordinates, to which market options are then matched. Let

τ = T - t

denote time to maturity. m is forward log moneyness, i.e.,

m = ln (\frac{K}{F_{t} (T)})

, with K being the strike and

F_{t}

being the forward price.

It is important to note that the entire option surface is never observable at once; it is only ever the currently liquid options for which we see quotes within the order book. Because of this, for each time to maturity

τ

there is a different set of log-moneynesses m. In particular, options further from expiry have a larger range of traded moneyness values in comparison to options traded closer to maturity. This represents what Cohen et al. (2023a) call the ’liquid lattice’ of options. When constructing our static grid, it is important to capture this aspect, as we are interested in modelling the movement of tradeable options that have sufficient liquidity. Conversely, we are not interested in modelling a very deep out-of-the-money option that has a one day expiry, because we are very unlikely, given most market conditions, to observe this option ever being actively traded.

3.1.1. Constructing a Static Grid at High-Frequency

Prior “market model” work in the literature has also parameterised the price of options in this fashion. Wissel (2008) constructs a square lattice (i.e. each maturity has the same distribution of moneyness), while Cohen et al. (2023a) use end-of-day closing data for STOXX 50 index and DAX options, which naturally provide a distribution of moneyness for each maturity and hence the shape of the liquid lattice is chosen heuristically.

However, at high frequency/intraday frequency each maturity has effectively a continuous distribution of traded moneynesses. Figure 5 highlights the observed distribution of available log-moneyness values for each maturity cluster in the Deribit data for a 24-hour period on May 22nd 2025.

This leads to a method that captures the distribution of m for each

τ

, or the shape of the “liquid lattice”. We are primarily interested in ensuring that this liquid lattice reflects the liquidity profile within our observed dataset. We construct a lattice over the joint

(τ, m)

space as follows. First, let

{{\bar{τ}}_{i}}_{i = 1}^{p}

be the sorted unique maturities observed in the dataset. For each

{\bar{τ}}_{i}

, compute empirical moneyness percentiles

m_{lo, i} = P_{1} (m ∣ τ = {\bar{τ}}_{i})

and

m_{hi, i} = P_{99} (m ∣ τ = {\bar{τ}}_{i})

, and form an

N_{m}

-point linear grid

m_{i j} = m_{lo, i} + (j - 1) \frac{m_{hi, i} - m_{lo, i}}{N_{m} - 1}, j = 1, \dots, N_{m} .

The lattice nodes are the Cartesian set

N = {({\bar{τ}}_{i}, m_{i j})}

. We then fit a 1-nearest-neighbor map in the 2D space (Euclidean metric) and assign each quote k with features

(τ_{k}, m_{k})

directly to its nearest node index

ℓ_{k} = arg {min}_{({\bar{τ}}_{i}, m_{i j}) \in N} ∥ (τ_{k}, m_{k}) - ({\bar{τ}}_{i}, m_{i j}) ∥

. For each calendar time t and node in

N

, if multiple quotes map to the same node, we keep the one with the highest liquidity (measured by trading volume). Finally, we pivot to a time-by-node panel and fill any missing entries by linear interpolation over t, followed by forward/backward fill as needed.

This method is implemented by Algorithm 1, noting that within the 24 hours of data, we take all unique maturities rather than performing clustering on

τ

. Let

Q = {q_{k}}_{k = 1}^{M}

be quote observations, where each

q_{k} = (t_{k}, T_{k}, K_{k}, B_{k}, A_{k}, S_{k}, V_{k}, {type}_{k}),

with timestamp

t_{k}

, expiry

T_{k}

, strike

K_{k}

, best bid

B_{k}

, best ask

A_{k}

, underlying spot

S_{k}

, liquidity proxy

V_{k}

(USD traded volume), and option type. Fix reference date

t_{ref}

, rates

(r, q)

, and lattice hyperparameters

n_{τ} \in N

,

n_{m} \in N

, and

K_{max} \in N

(default 50).

For each retained quote k, define

τ_{k} : = \frac{T_{k} - t_{ref}}{365.25}, F_{k} : = S_{k} e^{(r - q) τ_{k}}, m_{k} : = log (\frac{K_{k}}{F_{k}}),

{mid}_{k} : = \frac{B_{k} + A_{k}}{2}, c_{k} : = {mid}_{k} e^{- (r - q) τ_{k}} (= C_{k} / F_{k}) .

Retain calls only (

{type}_{k} = C

) with

V_{k} > 0

.

Algorithm 1: Static Lattice Construction

Require:: Cleaned quotes ${(t_{k}, τ_{k}, m_{k}, c_{k}, V_{k})}_{k = 1}^{M}$ , $n_{τ}$ , $n_{m}$ , $K_{max}$ , fill method (linear)
Ensure:: Reduced node set $L_{sub}$ , dense panel $C (t, ℓ)$ on $L_{sub}$
1:: $U \leftarrow sort (unique {τ_{k}}_{k = 1}^{M})$
2:: Fit 1D K-means with $n_{τ}$ clusters on $U$ ; obtain centroids ${{\tilde{τ}}_{i}}_{i = 1}^{n_{τ}}$
3:: Sort centroids increasingly: ${\bar{τ}}_{1} < \dots < {\bar{τ}}_{n_{τ}}$ and relabel quote cluster indices accordingly
4:: for $i = 1, \dots, n_{τ}$ do
5:: $M_{i} \leftarrow {m_{k} : quote k assigned to cluster i}$
6:: if $M_{i} = ⌀$ then
7:: $({\underset{̲}{m}}_{i}, {\bar{m}}_{i}) \leftarrow (P_{1} ({m_{k}}), P_{99} ({m_{k}}))$
8:: else
9:: $({\underset{̲}{m}}_{i}, {\bar{m}}_{i}) \leftarrow (P_{1} (M_{i}), P_{99} (M_{i}))$
10:: end if
11:: for $j = 1, \dots, n_{m}$ do
12:: $m_{i j} \leftarrow {\underset{̲}{m}}_{i} + \frac{j - 1}{n_{m} - 1} ({\bar{m}}_{i} - {\underset{̲}{m}}_{i})$
13:: Create node $ℓ = (i, j)$ with coordinates $x_{ℓ} : = ({\bar{τ}}_{i}, m_{i j})$
14:: end for
15:: end for
16:: $L \leftarrow {x_{ℓ}}$ , $N : = | L | = n_{τ} n_{m}$
17:: Fit 1-nearest-neighbor model on $L$ in $R^{2}$ (Euclidean distance)
18:: for $k = 1, \dots, M$ do
19:: $ℓ_{k} \leftarrow arg {min}_{ℓ \in {1, \dots, N}} {∥ (τ_{k}, m_{k}) - x_{ℓ} ∥}_{2}$
20:: end for
21:: for each pair $(t, ℓ)$ observed among ${(t_{k}, ℓ_{k})}$ do
22:: $I_{t, ℓ} \leftarrow {k : t_{k} = t, ℓ_{k} = ℓ}$
23:: $k^{★} (t, ℓ) \leftarrow arg {max}_{k \in I_{t, ℓ}} V_{k}$
24:: $C^{sparse} (t, ℓ) \leftarrow c_{k^{★} (t, ℓ)}$
25:: end for
26:: Set all unfilled entries of $C^{sparse}$ to NaN
27:: Node coverage count: $n_{ℓ} \leftarrow # {t : C^{sparse} (t, ℓ) is observed}$
28:: Keep $L_{K}$ = indices of the $K_{max}$ largest $n_{ℓ}$
29:: Restrict $C^{sparse}$ to columns in $L_{K}$ ; drop columns entirely NaN
30:: Let remaining column indices be $L_{sub}$ and corresponding nodes be ${x_{ℓ}}_{ℓ \in L_{sub}}$
31:: for each $ℓ \in L_{sub}$ do
32:: Interpolate $C^{sparse} (\cdot, ℓ)$ linearly in time
33:: Apply forward-fill then backward-fill at time boundaries
34:: end for
35:: Denote resulting dense panel by $C (t, ℓ)$
36:: return $L_{sub}$ , $C (t, ℓ)$

If train/test splitting is used, build

(L, NN)

on the training quotes only, apply the snapping-and-panel step separately to train and test, then keep the intersection of node indices:

L_{\cap} = L_{sub}^{train} \cap L_{sub}^{test},

and reindex both panels to

L_{\cap}

with the same linear + forward/backward fill.

The resulting liquid lattice from this process is displayed in Figure 6, which is the lattice from 24 hours of data from the 22nd of May 2025. Here we can see that, while the lattice is irregular and asymmetric, it does contain the general shape of an options lattice we would expect, with options closer to maturity having a smaller distribution support of traded moneyness values. Additionally, one can speculate that running this process over a longer time-period, say one month, the lattice would look more symmetric as one would observe more traded combinations of

(τ, m)

.

3.2. Arbitrage Repair of Historical Option Prices

When training an options model, we want to ensure that the input data is arbitrage free. Thus cleaning input data of arbitrage violations is a logical next step in the data processing. We may encounter seeming arbitrage violations in the input data for two reasons:

1.: Using the mid-price to construct our model. Within our data, and in the microstructure of markets, we do not see a single price for any asset, but rather two quotes: the bid, which is the highest price a buyer is willing to pay, and the ask, which is the lowest price a seller is willing to accept. The information contained within the spread of these relates to the liquidity of the option’s price and is captured within the lattice creation, via volume weighting, and the arbitrage cleaning process. The existence of an arbitrage-free surface between the spread is discussed in Appendix A.
2.: Data artefacts of our lattice creation method. The lattice of $(τ, m)$ pairs that are created using Algorithm 1 are not guaranteed to satisfy our static arbitrage constraints. Even if the mid-price captured a static arbitrage-free surface, we have interpolated these prices to common nodes which could potentially break convexity and/or monotonicity conditions.

Thus it is important to note that what may seem like arbitrage violations in the data are unlikely to be opportunities to make actual arbitrage profits in the market.

From Algorithm 1, we have a timeseries of option surfaces. We can use methods from Cohen et al. (2020) to “repair” these constructed surfaces of static arbitrage violations. Note that Cohen et al. (2020) applied this method to end-of-day closing data, while in our context it is applied to intraday, high frequency cryptocurrency options data.

Recall the representation of static arbitrage constraints reproduced from Cohen et al. (2020) in Section 2.2 above:

A c \geq 0

where A represents a matrix encoding each of the options shape constraints and c is a stacked vector of call option prices, across moneyness/time-to-maturity pairs. Let

A \in R^{m \times N}

and

b \in R^{m}

encode the shape constraints (following Cohen et al. (2020)):

\begin{matrix} (i) Monotone in maturity : & c (τ_{i + 1}, m) - c (τ_{i}, m) \geq 0, \end{matrix}

(1)

\begin{matrix} (ii) Monotone in strike : & c (τ, m_{j}) - c (τ, m_{j + 1}) \geq 0, \end{matrix}

(2)

\begin{matrix} (iii) Convex in strike : & \frac{c (τ, m_{j}) - c (τ, m_{j - 1})}{x_{j} - x_{j - 1}} \leq \frac{c (τ, m_{j + 1}) - c (τ, m_{j})}{x_{j + 1} - x_{j}}, x_{j} : = e^{m_{j}}, \\ (iv) Bounds and intrinsic \end{matrix}

(3)

\begin{matrix} lower envelope : & 0 \leq c (τ, m) \leq 1, c (τ, m) \geq {(1 - e^{m})}_{+} . \end{matrix}

(4)

On a non-uniform

x = e^{m}

grid, () is implemented as a linear inequality

\frac{1}{x_{j} - x_{j - 1}} c_{j - 1} - (\frac{1}{x_{j} - x_{j - 1}} + \frac{1}{x_{j + 1} - x_{j}}) c_{j} + \frac{1}{x_{j + 1} - x_{j}} c_{j + 1} \geq 0 .

When adjacent

τ

rows have misaligned m grids, (1) evaluates

c (τ_{i + 1}, m)

by linear interpolation on the

τ_{i + 1}

row.

To speed up repeated projections, redundant rows of

A c \geq b

are eliminated by LP tests, a method which comes from Caron et al. (1989): For each row i, solve

max_{c} A_{i} c s . t . A_{- i} c \geq b_{- i} .

where

A_{- i}

and

b_{- b}

refer to the matrix A with its i-th row removed and the vector b with its i-th element removed. We are trying to test if the i-th constraint is redundant, so if the optimum

\leq b_{i} + tol

, row i is redundant and removed.

Subsequently, each time slice is repaired by the nearest feasible point

\begin{matrix} {\hat{c}}_{t} = arg min_{c \in R^{N}} \frac{1}{2} {∥ c - c_{t}^{raw} ∥}_{2}^{2} s . t . A c \geq b, 0 \leq c \leq 1 . \end{matrix}

(5)

This strictly enforces all static constraints while minimally perturbing the raw panel. When bid/ask data is available, we perform a spread-aware L¹ projection, biasing adjustments to remain within the spread by introducing non-negative up/down slack

ϵ^{+}, ϵ^{-}

and weights

w^{+}, w^{-}

proportional to inverse half-spreads:

\begin{matrix} min_{c, ϵ^{\pm} \geq 0} & \sum_{n = 1}^{N} (w_{n}^{+} ϵ_{n}^{+} + w_{n}^{-} ϵ_{n}^{-}) \end{matrix}

(6)

\begin{matrix} s . t . & c = c_{t}^{mid} + ϵ^{+} - ϵ^{-}, A c \geq b, 0 \leq c \leq 1, \end{matrix}

(7)

\begin{matrix} w_{n}^{+} = \frac{1}{max {c_{t, n}^{ask} - c_{t, n}^{mid}, δ}}, w_{n}^{-} = \frac{1}{max {c_{t, n}^{mid} - c_{t, n}^{bid}, δ}} . \end{matrix}

(8)

The spread-aware L¹ adjustment matches the method of Cohen et al. (2020) exactly.

After these steps, the repaired surfaces

{{\hat{c}}_{t}}_{t}

are static-arbitrage free and feed all downstream steps. Because each time slice solves a convex program with warm starts, the method scales to high-frequency data while remaining faithful to the observed bid-ask data. Figure 7 highlights an example of the adjustments made to create an arbitrage-free surface. We can see that the deep out-of-the-money options are the ones that have the largest corrections. This can make sense, as these options are typically the ones with the largest bid-ask spread and also the ones with the most unreliable (and potentially stale) quotes.

3.3. Linear Factor Representation

Now that we have a series of arbitrage free options surfaces, we notice that it is high-dimensional and that there are likely redundancies in information contained within the high-dimensional representation, in the sense that we can expect prices on the surface to be correlated in some way. As such it is useful and efficient to decompose the surface into a low-dimensional representation.

We follow the method presented in Cohen et al. (2023a), where it is labelled Algorithm 1. This algorithm is designed to decompose the surface into latent factors that are designed to minimise arbitrage in the reconstruction.

Suppose that the surface can be described by d latent factors

ξ_{1 t}, \dots, ξ_{d t}

, collected into the factor vector

Ξ_{t} = {(ξ_{1 t}, \dots, ξ_{d t})}^{⊤} \in R^{d}

. We link these factors to the vector of call option prices via a linear decomposition:

\begin{matrix} \tilde{c} = G_{0} (τ, m) + \sum_{i = 1}^{d} G_{i} (τ, m) ξ_{i t} \end{matrix}

(9)

Here

G_{0} (τ, m) \in R^{N}

is a baseline price vector (the temporal mean of the arbitrage-repaired surfaces) and each

G_{i} (τ, m) \in R^{N}

,

i = 1, \dots, d

, is a vector of linear loading coefficients indexed over the N lattice nodes

(τ, m)

, mapping the scalar factor

ξ_{i t}

to price perturbations across the surface. Stacking the loading vectors column-wise gives the

N \times d

loading matrix

M = [G_{1} ∣ G_{2} ∣ \dots ∣ G_{d}]

, so that (9) can be written compactly as

\tilde{c} = G_{0} + M Ξ_{t}

.

The loading vectors

G_{i}

and associated factor scores

ξ_{i t}

are obtained via a modified principal components analysis (PCA) procedure, following Algorithm 1 of Cohen et al. (2023a). The decomposition is performed sequentially, with each successive factor chosen to optimise a different objective:

1.: Minimal Dynamic Arbitrage — The first loading vector $G_{1}$ (denoted $G_{dyn}$ ) is chosen to align with the leading principal component of the empirical z-factor, i.e., the deterministic drift operator applied to the reconstructed surface (see Appendix B). This ensures the dominant direction of surface evolution is representable by the factor model, thereby minimising violation of the HJM-type drift restriction.
2.: Statistical Accuracy — The second loading vector $G_{2}$ (denoted $G_{stat}$ ) is the leading principal component of the residual price variation after projecting out $G_{1}$ . This captures the direction of maximum remaining variance, ensuring accurate reconstruction.
3.: Minimal Static Arbitrage — The third loading vector $G_{3}$ (denoted $G_{sa}$ ) is chosen from the residual subspace to minimise the static arbitrage violations (measured via the constraint matrix A) in the reconstructed surfaces. A hinge-penalty objective penalises violations of $A \tilde{c} \geq 0$ during the optimisation of this direction.

The loading matrix is therefore partitioned as

M = [G_{dyn} ∣ G_{stat} ∣ G_{sa}]

under a three-factor model (

d = 3

).

Figure 8 illustrates the three loading vectors and the baseline

G_{0}

under a three-factor model. While the stochastic loading vectors are difficult to interpret individually, the baseline

G_{0}

exhibits the standard call surface shape one would expect. Similar to the approach by Cohen et al. (2023a), all loading vectors

G_{i}

(

i \geq 1

) are hard coded to zero at maturity

τ = 0

to ensure that the terminal payoff condition is satisfied.

4. Neural SDE Modelling

This section builds on the method of Cohen et al. (2023a), applying it to high-frequency cryptocurrency options data to obtain a data-driven, arbitrage-free model taking in the timeseries of available cryptocurrency option market data.

4.1. Stochastic Dynamics of Market Factors

We first posit an SDE for the evolution of our market factors under the statistical (a.k.a. “objective”, “physical” or “real–world”) probability measure, commonly denoted as the

P

measure. Specifically, the dynamics of the d-dimensional factor vector

Ξ_{t} = {(ξ_{1 t}, \dots, ξ_{d t})}^{⊤}

introduced in Section 3.3 are assumed to be governed by the SDE,

d Ξ_{t} = μ_{θ} (Ξ_{t}, t) d t + σ_{ϕ} (Ξ_{t}, t) d W_{t}

(10)

where

μ_{θ}

is the drift vector and

σ_{ϕ}

is the diagonal diffusion matrix and

W_{t}

is a standard Brownian motion. Each of

μ_{θ}

,

σ_{ϕ}

and

W_{t}

are d-dimensional, where d refers to the number of factors chosen from our factor decoding, as described in Section 3.3. Our diffusion matrix is assumed diagonal, meaning we assume that the noise terms between our

Ξ

factors are uncorrelated, and with the 24 hours of data this assumption seems to hold relatively well when analysing the residual factor plots, see Section 4.4.2.

In our data-driven approach, we represent the drift and diffusion components of this SDE with neural networks with weights

θ

and

ϕ

respectively. The specific architecture of the neural networks is detailed in Appendix E.

We can formulate the loss function for training the neural networks in this model by considering the Euler-Maryuama discretisation for a stochastic differential equation. Under this expansion, we have a defined transition. Let

Δ t_{k} : = t_{k + 1} - t_{k}

and

Δ W_{k} : = W_{t_{k + 1}} - W_{t_{k}} \sim N (0, Δ t_{k} I_{d})

. The Euler–Maruyama step for the Itô SDE is

\begin{matrix} Ξ_{k + 1} & = Ξ_{k} + μ_{θ} (Ξ_{k}, t_{k}) Δ t_{k} + σ_{ϕ} (Ξ_{k}, t_{k}) Δ W_{k} \end{matrix}

(11)

\begin{matrix} = Ξ_{k} + μ_{θ} (Ξ_{k}, t_{k}) Δ t_{k} + σ_{ϕ} (Ξ_{k}, t_{k}) \sqrt{Δ t_{k}} ε_{k}, ε_{k} \sim N (0, I_{d}) . \end{matrix}

(12)

Conditionally on

Ξ_{k}

, the transition density is Gaussian:

\begin{matrix} Ξ_{k + 1} ∣ Ξ_{k} \sim N (m_{k}, Δ t_{k} Γ_{k}), \\ m_{k} : = Ξ_{k} + μ_{θ} (Ξ_{k}, t_{k}) Δ t_{k}, Γ_{k} : = σ_{ϕ} (Ξ_{k}, t_{k}) σ_{ϕ} {(Ξ_{k}, t_{k})}^{⊤} . \end{matrix}

(13)

We can then formulate an expectation maximisation (EM) loss function. Define the observed increment

Δ Ξ_{k} : = Ξ_{k + 1} - Ξ_{k}

. The one-step negative log-likelihood (NLL) is

\begin{matrix} ℓ_{k} (θ, ϕ) = \frac{1}{2} [{(Δ Ξ_{k} - μ_{θ} (Ξ_{k}, t_{k}) Δ t_{k})}^{⊤} {(Δ t_{k} Γ_{k})}^{- 1} (Δ Ξ_{k} - μ_{θ} (Ξ_{k}, t_{k}) Δ t_{k}) \\ + log det (Δ t_{k} Γ_{k}) + d log (2 π)], \end{matrix}

(14)

so that for a path

{Ξ_{k}}_{k = 0}^{K}

, the total loss function value is

\begin{matrix} L_{NLL} (θ, ϕ) = \sum_{k = 0}^{K - 1} ℓ_{k} (θ, ϕ) . \end{matrix}

(15)

This NLL function now formulates our test loss, where minimising the NLL is equivalent to maximising the likelihood of seeing the next step given the current data. This follows the classical EM framework from Dempster et al. (1977) and is the neural SDE learning strategy from Dridi et al. (2021). This is structurally analogous to the maximisation/inference loop approach from Greff et al. (2017).

This training scheme does make the assumption that the transition/discretisation of the process is Gaussian and at ultra-high-frequency (sub-second level intervals) this assumption does not hold. When using 24 hours of high-frequency cryptocurrency options data, the prices were aggregated, by taking the median price, to minutely increments in order for goodness of fit tests to look reasonable. The training results at high frequency and associated discussion can be found in Appendix C.

One can also consider using different expansions or quasi-likelihoods as the loss term for the SDE. Aït-Sahalia (2002) develops a closed-form approximation for the likelihood of discretely observed diffusion processes, addressing the fact that most continuous-time SDEs lack an analytically tractable transition density. The approach constructs a sequence of explicit Hermite polynomial expansions that converge uniformly to the true but unknown transition density. By successively correcting the leading Gaussian term through higher-order terms in the sampling interval

Δ t

, this method yields a quasi-maximum likelihood estimator that remains asymptotically equivalent to the true MLE. Conceptually, it extends the Euler–Maruyama Gaussian transition by incorporating higher-order corrections that depend on the local drift, diffusion, and their spatial derivatives, thereby capturing the nonlinearity and skewness inherent in finite-time transitions. We explored using this approximation as a higher-order likelihood correction to the Gaussian transition in (13), attempting to capture the non-Gaussianity of the high frequency intervals, and the training results in this case are presented and discussed in Appendix D.

4.2. Arbitrage Free Factor Representation

In order to constrain the outputs of our model to be free of static arbitrage, we must first consider what these constraints mean in our factor space. As our constraints are linear inequalities and our factor representation of options prices is also linear, these constraints have a representation in the factor space. Recall the linear constraints in Section 2.2,

A c \geq 0

and the linear factor decomposition,

\tilde{c} = G_{0} (τ, m) + \sum_{i = 1}^{d} G_{i} (τ, m) ξ_{i t}

. Substituting this decomposition into the constraint system gives

A \tilde{c} = A G_{0} + \sum_{i = 1}^{d} A G_{i} ξ_{i t} \geq 0 .

Hence, in the factor space the arbitrage-free region is characterised by all factor vectors

Ξ_{t} = {(ξ_{1 t}, \dots, ξ_{d t})}^{⊤}

satisfying

A_{G} Ξ_{t} \geq b_{G}, A_{G} : = [A G_{1}, A G_{2}, \dots, A G_{d}], b_{G} : = - A G_{0} .

This defines a convex polytope, or region of space, in the factor space, within which any combination of factors yields option prices that respect all static arbitrage constraints. During training or simulation, we therefore enforce that the decoded factors

Ξ_{t}

remain inside this feasible region

P_{arb} = \{Ξ_{t} | A_{G} Ξ_{t} \geq b_{G}\},

ensuring that the reconstructed surface is statically arbitrage-free.

Removing redundant linear constraints as described in Section 3.2, we can plot our static arbitrage free convex polytope and the associated

ξ

factors to visually inspect if our factor decoding process in Section 3.3 has been done correctly. Figure 9 presents scores in a two factor model and the corresponding static no-arbitrage polygon. Inspecting this image we can see that the vast majority of our latent

ξ

factors lie within the acceptable polygon. Those few scores, which lie either on the boundary or just outside the polygon, are there due to the hinge-penalty-based optimisation done to reduce static arbitrage in option price reconstruction in the factor decoding process given in Section 3.3.

4.3. Constrained Networks and No-Static-Arbitrage Guarantees

Given the specified region of space in which the

Ξ

factors can evolve without option prices allowing static arbitrage, we can enforce constraints on the Neural SDE to ensure that the process remains within the required polytope. Cohen et al. (2023a) provide the method to achieve this using Friedman and Pinsky conditions for the non-attainability of a boundary for an SDE (Friedman and Pinsky 1973).

Definition: [Friedman–Pinsky Non-Attainability Conditions]

Consider an Itô diffusion in

R^{d}

,

d Ξ_{t} = μ (Ξ_{t}) d t + σ (Ξ_{t}) d W_{t},

and let

P = {Ξ \in R^{d} : a_{i}^{⊤} Ξ \geq b_{i}, i = 1, \dots, n_{f}}

be a convex polytope defined by

n_{f}

linear constraints (i.e.

n_{f}

faces). For each face i with boundary function

q_{i} (Ξ) : = a_{i}^{⊤} Ξ - b_{i} = 0,

define the normal drift and diffusion components near the boundary as

μ_{i} (Ξ) : = a_{i}^{⊤} μ (Ξ), σ_{i}^{2} (Ξ) : = a_{i}^{⊤} Γ (Ξ) a_{i}, Γ (Ξ) : = σ (Ξ) σ {(Ξ)}^{⊤} .

Then, a sufficient condition for the boundary

\partial P

to be non-attainable (i.e. the process remains in

P

almost surely if started inside) is that, for each i,

\begin{matrix} (i) μ_{i} (Ξ) \geq 0 whenever q_{i} (Ξ) = 0, \end{matrix}

(16)

\begin{matrix} (ii) σ_{i}^{2} (Ξ) = 0 whenever q_{i} (Ξ) = 0 . \end{matrix}

(17)

That is, the drift of the SDE points inward at the boundary and the diffusion vanishes in the direction normal to it.

Cohen et al. (2023a) enforce of these conditions via a smooth Lipschitz function. This function will shrink the diffusion as it gets closer to the boundary and slowly add drift in the opposite direction. Choosing such a function to be smooth and Lipschitz is important so as not to break backpropagation for training the neural networks. Recall the linear price map

\tilde{c} = G_{0} + M Ξ_{t}

from (9), where

Ξ_{t} \in R^{d}

is the factor vector and

M = [G_{dyn} ∣ G_{stat} ∣ G_{sa}] \in R^{N \times d}

is the loading matrix. Given the lattice constraints

A \tilde{c} \geq b

, we map them into factor space as

H Ξ_{t} \geq h, H : = A M, h : = b - A G_{0},

and augmented with a data-driven box

L \leq Ξ_{t} \leq U

. The box bounds

L, U \in R^{d}

are obtained component-wise from the training factor scores as

L_{k} = Q_{α} ({ξ_{k, t}}_{t})

and

U_{k} = Q_{1 - α} ({ξ_{k, t}}_{t})

for a small quantile level

α

(defaulting to componentwise min/max when quantile trimming is not applied). These box constraints are converted into additional half-space rows appended to

H Ξ_{t} \geq h

, so that the diffusion shrinkage and inward drift correction treat empirical range boundaries identically to the static-arbitrage faces. Let

n_{f}

denote the total number of half-space constraints (arbitrage faces plus box faces) after this augmentation. Rows of H are row-normalised when building the shrink geometry, so the signed distance of

Ξ_{t}

to face i is

ρ_{i} (Ξ_{t}) : = \frac{H_{i} Ξ_{t} - h_{i}}{∥ H_{i} ∥} = H_{i} Ξ_{t} - h_{i} (\sin ce ∥ H_{i} ∥ = 1) .

Following, Cohen et al. (2023a) the Lipschitz shrink profile is given by

ϕ (x) = \frac{x_{+}}{1 + x_{+}}, x_{+} : = max {x, 0},

Near the boundary

ρ_{i} (Ξ_{i}) ↓ 0

, we have

ϕ (ρ_{i}) ↓ 0

; far inside the polytope

ϕ (ρ_{i}) ↑ 1

.

At a given state

Ξ_{t}

we select the

k = min {d, n_{f}}

most relevant faces (smallest

ρ_{i}

), build an orthonormal frame

Q (Ξ_{t}) \in R^{d \times d}

whose first k columns point along the corresponding unit normals, and compute shrink coefficients

ε (Ξ_{t}) \in {[0, 1]}^{k}

with components

ε_{j} (Ξ_{t}) = ϕ (ρ_{i_{j}} (Ξ_{t}))

. This yields two equivalent shrink operators (chosen by mode):

- Matrix shrink:

P (Ξ_{t}) = Q (Ξ_{t}) [\begin{matrix} diag (ε (Ξ_{t})) & 0 \\ 0 & I_{d - k} \end{matrix}] Q {(Ξ_{t})}^{⊤} \in R^{d \times d},

so the effective diffusion is

Γ_{eff} (Ξ_{t}) = P (Ξ_{t}) Γ (Ξ_{t}) P {(Ξ_{t})}^{⊤} .

- Diagonal shrink (default, diagonal noise): compute elementwise multipliers

s (Ξ_{t}) \in {[0, 1]}^{d}

so that

diag (s (Ξ_{t})) σ (Ξ_{t})

best matches

P (Ξ_{t}) diag (σ (Ξ_{t}))

in a covariance sense. The effective diffusion used inside the model is

σ_{eff} (Ξ_{t}) = s (Ξ_{t}) ⊙ σ (Ξ_{t}),

In both cases, along any active face normal

a_{i}

one has

a_{i}^{⊤} Γ_{eff} (Ξ_{t}) a_{i} \to 0

as

ρ_{i} (Ξ_{t}) ↓ 0

, satisfying (). The same shrink can also be applied at the loss level: rather than modifying the diffusion

σ

directly, the constraint geometry is absorbed into a state-dependent preconditioner

Ω (Ξ_{t}) \in R^{d \times d}

(derived from

P (Ξ_{t})

) that transforms the observed increments and likelihood. Specifically, the tuple

(Ω, det Ω, Ω^{- ⊤} Δ Ξ)

is passed to the negative log-likelihood (14), so that the Gaussian transition is evaluated in the preconditioned coordinate system where the boundary constraints are automatically respected.

To enforce (16) via an inward drift correction, we precompute, along the observed factor path

X_{t}

, per–time inward directions and normalised boundary distances using

({dirs}_{t}, ε_{t}^{μ}) = (W, b, X_{\cdot}; ρ_{★}, ε_{★}^{μ}),

where

(W, b)

are the row–normalised halfspaces from the polytope output. At training time, for a batch indexed by t, the raw drift

μ (\cdot)

is corrected by

μ_{corr} (X_{t}) = μ (X_{t}) + Δ μ_{t}, Δ μ_{t} = \sum_{k} w_{t, k} {dirs}_{t, k}, w_{t, k} = \frac{λ_{bc}}{ε_{t, k}^{μ} + ε_{0}} 1 {ε_{t, k}^{μ} \leq τ_{cut}},

with user-selectable parameters

(λ_{bc}, ε_{0}, τ_{cut})

and an optional norm cap

∥ Δ μ_{t} ∥ \leq cap

.

Together, the smooth diffusion shrink guarantees

a_{i}^{⊤} Γ_{eff} (Ξ_{t}) a_{i} \to 0

as the boundary is approached, and the inward drift correction ensures

a_{i}^{⊤} μ_{corr} (Ξ_{t}) \geq 0

there; hence the Friedman–Pinsky non-attainability conditions are satisfied for the learned SDE on the static-arbitrage polytope.

4.4. Training Results

This section will discuss and review training results and goodness of fit tests of this Neural SDE model on our Deribit cryptocurrency options data.

4.4.1. Loss Curve

Illustrated in Figure 10, each loss curve, with and without the drift/diffusion constraints, suggest convergence. However, the loss with the constraints is far lower than the loss without them. Additionally, the gap between train and test loss is smaller. This is interpreted as suggesting that the constraints are indeed working as expected, reducing the state space of our

Ξ

factor values.

Here, the train/test split is 80/20. Meaning, the first 80% of the 24-hour period is used as training and the remaining 20% is the test, or out-of-sample, region.

4.4.2. Residuals Plots

Following the estimation of the Neural SDE, we assess the adequacy of the fitted drift and diffusion functions by examining the one-step standardised residuals. Under the Euler–Maruyama discretisation of the stochastic differential equation

d Y_{t} = μ_{θ} (Y_{t}) d t + σ_{ϕ} (Y_{t}) d W_{t},

the discrete-time transition between observations can be approximated as

Y_{t + Δ t} \approx Y_{t} + μ_{θ} (Y_{t}) Δ t + σ_{ϕ} (Y_{t}) \sqrt{Δ t} ε_{t}, ε_{t} \sim N (0, I_{d}),

where

ε_{t}

are the model innovations implied by the fitted dynamics. Given observed data

{Y_{t}}_{t = 1}^{T}

, we therefore compute the empirical residuals as

{\hat{ε}}_{t} = \frac{Y_{t + Δ t} - Y_{t} - μ_{θ} (Y_{t}) Δ t}{σ_{ϕ} (Y_{t}) \sqrt{Δ t}} .

If the estimated SDE correctly captures the conditional mean and variance of the data, then

{\hat{ε}}_{t}

should be approximately independent and standard normal across time and dimensions.

These residuals thus serve as a key diagnostic measure: quantile–quantile (QQ) plots of

{\hat{ε}}_{t}

against a standard normal distribution test for deviations from Gaussianity (such as skewness or excess kurtosis), while pairwise scatter plots of

{\hat{ε}}_{t}^{(i)}

and

{\hat{ε}}_{t}^{(j)}

assess the cross-sectional independence across latent factors.

From the QQ Plots in Figure 11, we can see that the assumption of Gaussian transition increments from our Euler-Maryuama training step is approximately correct, though not perfect. There is still some overdispersion at the tails and underdispersion at centrality. This suggests that deviations from our models predictions are still possible, which make sense considering the cryptocurrency market’s higher volatility.

In addition to marginal normality checks, we also compute the pairwise correlations between residual components to evaluate the cross-sectional independence of the model’s innovations. For each pair of factors

(i, j)

, the sample correlation coefficient is given by

ρ_{i j} = \frac{Cov ({\hat{ε}}^{(i)}, {\hat{ε}}^{(j)})}{\sqrt{Var ({\hat{ε}}^{(i)}) Var ({\hat{ε}}^{(j)})}} .

Under a correctly specified diffusion, the residuals should be uncorrelated across factors, reflecting that each dimension of the learned SDE captures an independent source of risk or market variation.

Analysing the residuals as plotted in Figure 12, we can see that there is some slight correlation between the ’dynamic arbitrage’ factor and the ’statistical accuracy’ factor. In a sense, this is not unexpected, as in testing we notice that the vast majority of variance is explained with a 2-factor model. When training with two factors (one statistical-accuracy factor and one static-arbitrage factor), we observe an

R^{2}

value of 99.5–99.8%, indicating that almost all variation in the observed option prices is captured by the model’s reconstructed surfaces. In this context,

R^{2}

measures the proportion of total price variance explained by the factor representation, reflecting both the completeness and goodness-of-fit of the decomposition. In a way this is unsurprising, as other martingale stochastic volatility models are also two-factor models (see e.g. Heston (1993); Hagan et al. (2002)). Aside from that, the correlations are low and residuals are center aligned.

5. Applications

This section considers two fundamental downstream applications: hedging and risk management, in the form of Value at Risk (VaR). These represent the most immediate and practically meaningful uses of a market pricing model. Both tasks provide direct validation of the model’s ability to capture market dynamics: hedging evaluates how well the learned dynamics can replicate real-world sensitivities and risk exposures, while VaR assesses the model’s capacity to simulate realistic loss distributions under stochastic evolution. However, this is not to say that this is an exhaustive list of the model’s capabilities as it provides a complete stochastic representation of how the implied volatility and option prices evolve through time. In principle, this framework can be extended to a wide range of quantitative finance tasks — including scenario generation, risk-neutral valuation, stress testing, derivative portfolio simulation, and even data-driven model calibration or discovery of latent market factors.

5.1. Market Simulation and VaR

(Cohen et al. 2023b) detail the methodology on how to create VaR estimates for options portfolios using Neural SDE market models. In essence, the Neural SDE is a generative model, meaning that it can simulate the learned

Ξ

market factors. Via a factor decoding process (i.e. moving from the low dimensional factor space back to a higher dimensional options surface), simulated prices for various option portfolios can be obtained. As we now have a distribution of potential returns for the options on our defined liquid lattice, we can naturally compute VaR estimates (and hence other risk metrics such as expected short-fall) using this distribution.

5.1.1. Market Simulation Methodology

Let

Ξ_{t} \in R^{d}

denote the vector of latent market factors, evolved according to the trained Neural–SDE:

d Ξ_{t} = μ_{θ} (Ξ_{t}) d t + σ_{ϕ} (Ξ_{t}) d W_{t}, σ_{ϕ} (Ξ_{t}) : = diag (σ (Ξ_{t})) \geq 0,

where

μ_{θ}

and

σ_{ϕ}

represent the learned drift and diffusion networks, and

W_{t}

is a standard Brownian motion. During simulation, we again ensure that the factor process

Ξ_{t}

remains within the static–arbitrage polytope defined by the half-space constraints:

P_{arb} = {Ξ_{t} ∣ A_{G} Ξ_{t} \geq b_{G}},

For each face i of the polytope, define the signed distance

ρ_{i} (Ξ_{t}) = A_{G, i} Ξ_{t} - b_{G, i}, i = 1, \dots, m,

which measures proximity to the ith constraint boundary.

As before, we apply a smooth Lipschitz function

φ_{σ} : R_{+} \to [0, 1]

to modulate the effective diffusion near the boundary:

ε_{i} (Ξ_{t}) = φ_{σ} (ρ_{i} (Ξ_{t})) = \frac{ρ_{i} (Ξ_{t})}{1 + ρ_{i} (Ξ_{t})} .

Let

Q (Ξ_{t})

be an orthonormal basis spanning the k most active faces (smallest

ρ_{i}

), and define

P (Ξ_{t}) = Q (Ξ_{t}) [\begin{matrix} diag (ε (Ξ_{t})) & 0 \\ 0 & I_{d - k} \end{matrix}] Q {(Ξ_{t})}^{⊤} .

The effective diffusion is then given by

σ_{eff} (Ξ_{t}) = diag (s (Ξ_{t}) ⊙ σ (Ξ_{t})),

where

s (Ξ_{t}) \in {[0, 1]}^{d}

are elementwise shrink coefficients matching

P (Ξ_{t}) σ (Ξ_{t})

in covariance. By construction, the normal component

A_{G, i}^{⊤} Γ_{eff} (Ξ_{t}) A_{G, i} \to 0

as

ρ_{i} (Ξ_{t}) ↓ 0

, satisfying the diffusion non-attainability condition.

Again following our previous procedure, a smooth correction is added to ensure the drift points inward at the boundary. Define a Lipschitz function

φ_{μ} (ρ) = \frac{1}{1 + ρ}

and inward directions

d_{i}

associated with each active face normal

A_{G, i}

. The corrected drift is:

μ_{corr} (Ξ_{t}) = μ_{θ} (Ξ_{t}) + \sum_{i \in I (Ξ_{t})} ω_{i} (Ξ_{t}) d_{i}, ω_{i} (Ξ_{t}) = \frac{{[- A_{G, i}^{⊤} μ_{θ} (Ξ_{t}) - φ_{μ} (ρ_{i} (Ξ_{t}))]}_{+}}{〈 d_{i}, A_{G, i} 〉},

ensuring

A_{G, i}^{⊤} μ_{corr} (Ξ_{t}) \geq 0

along all active faces.

Given a discretisation step

Δ t

, each Euler-Maryuama step evolves as:

Ξ_{t + Δ t} = Ξ_{t} + μ_{corr} (Ξ_{t}) Δ t + σ_{eff} (Ξ_{t}) \sqrt{Δ t} ε_{t}, ε_{t} \sim N (0, I_{d}) .

Running B Monte Carlo paths of

Ξ_{t}

generates a sample distribution of factor evolutions

{Ξ_{t}^{(b)}}_{b = 1}^{B}

.

5.1.2. VaR of Option Portfolios

Table 1 presents VaR results for various option portfolios, for both the Neural SDE simulation and a standard historical simulation approach.

The results in Table 1 show that across all portfolios, the Neural–SDE simulated Value-at-Risk (VaR) estimates are consistently higher in magnitude than those obtained via historical simulation. This indicates that the Neural–SDE captures fatter tails and greater short-term variability in the factor dynamics, reflecting more realistic risk in high-frequency markets such as cryptocurrency options. The largest VaR differences appear in short-maturity, short-volatility positions (e.g., short strangles and short condors), where non-linear payoffs are most sensitive to volatility clustering and jump-like behaviour. Conversely, for longer-dated or more balanced portfolios (e.g., butterflies or calendar spreads), the Neural–SDE and historical methods yield more comparable results, suggesting that the learned diffusion smooths risk over longer horizons.

Plotting a few examples of loss distributions in Figure 13 and comparing against a distribution from historical simulation, we can see that the Neural SDE does indeed provide a different distribution. We interpret these as reflecting data-driven and arbitrage-consistent dynamics of the underlying surface. Noting that traditional historical simulation approaches assume that options, or market factors, move individually, however we know that options move as a surface with constraints to their state, as discussed in Section 2.2. The historical simulation approach taken here in comparison, is a simulation of the market factors, that is to say that we look at the historical distribution of returns we have (in the test period) and simulate returns by sampling from this distribution. Option prices/portfolios are simulated by a factor decoding process.

5.2. Hedging of Option’s Portfolios

Cohen et al. (2022) highlight how can we use the learned model to hedge a portfolio of options. While their work presents both sensitivity based hedging and minimum-variance (MV) hedging, in our case the applications to cryptocurrency options will be focused on the MV hedge as it directly uses the learned Neural SDE model, whereas the sensitivity based hedge only uses the factor representation of option prices.

5.2.1. Minimum Variance Hedging

Let

Ξ_{t} \in R^{d}

denote the latent market factors driving the option surface, and recall the linear decoder from Section 3.3,

\tilde{c} (Ξ_{t}) = G_{0} + \sum_{i = 1}^{d} G_{i} ξ_{i t} = G_{0} + M Ξ_{t}, M \in R^{N \times d},

(18)

where

M = [G_{1} ∣ \dots ∣ G_{d}]

is the loading matrix defined in Section 3.3. For a portfolio defined by node weights

w \in R^{N}

, the portfolio’s factor exposure vector is given by

s = M^{⊤} w \in R^{d},

(19)

To first order, the one-step change in portfolio value is given by

Δ V_{t} \approx s^{⊤} Δ Ξ_{t}, Δ Ξ_{t} : = Ξ_{t + 1} - Ξ_{t} .

(20)

From the learned Neural–SDE model, the drift and diffusion functions

μ_{θ} (Ξ_{t})

and

σ_{ϕ} (Ξ_{t})

define the instantaneous covariance of the latent factors as

Σ_{t}^{base} = σ_{ϕ} (Ξ_{t}) σ_{ϕ} {(Ξ_{t})}^{⊤} .

(21)

To ensure that this covariance structure respects the static arbitrage geometry defined by the polytope constraints

A_{G} Ξ_{t} \geq b_{G}

, we apply a state-dependent linear map

Ω_{t}

that shrinks the diffusion along the active boundary directions, as detailed in Section 17. The resulting adjusted covariance is

Σ_{t} = Ω_{t} Σ_{t}^{base} Ω_{t}^{⊤} .

(22)

This procedure enforces that factor dynamics remain within the feasible arbitrage-free region during simulation and hedging.

Now, consider a set of k hedging instruments indexed by

H = {j_{1}, \dots, j_{k}}

. Let

B \in R^{d \times k}

denote their factor exposures,

B : = M_{:, H}^{⊤} .

(23)

The static minimum-variance (MV) hedge solves

u^{★} = arg min_{u \in R^{k}} Var ({(s - B u)}^{⊤} Δ Ξ_{t}) = {(B^{⊤} \bar{Σ} B)}^{- 1} B^{⊤} \bar{Σ} s,

(24)

where

\bar{Σ} = \frac{1}{T - 1} \sum_{t} Σ_{t}

is the time-averaged model-implied factor covariance. A dynamic version of this hedge recomputes

u_{t}^{★}

at each time step according to

u_{t}^{★} = {(B^{⊤} Σ_{t} B)}^{- 1} B^{⊤} Σ_{t} s,

(25)

yielding time-varying hedge weights that evolve consistently with the latent factor covariance implied by the Neural–SDE. In practice, the

Σ_{t}

matrices are computed per step using the endogenous shrinkage

Ω_{t}

and the trained model parameters

(θ, ϕ)

.

5.2.2. Hedging an Example Portfolio (A 30-Day ATM Option)

We present hedging results for a portfolio consisting an at-the-money option with a 30-day expiry. Figure 14 showcases the root mean squared error of the Neural SDE MV hedge compared with standard Black-Scholes delta and delta-vega hedging.

While we cannot say that the MV hedging procedure from the learnt Neural SDE model outperforms the delta or delta-vega hedge, we can state that it is comparable to these traditional hedging methods. These hedging results on cryptocurrency are similar to the results from Cohen et al. (2022), who present hedging results from traditional equities and forex markets.

6. Conclusions

This paper provides a synthesis of several methods in the literature and presented a pipeline that takes in raw options quotes and produces a non-parametric model for the entire options surface. The presented Neural SDE model adheres to static arbitrage constraints and the applications to risk management and hedging have been showcased. One can conclude that combining machine learning with known financial laws, e.g., arbitrage constraints, can provide non-parameteric alternatives to traditional models.

Our results demonstrate the feasibility of applying the arbitrage-free Neural SDE market model of Cohen et al. (2023a) to an entirely new asset class—high-frequency cryptocurrency options—where the microstructure and volatility regimes differ substantially from traditional markets. In particular, this extends the model’s applicability to ultra-high-frequency data, developing preprocessing methods for constructing static liquid lattices and repairing arbitrage in sparse intraday panels. The elements brought together here form a complete pipeline from raw data to an operational, data-driven, arbitrage-free simulator of the Bitcoin options market.

Author Contributions

Conceptualization, A.S. and E.S.; methodology, A.S.; software, A.S.; validation, A.S. and E.S.; formal analysis, A.S.; investigation, A.S. and E.S.; resources, A.S.; data curation, A.S.; writing—original draft preparation, A.S.; writing—review and editing, E.S.; visualization, A.S.; supervision, E.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data was sourced from publicly available webpages of the Deribit exchange.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. On the Existence of an Arbitrage Free Surface Between the Bid and Ask Spread

Consider the interpolated price vector

c_{k} = (1 - k) b_{raw} + k a_{raw},

(A1)

for a scale parameter

k \in [0, 1]

. This represents a convex combination between the raw bid price vector

b_{raw}

(when

k = 0

) and the raw ask price vector

a_{raw}

(when

k = 1

), interpolating uniformly across all nodes.

We require that this interpolated vector satisfies the static-arbitrage and non-negativity constraints:

A c_{k} \geq 0, c_{k} \geq 0,

(A2)

where

A \in R^{R \times n}

encodes the linear inequalities enforcing monotonicity in maturity, monotonicity and convexity in moneyness, and non-negativity across all lattice points.

Substituting Eq. (A1) into the constraints gives:

(1 - k) A b_{raw} + k A a_{raw} \geq 0 .

(A3)

Each row r of this matrix inequality corresponds to a scalar linear constraint:

(1 - k) {(A b_{raw})}_{r} + k {(A a_{raw})}_{r} \geq 0, r = 1, \dots, R .

(A4)

Each inequality in (A4) defines a half-space constraint on k, restricting k to one side of a threshold value within the interval

[0, 1]

. Taken together, these R inequalities define a closed feasible interval:

[k_{min}, k_{max}] \subseteq [0, 1] .

(A5)

If this interval is non-empty, i.e.

k_{min} \leq k_{max}

, then there exists at least one uniform interpolation

c_{k}

between the bid and ask vectors that is arbitrage-free. The endpoints of the feasible interval can be computed explicitly by solving two simple linear programs:

\begin{matrix} k_{min} & = arg min_{0 \leq k \leq 1} k & s . t . (1 - k) A b_{raw} + k A a_{raw} \geq 0, \end{matrix}

(A6)

\begin{matrix} k_{max} & = arg max_{0 \leq k \leq 1} k & s . t . (1 - k) A b_{raw} + k A a_{raw} \geq 0 . \end{matrix}

(A7)

These solutions yield the exact bounds on k ensuring arbitrage-free uniform interpolation within the observed bid–ask spread.

Interpretation. Geometrically, each inequality in Eq. (A3) defines a half-space in the one-dimensional variable k, restricting k to values for which the interpolated surface

c_{k}

remains within the arbitrage-free region defined by

A c_{k} \geq 0

. The intersection of these half-spaces yields the feasible interval

[k_{min}, k_{max}]

, within which all convex combinations of bid and ask prices satisfy the static no-arbitrage constraints. If this interval is non-empty, it implies that the observed bid–ask quotes are internally consistent with at least one arbitrage-free price surface. Conversely, an empty interval indicates that no uniform interpolation between bids and asks can satisfy all no-arbitrage inequalities simultaneously, signalling inconsistency or microstructure noise in the quoted market data.

Appendix B. Drift Restrictions, Dynamic Arbitrage and the ’z’ Factor

This appendix serves as an explanation of how we minimise dynamic arbitrage, as defined in Section 2.3, in the reconstructed option prices. Appendix A in Cohen et al. (2023a) gives a detailed derivation for this. For the reader’s convenience, this derivation is summarised here, with additional commentary on the application to the Deribit data.

First, we make the assume that the absence of arbitrage (dynamic and static) is equivalent to the existence of an equivalent martingale measure such that the discounted price process of any tradable instrument must be a local martingale

Q

. For an option with maturity T and strike K, this implies the Heath–Jarrow–Morton (HJM) type drift restriction:

ν_{t} (T, K) \cdot ϕ_{t} = η_{t} (T, K) + ν_{0, t} (T, K) γ_{t},

(A8)

where

η_{t} (T, K)

and

ν_{t} (T, K)

are respectively the drift and diffusion of the option price process under the physical measure

P

,

ϕ_{t}

is the market price of risk vector and

γ_{t}

is the volatility of the underlying process.

Appendix B.1. Drift of the Option Surface Under the Physical Measure

From the factor representation of normalised call prices:

{\tilde{c}}_{t} (τ, m) = G_{0} (τ, m) + \sum_{i = 1}^{d} ξ_{i, t} G_{i} (τ, m),

(A9)

where

τ = T - t

and

m = log (K / F_{t})

, applying Itô’s lemma yields the differential of the call surface evaluated along the moving coordinates

(τ_{t}, m_{t})

:

d {\tilde{c}}_{t} = \sum_{i} G_{i} (τ_{t}, m_{t}) d ξ_{i, t} + \frac{\partial {\tilde{c}}_{t}}{\partial τ} d τ_{t} + \frac{\partial {\tilde{c}}_{t}}{\partial m} d m_{t} + \frac{1}{2} \frac{\partial^{2} {\tilde{c}}_{t}}{\partial m^{2}} d {〈 m 〉}_{t} .

(A10)

Substituting

d τ_{t} = - d t

,

d m_{t} = (- α_{t} + \frac{1}{2} γ_{t}^{2}) d t - γ_{t} d W_{0, t}

, and

d {〈 m 〉}_{t} = γ_{t}^{2} d t

gives the drift of surface evolution:

\frac{d {\tilde{c}}_{t}}{d t} = - \frac{\partial {\tilde{c}}_{t}}{\partial τ} - \frac{1}{2} γ_{t}^{2} \frac{\partial {\tilde{c}}_{t}}{\partial m} + \frac{1}{2} γ_{t}^{2} \frac{\partial^{2} {\tilde{c}}_{t}}{\partial m^{2}} + \sum_{i} μ_{i, t} G_{i} (τ_{t}, m_{t}) .

(A11)

Appendix B.2. The Infinitesimal Drift Operator and the z-Factor

We define the deterministic differential operator

Z : = - \frac{\partial}{\partial τ} - \frac{1}{2} γ_{t}^{2} \frac{\partial}{\partial m} + \frac{1}{2} γ_{t}^{2} \frac{\partial^{2}}{\partial m^{2}},

(A12)

and apply it to the reconstructed option surface to obtain the z-factor:

z_{t} = Z [{\tilde{c}}_{t}] (τ_{t}, m_{t}) .

(A13)

This represents the infinitesimal deterministic drift of the option surface induced by the decay of time to maturity and the diffusion of moneyness. Intuitively,

z_{t}

describes how the option surface should evolve in

(τ, m)

-space in order to remain dynamically arbitrage-free.

Appendix B.3. Dynamic No-Arbitrage Under the Basis Expansion

Under the basis expansion of Eq. (A11), the HJM drift restriction becomes:

G^{⊤} μ_{t} - z_{t} = G^{⊤} σ_{t} ψ_{t},

(A14)

where

μ_{t}

and

σ_{t}

are respectively the drift and diffusion coefficients of the factor process

d ξ_{t} = μ_{t} d t + σ_{t} d W_{t}

, and

ψ_{t}

is the vector of market prices of risk. Eq. (A14) states that, up to a change of measure, the observed drift of the factors

G^{⊤} μ_{t}

should be explainable by the deterministic surface drift

z_{t}

and the model’s diffusion structure

G^{⊤} σ_{t}

.

Appendix B.4. Principal Component Analysis of the z-Fact or

We compute

z_{t}

empirically for each observation of the reconstructed surface and perform Principal Component Analysis (PCA) over its temporal evolution. The PCA decomposes the variability of

z_{t}

into orthogonal modes corresponding to consistent directions of deterministic surface motion (e.g., level, skew, curvature shifts). If the market is dynamically arbitrage-free, these principal directions should align closely with the factor basis

{G_{i}}

.

Residual components of

z_{t}

that are orthogonal to

span {G_{i}}

indicate systematic drifts of the surface that are not representable by the model’s factors, implying violations of the HJM drift restriction. By quantifying the variance explained by the first d principal components of

z_{t}

, we obtain a diagnostic of the extent to which the Neural-SDE factor model captures the dynamically consistent evolution of the option surface.

Appendix B.5. Minimising Dynamic Arbitrage

By training the model and selecting factor structures such that the leading principal components of

z_{t}

align with the learned factors

G_{i}

, we effectively minimise the unexplained component of the HJM drift restriction. This ensures that the model-implied surface evolution under the physical measure

P

is as close as possible to one that admits an equivalent martingale measure

Q

, that is, the dynamic arbitrage in the model is minimised.

Appendix B.6. Empirical Computation of the ’z’ Factor

While the work by Cohen et al. (2023a) uses splines to calculate the derivatives in A12, we have access to high-frequency data so we use finite difference methods. The

γ_{t}

parameter is computed by fitting an initial, ’Stage 0’, neural SDE to just the underlying process.

Figure A1. Surface derivatives at t=0, 22nd of May 2025 Deribit call options.

Our estimated surface derivatives are jagged/irregular due to the sparsity of our lattice. However, they do approximately resemble the surface derivatives presented by Cohen et al. (2023a) in shape.

Appendix C. Training at Ultra High Frequency Intervals

The results in this paper have been for prices, for options and the underlying asset, that have been aggregated to minutely increments. This appendix provides some results and commentary for training the neural SDE model on raw, un-aggregated increments. This means that each unique timestamp is sub-second level. Figure 1 elucidates the distribution of the interval length without any resampling/aggregation of prices. We can see that the majority of sampled intervals from Deribit are less than one second (sub second level intervals).

Figure A2. QQ plots of 3 factors at HF.

Figure A3. Loss Curve training at HF (with constraints).

Analysing Figure A3 and Figure A2 we can see that although the loss curve looks to be converging, the model does not seem to be respecting the Gaussian transition from our Euler-Maryuama training assumption. The tails on the QQ plots for the residuals show clear flat tails, suggesting that extreme values for transitions (both negative and positive) are more common, or higher in intensity, than what one would expect for the assumption of Gaussianity. One can speculate that this is due to the presence of microstructure noise at the lowest time-scales (Bandi and Russell 2006).

Appendix D. Alternative Likelihoods for Training

Here we report on an attempt to model the transition probabilities more accurately using a different likelihood parameterisation. To estimate the parameters of our Neural SDE, we employ a quasi-likelihood approach inspired by Aït-Sahalia’s expansion for diffusion transition densities (Aït-Sahalia 2002). The true transition density of a nonlinear stochastic differential equation,

d Y_{t} = μ (Y_{t}) d t + σ (Y_{t}) d W_{t},

is generally intractable except for a few special cases. Aït-Sahalia proposed a first-order approximation based on the Lamperti transformation, which locally linearises the diffusion coefficient. Defining the transform

Z_{t} = \int^{Y} \frac{1}{σ (y)} d y,

the dynamics of

Z_{t}

can be written approximately as

d Z_{t} \approx (\frac{μ (Y_{t})}{σ (Y_{t})} - \frac{1}{2} σ^{'} (Y_{t})) d t + d W_{t} .

This results in an approximate Gaussian transition density for

Z_{t + Δ t}

conditional on

Z_{t}

,

p (Y_{t + Δ t} | Y_{t}) \approx N (Z_{t + Δ t}; Z_{t} + μ_{L} (Y_{t}) Δ t, Δ t) \cdot | \partial Z / \partial Y_{t + Δ t} |,

where

μ_{L} (Y_{t}) = \frac{μ (Y_{t})}{σ (Y_{t})} - \frac{1}{2} σ^{'} (Y_{t})

and the Jacobian term

log | \partial Z / \partial Y_{t + Δ t} | = log σ (Y_{t + Δ t})

corrects for the nonlinear transformation.

In implementation,1 the networks

μ_{θ} (\cdot)

and

σ_{ϕ} (\cdot)

parameterise the drift and diffusion, and the loss is given by the sum of per-step negative log-likelihoods of this approximate density. Compared to the standard Euler–Maruyama Gaussian likelihood, which assumes locally constant diffusion, the Aït-Sahalia quasi-likelihood explicitly accounts for the spatial variation of

σ (Y_{t})

through its derivative

σ^{'} (Y_{t})

, improving statistical efficiency and potentially providing a more accurate approximation of the true SDE transition law while remaining consistent with the Euler–Maryuama discretisation framework.

However, when training a neural SDE market model with this transition as the loss, at the native high frequency intervals we can see that the QQ plots still show clear fat tails Figure A4. This suggests that the noise at high-frequency is still too large for the Gaussian assumption to hold, even when including spatial derivatives for the diffusion.

Figure A4. QQ Plots for quasi-likelihood training at native high frequency intervals

Appendix E. Neural Network Architecture

We model a diagonal Itô SDE

d X_{t} = f_{θ} (X_{t}) d t + diag (g_{θ} (X_{t})) d W_{t},

where the drift

f_{θ} : R^{p} \to R^{p}

and diffusion amplitude

g_{θ} : R^{p} \to R_{+}^{p}

are small multilayer perceptrons (MLPs). The baseline network (NeuralSDE) uses two parallel heads with identical depth: (i) the drift head is a 2–layer MLP

Linear (p, 256) \to ReLU \to Linear (256, p)

, and (ii) the diffusion head is

Linear (p, 256) \to ReLU \to Linear (256, p) \to Softplus

, where the Softplus enforces

g_{θ} (x) > 0

component wise. Thus, both architectures are compact 2 layer multi-layer perceptrons with ReLU hidden activations and Softplus on diffusion, with the shrink variant inserting a geometric diffusion contraction that preserves positivity and encodes static constraints.

Neural SDE architecture used for the latent factor dynamics are visualised in Figure A5. The input is the factor state

Ξ_{t} \in R^{d}

. The drift network outputs

μ_{θ} (Ξ_{t}, t)

and the diffusion network outputs the componentwise non-negative diagonal diffusion amplitude

σ_{ϕ} (Ξ_{t}, t)

, with Softplus enforcing positivity. Note that in our implementation of neural SDEs, we have a final layer that constrains both the drift and diffusion to enforce no-arbitrage boundaries.

Figure A5. Neural SDE architecture for latent factor dynamics.

References

Ahmed, Mohamed Shaker, Ahmed A. El-Masry, Aktham I. Al-Maghyereh, and Satish Kumar. 2024. Cryptocurrency volatility: A review, synthesis, and research agenda. Research in International Business and Finance 71, 102472. [Google Scholar] [CrossRef]
Aït-Sahalia, Yacine. 2002. Maximum likelihood estimation of discretely sampled diffusions: a closed-form approximation approach. Econometrica 70, 1: 223–262. [Google Scholar] [CrossRef]
Aït-Sahalia, Yacine, and Jean Jacod. 2009. Estimating the degree of activity of jumps in high frequency data. [Google Scholar] [CrossRef]
Ait-Sahalia, Yacine, Per A Mykland, and Lan Zhang. 2005. How often to sample a continuous-time process in the presence of market microstructure noise. The review of financial studies 18, 2: 351–416. [Google Scholar] [CrossRef]
Bandi, Federico M., and Jeffrey R. Russell. 2006. Separating microstructure noise from volatility. Journal of Financial Economics 79, 3: 655–692. [Google Scholar] [CrossRef]
Black, Fischer, and Myron Scholes. 1973. The pricing of options and corporate liabilities. Journal of political economy 81, 3: 637–654. [Google Scholar] [CrossRef]
Breeden, Douglas T, and Robert H Litzenberger. 1978. Prices of state-contingent claims implicit in option prices. Journal of business, 621–651. [Google Scholar] [CrossRef]
Buehler, Hans, Lukas Gonon, Josef Teichmann, and Ben Wood. 2019. Deep hedging. Quantitative Finance 19, 8: 1271–1291. [Google Scholar] [CrossRef]
Caron, R. J., J. F. McDonald, and C. M. Ponic. 1989. A degenerate extreme point strategy for the classification of linear constraints as redundant or necessary. Journal of Optimization Theory and Applications 62, 2: 225–237. [Google Scholar] [CrossRef]
Carr, Peter, and Dilip B Madan. 2005. A note on sufficient conditions for no arbitrage. Finance Research Letters 2, 3: 125–130. [Google Scholar] [CrossRef]
Cohen, Samuel N, Christoph Reisinger, and Sheng Wang. 2020. Detecting and repairing arbitrage in traded option prices. Applied Mathematical Finance 27, 5: 345–373. [Google Scholar] [CrossRef]
Cohen, Samuel N., Christoph Reisinger, and Sheng Wang. 2022. Hedging option books using neural-sde market models. [Google Scholar] [CrossRef]
Cohen, Samuel N, Christoph Reisinger, and Sheng Wang. 2023a. Arbitrage-free neural-sde market models. Applied Mathematical Finance 30, 1: 1–46. [Google Scholar] [CrossRef]
Cohen, Samuel N, Christoph Reisinger, and Sheng Wang. 2023b. Estimating risks of european option books using neural-sde market models. Journal of Computational Finance 26, 3. [Google Scholar]
Cont, Rama, and Milena Vuletić. 2023. Simulation of arbitrage-free implied volatility surfaces. Applied Mathematical Finance 30, 2: 94–121. [Google Scholar] [CrossRef]
Cuchiero, Christa, Wahid Khosrawi, and Josef Teichmann. 2020. A generative adversarial network approach to calibration of local stochastic volatility models. Risks 8, 4. [Google Scholar] [CrossRef]
Davis, Mark HA, and David G Hobson. 2007. The range of traded option prices. Mathematical Finance 17, 1: 1–14. [Google Scholar] [CrossRef]
DeLise, Timothy. 2021. Neural Options Pricing. Ph. D. thesis, Universite de Montreal. [Google Scholar]
Dempster, Arthur P, Nan M Laird, and Donald B Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological) 39, 1: 1–22. [Google Scholar] [CrossRef]
Derman, Emanuel, and Iraj Kani. 1994. Riding on a smile. Risk 7, 2: 32–39. [Google Scholar]
Dhiman, Ashish, and Yibei Hu. 2023. Physics informed neural network for option pricing. ArXiv abs/2312.06711.
Dridi, Naura, Lucas Drumetz, and Ronan Fablet. 2021. Learning stochastic dynamical systems with neural networks mimicking the euler-maruyama scheme. In 2021 29th European Signal Processing Conference (EUSIPCO). IEEE: pp. 1990–1994. [Google Scholar]
Dupire, Bruno, and et al. 1994. Pricing with a smile. Risk 7, 1: 18–20. [Google Scholar]
Durbin, James, and Siem Jan Koopman. 2012. Time series analysis by state space methods. Oxford university press. [Google Scholar]
Friedman, Avner, and Mark A Pinsky. 1973. Asymptotic stability and spiraling properties for solutions of stochastic equations. Transactions of the American Mathematical Society 186, 331–358. [Google Scholar] [CrossRef]
Fukasawa, Masaaki. 2011. Asymptotic analysis for stochastic volatility: martingale expansion. Finance and Stochastics 15, 4: 635–654. [Google Scholar] [CrossRef]
Gatheral, Jim. 2011. The volatility surface: a practitioner’s guide. John Wiley & Sons. [Google Scholar]
Gatheral, Jim, Thibault Jaisson, and Mathieu Rosenbaum. 2022. Volatility is rough. In Commodities. Chapman and Hall/CRC: pp. 659–690. [Google Scholar]
Gierjatowicz, Patryk, Marc Sabate Vidales, David Siska, Lukasz Szpruch, and Zan Zuric. 2022. Robust pricing and hedging via neural stochastic differential equations. Journal of Computational Finance 26, 3: 1–32. [Google Scholar] [CrossRef]
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Communications of the ACM 63, 11: 139–144. [Google Scholar] [CrossRef]
Greff, Klaus, Sjoerd Van Steenkiste, and Jürgen Schmidhuber. 2017. Neural expectation maximization. In Advances in neural information processing systems 30. [Google Scholar]
Hagan, Patrick S, Deep Kumar, Andrew S Lesniewski, and Diana E Woodward. 2002. Managing smile risk. In The Best of Wilmott 1. pp. 249–296. [Google Scholar]
Heath, David, Robert Jarrow, and Andrew Morton. 1992. Bond pricing and the term structure of interest rates: A new methodology for contingent claims valuation. Econometrica: Journal of the Econometric Society, 77–105. [Google Scholar] [CrossRef]
Heston, Steven L. 1993. A closed-form solution for options with stochastic volatility with applications to bond and currency options. The review of financial studies 6, 2: 327–343. [Google Scholar] [CrossRef]
Hutchinson, James M., Andrew W. Lo, and Tomaso Poggio. 1994. A nonparametric approach to pricing and hedging derivative securities via learning networks. The Journal of Finance 49, 3: 851–889. [Google Scholar] [CrossRef]
Jia, Junteng, and Austin R Benson. 2019. Neural jump stochastic differential equations. In Advances in Neural Information Processing Systems 32. [Google Scholar]
Kolm, Petter N, and Gordon Ritter. 2019. Dynamic replication and hedging: A reinforcement learning approach. The Journal of Financial Data Science 1, 1: 159–171. [Google Scholar] [CrossRef]
Lee, Suzanne S, and Per A Mykland. 2008. Jumps in financial markets: A new nonparametric test and jump dynamics. The review of financial studies 21, 6: 2535–2563. [Google Scholar] [CrossRef]
Fan, Lei, and Justin Sirignano. 2024. Machine learning methods for pricing financial deriviatives. In Computational Finance. [Google Scholar]
Lyons, T. J. 1995. Uncertain volatility and the risk-free synthesis of derivatives. Applied Mathematical Finance 2, 2: 117–133. [Google Scholar] [CrossRef]
Phytides, Alexio. 2023. No-arbitrage option pricing with neural sdes. Master’s thesis, University of Cape Town. [Google Scholar]
Roper, Michael, and Marek Rutkowski. 2009. On the relationship between the call price surface and the implied volatility surface close to expiry. International Journal of Theoretical and Applied Finance 12, 04: 427–441. [Google Scholar] [CrossRef]
Schönbucher, Philipp J. 1999. A market model for stochastic implied volatility. SSRN Electronic Journal. [Google Scholar] [CrossRef]
Sutton, Richard S, Andrew G Barto, and et al. 1998. Reinforcement learning: An introduction. Cambridge: MIT press, Volume 1. [Google Scholar]
Tanios, R. 2021. Physics informed neural networks in computational finance. Master’s thesis, ETH Zurich. [Google Scholar]
Vuletić, Milena, and Rama Cont. 2024. Volgan: a generative model for arbitrage-free implied volatility surfaces. Applied Mathematical Finance 31, 4: 203–238. [Google Scholar] [CrossRef]
Wiese, Magnus, Lianjun Bai, Ben Wood, and Hans Buehler. 2019. Deep hedging: learning to simulate equity option markets. arXiv arXiv:1911.01700. [Google Scholar] [CrossRef]
Wissel, Johannes Stefan. 2008. Arbitrage-free market models for liquid options. Ph. D. thesis, ETH Zurich. [Google Scholar]
Yu, Jialin. 2007. Closed-form likelihood approximation and estimation of jump-diffusions with an application to the realignment risk of the chinese yuan. Journal of Econometrics 141, 2: 1245–1280. [Google Scholar] [CrossRef]

1	https://github.com/yungarjun/deribit_market_model/blob/neural_sde/loss_testing.py

Figure 1. Native high frequency interval distribution.

Figure 2. BTCUSD 2025 May 22 price movement.

Figure 3. Interval distribution over weeek long period.

Figure 4. BTCUSD 2025 September 7th - 14th price movement.

Figure 5. Distribution of Observed Log-Moneyness Values per Maturity.

Figure 6. Liquid Options Nodes.

Figure 7. Before and after arbitrage repair of a surface at an example timestep.

Figure 8. Factor decomposition showing dynamic, statistical, static-arbitrage, and baseline (

G_{0}

) factors.

Figure 8. Factor decomposition showing dynamic, statistical, static-arbitrage, and baseline (

G_{0}

) factors.

Figure 9. Two factor scores & static no-arbitrage polygon

Figure 10. Comparison of NLL training curves with and without drift/diffusion constraints.

Figure 11. QQ Plots of Learned Neural SDE

Figure 12. Residual correlation of the learnt factors.

Figure 13. Value-at-Risk (VaR) distributions across option portfolios under Neural–SDE and historical simulation.


Short 10% strangle	Long ATM straddle	Risk-reversal 10%	Call spread 5%

Put spread 5–10% OTM	Butterfly 10% wings	Iron condor 10/20% (1d)	Long ATM straddle (7d)

Short 10% strangle (7d)	Butterfly 10% wings (7d)	Call spread 5% (30d)	Iron condor 10/20% (30d)

Calendar ATM 7d/30d	Fly wings 5% (1d)	Fly wings 5% (7d)	Fly wings 5% (30d)

Fly wings 10% (1d)	Fly wings 10% (7d)	Fly wings 10% (30d)	Fly wings 15% (1d)

Fly wings 15% (7d)	Fly wings 15% (30d)

Figure 14. Hedging a 30day ATM option, compared against Black-Scholes greek based methods

Table 1. Value-at-Risk (VaR) comparison: Neural-SDE vs Historical Simulation.

Portfolio	VaR NSDE 99	VaR HS 99	VaR NSDE 95	VaR HS 95	Description
Short 10% strangle ∼1d	-0.018899	-0.003865	-0.013648	-0.002575	Short 10% OTM call and 10% OTM put, $τ \approx 1$ d.
Short 10% strangle ∼7d	-0.014940	-0.003988	-0.010646	-0.002645	Same short strangle with maturity ∼7d.
Long ATM straddle ∼1d	-0.014211	-0.002680	-0.013370	-0.001680	Long ATM call + long ATM put, $τ \approx 1$ d.
Risk-reversal ±10% ∼1d	-0.006106	-0.003822	-0.006106	-0.002410	Long 10% OTM call, short 10% OTM put (directional skew), $τ \approx 1$ d.
Put spread 5–10% OTM ∼1d	-0.004825	-0.002017	-0.003476	-0.001376	Long 10% OTM put, short 5% OTM put; same expiry (debit put spread).
Iron condor 10/20% ∼30d	-0.004094	-0.003429	-0.003197	-0.002309	Short 10%/20% OTM call & put wings (short condor), $τ \approx 30$ d.
Butterfly 10% wings ∼7d	-0.003645	-0.002377	-0.002677	-0.001515	Same as above (ATM long body, OTM short wings).
Fly wings=15% $τ$ =1d	-0.002920	-0.004423	-0.001855	-0.002849	Wider ±15% wing butterfly, 1d.
Fly wings=10% $τ$ =30d	-0.002821	-0.004889	-0.002049	-0.003254	±10% wing butterfly, 30d.
Fly wings=15% $τ$ =30d	-0.002810	-0.001905	-0.001703	-0.001362	±15% wing butterfly, 30d.
Iron condor 10/20% ∼1d	-0.002717	-0.001909	-0.001937	-0.001197	Short condor with inner 10% & outer 20% strikes, $τ \approx 1$ d.
Fly wings=10% $τ$ =1d	-0.002506	-0.004034	-0.001470	-0.002632	±10% wing butterfly, 1d.
Butterfly 10% wings ∼1d	-0.002506	-0.004034	-0.001470	-0.002632	Equivalent short-dated butterfly.
Fly wings=15% $τ$ =7d	-0.002457	-0.001571	-0.002000	-0.001165	±15% wing butterfly, 7d.
Calendar ATM 7d/30d	-0.001962	-0.000974	-0.001381	-0.000652	Long 30d ATM, short 7d ATM (calendar spread).
Fly wings=5% $τ$ =1d	-0.001693	-0.003198	-0.001181	-0.002135	Tight ±5% wing butterfly, 1d.
Call spread ±5% ∼1d	-0.001653	-0.002181	-0.001653	-0.001498	Long 5% ITM call, short 5% OTM call; $τ \approx 1$ d.
Fly wings=5% $τ$ =7d	-0.001541	-0.003935	-0.001112	-0.002596	±5% wing butterfly, 7d.
Call spread ±5% ∼30d	-0.001495	-0.001580	-0.001054	-0.001022	5% call spread, 30d.
Fly wings=5% $τ$ =30d	-0.001460	-0.001538	-0.000985	-0.000962	Tight ±5% wing butterfly, 30d.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Implementing Neural SDEs for Data-Driven Dynamics of the Bitcoin Option Surface

Abstract

Keywords:

Subject:

1. Introduction

1.1. Motivation

1.2. Goals

1.3. Significance and Contributions

1.4. Overview of Dataset Used

2. Background

2.1. Market Models

2.2. Static Arbitrage

2.3. Dynamic Arbitrage

2.4. Neural Networks in Options Pricing & Modelling

2.4.1. Dynamic Hedging via Deep Learning

2.4.2. Generative Models

2.4.3. Neural SDEs

2.4.4. Physics Informed Neural Networks (PINNs) for Options Pricing

2.5. Choice of Modelling Approach

3. Data Preprocessing

3.1. Static Grid for Options

3.1.1. Constructing a Static Grid at High-Frequency

3.2. Arbitrage Repair of Historical Option Prices

3.3. Linear Factor Representation

4. Neural SDE Modelling

4.1. Stochastic Dynamics of Market Factors

4.2. Arbitrage Free Factor Representation

4.3. Constrained Networks and No-Static-Arbitrage Guarantees

4.4. Training Results

4.4.1. Loss Curve

4.4.2. Residuals Plots

5. Applications

5.1. Market Simulation and VaR

5.1.1. Market Simulation Methodology

5.1.2. VaR of Option Portfolios

5.2. Hedging of Option’s Portfolios

5.2.1. Minimum Variance Hedging

5.2.2. Hedging an Example Portfolio (A 30-Day ATM Option)

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. On the Existence of an Arbitrage Free Surface Between the Bid and Ask Spread

Appendix B. Drift Restrictions, Dynamic Arbitrage and the ’z’ Factor

Appendix B.1. Drift of the Option Surface Under the Physical Measure

Appendix B.2. The Infinitesimal Drift Operator and the z-Factor

Appendix B.3. Dynamic No-Arbitrage Under the Basis Expansion

Appendix B.4. Principal Component Analysis of the z-Fact or

Appendix B.5. Minimising Dynamic Arbitrage

Appendix B.6. Empirical Computation of the ’z’ Factor

Appendix C. Training at Ultra High Frequency Intervals

Appendix D. Alternative Likelihoods for Training

Appendix E. Neural Network Architecture

References

MDPI Initiatives

Important Links

Subscribe