Geometric Neural Ordinary Differential Equations: From Manifolds to Lie Groups

Yannik Paul Wotte; Federico Califano; Stefano Stramigioli

doi:10.20944/preprints202506.1038.v1

Submitted:

05 June 2025

Posted:

12 June 2025

You are already at the latest version

Abstract

Neural ordinary differential equations (neural ODEs) are a well-established tool for optimizing the parameters of dynamical systems, with applications in image classification, optimal control, and physics-learning. Although dynamical systems of interest often evolve on Lie groups and more general differentiable manifolds, theoretical results on neural ODEs are frequently phrased on $\R^n$. We collect recent results on neural ODEs on manifolds, and present a unifying derivation of various results that serves as a tutorial to extend existing methods to differentiable manifolds. We also extend the results to the recent class of neural ODEs on Lie groups, highlighting a non-trivial extension of manifold neural ODEs that exploits the Lie group structure.

Keywords:

Neural Ordinary Differential Equations

;

Differential Geometry

;

Lie groups

;

Machine Learning

;

Optimal Control

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Ordinary differential equations (ODEs) are ubiquitious in the engineering sciences, from modeling and control of simple physical systems like pendulums and mass-spring dampers, or more complicated robotic arms and drones, to the description of high-dimensional spatial discretizations of distributed systems, such as fluid flows, chemical reactions or quantum oscillators. Neural ordinary differential equations (neural ODEs) [1,2] are ODEs parameterized by neural networks. Given a state x, and parameters

θ

representing weights and biases of a neural network, a neural ODE reads:

\dot{x} = f_{θ} (x, t), x (0) = x_{0} .

(1)

First introduced by [1] as the continuum limit of recurrent neural networks, the number of applications of neural ordinary differential equations quickly exploded beyond simple classification tasks: learning highly nonlinear dynamics of multi-physical systems from sparse data [3,4,5], optimal control of nonlinear systems [6], medical imaging [7] and real-time handling of irregular time-series[8], to name but a few. Discontinuous state-transitions and dynamics [9,10], time-dependent parameters [11], augmented neural ODEs [12] and physics preserving formulations [13,14] present further extensions that increase the expressivity of neural ODEs.

However, these methods are typically phrased for states

x \in^{n}

. For many physical systems of interest, such as robot arms, humanoid robots and drones, the state lives on differentiable manifolds and Lie groups [15,16]. More generally, the manifold hypothesis in machine learning raises the expectation that many high-dimensional data-sets evolve on intrinsically lower-dimensional, albeit more complicated manifolds [17]. Neural ODEs on manifolds [18,19] presented significant steps to address this gap, presenting first optimization methods for neural ODEs on manifolds. Yet, the general tools and approaches available on

R^{n}

, such as including running costs, augmented states, time-dependent parameters, control-inputs or discontinuous state-transitions, are rarely addressed in a manifold context. Similar issues persist in a Lie group context, were Neural ODEs on Lie groups [20,21] were formalized.

Our goal is to extend more of the methods for neural ODEs on

R^{n}

to arbitrary manifolds, and in particular Lie groups. To this end, we present a systematic approach to the design of neural ODEs on manifolds and Lie groups. Specifically, our contributions are:

Systematic derivation of neural ODEs on manifolds and Lie groups, highlighting differences and equivalence of various approaches - for an overview, see also Table 1;
Summarizing the state of the art on manifold and Lie group neural ODEs, by formalizing the notion of extrinsic and intrinsic neural ODEs;
A tutorial-like introduction, to assist the reader in implementing various neural ODE methods on manifolds and Lie groups, presenting coordinate expressions alongside geometric notation.

The remainder of the article is organized as follows. A brief state-of-the-art on neural ODEs concludes this introduction. Section 2 provides background on differentiable manifolds, Lie groups, and the coordinate-free adjoint method. Section 3 describes neural ODEs on manifolds, and derives parameter updates via the adjoint method for various common architectures and cost-functions, including time-dependent parameters, augmented neural ODEs, running costs, and intermediate cost-terms. Section 4 describes neural ODEs on matrix Lie groups, explaining merits of treating Lie groups separately from general differentiable manifolds. Both Section 3 and Section 4 also classify methods into extrinsic and intrinsic approaches. We conclude with a discussion in Section 5, highlighting advantages, disadvantages, challenges and promises of the presented material. The Appendix includes background on Hamiltonian systems, which appear when transforming the adjoint method into a form that is unique to Lie groups.

1.1. Literature review

For a general introduction to neural ODEs, see [25]. Neural ODEs on

R^{n}

with fixed parameters were first introduced by [1], and parameter optimization via the adjoint method allowed for intermittent and final cost terms on each trajectory. The generalized adjoint method [2] also allows for running cost terms. Memory-efficient checkpointing is introduced in [26] to address stability issues of adjoint methods. Augmented neural ODEs [12] introduced augmented state-spaces to allow neural ODEs to express arbitrary diffeomorphisms. Time-varying parameters were introduced by [11], with similar benefits to augmented neural ODEs. Neural ODEs with discrete transitions were formulated in [9,10], with [9] also learning event-triggered transitions common in engineering application. Neural controlled differential equations (CDEs) were introduced in [27] for handling irregular time-series, and parameter updates reapply the adjoint method [1]. Neural stochastic differential equations (SDEs) were introduced in [28], relying on a stochastic variant of the adjoint method for the parameter update. The previously mentioned literature phrases dynamics of neural ODEs on

R^{n}

.

Neural ODEs on manifolds were first introduced by [22], including an adjoint method on manifolds for final cost terms, and application to continuous normalizing flows on Riemannian manifolds, but embedding manifolds into

R^{n}

. Neural ODEs on Riemannian manifolds are expressed in local exponential charts in [18], avoiding embedding into

R^{n}

, and considering final cost terms in the optimization. Charts for unknown, nontrivial latent manifolds, and dynamics in local charts, are learned from high-dimensional data in [29], including also discretized solutions to partial differential equations. Parameterized equivariant Neural ODEs on manifolds are constructed in [23], also commenting on state augmentation to express arbitrary (equivariant) flows on manifolds.

Neural ODEs on Lie groups were first introduced in [30] on the Lie group

S E (3)

, to learn the port-Hamiltonian dynamics of a drone from experiment, expressing group elements on an embedding

R^{12}

, and the approach was formalized to port-Hamiltonian systems on arbitrary matrix Lie groups in [20], embedding

m \times m

matrices in

R^{m^{2}}

.

Neural ODEs on

S E (3)

were phrased in local exponential charts in [24] to optimize a controller for a rigid body, using a chart-based adjoint method in local expential charts. An alternative, Lie algebra based adjoint method on general Lie groups was introduced in [21], foregoing Lie group specific numerical issues of applying the adjoint method in local charts.

1.2. Notation

For a complete introduction to differential geometry see e.g, [31], and for Lie group theory see [32].

Calligraphic letters

M, N, \dots

denote smooth manifolds. For conceptual clarity, the reader may think of these manifolds as embedded in a high dimensional

R^{N}

, e.g.,

M \subset R^{N}

. The set

C^{\infty} (M, N)

contains smooth functions between

M

and

N

, and we define

C^{\infty} (M) : = C^{\infty} (M, R)

:.

The tangent space at

x \in M

is

T_{x} M

and the cotangent space is

T_{x}^{*} M

. The tangent bundle of

M

is

T M

, and the cotangent bundle of

M

is

T^{*} M

. Then

(M)

denotes the set of vector fields over

M

, and

Ω^{k} (M)

denotes the set of k forms, where

Ω^{1} (M)

are co-vector fields, and

Ω^{0} (M) = C^{\infty} (M)

are smooth functions

V : M \to R

. The exterior derivative is denoted as

: Ω^{k} (M) \to Ω^{k + 1} (M)

. For functions

V \in C^{\infty} (M \times N, R)

, with

x \in M

,

y \in N

, we denote by

(‌_{x} V) (y) \in T_{x}^{*} M

the partial differential at

x \in M

. Curves

x : R \to M

are denoted as

x (t)

, and their tangent vectors are denoted as

\dot{x} \in T_{x (t)} M

.

A Lie group is denoted by G, its elements by

g, h

. The group identity is

e \in G

, and I denotes the identity matrix. The Lie algebra of G is , and its dual is

‌^{*}

. Letters

\tilde{A}, \tilde{B}

denote vectors in the Lie algebra, while letters

A, B

denote vectors in

R^{n}

.

In coordinate expressions lower indices are covariant and upper indices are contravariant components of tensors. For example, for a

(0, 2)

-tensor M, the components

M_{i j}

are covariant, and for non-degenerate M, the components of its inverse

M^{- 1}

are

M^{i j}

, which are contravariant. We use the Einstein summation convention

a_{i} b^{i} : = \sum_{i} a_{i} b^{i}

, i.e., the product of variables with repeated lower and upper indices implies a sum.

Denoting W a topological space, D the Borel

σ

-algebra and

P : D \to [0, 1]

a probability measure, then the tuple

(W, D, P)

denotes a probability space. Given a vector space L and a random variable

C : X \to L

, then the expectation of C w.r.t.

P

is

E_{w \sim P} (C) : = \int_{W} C (w) P (w)

.

2. Background

2.1. Smooth Manifolds

Given an n-dimensional manifold

M

, with

U \subset M

an open set and

: U \to R^{n}

a homeomorphism, we call

(U,)

a chart and we denote the coordinates of

x \in U

as

(q^{1}, \dots,^{q^{n}}) : = (x), x \in U \subset M .

(2)

Smooth manifolds admit charts

(U_{1},_{1})

and

(U_{2},_{2})

with smooth transition maps

‌_{21} =_{2} \circ_{1}^{- 1}

defined on the intersection

U_{1} ⋂ U_{2}

, and a collection

A

of charts

(U,)

with smooth transition maps is called a (smooth) atlas. A vector field

f \in (M)

associates to any point

x \in M

a vector in

T_{x} M

. It defines a dynamic system

\dot{x} = f (x); x (0) = x_{0},

(3)

and we denote the solution of (3) by the flow operator

Ψ_{f}^{t} : M \to M; Ψ_{f}^{t} (x_{0}) : = x (t) .

(4)

For a real valued function

V \in C^{\infty} (M)

, its differential is the covector field

V \in Ω^{1} (M); V = [V] {‌^{i}}^{i} .

(5)

Given additionally a smooth manifold

N

and a smooth map

φ : N \to M

, with

(U,)

and

(\bar{U}, \bar{‌})

appropriate charts of

M

and

N

, respectively. Then the pullback of V via

φ

is

φ^{*} V \in Ω^{1} (N); φ^{*} V : = (V \circ φ) = [φ^{j}] {\bar{‌}}^{i} [V] ‌^{j} {\bar{‌}}^{i} .

(6)

With a Riemannian metric M (i.e., a symmetric, non-degenerate (0,2) tensor field) on

M

, the gradient of V is a uniquely defined vector field

\nabla V \in (M)

given by

\nabla V : = M^{- 1} V = M^{i j} [V] ‌^{j} [] ‌^{i} .

(7)

When

M = R^{n}

, we assume that M is the Euclidean metric, and pick coordinates such that the components of the gradient and differential are the same. Finally, we define the Lie derivative of 1-forms, which differentiates

ω \in Ω^{1} (M)

along a vector field

f \in (M)

and returns

L_{f} ω \in Ω^{1} (M)

:

L_{f} ω : = \frac{‌}{t} {({Ψ_{f}^{t}}^{*} ω)}_{t = 0} = ω_{j} {(\frac{\partial}{\partial^{i}} f^{j})}^{i} + (\frac{\partial}{\partial^{j}} ω_{i}) f^{j} i .

(8)

2.2. Lie groups

Lie groups are smooth manifolds with a compatible group structure. We consider real matrix lie groups

G \subseteq G L (m, R)

, i.e., subgroups of the general linear group

G L (m, R) : = {g \in R^{m \times m} | \det (g) \neq 0},

(9)

For

g, h \in G

the left and right translations by h are, respectively, the matrix multiplications

\begin{matrix} L_{h} (g) : = h g, \end{matrix}

(10)

\begin{matrix} R_{h} (g) : = g h . \end{matrix}

(11)

The Lie algebra of G is the vector space

\subseteq gl (m, R)

, with

gl (m, R) = R^{m \times m}

the Lie algebra of

G L (m, R)

.

Define a basis

E : = {{\tilde{E}}_{1}, \dots, {\tilde{E}}_{n}}

with

{\tilde{E}}_{i} \in \subset R^{m \times m}

, and Define the (invertible, linear) map

Λ : R^{n} \to

as1

Λ : R^{n} \to; (A^{1}, \dots, A^{n}) \mapsto \sum_{i} A^{i} {\tilde{E}}_{i} .

(12)

The dual of is denoted

‌^{*}

, and given the map

Λ

we call

Λ^{*} :^{*} \to R^{n}

its dual. For

\tilde{A}, \tilde{B} \in

the small adjoint

{ad}_{\tilde{A}} (\tilde{B})

is a bilinear map, and the large Adjoint is the linear map

{Ad}_{g} (\tilde{A})

is a linear map

\begin{matrix} ad & : \times \to; {ad}_{\tilde{A}} (\tilde{B}) = \tilde{A} \tilde{B} - \tilde{B} \tilde{A}, \end{matrix}

(13)

\begin{matrix} Ad & : G \times \to; {Ad}_{g} (\tilde{A}) = g \tilde{A} g^{- 1} . \end{matrix}

(14)

In the remainder of the article, we exclusively use the adjoint representation

{ad}_{A} : R^{n} \to R^{n}

, written without a tilde in the subscript A, and Adjoint representation

{Ad}_{g} : R^{n} \times R^{n}

which are obtained as

\begin{matrix} {ad}_{A} & : = Λ^{- 1} ({ad}_{Λ (A)} Λ (\cdot)), \end{matrix}

(15)

\begin{matrix} ‌_{g} & : = Λ^{- 1} ({Ad}_{g} Λ (\cdot)) \end{matrix}

(16)

The exponential map

exp : \to G

is a local diffeomorphism given by the matrix exponential [Chapter 3.7 [32]:

exp (\tilde{A}) : = \sum_{n = 0}^{\infty} \frac{1}{n!} {\tilde{A}}^{n} .

(17)

Its inverse

log : U_{log} \to

is a given by the matrix logarithm, and it is well-defined on a subset

U_{log} \subset G

[Chapter 2.3 [32]:

log (g) = \sum_{n = 1}^{\infty} {(- 1)}^{n + 1} \frac{{(g - I)}^{n}}{n} .

(18)

Often, these infinite sums in (17) and (18) can further be reduced to a finite sums in m terms, by use of the Cayley-Hamilton Theorem [34]. A chart

(U_{h},_{h})

on G that assigns zero coordinates to

h \in G

can be defined using (18) and (12)

\begin{matrix} U_{h} & = {h g | g \in U_{log}}, \end{matrix}

(19)

\begin{matrix} ‌_{h} & : U_{h} \to R^{n}; g \mapsto Λ^{- 1} log (h^{- 1} g), \end{matrix}

(20)

\begin{matrix} ‌_{h}^{- 1} & : R^{n} \to G; q \mapsto h exp (Λ (q)) . \end{matrix}

(21)

The chart

(U_{h},_{h})

is called an exponential chart, and a collection

A

of exponential charts

(U_{h},_{h})

that cover the manifold is called an exponential atlas.

The differential of a function

V \in C^{\infty} (G, R)

is the co-vector field

V \in Ω^{1} (G)

(see also Equation (5)). For any given

g \in G

we further transform the co-vector

V (g) \in T_{g}^{*} G

to a left-trivialized differential, which collects the components of the gradient expressed in

‌^{*}

:

\begin{matrix} ‌_{g}^{L} V : = Λ^{*} L_{g}^{*} V (g) = [] ‌ V {(g (I + Λ (‌)))}_{| = 0} \in R^{n} . \end{matrix}

(22)

For a derivation of this coordinate expression, see [Sec. 3 [21].

2.3. Gradient over a flow

We will be interested in computing the gradient of functions with respect to the initial state of a flow. The adjoint sensitivity equations are a set of differential equations that achieve this. In the following, we show a derivation of the adjoint sensitivity on manifolds [21]. Given a function

: M \to R

, a vector field

f \in (M)

, the associated flow

Ψ_{f}^{t} : M \to M

, and a final time

T \in R

, the goal of the adjoint sensitvity method on manifolds is to compute the gradient

(C \circ Ψ_{f}^{T}) (x_{0}) .

In the adjoint method we define a co-state

λ (t) = (C \circ Ψ^{T - t}) (x (t)) \in T_{x (t)}^{*} M

, which represents the differential of

C (x (T))

with respect to

x (t)

. The adjoint sensitivity method describes its dynamics, which are integrated backwards in time from the known final condition

λ (T) = C (x (T))

, see also Figure 1. The adjoint sensitivity method is stated in Theorem 1.

Theorem 1

(Adjoint sensitivity on manifolds). The gradient of a function

C \circ Ψ_{f}^{T}

is

(C \circ Ψ_{f}^{T}) (x_{0}) = λ (0),

(23)

where

λ (t) \in T_{x (t)}^{*} M

is the co-state. In a local chart

(U,)

of

M

with induced coordinates on

T^{*} U

,

x (t)

and

λ (t)

satisfy the dynamics

\begin{matrix} {\dot{‌}}^{j} & = f^{j} (), (0) = (x_{0}), \end{matrix}

(24)

\begin{matrix} {\dot{λ}}_{i} & = - λ_{j} \frac{\partial}{\partial^{i}} f^{j} (), λ_{i} (T) = [C] ‌^{i} (x (T)) \end{matrix}

(25)

Proof.

Define the co-state

λ (t) \in T_{x (t)}^{*} M

as

\begin{matrix} λ (t) & : = {(Ψ_{f}^{T - t})}^{*} C (x (T)) . \end{matrix}

(26)

Then Equation (23) is recovered by application of Equation (6):

λ (0) = {(Ψ_{f}^{T})}^{*} C (x (T)) = (C \circ Ψ_{f}^{T}) (x_{0}),

(27)

A derivation of the dynamics governing

λ (t)

constitutes the remainder of this proof. By definition of

λ (t)

and the Lie derivative (8), we have that

L_{f} λ (t) = 0

:

\begin{matrix} L_{f} λ (t) & = \frac{d}{d s} {({(Ψ_{f}^{s})}^{*} λ (t + s))}_{s = 0} \\ = \frac{d}{d s} λ (t) = 0 . \end{matrix}

(28)

If we further treat

λ

as a 1-form

λ \in Ω^{1} (M)

(denoted as

λ

, by an abuse of notation), we obtain:

\begin{matrix} L_{f} λ = & λ_{j} {(\frac{\partial}{\partial^{i}} f^{j})}^{i} + (\frac{\partial}{\partial^{j}} λ_{i}) f^{j} i = 0 . \end{matrix}

The components satisfy the partial differential equation

λ_{j} [] ‌^{i} f^{j} + f^{j} [] ‌^{j} λ_{i} = 0 .

(29)

Impose that

λ (t) = λ (Ψ_{f}^{t} (x_{0}))

(this defines the 1-form

λ

along

x (t)

), then

{\dot{λ}}_{i} = [λ_{i}] ‌^{j} {\dot{‌}}^{j} = [λ_{i}] ‌^{j} f^{j} .

(30)

Combining Equations (29) and (30) leads to Equation ():

{\dot{λ}}_{i} = - λ_{j} \frac{\partial}{\partial^{i}} f^{j} .

(31)

Expanding the final condition

λ (T) = C (x (T))

in local coordinates (see Equation (5)) gives

λ (T) = [C] ‌^{i} {(x (T))}^{i} = λ_{i} {(T)}^{i} \Leftrightarrow λ_{i} (T) = [C] ‌^{i} (x (T)) .

(32)

□

A fact that will become useful in Section 4, is that the equations (24) and () have a Hamiltonian form. Define the control Hamiltonian

H_{c} : T^{*} M \to R

as

H_{c} (x, λ) = λ (f (x, t)) = λ_{i} (f^{i} (, t)) .

(33)

Then Equation (24) and Equation (), respectively, of Theorem 1 follow as the Hamiltonian equations on

T^{*} M

:

\begin{matrix} {\dot{‌}}^{j} & = \frac{\partial H_{c}}{\partial λ_{j}} = f^{j} (, t), \end{matrix}

(34)

\begin{matrix} {\dot{λ}}_{i} & = - \frac{\partial H_{c}}{\partial^{i}} = - λ_{j} \frac{\partial}{\partial^{i}} f^{j} (, t) . \end{matrix}

(35)

For background on Hamilton’s equations, see also Appendix A.1.

3. Neural ODEs on Manifolds

A neural ODE on a manifold is an NN-parameterized vector field in

(M)

– or including time dependence, it is an NN-parameterized vector field in

(M \times R)

, with t in the

R

slot and

\dot{t} = 1

. Given parameters

θ \in R^{n_{θ}}

, we denote this parameterized vector field as

f_{θ} (x, t) : = f (x, t, θ)

. This results in the dynamic system

\dot{x} = f_{θ} (x, t), x (0) = x_{0},

(36)

The key idea of neural ODEs is to tackle various flow approximation tasks, by optimising the parameters with respect to a to-be-specified optimization problem. Denote a finite time horizon T and intermittent times

T_{1}, T_{2}, \dots < T

. Denote a general trajectory cost by

C_{f_{θ}}^{T} (x_{0}, θ) = F (θ, Ψ_{f_{θ}}^{T_{0}} (x_{0}), Ψ_{f_{θ}}^{T_{1}} (x_{0}), \dots, Ψ_{f_{θ}}^{T} (x_{0})) + \int_{0}^{T} r (Ψ_{f_{θ}}^{s} (x_{0}), s) s,

(37)

with intermittent and final cost term F, and running cost r. Indicating a probability space

(M, D, P)

, we define the total cost as

J (θ) : = E_{x_{0} \sim P} C_{f_{θ}}^{T} (x_{0}, θ) .

(38)

The minimization problem takes the form

\begin{matrix} min_{θ} J (θ) . \end{matrix}

(39)

Note that (39) is not subject to any dynamic constraint - the flow already appears explicitly in the cost

C_{f_{θ}}^{T}

.

Normally, the optimization problem is solved by means of a stochastic gradient descent optimization algorithm [35]. In this, a batch of N initial conditions

x_{i}

is sampled from the probability distribution corresponding to the probability measure

P

. Writing

C_{i} = C_{f_{θ}}^{T} (x_{i}, θ)

, the parameter gradient

[] θ J (θ)

is approximated as

\frac{\partial}{\partial θ} J (θ) = E_{x_{0} \sim P} \frac{\partial}{\partial θ} C_{f_{θ}}^{T} (x_{0}) \approx \frac{1}{N} \sum_{i = 0}^{N} \frac{\partial}{\partial θ} C_{i} .

(40)

In this section, we show how to optimize the parameters

θ

for various choices of neural ODEs and cost functions, with (37) the most general case of a cost, and highlight similarities in the various derivations. In the following, the gradient

\frac{\partial}{\partial θ} C_{i}

is computed via the adjoint method on manifolds, for various scenarios. The advantage of the adjoint method over e.g., automatic differentiation of

C_{i}

/ back-propagation through an ODE solver, is that it has a constant memory efficiency with respect to the network depth T.

3.1. Constant parameters, running and final cost

Here we consider neural ODEs of the form (36), with constant parameters

θ

, and cost functions of the form

C_{f_{θ}}^{T} (x_{0}, θ) = F (Ψ_{f_{θ}}^{T} (x_{0}), θ) + \int_{0}^{T} r (Ψ_{f_{θ}}^{s} (x_{0}), θ, s) s,

(41)

with a final cost term F and a running cost term r. This generalizes [1,2] to manifolds. Compared to existing manifold methods for neural ODES [18,19], the running cost is new.

The parameter gradient’s components

[] θ C_{f_{θ}}^{T} ((x_{0}, t_{0}), θ) \in R^{n_{θ}}

are then computed by Theorem 2 (see also [21]):

Theorem 2

(Generalized Adjoint Method on Manifolds). Given the dynamics (36) and the cost (41), the parameter gradient’s components

[] θ C_{f_{θ}}^{T} ((x_{0}, t_{0}), θ) \in R^{n_{θ}}

are computed by

[] θ C_{f_{θ}}^{T} ((x_{0}, t_{0}), θ) = ([F] θ) (x (T), θ) + \int_{0}^{T} [] θ (λ_{j} f_{θ}^{j} ((s)) + r ((s), θ, s)) s .

(42)

where the state

x (s) \in M

and co-state

λ (s) \in T_{x (s)}^{*} M

satisfy, in a local chart

(U,)

with

(t) = (x (t))

,

λ (t) = λ_{i} {(t)}^{i}

\begin{matrix} {\dot{‌}}^{j} & = f_{θ}^{j} (, t), (0) = (x_{0}), t (0) = t_{0}, \end{matrix}

(43)

\begin{matrix} {\dot{λ}}_{i} & = - λ_{j} \frac{\partial}{\partial^{i}} f_{θ}^{j} (, t) - \frac{\partial r}{\partial^{i}}, λ_{i} (T) = [F] ‌^{i} (x (T), θ) . \end{matrix}

(44)

Proof.

Define the augmented state space as

M^{'} = M \times R^{n_{θ}} \times R \times R

, to include the original state

x \in M

, parameters

θ \in R^{n_{θ}}

, accumulated running cost

L \in R

and time

t \in R

in the augmented state

x^{'} : = (x, θ, L, t) \in M^{'}

. In addition, define the augmented dynamics

f_{aug} \in (M^{'})

as

{\dot{x}}^{'} = f_{aug} (x^{'}) = (\begin{matrix} f_{θ} (x, t) \\ 0 \\ r (x, θ, t) \\ 1 \end{matrix}), x^{'} (0) = x_{0}^{'} : = (\begin{matrix} x_{0} \\ θ \\ 0 \\ t_{0} \end{matrix}) .

(45)

This is an autonomous system with final state

x^{'} (T) = (x (T), θ, \int_{0}^{T} r (x, θ, s) s, T)

. Next, define the cost

C_{aug} : M^{'} \to R

on the augmented space:

C_{aug} (x^{'}) = F (x, θ) + L .

(46)

Then Equation (41) can be rewritten as the evaluation of a terminal cost

C_{aug} (x^{'} (T))

:

C_{f_{θ}}^{T} (x_{0}) = (C_{aug} \circ Ψ_{f_{aug}}^{T}) (x_{0}^{'}) .

(47)

By Theorem 1, the gradient

(C_{aug} \circ Ψ_{f_{aug}}^{T})

is given by

(C_{aug} \circ Ψ_{f_{aug}}^{T}) (x_{0}^{'}) = λ (0),

(48)

and by Equation (), the componentents of

λ (s)

satisfy

\begin{matrix} {\dot{λ}}_{i} = - λ_{j} \frac{\partial}{\partial {‌^{i}}^{'}} f_{aug}^{j}, λ_{i} (T) = [] {‌^{i}}^{'} C_{aug} (x^{'} (T)) \end{matrix}

(49)

Split the co-state into

λ_{,} λ_{θ}, λ_{L}, λ_{t}

, then their components’ dynamics are:

\begin{matrix} {\dot{λ}}_{, i} & = - [] ‌^{i} (λ_{, j} f_{θ}^{j} (, t) + λ_{L} r (, θ, t)), λ_{(} T) = [F] ‌ (x (T), θ), \end{matrix}

(50)

\begin{matrix} {\dot{λ}}_{θ, i} & = - [] θ^{i} (λ_{, j} f_{θ}^{j} (, t) + λ_{L} r (, θ, t)), λ_{θ} (T) = [F] θ (x (T), θ) \end{matrix}

(51)

\begin{matrix} {\dot{λ}}_{L} & = 0, λ_{L} (T) = [] L C_{aug} (x (T), θ) = 1 \end{matrix}

(52)

\begin{matrix} {\dot{λ}}_{t} & = - [] t (λ_{, j} f_{θ}^{j} (, t) + λ_{L} r (, θ, t)), λ_{t} (T) = [] t C_{aug} (x (T), θ) = 0 . \end{matrix}

(53)

The component

λ_{L} = 1

is constant, so equation (50) coincides with (). Integrating () from

s = 0

to

s = T

recovers Equation (42).

λ_{t}

does not appear in any of the other equations, such that Equation () may be ignored. □

In summary, the above proof dependeds on identifying a suitable augmented manifold

M^{'}

, with the goal that augmented dynamics

f_{aug} \in (M^{'})

are autonomous, and that the cost-function

C_{aug} : M^{'} \to R

on the augmented manifold rephrases the cost (41) as a final cost

C_{aug} (x (T))

, which allows to apply Theorem 1, and express any gradients of the original cost. In later sections (Sec. Section 3.2), this process will be the main technical tool for generalizations of Theorem (2). The next sections describe common special cases of (36) and Theorem 2.

3.1.1. Vanilla neural ODEs and extrinsic neural ODEs on manifolds

The case of neural ODEs on

R^{n}

(e.g., [1,2]) is obtained by setting

M = R^{n}

. In this case, one global chart can be used to represent all quantities.

This overlaps with extrinsic neural ODEs on manifolds (described, for instance, in [22]), which optimize the neural ODE on an embedding space

R^{N}

. We denote this embedding as

ι : M \to R^{N}

, let

x \in M

and

y \in R^{N}

. Optimizing the neural ODE on

R^{N}

requires extending the dynamics

f_{θ} (x, t) \in (M)

to a vector field

f_{θ}^{↑} (y, t) \in (R^{N})

, such that

ι_{*} f_{θ} (x, t) = f_{θ}^{↑} (ι (x), t)

(54)

The dynamics

f_{θ}^{↑} (y, t)

are then used in Theorem 2, and also the costate lives in

T^{*} R^{N}

.

As shown in [22], the resulting parameter gradients are equivalent to those resulting from an application in local charts, as long as it can be guaranteed that the integral curves of

f^{↑} (y, t) \in X (R^{N})

remain within

ι (M) \subset R^{N}

, i.e., are geometrically preserving.

A strong upside of an extrinsic formulation is that existing neural ODE packages (e.g., [36]) can be applied directly. Possible downsides of extrinsic neural ODEs are that finding

f^{↑} (y, t) \in X (R^{N})

may not be immediate, a geometrically preserving integration has to be guaranteed separately, and that N is larger than the intrinsic dimension

n = dim M

, leading to computational overhead.

3.1.2. Intrinsic neural ODEs on manifolds

The intrinsic case of neural ODEs on manifolds [18] is described by integrating the dynamics in local charts. Given a chart-transition from a chart

(U_{1},_{1})

to a chart

(U_{2},_{2})

chart transitions of the state and costate components (

‌_{1}^{i}, λ_{1, i}

and

‌_{2}^{i}, λ_{2, i}

, respectively) are given by

\begin{matrix} ‌_{2}^{i} & =_{2}^{i} \circ_{1}^{- 1} (_{1}), \end{matrix}

(55)

\begin{matrix} λ_{i, 2} & = A_{i}^{j} λ_{j, 1} . \end{matrix}

(56)

with

A_{i}^{j} = \partial_{i} (‌_{1}^{j} \circ_{2}^{- 1})

. The advantage of intrinsic neural ODEs on manifolds over extrinsic neural ODEs on manifolds is that the dimension of the resulting equations is as low as possible, for the given manifold. A disadvantage lies in having to determine charts, and chart-switching procedures. In available state-of-the-art packages for neural ODEs, these are phrased as discontinuous dynamics with state transitions

‌_{1}^{j} \circ_{2}^{- 1} : R^{n} \to R^{n}

. For details on chart-switching methods see [18,21,29].

3.2. Extensions

The proof of Theorem 2 depended on identifying a suitable augmented manifold

M^{'}

, autonomous augmented dynamics

f_{aug} \in (M^{'})

and augmented cost-function

C_{aug} : M^{'} \to R

that rephrases the cost (41) as a final cost

C_{aug} (x (T))

, to apply Theorem 1. This approach generalizes to various other scenarios, including different cost-terms, augmented neural ODEs on manifolds and time-dependent parameters, presented in the following.

3.2.1. Nonlinear and intermittent cost terms

We consider here the case of neural ODEs on manifolds of the form (36) with cost (37). For the final and intermittent cost term

F_{θ} : M \times M \times \dots \times M \to R

we denote by

‌_{k} F_{θ} \in T_{x}^{*} M

the differential w.r.t. the k-th slot, and denote

θ

as a subscript to avoid confusion. The components of

‌_{k} F

will be denoted

[F] ‌^{k} i

. In this case, the parameter gradient is determined by repeated application of Theorem 2:

Theorem 3

(Generalized Adjoint Method on Manifolds). Given the dynamics (36) and the cost (37), the parameter gradient’s components

[] θ C_{f_{θ}}^{T} ((x_{0}, t_{0}), θ) \in R^{n_{θ}}

are computed by

\begin{matrix} [] θ C_{f_{θ}}^{T} ((x_{0}, t_{0}), θ) = & ([F] θ) (θ, x (T_{1}), x (T_{2}), \dots, x (T)) \\ + \int_{0}^{T} [] θ (λ_{j} f_{θ}^{j} ((s)) + r ((s), θ, s)) s . \end{matrix}

(57)

where the state

x (s) \in M

satisfies (43) and the co-state

λ (s) \in T_{x (s)}^{*} M

satisfies dynamics with discrete updates at times

T_{1}, \dots, T_{N - 1}

given by

\begin{matrix} {\dot{λ}}_{, i} & = [] ‌^{i} (λ_{, j} f_{θ}^{j} (, t) + r (_{N}, θ, t)); λ_{, i} (T) = [F_{θ}] ‌^{N} i (x (T_{1}), \dots, x (T)) \end{matrix}

(58)

\begin{matrix} λ_{i} (T_{k, -}) & = λ_{i} (T_{k, +}) + [F_{θ}] ‌^{k} i (x (T_{1}), \dots, x (T)), \end{matrix}

(59)

with

T_{k, -}

the instance after a discrete update at time

T_{k}

(recall that co-state dynamics are integrated backwards, so

T_{k, -} < T_{k} < T_{k, +}

), and

T_{k, +}

the instance before.

Proof.

We introduce an augmented manifold

M^{'} = M \times \dots \times M \times R^{n_{θ}} \times R \times R

, to include N copies of the original state

x \in M

, parameters

θ \in R^{n_{θ}}

, accumulated running cost

L \in R

and time

t \in R

in the augmented state

x^{'} : = (x_{1}, \dots, x_{N}, θ, L, t) \in M^{'}

. Let

ϱ_{T_{i}} (t) = \{\begin{matrix} 1 t \leq T_{i} \\ 0 t > T_{i} \end{matrix},

(60)

and define the augmented dynamics

f_{aug} \in (M^{'})

as

{\dot{x}}^{'} = f_{aug} (x^{'}) = (\begin{matrix} ϱ_{T_{1}} (t) f_{θ} (x_{1}, t) \\ ⋮ \\ ϱ_{T_{N - 1}} (t) f_{θ} (x_{N - 1}, t) \\ f_{θ} (x_{N}, t) \\ 0 \\ r (x_{N}, θ, t) \\ 1 \end{matrix}), x^{'} (0) = x_{0}^{'} : = (\begin{matrix} x_{0} \\ ⋮ \\ x_{0} \\ x_{0} \\ θ \\ 0 \\ t_{0} \end{matrix}) .

(61)

This is an autonomous system with final state

x^{'} (T) = (x (T_{1}), \dots, x (T_{N - 1}), x (T), θ, \int_{0}^{T} r (x, θ, s) s, T) .

(62)

Next, define the cost

C_{aug} : M^{'} \to R

on the augmented space:

C_{aug} (x^{'}) = F_{θ} (x_{1}, \dots, x_{N}) + L .

(63)

Then Equation (37) can be rewritten as the evaluation of a terminal cost

C_{aug} (x^{'} (T))

:

C_{f_{θ}}^{T} (x_{0}) = (C_{aug} \circ Ψ_{f_{aug}}^{T}) (x_{0}^{'}) .

(64)

Apply Equation (), and split the co-state into

λ_{‌_{1}}, \dots, λ_{‌_{N}}, λ_{θ}, λ_{L}, λ_{t}

, then their components’ dynamics are:

\begin{matrix} {\dot{λ}}_{‌_{1}, i} & = - [] ‌^{i} (λ_{‌_{1}, j} ϱ_{T_{1}} (t) f_{θ}^{j} (_{1}, t)), λ_{‌_{1}} (T) = [F_{θ}] ‌^{1} (x (T_{1}), \dots, x (T)), \\ ⋮ \end{matrix}

(65)

\begin{matrix} {\dot{λ}}_{‌_{N}, i} & = - [] ‌^{N} i (λ_{‌_{N}, j} f_{θ}^{j} (_{N}, t) + λ_{L} r (_{N}, θ, t)), λ_{‌_{N}} (T) = [F_{θ}] ‌^{N} (x (T_{1}), \dots, x (T)), \end{matrix}

(66)

\begin{matrix} {\dot{λ}}_{θ, i} & = - [] θ^{i} (λ_{‌_{1}, j} ϱ_{T_{1}} (t) f_{θ}^{j} (_{1}, t) + \dots + λ_{‌_{N}, j} f_{θ}^{j} (_{N}, t) + λ_{L} r (, θ, t)), \\ λ_{θ} (T) & = [F_{θ}] θ (x (T_{1}), \dots, x (T)) . \end{matrix}

(67)

We excluded the dynamics of

λ_{t}

, which does not appear in any of the other equations, and the constant

λ_{L} = 1

. Finally, define the cumulative co-state

λ_{‌} = ϱ_{T_{1}} (t) λ_{‌_{1}} + \dots + λ_{‌_{N}} .

(68)

Its dynamics at

t \in [0, T] T_{1}, \cdot, T_{N - 1}

are given by the sum of (65) to (), letting

=_{N}

:

\begin{matrix} {\dot{λ}}_{, i} & = {\dot{λ}}_{‌_{1}, i} + \dots + {\dot{λ}}_{‌_{N}, i} \end{matrix}

(69)

\begin{matrix} = [] ‌^{i} (λ_{, j} f_{θ}^{j} (, t) + r (_{N}, θ, t)) \end{matrix}

(70)

\begin{matrix} λ_{‌} (T) & = [F_{θ}] ‌^{N} (x (T_{1}), \dots, x (T)), \end{matrix}

(71)

with discrete jumps (58) accounting for the final conditions of

λ_{‌_{1}}, \dots, λ_{‌_{N}}

, and the dynamics () can be rewritten as

{\dot{λ}}_{θ, i} = [] θ^{i} (λ_{, j} f_{θ}^{j} (, t) + r (, θ, t)); λ_{θ} (T) = [F_{θ}] θ (x (T_{1}), \dots, x (T)) .

(72)

Integrating this from

s = 0

to

s = T

recovers Equation (57). □

Cost terms of this form are interesting for optimiziation of e.g., periodic orbits [37] or trajectories on manifolds, where conditions at multiple checkpoints

Ψ_{f_{θ}}^{T_{i}} (x_{0})

may appear in the cost.

3.2.2. Augmented neural ODEs on manifolds and time-dependent parameters

With state

x \in M

, augmented state

α \in N

(not to be confused with

x^{'} \in M^{'}

), and parameterized

φ_{θ} : M \to N

, augmented neural ODEs on manifolds are neural ODEs on the manifold

M \times N

of the form

(\begin{matrix} \dot{x} \\ \dot{α} \end{matrix}) = (\begin{matrix} f_{θ} (x, α) \\ g_{θ} (x, α) \end{matrix}); (\begin{matrix} x (0) \\ α (0) \end{matrix}) = (\begin{matrix} x_{0} \\ φ_{θ} (x_{0}) \end{matrix}) .

(73)

Time t is not included explicitly in these dynamics, since it can be included in

α

. This case also includes the scenario of time-dependent parameters

\bar{θ} (t)

as part of

α

. As the trajectory cost, we take a final cost

C_{f_{θ}, g_{θ}}^{T} (x_{0}, θ) = F (Ψ_{f_{θ}, g_{θ}}^{T} (x_{0}, φ_{θ} (x_{0})), θ) .

(74)

Theorem 4

(Adjoint Method for Augmented Neural ODEs on Manifolds). Given the dynamics (73) and the cost (74), the parameter gradient’s components

[] θ C_{f_{θ}}^{T} ((x_{0}, t_{0}), θ) \in R^{n_{θ}}

are computed by

\begin{matrix} [] θ C_{f_{θ}, g_{θ}}^{T} ((x_{0}, φ (x_{0})), θ) = & ([F] θ) (x (T), α (T), θ) + [φ^{j}] θ λ_{α, j} (0) \\ + \int_{0}^{T} [] θ (λ_{x, j} f_{θ}^{j} ((s)) + λ_{α, j} g_{θ}^{j} ((s))) s . \end{matrix}

(75)

where the states

x (s) \in M, α (s) \in N

satisfy (73) and co-states

λ_{x} (s) \in T_{x (s)}^{*} M, λ_{α} (s) \in T_{α (s)}^{*} N,

satisfy, in a local chart

(U,)

on

M

and

\bar{U}, \bar{‌}

on

N

:

\begin{matrix} {\dot{λ}}_{x, i} & = - \frac{\partial}{\partial^{i}} (λ_{x, j} f_{θ}^{j} (, \bar{‌}, t) + λ_{α, j} g_{θ}^{j} (, \bar{‌}, t)), λ_{x, i} (T) = [F] ‌^{i} (x (T), α (T), θ), \end{matrix}

(76)

\begin{matrix} {\dot{λ}}_{α, i} & = - \frac{\partial}{\partial {\bar{‌}}^{i}} (λ_{x, j} f_{θ}^{j} (, \bar{‌}, t) + λ_{α, j} g_{θ}^{j} (, \bar{‌}, t)), λ_{α, i} (T) = [F] {\bar{‌}}^{i} (x (T), α (T), θ) . \end{matrix}

(77)

Proof.

Define the augmented state space as

M^{'} = M \times N \times R^{n_{θ}}

, to include the states

x \in M, α (s) \in N

and parameters

θ \in R^{n_{θ}}

in the augmented state

x^{'} : = (x, α, θ) \in M^{'}

. In addition, define the augmented dynamics

f_{aug} \in (M^{'})

as

{\dot{x}}^{'} = f_{aug} (x^{'}) = (\begin{matrix} f_{θ} (x, α) \\ g_{θ} (x, α) \\ 0 \end{matrix}), x^{'} (0) = x_{0}^{'} : = (\begin{matrix} x_{0} \\ φ_{θ} (x_{0}) \\ θ \end{matrix}) .

(78)

This is an autonomous system with final state

x^{'} (T) = (x (T), α (T), θ)

. Next, define the cost

C_{aug} : M^{'} \to R

on the augmented space:

C_{aug} (x^{'}) = F (x, α, θ) .

(79)

Then Equation (41) can be rewritten as the evaluation of a terminal cost

C_{aug} (x^{'} (T))

. The gradient

(C_{aug} \circ Ψ_{f_{aug}}^{T})

is given by an application of Equation (). Split the co-state into

λ_{x}, λ_{α}, λ_{θ}

, then their components’ dynamics are:

\begin{matrix} {\dot{λ}}_{x, i} & = - \frac{\partial}{\partial^{i}} (λ_{x, j} f_{θ}^{j} (, \bar{‌}, t) + λ_{α, j} g_{θ}^{j} (, \bar{‌}, t)), λ_{x} (T) = [F] ‌ (x (T), α (T), θ), \end{matrix}

(80)

\begin{matrix} {\dot{λ}}_{α, i} & = - \frac{\partial}{\partial {\bar{‌}}^{i}} (λ_{x, j} f_{θ}^{j} (, \bar{‌}, t) + λ_{α, j} g_{θ}^{j} (, \bar{‌}, t)), λ_{α, i} (T) = [F] {\bar{‌}}^{i} (x (T), α (T), θ) . \end{matrix}

(81)

\begin{matrix} {\dot{λ}}_{θ, i} & = - [] θ^{i} (λ_{x, j} f_{θ}^{j} (, \bar{‌}, t) + λ_{α, j} g_{θ}^{j} (, \bar{‌}, t)), λ_{θ} (T) = [F] θ (x (T), α (T), θ) \end{matrix}

(82)

Since

α (0) = φ_{θ} (x_{0})

also depends on

θ

, the total gradient of the cost w.r.t.

θ

is given by

[] θ^{i} C_{f_{θ}}^{T} ((x_{0}, φ_{θ} (x_{0})), θ) = λ_{θ, i} (0) + [φ^{j}] θ^{i} λ_{α, j} (0) .

(83)

This recovers Equation (75). □

A further degenerate application of Theorem 4 is obtained by removing x. Then both dynamics

g_{θ} (α)

and initial condition

α (0) = φ_{θ} (0)

are parameterized by

θ

, allowing joint optimization of parameters and initial condition. This is interesting for joint optimization and numerical continuation, e.g. [37].

4. Neural ODEs on Lie Groups

Just as a neural ODE on a manifold is an NN-parameterized vector field in

(M)

(or, including time,

(M \times R)

), a neural ODE on a Lie group can be seen as a parameterized vector field in

(G)

(or

(G \times R)

, respectively). Similarly to Equation (36), this results in a dynamic system

\dot{g} = f_{θ} (g, t), g (0) = g_{0},

(84)

Yet, Lie groups offers more structure than manifolds: the Lie algebra provides a canonical space to represent tangent vectors, and its dual

‌^{*}

provides a canonical space to represent the co-state. Similarly, canonical (exponential) charts offer structure for integrating dynamic systems [38]. Frequently, dynamics on a Lie group induced dynamics on a manifold

M

: by means of an action

Φ : G \times M \to M; (g, x) \mapsto Φ (g, x),

(85)

evolutions

g (t)

induce evolutions

x (t) = Φ (g (t), x_{0})

on

M

. This makes neural ODEs on Lie groups interesting in their own right.

In this section, we describe optimizing (39) for the cost

C_{f_{θ}}^{T} (g_{0}, θ) = F (Ψ_{f_{θ}}^{T} (g_{0}), θ) + \int_{0}^{T} r (Ψ_{f_{θ}}^{s} (g_{0}), θ, s) s,

(86)

with a final cost term F and a running cost term r. We highlight the extrinsic approach, and two intrinsic approaches, where one of the latter is peculiar to Lie groups.

4.1. Extrinsic neural ODEs on Lie groups

The extrinsic formulation of neural ODEs on Lie groups was first introduced by [20], and applies ideas of [19] (see also Section 3.1.1). Given

G \subset G L (m, R)

, this formulation treats the dynamic system (84) as a dynamic system on

R^{m^{2}}

. Denote

vec : R^{m \times m} \to R^{m^{2}}

an invertible map that stacks the components of an input matrix matrix into a component vector2 and let

{proj}_{G} : R^{m \times m} \to G

be a projection onto

G \subset R^{m \times m}

. Further denote

A_{y} = {vec}^{- 1} (y)

and

g_{y} = {proj}_{G} (A_{y})

. A lift

f_{θ}^{↑} (y, t)

can then be defined as

f_{θ}^{↑} (y, t) = vec (A_{y} g_{y}^{- 1} f (g_{y}, θ, t)) .

(87)

As was the case for extrinsic neural ODEs on manifolds, the cost-gradient resulting from this optimization is well-defined and equivalent to any intrinsically defined procedure. However, dimension

m^{2}

of the vectorization can be significantly larger than the intrinsic dimension of the Lie group.

4.2. Intrinsic neural ODEs on Lie groups

Theorem 2 directly applies to optimization of neural ODEs on Lie groups, given the local exponential charts (19), () on G. This does not make full use of the available structure on Lie groups. Frequently, dynamical systems are of a left-invariant form (88) or a right-invariant form ()

\begin{matrix} \dot{g} & = g Λ (ρ_{θ}^{L} (g, t)), \end{matrix}

(88)

\begin{matrix} \dot{g} & = Λ (ρ_{θ}^{R} (g, t)) g . \end{matrix}

(89)

Denoting

K () : T_{‌} R^{n} \to R^{n}

the derivative of the exponential map (see [21] for details). Then the chart-representatives

f_{θ}^{i}

in a local exponential chart

(U_{h},_{h})

are

\begin{matrix} f_{θ}^{L, i} (, t) & = {(K^{- 1})}_{j}^{i} () ρ^{L, j} (_{h}^{- 1} ()), \end{matrix}

(90)

\begin{matrix} f_{θ}^{R, i} (, t) & = {(K^{- 1})}_{j}^{i} () {Ad}_{‌_{h}^{- 1} ()} ρ^{R, j} (_{h}^{- 1} ()) . \end{matrix}

(91)

Application of Theorem 2 then requires computing

[] ‌^{j} f_{θ}^{L, i} (, t)

or

[] ‌^{j} f_{θ}^{R, i} (, t)

. But this leads to significant computational overhead due differentiation of the terms

{(K^{- 1})}_{j}^{i} ()

(see [21]). Instead of applying Theorem 2, i.e., expressing dynamics in local charts, the dynamics can also be expressed at the Lie algebra . Theorem has a Hamiltonian form, which can be directly transformed into Hamiltonian equations on a Lie group (see also Appendix A.1). Applying this reasoning to Theorem 2, we arrive at the following form, which foregoes differentiating

{(K^{- 1})}_{j}^{i} (x)

:

Theorem 5

(Left Generalized Adjoint Method on Matrix Lie Groups). Given are the dynamics (88) and the cost (86), or the dynamics () with

ρ_{θ}^{L} (g, t) = {Ad}_{g^{- 1}} ρ_{θ}^{R} (g, t)

. Then the parameter gradient

[] θ C_{f_{θ}}^{T} (g_{0})

of the cost is given by the integral equation

[] θ C_{f_{θ}}^{T} (g_{0}) = [F] θ (g (T), θ) + \int_{0}^{T} [] θ (λ_{g}^{⊤} ρ_{θ}^{L} (g, s) + r (g, θ, s)) s,

(92)

where the state

g (t) \in G

and co-state

λ_{g} (t) \in R^{n}

are the solutions of the system of equations

\begin{matrix} \dot{g} & = f_{θ} (g, t), g (0) = g_{0}, \end{matrix}

(93)

\begin{matrix} {\dot{λ}}_{g} & = -_{g}^{L} (λ_{g}^{⊤} ρ_{θ}^{L} (g, s) + r (g, θ, s)) +_{ρ_{θ}^{L} (g, t)}^{⊤} λ_{g}, λ_{g} (T) =_{g}^{L} F (g (T), θ) . \end{matrix}

(94)

Proof.

This is proven in two steps. First, define the time-and-parameter-dependent control-Hamiltonian

H_{c} : T^{*} M \times R^{n_{θ}} \times R \to R

as

H_{c} (x, λ, θ, t) = λ (f_{θ} (x, t)) + r (x, θ, t) = λ_{i} (f_{θ}^{i} (, t)) + r (, θ, t) .

(95)

The equations for the state- and co-state dynamics (43) and (), respectively, of Theorem 2 follow as the Hamiltonian equations on

T^{*} M

:

\begin{matrix} {\dot{‌}}^{j} & = \frac{\partial H_{c}}{\partial λ_{j}} = f_{θ}^{j} (, t), \end{matrix}

(96)

\begin{matrix} {\dot{λ}}_{i} & = - \frac{\partial H_{c}}{\partial^{i}} = - λ_{j} \frac{\partial}{\partial^{i}} f_{θ}^{j} (, t) - \frac{\partial r}{\partial^{i}} . \end{matrix}

(97)

And the integral equation (42) reads

[] θ C_{f_{θ}}^{T} ((x_{0}, t_{0}), θ) = [F] θ (x (T), θ) + \int_{0}^{T} [H_{c}] θ t .

(98)

Second, rewrite the control Hamiltonian (95) on a Lie group G, i.e.,

H_{c} : T^{*} G \times \times R^{n_{θ}} \times R \to R

. By substituting

λ_{g} (t) = Λ^{*} L_{g}^{*} λ (t)

(see also Equation (A6)), this induces

H_{c} : G \times g^{*} \times R^{n_{θ}} \times R \to R

H_{c} (g, λ_{g}, θ, t) = λ_{g}^{⊤} ρ_{θ}^{L} (g, t) + r (g, θ, t) .

(99)

Finally Hamilton’s equations (96), () are rewritten in their form on a matrix Lie group by means of (A7), (), which recovers equations (93) and ():

\begin{matrix} \dot{g} & = g Λ ([H_{c}] λ_{g}), \end{matrix}

(100)

\begin{matrix} {\dot{λ}}_{g} & = -_{g}^{L} H_{c} + {ad}_{[H_{c}] λ_{g}}^{⊤} λ_{g}, \end{matrix}

(101)

To find the final condition for

λ_{g}

, use that

λ_{g} (t) = Λ^{*} L_{g}^{*} λ (t)

:

λ_{g} (T) = Λ^{*} L_{g}^{*} λ (T) = Λ^{*} L_{g}^{*} F (g (T), θ) = d_{g}^{L} F (g (T), θ) .

(102)

□

Similar equations also hold on abstract (non-matrix) Lie groups, see [21]. Compared to the extrinsic method of Section 4.1, Theorem 5 has the advantage that the dimension of the co-state

λ_{g}

is as low as possible. Compared to the chart-based approach on Lie groups, Theorem 5 foregoes differentiating through the terms

K_{j}^{i} ()

, avoiding overhead. Compared to a chart-based approach on manifolds, also the choice of charts is canonical on Lie groups. Although the Lie-group approach foregoes many of the pitfalls of intrinsic neural ODEs on manifolds, implementation in existing neural ODE packages is currently cumbersome: the adjoint-sensitivity equations () have a non-standard form, requiring an adapted dynamics of the co-state

λ

, but these equations are rarely intended for modification, in existing packages. Packages for geometry-preserving integrators on Lie groups, such as [38] are also not readily available for arbitrary Lie groups.

4.3. Extensions

The proof of Theorem 5 relied on finding a control-Hamiltonian formulation for Theorem 2. This approach generalizes to methods in Section 3.2, which rely on the use of Theorem 1. That is because Theorem 1 itself has a Hamiltonian form ([19,21]).

5. Discussion

We discuss advantages and disadvantages of the main flavors of the presented formulations for manifold neural ODEs, expanding on the previous sections. We focus on extrinsic (embedding dynamics in

R^{N}

) and intrinsic (integrating in local charts) formulations. Summarizing prior comments:

the extrinsic formulation is readily implemented if the low-dimensional manifold $M$ and an embedding into $R^{N}$ are known. This comes at the possible cost of geometric inexactness, and a higher dimension of the co-state and sensitivity equations
the co-state in the intrinsic formulation has a generally lower dimension, which reduces the dimension of the sensitivity equations. The chart-based formulation also guarantees geometrically exact integration of dynamics. This comes at the mild cost of having to define local charts and chart-transitions.

This dimensionality reduction is unlikely to have a high impact when the manifold

M

is known and low dimensional, e.g., for the sphere

M = S^{2}

or similar manifolds. However, when applying the manifold hypothesis to high-dimensional data, there might be non-trivial latent manifolds for which the embedding is not immediate, and where the latent manifold is of a much lower dimension than the embedding data-manifold. Then the intrinsic method becomes difficult to avoid. If geometric exactness of the integration is desired, local charts need to be defined also for the extrinsic approach, in which case the intrinsic approach may offer further advantages.

In order to derive neural ODEs on Lie groups, three approaches were possible: the extrinsic and intrinsic formulations on manifolds directly carry over to matrix Lie groups, embedding

G \subset G L (m, R)

in

R^{m^{2}}

or using local exponential charts, respectively. A third option a novel intrinsic method for neural ODEs on matrix Lie groups, which made full use of the Lie group structure by phrasing dynamics on (as is more common on Lie groups), and the co-state on

‌^{*}

, avoiding difficulties of the chart-based formalism in differentiating extra terms.

Summarizing prior comments on advantages and disadvantages of these flavors:

the extrinsic formulation on matrix Lie groups can come at much higher cost than that on manifolds, since the intrinsic dimension of G can be much lower than $m^{2}$ , and a higher dimension of the co-state and sensitivity equations. Geometrically exact integration procedures are more readily available for matrix Lie groups, integrating $\dot{g}$ in local exponential charts.
the chart-based formulation on matrix Lie groups struggles when are not naturally phrased in local charts. This is common, dynamics are often more naturally phrased on . This was alleviated by an algebra-based formulation on matrix Lie groups. Both are intrinsic approaches, that feature a co-state dynamics that are as low as posssible. However, the algebra-based approach still misses readily available software implementation.

The authors believe that the algebra-based formulation is more convenient, in principle, and consider software implementations of the algebra-based approach as possible future work.

In summary, we presented a unified, geometric approach to extend various methods for neural ODEs on

R^{N}

to neural ODEs on manifolds and Lie groups. Optimization of neural ODEs on manifolds was based on the adjoint method on manifolds. Given a novel cost-function C and neural ODE architecture f, the strategy to present the results in a unified fashion was to identify a suitable augmented manifold

M_{aug}

, augmented dynamics

f_{aug} \in (M_{aug})

, and cost

C_{aug} : M_{aug} \to R

such that the original cost-function can be rephrased as

C = C_{aug} \circ Ψ_{f_{aug}}^{T}

. To further derive optimization of intrinsic neural ODEs on Lie groups was based on finding a Hamiltonian formulation of the adjoint method on manifolds, and to subsequently transformed them into the Hamiltonian equations on a matrix Lie group.

Author Contributions

Conceptualization, Y.P.W.; methodology, Y.P.W.; software, Y.P.W.; validation, Y.P.W.; formal analysis, Y.P.W.; investigation, Y.P.W.; resources, S.S.; data curation, Y.P.W.; writing—original draft preparation, Y.P.W.; writing—review and editing, Y.P.W., F.C., S.S.; visualization, Y.P.W.; supervision, F.C., S.S.; project administration, S.S.; funding acquisition, S.S. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Additional Material

Appendix A.1. Hamiltonian dynamics on Lie groups

We briefly review Hamiltonian systems on manifolds and matrix Lie groups (see also [21]).

Given a manifold

Q

with coordinate maps

‌^{i} : Q \to R

and

p_{i}

in the basis

‌^{i}

on

T_{q}^{*} Q

, we define the symplectic form

ω \in Ω^{2} (T^{*} M)

as

ω = p_{i} \land^{i} .

(A1)

Let

Y \in (T^{*} Q)

, then a Hamiltonian

H \in C^{\infty} (T^{*} Q, R)

implicitly defines a unique vector field

X_{H} \in (T^{*} Q)

by

H (Y) = ω (X_{H}, Y) .

(A2)

In coordinates,

X_{H}

has the components

\begin{matrix} {\dot{‌}}^{i} & = [H] p_{i}, \end{matrix}

(A3)

\begin{matrix} {\dot{p}}_{i} & = - [H] ‌^{i} . \end{matrix}

(A4)

On a Lie group G, the group structure allows the identification

T^{*} G \equiv G \times g^{*} \equiv G \times R^{n}

. E.g., using the pull back

{L_{g}}^{*} : T_{g}^{*} G \to g^{*}

of left-translation map

L_{g} : G \to G

, and

Λ^{*} : \to R^{n}

, to define

P_{g} \in R^{n}

as

P_{g} = Λ^{*} L_{g}^{*} P .

(A5)

Then the left Hamiltonian

H^{L} : G \times g^{*} \to R

is defined in terms of

H : T^{*} G \to g

as

H^{L} (g, P_{g}) = H (g, P)) .

(A6)

For a matrix Lie group the left Hamiltonian equations read:

\begin{matrix} \dot{g} & = g Λ ([H^{L}] P), \end{matrix}

(A7)

\begin{matrix} \dot{P} & = -_{g}^{L} H^{L} + {ad}_{[H^{L}] P}^{⊤} P, \end{matrix}

(A8)

with

Λ : R^{n} \to g

as in (12) and

‌_{g}^{L} H \in R^{n}

as in (22).

References

Chen, R.T.Q.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D. Neural Ordinary Differential Equations. CoRR, 1806. [Google Scholar]
Massaroli, S.; Poli, M.; Park, J.; Yamashita, A.; Asama, H. Dissecting neural ODEs. Advances in Neural Information Processing Systems, 2002. [Google Scholar]
Zakwan, M.; Natale, L.D.; Svetozarevic, B.; Heer, P.; Jones, C.; Trecate, G.F. Physically Consistent Neural ODEs for Learning Multi-Physics Systems*. IFAC-PapersOnLine 2023, 56, 5855–5860. [Google Scholar] [CrossRef]
Sholokhov, A.; Liu, Y.; Mansour, H.; Nabi, S. Physics-informed neural ODE (PINODE): embedding physics into models using collocation points. Scientific Reports 2023 13:1 2023, 13, 1–13. [Google Scholar] [CrossRef] [PubMed]
Ghanem, P.; Demirkaya, A.; Imbiriba, T.; Ramezani, A.; Danziger, Z.; Erdogmus, D. Learning Physics Informed Neural ODEs With Partial Measurements. AAAI-25, arXiv:cs.LG/2412.08681].
Massaroli, S.; Poli, M.; Califano, F.; Park, J.; Yamashita, A.; Asama, H. Optimal Energy Shaping via Neural Approximators. SIAM Journal on Applied Dynamical Systems 2022, 21, 2126–2147. [Google Scholar] [CrossRef]
Niu, H.; Zhou, Y.; Yan, X.; Wu, J.; Shen, Y.; Yi, Z.; Hu, J. On the applications of neural ordinary differential equations in medical image analysis. Artificial Intelligence Review 2024, 57, 1–32. [Google Scholar] [CrossRef]
Oh, Y.; Kam, S.; Lee, J.; Lim, D.Y.; Kim, S.; Bui, A.A.T. Comprehensive Review of Neural Differential Equations for Time Series Analysis 2025.
Poli, M.; Massaroli, S.; Yamashita, S.A.J.O.N.L.S.C.N.A.L.A.; Asama, H.; Garg, A. Neural Hybrid Automata: Learning Dynamics with Multiple Modes and Stochastic Transitions 2021.
Chen, R.T.Q.; Amos, B.; Nickel, M. Learning Neural Event Functions for Ordinary Differential Equations 2021.
Davis, J.Q.; Choromanski, K.; Varley, J.; Lee, H.; Slotine, J.J.; Likhosterov, V.; Weller, A.; Makadia, A.; Sindhwani, V. Time Dependence in Non-Autonomous Neural ODEs 2020. arXiv:cs.LG/2005.01906].
Dupont, E.; Doucet, A.; Teh, Y.W. Augmented Neural ODEs 2019. arXiv:stat.ML/1904.01681].
Chu, H.; Miyatake, Y.; Cui, W.; Wei, S.; Furihata, D. Structure-Preserving Physics-Informed Neural Networks With Energy or Lyapunov Structure, 2024, [arXiv:cs.LG/2401.04986]. arXiv:cs.LG/2401.04986].
Kütük, M.; Yücel, H. Energy dissipation preserving physics informed neural network for Allen–Cahn equations. Journal of Computational Science 2025, 87, 102577. [Google Scholar] [CrossRef]
Bullo, F.; Murray, R.M. Tracking for fully actuated mechanical systems: a geometric framework. Automatica 1999, 35, 17–34. [Google Scholar] [CrossRef]
Marsden, J.E.; Ratiu, T.S. Introduction to Mechanics and Symmetry; Vol. 17, Springer New York, 1999. [CrossRef]
Whiteley, N.; Gray, A.; Rubin-Delanchy, P. Statistical exploration of the Manifold Hypothesis 2025. arXiv:stat.ME/2208.11665].
Lou, A.; Lim, D.; Katsman, I.; Huang, L.; Jiang, Q.; Lim, S.N.; De Sa, C. Neural Manifold Ordinary Differential Equations. Advances in Neural Information Processing Systems, 2006. [Google Scholar]
Falorsi, L.; Davidson, T.R.; Berkeley, A.U.C.; Forré, P.; Mar, M.L. Reparameterizing Distributions on Lie Groups 2019. 89, arXiv:1903.02958v1].
Duong, T.; Altawaitan, A.; Stanley, J.; Atanasov, N. Port-Hamiltonian Neural ODE Networks on Lie Groups for Robot Dynamics Learning and Control. IEEE Transactions on Robotics 2024, 40, 3695–3715. [Google Scholar] [CrossRef]
Wotte, Y.P.; Califano, F.; Stramigioli, S. Optimal potential shaping on SE(3) via neural ordinary differential equations on Lie groups. The International Journal of Robotics Research 2024, 43, 2221–2244. [Google Scholar] [CrossRef]
Falorsi, L.; Forré, P. Neural Ordinary Differential Equations on Manifolds, 2020, [2006. 0 6663.
Andersdotter, E.; Persson, D.; Ohlsson, F. Equivariant Manifold Neural ODEs and Differential Invariants 2024.
Wotte, Y. Optimal Potential Energy Shaping on SE(3) via Neural Approximators. University of Twente Archive 2021. [Google Scholar]
Pau, B.S. An introduction to neural ordinary differential equations 2024.
Gholami, A.; Keutzer, K.; Biros, G. ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs 2019. arXiv:cs.LG/1902.10298].
Kidger, P.; Morrill, J.; Foster, J.; Lyons, T.J. Neural Controlled Differential Equations for Irregular Time Series. CoRR, 2005. [Google Scholar]
Li, X.; Wong, T.K.L.; Chen, R.T.Q.; Duvenaud, D. Scalable Gradients for Stochastic Differential Equations 2020. arXiv:cs.LG/2001.01328].
Floryan, D.; Graham, M.D. Data-driven discovery of intrinsic dynamics. Nature Machine Intelligence 2022, 4, 1113–1120. [Google Scholar] [CrossRef]
Duong, T.; Atanasov, N. Hamiltonian-based Neural ODE Networks on the SE(3) Manifold For Dynamics Learning and Control 2021. [2106.12782v3].
Isham, C.J. Modern Differential Geometry for Physicists; 1999. [CrossRef]
Hall, B.C. Lie Groups, Lie Algebras, and Representations: An Elementary Introduction; Graduate Texts in Mathematics (GTM, volume 222), Springer, 2015.
Solà, J.; Deray, J.; Atchuthan, D. A micro Lie theory for state estimation in robotics, 2021, [arXiv:cs.RO/1812.01537]. arXiv:cs.RO/1812.01537].
Visser, M.; Stramigioli, S.; Heemskerk, C. Cayley-Hamilton for roboticists. IEEE International Conference on Intelligent Robots and Systems 2006, 1, 4187–4192. [Google Scholar] [CrossRef]
Robbins, H.; Monro, S. A Stochastic Approximation Method. The Annals of Mathematical Statistics 1951, 22, 400–407. [Google Scholar] [CrossRef]
Poli, M.; Massaroli, S.; Yamashita, A.; Asama, H.; Park, J. TorchDyn: A Neural Differential Equations Library 2020. [2009.09346].
Wotte, Y.P.; Dummer, S.; Botteghi, N.; Brune, C.; Stramigioli, S.; Califano, F. Discovering efficient periodic behaviors in mechanical systems via neural approximators. Optimal Control Applications and Methods 2023, 44, 3052–3079. [Google Scholar] [CrossRef]
Munthe-Kaas, H. High order Runge-Kutta methods on manifolds. Applied Numerical Mathematics 1999, 29, 115–127. [Google Scholar] [CrossRef]

1	Equivalently (e.g, [33]), $Λ$ and $Λ^{- 1}$ are often denoted as operators “hat” $\land : R^{n} \to R^{m \times m}$ and “vee” $\lor : R^{m \times m} \to R^{n}$ , respectively.
2	in canonical coordinates on $R^{m \times m}$ and $R^{m^{2}}$ , though this choice is not required.

Figure 1. (a) The problem of computing the gradient over a flow, highlighting the cotangent spaces

C (x (T)) \in T_{x (T)}^{*} M

and

(C \circ Ψ_{f}^{T}) (x_{0}) = {(Ψ_{f}^{T})}^{*} C (x (T)) \in T_{x_{0}}^{*} M

. (b) In the adjoint method we set

λ (t) = (C \circ Ψ^{T - t}) (x (t))

, whose dynamics are uniquely determined by the property

L_{f} λ = 0

, allowing to find

λ (0) = (C \circ Ψ_{f}^{T}) (x_{0})

by integrating

\dot{λ}

backwards from

λ (T) = C (x (T))

.

Figure 1. (a) The problem of computing the gradient over a flow, highlighting the cotangent spaces

C (x (T)) \in T_{x (T)}^{*} M

and

(C \circ Ψ_{f}^{T}) (x_{0}) = {(Ψ_{f}^{T})}^{*} C (x (T)) \in T_{x_{0}}^{*} M

. (b) In the adjoint method we set

λ (t) = (C \circ Ψ^{T - t}) (x (t))

, whose dynamics are uniquely determined by the property

L_{f} λ = 0

, allowing to find

λ (0) = (C \circ Ψ_{f}^{T}) (x_{0})

by integrating

\dot{λ}

backwards from

λ (T) = C (x (T))

.

Table 1. Summary of neural ODEs on manifolds and Lie groups presented in this article

Name of Neural ODE	Subtype	Trajectory Cost	Subsection	Originally introduced in
Neural ODEs on manifolds (Section 3)	Extrinsic	Running and final cost	Section 3.1.1	Final cost [22], running cost [21]
	Intrinsic	Running and final cost , intermittent cost	Section 3.1.2	Final cost [18], running cost [21], intermittent cost (this work)
	Augmented, time-dependent parameters	Final cost	Section 3.2.2	Augmenting $M$ to $T M$ [23], Augmenting $M$ to $M \times N$ (this work)
Neural ODEs on Lie groups (Section 4)	Extrinsic	Final cost and intermittent cost	Section 4.1	In [20]
	Intrinsic, dynamics in local charts	Running and final cost	Section 4.2	In [21,24]
	Intrinsic, dynamics at Lie algebra	Running and final cost	Section 4.2	In [21]

¹ Tables may have a footer.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Geometric Neural Ordinary Differential Equations: From Manifolds to Lie Groups

Abstract

Keywords:

Subject:

1. Introduction

1.1. Literature review

1.2. Notation

2. Background

2.1. Smooth Manifolds

2.2. Lie groups

2.3. Gradient over a flow

3. Neural ODEs on Manifolds

3.1. Constant parameters, running and final cost

3.1.1. Vanilla neural ODEs and extrinsic neural ODEs on manifolds

3.1.2. Intrinsic neural ODEs on manifolds

3.2. Extensions

3.2.1. Nonlinear and intermittent cost terms

3.2.2. Augmented neural ODEs on manifolds and time-dependent parameters

4. Neural ODEs on Lie Groups

4.1. Extrinsic neural ODEs on Lie groups

4.2. Intrinsic neural ODEs on Lie groups

4.3. Extensions

5. Discussion

Author Contributions

Data Availability Statement

Conflicts of Interest

Appendix A. Additional Material

Appendix A.1. Hamiltonian dynamics on Lie groups

References

MDPI Initiatives

Important Links

Subscribe