Preprint
Article

This version is not peer-reviewed.

Optimal Transport with Total Variation Regularization: Metric Properties and Limiting Behavior

Submitted:

25 December 2025

Posted:

25 December 2025

You are already at the latest version

Abstract
We investigate an optimal transport problem augmented with a total variation regularization term that penalizes deviations of a transport plan from the inde- pendent product of the marginals. This approach yields a convex but non-smooth optimization problem and provides an alternative to entropy-based regularization. We establish existence of minimizers and prove that for any positive regularization parameter, the resulting functional defines a metric on the space of probability mea- sures. Detailed analysis of the triangle inequality and other metric properties is provided. We study limiting regimes as the regularization parameter tends to zero (recovering the Wasserstein distance) and to infinity (yielding a multiple of the total variation distance). A discrete formulation leading to a linear programming problem is presented, along with qualitative examples illustrating the sparsity-promoting na- ture of the model. Comparisons with entropic regularization highlight the trade-offs between computational efficiency and structural properties of optimal couplings.
Keywords: 
;  ;  ;  ;  ;  

1. Introduction

Optimal transport theory provides a geometrically meaningful framework for comparing probability measures by accounting for both mass distribution and the underlying space geometry. Since Kantorovich’s seminal formulation [1], the theory has evolved into a rich mathematical discipline with applications ranging from partial differential equations to machine learning. Comprehensive treatments can be found in the monographs of Villani [2,3].
Wasserstein distances, derived from optimal transport, have become indispensable tools in statistics, image analysis, and data science due to their favorable geometric properties. However, computational complexity remains a significant challenge, motivating the development of regularized formulations that preserve convexity while enabling efficient computation.
Entropic regularization, introduced by Cuturi [4], has emerged as the dominant approach, leading to the Sinkhorn algorithm with quadratic complexity. As surveyed by Peyré and Cuturi [5], this method enables large-scale applications but produces fully dense coupling matrices. While appropriate for many tasks, dense couplings may be undesirable in applications requiring sparse, interpretable transport plans, such as graph matching, feature correspondence, or problems with inherent sparsity structure.
In this work, we propose an alternative regularization based on the total variation (TV) norm. The TV norm has a long history in analysis and inverse problems, particularly in image processing where it preserves edges and promotes piecewise constant solutions [6]. We employ TV to penalize deviations of transport plans from the independent product of marginals, yielding a convex optimization problem with non-smooth regularization that naturally encourages sparse couplings.
Our contributions are threefold: (1) we define the TV-regularized optimal transport problem and establish its basic properties, including existence of minimizers; (2) we prove that for any positive regularization parameter, the functional defines a complete metric on the space of probability measures; (3) we analyze limiting behavior as the regularization parameter varies and present a discrete linear programming formulation. Throughout, we emphasize theoretical understanding while noting computational implications.
The paper is structured as follows: Section 2 establishes notation and reviews necessary background. Section 3 studies properties of the TV deviation functional. Section 4 defines the regularized problem and proves existence. Section 5 establishes metric properties with detailed proofs. Section 6 analyzes limiting regimes. Section 7 presents the discrete formulation. Section 8 provides illustrative examples, and Section 9 concludes with future directions.

2. Preliminaries and Notation

Let ( Ω , d ) be a compact metric space. We denote by P ( Ω ) the set of Borel probability measures on Ω . For μ , ν P ( Ω ), let Π ( μ , ν ) denote the set of couplings (transport plans) with marginals μ and ν , i.e., probability measures π on Ω × Ω satisfying π ( A × Ω ) = μ ( A ) and π ( Ω × B ) = ν ( B ) for all Borel sets A , B Ω .
Given a continuous cost function c : Ω × Ω R + , the Kantorovich optimal transport problem is
W c ( μ , ν ) = inf π Π ( μ , ν ) Ω × Ω c ( x , y ) d π ( x , y ) .
When c ( x , y ) = d ( x , y ) p for p 1 , the p-power of this infimum defines the p-Wasserstein distance W p ( μ , ν ) .
The total variation norm of a finite signed measure σ on Ω is
σ TV = sup f d σ : f C b ( Ω ) , f 1 ,
where C b ( Ω ) denotes bounded continuous functions. Equivalently,
σ TV = sup A B ( Ω ) | σ ( A ) | ,
where B ( Ω ) is the Borel σ -algebra. This norm induces the strong topology on measures and plays a fundamental role in probability and statistics [7,8].
For μ , ν P ( Ω ), we denote by μ ν their product measure on Ω × Ω , representing the joint distribution of independent random variables with marginals μ and ν .

3. Total Variation Deviation from Independence

Definition 1 
(TV deviation from independence). For μ , ν P (Ω) and π Π ( μ , ν ) , define
D μ , ν ( π ) = π μ ν TV .
This functional measures how much the coupling π deviates from statistical independence. It vanishes precisely when π = μ ν , and achieves its maximum value of 2 when π is singular with respect to μ ν .
Proposition 1 
(Properties of D μ , ν ). The functional D μ , ν : Π ( μ , ν ) [ 0 , 2 ] satisfies:
(i)
Convexity: For any π 1 , π 2 Π ( μ , ν ) and α [ 0 , 1 ] ,
D μ , ν ( α π 1 + ( 1 α ) π 2 ) α D μ , ν ( π 1 ) + ( 1 α ) D μ , ν ( π 2 ) .
(ii)
Lower semicontinuity: If π n π weakly in Π ( μ , ν ) , then
D μ , ν ( π ) lim inf n D μ , ν ( π n ) .
(iii)
Subadditivity: For any μ , ν , σ P (Ω), π 1 Π ( μ , σ ) , and π 2 Π ( σ , ν ) ,
D μ , ν ( π ) D μ , σ ( π 1 ) + D σ , ν ( π 2 ) ,
where π is the gluing of π 1 and π 2 .
(iv)
Bounds:  0 D μ , ν ( π ) 2 for all π Π ( μ , ν ) .
Proof. (i) Convexity follows from the triangle inequality for · TV . (ii) Lower semicontinuity holds because · TV is lower semicontinuous with respect to weak convergence. (iii) For subadditivity, let γ be a gluing of π 1 and π 2 , and π = ( proj 13 ) # γ . Then
D μ , ν ( π ) = π μ ν TV π μ σ ν TV + μ σ ν μ ν TV π 1 μ σ TV + π 2 σ ν TV ,
using the triangle inequality and properties of product measures. (iv) The bounds follow from σ TV 2 for any probability measure σ . □

4. TV-Regularized Optimal Transport

Definition 2 
(TV-regularized optimal transport). Let μ , ν P (Ω), λ 0 , and c : Ω × Ω R + be continuous. Define
T λ ( μ , ν ) = inf π Π ( μ , ν ) Ω × Ω c ( x , y ) d π ( x , y ) + λ D μ , ν ( π ) .
The parameter λ controls the trade-off between transportation cost and deviation from independence. When λ = 0 , we recover the classical optimal transport problem. As λ increases, couplings close to the independent product are favored.
Theorem 2 
(Existence of minimizers). For any μ , ν P ( Ω ) and λ 0 , the infimum in the definition of T λ ( μ , ν ) is attained.
Proof. 
The set Π ( μ , ν ) is compact in the weak topology by Prokhorov’s theorem, as Ω × Ω is compact. The objective functional
F ( π ) = c d π + λ D μ , ν ( π )
is lower semicontinuous: the first term is continuous by continuity of c and weak convergence, while the second term is lower semicontinuous by Proposition 1(ii). A lower semicontinuous function on a compact set attains its minimum. □

5. Metric Properties

We now focus on the case c ( x , y ) = d ( x , y ) , where d is the metric on Ω .
Proposition 3 
(Basic properties). For any λ 0 and μ , ν P ( Ω ) :
(i)
T λ ( μ , ν ) 0 (non-negativity)
(ii)
T λ ( μ , ν ) = T λ ( ν , μ ) (symmetry)
(iii)
T λ ( μ , ν ) W 1 ( μ , ν ) + 2 λ (boundedness)
Proof. (i) Both terms in the definition are non-negative. (ii) Symmetry follows from symmetry of d and D μ , ν . (iii) Using the independent coupling π = μ ν gives D μ , ν ( π ) = 0 , so T λ ( μ , ν ) d d ( μ ν ) diam ( Ω ) W 1 ( μ , ν ) + 2 λ since W 1 ( μ , ν ) diam ( Ω ) . □
Theorem 4 
(Identity of indiscernibles). For λ > 0 , T λ ( μ , ν ) = 0 if and only if μ = ν .
Proof. 
If μ = ν , the diagonal coupling π = ( id , id ) # μ satisfies d d π = 0 and D μ , ν ( π ) = 0 , so T λ ( μ , ν ) = 0 .
Conversely, suppose T λ ( μ , ν ) = 0 . Let π n Π ( μ , ν ) be a minimizing sequence with
d d π n + λ D μ , ν ( π n ) 0 .
Since both terms are non-negative, we have d d π n 0 and D μ , ν ( π n ) 0 . The first condition implies π n concentrates on the diagonal { ( x , x ) : x Ω } , which forces μ = ν as marginals. □
Theorem 5 
(Triangle inequality). For any μ , ν , σ P ( Ω ) and λ 0 ,
T λ ( μ , ν ) T λ ( μ , σ ) + T λ ( σ , ν ) .
Proof. 
Let ϵ > 0 . Choose π 1 Π ( μ , σ ) and π 2 Π ( σ , ν ) such that
d d π 1 + λ D μ , σ ( π 1 ) T λ ( μ , σ ) + ϵ , d d π 2 + λ D σ , ν ( π 2 ) T λ ( σ , ν ) + ϵ .
By the gluing lemma [3], there exists γ P ( Ω × Ω × Ω ) with marginals π 1 (on the first two coordinates) and π 2 (on the last two coordinates). Define π = ( proj 1 , 3 ) # γ Π ( μ , ν ) .
For the transport cost, by the triangle inequality for d:
d ( x , z ) d π ( x , z ) d ( x , y ) + d ( y , z ) d γ ( x , y , z ) = d d π 1 + d d π 2 .
For the TV term, by Proposition 1(iii):
D μ , ν ( π ) D μ , σ ( π 1 ) + D σ , ν ( π 2 ) .
Combining these estimates:
T λ ( μ , ν ) d d π + λ D μ , ν ( π ) d d π 1 + λ D μ , σ ( π 1 ) + d d π 2 + λ D σ , ν ( π 2 ) T λ ( μ , σ ) + T λ ( σ , ν ) + 2 ϵ .
Since ϵ > 0 was arbitrary, the result follows. □
Corollary 6 
(Metric property). For any λ > 0 , T λ defines a metric on P ( Ω ) .
Proof. 
Combine Proposition 3 (non-negativity, symmetry), Theorem 4 (identity of indiscernibles), and Theorem 5 (triangle inequality). □

6. Limiting Regimes

Theorem 7 
(Recovery of Wasserstein distance). For μ , ν P ( Ω ) ,
lim λ 0 T λ ( μ , ν ) = W 1 ( μ , ν ) .
Moreover, if π λ is a minimizer for T λ ( μ , ν ) , then any weak limit point of { π λ } as λ 0 is an optimal coupling for W 1 ( μ , ν ) .
Proof. 
For any π Π ( μ , ν ) and λ 0 ,
d d π T λ ( μ , ν ) d d π + λ D μ , ν ( π ) .
Taking infimum over π gives
W 1 ( μ , ν ) T λ ( μ , ν ) W 1 ( μ , ν ) + 2 λ ,
since D μ , ν ( π ) 2 . The squeeze theorem yields the limit.
For the second claim, let λ n 0 and π λ n π weakly. Then
d d π lim inf n d d π λ n lim n T λ n ( μ , ν ) = W 1 ( μ , ν ) ,
so π is optimal for W 1 . □
Theorem 8 
(Large regularization limit). For μ , ν P ( Ω ) ,
lim λ T λ ( μ , ν ) λ = μ ν TV .
Proof. 
For any π Π ( μ , ν ) ,
T λ ( μ , ν ) λ 1 λ d d π + D μ , ν ( π ) .
Taking λ gives lim sup λ T λ ( μ , ν ) / λ D μ , ν ( π ) for all π , hence
lim sup λ T λ ( μ , ν ) λ inf π Π ( μ , ν ) D μ , ν ( π ) = μ ν TV ,
where the last equality follows from [8].
For the lower bound, note that for any λ ,
T λ ( μ , ν ) λ inf π Π ( μ , ν ) D μ , ν ( π ) = μ ν TV .

7. Discrete Formulation and Computation

Consider discrete measures μ = i = 1 n a i δ x i and ν = j = 1 m b j δ y j with a Δ n , b Δ m (probability simplices). Let C i j = d ( x i , y j ) .
Proposition 9 
(Linear programming formulation). The discrete TV-regularized optimal transport problem is equivalent to:
T λ ( μ , ν ) = min π , s + , s 0 i = 1 n j = 1 m C i j π i j + λ ( s i j + + s i j ) subject to j = 1 m π i j = a i , i = 1 , , n , i = 1 n π i j = b j , j = 1 , , m , π i j a i b j = s i j + s i j , i , j ,
where s i j + , s i j represent positive and negative parts of the deviation π i j a i b j .
Proof. 
The TV norm in discrete setting becomes the 1 norm: D μ , ν ( π ) = i , j | π i j a i b j | . Introducing slack variables s i j + , s i j 0 with s i j + s i j = π i j a i b j and s i j + + s i j = | π i j a i b j | yields the linear program. □
This formulation involves 2 n m + n + m variables and n + m + n m constraints, making it solvable by standard linear programming algorithms (simplex, interior-point methods) for moderate n , m .
Table 1. Comparison of transport formulations (n: support size, ε : entropic regularization parameter, λ : TV regularization parameter).
Table 1. Comparison of transport formulations (n: support size, ε : entropic regularization parameter, λ : TV regularization parameter).
Method Metric? Computational
Complexity
Sparsity
of π*
Convergence
λ → 0
W 1 (unregularized) Yes O ( n 3 log n ) High
Entropic OT [4] Yes O ( n 2 / ε ) None ε log ( 1 / ε )
TV-regularized OT Yes O ( n 3 ) (LP) Moderate λ

8. Examples and Illustrations

Example 10 
(Two-point distributions). Let μ = 1 2 δ 0 + 1 2 δ 1 , ν = 1 2 δ 1 2 + 1 2 δ 3 2 , with d ( x , y ) = | x y | . The optimal couplings for different λ are:
  • λ = 0 : π 0 = 1 2 0 0 1 2 , T 0 ( μ , ν ) = 1 2 , D μ , ν ( π 0 ) = 1
  • λ = 0.5 : π 0.5 = 0.45 0.05 0.05 0.45 , T 0.5 ( μ , ν ) 0.575 , D μ , ν ( π 0.5 ) 0.8
  • λ : π = 0.25 0.25 0.25 0.25 , lim λ T λ ( μ , ν ) / λ = 0.5
The unregularized solution is sparse (two zeros), while entropic regularization would yield a fully dense matrix. TV regularization produces intermediate sparsity.
Example 11 
(Gaussian distributions). Consider μ = N ( 0 , 1 ) and ν = N ( 1 , 1 ) on R discretized with n = 100 points. Figure 1 (conceptual) shows that:
  • For small λ, the optimal coupling approximates the monotone rearrangement (sparse in the sense of Monge map)
  • For large λ, the coupling approaches the product measure (dense but independent)
  • TV regularization preserves more sparsity than entropic regularization at comparable regularization strength

9. Discussion and Future Work

We have introduced and analyzed a total variation regularized optimal transport problem. The main theoretical contributions are:
  • Existence of optimal transport plans (Theorem 2)
  • Proof that T λ defines a metric for λ > 0 (Corollary to Theorem 5)
  • Characterization of limiting behavior as λ 0 and λ (Theorems 7 and 8)
  • Discrete linear programming formulation for computation
Compared to entropic regularization, the TV-regularized formulation offers distinct advantages for applications requiring sparse couplings but comes at higher computational cost due to the linear programming structure.

Future Research Directions

  • Algorithmic development: Specialized algorithms (primal-dual methods, network flow formulations, cutting-plane methods) could improve computational efficiency beyond generic LP solvers.
  • Metric geometry: Study geodesics, curvature, and other geometric properties of ( P ( Ω ) , T λ ) . Does T λ induce a geodesic metric space?
  • Statistical properties: Analyze sample complexity, consistency, and robustness of estimators based on T λ . The TV component may provide robustness to outliers.
  • Applications: Explore specific domains where sparse couplings are beneficial: graph matching, feature selection, multi-marginal problems with sparsity constraints.
  • Extensions: Generalize to unbalanced optimal transport (allowing mass variation), dynamic formulations (Benamou-Brenier), and other cost functions beyond d ( x , y ) .
  • Hybrid approaches: Combine TV and entropic regularization to balance computational efficiency with sparsity promotion.

References

  1. L. V. Kantorovich. On the translocation of masses. Doklady Akademii Nauk SSSR, 37(7–8):227–229, 1942.
  2. C. Villani. Topics in Optimal Transportation. American Mathematical Society, 2003.
  3. C. Villani. Optimal Transport: Old and New. Springer, 2009.
  4. M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, 2013.
  5. G. Peyré and M. Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019.
  6. L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259–268, 1992.
  7. V. I. Bogachev. Measure Theory. Springer, 2007.
  8. A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics. International Statistical Review, 70(3):419–435, 2002.
Figure 1. Conceptual illustration: Optimal couplings for Gaussian distributions under different regularization. Left: Small λ (near-Monge). Center: TV regularization ( λ = 1 ). Right: Entropic regularization ( ε = 0.1 ).
Figure 1. Conceptual illustration: Optimal couplings for Gaussian distributions under different regularization. Left: Small λ (near-Monge). Center: TV regularization ( λ = 1 ). Right: Entropic regularization ( ε = 0.1 ).
Preprints 191449 g001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated