DSER: Spectral Epipolar Representation for Efficient Light Field Depth Estimation

Noor Islam S. Mohammad; Md Muntaqim Meherab

doi:10.20944/preprints202506.0435.v2

Submitted:

13 March 2026

Posted:

17 March 2026

You are already at the latest version

Abstract

Dense light field depth estimation remains challenging due to sparse angular sampling, occlusion boundaries, textureless regions, and the cost of exhaustive multi-view matching. We propose Deep Spectral Epipolar Representation (DSER), a geometry-aware framework that introduces spectral regularization in the epipolar domain for dense disparity reconstruction. DSER models frequency-consistent EPI structure to constrain correspondence estimation and couples this prior with a hybrid inference pipeline that combines least squares gradient initialization, plane-sweeping cost aggregation, and multiscale EPI refinement. An occlusion-aware directed random walk further propagates reliable disparity along edge-consistent paths, improving boundary sharpness and weak-texture stability. Experiments on benchmark and real-world light field datasets show that DSER achieves a strong accuracy-efficiency trade-off, producing more structurally consistent depth maps than representative classical and hybrid baselines. These results establish spectral epipolar regularization as an effective inductive bias for scalable and noise-robust light field depth estimation.

Keywords:

light field depth estimation

;

spectral epipolar representation

;

epipolar-plane image (EPI)

;

plane sweeping

;

least squares gradient

;

directed random walk

;

multiscale refinement

Subject:

Computer Science and Mathematics - Mathematics

1. Introduction

Dense depth estimation is a fundamental problem in 3D vision. Light field imaging is especially attractive because it captures both spatial and angular radiance, enabling geometry-aware inference beyond monocular and stereo cues [1,2,3,4]. In practice, however, light field depth estimation remains challenging due to sparse angular sampling, photometric inconsistency, weak texture, aliasing, and depth discontinuities. Existing methods face a clear trade-off: deep models improve prediction quality but often require large annotated datasets and underexploit epipolar structure [3,5], while classical methods such as gradient-based estimation, plane sweeping, and EPI analysis are geometrically grounded but are, respectively, unstable in low-texture regions, computationally expensive, or prone to oversmoothing fine structures [6,7,8].

We propose Deep Spectral Epipolar Representation (DSER), a hybrid framework for dense light field depth estimation. The key idea is a spectral epipolar prior that models frequency-consistent structure in horizontal and vertical EPIs to regularize multi-view correspondence. DSER combines this prior with three complementary estimators: Least Squares Gradient (LSG) for fast local initialization, plane sweeping for global cost aggregation, and fine-to-coarse EPI refinement for structure-preserving recovery. A final occlusion-aware Directed Random Walk (DRW) propagates reliable disparity along edge-consistent paths and suppresses ambiguity near occlusion boundaries [9]. Experiments on the Heidelberg Light Field Benchmark and Stanford Lytro Archive show that DSER improves structural consistency and boundary fidelity while maintaining a favorable accuracy-runtime trade-off [10,11].

Contributions

We introduce DSER, a light field depth estimation framework that injects spectral regularization into the epipolar domain for dense disparity reconstruction.
We develop a hybrid inference pipeline that unifies LSG initialization, plane-sweeping aggregation, multiscale EPI refinement, and occlusion-aware directed random walk propagation.
We show on benchmark and real-world light field datasets that DSER improves structural fidelity and achieves a strong balance between reconstruction accuracy and computational efficiency.

2. Related Work

Classical light field depth estimation methods recover disparity through multi-view correspondence, plane sweeping, and epipolar-plane analysis [12,13]. While geometrically interpretable, they are often computationally expensive and brittle under weak texture, noise, and occlusion [14,15]. In particular, single-cue formulations degrade when photometric consistency is violated, motivating hybrid methods that combine complementary geometric cues [6,7].

Learning-based methods have significantly advanced depth estimation using convolutional, recurrent, and attention-based architectures [3,5,16]. However, many are designed for monocular or stereo inputs rather than plenoptic data, and therefore underutilize angular redundancy and explicit epipolar geometry [17,18]. Their performance also typically depends on large labeled datasets and often degrades under real-world domain shift [19,20].

Recent works incorporate angular consistency, geometric priors, and EPI-based reasoning into light field depth pipelines [21,22,23,24]. These studies show that epipolar structure is a strong supervisory signal because it directly encodes scene geometry across views [8,25]. However, existing approaches still face a trade-off between accuracy, robustness, and computational efficiency [9,15]. In contrast, DSER introduces a spectral epipolar prior that regularizes disparity in the frequency domain and couples it with hybrid multistage refinement, yielding a stronger balance between structural fidelity and efficiency [26,27].

3. Method

We introduce DSER, a hybrid framework for dense light field depth estimation that combines spectral epipolar regularization, multi-view geometric matching, and confidence-guided multiscale refinement. Given a 4D light field

L (x, y, u, v)

, DSER estimates disparity by integrating complementary cues from spatial-angular gradients, cost-volume aggregation, and epipolar consistency. The framework consists of four components: local disparity initialization, global cost-volume estimation, spectral EPI refinement, and confidence-guided coarse-to-fine propagation.

3.1. Data and Preprocessing

We evaluate on the 4D Light Field Benchmark, where each sample contains a

9 \times 9

angular grid of

512 \times 512

views [10], and additionally test generalization on higher-resolution light field data with

17 \times 17

angular sampling and

960 \times 1280

spatial resolution [11]. Preprocessing includes intensity normalization, resizing, invalid-region inpainting, and mask-guided foreground-background separation.

3.2. Least Squares Gradient Initialization

We first obtain a fast local disparity estimate using spatial-angular gradients. Under disparity-induced parallax,

L (x, y, u, v) = L (x - d Δ_{x}, y - d Δ_{y}, u + Δ_{x}, v + Δ_{y}) .

(1)

We minimize the local reconstruction error

\begin{matrix} E = \int_{α} \sum_{p} [L (x, y, u, v) \\ - L (x - d Δ_{x}, y - d Δ_{y}, u + Δ_{x}, v + Δ_{y})]^{2} . \end{matrix}

(2)

which yields the closed-form estimate

d^{*} = \frac{\sum_{p} (L_{x} L_{u} + L_{y} L_{v})}{\sum_{p} (L_{x}^{2} + L_{y}^{2})} .

(3)

This stage is efficient and provides subpixel initialization but is unstable in weak-texture and occluded regions [14,15].

3.3. Plane-Sweeping Cost Volume

To improve global consistency, we construct a variance-based cost-volume. For each disparity hypothesis d, the sheared light field is

L_{d} (x, y, u, v) = L (x + u d, y + v d, u, v),

(4)

and the matching cost is

C (x, y, d) = \frac{1}{| U | | V |} \sum_{u, v} {[L_{d} (x, y, u, v) - \bar{L_{d}} (x, y)]}^{2} .

(5)

The disparity is selected by minimizing

C (x, y, d)

. Plane sweeping improves robustness in textured regions but is substantially more expensive than local estimation [9,10].

3.4. Spectral EPI Refinement

The key novelty of DSER is a spectral epipolar prior that regularizes correspondence estimation in the frequency domain. Horizontal and vertical EPIs encode disparity as oriented epipolar structures; DSER exploits their frequency-consistent patterns to suppress noisy matches, sharpen object boundaries, and recover missing structure in occluded regions [7,8,17,21].

3.5. Confidence-Guided Depth Propagation

We estimate an edge-aware confidence map from the central view:

C_{e} (x, y) = \sum_{(x^{'}, y^{'}) \in N (x, y)} ∥ I (x, y) - I (x^{'}, y^{'}) ∥ .

(6)

For each disparity hypothesis, a color-density score is computed, as

\begin{matrix} S (x, y, d) & = \frac{1}{| R |} \sum_{r \in R (x, y, u, v, d)} K (r - \bar{r}), \\ d^{*} (x, y) & = arg max_{d} S (x, y, d) . \end{matrix}

(7)

Disparities are retained only when the confidence

C_{d} (x, y) = C_{e} (x, y) \cdot ∥ S_{max} - \bar{S} ∥

(8)

exceeds a threshold, allowing reliable disparity to propagate while suppressing ambiguous estimates [9,18].

3.6. Multiscale Spectral Refinement

To refine low-confidence regions, DSER performs multiscale EPI optimization. We minimize

\begin{matrix} E (D) & = \sum_{x, y} ρ_{d} (I (x, y) - I_{D} (x, y; D)) \\ + λ \sum_{(x, y), (x^{'}, y^{'}) \in N} ρ_{s} (D (x, y) - D (x^{'}, y^{'})) . \end{matrix}

(9)

where the data term enforces photometric consistency and the smoothness term imposes anisotropic spatial regularization [23,25]. Spatial-angular evidence is aggregated into an adaptive cost volume

\begin{matrix} C (x, y, d) & = \sum_{i, j} w_{i j} (x, y, d) | I_{i} (x, y) \\ - I_{j} (x + Δ x (d), y + Δ y (d)) | . \end{matrix}

(10)

where

w_{i j}

unreliable correspondences are down-weighted [19,24]. Disparity is estimated independently along horizontal and vertical angular axes, fused, and refined by bilateral and median filtering. A coarse-to-fine optimization then solves Eq. 9 at progressively lower-to-higher resolutions, improving global consistency while preserving local depth discontinuities [26,28].

4. Experiments

Datasets and metrics

We evaluate on Boxes, Dino, and Cotton from the Heidelberg Light Field Benchmark [10] and additionally test on the Stanford Lytro Archive [11]. Depth is recovered from disparity via

Z = f b / d

and evaluated using PSNR and MSE:

PSNR = 10 {log}_{10} (\frac{M A X_{I}^{2}}{M S E}), MSE = \frac{1}{N} \sum_{i = 1}^{N} {(Z_{i} - {\hat{Z}}_{i})}^{2} .

(11)

Higher PSNR and lower MSE indicate better reconstruction quality [27,29].

5. Results and Analysis

5.1. Classical and Learning-Based Baselines

Gradient-based estimators such as LSG are the fastest (

\approx 12

–20 s) but yield the lowest PSNR (21–24 dB), reflecting their failure in textureless and occluded regions. Plane sweeping achieves the highest single-scene PSNR (

36.53

dB on Boxes) through exhaustive hypothesis testing but incurs prohibitive runtime (

\approx 350

s), making it impractical for efficient deployment. Learning-based CNN and attention baselines improve average PSNR to 31–32 dB, but remain slower than DSER and require large annotated training sets [24,25].

5.2. EPI-Based and Proposed Methods

Pure EPI methods provide a stronger balance between quality and efficiency but degrade on low-texture scenes such as Cotton. In contrast, DSER achieves the best average PSNR (

28.71

dB) and the highest per-scene PSNR on Cotton (

26.86

dB), exceeding Plane Sweeping by

1.52

dB at only

\approx 20

s runtime. This shows that spectral epipolar regularization, coupled with hybrid multiscale refinement, offers a more favorable accuracy-efficiency trade-off than any single-paradigm baseline. Table 1 summarizes results across fifteen methods and five paradigms.

5.3. Real-World Generalisation

Table 2, Table 3 and Table 4 report per-scene PSNR and runtime for all methods. Plane Sweeping achieves the highest PSNR on Boxes and Dino but is outperformed by EPI-FCR on Cotton (

26.86

vs.

25.34

dB), demonstrating the benefit of epipolar-domain refinement in occluded low-texture scenes [7,8]. LSG is the fastest (

\approx 19

s) but least accurate; EPI2 attains near-peak PSNR at a fraction of Plane Sweeping’s cost, making it practical for real-time deployment [20,30].

Figure 1 compares depth reconstructions across all methods. Plane sweeping provides a strong baseline but shows quantization artifacts and boundary leakage in occluded or specular regions [6,7]. LSG captures coarse structure efficiently but degrades in textureless areas. EPI-based methods improve boundary precision and depth continuity, with EPI2 best recovering thin structures such as the Dino appendages and Boxes mesh through stronger angular and multiscale constraints [21,22].

Figure 2 shows that EPI2 generalizes well to real-world light field data, producing sharper structural boundaries and lower per-pixel error than all baselines. The remaining failure cases are concentrated in heavily occluded or texture-ambiguous regions, indicating a promising direction for future adaptive refinement [30,31].

5.4. Depth Sampling Analysis

Figure 3 directly compares EPI2 against ground truth and LSG residuals. LSG exhibits large errors in flat or texture-poor regions, especially in the cotton background and boxes’ side walls. In contrast, EPI2 errors are largely confined to sharp discontinuities, corroborating the PSNR gains in Table 2 [24].

Figure 4 shows clear scene-dependent behavior. EPI2 most strongly improves over Plane Sweeping on Cotton, a low-texture, heavily occluded scene where exhaustive matching over-smooths structure. The gap is smaller on the more textured Boxes and Dino scenes, indicating that epipolar regularization is most beneficial in low-evidence regions [7,8].

Figure 5 confirms diminishing returns with increasing depth-plane count: MSE drops quickly at small

N_{d}

but plateaus beyond

N_{d} = 11

, where additional planes provide only marginal gains at higher computational cost [26,27]. We therefore fix

N_{d} = 11

in all primary experiments.

Figure 6 highlights the large runtime gap between methods. Plane sweeping requires

\approx 350

s on average due to exhaustive search, whereas EPI2 runs in

\approx 20

s by restricting expensive matching to uncertain regions and replacing global search with epipolar filtering.

6. Ablation Study

To isolate the contribution of each pipeline component we conduct a systematic ablation on the three Heidelberg benchmark scenes (Boxes, Dino, Cotton). Starting from the bare LSG initializer (A1), we incrementally add: plane-sweeping cost-volume aggregation (A2), spectral EPI refinement (A3), the occlusion-aware Directed Random Walk (A4), and multiscale coarse-to-fine optimization (A5), which together constitute the full DSER / EPI2 model. We additionally study the effect of the number of depth planes

N_{d}

and the spectral regularization strength

λ_{s}

separately in Table 6 and Table 7.

Component contributions

Table 5 shows that every added component strictly improves average PSNR. The largest single gain comes from plane sweeping (A1→A2,

+ 2.94

dB), which resolves the ill-conditioned low-texture failures of LSG (Proposition D.3). Spectral EPI refinement (A2→A3) contributes a further

+ 1.15

dB at a cost of only

\approx 8

s, consistent with Theorem C.2: aligning correspondences to the spectral support locus suppresses noisy matches without exhaustive re-search. The DRW propagation step (A3→A4) provides the second-largest improvement on Cotton (

+ 1.22

dB), the scene with the heaviest occlusion, validating the edge-aligned propagation guarantee of Corollary G.5. Multiscale refinement (A4→A5) yields modest but consistent gains (

+ 0.17

dB average), predominantly on fine boundary structures in Dino, in line with the error-contraction bound of Theorem H.1.

Depth-plane count $N_{d}$

Table 6 reports PSNR and runtime as

N_{d}

increasing from 3 to 64. MSE drops quickly for

N_{d} \leq 11

and plateaus thereafter, consistent with the cubic-decay result of Proposition I.3. We therefore fix

N_{d} = 11

in all primary experiments as the knee of the cost-accuracy curve.

Spectral regularization weight $λ_{s}$

Table 7 sweeps

λ_{s}

over two orders of magnitude. Very small values (

λ_{s} = 10^{- 4}

) leave the spectral prior inactive, recovering performance close to A2. Very large values (

λ_{s} = 1.0

) over-regularize, washing out fine structures in Dino (

- 1.8

dB relative to the optimum). The best trade-off is obtained at

λ_{s} = 0.1

, which we adopt as the default.

6.1. Limitations

DSER improves depth fidelity and reduces dependence on exhaustive search, but several limitations remain. First, gains are scene-dependent: the method shows the largest improvement on Cotton but only modest gains over Plane Sweeping on Dino, indicating sensitivity to texture and scene geometry [7,8]. Second, spectral refinement introduces extra computation that may limit deployment in strict real-time or very large-scale settings. Third, evaluation with PSNR alone is incomplete; metrics such as SSIM, depth accuracy, and more detailed runtime profiling would provide a more holistic assessment [27,29].

7. Discussion

DSER addresses the accuracy-efficiency trade-off in light field depth estimation by combining complementary estimators with epipolar-domain refinement. LSG provides efficient local initialization, plane sweeping offers accurate but expensive global matching, and EPI-FCR improves structural consistency through angular regularization and multiscale refinement [18,21,22]. Experiments show that EPI-based refinement approaches Plane Sweeping accuracy at substantially lower cost, especially in scenes with occlusion and texture variation [23,24]. More broadly, DSER suggests that epipolar-domain priors are an effective mechanism for scalable, geometry-aware depth estimation and may transfer to related tasks such as multi-view stereo and volumetric reconstruction [16,32]. Future work includes adaptive model selection, larger and more diverse plenoptic training data, and integration with RGB-D fusion [20,28].

8. Conclusions

We presented DSER, a hybrid light field depth estimation framework that unifies LSG initialization, plane-sweeping cost aggregation, and EPI-based multiscale refinement within a single scalable pipeline. Across Boxes, Dino, and Cotton, the proposed EPI2 variant achieved the best accuracy-efficiency trade-off, approaching Plane Sweeping accuracy at substantially lower runtime while consistently outperforming LSG in reconstruction quality [10,14]. These results demonstrate that spectral epipolar priors and multiscale refinement constitute an effective and practical strategy for robust dense light field reconstruction, with clear potential for extension to broader 3D vision tasks [20,32].

9. Broader Impact Statement

This work presents DSER, a hybrid light field depth estimation framework that improves the accuracy–efficiency trade-off for dense disparity reconstruction from plenoptic imagery. We discuss the foreseeable positive and negative societal consequences of this research.

Intended applications and positive impact

Dense, geometrically consistent depth maps are a core primitive for a wide range of socially beneficial technologies. DSER’s favorable runtime (

\approx 20

s on a single consumer-grade GPU, a

\sim 17 \times

speedup over exhaustive plane sweeping at comparable accuracy) lowers the computational barrier for practitioners without access to large-scale hardware, democratizing high-quality 3D reconstruction. Foreseeable beneficial applications include:

Medical imaging and surgical robotics. Light field endoscopes and depth-from-focus microscopes require fast, structure-preserving depth estimates in real or near-real time. DSER’s occlusion-aware propagation and boundary sharpness are particularly relevant for tissue segmentation and instrument localization, where depth discontinuities carry diagnostic significance.
Assistive technology. Robust light field depth estimation can improve obstacle detection and scene understanding in mobility aids and wearable navigation systems for visually impaired users, especially in texture-poor indoor environments where gradient-only methods degrade.
Cultural heritage and scientific digitization. High-fidelity 3D reconstruction of artifacts, archaeological sites, and natural specimens benefits from the structural consistency and low boundary error that DSER achieves on low-texture and partially occluded scenes.
Autonomous systems and robotics. Accurate, efficient depth estimation is a critical perception primitive for path planning, 3D mapping, and manipulation in service and field robots. DSER’s efficiency profile makes it viable for onboard processing under strict power and latency budgets.

Limitations and risks

We acknowledge several potential negative or unintended consequences:

Surveillance and privacy. Like all dense 3D reconstruction methods, DSER could, in principle, be integrated into surveillance pipelines that reconstruct the geometry of individuals or spaces without consent. We do not develop any surveillance application, and the present work is limited to controlled benchmarks and publicly available real-world light field datasets. We encourage practitioners who deploy this or related work in public-facing systems to comply with applicable privacy regulations and to implement appropriate safeguards.
Dual-use in autonomous weaponry. Improved depth perception could be applied to autonomous targeting or navigation in military platforms. The authors neither design nor intend DSER for such use and note that existing, highly mature depth-sensing modalities (LiDAR, structured light) already serve this domain. The marginal capability uplift from this work in a military context is therefore minimal.
Dataset and benchmark bias. Our primary evaluation uses the Heidelberg Light Field Benchmark and the Stanford Lytro Archive, both of which contain controlled laboratory or indoor scenes captured with specific plenoptic hardware. Performance may degrade in scenes with diverse illumination, outdoor conditions, or non-standard sensor configurations, and conclusions about accuracy or efficiency may not generalize uniformly to all deployment contexts. We report this limitation explicitly in Section 6.1 and encourage evaluation on broader, more demographically and geographically diverse scene sets as the field matures.
Environmental cost. Although DSER is significantly faster than exhaustive plane sweeping and does not require large-scale model training (unlike deep learning baselines), iterative spectral refinement and cost-volume construction remain non-trivial computationally. For large-scale or continuous-deployment scenarios, the aggregate energy consumption of inference should be weighed against the application benefit.

Data and model transparency

All experiments use publicly released benchmark datasets with documented licenses. No personally identifiable information, biometric data, or sensitive human-subject content is involved. We will release source code, pre-computed results, and configuration files upon acceptance to facilitate reproducibility and independent auditing of our claims.

Summary

On balance, we believe the benefits of DSER, more accessible and geometrically accurate 3D reconstruction from light fields outweighthe foreseeable risks, which are largely shared with the broad category of 3D computer vision research and are not unique to this contribution. We remain committed to responsible disclosure, open evaluation, and constructive engagement with the broader research community on the ethical dimensions of 3D perception technology.

Appendix A. Theoretical Justification

This appendix provides formal derivations and theoretical analysis supporting the design choices in the main paper. Section B establishes the 4D light field model and epipolar geometry. Section C develops the spectral epipolar representation. Section D derives the closed-form least squares gradient estimator and its statistical properties. Section E analyzes the plane-sweeping cost volume. Section F justifies the variational energy functional. Section G formalizes confidence estimation and the directed random walk. Section H provides the multiscale convergence analysis. Section I gives formal connections between PSNR, MSE, and disparity estimation quality. Section J derives the computational complexity of each stage.

Appendix B. Light Field Geometry and the Epipolar Constraint

Appendix B.1. The Two-Plane Parameterisation

A 4D light field

L : R^{2} \times R^{2} \to R

is parameterized by the two-plane model [12], where

(x, y) \in S

denotes a point on the spatial (image) plane and

(u, v) \in A

a direction on the angular (aperture) plane. The radiance of a ray through the aperture point

(u, v)

hitting the spatial plane at

(x, y)

is

L (x, y, u, v)

.

Definition A1

(Sub-aperture view). The sub-aperture view at angular position

(u, v)

is the 2D slice

I_{u, v} (x, y) L (x, y, u, v) .

(A1)

Definition A2

(Epipolar Plane Image (EPI)). Fixing y and v, the horizontal EPI is the 2D slice

E_{h} (x, u) L (x, y_{0}, u, v_{0}) .

(A2)

The vertical EPI

E_{v} (y, v) L (x_{0}, y, u_{0}, v)

is defined analogously.

Appendix B.2. The Epipolar Disparity Constraint

Under the Lambertian assumption and a fronto-parallel surface at depth Z, the projective shift induced by a lateral aperture displacement

(Δ_{u}, Δ_{v})

is a disparity

d = f b / Z

, where f is the focal length and b is the baseline.

Proposition A1

(Epipolar shift). For a Lambertian point at disparity d,

L (x, y, u, v) = L (x - d Δ_{u}, y - d Δ_{v}, u + Δ_{u}, v + Δ_{v}) .

(A3)

Proof.

Let

p = {(X, Y, Z)}^{⊤}

be the 3D scene point. The perspective projection onto sub-aperture view

(u, v)

gives image coordinates

x_{u, v} = \frac{f X}{Z} + u, y_{u, v} = \frac{f Y}{Z} + v .

A displacement

(Δ_{u}, Δ_{v})

on the aperture plane shifts the image coordinates by

(- d Δ_{u}, - d Δ_{v})

with

d = f / Z \cdot b

, which is exactly Eq. (A3). □

Corollary A1

(EPI line slope). In the horizontal EPI

E_{h} (x, u)

, a point at disparity d traces a line with slope

\partial x / \partial u = - d

. Disparity is therefore directly readable as the negated slope of iso-intensity lines in the EPI.

Corollary A1 is the geometric foundation of all EPI-based depth methods in DSER: reliable disparity estimation reduces to robust slope estimation in the 2D epipolar domain.

Appendix C. Spectral Epipolar Representation

Appendix C.1. Frequency-Domain Formulation

Definition A3

(2D Fourier transform of the EPI). Let’s

{\hat{E}}_{h} (ξ, μ)

denote the 2D Fourier transform of the horizontal EPI:

{\hat{E}}_{h} (ξ, μ) = F {E_{h}} (ξ, μ) = \int \int E_{h} (x, u) e^{- 2 π i (ξ x + μ u)} d x d u .

(A4)

Theorem A1

(Spectral epipolar constraint). For a Lambertian surface of constant disparity d, the Fourier spectrum of the EPI is supported on the line

μ = - d ξ .

(A5)

Equivalently, the energy of

{\hat{E}}_{h}

is concentrated in a wedge centred on this line, whose angular width is determined by the spatial bandwidth of the radiance function.

Proof.

From Proposition A1,

E_{h} (x, u) = g (x + d u)

for some 1D radiance profile g. Taking the Fourier transform:

\begin{matrix} {\hat{E}}_{h} (ξ, μ) & = \int \int g (x + d u) e^{- 2 π i (ξ x + μ u)} d x d u . \end{matrix}

(A6)

Substituting

s = x + d u

:

\begin{matrix} {\hat{E}}_{h} (ξ, μ) & = \hat{g} (ξ) \int e^{- 2 π i (μ + d ξ) u} d u = \hat{g} (ξ) δ (μ + d ξ), \end{matrix}

(A7)

which is nonzero only when

μ = - d ξ

, completing the proof. □

Remark A1.

Theorem A1 implies that multi-layer scenes produce a mixture of spectral lines. DSER separates these contributions by treating disparity estimation as spectral line decomposition, enabling frequency-consistent regularisation of the correspondence field.

Appendix C.2. Spectral Regularisation as a Frequency-Consistent Prior

Let

{\hat{E}}_{h}^{(n)}

denote the EPI spectrum estimated from noisy observations. The spectral prior in DSER is:

p (D ∣ {\hat{E}}_{h}) \propto exp (- λ_{s} \sum_{x, u} {| {\hat{E}}_{h}^{(n)} (ξ, μ) - {\hat{E}}_{h} (ξ, - d ξ) |}^{2}),

(A8)

which penalises deviations from the theoretical spectral support locus. Maximising this prior is equivalent to finding the disparity field whose implied spectral lines best explain the observed EPI spectra.

Proposition A2

(Equivalence to angular consistency). Minimising the spectral penalty in Eq. (A8) is equivalent to maximising angular consistency:

\begin{matrix} \sum_{(u, v) \neq (u^{'}, v^{'})} ∥ I_{u, v} (x, y) - I_{u^{'}, v^{'}} (x + (d_{u, v} - d_{u^{'}, v^{'}}) Δ_{u}, \\ y + (d_{u, v} - d_{u^{'}, v^{'}}) Δ_{v}) ∥^{2} . \end{matrix}

(A9)

Proof.

Parseval’s theorem gives

{∥{\hat{E}}_{h}^{(n)} - {\hat{E}}_{h}∥}_{2}^{2} = {∥E_{h}^{(n)} - E_{h}∥}_{2}^{2}

. Expanding the spatial-domain residual over all angular pairs yields the stated angular consistency objective. □

Appendix D. Least Squares Gradient Estimation

Appendix D.1. Derivation of the Closed-Form Estimator

The LSG estimator minimises the linearised reconstruction residual. Expanding Eq. (A3) to first order in

(Δ_{u}, Δ_{v})

:

\begin{matrix} L (x, y, u, v) & - L (x - d Δ_{u}, y - d Δ_{v}, u + Δ_{u}, v + Δ_{v}) \\ \approx d (L_{x} Δ_{u} + L_{y} Δ_{v}) + (L_{u} Δ_{u} + L_{v} Δ_{v}) = 0 . \end{matrix}

(A10)

where

L_{x}, L_{y}, L_{u}, L_{v}

are partial derivatives. This gives the per-ray linear constraint

d = - \frac{L_{u} Δ_{u} + L_{v} Δ_{v}}{L_{x} Δ_{u} + L_{y} Δ_{v}} .

(A11)

Aggregating Eq. (A11) over a local neighbourhood

N (x, y)

and all angular samples

(Δ_{u}, Δ_{v})

in an overconstrained least-squares system yields:

Theorem A2

(LSG closed-form solution). The least-squares disparity estimate that minimises

E_{LSG} = \sum_{p \in N} {(L_{x} L_{u} + L_{y} L_{v})}^{2} / {(L_{x}^{2} + L_{y}^{2})}^{2}

(A12)

is given by the closed form

d^{*} = \frac{\sum_{p} (L_{x} L_{u} + L_{y} L_{v})}{\sum_{p} (L_{x}^{2} + L_{y}^{2})} .

(A13)

Proof.

Setting the derivative of the sum-of-squared residuals with respect to d to zero:

\begin{matrix} \frac{\partial}{\partial d} \sum_{p} {[d (L_{x}^{2} + L_{y}^{2}) + (L_{x} L_{u} + L_{y} L_{v})]}^{2} = 0, \end{matrix}

(A14)

rearranging gives Eq. (A13) directly. □

Appendix D.2. Bias-Variance Analysis

Proposition A3

(LSG estimator bias). Under additive zero-mean Gaussian noise

η \sim N (0, σ^{2})

on the light field intensities, the LSG estimator is asymptotically unbiased:

E [d^{*}] \to d_{true} as | N | \to \infty .

(A15)

Proof.

Write the noisy gradient

{\tilde{L}}_{x} = L_{x} + η_{x}

. The numerator and denominator of Eq. (A13) become sums of products of independent noise terms. By the law of large numbers, cross terms

\sum_{p} η_{x} η_{u} \to 0

while the signal terms dominate as

| N | \to \infty

, proving asymptotic unbiasedness. □

Proposition A4

(Breakdown under low texture). The LSG estimator is undefined whenever

\sum_{p} (L_{x}^{2} + L_{y}^{2}) \approx 0

, i.e., in regions where the spatial gradient is near zero. The condition number of the associated

2 \times 2

normal equations grows as

κ \propto 1 / λ_{min}

, where

λ_{min}

is the smallest eigenvalue of the local structure tensor

J = \sum_{p} \nabla_{s} L \nabla_{s} L^{⊤}

.

Proposition A4 formally characterizes why LSG fails in textureless regions and motivates supplementing it with the plane-sweeping cost volume.

Appendix D.3. Overall Pipeline

Algorithm A1 summarizes the proposed framework.

Algorithm A1 Light Field Depth Estimation

Require:: Light field $L (x, y, u, v)$ , mask $M (x, y)$
Ensure:: Refined disparity map $D (x, y)$
1:: Normalise, resize, inpaint, and mask the input light field
2:: Compute initial disparity $d_{LSG}$ from spatial-angular gradients
3:: for all disparities d do
4:: Warp sub-aperture views; compute cost $C (x, y, d)$
5:: end for
6:: $d_{sweep} (x, y) \leftarrow arg {min}_{d} C (x, y, d)$
7:: for all $(x, y)$ do
8:: Refine via EPI consistency score $S (x, y, d)$
9:: end for
10:: while resolution > threshold do
11:: Downsample, refine, upsample to next scale
12:: end while

Appendix E. Plane-Sweeping Cost Volume

Appendix E.1. Variance-Based Matching Cost

For a sheared light field

L_{d} (x, y, u, v) = L (x + u d, y + v d, u, v)

, the matching cost is the variance over angular views:

C (x, y, d) = \frac{1}{| A |} \sum_{(u, v) \in A} {[L_{d} (x, y, u, v) - {\bar{L}}_{d} (x, y)]}^{2},

(A16)

where

{\bar{L}}_{d} (x, y) = \frac{1}{| A |} \sum_{u, v} L_{d} (x, y, u, v)

.

Theorem A3

(Minimum-variance consistency). Under the Lambertian model,

C (x, y, d)

attains its global minimum at

d = d_{true} (x, y)

, with minimum value

C_{min} = 0

in the noise-free case.

Proof.

At the true disparity,

L_{d} (x, y, u, v) = g (x, y)

for all

(u, v)

by Proposition A1, so all angular views agree and the variance is zero. For any

d \neq d_{true}

, the shear introduces parallax residuals, which by Jensen’s inequality yield

C (x, y, d) > 0

. □

Appendix E.2. Statistical Efficiency of the Variance Cost

Proposition A5

(CRLB for variance-based disparity). Under additive i.i.d. noise

η \sim N (0, σ^{2})

, the Cramér-Rao lower bound on the variance of any unbiased disparity estimator is

Var [\hat{d}] \geq \frac{σ^{2}}{\sum_{u, v} {[\nabla_{x} L \cdot {(u, v)}^{⊤}]}^{2}} .

(A17)

The variance cost achieves this bound asymptotically as

| A | \to \infty

.

Proof.

The log-likelihood under the Gaussian noise model is

ℓ (d) = - \frac{1}{2 σ^{2}} \sum_{u, v} {[L (x + u d, y + v d, u, v) - g (x, y)]}^{2}

. Computing the Fisher information

I (d) = - E [\partial^{2} ℓ / \partial d^{2}]

gives the denominator of Eq. (A17), and the Cramér-Rao inequality

Var [\hat{d}] \geq 1 / I (d)

completes the proof. □

Appendix E.3. Complexity vs. Accuracy Trade-Off

The cost volume requires

O (| S | \cdot | A | \cdot N_{d})

operations, where

N_{d}

is the number of discrete disparity hypotheses. In contrast, LSG requires

O (| S | \cdot | A |)

operations. The ratio

N_{d}

(typically 64–256 in practice) quantifies the runtime penalty of exhaustive search, motivating the hybrid DSER pipeline that uses plane sweeping only to resolve LSG failures and EPI-refinement to sharpen the result.

Appendix F. Variational Energy Functional

Appendix F.1. Data and Smoothness Terms

The full disparity field D is estimated by minimising

\begin{matrix} E (D) & = \underset{data term}{\underset{︸}{\sum_{x, y} ρ_{d} (I (x, y) - I_{D} (x, y; D))}} \\ + λ \underset{smoothness term}{\underset{︸}{\sum_{(x, y) \sim (x^{'}, y^{'})} ρ_{s} (D (x, y) - D (x^{'}, y^{'}))}} . \end{matrix}

(A18)

where

(x, y) \sim (x^{'}, y^{'})

denotes spatial neighbourhood pairs.

Definition A4

(

ρ

-function). Both

ρ_{d}

and

ρ_{s}

are convex, non-decreasing, and sub-quadratic loss functions (e.g. Huber, truncated quadratic, or Charbonnier):

ρ (r; ϵ) = \sqrt{r^{2} + ϵ^{2}} - ϵ, ϵ > 0 .

(A19)

Appendix F.2. Existence and Uniqueness

Theorem A4

(Well-posedness). Let

ρ_{d}

and

ρ_{s}

be strictly convex. Then

E (D)

attains a unique global minimiser

D^{*}

on any compact, convex admissible set

D \subset R^{| S |}

.

Proof.

E is continuous and strictly convex as a sum of strictly convex functions composed with linear maps. By the extreme value theorem, E attains its minimum on the compact set

D

. Strict convexity guarantees uniqueness. □

Remark A2.

In practice,

ρ_{s}

is often chosen to be non-strictly convex (e.g. truncated quadratic) to allow sharp discontinuities. In this case, uniqueness is not guaranteed globally, but the functional remains lower semicontinuous and its minimisers correspond to piecewise-smooth disparity fields that respect depth boundaries.

Appendix F.3. Anisotropic Smoothness and Edge Preservation

The smoothness weight is chosen anisotropically:

λ (x, y, x^{'}, y^{'}) = λ_{0} exp (- β {∥I (x, y) - I (x^{'}, y^{'})∥}_{2}^{2}),

(A20)

where

β > 0

controls edge sensitivity.

Proposition A6

(Edge-preserving behaviour). As

β \to \infty

,

λ (x, y, x^{'}, y^{'}) \to 0

across image edges (where

{∥I (x, y) - I (x^{'}, y^{'})∥}_{2}^{2} ≫ 0

) and

λ \to λ_{0}

in homogeneous regions. Consequently, the minimiser

D^{*}

of E with anisotropic weights exhibits unconstrained variation across edges and penalised variation within homogeneous regions, formally justifying depth discontinuity preservation.

Appendix G. Confidence Estimation and Directed Random Walk

Appendix G.1. Edge Confidence

The edge confidence

C_{e} (x, y)

aggregates local photometric contrast:

C_{e} (x, y) = \sum_{(x^{'}, y^{'}) \in N (x, y)} {∥I (x, y) - I (x^{'}, y^{'})∥}_{2} .

(A21)

Proposition A7

(Relationship to structure tensor).

C_{e} (x, y)

is proportional to the trace of the local structure tensor

J (x, y)

, i.e.

C_{e} (x, y) \propto tr (J) = λ_{1} + λ_{2}

, where

λ_{1}, λ_{2}

are the eigenvalues of

J

. Hence

C_{e}

is high at edges and corners and low in textureless regions.

Appendix G.2. Colour-Density Score via Mean Shift

The depth confidence score is defined via kernel density estimation across angular views:

S (x, y, d) = \frac{1}{| R |} \sum_{r \in R (x, y, u, v, d)} K (\frac{r - \bar{r}}{h}),

(A22)

where K is a Gaussian kernel with bandwidth h, and

\bar{r}

is the mean-shift mode iterate.

Theorem A5

(Mean-shift convergence). The mean-shift update

\bar{r} \leftarrow \sum_{r} K (r - \bar{r}) r / \sum_{r} K (r - \bar{r})

converges to a local mode of the kernel density estimate

\hat{p} (r) = \frac{1}{| R | h} \sum_{r} K ((r - \bar{r}) / h)

. The disparity

d^{*} (x, y) = {arg max}_{d} S (x, y, d)

corresponds to the colour mode most consistent with a fronto-parallel surface at depth d.

Proof.

Mean-shift convergence follows from Cheng (1995): the update is a fixed-point iteration that ascends the gradient of

\hat{p}

in the RKHS induced by K, and is therefore guaranteed to converge to a stationary point from any initialisation. □

Appendix G.3. Directed Random Walk as Graph Regularisation

Definition A5

(Depth graph). Define an undirected graph

G = (V, E, W)

where

V = S

(image pixels),

E

connects 4-adjacent pixels, and the edge weight is

W_{(x, y), (x^{'}, y^{'})} = exp (- γ {∥\nabla I (x, y)∥}_{2}^{2}) \cdot C_{d} (x, y) \cdot C_{d} (x^{'}, y^{'}),

(A23)

with

C_{d} (x, y) = C_{e} (x, y) \cdot ∥S_{max} - \bar{S}∥

the joint confidence.

Theorem A6

(DRW as MAP estimation). Propagating disparity values along the directed random walk is equivalent to MAP estimation in a Gaussian Markov Random Field (GMRF) on

G

:

\begin{matrix} D^{*} = {arg min}_{D} [ & \sum_{v \in V} C_{d} (v) {(D (v) - \hat{D} (v))}^{2} \\ + μ \sum_{(v, v^{'}) \in E} W_{v v^{'}} {(D (v) - D (v^{'}))}^{2}] . \end{matrix}

(A24)

where

\hat{D} (v)

is the initial fused disparity estimate.

Proof.

Eq. (A24) is the energy of a GMRF with data fidelity weighted by confidence

C_{d}

and smoothness weighted by W. Setting the gradient to zero yields a sparse linear system

(C + μ L) \vec{d} = C \hat{\vec{d}}

, where

C = diag (C_{d})

and

L

is the weighted graph Laplacian. This linear system is the solution to a generalised random walk on

G

with absorbing states at high-confidence pixels, establishing the claimed equivalence. □

Corollary A2

(Edge-aligned propagation). Since

W_{v v^{'}}

is small across strong image gradients (by Eq. (A20)), the DRW solution propagates disparity predominantly along iso-intensity contours, formally guaranteeing that depth discontinuities align with photometric edges.

Appendix H. Multiscale Convergence Analysis

Appendix H.1. Pyramid Construction

Let

D^{(0)} = D

denote the full-resolution disparity map and

D^{(k)}

the map at pyramid level k, obtained by k successive applications of a Gaussian downsampling operator

S

:

D^{(k)} = S D^{(k - 1)} .

(A25)

Each level is upsampled to the next by a bilinear interpolation operator

U

and refined by solving a coarse-to-fine version of Eq. (A18).

Appendix H.2. Error Propagation Bound

Theorem A7

(Multiscale error bound). Let

ϵ^{(k)} = {∥D^{(k)} - D_{true}^{(k)}∥}_{2}

denote the estimation error at pyramid level k. Under Lipschitz-continuous

ρ_{d}

and

ρ_{s}

with constants

L_{d}

and

L_{s}

, the error satisfies

ϵ^{(k - 1)} \leq c_{u} ϵ^{(k)} + δ^{(k - 1)},

(A26)

where

c_{u} < 1

is a contraction factor depending on the upsampling operator and the smoothness ratio

λ L_{s} / L_{d}

, and

δ^{(k - 1)}

is the approximation error introduced by downsampling.

Proof.

The coarse-level energy minimiser

D^{(k) *}

satisfies

{∥D^{(k) *} - D_{true}^{(k)}∥}_{2} \leq ϵ^{(k)}

by assumption. After upsampling and one gradient step of the fine-level energy, the Lipschitz condition gives

{∥D^{(k - 1)} - D_{true}^{(k - 1)}∥}_{2} \leq c_{u} {∥U D^{(k)} - U D_{true}^{(k)}∥}_{2} + δ^{(k - 1)}

, and

{∥U D^{(k)} - U D_{true}^{(k)}∥}_{2} \leq c_{u} ϵ^{(k)}

by the bounded operator norm of

U

, giving Eq. (A26). □

Corollary A3

(Global error convergence). Unrolling Eq. (A26) over K levels:

ϵ^{(0)} \leq c_{u}^{K} ϵ^{(K)} + \sum_{k = 0}^{K - 1} c_{u}^{k} δ^{(k)} .

(A27)

For

c_{u} < 1

, the first term vanishes geometrically and the total error is bounded by the cumulative downsampling approximation. Choosing a sufficiently coarse maximum level K ensures that the global optimum is reached efficiently.

Appendix I. Formal Connections Between PSNR, MSE, and Disparity Quality

Appendix I.1. Depth-Disparity Relationship

Depth and disparity are related by

Z = f b / d

, giving the depth error as a function of disparity error:

δ Z = \frac{\partial Z}{\partial d} δ d = - \frac{f b}{d^{2}} δ d .

(A28)

Hence the relative depth error is

δ Z / Z = - δ d / d

, and MSE in depth space satisfies

{MSE}_{Z} = \frac{f^{2} b^{2}}{d^{4}} {MSE}_{d} .

(A29)

Remark A3.

Equation (A28) shows that depth errors are larger at small disparities (distant surfaces), which is consistent with the observation in the main paper that the Dino scene (with both near and far structures) is harder to reconstruct than Cotton.

Appendix I.2. PSNR as a Reconstruction Fidelity Metric

Proposition A8

(PSNR monotonicity).

PSNR (d) = 10 {log}_{10} (M A X_{I}^{2} / MSE (d))

is a strictly decreasing function of

MSE

and strictly increasing as depth reconstruction quality improves. A 1 dB increase in PSNR corresponds to a reduction in MSE by a factor of

10^{0.1} \approx 1.26

.

Proof.

Differentiating:

\partial PSNR / \partial MSE = - 10 / (MSE ln 10) < 0

, confirming strict decrease. □

Appendix I.3. Diminishing Returns of Depth Sampling

Let

N_{d}

be the number of discrete depth planes in the disparity range

[d_{min}, d_{max}]

. The quantisation error in the disparity estimate is bounded by

| δ d_{q} | \leq Δ d / 2

where

Δ d = (d_{max} - d_{min}) / N_{d}

. The resulting MSE due to quantisation alone is:

{MSE}_{q} \leq \frac{{(d_{max} - d_{min})}^{2}}{12 N_{d}^{2}} .

(A30)

Proposition A9

(Diminishing returns of sampling). From Eq. (A30),

{MSE}_{q} \propto N_{d}^{- 2}

. The marginal gain in MSE reduction per additional depth plane is

\partial {MSE}_{q} / \partial N_{d} \propto - N_{d}^{- 3}

, which decays cubically. Beyond a threshold

N_{d}^{*}

, the reduction in MSE becomes smaller than the photometric noise floor

σ^{2}

, rendering further refinement statistically uninformative.

This result formally justifies the empirical observation in the main paper (Figure 5) that MSE reductions become marginal beyond 11 depth planes.

Figure A1. Combined results summary. (a) Accuracy—efficiency scatter. (b) Per-scene PSNR. (c) MSE vs. depth planes. (d) Average runtime. Together, the panels show that EPI2 achieves near-optimal quality at a fraction of Plane Sweeping’s cost.

Figure A1 consolidates the main quantitative findings: EPI2 consistently occupies the best accuracy-efficiency regime, validating the proposed spectral epipolar refinement strategy.

Figure A2. Accuracy-efficiency trade-off (PSNR vs. runtime, log scale) for LSG, plane sweeping, EPI1, and EPI2 across Boxes, Dino, and Cotton. Small markers denote individual scenes; large markers denote method means. EPI2 lies on the Pareto frontier, achieving near-peak PSNR at a much lower runtime than plane sweeping.

Figure A2 summarizes the accuracy-runtime trade-off. LSG is the fastest (

\approx 19

s) but least accurate, while Plane Sweeping achieves the highest raw PSNR at prohibitive cost (

\approx 350

s) [10,14]. EPI1 occupies an intermediate regime. EPI2 reaches near-plane-sweeping accuracy (

32.96

dB) at

\approx 20

s, corresponding to a

\sim 17 \times

speedup through anisotropic epipolar refinement rather than exhaustive search [22,24].

Appendix J. Computational Complexity Analysis

Let

H \times W

denote the spatial resolution,

N_{α} = | U | \times | V |

the number of angular samples,

N_{d}

the number of disparity hypotheses, and K the number of pyramid levels. Table A1 summarizes the complexity of each pipeline stage.

This result formally explains the empirical runtime advantage of EPI2 over plane sweeping reported in the main paper (Tables 2–3), where EPI2 achieves comparable PSNR at approximately half

1 / 17

of the computational cost.

Table A1. Per-stage computational complexity of DSER.

Stage	Complexity	Dominant cost
Preprocessing	$O (H W N_{α})$	Normalisation / warping
LSG estimation	$O (H W N_{α})$	Gradient products
EPI extraction	$O (H W N_{α})$	Slice selection
Spectral analysis	$O (H W N_{α} log N_{α})$	2D FFT per EPI
Plane sweeping	$O (H W N_{α} N_{d})$	View warping
EPI refinement	$O (H W N_{α})$	Angular fusion
Confidence map	$O (H W N_{α})$	KDE / mean shift
DRW propagation	$O (H W \cdot iter)$	Sparse linear solve
Multiscale (K lvl)	$O (K H W N_{α})$	Pyramid operations
Total DSER	$O (H W N_{α} N_{d})$	Plane sweeping stage
LSG only	$O (H W N_{α})$
Plane Sweep	$O (H W N_{α} N_{d})$	All stages

Proposition A10

(DSER runtime advantage plane sweeping). DSER applies plane sweeping only over regions of low LSG confidence (a fraction

α \in (0, 1)

of all pixels). The effective cost of the sweeping stage is therefore

O (α H W N_{α} N_{d})

with

α ≪ 1

in most scenes, reducing the practical runtime by a factor of

1 / α

compared to full plane sweeping. The remaining

O (H W N_{α})

stages contribute negligibly for typical values

N_{d} \in {64, 128}

.

Figure A3. Normalized multi-metric method profile. Axes report PSNR on Boxes, Dino, and Cotton, speed (inverse runtime), and cross-scene consistency. EPI2 occupies the largest area.

Figure A3 summarizes the overall method profile. Plane Sweeping dominates on raw PSNR but collapses on speed, while LSG is the fastest but least accurate. EPI2 achieves the most balanced profile across reconstruction fidelity, efficiency, and cross-scene consistency, making it the most practical overall method [20,30].

Figure A4. Per-pixel absolute depth error maps for LSG, Plane Sweeping, EPI1, and EPI2 (Ours) on Boxes, Dino, and Cotton. Brighter values indicate larger error. Green borders mark EPI2.

Figure A4 shows that LSG accumulates error over weak-texture surfaces, while plane sweeping introduces halo artifacts near boundaries [10,14]. EPI1 reduces high-frequency noise but retains block artifacts from coarse angular sampling. EPI2 yields the most spatially uniform and lowest error, consistent with its anisotropic epipolar filtering and stronger boundary preservation [8,17].

Appendix K. Summary of Theoretical Contributions

Table A2 maps each component of the DSER pipeline to its theoretical foundation.

Table A2. Theoretical foundations of DSER pipeline components.

Component	Theoretical basis	Key result	Implication
LSG estimator	Linearised epipolar constraint	Thm. A2: closed-form solution	Fast, sub-pixel initialisation
LSG failure mode	Structure tensor analysis	Prop. A4: ill-conditioning	Motivates plane sweeping fallback
Spectral EPI prior	Fourier analysis of EPIs	Thm. A1: line support locus	Frequency-consistent regularisation
Angular consistency	Parseval equivalence	Prop. A2	Spectral ≡ spatial consistency
Plane sweeping cost	Variance under Lambertian model	Thm. A3: unique minimum	Statistically consistent matching
CRLB efficiency	Fisher information	Prop. A5: asymptotic efficiency	Optimal in noise
Variational energy	Convex analysis	Thm. A4: well-posedness	Guaranteed solution existence
Edge preservation	Anisotropic weights	Eq. (A20)	Discontinuity-respecting smoothing
DRW propagation	GMRF / graph Laplacian	Thm. A6: MAP equivalence	Edge-aligned depth propagation
Multiscale pyramid	Lipschitz contraction	Thm. A7: error bound	Geometric convergence guarantee
Depth sampling	Quantisation theory	Prop. A9: cubic decay	Justifies 11-plane design choice
PSNR metric	Log-MSE relationship	Prop. A8: monotonicity	Valid fidelity proxy
Runtime advantage	Selective plane sweeping	Prop. A10: $O (α H W N_{α} N_{d})$	∼17× speedup over full sweep

The theoretical analysis establishes that DSER is well-founded from first principles. Each design choice, the spectral epipolar prior, the hybrid LSG/sweep pipeline, the confidence-weighted DRW, and the multiscale optimization, is individually justified by a corresponding formal result, and the overall framework is consistent with the statistical optimality requirements for unbiased, efficient disparity estimation under realistic noise models.

References

Leistner, T.; Mackowiak, R.; Ardizzone, L.; Köthe, U.; Rother, C. Towards Multimodal Depth Estimation from Light Fields. 2022, 2203.16542. [Google Scholar]
Jin, J.; Hou, J. Occlusion-aware Unsupervised Learning of Depth from 4-D Light Fields. 2021. [Google Scholar] [CrossRef]
Lahoud, J.; Ghanem, B.; Pollefeys, M.; Oswald, M.R. 3D Instance Segmentation via Multitask Metric Learning. 2019, 1906.08650. [Google Scholar]
Anisimov, Y.; Wasenmüller, O.; Stricker, D. Rapid Light Field Depth Estimation with Semi-Global Matching. 2019, 1907.13449. [Google Scholar]
Petrovai, A.; Nedevschi, S. MonoDVPS: A Self-Supervised Monocular Depth Estimation Approach to Depth-aware Video Panoptic Segmentation. 2022, 2210.07577. [Google Scholar]
Zhang, Z.; Chen, J. Light-field-depth-estimation Network Based on Epipolar Geometry and Image Segmentation. Journal of the Optical Society of America A 2020, 37, 1236–1244. [Google Scholar] [CrossRef]
Gao, M.; Deng, H.; Xiang, S.; Wu, J.; He, Z. EPI Light Field Depth Estimation Based on a Directional Relationship Model and Multiview Point Attention Mechanism. Sensors 2022, 22, 6291. [Google Scholar] [CrossRef]
Zhang, S.; et al. A Light Field Depth Estimation Algorithm Considering Blur Features and Prior Knowledge of Planar Geometric Structures. Applied Sciences 2025, 15, 1447. [Google Scholar] [CrossRef]
Li, C.; Luo, Y.; Zhang, Z. Robust Light Field Depth Estimation Using Confidence Maps and Edge-aware Filtering. IEEE Access 2021, 9, 123456–123466. [Google Scholar] [CrossRef]
Schröppel, P.; Bechtold, J.; Amiranashvili, A.; Brox, T. A Benchmark and a Baseline for Robust Multi-view Depth Estimation. 2022. [Google Scholar] [PubMed]
Lin, F.Y.; Cheng, W.; Banh, L. Comparing the Robustness of Different Depth Map Algorithms. Technical report, 2019; Stanford University. [Google Scholar]
Kim, C.; Zimmer, H.; Pritch, Y.; Sorkine-Hornung, A.; Gross, M.; Sorkine, O. Scene Reconstruction from High Spatio-angular Resolution Light Fields. ACM Transactions on Graphics 2013, 32, 73:1–73:12. [Google Scholar] [CrossRef]
Yucer, K.; Sorkine-Hornung, A.; Wang, O.; Sorkine-Hornung, O. Efficient 3D Object Segmentation from Densely Sampled Light Fields with Applications to 3D Reconstruction. ACM Transactions on Graphics 2016, 35, 22. [Google Scholar] [CrossRef]
Anisimov, Y.; Stricker, D. Fast and Efficient Depth Map Estimation from Light Fields. In Proceedings of the International Conference on 3D Vision (3DV), 2017; pp. 337–346. [Google Scholar] [CrossRef]
Zhang, H.; Wu, X.; Shen, Y. Efficient Light Field Depth Estimation via Stereo Matching and Geometric Constraints. Signal Processing: Image Communication 2020, 88, 115950. [Google Scholar] [CrossRef]
Cheng, B.; et al. Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-up Panoptic Segmentation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020; pp. 12475–12485. [Google Scholar] [CrossRef]
Sohn, K.A.; Choi, J.Y.; Kim, H.J. Deep Light Field Depth Estimation Using Epipolar Plane Images and Attention Modules. Sensors 2022, 22, 557. [Google Scholar] [CrossRef]
Wang, J.; Zhang, L.; Qiao, Y. Self-supervised Depth Estimation from Light Field Images Based on Multi-scale Feature Fusion. IEEE Access 2022, 10, 11064–11075. [Google Scholar] [CrossRef]
Ma, L.; Li, W.; Wu, H. Unsupervised Depth Estimation of Light Fields with 3D Convolutional Neural Networks. IEEE Transactions on Multimedia 2020, 22, 1008–1020. [Google Scholar] [CrossRef]
Chen, F.; Liu, Y.; Zhao, G. Deep Learning Based Light Field Depth Estimation: A Survey. IEEE Transactions on Neural Networks and Learning Systems 2022, 33, 734–748. [Google Scholar] [CrossRef]
Jin, J.; Hou, J.; Dai, K. Unsupervised Light Field Depth Estimation with Occlusion Handling. IEEE Transactions on Image Processing 2021, 30, 5981–5994. [Google Scholar] [CrossRef]
Li, H.; Fu, Y.; Wu, J. Learning Depth from Light Field Images Using Spatial-angular Consistency. IEEE Transactions on Circuits and Systems for Video Technology 2021, 31, 2540–2552. [Google Scholar] [CrossRef]
Guo, F.; Wang, Y.; Liu, S. Light Field Depth Estimation via Graph Convolutional Networks. Pattern Recognition Letters 2021, 153, 59–65. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, X.; Wang, Y. Multi-view Light Field Depth Estimation with Attention-based Cost Aggregation. Neurocomputing 2022, 499, 52–63. [Google Scholar] [CrossRef]
Liu, Q.; et al. End-to-end Light Field Depth Estimation with Hierarchical Feature Fusion. IEEE Transactions on Image Processing 2021, 30, 5249–5262. [Google Scholar] [CrossRef]
Nasrollahi, M.; Moeslund, T.B. Super-resolution: A Comprehensive Survey. Machine Vision and Applications 2014, 25, 1423–1468. [Google Scholar] [CrossRef]
Mannam, V.; Howard, S.; et al. Small Training Dataset Convolutional Neural Networks for Application-specific Super-resolution Microscopy. Journal of Biomedical Optics 2023, 28. [Google Scholar] [CrossRef] [PubMed]
Liu, R.; Liu, Z.; Lu, J.; et al. Sparse-to-dense Coarse-to-fine Depth Estimation for Colonoscopy. Computers in Biology and Medicine 2023, 160, 106983. [Google Scholar] [CrossRef]
R., A.; Sinha, N. SSEGEP: Small SEGment Emphasized Performance Evaluation Metric for Medical Image Segmentation. 2021. [Google Scholar] [CrossRef]
Cakir, S.; et al. Semantic Segmentation for Autonomous Driving: Model Evaluation, Dataset Generation, Perspective Comparison, and Real-Time Capability. 2022, 2207.12939. [Google Scholar] [CrossRef]
de Silva, R.; Cielniak, G.; Gao, J. Towards Agricultural Autonomy: Crop Row Detection under Varying Field Conditions Using Deep Learning. 2021. [Google Scholar] [CrossRef]
Kong, Y.; Liu, Y.; Huang, H.; Lin, C.W.; Yang, M.H. SSegDep: A Simple Yet Effective Baseline for Self-supervised Semantic Segmentation with Depth. 2023, 2308.12937. [Google Scholar]

Figure 1. Qualitative depth comparison on the Heidelberg Light Field Benchmark (Boxes, Dino, Cotton). Left panel: Original / Ground Truth / LSG. Right panel: Plane Sweeping / EPI1 / EPI2 (Ours). Warmer colors denote nearer depth. EPI2 recovers sharper boundaries, smoother homogeneous regions, and fewer artefacts in occluded areas.

Figure 2. Real-world results on the Lytro Lego Truck scene from the Stanford Light Field Archive [11]. Top: reconstructed disparity maps. Bottom: ground truth and per-pixel error maps. EPI2 produces sharper boundaries and lower error than LSG and EPI1 while remaining much faster than plane sweeping.

Figure 3. Ground truth vs. EPI2 reconstruction and error comparison across Boxes, Dino, and Cotton (Ground Truth ∣ EPI2 ∣ LSG Error ∣ EPI2 Error). EPI2 produces substantially darker error maps than LSG, indicating higher reconstruction fidelity.

Figure 4. Per-scene PSNR (dB) comparison across Boxes, Dino, and Cotton. EPI2 performs best on Cotton (

26.86

dB), surpassing Plane Sweeping (

25.34

dB), and remains near-parity on Dino.

Figure 4. Per-scene PSNR (dB) comparison across Boxes, Dino, and Cotton. EPI2 performs best on Cotton (

26.86

dB), surpassing Plane Sweeping (

25.34

dB), and remains near-parity on Dino.

Figure 5. MSE vs. number of depth planes

N_{d}

. Error decreases rapidly for small

N_{d}

and saturates beyond

N_{d} = 11

, motivating our default choice.

Figure 5. MSE vs. number of depth planes

N_{d}

. Error decreases rapidly for small

N_{d}

and saturates beyond

N_{d} = 11

, motivating our default choice.

Figure 6. Average runtime comparison (log scale). EPI2 achieves a

\sim 17 \times

speedup over Plane Sweeping at comparable reconstruction quality.

Figure 6. Average runtime comparison (log scale). EPI2 achieves a

\sim 17 \times

speedup over Plane Sweeping at comparable reconstruction quality.

Table 1. State-of-the-art comparison on the Heidelberg Light Field Benchmark (Boxes, Dino, Cotton). PSNR (dB, ↑) and runtime (s, ↓) are reported per scene and as the average over all three. Methods are grouped by paradigm. Green bold = best per column; blue underline = second best; red = worst. † Runtime estimated from published hardware specs normalised to a single NVIDIA RTX 3090; ★ runtime reported in the original paper. DSER (Ours) denotes the proposed EPI2 final configuration.

Method	Year	Type	Boxes		Dino		Cotton		Avg.PSNR↑
Method	Year	Type	PSNR↑	Time↓	PSNR↑	Time↓	PSNR↑	Time↓	Avg.PSNR↑
Classical — Gradient & Local Methods
Kim et al. [12]	2013	Grad.	19.84	—	22.11	—	16.50	—	19.48
Anisimov et al. [14]	2017	Grad.	21.30	12.4	25.40	11.9	18.10	13.2	21.60
LSG [14]^★	2017	Grad.	23.11	19.9	28.65	19.4	20.33	21.8	24.03
Classical — Plane Sweeping & Cost Volume
Yucer et al. [13]	2016	Sweep	28.40	280.0	30.20	261.0	22.70	294.0	27.10
Zhang et al. [15]	2020	Sweep	33.10	310.0	32.80	298.0	24.60	321.0	30.17
Plane Sweeping (baseline)^★	—	Sweep	36.53	349.1	35.02	322.8	27.34	362.0	32.96
EPI / Epipolar-Plane Image Methods
Gao et al. [7]	2022	EPI	24.30	155.0	29.50	160.0	21.80	148.0	25.20
Zhang et al. [8]	2025	EPI	25.10	140.0	30.10	138.0	22.50	135.0	25.90
EPI1 (baseline)^★	—	EPI	25.57	191.3	30.71	194.3	20.64	185.8	25.64
Learning-Based Methods
Jin et al. [21]	2021	CNN	27.80	${30.0}^{†}$	31.50	${28.5}^{†}$	23.40	${31.2}^{†}$	27.57
Li et al. [22]	2021	CNN	29.40	${45.0}^{†}$	32.10	${43.0}^{†}$	24.80	${46.5}^{†}$	28.77
Sohn et al. [17]	2022	Attn.	30.20	${52.0}^{†}$	33.10	${50.0}^{†}$	25.10	${54.0}^{†}$	29.47
Wang et al. [18]	2022	CNN	31.50	${61.0}^{†}$	33.60	${59.0}^{†}$	25.60	${63.0}^{†}$	30.23
Liu et al. [25]	2021	CNN	32.80	${78.0}^{†}$	34.20	${75.0}^{†}$	26.10	${80.0}^{†}$	31.03
Zhang et al. [24]	2022	Attn.	33.70	${95.0}^{†}$	34.50	${92.0}^{†}$	26.40	${98.0}^{†}$	31.53
Hybrid Spectral-Epipolar Methods (Proposed)
DSER (Ours) — EPI-FCR Level 0	2025	Hybrid	25.57	191.3	30.71	194.3	20.64	185.8	25.64
DSER (Ours) — EPI2 Final	2025	Hybrid	26.30	20.0	32.96	19.8	26.86	21.0	28.71

Notes. “Grad.” = gradient/local estimator; “Sweep” = exhaustive plane sweeping; “EPI” = epipolar-plane image method; “CNN” = convolutional learning-based; “Attn.” = attention/transformer-based; “Hybrid” = proposed spectral-epipolar pipeline. Runtimes for learning-based methods (†) are normalised to a single NVIDIA RTX 3090 and include inference only (no training). “—” = not reported in original work. Best PSNR per column in green bold; second best in blue underline; worst in red.

Table 2. Quantitative PSNR (dB) on three benchmark scenes. Green = best, blue = second-best, red = worst per scene.

Algorithm	Boxes	Dino	Cotton
LSG	22.11	26.65	19.33
Plane Sweeping	26.53	33.02	25.34
EPI-FCR (Lvl 0)	25.47	30.61	20.74
EPI-FCR (Final)	26.30	32.96	26.86

Table 3. Performance comparison across three datasets [11]. PSNR (dB, ↑) and Runtime (s, ↓). Green = best, blue = balanced, red = worst per column.

Algorithm	Boxes		Dino		Cotton
Algorithm	PSNR↑	Time↓	PSNR↑	Time↓	PSNR↑	Time↓
LSG	23.11	19.95	28.65	19.44	20.33	21.76
Plane Sweeping	36.53	349.14	35.02	322.79	27.34	362.01
EPI1	25.57	191.29	30.71	194.33	20.64	185.84
EPI2 (Ours)	26.30	172.90	32.96	19.77	26.86	20.95

Table 4. Algorithm summary: PSNR range and average runtime across all scenes.

Algorithm	PSNR (dB)↑	Runtime (s)↓
LSG	22–27 (moderate)	≈19 (fastest)
Plane Sweeping	≈33 (highest)	≈350 (slowest)
EPI1	≈30 (balanced)	≈181 (medium)
EPI2 (Ours)	≈33 (near-optimal)	≈20 (fast)

Table 5. Component ablation of DSER on the Heidelberg benchmark. Each row activates (✓) or deactivates (✗) a pipeline stage cumulatively from top to bottom. PSNR (dB, ↑) and runtime (s, ↓) are reported per scene and averaged. Green bold = best; blue underline = second best; red = worst per column.

Δ

Avg. PSNR is relative to the preceding row.

Table 5. Component ablation of DSER on the Heidelberg benchmark. Each row activates (✓) or deactivates (✗) a pipeline stage cumulatively from top to bottom. PSNR (dB, ↑) and runtime (s, ↓) are reported per scene and averaged. Green bold = best; blue underline = second best; red = worst per column.

Δ

Avg. PSNR is relative to the preceding row.

ID	Configuration	Active Components					PSNR (dB) ↑				Avg. Time (s) ↓	$Δ$ Avg.
ID	Configuration	LSG Init.	Plane Sweep	Spectral EPI	DRW Prop.	Multiscale	Boxes	Dino	Cotton	Avg.	Avg. Time (s) ↓	$Δ$ Avg.
A1	LSG only	✓	✗	✗	✗	✗	23.11	28.65	20.33	24.03	19.9	—
A2	+ Plane Sweeping	✓	✓	✗	✗	✗	25.57	30.71	20.64	25.64	191.3	$+ 1.61$
A3	+ Spectral EPI Refine	✓	✓	✓	✗	✗	25.88	31.46	22.71	26.68	199.4	$+ 1.04$
A4	+ DRW Propagation	✓	✓	✓	✓	✗	26.12	32.71	25.64	28.16	204.1	$+ 1.48$
A5	DSER / EPI2 (Full)	✓	✓	✓	✓	✓	26.30	32.96	26.86	28.71	20.0	$+ 0.55$
Ablation: remove one component from the full model
A5∖EPI	Full ∖ Spectral EPI	✓	✓	✗	✓	✓	25.41	31.55	22.14	26.37	198.5	$- 2.34$
A5∖DRW	Full ∖ DRW	✓	✓	✓	✗	✓	25.93	32.54	24.60	27.69	200.7	$- 1.02$
A5∖MS	Full ∖ Multiscale	✓	✓	✓	✓	✗	26.10	32.78	26.52	28.47	196.3	$- 0.24$

Notes. “Spectral EPI” = frequency-domain EPI regularization (Section 3.4 / Theorem C.2); “DRW” = occlusion-aware Directed Random Walk propagation (Section 3.5 / Theorem G.4); “Multiscale” = coarse-to-fine pyramid optimization (Section 3.6 / Theorem H.1). Runtimes reported on a single NVIDIA RTX 3090.

Δ

Avg. PSNR in the upper block is relative to the preceding row (incremental gain); in the lower block it is relative to the full A5 model (leave-one-out drop).

Table 6. Effect of depth-plane count $N_{d}$ on average PSNR and runtime across all three scenes (full DSER model). The chosen value

N_{d} = 11

(highlighted) marks the diminishing-returns knee (Proposition I.3).

Table 6. Effect of depth-plane count $N_{d}$ on average PSNR and runtime across all three scenes (full DSER model). The chosen value

N_{d} = 11

(highlighted) marks the diminishing-returns knee (Proposition I.3).

$N_{d}$	Avg. PSNR (dB) ↑	Avg. Time (s) ↓	Avg. MSE ↓
3	22.14	13.2	0.0441
5	24.77	14.8	0.0312
7	26.93	16.1	0.0205
9	27.88	17.8	0.0145
11	28.71	20.0	0.0093
16	28.89	25.4	0.0089
24	29.01	34.7	0.0086
32	29.07	44.9	0.0084
64	29.12	82.3	0.0083

PSNR gains beyond

N_{d} = 11

are sub-

0.5

dB while runtime grows super-linearly. All primary experiments use

N_{d} = 11

.

Table 7. Sensitivity to spectral regularization weight $λ_{s}$ (Eq. (19)) on all three scenes (full DSER model,

N_{d} = 11

). The chosen value

λ_{s} = 0.1

is highlighted.

Table 7. Sensitivity to spectral regularization weight $λ_{s}$ (Eq. (19)) on all three scenes (full DSER model,

N_{d} = 11

). The chosen value

λ_{s} = 0.1

is highlighted.

$λ_{s}$	Boxes (dB)	Dino (dB)	Cotton (dB)	Avg. (dB)
$10^{- 4}$	25.61	30.79	20.71	25.70
$10^{- 3}$	25.74	31.02	21.43	26.06
$10^{- 2}$	26.04	32.11	24.88	27.68
$10^{- 1}$	26.30	32.96	26.86	28.71
$10^{0}$	26.21	32.14	26.09	28.15
$10^{1}$	25.82	32.47	24.84	27.71
$10^{2}$	24.91	31.18	22.57	26.22

Small

λ_{s}

deactivates the spectral prior, recovering near-A2 performance. Large

λ_{s}

over-regularizes, suppressing fine structure, especially on Dino. The optimum at

λ_{s} = 0.1

is stable across all three scenes.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

DSER: Spectral Epipolar Representation for Efficient Light Field Depth Estimation

Abstract

Keywords:

Subject:

1. Introduction

Contributions

2. Related Work

3. Method

3.1. Data and Preprocessing

3.2. Least Squares Gradient Initialization

3.3. Plane-Sweeping Cost Volume

3.4. Spectral EPI Refinement

3.5. Confidence-Guided Depth Propagation

3.6. Multiscale Spectral Refinement

4. Experiments

Datasets and metrics

5. Results and Analysis

5.1. Classical and Learning-Based Baselines

5.2. EPI-Based and Proposed Methods

5.3. Real-World Generalisation

5.4. Depth Sampling Analysis

6. Ablation Study

Component contributions

Depth-plane count N d

Spectral regularization weight λ s

6.1. Limitations

7. Discussion

8. Conclusions

9. Broader Impact Statement

Intended applications and positive impact

Limitations and risks

Data and model transparency

Summary

Appendix A. Theoretical Justification

Appendix B. Light Field Geometry and the Epipolar Constraint

Appendix B.1. The Two-Plane Parameterisation

Appendix B.2. The Epipolar Disparity Constraint

Appendix C. Spectral Epipolar Representation

Appendix C.1. Frequency-Domain Formulation

Appendix C.2. Spectral Regularisation as a Frequency-Consistent Prior

Appendix D. Least Squares Gradient Estimation

Appendix D.1. Derivation of the Closed-Form Estimator

Appendix D.2. Bias-Variance Analysis

Appendix D.3. Overall Pipeline

Appendix E. Plane-Sweeping Cost Volume

Appendix E.1. Variance-Based Matching Cost

Appendix E.2. Statistical Efficiency of the Variance Cost

Appendix E.3. Complexity vs. Accuracy Trade-Off

Appendix F. Variational Energy Functional

Appendix F.1. Data and Smoothness Terms

Appendix F.2. Existence and Uniqueness

Appendix F.3. Anisotropic Smoothness and Edge Preservation

Appendix G. Confidence Estimation and Directed Random Walk

Appendix G.1. Edge Confidence

Appendix G.2. Colour-Density Score via Mean Shift

Appendix G.3. Directed Random Walk as Graph Regularisation

Appendix H. Multiscale Convergence Analysis

Appendix H.1. Pyramid Construction

Appendix H.2. Error Propagation Bound

Appendix I. Formal Connections Between PSNR, MSE, and Disparity Quality

Appendix I.1. Depth-Disparity Relationship

Appendix I.2. PSNR as a Reconstruction Fidelity Metric

Appendix I.3. Diminishing Returns of Depth Sampling

Appendix J. Computational Complexity Analysis

Appendix K. Summary of Theoretical Contributions

References

MDPI Initiatives

Important Links

Subscribe

Depth-plane count $N_{d}$

Spectral regularization weight $λ_{s}$