Multi-Modal Remote Sensing Image Registration Using Curvature Scale Space Contour Point Features

Jianhua Zhu; Changjiang Liu; Danling Liang

doi:10.20944/preprints202604.0149.v1

Submitted:

02 April 2026

Posted:

02 April 2026

You are already at the latest version

Abstract

Multi-modal remote sensing image registration is a challenging task due to differences in resolution, viewpoint, and intensity, which often leads to inaccurate and time-consuming results with existing algorithms. To address these issues, we propose an algorithm based on Curvature Scale Space Contour Point Features (CSSCPF). Our approach combines multi-scale Sobel edge detection, dominant direction determination, an improved curvature scale space corner detector, a new gradient definition, and enhanced SIFT descriptors. Test results on publicly available datasets show that our algorithm outperforms existing methods in overall performance. Our code will be released at https://github.com/JianhuaZhu-IR.

Keywords:

contour point features

;

remote sensing

;

image registration

;

gradient definition

Subject:

Computer Science and Mathematics - Computer Vision and Graphics

1. Introduction

Multi-modal remote sensing image registration has become a key area in computer vision and image processing, as single-modal images are insufficient for comprehensive applications [1]. Captured by various sensors with distinct imaging principles, multi-modal images offer unique information that enhances earth observation. The goal is to accurately align images from different times, cameras, or perspectives [2].

Multi-modal remote sensing image registration algorithms are mainly classified into region-based and feature-based methods [3]. Region-based algorithms use image intensity information and optimization to align regions by minimizing a cost function. Feature-based algorithms, which better handle geometric distortions, extract and match features, then select a transformation model based on geometric relationships [4]. Techniques like Scale-Invariant Feature Transform (SIFT) [5] address key multi-modal registration challenges.

Gao et al. [6] proposed the Multi-Scale Partial Intensity Invariant Feature Descriptor (MS-PIIFD) algorithm to address multi-source remote sensing image differences, showing good registration accuracy but being time-consuming and producing fewer matching point pairs, sometimes failing to complete registration. Gao et al. [7] developed the Multi-Scale Histogram of Local Main Orientations (MS-HLMO) algorithm, which handles intensity, scale, and rotation differences, but is also very slow. Li et al. [8] introduced the Radiation-Invariant Feature Transform (RIFT) algorithm to address Non-linear Radiometric Distortion (NRD), offering reliable feature matching but with lower accuracy and longer processing time than MS-PIIFD. Zhu et al. [9] discovered that existing image registering algorithms for infrared and visible images of power equipment suffer from low accuracy and long processing times. To address these issues, they proposed an image registering algorithm based on Large-Gap Fracture Contours (LGFC). This algorithm has shown success in the registering of power equipment images. However, its applicability to multi-modal remote sensing images is limited because remote sensing images generally lack rich LGFC information, which leads to poor registering performance in such images.

Öfverstedt et al. [10] found that intensity-based image registration relies on similarity metrics, which are crucial for robustness and accuracy. They proposed an affine registration framework combining intensity and spatial information using symmetric, non-intensity interpolation. Jiang et al. [11] addressed challenges in multi-modal power equipment images with their Contour Angle Orientation (CAO) method and Coarse-to-Fine (C2F) algorithm (CAO-C2F). Although effective, it still faces issues with accuracy and time consumption.

In response to the above analysis, this paper develops a multi-modal remote sensing image registering algorithm based on Curvature Scale Space (CSS) point features. The main contributions of this paper are as follows.

(1): The number of feature points is crucial for image registration quality. To ensure an adequate number of feature points, we have enhanced the CSS corner detection algorithm. Since CSS extracts feature points based on contours, and the number of edges correlates with contour quantity, maintaining a sufficient number of edges is vital. To address this, we propose a multi-scale Sobel edge detection algorithm.
(2): Given the significant differences in intensity, resolution, and viewpoint between multi-modal remote sensing image pairs, we propose a new gradient definition and a method to determine the dominant direction of feature points for rotation invariance. This gradient definition is applied to SIFT descriptors, with segmented normalization to enhance the similarity between feature point descriptors.

The following sections are structured as follows: Section 2 details the proposed registering algorithm. Section 3 conducts an extensive experimental analysis of the algorithm. Lastly, Section 4 provides the conclusion, outlining the key findings and contributions.

2. Proposed Image Registering Algorithm

The study begins by applying the proposed multi-scale Sobel edge detection algorithm to extract image edges, followed by contour tracking to outline these edges. The improved CSS algorithm detects and extracts feature points, with their dominant direction determined for rotation invariance. Enhanced SIFT descriptors are then used to characterize the feature points for image registration, ensuring alignment between the two images.

2.1. Edge Detection

The gradient components in the x and y directions can be calculated using the convolution operation as follows:

G_{x} = S * I (x, y), G_{y} = S^{⊤} * I (x, y) .

(1)

S = (\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}),

where * denotes the convolution operation, S represents the convolution kernel and

I (x, y)

denotes the multi-modal remote sensing image.

The gradient magnitude image G can then be obtained using the formula:

G = \sqrt{G_{x}^{2} + G_{y}^{2}} .

(2)

We introduce a multi-scale Sobel edge detection approach. Initially, the input color image undergoes a transformation into a grayscale image. Subsequently, we apply a Gaussian filter with a scale parameter of

σ_{0}

to the grayscale image, yielding the first layer of the scale space, denoted as

L_{1} (x, y, σ_{0})

. Following this,

L_{2} (x, y, σ_{1})

is derived by convolving a Gaussian filter with a scale of

σ_{1}

over

L_{1} (x, y, σ_{0})

, and

L_{3} (x, y, σ_{2})

is generated by convolving a Gaussian filter with a scale of

σ_{2}

over

L_{2} (x, y, σ_{1})

. This iterative process persists until a set of N layers, each of identical dimensions, is established, constituting the scale space. The Gaussian filter function and the relationship between consecutive layers within the scale space are depicted as follows:

\{\begin{matrix} g_{k} (x, y, σ_{k - 1}) = \frac{1}{2 π σ_{k - 1}^{2}} e^{- \frac{x^{2} + y^{2}}{2 σ_{k - 1}^{2}}}, \\ L_{k} (x, y, σ_{k - 1}) = g_{k} (x, y, σ_{k - 1}) * I (x, y), k = 1, 2, \dots, N, \end{matrix}

(3)

where

L_{k} (x, y, σ_{k - 1})

signifies the image obtained through Gaussian filtering.

The selection of an appropriate value for

σ_{k - 1}

is crucial in this process. Increasing

σ_{k - 1}

leads to more blurring, resulting in the loss of fine details and nuances. On the other hand, using a smaller

σ_{k - 1}

value produces a clearer image with more details.

To ensure the required relationship

σ_{k - 1} = f (k)

with

f^{'} (k) < 0

, we can assume a function

f (k) = σ_{0} k^{u}

. Calculating the first derivative of

f (k)

yields

f^{'} (k) = u σ_{0} k^{u - 1} < 0

, indicating that

u < 0

. Consequently, the following relationship between

σ_{k - 1}

and

σ_{0}

is established:

σ_{k - 1} = σ_{0} k^{u}, u < 0, k = 1, 2, 3, \dots, N .

(4)

After constructing the scale space

L_{k} (x, y, σ_{k - 1})

,

k = 1, 2, \dots, N .

We apply the single-scale Sobel edge detection method to each scale space image

L_{k} (x, y, σ_{k - 1})

, resulting in N edge images

E_{k} (x, y, σ_{k - 1})

. Then, we perform an overlay operation on these edge images

E_{k} (x, y, σ_{k - 1})

to obtain an overlaid edge image

E (x, y)

. Last but not least, the overlaid edge image

E (x, y)

is normalized using Equation (5):

\tilde{E} (x, y) = \frac{E (x, y) - a}{b - a},

(5)

where

\tilde{E} (x, y)

represents the normalized image. The minimum value in matrix

E (x, y)

is denoted as a, and the maximum value is denoted as b.

E (x, y) - a

refers to subtracting a from each element in

E (x, y)

, and then dividing each element in the resulting matrix

E (x, y) - a

by

b - a

.

Considering the binary nature of the edge image, the following technique can be employed:

\dot{E} (x, y) = \{\begin{matrix} 1, & if \tilde{E} (x, y) > T_{1}, \\ 0, & otherwise, \end{matrix}

(6)

where

\dot{E} (x, y)

represents the pixel value at coordinate

(x, y)

, and

T_{1}

is a threshold value. By performing the aforementioned operations, we obtain the resulting edge image

\dot{E} (x, y)

.

2.2. Contour Extraction

Contours offer more continuity than edges, providing geometric details like area and perimeter. Before corner detection, using a contour tracking method is recommended to extract contour sets from multi-modal remote sensing images.

L_{c} = \{l^{i} ∣ l^{i} = \{{(x_{1}, y_{1})}^{i}, {(x_{2}, y_{2})}^{i}, \dots, {(x_{n}, y_{n})}^{i}\}\},

(7)

where

i = 1, 2, \dots, M,

L_{c}

denotes the set of extracted contours, l represents the elements within each contour, i signifies the i-th contour, and M is the total number of contours in

L_{c}

.

2.3. Feature Point Detection

The classical corner detection algorithm, CSS [12], computes the absolute curvature of identified corners along the contour at a low scale. It uses the local maximum of the absolute curvature as initial candidate corner points and employs an adaptive algorithm to eliminate rounded corners, thus reducing false corner detections.

Each contour extracted from a multi-modal remote sensing image can be treated as a curve, represented by an arc length parameter v:

β (v) = [x (v), y (v)],

(8)

where

x (v)

and

y (v)

denote the sequence of horizontal and vertical coordinates constituting each curve.

The curve

β (v)

is smoothed using a Gaussian function with a scale parameter

α

to derive the smooth filter curve

β_{α} (v)

.

\{\begin{matrix} β_{α} (v) = [x (v) * g (v, α), y (v) * g (v, α)] = [X (v, α), Y (v, α)], \\ g (v, α) = \frac{1}{\sqrt{2 π} α} e^{- \frac{v^{2}}{2 α^{2}}}, \end{matrix}

(9)

where the convolution operator symbol * is employed in the context, and

g (v, α)

represents a one-dimensional Gaussian function.

The absolute curvature of each point on the curve can be determined by:

K (u, α) = \frac{|X_{v} (v, α) Y_{v v} (v, α) - X_{v v} (v, α) Y_{v} (v, α)|}{{[X_{v} {(v, α)}^{2} + Y_{v} {(v, α)}^{2}]}^{\frac{3}{2}}} .

(10)

where

X_{v} (v, α) = x (v) * g_{v} (v, α)

,

Y_{v} (v, α) = y (v) * g_{v} (v, α)

,

X_{v v} (v, α) = x (v) * g_{v v} (v, α)

, and

Y_{v v} (v, α) = y (v) * g_{v v} (v, α)

.

By utilizing Equations (8)-(10), we can calculate the absolute curvature for each point along the contour. Subsequently, the CSS [12] selects initial corner candidates at a lower scale based on the maximum value of the absolute curvature

K (v, α)

.

Since the CSS is based on contours, it cannot detect a sufficient number of feature points if the image lacks rich edge detail information. Therefore, we have improved CSS to increase the detection of corner information. The possible relationships between the line segments connecting adjacent feature points on each contour

l^{i} (i = 1, 2, \dots, M)

are illustrated in Figure 1.

As shown in Figure 1(a), for the case where the line segment connecting feature points

P_{L}^{i} (x_{L}^{i}, y_{L}^{i})

and

P_{R}^{i} (x_{R}^{i}, y_{R}^{i})

is a straight line, we use the midpoint

P_{M}^{i} (x_{M}^{i}, y_{M}^{i})

of the line segment

P_{L}^{i} P_{R}^{i}

as a feature point. The midpoint

P_{M}^{i}

can be calculated using the following formula:

\begin{matrix} P_{M}^{i} & = (x_{M}^{i}, y_{M}^{i}) = \frac{1}{2} (P_{L}^{i} + P_{R}^{i}) = \frac{1}{2} (x_{L}^{i} + x_{R}^{i}, y_{L}^{i} + y_{R}^{i}) . \end{matrix}

(11)

As shown in Figure 1(b) and Figure 1(c), the connecting curve between feature points

P_{L}^{i}

and

P_{R}^{i}

is a curved segment, exhibiting concave and convex shapes respectively. This connecting line segment is continuous between feature points

P_{L}^{i}

and

P_{R}^{i}

. According to Lagrange’s theorem, as long as the function is continuous on the closed interval and differentiable on the open interval, there must be at least one point

P_{M}^{i}

where Equation (12) holds.

F (x_{M}^{i}) = \frac{y_{L}^{i} - y_{R}^{i}}{x_{L}^{i} - x_{R}^{i}},

(12)

where

F (\cdot)

is the derivative of the function

f (\cdot)

that represents the curved segment between feature points

P_{L}^{i}

and

P_{R}^{i}

.

From Figure 1(b) and Figure 1(c), it can be observed that the straight line segment

P_{L}^{i} P_{R}^{i}

connecting

P_{L}^{i}

and

P_{R}^{i}

is parallel to the tangent

L_{t}^{i}

at point

P_{M}^{i}

, and there is only one such tangent. Therefore, the tangent point

P_{M}^{i}

is used as a feature point, and

P_{M}^{i}

is calculated as shown in Equation (13).

\begin{matrix} P_{M}^{i} & = (x_{M}^{i}, y_{M}^{i}) = (F^{- 1} (x_{m}^{i}), f (F^{- 1} (x_{m}^{i}))), \end{matrix}

(13)

where

F^{- 1} (\cdot)

is the inverse function of

F (\cdot)

.

After the above processing, the number of feature points extracted by the refined CSS corner detection algorithm is

\sum_{i = 1}^{M} (2 N^{i} - 1)

. Here,

N^{i}

represents the number of feature points detected by the original CSS algorithm in the i-th contour, and M denotes the total number of contours extracted from the image.

2.4. Dominant Direction

Assigning a dominant direction linked to the local gradient for each feature point is essential for achieving rotation, translation, and scaling invariance, which helps in generating effective descriptors. We propose a novel method to determine the dominant direction, specifically optimized to handle significant intensity variations in multi-modal remote sensing image pairs. For the image

I (x, y)

, the gradient magnitude image G is computed using Equation (2). Subsequently, Equation (1) is utilized to calculate the first horizontal and vertical steps,

G_{1 x}

and

G_{1 y}

, of the gradient magnitude image G, and the gradient magnitude

G_{1}

is determined as:

G_{1} = \sqrt{G_{1 x}^{2} + G_{1 y}^{2}},

(14)

where

G_{1}

denotes the gradient magnitude image of G.

Next, proceed to normalize the gradient magnitude of the gradient magnitude image

G_{1}

, as defined by the operation:

\bar{G} = U (G_{1}),

(15)

where the symbol U denotes the normalization operation, which involves dividing all values in

G_{1}

by its maximum value. The resulting

\bar{G}

represents the normalized gradient magnitude image.

Subsequently, calculate the horizontal and vertical partial derivatives based on the normalized gradient image using Equation (1). This can be expressed as:

\begin{matrix} (\begin{matrix} {\bar{G}}_{x} \\ {\bar{G}}_{y} \end{matrix}) = (\begin{matrix} \frac{\partial}{\partial x} (\bar{G}) \\ \frac{\partial}{\partial y} (\bar{G}) \end{matrix}), \end{matrix}

(16)

where ∂ denotes the operation of partial derivative.

Equation (16) represents the new gradient definition proposed in this paper. The weighted squared gradient in the average squared gradient method is defined as follows:

[\begin{matrix} G_{w_{r}, s, x} \\ G_{w_{r}, s, y} \end{matrix}] = [\begin{matrix} \sum_{w_{r}} {\bar{G}}_{x}^{2} - {\bar{G}}_{y}^{2} \\ 2 \sum_{w_{r}} {\bar{G}}_{x} {\bar{G}}_{y} \end{matrix}],

(17)

where

w_{r}

represents a Gaussian window with variance r. Subsequently, the dominant direction of the feature points can be determined by:

D = |∠ (G_{w_{r}, s, x}, G_{w_{r}, s, y})| .

(18)

In Equation (18), the absolute value operation

| \cdot |

is applied to confine the dominant direction range of feature points from

(- π, π)

to

(0, π)

. The variable D denotes the dominant direction of the specific feature point. The calculation of

∠ (G_{w_{r}, s, x}, G_{w_{r}, s, y})

is determined using:

∠ (X, Y) = \{\begin{matrix} arctan (\frac{Y}{X}), X < 0, Y < 0, \\ arctan (\frac{Y}{X}) + π, X ⩾ 0, \\ arctan (\frac{Y}{X}) + 2 π, X < 0, Y \geq 0 . \end{matrix}

(19)

2.4.1. Feature Descriptor Construction

Due to variations in resolution, intensity, and other attributes, the SIFT descriptor struggles with multi-modal images. The traditional SIFT, based on the first-order gradient, fails to effectively address these disparities. To improve this, we propose a novel gradient definition within the SIFT framework, using Equation (16) to calculate a more robust gradient magnitude

G_{2}

, enhancing the similarity of feature descriptors across multi-modal remote sensing images.

G_{2} = \sqrt{{\bar{G}}_{x}^{2} + {\bar{G}}_{y}^{2}} .

(20)

Utilizing a methodology akin to that outlined in Equation (15), the process of normalization is executed as represented by:

{\bar{G}}_{2} = U (G_{2}) .

(21)

In addition, we applied piecewise normalization to the gradient magnitude of the normalized gradient image

{\bar{G}}_{2}

. Specifically, the gradient magnitudes were sorted in descending order and then normalized accordingly.

2.4.2. Coarse-to-Fine Feature Matching

We use the Best Bin First (BBF) [13] matching algorithm with generated feature descriptors, performing two-sided matching to establish initial correspondences between feature points.

In the context of bilateral matching of two multi-modal remote sensing images, denoted as

R (x, y)

(reference image) and

F (x, y)

(floating image), we identify the nearest p and second-nearest q neighbors related to a point o in

R (x, y)

, as well as the nearest f and second-nearest v neighbors corresponding to a point e in

F (x, y)

. The distance between two points is computed:

d_{i j} = \sqrt{\sum_{k = 1}^{128} {[R_{i}^{k} (x, y) - F_{j}^{k} (x, y)]}^{2}} .

(22)

We enforce a distance ratio threshold

T_{2}

, as defined in Equation (23), where satisfying the conditions ensures successful matching between points from both images.

\{\begin{matrix} \frac{d_{o p}}{d_{o q}} \leq T_{2}, \\ \frac{d_{e f}}{d_{e v}} \leq T_{2} . \end{matrix}

(23)

We use the line consistency theory [11] to filter out most of the incorrect matches. Finally, to achieve more robust feature point pairs, we apply the Random Sample Consensus (RANSAC) algorithm [14] for fine-tuning the matches.

Algorithms	Average running time (s)	Average NCM	Average RMSE	Average registering accuracy (%)
LGFC	4.1215	3	8.4699	25.7
MS-PIIFD	18.3924	4	1.8334	29.08
CAO-C2F	25.2271	43	1.8796	20.84
RIFT	10.2207	147	1.9241	10.09
MS-HLMO	214.514	267	2.2151	18.75
CSSCPF (Ours)	5.5795	14	1.8542	46.74

Multi-Modal Remote Sensing Image Registration Using Curvature Scale Space Contour Point Features

Abstract

Keywords:

Subject:

1. Introduction

2. Proposed Image Registering Algorithm

2.1. Edge Detection

2.2. Contour Extraction

2.3. Feature Point Detection

2.4. Dominant Direction

2.4.1. Feature Descriptor Construction

2.4.2. Coarse-to-Fine Feature Matching

3. Experimental Results and Analysis

3.1. Data Set and Parameter Setting

3.2. Experimental Results of Multi-scale Sobel Edge Detection

3.3. Subjective Evaluation of the Registration Results

3.4. Objective Registering Results and Analysis

3.4.1. Running Time of Different Algorithms

3.4.2. NCM Point Pairs for Different Algorithms

3.4.3. RMSE of Different Algorithms

3.4.4. Registration Accuracy of Different Algorithms

4. Conclusion

Author Contributions

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Short Biography of Authors

MDPI Initiatives

Important Links

Subscribe

	Jianhua Zhu received his B.S. degree in mathematics and applied mathematics from Xichang University, and his M.A.Sc. degree in mathematics from Sichuan University of Science and Engineering, Zigong, 643000, China. His research interests include image processing and computer vision. This work was completed during his master’s studies. He is the first author of this article. Contact him at tostuhua@qq.com.
	Changjiang Liu is an associate professor with the Key Laboratory of Higher Education of Sichuan Province for Enterprise Informationalization and Internet of Things, Sichuan University of Science and Engineering, Zigong, 643000, China. His research interests include image processing and computer vision. Liu received his Ph.D. degree in image segmentation and image registration from Sichuan University. He is the corresponding author of this article. Contact him at liuchangjiang@189.cn.
	Danling Liang is currently working toward her M.A.Sc. degree, focused on the image segmentation, with the School of Mathematics and Statistics, Sichuan University of Science and Engineering, Zigong, 643000, China. Her research interests include image processing. Liang received her B.S. degree in mathematics and applied mathematics from Sichuan University of Science and Engineering. She is the co-author of this article. Contact her at liangdanling@163.com.