A New Principle Toward Robust Matching in Human-like Stereovision

Ming Xie; Tingfeng Lai; Yuhui Fang

doi:10.20944/preprints202306.1313.v1

Submitted:

19 June 2023

Posted:

19 June 2023

Read the latest preprint version here

Abstract

Visual signals are the upmost important source for robots, vehicles or machines to achieve human-like intelligence. Human beings heavily depend on binocular vision to understand the dynamically changing world. Similarly, intelligent robots or machines must also have the innate capabilities of perceiving knowledge from visual signals. Until today, one of the biggest challenges faced by intelligent robots or machines is the matching in stereovision. In this paper, we present the details of a new principle toward achieving a robust matching solution which leverages on the use and integration of top-down image sampling strategy, hybrid feature extraction, and RCE neural network for incremental learning (i.e., cognition) as well as robust match-maker (i.e., recognition). A preliminary version of the proposed solution has been implemented and tested with data from Maritime RobotX Challenge (www.robotx.org). The contribution of this paper is to attract more research interest and effort toward this new direction which may eventually lead to the development of robust solutions expected by future stereovision systems in intelligent robots, vehicles and machines.

Keywords:

Visual Signals

;

Stereovision

;

Image Sampling

;

Feature Extraction

;

Incremental Learning

;

Match-Maker

;

Cognition

;

Recognition

;

Possibility Function.

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

We are living inside an ocean of signals. Among all the signals, visual signals should be the ones with the upmost importance. This is because without the visual signals, human beings will not be able to undertake many activities in the physical world. Similarly, visual signals are extremely important to today’s autonomous robots, vehicles and machines [1]. Hence, research works on enabling robots, vehicles and machines to gain human-like intelligence from the use of visual signals should never be undermined or shadowed by the development of alternative sensors such as Radar [2] and LiDAR [3].

In this paper, we present a new principle which addresses the most difficult problem in stereovision, which is to achieve stereo matching as robust as possible [4]. The motivation behind our research works comes from projects dedicated to the development of intelligent humanoid robots [5] as well as autonomous surface vehicles [6], as shown in Figure 1.

For both platforms in Figure 1, their intelligence and autonomy greatly depend on the outer loop which consists of perception, planning and control. Among all possible modalities of doing perception, visual perception is a very important one. Especially, the goal toward achieving human-like visual perception must start with the use of binocular vision or stereovision [7]. Hence, research on human-like stereovision should be a non-negligible topic in both artificial intelligence and robotics. Therefore, the purpose of this paper is to present a new principle which advances the current state of the art in developing human-like stereovision for autonomous robots, vehicles, and machines [8].

The remaining part of the paper is organized as follows: Section 2 outlines the biggest challenge faced by stereovision. Section 3 briefly discusses similar works dedicated to stereo matching. Section 4 describes the proposed new principle which addresses stereo matching problem. Sections 5 to 10 present the details of the key steps inside the proposed principle. Section 11 shows some preliminary results. The conclusions are given in Section 12.

2. Problem Statement

Stereovision enables human beings to classify entities, to identify entities, and to localize entities. Clearly, stereovision is a very powerful system or module which provides answers to the following questions:

What are the classes of perceived entities?
What are the identities of perceived entities?
Where are the locations of perceived entities inside related images?
Where are the locations of perceived entities inside scenes?

A complete solution, which could fully answer the above questions, actually depends on the availability of the working principles underlying image sampling (NOTE: this is a largely overlooked sub-topic), entity cognition (NOTE: vaguely named as deep learning in literature [9]), entity recognition (NOTE: vaguely named as object detection in literature [10]), and entity matching [11] as shown in Figure 2, etc.

As illustrated in Figure 2, when an entity is placed at location Q in a scene, its stereo images will appear at location a in left image and location b in right image, respectively. Then, the stereo matching problem could be stated as follows:

Given the location of an entity seen by the left camera, how to determine the location of the same entity seen by the right camera? [12]

The above problem implies the following two challenges:

How to determine the presence of an entity in the left camera’s image plane?
How to find the match in the right camera’s image plane if an entity has been detected in the left camera’s image plane?

It is important to further pay attention to the root causes behind these two challenges. The major root causes include [13]:

Variations of entities in size
Variations of entities in orientation
Variations of entities’ images due to lighting conditions
Variations of entities’ images due to occlusions as shown in Figure 3a
Variations of entities’ images due to image sampling process as shown in Figure 3b

In practice, we could cope with the issues raised by the variations of images in sizes and orientations if we could afford to have enough computational powers allocated to process images at multiple scales and multiple rotations. Also, we could cope with the variations of lighting conditions if we could make use of cameras with built-in functions of automatic illumination compensation and/or automatic contrast equalization. Hence, our remaining effort should focus on dealing with the issue raised by the occurrence of partial views faced by stereovision.

3. Similar Works on Stereo Matching

Stereo matching is an old problem in computer vision. In literature, there is a tremendous amount of works dedicated to solving the problem faced by stereo matching. For example, there are:

Methods which make the attempt of matching points within a pair of stereo images [14].
Methods which make the attempt of matching edges or contours within a pair of stereo images [15].
Methods which make the attempt of matching line segments within a pair of stereo images [16].
Methods which make the attempt of matching curves within a pair of stereo images [17].
Methods which make the attempt of matching regions within a pair of stereo images [18].
Methods which make the attempts of matching objects within a pair of stereo images [19].

The proposed principle in this paper falls into the category of making the attempt of matching entities within a pair of stereo images. Here, an entity may broadly refer to an object, a person, an animal, a building, or a machine, etc. In the literature, the existing solution in this category focuses on the use of deep convolution to do feature extraction which is then followed by the use of artificial neural network to do tuning and prediction. Such methods actually depend on the process of bottom-up optimization (e.g., back propagation algorithm) and the use of features in time-domain. In contrast, our proposed principle advocates the use of top-down design process in which we promote the use of hybrid features (i.e., features from both time-domain and frequency domain) as well as the use of the improved version of RCE neural work [20]. RCE neural network [21,22,23], which was discovered in 1970s by a research team led by a laureate [23] of Nobel prize in 1969, is fundamentally different from artificial neural network. So far, to the best of our knowledge, there is no other better way of designing human-like cognition and recognition than the use of RCE neural work or its improved versions [20,21,22,23].

It is worth acknowledging that despite the huge amount of research works dedicated to stereovision, the achieved results are far behind the performance of human beings’ stereovision. Obviously, we should not stop the continuous investigation which aims at looking for better principles of, or solutions to, human-like stereovision.

4. The Outline of Proposed Principle

Human vision is attention-driven in a top-down manner. The attention could be triggered by the occurrences of reference entities such as appearances of persons, appearances of animals, appearances of objects, appearances of machines, appearances of geometries (e.g., lines, curves, surfaces, volumes, etc.), appearances of photometry (e.g., chrominance and luminance, etc.), appearances of textures, etc. Such reference entities could be learnt by a cognition process incrementally in real-time. However, the occurrences of familiar reference entities should be the responses of an internal recognition process.

Inspired by the innate processes of human vision, we propose a new principle which imitates the attention-driven behavior of human vision. The main idea of the proposed new principle is outlined in Figure 4.

Without loss of generality, we assume that the attention is to be recognized from the video streams of the left camera. The key steps involved in the proposed new principle include:

Image acquisition by both cameras.
Image sampling on video stream from left camera.
Hybrid feature extraction for each image sample.
Cognition of image samples if they correspond to the training data of reference entities inside training images.
Recognition of image samples if they correspond to the possible occurrences of reference entities inside real-time images.
Forward/Inverse processes of template matching, which work together so as to find the occurrence of matched candidate in the right image, if a recognized entity is present in the left image.

In the subsequent sections, we will describe the details of key steps 2 to 6.

5. Top-Down Strategy of Doing Image Sampling

An image may contain many entities of interest. One of the biggest challenges faced by image understanding or image segmentation/grouping is to divide an image into a matrix of image samples, each of which just contains the occurrence or appearance of a single entity. In theory and in practice, there is no solution which could generally guarantee such results expected by the subsequent visual processes in stereovision.

In addition, the problem of finding better ways to do image sampling did not receive enough attention in the research community. One major reason is because many people believe that it is good enough to use a sub-window to scan an input image so as to obtain all the possible image samples. However, this way of doing image sampling has serious drawbacks such as:

It is difficult to determine, or to justify, the size of sub-window which is used to scan an input image. If the size of sub-window is allowed to be dynamically changed, then the next question is how to do such dynamic adjustment of sizes.
The number of obtained image samples is independent of the content inside an input image. For example, an input image may contain a single entity. In this case, the scanning method will still produce many image samples which will be the input to subsequent visual processes of classification, identification, and grouping, etc. Obviously, irrelevant image samples may potentially cause troubles to these visual processes of recognition.

In this paper, we advocate a top-down strategy which iteratively divides an input image into a list of sets which contain linearly growing numbers of image samples of different sizes. If we denote

S_{k}

a set which contains k image samples, one way to obtain

S_{k}

is to uniformly divide an input image into a matrix of

d_{v} \times d_{h}

samples, in which

d_{v} \times d_{h} = k

. For example, if we iteratively divide an input image into:

$S_{k}$ with one sample, then $k = 1$ and $d_{v} \times d_{h} \in [1 \times 1]$ .
$S_{k}$ with two samples, then $k = 2$ and $d_{v} \times d_{h} \in [1 \times 2, 2 \times 1]$ .
$S_{k}$ with three samples, then $k = 3$ and $d_{v} \times d_{h} \in [1 \times 3, 3 \times 1]$ .
$S_{k}$ with four samples, then $k = 4$ and $d_{v} \times d_{h} \in [1 \times 4, 4 \times 1, 2 \times 2]$ .
and so on.

This top-down strategy of doing image sampling is suitable for both parallel implementation and sequential implementation.

Before sending the image samples to the next visual process of extracting features, it is necessary to normalize the size of image samples so as to make them to be comparable in size. In practice, it is trivial to scale up or down the size of an image sample to any chosen standard value. By now, we could represent image sample set

S_{k}

as follows:

S_{k} = \{I_{j, r} (u, v), I_{j, g} (u, v), I_{j, b} (u, v), u \in [0, U - 1], v \in [0, V - 1], j \in [1, k]\}

(1)

where

(r, g, b)

are the three primary color components at index coordinates

(u, v)

inside set

S_{k}

’s

j^{t h}

image sample

I_{j} (u, v)

which has the size of

V \times U

. Hence, by default, each image acquisition module in stereovision outputs color images, each of which is represented by a set of three matrices such as

I_{j, r} (u, v), I_{j, g} (u, v), I_{j, b} (u, v)

in Equation (1).

6. Feature Extraction from Sample Image in Time-Domain

Mathematically speaking, the periodicity in space is equivalent to the periodicity in time. Hence, without loss of generality, we consider the spatial axes of an image or image sample as time axes. In this way, we could focus our discussions on how to extract features in time domain as well as in frequency domain.

Feature extraction in time domain has been extensively investigated by the research community of image processing and computer vision. In general, the basic operations include the computations of n-order derivatives where n could be equal to 0, 1, 2, 3, and any other larger value of integer. Here, the zero-order derivatives could refer to the results obtained by the operation of image smoothing for noise reduction (e.g., to use Gaussian filters).

In the literature, there are also many advanced studies which explore the use of Laplacian filters, Gabor filters, Wavelet filters, Moravec corner filter, Harris-Stephens corner filter, and Shi-Tomasi corner filter, etc. Hence, feature extraction in time domain is a very rich topic.

From the results shown in Figure 5, it is clear to us that higher order derivatives do not significantly provide extra information. The data of zero-order derivatives and first-order derivatives should be good enough for us to extract meaningful features in time domain.

In practice, the zero-order derivatives could be obtained by convoluting set

S_{k}

’s

j^{t h}

image sample

I_{j} (u, v)

with a discrete Gaussian filter such as:

G (u, v) = \frac{1}{16} [\begin{matrix} 1 & 2 & 1 \\ 2 & 4 & 2 \\ 1 & 2 & 1 \end{matrix}], v \in [0,2], u \in [0,2]

(2)

If we represent the results (i.e., a matrix of zero-order derivatives of all the color components) of zero-order derivatives as follows:

S_{k, 0} = \{I_{j, r_{0}} (u, v), I_{j, g_{0}} (u, v), I_{j, b_{0}} (u, v), u \in [0, U - 1], v \in [0, V - 1], j \in [1, k]\}

(3)

Then, the first-order derivatives could be obtained by convoluting each matrix in

S_{k, 0}

with the following two Sobel filters:

C_{h} (u, v) = [\begin{matrix} + 1 & + 2 & + 1 \\ 0 & 0 & 0 \\ - 1 & - 2 & - 1 \end{matrix}], v \in [0,2], u \in [0,2]

(4)

and

C_{v} (u, v) = [\begin{matrix} + 1 & 0 & - 1 \\ + 2 & 0 & - 2 \\ + 1 & 0 & - 1 \end{matrix}], v \in [0,2], u \in [0,2]

(5)

Clearly, the convolution with filter in Equation (4) will result in the horizontal components of the first-order derivatives while the convolution with filter in Equation (5) will result in the vertical components of the first-order derivatives. The

L^{2}

norms computed from these two components will yield the results of the first-order derivatives of image samples

S_{k, 0}

, which could be represented by:

S_{k, 1} = \{I_{j, r_{1}} (u, v), I_{j, g_{1}} (u, v), I_{j, b_{1}} (u, v), u \in [0, U - 1], v \in [0, V - 1], j \in [1, k]\}

(6)

Therefore, for image sample j in set

S_{k}

, it actually has six image matrices which are:

[I_{j, r_{0}} (u, v), I_{j, g_{0}} (u, v), I_{j, b_{0}} (u, v)]

in Equation (3) and

[I_{j, r_{1}} (u, v), I_{j, g_{1}} (u, v), I_{j, b_{1}} (u, v)]

in Equation (6). Then, the next question is how to determine a feature vector

F_{j}

which meaningfully represents image sample j in set

S_{k}

.

A simple answer to the above question could be to convert the six image matrices of a sample into their vector representations (i.e., a 2D matrix is re-arranged as a 1D vector). Then, by putting these six image vectors together, we will obtain feature vector

F_{j}

. The advantage of this method is its simplicity. However, the noticeable drawback is the large dimension of feature vector

F_{j}

. Then, we may want to know whether there is a better way of determining feature vector

F_{j}

from image matrices, or not.

So far, there is no theoretical answer to this question. Maybe, a practical way is to design workable solutions which could be suitable for applications in hands. In this way, a library of workable solutions may empower autonomous robots, vehicles, or machines to adapt their behaviors to real-time situations or applications. Clearly, this topic still offers opportunities for further or continuous research works.

Here, we propose a simple and practical way of determining feature vector

F_{j}

from image matrices. The idea is to compute statistics from a set of image matrices. Interestingly, the two obvious types of statistics are the mean values and standard deviations.

For example, if

{I (u, v), u \in [0, U - 1], v \in [0, V - 1]}

is an image matrix of single values such as red components, green components, blue components, or their individual first-order derivatives, each value in

{I (u, v), u \in [0, U - 1], v \in [0, V - 1]}

could be considered as a kind of measurement of approximate electromagnetic energy. Therefore, we could compute the following four meaningful statistics from

{I (u, v), u \in [0, U - 1], v \in [0, V - 1]}

, which are:

The mean value of approximate electromagnetic energy:

$I_{a} = \frac{1}{U V} \sum_{v = 0}^{V - 1} \sum_{u = 0}^{U - 1} I (u, v)$

(7)
The square-root of the variance of approximate electromagnetic energy:

$σ_{I} = \sqrt{\frac{\sum_{v = 0}^{V - 1} \sum_{u = 0}^{U - 1} {(I (u, v) - I_{a})}^{2}}{U V}}$

(8)
The horizontal distribution of approximate electromagnetic energy:

$σ_{u} = \sqrt{\frac{\sum_{v = 0}^{V - 1} \sum_{u = 0}^{N - 1} I (u, v) \times {(u - u_{c})}^{2}}{\sum_{v = 0}^{V - 1} \sum_{u = 0}^{U - 1} I (u, v)}}$

(9)

with:

$u_{c} = \frac{\sum_{v = 0}^{V - 1} \sum_{u = 0}^{N - 1} {I (u, v) \times u}}{\sum_{v = 0}^{V - 1} \sum_{u = 0}^{U - 1} I (u, v)}$

(9a)

and

$v_{c} = \frac{\sum_{v = 0}^{V - 1} \sum_{u = 0}^{N - 1} {I (u, v) \times v}}{\sum_{v = 0}^{V - 1} \sum_{u = 0}^{U - 1} I (u, v)}$

(9b)
The vertical distribution of approximate electromagnetic energy:

$σ_{v} = \sqrt{\frac{\sum_{v = 0}^{V - 1} \sum_{u = 0}^{N - 1} {I (u, v) \times {(v - v_{c})}^{2}}}{\sum_{v = 0}^{V - 1} \sum_{u = 0}^{U - 1} I (u, v)}}$

(10)

As a result, any image matrix such as

{I (u, v), u \in [0, U - 1], v \in [0, V - 1]}

could be represented by feature vector

F

as follows:

F = [I_{a}, σ_{I}, σ_{u}, σ_{v}]

(11)

In time domain, if image sample j in set

S_{k}

has six image matrices, its feature vector

F_{j}

will contain 24 feature values.

7. Feature Extraction from Sample Image in Frequency-Domain

In mathematics, a very important discovery was Fourier Transform which tells us that any signal is the (finite or infinite) sum of sine functions. In engineering, one of the greatest inventors was Nikolas Tesla who told us that the secret of the universe could be understood by simply thinking in terms of energy, vibration, and frequency. Such statement explicitly advises us to look for feature space and feature vector in frequency domain if we would like to understand the secret of machine intelligence.

Given image matrix

{I (u, v), u \in [0, U - 1], v \in [0, V - 1]}

, it could be represented by, or decomposed into, its Fourier series in terms of complex exponentials

e^{\pm i x}

(in which

i = \sqrt{- 1}

) which could be computed as follows:

I (u, v) = \frac{1}{U V} \sum_{ω_{v} = 0}^{V - 1} \sum_{ω_{u} = 0}^{U - 1} \hat{I} (ω_{u}, ω_{v}) e^{i (\frac{2 π ω_{u} u}{U})} e^{i (\frac{2 π ω_{v} v}{V})}

(12)

with

0 \leq u \leq U - 1

,

0 \leq v \leq V - 1

and:

\hat{I} (ω_{u}, ω_{v}) = \sum_{v = 0}^{V - 1} \sum_{u = 0}^{U - 1} I (u, v) e^{- i (\frac{2 π ω_{u} u}{U})} e^{- i (\frac{2 π ω_{v} v}{V})}

(13)

in which

0 \leq ω_{u} \leq U - 1

and

0 \leq ω_{v} \leq V - 1

. With continuous signals or data, Equation (12) will become inverse Fourier Transform while Equation (3) will become forward Fourier transform.

It is interesting to take note that each value

\hat{I} (ω_{u}, ω_{v})

in Equation (13) is a complex number or more precisely a vector. Mathematically speaking, a vector indicates a position in a space. Hence, Fourier coefficient vectors (or complex numbers), which are stored inside complex matrix

{\hat{I} (ω_{u}, ω_{v}), ω_{u} \in [0, U - 1], ω_{v} \in [0, V - 1]}

, nicely define a feature space. Such a feature space could be called as Fourier feature space.

In mathematics, complex matrix

{\hat{I} (ω_{u}, ω_{v}), ω_{u} \in [0, U - 1], ω_{v} \in [0, V - 1]}

could be split into two ordinary matrices

{A (ω_{u}, ω_{v}), ω_{u} \in [0, U - 1], ω_{v} \in [0, V - 1]}

and

{B (ω_{u}, ω_{v}), ω_{u} \in [0, U - 1], ω_{v} \in [0, V - 1]}

, where

A (ω_{u}, ω_{v})

is the real part of complex number (or vector)

\hat{I} (ω_{u}, ω_{v})

and

B (ω_{u}, ω_{v})

is the imaginary part of complex number (or vector)

\hat{I} (ω_{u}, ω_{v})

. Both matrices A and B are Fourier coefficient matrices.

Therefore, in frequency domain, a straightforward way of determining feature vector

F

which characterizes image matrix

{I (u, v), u \in [0, U - 1], v \in [0, V - 1]}

taken from image sample j in set

S_{k}

is to re-arrange the corresponding Fourier coefficient matrix

{A (ω_{u}, ω_{v}), ω_{u} \in [0, U - 1], ω_{v} \in [0, V - 1]}

or

{B (ω_{u}, ω_{v}), ω_{u} \in [0, U - 1], ω_{v} \in [0, V - 1]}

into a vector.

Alternatively, we could use equations, which are the same to Equations (7) to (10), to compute the mean values

(ω_{u, c}, ω_{v, c})

of frequencies and their standard deviations

(σ_{ω, u}, σ_{ω, v})

from Fourier coefficient matrix

{A (ω_{u}, ω_{v}), ω_{u} \in [0, U - 1], ω_{v} \in [0, V - 1]}

or

{B (ω_{u}, ω_{v}), ω_{u} \in [0, U - 1], ω_{v} \in [0, V - 1]}

. In this way, frequency domain’s feature vector corresponding to each Fourier coefficient matrix

{A (u, v), u \in [0, U - 1], v \in [0, V - 1]}

or

{B (ω_{u}, ω_{v}), ω_{u} \in [0, U - 1], ω_{v} \in [0, V - 1]}

could be as follows:

F_{j} = [ω_{u, c}, ω_{v, c}, σ_{ω, u}, σ_{ω, v}]

(14)

In time domain, each image sample j in set

S_{k}

has three color component images. Each color component image could yield two Fourier coefficient matrices. In total, there will be six Fourier coefficient matrices for any given image sample j in set

S_{k}

. As a result, in frequency domain, feature vector

F_{j}

of image sample j in set

S_{k}

will also contain 24 feature values.

8. Cognition Process Using RCE Neural Network

Today, many researchers still believe that our mind arises from our brain. This opinion makes a lot of people or young researchers believe that the blueprint of mind is part of the blueprint of brain. For those who are familiar with microprocessors and operating systems, it is clear to us that the blueprints of operating systems are not part of the blueprints of microprocessors.

Here, we advocate the truth which states that mind is mind while brain is brain. Most importantly, the basic functions of brain are to support memorizations and computations which are intended by mind. With this truth in mind, the future research in artificial intelligence or machine intelligence should be focused on the physical principles behind the design of human-like minds which could transform signals into the cognitive states of knowing the conceptual meanings behind the signals.

In the previous sections, we have discussed the details of feature extraction. The results are lists of feature vectors in time domain, frequency domain, or both. Then, the next question will be how to learn the conceptual meaning behind a set of feature vectors corresponding to the same class of sample images or the same identity of sample images. The good news is that RCE neural network discovered in 1970s provides us a better version of answers, so far.

As shown in Figure 6, both cognition and recognition could be implemented with the use of RCE neural network which consists of three layers. There is a single vector at the input layer. Also, there is a single vector at the output layer. However, inside the middle layer, there is a dynamically growing number of nodes, each of which memorizes the feature vectors from a set of sample images provided by a training session of cognition.

Clearly, RCE neural network is fundamentally different from the so-called artificial neural network which is simply a graphical representation of a system of equations with coefficients to be tuned in some simple or deep manners (e.g., back-propagation method).

Refer to Figure 6. With training session

i

’s feature vectors, we could easily compute the mean vector and the standard deviation of the distances from the training session’s feature vectors to their mean vector.

For example, if training session

i

has

k_{i}

sample images which form the following set:

S_{k_{i}} = \{I_{j, r} (u, v), I_{j, g} (u, v), I_{j, b} (u, v), u \in [0, U - 1], v \in [0, V - 1], j \in [1, k_{i}]\}

(15)

then training session

i

’s set of feature vectors computed by feature extraction module could be denoted by

{F_{i, j}, j \in [1, k_{i}]}

where

j

is the index of image sample

j

in set

S_{k_{i}}

. Subsequently, the mean vector of

{F_{i, j}, j \in [1, k_{i}]}

could be calculated by:

F_{i, a} = \frac{1}{k_{i}} \sum_{j = 1}^{k_{i}} F_{i, j}

(16)

and the standard deviation of the distances from

{F_{i, j}, j \in [1, k_{i}]}

to the mean vector could be computed by:

σ_{i} = \sqrt{\frac{1}{k_{i}} \sum_{j = 1}^{k_{i}} {(F_{i, j} - F_{i, a})}^{T} (F_{i, j} - F_{i, a})}

(17)

By now, we could explain he physical meaning of node

i

(i.e., outcome of training session

i

) in RCE neural network, which is simply the representation of hyper-sphere [21] with its center at

F_{i, a}

and its radius to be equal to

3 σ_{i}

. Since

i

could dynamically grow, RCE neural network naturally supports the process of incrementally learning as well as the process of deep learning which is widely discussed about in the literature.

As we mentioned above, the deep tuning of parameters inside a complex artificial neural network, which is a graphical representation of a system of equations, has nothing to do with deep learning, and the true nature of deep learning is outlined in Figure 6.

In summary, a training session for cognizing entity n consists of supplying a set of entity n’s image samples in Equation (15) and entity n’s conceptual meaning

L_{n}

which is a label or a word in a natural language such as English.

9. Recognition Process Using Possibility Function

Refer to Figure 6 again. With a trained RCE neural network by a cognition process for each entity of interest (e.g., entity n), the output layer is primarily for the purpose of executing recognition process when the feature vector computed from any arbitrary image sample is given to the input layer.

In the literature, many researchers believe that recognition is a process of determining the chances of occurrences. As a result, probability functions are widely used inside a recognition module.

Here, we advocate the truth which states that recognition is a process of evaluating the beliefs about the identities and categories of any arbitrary image sample at input. This truth is in line with the fact that our mind consists of many sub-systems of beliefs. Hence, the function for estimating the degrees of beliefs should be a probability function such as:

p_{j, n} (i) = e^{- \frac{1}{2 σ_{i}^{2}} {(F_{0, j} - F_{i, a})}^{T} (F_{0, j} - F_{i, a})}

(18)

where

(F_{i, a}, σ_{i})

is the parameter vector of the hyper-sphere obtained from training session

i

while

F_{0, j}

is the feature vector computed from image sample

j

during recognition process, and

p_{j, n} (i)

is the possibility for image sample

j

to belong to learnt entity

n

according to training session

i

’s parameter vector.

Since RCE neural network intrinsically supports incremental learning as well as deep learning, the single node in the output layer must include a Soft-Max function such as:

P_{j, n} = \max_{i} {p_{m i n}, p_{j, n} (i)}

(19)

where

p_{m i n}

is the minimum value of acceptable possibility (e.g., 0.5). In practice, if

P_{j, n} = p_{m i n}

, the interpretation could be stated as follows: input

F_{0, j}

does not support the belief that image sample

j

belongs to learnt entity

n

. Otherwise, if

P_{j, n} > p_{m i n}

, it means that the output of recognition will be

(L_{n}, P_{j, n})

in which

L_{n}

is the conceptual meanings of image sample

j

.

10. Forward/Inverse Processes of Template Matching

In the previous sections, we have discussed the key details about the modules of image sampling, hybrid feature extraction, cognition, and recognition. In a human-like stereovision system, these modules will produce the output of recognized entities inside left camera’s image plane, as illustrated by Figure 4. Then, the next question will be how to determine the match in right camera if a recognized entity in left camera is given. This question describes the famous problem of stereo matching faced by today’s stereovision systems.

In the literature, stereo matching is a widely investigated problem. So far, there is no solution which could achieve the performance close to, or as good as, the one of human being’s stereovision system. Hence, better solutions for improved performance are still expected from future research works in this area.

In this paper, we present a new strategy which could cope with the problem of stereo matching in a better way. This new strategy consists of the interplay between forward template matching and inverse template matching.

In stereovision, the only geometrical constraint is the so-called epipolar line which indicates the possible locations of a match (e.g., at location b in Figure 7) in right image plane if a location in left image plane is given (e.g., location a in Figure 7).

As shown in Figure 7, if recognized sample

j

at location a is given in left image plane, the forward process of template matching will consist of the following steps:

Determine the equation of epipolar line from both stereovision’s calibration parameters (NOTE: such knowledge could be found in any textbook of computer vision) and location a’s coordinates.
Scan the epipolar line location by location.
Take image sample $e$ at currently scanned location e.
Compute the feature vector of image sample $e$ .
Compute the cosine distance between image sample $j$ ’s feature vector and image sample $e$ ’s feature vector.
Repeat the scanning until it is completed.
Choose the image sample to be the candidate of matched sample $j^{'}$ if it minimizes the cosine distance.
Use the cosine distance between recognized sample $j$ and the chosen candidate of matched sample $j^{'}$ to compute the possibility value of match (i.e., to use Equation (18)).
Accept matched sample $j^{'}$ if the possibility value of match is greater than a chosen threshold value (e.g., 0.5).

In the above process, if

F_{j}

is the feature vector of image sample

j

while

F_{e}

is the feature vector of image sample

e

, the cosine distance

d_{j, e}

between those two vectors is simply calculated according to their inner product, which is:

d_{j, e} = \frac{F_{j}^{T} \times F_{e}}{‖F_{j}‖ \times ‖F_{e}‖}

(20)

and the corresponding possibility value is calculated as follows:

P_{j, e} = e^{- \frac{1}{2 σ_{0}^{2}} d_{j, e}^{2}}

(21)

where

σ_{0}

is a default value of standard deviation which could be self-determined by robots, vehicles, or machines during a training session of cognition process.

According to the illustration shown in Figure 3, the forward process of template matching will work only if there is no partial view due to either occlusion or image sampling. If matched sample

j^{'}

in right image contains partial view of recognized sample

j

in left image, the inverse process of template matching will perform better than its counterpart of forward process.

As shown in Figure 8, if recognized sample

j

at location a is given in left image plane, the inverse process of template matching consists of the following steps:

Determine the equation of epipolar line from both the stereovision’s calibration parameters and the location a’s coordinates.
Scan the epipolar line location by location.
Take image sample $e$ at currently scanned location e.
Divide image sample e into a matrix of sub-samples ${e_{i}, i = 1,2, 3, \dots}$ .
Use each sub-sample in ${e_{i}, i = 1,2, 3, \dots}$ as template and do forward template matching with recognized sample $j$ .
Compute the mean value of all the possibility values which measure the match between all the sub-samples in ${e_{i}, i = 1,2, 3, \dots}$ and recognized sample $j$ . This mean value represents the possibility value for image sample $e$ in right image to match with recognized sample $j$ in left image.
Repeat the scanning until it is completed.
Choose the image sample to be the candidate of matched sample $j^{'}$ if it minimizes the possibility values of match (i.e., calculated by Equation (21)).
Accept the match if the possibility value of match is greater than a chosen threshold value (e.g., 0.5).

In practice, we could run both forward process and inverse process of template matching in parallel. In this way, a better decision of match in right image could be made if recognized sample

j

in left image is given.

11. Implementation and Results

The proposed new principle has been implemented in Python. Preliminary tests have been with image data from public domain. Especially, we use image data which are posted to public domain by maritime RobotX challenge (www.robotx.org). Figure 9 shows two typical examples of scenes constructed by maritime RobotX challenge.

The tasks to be undertaken by an autonomous surface vehicle include stereovision-guided delivery of objects, stereovision-guided parking into the docking bay, etc. In the following sections, we share some of our experimental results.

11.1. Results of Top-down Sampling Strategy of Input Images

Our proposed top-down sampling strategy of input images (e.g., images from left camera) is to divide an input image from left camera into a list of sets

S_{k}

which contain increasing number of image samples (i.e., k = 1, 2, 3, …). Figure 10 shows an example of results (i.e., k = 28) from our proposed top-down sampling strategy of an input image. At this level of sampling, a red floating post clearly appears inside one of these 28 samples.

11.2. Examples of Training Data for Cognition (i.e., Learning)

The proposed new principle involves the use of cognition and recognition modules. For cognition module, it is necessary to train it with training data of reference entities. Without loss of generality, we simply use a set of 10 samples to train the cognition module which is specifically dedicated to an entity of interest. It is amazing to see that the proposed solution could achieve successful results with 10 samples inside a dataset of training for each entity of interest.

Figure 11 shows the scenario of autonomous parking into a docking bay by an autonomous surface vehicle. In this task, the mental capabilities of the autonomous surface vehicle include a) cognition of triangle, cross and circle, and b) recognition of triangle, cross and circle. Hence, for the training of cognition module dedicated to each entity among triangle, cross and circle, we simply take ten samples as shown in Figure 11.

11.3. Results of Feature Extraction in Time Domain

For each sample image in Figure 10, we calculate its feature vector in time domain. Here, we share the results of feature vectors computed from the ten image samples of triangle in Figure 11. These results are shown in Figure 12, which also gives the result of the mean vector and its standard deviation.

11.4. Results of Feature Extraction in Frequency Domain

For each sample image in Figure 10, we calculate its feature vector in frequency domain. Similarly, we share the results of feature vectors computed from the ten image samples of triangle in Figure 11. These results are shown in Figure 13, which also gives the result of the mean vector and its standard deviation.

11.5. Results of Cognition

The mean vector and its standard deviation, which are obtained from each training session for any entity of interest, will be stored inside a node at the middle layer of the RCE neural network which is allocated to an entity’s cognition module. If there are N entities of interest, there will be a set of N RCE neural networks which support N pairs of cognizers and recognizers such as {(Cognizer n, Recognizer n), n = 1, 2, 3, …, N}. As illustrated in Figure 14, N could incrementally grow very deeply.

11.6. Results of Recognition

With N pairs of cognizers and recognizers such as {(Cognizer n, Recognizer n), n = 1, 2, 3, …, N} in place, an autonomous surface vehicle or robot is ready to recognize familiar or learnt entities inside images of left camera.

Figure 15 shows two examples of results of recognition in time domain. Each example contains seven image samples as input. Among these seven inputs, three of them are totally out of the class dedicated to the pair of cognizer and recognizer. We can see that recognition module performs quite successfully in recognizing the correct entries. Please take note that the feature vectors of image samples at input are all in time domain.

With the same image samples at input, Figure 16 shows the results of recognition in frequency domain. We can see that recognition module also performs quite successfully in recognizing the correct entries. Please take note that the feature vectors of image samples at input are all in frequency domain.

By now, people may ask whether the proposed new principle could work well with other more complex entities. For the sake of responding to such doubt, we give two more examples of results which make use of the feature vectors in frequency domain to do cognition and recognition. For each reference entity (e.g., car and dog), the cognition module is trained with ten image samples while the recognition process is tested with six image samples as input.

In Figure 17, we show the experimental results of cognizing and recognizing cars in frequency domain. The results are judged to be very good.

In Figure 18, we show the experimental results of cognizing and recognizing dogs in frequency domain. The results are also judged to be very good.

11.7. Results of Stereo Matching

Mathematically, a pair of images is good enough to validate a stereo matching algorithm. In practice, a pair of images could come from a binocular vision system which is normally named as stereovision system. Alternatively, a pair of images could come from a mobile monocular vision system. Since we use the image dataset from the public domain, it is easier for us to take two images from an image sequence captured by a mobile camera.

Here, we share one example of results in Figure 19, Figure 20 and Figure 21. In Figure 19, we let the stereovision system undergo the cognition process in which ten sample images of a floating post are used to train the RCE neural network inside the cognizer allocated to learn the red floating post. After the training of the cognizer’s RCE neural network, the so-called stereovision system is ready to enter the recognition process which takes any set of new images as input.

In Figure 19, seven image samples are selected for testing the validity of trained RCE neural network. The possibility values show good outcome from the recognition process.

Now, we could start the process of stereo matching. As shown in Figure 20, the first step is to do image sampling. When the so-called left image is sampled into a matrix of 4x7 image samples, the occurrence of a red floating post could be recognized. Please take note that the image sample of this recognized occurrence is named as image sample 1a by our testing program.

Subsequently, in the so-called right image, we could determine a line (i.e., equivalent to epipolar line) which will guide the search for the best match candidate.

For the purpose of illustration, we take three image samples to compute the stereo matching results. Among these three image samples, image sample 1b is the best match. In general, the best match is the one which maximizes the possibility value (i.e., computed by Equation (21) with

σ_{0} = 10

) between image sample 6a in left image and all the possible image samples in right image. Figure 21 shows the possibility values computed for three pairs of possible matches, which are (1a, 1b), (1a, 1c), and (1a, 1d). Clearly, pair (1a, 1b) stands out to be the best match.