3. The Experiment and the Preparation
We would like to point out again, that one of the contributions of this paper is the exposition of the connection between PCA and quantum logic. In this section, we will dedicate ourselves to have a closer look at a possible way to convert our data into a format that can utilize this connection.
To be more precise, we will now explain our data format and then move on to illustrate the preprocessing to generate the random vectors X from our shape dataset. Then, we will discuss a way of shape classification and provide an algorithmic pipeline. Last but not least, we will show and discuss the produced results.
3.1. Shape Data
For the start, we want to briefly explain the concept of a shape as we employ it. A shape S is considered to be realized by a closed curve in a 2-dimensional space, . Such a curve itself, is established as a point cloud with points for . The x and y coordinates of a single point are denoted via and . In this way, the point cloud P will represent a discrete realization of the shape S.
3.1.1. The Dataset
The core of the used dataset is the geometric shape dataset [
25]. This dataset consists of 10,000 pictures per shape and contains pictures of triangles, squares, pentagons, hexagons, heptagons, octagons, nonagons, circles, and stars (with a centred pentagonal hole) in different rotations, sizes, positions, and colours of the shapes themselves and the background. We also used this dataset in a previous paper, see [
24]. There we already transformed the images into point clouds. These point clouds define the starting point for our preprocessing.
The process we employ to boil down a colour image to a point cloud is sketched in
Figure 2, and will be shortly reviewed now. First, we select one colour channel of the image and obtain a grey scaled version of the image. The second step is the transformation into a binary black-and-white image. After that, we use the
bwboundaries MATLAB routine to compute the point clouds.
The reason behind using only one colour channel of the image to create a greyscale image is that the pictures from the dataset were originally created to have a constant mean value over the whole image. Doing otherwise, we would end up with no difference between background and shape in the image, and we would not be able to produce a binary image for further processing.
For the stars, the MATLAB routine bwboundaries produced two to three point clouds, i.e. one point cloud of the star, one pentagon point cloud and sometimes an image border point cloud. And these point clouds swapped their order from image to image. In the end, we were not able to automatically select only the useful point cloud. Since the proceeding is troublesome when producing point clouds from the star images, we chose to ignore these data and consider only the remaining geometrical shapes of the given dataset.
With the remaining dataset we have a
geometrical shape set
where we denoted the different shapes with their number of vertices. For the circle, we decided to use here the symbol
∞ to encode the number of vertices.
The point clouds P that result from this process are ordered. Therefore, the point is the predecessor of the point in the images and in the dataset. Since the geometric shapes have different sizes in the images, the number of points per point cloud differs for different realizations of a shape class. So, we end up with point clouds with a size ranging from close to 100 to about 500.
3.2. Preprocessing
The point clouds are the starting point for the preprocessing that produces the random vectors X. Thus, the aim of the preprocessing is to convert all point clouds into normalized vectors of the same length, which are supposed to store enough information to make meaningful classifications possible.
3.2.1. Shape Descriptors and Signature
We will start to convert the point clouds P with their set of two-dimensional coordinates into a one-dimensional object.
This process should encrypt the geometry of the shapes in a meaningful and fast computable way. To this aim, we will adopt the D1 shape descriptors presented in [
26].
This shape descriptor will calculate the distance
between a point
p from the point cloud
P and a chosen centre point
as
We make use of the Euclidean norm
and the barycentre of the point cloud as the designated centre point
. The centre point will thus be calculated by the arithmetic average of all
n points of the point cloud. In
Figure 3 we provide a visualization of this process for a triangle point cloud.
Now, we can create an ordered set of distance samples, also called ordered collection,
with a collection size of
n, which corresponds to the number of points in the point cloud. The phrasing "ordered" refers to the fact that we will keep the intrinsic order of the points from the underlying point cloud
P.
In
Figure 4 we plotted the ordered samples of a triangle, pentagon, nonagon, and a circle. In these ordered samples, the vertices and the mid-edge points are recognizable through the higher and lower distance values. Additionally, we notice that the size of the shape impacts the produced distances
d. This indicates that we still need some further preprocessing, which will be addressed by the next section.
3.2.2. Normalization
With the creation of the samples
D, we eliminated dependencies on rotation and translation. The samples are still correlated to the size of our shape, like we already mentioned. For example, bigger shapes or image filling shapes will produce, on average, larger shape descriptors
d since the distance between the centre point and the boundary points is larger, see
Figure 4. The geometrical shapes of the triangle (first row) and nonagon (third row) are similar in size, and therefore the range of distance values is roughly similar. The pentagon (second row) and circle (fourth row) produces distances that vary around a value of 23 and 31.
Since we can have large and small shapes, we need to make the samples D more comparable with each other. A first approach to this problem is normalization. Since there are multiple ways of normalization, we want to present the used methods:
-
Mean-Normalization
-
In order to ensure that all collections have the same mean value, we compute the mean value of a reference collection
. The mean value for a collection will be calculated via
The mean normalization of a collection
D can then be calculated by
-
Max-Normalization
Similar to mean normalization, we will store the maximum of a reference collection
. Then, we ensure that every collection
D will have the same max value as the reference collection via
As for important implementation details, from the 10,000 samples per shape class, we simply take the first sample as the reference and normalize all other samples of that shape accordingly. The normalization is done per shape class; that is, triangles are normalized with respect to the reference triangle, squares with respect to the reference square, and so on.
3.2.3. Histogram Technique
After normalization, the samples are no longer dependent on the shape size. But still, the samples take into account the size of our point clouds. As the object size in the image increases, the number n of points in the point cloud P increases, and so does the number of elements in D. This is why we have different sample sizes up to now.
Another issue is that the single entries do not contain any real information, which may be crucial for the PCA. The first element in the sample is the normalized distance from the first point in the point cloud to the centre of the shape. This first point could be anywhere on the shape curve, as we have no control over the original sorting of the points, nor do we want to, as we do not want to limit the generalization properties of our approach unnecessarily.
To address these two points, we make use of a histogram technique. The length will be standardized into predetermined bins, providing an approximate representation of the distance distribution for each sample, denoted as D. Furthermore, we can store the results in the elements of a vector, and therefore even the position in a vector will encrypt information, which we consider as helpful for the usage with the PCA. We thus expect that the distribution of distances for a triangle compared to another triangle has more in common than the comparison to, e.g., a distribution of distances for a square.
Let us now briefly explain the idea behind the histograms and how to construct them in order to obtain a vector from a shape that meets our desired requirements. The main idea of a histogram, as we use it, is to distribute a sample over bins, where each bin represents a range of values, in our case the distances d from the samples D. Then, we can count how many elements fall into each of the bins. Therefore, it is crucial that the bins do not overlap, so that all values can only fall into one bin. In the end, we get a vector of length k, which stores the number of elements in each bin. And to keep the results comparable and to ensure that the entries store the same information, it is mandatory to use the same bins for all histograms of a single shape.
Since the sample size still differs from sample to sample, the bins of larger samples contain more elements than the bins of smaller samples. To solve this, we can normalize the histograms so that each histogram sums up to an area of one. And the area can be calculated as the sum over all products of the number of elements in a bin and the width of the bin. We make use of this approach, because most of the libraries used in programming have a density option for creating a histogram.
To summarize, with these histograms we have constructed a mathematical object in the form of a vector with meaningful axis entries from a shape that is independent of rotation, translation, and the size of the shape. In addition, the resulting vector has a defined length of
, and each dimension of the vector has a relationship to a shape class that may differ from class to class. This means that we can use them as a starting point for the PCA analysis presented in
Section 2.3.
3.3. Shape Classification
Our classification process is made up of two parts. The actual classification, where we determine the extent to which a shape belongs to a certain shape class, and the calculation of the hit rate to quantify the quality of our classification.
3.3.1. Classification Procedure
The core fundamentals for the classification are provided by Equation (
18). If we have a normalized random vector
X (
) from an unknown class, we compute the principal component vector
obtained from the eigenvector spaces
for
. After that, we compute the probability
for different
of
X using Equation (
18). In the end, we need to choose the class, in which
X has the highest probability.
Where we make use of the Euclidean norm for ‖·‖, and since the single term in the sum is made up by scalars, it simplifies to the square of the terms.
This procedure can be seen in
Table 1, where we use the values from
Figure 5 to support this method with numbers. In the image, we show the square of principal components
(left) and
(right) in percentage. We see that the square of the first component of
contributes
to the total probability and on the other hand the first component of
will only contribute
. So the outcome of Equation (
24) would always be 3, since we got the highest probability measure for the projection operator
of the triangle shape class, regardless of the chosen parameter
.
3.3.2. Hit Rate
Consider, having multiple shape classes stored in O and each shape class for has random vectors X. We want to know how many vectors are classified to a specific shape class.
Generally speaking, we will call a classification
successful if
, and call it
fail if
for two shape classes
and
with
. To quantify the quality of the classification, we introduce the
hit rate , where we compute the quotient of the number of vectors of shape class
classified as shape class
over the total number of elements
in class
:
The hit rate
will be visualized in a matrix styled plot, see
Figure 6. The x-axis will represent the result of the classification, shape class
, and the y-axis indicate the original shape class
, for
. The grey value in the entries of the hit rate matrix encode the value for the hit rate. A black entry will represent a value of one for
, and for a value of zero, we will produce a white entry in the hit rate matrix.
3.4. The Experimental Pipeline
In this section, we want to present the pipelines and algorithms we used to create our results.
First, we need to preprocess the loaded point cloud as discussed, so that they can be used for the PCA. Algorithm 1 gives an overview of this process.
For all shape classes with , and for all point clouds belonging to class , we produce normalized samples. Then, we compute the max- and min-values of all normalized samples of each shape class to define the width of the bins. With these bins, we produce the histograms X from the samples . After this step, the preprocessing is finished.
| Algorithm 1: Algorithm for preprocessing the data. |
 |
The next step, would be the separation between the training data set and the test data set. We chose a ratio of 7:3. From the 10,000 histograms X in shape class , we use 7,000 for the PCA, and the remaining ones for training. The remaining 3,000 histograms will be used to test the matching quality of this approach.
In the third step, we produce with the PCA and the histograms labelled for training X the eigenspaces and mean values . The PCA is done separately for each shape class with . We store the produced eigenspaces and mean values for the classification process later on.
The fourth step, will calculate the classification via (
15). This is illustrated in Algorithm 2. Here, we normalize the histograms labelled for testing again, so that
. We will make use of the 2-norm for this, since
in (
18) refers to the induced norm from the inner product. In this case, it is the 2-norm, since we are using the Hilbert space
. After that we compute the principal values
for all combinations of
using Equation (
15).
| Algorithm 2: Compute the probability and construct the hit rate matrix. |
 |
With
, we can thus calculate probabilities via (
18) and different
. After that, we are able to produce a hit rate matrix using Equation (
25).
Finally, we would like to emphasize the two main properties that the generated random vectors X should have: All vectors must have the same length, and each dimension should encrypt some information.
3.5. Experiments
For conducting the experiments, we would like to formulate two theses. First, we would like to see, that we are actual able to distinguish between different shape classes in O. Second, we would like to see, that this prediction gets better with an increasing value for .
For this aim, we will make use of the presented pipeline, and switching between the two normalizations, namely mean-Normalization and max-Normalization.
The resulting hit rate matrices for different
and
are depicted in
Figure 7 and
Figure 8. There, we increase
row-wise from the top left image down to the bottom right image. Now, we will examine the hit rate matrices with respect to our two theses.
Both figures have a visible diagonal line, which is more dominant for shapes with higher number of nodes, i.e. hexagon up to circle. The mean-normalization tends to have more problems with low node shape classes than the max-normalization, because the diagonal line is faintly recognizable.
Concerning the second hypothesis, we notice that the diagonal gets more and more dominant if we increase the value of . It is remarkable that we achieve the best results for values of . However, we observe for an overall increase in the number of failed classifications. These failed attempts are more spread over multiple shape classes in the max-normalization and fixed to the hexagon shape class for the mean-normalization.
In Addition to the failed classifications, we notice other side effects and make an attempt to give an explanation for these.
We start with a potential explanation for the failed classifications. One reason could be in the inherent structure of the principal component calculation. There, we subtract the mean value of a specific shape class with from a random vector X. As a result, the random vector X belonging to the shape class should be closer to the origin of the eigenvector space spanned by this shape class than the random vectors of the other shape classes. Since points close to the origin are more sensitive to small errors, a small error could already change the share on different eigenvectors considerably, and therefore in the principal components, too. Random vectors of other shape classes may not have this problem, because their mean value differ from this sketched scenario. Therefore, random vectors of other shape classes, than the one used for testing, tend to stay away from the sensitive origin region. In consequence, these random vectors vary less in the principal components and may even be higher.
Another effect is, that with increasing values of the eigenvalues converge to zero. The latter used eigenvectors are therefore not meaningful enough. And so, these eigenvectors do not add much to the information describing the testing shape, but allow random vectors from other shapes to increase their probability measure. Ending up with the failed classifications in the bottom left images in the presented figures.
To summarize, on the one hand, we notice that an interpretable, meaningful classification is possible. Even with this relative simple approach, we are able to partially get correct classifications. Note that an advantage of such a simple approach is, for example, that it can be easily extended to 3D shape point clouds. On the other hand, we know that the presented preprocessing can be optimized in some aspects for better shape classification.