Preference Neural Network

This paper proposes a preference neural network (PNN) to address the problem of indifference preferences orders with new activation function. PNN also solves the Multi-label ranking problem, where labels may have indifference preference orders or subgroups are equally ranked. PNN follows a multi-layer feedforward architecture with fully connected neurons. Each neuron contains a novel smooth stairstep activation function based on the number of preference orders. PNN inputs represent data features and output neurons represent label indexes. The proposed PNN is evaluated using new preference mining dataset that contains repeated label values which have not experimented before. PNN outperforms five previously proposed methods for strict label ranking in terms of accurate results with high computational efficiency.


I. INTRODUCTION
P REFERENCE learning (PL) is an extended paradigm in machine learning that induces predictive preference models from experimental data [1]- [3]. PL has applications in various research areas such as knowledge discovery and recommender systems [4]. Objects, instances, and label ranking are the three main categories of PL domain. Of those, label ranking (LR) is a challenging problem that has gained importance in information retrieval by search engines [5], [6]. Unlike the common problems of regression and classification [7]- [13], label ranking involves predicting the relationship between multiple label orders.
For a given instance x from the instance space x, there is a label L associated with x, L ∈ π, where π = {λ 1 , .., λ n }, and n is the number of labels. LR is an extension of multiclass and multi-label classification, where each instance x is assigned an ordering of all the class labels in the set L. This ordering gives the ranking of the labels for the given x object. This ordering can be represented by a permutation set π = {1, 2, · · · , n}. The label order has the following three features. irreflexive where λ a λ a ,transitive where (λ a λ b ) ∧ (λ b λ c ) =⇒ λ a λ c and asymmetric λ a λ b =⇒ λ b λ a . Label preference takes one of two forms, strict and non-strict order. The strict label order (λ a λ b λ c λ d ) can be represented as π = (1, 2, 3, 4) and for non-restricted total order π = (λ a λ b λ c λ d ) can be represented as π = (1, 2, 2, 3), where a, b, c, and, d are the label indexes and λ a , λ b , λ c and λ d are the ranking values of these labels.
For the non-continuous permutation space, The order is represented by the relations mentioned earlier and the ⊥ incomparability binary relation. For example the partial order λ a λ b λ d can be represented as π = (1, 2, 0, 3) where 0 represents an incomparable relation since λ c is not comparable to (λ a , λ b , λ d ).
Various label ranking methods have been introduced in recent years [14], such as decomposition-based methods, statistical methods, similarity, and ensemble-based methods. Decomposition methods include pairwise comparison [15], [16], log-linear models and constraint classification [17]. The pairwise approach introduced by Hüllermeier [18] divides the label ranking problem into several binary classification problems to predict the pairs of labels λ i λ j or λ j ≺ λ i for an input x. Statistical methods includes decision trees [19], instance-based methods (Plackett-Luce) [20] and Gaussian mixture model based approaches. For example, Mihajlo uses Gaussian mixture models to learn soft pairwise label preferences [21].
The artificial neural network (ANN) for ranking was first introduced as (RankNet) by Burge to solve the problem of object ranking for sorting web documents by a search engine [22]. Rank net uses gradient descent and probabilistic ranking cost function for each object pair. The multilayer perceptron for label ranking (MLP-LR) [23] employs a network architecture using a sigmoid activation function to calculate the error between the actual and expected values of the output labels. However, It uses a local approach to minimize the individual error per output neuron by subtracting the actual-predicted value and using Kendall error as a global approach. Neither direction uses a ranking objective function in backpropagation (BP) or learning steps.
The deep neural network (DNN) is introduced for object ranking to solve document retrieval problems. RankNet [22], RankBoost [24], and Lambda MART [25], and deep pairwise label ranking models [26], are convolution neural Network (CNN) approaches for the vector representation of the query and document-based. CNN is used for image retrieval [27] and label classification for remote sensing and medical diagnosing [28]- [35]. A multi-valued activation function has been proposed by Moraga and Heider [36] to propose a Generalized Multiple-valued Neuron with a differentiable soft staircase activation function, which is represented by a sum of a set of sigmoidal functions. In addition, Aizenberg proposed a generalized multiple-valued neuron using a convex shape to support complex numbers neural network and multi-  [37]. Visual saliency detection using the Markov chain model is one approach that simulates the human visual system by highlighting the most important area in an image and calculating superpixels as absorbing nodes [38]- [40]. However, this approach needs a saliency optimization on the results and has calculation cost [41], [42].
Particle Swarm Optimization in movement detection is based on the concept of variation and inter-frame difference for feature selection. The swarm algorithms are mainly used in human motion detection in sports, and it is used based on probabilistic optimization algorithm [43]- [46] and CNN [47].
Some of the methods mentioned above and their variants have some issues that can be broadly categorized into three types: 1) The ANN Predictive probability can be enhanced by limiting the output ranking values in the SS functions to a discrete value instead of a range of values of the rectified linear unit (Relu), Sigmoid, or Softmax activation functions. The predictive is enhanced by using the SS function slope as a step function to create discrete values, accelerating the learning by reducing the output values to accelerate the ranking convergence.
2) The drawback of ranking based on the classification technique ignores the relation between multiple labels: When the ranking model is constructed using binary classification models, these methods cannot consider the relationship between labels because the activation functions do not provide deterministic multiple values. Such ranking based on minimizing pairwise classification errors differs from maximizing the label ranking's performance considering all labels. This is because pairs have multiple models that may reduce ranking unification by increasing ranking pairs conflicts where there is no ground truth, which has no generalized model to rank all the labels simultaneously. For example, D = (1, 1, 1) for π = (λ a λ b λ c ) and D = (1, 1, 1) for π = (λ a λ c λ b ) the ranking is unique; however, pairwise classification creates no ground truth ranking for the pair λ b λ c and λ c λ b which adds more complexity to the learning process. 3) Ignoring the relation between features. The convolution kernel has a fixed size that detects one feature per kernel. Thus, it ignores the relationship between different parts of the image. For example, CNN detects the face by combining features (the mouth, two eyes, the face oval, and a nose) with a high probability of classifying the subject without learning the relationship between these features. For example, the proposed PN kernel start attention to the important features that have a high number of pixel ranking variation.
The main contribution of the proposed neural network is • Solving the label ranking as a machine learning problem.
• Solving the deep learning classification problem by employing computational ranking in feature selection and learning. Where PNN has several advantages over existing label ranking methods and CNN classification approaches. 1) PNN uses the smooth staircase SS as an activation function that enhances the predictive probability over the sigmoid and Softmax due to the step shape that enhances the predictive probability from a range from -1 to 1 in the sigmoid to almost discrete multi-values. 2) PNN uses gradient ascent to maximize the spearman ranking correlation coefficient. In contrast, other classification-based methods such as MLP-LR use the absolute difference of root mean square error (RMS) by calculating the differences between actual and predicted ranking and other RMS optimization, which may not give the best ranking results. 3) PNN is implemented directly as a label ranker. It uses staircase activation functions to rank all the labels together in one model. The SS or PSS functions provide multiple output values during the conversions; however, MLP-LR and RankNet use sigmoid and Relu activation functions. These activation functions have a binary output. Thus, it ranks all the labels together in one model instead of pairwise ranking by classification. 4) PN uses a novel approach for learning the feature selection by ranking the pixels and using different sizes of weighted kernels to scan the image and generate the features map. The next section explains the Ranker network experiment, problem formulation, and the PNN components (Activation functions, Objective function, and network structure) that solve the Ranker problems and comparison between Ranker network and PNN.

A. Initial Ranker
The proposed PNN is based on an initial experiment to implement a computationally efficient label ranker network based on the Kendall τ error function and sigmoid activation function using simple structure as illustrated in section IV Fig. 6.
The ranker network is a fully connected, three-layer net. The input represents one instance of data with three inputs, and there are six neurons in the hidden layer and three output neurons representing the labels' index. Each neuron represents the ranking value. A small toy data set is used in this experiment. The ranker uses RMS gradient descent as an error function to measure the difference between the predicted and actual ranking values. The ranker has Kendall τ as a stopping criterion. The same ANN structure, number of neurons and learning rate using SS activation function, and spearman error function and gradient ascent of ρ will be discussed in section IV. The ranking convergence reaches τ 1 after 160 epochs using the Sigmoid function [48]. The sigmoid and ReLU shapes have a slightly high rate of change of y, and it produces a larger output range of data. Therefore, we consider ranking performance as one of the disadvantages of sigmoid function in the ranker network.
The ranker network has two main problems. 1) The ranker uses two different error functions, RMS for learning and Kendall τ for stopping criteria. Kendall τ is not used for learning because it is not continuous or differentiable. Both functions are not consistent as stopping criteria measure the relative ranking, and RMS does not, which may lead to incorrect stopping criteria. Enhancing the RMS may not also increase the error performance, as illustrated in Fig. 3 in a comparison between the ranker network. evaluation using ρ and RMS.
2) The convergence performance takes many iterations to reach the ranking τ 1 based on the shape of sigmoid or Relu functions and learning rate as shown in the experiment video link [48] due to the slope shape between -1 or 0 and 1. The prediction probability almost equals the values from -1 or 0 to 1.

B. Problem Formulation
For multi-class and multi-label problems, learning the data's preference relation predicts the class classification and label ranking. i.e. data instance D ∈ {x 1 , x 2 , . . . , x n }.
the output labels are predicted as ranked set labels that have preference relations L = {λ y 1 , . . . , λ y n }. PNN creates a model that learns from an input set of ranked data to predict a set of new ranked data. The next section presents the initial experiment to rank labels using the usual network structure.

C. Activation Functions
The usual ANN activation functions have a binary output or range of values based on a threshold. However, these functions do not produce multiple deterministic values on the y-axis. This paper proposes new functions to slow the differential rate around ranking values on the y-axis to solve ranking instability. The proposed functions are designed to be non-linear, monotonic, continuous, and differentiable using a polynomial of the tanh function. The step width maintains the stability of the ranking during the forward and backward processes. Moraga [36] introduced a similar multi-valued function. However, the proposed exponential derivative was not applied to an ANN implementation. Moraga exponential function is geometrically similar to the step function [49]. However, The newly proposed functions consist of tanh polynomial instead of exponential due to the difficulty in implementation. The new functions detect consecutive integer values, and the transition from low to high rank (or vice versa) is fast and does not interfere with threshold detection.

1) Positive Smooth Staircase (PSS):
As a non-linear and monotonic activation function, a positive smooth staircase (PSS) is represented as a bounded smooth staircase function starting from x=0 to ∞. Thus, it is not geometrically symmetrical around the y-axis as shown in Fig. 1. PSS is a polynomial of multiple tanh functions and is therefore differentiable and continuous. The function squashes the output neurons values during the FF into finite multiple integer values. These values represent the preference values from {0 to n} where 0 represents the incomparable relation ⊥ and values from 1 to n represent the label ranking. The activation function is given in Eq. 1. PSS is scaled by increasing the step width w Where n is the number of stair steps equal to the number of labels to rank, w is the step width, and c is the stair curvature c = 100 and 5 for the sharp and smooth step, respectively. and s is the scaling factor for reducing the height of each step to range to rank value with decimal place for the regression problems. s=10 and s=100 for 1 and 2 decimal places, respectively, s is calculated as in Eq. 2.
and w is the step width as shown in Eq. 3.
The proposed (SS) represents a staircase similar to (PSS). However, SS has a variable boundary value used as a hyperparameter in the learning process. The derivative of the activation function is discussed in section III and the performance comparison between SS and PSS is mentioned in Section V. The activation function is given in Eq. 4.
where c is step curvature, n = number of ranked labels, b is the boundary value on the x-axis, and (SS) lies between where Y max is the max. value to rank. i.e. Y max =3 and values have one decimal place. n =30 The (SS) function has the shape of smooth stair steps, where each step represents an integer number of label ranking on the y-axis from 0 to ∞ as shown in Fig. 1, The SS step is not flat, but it has a differential slope. The function boundary value on the x-axis is from -b to b Therefore, input values must be scaled from -b to b. The step width is 1 when n 2b. The convergence rate is based on the step width. However, it may take less time to converge based on network hyper parameters. Fig. 2 (a) and (b). The SS is scaled by increasing the boundary value b

D. Ranking Loss Function
Two main error functions have been used for label ranking; Kendall τ [50] and spearman ρ [51]. However, the Kendall τ function lacks continuity and differentiability. Therefore, the spearman ρ correlation coefficient is used to measure the ranking between output labels. spearman ρ error derivative is used as a gradient ascent process for BP, and correlation is used as a ranking evaluation function for convergence stopping criteria. τ Avg is the average τ per label divided by the number of instances m, as shown in line 8 of Algorithm 1. spearman ρ measures the relative ranking correlation between actual and expected values instead of using the absolute difference of root means square error (RMS) because gradient descent of RMS may not reduce the ranking error. For example, π 1 = (1, 2.1, 2.2) and π 2 = (1, 2.2, 2.1), have a low RMS = 0.081 but a low ranking correlation ρ = 0.5 and τ = 0.3. The spearman error function is represented by Eq.5 where y i , yt i , i and m represent rank output value, expected rank value, label index and number of instances, respectively.
E. PNN Structure 1) One middle layer: The ANN has multiple hidden layers. However, we propose PNN with a single middle layer instead of multi-hidden layers because ranking performance is not enhanced by increasing the number of hidden layers due to fixed multi-valued neuron output, as shown in Fig. 4; Seven benchmark data sets [52] was experimented using SS function using one, two, and three hidden layers with the following hyper parameters; learning rate (l.r.)=0.05, and each layer has neuron i = 100 and b = 10). We found that by increasing the number of hidden layers, the ranking performance decreases, and more iterations are required to reach ρ 1. The low performance because of the shape of SS produces multiple deterministic values, which decrease the arbitrarily complex decision regions and degrees of freedom per extra hidden layer. 2) Preference Neuron: Preference Neuron are a multivalued neurons uses a PSS or SS as an activation function. Each function has a single output; however, PN output is graphically drawn by n number of arrow links that represent the multi-deterministic values. The PN in the middle layer connects to only n output neurons stp = n+1; where stp is the number of SS steps. The PN in the output layer represents the preference value. The middle and output PNs produce a preference value from 0 to ∞ as illustrated in Fig. 5.
The PNN is fully connected to multiple-valued neurons Preference Neuron One Instance Output layer Middle layer Input layer #16n. #300h.n. #16n.
The ANN is scaled up by increasing the hidden layers and neurons; however, increasing the hidden layers in PNN does not enhance the ranking correlation because it does not arbitrarily increase complex decision regions and degrees of freedom to solve more complex ranking problems. This limitation is due to the multi-semi discretevalued activation function, limiting the output data variation. Therefore, instead of increasing the hidden layer, PNN is scaling up by increasing the number of neurons in the middle layer and scaling input data boundary value and increasing the PSS step width and SS boundaries which are equal to the input data scaling value, which leads to increased data separability.
PNN reaches ranking ρ 1 after 24 epochs compared to the initial ranker network that reaches the same result in 200 iterations, The video link demonstrates the ranking convergence as shown in Fig. 7 and video [48]. A summary of the three networks is presented in Table I.
The output labels represent the ranking values. The differential PSS and SS functions to accelerate the convergence after a few iterations due to the staircase shape, which achieves stability in learning. PNN simplifies the calculation of FF and BP, and updates weights into two steps due to single middle layer architecture. Therefore, the batch weight updating technique is not used in PNN, and pattern update is used in one step. The network bias is low due to the limited preference neuron output of data variance; thus it is not calculated. Each neuron uses the SS or PS activation function in FF step, and calculates the preference number from 1 to n, where n is the number of label classes. During BP. The processes of FF and BP are executed in two steps until ρ Avg 1 or the number of iterations reaches (10 6 ) as mentioned in the algorithm section.
The SS step width decreases by increasing the number of labels; thus, we increase function boundary b to increase the step width to 1 to make the ranking convergence; In addition, a few complex data sets may need more data separability to enhance the ranking. Therefore, we use the b value as a hyperparameter to keep the stair width >= 1 and normalize input data from −b to b.   The following section describes the data preprocessing steps, feature selections, and components of PN.

III. PN COMPONENTS
A. Image Preprocessing 1) Greyscale Conversion: Data scaling as red, green, and blue (RGB) colors is not considered for ranking because PN measures the preference values between pixels. Thus, The image is converted from RGB color to Greyscale.
2) Pixels' Sorting: Ranking the image from π = {λ 1 , .., λ m } to π = {λ 1 , .., λ k } where the maximum greyscale value λ m = 255 and λ k is the maximum ranked pixel value as illustrated in Fig. 8 (a). 3) Pixels Averaging: Ranking image pixels has an almost low ranking correlation due to noise, scaling, light, and object movement; therefore, window averaging is proposed by calculating the mean of pixel values of the small flattened window size of 2x2 of 4 pixels as shown in Fig. 9. The overall image ρ of pixels increased from 0.2 to 0.79 in (a and b), from 0.137 to 0.75 for noisy images in (s and d), and scaled images from -0.18 to 0.71 in (e and f).
The two approaches, Pixel ranking and Averaging has been tested in remote sensing and faces images to detect the similarity, and it shows high ranking correlations using different window size as shown in Fig 10. It detects the high correlation by starting from the large window size = image size. It reduces the size and scans until it reaches the highest correlation.

B. Feature Selection By Attention
Feature selection for the kernel proceeded by selecting the features with a high group of pixel ranking variations indicating the importance of the scanned kernel area. This kind of hard attention makes the selection based on the threshold of pixel ranking values. to reduce the dimension of the input image.

C. Feature Extraction
This paper proposes a new approach for image feature selection based on the preference values between pixels instead of the convolution of pixels array as implemented in CNN. The PN's features are based on ranking computational space. Therefore, the kernel window size is considered a factor for feature selection.
1) Pixels Resorting: The flattened window's values are sorted for each kernel window in the image. The Fig. 8 (b) shows the window size 3X3 range from λ k 1 = 23 to λ k 2 = 9. Pixel sorting reduces the data margin, Thus, it reduces the computational complexity. 2) Weighted Ranker Kernel: The kernel weights are randomly initialized from -0.05 to 0.05. The kernel learns the features by BP of its weights to select the best feature. the partial change in the kernel is calculated by differentiating the spearman correlation as in Eq. 6 dK w = 2 · I mg w − dρ · n 3 − n −6 Different kernel sizes could be used for big images' size. We use three different kernels to capture the relations between different features.
3) Max Pooling: Max. pooling is used to reduce the features map's size and select the highest correlation values to feed to the PNN.

D. PN Structure
PN is the deep learning structure of PNN for image classification. It consists of five layers, a ranking features map and a max. pooling and three PNN layers. PN has one or multiple different sizes of PNNs connected by one output layer. Each PNN has SS or PSS where ϕ n=2 for binary ranking to map the classification. The number of output neurons is the number of classes. The structure is shown in Fig 11. PN have one or more ranker kernels with different sizes, Each kernel has one corresponding PNN. PN uses the weighted kernel ranking to scan the image and extract the features map of spearman correlation values of the kernel with the scanned ranked image window as ρ(π k , π w ) where π k is the kernel preference values and π w is the scanned window image preference values. Each kernel scans the image by one step and creates a spearman features list. Max. Pooling is used to minimize the feature map used as input to PNN.

E. Choosing The Kernel Size
Kernel size is chosen based on the hard attention of the highest group of pixels that has high ranking variation. The process scans the image sequentially starting from a small size to find the size with the highest pixels ranking variation. For example for the Mnist dataset where the image has a size of 28X28, The meaningful features are extracted using kernel sizes 10x10, 15x15, 20x20 and 25x25.

A. Baseline Algorithm
Algorithm 1 represents the three functions of the network learning process; feed-forward (FF), BP, and updating weights (UW). Algorithm 2 represents the learning flow of PN. Algorithm 3 represents the simplified BP function in two steps.

Algorithm 1: PNN learning flow
• BB starts with calculating the error of output layer This time complexity is then multiplied by the number of epochs p 2) Input Neurons: The number of PN input neurons is represented by Eq. 10 where w and h are width and height of kernel and image.

V. NETWORK EVALUATION
This section evaluates the PNN against different activation functions and architectures. All weights are initialized = 0 to compare activation functions and A and B have the same initialized random weights to evaluate the structure.

A. Activation Functions Evaluation
PNN is tested on iris and stock data sets using four activation functions. SS, PSS, ReLU, sigmoid, and tanh. PNN has one middle layer and the number of hidden neurons (h.n.) is 50, while l.r.= 0.05. Fig. 13 shows the Max. Pooling BP() 27 Ranker kernel BP and UW() 28 until ρ Avg = 1 or #iterations ≥ 10 6 ; Algorithm 3: PNN BP 29 Step 1: for each pn i in Output layer do 30 Err Step 2: for each pn i in middle layer do 33 Err i = m k=0 ω k · δ k 34 δ i = Err · ϕ convergence after 500 iterations using four activation functions (SS, PSS, sigmoid, ReLU and tanh) respectively. We noticed that PSS and SS have a stable rate of ranking convergence compared to sigmoid, tanh, and ReLU. This stability is due to the stairstep width, which leads each point to reach the correct ranking during FF and BP in fewer epochs. 1) PSS and SS Evaluation: As shown in Fig 13, PSS reaches convergence and remains stable for a long number of iterations compared to SS. However, SS has better ρ than PSS. This good performance of SS is due to the reason: • The symmetry of SS function on the x axis. The SS shape handles both positive and negative normalized data. It reduces the number of iterations to reach the correct ranking values. To have the same performance for SS and PSS, the input data should be scaled from 0 to step width X #steps and from -b to b for PSS and SS respectively.
2) Missing Labels Evaluation: Activation functions are evaluated by removing a random number of labels per instance. PNN marked the missing label as -1; PNN neglects error calculation during BP, δ = 0. Thus, the missing label weights remain constants per learning iteration. The missing label approach is applied to the data set by 20% and 60% of the training data. The ranking performance decreases when the number of missing labels increases. However, SS and PSS have more stable convergence than other functions. This evaluation is performed on the iris data set, as shown in Fig. 13.

3) Statistical Test:
The PNN results were evaluated using receiver operating characteristic (ROC) curves. The true positive and negative for each rank are evaluated per label of wine dataset as shown in Fig. 14. The confusion matrix on wine and glass DS are shown in Fig. 15 where τ = 0.947, 0.84, Accuracy = 0.935 and 0.8 in (a) and (b) respectively. 4) Dropout Regularization: Dropout is applied as a regularization approach to enhance the PNN ranking stability by reducing over-fitting. We drop out the weights that have a probability of less than 0.5. these dropped  weights are removed from FF, BP, and UW steps. The comparison between dropout and non-dropout of PNN are shown in Fig. 16. The gap between the training model and ten-fold cross-validation curves has been reduced using dropout regularization using hyperparameters (l.r.=0.05, h.n.=100) on the iris data set. The dropout technique is used with all the data ranking results in the next section.
The following section is the evaluation of ranking experiments using label benchmark data sets.

VI. EXPERIMENTS
This section describes the classification and label ranking benchmark data sets, the results using PN and PNN, and a comparison with existing classification and ranking methods. 2) Label Ranking Data sets: PNN is experimented with using three different types of benchmark data sets to evaluate the multi-label ranking performance. The first type of data set focuses on exception preference mining [56], and the 'algae' data set is the first type that highlights the indifference preferences problem, where labels have repeated preference value [57]. German elections 2005, 2009, and modified sushi are considered new and restricted preference data sets. The second type is real-world data related to biological science [18]. The third type of data set is semi-synthetic (SS) taken from the KEBI Data Repository at the Philipps University of Marburg [52]. All data sets do not have ranking ground truth, and all labels have a continuous permutation space of relations between labels. Table II [57], real-world data sets [58] and semisynthetic (s-s) [52]. data sets.

B. Results
1) Image Classification Results: PN has 3 kernel sizes of 5,10 and 20 and is tested on the CFAR-100 [54] data set and 1 kernel with a size 5 for Fashion-MNIST data set [55]. Table III shows the results compared to other convolutions networks.
2) Label Ranking Results: PNN is evaluated by restricted and non-restricted label ranking data sets. The results are derived using spearman ρ and converted to Kendall τ coefficient for comparison with other approaches. For data validation, we used 10fold cross-validation. To avoid the over-fitting problem, We used hyperparameters, i.e. l.r.= (0.0008,0.0005,0.005, 0.05, 0.1) hidden neuron = no.inputs+(5, 10, 50, 100,  [65] 0.8757 -DART [66] 0.965 -PrefNet 0.91 -200,300,400,450) neurons and scaling boundaries from 1 to 250) are chosen within each cross-validation fold by using the best l.r. on each fold and calculating the average τ of ten folds. Grid searching is used to obtain the best hyperparameter. For type B, we use three output groups and l.r.=0.001 and w b = 0.01.
3) Benchmark Results: Table IV summarizes PNN ranking performance of 16 strict label ranking data sets by l.r. and m.n. The results are compared with the four methods for label ranking; supervised clustering [58], supervised decision tree [52], MLP label ranking [23], and label ranking tree forest (LRT) [67]. Each method's results are generated by ten-fold cross-validation. The comparison selects only the best approach for each method.
During the experiment, it was found that ranking performance increases by increasing the number of central neurons up to a maximum of 20 times the number of features. As shown in Table VI, The real datasets are ranked using PNN with dropout regulation due to complexity and over-fitting. The dropout requires increasing the number of epochs to reach high accuracy. All the results are held using a single hidden layer with various hidden neurons (100 to 450) and SS activation function. The Kendall τ error converges and reaches close to 1 after 2000 iterations, as shown in Fig. 17. Table IV compares PNN with similar approaches used for label ranking. These approaches are; Decision trees [58], MLP-LR [23] and label ranking trees forest LRT [67]. In this comparison, we choose the method that has the best results for each approach.

4) Preference Mining Results:
The ranking performance of the new preference mining data set is represented in table II. Two hundred fifty hidden neurons are used To enhance the ranking performance of the algae data set's repeated label values. However, restricted labels ranking data sets of the same type, i.e., (German elections and sushi), did not require a high number of hidden neurons and incurred less computational cost.
Experiments on the real-world biological data set were  conducted using supervised clustering (SC) [58], Table V presents the comparison between PNN and supervised clustering on biological real world data in terms of Loss LR as given in Eq. 11.
where τ is Kendall τ ranking error and Loss LR is the ranking loss function. SS function with 16 steps is used to rank Wisconsin data set with 16 labels. By increasing the number of steps in the interval and scaling up the features between -100 and 100, The step width is small. To enhance ranking performance, the data set has many labels. The number of hidden neurons is increased to exceed τ = 0.5.

C. Computational Platform
PNN and PN is implemented from scratch without the Tensorflow API and developed using Numba API to speed the execution on the GPU and use Cuda 10.1 and Tensorflow-GPU 2.3 for GPU execution and executed at the University of Technology Sydney High-Performance Computing cluster based on Linux RedHat 7.7, which has an NVIDIA Quadro GV100 and memory of 32 G.B. For a non-GPU version of PNN is located at GitHub Repository [68].

D. Discussion and Future Work
It can be noticed from table II that PN is performing better than ResNet [59] and WRN [60]. Different types of architectures of PN could be used to enhance the results and reach state-of-the-art in terms of image classification [69]- [71]. It can be noticed from table III that PNN outperforms on SS data sets with τ Avg = 0.8, whereas other methods such as, supervised clustering, decision tree, MLP-ranker and LRT, have results τ Avg = 0.79, 0.73, 0.62, 0.475, respectively. Also, the performance of PNN is almost 50% better than supervised clustering in terms of ranking loss function Loss LR on real-world biological data set, as shown in table V. The superiority of PNN is used for classification and ranking problems. The ranking is used in input data as a feature selection criteria is a novel approach for deep learning. Encoding the labels' preference relation to numeric values and ranking the output labels simultaneously in one model is an advanced step over pairwise label ranking based on classification. PNN could be used to solve new preference mining problems. One of these problems is incomparability between labels, where Label ranking has incomparable relation ⊥, i.e., ranking space (λ a λ b ⊥λ c ) is encoded to (1, 2, -1) and (λ a λ b )⊥(λ c λ d ) is encoded to (1, 2, -1, -2). PNN could be used to solve new problem of non-strict partial orders ranking, i.e., ranking space (λ a λ b λ c ) is encoded to (1,2,3) or (1, 2, 2). Future research may enhance PN by adding kernel size and SS parameters as part of the deep learning to choose the best kernel size and SS step width, which could enhance the image attention. Modifying PNN architecture by adding bias and solving noisy label ranking problems.

VII. CONCLUSION
This paper proposed a novel method to rank a complete multi-label space in output labels and features extraction in both simple and deep learning.PN is a new research direction for image recognition based on new kernel and pixel calculations. PNN and PN are native ranker networks for image classification and label ranking problems that uses SS or PSS to rank the multi-label per instance. This neural network's novelty is a new kernel mechanism, activation, and objective functions. This approach takes less computational time with a single middle layer. It is indexing multi-labels as output neurons with preference values. The neuron output structure can be mapped to integer ranking value; thus, PNN accelerates the ranking learning by assigning the rank value to more than one output layer to reinforce updating the random weights. PNN is implemented using python programming language 3.6 [68], and activation functions are modelled using wolframe Mathematica software [72]. A video demo that shows the ranking learning process using toy data is available to download [48].