Preprint
Article

This version is not peer-reviewed.

A Robust Clustering Framework Combining Minimum Description Length and Genetic Optimization

Submitted:

07 December 2024

Posted:

09 December 2024

You are already at the latest version

Abstract

Clustering algorithms have been instrumental in advancing the field of data analysis, providing valuable techniques for organizing data into meaningful groups. However, individual clustering algorithms often suffer from inherent criteria limitations and biases, which prevent the development of a universal clustering method capable of delivering optimal solutions across diverse datasets. Addressing this challenge, we propose a novel clustering method that combines the Minimum Description Length (MDL) principle with a genetic optimization algorithm to overcome these limitations. The proposed method begins by generating an initial clustering solution using an ensemble clustering technique. This initial solution serves as a baseline, which is subsequently refined through evaluation functions grounded in the MDL principle and optimized using a genetic algorithm. By integrating the MDL principle, the proposed method not only incorporates external information from the input clusters but also adapts to the intrinsic properties of the dataset, thereby reducing the dependence of the final results on the input clusters. This adaptive approach ensures that the clustering process remains data-driven and robust. The effectiveness of the proposed method was evaluated using thirteen standard datasets, employing four widely recognized validation criteria: accuracy, normalized mutual information (NMI), Fisher score, and adjusted Rand index (ARI). Experimental results demonstrate that the proposed method consistently produces clusters with superior accuracy, greater stability, and reduced biases compared to traditional clustering methods. The results highlight the method's versatility, making it suitable for clustering a wide range of datasets with diverse characteristics. By leveraging the strengths of MDL and genetic optimization, this study presents a robust and adaptable clustering framework that advances the field of data clustering, offering a reliable tool for handling complex and varied datasets.

Keywords: 
;  ;  

Introduction

Cluster analysis is a fundamental tool in modern data science, providing critical insights during the preprocessing stage of large-scale data analysis. This technique is particularly valuable when dealing with high-dimensional datasets, which can consist of tens, hundreds, or even thousands of attributes. By organizing data samples into clusters based on their similarities, clustering allows researchers to uncover hidden patterns and relationships in unlabeled data. The core principle of clustering is to group samples with similar attributes into the same cluster while ensuring that samples in different clusters exhibit distinct characteristics [1,2,3,4]. Over the years, clustering has found applications across a diverse range of fields, including bioinformatics, pattern recognition, machine learning, data mining, and image processing [3,5,6,7].
Clustering plays a pivotal role in fields like bioinformatics, where gene expression data is often analyzed using clustering methods. By partitioning genes into clusters, researchers can infer relationships between genes, predict unknown functions, and better understand biological processes. For example, genes with similar functional roles or those participating in the same genetic pathways are typically grouped within the same cluster. This not only aids in exploring the genetic landscape but also facilitates advancements in precision medicine and disease diagnostics [8].
Beyond bioinformatics, clustering serves as a cornerstone for analyzing unlabeled data in diverse contexts. Its importance can be summarized in five key areas [9,12]:
  • Facilitating Data Labeling: In many real-world scenarios, labeling datasets manually is prohibitively expensive and time-consuming. Clustering enables researchers to group similar data points, reducing the effort required for manual labeling.
  • Reverse Analysis: In data mining, clustering allows researchers to identify patterns and relationships in large, unlabeled datasets. Once clusters are identified, human observers can interpret and label the results for further analysis.
  • Adaptability to Change: In dynamic environments, such as seasonal food classification or adaptive recommendation systems, clustering can track evolving data attributes, ensuring that models remain relevant and accurate over time.
  • Feature Extraction: Clustering helps identify key attributes or features in datasets, enabling efficient data representation and improved performance in downstream machine learning tasks.
  • Structural Insight: By revealing the inherent structure and relationships within data, clustering provides valuable insights that inform decision-making and exploratory analysis.
With the explosion of big data across various domains, understanding and analyzing massive, high-dimensional datasets has become increasingly important. As a result, clustering has emerged as a crucial technique for making sense of complex data. However, the effectiveness of clustering largely depends on the choice of algorithm, which poses several challenges. Most existing clustering algorithms are broadly classified into hierarchical clustering and partitional clustering methods, each with its strengths and limitations [10].
Hierarchical clustering organizes data into a tree-like structure, or dendrogram, which represents the nested relationships among data points. Clusters are formed by cutting the dendrogram at a specific level. This approach can be further divided into two strategies: divisive and agglomerative clustering [11].
  • Divisive Clustering: Begins with all data points in a single cluster and iteratively splits them into smaller clusters until each point forms its own cluster.
  • Agglomerative Clustering: Starts with each data point as its own cluster and successively merges the closest clusters until all points are grouped into a single cluster.
While hierarchical clustering provides valuable insights into the data structure, it suffers from high computational complexity, making it unsuitable for large-scale datasets [11].
Partitional clustering methods, such as k-means, k-medoids, Forgy, and Isodata, divide the dataset into a predetermined number of clusters based on distance metrics or optimization criteria [13,14]. Among these, k-means is widely used due to its simplicity, low computational cost, and ability to handle large datasets. However, k-means has several known limitations:
  • Sensitivity to Initialization: Poor initialization of cluster centroids can lead to suboptimal results.
  • Sensitivity to Outliers: Noise or outliers in the dataset can significantly skew clustering outcomes.
  • Input Layout Dependency: The algorithm’s performance depends heavily on the spatial distribution of data.
  • Result Variability: Different initializations may produce inconsistent clustering results.
Given these limitations, traditional clustering methods often struggle to address the complexities of high-dimensional and dynamic datasets.
Emergence of Ensemble Clustering
To overcome the inherent weaknesses of individual clustering algorithms, ensemble clustering has emerged as a robust alternative in recent years. Ensemble clustering combines the outputs of multiple clustering algorithms to generate more accurate and stable clustering results [25,28,29,30,31,32]. By leveraging the diversity of individual clustering solutions, ensemble methods reduce the biases and dependencies associated with single algorithms. Ensemble clustering typically involves two main stages:
  • Generation of Input Clusters: Produces a diverse set of clustering solutions using different algorithms or configurations [30,37].
  • Consensus Combination: Aggregates the input clusters to form a unified clustering result, minimizing inconsistencies and maximizing accuracy [30,37,41,42,43,44].
Despite its advantages, ensemble clustering faces challenges in effectively combining diverse input clusters, particularly when some inputs are highly inaccurate or contradictory. Addressing this limitation requires innovative methods to mitigate the dependency on input clusters while incorporating additional data-driven insights.
This study introduces a novel clustering approach that integrates the Minimum Description Length (MDL) principle with a genetic optimization algorithm to overcome the biases and limitations of existing methods. The proposed method begins with an initial solution generated through ensemble clustering. Using evaluation functions based on MDL and genetic optimization, the solution is iteratively refined to achieve more accurate and stable clustering results. Unlike traditional ensemble methods that rely solely on external input clusters, the proposed approach leverages both external information and intrinsic data properties, reducing dependency on initial inputs and enhancing robustness.
The proposed methodology has been evaluated on multiple standard datasets using comprehensive validation metrics, demonstrating its ability to produce high-quality clusters suitable for a wide range of applications.

The Proposed Method

In this study, we introduce a novel clustering method called Genetic MDL, which integrates the Minimum Description Length (MDL) principle with a genetic optimization algorithm to address the challenges of traditional clustering methods. The Genetic MDL framework aims to overcome the limitations of existing approaches by combining the strengths of MDL, which balances model complexity and data fidelity, with the adaptability and efficiency of genetic algorithms for optimization.
The Genetic MDL approach operates through three key optimization stages:
  • EPMDLGAO: This stage employs the MDL principle to evaluate and optimize partitioning solutions. It focuses on ensuring that the clustering result represents the data with minimal encoding cost, effectively balancing simplicity and accuracy.
  • ABMDLGAO: This stage applies MDL-based adjustments to the clustering model, further refining the cluster assignments to reduce redundancy and improve consistency across the dataset.
  • EPAFGAO: This stage incorporates an enhanced genetic algorithm to optimize cluster configurations, ensuring robust convergence to high-quality solutions.
The proposed framework leverages these three stages in sequence to iteratively refine the clustering process, ensuring that the final result is stable, accurate, and free from biases introduced by initial conditions or input clusters.
An outline of the Genetic MDL framework is provided in Figure 1, which illustrates the interplay between the MDL principle and the genetic optimization algorithm in each stage.
The quasi-code of the proposed method is shown in Figure 2.

The Genetic MDL Framework

The Genetic MDL framework optimizes clustering by combining the Minimum Description Length (MDL) principle with genetic algorithms. It involves six key steps:
  • Formation of the Agreement Matrix (A): Constructs a matrix capturing consensus between clustering results from ensemble methods, forming the foundation for an initial solution.
  • Production of the Original Solution (C₀): Generates an initial clustering configuration by combining input clusters, ensuring diversity and robustness.
  • Normalization of Datasets: Scales data attributes to a uniform range, eliminating the influence of differing scales and ensuring consistent evaluation.
  • Calculation of Attribute Weight Coefficients: Assigns weights to attributes based on variance, prioritizing more informative attributes and reducing noise.
  • AWDL Description of the Original Solution: Evaluates the clustering solution using Attribute Weighted Description Length (AWDL) to balance model simplicity and data representation.
  • Genetic Algorithm Multiple Optimization (GAMO): Refines the clustering solution using genetic algorithm operations (selection, crossover, mutation) to achieve globally optimized, stable clusters.

Formation of Agreement Matrix (A)

The Agreement matrix is made on the basis of the results of input clustering which appears in the form of a one-dimensional vector and each of them forms a matrix row. These individual clusters are selected by qualified people. For any data set, the length of output vector of each individual clustering is equal to the total number of items in the data set that shows the number of each sample’s cluster. The Agreement Matrix is a square symmetric matrix with zero elements on the main diagonal. Each element of the Agreement matrix shows how many input clustering agreements agree with inserting two samples in a single cluster. Here the Agreement matrix is shown with variable A. A is the square symmetric matrix n × n, where n is the number of samples in the data set.
Figure 3 shows the Agreement matrix formation algorithm.
The formation of the Agreement Matrix (A) can be explained through an example. Consider three individual clustering algorithms: G1G_1G1​, G2G_2G2​, and G3G_3G3​. Each algorithm clusters the dataset XXX, consisting of 10 samples, into three clusters. The clustering results are represented in vector form, where each element indicates the cluster assignment of a specific sample.
For instance, the clustering results are as follows:
Output G1 = {1 , 2, 1 , 1, 1, 3 , 3, 2 , 3, 1 }
Output G2 = {3 , 2, 3 , and 1, 2, 1 , 1, 3 , 2, 3 }
Output G3 = {2 , 1, 1, 3, 3 , 2, 1, 1, 2, 2 }
Results of G1, G2 and G3 clusters are displayed in the form of matrix Z. Then matrix Z is written in the form of matrix H.
The consensus matrix A is obtained by equation 1.
A = H × H
In the above equation, H ‘ is the transpose of matrix H. In the above example, matrix A is equal to:

Production of the Original Solution (C0)

There are various techniques to produce the original solution, each offering distinct advantages depending on the characteristics of the dataset. One common approach involves running the K-means clustering algorithm multiple times (e.g., five iterations) and combining its results with hierarchical clustering methods such as Linkage Average, Linkage Complete, Linkage Ward, and Linkage Single, each executed once. The total number of iterations is set to an odd number to facilitate consensus in the final solution. After aligning the cluster labels from these methods, the results are aggregated through an ensemble process to determine the initial clustering solution.
Another method involves random sampling from the dataset. For instance, a subset of the samples (e.g., 80%) is randomly selected, and the K-means algorithm is applied to this subset to create initial clusters. The remaining samples are then assigned to these clusters based on their similarity to cluster centroids, often measured using a distance metric such as Euclidean distance. This method is particularly efficient for large datasets, as it reduces computational complexity while maintaining accuracy. This approach has been employed in the present study to generate the original solution. The algorithm for generating the initial solution is illustrated in Figure 4.
Other techniques include density-based clustering, where methods like DBSCAN or OPTICS are used to identify dense regions in the data, which serve as initial clusters. Sparse points are treated as noise or assigned to the nearest cluster based on proximity. Additionally, model-based clustering, such as Gaussian Mixture Models (GMMs) or Expectation-Maximization (EM), can fit a probabilistic model to the data, providing cluster assignments as the initial solution. While effective for datasets with well-defined distributions, these methods may be computationally intensive for large datasets.
In this study, the random sampling with K-means clustering technique was selected due to its simplicity, efficiency, and ability to handle large datasets effectively, ensuring a robust starting point for subsequent optimization processes.
The following is an example to clarify this procedure.
Suppose that the data set A with 10 samples and the first part is equal to A '. After running the K-means algorithm on the data set A ', the clusters C1 and C2 are derived.
A = [1; 2; 3; 4; 5; 6; 7; 8]
A '= [1; 2; 3; 5; 6; 7]
C1 = {1; 2; 3}
C2 = {5; 6; 7}
The remaining samples (e.g., samples 4 and 8) are then assigned to clusters C1C_1C1​ and C2C_2C2​, respectively, based on their proximity to the centers of these clusters. Proximity is typically measured using a distance metric, such as the Euclidean distance, ensuring that each sample is allocated to the cluster with the closest centroid.
Another approach, referred to as the third method, is similar to the second method but replaces the K-means algorithm with hierarchical clustering algorithms such as Linkage Average, Linkage Complete, Linkage Ward, and Linkage Single. The process proceeds in the same way as the second method, where a subset of the dataset is clustered first, and the remaining samples are subsequently assigned to the generated clusters based on their similarity to cluster centers.
The fourth method combines the strengths of the second and third methods. It uses both the K-means algorithm and hierarchical clustering algorithms (Linkage methods) to cluster the selected subset of samples. This hybrid approach ensures that the clustering leverages the diverse perspectives of both partitional and hierarchical methods, offering a more robust initial solution. By integrating the outputs of these algorithms, the fourth method provides a more comprehensive representation of the data structure, particularly for datasets with complex distributions.
These methods collectively offer flexibility and adaptability for producing the original clustering solution, depending on the characteristics and requirements of the dataset. In this study, the second method, involving random sampling with the K-means algorithm, was chosen for its simplicity, efficiency, and scalability.

Normalization of Data Sets

Before applying AWDL, the data set M should be normalized and standardized to X. in normalization, the value of each attribute is set between 0 and 1. Each row of the data set M is an example and each column is an attribute. Equation 2 shows the normalization.
X i j = M i j m i n ( M i j ) max M i j m i n ( M i j )   ,   1 i n , 1 j n
In the above equation we have:
X i j : represents the jth sample in the normalized data set X.
M i j : represents the jth attribute of the ith sample in the main data set M.
n: is the total number of samples.
max ( M i j ) and min( M i j ) are the highest and the lowest values in the jth attribute of the data set M.

Calculation of the Weight Coefficient of the Attributes

Variance measurement is the degree of variation (difference) between the values ​​of a variable. A higher variance represents more differences between the values of a variable. The higher the variance of a variable, the more weight it would have. Here we use the variance as the weight of each attribute.
To calculate the weight of each attribute, equations 3 to 5 are used.
D j = i = 1 n ( X i j μ j ) 2 n 1
μ j = i = 1 n X i j n
W j = D j m a x ( D 1 , D 2 , D a )
Dj: represents the variance of jth attribute for the samples of data set X.
n: is the number of samples in data set X.
Xij: represents the value of jth attribute of the ith sample.
μ j : reflects the average value of jth attribute on n samples.
wj: represents the weight the jth attribute.
a: is the total number of attributes.

Description of the Initial Solution Based on the Attribute Weighted Description Length Criterion

In this study, we employ the Attribute Weighted Description Length (AWDL) criterion, which builds upon the foundational principles of the Minimum Description Length (MDL) framework. MDL is a modern and widely applicable approach for comparative inference, offering a general solution to model selection problems. Its core principle states that:
Any pattern or regularity in the data can be used for compression.
In other words, if a dataset exhibits a certain structure or pattern, fewer bits or samples are required to describe the data than would be needed for a dataset with no discernible structure. The extent to which data can be compressed is directly proportional to the amount of order or regularity within the dataset. The key strength of the MDL approach is its versatility, as it can effectively describe various types of data.
Grunewald et al. (2004) demonstrated that MDL is a comprehensive approach for inference and model selection, outperforming other well-known methods, such as Bayesian statistical approaches, in terms of effectiveness and general applicability [40]. Today, MDL has been extended and applied in numerous research domains, including data clustering.
MDL in Clustering
The main concept behind using MDL in clustering is that the best clustering solution is the one that minimizes the description length of the dataset, rather than clustering solely based on similarity measures between samples. This approach ensures that clusters are formed in a way that optimally compresses the data, capturing its inherent structure.
To illustrate this concept, consider a dataset with NNN samples that need to be clustered into KKK clusters. The process involves two stages:
  • Formation of Initial Clusters: The KKK clusters are initialized, and the cluster centers are calculated based on the dataset.
  • Calculation of Distances: For each sample, the distance to the center of its assigned cluster is computed.
The quality of clustering is determined by the description length of the dataset. A clustering solution that minimizes the description length provides the best representation of the dataset. This methodology shifts the focus from traditional similarity-based clustering to a more holistic and information-theoretic perspective.

Attribute Weighted Description Length Criterion

The Attribute Weighted Description Length criterion is an extension of MDL, designed to incorporate attribute weighting into the clustering evaluation. Unlike standard MDL, which treats all attributes equally, AWDL assigns weights to attributes based on their variance, ensuring that more informative attributes contribute proportionally to the clustering process.
To calculate AWDL, the criterion LLL is applied, which consists of three components. The mathematical formulation of LLL is provided in Equation (6) and integrates the attribute weights, cluster structure, and overall dataset representation. The three components are:
  • Cluster Formation Cost: Captures the cost of forming clusters, including initializing cluster centers.
  • Cluster Assignment Cost: Represents the cost of assigning each sample to its respective cluster, considering its distance from the cluster center.
  • Compression Cost: Accounts for the compression achieved by clustering, balancing between simplicity and accuracy.
The AWDL criterion ensures that the clustering process is driven not only by the structural relationships within the data but also by the contribution of individual attributes, leading to a more robust and nuanced clustering solution.
L = { S m ,   S d ,   W }
In equation 6, W = [w1, w2, wa] is a vector that includes the weight of all attributes and since the vector W is always constant, AWDL can be easily expressed on the basis of criteria L ' and according to equation 7.
  L   ' = { S m ,   S d }
Sm and Sd are described in the following.
Sm: represents the sum of the average value of all clusters of a partition and is defined based on Equation 8.
S M = p = 1 k w 1 m p 1 + w 2 m p 2 + + w a m p a
To describe Sm in equation 8 we have:
a : is the number of attributes (features) of the data set, and k is the number of clusters.
m p j : represents the average value of jth attribute of the pth cluster.
w j : is the weight jth attribute of the data set.
s d : is defined according to equation 9 and represents the weighted average deviation for all samples of each cluster in a partition. Here the standard deviation was used to measure the difference between the average and value of each sample in the cluster.
S d = p = 1 k q = 1 T w 1 X q 1 m p 1 + X q 2 m p 2 + + X q j m p a
In the above equation we have:
a: is the number of attributes ( features ) of the data set, k is the number of clusters and T is the number of samples in each cluster.
m p j : represents the average value of jth attribute of the pth cluster.
X q j : is the value of jth attribute of the qth sample in pth cluster ( q = 1   . . .   T ) .

Genetic Algorithm Multiple Optimization Framework (GAMO)

GAMO framework consists of three separate optimization phases:
  • MDL optimizer based on Agreement Based MDL Genetic Algorithm Optimization (ABMDLGAO).
  • Equal Probability MDL Genetic Algorithm Optimization (EPMDLGAO).
  • Equal Probability Agreement Fitness Genetic Algorithm Optimization (EPAFGAO).
The performance each phase is investigated in the following.

The Optimization Function ABMDLGAO

Displacing the samples within clusters, this tries to find the best cluster to embed any sample so that the L   ' criterion gets minimized (the L   ' criterion was described above). The probabilities of sample selection for the displacement are not the same. To evaluate the probability of each sample’s displacement, equations 10 to 13 are used. This process will continue until the value of L ' will converge or the iteration end condition will be reached.
Displacing the samples within clusters is a critical step in optimizing the clustering solution. This step aims to find the optimal cluster for each sample such that the L   ' criterion—as described earlier—is minimized. By minimizing L   ' , the clustering process ensures that the overall description length of the dataset is reduced, leading to a more efficient and accurate representation of the data.
To achieve this, samples are iteratively evaluated for potential displacement to a different cluster. However, the probabilities for selecting samples for displacement are not uniform. Instead, they are determined based on their likelihood of improving the clustering solution, which is calculated using Equations (10) to (13). These equations incorporate factors such as the distance of a sample to the cluster centers and the contribution of its displacement to minimizing L   ' . This probability-based selection ensures that the optimization process focuses on the most impactful samples, reducing unnecessary computations and enhancing convergence speed.
The displacement process continues iteratively, with each iteration re-evaluating the clustering configuration. This iterative adjustment persists until one of two conditions is met:
  • Convergence of L   ' : The value of L   ' stabilizes, indicating that further sample displacements will not significantly reduce the description length.
  • End Condition Reached: A predefined iteration limit or computational threshold is achieved, ensuring that the process terminates within practical runtime constraints.
By iteratively optimizing the placement of samples, this approach ensures a refined clustering solution that effectively balances accuracy, simplicity, and computational efficiency.
V 1 = N q m a x ( A i q ) N i m a x ( A i )
V 2 = m a x ( A i q ) r
0                                                                       if max A i q < m a x ( A i ) min V 1 , V 2                             if max A i q = m a x ( A i )
V p = 1 V
In the above equations we have:
A : the Consensus Matrix.
V 1 : integrated agreement rate
A i q : agreed value of sample pairs that include sample i and are in the cluster q.
m a x ( A i q ): maximum agreed value of A i q N q : the number of maximum agreed value in A i q .
A i : the agreed value of all sample pairs that include sample i N i : is the number of maximum values ​​in A i .
m a x ( A i ): is the maximum agreed value of A i .
V 2 : is the simple agreed value
r : The number of input clusters
V : the sample’s probability of not being selected to be displaced
V p : the sample’s probability of being selected to be displaced

The Optimization Function EPMDLGAO

Similar to the ABMDLGAO function, this function aims to optimize sample placement within clusters to minimize the L′L'L′ value. However, unlike ABMDLGAO, the probabilities for sample displacement are uniform.

The Optimizer Function EPAFGAO

The objective function EPAFGAO serves as an agreed evaluation function, designed to identify solutions that maximize the value of the agreement objective function. The agreement function is mathematically defined using Equations (14) to (17).
B = max A min A × 0.6 + m i n ( A )
A ' = A B
f ' C i k = 1 S i 1 q = k + 1 S i A ' c i k , c i q i f S i > 1                                         0                                                                                                             otherwise                                      
F C = i = 1 m f ' ( C i )
In the above equations we have:
A : The Consensus Matrix
B (Consensus Threshold): reward for the clusters whose agreement values are higher than B, and punishment for the clusters whose agreement values are less than B.
m a x   ( A ) and m i n   ( A ) are respectively the maximum and minimum values in matrix A .
A   ' : weighted agreement matrix obtained by the reduction of B value from the consensus matrix A and its main diagonal is zero.
f ' C i : the evaluation function of ith cluster of the total clusters of the initial solution.
F   ( C ) : the evaluation function of the initial solution.
S i : the number of samples in the ith cluster. If a cluster has only one sample (i.e. S i = 1 ), the evaluation function of the cluster will be zero.
C i k : the kth element of the ith cluster.

Experimental Results

In this chapter, the experimental results of the evaluation of Genetic MDL across multiple datasets are presented. The datasets used in this study are sourced from the UCI Machine Learning Repository, a benchmark repository commonly utilized in clustering studies worldwide. The performance of Genetic MDL was assessed using various evaluation metrics, including Fisher’s F-measure (F-measure), Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), and Accuracy. Experimental results demonstrated that Genetic MDL exhibits satisfactory efficiency and robustness when applied to diverse datasets.

Data Sets

The Genetic MDL method was tested on 13 standard datasets to ensure a comprehensive evaluation. The datasets were selected to maximize diversity in terms of the number of classes, number of attributes, and number of samples, providing a robust basis for evaluating the method’s generalizability and effectiveness. A summary of the datasets used is provided in Table 1, which highlights their key characteristics and ensures that the results reflect a wide range of data conditions.
For all datasets, the number of clusters and the true labels of the samples were known beforehand. This allowed the percentage of correctly recognized samples to serve as the efficiency metric for the clustering approach. By addressing the correspondence between the obtained labels and the actual clusters, the error rate was calculated to evaluate the performance of the method.

Validation Measures

To compare the efficiencies of different algorithms, four widely recognized evaluation criteria were employed: Accuracy, NMI, Fisher’s F-measure (F-measure), and ARI. The definitions and significance of these criteria are explained below.

Fisher Measure

The Fisher's F-measure is a criterion that combines accuracy and precision into a single metric, making it particularly effective for evaluating unbalanced datasets. Defined in Equation (18), the F-measure is calculated as the harmonic mean of accuracy and precision. Its value ranges between 0 and 1, where 1 represents perfect agreement and 0 indicates no agreement.
F m e a t u r e = 2 . p r e c i s i o n . r e c a l l p r e c i s i o n + r e c a l l
In the context of the above formula, the term Precision refers to accuracy, specifically the closeness of measurement values to one another, regardless of whether these values reflect the true reality or not.

Precision Measure

The precision measure evaluates clustering accuracy based on the class labels by assessing the correct assignment of data points to their respective clusters. This measure accounts for both the items correctly assigned to their actual clusters and those correctly excluded from incorrect clusters. Mathematically, as defined in Equation (19), precision is calculated as the ratio of the sum of correctly clustered items and correctly excluded items to the total number of data points. This measure provides a clear indication of the clustering algorithm's effectiveness in accurately grouping the data.
A c c u r a c y = T P + T N N

Normalized Mutual Information Criterion

In this method, clusters are evaluated using a sustainability criterion based on Normalized Mutual Information (NMI). The approach involves a mechanism to assess the stability of individual clusters independently of other clusters produced during the clustering process. This ensures that each cluster's stability is measured solely based on its consistency and reproducibility across multiple clustering iterations or sampling processes. To do so, suppose that we want to calculate the stability of cluster C i . In this method, new data sets are formed by sampling and different clustering approaches are done on them. Then, it will be tried to answer this question: "Has this cluster also appeared in the clustering approaches or not?" To do so, a similarity measure between that cluster ( C i ) and the initial clustering (P (D)) is suggested which is shown as s i m ( C i , P ( D ) ) . Using this criterion, the similarity between that cluster and different clusterings done by sampling is calculated. The mean of the similarity criteria is then returned as the stability of the cluster g i ( C i , D ). In fact, s i m ( C i , P ( D ) ) specifies the validity of cluster C i in P clustering on D dataset. The N M I relationship between P1 and P2 clustering is calculated with equation 20.
N M I = M I ( P 1 , P 2 ) 1 2 m ( i = 0 1 p i log p i m + j = 0 1 p j log p j m )
In this equation, p11 represents the number of common samples in C * and C i . p10 shows the number of common samples in D / C * and C i . p01 indicates the number of common samples in C* and D / C i . p00 indicates the number of common samples in   D / C * and D / C i . Moreover, m is the total number of samples. In fact, pi. and pi. are respectively the total samples available in C i and C * .

ARI Standard

The consistency between U and C partitions can be shown by the probability matrix M R K c × K u in which K c and k are respectively the numbers of the clusters in partitions C and U. M i j = C i U j is the number of available data in cluster I of partition C and cluster j of partition U . For two partitions C and U the Rand Index includes the following information:
1. a : The number of pairs of data in similar clusters in C and U .
2. b : The number of pairs of data in the similar clusters in C , but not in the similar cluster U .
3. c : The number of pairs of data in the similar clusters in U , but not in cluster C .
4. d : The number of pairs of data in different clusters in both partitions.
( a + b ) ( a + b + c + d )
The criteria a and d can be considered as consistent standards while the criteria b and c can be considered as mismatches. Adjust Rand Index ( A R I ) is a criterion to measure the similarity between two partitions and is calculated based on the following formula:
A R I = i j M i j 2 [ i M i 2 . j M j 2 ] / M 2 1 2 [ i M i 2 . j M j 2 ] [ i M i 2 . j M j 2 ] / M 2
In formula 22, M i j is the number of data items that exist in cluster i of partition C and cluster j of partition U . M i is the number of data items in cluster i of partition C ( the sum of the ith row of matrix M). M j is the number of data items in cluster j of partition U ( the sum of the jth column of matrix M ).

Experiments

The Genetic MDL method was implemented and tested using Python 3.4. Its performance was compared against the following clustering algorithms: K-means, Single Linkage, Average Linkage, Complete Linkage, Ward Linkage, and FCM. The experiments were conducted over 100 independent runs of the program, and the results were reported as averages accompanied by their standard deviations. The algorithm rankings based on various validation criteria are presented in Table 2, Table 3, Table 4 and Table 5 below.

Evaluation of Genetic MDL

In this section, the results obtained from 100 independent runs of each algorithm are presented. As shown in Table 2 and Figure 5, the algorithms were evaluated on thirteen datasets using the NMI criterion. The results demonstrate that, on average, across 100 independent implementations for each dataset, the Genetic MDL method consistently outperformed the other algorithms in terms of efficiency.
As presented in Table 3 and Figure 6, the algorithms were assessed on thirteen datasets using the ARI criterion. The results, averaged over 100 independent runs for each dataset, indicate that the Genetic MDL method consistently achieved superior efficiency compared to the other algorithms.
In Table 4 and Figure 7, the precision metric—which assesses the accuracy of the clustering method in assigning data points to their respective clusters—is presented. Based on the results averaged over 100 independent runs for each algorithm on the thirteen datasets, the Genetic MDL method demonstrated greater efficiency compared to the other methods.
As shown in Table 5, the algorithms were evaluated on thirteen datasets using the Fisher standard. Based on the results averaged over 100 independent runs for each dataset, the Genetic MDL method consistently outperformed the other algorithms in terms of efficiency.

Algorithm Ranking

In this section, the ranking of each algorithm is presented based on the T-test results, derived from the average of 100 independent runs of each algorithm across 13 datasets for each criterion. Before discussing the rankings, a brief explanation of the T-test is provided.

Introduction of the T Test

The T-test is a statistical method used to compare the performance of algorithms in machine learning. Assuming two algorithms, AAA and BBB, and a dataset ZZZ, the general process of this method is as follows:
  • Let P(A)P(A)P(A) and P(B)P(B)P(B) represent the precision values of algorithms AAA and BBB, respectively, on the dataset ZZZ.
  • Both algorithms are executed KKK times, and their results are compared with the real labels of the dataset.
  • A set of KKK differences is calculated by subtracting the precision values of the two algorithms for each test run, forming a set of differences {P(A)−P(B)}\{P(A) - P(B)\}{P(A)−P(B)}.
These differences are then analyzed statistically using the T-test to determine whether the performance of the two algorithms differs significantly. This comparison helps in ranking the algorithms based on their efficiency and accuracy.
P ( 1 ) = P A ( 1 ) P B 1 P ( k ) = P A ( k ) P B k
The samples are selected independently. With the null hypothesis ( H 0 : accuracy equivalent), t is calculated with the freedom degree k-1 as follows:
t = P K i = 1 k p i p 2 k 1 where   P = ( 1 k ) i = 1 k P ( i )
In the final step of the T-test, the calculated ttt-value, approximated using the given formula, is compared with the critical ttt-value at a 0.05 significance level (5% error). If the calculated ttt-value exceeds the critical ttt-value, the difference between the algorithms is considered statistically significant; otherwise, it is deemed nonsignificant. The results of these comparisons, along with the algorithm rankings based on each criterion, are presented in Table 6, Table 7, Table 8 and Table 9.

Summary and Conclusions

In this study, the AWDL approach, an extended version of the MDL principle, was applied to clustering datasets. The key advantage of this method lies in its ability to consider both the internal and external information of the data. By weighting the attributes of the datasets, the AWDL approach assigns special importance to attribute value coefficients during clustering, reducing biases toward input clusters and producing more stable clustering results. Additionally, the proposed method incorporates multiple evaluation functions for assessing candidate solutions, enhancing its adaptability to diverse datasets. Experimental results on 13 standard datasets, evaluated using four widely accepted validation metrics—Fisher, accuracy, NMI, and ARI—demonstrated the superior efficiency of the proposed method compared to conventional clustering algorithms such as Linkage Single, Linkage Average, Linkage Complete, Linkage Ward, K-means, and FCM.
The AWDL approach is particularly promising for applications in medical and health sciences, where clustering methods are often used to analyze complex and high-dimensional data. For example, in EEG (electroencephalogram) and ECG (electrocardiogram) signal analysis, clustering can help identify patterns associated with neurological disorders or cardiac abnormalities, enabling early diagnosis and personalized treatment [45,46]. Similarly, in spatial transcriptomics, clustering is critical for grouping cells based on their gene expression profiles and spatial locations, aiding in the study of tissue architecture and disease progression, such as in cancer research. The stability and bias-resilient properties of AWDL make it particularly well-suited for these applications, where accurate clustering directly impacts clinical insights and outcomes.
Furthermore, the AWDL method holds significant potential for use in deep learning, where clustering can play a vital role in tasks such as feature extraction, unsupervised pretraining, and data augmentation [47,48]. Clustering methods like AWDL can be integrated into autoencoders, variational autoencoders (VAEs), or contrastive learning frameworks to identify latent structures in the data, improving representation learning. Additionally, the ability to handle diverse datasets and produce stable clusters makes AWDL an excellent choice for applications in domains like computer vision, natural language processing, and bioinformatics, where clustering is often used to preprocess data or interpret deep learning model outputs. By enhancing clustering performance, AWDL can contribute to developing more robust and efficient deep learning pipelines.

Future Work

One potential avenue for advancing the assessment of clustering methods is the development of a comprehensive dataset encompassing diverse data types with varying characteristics to serve as a benchmark for testing clustering algorithms. Additionally, building upon the present research, it is recommended to explore the use of other heuristic and meta-heuristic optimization techniques, such as Ant Colony Optimization, Bee Algorithms, and Frog Leap Algorithm, for refining the initial solution, alongside the genetic optimization algorithm employed in this study.

References

  1. Jain, A.K., Murty. Data clustering: a review. ACM Comput. Surv. 1999, 31, 264–323. [Google Scholar] [CrossRef]
  2. Berkhin, P. , Survey of clustering data mining techniques. Technical Report, Accrue Software, 2002.
  3. Xu, R., and Wunsch. Survey of clustering algorithms. II. IEEE Transactions on Neural Networks, 2005, 16, 645–678. [Google Scholar] [CrossRef]
  4. Pandey, G. , Kumar, V. and Steinbach, M., “Computational Approaches for Protein Function Prediction,” Supported by the National Science Foundation under Grant Nos. IIS-0308264 and ITR-0325949, 2007.
  5. Jain, A.K. and Dubes, R.C., “Algorithms for Clustering Data,” Prentice Hall, 1988.
  6. Duda, R. , Hart, P. and Stork, D., “Pattern Classification,” John Wiley & Sons, 2001.
  7. Hastie, T. , Tibshirani, R., and Friedman, J., “The Elements of Statistical Learning: Data Mining, Inference and Prediction,” Springer, 2001.
  8. Jiang, D. , Tang, C. and Zhang, A., "Cluster Analysis for Gene Expression Data: A Survey," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11, pp. 1370-1386, Nov. 2004.
  9. Strehl A. and Ghosh, J. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 2002, 3, 583–617. [Google Scholar]
  10. Everitt, B. , “Cluster Analysis,” Third Edition. Edward Arnold: a member of the Hodder Headline Group. 1993. [Google Scholar]
  11. Johnson, S.C. , Hierarchical Clustering Schemes. Psychometrika, 1967, 2, 241–254. [Google Scholar] [CrossRef] [PubMed]
  12. Webb, A. , “Statistical Pattern Recognition,” Arnold: a member of the Hodder Headline Group, pp. 275-317, 1999.
  13. Mcqueen, J., “Some methods for classification and analysis of multivariate observations,” In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297, 1967.
  14. Kaufman, L. and Rousseeuw, P.J., “Finding Groups in Data: An Introduction to Cluster Analysis,” Wiley-Interscience: New York (Series in Applied Probability and Statistics), ISBN 0-471-87876-6, 1990.
  15. Ester, M. , Kriegel, H.P. and Xu, X., “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” In: Proc. 2nd ACM SIGKDD, pp. 226-231, 1996.
  16. Ankerst, M. , Breunig, M., Kriegel, H.P. and Sander, J., “OPTICS: Ordering Points to Identify the Clustering Structure,” Management of Data Mining, pp. 49–60, 1999.
  17. Wang, W. , Yang, J. and Muntz, R., “STING: A Statistical Information Grid Approach to Spatial Data Mining,” In: Proc. of the 23rd Very Large Databases Conf, 1997.
  18. Parvin, H., Parvin. A classifier ensemble of binary classifier ensembles. International Journal of Learning Management Systems, 2013, 1, 36–46. [Google Scholar] [CrossRef] [PubMed]
  19. Kohonen, T. , “Automatic formation of topological maps of patterns in a self-organizing system,” Proceedings of 2SCIA, pp. 214-220, 1981a.
  20. Fisher, D.H. , “Knowledge Acquisition via Incremental Conceptual Clustering,” Machine Learning, vol. 2, pp. 139-172, 1987.
  21. Cheeseman, P. and Stutz, J., “Bayesian Classification (AutoClass): Theory and Results,” Advances in Knowledge Discovery and Data Mining, Cambridge, 1996.
  22. Dunn, J.C. , "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters," Journal of Cybernetics, vol. 3, pp. 32-57, 1973.
  23. Bezdek, J.C. , “Pattern Recognition with Fuzzy Objective Function Algorithms,” Plenum Press, New York, 1981.
  24. Yang, X.S. , “Introduction to Mathematical Optimization: From Linear Programming to Metaheuristics,” Cambridge Int. Science Publishing, 2008.
  25. Swift, S. , Tucker, A., Vincotti, V., Martin, N., Orengo, C., Liu, X. and Kellam, P., “Consensus Clustering and Functional Interpretation of Gene-Expression Data,” Genome Biology, vol. 5, no. 11, pp. R94.1-R94.16, 2004.
  26. Hu, X. and Yoo, I., "Cluster Ensemble and its Applications in Gene Expression Analysis," Proc. the Second Conference on Asia-Pacific Bioinformatics, vol. 55, pp. 297-302, 1 Jan. 2004.
  27. Jenner, R.G., Maillard. Kaposi's sarcoma-associated herpesvirus-infected primary effusion lymphoma has a plasma cell gene expression profile. Proc Natl Acad Sci USA, 2003, 100, 10399–10404. [Google Scholar] [CrossRef] [PubMed]
  28. Tseng, V. S. and Kao, C., “Efficiently Mining Gene Expression Data via a Novel Parameterless Clustering Method,” IEEE/ACM Trans. Comput. Biol. Bioinformatics, vol. 2, no. 4, pp. 355-365, 2005.
  29. Parvin, H., Minaei-Bidgoli. A new imbalanced learning and dictions tree method for breast cancer diagnosis. Journal of Bionanoscience, 2013, 7, 673–678. [Google Scholar] [CrossRef]
  30. Yu, Z.W. , Wong, H.S. and Wang, H.Q., “Graph-based consensus clustering for class discovery from gene expression data,” Bioinformatics, vol. 23, no. 21, pp. 2888-2896, 2007.
  31. Mahmoudi, M. R., Akbarzadeh. Consensus function based on cluster-wise two level clustering. Artificial Intelligence Review, 2021, 54, 639–665. [Google Scholar] [CrossRef]
  32. Parvin, H. , Alizadeh, H., Moshki, M., Minaei-Bidgoli, B., & Mozayani, N. (2008, November). Divide & conquer classification and optimization by genetic algorithm. In 2008 Third International Conference on Convergence and Hybrid Information Technology (Vol. 2, pp. 858-863). IEEE.
  33. Parvin, H., Minaei-Bidgoli. A new classifier ensemble methodology based on subspace learning. Journal of Experimental & Theoretical Artificial Intelligence, 2013, 25, 227–250. [Google Scholar]
  34. Punera, K. and Ghosh, J., “Consensus-based Ensembles of Soft Clusterings,” Applied Artificial Intelligence, vol. 22, no. 7-8, pp. 780-810, 2008.
  35. Asur, S. , Ucar, D., and Parthasarathy, S., "An Ensemble Framework for Clustering Protein-Protein Interaction Networks," Bioinformatics, vol. 23, no. 13, pp. I29-I40, Jul. 2007.
  36. Fred, A.L.N. and Jain, A.K., "Combining Multiple Clusterings using Evidence Accumulation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 835-850, Jun. 2005. [Google Scholar]
  37. Azimi, J. , Abdoos, M. and Analoui, M., “A New Efficient Approach in Clustering Ensembles,” In IDEAL, International Conference on Intelligent Data Engineering and Automated Learning, 2007.
  38. Fern, X.Z. and Brodley, C.E., "Solving Cluster Ensemble Problems by Bipartite Graph Partitioning," Proc. the Twenty-First International Conference on Machine Learning (ICML '04), vol. 69, ACM Press, New York, NY, pp. 36, 2004.
  39. Vega-Pons, S. , Correa-Morris, J. and Ruiz-Shulcloper, J., “Weighted Cluster Ensemble Using a Kernel Consensus Function,” Progress in Pattern Recognition, Image Analysis and Applications, vol. 5197, pp. 195-202, 2008.
  40. Graunwald, P.D. , Myung, I.J. and Pitt, M.A. (Eds.), “Advances in Minimum Description Length: Theory and Applications,” MIT Press, 2004.
  41. Parvin, H., Minaei-Bidgoli. Data weighing mechanisms for clustering ensembles. Computers & Electrical Engineering, 2013, 39, 1433–1450. [Google Scholar]
  42. Minaei-Bidgoli, B., Parvin. Effects of resampling method and adaptation on clustering ensemble efficacy. Artificial Intelligence Review, 2014, 41, 27–48. [Google Scholar] [CrossRef]
  43. Parvin, H., Seyedaghaee. A heuristic scalable classifier ensemble of binary classifier ensembles. Journal of Bioinformatics and Intelligent Control, 2012, 1, 163–170. [Google Scholar] [CrossRef]
  44. Parvin, H., Minaei, B., Alizadeh, H., & Beigi, A. (2011). A novel classifier ensemble method based on class weightening in huge dataset. In Advances in Neural Networks–ISNN 2011: 8th International Symposium on Neural Networks, ISNN 2011, Guilin, China, May 29–June 1, 2011, Proceedings, Part II 8 (pp. 144-150). Springer Berlin Heidelberg.
  45. Hansun, S. , Argha, A., Alizadehsani, R., Gorriz, J. M., Liaw, S. T.,... & Marks, G. B. (2024). A New Ensemble Transfer Learning Approach With Rejection Mechanism for Tuberculosis Disease Detection. IEEE Transactions on Radiation and Plasma Medical Sciences.
  46. Sadeghi, A., Hajati. 3DECG-Net: ECG fusion network for multi-label cardiac arrhythmia detection. Computers in Biology and Medicine, 2024, 182, 109126. [Google Scholar] [CrossRef]
  47. Niu, H.; et al. Deep feature learnt by conventional deep neural network. Computers & Electrical Engineering, 2020, 84, 106656. [Google Scholar]
  48. Hong, L. , Modirrousta, M. H., Hossein Nasirpour, M., Mirshekari Chargari, M., Mohammadi, F., Moravvej, S. V., Nahavandi, S. (2023). GAN-LSTM-3D: An efficient method for lung tumour 3D reconstruction enhanced by attention-based LSTM. CAAI Transactions on Intelligence Technology.
Figure 1. Framework of the proposed approach.
Figure 1. Framework of the proposed approach.
Preprints 142157 g001
Figure 2. The quasi-code of the proposed method.
Figure 2. The quasi-code of the proposed method.
Preprints 142157 g002
Figure 3. Agreement matrix formation algorithm.
Figure 3. Agreement matrix formation algorithm.
Preprints 142157 g003
Figure 4. Initial solution generation algorithm.
Figure 4. Initial solution generation algorithm.
Preprints 142157 g004
Figure 5. The NMI graph of clustering methods.
Figure 5. The NMI graph of clustering methods.
Preprints 142157 g005
Figure 6. ARI graph of clustering methods.
Figure 6. ARI graph of clustering methods.
Preprints 142157 g006
Figure 7. Accuracy graph of clustering methods.
Figure 7. Accuracy graph of clustering methods.
Preprints 142157 g007
Table 1. The used data sets and their features.
Table 1. The used data sets and their features.
Preprints 142157 i001
Table 2. Comparison of average normalized data of Genetic MDL method with the common clustering algorithms.
Table 2. Comparison of average normalized data of Genetic MDL method with the common clustering algorithms.
Preprints 142157 i002
Table 3. Comparison of average ARI of Genetic MDL method with common clustering algorithms.
Table 3. Comparison of average ARI of Genetic MDL method with common clustering algorithms.
Data sets Methods
K-means Single Average Complete Ward Fcm EPMDLGAO EPAFGAO ABMDLGAO
balance_scale 13.24 ± 3.78 0.36± 0 9.2 ± 0 6.03 ± 0 11.98 ± 0 10.73 ± 8.31 13.5 ± . 58 13.14 ± . 71 12.65 ± . 66
Halfring 23.87 ± 0 2.99 ± 0 78.07 ± 0 51.16 ± 0 51.16 ± 0 23.87 ± 0 94.31 ± 4.3 58.94 ± 3.18 96.29 ± 5.39
Iris 63.85 ± 14.01 56.38 ± 0 75.92 ± 0 64.23 ± 0 73.12 ± 0 72.94 ± 0 78.5 ± 4.82 96.24 ± 4.25 73.44 ± 4.21
Nbalance_scale 13.55 ± 3.88 0.36 ± 0 8.54 ± 0 8.3 ± 0 3.55 ± 0 14.82 ± 10.99 15.01 ± . 64 8.42 ± . 52 14.62 ± . 53
Nbreast 81.61 ± 0.44 0.25 ± 0 9.63 ± 0 79.75 ± 0 86.35 ± 0 81.38 ± 0 91.7 ± 3.28 85.45 ± 1.09 78.56 ± . 71
Nbupa -.069 ± 0.02 -0.16 ± 0 -0.46 ± 0 -0.44 ± 0 -0.3 ± 0 -0.6 ± 0 1.62 ± . 3 1.57 ± . 28 1.62 ± . 54
Ngalaxy 11.17 ± 1.04 -0.03 ± 0 13.4 ± 0 6.93 ± 0 10.59 ± 0 10.54 ± 0 4.16 ± . 96 27.51 ± 2.76 15.55 ± 2.15
Nglass 19.04 ± 5.39 1.49 ± 0 1.95 ± 0 3.5 ± 0 13.51 ± 0 15.44 ± 0 33.63 ± 3.65 33.45 ± 4.33 34.76 ± 2.78
Nionosphere 16.79 ± 0 0.45 ± 0 0.45 ± 0 11.45 ± 0 17.75 ± 0 15.87 ± 0 1.53 ± 3.23 19.57 ± . 63 -1.23 ± . 47
NSAHeart 6.57 ± 0.24 -0.2 ± 0 1.55 ± 0 2.54 ± 0 7.98 ± 0 8.72 ± 0 3.15 ± 2.43 10.13 ± 5.13 9.31 ± 1.82
Nwine 89.75 ± 0 -0.68 ± 0 -0.54 ± 0 57.71 ± 0 78.99 ± 0 89.75 ± 0 67.95 ± 7.2 90.87 ± 6.03 51.27 ± 11.34
NYeast 17.47 ± 1.17 1 ± 0 1.87 ± 0 10.83 ± 0 16.92 ± 0 12.19 ± 0.73 44.81 ± 3.03 45.1 ± 3.49 48.49 ± 4.63
Wine 35.75 ± 1.74 0.54 ± 0 29.26 ± 0 37.08 ± 0 36.84 ± 0 35.39 ± 0 46.32 ± 4.59 53.26 ± 8.35 41.66 ± 3.33
Table 4. Comparison of the mean Accuracy of Genetic MDL with common clustering algorithms.
Table 4. Comparison of the mean Accuracy of Genetic MDL with common clustering algorithms.
Data sets Methods
K-means Single Average Complete Ward Fcm EPMDLGAO EPAFGAO ABMDLGAO
balance_scale 51.75 ± 2.71 3.61 ± 0 53.12 ± 0 50.24 ± 0 55.04 ± 0 53.65 ± 8.37 53.81 ± 4.68 47.64 ± 6.79 56.66 ± 9.84
Halfring 74.5 ± 0 75.75 ± 0 94.75 ± 0 86 ± 0 86 ± 0 74.5 ± 0 93.64 ± 2.39 80.20 ± 11.63 95.44 ± 3.95
Iris 86.58 ± 9.36 68 ± 0 90.67 ± 0 84 ± 0 89.33 ± 0 89.33 ± 0 87.51 ± 9.21 82.23 ± 5.63 91.17 ± 6.5
Nbalance_scale 51.94 ±3.25 46.4 ± 0 52.16 ± 0 49.28 ± 0 43.2 ± 0 50.48 ± 7.14 48.85 ± 4.97 53.79 ± 5.22 47.49 ± 6.33
Nbreast 95.25 ± 0.13 65.15 ± 0 70.13 ± 0 94.73 ± 0 96.49 ± 0 95.17 ± 0 96.7 ± 2.61 89.17 ± 9.44 93.89 ± 3.71
Nbupa 54.58 ± 0.13 57.68 ± 0 57.1 ± 0 55.94 ± 0 55.65 ± 0 51.3 ± 0 64.84 ± 6.69 56.5 ± 7.84 58.33 ± 9.15
Ngalaxy 30.88 ± 1.73 25.08 ± 0 33.44 ± 0 28.79 ± 0 29.1 ± 0 29.32 ± 0.14 31.67 ± 3.99 28.76 ± 1.73 36.34 ± 4.11
Nglass 47.55 ± 4.45 36.45 ± 0 37.85 ± 0 40.65 ± 0 42.06 ± 0 40.53 ± 0.2 47.01 ± 5.13 51.19 ± 6.31 48.12 ± 4.21
Nionosphere 70.66 ± 0 64.39 ± 0 64.39 ± 0 67.24 ± 0 71.23 ± 0 70.09 ± 0 66.65 ± 8.7 71.73 ± 7.28 68.82 ± 6.63
NSAHeart 63.1 ± 0.24 65.15 ± 0 66.23 ± 0 63.1 ± 0 64.29 ± 0 64.93 ± 0.02 66.32 ± 3.73 69.32 ± 4.08 66.38 ± 4.57
Nwine 96.63 ± 0 37.64 ± 0 38.76 ± 0 83.71 ± 0 92.7 ± 0 96.63 ± 0 97.13 ± . 69 67.45 ± . 8 88.79 ± 1.49
NYeast 42.09 ± 2.37 31.74 ± 0 32.41 ± 0 35.92 ± 0 41.58 ± 0 33.09 ± 0 55.39 ± 2.73 56.41 ± 9.04 61.44 ± 2.31
Wine 65 ± 6.54 42.7 ± 0 61.24 ± 0 67.42 ± 0 69.66 ± 0 68.54 ± 0 79.32 ± 14.58 59.83 ± 6.25 86.37 ± 9.32
Table 5. Comparison of the mean Fisher standard method of MDL Genetic with common clustering algorithms.
Table 5. Comparison of the mean Fisher standard method of MDL Genetic with common clustering algorithms.
Data sets Methods
K-means Single Average Complete Ward Fcm EPMDLGAO EPAFGAO ABMDLGAO
balance_scale 55.9 ± 3.05 89.83 ± 0 53.16 ± 0 51.56 ± 0 53.4 ± 0 50.61± 6.25 56.99 ± 10.43 88.09 ± 7.02 54.96 ± 9.77
Halfring 78.65 ± 0 92.83 ± 0 92.97 ± 0 86.14 ± 0 86.14± 0 78.37±0 84.53 ± 12.52 84.44 ± 6.36 95.19 ± 6.34
Iris 88.22 ± 5.41 90.55 ± 0 91.68 ± 0 86.03 ± 0 90.29± 0 89.8±0 90.15 ± 2.75 88.09 ± 7.13 91.08 ± 2.89
Nbalance_scale 54.94 ± 2.54 89.83 ± 0 52.83± 0 51.82± 0 47.3± 0 49.86±0 55.61 ± 7.96 91.12 ± 3.7 61.94 ± 8.21
Nbreast 94.75 ± 0.14 90.36 ± 0 68.23± 0 94.22± 0 96.3± 0 94.67±0 97.49 ± 2.58 84.23 ± 8.36 95.74 ± 4.75
Nbupa 55.38 ± 0.05 83.01 ± 0 61.74± 0 54.89± 0 54.2± 0 62.36±0 84.290 ± 8.15 57.06 ± 2.52 65.16 ± 9.94
Ngalaxy 37.39 ± 2.73 66.53 ± 0 42.16± 0 35.32± 0 37.65± 0 31.43±0 45.66 ± 1.47 49.81 ± 11.98 68.55 ± 8.29
Nglass 55.43 ± 3.41 82.8 ± 0 37.58± 0 41.76± 0 55.15± 0 50.42±0 70.36 ± 1.61 84.76 ± 6.23 69.65 ± 1.05
Nionosphere 70.24 ± 0 90.03 ± 0 62.47± 0 65.65± 0 70.84± 0 69.79±0 72.62 ± 1.25 90.19 ± 4.54 77.93 ± 6.87
Table 6. Algorithm priorities based on ARI criterion.
Table 6. Algorithm priorities based on ARI criterion.
Data sets Methods
balance_scale Halfring Iris Nbalance_scale Nbreast Nbupa Ngalaxy Nglass Nionosphere NSAHeart Nwine NYeast Wine
kmeans 5 -5 -5 3 2 -8 2 2 4 0 3 2 -2
Single -8 -8 -8 -8 -8 2 -6 -8 -6 -8 -8 -8 -8
Average -3 4 6 0 -5 -4 4 -6 -4 -6 -6 -6 -6
Complete -6 -1 -4 -6 -2 -2 -4 -4 0 -4 -2 -4 -2
Ward -1 -1 2 -4 6 0 -2 -2 6 2 4 0 2
Fcm -2 -5 1 5 -1 -6 -2 0 2 4 7 -2 -2
EPMDLGAO 7 6 8 7 8 6 -6 5 -2 -2 0 5 6
EPAFGAO 5 2 -2 -2 4 6 8 5 8 7 6 5 8
ABMDLGAO 3 8 2 5 -4 6 6 8 -8 7 -4 8 4
Table 7. Algorithm priorities based on Fisher standard.
Table 7. Algorithm priorities based on Fisher standard.
Data sets Methods
balance_scale Halfring Iris Nbalance_scale Nbreast Nbupa Ngalaxy Nglass Nionosphere NSAHeart Nwine NYeast Wine
Kmeans 2 -2 -5 1 2 -4 -3 -1 -1 0 6 2 -1
Single 8 6 6 7 -2 7 6 6 7 6 -6 2 2
Average -3 5 6 -1 -8 0 0 -8 -8 -4 -8 -8 -4
Complete -7 -3 -8 -4 -1 -6 -6 -6 -6 -8 -2 -2 -4
Ward -1 -1 1 -7 4 -8 -3 -1 -1 -2 2 0 2
Fcm -7 -8 -1 -6 -2 2 -8 -4 -4 2 7 -6 -3
EPMDLGAO 2 -1 1 -1 8 7 2 4 2 -6 5 -2 0
EPAFGAO 6 -3 -5 7 -6 -2 4 8 7 4 -4 8 0
ABMDLGAO 0 7 5 4 5 4 8 2 4 8 0 6 8
Table 8. Algorithm priorities based on normalized mutual information criterion.
Table 8. Algorithm priorities based on normalized mutual information criterion.
Data sets Methods
balance_scale Halfring Iris Nbalance_scale Nbreast Nbupa Ngalaxy Nglass Nionosphere NSAHeart Nwine NYeast Wine
kmeans 7 -5 -5 6 0 -4 2 -2 1 2 7 5 0
Single -8 -8 -6 -6 -8 0 -8 -8 -2 -2 -6 -8 -8
Average -4 6 2 -4 -6 5 6 1 -8 -4 -8 -6 -6
Complete -6 0 -6 -2 -3 -6 -6 -6 -6 -8 0 -2 0
ward 1 2 0 -8 6 -8 2 -2 4 0 4 3 -2
Fcm 6 -5 -3 5 0 -2 -2 4 8 4 -4 -4 -2
EPMDLGAO 1 4 5 1 4 5 8 6 6 7 7 7 6
EPAFGAO 1 -2 5 6 -1 2 -4 -1 -4 7 -2 4 4
ABMDLGAO 2 8 8 2 8 8 2 8 1 -6 2 1 8
Table 9. Algorithm priorities based on precision criterion
Table 9. Algorithm priorities based on precision criterion
Data sets Methods
balance_scale Halfring Iris Nbalance_scale Nbreast Nbupa Ngalaxy Nglass Nionosphere NSAHeart Nwine NYeast Wine
kmeans -2 0 -1 4 4 -6 3 4 7 -6 2 2 -2
Single -8 -6 -8 -5 -8 -2 -8 -8 -8 -2 -8 -8 -8
Average 2 6 3 5 -6 2 6 -6 -6 2 -6 -6 -4
Complete -4 -1 -4 0 3 1 -3 0 -1 -8 -2 -2 0
ward 6 -6 5 -8 3 -1 -5 -2 3 0 4 0 2
Fcm 3 -2 3 1 3 -6 0 -4 1 -2 6 -4 4
EPMDLGAO 2 4 1 -1 7 8 3 4 -2 4 8 5 6
EPAFGAO -6 -2 -6 8 4 1 -4 8 6 8 -4 5 -6
ABMDLGAO 7 7 7 -4 -2 3 8 4 0 4 0 8 8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated