Submitted:
16 January 2025
Posted:
17 January 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- It includes an in-depth review of nine clustering techniques: K-Means, BIRCH, Divisive Clustering, DBSCAN, OPTICS, Mean Shift, GMM, BGMM, and CLIQUE.
- A comparative analysis highlights the key characteristics, strengths, and limitations of each technique. This includes insights into parameter sensitivity, handling of clusters with varying densities, overfitting tendencies, computational complexity, and best application scenarios.
- The paper outlines a universal framework for image compression using clustering methods, including preprocessing, compression, and decompression phases.
- Each clustering technique was implemented to achieve image compression by segmenting image blocks into clusters and reconstructing them using cluster centroids.
- The implementations are adapted for diverse clustering methods, ensuring consistency in preprocessing, compression, and decompression phases.
- The paper provides detailed analysis and interpretation for each clustering technique, addressing trade-offs between compression and quality.
- Rigorous experiments were conducted using benchmark images from CID22 to validate the compression efficiency and image quality for all clustering techniques.
- The results are synthesized into a clear discussion, ranking the techniques based on their effectiveness in achieving a balance between CR (compression ratio) and SSIM.
- Custom visualizations demonstrate the impact of varying block sizes and parameters for each technique, offering intuitive insights into their performance characteristics.
2. Overview of Clustering Techniques
2.1. Partitioning Techniques
2.1.1. K-Means
2.1.2. BIRCH
2.2. Hierarchical Techniques
2.2.1. Divisive Clustering
2.3. Density-Based Techniques
2.3.1. DBSCAN
2.3.2. OPTICS
2.3.3. Mean Shift
2.4. Distribution-Based Techniques
2.4.1. GMM
2.4.2. BGMM
2.5. Grid-Based Techniques
2.5.1. CLIQUE
2.6. Comparative Analysis of Clustering Techniques
3. Implementation and Experimental Evaluation of Clustering Techniques
4. Compression and Decompression Framework Using Clustering Techniques
4.1. Compression Phase
4.1.1. Image Preprocessing
4.1.2. Clustering Initialization
4.1.3. Clustering Process
4.1.4. Centroid Quantization
4.1.5. Index Encoding
4.1.6. Run-Length Lossless Compression
4.2. Decompression Phase
4.2.1. Loading the Compressed Data
4.2.2. Block Reconstruction
4.2.3. Image Assembly
4.2.4. Post-Processing
5. Quality Assessment Metrics for Image Compression
5.1. Compression Ratio (CR)
- : Size of the original image file (in bytes).
- : Size of the compressed image file (in bytes).
5.2. Bits Per Pixel (BPP)
- : Size of the compressed image file (in bytes).
- : Height of the image (in pixels).
- : Width of the image (in pixels).
- The factor 8 converts bytes to bits.
5.3. Structural Similarity Index (SSIM)
- and : Corresponding image patches from the original and compressed images, respectively.
- and : Mean intensities of and .
- and : Variances of and .
- : Covariance of and .
- and : Small constants to stabilize the division when the denominator is close to zero.
6. Comprehensive Performance Analysis of Clustering Techniques in Image Compression
6.1. Kmeans Clustering for Compression
6.2. BIRCH Clustering for Compression
6.3. Divisive Clustering for Compression
6.4. DBSCAN and OPTICS Clustering for Compression
6.5. Mean Shift Clustering for Compression
6.6. GMM and BGMM Clustering for Compression
6.7. CLIQUE Clustering for Compression
6.8. Discussion
7. Validation of Compression Results Using CID22 Benchmark Dataset
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| BIRCH | Balanced Iterative Reducing and Clustering using Hierarchies |
| DBSCAN | Density-Based Spatial Clustering of Applications with Noise |
| OPTICS | Ordering Points to Identify the Clustering Structure |
| GMM | Gaussian Mixture Models |
| BGMM | Bayesian Gaussian Mixture Models |
| CLIQUE | Clustering In Quest |
| SSIM | Structural Similarity Index |
| CR | Compression Ratio |
| CF | Clustering Feature |
| Eps | Epsilon |
| Min_Pts | Minimum Points |
| BPP | Bits Per Pixel |
References
- Kou, Weidong. Digital image compression: algorithms and standards. Vol. 333. Springer Science & Business Media, 2013; pp 1-15.
- Vincze, Miklos, Bela Molnar, and Miklos Kozlovszky. "Real-Time Network Video Data Streaming in Digital Medicine." Computers 12, no. 11 (2023): 234. [CrossRef]
- Mochurad, Lesia. "A Comparison of Machine Learning-Based and Conventional Technologies for Video Compression." Technologies 12, no. 4 (2024): 52. [CrossRef]
- Auli-Llinas, Francesc. "Fast and Efficient Entropy Coding Architectures for Massive Data Compression." Technologies 11, no. 5 (2023): 132. [CrossRef]
- Frackiewicz, Mariusz, Aron Mandrella, and Henryk Palus. "Fast color quantization by K-means clustering combined with image sampling." Symmetry 11, no. 8 (2019): 963. [CrossRef]
- Báscones, Daniel, Carlos González, and Daniel Mozos. "Hyperspectral image compression using vector quantization, PCA and JPEG2000." Remote sensing 10, no. 6 (2018): 907.
- Guerra, Raúl, Yubal Barrios, María Díaz, Lucana Santos, Sebastián López, and Roberto Sarmiento. "A new algorithm for the on-board compression of hyperspectral images." Remote Sensing 10, no. 3 (2018): 428. [CrossRef]
- Ungureanu, Vlad-Ilie, Paul Negirla, and Adrian Korodi. "Image-Compression Techniques: Classical and “Region-of-Interest-Based” Approaches Presented in Recent Papers." Sensors 24, no. 3 (2024): 791. [CrossRef]
- Uthayakumar, J., Mohamed Elhoseny, and K. Shankar. "Highly reliable and low-complexity image compression scheme using neighborhood correlation sequence algorithm in WSN." IEEE Transactions on Reliability 69, no. 4 (2020): 1398-1423. [CrossRef]
- Khalaf, Walaa, Dhafer Zaghar, and Noor Hashim. "Enhancement of curve-fitting image compression using hyperbolic function." Symmetry 11, no. 2 (2019): 291. [CrossRef]
- Fernandes, Vítor, Gonçalo Carvalho, Vasco Pereira, and Jorge Bernardino. "Analyzing Data Reduction Techniques: An Experimental Perspective." Applied Sciences 14, no. 8 (2024): 3436. [CrossRef]
- Hoeltgen, Laurent, Pascal Peter, and Michael Breuß. "Clustering-based quantisation for PDE-based image compression." Signal, Image and Video Processing 12 (2018): 411-419. [CrossRef]
- Mbuga, Felix, and Cristina Tortora. "Spectral clustering of mixed-type data." Stats 5, no. 1 (2021): 1-11. [CrossRef]
- Nies, Hui Wen, Zalmiyah Zakaria, Mohd Saberi Mohamad, Weng Howe Chan, Nazar Zaki, Richard O. Sinnott, Suhaimi Napis, Pablo Chamoso, Sigeru Omatu, and Juan Manuel Corchado. "A review of computational methods for clustering genes with similar biological functions." Processes 7, no. 9 (2019): 550. [CrossRef]
- Ahmed, Mohiuddin, Raihan Seraj, and Syed Mohammed Shamsul Islam. "The k-means algorithm: A comprehensive survey and performance evaluation." Electronics 9, no. 8 (2020): 1295. [CrossRef]
- Oujezsky, Vaclav, and Tomas Horvath. "Traffic similarity observation using a genetic algorithm and clustering." Technologies 6, no. 4 (2018): 103. [CrossRef]
- Steinley, Douglas. "K-means clustering: a half-century synthesis." British Journal of Mathematical and Statistical Psychology 59, no. 1 (2006): 1-34.
- Hill, Mark O., Colin A. Harrower, and Christopher D. Preston. "Spherical k-means clustering is good for interpreting multivariate species occurrence data." Methods in Ecology and Evolution 4, no. 6 (2013): 542-551. [CrossRef]
- Celebi, M. Emre, Hassan A. Kingravi, and Patricio A. Vela. "A comparative study of efficient initialization methods for the k-means clustering algorithm." Expert systems with applications 40, no. 1 (2013): 200-210. [CrossRef]
- Wu, Junjie, and Junjie Wu. "Cluster analysis and K-means clustering: an introduction." Advances in K-Means clustering: A data mining thinking (2012): 1-16.
- Zhang, Tian, Raghu Ramakrishnan, and Miron Livny. "BIRCH: an efficient data clustering method for very large databases." ACM sigmod record 25, no. 2 (1996): 103-114.
- Zhang, Tian, Raghu Ramakrishnan, and Miron Livny. "BIRCH: A new data clustering algorithm and its applications." Data mining and knowledge discovery 1 (1997): 141-182. [CrossRef]
- Lang, Andreas, and Erich Schubert. "BETULA: Fast clustering of large data with improved BIRCH CF-Trees." Information Systems 108 (2022): 101918. [CrossRef]
- Lorbeer, Boris, Ana Kosareva, Bersant Deva, Dženan Softić, Peter Ruppel, and Axel Küpper. "Variations on the clustering algorithm BIRCH." Big data research 11 (2018): 44-53. [CrossRef]
- Shetty, Pranav, and Suraj Singh. "Hierarchical clustering: a survey." International Journal of Applied Research 7, no. 4 (2021): 178-181.
- Savaresi, Sergio M., Daniel L. Boley, Sergio Bittanti, and Giovanna Gazzaniga. "Cluster selection in divisive clustering algorithms." In Proceedings of the 2002 SIAM International Conference on Data Mining, pp. 299-314. Society for Industrial and Applied Mathematics, 2002.
- Tasoulis, S. K., and D. K. Tasoulis. "Improving principal direction divisive clustering." In 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008), Workshop on Data Mining using Matrices and Tensors, Las Vegas, USA. 2008.
- Bhattacharjee, Panthadeep, and Pinaki Mitra. "A survey of density based clustering algorithms." Frontiers of Computer Science 15 (2021): 1-27. [CrossRef]
- Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. "A density-based algorithm for discovering clusters in large spatial databases with noise." In kdd, vol. 96, no. 34, pp. 226-231. 1996.
- Khan, Kamran, Saif Ur Rehman, Kamran Aziz, Simon Fong, and Sababady Sarasvady. "DBSCAN: Past, present and future." In The fifth international conference on the applications of digital information and web technologies (ICADIWT 2014), pp. 232-238. IEEE, 2014.
- Choi, Changlock, and Seong-Yun Hong. "MDST-DBSCAN: A density-based clustering method for multidimensional spatiotemporal data." ISPRS International Journal of Geo-Information 10, no. 6 (2021): 391. [CrossRef]
- Monalisa, Siti, and Fitra Kurnia. "Analysis of DBSCAN and K-means algorithm for evaluating outlier on RFM model of customer behaviour." Telkomnika (Telecommunication Computing Electronics and Control) 17, no. 1 (2019): 110-117. [CrossRef]
- Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. "OPTICS: Ordering points to identify the clustering structure." ACM Sigmod record 28, no. 2 (1999): 49-60.
- Kanagala, Hari Krishna, and VV Jaya Rama Krishnaiah. "A comparative study of K-Means, DBSCAN and OPTICS." In 2016 International Conference on Computer Communication and Informatics (ICCCI), pp. 1-6. IEEE, 2016.
- Reitz, Paul, Sören R. Zorn, Stefan H. Trimborn, and Achim M. Trimborn. "A new, powerful technique to analyze single particle aerosol mass spectra using a combination of OPTICS and the fuzzy c-means algorithm." Journal of Aerosol Science 98 (2016): 1-14. [CrossRef]
- Al Samara, Mustafa, Ismail Bennis, Abdelhafid Abouaissa, and Pascal Lorenz. "Complete outlier detection and classification framework for WSNs based on OPTICS." Journal of Network and Computer Applications 211 (2023): 103563.
- Cheng, Yizong. "Mean shift, mode seeking, and clustering." IEEE transactions on pattern analysis and machine intelligence 17, no. 8 (1995): 790-799.
- Georgescu, Shimshoni, and Meer. "Mean shift based clustering in high dimensions: A texture classification example." In Proceedings Ninth IEEE International Conference on Computer Vision, pp. 456-463. IEEE, 2003.
- Ozertem, Umut, Deniz Erdogmus, and Robert Jenssen. "Mean shift spectral clustering." Pattern Recognition 41, no. 6 (2008): 1924-1938. [CrossRef]
- Yu, Zhiwen, Xianjun Zhu, Hau-San Wong, Jane You, Jun Zhang, and Guoqiang Han. "Distribution-based cluster structure selection." IEEE transactions on cybernetics 47, no. 11 (2016): 3554-3567. [CrossRef]
- Adams, Stephen, and Peter A. Beling. "A survey of feature selection methods for Gaussian mixture models and hidden Markov models." Artificial Intelligence Review 52 (2019): 1739-1779. [CrossRef]
- Patel, Eva, and Dharmender Singh Kushwaha. "Clustering cloud workloads: K-means vs gaussian mixture model." Procedia computer science 171 (2020): 158-167. [CrossRef]
- Su, Ting, and Jennifer G. Dy. "In search of deterministic methods for initializing K-means and Gaussian mixture clustering." Intelligent Data Analysis 11, no. 4 (2007): 319-338. [CrossRef]
- Mirzal, Andri. "Statistical analysis of microarray data clustering using NMF, spectral clustering, Kmeans, and GMM." IEEE/ACM Transactions on Computational Biology and Bioinformatics 19, no. 2 (2020): 1173-1192.
- Ganesan, Anusha, Anand Paul, and Sungho Kim. "Enhanced Bayesian Gaussian hidden Markov mixture clustering for improved knowledge discovery." Pattern Analysis and Applications 27, no. 4 (2024): 154. [CrossRef]
- Kita, Francis John, Srinivasa Rao Gaddes, and Peter Josephat Kirigiti. "Enhancing Cluster Accuracy in Diabetes Multimorbidity with Dirichlet Process Mixture Models." IEEE Access (2024). [CrossRef]
- Pezoulas, Vasileios C., Grigorios I. Grigoriadis, Nikolaos S. Tachos, Fausto Barlocco, Iacopo Olivotto, and Dimitrios I. Fotiadis. "Variational Gaussian Mixture Models with robust Dirichlet concentration priors for virtual population generation in hypertrophic cardiomyopathy: a comparison study." In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pp. 1674-1677. IEEE, 2021.
- Economou, Polychronis. "A clustering algorithm for overlapping Gaussian mixtures." Research in Statistics 1, no. 1 (2023): 2242337. [CrossRef]
- Cheng, Wei, Wei Wang, and Sandra Batista. "Grid-based clustering." In Data clustering, pp. 128-148. Chapman and Hall/CRC, 2018.
- Du, Mingjing, and Fuyu Wu. "Grid-based clustering using boundary detection." Entropy 24, no. 11 (2022): 1606. [CrossRef]
- Rani, Pinki. "A Survey on STING and CLIQUE Grid Based Clustering Methods." International Journal of Advanced Research in Computer Science 8, no. 5 (2017), pp. 1510-1512.
- Test image repository, Available: https://links.uwaterloo.ca/Repository.html.
- Test image repository, Available: https://cloudinary.com/labs/cid22.
























| DBSCAN | OPTICS | |
| Handling Clusters of Varying Densities | Requires a single value to define the neighborhood | Does not require a single value. Instead, it creates a reachability plot |
| Cluster Identification | Directly labels clusters and noise points based on eps and | Produces an ordering of the data points and a reachability distance for each point |
| Reachability Plot | Not generated | Generated |
| Sensitivity to Parameters | Very sensitive to and | Less sensitive to and |
| Output | Final clusters and noisy points | Ordering of the points and their reachability distances |
|
Computational Complexity |
Generally faster due to its direct clustering approach | More computationally intensive due to reachability plot |
| GMM | BGMM | |
| Number of Clusters | Required | Not required |
| Parameter Estimation | Uses EM algorithm to estimate the parameters | Utilizes variational inference instead of EM |
| Priors | Not incorporated | Incorporated |
| Computational Complexity | Simple | Complex |
| Output | Direct Clustering | Probabilistic Clustering |
| K-means | BIRCH | Divisive Clustering | DBSCAN | OPTICS | Mean Shift | GMM | BGMM | CLIQUE | |
| Used Parameters | k | k, branching factor, threshold | k | minPts, eps |
minPts, eps optional |
Bandwidth | k | Priors, k optional |
Grid size, density threshold |
| Parameter Estimation | Elbow method, Silhouette | Elbow method, Silhouette | Dendrogram analysis | Heuristic, domain knowledge | Heuristic, domain knowledge | Iterative shifting towards density peaks | Expectation-Maximization | Variational inference | Grid density analysis |
| Sensitivity to Parameters | High | Moderate | N/A | High | Low, depends on minPts | High | High | Low to moderate | Moderate |
| Hierarchical Clustering | No | Yes | Yes | No | No | No | No | No | |
| Number of Clusters | Predefined | Predefined or determined based on CF Tree structure | Determined | Determined by data density | Inferred from reachability plot | Determined by density peaks | Predefined | Inferred from data using priors | Determined by grid density |
| Handling Clusters of Varying Densities | Poor | Good | Moderate | Good | Excellent | Excellent | Poor (assumes Gaussian shape and variance) | Good | Good |
| Cluster Assignment | Hard | Hard | Hard | Soft (points can belong to multiple clusters) | Soft (based on reachability distance) | Hard | Soft (probabilistic) | Soft (probabilistic) | Soft |
| Output | Fixed number of clusters and centroids | CF Tree and clusters | Hierarchical tree and clusters | Cluster labels and noise points | Reachability plot and cluster labels | Cluster labels and convergence points | Mixture components (means, covariances) | Probabilistic cluster memberships and model parameters | Cluster labels |
| Handling Overfitting | Prone to overfitting if k is too high | Moderate) | Moderate | Robust (explicit handling of noise) | Robust | Moderate | Prone to overfitting with too many components | Robust | Grid granularity |
| Complexity | O(n*k* d*iteration) | O(n) | O(n2) | O(n*log(n)) | O(n*log(n)) (slightly higher than DBSCAN) | O(n2) | O(n*k* d*iteration) | Higher than GMM due to variational inference | O(n*grids) |
| Flexibility | Moderate | Moderate to High (handles varying shapes, adaptable structure) | High (can adapt to different cluster shapes) | High (handles arbitrary shapes and noise) | Very High (handles varying densities, arbitrary shapes) | High (handles arbitrary shapes, non-parametric) | Moderate (fixed Gaussian shapes, fixed number of clusters) | High (flexible through priors, infers number of clusters) | Moderate |
| Best Application | When number of clusters is known, clusters are spherical | Large datasets, incremental clustering, data streams | Hierarchical data, large initial clusters need division | Spatial data, noise-prone environments, arbitrary shaped clusters | Complex datasets with varying densities, spatial data | Image processing, complex shapes, non-parametric clustering | When Gaussian assumptions hold, for overlapping clusters | When prior knowledge exists, uncertainty in number of clusters | High-dimensional data |
| K-MEANS | BIRCH | DIVISIVE | DBSCAN | OPTICS | MEANSHIFT | GMM | BGMM | CLIQUE | ||||||||||
| Images | CR | SSIM | CR | SSIM | CR | SSIM | CR | SSIM | CR | SSIM | CR | SSIM | CR | SSIM | CR | SSIM | CR | SSIM |
| Sports action | 22.20 | 0.993 | 1.76 | 0.012 | 23.45 | 0.993 | 1.58 | 1.0 | 266.06 | -0.113 | 1.6 | 1.0 | 21.02 | 0.941 | 25.33 | 0.905 | 14.39 | 0.992 |
| Mechanical objects | 19.05 | 0.985 | 1.61 | 0.015 | 20.13 | 0.984 | 1.49 | 1.0 | 4468 | 0.441 | 1.50 | 1.0 | 18.30 | 0.949 | 23.10 | 0.928 | 11.43 | 0.984 |
| Vehicles | 27.05 | 0.994 | 1.95 | 0.007 | 28.34 | 0.994 | 1.75 | 1.0 | 337.84 | 0.597 | 1.76 | 1.0 | 24.44 | 0.978 | 28.88 | 0.968 | 14.40 | 0.992 |
| Food photography | 20.52 | 0.996 | 1.67 | -0.004 | 21.90 | 0.996 | 1.54 | 1.0 | 154.21 | 0.530 | 1.54 | 1.0 | 20.47 | 0.977 | 23.30 | 0.952 | 12.08 | 0.992 |
| Outdoor scenes | 17.80 | 0.985 | 1.41 | -0.155 | 18.88 | 0.984 | 1.36 | 1.0 | 183.93 | 0.208 | 1.32 | 1.0 | 18.13 | 0.961 | 21.80 | 0.911 | 7.75 | 0.988 |
| Macro photography (insects) | 30.58 | 0.996 | 3.39 | -0.032 | 31.55 | 0.996 | 2.77 | 1.0 | 4468.6 | 0.189 | 2.91 | 1.0 | 23.70 | 0.933 | 29.27 | 0.921 | 29.29 | 0.963 |
| Artwork | 20.16 | 0.985 | 1.58 | -0.056 | 21.29 | 0.984 | 1.43 | 1.0 | 104.86 | 0.417 | 1.46 | 1.0 | 18.40 | 0.943 | 20.91 | 0.934 | 7.98 | 0.989 |
| Cloud formations | 19.75 | 0.998 | 1.86 | 0.149 | 21.54 | 0.988 | 1.63 | 1.0 | 191.45 | 0.302 | 1.69 | 1.0 | 18.80 | 0.982 | 21.44 | 0.928 | 18.24 | 0.987 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).