Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

An Improved Similarity-based Clustering Algorithm for Multi-database Mining

Version 1 : Received: 8 April 2021 / Approved: 9 April 2021 / Online: 9 April 2021 (10:20:06 CEST)

A peer-reviewed article of this Preprint also exists.

Miloudi, S.; Wang, Y.; Ding, W. An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining. Entropy 2021, 23, 553. Miloudi, S.; Wang, Y.; Ding, W. An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining. Entropy 2021, 23, 553.

Abstract

Clustering algorithms for multi-database mining (MDM) rely on computing $(n^2-n)/2$ pairwise similarities between $n$ multiple databases to generate and evaluate $m\in[1, (n^2-n)/2]$ candidate clusterings in order to select the ideal partitioning which optimizes a predefined goodness measure. However, when these pairwise similarities are distributed around the mean value, the clustering algorithm becomes indecisive when choosing what database pairs are considered eligible to be grouped together. Consequently, a trivial result is produced by putting all the $n$ databases in one cluster or by returning $n$ singleton clusters. To tackle the latter problem, we propose a learning algorithm to reduce the fuzziness in the similarity matrix by minimizing a weighted binary entropy loss function via gradient descent and back-propagation. As a result, the learned model will improve the certainty of the clustering algorithm by correctly identifying the optimal database clusters. Additionally, in contrast to gradient-based clustering algorithms which are sensitive to the choice of the learning rate and require more iterations to converge, we propose a learning-rate-free algorithm to assess the candidate clusterings generated on the fly in a fewer upper-bounded iterations. Through a series of experiments on multiple database samples, we show that our algorithm outperforms the existing clustering algorithms for MDM.

Keywords

Multi-database Mining; Graph Clustering; Coordinate Descent; Convex Optimization; Similarity Measure; Binary Entropy Loss; Fuzziness Index

Subject

Computer Science and Mathematics, Mathematics

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.