Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

# An Improved Similarity-based Clustering Algorithm for Multi-database Mining

Version 1 : Received: 8 April 2021 / Approved: 9 April 2021 / Online: 9 April 2021 (10:20:06 CEST)

A peer-reviewed article of this Preprint also exists.

Miloudi, S.; Wang, Y.; Ding, W. An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining. Entropy 2021, 23, 553. Miloudi, S.; Wang, Y.; Ding, W. An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining. Entropy 2021, 23, 553.

## Abstract

Clustering algorithms for multi-database mining (MDM) rely on computing $(n^2-n)/2$ pairwise similarities between $n$ multiple databases to generate and evaluate $m\in[1, (n^2-n)/2]$ candidate clusterings in order to select the ideal partitioning which optimizes a predefined goodness measure. However, when these pairwise similarities are distributed around the mean value, the clustering algorithm becomes indecisive when choosing what database pairs are considered eligible to be grouped together. Consequently, a trivial result is produced by putting all the $n$ databases in one cluster or by returning $n$ singleton clusters. To tackle the latter problem, we propose a learning algorithm to reduce the fuzziness in the similarity matrix by minimizing a weighted binary entropy loss function via gradient descent and back-propagation. As a result, the learned model will improve the certainty of the clustering algorithm by correctly identifying the optimal database clusters. Additionally, in contrast to gradient-based clustering algorithms which are sensitive to the choice of the learning rate and require more iterations to converge, we propose a learning-rate-free algorithm to assess the candidate clusterings generated on the fly in a fewer upper-bounded iterations. Through a series of experiments on multiple database samples, we show that our algorithm outperforms the existing clustering algorithms for MDM.

## Keywords

Multi-database Mining; Graph Clustering; Coordinate Descent; Convex Optimization; Similarity Measure; Binary Entropy Loss; Fuzziness Index

## Subject

Computer Science and Mathematics, Mathematics

Views 0