An Improved Similarity-based Clustering Algorithm for Multi-database Mining

Salim Miloudi; Yulin Wang; Wenjia Ding

doi:10.20944/preprints202104.0256.v1

Submitted:

08 April 2021

Posted:

09 April 2021

You are already at the latest version

Abstract

Clustering algorithms for multi-database mining (MDM) rely on computing $(n^2-n)/2$ pairwise similarities between $n$ multiple databases to generate and evaluate $m\in[1, (n^2-n)/2]$ candidate clusterings in order to select the ideal partitioning which optimizes a predefined goodness measure. However, when these pairwise similarities are distributed around the mean value, the clustering algorithm becomes indecisive when choosing what database pairs are considered eligible to be grouped together. Consequently, a trivial result is produced by putting all the $n$ databases in one cluster or by returning $n$ singleton clusters. To tackle the latter problem, we propose a learning algorithm to reduce the fuzziness in the similarity matrix by minimizing a weighted binary entropy loss function via gradient descent and back-propagation. As a result, the learned model will improve the certainty of the clustering algorithm by correctly identifying the optimal database clusters. Additionally, in contrast to gradient-based clustering algorithms which are sensitive to the choice of the learning rate and require more iterations to converge, we propose a learning-rate-free algorithm to assess the candidate clusterings generated on the fly in a fewer upper-bounded iterations. Through a series of experiments on multiple database samples, we show that our algorithm outperforms the existing clustering algorithms for MDM.

Keywords:

Multi-database Mining

;

Graph Clustering

;

Coordinate Descent

;

Convex Optimization

;

Similarity Measure

;

Binary Entropy Loss

;

Fuzziness Index

Subject:

Computer Science and Mathematics - Mathematics

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

An Improved Similarity-based Clustering Algorithm for Multi-database Mining

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe