<em>k</em>-Means+++: Outliers-Resistant Clustering

Adiel Statman; Liat Rozenberg; Dan Feldman

Submitted:

23 September 2020

Posted:

24 September 2020

Read the latest preprint version here

Abstract

The $k$-means problem is to compute a set of $k$ centers (points) that minimizes the sum of squared distances to a given set of $n$ points in a metric space. Arguably, the most common algorithm to solve it is $k$-means++ which is easy to implement, and provides a provably small approximation factor in time that is linear in $n$. We generalize $k$-means++ to support: (i) non-metric spaces and any pseudo-distance function. In particular, it supports M-estimators functions that handle outliers, e.g. where the distance $\mathrm{dist}(p,x)$ between a pair of points is replaced by $\min {\mathrm{dist}(p,x),1}$. (ii) $k$-means clustering with $m\geq 1$ outliers, i.e., where the $m$ farthest points from the $k$ centers are excluded from the total sum of distances. This is the first algorithm whose running time is linear in $n$ and polynomial in $k$ and $m$.

Keywords:

Clustering

;

Approximation

;

Outliers

Subject:

Computer Science and Mathematics - Mathematics

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

k-Means+++: Outliers-Resistant Clustering

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe