Working Paper Article Version 1 This version is not peer-reviewed

# k-Means+++: Outliers-Resistant Clustering

Version 1 : Received: 23 September 2020 / Approved: 24 September 2020 / Online: 24 September 2020 (03:22:16 CEST)

How to cite: Statman, A.; Rozenberg, L.; Feldman, D. k-Means+++: Outliers-Resistant Clustering. Preprints 2020, 2020090558 Statman, A.; Rozenberg, L.; Feldman, D. k-Means+++: Outliers-Resistant Clustering. Preprints 2020, 2020090558

## Abstract

The $k$-means problem is to compute a set of $k$ centers (points) that minimizes the sum of squared distances to a given set of $n$ points in a metric space. Arguably, the most common algorithm to solve it is $k$-means++ which is easy to implement, and provides a provably small approximation factor in time that is linear in $n$. We generalize $k$-means++ to support: (i) non-metric spaces and any pseudo-distance function. In particular, it supports M-estimators functions that handle outliers, e.g. where the distance $\mathrm{dist}(p,x)$ between a pair of points is replaced by $\min {\mathrm{dist}(p,x),1}$. (ii) $k$-means clustering with $m\geq 1$ outliers, i.e., where the $m$ farthest points from the $k$ centers are excluded from the total sum of distances. This is the first algorithm whose running time is linear in $n$ and polynomial in $k$ and $m$.

## Subject Areas

Clustering; Approximation; Outliers

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.

Views 0