Working Paper Article Version 1 This version is not peer-reviewed

# k-Means+++: Outliers-Resistant Clustering

Version 1 : Received: 23 September 2020 / Approved: 24 September 2020 / Online: 24 September 2020 (03:22:16 CEST)
Version 2 : Received: 18 November 2020 / Approved: 19 November 2020 / Online: 19 November 2020 (10:54:12 CET)

A peer-reviewed article of this Preprint also exists.

Statman, A.; Rozenberg, L.; Feldman, D. k-Means: Outliers-Resistant Clustering+++. Algorithms 2020, 13, 311. Statman, A.; Rozenberg, L.; Feldman, D. k-Means: Outliers-Resistant Clustering+++. Algorithms 2020, 13, 311.

Journal reference: Algorithms 2020, 13, 311
DOI: 10.3390/a13120311

## Abstract

The $k$-means problem is to compute a set of $k$ centers (points) that minimizes the sum of squared distances to a given set of $n$ points in a metric space. Arguably, the most common algorithm to solve it is $k$-means++ which is easy to implement, and provides a provably small approximation factor in time that is linear in $n$. We generalize $k$-means++ to support: (i) non-metric spaces and any pseudo-distance function. In particular, it supports M-estimators functions that handle outliers, e.g. where the distance $\mathrm{dist}(p,x)$ between a pair of points is replaced by $\min {\mathrm{dist}(p,x),1}$. (ii) $k$-means clustering with $m\geq 1$ outliers, i.e., where the $m$ farthest points from the $k$ centers are excluded from the total sum of distances. This is the first algorithm whose running time is linear in $n$ and polynomial in $k$ and $m$.

## Keywords

Clustering; Approximation; Outliers

## Subject

MATHEMATICS & COMPUTER SCIENCE, Numerical Analysis & Optimization

Views 0