Article
Version 2
This version is not peer-reviewed
k-Means+++: Outliers-Resistant Clustering
Version 1
: Received: 23 September 2020 / Approved: 24 September 2020 / Online: 24 September 2020 (03:22:16 CEST)
Version 2 : Received: 18 November 2020 / Approved: 19 November 2020 / Online: 19 November 2020 (10:54:12 CET)
Version 2 : Received: 18 November 2020 / Approved: 19 November 2020 / Online: 19 November 2020 (10:54:12 CET)
A peer-reviewed article of this Preprint also exists.
Statman, A.; Rozenberg, L.; Feldman, D. k-Means: Outliers-Resistant Clustering
Abstract
The $k$-means problem is to compute a set of $k$ centers (points) that minimizes the sum of squared distances to a given set of $n$ points in a metric space. Arguably, the most common algorithm to solve it is $k$-means++ which is easy to implement, and provides a provably small approximation factor in time that is linear in $n$. We generalize $k$-means++ to support: (i) non-metric spaces and any pseudo-distance function. In particular, it supports M-estimators functions that handle outliers, e.g. where the distance $\mathrm{dist}(p,x)$ between a pair of points is replaced by $\min {\mathrm{dist}(p,x),1}$. (ii) $k$-means clustering with $m\geq 1$ outliers, i.e., where the $m$ farthest points from the $k$ centers are excluded from the total sum of distances. This is the first algorithm whose running time is linear in $n$ and polynomial in $k$ and $m$.
Keywords
Clustering; Approximation; Outliers
Subject
Computer Science and Mathematics, Algebra and Number Theory
Copyright: This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Comments (1)
We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.
Leave a public commentSend a private comment to the author(s)
* All users must log in before leaving a comment
Commenter: Adiel Statman
Commenter's Conflict of Interests: Author