Mathematics Ph.D. Dissertations

Title

K-groups: A Generalization of K-means by Energy Distance

Date of Award

2015

Document Type

Dissertation

Degree Name

Doctor of Philosophy (Ph.D.)

Department

Statistics

First Advisor

Rizzo Maria (Advisor)

Second Advisor

Christopher Rump (Other)

Third Advisor

Chen Hanfeng (Committee Member)

Fifth Advisor

Wei Ning (Committee Member)

Abstract

We propose two distribution-based clustering algorithms called K-groups. Our algorithms group the observations in one cluster if they are from a common distribution. Energy distance is a non-negative measure of the distance between distributions that is based on Euclidean distances between random observations, which is zero if and only if the distributions are identical. We use energy distance to measure the statistical distance between two clusters, and search for the best partition which maximizes the total between clusters energy distance. To implement our algorithms, we apply a version of Hartigan and Wong's moving one point idea, and generalize this idea to moving any m points. We also prove that K-groups is a generalization of the K-means algorithm. K-means is a limiting case of the K-groups generalization, with common objective function and updating formula in that case.

K-means is one of the well-known clustering algorithms. From previous research, it is known that K-means has several disadvantages. K-means performs poorly when clusters are skewed or overlapping. K-means can not handle categorical data, because the mean is not a good estimate of center. K-means can not be applied when dimension exceeds sample size. Our K-groups methods provide a practical and effective solution to these problems.

Simulation studies on the performance of clustering algorithms for univariate and multivariate mixture distributions are presented. Four validation indices (diagonal, Kappa, Rand and corrected Rand) are reported for each example in the simulation study. Results of the empirical studies show that both K-groups algorithms perform as well as K-means when clusters are well-separated and spherically shaped, but K-groups algorithms perform better than K-means when clusters are skewed or overlapping. K-groups algorithms are more robust than K-means with respect to outliers. Results are presented for three multivariate data sets, wine cultivars, dermatology diseases and oncology cases. In our real data examples, the performance of both K-groups algorithms are better than the performance of K-means in each case.

COinS