Mathematics Ph.D. Dissertations

Title

Visualization and Unsupervised Pattern Recognition in Multidimensional Data Using a New Heuristic for Linear Data Ordering

Date of Award

2016

Document Type

Dissertation

Degree Name

Doctor of Philosophy (Ph.D.)

Department

Statistics

First Advisor

Craig Zirbel (Advisor)

Second Advisor

Haowen Xi (Committee Member)

Third Advisor

Hanfeng Chen (Committee Member)

Fourth Advisor

Maria Rizzo (Committee Member)

Fifth Advisor

Junfeng Shang (Committee Member)

Abstract

In data-driven applications, understanding the structural relationship in the given data can greatly facilitate data analysis and decision making in their broadest sense. Many different tools, like multidimensional scaling and hierarchical clustering have been developed and used for this purpose. Seriation is another method. Given a sample of n objects and the corresponding dissimilarity matrix, seriation aims to produce a linear ordering of the objects. One uses the ordering to produce a heat map visualization of the reordered dissimilarity matrix and thus understand the structure of the data. Good orderings should reflect the underlying data structure and result in heat maps that are easy to read and allow for clear interpretation of the data structure. Since the pioneering work of F. Petrie in 1899, a substantial number of seriation methods have been developed. Which methods consistently produce good orderings? In the literature, some authors have made comparisons of different seriation methods. However, the number of seriation methods compared and the number of datasets used is relatively small.

This dissertation conducts an evaluation study of the potential of 35 existing and one novel seriation methods to reveal the structure of data. Initial assessment of the potential is conducted for all 36 methods across six datasets with relatively simple data structure. Further assessment is conducted for the most successful seriation methods using another collection of six datasets with a more sophisticated data structure. The assessment results show that some seriation methods consistently produce orderings that are more helpful for understanding and visualization of the structure of data, and that some methods should only be used when their particular features are called for. The results also show that even the better methods should be used with proper caution.

This dissertation introduces a new seriation method, called tree-penalized TSP (tpTSP), which compares favorably with other considered methods. Hybrid in nature, the method benefits from the strengths of two popular types of seriation methods, TSP and OLO, but avoids their key pitfalls. The datasets used for the performance evaluation and the R code for the new method are posted on Github.

COinS