Mathematics Ph.D. Dissertations

An efficient framework for hypothesis testing using Topological Data Analysis

Date of Award

2023

Document Type

Dissertation

Degree Name

Doctor of Philosophy (Ph.D.)

Department

Statistics

First Advisor

Umar Islambekov (Committee Chair)

Second Advisor

Paul Morris (Other)

Third Advisor

Kit Chan (Committee Member)

Fourth Advisor

Maria Rizzo (Committee Member)

Abstract

Topological data analysis (TDA) has become a popular approach in recent years to study the shape structure of data. Persistence homology is a widely used TDA tool that describes how topology changes through a nested sequence of topological spaces in the form of simplicial complexes. As a result, topological features (e.g., connected components, loops, and voids) appear and disappear, and the summary of this evolution is reported as a persistence diagram (PD). Considered as topological signatures of the data, the space of PDs can be endowed with Wasserstein distance with a stability property. However, since PDs are not vectors in Euclidean space and calculating Wasserstein distances might get computationally costly, they have limitations to be represented as inputs in machine learning tasks. A common remedy to deal with this issue is to map PDs into a space of functions and to vectorized by evaluating them over a grid of scale values which results in vector summaries belonging to Euclidean space. The Betti function, which has incorporated weights, is one of the simplest functional summaries for PDs leading to such vector representations. Even though it is the easiest to construct and fast to implement, no stability result is proven to the best of our knowledge. In the present work, we introduce a new method to vectorize the Betti function by integrating it between two consecutive scale values of a grid. The resulting vector summary, named a vector of averaged Bettis (VAB), provides a lower-dimensional, informative, and computationally efficient vector representation compared to the standard method of vectorizing the Betti function. We also prove a stability result for a class of weight functions with respect to Wasserstein distance. Further, through several experimental studies, we show that the permutation-based hypothesis testing procedure, which aims at identifying whether two sets PDs are drawn from the same distribution or process, can be improved in terms of computational cost without compromising its performance. This is achieved by replacing Wasserstein distances between pairs of PDs in the associated loss function with L1 distances between the corresponding pairs of VABs. We also introduce a new method to shuffle PDs in the permutation test that provides a greater decrease in the Type II error rate at the cost of a smaller increase in the Type I error rate.

Share

COinS