Sequential batches of time-evolving data for a set of persistent identifiable entities (e.g. online shopping behavior by month for a customer ID, or economic figures by year for a collection of countries) can exhibit temporal shifts in their underlying clustering structure. Methods for recovering this evolutionary clustering structure exploit natural...
Supervised learning model is one of the most fundamental machine learning models. It can provide powerful capability of prediction by learning complex patterns hidden in many, sometimes thousands, predictors. It can also be used as a building block of other machine learning tasks, like unsupervised learning and reinforcement learning. Such...
The ever growing desire for accurate estimation and efficient learning necessitates the efforts to quantitatively characterize uncertainties for models. In this thesis, four problems pertaining to uncertainty quantification are discussed: A sequential stopping framework of constructing fixed-precision confidence regions is proposed for a class of multivariate simulation problems where variance...
The advent of sequencing technologies has generated a large amount of biological and medical data. These data such as genetic sequencing data and lab experimental evidence data can help understand critical biomedical problems. This dissertation makes contribution in three different but related applications in biomedical research. In Chapter 2, we...
Gaussian process provides a principled and flexible approach for modeling the response surface or the latent function in many areas, including machine learning, statistics and computer experiment. In literature, Gaussian process models have already demonstrated their effectiveness and usefulness in a variety of applications. In this dissertation, we mainly focus...
Modeling human language is at the very frontier of machine learning and artificial intelligence. Statistical language models are probabilistic models that assign probabilities to sequences of words. For example, topic models are frequently used text-mining tools to organize a vast set of unstructured documents by exploring their theme structure. More...
This dissertation focuses on subgroup identification in longitudinal studies. There are two different but related topics. In chapter two and chapter three, several longitudinal based methods for subgroup identification with enhanced treatment effect are proposed to correct the deficiency in measuring treatment effect by simply using a summary statistic. In...
The advent of next-generation sequencing technologies has greatly promoted the devel- opment of metagenomics, and the analysis of compositional dataset has a wide range of application in this area. Because of the constraint that the sum of species relative abun- dance being 1, many traditional and classical statistical methods cannot...
Randomization is considered the gold standard when it comes to evaluating the effectiveness of interventions, primarily due to its ability to avoid bias. However, in recent years, randomization has been heavily criticized in circumstances where subject randomization may not be ethical. In a randomized controlled trial, patients who are extremely...
A replication crisis has enveloped several scientific fields since the early 2000s (see Baker, 2016). This has given rise to improved research and reporting practices (e.g., F. S. Collins & Tabak, 2014), as well as a cottage industry of research into issues of replication and reproducibility (e.g., R. A. Klein...