Predictive Models for Group-Structured Regression and Classification Problems

Surer, Ozge

doi:https://doi.org/10.21985/n2-mfb3-js96

Work

Predictive Models for Group-Structured Regression and Classification Problems

Public

Download PDF

Download All Files (.zip)

Massive amounts of data with a large number of predictors routinely arrive in data systems as a result of recent developments in data collection technology. In this data-intensive world, predictive models are more important than ever to extract information and make decisions, and are widely applied in many different fields. However, high dimensionality complicates the task of finding interpretable models with high predictive accuracy in a computationally efficient way. In such an environment, there has been an increasing demand for interpretable, fast, and accurate predictive models for large data sets. This dissertation develops new interpretable, fast, and accurate predictive models for linear regression, generalized linear models, and longitudinal data sets. We first focus on developing new approaches for fitting linear regression and generalized linear models with large data sets in a fundamentally different way. We then move on to longitudinal data sets to efficiently discover the underlying temporal characteristics of the data. To this end, this dissertation consists of four chapters: 1) Coefficient tree regression: Fast, accurate and interpretable predictive modeling; 2) Coefficient tree regression for discovering structure in generalized linear models; 3) Discovering structure in longitudinal data via coefficient tree regression; and 4) Future work. Chapter 1 proposes a new algorithm, coefficient tree regression, that fits regression models in a fundamentally different way. The proliferation of data collection technologies commonly results in data sets of immense size. In practice, many groups of predictors influence the response in the same way, but the groups are unknown and must be discovered from the data. We call the new approach coefficient tree regression (CTR), since it successively partitions the regression coefficients into groups, the coefficients within each group being the same. CTR does this in a highly computationally efficient way, borrowing concepts from both linear regression and from regression trees to achieve computational performance that is orders of magnitude faster than existing competitors for larger scale problems. Finding such groups of predictors that share a common regression coefficient is an automated way of feature engineering where the sum of predictors within each group represents a new derived predictor. Moreover, finding hidden groups of predictors that impact the response only via their sum has major interpretability advantages, which we demonstrate with a real data example of predicting political affiliations with television viewing habits. CTR can be viewed as a form of regularization without shrinkage, and we demonstrate an overall predictive performance that is on par with lasso regression, in addition to having enormous computational and interpretive advantages. Chapter 2 develops a coefficient tree regression algorithm for generalized linear models (CTRGLM) to discover the group structure from the data and fit the generalized linear regression models (GLM) with large data sets. In this chapter, we fully derive and implement the special cases of logistic and Poisson regressions, and develop computational strategies to approximately maximize the likelihood in a computationally efficient way. In addition to being simple and highly interpretable, CTRGLM has enormous advantages over existing methods for grouped predictors, being many orders-of-magnitude faster and also enjoying better predictive accuracy for large data sets. Chapter 3 discovers the group structure when the data have a temporal component, as in the case of longitudinal data. Longitudinal studies are becoming increasingly commonplace; especially when the data are massively multivariate, fast, accurate, and interpretable models are becoming fundamentally important. When regression data have a temporal component, we often expect that predictor variables that are closer in time are more likely to be associated with the response in the same way. In such situations, we can exploit those aspects and discover groups of predictors that share the same (or similar) coefficient according to their temporal proximity. Our longitudinal CTR (LCTR) approach results in a simple and highly interpretable tree structure from which the hierarchical relationships between groups of predictor variables that affect the response in a similar manner based on their temporal proximity can be observed, and we demonstrated with a real news engagement example that it can provide a clear and concise interpretation of the data. In numerical comparisons over a variety of examples, we showed that our approach performs better than ridge and lasso in terms of computing time and predictive accuracy. Finally, the future work arising from these studies is discussed in Chapter 4.

Creator