Machine Leaning with Heterogeneous Data

Xue, Ye

doi:https://doi.org/10.21985/n2-295p-0d10

Work

Machine Leaning with Heterogeneous Data

Public

Download PDF

A massive amount of data is generated every second all around the world. Machine learning becomes the most attractive solution to consume the data fuel and transform it into productivity. It has yielded great results in many fields, such as healthcare, marketing, finance, etc. Machine learning models are usually designed for tasks with a certain type of input data. For example, linear regression can be used for modeling the relationships among multiple variables and Gaussian process models are commonly used for modeling time series. For tasks related to images, convolutional neural networks are widely used, while the recurrent neural network has been shown successful for sequential data. Challenges arise when machine learning is used to solve more complex tasks involving different types of data. Although it is natural for a human to accomplish complex tasks by collecting related information from different types of data and even data with conflicting signals, it is not straightforward for machines to learn from data where such heterogeneity exists. In this research, we study the following machine learning scenarios with heterogeneous data: (1) Learning with structured and time series data.(2) Learning with structured and multimodal unstructured data. (3) Learning with multimodal temporal data. (4) Learning with non-IID data. We study machine learning with structured and time series data in an imputation task. The problem of missing values in multivariable time series is a key challenge in many applications such as clinical data mining. Although many imputation methods show their effectiveness in many applications, few of them are designed to accommodate multivariable time series. We propose a multiple imputation model that captures both cross-sectional information and temporal correlations. We integrate Gaussian processes with mixture models and introduce individualized mixing weights to handle the variance of predictive confidence of Gaussian process models. The proposed model is compared with several state-of-the-art imputation algorithms on both real-world and synthetic datasets. Experiments show that our best model can provide a more accurate imputation than the benchmarks on all of our datasets. In the second chapter, we study machine learning with heterogeneous data in a more complex scenario, i.e., multimodal learning, with multiple modalities including images, text, videos and structured data. Multimodal multitask learning has attracted an increasing interest in recent years. Single-modal models have been advancing rapidly and have achieved astonishing results on various tasks across multiple domains. Multimodal learning offers opportunities for further improvements by integrating data from multiple modalities. Many methods are proposed to learn from a specific type of multimodal data, such as vision and language data. A few of them are designed to handle several modalities and tasks at a time. We extend and improve Omninet, an architecture that is capable of handling multiple modalities and tasks at a time, by introducing cross-cache attention, integrating patch embeddings for vision inputs, and supporting structured data. The proposed Structured data enhanced Omninet (S-Omninet) is a universal model that is capable of learning from structured data of various dimensions effectively with unstructured data through cross-cache attention, which enables interactions among spatial, temporal, and structured features. We also enhance spatial representations in a spatial cache with patch embeddings. We evaluate the proposed model on several multimodal datasets and demonstrate a significant improvement over the baseline, Omninet. The multimodal learning scenario can be even more complicated when there is a temporal dimension – different modalities become available at different times. It is very common in many real-world applications. For example, in-hospital patients usually do not take all tests at the same time. Vitals and demographics are usually available at an early stage of admission. Then various lab tests may be taken during the admission. If further analyses are needed, medical images like X-rays and text data such as bedside notes may finally become available. Because not all modalities of a sample arrive at the same time, different types of samples may have different importance in many use cases, where an early sample with significant modalities may be more valuable than a later one as early predictions can be made to speed up decision-making processes. Besides, sample correlations are very common in multimodal temporal data, as samples accumulate in time and a late sample may contain the same data existing in an earlier sample. Training without the awareness of the importance and correlation yields less effective models. We define multimodal temporal data, discuss key challenges and propose two methods that improve traditional multimodal training on such data. We demonstrate the effectiveness of the proposed methods on several multimodal temporal datasets, where they show 1% to 3% improvements over the baseline. Lastly, we study a special case of learning with heterogeneous data, where we have the same kind of data but statistical heterogeneity exists. This is a typical case in federated learning. Federated learning is a distributed machine learning paradigm where multiple data owners (clients) collaboratively train one machine learning model while keeping data on their own devices. The heterogeneity of client datasets is one of the most important challenges of federated learning algorithms. Studies have found performance reduction with standard federated algorithms, such as FedAvg, on non-IID data. Many existing works on handling non-IID data adopt the same aggregation framework as FedAvg and focus on improving model updates either on the server side or on clients. We tackle this challenge from a different view by introducing redistribution rounds that delay the aggregation. With delayed aggregations, local models are trained on data that are more representative to the global distribution. The proposed algorithm can also be used as a federated learning paradigm, as an alternative to FedAvg, where other methods can be plugged in. We perform experiments on multiple tasks and show that the proposed framework significantly improves the performance on non-IID data.

Creator