Work

Ensembling and Data Selection for Neural Language Models, and Analysis of F-measure

Public

Downloadable Content

Download PDF

Language models are the foundation of many natural language tasks such as machine translation, speech recognition, and dialogue systems. Modeling the probability distributions of text accurately helps capture the structures of language and extract valuable information contained in various corpora. In recent years, many advanced models have achieved state-of-the-art performance on various large, diverse language modeling benchmarks. However, training these advanced models on large benchmark corpora can be difficult due to memory constraints and prohibitive computational costs. Of these advanced models, deep ensemble model combines several individual models to obtain better generalization performance. But training standard ensemble of language mod- els over the entire corpus may ignore the inherent characteristics of text and the specialization of ensemble model on certain subset of text.Many natural language processing (NLP) systems are evaluated using the F-measure, which is computed as the harmonic mean of recall and precision. The F-measure is usually estimated on the sample set, and many useful approaches have been presented to estimate the standard error of the F-measure. However, no efficient approach has been proposed to compare the accuracy of the estimated F-measures for different algorithms. In this dissertation, we develop several statistical methodologies for more effective and efficient language modeling, and we also propose a framework for comparing different approaches of estimating F-measures. First, we correct some estimation details in Wong’s approach [1], and give estimation of the covariance between two F-measures in JVESR’s approach. We present an experimental framework for comparing JVESR’s approach and Wong’s approach in estimating F-measures of two algorithms. Second, we introduce a novel algorithm called BaMME for effective ensemble deep neural network training. By applying BaMME, each ensemble model is navigated to focus on text with certain characteristics in the training, and our experiments demonstrate that BaMME outperforms the traditional ensemble modeling. Finally, we propose two new subsampling methods, CLIS and CLISIER, for selecting text data for efficient sentence-level RNN language modeling. These two subsampling methods apply the clustering technique and the importance sampling technique to make the selected dataset as diverse and informative as possible. In the CLISIER method, we remove inefficiency sentences by building the Dataset Map. Both of CLIS and CLISIER outperform the baseline methods in training data selection for language modeling. Moreover, CLISIER provides an efficient way to build Dataset Maps for language modeling.

Creator
DOI
Subject
Language
Alternate Identifier
Date created
Resource type
Rights statement

Relationships

Items