zippy/samples/llm-generated/2110.13229_generated.txt


			
				
				
					
						
						
						
							
							
								
							
							Abstract Neural machine learning models can success- fully model language that is similar to their training distribution, but they are highly sus- ceptible to degradation under distribution shift, which occurs in many practical applications when processing out-of-domain (OOD) text. This has been attributed to "shortcut learning": relying on weak correlations over arbitrary large contexts. We propose a method based on OOD detection with Random Network Dis- tillation to allow an autoregressive language model to automatically disregard OOD context during inference, smoothly transitioning to- wards a less expressive but more robust model as the data becomes more OOD, while re- taining its full context capability when operat- ing in-distribution. We apply our method to a GRU architecture, demonstrating improve- ments on multiple language modeling (LM) datasets. Introduction Neural language models have become the main component of modern natural language processing systems, with larger and larger models being used as feature extractors for downstream tasks (Devlin et al., 2019), as probability estimators for ranking and ensembling (Gulcehre et al., 2015) or as lan- guage generators (Bahdanau et al., 2015; Vaswani et al., 2017; Brown et al., 2020). Despite their success, neural machine learning models can suffer large performance degradation when they are applied to out-of-domain data which is substantially different than their training data (La- puschkin et al., 2019; Hupkes et al., 2019; Recht et al., 2019). Autoregressive language models used for generation also suffer from the related expo- sure bias problem (Ranzato et al., 2015): as the model is fed its own samples, deviations from the training distribution are amplified, and eventually for sufficiently long sequences the model generates completely abnormal text. Unlike the older statistical language models, Re- current LMs (RNNLMs) (Mikolov et al., 2010) and their successors Transformers LMs (Vaswani et al., 2017) can consider the entire prefix of a sentence when predicting or generating the next token. By being able to relate a very high-dimensional input to the output, these models can learn many sub- tle correlations which are highly useful as long as the input is in-distribution, unfortunately these correlations tend to be brittle to distribution shift, causing a model that depends on them to go astray. This phenomenon is known as "shortcut learning" (Geirhos et al., 2020) and it has been found to also occur in humans and animals, but it is especially prevalent in artificial neural networks. Research on this problem has explored models invariant or equivariant w.r.t. certain transforma- tions by means of compositional representations (Sabour et al., 2017; Soulos et al., 2019; Liu et al., 2020), causal modeling (SchoÃÂlkopf et al., 2021), or both (Arjovsky et al., 2019; Krueger et al., 2020), but these works focus on classification tasks often on synthetic datasets and canÃ¢ÂÂt be straightforwardly applied to black-box language models. Approaches specific to LMs have focused on robustness where the data domains are known and represented in the training data (Oren et al., 2019; Gerstenberger et al., 2020). In this work we propose a method that uses Random Network Distillation (RND) (Burda et al., 2018) to dynamically adapt the amount of context that the model relies upon during inference based on an estimate of how much this context is out-of- distribution (OOD). This way the model can still make use of all available context when operating within a familiar context space, exploiting long- distance weak correlations, but it reduces to a less expressive and more robust model when operating OOD, relying only on the strongest correlations. As a proof of concept we implement our ap- proach on a GRU recurrent language model (Cho et al., 2014). While Transformer decoders out- perform RNNs when trained on large training sets, RNNs remain competitive on smaller datasets (< 107 tokens) where OOD phenomena are easier to measure, furthermore they are easier to opti- mize, simplifying architecture and hyperparameter search. We evaluate our method on language modeling tasks on English datasets, obtaining improvements when evaluating on eight OOD domains.We re- port additional preliminary sequence-to-sequence results on Transformer-RNN models (Zhang et al., 2018) in appendix A. We leave extensions of our method to fullas future research. Conclusions and future work We proposed a method to improve the robustness of language models to distribution shift caused by train/test domain mismatch. Our model con- tracts the RNN state based on an unsupervised out-of-distribution estimator in order to reduce the model dependency on weak long-distance correla- tions, which are useful in-distribution but are brittle to distribution shift, causing a model that depends on them to go astray. We propose a method based on OOD detection with Random Network Dis- tillation to allow an autoregressive language model to automatically disregard OOD context during inference, smoothly transitioning to- wards a less expressive but more robust model as the data becomes more OOD, while re- taining its full context capability when operat- ing in-distribution. We apply our method to a GRU architecture, demonstrating improve- ments on multiple language modeling (LM) datasets.
						
						
					
				
				
					
						Reference in New Issue
					
					View Git Blame
					Kopiuj bezpośredni odnośnik