zippy/samples/llm-generated/2110.15802_generated.txt

1 wiersz
7.2 KiB
Plaintext

ABSTRACT We propose BERMo, an architectural modification to BERT, which makes predic- tions based on a hierarchy of surface, syntactic and semantic language features. We use linear combination scheme proposed in Embeddings from Language Mod- els (ELMo) to combine the scaled internal representations from different network depths. Our approach has two-fold benefits: (1) improved gradient flow for the downstream task as every layer has a direct connection to the gradients of the loss function and (2) increased representative power as the model no longer needs to copy the features learned in the shallower layer which are necessary for the down- stream task. Further, our model has a negligible parameter overhead as there is a single scalar parameter associated with each layer in the network. Experiments on the probing task from SentEval dataset show that our model performs up to 4.65% better in accuracy than the baseline with an average improvement of 2.67% on the semantic tasks. When subject to compression techniques, we find that our model enables stable pruning for compressing small datasets like SST-2, where the BERT model commonly diverges. We observe that our approach converges 1.67× and 1.15× faster than the baseline on MNLI and QQP tasks from GLUE dataset. Moreover, our results show that our approach can obtain better parameter efficiency for penalty based pruning approaches on QQP task. INTRODUCTION The invention of Transformer (Vaswani et al. (2017)) architecture has paved new research direc- tions in the deep learning community. Descendants of this architecture, namely BERT (Devlin et al. (2019)) and GPT (Brown et al. (2020)), attain State-of-The-Art (SoTA) performance for a broad range of NLP applications. The success of these networks is primarily attributed to the two-stage training process (self-supervised pre-training and task-based fine-tuning), and the attention mech- anism introduced in Transformers. Many of the top models on various leader boards use models from the BERT family. All of the fifteen systems that surpass the human baseline on the Gen- eral Language Understanding Evaluation (GLUE) (Wang et al. (2018)) benchmark use variants of BERT or have it as one of the constituents in an ensemble, except for T5 (Raffel et al. (2019)), which uses the Transformer architecture. Further, the best-performing systems for each task in the Ontonotes (Weischedel et al. (2011)) benchmark belong to the BERT family, with the exception of Entity Typing task where Embeddings from Language Models (ELMo) (Peters et al. (2018)) tops the leaderboard. These promising results make the BERT family of models increasingly ubiquitous in solving several tasks in the various domains of machine learning like NLP, Image Recognition (Dosovitskiy et al. (2021); Jaegle et al. (2021)) and Object detection (Carion et al. (2020)). Motivation: In (Jawahar et al. (2019)), the authors manifest the lower layers of BERT to capture phrase-level information, which gets diluted with the depth. Moreover, they also demonstrate that the initial layers capture surface-level features, the middle layers deal with syntactic features, and the last few layers are responsible for semantic features. These findings indicate that BERT captures a rich hierarchy of linguistic features at different depths of the network. Intrigued by this discovery, we aim at combining the activations from different depths to obtain a richer feature representation. We find our problem formulation similar to the one presented in ELMo, where the authors illus- trate higher-level Long Short Term Memory (LSTM) states capture context-dependent aspects of word meaning or semantic features, and the lower-level LSTM states model the syntax. Inspired by ELMo, we propose BERMo by modifying the BERT architecture to increase the dependence of features from different depths to generate a rich context-dependent embedding. This approach improves the gradient flow during the backward pass and increases the representative power of the network (He et al. (2016); Huang et al. (2016)). Further, the linear combination of features from intermediate layers, proposed in ELMo, is simpler form of skip connection introduced in ResNets (He et al. (2016)). The skip connections in ResNets enable aggressive pooling in initial layers with- out affecting the gradient flow. This in turn leads to low parameters allowingnetworks to have orders of magnitude lower parameters compared to the architectures without skip connections such as VGG (Simonyan & Zisserman (2014)) while achieving competitive accuracy. Since the perfor- mance improvements associated with the BERT architecture come at the cost of a large memory footprint and enormous computational resources, compression becomes necessary to deploy these enormous resources on demand. As we introduce skip connections in the BERT architecture, the result is a large memory footprint and enormous computational resources. These enormous resources make the BERT ideal for distributing resource bruntings across different network depths. We observe that our approach converges 1.67× and 1.15× faster than the baseline on MNLI and QQP tasks from GLUE dataset. Moreover, our results show that our approach can obtain better parameter efficiency for penalty based pruning approaches on QQP task. INTRODUCTION The invention of Transformer (Vaswani et al. (2017)) architecture has paved new research direc- tions in the deep learning community. Descendants of this architecture, namely BERT (Devlin et al. (2019)) and GPT (Brown et al. (2020)), attain State-of-The-Art (SoTA) performance for a broad range of NLP applications. The success of these networks is primarily attributed to the two-stage training process (self-supervised pre-training and task-based fine-tuning), and the attention mech- anism introduced in Transformers. Many of the top models on various leader boards use models from the BERT family. All of the fifteen systems that surpass the human baseline on the Gen- eral Language Understanding Evaluation (GLUE) (Wang et al. (2018)) benchmark use variants of BERT or have it as one of the constituents in an ensemble, except for T5 (Raffel et al. (2019)), which uses the Transformer architecture. Further, the best-performing systems for each task in the Ontonotes (Weischedel et al. (2011)) benchmark belong to the BERT family, with the exception of Entity Typing task where Embeddings from Language Models (ELMo) (Peters et al. (2018)) tops the leaderboard. These promising results make the BERT family of models increasingly ubiquitous in solving several tasks in the various domains of machine learning like NLP, Image Recognition (Dosovitskiy et al. (2021); Jaegle et al. (2021)) and Object detection (Carion et al. (2020)). Motivation: In (Jawahar et al. (2019)), the authors manifest the lower layers of BERT to capture phrase-level information, which gets diluted with the depth. Moreover, they also demonstrate that the initial layers capture surface-level features, the middle layers deal with syntactic features, and the last few layers are responsible for semantic features. These findings indicate that BERT captures a rich hierarchy of linguistic features at different depths of the network. Intrigued by this discovery, we aim at combining the activations from different depths to obtain a richer feature representation.