zippy/samples/llm-generated/2111.03837_generated.txt

1 wiersz
10 KiB
Plaintext
Czysty Wina Historia

This file contains invisible Unicode characters!

This file contains invisible Unicode characters that may be processed differently from what appears below. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to reveal hidden characters.

ABSTRACT Named entity recognition (NER) aims to identify mentions of named entities in an unstructured text and classify them into the predefined named entity classes. Even though deep learning-based pre-trained language models achieve good predictive performances, many domain-specific NER tasks still require a sufficient amount of labeled data. Active learning (AL), a general framework for the label acquisition problem, has been used for the NER tasks to minimize the annotation cost without sacrificing model performance. However, heavily imbalanced class distribution of tokens introduces challenges in designing effective AL querying methods for NER. We propose AL sentence query evaluation functions which pay more attention to possible positive tokens, and evaluate these proposed functions with both sentence-based and token-based cost evaluation strategies. We also propose a better data-driven normalization approach to penalize too long or too short sentences. Our experiments on three datasets from different domains reveal that the proposed approaches reduce the number of annotated tokens while achieving better or comparable prediction performance with conventional methods. Keywords Active learning, Named entity recognition, Annotation cost, Semi-supervised clustering Introduction Name entity recognition (NER) aims to identify mentions of named entities in an unstructured text and classify them into the predefined named entity classes (e.g., person names, organizations, locations). NER is one of the fundamental natural language processing (NLP) tasks and is used in other NLP tasks such as entity linking, event extraction, and question answering. Although deep learning-based pre-trained language models [Devlin et al., 2019, Yang et al., 2019, Liu et al., 2019, Raffel et al., 2020] have advanced the state-of-the-art performance in NER [Torfi et al., 2020, Peters et al., 2019], a sufficient amount of labeled data is still necessary for achieving satisfactory prediction performance in most domains [Tikhomirov et al., 2020, Wang et al., 2020]. Since acquiring labeled data is both time and budget-consuming, efficient label acquisition for NER remains a challenge. A general framework for tackling the labeled data acquisition problem is active learning, in which the learner strategically chooses the most valuable instances as opposed to selecting a random sample for labeling [Thompson et al., 1999]. In the pool-based active learning setup, the active learner selects the most useful examples from an unlabeled pool of samples, queries them to an annotator for labeling; upon receiving the labels for the queried examples, the model is retrained with the augmented labeled set. These query selection, annotation, and retraining steps are iterated multiple times until either the desired performance is achieved or the labeling budget is exhausted [Settles, 2011]. The goal is to reduce the annotation cost by creating a smaller labeled set while still achieving good predictive performance. Active learning (AL) has demonstrated success in various sequence annotation tasks such as part-of-speech tagging [Ringger et al., 2007], dependency parsing [Li et al., 2016b] and semantic parsing [Thompson et al., 1999]. AL has been used to tackle the label acquisition problem in the NER task as well. Shen et al. [2017] demonstrated that AL combined with deep learning achieves nearly the same performance on standard datasets with just 25% of the original training data. Chen et al. [2015] developed and evaluated both existing and new AL methods for a clinical NER task. Their results showed that AL methods, particularly uncertainty-sampling approaches, provide significant savings in the annotation cost. In active learning, the most critical step is selecting the useful query examples for manual annotation. This step becomes more challenging for sequence labeling tasks, especially for named entity recognition, for two reasons. The first challenge of applying active learning to NER arises due to imbalanced data distribution. In NER annotation, a token is either labeled with its corresponding named entity class if it is part of a named entity or with the “other” class if it is not part of a named entity. The other class is generally referred as negative annotation or negative token, and all other labels -named entity labels- are referred as positive annotations or positive tokens [Marcheggiani and Artières, 2014]. In NER datasets, negative tokens are usually over-abundant compared to positive tokens. The second challenge in applying active learning to NER is related to the varying length of sentences. In NER, tokens are annotated one by one, but the context, hence the corresponding sentence, is still required for accurate token annotation. Therefore, at each active learning iteration, sentences are queried instead of tokens. Active learners that select thesentences for querying by directly aggregating over all the tokens are biased towards longer sentences. In order to prevent this bias towards sentences with more terms, the aggregated sentence scores are normalized by the number of tokens in the sentence [Engelson and Dagan, 1996, Haertel et al., 2008, Settles and Craven, 2008]. This commonly used approach solves the problem only partially, since this time, the active learner starts to query "too" short sentences in the early, and intermediate rounds [Tomanek, 2010]. In this paper, we moderate these two extreme cases and propose a normalization which exploits the corresponding datasets’ token count distribution. The varying length of sentences also affects the cost evaluation of the active learning framework. Some studies [Settles and Craven, 2008, Yao et al., 2009, Kim et al., 2006, Liu et al., 2020a] treat all sentences equally and compare active learning methods directly with respect to the number of sentences queried. However, this is not realistic since the cost is not fixed across sentences as sentences differ in the number of tokens and the number of named entities they contain [Arora et al., 2009, Haertel et al., 2008]. Therefore, the number of annotated tokens should be incorporated into the active learning cost. In this regard, many studies in the literature [Shen et al., 2004, Settles and Craven, 2008, Reichart et al., 2008, Shen et al., 2017] measure the cost of the annotation by the number of tokens annotated even though they query the sentences. Using only the token count is also an imperfect strategy as the cost of annotating the same number of tokens distributed over multiple sentences is not equivalent to annotating these tokens within a single sentence [Settles et al., 2008, Tomanek and Hahn, 2010]. This is mainly because there is a cost factor associated with each new sentence that is independent of its content and length. Even though we do not propose a new cost calculation method that encompasses all these different aspects, we consider these two cost evaluation set-ups to analyze the existing and proposed approaches in detail. In this study, we propose an extension to the subset of the existing uncertainty sampling methods to handle the challenges associated with the over-abundance of the negative tokens. In our proposed approach, the query evaluation metrics are designed to pay less attention to the tokens that are predicted to have negative annotations. We identify potentially negative tokens through clustering of pre-trained BERT representations after a semi-supervised dimensionality reduction. To the best of our knowledge, the use of the BERT embeddings directly in the active learning querying step for NER has never been attempted before this paper. Last but not least, this paper proposes a better normalization strategy for aggregating token scores to attain a good sentence query metric. For a fair comparison, we evaluate different active learning query methods both under the assumption of fixed annotation cost per sentence and fixed annotation cost per token. Our experiments on three datasets from different domains illustrate that our proposed approach reduces the number of annotated tokens while maintaining the slightly better or same level of prediction performance with the compared methods. We also present an extensive investigation about the effects of different pre-trained language embeddings on the performance of our NER model. The rest of the paper is organized as follows: Section 2 presents the NER data collections used in the experiments together with additional details to motivate the reader for the proposed method, Section 3 summarizes the general active learning setup and the commonly used active learning strategies for the NER task, and Section 4 describes the proposed approach. We describe our experimental setting in Section 5 and detail our results in Section 6. Section 7 concludes with a summary of our findings. Conclusion In this work, we focus on active learning for NER. One challenge of applying NER in active learning is the abundance of negative tokens. Uncertainty-based sentence query functions aggregate the scores of the tokens in a sentence, and since the negative tokens’ uncertainty scores dominate the overall score, they shadow the informative, positive tokens. In this work, we propose strategies to overcome this by focusing on the possible positive tokens. To identify positive tokens, we use a semi-supervised clustering strategy of tokens’ BERT embeddings. We experiment with several strategies where the sentence uncertainty score focuses on positive tokens and show empirically on multiple datasets that this is useful. A second challenge of querying sentences with NER is related to the length of the sentences. Longer sentences that contain more tokens can bring more information at once; however, the annotation cost is higher. Normalizing sentences with the tokens they contain, on the otheryields to querying too short sentences. We propose to normalize the scores such that sentences with the typical length for the dataset are queried more often. We evaluate the suggested methods based on both sentence and token-based cost analysis. Overall, we believe the work presented here can support active learning to NER.