[期刊论文][Full-length article]


Bayesian active summarization

作   者:
Alexios Gidiotis;Grigorios Tsoumakas;

出版年:2023

页    码:101553 - 101553
出版社:Elsevier BV


摘   要:

Bayesian Active Learning has had significant impact to various NLP problems, but nevertheless its application to text summarization has been explored very little. We introduce Bayesian Active Summarization (BAS), as a method of combining active learning methods with state-of-the-art summarization models. Our findings suggest that BAS achieves better and more robust performance, compared to random selection, particularly for small and very small data annotation budgets. More specifically, applying BAS with a summarization model like PEGASUS we managed to reach 95% of the performance of the fully trained model, using less than 150 training samples. Furthermore, we have reduced standard deviation by 18% compared to the conventional random selection strategy. Using BAS we showcase it is possible to leverage large summarization models to effectively solve real-world problems with very limited annotated data. Introduction Modern Natural Language Processing (NLP) methods based on deep neural networks have achieved remarkable performance in several different tasks (Vaswani et al., 2017, Radford et al., 2019, Devlin et al., 2018, Liu et al., 2019, Lewis et al., 2019, Raffel et al., 2020). Such performance levels are usually achieved by scaling up deep neural networks to millions or even billions of parameters. Scaling, in turn, requires extremely high computational capacity and large training datasets. Abstractive text summarization is an NLP task that has drawn extensive attention in recent deep learning work, with some very significant achievements (Lewis et al., 2019, Zhang et al., 2020, Zaheer et al., 2020). State-of-the-art summarization models generally depend on large supervised datasets to achieve good performance and generalize to unseen data. Summarization datasets are usually document collections, each accompanied by some form of summary, typically written by humans. Although numerous datasets for summarization are available in the literature (Napoles et al., 2012, Hermann et al., 2015, Narayan et al., 2018, Grusky et al., 2018), this is not the case in many practical applications, since constructing large supervised datasets can be very expensive and time consuming. Collecting good quality training data in large amounts and annotating them for summarization would be costly for many small businesses trying to adopt summarization technology to solve problems in their respective domains. This cost can be particularly high if domain expertise is required for annotation, which is true for many use cases such as financial and legal documents. Active learning methods have been widely adopted in an effort to reduce deep learning data requirements for various tasks (Houlsby et al., 2011, Gal et al., 2017). Strategically selecting the most informative samples for annotation has proven to be more effective than random selection when the budget for annotating data is small (Siddhant and Lipton, 2020). Although active learning has been applied to NLP problems, it has rarely been explored from a summarization perspective (Zhang and Fung, 2009). Part of the reason is that one of the key challenges of active learning is finding a good strategy for selecting informative samples for annotation. Various selection strategies, founded on information theory or Bayesian learning, have been explored in different problem setups, such as classification and regression. On the other hand, applying such methods in tasks that deal with language generation, such as abstractive summarization, is far from trivial. We propose Bayesian Active Summarization (BAS), an approach for applying active learning to abstractive text summarization aiming to mitigate the data dependence of summarization models. BAS iteratively alternates between annotating and training, in an effort to maximize the gains from a limited data annotation budget. Building upon our previous work (Gidiotis and Tsoumakas, 2022), we apply Bayesian deep learning methods with Monte Carlo (MC) dropout (Gal and Ghahramani, 2016) to quantify summarization uncertainty, and use it to select data samples for annotation. We empirically show that BAS is more data efficient than random selection, and achieves better overall performance when the annotation budget is small. After experimenting with multiple Transformer based summarization models we showed that those model, when trained with BAS, consistently outperform random selection for data budgets of a few hundred samples. For instance, by applying BAS on the PEGASUS summarization model (Zhang et al., 2020) we managed to reach 95% of the performance of a PEGASUS model trained on the full XSum dataset (Narayan et al., 2018) using less than 150 training samples. Finally, we analyze BAS with regard to computational cost, in an effort to identify the effects different design choices have. This analysis gives us insights into the trade-off between performance and computational complexity, helping us understand how to scale BAS effectively. The rest of this paper is structured as follows. Section 2 is a review of relevant work in active learning, Bayesian methods and their application to NLP. In Section 3 we briefly introduce Bayesian uncertainty and its extensions to summarization models. Then, in Section 4 we present in detail the main BAS algorithm and in Section 5 some practical considerations concerning scalability and robustness. Section 6 describes our experimental setup and Section 7 discusses our main findings, including an analysis of BAS from different angles. We conclude with some final remarks in Section 8. Section snippets Related work A number of works have applied active learning methods to NLP problems. Shen et al. (2018) use active learning for named entity recognition (NER), by selecting and annotating the samples with the lowest predicted probability. Most notably, Siddhant and Lipton (2020) empirically study multiple different active learning methods, focusing on sentence classification and NER. Various works combine BERT (Devlin et al., 2018) with different active learning methods on NLP tasks like text classification Bayesian uncertainty In this section we briefly introduce some key concepts about Bayesian uncertainty, which is the foundation of our document selection strategy. Active summarization The main objective of Bayesian active summarization (BAS) is to train a summarization model that achieves competitive performance, but requires less supervised data for training. Since creating large numbers of samples for summarization training can be particularly difficult and costly, we focus on training budgets of only a few hundreds of annotated samples. Active learning methods, and particularly Bayesian ones, are known to have significant computational overheads. When applied to the Practical considerations In this section we discuss practical considerations that arise when trying to apply active summarization in real world scenarios. These considerations concern important decisions we need to make with each applications’ specific requirements in mind, and have a great impact on BAS’s practical performance, cost, scalability and robustness. Experimental setup In order to assess BAS’s effectiveness, an experimental study was conducted. Here, we describe our experimental setup including the data and models used for the study as well as the learning details. We aim to simulate a real-world scenario with a low data annotation budget and as a consequence all experimental decisions are made under that assumption. Results The experimental results presented in this section are organized in the following way. First, we present a study of the BLEUVar metric that we are using to quantify summarization uncertainty for BAS. Then we go through the process of tuning BAS, in an attempt to achieve a good balance between effectiveness and computational complexity. Finally, we evaluate the performance of BAS and compare it with a baseline that follows the standard supervised learning paradigm of randomly selecting and Conclusion We explored active learning in the context of abstractive text summarization. Although active learning methods have had significant impact on various NLP problems, applications to text summarization have been very limited. We introduced BAS, as a way of combining active learning methods with state-of-the-art summarization models. BAS is, to the best of our knowledge, the first attempt to apply Bayesian active learning to abstractive text summarization. Our main findings suggest that indeed BAS Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. References (35) Cohn D.A. et al. Active learning with statistical models J. Artificial Intelligence Res. (1996) Devlin J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding (2018) Ein-Dor, L., Halfon, A., Gera, A., Shnarch, E., Dankin, L., Choshen, L., Danilevsky, M., Aharonov, R., Katz, Y.,... Filos A. et al. A systematic comparison of Bayesian deep learning robustness in diabetic retinopathy tasks (2019) Gal Y. et al. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning Gal Y. et al. Deep Bayesian active learning with image data Gidiotis A. et al. Should we trust this summary? Bayesian abstractive summarization to the rescue Grießhaber, D., Maucher, J., Vu, N.T., 2020. Fine-tuning BERT for Low-Resource Natural Language Understanding via... Grusky, M., Naaman, M., Artzi, Y., 2018. Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive... Hermann K.M. et al. Teaching machines to read and comprehend Houlsby N. et al. Bayesian active learning for classification and preference learning (2011) Hu, P., Lipton, Z.C., Anandkumar, A., Ramanan, D., 2019. Active learning with partial feedback. In: 7th International... Kendall A. et al. What uncertainties do we need in Bayesian deep learning for computer vision? Kirsch A. et al. Batchbald: Efficient and diverse batch acquisition for deep Bayesian active learning Lewis M. et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension Lin, C.-Y., 2004. Rouge: A package for automatic evaluation of summaries. In: Proceedings of the 2004 Workshop on Text... Litjens G. et al. A survey on deep learning in medical image analysis (2017) View more references Cited by (0) Recommended articles (6) Research article Ultra-long dual-sideband BOTDA with balanced detection Optics & Laser Technology, Volume 68, 2015, pp. 206-210 Show abstract We experimentally study a dual-sideband Brillouin optical time-domain analyzer (BOTDA) with balanced detection. The amplification regime is simple (only bi-directional first-order Raman amplification); however, taking advantage of the balanced detection, 200 km fiber-loop sensing system is achieved (half of the whole length is used for sensing). Before the balanced detector, a FBG is used to separate the anti-Stokes and the Stokes components of the probe light, and the optical source׳s wavelength with regard to the given FBG׳s Bragg wavelength is shown to be highly relevant to the system׳s performance. Research article Amplifying scientific paper’s abstract by leveraging data-weighted reconstruction Information Processing & Management, Volume 52, Issue 4, 2016, pp. 698-719 Show abstract In this paper, we focus on the problem of automatically generating amplified scientific paper’s abstract which represents the most influential aspects of scientific paper. The influential aspects can be illustrated by the target scientific paper’s abstract and citation sentences discussing the target paper, which are provided in papers citing the target paper. In this paper, we extract representative sentences through data-weighted reconstruction approach(DWR) by jointly leveraging target scientific paper’s abstract and citation sentences’ content and structure. In our study, we make two-folded contributions. Firstly, sentence’s weight was learned by exploiting regularization for ranking on heterogeneous bibliographic network. Specially, Sentences-similar-Sentences relationship was identified by language modeling-based approach and added to the bibliographic network. Secondly, a data-weighted reconstruction objective function is optimized to select the most representative sentences which reconstructs the original sentence set with minimum error. In this process, sentences’ weight plays a critical role. Experimental evaluation over real dataset confirms the effectiveness of our approach. Research article Sparse representation of two- and three-dimensional images with fractional Fourier, Hartley, linear canonical, and Haar wavelet transforms Expert Systems with Applications, Volume 77, 2017, pp. 247-255 Show abstract Sparse recovery aims to reconstruct signals that are sparse in a linear transform domain from a heavily underdetermined set of measurements. The success of sparse recovery relies critically on the knowledge of transform domains that give compressible representations of the signal of interest. Here we consider two- and three-dimensional images, and investigate various multi-dimensional transforms in terms of the compressibility of the resultant coefficients. Specifically, we compare the fractional Fourier (FRT) and linear canonical transforms (LCT), which are generalized versions of the Fourier transform (FT), as well as Hartley and simplified fractional Hartley transforms, which differ from corresponding Fourier transforms in that they produce real outputs for real inputs. We also examine a cascade approach to improve transform-domain sparsity, where the Haar wavelet transform is applied following an initial Hartley transform. To compare the various methods, images are recovered from a subset of coefficients in the respective transform domains. The number of coefficients that are retained in the subset are varied systematically to examine the level of signal sparsity in each transform domain. Recovery performance is assessed via the structural similarity index (SSIM) and mean squared error (MSE) in reference to original images. Our analyses show that FRT and LCT transform yield the most sparse representations among the tested transforms as dictated by the improved quality of the recovered images. Furthermore, the cascade approach improves transform-domain sparsity among techniques applied on small image patches. Research article Temperature extraction for Brillouin optical fiber sensing system based on extreme learning machine Optics Communications, Volume 453, 2019, Article 124418 Show abstract The use of extreme learning machine (ELM) network to extract temperature distribution from the measured Brillouin gain spectra (BGSs) along the sensing fiber obtained by Brillouin optical fiber sensors is proposed and demonstrated experimentally. Compared with conventional curve fitting method (CFM), ELM network trained by a set of ideal BGSs can extract temperature information directly from the measured BGSs obtained by Brillouin optical time domain reflectometer (BOTDR) system without the need of determining Brillouin frequency shift (BFS) and converting BFS to temperature. The BGSs linewidth is taken into account to construct the ideal BGSs by using Pseudo-Voigt curve for ELM training. The performance of ELM is analyzed in detail and compared with that of widely-used Lorentzian CFM, and the experiment results show that ELM can provide higher accuracy even at large frequency scanning step and faster processing speed. Therefore, the proposed ELM approach is feasible and effective for temperature extraction in Brillouin optical fiber sensing system. Research article An active learning approach to build adaptive cost models for web services Data & Knowledge Engineering, Volume 119, 2019, pp. 89-104 Show abstract Delivering accurate estimates of query costs in web services is important in different contexts, e.g., to measure their Quality of Service. However, building a reliable cost model is difficult as (i) a web service is a black box often hiding a complex computation, (ii) a call to the same service can yield completely different costs by simply changing a parameter value, and (iii) execution costs can drift with time. In this paper we propose Tiresias, an approach that, given a web service exposing an interface with a fixed number of parameters, initializes and actively adapts a model to accurately predict query costs. The cost model is represented by a regression tree trained through two interleaved querying cycles: a passive one, where the costs measured for user-generated queries are used to update the tree, and an active one, where the service is probed through system-generated queries to cope with drifts in the cost function. Tiresias is finally evaluated in terms of effectiveness and efficiency through a set of experimental tests performed on both real and synthetic datasets. Research article Incremental term representation learning for social network analysis Future Generation Computer Systems, Volume 86, 2018, pp. 1503-1512 Show abstract Term representation methods as computable and semantic tools have been widely applied in social network analysis. This paper provides a new perspective that can incrementally factorize co-occurrence matrix to query latest semantic vectors. We divide the streaming social network data into old and updated training tasks respectively, and factorize the training objective function based on stochastic gradient methods to update vectors. We prove that the incremental objective function is convergent. Experimental results demonstrate that our incremental factorizing can save a substantial amount of time by speeding up training convergence. The smaller the updated data is, the faster the update factorizing process can be, even 30 times faster than existing methods in certain cases. To evaluate the correctness of incremental representation, social text similarity/relatedness, linguistic tasks, network event detection, social user multi-label classification and user clustering for social network analysis are employed as benchmarks in this paper. View full text © 2023 Elsevier Ltd. All rights reserved. About ScienceDirect Remote access Shopping cart Advertise Contact and support Terms and conditions Privacy policy We use cookies to help provide and enhance our service and tailor content and ads. By continuing you agree to the use of cookies . Copyright © 2023 Elsevier B.V. or its licensors or contributors. ScienceDirect® is a registered trademark of Elsevier B.V. ScienceDirect® is a registered trademark of Elsevier B.V.



关键字:

暂无


所属期刊
Computer Speech & Language
ISSN: 0885-2308
来自:Elsevier BV