zippy/samples/llm-generated/2111.00514_generated.txt


			
				
				
					
						
						
						
							
							
								
							
							Abstract Simultaneous speech translation (SimulST) is the task in which output generation has to be performed on partial, incremental speech input. In recent years, SimulST has become popular due to the spread of multilingual application scenarios, like international live conferences and streaming lectures, in which on-the-fly speech translation can facilitate usersÃ¢ÂÂ access to audio-visual content. In this paper, we analyze the characteristics of the SimulST systems developed so far, discussing their strengths and weaknesses. We then concentrate on the evaluation framework required to properly assess systemsÃ¢ÂÂ effectiveness. To this end, we raise the need for a broader performance analysis, also including the user experience standpoint. We argue that SimulST systems, indeed, should be evaluated not only in terms of quality/latency measures, but also via task-oriented metrics accounting, for instance, for the visualization strategy adopted. In light of this, we highlight which are the goals achieved by the community and what is still missing. Introduction Simultaneous speech translation (SimulST) is the task in which the translation of a source language speech has to be performed on partial, incremental input. This is a key feature to achieve low latency in scenarios like streaming conferences and lectures, where the text has to be displayed following as much as possible the pace of the speech SimulST is indeed a complex task in which the difficulties of performing speech recognition from partial inputs are exacerbated by the problem to project meaning across languages. Despite the increasing demand for such a system, the problem is still far from being solved. So far, research efforts mainly focused on the quality/latency trade-off, i.e. producing high quality outputs in the shortest possible time, balancing the need for a good translation with the necessity of a rapid text generation. Previous studies, however, disregard how the translation is displayed and, consequently, how it is actually perceived by the end users. After a concise survey of the state of the art in the field, in this paper we posit that, from the usersÃ¢ÂÂ experience standpoint, output visualization is at least as important as having a good translation in a short time. This raises the need for a broader, task-oriented and human-centered analysis of SimulST systemsÃ¢ÂÂ performance, also accounting for this third crucial factor. Background As in the case of offline speech translation, the adoption of cascade architectures (Stentiford and Steer, 1988; Waibel et al., 1991) was the first attempt made by the SimulST community to tackle the problem of generating text from partial, incremental input. Cascade systems (F ÃÂ¨ugen, 2009; Fujita et al., 2013; Niehues et al., 2018; Xiong et al., 2019; Arivazhagan et al., 2020b) involve a pipeline of two components. First, a streaming automatic speech recognition (ASR) module transcribes the input speech into the corresponding text (Wang et al., 2020; Moritz et al., 2020). Then, a simultaneous text-to-text translation module translates the partial transcription into target language text (Gu et al., 2017; Dalvi et al., 2018; Ma et al., 2019; Arivazhagan et al., 2019). This approach suffers from error propagation, a well-known problem even in the offline scenario, where the transcription errors made by the ASR module are propagated to the MT module, which cannot recover from them as it does not have direct access to the audio. Another strong limitation of cascaded systems is the extra latency added by the two-step pipeline, since the MT module has to wait until the streaming ASR output is produced. To overcome these issues, the direct models initially proposed in BÃÂ´erard et al. (2016; Weiss et al. (2017) represent a valid alternative that is gaining increasing traction (Bentivogli et al., 2021). Direct ST models are composed of an encoder, usually bidirectional, and a decoder. The encoder starts from the audio features extracted from the input signal and computes a hidden representation; the decoder transforms this representation into target language text. Direct modeling becomes crucial in the simultaneous scenario, as it reduces the overall systemÃ¢ÂÂs latency due to the absence of intermediate symbolic representation steps. Despite the data scarcity issue caused by the limited availability of speech-to-translation corpora, the adoption of direct architectures showed to be promising (Weiss et al., 2017; Ren et al., 2020; Zeng et al., 2021), driving recent efforts towards the development of increasingly powerful and efficient models. Conclusions and Future directions SimulST systems have become increasingly popular in recent years andefforts have been made to build robust and efficient models. Despite the difficulties introduced by the online framework, these models have rapidly improved, achieving comparable results to the offline systems. However, many research directions have not been explored enough (e.g., the adoption of dynamic or fixed segmentation, the offline or the online training). We argue that SimulST systems, indeed, should be evaluated not only in terms of quality/latency measures, but also via task-oriented metrics accounting, for instance, for the visualization strategy adopted. In light of this, we highlight which are the goals achieved by the community and what is still missing. Introduction Simultaneous speech translation (SimulST) is the task in which the translation of a source language speech has to be performed on partial, incremental input.
						
						
					
				
				
					
						Reference in New Issue
					
					View Git Blame
					Kopiuj bezpośredni odnośnik