Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10609/151239
Título : Filtering and rescoring the CCMatrix corpus for neural machine translation training
Autoría: Oliver, Antoni  
Alvarez Vidal, Sergi  
Citación : Oliver, A. [Antoni] & Álvarez, S. [Sergi]. (2023). Filtering and rescoring the CCMatrix corpus for Neural Machine Translation training. Proceedings of the 24th Annual Conference of the European Association for Machine Translation
Resumen : There are several parallel corpora available for many language pairs, such as CCMatrix, built from mass downloads of web content and automatic detection of segments in one language and the translation equivalent in another. These techniques can produce large parallel corpora, but of questionable quality. In many cases, the segments are not in the required languages, or if they are, they are not translation equivalents. In this article, we present an algorithm for filtering out the segments in languages other than the required onesand re-scoring the segments using SBERT. A use case on the Spanish–Asturian and Spanish–Catalan CCMatrix corpus is presented.
Tipo de documento: info:eu-repo/semantics/conferenceObject
Fecha de publicación : jun-2023
Licencia de publicación: http://creativecommons.org/licenses/by-nd/4.0/es/  
Aparece en las colecciones: Conferencias

Ficheros en este ítem:
Fichero Descripción Tamaño Formato  
Filtering and rescoring the CCMatrix corpus for Neural Machine Translation training.pdf124,17 kBAdobe PDFVista previa
Visualizar/Abrir
Comparte:
Exporta:
Consulta las estadísticas

Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons Creative Commons