Empreu aquest identificador per citar o enllaçar aquest ítem: http://hdl.handle.net/10609/151239
Registre complet de metadades
Camp DCValorLlengua/Idioma
dc.contributor.authorOliver, Antoni-
dc.contributor.authorAlvarez Vidal, Sergi-
dc.date.accessioned2024-09-17T12:42:20Z-
dc.date.available2024-09-17T12:42:20Z-
dc.date.issued2023-06-
dc.identifier.citationOliver, A. [Antoni] & Álvarez, S. [Sergi]. (2023). Filtering and rescoring the CCMatrix corpus for Neural Machine Translation training. Proceedings of the 24th Annual Conference of the European Association for Machine Translation-
dc.identifier.isbn9789520329471-
dc.identifier.urihttp://hdl.handle.net/10609/151239-
dc.description.abstractThere are several parallel corpora available for many language pairs, such as CCMatrix, built from mass downloads of web content and automatic detection of segments in one language and the translation equivalent in another. These techniques can produce large parallel corpora, but of questionable quality. In many cases, the segments are not in the required languages, or if they are, they are not translation equivalents. In this article, we present an algorithm for filtering out the segments in languages other than the required onesand re-scoring the segments using SBERT. A use case on the Spanish–Asturian and Spanish–Catalan CCMatrix corpus is presented.en
dc.format.mimetypeapplication/pdfca
dc.language.isoengen
dc.publisherEuropean Association for Machine Translation-
dc.relation.ispartofProceedings of the 24th Annual Conference of the European Association for Machine Translation - EAMT 2023, Tampere, Finlàndia, 12 -15 de juny, 2023ca
dc.rightsCC BY-ND*
dc.rights.urihttp://creativecommons.org/licenses/by-nd/4.0/es/-
dc.titleFiltering and rescoring the CCMatrix corpus for neural machine translation trainingca
dc.typeinfo:eu-repo/semantics/conferenceObjectca
dc.rights.accessRightsinfo:eu-repo/semantics/openAccess-
dc.gir.idCO/0000006492-
Apareix a les col·leccions:Conferencias

Arxius per aquest ítem:
Arxiu Descripció MidaFormat 
Filtering and rescoring the CCMatrix corpus for Neural Machine Translation training.pdf124,17 kBAdobe PDFThumbnail
Veure/Obrir
Comparteix:
Exporta:
Consulta les estadístiques

Aquest ítem està subjecte a una llicència de Creative CommonsLlicència Creative Commons Creative Commons