Accéder directement au contenu Accéder directement à la navigation
Communication dans un congrès

SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings

Masoud Jalili Sabet Philipp Dufter François Yvon 1 Hinrich Schütze
1 TLP - Traitement du Langage Parlé
LIMSI - Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur
Abstract : Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings {--} both static and contextualized {--} for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners {--} even with abundant parallel data; e.g., contextualized embeddings achieve a word alignment F1 for English-German that is 5 percentage points higher than eflomal, a high-quality statistical aligner, trained on 100k parallel sentences.
Type de document :
Communication dans un congrès
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-03013194
Contributeur : Limsi Publications <>
Soumis le : jeudi 19 novembre 2020 - 15:23:09
Dernière modification le : lundi 22 février 2021 - 16:21:22
Archivage à long terme le : : samedi 20 février 2021 - 19:03:21

Fichier

2020.findings-emnlp.147.pdf
Fichiers éditeurs autorisés sur une archive ouverte

Identifiants

  • HAL Id : hal-03013194, version 1

Collections

Citation

Masoud Jalili Sabet, Philipp Dufter, François Yvon, Hinrich Schütze. SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings. Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Nov 2020, Online, United States. pp.1627 - 1643. ⟨hal-03013194⟩

Partager

Métriques

Consultations de la notice

59

Téléchargements de fichiers

21