Cross-Language Alignments: Challenges, Guidelines and Gold Sets

3 July 2012

Anabela Barreiro L2F, INESC-ID

In this presentation I will describe the key cross-language annotation guidelines to provide support for state-of-the art machine translation systems. The guidelines aim at improving the quality of the statistical machine translation output by using linguistically-informed and motivated annotation of special case multiwords and semantico-syntactic translation units. The guidelines were based on the alignment of bilingual texts of the common test set of the Europarl corpus. The bilingual texts cover all possible combinations between the English, Spanish, French, and Portuguese languages. The major challenges I will discuss are grouped into four different classes: lexical and semantico-syntactic (multiword units, compound verbs, and prepositional predicates); morphological (lexical versus non-lexical realization, such as determiners and zero determiners, the pro-drop phenomenon including subject pronoun drop, and empty relative pronoun, and contracted forms); morpho-syntactic (free noun adjuncts); and semantico-discursive (emphatic linguistic constructions such as tautology, pleonasm and repetition, and focus constructions). I will also present CLUE-Aligner, a tool developed to reduce ambiguity in the alignment process and facilitate the alignment of meaning and translation units. The inter-annotator agreement for English-Portuguese word alignment is 0.98 and for multiword and semantico-syntactic unit alignment is 0.54, which represents a total agreement of 0.87. The gold collection and alignment tool are publicly available.



Anabela Barreiro is an invited researcher at INESC-ID Lisbon at the Spoken Language Systems Laboratory. She holds a PhD in Linguistics and works in the areas of machine translation and paraphrasing applied to authoring aids, text production and revision, and cross-language tasks. Her post-doctoral work consists of the development of a new hybrid machine translation system that applies linguistically enhanced natural language processing resources (semantico-syntactic knowledge) to statistical machine translation. She has over 7 years experience in the development of commercial machine translation systems at Logos Corporation, USA. More recently, she has been endorsing the OpenLogos open source machine translation system initiative. She has substantial experience in the development of linguistic resources (monolingual and multilingual) and natural language processing tools. She is the author of several journal publications on machine translation, paraphrases, and linguistic resources.