Enrichment of Bilingual Dictionary through News Stream Data

Abstract

Bilingual dictionaries are the key component of the cross-lingual similarity estimation methods. Usually such dictionary generation is accomplished by manual or automatic means. Automatic generation approaches include to exploit parallel or comparable data to derive dictionary entries. Such approaches require large amount of bilingual data in order to produce good quality dictionary. Many time the language pair does not have large bilingual comparable corpora and in such cases the best automatic dictionary is upper bounded by the quality and coverage of such corpora. In this work we propose a method which exploits continuous quasi-comparable corpora to derive term level associations for enrichment of such limited dictionary. Though we propose our experiments for English and Hindi, our approach can be easily extendable to other languages. We evaluated dictionary by manually computing the precision. In preliminary experiments our approach is able to derive interesting term level associations across languages.

Reference

Ajay Dubey, Parth Gupta, Vasudev Varma and Paolo Rosso. Enrichment of Bilingual Dictionary through News Stream Data. In proceedings of LREC 2014, Reykjavík, Iceland. May 26-31.

Downloads