English/Arabic Cross Language Information Retrieval (CLIR) for Arabic OCR-Degraded Text

Tarek A. Elghazaly and Aly A. Fahmy

Faculty of Computers & Information, Cairo University, Giza, Egypt

Abstract

In this paper, a novel for Query Translation and Expansion for enabling English/Arabic CLIR for both normal and OCR-Degraded Arabic Text model has been proposed, implemented, and tested. First, an English/Arabic Word Collocations Dictionary has been established plus reproducing three English/Arabic Single Words Dictionaries. Second, a modern Arabic Corpus has been built. Third, a model for simulating the Arabic OCR errors has been proposed. Forth, a comprehensive model for Query Translation and expansion is proposed. The model translates the Query from English to Arabic detecting and translating collocations, translating single words and transliterating names. It solves the replacement ambiguity then it expands the Arabic Query to handle the expected Arabic OCR errors. The proposed model gives high accuracy in translating the Queries from English to Arabic solving the translation and transliteration ambiguities and with orthographic query expansion, it gave high degree of accuracy in handling OCR errors.

Keywords: Cross Language Information Retrieval, CLIR, Arabic OCR-Degraded Text, Arabic Corpus.
Shares