1.1. NE applications
1.1.1 Information Retrieval: This is the task of retrieving data or documents according to a search input query, …show more content…
this task requires identifying NEs in the input query and identifying NEs within the search data or documents, in order to retrieve the relevant document. For example: the word الهلال (alhilal) can be recognized as an organization name such as (Saudi football club) or as noun as in word moon. The correct NEs identification will facilitate retrieving the correct document. A study by [Dayong WU et al. 2011] has indicated that about 60% of the queries in search engines contain NEs.
1.1.2 Question Answering: The task of giving an answer for a given question is called a Question Answering application. NER can be used to analyze questions that will help in identifying the correct domain and constructing a relevant answer. Moreover, the answer of many questions contains NEs, For Example: the answer of the questions that begin with who (من) usually involves persons or locations, and the answer of questions start with where (أين) usually involves locations. [Shaalan 2014]
1.1.3 Machine Translation: This is the task of automatically translating a given text from one language to another. In this task, NER systems play a key role in the overall quality of machine learning applications, it is very important in order to determine which part of NE should be meaning-translated, and which part should by transliterated, such as personal names. For example: جامعه الأميرة سمية للتكنولوجيا is translated to Princess Sumaya University for technology, in this example, the word سميه is transliterated to Sumaya, while the other words translated normally.
1.1.4 Navigation Systems: Using digital maps to provide directions and information about nearby places is the main task of navigation system.
In such systems, all places and locations stored in the system database with their geographical coordinates are NEs.
1.2. Arabic language aspects and challenges
“Arabic is a language of rich morphology and complex syntax” [Al-Sughaiyer and Al-Kharashi 2004]. It is classified into three main types: Classical Arabic; which is the language of Islam that used for over 1500 years. Modern Standard Arabic; which is one of the six official languages of United Nations, and most of Arabic NLP researches are focused on Colloquial Arabic; which is the spoken Arabic language. It is irregular and differs among countries and regions.
The task of Arabic NER is challenging due to the following Arabic language features:
• Lack of capitalization: Capitalization feature is not existing in Arabic language script, unlike other languages such as English, in which capital letter used to recognize NEs. The absence of this feature makes Arabic NER more difficult by the fact that most of Arabic NEs are indistinguishable from common nouns and adjectives. For example: the Arabic location word الزرقاء (city in Jordan) can be used as an adjective (refers to something with blue color). This type of ambiguity commonly resolved by analyzing the context surrounding the NE. [Shaalan
2014]
• Complicated morphology: Each word in Arabic language may consist of one or more prefixes, a stem or root, and one or more suffixes, resulting in a complicated morphology. Moreover, clitics may be attached to the NE including conjunctions, prepositions, or a combination of them.
• Optional Short Vowels: Arabic words contains diacritics (small marks placed above or under an Arabic letter) representing most of vowels that give different meaning to the same lexical form. Most Arabic text appears in letters, magazines, or other media are undiacritized for simplification, which led to lexical types of ambiguity. For example: the word مؤسسة could be recognized as location NE when it is diacritized as مُؤَسَّسَة (foundation or corporation) or as a person NE when it is diacritized as مُؤَسِسَة (a founder). [Shaalan 2014]
• Ambiguity in Named Entities: two or more NEs could be ambiguous and refers to many different NEs types. For example: the word أمنيه could be considered as person NE, or could be considered as organization NE (a telecommunication company in Jordan).
• Lack of Uniformity in Writing Styles: this ambiguity occurs when transliterating a NE from other language to Arabic language, this happens because Arabic has more speech sounds than other languages. For example: transliterating English NE such as Gallery Mall (a location in Jordan) into Arabic NE could produce many variants such as: جاليري مول ، غاليري مول.
• Lack of Resources: Arabic language has limited number of available resources to be used in NER systems. Corpora (tagged documents) and gazetteers (list of types NE) are used to implement and test the performance of Arabic NER systems. Researchers in Arabic NLP relays on their own human annotated corpora, some of them are published and become available freely to others, whereas others are available under paid licensed agreements. [Shaalan 2014]