Automatic Diacritizer for Arabic Texts

Thesis Title: 
Mohammad Ahmed Sayed Ahmed
Date of Birth: 
Tue, 11/08/1981
Previous Degrees: 
B.Sc. (ELC) 2003 - Cairo
Registration Date: 
Wed, 01/10/2003
Awarding Date: 
Tue, 14/07/2009

Dr. Fahmy, A. A.
Dr. Fakhr, M. W.
Dr. Rashwan, M. A.

Key Words: 

Automatic Arabic diacritizations, Factorization, Unfactorization,
Morphological analyses, Natural language processing, Human
language technology


The problem of entity factorizing versus unfactorizing is one of the main
problems that face peoples in the human languages technologies field. As a case
study for this problem; this thesis studies the automatic Arabic text diacritization
problem. The thesis compares the diacritization through words factorization
using morphological analyses versus the diacritization through the words
unfactorization using full-form words. From the experimental results show that;
for small training corpus size; unfactorizing system is better since it can reach the
saturation faster than the factorizing one, but it may suffer from the OOV
problem, but for very large training corpus size; the two systems are almost the
same, except that the cost of the unfactorizing systems is lower. So, the best
strategy is using a hybrid of the two systems to enjoy the fast learning and low
cost of the factorizing system and the wide coverage of the factorizing one.