|
Abstract This thesis presents
a novel segmentation free technique for the design and
implementation of an OCR (Optical Character Recognition) system for
printed Nastalique text.
Specific area of this thesis is document understanding and
recognition which is a branch of computer vision and in turn a
sub-class of Artificial Intelligence.
Optical character recognition is the translation of optically
scanned bitmaps of printed or hand written text into digitally
editable data files. OCRs developed for many world languages are
already under efficient use but none exist for Nastalique – a
calligraphic adaptation of the Arabic script, just as Jawi is for
Malay. More often, a single script with its basic character shapes
is adapted for writing in multiple languages e.g. the Roman script
for English, German and French, and the Arabic script for Persian,
Sindhi, Urdu, Pashtu and Malay.
Urdu has 39 characters against the Arabic 28. Each character then
has two to four different shapes according to their position in the
word: isolated, initial, medial and final. Many character shapes
have multiple instances and are context sensitive – character shapes
changing with changes in the antecedent or the precedent character.
At times even the third or the fourth character may cause a similar
change depicting an n-gram model in a Markov chain. Unlike the Roman
script, word and character overlapping in Nastalique, makes optical
recognition extremely complex.
Compared to Roman script languages’ OCRs very little research work
is done on Arabic Naskh OCR. Only a few Arabic Naskh OCR systems are
available today and they too are far from perfect, lagging behind in
accuracy as compared to Roman script OCR systems.
In this perspective Nastalique is even more complicated than Naskh
as it has multiple base lines,more overlapping of characters within
a ligature and between adjacent ligatures, vertical stacking of
characters in a ligature etc.
Urdu has still not attracted researchers’ attention for the
development of OCR partly due to lack of funds in this area but
mainly due to the challenges the Nastalique style offers because of
its cursiveness and context-sensitivity. For the same reason
published research work in this area is nearly non-existent.
The proposed system for Nastalique OCR does not require segmentation
of a ligature into constituent character shapes. However, it does
require segmentation at two levels i.e. first the text image is
segmented into lines of text then each of the lines of text is
further segmented into ligatures or isolated characters. The next
step is a line by line cross-correlation for recognition of
characters in the ligatures whereby, character codes are written
into a text file in the sequence the characters are found in the
ligature. As the recognition process is completed, the character
codes in the text file are given to the rendering engine, which
displays the recognized text in a text region.
The limitation of the proposed Nastalique character recognition
system is that it is font dependent: it needs the same font file for
recognition which was used to write the text in.The new undertaking
has greater challenges as it will aim to overcome the inherent
cursiveness and context sensitivity of Nastalique style of writing.
For Nastalique OCR, we develop character-based True Type Font files
for a few Nastalique words.These words are written using the same
character-based TTF font and an image is made of the Nastalique
text. The image is then given to our Nastalique OCR. After
recognition the rendering is done by using the same TTF font file to
display the recognized text. The work is therefore three
folds;development of character-based Nastalique True Type Font,
Nastalique character recognition and rendering the recognized text
using character-based Nastalique True Type Font.
Since our character-based segmentation-free Nastalique OCR algorithm
needs, as a ground work, a character-based Nastalique Text
Processor, we have also proposed a Finite State Nastalique Text
Processor Model. Implementation is not yet done so results are not
reported. However this model could serve as an impetus for future
research in this challenging field.
Optical Character Recognition for Roman script languages is almost a
solved problem for document images and researchers are now focusing
on extraction and recognition of text from video scenes.This new and
emerging field in character recognition is called Video OCR and has
numerous applications like video annotation, indexing, retrieval,
search, digital libraries, and lecture video indexing.
The emerging field for character recognition is attracting research
on other scripts like Chinese, but to the best of our knowledge, no
work is reported as yet, on Video OCR for Arabic script languages
like Arabic, Persian and Urdu.
As an extension of our Nastalique OCR to Video OCR for Arabic script
languages, we have also performed experiments on video text
identification, localization and extraction for its recognition. We
have used MACH (Maximum Average Correlation Height) filter to
identify text regions in video frames, these text regions are then
localized and extracted for recognition. All research and
development work is done using Matlab 7.0. Experiments and results
are reported in the thesis.
|