I=
Pakistan Research Repository Home
 

Title of Thesis

A Technique For The Design And Implementation Of An OCR For Printed Nastalique Text

Author(s)

Sohail Abdul Sattar

Institute/University/Department Details
Department of Computer Science and Information Technology / N.E.D. University of Engineering & Technology, Karachi
Session
2009
Subject
Computer Science
Number of Pages
216
Keywords (Extracted from title, table of contents and abstract of thesis)
Implementation, Technique, Nastalique, Segmentation, Recognition, Optically, Cursiveness,

Abstract
This thesis presents a novel segmentation free technique for the design and implementation of an OCR (Optical Character Recognition) system for printed Nastalique text.
Specific area of this thesis is document understanding and recognition which is a branch of computer vision and in turn a sub-class of Artificial Intelligence.
Optical character recognition is the translation of optically scanned bitmaps of printed or hand written text into digitally editable data files. OCRs developed for many world languages are already under efficient use but none exist for Nastalique – a calligraphic adaptation of the Arabic script, just as Jawi is for Malay. More often, a single script with its basic character shapes is adapted for writing in multiple languages e.g. the Roman script for English, German and French, and the Arabic script for Persian, Sindhi, Urdu, Pashtu and Malay.
Urdu has 39 characters against the Arabic 28. Each character then has two to four different shapes according to their position in the word: isolated, initial, medial and final. Many character shapes have multiple instances and are context sensitive – character shapes changing with changes in the antecedent or the precedent character. At times even the third or the fourth character may cause a similar change depicting an n-gram model in a Markov chain. Unlike the Roman script, word and character overlapping in Nastalique, makes optical recognition extremely complex.
Compared to Roman script languages’ OCRs very little research work is done on Arabic Naskh OCR. Only a few Arabic Naskh OCR systems are available today and they too are far from perfect, lagging behind in accuracy as compared to Roman script OCR systems.
In this perspective Nastalique is even more complicated than Naskh as it has multiple base lines,more overlapping of characters within a ligature and between adjacent ligatures, vertical stacking of characters in a ligature etc.
Urdu has still not attracted researchers’ attention for the development of OCR partly due to lack of funds in this area but mainly due to the challenges the Nastalique style offers because of its cursiveness and context-sensitivity. For the same reason published research work in this area is nearly non-existent.
The proposed system for Nastalique OCR does not require segmentation of a ligature into constituent character shapes. However, it does require segmentation at two levels i.e. first the text image is segmented into lines of text then each of the lines of text is further segmented into ligatures or isolated characters. The next step is a line by line cross-correlation for recognition of characters in the ligatures whereby, character codes are written into a text file in the sequence the characters are found in the ligature. As the recognition process is completed, the character codes in the text file are given to the rendering engine, which displays the recognized text in a text region.
The limitation of the proposed Nastalique character recognition system is that it is font dependent: it needs the same font file for recognition which was used to write the text in.The new undertaking has greater challenges as it will aim to overcome the inherent cursiveness and context sensitivity of Nastalique style of writing.
For Nastalique OCR, we develop character-based True Type Font files for a few Nastalique words.These words are written using the same character-based TTF font and an image is made of the Nastalique text. The image is then given to our Nastalique OCR. After recognition the rendering is done by using the same TTF font file to display the recognized text. The work is therefore three folds;development of character-based Nastalique True Type Font, Nastalique character recognition and rendering the recognized text using character-based Nastalique True Type Font.
Since our character-based segmentation-free Nastalique OCR algorithm needs, as a ground work, a character-based Nastalique Text Processor, we have also proposed a Finite State Nastalique Text Processor Model. Implementation is not yet done so results are not reported. However this model could serve as an impetus for future research in this challenging field.
Optical Character Recognition for Roman script languages is almost a solved problem for document images and researchers are now focusing on extraction and recognition of text from video scenes.This new and emerging field in character recognition is called Video OCR and has numerous applications like video annotation, indexing, retrieval, search, digital libraries, and lecture video indexing.
The emerging field for character recognition is attracting research on other scripts like Chinese, but to the best of our knowledge, no work is reported as yet, on Video OCR for Arabic script languages like Arabic, Persian and Urdu.
As an extension of our Nastalique OCR to Video OCR for Arabic script languages, we have also performed experiments on video text identification, localization and extraction for its recognition. We have used MACH (Maximum Average Correlation Height) filter to identify text regions in video frames, these text regions are then localized and extracted for recognition. All research and development work is done using Matlab 7.0. Experiments and results are reported in the thesis.

Download Full Thesis
4,932 KB
S. No. Chapter Title of the Chapters Page Size (KB)
1 0 CONTENTS

 

x
117 KB
2

1

INTRODUCTION

1.1 Computer Vision
1.2 Character Recognition
1.3 History of OCR
1.4 OCR Processes
1.5 World Languages and Scripts
1.6 Perso-Arabic Script
1.8 Nastalique Text Processor
1.9 The Digital Divide
1.10 Importance of Bridging the Digital Divide
1.11 Approaches for Arabic Naskh OCR
1.12 Motivation and Research Objective
1.13 Main Contribution of this Research
1.14 Additional Contribution of this Research
1.15 Thesis overview
1.16 Conclusion

1
365 KB
3 2 LITERATURE SURVEY

2.1 Introduction
2.2 Previous Work on Urdu OCR
2.3 Approaches for Arabic script OCR
2.4 Previous Work in Ligature-Based Arabic OCR
2.5 Previous Work in Segmentation-Based Arabic OCR

31
165 KB
4 3 VIDEO OCR

3.1 Introduction
3.2 Types of video text
3.3 Applications of Video-OCR
3.4 Literature Survey
3.5 Correlation Pattern Recognition
3.6 Text Region Detection in Video Frames
3.7 Results
3.8 Conclusion and Future work

59
2,938 KB
5 4 IMPLEMENTATION CHALLENGES FOR NASTALIQUE OCR

4.1 Introduction
4.2 Nastalique Character Set
4.3 Nastalique Script Characteristics
4.4 Computational Analysis of Urdu Alphabet
4.5 Nastalique Script for Urdu
4.6 Ligature in Urdu
4.7 Word Forming in Urdu
4.8 Styles of Urdu Writing
4.9 Nastalique Script Complexities
4.10 Sloping and Multiple Base-Lines
4.11 A Generic OCR Model
4.12 Working of a Roman Script OCR
4.13 Working of a Nastalique Script OCR
4.14 Approaches for Nastalique OCR

109
871 KB
6 5 THE PROPOSED NASTALIQUE OCR SYSTEM

5.1 Introduction
5.2 The Nastalique OCR Implementation
5.3 The Novel Segmentation-free Nastalique OCR Algorithm
5.4 Nastalique OCR Algorithm Description
5.5 Segmentation of Text Image into Lines
5.6 Segmentation of Text Line into Ligatures
5.7 Character Recognition by Cross-Correlation
5.8 Nastalique Text Segmentation
5.9 Segmentation of Text Image into Lines and Ligatures
5.10 Recognition Technique
5.11 Nastalique OCR Application
5.12 The Recognition Procedure
5.13 The Recognition Process

137
955 KB
7 6 CONCLUSION AND FUTURE WORK

6.1 Introduction
6.2 Nastalique Character Shapes
6.3 Nastalique Joining Characters Features Set
6.4 Proposed Nastalique Text Processor Model
6.5 Components of Nastalique Text Processor Model (NTPM)
6.6 Conclusion
6.7 Contribution
6.8 Future Work

169
290 KB
8 7 REFERENCES 209
129 KB