首页 | 本学科首页   官方微博 | 高级检索  
     检索      

Optical Character Recognition for printed Tamil text using Unicode
作者姓名:SEETHALAKSHMI R.  SREERANJANI T.R.  BALACHANDAR T.  Abnikant Singh  Markandey Singh  Ritwaj Ratan  Sarvesh Kumar
作者单位:Shanmugha Arts Science Technology and Research Academy,Thirumalaisamudram,Thanjavur,Tamil Nadu,India,Shanmugha Arts Science Technology and Research Academy,Thirumalaisamudram,Thanjavur,Tamil Nadu,India,Shanmugha Arts Science Technology and Research Academy,Thirumalaisamudram,Thanjavur,Tamil Nadu,India,Shanmugha Arts Science Technology and Research Academy,Thirumalaisamudram,Thanjavur,Tamil Nadu,India,Shanmugha Arts Science Technology and Research Academy,Thirumalaisamudram,Thanjavur,Tamil Nadu,India,Shanmugha Arts Science Technology and Research Academy,Thirumalaisamudram,Thanjavur,Tamil Nadu,India,Shanmugha Arts Science Technology and Research Academy,Thirumalaisamudram,Thanjavur,Tamil Nadu,India
摘    要:INTRODUCTION Optical Character Recognition (OCR) deals with machine recognition of characters present in an input image obtained using scanning operation. It refers to the process by which scanned images are electroni- cally processed and converted to an editable text. The need for OCR arises in the context of digitizing Tamil documents from the ancient and old era to the latest, which helps in sharing the data through the Internet. Tamil language Tamil is a South Indian language spo…

关 键 词:光学性质识别  支撑向量  人工神经网络  编码系统  软件翻译
收稿时间:2005-08-05
修稿时间:2005-09-10

Optical character recognition for printed Tamil text using Unicode
SEETHALAKSHMI R.,SREERANJANI T.R.,BALACHANDAR T.,Abnikant Singh,Markandey Singh,Ritwaj Ratan,Sarvesh Kumar.Optical Character Recognition for printed Tamil text using Unicode[J].Journal of Zhejiang University Science,2005,6(11):1297-1305.
Authors:R Seethalakshmi  T R Sreeranjani  T Balachandar  Abnikant Singh  Markandey Singh  Ritwaj Ratan  Sarvesh Kumar
Institution:(1) Shanmugha Arts Science Technology and Research Academy, Thirumalaisamudram, Thanjavur, Tamil Nadu, India
Abstract:Optical Character Recognition (OCR) refers to the process of converting printed Tamil text documents into software translated Unicode Tamil Text. The printed documents available in the form of books, papers, magazines, etc. are scanned using standard scanners which produce an image of the scanned document. As part of the preprocessing phase the image file is checked for skewing. Ifthe image is skewed, it is corrected by a simple rotation technique in the appropriate direction. Then the image is passed through a noise elimination phase and is binarized. The preprocessed image is segmented using an algorithm which decomposes the scanned text into paragraphs using special space detection technique and then the paragraphs into lines using vertical histograms, and lines into words using horizontal histograms, and words into character image glyphs using horizontal histograms.Each image glyph is comprised of 32x32 pixels. Thus a database of character image glyphs is created out of the segmentation phase. Then all the image glyphs are considered for recognition using Unicode mapping. Each image glyph is passed through various routines which extract the features of the glyph. The various features that are considered for classification are the character height, character width, the number of horizontal lines (long and short), the number of vertical lines (long and short), the horizontally oriented curves, the vertically oriented curves, the number of circles, number of slope lines, image centroid and special dots. The glyphs are now set ready for classification based on these features. The extracted features are passed to a Support Vector Machine (SVM) where the characters are classified by Supervised Learning Algorithm. These classes are mapped onto Unicode for recognition. Then the text is reconstructed using Unicode fonts.
Keywords:OCR  Unicode  Features  Support Vector Machine (SVM)  Artificial Neural Networks
本文献已被 CNKI 万方数据 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号