خوشه‌یابی تصویر زیرکلمات در متون قدیمی و حجیم چاپی با استفاده از معیار مقایسه تصویری

الموضوعات : مهندسی برق و کامپیوتر

1 - دانشگاه تربیت مدرس
2 - دانشگاه تربیت مدرس

تاريخ الإرسال : 17 الأحد , صفر, 1437 تاريخ التأكيد : 18 الإثنين , صفر, 1437 تاريخ الإصدار : 20 الجمعة , جمادى الأولى, 1435

الکلمات المفتاحية: تحلیل اسناد تصویری بازشناسی متون حجیم خوشه‌یابی افزایشی جداسازی مجموعه داده,

ملخص المقالة :

حجم زیاد تصاویر متنی روز به روز مسئله دیجیتالی‌شدن متن تصاویر و همچنین مسئله جستجو در این منابع را اهمیت می‌بخشد. در بازشناسی متن‌های حجیم می‌توان از ویژگی‌هایی مانند محدودبودن تعداد و اندازه قلم، یکسان‌بودن صفحه‌آرایی در کل صفحه‌ها، محدودبودن مجموعه واژه‌ها و حوزه معنایی آنها و یکسان‌بودن سبک نگارشی در کل متن استفاده کرد. در این مقاله الگوریتمی ارائه شده که از یکسان‌بودن نوع و اندازه قلم برای خوشه‌یابی زیرکلمات یک کتاب قدیمی با کیفیت پایین چاپ استفاده شده است. این کتاب 233 صفحه دارد و کل زیرکلمات آن که در حدود 111000 زیرکلمه است جداسازی و برچسب‌زنی شده است. در این تحقیق از یک روش ساده افزایشی برای خوشه‌یابی زیرکلمات استفاده شده است. ابتدا برای هر زیرکلمه چهار ویژگی ساده استخراج می‌شود، در صورتی که تفاوت این ویژگی‌ها از ویژگی‌های نماینده یک خوشه کمتر از مقدار آستانه باشد، مقایسه تصویری بین آن دو انجام می‌شود. به علت زیادبودن تعداد زیرکلمات سعی شده تا از ساده‌ترین روش‌های ممکن استفاده شود تا سرعت اجرا افزایش یابد. نتایج آزمایش‌ها نشان می‌دهد می‌توان زیرکلمات را با دقتی در حدود 7/99 درصد خوشه‌یابی کرد. نتایج این خوشه‌یابی در مرحله بازشناسی زیرکلمات کمک بسیار زیادی خواهد کرد.

المصادر:

[1] http://en.wikipedia.org/wiki/Google_Books
[2] K. Pramod Sankar and C. V. Jawahar, "Enabling search over large collections of telugu document images - an automatic annotation based approach," in Proc. of the 5th Indian Conf. on Computer Vision, Graphics, and Image Processing, ICVGIP, vol. 4338, pp. 837-848, Dec. 2006.
[3] K. Pramod Sankar, V. Ambati, L. Pratha, and C. V. Jawahar, "Digitizing a million books: challenges for document analysis," in Proc. of the 7th IAPR Int. Workshop on Document Analysis Systems, DAS'06, vol. 3872, pp. 425-436, Feb. 2006.
[4] M. Meshesha and C. V. Jawahar, "Self adaptable recognizer for document image collections," in Proc. of the 2nd Int. Conf. on Pattern Recognition and Machine Intelligence, vol. 4815, pp. 560-567, Dec. 2007.
[5] N. V. Neeba and C. V. Jawahar, "Recognition of books by verification and retraining," in Proc. of the 19th Int. Conf. on Pattern Recognition, ICPR'08, 4 pp., Dec. 2008.
[6] V. Rasagna, A. Kumar, C. V. Jawahar, and R. Manmatha, "Robust recognition of documents by fusing results of word clusters," in Proc. of the 10th Int. Conf. on Document Analysis and Recognition, ICDAR'09, pp.566-570, Jul. 2009.
[7] K. Pramod Sankar, C. V. Jawahar, and R. Manmatha, "Nearest neighbor based collection OCR," in Proc. of the 9th IAPR International Workshop on Document Analysis Systems, DAS'10, pp. 207-214, 2010.
[8] P. Xiu and H. S. Baird, "Whole-book recognition using mutual-entropy-driven model adaptation," in Proc. 15th Document Recognition and Retrieval Conf., DRR'08, vol. 6815, 2008.
[9] P. Xiu and H. S. Baird, "Towards whole - book recognition," in Proc. of the 8th IAPR Int. Workshop on Document Analysis Systems, DAS'08, pp.629-636, Sep. 2008.
[10] P. Xiu and H. S. Baird, "Scaling up whole-book recognition," in Proc. of the 10th Int. Conf. on Document Analysis and Recognition, ICDAR'09, pp.698-702, Jul. 2009.
[11] P. Xiu and H. S. Baird, "Analysis of whole-book recognition," in Proc. of the 9th IAPR Int. Workshop on Document Analysis Systems, DAS'10, pp. 199-206, 2010.
[12] P. Xiu and H. S. Baird, "Incorporating linguistic post-processing into whole - book recognition," in Proc. of the 17th Document Recognition and Retrieval Conf., DRR'10, Jan. 2010.
[13] P. Xiu and H. S. Baird, "Incorporating linguistic model adaptation into whole-book recognition," in Proc. of the IAPR 20th Int. Conf. on Pattern Recognition, ICPR'10, pp.2057-2060, Aug. 2010.
[14] V. Kluzner, A. Tzadok, Y. Shimony, E. Walach, and A. Antonacopoulos, "Word-based adaptive OCR for historical books," in Proc. of the 10th Int Conf. on Document Analysis and Recognition, ICDAR'09, pp.501-505, Jul. 2009.
[15] J. J. Hull, "Document image skew detection: survey and annotated bibliography," Document Analysis Systems II, World Scientific, pp. 40-64, 1998.
[16] M. Valizadeh and E. Kabir, "Binarization of degraded document image based on feature space partitioning and classification," Int. J. on Document Analysis and Recognition, vol. 15, no. 1, pp. 57-69, 2012.
[17] C. D. Manning, P. Raghavan, and H. Schutze, An Introduction to Information Retrieval, Cambridge University Press, 2009.

شارک

عنوان URL للمقالة

خوشه‌یابی تصویر زیرکلمات در متون قدیمی و حجیم چاپی با استفاده از معیار مقایسه تصویری

رایمگ

الروابط

المراكز ذات الصلة

دعامة

الصفحات الرسمية