Challenges of Persian Scene Text Detection and the Importance of a New Dataset for Evaluating Deep Learning Models
Subject Areas : electrical and computer engineering
Z. Raisi
1
,
R. Damani
2
,
E. Sarani
3
,
V. Nazarhzehi Had
4
1 - Elec. Eng. Dept., Marine Engineering Faculty, Chabahar Maritime University, Chabahar, Iran
2 - Elec. Eng. Dept., Marine Engineering Faculty, Chabahar Maritime University, Chabahar, Iran
3 - Electrical Engineering Elec. Eng. Dept., Marine Engineering Faculty, Chabahar Maritime University, Chabahar, IranDepartment, Marine Engineering Faculty, Chabahar Maritime University, Chabahar, Iran
4 - Elec. Eng. Dept., Marine Engineering Faculty, Chabahar Maritime University, Chabahar, Iran
Keywords: Persian text dataset, Scene text detection, deep learning models, FATD benchmark dataset,
Abstract :
Due to the structural complexity of the Persian script and the lack of standardized and reliable datasets, Persian scene text detection and word segmentation in natural scene images captured by conventional cameras remain key challenges in the field of image processing. In this paper, we introduce a comprehensive dataset for Persian text detection, named FATD (Farsi Text Detection Dataset). FATD comprises more than 2,000 diverse images containing texts with various fonts, sizes, orientations, and environmental conditions, covering a wide range of visual complexity. Subsequently, six deep learning models are evaluated and compared under identical conditions on this dataset, including two convolutional neural network (CNN)-based models (YOLOv8 and CRAFT), two transformer-based models (RRDETR and RRBDETR), and two vision-language models (Qwen2.5VL and Florence-2). Experimental results demonstrate that transformer-based models achieve superior accuracy—up to 65% in H-mean—at the expense of higher computational cost. In contrast, CNN-based models offer competitive accuracy with notably faster inference speed. Moreover, despite their limited training exposure to Persian text data, the evaluated vision-language models exhibit promising localization performance according to the H-mean metric. Overall, this study provides a valuable benchmark and comparative analysis for advancing Persian scene text detection and highlights the potential of modern vision-language architectures in low-resource languages.
[1] Y. Zhu, C. Yao, and X. Bai, "Scene text detection and recognition: Recent advances and future trends," Front. Comp Sci., vol. 10, no. 1, pp. 19-36, 2016.
[2] H. Lin, P. Yang, and F. Zhang, "Review of scene text detection and recognition," Arch. Comput. Methods Eng, pp. 1-22, 2019.
[3] Z. Raisi, M. A. Naiel, P. Fieguth, S. Wardell, and J. Zelek, Text Detection and Recognition in the Wild: A Review, arXiv: arXiv:2006.04305, 2020.
[4] Z. Raisi and J. Zelek, "Text detection and recognition for robot localization," J. Electr. Comput. Eng. Innov. JECEI, vol. 12, no. 1, pp. 163174, 2024.
[5] X. Han, J. Gao, C. Yang, Y. Yuan, and Q. Wang, "Focus entirety and perceive environment for arbitrary-shaped text detection," IEEE Trans. Multimed., 2024.
[6] J. Xu et al., "FSANet: Feature shuffle and adaptive channel attention network for arbitrary shape scene text detection," Neurocomputing, Article ID: 129443, 2025.
[7] K. Wang, B. Babenko, and S. Belongie, "End-to-end scene text recognition," in Proc. Int. Conf. on Comp. Vision, pp. 1457-1464, Barcelona, Spain, 6-13 Nov. 2011.
[8] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, " PhotoOCR: Reading text in uncontrolled conditions," in Proc. IEEE Int. Conf. on Comp. Vision, pp. 785-792, Sydney, Australia, 1-8 Dec. 2013.
[9] X. Zhou et al., "EAST: An efficient and accurate scene text detector," in Proc. IEEE Conf. on Comp. Vision and Pattern Recognition, pp. 5551-5560, Honolulu, HI, USA, 21-26 Jul.2017.
[10] M. Liao, B. Shi, and X. Bai, "Textboxes++: A single-shot oriented scene text detector," IEEE Trans Image Process, vol. 27, no. 8, pp. 3676-3690, Apr. 2018.
[11] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, "Character region awareness for text detection," in Proc. IEEE Conf. on Comp. Vision and Pattern Recognition, pp. 9365-9374, Long Beach, CA, USA, 16-17 Jun. 2019.
[12] Z. Raisi, M. A. Naiel, G. Younes, S. Wardell, and J. S. Zelek, "Transformer-Based text detection in the wild," in Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition Workshops, pp. 3162-3171, Nashville, TN, USA,19-25 Jun. 2021.
[13] Z. Raisi, G. Younes, and J. Zelek, "Arbitrary shape text detection using transformers,” in Proc. 26th Int. Conf. on Pattern Recognition, pp. 3238-3245, Montreal, Canada,2022.
[14] Z. Raisi and J. Zelek, "Visual place recognition from end-to-end semantic scene text features," Front. Robot. AI, vol. 11, Article ID: 1424883, Sept. 2024.
[15] ز. حیدران داروقه امنیه، س. م. رستگار فاطمی، م. رستگارپور و گ. آقایی قزوینی، "افزایش دقت شبکه¬های عصبی کانولوشنی مبتنی بر مدل چهار-جریان با فیلترهای پردازش تصویر و نگاشت خطی¬ساز فضای عدم تشابه"، نشریه روش¬های هوشمند در صنعت برق، سال 16، شماره 61، صص. 28-1، بهار 1404. [16] م. روحی، ج. مظلوم، م. ع. پورمینا و ب. قلمکاری، "طبقه¬بندی سکته مغزی بر اساس روش یادگیری عمیق در سیستم تصویربرداری ریزموجی از مغز"، نشریه روش¬های هوشمند در صنعت برق، سال 15، شماره 57، صص. 132-121، بهار 1403. [17] ف. علی¬مرادی، ف. رحمانی، ل. ربیعی، م. خوانساریو م. مازوچی، "ساخت مجموعه داده تصاویر برای تشخیص و بازشناسی متن در تصاویر، "فصلنامه اطلاعات و ارتباطات ایران، سال 14، شماره 53، صص. 95-78، پاییز-زمستان 1401. [18] S. Kheirinejad, N. Riaihi, and R. Azmi, "Persian text-based traffic sign detection with convolutional neural network: A new dataset," in Proc. 10th Int. Conf. on Computer and Knowledge Engineering, pp. 060-064, Mashhad, Iran, 29-30 Oct. 2020.
[19] A. Fateh, M. Rezvani, A. Tajary, and M. Fateh, "Persian printed text line detection based on font size," Multimed. Tools Appl., vol. 82, no. 2, pp. 2393-2418, Jan. 2023.
[20] M. Rahmati, M. Fateh, M. Rezvani, A. Tajary, and V. Abolghasemi, "Printed Persian OCR system using deep learning," IET Image Process., vol. 14, no. 15, pp. 3920-3931, Dec. 2020.
[21] B. Epshtein, E. Ofek, and Y. Wexler, "Detecting text in natural scenes with stroke width transform," in Proc. IEEE Conf. on Comp. Vision and Pattern Recognition, pp. 2963-2970, San Francisco, CA, USA, 13-18 Jun. 2010.
[22] H. Chen, et al., "Robust text detection in natural images with edge-enhanced maximally stable extremal regions," in Proc. IEEE Int. Conf. on Image Processing, pp. 2609-2612, Barcelona, Spain,6-13 Nov. 2011.
[23] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks," in Proc. Adv. in Neural Info. Process. Sys., pp. 91-99, Montreal, Canada, 7-12 Dec.2015.
[24] M. Yaseen, What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector, arXiv preprint arXiv:2408.15857, 2024.
[25] W. Liu, et al., "SSD: Single shot multibox detector," in Eur. Conf. on Comp. Vision, Springer, pp. 21-37, 2016.
[26] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in Proc. IEEE Conf. on Comp. Vision and Pattern Recognition, 2015, pp. 3431–3440.
[27] K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask R-CNN," in Proc. IEEE Int. Conf. on Comp. Vision, pp. 2961-2969, 2017.
[28] N. Carion, et al., End-to-End Object Detection with Transformers, arXiv Preprint arXiv200512872, 2020.
[29] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, Deformable DETR: Deformable Transformers for End-to-End Object Detection, arXiv Preprint arXiv201004159, 2020.
[30] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, "Detecting text in natural image with connectionist text proposal network," in Proc. Eur. Conf. on Comp. Vision, Springer, pp. 56-72, 2016.
[31] S. Long, et al., "Textsnake: A flexible representation for detecting text of arbitrary shapes," in Proc. Eur. Conference. on Computer Vision, pp. 20-36, 2018.
[32] D. Deng, H. Liu, X. Li, and D. Cai, "Pixellink: Detecting scene text via instance segmentation," in Proc. AAAI Conf. on Artif. Intell., 2018.
[33] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation," in Proc. Int. Conf. on Medical Image Computing and Computer-Assisted Intervention, Springer, pp. 234-241, 2015.
[34] L. Yuan, et al., Florence: A New Foundation Model for Computer Vision, arXiv preprint arXiv:2111.11432, 2021.
[35] B. Xiao, et al., Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks, arXiv preprint arXiv:2311.06242, 2023.
[36] S. Bai, et al., Qwen2.5-VL Technical Report, arXiv preprint arXiv:2502.13923, 2025.
[37] J. Bai, et al., Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond, arXiv preprint arXiv:2308.12966, 2023.
[38] A. Hurst, et al., GPT-4o System Card, arXiv preprint arXiv:2410.21276, 2024.
[39] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," in Proc. Int. Conf. on Learning Representations, 2015.
[40] O. Ronneberger, P. Fischer, and T. Brox, "U-Net: Convolutional networks for biomedical image segmentation," in Proc. Medical Image Computing and Computer-Assisted Intervention, pp. 234-241, 2015.
[41] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You Only Look Once: Unified, Real-Time Object Detection, arXiv PreprintarXiv1506.02640, 2016.
[42] D. Karatzas et al., "ICDAR 2013 robust reading competition," in Proc. Int. Conf. on Document Anal. and Recognition, pp. 1484-1493, 2013.
[43] D. Karatzas et al., "ICDAR 2015 competition on robust reading," in Proc. Int. Conf. on Document Anal. and Recognition, pp. 1156-1160, 2015.
[44] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie, Coco-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images, arXiv preprint. arXiv160107140, 2016.
[45] A. Singh, et al., "TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text," in Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 8802-8812, 2021.
[46] C. K. Ch’ng and C. S. Chan, "Total-text: A comprehensive dataset for scene text detection and recognition," in Proc. IAPR Int. Conf. on Document Anal. and Recognition, pp. 935-942, Kyoto, Japan, 9-11 Nov. 2017.
[47] L. Yuliang, J. Lianwen, Z. Shuaitao, and Z. Sheng, Detecting Curve Text in the Wild: New Dataset and New Solution, arXiv preprint arXiv:1712.02170, 2017.
[48] S. Long, S. Qin, Y. Fujii, A. Bissacco, and M. Raptis, "Hierarchical text spotter for joint text spotting and layout analysis," in Pro. of the IEEE/CVF Winter Conf. on Applications of Computer Vision, pp. 903-913, 2024.
[49] T. Kil, S. Kim, S. Seo, Y. Kim, and D. Kim, "Towards unified scene text spotting based on sequence generation," in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 15223–15232, 2023.
[50] A. Gupta, A. Vedaldi, and A. Zisserman, "Synthetic Data for Text Localisation in Natural Images," in Proc. IEEE Conf. on Comp. Vision and Pattern Recognition, 2016.
[51] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition, arXiv PreprINT. ArXiv14062227, 2014.
[52] M. Iwamura, et al., "ICDAR2017 robust reading challenge on omnidirectional video,” in Proc. IAPR Int. Conf. on Document Anal. and Recognition, 2017, pp. 1448–1453.
[53] C. -K. Chng, et al., ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text (RRC-ArT), arXiv preprint arXiv1909.07145, 2019.
[55] W. Wu, et al., "ICDAR 2023 competition on video text reading for dense and small text," in Proc. Int. Conf. on Document Analysis and Recognition, pp. 405-419, 2023.
[56] Z. Wan, J. Zhang, L. Zhang, J. Luo, and C. Yao, "On vocabulary reliance in scene text recognition," in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 11425-11434, 2020.
[57] R. Zhang et al., "ICDAR 2019 robust reading challenge on reading Chinese text on signboard," in Proc. International Conf. on Document Analysis and Recognition, pp. 1577–1581, 2019.
[58] M. Tounsi, I. Moalla, A. M. Alimi, and F. Lebouregois, "Arabic characters' recognition in natural scenes using sparse coding for feature representations," in Proc. 13th Int. Conference on Document Analysis and Recognition, pp. 1036-1040, 2015.
[59] M. Tounsi, I. Moalla, and A. M. Alimi, "ARASTI: A database for Arabic scene text recognition," in Proc. 1st Int. Workshop on Arabic Script Analysis and Recognition, pp. 140-144, 2017. Asian Conf. on Pattern Recognition, pp. 747-752, 2017.
[61] S. Sabour, N. Frosst, and G. E. Hinton, "Dynamic routing between capsules," in Proc. Advances in Neural Information Processing [60] M. Jain, M. Mathew, and C. Jawahar, "Unconstrained OCR for Urdu using deep CNN-RNN hybrid networks," in Proc. 4th IAPRSystems, pp. 3856-3866, 2017.
[62] A. Rahman, A. Ghosh, and C. Arora, "UTRNet: High-Resolution Urdu Text Recognition in Printed Documents," in Proc. Int. Conf. on Document Analysis and Recognition, pp. 305-324, 2023.
[63] E. Shabaninia, F. Eslami, A. Afkari-Fahandari, and H. Nezamabadi-pour, "SUT: a new multi-purpose synthetic dataset for Farsi document image analysis," in Proc. 13th Int. Conf. on Computer and Knowledge Engineering, pp. 253-258, 2023.
[64] F. Asadi-Zeydabadi, E. Shabaninia, H. Nezamabadi-Pour, and M. Shojaee, "Farsi optical character recognition using a transformer-based model," in Proc. 13th Int. Conf. on Computer and Knowledge Engineering, pp. 293-299, 2023.
[65] M. Mosannafat, F. Taherinezhad, H. Khotanlou, and E. Alighardash, "Farsi text detection and localization in videos and images," in Proc. 9th Iranian Joint Congress on Fuzzy and Intelligent Systems, 6 pp., Bam, Iran, 2-4 Mar. 2022.
[66] A. Salmasi and E. Kabir,"Farsi text in scene: A new dataset," in Proc. 13th Int. Conf. on Computer and Knowledge Engineering, pp. 510-514, Mashhad, Iran, Nov. 2023.
[67] A. Dutta and A. Zisserman, "The VIA annotation software for images, audio and video," in Proc. of the 27th ACM Int. Conf. on Multimedia, New York, NY, USA. doi: 10.1145/3343031.3350535.
[68] T.-Y. Lin et al., "Microsoft COCO: Common objects in context" in Proc. Euro. Conf. on Comp.Vision, pp. 740-755, Milan, Itay, 29 Sept.-4 Oct. 2014.
[69] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, Hierarchical Text-Conditional Image Generation with Clip Latents, arXiv Preprint, arXiv220406125, 2022.
[70] J. Achiam et al., GPT-4 Technical Report, arXiv preprint arXiv230308774, 2023.
[71] D. Guo,et al., DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, arXiv Preprint arXiv250112948, 2025.
[72] G. Team, et al., Gemini: A Family of Highly Capable Multimodal Models, arXiv Preprint, arXiv231211805, 2023.
[73] A. Kirillov et al., "Segment anything," in Proc. of the IEEE/CVF Int. Conf. on Computer Vision, pp. 4015-4026, 2023.
[74] A. Kortylewski, Q. Liu, H. Wang, Z. Zhang, and A. Yuille, "Combining compositional models and deep networks for robust object classification under occlusion," in Proc. IEEE Winter Conf. on Applications of Computer Vision, pp. 1333-1341, 2020.
[75] Z. Raisi and J. Zelek, "Occluded text detection and recognition in the wild," in Proc. 19th Conf. on Robots and Vision, pp. 140-150, Toronto, Canada, May 2022.
