درون سازی معنایی واژه ها با استفاده از BERT روی وب فارسی

الموضوعات : electrical and computer engineering

شکوفه بستان ¹ , علی محمد زارع بیدکی ² , محمد رضا پژوهان ³

1 - دانشگاه یزد
2 - دانشگاه یزد
3 - دانشگاه یزد

تاريخ الإرسال : 16 الأربعاء , ربيع الأول, 1444 تاريخ التأكيد : 21 الأحد , رجب, 1444 تاريخ الإصدار : 09 الثلاثاء , ربيع الثاني, 1445

الکلمات المفتاحية: بردار معنایی, درون‌سازی واژه, رتبه‌بندی, یادگیری عمیق,

ملخص المقالة :

استفاده از بافت و ترتیب واژگان در یک عبارت از مواردی است که می‌تواند به فهم بهتر آن عبارت منجر گردد. در سال‌های اخیر، مدل‌های زبانی از پیش‌آموزش‌یافته، پیشرفت شگرفی در زمینه پردازش زبان طبیعی به وجود آوده‌اند. در این راستا مدل‌های مبتنی بر ترنسفورمر مانند الگوریتم BERT از محبوبیت فزاینده‌ای برخوردار گردیده‌اند. این مسئله در زبان فارسی کمتر مورد بررسی قرار گرفته و به‌عنوان یک چالش در حوزه وب فارسی مطرح می‌گردد. بنابراین در این مقاله، درون‌سازی واژگان فارسی با استفاده از این الگوریتم مورد بررسی قرار می‌گیرد که به درک معنایی هر واژه بر مبنای بافت متن می‌پردازد. در رویکرد پیشنهادی، مدل ایجادشده بر روی مجموعه دادگان وب فارسی مورد پیش‌آموزش قرار می‌گیرد و پس از طی دو مرحله تنظیم دقیق با معماری‌های متفاوت، مدل نهایی تولید می‌شود. در نهایت ویژگی‌های مدل استخراج می‌گردد و در رتبه‌بندی اسناد وب فارسی مورد ارزیابی قرار می‌گیرد. نتایج حاصل از این مدل، بهبود خوبی نسبت به سایر مدل‌های مورد بررسی دارد و دقت را نسبت به مدل برت چندزبانه تا حداقل یک درصد افزایش می‌دهد. همچنین اعمال فرایند تنظیم دقیق با ساختار پیشنهادی بر روی سایر مدل‌های موجود توانسته به بهبود مدل و دقت درون‌سازی بعد از هر فرایند تنظیم دقیق منجر گردد. نتایج رتبه‌بندی بر مبنای مدل‌های نهایی، بیانگر بهبود دقت رتبه‌بندی وب فارسی نسبت به مدل‌های پایه مورد ارزیابی با افزایش حدود 5 درصدی دقت در بهترین حالت است.

المصادر:

[1] A. Bidoki, Effective Web Ranking and Crawling, Ph.D. Thesis, University of Tehran, 2009.
[2] W. Qader, M. Ameen, and B. Ahmed, "An overview of bag of words; importance, implementation, applications, and challenges," in Proc. IEEE Int. Engineering Conf., IEC'19, pp. 200-204, Erbil, Iraq, 23-25 Jun. 2019.
[3] G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Information Processing & Management, vol. 24, no. 5, pp. 513-523, 1988.
[4] Y. Benjio and R. Ducharme, "A neural probabilistic language model," The J. of Machine Learning Research, vol. 3, pp. 1137-1155, 2003.
[5] T. Mikolov, K. Chen, G. Corrado, and J. Dea, "Efficient estimation of word representations in vector space," in Proc. Int. Conf. on Learning Representations, ICLR'13, pp. 1137-1155, Scottsdale, AZ, USA, 2-4 May 2013.
[6] T. Mikolov, I. Sutskever, K. Chen, and G. Corr, "Distributed representations of words and phrases and their compositionality," In C. J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (ed.), Annual Conf. on Neural Information Processing Systems, NIPS'13, vol. 2, pp. 3111-3119, Lake Tahoe, NV, USA, 5-10 Dec. 2013.
[7] J. Pennington, R. Socher, C. Ma, and C. Manning, "GloVe: global vectors for word representation," in Proc. Conf. on Empirical Methods in Natural Language Processing, EMNLP'14, pp. 1532-1543, Doha, Qatar, Oct. 2014.
[8] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information," Trans. of the Association for Computational Linguistics (TACL), vol. 5, pp. 135-146, 2017.
[9] S. Pan and Q. Yang, "A survey on transfer learning," IEEE Trans. on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345-1359, Oct. 2010.
[10] M. Peters, et al., "Deep contextualized word representations," in Proc. Conf. of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL'18, vol. 1, pp. 2227-2237, New Orleans, LA, USA, Jun. 2018.
[11] J. Devlin, M. Chang, and K. Kristina, "BERT: pre-training of deep bidirectional transformers for language understanding," in Proc. Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, HLT-NAACL'19, pp. 4171-4186, Minneapolis, MN, USA, 2-7 Jun. 2019.
[12] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving Language Understanding by Generative Pre-Training, Technical Report, OpenAI, 11 Jun. 2018.
[13] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, Nov. 1997.
[14] T. Mikolov, S. Kombrink, L. Burget, and J. Cernocky, "Extensions of recurrent neural network language model," in Proc. IEEE Int. Speech and Signal Processing, ICASSP'11, pp. 5528-5531, Prague, Czech Republic, 22-27 May 2011.
[15] M. Schuster and K. Paliwal, "Bidirectional recurrent neural networks," IEEE Trans. on Signal Processing, vol. 45, no. 11, pp. 2673-2681, Nov. 1997.
[16] A. Vaswani, et al., "Attention is all you need," In Proc. 31st Annual Conf. on Neural Information Processing Systems, NIPS'17, 11 pp., Long Beach, CA, USA, 4-9 Dec. 2017.
[17] Z. Lan, et al., A Lite BERT for Self-Supervised Learning of Language Representations, arXiv preprint arXiv:1909.11942, 2019.
[18] Y. Liu, et al., A Robustly Optimized BERT Pretraining Approach, arXiv preprint arXiv:1907.11692, 2019.
[19] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter, arXiv preprint arXiv:1910.01108, 2019.
[20] M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, " ParsBERT: transformer-based model for persian language understanding," Neural Processing Letters, vol. 53, pp. 3831-3847, 2021.
[21] BERT, "huggingface," 2018. Available: https://huggingface.co/docs/transformers/.
[22] C. Sun, X. Qiu, Y. Xu, and X. Huang, "How to fine-tune BERT for text classification?" in Proc. China National Conf. on Chinese Computational Linguistics, CCL'19, pp. 194-206, Kunming, China, 18-20 Oct. 2019.
[23] D. Viji and S. Revathy, "A hybrid approach of weighted fine-tuned BERT extraction with deep siamese bi-LSTM model for semantic text similarity identification," Multimedia Tools and Applications, vol. 81, pp. 6131-6157, 2022.
[24] A. Agarwal and P. Meel, "Stacked bi-LSTM with attention and contextual BERT embeddings for fake news analysis," in Proc. 7th Int. Conf. on Advanced Computing and Communication Systems, ICACCS'21, pp. 233-237, Coimbatore, India, 19-20 Mar. 2021.
[25] K. Jarvelin and J. Kekalainen, "Cumulated gain-based evaluation of IR techniques," ACM Trans. on Information Systems, vol. 20, no. 4, pp. 422-446, Oct. 2002.

شارک

عنوان URL للمقالة

درون سازی معنایی واژه ها با استفاده از BERT روی وب فارسی

رایمگ

الروابط

المراكز ذات الصلة

دعامة

الصفحات الرسمية