بهبود یادگیری Q با استفاده از هم‌زمانی به روز رسانی و رویه تطبیقی بر پایه عمل متضاد

الموضوعات : electrical and computer engineering

مریم پویان ¹ , شهرام گلزاري ² , امین موسوی ³ , احمد حاتم ⁴

1 - دانشگاه هرمزگان
2 - دانشگاه هرمزگان
3 - دانشگاه هرمزگان
4 - دانشگاه هرمزگان

تاريخ الإرسال : 19 الخميس , شوال, 1438 تاريخ التأكيد : 19 الخميس , شوال, 1438 تاريخ الإصدار : 20 الأربعاء , ذو الحجة, 1437

الکلمات المفتاحية: رویه تطبیقی سرعت همگرایی عمل متضاد هم‌زمانی به روز رسانی یادگیری Q,

ملخص المقالة :

روش یادگیری Q یکی از مشهورترین و پرکاربردترین روش‌های یادگیری تقویتی مستقل از مدل است. از جمله مزایای این روش عدم وابستگی به آگاهی از دانش پیشین و تضمین در رسیدن به پاسخ بهینه است. یکی از محدودیت‌های این روش کاهش سرعت همگرایی آن با افزایش بعد است. بنابراین افزایش سرعت همگرایی به عنوان یک چالش مطرح است. استفاده از مفاهیم عمل متضاد در یادگیری Q، منجر به بهبود سرعت همگرایی می‌شود زیرا در هر گام یادگیری، دو مقدار Q به طور هم‌زمان به روز می‌شوند. در این مقاله روشی ترکیبی با استفاده از رویه تطبیقی در کنار مفاهیم عمل متضاد برای افزایش سرعت همگرایی مطرح شده است. روش‌ها برای مسئله Grid world شبیه‌سازی شده است. روش‌های ارائه‌شده بهبود در میانگین درصد نرخ موفقیت، میانگین درصد حالت‌های بهینه، متوسط تعداد گام‌های عامل برای رسیدن به هدف و میانگین پاداش دریافتی را نشان می‌دهند.

المصادر:

[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1998.
[2] J. Qiao, R. Fan, H. Han, and X. Ruan, "Q-learning based on dynamical structure neural network for robot navigation in unknown environment," in Proc. of the 6th Int. Symp. on Neural Networks: Advances in Neural Networks - Part III, ISNN'09, pp. 188-196, 2009.
[3] W. Y. Kwon, I. H. Suh, and S. Lee, "SSPQL: stochastic shortest path-based Q-learning," International J. of Control, Automation, and Systems, vol. 9, no. 2, pp. 328-338, 2011.
[4] P. K. Das, S. C. Mandhata, H. S. Behera, and S. N. Patro, "An improved Q-learning algorithm for path-planning of a mobile robot," International J. of Computer Applications, vol. 51, no. 9, pp. 40-46, 2012.
[5] M. B. Naghibi-Sistani, M. R. Akbarzadeh-Tootoonchi, M. H. Javidi-Dashte Bayaz, and H. Rajabi-Mashhadi, "Application of Q-learning with temperature variation for bidding strategies in market based power systems," Energy Conversion and Management, vol. 47, no. 11, pp. 1529-1538, 2006.
[6] Y. Ozbek, A. Zeid, and S. Kamarthi, "A Q-learning-based adaptive grouping policy for condition-based maintenance of a flow line manufacturing system," International J. of Collaborative Enterprise, vol. 2, no. 4, pp. 302-321, 2011.
[7] R. A. Bianchi, A. Ramisa, and R. L. De Mantaras, "Automatic selection of object recognition methods using reinforcement learning," in Advances in Machine Learning I, Springer Berlin Heidelberg, pp. 421-439, 2010.
[8] H. R. Tizhoosh, "Opposition-based reinforcement learning," J. of Advanced Computational Intelligence and Intelligent Informatics, vol. 10, no. 4, pp. 578-585, 2006.
[9] X. Ma, Y. Xu, G. Q. Sun, L. X. Deng, and Y. B. Li, "State-chain sequential feedback reinforcement learning for path planning of autonomous mobile robots," J. of Zhejiang University Science C, vol. 14, no. 3, pp. 167-178, Mar. 2013.
[10] A. Lampton and J. Valasek, "Multiresolution state-space discretization method for Q-learning," in Proc. American Control Conf., pp. 1646-1651, 2009.
[11] D. Vincze and S. Kovacs, "Incremental rule base creation with fuzzy rule interpolation-based Q-learning," in Proc. Computational Intelligence in Engineering, pp. 191-203, 2010.
[12] K. Terashima and J. Murata, "A study on use of prior information for acceleration of reinforcement learning," in Proc. SICE Annual Conf., pp. 537-543, 2011.
[13] B. Marthi, "Automatic shaping and decomposition of reward functions," in Proc. of the 24th Int. Conf. on Machine Learning, pp. 601-608, 2007.
[14] S. Manju and M. Punithavalli, "An analysis of Q-learning algorithms with strategies of reward function," IJCSE, vol. 3, no. 2, pp. 814-820, Feb. 2011.
[15] M. Guo, Y. Liu, and J. Malec, "A new Q-learning algorithm based on the metropolis criterion," IEEE Trans. Syst. Man Cybern. B, vol. 34, no. 5, pp. 2140-2143, Oct. 2004.
[16] M. Tokic, "Adaptive ε-greedy exploration in reinforcement learning based on value differences," in Proc. of the 33rd annual German Conf. on Advances in Artificial Intelligence, KI'10, pp. 203-210, 2010.
[17] M. Tokic and G. Palm, "Value-difference based exploration: adaptive exploration between epsilon-greedy and softmax," in Proc. of the 34rd annual German Conf. on Advances in Artificial Intelligence, KI'11, pp. 335-346, 2011.
[18] م. پویان، ا. موسوی، ش. گلزاری و ا. حاتم، "روشی نوین برای بهبود عملکرد یادگیری Q با افزایش تعداد به روز رسانی مقادیر Q بر پایه عمل متضاد،" مجموعه مقالات بیستمین کنفرانس سالانه کامپیوتر ایران، دانشگاه فردوسی مشهد، صص. 233-226، 14-12 اسفند 93.
[19] C. J. C. H. Watkins, Learning from Delayed Rewards, Ph. D Thesis, Cambridge University, Cambridge, England, 1989.
[20] M. Pouyan, A. Mousavi, S. Golzari, and A. Hatam, "Improving the performance of Q-learning using simultanous Q-values updating," in Proc. 2014 Int. Congress on Technology, Communication and Knowledge, ICTCK'14 , 6 pp., 26-27 Nov. 2014.
[21] M. Shokri, "Knowledge of opposite actions for reinforcement learning," Applied Soft Computing, vol. 11, no. 6, pp. 4097-4109, 2011.
[22] U. Nehmzow, Scientific Methods in Mobile Robotics: Quantitative Analysis of Agent Behavior, London: Springer-Verlag London Limited, 2006.
[23] L. A. Celiberto, J. P. Matsuura, D. Mantaras, R. Lopez, and R. A. Bianchi, "Using transfer learning to speed-up reinforcement learning: a cased-based approach," in Proc. 2010 Latin American Robotics Symp. and Intelligent Robotic Meeting, LARS'10, pp. 55-60, Sao Bernardo do Campo, Brazil, 23-28 Oct. 2010.

شارک

عنوان URL للمقالة

بهبود یادگیری Q با استفاده از هم‌زمانی به روز رسانی و رویه تطبیقی بر پایه عمل متضاد

رایمگ

الروابط

المراكز ذات الصلة

دعامة

الصفحات الرسمية