ارائه روشی جدید برای کسب مهارت در یادگیری تقویتی با کمک خوشهبندی گراف
الموضوعات :مرضیه داودآبادی فراهانی 1 , ناصر مزینی 2
1 - دانشگاه علم و صنعت ایران
2 - دانشگاه علم و صنعت ایران
الکلمات المفتاحية: یادگیری تقویتی سلسلهمراتبیگزینهانتزاع زمانیمهارتارزیابی مهارتهاخوشهبندی گراف,
ملخص المقالة :
یادگيري تقويتي، يكي از انواع يادگيري ماشين است كه در آن عامل با استفاده از تراکنش با محيط، به شناخت محیط و بهبود رفتار خود میپردازد. يكي از مشكلات اصلي الگوريتمهاي استاندارد يادگيري تقويتي مانند یادگیری Q اين است که نمیتوانند مسایل بزرگ را در زمان قابل قبولی حل کنند. کسب خودکار مهارتها میتواند به شکستن مسأله به زيرمسألههاي کوچکتر و حل سلسلهمراتبی آن کمک کند. با وجود نتایج امیدوارکننده استفاده از مهارتها در یادگیری تقویتی سلسلهمراتبی، در برخی تحقیقات دیگر نشان داده شد که بر اساس وظیفه مورد نظر، اثر مهارتها بر کارایی یادگیری میتواند کاملاً مثبت یا منفی باشد و اگر به درستی انتخاب نشوند میتوانند پیچیدگی حل مسأله را افزایش دهند. از این رو یکی از نقاط ضعف روشهای قبلی کسب خودکار مهارتها، عدم ارزیابی هر یک از مهارتهای کسبشده میباشد. در این مقاله روشهای جدیدی مبتنی بر خوشهبندی گراف برای استخراج زیرهدفها و کسب مهارتها ارائه میگردد. همچنین معیارهای جدید برای ارزیابی مهارتها مطرح میشود که با کمک آنها، مهارتهای نامناسب برای حل مسأله حذف میگردند. استفاده از این روشها در چندین محیط آزمایشگاهی افزایش سرعت یادگیری را به شکل قابل ملاحظهای نشان میدهد.
[1] A. G. Barto and S. Mahadevan, "Recent advances in hierarchical reinforcement learning," Discrete Event Dynamic Systems, vol. 13, no. 4, pp. 341-379, Jan. 2003.
[2] R. S. Sutton and A. G. Barto, "Reinforcement learning: an introduction," IEEE Trans. on Neural Networks, vol. 9, no. 5, pp. 1054-1054, Sep. 1998.
[3] W. Moerman, Hierarchical Reinforcement Learning: Assignment of Behaviours to Subpolicies by Self-Organization, Ph.D Thesis, Cognitive Artificial Intelligence, Utrecht University, 2009.
[4] J. Pfau, Plans as a Means for Guiding Reinforcement Learner, Master Thesis, Department of Information Systems, University of Melbourn, 2008.
[5] T. Mann and S. Mannor, "Scaling up approximate value iteration with options: better policies with fewer iterations," in Proc. of the 31st Int. Conf. on Machine Learning, vol. 1, pp. 127-137, Beijing, China, 21-26 Jun. 2014.
[6] A. McGovern and R. S. Sutton, Macro-Actions in Reinforcement Learning: An Empirical Analysis, University of Massachusetts, Department of Computer Science, Tech. Rep, pp. 98-70, 1998.
[7] N. K. Jong, T. Hester, and P. Stone, "The utility of temporal abstraction in reinforcement learning," in Proc. of the 7th Int. Joint Conf. on Autonomous Agents and Multiagent Systems, vol. 1, pp. 299-306, Estoril, Portugal, 12-16 May 2008.
[8] R. S. Sutton, D. Precup, and S. Singh, "Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning," Artificial Intelligence, vol. 112, no. 1-2, pp. 181-211, Aug. 1999.
[9] T. Dietterich, "An overview of MAXQ hierarchical reinforcement learning," Abstraction, Reformulation, and Approximation, pp. 26-44, 2000.
[10] M. Stolle, Automated Discovery of Options in Reinforcement Learning, M.Sc Thesis, McGill University, 2004.
[11] A. McGovern and A. G. Barto, "Automatic discovery of subgoals in reinforcement learning using diverse density," in Proc. Int. Workshop Conf. Machine Learning, pp. 361-368, 28 Jun-1 Jul. 2001.
[12] S. Mannor, I. Menache, A. Hoze, and U. Klein, "Dynamic abstraction in reinforcement learning via clustering," in Proc. of the 21st Int. Conf. on Machine Learning, p. 71, Banff, Alberta, Canada, 4-8 Jul. 2004.
[13] O. Simsek and A. Barto, "Identifying useful subgoals in reinforcement learning by local graph partitioning," in Proc. of the 22nd Int. Conf. on Machine Learning, pp. 816-823, Bonn, Germany, 7-11 Aug. 2005.
[14] A. Jonsson and A. Barto, "A causal approach to hierarchical decomposition of factored MDPs," in Proc. of the 22nd Int. Conf. on Machine Learning, pp. 401-408, Bonn, Germany, 7-11 Aug. 2005.
[15] N. Mehta, S. Ray, P. Tadepalli, and T. Dietterich, "Automatic discovery and transfer of MAXQ hierarchies," in Proc. of the 25th Int. Conf. on Machine Learning, pp. 648-655, Helsinki, Finland, 5-9 Jul. 2008.
[16] P. Zang, P. Zhou, D. Minnen, and C. Isbell, "Discovering options from example trajectories," in Proc. of the 26th Annual Int. Conf. on Machine Learning, pp. 1217-1224, Montreal, Quebec, Canada, 14-18 Jun. 2009.
[17] I. Menache, S. Mannor, and N. Shimkin, "Q-cut-dynamic discovery of sub-goals in reinforcement learning," in Proc. 13th European Conf. on Machine Learning, ECML'02, pp. 187-195, 19-23 Aug. 2002.
[18] O. Simsek, Behavioral Building Blocks for Autonomous Agents: Description, Identification, and Learning, Ph.D. Thesis, Department of Computer Science, University of Massachusetts Amherst, 2008.
[19] V. Mathew, K. Peeyush, and B. Ravindran, "Abstraction in reinforcement learning in terms of metastability," in Proc. of the 10th European Workshop on Reinforcement Learning, EWRL'12, pp. 1-14, 2012.
[20] C. C. Chiu and V. W. Soo, "Automatic complexity reduction in reinforcement learning," Computational Intelligence, vol. 26, no. 1, pp. 1-25, Feb. 2010.
[21] J. H. Metzen and F. Kirchner, "Incremental learning of skill collections based on intrinsic motivation," Frontiers in Neurorobotics, vol. 7, p. 11, Jan. 2013.
[22] P. Bacon and D. Precup, "Using label propagation for learning temporally abstract actions in reinforcement learning," in Proc. of the Workshop on Multiagent Interaction Networks, pp. 357-368, 2013.
[23] M. Davoodabadi and H. Beigy, "A new method for discovering subgoals and constructing options in reinforcement learning," in Proc. IICAI, pp. 441-450, 2011.
[24] O. Simsek and A. G. Barto, "Skill characterization based on betweenness," in Proc. 21th Int. Conf. on Neural Information Processing Systems, NIPS'08, pp. p.1497-1504, Vancouver, Canada, 8-10 Dec. 2008.
[25] A. A. Rad, M. Hasler, and P. Moradi, "Automatic skill acquisition in reinforcement learning using connection graph stability centrality," in Proc. of 2010 IEEE Int. Symp. on Circuits and Systems, ISCAS'10, pp. 697-700, Paris, France, 30 May-2 Jun. 2010.
[26] L. J. Lin, "Self-improving reactive agents based on reinforcement learning, planning and teaching," Machine Learning, vol. 8, no. 3-4, pp. 293-321, May 1992.
[27] R. S. Sutton, D. Precup, and S. Singh, "Intra-option learning about temporally abstract actions," in Proc. of the Fifteenth Int. Conf. on Machine Learning, pp. 556-564, 1998.
[28] M. E. J. Newman and M. Girvan, "Finding and evaluating community structure in networks," Physical Review E, vol. 69, no. 2, p. 026113, Feb. 2004.
[29] U. N. Raghavan, R. Albert, and S. Kumara, "Near linear time algorithm to detect community structures in large-scale networks," Physical Review E, vol. 76, no. 3, p. 036106, Sep. 2007.
[30] M. E. J. Newman, "Fast algorithm for detecting community structure in networks," Physical Review E, vol. 69, no. 6, p. 066133, Jun. 2004.
[31] K. Merrick, Modelling Motivation for Experience-Based Attention Focus in Reinforcement Learning, Ph.D. Thesis, School of Information Technologies, the University of Sydney, 2007.
[32] M. Davoodabadi and N. Mozayani, "Automatic construction and evaluation of options in reinforcement learning," Submitted in Artificial Intelligence, https://www.journals.elsevier.com/artificial-intelligence/
[33] A. G. Barto, S. Singh, and N. Chentanez, "Intrinsically motivated learning of hierarchical collections of skills," in Proc. of the 3rd Int. Conf. on Development and Learning, ICDL’04, pp. 112-119, Salk Institute, San Diego, USA, Oct. 2004.
[34] J. Murata, "Controlled use of subgoals in reinforcement learning," Robotics, Automation and Control, pp. 167-182, 2008.