ÎÛÎÛ²ÝÝ®ÊÓƵ

Event

Periodic agent-state based Q-learning for POMDPs

Thursday, July 4, 2024 10:00to11:00
McConnell Engineering Building Zames Seminar Room, MC 437, 3480 rue University, Montreal, QC, H3A 0E9, CA

Informal Systems Seminar (ISS), Centre for Intelligent Machines (CIM) and Groupe d'Etudes et de Recherche en Analyse des Decisions (GERAD)

Speaker: Amit Sinha

**Ìý±·´Ç³Ù±ðÌý³Ù³ó²¹³ÙÌý³Ù³ó¾±²õÌý¾±²õÌý²¹Ìý³ó²â²ú°ù¾±»åÌý±ð±¹±ð²Ô³Ù.
**Ìý°Õ³ó¾±²õÌý²õ±ð³¾¾±²Ô²¹°ùÌý·É¾±±ô±ôÌý²ú±ðÌý±è°ù´ÇÂá±ð³¦³Ù±ð»åÌý²¹³ÙÌý²Ñ³¦°ä´Ç²Ô²Ô±ð±ô±ôÌý437Ìý²¹³ÙÌý²Ñ³¦³Ò¾±±ô±ôÌý±«²Ô¾±±¹±ð°ù²õ¾±³Ù²â.


²Ñ±ð±ð³Ù¾±²Ô²µÌý±õ¶Ù:Ìý845Ìý1388Ìý1004ÌýÌýÌýÌýÌýÌýÌý
±Ê²¹²õ²õ³¦´Ç»å±ð:Ìý³Õ±õ³§³§

Abstract: The traditional approach to POMDPs is to convert them into fully observed MDPs by considering a belief state as an information state. However, a belief-state based approach requires perfect knowledge of the system dynamics and is therefore not applicable in the learning setting where the system model is unknown. Various approaches to circumvent this limitation have been proposed in the literature. A unified treatment of these approaches involves considering the "agent state", which is a model-free, recursively updateable function of the observation history. Some examples of an agent state include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a deterministic stationary policy. Since the agent state is not an information state, we cannot apply the same results for MDPs and thus, we must first consider what happens with the different policy classes: stationary/non-stationary and deterministic/stochastic. Our main thesis that we illustrate via examples is that because the agent state is not information state, non-stationary agent-state based policies can outperform stationary ones. To leverage this feature, we propose PASQL (periodic agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies. By combining ideas from periodic Markov chains and stochastic approximation, we rigorously establish that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy. Finally, we present a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.

Affiliation: Amit Sinha is a PhD candidate in the Department of Electrical and Computer Engineering, ÎÛÎÛ²ÝÝ®ÊÓƵ University.

Back to top