More about HKUST
Handling Out-of-distribution Scenarios in Offline Reinforcement Learning
MPhil Thesis Defence Title: "Handling Out-of-distribution Scenarios in Offline Reinforcement Learning" By Mr. Hon Hing CHAK Abstract In standard reinforcement learning (RL), environment interactions are usually available for model training to facilitate continuous exploration and performance improvement. But, in many reinforcement learning applications, models are only trained with pre-existing datasets without online interactions with the environments. This problem is called offline reinforcement learning (Offline RL). Recent studies in Offline RL mixes traditional RL techniques with some forms of regularization, which usually aims at matching the RL policy with the dataset-generating policy. This is to address extrapolation errors when we evaluate the quality of out-of-distribution state-action pairs, also known as the distributional shift issue. However, most regularization techniques assume that the environment states encountered by the RL agent during deployment will stay close to the dataset distribution such that the agent could identify suitable actions to minimize distributional shift. In many real-world applications with highly stochastic environments, this might not be true. In the case that an unfamiliar environment state, i.e. out-of-distribution (OOD) states is encountered, the agent might pick actions that are unregularized during training. These unregularized actions could lead to further distributional shifts in latter interactions, forming a vicious cycle. In this thesis, we propose an offline RL model by combining the regular actor-critic architecture with a Wasserstein-1 divergence critic inspired by Wasserstein Generative Adversarial Network with gradient penalty (WGAN-GP) to address the issue of OOD states. We build a gradient penalized critic network to capture the divergence from the state-action distribution of the dataset using the Wasserstein-1 distance metric, while extending the distance metric to the full state space during training. When encountering unfamiliar environment states during deployment, the model would still be able to output actions that are close to the marginal action distribution of the dataset. In our experiments, we tested our model in a real-world application setting: automatic cremation control system. We show that our model produces actions similar to human actions in OOD states, but previous studies are unable to recognize meaningful actions in those states. In addition, our model enjoys stable performances and outperforms the human baseline. Date: Thursday, 11 August 2022 Time: 2:00pm - 4:00pm Zoom Meeting: https://hkust.zoom.us/j/99482645453?pwd=R2VLRVZmemF2L01DRjR2cXQ1MktmQT09 Committee Members: Prof. Raymond Wong (Supervisor) Prof. Dit-Yan Yeung (Chairperson) Prof. Shing-Chi Cheung **** ALL are Welcome ****