On the Foundations of Distributionally Robust Reinforcement Learning 


Motivated by the need for a robust policy in the face of environment shifts between training and the deployment, we contribute to the theoretical foundation of distributionally robust reinforcement learning (DRRL). This is accomplished through a comprehensive modeling framework centered around distributionally robust Markov decision processes (DRMDPs). This framework obliges the decision maker to choose an optimal policy under the worst-case distributional shift orchestrated by an adversary. By unifying and extending existing formulations, we construct DRMDPs that embraces various modeling attributes for both the decision maker and the adversary. These attributes include adaptability granularity, exploring history-dependent, Markov, and Markov time-homogeneous decision maker and adversary dynamics. Additionally, we delve into the flexibility of shifts induced by the adversary, examining so-called SA and S-rectangularity. We investigate conditions for the existence or absence of the dynamic programming principle (DPP). From an algorithmic standpoint, the existence of DPP holds significant implications, as the vast majority of existing data and computationally efficiency RL algorithms are reliant on the DPP. We also offer counterexamples for settings in which a DPP with full generality is absent. 

(Joint work with Shengbo Wang, Nian Si, Zhengyuan Zhou).


Jose Blanchet is a Professor of Management Science and Engineering (MS&E) at Stanford. Prior to joining MS&E, he was a professor at Columbia (Industrial Engineering and Operations Research, and Statistics, 2008-2017), and before that he taught at Harvard (Statistics, 2004-2008). Jose is a recipient of the 2010 Erlang Prize and several best publication awards in areas such as applied probability, simulation, operations management, and revenue management. He also received a Presidential Early Career Award for Scientists and Engineers in 2010. He worked as an analyst in Protego Financial Advisors, a leading investment bank in Mexico. He has research interests in applied probability and Monte Carlo methods. He is the Area Editor of Stochastic Models in Mathematics of Operations Research. He has served on the editorial board of Advances in Applied Probability, Bernoulli, Extremes, Insurance: Mathematics and Economics, Journal of Applied Probability, Queueing Systems: Theory and Applications, and Stochastic Systems, among others.