We study offline reinforcement learning under a novel model called strategic MDP, which characterizes the strategic interactions between a principal and a sequence of myopic agents with private types. Due to the bilevel structure and private types, strategic MDP involves information asymmetry between the principal and the agents. We focus on the offline RL problem, where the goal is to learn the optimal policy of the principal concerning a target population of agents based on a pre-collected dataset that consists of historical interactions. The unobserved private types confound such a dataset as they affect both the rewards and observations received by the principal. We propose a novel algorithm, Pessimistic policy Learning with Algorithmic iNstruments (PLAN), which leverages the ideas of instrumental variable regression and the pessimism principle to learn a near-optimal principal's policy in the context of general function approximation. Our algorithm is based on the critical observation that the principal's actions serve as valid instrumental variables. In particular, under a partial coverage assumption on the offline dataset, we prove that PLAN outputs a nearly optimal policy at a root-N statistical rate, where N is the number of trajectories. We further apply our framework to some special cases of strategic MDP, including strategic regression, strategic bandit, and noncompliance in recommendation systems. This is joint work with Mengxin Yu and Jianqing Fan.



Zhuoran Yang is an Assistant Professor of Statistics and Data Science at Yale University, starting in July 2022. His research interests lie in the interface between machine learning, statistics, and optimization. He is particularly interested in the foundations of reinforcement learning, representation learning, and deep learning. Before joining Yale, Zhuoran worked as a postdoctoral researcher at the University of California, Berkeley, advised by Michael. I. Jordan. Prior to that, he obtained his Ph.D. from the Department of Operations Research and Financial Engineering at Princeton University, co-advised by Jianqing Fan and Han Liu. He received his bachelor’s degree in Mathematics from Tsinghua University in 2015.