Offline Reinforcement Learning Workshop

Neural Information Processing Systems (NeurIPS)

December 12, 2020

@OfflineRL · #OFFLINERL2020

This page contains a non-exhaustive list of resources for machine learning and reinforcement learning researchers and practitioners to learn more about offline RL. We welcome additional resource suggestions via offline-rl-neurips@google.com or a PR on GitHub.

References (Alphabetical Order)

Agarwal, R., Schuurmans, D., and Norouzi, M. 2020. An Optimistic Perspective on Offline Reinforcement Learning. International Conference on Machine Learning (ICML).

Bodnar, C., Li, A., Hausman, K., Pastor, P., and Kalakrishnan, M. 2019. Quantile qt-opt for risk-aware vision-based robotic grasping. arXiv preprint arXiv:1910.02787.

Bottou, L., Peters, J., Quiñonero-Candela, J., et al. 2013. Counterfactual reasoning and learning systems: The example of computational advertising. JMLR.

Boyan, J. A. 1999. Least-squares temporal difference learning ICML.

Cabi, S., Colmenarejo, S.G., Novikov, A., et al. 2019. A framework for data-driven robotics. arXiv preprint arXiv:1909.12200.

Chang, M., Gupta, A., & Gupta, S. 2020. Semantic visual navigation by watching youtube videos. NeurIPS.

Chen, J. and Jiang, N. 2019. Information-theoretic considerations in batch reinforcement learning. ICML.

Chen, X., Zhou, Z., Wang, Z., et al. 2019. BAIL: Best-action imitation learning for batch deep reinforcement learning. arXiv preprint arXiv:1910.12179.

Dai, B., Shaw, A., He, N., Li, L., and Song, L. 2017. Boosting the actor with dual critic. arXiv preprint arXiv:1712.10282.

Dai, B., Shaw, A., Li, L., et al. 2018. SBEED: Convergent reinforcement learning with nonlinear function approximation. International conference on machine learning, 1125–1134.

Degris, T., White, M., Sutton, R. S. 2012. Off-policy Actor-Critic. arXiv preprint arXiv:1205.4839.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. 2009. ImageNet: A Large-Scale Hierarchical Image Database. CVPR.

Dudı́k, M., Erhan, D., Langford, J., Li, L., and others. 2014. Doubly robust policy evaluation and optimization. Statistical Science 29, 4, 485–511.

Dulac-Arnold, G., Mankowitz, D., and Hester, T. 2019. Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901.

Ernst, D., Geurts, P., and Wehenkel, L. 2005. Tree-based batch mode reinforcement learning. JMLR.

Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. 2020. D4RL: Datasets for deep data-driven reinforcement learning. ArXiv.

Farahmand, A. M., Szepesvári, C., Munos R. 2010. Error propagation for approximate policy and value iteration NeurIPS.

Farahmand, A. M., Szepesvári, C. 2011. Model selection in reinforcement learning. Machine learning, 85(3) 299-332.

Fujimoto, S., Conti, E., Ghavamzadeh, M., and Pineau, J. 2019. Benchmarking batch deep reinforcement learning algorithms. arXiv preprint arXiv:1910.01708.

Fujimoto, S., Meger, D., and Precup, D. 2018. Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900.

Gottesman, O., Futoma, J., Liu, Y., et al. 2020. Interpretable off-policy evaluation in reinforcement learning by highlighting influential transitions. arXiv preprint arXiv:2002.03478.

Gulcehre, C., Wang, Z., Novikov, A., et al. 2020. RL unplugged: Benchmarks for offline reinforcement learning. arXiv preprint arXiv:2006.13888.

Hoppe, S. and Toussaint, M. 2019. Qgraph-bounded q-learning: Stabilizing model-free off-policy deep reinforcement learning..

Jaques, N., Ghandeharioun, A., Shen, J.H., et al. 2019. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456.

Jiang, N. and Li, L. 2016. Doubly robust off-policy value evaluation for reinforcement learning. International conference on machine learning, 652–661.

Karampatziakis, N., Langford, J., and Mineiro, P. 2019. Empirical likelihood for contextual bandits. arXiv preprint arXiv:1906.03323.

Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, T. 2020. MOReL: Model-based offline reinforcement learning. arXiv preprint arXiv:2005.05951.

Kumar, A., Fu, J., Tucker, G., and Levine, S. 2019. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. NeurIPS.

Kumar, A., Gupta, S. and Malik, J. 2020. Learning Navigation Subroutines from Egocentric Videos. Conference on Robot Learning.

Kumar, A., Zhou, A., Tucker, G., and Levine, S. 2020. Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779.

Lagoudakis, M. G, Parr, R. 2003. Least-squares policy iteration. JMLR.

Lange, S., Gabel, T., and Riedmiller, M. 2012. Batch reinforcement learning. Reinforcement learning.

Langford, J. 2019. A real-world reinforcement learning revolution..

Laroche, R., Trichelair, P., and Des Combes, R.T. 2019. Safe policy improvement with baseline bootstrapping. International conference on machine learning, 3652–3661.

Levine, S., Kumar, A., Tucker, G., and Fu, J. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.

Liu, Y., Bacon, P.-L., and Brunskill, E. 2019a. Understanding the curse of horizon in off-policy evaluation via conditional importance sampling. arXiv preprint arXiv:1910.06508.

Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. 2019b. Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473.

Matsushima, T., Furuta, H., Matsuo, Y., Nachum, O., and Gu, S. 2020. Deployment-efficient reinforcement learning via model-based offline optimization. arXiv preprint arXiv:2006.03647.

Nachum, O., Chow, Y., Dai, B., and Li, L. 2019a. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in neural information processing systems, 2318–2328.

Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., and Schuurmans, D. 2019b. AlgaeDICE: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074.

Nair, A., Dalal, M., Gupta, A., and Levine, S. 2020. Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359.

Namkoong, H., Keramati, R., Yadlowsky, S., and Brunskill, E. 2020. Off-policy policy evaluation for sequential decisions under unobserved confounding. arXiv preprint arXiv:2003.05623.

Peng, X.B., Kumar, A., Zhang, G., and Levine, S. 2019. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177.

Peshkin, L., Shelton, C. R. 2002. Learning from scarce experience. arXiv preprint cs/0204043..

Prasad, N., Engelhardt, B., and Doshi-Velez, F. 2020. Defining admissible rewards for high-confidence policy evaluation in batch reinforcement learning. Proceedings of the acm conference on health, inference, and learning, 1–9.

Precup, D. 2000. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, 80.

Precup, D., Sutton, R.S., and Dasgupta, S. 2001. Off-policy temporal-difference learning with function approximation. ICML, 417–424.

Shortreed, S.M., Laber, E., Lizotte, D.J., Stroup, T.S., Pineau, J., and Murphy, S.A. 2011. Informing sequential clinical decision-making through reinforcement learning: An empirical study. Machine learning.

Siegel, N., Springenberg, J.T., Berkenkamp, F., et al. 2020. Keep doing what worked: Behavior modelling priors for offline reinforcement learning. ICLR.

Sohn, S., Chow, Y., Ooi, J., et al. 2020. BRPO: Batch residual policy optimization. arXiv:2002.05522.

Sussex, S., Gottesman, O., Liu, Y., Murphy, S., Brunskill, E., and Doshi-Velez, F. Stitched trajectories for off-policy learning..

Sutton, R.S., Maei, H.R., Precup, D., et al. 2009. Fast gradient-descent methods for temporal-difference learning with linear function approximation. Proceedings of the 26th annual international conference on machine learning, 993–1000.

Sutton, R.S. 1991. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin,, 160-163.

Thomas, P. S., Theocharous, G., Ghavamzadeh, M. 2015. High-confidence off-policy evaluation. AAAI.

Wang Q, Xiong J, Han L, Liu H, Zhang T. 2018. Exponentially weighted imitation learning for batched historical data. NeurIPS.

Wang, Z., Novikov, A., Żołna, K., et al. 2020. Critic Regularized Regression. arXiv e-prints, arXiv:2006.15134.

Wu, Y., Tucker, G., and Nachum, O. 2019. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361.

Xie, T. and Jiang, N. 2020. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. ArXiv abs/2003.03924.

Yu, T., Thomas, G., Yu, L., et al. 2020. MOPO: Model-based offline policy optimization. arXiv preprint arXiv:2005.13239.