Cost-Aware Reinforcement Learning

Paper title: Reinforcement Learning for Cost-Aware Markov Decision Processes

Authors

Wesley Suttle, Stony Brook AMS PhD student
Kaiqing Zhang, University of Illinois at Urbana-Champaign
Zhuoran Yang, Princeton University
David Kraemer, Stony Brook AMS PhD student
Ji Liu, Stony Brook ECE Faculty

Venue

38th International Conference on Machine Learning (ICML), 2021. Acceptance Rate: 21.5%

Novel Technical Contribution

The paper proposes two new tractable, model-free, reinforcement learning algorithms for solving cost-aware Markov decision processes with theoretical guarantees, where the goal is to maximize the ratio of long-run average reward to long-run average cost.

Societal Contribution

The paper lays sound theoretical foundations for the application of reinforcement learning to ratio maximization problems in areas as diverse as reward shaping for incorporating domain-specific knowledge and expert guidance, portfolio optimization in finance, and the development of safe artificial intelligence.

Abstract

Ratio maximization has applications in areas as diverse as finance, reward shaping for reinforcement learning (RL), and the development of safe artificial intelligence, yet there has been very little exploration of RL algorithms for ratio maximization. This paper addresses this deficiency by introducing two new, model-free RL algorithms for solving cost-aware Markov decision processes, where the goal is to maximize the ratio of long-run average reward to long-run average cost. The first algorithm is a two-timescale scheme based on relative value iteration (RVI) Q-learning and the second is an actor-critic scheme. The paper proves almost sure convergence of the former to the globally optimal solution in the tabular case and almost sure convergence of the latter under linear function approximation for the critic. Unlike previous methods, the two algorithms provably converge for general reward and cost functions under suitable conditions. The paper also provides empirical results demonstrating promising performance and lending strong support to the theoretical results.