Online Learning in Risk Sensitive constrained MDP

Research output: Contribution to journalConference articlepeer-review

Abstract

We consider a setting in which the agent aims to maximize the expected cumulative reward, subject to a constraint that the entropic risk of the total utility exceeds a given threshold. Unlike the risk-neutral case, standard primal-dual approaches fail to directly yield regret and violation bounds, as value iteration with respect to a combined state-action value function is not applicable in the risk-sensitive setting. To address this, we adopt the Optimized Certainty Equivalent (OCE) representation of the entropic risk measure and reformulate the problem by augmenting the state space with a continuous budget variable. We then propose a primal-dual algorithm tailored to this augmented formulation. In contrast to the standard approach for risk-neutral CMDPs, our method incorporates a truncated dual update to account for the possible absence of strong duality. We show that the proposed algorithm achieves regret of (Foramula Presented) and constraint violation of (Formula Presented) with probability at least 1-δ, where S and A denote the cardinalities of the state and action spaces, respectively, H is the episode length, K is the number of episodes, α < 0 is the risk-aversion parameter, and (Foramula Presented). To the best of our knowledge, this is the first result establishing sublinear regret and violation bounds for the risk-sensitive CMDP problem.

Original languageEnglish (US)
Pages (from-to)19406-19425
Number of pages20
JournalProceedings of Machine Learning Research
Volume267
StatePublished - 2025
Event42nd International Conference on Machine Learning, ICML 2025 - Vancouver, Canada
Duration: Jul 13 2025Jul 19 2025

All Science Journal Classification (ASJC) codes

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Online Learning in Risk Sensitive constrained MDP'. Together they form a unique fingerprint.

Cite this