Abstract
We consider a setting in which the agent aims to maximize the expected cumulative reward, subject to a constraint that the entropic risk of the total utility exceeds a given threshold. Unlike the risk-neutral case, standard primal-dual approaches fail to directly yield regret and violation bounds, as value iteration with respect to a combined state-action value function is not applicable in the risk-sensitive setting. To address this, we adopt the Optimized Certainty Equivalent (OCE) representation of the entropic risk measure and reformulate the problem by augmenting the state space with a continuous budget variable. We then propose a primal-dual algorithm tailored to this augmented formulation. In contrast to the standard approach for risk-neutral CMDPs, our method incorporates a truncated dual update to account for the possible absence of strong duality. We show that the proposed algorithm achieves regret of (Foramula Presented) and constraint violation of (Formula Presented) with probability at least 1-δ, where S and A denote the cardinalities of the state and action spaces, respectively, H is the episode length, K is the number of episodes, α < 0 is the risk-aversion parameter, and (Foramula Presented). To the best of our knowledge, this is the first result establishing sublinear regret and violation bounds for the risk-sensitive CMDP problem.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 19406-19425 |
| Number of pages | 20 |
| Journal | Proceedings of Machine Learning Research |
| Volume | 267 |
| State | Published - 2025 |
| Event | 42nd International Conference on Machine Learning, ICML 2025 - Vancouver, Canada Duration: Jul 13 2025 → Jul 19 2025 |
All Science Journal Classification (ASJC) codes
- Software
- Control and Systems Engineering
- Statistics and Probability
- Artificial Intelligence