Abstract
In Reinforcement Learning (RL), tasks with instantaneous hard constraints present significant challenges, particularly when the decision space is non-convex or non-star-convex. This issue is especially relevant in domains like autonomous vehicles and robotics, where constraints such as collision avoidance often take a non-convex form, and the state-space may be large. In this paper, we establish a regret bound ofÕ((1 + 1 τ )√log( 1 τ ) d3 H4 K), applicable to both star-convex and non-star-convex cases, where d is the feature dimension, H the episode length, K the number of episodes, and τ the safety threshold for a linear MDP setting. Moreover, the violation of safety constraints is zero with a high probability throughout the learning process. A key technical challenge in these settings is bounding the covering number of the value-function class, which is essential for achieving value-aware uniform concentration in model-free function approximation. For the star-convex setting, we develop a novel technique called Objective–Constraint Decomposition (OCD) to properly bound the covering number, and resolves an error in a previous work on the constrained RL.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 51957-51995 |
| Number of pages | 39 |
| Journal | Proceedings of Machine Learning Research |
| Volume | 267 |
| State | Published - 2025 |
| Event | 42nd International Conference on Machine Learning, ICML 2025 - Vancouver, Canada Duration: Jul 13 2025 → Jul 19 2025 |
All Science Journal Classification (ASJC) codes
- Software
- Control and Systems Engineering
- Statistics and Probability
- Artificial Intelligence