TY - JOUR
T1 - A New Regret-analysis Framework for Budgeted Multi-Armed Bandits
AU - Xu, Evan Yifan
AU - Xu, Pan
N1 - Publisher Copyright:
©2025 The Authors.
PY - 2025
Y1 - 2025
N2 - We consider two versions of the (stochastic) budgeted Multi-Armed Bandit problem. The first one was introduced by Tran-Thanh et al. (AAAI, 2012): Pulling each arm incurs a fixed deterministic cost and yields a random reward i.i.d. sampled from an unknown distribution (prior free). We have a global budget B and aim to devise a strategy to maximize the expected total reward. The second one was introduced by Ding et al. (AAAI, 2013): It has the same setting as before except costs of each arm are i.i.d. samples from an unknown distribution (and independent from its rewards). We propose a new budget-based regret-analysis framework and design two simple algorithms to illustrate the power of our framework. Our regret bounds for both problems not only match the optimal bound of O(ln B) but also significantly reduce the dependence on other input parameters (assumed constants), compared with the two studies of Tran-Thanh et al. (AAAI, 2012) and Ding et al. (AAAI, 2013) where both utilized a time-based framework. Extensive experimental results show the effectiveness and computation efficiency of our proposed algorithms and confirm our theoretical predictions.
AB - We consider two versions of the (stochastic) budgeted Multi-Armed Bandit problem. The first one was introduced by Tran-Thanh et al. (AAAI, 2012): Pulling each arm incurs a fixed deterministic cost and yields a random reward i.i.d. sampled from an unknown distribution (prior free). We have a global budget B and aim to devise a strategy to maximize the expected total reward. The second one was introduced by Ding et al. (AAAI, 2013): It has the same setting as before except costs of each arm are i.i.d. samples from an unknown distribution (and independent from its rewards). We propose a new budget-based regret-analysis framework and design two simple algorithms to illustrate the power of our framework. Our regret bounds for both problems not only match the optimal bound of O(ln B) but also significantly reduce the dependence on other input parameters (assumed constants), compared with the two studies of Tran-Thanh et al. (AAAI, 2012) and Ding et al. (AAAI, 2013) where both utilized a time-based framework. Extensive experimental results show the effectiveness and computation efficiency of our proposed algorithms and confirm our theoretical predictions.
UR - https://www.scopus.com/pages/publications/105003181148
UR - https://www.scopus.com/inward/citedby.url?scp=105003181148&partnerID=8YFLogxK
U2 - 10.1613/jair.1.16261
DO - 10.1613/jair.1.16261
M3 - Article
AN - SCOPUS:105003181148
SN - 1076-9757
VL - 82
SP - 1943
EP - 1959
JO - Journal of Artificial Intelligence Research
JF - Journal of Artificial Intelligence Research
ER -