TY - GEN
T1 - Stochastic Contextual Bandits with Long Horizon Rewards
AU - Qin, Yuzhen
AU - Li, Yingcong
AU - Pasqualetti, Fabio
AU - Fazel, Maryam
AU - Oymak, Samet
N1 - Publisher Copyright:
Copyright © 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2023/6/27
Y1 - 2023/6/27
N2 - The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most s prior actions and contexts (not necessarily consecutive), up to a time horizon of h. In order to avoid polynomial dependence on h, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor (T < h) and data-rich (T ≥ h) regimes, and derive respective regret upper bounds Õ(d√sT + min{q, T}) and Õ(√sdT), with sparsity s, feature dimension d, total time horizon T, and q that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon h. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.
AB - The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most s prior actions and contexts (not necessarily consecutive), up to a time horizon of h. In order to avoid polynomial dependence on h, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor (T < h) and data-rich (T ≥ h) regimes, and derive respective regret upper bounds Õ(d√sT + min{q, T}) and Õ(√sdT), with sparsity s, feature dimension d, total time horizon T, and q that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon h. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.
UR - https://www.scopus.com/pages/publications/85163764110
UR - https://www.scopus.com/pages/publications/85163764110#tab=citedBy
U2 - 10.1609/aaai.v37i8.26140
DO - 10.1609/aaai.v37i8.26140
M3 - Conference contribution
AN - SCOPUS:85163764110
T3 - Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023
SP - 9525
EP - 9533
BT - AAAI-23 Technical Tracks 8
A2 - Williams, Brian
A2 - Chen, Yiling
A2 - Neville, Jennifer
PB - AAAI press
T2 - 37th AAAI Conference on Artificial Intelligence, AAAI 2023
Y2 - 7 February 2023 through 14 February 2023
ER -