Wang, Yiliu, Chen, Wei and Vojnovic, Milan ORCID: 0000-0003-1382-022X (2024) Combinatorial bandits for maximum value reward function under value-index feedback. In: ICLR 2024 The Twelfth International Conference on Learning Representations, 2024-05-07 - 2024-05-11, Messe Wien Exhibition and Congress Center, Vienna, Austria, AUT.
Text (Combinatorial Bandits for Maximum Value Reward Function under Value-Index Feedback)
- Published Version
Download (662kB) |
Abstract
We investigate the combinatorial multi-armed bandit problem where an action is to select $k$ arms from a set of base arms, and its reward is the maximum of the sample values of these $k$ arms, under a weak feedback structure that only returns the value and index of the arm with the maximum value. This novel feedback structure is much weaker than the semi-bandit feedback previously studied and is only slightly stronger than the full-bandit feedback, and thus it presents a new challenge for the online learning task. We propose an algorithm and derive a regret bound for instances where arm outcomes follow distributions with finite supports. Our algorithm introduces a novel concept of biased arm replacement to address the weak feedback challenge, and it achieves a distribution-dependent regret bound of $O((k/\Delta)\log(T))$ and a distribution-independent regret bound of $\tilde{O}(\sqrt{T})$, where $\Delta$ is the reward gap and $T$ is the time horizon. Notably, our regret bound is comparable to the bounds obtained under the more informative semi-bandit feedback. We demonstrate the effectiveness of our algorithm through experimental results.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Additional Information: | © 2024 The Author(s) |
Divisions: | Statistics |
Subjects: | H Social Sciences > HA Statistics |
Date Deposited: | 19 Jun 2024 10:39 |
Last Modified: | 20 Dec 2024 01:00 |
URI: | http://eprints.lse.ac.uk/id/eprint/123919 |
Actions (login required)
View Item |