Setting

Consider the following `multi-player’ version of adversarial multi-armed bandits, on three arms: An adversary chooses a sequence of $T$ reward vectors $x_t\in [0,1]^3$. Then, two randomized decision makers iteratively choose $i_{t},j_{t}\in \{1,2,3\}$, and receive a reward $g_t(i_t,j_t)$ equal to $x_{t,i_t}+x_{t,j_t}$ if $i_t\neq j_t$, and $0$ if $i_t=j_t$. This models situations where two users utilize the same set of resources, resulting in zero reward if they collide. Crucially, the decision makers cannot communicate.

Goal

Design an algorithm for the two decision makers which attains sublinear regret: Namely, that regardless of the adversary’s strategy, $\max_{i,j}\sum_{t=1}^{T}g_t(i,j)-\sum_{t=1}^{T}g_t(i_t,j_t)= o(T)$.

Existing Results

Multi-player bandits have been studied in the context of cognitive radio networks, where users need to transmit on channels of varying quality and may interfere with each other (e.g. [1,2,3]). However, all results we are aware of are for a stochastic setting, where the rewards are sampled i.i.d., and most of them require a (possibly elaborate) communication scheme between the players, which is not always realistic.

Bibliography

[1] Animashree Anandkumar, Nithin Michael, Ao Kevin Tang, and Ananthram Swami. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications, 29(4):731–745, 2011.

[2] Orly Avner and Shie Mannor. Concurrent bandits and cognitive radio networks. In Machine Learning and Knowledge Discovery in Databases, pages 66–81. Springer, 2014.

[3] Dileep Kalathil, Naumaan Nayyar, and Rahul Jain. Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory, 60(4):2331–2345, 2014.