# Around Thompson Sampling (Sebastien Bubeck)

Hyperlinks lead to the arxiv version of the mentioned papers.

Let $\nu$ be a distribution over length $T$ sequences of functions from some compact set $K \subset \mathbb{R}^n$ to $[-1,1]$. Let $(\ell_1, \hdots, \ell_T) \sim \nu$, $x^* = \mathrm{argmin}_{x \in K} \sum_{t=1}^T \ell_t(x)$, and $(U_1, \hdots, U_T)$ be uniformly distributed in $[0,1]^T$ (and independent of $\ell_t's$). A bandit strategy is a sequence of random variables $x_1, \hdots, x_T$ where $x_t$ is measurable with respect to the bandit filtration $\mathcal{F}_t = \sigma(U_1, \hdots, U_t, \ell_1(x_1), \hdots, \ell_{t-1}(x_{t-1}))$. We are interested in the following quantity (where expectation is taken with respect to $\nu$): $R_T = \mathbb{E} \sum_{t=1}^T (\ell_t(x_t) - \ell_t(x^*))$.

Thomspon Sampling (TS) is the strategy where $x_t$ has the same the same distribution as $x^*$ conditionally on $\mathcal{F}_t$. Here are some simple questions about TS.

Approximation properties of TS

Consider the special case where $K = \{e_1, \hdots, e_n\}$, and one can simulate $\nu$ by first sampling $\theta \in [0,1]^n$ from some distribution $\nu_0^{\otimes n}$ and then sampling independently each $\ell_t(e_i)$ from $\mathrm{Ber}(\theta_i)$. Let $R_T^*$ be the infimum of $R_T$ over all bandit strategies for this setting. Guha and Munagala ask whether TS satisfies $R_T \leq 2 R_T^*$ (more precisely they ask this question for a somewhat stronger notion a regret where one counts the number of times that a suboptimal action was selected). Note that in this setting we know that there exists a numerical constant $C>0$ such that $\sup_{\nu} R_T \leq C \sup_{\nu} R_T^*$ (see this paper). Furthermore Guha and Munagala essentially prove that $R_T \leq 2 R_T^*$ is true for $n=2$.

TS and the geometry of the action set

Russo and van Roy showed that if $\nu$ is supported on linear functions, then TS satisfies $R_T = O(\sqrt{n T \log |K|})$. This essentially implies that on $K=\mathbb{B}^n$ (the unit Euclidean ball in $\mathbb{R}^n$) one has $R_T = O(n \sqrt{T})$. However it is known that for this particular convex body there exists strategies with $R_T = O(\sqrt{n T})$ (see this paper). Is this latter bound true for TS? More generally understanding the interplay between TS and the geometry of the underlying action set is entirely open.

TS and convexity

In our recent work we showed that if $\nu$ is supported on convex functions, and if $n=1$, then a small modification of TS leads to $R_T = O(\log(T) \sqrt{T})$. Is this bound true for the basic TS? What about higher dimensions?