Around Thompson Sampling (Sebastien Bubeck)

Hyperlinks lead to the arxiv version of the mentioned papers.

Let \nu be a distribution over length T sequences of functions from some compact set K \subset \mathbb{R}^n to [-1,1]. Let (\ell_1, \hdots, \ell_T) \sim \nu, x^* = \mathrm{argmin}_{x \in K} \sum_{t=1}^T \ell_t(x), and (U_1, \hdots, U_T) be uniformly distributed in [0,1]^T (and independent of \ell_t's). A bandit strategy is a sequence of random variables x_1, \hdots, x_T where x_t is measurable with respect to the bandit filtration \mathcal{F}_t = \sigma(U_1, \hdots, U_t, \ell_1(x_1), \hdots, \ell_{t-1}(x_{t-1})). We are interested in the following quantity (where expectation is taken with respect to \nu): R_T = \mathbb{E} \sum_{t=1}^T (\ell_t(x_t) - \ell_t(x^*)).

Thomspon Sampling (TS) is the strategy where x_t has the same the same distribution as x^* conditionally on \mathcal{F}_t. Here are some simple questions about TS.

Approximation properties of TS

Consider the special case where K = \{e_1, \hdots, e_n\}, and one can simulate \nu by first sampling \theta \in [0,1]^n from some distribution \nu_0^{\otimes n} and then sampling independently each \ell_t(e_i) from \mathrm{Ber}(\theta_i). Let R_T^* be the infimum of R_T over all bandit strategies for this setting. Guha and Munagala ask whether TS satisfies R_T \leq 2 R_T^* (more precisely they ask this question for a somewhat stronger notion a regret where one counts the number of times that a suboptimal action was selected). Note that in this setting we know that there exists a numerical constant C>0 such that \sup_{\nu} R_T \leq C \sup_{\nu} R_T^* (see this paper). Furthermore Guha and Munagala essentially prove that R_T \leq 2 R_T^* is true for n=2.

TS and the geometry of the action set

Russo and van Roy showed that if \nu is supported on linear functions, then TS satisfies R_T = O(\sqrt{n T \log |K|}). This essentially implies that on K=\mathbb{B}^n (the unit Euclidean ball in \mathbb{R}^n) one has R_T = O(n \sqrt{T}). However it is known that for this particular convex body there exists strategies with R_T = O(\sqrt{n T}) (see this paper). Is this latter bound true for TS? More generally understanding the interplay between TS and the geometry of the underlying action set is entirely open.

TS and convexity

In our recent work we showed that if \nu is supported on convex functions, and if n=1, then a small modification of TS leads to R_T = O(\log(T) \sqrt{T}). Is this bound true for the basic TS? What about higher dimensions?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s