Optimal regret in a bandit problem with combinatorial structure (Laurent Massoulie)

In http://dl.acm.org/citation.cfm?id=2745868 we considered the following
bandit problem: an object with unknown type \tau\in T is to be
disseminated to a subset of consumers u\in U; the benefit that  u gets
from being exposed is +1 with probability p_{u\tau} and -1
otherwise. The question is then to sequentially determine who to expose,
so that the set V of users eventually exposed minimizes C_0+C_1, where
C_0:=\sum_{v\in V}(1-2 p_{v\tau})^+ is the spamming cost, and
C_1:=\sum_{v\in U\setminus V}(2p_{v\tau}-1)^+ is the missed opportunity
cost. We identified lower bounds on regrets in terms of the parameters
n=|U|, t=|T|, and \delta=\inf_{u\in U,\tau\in T}|1/2-p_{u\tau}| in
several scenarios of interest and found that the so-called greedy Bayes
strategy (a variant of Thompson sampling) would match these lower bounds
in each of these scenarios.

This begs two questions:

(i) Beyond these specific scenarios, in general what function of the
matrix (p_{u\tau})_{u\in U,\tau\in T} conditions the best possible
regret?

(ii) Do generic methods such as Thompson sampling or Greedy Bayes approach
the best possible regret more generally than in the specific scenarios
considered in http://dl.acm.org/citation.cfm?id=2745868 ?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s