Optimal regret in a bandit problem with combinatorial structure (Laurent Massoulie)

In http://dl.acm.org/citation.cfm?id=2745868 we considered the following
bandit problem: an object with unknown type \tau\in T is to be
disseminated to a subset of consumers u\in U; the benefit that  u gets
from being exposed is +1 with probability p_{u\tau} and -1
otherwise. The question is then to sequentially determine who to expose,
so that the set V of users eventually exposed minimizes C_0+C_1, where
C_0:=\sum_{v\in V}(1-2 p_{v\tau})^+ is the spamming cost, and
C_1:=\sum_{v\in U\setminus V}(2p_{v\tau}-1)^+ is the missed opportunity
cost. We identified lower bounds on regrets in terms of the parameters
n=|U|, t=|T|, and \delta=\inf_{u\in U,\tau\in T}|1/2-p_{u\tau}| in
several scenarios of interest and found that the so-called greedy Bayes
strategy (a variant of Thompson sampling) would match these lower bounds
in each of these scenarios.

This begs two questions:

(i) Beyond these specific scenarios, in general what function of the
matrix (p_{u\tau})_{u\in U,\tau\in T} conditions the best possible

(ii) Do generic methods such as Thompson sampling or Greedy Bayes approach
the best possible regret more generally than in the specific scenarios
considered in http://dl.acm.org/citation.cfm?id=2745868 ?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s