Optimal regret in a bandit problem with combinatorial structure (Laurent Massoulie)

In http://dl.acm.org/citation.cfm?id=2745868 we considered the following
bandit problem: an object with unknown type \tau\in T is to be
disseminated to a subset of consumers u\in U; the benefit that  u gets
from being exposed is +1 with probability p_{u\tau} and -1
otherwise. The question is then to sequentially determine who to expose,
so that the set V of users eventually exposed minimizes C_0+C_1, where
C_0:=\sum_{v\in V}(1-2 p_{v\tau})^+ is the spamming cost, and
C_1:=\sum_{v\in U\setminus V}(2p_{v\tau}-1)^+ is the missed opportunity
cost. We identified lower bounds on regrets in terms of the parameters
n=|U|, t=|T|, and \delta=\inf_{u\in U,\tau\in T}|1/2-p_{u\tau}| in
several scenarios of interest and found that the so-called greedy Bayes
strategy (a variant of Thompson sampling) would match these lower bounds
in each of these scenarios.

This begs two questions:

(i) Beyond these specific scenarios, in general what function of the
matrix (p_{u\tau})_{u\in U,\tau\in T} conditions the best possible
regret?

(ii) Do generic methods such as Thompson sampling or Greedy Bayes approach
the best possible regret more generally than in the specific scenarios
considered in http://dl.acm.org/citation.cfm?id=2745868 ?

Clustering cheaply under local differential privacy (Laurent Massoulie)

The trade-off between privacy preservation and statistical inference takes
interesting forms in the context of recommendation engines. In
http://arxiv.org/abs/1207.3269 we considered the problem of inferring
clusters of similar items from inputs of users under an extreme form of
differential privacy, namely local differential privacy. Local DP mandates
each user to only provide input that is distorted by noise so that the
underlying ratings of items by that user (as well as which items it rated)
remain ambiguous. We assessed performance of clustering schemes under
these constraints by query complexity. This counts how many users are
probed before a cluster structure underlying users’s ratings can be
reliably identified. Assuming users rate a small number of items chosen
uniformly at random, the best query complexity we could achieve is an
impracticably large \tilde{O}(N^2) where N is the number of items.
However the best lower bounds we know are in \tilde{\Omega}(N).

This begs the question of determining whether in this setup quasi-linear
query complexity suffices to successfully cluster, and if so, what scheme
achieves such complexity?

Minimax off-policy value estimation (Lihong Li)

In stochastic contextual bandits, one is often interested in estimating the average per-step reward by running a given policy from data collected by a different policy. The problem is also known as off-policy learning in reinforcement learning and covariate shift in statistics, among others.

Precisely, a contextual K-armed bandit is specified by a distribution \mu over context set \mathcal{X} and a reward distribution \Phi. Given a stochastic policy \pi, its average reward is given by v^\pi := E_{X\sim\mu, A\sim\pi(\cdot | X), R\sim\Phi(\cdot | A)} [R] .
The data available to an estimator is of the form D^n= { (X_i,A_i,R_i) }_{i=1,2,...,n}, where X_i\sim \mu, A_i\sim\pi_D(\cdot | X_i), R_i \sim\Phi(\cdot | A_i), and \pi_D is a data-collection policy different from \pi. The problem is to estimate v^\pi from data D^n.

A natural solution is to use importance sampling to get an unbiased estimate of v^\pi, whose variance depends on n as well as how “close” \pi and \pi_D are. While this approach has been successful empirically, it is not necessarily optimal, as shown in an asymptotic result [1].  More recently, it is shown that for the special case where \mathcal{X} is singleton (ie, the classic K-armed bandit), a less popular approach called regression estimator is near-optimal (and asymptotically optimal) in a minimax sense, although the importance-sampling estimator can be arbitrarily sub-optimal [2].

The open question involves proposing a (near-)optimal estimator for general contextual bandits.

[1] Keisuke Hirano, Guido W. Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003.

[2] Lihong Li, Remi Munos, and Csaba Szepesvari.  Toward minimax off-policy value estimation.  In AI&Stats 2015.

Active Learning for Maximum Likelihood Estimation with Model Mismatch (Kamalika Chaudhuri)

An active learner is given a model class, a large number of unlabeled samples, and the ability to interactively query labels of a subset of these samples. The goal of the learner is to provide a model in the model class that fits the data best by making as few interactive label queries as possible. An example is active regression; here, the model class is the set of linear models \{ w \in R^d \}, and the corresponding labels are generated as follows:
y = w^{\top}x + \eta, \quad \eta \sim \mathcal{N}(0, \sigma^2)

When there is no model mismatch — that is, when the labels are indeed generated from a model in the input model class — the active learning problem is relatively easy, and has been addressed in a fairly general context in this paper:

http://arxiv.org/abs/1506.02348

What happens when the labels are not necessarily generated by a model in the input model class? This paper gives an approach for linear regression:

http://arxiv.org/abs/1410.5920

by using a piecewise partitioning over the data domain, and learning a series of importance weights that are constant over each piece of the partition. The importance weights are adjusted so that weighted MLE estimates have optimal (upper bound on) risk.

Can we find a better and more efficient way for active learning for MLE with model mismatch?

Differentially Private Model Selection (Ben Recht)

Recent work in differential privacy has shown how to build provably reusable holdout sets and tamper-proof machine learning competitions

http://arxiv.org/abs/1411.2664

http://arxiv.org/abs/1502.04585

These papers look at the problem of preventing a malicious adversary from overfitting to data.  But the question remains: if I know that the machine learning competition that I am entering is using one of these mechanisms, what is the optimal strategy for me as a machine learning practitioner to yield the lowest possible error on a test set?  Such a model would be guaranteed to generalize, and would have nearly optimal performance subject to generalizability.  Concretely: what are the best optimization schemes to use for parameter tuning and model selection, given that one is faced with such a differentially private adversary?