In http://dl.acm.org/citation.cfm?id=2745868 we considered the following
bandit problem: an object with unknown type is to be
disseminated to a subset of consumers ; the benefit that gets
from being exposed is with probability and
otherwise. The question is then to sequentially determine who to expose,
so that the set of users eventually exposed minimizes , where
is the spamming cost, and
is the missed opportunity
cost. We identified lower bounds on regrets in terms of the parameters
, , and in
several scenarios of interest and found that the so-called greedy Bayes
strategy (a variant of Thompson sampling) would match these lower bounds
in each of these scenarios.
This begs two questions:
(i) Beyond these specific scenarios, in general what function of the
matrix conditions the best possible
(ii) Do generic methods such as Thompson sampling or Greedy Bayes approach
the best possible regret more generally than in the specific scenarios
considered in http://dl.acm.org/citation.cfm?id=2745868 ?
The trade-off between privacy preservation and statistical inference takes
interesting forms in the context of recommendation engines. In
http://arxiv.org/abs/1207.3269 we considered the problem of inferring
clusters of similar items from inputs of users under an extreme form of
differential privacy, namely local differential privacy. Local DP mandates
each user to only provide input that is distorted by noise so that the
underlying ratings of items by that user (as well as which items it rated)
remain ambiguous. We assessed performance of clustering schemes under
these constraints by query complexity. This counts how many users are
probed before a cluster structure underlying users’s ratings can be
reliably identified. Assuming users rate a small number of items chosen
uniformly at random, the best query complexity we could achieve is an
impracticably large where is the number of items.
However the best lower bounds we know are in .
This begs the question of determining whether in this setup quasi-linear
query complexity suffices to successfully cluster, and if so, what scheme
achieves such complexity?
The problem description can be found here.
In stochastic contextual bandits, one is often interested in estimating the average per-step reward by running a given policy from data collected by a different policy. The problem is also known as off-policy learning in reinforcement learning and covariate shift in statistics, among others.
Precisely, a contextual -armed bandit is specified by a distribution over context set and a reward distribution . Given a stochastic policy , its average reward is given by
The data available to an estimator is of the form , where , , , and is a data-collection policy different from . The problem is to estimate from data .
A natural solution is to use importance sampling to get an unbiased estimate of , whose variance depends on as well as how “close” and are. While this approach has been successful empirically, it is not necessarily optimal, as shown in an asymptotic result . More recently, it is shown that for the special case where is singleton (ie, the classic -armed bandit), a less popular approach called regression estimator is near-optimal (and asymptotically optimal) in a minimax sense, although the importance-sampling estimator can be arbitrarily sub-optimal .
The open question involves proposing a (near-)optimal estimator for general contextual bandits.
 Keisuke Hirano, Guido W. Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003.
 Lihong Li, Remi Munos, and Csaba Szepesvari. Toward minimax off-policy value estimation. In AI&Stats 2015.
An active learner is given a model class, a large number of unlabeled samples, and the ability to interactively query labels of a subset of these samples. The goal of the learner is to provide a model in the model class that fits the data best by making as few interactive label queries as possible. An example is active regression; here, the model class is the set of linear models , and the corresponding labels are generated as follows:
When there is no model mismatch — that is, when the labels are indeed generated from a model in the input model class — the active learning problem is relatively easy, and has been addressed in a fairly general context in this paper:
What happens when the labels are not necessarily generated by a model in the input model class? This paper gives an approach for linear regression:
by using a piecewise partitioning over the data domain, and learning a series of importance weights that are constant over each piece of the partition. The importance weights are adjusted so that weighted MLE estimates have optimal (upper bound on) risk.
Can we find a better and more efficient way for active learning for MLE with model mismatch?
Recent work in differential privacy has shown how to build provably reusable holdout sets and tamper-proof machine learning competitions
These papers look at the problem of preventing a malicious adversary from overfitting to data. But the question remains: if I know that the machine learning competition that I am entering is using one of these mechanisms, what is the optimal strategy for me as a machine learning practitioner to yield the lowest possible error on a test set? Such a model would be guaranteed to generalize, and would have nearly optimal performance subject to generalizability. Concretely: what are the best optimization schemes to use for parameter tuning and model selection, given that one is faced with such a differentially private adversary?