In http://dl.acm.org/citation.cfm?id=2745868 we considered the following
bandit problem: an object with unknown type is to be
disseminated to a subset of consumers ; the benefit that gets
from being exposed is with probability and
otherwise. The question is then to sequentially determine who to expose,
so that the set of users eventually exposed minimizes , where
is the spamming cost, and
is the missed opportunity
cost. We identified lower bounds on regrets in terms of the parameters
, , and in
several scenarios of interest and found that the so-called greedy Bayes
strategy (a variant of Thompson sampling) would match these lower bounds
in each of these scenarios.
This begs two questions:
(i) Beyond these specific scenarios, in general what function of the
matrix conditions the best possible
(ii) Do generic methods such as Thompson sampling or Greedy Bayes approach
the best possible regret more generally than in the specific scenarios
considered in http://dl.acm.org/citation.cfm?id=2745868 ?
The trade-off between privacy preservation and statistical inference takes
interesting forms in the context of recommendation engines. In
http://arxiv.org/abs/1207.3269 we considered the problem of inferring
clusters of similar items from inputs of users under an extreme form of
differential privacy, namely local differential privacy. Local DP mandates
each user to only provide input that is distorted by noise so that the
underlying ratings of items by that user (as well as which items it rated)
remain ambiguous. We assessed performance of clustering schemes under
these constraints by query complexity. This counts how many users are
probed before a cluster structure underlying users’s ratings can be
reliably identified. Assuming users rate a small number of items chosen
uniformly at random, the best query complexity we could achieve is an
impracticably large where is the number of items.
However the best lower bounds we know are in .
This begs the question of determining whether in this setup quasi-linear
query complexity suffices to successfully cluster, and if so, what scheme
achieves such complexity?
The problem description can be found here.
In stochastic contextual bandits, one is often interested in estimating the average per-step reward by running a given policy from data collected by a different policy. The problem is also known as off-policy learning in reinforcement learning and covariate shift in statistics, among others.
Precisely, a contextual -armed bandit is specified by a distribution over context set and a reward distribution . Given a stochastic policy , its average reward is given by
The data available to an estimator is of the form , where , , , and is a data-collection policy different from . The problem is to estimate from data .
A natural solution is to use importance sampling to get an unbiased estimate of , whose variance depends on as well as how “close” and are. While this approach has been successful empirically, it is not necessarily optimal, as shown in an asymptotic result . More recently, it is shown that for the special case where is singleton (ie, the classic -armed bandit), a less popular approach called regression estimator is near-optimal (and asymptotically optimal) in a minimax sense, although the importance-sampling estimator can be arbitrarily sub-optimal .
The open question involves proposing a (near-)optimal estimator for general contextual bandits.
 Keisuke Hirano, Guido W. Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003.
 Lihong Li, Remi Munos, and Csaba Szepesvari. Toward minimax off-policy value estimation. In AI&Stats 2015.
Recent work in differential privacy has shown how to build provably reusable holdout sets and tamper-proof machine learning competitions
These papers look at the problem of preventing a malicious adversary from overfitting to data. But the question remains: if I know that the machine learning competition that I am entering is using one of these mechanisms, what is the optimal strategy for me as a machine learning practitioner to yield the lowest possible error on a test set? Such a model would be guaranteed to generalize, and would have nearly optimal performance subject to generalizability. Concretely: what are the best optimization schemes to use for parameter tuning and model selection, given that one is faced with such a differentially private adversary?
What is the “right” way to think about parameter tuning in machine learning? Contemporary machine learning models may have thousands of parameters that have to be hand tuned by the engineer. Current techniques for automating this tuning have two flavors:
(a) use derivative free optimization to set the parameters. This reduces the problem to black box optimization and is necessarily exponential time in the number of parameters. Current “best-practice” using Gaussian Processes is often not much better than pure random search.
(b) combine multiple models using statistical aggregation techniques. This reduces the problem to model selection. The resulting models are often large, unwieldly and uninterpretable (like random forests).
Both methods ignore the structure of machine learning design, where pipelines are built stage-wise in a DAG, and ignore concerns about stability of the end-to-end pipeline. Are we just thinking about this problem incorrectly? What other structures and modeling can we take advantage of when optimizing machine learning models end-to-end