# Author Archives: sbubeck

# Optimal regret in a bandit problem with combinatorial structure (Laurent Massoulie)

In http://dl.acm.org/citation.cfm?id=2745868 we considered the following

bandit problem: an object with unknown type is to be

disseminated to a subset of consumers ; the benefit that gets

from being exposed is with probability and

otherwise. The question is then to sequentially determine who to expose,

so that the set of users eventually exposed minimizes , where

is the spamming cost, and

is the missed opportunity

cost. We identified lower bounds on regrets in terms of the parameters

, , and in

several scenarios of interest and found that the so-called greedy Bayes

strategy (a variant of Thompson sampling) would match these lower bounds

in each of these scenarios.

This begs two questions:

(i) Beyond these specific scenarios, in general what function of the

matrix conditions the best possible

regret?

(ii) Do generic methods such as Thompson sampling or Greedy Bayes approach

the best possible regret more generally than in the specific scenarios

considered in http://dl.acm.org/citation.cfm?id=2745868 ?

# Clustering cheaply under local differential privacy (Laurent Massoulie)

The trade-off between privacy preservation and statistical inference takes

interesting forms in the context of recommendation engines. In

http://arxiv.org/abs/1207.3269 we considered the problem of inferring

clusters of similar items from inputs of users under an extreme form of

differential privacy, namely local differential privacy. Local DP mandates

each user to only provide input that is distorted by noise so that the

underlying ratings of items by that user (as well as which items it rated)

remain ambiguous. We assessed performance of clustering schemes under

these constraints by query complexity. This counts how many users are

probed before a cluster structure underlying users’s ratings can be

reliably identified. Assuming users rate a small number of items chosen

uniformly at random, the best query complexity we could achieve is an

impracticably large where is the number of items.

However the best lower bounds we know are in .

This begs the question of determining whether in this setup quasi-linear

query complexity suffices to successfully cluster, and if so, what scheme

achieves such complexity?

# Better Guarantees for Sparsest Cut Clustering (Nina Balcan)

The problem description can be found here.

# Minimax off-policy value estimation (Lihong Li)

In stochastic contextual bandits, one is often interested in estimating the average per-step reward by running a given policy from data collected by a different policy. The problem is also known as off-policy learning in reinforcement learning and covariate shift in statistics, among others.

Precisely, a contextual -armed bandit is specified by a distribution over context set and a reward distribution . Given a stochastic policy , its average reward is given by

The data available to an estimator is of the form , where , , , and is a data-collection policy different from . The problem is to estimate from data .

A natural solution is to use importance sampling to get an unbiased estimate of , whose variance depends on as well as how “close” and are. While this approach has been successful empirically, it is not necessarily optimal, as shown in an asymptotic result [1]. More recently, it is shown that for the special case where is singleton (ie, the classic -armed bandit), a less popular approach called regression estimator is near-optimal (and asymptotically optimal) in a minimax sense, although the importance-sampling estimator can be arbitrarily sub-optimal [2].

The open question involves proposing a (near-)optimal estimator for general contextual bandits.

[1] Keisuke Hirano, Guido W. Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003.

[2] Lihong Li, Remi Munos, and Csaba Szepesvari. Toward minimax off-policy value estimation. In AI&Stats 2015.

# Differentially Private Model Selection (Ben Recht)

Recent work in differential privacy has shown how to build provably reusable holdout sets and tamper-proof machine learning competitions

http://arxiv.org/abs/1411.2664

http://arxiv.org/abs/1502.04585

These papers look at the problem of preventing a malicious adversary from overfitting to data. But the question remains: if I know that the machine learning competition that I am entering is using one of these mechanisms, what is the optimal strategy for me as a machine learning practitioner to yield the lowest possible error on a test set? Such a model would be guaranteed to generalize, and would have nearly optimal performance subject to generalizability. Concretely: what are the best optimization schemes to use for parameter tuning and model selection, given that one is faced with such a differentially private adversary?

# Parameter Tuning (Ben Recht)

What is the “right” way to think about parameter tuning in machine learning? Contemporary machine learning models may have thousands of parameters that have to be hand tuned by the engineer. Current techniques for automating this tuning have two flavors:

(a) use derivative free optimization to set the parameters. This reduces the problem to black box optimization and is necessarily exponential time in the number of parameters. Current “best-practice” using Gaussian Processes is often not much better than pure random search.

(b) combine multiple models using statistical aggregation techniques. This reduces the problem to model selection. The resulting models are often large, unwieldly and uninterpretable (like random forests).

Both methods ignore the structure of machine learning design, where pipelines are built stage-wise in a DAG, and ignore concerns about stability of the end-to-end pipeline. Are we just thinking about this problem incorrectly? What other structures and modeling can we take advantage of when optimizing machine learning models end-to-end