Parameter Tuning (Ben Recht)

What is the “right” way to think about parameter tuning in machine learning?  Contemporary machine learning models may have thousands of parameters that have to be hand tuned by the engineer.  Current techniques for automating this tuning have two flavors:

(a) use derivative free optimization to set the parameters.  This reduces the problem to black box optimization and is necessarily exponential time in the number of parameters.  Current “best-practice” using Gaussian Processes is often not much better than pure random search.

(b) combine multiple models using statistical aggregation techniques.  This reduces the problem to model selection.  The resulting models are often large, unwieldly and uninterpretable (like random forests).

Both methods ignore the structure of machine learning design, where pipelines are built stage-wise in a DAG, and ignore concerns about stability of the end-to-end pipeline. Are we just thinking about this problem incorrectly? What other structures and modeling can we take advantage of when optimizing machine learning models end-to-end

Modular Machine Learning (Ben Recht)

In machine learning theory, we study problems with very simple descriptions.  We analyze how to fit a mixture of gaussians, or train an SVM, or compute principal components.  But in the wild, machine learning is a much more involved process.  A data scientist (heheh) is never just presented with data and told to fit an SVM.  One must first decide how to represent the data as a vector or discrete code, then how to normalize features, then how to prune features, and then, maybe, can begin fitting an SVM.

What can we say about the theoretical properties of such “end-to-end” machine learning?  What do the risk bounds look like?  Is this simply a matter of studying the “stability” properties of the end-to-end pipeline.  If so, how does one evaluate the stability of connecting together a series of components which have their individual, well-understood, stability properties.  Can we derive generalization bounds for general DAGs of heterogenous machine learning components?

The Online Light Bulb Problem (Ohad Shamir)

Setting

A learner is given access to an infinite stream of i.i.d. vectors x_t\in \{-1,+1\}^d. The distribution is such that \mathbb{E}[x_t]=0, and \mathbb{E}[x_t x_t^\top]=I+\rho(e_i e_j^\top + e_j e_i^\top) (in words, all coordinates are zero mean and uncorrelated, except for a single pair of coordinates (i,j) which have positive correlation of \rho).

Goal

What is the minimal sample size (as a function of d,\rho) required to detect i,j with constant probability, using an algorithm with \tilde{O}(d) bits of memory? The \tilde{O} notation can hide arbitrary factors polylogarithmic in the parameters of the problem (so that finite-precision issues are not a problem).

Existing Results

This problem is closely related to the light bulb problem, introduced by [1], which asked a similar question in an offline setting, where a sample of such vectors is given and the goal is to detect a correlation in time sub-quadratic in d. In contrast, we consider an online/streaming setting, and constrain ourselves in terms of memory.

Without memory constraints, it is possible to detect the correlation with O(\log(d)/\rho^2) samples (which is information-theoretically optimal), by maintaining the empirical average of x_{t,i}x_{t,j} for all i,j. However, this requires \tilde{O}(d^2) memory, which can be impractical for high-dimensional data. With \tilde{O}(d) memory, one simple algorithm is to maintain empirical averages in batches of O(d) coordinate pairs at a time, moving to the next batch once the current batch has been ruled out, and stopping when the correlated pair was found. However, the sample size for such an algorithm is much larger, O(d\log(d)/\rho^2), and it’s not clear whether this is the best possible.

[2] essentially showed that the trivial algorithm above can be optimal, but for a certain carefully-constructed artificial distribution, quite different than the one considered here. For the offline light bulb problem, the best currently-known algorithm appears in [3] and is based on a clever random projection scheme. However, in an online \tilde{O}(d)-memory setting, random projections don’t seem to yield a sample complexity much better than the trivial algorithm described above.

Bibliography

[1] Leslie G Valiant. Functionality in neural nets. In AAAI, pages 629–634, 1988.

[2] Ohad Shamir. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In NIPS, 2014.

[3] Gregory Valiant. Finding correlations in subquadratic time, with applications to learning parities and juntas. In FOCS, 2012.

Adversarial Multi-Player Bandits (Nicolò Cesa-Bianchi and Ohad Shamir)

Setting

Consider the following `multi-player’ version of adversarial multi-armed bandits, on three arms: An adversary chooses a sequence of T reward vectors x_t\in [0,1]^3. Then, two randomized decision makers iteratively choose i_{t},j_{t}\in \{1,2,3\}, and receive a reward g_t(i_t,j_t) equal to x_{t,i_t}+x_{t,j_t} if i_t\neq j_t, and 0 if i_t=j_t. This models situations where two users utilize the same set of resources, resulting in zero reward if they collide. Crucially, the decision makers cannot communicate.

Goal

Design an algorithm for the two decision makers which attains sublinear regret: Namely, that regardless of the adversary’s strategy, \max_{i,j}\sum_{t=1}^{T}g_t(i,j)-\sum_{t=1}^{T}g_t(i_t,j_t)= o(T).

Existing Results

Multi-player bandits have been studied in the context of cognitive radio networks, where users need to transmit on channels of varying quality and may interfere with each other (e.g. [1,2,3]). However, all results we are aware of are for a stochastic setting, where the rewards are sampled i.i.d., and most of them require a (possibly elaborate) communication scheme between the players, which is not always realistic.

Bibliography

[1] Animashree Anandkumar, Nithin Michael, Ao Kevin Tang, and Ananthram Swami. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications, 29(4):731–745, 2011.

[2] Orly Avner and Shie Mannor. Concurrent bandits and cognitive radio networks. In Machine Learning and Knowledge Discovery in Databases, pages 66–81. Springer, 2014.

[3] Dileep Kalathil, Naumaan Nayyar, and Rahul Jain. Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory, 60(4):2331–2345, 2014.

Concentration of a Lipschitz Function of Negatively Associated Random Variables (Yuval Peres)

Let   f : \{ 0 , 1 \}^n \to \mathbb{R} be a Lipschitz function.  If the binary variables \{ X_1 , \ldots , X_n \} are independent, then it is well known that f(X)=f(X_1, \ldots,X_n) satisfies a Gaussian concentration inequality (*) \mathbb{P} \left( f(X) - \mathbb{E} f(X) \ge a \right) \leq \exp(- \frac{a^2}{c n} ) for a suitable c>0.

Question

Does such an inequality also hold if the binary variables  \{ X_1 , \ldots , X_n \}   are negatively associated (i.e., increasing  functions of disjoint subsets of these  variables are negatively correlated) rather than independent?

Background

This is classical when f is an increasing linear function of X, and therefore also holds for any linear Lipschitz function f (with a different constant c).

In 2009, Elchanan Mossel (personal communication) asked whether (*)  holds for negatively associated X_i and nonlinear Lipschitz functions. This is still open. A central motivation was the ensemble of uniform spanning trees in a graph G, which was proved in [1] to have negative association (the relevant binary variables are indexed by the edges of G and are indicators for inclusion in the uniform spanning tree.)

In [2] the inequality (*) is proved under a more restrictive negative dependence hypothesis known as the strong Rayleigh condition,  defined in [3]. The class of strong Rayleigh measures includes determinantal measures (which include the uniform spanning tree) and also weighted uniform matroids and spanning tree measures. However, it is hard to verify directly if a measure is strong Rayleigh, and not all negatively associated measures on the hypercube satisfy this condition.

Bibliography

[1] T. Feder and M. Mihail. (1992). Balanced matroids. Annual ACM Symposium on Theory of Computing, 26-38, 1992.

[2] R. Pemantle and Y. Peres (2014). Concentration of Lipschitz Functionals of Determinantal and Other Strong Rayleigh Measures. Combinatorics, Probability and Computing 23, 140–160.  

[3]  J. Borcea, P. Branden, and T. Liggett (2009).  Negative dependence and the geometry of polynomials. J. AMS, 22:521–567, 2009.

Grothendieck inequality and hyperbolicity (Sebastien Bubeck)

Let me warn you that this open problem is more open ended than the one on Thompson Sampling!

Grothendieck’s inequality and its proof by Gaussian projection give a universal rounding scheme for semidefinite programs. The question is whether there exist a generalization of Grothendieck’s inequality to hyperbolic programs, and what would the rounding scheme look like in this case?

Some readings:

– See section 7 here for a clean description of hyperbolic programming and the associated basic concepts.

– See section 2 here for a simple proof of Grothendieck’s inequality.

– Known connections between hyperbolicity and probability will probably be useful here, see this survey by Robin Pemantle.

Around Thompson Sampling (Sebastien Bubeck)

Hyperlinks lead to the arxiv version of the mentioned papers.

Let \nu be a distribution over length T sequences of functions from some compact set K \subset \mathbb{R}^n to [-1,1]. Let (\ell_1, \hdots, \ell_T) \sim \nu, x^* = \mathrm{argmin}_{x \in K} \sum_{t=1}^T \ell_t(x), and (U_1, \hdots, U_T) be uniformly distributed in [0,1]^T (and independent of \ell_t's). A bandit strategy is a sequence of random variables x_1, \hdots, x_T where x_t is measurable with respect to the bandit filtration \mathcal{F}_t = \sigma(U_1, \hdots, U_t, \ell_1(x_1), \hdots, \ell_{t-1}(x_{t-1})). We are interested in the following quantity (where expectation is taken with respect to \nu): R_T = \mathbb{E} \sum_{t=1}^T (\ell_t(x_t) - \ell_t(x^*)).

Thomspon Sampling (TS) is the strategy where x_t has the same the same distribution as x^* conditionally on \mathcal{F}_t. Here are some simple questions about TS.

Approximation properties of TS

Consider the special case where K = \{e_1, \hdots, e_n\}, and one can simulate \nu by first sampling \theta \in [0,1]^n from some distribution \nu_0^{\otimes n} and then sampling independently each \ell_t(e_i) from \mathrm{Ber}(\theta_i). Let R_T^* be the infimum of R_T over all bandit strategies for this setting. Guha and Munagala ask whether TS satisfies R_T \leq 2 R_T^* (more precisely they ask this question for a somewhat stronger notion a regret where one counts the number of times that a suboptimal action was selected). Note that in this setting we know that there exists a numerical constant C>0 such that \sup_{\nu} R_T \leq C \sup_{\nu} R_T^* (see this paper). Furthermore Guha and Munagala essentially prove that R_T \leq 2 R_T^* is true for n=2.

TS and the geometry of the action set

Russo and van Roy showed that if \nu is supported on linear functions, then TS satisfies R_T = O(\sqrt{n T \log |K|}). This essentially implies that on K=\mathbb{B}^n (the unit Euclidean ball in \mathbb{R}^n) one has R_T = O(n \sqrt{T}). However it is known that for this particular convex body there exists strategies with R_T = O(\sqrt{n T}) (see this paper). Is this latter bound true for TS? More generally understanding the interplay between TS and the geometry of the underlying action set is entirely open.

TS and convexity

In our recent work we showed that if \nu is supported on convex functions, and if n=1, then a small modification of TS leads to R_T = O(\log(T) \sqrt{T}). Is this bound true for the basic TS? What about higher dimensions?