# The Online Light Bulb Problem (Ohad Shamir)

Setting

A learner is given access to an infinite stream of i.i.d. vectors $x_t\in \{-1,+1\}^d$. The distribution is such that $\mathbb{E}[x_t]=0$, and $\mathbb{E}[x_t x_t^\top]=I+\rho(e_i e_j^\top + e_j e_i^\top)$ (in words, all coordinates are zero mean and uncorrelated, except for a single pair of coordinates $(i,j)$ which have positive correlation of $\rho$).

Goal

What is the minimal sample size (as a function of $d,\rho$) required to detect $i,j$ with constant probability, using an algorithm with $\tilde{O}(d)$ bits of memory? The $\tilde{O}$ notation can hide arbitrary factors polylogarithmic in the parameters of the problem (so that finite-precision issues are not a problem).

Existing Results

This problem is closely related to the light bulb problem, introduced by [1], which asked a similar question in an offline setting, where a sample of such vectors is given and the goal is to detect a correlation in time sub-quadratic in $d$. In contrast, we consider an online/streaming setting, and constrain ourselves in terms of memory.

Without memory constraints, it is possible to detect the correlation with $O(\log(d)/\rho^2)$ samples (which is information-theoretically optimal), by maintaining the empirical average of $x_{t,i}x_{t,j}$ for all $i,j$. However, this requires $\tilde{O}(d^2)$ memory, which can be impractical for high-dimensional data. With $\tilde{O}(d)$ memory, one simple algorithm is to maintain empirical averages in batches of $O(d)$ coordinate pairs at a time, moving to the next batch once the current batch has been ruled out, and stopping when the correlated pair was found. However, the sample size for such an algorithm is much larger, $O(d\log(d)/\rho^2)$, and it’s not clear whether this is the best possible.

[2] essentially showed that the trivial algorithm above can be optimal, but for a certain carefully-constructed artificial distribution, quite different than the one considered here. For the offline light bulb problem, the best currently-known algorithm appears in [3] and is based on a clever random projection scheme. However, in an online $\tilde{O}(d)$-memory setting, random projections don’t seem to yield a sample complexity much better than the trivial algorithm described above.

Bibliography

[1] Leslie G Valiant. Functionality in neural nets. In AAAI, pages 629–634, 1988.

[2] Ohad Shamir. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In NIPS, 2014.

[3] Gregory Valiant. Finding correlations in subquadratic time, with applications to learning parities and juntas. In FOCS, 2012.