In stochastic contextual bandits, one is often interested in estimating the average per-step reward by running a given policy from data collected by a different policy. The problem is also known as off-policy learning in reinforcement learning and covariate shift in statistics, among others.

Precisely, a contextual -armed bandit is specified by a distribution over context set and a reward distribution . Given a stochastic policy , its average reward is given by

The data available to an estimator is of the form , where , , , and is a data-collection policy different from . The problem is to estimate from data .

A natural solution is to use importance sampling to get an unbiased estimate of , whose variance depends on as well as how “close” and are. While this approach has been successful empirically, it is not necessarily optimal, as shown in an asymptotic result [1]. More recently, it is shown that for the special case where is singleton (ie, the classic -armed bandit), a less popular approach called regression estimator is near-optimal (and asymptotically optimal) in a minimax sense, although the importance-sampling estimator can be arbitrarily sub-optimal [2].

The open question involves proposing a (near-)optimal estimator for general contextual bandits.

[1] Keisuke Hirano, Guido W. Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003.

[2] Lihong Li, Remi Munos, and Csaba Szepesvari. Toward minimax off-policy value estimation. In AI&Stats 2015.