# Minimax off-policy value estimation (Lihong Li)

In stochastic contextual bandits, one is often interested in estimating the average per-step reward by running a given policy from data collected by a different policy. The problem is also known as off-policy learning in reinforcement learning and covariate shift in statistics, among others.

Precisely, a contextual $K$-armed bandit is specified by a distribution $\mu$ over context set $\mathcal{X}$ and a reward distribution $\Phi$. Given a stochastic policy $\pi$, its average reward is given by $v^\pi := E_{X\sim\mu, A\sim\pi(\cdot | X), R\sim\Phi(\cdot | A)} [R] .$
The data available to an estimator is of the form $D^n= { (X_i,A_i,R_i) }_{i=1,2,...,n}$, where $X_i\sim \mu$, $A_i\sim\pi_D(\cdot | X_i)$, $R_i \sim\Phi(\cdot | A_i)$, and $\pi_D$ is a data-collection policy different from $\pi$. The problem is to estimate $v^\pi$ from data $D^n$.

A natural solution is to use importance sampling to get an unbiased estimate of $v^\pi$, whose variance depends on $n$ as well as how “close” $\pi$ and $\pi_D$ are. While this approach has been successful empirically, it is not necessarily optimal, as shown in an asymptotic result [1].  More recently, it is shown that for the special case where $\mathcal{X}$ is singleton (ie, the classic $K$-armed bandit), a less popular approach called regression estimator is near-optimal (and asymptotically optimal) in a minimax sense, although the importance-sampling estimator can be arbitrarily sub-optimal [2].

The open question involves proposing a (near-)optimal estimator for general contextual bandits.

[1] Keisuke Hirano, Guido W. Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003.

[2] Lihong Li, Remi Munos, and Csaba Szepesvari.  Toward minimax off-policy value estimation.  In AI&Stats 2015.