Minimax off-policy value estimation (Lihong Li)

In stochastic contextual bandits, one is often interested in estimating the average per-step reward by running a given policy from data collected by a different policy. The problem is also known as off-policy learning in reinforcement learning and covariate shift in statistics, among others.

Precisely, a contextual K-armed bandit is specified by a distribution \mu over context set \mathcal{X} and a reward distribution \Phi. Given a stochastic policy \pi, its average reward is given by v^\pi := E_{X\sim\mu, A\sim\pi(\cdot | X), R\sim\Phi(\cdot | A)} [R] .
The data available to an estimator is of the form D^n= { (X_i,A_i,R_i) }_{i=1,2,...,n}, where X_i\sim \mu, A_i\sim\pi_D(\cdot | X_i), R_i \sim\Phi(\cdot | A_i), and \pi_D is a data-collection policy different from \pi. The problem is to estimate v^\pi from data D^n.

A natural solution is to use importance sampling to get an unbiased estimate of v^\pi, whose variance depends on n as well as how “close” \pi and \pi_D are. While this approach has been successful empirically, it is not necessarily optimal, as shown in an asymptotic result [1].  More recently, it is shown that for the special case where \mathcal{X} is singleton (ie, the classic K-armed bandit), a less popular approach called regression estimator is near-optimal (and asymptotically optimal) in a minimax sense, although the importance-sampling estimator can be arbitrarily sub-optimal [2].

The open question involves proposing a (near-)optimal estimator for general contextual bandits.

[1] Keisuke Hirano, Guido W. Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003.

[2] Lihong Li, Remi Munos, and Csaba Szepesvari.  Toward minimax off-policy value estimation.  In AI&Stats 2015.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s