|
Article Excerpt It was established in Dupuis and Wang [Dupuis, E, H. Wang. 2004. Importance sampling, large deviations, and differential games. Stoch. Stoch. Rep. 76 481-508, Dupuis, E, H. Wang. 2005. Dynamic importance sampling for uniformly recurrent Markov chains. Ann. Appl. Probab. 15 1-38] that importance sampling algorithms for estimating rare-event probabilities are intimately connected with two-person zero-sum differential games and the associated Isaacs equation. This game interpretation shows that dynamic or state-dependent schemes are needed in order to attain asymptotic optimality in a general setting. The purpose of the present paper is to show that classical subsolutions of the Isaacs equation can be used as a basic and flexible tool for the construction and analysis of efficient dynamic importance sampling schemes. There are two main contributions.
The first is a basic theoretical result characterizing the asymptotic performance of importance sampling estimators based on subsolutions. The second is an explicit method for constructing classical subsolutions as a mollification of piecewise affine functions. Numerical examples are included for illustration and to demonstrate that simple, nearly asymptotically optimal importance sampling schemes can be obtained for a variety of problems via the subsolution approach.
Key words: importance sampling; large deviations; Isaacs equation; rare events; simulation; Monte Carlo
1. Introduction. It was established in Dupuis and Wang [5, 6] that importance sampling algorithms for estimating rare-event probabilities, or functionals that are largely determined by rare events, are closely related to deterministic differential games. More precisely, the asymptotic optimal performance of importance sampling schemes can be characterized by the value function of a two-person zero-sum differential game, which can in turn be characterized by the solution to the Isaacs equation, a nonlinear partial differential equation (PDE), associated with the game. It was also discussed in Dupuis and Wang [5, 6] that one can construct asymptotically optimal importance sampling algorithms based on this solution.
The purpose of the present paper is to explore this connection in further depth. A main new feature is that it is possible to construct efficient importance sampling schemes based on classical subsolutions of the Isaacs equation. This leads to a more general class of schemes and lends great flexibility to the designer. We will see that one can often construct subsolutions that are structurally much simpler than the actual solution, but which correspond to asymptotically optimal, or at least nearly asymptotically optimal, importance sampling schemes that reflect this simpler structure. This simplicity is important, since one is interested in properties other than just asymptotic optimality, e.g., ease of construction and ease of implementation.
The main theoretical contribution of the paper is a basic result on the asymptotic performance of subsolution-based importance sampling schemes. It characterizes the performance in terms of the value of the subsolution at a particular point. The proof is carried out in a general setting that contains as special cases sums of independent identically distributed (iid) random variables and the empirical measure of a finite-state Markov chain. Another contribution of the current paper is a method for the systematic construction of classical subsolutions that lead to simple yet efficient importance sampling schemes. More precisely, we show that in many cases one can build such subsolutions as a mollification of the minimum of finitely many affine functions.
We wish to point out that the potential application of the subsolution approach is much broader, and includes systems with state dependencies and small noise effects, solutions to stochastic differential equations, systems with constrained dynamics (e.g., queueing networks), and expected values involving path-dependent events. For the purpose of illustration, we include a few numerical examples that do not fit directly into the theoretical framework of the current paper and yet for which efficient importance sampling algorithms can still be built via subsolutions.
The paper is organized as follows. Section 2 gives a brief account of importance sampling and asymptotic optimality. Since the underlying game and Isaacs equation are not yet widely exposed in the importance sampling literature, we give some heuristics and a formal overview in [section] 3 in the setting of sums of iid random variables. Sections 4 through 8 are the theoretical part of the paper. In [section] 4, the general model and assumptions are stated.
Importance sampling for Markov chains uses a collection of eigenfunctions that are related to the transition kernel of the chain, and [section] 5 reviews the properties of these eigenfunctions. Sections 6 and 7 introduce the concept of generalized subsolution/control and describe the associated importance sampling algorithms. The main theoretical result, which characterizes the asymptotic performance of such schemes, is stated in [section] 8. Section 9 discusses methods for the construction of subsolutions in great detail, starting with the simplest possible examples and then extending to more complicated situations. Section 10 is devoted to numerical examples. To illustrate the broad application of the subsolution approach, the latter part of [section] 10 considers several classes of problems that are not covered by the main theoretical result, including multidimensional level-crossing problems, probabilities and expected values that involve path-dependent events, and buffer overflow in a mixed open/closed queueing network. In each case, the application of the Isaacs equation and subsolution approach follows the pattern laid out for the simple case of sums of random variables. To streamline the presentation, technical proofs are collected in appendices.
Notation. For a Polish space S, P(S) denotes the collection of all probability measures on (S, B(S)), where B(S) is the Borel [sigma]-algebra. There will be many instances in this paper where we decompose measures on a product space as the product of a marginal distribution and a stochastic kernel. The following notation will be used. Suppose that [mu] [member of] P([S.sub.1] x [S.sub.2]) with each [S.sub.i] a Polish space. Then [[[mu]].sub.l] will denote the first marginal of [mu], and [mu]([dy.sub.2] | [y.sub.1]) will denote the stochastic kernel on [S.sub.2] given [S.sub.1] such that [mu]([dy.sub.1] x [dy.sub.2]) = [[[mu]].sub.l]([dy.sub.l])[mu]([dy.sub.2] | [y.sub.1]). Quantities such as [[[mu]].sub.2], [mu]([dy.sub.1] | [y.sub.2]), and the extension to products of more than two Polish spaces are all defined in the analogous fashion. Given [mu] [member of] P([S.sub.1]) and a stochastic kernel q on [S.sub.2] given [S.sub.1], we let [mu] [cross product] q denote [mu]([dy.sub.1])q([dy.sub.2] | [y.sub.1]) [member of] P([S.sub.1] x [S.sub.2]).
Superscripts in this paper generically denote an index, such as a scaling parameter or a mollification parameter. The one exception is when a superscript of 2 appears, in which case it denotes the square.
Relative entropy. Given a Polish space S and two probability measures [gamma], [mu] [member of] P(S), the relative entropy is defined by
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
The relative entropy is always nonnegative and is a convex, lower semicontinuous function of ([gamma],[mu]) [member of] P(S) x P(S)(Dupuis and Ellis [4, [section] 1.4]).
2. An overview of importance sampling.
2.1. Basics of importance sampling. Importance sampling is a variance reduction technique widely used in Monte Carlo simulation. The basic idea of importance sampling is "change of measure." In other words, the system is simulated under a different probability distribution and the outcomes are multiplied by appropriate Radon-Nikodym derivatives to form unbiased estimators.
Suppose we wish to estimate the expected value of a random variable X with distribution 0. Consider an alternative sampling distribution [mu] such that [theta] [much less than] [mu]. Let f(x) [??} (d[theta]/d[mu])(x) denote the Radon-Nikodym derivative. Then importance sampling considers independent copies of a random variable [bar.X] with distribution [mu] and forms an estimator by averaging the independent copies of
Z [??] [bar.X]f([bar.X]).
This estimator is unbiased since
E[bar.X]f([bar.X] = [integral] xf(x)[mu](dx) = [integra] x [theta](dx) = EX,
and its rate of convergence is determined by the variance Z--the smaller the variance, the faster the convergence. One typically seeks to minimize the variance of Z, or equivalently the second moment of Z, within a parametric family of alternative sampling distributions.
REMARK 2.1. In the analysis, it is often convenient to write the second moment of Z in terms of the original random variable X, that is,
[EZ.sup.2] = [integral] [x.sup.2] [f.sup.2](x)[mu](dx) = [integral] [x.sup.2] [f(x)[theta](dx) = E[[X.sup.2]f(X)].
2.2. Asymptotic optimality. The following asymptotic optimality criterion is often adopted when one is interested in a family of random variables {[X.sub.n]} that satisfy
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
for some constant [gamma] > 0. Let [Z.sub.n] be an unbiased importance sampling estimator for E[[X.sub.n]]. Recall that a major concern of importance sampling is to minimize the second moment of [Z.sub.n]. However, by Jensen's inequality,
[EZ.sup.2.sub.n] [greater than or equal to][([EZ.sub.n]).sup.2] = [([EX.sub.n]).sup.2].
Therefore,
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
We say the importance sampling estimator is asymptotically optimal if the upper bound is achieved, i.e., if
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
Sometimes 2[gamma] is simply referred to as the "optimal decay rate."
3. An introduction to the role of subsolutions. This section describes how an Isaacs equation arises in importance sampling and the implications for the performance of schemes based on subsolutions. Since it is an overview, we do not give all details and will not be precise regarding all necessary assumptions.
3.1. Problem formulation for sums of iid random variables. Consider a sequence of iid random variables {[Y.sub.i], i [member of] N} with distribution [mu] [member of] P ([R.sup.d]), and define
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
Let H be the log-moment generating function, that is, for [alpha] [member of] [R.sup.d],
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
Assume H is finite for each [alpha]. Denote by L the Legendre transform
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
Note that both H and L are convex functions.
Suppose one is interested in the importance sampling estimation of
Eexp{-nF([X.sub.n])}.
In the context of sums of lid random variables, one typically uses the following parametric family of exponential changes of measure to generate the replacements for the [Y.sub.i]:
[[mu].sub.[alpha]](dy)[??][e.sup.([alpha],y)-H([alpha])][mu](dy).
In constructing the replacement for [X.sub.n], we use a dynamic change of measure. For a function [bar.[alpha]](x, t): [R.sup.d] x [0, 1] [right arrow] [R.sup.d] recursively define the following quantities. Let [[bar.X].sup.n.sub.0] = 0, and assume that [[bar.X].sup.n.sub.j], [[bar.Y].sup.n.sub.j], j = 1, ..., i have been defined. Let [[bar.Y].sup.n.sub.i+1], conditioned on [[bar.X].sup.n.sub.j], [[bar.Y].sup.n.sub.j], j = 1, ..., i, have distribution [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], and then set [[bar.X].sup.n.sub.i+1][??][[bar.X].sup.n.sub.i] + [[bar.Y].sup.n.sub.i+1]/n. When [[bar.X].sup.n.sub.i], [[bar.Y].sup.n.sub.i] have been defined for all i = 1, ..., n, the importance sampling estimator is given by
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
The importance sampling algorithm then takes the sample average of independent replications of [Z.sup.n] as the estimate. Using a conditioning argument, it is not difficult to check that [Z.sup.n] is unbiased, and therefore to minimize the variance, it suffices to minimize the second moment of [Z.sup.n].
We consider the problem of minimizing the second moment as a control problem, with [bar.[alpha]] the control. It is here that the problem connects naturally to a PDE. To make the connection, we must extend the problem slightly. For i [member of] N [union] {0} and x [member of] [R.sup.d], define [[bar.X].sup.n.sub.j], j = i, ..., n - 1 as above except [[bar.X].sup.n.sub.i] = x, and then define
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
It will be more convenient to express this in terms of the original random variables as in Remark 2.1:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
Owing to the exponential scaling in n, one gets a simple asymptotic problem by considering the logarithmic transform
[W.sup.n] (x, i) = - 1/n log [V.sup.n] (x, i).
The performance of the scheme corresponding to [bar.[alpha]] can then be characterized in terms of lim [inf.sub.n[right arrow][infinity]] [W.sup.n](0, 0), with larger values indicating better performance.
3.2. The associated Isaacs equation. [V.sup.n] is the value function of a discrete time stochastic control problem and, as such, satisfies the dynamic programming equation
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
A variational formula (Dupuis and Ellis [4, [section] 1.4]) shows how to represent exponential integrals in terms of relative entropy. For any bounded and continuous function f: [R.sup.d] [right arrow] R,
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
Applying this to the dynamic programming equation and using the definition of [W.sup.n] gives the following discrete-time Isaacs equation:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
To formally relate [W.sup.n](x, i) to the solution of a PDE, suppose that for a smooth function W: [R.sup.d] x [0, 1] [right arrow] R, [W.sup.n] (x, i) [approximately equal to] W(x, i/n). We also use the following relationship (Dupuis and Ellis [4, [section] C.5]) between relative entropy and the function L defined previously as the Legendre transform of H. For any [beta] [member of] [R.sup.d],
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
Let [W.sub.t] denote the partial derivative with respect to t, DW the gradient in x, and
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
for s, [alpha], [beta] [member of] [R.sup.d]. We bring [W.sup.n](x, i) [approximately equal to] W(x, i/n) to the right side of the Isaacs equation, expand via Taylor series, insert the expression for L, multiply by n, and send n [right arrow] [infinity] to get
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
Note that one also expects the terminal condition W(x, 1) = 2F(x) to hold.
This PDE, which is also known as an Isaacs equation, was identified in Dupuis and Wang [5, 6] and its solution was used there to construct asymptotically optimal importance sampling schemes.
3.3. Subsolutions and importance sampling. The purpose of the present paper is to show that it is only the subsolution property that is essential in the context of importance sampling. The definition of a subsolution simply replaces the equalities that appear in the Isaacs equation and terminal condition with inequalities. More precisely, by a classical subsolution, we mean a continuously differentiable function [bar.W]: [R.sup.d] x [0, 1] [right arrow] [R] such that
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
for all (x, t) and [bar.W](x, 1) [less than or equal to] 2F(x).
The sufficiency of the subsolution property can be understood intuitively as follows. Recall that we are only interested in bounding the quantity [W.sup.n](0, 0) from below, since an upper bound is automatic from Jensen's inequality (see [section] 2.2). The inequalities in the definition of a subsolution are those which give lower bounds when the subsolution is combined with a verification argument to estimate the performance.
In a more general context than the one used in this overview, we will show how subsolutions naturally suggest importance sampling schemes. The main theoretical result of this paper can then be roughly stated as follows: If [Z.sup.n] is the importance sampling estimator constructed according to a subsolution [bar.W], then
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (1)
With result (1), the design problem becomes clear: Construct a subsolution whose associated importance sampling scheme can be implemented with reasonable effort and for which [bar.W](0, 0) is equal or close to the optimal decay rate 2[gamma].
REMARK 3.1. Because subsolutions deal with inequalities (rather than equalities), there is not a unique importance sampling scheme associated with each subsolution. We will turn this flexibility to our advantage, but it requires that the control for the importance sampling player be specified as part of the definition. As a consequence, the notion of a generalized subsolution/control will be introduced, which specifies a set of differential inequalities for the given pair. The lower bound (1) still holds for importance sampling estimators based on generalized subsolution/controls (see [section] 7).
4. The general setup. The broader collection of importance sampling problems we wish to analyze includes sums of iid random variables and sums of functionals of a finite state Markov chain. The following general model includes both as special cases. Let Y [??] {[Y.sub.i] i [member of] [N.sub.0]} denote a Markov chain with state space S. Assume that S is a Polish space, and let p(y, dz) denote the probability transition kernel. Let {[b.sub.i](*), i [member of] [N.sub.0]} be a sequence of iid random vector fields on S that is independent of the Markov chain Y. For each y [member of] S, [b.sub.i](y) is distributed according to a probability measure, say m(* | y), on [R.sup.d]. Our interest is in sums of the form
[X.sub.n] [??] 1/n [n.summation over (i=1][b.sub.i]([Y.sub.i]). (2)
By choosing S to be a single point, we recover the case of sums of iid random variables, whereas taking [b.sub.i](y) to be deterministic (i.e., m(* | y) is a single atom for each y [member of] S) produces the case of functionals of a Markov chain. The general case is also of interest and occurs when the distribution of the summand [b.sub.i] is modulated by the "exogenous" process Y. Note that ([Y.sub.n], n[X.sub.n]) forms a Markov additive process.
REMARK 4.1. In the literature on importance sampling for Markov chains, it is standard to include the initial state [Y.sub.0]...
|