|
Article Excerpt We devise an algorithm for solving the infinite-dimensional linear programs that arise from general deterministic semi-Markov decision processes on Borel spaces. The algorithm constructs a sequence of approximate primal-dual solutions that converge to an optimal one. The innovative idea is to approximate the dual solution with continuous piecewise linear ridge functions that naturally represent functions defined on a high-dimensional domain as linear combinations of functions defined on only a single dimension. This approximation gives rise to a primal/dual pair of semi-infinite programs, for which we show strong duality. In addition, we prove various properties of the underlying ridge functions.
Key words: infinite/semi-infinite linear programming algorithms; deterministic semi-Markov decision processes; approximate dynamic programming; ridge function approximations
1. Introduction. Linear programming is a classical approach to the solution of Markov decision processes (MDP), and dates back to the pioneering work of Ghellinck [13], D'Epenoux [12], and Manne [23] for MDP's with finite state and action spaces. Recently, there has been substantial interest in MDP's on general Borel spaces, e.g., see Hernandez-Lerma and Lasserre [16, 19] and references therein. In this setting, the linear programs are no longer defined on finite spaces, but infinite ones. Unlike finite linear programs, which enjoy the wide availability of solution software, solution methodologies for infinite linear programs are rare. Typically, algorithms are specially designed for particular classes of problems. One exception to this is the paper by Hernandez-Lerma and Lasserre [17].
Klabjan and Adelman [21] develop infinite-dimensional linear programming theory for semi-Markov decision processes (SMDP) on Borel spaces and long-run time average expected cost criterion, with particular emphasis on the special case in which the state transitions are deterministic. Included in this theory are general conditions for strong duality and the existence of an optimal stationary, deterministic control policy. Adelman and Klabjan [2] apply this theory to the generalized joint replenishment problem, which includes many classical problems in inventory control for which the duality and existence questions have remained open for many years.
In this paper, we devise a new algorithm for solving the primal-dual pair of infinite-dimensional linear programs that arise from deterministic semi-Markov decision problems with long-run time average cost criterion. We show that under some mild conditions, the algorithm is convergent. Our algorithm constructs a sequence of measures that converge weakly to an optimal solution to the primal problem, and a corresponding sequence of dual feasible functions that converge to an optimal solution to the dual problem. A key contribution of our work is in specific forms of functional approximations to the feasible solutions of the dual problem. In Klabjan and Adelman [21], we showed how to construct an optimal stationary, deterministic control policy from an optimal primal solution.
Bellman and Dreyfus [3] were the first to consider functional approximations, in their case Legendre polynomials, in the solution of the dynamic programming optimality equation defined on continuous spaces. More recently, Johnson et al. [20] considered the use of multivariate spline functions to approximate the value function. The idea is to discretize the state space, evaluate the value function at the grid points, and then use multivariate splines to interpolate between grid points. Because the number of multivariate basis functions grows exponentially with the number of state space dimensions, Chen et al. [6] consider an experimental design approach from statistics, but with multivariate spline functions of lower polynomial order. These authors consider only the finite-horizon case, work directly with the optimality equations instead of the linear programming formulations as we do here, and do not show convergence. Recently, de Farias and Van Roy [10, 11] have considered linear programming for approximate dynamic programming. The key idea, which comes from Schweitzer and Seidmann [27], is to approximate the value function with a linear combination of basis functions. There has been considerable recent interest in using this approach. However, there remains a fundamental open question: How to generate basis functions? Until now, all research in approximate dynamic programming has assumed a fixed set of basis functions, chosen a priori by the human modeler. Ours is the first paper, to our knowledge, to provide an algorithm that automates the generation of basis functions. Our algorithm dynamically constructs basis functions as the algorithm proceeds, in the limit converging in such a way as to close the gap between the approximate value function and the true optimal value function. The key is limiting the search for new basis functions to a particular well-structured class that has the power to approximate, arbitrarily closely, any bounded measurable function.
Hernandez-Lerma and Lasserre [18] propose a convergent algorithm for solving the infinite-dimensional linear programs resulting from MDPs. Our algorithm differs from theirs in fundamental ways. Whereas we approximate the dual problem, they approximate the primal problem. They consider countable dense subsets of measures for aggregating the variables and countable dense subsets of functions to aggregate constraints of the primal problem. In contrast, we work with an explicit form of functional approximations, which are computationally tractable and still give a convergent algorithm under some mild conditions. Linear programming for semi-infinite problems is considered by Tadic et al. [28], where the problem is reformulated as a stochastic semi-infinite problem, which is then solved by Monte-Carlo simulation.
Our main idea is to approximate the bias function with piecewise linear functions. In our context, a piecewise linear function is always continuous, and it has a finite number of breakpoints. The key question is how to define multivariate piecewise linear functions. The definition must allow for an easy encoding of such functions, and it should be easy to capture properties such as continuity, convexity, and monotonicity. Such properties are important in applications of approximate dynamic programming, see, e.g., Powell and Topaloglu [25], to ensure computational tractability and economic sensibility. The multivariate spline functions, see, e.g., Wang [29] or Bojanov et al. [5], are very complex, and it is extremely difficult to impose the aforementioned properties. For this reason, we took a different approach where the mapping from the multidimensional space to the 1-dimensional space is carried via the notion of ridge functions, see, e.g., Cheney and Light [7]. In the 1-dimensional space, we then use piecewise linear functions or 1-splines to approximate continuous functions. It is well known that higher-order splines have better convergence properties in other settings and they are smooth. However, in our setting, they would yield nonlinear, very hard-to-solve programs. By using piecewise linear functions, often the resulting programs are linear mixed integer programs, which can be solved using readily available optimization software.
Our algorithm begins with a collection of piecewise linear functions [f.sub.j] R [right arrow] R, which correspond to 1-splines. Given a Borel state space X of high dimension, corresponding to function j we map each state x into a scalar [r.sup.j]x, where [r.sup.j] is called a ridge vector having dimension equal to that of X. We then approximate the bias function, u(.), with the summed ridge function, i.e.,
u(x) [approximately equal to] [summation over (j)] [f.sub.j] ([r.sup.j] x) x [member of] X.
By choosing [r.sup.j] to be different from the unit vectors, we capture interactions between components of the state space. Letting [b.sup.j] denote the vector of breakpoints for function [f.sub.j], we thus start the algorithm with a collection [{[r.sup.j], [b.sup.j]}.sub.j]. By substituting the above approximation into the dual-infinite program, the slopes between breakpoints for each function [f.sub.j] essentially become decision variables in what turns out to be a semi-infinite linear program. We show that this program is solvable and has strong duality. Using the primal solution of this semi-infinite program, we then solve an auxiliary problem that finds additional breakpoints and/or ridge vectors, and then we repeat the whole procedure. We show that primal and dual solutions generated by this algorithm converge to an optimal solution under some mild assumptions. In Adelman and Klabjan [1], we discuss how to efficiently implement the basic algorithm discussed here on the generalized joint replenishment problem, including how to construct feasible control policies in practice that are (near) optimal.
The paper is organized as follows. In [section]2, we give a brief overview of infinite-dimensional linear programming for deterministic SMDPs. In [section]3, we define ridge functions and prove various properties about them. We show in [section]4 how to formulate the problem of finding optimal weights for the ridge function as a semi-infinite linear program, prove a strong duality result, and provide an algorithm for solving it within arbitrary precision. Then in [section]5, we give the main algorithm and prove convergence.
2. Overview of infinite-dimensional linear programming for deterministic SMDPs. In this section, we review the linear programming theory developed in Klabjan and Adelman [21] and Adelman and Klabjan [2].
Given a Borel space Z we denote by C(Z) the set of all continuous functions on Z and by B(Z) the set of all Borel measurable bounded functions on Z. The set of all finite signed Borel measures on Z is denoted by M(Z). All these three sets can be equipped with a norm and the last two become Banach spaces. Let also B(Z) be the Borel [sigma]-algebra in Z.
2.1. Formulation. Consider a deterministic SMDP defined on a state space X and action space A, both assumed to be Borel spaces. For each x [member of] X, let A (x) [subset or equal to] A be a nonempty Borel subset that specifies the set of admissible actions from state x. We denote the collection of state-action pairs as K = {(x, a): x [member of] X, a [member of] A(x)}, assumed to be a Borel subset of X x A. Upon taking action a in state x, a cost c(x, a) is incurred and then the system transitions to some state s(x, a) after a time duration of length [tau](x, a), all with probability one. We assume that c: K [right arrow] R, s: K [right arrow] X, and [tau]: K [right arrow] [0, [infinity]) are measurable on K. Let [{[x.sub.n], [a.sub.n], [t.sub.n]}.sub.n=0,1, ...] [member of] (K x [[0, [infinity])).sup.[infinity]] denote any infinite sequence of state-action pairs and transition times. Suppose f: X [right arrow] A is a measurable decision function that specifies for every x [member of] X some action a [member of] A(x). Define the long-run average cost of the system under control f, starting from an initial state [x.sub.0] [member of] X, as
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
The problem
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
finds an optimal decision rule [f.sup.*] from starting state [x.sub.0]. One of the main questions in Markov control processes is under what conditions does there exist an [f.sup.*] such that [J.sup.*] = J([f.sup.*], [x.sub.0]) = J([x.sub.0]) for every [x.sub.0] [member of] X? Such a decision rule is said to be long-run time average optimal, in the class of stationary deterministic decision rules, from every starting state.
More generally, rather than restricting the class of policies to deterministic decision rules f: X [right arrow] A, we could pose the existence question over all admissible, nonanticipatory policies [pi] [member of] [PI], including randomized history-dependent ones. It follows from Klabjan and Adelman [21] that stationary deterministic policies still suffice for optimality.
2.2. Infinite-dimensional linear programming theory. We reformulate the problem of finding an optimal policy as an infinite-dimensional linear program.
We consider the following primal-dual linear programs on the spaces (M(K), B(K)), (R x M(X), R x B(X)). The primal problem (P) is
min [[integral].sub.K] c(x, a)[mu](d(x, a)) (1a)
[[integral].sub.K] [tau](x, a)[mu](d(x, a)) = 1 (1b)
[mu]((B x A) [intersection] K) - [mu]({(x, a) [member of] K: s(x, a) [member of] B}) = for every B [member of] B(X) (1c)
[mu] [greater than or equal to] 0, [mu] [member of] M(K) (1d)
and the corresponding dual problem (D) reads max [rho]
[tau](x, a) rho] + u(x) - u(s(x, a)) [less than or equal to] c(x, a) for every (x, a) [member of] K (2)
[rho] [member of] R, u [member of] B(X).
We denote by inf(P) and sup(D) the optimal values of the primal and dual programs, respectively. We call constraints (1c) the flow balance constraints.
Next, we provide a set of assumptions under which strong duality holds between these two programs.
ASSUMPTION 2.1. [tau] is continuous, nonnegative, and bounded on K.
ASSUMPTION 2.2. c is lower semicontinuous and nonnegative on K.
ASSUMPTION 2.3. c(x, a) + [tau](x, a) [greater than or equal to] 1 for every (x, a) [member of] K.
Here, the right-hand side can be changed to any [epsilon] > 0, but we normalize to 1 for convenience.
ASSUMPTION 2.4. {a [member of] A(x): c(x, a) + [tau](x, a) [less than or equal to] r} is compact for every x [member of] X, r [member of] R.
ASSUMPTION 2.5. There exists a decision rule f: X [right arrow] A and initial state [x.sub.0] [member of] X such that J (f, [x.sub.0]) < [infinity].
ASSUMPTION 2.6. s is continuous on K.
ASSUMPTION 2.7. K is compact.
The following assumption says that all states communicate with bounded cost and time.
ASSUMPTION 2.8. There exist constants C < [infinity], [GAMMA] < [infinity] such that for every measurable subset S [subset or equal to] X there is a decision rule f: X\S [right arrow] A with the property that for every x' [member of] X\S there exists a finite integer N and a set of states [x.sub.0], [x.sub.1], ..., [x.sub.N] with
* [x.sub.0] = x',
* [a.sub.n] = f([x.sub.n]) [member of] A([x.sub.n]) for every n = 0, ..., N - 1,
* [x.sub.n+1] = s([x.sub.n], [a.sub.n]) for every n = 0, ..., N - 1,
* [x.sub.n+1] [member of] S,
* [[summation].sup.N-1.sub.n=0] c([x.sub.n], [a.sub.n]) [less than or equal to] C, and
* [[summation].sup.N-1.sub.n=0] [tau]([x.sub.n], [a.sub.n]) [less than or equal to] [GAMMA].
This communication assumption holds for a wide variety of deterministic continuous time infinite time horizon inventory problems, Adelman and Klabjan [2]. For example, the classical joint replenishment problem has this property.
The following important theorems are given in Adelman and Klabjan [2].
THEOREM 2.1 (STRONG DUALITY). Under Assumptions 2.1-2.8, there exists an optimal primal-dual solution pair ([[mu].sup.*], ([[rho].sup.*], [u.sup.*])) [member of] (M(K), (R, B(X))) such that inf(P) = sup(D) and complementary slackness holds, i.e., for [[mu].sup.*]-almost all (x, a) [member of] K we have
[tau](x, a)[[rho].sup.*] + [u.sup.*](x) = c(x, a) + [u.sup.*](s(x, a)).
Because (P) and (D) are solvable, instead of denoting by inf(P), sup(D) their respective values, we denote them by rain(P) and max(D). The next theorem states that there is an optimal deterministic stationary policy.
THEOREM 2.2 (GENERAL EXISTENCE RESULT). Under Assumptions 2.1-2.8, there exists a decision rule [f.sup.*]: X [right arrow] A such that
J(x) = J([f.sup.*], x) = [J.sup.*] for all x [member of] X.
In the present work, we give an algorithm that solves (P) and (D). The main idea is to approximate u by certain functions, which we introduce next.
3. Ridge functions. Let [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] for j [member of] [n] = {1, 2, ..., n} be real numbers in an interval [-[OMEGA], [OMEGA]] except that [b.sup.j.sub.0] < -[OMEGA] and [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. Each such ordered set of numbers is denoted by [b.sup.j]. For each j [member of] [n] and i [member of] [[m.sub.j]] let [H.sup.j.sub.i]: [-[OMEGA], [OMEGA]] [right arrow] R be defined as
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
otherwise.
Note that [H.sup.j.sub.i]([b.sup.j.sub.i]) = 1 and supp([H.sup.j.sub.i]) = ([b.sup.j.sub.i-1], [b.sup.j.sub.i+1]), see Figure 1. By supp(x) we denote the support set of a function. These functions are known as hat functions or also B-splines of degree 1, and they form the basis among the spline functions of degree 1, see, e.g., Bojanov et al....
|