|
Article Excerpt We introduce a new algorithm based on linear programming for optimization of average-cost Markov decision processes (MDPs). The algorithm approximates the differential cost function of a perturbed MDP via a linear combination of basis functions. We establish a bound on the performance of the resulting policy that scales gracefully with the number of states without imposing the strong Lyapunov condition required by its counterpart in de Farias and Van Roy [de Farias, D. P., B. Van Roy. 2003. The linear programming approach to approximate dynamic programming. Oper. Res. 51(6) 850-865]. We investigate implications of this result in the context of a queueing control problem.
Key words: approximate dynamic programming; linear programming; average cost
MSC2000 subject classification: Primary: 90C39, 90C40; secondary: 90C59
OR/MS subject classification: Primary: dynamic programming/optimal control, Markov infinite state; secondary: linear programming, large-scale systems
History: Received September 12, 2004; revised December 6, 2005.
1. Introduction. Over the past few years, there has been a growing interest in linear programming (LP) approaches to approximate dynamic programming (DP) (Adelman [1], de Farias and Van Roy [7, 9], Farias and Van Roy [11], Gordon [14], Guestrin [15], Guestrin et al. [16, 17], Hauskrecht and Kveton [19], Schuurmans and Patrascu [32], Schweitzer and Seidman [33], and Trick and Zin [unpublished manuscript, 1993; 34]). We refer to the methodology of LP-based approximate dynamic programming as approximate linear programming (ALP). The main idea in ALP is as follows. A classic result in dynamic programming is that Bellman's equation can be solved by solving a linear programming instead (D'Epenoux [10] and Manne [25]). ALP attempts to alleviate the curse of dimensionality by combining the LP formulation of Bellman's equation with approximation of the DP value function via a linear combination of preselected basis functions. The resulting LP is used for computing weights in the linear combination. A control policy that is "greedy" with respect to the resulting approximation is then used to make real-time decisions.
Empirically, ALP appears to generate effective control policies for high- dimensional dynamic programs (Adelman [1], de Farias and Van Roy [9], Farias and Van Roy [11], Guestrin et al. [17], and Schuurmans and Patrascu [32]). There is also evidence that, compared to other approximate dynamic programming approaches, ALP is competitive and its execution might in fact be orders of magnitude faster (Farias and Van Roy [11], Guestrin et al. [17], and Schuurmans and Patrascu [32]). At the same time, the strength of theoretical results about such algorithms has overtaken counterparts available for alternatives such as approximate value iteration, approximate policy iteration, and temporal-difference methods. As an example, a result in de Farias and Van Roy [9] implies that, for a discrete-time finite-state Markov decision process (MDP), if the span of the basis functions contains the constant function and comes within a distance of e of the dynamic programming value function, then the approximation generated by a certain LP will come within a distance of O([member of]). Here, the coefficient of the O([member of]) term depends on the discount factor and the metric used for measuring distance but not on the choice of basis functions. On the other hand, the strongest results available for approximate value iteration and approximate policy iteration only promise O([member of]) error under additional requirements on iterates generated in the course of executing the algorithms (Bertsekas and Tsitsiklis [3] and Munos [30]). In fact, it has been shown that even when [member of] = 0, approximate value iteration can generate a diverging sequence of approximations (Boyan and Moore [5], Gordon [12, 13], and Tsitsiklis and Van Roy [35]).
We propose and analyze a new ALP formulation for approximating the dynamic programming solution. Previous ALP analysis found in the literature focuses on finite-state discrete- time MDPs with discounted-cost criterion. We consider average-cost problems in discrete and continuous time, involving countable state spaces. As a side benefit, the analysis leads to an apparently new condition for the solution of Bellman's equation via linear programming to be the optimal differential cost function, when the state space is countable. However, we consider the main contribution of the paper to be the derivation of an ALP formulation that is suitable for average-cost problems.
Challenges in average-cost ALP. The theoretical analysis of average-cost dynamic programming, especially in the case of infinite state spaces, is notoriously more involved than that of discounted-cost problems. In ALP the situation is no different. Previous analysis for discounted-cost problems establishes that, with an appropriate choice of algorithm parameters in ALE the method exhibits certain desirable error guarantees. In particular, these guarantees suggest that the performance of the method will not degrade as it is applied to problems of increasing dimensions. However, attempts to extend the formulation and analysis to average-cost problems lead to at least two fundamental difficulties:
* As discussed in de Farias and Van Roy [9], a central concept in discounted- cost ALP is that of state-relevance weights. State-relevance weights are parameters in the ALP method that can be chosen to specify how errors in the approximation of the cost-to-go function over different system states should be emphasized. It is shown in de Farias and Van Roy [9] that state-relevance weights might have a first-order impact on the quality of the policy generated by ALE State-relevance weights appear naturally when one extends the LP formulation of the discounted-cost Bellman's equation to approximate DE However, the analogous extension in the average-cost setting, although seemingly intuitive and natural, does not include state- relevance weights and does not allow for specification of how to control the trade-off between approximation errors over different portions of the state space.
* Existing error analysis for discounted-cost ALP is based on the use of Lyapunov functions (de Farias and Van Roy [9]). The existence of a suitable Lyapunov function that is contained in the span of the basis functions used to approximate the DP solution is instrumental in ensuring that the ALP satisfies certain desirable properties. In particular, it ensures that the ALP is feasible and leads to approximation error bounds that scale gracefully with problem size. The main stumbling block in extending this analysis to average-cost problems is the condition that must be satisfied by the Lyapunov function. In the discounted-cost setting, the presence of the discount factor introduces some slack in the condition and allows for a suitable Lyapunov function to be derived. However, in the average-cost case the Lyapunov condition is overly restrictive.
In de Farias and Van Roy [8], we proposed a two-step ALP formulation for average-cost problems. The first step involves an LP for estimating the optimal average cost. The second step uses that estimate in another LP, which directly approximates the differential cost function. The formulation has the advantage of including state-relevance weights, which again can be used to control the trade-offs in approximating the differential cost function over different states. To extend the discounted-cost line of analysis, de Farias and Van Roy [8] introduce a slightly different definition of Lyapunov function. Under certain technical conditions (e.g., existence of a state that is recurrent under all policies), a Lyapunov function can be shown to exist. However, the resulting approximation error bound is unlikely to scale gracefully with problem size. In particular, the constants involved in the bound are expected to grow exponentially in the number of state variables of the system, which defeats the ALP objective of alleviating the curse of dimensionality. Moreover, even though a Lyapunov function can be shown to exist, finding such a function or ensuring that it is spanned by the basis functions could itself be a daunting task.
The cost-shaping LP and a performance bound. The aforementioned issues motivate the development of a new ALP formulation. We introduce a perturbed version of the MDP and add a cost-shaping term to the ALP. We derive a bound on the expected increase in average cost due to using the ALP policy in lieu of an optimal policy. The bound is stated in terms of the quality of the preselected basis functions. Specifically, the quality of a set of basis functions is measured by the best approximation error that can be attained if they are used to approximate the optimal differential cost function of the perturbed MDP. We show that the loss in performance related to the ALP policy is proportional in a certain sense to the best- possible approximation error. We consider the new formulation and associated performance bound to be the main contribution of this paper. Some of the most important aspects of the performance bound are discussed next.
Lyapunov-function-like bounds, without the Lyapunov condition. In the analysis of the new ALP, the cost-shaping term is shown to have a role analogous to the Lyapunov function. There are, however, two main differences. First, the cost-shaping term can be chosen arbitrarily and does not have to satisfy any condition. Hence, the Lyapunov function condition is relaxed. Furthermore, even when the cost-shaping term does satisfy the Lyapunov condition, the error and performance bounds that can be derived are strengthened. Specifically, they do not exhibit an undesirable dependence on the Lyapunov function that appears in the corresponding hounds obtained with the previous analysis (de Farias and Van Roy [9]).
Error vs. performance bounds. An important question about any approximate DP algorithm that approximates the solution to Bellman's equation is what guarantees can be established on the resulting approximation error. The analysis of ALP in de Farias and Van Roy [9] focuses primarily on that question, developing an error bound that establishes that, in terms of a certain choice of norm, the difference between the approximate and exact cost-to-go functions is proportional to the best that can be achieved by the approximation architecture. Perhaps an even more important question is whether the policy that results from using the cost-to-go approximation exhibits good performance. This is partially addressed in de Farias and Van Roy [9] through a bound that relates the loss in performance due to using the suboptimal ALP policy to a certain norm of the approximation error. However, there is a mismatch between the norm that is used in the bound on the approximation error and the one that is required in the performance analysis. In this paper, we emphasize the performance aspect by phrasing our results in terms of the expected loss of performance due to using the suboptimal policy. The norm mismatch issue is captured through a constant present in the performance bound.
Relation to Bellman error minimization. An alternative approach for approximate DP aims at minimizing "Bellman error" (this idea was first suggested in Schweitzer and Seidman [33]). Methods proposed for this (e.g., Bertsekas [2] and Harmon [18]) involve stochastic steepest descent of a complex nonlinear function. There are no results indicating whether...
|