Home | Industry Information | Business News | Browse by Publication | M | Mathematics of Operations Research

On the empirical state-action frequencies in Markov decision processes under general policies.

Publication: Mathematics of Operations Research
Publication Date: 01-AUG-05
Format: Online
Delivery: Immediate Online Access

Article Excerpt
We consider the empirical state-action frequencies and the empirical reward in weakly communicating finite-state Markov decision processes under general policies. We define a certain polytope and establish that every element of this polytope is the limit of the empirical frequency vector, in...

View more below

Read this article now - Try Goliath Business News - FREE!   
You can view this article PLUS...

  • Over 5 million business articles
  • Hundreds of the most trusted magazines, newswires, and journals (see list)
  • Premium business information that is timely and relevant
  • Unlimited Access

Now for a Limited Time, try Goliath Business News - Free for 7 Days!
Tell Me More   Terms and Conditions

Purchase this article for $4.95

Already a subscriber? Log in to view full article

...under some policy, a strong sense. Furthermore, we show that the probability of exceeding a given distance between the empirical frequency vector and the polytope decays exponentially with time under every policy. We provide similar results for vector-valued empirical rewards.

Key words: Markov decision processes; state-action frequencies; large deviations: empirical measure

MSC2000 subject classification: Primary: 90C40; secondary: 60F99, 60B10

OR/MS subject classification: Primary: probability, Markov processes; secondary: dynamic programming/optimal control, Markov finite state

History: Received March 6, 2003; revised April 6, 2004.

1. Introduction. We consider a Markov decision process (MDP) that satisfies a weak communication assumption and describe a polytope of possible state-action frequency vectors. We show that for every point in the polytope, there exists a policy that gets "very close" to that point. More accurately, for every point in the polytope, we specify a policy that guarantees that the empirical state-action frequency vector converges to that point, with probability one. Moreover, we show that under the prescribed policy, the probability of a large distance between the point and the empirical state-action frequency vector decays exponentially with time. On the other hand, we show that no policy can "get far" from this polytope even without the weak communication assumption. Specifically, we show that the probability of a large distance between the empirical state-action frequency vector and the polytope decays exponentially with time, uniformly over all admissible policies.

While the emphasis of this work is on bounds on the empirical frequencies, we also derive some apparently new results on state-action frequency polytopes. Under the weak communication assumption, our results establish that the polytope we consider is the same as the set of possible limits (both in expectation and almost surely) of the empirical frequency vector under different policies. This extends results in Derman [7] and Puterman [15], which assumed a unichain structure. These references also showed that every point in the polytope can be achieved by a stationary policy. In contrast, for the more general case that we consider, nonstationary policies may be necessary. We note that in Kallenberg [12], a related polytope was defined for every Markov decision process, without any communication assumptions. However, the framework of Kallenberg [12] is too general to be useful for our purposes. In particular, some communication assumption is necessary in order to establish that every point in the polytope is a possible limit of the empirical frequency vector.

The primary motivation for this work arises in the fields of adaptive control and reinforcement learning (e.g., Kumar and Varaiya [13], Bertsekas and Tsitsiklis [4], Sutton and Barto [17]). The policies used by learning algorithms are typically nonstationary. For this reason, it is useful to have a complete characterization of the possible behaviors of empirical state-action frequencies under general (not necessarily stationary) policies. For instance, there are certain bounds on the probability of a large distance between the empirical frequencies and their limit, under the assumption that such a limit exists (Altman and Zeitouni [2]). Our results indicate that similar bounds apply to the case of general, nonstationary policies.

Another motivation comes from the context of exploration in dynamic environments. Suppose that we wish to visit at least k times every state of a controlled Markov chain with known transition probabilities, where k is a large number. This may be the case if we desire to take a large number of measurements at each state, or in a "needle in a haystack" problem, where each state needs to be examined several times in order to identify whether something unique happens at that state. Under an appropriate accessibility assumption, it can be shown that the best possible expected time for achieving this goal is of the form [eta]k + o(k), where [eta] is a positive constant that can be computed in terms of the transition probabilities. Using the results in this paper, a stronger property is obtained, namely, that there exists a policy under which the time it takes, [T.sub.k], satisfies [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], for every [epsilon] > 0, where c and d are some positive constants. Moreover, for every policy, [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], that is, no policy can sample more efficiently.

Yet another motivation arises from the connection between the average rewards per unit time in finite and infinite horizon problems. An important question, for a finite-horizon problem, is whether one can gain substantially by using a time-dependent policy rather than a stationary one. Our results indicate that the probability of a substantial gain is exponentially small in the time horizon.

Regarding related research, let us mention that there are large deviations results for the empirical state-action frequency vector in finite-state Markov processes (see, e.g., Dembo and Zeitouni [6]). These results were extended to Markov decision processes in Altman and Zeitouni [2], which obtained uniform convergence rates over the class of stationary policies. The case of nonstationary policies that have a limit was also considered to some extent in Altman and Zeitouni [2].

The question of achievable rates of convergence for controlled processes was considered in Shimkin [16]. The model therein is essentially a single-state decision process in which a decision maker may choose between sampling several stationary reward populations. Lower and upper bounds were provided on the probabilities of rare events under arbitrary policies. Of a somewhat different flavor is a Hoeffding-type inequality for bounded functions of uniformly ergodic Markov chains, which was derived in Glynn and Ormoneit [9]. We note that this reference provides an error exponent that is tighter than ours, but these results are essentially dependent on the stationary nature of the underlying policy.

The rest of the paper is organized as follows. In [section] 2, we start by defining the model of interest. In [section] 3, we introduce state-action frequency polytopes. In [section] 4, we show that for every element of the polytope, there exists a policy under which the empirical state-action frequency vector converges to that point, in a strong sense. In [section] 5, we derive a large deviations bound for the distance of the empirical state-action frequency vector from the polytope. In [section] 6, we generalize and obtain bounds on the probability of large deviations of an empirical vector-valued reward. In [section] 7, we provide some brief concluding remarks. The appendix contains the proofs of some of lemmas used in our development.

2. Problem definition. We consider a Markov decision process (MDP) with finite state and action spaces. The MDP is formally defined by a triplet (T, A, P), where

(a) S = {1, ..., S} is a finite set of states.

(b) A = {1, ..., A} is a finite set of actions which is assumed, for simplicity, to be the same for all states.

(c) P is the conditional probability law. Namely, P(s' | s, a) is the probability that the next state is s', given that the current state is s and that action a was taken.

At every time epoch t, the decision maker observes the current state [s.sub.t] and chooses an action [a.sub.t]. Then the next state [s.sub.t+1] is chosen, according to P(* |[s.sub.t], [a.sub.t]). For a finite set B, we will use [DELTA](B) to denote the set of all probability distributions on B. A policy is a mapping from the set of possible past histories to the set [DELTA](A), which prescribes the probability of any particular action for every given history. A stationary policy is a policy that depends only on the current state.

Given a history (1) ([s.sub.1], [a.sub.1], [s.sub.2], [a.sub.2], ..., [s.sub.t], [a.sub.t], [s.sub.t+1]), we define the empirical state-action-state frequencies by

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1)

where [I.sub.E] stands for the indicator function of an event E. Note that the empirical frequency vector [[??].sub.t], with components [[??].sub.t] (s, a, s'), is a vector in [DELTA](T x A x T).

Without any stationarity assumption on the underlying policy, we cannot expect [[??].sub.t] to have a limit. Our main objective is to show that [[??].sub.t] has to be close (with high probability) to the set of expected frequency vectors that can be attained using stationary policies. We will therefore start by characterizing the latter set, which is the subject of the next section.

3. State-action polytopes. In this section, we introduce a polytope and characterize it as the set of feasible limiting expected state-action-state frequencies. This characterization is used in [section] 4 to show that the elements of this polytope are feasible empirical frequencies in a rather strong sense.

Given an MDP, the state-action polytope, X, is defined as the set of vectors x in [DELTA](T x A) that satisfy

[summation over (s)] [summation over (a)] P(s' |s, a)x(s, a) = [summation over (a)'] x(s', a'), [for all] s'. (2)

We let (as in Puterman [15]) [x.sup.[pi], [alpha]] [member of] [DELTA](T x A)...

NOTE: All illustrations and photos have been removed from this article.



More articles from Mathematics of Operations Research
Optimal investments for robust utility functionals in complete market ..., August 01, 2005
Lex-optimal online multiclass scheduling with hard deadlines., August 01, 2005
An alternative algorithm for counting lattice points in a convex polyt..., August 01, 2005
The scenario generation algorithm for multistage stochastic linear pro..., August 01, 2005
A semidefinite programming approach to optimal-moment bounds for conve..., August 01, 2005

Looking for additional articles?
Search our database of over 3 million articles.

Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication name or publication date.

About Goliath
Whether you're looking for sales prospects, competitive information, company analysis or best practices in managing your organization, Goliath can help you meet your business needs.

Our extensive business information databases empower business professionals with both the breadth and depth of credible, authoritative information they need to support their business goals. Whether it be strategic planning, sales prospecting, company research or defining management best practices - Goliath is your leading source for accurate information.