Home | Business News | Browse by Publication | M | Mathematics of Operations Research

Risk-sensitive and risk-neutral multiarmed bandits.

Publication: Mathematics of Operations Research
Publication Date: 01-MAY-07
Format: Online
Delivery: Immediate Online Access

Article Excerpt
For the multiarmed bandit, the classic result is probabilistic: each state of each bandit (Markov chain with rewards) has an index that is determined by an optimal stopping time for that state's bandit, and expected discounted income is maximized by playing at each epoch a bandit whose current state has the largest index. Our approach is analytic, not probabilistic. It uses pairwise comparison in place of stopping times. A simple recursion assigns to each state of each bandit a utility and an amplification of future utility that depend solely on the data for that state's bandit. These utilities and amplifications determine whether or not one state dominates another. We show that it is optimal to play at each epoch any bandit whose current state is not dominated by the current states of the other bandits. We obtain this result by a coherent analysis that encompasses three models--one with risk-averse exponential utility, one with risk-seeking exponential utility, and one with linear utility and either stopping or discounting. We also show that the risk-seeking case and a model of Nash [Nash, P. 1980. A generalized bandit problem. J. Roy. Statist. Soc. B 42 165-169] are equivalent to each other.

Key words: multiarmed bandits; exponential utility; risk-sensitive Markov decision processes; optimal stopping

1. Introduction. The much-studied multiarmed bandit is the following decision problem: At Epochs 1, 2, ... , a decision maker observes the current state of each of several bandits (Markov chains with rewards) and plays one of them. The bandits that are not played remain in their current states. The bandit that is played evolves for one transition according to its transition probabilities, earning an immediate income (possibly negative) that can depend on its current state and on the state to which transition occurs. Each transition takes one epoch, and its occurrence renews the cycle of events. Typically, the goal has been to maximize the expectation of the discounted total income with discount factor [alpha] ([alpha] < 1) per epoch.

To ease the exposition, we introduce language that distinguishes the states of the multiarmed bandit problem from the states of the individual bandits. From now on, the latter states will be called nodes. Thus, individual bandits evolve from node to node (not from state to state) according to transition probabilities, and each state (of the multiarmed bandit) prescribes a node for each of the individual bandits.

For a multiarmed bandit, Gittins and Jones [12] first demonstrated the key result: each node of each bandit has an index that depends solely on that bandit's data, and expected discounted income is maximized by playing, at each epoch, a bandit whose node has the largest index. Generalizations and novel proofs of this result have been devised by Nash [19], Whittle [30, 32], Varaiya et al. [26], Katehakis and Veinott [17], Weiss [29], Weber [28], E1 Karoui and Karatzas [10], Tsisiklis [25], Katehakis and Rothblum [16], Kaspi and Mandelbaum [15], Dumitriu et al. [9], and Katta and Sethuraman [18], among others. Their analyses are probabilistic in that each state's index is determined by an optimal stopping time. Certain of these papers indicate how to compute indices; see also Kallenberg [14] and Sonin [24].

By contrast, our analysis is analytic. Stopping times are set aside. In their place, we use a simple recursion to assign to each node i of each bandit a utility r(i) and an amplification a(i) of future utility that enables pair-wise comparison: Node i is said to dominate node j if

r(i) 4- a(i)r(j) > r(j) + a(j)r(i), (1)

and node i is said to be undominated if no node j of any bandit dominates node i, equivalently, if

r(i) + a(i)r(j) [greater than or equal to] r(j) + a(j)r(i) [for all] j [member of] N, (2)

where N denotes the set of all nodes of all bandits. Domination leads to a ranking of the nodes of the bandits. We show that it is optimal to play at each epoch any bandit whose current node has the highest rank. With appropriate values of the r(i)s and a(i)s, this line of analysis applies to three models: to a new model with risk-averse exponential utility function, to a new model with risk-seeking exponential utility function, and to a generalization of the risk-neutral model in which stopping occurs with probability of one, independent of the policy.

The recursion that is the crux of our analysis builds on an insight of Tsisiklis [25]. It is sketched here and developed subsequently. The initial values of the utilities and amplifications are determined from the problem's data in a natural way. A node i that is not dominated by any node of any bandit is identified. An interchange argument shows that it is optimal to play node i's bandit for each state of the multi-armed bandit that includes node i (see Theorems 5.1 and 5.2). Playing node i's bandit whenever node i is observed leads to modification of the data for node i's bandit so that transition to node i occurs with probability of 0, but the set of optimal policies is unaffected (see Theorem 6.1). Node i is ranked highest and is then set aside. Among the nodes that remain and using the modified utilities and amplifications, an undominated node [i.sup.*] is found, the data are modified a second time, node [i.sup.*] is ranked second highest, and it too is set aside. Repetition of this procedure is shown to rank nodes so that it is optimal to play the bandit whose node is ranked highest among those nodes that are currently observed (see Theorem 7.1). Theorem 7.1 also bounds the work (number of computer operations) required by this recursion. For linear-utility models, similar recursions had been obtained by Katta and Sethuraman [18] and by Sonin [24].

This recursion requires comparisons across bandits. To deal individually with the bandits, we find it necessary to classify the nodes into four categories and refine the notion of domination (see Equation (44)) in a way that breaks ties by preferring nodes in lower-numbered categories. With refined domination, the same data modification scheme assigns to each node i a finalized utility [bar.r](i) and amplification a(i) so that, given any state, it is optimal to play any bandit whose current node is not dominated (in its refined form) by the current node of any other bandit (see Theorem 8.1). Theorem 8.1 also shows that this method reduces the work somewhat; the number of computer operations it requires is proportional to the sum over k of the cube of the number of nodes in bandit k. The authors know of no method with a better worst-case work bound.

Our results relate directly to a lovely paper by Nash [19]. Using a notion of similarity that is due to Veinott [27], we show that Nash's model and the risk-seeking case are equivalent. We also show how to use domination to find directly optimal policies for Nash's model, without first converting it to an equivalent risk-seeking model. Related arguments show that Nash's model is equivalent to the special case of the risk-averse model in which each bandit k has a transition matrix [q.sub.k] (defined in [section] 3) that is transient, and to the special case of the risk-neutral model in which all utilities are nonnegative. Links between our work and Nash's enhance the value of both models and strengthen the ties to Tsisiklis [25].

The multiarmed bandit has myriad uses. Books by Presman and Sonin [21] and Gittins [11] describe early applications to search, scheduling, pharmaceutical research, and project management. More recent contributions can be found in Pinedo [20], in a burgeoning literature within economics; cf, Schlag [23], in Denardo et al. [8], and in Denardo et al. [6]. Lucid accounts of the multiarmed bandit can be found in several texts, including Whittle [31], Ross [22], and Bertsekas [1].

2. The multiarmed bandit. The multiarmed bandit is now introduced. The composite problem consists of K bandits that are numbered 1 through K. For k = 1, ... ,K, bandit k has a finite node set [N.sub.k]. These node sets are disjoint, and b(j) denotes the node set (bandit) of which j is a member, i.e., j 6 Nb(j). The set N of all nodes equals [N.sub.1] [union] ... [union] [N.sub.k].

Playing bandit k while its node is i causes transition of that bandit to node j with probability p(i, j), in which case income w(i, j) is earned. Playing bandit k while its node is i causes termination, rather than transition to some other node, with probability p(i) given by

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.]

in which case income w(i) is earned. Income can be positive, negative, or zero. Playing bandit k does not change the nodes of the other bandits. Termination stops the entire decision process, not merely access to the bandit that was played most recently.

Each state of the K-bandit problem prescribes one node per bandit. Because the bandits' node sets are disjoint, we can (and do) define a state s of the K-bandit problem as a set of K nodes containing one node per bandit, and we write each state s as s = {[s.sub.1], [s.sub.2], ... ,[s.sub.K]}, with [s.sub.k] [member of] [N.sub.k] for k = 1, ... ,K. The set S of states of the K-bandit problem is isomorphic to the Cartesian product [N.sub.1] x ... x [N.sub.K]. With s as any state, [s.sub.k] denotes its node of bandit k, so that s [intersection] [N.sub.k] = {[s.sub.k]}. Also, with s as any state, [s.sub.j[left arrow]k] denotes the state g whose nodes are identical to those of s, except that j has become the node of bandit k, so [s.sub.k[left arrow]k] [intersection] [N.sub.k] = {j} and [s.sub.j[left arrow]k]\[N.sub.k] = s\{[s.sub.k]}.

The risk-neutral, risk-averse, and risk-seeking models are dubbed Cases RN, RA, and RS, respectively. In Case RN, the utility u(X) of X units of income equals X itself. In Case RA, the utility of X units of income is given by u(X) = -[e.sup.-[lambda]X], where h is a fixed positive number that is known as the coefficient of risk aversion. In Case RS, the utility of X units of income is u(X) = [e.sup.-[lambda]X], where [lambda] is a fixed negative number that is (again) known as the coefficient of risk aversion.

For each case, it will prove useful to employ the local utility function h(s, k, v) that was introduced in Denardo [2, 3]; this function is defined for each state s, each action k, and each function v mapping the states into the real numbers, and h(s, k, v) denotes the expected utility for the truncated one-transition model in which one observes state s, plays bandit k, and receives terminal utility v(t) if transition occurs to state t. We will soon indicate why

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.] (3)

with data r(i) and q(i, j) that are specified in (4)-(5). In Cases RA and RS, there is no discounting (the per period discount factor a equals 1), but in Case RN, discounting can occur (0 < a [less than or equal to] 1). For each node i in N,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.] (14)

and for each pair (i, j) of nodes in N,

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.] (15)

Equations (3)-(5) are first justified for Case RN, for which r(i) is the expected income earned in the current epoch, q(i, j) equals product of the transition probability p(i, j), and the per epoch discount factor a, and (3) holds because the expectation of the sum equals the sum of the expectations. In Cases RA and RS, (3)-(5) describe the expected utility because the exponential function is multiplicative.

Termination, as described above, stops the play of all of the bandits. Stopping the play of a particular bandit can be modeled by adding an absorbing node to its node set, with a prohibitively high cost of playing that bandit when it is in its absorbing node. Similarly, voluntary stopping of the entire decision process can be modeled by adding a bandit that has only one node i with p(i) = 1 and with w(i) equated to the utility (possibly negative) that is earned by stopping.

The risk-neutral model accommodates a discounted income stream via a < 1, but the risk-sensitive models do not. To see why, note that in the risk-averse case, with discount factor a < 1, receiving income X now and Y one epoch hence has utility -[e.sup.-[lambda](X+[alpha]Y)] so the discount factor [alpha] is discounting (reducing) the coefficient of risk aversion; for this model, a stationary policy cannot be optimal, and a dynamic programming recursion would entail coefficients of risk aversion equal to [lambda][[alpha].sup.n] for n = 0, 1, ...

Within the economics literature, however, some Robinson Crusoe models use [alpha] < 1 to discount benefit rather than income; with an exponential utility function, these models satisfy (3), (4), and the variant of (5) in which q(i, j) = [alpha]p(i, j)[e.sup.-[lambda]w(i,j)].

2.1. Nomenclature. For each case, the data in (4)-(5) that relate to bandit k array themselves into the [absolute value [N.sub.k]] x 1 utility vector [r.sub.k] and an [absolute value of[N.sub.k]] x [absolute value of [N.sub.k]] transition matrix [q.sub.k] whose entries are

[r.sub.k](i)=r(i) [for all]I [member of] [N.sub.k] and [q.sub.k](i,j) = q(i,j) [for all]i, j [member of] [N.sub.k].

Related nomenclature is now introduced. We denote as [p.sub.k] the [asolute value of [N.sub.k] x [absolute value of [N.sub.k]] matrix whose ijth entry equals p(i, j). The matrix [P.sub.k] is substochastic because it is nonnegative and because each of its rows sums to 1 or less. A nonempty subset C of [N.sub.k] is said to be a final class if

[summation over j[member of]C]p(I, j) = 1, [[infinity].summation over n=0][[([p.sub.k]).sup.n]](i, j) > 0, [for all]I, j [member of] C,

that is, if escape from C is impossible and if it is possible to get from any state in C to any other with positive probability. A square matrix Q is said to be transient if each entry in [Q.sup.n] approaches as n [right arrow] [infinity]. The spectral radius of a square matrix Q is the largest of the absolute values of Q's eigenvalues.

With Q as any matrix, Qc, o denotes the submatrix of Q whose rows are in C and whose columns are in D. Similarly, with v as any column vector, [[upsilon].sub.c] denotes the subvector whose rows are in C. The symbol e denotes the vector of l's of appropriate size. Finally, if x and y are vectors of the same size, the statement x [much greater than] y means x(i) > y(i) for each i.

2.2. Hypotheses. The hypotheses under which our models are analyzed concern the local utility function h(s, d, v) as given by (3), but do not concern the formulas in (4) and (5) for r(i) and q(i, j). For Case RA, this hypothesis concerns final classes, for which reason it needs to be restated in terms of the r(i)s and the q(i, j)s. In Case RA, a nonempty subset C of [N.sub.k] is now said to be a final class if

[summation over n [not member of]C] q(i, n) = 0, [[infinity].summation over n=0][[([q.sub.k]).sup.n]](i, j)>0, r(i) = [for all]I, j [member of]C.

In Case RA, (7) is equivalent to (6) because r(i) = if and only if p(i) = and q(i, j) = if and only if p(i, j) : 0.

The hypotheses under which we analyze cases RN, RS, and RA appear below in their respective order:

HYPOTHESIS RN. For each k, the matrix [q.sub.k] is nonnegative, substochastic, and transient, and u(0) = 0.

HYPOTHESIS RS. For each k, the matrix [q.sub.k] is nonnegative and transient, u(0) = 1, and each node i has r(i) > 0.

HYPOTHESIS RA. For each k, the matrix [q.sub.k] is nonnegative, [q.sub.k] is transient for at least one k, u(0) : -1, and each node i has r(i) [less than or equal to]...

View this article FREE - Now for a Limited Time, try Goliath Business News
Free for 3 Days!



More articles from Mathematics of Operations Research
On the closedness of the linear image of a closed convex cone., May 01, 2007
Secret correlation in repeated games with imperfect monitoring., May 01, 2007
Secret correlation in repeated games with imperfect monitoring: the ne..., May 01, 2007
Pseudomonotonicity and economic equilibrium problem in reflexive Banac..., May 01, 2007
Mathematical programs with complementarity constraints: convergence pr..., May 01, 2007

Looking for additional articles?
Search our database of over 3 million articles.

Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication name or publication date.

About Goliath
Whether you're looking for sales prospects, competitive information, company analysis or best practices in managing your organization, Goliath can help you meet your business needs.

Our extensive business information databases empower business professionals with both the breadth and depth of credible, authoritative information they need to support their business goals. Whether it be strategic planning, sales prospecting, company research or defining management best practices - Goliath is your leading source for accurate information.