Home | Business News | Browse by Publication | M | Mathematics of Operations Research

Regret minimization under partial monitoring.

Publication: Mathematics of Operations Research
Publication Date: 01-AUG-06
Format: Online
Delivery: Immediate Online Access

Article Excerpt
We consider repeated games in which the player, instead of observing the action chosen by the opponent in each game round, receives a feedback generated by the combined choice of the two players. We study Hannan-consistent players for these games, that is, randomized playing strategies whose per-round regret vanishes with probability one as the number n of game rounds goes to infinity. We prove a general lower bound of [OMEGA]([n.sup.1/3]) for the convergence rate of the regret, and exhibit a specific strategy that attains this rate for any game for which a Hannan- consistent player exists.

Key words: repeated games; Hannan consistency; imperfect monitoring; internal regret

MSC2000 subject classification: Primary: 91A20, 62L12; secondary: 68Q32

OR/MS subject classification: Primary: computer science-artificial intelligence, decision analysis-sequential; secondary: games/group decisions-noncooperative

History: Received April 4, 2005; revised April 6, 2006.

**********

1. A motivating example. A simple yet nontrivial example of partial monitoring is the following dynamic pricing problem. A vendor sells a product to a sequence of customers whom he attends one by one. To each customer, the seller offers the product at a price he selects, say, from the interval [0, 1]. The customer then decides to buy the product or not. No bargaining is possible, and no other information is exchanged between buyer and seller. The goal of the seller is to achieve an income almost as large as if he knew the maximal price each customer is willing to pay for the product. Thus, if the price offered to the tth customer is [p.sub.t], and the highest price this customer is willing to pay is [y.sub.t] [member of] [0, 1], then the loss of the seller is [y.sub.t] - [p.sub.t] if the product is sold and (say) a constant c > if the product is not sold. Formally, the loss of the vendor at time t is

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.],

where c [member of] [0, 1]. (In another version of the problem the constant c may be replaced by [y.sub.t]. In this case it is easy to see that all terms depending on [y.sub.t] cancel out when considering the regret, and we obtain the bandit setting referred to as online posted price mechanism in, e.g., Kleinberg and Leighton [30], Blum et al. [9], Blum and Hartline [7]--see below.) In either case, if the seller knew in advance the empirical distribution of the [y.sub.t]s, then he could set a constant price q [member of] [0, 1], which minimizes his overall loss. A natural question is whether there exists a randomized strategy for the seller such that his average regret

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.]

is guaranteed to converge to zero as n [right arrow] [infinity] regardless of the sequence [y.sub.1], [y.sub.2], ... of prices. The difficulty in this problem is that the only information the seller (i.e., the forecaster) has access to is whether [p.sub.t] > [y.sub.t], but neither [y.sub.t] nor l ([p.sub.t], [y.sub.t]) are revealed. One of the main results of this paper describes a simple strategy such that the average regret defined above is of the order of [n.sup.-1/5].

We treat such limited-feedback (or partial-monitoring) prediction problems in a more general framework that we describe next. The dynamic pricing problem described above, which is a special case of this more general framework, has also been investigated by Blum and Hartline [7], Blum et al. [9], and Kleinberg and Leighton [30] in a simpler setting where the reward of the seller is defined as [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.]. Note that by using the feedback information (i.e., whether the customer bought the product or not), here the seller can compute the value of [rho]([p.sub.t], [y.sub.t]). Therefore, their game reduces to an instance of the multiarmed bandit game (see Example 2.1 below) with a continuous action space.

2. Main definitions. We adopt a learning-theoretic viewpoint and describe partial monitoring as a repeated prediction game between a forecaster (the player) and the environment (the opponent). In the same spirit, we call

PREDICTION WITH PARTIAL MONITORING

Parameters: number of actions N, number of outcomes M, loss function l, feedback function h. For each round t = 1, 2, ...,

(1) the environment chooses the next outcome [y.sub.t] [member of] {1, ..., M} without revealing it;

(2) the forecaster chooses a probability distribution [p.sub.t] over the set of N actions and draws an action [I.sub.t] [member of] {1, ..., N} according to this distribution;

(3) the forecaster incurs loss l([I.sub.t], [y.sub.t]) and each action i incurs loss l(i, [y.sub.t]), where none of these values is revealed to the forecaster;

(4) the feedback h([I.sub.t], [y.sub.t]) is revealed to the forecaster.

the actions taken by the environment outcomes. At each round t = 1, 2, ... of the game, the forecaster chooses an action [I.sub.t] from the set {1, ..., N}, and the environment chooses an action [y.sub.t] from the set {1, ..., M}. The losses of the forecaster are summarized in the loss matrix L = [[l(i, J)].sub.NxM]. (This matrix is assumed to be known by the forecaster.) Without loss of generality, we rescale the losses so that they all lie in [0, 1]. If at time t the forecaster chooses an action [I.sub.t] [member of] {1, ..., N} and the outcome is [y.sub.t] [member of] {1, ..., M}, then the forecaster suffers loss l([I.sub.t], [y.sub.t]). However, instead of the outcome [y.sub.t], the forecaster only observes the feedback h([I.sub.t], [y.sub.t]), where h is a known feedback function that assigns to each action/outcome pair in {1, ..., N} x {1, ..., M} an element of a finite set I = {[s.sub.1], ..., [s.sub.m]} of signals. The values of h are collected in a feedback matrix H = [[h(i,j)].sub.NxM].

Note that we do not make any restrictive assumption on the power of the opponent. The environment may choose action [y.sub.t] at time t by considering the whole past, that is, the whole sequence of action/outcome pairs ([I.sub.s], [y.sub.s]), s = 1, ..., t - 1. Without loss of generality, we assume that the opponent uses a deterministic strategy, so that the value of [y.sub.t] is fixed by the sequence ([I.sub.1], ..., [I.sub.t-1]). In Comparison, the forecaster has access to significantly less information because he knows only the sequence of past feedbacks, (h([I.sub.1], [y.sub.1]), ..., h([I.sub.t-1], [y.sub.t-1])).

We note here that some authors consider a more general setup in which the feedback could be random. For the sake of clarity, we treat the simpler model described above and return to the more general case in [section] 7.

It is an interesting and complex problem to investigate the possibilities of a predictor supplied only with the limited information of the feedback. In this paper we focus on the average regret

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.]

that is, the difference between the average (per-round) loss of the forecaster and the average (per-round) loss of the best action. Forecasting strategies guaranteeing that the average regret converges to zero almost surely for all possible strategies of the environment are called Hannan consistent after James Hannan, who first proved the existence of a Hannan-consistent strategy in the full-information case (Hannan [23]) when h(i,j) = j for all i, j (i.e., when the true outcome [y.sub.t] is revealed to the forecaster after taking an action). The full-information case has been studied extensively in the theory of repeated games and in the fields of learning theory and information theory. A few key references and surveys include Blackwell [6], Cesa-Bianchi et al. [14], Cesa-Bianchi and Lugosi [10], Feder et al. [16], Foster and Vohra [19], Hart and Mas-Colell [25], Littlestone and Warmuth [31], Merhav and Feder [35], and Vovk [40, 41].

A natural question one might ask is under what conditions on the loss and feedback matrices it is possible to achieve Hannan consistency, that is, to guarantee that, asymptotically, the cumulative loss of the forecaster is not larger than that of the best constant action with probability one. Naturally, this depends on the relationship between the loss and feedback functions. An initial answer to this question has been provided by the work of Piccolboni and Schindelhauer [37]. However, because they are concerned only with expected performance, their results do not imply Hannan consistency. In addition, their bounds have suboptimal rates of convergence. Below, we extend those results by showing a forecaster that achieves Hannan consistency with optimal convergence rates.

Note that the forecaster is free to encode the values h(i,j) of the feedback function by real numbers. The only restriction is that if h(i,j) = h(i,j'), then the corresponding real numbers should also coincide. To avoid ambiguities by trivial rescaling, we assume that [absolute value of h(i,j)] [less than or equal to] 1 for all pairs (i,j). Thus, in the sequel we assume that H = [[h(i,j)].sub.NxM] is a matrix of real numbers between -1 and 1, and we keep in mind that the forecaster can replace this matrix by [H.sub.[phi]] = [[[[phi].sub.i](h(i,j))].sub.NxM] for arbitrary functions [[phi].sub.i]: [-1,1] [right arrow] [-1,1], i = 1, ..., N. Note that the set I of signals may be chosen such that it has m [less than or equal to] M elements, although after numerical encoding the matrix might have as many as MN distinct elements.

The problem of partial monitoring was considered by Mertens et al. [36], Rustichini [38], Piccolboni and Schindelhauer [37], and Mannor and Shimkin [32]. The forecaster strategy studied in [section] 3 is first introduced in Piccolboni and Schindelhauer [37], where its expected regret is shown to have a sublinear growth. Rustichini [38] and Mannor and Shimkin [32] consider a more general setup in which the feedback is not necessarily a deterministic function of the pair outcome and the forecaster's action, but it might be random with a distribution indexed by this pair. Based on Blackwell's approachability theorem, Rustichini [38] establishes a general existence result for strategies with asymptotically optimal performance in...



More articles from Mathematics of Operations Research
Many-to-one stable matching: geometry and fairness., August 01, 2006
A cost-shaping linear program for average-cost approximate dynamic pro..., August 01, 2006
Nonparametric estimation of market distribution functions in electrici..., August 01, 2006
Excludability and bounded computational capacity., August 01, 2006

Looking for additional articles?
Search our database of over 3 million articles.

Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication name or publication date.

About Goliath
Whether you're looking for sales prospects, competitive information, company analysis or best practices in managing your organization, Goliath can help you meet your business needs.

Our extensive business information databases empower business professionals with both the breadth and depth of credible, authoritative information they need to support their business goals. Whether it be strategic planning, sales prospecting, company research or defining management best practices - Goliath is your leading source for accurate information.