|
Article Excerpt In this study, we develop a multivariate generalization of the negative binomial distribution (NBD). This new model has potential application to situations where separate NBDs are correlated, such as for page views across multiple websites. In turn, our page view model is used to predict the audience for Internet advertising campaigns. For very large Internet advertising schedules, a simple approximation to the multivariate model is also derived. In a test of nearly 3,000 Internet advertising schedules, the two new models are compared with some proprietary and nonproprietary models previously used for Internet advertising and are shown to be significantly more accurate.
Key words: advertising; Internet marketing; media; probability models
1. Introduction
A key measure of website activity is page views, also known as page impressions or page requests, which are the number of distinct pages served to a Web user over the duration of his or her visit to a domain (Bhat et al. 2002). The underlying reason for the increasing importance of page views is that Web pages often carry banner ads or sponsored search links to other websites. Now that annual Internet advertising spending has reached $12.5 billion and is growing consistently and strongly (Internet Advertising Bureau 2006), accurate models for estimating campaign audiences are required (Meskauskas 2003). Models that produce well-known advertising audience measures such as reach and frequency are in particular demand to help give online advertising credibility alongside traditional advertising media such as television (Smith 2003).
A necessary starting point for a model of Internet advertising exposure is a probability model for page views. However, an alternative method to a probability model is an empirical distribution based on historical data. For instance, in recent years, Nielsen/Netratings and comScore Media Metrix have enlarged their Web user panels to the point where a probability model is seemingly redundant. In particular, the comScore Media Metrix data set in the United States totals 100,000 panelists, which allows an advertiser to accurately compile an empirical exposure distribution to an Internet ad campaign comprising the entire panel as well as demographic subgroups. Indeed, empirical distributions are the basis of Nielsen's and Telmar's online media planning software. In [section]5.1, we show that the real challenge for Internet media models is not to estimate the audience for a historical data set since the availability of large panels enables audience measures to be estimated empirically with high accuracy. Instead, the pertinent problem is to predict audiences for a time period in the future based on historical data. For example, none of the probability models considered by Leckenby and Hong (1998) are adaptable to the predictive environment. Therefore, the purpose of this research is to develop a model that can accurately predict audience exposure measures for Internet advertising campaigns by using a model for page views. In doing so, we address the limitations of previous models by tailoring our model to the Internet environment, but we retain the ability to report media exposure measures that are familiar to traditional media buyers. Internet ad campaigns often run across several websites simultaneously, with sites being similar in purpose such as travel websites, so there is a resulting correlation in page view counts among the sites (Li et al. 2002). Such correlations will affect the audience measures for Internet campaigns (Leckenby and Hong 1998) and thereby present a modeling challenge. Hence, our model of page views is multivariate without assuming independence between pairs of sites, and additionally allows for the possibility of different time periods for ad delivery on each website. Our results show our model to be significantly more accurate than all previous models including those based in empirical distributions from large panels.
2. Modeling Page Views and Internet Advertising
There is a long history of models being used in traditional media such as television and print (Chandon 1986, Danaher 1992, Leckenby and Kishi 1982, Rust 1986). Models have been used in these media primarily to estimate reach, the proportion of the target audience exposed to at least one ad; frequency, the average number of exposures among those reached; gross rating points (GRPs), the average number of exposures (with GRPs = reach x frequency); and the exposure distribution (ED), the proportion of the target audience exposed to none, just one, just two, etc., ads.
2.1. Modeling Issues Specific to Online Advertising
For online media to be accepted by advertisers and advertising agencies, online publishers must also be able to apply traditional media language, particularly GRPs, to their medium (Meskauskas 2003). However, when it comes to online advertising there is a fundamental difference between the way advertising space is bought compared with offline media. (1) The primary difference is that online campaigns are often purchased on the basis of "ad impressions." An online ad impression is some form of advertisement (e.g., banner, interstitial, pop up, etc.) that is served to a website's user during the course of a visit. A typical online ad campaign might comprise 50,000 to 200,000 ad impressions (http://computer. howstuffworks.com/banner-ad.htm). While a user is surfing a particular website, he downloads different pages depending on the links he clicks. As each page is assembled, advertisements are added to the page by the site's server. Each page served could have different ads embedded within it. However, the more pages a user requests the more likely it is that he will receive several exposures to the same ad, especially if he visits the site multiple times over several weeks. Hence, the key issue in the online environment is that users determine the rate of page view delivery depending on where and how often they click on items within a session. This contrasts with traditional media, where the broadcaster/publisher controls the delivery of advertisements to its audience.
Estimating the audience for an Internet advertising campaign is further complicated by issues such as possibly having multiple ads per page, ads on just the homepage, and frequency capping whereby websites limit the number of ads served to a computer by using "cookies." Handling these issues requires data not just from the page server, but also from the ad server. User centric Web browsing data (such as the comScore Media Metrix used in this study) has only page views and no record of ads served. Hence, we model page views/impressions which are conceptually similar to ad impressions, since all ads are served on Web pages. If an advertiser is fortunate enough to additionally have data on the advertising regime, then the "ad view data" simply replaces the page view data.
A further difference between online and offline media models is the data available for model fitting. Models in traditional media are generally limited to using published figures on single-vehicle reach and pairwise duplications (Rust 1986). However, online media models usually have large-scale individual-level panel data available, as detailed in [section] 4.2.
2.2. Previous Page View and Internet Advertising Models
To the best of our knowledge, only one page view model exists, being a multivariate discretized version of the Tobit model developed by Li et al. (2002). Their use of the Tobit model is justified because a large proportion of Web users do not visit particular sites, creating a "spike" at zero page views for each website. In addition, page views are nonnegative integers, so the Tobit must be "discretized." Last, Li et al. (2002) recognize the need to allow for correlations in page views across different website categories so they generalize the univariate Tobit model to one that has a multivariate normal distribution. They apply their model to page views of comScore Media Metrix data, as we do, but their primary purpose is to uncover patterns in browsing behavior across categories of websites like auction and portal sites and test the effects of user demographics on such browsing. Still, their model can be adapted to predicting Internet audiences, so we compare our model with their multivariate discretized Tobit in [section]5.4.
To date, only three nonproprietary models have been developed specifically for online advertising. Of these, the most comprehensive are Leckenby and Hong's (1998) and Huang and Lin's (2006) studies, with Wood's (1998) model essentially a curve-fitting method rather than a formal model. Leckenby and Hong compare some well-known models from offline media such as the betabinomial distribution (BBD) (Metheringham 1964) and the Dirichlet-multinomial (Leckenby and Kishi 1984). To use these models, Leckenby and Hong (1998) had to artificially aggregate the panel-based website exposure data in a way that forced it into the same format as that used in offline media. Rather than restricting the number of exposures to coincide with a prespecified time period, as done by Leckenby and Hong (1998), our model allows each person's exposure level to range from zero to infinity. This is more appropriate for the Internet, where there is varying exposure opportunity per website visitor. Huang and Lin (2006) avoid the problems of Leckenby and Hong's (1998) model by allowing exposures to range upward from zero, but their model requires the duration of advertising on each website to be the same and ignores duplication of exposure between websites. Our model does not have these limitations.
Proprietary models for reach and frequency prediction include Nielsen/Netratings' "WebRF" and Telmar's "WebPlanner" models. Both use individual-level panel data to build an empirical exposure distribution. Another proprietary model is one developed by Atlas DMT (www.atlasdmt.com), which combines site-centric ad server information with comScore Media Metrix panel data (Smith 2003). No technical details about this model are available except that it is based on a simulation method, although Chandler-Pepelnjak (2004) reports that the average prediction error for reach for this model is 20%. Later, we demonstrate that this is much higher than the 5% average prediction error for our model.
3. Model Development
3.1. Notation
A formal statement of the exposure distribution (ED) setup is as follows. Let [X.sub.i] be the number of exposures (2) a person has to media vehicle i, [X.sub.i] = 0, 1, 2, ..., i = 1, ..., m, where m is the number of different vehicles. The exposure random variable to be modeled is X = [[summation].sup.m.sub.i=1] [X.sub.i], the total number of exposures to an advertising schedule. Although X is a simple sum of random variables, two nonignorable correlations make modeling it difficult (Danaher 1989). One is the intravehicle correlation due to repeat viewing/visits to the same vehicle (Danaher 1989, Morrison 1979) and the other is intervehicle correlation, where there might be an overlap in exposure to two vehicles.
In the case of the print media, for example, observed empirical EDs are known to be particularly "lumpy" due to strong intravehicle correlation. As a consequence, Danaher (1988, 1989, 1991) shows that it is necessary to first model the joint multivariate distribution of ([X.sub.1], [X.sub.2], ..., [X.sub.m]), from which the distribution of total exposures X = [[summation].sup.m.sub.i=1] [X.sub.i] can be derived. This is less of a problem with television EDs (Rust 1986) where loyalty from episode to episode is generally moderate, with intra-exposure duplication factors of the order 0.28 (Ehrenberg and Wakshlag 1987). In addition, for the television environment there are more vehicle choices than for the print medium (Krugman and Rust 1993) and this helps to reduce both intra- and intervehicle correlation. As a result, models for just X rather than the full multivariate ([X.sub.1], [X.sub.2], ..., [X.sub.m]) are often adequate for television EDs, which tend to be smooth...
|