|
Article Excerpt Clickstream data provide information about the sequence of pages or the path viewed by users as they navigate a website. We show how path information can be categorized and modeled using a dynamic multinomial probit model of Web browsing. We estimate this model using data from a major online bookseller. Our results show that the memory component of the model is crucial in accurately predicting a path. In comparison, traditional multinomial probit and first-order Markov models predict paths poorly. These results suggest that paths may reflect a user's goals, which could be helpful in predicting future movements at a website. One potential application of our model is to predict purchase conversion. We find that after only six viewings purchasers can be predicted with more than 40% accuracy, which is much better than the benchmark 7% purchase conversion prediction rate made without path information. This technique could be used to personalize Web designs and product offerings based upon a user's path.
Key words: personalization; multinomial probit model; hierarchical Bayes models; hidden Markov chain models; vector autoregressive models
History: This paper was received November 18, 2002, and was with the authors 7 months for 3 revisions; processed by Peter Fader.
1. Introduction
One of the original promises of the Web was that online stores would be able to fully realize the potential of interactive marketing (Blattberg and Deighton 1991, Hoffman and Novak 1996, Alba et al. 1997) through personalization (Pal and Rangaswamy 2003, Ansari and Mela 2003). Currently, online stores target visitors (Mena 2001) using many types of information, such as demographic characteristics, purchase history (if any), and how the visitor arrives at the online store (i.e., did the user find the site through a bookmark, search engine, or link on an email promotion). Another potentially rich--but underutilized--source of information is clickstream data, which record the navigation path that a user takes through the website (Montgomery 2001). Unfortunately, marketers have lacked a methodology for analyzing path information (Bucklin et al. 2002). Our paper proposes a new model that draws upon past work in choice modeling (Rossi et al. 1996, Paap and Franses 2000, Haaijer and Wedel 2001) to extract information from the path. In particular, we develop a statistical model that analyzes the page-by-page viewings of a visitor as he browses through a website.
Path data may contain information about a user's goals, knowledge, and interests. The path brings a new facet to predicting consumer behavior that analysts working with scanner data have not considered. Specifically, the path encodes the sequence of events leading up to a purchase, as opposed to looking at the purchase occasion alone. To illustrate this point consider a user who visits the Barnes and Noble website, barnesandnoble.com (B&N). Suppose the user starts at the home page and executes a search for "information rules," selects the first item in the search list, and is directed to a product page with detailed information about the book Information Rules by Shapiro and Varian (1998). Alternatively, another user arrives at the home page, goes to the business category, surfs through a score of book descriptions, repeatedly backing up and reviewing pages, and finally views the same Information Rules product page.
Which user is more likely to purchase a book: the first or second? Intuition would suggest that the directed search and the lack of information review (e.g., selecting the back button) by the first user indicates an experienced user with a distinct purchase goal. The meandering path of the second user suggests a user who has no specific goal and is unlikely to purchase, but is simply surfing or foraging for information (Pirolli and Card 1999). It would appear that a user's path can inform about a user's goals and potentially predict future actions.
Our proposed statistical model can make probabilistic assessments about future paths, including whether the user will make a purchase. Our results show that the first user is more likely to purchase. Moreover, our model can be applied generally to predict any path through the website. For example, which user is more likely to view another product page or leave the website entirely within the next five clicks? Potentially this model could be used for website design or setting marketing mix variables. For example, the site design could be dynamically changed by adding links to helpful pages if it is known that a user is less likely to purchase, while the site could become more streamlined if it is known that a user is likely to purchase. A simulation study using our model suggests that purchase conversion rates could be improved using the prediction of the model, which could substantially increase operating profits.
From a marketing perspective, there has been recent interest in mining Web data to predict purchase conversion (Moe and Fader 2004, Moe et al. 2002, Park and Fader 2004). These studies have focused on web-browsing behavior using session-level data. These aggregate data are quite different from the page-level clickstream data we consider. One criticism of aggregate clickstream data is that sequential information is lost, while in our click-by-click level analysis it is retained. Because websites must interact with users dynamically, these sequencing data are crucial.
Sismeiro and Bucklin (2003) do consider some sequencing information. Specifically, they model the completion of tasks that correspond with groups of Web pages. However, our work is much more detailed because we are modeling page-level movements through a website and not through collections of pages that correspond to tasks. This requires our model to be much more flexible because the paths we observe do not have nice, sequential properties, as do Sismeiro and Bucklin's model.
We also contrast our work to that of Ansari and Mela (2003), who consider the personalization of email messages--but whose work could potentially be applied in a clickstream environment. Again, the basic difference is the type of data we consider; the data dictate many modeling differences. Their data are derived from user clicks on hyperlinks to personalized emails. These emails may be separated by many days, hence modeling the dependence between choices is not crucial. Their choice model assumes independence both within a page and across time. In contrast, our goal is to focus on the sequence of the choices made, which tend to occur within seconds of one another. Hence we find it critical to introduce correlation across choices, as well as to introduce time series elements to capture the timing of the choices.
2. Clickstream Data
Given that clickstream data may be unfamiliar to many readers, we first explain our data, describe how they are collected, and conduct an exploratory data analysis to motivate the model we introduce in [section]3. Our data are derived from a panel of Web users maintained by Jupiter Media Metrix, which is now known as Comscore Media Metrix (CMM). CMM randomly recruits a representative sample of personal computer (PC) users and tracks these users' usage at home (Coffey 1999). These users agree to install a computer program (or PC meter) that runs in the background and monitors computer usage. It records any URL viewed by the user in his browser window. Because the meter records the actual pages viewed in the browser window, it avoids the caching problems commonly found by recording page requests at an Internet Service Provider (ISP) or a Web server. However, the meter does not distinguish how the user navigates between pages (e.g., whether the user selects a hyperlink, a bookmark, or directly types in the URL to navigate to a page). In addition, the meter does not record the content of the page, but records only the URL.
2.1. Descriptive Analysis and Defining the Path
Our dataset consists of 1,160 users who visited Barnes and Noble online by going to barnesandnoble. com, books.com, or bn.com between April 1, 2002, and April 30, 2002. (We abbreviate references to barnesandnoble.com as B&N.) This dataset represents all users in the full CMM panel who visited B&N for April 2002, which is almost 6% of the full panel. We selected B&N for our analysis because it is a popular online bookstore and has a relatively clean and stable site structure compared with other online stores. Although we use clickstream data collected by CMM, our methodology could be applied directly to clickstream data collected from B&N's Web servers. Again, our reason for using CMM clickstream data is that they are available to the authors, and that they are more complete and have a cleaner format than Web-server logs (Pitkow 1997).
First, we define the following terms to describe Web browsing: page request, page viewing, and session. A page request refers to a user's requesting a URL through her browser program. In turn this page request will appear as a hit in the server's log file. A page viewing refers to the actual rendering of a page request in the user's browser window. A user may hit the back button in his browser window to review a page, which will generate another page viewing but not a page request. (Instead, the browser program will render the page from a previously stored or cached copy.) Often, pages are viewed multiple times, so page viewings generally exceed page requests. Finally, a session is defined as a period of sustained Web browsing or a sequence of page viewings. If a user has not viewed any pages for 20 minutes, we assume that the viewing session has ended and that the next page viewing marks the beginning of a new session. Sessions include all of a user's page viewings, both at B&N and other sites.
Our 1,160 users requested 9,180 unique URLs or pages at B&N on 14,512 viewing occasions over the course of 1,659 sessions. The average B&N page was viewed 1.5 times. The average number of B&N pages viewed during a session was 8.75. The number of B&N viewings during a session ranged from 2 to 239, with the median being 5 viewings. Most users had only one or two sessions that included activity at B&N; fewer than 25% of our users had more than two sessions. Out of these 1,659 sessions, 114 of these sessions had a purchase (two sessions had two purchases), which yields a purchase conversion rate of 7%. (This rate is higher than the industry average, either due to B&N's success or the fact that our estimate is not contaminated by automated traffic from search engines and robots, as is commonly the case.)
The descriptive statistics for the demographic information about our user sample is given in Table 1. All of our demographic variables, except age, are coded as dummy variables. Notice that the average user is 46 years old with a range from 9 to 89 years old; slightly more than half are female; most are white; most have some college education; and most have higher-than-average incomes. While it is unlikely that B&N would have such detailed information, we include this information to assess its predictive power; in the future it is possible that online retailers could purchase these data from online vendors.
Potentially, the clickstream is a very rich data source because the full text and HTML content of each URL is known (or can be recaptured). Practically, however, without some structure it is difficult to analyze these free-format and textual data. We choose to do so by focusing on the category that corresponds with each page viewed. Every page is classified into one of seven categories: Home, Account, Category, Product, Information, Shopping Cart, Order, and Enter/Exit pages. (See Technical Report Appendix C at http://mktsci.pubs.informs.org for our text matching algorithm to categorize pages and an example session.)
Redish (2002) proposed this categorization scheme as a common taxonomy across e-commerce sites based on a task analysis of what users want to do on an e-commerce sites from a human-computer interaction standpoint. Moe et al. (2002) also employed a similar classification scheme. The home page is a common starting point for new tasks. Account pages are used for logins, address changes, and order status reviews. Category pages present lists of...
|