Home | Business News | Browse by Publication | L | Library Trends

The most influential paper Gerard Salton never wrote.

Publication: Library Trends
Publication Date: 22-MAR-04
Format: Online
Delivery: Immediate Online Access

Article Excerpt
ABSTRACT

Gerard Salton is often credited with developing the vector space model (VSM) for information retrieval (IR). Citations to Salton give the impression that the VSM must have been articulated as an IR model sometime between 1970 and 1975. However, the VSM as it is understood today evolved over a longer time period than is usually acknowledged, and an articulation of the model and its assumptions did not appear in print until several years after those assumptions had been criticized and alternative models proposed. An often cited overview paper titled "A Vector Space Model for Information Retrieval" (alleged to have been published in 1975) does not exist, and citations to it represent a confusion of two 1975 articles, neither of which were overviews of the VSM as a model of information retrieval. Until the late 1970s, Salton did not present vector spaces as models of IR generally but rather as models of specific computations. Citations to the phantom paper reflect an apparently widely held misconception that the operational features and explanatory devices now associated with the VSM must have been introduced at the same time it was first proposed as an IR model.

INTRODUCTION

In a tribute written for the Journal of the American Society for Information Science (JASIS) (Crouch et al., 1996), Carolyn Crouch declares that Gerard Salton was more than just the leading authority in the field of information retrieval (IR). For thirty years, Crouch writes, "Gerry Salton was information retrieval" (p. 108) During times when the significance of computational IR research was in doubt, Salton defended and supported it "through the sheer force of his own personality and reputation" (Crouch et al., 1996, p. 108). Crouch's sentiments are echoed in the memoriam by Salton's other colleagues and former proteges, who reflect on his many contributions in research, teaching, writing, editing, and service to scholarly societies. They cite the textbooks he wrote, the SMART system developed under his leadership, the scholars that he mentored, and many other contributions. Donna Harman reminds the reader that Salton investigated "the use of the vector space model in clustering, relevance feedback, automatic linking, book indexing, passage retrieval, visualization, and many other areas" (Crouch et al., 1996, p. 108).

It is hardly surprising that Dr. Harman would cite Salton's pioneering research in the vector space model (VSM) for information retrieval: there are numerous citations crediting Salton with the original development of that IR model, as well as responses commenting on its advantages and limitations and proposing extensions or alternatives to it (Bollmann-Sdorra & Raghavan, 1993, 1998; Raghavan & Wong, 1986; Wong & Raghavan, 1984; Wong, Ziarko, & Wong, 1985; Wong, Ziarko, Raghavan, & Wong, 1986, 1987; McGill & Huitfeldt, 1979; Singhal, 2001; Howland & Park, 2004; Kobayashi & Aono, 2004). What is surprising, however, is that there is evidence that the VSM evolved over a much longer period of time than is usually acknowledged and that Salton did not publish an articulation of the model and its assumptions until several years after criticisms of those assumptions had been leveled and alternative models proposed (see section 7 below).

In giving credit to Salton for the vector model, a number of authors cite an overview paper titled "A Vector Space Model for Information Retrieval," which some show as published in the JASIS in 1975 and others as published in the Communications of the Association for Computing Machinery (CACM) in 1975. In fact, no such article was ever published, and citations to it usually represent a confusion of two 1975 articles (Salton, Wong, & Yang, 1975; Salton, Yang, & Yu, 1975), neither of which were overviews of the VSM as it is generally understood (see section 5 below). Some of Salton's own colleagues have been guilty of this mistake: both Cardie et al. and Singhal cite the CACM version, for example (Singhal, 2001; Cardie, Ng, Pierce, & Buckley, 2000). The paper is even cited in a few of the very last articles on which Salton is listed as a coauthor (Singhal, Salton, Mitra, & Buckley, 1996; Singhal & Salton, 1995). These papers were published close to or shortly after the time of his death, and so the errors cannot be blamed on Salton (remembered by his colleagues as a very careful and meticulous writer).

Another irony--one representing a more fitting tribute to Salton's legacy--is that locating papers containing the mistaken citation is very difficult using conventional citation databases such as the Web of Science. But discovery of the errors is greatly aided by search engines such as Google and CiteSeer--systems that employ techniques similar to those that Salton himself refined and recommended. The following papers were found in this way, and they cite one or the other versions of the bibliographic ghost: McCabe, Lee, Chowdhury, Grossman, & Frieder, 2000; Theophylactou & Lalmas, 1998; Arampatzis, van der Weide, Koster, van Bommel, 2000; Chen, 2001; Jiang & Littman, 2001; Nallapati, 2003. This leads us to the following questions: How did this mistake occur, and how was it perpetuated to the degree that it was? The answer seems to lie in a misconception widely held even by people who cite Salton's publications correctly: it is assumed that a description of the VSM must have been published sometime around 1975, even though it was not characterized as an IR model at that time.

VECTOR SPACES AND MATHEMATICAL MODELS

We begin with a description of the VSM that Salton included in chapter 10 of his 1989 book on automatic text processing. That treatment includes the following characterization:

1. The VSM (like the Boolean and probabilistic models) represents information retrieval systems and procedures.

2. Global measures of similarity (such as the cosine measure) are computed between queries and documents.

3. Queries and documents are represented by term sets.

4. Both queries and documents can then be represented as ordered term vectors.

5. The components of the vectors are numbers representing either the importance of a term or simply the presence or absence of a term (1 or 0, respectively).

As mentioned above, the origins of these features are considerably earlier than the publications usually credited with the definition of the VSM. Salton himself did not publish a full articulation of the VSM as a retrieval model until this chapter, however, which appeared years after he was publicly credited with having invented the VSM.

The VSM is a mathematical model. Generalizing a definition by Rutherford Aris, Davis and Hersh (1981) define a mathematical model as a consistent mathematical structure designed to correspond to some physical, biological, social, psychological, or conceptual entity. They cite a number of uses for mathematical models, including:

1. predicting events in the physical world

2. guiding observation or experimentation

3. fostering conceptual understanding

4. assisting the "axiomatization of the physical situation" (Davis & Hersh, 1981, p. 78)

5. promoting progress in mathematics

So there are any number of ways in which the VSM might represent an advance for or contribution to IR research or systems design. Clarifying the particular role it plays as a model recommends a closer look at how vector representations are used to model other domains. The vector space is a very general and flexible abstraction, used...

Read the FULL article now - Try Goliath Business News - FREE!   
You can view this article PLUS...

  • Over 5 million business articles
  • Hundreds of the most trusted magazines, newswires, and journals (see list)
  • Premium business information that is timely and relevant
  • Unlimited Access

Now for a Limited Time, try Goliath Business News - Free for 3 Days!
Tell Me More   Terms and Conditions

Get Goliath Business News for 1 year - Just $99 (Save 65%)
Tell Me More   Terms and Conditions

Already a subscriber? Log in to view full article



More articles from Library Trends
The art and science of classification: Phyllis Allen Richmond, 1921-19..., March 22, 2004
"A brilliant mind": Margaret Egan and social epistemology., March 22, 2004
Social epistemology from Jesse Shera to Steve Fuller., March 22, 2004
Foster Mohrhardt: connecting the traditional world of libraries and th..., March 22, 2004
Cornelia Marvin and Mary Frances Isom: leaders of Oregon's library mov..., March 22, 2004

Looking for additional articles?
Search our database of over 3 million articles.

Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication name or publication date.

About Goliath
Whether you're looking for sales prospects, competitive information, company analysis or best practices in managing your organization, Goliath can help you meet your business needs.

Our extensive business information databases empower business professionals with both the breadth and depth of credible, authoritative information they need to support their business goals. Whether it be strategic planning, sales prospecting, company research or defining management best practices - Goliath is your leading source for accurate information.