|
Article Excerpt ABSTRACT
This paper focuses on the practical limitations in the content and software of the databases that are used to calculate the h-index for assessing the publishing productivity and impact of researchers. To celebrate F. W. Lancaster's biological age of seventy-five, and "scientific age" of forty-five, this paper discusses the related features of Google Scholar, Scopus, and Web of Science (WoS), and demonstrates in the latter how a much more realistic and fair h-index can be computed for F. W. Lancaster than the one produced automatically. Browsing and searching the cited reference index of the 1945-2007 edition of WoS, which in my estimate has over a hundred million "orphan references" that have no counterpart master records to be attached to, and "stray references" that cite papers which do have master records but cannot be identified by the matching algorithm because of errors of omission and commission in the references of the citing works, can bring up hundreds of additional cited references given to works of an accomplished author but are ignored in the automatic process of calculating the h-index. The partially manual process doubled the h-index value for F. W. Lancaster from 13 to 26, which is a much more realistic value for an information scientist and professor of his stature.
INTRODUCTION
The h-index was developed by Professor Jorge E. Hirsch of the Department of Physics at the University of San Diego. It was published in the prestigious Proceedings of the National Academies of Science (Hirsch, 2005) soon after its preprint appeared in arXiv, the excellent and widely used preprint repository focusing primarily on physics (http://arxiv.org/pdf/ physics/0508025). It was welcomed much more widely and quickly than any other bibliometric and scientometric indicators received before (Lancaster, 1991).
Hirsch summarized the essence in a terse abstract: "I propose the index h, defined as the number of papers with citation number [greater than or equal to] h, as a useful index to characterize the scientific output of a researcher." He then explains that "A scientist has index h if h is his or her [N.sub.p] papers have at least h citations each and the other ([N.sub.p] - h) papers have [less than or equal to] h citations each." This means that an author with h=16 has 16 publications each of which received 16 or more citations. The h-index varies widely from discipline to discipline and even within disciplines and research areas. In library and information science, for example, a h-index of 16 is a high value, but in, say astronomy and retrovirology, it is considered to be a relatively low value.
SHORT LITERATURE OVERVIEW
Immediately after publication there was already a flurry of formal and informal comments and reactions by researchers from various disciplines with only a few dismissive and skeptical comments (Purvis, 2006; Ashkanasy, 2007; Berger, 2007), and plenty of supporting ones, in serious news sources, listserv fora and blog sites, beyond the many academic journals. It was cited by more than sixty papers by the end of August 2007. The most telling sign of the importance and appreciation of the h-index was that editors of Scientometrics found a way to squeeze in a paper about the h-index in its December 2005 issue (Bornmann and Daniel, 2005), then dedicated its April 2006 issue to the topic, with several substantial articles by some of the most respected scientometricians followed by three more in May, June, and July, then two more in 2007 in that journal alone. The papers approached the topic from a variety of theoretical (Egghe and Rousseau, 2006; Liang, 2006; Egghe, 2006, 2007a; Schubert, 2007, Glanzel, 2006; and practical angles (Costas & Bordons, 2007; Imperial & Rodriguez-Navarro, 2007; Vanclay, 2007).
There are several case studies that present the h-index for a variety of target groups. These include the prominent scholars, educators, and researchers in a specific field (Kelly & Jennions, 2006; Saad, 2006; Cronin & Meho, 2006 Oppenheim, 2007), lesser known researchers in the broad field of physics (Schreiber, 2007a), institutions within a country (Prathap, 2006), researchers of a discipline within a country (Salgado and Paez, 2007), researchers within a country in different fields (Imperial & Rodriguez-Navarro, 2007; Packer & Meneghini, 2006, Meneghini & Packer, 2006), across countries in a field of specialization (Oelrich, Peters, and Jung 2007), and in the highly select group of scientometrics, the winners of the award commemorating John Derek de Solla Price (Bar-Ilan, 2006a).
Some of the best papers about the h-index voiced reservations about the details of the proposed model, but they indicated their support of the theory of Hirsch by suggesting variant and derivative indexes built on the idea of Hirsch (Batista, et al, 2006; Egghe, 2006, Vanclay, 2006; Barendse, 2007; Jin et al, 2007). Several papers compared the h-index with other, traditional measures (van Raan, 2006; Barendse, 2007; Costas & Bordons, 9007).
The h-index was begging to be applied to journals, to complement the controversial Journal Impact Factor, and several papers confirmed and applied this extension (although not for a lifetime measure given the volume of papers in many journals) (Braun, Glanzel, & Schubert, 2006; Schubert & Glanzel, 2007; Olden, 2007).
Although it is not about the h-index, an excellent article by Butler and Visser (2006) about the need for extending citation analysis to nonsource materials (i.e., to material types, document genres, specific journals not covered by a database) is essential for understanding the context of the h-index. I will come back to their well-designed, nationwide research later, as their conclusions are likely to apply not only to researchers in Australia but around the world. The warnings of Cronin, Snyder, and Atkins (1997), and earlier by Line (1979), should be heeded by everyone who evaluates the research performance of scholars in the sciences, the social sciences and especially in arts and humanities for the preference of non-journal sources in the research area.
Two aspects of the concept of Hirsch received special interest: the bias of the h-index for extensively self-citing authors (Schreiber, 2007b; Vinkler, 2007), and its robustness and relative insensitivity to missing records for highly cited papers (Vanclay, 2007; Rousseau, 2007). There are several other relevant papers cited in the sections on Google Scholar, Scopus, and Web of Science, which provide a broader background for these three systems, beyond the perspective of the h-index itself, often comparing the alternatives. These latter two papers offer a good transition to the focus of my research for this Festschrift, which illustrates through the example of Wilf Lancaster. One has to be careful with searching by author name(s). In WoS, the last name must come first, followed by the first and--if applicable--middle initial(s) and no punctuation at all. In Scopus, the order does not matter but initial(s) must be followed by a dot. If the last name is entered first, it must be followed by a comma. The template suggests that the comma must be followed by a space, but it is actually not needed, and the software removes it if it is entered, echoing back an odd-looking format, which is different from what the user entered, and from the way it looks in the record. In Google Scholar, punctuation is ignored so "FW Lancaster" and "F.W. Lancaster" bring up the same number of hits. On the other hand, you should put the name in between quotes (which had no effect until about mid-1996, but it is important because otherwise Google Scholar picks up records for articles authored by, say "M Lancaster," "F Smith," and "W Black" because its software does not handle repeatable fields, such as authors, appropriately). Purely software-generated h-indexes, which ignore the "orphan references" are actually sensitive to even just a few missing records for publications, which are highly cited, but are ignored in the process of automatically generating the h-index.
OUTLINE
First, the features of Google Scholar and Scopus are discussed from the h-index perspective, followed by a more detailed analysis of the pros and cons of Web of Science (WoS) from the perspective of generating the h-index in general, and for F. W. Lancaster in particular. These three systems have the broadest disciplinary coverage among the databases, which are fully or partially enhanced by well-tagged cited references, which is one of the pre-requisites for counting and keeping track of the citations given and received (Jacso, 2008b). It is another question as to why the developers of Google Scholar apparently did not make use of any of the metadata which are available in tens of million records, which the developers had access to (Jacso, 2008a).
Given the space limitation of this Festschrift, I can provide only limited coverage of the issues, but a multipart series about the content and software advantages and disadvantages of using Google Scholar, Scopus, and WoS and for calculating a rational h-index (Jacso, 2008c, 2008d, 2008e) are to be published in 2008.
Here it is demonstrated how the existing features of WoS can be exploited to arrive at a much more credible, and traceable h-index, which is at least twice as high as the automatically generated h-index in WoS, and 4-8 times higher than the h-index produced by Scopus. The ratio depends on which of the two automatic h-index generation options of Scopus is chosen by the user (as discussed later). I am reluctant to provide any comparative score with Google Scholar, simply because its hit counts and citation counts remain as untraceable and inflated (Jacso, 2006a, 2006b) as they were at the launch of the beta version in 2004. All three systems have limitations (and so do all the citation-enhanced indexing and abstracting databases (Jacso, 2004b), but the deficiencies in Google Scholar are so voluminous, unscholarly, and often so hidden that its hit counts and citation counts should not be accepted even as a starting point for evaluating the research output of real scholars. In an interesting twist, Google Scholar can help in revealing the shortcomings of WoS and Scopus by showing information about publications that are not covered by the automatic h-index generators of either.
THE OEUVRE OF LANCASTER
It is not for sheer snobbery that I use the French loan word. As a single word it refers to "a substantial body of work constituting the lifework of a writer" according to the Merriam-Webster Collegiate Dictionary (and many other good dictionaries). It is important--especially in a Festschrift (which is a borrowed German term and a general...
|