|
Article Excerpt The Beck Depression Inventory (BDI; Beck, Steer, & Brown, 1996; Beck, Ward, Mendelson, Mock, & Erbaugh, 1961) is a widely researched self-report measure of depression. It was designed to measure the presence and severity of depression in psychiatrically diagnosed patients and in nonclinical populations of both adolescents and adults. A variety of translated versions of the BDI-II have been studied worldwide, and many empirical studies have investigated psychometric characteristics of the BDI-II, including internal consistency, factorial validity, and dimensionality (e.g., Dozois, Dobson, & Ahnberg, 1998; Osman et al., 1997; Penley, Wiebe, & Nwosu, 2003; Steer, Ball, Ranieri, & Beck, 1999; Steer & Clark, 1997; Ward, 2006; Whisman, Perez, & Ramel, 2000). Studies have found satisfactory internal consistency reliability of BDI-II scores for both clinical and nonclinical samples. Factor analytic studies of the BDI-II, however, have provided different factor solutions, with the number of factors varying between one and seven depending on the study sample and extraction method that have been used (e.g., Dozois et al., 1998; Penley et al., 2003). A two-factor solution, consisting of the Somatic-Affective factor and the Cognitive factor, has been most commonly obtained in clinical samples (e.g., Arnau, Meagher, Norris, & Bramson, 2001; Beck et al., 1996; Steer et al., 1999). For nonclinical samples, a two-factor solution, consisting of the Cognitive-Affective and Somatic factors (e.g., Beck et al., 1996; Steer & Clark, 1997; Steer et al., 1999; Whisman et al., 2000), as well as a three-factor solution, consisting of the Negative Attitude, Performance Difficulty, and Somatic Elements factors (e.g., Byrne, Stewart, & Lee, 2004), have been most frequently identified in the literature.
Most studies have typically used the classical test theory (CTT) approach to investigate the factor structure of the BDI-II. Under CTT, however, the estimates of item difficulty, item discrimination, score reliability, and standard errors are sample dependent. Many of the problems associated with CTT-based factor analysis have been noted (e.g., Embretson & Reise, 2000; R. M. Smith, 1996; Wright, 1996, 1999). First, factor analysis using CTT suffers from the limitation that factor structures are confounded by item difficulty. Second, item difficulties and person (ability or trait) measures are not estimated on the same scale. Third, because the raw scores for persons and items obtained from rating scales (e.g., Likert scales) are ordinal, the relationships between raw (observed) scores and factor scores are nonlinear. Therefore, factor scores and factor loadings (covariances of item and factor scores) are hardly reproduced when new sets of data from the same population are reanalyzed using the same procedures. Fourth, when measuring, say, depression with a Likert scale under CTT, it may happen that persons with higher levels of depression endorse a lower scale category more often than do persons with lower levels of depression. Another problem, related to factor analysis of BDI scores for nonclinical samples, is that the markedly skewed distributions of the item ratings for nonclinical samples lead to serious violation of the normality assumption with factor analytic procedures and, thus, to inaccurate factor solutions (Welch, Hall, & Walkey, 1990). These problems can be alleviated by using Rasch modeling of BDI scores.
The Rasch model produces scale-free person measures and sample-free item locations (Wright & Masters, 1982). In other words, the differences between pairs of person measures and pairs of item locations are sample independent in the Rasch measurement. There are several advantages of using Rasch modeling for the validation of latent trait measures. First, both item difficulties and person (ability or trait) measures are located on the same interval scale and have the same units of measurement--logits (the natural logarithm of the odds for success on a test item). Thus, the estimates of person measures provide comparable scores for the comparison of persons across different forms of an assessment (Andrich, 1988; Bond & Fox, 2001). Second, the Rasch approach yields least squares maximum likelihood estimates of linear person measures and linear item locations along the entire continuum. The Rasch misfit analyses are sensitive to the full matrix of residuals, and, thus, the residual noise can be avoided (Wright, 1996, 1999). Third, the Rasch analysis provides additional information on category functioning, which increases the measurement accuracy (Linacre, 2002). For example, with information on whether the average measures or step calibrations advance monotonically across rating scale categories, researchers could optimize the effectiveness of scale categories.
There have been very few empirical studies of the BDI using Rasch-based analyses (e.g., Bouman & Kok, 1987; Hammond, 1995; Hong & Wong, 2005). The literature is inconsistent in regard to dimensionality of the BDI. For example, Bouman and Kok have identified three unidimensional subscales of the BDI for Dutch samples: Mood and Inhibit, Guilt and Failure, and Somatic. Hammond found a borderline fit of unidimensionality for Western nonclinical populations with two poor-fitted items: Item 8 (self-blame) and Item 11 (irritability). For Korean samples, Hong and Wong found that most items of the BDI were associated with one unidimensional scale except for two misfit items: Item 19 (weight loss) and Item 21 (libido loss). However, these two studies (Hammond, 1995; Hong & Wong, 2005) have not investigated the item fit after removing poor-fitted items and have not provided other evidence of unidimensionality (e.g., examining the size of residuals or person invariance property). To date, the unidimensionality of the BDI is under debate. Regarding the BDI-II version, unidimensionality has not been found using Rasch analysis.
The BDI-II items are derived from symptoms of clinical depression, and, therefore, it is not clear whether the item difficulties are suitable for nonclinical populations. Very little is known in this regard from the literature for the BDI and nothing for the BDI-II. Richter, Werner, Heerlein, Kraus, and Sauer (1998) stated that high item difficulties are the shortcoming of the BDI. They showed that the average scores were below 1 for student samples. Also, Hong and Wong (2005) indicated that most of the BDI items were too difficult for the nonclinical sample. Therefore, the adequacy of the cutoff values, proposed by Beck et al. (1996), used for nonclinical populations has been called into question. Rasch analysis can be very useful in investigating this question. As noted earlier, Rasch person measures and item difficulties can be placed on the same linear continuum, which allows for the detection of ceiling effects and floor effects. A ceiling effect occurs when person measures are aligned much higher along the Rasch scale than are item difficulties. Conversely, a floor effect occurs when person measures are arranged much lower than are the item difficulties along the Rasch scale. Furthermore, the relative difficulty of items can provide useful insights for the construct validity of depression (e.g., to support or refute the expectation that somatic symptoms are easier to endorse than are cognitive-affective symptoms). These issues have not been addressed in the BDI-II literature.
One may also wonder about the "communication validity" of the BDI-II, which has to do with whether the rating scale categories perform as intended--for example, whether (a) people with lower levels of depression endorse lower categories, whereas people with higher levels of depression endorse higher categories; (b) the test language is socially acceptable and free of idiosyncratic category usage and ambiguous terminology; and (c) the respondents can differentiate the response levels of each rating scale (e.g., often vs. sometimes; Lopez, 1996). The Rasch model provides promising tools to verify and improve the category functioning (Linacre, 1995, 2002; Lopez, 1996). Linacre (2002) proposed the following eight guidelines for optimizing rating scale category effectiveness: (a) at least 10 observations of each category, (b) regular observation distribution, (c) average measures advance monotonically with category, (d) outfit mean square (MNSQ) less than 2, (e) step calibrations advance, (f) ratings imply measures and measures imply ratings, (g) step difficulties advance by at least 1.40 logits, and (h) step difficulties advance by less than 5 logits. The focus on particular guidelines depends on the research purposes. Promising applications of these guidelines have been demonstrated in empirical studies (e.g., E. V. Smith, Wakely, deKruif, & Swartz, 2003; Stone & Wright, 1994; Zhu, Updyke, & Lewandoeski, 1997).
The Chinese version of the BDI-II (BDI-II-C) was translated by the Chinese Behavioral Science Corporation in the year 2000 and has been extensively used to diagnose the severity of depression for nonclinical populations in Taiwan. In spite of its widespread use, however, there have been very few studies to validate its psychometric properties with Taiwan samples. As a step in this direction, the purpose of...
|