|
...to interpret for even moderate numbers of variables. This article demonstrates that the impact of high dimensions is much less severe when the component displays are clustered together according to some index of merit. Effectively, this clustering reduces the dimensionality and makes interpretation easier. For scatterplot matrices and parallel coordinates plots clustering of component displays is achieved by finding suitable permutations of the variables. I discuss algorithms based on cluster analysis for finding permutations, and present examples using various indices of merit.
Key Words: Parallel coordinates: Permutation of variables; Projection pursuit; Scatterplot matrices.
1. INTRODUCTION
Datasets of three or more dimensions are notoriously difficult to display on a two-dimensional screen or on a piece of paper. Many graphical methods for displaying multivariate data consist of arrangements of multiple displays of one or two variables--for example, a scatterplot matrix consists of all pairwise scatterplots of two variables arranged in a square matrix, and a parallel coordinates display is a sequence of one-dimensional dotplots where line segments are drawn to connect the dots pertaining to a particular case. While in principle these methods generalize to arbitrary numbers of variables, in practice as the dimensions increase, they become less effective, presenting us with an overwhelming amount of information that is difficult to absorb. Usually, the ordering of the variables in these displays is arbitrary and corresponds to the order in which the variables were listed in the data file. However, the interpretability and effectiveness of visualizations often improve dramatically when the variables are reordered in some systematic way.
A scatterplot matrix shows all pairwise scatterplots of p variables, while a parallel coordinate display shows p - 1 of the [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.] pairwise line plots. Some of these pairwise plots are more interesting or informative than others, and an effective visualization should help us to focus on these. Our basic idea is that each pairwise display (a panel) is awarded a merit score measuring its "interestingness." Then the variables are reordered so that the viewer's attention will be focused on the most interesting panels, which are placed in prominent positions. For the scatterplot matrix, we consider positions close to the diagonal to be the most prominent, while for the parallel coordinate display interesting panels should be among the p - 1 visible panels. Suitable merit measures will depend on the context of the data and the type of display, but correlation is often a good starting point. Then the visualizations will help us identify clusters of similar (highly correlated) variables, effectively reducing the dimensionality of the visualization problem.
Ideally, the panel merit scores are combined into an overall merit score for the entire display. We could then find the permutation of the variables maximizing this overall score. A brute-force approach to solving this problem evaluates the criterion on all possible permutations of the variables, but this is slow except for small numbers of variables. Because our goal is effective data visualization, it is probably better to find a good display quickly rather than wait around for a slightly better but optimal display. Therefore, we use a fast ad-hoc algorithm based on cluster analysis (Gruvaeus and Wainer 1972) to come up with suitable permutations of the variables. In our experience the resulting visualizations are often far more effective than those using standard variable order.
The problem of choosing an ordering of variables for displays of multivariate data has received surprisingly little attention in the literature. The work of Bertin is an exception in this regard; ordering variables, cases, and categories in so-called "matrix displays" is a major theme of his work (Bertin 1983).
In multiway trellis displays, Cleveland (1995) ordered categories by their medians, Friendly (1994) ordered categories in a mosaic display by their score on the first correspondence analysis direction, and in both cases ordering clarifies patterns present in the data. Carr and Olsen (1996) stated succinctly that "sorting simplifies" and demonstrated this extremely effectively using a minimal spanning tree-based ordering of row and column variables in a two-way layout. Wegman (1990) sorted observations along one variable at a time to produce a variation on the parallel coordinate display called the "color histogram." The "data image" described by Minnotte and West (1998) is similar to the color histogram, but it orders both cases and variables using the the Gruvaeus and Wainer (1972) algorithm.
More recently, Friendly and Kwan (2003) argued very strongly in favor of ordering information in visual displays of data. Their basic notion is that similar variables, cases, and categories should be positioned adjacently in a graphical display, and they used orderings based on eigen decompositions for this purpose. In a related article, Friendly (2002) examined ways of rendering correlation matrices. He advocates reordering variables so that highly correlated variables are positioned adjacently, and computes an ordering from the angular positions of the first two eigen vectors of the correlation matrix.
In the visualization literature, Ankerst, Berchtold, and Keim (1998) tackled a problem that is closely related to that of the present article: they were concerned with clustering variables so that similar variables are clustered together in one-dimensional, two-dimensional, and circular display formats. However--unlike the present article--they were not concerned with placing interesting displays in prominent positions.
Section 2 describes a method for ordering variables in scatterplot matrices, so that interesting panels are clustered along the diagonal. I suggest various merit scores and give examples to show that the method yields improved visualizations. Section 3 describes a method for ordering variables in parallel coordinate displays so that interesting panels are visible. Again, I give an example and suggest various merit scores appropriate for parallel coordinate displays. Section 4 follows with some concluding remarks. The Appendix gives details of a suite of R functions implementing the graphical methods and algorithms described here.
2. SCATTERPLOT MATRICES
According to Hills (1969) the first and sometimes only impression gained from looking at a large correlation matrix is...
NOTE: All illustrations and photos
have been removed from this article.

More articles from Journal of Computational & Graphical Statistics
CARTscans: a tool for visualizing complex models., December 01, 2004 LOTUS: an algorithm for building accurate and comprehensible logistic ..., December 01, 2004 Evolutionary simulated annealing with application to image restoration..., December 01, 2004 Statistical simulations on parallel computers., December 01, 2004 Population Monte Carlo., December 01, 2004
Looking for additional articles?
Search our database of over 3 million articles.
Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication
name or publication date.
About Goliath
Whether you're looking for sales prospects, competitive information, company
analysis or best practices in managing your organization,
Goliath can help you meet your business needs.
Our extensive business information databases empower business
professionals with both the breadth and depth of credible,
authoritative information they need to support their business
goals. Whether it be strategic planning, sales prospecting,
company research or defining management best practices -
Goliath is your leading source for accurate information.
|