About UsMy AccountView Cart
Browse or Search over 5 million articles »
Find Articles by Publication

Home | Industry Information | Business News | Browse by Publication | I | IIE Transactions

A very large scale neighborhood search algorithm for the q-mode problem.(Report)

Article, News, Research, Information, Industry & Business News
» View article excerpt

Read this article now - Try Goliath Business News - FREE!  
You can view this article PLUS...

  • Over 5 million business articles
  • Hundreds of the most trusted magazines, newswires, and journals (see list)
  • Premium business information that is timely and relevant
  • Unlimited Access
Now for a Limited Time, try Goliath Business News - Free for 7 Days!
Tell Me More Terms and Conditions

Purchase this article for $4.95

Already a subscriber? Log in to read full article
 

Publication: IIE Transactions
Publication Date: 01-OCT-07
Delivery: Immediate Online Access
Author: Kulkarni, Girish ; Fathi, Yahya

Article Excerpt
1. Introduction

The q-mode problem is a combinatorial optimization problem that requires partitioning a given collection of objects that are described by categorical attributes into mutually exclusive and collectively exhaustive clusters. Our objective in this paper is to develop an effective algorithm for the q-mode problem that is usable for solving relatively large instances of the problem within a reasonable CPU time. To this end we propose a local search algorithm that employs network flow techniques to implicitly search a very large scale neighborhood defined for the q-mode problem. The effectiveness of this algorithm is empirically evaluated by comparing the objective value obtained using the proposed algorithm with that of an exact algorithm and a heuristic procedure for a collection of randomly generated problem instances.

In Section 2 we formally define the q-mode problem, introduce the notation, and discuss the related literature. In Section 3 we present the design of a local improvement algorithm for the q-mode problem along with details of the very large scale neighborhood on which it is based. In Section 4 we describe a computational experiment that we have performed to empirically evaluate the effectiveness of this algorithm and discuss our observations. Finally, we conclude the paper in Section 5.

2. Problem definition and background

2.1. Notation and problem statement

In the context of the q-mode problem our objective is to partition a given collection of objects into q mutually exclusive and collectively exhaustive groups so as to minimize the total distance from the objects to the "mode" of the cluster to which each object is assigned. We refer to each object in this context as a record, and each record is represented by a vector of size n. Each element (position) of this vector corresponds to an attribute of the object, and we assume that all attributes are categorical in nature, i.e., each attribute can take one of m different values (or categories) for that attribute. Without loss of generality we assume that all attributes have the same number m of possible values or categories, but this assumption can be easily removed with minor adjustments. We refer to the space of all such vectors of size n as [CV.sup.nm].

For a given collection (group) of records [PHI] in [CV.sup.nm], and for each value of i = 1 to m and j = 1 to n, let F(i, j) = number of records in the collection [PHI] that have value i in position j, and let [F.sup.max](j) = [max.sub.i] F(i, j) and i*(j) = arg [max.sub.i] F(i, j). Clearly i*(j) represents the category that is most frequently observed in position j among all members of the collection [PHI]. If more than one category tie at achieving this maximum value at position j we break the tie arbitrarily and select any one of these indices as i*(j). We now define the mode of [PHI] as a vector of size n where its jth element is i*(j), for all j = 1 to n. We denote this vector by mod([PHI]) and its jth element by mo[d.sub.j]([PHI]) = i*(j) for all j. Note that given a collection of vectors [PHI] [??] [CV.sup.nm] its mode vector can be determined efficiently and the corresponding computational requirement is O(np) (Morgan et al., 2004), where p is the cardinality of the set [PHI].

Given two vectors [U.sup.1] and [U.sup.2] of the same size we define a dissimilarity measure (i.e., distance) between these vectors, D([U.sup.1], [U.sup.2]), as the number of positions at which the two vectors are not identical; i.e., letting [d.sub.j]([U.sup.1], [U.sup.2]) = 1 for all j such that [u.sub.j.sup.1] [not equal to] [u.sub.j.sup.2], and [d.sub.j]([U.sup.1], [U.sup.2]) = otherwise, it follows that D([U.sup.1], [U.sup.2]) = [[summation].sub.j=1.sup.n] [d.sub.j]([U.sup.1], [U.sup.2]). In these expressions the notation [u.sub.j] is used to represent the jth element of the vector U.

For a given collection of p vectors in [CV.sup.nm] (i.e., vectors of categorical data as defined above), say [PHI] = {[U.sup.1],..., [U.sup.p]}, it can be shown that the total distance between these vectors and their mode is smaller than or equal to the total distance between these vectors and any other vector of the same size in [CV.sup.nm] (Morgan et al., 2004). In other words:

[p.summation over (k=1)]D[[U.sup.k], mod([PHI])] = [min.[v[member of][CV.sup.nm]]]{[p.summation over (k=1)]D[[U.sup.k], v]}

Clearly [[summation].sub.k=1.sup.p]D[[U.sup.k], mod([PHI])] is a characteristic value of the collection [PHI]; we denote this value by MD([PHI]) and refer to it as the total distance of the collection [PHI] from its mode. In context, this value is comparable to the total distance of a collection of vectors in [R.sup.n] from their geometric center.

We can interpret the distance between a vector [U.sup.k] [member of] [PHI] and its mode, i.e., D[[U.sup.k], mod([PHI])], as the total number of positions where a replacement of the value (change of category) is required to make this vector identical to the mode vector. This interpretation is particularly useful in the context of switching cabinet manufacturing as described below or in a similar manufacturing scenario.

We are now prepared to give a formal definition of the q-mode problem.

The q-mode problem: Given a collection of p vectors in [CV.sup.nm] and a positive integer q, partition these vectors into q mutually exclusive and collectively exhaustive groups (clusters) [[PHI].sub.1] through [[PHI].sub.q] so as to minimize [[summation].sub.l=1.sup.q] MD([[PHI].sub.l]).

In this context, and for ease of discussion, throughout this paper we refer to [[summation].sub.l=1.sup.q] MD([[PHI].sub.l]) as the total number of replacements associated with partition [[PHI].sub.1] through [[PHI].sub.q] of [PHI]. Note that the 1-mode problem is simply the problem of finding the mode of the given collection of vectors, and hence it can be solved efficiently as discussed earlier. For q [greater than or equal to] 2 the q-mode problem is conjectured to be NP-hard, although no proof is available in the open literature as of this writing (Morgan et al., 2004).

2.2. Background

The problem of forming clusters among a given collection of objects with attribute data is addressed in a number of different articles in the open literature. Its applications include the problem of pattern discovery in computational biology (Rigoutsos et al., 2000), design of cluster-based document retrieval systems (Jain et al., 1999), creating clusters among a given collection of transactional data in market analysis (Guha et al., 2000), and the model configuration problem in the context of switching cabinet manufacturing (Morgan et al., 2004), among others. Of course each application could have certain distinguishing characteristics that are unique to that application. Such characteristics typically lead to an appropriate definition for the notion of similarity (or dissimilarity) between two objects and a corresponding evaluation criterion which could also be unique to that application. For instance, in the context of clustering transactional data, Guha et al. (2000) propose the concept of links in the algorithm ROCK, whereas the algorithm CACTUS (Ganti et al., 1999) creates summaries from the values and each record is assigned to a cluster based on these summaries.

The notion of measuring the dissimilarity (i.e., distance) between two vectors as the number of positions in which the two vectors are not identical (i.e., D([U.sup.1], [U.sup.2]) defined above) is previously employed and discussed in the open literature (Gordon, 1999). Below we briefly describe its application in two contexts.

In Morgan et al. (2004) this problem arises in the context of a manufacturing process. The product is a "switching cabinet" that needs to be configured according to a certain set of specifications for each customer. There are a number of circuit-pack slots within each cabinet, and the customer may order one of several available circuit-pack types (options) for each slot. In order to streamline the manufacturing process, instead of starting the manufacture of each switching cabinet from scratch according to its unique configuration requirements as specified by the customer, the manufacturer may choose to produce one or more (q) pre-configured model cabinets in large numbers. These models can then be used as the starting points in the manufacturing process. To fill a specific customer order we can start the manufacturing process by taking an appropriate pre-configured model that is closest to the customer order (i.e., requires the least number of replacements) and reconfigure this model to meet the customer requirements by replacing the circuit packs as necessary. Clearly minimizing the total number of replacements directly leads to lower manufacturing costs, and it can be easily shown that for any given value of q, in order to minimize the total number of replacements to fill a collection of customer orders, the q models should be exactly the mode vectors for the corresponding clusters, and a cluster pattern that minimizes MD([PHI]) would minimize the total number of replacements necessary to fill that group of customer orders, and hence the corresponding manufacturing cost. Obviously a similar scenario could also arise in other manufacturing contexts.

The second application that uses this measure is in data mining as described in Huang (1998). Here the measure MD([PHI]) is used to create meaningful classifications in the context of the well-known soybean disease data set (Michalski and Stepp, 1983; Fisher, 1987) and in a data set from the health insurance database. In both cases it is shown that the resulting classifications obtained by minimizing MD([PHI]) are indeed meaningful. Application of this measure in other contexts in data mining and pattern recognition that we stated earlier is also possible and it could result in determining useful and/or interesting clusters in context.

All known exact methods for solving the q-mode problem have exponential growth in the computational requirements with growth in the size of the problem. These methods can be effectively used only for relatively small to moderate-sized instances of the problem, and their execution time becomes excessive as the size of the problem increases (see Kulkarni (2005)). In many situations among the application areas described above the size of the problem can easily exceed the practical limits for exact methods. For instance, in the switching cabinet manufacturing environment, such a situation is routinely faced since the number of records (p) can be quite large. In such cases we can either reduce the size of the problem by selecting a representative group of records (i.e., reduce p), or by trying to solve the original problem via a computationally effective but inexact method. In either case, the solution obtained for the original problem could be suboptimal. In this article we focus on the latter approach.

Huang (1998) proposes a constructive/iterative heuristic procedure which is similar to the k-means algorithm for datasets with numerical values (MacQueen, 1967). This procedure, referred to as the K-modes algorithm, is relatively fast and hence it is scalable to relatively large data sets. Our preliminary computational experiments show that this method is also quite effective in finding the clusters if the dataset contains relatively strong natural clusters. But this method is not as effective if the natural clusters in the dataset are relatively weak.

Morgan et al. (2004) propose several heuristic approaches to solve the 2-mode problem. These procedures include constructive techniques as well as search methods, but the search methods are based on relatively small-sized neighborhood...

NOTE: All illustrations and photos have been removed from this article.



More articles from IIE Transactions
Locating capacitated facilities to maximize captured demand, 01-NOV-07
Erratum, 01-NOV-07
Sequencing with limited flexibility, 01-OCT-07

Looking for additional articles?
Click here to search our database of over 3 million articles.

Looking for more in-depth information on this industry?
Click here to search our complete database of Industry & Market reports by text, subject, publication name or publication date.

About Goliath
Whether you're looking for sales prospects, competitive information, company analysis or best practices in managing your organization, Goliath can help you meet your business needs.

Our extensive business information databases empower business professionals with both the breadth and depth of credible, authoritative information they need to support their business goals. Whether it be strategic planning, sales prospecting, company research or defining management best practices - Goliath is your leading source for accurate information.

Home

Company Profiles

Industry Information

Business Development Resources

Business Management Resources

U.S. Job Search

Need More Information?
Start a new search.
Advertising, Privacy Policy, Refund Policy, Contact Us, Site Map, Terms & Conditions, Add to del.icio.us
Customer Service, How to Buy, Frequently Asked Questions
Copyright © 2008, ECNext, Inc., All Rights Reserved