Partitioning by Iterative Optimization

Usage

partition(a, ng, iter.max=15, option="mindst", diagnostics=F)

partition(a, centers, iter.max=15, diagnostics=F)

Arguments

a matrix of multivariate data. Each row corresponds to an observation, and each column corresponds to a variable.
ng Number of groups or clusters.
centers matrix of initial guesses for the cluster centers. Each row represents a cluster center, and thus centers must have the same number of columns as a. The number of rows in centers is the number of clusters that will be formed.
iter.max maximum number of iterations.
option Either mindst (default) or exch. Options for the Spaeth (1985) algorithm. In the former case, the variances of the groups are optimized by assigning the objects to groups such that they are minimally distant from group centers. In the latter case, the variances are optimized by exchanging the objects between groups such that they are minimally distant from group centers.
diagnostics FALSE (default) implies that cluster cardinalities, cluster center coordinates, and cluster compactness values will not be output to the command window.
cluster vector of integers, ranging from 1 to ng (or number of rows in centers), with length the same as the number of rows of a. The ith value indicates the group in which the ith observation belongs.
centers matrix containing the coordinates of the final group centers.
withinss vector of length ng or number of rows in centers. The ith value gives the within group sum of squares for the ith group.
size vector of length ng, or the number of rows in centers. The ith value gives the number of observations in group i.

In the case of the Spaeth algorithm (where the number of clusters is given), the first of these, only, is returned.

Description

Returns cluster memberships. In the case of a data matrix, the user specifies either a requested number of clusters; or - implicitly - as many clusters as there are rows in an initial guess at the cluster centers. Incorporates S function `kmeans'.

METHOD

Consider first the clustering of coordinate data. The object is to find a partition of the observations with ng (or the number of rows in centers) groups that minimizes the sum of withinss values. To actually guarantee the minimum would be computationally infeasible in many settings; this function finds a local minimum, that is, a solution such that there is no single switch of an observation from one group to another group that will decrease the objective criterion. In the case of the Spaeth (1985) algorithms, the local minimum arrived at is dependent on the initial arbitrary assignment of observations to groups. This arbitrary assignment is carried out using a random number generator. Subsequent executions of the partition function will make use of different initial arbitrary assignments. The function partition should therefore be run a number of times, and the best result kept.

The initial data is not standardized: it may be advisable to divide all column values by the range of the column's values, or to standardize in some other way.

Note that this routine is best used by specifying a small number of groups; specifying a large number of groups may cause many empty classes. When this happens, group centroids and compactness values may be undefined. This may also happen when there are very large values in the input data set. Cluster memberships, though, can still be used.

Sample timings for the Spaeth (1985) algorithms (accessed by specifying the desired number of groups): 33000 rows, 4 columns, 3 groups took about 40 secs. on a Sun SPARCstation 1; 50 groups took about 140 secs.

When deciding on the number of clusters, Hartigan (1975, pp. 90-91) suggests the following rough rule of thumb. If k is the result of partition with k groups, and kplus1 is the result with k+1 groups, then it is justifiable to add the extra group when

(sum(k$withinss)/sum(kplus1$withinss)-1)*(nrow(a)-k+1)

is greater than 10.

When a partition is being obtained from a dendrogram, the procedure is straightforward: a slice is made, orthogonal to the cluster criterion value axis, and the clusters read off.

References

Spaeth, H. (1985). Cluster Dissection and Analysis: Theory, Fortran Programs, Examples. Ellis Horwood, Chichester.

Hartigan, J.A. (1975). Clustering Algorithms. Wiley, New York.

Hartigan, J.A. and Wong, M.A. (1979). A k-means clustering algorithm, Applied Statistics, vol. 28, pp. 100-108.

See Also

kmeans (incorporated into this function); hierclust (hierarchical clustering routines, incorporating hclust); modclust (hierarchical clustering, incorporating mclust).

Examples

# Firstly, specifying the desired number of groups.
ng <- 3
pp <- partition(a,ng)
# How well did the partitioning do?
pp$ctot
# Plot the results in the plane comprising the first two columns of `a'
x <- a[,1]       # for convenience
y <- a[,2]
plot(x, y, type="n")     # set up plot scales, etc.
points(x[pp$memgp==1], y[pp$memgp==1],pch="*")
points(x[pp$memgp==2], y[pp$memgp==2],pch=".")
points(x[pp$memgp==3], y[pp$memgp==3],pch="o")

# Secondly, specifying guesses at cluster centers.
irismean <- t(apply(iris, c(2, 3), 'mean'))
x <- rbind(iris[,,1], iris[,,2], iris[,,3])
km <- partition(x, irismean)
wrong <- km$cluster!=rep(1:3, c(50, 50, 50))
spin(x, highlight=wrong)
plot(x[,2], x[,3], type="n")
text(x[!wrong, 2], x[!wrong, 3], km$cluster)
# identify cluster membership that is correct
points(x[wrong, 2], x[wrong, 3], pch=15)
# boxes for points in error
title(main="K-Means Clustering of the Iris Data")

# Now the hierarchy slicing case.
# Produce a hierarchical clustering and plot it.
clhier    <- hierclust(a)
motif()
plclust(clhier)
# Derive the 4-cluster partition from this.  Update plot of dendrogram.
clresult2 <- partition(clhier, 4, showslice=T)


[Package Contents]