NAME
     Statistics::Gap - Perl extension for the "Gap Statistic"

SYNOPSIS
     use Statistics::Gap;
     &gap("GapPrefix", "InputFile", "squared", "agglo", 5, 100, "unif", 90);

     OR

     use Statistics::Gap;
     &gap("GapPrefix", "InputFile", "squared", "agglo", 5, 100, "prop", 90);

     Input file is can be in the "dense" format -
     Sample Input file:
   
     6 5
     1       1       0       0       1
     1       0       0       0       0
     1       1       0       0       1
     1       1       0       0       1
     1       0       0       0       1
     1       1       0       0       1              

     OR in "sparse" format -
     Sample Input file:

     8 10 41
     1 1 4 1 8 1 10 1
     1 1 2 1 3 1 5 1 7 1 9 1
     2 1 3 1 4 1 5 1 7 1 9 1
     2 1 4 1 5 1 9 1
     1 1 4 1 6 1
     1 1 2 1 3 1 4 1 5 1 6 1 10 1
     2 1 4 1 5 1 7 1 8 1 10 1
     2 1 4 1 7 1 9 1 10 1

DESCRIPTION
    Given a dataset how does one automatically find the optimal number of
    clusters that the dataset should be grouped into? - is one of the
    prevailing problems. Statisticians Robert Tibshirani, Guenther Walther
    and Trevor Hastie propose a solution for this problem in a Techinal
    Report named - "Estimating the number of clusters in a dataset via the
    Gap Statistic". This perl module implements the approach proposed in the
    above paper.

    If one tries to cluster a dataset (i.e. numerous observations described
    in terms of a feature space) into n groups/clusters and if we plot the
    graph of Within Cluster (Dis)Similarity along Y-axis and Number of
    clusters along X-axis then this graph generally takes a form of a
    elbow/knee depending upon the measure on the Y-axis. The Gap Statistic
    seeks to locate this elbow/knee because the value on the X-axis at this
    elbow is the optimal number of clusters for the data.

    NOTE: Gap Statistic uses reference distribution in the process of
    estimating the number of clusters. The appropriate methodology for
    generation of this reference distribution is dependent on the data to be
    clustered. This module was implemented for data with following
    characteristics: 1. highly sparse - very few features occur in any given
    observation. 2. high multivariate dimensionality (i.e. large feature
    space) 3. binary feature frequency - feature either occurs or does not
    occur in an observation but rarely occurs multiple times in the same
    observation.

  EXPORT
    "gap" function by default.

INPUT
  Prefix
    The string that should be used to as a prefix while naming the
    intermediate files and the .png files (graph files).

  InputFile
    The input dataset can be in "dense" or "sparse" matrix format. The input
    matrix is expected in a plain text file where the first line in the file
    gives the dimensions of the dataset and then the dataset in either of
    the format should follow. The contexts / observations should be along
    the rows and the features should be along the column.

            eg: dense format - 
            6 5
            1       1       0       0       1
            1       0       0       0       0
            1       1       0       0       1
            1       1       0       0       1
            1       0       0       0       1
            1       1       0       0       1       

    The first line (6 5) gives the number of rows (observations) and the
    number of columns (features) present in the following matrix. Each
    following line/row records the presence (1) or absence (0) of the
    feature at the column in the given observation. Thus features1 (1st
    column) occurs once in the observation1 and infact once in all the other
    observations too while the feature3 does not occur in observation1.

            eg: sparse format -
                    8 10 41
                    1 1 4 1 8 1 10 1
                    1 1 2 1 3 1 5 1 7 1 9 1
                    2 1 3 1 4 1 5 1 7 1 9 1
                    2 1 4 1 5 1 9 1
                    1 1 4 1 6 1
                    1 1 2 1 3 1 4 1 5 1 6 1 10 1
                    2 1 4 1 5 1 7 1 8 1 10 1
                    2 1 4 1 7 1 9 1 10 1

    The first line (8 10 41) gives the number of rows (observations), the
    number of columns (features) and the number of non-zero elements present
    in the following matrix. Each of the following line/row consists of pair
    of values where the first number in the pair gives the column number
    corresponding to the feature that occurs in the observation
    corresponding to the row. In this format only features that are
    present/observed (1) are recorded i.e. the features that are
    absent/not-observed (0) are implied. Thus in the 1st row (observation1)
    of the above matrix only features1, feature4, feature8 and feature10
    occur while the other 7 features do not occur.

  DistanceMeasure
    The Distance Measure that should be used. Currrently this module
    supports the following distance measure: 1. Squared Euclidean (string
    that should be used as an argument: "squared") 2. Manhattan (string that
    should be used as an argument: "manhattan") 3. Euclidean (string that
    should be used as an argument: "euclidean")

  ClusteringAlgorithm
    The Clustering Measures that can be used are: 1. rb - Repeated
    Bisections [Default] 2. rbr - Repeated Bisections for by k-way
    refinement 3. direct - Direct k-way clustering 4. agglo - Agglomerative
    clustering 5. graph - Graph partitioning-based clustering 6. bagglo -
    Partitional biased Agglomerative clustering

  K value
    This is an approximate upper bound for the number of clusters that may
    be present in the dataset. Thus for a dataset that you expect to be
    seperated into 3 clusters this value should be set some integer value
    greater than 3.

  B value
    Specifies the number of replicates that should be sampled from the
    generated reference distribution using Monte Carlo Sampling.

  ReferenceGenerationMethod
    1. Uniform - While generating the reference distribution, all the
    features in the feature set have equal probability of being selected for
    the observation under consideration.

    2. Proportional - Each feature is assigned a probability of being
    selected depending upon its frequency of occurrence in the observed
    data. Thus feature distribution is taken into consideration while
    selecting the features for the reference distribution generation.

  Percentage Confidence Interval
    This parameter specifies the percentage confidence to be reported in the
    log file. Since Statistics::Gap uses parametric bootstrap method for
    reference distribution generation, it is critical to understand the
    interval around the sample mean that could contain the population
    ("true") mean and with what certainty.

OUTPUT
    1. A single integer number at STDOUT which is the Gap Statistic's
    estimate of number of clusters present in the input dataset. 2. The
    PREFIX.log file contains the log of various values at different K
    values. The first table in the file gives values like Gap(k), log(W(k))
    etc. for every K value experimented with. 3. The PREFIX.fig*.png files
    are the graphical representations of different values which help locate
    the knee/elbow.

PRE-REQUISITES
    1. This module uses suite of C programs called CLUTO for clustering
    purposes. Thus CLUTO needs to be installed for this module to be
    functional. CLUTO can be downloaded from
    http://www-users.cs.umn.edu/~karypis/cluto/

    2. Following Perl Modules 1. GD
    (http://search.cpan.org/~lds/GD-2.19/GD.pm) 2. GD::Text
    (http://search.cpan.org/~mverb/GDTextUtil-0.86/Text.pm) 3.
    GD::Graph::lines (http://search.cpan.org/~mverb/GDGraph-1.43/) 4.
    GD::Graph::colour
    (http://search.cpan.org/~mverb/GDGraph-1.43/Graph/colour.pm)

SEE ALSO
        http://citeseer.ist.psu.edu/tibshirani00estimating.html
        http://www-users.cs.umn.edu/~karypis/cluto/

AUTHOR
        Anagha Kulkarni, University of Minnesota Duluth
        kulka020 <at> d.umn.edu
        
        Guergana Savova, Mayo Clinic
        savova.guergana <at> mayo.edu

COPYRIGHT AND LICENSE
    Copyright (C) 2006-2008, Anagha Kulkarni and Guergana Savova

    This program is free software; you can redistribute it and/or modify it
    under the terms of the GNU General Public License as published by the
    Free Software Foundation; either version 2 of the License, or (at your
    option) any later version. This program is distributed in the hope that
    it will be useful, but WITHOUT ANY WARRANTY; without even the implied
    warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.