Density Clustering

Download the datafile iris.data from the UCI Machine Learning Repository. This has five attributes with 150 instances. The last column of the data is a categorical attribute for the type of Iris flower.

Write a script to implement the DENCLUE density-based clustering algorithm Algorithm 15.2 in chapter 15. The script should take as input a dataset \(\mathbf{D}\), the minimum density \(\xi\), the tolerance for convergence \(\epsilon\), and the width \(h\). Do not make any assumptions about the data (i.e., column names, etc), except that the last column gives the "true" cluster id.

Run your script on the iris dataset, with \(\epsilon=0.0001\). Your script should output the following:

  • The number of clusters, and the size of each cluster

  • The density attractor, followed by the set of point in that cluster.

  • Purity of the clustering, based on the true id.

For Iris, you should use a value of \(\xi\) that gives you 3 clusters in the end, i.e., try different values and then finally report only the results for the value that gives you 3 clusters, since there are 3 true clusters in the data. Select the value of \(h\) empirically.

To speed up the computation for estimating the density at a point, you may want to first identify the K nearest neighbors, and use only those neighbors.