PHYS291 report:

Project in Data Handling in Physics. Implemented three different clustering algorithms on measured data.
Goal: determining cluster of origin for data points.
Achived through the k-means algorithm, Agglomerative clustering and EM-algorithm.
generalized to work in N-dimentions on M-measurments.
With graphic representation in the spesific case of 4-dimentions and 150 measurements.

Part 1: k-means clustering:

Imported data from "iris.dat" containing the famous measurments of petal and sepal size from three diferent iris species.
Then performed the k-means clustering algorithm with 3 cluster.

Algorithm:

  1. Initialize cluster with random assignment of cluster-lables to all points.
  2. Calculate the k cluster centroids(average value in each dimention)
  3. Update cluster-lable according to closest centroid
  4. Repeat until nothing changes between iterations.

1. Initialized cluster:

3. Updated cluster-label:

4. Convergent cluster-labels:

All combinations of dimentions

1. Initialized cluster:

3. Updated cluster-label:

4. Convergent cluster-labels:

Real clustering:

This classifier gives 82% correct classification.
Code can be run in root with command: ".x K_Means.C".
Code can be seen here

Also have an example with three distinct gaussian distributed clusters which gives a 100% correct classification.
Code for that can be seen here

P'art 2: Agglomerative clustering :

Again imported data from "iris.dat" containing the famous measurments of petal and sepal size from three diferent iris species.
Then performed Agglomerative clustering until there were 3 clusters left.

Algorithm:

  1. Assign a cluster to each datapoint.
  2. Calculate the all to all cluster distance (used average distance when more then one point)
  3. Merged the clusters closest to each other
  4. Repeat until only three left.

150 Initialized clusters:

Reduced to 30 clusters:

Reduced to 3 clusters:

This classifier gives 83.3% correct classification.
Code can be run in root with command: ".x clustering2.C".
Code can be seen here

Also have an example with three distinct gaussian distributed clusters which gives a 100% correct classification.
Code for that can be seen here

Part 3: Quadratic discriminant analysis :

Again imported data from "iris.dat".
Then performed EM algorithm to generate three gaussian clusters.

EM Algorithm:

  1. Created three gaussian distribution with the three means initialized from one measurment from each species.
    and gave them the standard deviation of 1.
  2. The probability of each point belonging to a given distribution is calculated through:

    And classified as the cluster with the highest probability.
  3. Then calculating the mean:


    and standard deviation:

    of each cluster.
  4. Repeat step 2 and 3 until there is no change.

Histogram of data set:

Histogram of data set with the sum of the three gaussian distribution. Amplitude fitted to the histograms.

Graph of the three gaussian in 4 dimentions:

Graph of the three gaussian distribution in the 4 dimentions.


* In the second image the green graph is behind the blue.

Convergen cluster-labels:


This classifier gives 94% correct classification.
Code can be run in root with command: ".x EM_alg.C".
Code can be seen here

nb: program gets an "malloc" error when finishing, tried to fix, but no luck yet.
Suspecting something with me not deleting allocated memory(should be deleted on exiting the program).

Part 4: Stuff which is not completed or crashes:

Work which have been abandoned due to time constraint, but some of it is still good:

  1. k_means2.C Gives out the fit of three gaussian curves to the histograms, with 12 degrees of freedom(mean,sd and amplitude)


    But this cant be used as a classifier since some of the amplidudes are negative.

  2. One dimentional gaussian classifier for the iris data set:

    Gives 72.7% correct classification. Code here.
  3. Attempt to draw the ellipse given by the covariance of a two dimentional gaussian distribution.
    TO DO: figure out the angle. Example form the file draw_ellipse_example.C.