Kmeans Mapreduce Save

K-means MapReduce implementation

Project README

K-means MapReduce implementation

In this work k-means clustering algorithm is implemented using MapReduce (Hadoop version 2.8) framework.

To run the program, shell script run.sh should be executed. It requires path to jar file and its input parameters which are:

  • input - path to data file
  • state - path to file that contains clusters
  • number - number of reducers
  • output - output directory
  • delta - threshold convergence (acceptable difference between 2 subsequent centroids)
  • max - maximum number of iterations
  • distance - similairty measure (currently only Euclidean distance is supported)

Workflow

The figure below denotes one iteration of MapReduce program.

alt text

First, Centroids and Context (Configuration) are loaded into the Distributed Cache. This is done by overriding setup function in the Mapper and Reducer class. Afterwards, the input data file is split and each data point is processed by one of the map functions (in Map process). The function writes key-value pairs <Centroid, Point>, where the Centroid is the closest one to the Point. Next, Combiner is used in order to decrease the number of local writings. In this phase data points that are on the same machine are summed up and the number of those data points is recorded, Point.number variable. Now, for the optimization reasons output values are automatically shuffled and sorted by Centroids. The Reducer performs the same procedure as the Combiner, but it also checks whether centroids converged; comparing the difference between old and new centroids with delta input parameter. If a centroid converge, then the global Counter remains unchanged, otherwise, it is incremented.

After the one iteration is done, new centroids are saved and the program checks two conditions, if the program reached the maximum number of iterations or if the Counter value is unchanged. If one of these two conditions is satisfied, then the program is finished, otherwise, the whole MapReduce process is run again with the updated centroids.

Examples

One of the use-cases of k-means algorithm is the color quantization process, reducing the number of distinct colors of an image. (Far better algorithms for this purpose are available)

Numerical (RGB) values of images (Fig. 1) are saved as input data (Fig. 2), and clusters are randomly initialized.

Original Images

alt text

RGB values of original and modified images

alt text

After 10 iterations with 10 clusters, RBG values are represented in Fig. 3. It can be noted that a couple of centroids have vanished.

alt text

Modified images for a different number of centroids

alt text

Modified images for a different number of iterations and 10 centroids

alt text

alt text

Open Source Agenda is not affiliated with "Kmeans Mapreduce" Project. README Source: markomih/kmeans_mapreduce
Stars
43
Open Issues
0
Last Commit
1 year ago

Open Source Agenda Badge

Open Source Agenda Rating