Kmeans Clusterer Save

k-means clustering in Ruby

Project README

KMeansClusterer

k-means clustering in Ruby. Uses NArray under the hood for fast calculations.

Jump to the examples directory to see this in action.

Features

Runs multiple clustering attempts to find optimal solution (single runs are susceptible to falling into non-optimal local minima)
Initializes centroids via k-means++ algorithm, for faster convergence
Calculates silhouette score for evaluation
Option to scale data before clustering, so that output isn't biased by different feature scales
Works with high-dimensional data

Install

gem install kmeans-clusterer

Usage

Simple example:

require 'kmeans-clusterer'

data = [[40.71,-74.01],[34.05,-118.24],[39.29,-76.61],
        [45.52,-122.68],[38.9,-77.04],[36.11,-115.17]]

labels = ['New York', 'Los Angeles', 'Baltimore', 
          'Portland', 'Washington DC', 'Las Vegas']

k = 2 # find 2 clusters in data

kmeans = KMeansClusterer.run k, data, labels: labels, runs: 5

kmeans.clusters.each do |cluster|
  puts  cluster.id.to_s + '. ' + 
        cluster.points.map(&:label).join(", ") + "\t" +
        cluster.centroid.to_s
end

# Use existing clusters for prediction with new data:
predicted = kmeans.predict [[41.85,-87.65]] # Chicago
puts "\nClosest cluster to Chicago: #{predicted[0]}"

# Clustering quality score. Value between -1.0..1.0 (1.0 is best)
puts "\nSilhouette score: #{kmeans.silhouette.round(2)}"

Output of simple example:

0. New York, Baltimore, Washington DC [39.63, -75.89]
1. Los Angeles, Portland, Las Vegas [38.56, -118.7]

Closest cluster to Chicago: 0

Silhouette score: 0.91

Options

The following options can be passed in to KMeansClusterer.run:

option	default	description
:labels	nil	optional array of Ruby objects to collate with data array
:runs	10	number of times to run kmeans
:log	false	print stats after each run
:init	:kmpp	algorithm for picking initial cluster centroids. Accepts :kmpp, :random, or an array of k centroids
:scale_data	false	scales features before clustering using formula (data - mean) / std
:float_precision	:double	float precision to use. :double or :single
:max_iter	300	max iterations per run

Open Source Agenda is not affiliated with "Kmeans Clusterer" Project. README Source: gbuesing/kmeans-clusterer

Stars

Open Issues

Last Commit

3 years ago

Repository

gbuesing/kmeans-clusterer

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/kmeans-clusterer"><img src="https://www.opensourceagenda.com/projects/kmeans-clusterer/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022