How to perform K-means clustering operation in Python?

Python | K-means Clustering Operation: In this tutorial, we will learn how to perform K-means clustering operation using Python. By Shivang Yadav Last updated : September 18, 2023

One of the major implications of the Python programming language is Machine Learning. And, one algorithm for unsupervised machine learning is K-means Clustering.

K-Means Clustering

K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into groups or clusters based on their similarity. The goal of K-means is to find clusters such that data points within the same cluster are more similar to each other than to those in other clusters. It's widely used in various fields like data analysis, image processing, and customer segmentation.

Working of K-means clustering

Initialization: Choose the number of clusters (K) you want to create. Also, initialize K cluster centroids randomly within the data space or by using some specific method.
Assignment: For each data point in the dataset, compute its distance to each of the K centroids. The distance is typically calculated using Euclidean distance. The data point is then assigned to the cluster associated with the nearest centroid.
Update Centroids: The centroid of each cluster should be recalculated by taking the mean of all the data points assigned to that cluster. These new centroids will represent the center of each cluster.
Repeat: Repeat steps 2 and 3 until either a convergence criterion is met (e.g., centroids do not change significantly) or a fixed number of iterations is reached.
Result: The final clusters are formed when the algorithm converges, and each data point belongs to the cluster with the nearest centroid.

It's important to choose an appropriate value of K, as it significantly affects the quality of clustering. Various methods, such as the elbow method or silhouette analysis, can help determine the optimal number of clusters.

Python program to perform K-means clustering operation

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Creating and printing data Sets
dataSet = pd.DataFrame({
    'set1': [18, np.nan, 19, 14, 14, 11, 20, 28, 30, 31, 35, 33, 29, 25, 25, 27, 29, 30, 19, 23],
    'set2': [3, 3, 4, 5, 4, 7, 8, 7, 6, 9, 12, 14, np.nan, 9, 4, 3, 4, 12, 15, 11],
    'set3': [15, 14, 14, 10, 8, 14, 13, 9, 5, 4, 11, 6, 5, 5, 3, 8, 12, 7, 6, 5]
})

print(f'The data set is \n{dataSet}')

dataSet = dataSet.dropna()
scaled_df = StandardScaler().fit_transform(dataSet)

kmeans_kwargs = { "init": "random",
  "n_init": 10,
  "random_state": 1,}

sse = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
    kmeans.fit(scaled_df)
    sse.append(kmeans.inertia_)

kmeans = KMeans(init="random", n_clusters=3, n_init=10, random_state=1)
kmeans.fit(scaled_df)
print(f'K means Clusters : \n{kmeans.labels_}')

Output

The output of the above program is:

The data set is 
    set1  set2  set3
0   18.0   3.0    15
1    NaN   3.0    14
2   19.0   4.0    14
3   14.0   5.0    10
4   14.0   4.0     8
5   11.0   7.0    14
6   20.0   8.0    13
7   28.0   7.0     9
8   30.0   6.0     5
9   31.0   9.0     4
10  35.0  12.0    11
11  33.0  14.0     6
12  29.0   NaN     5
13  25.0   9.0     5
14  25.0   4.0     3
15  27.0   3.0     8
16  29.0   4.0    12
17  30.0  12.0     7
18  19.0  15.0     6
19  23.0  11.0     5
K means Clusters : 
[1 1 1 1 1 1 2 2 0 0 0 0 2 2 2 0 0 0]

Python NumPy Programs »

How to Create a Covariance Matrix in Python?

How to use the elbow method in Python to find optimal cluster?