Goals And Applications Of Cluster Analysis

Goals of Cluster Analysis

  • The goal of cluster analysis is to partition the data into distinct sub-groups or clusters such that observations belonging to the same cluster are very similar or homogeneous and observations belonging to different clusters are different or heterogeneous.
  • The measurement of similarity may be distance, correlation, cosine similarity, etc. depending on the context/domain of the problem.

Application of clustering;

  • One very popular application of cluster analysis in business is market segmentation. Here, customers are grouped into distinct clusters or market segments and each segment is targeted with different marketing mixes such as different promotional messages, different products, different prices, and different distribution channels.
  • Other examples of clustering may be a clustering of products into different sub-groups based on attributes like price-elasticity, genres, etc.
  • In a way, clustering compresses the entire data into a reduced set of sub-groups. So, clustering is a data reduction technique.

K-Means Clustering – a very popular clustering algorithm

  • The idea behind K-mean clustering is that a good clustering is one for which within-cluster variation is as small as possible.
  • The one possible measure of within-cluster variation for the kth cluster is the sum of all the pairwise distance between the observations in the kth cluster, divided by the total number of observations in the kth cluster.
  • The total within-cluster variation is the sum of the all within-cluster variation for the 1 to kth cluster.
  • Minimizing this total within-cluster variation is the optimization problem for K-means clustering.
  • K-means algorithm provides a local optimum- nevertheless, a good solution to the above optimization problem.

K-Means Algorithm

  • Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
  • Iterate until there is no change in cluster centroids (change is below a tolerance limit):
    1.  For each of the K clusters, compute the cluster centroids. The kth cluster centroid is the vector of all the feature means for the observations in the kth cluster.
    2. Assign each observation to the cluster whose centroid is closest.

The mathematical perspective of clustering:

  • Let C1,…, Ck denote sets containing the indices of the observations in each cluster. The set should satisfy two properties:
    1. C1 C2 … Ck = {1,…,n}. That is each observation belong to at least one of the k clusters.
    2. Ci Cj = for all . No observation belongs to more than one cluster.
  • K-means cost Function:
    1. Let Z1,…,Zk are the cluster centroids.
    2. Cost (C1,…,Ck,Z1,…,Zk)=nr∑i=1nr∑i′=1?xi−xi′?2

Issues with K-Means Clustering:

Because the k-means algorithm finds a local rather than a global optimum. The result depends on the initial random cluster assignment. Hence, it is important to run the algorithm multiple times from different random initial configurations. Then select the best solution for which the total within-cluster variation is smallest.

How many clusters to extract?

  • Decide the number of clusters that are most interpretable and actionable.
  • Other Guidelines:
    • The plot of k-means cost versus K
    • Silhouette score
  • Visualizing data on the first two Principal Components

Python Code for K-Means clustering:

# k-means clustering

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

from sklearn.cluster import KMeans

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

from sklearn.metrics import silhouette_score

# load  data

df=pd.read_csv(“H:/Data/USArrests.csv”)

pd.options.display.max_columns=5

df.head()

df.shape

df1=df.iloc[:,1:] # removing the string column from the data

df1.head()

# Standardization of features

scaler=StandardScaler()

X=scaler.fit_transform(df1)

X

# Fitting 2 clusters (arbitary choice)

kmeans=KMeans(n_clusters=2,n_init=20,random_state=0).fit(X)

# Cluster memberships

labels=kmeans.predict(X)

labels

# cluster size

np.unique(labels, return_counts=True) 

#cluster means(centroids)

df[‘cluster’]=labels # add clustership as new column

df.groupby(‘cluster’).mean()

# Deciding the optimal number of clusters in kmeans

k_range=range(2,11)

sil_score=[]

twss=[]

for k in k_range:

    cluster=KMeans(n_clusters=k, n_init=10,random_state=42)

    cluster.fit(X)

    label=cluster.predict(X)

    ss=silhouette_score(X,label)

    sil_score.append(ss)

    twss.append(cluster.inertia_)  

# plot Number of clusters versus twss

plt.plot(k_range,twss,’ro-‘)

plt.ylabel(“twss”)

plt.xlabel(“number of cluster”)

# k versus silhouette

plt.plot(k_range,sil_score,’ro-‘)

plt.ylabel(“Silhouette_score”)

plt.xlabel(“number of cluster”)

# View the data along the first two  principal components

T=PCA(n_components=2).fit_transform(X)

df2=pd.DataFrame(T,columns=[‘PC1′,’PC2’])

plt.figure(figsize=(10,10))

plt.scatter(df2.PC1,df2.PC2)

# Fitting 4 clusters based on wss plot

kmeans=KMeans(n_clusters=4, n_init=10,random_state=0).fit(X)

# Cluster memberships

labels=kmeans.predict(X)

labels

# add cluster membership as a new column

df2[‘cluster’]=labels

# view 4 cluster on PC plot

plt.figure(figsize=(10,10))

plt.scatter(df2.PC1,df2.PC2,c=df2[‘cluster’],s=50,cmap=’rainbow’)

# cluster means/centroid

df1[‘cluster’]=labels

df1.groupby(‘cluster’).mean()

# which States in which cluster

df[‘cluster’]=labels

df[df.cluster==0] # cities in cluster=0

df[df.cluster==1] # cities in cluster=1

df[df.cluster==2] # cities in cluster=2

df[df.cluster==3] # cities in cluster=3

# cluster size

np.unique(labels,return_counts=True)

# scatter plot with state names

fig, ax = plt.subplots(figsize=(10,10))

x=df2.PC1

y=df2.PC2

ax.scatter(x,y,c=labels,s=50, cmap=”rainbow”)

plt.xlabel(‘PC1’)

plt.ylabel(‘PC2’)

for i, txt in enumerate(df.State):

    ax.annotate(txt, (x[i], y[i]))

Facebook
Twitter
Pinterest
Email