Skip to content

Introduction to Customer Segmentation

Foreword

Notes and code snippets. Python 3. From DataCamp.


What is Customer Segmentation?

Customer segmentation is a method of dividing customers into groups or clusters on the basis of common characteristics.

  • The market researcher can segment customers into the B2C model using various customer’s demographic characteristics such as occupation, gender, age, location, and marital status.
  • Psychographic characteristics such as social class, lifestyle and personality characteristics and behavioral characteristics such as spending, consumption habits, product/service usage, and previously purchased products.
  • In the B2B model using various company’s characteristics such as the size of the company, type of industry, and location.

Need of Customer Segmentation

In the Retail sector, the various chain of hypermarkets generating an exceptionally large amount of data. This data is generated on a daily basis across the stores.
This extensive database of customers transactions needs to analyze for designing profitable strategies.
All customers have different-different kind of needs. With the increase in customer base and transaction, it is not easy to understand the requirement of each customer.
Segmentation can play a better role in grouping those customers into various segments:

  • It helps in identifying the most potential customers.
  • It helps managers to easily communicate with a targetted group of the audience.
  • It helps in selecting the best medium for communicating with the targetted segment.
  • It improves the quality of service, loyalty, and retention.
  • It improves customer relationship via better understanding needs of segments.
  • It provides opportunities for upselling and cross-selling.
  • It helps managers to design special offers for targetted customers, to encourage them to buy more products.
  • It helps companies to stay a step ahead of competitors.
  • It also helps in identifying new products that customers could be interested in.

Types of Segmentation

Customer Segmentation using RFM analysis

RFM (Recency, Frequency, Monetary) analysis is a behavior-based approach grouping customers into segments.
It groups the customers on the basis of their previous purchase transactions. How recently, how often, and how much did a customer buy.
RFM filters customers into various groups for the purpose of better service.

  • Recency (R): Who have purchased recently? Number of days since last purchase (least recency).
  • Frequency (F): Who has purchased frequently? It means the total number of purchases. ( high frequency).
  • Monetary Value(M): Who have high purchase amount? It means the total money customer spent (high monetary value).

Steps of RFM(Recency, Frequency, Monetary):

  • Calculate the Recency, Frequency, Monetary values for each customer.
  • Add segment bin values to RFM table using quartile.
  • Sort the customer RFM score in ascending order.

RFM in Python

Importing Data

The dataset is over 500K lines!

# Importing the necessary modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
# Loading the dataset
data = pd.read_csv("Online_Retail.csv", sep=';', decimal=',')
# Examining the data
data.head(3)
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26 2.75 17850.0 United Kingdom
# Examining the fields (culomns)
data.info()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null object
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB
data['InvoiceDate'] = data['InvoiceDate'].astype('datetime64[ns]')
data.info()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null datetime64[ns]
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB
# Removing null ID
data = data[pd.notnull(data['CustomerID'])]
# Removing duplicates
filtered_data = data[['Country','CustomerID']].drop_duplicates()

Data Insights

# Filtering the top ten country's customer
# Counting the number of occurrence by country
filtered_data.Country.value_counts()[:10].plot(kind='bar');

# Filtering for the United Kingdom
uk_data = data[data.Country=='United Kingdom']

uk_data.info()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
<class 'pandas.core.frame.DataFrame'>
Int64Index: 361878 entries, 0 to 541893
Data columns (total 8 columns):
InvoiceNo      361878 non-null object
StockCode      361878 non-null object
Description    361878 non-null object
Quantity       361878 non-null int64
InvoiceDate    361878 non-null datetime64[ns]
UnitPrice      361878 non-null float64
CustomerID     361878 non-null float64
Country        361878 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 24.8+ MB
# Getting summary statistics for the UK
uk_data.describe()
Quantity UnitPrice CustomerID
count 361878.000000 361878.000000 361878.000000
mean 11.077029 3.256007 15547.871368
std 263.129266 70.654731 1594.402590
min -80995.000000 0.000000 12346.000000
25% 2.000000 1.250000 14194.000000
50% 4.000000 1.950000 15514.000000
75% 12.000000 3.750000 16931.000000
max 80995.000000 38970.000000 18287.000000
# Some of the customers have ordered in a negative quantity
# Which is not possible
# Filter Quantity greater than zero

# Filtering quantity greater than zero
uk_data = uk_data[(uk_data['Quantity'] > 0)]

uk_data.info()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
<class 'pandas.core.frame.DataFrame'>
Int64Index: 354345 entries, 0 to 541893
Data columns (total 8 columns):
InvoiceNo      354345 non-null object
StockCode      354345 non-null object
Description    354345 non-null object
Quantity       354345 non-null int64
InvoiceDate    354345 non-null datetime64[ns]
UnitPrice      354345 non-null float64
CustomerID     354345 non-null float64
Country        354345 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 24.3+ MB
# Filtering the required columns
# for RFM analysis

# InvoiceNo helps you to count the number of time transaction
# performed(frequency)
# Quantity purchased in each transaction and
# UnitPrice of each unit purchased by the customer
# helps calculating the total purchased amount
uk_data = uk_data[['CustomerID',
                   'InvoiceDate',
                   'InvoiceNo',
                   'Quantity',
                   'UnitPrice']]


uk_data['TotalPrice'] = uk_data['Quantity'] * \
    uk_data['UnitPrice']

# InvoiceDate help you calculate recency of purchase
uk_data['InvoiceDate'].min(), uk_data['InvoiceDate'].max()
1
(Timestamp('2010-12-01 08:26:00'), Timestamp('2011-12-09 12:49:00'))
# Creating a constant
PRESENT = dt.datetime(2011,12,10)

# InvoiceDate help you calculate recency of purchase
uk_data['InvoiceDate'] = pd.to_datetime(uk_data['InvoiceDate'])
uk_data.head()
CustomerID InvoiceDate InvoiceNo Quantity UnitPrice TotalPrice
0 17850.0 2010-12-01 08:26:00 536365 6 2.55 15.30
1 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34
2 17850.0 2010-12-01 08:26:00 536365 8 2.75 22.00
3 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34
4 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34

RFM Analysis

  • For Recency, calculate the number of days between present date and date of last purchase each customer.
  • For Frequency, calculate the number of orders for each customer.
  • For Monetary, calculate the sum of purchase price for each customer.
rfm = uk_data.groupby('CustomerID'). \
    agg({'InvoiceDate': lambda date: (PRESENT - date.max()).days,
         'InvoiceNo': lambda num: len(num),
         'TotalPrice': lambda price: price.sum()})

rfm.columns
1
Index(['InvoiceDate', 'InvoiceNo', 'TotalPrice'], dtype='object')
# Change the name of columns
rfm.columns = ['monetary','recency','frequency']

rfm['recency'] = rfm['recency'].astype(int)
rfm.head(3)
monetary recency frequency
CustomerID
12346.0 325 1 77183.60
12747.0 2 103 4196.01
12748.0 0 4596 33719.73

Computing Quantile of RFM values

Customers with the lowest recency, highest frequency and monetary amounts considered as top customers.

qcut isquantile-based discretization function.
qcut bins the data based on sample quantiles.
For example, 1000 values for 4 quantiles would produce a categorical object indicating quantile membership for each customer.

rfm['r_quartile'] = pd.qcut(rfm['recency'], 4, ['1','2','3','4'])
rfm['f_quartile'] = pd.qcut(rfm['frequency'], 4, ['4','3','2','1'])
rfm['m_quartile'] = pd.qcut(rfm['monetary'], 4, ['4','3','2','1'])

rfm.head(3)
monetary recency frequency r_quartile f_quartile m_quartile
CustomerID
12346.0 325 1 77183.60 1 1 1
12747.0 2 103 4196.01 4 1 4
12748.0 0 4596 33719.73 4 1 4

RFM Result Interpretation

rfm['RFM_Score'] = rfm.r_quartile.astype(str)+ rfm.f_quartile.astype(str) + rfm.m_quartile.astype(str)

rfm.head(5)
monetary recency frequency r_quartile f_quartile m_quartile RFM_Score
CustomerID
12346.0 325 1 77183.60 1 1 1 111
12747.0 2 103 4196.01 4 1 4 414
12748.0 0 4596 33719.73 4 1 4 414
12749.0 3 199 4090.88 4 1 4 414
12820.0 3 59 942.34 3 2 4 324
# Filter out Top/Best customers
rfm[rfm['RFM_Score']=='111'].sort_values('monetary',
                                         ascending=False).head()
monetary recency frequency r_quartile f_quartile m_quartile RFM_Score
CustomerID
16754.0 372 2 2002.4 1 1 1 111
12346.0 325 1 77183.6 1 1 1 111
15749.0 235 10 44534.3 1 1 1 111
16698.0 226 5 1998.0 1 1 1 111
13135.0 196 1 3096.0 1 1 1 111