Decile Analysis in Python

Posted at 2024-05-25

1. Motivation

Using the following Kaggle purchase data to do targeting and decile analysis.
https://www.kaggle.com/vijayuv/onlineretail

2. Procedures

Read retail_data and perform preprocessing
Calculate the sum of TotalPrice for each CustomerID
Sort the dataframe in descending order based on the TotalPrice per CustomerID.
Divide the customers into n groups (where 𝑛 is a natural number that does not exceed the number of data and does not have the same value in the quartile).
Calculate what percentage of sales the each group's TotalPrice accounts for. (decile analysis)
Sort the groups by their calculated percentages in descending order.
Summarize the above steps as a function.
Create a cumulative distribution plot to visualize the contribution of each group to the total TotalPrice.

3. Steps

3-1. Read retail_data and perform preprocessing

Code

#Load the libraries
import pandas as pd
import numpy as np

# Data loading and preprocessing *Takes a lot of time
file_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
Excel_data = pd.ExcelFile(file_url)
retale_data = Excel_data.parse('Online Retail')

# Convert InvoiceNo to string type
retail_data['cancel_flg'] = retail_data.InvoiceNo.map(lambda x:str(x)[0])

# Target invoices whose InvoiceNo starts with 5 and whose ID is not Null
retail_data_1 = retail_data[(retail_data.cancel_flg == '5') & (retail_data.CustomerID.notnull())]

# Prepare main dataframe
df = retail_data_1.assign(TotalPrice=retail_data_1.Quantity * retail_data_1.UnitPrice)

Result

df.info()

df.describe()

3-2. Calculate the sum of `TotalPrice` for each `CustomerID`

sum = df.groupby('CustomerID')[['TotalPrice']].sum()
sum_sorted = sum.sort_values(by='TotalPrice', ascending=False)

Result

sum

3-3. Sort the dataframe in descending order based on the `TotalPrice` per `CustomerID`.

sum_sorted = sum.sort_values(by='TotalPrice', ascending=False)

Result

sum_sorted

3-4. Divide the customers into n `group`s .

where 𝑛 is a natural number that does not exceed the number of data and does not have the same value in the quartile

n = 10
group_labels = []
for i in range(n):
    group_labels.append('{0}th'.format(i+1))
sum_sorted['group'] = pd.qcut(sum_sorted['TotalPrice'], n, labels=group_labels, duplicates='drop')

Result

sum_sorted

3-5. Calculate what percentage of sales the each group's `TotalPrice` accounts for. (decile analysis)

# Find what percentage of the total purchase price each group represents
proportions = sum_sorted.groupby('group')['TotalPrice'].sum() / sum_sorted['TotalPrice'].sum()

Result

proportions

3-6. Sort the groups by their calculated percentages in descending order.

# Sort in ascending order by proportion height
result = proportions.sort_values(ascending=False)

Result

result

# Rank groups by percentage height and reassign indexes by rank
result.index = result.sort_values().index.map(lambda x: 'decile'+str(x)[0])

Result

result

3-7. Summarize the above steps as a function.

def function(df, n):
    sum = df.groupby('CustomerID')[['TotalPrice']].sum()
    sum_sorted = sum.sort_values(by='TotalPrice', ascending=False)
    group_labels = []
    for i in range(n):
        group_labels.append('{0}th'.format(i+1))
    sum_sorted['group'] = pd.qcut(sum_sorted['TotalPrice'], n, labels=group_labels, duplicates='drop')
    proportions = sum_sorted.groupby('group')['TotalPrice'].sum() / sum_sorted['TotalPrice'].sum()
    result = proportions.sort_values(ascending=False)
    result.index = result.sort_values().index.map(lambda x: str(x)[0])
    return result

Result

function(df, 10)

3-8. Create a cumulative distribution plot to visualize the contribution of each group to the total `TotalPrice`.

# Visualize with Matplotlib
import matplotlib.pyplot as plt
N = 10
data = function(df, N)

fig, ax1 = plt.subplots(figsize=(6,4))

# Number of segments determines the abscissa
data_num = len(data)

# Cumulative sum
cum_per = np.cum(data)

# Bar chart
ax1.bar(range(data_num), data)
ax1.set_xticks(range(data_num))

# line graph
ax2 = ax1.twinx()
ax2.plot(range(data_num), cum_per, c="k", marker="o")
ax2.set_ylim([0, 1])
ax2.grid(True, which='both', axis='y')

Result

Result (where `N`=20)

Acknowledgement

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197â€“208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Decile Analysis in Python

1. Motivation

2. Procedures

3. Steps

3-1. Read retail_data and perform preprocessing

Code

Result

3-2. Calculate the sum of TotalPrice for each CustomerID

Result

3-3. Sort the dataframe in descending order based on the TotalPrice per CustomerID.

Result

3-4. Divide the customers into n groups .

Result

3-5. Calculate what percentage of sales the each group's TotalPrice accounts for. (decile analysis)

Result

3-6. Sort the groups by their calculated percentages in descending order.

Result

Result

3-7. Summarize the above steps as a function.

Result

3-8. Create a cumulative distribution plot to visualize the contribution of each group to the total TotalPrice.

Result

Result (where N=20)

Acknowledgement

3-2. Calculate the sum of `TotalPrice` for each `CustomerID`

3-3. Sort the dataframe in descending order based on the `TotalPrice` per `CustomerID`.

3-4. Divide the customers into n `group`s .

3-5. Calculate what percentage of sales the each group's `TotalPrice` accounts for. (decile analysis)

3-8. Create a cumulative distribution plot to visualize the contribution of each group to the total `TotalPrice`.

Result (where `N`=20)