0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

1. Motivation

Using the following Kaggle purchase data to do targeting and decile analysis.
https://www.kaggle.com/vijayuv/onlineretail

2. Procedures

  1. Read retail_data and perform preprocessing
  2. Calculate the sum of TotalPrice for each CustomerID
  3. Sort the dataframe in descending order based on the TotalPrice per CustomerID.
  4. Divide the customers into n groups (where 𝑛 is a natural number that does not exceed the number of data and does not have the same value in the quartile).
  5. Calculate what percentage of sales the each group's TotalPrice accounts for. (decile analysis)
  6. Sort the groups by their calculated percentages in descending order.
  7. Summarize the above steps as a function.
  8. Create a cumulative distribution plot to visualize the contribution of each group to the total TotalPrice.

3. Steps

3-1. Read retail_data and perform preprocessing

Code

#Load the libraries
import pandas as pd
import numpy as np

# Data loading and preprocessing *Takes a lot of time
file_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
Excel_data = pd.ExcelFile(file_url)
retale_data = Excel_data.parse('Online Retail')

# Convert InvoiceNo to string type
retail_data['cancel_flg'] = retail_data.InvoiceNo.map(lambda x:str(x)[0])

# Target invoices whose InvoiceNo starts with 5 and whose ID is not Null
retail_data_1 = retail_data[(retail_data.cancel_flg == '5') & (retail_data.CustomerID.notnull())]

# Prepare main dataframe
df = retail_data_1.assign(TotalPrice=retail_data_1.Quantity * retail_data_1.UnitPrice)

Result

df.info()

Screenshot 2024-05-25 at 18.03.51.png

df.describe()

Screenshot 2024-05-25 at 18.05.07.png

3-2. Calculate the sum of TotalPrice for each CustomerID

sum = df.groupby('CustomerID')[['TotalPrice']].sum()
sum_sorted = sum.sort_values(by='TotalPrice', ascending=False)

Result

sum

Screenshot 2024-05-25 at 18.07.57.png

3-3. Sort the dataframe in descending order based on the TotalPrice per CustomerID.

sum_sorted = sum.sort_values(by='TotalPrice', ascending=False)

Result

sum_sorted

Screenshot 2024-05-25 at 18.18.46.png

3-4. Divide the customers into n groups .

where 𝑛 is a natural number that does not exceed the number of data and does not have the same value in the quartile

n = 10
group_labels = []
for i in range(n):
    group_labels.append('{0}th'.format(i+1))
sum_sorted['group'] = pd.qcut(sum_sorted['TotalPrice'], n, labels=group_labels, duplicates='drop')

Result

sum_sorted

Screenshot 2024-05-25 at 18.11.19.png

3-5. Calculate what percentage of sales the each group's TotalPrice accounts for. (decile analysis)

# Find what percentage of the total purchase price each group represents
proportions = sum_sorted.groupby('group')['TotalPrice'].sum() / sum_sorted['TotalPrice'].sum()

Result

proportions

Screenshot 2024-05-25 at 18.22.16.png

3-6. Sort the groups by their calculated percentages in descending order.

# Sort in ascending order by proportion height
result = proportions.sort_values(ascending=False)

Result

result

Screenshot 2024-05-25 at 18.22.27.png

# Rank groups by percentage height and reassign indexes by rank
result.index = result.sort_values().index.map(lambda x: 'decile'+str(x)[0])

Result

result

Screenshot 2024-05-25 at 18.22.36.png

3-7. Summarize the above steps as a function.

def function(df, n):
    sum = df.groupby('CustomerID')[['TotalPrice']].sum()
    sum_sorted = sum.sort_values(by='TotalPrice', ascending=False)
    group_labels = []
    for i in range(n):
        group_labels.append('{0}th'.format(i+1))
    sum_sorted['group'] = pd.qcut(sum_sorted['TotalPrice'], n, labels=group_labels, duplicates='drop')
    proportions = sum_sorted.groupby('group')['TotalPrice'].sum() / sum_sorted['TotalPrice'].sum()
    result = proportions.sort_values(ascending=False)
    result.index = result.sort_values().index.map(lambda x: str(x)[0])
    return result

Result

function(df, 10)

Screenshot 2024-05-25 at 18.28.58.png

3-8. Create a cumulative distribution plot to visualize the contribution of each group to the total TotalPrice.

# Visualize with Matplotlib
import matplotlib.pyplot as plt
N = 10
data = function(df, N)

fig, ax1 = plt.subplots(figsize=(6,4))

# Number of segments determines the abscissa
data_num = len(data)

# Cumulative sum
cum_per = np.cum(data)

# Bar chart
ax1.bar(range(data_num), data)
ax1.set_xticks(range(data_num))

# line graph
ax2 = ax1.twinx()
ax2.plot(range(data_num), cum_per, c="k", marker="o")
ax2.set_ylim([0, 1])
ax2.grid(True, which='both', axis='y')

Result

Screenshot 2024-05-25 at 18.32.34.png

Result (where N=20)

Screenshot 2024-05-25 at 18.30.53.png

Acknowledgement

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?