Hello, my name is Sourav Saha, a member of NEC Digital Technology Research and Development Laboratories. This is an article for NEC デジタルテクノロジー開発研究所 Advent Calendar 2023 Day 6, where I am going to discuss a typical performance issue with pandas that I have encountered in one of my recent research works.
Let me discuss this with a dummy situation instead of the actual application details due to confidentiality.
Problem Statement
For example, below is the transaction data of certain users captured yesterday:
cust_name trans_time_last_day item_num
A 2022-12-20 08:58:00 I10
A 2022-12-20 09:32:36 I11
B 2022-12-20 09:35:48 I10
B 2022-12-20 10:03:01 I15
A 2022-12-20 10:05:02 I20
A 2022-12-20 10:30:18 I10
C 2022-12-20 11:05:02 I12
C 2022-12-20 11:08:03 I13
A 2022-12-20 11:30:11 I10
B 2022-12-20 11:32:18 I15
A 2022-12-20 12:15:21 I20
A 2022-12-20 12:30:35 I20
Our objective is to recommend the item that he ordered yesterday just before the specific time in the column trans_time_today
.
Data Preparation
Let's prepare our python script to show demo with this sample of data:
import pandas as pd
# sorted based on "trans_time_last_day"
last_day_record = pd.DataFrame()
last_day_record["cust_name"] = ["A", "A", "B", "B", "A", "A", "C", "C", "A", "B", "A", "A"]
last_day_record["trans_time_last_day"] = ["2022-12-20 08:58:00", "2022-12-20 09:32:36",
"2022-12-20 09:35:48", "2022-12-20 10:03:01",
"2022-12-20 10:05:02", "2022-12-20 10:30:18",
"2022-12-20 11:05:02", "2022-12-20 11:08:03",
"2022-12-20 11:30:11", "2022-12-20 11:32:18",
"2022-12-20 12:15:21", "2022-12-20 12:30:35"]
last_day_record["item_num"] = ["I10", "I11", "I10", "I15",
"I20", "I10", "I12", "I13",
"I10", "I15", "I20", "I20"]
df = pd.DataFrame()
df["cust_name"] = ["A", "B", "C"]
df["trans_time_today"] = ["2022-12-21 10:20:01", "2022-12-21 11:05:02", "2022-12-21 11:20:09"]
# extracting only the time parts
last_day_record["trans_time_last_day"] = last_day_record["trans_time_last_day"].str.slice(11)
df["trans_time_today"] = df["trans_time_today"].str.slice(11)
Solution #1
The target application uses a groupby-apply
approach as follows to solve the above problem:
def get_recommendation_1(df: pd.DataFrame,
last_day_record: pd.DataFrame):
def get_previous_item(df: pd.DataFrame,
last_day_record: pd.DataFrame):
cust_name = df["cust_name"].iloc[0]
target_time = df["trans_time_today"].iloc[0]
last_record = last_day_record[(last_day_record["cust_name"] == cust_name) &\
(last_day_record["trans_time_last_day"] < target_time)].iloc[-1]
return last_record["item_num"]
df["recommend"] = df.groupby("cust_name")\
.apply(lambda grp: get_previous_item(grp, last_day_record))\
.reset_index(drop=True)
return df
res = get_recommendation_1(df, last_day_record)
print(res)
Although it might be easier and more flexible to implement such cases using groupby-apply
, there would be a significant performance impact with the increasing number of data.
In the actual situation, there were more than 421k records available in last_day_record
, and there were more than 25k users for which the groupby-apply
method was to be performed. The actual execution time for the implemented approach was taking more than an hour!!
Solution #2
Let’s now discuss how it can be optimized to perform faster analysis, even in pandas!
The idea is to get previous records for each user.
Hence we can simply perform a JOIN operation using cust_name
as the key and FILTER using transaction time to get the previous records as follows:
def get_recommendation_2(df: pd.DataFrame,
last_day_record: pd.DataFrame):
merged = df.merge(last_day_record, on="cust_name")
filtered = merged[merged["trans_time_last_day"] < merged["trans_time_today"]]
last_record = filtered.drop_duplicates(subset=["cust_name", \
"trans_time_today"], keep="last")
df["recommend"] = last_record.reset_index(drop=True)["item_num"]
return df
res = get_recommendation_2(df, last_day_record)
print(res)
The above optimization could successfully reduce the execution time from 1 hour to 20 minutes.
Significant improvement, isn’t it?
Wait, there is still room for further optimization for the above use case.
Solution #3
Unlike SQL, pandas doesn’t provide methods related to non-equi join. Therefore, it is very common to perform a merge followed by a filter to get the required result, like in the above implementation.
But have you heard of the method merge_asof
?
It is a very useful method and can be a perfect fit for the above problem. Here is a solution using merge_asof
:
def get_recommendation_3(df: pd.DataFrame,
last_day_record: pd.DataFrame):
df["trans_time_today"] = pd.to_datetime(df["trans_time_today"])
last_day_record["trans_time_last_day"] = pd.to_datetime(last_day_record["trans_time_last_day"])
merged = pd.merge_asof(df, last_day_record, left_on="trans_time_today",
right_on="trans_time_last_day", by="cust_name",
allow_exact_matches=False, direction='backward')
df["recommend"] = merged["item_num"]
return df
res = get_recommendation_3(df, last_day_record)
print(res)
Using merge_asof
, we could finally reduce the execution time to 10 mins!!
Undoubtedly merge_asof
is the clear choice for such problems.
merge_asof?
Let me explain the method and its parameters used in the above solution.
As written in the pandas document:
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the key.
Our objective was to find the previous trans_time_last_day
from the last_day_record
table, with respect to the current trans_time_today
from df
table.
Hence as for join-keys, we have selected
-
trans_time_today
from the left-table (left_on) and -
trans_time_last_day
from the right-table (right-on) - with the direction parameter as
backward
.
We need to perform kind of (<)
operation, instead of (<=)
to get the previous value. Hence, we have specified allow_exact_matches
as False.
Please note the below conditions as for the join-keys:
- the keys must be numeric such as datetimelike, integer, or float in both tables.
- the keys must be sorted in both tables.
- there should be no missing values (nulls) in any of the key columns.
Since our data in both tables in the above sample were already sorted, we have avoided the sorting operation, but in order to make the key columns (transaction time) numeric, we needed to convert the time columns from string -> datetime.
What is the difference between the by
and on
parameters?
Well, this might confuse you and might look like an alias if you are trying this method for the very first time, but they have their unique purposes.
on
is the mandatory parameter to perform the joining, whereas the by
parameter can be specified to create a group from both tables before performing the actual join. We need to extract customer-wise
previous transaction time. Hence it will create groups having the same cust_name
from both tables (specifying by="cust_name") and then will perform the join operation using respective transaction time-related columns (specifying left_on and right_on). If the key
column in both the tables shares the same name, then specifying the on
parameter should be fine.
The by
parameter can be optional if you want to perform record-wise join from both tables, but you need to specify the on
parameter to specify the target keys for joining.
Also, you can specify many columns for the by
parameter (to create a group using multi-keys), but you need to pass only a single column for the on
parameter.
Last but not the least, since it is similar to a left-outer join, it doesn't have a how
parameter!
That's all for today. I hope to discuss many such performance-critical cases with examples of possible solutions in some other article. Please stay tuned with us!
Additional Tips
By the way, have you heard about FireDucks
?
It is a pandas-compatible high-performance data analysis library for python with an in-built compiler for query optimization developed at our laboratory at NEC to accelerate your pandas application. FireDucks can outperform pandas (up to 40x) for the commonly used data manipulation methods like sort, select, filter, join, groupby, aggregation, etc. It is publicly available and can simply be installed using pip
.
Please check its homepage for more details:
You may also like to check one of my articles explaining its salient features:
Wrapping-up
Thank you for reading so far!
To summarize, we discussed a unique use case for the pandas merge_asof
method along with its different parameters. This method can be very useful for our regular analysis where we unknowingly opt for groupby-apply or merge-filter related operations paying a significant amount of performance cost. Using the proposed approach, we could successfully reduce the execution time from 1 hour to 10 minutes for a performance-critical data analysis application.
In a previous article, I discussed several Join-related use cases that can be accelerated if written carefully in pandas. You may like to check that out as well: