More than 1 year has passed since last update.

One thing you might be doing wrong in pandas!!

Last updated at 2023-12-28Posted at 2023-12-25

Hello, my name is Sourav Saha, a member of NEC Digital Technology Research and Development Laboratories. This is an article for NEC デジタルテクノロジー開発研究所 Advent Calendar 2023 Day 6, where I am going to discuss a typical performance issue with pandas that I have encountered in one of my recent research works.

Let me discuss this with a dummy situation instead of the actual application details due to confidentiality.

Problem Statement

For example, below is the transaction data of certain users captured yesterday:

last_day_record

cust_name	trans_time_last_day	item_num
    A	    2022-12-20 08:58:00	  I10
    A	    2022-12-20 09:32:36	  I11
    B	    2022-12-20 09:35:48	  I10
    B	    2022-12-20 10:03:01	  I15
    A	    2022-12-20 10:05:02	  I20
    A	    2022-12-20 10:30:18	  I10
    C	    2022-12-20 11:05:02	  I12
    C	    2022-12-20 11:08:03	  I13
    A	    2022-12-20 11:30:11	  I10
    B	    2022-12-20 11:32:18	  I15
    A	    2022-12-20 12:15:21	  I20
    A	    2022-12-20 12:30:35	  I20

Our objective is to recommend the item that he ordered yesterday just before the specific time in the column trans_time_today.

Data Preparation

Let's prepare our python script to show demo with this sample of data:

import pandas as pd

# sorted based on "trans_time_last_day"
last_day_record = pd.DataFrame()
last_day_record["cust_name"] = ["A", "A", "B", "B", "A", "A", "C", "C", "A", "B", "A", "A"]
last_day_record["trans_time_last_day"] = ["2022-12-20 08:58:00", "2022-12-20 09:32:36",
                                          "2022-12-20 09:35:48", "2022-12-20 10:03:01",
                                          "2022-12-20 10:05:02", "2022-12-20 10:30:18",
                                          "2022-12-20 11:05:02", "2022-12-20 11:08:03",
                                          "2022-12-20 11:30:11", "2022-12-20 11:32:18",
                                          "2022-12-20 12:15:21", "2022-12-20 12:30:35"]
last_day_record["item_num"] = ["I10", "I11", "I10", "I15",
                               "I20", "I10", "I12", "I13",
                               "I10", "I15", "I20", "I20"]

df = pd.DataFrame()
df["cust_name"] = ["A", "B", "C"]
df["trans_time_today"] = ["2022-12-21 10:20:01", "2022-12-21 11:05:02", "2022-12-21 11:20:09"]

# extracting only the time parts
last_day_record["trans_time_last_day"] = last_day_record["trans_time_last_day"].str.slice(11)
df["trans_time_today"] = df["trans_time_today"].str.slice(11)

Solution #1

The target application uses a groupby-apply approach as follows to solve the above problem:

def get_recommendation_1(df: pd.DataFrame,
                         last_day_record: pd.DataFrame):
    def get_previous_item(df: pd.DataFrame,
                          last_day_record: pd.DataFrame):
        cust_name = df["cust_name"].iloc[0]
        target_time = df["trans_time_today"].iloc[0]
        last_record = last_day_record[(last_day_record["cust_name"] == cust_name) &\
                        (last_day_record["trans_time_last_day"] < target_time)].iloc[-1]
        return last_record["item_num"]

    df["recommend"] = df.groupby("cust_name")\
                        .apply(lambda grp: get_previous_item(grp, last_day_record))\
                        .reset_index(drop=True)
    return df

res = get_recommendation_1(df, last_day_record)
print(res)

Although it might be easier and more flexible to implement such cases using groupby-apply, there would be a significant performance impact with the increasing number of data.

In the actual situation, there were more than 421k records available in last_day_record, and there were more than 25k users for which the groupby-apply method was to be performed. The actual execution time for the implemented approach was taking more than an hour!!

Solution #2

Let’s now discuss how it can be optimized to perform faster analysis, even in pandas!

The idea is to get previous records for each user.
Hence we can simply perform a JOIN operation using cust_name as the key and FILTER using transaction time to get the previous records as follows:

def get_recommendation_2(df: pd.DataFrame,
                         last_day_record: pd.DataFrame):
    merged = df.merge(last_day_record, on="cust_name")
    filtered = merged[merged["trans_time_last_day"] < merged["trans_time_today"]]
    last_record = filtered.drop_duplicates(subset=["cust_name", \
                                                   "trans_time_today"], keep="last")
    df["recommend"] = last_record.reset_index(drop=True)["item_num"]
    return df

res = get_recommendation_2(df, last_day_record)
print(res)

The above optimization could successfully reduce the execution time from 1 hour to 20 minutes.

Significant improvement, isn’t it?
Wait, there is still room for further optimization for the above use case.

Solution #3

Unlike SQL, pandas doesn’t provide methods related to non-equi join. Therefore, it is very common to perform a merge followed by a filter to get the required result, like in the above implementation.

But have you heard of the method merge_asof?
It is a very useful method and can be a perfect fit for the above problem. Here is a solution using merge_asof:

def get_recommendation_3(df: pd.DataFrame,
                         last_day_record: pd.DataFrame):
    df["trans_time_today"] = pd.to_datetime(df["trans_time_today"])
    last_day_record["trans_time_last_day"] = pd.to_datetime(last_day_record["trans_time_last_day"])
    merged = pd.merge_asof(df, last_day_record, left_on="trans_time_today",
                           right_on="trans_time_last_day", by="cust_name",
                           allow_exact_matches=False, direction='backward')
    df["recommend"] = merged["item_num"]
    return df

res = get_recommendation_3(df, last_day_record)
print(res)

Using merge_asof, we could finally reduce the execution time to 10 mins!!
Undoubtedly merge_asof is the clear choice for such problems.

merge_asof?

Let me explain the method and its parameters used in the above solution.

As written in the pandas document:

merge_asof?

This is similar to a left-join except that we match on nearest key rather than equal keys. 
Both DataFrames must be sorted by the key.

Our objective was to find the previous trans_time_last_day from the last_day_record table, with respect to the current trans_time_today from df table.

Hence as for join-keys, we have selected

trans_time_today from the left-table (left_on) and
trans_time_last_day from the right-table (right-on)
with the direction parameter as backward.

We need to perform kind of (<) operation, instead of (<=) to get the previous value. Hence, we have specified allow_exact_matches as False.

Please note the below conditions as for the join-keys:

the keys must be numeric such as datetimelike, integer, or float in both tables.
the keys must be sorted in both tables.
there should be no missing values (nulls) in any of the key columns.

Since our data in both tables in the above sample were already sorted, we have avoided the sorting operation, but in order to make the key columns (transaction time) numeric, we needed to convert the time columns from string -> datetime.

What is the difference between the by and on parameters?
Well, this might confuse you and might look like an alias if you are trying this method for the very first time, but they have their unique purposes.

on is the mandatory parameter to perform the joining, whereas the by parameter can be specified to create a group from both tables before performing the actual join. We need to extract customer-wise previous transaction time. Hence it will create groups having the same cust_name from both tables (specifying by="cust_name") and then will perform the join operation using respective transaction time-related columns (specifying left_on and right_on). If the key column in both the tables shares the same name, then specifying the on parameter should be fine.

The by parameter can be optional if you want to perform record-wise join from both tables, but you need to specify the on parameter to specify the target keys for joining.

Also, you can specify many columns for the by parameter (to create a group using multi-keys), but you need to pass only a single column for the on parameter.

Last but not the least, since it is similar to a left-outer join, it doesn't have a how parameter!

That's all for today. I hope to discuss many such performance-critical cases with examples of possible solutions in some other article. Please stay tuned with us!

Additional Tips

By the way, have you heard about FireDucks?
It is a pandas-compatible high-performance data analysis library for python with an in-built compiler for query optimization developed at our laboratory at NEC to accelerate your pandas application. FireDucks can outperform pandas (up to 40x) for the commonly used data manipulation methods like sort, select, filter, join, groupby, aggregation, etc. It is publicly available and can simply be installed using pip.

Please check its homepage for more details:

You may also like to check one of my articles explaining its salient features:

Wrapping-up

Thank you for reading so far!

To summarize, we discussed a unique use case for the pandas merge_asof method along with its different parameters. This method can be very useful for our regular analysis where we unknowingly opt for groupby-apply or merge-filter related operations paying a significant amount of performance cost. Using the proposed approach, we could successfully reduce the execution time from 1 hour to 10 minutes for a performance-critical data analysis application.

In a previous article, I discussed several Join-related use cases that can be accelerated if written carefully in pandas. You may like to check that out as well:

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up