NEC デジタルテクノロジー開発研究所Advent Calendar 2023Day 16

A hidden fact you must know when working with pandas

Last updated at 2023-12-18Posted at 2023-12-15

Hello, my name is Sourav Saha, a member of NEC Digital Technology Research and Development Laboratories. This is an article for NEC デジタルテクノロジー開発研究所 Advent Calendar 2023 Day 16.

Recently I came across a few examples from some performance-critical data analysis applications that revealed some hidden costs while working with pandas. I am going to explain one of them in today's blog.

Let's first look at the below example:

ret = pd.DataFrame()
ret["x"] = s1 * 2 # Line-1
ret["y"] = s2 * 2 # Line-2

This is simply multiplying two series data and assigning the result into a data frame. Both s1 and s2 have 100 million elements of type float64.

Still, they differ in their execution time to a great extent as shown below:

Can you guess why?

Well, the fact amazed me when I first encountered this situation.
In pandas, the series and data frames are labeled with an “index” column (useful for multiple purposes).

Whenever an assignment (=) operation takes place, pandas first checks whether the index columns for both side of the expression are equally aligned or not. If not aligned equally, it performs an expensive reindex operation to align the data on the right-hand side of the assignment operator according to the data on the left-hand side.

In the above example, s1 and s2 have different index layouts (due to some sorting).
Hence Line-1 only involves the computation cost of a binary multiplication, whereas Line-2 involves the cost of a binary multiplication, along with the cost of index-alignment as depicted in the following figure.

Because of this expensive reindex operation, the execution of Line-2 consumes almost 50 times the execution time for Line-1 .

Needless to say, it is very crucial to take care of such assignments when performing within a loop.
For example, below is a sample program that attempts to perform the multiplication of each numeric column in the input data frame, df with 2. For each column that contains missing values (nulls), it sorts the data by the dates column, performs a forward fill operation to fill the missing values for those columns, and finally performs the multiplication on the modified column:

ret = pd.DataFrame()
for c in df.select_dtypes("number").columns:
    if any(df[c].isnull())
        tmp = df.sort_values("dates")[c].ffill()
        ret[c] = tmp * 2
    else:
        ret[c] = df[c] * 2

There are two performance issues involved in the if part of the above program. The clearly visible one is the sort operation which is performed on the same data on every execution of the if part within the loop. Hence, let's first work on to optimize that:

# sorting and forward-filling are performed only once
sorted_df = df.sort_values("dates").ffill() 

ret = pd.DataFrame()
for c in df.select_dtypes("number").columns:
    if any(df[c].isnull()):
        ret[c] = sorted_df[c] * 2
    else:
        ret[c] = df[c] * 2

The invisible performance factor is involved in the line: ret[c] = sorted_df[c] * 2.
Here it is trying to multiply target columns from sorted_df with 2 and assigning into the resultant dataframe.

After the sorting operation, index columns of df and sorted_df are no longer the same. Hence, whenever the if part is executed, along with the binary multiplication, an expensive reindex operation will take place making the above program very slow.

Hence, such assignments inside a loop should be avoided wherever possible.
One can perform the multiplication on the sorted_df and then perform sort_index on the result to get the intended result as shown below:

# sorting and forward-filling is performed only once
sorted_df = df.sort_values("dates").ffill()
target_input = sorted_df.select_dtypes("number")
ret = target_input * 2
ret = ret.sort_index() # to make the result as expected in the original program

Hope these tips will help you to accelerate your analysis when working with pandas. It is recommended to use an alternative python library developed at our laboratory, named FireDucks.

It is a multithreaded library accelerated with compiler technology enabling you to perform faster analysis even with the above situation (it can auto-optimize the first case, i.e., performing sort operation only once and reuse the same whenever needed) and is publicly available for use.

You may like to check out one of my articles explaining key features of FireDucks, along with its different layers of optimization to realize its potential to be used as a drop-in replacement for pandas.

Wrapping-up

Thank you for reading so far.

To summarize, I have discussed a few tips one should avoid while working with pandas. In the actual situation, just by performing the above optimization, we could successfully reduce the execution time of a performance-critical application from 2 hours 5 mins to 1 hour 30 mins. Using FireDucks further makes it possible to reduce the overall analysis time to 56 mins. With some further optimization, we could reduce the overall analysis time within 10 mins. Let's discuss a few of them in some other blog.

We recommend getting it installed using pip and try using it in your pandas application to enjoy its benefit. We are continuously striving towards improving its performance and releasing the same at frequent intervals. Please stay tuned with us and keep sharing your valuable feedback on any execution issues you may face or optimization requests you may have. We are eagerly looking forward to hearing from you!

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up