Hello, my name is Sourav Saha, a member of NEC Digital Technology Research and Development Laboratories. This is an article for NEC デジタルテクノロジー開発研究所 Advent Calendar 2023 Day 7, where I will briefly discuss the motivation behind developing FireDucks, its different layers of optimization and some tips to ensure significant performance gain when using FireDucks.
Background
In general, a Data Scientist spends a lot of time in data preparation before training an AI model or creating visualizations, since the accuracy of any AI model is directly proportional to the quality of the training data. “If you torture the data long enough, it will confess to anything”. – rightly said by Ronald H. Coase, a renowned British Economist.
When it comes to performing Data Analysis in Python, the top choice of a Data Scientist is to use pandas library because of its flexible and easy-to-use user interface. However, because of its single-core execution, it has its performance bottlenecks when it processes a large amount of data, often forcing data scientists to look for alternative solutions.
There are many alternatives available for pandas, some of which compel a pandas user to learn completely new data structures with new API syntaxes and execution models, whereas some of the others demand paying an extra cost for a high-performance computational unit like GPU, etc.
To address the above, we have developed our own high-performance Data Analysis library for Python, named FireDucks with an in-built runtime compiler aiming at query optimization. FireDucks
is highly compatible with pandas and can be used without any changes to your existing pandas application. It is backed by different kernels optimized for different hardware models like CPU, GPU, etc. Thus, if you want to speed up your existing application even on your CPU-only systems, FireDucks would enable you to do the same without any need for learning new APIs, putting in strenuous efforts to migrate your existing application, or paying an extra cost for a new hardware system.
↓The beta version of FireDucks is released on 19th Oct, 2023:
and it can easily be installed in your system using:
$ pip install fireducks
Salient Features of FireDucks
① Zero manual effort to start with
To enjoy the acceleration driven by FireDucks in your application, simply replace the import statement for pandas as follows:
# import pandas as pd
import fireducks.pandas as pd
Or, you can simply execute your program with the import-hook feature as follows:
$ python -m fireducks.imhook program.py
② Lazy execution with IR-based auto-optimization
FireDucks doesn’t execute the data frame-specific API calls right away, like in pandas.
Instead, it creates some specialized IR (Intermediate Representation) operators related to each API call and keeps accumulating them to inspect the possible room for code flow optimization at the time of execution.
The execution is triggered when a reduction method, e.g., print, aggregation on series (like sum, max, etc.) is invoked, or an explicit evaluation is requested. The accumulated IR operators are first processed by an in-built runtime compiler for performing multiple levels of optimization and then the target backend kernel is selected for executing the optimized code.
The beta version of FireDucks only includes a multithreaded optimized kernel, named dfkl
aimed for CPU-only systems. The multithread dfkl kernel is backed by arrow implementation along with different algorithm-level optimization for sort, groupby, dropna, etc. helping you to perform faster analysis even for gigabytes of data without any need for switching to a high-performance computational unit.
③ High compatibility with pandas
FireDucks provides the exact APIs as in pandas aiming for full compatibility to enable a pandas user to accelerate his existing application without any extra effort. There is a rich set of methods related to frame and series operations available in pandas and it’s difficult to implement all of them for achieving full compatibility. Hence, FireDucks implements most of the commonly used pandas methods for data analysis, and whenever an unsupported operation is encountered in a user program, it internally invokes the respective pandas method by converting the target data into a pandas equivalent data using an implicit Fallback
mechanism, aiming for a smooth execution without raising any error messages, while speeding up the overall analysis.
This Fallback
sometimes can be very expensive, if the unsupported operation is to be performed on gigabytes of data, as it involves data copy from the backend side kernel. It will not be a bottleneck for relatively smaller data, because most of the operations are already implemented at the backend side and can leverage the full power of available CPU cores. However, we are continuously working towards eliminating these costly fallbacks by supporting more and more operations and releasing them at frequent intervals.
④ Ability to avoid dead code being executed
Unlike pandas, FireDucks can wisely decide not to execute the part of a user program that is to be computed, but never used within the entire program. For example, consider the following method:
def func(x: pd.DataFrame, y: pd.DataFrame):
merged = x.merge(y, on="key")
sorted = merged.sort_values(by="key") # is never used
return merged.groupby("key").nunique()
The sorted
object is never referenced within the method, func(). Hence, it would never be executed when using FireDucks
, saving a lot of execution time for func().
Optimization Layers
There are mainly 3 layers of optimization applied in FireDucks:
- IR-based auto-optimization using the in-built runtime compiler
- The multi-threaded execution of the compiler-optimized code
- Algorithm-level optimization for various backend side implementations for dropna, sort, groupby etc.
Let me explain each of these with examples.
Although the major issue with pandas is its single-core execution model, it is possible to speed up an existing pandas application by expert-level code optimization. For example, let’s consider the below method:
def foo(df: pd.DataFrame, term: pd.Series, weight: pd.Series):
r1 = term * weight - 30
r2 = term * weight + 30
return df[(df["x"] >= r1) & (df["x"] <= r2)][["c_1", "c_3", "c_5"]]
In the above example, there are two performance issues:
-
Calculation of
term * weight
for two times, one for r1 and another for r2Since the input series for
term
andweight
are used only for read-only input purposes,
it is possible to reuse the calculation ofterm * weight
when computing the result forr2
. -
Extracting required columns after performing the filter on the entire data frame
Generally, the data frames are constructed using columnar data structures, where each column might be of a different data type. Hence, filtering the target rows for the entire data frame might be costly, especially when there are many columns in the input data frame and only a few columns are required to be output. Hence, the target columns can first be extracted before performing the actual filter operation.
The optimized implementation for the above foo() method can be realized as follows:
def opt_foo(df: pd.DataFrame, term: pd.Series, weight: pd.Series):
tw = term * weight # one-time calculation
r1 = tw - 30
r2 = tw + 30
tmp = df[["c_1", "c_3", "c_5", "x"]] # extraction of required columns
return tmp[(tmp["x"] >= r1) & (tmp["x"] <= r2)][["c_1", "c_3", "c_5"]]
For a sample input/output specification as follows, the above optimization achieves ~4x speed up even when using native pandas:
- df: A data frame of shape (10000000, 100) with all columns of type float64
- term: A float64 series of size 10000000
- weight: A float64 series of size 10000000
- output: A data frame of shape (3032988, 3) with all columns of type float64
Because of the lazy execution model, FireDucks can inspect the stacked IR-OPs generated during the API calls and can successfully automate many of such domain-specific optimizations
(which sometimes are difficult to consider even for an expert developer) using its in-built runtime compiler. Along with the same, when it executes the optimized code using the multithreaded dfkl
kernel, further speed up can be experienced due to the parallelization.
OMP_NUM_THREADS=1:
--------------------
[opt_foo] native-pandas-time: 0.1373 sec; fireducks-time: 0.1347 sec
[foo] native-pandas-time: 0.6017 sec; fireducks-time: 0.1225 sec
It is to be noted that:
- ~4x speed up can be observed for native pandas case when switching from
foo()
to manually optimizedopt_foo()
. - however because of IR-based auto-optimization, execution time of
foo()
is almost the same as execution time ofopt_foo()
for FireDucks, i.e., you can completely rely on FireDucks for such automatic optimization.
OMP_NUM_THREADS=2:
--------------------
[opt_foo] native-pandas-time: 0.1402 sec; fireducks-time: 0.0822 sec
[foo] native-pandas-time: 0.5873 sec; fireducks-time: 0.0793 sec
OMP_NUM_THREADS=3:
--------------------
[opt_foo] native-pandas-time: 0.1393 sec; fireducks-time: 0.0610 sec
[foo] native-pandas-time: 0.6063 sec; fireducks-time: 0.0588 sec
It is to be noted that:
- when there is no parallelization (single-threaded execution), both native pandas and FireDucks take almost the same execution time for opt_foo().
- however, we can achieve more throughput when increasing the number of cores for FireDucks execution, i.e., you can leverage the full power of available CPU cores in your system (without any extra effort).
The above example demonstrates the effect of the first two layers of optimization possible in FireDucks. Let me explain the effect of the third layer of optimization with another example.
The following method simply removes missing values from each row and calculates the mutual correlation of all N columns for the input data frame:
def bar(df: pd.DataFrame):
return df.dropna().corr()
For this specific method, there is nothing the compiler can optimize automatically, but because of the algorithm-level optimization performed at dfkl kernel, one can experience speed up even for single-threaded execution.
Below are the evaluation metrics for the above bar() method using the bitcoin data:
OMP_NUM_THREADS=1:
--------------------
[bar] native-pandas-time: 0.5538 sec; fireducks-time: 0.3276 sec
OMP_NUM_THREADS=12:
--------------------
[bar] native-pandas-time: 0.5496 sec; fireducks-time: 0.0523 sec
Even when no IR-based optimization is possible and multi-threading is disabled (OMP_NUM_THREADS=1), significant performance can be achieved from FireDucks default dfkl
kernel, since we have carefully applied many algorithm-level optimizations while implementing various methods like groupby
, sort
, dropna
, corr
etc.
Because of all these optimizations, we could achieve significant performance gain while evaluating various queries from the popular TPC-H
and TPCx-BB
benchmarks.
Tips to ensure better performance
As mentioned earlier, there are possibilities for performance degradation when an unsupported operation is encountered and the same needs to be taken care of through the Fallback
mechanism. One such case would be triggered when using apply()
methods. The apply enables a pandas developer to be more flexible when applying some user-defined methods along an axis of a data frame or on a series, but the apply() itself is very costly because of its execution flow of iterating over each element on the given axis and element-wise applying the input method, even for the native pandas library.
For example, the below method multiplies each element of an input series with 2 using apply() and it takes ~17 seconds for a sample input of size 100000000 when using the native pandas library
.
def two_times(s: pd.Series):
return s.apply(lambda x: x * 2)
There would be no performance gain when using FireDucks
for such methods since at this moment we do not support the optimization of apply() statements and it implicitly falls back to the pandas level to get the job done. Sometimes, such Fallback
may also interrupt the efficiency of IR-based auto-optimization as depicted in the following figure:
When re-writing the above method using the simple binary multiplication operation
supported on pandas series as follows, we could experience ~190x (17 sec -> 0.09 sec) performance gain even from the native pandas library, and further performance gain could be achieved when using FireDucks because of its multithreaded execution.
def two_times(s: pd.Series):
return s * 2
Hence it is strongly recommended to avoid such apply() statements whenever possible. In the future version of FireDucks we will automate many of such apply() optimizations. You may also like to check out other tips available in our public website.
Wrapping-up
To summarize, we discussed the key features of FireDucks along with its different layers of optimization with examples and our ongoing research activities to continuously improve its performance.
We expect it to be an economical solution
for you since neither you will need to upgrade your existing computational system, nor you will need to spend significant time and effort learning new APIs and migrating your existing application. Additionally, if you are using a cloud-based system like AWS etc. for your data analysis, the speed up from FireDucks would significantly reduce your overall analysis time, helping you to save significant cloud costs.
At the same time, FireDucks is an environment-friendly solution
for you since with a higher throughput you can ensure less CO2 emissions for your entire analysis throughout the year.
Thank you for showing your interest in FireDucks and reading this article so far. We are continuously striving towards improving its performance and releasing the same at frequent intervals. Please stay tuned with us and keep sharing your valuable feedback on any execution issues you may face or optimization requests you may have. We are eagerly looking forward to hearing from you!