LoginSignup
0
0

More than 1 year has passed since last update.

Parallel Process Data Frames Using Dask

Posted at

When dealing with data frames, sometimes it is prolonged to process them when there are a massive amount of data. In that case, it is better to use the parallel process to speed up the entire process. Dask is an open-source library that allows us to accomplish the above goal.

Dask - Why do we need it?

For data science tasks, numpy, sklean, pandas, etc., are very helpful to manipulate data. In most the cases, pandas library is sufficient. We can do various kinds of things using pandas.
The problem starts when our data gets more extensive. Because our RAM may not be enough moreover, it can take much time.
I am sure you guys know about big data platforms likes Hadoop and spark. However, unfortunately, those are not python environments, and many people prefer Pandas, TensorFlow, Torch, etc., for work with ML/DL tasks.

Parallel Processing

Executing multiple tasks at the same time is called parallel processing. In typical situations, code is executed sequentially, which means 1 task at a time.
Rather than waiting for the prior task to finish, we can compute many levels concurrently at the same interval.

Demo

Let us compare the time taken for some tasks using Pandas and Dask.
First, we need to install dask library. Then we need to import pandas and dask.

Screen Shot 2021-09-02 at 17.51.10.png

I used two large CSV files. The first file is around 8GB, and the second one is around 5 GB.

1) Read file and Create a data frame.

Screen Shot 2021-09-02 at 17.54.28.png
Pandas took approximately 2 minutes and 30 seconds to read and creating the data frame. I feel thrilled to have a powerful server.
Now let us see how the dask performs.
Screen Shot 2021-09-02 at 17.54.43.png
*OH, WAIT. 0.7 milliseconds ???? *

Should be something wrong ....
No, there is nothing wrong.
Dask is a beast. It took only 0.7 milliseconds to read the same file that pandas took more than 2 minutes.

2) Append two Data frames.

Screen Shot 2021-09-02 at 17.55.06.png
OK, for appending, we need two data frames. So let us read another file from pandas, create a second data frame, and append it to the first data frame.

For the first file - 2 minutes and 30 seconds
For the second file - 1 minute and 34 seconds
Append process - 19 seconds

So it took a total of approximately 4 minutes and 20 seconds.
OK, now it is time to see how the dask performs.
Screen Shot 2021-09-02 at 17.55.28.png
For the first file - 07 milliseconds
For the second file - 02 milliseconds
Append process - 03 milliseconds

It took 12 milliseconds for the whole process.

MILLISECONDS ???

Yes, you have seen it correctly. We have to compare minutes and milliseconds. It is awesome.

3) Merge two Data frames.

Screen Shot 2021-09-02 at 17.56.15.png
Pandas could not handle it. My kernel restarted after around 20 minutes of running due to the out-of-memory. I have 120 GB of memory. However, it was not enough for pandas.
Screen Shot 2021-09-02 at 17.56.26.png
WoW.
It is finished in just less than a second. Actually, it took only 59 milliseconds. I have waited 20 minutes for pandas, and even it could not finish the task.

4) Group Data

Screen Shot 2021-09-02 at 17.55.37.png
Screen Shot 2021-09-02 at 17.55.46.png

For my data set, there is not much of a difference. Both finished very quickly.

5) Get unique values

Screen Shot 2021-09-02 at 17.57.17.png
Screen Shot 2021-09-02 at 17.57.30.png
Pandas took 4 seconds. However, dask does not support the same command. There can be a workaround or other command. However, I will skip it.

6) Get notNA values

Screen Shot 2021-09-02 at 17.56.44.png
Screen Shot 2021-09-02 at 17.57.01.png
Pandas took 32 seconds. And for the dask it's the same as the above. Dask does not support the same command. There can be a workaround or other command. However, I will skip it.

7) Sorting

Screen Shot 2021-09-02 at 19.06.09.png

Screen Shot 2021-09-02 at 17.56.35.png
For the sorting, pandas is the winner. It took two and half minutes while dask took three and half minutes.

End Note

Please test with some other tasks with dask and see how it performs. We can see that there are some advantages and disadvantages of dask.
The conclusion is, it is better to combine both when working with a dataset. Because we can see dask is so much faster but has some limitations for the people who worked with pandas.

*本記事は @qualitia_cdevの中の一人、@nuwanさんが書いてくれました。
*This article is written by @nuwan a member of @qualitia_cdev.

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0