When dealing with data frames, sometimes it is prolonged to process them when there are a massive amount of data. In that case, it is better to use the parallel process to speed up the entire process. Dask is an open-source library that allows us to accomplish the above goal.
Dask - Why do we need it?
For data science tasks, numpy, sklean, pandas, etc., are very helpful to manipulate data. In most the cases, pandas library is sufficient. We can do various kinds of things using pandas.
The problem starts when our data gets more extensive. Because our RAM may not be enough moreover, it can take much time.
I am sure you guys know about big data platforms likes Hadoop and spark. However, unfortunately, those are not python environments, and many people prefer Pandas, TensorFlow, Torch, etc., for work with ML/DL tasks.
Parallel Processing
Executing multiple tasks at the same time is called parallel processing. In typical situations, code is executed sequentially, which means 1 task at a time.
Rather than waiting for the prior task to finish, we can compute many levels concurrently at the same interval.
Demo
Let us compare the time taken for some tasks using Pandas and Dask.
First, we need to install dask library. Then we need to import pandas and dask.
I used two large CSV files. The first file is around 8GB, and the second one is around 5 GB.
1) Read file and Create a data frame.
Pandas took approximately 2 minutes and 30 seconds to read and creating the data frame. I feel thrilled to have a powerful server.
Now let us see how the dask performs.
**OH, WAIT. 0.7 milliseconds ???? **
Should be something wrong ....
No, there is nothing wrong.
Dask is a beast. It took only 0.7 milliseconds to read the same file that pandas took more than 2 minutes.
2) Append two Data frames.
OK, for appending, we need two data frames. So let us read another file from pandas, create a second data frame, and append it to the first data frame.
For the first file - 2 minutes and 30 seconds
For the second file - 1 minute and 34 seconds
Append process - 19 seconds
So it took a total of approximately 4 minutes and 20 seconds.
OK, now it is time to see how the dask performs.
For the first file - 07 milliseconds
For the second file - 02 milliseconds
Append process - 03 milliseconds
It took 12 milliseconds for the whole process.
MILLISECONDS ???
Yes, you have seen it correctly. We have to compare minutes and milliseconds. It is awesome.
3) Merge two Data frames.
Pandas could not handle it. My kernel restarted after around 20 minutes of running due to the out-of-memory. I have 120 GB of memory. However, it was not enough for pandas.
WoW.
It is finished in just less than a second. Actually, it took only 59 milliseconds. I have waited 20 minutes for pandas, and even it could not finish the task.
4) Group Data
For my data set, there is not much of a difference. Both finished very quickly.
5) Get unique values
Pandas took 4 seconds. However, dask does not support the same command. There can be a workaround or other command. However, I will skip it.
6) Get notNA values
Pandas took 32 seconds. And for the dask it's the same as the above. Dask does not support the same command. There can be a workaround or other command. However, I will skip it.
7) Sorting
For the sorting, pandas is the winner. It took two and half minutes while dask took three and half minutes.
End Note
Please test with some other tasks with dask and see how it performs. We can see that there are some advantages and disadvantages of dask.
The conclusion is, it is better to combine both when working with a dataset. Because we can see dask is so much faster but has some limitations for the people who worked with pandas.
*本記事は @qualitia_cdevの中の一人、@nuwanさんが書いてくれました。
*This article is written by @nuwan a member of @qualitia_cdev.