Pandas vs Polars - A comparison on File I/O
For more than a decade, Pandas has been the gold standard of dataframe libraries in Python and the go-to for tasks such as data manipulation with tabular datasets. Most (if not all) connectors to third party applications make use of the Pandas API when dealing with datasets. The de-facto has been (and probably will continue to be) Pandas.
Enter Polars. Polars is a "Lightning-fast DataFrame library for Rust and Python" built on top of Apache Arrow memory format. Polars claim to fame is speed, as evidenced by the H2O.ai benchmarks here which mainly target database operations such as joins, group by's, and aggregations. However, I am particularly interested in the I/O performance of Polars compared to Pandas. In this article, we will be exploring how Polars and Pandas fare by reading and writing a large dataset that should fit in memory (~1GB) and one that does not fit in memory (~50GB). Let's get started!
Datasets
The specifications for the datasets used in the analysis are provided here:
medium.csv
- ~1GB
- 2.2 Million Rows, 27 Columns
large.csv
- ~50GB
- 1.2 Billion Rows, 3 Columns
These datasets were generated by modifying the following bash script:
for i in {0..10000000}; do echo $(uuidgen),$RANDOM,$(head -c 12 /dev/urandom | base64); done >> file.csv
Setup
Run the following to install the libraries and needed dependencies
pip install pandas
pip install polars
In memory dataset
Let's compare how Pandas and Polars perform at reading and writing a dataframe that fits in memory. In order to capture the execution time, I will be using the %%timeit
jupyter magic. If you wish to follow along not in Jupyter, then could also create a script and run the shell command time
before calling the script.
Reading
import pandas as pd
import polars as pl
import numpy as np
%%timeit
pandas_df = pd.read_csv("medium.csv")
# Output: 13.5 s ± 528 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
On my machine (Macbook Pro M1 chip), this took 13.5 seconds to read in. Let's see how Polars compares to this.
%%timeit
polars_df = pl.read_csv("medium.csv")
# Output: 350 ms ± 7.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Quite a dramatic difference when reading in the same dataset, 13.5 seconds versus 350 milliseconds!
Writing
Now let's take that CSV that we read in and write it back out as a new file.
Pandas:
%%timeit
pandas_df.to_csv("pandas_data.csv")
# Output: 16.1 s ± 186 ms per loop
Polars:
%%timeit
df_polars.write_csv("polars_data.csv")
#Output: 951 ms ± 37.8 ms per loop
Another stark contrast where Pandas takes just over 16 seconds and Polars takes just under 1 second.
Large file operations
Let's turn our focus to a dataset that cannot be read entirely into memory. I have a CSV file that is nearly 50 GB large and consists of three columns:
- Id (a random UUID)
- Letter (a random letter from A-Z)
- Number (A random integer)
I'd like to query this dataset to only have Letter's equal to 'A' and Number's equal to 1.
Executing this in Polars is fairly seamless, as the scan_csv
function allows for lazy execution and is ideal for larger than memory datasets.
%%timeit
q = (pl.scan_csv('large.csv')
.filter((pl.col('letter') == 'A') &
(pl.col('number') == 1))
)
q.collect().write_csv("tmp/polars/polars_results.csv")
# Output: 1min 13s ± 658 ms per loop
In order to trim down a 50GB file and write the output to a new csv, this took roughly 1 minute and 13 seconds.
If we were to use Pandas (without additional libraries such as Dask), we would have to resort to a chunking method to iteratively filter the DataFrame.
def filter_csv(csv_file:str, chunksize = 100000):
file_name = "pandas_results.csv"
return_path = f'tmp/pandas/{file_name}'
return_dataframe = pd.DataFrame()
n_chunks = 0
for i, chunk in enumerate(pd.read_csv(csv_file, chunksize = chunksize, dtype={'number': np.int32})):
if i == 0:
filtered = chunk.loc[(chunk.letter == 'A') & (chunk.number) == 1]
filtered.to_csv(return_path)
else:
filtered = chunk.loc[(chunk.letter == 'A') & (chunk.number) == 1]
filtered.to_csv(return_path, mode = 'a')
n_chunks = i
filter_csv('large.csv')
Since Pandas is unable to lazily scan a large file like Polars, we had to read in chunks and then append the output to a target file. This took 9 minutes and 53 seconds to run on my machine.
Conclusion
To summarize the results for each:
Library | Read 1 GB File | Write 1 GB File | Read 50 GB File and Filter |
---|---|---|---|
Pandas | 13.5s | 16.1s | 9min 53s |
Polars | 350ms | 951ms | 1min 13s |
In each IO operation, Polars faired better than Pandas, and in significant fashion. Granted, there are ways to improve the speed of Pandas using other libraries that we did not demonstrate here, but nonetheless it is pretty impressive that Polars is so fast right out of the box.
Will it be dethroning Pandas any time soon? Definitely not. Is it a legitimate alternative for some use cases with tremendous upside? You bet!