Code
5 Optimizations in Pandas
February 20, 2021

As someone who did over a decade of development before moving into Data Science, there’s a lot of mistakes I see data scientists make while using Pandas. The good news is these are really easy to avoid, and fixing them can also make your code more readable.

Mistake 1: Getting or Setting Values Slowly

It’s nobody’s fault that there are way too many ways to get and set values in Pandas. In some situations, you have to find a value using only an index or find the index using only the value. However, in many cases, you’ll have many different ways of selecting data at your disposal: index, value, label, etc.

In those situations, I prefer to use whatever is fastest. Here are some common choices from slowest to fastest, which shows you could be missing out on a 195% gain!

Tests were run using a DataFrame of 20,000 rows. Here’s the notebook if you want to run it yourself.

Mistake 2: Only Using 25% of Your CPU

Whether you’re on a server or just your laptop, the vast majority of people never use all the computing power they have. Most processors (CPUs) have 4 cores nowadays, and by default, Pandas will only ever use one.

Modin is a Python module built to enhance Pandas by making way better use of your hardware. Modin DataFrames don’t require any extra code and in most cases will speed up everything you do to DataFrames by 3x or more.

Modin acts as more of a plugin than a library since it uses Pandas as a fallback and cannot be used on its own.

The goal of Modin is to augment Pandas quietly and let you keep working without learning a new library. The only line of code most people will need is import modin.pandas as pd replacing your normal import pandas as pd, but if you want to learn more check out the documentation here.

In order to avoid recreating tests that have already been done, I’ve included this picture from the Modin documentation showing how much it can speed up the read_csv() function on a standard laptop.

Please note that Modin is in development, and while I use it in production, you should expect some bugs. Check the Issues in GitHub and the Supported APIsfor more information.



Mistake 3: Making Pandas Guess Data Types

When you import data into a DataFrame and don’t specifically tell Pandas the columns and datatypes, Pandas will read the entire dataset into memory just to figure out the data types.

For example, if you have a column full of text Pandas will read every value, see that they’re all strings, and set the data type to “string” for that column. Then it repeats this process for all your other columns.

You can use df.info() to see how much memory a DataFrame uses, that’s roughly the same amount of memory Pandas will consume just to figure out the data types of each column.

Unless you’re tossing around tiny datasets or your columns are changing constantly, you should always specify the data types. In order to do this, just add the dtypes parameter and a dictionary with your column names and their data types as strings. For example:

Note: This also applies to DataFrames that don’t come from CSVs.



Mistake 4: Leftover DataFrames

One of the best qualities of DataFrames is how easy they are to create and change. The unfortunate side effect of this is most people end up with code like this:

What happens is you leave df2 and df1 in Python memory, even though you’ve moved on to df3. Don’t leave extra DataFrames sitting around in memory, if you’re using a laptop it’s hurting the performance of almost everything you do. If you’re on a server, it’s hurting the performance of everyone else on that server (or at some point, you’ll get an “out of memory” error).

Instead, here are some easy ways to keep your memory clean:

  • Use df.info() to see how much memory a DataFrame is using
  • Install plugin support in Jupyter, then install the Variable Inspector plugin for Jupyter. If you’re used to having a variable inspector in R-Studio, you should know that R-Studio now supports Python!
  • Anytime you edit a DataFrame using a function, check if that function supports the in_place=True parameter. Most of them do, and this will modify the DataFrame without making you store it in a variable again.
  • If you’re in a Jupyter session already, you can always erase variables without restarting by using del df2
  • Chain together multiple DataFrame modifications in one line (so long as it doesn’t make your code unreadable): df = df.apply(thing1).dropna()
  • As Roberto Bruno Martins pointed out, another way to ensure clean memory is to perform operations within functions. You can still unintentionally abuse memory this way, and explaining scope is outside the scope of this article, but if you aren’t familiar I’d encourage you to read this writeup.

Mistake 5: Manually Configuring Matplotlib

This might be the most common mistake, but it lands at #5 because it’s the least impactful. I see this mistake happen even in tutorials and blog posts from experienced professionals.

Matplotlib is automatically imported by Pandas, and it even sets some chart configuration up for you on every DataFrame.

There’s no need to import and configure it for every chart when it’s already baked into Pandas for you.

Here’s an example of doing it the wrong way, even though this is a basic chart it’s still a waste of code:

And here’s the right way:

df[‘x’].plot()

Easier, right? You can do anything on these DataFrame plot objects that you can do to any other Matplotlib plot object. For example:

df[‘x’].plot.hist(title=’Chart title’)

I’m sure I’m making other mistakes I don’t know about, but hopefully sharing these known ones with you will help put your hardware to better use, let you write less code, and get more done!

If you’re still looking for more optimizations, you’ll definitely want to read:

-Preston Badeer

Towards Data Science

Original Article

Download Modular IB Strategy

Interactive Brokers Modular IB strategy is ready to automate Futures, Options, Equities & FX via IB API

Sign Up Now
Zach Oakes

I've managed a quant focused Master Fund for 10 years. I'm a full stack developer, working mostly in Python, ELD, C#, and some C++. My specialties include Strategy Development & Portfolio Optimization -- I write about Alpha and ways to find, create, exploit, or improve it. Enjoy!

← All Posts