Be aware of this pitfall in pandas

Denys Kaliuzhnyi
3 min readJan 22, 2021

Managing datetime data type in pandas? Don't distort your data from the very beginning!

In this brief article, I want to highlight one mistake many beginners make while processing dates in pandas. It doesn’t break your code down, but rather makes your scripts working not in the way you expected. In my work experience, it takes some time to reveal a problem and then hardcode my mind to pay attention to it.

To start with, pandas has a very high-level API that provides an easy way to process various types of date formats and powerful further manipulation. However, don’t rely on this simplicity and always validate a return you get. Let’s see why.

Everyone who processes data in pandas deals with read_csv function that allows to read csv file in a convenient way.

df = pd.read_csv('path/to/file')

Most people know that there is a parse_dates argument that accepts a list of columns (either names or indices) to convert to an appropriate date type.

df = pd.read_csv('path/to/file', parse_dates=['Date'])

However, folks may not pay attention to daysfirst argument. Depending on where you live, you may use different date formats, e.g. DD/MM/YYYY or YYYY/MM/DD. By default pandas expects month followed by day, meaning dayfirst is False. That’s why if your csv uses the former date format, you need to explicitly specify dayfirst to be True.

df = pd.read_csv('path/to/file', parse_dates=['Date'], datefirst=True)

But what if you don’t? You may expect a risen error, but unfortunately it won’t happen. There is the illustration of what happens below.

Let’s assume we have a csv file with the following content and read it either with or without specifying dayfirst to True.

Sample of csv file with DD-MM-YYYY date format. Raw view from a text editor.
Read with dayfirst=True. Expected result.
Read with dayfirst=False. Unexpected result.

Surprise! As you can see reading with dayfirst=False ends up with swapped days and months in case where this tweak produces a valid date (i.e. day is not greater than 12). This is a very unpleasant pitfall that is hard to reveal. You are lucky if you know an expected date distribution of your data and it helps you to disclose the problem because of mismatch. Otherwise, your code will produce meaningless output and you will never know about it.

This behaviour is also relevant to converting data types after reading, e.g. using astype function. I assume pandas uses such behaviour because of time saving. So, I recommend you to keep track of dates you are dealing with, especially comparing raw (string) and processed values. Thanks for reading!

--

--