Meet the hardest functions of Pandas, Part II

Original article was published by Bex T. on Artificial Intelligence on Medium


Pandas crosstab() comparison with pivot_table() and groupby()

Before we move on to more fun stuff, I think I need to clarify the differences between the three functions that compute grouped summary stats.

I covered the differences of pivot_table() and groupby() in the first part of the article. For crosstab(), the difference between the three is the syntax and the shape of results. Let’s compute the last crosstab() table using all three:

The output of pivot_table
The output of crosstab

I think you already know your favorite. groubpy() returns a series while the other two return identical dataframes as a result. However, it is possible to turn the groupby series into the same dataframe like this:

If you don’t understand the syntaxes of pivot_table() and unstack(), I highly suggest you read the first part of the article.

When it comes to speed, crosstab() is faster than pivot_table() but both are much slower than groupby():

As you see, even when chained with unstack(), groupby() is 3 times faster than the other two. This tells that if you just want to group and compute summary stats, you should use the same ol’ groupby(). The speed difference was even larger when I chained other methods like simple round().

The rest of the comparison will mainly be about pivot_table() and crosstab(). As you saw, the shape of the results of the two functions is the same. The first difference between the two is that crosstab() can work with any data type.

It can accept any array-like objects such as lists, numpy arrays, data frame columns (which are pandas series). In contrast, pivot_table() only works on dataframes. In a helpful StackOverflow thread, I found out that if you use crosstab() on a dataframe it calls pivot_table() under the hood.

Next are the parameters. There are parameters that exist only in one and vice versa. The first one which is the most popular is crosstab()‘s normalize. normalize accepts these options (from the documentation):

  • If passed all or True, will normalize over all values.
  • If passed index will normalize over each row.
  • If passed columns will normalize over each column.

Let’s see a simple example:

The most number of diamonds are with ideal cut and color code G.

If passed all, for each cell, pandas calculates the percentage of the overall amount:

If passed, index or columns, the same operation is done column-wise or row-wise:

Row-wise normalization. Looks like in almost all types of diamond cuts, the highest number of them have color code G.
Column-wise normalization. Heatmap shows again that in the dataset, for each color code, the number of ideally cut diamonds is much higher.

In crosstab() you can also change the index and column names directly within the function using rownames and colnames. You don’t have to do it manually afterward. These two arguments are very useful when we group by multiple variables at a time, as you will see later.

The parameter fill_value only exists in pivot_table(). Sometimes, when you group by many variables, it is inevitable that there will NaNs. In pivot_table(), you can change them to custom value using fill_value:

But, you can achieve the same thing by chaining fillna() on the dataframe, if you use crosstab():