Original article was published by Bex T. on Artificial Intelligence on Medium
Pandas crosstab() comparison with pivot_table() and groupby()
Before we move on to more fun stuff, I think I need to clarify the differences between the three functions that compute grouped summary stats.
I covered the differences of
groupby() in the first part of the article. For
crosstab(), the difference between the three is the syntax and the shape of results. Let’s compute the last
crosstab() table using all three:
I think you already know your favorite.
groubpy() returns a series while the other two return identical dataframes as a result. However, it is possible to turn the
groupby series into the same dataframe like this:
If you don’t understand the syntaxes of
unstack(), I highly suggest you read the first part of the article.
When it comes to speed,
crosstab() is faster than
pivot_table() but both are much slower than
As you see, even when chained with
groupby() is 3 times faster than the other two. This tells that if you just want to group and compute summary stats, you should use the same ol’
groupby(). The speed difference was even larger when I chained other methods like simple
The rest of the comparison will mainly be about
crosstab(). As you saw, the shape of the results of the two functions is the same. The first difference between the two is that
crosstab() can work with any data type.
It can accept any array-like objects such as lists,
numpy arrays, data frame columns (which are
pandas series). In contrast,
pivot_table() only works on dataframes. In a helpful StackOverflow thread, I found out that if you use
crosstab() on a dataframe it calls
pivot_table() under the hood.
Next are the parameters. There are parameters that exist only in one and vice versa. The first one which is the most popular is
normalize accepts these options (from the documentation):
- If passed
True, will normalize over all values.
- If passed
indexwill normalize over each row.
- If passed
columnswill normalize over each column.
Let’s see a simple example:
all, for each cell,
pandas calculates the percentage of the overall amount:
columns, the same operation is done column-wise or row-wise:
crosstab() you can also change the index and column names directly within the function using
colnames. You don’t have to do it manually afterward. These two arguments are very useful when we group by multiple variables at a time, as you will see later.
fill_value only exists in
pivot_table(). Sometimes, when you group by many variables, it is inevitable that there will
pivot_table(), you can change them to custom value using
But, you can achieve the same thing by chaining
fillna() on the dataframe, if you use