Original article was published by Bex T. on Artificial Intelligence on Medium
Pandas pivot_table(), with comparison to groupby()
There should be one — and preferably only one — obvious way to do it.
The above is a quote from the Zen of python. Python wants to have only one obvious solution for a single problem. But,
pandas deliberately avoids this. Often in
pandas, there are several ways to do one operation.
pivot_table() is an example. It is a complete and sometimes a better alternative to
groupby() function. The difference is the shape of the result.
groupby() returns a
Series object while
pivot_table() gives an easy-to-work dataframe.
Let’s work on a problem and give the solutions using both functions. I will load the
tips dataset from
tips = sns.load_dataset('tips')
We want to find the sum of all bills for each gender:
Let’s compare the syntax of the two functions. In
groupby(), we pass the column we want to group by in the parentheses and in
pivot_table() the equivalent parameter is the
groupby(), to choose the column to aggregate, we use subsetting with brackets while in
pivot_table() we pass it to
values. Finally, to choose the aggregating function, we use method chaining in
When I wrote an article about project setup for DS and ML, I researched a lot of notebooks. What I found surprising was that many people used
groupby() and used
.reset_index() function to turn the results of
groupby() into a dataframe, let’s explore further to find out why:
result = tips.groupby('sex')['total_bill'].sum().reset_index()
If you use the
pivot_table() you don’t have to use
reset_index() to convert the result into a dataframe.
groupby() results are not as easy to work with as dataframes. Let’s see how to group by multiple columns and aggregate with multiple functions:
.agg([np.mean, np.median, np.sum]).reset_index()
aggfunc=[np.mean, np.median, np.sum])
Both functions return a dataframe for multiple columns. But, even though for a single column
pivot_table() is better, using the
reset_index() on the
groupby result gives a much nicer dataframe. Maybe that’s why Kagglers prefer
pivot_table(), sometimes you can use
columns parameter instead of
index (or sometimes both) to display each group as a column. But if you pass multiple arguments to
columns, the result will be a long dataframe with a single row.
Another difference between
pivot_table() would be
fill_value parameter. Sometimes, when you group by multiple variables, there won’t be matching cells for the result. In such cases,
NaNs but in
pivot_table() you can control this behavior:
When would you want to use
pivot_table()? As I said previously, it can sometimes be a better alternative for
groupby(). It is also a personal preference when it comes to syntax. An obvious example would be choosing
pivot_table() because it has some other parameters which are not available in
groupby(). I already covered
fill_value, but there are others like
margins. You can learn more about it in the documentation 😁.