Meet the hardest functions of Pandas, Part I

Original article was published by Bex T. on Artificial Intelligence on Medium


Pandas pivot_table(), with comparison to groupby()

There should be one — and preferably only one — obvious way to do it.

The above is a quote from the Zen of python. Python wants to have only one obvious solution for a single problem. But, pandas deliberately avoids this. Often in pandas, there are several ways to do one operation.

pivot_table() is an example. It is a complete and sometimes a better alternative to groupby() function. The difference is the shape of the result. groupby() returns a Series object while pivot_table() gives an easy-to-work dataframe.

Let’s work on a problem and give the solutions using both functions. I will load the tips dataset from seaborn:

tips = sns.load_dataset('tips')
tips.head()

We want to find the sum of all bills for each gender:

Let’s compare the syntax of the two functions. In groupby(), we pass the column we want to group by in the parentheses and in pivot_table() the equivalent parameter is the index. In groupby(), to choose the column to aggregate, we use subsetting with brackets while in pivot_table() we pass it to values. Finally, to choose the aggregating function, we use method chaining in groupby() whereas, pivot_table() provides aggfunc argument.

When I wrote an article about project setup for DS and ML, I researched a lot of notebooks. What I found surprising was that many people used groupby() and used .reset_index() function to turn the results of groupby() into a dataframe, let’s explore further to find out why:

result = tips.groupby('sex')['total_bill'].sum().reset_index()
result

If you use the pivot_table() you don’t have to use reset_index() to convert the result into a dataframe. groupby() results are not as easy to work with as dataframes. Let’s see how to group by multiple columns and aggregate with multiple functions:

tips.groupby(['sex', 'day'])['total_bill']\
.agg([np.mean, np.median, np.sum]).reset_index()
tips.pivot_table(values='total_bill', 
index=['sex', 'day'],
aggfunc=[np.mean, np.median, np.sum])

Both functions return a dataframe for multiple columns. But, even though for a single column pivot_table() is better, using the reset_index() on the groupby result gives a much nicer dataframe. Maybe that’s why Kagglers prefer groupby().

In pivot_table(), sometimes you can use columns parameter instead of index (or sometimes both) to display each group as a column. But if you pass multiple arguments to columns, the result will be a long dataframe with a single row.

Another difference between groupby() and pivot_table() would be fill_value parameter. Sometimes, when you group by multiple variables, there won’t be matching cells for the result. In such cases, groupby() puts NaNs but in pivot_table() you can control this behavior:

tips.head()

When would you want to use pivot_table()? As I said previously, it can sometimes be a better alternative for groupby(). It is also a personal preference when it comes to syntax. An obvious example would be choosing pivot_table() because it has some other parameters which are not available in groupby(). I already covered fill_value, but there are others like margins. You can learn more about it in the documentation 😁.