Original article was published by Moeedlodhi on Artificial Intelligence on Medium
7 Pandas functions to boost your productivity
Basic Pandas Functions to help you in your data preprocessing
If you’re working within Data Science, You must be familiar with Pandas.
Specifically designed to carry out Data preprocessing tasks, Pandas has a ton of functionalities that can make managing, cleaning, visualizing, and retrieving data extremely easy. And as anyone would know, A large chunk of a Data Scientists’ time goes into getting the data into a clean and understandable format for machine learning.
In this article, I will be going over the basic Pandas functions which have made my life as a Data Science intern, a whole lot easier.
For this article, I will be using the Titanic dataset from Kaggle. Let’s get started.
Loading the data in.
data = pd.read_csv('train.csv')
The first command I would like to go over is the “data.isnull().sum()” command which gives us the number of null values present in our dataset.
Before beginning any sort of preprocessing task, It is a must to check how the number of null values present in our dataset. So we can either remove or impute Null values and the above command does just that.
Let’s check the output
We have a number of null values present that need to be dealt with. data.fillna() replaces all null values with a value of our choice. Let’s use it on the Embarked column and Age columns.
The Embarked column has three values ‘S’, ‘C’, and ‘Q. I would like to impute the null values with the “most occurring value” or “mode” of the embarked column whereas the Age column can be imputed with the “mean of the Age column”
The mode of the Embarked column and the mean of the Age column can be represented as
And we can add that into our “data.fillna()” code to impute the null values
data['Embarked'] = data['Embarked'].fillna(data['Embarked'].mode())data['Age'] = data['Age'].fillna(data['Age'].mean())
When working with categorical values, It’s a must to understand how we can map categorical variables into numeric categorical variables.
Luckily the sci-kit learn library has the “LabelEncoder” library to help us take care of that.
For the above example, I would like to convert the values in the Sex and Embarked column into numeric values. Let’s check it out.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data['Sex'] = le.transform(data['Sex'])
datale = preprocessing.LabelEncoder()
data['Embarked'] = le.transform(data['Embarked'])
There are situations where we don’t need certain columns as a part of the data frame and that is why we should know how to remove them.
The data. drop() function gives the ability to drop columns that have no use to us. Let’s remove the “Ticket”, “Cabin” and “Name” column.
The “axis=1” tells pandas we are removing “columns” and the “inplace=True” makes the change permanent.
The data.corr() function helps us to understand the correlation of all the variables present in the dataset and helps us to pick variables for the machine learning part of the project. Let’s give it a try
The above table is extremely handy for us to understand which variables are highly correlated with each other.
6)data[‘column name’]. value_counts()
Now that our data has been cleaned and processed, I would like to move towards analyzing the different columns of our data frame.
We have the ‘Survived’ column which has the people who survived labeled as 1 and the people who did not survive labeled as 0.
The above function gives us the count of the unique values present in our columns. Let’s give it a try.
Implementing the above command gives us:
We can implement the above command for all the columns to analyze the data frame.
The groupby is an extremely handy function that gives us the “aggregated mean” or “aggregated sum” of our values. For example:
Let us groupby the entire data frame on the Survived column and find the mean.
The result we get is:
Looking at the groupby table, We can see that individuals who did not survive were slightly older in Age than the ones who did survive but had a very low Fare indicating a difference in class. The groupby function can be applied to any column of our choice to better understand the data.
I went over 7 basic Pandas functions which are extremely handy when it comes to preprocessing data. There is more command than the ones I just mentioned and I will hopefully go over them in some of my next articles.