Original article was published on Artificial Intelligence on Medium
Everything You Need to Know About “loc” and “iloc” of Pandas
Clearly distinguish loc and iloc
Pandas being the most widely used data analysis and manipulation library provides many flexible and convenient functions that ease and expedite data analysis process. In this post, I will cover two important tools which are used to select data from a dataframe based on specified rows and columns. Let’s introduce them first and build up a comprehensive understanding with different kinds of examples.
- loc: select by labels of rows and columns
- iloc: select by positions of rows and columns
The distinction becomes clear as we go through examples. As always, we start with importing numpy and pandas.
import pandas as pd
import numpy as np
We will do the examples on telco customer churn dataset available on kaggle. Let’s read the dataset into a pandas dataframe.
df = pd.read_csv("Projects/churn_prediction/Telco-Customer-Churn.csv")
Dataset includes 21 columns but we can only see the ones that fit to screen.
loc is used to select data by label. The labels of columns are the column names. For example, customerID, gender, SeniorCitizen are the first three column names (i.e. labels). We need to be careful about row labels. Since we did not assign any specific indices, pandas created integer index by default. Thus, the row labels are integers starting from 0 and going up. The row positions that are used with iloc are also integers starting from 0. We will see how pandas handle rows differently with loc and iloc with examples.
- Select row “2” and column “gender”
It returns the value in ‘gender’ column of row ‘2’
- Select the row labels up to ‘5’ and columns “gender” and “Partner”
- Select row labels “2”, “4”, “5” and “InternetService” column
We can also filter the dataframe and then apply loc or iloc
- Select row labels to “10” and “InternetService” and “PhoneService” columns of customer with a Partner (Partner == ‘Yes’)
We filter the dataframe but do not change the index. Thus, the indices of the resulting dataframe only contain the labels of the rows that are not omitted. Therefore, when use loc[:10], we can select the rows with labels up to “10”. O the other hand, if we use iloc[:10] after applying the filter, we get 10 rows because iloc selects by position regardless of the labels.
As you notice, we also need to change the way to select the columns. We also need to pass the positions of columns to iloc.
- Select the first 5 rows and first 5 columns
- Select the last 5 rows and last 5 columns.
The positions start from 0 from the beginning. If we start the positions from the end, we start with -1 so we use “-5:” to select the last five.
We can also apply lambda functions.
- Select the every third row up to 15th row and show only “Partner” and “InternetService” columns.
We can select positions or labels in between.
- Select the row positions between 20 and 25 , column positions between 4 and 6.
If you try to pass labels to iloc, Pandas is kind enough to return an informative feedback as follows:
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
A similar error is returned when we pass positions to loc.