Pandas — quick reference for the busy data scientist

Source: Deep Learning on Medium

Pandas — quick reference for the busy data scientist

( Pandas snippets that I find useful for day to day tasks )

I want this guide to be a no-nonsense and quick reference sheet to refresh your data munging skills in pandas. So let’s dive into pandas directly.

Note: You can play around with any dataset you like.

Quick Pandas Vocabulary:

dataframe: Two-dimensional structure of data similar to a spreadsheet

series: One-dimensional indexed array of fixed data type

First, let’s load pandas

import pandas as pd
url = “https://raw.githubusercontent.com/jokecamp/FootballData/master/Germany/Bundesliga/1964/matches.csv"
  1. You can load data from the internet without downloading
data = pd.read_csv(url)
data.head()
reading data from url

(Note: tail() command picks rows from tail, also you can pass numerical parameters to head and tail commands)

Some of the exploratory data analysis(EDA) commands you can try with pandas:

# print column names
data.columns
# print general information about dataframedata.info()# STATISTICS of non-numerical fields (explicitely indicate data #types of interest in include parameterdata.describe(include=[‘object’, ‘bool’])#(Also don’t forget to try with only describe())# For categorical(type object) and boolean(type bool) features, we #can use value_counts method to count the number of values.data[‘home_team’].value_counts()# If you want to calculate porportions try this:data[‘home_team’].value_counts(normalize=True)# Get min and max value from columndata[‘round’].min()data[‘round’].max()# Calculate mean of a columndata[‘round’].mean()
  1. DataFrame Indexing

DataFrames can be indexed by column name(label), row name(index), or by a serial number of the row.

loc: Indexing by name

# Get 5 rows from home_score column to away_team column
data.loc[0:5, ‘home_score’:’away_team’]

iloc: Indexing by number

# Same thing as above but using numbers
data.iloc[0:5, 2:5]

2. Changing the value in column

data[‘away_score’] = data[‘away_score’].apply(lambda x: x*2)ORdata[‘away_score’] = data[‘away_score’].str.split(“ “)
double the away_score column

3. Adding new column

data[‘new_coumn’] = [i+1000 for i in range(len(data))]

(Note: Using arrays to add a new column is not the only way, but you can also use pandas’ Series object to create a column in dataframe.)

4. Apply function to multiple columns

data[[‘away_score’,’new_coumn’]] = data[[‘away_score’,’new_coumn’]].apply(lambda x: x*2)
double two columns

(Note: You just need to pass the list of columns you want to modify.)

5. Count rows with NaN

see how many rows are null

6. Selecting NaN rows and doing some operation

select rows with notes column having Nan

Important: Filling NaN rows is separate topic and out of the scope of this article. There are many ways to fill missing values for example filling with mean, median, highest frequency etc. But if you want to fill with some constant values, then it is pretty straightforward:

data.fillna(0)

7. Selecting non-NaN rows and doing some operation

data[data[‘notes’].notnull()]
select rows with note column with non-nan values

8. Filling NaN rows

data.fillna(value = “Replace NaN”, inplace = True)
replace nan values with “Replace Nan” string

9. Append rows to the data frame

data = data.append({ “round”: 5, “home_team”: “Some name”, “home_score”: 0, “away_score”: 50, “away_team”: “some team”, “notes”: “nothing”, “new_coumn”: “Nothing”}, ignore_index=True)data.tail(5)
append row to dataframe

10. Sort and Reindex data frame

data = data.sort_values([‘round’], ascending=False)# If you want to sort by multiple columns, just pass list of columnsdata = data.sort_values(by=[‘round’, ‘away_score’], ascending=[1,0])data.head()
sort round column in descending order

11. Update columns

data.index = range(1, len(data)+1)data.head()
reindex dataframe after sorting

12. Chaining conditions using bitwise operations

data[(data[‘notes’] == “Replace NaN”) & (data[‘round’] > 25)]Note: Use | for OR conidtion
applying and operation on dataframe’s columns

13. Printing column types

(Note: For strings it is using ‘object’ data types)

14. Select by column type

import numpy as np
data.loc[:, (data.dtypes == np.dtype(‘O’)).values].head()
# Note: ‘O’ means object type. if you want floats use ‘float64’ #instead.
selecting object data types

15. Converting column types

data[‘away_score’] = data[‘away_score’].astype(float)

16. Condition test

mydata = [ [1,2,3], [4,5,6] ]
df = pd.DataFrame(mydata)
df < 2
if test on dataframe

17. Iterating rows on dataframe

Note: I don’t recommend you to loop and do operation on rows of dataframe, but if there is a case you have to iterate through, here is the technique:

for index, row in data.iterrows():
name = row[‘name’]
age = row[‘age’]
# do some operation

18. Exporting dataframe to CSV

If you want to share your result with someone in your team do ahead and export the results in CSV, import it into google sheets and share them:

data.to_csv(‘result_file.csv’)

This article is a quick reference for brushing up your concepts for playing with panda’s dataframe. The above snippets might not be always efficient for performing the required task depending upon the number of datasets you have. So you should always try to use vectorization techniques wherever possible.

If you like my article, don’t forget to follow me on medium, or connect me on linkedin, or follow me on twitter.