Getting started with Pandas

Original article was published on Artificial Intelligence on Medium

Getting started with Pandas

A brief introduction to Pandas using an Olympics dataset

Photo by Vytautas Dranginis on Unsplash

This is the first article of a basic introduction to data science using Pandas on an Olympics dataset. This series of articles will be based on a project as part of the course of “Introduction to Data Science” that I took at the University of Strasbourg in France. All credits and thanks go to the concerned tutors and professors who made learning this subject possible.

The raw dataset can be accessed as an Excel file on my Github page, where the corresponding Jupyter Notebook is also found.

In this first article we will focus on importing the library, reading the data and displaying it, as well as setting an index column.

1- Necessary imports

As in any Python code, one always starts with the necessary imports needed for the tasks needed. In this very simple introduction, we only need to import the Pandas library, one of the most powerful tools in Data Science :

# Necessary importsimport pandas as pd

Importing pandas as the widely used “pd” abbreviations only serves as a simplification for future commands. For example, instead of writing “pandas.read_excel” to read an Excel File, one only writes “pd.read_excel”. Time is precious 🙂

2- Reading and displaying data

Now that we imported the required library, we would like to read the file that we have and display a few columns. Let’s call our data “raw_data”.

# Reading and displaying the dataraw_data = pd.read_excel('JO.xls')
raw_data.head()

Sample Output :

First five lines of the dataset

3- Setting an index for the data

Once the data read and displayed, an important thing to figure out is how to index it. When printing the first five lines of the dataset, we notice that the column on the left is an index. However, the problem with that is that each athlete can have different indices if their name appears more than once.

When looking at the rest of columns, the only one that would be specific for each athlete is the “ID” column. We will set the index to this column and check that it really is specific to each athlete.

# Setting the index to IDraw_data.index = raw_data['ID']# Checking that Usain Bolt for example has a unique IDusain_bolt = raw_data[raw_data['Name'].str.contains("Usain", na=False)]
usain_bolt[['ID','Name','Medal']]
# Looking for the athlete with the unique ID 23athlete23 = raw_data[raw_data.index == 23] athlete23.head()

Outputs :

Usain Bolt ID 13029
Fritz Aanes ID 23

By proceeding both ways we have now made sure that each athlete has one and only one ID.

This brings us to the end of this first article in this series about an introduction to Pandas. In the next article, we will deal with duplicates and missing values of our raw dataset.

Thank you for reading and stay tuned for more !