Original article was published on Artificial Intelligence on Medium
Getting started with Pandas
A brief introduction to Pandas using an Olympics dataset
This is the first article of a basic introduction to data science using Pandas on an Olympics dataset. This series of articles will be based on a project as part of the course of “Introduction to Data Science” that I took at the University of Strasbourg in France. All credits and thanks go to the concerned tutors and professors who made learning this subject possible.
In this first article we will focus on importing the library, reading the data and displaying it, as well as setting an index column.
1- Necessary imports
As in any Python code, one always starts with the necessary imports needed for the tasks needed. In this very simple introduction, we only need to import the Pandas library, one of the most powerful tools in Data Science :
# Necessary importsimport pandas as pd
Importing pandas as the widely used “pd” abbreviations only serves as a simplification for future commands. For example, instead of writing “pandas.read_excel” to read an Excel File, one only writes “pd.read_excel”. Time is precious 🙂
2- Reading and displaying data
Now that we imported the required library, we would like to read the file that we have and display a few columns. Let’s call our data “raw_data”.
# Reading and displaying the dataraw_data = pd.read_excel('JO.xls')
Sample Output :
3- Setting an index for the data
Once the data read and displayed, an important thing to figure out is how to index it. When printing the first five lines of the dataset, we notice that the column on the left is an index. However, the problem with that is that each athlete can have different indices if their name appears more than once.
When looking at the rest of columns, the only one that would be specific for each athlete is the “ID” column. We will set the index to this column and check that it really is specific to each athlete.
# Setting the index to IDraw_data.index = raw_data['ID']# Checking that Usain Bolt for example has a unique IDusain_bolt = raw_data[raw_data['Name'].str.contains("Usain", na=False)]
usain_bolt[['ID','Name','Medal']]# Looking for the athlete with the unique ID 23athlete23 = raw_data[raw_data.index == 23] athlete23.head()
By proceeding both ways we have now made sure that each athlete has one and only one ID.
This brings us to the end of this first article in this series about an introduction to Pandas. In the next article, we will deal with duplicates and missing values of our raw dataset.
Thank you for reading and stay tuned for more !