Exploratory data analysis of Trending YouTube Video Statistics in France

Original article was published on Artificial Intelligence on Medium


What is Exploratory Data Analysis?

Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. This step is very important especially when we arrive at modeling the data in order to apply Machine learning. Plotting in EDA consists of Histograms, Box plot, Scatter plot, and many more. It often takes much time to explore the data. Through the process of EDA, we can ask to define the problem statement or definition on our data set which is very important.

Data we are exploring

YouTube has facilitated engagement between institutions and individuals, such as between universities and prospective students, and between businesses and employees. Also, some YouTube videos increase awareness of social issues, allow broadened social contact, and overcome stereotypes of minorities and minority viewpoints.

This dataset is a daily record of the top trending YouTube videos, From Kaggle Trending YouTube Video Statistics, this dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for the US, GB, DE, CA, and FR regions (USA, Great Britain, Germany, Canada, and France, respectively), with up to 200 listed trending videos per day.

Let’s get started !!!

  1. Importing the required libraries for EDA

Below are the libraries that are used in order to perform EDA (Exploratory data analysis) in this tutorial. The complete code can be found on my GitHub.

# Importing required libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
from datetime import datetime
from matplotlib import cm

2. Loading the data into the data frame.

Loading the data into the panda’s data frame is certainly one of the most important steps in EDA, I chose to work on a Kaggle kernel that you will find online Here on France dataset, so let’s go !!!

youtube = pd.read_csv("../input/youtube-new/FRvideos.csv")

If we look at the trending_date or publish_time columns, we see that they are not yet in the correct format of Date Time data.

youtube['trending_date'] = pd.to_datetime(youtube['trending_date'], format='%y.%d.%m') 
youtube['publish_time'] = pd.to_datetime(youtube['publish_time'], format='%Y-%m-%dT%H:%M:%S.%fZ')
youtube['category_id'] = youtube['category_id'].astype(str)

Let’s check out the Head of our data :

# To display the top rows


It returns to range, column, number of non-null objects of each column, datatype, and memory usage.


The describe() method is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame. It analyzes both numeric and object series and also the DataFrame column sets of mixed data types.


Changing the types to uniform format:

Some columns have their data types inappropriately registered by Pandas. For example, views, likes, and similar columns only need int datatype, instead of float (to save memory), or category_id, a nominal attribute, should not carry int datatype.

It is important that we ourselves assign their data types appropriately

type_int_list = ['views', 'likes', 'dislikes', 'comment_count']
for column in type_int_list:
youtube[column] = youtube[column].astype(int)
type_str_list = ['category_id']
for column in type_str_list:
youtube[column] = youtube[column].astype(str)

Each region’s data is in a separate file. Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count.

The data also includes a category_id field, which varies between regions. To retrieve the categories for a specific video, find it in the associated JSON. One such file is included for each of the five regions in the dataset.

# creates a dictionary that maps `category_id` to `category`
id_to_category = {}
with open('../input/youtube-new/FR_category_id.json', 'r') as f:
data = json.load(f)
for category in data['items']:
id_to_category[category['id']] = category['snippet']['title']
youtube.insert(4, 'category', youtube['category_id'].map(id_to_category))

Correlation analysis and heatmap

Visualize correlation between views and likes

plt.scatter(youtube['views'], youtube['likes'])
plt.title(‘Correlation between views and likes’)

Correlation between views and likes

plt.title('Correlation between views and likes')
keep_columns = ['views', 'likes', 'dislikes', 'comment_count'] 
corr_matrix = youtube[keep_columns].corr()
sns.heatmap(us_df[['views' , 'likes' , 'dislikes' , 'comment_count']].corr(), annot = True, fmt = ".2f")

what are the trending dates ?

youtube['publish_time'] = pd.to_datetime(youtube['publish_time'])
youtube['diff'] = (youtube['trending_date'] - youtube['publish_time']).dt.days
youtube[['trending_date' ,'views']].set_index('trending_date').plot()

Which Title has the most likes and views in 2018?

we have to extract from date-time the appropriate year.

youtube['publish_time'] = pd.to_datetime(youtube['publish_time'] ).dt.strftime('%d-%m-%Y')
youtube['publish_year'] =youtube['publish_time'].str.split('-',expand=True)[2]
youtube['publish_time'] = pd.to_datetime(youtube['publish_time'] )
most = youtube['publish_year'] == '2018'
temp = youtube[most]
Max2018L = temp['likes'].max()
Max2018V = temp['views'].max()
most1 = temp['likes'] == Max2018L
most2 = temp['views'] == Max2018V
temp[ most1 | most2 ]

So a BTS Music Video is the most liked Video in 2018 yeeey go k-pop


I was happy to share with you a very simple Data Exploration, with so simple words, that even people with no academic background would be able to understand it.

I hope I was able to clarify it a little to your machine learning, I will be uploading a lot of more explanation of algorithms because why not 🙂



Zahra Elhamraoui

Thanks for reading and…