Data Science Projects

Source: Deep Learning on Medium

1. Stock Analysis

In this projects, the data are fetched using yahoo api key. The data contains information about stocks of some top MNCs. The aim of the projects is to provide detailed analysis of stocks along with visualization

Opening Prices of Each MNC

This clearly shows how Tesla stocks open the markets with high prices

This shows the volume distribution of stocks of companies

After that we calculated the total trade of each company using opening price and volume and plot that

Total Trade Plot

Then we calculate mean of opening prices for each 50 days and 200 days

We also calculate opening prices of each company each day and plotted it

Then we calculate the returns and cumulative returns

Returns
Cumulative Returns

Project Link: https://github.com/harshbansal1999/Stock-Analysis

2. Credit Card Fraud Detection

For this project refer my story: https://medium.com/@bansalh944/credit-card-fraud-detection-c66d1399c0b7

3. Suicide Rate Prediction

We have provided the dataset, most probably from kaggle. The dataset contain suicides number in different regions along with the gdp, population count, sex, generation, age, year.

country 27820 non-null object
year 27820 non-null int64
sex 27820 non-null object
age 27820 non-null object
suicides_no 27820 non-null int64
population 27820 non-null int64
suicides/100k pop 27820 non-null float64
country-year 27820 non-null object
HDI for year 8364 non-null float64
gdp_for_year 27820 non-null object
gdp_per_capita 27820 non-null int64
generation 27820 non-null object

Data cleaning part involve removing some columns and using LabelEncoder on categorical columns like generation.

Countplot of Generation Column
Boxplot of Generation and Suicides Number

We calculated the correlation matrix of dataset to see how exploratory variable are related to response variable.

Distplot of Population
Barplot of year vs Suicide number

Then we split the dataset and trained various ML regression model and compare their accuracy.

And we came to know that random forest regressor is best suited algorithm for the purpose

Project Link: https://github.com/harshbansal1999/Suicide

4. Movie Recommendation System

This is a natural language processing project. The dataset used in this project is from kaggle and it contains various movie information. This information include the genre of movie, rated or not, id, title, release date, votes average, vote count. So our aim to recommend the movies based on genres provided.

We calculate the average vote of all movies and filter out those which does not qualify criteria of 90% percentile.

After that we calculated the score of each movie using vote count and average

This is the formula:

q_movies[‘score’]=(q_movies[‘vote_count’]/(q_movies[‘vote_count’]+c) * q_movies[‘vote_average’]) + (c/(c+q_movies[‘vote_count’]) * mean_rating)

Then we import TfidfVectorizer which convert the text into computer readable form. We will apply this on overview of each movies.

Then we calculate the cosine similarity of data to determine how data are similar to each other

Then we calculated a function which determine similar types of movies based on overview and we recommend those

For ex- We use Inception movies we get the results

Then we imply topic modelling by importing LDA. This will determine top topics on which most movies are based. To achieve that we have to perform Text Cleaning like removing stopwords, lemmatization.

Project Link: https://github.com/harshbansal1999/Movie-Recommendation-System

5. Delhi Election Analysis

In this project we study the data of elections held in delhi in past 10 years. This include both vidhansabha and loksabha election. We have to fetech data from delhi government site but there are in form of pdf. So we have to extract the data tables from the pdf file. This data contains info about the candidates list who are participating in election. The data contain information about votes each candidate get from each pooling station.

We calculate the detailed statistics of each statistics

After getting data in required form we formed a dataframe determining the detailed analysis of each candidate

Total votes in each pooling station
Scatter plot of Candidates vs Total votes
Barplot of best candidate depicting votes in each station

We performed this detailed analysis on each elections

Project Link: https://github.com/harshbansal1999/Delhi-Election

For Various other projects based on Data Analytics, Machine Learning and NLP refer my github profile: https://github.com/harshbansal1999