7 Essential Data Science Projects

Original article was published by Adam DeJans on Artificial Intelligence on Medium


7 Essential Data Science Projects

In this article we walk through 7 essential data science projects that you should include in your portfolio to get hired. Having these 7 different projects will show employers that you have diversity in your data science took-kit and will be help you stand out that much more when applying to data science positions.

Stand out from the crowd with these 7 data science projects

Project 1: Exploratory Data Analysis

Data visualization to understand the results of a data analysis. (credit Wikipedia)

First up on the list is an exploratory data analysis (EDA) project. This is especially helpful if you’re newer to data science as this project is a perfect intro to start exploring data science with. The biggest benefit of having a thorough EDA project on hand is to show that you’re capable and better yet comfortable with telling a story through data. Like it or not, EDA is the majority of data science as all data science projects start here. Creating and gathering data, standardizing pipelines, and feature engineering are definite essentials. An EDA project will allow you to showcase all of these skills.

Recommendations: Don’t use clean data. Find some messy data or better yet collect your own data by scraping the web or using an API to pull data.

Project 2: Classification

The second project is a classification project. The classification problem can be predicting a binary or categorical outcome. The typical example is the famous titanic survival data set where the goal is to predict if someone on the titanic ship will survive or not. Other more practical examples include: predicting if a customer will click on an ad, or if a sports-team will win a championship. Classification problems are a staple of data science and data scientists seek to solve these problems on a regular basis.

Recommendations:

  • Use predictive probabilities associated with different types of models — specifically Logistic Regression, Random Forests, or XGBoost. With these you can describe how confident you are with each prediction on each data point. By showing confidence in a prediction, you express to employers that you understand business value.
  • With classification problems you want to be very clear about your evaluation criteria. Try experimenting by optimizing for: accuracy, precision, recall, and even F1 score. Be sure to graph your ROC-AUC curve as well.
  • Try not to use standardized data sets like the titanic data set. While these are great for practice and getting familiar with concepts, these datasets have largely been worn out and very well documented already. Otherwise said, they are not ideal for including in your portfolio as everyone has done them. To stand out from the crowd, try doing something more unique and give it your own personal touch.

Project 3: Regression

For the third project try to predict a continuous outcome; otherwise known as a regression problem. A common example is trying to predict prices of houses in a certain region, or how many clicks an advertisement might get.

Recommendations:

  • Again, it is very important to explain how you evaluate your success. Perhaps you’re using R-squared, root-mean square error (RMSE), or mean absolute error (MAE). The way you evaluate your model is directly related to the type of problem you’re trying to solve.
  • Explore different models and see how they each perform against your chosen evaluation metric.

Project 4: Clustering

The fourth project uses models to group things together. This is more commonly referred to as “cluster analysis.” Sometimes cluster anlaysis is involved with EDA, but is also particularly important when combined with classification. Specific cluster analysis problems can be considered as an extension of EDA as a slightly more quantitatively focused way to understand the relationships between data points. Principal component analysis (PCA) is one very popular clustering and dimensionality reduction technique that can also be used to find relationships between features

Convergence of k-means (credit: WIkipedia)

An example of clustering may be trying to determine which NFL quarterbacks play similar offenses. We could use a k-means cluster analysis based on the performance on a couple different key statistics of games played and that would group quarterbacks together based on how many groups we want to split the league into. From this we can see what type of quarterbacks play well in particular weather for example which could create additional value. Also if a new QB is coming into the league we can determine what cluster they fall into — which may be interesting or valuable when we try to predict how well they’re going to play over the course of the upcoming season.

Projects 5, 6, & 7: Advanced Projects

To name a few popular areas in industry: natural language processing, computer vision, and deep learning.

These advanced projects require a big more effort and skill, but these projects also prove to be the most beneficial given that most industrial machine-learning problems revolve around them. Going deep into an advanced topic can make you far more desirable to employers.

Recommendations: make these projects fun and engaging. Show that you can contribute to real world needs through these projects.

Objects detected with OpenCV’s Deep Neural Network module (dnn) by using a YOLOv3 model trained on COCO dataset capable to detect objects of 80 common classes. (credit: Wikipedia)