# ML Day 1 : ‘Regression’

Original article was published on Deep Learning on Medium

Hi everyone!

This is the first post of my new series of articles which will be specifically focusing on Machine Learning Algorithms and its practical implementation in Python using real world data sets.

I assume that you have basic knowledge of python programming and have used the basic libraries such as Numpy, Scipy, Pandas, Matplotlib and Sklearn.

If you haven’t, I am attaching the links here for you!

Numpy: http://bit.ly/2k4igLd

Scipy: http://bit.ly/1oyJqYk

Pandas: http://bit.ly/2qs1lAJ

Matplotlib: http://bit.ly/2EMuVNG

Sklearn: http://bit.ly/2j049C4

I wanted to provide a quick introduction to building models in Python, and what better way to start than one of the very basic models, linear regression?

This will be the first post about machine learning and I plan to write about more complex models in the future. Stay tuned! But for right now, let’s focus on linear regression.

I want to focus on the concept of linear regression and mainly on the implementation of it in Python.

Here we go!!

# Chapter 1 : Regression

Linear regression is a statistical model that examines the linear relationship between two (Simple Linear Regression ) or more (Multiple Linear Regression) variables — a dependent variable and independent variable(s). Linear relationship basically means that when one (or more) independent variables increases (or decreases), the dependent variable increases (or decreases) too.

As you can see, a linear relationship can be positive (independent variable goes up, dependent variable goes up) or negative (independent variable goes up, dependent variable goes down) or other. Like I said, I will focus on the implementation of regression models in Python, so I don’t want to delve too much into the math under the regression hood, but I will write a little bit about it.

# A Little Bit About the Math

A relationship between variables Y and X is represented by this equation:

`Y`i = mX + b`

In this equation, Y is the dependent variable — or the variable we are trying to predict or estimate; X is the independent variable — the variable we are using to make predictions; m is the slope of the regression line — it represent the effect X has on Y.

This is Simple Linear Regression (SLR). In a SLR model, we build a model based on data — the slope and Y-intercept derive from the data; furthermore, we don’t need the relationship between X and Y to be exactly linear. SLR models also include the errors in the data (also known as residuals). I won’t go too much into it now, maybe in a later post, but residuals are basically the differences between the true value of Y and the predicted/estimated value of Y. It is important to note that in a linear regression, we are trying to predict a continuous variable. In a regression model, we are trying to minimize these errors by finding the “line of best fit” — the regression line from the errors would be minimal. We are trying to minimize the distance of the red dots from the blue line — as close to zero as possible. It is related to (or equivalent to) minimizing the mean squared error (MSE) or the sum of squares of error (SSE), also called the “residual sum of squares.” (RSS) but this might be beyond the scope of this blog post 🙂

In most cases, we will have more than one independent variable — we’ll have multiple variables; it can be as little as two independent variables and up to hundreds (or theoretically even thousands) of variables. in those cases we will use a Multiple Linear Regression model (MLR). The regression equation is pretty much the same as the simple regression equation, just with more variables:

`Y’i = b0 + b1X1i + b2X2i`

This concludes the math portion of this post 🙂 Ready to get to implementing it in Python?

# Project 1: Predicting Boston Housing Prices

Here we perform a simple regression analysis on the Boston housing data, exploring simple types of linear regression model.

I will use Boston Housing data set, the data set contains information about the housing values in suburbs of Boston. This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository. UCI machine learning repository contains many interesting data sets, I encourage you to go through it.

We will be using sklearn to import the boston data as it contains a bunch of useful datasets to practice along and we will also import Linear Regression Model from sklearn. Although, you can also code your Linear Regression model as a function or class in python and its easy.

`from sklearn.datasets import load_bostondata = load_boston()`

Print a histogram of the quantity to predict: price

`import matplotlib.pyplot as plt%matplotlib inlineplt.style.use('bmh')plt.figure(figsize=(15, 6))plt.hist(data.target)plt.xlabel('price (\$1000s)')plt.ylabel('count')plt.tight_layout()`

Print the join histogram for each feature

`for index, feature_name in enumerate(data.feature_names):    plt.figure(figsize=(4, 3))    plt.scatter(data.data[:, index], data.target)    plt.ylabel('Price', size=15)    plt.xlabel(feature_name, size=15)    plt.tight_layout()`

Prediction

`from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(data.data, data.target)from sklearn.linear_model import LinearRegressionclf = LinearRegression()clf.fit(X_train, y_train)predicted = clf.predict(X_test)expected = y_testplt.figure(figsize=(15, 6))plt.scatter(expected, predicted)plt.plot([0, 50], [0, 50], '--k')plt.axis('tight')plt.xlabel('True price (\$1000s)')plt.ylabel('Predicted price (\$1000s)')plt.tight_layout()`

`import numpy as npprint("RMS: %r " % np.sqrt(np.mean((predicted - expected) ** 2)))RMS: 5.3588890300591467`