How to use Machine learning Algorithms such as Linear Regression on Covid19 datasets for prediction

Original article was published on Artificial Intelligence on Medium


How to use Machine learning Algorithms such as Linear Regression on Covid19 datasets for prediction

So today I will be showing how to use linear Regression algorithm on Covid19 datasets for prediction

But before that, let’s understand what regression means

The regression model estimates the relationships between the variables. In simple words, from the list of given input variables or features, it estimates the continuous dependent variables. Typical applications include survival prediction, weather forecasting, etc. Use regression techniques if the data range and nature of the response is real numbers. Regression activation function can be linear, quadratic, polynomial, non-linear, etc. In the training phase, the hidden parameters are optimized w.r.t. the input values presented in the training. The process that does the optimization is the gradient descent algorithm or also known as the steepest descent algorithm. The gradient descent is used to update the parameter of the model. If the learning rate would be more, it will lead to overshooting, if the learning rate is too small it would take a larger time to converge, If you are using neural networks, then you also need a Back-propagation algorithm to compute the gradient at each layer. Once the theoretical parameters/hypothesis parameters got trained (when they gave the least error during the training), then the same theory/hypothesis with the trained parameters is used with new input values to predict outcomes that will be again real values.

Linear Regression is a way of predicting a response Y on the basis of a single predictor variable X.

Dataset description:

I have downloaded the dataset from kaggle.com. the dataset name is “dataset on Novel Corona Virus Disease 2019 in India”

Here is the description of the datasets,

There are 3567 rows out of which python picked only 2524 entries because of error_bad_lines=False

Dataset Description

Python Code

  1. Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt #Data visualisation libraries
import seaborn as sns
from sklearn.model_selection import KFold, cross_val_score, train_test_split

2. Read the dataset and pair plot it using seaborn. Seaborn is one of the most used Visualization tools in python

covid_data = pd.read_csv(r’C:\Users\sushilkumar.yadav\Desktop\vmware\Personal\Spyder\Covid19datasets\covid_19_india.csv’, header=0, delim_whitespace=True, error_bad_lines=False)
covid_data.head()
covid_data.info()
covid_data.describe()
covid_data.columns
sns.pairplot(covid_data)

Output

3. Let’s now begin to train out the regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case, the death column.

X = covid_data[[‘Date’, ‘Confirmed’]]
y = covid_data[‘Deaths’]
X[‘Date’] = pd.to_datetime(X[‘Date’], format=’%d-%m-%y’)

4. Now, plotting the death rate with respect to data and all the plot with respect to the date

plt.xticks(rotation=45)
print(X[‘Date’])
plt.plot_date(X[‘Date’], y, fmt=’b-’, xdate=True, ydate=False)
covid_data.plot()
plt.show()
date = X.loc[:, [‘Date’]]
X[‘Date2num’] = X[‘Date’].apply(lambda x: mdates.date2num(x))
del X[‘Date’]

Output

Death rate plotting

5. Train test split

Lets split the data into a training dataset and test dataset, in our case we are using 30% data as the test data and remaining all other as the training data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
date_test = date.loc[:np.floor(m*0.3)]
date_train = date.loc[np.floor(m*0.3)+1:]

6. Creating and training the model

Let’s fit the linear regression model on the training data

lr = LinearRegression()
lr.fit(X_train,y_train)

7. Let’s predict and Visualize the model

print(‘Coefficients: \n’, lr.coef_)
# The mean square error
print(“Residual sum of squares: %.2f”
% np.mean((lr.predict(X_test) — y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print(‘Variance score: %.2f’ % lr.score(X_test, y_test))
# Plot outputs
plt.xticks(rotation=45)
plt.plot_date(date_test, y_test, fmt=’b-’, xdate=True, ydate=False, label=’Real value’)
plt.plot_date(date_test, lr.predict(X_test), fmt=’r-’, xdate=True, ydate=False, label=’Predicted value’)
plt.legend(loc=’upper center’)
plt.ylabel(‘No. of Deaths’)
plt.title(‘Covid19 Death prediction for India’)
plt.grid()

Output

Real value and Predicted value

Hope this gives small gist on how machine learning can be implemented in solving real-life problems