Original article was published on Artificial Intelligence on Medium

**How to use Machine learning Algorithms such as Linear Regression on Covid19 datasets for prediction**

So today I will be showing how to use linear Regression algorithm on Covid19 datasets for prediction

But before that, let’s understand what **regression** means

The regression model estimates the relationships between the variables. In simple words, from the list of given input variables or features, it estimates the continuous dependent variables. Typical applications include survival prediction, weather forecasting, etc. Use regression techniques if the data range and nature of the response is real numbers. Regression activation function can be linear, quadratic, polynomial, non-linear, etc. In the training phase, the hidden parameters are optimized w.r.t. the input values presented in the training. The process that does the optimization is the gradient descent algorithm or also known as the **steepest descent algorithm**. The gradient descent is used to update the parameter of the model. If the learning rate would be more, it will lead to overshooting, if the learning rate is too small it would take a larger time to converge, If you are using neural networks, then you also need a Back-propagation algorithm to compute the gradient at each layer. Once the theoretical parameters/hypothesis parameters got trained (when they gave the least error during the training), then the same theory/hypothesis with the trained parameters is used with new input values to predict outcomes that will be again real values.

Linear Regression is a way of predicting a response Y on the basis of a single predictor variable X.

**Dataset description:**

I have downloaded the dataset from kaggle.com. the dataset name is “dataset on Novel Corona Virus Disease 2019 in India”

**Here is the description of the datasets,**

There are 3567 rows out of which python picked only 2524 entries because of *error_bad_lines=False*

**Python Code**

- Import the libraries

`import pandas as pd`

import numpy as np

import matplotlib.pyplot as plt #Data visualisation libraries

import seaborn as sns

from sklearn.model_selection import KFold, cross_val_score, train_test_split

2. Read the dataset and pair plot it using seaborn. Seaborn is one of the most used Visualization tools in python

`covid_data = pd.read_csv(r’C:\Users\sushilkumar.yadav\Desktop\vmware\Personal\Spyder\Covid19datasets\covid_19_india.csv’, header=0, delim_whitespace=True, error_bad_lines=False)`

covid_data.head()

covid_data.info()

covid_data.describe()

covid_data.columns

sns.pairplot(covid_data)

Output

3. Let’s now begin to train out the regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case, the death column.

`X = covid_data[[‘Date’, ‘Confirmed’]]`

y = covid_data[‘Deaths’]

X[‘Date’] = pd.to_datetime(X[‘Date’], format=’%d-%m-%y’)

4. Now, plotting the death rate with respect to data and all the plot with respect to the date

`plt.xticks(rotation=45)`

print(X[‘Date’])

plt.plot_date(X[‘Date’], y, fmt=’b-’, xdate=True, ydate=False)

covid_data.plot()

plt.show()

date = X.loc[:, [‘Date’]]

X[‘Date2num’] = X[‘Date’].apply(lambda x: mdates.date2num(x))

del X[‘Date’]

Output

5. **Train test split**

Lets split the data into a training dataset and test dataset, in our case we are using 30% data as the test data and remaining all other as the training data

`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)`

date_test = date.loc[:np.floor(m*0.3)]

date_train = date.loc[np.floor(m*0.3)+1:]

6. **Creating and training the model**

Let’s fit the linear regression model on the training data

`lr = LinearRegression()`

lr.fit(X_train,y_train)

7. Let’s predict and Visualize the model

`print(‘Coefficients: \n’, lr.coef_)`

# The mean square error

print(“Residual sum of squares: %.2f”

% np.mean((lr.predict(X_test) — y_test) ** 2))

# Explained variance score: 1 is perfect prediction

print(‘Variance score: %.2f’ % lr.score(X_test, y_test))

# Plot outputs

plt.xticks(rotation=45)

plt.plot_date(date_test, y_test, fmt=’b-’, xdate=True, ydate=False, label=’Real value’)

plt.plot_date(date_test, lr.predict(X_test), fmt=’r-’, xdate=True, ydate=False, label=’Predicted value’)

plt.legend(loc=’upper center’)

plt.ylabel(‘No. of Deaths’)

plt.title(‘Covid19 Death prediction for India’)

plt.grid()

Output

Hope this gives small gist on how machine learning can be implemented in solving real-life problems