NBA Game Classification using Machine Learning.

Source: Deep Learning on Medium

NBA Game Classification using Machine Learning.

With the help of this article, I will show different classification method to determine the result of a NBA game.

Background on NBA for those who do not have any idea about it.

The National Basketball Association (NBA) is a men’s professional basketball league in North America, composed of 30 teams (29 in the United States and 1 in Canada). It is one of the four major professional sports leagues in the United States and Canada, and is widely considered to be the premier men’s professional basketball league in the world.

What is classification in Machine Learning?

The majority of practical machine learning uses supervised learning. Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output Y = f(X) . The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

Classification is a type of supervised learning. It specifies the class to which data elements belong to and is best used when the output has finite and discrete values. It predicts a class for an input variable as well.

What are we trying to achieve in this article?

In this article, we will use various different types of classification algorithms using Machine Learning and test the accuracy of our model for predicting the result. Testing is not shown in this article, whereas showing how to check the accuracy of the model is our main objective. The dataset used in this example is the results of all NBA games from 2014 to 2018 and can be found here: https://www.kaggle.com/ionaskel/nba-games-stats-from-2014-to-2018

Before working with different models, let us see how to load the data frame and modify it as per our needs. And also the different libraries required for working with Classification in Machine Learning.

Import the standard libraries which are essential for our objective.

import pandas as pd
import numpy as np
import matplotlib as mp
import matplotlib.pyplot as plt

Now let us read the CSV file we downloaded from Kaggle for our dataset. This data needs to be manipulated for pre processing further.

df=pd.read_csv(r'C:\Siddhanth\SI4407\Sem 5\ML and AI\Home\nba.csv')
df.head(

Output:

The first step we need to do for classification is, turn all data into numeric value. For conversion of non-numeric value to numeric there are a lot of methods. The one I used is One Hot Encoding using Pandas as pd.dummies, for Changing Home and Away into 1 and 0. You can read more about it here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

This is how:

one_hot = pd.get_dummies(df['Home'],prefix='Side')
df = df.drop('Home',axis = 1)
df = df.join(one_hot)
df.head()

Output:

As you can see the Home column has been changed to Side_Away and Side_Home and has been added to the data frame in the end using join. Similarly for other non-numeric data like Team and Opponent we have done the same using Label Encoder using sklearn. More about it here: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df['Team']=le.fit_transform(df['Team'])
df['Opponent']=le.transform(df['Opponent'])
df.head()

Output:

Here each team has been given a numeric value and according to their number, their name have been replaced with their number in the Opponent Column. Now remains the column WINorLOSS which can easily be converted into 1 and 0.

d=df['WINorLOSS']
d=list(map(lambda x: 1 if x != 'L' else 0, d))
df['WINorLOSS']=d
df.head()

Output:

Now that our data is pre processed as is ready for use, let’s jump into different models of classification.

1. CART

Classification and Regression Trees or CART for short is a term introduced by Leo Breiman to refer to Decision Tree algorithms that can be used for classification or regression predictive modeling problems. Classically, this algorithm is referred to as “decision trees”, but on some platforms like R they are referred to by the more modern term CART.

Let’s use this model and try our accuracy score.

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.33, random_state=42)
regressor = DecisionTreeRegressor(random_state =1)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_pred,y_test)

Output:

The representation for the CART model is a binary tree.

2. Random Forest

Random forest algorithm can use both for classification and the regression kind of problems. The Same algorithm both for classification and regression, You mind be thinking I am kidding. But the truth is, Yes we can use the same random forest algorithm both for classification and regression. The best video I found for Random Forest on YouTube: https://www.youtube.com/watch?v=J4Wdy0Wc_xQ

Let’s test our accuracy:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_pred,y_test)

Output:

Pretty good, huh? 🙂

3. Gradient Boosting

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. Gradient boosting involves three elements:

  1. A loss function to be optimized.
  2. A weak learner to make predictions.
  3. An additive model to add weak learners to minimize the loss function.

We have taken multiple learning rates in a list, and have checked the accuracy for the same.

from sklearn.ensemble import GradientBoostingClassifier
lr_list = [0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1]
for learning_rate in lr_list:
gb_clf = GradientBoostingClassifier(n_estimators=20, learning_rate=learning_rate, max_features=2, max_depth=2, random_state=0)
gb_clf.fit(X_train, y_train)
print("Learning rate: ", learning_rate)
print("Accuracy score (training): {0:.3f}".format(gb_clf.score(X_train, y_train)))
print("Accuracy score (validation): {0:.3f}".format(gb_clf.score(X_test, y_test)))

Output:

4. Neural Networks

Artificial Neural Networks have generated a lot of excitement in Machine Learning research and industry, thanks to many breakthrough results in speech recognition, computer vision and text processing. In this blog post we will try to develop an understanding of a particular type of Artificial Neural Network called the Multi Layer Perceptron. More about this can be found here: https://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/

Let’s test our accuracy

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
from keras import regularizers
model = Sequential()
model.add(Dense(20, activation='relu', input_dim=38))
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs = 10, batch_size = 64, verbose=1,validation_data=(X_test,y_test),shuffle=True)
score = model.evaluate(X_test, y_test, batch_size=64)

Output:

CONCLUSION

Different models give different results, some of it underfit or overfit while some of it gives a bit accurate answers. Now using these techniques, testing can be done using another data to predict the result of any NBA match. For our dataset, CART has proven to be a better model than any other. Do try other models, thank you.