Source: Deep Learning on Medium
Machine Learning with Abalone
At the most basic level, machine learning can be understood as programmed algorithms that receive and analyse input data to predict output values within an acceptable range. As new data is given to these algorithms, they learn to optimize their operations as to improve their performance by developing ‘intelligence’ over time.
In this blog various machine learning algorithms will be compared with the help of Abalone data present in the UCI Repository.
Predicting Age of Abalone
The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope — a boring and time-consuming task. We want to predict the age using different physical measurements which is easier to measure. The age of abalone is ( number of rings +1.5) years.
Sex / nominal / — / M, F, and I (infant)
Length / continuous / mm / Longest shell measurement
Diameter / continuous / mm / perpendicular to length
Height / continuous / mm / with meat in shell
Whole weight / continuous / grams / whole abalone
Shucked weight / continuous / grams / weight of meat
Viscera weight / continuous / grams / gut weight (after bleeding)
Shell weight / continuous / grams / after being dried
Rings / integer / — / +1.5 gives the age in years
In the data set all the columns except ‘Sex’ is numerical type.
import pandas as pd
import numpy as np
path="/content/drive/My Drive/Machine learning 303L/abalone.csv"
For the categorical data ‘Sex’ we will do One-Hot-Encoding for ‘M’(Male),‘F’(Female) and ‘I’(Infant).
df["M"] = np.nan
df["F"] = np.nan
df["I"] = np.nan
for i in range (len(df[columnName])):
elif df[columnName][i]=='I' :
For the numerical data we will normalize the data using python libraries. We normalize as we want to bring the data in a particular range.
Rings column will be taken as output column and others as input.
from sklearn import preprocessing
x = preprocessing.normalize(x)
Now we will divide the dataset into training and testing set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
Training different models
Now we will train various models for this regression problem using python libraries. We will compare with the help of Regression metrics namely MAE (mean absolute error),MSE (mean squared error) and R² (R squared) for comparison of different models.
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVRimport warnings
from keras.models import Sequential,model_from_json
from keras.layers import Dense
from keras.optimizers import RMSprop
Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller subsets and the tree is developed subsequently. The final result is a tree with decision nodes and leaf nodes.
regressor = DecisionTreeRegressor(random_state = 0)
y_predicted1 = regressor.predict(x_test)
Linear regression uses a linear model to predict the relationship between two or more variables or factors.
linearRegressor = LinearRegression()
y_predicted2 = linearRegressor.predict(x_test)
Polynomial Regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is modeled as an nth degree polynomial.
x_poly = polynomial_features.fit_transform(x_train)
x_poly_test = polynomial_features.fit_transform(x_test)
model = LinearRegression()
y_predicted3 = model.predict(x_poly_test)
Random Forest is an ensemble method for classification or regression which combines multiple decision trees.
rf = RandomForestClassifier()
y_predicted4 = rf.predict(x_test)
Support Vector Machine — Regression (SVR)
The Support Vector Regression (SVR) uses the same principles as the SVM for classification: to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind that part of the error is tolerated
regressor = SVR(kernel='rbf')
y_predicted7 = regressor.predict(x_test)
A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates.
We will use densely connected 4 layers. The input will contain 10 nodes and output will contain only 1 node. The calculated loss will be MAE.Other activation functions can be used with different number of nodes.
model = Sequential()
model.add(Dense(256, activation='relu', input_shape=(10,)))
model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_absolute_error'])
history = model.fit(x_train,y_train,batch_size=100,epochs=5,verbose=1)test=model.evaluate(x_test, y_test, verbose=1)
Using Regression Metrics
We are using python libraries for implementing regression metrics.
After calculating the regression metrics we can see that the minimum MAE and MSE is for Polynomial regression. Others have larger values of MSE and MAE and lower values for R squared. Neural Network can maybe perform well with different activation functions and number of nodes.
We can further improve our models by using Regularization. It will help to penalize the larger weights. We can use Ridge or Lasso along with the original algorithms. Here I am using Ridge with Linear Regression but it is not helping to improve the model. But it can help with combination with other models. One can see the difference in the table below.
from sklearn.linear_model import Ridge
ridge = Ridge()
y_predicted5 = ridge.predict(x_test)
We have seen preprocessing,different types of algorithms with their implementation in python and comparison with the help of regression metrics on Abalone dataset.