Supporting the Math Behind Supporting Vector Machines!

Original article was published on Artificial Intelligence on Medium

Introduction to Support Vector Machine

Support Vector Machine(SVM) is a powerful classifier that works with both linear and non-linear data.

If you have a n-dimensional space, then the dimension of the hyperplane will be (n-1).

The goal of SVM is to find an optimal hyperplane that best separates our data so that distance from the nearest points in space to itself is maximized.

To keep it simple, consider a road, which separates the left, right-side cars, buildings, pedestrians and makes the widest lane as possible. And those cars, buildings, really close to the street are the support vectors.

This is how the Support Vector machine works.

How SVM works?

Goal: To find an optimal hyperplane.

Computing distance of point from Supporting vector

We want a hyperplane that can classify points in a very efficient manner i.e. allocating the classes in a precise manner.

To execute that precisely, we need to alter our goal.

Modified Goal: To maximize minimum distance from the hyperplane.

Let us assume a minimum threshold distance = γ

Now all points must satisfy following two conditions:

In order to achieve minimum distance, we need to maximize “w”.

This execution is a little tedious and cumbersome, therefore to reduce mathematical complexities we will reformulate our equations.

So, we will re-normalize our data in such a way that support vectors must lie on the hyperplane.

So, in order to maximize “w”, we need to maximize “D”.

D = D1+D2; so we need to maximize D1 and D2 too.

Now, in order to increase “D”, we must focus on decreasing “||w||” under the condition that the minimum distance must be 1.

In our new goal, we need to decrease ||w|| but we need to keep in mind that ||w|| isn’t free i.e. it depends upon γ.

Since SVM can’t employ Gradient Descent, due to constraint dependency, we need to look for other solutions.

The problem can be avoided using any of the following method:

  • Quadratic Solver
  • Langragian Density

Handling Outliers

Sometimes for some examples (x(i), y(i)) we might encounter error as E(i).

The error will completely demolish our goal as adding error will increase the cost of loss but to develop a robust model we need to allow some error or otherwise, it will lead to overfitting.

But to keep the error as low as possible, we will introduce a penalty, the moment our model wrongly classify a data point it will face some penalty(c) that will help us to increase our accuracy.

If E=0 ; then no penalty would be fined.

if c is very large ; then less margin but better classification (more prone to overfitting)

if c=1 i.e. very low ; then margin is maximized but at expense of misclassification (better classifier)


To overcome the constraint convex optimization problem, we employ the PEGASOS method.

We reformulate our equations to achieve constraint independency.

Now if we talk mathematically, the effect on 1 will be nil on the equation as we are adding it once and contrastingly subtracting it once.

To ease our equations, we will substitute t(i).

The above equation corresponds that if point is away from hyperplane the error will be zero otherwise the error encountered will be (1-t(i)).

The formulated function is still not differential at t≥1 ; so we are going to divide our gradient in Sub-gradient 1 & Sub-gradient 2.

Our final function that needs to be minimize is:

Now, since the constraints are removed and our function is independent of γ. We can employ Gradient Descent to minimize the loss.

Minimizing loss using Gradient Descent

Calculating loss
Implementing Gradient Descent for SVM

Non-Linear Classification

To classify non-linear data using the Support vector machine, we need to project data to higher dimensions i.e. convert 2D data to 3D data by increasing its features.


This increment in features will be computationally expensive.

To transform dimensions of data, we use kernels.

  • Linear Kernel
  • RBF (Radial Basis)Kernel

The hurdle with kernel trick is choosing the right kernel for your model, because we never know which kernel will be best suited to our model and at what cost(penalty).

So to auto-tune our hyperparameters, we will implement GridSearch.

We will specify params i.e. list consisting of dictionary with keys as kernels and penalty, we’ll run our model for every kernel and get the best accuracy.

params = [{‘kernel’:[‘linear’, ‘rbf’, ‘poly’, ‘sigmoid’], ‘c’:[0.1, 0.2, 0.5, 1.0, 2.0, 5.0]}

Now for each type of kernel and 5 types of penalty we will store accuracies and later will choose kernel and penalty(c) with best accuracy.

Parameters accepted :

  • cv: cross-validation
  • n_jobs: number of CPU available

Pros & Cons of Support Vector Machine


  • Can easily handle large feature spaces.
  • Kernel trick is real strength of SVM as it helps to find solution to even complex problems.
  • Works well with both Linear and non-linear data.
  • Less prone to overfitting.
  • Reliable even for unstructured data.


  • Sensitive to noise.
  • Choosing an optimal kernel is really tough.
  • Long training time for Large Dataset.

Visualizing SVM

Applications of SVM

Source : DataFlair