Original article was published on Artificial Intelligence on Medium
The issue here is that as the number of features that we have increased the computational cost of computing high degree polynomial features skyrockets, making the model slow.
Another way of making a data set linearly separable are the Radial Basis functions mentioned previously. The following example shows how we can transform a non linearly separable data set into one that can be easily divided using an RBF:
Now, if on the data set shown in the previous image, we compute a third feature, using an RBF centred on the origin, and we plot this new feature along with the previous two, our data set is transformed into the following one, which can be linearly separated by a horizontal plane at the height of 0.7 on the new feature r.
In this previous example, however, we choose a point to centre our RBF that gave us good results, which is not trivial to do in most cases. What it’s done most of the time is computing an RBF from every point in the data set to all the other data points using the original features, and using those RBF computed distances as new features.
The issue here is, that if we have a large training set, computing all these features is very computationally expensive. Again we reach problems when trying to use a linear SVC to separate non-linear data.
In the solution to these problems lies the true power of Support Vector Machines: Kernels.
The Kernel Trick
In both of the previous cases, to be able to separate our data, we needed to compute a transformation on it (a polynomial transformation, and an RBF similarity function), which as we saw was very computationally expensive.
A Kernel in Machine Learning is a mathematical function that allows us to compute the dot product of the transformation of two vectors, without actually having to compute the transformations itself.
The idea behind kernels is illustrated in the following formula.
K(a,b) in the previous formula represents the kernel of vectors a and b (a random kernel, later we will see there are many different ones). We can see that this kernel is equal to the dot product of ϕ(a) and ϕ(b), where ϕ represents an specific transformation (which could be a polynomial, or RBF transformation) of the vector that is given to it as an argument.
To understand how this largely simplifies our problem, and by the use of kernels we can use SVMs to classify non-linearly separable data sets, it is best if we first see how a Support Vector Machine is trained.
Training an SVM: Another optimisation problem
In the end, training an SVM Classifier, comes down to solving an optimisation problem, called the Dual Problem. An SVM makes predictions pretty much like any other classifier: it takes the input vector x, multiplies it by some weight vector w, and adds a bias term b, like shown in the following formula.
If this operation gives a result greater than some threshold (0 in our case), then the sample gets classified as a positive instance (1). If it yields a result lower than the threshold, then it gets classified as a negative instance (0).
To obtain the weight and bias vectors, we have to solve an optimisation problem that tries to maximise the margin of the street we spoke about while limiting the margin violations. Mathematically this is translated into finding the minimal value of the weight vector that satisfies a certain condition of vagueness or slack (how many times and how heavily the margin can be crossed).
This optimisation problem can be solved using a normal QP solver, as it is a convex, quadratic problem. The final equation which this problem solves is the following:
Formula 3, however, is not strictly correct. α should be greater than or equal to 0, taking this last value for every data point that is not a support vector.
Don’t mind about the t(i) or t(j), there are two important things about this equation: once we solve it and get the value of α, we can calculate the weight and bias vectors. The other thing to notice is the term inside the orange dotted line: the dot product of two training instances, which we have to repeat for all the training set.
This is where the Kernel trick we spoke about before comes in handy: if we have transformed our training instances to make them separable using any kind of transformation, we would have to compute the transformation on every training instance, do the dot product… and it would all be amazingly expensive in terms of computation. This is shown in the following formula.