Radial Basis Function (RBF) Kernel: The Go-To Kernel

Original article was published by Sushanth Sreenivasa on Artificial Intelligence on Medium


Radial Basis Function (RBF) Kernel: The Go-To Kernel

You’re working on a Machine Learning algorithm like Support Vector Machines for non-linear datasets and you can’t seem to figure out the right feature transform or the right kernel to use. Well, fear not because Radial Basis Function (RBF) Kernel is your savior.

Fig 1: No worries! RBF got you covered. [Image Credits: Tenor (tenor.com)]

RBF kernels are the most generalized form of kernelization and is one of the most widely used kernels due to its similarity to the Gaussian distribution. The RBF kernel function for two points X₁ and X₂ computes the similarity or how close they are to each other. This kernel can be mathematically represented as follows:

where,
1. ‘σ’ is the variance and our hyperparameter
2. ||X₁ – X₂|| is the Euclidean (L-norm) Distance between two points X₁ and X₂

Let d₁₂ be the distance between the two points X₁ and X₂, now we can represent d₁₂ as follows:

Fig 2: Distance between two points in space [Image by Author]

The kernel equation can be re-written as follows:

The maximum value that the RBF kernel can be is 1 and occurs when d₁₂ is 0 which is when the points are the same, i.e. X₁ = X₂.

  1. When the points are the same, there is no distance between them and therefore they are extremely similar
  2. When the points are separated by a large distance, then the kernel value is less than 1 and close to 0 which would mean that the points are dissimilar

Distance can be thought of as an equivalent to dissimilarity because we can notice that when distance between the points increases, they are less similar.

Fig 3: Similarity decreases as distance increases [Image by Author]

It is important to find the right value of ‘σ’ to decide which points should be deemed as similar and this can be demonstrated on a case by case basis.

a] σ = 1

When σ = 1, σ² = 1 and the RBF kernel’s mathematical equation will be as follows:

The curve for this equation is given below and we can notice that as the distance increases, the RBF Kernel decreases exponentially and is 0 for distances greater than 4.

Fig 4: RBF Kernel for σ = 1 [Image by Author]
  1. We can notice that when d₁₂ = 0, the similarity is 1 and as d₁₂ increases beyond 4 units, the similarity is 0
  2. From the graph, we see that if the distance is below 4, the points can be considered similar and if the distance is greater than 4 then the points are dissimilar

b] σ = 0.1

When σ = 0.1, σ² = 0.01 and the RBF kernel’s mathematical equation will be as follows:

The width of the Region of Similarity is minimal for σ = 0.1 and hence, only if points are extremely close they are considered similar.

Fig 4: RBF Kernel for σ = 0.1 [Image by Author]
  1. We see that the curve is extremely peaked and is 0 for distances greater than 0.2
  2. The points are considered similar only if the distance is less than or equal to 0.2

b] σ = 10

When σ = 10, σ² = 100 and the RBF kernel’s mathematical equation will be as follows:

The width of the Region of Similarity is large for σ = 100 because of which the points that are farther away can be considered to be similar.

Fig 5: RBF Kernel for σ = 10 [Image by Author]
  1. The width of the curve is large
  2. The points are considered similar for distances up to 10 units and beyond 10 units they are dissimilar

It is evident from the above cases that the width of the Region of Similarity changes as σ changes.
Finding the right σ for a given dataset is important and can be done by using hyperparameter tuning techniques like Grid Search Cross Validation and Random Search Cross Validation.

RBF Kernel is popular because of its similarity to K-Nearest Neighborhood Algorithm. It has the advantages of K-NN and overcomes the space complexity problem as RBF Kernel Support Vector Machines just needs to store the support vectors during training and not the entire dataset.

The RBF Kernel Support Vector Machines is implemented in the scikit-learn library and has two hyperparameters associated with it, ‘C’ for SVM and ‘γ’ for the RBF Kernel. Here, γ is inversely proportional to σ.

Fig 6: RBF Kernel SVM for Iris Dataset [Image Credits: https://scikit-learn.org/]

From the figure, we can see that as γ increases, i.e. σ reduces, the model tends to overfit for a given value of C.

Finding the right γ or σ along with the value of C is essential in order to achieve the best Bias-Variance Trade off.

References:

  1. Scikit-Learn Implementation of SVM: https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html
  2. Radial Basis Function Kernel: https://en.wikipedia.org/wiki/Radial_basis_function_kernel