Source: Deep Learning on Medium

Typically in convolutions, we use a 2D or a 3D kernel filter where we hope that each filter extracts some kind of a feature by convoluting in all the 2 or 3 dimensions, respectively. Specifically in 2D case, we try to extract simple features in initial layer and more complex features in the later layers. However if we want, we can factorize a 2D kernel into two 1D kernels as shown below.

Now, we can take these two 1D kernels and apply them one by one (in subsequent layers) on an image instead of applying the original 2D kernel. By doing so, we have actually reduced the number of parameters that we use for convolution and now have lesser parameters to train on. Also, the order in which we use these separable kernel filters does not really matter in general. To put things into perspective, a 5×5 kernel filter has 25 parameters whereas two kernels, a 1×5 kernel and a 5×1 kernel has only 10 parameters.

Of course, the reduction in parameters means we might have to compromise on the complexity of features that we learn. But, if you look at the image below, you can see that two 1D kernels can easily learn simple features that one 2D kernel was trying to learn. And, if you try to visualize these two 1D kernels (as shown below), you can see that they should be able to nearly learn any near-complex feature with decent accuracy. In the image, we are visualizing the weights of the 1D kernels “horz[c]” and “vert[r]” in the left and the image in the right shows a 3d plot of final weights learned as a result of applying the two 1D kernels one after the other.

These separable convolutional layers usually can learn simple to near-complex features in the image very efficiently and effortlessly. Therefore, intuitively it makes sense to use these separable layers more in initial layers that try to capture simple features compared to the final layers that try to capture much more complex features.

We could also see this as a way of regularizing our network, wherein we are trying to retain only truly independent parameters of the network. This way we are making our model computationally efficient with strong set of features learned at every layer. Also, by trying to convolve and learn features in each dimension, the separable convolution tries to learn more abstract features in each dimension. In a way, it is more focused on finding good features in independent individual dimensions and then kind of club them together in the end to extract a complex feature with minimal number of parameters.

Depthwise convolutions are a special case of separable convolution. In depthwise convolution, we first spatially convolve in X and Y dimension using 2D filters (having the size of third dimension = 1) and then we convolve channel wise in Z dimension using a1x1 filter. The image below shows how depthwise convolution is used in Xception networks.

You can read more about separable convolutions in an article by Chi-Feng Wang titled “A Basic Introduction to Separable Convolutions”. If you want more of a formal proof on separable convolutions, you can also refer here for more of a mathematical proof for 2D separable convolutions, supported with examples.