Source: Deep Learning on Medium
Invariant Information Clustering for Unsupervised Image Classification and Segmentation
Revisiting this paper again → clustering using information maximization → super interesting approach and it is able to outperform SOTA results. (very sexy) → and another good thing is the fact that it is robust enough to → use in any kind of data.
Easy to implement as well as → very strong information theory. (new state of the art for STL10 dataset)
Usually, annotation is very hard to do → and we are able to perform image segmentation. (just from class information) → many authors have tried to combine deep learning with information theory.
Other methods such as PCA and more → have been incorporated into the training of the network.
Optional heads are able to be added → this is good since the methods become more robust. (they can be used in many different settings).
Auxilary over clustering → is actually widely used in clustering methods. (super interesting).
Information theory has been used in clustering for a long time → this is not new. (learning some kind of representation using deep learning is quite well done → even Geff Hinton done it).
But the problem is the loss function, calculating the mutual information in a class label is a hard thing to do.
And when we view the clustering process, we are able to see that it is able to cluster perfectly. (this is very impressive).
Dealing with high dimensional data is always hard and numerically problematic. (deep cluster → is another method → but does not produce groups of images) → and they do not perform better than IIC.
Many approaches → used distance function → what do they use here? → mutual information. (the distance has been scale easily).
So some image transformation is done → and we want to learn a representation that maximizes the mutual information between those representations. (and we have to remove degenerate solutions).
Softmax has to be applied before → giving the square matrix to a loss function. (we have to perform marginalization as well → this is important as well) → and thanks to the formulation of loss function we are able to avoid degenerate solutions.
And assuming z is the representation vector → the above is the full loss function, very easy to implement.
One downside → is the fact that we need to use multiple heads → and we are going to choose the best head → or average the label information. (a method to avoid this → and make the neural network more robust → might be a good future work).
Just from unsupervised results → so the base networks are VGG as well as Resnet. (only the subheads are different → and we have multiple subheads → they use Adam optimizer)
They are also able to perform semi-supervised training. (each subhead is tested individually → the best one is selected).
There are some failure cases → but overall it performs really well. (and when they reduce the number of labels → this does not reduce the accuracy that much)
Which is crazy! (They also used COCO dataset for image segmentation).
Very impressive results → over clustering are done → for segmentation as well. (training → again adam optimizer were used).
In conclusion, using a clever loss function that avoids degenerative solutions → has a huge possibility.